新建一个Dynamic Web Project项目 点下一步,我取名为test,服务器选刚才创建的tomcat6.0,然后下一步,下一步,直到完成就好了 在webcontent目录下面新建一个jsp文件,我的叫a.jsp 我在body区输入了My First Jsp 右击该文件,在服务器上运行,选择tomcat,然后结果如图所示。恭喜你,所有配置都成功啦! 点选eclipse的窗口,然后web浏览器,选default system web browser,即系统默认浏览器,就可以用自己的浏览器打开界面啦。如图所示
html = """ <html><head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <ahref="http://example.com/elsie"class="sister"id="link1"><!-- Elsie --></a>, <ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a> and <ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <pclass="story">...</p> """
<html> <head> <title> The Dormouse's story </title> </head> <body> <pclass="title"name="dromouse"> <b> The Dormouse's story </b> </p> <pclass="story"> Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"> <!-- Elsie --> </a> , <aclass="sister"href="http://example.com/lacie"id="link2"> Lacie </a> and <aclass="sister"href="http://example.com/tillie"id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <pclass="story"> ... </p> </body> </html>
<pclass="story">Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>, <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a> and <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>; and they lived at the bottom of a well.</p>
<html><head><title>The Dormouse's story</title></head> <body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>, <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a> and <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <pclass="story">...</p> </body></html> <head><title>The Dormouse's story</title></head> <title>The Dormouse's story</title> The Dormouse's story
<body> <pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <pclass="story">Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>, <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a> and <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <pclass="story">...</p> </body>
<pclass="title"name="dromouse"><b>The Dormouse's story</b></p> <b>The Dormouse's story</b> The Dormouse's story
<pclass="story">Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>, <aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a> and <aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>; and they lived at the bottom of a well.</p> Once upon a time there were three little sisters; and their names were
forstringin soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'
forstringin soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'
print soup.p.next_sibling # 实际该处为空白 print soup.p.prev_sibling #None 没有前一个兄弟节点,返回 None print soup.p.next_sibling.next_sibling #<pclass="story">Once upon a time there were three little sisters; and their names were #<aclass="sister"href="http://example.com/elsie"id="link1"><!-- Elsie --></a>, #<aclass="sister"href="http://example.com/lacie"id="link2">Lacie</a> and #<aclass="sister"href="http://example.com/tillie"id="link3">Tillie</a>; #and they lived at the bottom of a well.</p> #下一个节点的下一个兄弟节点是我们可以看到的节点
for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # <a class="sister"href="http://example.com/lacie"id="link2">Lacie</a> # u' and\n' # <a class="sister"href="http://example.com/tillie"id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # None
(9)前后节点
知识点:.next_element .previous_element 属性
与 .next_sibling .previous_sibling 不同,它并不是针对于兄弟节点,而是在所有节点,不分层次 比如 head 节点为
for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # <p class="story">...</p> # u'...' # u'\n' # None
以上是遍历文档树的基本用法。
7.搜索文档树
(1)find_all( name , attrs , recursive , text , **kwargs )
find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件 1)name 参数 name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉 A.传字符串 最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签
soup.find_all(has_class_but_no_id) # [<pclass="title"><b>The Dormouse's story</b></p>, # <pclass="story">Once upon a time there were...</p>, # <pclass="story">...</p>]
2)keyword 参数
注意:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性
以人来举例说明,人有能标识身份的身份证,有姓名,有类别(大人、小孩、老人)等。 1. ID 是一个人的身份证号码,是唯一的。所以通过getElementById获取的是指定的一个人。 2. Name 是他的名字,可以重复。所以通过getElementsByName获取名字相同的人集合。 3. TagName可看似某类,getElementsByTagName获取相同类的人集合。如获取小孩这类人,getElementsByTagName(“小孩”)。 把上面的例子转换到HTML中,如下:
function validB(){ var u_agent = navigator.userAgent; var B_name="Failed to identify the browser"; if(u_agent.indexOf("Firefox")>-1){ B_name="Firefox"; }elseif(u_agent.indexOf("Chrome")>-1){ B_name="Chrome"; }elseif(u_agent.indexOf("MSIE")>-1&&u_agent.indexOf("Trident")>-1){ B_name="IE(8-10)"; } document.write("B_name:"+B_name+"<br>"); document.write("u_agent:"+u_agent+"<br>"); }
<scripttype="text/javascript"> var mydate=newDate();//定义日期对象 var weekday=["星期日","星期一","星期二","星期三","星期四","星期五","星期六"]; //定义数组对象,给每个数组项赋值 var mynum=mydate.getDay();//返回值存储在变量mynum中 document.write(mydate.getDay());//输出getDay()获取值 document.write("今天是:"+ weekday[mynum]);//输出星期几 </script>
<scripttype="text/javascript"> var mya1= newArray("hello!") var mya2= newArray("I","love"); var mya3= newArray("JavaScript","!"); var mya4=mya1.concat(mya2,mya3); document.write(mya4); </script>
1. 淘宝的密码用了AES加密算法,最终将密码转化为256位,在POST时,传输的是256位长度的密码。 2. 淘宝在登录时必须要输入验证码,在经过几次尝试失败后最终获取了验证码图片让用户手动输入来验证。 3. 淘宝另外有复杂且每天在变的 ua 加密算法,在程序中我们需要提前获取某一 ua 码才可进行模拟登录。 4. 在获取最后的登录 st 码时,历经了多次请求和正则表达式提取,且 st 码只可使用一次。
整体思路梳理
1. 手动到浏览器获取 ua 码以及 加密后的密码,只获取一次即可,一劳永逸。 2. 向登录界面发送登录请求,POST 一系列参数,包括 ua 码以及密码等等,获得响应,提取验证码图像。 3. 用户输入手动验证码,重新加入验证码数据再次用 POST 方式发出请求,获得响应,提取 J_Htoken。 4. 利用 J_Htoken 向 alipay 发出请求,获得响应,提取 st 码。 5. 利用 st 码和用户名,重新发出登录请求,获得响应,提取重定向网址,存储 cookie。 6. 利用 cookie 向其他个人页面如订单页面发出请求,获得响应,提取订单详情。 是不是没看懂?没事,下面我将一点点说明自己模拟登录的过程,希望大家可以理解。
前期准备
由于淘宝的 ua 算法和 aes 密码加密算法太复杂了,ua 算法在淘宝每天都是在变化的,不过,这个内容你获取之后一直用即可,经过测试之后没有问题,一劳永逸。 那么 ua 和 aes 密码怎样获取呢? 我们就从浏览器里面直接获取吧,打开浏览器,找到淘宝的登录界面,按 F12 或者浏览器右键审查元素。 在这里我用的是火狐浏览器,首先记得在浏览器中设置一下显示持续日志,要不然页面跳转了你就看不到之前抓取的信息了。在这里截图如下: 好,那么接下来我们就从浏览器中获取 ua 和 aes 密码 点击网络选项卡,这时都是空的,什么数据也没有截取。这时你就在网页上登录一下试试吧,输入用户名啊,密码啊,有必要时需要输入验证码,点击登录。 等跳转成功后,你就可以看到好多日志记录了,点击图中的那一行 login.taobo.com,然后查看参数,你就会发现表单数据了,其中就包括 ua 还有下面的 password2,把这俩复制下来,我们之后要用到的。这就是我们需要的 ua 还有 aes 加密后的密码。 恩,读到这里,你应该获取到了属于自己的 ua 和 password2 两个内容。
#获取所有已买到的宝贝信息 defgetAllGoods(self,pageNum): printu"获取到的商品列表如下" for x in range(1,int(pageNum)+1): page = self.getGoodsPage(x) self.tool.getGoodsInfo(page)
#传入图片地址,文件名,保存单张图片 def saveImg(self,imageURL,fileName): u = urllib.urlopen(imageURL) data = u.read() f = open(fileName, 'wb') f.write(data) f.close()
2)写入文本
1 2 3 4 5
defsaveBrief(self,content,name): fileName = name + "/" + name + ".txt" f = open(fileName,"w+") printu"正在偷偷保存她的个人信息为",fileName f.write(content.encode('utf-8'))
#保存个人简介 defsaveBrief(self,content,name): fileName = name + "/" + name + ".txt" f = open(fileName,"w+") printu"正在偷偷保存她的个人信息为",fileName f.write(content.encode('utf-8'))
#传入图片地址,文件名,保存单张图片 defsaveImg(self,imageURL,fileName): u = urllib.urlopen(imageURL) data = u.read() f = open(fileName, 'wb') f.write(data) printu"正在悄悄保存她的一张图片为",fileName f.close()
Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy
[ more ] More commands available when runfrom project directory
Name = BAIDUID Value = B07B663B645729F11F659C02AAE65B4C:FG=1 Name = BAIDUPSID Value = B07B663B645729F11F659C02AAE65B4C Name = H_PS_PSSID Value = 12527_11076_1438_10633 Name = BDSVRTM Value = 0 Name = BD_HOME Value = 0