投稿    登录
  您好,欢迎光临崔庆才的博客,祝大家新年快乐,鸡年大吉!

小白进阶之Scrapy第一篇

Python 哎哟卧槽 17621浏览 33评论

下面我的爬虫进入了这个页面:

Scrapy10

这个页面就有很多我们需要的信息了:废话不说了代码上来:

Scrapy11

第四十行:将我们导入的item文件进行实例化,用来存储我们的数据。

后面全部:将需要的数据,复制给item[key] (注意这儿的Key就是我们前面在item文件中定义的那些字段。)

注意!response.meta[key]:这个是提取从上一个函数传递下来的值。

return item 就是返回我们的字典了,然后Pipelines就可以开始对这些数据进行处理了。比如 存储之类的。

好啦,Spiders我们先编写到这个地方。(是不是有小伙伴发现我还有几个字段没有取值?当然留着你们自己试试了,哈哈哈ヽ(=^・ω・^=)丿)后面再继续。

转载请注明:静觅 » 小白进阶之Scrapy第一篇

喜欢 (46)or分享 (0)

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请狠狠点击下面的

  1. 请问下为什么我在Linux上运行scrapy老是报以下错误啊!一直无解啊!求大神赐教啊!2017-02-21 16:49:19 [scrapy.core.scraper] ERROR: Error downloading Traceback (most recent call last): File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/internet/defer.py”, line 1299, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/python/failure.py”, line 393, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py”, line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/utils/defer.py”, line 45, in mustbe_deferred result = f(*args, **kw) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py”, line 65, in download_request return handler.download_request(request, spider) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py”, line 61, in download_request return agent.download_request(request) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py”, line 286, in download_request method, to_bytes(url, encoding=’ascii’), headers, bodyproducer) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/web/client.py”, line 1631, in request parsedURI.originForm) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/web/client.py”, line 1408, in _requestWithEndpoint d = self._pool.getConnection(key, endpoint) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/web/client.py”, line 1294, in getConnection return self._newConnection(key, endpoint) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/web/client.py”, line 1306, in _newConnection return endpoint.connect(factory) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/internet/endpoints.py”, line 788, in connect EndpointReceiver, self._hostText, portNumber=self._port File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/internet/_resolver.py”, line 174, in resolveHostName onAddress = self._simpleResolver.getHostByName(hostName) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/scrapy/resolver.py”, line 21, in getHostByName d = super(CachingThreadedResolver, self).getHostByName(name, timeout) File “/etldata/migu/anaconda2/lib/python2.7/site-packages/twisted/internet/base.py”, line 276, in getHostByName timeoutDelay = sum(timeout)TypeError: ‘float’ object is not iterable

  2. 那个def parse函数中获得bashurl的方法写错了,不能从后边开始索引,1-9页没有问题到十几页的时候就会出现问题了,应该改成str(response)[:27]