赞
踩
目前,为了加速页面的加载速度,页面的很多部分都是用JS生成的,而对于用scrapy爬虫来说就是一个很大的问题,因为scrapy没有JS engine,所以爬取的都是静态页面,对于JS生成的动态页面都无法获得。
解决方案:
Splash简介:
Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。
下面就来讲一下如何使用scrapy-splash:
$ pip install scrapy-splash
$ docker pull scrapinghub/splash
$ docker run -p 8050:8050 scrapinghub/splash
配置splash服务(以下操作全部在settings.py):
1)添加splash服务器地址:
SPLASH_URL = 'http://localhost:8050'
2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:
- DOWNLOADER_MIDDLEWARES = {
- 'scrapy_splash.SplashCookiesMiddleware': 723,
- 'scrapy_splash.SplashMiddleware': 725,
- 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
- }
3)Enable SplashDeduplicateArgsMiddleware:
- SPIDER_MIDDLEWARES = {
- 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
- }
4)Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
5)a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
例子
获取HTML内容:
- import scrapy
- from scrapy_splash import SplashRequest
-
- class MySpider(scrapy.Spider):
- start_urls = ["http://example.com", "http://example.com/foo"]
-
- def start_requests(self):
- for url in self.start_urls:
- yield SplashRequest(url, self.parse, args={'wait': 0.5})
-
- def parse(self, response):
- # response.body is a result of render.html call; it
- # contains HTML processed by a browser.
- # ...
Scrapinghub博客:http://blog.scrapinghub.com/
Splash官方文档:http://splash.readthedocs.org/en/latest/scripting-tutorial.html
Github中ScrapyJS项目:https://github.com/scrapinghub/scrapyjs
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。