当前位置:   article > 正文

scrapy-redis分布式爬虫爬取美女图片_scrapy-redis分布式爬虫 爬取图片可以多大

scrapy-redis分布式爬虫 爬取图片可以多大

背景:

爬取目标:(你懂得)

url: h t t p s : / / w w w . j p x g y w . c o m 

为什么要用scrapy-redis:

为什么用scrapy-redis,个人原因喜欢只爬取符合自己口味的,这样我只要开启爬虫,碰到喜欢的写真集,把url lpush到redis,爬虫就检测到url并开始运行,这样爬取就比较有针对性。说白了自己最后看的都是精选的,那岂不是美滋滋

爬取思路:

进入一个喜欢的写真集,我们的目标就是将下一页的url和图片的url提取出来

这里我试过,response.css和response.xpath都提取不到图片url,所以我们这里用selenium控制Chrome或PhantomJS获取源码来提取我们想要的url

开干!:

1、环境:

个人环境是python3.6.4+scrapy1.5.1

scrapy环境搭建我就不啰嗦了,pip3安装scrapy,网上教程一大堆。这里多说一句,我们既然爬取的是图片,Pillow这个库是必须要安装的,selenium这个库也需要,还有redis,如果没有,手动pip3 install Pillow/pip3 install selenium/pip3 install redis一下

附上个人虚拟环境库列表:

  1. Scrapy 1.5.1
  2. Pillow 5.2.0
  3. pywin32 223
  4. requests 2.19.1
  5. selenium 3.14.0
  6. redis

2、创建爬虫

我们先创建一个scrapy项目,进入虚拟环境

 scrapy startproject ScrapyRedisTest

下一步就是搞到scrapy-redis的源码,访问github: https://github.com/rmax/scrapy-redis,下载项目

解压后我们把 src 中的scrapy_redis整个复制到刚刚创建的ScrapyRedisTest根目录下

 

在根目录下的ScrapyRedisTest中创建一个images文件夹作为图片存放文件

这是当前目录结构:

3、编写爬虫

环境搞定了,我们开始写爬虫

编写settings.py文件:(带注释)

童鞋们可以直接复制代码替换自动生成的settings.py

  1. # -*- coding: utf-8 -*-
  2. import os,sys
  3. # Scrapy settings for ScrapyRedisTest ChatRoom
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # https://doc.scrapy.org/en/latest/topics/settings.html
  9. # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  10. # https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  11. BOT_NAME = 'ScrapyRedisTest'
  12. SPIDER_MODULES = ['ScrapyRedisTest.spiders']
  13. NEWSPIDER_MODULE = 'ScrapyRedisTest.spiders'
  14. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  15. #scrapy自带的UserAgentMiddleware需要设置的参数,我们这里设置一个chrome的UserAgent
  16. USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"
  17. # Obey robots.txt rules
  18. ROBOTSTXT_OBEY = False #不遵循ROBOT协议
  19. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  20. #CONCURRENT_REQUESTS = 32
  21. # Configure a delay for requests for the same website (default: 0)
  22. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  23. # See also autothrottle settings and docs
  24. #DOWNLOAD_DELAY = 3
  25. # The download delay setting will honor only one of:
  26. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  27. #CONCURRENT_REQUESTS_PER_IP = 16
  28. # Disable cookies (enabled by default)
  29. #COOKIES_ENABLED = False
  30. # Disable Telnet Console (enabled by default)
  31. #TELNETCONSOLE_ENABLED = False
  32. # Override the default request headers:
  33. #DEFAULT_REQUEST_HEADERS = {
  34. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  35. # 'Accept-Language': 'en',
  36. #}
  37. # Enable or disable spider middlewares
  38. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  39. #SPIDER_MIDDLEWARES = {
  40. # 'ScrapyRedisTest.middlewares.ScrapyredistestSpiderMiddleware': 543,
  41. #}
  42. SCHEDULER = "scrapy_redis.scheduler.Scheduler" #格式:scrapy-redis调度器替换成scrapy_redis的
  43. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #格式:scrapy-redis去重器替换成scrapy_redis的
  44. #ITEM_PIPELINES = {
  45. # 'scrapy_redis.pipelines.RedisPipeline': 300,
  46. #}
  47. BASE_DIR=os.path.dirname(os.path.abspath(os.path.dirname(__file__)))
  48. sys.path.insert(0,os.path.join(BASE_DIR,'ScrapyRedisTest'))
  49. # Enable or disable downloader middlewares
  50. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  51. #DOWNLOADER_MIDDLEWARES = {
  52. # 'ScrapyRedisTest.middlewares.ScrapyredistestDownloaderMiddleware': 543,
  53. #}
  54. # Enable or disable extensions
  55. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  56. #EXTENSIONS = {
  57. # 'scrapy.extensions.telnet.TelnetConsole': None,
  58. #}
  59. # Configure item pipelines
  60. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  61. #ITEM_PIPELINES = {
  62. # 'ScrapyRedisTest.pipelines.ScrapyredistestPipeline': 300,
  63. #}
  64. IMAGES_URLS_FIELD="front_image_url" #scrapy 自带的 ImagesPipeline根据这个字段判断从哪个item下载图片
  65. project_dir=os.path.abspath(os.path.dirname(__file__))
  66. IMAGES_STORE=os.path.join(project_dir,'images') #下载的图片保存在哪个文件夹
  67. # Enable and configure the AutoThrottle extension (disabled by default)
  68. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  69. AUTOTHROTTLE_ENABLED = True #scrapy会自动帮我们调整下载速度
  70. # The initial download delay
  71. AUTOTHROTTLE_START_DELAY = 5
  72. # The maximum download delay to be set in case of high latencies
  73. AUTOTHROTTLE_MAX_DELAY = 60
  74. # The average number of requests Scrapy should be sending in parallel to
  75. # each remote server
  76. AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  77. # Enable showing throttling stats for every response received:
  78. AUTOTHROTTLE_DEBUG = False

这里多说一句,个人喜欢设置AUTOTHROTTLE_ENABLED=True这个属性,虽然爬取速度可能会慢一点,但是能减少被反爬虫的几率,(爬取拉钩网的时候不设置这个就会被302)。其次,我们不限制速度爬的辣么快给人家网站整挂了也不好嘛

编写middlewares.py文件:

我们的目标是只访问特定的url使用selenium,所以我们编写一个middleware

童鞋们将代码拷贝到自动生成的middlewares.py文件里面,不要覆盖原有的

  1. import time
  2. from scrapy.http import HtmlResponse
  3. class JSPageMiddleware_for_jp(object):
  4. def process_request(self, request, spider):
  5. if request.url.startswith("https://www.jpxgyw.com"): #特定url才使用selenium下载
  6. spider.driver.get(request.url)
  7. time.sleep(1)
  8. print("currentUrl", spider.driver.current_url)
  9. return HtmlResponse(url=spider.driver.current_url,body=spider.driver.page_source,encoding="utf-8",request=request)

编写spider文件:

在spiders文件下创建一个jp.py作为我们爬虫的spidier文件

这里我们将selenium.webdriver的初始化工作放到__init__中是为了每次爬取新的网站不用重复打开浏览器,(这招我是跟慕课网的bobby老师学的,老师666,为他打call)

  1. # -*- coding: utf-8 -*-
  2. import re
  3. import scrapy
  4. from scrapy_redis.spiders import RedisSpider
  5. from selenium import webdriver
  6. from scrapy.loader import ItemLoader
  7. from scrapy.xlib.pydispatch import dispatcher
  8. from scrapy import signals
  9. from ScrapyRedisTest.items import JPItem
  10. class JpSpider(RedisSpider):
  11. name = 'jp'
  12. allowed_domains = ['www.jpxgyw.com','img.xingganyouwu.com']
  13. redis_key = 'jp:start_urls' #redis的key值
  14. custom_settings = {
  15. "AUTOTHROTTLE_ENABLED": True, #开启自动调节爬取速度的插件
  16. "DOWNLOADER_MIDDLEWARES": {
  17. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 1, #使用scrapy自带的middleware模拟user-agent
  18. 'ScrapyRedisTest.middlewares.JSPageMiddleware_for_jp': 2 #使用自己编写的middleware
  19. },
  20. "ITEM_PIPELINES": {
  21. 'scrapy.pipelines.images.ImagesPipeline': 1, #使用scrapy自带的middleware下载图片
  22. }
  23. }
  24. current_page=0 #控制当前爬取到第几页的类变量
  25. # max_page=17
  26. @staticmethod
  27. def judgeFinalPage(body): #判断是否已爬取到最后一页
  28. ma = re.search(r'性感尤物提醒你,访问页面出错了', body)
  29. return not ma
  30. def __init__(self,**kwargs):
  31. #这里我使用了PhantomJS,童鞋可以替换成chromedriver,executable_path为我的电脑存放phantomjs的路径,童鞋自行替换,另外在linux上不用设置executable_path
  32. self.driver=webdriver.PhantomJS(executable_path="D:\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe")
  33. super(JpSpider,self).__init__()
  34. dispatcher.connect(self.spider_closed,signals.spider_closed)
  35. def spider_closed(self,spider): #当爬虫退出时关闭PhantomJS
  36. print("spider closed")
  37. self.driver.quit()
  38. def parse(self, response):
  39. if "_" in response.url: #如果start_urls不是从第一页开始爬
  40. ma_url_num = re.search(r'_(\d+?).html', response.url) #提取当前数字
  41. self.current_page=int(ma_url_num.group(1)) #写到global变量
  42. self.current_page = self.current_page + 1
  43. ma_url = re.search(r'(.*)_\d+?.html', response.url) #提取当前url
  44. nextUrl=ma_url.group(1)+"_"+str(self.current_page)+".html" #拼接下一页的url
  45. print("nextUrl", nextUrl)
  46. else: #如果start_urls是从第一页开始爬
  47. self.current_page=0 #重置
  48. next_page_num=self.current_page+1
  49. self.current_page=self.current_page+1
  50. nextUrl=response.url[:-5]+"_"+str(next_page_num)+".html" #拼接下一页的url
  51. print("nextUrl",nextUrl)
  52. ma = re.findall(r'src="/uploadfile(.*?).jpg', bytes.decode(response.body))
  53. imgUrls=[] #提取当前页所有图片url放到列表中
  54. for i in ma:
  55. imgUrl="http://img.xingganyouwu.com/uploadfile/"+i+".jpg"
  56. imgUrls.append(imgUrl)
  57. print("imgUrl",imgUrl)
  58. item_loader = ItemLoader(item=JPItem(), response=response)
  59. item_loader.add_value("front_image_url", imgUrls) #放到item中
  60. jp_item = item_loader.load_item()
  61. yield jp_item #交给pipline下载图片
  62. if self.judgeFinalPage(bytes.decode(response.body)): #如果判断不是最后一页,继续下载
  63. yield scrapy.Request(nextUrl, callback=self.parse, dont_filter=True)
  64. else:
  65. print("最后一页了!")

编写items.py文件:

别忘了在items.py文件中加入下面代码,为scrapy图片下载器指定item

  1. class JPItem(scrapy.Item):
  2. front_image_url=scrapy.Field()

开始爬:

我是将项目放在自己的阿里云服务器上运行的(因为家里网速太慢。。)

首先要开启    redis-server   和   redis-cli     windows,linux开启方法都很简单,这里偷懒不写了,请自行百度

之后cd 到爬虫根目录下  scrapy crawl jp  开启爬虫,下图显示爬虫正在执行并等待redis中的jp:start_urls

说明爬虫已经正常运行了,我们去redis,lpush一个url

之后我们就可以看到爬虫开始工作了

爬完这个url之后不需要关闭爬虫,因为它一直监听着redis,我们只要看到中意的url,lpush到redis中就可以了

测试效果:稳定爬取321个url,894张图片

ok,剩下的我就不管了,在根目录下的images欣赏图片吧

最后附上一句:此贴重在学习scrapy框架,樯橹灰飞烟灭~

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/43143
推荐阅读
相关标签
  

闽ICP备14008679号