当前位置:   article > 正文

python爬虫之豆瓣首页图片爬取

python爬虫之豆瓣首页图片爬取

 网址:https://movie.douban.com/

  1. import requests
  2. from lxml import etree
  3. import re
  4. url = 'https://movie.douban.com'
  5. headers = {
  6. 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.5735.289 Safari/537.36'
  7. }
  8. session = requests.session()
  9. response = session.get(url,headers = headers)
  10. # response.encoding='utf-8'
  11. # response.encoding = response.apparent_encoding
  12. index_url = 'https://movie.douban.com'
  13. res = session.get(index_url,headers=headers)
  14. # print(res.text)
  15. # 输出:页面源代码
  16. tree = etree.HTML(res.text)
  17. # print(tree)
  18. # 输出:<Element html at 0x186fa6a3100>
  19. img_all = tree.xpath('//img')
  20. # print(img_all)
  21. for i in img_all:
  22. img = etree.tostring(i, encoding='UTF-8').decode('UTF-8')
  23. # 得到所有的img标签
  24. # print(img)
  25. # <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2900931370.jpg" alt="&#x5C0F;&#x884C;&#x661F;&#x730E;&#x4EBA;" rel="nofollow" class=""/>
  26. img_url = tree.xpath('//img/@src')
  27. # img_name = tree.xpath('//img/@alt')
  28. # print(img_url,img_name)
  29. # 输出:许多个列表
  30. for i in img_url:
  31. # print(i)
  32. last_str = i.split('/')[-1]
  33. # print(last_str)
  34. # 输出:多个p2900931370.jpg p2901057189.jpg
  35. every_name = last_str.split('.')[0]
  36. # print(every_name)
  37. # 输出:多个p2900931370 p2901057189
  38. res_url = session.get(i,headers=headers)
  39. with open(f'./img/{every_name}.jpg','wb') as f:
  40. f.write(res_url.content)

运行结果:

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号