赞
踩
我之前的一篇爬虫爬取信息练习里使用了请求头:User-Agent,让网页人为刚刚进行访问的是浏览器,所以我在想是否可以使用这种方法去增加我CSDN博客的访问量,所以我使用这篇博客进行了测试。
第一次我没有使用代理IP去request.get访问这篇博客,也没有使用多个进程,效果成功但访问量刷新效率低,而且IP地址容易被网站的反爬虫检索出来然后封锁。
- import requests
- import re
- import time
- import random
-
- # 下载一个网页
- url_list ='https://blog.csdn.net/qq_36171287/article/details/91352388'
- user_agent_list=[
- 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
- 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
- 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
- 'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
- 'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
- 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
- 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
- ]
-
- # 请求头,告诉服务器这是浏览器
- # header = {
- # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
-
- #print(header,url)
- # 模拟浏览器发送HTTP请求
- i = 0
- while i<200:
- header = {
- 'User-Agent': random.choice(user_agent_list)
- }
- url = url_list
- try:
- response = requests.get(url, headers=header)
- print("success")
- except :
- print("mistake")
- time.sleep(30)
- i = i + 1
-
- print("**********")

第二次我使用了代理IP和多个进程,效率就提高了很多
- import requests
- import re
- import time
- import random
- from threading import Thread
- from bs4 import BeautifulSoup
-
- # 下载一个网页
- url_list = 'https://blog.csdn.net/qq_36171287/article/details/91352388'
- user_agent_list=[
- 'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
- 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
- 'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
- 'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
- 'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
- 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
- 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
- 'Opera/8.0 (Windows NT 5.1; U; en)',
- 'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
- 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
- 'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
- 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
- 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
- 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
- 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
- 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
- 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
- ]
-
-
- def get_ip_list(url, headers):
- #获取代理IP地址
- web_data = requests.get(url, headers=headers)
- soup = BeautifulSoup(web_data.text, 'lxml')
- ips = soup.find_all('tr')
- ip_list = []
- for i in range(1, len(ips)):
- ip_info = ips[i]
- tds = ip_info.find_all('td')
- ip_list.append(tds[1].text + ':' + tds[2].text)
- return ip_list
-
- def get_random_ip(ip_list):
- #获取代理IP地址
- proxy_list = []
- for ip in ip_list:
- proxy_list.append('http://' + ip)
- proxy_ip = random.choice(proxy_list)
- proxies = {'http': proxy_ip}
- return proxies
-
-
- def fun(url,proxies):
- # 模拟浏览器发送HTTP请求
- #访问需要的目标网页
- header = {
- 'User-Agent': random.choice(user_agent_list)
- }
- try:
- response = requests.get(url, headers=header,proxies=proxies)
- print("success")
- except:
- print("mistake")
- time.sleep(30)
-
-
-
- def run():
- """
- 多线程运行
- :param url: 目标网址
- :param proxy_url: 代理API接口
- :return:
- """
- threads = []
-
- # 同时开启8个线程运行,可修改
- for i in range(8):
- url = url_list
- proxies = get_random_ip(ip_list)
- print(proxies)
- threads.append(Thread(target=fun, args=(url, proxies)))
-
- for t in threads:
- t.start()
-
- for t in threads:
- t.join()
-
-
- if __name__ == '__main__':
- url = 'http://www.xicidaili.com/nn/'
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
- }
- #获取代理IP地址
- ip_list = get_ip_list(url, headers=headers)
- #运行访问
- while True:
- run()
-

但是在后来进行测试时发现已经不会增长了,因为网页进行了加密
使用下面代码进行爬取时,爬出的是加密js
- import requests
- import re
-
- # 下载一个网页
- url = 'https://blog.csdn.net/qq_36171287/article/details/91352388'
-
- # 请求头,告诉服务器这是浏览器
- header = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
- # 模拟浏览器发送HTTP请求
- response = requests.get(url, headers=header)
- print(response.text)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。