赞
踩
爬虫已经逐渐成为非程序员都要学会的技能之一。而当前市面上的爬虫脚本冗杂,这里提供一个易于新手上手的脚本爬取,主要特点就是简单,快速上手看下面。
使用场景:爬取维基百科内容
效果:给一个网页,返回一个chinese_txt文件,里面有爬回这个网页内所有的文字,以及网页内所有的链接内的网页文字也会被爬到。
- import requests
- import re
- import time
- from bs4 import BeautifulSoup
-
- exist_url = []
- g_writecount = 0
-
- full_url = "" # 填写你要爬取的网页
- def scrappy(url, depth=1):
- global g_writecount
- try:
- headers = {
- 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36"}
- r = requests.get(full_url, headers=headers)
- html = r.text
- except Exception as e:
- print("Failed downloading and saving ", full_url)
- print(e)
- exist_url.append(full_url)
- return None
-
- exist_url.append(url)
-
- # 使用BeautifulSoup提取页面的所有文本内容
- soup = BeautifulSoup(html, 'lxml')
-
- # 提取所有文本
- chinese_text = ''.join([p.get_text() for p in soup.find_all('p')])
-
- # 保存中文文本到文件
- with open('chinese_txt.txt', 'a+', encoding='utf-8') as f:
- f.write(chinese_text)
-
- link_list = re.findall('<a href="/wiki/([^:#=<>]*?)".*?</a>', html)
- unique_list = list(set(link_list) - set(exist_url))
-
- for eachone in unique_list:
- g_writecount += 1
- output = 'No.' + str(g_writecount) + '\t Depth:' + str(depth) + '\t' + full_url + ' -> ' + eachone + '\n'
- print(output)
-
- if depth < 2:
- time.sleep(5) # 添加5秒的延迟
- scrappy(eachone, depth + 1)
-
-
- start = time.time()
-
- scrappy(url)
- stop = time.time()
- print("所用时间:", stop - start)

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。