赞
踩
编写Python爬虫时,常用的库包括Requests、Beautiful Soup和Scrapy。以下是三个简单的Python爬虫案例,分别使用Requests和Beautiful Soup,以及Scrapy。
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# 在这里可以使用Beautiful Soup提取页面内容
# 例如:titles = soup.find_all('h2')
print(soup.title.text)
else:
print(f"Failed to retrieve the page. Status code: {response.status_code}")
import requests import re from urllib.parse import urljoin url = "https://example.com" response = requests.get(url) if response.status_code == 200: image_urls = re.findall(r'<img.*?src=["\'](.*?)["\']', response.text) for img_url in image_urls: full_url = urljoin(url, img_url) # 在这里可以下载图片或进行其他处理 # 例如:response = requests.get(full_url); save_image(response.content, "image.jpg") print(full_url) else: print(f"Failed to retrieve the page. Status code: {response.status_code}")
首先,确保已安装Scrapy:
pip install scrapy
创建一个新的Scrapy项目:
scrapy startproject myproject
cd myproject
编辑Spider:
# myproject/spiders/myspider.py
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
# 在这里可以使用XPath或CSS选择器提取数据
# 例如:titles = response.xpath('//h2/text()').getall()
title = response.css('title::text').get()
print(title)
运行Scrapy爬虫:
scrapy crawl myspider
这些例子只是入门,实际项目中可能需要处理更多的异常情况、使用代理、设置请求头等。爬取网页时,请确保遵守网站的Robots.txt文件和使用者协议。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。