python爬虫如何写，有哪些成功爬取的案例

作者：opred | 2024-01-25 09:38:30

踩

编写Python爬虫时，常用的库包括Requests、Beautiful Soup和Scrapy。以下是三个简单的Python爬虫案例，分别使用Requests和Beautiful Soup，以及Scrapy。

1. 使用Requests和Beautiful Soup爬取网页内容：

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    # 在这里可以使用Beautiful Soup提取页面内容
    # 例如：titles = soup.find_all('h2')
    print(soup.title.text)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
1
2
3
4
5
6
7
8
9
10
11
12
13

2. 使用Requests和正则表达式爬取图片：

import requests
import re
from urllib.parse import urljoin

url = "https://example.com"
response = requests.get(url)

if response.status_code == 200:
    image_urls = re.findall(r'<img.*?src=["\'](.*?)["\']', response.text)
    for img_url in image_urls:
        full_url = urljoin(url, img_url)
        # 在这里可以下载图片或进行其他处理
        # 例如：response = requests.get(full_url); save_image(response.content, "image.jpg")
        print(full_url)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

3. 使用Scrapy爬取网站：

首先，确保已安装Scrapy：

pip install scrapy
1

创建一个新的Scrapy项目：

scrapy startproject myproject
cd myproject
1
2

编辑Spider：

# myproject/spiders/myspider.py
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # 在这里可以使用XPath或CSS选择器提取数据
        # 例如：titles = response.xpath('//h2/text()').getall()
        title = response.css('title::text').get()
        print(title)
1
2
3
4
5
6
7
8
9
10
11
12

运行Scrapy爬虫：

scrapy crawl myspider
1

这些例子只是入门，实际项目中可能需要处理更多的异常情况、使用代理、设置请求头等。爬取网页时，请确保遵守网站的Robots.txt文件和使用者协议。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/article/detail/41791