木道寻08

这个屌丝很懒，什么也没留下！

热门标签

Python爬虫基本库的使用

作者：木道寻08 | 2024-08-07 17:57:46

踩

Python爬虫基本库的使用

已写章节

第一章网络爬虫入门
第二章基本库的使用
第三章解析库的使用
 第四章数据存储
 第五章动态网页的抓取

文章目录

- - 已写章节
第二章基本库的使用

第二章基本库的使用

2.1 urllib库的使用(非重点)

urllib的官方文档

urllib是Python中自带的HTTP请求库，也就是说不用额外安装就可以使用，它包含如下四个模块：

requests：基本的HTTP请求模块，可以模拟发送请求
error：异常处理模块
parse：一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等。
robotparser：它主要用来识别网站的robots.txt文件，让后判断哪些内容可以爬取，哪些不能爬取，用得比较少。

2.1.1 request模块

发送请求

# 2.1 使用urllib库中的request模块发送一个请求的例子
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
1
2
3
4
5

使用request.urlopen()来向百度首页发起请求，返回的是http.client.HTTPResponse对象，这个对象主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、reason、debuglevel、closed等属性。将返回的HTML代码以utf-8的编码方式读取并打印出来。上面的代码执行后将返回百度的主页的HTML代码。

我运行的效果如下：

<!DOCTYPE html><!--STATUS OK-->


    <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description" content="全球最大的中文搜索引擎、致力于让网民更便捷地获取
信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。">
后面省略无数字......
1
2
3
4
5
6

接下来再看这个例子：

# 2.2 使用urllib中的request模块发起一个请求并获取response对象中的信息的例子
import urllib.request

response = urllib.request.urlopen("http://www.python.org")
print(response.read().decode('utf-8')[:100]) # 截取返回的html代码的前100个字符的信息
print("response的类型为：" + str(type(response)))
print("response的状态为：" + str(response.status))
print("response的响应的头信息：" + str(response.getheaders()))
print("response的响应头中的Server值为：" + str(response.getheader('Server')))
1
2
3
4
5
6
7
8
9

上面的代码使用urlopen()方法向指定的链接发起了一个请求，得到一个HTTPResponse对象，然后调用HTTPResponse的方法和属性来获取请求的状态、请求头信息等

下面是我的执行结果：

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!-
response的类型为：<class 'http.client.HTTPResponse'>
response的状态为：200
response的响应的头信息：[('Connection', 'close'), ('Content-Length', '50890'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur, 1.1 varnish, 1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Mon, 17 May 2021 08:59:57 GMT'), ('Age', '1660'), ('X-Served-By', 'cache-bwi5163-BWI, cache-hkg17920-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 3886'), ('X-Timer', 'S1621241997.260514,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
response的响应头中的Server值为：nginx
1
2
3
4
5
6
7

data参数

data参数是可选的，该参数是bytes类型，需要使用bytes()方法将字典转化为字节类型，并且，该参数只能在POST请求中使用。

# 2.3 data参数的使用
import urllib.request

# 使用urllib中的parse模块中的urlencode方法来将字典转化为字节类型，编码方式为utf-8
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
1
2
3
4
5
6
7

这次我们请求的是http://httpbin.org/post这个网址，这个网址可以提供http请求测试，它可以返回请求的一些信息，其中包括我们传递的data参数。

timeout参数

timeout参数用来设置超时时间，单位为秒，意思是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。

# 2.4 timeout参数的使用
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com', timeout=1)
print(response.read())
1
2
3
4
5

运行结果就不展示了。

其他参数

除了data参数和timeout参数外，还有context参数，它必须是ssl.SSLContext类型，用来指定SSL设置

Request类

urlopen()可以实现基本的请求的发起，但这不能构造一个完整的请求，如果要在请求中加入Headers等信息，就可以利用更强大的Request类来构建。

# 2.5 Request类的使用
import urllib.request

request = urllib.request.Request('https://python.org')
print(type(request))
response = urllib.request.urlopen(request)   # 传入的是Request对象
print(response.read().decode('utf-8'))
1
2
3
4
5
6
7

request的构造方法：

Requests(url, data, headers, origin_host， unverifiablem, method)

url：请求的url链接
data：必须为字节流（bytes）类型
headers：请求头信息
origin_req_host：请求方的host名称或者IP地址
unverifiable：表示这个请求是否是无法验证的，默认为False。
method：指示请求使用的方法，比如：GET、POST、PUT等

下面是例子：

# 2.6 Request类的使用
from urllib import request, parse

url = "http://httpbin.org/get"
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='GET')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

我们依然请求的是测试网址http://httpbin.org/get，它会返回我们发起的请求信息，下面是我的运行结果：

{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)", 
    "X-Amzn-Trace-Id": "Root=1-60a236ed-01f68c862b09c8934983ae80"
  }, 
  "origin": "221.176.140.213", 
  "url": "http://httpbin.org/get"
}
1
2
3
4
5
6
7
8
9
10
11
12
13

从结果中，我们可以看到我们发起的请求中包含了我们自己设置的User-Agent，Host和我们请求中包含的数据 ‘name’: ‘Germey’。

2.1.2 error模块

urllib中的error模块定义了由request模块产生的异常，如果出现了问题，request模块就会抛出error模块中的异常。

下面介绍其中用得比较多的两个异常：URLError和HTTPError。

URLError

URLError类是error异常模块的基类，由request模块产生的异常都可以通过捕获这个异常来处理。

# 2.7 URLError的使用例子
from urllib import request, error

# 打开一个不存在的网页
try:
    response = request.urlopen('https://casdfasf.com/index.htm')
except error.URLError as e:
    print(e.reason)
1
2
3
4
5
6
7
8
'运行

运行结果：

[Errno 11001] getaddrinfo failed
1

HTTPError

它是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等。它有如下3个属性：

code：返回HTTP状态码
reason:返回错误的原因
headers：返回请求头

# 2.8 HTTPError对象的属性
from urllib import request, error

try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
1
2
3
4
5
6
7

运行结果：

Not Found
404
Server: GitHub.com
Date: Tue, 16 Feb 2021 03:01:45 GMT
Content-Type: text/html; charset=utf-8
X-NWS-UUID-VERIFY: 8e28a376520626e0b40a8367b1c3ef01
Access-Control-Allow-Origin: *
ETag: "6026a4f6-c62c"
x-proxy-cache: MISS
X-GitHub-Request-Id: 0D4A:288A:10EE94:125FAD:602B33C2
Accept-Ranges: bytes
Age: 471
Via: 1.1 varnish
X-Served-By: cache-tyo11941-TYO
X-Cache: HIT
X-Cache-Hits: 0
X-Timer: S1613444506.169026,VS0,VE0
Vary: Accept-Encoding
X-Fastly-Request-ID: 9799b7e3df8bdc203561b19afc32bb5803c1f03c
X-Daa-Tunnel: hop_count=2
X-Cache-Lookup: Hit From Upstream
X-Cache-Lookup: Hit From Inner Cluster
Content-Length: 50732
X-NWS-LOG-UUID: 5426589989384885430
Connection: close
X-Cache-Lookup: Cache Miss
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

2.1.3 parse模块

parse模块是用来处理url的模块，它可以实现对url各部分的抽取、合并以及连接装换等。

下面介绍parse模块中常用的几个方法：

urlparse()

实现url的识别和分段

# 2.9 urllib库中parse模块中urlparse()方法的使用
from urllib.parse import urlparse

result = urlparse('http://www.biadu.com/index.html;user?id=5#comment')
print(type(result), result)

result1 = urlparse('www.biadu.com/index.html;user?id=5#comment', scheme='https')
print(type(result1), result1)

result2 = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(type(result2), result2)

result3 = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result3.scheme, result3[0], result3.netloc, result3[1], sep="\n")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
'运行

运行结果：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.biadu.com', path='/index.html', params='user', query='id=5', fragment='comment')
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='', path='www.biadu.com/index.html', params='user', query='id=5', fragment='comment')
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
http
http
www.baidu.com
www.baidu.com
1
2
3
4
5
6
7

可以看到，urlparse()方法将url解析为6部分，返回的是一个ParseResult对象，这6部分是：

scheme：
声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/木道寻08/article/detail/943820

Python爬虫基本库的使用

已写章节

文章目录

第二章 基本库的使用

2.1 urllib库的使用(非重点)

2.1.1 request模块

2.1.2 error模块

2.1.3 parse模块

第二章基本库的使用