Python 爬虫中文返回乱码_python2 处理请求返回结果中文乱码

作者：黑客灵魂 | 2024-07-31 17:19:11

踩

python2 处理请求返回结果中文乱码

Python 爬虫中文返回乱码

1、情景复现
2、尝试解决

1、情景复现

今天闲的无聊，就复习一下爬虫，先拿学校官网做实验，爬取学校官网新闻标题、时间以及链接，可是返回的中文一直是如下的乱码：

在这里插入图片描述

2、尝试解决

我们先查看要爬取的网站的编码方式，在要爬取的网站用鼠标右击–>检查–>点击Console 输入 document.charse 即可显示出网页的编码格式，如图：

在这里插入图片描述

一开始我们的代码是：

import requests
from lxml import etree
html = requests.get('https://www.cczu.edu.cn/')
tree = etree.HTML(html.text)
a = tree.xpath("//ul[@class='clearfix']/li")
total = []
for i in a:
    title = ''.join(i.xpath('.//h2//text()'))
    time = ''.join(i.xpath('.//h3//text()'))
    link = ''.join(i.xpath('./h2/a/@href'))
    print(title, time, link)
1
2
3
4
5
6
7
8
9
10
11

那咱来根据其网页的编码格式，把 request 返回的乱码转换一下：

import requests
from lxml import etree
html = requests.get('https://www.cczu.edu.cn/')
# 新增编码格式
html.encoding = "utf-8"
tree = etree.HTML(html.text)
a = tree.xpath("//ul[@class='clearfix']/li")
total = []
for i in a:
    title = ''.join(i.xpath('.//h2//text()'))
    time = ''.join(i.xpath('.//h3//text()'))
    link = ''.join(i.xpath('./h2/a/@href'))
    print(title, time, link)
1
2
3
4
5
6
7
8
9
10
11
12
13

完美解决！

在这里插入图片描述

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/黑客灵魂/article/detail/909911

Python 爬虫中文返回乱码_python2 处理请求返回结果 中文乱码

Python 爬虫中文返回乱码

1、情景复现

2、尝试解决

Python 爬虫中文返回乱码_python2 处理请求返回结果中文乱码