赞
踩
打开京东首页,在搜索框任意输入一个商品名称,这里以华为最新发布的手机华为p50为例,点击搜索,页面如下所示:
可能会出现登录界面,可以先登录一下:
进入首页后,先记录首页链接,然后连续下滑,可以看到翻页的地方:
点击第二页额、第三页、第四页,然后记录下每一页的链接,可以发现如下规律:
第一页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=1&s=58&click=0
第二页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=3&s=58&click=0
第三页: https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=5&s=121&click=0
第四页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=7&s=181&click=0
不难发现,每一页的链接几乎一样,区别在于最后的page参数和s参数不一样,还能发现如果把链接最后面的两个参数去掉照样可以访问每一页的内容,去掉s参数的URL更容易构造请求链接:
第一页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=1
第二页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=3
第三页: https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=5
第四页:https://search.jd.com/search.php?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=afc054c82185474c85e8964ee19626c1&page=7
在翻页过程中还能发现每一页有60条数据,但是分两次加载,每次加载30个商品信息,也即是说如果我们使用requests库,则每页的商品信息只能爬取一半,这显然是不可取的,所以我们选择使用selenium库爬取信息。
在使用selenium连续获取每一页的请求时,有时会被服务器重定向到登录界面,所以我们可以先使用selenium获取登录之后的cookie,之后在获取其他页面时,如果没有重定向则直接提取信息,遇到重定向时可以把事先获取的cookie加上,再次发起请求过去网页源码。
获取cookie的代码如下:
from selenium import webdriver import json, time def get_cookie(url): browser = webdriver.Chrome() browser.get(url) time.sleep(60) dictCookies = browser.get_cookies() # 获取list的cookies jsonCookies = json.dumps(dictCookies) # 转换成字符串保存 with open('cookies.txt', 'w') as f: f.write(jsonCookies) print('cookies保存成功!') browser.close() if __name__ == '__main__': get_cookie('https://passport.jd.com/new/login.aspx')
把url设置为登录界面链接https://passport.jd.com/new/login.aspx,先运行这一段代码,会出现登录界面,给你一分钟的时间扫码登录(必须在一分钟内扫码登录成功!),待程序运行结束,会生成一个cookies.txt文件,这个文件中就保存了登录所需的cookie值。
由2.1的分析,构造前十页的链接如下所示:
base_url = 'https://search.jd.com/Search?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&suggest=3.def.0.base&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=ddb6b5344d4c452496da22357a030be8&page={}'
url_list = [base_url.format(i) for i in range(1, 20, 2)]
数组url_list中即为前十页的请求链接!
def get_html(url): browser = webdriver.Chrome() browser.get(url) # 如果当前链接跟请求链接一致,先执行js代码下滑到底部,获取所有商品信息,再返回源码。 if browser.current_url == url: js = "var q=document.documentElement.scrollTop=100000" browser.execute_script(js) time.sleep(2) responses = browser.page_source browser.close() return responses # 如果不一致,则说明发生了重定向,先把本地的cookie加上,再重新发起请求 else: with open('cookies.txt', 'r', encoding='utf8') as f: listCookies = json.loads(f.read()) for cookie in listCookies: cookie_dict = { 'domain': '.jd.com', 'name': cookie.get('name'), 'value': cookie.get('value'), "expires": '1629446549', 'path': '/', 'httpOnly': False, 'HostOnly': False, 'Secure': False } browser.add_cookie(cookie_dict) time.sleep(1) browser.get(url) js = "var q=document.documentElement.scrollTop=100000" browser.execute_script(js) time.sleep(2) responses = browser.page_source browser.close() return responses
解析网页源码我喜欢用xpath表达式提取,所以使用lxml模块解析网页,这里提取了商品标题、价格、店铺名称、评论人数以及优惠信息。由于提取的标题信息有点乱,所以加了一个process_list函数清洗一下数据。
def parser(responses): res = etree.HTML(responses) li_list = res.xpath('//*[@id="J_goodsList"]/ul/li') info = [] for li in li_list: title = li.xpath('./div/div[4]/a/em//font/text()')[0] all_title = li.xpath('./div/div[4]/a/em/text()') all_title = title + process_list(all_title) price = li.xpath('./div/div[3]/strong/i/text()')[0] shop = li.xpath('./div/div[7]/span/a/text()') comment_num = li.xpath('./div/div[5]/strong/a/text()') discount = li.xpath('./div/div[8]/i/text()') print(all_title, price, shop, comment_num, discount) a = {'title': all_title, 'price': price, 'shop': shop, 'comment_num': comment_num, 'discount': discount} info.append(a) return info
提取的信息保存到mongodb数据库,创建一个数据库Jingdong,集合名称huawei P50,把解析模块返回的信息以此存入数据库,如下所示:
def save_info_to_mongo(info):
client = pymongo.MongoClient('localhost', 27017)
collection = Collection(Database(client, 'Jingdong'), 'huawei P50')
for info in info:
collection.insert_one(info)
client.close()
import json import time from pymongo.database import Database from pymongo.collection import Collection import pymongo from lxml import etree from selenium import webdriver def get_html(url): browser = webdriver.Chrome() browser.get(url) if browser.current_url == url: js = "var q=document.documentElement.scrollTop=100000" browser.execute_script(js) time.sleep(2) responses = browser.page_source browser.close() return responses else: with open('cookies.txt', 'r', encoding='utf8') as f: listCookies = json.loads(f.read()) for cookie in listCookies: cookie_dict = { 'domain': '.jd.com', 'name': cookie.get('name'), 'value': cookie.get('value'), "expires": '1629446549', 'path': '/', 'httpOnly': False, 'HostOnly': False, 'Secure': False } browser.add_cookie(cookie_dict) time.sleep(1) browser.get(url) js = "var q=document.documentElement.scrollTop=100000" browser.execute_script(js) time.sleep(2) responses = browser.page_source browser.close() return responses def parser(responses): res = etree.HTML(responses) li_list = res.xpath('//*[@id="J_goodsList"]/ul/li') info = [] for li in li_list: title = li.xpath('./div/div[4]/a/em//font/text()')[0] all_title = li.xpath('./div/div[4]/a/em/text()') all_title = title + process_list(all_title) price = li.xpath('./div/div[3]/strong/i/text()')[0] shop = li.xpath('./div/div[7]/span/a/text()') comment_num = li.xpath('./div/div[5]/strong/a/text()') discount = li.xpath('./div/div[8]/i/text()') print(all_title, price, shop, comment_num, discount) a = {'title': all_title, 'price': price, 'shop': shop, 'comment_num': comment_num, 'discount': discount} info.append(a) return info def save_info_to_mongo(info): client = pymongo.MongoClient('localhost', 27017) collection = Collection(Database(client, 'Jingdong'), 'huawei P50') for info in info: collection.insert_one(info) client.close() def process_list(lists): a = '' for i in lists: b = i.replace('\n', '').replace('【', '').replace('】', '').replace('-', '') a += b return a if __name__ == '__main__': base_url = 'https://search.jd.com/Search?keyword=%E5%8D%8E%E4%B8%BAp50&qrst=1&suggest=3.def.0.base&wq=%E5%8D%8E%E4%B8%BAp50&ev=exbrand_%E5%8D%8E%E4%B8%BA%EF%BC%88HUAWEI%EF%BC%89%5E&pvid=ddb6b5344d4c452496da22357a030be8&page={}' url_list = [base_url.format(i) for i in range(1, 20, 2)] for page_url in url_list: save_info_to_mongo(parser(get_html(page_url)))
例子仅供参考学习,如有错误,敬请指出!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。