使用JAVA+Selenium爬虫_java=selenoim模拟爬取数据

作者：从前慢现在也慢 | 2024-02-16 02:08:27

踩

java=selenoim模拟爬取数据

总体思路是获取网页加载完成后的html内容，解析html然后获取所需要的元素，从而获得需要的信息。

1、环境准备工作
知识上需要有基本的java和html知识；环境上需要准备java、selenium和chrome浏览器及对应的chrmoedriver（也可以使用firefox等浏览器，需要另外进行简单的配置），mac os下selenium+chrome的环境准备可以参见我的另一篇博客：http://blog.csdn.net/egg1996911/article/details/72085151。

2、分析所需要爬虫的网站的html结构
以新浪nba（http://sports.sina.com.cn/nba/）为例，我想要爬取的内容为首页的新闻信息，如下图中蓝框所框部分：
这里写图片描述

打开浏览器的开发者工具，分析所框部分的html元素：
这里写图片描述
可以发现新闻内容都在class=“item”的li元素下，这样我们就有迹可循了。

3、编写代码

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;

import java.util.List;
import java.util.concurrent.TimeUnit;

/**
 * Created by deng on 2018/3/6.
 */
public class SinaSportsSpider implements Spider {
    private final static String TARGET_URL = "http://sports.sina.com.cn/nba/";
    private final static String FILE_NAME = "/Users/deng/IdeaProjects/Spider/src/SinaSportsNews.txt";

    public void run() throws InterruptedException {
        StringBuilder allNewses = new StringBuilder();

        WebDriver driver = MyWebDriver.createWebDriver();
        driver.get(TARGET_URL);

        WebDriver newsDriver = MyWebDriver.createWebDriver();
        // 超过8秒即为超时，会抛出Exception
        newsDriver.manage().timeouts().pageLoadTimeout(8, TimeUnit.SECONDS);

        Document document = Jsoup.parse(driver.getPageSource());
        List<Element> liTags = document.getElementsByClass("item");
        for (int i = 0; i < 3; i++) {
            List<Element> aTags = liTags.get(i).getElementsByTag("a");

            // 遍历单条新闻
            for (Element a : aTags) {
                String href = a.attr("href");

                // 筛选出新闻的url
                if (href.contains("sports.sina.com.cn") && href.contains("shtml")) {
                    System.out.println(href);
                    allNewses.append(href + "\n");

                    try {
                        newsDriver.get(href);
                    } catch (Exception e) {
                        // 加载页面超时，执行js手动停止页面加载
                        ((JavascriptExecutor) newsDriver).executeScript("window.stop()");
                    }finally {
                        Document newsDocument = Jsoup.parse(newsDriver.getPageSource());
                        String title = newsDocument.getElementsByClass("main-title").get(0).text();

                        Element dateAndSource = newsDocument.getElementsByClass("date-source").get(0);
                        String date = dateAndSource.getElementsByTag("span").get(0).text();
                        String source = dateAndSource.getElementsByTag("a").get(0).text();


                        allNewses.append(title + "\n");
                        allNewses.append(date + " " + source + "\n");

                        Element article = newsDocument.getElementById("artibody");
                        for (Element p : article.getElementsByTag("p")) {
                            allNewses.append(p.text().trim() + "\n");
                        }
                    }
                }
            }
        }
        Thread.sleep(3000);

        driver.quit();
        newsDriver.quit();

        MyFileWriter.writeString(FILE_NAME, allNewses.toString());
    }

    public static void main(String[] args) {
        try {
            new SinaSportsSpider().run();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

总体思路：找到class为item的li元素，获得li下的a元素，然后就能获得新闻标题和链接，再用另一个WebDriver打开新闻网页，获得新闻详情；
打开单个新闻页面时，需要的内容（新闻标题、时间、正文）已经加载出来了，但是页面仍然显示加载中（url旁边一直有个转圈的标志），因此需要规定个超时时间，确保在这个时间内（比如程序中的8秒）所需要爬虫的内容已经加载完毕，因此在手动stop网页加载后还能够爬取到想要的信息；

nba新闻

4、总结
代码比较简单粗暴，有很多待改进的地方。
1) 设置8秒为超时时间的行为不够合理，比如为什么是8秒（过长浪费时间，过短会导致内容未加载完成），应该有更好的方案解决（比如使用Selenium的显示等待机制，待研究）；
2) 当爬虫中间出现异常时，会导致之前已爬取到的内容丢失，同时也会导致爬虫无法继续进行下去，没有异常恢复机制等措施；

项目源码：https://github.com/LeiDengDengDeng/spider

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/从前慢现在也慢/article/detail/89699