weixin_40725706

这个屌丝很懒，什么也没留下！

热门标签

Spark Streaming实时流处理项目_设计一个使用spark streaming的实时数据流处理作业。

作者：weixin_40725706 | 2024-08-13 16:25:52

踩

设计一个使用spark streaming的实时数据流处理作业。

Spark Streaming 实时流处理项目实战

项目整体流程：

模拟用户访问日志数据；
由Flume采集数据并通过Kafka进行消费；
按照需求由spark sparkstreaming 进行实时处理，并将结果保存到HBase中；

数据准备

generate_log.py 生成模拟用户访问网站的日志数据
由ip+时间+url+状态码+搜索引擎来源组成
为了模拟实时处理流程，定时运行generate_log.py 脚本

在这里插入图片描述

定时运行：
- lgl.sh

python /home/jackie/project_0907/generate_log.py
1

每一分钟执行一次，每次生成100条数据


[jackie@hadoop102:project_0907]$ crontab -e

*/1 * * * * /home/jackie/project_0907/lgl.sh

1
2
3
4
5

数据采集

启动Flume进行采集

#单机 zookeeper
/opt/module/zookeeper/bin/zkServer.sh start

#启动kafka服务
/opt/module/kafka/bin/kafka-server-start.sh \
-daemon /opt/module/kafka/config/server.properties

#启动flume
/opt/module/flume/bin/flume-ng agent \
--conf /opt/module/flume/conf/ \
--conf-file /home/jackie/project_0907/streaming_project.conf \
--name exec-memory-kafka \
-Dflume.root.logger=INFO,console
1
2
3
4
5
6
7
8
9
10
11
12
13

streaming_project.conf
模拟消费，输出到控制台

#消费
[jackie@hadoop102:kafka]$ bin/kafka-console-consumer.sh --zookeeper hadoop102:2181 --topic streamingtopic
1
2

数据处理

测试一数据接收

验证kafka能否正常接收数据

    //1.初始化Spark配置信息
    //打包时 将 setMaster("local[*]") 注释掉
	val sparkConf = new SparkConf().setAppName("StreamCount").setMaster("local[*]")
    val sparkConf = new SparkConf().setAppName("CourseClickCount")

    //2.初始化SparkStreamingContext 实时数据分析环境对象 采集周期 60s
    val streamingContext = new StreamingContext(sparkConf, Seconds(60))

    //3.从Kafka中采集数据
    val kafkaDStream: ReceiverInputDStream[(String, String)] = 		KafkaUtils.createStream(
      streamingContext,
      "hadoop102:2181",
      "lzou",
      Map("streamingtopic" -> 1)
    )
	kafkaDStream.map(_._2).count().print()

//输出
/*
    10.63.98.87	2020-09-15 18:09:01	"GET /class/143.html HTTP/1.1"	404	-
    143.55.98.124	2020-09-15 18:09:01	"GET /class/112.html HTTP/1.1"	200	http://www.baidu.com/s?wd=Hadoop基础
    132.143.98.156	2020-09-15 18:09:01	"GET /class/128.html HTTP/1.1"	200	-
    10.132.63.124	2020-09-15 18:09:01	"GET /class/128.html HTTP/1.1"	200	-
    187.10.132.72	2020-09-15 18:09:01	"GET /course/list HTTP/1.1"	200	http://www.baidu.com/s?wd=Spark Streaming
     */
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

测试二数据清洗

请先出class课程的数据，即：URL中 /class 开头的数据

 val cleanData: DStream[ClickLog] = kafkaDStream.map(_._2).map(line => {
      val fields: Array[String] = line.split("\t")
      val url: String = fields(2).split(" ")(1)
      var courseId = 0

      //      10.55.187.87	2020-09-09 14:03:01	"GET /class/112.html HTTP/1.1"	404	-

      if (url.startsWith("/class")) {
        val courseIdHTML: String = url.split("/")(2)
        courseId = courseIdHTML.substring(0, courseIdHTML.lastIndexOf(".")).toInt
      }

      ClickLog(fields(0), DateUtils.parseToMinute(fields(1)), courseId, fields(3).toInt, fields(4))

    }).filter(ClickLog => ClickLog.courseId != 0)

    cleanData.print()

//输出
 /** cleanData
      * ClickLog(30.87.55.46,20200915211301,143,404,-)
      * ClickLog(156.187.132.98,20200915211301,141,404,-)
      * ClickLog(30.55.63.167,20200915211301,143,500,http://search.yahoo.com/search?p=大数据面试)
      */

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

测试三数据存储到HBase

rowkey : day+courseid

cleanData.map(x => {
      (x.time.substring(0, 8) + "_" + x.courseId, 1)
    }).reduceByKey(_ + _).foreachRDD(rdd => {

      rdd.foreachPartition(partition => {
        val buffer: ListBuffer[CourseClickCount] = new ListBuffer[CourseClickCount]

        partition.foreach(pair => {
          buffer.append(CourseClickCount(pair._1, pair._2))
        })

        CourseClickCountDAO.save(buffer)
      })
    })
1
2
3
4
5
6
7
8
9
10
11
12
13
14

测试四统计从搜索引擎过来的实战课程访问量

rowkey : day+search+id

//
    //ClickLog(10.46.187.63,20200915212801,143,404,http://www.baidu.com/s?wd=Hadoop基础)
    cleanData.map(x => {
      val referer: String = x.referer.replaceAll("//", "/")
      val splits: Array[String] = referer.split("/")
      var host = ""
      if (splits.length > 2) {
        host = splits(1)
      }
      (host, x.courseId, x.time)

    }).filter(_._1 != "").map(x => {

      (x._3.substring(0, 8) + "_" + x._1 + "_" + x._2, 1)

    }).reduceByKey(_ + _).foreachRDD(rdd => {

      rdd.foreachPartition(partition => {

        val list: ListBuffer[CourseSearchClickCount] = new ListBuffer[CourseSearchClickCount]
        partition.foreach(pair => {
          list.append(CourseSearchClickCount(pair._1, pair._2))
        })

        CourseSearchClickCountDAO.save(list)
      })


    })
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

打包提交

pom.xml依赖

  <build>
        <plugins>


            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <archive>
                        <manifest>
                            <!-- 主类信息 -->
                            <mainClass>com.spark.SparkStreaming_Kafka</mainClass>
                        </manifest>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

提交jar包到spark运行

先启动hdfs zk
#启动spark
[jackie@hadoop102:spark]$ sbin/start-all.sh
# 启动spark-shell
[jackie@hadoop102 spark]$ bin/spark-shell
1
2
3
4
5

[jackie@hadoop102:spark]$ bin/spark-submit --master local[5] \
> --name CourseClickCount \
> --class com.spark.SparkStreaming_Kafka \
> /home/jackie/project_0907/spark-streaming-project-1.0-SNAPSHOT-jar-with-dependencies.jar 

1
2
3
4
5

这里上传的是完整jar包，包括依赖
name spark-streaming 运行spark程序的名字可以随便写
class com.spark.SparkStreaming_Kafka 加载的主类名
/home/jackie/…jar jar包地址

Spark Web UI : http://hadoop102:4040/

注意事项

源码

完整项目工程

需要启动的进程

jps
HDFS、ZK、Flume、Kafka、HBase、Spark

Hbase启动时报错

启动 hbase shell 时：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lND3qFNk-1600184995507)(D:\Desktop\java笔记\assets\1600184092530.png)]$

解决方案：
- 关闭hbase ，注意查看jps是否还在运行，建议直接kill掉进程；
- 进入zookeeper/bin/目录下，运行zkCli.sh ，执行 rmr /hbase ，删除hbase;
```
[jackie@hadoop102:zookeeper]$ bin/zkCli.sh 

[zk: localhost:2181(CONNECTED) 0] rmr /hbase
1
2
3
```
- 然后再删除hdfs上的hbse文件夹；
```
 [jackie@hadoop102:~]$ hadoop fs -rm -r /hbase
1
```
- 重新启动

其他

若有其他问题，欢迎交流讨论。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/975648