赞
踩
Map
DataStream → DataStream:输入一个参数产生一个参数。
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.generateSequence(1,10)
- val streamMap = stream.map { x => x * 2 }
- streamFilter.print()
-
- env.execute("FirstJob")
注意:stream.print():每一行前面的数字代表这一行是哪一个并行线程输出的。
- import org.apache.flink.api.common.functions.MapFunction;
- import org.apache.flink.api.java.DataSet;
- import org.apache.flink.api.java.ExecutionEnvironment;
- import org.apache.flink.api.java.operators.MapOperator;
- import org.apache.flink.api.java.utils.ParameterTool;
- import scala.Tuple2;
-
- import java.util.Random;
-
-
- public class StuScore {
- private static Random rand = new Random();
-
- public static void main(String[] args) throws Exception {
- ParameterTool params = ParameterTool.fromArgs(args);
- ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
- env.getConfig().setGlobalJobParameters(params);
-
- DataSet<String> text;
- if (params.has("input")) {
- text = env.readTextFile("F:\\date\\flinkdata\\stu.txt");
- }else{
- System.out.println("请检查你的输入");
- return;
- }
-
- MapOperator<String, Tuple2<String, Integer>> stuscore = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
- @Override
- public Tuple2<String, Integer> map(String s) throws Exception {
- return new Tuple2<>(s, rand.nextInt(100) + 1);
- }
- });
-
- if (params.has("output")) {
- stuscore.writeAsCsv("F:\\date\\flinkdata\\personinput\\A");
- }else {
- System.out.println("打印到控制台");
- stuscore.print();
- }
- }
- }

FlatMap
DataStream → DataStream:输入一个参数,产生0个、1个或者多个输出。
- import org.apache.flink.streaming.api.scala._
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("F:\date\flinkdata\stu.tsv")
- val streamFlatMap = stream.flatMap{
- x => x.split(" ")
- }
- streamFilter.print()
-
- env.execute("FirstJob")
Filter
DataStream → DataStream:结算每个元素的布尔值,并返回布尔值为true的元素。下面这个例子是过滤出非0的元素:
- import org.apache.flink.streaming.api.scala._
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.generateSequence(1,10)
- val streamFilter = stream.filter{
- //打印奇数
- x => (x % 2 != 0)
- }
- streamFilter.print()
-
- env.execute("FirstJob")
Connect
DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被Connect之后,只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("F:\date\flinkdata\stu.tsv")
-
- val streamMap = stream.flatMap(item => item.split(" ")).filter(item => item.equals("hadoop"))
- val streamCollect = env.fromCollection(List(1,2,3,4))
- //streamMap和streamCollect交换顺序不会影响结果
- val streamConnect = streamMap.connect(streamCollect)
-
- streamConnect.map(item=>println(item), item=>println(item))
-
- env.execute("FirstJob")
CoMap,CoFlatMap
ConnectedStreams → DataStream:作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams中的每一个Stream分别进行map和flatMap处理。
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream1 = env.readTextFile("F:\date\flinkdata\stu.tsv")
- val streamFlatMap = stream1.flatMap(x => x.split(" "))
- val stream2 = env.fromCollection(List(1,2,3,4))
- val streamConnect = streamFlatMap.connect(stream2)
- val streamCoMap = streamConnect.map(
- (str) => str + "connect",
- (in) => in + 100
- )
-
- streamCoMap.print()
-
- env.execute("FirstJob")
-
- //========================
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream1 = env.readTextFile("test.txt")
- val stream2 = env.readTextFile("test1.txt")
- val streamConnect = stream1.connect(stream2)
- val streamCoMap = streamConnect.flatMap(
- (str1) => str1.split(" "),
- (str2) => str2.split(" ")
- )
- streamConnect.map(item=>println(item), item=>println(item))
-
- env.execute("FirstJob")

Split
DataStream → SplitStream:根据某些特征把一个DataStream拆分成两个或者多个DataStream。注:此代码无法运行出结果,使用Select即可运行
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("F:\date\flinkdata\stu.tsv")
- val streamFlatMap = stream.flatMap(x => x.split(" "))
- val streamSplit = streamFlatMap.split(
- num =>
- //字符串内容为hadoop的组成一个DataStream,其余的组成一个DataStream
- (num.equals("hadoop")) match{
- case true => List("hadoop")
- case false => List("other")
- }
- )
-
- env.execute("FirstJob")
Select
SplitStream→DataStream:从一个SplitStream中获取一个或者多个DataStream。
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("F:\date\flinkdata\stu.tsv")
- val streamFlatMap = stream.flatMap(x => x.split(" "))
- val streamSplit = streamFlatMap.split(
- num =>
- (num.equals("hadoop")) match{
- case true => List("hadoop")
- case false => List("other")
- }
- )
-
- val hadoop = streamSplit.select("hadoop")
- val other = streamSplit.select("other")
- other.print()
-
- env.execute("FirstJob")

Union
DataStream → DataStream:对两个或者两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream。注意:如果你将一个DataStream跟它自己做union操作,在新的DataStream中,你将看到每一个元素都出现两次。
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream1 = env.readTextFile("test.txt")
- val streamFlatMap1 = stream1.flatMap(x => x.split(" "))
- val stream2 = env.readTextFile("test1.txt")
- val streamFlatMap2 = stream2.flatMap(x => x.split(" "))
- val streamConnect = streamFlatMap1.union(streamFlatMap2)
-
- env.execute("FirstJob")
KeyBy
DataStream → KeyedStream:输入必须是Tuple类型,逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
- val stream = env.readTextFile("test.txt")
- val streamFlatMap = stream.flatMap{
- x => x.split(" ")
- }
- val streamMap = streamFlatMap.map{
- x => (x,1)
- }
- val streamKeyBy = streamMap.keyBy(0)
- env.execute("FirstJob")
Reduce
KeyedStream → DataStream:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("test.txt").flatMap(item => item.split(" ")).map(item => (item, 1)).keyBy(0)
-
- val streamReduce = stream.reduce(
- (item1, item2) => (item1._1, item1._2 + item2._2)
- )
-
- streamReduce.print()
-
- env.execute("FirstJob")
Fold
KeyedStream → DataStream:一个有初始值的分组数据流的滚动折叠操作,合并当前元素和前一次折叠操作的结果,并产生一个新的值,返回的流中包含每一次折叠的结果,而不是只返回最后一次折叠的最终结果。
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("test.txt").flatMap(item => item.split(" ")).map(item => (item, 1)).keyBy(0)
-
- val streamReduce = stream.fold(100)(
- (begin, item) => (begin + item._2)
- )
-
- streamReduce.print()
-
- env.execute("FirstJob")
Aggregations
KeyedStream → DataStream:分组数据流上的滚动聚合操作。min和minBy的区别是min返回的是一个最小值,而minBy返回的是其字段中包含最小值的元素(同样原理适用于max和maxBy),返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。
-
- keyedStream.sum(0)
- keyedStream.sum("key")
- keyedStream.min(0)
- keyedStream.min("key")
- keyedStream.max(0)
- keyedStream.max("key")
- keyedStream.minBy(0)
- keyedStream.minBy("key")
- keyedStream.maxBy(0)
- keyedStream.maxBy("key")
-
- val env = StreamExecutionEnvironment.getExecutionEnvironment
-
- val stream = env.readTextFile("test02.txt").map(item => (item.split(" ")(0), item.split(" ")(1).toLong)).keyBy(0)
-
- val streamReduce = stream.sum(1)
-
- streamReduce.print()
-
- env.execute("FirstJob")

在2.3.10之前的算子都是可以直接作用在Stream上的,因为他们不是聚合类型的操作,但是到2.3.10后你会发现,我们虽然可以对一个无边界的流数据直接应用聚合算子,但是它会记录下每一次的聚合结果,这往往不是我们想要的,其实,reduce、fold、aggregation这些聚合算子都是和Window配合使用的,只有配合Window,才能得到想要的结果。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。