43
Apache Flink Stream Processing Suneel Marthi @suneelmarthi Washington DC Apache Flink Meetup, Capital One, Vienna, VA November 19, 2015

Apache Flink Stream Processing

Embed Size (px)

Citation preview

Page 1: Apache Flink Stream Processing

Apache Flink Stream Processing

Suneel Marthi @suneelmarthi

Washington DC Apache Flink Meetup, Capital One, Vienna, VA

November 19, 2015

Page 2: Apache Flink Stream Processing

Source Code

2

https://github.com/smarthi/DC-FlinkMeetup

Page 3: Apache Flink Stream Processing

Flink Stack

3

Streaming dataflow runtime

Specialized Abstractions / APIs

Core APIs

Flink Core Runtime

Deployment

Page 4: Apache Flink Stream Processing

The Full Flink Stack

Gel

ly

Tabl

e

ML

SAM

OA

DataSet (Java/Scala) DataStream

Had

oop

M/R

Local Cluster Yarn Tez Embedded

Dat

aflo

w

Dat

aflo

w (W

iP)

MRQ

L

Tabl

e

Casc

adin

gStreaming dataflow runtime

Stor

m (W

iP)

Zepp

elin

Page 5: Apache Flink Stream Processing

Stream Processing ?

▪ Real World Data doesn’t originate in micro batches and is pushed through systems.

▪ Stream Analysis today is an extension of the Batch paradigm.

▪ Recent frameworks like Apache Flink, Confluent are built to handle streaming data.

5

Web server KafkaTopic

Page 6: Apache Flink Stream Processing

Requirements for a Stream Processor

▪ Low Latency ▪ Quick Results (milliseconds)

▪ High Throughput ▪ able to handle million events/sec

▪ Exactly-once guarantees ▪ Deliver results in failure scenarios

6

Page 7: Apache Flink Stream Processing

Fault Tolerance in Streaming

▪ at least once: all operators see all events ▪ Storm: re-processes the entire stream in

failure scenarios ▪ exactly once: operators do not perform

duplicate updates to their state ▪ Flink: Distributed Snapshots ▪ Spark: Micro-batches

7

Page 8: Apache Flink Stream Processing

Batch is an extension of Streaming

▪ Batch: process a bounded stream (DataSet) on a stream processor

▪ Form a Global Window over the entire DataSet for join or grouping operations

Page 9: Apache Flink Stream Processing

Flink Window Processing

9

Courtesy: Data Artisans

Page 10: Apache Flink Stream Processing

What is a Window?

▪ Grouping of elements info finite buckets ▪ by timestamps ▪ by record counts

▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window would have arrived.

10

Page 11: Apache Flink Stream Processing

Why Window?

▪ Process subsets of Streams ▪ based on timestamps ▪ or by record counts

▪ Have a maximum timestamp, which means, at some point, all elements that need to be assigned to a window will have arrived.

11

Page 12: Apache Flink Stream Processing

Different Window Schemes

▪ Global Windows: All incoming elements are assigned to the same window

stream.window(GlobalWindows.create()); ▪ Tumbling time Windows: elements are assigned to a window of

size (1 sec below) based on their timestamp, elements assigned to exactly one window keyedStream.timeWindow(Time.of(5, TimeUnit.SECONDS));

▪ Sliding time Windows: elements are assigned to a window of certain size based on their timestamp, windows “slide” by the provided value and hence overlap

stream.window(SlidingTimeWindows.of(Time.of(5, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)));

12

Page 13: Apache Flink Stream Processing

Different Window Schemes

▪ Tumbling count Windows: defines window of 1000 elements, that “tumbles”. Elements are grouped according to their arrival time in groups of 1000 elements, each element belongs to exactly one window stream.countWindow(1000);

▪ Sliding count Windows: defines a window of 1000 elements that slides every “100” elements, elements can belong to multiple windows.

stream.countWindow(1000, 100)

13

Page 14: Apache Flink Stream Processing

Tumbling Count Windows Animation

14

Courtesy: Data Artisans

Page 15: Apache Flink Stream Processing

Count Windows

15Tumbling Count Window, Size = 3

Page 16: Apache Flink Stream Processing

Count Windows

16Tumbling Count Window, Size = 3

Page 17: Apache Flink Stream Processing

Count Windows

17Tumbling Count Window, Size = 3

Page 18: Apache Flink Stream Processing

Count Windows

18Tumbling Count Window, Size = 3

Page 19: Apache Flink Stream Processing

Count Windows

19Tumbling Count Window, Size = 3

Page 20: Apache Flink Stream Processing

Count Windows

20Tumbling Count Window, Size = 3

Page 21: Apache Flink Stream Processing

Count Windows

21

Tumbling Count Window, Size = 3

Page 22: Apache Flink Stream Processing

Count Windows

22

Tumbling Count Window, Size = 3 Sliding every 2 elements

Page 23: Apache Flink Stream Processing

Count Windows

23

Tumbling Count Window, Size = 3 Sliding every 2 elements

Page 24: Apache Flink Stream Processing

Count Windows

24

Tumbling Count Window, Size = 3 Sliding every 2 elements

Page 25: Apache Flink Stream Processing

Count Windows

25

Tumbling Count Window, Size = 3 Sliding every 2 elements

Page 26: Apache Flink Stream Processing

Flink Streaming API

26

Page 27: Apache Flink Stream Processing

Flink DataStream API

27

public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print();

env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }

//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }

Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/StreamingWordCount.java

Page 28: Apache Flink Stream Processing

Streaming WordCount (Explained)

▪ Obtain a StreamExecutionEnvironment ▪ Connect to a DataSource ▪ Specify Transformations on the

DataStreams ▪ Specifying Output for the processed data ▪ Executing the program

28

Page 29: Apache Flink Stream Processing

Flink DataStream API

29

public class StreamingWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”); DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) // Converts DataStream -> KeyedStream .keyBy(0) //Group by first element of the Tuple .sum(1); counts.print();

env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }

//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }

Source code - https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/StreamingWordCount.java

Page 30: Apache Flink Stream Processing

Flink Window API

30

Page 31: Apache Flink Stream Processing

Keyed Windows (Grouped by Key)

31

public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);

DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple

// create a Window of 'windowSize' records and slide window // by 'slideSize' records.countWindow(windowSize, slideSize)

.sum(1); counts.print();

env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }

//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/

examples/WindowWordCount.java

Page 32: Apache Flink Stream Processing

Keyed Windows

32

public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);

DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple

// Converts KeyedStream -> WindowStream .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print();

env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }

//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }

https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WindowWordCount.java

Page 33: Apache Flink Stream Processing

Global Windows

33

All incoming elements of a given key are assigned to the same window. lines.flatMap(new LineSplitter()) //group by the tuple field "0" .keyBy(0) // all records for a given key are assigned to the same window .GlobalWindows.create() // and sum up tuple field "1" .sum(1) // consider only word counts > 1 .filter(new WordCountFilter())

Page 34: Apache Flink Stream Processing

Flink Streaming API (Tumbling Windows)

34

• All incoming elements are assigned to a window of a certain size based on their timestamp,

• Each element is assigned to exactly one window

Page 35: Apache Flink Stream Processing

Flink Streaming API (Tumbling Window)

35

public class WindowWordCount { public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment .getExecutionEnvironment(); // Create a DataStream from lines in File DataStream<String> text = env.readTextFile(“/path”);

DataStream<Tuple2<String, Integer>> counts = text .flatMap(new LineSplitter()) .keyBy(0) //Group by first element of the Tuple

// Tumbling Window .timeWindow(Time.of(1, TimeUnit.SECONDS)) .sum(1); counts.print();

env.execute(“Execute Streaming Word Counts”); //Execute the WordCount job }

//FlatMap implantation which converts each line to many <Word,1> pairs public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } }

https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WindowWordCount.java

Page 36: Apache Flink Stream Processing

Demos

36

}

Page 37: Apache Flink Stream Processing

Twitter + Flink Streaming

37

• Create a Flink DataStream from live Twitter feed • Split the Stream into multiple DataStreams based

on some criterion • Persist the respective streams to Storage

https://github.com/smarthi/DC-FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/twitter

Page 38: Apache Flink Stream Processing

Flink Event Processing: Animation

38

Courtesy: Ufuk Celebi and Stephan Ewen, Data Artisans

Page 39: Apache Flink Stream Processing

39

32-35

24-27

20-23

8-110-3

4-7

Tumbling Windows of 4 Seconds

123412

4

59

9 0

20

20

22212326323321

26

353642

39

Page 40: Apache Flink Stream Processing

tl;dr

40

• Event Time Processing is unique to Apache Flink • Flink provides exactly-once guarantees • With Release 0.10.0, Flink supports Streaming

windows, sessions, triggers, multi-triggers, deltas and event-time.

Page 41: Apache Flink Stream Processing

References

41

• Data Streaming Fault Tolerance in Flink Data Streaming Fault Tolerance in Flink • Light Weight Asynchronous snapshots for

distributed Data Flows http://arxiv.org/pdf/1506.08603.pdf

• Google DataFlow paper Google Data Flow

Page 42: Apache Flink Stream Processing

Acknowledgements

42

Thanks to following folks from Data Artisans for their help and feedback:

• Ufuk Celebi • Till Rohrmann • Stephan Ewen • Marton Balassi • Robert Metzger • Fabian Hueske • Kostas Tzoumas

Page 43: Apache Flink Stream Processing

Questions ???

43