28
Gyula Fóra [email protected] Flink committer Swedish ICT Real-time data processing with Apache Flink

Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Gyula Fóra

[email protected]

Flink committer

Swedish ICT

Real-time data processing

with Apache Flink

Page 2: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

2

▪ Data stream: Infinite sequence of data arriving in a continuous

fashion.

▪ Stream processing: Analyzing and acting on real-time streaming

data, using continuous queries

Stream processing

Page 3: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

3 Parts of a Streaming Infrastructure

3

Gathering Broker Analysis

Sensors

Transaction

logs …

Server Logs

Page 4: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

4

Apache Storm

• True streaming over distributed dataflow

• Low level API (Bolts, Spouts) + Trident

Spark Streaming

• Stream processing emulated on top of batch system (non-native)

• Functional API (DStreams), restricted by batch runtime

Apache Samza

• True streaming built on top of Apache Kafka, state is first class citizen

• Slightly different stream notion, low level API

Apache Flink

• True streaming over stateful distributed dataflow

• Rich functional API exploiting streaming runtime; e.g. rich windowing semantics

Streaming landscape

Page 5: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

What is Flink

5

Gelly

Table

ML

SA

MO

A

DataSet (Java/Scala/Python) DataStream (Java/Scala)

Hadoop

M/R

Local Remote Yarn Tez Embedded

Data

flow

Data

flow

MR

QL

Table

Ca

sca

din

g (

WiP

)Streaming dataflow runtime

Page 6: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Program compilation

6

case class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Optimizer

Type extraction

stack

Task

scheduling

Dataflow

metadata

Pre-flight (Client)

MasterWorkers

DataSourc

eorders.tbl

Filter

MapDataSourc

elineitem.tbl

JoinHybrid Hash

build

HTprobe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

Dataflow

Graph

deploy

operators

track

intermediate

results

Page 7: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Flink Streaming

7

Page 8: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

What is Flink Streaming

8

Native stream processor (low-latency)

Expressive functional API

Flexible operator state, stream windows

Exactly-once processing guarantees

Page 9: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Native vs non-native streaming

9

Stream

discretizer

Job Job Job Jobwhile (true) {// get next few records// issue batch computation

}

while (true) {// process next record

}

Long-standing operators

Non-native streaming

Native streaming

Page 10: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Pipelined stream processor

10

StreamingShuffle!

Page 11: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Defining windows in Flink

Trigger policy• When to trigger the computation on current window

Eviction policy• When data points should leave the window

• Defines window width/size

E.g., count-based policy• evict when #elements > n

• start a new window every n-th element

Built-in: Count, Time, Delta policies

11

Page 12: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Expressive APIs

12

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ")

.map(word => Word(word,1))}

.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))

.groupBy("word").sum("frequency")

.print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ")

.map(word => Word(word,1))}

.groupBy("word").sum("frequency")

.print()

DataSet API (batch):

DataStream API (streaming):

Page 13: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

DataStream API

13

Page 14: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Overview of the API

Data stream sources• File system

• Message queue connectors

• Arbitrary source functionality

Stream transformations• Basic transformations: Map, Reduce, Filter, Aggregations…

• Binary stream transformations: CoMap, CoReduce…

• Windowing semantics: Policy based flexible windowing (Time, Count,

Delta…)

• Temporal binary stream operators: Joins, Crosses…

• Native support for iterations

Data stream outputs

For the details please refer to the programming guide:• http://flink.apache.org/docs/latest/streaming_guide.html

14

Reduce

Merge

Filter

Sum

Map

Src

Sink

Src

Page 15: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Use-case: Financial analytics

15

Reading from multiple inputs

• Merge stock data from various sources

Window aggregations

• Compute simple statistics over windows

Data driven windows

• Define arbitrary windowing semantics

Combine with sentiment analysis

• Enrich your analytics with social media feeds (Twitter)

Streaming joins

• Join multiple data streams

Detailed explanation and source code on our blog

• http://flink.apache.org/news/2015/02/09/streaming-example.html

Page 16: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Reading from multiple inputs

case class StockPrice(symbol : String, price : Double)

val env = StreamExecutionEnvironment.getExecutionEnvironment

val socketStockStream = env.socketTextStream("localhost", 9999)

.map(x => { val split = x.split(",")

StockPrice(split(0), split(1).toDouble) })

val SPX_Stream = env.addSource(generateStock("SPX")(10) _)

val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)

val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM)

16

(1)

(2)

(4)

(3)

(1)

(2)

(3)

(4)

"HDP, 23.8""HDP, 26.6"

StockPrice(SPX, 2113.9)

StockPrice(FTSE, 6931.7)StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)

Page 17: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Window aggregations

val windowedStream = stockStream

.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))

val lowest = windowedStream.minBy("price")

val maxByStock = windowedStream.groupBy("symbol").maxBy("price")

val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)

17

(1)

(2)

(4)

(3)

(1)

(2)

(4)

(3)

StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)

StockPrice(HDP, 23.8)

StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 26.6)

StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 25.2)

Page 18: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Data-driven windows

case class Count(symbol : String, count : Int)

val priceWarnings = stockStream.groupBy("symbol")

.window(Delta.of(0.05, priceChange, defaultPrice))

.mapWindow(sendWarning _)

val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol")

.window(Time.of(30, SECONDS))

.sum("count")

18

(1)(2) (4)

(3)

(1)

(2)

(4)

(3)

StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)

Count(HDP, 1)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)

Page 19: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Combining with a Twitter stream

val tweetStream = env.addSource(generateTweets _)

val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" "))

.map(_.toUpperCase())

.filter(symbols.contains(_))

val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol")

.window(Time.of(30, SECONDS))

.sum("count")

19

"hdp is on the rise!""I wish I bought more YHOO and HDP stocks"

Count(HDP, 2)Count(YHOO, 1)(1)

(2)(4)

(3)

(1)

(2)

(4)

(3)

Page 20: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Streaming joins

val tweetsAndWarning = warningsPerStock.join(tweetsPerStock)

.onWindow(30, SECONDS)

.where("symbol")

.equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) }

val rollingCorrelation = tweetsAndWarning

.window(Time.of(30, SECONDS))

.mapWindow(computeCorrelation _)

20

Count(HDP, 2)Count(YHOO, 1)

Count(HDP, 1)

(1,2)

(1) (2)

(1)

(2)

0.5

Page 21: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

21

▪ Performance optimizations• Effective serialization due to strongly typed topologies

• Operator chaining (thread sharing/no serialization)

• Different automatic query optimizations

▪ Competitive performance

• ~ 1.5m events / sec / core

• As a comparison Storm promises ~ 1m tuples / sec /

node

Performance

Page 22: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Fault tolerance

22

Page 23: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Overview

23

▪ Fault tolerance in other systems

• Message tracking/acks (Apache Storm)

• RDD lineage tracking/recomputation

Fault tolerance in Apache Flink

• Based on consistent global snapshots

• Algorithm inspired by Chandy-Lamport

• Low runtime overhead, stateful exactly-

once semantics

Page 24: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Checkpointing / Recovery

24

Asynchronous Barrier Snapshotting for globally consistent checkpoints

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)

Page 25: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

State management

25

State declared in the operators is managed and

checkpointed by Flink

Pluggable backends for storing persistent

snapshots

• Currently: JobManager, FileSystem (HDFS, Tachyon)

State partitioning and flexible scaling in the

future

Page 26: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Closing

26

Page 27: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

Streaming roadmap for 2015

State management

• New backends for state snapshotting

• Support for state partitioning and incremental

snapshots

• Master Failover

Improved monitoring

Integration with other Apache projects

• SAMOA, Zeppelin, Ignite

Streaming machine learning and other new

libraries27

Page 28: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza

flink.apache.org

@ApacheFlink