K. Tzoumas & S. Ewen – Flink Forward Keynote

Preview:

Citation preview

Welcome to

The first conference on Apache Flink

Sponsored by

Some practical info §  Registration, cloakroom, and meals are in

Palais

§  Information point always staffed

§  WiFi is FlinkForward

§  Twitter hashtag is #ff15

§  Follow @FlinkForward

Some practical info

§  Need help? Look for a volunteer (pink badges)

§  All sessions are recorded and will be made available online

§  This includes the training sessions

3

Getting around

4

Please go around while talks are in progress

Our speaker organizations

5

Kostas Tzoumas and Stephan Ewen @kostas_tzoumas | @StephanEwen

Apache FlinkTM: From Incubation to Flink 1.0

7

1.  A bit of history

2.  The streaming era and Flink

3.  Inside Flink 0.10

4.  Towards Flink 1.0 and beyond

A bit of history From incubation until now

8

9

DataSet API (Java/Scala)

Flink core

Local Remote Yarn

Apr 2014 Jun 2015 Dec 2014

0.7 0.6 0.5 0.9 0.9-m1 0.10

Oct 2015

Top level

0.8

Gel

ly

Tabl

e

Flin

kML

SAM

OA

DataSet (Java/Scala/Python) DataStream (Java/Scala)

Hado

op M

/R

Flink dataflow engine

Local Remote Yarn Tez Embedded

Dat

aflow

Dat

aflow

Casc

adin

g

Tabl

e

Stor

m

Community growth

Flink is one of the largest and most active Apache big data projects with well over 120 contributors

10

Flink meetups around the globe

11

Featured in

12

The streaming era Welcome to

13

14

batch

event based

need new systems

well served

15

Streaming is the biggest change in data infrastructure since Hadoop

16

1.  Radically simplified infrastructure 2.  Internet of Things, on-demand services

3.  Can completely subsume batch

17

In a world of events and isolated apps, the stream processor is the backbone of the data infrastructure

App App

App

local view

local view local view

Consistent movement,

analytics

App App App

Global view Consistent store

18

§  Until now, stream processors were less mature than batch processors

§  This led to •  in-house solutions •  abuse of batch processors •  Lambda architectures

§  This is no longer the case

19

Flink 0.10 With the upcoming 0.10 release, Flink significantly surpasses the state of the art in open source stream processing systems.

And, we are heading to Flink 1.0 after that.

20

§  Streaming technology has matured •  e.g., Flink, Kafka, Dataflow

§  Flink and Dataflow duality •  a Google technology •  an open source Apache project

+

21

§  Streaming is happening

§  Better adapt now

§  Flink 0.10: a ready to use open source stream processor

Flink 0.10 Flink for the streaming era

22

Improved DataStream API §  Stream data analysis differs from batch data

analysis by introducing time

§  Streams are unbounded and produce data over time

§  Simple as batch API if handling time in a simple way

§  Powerful if you want to handle time in an advanced way (out-of-order records, preliminary results, etc)

23

Improved DataStream API

24

case  class  Event(location:  Location,  numVehicles:  Long)    

val  stream:  DataStream[Event]  =  …;    stream        .filter  {  evt  =>  isIntersection(evt.location)  }  

Improved DataStream API

25

case  class  Event(location:  Location,  numVehicles:  Long)    

val  stream:  DataStream[Event]  =  …;    stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .sum("numVehicles")  

Improved DataStream API

26

case  class  Event(location:  Location,  numVehicles:  Long)    

val  stream:  DataStream[Event]  =  …;    stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .trigger(new  Threshold(200))        .sum("numVehicles")  

Improved DataStream API

27

case  class  Event(location:  Location,  numVehicles:  Long)    

val  stream:  DataStream[Event]  =  …;    stream        .filter  {  evt  =>  isIntersection(evt.location)  }          .keyBy("location")        .timeWindow(Time.of(15,  MINUTES),  Time.of(5,  MINUTES))        .trigger(new  Threshold(200))        .sum("numVehicles")          .keyBy(  evt  =>  evt.location.grid  )        .mapWithState  {  (evt,  state:  Option[Model])  =>  {              val  model  =  state.orElse(new  Model())              (model.classify(evt),  Some(model.update(evt)))          }}  

IoT / Mobile Applications

28

Events occur on devices

Queue / Log

Events analyzed in a data streaming

system

Stream Analysis

Events stored in a log

IoT / Mobile Applications

29

IoT / Mobile Applications

30

IoT / Mobile Applications

31

IoT / Mobile Applications

32

Out of order !!!

First burst of events

Second burst of events

IoT / Mobile Applications

33

Event time windows

Arrival time windows

Instant event-at-a-time

Flink supports out of order time (event time) windows, arrival time windows (and mixtures) plus low latency processing.

First burst of events

Second burst of events

High Availability and Consistency

34

No Single-Point-Of-Failure any more

Exactly-once processing semantics across pipeline

Checkpoints/Fault Tolerance is decoupled from windows è Allows for highly flexible window implementations

ZooKeeper ensemble

Multiple Masters

failover

Performance

35

Continuous streaming

Latency-bound buffering

Distributed Snapshots

High Throughput & Low Latency

With configurable throughput/latency tradeoff

Batch and Streaming

36

case  class  WordCount(word:  String,  count:  Int)    

val  text:  DataStream[String]  =  …;    text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .keyBy("word")      .window(GlobalWindows.create())      .trigger(new  EOFTrigger())      .sum("count")  

Batch Word Count in the DataStream API

Batch and Streaming

37

Batch Word Count in the DataSet API

case  class  WordCount(word:  String,  count:  Int)    

val  text:  DataStream[String]  =  …;    text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .keyBy("word")      .window(GlobalWindows.create())      .trigger(new  EOFTrigger())      .sum("count")  

val  text:  DataSet[String]  =  …;    text      .flatMap  {  line  =>  line.split("  ")  }      .map  {  word  =>  new  WordCount(word,  1)  }      .groupBy("word")      .sum("count")      

Batch and Streaming

38

Pipelined and blocking operators Streaming Dataflow Runtime

Batch Parameters

DataSet DataStream

Relational Optimizer

Window Optimization

Pipelined and windowed operators

Schedule lazily Schedule eagerly

Recompute whole operators Periodic checkpoints

Streaming data movement

Stateful operations

DAG recovery Fully buffered streams DAG resource management

Streaming Parameters

Batch and Streaming

39

A full-fledged batch processor as well G

elly

Tabl

e

Flin

kML

SAM

OA

DataSet (Java/Scala/Python) DataStream (Java/Scala)

Hado

op M

/R

Flink dataflow engine

Local Remote Yarn Tez Embedded

Dat

aflow

Dat

aflow

Casc

adin

g

Tabl

e

Stor

m

Batch and Streaming

40

A full-fledged batch processor as well G

elly

Tabl

e

Flin

kML

SAM

OA

DataSet (Java/Scala/Python) DataStream (Java/Scala)

Hado

op M

/R

Flink dataflow engine

Local Remote Yarn Tez Embedded

Dat

aflow

Dat

aflow

Casc

adin

g

Tabl

e

Stor

m

More details at Dongwon Kim's Talk "A comparative performance evaluation of Flink"

Integration (picture not complete)

41

POSIX   Java/Scala Collections

POSIX  

Monitoring

42

Life system metrics and user-defined accumulators/statistics

Get  http://flink-­‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators  

Monitoring REST API for custom monitoring tools

{  "id":  "dceafe2df1f57a1206fcb907cb38ad97",  "user-­‐accumulators":  [      {  "name":"avglen",  "type":"DoubleCounter",  "value":"123.03259440000001"  },      {  "name":"genwords",  "type":"LongCounter",  "value":"75000000"  }  ]  }  

Flink 0.10 Summary §  Focus on operational readiness •  high availability •  monitoring •  integration with other systems

§  First-class support for event time

§  Refined DataStream API: easy and powerful

43

Towards Flink 1.0 and beyond Where we see the project going

44

Towards Flink 1.0

§  Flink 1.0 is around the corner

§  Focus on defining public APIs and automatic API compatibility checks

§  Guarantee backwards compatibility in all Flink 1.X versions

45

Beyond Flink 1.0 §  Flink engine has most features in place

§  Focus on usability features on top of DataStream API •  e.g., SQL, ML, more connectors

§  Continue work on elasticity and memory management

46

47  

Enjoy the rest of

The first conference on Apache Flink