50
Apache Gearpump next-gen streaming engine Manu Zhang, [email protected] Sean Zhong, [email protected]

Apache Gearpump next-gen streaming engine

Embed Size (px)

Citation preview

Page 1: Apache Gearpump next-gen streaming engine

Apache Gearpump next-gen streaming engineManu Zhang, [email protected] Zhong, [email protected]

Page 2: Apache Gearpump next-gen streaming engine

What is Gearpump ?

● An Akka[2] based real-time streaming engine ● An Apache Incubator[1] project since Mar.8th, 2016● A super simple pump that consists of only two gears

but very powerful at streaming water

Page 3: Apache Gearpump next-gen streaming engine

Gearpump Architecture

● Actor concurrency● Message passing

communication● error handling and

isolation with supervision hierarchy

● Master HA with Akka Cluster

Page 4: Apache Gearpump next-gen streaming engine

Why Gearpump ?

Page 5: Apache Gearpump next-gen streaming engine

Stream processing is hard but no system meets all requirements

● Infinite data● Out-of-order data● Low latency assurance (e.g real-time recommendation)● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update user logics

Page 6: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG

Page 7: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG

Page 8: Apache Gearpump next-gen streaming engine

User Interface - DAG API

Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~> HashPartitioner ~> AnomalyDetector, AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)

Page 9: Apache Gearpump next-gen streaming engine

partitioner

User Interface - Partitioner

Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~ HashPartitioner ~> AnomalyDetector, AnomalyDetector ~ ShufflePartitioner ~> KafkaSink)

WindowCounter(parallelism = 3)

ModelCalculator(parallelism = 2)

Task

Task

Task

Task

Task

Page 10: Apache Gearpump next-gen streaming engine

User Interface - Compose DAG

Page 11: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG

Page 12: Apache Gearpump next-gen streaming engine

Flow Control

1. Words

3. W

indow

Coun

ts

2. Window Counts

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSink

Page 13: Apache Gearpump next-gen streaming engine

Without Flow Control - messages queue up

KafkaSource WindowCounter ModelCalculator

fast fast very slow

Page 14: Apache Gearpump next-gen streaming engine

Without Flow Control - OOM

KafkaSource WindowCounter ModelCalculator

fast fast very slow

Page 15: Apache Gearpump next-gen streaming engine

With Flow Control - Backpressure

KafkaSource WindowCounter ModelCalculator

fast slow down very slowbackpressure

Page 16: Apache Gearpump next-gen streaming engine

With Flow Control - Backpressure

KafkaSource WindowCounter ModelCalculator

Slow down Slow down Very Slowbackpressurebackpressure

pull slower

Page 17: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG

Page 18: Apache Gearpump next-gen streaming engine

Out-of-order data

Event time321

Proc

essi

ng t

ime

● Event time - when data generated● Processing time - when data processed

1

2

3

4

56

Page 19: Apache Gearpump next-gen streaming engine

Window Count

1. Words

3. W

indow

Coun

ts

2. Window Counts

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSink

Page 20: Apache Gearpump next-gen streaming engine

When can window counts be emitted ?

window messages count

[0, 2) 1

[2, 4)2

[4, 6)2

(“gearpump”, 1)

WindowCounter In-memory Table

● No window can be emitted since message as early as time 1 has not arrived

(“gearpump”, 1)

(“gearpump”, 3)

(“gearpump”, 5)

(“gearpump”, 4)

(“gearpump”, 2)

WindowCounter

Page 21: Apache Gearpump next-gen streaming engine

On Watermark[4]

Event time321

Proc

essi

ng t

ime

1

2

3

4

56 watermark

● No timestamp earlier than watermark will be seen

Page 22: Apache Gearpump next-gen streaming engine

Out-of-order processing with watermark

window messages count

[0, 2) 1

[2, 4)2

[4, 6)2

(“gearpump”, 1)

WindowCounter In-memory Table

(“gearpump”, 1)

(“gearpump”, 3)

(“gearpump”, 5)

(“gearpump”, 4)

(“gearpump”, 2)

WindowCounter

watermark = 0,No window can be emitted

Page 23: Apache Gearpump next-gen streaming engine

Out-of-order processing with watermark

window messages count

[0, 2) 1

[2, 4)2

[4, 6)2

(“gearpump”, [

0, 2), 1

)

WindowCounter In-memory Table

(“gearpump”, 1)

(“gearpump”, 3)

(“gearpump”, 5)

(“gearpump”, 4)

(“gearpump”, 2)

WindowCounter

watermark = 2,Window [0, 2) can be emitted

(“gearpump”, [0, 2), 1)

Page 24: Apache Gearpump next-gen streaming engine

How to get watermark ?

1. Words

3. W

indow

Coun

ts

2. Window Counts

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSink

Page 25: Apache Gearpump next-gen streaming engine

From upstream

1. Words

3. W

indow

Coun

ts

2. Window Counts

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSink

watermark

Page 26: Apache Gearpump next-gen streaming engine

More on Watermark

● Source watermark defined by user● Usually heuristic based ● Users decide whether to drop data arriving after

watermark

Page 27: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User Interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG

Page 28: Apache Gearpump next-gen streaming engine

Exactly Once - no Lost or Duplicated States

● Lost states can cause anomaly words skipped

● Duplicated states can cause false alarm

States

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSink

kafka_offset window_counts

models

Page 29: Apache Gearpump next-gen streaming engine

Exactly Once with asynchronous checkpointing

(2, kafka_offset)

KafkaSource WindowCounter ModelCalculator

Watermark = 2 Watermark = 0 Watermark = 0

(2, kafka_offset)

Page 30: Apache Gearpump next-gen streaming engine

(2, kafka_offset)(2, window_counts)

Exactly Once with asynchronous checkpointing

KafkaSource WindowCounter ModelCalculator

Watermark = 2 Watermark = 2 Watermark = 0

(2, window_counts)

Page 31: Apache Gearpump next-gen streaming engine

(2, kafka_offset)(2, window_counts)

(2, models)

Exactly Once with asynchronous checkpointing

KafkaSource WindowCounter ModelCalculator

Watermark = 2 Watermark = 2 Watermark = 2

(2, models)

Page 32: Apache Gearpump next-gen streaming engine

(2, kafka_offset)(2, window_counts)

(2, models)

Exactly Once with asynchronous checkpointing

KafkaSource WindowCounter ModelCalculator

Watermark = 2 Watermark = 2 Watermark = 2

(2, models)

Page 33: Apache Gearpump next-gen streaming engine

(2, kafka_offset)(2, window_counts)

(2, models)

Crash

KafkaSource WindowCounter ModelCalculator

Watermark = 2 Watermark = 2 Watermark = 2

(2, models)

Page 34: Apache Gearpump next-gen streaming engine

(2, kafka_offset)(2, window_counts)

(2, models)

Recover to latest checkpoint at 2

KafkaSource WindowCounter ModelCalculator

modelswindow_countskafka_offset

Replay from kafka

Get state at 2

Page 35: Apache Gearpump next-gen streaming engine

More on Exactly Once

● User extends PersistentState interface and system checkpoints state for users automatically

● Watermark is persisted on master

Page 36: Apache Gearpump next-gen streaming engine

How Gearpump solves the hard parts

● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG

Page 37: Apache Gearpump next-gen streaming engine

Can we update model algorithm in a cheap way ?

1. Words

3. W

indow

Coun

ts

2. Window Counts

4. Models

5. Anomaly words

KafkaSource WindowCounter

ModelCalculator

AnomalyDetector

KafkaSinkThe model algorithm doesn’t perform well

Page 38: Apache Gearpump next-gen streaming engine

Dynamic DAG - modify application at runtime

Page 39: Apache Gearpump next-gen streaming engine

Performance

Page 40: Apache Gearpump next-gen streaming engine

Yahoo! Streaming Benchmark[5]

Page 41: Apache Gearpump next-gen streaming engine

Yahoo! Streaming Benchmark

● 4 node cluster

● Each node has

32-core, 64 GB

memory

● 10GB Ethernet

Page 42: Apache Gearpump next-gen streaming engine

Advanced features

Page 43: Apache Gearpump next-gen streaming engine

DAG Visualization

● Watermark● Node size reflects throughput● Edge width represents flow rate

Page 44: Apache Gearpump next-gen streaming engine

DAG Visualization

● Data skew analysis

Page 45: Apache Gearpump next-gen streaming engine

Apache Beam[6] Gearpump Runner

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Apache Gearpump

Page 46: Apache Gearpump next-gen streaming engine

Experimental features

● Web UI Authorization / OAuth2 Authentication● CGroup Resource Isolation● Binary Storm compatibility● Akka Streams integration

Page 47: Apache Gearpump next-gen streaming engine

Summary

● Gearpump is good at streaming infinite out-of-order data and guarantees correctness

● Gearpump helps users to easily program streaming applications, get runtime information and update dynamically

Page 48: Apache Gearpump next-gen streaming engine

References

1. gearpump.apache.org2. akka.io3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti

me-dagprocessing-with-akka-at-scale4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid

au.pdf5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami

ng-computation-engines-at6. Apache Beam [project overview]

Page 49: Apache Gearpump next-gen streaming engine

Backup Slides

Page 50: Apache Gearpump next-gen streaming engine

What is Akka

● Message driven● Shared nothing● Location transparent● Supervision based

failure management● High performance

Actor

ActorActor

Actor Actor

spawn