58
Unified Processing with Apache Beam Cloud+Data NEXTCon 2017

Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Unified Processing with Apache Beam

Cloud+Data NEXTCon 2017

Page 2: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

I am Sourabh

Hello!

Page 3: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

I am Sourabh

Hello!

I am a Software Engineer

Page 4: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

I am Sourabh

Hello!

I am a Software Engineer

I tweet at @sb2nov

Page 5: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

What is Apache Beam?

Page 6: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Apache Beam is a unified programming model for expressing efficient and

portable data processing pipelines

Page 7: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Big Data

Page 8: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

LAUNCH!!

Page 9: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

DATA CAN BE BIG

Page 10: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

… REALLY BIG ...

TuesdayWednesday

Thursday

Page 11: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

UNBOUNDED, DELAYED, OUT OF ORDER

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Page 12: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

ORGANIZING THE STREAM

8:00

8:00

8:00

Page 13: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

DATA PROCESSING TRADEOFFS

Completeness Latency

$$$Cost

Page 14: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHAT IS IMPORTANT?

Completeness Low Latency Low Cost

Important

Not Important

$$$

Page 15: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

MONTHLY BILLING

Completeness Low Latency Low Cost

Important

Not Important

$$$

Page 16: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

BILLING ESTIMATE

Completeness Low Latency Low Cost

Important

Not Important

$$$

Page 17: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

FRAUD DETECTION

Completeness Low Latency Low Cost

Important

Not Important

$$$

Page 18: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Beam Model

Page 19: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

GENERATIONS BEYOND MAP-REDUCE

Clearly separates event time from processing time

Improved abstractions let you focus on your application logic

Batch and stream processing are both first-class citizens

Page 20: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Pipeline

PTransform

PCollection(bounded or unbounded)

Page 21: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

EVENT TIME VS PROCESSING TIME

Page 22: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

EVENT TIME VS PROCESSING TIME

Page 23: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

EVENT TIME VS PROCESSING TIME

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 24: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

ASKING THE RIGHT QUESTIONS

When in processing time?

What is being computed?

Where in event time?

How do refinements happen?

Page 25: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHAT IS BEING COMPUTED?

scores: PCollection[KV[str, int]] = (input | beam.CombinePerKey(sum))

Page 26: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHAT IS BEING COMPUTED?

Element-Wise Aggregating Composite

Page 27: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHAT IS BEING COMPUTED?

Page 28: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHERE IN EVENT TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

Page 29: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHERE IN EVENT TIME?

Page 30: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHERE IN EVENT TIME?

Page 31: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHERE IN EVENT TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

The choice of windowing is retained through subsequent aggregations.

Page 32: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHEN IN PROCESSING TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))

Page 33: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHEN IN PROCESSING TIME?

Triggers control when results are emitted.

Triggers are often relative to the watermark.

Page 34: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHEN IN PROCESSING TIME?

Page 35: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

HOW DO REFINEMENTS HAPPEN?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60),

triggerfn=trigger.AfterWatermark(early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)),

accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))

Page 36: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

HOW DO REFINEMENTS HAPPEN?

Page 37: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

CUSTOMIZING WHAT WHERE WHEN HOW

Classic Batch

Windowed Batch

Streaming Streaming + Accumulation

For more information see https://cloud.google.com/dataflow/examples/gaming-example

Page 38: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Python SDK

Page 39: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

39

SIMPLE PIPELINE

with beam.Pipeline() as p:

Pipeline construction is deferred.

Page 40: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

40

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

lines is a PCollection, a deferred collection of all lines in the specified files.

Page 41: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

41

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash.

This will be applied to each line, resulting in a PCollection of words.

Page 42: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

42

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = (words

| beam.Map(lambda w: (w, 1))

| beam.CombinePerKey(sum))Operations can be chained.

Page 43: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

43

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

Composite operations easily defined.

Page 44: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

44

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

Finally, write the results somewhere.

The pipeline actually executes on exiting its context. Pipelines are DAGs in general.

Page 45: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

45

SIMPLE BATCH PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

Page 46: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

46

WHAT ABOUT STREAMING?

Page 47: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

47

SIMPLE STREAMING PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadPubSub(...) | WindowInto(...)

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

Page 48: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Portability &Vision

Google Cloud Dataflow

Page 49: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

WHAT DOES APACHE BEAM PROVIDE?

Runners for Existing Distributed Processing Backends

The Beam Model: What / Where / When / How

API (SDKs) for writing Beam pipelines

Apache Apex

Apache Flink

InProcess / Local

Apache Spark

Google Cloud Dataflow

Apache GearPump

Page 50: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

OtherLanguages

Beam Java

Beam Python Pipeline SDK

User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Page 51: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Runner API

OtherLanguages

Beam Java

Beam Python Runner API

Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Page 52: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Runner API

OtherLanguages

Beam Java

Beam Python

Execution ExecutionExecution

SDK HarnessDocker based execution environments that are shared by all runners for running the user code in a consistent environment.

Page 53: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Fn API

Runner API

OtherLanguages

Beam Java

Beam Python

Execution ExecutionExecution

Fn APIAPI which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Page 54: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Fn API

Apache Flink

Apache Spark

Runner API

OtherLanguages

Beam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Apache Gear-pump

Apache Apex

RunnerDistributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

Page 55: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

BEAM RUNNER CAPABILITIES

https://beam.apache.org/capability-matrix/

Page 56: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

MORE BEAM?

Issue tracker (https://issues.apache.org/jira/projects/BEAM)

Beam website (https://beam.apache.org/)

Source code (https://github.com/apache/beam)

Developers mailing list ([email protected])

Users mailing list ([email protected])

Follow @ApacheBeam on Twitter

Page 57: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

SUMMARY● Beam helps you tackle big data that is:

○ Unbounded in volume○ Out of order ○ Arbitrarily delayed

● The Beam model separates concerns of:○ What is being computed?○ Where in event time?○ When in processing time?○ How do refinements happen?

Page 58: Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam Cloud+Data NEXTCon 2017. I am Sourabh Hello! I am Sourabh Hello! ... Apache Flink Apache

Thanks!

You can find me at: @sb2nov

Questions?