Unified Processing with Apache Beam · 16/09/2017  · Unified Processing with Apache Beam...

Preview:

Citation preview

Unified Processing with Apache Beam

Cloud+Data NEXTCon 2017

I am Sourabh

Hello!

I am Sourabh

Hello!

I am a Software Engineer

I am Sourabh

Hello!

I am a Software Engineer

I tweet at @sb2nov

What is Apache Beam?

Apache Beam is a unified programming model for expressing efficient and

portable data processing pipelines

Big Data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

LAUNCH!!

DATA CAN BE BIG

… REALLY BIG ...

TuesdayWednesday

Thursday

UNBOUNDED, DELAYED, OUT OF ORDER

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

ORGANIZING THE STREAM

8:00

8:00

8:00

DATA PROCESSING TRADEOFFS

Completeness Latency

$$$Cost

WHAT IS IMPORTANT?

Completeness Low Latency Low Cost

Important

Not Important

$$$

MONTHLY BILLING

Completeness Low Latency Low Cost

Important

Not Important

$$$

BILLING ESTIMATE

Completeness Low Latency Low Cost

Important

Not Important

$$$

FRAUD DETECTION

Completeness Low Latency Low Cost

Important

Not Important

$$$

Beam Model

GENERATIONS BEYOND MAP-REDUCE

Clearly separates event time from processing time

Improved abstractions let you focus on your application logic

Batch and stream processing are both first-class citizens

Pipeline

PTransform

PCollection(bounded or unbounded)

EVENT TIME VS PROCESSING TIME

EVENT TIME VS PROCESSING TIME

EVENT TIME VS PROCESSING TIME

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

ASKING THE RIGHT QUESTIONS

When in processing time?

What is being computed?

Where in event time?

How do refinements happen?

WHAT IS BEING COMPUTED?

scores: PCollection[KV[str, int]] = (input | beam.CombinePerKey(sum))

WHAT IS BEING COMPUTED?

Element-Wise Aggregating Composite

WHAT IS BEING COMPUTED?

WHERE IN EVENT TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

WHERE IN EVENT TIME?

WHERE IN EVENT TIME?

WHERE IN EVENT TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))

The choice of windowing is retained through subsequent aggregations.

WHEN IN PROCESSING TIME?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))

WHEN IN PROCESSING TIME?

Triggers control when results are emitted.

Triggers are often relative to the watermark.

WHEN IN PROCESSING TIME?

HOW DO REFINEMENTS HAPPEN?

scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60),

triggerfn=trigger.AfterWatermark(early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)),

accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))

HOW DO REFINEMENTS HAPPEN?

CUSTOMIZING WHAT WHERE WHEN HOW

Classic Batch

Windowed Batch

Streaming Streaming + Accumulation

For more information see https://cloud.google.com/dataflow/examples/gaming-example

Python SDK

39

SIMPLE PIPELINE

with beam.Pipeline() as p:

Pipeline construction is deferred.

40

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

lines is a PCollection, a deferred collection of all lines in the specified files.

41

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash.

This will be applied to each line, resulting in a PCollection of words.

42

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = (words

| beam.Map(lambda w: (w, 1))

| beam.CombinePerKey(sum))Operations can be chained.

43

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

Composite operations easily defined.

44

SIMPLE PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

Finally, write the results somewhere.

The pipeline actually executes on exiting its context. Pipelines are DAGs in general.

45

SIMPLE BATCH PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadTextFile('/path/to/files')

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

46

WHAT ABOUT STREAMING?

47

SIMPLE STREAMING PIPELINE

with beam.Pipeline() as p:

lines = p | beam.io.ReadPubSub(...) | WindowInto(...)

words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))

totals = words | Count()

totals | beam.io.WriteTextFile('/path/to/output')

(totals | beam.CombinePerKey(Largest(100))

| beam.io.WriteTextFile('/path/to/another/output')

Portability &Vision

Google Cloud Dataflow

WHAT DOES APACHE BEAM PROVIDE?

Runners for Existing Distributed Processing Backends

The Beam Model: What / Where / When / How

API (SDKs) for writing Beam pipelines

Apache Apex

Apache Flink

InProcess / Local

Apache Spark

Google Cloud Dataflow

Apache GearPump

OtherLanguages

Beam Java

Beam Python Pipeline SDK

User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.

Runner API

OtherLanguages

Beam Java

Beam Python Runner API

Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.

Runner API

OtherLanguages

Beam Java

Beam Python

Execution ExecutionExecution

SDK HarnessDocker based execution environments that are shared by all runners for running the user code in a consistent environment.

Fn API

Runner API

OtherLanguages

Beam Java

Beam Python

Execution ExecutionExecution

Fn APIAPI which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.

Fn API

Apache Flink

Apache Spark

Runner API

OtherLanguages

Beam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Apache Gear-pump

Apache Apex

RunnerDistributed processing environments that understand the runner API graph and how to execute the Beam model primitives.

BEAM RUNNER CAPABILITIES

https://beam.apache.org/capability-matrix/

MORE BEAM?

Issue tracker (https://issues.apache.org/jira/projects/BEAM)

Beam website (https://beam.apache.org/)

Source code (https://github.com/apache/beam)

Developers mailing list (dev-subscribe@beam.apache.org)

Users mailing list (user-subscribe@beam.apache.org)

Follow @ApacheBeam on Twitter

SUMMARY● Beam helps you tackle big data that is:

○ Unbounded in volume○ Out of order ○ Arbitrarily delayed

● The Beam model separates concerns of:○ What is being computed?○ Where in event time?○ When in processing time?○ How do refinements happen?

Thanks!

You can find me at: @sb2nov

Questions?