Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Unified Processing with Apache Beam
Cloud+Data NEXTCon 2017
I am Sourabh
Hello!
I am Sourabh
Hello!
I am a Software Engineer
I am Sourabh
Hello!
I am a Software Engineer
I tweet at @sb2nov
What is Apache Beam?
Apache Beam is a unified programming model for expressing efficient and
portable data processing pipelines
Big Data
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
LAUNCH!!
DATA CAN BE BIG
… REALLY BIG ...
TuesdayWednesday
Thursday
UNBOUNDED, DELAYED, OUT OF ORDER
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
ORGANIZING THE STREAM
8:00
8:00
8:00
DATA PROCESSING TRADEOFFS
Completeness Latency
$$$Cost
WHAT IS IMPORTANT?
Completeness Low Latency Low Cost
Important
Not Important
$$$
MONTHLY BILLING
Completeness Low Latency Low Cost
Important
Not Important
$$$
BILLING ESTIMATE
Completeness Low Latency Low Cost
Important
Not Important
$$$
FRAUD DETECTION
Completeness Low Latency Low Cost
Important
Not Important
$$$
Beam Model
GENERATIONS BEYOND MAP-REDUCE
Clearly separates event time from processing time
Improved abstractions let you focus on your application logic
Batch and stream processing are both first-class citizens
Pipeline
PTransform
PCollection(bounded or unbounded)
EVENT TIME VS PROCESSING TIME
EVENT TIME VS PROCESSING TIME
EVENT TIME VS PROCESSING TIME
Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.Too Fast? Some data is late.
ASKING THE RIGHT QUESTIONS
When in processing time?
What is being computed?
Where in event time?
How do refinements happen?
WHAT IS BEING COMPUTED?
scores: PCollection[KV[str, int]] = (input | beam.CombinePerKey(sum))
WHAT IS BEING COMPUTED?
Element-Wise Aggregating Composite
WHAT IS BEING COMPUTED?
WHERE IN EVENT TIME?
scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))
WHERE IN EVENT TIME?
WHERE IN EVENT TIME?
WHERE IN EVENT TIME?
scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60)) | beam.CombinePerKey(sum))
The choice of windowing is retained through subsequent aggregations.
WHEN IN PROCESSING TIME?
scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60), triggerfn=trigger.AfterWatermark()) | beam.CombinePerKey(sum))
WHEN IN PROCESSING TIME?
Triggers control when results are emitted.
Triggers are often relative to the watermark.
WHEN IN PROCESSING TIME?
HOW DO REFINEMENTS HAPPEN?
scores: PCollection[KV[str, int]] = (input | beam.WindowInto(FixedWindows(2 * 60),
triggerfn=trigger.AfterWatermark(early=trigger.AfterPeriod(1*60), late=trigger.AfterCount(1)),
accumulation_mode=ACCUMULATING) | beam.CombinePerKey(sum))
HOW DO REFINEMENTS HAPPEN?
CUSTOMIZING WHAT WHERE WHEN HOW
Classic Batch
Windowed Batch
Streaming Streaming + Accumulation
For more information see https://cloud.google.com/dataflow/examples/gaming-example
Python SDK
39
SIMPLE PIPELINE
with beam.Pipeline() as p:
Pipeline construction is deferred.
40
SIMPLE PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
lines is a PCollection, a deferred collection of all lines in the specified files.
41
SIMPLE PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
The "pipe" operator applies a transformation (on the right) to a PCollection, reminiscent of bash.
This will be applied to each line, resulting in a PCollection of words.
42
SIMPLE PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
totals = (words
| beam.Map(lambda w: (w, 1))
| beam.CombinePerKey(sum))Operations can be chained.
43
SIMPLE PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
totals = words | Count()
Composite operations easily defined.
44
SIMPLE PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
totals = words | Count()
totals | beam.io.WriteTextFile('/path/to/output')
(totals | beam.CombinePerKey(Largest(100))
| beam.io.WriteTextFile('/path/to/another/output')
Finally, write the results somewhere.
The pipeline actually executes on exiting its context. Pipelines are DAGs in general.
45
SIMPLE BATCH PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadTextFile('/path/to/files')
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
totals = words | Count()
totals | beam.io.WriteTextFile('/path/to/output')
(totals | beam.CombinePerKey(Largest(100))
| beam.io.WriteTextFile('/path/to/another/output')
46
WHAT ABOUT STREAMING?
47
SIMPLE STREAMING PIPELINE
with beam.Pipeline() as p:
lines = p | beam.io.ReadPubSub(...) | WindowInto(...)
words = lines | beam.FlatMap(lambda line: re.findall('\w+', line))
totals = words | Count()
totals | beam.io.WriteTextFile('/path/to/output')
(totals | beam.CombinePerKey(Largest(100))
| beam.io.WriteTextFile('/path/to/another/output')
Portability &Vision
Google Cloud Dataflow
WHAT DOES APACHE BEAM PROVIDE?
Runners for Existing Distributed Processing Backends
The Beam Model: What / Where / When / How
API (SDKs) for writing Beam pipelines
Apache Apex
Apache Flink
InProcess / Local
Apache Spark
Google Cloud Dataflow
Apache GearPump
OtherLanguages
Beam Java
Beam Python Pipeline SDK
User facing SDK, defines a language specific API for the end user to specify the pipeline computation DAG.
Runner API
OtherLanguages
Beam Java
Beam Python Runner API
Runner and language agnostic representation of the user’s pipeline graph. It only contains nodes of Beam model primitives that all runners understand to maintain portability across runners.
Runner API
OtherLanguages
Beam Java
Beam Python
Execution ExecutionExecution
SDK HarnessDocker based execution environments that are shared by all runners for running the user code in a consistent environment.
Fn API
Runner API
OtherLanguages
Beam Java
Beam Python
Execution ExecutionExecution
Fn APIAPI which the execution environments use to send and receive data, report metrics around execution of the user code with the Runner.
Fn API
Apache Flink
Apache Spark
Runner API
OtherLanguages
Beam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
Apache Gear-pump
Apache Apex
RunnerDistributed processing environments that understand the runner API graph and how to execute the Beam model primitives.
BEAM RUNNER CAPABILITIES
https://beam.apache.org/capability-matrix/
MORE BEAM?
Issue tracker (https://issues.apache.org/jira/projects/BEAM)
Beam website (https://beam.apache.org/)
Source code (https://github.com/apache/beam)
Developers mailing list ([email protected])
Users mailing list ([email protected])
Follow @ApacheBeam on Twitter
SUMMARY● Beam helps you tackle big data that is:
○ Unbounded in volume○ Out of order ○ Arbitrarily delayed
● The Beam model separates concerns of:○ What is being computed?○ Where in event time?○ When in processing time?○ How do refinements happen?