68
Apache Beam (incubating) PPMC Senior Software Engineer, Google Dan Halperin Introduction to Apache Beam & No shard left behind: APIs for massive parallel eciency

Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Embed Size (px)

Citation preview

Page 1: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam (incubating) PPMCSenior Software Engineer, Google

Dan Halperin

Introduction to Apache Beam & No shard left behind: APIs for massive parallel efficiency

Page 2: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam is a unified programming model designed to

provide efficient and portable data processing pipelines

Page 3: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

What is Apache Beam?The Beam Programming Model•What / Where / When / How

SDKs for writing Beam pipelines•Java, Python

Beam Runners for existing distributed processing backends•Apache Apex•Apache Flink•Apache Spark•Google Cloud Dataflow

The Beam Programming Model

SDKs for writing Beam pipelines

•Java, Python

Beam Runners for existing distributed processing backends

What is Apache Beam?

Google Cloud Dataflow

ApacheApex

Apache

ApacheGearpump

Apache

Cloud Dataflow

Apache Spark

Beam Model: Fn Runners

Apache Flink

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Apache Gearpump

Execution

Apache Apex

Page 4: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam is a unified programming model designed to

provide efficient and portable data processing pipelines

Page 5: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Quick overview of the Beam model

Page 6: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.

Quick overview of the Beam model

Page 7: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.ParDo – flatmap over elements of a PCollection.

Quick overview of the Beam model

Page 8: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.ParDo – flatmap over elements of a PCollection.(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.

Quick overview of the Beam model

Page 9: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.ParDo – flatmap over elements of a PCollection.(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.Side inputs – global view of a PCollection used for broadcast / joins.

Quick overview of the Beam model

Page 10: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.ParDo – flatmap over elements of a PCollection.(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.Side inputs – global view of a PCollection used for broadcast / joins.

Window – reassign elements to zero or more windows; may be data-dependent.

Quick overview of the Beam model

Page 11: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Clean abstractions hide details

PCollection – a parallel collection of timestamped elements that are in windows.

Sources & Readers – produce PCollections of timestamped elements and a watermark.ParDo – flatmap over elements of a PCollection.(Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}.Side inputs – global view of a PCollection used for broadcast / joins.

Window – reassign elements to zero or more windows; may be data-dependent.Triggers – user flow control based on window, watermark, element count, lateness.

Quick overview of the Beam model

Page 12: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

4. Streaming with Speculative + Late Data

5. Sessions

Page 13: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Data: JSON-encoded analytics stream from site•{“user”:“dhalperi”, “page”:“oreilly.com/feed/7”, “tstamp”:”2016-08-31T15:07Z”, …}

Desired output: Per-user session length and activity level•dhalperi, 17 pageviews, 2016-08-31 15:00-15:35

Other pipeline goals:•Live data – can track ongoing sessions with speculative outputdhalperi, 10 pageviews, 2016-08-31 15:00-15:15 (EARLY)

•Archival data – much faster, still correct output respecting event time

Simple clickstream analysis pipeline

Page 14: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Simple clickstream analysis pipeline

PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser()));

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey());

userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…));

pipeline.run();

Page 15: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Unified unbounded & bounded PCollections

pipeline.apply(IO.Read(…)).apply(MapElements.of(new ParseClicksAndAssignUser()));

Apache Kafka, ActiveMQ, tailing filesystem, … •A live, roughly in-order stream of messages, unbounded PCollections.•KafkaIO.read().fromTopic(“pageviews”)

HDFS, Google Cloud Storage, yesterday’s Kafka log, …• Archival data, often readable in any order, bounded PCollections.• TextIO.read().from(“gs://oreilly/pageviews/*”)

Page 16: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Event time3:00 3:05 3:10 3:15 3:20 3:25

Page 17: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Event time3:00 3:05 3:10 3:15 3:20 3:25

One session, 3:04-3:25

Page 18: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Proc

essi

ng ti

me

Event time3:00 3:05 3:10 3:15 3:20 3:25

Page 19: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Proc

essi

ng ti

me

Event time3:00 3:05 3:10 3:15 3:20 3:25

1 session, 3:04–3:10(EARLY)

Page 20: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Proc

essi

ng ti

me

Event time3:00 3:05 3:10 3:15 3:20 3:25

1 session, 3:04–3:10(EARLY)

2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)

Page 21: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Proc

essi

ng ti

me

Event time3:00 3:05 3:10 3:15 3:20 3:25

1 session, 3:04–3:25

1 session, 3:04–3:10(EARLY)

2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)

Page 22: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Windowing and triggers

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1)))))

Proc

essi

ng ti

me

Event time3:00 3:05 3:10 3:15 3:20 3:25

1 session, 3:04–3:25

1 session, 3:04–3:10(EARLY)

2 sessions, 3:04–3:10 & 3:15–3:20(EARLY)

Watermark

Page 23: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Writing output, and bundling

userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…));

Writing is the dual of reading — format and then output.

Not its own primitive, but implemented based on sink semantics using ParDo + graph structure.

Page 24: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Streaming job consuming Kafka stream•Uses 10 workers.•Pipeline lag of a few seconds.•With a 2 million users over 1 day.

•A total of ~4.7M early + final sessions.•240 worker-hours

Two example runs of this pipeline

Page 25: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Streaming job consuming Kafka stream•Uses 10 workers.•Pipeline lag of a few seconds.•With a 2 million users over 1 day.

•A total of ~4.7M early + final sessions.•240 worker-hours

Two example runs of this pipeline

Batch job consuming HDFS archive•Uses 200 workers.•Runs for 30 minutes.•Same input.

•A total of ~2.1M final sessions.•100 worker-hours

Page 26: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Streaming job consuming Kafka stream•Uses 10 workers.•Pipeline lag of a few seconds.•With a 2 million users over 1 day.

•A total of ~4.7M early + final sessions.•240 worker-hours

Two example runs of this pipeline

Batch job consuming HDFS archive•Uses 200 workers.•Runs for 30 minutes.•Same input.

•A total of ~2.1M final sessions.•100 worker-hours

What does the user have to change to get these results?

Page 27: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Streaming job consuming Kafka stream•Uses 10 workers.•Pipeline lag of a few seconds.•With a 2 million users over 1 day.

•A total of ~4.7M early + final sessions.•240 worker-hours

Two example runs of this pipeline

Batch job consuming HDFS archive•Uses 200 workers.•Runs for 30 minutes.•Same input.

•A total of ~2.1M final sessions.•100 worker-hours

What does the user have to change to get these results?A: O(10 lines of code) + Command-line Arguments

Page 28: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam is a unified programming model designed to

provide efficient and portable data processing pipelines

Page 29: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

Beam abstractions empower runners

Page 30: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

Page 31: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

Page 32: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

Size?

Page 33: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

50 TiB

Page 34: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

Cluster utilization?

Quota?Reservations?

Bandwidth?

Throughput?Bottleneck?

50 TiB

Page 35: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

Cluster utilization?

Quota?Reservations?

Bandwidth?

Throughput?Bottleneck?

Split1TiB

Page 36: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Efficiency at runner’s discretion

“Read from this source, splitting it 1000 ways” •user decides

“Read from this source”

APIs:• long getEstimatedSize()• List<Source> splitIntoBundles(size)

Beam abstractions empower runners

gs://logs/*

Cluster utilization?

Quota?Reservations?

Bandwidth?

Throughput?Bottleneck?

filenames

Page 37: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Bundling and runner efficiency

Page 38: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Bundling and runner efficiency

APIs (ParDo/DoFn):•startBundle() •processElement() n times •finishBundle()

Page 39: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Bundling and runner efficiency

APIs (ParDo/DoFn):•startBundle() •processElement() n times •finishBundle()

Transaction

Page 40: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Streaming runner: small bundles, low-latency pipelining across stages,overhead of frequent commits.

Bundling and runner efficiency

APIs (ParDo/DoFn):•startBundle() •processElement() n times •finishBundle()

Transaction

Page 41: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Streaming runner: small bundles, low-latency pipelining across stages,overhead of frequent commits.Classic batch runner: large bundles, fewer large commits, more efficient,long synchronous stages.

Bundling and runner efficiency

APIs (ParDo/DoFn):•startBundle() •processElement() n times •finishBundle()

Transaction

Page 42: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

A bundle is a group of elements processed and committed together.

Streaming runner: small bundles, low-latency pipelining across stages,overhead of frequent commits.Classic batch runner: large bundles, fewer large commits, more efficient,long synchronous stages.Other runner strategies strike a different balance.

Bundling and runner efficiency

APIs (ParDo/DoFn):•startBundle() •processElement() n times •finishBundle()

Transaction

Page 43: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam triggers are flow control, not instructions.• “it is okay to produce data” not “produce data now”.•Runners decide when to produce data, and can make local choices for efficiency.

Streaming clickstream analysis: runner may optimize for latency and freshness.• Small bundles and frequent triggering → more files and more (speculative) records.

Batch clickstream analysis: runner may optimize for throughput and efficiency.• Large bundles and no early triggering → fewer large files and no early records.

Runner-controlled triggering

Page 44: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Pipeline workload variesW

orkl

oad

Time

Streaming pipeline’s input varies Batch pipelines go through stages

Page 45: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Perils of fixed decisionsW

orkl

oad

Time Time

Over-provisioned / worst case Under-provisioned / average case

Page 46: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Ideal case

Wor

kloa

d

Time

Page 47: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

The Straggler ProblemW

orke

r

Time

Work is unevenly distributed across tasks.

Reasons:•Underlying data.•Processing.•Runtime effects.

Effects are cumulative per stage.

Page 48: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Standard workarounds for stragglers

Split files into equal sizes?

Preemptively over-split?

Detect slow workers and re-execute?

Sample extensively and then split?

Wor

ker

Time

Page 49: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Standard workarounds for stragglers

Split files into equal sizes?

Preemptively over-split?

Detect slow workers and re-execute?

Sample extensively and then split?

All of these have major costs, none is a complete solution.

Wor

ker

Time

Page 50: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the

system will always hit unpredictable situations at run-time.

A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that

heuristically hopes to avoid getting into it.

Page 51: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Readers provide simple progress signals, enable runners to take action based on execution-time characteristics.

APIs for how much work is pending:•Bounded: double getFractionConsumed() •Unbounded: long getBacklogBytes()

Work-stealing:•Bounded: Source splitAtFraction(double) int getParallelismRemaining()

Beam readers enable dynamic adaptation

Page 52: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Dynamic work rebalancing

Now

Done work Active work Predicted completion

Task

s

Time

Average

Page 53: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Dynamic work rebalancing

Now

Done work Active work Predicted completion

Task

s

Time

Page 54: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Dynamic work rebalancing

Now

Done work Active work Predicted completion

Task

s

Now

Page 55: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Dynamic work rebalancing

Now

Done work Active work Predicted completion

Task

s

Now

Page 56: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing: a real example

2-stage pipeline,split “evenly” but uneven in practice

Page 57: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing: a real example

2-stage pipeline,split “evenly” but uneven in practice

Same pipeline dynamic work rebalancing enabled

Page 58: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing: a real example

2-stage pipeline,split “evenly” but uneven in practice

Same pipeline dynamic work rebalancing enabled

Savings

Page 59: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing + Autoscaling

Page 60: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing + Autoscaling

Initially allocate ~80 workersbased on size

Page 61: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing + Autoscaling

Initially allocate ~80 workersbased on size

Multiple rounds of upsizing enabledby dynamic splitting

Page 62: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing + Autoscaling

Initially allocate ~80 workersbased on size

Multiple rounds of upsizing enabledby dynamic splitting

Upscale to 1000 workers* tasks stay well-balanced* without oversplitting initially

Page 63: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Beam pipeline on the Google Cloud Dataflow runnerDynamic Work Rebalancing + Autoscaling

Initially allocate ~80 workersbased on size

Multiple rounds of upsizing enabledby dynamic splitting

Long-running tasks aborted without causing stragglers

Upscale to 1000 workers* tasks stay well-balanced* without oversplitting initially

Page 64: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam is a unified programming model designed to

provide efficient and portable data processing pipelines

Page 65: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Write a pipeline once, run it anywhere

PCollection<KV<User, Click>> clickstream = pipeline.apply(IO.Read(…)) .apply(MapElements.of(new ParseClicksAndAssignUser()));

PCollection<KV<User, Long>> userSessions = clickstream.apply(Window.into(Sessions.withGapDuration(Minutes(3))) .triggering( AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))))) .apply(Count.perKey());

userSessions.apply(MapElements.of(new FormatSessionsForOutput())) .apply(IO.Write(…));

pipeline.run();

Unified model for portable execution

Page 66: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Demo time

Page 67: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Building the community

If you analyze data: Try it out!

If you have a data storage or messaging system: Write a connector!

If you have Big Data APIs: Write a Beam SDK, DSL, or library!

If you have a distributed processing backend:•Write a Beam Runner!

(& join Apache Apex, Flink, Spark, Gearpump, and Google Cloud Dataflow)

Get involved with Beam

Page 68: Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel Efficiency

Apache Beam (incubating):•https://beam.incubator.apache.org/•documentation, mailing lists, downloads, etc.

Read our blog posts!•Streaming 101 & 102: oreilly.com/ideas/the-world-beyond-batch-streaming-101 •No shard left behind: Dynamic work rebalancing in Google Cloud Dataflow •Apache Beam blog: http://beam.incubator.apache.org/blog/

Follow @ApacheBeam on Twitter

Learn more