20
Introduction to Apache Beam (incubating) Slides by Manu Zhang, July 2016

Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Introduction to Apache Beam(incubating)

Slides by Manu Zhang, July 2016

Page 2: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Unified Batch + strEAM processing model

Page 3: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Beam

credit: http://www.post-gazette.com/starwarscredit: http://reallyobsessedwithfilm.blogspot.com/2012/05/x-men-second-grade-we-need-leader.html

Beam

Page 4: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Evolution of Apache Beam

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Page 5: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

MapReduce

Tuesday [11:00 - 12:00)

[12:00 - 13:00)

[13:00 - 14:00)

[14:00 - 15:00)

[15:00 - 16:00)

[16:00 - 17:00)

[18:00 - 19:00)

[19:00 - 20:00)

[21:00 - 22:00)

[22:00 - 23:00)

[23:00 - 0:00)

Batch Patterns: Time Based Windows

Page 6: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Streaming Patterns: Event-Time Based Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 7: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 8: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: What is Being Computed?

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

Page 9: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: Where in Event Time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))

.apply(Sum.integersPerKey());

Page 10: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: When in Processing Time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow()))

.apply(Sum.integersPerKey());

Page 11: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: How Do Refinements Relate?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.discardingFiredPanes()) // or accumulatingFiredPanes()

.apply(Sum.integersPerKey());

Page 12: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

12

The Beam Model: Asking the Right Questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

Page 13: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: Batch

PCollection<String> input = pipeline.apply(HDFSSource.read());PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Page 14: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: Streaming

PCollection<String> input = pipeline.apply(KafkaSource.read());PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Page 15: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: Spark Runner

Pipeline pipeline = Pipeline.create(“SparkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Page 16: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

The Beam Model: Flink Runner

Pipeline pipeline = Pipeline.create(“FlinkRunner”);PCollection<String> input = pipeline.apply(KafkaSource.read());

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(Afterwatermark.pastEndOfWindow())

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

Page 17: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

Why Apache Beam?

Unified - One model handles batch and streaming use cases.

Portable - Pipelines can be executed on multiple execution environments, avoiding lock-in.

Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors.

Page 18: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

18

The Apache Beam Vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Page 19: Introduction to Apache Beam - files.meetup.comThe Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam

19

Learn More!

Apache Beam (incubating) http://beam.incubator.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Why Apache Beam? A Google Perspectivehttp://goo.gl/eWTLH1

Join the mailing lists! User discussions - [email protected] discussions - [email protected]

Follow @ApacheBeam on Twitter