Processing A Unified Model for Batch and Streaming Data Dataflow/Apache...

Dataflow/Apache BeamA Unified Model for Batch and Streaming Data ProcessingEugene Kirpichov, Google

STREAM 2016

Google’s Data Processing Story

Philosophy of the Beam programming model

Agenda

Apache Beam project3

Google’s Data Processing Story

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

MillWheel

Dataflow

Data Processing @ Google

(distributed output dataset)

MapReduce: SELECT + GROUP BY

(distributed input dataset)

Shuffle (GROUP BY)

Map (SELECT)

Reduce (SELECT)

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

MillWheel

Dataflow

FlumeJava Pipelines

• A Pipeline represents a graph of data processing transformations

• PCollections flow through the pipeline

• Optimized and executed as a unit for efficiency

// Collection of raw eventsPCollection<SensorEvent> raw = ...;// Element-wise extract location/temperature pairsPCollection<KV<String, Double>> input = raw.apply(ParDo.of(new ParseFn()))// Composite transformation containing an aggregationPCollection<KV<String, Double>> output = input .apply(Mean.<Double>perKey());// Write outputoutput.apply(BigtableIO.Write.to(...));

Example: Computing mean temperature

What Where When How

So, people used FJ to process data...

...big data...

...really, really big...

TuesdayWednesday

Thursday

Batch failure mode #1

Latency

MapReduce

TuesdayWednesday

Batch failure mode #2: Sessions

Cheryl

WednesdayTuesday

Continuous & Unbounded

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Historicalevents

Exacthistorical

Periodic batchprocessing

Approximate real-time

Stream processing

systemContinuous

updates

State of the art until recently: Lambda Architecture

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

MillWheel

Dataflow

MillWheel: Deterministic, low-latency streaming

● Framework for building low-latency data-processing applications

● User provides a DAG of computations to be performed

● System manages state and persistent flow of elements

Streaming or Batch?

1 + 1 = 2Correctness Latency

Why not both?

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Where in event time?

What Where When How

● Windowing divides data into event-time-based finite chunks.

● Required when doing aggregations over unbounded data.

What Where When How

When in Processing Time?

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.Pr

Event Time

Watermark

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum());

What Where When How

How do refinements relate?

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

MillWheel

Dataflow

Cloud DataflowA fully-managed cloud service and programming model for batch and streaming big data processing.

Google Cloud Dataflow

Dataflow SDK

● Portable API to construct and run a pipeline.● Available in Java and Python (alpha)● Pipelines can run…

○ On your development machine○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark or Flink.

Dataflow ⇒ Apache Beam3

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(

word → Arrays.asList(word.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(

count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://.../..."));

p.run();

Apache Beam ecosystem

End-user's pipeline

Libraries: transforms, sources/sinks etc.

Language-specific SDK

Beam model (ParDo, GBK, Windowing…)

Runner

Execution environment

Java ...Python

Apache Beam ecosystem

End-user's pipeline

Libraries: transforms, sources/sinks etc.

Language-specific SDK

Beam model (ParDo, GBK, Windowing…)

Runner

Execution environment

Java ...Python

Apache Beam Roadmap

02/01/2016Enter Apache

Incubator

Early 2016Design for use cases,

begin refactoring

Mid 2016Slight chaos

Late 2016Multiple runners execute Beam

pipelines

02/25/20161st commit to ASF repository

Runner capability matrix

• Multiple SDKswith shared pipeline representation

• Language-agnostic runnersimplementing the model

• Fn Runnersrun language-specific code

Technical Vision: Still more modular

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Recap: Timeline of ideas2004 MapReduce (SELECT / GROUP BY)

Library > DSLAbstract away fault tolerance & distribution

2010 FlumeJava: High-level API (typed DAG)

2013 MillWheel: Deterministic stream processing

2015 Dataflow: Unified batch/streaming modelWindowing, Triggers, Retractions

2016 Beam: Portable programming modelLanguage-agnostic runners

Programming modelThe World Beyond Batch: Streaming 101, Streaming 102The Dataflow Model paper

Cloud Dataflowhttp://cloud.google.com/dataflow/

Apache Beamhttps://wiki.apache.org/incubator/BeamProposalhttp://beam.incubator.apache.org/Dataflow/Beam vs. Spark

Learn More!

Google confidential │ Do not distribute

Thank you

Google confidential │ Do not distribute

Processing A Unified Model for Batch and Streaming Data Dataflow/Apache...

Documents

TOWARDS PORTABILITY AND BEYOND€¦ · THE BEAM MODEL • Unified batch and stream programming model • Stream: Batch is just a bounded stream • Massively parallelizable Transformations

Apache Flink Meetup Berlin #6: Unified Batch & Stream Processing in Apache Flink

Dataflow analysis

ONESouRcE DaTaFLow

Dataflow - A Unified Model for Batch and Streaming Data Processing

Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch … · data across them, unified distributed dataflow frameworks evolved to support both batch and stream processing

Data Management 13 Stream Processing - mboehm7.github.io · Distributed stream processing engines, and “unified” batch/stream processing Proprietary systems: Google Cloud Dataflow,

dataflow - Lunds tekniska högskola · 3 motivation the need for a parallel programming model dataflow programming actors, dataflow, and the CAL actor language dataflow perspectives

Dataflow Diagram

Outbound Option Guide for Unified Contact Center ... · APPENDIX C DialerDetailTable 135 AbouttheDialer_DetailTable 135 Advantages 135 DataFlow 135 FaultTolerance 136 Dialer_DetailTableDatabaseFieldsandDescriptions

Batch-Aware Unified Memory Management in GPUs for ...Batch-Aware Unified Memory Management in GPUs for Irregular Workloads Hyojong Kim Georgia Institute of Technology Jaewoong Sim

Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Apache Beam: A unified model for batch and stream processing data

Google Cloud Dataflow - SICSictlabs-summer-school.sics.se/2015/slides/google cloud dataflow.pdf · Unified batch & streaming semantics. Google supported and open sourced ... • Benefits

Tpl DataFlow

Dataflow Lab

Dataflow Operator

Dataflow II: Finish Dataflow Analysis, Start on Classical Optimizations

Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Database Systems 13 Stream Processing - GitHub Pages · Distributed stream processing engines, and “unified” batch/stream processing Proprietary systems: Google Cloud Dataflow,