Have your cake and eat it too

Have Your Cake and Eat It TooArchitectures for Batch and Stream Processing

Speaker name // Speaker title

2

Stuff We’ll Talk About

• Why do we need both streams and batches• Why is it a problem?• Stream-Only Patterns (i.e. Kappa Architecture)• Lambda-Architecture Technologies– SummingBird– Apache Spark– Apache Flink– Bring-your-own-framework

3©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume

• @gwenshap

About Me

4

Why Streaming and Batch

©2014 Cloudera, Inc. All rights reserved.

5

Batch Processing

• Store data somewhere• Read large chunks of data• Do something with data• Sometimes store results

6Click to enter confidentiality information

Batch Examples

• Analytics

• ETL / ELT

• Training machine learning models

• Recommendations


Stream Processing

• Listen to incoming events• Do something with each

event• Maybe store events /

results


Stream Processing Examples

• Anomaly detection, alerts

• Monitoring, SLAs

• Operational intelligence

• Analytics, dashboards

• ETL


Streaming &Batch

AlertsMonitoring, SLAs

Operational Intelligence

Risk AnalysisAnomaly detectionAnalytics

ETL


Four Categories

• Streams Only• Batch Only• Can be done in both• Must be done in both

ETLSome Analytics


ETL

Most Stream Processing projects I see involve few simple transformations.

• Currency conversion• JSON to Avro• Field extraction• Joining a stream to a static data set• Aggregate on window• Identifying change in trend• Document indexing


Batch || Streaming

• Efficient:– Lower CPU utilization– Better network and disk throughput– Fewer locks and waits

• Easier administration

• Easier integration with RDBMS

• Existing expertise

• Existing tools

• Real-time information

13

The Problem



We Like

• Efficiency

• Scalability

• Fault Tolerance

• Recovery from errors

• Experimenting with different approaches

• Debuggers

• Cookies


But…We don’t likeMaintaining two applicationsThat do the same thing


Do we really need to maintain same app twice?

Yes, because:

• We are not sure about requirements

• We sometimes need to re-process with very high efficiency

Not really:

• Different apps for batch and streaming

• Can re-process with streams

• Can error-correct with streams

• Can maintain one code-base for batches and streams

17

Stream-Only Patterns(Kappa Architecture)

Click to enter confidentiality information


DWH Example

OLTP DB

Sensors, Logs

DWH Fact Table(Partitione

d)

Real TimeFact Tables

Dimension

Dimension

Dimension

Views

Aggregates

App 1:Stream

processing

App 2:Occasional load


We need to fix older data

0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table



0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v1

Streaming App v2

Real-Time Table

Replacement Partition

Partitioned Fact Table



0 1 2 3 4 5 6 7 8 910

11

12

13

Streaming App v2

Real-Time Table

22

Lambda-Architecture Technologies


23

WordCount in Scala

source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()

24

SummingBird

25

MapReduce was great because…

Very simple abstraction:- Map- Shuffle- Reduce- Type-safe

And it has simpler abstractions on top.


SummingBird

• Multi-stage MapReduce• Run on Hadoop, Spark, Storm• Very easy to combine

batch and streaming results


API

• Platform – Storm, Scalding, Spark…• Producer.source(Platform) <- get data • Producer – collection of events• Transformations – map, filter, merge, leftJoin (lookup)• Output – write(sink), sumByKey(store)• Store – contains aggregate for each key, and reduce

operation


Associative Reduce


WordCount SummingBird

def wordCount[P <: Platform[P]]

(source: Producer[P, String], store: P#Store[String, Long]) =

source.flatMap { sentence =>

toWords(sentence).map(_ -> 1L)

}.sumByKey(store)

val stormTopology = Storm.remote(“stormName”).plan(wordCount)

val hadoopJob = Scalding(“scaldingName”).plan(wordCount)

30

SparkStreaming

31

First, there was the RDD

• Spark is its own execution engine

• With high-level API

• RDDs are sharded collections

• Can be mapped, reduced, grouped, filtered, etc

32Confidentiality Information Goes Here

DStream

DStream

DStream

Spark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count Print

Pre-first Batch

First Batch

Second Batch

33Confidentiality Information Goes Here

DStream

DStream

DStreamSpark Streaming

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single PassFilter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1


Compared to SummingBird

Differences:

• Micro-batches

• Completely new execution model

• Real joins

• Reduce is not limited to Monads

• SparkStreaming has Richer API

• Summingbird can aggregate batch and stream to one dataset

• SparkStreaming runs in debugger

Similarities:

• Almost same code will run in batch and streams

• Use of Scala

• Use of functional programing concepts

35

Spark Example


1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

36

Spark Streaming Example


1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. ssc.start()

37

Apache Flink

38

Execution Model

You don’t want to know.

39

Flink vs SparkStreaming

Differences:

• Flink is event-by-event streaming, events go through pipeline.

• SparkStreaming has good integration with Hbase as state store

• “checkpoint barriers”

• Optimization based on strong typing

• Flink is newer than SparkStreaming, there is less production experience

Similarities:

• Very similar APIs

• Built-in stream-specific operators (windows)

• Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)

40

WordCount Batch

val env = ExecutionEnvironment.getExecutionEnvironment

val text = getTextDataSet(env)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

41

WordCount Streaming

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.socketTextStream(host, port)

val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }

.map { (_, 1) } .groupBy(0)

.sum(1)

counts.print()

env.execute(“Wordcount Example”)

42

Bring Your Own Framework

43

If the requirements are simple…

44

How difficult it is to parallelize transformations?

Simple transformationsAre simple


Just add Kafka

Kafka is a reliable data sourceYou can read

BatchesMicrobatchesStreams

Also allows for re-partitioning


Cluster management

• Managing cluster resources used to be difficult• Now:– YARN– Mesos– Docker– Kubernetes


So your app should…

• Allocate resources and track tasks with YARN / Mesos• Read from Kafka (however often you want)• Do simple transformations• Write to Kafka / Hbase

• How difficult can it possibly be?

48

Parting Thoughts


49

Good engineering lessons

• DRY – do you really need same code twice?• Error correction is critical• Reliability guarantees are critical• Debuggers are really nice• Latency / Throughput trade-offs• Use existing expertise• Stream processing is about patterns

Thank you

Data & Analytics

Have your cake and eat it too