Upload
gwen-chen-shapira
View
1.230
Download
0
Tags:
Embed Size (px)
Citation preview
Have Your Cake and Eat It TooArchitectures for Batch and Stream Processing
Speaker name // Speaker title
2
Stuff We’ll Talk About
• Why do we need both streams and batches• Why is it a problem?• Stream-Only Patterns (i.e. Kappa Architecture)• Lambda-Architecture Technologies– SummingBird– Apache Spark– Apache Flink– Bring-your-own-framework
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume
• @gwenshap
About Me
4
Why Streaming and Batch
©2014 Cloudera, Inc. All rights reserved.
5
Batch Processing
• Store data somewhere• Read large chunks of data• Do something with data• Sometimes store results
6Click to enter confidentiality information
Batch Examples
• Analytics
• ETL / ELT
• Training machine learning models
• Recommendations
7Click to enter confidentiality information
Stream Processing
• Listen to incoming events• Do something with each
event• Maybe store events /
results
8Click to enter confidentiality information
Stream Processing Examples
• Anomaly detection, alerts
• Monitoring, SLAs
• Operational intelligence
• Analytics, dashboards
• ETL
9Click to enter confidentiality information
Streaming &Batch
AlertsMonitoring, SLAs
Operational Intelligence
Risk AnalysisAnomaly detectionAnalytics
ETL
10Click to enter confidentiality information
Four Categories
• Streams Only• Batch Only• Can be done in both• Must be done in both
ETLSome Analytics
11Click to enter confidentiality information
ETL
Most Stream Processing projects I see involve few simple transformations.
• Currency conversion• JSON to Avro• Field extraction• Joining a stream to a static data set• Aggregate on window• Identifying change in trend• Document indexing
12Click to enter confidentiality information
Batch || Streaming
• Efficient:– Lower CPU utilization– Better network and disk throughput– Fewer locks and waits
• Easier administration
• Easier integration with RDBMS
• Existing expertise
• Existing tools
• Real-time information
13
The Problem
©2014 Cloudera, Inc. All rights reserved.
14Click to enter confidentiality information
We Like
• Efficiency
• Scalability
• Fault Tolerance
• Recovery from errors
• Experimenting with different approaches
• Debuggers
• Cookies
15Click to enter confidentiality information
But…We don’t likeMaintaining two applicationsThat do the same thing
16Click to enter confidentiality information
Do we really need to maintain same app twice?
Yes, because:
• We are not sure about requirements
• We sometimes need to re-process with very high efficiency
Not really:
• Different apps for batch and streaming
• Can re-process with streams
• Can error-correct with streams
• Can maintain one code-base for batches and streams
17
Stream-Only Patterns(Kappa Architecture)
Click to enter confidentiality information
18Click to enter confidentiality information
DWH Example
OLTP DB
Sensors, Logs
DWH Fact Table(Partitione
d)
Real TimeFact Tables
Dimension
Dimension
Dimension
Views
Aggregates
App 1:Stream
processing
App 2:Occasional load
19Click to enter confidentiality information
We need to fix older data
0 1 2 3 4 5 6 7 8 910
11
12
13
Streaming App v1
Streaming App v2
Real-Time Table
Replacement Partition
Partitioned Fact Table
20Click to enter confidentiality information
We need to fix older data
0 1 2 3 4 5 6 7 8 910
11
12
13
Streaming App v1
Streaming App v2
Real-Time Table
Replacement Partition
Partitioned Fact Table
21Click to enter confidentiality information
We need to fix older data
0 1 2 3 4 5 6 7 8 910
11
12
13
Streaming App v2
Real-Time Table
22
Lambda-Architecture Technologies
Click to enter confidentiality information
23
WordCount in Scala
source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()
24
SummingBird
25
MapReduce was great because…
Very simple abstraction:- Map- Shuffle- Reduce- Type-safe
And it has simpler abstractions on top.
26Click to enter confidentiality information
SummingBird
• Multi-stage MapReduce• Run on Hadoop, Spark, Storm• Very easy to combine
batch and streaming results
27Click to enter confidentiality information
API
• Platform – Storm, Scalding, Spark…• Producer.source(Platform) <- get data • Producer – collection of events• Transformations – map, filter, merge, leftJoin (lookup)• Output – write(sink), sumByKey(store)• Store – contains aggregate for each key, and reduce
operation
28Click to enter confidentiality information
Associative Reduce
29Click to enter confidentiality information
WordCount SummingBird
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
val stormTopology = Storm.remote(“stormName”).plan(wordCount)
val hadoopJob = Scalding(“scaldingName”).plan(wordCount)
30
SparkStreaming
31
First, there was the RDD
• Spark is its own execution engine
• With high-level API
• RDDs are sharded collections
• Can be mapped, reduced, grouped, filtered, etc
32Confidentiality Information Goes Here
DStream
DStream
DStream
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single PassFilter Count Print
Pre-first Batch
First Batch
Second Batch
33Confidentiality Information Goes Here
DStream
DStream
DStreamSpark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDD
RDD
RDD
Single PassFilter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
34Click to enter confidentiality information
Compared to SummingBird
Differences:
• Micro-batches
• Completely new execution model
• Real joins
• Reduce is not limited to Monads
• SparkStreaming has Richer API
• Summingbird can aggregate batch and stream to one dataset
• SparkStreaming runs in debugger
Similarities:
• Almost same code will run in batch and streams
• Use of Scala
• Use of functional programing concepts
35
Spark Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
36
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. ssc.start()
37
Apache Flink
38
Execution Model
You don’t want to know.
39
Flink vs SparkStreaming
Differences:
• Flink is event-by-event streaming, events go through pipeline.
• SparkStreaming has good integration with Hbase as state store
• “checkpoint barriers”
• Optimization based on strong typing
• Flink is newer than SparkStreaming, there is less production experience
Similarities:
• Very similar APIs
• Built-in stream-specific operators (windows)
• Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)
40
WordCount Batch
val env = ExecutionEnvironment.getExecutionEnvironment
val text = getTextDataSet(env)
val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)
41
WordCount Streaming
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream(host, port)
val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)
42
Bring Your Own Framework
43
If the requirements are simple…
44
How difficult it is to parallelize transformations?
Simple transformationsAre simple
45Click to enter confidentiality information
Just add Kafka
Kafka is a reliable data sourceYou can read
BatchesMicrobatchesStreams
Also allows for re-partitioning
46Click to enter confidentiality information
Cluster management
• Managing cluster resources used to be difficult• Now:– YARN– Mesos– Docker– Kubernetes
47Click to enter confidentiality information
So your app should…
• Allocate resources and track tasks with YARN / Mesos• Read from Kafka (however often you want)• Do simple transformations• Write to Kafka / Hbase
• How difficult can it possibly be?
48
Parting Thoughts
Click to enter confidentiality information
49
Good engineering lessons
• DRY – do you really need same code twice?• Error correction is critical• Reliability guarantees are critical• Debuggers are really nice• Latency / Throughput trade-offs• Use existing expertise• Stream processing is about patterns
Thank you