View
622
Download
5
Category
Preview:
Citation preview
Best practices for streaming applicationsO’Reilly WebcastJune 21st/22nd, 2016Mark Grover | @mark_grover | Software Engineer
Ted Malaska | @TedMalaska | Principal Solutions Architect
2
About the presenters
• Principal Solutions Architect at Cloudera
• Done Hadoop for 6 years– Worked with > 70 companies in 8
countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark• Marvel fan boy, runner
• Software Engineer at Cloudera, working on Spark
• Committer on Apache Bigtop, PMC member on Apache Sentry (incubating)
• Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume
Ted Malaska Mark Grover
3
About the book
• @hadooparchbook• hadooparchitecturebook.com• github.com/hadooparchitecturebook• slideshare.com/hadooparchbook
4
Goal
5
Understand common use-cases for streaming and
their architectures
6
What is streaming?
7
When to stream, and when not to
Constant low milliseconds & under
Low milliseconds to seconds, delay in case
of failures
10s of seconds or more, re-run in case of
failures
Real-time Near real-time Batch
8
When to stream, and when not to
Constant low milliseconds & under
Low milliseconds to seconds, delay in case
of failures
10s of seconds or more, re-run in case of
failures
Real-time Near real-time Batch
9
No free lunch
Constant low milliseconds & under
Low milliseconds to seconds, delay in case
of failures
10s of seconds or more, re-run in case of
failures
Real-time Near real-time Batch
“Difficult” architectures, lower latency “Easier” architectures, higher latency
10
Use-cases for streaming
11
Use-case categories
• Ingestion• Simple transformations
– Decision (e.g. Anomaly detection)
• Simple counts– Lambda, etc.
• Advanced usage– Machine Learning– Windowing
12
Ingestion & Transformations
13
What is ingestion?
Source Systems Destination systemStreaming engine
14
But there multiple sources
Ingest
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Streaming engine Ingest
15
But..
• Sources, sinks, ingestion channels may go down• Sources, sinks producing/consuming at different rates (buffering)• Regular maintenance windows may need to be scheduled• You need a resilient message broker (pub/sub)
16
Need for a message broker
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
17
Kafka
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
18
Destination systems
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
Most common “destination” is a storage system
19
Architecture diagram with a broker
Source System 1
Storage systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
20
Streaming engines
Source System 1
Storage systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Kafka Connect
ApacheFlume
Message broker
Apache Beam (incubating)
21
Storage options
Source System 1
Storage systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Kafka Connect
ApacheFlume
Message broker
Apache Beam (incubating)
22
SemanticsAt most once, Exactly once, At least once
23
Semantic types
• At most once– Not good for many cases– Only where performance/SLA is more important than accuracy
• Exactly once– Expensive to achieve but desirable
• At least once– Easiest to achieve
24
Review
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
25
Semantics of our architecture
Source System 1
Destination systemSource System 2
Source System 3
Ingest
Ingest
Ingest Extract Streaming engine
Push
Message broker
At least once
At least onceOrderedPartitioned
It depends It depends
26
Transforming data in flight
27
Streaming architecture for ingestion
Source System 1
Storage systemSource System 2
Source System 3
Ingest
Ingest
Ingest ExtractStreaming ingestion process
Push
Kafka connect
ApacheFlume
Message broker
Can be used to do simple
transformations
28
Ingestion and/or Transformation
1. Zero Transformation– No transformation, plain ingest, no schema validation– Keep the original format - SequenceFiles, Text, etc.– Allows to store data that may have errors in the schema
2. Format Transformation– Simply change the format of field, for example– Structured Format e.g. Avro– Which does schema validation
3. Enrichment Transformation– Atomic– Contextual
29
#3 - Enrichment transformations
Atomic• Need to work with one event at a
time• Mask a credit card number• Add processing time or offset to the
record
Contextual• Need to refer to external context• Example - convert zip code to state,
by looking up a cache
30
Atomic transformations
• Require no context• All streaming engines support it
31
Contextual transformations
• Well supported by many streaming engines• Need to store the context somewhere.
32
Where to store the context
1. Locally Broadcast Cached Dim Data– Local to Process (On Heap, Off Heap)– Local to Node (Off Process)
2. Partitioned Cache– Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
33
#1a - Locally broadcast cached data
Could be On heap or Off heap
34
#1b - Off process cached dataData is cached on the node, outside of process. Potentially in an external system like Rocks DB
35
#2 - Partitioned cache data
Data is partitioned based on field(s) and then cached
36
#3 - External fetch
Data fetched from external system
37
A combination (partitioned cache + external)
38
Anomaly detection using contextual transformations
39
Storage systemsWhen to use which one?
40
Storage Considerations
• Throughput• Access Patterns
– Scanning– Indexed– Reversed Indexed
• Transaction Level– Record/Document– File
41
File Level
• HDFS• S3
42
NoSql
• HBase• Cassandra• MongoDB
43
Search
• SolR• Elastic Search
44
NoSql-Sql
• Kudu
45
Streaming enginesComparison
46© Cloudera, Inc. All rights reserved.
Tricks With Producers
•Send Source ID (requires Partitioning In Kafka)
•Seq
•UUID
•UUID plus time
•Partition on SourceID
•Watch out for repartitions and partition fail overs
47© Cloudera, Inc. All rights reserved.
Streaming Engines
•Consumer
•Flume, KafkaConnect, Streaming Engine
•Storm
•Spark Streaming
•Flink
•Kafka Streams
48© Cloudera, Inc. All rights reserved.
Consumer: Flume, KafkaConnect
•Simple and Works
•Low latency
•High throughput
•Interceptors
•Transformations
•Alerting
•Ingestions
49© Cloudera, Inc. All rights reserved.
Consumer: Streaming Engines
•Not so great at HDFS Ingestion
•But great for record storage systems
•HBase
•Cassandra
•Kudu
•SolR
•Elastic Search
50© Cloudera, Inc. All rights reserved.
Storm
•Old Gen
•Low latency
•Low throughput
•At least once
•Around for ever
•Topology Based
51© Cloudera, Inc. All rights reserved.
Spark Streaming
•The Juggernaut
•Higher Latency
•High Through Put
• Exactly Once
•SQL
•MlLib
•Highly used
•Easy to Debug/Unit Test
•Easy to transition from Batch
•Flow Language
•600 commits in a month and about 100 meetups
52© Cloudera, Inc. All rights reserved.
Spark Streaming
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First Batch
Second Batch
53© Cloudera, Inc. All rights reserved.
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source ReceiverRDD
partitions
RDDParition
RDD
Single Pass
Filter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
Spark Streaming
54© Cloudera, Inc. All rights reserved.
Flink
•I’m Better Than Spark Why Doesn’t Anyone use me
•Very much like Spark but not as feature rich
•Lower Latency•Micro Batch -> ABS
•Asynchronous Barrier Snapshotting
•Flow Language
•~1/6th the comments and meetups
•But Slim loves it ☺
55© Cloudera, Inc. All rights reserved.
Flink - ABS
Operator
Buffer
56© Cloudera, Inc. All rights reserved.
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B Still Behind
57© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABS
Both Barriers Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B Still Behind
Check Point
58© Cloudera, Inc. All rights reserved.
Operator
Buffer
Flink - ABSBoth
Barriers Hit
Check Point
Operator
BufferBarrier is combined and can move on
Buffer can be flushed
out
59© Cloudera, Inc. All rights reserved.
Kafka Streams• The new Kid on the Block• When you only have Kafka• Low Latency• High Throughput• Not exactly once• Very Young• Flow Language• Very different hardware profile then others• Not widely supported• Not widely used• Worries about separation of concern
60© Cloudera, Inc. All rights reserved.
Summary about Engines• Ingestion
• Flume and KafkaConnect• Super Real Time and Special
• Consumer• Counting, MlLib, SQL
• Spark• Maybe future and cool
• Flink and KafkaStreams• Odd man out
• Storm
61© Cloudera, Inc. All rights reserved.
Abstractions
Code Abstractions
BeamSQL Abstraction
SQLUI Abstraction
StreamSets
Streaming Engines
62
Counting
63
Streaming and Counting
• Counting is easy right?• Back to Only once
64
We started with Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer
65
Why did Streaming Suck
• Increments with Cassandra • Double increment• No strong consistency
• Storm without Kafka• Not only once• Not at least once
• Batch would have to re-process EVERY record to remove dups
66
We have come a long way
• We don’t have to use Increments any more and we can have consistency• HBase
• We can have state in our streaming platform• Spark Streaming
• We don’t lose data• Spark Streaming• Kafka• Other options
• Full universe of Deduping• Again HBase with versions
67
Increments
68
Puts with State
69
Advanced streamingWhen to use which one?
70
Advanced Streaming
• Ad-hoc will produce Identify Value• Ad-hoc will become batch• The value will demand less latency on batch• Batch will become Streaming
71
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks• Something that can snap both paradigms• Something that can use the tools of Ad-hoc
• SQL• MlLib• R• Scala• Java
• Development through a common IDE• Debugging• Unit Testing• Common deployment model
72
Advanced Streaming
• In Spark Streaming• A DStream is a collection of RDD with respect to micro batch
intervals• If we can access RDDs in Spark Streaming
• We can convert to Vectors• KMeans• Principal component analysis
• We can convert to LabeledPoint• NaiveBayes• Random Forest• Linear Support Vector Machines
• We can convert to a DataFrames• SQL• R
73
Wrap-up
74
Understand common use-cases for streaming and
their architecturesOur original goal
75
Common streaming use-cases• Ingestion
– Transformation
• Counting– Lambda, etc.
• Advanced streaming
76
Thank you!Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
@hadooparchbook
hadooparchitecturebook.com
77
Transformations with context
Recommended