Streaming architecture patterns

Best practices for streaming applicationsO’Reilly WebcastJune 21st/22nd, 2016Mark Grover | @mark_grover | Software Engineer

Ted Malaska | @TedMalaska | Principal Solutions Architect

About the presenters

• Principal Solutions Architect at Cloudera

• Done Hadoop for 6 years– Worked with > 70 companies in 8

countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop,

HBase, Flume, Avro, Pig and Spark• Contributor to Apache Hadoop,

HBase, Flume, Avro, Pig and Spark• Marvel fan boy, runner

• Software Engineer at Cloudera, working on Spark

• Committer on Apache Bigtop, PMC member on Apache Sentry (incubating)

• Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume

Ted Malaska Mark Grover

About the book

• @hadooparchbook• hadooparchitecturebook.com• github.com/hadooparchitecturebook• slideshare.com/hadooparchbook

Understand common use-cases for streaming and

their architectures

What is streaming?

When to stream, and when not to

Constant low milliseconds & under

Low milliseconds to seconds, delay in case

of failures

10s of seconds or more, re-run in case of

failures

Real-time Near real-time Batch

When to stream, and when not to

of failures

failures

No free lunch

of failures

failures

“Difficult” architectures, lower latency “Easier” architectures, higher latency

Use-cases for streaming

Use-case categories

• Ingestion• Simple transformations

– Decision (e.g. Anomaly detection)

• Simple counts– Lambda, etc.

• Advanced usage– Machine Learning– Windowing

Ingestion & Transformations

What is ingestion?

Source Systems Destination systemStreaming engine

But there multiple sources

Ingest

Source System 1

Destination systemSource System 2

Source System 3

Ingest

Streaming engine Ingest

• Sources, sinks, ingestion channels may go down• Sources, sinks producing/consuming at different rates (buffering)• Regular maintenance windows may need to be scheduled• You need a resilient message broker (pub/sub)

Need for a message broker

Source System 1

Source System 3

Ingest

Ingest Extract Streaming engine

Message broker

Source System 1

Source System 3

Ingest

Message broker

Destination systems

Source System 1

Source System 3

Ingest

Message broker

Most common “destination” is a storage system

Architecture diagram with a broker

Source System 1

Storage systemSource System 2

Source System 3

Ingest

Message broker

Streaming engines

Source System 1

Source System 3

Ingest

Kafka Connect

ApacheFlume

Message broker

Apache Beam (incubating)

Storage options

Source System 1

Source System 3

Ingest

Kafka Connect

ApacheFlume

Message broker

Apache Beam (incubating)

SemanticsAt most once, Exactly once, At least once

Semantic types

• At most once– Not good for many cases– Only where performance/SLA is more important than accuracy

• Exactly once– Expensive to achieve but desirable

• At least once– Easiest to achieve

Review

Source System 1

Source System 3

Ingest

Message broker

Semantics of our architecture

Source System 1

Source System 3

Ingest

Message broker

At least once

At least onceOrderedPartitioned

It depends It depends

Transforming data in flight

Streaming architecture for ingestion

Source System 1

Source System 3

Ingest

Ingest ExtractStreaming ingestion process

Kafka connect

ApacheFlume

Message broker

Can be used to do simple

transformations

Ingestion and/or Transformation

1. Zero Transformation– No transformation, plain ingest, no schema validation– Keep the original format - SequenceFiles, Text, etc.– Allows to store data that may have errors in the schema

2. Format Transformation– Simply change the format of field, for example– Structured Format e.g. Avro– Which does schema validation

3. Enrichment Transformation– Atomic– Contextual

#3 - Enrichment transformations

Atomic• Need to work with one event at a

time• Mask a credit card number• Add processing time or offset to the

record

Contextual• Need to refer to external context• Example - convert zip code to state,

by looking up a cache

Atomic transformations

• Require no context• All streaming engines support it

Contextual transformations

• Well supported by many streaming engines• Need to store the context somewhere.

Where to store the context

1. Locally Broadcast Cached Dim Data– Local to Process (On Heap, Off Heap)– Local to Node (Off Process)

2. Partitioned Cache– Shuffle to move new data to partitioned cache

3. External Fetch Data (e.g. HBase, Memcached)

#1a - Locally broadcast cached data

Could be On heap or Off heap

#1b - Off process cached dataData is cached on the node, outside of process. Potentially in an external system like Rocks DB

#2 - Partitioned cache data

Data is partitioned based on field(s) and then cached

#3 - External fetch

Data fetched from external system

A combination (partitioned cache + external)

Anomaly detection using contextual transformations

Storage systemsWhen to use which one?

Storage Considerations

• Throughput• Access Patterns

– Scanning– Indexed– Reversed Indexed

• Transaction Level– Record/Document– File

File Level

• HDFS• S3

• HBase• Cassandra• MongoDB

Search

• SolR• Elastic Search

NoSql-Sql

• Kudu

Streaming enginesComparison

Tricks With Producers

•Send Source ID (requires Partitioning In Kafka)

•Seq

•UUID

•UUID plus time

•Partition on SourceID

•Watch out for repartitions and partition fail overs

Streaming Engines

•Consumer

•Flume, KafkaConnect, Streaming Engine

•Storm

•Spark Streaming

•Flink

•Kafka Streams

Consumer: Flume, KafkaConnect

•Simple and Works

•Low latency

•High throughput

•Interceptors

•Transformations

•Alerting

•Ingestions

Consumer: Streaming Engines

•Not so great at HDFS Ingestion

•But great for record storage systems

•HBase

•Cassandra

•Kudu

•SolR

•Elastic Search

•Old Gen

•Low latency

•Low throughput

•At least once

•Around for ever

•Topology Based

Spark Streaming

•The Juggernaut

•Higher Latency

•High Through Put

• Exactly Once

•SQL

•MlLib

•Highly used

•Easy to Debug/Unit Test

•Easy to transition from Batch

•Flow Language

•600 commits in a month and about 100 meetups

Spark Streaming

DStream

Single Pass

Source Receiver RDD

Filter Count Print

Source Receiver RDD

Single Pass

Filter Count Print

First Batch

Second Batch

DStream

Single Pass

Source Receiver RDD

Filter Count

Source ReceiverRDD

partitions

RDDParition

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Stateful RDD 2

Stateful RDD 1

Spark Streaming

•I’m Better Than Spark Why Doesn’t Anyone use me

•Very much like Spark but not as feature rich

•Lower Latency•Micro Batch -> ABS

•Asynchronous Barrier Snapshotting

•Flow Language

•~1/6th the comments and meetups

•But Slim loves it ☺

Flink - ABS

Operator

Buffer

Operator

Buffer

Operator

Buffer

Flink - ABS

Barrier 1A Hit

Barrier 1B Still Behind

Operator

Buffer

Flink - ABS

Both Barriers Hit

Operator

Buffer

Barrier 1A Hit

Barrier 1B Still Behind

Check Point

Operator

Buffer

Flink - ABSBoth

Barriers Hit

Check Point

Operator

BufferBarrier is combined and can move on

Buffer can be flushed

Kafka Streams• The new Kid on the Block• When you only have Kafka• Low Latency• High Throughput• Not exactly once• Very Young• Flow Language• Very different hardware profile then others• Not widely supported• Not widely used• Worries about separation of concern

Summary about Engines• Ingestion

• Flume and KafkaConnect• Super Real Time and Special

• Consumer• Counting, MlLib, SQL

• Spark• Maybe future and cool

• Flink and KafkaStreams• Odd man out

• Storm

Abstractions

Code Abstractions

BeamSQL Abstraction

SQLUI Abstraction

StreamSets

Streaming Engines

Counting

Streaming and Counting

• Counting is easy right?• Back to Only once

We started with Lambda

Speed Layer

Batch Layer

Persist Results

Speed Results

Batch Results

Serving Layer

Why did Streaming Suck

• Increments with Cassandra • Double increment• No strong consistency

• Storm without Kafka• Not only once• Not at least once

• Batch would have to re-process EVERY record to remove dups

We have come a long way

• We don’t have to use Increments any more and we can have consistency• HBase

• We can have state in our streaming platform• Spark Streaming

• We don’t lose data• Spark Streaming• Kafka• Other options

• Full universe of Deduping• Again HBase with versions

Increments

Puts with State

Advanced streamingWhen to use which one?

Advanced Streaming

• Ad-hoc will produce Identify Value• Ad-hoc will become batch• The value will demand less latency on batch• Batch will become Streaming

Advanced Streaming

• Requirements for Ideal Batch to Streaming frameworks• Something that can snap both paradigms• Something that can use the tools of Ad-hoc

• SQL• MlLib• R• Scala• Java

• Development through a common IDE• Debugging• Unit Testing• Common deployment model

Advanced Streaming

• In Spark Streaming• A DStream is a collection of RDD with respect to micro batch

intervals• If we can access RDDs in Spark Streaming

• We can convert to Vectors• KMeans• Principal component analysis

• We can convert to LabeledPoint• NaiveBayes• Random Forest• Linear Support Vector Machines

• We can convert to a DataFrames• SQL• R

Wrap-up

Understand common use-cases for streaming and

their architecturesOur original goal

Common streaming use-cases• Ingestion

– Transformation

• Counting– Lambda, etc.

• Advanced streaming

Thank you!Mark Grover | @mark_grover

Ted Malaska | @TedMalaska

@hadooparchbook

hadooparchitecturebook.com

Transformations with context

Streaming architecture patterns

Engineering

Information architecture patterns

Architecture Patterns - Open Discussion

A Data Streaming Architecture with Apache Flink

Patterns within patterns in architecture v2

Streaming-Enabled Parallel Dataflow Architecture for Multicore

Realtime streaming architecture in INFINARIO

Architecture patterns

Software Architecture Anti-Patterns

Service Architecture patterns

High Performance Architecture Patterns

HYBRID MEDIA STREAMING ARCHITECTURE, FOCUSSING ON

CEP - simplified streaming architecture - Strata Singapore 2016

An End End Architecture Mpeg4 Video Streaming

ACM DEBS 2015: Realtime Streaming Analytics Patterns

Insights into transducer-plane streaming patterns in … · Insights into transducer-plane streaming patterns in thin-layered acoustofluidic devices ... the two regions of minimum

Software Architecture Patterns - SDD Conferencesddconf.com › brands › sdd › library › Architecture_Patterns.pdf · Software Architecture Patterns. Software Architecture Fundamentals

Enterprise Architecture Management Patterns

Patterns for organic architecture

App Streaming- Architecture & Troubleshooting Techniques

Streaming architecture zx_dec2015