Spark streaming state of the union

Spark Streaming The State of the Union and the Road Beyond

Tathagata “TD” Das @tathadas

March 18, 2015

Who am I?

Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks

What is Spark Streaming?

Spark Streaming

Scalable, fault-tolerant stream processing system

File systems

Databases

Dashboards

Flume Kinesis

HDFS/S3

Kafka

Twitter

High-level API

joins, windows, … often 5x less code

Fault-tolerant

Exactly-once semantics, even for stateful ops

Integration

Integrate with MLlib, SQL, DataFrames, GraphX

How does it work?

Receivers receive data streams and chop them up into batches

Spark processes the batches and pushes out the results

5

data streams

rece

iver

s

batches results

Streaming Word Count with Kafka

val kafka = KafkaUtils.create(ssc, kafkaParams, …)

val words = kafka.map(_._2).flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _)

wordCounts.print()

ssc.start()

6

print some counts on screen

count the words

split lines into words

create DStream with lines from Kafka

start processing the stream

Languages

Can natively use

Can use any other language by using RDD.pipe()

7

Integrates with Spark Ecosystem

8

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Combine batch and streaming processing

Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset)filter(...) }

9

Spark Core

Spark Streaming


Combine machine learning with streaming

Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }

10

Spark Core

Spark Streaming


Combine SQL with streaming

Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents")

11

Spark Core

Spark Streaming


A Brief History

12

Late 2011 – research idea AMPLab, UC Berkeley

We need to make Spark

faster

Okay...umm, how??!?!

A Brief History

13

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Late 2011 – idea AMPLab, UC Berkeley

A Brief History

14

Late 2011 – idea AMPLab, UC Berkeley

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Jan 2014 – Stable release Graduation with Spark 0.9

Current state of Spark Streaming

Adoption

16

Roadmap

Development

17

What have we added in the last year?

Python API

Core functionality in Spark 1.2, with sockets and files as sources

Kafka support in Spark 1.3

Other sources coming in future

18

kafka = KafkaUtils.createStream(ssc, params, …) lines = kafka.map(lambda x: x[1]) counts = lines.flatMap(lambda line: line.split(" "))\ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint()

Streaming MLlib algorithms

val model = new StreamingKMeans() .setK(10) .setDecayFactor(1.0) .setRandomCenters(4, 0.0) // Apply model to DStreams model.trainOn(trainingDStream) model.predictOnValues( testDStream.map { lp => (lp.label, lp.features) } ).print()

19

Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1

StreamingKMeans in Spark 1.2

StreamingLogisticRegression in Spark 1.3 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Kafka `Direct` Stream API

Earlier Receiver-based approach for Kafka

Requires replicated journals (write ahead logs) to ensure zero data loss under driver failures

20

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Kafka Receiver high-level consumer


Earlier Receiver-based approach for Kafka New direct approach for Kafka in Spark 1.3

21


Kafka Receiver high-level consumer

simple consumer API to read Kafka topics


New direct approach for Kafka in 1.3 – treat Kafka like a file system

No receivers!!! Directly query Kafka for latest topic offsets, and read data like reading files Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets More efficient, fault-tolerant, exactly-once receiving of Kafka data

22


Other Library Additions

Amazon Kinesis integration [Spark 1.1] More fault-tolerant Flume integration [Spark 1.1]

23

System Infrastructure

Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]

24

Contributors to Streaming

25

0

10

20

30

40

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Contributors - Full Picture

26

0

30

60

90

120

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Streaming

Core + Streaming (w/o SQL, MLlib,…)

All contributions to core Spark directly improve Spark Streaming

Spark Packages

More contributions from the community in spark-packages

Alternate Kafka receiver

Apache Camel receiver

Cassandra examples

http://spark-packages.org/

27

Who is using Spark Streaming?

Spark Summit 2014 Survey

29

40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it

Not using 21%

Evaluating 39%

Prototyping 31%

Production 9%

30

31

80+ known

deployments

Intel China builds big data solutions for large enterprises Multiple streaming applications for top businesses

Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company

Complicated stream processing SQL queries on streams Join streams with large historical datasets

> 1TB/day passing through Spark Streaming

YARN

Spark Streaming

Kafka

RocketMQ HBase

One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming

Spark Standalone

Spark Streaming Kafka

Cassandra

Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing

Apache Blur

More information: http://dbricks.co/1BnFZZ8

Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads

Mesos+Marathon

Spark Streaming

Kinesis MySQL Redis

RabbitMQ SQS

Wants to learn trending movies and shows in real time Currently in the middle of replacing one of their internal stream processing architecture with Spark Streaming Tested resiliency of Spark Streaming with Chaos Monkey More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Driver failures handled with Spark Standalone cluster’s supervise mode Worker, executor and receiver failures automatically handled

Spark Streaming can handle all kinds of failures More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Neuroscience @ Freeman Lab, Janelia Farm

Spark Streaming and MLlib to analyze neural activities

Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!

http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Neuroscience @ Freeman Lab, Janelia Farm

Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Why are they adopting Spark Streaming?

Easy, high-level API

Unified API across batch and streaming

Integration with Spark SQL and MLlib

Ease of operations

41

What’s coming next?

Libraries

Operational Ease

Performance

Roadmap

Libraries Streaming machine learning algorithms

A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms

Streaming + DataFrames, Streaming + SQL

44

Roadmap

Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments

45

Roadmap

Performance Higher throughput, especially of stateful operations Lower latencies

Easy deployment of streaming apps in Databricks Cloud!

46

You can help!

Roadmaps are heavily driven by community feedback We have listened to community demands over the last year

Write Ahead Logs for zero data loss New Kafka direct API

Let us know what do you want to see in Spark Streaming

Spark user mailing list, tweet it to me @tathadas

47

Industry adoption increasing rapidly

Community contributing very actively

More libraries, operational ease and

performance in the roadmap

48

@tathadas

49

Backup slides

Typesafe survey of Spark users

2136 developers, data scientists, and other tech professionals

http://java.dzone.com/articles/apache-spark-survey-typesafe-0


65% of Spark users are interested in Spark Streaming


2/3 of Spark users want to process event streams

53

More usecases

•  Big data solution provider for enterprises •  Multiple applications for different businesses

-  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank

•  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on

Spark Streaming, Kafka, Cassandra, MongoDB

•  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic

•  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations

•  Kafka à SS

http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark

•  Runs claims processing applications for healthcare providers

http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims

•  Predictive models can look for claims that are likely to be held up for approval

•  Spark Streaming allows model scoring in seconds instead of hours