56
Spark Streaming The State of the Union and the Road Beyond Tathagata “TD” Das @tathadas March 18, 2015

Spark streaming state of the union

Embed Size (px)

Citation preview

Page 1: Spark streaming state of the union

Spark Streaming The State of the Union and the Road Beyond

Tathagata “TD” Das @tathadas

March 18, 2015

Page 2: Spark streaming state of the union

Who am I?

Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks

Page 3: Spark streaming state of the union

What is Spark Streaming?

Page 4: Spark streaming state of the union

Spark Streaming

Scalable, fault-tolerant stream processing system

File systems

Databases

Dashboards

Flume Kinesis

HDFS/S3

Kafka

Twitter

High-level API

joins, windows, … often 5x less code

Fault-tolerant

Exactly-once semantics, even for stateful ops

Integration

Integrate with MLlib, SQL, DataFrames, GraphX

Page 5: Spark streaming state of the union

How does it work?

Receivers receive data streams and chop them up into batches

Spark processes the batches and pushes out the results

5

data streams

rece

iver

s

batches results

Page 6: Spark streaming state of the union

Streaming Word Count with Kafka

val  kafka  =  KafkaUtils.create(ssc,  kafkaParams,  …)  

val  words  =  kafka.map(_._2).flatMap(_.split("  "))  

val  wordCounts  =  words.map(x  =>  (x,  1))                .reduceByKey(_  +  _)  

wordCounts.print()  

ssc.start()  

6

print some counts on screen

count the words

split lines into words

create DStream with lines from Kafka

start processing the stream

Page 7: Spark streaming state of the union

Languages

Can natively use

Can use any other language by using RDD.pipe()

7

Page 8: Spark streaming state of the union

Integrates with Spark Ecosystem

8

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 9: Spark streaming state of the union

Combine batch and streaming processing

Join data streams with static data sets //  Create  data  set  from  Hadoop  file  val  dataset  =  sparkContext.hadoopFile(“file”)            //  Join  each  batch  in  stream  with  the  dataset  kafkaStream.transform  {  batchRDD  =>                batchRDD.join(dataset)filter(...)  }  

9

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 10: Spark streaming state of the union

Combine machine learning with streaming

Learn models offline, apply them online //  Learn  model  offline  val  model  =  KMeans.train(dataset,  ...)    //  Apply  model  online  on  stream  kafkaStream.map  {  event  =>            model.predict(event.feature)    }    

10

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 11: Spark streaming state of the union

Combine SQL with streaming

Interactively query streaming data with SQL //  Register  each  batch  in  stream  as  table  kafkaStream.map  {  batchRDD  =>              batchRDD.registerTempTable("latestEvents")  }    //  Interactively  query  table  sqlContext.sql("select  *  from  latestEvents")  

11

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 12: Spark streaming state of the union

A Brief History

12

Late 2011 – research idea AMPLab, UC Berkeley

We need to make Spark

faster

Okay...umm, how??!?!

Page 13: Spark streaming state of the union

A Brief History

13

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Late 2011 – idea AMPLab, UC Berkeley

Page 14: Spark streaming state of the union

A Brief History

14

Late 2011 – idea AMPLab, UC Berkeley

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Jan 2014 – Stable release Graduation with Spark 0.9

Page 15: Spark streaming state of the union

Current state of Spark Streaming

Page 16: Spark streaming state of the union

Adoption

16

Roadmap

Development

Page 17: Spark streaming state of the union

17

What have we added in the last year?

Page 18: Spark streaming state of the union

Python API

Core functionality in Spark 1.2, with sockets and files as sources

Kafka support in Spark 1.3

Other sources coming in future

18

kafka  =  KafkaUtils.createStream(ssc,  params,  …)  lines  =  kafka.map(lambda  x:  x[1])  counts  =  lines.flatMap(lambda  line:  line.split("  "))\                                      .map(lambda  word:  (word,  1))  \                                      .reduceByKey(lambda  a,  b:  a+b)  counts.pprint()  

Page 19: Spark streaming state of the union

Streaming MLlib algorithms

val  model  =  new  StreamingKMeans()      .setK(10)      .setDecayFactor(1.0)      .setRandomCenters(4,  0.0)    //  Apply  model  to  DStreams  model.trainOn(trainingDStream)  model.predictOnValues(      testDStream.map  {  lp  =>            (lp.label,  lp.features)        }  ).print()    

19

Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1

StreamingKMeans in Spark 1.2

StreamingLogisticRegression in Spark 1.3 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Page 20: Spark streaming state of the union

Kafka `Direct` Stream API

Earlier Receiver-based approach for Kafka

Requires replicated journals (write ahead logs) to ensure zero data loss under driver failures

20

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Kafka Receiver high-level consumer

Page 21: Spark streaming state of the union

Kafka `Direct` Stream API

Earlier Receiver-based approach for Kafka New direct approach for Kafka in Spark 1.3

21

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Kafka Receiver high-level consumer

simple consumer API to read Kafka topics

Page 22: Spark streaming state of the union

Kafka `Direct` Stream API

New direct approach for Kafka in 1.3 – treat Kafka like a file system

No receivers!!! Directly query Kafka for latest topic offsets, and read data like reading files Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets More efficient, fault-tolerant, exactly-once receiving of Kafka data

22

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Page 23: Spark streaming state of the union

Other Library Additions

Amazon Kinesis integration [Spark 1.1] More fault-tolerant Flume integration [Spark 1.1]

23

Page 24: Spark streaming state of the union

System Infrastructure

Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]

24

Page 25: Spark streaming state of the union

Contributors to Streaming

25

0

10

20

30

40

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Page 26: Spark streaming state of the union

Contributors - Full Picture

26

0

30

60

90

120

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Streaming

Core + Streaming (w/o SQL, MLlib,…)

All contributions to core Spark directly improve Spark Streaming

Page 27: Spark streaming state of the union

Spark Packages

More contributions from the community in spark-packages

Alternate Kafka receiver

Apache Camel receiver

Cassandra examples

http://spark-packages.org/

27

Page 28: Spark streaming state of the union

Who is using Spark Streaming?

Page 29: Spark streaming state of the union

Spark Summit 2014 Survey

29

40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it

Not using 21%

Evaluating 39%

Prototyping 31%

Production 9%

Page 30: Spark streaming state of the union

30

Page 31: Spark streaming state of the union

31

80+ known

deployments

Page 32: Spark streaming state of the union

Intel China builds big data solutions for large enterprises Multiple streaming applications for top businesses

Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company

Page 33: Spark streaming state of the union

Complicated stream processing SQL queries on streams Join streams with large historical datasets

> 1TB/day passing through Spark Streaming

YARN

Spark Streaming

Kafka

RocketMQ HBase

Page 34: Spark streaming state of the union

One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming

Page 35: Spark streaming state of the union

Spark Standalone

Spark Streaming Kafka

Cassandra

Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing

Apache Blur

More information: http://dbricks.co/1BnFZZ8

Page 36: Spark streaming state of the union

Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads

Mesos+Marathon

Spark Streaming

Kinesis MySQL Redis

RabbitMQ SQS

Page 37: Spark streaming state of the union

Wants to learn trending movies and shows in real time Currently in the middle of replacing one of their internal stream processing architecture with Spark Streaming Tested resiliency of Spark Streaming with Chaos Monkey More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Page 38: Spark streaming state of the union

Driver failures handled with Spark Standalone cluster’s supervise mode Worker, executor and receiver failures automatically handled

Spark Streaming can handle all kinds of failures More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Page 39: Spark streaming state of the union

Neuroscience @ Freeman Lab, Janelia Farm

Spark Streaming and MLlib to analyze neural activities

Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!

http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Page 40: Spark streaming state of the union

Neuroscience @ Freeman Lab, Janelia Farm

Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Page 41: Spark streaming state of the union

Why are they adopting Spark Streaming?

Easy, high-level API

Unified API across batch and streaming

Integration with Spark SQL and MLlib

Ease of operations

41

Page 42: Spark streaming state of the union

What’s coming next?

Page 43: Spark streaming state of the union

Libraries

Operational Ease

Performance

Page 44: Spark streaming state of the union

Roadmap

Libraries Streaming machine learning algorithms

A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms

Streaming + DataFrames, Streaming + SQL

44

Page 45: Spark streaming state of the union

Roadmap

Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments

45

Page 46: Spark streaming state of the union

Roadmap

Performance Higher throughput, especially of stateful operations Lower latencies

Easy deployment of streaming apps in Databricks Cloud!

46

Page 47: Spark streaming state of the union

You can help!

Roadmaps are heavily driven by community feedback We have listened to community demands over the last year

Write Ahead Logs for zero data loss New Kafka direct API

Let us know what do you want to see in Spark Streaming

Spark user mailing list, tweet it to me @tathadas

47

Page 48: Spark streaming state of the union

Industry adoption increasing rapidly

Community contributing very actively

More libraries, operational ease and

performance in the roadmap

48

@tathadas

Page 49: Spark streaming state of the union

49

Backup slides

Page 50: Spark streaming state of the union

Typesafe survey of Spark users

2136 developers, data scientists, and other tech professionals

http://java.dzone.com/articles/apache-spark-survey-typesafe-0

Page 51: Spark streaming state of the union

Typesafe survey of Spark users

65% of Spark users are interested in Spark Streaming

Page 52: Spark streaming state of the union

Typesafe survey of Spark users

2/3 of Spark users want to process event streams

Page 53: Spark streaming state of the union

53

More usecases

Page 54: Spark streaming state of the union

•  Big data solution provider for enterprises •  Multiple applications for different businesses

-  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank

•  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on

Spark Streaming, Kafka, Cassandra, MongoDB

Page 55: Spark streaming state of the union

•  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic

•  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations

•  Kafka à SS

http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark

Page 56: Spark streaming state of the union

•  Runs claims processing applications for healthcare providers

http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims

•  Predictive models can look for claims that are likely to be held up for approval

•  Spark Streaming allows model scoring in seconds instead of hours