15
Open Aid Partnership Open Aid Partnership visualizes the sub-national location of donor-financed projects

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Embed Size (px)

DESCRIPTION

Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.

Citation preview

Page 1: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Deep dive into Spark Streaming

Tathagata Das (TD)Matei Zaharia, Haoyuan Li, Timothy Hunter,

Patrick Wendell and many others

UC BERKELEY

Page 2: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

What is Spark Streaming? Extends Spark for doing large scale stream processing

Scales to 100s of nodes and achieves second scale latencies

Efficient and fault-tolerant stateful stream processing

Integrates with Spark’s batch and interactive processing

Provides a simple batch-like API for implementing complex algorithms

Page 3: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Discretized Stream Processing

Run a streaming computation as a series of very small, deterministic batch jobs

3

Spark

SparkStreaming

batches of X seconds

live data stream

processed results

Chop up the live stream into batches of X seconds

Spark treats each batch of data as RDDs and processes them using RDD operations

Finally, the processed results of the RDD operations are returned in batches

Page 4: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Discretized Stream Processing

Run a streaming computation as a series of very small, deterministic batch jobs

4

Batch sizes as low as ½ second, latency ~ 1 second

Potential for combining batch processing and streaming processing in the same system

Spark

SparkStreaming

batches of X seconds

live data stream

processed results

Page 5: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

DStream: a sequence of distributed datasets (RDDs) representing a distributed stream of data

batch @ t+1batch @ t batch @ t+2

tweets DStream

stored in memory as an RDD (immutable, distributed dataset)

Twitter Streaming API

Page 6: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

flatMap flatMap flatMap

transformation: modify data in one DStream to create another DStream new DStream

new RDDs created for every batch

batch @ t+1batch @ t batch @ t+2

tweets DStream

hashTags Dstream[#cat, #dog, … ]

Page 7: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example 1 – Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

every batch saved to HDFS

Page 8: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream of data

Example 2 – Count the hashtags over last 1 min

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

sliding window operation window length sliding interval

window length

sliding interval

Page 9: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

tagCounts

Example 2 – Count the hashtags over last 1 min

val tagCounts = hashTags.window(Minutes(1), Seconds(1)).countByValue()

hashTags

t-1 t t+1 t+2 t+3

sliding window

countByValue

count over all the data in the

window

Page 10: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Key concepts DStream – sequence of RDDs representing a stream of data

- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

Transformations – modify data from one DStream to another- Standard RDD operations – map, countByValue, reduceByKey, join, …- Stateful operations – window, countByValueAndWindow, …

Output Operations – send data to external entity- saveAsHadoopFiles – saves to HDFS- foreach – do anything with each batch of results

Page 11: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Arbitrary Stateful Computations Maintain arbitrary state, track sessions

- Maintain per-user mood as state, and update it with his/her tweets

moods = tweets.updateStateByKey(tweet => updateMood(tweet))

updateMood(newTweets, lastMood) => newMood

tweets

t-1 t t+1 t+2 t+3

moods

Page 12: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Combine Batch and Stream Processing Do arbitrary Spark RDD computation within DStream

- Join incoming tweets with a spam file to filter out bad tweets

tweets.transform(tweetsRDD => {

tweetsRDD.join(spamHDFSFile).filter(...)

})

Page 13: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Fault-tolerance RDDs remember the operations

that created them

Batches of input data are replicated in memory for fault-tolerance

Data lost due to worker failure, can be recomputed from replicated input data

Therefore, all transformed data is fault-tolerant

input data replicatedin memory

flatMap

lost partitions recomputed on other workers

tweetsRDD

hashTagsRDD

Page 14: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning

Page 15: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Discretized Stream (DStream)

A sequence of RDDs representing a stream of data

What does it take to define a DStream?

Page 16: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream Interface

The DStream interface primarily defines how to generate an RDD in each batch interval

List of dependent (parent) DStreams

Slide Interval, the interval at which it will compute RDDs

Function to compute RDD at a time t

Page 17: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example: Mapped DStream Dependencies: Single parent DStream

Slide Interval: Same as the parent DStream

Compute function for time t: Create new RDD by applying map function on parent DStream’s RDD of time t

Page 18: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Window operation gather together data over a sliding window

Dependencies: Single parent DStream

Slide Interval: Window sliding interval

Compute function for time t: Apply union over all the RDDs of parent DStream between times t and (t – window length)

Example: Windowed DStream

Parent DStream

window length

sliding interval

Page 19: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example: Network Input DStream

Base class of all input DStreams that receive data from the network

Dependencies: None

Slide Interval: Batch duration in streaming context

Compute function for time t: Create a BlockRDD with all the blocks of data received in the last batch interval

Associated with a Network Receiver object

Page 20: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Network Receiver

Responsible for receiving data and pushing it into Spark’s data management layer (Block Manager)

Base class for all receivers - Kafka, Flume, etc.

Simple Interface: What to do on starting the receiver

- Helper object blockGenerator to push data into Spark

What to do on stopping the receiver

Page 21: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Example: Socket Receiver On start:

Connect to remote TCP server

While socket is connected,

Receiving bytes and deserialize

Deserialize them into Java objects

Add the objects to blockGenerator

On stop:Disconnect socket

Page 22: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream Graph

t = ssc.twitterStream(“…”) .map(…)t.foreach(…)

t1 = ssc.twitterStream(“…”)t2 = ssc.twitterStream(“…”)

t = t1.union(t2).map(…)

t.saveAsHadoopFiles(…)t.map(…).foreach(…)t.filter(…).foreach(…)

T

M

FE

Twitter Input DStream

Mapped DStream

Foreach DStream

T

U

M

T

M FFE

FE

FE

DStream GraphSpark Streaming program

Dummy DStream signifying an output operation

Page 23: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream Graph RDD Graphs Spark jobs

Every interval, RDD graph is computed from DStream graph For each output operation, a Spark action is created For each action, a Spark job is created to compute it

T

U

M

T

M FFE

FE

FE

DStream Graph

B

U

M

B

M FA

A A

RDD Graph

Spark actions

Block RDDs with data received in last batch interval

3 Spark jobs

Page 24: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning

Page 25: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Components

Network Input Tracker – Keeps track of the data received by each network receiver and maps them to the corresponding input DStreams

Job Scheduler – Periodically queries the DStream graph to generate Spark jobs from received data, and hands them to Job Manager for execution

Job Manager – Maintains a job queue and executes the jobs in Spark

ssc = new StreamingContext

t = ssc.twitterStream(“…”)

t.filter(…).foreach(…)

Your programDStream graph

Spark Context

Network Input Tracker

Job Manager

Job Scheduler

Spark Client

RDD graph Scheduler

Block managerShuffle tracker

Spark WorkerBlock

managerTask

threads

Cluster Manager

Page 26: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Execution Model – Receiving Data

Spark Streaming + Spark Driver Spark Workers

StreamingContext.start()

Network Input

Tracker

Receiver

Network receivers launched as tasks Data recvd

Block Manager

Blocks replicated

BlockManagerMaster

Notification of received block IDs

Location of block IDs

Block Manager

Blocks pushed

Page 27: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Spark Workers

Execution Model – Job Scheduling

Network Input

Tracker

Job Scheduler Spark’s

Schedulers

Receiver

Block Manager

Block Manager

Jobs executed on worker nodes

DStream Graph

Job Manager

Job

Que

ue

Spark Streaming + Spark Driver

Jobs

Block IDsRDDs

Page 28: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Job Scheduling Each output operation used generates a job

- More jobs more time taken to process batches higher batch duration

Job Manager decides how many concurrent Spark jobs to run- Default is 1, can be set using Java property spark.streaming.concurrentJobs

- If you have multiple output operations, you can try increasing this property to reduce batch processing times and so reduce batch duration

Page 29: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning

Page 30: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream Persistence If a DStream is set to persist at a storage level, then all RDDs

generated by it set to the same storage level

When to persist?- If there are multiple transformations / actions on a DStream- If RDDs in a DStream is going to be used multiple times

Window-based DStreams are automatically persisted in memory

Page 31: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

DStream Persistence Default storage level of DStreams is StorageLevel.MEMORY_ONLY_SER

(i.e. in memory as serialized bytes) - Except for input DStreams which have StorageLevel.MEMORY_AND_DISK_SER_2- Note the difference from RDD’s default level (no serialization)- Serialization reduces random pauses due to GC providing more consistent

job processing times

Page 32: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning

Page 33: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

What is RDD checkpointing?

Saving RDD to HDFS to prevent RDD graph from growing too large Done internally in Spark transparent to the user program Done lazily, saved to HDFS the first time it is computed

red_rdd.checkpoint()

HDFS file

Contents of red_rdd saved to a HDFS file transparent to

all child RDDs

Page 34: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Why is RDD checkpointing necessary?

Stateful DStream operators can have infinite lineages

Large lineages lead to … Large closure of the RDD object large task sizes high task launch times High recovery times under failure

datat-1 t t+1 t+2 t+3

states

Page 35: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Why is RDD checkpointing necessary?

Stateful DStream operators can have infinite lineages

Periodic RDD checkpointing solves this

Useful for iterative Spark programs as well

datat-1 t t+1 t+2 t+3

states

HDFS HDFS

Page 36: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

RDD Checkpointing Periodicity of checkpoint determines a tradeoff

- Checkpoint too frequent: HDFS writing will slow things down- Checkpoint too infrequent: Task launch times may increase- Default setting checkpoints at most once in 10 seconds- Try to checkpoint once in about 10 batches

Page 37: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Agenda Overview DStream Abstraction System Model Persistence / Caching RDD Checkpointing Performance Tuning

Page 38: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Performance Tuning

Step 1Achieve a stable configuration that can sustain the

streaming workload

Step 2Optimize for lower latency

Page 39: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Step 1: Achieving Stable Configuration

How to identify whether a configuration is stable?

Look for the following messages in the log

Total delay: 0.01500 s for job 12 of time 1371512674000 …

If the total delay is continuously increasing, then unstable as the system is unable to process data as fast as its receiving!

If the total delay stays roughly constant and around 2x the configured batch duration, then stable

Page 40: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Step 1: Achieving Stable Configuration

How to figure out a good stable configuration?

Start with a low data rate, small number of nodes, reasonably large batch duration (5 – 10 seconds)

Increase the data rate, number of nodes, etc.

Find the bottleneck in the job processing - Jobs are divided into stages - Find which stage is taking the most amount of time

Page 41: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Step 1: Achieving Stable Configuration

How to figure out a good stable configuration?

If the first map stage on raw data is taking most time, then try …- Enabling delayed scheduling by setting property spark.locality.wait- Splitting your data source into multiple sub streams- Repartitioning the raw data into many partitions as first step

If any of the subsequent stages are taking a lot of time, try…- Try increasing the level of parallelism (i.e., increase number of reducers)- Add more processors to the system

Page 42: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Step 2: Optimize for Lower Latency Reduce batch size and find a stable configuration again

- Increase levels of parallelism, etc.

Optimize serialization overheads- Consider using Kryo serialization instead of the default Java serialization for

both data and tasks- For data, set property spark.serializer=spark.KryoSerializer- For tasks, set spark.closure.serializer=spark.KryoSerializer

Use Spark stand-alone mode rather than Mesos

Page 43: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Step 2: Optimize for Lower Latency Using concurrent mark sweep GC -XX:+UseConcMarkSweepGC is

recommended- Reduces throughput a little, but also reduces large GC pauses and may

allow lower batch sizes by making processing time more consistent

Try disabling serialization in DStream/RDD persistence levels- Increases memory consumption and randomness of GC related pauses, but

may reduce latency by further reducing serialization overheads

For a full list of guidelines for performance tuning - Spark Tuning Guide- Spark Streaming Tuning Guide

Page 44: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Small code base 5000 LOC for Scala API (+ 1500 LC for Java API)

- Most DStream code mirrors the RDD code

Easy to take a look and contribute

Page 45: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

Future (Possible) Directions Better master fault-tolerance

Better performance for complex queries- Better performance for stateful processing is a low hanging fruit

Dashboard for Spark Streaming- Continuous graphs of processing times, end-to-end latencies- Drill down for analyzing processing times of stages for finding bottlenecks

Python API for Spark Streaming

Page 46: Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17

CONTRIBUTIONS ARE WELCOME