31
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing 2013, H. Alkaff, I. Gupta

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Embed Size (px)

Citation preview

Page 1: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 18-1Lecture 18-1Lecture 17-1Lecture 17-1

Computer Science 425Distributed Systems

CS 425 / ECE 428

Fall 2013

Computer Science 425Distributed Systems

CS 425 / ECE 428

Fall 2013

Hilfi Alkaff

November 5, 2013

Lecture 21

Stream Processing

2013, H. Alkaff, I. Gupta

Page 2: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-2Lecture 17-2

MotivationMotivation

• Many important applications must process large streams of live data and provide results in near-real-time

– Social network trends

– Website statistics

– Intrustion detection systems

– etc.

• Require large clusters to handle workloads

• Require latencies of few seconds

Page 3: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-3Lecture 17-3

Traditional Solutions (Map-Reduce)Traditional Solutions (Map-Reduce)

• No partial answers– Have to wait for the entire batch to finish

• Hard to configure when we want to add more nodes.

Mapper 1

Mapper 2

Reducer 1

Reducer 2

New Mapper

Page 4: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-4Lecture 17-4

The Need for New FrameworkThe Need for New Framework

• Scalable

• Second-scale

• Easy to program

• “Just” works– Adding/removing nodes should be transparent

Page 5: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-5Lecture 17-5

Enter StormEnter Storm

• Open-sourced at http://storm-project.net/

• Most watched JVM project

• Multiple languages API supported– Python, Ruby, Fancy

• Used by over 30 companies including– Twitter: For personalization, search

– Flipboard: For generating custom feeds

Page 6: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-6Lecture 17-6

Storm’s TerminologyStorm’s Terminology

• Tuples

• Streams

• Spouts

• Bolts

• Topologies

Page 7: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-7Lecture 17-7

TuplesTuples

• Ordered list of elements– E.g. [“Illinois”, “[email protected]”]

Page 8: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-8Lecture 17-8

StreamsStreams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Page 9: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-9Lecture 17-9

SpoutsSpouts

Source of streams (E.g. Web logs)

TupleTuple

TupleTuple

Tuple

Tuple

Tuple

Tuple

Page 10: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-10Lecture 17-10

BoltsBolts

Processes tuples and create new streams

Tuple Tuple Tuple

TupleTuple

Tuple

Tuple

Tuple

Tuple

Page 11: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-11Lecture 17-11

Bolts (Continued)Bolts (Continued)

• Operations that can be performed– Filter

– Joins

– Apply/transform functions

– Access DBs

– Etc

• Could specify amount of parallelism in each bolt– How to decide which task in bolt to route subset of the streams

to?

Page 12: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-12Lecture 17-12

Streams Grouping in BoltsStreams Grouping in Bolts

• Shuffle Grouping– Streams are randomly distributed from to the bolt’s tasks

• Fields Grouping– Lets you group a stream by a subset of its fields (Example?)

• All Grouping– Stream is replicated across all the bolt’s tasks (Example?)

Page 13: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-13Lecture 17-13

TopologiesTopologies

A directed graph of spouts and bolts

Tuple Tuple Tuple

TupleTuple

Tuple

Tuple

Tuple

Tuple

Page 14: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-14Lecture 17-14

Word Count Example (Main)Word Count Example (Main)

Page 15: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-15Lecture 17-15

Word Count Example (Spout)Word Count Example (Spout)

Page 16: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-16Lecture 17-16

Word Count Example (Bolt)Word Count Example (Bolt)

Page 17: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-17Lecture 17-17

Word Count (Running)Word Count (Running)

Page 18: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-18Lecture 17-18

Storm ClusterStorm Cluster

• Master node– Runs a daemon called Nimbus

– Responsible for distributing code around cluster

– Assigning tasks to machines

– Monitoring for failures

• Worker node– Runs a daemon called Supervisor

– Listens for work assigned to its machines

• Zookeeper– Coordinates Nimbus and Supervisors communication

– All state of Supervisor and Nimbus is being kept here

Page 19: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-19Lecture 17-19

Storm ClusterStorm Cluster

Nimbus Zookeeper

Supervisor

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Page 20: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-20Lecture 17-20

Guaranteeing Message ProcessingGuaranteeing Message Processing

• When spout produces tuples, the followings are created:

– Tuple for each word in the sentence

– Tuple for the updated count for each word

• A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout

Page 21: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-21Lecture 17-21

AnchoringAnchoring

• When a word is being anchored, the spout tuple at the root of the tree will be replayed later on if the word tuple failed to be processed downstream

– Might not necessarily wants this feature

• An output can be anchored to more than one input tuple

– Failure of one tuple causes multiple tuples to be replayed

Page 22: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-22Lecture 17-22

Miscellaneous for Storm’s Reliability APIMiscellaneous for Storm’s Reliability API

• Emit(tuple, output)– Each word is anchored if you specify the input tuple as the first

argument to emit

• Ack(tuple)– Acknowledge that you finish processing a tuple

• Fail(tuple)– Immediately fail the spout tuple at the root of tuple tree if there

is an exception from the database, etc

• Must remember to ack/fail each tuple– Each tuple consume memory. Failure to do so results in

memory leaks

Page 23: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-23Lecture 17-23

What’s wrong with Storm?What’s wrong with Storm?

• Mutable state can be lost due to failure!– May update mutable state twice!

– Processes each record at least once

• Slow nodes?– No notion of time slices

Page 24: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-24Lecture 17-24

Enter Spark StreamingEnter Spark Streaming

• Research project from AMPLab Group at UC Berkeley

• A follow-up from Spark research project– Idea: cache datasets in memory, dubbed Resilient Distributed

Datasets (RDD) for repeated query

– Able to perform 100 times faster than Hadoop MapReduce for iterative machine learning applications

Page 25: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-25Lecture 17-25

Discretized Stream ProcessingDiscretized Stream Processing

• Run a streaming computation as a series of very small, deterministic batch jobs

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD operations

• Finally, the processed results of the RDD operations are returned in batches

Page 26: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-26Lecture 17-26

Discretized Stream Processing (Continued)Discretized Stream Processing (Continued)

• Run a streaming computation as a series of very small, deterministic batch jobs

• Chop up the live stream into batches of X seconds

• Spark treats each batch of data as RDDs and processes them using RDD operations

• Finally, the processed results of the RDD operations are returned in batches

Page 27: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-27Lecture 17-27

Effects of DiscretizationEffects of Discretization

• (+) Easy to maintain consistency– Each time interval is processed independently

» Especially useful when you have stragglers (slow nodes)

» Not handled in Storm

• (+) Handle out-of-order records– Can choose to wait for a limited “slack time”

– Incrementally correct late records at application level

• (+) Parallel recovery– Expose parallelism across both partitions of an operator and

time

• (-) Raises minimum latency

Page 28: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-28Lecture 17-28

Fault-tolerance in Spark StreamingFault-tolerance in Spark Streaming

• RDDs are remember the sequence of operations that created it from the original fault-tolerant input data (Dubbed “lineage graph”)

• Batches of input data are replicated in memory of multiple worker nodes, therefore fault-tolerant

• Data lost due to worker failure, can be recomputed from input data

Page 29: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-29Lecture 17-29

Fault-tolerant Stateful ProcessingFault-tolerant Stateful Processing

• State data not lost even if a worker node dies– Does not change the value of your result

• Exactly once semantics to all transformations– No double counting!

• Recovers from faults/stragglers within 1 sec

Page 30: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-30Lecture 17-30

Comparison with StormComparison with Storm

• Higher throughput than Storm– Spark Streaming: 670k records/second/node

– Storm: 115k records/second/node

Page 31: Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing

Lecture 17-31Lecture 17-31

More informationMore information

• Storm– https://github.com/nathanmarz/storm/wiki/Tutorial

– https://github.com/nathanmarz/storm-starter

• Spark Streaming– http://spark.incubator.apache.org/docs/latest/streaming-

programming-guide.html

• Other Streaming Frameworks– Yahoo S4

– LinkedIn Samza