Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Stream Processing
Marco Serafini
COMPSCI 532Lecture 5
22
Stream vs. Batch Processing• Batch processing
• Bounded input• Bounded one-shot computation• Bounded output
• Stream processing• Unbounded input: data stream• Unbounded computation: always on• Unbounded output
33
Advantages and Disadvantages• Advantages of stream processing
• Many real-time datasets are data streams (e.g. IoT)• Near real-time results • No need for accumulating data for processing• Streaming operators typically require less memory
• Disadvantages of stream processing• Need to deal with timing semantics• Some operators are harder to implement with streaming
• Especially if we want operator state to be constant • E.g. Find median
• Stream algorithms are often approximations
44
Streaming Computation• Dataflow Graph of (possibly stateful) operators• Data streams connecting them
• Tuples in the stream: <key, [timestamp,] value>• FIFO channels
• Partitioning• Operators are parallelized into subtasks based on key• Streams are split into partitions
55
Example: Streaming Inverted Index• How could the architecture look like?• What kind of streams would we have?• What kind of operators?
66
Windowing• Windows create batches for bounded operations
• Example: aggregation• Tuples replicated on multiple windows
77
Event and Processing Time• Event time
• Time when events occurs• Associated to event itself• Immutable
• Processing time• Time when event is processed• Depends on system implementation• Mutable
• Q: Which one is easier to program with?
88
Watermarks• Event-time processing
• Requires reordering since event time ≠ processing time• Example: process all events generated [from, to)
• Low watermarks• How to know that I got all events until event-time T?• Watermark is a special message telling us that• Forwarded throughout dataflow graph• If an operator has multiple input channels, forward minimum (earliest) low watermark across inputs
• Punctuations• Similar to watermarks
99
Triggers• Watermarks can be too fast (when?) or too slow (when?)• Triggers are used to process a window based on a processing time signal• Problem: what to do with the window once triggering?
• Discard? Accumulate? Retract?
1010
Spark Structured Streaming• Define a relational query on stream• System incrementally updates output tables
11
System Implementation
12 12
Stream Processing Systems• “Pure” streaming systems
• Tuple-at-a-time semantic• Example: Apache Storm, Apache Flink
• Micro-batching• Create small batches of inputs and then execute batched computation• Example: Spark Streaming
• System implementation ≠ programming semantics
1313
Control vs. Data Messages• Control messages are injected in event stream• Checkpoint markers
• Inserting them in stream helps consistent snapshot• Watermarks for windowing
• Inserting them in stream allows triggering windows• Coordination barriers
• Inserting them in stream allows marking event before-after barrier
1414
Implementing Batch on Streaming• DataSet abstraction in Flink• Example
• Q: How to implement map-reduce on Flink?• A: Control messages
• Mappers send an “eof” marker to each reducer when done• Reducer do not process until they receive markers from all mappers
1515
Fault Tolerance• Streaming: stateful operators
• Cannot rerun the whole stream from beginning• Spark Streaming: Lineage (Spark) + checkpoints• Flink: Periodic checkpointing of stateful operators
• Export API to application to define state to be checkpointed• How to checkpoint?
1616
Distributed Checkpoints• Uncoordinated Checkpoint?
Domino effect• Coordinated checkpoint
• Consistent cut: no message received but not sent• Distributed checkpointing protocol (Chandy-Lamport)
1717
Chandy-Lamport Protocol• Assumptions
• Originator process starts it• FIFO channels• One checkpoint at a time
• Goal: checkpoint state + all in-flight messages• Algorithm
• Originator checkpoints its state and sends checkpoint marker• Upon receiving checkpoint marker
• Checkpoint and send checkpoint marker on each channel• Record subsequent messages on each channel until receive checkpoint marker back
1818
Load Balancing• How to balance load in tuple-at-a-time system?
• Redistribute keys• Move key from one server to another• Require migrating operator state
• Replicate and aggregate• Multiple copies of the same operator compute partial results• A downstream operator aggregates them• Power of both choices
1919
Lambda Architecture
• Compromise between accuracy and freshness• Requires maintaining 2 platforms and 2 implementations
Persistent store
(e.g. Kafka)
Incoming data stream
Stream Processing
Batch Processing
Periodically
Result of analysis
(e.g. Index)
Recreate accurate results
Update approximate
resultsReal-time
20 20
Regulating the Data Flow• Backpressure
• If receiver operator cannot process inputs fast enough…• ... Block or slow down senders (recursively if needed)
• Intermediate buffer pools (queues)• Decouple communication from consuming messages
• Conflicting requirements• Throughput: batch output messages, don’t send one by one• Latency: send messages asap• Tradeoff: Send when either
• Max batch size reached (e.g. 1 kB) or • Timeout (e.g. 5 milliseconds)
21
Exercise
2222
Exercise: Online Store• Two input streams
• One has ad clicks: <user_ID, time, ad_ID>• The other has ad impressions: <ad_ID, time, user_ID>
• Design DataFlow application that • Correlates ad impressions with user clicks (for billing)• Correlated = click happens within 10 seconds after ad
• Questions• Which operators? How are they partitioned?• Watermarks?• Triggers?
2323
Possible Implementation• Streaming operator
• Receives both streams• Partitioned by ad_ID• Join by ad_ID and return <ad_ID, user_ID, ad_time>
• Windowing: session• When an ad appears, gather all clicks for 5 minutes
• Watermarks• Both input streams emit a low watermark every second• Earliest low watermark triggers the window
• Triggers: aggregate and retract