Dynamic Resource Management In a Massively Parallel Stream Processing Engine

Dynamic Resource Management

In a Massively Parallel Stream

Processing Engine

KASPER GRUD SKAT MADSEN, PHD STUDENT

YONGLUAN ZHOU, ASSOCIATE PROFESSOR

LINK TO PAPER

http://dl.acm.org/citation.cfm?id=2806449

About this presentation

Presentation was given by Kasper Grud Skat Madsen, 20/10/2015 at CIKM 2015

SIGIR graciously provided a travel grant

2

Massively Parallel

Stream Processing Engine

Designed to process streaming data in a highly distributed fashion

Job is modelled as DAG (O,E)

Vertices are operator instances

Edges are communication channels between vertices

Job is allocated to nodes, s.t.

Workload is reasonably balanced at submission

Assumptions

Passive Fault Tolerance

Support for Horizontal Scaling (see paper for details)

3

Motivation & Contributions

Data input rate and distribution might change on streaming inputs

Need for horizontal scaling at runtime

Need for load balancing at runtime

Both issues can be handled with state-migration

Typically incurs overhead relative to size of state to migrate

It is imperative to support state-migrations, as cheap as possible

Paper contributions

Design scheduler, to let Apache Storm support horizontal scaling

Checkpoint-Assisted Low-Latency State-Migration

Checkpoint Allocation: Using correlation to do better

4

Fault Tolerance

Upstream FT

Operators buffers output, until the downstream operators no longer

depends on it

Buffers are potentially unbounded

Passive FT

Extension of upstream with checkpoints

A checkpoint is created periodically and stored on an external node

After a checkpoint is created, the buffers can safely be trimmed

Active

Not considered in this work.

5

Passive FT - Example 6

1 2

7

4

6

5

3

Example job: 7 operators, 3 nodes

Node 1

Node 2

Node 3

2

4

3Last TS: 99

Last CP: 90

OutBuf1011009998979695…

Last TS: 98

Last CP: 85

Each operator maintains

Output buffer

Last seen timestamp (Last TS)

Last checkpointed timestamp (Last CP)

Checkpoints are made periodically

Upstream output buffers are trimmedbased on downstream ”Last CP”

In case of fault

Checkpoint is loaded

Upstream output buffers are replayed

Regular processing can continue

Supports At-Least-Once / Exactly-Once

State-Migration

Direct State-Migration

Pause execution (op2)

Install new operator on target node (install op 2 on node 2)

Serialize state (op 2) & send to new node (node 2)

Redirect tuples to new node (node 1 node 2)

CP-Assisted State-Migration

Migrate to node which contains the newest checkpoint

Redirect tuples and buffer at new node (node 1 node2)

Install new operator on target node & convert cp to state (install op 2 on node 2)

Replay all upstream buffers (from op 1 on node 1)

7

1

4

3

2

Example job: 4 operators, 3 nodes

Migrate op 2,

from node 1

to node 2

Checkpoint-Assisted Low-Latency

State-Migration

8

Step 1: Migrate to node which contains newest checkpoint

1

4

3

2

Migrate op 2, from node 1 to node 2


State-Migration

9

Step 2: Duplicate tuples to both old and new node

1

4

3

2

2 Buffer incoming data

Continues processingD.


State-Migration

10

Step 3: Convert checkpoint to state & replay buffers

1

4

3

2

2

D. Continues processing

Convert cp to state & process output buffer from op 1


State-Migration

11

Step 4: Finishes processing output buffers

1

4

3

2

2Finishes the processing of the output buffers

Process buffered tuples

Continues processingD.


State-Migration

12

Step 5: When new node has ”caught up”, synchronize

1

4

3

2

2 Operator 2 has caught up do sync

Operator 2 stops receiving inputs. Is then converted to checkpoint

Optimizing checkpoint allocation

for partial checkpoints

Optimize allocation of checkpoints, such that processing node and

checkpointing node are negatively correlated

This is only needed when using partial checkpoints

NP-Hard heuristic solution

Calculate ”impact” of allocating each checkpoint to each node

Calculate ”importance” of each checkpoint

Assign cp with largest ”importance”

Loop, until no more checkpoints

13

Evaluation –

State-Migration

Executed with 25 nodes on Amazon EC2

Real dataset: Airline On-Time (provided by United States Department of Transportation)

Apache Storm with stabilization period of 500 seconds, then one migration

14

Evaluation –

Checkpoint Allocation

Executed with 25 nodes on Amazon EC2

Real dataset: Airline On-Time (provided by United States Department of Transportation)

Job: 4 Operators, 3 Highly Correlated, 1 Highly Uncorrelated

Apache Storm executed for 48 minutes (with 15 min statistics collection)

15

Science

Dynamic Resource Management In a Massively Parallel Stream Processing Engine