58
Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov JetBrains Research, Saint Petersburg State University

Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Deterministic Model for Distributed Speculative Stream Processing

DISCAN 2018

Igor Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov

JetBrains Research, Saint Petersburg State University

Page 2: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

2

Page 3: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Deterministic computations

Given a particular input, the same output will be produced after any number of reruns

For streaming it means: 𝐹 𝐼𝑛 … 𝐼0 : ∀𝑘, 𝐹(𝐼𝑘𝐼𝑘−1 … 𝐼0 ) = 𝐽𝑘

Usually, it is considered as: 𝐹 𝐼𝑛, 𝑆𝑛 : ∀𝑘, ∃𝑆𝑘 = 𝑆 𝐼𝑘−1 … 𝐼0 , 𝐹(𝐼𝑘, 𝑆𝑘) = 𝐽𝑘

3

Page 4: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Why is determinism important?

Determinism is a desired property in many CS areas

• Natural for users (people got used to think sequentially)

• Computations are reproducible and predictable

• Implies consistency [Stonebreaker et al. The 8 requirements of real-time stream processing. ACM SIGMOD Record 2005]

4

Page 5: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Determinism: simple way

Determinism can be easily achieved if

• All computations are sequential

• All transformations are pure functions

5

Page 6: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Stream processing

• Shared-nothing distributed runtime

• Record-at-a-time model

• Latency is a key performance metric

6 6

Page 7: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Determinism in stream processing

• It is considered that determinism is too difficult too achieve

• Systems usually provide low-level interfaces, which do not guarantee any level of determinism

• Trade-off between determinism and latency [Zacheilas et al. Maximizing Determinism in Stream Processing Under Latency Constraints. DEBS 2017]

7

Page 8: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What is about batch processing?

• MapReduce is usually implemented deterministically

• Micro-batching (spark streaming, storm trident) is also deterministic

8

Page 9: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

The ultimate question of life, the universe, and everything

Is it possible to combine low-latency and determinism within distributed stream processing?

9

Page 10: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

The ultimate question of life, the universe, and everything

Is it possible to combine low-latency and determinism within distributed stream processing?

• In spite of asynchronous distributed processing

10

Page 11: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

The ultimate question of life, the universe, and everything

Is it possible to combine low-latency and determinism within distributed stream processing?

• In spite of asynchronous distributed processing

• Avoiding input buffering

11

Page 12: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

12

Page 13: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Dataflow

• Dataflow is a potentially unlimited sequence of data items • Timestamps can be assigned to data items to define an

order • Dataflow is expressed in the form of a graph • Vertices are operations, which are implemented by user-

defined functions • Edges declare an order between operations

13

Page 14: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Physical deployment

14

Page 15: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

15

Page 16: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What do we require to achieve determinism?

• Total order and transformations as pure functions – We can define synthetic order by assigning timestamps at system entry

16

Page 17: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What do we require to achieve determinism?

• Total order and transformations as pure functions – We can define synthetic order by assigning timestamps at system entry

• We need to care about the order only in the operations that are order-sensitive and before output

17

Page 18: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What do we require to achieve determinism?

• Total order and transformations as pure functions – We can define synthetic order by assigning timestamps at system entry

• We need to care about the order only in the operations that are order-sensitive and before output

• Calculations are partitioned, and order between items from different partitions does not influence the result (if they will not be merged)

18

Page 19: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Unrealistic requirement

• Total order preservation

19

Page 20: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Unrealistic requirement

• Total order preservation

• Let’s try to rethink streaming computations

20

Page 21: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: idea

21

𝐼𝑘 → Op 𝐼𝑘 , 𝑆𝑘 → 𝐽𝑘

↑ ↓

𝑆𝑘 𝑆𝑘+1

Page 22: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: idea

22

𝐼𝑘 → Op 𝐼𝑘 , 𝑆𝑘 → 𝐽𝑘

↑ ↓

𝑆𝑘 𝑆𝑘+1 newState = combine(prevState, newItem) handler.update(newState) return newState

Page 23: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: idea

23

𝐼𝑘 → Op 𝐼𝑘 , 𝑆𝑘 → 𝐽𝑘

↑ ↓

𝑆𝑘 𝑆𝑘+1

What if we put state directly into the stream?

𝐼𝑘 , 𝑆𝑘 → Op 𝐼𝑘 , 𝑆𝑘 → 𝐽𝑘 , 𝑆𝑘+1

Page 24: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: implementation

• Any stateful transformation can be decomposed into map and windowed grouping operation with a cycle

• Map operation is stateless == order insensitive and pure

• Grouping operation is pure (and even does not contain user-define logic)

24

Page 25: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: optimistic grouping

• Grouping operation is pure, but order-sensitive • Buffers before each grouping can increase latency [Li et al.

Out-of-order processing: a new architecture for high-performance stream systems. VLDB 2008]

• Grouping can be implemented optimistically without blocking

25

Page 26: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: optimistic grouping

• Grouping operation is pure, but order-sensitive • Buffers before each grouping can increase latency [Li et al.

Out-of-order processing: a new architecture for high-performance stream systems. VLDB 2008]

• Grouping can be implemented optimistically without blocking

26

Page 27: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: the only buffer

• Optimistic approach produces invalid items • Invalid items must be filtered out before they are

sent to consumer • Punctuations (low watermarks) allow releasing

items

27

Page 28: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Something is wrong here

• Dataflow graphs are cyclic

• It is unclear how to send low watermarks through the cycles

28

Page 29: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Implementation notes: acker

29

Page 30: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Implementation notes: modified acker

30

Page 31: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Discussion: drifting state pros

• Determinism is closer than you think! • We moved from “state as a special item” to “state

as an ordinary item” – Business-logic becomes stateless – All guarantees that system provides regarding

ordinary items are satisfied for state – Any stateful dataflow graph can be expressed using

drifting state model – Transformations can be non-commutative, but should

be pure – Single buffer before output per dataflow graph

31

Page 32: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Discussion: drifting state cons

• It is harder to write code

– A need for a convenient API

• Optimistic technique can potentially generate a lot of additional items

• How does drifting state behaves within real-world problem?

32

Page 33: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

33

Page 34: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: prototype

• FlameStream [https://github.com/flame-stream]

• Java + Akka, Zookeper

34

Page 35: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: task

Incremental inverted index building

– Requires stateful operations

– Contains network shuffle

– Workload is unbalanced due to Zipf’s law

35

Page 36: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: setup

• 10 EC2 micro instances

– 1 GB RAM

– 1 core CPU

• Wikipedia documents as a dataset

36

Page 37: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: overhead

37

Page 38: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: latency scalability

38

Page 39: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: throughput scalability

39 Nodes

Do

cum

ents

/sec

Page 40: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: comparison with conservative approach

40

• Posting lists update is order-sensitive operation

• Buffer elements before this operation

• Buffer is flushed on low watermarks

• Low watermarks are sent after each input element to minimize overhead

• [Li et al. Out-of-order processing: a new architecture for high-performance stream systems. VLDB 2008]

• Apache Flink as stream processing system

Page 41: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: comparison with conservative approach (at most once)

41

10 nodes 5 nodes

Page 42: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: comparison with conservative approach (at most once)

42

10 nodes 5 nodes

Page 43: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: comparison with conservative approach

43

Page 44: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: comparison with conservative approach

44

Do

cum

ents

/sec

Nodes

Page 45: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Drifting state: conclusion

• Determinism and low-latency is achieved

• Overhead is low

• Throughput is not significantly degraded

• Model is suitable for any stateful dataflows

– If all transformations are pure

45

Page 46: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

• Map is not pure? – Determinism is lost by definition – Correctness is lost

• Multiple input nodes?

– Timestamp = timestamp@node_id – Latency ≥ out of sync time – Possible to sync in ~10ms

• Acker fails? – Replication – Separate ackers for timestamp ranges

What if…

46

Page 47: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

47

Page 48: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What do we need to add to achieve exactly-once?

• Input replay

• Restore consistent state in groupings

• Deduplicate items only at the barrier

48

Page 49: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

What do we need to add to achieve exactly-once?

• Input replay

• Restore consistent state in groupings

• Deduplicate items only at the barrier – Output items atomically

– Sink stores timestamp of the last received item

– Simply compare timestamps!

49

Page 50: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Exactly-once: discussion

• Pure streaming

• Deduplication only at the barrier

• Snapshotting and outputting are independent (are not connected into transaction)

50

Page 51: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Exactly-once: roadmap

51

Page 52: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Outline

• Deterministic computations

• Stream processing computational model

• Optimistic determinism: drifting state

• Experiments

• Exactly-once on top of determinism

• Yet another experiment

52

Page 53: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: latency (50 ms between checkpoints)

53

50 ms between state snapshots

Page 54: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: latency (1000 ms between checkpoints)

54

1000 ms between state snapshots

Page 55: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Experiments: throughput

55

Do

cum

ents

/sec

Nodes

Page 56: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Conclusions

• Single extra requirement: all transformations are pure

• Results look promising

• A lot of work

– Understand properties and limitations

– Real-life deployment

• We are open for collaboration

56

Page 57: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Future work

• Real-life deployment

• Efficient determinism and exactly-once can be used for system-level acceptance testing

– [Trofimov. Consistency maintenance in distributed analytical stream processing. ADBIS DC 2018]

57

Page 58: Deterministic Model for Distributed Speculative …Deterministic Model for Distributed Speculative Stream Processing DISCAN 2018 Igor Kuralenok, Artem Trofimov, Nikita Marshalkin,

Papers

• Kuralenok et al. FlameStream: Model and Runtime for Distributed Stream Processing. BeyondMR@SIGMOD 2018

• Kuralenok et al. Deterministic model for distributed speculative stream processing. ADBIS 2018

• Long paper about exactly-once is on the anvil

58