42
GRAPHS AS STREAMS RETHINKING GRAPH PROCESSING IN THE STREAMING ERA Vasia Kalavri [email protected] @vkalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming Era

Embed Size (px)

Citation preview

Page 1: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA

Vasia Kalavri [email protected]

@vkalavri

Page 2: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

2

Page 3: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

MODERN STREAMING TECHNOLOGY

➤ sub-second latencies

➤ high throughput

➤ dynamic topologies

➤ powerful semantics

➤ ecosystem integration

3

Page 4: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

MORE THAN COUNTING WORDS

Complex Event Processing Online Machine Learning Streaming SQL

4

Page 5: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

WHAT ABOUT GRAPH PROCESSING?

5

Page 6: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

6

Page 7: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

2. Compute: read and mutate the graph state

7

Page 8: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

HOW WE’VE DONE GRAPH PROCESSING SO FAR

1. Load: read the graph from disk and partition it in memory

2. Compute: read and mutate the graph state

3. Store: write the final graph state back to disk

8

Page 9: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

“ If what you need is to analyze a static graph over and over again then this model is great!

9

Page 10: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

WHAT’S WRONG WITH THIS MODEL?

➤ It is slow ➤ wait until the computation is over before you see any result

➤ pre-processing and partitioning

➤ It is expensive ➤ lots of memory and CPU required in order to scale

➤ It requires re-computation for graph changes ➤ no efficient way to deal with updates

10

Page 11: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

➤ Maintain the dynamic graph structure

➤ Provide up-to-date results with low latency

➤ Compute on fresh state only

11

GRAPH STREAMING CHALLENGES

Page 12: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

12

Page 13: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

ACADEMIA TO THE RESCUE

➤ Graph streaming in the 90s-00s ➤ input fits in secondary storage

➤ limited memory

➤ few passes over the input data

➤ compact graph representations and summaries

13

Page 14: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GRAPH SUMMARIES

➤ spanners ➤ connectivity, distance

➤ sparsifiers ➤ cut estimation

➤ neighborhood sketches

graph summary

~algorithm algorithmR1 R2

14

Page 15: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

1

43

2

5

i=0

BATCH CONNECTED COMPONENTS

15

6

7

8

Page 16: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

1

43

2

5

6

7

8

i=0

BATCH CONNECTED COMPONENTS

16

14

345

235

24

78

67

68

Page 17: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

1

21

2

2

i=1

BATCH CONNECTED COMPONENTS

17

6

6

6

Page 18: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

1

21

1

2

6

6

6

i=1

BATCH CONNECTED COMPONENTS

18

2

122

112

12

76

6

6

Page 19: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

1

11

1

1

i=2

BATCH CONNECTED COMPONENTS

19

6

6

6

Page 20: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

31

52

20

STREAM CONNECTED COMPONENTS

Graph Summary: Disjoint Set (Union-Find)

➤ Only store component IDs and vertex IDs

Page 21: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

31

52

21

1

3

Cid = 1

Page 22: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

43

31

52

22

1

3

Cid = 1

2

5

Cid = 2

Page 23: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

43

87

31

52

23

1

3

Cid = 1

2

5

Cid = 2

4

Page 24: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

43

87

41

31

52

24

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

Page 25: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

43

87

41

5225

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

8

Page 26: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

54

76

86

42

43

87

41

26

1

3

Cid = 1

2

5

Cid = 2

4

6

7Cid = 6

8

Page 27: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

76

86

42

43

87

41

27

6

7Cid = 6

8

1

3

Cid = 1

2

5

Cid = 2

4

Page 28: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

76

86

42

43

87

41

28

1

3

Cid = 1

2

5

4

6

7Cid = 6

8

Page 29: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

DISTRIBUTED STREAM CONNECTED COMPONENTS

29

Page 30: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

THE BAD NEWS

➤A slightly different motivation ➤ finite graph stored in disk vs. unbounded graph arriving in real-time ➤ some algorithms assume we know |V|, |E| ➤ most algorithms designed for single-node execution

30

Page 31: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

THE GOOD NEWS

➤A quite different reality ➤ memory is getting bigger ➤ … and cheaper ➤ we know how to design distributed algorithms

31

Page 32: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GELLY-STREAM SINGLE-PASS STREAM GRAPH PROCESSING WITH APACHE FLINK

32

Page 33: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GELLY ON STREAMS

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

➤ Static Graphs

➤ Multi-Pass Algorithms

➤ Full Computations

➤ Dynamic Graphs

➤ Single-Pass Algorithms

➤ Approximate Computations

DataStream

33

Page 34: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

DISTRIBUTED STREAM CONNECTED COMPONENTS

34

Page 35: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

35

Page 36: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

36

Partition the edge stream

Page 37: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

37

Define the merging frequency

Page 38: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

38

merge locally

Page 39: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

STREAM CONNECTED COMPONENTS WITH FLINK

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

39

merge globally

Page 40: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GELLY-STREAM STATUS

➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations

40

➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate

Page 41: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

FEELING GELLY?

➤Gelly-Stream Repository

github.com/vasia/gelly-streaming

➤A list of graph streaming papers

citeulike.org/user/vasiakalavri/tag/graph-streaming

➤A related talk at FOSDEM’16

slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink

41

Page 42: Graphs as Streams: Rethinking Graph Processing in the Streaming Era

GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA

Vasia Kalavri [email protected]

@vkalavri