Upload
vasia-kalavri
View
555
Download
0
Embed Size (px)
Citation preview
GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA
Vasia Kalavri [email protected]
@vkalavri
2
MODERN STREAMING TECHNOLOGY
➤ sub-second latencies
➤ high throughput
➤ dynamic topologies
➤ powerful semantics
➤ ecosystem integration
3
MORE THAN COUNTING WORDS
Complex Event Processing Online Machine Learning Streaming SQL
4
WHAT ABOUT GRAPH PROCESSING?
5
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk and partition it in memory
6
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk and partition it in memory
2. Compute: read and mutate the graph state
7
HOW WE’VE DONE GRAPH PROCESSING SO FAR
1. Load: read the graph from disk and partition it in memory
2. Compute: read and mutate the graph state
3. Store: write the final graph state back to disk
8
“ If what you need is to analyze a static graph over and over again then this model is great!
9
WHAT’S WRONG WITH THIS MODEL?
➤ It is slow ➤ wait until the computation is over before you see any result
➤ pre-processing and partitioning
➤ It is expensive ➤ lots of memory and CPU required in order to scale
➤ It requires re-computation for graph changes ➤ no efficient way to deal with updates
10
➤ Maintain the dynamic graph structure
➤ Provide up-to-date results with low latency
➤ Compute on fresh state only
11
GRAPH STREAMING CHALLENGES
12
ACADEMIA TO THE RESCUE
➤ Graph streaming in the 90s-00s ➤ input fits in secondary storage
➤ limited memory
➤ few passes over the input data
➤ compact graph representations and summaries
13
GRAPH SUMMARIES
➤ spanners ➤ connectivity, distance
➤ sparsifiers ➤ cut estimation
➤ neighborhood sketches
graph summary
~algorithm algorithmR1 R2
14
1
43
2
5
i=0
BATCH CONNECTED COMPONENTS
15
6
7
8
1
43
2
5
6
7
8
i=0
BATCH CONNECTED COMPONENTS
16
14
345
235
24
78
67
68
1
21
2
2
i=1
BATCH CONNECTED COMPONENTS
17
6
6
6
1
21
1
2
6
6
6
i=1
BATCH CONNECTED COMPONENTS
18
2
122
112
12
76
6
6
1
11
1
1
i=2
BATCH CONNECTED COMPONENTS
19
6
6
6
54
76
86
31
52
20
STREAM CONNECTED COMPONENTS
Graph Summary: Disjoint Set (Union-Find)
➤ Only store component IDs and vertex IDs
54
76
86
42
31
52
21
1
3
Cid = 1
54
76
86
42
43
31
52
22
1
3
Cid = 1
2
5
Cid = 2
54
76
86
42
43
87
31
52
23
1
3
Cid = 1
2
5
Cid = 2
4
54
76
86
42
43
87
41
31
52
24
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
54
76
86
42
43
87
41
5225
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
8
54
76
86
42
43
87
41
26
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
8
76
86
42
43
87
41
27
6
7Cid = 6
8
1
3
Cid = 1
2
5
Cid = 2
4
76
86
42
43
87
41
28
1
3
Cid = 1
2
5
4
6
7Cid = 6
8
DISTRIBUTED STREAM CONNECTED COMPONENTS
29
THE BAD NEWS
➤A slightly different motivation ➤ finite graph stored in disk vs. unbounded graph arriving in real-time ➤ some algorithms assume we know |V|, |E| ➤ most algorithms designed for single-node execution
30
THE GOOD NEWS
➤A quite different reality ➤ memory is getting bigger ➤ … and cheaper ➤ we know how to design distributed algorithms
31
GELLY-STREAM SINGLE-PASS STREAM GRAPH PROCESSING WITH APACHE FLINK
32
GELLY ON STREAMS
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-Stream
➤ Static Graphs
➤ Multi-Pass Algorithms
➤ Full Computations
➤ Dynamic Graphs
➤ Single-Pass Algorithms
➤ Approximate Computations
DataStream
33
DISTRIBUTED STREAM CONNECTED COMPONENTS
34
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
35
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
36
Partition the edge stream
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
37
Define the merging frequency
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
38
merge locally
STREAM CONNECTED COMPONENTS WITH FLINK
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
39
merge globally
GELLY-STREAM STATUS
➤ Properties and Metrics ➤ Transformations ➤ Aggregations ➤ Discretization ➤ Neighborhood Aggregations
40
➤ Graph Streaming Algorithms ➤ Connected Components ➤ Bipartiteness Check ➤ Window Triangle Count ➤ Triangle Count Estimation ➤ Continuous Degree Aggregate
FEELING GELLY?
➤Gelly-Stream Repository
github.com/vasia/gelly-streaming
➤A list of graph streaming papers
citeulike.org/user/vasiakalavri/tag/graph-streaming
➤A related talk at FOSDEM’16
slideshare.net/vkalavri/gellystream-singlepass-graph-streaming-analytics-with-apache-flink
41
GRAPHS AS STREAMSRETHINKING GRAPH PROCESSING IN THE STREAMING ERA
Vasia Kalavri [email protected]
@vkalavri