Upload
voxuyen
View
219
Download
4
Embed Size (px)
Citation preview
Connected Components in MapReduce and Beyond
R. Kiveris, S. Lattanzi, V. Mirrokni, V. Rastogi, S. VassilvitskiiGoogle
Monday, June 23, 14
Modern Massive Algorithmics
Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot
Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard
• “Curse of the last reducer”
2Monday, June 23, 14
Modern Massive Algorithmics
Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot
Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard
• “Curse of the last reducer”
Algorithms:– Embarrassingly parallel may also be embarrassingly slow – New techniques to minimize communication & skew
3Monday, June 23, 14
Today: Graph Connectivity
Classical problem – Many parallel algorithms
• PRAM• MapReduce• Pregel• ...
– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...
4Monday, June 23, 14
Today: Graph Connectivity
Classical problem – Many parallel algorithms
• PRAM• MapReduce• Pregel• ...
– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...
Want to optimize for very large graphs– Billions of nodes, 100s of billions of edges – Typically sparse– Do not fit in memory (10s+ TBs)
5Monday, June 23, 14
Approach
Transform the graph into a union of stars, one for each connected component.
6Monday, June 23, 14
Approach
Transform the graph into a union of stars, one for each connected component.
Begin:– Every node has a unique id – Assigned arbitrarily
7
1
9
5
8
7
3
2
6
4
Monday, June 23, 14
Approach
Transform the graph into a union of stars, one for each connected component.
Begin:– Every node has a unique id – Assigned arbitrarily
Two Local Operations:– Only look at a node and its neighbors– Prescribe which edges should exist in the next round
8Monday, June 23, 14
Operations
– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.
9Monday, June 23, 14
Operations
– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.
– Do this in parallel on every node to build a new graph
10
1
9
5
7
8
1
9
5
7
8
Monday, June 23, 14
Example
11
1
9
5
8
7
3
2
6
4
1
9
5
8
7
3
2
6
4
Monday, June 23, 14
Example
12
1
9
5
8
7
3
2
6
4
5
8
7
3
2
6
4
1
9
Monday, June 23, 14
Example
13
1
9
5
8
7
3
2
6
4
5
7
2
6
4
1
9
83
Monday, June 23, 14
Example
14
1
9
5
8
7
3
2
6
4
7
2
6
4
1
9
3
5
8
Monday, June 23, 14
Example
15
1
9
5
8
7
3
2
6
4
7
2
6
4
9
3
5
1
1
8
Monday, June 23, 14
Example
16
1
9
5
8
7
3
2
6
4
7
2
6
4
9
3
51
8
Monday, June 23, 14
Example
17
1
9
5
8
7
3
2
6
4
9
51
8
7
2
3
6
4
Monday, June 23, 14
LargeStar Connectivity
Lemma: Executing LargeStar in parallel preserves connectivity
– WLOG assume A > b. If b is min neighbor of A we are done– If b has no smaller neighbors (local min) we are done– Else: A > b > c and:
• Now need to reason about connectivity of (b,c)• Show (b,c) connected by induction on node rank
18
bA
A c
b b
A c
Monday, June 23, 14
LargeStar: Reinterpretation
– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.
– Orient all edges from larger to smaller – LargeStar = tell children to connect to smallest parent
19
1
9
5
7
8
1
9
5
7
8
Monday, June 23, 14
LargeStar Fixed Point
Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes
• Fixed point if graph is DAG of height 2
20
1
9
5
8
7
3
2
6
4
Monday, June 23, 14
LargeStar Fixed Point
Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes
• Fixed point if graph is DAG of height 2
Progress:– Every LargeStar operation reduces height by a constant factor
21
1
9
5
8
7
3
2
6
4
Monday, June 23, 14
LargeStar Fixed Point
22
1
9
5
8
7
3
2
6
4
9
51
8
7
2
3
6
4
Monday, June 23, 14
Operations
– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.
– Do this in parallel on every node to build a new graph
23
1
9
5
7
8
1
9
5
7
8
Monday, June 23, 14
Operations
– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.
24
1
9
5
78
9
8
15
7
Monday, June 23, 14
Example
25
9
8
7
3
6
4
9
51
8
7
2
3
6
4
1 5
2
Monday, June 23, 14
Example
26
9
8
7
3
6
4
97
3
1 5
2
51
8
2
6
4
Monday, June 23, 14
Example
27
97
3
6
4
97
3
5
8
2
1
51
8
2
6
4
Monday, June 23, 14
Example
28
9
97
3
5
8
1
7
3
6
4
2
6
51
8
2
4
Monday, June 23, 14
Operations
– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.
– Connect all parents (and self) to the minimum parent.
29
1
9
5
78
9
8
15
7
Monday, June 23, 14
SmallStar Analysis
Lemma: SmallStar preserves connectivity – Similar argument as before
30Monday, June 23, 14
SmallStar Analysis
Lemma: SmallStar preserves connectivity – Similar argument as before
Progress:– Run LargeStar to completion– Run one iteration of SmallStar– Run LargeStar to completion again– The number of local minima (maximal nodes in the DAG) reduces by a
constant factor
31Monday, June 23, 14
Overall Algorithm
Input:– Set of edges, with a unique label per node
Repeat until convergence
Repeat until convergence
LargeStar
SmallStar
Theorem:– The above algorithm converges in O(log2 n) rounds.
32Monday, June 23, 14
Implementation
33Monday, June 23, 14
Implementation
Both can be easily implemented in MapReduce– Or Pregel, or Giraph, or ...
LargeStar:
Map (u;v):– Emit (u;v), Emit (v;u)
Reduce (u; v1,v2,...,vk):
– m = argmin label(vi)
– Emit (v,m) for all v with label(v) > label(m)
34Monday, June 23, 14
Discussion
Convergence:– log2 n: is tight– The graph is used to define communication structure from time to
time – Number of edges does not increase at every time step
35Monday, June 23, 14
Making it Practical
36Monday, June 23, 14
Making it Practical
Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG
reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the
root”• Keep the min id of the parent in a DHT • Similar to path compression
37Monday, June 23, 14
Making it Practical
Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG
reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the
root”• Keep the min id of the parent in a DHT • Similar to path compression
Approach 2 (Algorithms):– Instead of waiting for LargeStar to converge, just interleave LargeStar and SmallStar.
• Repeat Until Convergence:
SmallStar
LargeStar
– Can prove convergence – Appears to converge even faster (conjecture O(log(n)) rounds)
38Monday, June 23, 14
Watching Out for Skew
Final output:– Union of stars, one for each component – This should worry you!
• In case of a single component, get one star with linear degree • In case of skewed component sizes, also get one star with linear degree
39Monday, June 23, 14
Dealing with Skew
Divide the computation of the minimum
– Can do this recursively c times– Increase number of rounds by 1/c, each node’s input at most nc
40
od
g
abc
i
j
ml
n 1
h
ef
k
hd
lgc
ae
ig
b
if
k
jo
11.1
1.2
1.3
1.4
Monday, June 23, 14
But does it work?
Data (subset):– UK Web graph: 106M nodes, 6.6B edges– Google+ subgraph: 178M nodes, 2.9B edges– Keyword similarity : 371M nodes, 3.5B edges– Document similarity: 4,700M nodes, 452B edges
Algorithms:– Hash2Min (previous MapReduce state of the art)– DHT Implementation– Alternating algorithm(skew optimized & non-optimized)
Setup:– Loaded cluster, look at median running times
41Monday, June 23, 14
Speedups
42
– 20x-40x faster on the document similarity graph– Smaller improvements on smaller graphs
Monday, June 23, 14
Graph Size
43
22
24
26
28
30
32
0 2 3 4 5 6 8 9 10 11 12
Log
og t
he n
umbe
r of
edg
es
Round
Alternating Optimized AlternatingHash-To-Min
Log
of n
umbe
r of e
dges
Monday, June 23, 14
Conclusion
Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!
44Monday, June 23, 14
Conclusion
Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!
Algorithms:– Evolve with the underlying system architecture– Avoid embarrassingly slow embarrassingly parallel implementations
45Monday, June 23, 14
Thank You
Monday, June 23, 14