Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… · · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,

Connected Components in MapReduce and Beyond

R. Kiveris, S. Lattanzi, V. Mirrokni, V. Rastogi, S. VassilvitskiiGoogle

Monday, June 23, 14

Modern Massive Algorithmics

Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot

Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard

• “Curse of the last reducer”

2Monday, June 23, 14

Modern Massive Algorithmics

Communication: – Can be the overwhelming cost – In practice constant factors matter a whole lot

Data Skew:– Most datasets are heavy tailed – Naive data distribution can be disastrous– In synchronous environments must wait for slowest shard

• “Curse of the last reducer”

Algorithms:– Embarrassingly parallel may also be embarrassingly slow – New techniques to minimize communication & skew


Today: Graph Connectivity

Classical problem – Many parallel algorithms

• PRAM• MapReduce• Pregel• ...

– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...


Today: Graph Connectivity

Classical problem – Many parallel algorithms

• PRAM• MapReduce• Pregel• ...

– Subroutine in many other problems• MST• Clustering• Multiway cuts• ...

Want to optimize for very large graphs– Billions of nodes, 100s of billions of edges – Typically sparse– Do not fit in memory (10s+ TBs)


Approach

Transform the graph into a union of stars, one for each connected component.


Approach


Begin:– Every node has a unique id – Assigned arbitrarily

7

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Approach


Begin:– Every node has a unique id – Assigned arbitrarily

Two Local Operations:– Only look at a node and its neighbors– Prescribe which edges should exist in the next round


Operations

– LargeStar(v): Connect all strictly larger neighbors to the min neighbor including self.


Operations


– Do this in parallel on every node to build a new graph

10

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

Example

11

1

9

5

8

7

3

2

6

4

1

9

5

8

7

3

2

6

4

Monday, June 23, 14

Example

12

1

9

5

8

7

3

2

6

4

5

8

7

3

2

6

4

1

9

Monday, June 23, 14

Example

13

1

9

5

8

7

3

2

6

4

5

7

2

6

4

1

9

83

Monday, June 23, 14

Example

14

1

9

5

8

7

3

2

6

4

7

2

6

4

1

9

3

5

8

Monday, June 23, 14

Example

15

1

9

5

8

7

3

2

6

4

7

2

6

4

9

3

5

1

1

8

Monday, June 23, 14

Example

16

1

9

5

8

7

3

2

6

4

7

2

6

4

9

3

51

8

Monday, June 23, 14

Example

17

1

9

5

8

7

3

2

6

4

9

51

8

7

2

3

6

4

Monday, June 23, 14

LargeStar Connectivity

Lemma: Executing LargeStar in parallel preserves connectivity

– WLOG assume A > b. If b is min neighbor of A we are done– If b has no smaller neighbors (local min) we are done– Else: A > b > c and:

• Now need to reason about connectivity of (b,c)• Show (b,c) connected by induction on node rank

18

bA

A c

b b

A c

Monday, June 23, 14

LargeStar: Reinterpretation


– Orient all edges from larger to smaller – LargeStar = tell children to connect to smallest parent

19

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

LargeStar Fixed Point

Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes

• Fixed point if graph is DAG of height 2

20

1

9

5

8

7

3

2

6

4

Monday, June 23, 14


Fixed point if:– Every node is a local min or connected to local minima– Orient edges from larger nodes to smaller nodes

• Fixed point if graph is DAG of height 2

Progress:– Every LargeStar operation reduces height by a constant factor

21

1

9

5

8

7

3

2

6

4

Monday, June 23, 14


22

1

9

5

8

7

3

2

6

4

9

51

8

7

2

3

6

4

Monday, June 23, 14

Operations


– Do this in parallel on every node to build a new graph

23

1

9

5

7

8

1

9

5

7

8

Monday, June 23, 14

Operations

– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.

24

1

9

5

78

9

8

15

7

Monday, June 23, 14

Example

25

9

8

7

3

6

4

9

51

8

7

2

3

6

4

1 5

2

Monday, June 23, 14

Example

26

9

8

7

3

6

4

97

3

1 5

2

51

8

2

6

4

Monday, June 23, 14

Example

27

97

3

6

4

97

3

5

8

2

1

51

8

2

6

4

Monday, June 23, 14

Example

28

9

97

3

5

8

1

7

3

6

4

2

6

51

8

2

4

Monday, June 23, 14

Operations

– SmallStar(v): Connect all smaller neighbors and self to the min neighbor.

– Connect all parents (and self) to the minimum parent.

29

1

9

5

78

9

8

15

7

Monday, June 23, 14

SmallStar Analysis

Lemma: SmallStar preserves connectivity – Similar argument as before


SmallStar Analysis

Lemma: SmallStar preserves connectivity – Similar argument as before

Progress:– Run LargeStar to completion– Run one iteration of SmallStar– Run LargeStar to completion again– The number of local minima (maximal nodes in the DAG) reduces by a

constant factor


Overall Algorithm

Input:– Set of edges, with a unique label per node

Repeat until convergence

Repeat until convergence

LargeStar

SmallStar

Theorem:– The above algorithm converges in O(log2 n) rounds.


Implementation


Implementation

Both can be easily implemented in MapReduce– Or Pregel, or Giraph, or ...

LargeStar:

Map (u;v):– Emit (u;v), Emit (v;u)

Reduce (u; v1,v2,...,vk):

– m = argmin label(vi)

– Emit (v,m) for all v with label(v) > label(m)


Discussion

Convergence:– log2 n: is tight– The graph is used to define communication structure from time to

time – Number of edges does not increase at every time step


Making it Practical


Making it Practical

Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG

reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the

root”• Keep the min id of the parent in a DHT • Similar to path compression


Making it Practical

Approach 1 (Systems):– LargeStar is equivalent to finding one of the maxima in the DAG

reachable from each vertex– Can do this with a fast distributed hash table (DHT) to “walk up to the

root”• Keep the min id of the parent in a DHT • Similar to path compression

Approach 2 (Algorithms):– Instead of waiting for LargeStar to converge, just interleave LargeStar and SmallStar.

• Repeat Until Convergence:

SmallStar

LargeStar

– Can prove convergence – Appears to converge even faster (conjecture O(log(n)) rounds)


Watching Out for Skew

Final output:– Union of stars, one for each component – This should worry you!

• In case of a single component, get one star with linear degree • In case of skewed component sizes, also get one star with linear degree


Dealing with Skew

Divide the computation of the minimum

– Can do this recursively c times– Increase number of rounds by 1/c, each node’s input at most nc

40

od

g

abc

i

j

ml

n 1

h

ef

k

hd

lgc

ae

ig

b

if

k

jo

11.1

1.2

1.3

1.4

Monday, June 23, 14

But does it work?

Data (subset):– UK Web graph: 106M nodes, 6.6B edges– Google+ subgraph: 178M nodes, 2.9B edges– Keyword similarity : 371M nodes, 3.5B edges– Document similarity: 4,700M nodes, 452B edges

Algorithms:– Hash2Min (previous MapReduce state of the art)– DHT Implementation– Alternating algorithm(skew optimized & non-optimized)

Setup:– Loaded cluster, look at median running times


Speedups

42

– 20x-40x faster on the document similarity graph– Smaller improvements on smaller graphs

Monday, June 23, 14

Graph Size

43

22

24

26

28

30

32

0 2 3 4 5 6 8 9 10 11 12

Log

og t

he n

umbe

r of

edg

es

Round

Alternating Optimized AlternatingHash-To-Min

Log

of n

umbe

r of e

dges

Monday, June 23, 14

Conclusion

Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!


Conclusion

Connected Components– Simple, local algorithms with O(log2 n) round complexity– Communication efficient (number of edges non-increasing)– Open: Prove O(log n) – Open: Prove ~log n lower bounds!

Algorithms:– Evolve with the underlying system architecture– Avoid embarrassingly slow embarrassingly parallel implementations


Thank You

Monday, June 23, 14

Documents

Connected Components in MapReduce and Beyondmmds-data.org/presentations/2014/vassilvitskii_mmds1… · · 2016-09-27Connected Components in MapReduce and Beyond R. Kiveris, S. Lattanzi,