Distributed-Memory Algorithms for Cardinality Matching using … · 2016. 6. 21. · Distributed-Memory Algorithms for Cardinality Matching using Matrix Algebra Ariful’Azad,’Lawrence(Berkeley(Naonal(Laboratory(

Distributed-Memory Algorithms for Cardinality Matching using Matrix Algebra

Ariful Azad, Lawrence Berkeley Na.onal Laboratory

Joint work with Aydın Buluç (LBNL) Support: DOE Office of Science

SIAM PP 2016, Paris

q  Matching: A subset of independent edges, i.e., at most one edge in the matching is incident on each vertex.

q  Maximal cardinality matching: A matching where if another edge is added it is not a matching anymore.

q  Maximum cardinality matching (MCM) has the maximum possible cardinality

A matching in a graph

x1

x2

x3

y1

y2

y3

Maximal Cardinality Matching Cardinality = 2

x1

x2

x3

y1

y2

y3

Maximum Cardinality Matching Cardinality = 3

Matched vertex

Unmatched vertex

Matched edge

Unmatched edge

Applica.on of matching in scien.fic compu.ng

Bipar&te)Cardinality)Matching) Bipar&te)Weighted)Matching)

Maximum)Cardinality)Matching)(MCM))

Maximum)Weighted)Matching)(MWM))exact/approx.)

Maximum<Weight)Perfect)Matching)(MWPM))

exact/approx.)

Maximal)Cardinality)Matching)(1/2)approx.))

Block)Triangular))Form)(BTF))

Sparse)QR)) Sparse)LU)) Graph))Par&&oning)

KLU)

Alg

orith

ms

App

licat

ions

AMG)Least)square)on)HBS)matrices)

Pre<condi&oning)

HSS<based)mul&frontal)solver)

use rela.onship

q  Problem: Cardinality matching in a bipar.te graph –  Maximum cardinality matching (MCM) –  Maximal cardinality matching (used to ini.alize MCM)

q  Algorithm: Distributed-‐memory parallel algorithms

q  Approach: Matrix-‐algebraic formula.ons of graph primi.ves. Inspired by Graph BLAS (hWp://graphblas.org/). –  More discussion on Friday (MS68): The GraphBLAS Effort: Kernels, API, and Parallel Implementa.ons by Aydin Buluc.

q  Covers two recent papers: –  Maximal matching: Azad and Buluç, IEEE CLUSTER 2015 –  Maximum matching: Azad and Buluç, IPDPS 2016

Scope of this talk

MCM algorithm based on augmen.ng-‐path searches

q  Augmen.ng path: A path that alternates between matched and unmatched edges with unmatched end points.

q  Algorithm: Search for augmen.ng paths and flip edges across the paths to increase cardinality of the matching. –  Algorithmic op?ons: single source or mul.-‐source, breadth-‐first search

(BFS) or depth-‐first search (DFS)

x1

x2

x3

y1

y2

y3

x1

x2

x3

y1

y2

y3

Augment matching

Algorithmic landscape of cardinality matching Duff, Kaya and Ucar (ACM TOMS 2011), Azad, Buluç, Pothen (TPDS 2016)

Class Search strategy Serial Complexity

Maximum cardinality matching

Single-‐source augmen.ng path search DFS or BFS O(nm)

Mul.-‐source augmen.ng path search

DFS w lookahead (Pothen-‐Fan) O(nm)

BFS (MS-‐BFS) O(nm)

DFS & BFS (Hopcrof-‐Karp) O(√nm)

Push relabel Label guided FIFO search O(nm)

Maximal cardinality matching

Greedy Karp-‐Sipser

Dynamic mindegree Local O(m)

MS-‐BFS: exposes more parallelism

Hopcrof-‐Karp: best asympto.c complexity Ini.alizes MCM

Our focus

q When a graph does not fit in the memory of a node q  The graph is already distributed

–  Example: sta.c pivo.ng in SuperLU_DIST (Li and Demmel, 2003) –  The graph is gathered on a single node and MC64 is used to compute the matching, which is unscalable and expensive

The need for distributed-‐memory algorithms

�

�

��

��

��

��

��

��

��

Time to gather a graph and scaWer the matching on 2048 cores of NERSC/Edison (Cray XC30)

Distributed algorithms are cheaper and scalable

q  Prior work: Push-‐relabel by Langguth et al. (2011) and Karp-‐Sipser on general graph by Patwary et al. (2010). –  does not scale beyond 64 processors

q  Challenge –  long paths passing through mul.ple processors –  lots of fine-‐grained asynchronous communica.on

q  Here we use graph-‐matrix duality and design matching algorithms using scalable matrix and vector opera.ons. –  A handful of standard opera.ons –  Offers bulk-‐synchronous parallelism –  Jumping among algorithms is easier

Distributed-‐memory cardinality matching

Two required primi.ves 1. Sparse matrix-‐sparse vector mul?ply (SpMSpV)

1

5

2 3 4

1 5 2 3 4

×

× ×

×

× ×

c1

c2

c3

c4

r1

r2

r3

r4

r5 c5

× × ×

1 2 5

1"

1"

2"

5"

5"

Select columns

Unmatched columns: fc

In each row, retain the minimum product from the selected columns

Vis

ited

row

s: f r

Matrix A Bipartite graph G(R,C,E)

Semiring Op.on: (mul.ply,add) (select2nd, min)

c1 c2

r1 r2 r3 r4 r5

c5

Row

Ver.ces

Column

Ver.ces

1

5

2 3 4

1 5 2 3 4

×

× ×

×

× ×

c1

c2

c3

c4

r1

r2

r3

r4

r5 c5

× × ×

1 2 5

1"

1"

2"

5"

5"

Select columns

Unmatched columns: fc

In each row, retain the minimum product from the selected columns

Vis

ited

row

s: f r

Matrix A Bipartite graph G(R,C,E)

Graph Opera.on Traverse

unvisited neighbors Matrix Opera.on

SpMSpV A matching

Two required primi.ves 2. Inverted index in a sparse vector

c1 c2

r1 r2 r4

c5

r4

c5

r1

c1

Swap parents and children Duplicates removed Graph Opera.on

1. Keep unique child 2. Swap matched and unmatched edges

Invert Vector Opera.on Inverted index in a sparse vector

1 1 5 1 4

1 2 3 4 5 1 2 3 4 5

Index: child Value: parent

Index: parent Value: child

Mul.-‐source BFS (MS-‐BFS) algorithm using matrix and vector opera.ons

x1

x2

x3

x4

x5

y1

y2

y3

y4

y5

x6 y6

(a) A maximal matching in a Bipartite Graph

x1 x2

x3 x4 x5

y1 y2 y3

y4 y5

(b) Alternating BFS Forest

Roots of BFS trees

Sparse matrix-‐sparse vector mul.ply (SpMSpV)

Inverted index using matching vector

Sparse matrix-‐sparse vector mul.ply (SpMSpV)

Not explored to maintain vertex-‐disjoint trees

Step-‐1: Discover vertex-‐disjoint augmen?ng paths

MS-‐BFS algorithm using matrix and vector opera.ons

x1 x2

x3 x4 x5

y1 y2 y3

y4 y5

Augment matching

Step-‐2: Augment matching by flipping matched and unmatched edges along the augmen?ng paths

Invert

Invert

Distributed memory paralleliza.on (SpMSpV)

x A fron.er

à x

ALGORITHM: 1.  Gather ver.ces in processor column [communica?on] 2.  Local mul.plica.on [computa.on] 3.  Find owners of the current fron.er’s adjacency and exchange

adjacencies in processor row [communica?on]

n p

n p

p × p Processor grid

P processors are arranged in

Shared-‐memory paralleliza.on (SpMSpV)

y

x (fron.er) n (t p )

Thread 1

Thread 2

•  Explicitly split local submatrices to t (#threads) pieces along the rows.

n (t p )

Computa.on and communica.on .me of discovering vertex-‐disjoint augmen.ng paths (a phase)

Opera?on Per processor Computa?on (lower bound)

Per processor Comm (latency)

Per processor Comm

(bandwidth)

SpMSpV

Invert

height *α p

height *αp βnp

βmp+np

!

"##

$

%&&

mp

np

α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors

n: number of ver.ces, m: number of edges height: maximum height of the BFS forest

x

x

y

y

x

x

y

y

x

x

y

y

x

x

y

y

x

x

y

y

x

x

y

y Level synchronous: BFS Style

One path per process Using one-‐sided communica.on

via MPI Remote Memory Access (RMA)

P1, P2

P1, P2

P1 P2

Special treatments for long augmen.ng paths

q  Plaxorm: Edison (NERSC) –  2.4 GHz Intel Ivy Bridge processor, 24 cores (2 sockets) and 64 GB RAM per node

–  Cray Aries network using a Dragonfly topology (0.25 μs to 3.7 μs MPI latency, ~8GB/sec MPI bandwidth)

–  Programming environment: C++ and Cray MPI, Combinatorial BLAS library (Buluc and Gilbert, 2011)

q  Input graphs –  Real matrices from Florida sparse matrix collec.on and randomly generated matrices.

–  Matrix-‐ bipar.te graph conversion •  rows: ver.ces in one part, columns: ver.ces in another part, nonzeros: edges.

Results: experimental Setup

q  Karp-‐Sipser obtains the highest cardinality for many prac.cal problems, but it runs the slowest on high concurrency

q  We found that dynamic mindegree + MCM ofen runs the fastest on high concurrency.

Impact of ini.aliza.on on MCM

0

8

16

24

32

Dynamic Mindegree

Karp- Sipser

Greedy No Init

Tim

e (s

ec)

nlpkkt200 Maximum Matching Maximal Matching

0!

4!

8!

12!

16!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

road_usa!

0!

2!

4!

6!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

GL7d19!

0!

2!

4!

6!

8!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

wikipedia!

On 1024 cores of Edison

16 32 64 128 256 512 1024 20482

4

8

16

32

64

128

256

Number of Cores

Tim

e (s

ec)

ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R

MCM strong scaling (real matrices)

~80x increase of cores

12x-‐18x speedups

To appear: Azad and Buluç, IPDPS 2016

1 node (24 cores of Edison)

MCM strong scaling (G500 RMAT matrices)

64 128 256 512 1024 2048 4096 8192 163841

2

4

8

16

32

64

Number of Cores

Tim

e (s

ec)

(b) G500

Scale−30Scale−29Scale−28Scale−27Scale−26

Scale-‐30 RMAT: 2 billion ver.ces, 32 billion edges Scaling con.nues beyond 10K core on Large matrices

MCM: Breakdown of run.me

48 108 192 432 972 2,0140

0.5

1

1.5

2

2.5

Number of Cores

Tim

e (s

ec)

GL7d19

MaximalSpMVSelectInvertPruneAugment

V1= 1.91M, V2 = 1.96M, #edges = 37M

q  Similar graph-‐matrix transforma.on applies to weighted matching algorithms.

q  Auc.on algorithm ideas [Ongoing work] –  Bidders bid for most profitable objects: SpMSpV with (select2nd, max) semiring

–  An object selects the best bidder from which it received bid: Inverted index

–  Dual updates can be done using vector opera.ons

Ideas for weighted matching

q  Summary of contribu.ons – Methods: distributed memory matching algorithms based on matrix algebra

–  Performance: scales up to 10K cores on large graphs. –  Easy to implement an algorithm using matrix-‐algebraic primi.ves.

–  Source code publicly available at: h1p://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/

q  Future work –  Distributed weighted matching using matrix algebra

Summary

q  A. Azad and A. Buluç, to appear IPDPS 2016, Distributed-‐Memory Algorithms for Maximum Cardinality Matching in Bipar.te Graphs.

q  A. Azad and A. Buluç, CLUSTER 2015, Distributed-‐memory algorithms for maximal cardinality matching using matrix algebra.

q  Langguth et al., Parallel Compu.ng 2011, Parallel algorithms for bipar.te matching problems on distributed memory computers.

q  M. Patwary, R. Bisseling, F. Manne, HPPA 2010, Parallel greedy graph matching using an edge par..oning approach.

q  M. Sathe, O. Schenk, H. Burkhart, Parallel Compu.ng 2012, An auc.on-‐based weighted matching implementa.on on massively parallel architectures.

Relevant references

Thanks for your aWen.on

Suppor.ng slides

q Used to ini.alize MCM q  Example: dynamic mindegree algorithm

–  Greedy and Karp-‐Sipser are similar (Azad and Buluc, 2015)

Maximal matching algorithms using matrix and vector opera.ons

x1

x2

x3

y1

y2

y3

d=2

d=2

d=3

d=3

d=2

d=2

x1

x2

x3

y1

y2

y3

d=2

d=2

d=3

d=3

d=2

d=2

x1

x2

x3

y1

y2

y3

d=2

d=2

d=2

d=2

neighbor with mindegree Match Update degree

SpMSpV Addi?on = min (degree)

Inverted Index

SpMSpV Addi?on = plus

Graph Op

Matrix Op

��

�

�

�

�

�

��

�

��

��

��

�� !�� !�� !�

��

�

�

�

�

�

��

�

��

��

��

�� "�#� �� #��$��

�� !�� !�� !�

��

�

�

�

�

�

��

��

�� %��&!'�&��

�� !�� !�� !�

Maximal matching strong Scaling Randomly generated RMAT graphs

Graph #ver?ces #edges Greedy Karp-‐Sipser

Dynamic Mindegree

RMAT-‐26 128 million 2 billion 3x no 6x

RMAT-‐28 512 million 8 billion 7x 3x 10x

RMAT-‐30 2 billion 32 billion 12x 8x 15x

For 16x increase of cores: 1,024 – 16,384

Larger graphs Higher speedups

Strong Scaling Why does dynamic mindegree scale beWer?

Graph #ver?ces #edges Greedy Karp-‐Sipser

Dynamic Mindegree



RMAT-‐30 2 billion 32 billion 12x 8x 15x

For 16x increase of cores: 1,024 – 16,384

� ��

��

��

��

��

��

�� !"#�!$��

� ��

� � � � � � %�

��

��

��

��

��

�� &�� '��

��

%.m

e or %matched

q  For graph-‐based algorithms, matching quality decreases with increased concurrency

Graph-‐based vs. Matrix-‐based parallel algorithms

� � ��

��

��

��

��

��

��

��

�!��"��#$��%&��!'

�!��"��#$��%��() �'

*�� "��#$��%+��'

*�� ,��(%+��'

Documents

Distributed-Memory Algorithms for Cardinality Matching using … · 2016. 6. 21. · Distributed-Memory Algorithms for Cardinality Matching using Matrix Algebra Ariful’Azad,’Lawrence(Berkeley(Naonal(Laboratory(