Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Distributed-Memory Algorithms for Cardinality Matching using Matrix Algebra
Ariful Azad, Lawrence Berkeley Na.onal Laboratory
Joint work with Aydın Buluç (LBNL) Support: DOE Office of Science
SIAM PP 2016, Paris
q Matching: A subset of independent edges, i.e., at most one edge in the matching is incident on each vertex.
q Maximal cardinality matching: A matching where if another edge is added it is not a matching anymore.
q Maximum cardinality matching (MCM) has the maximum possible cardinality
A matching in a graph
x1
x2
x3
y1
y2
y3
Maximal Cardinality Matching Cardinality = 2
x1
x2
x3
y1
y2
y3
Maximum Cardinality Matching Cardinality = 3
Matched vertex
Unmatched vertex
Matched edge
Unmatched edge
Applica.on of matching in scien.fic compu.ng
Bipar&te)Cardinality)Matching) Bipar&te)Weighted)Matching)
Maximum)Cardinality)Matching)(MCM))
Maximum)Weighted)Matching)(MWM))exact/approx.)
Maximum<Weight)Perfect)Matching)(MWPM))
exact/approx.)
Maximal)Cardinality)Matching)(1/2)approx.))
Block)Triangular))Form)(BTF))
Sparse)QR)) Sparse)LU)) Graph))Par&&oning)
KLU)
Alg
orith
ms
App
licat
ions
AMG)Least)square)on)HBS)matrices)
Pre<condi&oning)
HSS<based)mul&frontal)solver)
use rela.onship
q Problem: Cardinality matching in a bipar.te graph – Maximum cardinality matching (MCM) – Maximal cardinality matching (used to ini.alize MCM)
q Algorithm: Distributed-‐memory parallel algorithms
q Approach: Matrix-‐algebraic formula.ons of graph primi.ves. Inspired by Graph BLAS (hWp://graphblas.org/). – More discussion on Friday (MS68): The GraphBLAS Effort: Kernels, API, and Parallel Implementa.ons by Aydin Buluc.
q Covers two recent papers: – Maximal matching: Azad and Buluç, IEEE CLUSTER 2015 – Maximum matching: Azad and Buluç, IPDPS 2016
Scope of this talk
MCM algorithm based on augmen.ng-‐path searches
q Augmen.ng path: A path that alternates between matched and unmatched edges with unmatched end points.
q Algorithm: Search for augmen.ng paths and flip edges across the paths to increase cardinality of the matching. – Algorithmic op?ons: single source or mul.-‐source, breadth-‐first search
(BFS) or depth-‐first search (DFS)
x1
x2
x3
y1
y2
y3
x1
x2
x3
y1
y2
y3
Augment matching
Algorithmic landscape of cardinality matching Duff, Kaya and Ucar (ACM TOMS 2011), Azad, Buluç, Pothen (TPDS 2016)
Class Search strategy Serial Complexity
Maximum cardinality matching
Single-‐source augmen.ng path search DFS or BFS O(nm)
Mul.-‐source augmen.ng path search
DFS w lookahead (Pothen-‐Fan) O(nm)
BFS (MS-‐BFS) O(nm)
DFS & BFS (Hopcrof-‐Karp) O(√nm)
Push relabel Label guided FIFO search O(nm)
Maximal cardinality matching
Greedy Karp-‐Sipser
Dynamic mindegree Local O(m)
MS-‐BFS: exposes more parallelism
Hopcrof-‐Karp: best asympto.c complexity Ini.alizes MCM
Our focus
q When a graph does not fit in the memory of a node q The graph is already distributed
– Example: sta.c pivo.ng in SuperLU_DIST (Li and Demmel, 2003) – The graph is gathered on a single node and MC64 is used to compute the matching, which is unscalable and expensive
The need for distributed-‐memory algorithms
�
�
��
��
��
��
�� ��� ���� ���� �����
���������
���� �� ����� �� ��� �����
Time to gather a graph and scaWer the matching on 2048 cores of NERSC/Edison (Cray XC30)
Distributed algorithms are cheaper and scalable
q Prior work: Push-‐relabel by Langguth et al. (2011) and Karp-‐Sipser on general graph by Patwary et al. (2010). – does not scale beyond 64 processors
q Challenge – long paths passing through mul.ple processors – lots of fine-‐grained asynchronous communica.on
q Here we use graph-‐matrix duality and design matching algorithms using scalable matrix and vector opera.ons. – A handful of standard opera.ons – Offers bulk-‐synchronous parallelism – Jumping among algorithms is easier
Distributed-‐memory cardinality matching
Two required primi.ves 1. Sparse matrix-‐sparse vector mul?ply (SpMSpV)
1
5
2 3 4
1 5 2 3 4
×
× ×
×
× ×
c1
c2
c3
c4
r1
r2
r3
r4
r5 c5
× × ×
1 2 5
1"
1"
2"
5"
5"
Select columns
Unmatched columns: fc
In each row, retain the minimum product from the selected columns
Vis
ited
row
s: f r
Matrix A Bipartite graph G(R,C,E)
Semiring Op.on: (mul.ply,add) (select2nd, min)
c1 c2
r1 r2 r3 r4 r5
c5
Row
Ver.ces
Column
Ver.ces
1
5
2 3 4
1 5 2 3 4
×
× ×
×
× ×
c1
c2
c3
c4
r1
r2
r3
r4
r5 c5
× × ×
1 2 5
1"
1"
2"
5"
5"
Select columns
Unmatched columns: fc
In each row, retain the minimum product from the selected columns
Vis
ited
row
s: f r
Matrix A Bipartite graph G(R,C,E)
Graph Opera.on Traverse
unvisited neighbors Matrix Opera.on
SpMSpV A matching
Two required primi.ves 2. Inverted index in a sparse vector
c1 c2
r1 r2 r4
c5
r4
c5
r1
c1
Swap parents and children Duplicates removed Graph Opera.on
1. Keep unique child 2. Swap matched and unmatched edges
Invert Vector Opera.on Inverted index in a sparse vector
1 1 5 1 4
1 2 3 4 5 1 2 3 4 5
Index: child Value: parent
Index: parent Value: child
Mul.-‐source BFS (MS-‐BFS) algorithm using matrix and vector opera.ons
x1
x2
x3
x4
x5
y1
y2
y3
y4
y5
x6 y6
(a) A maximal matching in a Bipartite Graph
x1 x2
x3 x4 x5
y1 y2 y3
y4 y5
(b) Alternating BFS Forest
Roots of BFS trees
Sparse matrix-‐sparse vector mul.ply (SpMSpV)
Inverted index using matching vector
Sparse matrix-‐sparse vector mul.ply (SpMSpV)
Not explored to maintain vertex-‐disjoint trees
Step-‐1: Discover vertex-‐disjoint augmen?ng paths
MS-‐BFS algorithm using matrix and vector opera.ons
x1 x2
x3 x4 x5
y1 y2 y3
y4 y5
Augment matching
Step-‐2: Augment matching by flipping matched and unmatched edges along the augmen?ng paths
Invert
Invert
Distributed memory paralleliza.on (SpMSpV)
x A fron.er
à x
ALGORITHM: 1. Gather ver.ces in processor column [communica?on] 2. Local mul.plica.on [computa.on] 3. Find owners of the current fron.er’s adjacency and exchange
adjacencies in processor row [communica?on]
n p
n p
p × p Processor grid
P processors are arranged in
Shared-‐memory paralleliza.on (SpMSpV)
y
x (fron.er) n (t p )
Thread 1
Thread 2
• Explicitly split local submatrices to t (#threads) pieces along the rows.
n (t p )
Computa.on and communica.on .me of discovering vertex-‐disjoint augmen.ng paths (a phase)
Opera?on Per processor Computa?on (lower bound)
Per processor Comm (latency)
Per processor Comm
(bandwidth)
SpMSpV
Invert
height *α p
height *αp βnp
βmp+np
!
"##
$
%&&
mp
np
α : latency (0.25 μs to 3.7 μs MPI latency on Edison) β : inverse bandwidth (~8GB/sec MPI bandwidth on Edison) p : number of processors
n: number of ver.ces, m: number of edges height: maximum height of the BFS forest
x
x
y
y
x
x
y
y
x
x
y
y
x
x
y
y
x
x
y
y
x
x
y
y Level synchronous: BFS Style
One path per process Using one-‐sided communica.on
via MPI Remote Memory Access (RMA)
P1, P2
P1, P2
P1 P2
Special treatments for long augmen.ng paths
q Plaxorm: Edison (NERSC) – 2.4 GHz Intel Ivy Bridge processor, 24 cores (2 sockets) and 64 GB RAM per node
– Cray Aries network using a Dragonfly topology (0.25 μs to 3.7 μs MPI latency, ~8GB/sec MPI bandwidth)
– Programming environment: C++ and Cray MPI, Combinatorial BLAS library (Buluc and Gilbert, 2011)
q Input graphs – Real matrices from Florida sparse matrix collec.on and randomly generated matrices.
– Matrix-‐ bipar.te graph conversion • rows: ver.ces in one part, columns: ver.ces in another part, nonzeros: edges.
Results: experimental Setup
q Karp-‐Sipser obtains the highest cardinality for many prac.cal problems, but it runs the slowest on high concurrency
q We found that dynamic mindegree + MCM ofen runs the fastest on high concurrency.
Impact of ini.aliza.on on MCM
0
8
16
24
32
Dynamic Mindegree
Karp- Sipser
Greedy No Init
Tim
e (s
ec)
nlpkkt200 Maximum Matching Maximal Matching
0!
4!
8!
12!
16!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
road_usa!
0!
2!
4!
6!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
GL7d19!
0!
2!
4!
6!
8!
Dynamic Mindegree!
Karp- Sipser!
Greedy! No Init!
Tim
e (s
ec)!
wikipedia!
On 1024 cores of Edison
16 32 64 128 256 512 1024 20482
4
8
16
32
64
128
256
Number of Cores
Tim
e (s
ec)
ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R
MCM strong scaling (real matrices)
~80x increase of cores
12x-‐18x speedups
To appear: Azad and Buluç, IPDPS 2016
1 node (24 cores of Edison)
MCM strong scaling (G500 RMAT matrices)
64 128 256 512 1024 2048 4096 8192 163841
2
4
8
16
32
64
Number of Cores
Tim
e (s
ec)
(b) G500
Scale−30Scale−29Scale−28Scale−27Scale−26
Scale-‐30 RMAT: 2 billion ver.ces, 32 billion edges Scaling con.nues beyond 10K core on Large matrices
MCM: Breakdown of run.me
48 108 192 432 972 2,0140
0.5
1
1.5
2
2.5
Number of Cores
Tim
e (s
ec)
GL7d19
MaximalSpMVSelectInvertPruneAugment
V1= 1.91M, V2 = 1.96M, #edges = 37M
q Similar graph-‐matrix transforma.on applies to weighted matching algorithms.
q Auc.on algorithm ideas [Ongoing work] – Bidders bid for most profitable objects: SpMSpV with (select2nd, max) semiring
– An object selects the best bidder from which it received bid: Inverted index
– Dual updates can be done using vector opera.ons
Ideas for weighted matching
q Summary of contribu.ons – Methods: distributed memory matching algorithms based on matrix algebra
– Performance: scales up to 10K cores on large graphs. – Easy to implement an algorithm using matrix-‐algebraic primi.ves.
– Source code publicly available at: h1p://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/
q Future work – Distributed weighted matching using matrix algebra
Summary
q A. Azad and A. Buluç, to appear IPDPS 2016, Distributed-‐Memory Algorithms for Maximum Cardinality Matching in Bipar.te Graphs.
q A. Azad and A. Buluç, CLUSTER 2015, Distributed-‐memory algorithms for maximal cardinality matching using matrix algebra.
q Langguth et al., Parallel Compu.ng 2011, Parallel algorithms for bipar.te matching problems on distributed memory computers.
q M. Patwary, R. Bisseling, F. Manne, HPPA 2010, Parallel greedy graph matching using an edge par..oning approach.
q M. Sathe, O. Schenk, H. Burkhart, Parallel Compu.ng 2012, An auc.on-‐based weighted matching implementa.on on massively parallel architectures.
Relevant references
Thanks for your aWen.on
Suppor.ng slides
q Used to ini.alize MCM q Example: dynamic mindegree algorithm
– Greedy and Karp-‐Sipser are similar (Azad and Buluc, 2015)
Maximal matching algorithms using matrix and vector opera.ons
x1
x2
x3
y1
y2
y3
d=2
d=2
d=3
d=3
d=2
d=2
x1
x2
x3
y1
y2
y3
d=2
d=2
d=3
d=3
d=2
d=2
x1
x2
x3
y1
y2
y3
d=2
d=2
d=2
d=2
neighbor with mindegree Match Update degree
SpMSpV Addi?on = min (degree)
Inverted Index
SpMSpV Addi?on = plus
Graph Op
Matrix Op
��� ��� ���� ���� ���� ���� ������
�
�
�
�
�
��
�
�� ��� �� ����������
�� ������
��� ������
�� �!���� �!���� �!�
��� ��� ���� ���� ���� ���� ������
�
�
�
�
�
��
�
��
�� ��� �� ����������
�� ������
��� "�#� �� ��#��$���
�� �!���� �!���� �!�
��� ��� ���� ���� ���� ���� ������
�
�
�
�
�
��
�� ��� �� ������������ ������
��� %��&!'�&���
�� �!���� �!���� �!�
Maximal matching strong Scaling Randomly generated RMAT graphs
Graph #ver?ces #edges Greedy Karp-‐Sipser
Dynamic Mindegree
RMAT-‐26 128 million 2 billion 3x no 6x
RMAT-‐28 512 million 8 billion 7x 3x 10x
RMAT-‐30 2 billion 32 billion 12x 8x 15x
For 16x increase of cores: 1,024 – 16,384
Larger graphs Higher speedups
Strong Scaling Why does dynamic mindegree scale beWer?
Graph #ver?ces #edges Greedy Karp-‐Sipser
Dynamic Mindegree
RMAT-‐26 128 million 2 billion 3x 0x 6x
RMAT-‐28 512 million 8 billion 7x 3x 10x
RMAT-‐30 2 billion 32 billion 12x 8x 15x
For 16x increase of cores: 1,024 – 16,384
� �� �� �� �� �� �� �� ���
��
���
���
���
�� ���
��� �!"#�!$��
� ��� ��� � � ������ �� � � ������
� � � � � � %�
���
���
���
���
�� ���
��� &�� ��� '��������
���� ��� � � ������ �� � � ������
%.m
e or %matched
q For graph-‐based algorithms, matching quality decreases with increased concurrency
Graph-‐based vs. Matrix-‐based parallel algorithms
� � �� �� ��� ���� ���� ������
��
��
��
��
���
�� �� �� ����� �� �������
��������������������
�!����������"���#$�����%&���!'
�!����������"���#$�����%���() �'
*����� ����"���#$�����%+�����'
*����� ����,����(%+�����'