200
HAL Id: tel-02377874 https://hal.inria.fr/tel-02377874 Submitted on 24 Nov 2019 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Partitioning, matching, and ordering: Combinatorial scientific computing with matrices and tensors Bora Uçar To cite this version: Bora Uçar. Partitioning, matching, and ordering: Combinatorial scientific computing with matrices and tensors. Computer Science [cs]. ENS de Lyon, 2019. tel-02377874

Partitioning, matching, and ordering: Combinatorial

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Partitioning, matching, and ordering: Combinatorial

HAL Id tel-02377874httpshalinriafrtel-02377874

Submitted on 24 Nov 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents whether they are pub-lished or not The documents may come fromteaching and research institutions in France orabroad or from public or private research centers

Lrsquoarchive ouverte pluridisciplinaire HAL estdestineacutee au deacutepocirct et agrave la diffusion de documentsscientifiques de niveau recherche publieacutes ou noneacutemanant des eacutetablissements drsquoenseignement et derecherche franccedilais ou eacutetrangers des laboratoirespublics ou priveacutes

Partitioning matching and ordering Combinatorialscientific computing with matrices and tensors

Bora Uccedilar

To cite this versionBora Uccedilar Partitioning matching and ordering Combinatorial scientific computing with matricesand tensors Computer Science [cs] ENS de Lyon 2019 tel-02377874

MEMOIRE DrsquoHABILITATION A DIRIGER DES RECHERCHES

presente le 19 septembre 2019

a lrsquoEcole Normale Superieure de Lyon

par

Bora Ucar

Partitioning matching and ordering Combinatorialscientific computing with matrices and tensors

(Partitionnement couplage et permutation Calculscientifique combinatoire sur des matrices et des tenseurs)

Devant le jury compose de

Rob H Bisseling Utrecht University the Netherlands RapporteurNadia Brauner Universite Joseph Fourier Grenoble

FranceExaminatrice

Karen D Devine Sandia National Labs AlbuquerqueNew Mexico USA

Rapporteuse

Ali Pınar Sandia National Labs LivermoreCalifornia USA

Examinateur

Yves Robert ENS Lyon France ExaminateurDenis Trystram Grenoble INP France Rapporteur

ii

Contents

1 Introduction 1

11 Directed graphs 1

12 Undirected graphs 5

13 Hypergraphs 7

14 Sparse matrices 7

15 Sparse tensors 9

I Partitioning 13

2 Acyclic partitioning of directed acyclic graphs 15

21 Related work 16

22 Directed multilevel graph partitioning 18

221 Coarsening 18

222 Initial partitioning 21

223 Refinement 22

224 Constrained coarsening and initial partitioning 23

23 Experiments 25

24 Summary further notes and references 31

3 Distributed memory CP decomposition 33

31 Background and notation 34

311 Hypergraph partitioning 34

312 Tensor operations 35

32 Related work 36

33 Parallelization 38

331 Coarse-grain task model 39

332 Fine-grain task model 43

34 Experiments 47

35 Summary further notes and references 51

iii

iv CONTENTS

II Matching and related problems 53

4 Matchings in graphs 55

41 Scaling matrices to doubly stochastic form 55

42 Approximation algorithms for bipartite graphs 56

421 Related work 57

422 Two matching heuristics 59

423 Experiments 69

43 Approximation algorithms for general undirected graphs 75

431 Related work 76

432 One-Out its analysis and variants 77

433 Experiments 82

44 Summary further notes and references 87

5 Birkhoff-von Neumann decomposition 91

51 The minimum number of permutation matrices 91

511 Preliminaries 92

512 The computational complexity of MinBvNDec 93

52 A result on the polytope of BvN decompositions 94

53 Two heuristics 97

54 Experiments 103

55 A generalization of Birkhoff-von Neumann theorem 106

551 Doubly stochastic splitting 106

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices 108

56 Summary further notes and references 109

6 Matchings in hypergraphs 113

61 Background and notation 114

62 The heuristics 114

621 Greedy-H 115

622 Karp-Sipser-H 115

623 Karp-Sipser-H-scaling 117

624 Karp-Sipser-H-mindegree 118

625 Bipartite-reduction 119

626 Local-Search 120

63 Experiments 120

64 Summary further notes and references 128

III Ordering matrices and tensors 129

7 Elimination trees and height-reducing orderings 131

71 Background 132

CONTENTS v

72 The critical path length and the elimination tree height 13473 Reducing the elimination tree height 136

731 Theoretical basis 137732 Recursive approach Permute 137733 BBT decomposition PermuteBBT 138734 Edge-separator-based method 139735 Vertex-separator-based method 141736 BT decomposition PermuteBT 142

74 Experiments 14275 Summary further notes and references 145

8 Sparse tensor ordering 14781 Three sparse tensor storage formats 14782 Problem definition and solutions 149

821 BFS-MCS 149822 Lexi-Order 151823 Analysis 154

83 Experiments 15684 Summary further notes and references 160

IV Closing 165

9 Concluding remarks 167

Bibliography 170

vi CONTENTS

List of Algorithms

1 CP-ALS for the 3rd order tensors 362 Mttkrp for the 3rd order tensors 383 Coarse-grain Mttkrp for the first mode of third order tensors

at process p within CP-ALS 404 Fine-grain Mttkrp for the first mode of 3rd order tensors at

process p within CP-ALS 44

5 ScaleSK Sinkhorn-Knopp scaling 566 OneSidedMatch for bipartite graph matching 607 TwoSidedMatch for bipartite graph matching 628 Karp-Sipser-MT for bipartite 1-out graph matching 649 One-Out for undirected graph matching 7810 Karp-SipserOne-Out for undirected 1-out graph matching 80

11 GreedyBvN for constructing a BvN decomposition 98

12 Karp-Sipser-H-scaling for hypergraph matching 117

13 Row-LU(A) 13414 Column-LU(A) 13415 Permute(A) 13816 FindStrongVertexSeparatorES(G) 14017 FindStrongVertexSeparatorVS(G) 142

18 BFS-MCS for ordering a tensor dimension 15019 Lexi-Order for ordering a tensor dimension 153

vii

viii LIST OF ALGORITHMS

Chapter 1

Introduction

This document investigates three classes of problems at the interplay ofdiscrete algorithms combinatorial optimization and numerical methodsThe general research area is called combinatorial scientific computing (CSC)In CSC the contributions have practical and theoretical flavor For allproblems discussed in this document we have the design analysis andimplementation of algorithms along with many experiments The theoreticalresults are included in this document some with proofs the reader is invitedto the original papers for the omitted proofs A similar approach is takenfor presenting the experiments While most results for observing theoreticalfindings in practice are included the reader is referred to the original papersfor some other results (eg run time analysis)

The three problem classes are that of partitioning matching and or-dering We cover two problems from the partitioning class three from thematching class and two from the ordering class Below we present theseseven problems after introducing the notation and recalling the related defi-nitions We summarize the computing contexts in which problems arise andhighlight our contributions

11 Directed graphs

A directed graph G = (VE) is a set V of vertices and a set E of directededges of the form e = (u v) where e is directed from u to v A path isa sequence of edges (u1 v1) middot (u2 v2) middot middot middot with vi = ui+1 A path ((u1 v1) middot(u2 v2) middot middot middot (u` v`)) is of length ` where it connects a sequence of `+1 vertices(u1 v1 = u2 v`minus1 = u` v`) A path is called simple if the connectedvertices are distinct Let u v denote a simple path that starts from u andends at v We say that a vertex v is reachable from another vertex u if apath connects them A path ((u1 v1) middot (u2 v2) middot middot middot (u` v`)) forms a (simple)cycle if all vi for 1 le i le ` are distinct and u1 = v` A directed acyclicgraph DAG in short is a directed graph with no cycles A directed graph

1

2 CHAPTER 1 INTRODUCTION

is strongly connected if every vertex is reachable from every other vertex

A directed graph Gprime = (V prime Eprime) is said to be a subgraph of G = (VE)if V prime sube V and Eprime sube E We call Gprime as the subgraph of G induced by V primedenoted as G[V prime] if Eprime = E cap (V primetimesV prime) For a vertex set S sube V we defineG minus S as the subgraph of G induced by V minus S ie G[V minus S] A vertexset C sube V is called a strongly connected component of G if the inducedsubgraph G[C] is strongly connected and C is maximal in the sense thatfor all C prime sup C G[C prime] is not strongly connected A transitive reduction G0

is a minimal subgraph of G such that if u is connected to v in G then u isconnected to v in G0 as well If G is acyclic then its transitive reduction isunique

The path u v represents a dependency of v to u We say that theedge (u v) is redundant if there exists another u v path in the graphThat is when we remove a redundant (u v) edge u remains connected tov and hence the dependency information is preserved We use Pred[v] =u | (u v) isin E to represent the (immediate) predecessors of a vertex v andSucc[v] = u | (v u) isin E to represent the (immediate) successors of v Wecall the neighbors of a vertex v its immediate predecessors and immediatesuccessors Neigh[v] = Pred[v]cupSucc[v] For a vertex u the set of vertices vsuch that u v are called the descendants of u Similarly the set of verticesv such that v u are called the ancestors of the vertex u The verticeswithout any predecessors (and hence ancestors) are called the sources of Gand vertices without any successors (and hence descendants) are called thetargets of G Every vertex u has a weight denoted by wu and every edge(u v) isin E has a cost denoted by cuv

A k-way partitioning of a graph G = (VE) divides V into k disjointsubsets V1 Vk The weight of a part Vi denoted by w(Vi) is equal tosum

uisinVi wu which is the total vertex weight in Vi Given a partition an edgeis called a cut edge if its endpoints are in different parts The edge cut ofa partition is defined as the sum of the costs of the cut edges Usually aconstraint on the part weights accompanies the problem We are interestedin acyclic partitions which are defined below

Definition 11 (Acyclic k-way partition) A partition V1 Vk ofG = (VE) is called an acyclic k-way partition if two paths u v andvprime uprime do not co-exist for u uprime isin Vi v vprime isin Vj and 1 le i 6= j le k

In other words for a suitable numbering of the parts all edges shouldbe directed from a vertex in a part p to another vertex in a part q wherep le q Let us consider a toy example shown in Fig 11a A partition of thevertices of this graph is shown in Fig 11b with a dashed curve Since thereis a cut edge from s to u and another from u to t the partition is cyclic Anacyclic partition is shown in Fig 11c where all the cut edges are from onepart to the other

3

(a) A toy graph (b) A partition ignor-ing the directions it iscyclic

(c) An acyclic parti-tioning

Figure 11 A toy graph with six vertices and six edges a cyclic partitioningand an acyclic partitioning

There is a related notion in the literature [87] which is called a convexpartition A partition is convex if for all vertex pairs u v in the same partthe vertices in any u v path are also in the same part Hence if a partitionis acyclic it is also convex On the other hand convexity does not implyacyclicity Figure 12 shows that the definitions of an acyclic partition anda convex partition are not equivalent For the toy graph in Fig 12a thereare three possible balanced partitions shown in Figures 12b 12c and 12dThey are all convex but only the one in Fig 12d is acyclic

Deciding on the existence of a k-way acyclic partition respecting an upperbound on the part weights and an upper bound on the cost of cut edges isNP-complete [93 ND15] In Chapter 2 we treat this problem which isformalized as follows

Problem 1 (DAGP DAG partitioning) Given a directed acyclic graphG = (VE) with weights on vertices an imbalance parameter ε find anacyclic k-way partition V1 Vk of V such that the balance con-straints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut which is equal to the numberof edges whose end points are in two different parts is minimized

The DAGP problem arises in exposing parallelism in automatic differ-entiation [53 Ch9] and particularly in the computation of the Newtonstep for solving nonlinear systems [51 52] The DAGP problem with someadditional constraints is used to reason about the parallel data movementcomplexity and to dynamically analyze the data locality potential [84 87]Other important applications of the DAGP problem include (i) fusing loopsfor improving temporal locality and enabling streaming and array contrac-tions in runtime systems [144 145] (ii) analysis of cache efficient execution

4 CHAPTER 1 INTRODUCTION

a b

c d

(a) A toy graph

a b

c d

(b) A cyclic and con-vex partition

a b

c d

(c) A cyclic and con-vex partition

a b

c d

(d) An acyclic and convexpartition

Figure 12 A toy graph two cyclic and convex partitions and an acyclicand convex partition

of streaming applications on uniprocessors [4] (iii) a number of circuit de-sign applications in which the signal directions impose acyclic partitioningrequirement [54 213]

In Chapter 2 we address the DAGP problem We adopt the multilevelapproach [35 113] for this problem This approach is the de-facto standardfor solving graph and hypergraph partitioning problems efficiently and usedby almost all current state-of-the-art partitioning tools [27 40 113 130186 199] The algorithms following this approach have three phases thecoarsening phase which reduces the number of vertices by clustering themthe initial partitioning phase which finds a partition of the coarsest graphand the uncoarsening phase in which the initial partition is projected to thefiner graphs and refined along the way until a solution for the original graphis obtained We focus on two-way partitioning (sometimes called bisection)as this scheme can be used in a recursive way for multiway partitioningTo ensure the acyclicity of the partition at all times we propose novel andefficient coarsening and refinement heuristics We also propose effective waysto use the standard undirected graph partitioning methods in our multilevelscheme We perform a large set of experiments on a dataset and reportsignificant improvements compared to the current state of the art

A rooted tree T is an ordered triple (V r ρ) where V is the vertex setr isin V is the root vertex and ρ gives the immediate ancestordescendantrelation For two vertices vi vj such that ρ(vj) = vi we call vi the parent

5

vertex of vj and call vj a child vertex of vi With this definition the root ris the ancestor of every other node in T The children vertices of a vertexare called sibling vertices to each other The height of T denoted as h(T )is the number of vertices in the longest path from a leaf to the root r

Let G = (VE) be a directed graph with n vertices which are ordered(or labeled as 1 n) The elimination tree of a directed graph is definedas follows [81] For any i 6= n ρ(i) = j whenever j gt i is the minimumvertex id such that the vertices i and j lie in the same strongly connectedcomponent of the subgraph G[1 j]mdashthat is i and j are in the samestrongly connected component of the subgraph containing the first j verticesand j is the smallest number Currently the algorithm with the best worst-case time complexity has a run time of O(m log n) [133]mdashanother algorithmwith the worst-case time complexity of O(mn) [83] also performs well inpractice As the definition of the elimination tree implies the properties ofthe tree depend on the ordering of the graph

The elimination tree is a recent model playing important roles in sparseLU factorization This tree captures the dependencies between the tasksof some well-known variants of sparse LU factorization (we review these inChapter 7) Therefore the height of the elimination tree corresponds to thecritical path length of the task dependency graph in the corresponding par-allel LU factorization methods In Chapter 7 we investigate the problem offinding minimum height elimination trees formally defined below to exposea maximum degree of parallelism by minimizing the critical path length

Problem 2 (Minimum height elimination tree) Given a directed graphfind an ordering of its vertices so that the associated elimination tree hasthe minimum height

The minimum height of an elimination tree of a given directed graph cor-responds to a directed graph parameter known as the cycle-rank [80] Theproblem being NP-complete [101] we propose heuristics in Chapter 7 whichgeneralize the most successful approaches used for symmetric matrices tounsymmetric ones

12 Undirected graphs

An undirected graph G = (VE) is a set V of vertices and a set E ofedges Bipartite graphs are undirected graphs where the vertex set can bepartitioned into two sets with all edges connecting vertices in the two partsFormally G = (VR cup VC E) is a bipartite graph where V = VR cup VC isthe vertex set and for each edge e = (r c) we have r isin VR and c isin VC The number of edges incident on a vertex is called its degree A path in agraph is a sequence of vertices such that each consecutive vertex pair share

6 CHAPTER 1 INTRODUCTION

an edge A vertex is reachable from another one if there is a path betweenthem The connected components of a graph are the equivalence classes ofvertices under the ldquois reachable fromrdquo relation A cycle in a graph is a pathwhose start and end vertices are the same A simple cycle is a cycle with novertex repetitions A tree is a connected graph with no cycles A spanningtree of a connected graph G is a tree containing all vertices of G

A matching M in a graph G = (VE) is a subset of edges E where avertex in V is in at most one edge in M Given a matching M a vertex vis said to be matched by M if v is in an edge of M otherwise v is calledunmatched If all the vertices are matched by M then M is said to be aperfect matching The cardinality of a matching M denoted by |M| is thenumber of edges inM In Chapter 4 we treat the following problem whichasks to find approximate matchings

Problem 3 (Approximating the maximum cardinality of a matching inundirected graphs) Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matching M of large cardinalitywith an approximation guarantee

There are a number of polynomial time algorithms to solve the maximumcardinality matching problem exactly in graphs The lowest worst-case timecomplexity of the known algorithms is O(

radicnτ) for a graph with n vertices

and τ edges For bipartite graphs the first of such algorithms is described byHopcroft and Karp [117] certain variants of the push-relabel algorithm [96]also have the same complexity (we give a classification for maximum car-dinality matchings in bipartite graphs in a recent paper [132]) For gen-eral graphs algorithms with the same complexity are known [30 92 174]There is considerable interest in simpler and faster algorithms that havesome approximation guarantee [148] Such algorithms are used as a jump-start routine by the current state of the art exact maximum matching algo-rithms [71 148 169] Furthermore there are applications [171] where largecardinality matchings are used

The specialization of Problem 3 for bipartite graphs and general undi-rected graphs are described in respectively Section 42 and Section 43 Ourfocus is on approximation algorithms which expose parallelism and have lin-ear run time in practical settings We propose randomized heuristics andanalyze their results The proposed heuristics construct a subgraph of theinput graph by randomly choosing some edges They then obtain a maxi-mum cardinality matching on the subgraph and return it as an approximatematching for the input graph The probability density function for choosinga given edge is obtained by scaling a suitable matrix to be doubly stochasticThe heuristics for the bipartite graphs and general undirected graphs sharethis general approach and differ in the analysis To the best of our knowl-edge the proposed heuristics have the highest constant factor approximation

7

ratio (around 086 for large graphs)

13 Hypergraphs

A hypergraph H = (VE) is defined as a set of vertices V and a set of hyper-edges E Each hyperedge is a set of vertices A matching in a hypergraphis a set of disjoint hyperedges

A hypergraph H = (VE) is called d-partite and d-uniform if V =⋃di=1 Vi with disjoint Vis and every hyperedge contains a single vertex from

each Vi In Chapter 6 we investigate the problem of finding matchings inhypergraphs formally defined as follows

Problem 4 (Maximum matching in d-partite d-uniform hypergraphs)Given a d-partite d-uniform hypergraph find a matching of maximumsize

The problem is NP-complete for d ge 3 the 3-partite case is called theMax-3-DM problem [126] and is an important problem in combinatorialoptimization It is a special case of the d-Set-Packing problem [110] Ithas been shown that d-Set-Packing is hard to approximate within a factorof O(d log d) [110] The maximumperfect set packing problem has manyapplications including combinatorial auctions [98] and personnel schedul-ing [91] Matchings in hypergraphs can also be used in the coarsening phaseof multilevel hypergraph partitioning tools [40] In particular d-uniformand d-partite hypergraphs arise in modeling tensors [134] and heuristics forProblem 4 can be used to partition these hypergraphs by plugging themin the current state of the art partitioning tools We propose heuristics forProblem 4 We first generalize some graph matching heuristics then pro-pose a novel heuristic based on tensor scaling to improve the quality viajudicious hyperedge selections Experiments on random synthetic and real-life hypergraphs show that this new heuristic is highly practical and superiorto the others on finding a matching with large cardinality

14 Sparse matrices

Bold upper case Roman letters are used for matrices as in A Matrixelements are shown with the corresponding lowercase letters for exampleaij Matrix sizes are sometimes shown in the lower right corner eg AItimesJ Matlab notation is used to refer to the entire rows and columns of a matrixeg A(i ) and A( j) refer to the ith row and jth column of A respectivelyA permutation matrix is a square 0 1 matrix with exactly one nonzeroin each row and each column A sub-permutation matrix is a square 0 1matrix with at most one nonzero in a row or a column

8 CHAPTER 1 INTRODUCTION

A graph G = (VE) can be represented as a sparse matrix AG calledthe adjacency matrix In this matrix an entry aij = 1 if the vertices vi andvj in G are incident on a common edge and aij = 0 otherwise If G isbipartite with V = VR cup VC we identify each vertex in VR with a uniquerow of AG and each vertex in VC with a unique column of AG If G is notbipartite a vertex is identified with a unique row-column pair A directedgraph GD = (VE) with vertex set V and edge set E can be associated withan ntimes n sparse matrix A Here |V | = n and for each aij 6= 0 where i 6= jwe have a directed edge from vi to vj

We also associate graphs with matrices For a given m times n matrix Awe can associate a bipartite graph G = (VR cup VC E) so that the zero-nonzero pattern of A is equivalent to AG If the matrix is square and hasa symmetric zero-nonzero pattern we can similarly associate an undirectedgraph by ignoring the diagonal entries Finally if the matrix A is squareand has an unsymmetric zero-nonzero pattern we can associate a directedgraph GD = (VE) where (vi vj) isin E iff aij 6= 0

An n times n matrix A 6= 0 is said to have support if there is a perfectmatching in the associated bipartite graph An n times n matrix A is said tohave total support if each edge in its bipartite graph can be put into a perfectmatching A square sparse matrix is called irreducible if its directed graph isstrongly connected A square sparse matrix A is called fully indecomposableif for a permutation matrix Q the matrix B = AQ has a zero free diagonaland the directed graph associated with B is irreducible Fully indecompos-able matrices have total support but a matrix with total support could be ablock diagonal matrix where each block is fully indecomposable For moreformal definitions of support total support and the fully indecomposabilitysee for example the book by Brualdi and Ryser [34 Ch 3 and Ch 4]

Let A = [aij ] isin Rntimesn The matrix A is said to be doubly stochastic ifaij ge 0 for all i j and Ae = AT e = e where e is the vector of all ones By

Birkhoffrsquos Theorem [25] there exist α1 α2 αk isin (0 1) withsumk

i=1 αi = 1and permutation matrices P1P2 Pk such that

A = α1P1 + α2P2 + middot middot middot+ αkPk (11)

This representation is also called Birkhoff-von Neumann (BvN) decomposi-tion We refer to the scalars α1 α2 αk as the coefficients of the decom-position

Doubly stochastic matrices and their associated BvN decompositionshave been used in several operations research problems and applicationsClassical examples are concerned with allocating communication resourceswhere an input traffic is routed to an output traffic in stages [43] Each rout-ing stage is a (sub-)permutation matrix and is used for handling a disjointset of communications The number of stages correspond to the numberof (sub-)permutation matrices A recent variant of this allocation problem

9

arises as a routing problem in data centers [146 163] Heuristics for theBvN decomposition have been used in designing business strategies for e-commerce retailers [149] A display matrix (which shows what needs to bedisplayed in total) is constructed which happens to be doubly stochasticThen a BvN decomposition of this matrix is obtained in which each per-mutation matrix used is a display shown to a shopper A small number ofpermutation matrices is preferred furthermore only slight changes betweenconsecutive permutation matrices is desired (we do not deal with this lastissue in this document)

The BvN decomposition of a doubly stochastic matrix as a convex com-bination of permutation matrices is not unique in general The Marcus-Ree Theorem [170] states that there is always a decomposition with k len2 minus 2n + 2 permutation matrices for dense n times n matrices Brualdi andGibson [33] and Brualdi [32] show that for a fully indecomposable ntimesn sparsematrix with τ nonzeros one has the equivalent expression k le τ minus2n+ 2 InChapter 5 we investigate the problem of finding the minimum number k ofpermutation matrices in the representation (11) More formally define theMinBvNDec problem defined as follows

Problem 5 (MinBvNDec A BvN decomposition with the minimumnumber of permutation matrices) Given a doubly stochastic matrix Afind a BvN decomposition (11) of A with the minimum number k ofpermutation matrices

Brualdi [32 p197] investigates the same problem and concludes that this isa difficult problem We continue along this line and show that the MinBvN-Dec problem is NP-hard We also propose a heuristic for obtaining a BvNdecomposition with a small number of permutation matrices We investi-gate some of the properties of the heuristic theoretically and experimentallyIn this chapter we also give a generalization of the BvN decomposition forreal matrices with possibly negative entries and use this generalization fordesigning preconditioners for iterative linear system solvers

15 Sparse tensors

Tensors are multidimensional arrays generalizing matrices to higher ordersWe use calligraphic font to refer to tensors eg X Let X be a d-dimensionaltensor whose size is n1 times middot middot middot times nd As in matrices an element of a tensor isdenoted by a lowercase letter and subscripts corresponding to the indices ofthe element eg xi1id where ij isin 1 nj A marginal of a tensor is a(dminus1)-dimensional slice of a d-dimensional tensor obtained by fixing one ofits indices A d-dimensional tensor where the entries in each of its marginalssum to one is called d-stochastic In a d-stochastic tensor all dimensions

10 CHAPTER 1 INTRODUCTION

necessarily have the same size n A d-stochastic tensor where each marginalcontains exactly one nonzero entry (equal to one) is called a permutationtensor

A d-partite d-uniform hypergraph H = (V1 cup middot middot middot cup Vd E) correspondsnaturally to a d-dimensional tensor X by associating each vertex class witha dimension of X Let |Vi| = ni Let X isin 0 1n1timesmiddotmiddotmiddottimesnd be a tensor withnonzero elements xv1vd corresponding to the edges (v1 vd) of H Inthis case X is called the adjacency tensor of H

Tensors are used in modeling multifactor or multirelational datasetssuch as product reviews at an online store [21] natural language process-ing [123] multi-sensor data in signal processing [49] among many oth-ers [142] Tensor decomposition algorithms are used as an important tool tounderstand the tensors and glean hidden or latent information One of themost common tensor decomposition approaches is the CandecompParafac(CP) formulation which approximates a given tensor as a sum of rank-onetensors More precisely for a given positive integer R called the decompo-sition rank the CP decomposition approximates a d-dimensional tensor bya sum of R outer products of column vectors (one for each dimension) Inparticular for a three dimensional I times J timesK tensor X CP asks for a rankR approximation X asymp

sumRr=1 ar br cr with vectors ar br and cr of size

respectively I J and K Here denotes the outer product of the vectorsand hence xijk asymp

sumRr=1 ar(i) middot br(j) middot cr(k) By aggregating these column

vectors one forms the matrices A B and C of size respectively I times RJ timesR and K timesR The matrices A B and C are called the factor matrices

The computational core of the most common CP decomposition meth-ods is an operation called the matricized tensor-times-Khatri-Rao product(Mttkrp) For the same tensor X this core operation can be describedsuccinctly as

foreach xijk isin X do

A(i )larr A(i ) + xijk[B(j ) lowastC(k )] (12)

where B and C are input matrices of sizes J timesR and K timesR respectivelyMA is the output matrix of size I timesR which is initialized to zero before theloop For each tensor nonzero xijk the jth row of the matrix B is multipliedelement-wise with the kth row of the matrix C the result is scaled with thetensor entry xijk and added to the ith row of MA In the well known CPdecomposition algorithm based on alternative least squares (CP-ALS) thetwo factor matrices B and C are initialized at the beginning then the factormatrix A is computed by multiplying MA with the (pseudo-)inverse of theHadamard product of BTB and CTC Since BTB and CTC are of sizeR times R their multiplication and inversion do not pose any computationalchallenges Afterwards the same steps are performed to compute a newB and then a new C and this process of computing the new factors one

11

at a time by assuming others fixed is repeated iteratively In Chapter 3we investigate how to efficiently parallelize the Mttkrp operations withthe CP-ALS algorithm in distributed memory systems In particular weinvestigate the following problem

Problem 6 (Parallelizing CP decomposition) Given a sparse tensor X the rank R of the required decomposition the number k of processorspartition X among the processors so as to achieve load balance andreduce the communication cost in a parallel implementation of Mttkrp

More precisely in Chapter 3 we describe parallel implementations of CP-ALS and how the parallelization problem can be cast as a partitioning prob-lem in hypergraphs In this document we do not develop heuristics for thehypergraph partitioning problemmdashwe describe hypergraphs modeling thecomputations and make use of the well-established tools for partitioningthese hypergraphs The Mttkrp operation also arises in the gradient de-scent based methods [2] to compute CP decomposition Furthermore it hasbeen implemented as a stand-alone routine [17] to enable algorithm devel-opment for variants of the core methods [158] That is why it has beeninvestigated for efficient execution in different computational frameworkssuch as Matlab [11 18] MapReduce [123] shared memory [206] and dis-tributed memory [47]

Consider again the Mttkrp operation and the execution of the oper-ations as shown in the body of the displayed for-loop (12) As seen inthis loop the rows of the factor matrices and the output matrix are ac-cessed many times according to the sparsity pattern of the tensor X In acache-based system the computations will be much more efficient if we takeadvantage of spatial and temporal locality If we can reorder the nonzerossuch that the multiplications involving a given row (of MA and B and C)are one after another this would be achieved We need this to hold whenexecuting the Mttkrps for computing the new factor matrices B and Cas well The issue is more easily understood in terms of the sparse matrixndashvector multiplication operation y larr Ax If we have two nonzeros in thesame row aij and aik we would prefer an ordering that maps j and k nextto each other and perform the associated scalar multiplications aijprimexjprime andaikprimexkprime one after other where jprime is the new index id for j and kprime is the newindex id for k Furthermore we would like the same sort of locality for allindices and for the multiplication x larr AT y as well We identify two re-lated problems here One of them is the choice of data structure for takingadvantage of shared indices and the other is to reorder the indices so thatnonzeros sharing an index in a dimension have close-by indices in the otherdimensions In this document we do not propose a data structure our aimis to reorder the indices such that the locality is improved for three commondata structures used in the current state of the art We summarize those

12 CHAPTER 1 INTRODUCTION

three common data structures in Chapter 8 and investigate the followingordering problem Since the overall problem is based on the data structurechoice we state the problem in a general language

Problem 7 (Tensor ordering) Given a sparse tensor X find an or-dering of the indices in each dimension so that the nonzeros of X haveindices close to each other

If we look at the same problem for matrices the problem can be cast asordering the rows and the columns of a given matrix so that the nonzeros areclustered around the diagonal This is also what we search in the tensor casewhere different data structures can benefit from the same ordering thanksto different effects In Chapter 8 we investigate this problem algorithmicallyand develop fast heuristics While the problem in the case of sparse matriceshas been investigated with different angles to the best of our knowledgeours is the first study proposing general ordering tools for tensorsmdashwhichare analogues of the well-known and widely used sparse matrix orderingheuristics such as Cuthill-McKee [55] and reverse Cuthill-McKee [95]

Part I

Partitioning

13

Chapter 2

Multilevel algorithms foracyclic partitioning ofdirected acyclic graphs

In this chapter we address the directed acyclic graph partitioning (DAGP)problem Most of the terms and definitions are standard which we sum-marized in Section 11 The formal problem definition is repeated below forconvenience

DAG partitioning problem Given a DAG G = (VE) an imbalanceparameter ε find an acyclic k-way partition V1 Vk of V such thatthe balance constraints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut is minimized

We adopt the multilevel approach [35 113] with the coarsening initialpartitioning and refinement phases for acyclic partitioning of DAGs Wepropose heuristics for these three phases (Sections 221 222 and 223respectively) which guarantee acyclicity of the partitions at all phases andmaintain a DAG at every level We strove to have fast heuristics at thecore With these characterizations the coarsening phase requires new algo-rithmictheoretical reasoning while the initial partitioning and refinementheuristics are direct adaptations of the standard methods used in undirectedgraph partitioning with some differences worth mentioning We discuss onlythe bisection case as we were able to improve the direct k-way algorithmswe proposed before [114] by using the bisection heuristics recursively

The acyclicity constraint on the partitions forbids the use of the stateof the art undirected graph partitioning tools This has been recognized

15

16 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

before and those tools were put aside [114 175] While this is true one canstill try to make use of the existing undirected graph partitioning tools [113130 186 199] as they have been very well engineered Let us assume thatwe have partitioned a DAG with an undirected graph partitioning tool intotwo parts by ignoring the directions It is easy to detect if the partition iscyclic since all the edges need to go from the first part to the second one Ifa partition is cyclic we can easily fix it as follows Let v be a vertex in thesecond part we can move all vertices to which there is a path from v intothe second part This procedure breaks any cycle containing v and hencethe partition becomes acyclic However the edge cut may increase and thepartitions can be unbalanced To solve the balance problem and reduce thecut we can apply move-based refinement algorithms After this step thefinal partition meets the acyclicity and balance conditions Depending onthe structure of the input graph it could also be a good initial partitionfor reducing the edge cut Indeed one of our most effective schemes usesan undirected graph partitioning algorithm to create a (potentially cyclic)partition fixes the cycles in the partition and refines the resulting acyclicpartition with a novel heuristic to obtain an initial partition We thenintegrate this partition within the proposed coarsening approaches to refineit at different granularities

21 Related work

Fauzia et al [87] propose a heuristic for the acyclic partitioning problemto optimize data locality when analyzing DAGs To create partitions theheuristic categorizes a vertex as ready to be assigned to a partition whenall of the vertices it depends on have already been assigned Vertices areassigned to the current partition set until the maximum number of verticesthat would be ldquoactiverdquo during the computation of the part reaches a specifiedlimit which is the cache size in their application This implies that a partsize is not limited by the sum of the total vertex weights in that part butis a complex function that depends on an external schedule (order) of thevertices This differs from our problem as we limit the size of each part bythe total sum of the weights of the vertices on that part

Kernighan [137] proposes an algorithm to find a minimum edge-cut par-tition of the vertices of a graph into subsets of size greater than a lowerbound and inferior to an upper bound The partition needs to use a fixedvertex sequence that cannot be changed Indeed Kernighanrsquos algorithmtakes a topological order of the vertices of the graph as an input and parti-tions the vertices such that all vertices in a subset constitute a continuousblock in the given topological order This procedure is optimal for a givenfixed topological order and has a run time proportional to the number ofedges in the graph if the part weights are taken as constant We used a

17

modified version of this algorithm as a heuristic in the earlier version of ourwork [114]

Cong et al [54] describe two approaches for obtaining acyclic partitionsof directed Boolean networks modeling circuits The first one is a single-level Fiduccia-Mattheyses (FM)-based approach In this approach Cong etal generate an initial acyclic partition by splitting the list of the vertices (ina topological order) from left to right into k parts such that the weight of eachpart does not violate the bound The quality of the results is then improvedwith a k-way variant of the FM heuristic [88] taking the acyclicity constraintinto account Our previous work [114] employs a similar refinement heuristicThe second approach of Cong et al is a two-level heuristic the initial graphis first clustered with a special decomposition and then it is partitionedusing the first heuristic

In a recent paper [175] Moreira et al focus on an imaging and com-puter vision application on embedded systems and discuss acyclic parti-tioning heuristics They propose a single level approach in which an initialacyclic partitioning is obtained using a topological order To refine the parti-tioning they proposed four local search heuristics which respect the balanceconstraint and maintain the acyclicity of the partition Three heuristics picka vertex and move it to an eligible part if and only if the move improvesthe cut These three heuristics differ in choosing the set of eligible parts foreach vertex some are very restrictive and some allow arbitrary target partsas long as acyclicity is maintained The fourth heuristic tentatively realizesthe moves that increase the cut in order to escape from a possible local min-ima It has been reported that this heuristic delivers better results than theothers In a follow-up paper Moreira et al [176] discuss a multilevel graphpartitioner and an evolutionary algorithm based on this multilevel schemeTheir multilevel scheme starts with a given acyclic partition Then thecoarsening phase contracts edges that are in the same part until there is noedge to contract Here matching-based heuristics from undirected graphpartitioning tools are used without taking the directions of the edges intoaccount Therefore the coarsening phase can create cycles in the graphhowever the induced partitions are never cyclic Then an initial partition isobtained which is refined during the uncoarsening phase with move-basedheuristics In order to guarantee acyclic partitions the vertices that lie incycles are not moved In a systematic evaluation of the proposed methodsMoreira et al note that there are many local minima and suggest using re-laxed constraints in the multilevel setting The proposed methods have highrun time as the evolutionary method of Moreira et al is not concerned withthis issue Improvements with respect to the earlier work [175] are reported

We propose methods to use an undirected graph partitioner to guidethe multilevel partitioner We focus on partitioning the graph in two partssince one can handle the general case with a recursive bisection scheme Wealso propose new coarsening initial partitioning and refinement methods

18 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

specifically designed for the 2-partitioning problem Our multilevel schememaintains acyclic partitions and graphs through all the levels

22 Directed multilevel graph partitioning

Here we summarize the proposed methods for the three phases of the mul-tilevel scheme

221 Coarsening

In this phase we obtain smaller DAGs by coalescing the vertices level bylevel This phase continues until the number of vertices becomes smallerthan a specified bound or the reduction on the number of vertices from onelevel to the next one is lower than a threshold At each level ` we startwith a finer acyclic graph G` compute a valid clustering C` ensuring theacyclicity and obtain a coarser acyclic graph G`+1 While our previouswork [114] discussed matching based algorithms for coarsening we presentagglomerative clustering based variants here The new variants supersedethe matching based ones Unlike the standard undirected graph case inDAG partitioning not all vertices can be safely combined Consider a DAGwith three vertices a b c and three edges (a b) (b c) (a c) Here thevertices a and c cannot be combined since that would create a cycle Wesay that a set of vertices is contractible (all its vertices are matchable) ifunifying them does not create a cycle We now present a general theoryabout finding clusters without forming cycles after giving some definitions

Definition 21 (Clustering) A clustering of a DAG is a set of disjointsubsets of vertices Note that we do not make any assumptions on whetherthe subsets are connected or not

Definition 22 (Coarse graph) Given a DAG G and a clustering C of Gwe let G|C denote the coarse graph created by contracting all sets of verticesof C

The vertices of the coarse graph are the clusters in C If (u v) isin G fortwo vertices u and v that are located in different clusters of C then G|Chas an (directed) edge from the vertex corresponding to ursquos cluster to thevertex corresponding to vrsquos cluster

Definition 23 (Feasible clustering) A feasible clustering C of a DAGG is a clustering such that G|C is acyclic

Theorem 21 Let G = (VE) be a DAG For u v isin V and (u v) isin E thecoarse graph G|(uv) is acyclic if and only if every path from u to v in Gcontains the edge (u v)

19

Theorem 21 whose proof is in the journal version of this chapter [115]can be extended to a set of vertices by noting that this time all paths con-necting two vertices of the set should contain only the vertices of the setThe theorem (nor its extension) does not imply an efficient algorithm as itrequires at least one transitive reduction Furthermore it does not describea condition about two clusters forming a cycle even if both are individuallycontractible In order to address both of these issues we put a constrainton the vertices that can form a cluster based on the following definition

Definition 24 (Top level value) For a DAG G = (VE) the top levelvalue of a vertex u isin V is the length of the longest path from a source ofG to that vertex The top level values of all vertices can be computed in asingle traversal of the graph with a complexity O(|V |+ |E|) We use top[u]to denote the top level of the vertex u

By restricting the set of edges considered in the clustering to the edges(u v) isin E such that top[u]+1 = top[v] we ensure that no cycles are formedby contracting a unique cluster (the condition identified in Theorem 21 issatisfied) Let C be a clustering of the vertices Every edge in a cluster ofC being contractible is a necessary condition for C to be feasible but not asufficient one More restrictions on the edges of vertices inside the clustersshould be found to ensure that C is feasible We propose three coarseningheuristics based on clustering sets of more than two vertices whose pair-wisetop level differences are always zero or one

CoTop Acyclic clustering with forbidden edges

To have an efficient clustering heuristic we restrict ourselves to static in-formation computable in linear time As stated in the introduction of thissection we rely on the top level difference of one (or less) for all verticesin the same cluster and an additional condition to ensure that there willbe no cycles when a number of clusters are contracted simultaneously InTheorem 22 below we give two sufficient conditions for a clustering to befeasible (that is the graphs at all levels are DAGs)mdashthe proof is in theassociated journal paper [115]

Theorem 22 (Correctness of the proposed clustering) Let G = (VE) bea DAG and C = C1 Ck be a clustering If C is such that

bull for any cluster Ci for all u v isin Ci |top[u]minus top[v]| le 1

bull for two different clusters Ci and Cj and for all u isin Ci and v isin Cjeither (u v) isin E or top[u] 6= top[v]minus 1

then the coarse graph G|C is acyclic

20 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

We design a heuristic based on Theorem 22 This heuristic visits allvertices which are singleton at the beginning in an order and adds thevisited vertex to a cluster if the criteria mentioned in Theorem 22 are metif not the vertex stays as a singleton Among those clusters satisfying thementioned criteria the one which saves the largest edge weight (incident onthe vertices of the cluster and the vertex being visited) is chosen as the targetcluster We discuss a set of bookkeeping operations in order to implementthis heuristic in linear O(|V |+ |E|) time [115 Algorithm 1]

CoCyc Acyclic clustering with cycle detection

We now propose a less restrictive clustering algorithm which also maintainsthe acyclicity We again rely on the top level difference of one (or less) forall vertices in the same cluster Knowing this invariant when a new vertexis added to a cluster a cycle-detection algorithm checks that no cycles areformed when all the clusters are contracted simultaneously This algorithmdoes not traverse the entire graph by also using the fact that the top leveldifference within a cluster is at most one Nonetheless detecting such cyclescould require a full traversal of the graph

From Theorem 22 we know if adding a vertex to a cluster whose verticesrsquotop level values are t and t+ 1 creates a cycle in the contracted graph thenthis cycle goes through the vertices with top level values t or t + 1 Thuswhen considering the addition of a vertex u to a cluster C containing v wecheck potential cycle formations by traversing the graph starting from u ina breadth-first search (BFS) manner Let t denote the minimum top levelin C At a vertex w we normally add a successor y of w in a queue (as inBFS) if |top(y)minus t| le 1 if w is in the same cluster as one of its predecessorsx we also add x to the queue if |top(x) minus t| le 1 We detect a cycle if atsome point in the traversal a vertex from cluster C is reached if not thecluster can be extended by adding u In the worst-case the cycle detectionalgorithm completes a full graph traversal but in practice it stops quicklyand does not introduce a significant overhead The details of this clusteringscheme whose worst case run time complexity is O(|V |(|V |+ |E|)) can befound in the original paper [115 Algorithm 2]

CoHyb Hybrid acyclic clustering

The cycle detection based algorithm can suffer from quadratic run time forvertices with large in-degrees or out-degrees To avoid this we design a hy-brid acyclic clustering which uses the relaxed clustering strategy with cycledetection CoCyc by default and switches to the more restrictive clusteringstrategy CoTop for large degree vertices We define a limit on the degree ofa vertex (typically

radic|V |10) for calling it large degree When considering

an edge (u v) where top[u] + 1 = top[v] if the degrees of u and v do not

21

exceed the limit we use the cycle detection algorithm to determine if wecan contract the edge Otherwise if the outdegree of u or the indegree of vis too large the edge will be contracted if it is allowed The complexity ofthis algorithm is in between those CoTop and CoCyc and it will likely avoidthe quadratic behavior in practice

222 Initial partitioning

After the coarsening phase we compute an initial acyclic partitioning of thecoarsest graph We present two heuristics One of them is akin to the greedygraph growing method used in the standard graphhypergraph partitioningmethods The second one uses an undirected partitioning and then fixesthe acyclicity of the partitions Throughout this section we use (V0 V1) todenote the bisection of the vertices of the coarsest graph and ensure thatthere is no edge from the vertices in V1 to those in V0

Greedy directed graph growing

One approach to compute a bisection of a directed graph is to design agreedy algorithm that moves vertices from one part to another using localinformation We start with all vertices in V1 and replace vertices towards V0

by using heaps At any time the vertices that can be moved to V0 are in theheap These vertices are those whose in-neighbors are all in V0 Initially onlythe sources are in the heap and when all the in-neighbors of a vertex v aremoved to C0 v is inserted into the heap We separate this process into twophases In the first phase the key-values of the vertices in the heap are setto the weighted sum of their incoming edges and the ties are broken in favorof the vertex closer to the first vertex moved The first phase continues untilthe first part has more than 09 of the maximum allowed weight (modulothe maximum weight of a vertex) In the second phase the actual gain of avertex is used This gain is equal to the sum of the weights of the incomingedges minus the sum of the weights of the outgoing edges In this phasethe ties are broken in favor of the heavier vertices The second phase stopsas soon as the required balance is obtained The reason that we separatedthis heuristic into two phases is that at the beginning the gains are of noimportance and the more vertices become movable the more flexibility theheuristic has Yet towards the end parts are fairly balanced and usingactual gains should help keeping the cut small

Since the order of the parts is important we also reverse the roles of theparts and the directions of the edges That is we put all vertices in V0 andmove the vertices one by one to V1 when all out-neighbors of a vertex havebeen moved to V1 The proposed greedy directed graph growing heuristicreturns the best of the these two alternatives

22 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Undirected bisection and fixing acyclicity

In this heuristic we partition the coarsest graph as if it were undirectedand then move the vertices from one part to another in case the partitionwas not acyclic Let (P0 P1) denote the (not necessarily acyclic) bisectionof the coarsest graph treated as if it were undirected

The proposed approach designates arbitrarily P0 as V0 and P1 as V1One way to fix the cycle is to move all ancestors of the vertices in V0 to V0thereby guaranteeing that there is no edge from vertices in V1 to vertices inV0 making the bisection (V0 V1) acyclic We do these moves in a reversetopological order Another way to fix the acyclicity is to move all descen-dants of the vertices in V1 to V1 again guaranteeing an acyclic partition Wedo these moves in a topological order We then fix the possible unbalancewith a refinement algorithm

Note that we can also initially designate P1 as V0 and P0 as V1 and canagain fix a potential cycle in two different ways We try all four of thesechoices and return the best partition (essentially returning the best of thefour choices to fix the acyclicity of (P0 P1))

223 Refinement

This phase projects the partition obtained for a coarse graph to the nextfiner one and refines the partition by vertex moves As in the standardrefinement methods the proposed heuristic is applied in a number of passesWithin a pass we repeatedly select the vertex with the maximum move gainamong those that can be moved We tentatively realize this move if themove maintains or improves the balance Then the most profitable prefixof vertex moves are realized at the end of the pass As usual we allow thevertices move only once in a pass We use heaps with the gain of movesas the key value where we keep only movable vertices We call a vertexmovable if moving it to the other part does not create a cyclic partitionWe use the notation (V0 V1) to designate the acyclic bisection with no edgefrom vertices in V1 to vertices in V0 This means that for a vertex to movefrom part V0 to part V1 one of the two conditions should be met (i) eitherall its out-neighbors should be in V1 (ii) or the vertex has no out-neighborsat all Similarly for a vertex to move from part V1 to part V0 one of thetwo conditions should be met (i) either all its in-neighbors should be in V0(ii) or the vertex has no in-neighbors at all This is an adaptation of theboundary Fiduccia-Mattheyses heuristic [88] to directed graphs The notionof movability being more restrictive results in an important simplificationThe gain of moving a vertex v from V0 to V1 issum

uisinSucc[v]

w(v u)minussum

uisinPred[v]

w(u v) (21)

23

and the negative of this value when moving it from V1 to V0 This meansthat the gain of a vertex is static once a vertex is inserted in the heap withthe key value (21) its key-value is never updated A move could rendersome vertices unmovable if they were in the heap then they should bedeleted Therefore the heap data structure needs to support insert deleteand extract max operations only

224 Constrained coarsening and initial partitioning

There are a number of highly successful undirected graph partitioningtools [130 186 199] They are not immediately usable for our purposesas the partitions can be cyclic Fixing such partitions by moving verticesto break the cyclic dependencies among the parts can increase the edgecut dramatically (with respect to the undirected cut) Consider for exam-ple the n times n grid graph where the vertices are at integer positions fori = 1 n and j = 1 n and a vertex at (i j) is connected to (iprime jprime)when |iminus iprime| = 1 or |jminusjprime| = 1 but not both There is an acyclic orientationof this graph called spiral ordering as described in Figure 21 for n = 8This spiral ordering defines a total order When the directions of the edgesare ignored we can have a bisection with perfect balance by cutting onlyn = 8 edges with a vertical line This partition is cyclic and it can be madeacyclic by putting all vertices numbered greater than 32 to the second partThis partition which puts the vertices 1ndash32 to the first part and the rest tothe second part is the unique acyclic bisection with perfect balance for theassociated directed acyclic graph The edge cut in the directed version is35 as seen in the figure (gray edges) In general one has to cut n2 minus 4n+ 3edges for n ge 8 the blue vertices in the border (excluding the corners) haveone edge directed to a red vertex the interior blue vertices have two suchedges finally the blue vertex labeled n22 has three such edges

Let us investigate the quality of the partitions in practice We usedMeTiS [130] as the undirected graph partitioner on a dataset of 94 matrices(their details are in Section 23) The results are given in Figure 22 Forthis preliminary experiment we partitioned the graphs into two with themaximum allowed load imbalance ε = 3 The output of MeTiS is acyclicfor only two graphs and the geometric mean of the normalized edge cut is00012 Figure 22a shows the normalized edge cut and the load imbalanceafter fixing the cycles while Figure 22b shows the two measurements aftermeeting the balance criteria A normalized edge cut value is computed bynormalizing the edge cut with respect to the number of edges

In both figures the horizontal lines mark the geometric mean of thenormalized edge cuts and the vertical lines mark the 3 imbalance ratioIn Figure 22a there are 37 instances in which the load balance after fixingthe cycles is feasible The geometric mean of the normalized edge cuts inthis sub-figure is 00045 while in the other sub-figure it is 00049 Fixing

24 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

815864

32

3327

45

48

41

35

53

38

1522

Figure 21 8times8 grid graph whose vertices are ordered in a spiral way a fewof the vertices are labeled All edges are oriented from a lower numberedvertex to a higher numbered one There is a unique bipartition with 32vertices in each side The edges defining the total order are shown in redand blue except the one from 32 to 33 the cut edges are shown in grayother internal edges are not shown

10 12 14 16 18 20balance

10 5

10 4

10 3

10 2

10 1

100

101

cut

(a) Undirected partitioning and fixing cy-cles

098 100 102 104 106balance

10 5

10 4

10 3

10 2

10 1

cut

(b) Undirected partitioning fixing cyclesand balancing

Figure 22 Normalized edge cut (normalized with respect to the number ofedges) the balance obtained after using an undirected graph partitioner andfixing the cycles (left) and after ensuring balance with refinement (right)

25

the cycles increases the edge cut with respect to an undirected partition-ing but not catastrophically (only by 0004500012 = 375 times in theseexperiments) and achieving balance after this step increases the cut only alittle (goes to 00049 from 00045) That is why we suggest using an undi-rected graph partitioner fixing the cycles among the parts and performinga refinement based method for load balancing as a good (initial) partitioner

In order to refine the initial partition in a multilevel setting we proposea scheme similar to the iterated multilevel algorithm used in the existingpartitioners [40 212] In this scheme first a partition P is obtained Thenthe coarsening phase starts to cluster vertices that were in the same partin P After the coarsening an initial partitioning is freely available byusing the partition P on the coarsest graph The refinement phase then canwork as before Moreira et al [176] use this approach for the directed graphpartitioning problem To be more concrete we first use an undirected graphpartitioner then fix the cycles as discussed in Section 222 and then refinethis acyclic partition for balance with the proposed refinement heuristicsin Section 223 We then use this acyclic partition for constrained coarseningand initial partitioning We expect this scheme to be successful in graphswith many sources and targets where the sources and targets can lie in anyof the parts while the overall partition is acyclic On the other hand if agraph is such that its balanced acyclic partitions need to put sources in onepart and the targets in another part then fixing acyclicity may result inmoving many vertices This in turn will harm the edge cut found by theundirected graph partitioner

23 Experiments

The partitioning tool presented (dagP) is implemented in CC++ program-ming languages The experiments are conducted on a computer equippedwith dual 21 GHz Xeon E5-2683 processors and 512GB memory Thesource code and more information is available at httptdagatechedusoftwaredagP

We have performed a large set of experiments on DAG instances com-ing from two sources The first set of instances is from the PolyhedralBenchmark suite (PolyBench) [196] The second set of instances is obtainedfrom the matrices available in the SuiteSparse Matrix Collection (UFL) [59]From this collection we pick all the matrices satisfying the following prop-erties listed as binary square and has at least 100000 rows and at most226 nonzeros There were a total of 95 matrices at the time of experimenta-tion where two matrices (ids 1514 and 2294) having the same pattern Wediscard the duplicate and use the remaining 94 matrices for experimentsFor each such matrix we take the strict upper triangular part as the asso-ciated DAG instance whenever this part has more nonzeros than the lower

26 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Graph vertex edge max deg avg deg source target2mm 36500 62200 40 1704 2100 4003mm 111900 214600 40 1918 3900 400adi 596695 1059590 109760 1776 843 28atax 241730 385960 230 1597 48530 230covariance 191600 368775 70 1925 4775 1275doitgen 123400 237000 150 1921 3400 3000durbin 126246 250993 252 1988 250 249fdtd-2d 256479 436580 60 1702 3579 1199gemm 1026800 1684200 70 1640 14600 4200gemver 159480 259440 120 1627 15360 120gesummv 376000 500500 500 1331 125250 250heat-3d 308480 491520 20 1593 1280 512jacobi-1d 239202 398000 100 1664 402 398jacobi-2d 157808 282240 20 1789 1008 784lu 344520 676240 79 1963 6400 1ludcmp 357320 701680 80 1964 6480 1mvt 200800 320000 200 1594 40800 400seidel-2d 261520 490960 60 1877 1600 1symm 254020 440400 120 1734 5680 2400syr2k 111000 180900 60 1630 2100 900syrk 594480 975240 81 1640 8040 3240trisolv 240600 320000 399 1330 80600 1trmm 294570 571200 80 1939 6570 4800

Table 21 DAG instances from PolyBench [196]

triangular part otherwise we take the strict lower triangular part All edgeshave unit cost and all vertices have unit weight

Since the proposed heuristics have a randomized behavior (the traversalsused in the coarsening and refinement heuristics are randomized) we runthem 10 times for each DAG instance and report the averages of theseruns We use performance profiles [69] to present the edge-cut results Aperformance profile plot shows the probability that a specific method givesresults within a factor θ of the best edge cut obtained by any of the methodscompared in the plot Hence the higher and closer a plot to the y-axis thebetter the method is

We set the load imbalance parameter ε = 003 in the problem definitiongiven at the beginning of the chapter for all experiments The vertices areunit weighted therefore the imbalance is rarely an issue for a move-basedpartitioner

Coarsening evaluation

We first evaluated the proposed coarsening heuristics (Section 51 of the as-sociated paper [115]) The aim was to find an effective one to set as a defaultcoarsening heuristic Based on performance profiles on the edge cut we con-cluded that in general the coarsening heuristics CoHyb and CoCyc behavesimilarly and are more helpful than CoTop in reducing the edge cut We alsoanalyzed the contraction efficiency of the proposed coarsening heuristics Itis important that the coarsening phase does not stop too early and that the

27

coarsest graph is small enough to be partitioned efficiently By investigatingthe maximum the average and the standard deviation of vertex and edgeweight ratios at the coarsest graph and the average the minimum andthe maximum number of coarsening levels observed for the two datasetswe saw that CoCyc and CoHyb behave again similarly and provide slightlybetter results than CoTop Based on all these observations we set CoHyb asthe default coarsening heuristic as it performs better than CoTop in termsof final edge cut and is guaranteed to be more efficient than CoCyc in termsof run time

Constrained coarsening and initial partitioning

We now investigate the effect of using undirected graph partitioners forcoarsening and initial partitioning as explained in Section 224 We com-pare three variants of the proposed multilevel scheme All of them use therefinement described in Section 223 in the uncoarsening phase

bull CoHyb this variant uses the hybrid coarsening heuristic describedin Section 221 and the greedy directed graph growing heuristic de-scribed in Section 222 in the initial partitioning phase This methoddoes not use constrained coarsening

bull CoHyb C this variant uses an acyclic partition of the finest graph ob-tained as outlined in Section 222 to guide the hybrid coarseningheuristic described in Section 224 and uses the greedy directed graphgrowing heuristic in the initial partitioning phase

bull CoHyb CIP this variant uses the same constrained coarsening heuristicas the previous method but inherits the fixed acyclic partition of thefinest graph as the initial partitioning

The comparison of these three variants are given in Figure 23 for thewhole dataset In this figure we see that using the constrained coarseningis always helpful as it clearly separates CoHyb C and CoHyb CIP from CoHyb

after θ = 11 Furthermore applying the constrained initial partitioning (ontop of the constrained coarsening) brings tangible improvements

In the light of the experiments presented here we suggest the variantCoHyb CIP for general problem instances

CoHyb CIP with respect to a single level algorithm

We compare CoHyb CIP (constrained coarsening and initial partitioning)with a single-level algorithm that uses an undirected graph partitioningfixes the acyclicity and refines the partitions This last variant is denotedas UndirFix and it is the algorithm described in Section 222 Both vari-ants use the same initial partitioning approach which utilizes MeTiS [130] as

28 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CCoHyb_CIP

Figure 23 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) using the constrained coarsening and the greedy directed graphgrowing (CoHyb C) and CoHyb (no constraints)

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPUndirFix

Figure 24 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) and using the same approach without coarsening (UndirFix)

the undirected partitioner The difference between UndirFix and CoHyb CIP

is the latterrsquos ability to refine the initial partition at multiple levels Fig-ure 24 presents this comparison The plots show that the multilevel schemeCoHyb CIP outperforms the single level scheme UndirFix at all appropriateranges of θ attesting to the importance of the multilevel scheme

Comparison with existing work

Here we compare our approach with the evolutionary graph partitioningapproach developed by Moreira et al [175] and briefly with our previouswork [114]

Figure 25 shows how CoHyb CIP and CoTop compare with the evolution-ary approach in terms of the edge cut on the 23 graphs of the PolyBenchdataset for the number of partitions k isin 2 4 8 16 32 We use the av-

29

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPCoTopMoreira et al

Figure 25 Performance profiles for the edge cut obtained by CoHyb CIPCoTop and Moreira et alrsquos approach on the PolyBench dataset with k isin2 4 8 16 32

erage edge cut value of 10 runs for CoTop and CoHyb CIP and the averagevalues presented in [175] for the evolutionary algorithm As seen in thefigure the CoTop variant of the proposed multilevel approach obtains thebest results on this specific dataset (all variants of the proposed approachoutperform the evolutionary approach) In more detailed comparisons [115Appendix A] we see that the variants CoHyb CIP and CoTop of the proposedalgorithm obtain strictly better results than the evolutionary approach in78 and 75 instances (out of 115) respectively when the average edge cutsare compared

In more detailed comparisons [115 Appendix A] we see that CoHyb CIP

obtains 26 less edge cut than the evolutionary approach on average (ge-ometric mean) when the average cuts are compared when the best cutsare compared CoHyb CIP obtains 48 less edge cut Moreover CoTop ob-tains 37 less edge cut than the evolutionary approach when the averagecuts are compared when the best cuts are compared CoTop obtains 41less cut In some instances we see large differences between the averageand the best results of CoTop and CoHyb CIP Combined with the observa-tion that CoHyb CIP yields better results in general this suggests that theneighborhood structure can be improved (see the notion of the strength ofa neighborhood [183 Section 196])

The proposed approach with all the reported variants takes about 30minutes to complete the whole set of experiments for this dataset whereasthe evolutionary approach is much more compute-intensive as it has to runthe multilevel partitioning algorithm numerous times to create and updatethe population of partitions for the evolutionary algorithm The multilevelapproach of Moreira et al [175] is more comparable in terms of characteris-tics with our multilevel scheme When we compare CoTop with the results ofthe multilevel algorithm by Moreira et al our approach provides results thatare 37 better on average and CoHyb CIP approach provides results that are

30 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CIPUndirFix

Figure 26 Performance profiles of CoHyb CoHyb CIP and UndirFix in termsof edge cut for single source single target graph dataset The average of 5runs are reported for each approach

26 better on average highlighting the fact that keeping the acyclicity ofthe directed graph through the multilevel process is useful

Finally CoTop and CoHyb CIP also outperform the previous version of ourmultilevel partitioner [114] which is based on a direct k-way partitioningscheme and matching heuristics for the coarsening phase by 45 and 35on average respectively on the same dataset

Single commodity flow-like problem instances

In many of the instances of our dataset graphs have many source and targetvertices We investigate how our algorithm performs on problems where allsource vertices should be in a given part and all target vertices should bein the other part while also achieving balance This is a problem close tothe maximum flow problem where we want to find the maximum flow (orminimum cut) from the sources to the targets with balance on part weightsFurthermore addressing this problem also provides a setting for solvingpartitioning problems with fixed vertices

We used the UFL dataset where all isolated vertices are discarded anda single source vertex S and a single target vertex T are added to all graphsS is connected to all original source vertices and all original target verticesare connected to T The cost of the new edges is set to the number ofedges A feasible partition should avoid cutting these edges and separateall sources from the targets

The performance profiles of CoHyb CoHyb CIP and UndirFix are givenin Figure 26 with the edge cut as the evaluation criterion As seen in thisfigure CoHyb is the best performing variant and UndirFix is the worstperforming variant This is interesting as in the general setting we sawa reverse relation The variant CoHyb CIP performs in the middle as itcombines the other two

31

1 2 3 4 5 6 7 800

02

04

06

08

10

p CoTopCoHybCoCycUndirFixCoHyb_CIPCoHyb_C

Figure 27 Run time performance profile of CoCyc CoHyb CoTop CoHyb CCoHyb CIP and UndirFix on the whole dataset The values are the averagesof 10 runs

Run time performance

We give again a summary of the run time comparisons from the detailedversion [115 Section 56] Figure 27 shows the comparison of the five vari-ants of the proposed multilevel scheme and the single level scheme on thewhole dataset Each algorithm is run 10 times on each graph As expectedCoTop offers the best performance and CoHyb offers a good trade-off betweenCoTop and CoCyc An interesting remark is that these three algorithms havea better run time than the single level algorithm UndirFix For exampleon the average CoTop is 144 times faster than UndirFix This is mainlydue to cost of fixing acyclicity Undirected partitioning accounts for roughly25 of the execution time of UndirFix and fixing the acyclicity constitutesthe remaining 75 Finally the variants of the multilevel algorithm usingconstrained coarsening heuristics provide satisfying run time performancewith respect to the others

24 Summary further notes and references

We proposed a multilevel approach for acyclic partitioning of directed acyclicgraphs This problem is close to the standard graph partitioning in that theaim is to partition the vertices into a number of parts while minimizingthe edge cut and meeting a balance criterion on the part weights Unlikethe standard graph partitioning problem the directions of the edges areimportant and the resulting partitions should have acyclic dependencies

We proposed coarsening initial partitioning and refinement heuristicsfor the target problem The proposed heuristics take the directions of theedges into account and maintain the acyclicity throughout the three phasesof the multilevel scheme We also proposed efficient and effective approachesto use the standard undirected graph partitioning tools in the multilevel

32 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

scheme for coarsening and initial partitioning We performed a large set ofexperiments on a dataset with graphs having different characteristics andevaluated different combinations of the proposed heuristics Our experi-ments suggested (i) the use of constrained coarsening and initial partition-ing where the main coarsening heuristic is a hybrid one which avoids thecycles and in case it does not performs a fast cycle detection (CoHyb CIP)for the general case (ii) a pure multilevel scheme without constrained coars-ening using the hybrid coarsening heuristic (CoHyb) for the cases where anumber of sources need to be separated from a number of targets (iii) a puremultilevel scheme without constrained coarsening using the fast coarseningalgorithm (CoTop) for the cases where the degrees of the vertices are smallAll three approaches are shown to be more effective and efficient than thecurrent state of the art

Ozkaya et al [181] make use of the proposed partitioner for schedulingdirected acyclic task graphs for homogeneous computing systems Here thepartitioner is used to obtain coarse grained tasks to enhance data localityLepere and Trystram [151] show that acyclic clustering is very effective inscheduling DAGs with unit task weights and large uniform edge costs Inparticular they show that there exists an acyclic clustering whose makespanis at most twice the best clustering

Recall that a directed edge (u v) is called redundant if there is a pathu v in the graph Consider a redundant edge (u v) and a path u vLet us discard the edge (u v) and add its cost to all edges in the identifiedu v path to obtain a reduced graph In an acyclic bisection of this reducedgraph if the vertices u and v are in different parts then the path u vcrosses the cut only once (otherwise the partition is cyclic) and the edgecut in the original graph equals the edge cut in the reduced graph If uand v are in the same part then all paths u v are confined to the samepart Therefore neither the edge (u v) of the original graph nor any pathu v of the reduced graph contributes to the respective cuts Hence theedge cut in the original graph is equal to the edge cut in the reduced graphSuch reductions could potentially help reduce the run time and improve thequality of the partitions in our multilevel setting The transitive reductionis a costly operation and hence we do not hope to apply all reductionsin the proposed approach a subset of those should be found and appliedThe effects of such reductions should be more tangible in the evolutionaryalgorithm [175 176]

Chapter 3

Parallel CandecompParafacdecomposition of sparsetensors in distributedmemory

In this chapter we address the problem of parallelizing the computationalcore of the CandecompParafac decomposition of sparse tensors Tensornotation is a little less common and the reader is invited to revise thenotation and the basic definitions in Section 15 More related definitionsare given in this chapter Below we restate the problem for convenience

Parallelizing CP decomposition Given a sparse tensor X the rankR of the required decomposition the number k of processors partitionX among the processors so as to achieve load balance and reduce thecommunication cost in a parallel implementation of Mttkrp

We investigate an efficient parallelization of the Mttkrp operation indistributed memory environments for sparse tensors in the context of theCP decomposition method CP-ALS For this purpose we formulate twotask definitions a coarse-grain and a fine-grain one These definitions aregiven by applying the owner-computes rule to a coarse-grain and a fine-grainpartition of the tensor nonzeros We define the coarse-grain partition of atensor as a partition of one of its dimensions In matrix terms a coarse-grainpartition corresponds to a row-wise or a column-wise partition We definethe fine-grain partition of a tensor as a partition of its nonzeros This hasthe same significance in matrix terms Based on these two task granularitieswe present two parallel algorithms for the Mttkrp operation We addressthe computational load balance and communication cost reduction problemsfor the two algorithms and present hypergraph partitioning-based models

33

34 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

to tackle these problems with the off-the-shelf partitioning toolsOnce the Mttkrp operation is efficiently parallelized the other parts of

the CP-ALS algorithm are straightforward Nonetheless we design a libraryfor the parallel CP-ALS algorithm to test the proposed Mttkrp algorithmsand report scalability results up to 1024 MPI ranks

31 Background and notation

311 Hypergraph partitioning

Let H = (VE) be a hypergraph with the vertex set V and hyperedge setE Let us associate weights with the vertices and costs with the hyperedgesw(v) denotes the weight of a vertex v and ch denotes the cost of a hyperedgeFor a given integer K ge 2 a K-way vertex partition of a hypergraph H =(VE) is denoted as Π = V1 VK where (i) the parts are non-empty(ii) mutually exclusive VkcapV` = empty for k 6= ` and (iii) collectively exhaustiveV =

⋃Vk

Let w(Vk) =sum

visinVk wv be the total weight of the vertices in the setVk and wavg = w(V )K be the average part weight If each part Vk isin Πsatisfies the balance criterion

w(Vk) le wavg(1 + ε) for k = 1 2 K (31)

we say that Π is balanced where ε represents the maximum allowed imbal-ance ratio

In a partition Π a hyperedge that has at least one vertex in a part is saidto connect that part The number of parts connected by a hyperedge h ieconnectivity is denoted as λh Given a vertex partition Π of a hypergraphH = (VE) one can measure the size of the cut induced by Π as

χ(Π) =sumhisinE

ch(λh minus 1) (32)

This cut measure is called the connectivity-1 cutsize metricGiven ε gt 0 and an integer K gt 1 the standard hypergraph partitioning

problem is defined as the task of finding a balanced partition Π with K partssuch that χ(Π) is minimized The hypergraph partitioning problem is NP-hard [150]

A variant of the above problem is the multi-constraint hypergraph parti-tioning [131] In this variant each vertex has an associated vector of weightsThe partitioning objective is the same as above and the partitioning con-straint is to satisfy a balancing constraint for each weight Let w(v i) denotethe C weights of a vertex v for i = 1 C In this variant the balancecriterion (31) is rewritten as

w(Vk i) le wavgi (1 + ε) for k = 1 K and i = 1 C (33)

35

where the ith weight w(Vk i) of a part Vk is defined as the sum of the ithweights of the vertices in that part wavgi is the average part weight for theith weight of all vertices and ε represents the allowed imbalance ratio

312 Tensor operations

In this chapter we use N to denote the number of dimensions (the order)of a tensor For the sake of simplicity we describe all the notation andthe algorithms for N = 3 even though our algorithms and implementationshave no such restriction We explicitly generalize the discussion to generalorder-N tensors whenever we find necessary A fiber in a tensor is definedby fixing every index but one eg if X is a third-order tensor Xjk is amode-1 fiber and Xij is a mode-3 fiber A slice in a tensor is defined byfixing only one index eg Xi refers to the ith slice of X in mode 1 Weuse |Xi| to denote the number of nonzeros in Xi

Tensors can be matricized in any mode This is achieved by identifying asubset of the modes of a given tensor X as the rows and the other modes of Xas the columns of a matrix and appropriately mapping the elements of X tothose of the resulting matrix We will be exclusively dealing with the matri-cizations of tensors along a single mode For example take X isin RI1timesmiddotmiddotmiddottimesIN Then X(1) denotes the mode-1 matricization of X in such a way that therows of X(1) corresponds to the first mode of X and the columns correspondsto the remaining modes The tensor element xi1iN corresponds to the el-

ement(i1 1 +

sumNj=2

[(ij minus 1)

prodjminus1k=1 Ik

])of X(1) Specifically each column

of the matrix X(1) becomes a mode-1 fiber of the tensor X Matricizationsin the other modes are defined similarly

For two matrices AI1timesJ1 and BI2timesJ2 the Kronecker product is

AotimesB =

a11B middot middot middot a1J1B

aI11B middot middot middot aI1J1B

For two matrices AI1timesJ and BI2timesJ the Khatri-Rao product is

AB =[A( 1)otimesB( 1) middot middot middot A( J)otimesB( J)

]

which is of size I1I2 times J For two matrices AItimesJ and BItimesJ the Hadamard product is

A lowastB =

a11b11 middot middot middot a1Jb1J

aI1bI1 middot middot middot aIJbIJ

Recall that the CP-decomposition of rankR of a given tensor X factorizesX into a sum of R rank-one tensors For X isin RItimesJtimesK it yields X asymp

36 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

sumRr=1 ar br cr for ar isin RI br isin RJ and cr isin RK where is the outer

product of the vectors Here the matrices A = [a1 aR] B = [b1 bR]and C = [c1 cR] are called the factor matrices or factors For N -modetensors we use U1 UN to refer to the factor matrices

We revise the Alternating Least Squares (ALS) method for obtaining arank-R approximation of a tensor X with the CP-decomposition [38 107]A common formulation of CP-ALS is shown in Algorithm 1 for third ordertensors At each iteration each factor matrix is recomputed while fixingthe other two eg A larr X(1)(C B)(BTB lowast CTC)dagger This operation is

performed in the following order MA = X(1)(CB) V = (BTB lowastCTC)daggerand then A larr MAV Here V is a dense matrix of size R times R and iseasy to compute The important issue is the efficient computation of theMttkrp operations yielding MA similarly MB = X(2)(AC) and MC =X(3)(BA)

Algorithm 1 CP-ALS for the 3rd order tensors

Input X A 3rd order tensorR The rank of approximation

Output CP decomposition [[λ ABC]]

1 repeat2 Alarr X(1)(CB)(BTB lowastCTC)dagger

3 Normalize columns of A

4 Blarr X(2)(CA)(ATA lowastCTC)dagger

5 Normalize columns of B

6 Clarr X(3)(BA)(ATA lowastBTB)dagger

7 Normalize columns of C and store the norms as λ

8 until no improvement or maximum iterations reached

The sheer size of the Khatri-Rao products makes them impossible tocompute explicitly hence efficient Mttkrp algorithms find other means tocarry out the Mttkrp operation

32 Related work

SPLATT [206] is an efficient implementation of the Mttkrp operation forsparse tensors on shared memory systems SPLATT implements the Mt-tkrp operation based on the slices of the dimension in which the factoris updated eg on the mode-1 slices when computing A larr X(1)(C B)(BTB lowastCTC)dagger Nonzeros of the fibers in a slice are multiplied with thecorresponding rows of B and the results are accumulated to be later scaledwith the corresponding row of C to compute the row of A correspondingto the slice Parallelization is done using OpenMP directives and the loadbalance (in terms of the number of nonzeros in the slices of the mode for

37

which Mttkrp is computed) is achieved by using the dynamic schedulingpolicy Hypergraph models are used to optimize cache performance by re-ducing the number of times a row of A B and C are accessed Smith etal [206] also use N -partite graphs to reorder the tensors for all dimensionsA newer version of SPLATT [204 205] has a medium grain task definitionand a distributed memory implementation

GigaTensor [123] is an implementation of CP-ALS which follows theMap-Reduce paradigm All important steps (the Mttkrps and the com-putations BTB lowastCTC) are performed using this paradigm A distinct ad-vantage of GigaTensor is that thanks to Map-Reduce the issues of fault-tolerance load balance and out of core tensor data are automatically han-dled The presentation [123] of GigaTensor focuses on three-mode tensorsand expresses the map and the reduce functions for this case Additionalmap and reduce functions are needed for higher order tensors

DFacTo [47] is a distributed memory implementation of the Mttkrp op-eration It performs two successive sparse matrix-vector multiplies (SpMV)to compute a column of the product X(1)(C B) A crucial observation(made also elsewhere [142]) is that this operation can be implemented asXT

(2)B( r) which can be reshaped into a matrix to be multiplied with

C( r) to form the rth column of the Mttkrp Although SpMV is awell-investigated operation there is a peculiarity here the result of thefirst SpMV forms the values of the sparse matrix used in the second oneTherefore there are sophisticated data dependencies between the two Sp-MVs Notice that DFacTo is rich in SpMV operations there are two SpMVsper factorization rank per dimension of the input tensor DFacTo needs tostore the tensor matricized in all dimensions ie X(1) X(N) In lowdimensions this can be a slight memory overhead yet in higher dimensionsthe overhead could be non-negligible DFacTo uses MPI for paralleliza-tion yet fully stores the factor matrices in all MPI ranks The rows ofX(1) X(N) are blockwise distributed (statically) With this partitioneach process computes the corresponding rows of the factor matrices Fi-nally DFacTo performs an MPI Allgatherv operation to communicate thenew results to all processes which results in (InP ) log2 P communicationvolume per process (assuming a hypercube algorithm) when computing thenth factor matrix having In rows using P processes

Tensor Toolbox [18] is a MATLAB toolbox for handling tensors It pro-vides many essential operations and enables fast and efficient realizationsof complex algorithms in MATLAB for sparse tensors [17] Among thoseoperations Mttkrp implementations are provided and used in CP-ALSmethod Here each column of the output is computed by performing N minus 1sparse tensor vector multiplications Another well-known MATLAB toolboxis the N -way toolbox [11] which is essentially for dense tensors and incor-porates now support for sparse tensors [2] through Tensor Toolbox Tensor

38 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Toolbox and the related software provide excellent means for rapid proto-typing of algorithms and also efficient programs for tensor operations thatcan be handled within MATLAB

33 Parallelization

A common approach in implementing the Mttkrp is to explicitly matricizea tensor across all modes and then perform the Khatri-Rao product usingthe matricized tensors [47 206] Matricizing a tensor in a mode i requirescolumn index values up to

prodNk 6=i Ik which can exceed the integer value lim-

its supported by modern architectures when using tensors of higher orderand very large dimensions Also matricizing across all modes results in Nreplications of a tensor which can exceed the memory limitations Hence inorder to be able to handle large tensors we store them in coordinate formatfor the Mttkrp operation which is also the method of choice in TensorToolbox [18] In this format each nonzero value is stored along with all ofits position indices

With a tensor stored in the coordinate format Mttkrp operation canbe performed as shown in Algorithm 2 As seen on Line 4 of this algorithma row of B and a row of C are retrieved and their Hadamard product iscomputed and scaled with a tensor entry to update a row of MA In generalfor an N -mode tensor

MU1(i1 )larrMU1(i1 ) + xi1i2iN [U2(i2 ) lowast middot middot middot lowastUN (iN )]

is computed Here indices of the corresponding rows of the factor matricesand MU1 coincide with indices of the unique tensor entry of the operation

Algorithm 2 Mttkrp for the 3rd order tensors

Input X tensorBC Factor matrices in all modes except the firstIA Number of rows of the factor AR Rank of the factors

Output MA = X(1)(BC)

1 Initialize MA to zeros of size IA timesR2 foreach xijk isin X do44 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

As factor matrices are accessed row-wise we define computational unitsin terms of the rows of factor matrices It follows naturally to partitionall factor matrices row-wise and use the same partition for the Mttkrpoperation for each mode of an input tensor across all CP-ALS iterations toprevent extra communication A crucial issue is the task definitions as this

39

pertains to the issues of load balancing and communication We identify acoarse-grain and a fine-grain task definition

In the coarse-grain task definition the ith atomic task consists of com-puting the row MA(i ) using the nonzeros in the tensor slice Xi and therows of B and C corresponding to the nonzeros in that slice The inputtensor X does not change throughout the iterations of tensor decompositionalgorithms hence it is viable to make the whole slice Xi available to theprocess holding MA(i ) so that the Mttkrp operation can be performedby only communicating the rows of B and C Yet as CP-ALS requires theMttkrp in all modes and each nonzero xijk belongs to slices X (i )X ( j ) and X ( k) we need to replicate tensor entries in the owner pro-cesses of these slices This may require up to N times replication of thetensor depending on its partitioning Note that an explicit matricizationalways requires exactly N replications of tensor entries

In the fine-grain task definition an atomic task corresponds to the mul-tiplication of a tensor entry with the Hadamard product of the correspond-ing rows of B and C Here tensor nonzeros are partitioned among pro-cesses with no replication to induce a task partition by following the owner-computes rule This necessitates communicating the rows of B and C thatare needed by these atomic tasks Furthermore partial results on the rowsof MA need to be communicated as without duplicating tensor entries wecannot in general compute all contributions to a row of MA Here the par-tition of X should be useful in all modes as the CP-ALS method requiresthe Mttkrp in all modes

The coarse-grain task definition resembles the one-dimensional (1D) row-wise (or column-wise) partitioning of sparse matrices whereas the fine-grainone resembles the two-dimensional (nonzero-based) partitioning of sparsematrices for parallel sparse matrix-vector multiply (SpMV) operations As isconfirmed for SpMV in modern applications 1D partitioning usually leads toharder problems of load balancing and communication cost reduction Thesame phenomenon is likely to be observed in tensors as well Nonethelesswe cover the coarse-grain task definition as it is used in the state of theart parallel Mttkrp [47 206] methods which partition the input tensor byslices

331 Coarse-grain task model

In the coarse-grain task model computing the rows of MA MB and MC aredefined as the atomic tasks Let microA denote the partition of the first modersquosindices among the processes ie microA(i) = p if the process p is responsiblefor computing MA(i ) Similarly let microB and microC define the partition of thesecond and the third mode indices The process owning MA(i ) needs theentire tensor slice Xi similarly the process owning MB(j) needs Xj andthe owner of MC(k ) needs Xk This necessitates duplication of some

40 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

tensor nonzeros to prevent unnecessary communicationOne needs to take the context of CP-ALS into account when parallelizing

the Mttkrp operation First the output MA of Mttkrp is transformedinto A Since A(i ) is computed simply by multiplying MA(i ) with thematrix (BTBlowastCTC)dagger we make the process which owns MA(i ) responsiblefor computing A(i ) Second N Mttkrp operations follow one anotherin an iteration Assuming that every process has the required rows of thefactor matrices while executing Mttkrp for the first mode it is advisable toimplement the Mttkrp in such a way that its output MA after transformedinto A is communicated This way all processes will have the necessarydata for executing the Mttkrp for the next mode With these in mind thecoarse-grain parallel Mttkrp method executes Algorithm 3 at each processp

Algorithm 3 Coarse-grain Mttkrp for the first mode of thirdorder tensors at process p within CP-ALS

Input Ip indices where microA(i) = pXIp tensor slicesBTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in XIp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them ATA is available

2 Initialize MA(Ip ) to all zeros of size |Ip| timesR3 foreach i isin Ip do4 foreach xijk isin Xi do66 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

88 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

9 foreach i isin Ip do1111 Send A(i ) to all processes having nonzeros in Xi

12 Receive A(i ) from microA(i) for each owned nonzero xijk

1414 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

As seen in Algorithm 3 the process p computes MA(i ) for all i withmicroA(i) = p on Line 6 This is possible without any communication if weassume the preconditions that BTB CTC and the required rows of B andC are available We need to satisfy this precondition for the Mttkrp oper-ations in the remaining modes Once MA(Ip ) is computed it is multipliedon Line 8 with (BTB lowastCTC)dagger to obtain A(Ip ) Then the process p sendsthe rows of A(Ip ) to other processes who will need them then receivethe ones that it will need More precisely the process p sends A(i ) tothe process r where microB(j) = r or microC(k) = r for a nonzero xijk isin X thereby satisfying the stated precondition for the remaining Mttkrp op-

41

erations Since computing A(i ) required BTB and CTC to be availableat each process we also need to compute ATA and make the result avail-able to all processes We perform this by computing local multiplicationsA(Ip )

TA(Ip ) and all-reducing the partial results (Line 14)

We now investigate the problems of obtaining load balance and reducingthe communication cost in an iteration of CP-ALS given in Algorithm 3 thatis when Algorithm 3 is applied N times once for each mode Line 6 is thecomputational core the process p performs

sumiisinIp |Xi| many Hadamard

products (of vectors of size R) without any communication In order toachieve load balance one should partition the slices of the first mode equi-tably by taking the number of nonzeros into account Line 8 does not involvecommunication where load balance can be achieved by assigning almostequal number of slices to the processes Line 14 requires an almost equalnumber of slices to the processes for load balance The operations at Lines 8and 14 can be performed very efficiently using BLAS3 routines thereforetheir costs are generally negligible in compare to the cost of the sparse ir-regular computations at Line 6 There are two lines 11 and 14 requiringcommunication Line 14 requires the collective all-reduce communicationand one can rely on efficient MPI implementations The communication atLine 11 is irregular and would be costly Therefore the problems of com-putational load balance and communication cost need to be addressed withregards to Lines 6 and 11 respectively

Let ai be the task of computing MA(i ) at Line 6 Let also TA bethe set of tasks ai for i = 1 IA and bj ck TB and TC be definedsimilarly In the CP-ALS algorithm there are dependencies among thetriplets ai bj and ck of the tasks for each nonzero xijk Specifically thetask ai depends on the tasks bj and ck Similarly bj depends on the tasksai and ck and ck depends on the tasks ai and bj These dependencies arethe source of the communication at Line 11 eg A(i ) should be sent tothe processes that own B(j ) and C(k ) for the each nonzero xijk in theslice Xi These dependencies can be modeled using graphs or hypergraphsby identifying the tasks with vertices and their dependencies with edgesand hyperedges respectively As is well-known in similar parallelizationcontexts [39 111 112] the standard graph partition related metric of ldquoedgecutrdquo would loosely relate to the communication cost We therefore proposea hypergraph model for modeling the communication and computationalrequirements of the Mttkrp operation in the context of CP-ALS

For a given tensor X we build the d-partite d-uniform hypergraph H =(VE) described in Section 15 In this model the vertex set V correspondsto the set of tasks We abuse the notation and use ai to refer to a task and itscorresponding vertex in V We then set V = TA cupTB cupTC Since one needsload balance on Line 6 for each mode we use vectors of size N for vertexweights We set w(ai 1) = |Xi| w(bj 2) = |Xj| and w(ck 3) = |Xk|

42 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

c3

b3

a3

c2a2b2

b1

c1

a1

(a) Coarse-grain model

x123 x312

x231a2

a1

b1

c1

b2 c2

c3

b3

a3

(b) Fine-grain model

Figure 31 Coarse and fine-grain hypergraph models for the 3times3times3 tensorX = (1 2 3) (2 3 1) (3 1 2)

as the vertex weights in the indicated modes and assign weight 0 in allother modes The dependencies causing communication at Line 11 are one-to-many for a nonzero xijk any one of ai bj and ck is computed byusing the data associated with the other two Following the guidelines inbuilding the hypergraph models [209] we add a hyperedge correspondingto each task (or each row of the factor matrices) in TA cup TB cup TC in orderto model the data dependencies to that specific task We use nai to denotethe hyperedge corresponding to the task ai and define nbj and nck similarlyFor each nonzero xijk we add the vertices ai bj and ck to the hyperedgesnai n

bj and nck Let Ji and Ki be all unique 2nd and 3rd mode indices

respectively of the nonzeros in Xi Then the hyperedge nai contains thevertex ai all vertices bj for j isin Ji and all vertices ck for k isin Ki

Figure 31a demonstrates the hypergraph model for the coarse-grain taskdecomposition of a sample 3times 3times 3 tensor with nonzeros (1 2 3) (2 3 1)and (3 1 2) Labels for the hyperedges shown as small blue circles are notprovided yet it should be clear from the context eg ai isin nai is shownwith a non-curved black line between the vertex and the hyperedge Forthe first second and third nonzeros we add the corresponding verticesto related hyperedges which are shown with green magenta and orangecurves respectively It is interesting to note that the N -partite graph ofSmith et al [206] and this hypergraph model are related their adjacencystructure when laid as a matrix give the same matrix

Consider now a P -way partition of the vertices of H = (VE) and theassociation of each part with a unique process for obtaining a P -way parallelCP-ALS The task ai incurs |Xi| Hadamard products in the first mode inthe process microA(i) and has no other work when computing the MttkrpsMB = X(2)(AC) and MC = X(3)(AB) in the other modes This obser-vation (combined with the corresponding one for bjs and cks) shows that any

43

partition satisfying the balance condition for each weight component (33)corresponds to a balanced partition of Hadamard products (Line 6) in eachmode Now consider the messages which are of size R sent by the pro-cess p regarding A(i ) These messages are needed by the processes thathave slices in modes 2 and 3 that have nonzeros whose first mode index is iThese processes correspond to the parts connected by the hyperedge nai Forexample in Figure 31a due to the nonzero (3 1 2) a3 has a dependency tob1 and c2 which are modeled by the curves from a3 to nb1 and nc2 Since a3b1 and c2 reside in different parts a3 contributes one to the connectivity ofnb1 and nc2 which accurately captures the total communication volume byincreasing the total connectivity-1 metric by two

Notice that the part microA(i) is always connected by naimdashassuming that theslice Xi is not empty Therefore minimizing the connectivity-1 metric (32)exactly amounts to minimizing the total communication volume required forMttkrp in all computational phases of a CP-ALS iteration if we set thecost ch = R for each hyperedge h

332 Fine-grain task model

In the fine-grain task model we assume a partition of the nonzeros of thetensor hence there is no duplication Let microX denote the partition of thetensor nonzeros eg microX (i j k) = p if the process p holds xijk WhenevermicroX (i j k) = p the process p performs the Hadamard product of B(j ) andC(k ) in the first mode A(i ) and C(k ) in the second mode and A(i )and B(j ) in the third mode then scales the result with xijk Let againmicroA microB and microC denote the partition on the indices of the correspondingmodes that is microA(i) = p denotes that the process p is responsible forcomputing MA(i ) As in the coarse-grain model we take the CP-ALScontext into account while designing the Mttkrp algorithm (i) the processwhich is responsible for MA(i ) is also held responsible for A(i ) (ii)at the beginning all rows of the factor matrices B(j ) and C(k ) wheremicroX (i j k) = p are available to the process p (iii) at the end all processesget the R times R matrix ATA and (iv) each process has all the rows of Athat are required for the upcoming Mttkrp operations in the second or thethird modes With these in mind the fine-grain parallel Mttkrp methodexecutes Algorithm 4 at each process p

In Algorithm 4 Xp and Ip denote the set of nonzeros and the set of firstmode indices assigned to the process p As seen in the algorithm the processp has the required data to carry out all Hadamard products correspondingto all xijk isin Xp on Line 5 After scaling by xijk the Hadamard productsare locally accumulated in MA which contains a row for all i where Xphas a nonzero on the slice Xi Notice that a process generates a partialresult for each slice of the first mode in which it has a nonzero The relevantrow indices (the set of unique first mode indices in Xp) are denoted by Fp

44 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Algorithm 4 Fine-grain Mttkrp for the first mode of 3rd ordertensors at process p within CP-ALS

Input Xp tensor nonzeros where microX (i j k) = pIp the first mode indices where microA(Ip) = pFp the set of unique first mode indices in Xp

BTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in Xp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them A(Fp ) is updated and ATA is available

2 Initialize MA(Fp ) to all zeros of size |Fp| timesR3 foreach xijk isin Xp do55 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

6 foreach i isin Fp Ip do88 Send MA(i ) to microA(i)

1010 Receive contributions to MA(i ) for i isin Ip and add them up

1212 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

13 foreach i isin Ip do1515 Send A(i ) to all processes having nonzeros in Xi

16 foreach i isin Fp Ip do1818 Receive A(i ) from microA(i)

2020 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

45

Some indices in Fp are not owned by the process p For such an i isin Fp Ipthe process p sends its contribution to MA(i ) to the owner process microA(i)at Line 8 Again messages of the process p destined to the same processare combined Then the process p receives contributions to MA(i ) wheremicroA(i) = p at Line 10 Each such contribution is accumulated and the newA(i ) is computed by a multiplication with (BTB lowast CTC)dagger at Line 12Finally the updated A is communicated to satisfy the precondition of theupcoming Mttkrp operations Note that the communications at Lines 15and 18 are the duals of those at Lines 10 and 8 respectively in the sense thatthe directions of the messages are reversed and the messages always pertainto the same row indices For example MA(i ) is sent to the process microA(i)at Line 8 and A(i ) is received from microA(i) at Line 18 Hence the process pgenerates partial results for each slice Xi on which it has a nonzero eitherkeeps it or sends it to the owner of microA(i) (at Line 8) and if so receivesthe updated row A(i ) later (at Line 18) Finally the algorithm finishesby an all-reduce operation to make ATA available to all processes for theupcoming Mttkrp operations

We now investigate the computational and communication requirementsof Algorithm 4 As before the computational core is at Line 5 where theHadamard products of the row vectors B(j ) and C(k ) are computed foreach nonzero xijk that a process owns and the result is added to MA(i )Therefore each nonzero xijk incurs three Hadamard products (one permode) in its owner process microX (i j k) in an iteration of the CP-ALS al-gorithm Other computations are either efficiently handled with BLAS3routines (Lines 12 and 20) or they (Line 10) are not as many as the numberof Hadamard products The significant communication operations are thesends at Lines 8 and 15 and the corresponding receives at Lines 10 and 18Again the all-reduce operation at Line 20 would be handled efficiently by theMPI implementation Therefore the problems of computational load bal-ance and reduced communication cost need to be addressed by partitioningthe tensor nonzeros equitably among processes (for load balance at Line 5)while trying to confine all slices of all modes to a small set of processes (toreduce communication at Lines 8 and 18)

We propose a hypergraph model for the computational load and commu-nication volume of the fine-grain parallel Mttkrp operation in the contextof CP-ALS For a given tensor the hypergraph H = (VE) with the ver-tex set V and hyperedge set E is defined as follows The vertex set V hasfour types of vertices ai for i = 1 IA bj for j = 1 IB ck fork = 1 IC and the vertex xijk we define for each nonzero in X byabusing the notation The vertex xijk represents the Hadamard productsB(j ) lowastC(k ) A(i ) lowastC(k ) and A(i ) lowastB(j ) to be performed duringthe Mttkrp operations at different modes There are three types of hyper-edges in E and they represent the dependencies of the Hadamard productsto the rows of the factor matrices The hyperedges are defined as follows

46 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

E = EA cup EB cup EC where EA contains a hyperedge nai for each A(i ) EBcontains a hyperedge nbj for each B(j ) and EC contains a hyperedge nck for

each C(k ) Initially nai nbj and nck contain only the corresponding vertices

ai bj and ck Then for each nonzero xijk isin X we add the vertex xijkto the hyperedges nai n

bj and nck to model the data dependency to the cor-

responding rows of A B and C As in the previous subsection one needsa multi-constraint formulation for three reasons First since xijk verticesrepresent the Hadamard products they should be evenly distributed amongthe processes Second as the owners of the rows of A B and C are possi-ble destinations of partial results (at Line 8) it is advisable to spread themevenly Third almost equal number of rows per process implies balancedmemory usage and BLAS3 operations at Lines 12 and 20 even though thisoperation is not our main concern for the computational load With theseconstraints we associate a weight vector of size N + 1 with each vertex Fora vertex xijk we set w(xijk ) = [0 0 0 3] for vertices ai bj and ck we setw(ai ) = [1 0 0 0] w(bj ) = [0 1 0 0] w(ck ) = [0 0 1 0]

In Figure 31b we demonstrate the fine-grain hypergraph model for thesample tensor of the previous section We exclude the vertex weights fromthe figure Similar to the coarse-grain model dependency on the rows of AB and C are modeled by connecting each vertex xijk to the hyperedges nai nbj and nck to capture the communication cost

Consider now a P -way partition of the vertices of H = (VE) and as-sociate each part with a unique process to obtain P -way parallel CP-ALSwith the fine-grain Mttkrp operation The partition of the xijk verticesuniquely partitions the Hadamard products Satisfying the fourth balanceconstraint in (33) balances the computational loads of the processes Sim-ilarly satisfying the first three constraints leads to an equitable partitionof the rows of factor matrices among processes Let microX (i j k) = p andmicroA(i) = ` that is the vertex xijk and ai are in different parts Then theprocess p sends a partial result on MA(i ) to the process ` at Line 8 ofAlgorithm 4 By looking at these messages carefully we see that all pro-cesses whose corresponding parts are in the connectivity set of niA will senda message to microA(i) Since microA(i) is also in this set we see that the totalvolume of send operations concerning MA(i ) at Line 8 is equal to λnai minus 1Therefore the connectivity-1 cut size metric (32) over the hyperedges inEA encodes the total volume of messages sent at Line 8 if we set ch = Rfor all hyperedges un EA Since the send operations at Line 15 are dualsof the send operations at Line 8 total volume of messages sent at Line 15for the first mode is also equal to this number By extending this reasoningto all hyperedges we see that the cumulative (over all modes) volume ofcommunication is equal to the connectivity-1 cut-size metric (32)

The presence of multiple constraints usually causes difficulties for thecurrent state of the art partitioning tools (Aykanat et al[15] discuss the is-

47

sue for PaToH [40]) In preliminary experiments we observed that this wasthe case for us as well In an attempt to circumvent this issue we designedthe following alternative We start with the same hypergraph and removethe vertices corresponding to the rows of factor matrices Then use a singleconstraint partitioning on the resulting hypergraph to partition the verticesof nonzeros We are then faced with the problem of assigning the rows offactor matrices given the partition on the tensor nonzeros Any objectivefunction (relevant to our parallelization context) in trying to solve this prob-lem would likely be a variant of the NP-complete Bin-Packing Problem [93p124] as in the SpMV case [27 208] Therefore we heuristically handle thisproblem by making the following observations (i) the modes can be parti-tioned one at a time (ii) each row corresponds to a removed vertex for whichthere is a corresponding hyperedge (iii) the partition of xijk vertices definesa connectivity set for each hyperedge This is essentially multi-constraintpartitioning which takes advantage of the listed observations We use con-nectivity of corresponding hyperedges as key value of rows to impose anorder Then rows are visited in decreasing order of key values Each visitedrow is assigned to the least loaded part in the connectivity set of its corre-sponding hyperedge if that part is not overloaded Otherwise the visitedvertex is assigned to the least loaded part at that moment Note that in thesecond case we add one to the total communication volume otherwise thetotal communication volume obtained by partitioning xijk vertices remainsthe same Note also that key values correspond to the volume of receives atLine 10 of Algorithm 4 hence this method also aims to balance the volumeof receives at Line 10 and the volume of dual send operations at Line 15

34 Experiments

We conducted our experiments on the IDRIS Ada cluster which consistsof 304 nodes each having 128 GBs of memory and four 8-core Intel SandyBridge E5-4650 processors running at 27 GHz We ran our experiments upto 1024 cores and assigned one MPI process per core All codes we usedin our benchmarks were compiled using the Intel C++ compiler (version1401106) with -O3 option for compiler optimizations -axAVXSSE42 toenable SSE optimizations whenever possible and -mkl option to use theIntel MKL library (version 1111) for LAPACK BLAS and VML (VectorMath Library) routines We also compiled DFacTo code using the Eigen(version 324) library [102] with EIGEN USE MKL ALL flag set to ensurethat it utilizes the routines provided in the Intel MKL library for the bestperformance We used PaToH [40] with default options to partition thehypergraphs We created all partitions offline and ran our experiments onthese partitioned tensors on the cluster We do not report timings for par-titioning hypergraphs which is quite costly for two reasons First in most

48 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Tensor I1 I2 I3 I4 nonzeros

netflix 480K 17K 2K - 100M

NELL-B 32M 301 638K - 78M

deli 530K 17M 24M 1K 140M

flickr 319K 28M 16M 713 112M

Table 31 Size of tensors used in the experiments

applications the tensors from the real-world data are built incrementallyhence a partition for the updated tensor can be formed by refining the parti-tion of the previous tensor Also one can decompose a tensor multiple timeswith different ranks of approximation Thereby the cost of an initial parti-tioning of a tensor can be amortized across multiple runs Second we had topartition the data on a system different from the one used for CP-ALS runsIt would not be very useful to compare the run time from different systemsand repeating the same runs on a different system would not add much tothe presented results We also note that the proposed coarse and fine-grainMttkrp formulations are independent from the partitioning method

Our dataset for experiments consists of four sparse tensors whose sizesare given in Table 31 and obtained from various resources [21 37 97]

We briefly describe the methods compared in the experiments DFacTois the CP-ALS routine we compare our methods against which uses blockpartitioning of the rows of the matricized tensors to distribute matrix nonze-ros equally among processes In the method ht-coarsegrain-block we op-erate the coarse-grain CP-ALS implementation in HyperTensor on a tensorwhose slices in each mode are partitioned consecutively to ensure equal num-ber of nonzeros among processes similarly to the DFacTorsquos approach Themethod ht-coarsegrain-hp runs the same CP-ALS implementation andit uses the hypergraph partitioning The method ht-finegrain-randompartitions the tensor nonzeros as well as the rows of the factor matrices ran-domly to establish load balance It uses the fine-grain CP-ALS implemen-tation in HyperTensor Finally the method ht-finegrain-hp correspondsto the fine-grain CP-ALS implementation in HyperTensor operating on thetensor partitioned according to the hypergraph model proposed for the fine-grain task decomposition We benchmark our methods against DFacTo onNetflix and NELL-B datasets only as DFacTo can only process 3-mode ten-sors We let the CP-ALS implementations run for 20 iterations on each datawith R = 10 and record the average time spent per iteration

In Figures 32a and 32b we show the time spent per CP-ALS itera-tion on Netflix and NELL-B tensors for all algorithms In Figures 32c and32d we show the speedups per CP-ALS iteration on Flickr and Delicioustensors for methods excluding DFacTo While performing the comparisonwith DFacTo we preferred to show the time spent per iteration instead ofthe speedup as the sequential run time of our methods differs from that of

49

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(b) Time on Nell-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(d) Speedup on Delicious

Figure 32 Time for parallel CP-ALS iteration on Netflix and NELL-B andspeedups on Flickr and Delicious

50 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

DFacTo Also in Table 32 we show the statistics regarding the communica-tion and computation requirements of the 512-way partitions of the Netflixtensor for all proposed methods

We first observe in Figure 32a that on Netflix ht-finegrain-hp clearlyoutperforms all other methods by achieving a speedup of 194x with 512 coresover a sequential execution whereas ht-coarsegrain-hp ht-coarsegrain-blockDFacTo and ht-finegrain-random could only yield 69x 63x 49x and 40xspeedups respectively Table 32 shows that on 512-way partitioning of thesame tensor ht-finegrain-random ht-coarsegrain-block and ht-coarsegrain-hp result in total send communication volumes of 142M 80M and 77Munits respectively whereas ht-finegrain-hp partitioning requires only 76Munits which explains the superior parallel performance of ht-finegrain-hpSimilarly in Figures 32b 32c and 32d ht-finegrain-hp achieves 81x 129xand 123x speedups on NELL-B Flickr and Delicious tensors while the bestof all other methods could only obtain 39x 55x and 65x respectively upto 1024 cores As seen from these figures ht-finegrain-hp is the fastest of allproposed methods It experiences slow-downs at 1024 cores for all instances

One point to note in Figure 32b is that our Mttkrp implementationgets notably more efficient than DFacTo on NELL-B We believe that themain reason for this difference is due to DFacTorsquos column-by-column wayof computing the factor matrices We instead compute all columns of thefactor matrices simultaneously This vector-wise mode of operation on therows of the factors can significantly improve the data locality and cacheefficiency which leads to this performance difference Another point in thesame figure is that HyperTensor running with ht-coarsegrain-block partition-ing scales better than DFacTo even though they use similar partitioning ofthe matricized tensor or the tensor This is mainly due to the fact thatDFacTo uses MPI Allgatherv to communicate local rows of factor matri-ces to all processes which incurs significant increase in the communicationvolume whereas our coarse-grain implementation performs point-to-pointcommunications Finally in Figure 32a DFacTo has a slight edge over ourmethods in sequential execution This could be due to the fact that it re-quires 3(|X |+ F ) multiplyadd operations across all modes where F is theaverage number of non-empty fibers and is upper-bounded by |X | whereasour implementation always takes 6|X | operations

Partitioning metrics in the Table 32 show that ht-finegrain-hp is ableto successfully reduce the total communication volume which brings aboutits improved scalability However we do not see the same effect when usinghypergraph partitioning for the coarse-grain task model due to inherent lim-itations of 1D partitionings Another point to note is that despite reducingthe total communication volume explicitly imbalance in the communicationvolume still exists among the processes as this objective is not directly cap-tured by the hypergraph models However the most important limitationon further scalability is the communication latency Table 32 shows that

51

ModeComp load Comm volume Num msg

Max Avg Max Avg Max Avg

ht-finegrain-hp

1 196672 196251 21079 6367 734 316

2 196672 196251 18028 5899 1022 1016

3 196672 196251 3545 2492 1022 1018

ht-finegrain-random

1 197507 196251 272326 252118 1022 1022

2 197507 196251 29282 22715 1022 1022

3 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp

1 364181 196251 302001 136741 511 511

2 349123 196251 59523 12228 511 511

3 737570 196251 23524 2000 511 507

ht-coarsegrain-block

1 198602 196251 239337 142006 448 447

2 367966 196251 33889 12458 511 445

3 737570 196251 24659 2049 511 394

Table 32 Statistics for the computation and communication requirementsin one CP-ALS iteration for 512-way partitionings of the Netflix tensor

each process communicates with almost all others especially on the smalldimensions of the tensors and the number of messages is doubled for thefine-grain methods To alleviate this issue one can try to limit the latencyfollowing previous work [201 208]

35 Summary further notes and references

The pros and cons of the coarse and fine-grain models resemble to those ofthe 1D and 2D partitionings in the SpMV context There are two advantagesof the coarse-grain model over the fine-grain one First there is only onecommunicationcomputation step in the coarse-grain model whereas fine-grain model requires two computation and communication phases Secondthe coarse-grain hypergraph model is smaller than the fine-grain hypergraphmodel (one vertex per index of each dimension versus additionally havingone vertex per nonzero) However one major limitation with the coarse-grain parallel decomposition is that restricting all nonzeros in an (N minus 1)dimensional slice of an N dimensional tensor to reside in the same processposes a significant constraint towards communication minimization an (Nminus1)-dimensional data typically has very large ldquosurfacerdquo which necessitatesgathering numerous rows of the factor matrices As N increases one mightexpect this phenomenon to increase the communication volume even furtherAlso if the tensor X is not large enough in one of its modes this poses agranularity problem which potentially causes load imbalance and limits the

52 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

parallelism None of these issues exists in the fine-grain task decompositionIn a more recent work [135] we extended the discussed approach for

efficient parallelization in shared and distributed memory systems For theshared-memory computations we proposed a novel computational schemecalled dimension trees that significantly reduces the computational costwhile offering an effective parallelism We then performed theoretical anal-yses corresponding to the computational gains and the memory use Theseanalyses are also validated with the experiments We also presented com-parisons with the recent version of SPLATT [204 205] which uses a mediumgrain task model and associated partitioning [187] Recent work discussesalgorithms for partitioning for medium-grain task decomposition [3]

A dimension tree is a data structure that partitions the mode indicesof an N -dimensional tensor in a hierarchical manner for computing ten-sor decompositions efficiently It was first used in the hierarchical Tuckerformat representing the hierarchical Tucker decomposition of a tensor [99]which was introduced as a computationally feasible alternative to the origi-nal Tucker decomposition for higher order tensors Our work [135] presentsa novel way of using dimension trees for computing the standard CP de-composition of tensors with a formulation that asymptotically reduces thecomputational cost by memorizing certain partial computations

For computing the CP decomposition of dense tensors Phan et al [189]propose a scheme that divides the tensor modes into two sets pre-computesthe tensor-times-multiple-vector multiplication (TTMVs) for each mode setand finally reuses these partial results to obtain the final Mttkrp result ineach mode This provides a factor of two improvement in the number ofTTMVs over the standard approach and our dimension tree-based frame-work can be considered as the generalization of this approach that providesa factor of N logN improvement Li et al [153] use the same idea of stor-ing intermediate tensors but use a different formulation based on the tensortimes tensor multiplication and the tensorndashmatrix Hadamard product oper-ations for shared memory systems The overall approach is similar to thatproposed by Phan et al [189] where the difference lies in the application ofthe method to sparse tensors and auto-tuning to better control the memoryuse and the gains in the operation counts

In most of the sparse tensors available (eg see FROSTT [203]) oneof the dimensions is usually very small If the indices in that dimensionare vertices in a hypergraph then we have very high degree vertices if theindices are hyperedges then we have hyperedges with very large sizes Bothcases are not favorable for the traditional multilevel hypergraph partitioningmethods where efficient coarsening heuristicsrsquo run time are super-linear interms of vertexhyperedge degrees

Part II

Matching and relatedproblems

53

Chapter 4

Matchings in graphs

In this chapter we address the problem of approximating the maximumcardinality of a matching in undirected graphs with fast heuristics Most ofthe terms and definitions including the problem itself is standard whichwe summarized in Section 12 The formal problem definition is repeatedbelow for convenience

Approximating the maximum cardinality of a matching in undi-rected graphs Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matchingM of large cardinalitywith an approximation guarantee

We design randomized heuristics for the stated problem and analyze theirapproximation guarantees Both the design and analysis of the heuristicsare based on doubly stochastic matrices (see Section 14) and scaling ofnonnegative matrices (in our case the adjacency matrices) to the doublystochastic form We therefore first give some background on scaling matricesto doubly stochastic form

41 Scaling matrices to doubly stochastic form

Any nonnegative matrix A with total support can be scaled with two positivediagonal matrices DR and DC such that DRADC is doubly stochastic (thatis the sum of entries in any row and in any column of DRADC is equalto one) If the matrix is fully indecomposable then the scaling matricesare unique up to a scalar multiplication (if the pair DR and DC scale thematrix then so αDR and 1

αDC for α gt 0) If A has support but not totalsupport then A can be scaled to a doubly stochastic matrix but not withtwo positive diagonal matrices [202]mdashthis fact is also seen in more recenttreatments [139 141 198])

The Sinkhorn-Knopp algorithm [202] is a well-known method for scalingmatrices to doubly stochastic form This algorithm generates a sequence of

55

56 CHAPTER 4 MATCHINGS IN GRAPHS

matrices (whose limit is doubly stochastic) by normalizing the columns andthe rows of the sequence of matrices alternately as shown in Algorithm 5First the initial matrix is normalized such that each column has sum oneThen the resulting matrix is normalized so that each row has sum oneand so on so forth If the initial matrix is symmetric and has total supportSinkhorn-Knopp algorithm will find a symmetric doubly stochastic matrix atthe limit while the intermediate matrices generated in the sequence are notsymmetric Two other iterative scaling algorithms [140 141 198] maintainsymmetry all along the way which are therefore more useful for symmet-ric matrices in practice Efficient shared memory and distributed memoryparallelizations of the Sinkhorn-Knopp and related algorithms have beendiscussed elsewhere [9 41 76]

Algorithm 5 ScaleSK Sinkhorn-Knopp scaling

Data A an ntimes n matrix with total support ε the error thresholdResult dr dc rowcolumn scaling arrays

1 for i = 1 to n do2 dr[i]larr 13 dc[i]larr 1

4 for j = 1 to n do5 csum[j]larr

sumiisinAlowastj

aij

6 repeat7 for j = 1 to n do8 dc[j]larr 1csum[j]

9 for i = 1 to n do10 rsumlarr

sumjisinAilowast

aij times dc[j]

11 dr[i]larr 1rsum

12 for j = 1 to n do13 csum[j]larr

sumiisinAlowastj

dr[i]times aij14 until |max1minus csum[j] 1 le j le n| le ε

42 Approximation algorithms for bipartite graphs

Most of the existing heuristics obtain good results in practice but theirworst-case guarantee is only around 12 Among those the Karp-Sipserheuristic [128] is very well known It finds maximum cardinality matchingsin highly sparse (random) graphs and matches all but O(n15) verticesof a random undirected graph [14] (this heuristics will be reviewed laterin Section 421) Karp-Sipser also obtains very good results in practiceCurrently it is the suggested one to be used as a jump-start routine [71148] for exact algorithms especially for augmenting-path based ones [132]Algorithms that achieve an approximation ratio of 1minus1e where e is the base

57

of the natural logarithm are designed for the online case [129] Many of thesealgorithms are sequential in nature in that a sequence of greedy decisionsare made in the light of the previously made decisions Another heuristicis obtained by truncating the Hopcroft and Karp (HK) algorithm HKstarting from a given matching augments along a maximal set of shortestdisjoint paths until there are no augmenting paths If one lets HK run untilthe shortest augmenting paths are of length k then a 1minus 2k approximatematching is obtained for k ge 3 The run time of this heuristic is O(τk) fora bipartite graph with τ edges

We propose two matching heuristics (Section 422) for bipartite graphsBoth heuristics construct a subgraph of the input graph by randomly choos-ing some edges They then obtain a maximum matching in the selectedsubgraph and return it as an approximate matching for the input graphThe probability density function for choosing a given edge in both heuris-tics is obtained with a matrix scaling method The first heuristic is shown todeliver a constant approximation guarantee of 1minus 1e asymp 0632 of the maxi-mum cardinality matching under the condition that the scaling method hasscaled the input matrix The second one builds on top of the first one andimproves the approximation ratio to 0866 under the same condition as inthe first one Both of the heuristics are designed to be amenable to paral-lelization in modern multicore systems The first heuristic does not requirea conflict resolution scheme Furthermore it does not have any synchro-nization requirements The second heuristic employs Karp-Sipser to find amatching on the selected subgraph We show that Karp-Sipser becomes anexact algorithm on the selected subgraphs Further analysis of the proper-ties of those subgraphs is carried out to design a specialized implementationof Karp-Sipser for efficient parallelization The approximation guaranteesof the two proposed heuristics do not deteriorate with the increased degreeof parallelization thanks to their design which is usually not the case forparallel matching heuristics [16]

For the analysis we assume that the two vertex classes contain the samenumber of vertices and that the associated adjacency matrix has total sup-port Under these criteria the Sinkhorn-Knopp scaling algorithm summa-rized in Section 41 converges in a finite number of steps and obtains adoubly stochastic matrix Later on (towards the end of Section 422 and inthe experiments presented in Section 423) we discuss the bipartite graphswithout the mentioned properties

421 Related work

Parallel (exact) matching algorithms on modern architectures have been re-cently investigated Azad et al [16] study the implementations of a setof known bipartite graph matching algorithms on shared memory systemsDeveci et al [66 65] investigate the implementation of some known match-

58 CHAPTER 4 MATCHINGS IN GRAPHS

ing algorithms or their variants on GPU There are quite good speedupsreported in these implementations yet there are non-trivial instances whereparallelism does not help (for any of the algorithms)

Our focus is on matching heuristics that have linear run time complexityand good quality guarantees on the size of the matching Recent surveysof matching heuristics are given by Duff et al [71 Section 4] Langguth etal [148] and more recently by Pothen et al [195] Two heuristics calledGreedy and Karp-Sipser stand out and are suggested as initialization stepsin the best two exact matching algorithms [132] These two heuristics alsoattracted theoretical interest

The Greedy heuristic has two variants in the literature The first variantrandomly visits the edges and matches the two endpoints of an edge if theyare both available The theoretical performance guarantee of this heuristic is12 ie the heuristic delivers matchings of size at least half of the maximumcardinality of a matching This is analyzed theoretically [78] and shown toobtain results that are near the worst-case on certain classes of graphs Thesecond variant of Greedy repeatedly selects a random vertex and matches itwith a random neighbor The matched vertices along with the ones whichbecome isolated are removed from the graph and the process continuesuntil the whole graph is consumed This variant also has a 12 worst-caseapproximation guarantee (see for example a proof by Pothen and Fan [194])and it is somewhat better (05 + ε for ε ge 00000025 [13] which has beenrecently improved to ε ge 1256 [191])

We make use of the Karp-Sipser heuristic to design one of the proposedheuristics We summarize a commonly used variant of Karp-Sipser here andrefer the reader to the original paper [128] for details The theoretical foun-dation of Karp-Sipser is that if there is a vertex v with exactly one neighbor(v is called degree-one) then there is a maximum cardinality matching inwhich v is matched with its neighbor That is matching v with its neighboris an optimal decision Using this observation Karp-Sipser runs as followsCheck whether there is a degree-one vertex if so then match the vertexwith its unique neighbor and delete both vertices (and the edges incident onthem) from the graph Continue this way until the graph has no edges (inwhich case we are done) or all remaining vertices have degree larger thanone In the latter case pick a random edge match the two endpoints ofthis edge and delete those vertices and the edges incident on them Thenrepeat the whole process on the remaining graph The phase before the firstrandom choice of edges made is called Phase 1 and the rest is called Phase2 (where new degree-one vertices may arise) The run time of this heuris-tic is linear This heuristic matches all but O(n15) vertices of a randomundirected graph [14] One disadvantage of Karp-Sipser is that because ofthe degree dependencies of the vertices to the already matched vertices anefficient parallelism is hard to achieve (a list of degree-one vertices needs tobe maintained) That is probably why some inflicted forms (successful but

59

without any known quality guarantee) of this heuristic were used in recentstudies [16]

Recent studies focusing on approximate matching algorithms on parallelsystems include heuristics for graph matching problem [42 86 105] and alsoheuristics used for initializing bipartite matching algorithms [16] Lotker etal [164] present a distributed 1minus 1k approximate matching algorithm forbipartite graphs This nice theoretical algorithm has O(k3 log ∆ + k2 log n)time steps with a message length of O(log ∆) where ∆ is the maximumdegree of a vertex and n is the number vertices in the graph Blellochet al [29] propose an algorithm to compute maximal matchings (12 ap-proximate) with O(τ) work and O(log3 τ) depth with high probability on abipartite graph with τ edges This is an elaboration of the Greedy heuristicfor parallel systems Although the performance metrics work and depth arequite impressive the approximation guarantee stays as in the serial variantA striking property of this heuristic is that it trades parallelism and re-duced work while always finding the same matching (including those foundin the sequential version) Birn et al [26] discuss maximal (12 approxi-mate) matchings in O(log2 n) time and O(τ) work in the CREW PRAMmodel Practical implementations on distributed memory and GPU sys-tems are discussedmdasha shared memory implementation is left as future workin the cited paper Patwary et al [185] discuss an efficient implementationof Karp-Sipser on distributed memory systems and show that the communi-cation requirement (in terms of the volume) is proportional to that of sparsematrixndashvector multiplication

The overall approach is related to a random bipartite graph model calledrandom k-out model [211] In these graphs every vertex chooses k neighborsuniformly at random from the other part of vertices Walkup [211] showsthat a 1-out subgraph of a complete bipartite graph G = (V1 V2 E) where|V1| = |V2| asymptotically almost surely (aas) does not contain a perfectmatching He also shows that a 2-out subgraph of a complete bipartitegraph aas contains a perfect matching Karonski and Pittel [125] laterwith Overman [124] extend the analysis to two other cases In the firstcase the vertices that were not selected at all choose one more neighbor inthe second case the vertices that were selected only once choose one moreneighbor Initial claim [125] was that in the first case there was a perfectmatching aas which was shown to be false [124] The existence of perfectmatchings in the second case in which the graph is still between 1-out and2-out has been shown recently [124]

422 Two matching heuristics

We propose two matching heuristics for the approximate maximum cardi-nality matching problem on bipartite graphs The heuristics are efficientlyparallelizable and have guaranteed approximation ratios The first heuristic

60 CHAPTER 4 MATCHINGS IN GRAPHS

does not require synchronization nor conflict resolution assuming that thewrite operations to the memory are atomic (such settings are common seea discussion in the original paper [76]) This heuristic has an approxima-tion guarantee of around 0632 The second heuristic is designed to obtainlarger matchings compared to those obtained by the first one This heuristicemploys Karp-Sipser on a judiciously chosen subgraph of the input graphWe show that for this subgraph Karp-Sipser is an exact algorithm and aspecialized efficient implementation is possible to obtain matchings of sizearound 0866 of the maximum cardinality

One-sided matching

The first matching heuristic we propose OneSidedMatch scales the givenadjacency matrix A (each aij is originally either 0 or 1) and uses the scaledentries to randomly choose a column as a match for each row The pseu-docode of the heuristic is shown in Algorithm 6

Algorithm 6 OneSidedMatch Bipartite graphs

Data A an ntimes n (01)-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for j = 1 to n in parallel do3 cmatch[j]larr NIL

4 for i = 1 to n in parallel do5 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

6 cmatch[j]larr i

OneSidedMatch first obtains the scaling vectors dr and dc corre-sponding to a doubly stochastic matrix S (line 1) After initializing thecmatch array for each row i of A the heuristic randomly chooses a col-umn j isin Ailowast based on the probabilities computed by using correspond-ing scaled entries of row i It then matches i and j Clearly multiplerows can choose the same column and write to the same entry in cmatchWe assume that in the parallel shared-memory setting one of the writeoperation survives and the cmatch array defines a valid matching iecmatch[j] j cmatch[j] 6= NIL We now analyze its approximationguarantee in terms of the matching cardinality

Theorem 41 Let A be an n times n (01)-matrix with total support ThenOneSidedMatch obtains a matching of size at least n(1 minus 1e) asymp 0632n

61

in expectation as nrarrinfin

Proof To compute the matching cardinality we will count the columns thatare not picked by any row and subtract it from n Since ΣkisinAilowastsik = 1 foreach row i of S the probability that a column j is not picked by any of therows in Alowastj is equal to

prodiisinAlowastj (1minussij) By applying the arithmetic-geometric

mean inequality we obtain

dj

radic prodiisinAlowastj

(1minus sij) ledj minus

sumiisinAlowastj sij

dj

where dj = |Alowastj | is the degree of column vertex j Therefore

prodiisinAlowastj

(1minus sij) le

(1minus

sumiisinAlowastj sij

dj

)dj

Since S is doubly stochastic we havesum

iisinAlowastj sij = 1 and

prodiisinAlowastj

(1minus sij) le(

1minus 1

dj

)dj

The function on the right hand side above is an increasing one and has thelimit

limdjrarrinfin

(1minus 1

dj

)dj=

1

e

where e is the base of the natural logarithm By the linearity of expectationthe expected number of unmatched columns is no larger than n

e Hence thecardinality of the matching is no smaller than n (1minus 1e)

In Algorithm 6 we split the rows among the threads with a parallelfor construct For each row i the corresponding thread chooses a randomnumber r from a uniform distribution with range (0

sumkisinAilowast

sik] Then thelast nonzero column index j for which

sum1leklej sik le r is found and cmatch[j]

is set to i Since no synchronization or conflict detection is required theheuristic promises significant speedups

Two-sided matching

OneSidedMatchrsquos approximation guarantee and suitable structure for par-allel architectures make it a good heuristic The natural question that fol-lows is whether a heuristic with a better guarantee exits Of course thesought heuristic should also be simple and easy to parallelize We askedldquowhat happens if we repeat the process for the other (column) side of thebipartite graphrdquo The question led us to the following algorithm Let each

62 CHAPTER 4 MATCHINGS IN GRAPHS

row select a column and let each column select a row Take all these 2nchoices to construct a bipartite graph G (a subgraph of the input) with 2nvertices and at most 2n edges (if i chooses j and j chooses i we have oneedge) and seek a maximum cardinality matching in G Since the number ofedges is at most 2n any exact matching algorithm on this graph would befastmdashin particular the worst case run time would be O(n15) [117] Yet wecan do better and obtain a maximum cardinality matching in linear time byrunning the Karp-Sipser heuristic on G as we display in Algorithm 7

Algorithm 7 TwoSidedMatch Bipartite graphs

Data A an ntimes n 0 1-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for i = 1 to n in parallel do3 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

4 rchoice[i]larr j

5 for j = 1 to n in parallel do6 Pick a random row i isin Alowastj by using the probability density function

skjΣ`isinAlowastjs`j

for all k isin Alowastj

7 cchoice[j]larr i

8 Construct a bipartite graph G = (VR cup VC E) where

E =i rchoice[i] i isin 1 ncupcchoice[j] j j isin 1 n

9 matchlarrKarp-Sipser (G)

The most interesting component of TwoSidedMatch is the incorpora-tion of the Karp-Sipser heuristic for two reasons First although it is onlya heuristic Karp-Sipser computes a maximum cardinality matching on thebipartite graph G constructed in Algorithm 7 Second although Karp-Sipserhas a sequential nature we can obtain good speedups with a specializedparallel implementation In general it is hard to efficiently parallelize (non-trivial) graph algorithms and it is even harder when the overall cost isO(n) which is the case for Karp-Sipser on G We give a series of lemmasbelow which enables us to use Karp-Sipser as an exact algorithm with a goodshared-memory parallel performance

The first lemma describes the structure of G constructed at line 8 of

63

TwoSidedMatch

Lemma 41 Each connected component of G constructed in Algorithm 7contains at most one simple cycle

Proof A connected component M sube G with nprime vertices can have at most nprime

edges Let T be a spanning tree of M Since T contains nprime minus 1 edges theremaining edge in M can create at most one cycle when added to T

Lemma 41 explains why Karp-Sipser is an exact algorithm on G If acomponent does not contain a cycle Karp-Sipser consumes all its vertices inPhase 1 Therefore all of the matching decisions given by Karp-Sipser areoptimal for this component Assume a component contains a simple cycleAfter Phase 1 the component is either consumed or due to Lemma 41 itis reduced to a simple cycle In the former case the matching is a maximumcardinality one In the latter case an arbitrary edge of the cycle can beused to match a pair of vertices This decision necessarily leads to a uniqueperfect matching in the remaining simple path These two arguments canbe repeated for all the connected components to see that the Karp-Sipserheuristic finds a maximum cardinality matching in G

Algorithm 8 describes our parallel implementation Karp-Sipser-MT Thegraph is represented using a single array choice where choice[u] is the ver-tex randomly chosen by u isin VR cup VC The choice array is a concatenationof the arrays rchoice and cchoice set in TwoSidedMatch Hence anexplicit graph construction for G (line 8 of Algorithm 7) is not required anda transformation of the selected edges to a graph storage scheme is avoidedKarp-Sipser-MT uses three atomic operations for synchronization The firstoperation Add(memory value) atomically adds a value to a memory loca-tion It is used to compute the vertex degrees in the initial graph (line 9)The second operation CompAndSwap(memory value replace) first checkswhether the memory location has the value If so its content is replaced Thefinal content is returned The third operation AddAndFetch(memoryvalue) atomically adds a given value to a memory location and the finalcontent is returned We will describe the use of these two operations later

Karp-Sipser-MT has two phases which correspond to the two phases ofKarp-Sipser The first phase of Karp-Sipser-MT is similar to that of Karp-Sipser in that optimal matching decisions are made about some degree-onevertices The second phase of Karp-Sipser-MT handles remaining verticesvery efficiently without bothering with their degrees The following defini-tions are used to clarify the difference between an original Karp-Sipser andKarp-Sipser-MT

Definition 41 Given a matching and the array choice let u be an un-matched vertex and v = choice[u] Then u is called

64 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 8 Karp-Sipser-MT Bipartite 1-out graphs

Data G = V choice[middot] the chosen vertex for each u isin V Result match[middot] the match array for u isin V

1 for all u isin V in parallel do2 mark[u]larr 13 deg[u]larr 14 match[u]larr NIL

5 for all u isin V in parallel do6 v larr choice[u]7 mark[v]larr 08 if choice[v] 6= u then9 Add(deg[v] 1)

10 for each vertex u in parallel do Phase 1 out-one vertices 11 if mark[u] = 1 then12 curr larr u13 while curr 6= NIL do14 nbr larr choice[curr]15 if CompAndSwap(match[nbr]NIL curr) = curr then16 match[curr]larr nbr17 curr larr NIL18 nextlarr choice[nbr]19 if match[next] = NIL then20 if AddAndFetch(deg[next]minus1) = 1 then21 curr larr next22 else23 curr larr NIL

24 for each column vertex u in parallel do Phase 2 the rest 25 v larr choice[u]26 if match[u] = NIL and match[v] = NIL then27 match[u]larr v28 match[v]larr u

65

bull out-one if v is unmatched and no unmatched vertex w with choice[w] =u exists

bull in-one if v is matched and only a single unmatched vertex w withchoice[w] = u exists

The first phase of Karp-Sipser-MT (for loop of line 10) does not track andmatch all degree-one vertices Instead only the out-one vertices are takeninto account For each such vertex u that is already out-one before Phase1 we have mark[u] = 1 Karp-Sipser-MT visits these vertices (lines 10-11)Newly arising out-one vertices are consumed right away without maintain-ing a list The second phase of Karp-Sipser-MT (for loop of line 24) ismuch simpler than that of Karp-Sipser as the degrees of the vertices arenot trackedupdated We now discuss how these simplifications are possiblewhile ensuring a maximum cardinality matching in G

Observation 41 An out-one or an in-one vertex is a degree-one vertex inKarp-Sipser

Observation 42 A degree-one vertex in Karp-Sipser is either an out-oneor an in-one vertex or it is one of the two vertices u and v in a 2-cliquesuch that v = choice[u] and u = choice[v]

Lemma 42 (Proved elsewhere [76 Lemma 3]) If there exists an in-onevertex in G at any time during the execution of Karp-Sipser-MT an out-onevertex also exists

According to Observation 41 all the matching decisions given by Karp-Sipser-MT in Phase 1 are optimal since an out-one vertex is a degree-onevertex Observation 42 implies that among all the degree-one vertices Karp-Sipser-MT ignores only the in-ones and 2-cliques According to Lemma 42an in-one vertex cannot exist without an out-one vertex therefore they arehandled in the same phase The 2-cliques that survive Phase 1 are handledin Karp-Sipser-MTrsquos Phase 2 since they can be considered as cycles

To analyze the second phase of Karp-Sipser-MT we will use the followinglemma

Lemma 43 (Proved elsewhere [76 Lemma 4]) Let Gprime = (V primeR cup V primeC Eprime) bethe graph induced by the remaining vertices after the first phase of Karp-Sipser-MT Then the set

(u choice[u]) u isin V primeR choice[u] isin V primeC

is a maximum cardinality matching in Gprime

In the light of Observations 41 and 42 and Lemmas 41ndash43 Karp-Sipser-MT is an exact algorithm on the graphs created in Algorithm 7 Its worstcase (sequential) run time is linear

66 CHAPTER 4 MATCHINGS IN GRAPHS

Figure 41 A toy bipartite graph with 9 row (circles) and 9 column (squares)vertices The edges are oriented from a vertex u to the vertex choice[u]Assuming all the vertices are currently unmatched matching 15-7 (or 5-13)creates two degree-one vertices But no out-one vertex arises after matching(15-7) and only one vertex 6 arises after matching (5-13)

Karp-Sipser-MT tracks and consumes only the out-one vertices Thisbrings high flexibility while executing Karp-Sipser-MT in multi-threaded en-vironments Consider the example in Figure 41 Here after matching a pairof vertices and removing them from the graph multiple degree-one verticescan be generated The standard Karp-Sipser uses a list to store these newdegree-one vertices Such a list is necessary to obtain a larger matching butthe associated synchronizations while updating it in parallel will be an ob-stacle for efficiency The synchronization can be avoided up to some level ifone sacrifices the approximation quality by not making all optimal decisions(as in some existing work [16]) We continue with the following lemma totake advantage of the special structure of the graphs in TwoSidedMatchfor parallel efficiency in Phase 1

Lemma 44 (Proved elsewhere [76 Lemma 5]) Consuming an out-onevertex creates at most one new out-one vertex

According to Lemma 44 Karp-Sipser-MT does not need a list to storethe new out-one vertices since the process can continue with the new out-one vertex In a shared-memory setting there are two concerns for thefirst phase from the synchronization point of view First multiple threadsthat are consuming different out-one vertices can try to match them withthe same unmatched vertex To handle such cases Karp-Sipser-MT uses theatomic CompAndSwap operation (line 15 of Algorithm 8) and ensures thatonly one of these matchings will be processed In this case other threadswhose matching decisions are not performed continue with the next vertexin the for loop at line 10 The second concern is that while consumingout-one vertices several threads may create the same out-one vertex (andwant to continue with it) For example in Figure 41 when two threadsconsume the out-one vertices 1 and 2 at the same time they both will try tocontinue with vertex 4 To handle such cases an atomic AddAndFetch

67

operation (line 20 of Algorithm 8) is used to synchronize the degree reductionoperations on the potential out-one vertices This approach explicitly ordersthe vertex consumptions and guarantees that only the thread who performsthe last consumption before a new out-one vertex u appears continues withu The other threads which wanted to continue with the same path stop andskip to the next unconsumed out-one vertex in the main for loop One lastconcern that can be important on the parallel performance is the maximumlength of such paths since a very long path can yield a significant imbalanceon the work distribution to the threads We investigated the length of suchpaths experimentally (see the original paper [76]) and observed them to betoo short to hurt the parallel performance

The second phase of Karp-Sipser-MT is efficiently parallelized by usingthe idea in Lemma 43 That is a maximum cardinality matching for thegraph remaining after the first phase of Karp-Sipser-MT can be obtained viaa simple parallel for construct (see line 24 of Algorithm 8)

Quality of approximation If the initial matrix A is the n times n matrixof 1s that is aij = 1 for all 1 le i j le n then the doubly stochastic matrixS is such that sij = 1

n for all 1 le i j le n In this case the graph Gcreated by Algorithm 7 is a random 1-out bipartite graph [211] Referringto a study by Meir and Moon [172] Karonski and Pittel [125] argue thatthe maximum cardinality of a matching in a random 1-out bipartite graphis 2(1minusΩ)n asymp 0866n in expectation where Ω asymp 0567 also called LambertrsquosW (1) is the unique solution of the equation ΩeΩ = 1 We obtain the sameresult for random 1-out subgraph of any bipartite graph whose adjacencymatrix has total support as highlighted in the following theorem

Theorem 42 Let A be the n times n adjacency matrix of a given bipar-tite graph where A has total support Then TwoSidedMatch obtainsa matching of size 2(1 minus Ω)n asymp 0866n in expectation as n rarr infin whereΩ asymp 0567 is the unique solution of the equation ΩeΩ = 1

Theorem 42 contributes to the known results about the Karp-Sipserheuristic (recall that it is known to leave out O(n15) vertices) by show-ing a constant approximation ratio with some preprocessing The existenceof total support does not seem to be necessary for the Theorem 42 to hold(see the next subsection)

The proof of Theorem 42 is involved and can be found in the originalpaper [76] It essentially adapts the original analysis of Karp and Sipser [128]to the special graphs at hand

Discussion

The Sinkhorn-Knopp scaling algorithm converges linearly (when A has totalsupport) where the convergence rate is equivalent to the square of the second

68 CHAPTER 4 MATCHINGS IN GRAPHS

largest singular value of the resulting doubly stochastic matrix [139] Ifthe matrix does not have support or total support less is known aboutthe Sinkhorn-Knopp scaling algorithmrsquos convergencemdashwe do not know howmuch of a reduction we will get at each step In such cases we are not ableto bound the run time of the scaling step However the scaling algorithmsshould be run only a few iterations (see also the next paragraph) in practicein which case the practical run time of our heuristics would be linear (inedges and vertices)

We have discussed the proposed matching heuristics while assuming thatA has total support This can be relaxed in two ways to render the over-all approach practical in any bipartite graph The first relaxation is thatwe do not need to run the scaling algorithms until convergence (as in someother use cases of similar algorithms [141]) If

sumiisinAlowastj sij ge α instead ofsum

iisinAlowastj sij = 1 for all j isin 1 n then limdrarrinfin(1minus α

d

)d= 1

eα In otherwords if we apply the scaling algorithms a few iterations or until some rel-atively large error tolerance we can still derive similar results For exampleif α = 092 we will have a matching of size n

(1minus 1

)asymp 0601n (larger

column sums give improved ratios but there are columns whose sum isless than one when the convergence is not achieved) for OneSidedMatchWith the same α TwoSidedMatch will obtain an approximation of 0840In our experiments the number of iterations were always a few where theproven approximation guarantees were always observed The second relax-ation is that we do not need total support we do not even need supportnor equal number of vertices in the two vertex classes Since little is knownabout the scaling methods on such matrices we only mention some factsand observations and later on present some experiments to demonstratethe practicality of the proposed OneSidedMatch and TwoSidedMatchheuristics

A sparse matrix (not necessarily square) can be permuted into a blockupper triangular form using the canonical Dulmage-Mendelsohn (DM) de-composition [77]

A =

H lowast lowastO S lowastO O V

S =

(S1 lowastO S2

)

where H (horizontal) has more columns than rows and has a matchingcovering all rows S is square and has a perfect matching and V (vertical)has more rows than columns and a matching covering all columns Thefollowing facts about the DM decomposition are well known [192 194] Anyof these three blocks can be void If H is not connected then it is blockdiagonal with horizontal blocks If V is not connected then it is blockdiagonal with vertical blocks If S does not have total support then itis in block upper triangular form shown on the right where S1 and S2

69

have the same structure recursively until each block Si has total supportThe entries in the blocks shown by ldquordquo cannot be put into a maximumcardinality matching When the presented scaling methods are applied toa matrix the entries in ldquordquo blocks will tend to zero (the case of S is welldocumented [202]) Furthermore the row sums of the blocks of H will be amultiple of the column sums in the same block a similar statement holds forV finally S will be doubly stochastic That is the scaling algorithms appliedto bipartite graphs without perfect matchings will zero out the entries in theirrelevant parts and identify the entries that can be put into a maximumcardinality matching

423 Experiments

We present a selected set of results from the original paper [76] We used ashared-memory parallel version of Sinkhorn-Knopp algorithm with a fixednumber of iterations This way the scaling algorithm is simplified as noconvergence check is required Furthermore its overhead is bounded Inmost of the experiments we use 0 1 5 or 10 scaling iterations where thecase of 0 corresponds to applying the matching heuristics with uniform edgeselection probabilities While these do not guarantee convergence of theSinkhorn-Knopp scaling algorithm it seems to be enough for our purposes

Matching quality

We investigate the matching quality of the proposed heuristics on all squarematrices with support from the SuiteSparse Matrix Collection [59] having atleast 1000 non-empty rowscolumns and at most 20000000 nonzeros (someof the matrices are also in the 10th DIMACS challenge [19]) There were 742matrices satisfying these properties at the time of experimentation Withthe OneSidedMatch heuristic the quality guarantee of 0632 was sur-passed with 10 iterations of the scaling method in 729 matrices With theTwoSidedMatch heuristic and the same number of iterations of the scalingmethods the quality guarantee of 0866 was surpassed in 705 matrices Mak-ing 10 more scaling iterations smoothed out the remaining instances TheGreedy heuristic (the second variant discussed in Section 421) obtained onaverage (both the geometric and arithmetic means) matchings of size 093of the maximum cardinality In the worst case 082 was observed

We collect a few of the matrices from the SuiteSparse collection in Ta-ble 41 Figures 42a and 42b show the qualities on our test matrices In thefigures the first columns represent the case that the neighbors are pickedfrom a uniform distribution over the adjacency lists ie the case with noscaling hence no guarantee The quality guarantees are achieved with only5 scaling iterations Even with a single iteration the quality of TwoSided-Match is more than 0866 for all matrices and that of OneSidedMatch

70 CHAPTER 4 MATCHINGS IN GRAPHS

Avg sprank

Name n of edges deg stdA stdB n

atmosmodl 1489752 10319760 69 03 03 100audikw 1 943695 77651847 822 425 425 100cage15 5154859 99199551 192 57 57 100channel 4802000 85362744 178 10 10 100europe osm 50912018 108109320 21 05 05 099Hamrle3 1447360 5514242 38 16 21 100hugebubbles 21198119 63580358 30 003 003 100kkt power 2063494 12771361 62 745 745 100nlpkkt240 27993600 760648352 267 222 222 100road usa 23947347 57708624 24 093 093 095torso1 116158 8516500 733 41959 24544 100venturiLevel3 4026819 16108474 40 012 012 100

Table 41 Properties of the matrices used in the comparisons In the ta-ble sprank is the maximum matching cardinality Avg deg is the averagevertex degree stdA and stdB are the standard deviations of A and B ver-tex (row and column) degrees respectively

is more than 0632 for all matrices

Comparison of TwoSidedMatch with Karp-Sipser Next we analyzethe performance of the proposed heuristics with respect to Karp-Sipser on amatrix class which we designed as a challenging case for Karp-Sipser Let Abe an ntimes n matrix R1 be the set of Arsquos first n2 rows and R2 be the setof Arsquos last n2 rows similarly let C1 and C2 be the set of first n2 and theset of last n2 columns As Figure 43 shows A has a full R1 times C1 blockand an empty R2 times C2 block The last h n rows and columns of R1 andC1 respectively are full Each of the blocks R1 times C2 and R2 times C1 has anonzero diagonal Those diagonals form a perfect matching when combinedIn the sequel a matrix whose corresponding bipartite graph has a perfectmatching will be called full-sprank and sprank-deficient otherwise

When h le 1 the Karp-Sipser heuristic consumes the whole graph duringPhase 1 and finds a maximum cardinality matching When h gt 1 Phase1 immediately ends since there is no degree-one vertex In Phase 2 thefirst edge (nonzero) consumed by Karp-Sipser is selected from a uniform dis-tribution over the nonzeros whose corresponding rows and columns are stillunmatched Since the block R1timesC1 is full it is more likely that the nonzerowill be chosen from this block Thus a row in R1 will be matched with acolumn in C1 which is a bad decision since the block R2timesC2 is completelyempty Hence we expect a decrease in the performance of Karp-Sipser ash increases On the other hand the probability that TwoSidedMatchchooses an edge from that block goes to zero as those entries cannot be ina perfect matching

71

050

055

060

065

070

075

080

085

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 42 Matching qualities of OneSidedMatch and TwoSided-Match The horizontal lines are at 0632 and 0866 respectively whichare the approximation guarantees for the heuristics The legends containthe number of scaling iterations

R1

R2

C1 C2

h

h

Figure 43 Adjacency matrices of hard bipartite graph instances for Karp-Sipser

72 CHAPTER 4 MATCHINGS IN GRAPHS

Results with different number of scaling iterationsKarp- 0 1 5 10

h Sipser Qual Err Qual Err Qual Err Qual

2 0782 0522 13853 0557 3463 0989 0578 09994 0704 0489 11257 0516 3856 0980 0604 09978 0707 0466 8653 0487 4345 0946 0648 0996

16 0685 0448 6373 0458 4683 0885 0725 099032 0670 0447 4555 0453 4428 0748 0867 0980

Table 42 Quality comparison (minimum of 10 executions for each instance)of the Karp-Sipser heuristic and TwoSidedMatch on matrices described inFig 43 with n = 3200 and h isin 2 4 8 16 32

The results of the experiments are in Table 42 The first column showsthe h value Then the matching quality obtained by Karp-Sipser and byTwoSidedMatch with different number of scaling iterations (0 1 5 10)as well as the scaling error are given The scaling error is the maximumdifference between 1 and each rowcolumn sum of the scaled matrix (for0 iterations it is equal to n minus 1 for all cases) The quality of a match-ing is computed by dividing its cardinality to the maximum one which isn = 3200 for these experiments To obtain the values in each cell of thetable we run the programs 10 times and give the minimum quality (as weare investigating the worst-case behavior) The highest variance for Karp-Sipser and TwoSidedMatch were (up to four significant digits) 00041 and00001 respectively As expected when h increases Karp-Sipser performsworse and the matching quality drops to 067 for h = 32 TwoSided-Matchrsquos performance increases with the number of scaling iterations Asthe experiment shows only 5 scaling iterations are sufficient to make theproposed two-sided matching heuristic significantly better than Karp-SipserHowever this number was not enough to reach 0866 for the matrix withh = 32 On this matrix with 10 iterations only 2 of the rowscolumnsremain unmatched

Matching quality on bipartite graphs without perfect matchingsWe analyze the proposed heuristics on a class of random sprank-deficientsquare (n = 100000) and rectangular (m = 100000 and n = 120000) matri-ces with a uniform nonzero distribution (two more sprank-deficient matricesare used in the scalability tests as well) These matrices are generatedby Matlabrsquos sprand command (generating Erdos-Renyi random matri-ces [85]) The total number of nonzeros is set to be around d times 100000for d isin 2 3 4 5 The top half of Table 43 presents the results of thisexperiment with square matrices and the bottom half presents the resultswith the rectangular ones

As in the previous experiments the matching qualities in the table arethe minimum of 10 executions for the corresponding instances As Table 43

73

One Two One TwoSided Sided Sided Sided

d iter sprank Match Match d iter sprank Match Match

Square 100000times 100000

2 0 78225 0770 0912 4 0 97787 0644 08382 1 78225 0797 0917 4 1 97787 0673 08482 5 78225 0850 0939 4 5 97787 0719 08732 10 78225 0879 0954 4 10 97787 0740 0886

3 0 92786 0673 0851 5 0 99223 0635 08403 1 92786 0703 0857 5 1 99223 0662 08513 5 92786 0756 0884 5 5 99223 0701 08733 10 92786 0784 0902 5 10 99223 0716 0882

Rectangular 100000times 120000

2 0 87373 0793 0912 4 0 99115 0729 08992 1 87373 0815 0918 4 1 99115 0754 09102 5 87373 0861 0939 4 5 99115 0792 09332 10 87373 0886 0955 4 10 99115 0811 0946

3 0 96564 0739 0896 5 0 99761 0725 09053 1 96564 0769 0904 5 1 99761 0749 09173 5 96564 0813 0930 5 5 99761 0781 09363 10 96564 0836 0945 5 10 99761 0792 0943

Table 43 Matching qualities of the proposed heuristics on random sparsematrices with uniform nonzero distribution Square matrices in the top halfrectangular matrices in the bottom half d average number of nonzeros perrow

shows when the deficiency is high (correlated with small d) it is easier forour algorithms to approximate the maximum cardinality However when dgets larger the algorithms require more scaling iterations Even in this case5 iterations are sufficient to achieve the guaranteed qualities In the squarecase the minimum quality achieved by OneSidedMatch and TwoSided-Match were 0701 and 0873 In the rectangular case the minimum qual-ity achieved by OneSidedMatch and TwoSidedMatch were 0781 and0930 respectively with 5 scaling iterations In all cases increased scal-ing iterations results in higher quality matchings in accordance with theprevious results on square matrices with perfect matchings

Comparisons with some other codes

We run Azad et alrsquos [16] variant of Karp-Sipser (ksV) the two greedy match-ing heuristics of Blelloch et al [29] (inc and nd which are referred to as ldquoin-crementalrdquo and ldquonon-deterministic reservationrdquo in the original paper) andthe proposed OneSidedMatch and TwoSidedMatch heuristics with oneand 16 threads on the matrices given in Table 41 We also add one more ma-trix wc 50K 64 from the family shown in Fig 43 with n = 50000 and h = 64

74 CHAPTER 4 MATCHINGS IN GRAPHS

Run time in secondsOne thread 16 threads

Name ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 020 1480 007 012 030 003 163 001 001 003audikw 1 098 20600 022 055 082 010 1690 002 006 009cage15 154 996 039 092 195 029 110 004 009 019channel 129 10500 030 082 146 014 907 003 008 013europe osm 1203 5180 206 489 1197 120 712 021 046 105Hamrle3 015 012 004 009 025 002 002 001 001 002hugebubbles 836 465 128 392 1004 095 053 018 037 094kkt power 055 068 009 019 045 008 016 001 002 005nlpkkt240 1088 190000 256 556 1011 115 23700 023 058 106road usa 623 357 096 216 542 070 038 009 022 050torso1 011 728 002 006 009 002 120 000 001 001venturiLevel3 045 272 013 032 084 008 029 001 004 009wc 50K 64 721 320000 146 439 580 117 40200 012 055 059

Quality of the obtained matchingName ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 100 100 100 066 087 098 100 097 066 087audikw 1 100 100 100 064 087 095 100 097 064 087cage15 100 100 100 064 087 095 100 091 064 087channel 100 100 100 064 087 099 100 099 064 087europe osm 098 095 095 075 089 098 095 094 075 089Hamrle3 100 084 084 075 090 097 084 086 075 090hugebubbles 099 096 096 070 089 096 096 090 070 089kkt power 085 079 079 078 089 087 079 086 078 089nlpkkt240 099 099 099 064 086 099 099 099 064 086road usa 094 081 081 076 088 094 081 081 076 088torso1 100 100 100 065 088 098 100 098 065 087venturiLevel3 100 100 100 068 088 097 100 096 068 088wc 50K 64 050 050 050 081 098 070 050 051 081 098

Table 44 Run time and quality of different heuristics OneSidedMatchand TwoSidedMatch are labeled with 1SD and 2SD respectively andapply two steps of scaling iterations The heuristic ksV is from Azad etal [16] and the heuristics inc and nd are from Blelloch et al [29]

The OneSidedMatch and TwoSidedMatch heuristics are run with twoscaling iterations and the reported run time includes all components of theheuristics The inc and nd heuristics are compiled with g++ 445 and ksV

is compiled with gcc 445 For all the heuristics we used -O2 optimizationand the appropriate OpenMP flag for compilation The experiments werecarried out on a machine equipped with two Intel Sandybridge-EP CPUsclocked at 200Ghz and 256GB of memory split across the two NUMA do-mains Each CPU has eight-cores (16 cores in total) and HyperThreadingis enabled Each core has its own 32kB L1 cache and 256kB L2 cache The8 cores on a CPU share a 20MB L3 cache The machine runs 64-bit Debianwith Linux 2639-bpo2-amd64 The run time of all heuristics (in seconds)and their matching qualities are given in Table 44

75

As seen in Table 44 nd is the fastest of the tested heuristics with bothone thread and 16 threads on this data set The second fastest heuristicis OneSidedMatch The heuristic ksV is faster than TwoSidedMatchon six instances with one thread With 16 threads TwoSidedMatch isalmost always faster than ksVmdashboth heuristics are quite efficient and havea maximum run time of 117 seconds The heuristic inc was observed tohave large run time on some matrices with relatively high nonzeros per rowAll the existing heuristics are always very efficient except inc which haddifficulties on some matrices in the data set In the original reference [29]inc is quite efficient for some other matrices (where the codes are compiledwith Cilk) We do not see run time with nd elsewhere but in our data setit was always better than inc The proposed OneSidedMatch heuristicrsquosquality is almost always close to its theoretical limit TwoSidedMatch alsoobtains quality results close to its theoretical limit Looking at the quality ofthe matching with one thread we see that ksV obtains (almost always) thebest score except on kkt power and wc 50K 64 where TwoSidedMatchobtains the best score The heuristics inc and nd obtain the same score withone thread but with 16 threads inc obtains better quality than nd almostalways OneSidedMatch never obtains the best score (TwoSidedMatchis always better) yet it is better than ksV inc and nd on the syntheticwc 50K 64 matrix The TwoSidedMatch heuristic obtains better resultsthan inc and nd on Hamrle3 kkt power road usa and wc 50K 64 withboth one thread and 16 threads Also on hugebubbles with 16 threads thedifference between TwoSidedMatch and nd is 1 (in favor of nd)

43 Approximation algorithms for general undi-rected graphs

We extend the results on bipartite graphs of the previous section to generalundirected graphs We again select a subgraph randomly with a probabilitydensity function obtained by scaling the adjacency matrix of the input graphand run Karp-Sipser [128] on the selected subgraph Again Karp-Sipser be-comes an exact algorithm on the selected subgraph and analysis reveals thatthe approximation ratio is around 0866 of the maximum cardinality in theoriginal graph We also propose two other variants of this algorithm whichobtain better results both in theory and in practice We omit most of theproofs in the presentation they can be found in the associated paper [73]

Recall that a graph Gprime = (V prime Eprime) is a subgraph of G = (VE) if V prime sube Vand Eprime sube E Let Gkout = (VEprime) be a random subgraph of G where eachvertex in V randomly chooses k of its edges with repetition (an edge u vis included in Gkout only once even if it is chosen multiple times) We callGkout as a k-out subgraph of G

A permutation of [1 n] in which no element stays in its original po-

76 CHAPTER 4 MATCHINGS IN GRAPHS

sition is called derangement The total number of derangements of [1 n]is denoted by n and is the nearest integer to n

e [36 pp 40ndash42] where e isthe base of the natural logarithm

431 Related work

The first polynomial time algorithm with O(n2τ) complexity for the maxi-mum cardinality matching problem on general graphs with n vertices and τedges is proposed by Edmonds [79] Currently the fastest algorithms haveO(radicnτ) complexity [30 92 174] The first of these algorithms is by Micali

and Vazirani [174] recently its simpler analysis is given by Vazirani [210]Later Blum [30] and Gabow and Tarjan [92] proposed algorithms with thesame complexity

Random graphs have been frequently analyzed in terms of their max-imum matching cardinality Frieze [90] shows results for undirected (gen-eral) graphs along the lines of Walkuprsquos results for bipartite graphs [211]In particular he shows that if each vertex chooses uniformly at random oneneighbor (among all vertices) then the constructed graph does not havea perfect matching almost always In other words he shows that random1-out subgraphs of a complete graph do not have perfect matchings Luck-ily as in bipartite graphs if each vertex chooses two neighbors then theconstructed graph has perfect matchings almost always That is random2-out subgraphs of a complete graph have perfect matchings Our contribu-tions are about making these statements valid for any host graph (assumingthe adjacency matrix has total support) and measure the approximationguarantee of the 1-out case

Randomized algorithms which check the existence of a perfect match-ing and generate perfectmaximum matchings have been proposed in theliterature [45 120 165 177 178] Lovasz showed that the existence of aperfect matching can be verified in randomized O(nω) time where O(nω)is the time complexity of the fastest matrix multiplication algorithm avail-able [165] More information such as the set of allowed edges ie the edgesin some maximum matching of a graph can also be found with the samerandomized complexity as shown by Cheriyan [45]

To construct a maximum cardinality matching in a general non-bipartitegraph a simple easy to implement algorithm with O(nω+1) randomizedcomplexity is proposed by Rabin and Vazirani [197] The algorithm usesmatrix inversion as a subroutine Later Mucha and Sankowski proposedthe first algorithm for the same problem with O(nω) randomized complex-ity [178] This algorithm uses expensive sub-routines such as Gaussian elim-ination and equivalence classes formed based on the edges that appear insome perfect matching [166] A simpler randomized algorithm with the samecomplexity is proposed by Harvey [108]

77

432 One-Out its analysis and variants

We first present the main heuristic and then analyze its approximationguarantee While the heuristic is a straightforward adaptation of its coun-terpart for bipartite graphs [76] the analysis is more complicated becauseof odd cycles The analysis shows that random 1-out subgraphs of a givengraph have the maximum cardinality of a matching around 0866minus log(n)nof the best possiblemdashthe observed performance (see Section 433) is higherThen two variants of the main heuristic are discussed without proofs Theyextend the 1-out subgraphs obtained by the first one and hence deliverbetter results theoretically

The main heuristic One-Out

The heuristic shown in Algorithm 9 first scales the adjacency matrix of agiven undirected graph to be doubly stochastic Based on the values in thematrix the heuristic then randomly marks one of the neighbors for each ver-tex as chosen This defines one marked edge per vertex Then the subgraphof the original graph containing only the marked edges is formed yieldingan undirected graph with at most n edges each vertex having at least oneedge incident on it The Karp-Sipser heuristic is run on this subgraph andfinds a maximum cardinality matching on it as the resulting 1-out graphhas unicyclic components The approximation guarantee of the proposedheuristic is analyzed for this step One practical improvement to the previ-ous work is to make sure that the matching obtained is a maximal one byrunning Karp-Sipser on the vertices that are not matched This brings in alarge improvement to the cardinality of the matching in practice

Let us classify the edges incident on a vertex vi as in-edges (from thoseneighbors that have chosen i at Line 4) and an out-edge (to a neighborchosen by i at Line 4 Dufosse et al [76 Lemma 3] show that during anyexecution of Karp-Sipser it suffices to consider the vertices whose in-edgesare unavailable but out-edges are available as degree-1 vertices

The analysis traces an execution of Karp-Sipser on the subgraph G1out In

other words the analysis can also be perceived as the analysis of the Karp-Sipser heuristicrsquos performance on random 1-out graphs Let A1 be the setof vertices not chosen by any other vertex at Line 4 of Algorithm 9 Thesevertices have in-degree zero and out degree one and hence can be processedby Karp-Sipser Let B1 be the set of vertices chosen by the vertices in A1The vertices in B1 can be perfectly matched with vertices in A1 leavingsome A1 vertices not matched and creating some new in-degree-0 verticesWe can proceed to define A2 to be the vertices that have in degree-0 inV (A1 cup B1) and define B2 as those chosen by A2 and so on so forthFormally let B0 be an empty set and define Ai to be the set of verticeswith in-degree 0 in V Biminus1 and Bi be the vertices chosen by those in Ai

78 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 9 One-Out Undirected graphs

Data G = (VE) and its adjacency matrix AResult match[middot] the matching

1 Dlarr SymScale(A)2 for i = 1 to n in parallel do3 Pick a random j isin Ailowast by using the probability density function

sikΣtisinAilowastsit

for all k isin Ailowast

where sik = d[i]times d[k] is the corresponding entry in the scaledmatrix S = DAD

4 Mark that i chooses j

5 Construct a graph G1out = (VE) where

V =1 nE =(i j) i chose j or j chose i

6 matchlarrKarpSipserOne-Out(G1out)

7 Make match a maximal matching

for i ge 1 Notice that Ai sube Ai+1 and Bi sube Bi+1 The Karp-Sipser heuristiccan process A1 then A2 A1 and so on until the remaining graph hascycles only The sets Ai and Bi and their cardinality are at the core of ouranalysis We first present some facts about these sets and their cardinalityand describe an implementation of Karp-Sipser instrumented to highlightthem

Lemma 45 With the definitions above Ai capBi = empty

Proof We prove this by induction For i = 1 it clearly holds Assume thatit holds for all i lt ` Suppose there exists a vertex u isin A` cap B` BecauseA`minus1capB`minus1 = empty u must necessarily belong to both A` A`minus1 and B` B`minus1For u to be in B` B`minus1 there must exist at least one vertex v isin A` A`minus1

such that v chooses u However the condition for u isin A` is that no vertex inV cap(A`minus1cupB`minus1) has selected it This is a contradiction and the intersectionA` capB` should be empty

Corollary 41 Ai capBj = empty for i le j

Proof Assume Ai cap Bj 6= empty Since Ai sube Aj we have a contradiction asAj capBj = empty by Lemma 45

Thanks to Lemma 45 and Corollary 41 the sets Ai and Bi are disjointand they form a bipartite subgraph of G1

out for all i = 1 `The proposed version Karp-SipserOne-Out of the Karp-Sipser heuristic

for 1-out graphs is shown in Algorithm 10 The degree-1 vertices are kept

79

in a first-in first-out priority queue Q The queue is first initialized with A1and the character lsquorsquo is used to mark the end of A1 Then all vertices inA1 are matched to some other vertices defining B1 When we remove twomatched vertices from the graph G1

out at Lines 29 and 40 we update thedegrees of their remaining neighbors and append the vertices which havedegrees of one to the queue During Phase-1 of Karp-SipserOne-Out wealso maintain the set of Ai and Bi vertices while storing only the last oneA` and B` are returned along with the number ` of levels which is computedthanks to the use of the marker

Apart from the scaling step the proposed heuristic in Algorithm 9 haslinear worst-case time complexity of O(n + τ) The scaling step if applieduntil convergence can take more time than that We do do not suggestrunning it until convergence for all practical purposes 5 or 10 iterations ofthe basic method [141] or even less of the Newton iterations [140] seem suf-ficient (see experiments) Therefore the practical run time of the algorithmis linear

Analysis of One-Out

Let ai and bi be two random variables representing the cardinalities of Aiand Bi respectively in an execution of Karp-SipserOne-Out on a random1-out graph Then Karp-SipserOne-Out matches b` edges in the first phaseand leaves a`minus b` vertices unmatched What remains after the first phase isa set of cycles In the bipartite graph case [76] all vertices in those cyclesare matchable and hence the cardinality of the matching was measured bynminus a` + b` Since we can possibly have odd cycles after the first phase wecannot match all remaining vertices in the general case of undirected graphsLet c be a random variable representing the number of odd cycles after thefirst phase of Karp-SipserOne-Out Then we have the following (obvious)lemma about the approximation guarantee of Algorithm 9

Lemma 46 At the end of execution the number of unmatched vertices isa` minus b` + c Hence Algorithm 9 matches at least nminus (a` minus b` + c) vertices

We need to quantify a` minus b` and c in Lemma 46 We state an an upperbound on a` minus b` in Lemma 47 and an upper bound on c in Lemma 48without any proof (the reader can find the proofs in the original paper [73])While the bound for a`minusb` holds for any graph with total support presentedto Algorithm 9 the bounds for c are shown for random 1-out graphs ofcomplete graphs By plugging the bounds for these quantities we obtainthe following theorem on the approximation guarantee of Algorithm 9

Theorem 43 Algorithm 9 obtains a matching with cardinality at least0866minus d104 log(0336n)e

n in expectation when the input is a complete graph

80 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 10 Karp-SipserOne-Out Undirected graphs

Data G1out = (VE)

Result match[middot] the mates of verticesResult ` the number of levels in the first phaseResult A the set of degree-1 vertices in the first phaseResult B the set of vertices matched to A vertices

1 match[u]larr NIL for all u2 Qlarr v deg(v) = 1 degree-1 or in-degree 0 3 if Q = empty then4 `larr 0 no vertex in level 1 5 else6 `larr 1

7 Enqueue(Q) marks the end of the first level 8 Phase-1larr ongoing9 Alarr B larr empty

10 while true do11 while Q 6= empty do12 ularr Dequeue(Q)13 if u = and Q = empty then14 break the while-Q-loop15 else if u = then16 `larr `+ 117 Enqueue(Q) new level formed 18 skip to the next while-Q-iteration19 if match[u] 6= NIL then20 skip to the next while-Q-iteration

21 for v isin adj(u) do22 if match[v] = NIL then23 match[u]larr v24 match[v]larr u25 if Phase-1 = ongoing then26 Alarr A cup u27 B larr B cup v28 N larr adj(v)

29 G1out larr G1

out u v30 Enqueue(Qw) for w isin N and deg(w) = 131 break the for-v-loop

32 if Phase-1 = ongoing and match[u] = NIL then33 Alarr A cup u u cannot be matched

34 Phase-1 larr done35 if E 6= empty then36 pick a random edge (u v)37 match[u]larr v38 match[v]larr u39 N larr adj(u v)40 G1

out larr G1out u v

41 Enqueue(Qw) for w isin N and deg(w) = 142 else43 break the while-true loop

81

The theorem can also be interpreted more theoretically as a random1-out graph has a maximum cardinality of a matching at least 0866 minusd104 log(0336n)e

n in expectation The bound is in close vicinity of 0866 whichwas the proved bound for the bipartite case [76] We note that in derivingthis bound we assumed a random 1-out graph (as opposed to a random 1-outsubgraph of a given graph) only at a single step We leave the extension tothis latter case as future work and present experiments suggesting that thebound is also achieved for this case In Section 433 we empirically showthat the same bound also holds for graphs whose corresponding matrices donot have total support

In order to measure a` minus b` we adapted a proof from earlier work [76]which was inspired by Karp and Sipserrsquos analysis of the first phase of theirheuristic [128] and obtained the following lemma

Lemma 47 a` minus b` le (2Ω minus 1)n where Ω asymp 0567 equals to W (1) ofLambertrsquos W function

The lemma also reads as a` minus b` lesumn

i=1(2 middot 0567minus 1) le 0134 middot nIn order to achieve the result of Theorem 43 we need to bound c the

number of odd cycles that remain on a G1out graph after the first phase of

Karp-SipserOne-Out We were able to bound c on random 1-out graphs(that is we did not show it for a general graph)

Lemma 48 The expected number c of cycles remaining after the first phaseof Karp-SipserOne-Out in random 1-out graphs is less than or equal tod104 log(nminusa`minus b`)e which is also less than or equal to d104 log(0336n)e

Variants

Here we summarize two related theoretical random bipartite graph modelsthat we adapt to the undirected case using similar algorithms The presenta-tion is brief and without proofs we will present experiments in Section 433

The random (1 + eminus1) undirected graph model (see the bipartite ver-sion [125] summarized in Section 421) lets each vertex choose a randomneighbor Then the vertices that have not been chosen select one moreneighbor The maximum cardinality of a matching in the subgraph consist-ing of the identified edges can be computed as an approximate matchingin the original graph As this heuristic uses a richer graph structure than1-out we expect perfect or near perfect matchings in the general case

A model richer in edges is the random 2-out graph model In this modeleach vertex chooses two neighbors There are two enticing characteristics ofthis model in the uniform bipartite case First the probability of having aperfect matching goes to one with the increasing number of vertices [211]Second there is a special algorithm for computing the maximum cardinalitymatchings in these (bipartite) graphs [127] with high probability in lineartime in expectation

82 CHAPTER 4 MATCHINGS IN GRAPHS

433 Experiments

To understand the effectiveness and efficiency of the proposed heuristics inpractice we report the matching quality and various statistics regarding thesets that appear in our analyses in Section 432 For the experiments weused three graph datasets The first set is generated with matrices fromthe SuiteSparse Matrix Collection [59] We investigated all n times n matricesfrom the collection with 50000 le n le 100000 For a matrix from this setwe removed the diagonal entries symmetrized the pattern of the resultingmatrix and discarded a matrix if it has empty rows There were 115 matricesat the end which we used as the adjacency matrix The graphs in the seconddataset are synthetically generated to make Karp-Sipser deviate from theoptimal matching as much as possible This dataset contains five graphswith different hardness levels for Karp-Sipser The third set contains fivelarge real-life matrices from the SuiteSparse collection for measuring therun time efficiency of the proposed heuristics

A comprehensive evaluation

We use MATLAB to measure the matching qualities based on the first twodatasets For each matrix five runs are performed with each randomizedmatching heuristic and the average is reported One five and ten iterationsare performed to evaluate the impact of the scaling method

Table 45 summarizes the quality of the matchings for all the experimentson the first dataset The matching quality is measured as the ratio of thematching cardinality to the maximum cardinality matching in the originalgraph The table presents statistics for matching qualities of Karp-Sipserperformed on the original graphs (first row) 1-out graphs (the second set ofrows) Karonski-Pittel-like (1 + eminus1)-out graphs (the third set of rows) and2-out graphs (the last set of rows)

For the U-Max rows we construct k-out graphs by using uniform prob-abilities while selecting the neighbors as proposed in the literature [90 125]We compare the cardinality of the maximum matchings in these k-out graphsto the maximum matching cardinality on the original graphs and reportthe statistics The rows St-Max report the same statistics for the k-outgraphs constructed by using probabilities with t isin 1 5 10 scaling iter-ations These statistics serve as upper bounds on the matching qualitiesof the proposed St-KS heuristics which execute Karp-Sipser on the k-outgraphs obtained with t scaling iterations Since St-KS heuristics use only asubgraph the matchings they obtain are not maximal with respect to theoriginal edge set The proposed St-KS+ heuristics exploit this fact andapply another Karp-Sipser phase on the subgraph containing only the un-matched vertices to improve the quality of the matchings The table doesnot contain St-Max rows for 1-out graphs since Karp-Sipser is an optimal

83

Alg Min Max Avg GMean Med StDev

Karp-Sipser 0880 1000 0980 0980 0988 0022

1-ou

t

U-Max 0168 1000 0846 0837 0858 0091S1-KS 0479 1000 0869 0866 0863 0059S5-KS 0839 1000 0885 0884 0865 0044S10-KS 0858 1000 0889 0888 0866 0045S1-KS+ 0836 1000 0951 0950 0953 0043S5-KS+ 0865 1000 0958 0957 0968 0036S10-KS+ 0888 1000 0961 0961 0971 0033

(1+eminus

1)-

out

U-Max 0251 1000 0952 0945 0967 0081S1-Max 0642 1000 0967 0966 0980 0042S5-Max 0918 1000 0977 0977 0985 0020S10-Max 0934 1000 0980 0979 0985 0018S1-KS 0642 1000 0963 0962 0972 0041S5-KS 0918 1000 0972 0972 0976 0020S10-KS 0934 1000 0975 0975 0977 0018S1-KS+ 0857 1000 0972 0972 0979 0025S5-KS+ 0925 1000 0978 0978 0984 0018S10-KS+ 0939 1000 0980 0980 0985 0016

2-ou

t

U-Max 0254 1000 0972 0966 0996 0079S1-Max 0652 1000 0987 0986 0999 0036S5-Max 0952 1000 0995 0995 0999 0009S10-Max 0968 1000 0996 0996 1000 0007S1-KS 0651 1000 0974 0974 0981 0035S5-KS 0945 1000 0982 0982 0984 0013S10-KS 0947 1000 0984 0983 0984 0012S1-KS+ 0860 1000 0980 0979 0984 0020S5-KS+ 0950 1000 0984 0984 0987 0012S10-KS+ 0952 1000 0986 0986 0987 0011

Table 45 For each matrix in the first dataset and each proposed heuristic five

runs are performed The statistics are computed over the mean of these results

the minimum maximum arithmetic and geometric averages median and standard

deviation are reported

84 CHAPTER 4 MATCHINGS IN GRAPHS

0808208408608809

092094096098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphOneOut5iters(MaxKS)OneOut5iters(KS+)

(a) 1-out graphs

09

092

094

096

098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphTwoOut5iters(Max)TwoOut5iters(KS)TwoOut5iters(KS+)

(b) 2-out graphs

Figure 44 The matching qualities sorted in increasing order for Karp-Sipseron the original graph S5-KS and S5-KS+ on 1-out and 2-out graphs Thefigures also contain the quality of the maximum cardinality matchings inthese graphs The experiments are performed on the 115 graphs in the firstdataset

algorithm for these subgraphs

As Table 45 shows more scaling iterations increase the maximum match-ing cardinalities on k-out graphs Although this is much more clear whenthe worst-case graphs are considered it can also be observed for arithmeticand geometric means Since U-Max is the no scaling case the impact ofthe first scaling iteration (S1-KS vs U-Max) is significant On the otherhand the difference on the matching quality for S5-KS and S10-KS is mi-nor Hence five scaling iterations are deemed sufficient for the proposedheuristics in practice

As the theory suggests the heuristics St-KS perform well for (1 + eminus1)-out and 2-out graphs With t isin 5 10 their quality is almost on parwith Karp-Sipser on the original graph and even better for 2-out graphsIn addition applying Karp-Sipser on the subgraph of unmatched vertices toobtain a maximal matching does not increase the matching quality muchSince this subgraph is small the overhead of this extra work will not besignificant Furthermore this extra step significantly improves the matchingquality for 1-out graphs which aas do not have a perfect matching

To better understand the practical performance of the proposed heuris-tics and the impact of the additional Karp-Sipser execution we profile theirperformance by sorting their matching qualities in increasing order for all115 matrices Figure 44a plots these profiles on 1-out and 2-out graphs forthe heuristics with five scaling iterations As the first figure shows five iter-ations are sufficient to obtain 086 matching quality except 17 of the 1-outexperiments The figure also shows that the maximum matching cardinalityin a random 1-out graph is worse than what Karp-Sipser can obtain on theoriginal graph This is why although S5-KS finds maximum matchings on

85

0020406081

0 2 4 6 8 10

CDFofoddCyclesn

Figure 45 The cumulative distribution of the ratio of number of odd cyclesremaining after the first phase of Karp-Sipser in One-Out to the number ofvertices in the graph

1-out graphs its performance is still worse than Karp-Sipser The additionalKarp-Sipser in S5-KS+ closes almost all of this gap and makes the match-ing qualities close to those of Karp-Sipser On the contrary for 2-out graphsgenerated with five scaling iterations the maximum matching cardinalityis more than the cardinality of the matchings found by Karp-Sipser Thereis still a gap between the best possible (red line) and what Karp-Sipser canfind (blue line) on 2-out graphs We believe that this gap can be targetedby specialized efficient exact matching algorithms for 2-out graphs

There are 30 rank-deficient matrices without total support among the115 matrices in the first dataset We observed that even for 1-out graphsthe worst-case quality for S5-KS is 086 and the average is 093 Hencethe proposed approach also works well for rank-deficient matricesgraphs inpractice

Since the number c of odd cycles at the end of the first phase of Karp-Sipser is a performance measure we investigate it For each matrix wecompute the ratio cn We then plot the cumulative distribution of thesevalues in Fig 45 For all the experiments except one this ratio is less than1 For the extreme case the ratio increases to 8 In that particular casethe matching found in the 1-out subgraph is maximum for the input graph(ie the number of odd components is also large in the original graph)

Hard instances for Karp-Sipser

Let A be an ntimesn symmetric matrix V1 be the set of Arsquos first n2 verticesand V2 be the set of Arsquos last n2 vertices As Figure 46 shows A has afull V1timesV1 block ignoring the diagonal ie a clique and an empty V2timesV2

block The last h n vertices of V1 are connected to all the vertices in thecorresponding graph The block V1 times V2 has a nonzero diagonal hence thecorresponding graph has a perfect matching Note that the same instancesare used for the bipartite case

On such a matrix with h = 0 Karp-Sipser consumes the whole graphduring Phase 1 and finds a maximum cardinality matching When h gt 1

86 CHAPTER 4 MATCHINGS IN GRAPHS

V1

V2

V1 V2

h

h

Figure 46 Adjacency matrices of hard undirected graph instances for Karp-Sipser

One-Out5 iters 10 iters 20 iters

h Karp-Sipser error quality error quality error quality

2 096 754 099 068 099 022 1008 078 852 097 078 099 023 099

32 068 665 081 109 099 026 099128 063 332 053 189 090 033 098512 063 124 055 117 059 061 086

Table 46 Results for the hard instances with n = 5000 and different hvalues One-Out is executed with 5 10 and 20 scaling iterations andscaling errors are also reported Averages of five are reported for each cell

Phase 1 immediately ends since there is no degree-one vertex In Phase 2the first edge (nonzero) consumed by Karp-Sipser is selected from a uniformdistribution over the nonzeros whose corresponding rows and columns arestill unmatched Since the block V1 times V1 forms a clique it is more likelythat the nonzero will be chosen from this block Thus a vertex in V1 will bematched with another vertex from V1 which is a bad decision since the blockV2timesV2 is completely empty Hence we expect a decrease on the performanceof Karp-Sipser as h increases On the other hand the probability that theproposed heuristics chooses an edge from that block goes to zero as thoseentries cannot be in a perfect matching

Table 46 shows that the quality of Karp-Sipser drops to 063 as h in-creases In comparison One-Out heuristic with five scaling iterations main-tains a good quality for small h values However a better scaling with moreiterations (10 and 20 for h = 128 and h = 512 respectively) is required toguarantee the desired matching qualitymdashsee the scaling error in the table

Experiments on large-scale graphs

These experiments are performed on a machine running 64-bit CentOS 65which has 30 cores each of which is an Intel Xeon CPU E7-4870 v2 core

87

Karp-SipserMatrix |V | |E| Quality time

cage15 52 940 100 582dielFilterV3real 11 882 099 336hollywood-2009 11 1128 093 418nlpkkt240 280 7465 098 5295rgg n 2 24 s0 168 2651 098 1949

One-Out-S5-KS+ Two-Out-S5-KS+Execution time (seconds) Execution time (seconds)

Matrix Quality Scale 1-out KS KS+ Total Quality Scale 2-out KS KS+ Total

cage 093 067 085 065 005 221 099 067 148 178 001 394diel 098 052 036 007 001 095 099 052 062 016 000 130holly 091 112 045 006 002 165 095 112 076 013 001 201nlpk 093 444 499 396 027 1366 098 444 943 1056 011 2454rgg 093 233 233 208 017 691 098 233 401 670 011 1315

Table 47 Summary of the results with five large-scale matrices for original Karp-

Sipser and the proposed One-Out and TwoOut heuristics with five scaling iter-

ations and post-processing for maximal matchings Matrix names are shortened in

the lower half

operating at 230 GHz To choose five large-scale matrices from SuiteSparsewe first sorted the pattern-symmetric matrices in the collection in decreasingorder of their nonzero count We then chose the five largest matrices fromdifferent families to increase the variety of experiments The details of thesematrices are given in Table 47 This table also contains the run timesand matching qualities of the original Karp-Sipser and the proposed One-Out and Two-Out heuristics The proposed heuristics have five scalingiterations and also apply Karp-Sipser at the end for ensuring maximality

The run time of the proposed heuristics are analyzed in four stages inTable 47 The Scale stage scales the matrix with five iterations the k-out stage chooses the neighbors and constructs the k-out subgraph the KSstage applies Karp-Sipser on the k-out subgraph and KS+ is the stage forthe additional Karp-Sipser at the end The total time with these four stagesis also shown for each instance The quality results are consistent withthe results obtained on the first dataset As the table shows the proposedheuristics are faster than the original Karp-Sipser on the input graph For1-out graphs the proposed approach is 25ndash39 faster than Karp-Sipser onthe original graph The speedups are in between 15ndash26 for 2-out graphswith five iterations

44 Summary further notes and references

We investigated two heuristics for the bipartite maximum cardinality match-ing problem The first one OneSidedMatch is shown to have an approx-

88 CHAPTER 4 MATCHINGS IN GRAPHS

imation guarantee of 1 minus 1e asymp 0632 The second heuristic TwoSid-edMatch is shown to have an approximation guarantee of 0866 Bothalgorithms use well-known methods to scale the sparse matrix associatedwith the given bipartite graph to a doubly stochastic form whose entriesare used as the probability density functions to randomly select a subsetof the edges of the input graph OneSidedMatch selects exactly n edgesto construct a subgraph in which a matching of the guaranteed cardinal-ity is identified with virtually no overhead both in sequential and parallelexecution TwoSidedMatch selects around 2n edges and then runs theKarp-Sipser heuristic as an exact algorithm on the selected subgraph to ob-tain a matching of conjectured cardinality The subgraphs are analyzed todevelop a specialized Karp-Sipser algorithm for efficient parallelization Alltheoretical investigations are first performed assuming bipartite graphs withperfect matchings and the scaling algorithms have converged Then the-oretical arguments and experimental evidence are provided to extend theresults to cover other cases and validate the applicability and practicality ofthe proposed heuristics in general settings

The proposed OneSidedMatch and TwoSidedMatch heuristics donot guarantee maximality of the matching found One can visit the edges ofthe unmatched row vertices and match them to an available neighbor Wehave carried out this greedy post-process and obtained the results shown inFigures 47a and 47b in comparison with what was shown in Figures 42aand 42b We see that the greedy post-process makes significant differencefor OneSidedMatch the approximation ratios achieved are well beyondthe bound 0632 and are close to those of TwoSidedMatch This obser-vation attests to the efficiency of the whole approach

We also extended the heuristics and their analysis for approximatingthe maximum cardinality matchings on general (undirected) graphs Sincewe have in the general case one class of vertices the 1-out based approachcorresponds to the TwoSidedMatch heuristics in the bipartite case The-oretical analysis which can be perceived as an analysis of Karp-Sipser onthe random 1-out graph model showed that the approximation guarantee isslightly less than 0866 The losses with respect to the bipartite case are dueto the existence of odd-length cycles Our experiments suggest that one ofthe heuristics obtains as good results as Karp-Sipser while being faster andmore reliable

A more rigorous treatment and elaboration of the variants of the heuris-tics for general graphs seem worthwhile Although Karp-Sipser works wellfor these graphs we wonder if there are exact linear time algorithms for(1 + eminus1)-out and 2-out graphs We also plan to work on the parallel imple-mentations of these algorithms

89

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 47 Matching qualities obtained after making the matchings foundby OneSidedMatch and TwoSidedMatch maximal The horizontal linesare at 0632 and 0866 respectively which are the approximation guaranteesfor the heuristics The legends contain the number of scaling iterationsCompare with Figures 42a and 42b where maximality of the matchingswas not guaranteed

90 CHAPTER 4 MATCHINGS IN GRAPHS

Chapter 5

Birkhoff-von Neumanndecomposition of doublystochastic matrices

In this chapter we investigate the Birkhoff-von Neumann (BvN) decomposi-tion described in Section 14 The main problem that we address is restatedbelow for convenience

MinBvNDec A BvN decomposition with the minimum num-ber of permutation matrices Given a doubly stochastic matrix Afind a BvN decomposition

A = α1P1 + α2P2 + middot middot middot+ αkPk (51)

of A with the minimum number k of permutation matrices

We first show that the MinBvNDec problem is NP-hard We then proposea heuristic for obtaining a BvN decomposition with a small number of per-mutation matrices We investigate some of the properties of the heuristictheoretically and experimentally In this chapter we also give a general-ization of the BvN decomposition for real matrices with possibly negativeentries and use this generalization for designing preconditioners for iterativelinear system solvers

51 The minimum number ofpermutation matrices

We show in this section that the decision version of the problem is NP-complete We first give some definitions and preliminary results

91

92 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

511 Preliminaries

A multi-set can contain duplicate members Two multi-sets are equivalentif they have the same set of members with the same number of repetitions

Let A and B be two n times n matrices We write A sube B to denote thatfor each nonzero entry of A the corresponding entry of B is nonzero Inparticular if P is an n times n permutation matrix and A is a nonnegativen times n matrix P sube A also means that the entries of A at the positionscorresponding to the nonzeros of P are positive We use P A to denotethe entrywise product of P and A which selects the entries of A at thepositions corresponding to the nonzero entries of P We also use minPAto denote the minimum entry of A at the nonzero positions of P

Let U be a set of positions of the nonzeros of A Then U is calledstrongly stable [31] if for each permutation matrix P sube A pkl = 1 for atmost one pair (k l) isin U

Lemma 51 (Brualdi [32]) Let A be a doubly stochastic matrix Then ina BvN decomposition of A there are at least γ(A) permutation matriceswhere γ(A) is the maximum cardinality of a strongly stable set of positionsof A

Note that γ(A) is no smaller than the maximum number of nonzerosin a row or a column of A for any matrix A Brualdi [32] shows that forany integer t with 1 le t le dn2ed(n + 1)2e there exists an n times n doublystochastic matrix A such that γ(A) = t

An ntimes n circulant matrix C is defined as follows The first row of C isspecified as c1 cn and the ith row is obtained from the (iminus 1)th one bya cyclic rotation to the right for i = 2 n

C =

c1 c2 cncn c1 cnminus1

c2 c3 c1

We state and prove an obvious lemma to be used later

Lemma 52 Let C be an ntimesn positive circulant matrix whose first row isc1 cn The matrix Cprime = 1sum

cjC is doubly stochastic and all BvN decom-

positions with n permutation matrices have the same multi-set of coefficients cisum

cj for i = 1 n

Proof Since the first row of Cprime has all nonzero entries and only n permuta-tion matrices are permitted the multi-set of coefficients must be the entriesin the first row

93

In the lemma if ci = cj for some i 6= j we will have the same value cicfor two different permutation matrices As a sample BvN decomposition letc =

sumci and consider 1

cC =sum cj

c Dj where Dj is the permutation matrixcorresponding to the (jminus 1)th diagonal for j = 1 n the matrix Dj has1s at the positions (i i+ j minus 1) for i = 1 nminus j + 1 and at the positions(nminus j+ 1 + k k) for k = 1 jminus 1 where we assumed that the second setis void for j = 1

Note that the lemma is not concerned by the uniqueness of the BvNdecomposition which would need a unique set of permutation matrices aswell If all cis were distinct this would have been true where the set ofDjs described above would define the unique permutation matrices Alsoa more general variant of the lemma concerns the decompositions whosecardinality k is equal to the maximum cardinality of a strongly stable setIn this case too the coefficients in the decomposition will correspond to theentries in a strongly stable set of cardinality k

512 The computational complexity of MinBvNDec

Here we prove that the MinBvNDec problem is NP-complete in the strongsense ie it remains NP-complete when the instance is represented in unaryIt suffices to do a reduction from a strongly NP-complete problem to provethe strong NP-completeness [22 Section 66]

Theorem 51 The problem of deciding whether there is a Birkhoff-vonNeumann decomposition of a given doubly stochastic matrix with k permu-tation matrices is strongly NP-complete

Proof It is clear that the problem belongs to NP as it is easy to check inpolynomial time that a given decomposition is equal to a given matrix

To establish NP-completeness we demonstrate a reduction from the well-known 3-PARTITION problem which is NP-complete in the strong sense [93p 96] Consider an instance of 3-PARTITION given an array A of 3mpositive integers a positive integer B such that

sum3mi=1 ai = mB and for all

i it holds that B4 lt ai lt B2 does there exist a partition of A into mdisjoint arrays S1 Sm such that each Si has three elements whose sumis B Let I1 denote an instance of 3-PARTITION

We build the following instance I2 of MinBvNDec corresponding to I1

given above Let

M =

[1mEm OO 1

mBC

]where Em is an m timesm matrix whose entries are 1 and C is an 3m times 3mcirculant matrix whose first row is a1 a3m It is easy to see that Mis doubly stochastic A solution of I2 is a BvN decomposition of M withk = 3m permutations We will show that I1 has a solution if and only if I2

has a solution

94 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Assume that I1 has a solution with S1 Sm Let Si = ai1 ai2 ai3and observe that 1

mB (ai1 + ai2 + ai3) = 1m We identify three permu-tation matrices Pid for d = 1 2 3 in C which contain ai1 ai2 and ai3respectively We can write 1

mEm =summ

i=11mDi where Di is the permutation

matrix corresponding to the (i minus 1)th diagonal (described after the proofof Lemma 52) We prepend Di to the three permutation matrices Pid ford = 1 2 3 and obtain three permutation matrices for M We associate thesethree permutation matrices with αid =

aidmB Therefore we can write

M =msumi=1

3sumd=1

αid

[Di

Pid

]

and obtain a BvN decomposition of M with 3m permutation matricesAssume that I2 has a BvN decomposition with 3m permutation matrices

This also defines two BvN decompositions with 3m permutation matricesfor 1

mEm and 1mBC We now establish a correspondence between these two

BvNrsquos to finish the proof Since 1mBC is a circulant matrix with 3m nonzeros

in a row any BvN decomposition of it with 3m permutation matrices hasthe coefficients ai

mB for i = 1 3m by Lemma 52 Since ai + aj lt B we

haveai+ajmB lt 1m Therefore each entry in 1

mEm needs to be included in atleast three permutation matrices A total of 3m permutation matrices coversany row of 1

mEm say the first one Therefore for each entry in this row wehave 1

m = αi + αj + αk for i 6= j 6= k corresponding to three coefficientsused in the BvN decomposition of 1

mBC Note that these three indices i j kdefining the three coefficients used for one entry of Em cannot be used foranother entry in the same row This correspondence defines a partition ofthe 3m numbers αi for i = 1 3m into m groups with three elementseach where each group has a sum of 1m The corresponding three airsquos ina group therefore sums up to B and we have a solution to I1 concludingthe proof

52 A result on the polytope ofBvN decompositions

Let S(A) be the polytope of all BvN decompositions for a given dou-bly stochastic matrix A The extreme points of S(A) are the ones thatcannot be represented as a convex combination of other decompositionsBrualdi [32 pp 197ndash198] observes that any generalized Birkhoff heuristicobtains an extreme point of S(A) and predicts that there are extreme pointsof S(A) which cannot be obtained by a generalized Birkhoff heuristic Inthis section we substantiate this claim by showing an example

Any BvN decomposition of a given matrix A with the smallest numberof permutation matrices is an extreme point of S(A) otherwise the other

95

BvN decompositions expressing the said point would have smaller numberof permutation matrices

Lemma 53 There are doubly stochastic matrices whose polytopes of BvNdecompositions contain extreme points that cannot be obtained by a general-ized Birkhoff heuristic

We will prove the lemma by giving an example We use computationaltools based on a mixed integer linear programming (MILP) formulation ofthe problem of finding a BvN decomposition with the smallest number ofpermutation matrices We first describe the MILP formulation

Let A be a given n times n doubly stochastic matrix and Ωn be the set ofall n times n permutation matrices There are n matrices in Ωn for brevitylet us refer to these permutations by P1 Pn We associate an incidencematrix M of size n2 times n with Ωn We fix an ordering of the entries of Aso that each row of M corresponds to a unique entry in A Each column ofM corresponds to a unique permutation matrix in Ωn We set mij = 1 ifthe ith entry of A appears in the permutation matrix Pj and set mij = 0otherwise Let ~a be the n2-vector containing the values of the entries of Ain the fixed order Let ~x be a vector of n elements xj corresponding tothe permutation matrix Pj With these definitions MILP formulation forfinding a BvN decomposition of A with the smallest number of permutationmatrices can be written as

minimizensumj=1

sj (52)

subject to M~x = ~a (53)

1 ge xj ge 0 for j = 1 n (54)

nsumj=1

xj = 1 (55)

sj ge xj for j = 1 n (56)

sj isin 0 1 for j = 1 n (57)

In this MILP sj is a binary variable which is 1 only if xj gt 0 otherwise 0The equality (53) the inequalities (54) and the equality (55) guaranteethat we have a BvN decomposition of A This MILP can be used only forsmall problems In this MILP we can exclude any permutation matrix Pj

from a decomposition by setting sj = 0

Proof of Lemma 53 Let the following 10 letters correspond to the numbersunderneath

a b c d e f g h i j

1 2 4 8 16 32 64 128 256 512

96 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Consider the following matrix whose row sums and column sums are 1023hence can be considered as doubly stochastic

A =

a+ b d+ i c+ h e+ j f + ge+ g a+ c b+ i d+ f h+ jf + j e+ h d+ g b+ c a+ id+ h b+ f a+ j g + i c+ ec+ i g + j e+ f a+ h b+ d

(58)

Observe that the entries containing the term 2` form a permutationmatrix for ` = 0 9 Therefore this matrix has a BvN decompositionwith 10 permutation matrices We created the MILP above and found 10 asthe smallest number of permutation matrices by calling the CPLEX solver [1]via the NEOS Server [58 68 100] Hence the described decomposition is anextreme point of S(A) None of the said permutation matrices annihilateany entry of the matrix Therefore at the first step no entry of A getsreduced to zero regardless of the order of permutation matrices Thus thisdecomposition cannot be obtained by a generalized Birkhoff heuristic

One can create a family of matrices with arbitrary sizes by embeddingthe same matrix into a larger one of the form

B =

(1023 middot I OO A

)

where I is the identity matrix with the desired size All BvN decompositionsof B can be obtained by extending the permutation matrices in Arsquos BvNdecompositions with I That is for a permutation matrix P in a BvN de-

composition of A

(I OO P

)is a permutation matrix in the corresponding

BvN decomposition of B Furthermore all permutation matrices in an arbi-trary BvN decomposition of B must have I as the principle sub-matrix andthe rest should correspond to a permutation matrix in A defining a BvNdecomposition for A Hence the extreme point S(B) corresponding to theextreme point of S(A) with 10 permutation matrices cannot be found by ageneralized Birkhoff heuristic

Let us investigate the matrix A and its BvN decomposition given in theproof of Lemma 53 Let PaPb Pj be the 10 permutation matrices ofthe decomposition corresponding to a b j We solve 10 MILPs in whichwe set st = 0 for one t isin a b j This way we try to find a BvN decom-position of A without Pa without Pb and so on always with the smallestnumber of permutation matrices The smallest number of permutation ma-trices in these 10 MILPs were 11 This certifies that the only BvN decom-position with 10 permutation matrices necessarily contains PaPb Pj It is easy to see that there is a unique solution to the equality M~x = ~a of

97

A =

3 264 132 528 9680 5 258 40 640544 144 72 6 257136 34 513 320 20260 576 48 129 10

(a) The sample matrix

129 511 257 63 33 15 7 3 2 2 1

3 4 2 5 5 4 2 3 4 1 15 5 3 1 4 1 4 2 1 2 32 1 5 3 1 2 3 4 3 4 41 3 4 4 2 5 1 5 5 3 24 2 1 2 3 3 5 1 2 5 5

(b) A BvN decomposition

Figure 51 The matrix A from Lemma 53 and a BvN decomposition with11 permutation matrices which can be obtained by a generalized Birkhoffheuristic Each column in (b) corresponds to a permutation where the firstline gives the associated coefficient and the following lines give the columnindices matched to the rows 1 to 5 of A

the MILP with xt = 0 for t isin a b j as the submatrix M containingonly the corresponding 10 columns has full column rank

Any generalized Birkhoff heuristic obtains at least 11 permutation ma-trices for the matrix A of the proof of Lemma 53 One such decompositionis shown in a tabular format in Fig 51 In Fig 51a we write the matrixof the lemma explicitly for convenience Then in Fig 51b we give a BvNdecomposition The column headers (the first line) in the table contain thecoefficients of the permutation matrices The nonzero column indices of thepermutation matrices are stored by rows For example the first permuta-tion has the coefficient 129 and columns 3 5 2 1 and 4 are matched tothe rows 1ndash5 The bold indices signify entries whose values are equivalentto the coefficients of the corresponding permutation matrices at the timewhere the permutation matrices are found Therefore the correspondingentries become zero after the corresponding step For example in the firstpermutation a54 = 129

The output of GreedyBvN for the matrix A is given in Fig 52 for refer-ence It contains 12 permutation matrices

53 Two heuristics

There are heuristics to compute a BvN decomposition for a given matrix AIn particular the following family of heuristics is based on the constructiveproof of Birkhoff Let A(0) = A At every step j ge 1 find a permutationmatrix Pj having its ones at the positions of the nonzero elements of A(jminus1)use the minimum nonzero element of A(jminus1) at the positions identified byPj as αj set A(j) = A(jminus1) minus αjPj and repeat the computations in thenext step j + 1 until A(j) becomes void Any heuristic of this type is calledgeneralized Birkhoff heuristic

98 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

A = 513 middot

0 0 0 1 00 0 0 0 11 0 0 0 00 0 1 0 00 1 0 0 0

+ 257 middot

0 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 01 0 0 0 0

+ 127 middot

0 0 1 0 00 0 0 0 10 1 0 0 01 0 0 0 00 0 0 1 0

+ 63 middot

0 0 0 0 11 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 0

+

31 middot

0 0 0 0 10 0 0 1 01 0 0 0 00 1 0 0 00 0 1 0 0

+ 15 middot

0 0 0 1 01 0 0 0 00 1 0 0 00 0 0 0 10 0 1 0 0

+ 7 middot

0 1 0 0 00 0 0 1 00 0 1 0 01 0 0 0 00 0 0 0 1

+ 3 middot

0 0 1 0 00 1 0 0 00 0 0 1 00 0 0 0 11 0 0 0 0

+

2 middot

0 0 0 0 10 0 0 1 00 1 0 0 01 0 0 0 00 0 1 0 0

+ 2 middot

0 0 1 0 01 0 0 0 00 0 0 1 00 1 0 0 00 0 0 0 1

+ 2 middot

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 0

+ 1 middot

1 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 00 0 0 0 1

Figure 52 The output of GreedyBvN for the matrix given in the proof ofLemma 53

Given an n times n matrix A we can associate a bipartite graph GA =(RcupCE) to it The rows of A correspond to the vertices in the set R andthe columns of A correspond to the vertices in the set C so that (ri cj) isin Eiff aij 6= 0 A perfect matching in GA or G in short is a set of n edges notwo sharing a row or a column vertex Therefore a perfect matching M inG defines a unique permutation matrix PM sube A

Birkhoffrsquos heuristic This is the original heuristic used in proving thata doubly stochastic matrix has a BvN decomposition described for exampleby Brualdi [32] Let a be the smallest nonzero of A(jminus1) and G(jminus1) bethe bipartite graph associated with A(jminus1) Find a perfect matching M inG(jminus1) containing a set αj larr a and Pj larr PM

GreedyBvN heuristic At every step j among all perfect matchings inG(jminus1) find one whose minimum element is the maximum That is find aperfect matching M in G(jminus1) where minPM A(jminus1) is the maximumThis ldquobottleneckrdquo matching problem is polynomial time solvable with forexample MC64 [72] In this greedy approach αj is the largest amount wecan subtract from a row and a column of A(jminus1) and we hope to obtain asmall k

Algorithm 11 GreedyBvN A greedy heuristic for constructing aBvN decomposition

Data A a doubly stochastic matrixResult a BvN decomposition

1 k larr 02 while nnz(A) gt 0 do3 k larr k + 14 Pk the pattern of a bottleneck perfect matching M in A

5 αk larr minPk A(jminus1)6 Alarr Aminus αkPk

99

A(0) =

1 4 1 0 0 00 1 4 1 0 00 0 1 4 1 00 0 0 1 4 11 0 0 0 1 44 1 0 0 0 1

A(1) =

0 4 1 0 0 00 1 3 1 0 00 0 1 3 1 00 0 0 1 3 11 0 0 0 1 34 0 0 0 0 1

A(2) =

0 4 0 0 0 00 0 3 1 0 00 0 1 2 1 00 0 0 1 2 11 0 0 0 1 23 0 0 0 0 1

A(3) =

0 3 0 0 0 00 0 3 0 0 00 0 0 2 1 00 0 0 1 1 11 0 0 0 1 12 0 0 0 0 1

A(4) =

0 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 1 11 0 0 0 1 01 0 0 0 0 1

A(5) =

0 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 00 0 0 0 0 1

Figure 53 A sample matrix to show that the original Birkhoff heuristic canobtain BvN decomposition with n permutation matrices while the optimumone has 3

Analysis of the two heuristics

As we will see in Section 54 GreedyBvN can obtain a much smaller num-ber of permutation matrices compared to the Birkhoff heuristic Here wecompare these two heuristics theoretically First we show that the originalBirkhoff heuristic does not have any constant ratio approximation guaran-tee Furthermore for an ntimesn matrix its worst-case approximation ratio isΩ(n)

We begin with a small example shown in Fig 53 We decompose a6 times 6 matrix A(0) which has an optimal BvN decomposition with threepermutation matrices the main diagonal the one containing the entriesequal to 4 and the one containing the remaining entries For simplicitywe used integer values in our example However since the row and columnsums of A(0) is equal to 6 it can be converted to a doubly stochastic matrixby dividing all the entries to 6 Instead of the optimal decomposition in thefigure we obtain the permutation matrices as the original Birkhoff heuristicdoes Each red-colored entry set is a permutation and contains the minimumpossible value 1

In the following we show how to generalize the idea for having matri-ces of arbitrarily large size with three permutation matrices in an optimaldecomposition while the original Birkhoff heuristic obtains n permutation

100 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

matrices

Lemma 54 The worst-case approximation ratio of the Birkhoff heuristicis Ω(n)

Proof For any given integer n ge 3 we show that there is a matrix of sizen times n whose optimal BvN decomposition has 3 permutations whereas theBirkhoff heuristic obtains a BvN decomposition with exactly n permutationmatrices The example in Fig 53 is a special case for n = 6 for the followingconstruction process

Let f 1 2 n rarr 1 2 n be the function f(x) = (x mod n) +1 Given a matrix M let Mprime = F(M) be another matrix containing thesame set of entries where the function f(middot) is used on the coordinate indicesto redistribute the entries of M on Mprime That is mij = mprimef(i)f(j) Since f(middot)is one-to-one and onto if M is a permutation matrix then F(M) is also apermutation matrix We will start with a permutation matrix and run itthrough F for nminus 1 times to obtain n permutation matrices which are alldifferent By adding these permutation matrices we will obtain a matrix Awhose optimal BvN decomposition has three permutation matrices whilethe n permutation matrices used to create A correspond to a decompositionthat can be obtained by the Birkhoff heuristic

Let P1 be the permutation matrix whose ones which are partitionedinto three sets are at the positions

1st set︷ ︸︸ ︷(1 1)

2nd set︷ ︸︸ ︷(n 2)

3rd set︷ ︸︸ ︷(2 3) (3 4) (nminus 1 n) (59)

Let us use F(middot) to generate a matrix sequence Pi = F(Piminus1) for 2 le i le nFor example P2rsquos nonzeros are at the positions

1st set︷ ︸︸ ︷(2 2)

2nd set︷ ︸︸ ︷(1 3)

3rd set︷ ︸︸ ︷(3 4) (4 5) (n 1)

We then add the Pis to build the matrix

A = P1 + P2 + middot middot middot+ Pn

We have the following observations about the nonzero elements of A

1 aii = 1 for all i = 1 n and only Pi has a one at the position (i i)These elements are from the first set of positions of the permutationmatrices as identified in (59) When put together these n entriesform a permutation matrix P(1)

2 aij = 1 for all i = 1 n and j = ((i + 1) mod n) + 1 and onlyPh where h = (i mod n) + 1 has a one at the position (i j) Theseelements are from the second set of positions of the permutation ma-trices as identified in (59) When put together these n entries forma permutation matrix P(2)

101

3 aij = n minus 2 for all i = 1 n and j = (i mod n) + 1 where allP` for ` isin 1 n i j have a one at the position aij Theseelements are from the third set of positions of the permutation matri-ces as identified in (59) When put together these n entries form apermutation matrix P(3) multiplied by the scalar (nminus 2)

In other words we can write

A = P(1) + P(2) + (nminus 2) middotP(3)

and see that A has a BvN decomposition with three permutation matricesWe note that each row and column of A contains three nonzeros and hencethree is the smallest number of permutation matrices in a BvN decomposi-tion of A

Since the minimum element in A is 1 and each Pi contains one suchelement the Birkhoff heuristic can obtain a decomposition using Pi fori = 1 n Therefore the Birkhoff heuristicrsquos approximation is no betterthan n

3 which can be made arbitrarily large

We note that GreedyBvN will optimally decompose the matrix A usedin the proof above

We now analyze the theoretical properties of GreedyBvN We first givea lemma identifying a pattern in its output

Lemma 55 The heuristic GreedyBvN obtains α1 αk in a non-increasingorder α1 ge middot middot middot ge αk where αj is obtained at the jth step

We observed through a series of experiments that the performance ofGreedyBvN depends on the values in the matrix and there seems to be noconstant factor approximation [74]

We create a set of n times n matrices To do that we first fix a set ofz permutation matrices C1 Cz of size n times n These permutationmatrices with varying values of α will be used to generate the matricesThe matrices are parameterized by the subscript i and each Ai is created asfollows Ai = α1 middotC1 +α2 middotC2 + middot middot middot+αz middotCz where each αj for j = 1 zis a randomly chosen integer in the range [1 2i] and we also set a randomlychosen αj equivalent to 2i to guarantee the existence of at least one largevalue even in the unlikely case that all other values are not large enoughAs can be seen each Ai has the same structure and differs from the restonly in the values of αj rsquos that are chosen As a consequence they all canbe decomposed by the same set of permutation matrices

We present our results in two sets of experiments shown in Table 51 fortwo different n We create five random Ai for i isin 10 20 30 40 50 thatis we have five matrices with the parameter i and there are five different iWe have n = 30 and z = 20 in the first set and n = 200 and z = 100 inthe second set Let ki be the number of permutation matrices GreedyBvN

102 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

n = 30 and z = 20

average worst casei ki kiz ki kiz

10 59 299 63 31520 105 529 110 55030 149 746 158 79040 184 923 191 95550 212 1062 227 1135

n = 200 and z = 100

average worst casei ki kiz ki kiz

10 268 269 280 28020 487 488 499 49930 716 716 726 72640 932 933 947 94750 1124 1125 1162 1162

Table 51 Experiments showing the dependence of the performance ofGreedyBvN on the values of the matrix elements n is the matrix sizei isin 10 20 30 40 50 is the parameter for creating matrices using αj isin[1 2i] z is the number of permutation matrices used in creating Ai Fiveexperiments for a given pair of n and i GreedyBvN obtains ki permutationmatrices The average and the maximum number of permutation matricesobtained by GreedyBvN for five random instances are given kiz is a lowerbound on the performance of GreedyBvN as z ge Opt

obtains for a given Ai The table gives the average and the maximumki of five different Ai for a given n and i pair By construction eachAi has a BvN with z permutation matrices Since z is no smaller thanthe optimal value the ratio ki

z gives a lower bound on the performance ofGreedyBvN As seen from the experiments as i increases the performanceof GreedyBvN gets increasingly worse This shows that a constant ratioworst case approximation of GreedyBvN is unlikely While the performancedepends on z (for example for small z GreedyBvN is likely to obtain nearoptimal decompositions) it seems that the size of the matrix does not largelyaffect the relative performance of GreedyBvN

Now we attempt to explain the above results theoretically

Lemma 56 Let α1P1 + middot middot middot + αkP

k be a BvN decomposition of a given

doubly stochastic matrix A with the smallest number k of permutation ma-trices Then for any BvN decomposition of A with ` ge k permutation

matrices we have ` le k middot maxi αi

mini αi If the coefficients are integers (eg when

A is a matrix with constant row and column sums of integral values) wehave ` le k middotmaxi α

i

Proof Consider a BvN decomposition α1P1 + middot middot middot+ α`P` with ` ge k As-sume without loss of generality that α1 ge middot middot middot ge αk and α1 ge middot middot middot ge α`

We know that the coefficients of these two decompositions sum up to thesame value That is sum

i=1

αi =ksumi=1

αi

103

Since α` is the smallest of α and α1 is the largest of α we have

` middot α` le k middot α1

and hence`

kle α1α`

By assuming integer values we see that α` ge 1 and thus

` le k middotmaxiαi

This lemma evaluates the approximation guarantee of a given BvN de-composition It does not seem very useful because of the fact that even ifwe have mini αi we do do not have maxi α

i Luckily we can say more in

the case of GreedyBvN

Corollary 51 Let k be the smallest number of permutation matrices ina BvN decomposition of a given doubly stochastic matrix A Let α1 andα` be the first and last coefficients obtained by the heuristic GreedyBvN fordecomposing A Then ` le k middot α1

α`

Proof This is easy to see as GreedyBvN obtains the coefficients in a non-increasing order (see Lemma 55) and α1 ge αj for all 1 le j le k for anyBvN decomposition containing αj

Lemma 56 and Corollary 51 give a posteriori estimates of the perfor-mance of the heuristic GreedyBvN in that one looks at the decompositionand tells how good it is This potentially can reveal a good performanceFor example when GreedyBvN obtains a BvN decomposition with all coeffi-cients equivalent then we know that it is an optimal BvN The same cannotbe told for the Birkhoff heuristic though (consider the example proceed-ing Lemma 54) We also note that the ratio given in Corollary 51 shouldusually be much larger than the practical performance

54 Experiments

We present results with the two heuristics We note that the original Birkhoffheuristic was not concerned with the minimality of the number of permu-tation matrices Therefore the presented results are not for comparing thetwo heuristics but for giving results with what is available We give resultson two different set of matrices The first set contains real world sparsematrices preprocessed to be doubly stochastic The second set contains afew randomly created dense doubly stochastic matrices

104 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

matrix n τ dmax devsumk

i=1 αi ksumk

i=1 αi kaft01 8205 125567 21 00e+00 0160 2000 1000 120aft02 8184 127762 82 10e-06 0000 1434 1000 234barth 6691 46187 13 00e+00 0160 2000 1000 71barth4 6019 40965 13 00e+00 0140 2000 1000 61bcspwr10 5300 21842 14 10e-06 0380 2000 1000 63bcsstk38 8032 355460 614 00e+00 0000 2000 1000 592lowastbenzene 8219 242669 37 00e+00 0000 2000 1000 113c-29 5033 43731 481 10e-06 0000 2000 1000 870EX5 6545 295680 48 00e+00 0020 2000 1000 229EX6 6545 295680 48 10e-06 0030 2000 1000 226flowmeter0 9669 67391 11 00e+00 0510 2000 1000 58fv1 9604 85264 9 00e+00 0620 2000 1000 50fv2 9801 87025 9 00e+00 0620 2000 1000 52fxm3 6 5026 94026 129 10e-06 0130 2000 1000 383g3rmt3m3 5357 207695 48 10e-06 0050 2000 1000 223Kuu 7102 340200 98 00e+00 0000 2000 1000 330mplate 5962 142190 36 00e+00 0030 2000 1000 153n3c6-b7 6435 51480 8 00e+00 1000 8 1000 8nemeth02 9506 394808 52 00e+00 0000 2000 1000 109nemeth03 9506 394808 52 00e+00 0000 2000 1000 115olm5000 5000 19996 6 10e-06 0750 283 1000 14s1rmq4m1 5489 262411 54 00e+00 0000 2000 1000 211s2rmq4m1 5489 263351 54 00e+00 0000 2000 1000 208SiH4 5041 171903 205 00e+00 0000 2000 1000 574t2d q4 9801 87025 9 20e-06 0500 2000 1000 54

Table 52 Birkhoffrsquos heuristic and GreedyBvN on sparse matrices from theUFL collection The column τ contains the number of nonzeros in a matrixThe column dmax contains the maximum number of nonzeros in a row ora column setting up a lower bound for the number k of permutation ma-trices in a BvN decomposition The column ldquodevrdquo contains the maximumdeviation of a rowcolumn sum of a matrix A from 1 in other words thevalue maxA1 minus 1infin 1TA minus 1Tinfin reported to six significant digitsThe two heuristics are run to obtain at most 2000 permutation matrices oruntil they accumulated a sum of at least 09999 with the coefficients In onematrix (marked with lowast) GreedyBvN obtained this number in less than dmax

permutation matricesmdashincreasing the limit to 0999999 made it return with908 permutation matrices

105

The first set of matrices was created as follows We have selected all ma-trices with the following properties from the SuiteSparse Matrix Collection(UFL) [59] square has at least 5000 and at most 10000 rows fully inde-composable and there are at most 50 and at least 2 nonzeros per row Thisgave a set of 66 matrices These 66 matrices are from 30 different groups ofproblems we have chosen at most two per group to remove any bias thatmight be arising from the group This resulted in 32 matrices which wepreprocessed as follows to obtain a set of doubly stochastic matrices Wefirst took the absolute values of the entries to make the matrices nonnega-tive Then we scaled them to be doubly stochastic using the algorithm ofKnight et al [141] with a tolerance of 10minus6 with at most 1000 iterationsWith this setting the scaling algorithm tries to get the maximum deviationof a rowcolumn sum from 1 to be less than 10minus6 within 1000 iterations In7 matrices the deviation was larger than 10minus4 We deemed this too big adeviation from a doubly stochastic matrix and discarded those matrices Atthe end we had 25 matrices given in Table 52

We run the two heuristics for the BvN decomposition to obtain at most2000 permutation matrices or until they accumulated a sum of at least09999 with the coefficients The accumulated sum of coefficients

sumki=1 αi

and the number of permutation matrices k for the Birkhoff and GreedyBvN

heuristics are given in Table 52 As seen in this table GreedyBvN obtainsa much smaller number of permutation matrices than Birkhoffrsquos heuristicexcept the matrix n3c6-b7 This is a special matrix with eight nonzeros ineach row and column and all nonzeros are 18 Both heuristics find the same(minimum) number of permutation matrices We note that the geometricmean of the ratios kdmax is 34 for GreedyBvN

The second set of matrices was created as follows For n isin 100 200 300we created five matrices of size ntimesn whose entries are randomly chosen in-tegers between 1 and 100 (we performed tests with integers between 1 and20 and the results were close to what we present here) We then scaled themto be doubly stochastic using the algorithm of Knight et al [141] with a tol-erance of 10minus6 with at most 1000 iterations We run the two heuristics forthe BvN decomposition such that at most n2minus2n+ 2 permutation matricesare found or the total value of the coefficients was larger than 09999 Wethen took the average of the five instances with the same n The resultsare in Table 53 In this table again we see that GreedyBvN obtains muchsmaller k than Birkhoffrsquos heuristic

The experimental observation that GreedyBvN performs better than theBirkhoff heuristic is not surprising This is for two reasons First as statedbefore the original Birkhoff heuristic is not concerned with the number ofpermutation matrices Second the heuristic GreedyBvN is an adaptation ofthe Birkhoff heuristic to have large reductions at each step

106 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

nsumk

i=1 αi ksumk

i=1 αi k100 099 9644 100 388200 099 39208 100 717300 100 88759 100 1042

Table 53 Birkhoffrsquos heuristic and GreedyBvN on random dense matricesThe maximum deviation of a rowcolumn sum of a matrix A from 1 thatis maxA1minus 1infin 1TAminus 1Tinfin was always less than 10minus6 For each nthe results are the averages of five different instances

55 A generalization of Birkhoff-von Neumanntheorem

We have exploited the BvN decomposition in the context of preconditioningfor solving linear systems [23] As the decomposition is defined for nonnega-tive matrices we needed a generalization of it to cover all real matrices Tomotivate the generalized decomposition theorem we give a brief summaryof the preconditioner

551 Doubly stochastic splitting

For a given nonnegative matrix A we first preprocess it to get a doublystochastic matrix (whose row and column sums are one) Then using thisdoubly stochastic matrix we select some fraction of some of the nonzeros ofA to be included in the preconditioner The selection is done by taking a setof permutation matrices along with their coefficients in a BvN decompositionof the given matrix

Assume we have a BvN decomposition of A as shown in (51) Thenone can pick an integer r between 1 and k minus 1 and split A as

A = MminusN (510)

where

M = α1P1 + middot middot middot+ αrPr N = minusαr+1Pr+1 minus middot middot middot minus αkPk (511)

Note that M and minusN are doubly substochastic matrices

Definition 51 A splitting of the form (510) with M and N given by (511)is said to be a doubly substochastic splitting

Definition 52 A doubly substochastic splitting A = M minusN of a doublystochastic matrix A is said to be standard if M is invertible We will callsuch a splitting an SDS splitting

107

In general it is not easy to guarantee that a given doubly substochasticsplitting is standard except for some trivial situation such as the case r = 1in which case M is always invertible We also have a characterization forinvertible M when r = 2

Theorem 52 A sufficient condition for M =sumr

i=1 αiPi to be invertible isthat one of the αi with 1 le i le r be greater than the sum of the remainingones

Let A = M minusN be an SDS splitting of A and consider the stationaryiterative scheme

xk+1 = Hxk + c H = Mminus1N c = Mminus1b (512)

where k = 0 1 and x0 is arbitrary As seen in (512) Mminus1 or solversfor My = z are required As is well known the scheme (512) convergesto the solution of Ax = b for any x0 if and only if ρ(H) lt 1 Hence weare interested in conditions that guarantee that the spectral radius of theiteration matrix

H = Mminus1N = minus(α1P1 + middot middot middot+ αrPr)minus1(αr+1Pr+1 + middot middot middot+ αkPk)

is strictly less than one In general this problem appears to be difficultWe have a necessary condition (Theorem 53) and a sufficient condition(Theorem 54) which is simple but restrictive

Theorem 53 For the splitting A = MminusN with M =sumr

i=1 αiPi and N =

minussumk

i=r+1 αiPi to be convergent it must hold thatsumr

i=1 αi gtsumk

i=r+1 αi

Theorem 54 Suppose that one of the αi appearing in M is greater thanthe sum of all the other αi Then ρ(Mminus1N) lt 1 and the stationary iterativemethod (512) converges for all x0 to the unique solution of Ax = b

We discuss how to build preconditioners meeting the sufficiency con-ditions Since M itself is doubly stochastic (up to a scalar ratio) we canapply splitting recursively on M and obtain a special solver Our motivationis that the preconditioner Mminus1 can be applied to vectors via a number ofhighly concurrent steps where the number of steps is controlled by the userTherefore the preconditioners (or the splittings) can be advantageous foruse in many-core computing systems In the context of splittings the appli-cation of N to vectors also enjoys the same property These motivations areshared by recent work on ILU preconditioners where their fine-grained com-putation [48] and approximate application [12] are investigated for GPU-likesystems

108 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Preconditioner construction

It is desirable to have a small number k in the Birkhoff-von Neumann de-composition while designing the preconditionermdashthis was our motivation toinvestigate the MinBvNDec problem This is because of the fact that if weuse splitting then k determines the number of steps in which we computethe matrix-vector products If we do not use all k permutation matriceshaving a few with large coefficients should help to design the preconditionerAs stated in Lemma 55 GreedyBvN (Algorithm 11) obtains the coefficientsin a non-increasing order We can therefore use GreedyBvN to build an Msuch that it satisfies the sufficiency condition presented in Theorem 54That is we can have M with α1sumk

i=1 αigt 12 and hence Mminus1 can be applied

with splitting iterations For this we start by initializing M to α1P1 Thenwhen a coefficient αj where j ge 2 is obtained at Line 5 of Algorithm 11we check if α1 will still be larger than the sum of other coefficients used inbuilding M when we add αjPj to M If so we add αjPj to M and con-tinue In practice we iterate the while loop until k is around 10 and collectαj rsquos as described above Experiments on a set of challenging problems [23]show that this preconditioner is more effective than ILU(0)

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices

In order to apply the preconditioners to all fully indecomposable sparsematrices we generalized the BvN decomposition as follows

Theorem 55 Any (real) matrix A with total support can be written as aconvex combination of a set of signed scaled permutation matrices

Proof Let B = abs(A) and consider R and C making RBC doubly stochas-tic which has a Birkhoff-von Neumann decomposition RBC =

sumki=1 αiPi

Then RAC can be expressed as

RAC =ksumi=1

αiQi

where Qi = [q(i)jk ]ntimesn is obtained from Pi = [p

(i)jk ]ntimesn as follows

q(i)jk = sign(ajk)p

(i)jk

The scaling matrices can be inverted to express A

De Werra [62] presents another generalization of the Birkhoff-von Neu-mann theorem to arbitrary matrices In De Werrarsquos generalization insteadof permutation matrices signed integral matrices are used to decompose the

109

given matrix The entries of the integral matrices used in the decompositionhave the same sign as the corresponding entries in the original matrix butthere may be multiple nonzeros in a row or column De Werra discussesflow-based algorithms to obtain the decompositions While the decomposi-tion is costlier to obtain (general flow computations are usually costlier thancomputing perfect matchings) it is also more general to handle rectangularmatrices as well

56 Summary further notes and references

We investigated the problem of obtaining a Birkhoff-von Neumann decom-position of doubly stochastic matrices with the minimum number of per-mutation matrices We showed four results regarding this problem Firstobtaining a BvN decomposition with the minimum number of permutationmatrices is NP-hard Second there are matrices whose decompositions withthe smallest number of permutation matrices cannot be obtained by anyBirkhoff-like heuristic Third the worst-case approximation ratio of theoriginal Birkhoff heuristic is Ω(n) On a more practical side we proposeda natural greedy heuristic (GreedyBvN) and presented experimental resultswhich demonstrated that this heuristic delivers results that are not far froma trivial lower bound We theoretically investigated GreedyBvNrsquos perfor-mance and observed that its performance depends on the values of matrixelements We showed a bound using the first and the last coefficients foundby the heuristic This chapter also included a generalization of the Birkhoff-von Neumann theorem in which we are able to express any real matrixwith a total support as a convex combination of scaled signed permutationmatrices

The shown bound for the performance of GreedyBvN is expected to bemuch larger than what one observes in practice as the bound can even belarger than the upper bound on the number of permutation matrices Atighter analysis should be possible to explain the practical performance ofthe GreedyBvN heuristic

The heuristics we considered in this chapter were based on finding a per-fect matching Koopmans and Beckmann [143] present another constructiveproof of the Birkhoff theorem This constructive proof splits the matrix asA = βA1 +(1minusβ)A2 where 0 lt β lt 1 A1 and A2 being doubly stochasticmatrices A related proof is also given by Horn and Johnson [118 Theorem871] where β = 12 We review these two constructive proofs

Let C be a cycle of length 2` with ` vertices in each part of the bipartitegraph with vertices ri ri+1 ri+`minus1 on one side and ci ci+1 ci+`minus1

on the other side Let us imagine the cycle drawn on the plane so that wehave horizontal edges of the form (rj cj) for j = i i + ` minus 1 slantededges of the form (ri ci+`minus1) and (rj cjminus1) for j = i + 1 i + ` minus 1 A

110 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

ri ci

rj cj

minusε1

minusε1

+ε1+ε1

(a) ε1 = minaii ajj

ri ci

rj cj

+ε2

+ε2

minusε2minusε2

(b) ε2 = minaij aji

Figure 54 A small example for Koopmans and Beckmann decompositionapproach

small example is shown in Figure 54 Let ε1 and ε2 be the minimum valueof a horizontal and slanted edge respectively Koopmans and Beckmannproceed as follows to obtain a BvN decomposition Consider a matrix C1

which contains minusε1 at the positions corresponding to the horizontal edgesof C ε1 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A1 = A + C1 (see Fig 54a) It is easy to see thateach element of A1 is between 0 and 1 and row and column sums are thesame as in A Therefore A1 is doubly stochastic Similarly consider C2

which contains ε2 at the positions corresponding to the horizontal edges ofC minusε2 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A2 = A + C2 (see Fig 54b) and observe that A2

is also doubly stochastic By a simple arithmetic we can set β = ε2ε1+ε2

andwrite A = βA1 + (1 minus β)A2 The observation that A1 and A2 contain atleast one more zero than A can then be used in an inductive hypothesis toconstruct a decomposition by recursively decomposing A1 and A2

At first sight the constructive approach of Koopmans and Beckmanndoes not lead to an efficient algorithm This is because of the fact thatA1 and A2 can have significant overlap and hence many permutation ma-trices can be shared by A1 and A2 yielding combinatorially large spacerequirement or very high run time A variant of this algorithm needs to beexplored First note that we can find a cycle and manipulate the weights bydecreasing them along the horizontal edges and increasing them along theslanted edges (not necessarily by annihilating the smallest weight) to createanother matrix Then we can take the modified matrix and manipulateanother cycle At one point one can decide to stop and create A1 this waywith a suitable choice of coefficients the decomposition can thus proceed Atthe second sight then one observes that Koopmans and Beckmann methodcalls for creative ideas for developing efficient and effective heuristics forBvN decompositions

Horn and Johnson on the other hand set ε to the smallest of ε1 and ε2

111

Without loss of generality let ε1 be the smallest Then let C be a matrixwhose entries are minus1 0 1 corresponding to the negative zero and positiveentries of C1 Then let A1 = A + εC and A2 = A minus εC Both of thesematrices have nonnegative entries less than or equal to one and have thesame row and column sums as A We can then write A = 05A1 + 05A2

and recurse on A1 and A2 to obtain a decomposition (which stops when wedo not have any cycles)

Inspired by the approaches summarized above we can optimally decom-pose the matrix (58) used in showing that there are optimal decompositionswhich cannot be obtained by Birkhoff-like heuristics Let

A1 =

a+ b d c e 0e a+ c b d 00 e d b+ c ad b a 0 c+ ec 0 e a b+ d

A2 =

a+ b d+ 2 middot i c+ 2 middot h e+ 2 middot j 2 middot f + 2 middot ge+ 2 middot g a+ c b+ 2 middot i d+ 2 middot f 2 middot h+ 2 middot j

2 middot f + 2 middot j e+ 2 middot h d+ 2 middot g b+ c a+ 2 middot id+ 2 middot h b+ 2 middot f a+ 2 middot j 2 middot g + 2 middot i c+ ec+ 2 middot i 2 middot g + 2 middot j e+ 2 middot f a+ 2 middot h b+ d

and observe that A = 05A1 + 05A2 We can then use the permutationscontaining the coefficients a b c d e to decompose A1 optimally with fivepermutation matrices using GreedyBvN (in the decreasing order first ethen d then c and so on) While decomposing A2 we need to use thepermutation matrices found for A1 with a careful selection of the coefficients(instead of a+ b we use a for example for the permutation associated witha) Once we have consumed those five permutations we can again resortto GreedyBvN to decompose the remaining matrix optimally with the fivepermutation matrices associated with f g h i and j In this way we havean optimal decomposition for the matrix (58) by applying GreedyBvN toA1 and carefully decomposing A2 We note that neither A1 nor A2 havethe same sum as A in contrast to the two approaches above Can wesystematize this line of reasoning to develop an effective heuristic to obtaina BvN decomposition

112 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Chapter 6

Matchings in hypergraphs

In this chapter we investigate the maximum matching in d-partite d-uniform hypergraphs described in Section 13 The formal problem definitionis repeated below for convenience

Maximum matching in d-partite d-uniform hypergraphs Givena d-partite d-uniform hypergraph find a matching of maximum size

We investigate effective heuristics for the problem above This problemhas been studied mostly in the context of local search algorithms [119] andthe best known algorithm is due to Cygan [56] who provides a ((d+ 1 + ε)3)-approximation building on previous work [57 106] It is NP-hard to approx-imate the 3-partite case (Max-3-DM) within 9897 [24] Similar boundsexist for higher dimensions the hardness of approximation for d = 4 5 and6 are shown to be 5453minus ε 3029minus ε and 2322minus ε respectively [109]

We propose five heuristics for the above problem The first two heuris-tics are the adaptations of the Greedy [78] and Karp-Sipser [128] heuristicsused in finding matchings in graphs (they are summarized in Chapter 4)The proposed generalizations are referred to as Greedy-H and Karp-Sipser-HGreedy-H traverses the hyperedge list in random order and adds an edgeto the matching whenever possible Karp-Sipser-H introduces certain rulesto Greedy-H to improve the cardinality The third heuristic is inspired by arecent scaling-based approach proposed for the maximum cardinality match-ing problem on graphs [73 74 76] The fourth heuristic is a modificationon the third one that allows for faster execution time The last one findsa matching for a reduced (d minus 1)-dimensional problem and exploits it forthe original matching problem recursively This heuristic uses an exact al-gorithm for the bipartite matching problem at each reduction We performexperiments to evaluate the performance of these heuristics on special classesof random hypergraphs as well as real-life data

One plausible way to tackle the problem is to create the line graph Gfor a given hypergraph H The line graph is created by identifying each

113

114 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

hyperedge of H with a vertex in G and by connecting two vertices of Gwith an edge iff the corresponding hyperedges share a common vertex in HThen successful heuristics for computing large independent sets in graphseg KaMIS [147] can be used to compute large matchings in hypergraphsThis approach although promising quality-wise could be impractical Thisis so since building G from H requires quadratic run time (in terms of thenumber of hyperedges) and more importantly quadratic storage (again interms of the number of hyperedges) in the worst case While this can beacceptable in some instances in some others it is not We have such instancesin the experiments Notice that while a heuristic for the independent setproblem can be of linear time complexity in graphs due to our graphs beinga line graph the actual complexity could be high

61 Background and notation

Franklin and Lorenz [89] show that if a nonnegative d-dimensional tensorX has the same zero-pattern as a d-stochastic tensor one can apply a mul-tidimensional version of the Sinkhorn-Knopp algorithm [202] (see also thedescription in Section 41) to scale X to be d-stochastic This is accom-plished by finding d vectors u(1) u(d) and scaling the nonzeros of X as

xi1id middot u(1)i1middot middot middot middot middot u(d)

idfor all i1 id isin 1 n

In a hypergraph if a vertex is a member of only a single hyperedgewe call it a degree-1 vertex Similarly if a vertex is a member of only twohyperedges we call it a degree-2 vertex

In the k-out random hypergraph model given a set V of vertices eachvertex u isin V selects k hyperedges from the set Eu = e e sube V u isin e ina uniformly random fashion and the union of these edges forms E We areinterested in the d-partite d-uniform case and hence Eu = e |e cap Vi| =1 for 1 le i le d u isin e This model generalizes the random k-out bipartitegraphs [211] Devlin and Kahn [67] investigate fractional matchings in thesehypergraphs and mention in passing that k should be exponential in d toensure that a perfect matching exists

62 The heuristics

A matching which cannot be extended with more edges is called maxi-mal The heuristics proposed here find maximal matchings on d-partited-uniform hypergraphs For such hypergraphs any maximal matching isa d-approximate matching The bound is tight and can be verified ford = 3 Let H be a 3-partite 3times 3times 3 hypergraph with the following edgese1 = (1 1 1) e2 = (2 2 2) e3 = (3 3 3) and e4 = (1 2 3) The maximummatching is e1 e2 e3 but the edge e4 alone forms a maximal matching

115

621 Greedy-H A Greedy heuristic

Among the two variants of Greedy [78 194] we adapt the first one whichrandomly visits the edges The adapted version is referred to as Greedy-Hand it visits the hyperedges in random order and adds the visited hyper-edge to the matching whenever possible Since only maximal matchings arepossible as its output Greedy-H is a d-approximation heuristic

622 Karp-Sipser-H Adapting Karp-Sipser

The commonly used and analyzed version of the Karp-Sipser heuristics forgraphs applies degree-1 reductions as summarized in Section 421 Theoriginal version though applies two rules

bull At any time during the heuristic if a degree-1 vertex appears it ismatched with its only neighbor

bull Otherwise if a degree-2 vertex u appears with neighbors v w u(and its edges) is removed from the current graph and v and w aremerged to create a new vertex vw whose set of neighbors is the unionof those of v and w (except u) A maximum cardinality matching forthe reduced graph can be extended to obtain one for the current graphby matching u with either v or w depending on vwrsquos match

We now propose an adaptation of Karp-Sipser with the two rules for d-partite d-uniform hypergraphs The adapted version is called Karp-Sipser-HSimilar to the original one the modified heuristic iteratively adds a randomhyperedge to the matching remove its d endpoints as well as their hyper-edges However the random selection is not applied whenever hyperedgesdefined by the following lemmas appear

Lemma 61 During the heuristic if a hyperedge e with at least d minus 1degree-1 endpoints appears there exists a maximum cardinality matching inthe current hypergraph containing e

Proof Let H prime be the current hypergraph at hand and e = (u1 ud) be ahyperedge in H prime whose first dminus1 endpoints are degree-1 vertices Let M prime bea maximum cardinality matching in H prime If e isinM prime we are done Otherwiseassume that ud is the endpoint matched by a hyperedge eprime isinM prime (note thatif ud is not matched M prime can be extended with e) Since ui 1 le i lt d arenot matched in M prime M prime eprime cup e defines a valid maximum cardinalitymatching for H prime

We note that it is not possible to relax the condition by using a hyperedgee with less than dminus 1 endpoints of degree-1 in M prime two of ersquos higher degreeendpoints could be matched with two different hyperedges in which casethe substitution as done in the proof of the lemma will not be valid

116 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 62 During the heuristic let e = (u1 ud) and eprime = (uprime1 uprimed)

be two hyperedges sharing at least one endpoint where for an index set I sub1 d of cardinality d minus 1 the vertices ui u

primei for all i isin I only touch

e andor eprime That is for each i isin I either ui = uprimei is a degree-2 vertexor ui 6= uprimei and they are both degree-1 vertices For j isin I uj and uprimej arearbitrary vertices Then in the current hypergraph there exists a maximumcardinality matching having either e or eprime

Proof Let H prime be the current hypergraph at hand and j isin I be the remain-ing part id Let M prime be a maximum cardinality matching in H prime If eithere isin M prime or eprime isin M prime we are done Otherwise ui and uprimei for all i isin I areunmatched by M prime Furthermore since M prime is maximal uj must be matchedby M prime (otherwise M prime can be extended by e) Let eprimeprime isinM prime be the hyperedgematching uj Then M prime eprimeprime cup e defines a valid maximum cardinalitymatching for H prime

Whenever such hyperedges appear the rules below are applied in thesame order

bull Rule-1 At any time during the heuristic if a hyperedge e with atleast dminus 1 degree-1 endpoints appears instead of a random edge e isadded to the matching and removed from the hypergraph

bull Rule-2 Otherwise if two hyperedges e and eprime as defined in Lemma 62appear they are removed from the current hypergraph with the end-points ui u

primei for all i isin I Then we consider uj and uprimej If uj and

uprimej are distinct they are merged to create a new vertex ujuprimej whose

hyperedge list is defined as the union of uj rsquos and uprimej rsquos hyperedge listsIf uj and uprimej are identical we rename uj as uju

primej After obtaining a

maximal matching on the reduced hypergraph depending on the hy-peredge matching uju

primej either e or eprime can be used to obtain a larger

matching in the current hypergraph

When Rule-2 is applied the two hyperedges identified in Lemma 62 areremoved from the hypergraph and only the hyperedges containing uj andoruprimej have an update in their vertex list Since the original hypergraph is d-partite and d-uniform that update is just a renaming of a vertex in theconcerned hyperedges (hence the resulting hypergraph is d-partite and d-uniform)

Although the extended rules usually lead to improved results in compari-son to Greedy-H Karp-Sipser-H still adheres to the d-approximation bound ofmaximal matchings For the example given at the beginning of Section 62Karp-Sipser-H generates a maximum cardinality matching by applying thefirst rule However when e5 = (2 1 3) and e6 = (3 1 3) are added to theexample neither of the two rules can be applied As before in case e4 israndomly selected it alone forms a maximal matching

117

623 Karp-Sipser-H-scaling Karp-Sipser with scaling

Karp-Sipser-H can be modified for better decisions in case neither of the tworules apply In this variant instead of a random selection we first scale theadjacency tensor of H and obtain an approximate d-stochastic tensor X We then augment the matching by adding the edge which corresponds tothe largest value in X when the rules do not apply The modified heuristicis summarized in Algorithm 12

Algorithm 12 Karp-Sipser-H-scaling (H)

Data A d-partite d-uniform n1 times middot middot middot times nd hypergraph H = (VE)Result A maximal matching M of H

1 M larr empty Initially M is empty 2 S larr empty Stack for the merges for Rule-2 3 while H is not empty do4 remove isolated vertices from H5 if existe = (u1 ud) as in Rule-1 then6 M larrM cup e Add e to the matching 7 Apply the reduction Rule-1 on H8 else if existe = (u1 ud) eprime = (uprime1 u

primed) and I as in Rule-2 then

9 Let j be the part index where j isin I10 Apply the reduction Rule-2 on H by introducing the vertex uju

primej

11 Eprime = (v1 ujuprimej vd) for all (v1 uj vd) isin E memorize the hyperedges of uj

12 Spush(e eprime ujuprimej E

prime) Store the current merge

13 else14 X larr Scale(adj(H)) Scale the adjacency tensor of H 15 elarr arg max(u1ud) (xu1ud

) Find the max in X

16 M larrM cup e Add e to the matching 17 Remove all hyperedges of u1 ud from E18 V larr V u1 ud

19 while S 6= empty do20 〈e eprime ujuprimej Eprime〉 larr Spop()

21 if ujuprimej is not matched by M then

22 M larrM cup e23 else24 Let eprimeprime isinM be the hyperedge matching uju

primej

25 if eprimeprime isin Eprime then26 Replace uju

primej in eprimeprime with uprimej

27 M larrM cup eprime28 else29 Replace uju

primej in eprimeprime with uj

30 M larrM cup e

Our inspiration comes from the d = 2 case [74 76] described in Chap-ter 4 By using the scaling method as a preprocessing step and choosing

118 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

edges with a probability corresponding to the scaled entry the edges whichare not included in a perfect matching become less likely to be chosen Un-fortunately for d ge 3 there is no equivalent of Birkhoffrsquos theorem as demon-strated by the following lemma whose proof can be found in the associatedtechnical report [75 Lemma 3]

Lemma 63 For d ge 3 there exist extreme points in the set of d-stochastictensors which are not permutation tensors

These extreme points can be used to generate other d-stochastic ten-sors as linear combinations Due to the lemma above we do not have thetheoretical foundation to imply that hyperedges corresponding to the largeentries in the scaled tensor must necessarily participate in a perfect match-ing Nonetheless the entries not in any perfect matching tend to becomezero (not guaranteed for all though) For the worst case example of Karp-Sipser-H described above the scaling indeed helps the entries correspondingto e4 e5 and e6 to become zero See the discussion following the said lemmain the technical report [75]

On a d-partite d-uniform hypergraph H = (VE) the Sinkhorn-Knoppalgorithm used for scaling operates in iterations each of which requiresO(|E| times d) time In practice we perform only a few iterations (eg 10ndash20)Since we can match at most |V |d hyperedges the overall run time costassociated with scaling is O(|V | times |E|) A straightforward implementationof the second rule can take quadratic time in the worst case of a large numberof repetitive merges with a given vertex In practice more of a linear timebehavior should be observed for the second rule

624 Karp-Sipser-H-mindegree Hypergraph matching viapseudo scaling

In Algorithm 12 applying scaling at every step can be very costly Herewe propose an alternative idea inspired by the specifics of the Sinkhorn-Knopp algorithm to reduce the overall cost In particular we can mimicthe first iteration of the Sinkhorn-Knopp algorithm to discover that we canuse the inverse of the degree of a vertex as its scaling vector This assignseach vertex a weight proportional to the inverse of its degreemdashhence not anexact iteration of Sinkhorn-Knopp

With this approach each hyperedge i1 id is associated with a value1prodd

j=1 λij where λij denotes the degree of the vertex ij from jth part The

selection procedure is the same as that of Algorithm 12 ie the edge withthe maximum value is added to the matching set We refer to this algorithmas Karp-Sipser-H-mindegree as it selects a hyperedge based on a function ofthe degrees of the vertices With a straightforward implementation findingthis hyperedge takes O(|E|) time For a better efficiency the edges can be

119

stored in a heap and when the degree of a node v decreases the increaseKeyheap operation can be called for all its edges

625 Bipartite-reduction Reduction to the bipartite match-ing problem

A perfect matching in a d-partite d-uniform hypergraph H remains perfectwhen projected on a (d minus 1)-partite (d minus 1)-uniform hypergraph obtainedby removing one of Hrsquos dimensions Matchability in (d minus 1)-dimensionalsub-hypergraphs has been investigated [6] to provide an equivalent of HallrsquosTheorem for d-partite hypergraphs These observations lead us to proposea heuristic called Bipartite-reduction This heuristic tackles the d-partited-uniform case by recursively asking for matchings in (dminus1)-partite (dminus1)-uniform hypergraphs and so on until d=2

Let us start with the case where d = 3 Let G = (VG EG) be thebipartite graph with the vertex set VG = V1 cup V2 obtained by deleting V3

from a 3-partite 3-regular hypergraph H = (VE) The edge (u v) isin EG iffthere exists a hyperedge (u v z) isin E One can also assign a weight functionw(middot) to the edges during this step such as

w(u v) = |z (u v z) isin E| (61)

A maximum weighted (product sum etc) matching algorithm can be usedto obtain a matching MG on G A second bipartite graph Gprime = (VGprime EGprime)is then created with VGprime = (V1 times V2) cup V3 and EGprime = (uv z) (u v) isinMG (u v z) isin H Under this construction any matching inGprime correspondsto a valid matching in H Furthermore if the weight function (61) definedabove is used the following holds (the proof is in the technical report [75Proposition 4])

Proposition 61 Let w(MG) =sum

(uv)isinMGw(u v) be the size of the match-

ing MG found in G Then Gprime has w(MG) edges

Thus by selecting a maximum weighted matching MG and maximizingw(MG) the largest number of edges will be kept in Gprime

For d-dimensional matching a similar process is followed First anordering i1 i2 id of the dimensions is defined At the jth bipartite re-duction step the matching is found between the dimension cluster i1i2 middot middot middot ijand dimension ij+1 by similarly solving a bipartite matching instance wherethe edge (u1 middot middot middotuj v) exists iff vertices u1 uj were matched in previoussteps and there exists an edge (u1 uj v zj+2 zd) in H

Unlike the previous heuristics Bipartite-reduction does not have any ap-proximation guarantee We state this with the following lemma (the proofis in the technical report [75 Lemma 5])

120 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 64 If algorithms for the maximum cardinality or the maximumweighted matching with the suggested edge weights (61) problems are usedthen Bipartite-reduction has a worst-case approximation ratio of Ω(n)

626 Local-Search Performing local search

A local search heuristic is proposed by Hurkens and Schrijver [119] It startsfrom a feasible maximal matching M and performs a series of swaps untilit is no longer possible In a swap k edges of M are replaced with at leastk+1 new edges from EM so that the cardinality of M increases by at leastone These k edges from M can be replaced with at most dtimes k new edgesHence these edges can be found by a polynomial algorithm enumeratingall the possibilities The approximation guarantee improves with higher kvalues Local search algorithms are limited in practice due to their hightime complexity The algorithm might have to examine all

(|M |k

)subsets of

M to find a feasible swap at each step The algorithm by Cygan [56] whichachieves a

(d+1+ε

3

)-approximation is based on a different swap scheme but

is also not suited for large hypergraphs We implement Local-Search as analternative from the literature to set a baseline

63 Experiments

To understand the relative performance of the proposed heuristics we con-ducted a wide variety of experiments with both synthetic and real-life dataThe experiments were performed on a computer equipped with Intel Corei7-7600 CPU and 16GB RAM We investigate the performance of the pro-posed heuristics Greedy-H Karp-Sipser-H Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reduction For d = 3 we also consider Local-Search [119] which replaces one hyperedge from a maximal matching Mwith at least two hyperedges from E M to increase the cardinality of M We did not consider local search schemes for higher dimensions or withbetter approximation ratios as they are computationally too expensiveFor each hypergraph we perform ten runs of Greedy-H and Karp-Sipser-Hwith different random decisions and take the maximum cardinality obtainedSince Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reductiondo not pick hyperedges randomly we run them only once We perform 20steps of the scaling procedure in Karp-Sipser-H-scaling We refer to qualityof a matching M in a hypergraph H as the ratio of M rsquos cardinality to thesize of the smallest vertex partition of H

On random hypergraphs

We perform experiments on two classes of d-partite d-uniform random hy-pergraphs where each part has n vertices The first class contains random

121

k k

d ddminus3 ddminus2 ddminus1 d ddminus3 ddminus2 ddminus1

2 - 087 100 2 - 084 1003 080 100 100 3 088 100 100

n = 10 4 100 100 100 n = 30 4 099 100 1005 100 100 100 5 100 100

2 - 088 100 2 - 087 1003 085 100 100 3 084 100 100

n = 20 4 100 100 100 n = 50 4 lowast 100 1005 100 100 100 5

Table 61 The average maximum matching cardinalities of five randominstances over n on random k-out d-partite d-uniform hypergraphs for dif-ferent k d and n No runs for k = ddminus3 for d = 2 and the problems markedwith lowast were not solved within 24 hours

k-out hypergraphs and the second one contains sparse random hypergraphs

Random k-out d-partite d-uniform hypergraphsHere we consider random k-out d-partite d-uniform hypergraphs describedin Section 61 Hence (ignoring the duplicate ones) these hypergraphs havearound d times k times n hyperedges These k-out d-partite d-uniform hyper-graphs have been recently analyzed in the matching context by Devlin andKahn [67] They state in passing that k should be exponential in d for aperfect matching to exist with high probability The bipartite graph variantof the same problem ie with d = 2 has been extensively studied in theliterature [90 124 125 211]

We first investigate the existence of perfect matchings in random k-outd-partite d-uniform hypergraphs For this purpose we used CPLEX [1] tosolve the hypergraph matching problem exactly and found the maximumcardinality of a matching in k-out hypergraphs with k isin ddminus3 ddminus2 ddminus1for d isin 2 5 and n isin 10 20 30 50 For each (k d n) triple we cre-ated five hypergraphs and computed their maximum cardinality matchingsFor k = ddminus3 we encountered several hypergraphs with no perfect match-ing especially for d = 3 The hypergraphs with k = ddminus2 were also lackinga perfect matching for d = 2 However all the hypergraphs we createdwith k = ddminus1 had at least one Based on these results we experimentallyconfirm Devlin and Kahnrsquos statement We also conjecture that ddminus1-outrandom hypergraphs have perfect matchings almost surely The averagemaximum matching cardinalities we obtained in this experiment are givenin Table 61 In this table we do not have results for k = ddminus3 for d = 2and the cases marked with lowast were not solved within 24 hours

We now compare the performance of the proposed heuristics on randomk-out d-partite d-uniform hypergraphs d isin 3 6 9 and n isin 1000 10000We tested with k values equal to powers of 2 for k le d log d The results are

122 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

summarized in Figure 61 For each (k d n) triplet we create ten random in-stances and present the average performance of the heuristics on them Thex-axis in each figure denotes k and the y-axis reports the matching cardinal-ity over n As seen Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree havethe best performance comfortably beating the other alternatives For d = 3Karp-Sipser-H-scaling dominates Karp-Sipser-H-mindegree but when d gt 3we see that Karp-Sipser-H-mindegree has the best performance Local-Searchperforms worse than Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree butbetter than others Karp-Sipser-H performs better than Greedy-H Howevertheir performances get closer as d increases This is due to the fact thatthe conditions for Rule-1 and Rule-2 hold less often for larger d as we havemore restrictions to encounter in such cases Bipartite-reduction has worseperformance than the others and the gap in the performance grows as dincreases This happens since at each step we impose more and more con-ditions on the edges involved and there is no chance to recover from baddecisions

Sparse random d-partite d-uniform hypergraphsHere we consider a random d-partite d-uniform hypergraph Hi created withitimesn hyperedges The parameters used for this experiment are i isin 1 3 5 7n isin 4000 8000 and d isin 6 9 (the associated paper presents further ex-periments with d = 3) Each Hi is created by choosing the vertices of ahyperedge uniformly at random for each dimension We do not allow du-plicate hyperedges Another random hypergraph Hi+M is then obtained byplanting a perfect matching to Hi that is a random perfect matching isadded to the pattern of Hi We again generate ten random instances foreach parameter setting We do not present results for Bipartite-reduction asit was always worse than the others The average quality of different heuris-tics on these instances is shown in Figure 62 The experiments confirmthat Karp-Sipser-H performs consistently better than Greedy-H Further-more Karp-Sipser-H-scaling performs significantly better than Karp-Sipser-HKarp-Sipser-H-scaling works even better than the local search heuristic [75Fig2]) and it is the only heuristic that is capable of finding the planted per-fect matchings for a significant number of the runs In particular it finds aperfect matching on Hi+M rsquos in all cases except for when d = 6 and i = 7Interestingly Karp-Sipser-H-mindegree outperforms Karp-Sipser-H-scaling onHis but is dominated on Hi+M s where it is the second best performingheuristic

Evaluating algorithmic choices

We evaluate the use of scaling and the importance of Rule-1 and Rule-2

Scaling vs no-scalingTo evaluate and emphasize the contribution of scaling better we compare the

123

2 4 8065

07

075

08

085

09

095

1n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

2 4 806

065

07

075

08

085

09

095

1n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

(a) d = 3 n = 1000 (left) and n = 10000 (right)

2 4 1603

04

05

06

07

08

09n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 1603

04

05

06

07

08

09n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(b) d = 6 n = 1000 (left) and n = 10000 (right)

2 4 8 16 3202

03

04

05

06

07

08n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 8 16 3202

03

04

05

06

07

08n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(c) d = 9 n = 1000 (left) and n = 10000 (right)

Figure 61 The performance of the heuristics on k-out d-partite d-uniformhypergraphs with n vertices at each part The y-axis is the ratio of matchingcardinality to n whereas the x-axis is k No Local-Search for d = 6 and d = 9

124 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 031 031 035 035 035 035 036 037 062 061 090 089 100 100 100 1003 043 043 047 047 048 048 054 054 051 050 056 055 100 100 099 0995 048 048 052 052 054 054 061 061 052 052 056 055 100 100 097 0977 052 052 055 055 059 059 066 066 054 054 057 057 084 080 071 070

(a) d = 6 without (left) and with (right) the planted matching

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 025 024 027 027 027 027 030 030 056 055 080 079 100 100 100 1003 034 033 036 036 036 036 043 043 040 040 044 044 100 100 099 1005 038 037 040 040 041 041 048 048 041 040 043 043 100 100 099 0997 040 040 042 042 044 044 051 051 042 042 044 044 100 100 097 096

(b) d = 9 without (left) and with (right) the planted matching

Figure 62 Performance comparisons on d-partite d-uniform hypergraphswith n = 4000 8000 Hi contains i times n random hyperedges and Hi+M

contains an additional perfect matching

R1

R2

C1 C2

t

t

Figure 63 AKS A challenging bipartite graph instance for Karp-Sipser

Local Karp- Karp-Sipser-H- Karp-Sipser-H-t Greedy-H Search Sipser-H scaling mindegree

2 053 099 053 100 1004 053 099 053 100 1008 054 099 055 100 10016 055 099 056 100 10032 059 099 059 100 100

Table 62 Performance of the proposed heuristics on 3-partite 3-uniformhypergraphs corresponding to XKS with n = 300 vertices in each part

125

performance of the heuristics on a particular family of d-partite d-uniformhypergraphs where their bipartite counterparts have been used before aschallenging instances for the original Karp-Sipser heuristic [76]

Let AKS be an n times n matrix discussed before in Figure 43 as a chal-lenging instance to Karp-Sipser and shown in Figure 63 for convenienceTo adapt this scheme to hypergraphstensors we generate a 3-dimensionaltensor XKS such that the nonzero pattern of each marginal of the 3rd di-mension is identical to that of AKS Table 62 shows the performance of theheuristics (ie matching cardinality normalized with n) for 3-dimensionaltensors with n = 300 and t isin 2 4 8 16 32

The use of scaling indeed reduces the influence of the misleading hyper-edges in the dense block R1 times C1 and the proposed Karp-Sipser-H-scalingheuristic always finds the perfect matching as does Karp-Sipser-H-mindegreeHowever Greedy-H and Karp-Sipser-H perform significantly worse Further-more Local-Search returns a 099-approximation in every case because itends up in a local optima

Rule-1 vs Rule-2We finish the discussion on the synthetic data by focusing on Karp-Sipser-H In the bipartite case recent work [10] shows that both rules are neededto obtain perfect matchings in a class of graphs We present a family ofhypergraphs for which applying Rule-2 leads to much improved performancethan applying Rule-1 only

We use Karp-Sipser-H-R1 to refer to Karp-Sipser-H without Rule-2 Asbefore we describe first the bipartite case Let ARF be a n times n matrixwith (ARF )ij = 1 for 1 le i le j le n and (ARF )21 = (ARF )nnminus1 = 1That is ARF is composed of an upper triangular matrix and two additionalsubdiagonal nonzeros The first two columns and the last two rows havetwo nonzeros Assume without loss of generality that the first two rows aremerged by applying Rule-2 on the first column (which is discarded) Thenin the reduced matrix the first column (corresponding to the second columnin the original matrix) will have one nonzero Rule-1 can now be appliedwhereupon the first column in the reduced matrix will have degree oneThe process continues similarly until the reduced matrix is a 2 times 2 denseblock where applying Rule-2 followed by Rule-1 yields a perfect matchingIf only Rule-1 reductions are allowed initially no reduction can be appliedand randomly chosen edges will be matched which negatively affects thequality of the returned matching

For higher dimensions we proceed as follows Let XRF be a d-dimensionaln times middot middot middot times n tensor We set (XRF )ijj = 1 for 1 le i le j le n and(XRF )122 = (XRF )nnminus1nminus1 = 1 By similar reasoning we see thatKarp-Sipser-H with both reduction rules will obtain a perfect matchingwhereas Karp-Sipser-H-R1 will struggle We give some results in Table 63that show the difference between the two We test for n isin 1000 2000 4000

126 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

d2 3 6

n quality rn

quality rn

quality rn

1000 083 045 085 047 080 0312000 086 053 087 056 080 0304000 082 042 075 017 084 045

Table 63 Quality of matching and the number r of the applications ofRule-1 over n in Karp-Sipser-H-R1 for hypergraphs corresponding to XRF Karp-Sipser obtains perfect matchings

and d isin 2 3 6 and show the quality of Karp-Sipser-H-R1 and the num-ber of times that Rule-1 is applied over n We present the best result over10 runs As seen in Table 63 Karp-Sipser-H-R1 obtains matchings thatare about 13ndash25 worse than Karp-Sipser-H Furthermore the larger thenumber of Rule-1 applications is the higher the quality is

On real-life tensors

We also evaluate the performance of the proposed heuristics on some real-life tensors selected from the FROSTT library [203] The descriptions ofthe tensors are given in Table 64 For nips and uber a dimension of size17 and 24 is dropped respectively since they restrict the size of maximumcardinality matching As described before a d-partite d-uniform hyper-graph is obtained from a d-dimensional tensor by keeping a vertex for eachdimension index and a hyperedge for each nonzero Unlike the previousexperiments the parts of the hypergraphs obtained from real-life tensors inTable 64 do not have an equal number of vertices In this case although thescaling algorithm works along the same lines its output is slightly differentLet ni = |Vi| be the cardinality at ith dimension and nmax = max1leiled nibe the maximum one By slightly modifying Sinkhorn-Knopp for each iter-ation of Karp-Sipser-H-scaling we scale the tensor such that the marginalsin dimension i sum up to nmaxni instead of one The results in Table 64resemble those from previous sections Karp-Sipser-H-scaling has the bestperformance and is slightly superior to Karp-Sipser-H-mindegree Greedy-Hand Karp-Sipser-H are close to each other and when it is feasible Local-Searchis better than them We also see that in these instances Bipartite-reductionexhibits a good performance its performance is at least as good as Karp-Sipser-H-scaling for the first three instances but about 10 worse for thelast one

127

Local- Karp- Karp-Sipser-H- Karp-Sipser-H- Bipartite-Tensor d Dimensions Greedy-H Search Sipser-H mindegree scaling Reduction

Uber 3 183times 1140times 1717 183 183 183 183 183 183nips 3 2 482 times 2 862 times

14 0361847 1991 1839 2005 2007 2007

Nell-2 3 12 092 times 9 184 times28 818

3913 4987 3935 5100 5154 5175

Enron 4 6 066 times 5 699 times244 268times 1 176

875 - 875 988 1001 898

Table 64 Four real-life tensors and the performance of the proposed heuris-tics on the corresponding hypergraphs No result for Local-Search for Enronas it is four dimensional Uber has 1117629 nonzeros nips has 3101609nonzeros Nell-2 has 76879419 nonzeros and Enron has 54202099 nonze-ros

Using an independent set solver

We compare Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree with the ideaof reducing the maximum d-dimensional matching problem to that of findingan independent set in the line graph of the given hypergraph We show thatthis transformation can obtain good results but is restricted because linegraphs can require too much space

We use KaMIS [147] to find independent sets in graphs KaMIS uses aplethora of reductions and a genetic algorithm in order to return high cardi-nality independent sets We use the default settings of KaMIS (where exe-cution time is limited to 600 seconds) and generate the line graphs with effi-cient sparse matrixndashmatrix multiplication routines We run KaMIS Greedy-H Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree on a few hypergraphsfrom previous tests The results are summarized in Table 65 The run timeof Greedy-H was less than one second in all instances KaMIS operates inrounds and we give the quality and the run time of the first round andthe final output We note that KaMIS considers the time-limit only afterthe first round has been completed As can be seen while the quality ofKaMIS is always good and in most cases superior to Karp-Sipser-H-scalingand Karp-Sipser-H-mindegree it is also significantly slower (its principle isto deliver high quality results) We also observe that the pseudo scaling ofKarp-Sipser-H-mindegree indeed helps to reduce the run time compared toKarp-Sipser-H-scaling

The line graphs of the real-life instances from Table 64 are too large tobe handled We estimate using known techniques [50] the number of edgesin these graphs to range from 15 times 1010 to 47 times 1013 which translate to126GB to 380TB of storage if edges are stored twice using 4 bytes per edge

128 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

KaMIS Karp-Sipser-H- Karp-Sipser-H-line graph Round 1 Output Greedy-H scaling mindegree

hypergraph gen time quality time quality time quality quality time quality time

8-out n =1000 d = 3

10 098 80 099 600 086 098 1 098 1

8-out n =10000 d = 3

112 098 507 099 600 086 098 197 098 1

8-out n =1000 d = 9

298 067 798 069 802 055 062 2 067 1

n = 8000 d =3 H3

1 077 16 081 602 063 076 5 077 1

n = 8000 d =3 H3+M

2 089 25 100 430 070 100 11 091 1

Table 65 Run time (in seconds) and performance comparisons betweenKaMIS Greedy-H and Karp-Sipser-H-scaling The time required to createthe line graphs should be added to KaMISrsquos overall time

64 Summary further notes and references

We have proposed heuristics for the maximum matching in d-partite d-uniform hypergraphs problem by generalizing existing heuristics for themaximum cardinality matching in graphs The experimental analysis onvarious hypergraphs (both random and corresponding to real-life tensors)show the effectiveness and efficiency of the proposed heuristics

This being a recent study we have much future work On the theoreticalside we plan to investigate the stated conjecture that ddminus1-out random hy-pergraphs have perfect matchings almost always and analyze the theoreticalguarantees of the proposed algorithms On the practical side we want totest the use of the proposed hypergraph matching heuristics in the coarsen-ing phase of the multilevel hypergraph partitioning tools For this we needto adopt the proposed heuristics for the general hypergraph case (not nec-essarily d-partite and d-uniform) Another future work is the investigationof the weighted hypergraph matching problem and the generalization anduse of the proposed heuristics for this problem Solvers for this problem arecomputational tools used in the multiple network alignment problem [179]where a matching among the vertices of more than two graphs is computed

Part III

Ordering matrices andtensors

129

Chapter 7

Elimination trees andheight-reducing orderings

In this chapter we investigate the minimum height elimination problemdescribed in Section 11 The formal problem definition is repeated belowfor convenience

Minimum height elimination tree Given a directed graph find anordering of its vertices so that the associated elimination tree has theminimum height

The standard elimination tree [200] has been used to expose parallelismin sparse Cholesky LU and QR factorizations [5 8 116 160] Roughly aset of vertices without ancestordescendant relations corresponds to a setof independent tasks that can be performed in parallel Therefore the to-tal number of parallel steps or the critical path length is equal to theheight of the tree on an unbounded number of processors [162 214] Ob-taining an elimination tree with the minimum height for a given matrixis NP-complete [193] Therefore heuristic approaches are used One setof heuristic approaches is to content oneself with the graph partitioningbased methods These methods reduce some other important cost metricsin sparse Cholesky factorization such as the fill-in and the operation countwhile giving a shallow depth elimination tree [103] When the matrix is un-symmetric the elimination tree for LU factorization [81] is useful to exposeparallelism In this respect the height of the tree again corresponds to thenumber of parallel steps or the critical path length for certain factorizationschemes Here we develop heuristics to reduce the height of eliminationtrees for unsymmetric matrices

Consider the directed graph Gd obtained by replacing every edge of agiven undirected graph G by two directed edges pointing at opposite direc-tions Then the cycle-rank (or the minimum height of an elimination tree)

131

132 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

of Gd is closely related to the undirected graph parameter tree-depth [180p128] which corresponds to the minimum height of an elimination tree for asymmetric matrix One can exploit this correspondence in the reverse sensein an attempt to reduce the height of the elimination tree while producingan ordering for the matrix That is one can use the standard undirectedgraph model corresponding to the matrix A + AT where the addition ispattern-wise This approach of producing an undirected graph for a givendirected graph by removing the direction of each edge is used in solvers suchas MUMPS [8] and SuperLU [63] This way the state of the art graph parti-tioning based ordering methods can be used We will compare the proposedheuristics with such methods

The height of the elimination tree or the critical path length is notthe only metric for efficient parallel LU factorization of unsymmetric matri-ces Depending on the factorization schemes such as Gaussian eliminationwith static pivoting (implemented in GESP [157]) or multifrontal basedsolvers (implemented for example in WSMP [104]) other metrics come intoplay (such as the fill-in the size of blocking in GESP based solvers egSuperLU MT [64] the maximumminimum size of fronts for parallelismin multifrontal solvers or the communication) Nonetheless the elimina-tion tree helps to give an upper bound on the performance of the suitableparallel LU factorization schemes under a fairly standard PRAM model(assuming unit size computations infinite number of processors and nomemorycommunication or scheduling overhead) Furthermore our overallapproach is based on obtaining a desirable form for LU factorization calledbordered block triangular form (or BBT form for short) [81] This formsimplifies many algorithmic details of different solvers [81 82] and is likelyto control the fill-in as the nonzeros are constrained to be in the blocks ofa BBT form If one orders a matrix without a BBT form and then obtainsthis form the fill-in can increase or decrease [83] Hence our BBT basedapproach is likely to be helpful for LU factorization schemes exploiting theBBT form as well in controlling the fill-in

71 Background

We define Gs = (VEs) as the graph associated with the directed graphG such that Es = vi vj (vi vj) isin E A directed graph G is calledcomplete if (u v) isin E for all vertex pairs (u v)

Let T = (V r ρ) be a rooted tree An ordering σ V harr 1 n is abijective function An ordering σ is called topological if σ(ρ(vj)) gt σ(vj) forall vj isin V Finally a postordering is a special form of topological orderingin which the vertices of each subtree are ordered consecutively

Let A be an n times n unsymmetric matrix with m off-diagonal nonzeroentries The standard directed graph G(A) = (VE) of A consists of a

133

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix A

8

46

9

1

7

2

5

3

(b) Directed graph G(A)

1 2

3

5 64

7 8

9

(c) Elimination tree T(A)

Figure 71 A sample 9 times 9 unsymmetric matrix A the corresponding di-rected graph G(A) and elimination tree T(A)

vertex set V = 1 n and an edge set E = (i j) aij 6= 0 of m edgesWe adopt GA instead of G(A) when a subgraph notion is required

An elimination digraph Gk(A) = (Vk Ek) is defined for 0 le k lt n withthe vertex set Vk = k + 1 n and the edge set

Ek =

E k = 0Ekminus1 cup (i j) (i k) (k j) isin Ekminus1 k gt 0

We also define V k = 1 k for each 0 le k lt n

We assume that A is irreducible and the LU-factorization A = LU existswhere L and U are unit lower and upper triangular respectively Since thefactors L and U are triangular their standard directed graph models G(L)and G(U) are acyclic Eisenstat and Liu [81] define the elimination treeT(A) as follows Let

Parent(i) = minj j gt i and jG(L)===rArr i

G(U)===rArr j

where aG=rArr b means that there is path from a to b in the graph G then

Parent(i) corresponds to the parent of vertex i in T(A) for i lt n and forthe root n we put Parent(n) =infin

We now illustrate some of the definitions Figure 71a displays a samplematrix A with 9 rowscolumns and 26 nonzeros Figure 71b shows thestandard directed graph representation G(A) of this matrix Upon elimina-tion of vertex 1 the three edges (3 2)(6 2) and (7 2) are added and vertex1 is removed to build the elimination digraph G1(A) On the right sideFigure 71c shows the elimination tree T(A) where height h(T(A)) = 5Here the parent vertex of 1 is 3 as GA[1 2] is not strongly connected butGA[1 2 3] is (thanks to the cycle 1 rarr 2 rarr 3 rarr 1) The cross edges ofG(A) with respect to the elimination tree T(A) are represented by dashed

134 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

directed arrows in Figure 71c The initial order is topological but not pos-tordered However 8 1 2 3 5 4 6 7 9 is a postordering of T(A) and doesnot respect the cross edges

72 The critical path length and the eliminationtree height

We summarize rowwise and columnwise LU factorization schemes also knownas ikj and jik variants [70] here to be referred as row-LU and column-LUrespectively Then we show that the critical path lengths for the parallelrow-LU and column-LU factorization schemes are equivalent to the elimina-tion tree height whenever the matrix has a certain form

Let A be an ntimesn irreducible unsymmetric matrix and A = LU whereL and U are unit lower and upper triangular respectively Algorithms 13and 14 display the row-LU and column-LU factorization schemes respec-tively Both schemes update the given matrix A so that the U = [uij ]ilejfactor is formed by the upper triangular (including the diagonal) entries ofthe matrix A at the end The L = [`ij ]igej factor is formed by the multipliers(`ik = aikakk for i gt k) and a unit diagonal We assume a coarse-grainparallelization approach [122 160] In Algorithm 13 task i is defined as thetask of computing rows i of L and U and denoted as Trow(i) Similarlyin Algorithm 14 task j is defined as the task of computing the columns j ofboth L and U factors and denoted as Tcol(j) The dependencies betweenthe tasks can be represented with the directed acyclic graphs of LT and Ufor row-LU and column-LU factorization of A respectively

Algorithm 13 Row-LU(A)

1 for i = 1 to n do2 Trow(i)3 for each k lt i and aik 6= 0 do4 for each j gt k st akj 6= 0 do5 aij larr aij minus akj times (aikakk)

Algorithm 14 Column-LU(A)

1 for j = 1 to n do2 Tcol(j)3 for each k lt j st akj 6= 0 do4 for each i gt k st aik 6= 0 do5 aij larr aij minus aik times (akjakk)

135

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix L + U

8

46

9

1

7

2

5

3

(b) Task dependency graphG(LT)

8

46

9

1

7

2

5

3

(c) Task dependency graphG(U)

Figure 72 The filled matrix L + U (a) task dependency graphs G(LT) forrow-LU (b) and G(U) for column-LU (c)

Figure 72 illustrates the mentioned dependencies Figure 72a showsthe matrix L + U corresponding to the factors of the sample matrix Agiven in Figure 71a In this figure the blue hexagons and green circles arethe fill-ins in L and U respectively Subsequently Figures 72b and 72cshow the task dependency graphs for row-LU and column-LU factorizationsrespectively The blue dashed directed edges in Figure 72b and the greendashed directed edges in Figure 72c correspond to the fill-ins All edgesin Figures 72b and 72c show dependencies For example in the column-LU the second column depends on the first one as the values in the firstcolumn of L are needed while computing the second column This is shownin Figure 72c with a directed edge from the first vertex to the second one

We now consider postorderings of the elimination tree T(A) Let T [k]denote the set of vertices in the subtree of T(A) rooted at k For a postorder-ing the order of siblings is not specified in general sense A postordering iscalled upper bordered block triangular (BBT) if the topological order amongthe sibling vertices is respected that is a vertex k is numbered before asibling vertex ` if there is an edge in G(A) that is directed from T [k] toT [`] Similarly we call a postordering a lower BBT if the sibling verticesare ordered in reverse topological As an example the initial order of ma-trix given in Figure 71 is neither upper nor lower BBT however the order8 6 1 2 3 5 4 7 9 would be an upper BBT (see Figure 71c) The follow-ing theorem shows the relation between a BBT postordering and the taskdependencies in LU factorizations

Theorem 71 ([81]) Assuming the edges are directed from child to parentT(A) is the transitive reduction of G(LT) and G(U) when A is upper andlower BBT ordered respectively

We follow this theorem with a corollary which provides the motivationfor reducing the elimination tree height

Corollary 71 The critical path length of the task dependency graph is

136 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

3

5

7

9

2 3 5 4 7 9

8

6

1

2

6 18

4

(a) The matrix L + U

1 2

3

5 64

7 8

9

(b) The directed graph G(LT)

Figure 73 The filled matrix of matrix A = LU in BBT form (a) and thetask dependency graph G(LT) for row-LU (b)

equal to the elimination tree height h(T(A)) for row-LU factorization whenA is upper BBT ordered and for column-LU factorization when A is lowerBBT ordered

Figure 73 demonstrates the discussed points on a matrix A which is thepermuted A of Figure 71a with the upper BBT postordering 8 6 1 2 3 5 4 7 9Figure 73a and 73b show the filled matrix L + U such that A = LU andthe corresponding task dependency graph G(LT) for row-LU factorizationrespectively As seen in Figure 73b there are only tree (solid) and back(dashed) edges with respect to elimination tree T(A) due to the fact thatT(A) is the transitive reduction of G(LT) (see Theorem 71) In the figurethe blue edges refer to fill-in nonzeros As seen in the figure there is a pathfrom each vertex to the root and a critical path 1 rarr 3 rarr 5 rarr 7 rarr 9 haslength 5 which coincides with the height of T(A)

73 Reducing the elimination tree height

We propose a top-down approach to reorder a given matrix leading to ashort elimination tree The bigger lines of the proposed approach form ageneralization of the state of the art methods used in Cholesky factorization(and are based on nested dissection [94]) We give the algorithms and dis-cussions for the upper BBT form they can be easily adapted for the lowerBBT form The proofs of Proposition 71 and Theorems 72ndash74 are omittedfrom the presentation and can be found in the original paper [136]

137

731 Theoretical basis

Let A be an ntimes n sparse unsymmetric irreducible matrix and T(A) be itselimination tree In this case a BBT decomposition of A can be representedas a permutation of A with a permutation matrix P such that

ABBT =

A11 A12 A1K A1B

A22 A2K A2B

AKK AKB

AB1 AB2 ABK ABB

(71)

where ABBT = PAPT the number of diagonal blocks K gt 1 and Akk isirreducible for each 1 le k le K The border ABlowast is called minimal if there isno permutation matrix Pprime 6= P such that PprimeAPprimeT is a BBT decompositionand AprimeBlowast sub ABlowast That is the border ABlowast is minimal if we cannot removecolumns from ABlowast and the corresponding rows from ABlowast permute themtogether to a diagonal block and still have a BBT form We give thefollowing proposition aimed to be used for Theorem 72

Proposition 71 Let A be in upper BBT form and the border ABlowast beminimal The elimination digraph Gκ(A) is complete where κ =

sumKi=1 |Akk|

The following theorem provides a basis for reducing the elimination treeheight h(T(A))

Theorem 72 Let A be in upper BBT form and the border ABlowast be mini-mal Then

h(T(A)) = |ABlowast|+ max1lekleK

h(T(Akk))

The theorem says that the critical path length in the parallel LU factor-ization methods summarized in Section 72 can be expressed recursively interms of the border and the critical path length of a block with the maxi-mum size This holds for the row-LU scheme when the matrix A is in anupper BBT form and for the column-LU scheme when A is in a lower BBTform Hence having a small border size and diagonal blocks of similar sizeis likely to lead to reduced critical path length

732 Recursive approach Permute

We propose a recursive approach to reorder a given matrix so that the elim-ination tree is reduced Algorithm 15 gives the overview of the solutionframework The main procedure Permute takes an irreducible unsymmet-ric matrix A as its input It calls PermuteBBT to decompose A into aBBT form ABBT Upon obtaining such a decomposition the main pro-cedure calls itself on each diagonal block Akk recursively Whenever the

138 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

matrix becomes sufficiently small (its size gets smaller than τ) the proce-dure PermuteBT is called in order to compute a fine BBT decompositionin which the diagonal blocks are of unit size here to be referred as borderedtriangular (BT) decomposition and the recursion terminates

Algorithm 15 Permute(A)

1 if |A| lt τ then Recursion terminates 2 PermuteBT(A)3 else4 ABBT larr PermuteBBT(A) As given in (71) 5 for k larr 1 to K do6 Permute(Akk) Subblocks of ABBT

733 BBT decomposition PermuteBBT

Permuting A into a BBT form translates into finding a strong (vertex) sep-arator of G(A) where the strong separator itself corresponds to the borderABlowast Regarding to Theorem 72 we are particularly interested in findingminimal borders

Let G = (VE) be a strongly connected directed graph A strong sep-arator (also called directed vertex separator [138]) is defined as a vertex setS sub V such that GminusS is not strongly connected Moreover S is said to beminimal if Gminus Sprime is strongly connected for any proper subset Sprime sub S

The algorithm that we propose to find a strong separator is based on bi-partitioning of directed graphs In case of undirected graphs typically thereare two kinds of graph partitioning methods which differ by their separatorsedge separators and vertex separators An edge (or vertex) separator is a setof edges (or vertices) whose removal leaves the graph disconnected

A BBT decomposition of a matrix may have three or more subblocks (asseen in (71)) However the algorithms we use to find a strong separatorutilize 2-way partitioning which in turn constructs two parts each of whichmay potentially have several components For the sake of simplicity in thepresentation we assume that both parts are strongly connected

We introduce some notation in order to ease the discussion For a pairof vertex subsets (UW ) of directed graph G we write U 7rarrW if there maybe some edges from U to W but there is no edge from W to U BesidesU harr W and U 6harr W are used in a more natural way that is U harr Wimplies that there may be edges in both directions whereas U 6harr W refersto absence of such an edge between U and W

We cast our problem of finding strong separators as follows For a givendirected graph G find a vertex partition Πlowast = V1 V2 S where S refersto a strong separator and V1 V2 refer to vertex parts so that S has small

139

8

46

9

1

7

2

5

3

V2

V1

(a) An edge separator ΠES =V1 V2

7

4

9

2

3

5

V2

6 1

V1

7

4

2

3

5

V2

6 1

V1

9

(b) Covers for E2 7rarr1 (left)and E1 7rarr2 (right)

8

1

3

5

9 8 1 2 3 5

6

7

4

9

7 46

2

(c) Matrix view of ΠES

Figure 74 A sample 2-way partition ΠES = V1 V2 by edge separator (a)bipartite graphs and covers for directed cuts E27rarr1 and E17rarr2 (b) and thematrix view of ΠES

cardinality the part sizes are close to each other and either V1 7rarr V2 orV2 7rarr V1

734 Edge-separator-based method

The first step of the method is to obtain a 2-way partition ΠES = V1 V2 ofG by edge separator We build two strong separators upon ΠES and choosethe one with the smaller cardinality

For an edge separator ΠES = V1 V2 the directed cut Ei 7rarrj is definedas the set of edges directed from Vi to Vj ie Ei 7rarrj = (u v) isin E u isinVi v isin Vj and we say that a vertex set S covers Ei 7rarrj if for any edge(u v) isin Ei 7rarrj either u isin S or v isin S for i 6= j isin 1 2

Initially we have V1 harr V2 Our goal is to find a subset of verticesS sub V so that either V1 7rarr V2 or V2 7rarr V1 where V1 and V2 are the setsof remaining vertices in V1 and V2 after the removal of those vertices in Sthat is V1 = V1 minus S and V2 = V2 minus S The following theorem serves to thepurpose of finding such a subset S

Theorem 73 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and S sube V Now Vi 7rarr Vj if and only if S covers Ej 7rarrifor i 6= j isin 1 2

The problem of finding a strong separator S can be encoded as findinga minimum vertex cover in the bipartite graph whose edge set is populatedonly by the edges of Ej 7rarri Algorithm 16 gives the pseudocode As seenin the algorithm we find two separators S17rarr2 and S2 7rarr1 which cover E27rarr1

and E17rarr2 and result in V1 7rarr V2 and V2 7rarr V1 respectively At the end wetake the strong separator with the smaller cardinality

In this method how the edge separator is found matters The objectivefor ΠES is to minimize the cutsize Ψ(ΠES) which is typically defined over

140 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Algorithm 16 FindStrongVertexSeparatorES(G)

1 ΠES larr GraphBipartiteES(G) ΠES = V1 V2 2 Π17rarr2 larr FindMinCover(GE2 7rarr1)3 Π27rarr1 larr FindMinCover(GE1 7rarr2)4 if |S17rarr2| lt |S27rarr1| then

5 return Π1 7rarr2 = V1 V2 S17rarr2 V1 7rarr V2 6 else

7 return Π2 7rarr1 = V1 V2 S27rarr1 V2 7rarr V1

the cut edges as follows

Ψes(ΠES) = |Ec| = |(u v) isin E π(u) 6= π(v)| (72)

where u isin Vπ(u) and v isin Vπ(v) and Ec refers to set of cut edges We notethat the use of the above metric in Algorithm 16 is a generalization of amethod used for symmetric matrices [161] However Ψes(ΠES) is looselyrelated to the size of the cover found in order to build the strong separatorA more suitable metric would be the number of vertices from which a cutedge emanates or the number of vertices to which a cut edge is directedSuch a cutsize metric can be formalized as follows

Ψcn(ΠES) = |Vc| = |v isin V exist(u v) isin Ec| (73)

The following theorem shows the close relation between Ψcn(ΠES) and thesize of the strong separator S

Theorem 74 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and Πlowast = V1 V2 S be the strong separator built uponΠES Then |S| le Ψcn(ΠES)2

Apart from the theorem we also note that optimizing (73) is likely toyield bipartite graphs which are easier to cover than those resulting fromoptimizing (72) This is because of the fact that all edges emanating from asingle vertex or many different vertices are considered as the same with (72)whereas those emanating from a single vertex are preferred in (73)

We note that the cutsize metric as given in (73) can be modeled usinghypergraphs The hypergraph that models this cutsize metric (called thecolumn-net hypergraph model [39]) has the same vertex set as the directedgraph G = (VE) and a hyperedge for each vertex v isin V that connectsall u isin V such that (u v) isin E Any partition of the vertices of this hyper-graph that minimizes the number of hyperedges in the cut induces a vertexpartition on G with an edge separator that minimizes the cutsize accord-ing to the metric (73) A similar statement holds for another hypergraphconstructed by reversing the edge directions (called the row-net model [39])

141

8

46

9

1

7

2

5

3

S

V2

V1

(a) Strong separator Π2 7rarr1

9

7

1

6

4 9 7 8 1 6

2

3

5

4

3 52

8

(b) ABBT matrix viewof Π2 7rarr1

1

6

2 3

5

9

8

74

(c) Elimination treeT(ABBT)

Figure 75 Π2 7rarr1 the strong separator built upon S2 7rarr1 the cover of E17rarr2

given in 74b (a) the matrix ABBT obtained by Π2 7rarr1 (b) and the eliminationtree T(ABBT)

Figures 74 and 75 give an example for finding a strong separator basedon an edge separator In Figure 74a the red curve implies the edge separa-tor ΠES = V1 V2 such that V1 = 4 6 7 8 9 the green vertices and V2 =1 2 3 5 the blue vertices The cut edges of E27rarr1 = (2 7) (5 4) (5 9)and E17rarr2 = (6 1) (6 3) (7 1) are given in color blue and green respec-tively We see two covers for the bipartite graphs corresponding to each ofthe directed cuts E27rarr1 and E17rarr2 in Figure 74b As seen in Figure 74cthese directed cuts refer to nonzeros of the off-diagonal blocks in matrixview Figure 75a gives the strong separator Π2rarr1 built upon ΠES by thecover S2rarr1 = 1 6 of the bipartite graph corresponding to E1rarr2 which isgiven on the right of 74b Figures 75b and 75c depict the matrix view ofthe strong separator Π2rarr1 and the corresponding elimination tree respec-tively It is worth mentioning that in matrix view V2 proceeds V1 sincewe use Π2rarr1 instead of Π1rarr2 As seen in 75c since the permuted ma-trix (in Figure 75b) has an upper BBT postordering all cross edges of theelimination tree are headed from left to right

735 Vertex-separator-based method

A vertex separator is a strong separator restricted so that V1 6harr V2 Thismethod is based on refining the separator so that either V1 7rarr V2 or V1 7rarr V2The vertex separator can be viewed as a 3-way vertex partition ΠV S =V1 V2 S As in the previous method based on edge separators we buildtwo strong separators upon ΠV S and select the one having the smaller strongseparator

The criterion used to find an initial vertex separator can be formalizedas

Ψvs(ΠV S) = |S| (74)

Algorithm 17 gives the overall process to obtain a strong separator This

142 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

algorithm follows Algorithm 16 closely Here the procedure that refines theinitial vertex separator namely RefineSeparator works as follows Itvisits the separator vertices once and moves the vertex at hand if possibleto V1 or V2 so that Vi 7rarr Vj after the movement where i 7rarr j is specified aseither 1 7rarr 2 or 2 7rarr 1

Algorithm 17 FindStrongVertexSeparatorVS(G)

1 ΠV S larr GraphBipartiteVS(G) ΠV S = V1 V2 S 2 Π1rarr2 larr RefineSeparator(GΠV S 1rarr 2)3 Π2rarr1 larr RefineSeparator(GΠV S 2rarr 1)4 if |S1rarr2| lt |S2rarr1| then

5 return Π1rarr2 = V1 V2 S1rarr2 V1 rarr V2 6 else

7 return Π2rarr1 = V1 V2 S2rarr1 V2 rarr V1

736 BT decomposition PermuteBT

The problem of permuting A into a BT form can be decoded as finding afeedback vertex set which is defined as a subset of vertices whose removalmakes a directed graph acyclic In matrix view a feedback vertex set corre-sponds to border ABlowast when the matrix is permuted into a BBT form (71)where each Aii for i = 1 K is of unit size

The problem of finding a feedback vertex set of size less than a givenvalue is NP-complete [126] but fixed-parameter tractable [44] In literaturethere are greedy algorithms that perform well in practice [152 184] In thiswork we adopt the algorithm proposed by Levy and Low [152]

For example in Figure 75 we observe that the sets 5 and 8 are usedas the feedback-vertex sets for V1 and V2 respectively which is the reasonthose rowscolumns are permuted last in their corresponding subblocks

74 Experiments

We investigate the performance of the proposed heuristic here to be referredas BBT in terms of the elimination tree height and ordering time We havethree variants of BBT (i) BBT-es based on minimizing the edge cut (72) (ii)BBT-cn based on minimizing the hyperedge cut (73) (iii) BBT-vs based onminimizing the vertex separator (74) As a baseline we used MeTiS [130]on the symmetrized matrix A + AT MeTiS uses state of the art methodsto order a symmetric matrix and leads to short elimination trees Its use inthe unsymmetric case on A+AT is a common practice BBT is implementedin C and compiled with mex of MATLAB which uses gcc 445 We usedMeTiS library for graph partitioning according to the cutsize metrics (72)

143

and (74) For the edge cut metric (72) we input MeTiS the graph ofA + AT where the weight of the edge i j is 2 (both aij aji 6= 0) or1 (either aij 6= 0 or aji 6= 0 but not both) For the vertex separatormetric (74) we input the graph of A+AT (no weights) to the correspondingMeTiS procedure We implemented the cutsize metric given in (73) usingPaToH [40] (in the default setting) for hypergraph partitioning using thecolumn-net model (we did not investigate the use of row-net model) Weperformed the experiments on an AMD Opteron Processor 8356 with 32 GBof RAM

In the experiments for each matrix we detected dense rows and columnswhere a rowcolumn is deemed to be dense if it has at least 10

radicn nonzeros

We applied MeTiS and BBT on the submatrix of non-dense rowscolumnsand produced the final ordering by appending the dense rowscolumns afterthe permutations obtained for the submatrices The performance results arecomputed as the median of eleven runs

Recall from Section 732 that whenever the size of the input matrixis smaller than the parameter τ the recursion terminates in BBT variantsand a feedback-vertex-set-based BT decomposition is applied We first in-vestigate the effect of τ in order to suggest a concrete approach For thispurpose we build a dataset of matrices called UFL below The matrices inUFL are from the SuiteSparse Matrix Collection [59] and chosen with thefollowing properties square 5K le n le 100K nnz le 1M real and notof type Graph These criteria enable an automatic selection of matricesfrom different application domains without having to specify the matricesindividually Other criteria could be used ours help to select a large set ofmatrices each of which is not too small and not too large to be handled easilywithin Matlab We exclude matrices recorded as Graph in the SuiteSparsecollection because most of these matrices have nonzeros from a small set ofintegers (for example minus1 1) and are reducible We further detect matriceswith the same nonzero pattern in the UFL dataset and keep only one matrixper unique nonzero pattern We then preprocess the matrices by applyingMC64 [72] in order to permute the columns so that the permuted matrix hasa maximum diagonal product Then we identify the irreducible blocks us-ing Dulmage-Mendelsohn decomposition [77] permute the matrices to blockdiagonal form and delete all entries that lie in the off-diagonal blocks Thistype of preprocessing is common in similar studies [83] and correspondsto preprocessing for numerics and reducibility After this preprocessing wediscard the matrices whose total (remaining) number of nonzeros is less thantwice its size or whose pattern symmetry is greater than 90 where thepattern symmetry score is defined as (nnz(A)minusn)(nnz(A+AT)minusn)minus05We ended up with 99 matrices in the UFL dataset

In Table 71 we summarize the performance of the BBT variants as nor-malized to that of MeTiS on the UFL dataset for the recursion terminationparameter τ isin 3 50 250 The table presents the geometric means of per-

144 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Tree Height (h(A)) Ordering Timeτ -es -cn -vs -es -cn -vs

3 081 069 068 168 353 25850 084 071 072 126 271 191

250 093 082 086 112 212 143

Table 71 Statistical indicators of the performance of BBT variants on theUFL dataset normalized to that of MeTiS with recursion termination param-eter τ isin 3 50 250

0 20 40 60 80Pattern symmetry

03

04

05

06

07

08

09

10

11

12

13

Nor

mal

ized

tree

hei

ght

(a) BBT-vs MeTiS Tree Height (h(A))

0 20 40 60 80Pattern symmetry

05

10

15

20

25

30

Nor

mal

ized

ord

erin

g tim

e

(b) BBT-vs MeTiS Ordering Time

Figure 76 BBT-vs MeTiS performance on the UFL dataset Dashed greenlines mark the geometric mean of the ratios

formance results over all matrices of the dataset As seen in the table forsmaller values of τ all BBT variants achieve shorter elimination trees at acost of increased processing time to order the matrix This is because of thefast but probably not of very high quality local ordering heuristicmdashas dis-cussed before the heuristic here does not directly consider the height of thesubtree corresponding to the submatrix as an optimization goal Other τvalues could also be tried but we find τ = 50 as a suitable choice in generalas it performs close to τ = 3 in quality and close to τ = 250 in efficiency

From Table 71 we observe that BBT-vs method performs in the mid-dle of the other two BBT variants but close to BBT-cn in terms of the treeheight All the three variants improve the height of the tree with respectto MeTiS Specifically at τ = 50 BBT-es BBT-cn and BBT-vs obtain im-provements of 16 29 and 28 respectively The graph based methodsBBT-es and BBT-vs run slower than MeTiS as they contain additional over-heads of obtaining the covers and refining the separators of A + AT forA For example at τ = 50 the overheads result in 26 and 171 longerordering time for BBT-es and BBT-vs respectively As expected the hy-pergraph based method BBT-cn is much slower than others (for example112 longer ordering time with respect to MeTiS with τ = 50) Since we

145

identify BBT-vs as fast and of high quality (it is close to the best BBT variantin terms of the tree height and the fastest one in terms of the run time)we display its individual performance results in detail with τ = 50 in Fig-ure 76 In Figures 76a and 76b each individual point represents the ratioBBT-vs MeTiS in terms of tree height and ordering time respectivelyAs seen in Figure 76a BBT-vs reduces the elimination tree height on 85out of 99 matrices in the UFL dataset and achieves a reduction of 28 interms of the geometric mean with respect to MeTiS (the green dashed linein the figure represents the geometric mean) We observe that the patternsymmetry and the reduction in the elimination tree height highly correlateafter a certain pattern symmetry score More precisely the correlation co-efficient between the pattern symmetry and the normalized tree height is06187 for those matrices with pattern symmetry greater than 20 Thisis of course in agreement with the fact that LU factorization methods ex-ploiting the unsymmetric pattern have much to gain on matrices with lowerpattern symmetry score and less to gain otherwise than the alternatives Onthe other hand as seen in Figure 76b BBT-vs performs consistently slowerthan MeTiS with an average of 91 slow down

The fill-in generally increases [83] when a matrix is reordered into a BBTform with respect to the original ordering using the elimination tree Wenoticed that BBT-vs almost always increases the number of nonzeros in L(by 157 with respect to MeTiS in a set of matrices) However the numberof nonzeros in U almost always decreases (by 15 with respect to MeTiS

in the same dataset) This observation may be exploited during Gaussianelimination where L is not stored

75 Summary further notes and references

We investigated the elimination tree model for unsymmetric matrices as ameans of capturing dependencies among the tasks in row-LU and column-LU factorization schemes Specifically we focused on permuting a givenunsymmetric matrix to obtain an elimination tree with reduced height inorder to expose higher degree of parallelism by minimizing the critical pathlength in parallel LU factorization schemes Based on the theoretical find-ings we proposed a heuristic which orders a given matrix to a borderedblock diagonal form with a small border size and blocks of similar sizes andthen locally orders each block We presented three variants of the proposedheuristic These three variants achieved on average 24 31 and 35improvements with respect to a common method of using MeTiS (a stateof the art tool used for the Cholesky factorization) on a small set of matri-ces On a larger set of matrices the performance improvements were 1628 and 29 The best performing variant is about 271 times slower thanMeTiS on the larger set while others are slower by factors of 126 and 191

146 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Although numerical methods implementing the LU factorization withthe help of the standard elimination tree are well developed those imple-menting the LU factorization using the elimination discussed in this sectionare lacking A potential implementation will have to handle many schedul-ing issues and therefore the use of a runtime system seems adequate Onthe combinatorial side we believe that there is a need for a direct methodto find strong separators which would give better performance in terms ofthe quality and the run time

Chapter 8

Sparse tensor ordering

In this chapter we investigate the tensor ordering problem described inSection 15 The formal problem definition is repeated below for convenience

Tensor ordering Given a sparse tensor X find an ordering of theindices in each dimension so that the nonzeros of X have indices closeto each other

We propose two ordering algorithms for this problem The first proposedheuristic BFS-MCS is a breadth first search (BFS)-like approach based onthe maximum cardinality search family and works on the hypergraph repre-sentation of the given tensor The second proposed heuristic Lexi-Orderis an extension of doubly lexical ordering of matrices to tensors We showthe effects of these schemes in the MTTKRP performance when the ten-sors are stored in three existing sparse tensor formats coordinate (COO)compressed sparse fiber (CSF) and hierarchical coordinate (HiCOO) Theassociated paper [156] has further contributions with respect to paralleliza-tion and includes results on parallel implementations of MTTKRP with thethree formats as well as runs with CandecompParafac decomposition al-gorithms

81 Three sparse tensor storage formats

We consider the three state-of-the-art tensor formats (COO CSF HiCOO)which are all for general unstructured sparse tensors

Coordinate format (COO)

The coordinate (COO) format is the simplest yet arguably most popularformat by far It stores each nonzero value along with all of its positionindices shown in Figure 81a i j k are indices (inds) of the nonzeros storedin the val array

147

148 CHAPTER 8 SPARSE TENSOR ORDERING

(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(a) COO(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(b) CSF(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(c) HiCOO

Figure 81 Three common formats for sparse tensors (example is originallyfrom [155])

Compressed sparse fibers (CSF)

Compressed Sparse Fiber (CSF) is a hierarchical fiber-centric format thateffectively generalizes the CSR matrix format to tensors An example of itsrepresentation appears in Figure 81b Conceptually CSF organizes nonzeroindices into a tree Each level corresponds to a tensor mode and eachnonzero is a path from the root to a leaf The indices of CSF are stored incinds with the pointers stored in cptrs to indicate the locations of nonzerosat the next level Since cptrs needs to show the range up to nnz we use βint

or βlong accordingly for differently sized-tensors

Hierarchical coordinate format (HiCOO)

Hierarchical Coordinate (HiCOO) [155] format derives from COO formatbut improves upon it by compressing the indices in units of sparse tensorblocks HiCOO stores a sparse tensor in a sparse-blocked pattern with apre-specified block size B meaning in Btimesmiddot middot middottimesB blocks It represents everyblock by compactly storing its nonzero triples using fewer bits Figure 81cshows the example tensor given 2 times 2 times 2 blocks (B = 2) For a third-order tensor bi bj bk are block indices indexing tensor blocks ei ej ek areelement indices indexing nonzeros within a tensor block A bptr array storesthe pointers of each blockrsquos beginning locations and val saves all the nonzerovalues The block ratio αb of a HiCOO representation of tensor is defined asthe number of blocks divided by the number of nonzeros in the tensor Theaverage slice size per tensor block cb of a HiCOO representation of tensor isdefined as the average slice size per tensor block These two key parametersdescribe the effectiveness of storing a tensor in the HiCOO format

Comparing the three tensor formats HiCOO and COO treat every modeequally and do not assume any mode order these preserve the mode-generic

149

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

Figure 82 A sparse tensor and the associated hypergraph

orientation [155] CSF has a strong mode-specificity since the tree structureimplies a mode ordering for enumeration of nonzeros Besides comparing toCOO format HiCOO and CSF save storage space and memory footprintswhereas achieve higher performance generally

82 Problem definition and solutions

Our objective is to improve the performance of Mttkrp and therefore CPDalgorithms based on the three tensor storage formats described above Wewill achieve this objective by ordering (or relabeling) the indices in one ormore modes of the input tensor so as to improve the data locality in thetensor and the factor matrices of an Mttkrp operation Take for exampletwo nonzeros (i2 j2 k1) and (i2 j2 k2) of a third-order tensor in Figure 82Relabeling k1 and k2 to other two indices close to each other will potentiallyimprove cache hits for both tensor and their corresponding rows of factormatrices in the Mttkrp operation Besides it will also influence the tensorstorage for some formats (Analysis will be illustrated in Section 823)

We propose two ordering heuristics The aim of the heuristics is toarrange the nonzeros close to each other in all modes If we were to lookat matrices this would correspond to reordering the rows and columns sothat all nonzeros are clustered around the diagonal This way nonzeros in arow or column would be close to each other and any blocking (by imposingfixed sized blocks) would have nonzero blocks only around the diagonalThe proposed heuristics are based on these observations and try to obtainsimilar behavior for tensors The output of a reordering algorithm is thepermutations for all modes being used to relabel tensor indices of them

821 BFS-MCS

BFS-MCS is a breadth first search (BFS)-like heuristic approach basedon the maximum cardinality search family [207] We first construct the d-partite d-uniform hypergraph for a sparse tensor as described in Section 15Recall that in this hypergraph the vertices are tensor indices in all modesand hyperedges represent its nonzero entries Figure 82 shows the hyper-

150 CHAPTER 8 SPARSE TENSOR ORDERING

graph associated with a sparse tensor Here the vertices are blank circlesand hyperedges are represented by grouping vertices

For an Nth-order sparse tensor we need to find the permutations forN modes We determine a permutation for a given mode n (permn) of atensor X isin RI1timesmiddotmiddotmiddottimesIN as follows Suppose some of the indices of mode n arealready ordered Then BFS-MCS picks the next to-be-ordered index asthe one with the strongest connection to the currently ordered index set Inthe case of ties it selects the index with the smallest number of nonzeros inthe corresponding sub-tensor Intuitively a stronger connection representsmore common indices in the modes other than n among an unordered vertexand the already ordered ones This means more data from factor matricescan be reused in Mttkrp if found in cache

Algorithm 18 BFS-MCS ordering based on maximum cardinal-ity search for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN hypergraphG = (VE) mode n

Result Permutation permn

1 wp(i)larr 0 for i = 1 In2 ws(i)larr minus|X ( i )| number of nonzeros with i in mode n 3 Build max-heap Hn for mode-n indices with wp and ws as the primary

and secondary keys respectively4 mV (i)larr 0 for i = 1 In5 for in = 1 In do

6 v(0)n larr GetHeapMax(Hn)

7 permn(v(0)n )larr in

8 mV (v(0)n )larr 1

9 for en isin hyperedges of v(0)n do

10 for v isin en v is not in mode n and mV (v) = 0 do11 mV (v)larr 112 for e isin hyperedges of v and e 6= en do13 vn larr vertex in mode n of e14 if inHeap(vn) then15 wp(vn)larr wp(vn) + 116 heapUpdateKey(Hnwp(vn))

17 return permn

The process is implemented for all mode-n indices by maintaining amax-heap with two keys The primary key of an index is the number ofconnections to the currently ordered indices The secondary key is thenumber of nonzeros in the corresponding (N minus 1)th-order sub-tensor egX ( in ) where the smaller secondary key values signify higherpriority The secondary keys are static Algorithm 18 gives the pseudocodeof BFS-MCS The heap Hn is initially constructed (Line 3) according to the

151

secondary keys of mode-n indices For each mode-n index (eg in) of thismax-heap BFS-MCS traverses all connected tensor indices in the modesexcept n calculates the number of connections of mode-n indices and thenupdates the heap using the primary and secondary keys Let mV denote asize-|V | array that tracks whether a vertex has been visited They are ini-tialized to zeros (unvisited) and changed to 1s once after being visited We

record a mode-n index v(0)n obtained from the max-heap Hn to the permu-

tation array permn The algorithm visits all its hyperedges (ie all nonzeroswith in Line 9) and their connected and unvisited vertices from the other(N minus 1) modes except n (Line 10) For these vertices it again visits theirhyperedges (e in Line 12) and then checks if the connected vertices in moden (vn) are in the heap (Line 14) If so the primary key (connectivity) of vnis increased by 1 and the max-heap (Hn) is updated To summarize this

algorithm locates the sub-tensors of the neighbor vertices of v(0)n to increase

the primary key values of the occurred mode-n indices in HnThe BFS-MCS heuristic has low time complexity We explain it us-

ing tensor terms The innermost heap update operation (Line 15) costsO(log(In)) Each sub-tensor is only visited once with the help of the markerarray mV For all the sub-tensors in one tensor mode the heap updateis only performed for nnz nonzeros (hyperedges) Overall the time com-plexity of BFS-MCS is O(N nnz log(In)) for an Nth-order tensor with nnznonzeros when computing permn for mode n

BFS-MCS does not exactly catch the memory access pattern of an Mt-tkrp It treats the contribution from the indices of all modes except nequally to the connectivity thus the connectivity might not match the ac-tual data reuse Also it uses a greedy strategy to determine the next-levelvertices which could miss the optimal global orderings as is common togreedy heuristics for hard problems

822 Lexi-Order

A lexical ordering of an integer vector is the standard dictionary orderingof its elements defined as follows Given two equal-length vectors x andy we say x le y iff either (i) all elements are the same ie x = y or(ii) there exists an index j such that x(j) lt y(j) and x(i) = y(i) for all0 le i lt j Consider two 0 1 vectors x = (1 0 1 0) and y = (1 1 0 0) wehave x le y because x(1) lt y(1) and x(i) = y(i) for all 0 le i lt 1

Lubiw [167] associates a vector v of size I middot J with an I times J matrix Awhere the matrix indices (i j) are first sorted with i + j and then j Adoubly lexical ordering of A is an ordering of its rows and columns whichmakes v lexically the largest Every real-valued matrix has a doubly lexicalordering and such an ordering is not unique [167] Doubly lexical orderinghas a number of applications including the efficient recognition of totallybalanced subtree and plaid matrices and (strongly) chordal graphs To the

152 CHAPTER 8 SPARSE TENSOR ORDERING

(1 1 0 0)

(1 0 1 0)

0 1 0 10 1 0 00 0 1 00 0 0 1

1 1 0 01 0 0 00 1 0 00 0 1 0

original ordered(a) Vector comparison (b) Matrix

gt

Figure 83 A doubly lexical ordering of a 0 1 matrix

best of our knowledge the merits of this ordering for high performance havenot been investigated for sparse matrices

We propose an iterative algorithm called matLexiOrder for doubly lex-ical ordering of matrices where in an iteration either rows or columns aresorted lexically Lubiw [167 Claim 22] shows that exchanging two rowswhich are lexically out of order improves the lexical ordering of the matrixBy appealing to this result we see that the matLexiOrder algorithm findsa doubly lexical ordering in a finite number of iterations This also fits wellwith another characterization of the doubly lexical ordering [167 182] amatrix is in doubly lexical ordering if both the row and column vectors arein non-increasing lexical order A row vector is read from left to right and acolumn vector is read from top to bottom Figure 83 shows a doubly lexicalordering of a sample matrix As seen in the figure the row vectors are innon-increasing lexical order as are the column vectors

The known doubly lexical ordering algorithms for matrices by Lubiw [167]and Paige and Tarjan [182] are ldquodirectrdquo ordering methods with a run timeof O(nnz log(I + J) + J) and O(nnz +I + J) space for an ItimesJ matrix withnnz nonzeros We find the time complexity of these algorithms to be toohigh for our purpose Furthermore the data structures are too complex toallow an efficient generalized implementation for tensors Since we do notaim to obtain an exact doubly lexical ordering (a close-by ordering will likelysuffice to improve the Mttkrp performance) our approach will be fasterwhile being simpler to implement

We first describe matLexiOrder algorithm for a sparse matrix Thisaims to show the efficiency of our iterative approach compared to othersA partition refinement technique is used to order the rows and columnsalternatively Given an ordering of the rows the columns can be sortedlexically This is achieved by an order preserving variant of the partitionrefinement method [182] which is called orderlyRefine We briefly explainthe partition refinement technique for ordering the columns of a matrixGiven an I times J matrix A all columns are initially put into a single partThen Arsquos nonzeros are visited row by row At a row i each column partC is split into two parts C1 = C cap a(i ) and C2 = C a(i ) and thesetwo parts replace C in the order C1 C2 (empty sets are discarded) Note

153

that this algorithm keeps the parts in a particular order which generatesan ordering of the columns orderlyRefine is used to refine all parts thathave at least one nonzero in row i in O(|a(i )|) time where |a(i )| is thenumber of nonzeros of row i Overall matLexiOrder costs a linear totaltime of O(nnz +I + J) (for rows and columns ordering) per iteration andO(J) space We also observe that only a small number of iterations will beenough (will be shown in Section 835 for tensors) yielding a more storage-efficient algorithm compared to the prior doubly lexical ordering methods[167 182] matLexiOrder in particular the use of orderlyRefine routinea few times is sufficient for our needs

Algorithm 19 Lexi-Order for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN mode nResult Permutation permn

Sort all nonzeros along with all but mode n1 quickSort(X coordCmp) Matricize X to X(n)

2 r larr compose (inds ([minusn]) 1)3 for m = 1 nnz do4 clarr inds(nm) Column index of X(n)

5 if coordCmp(X mmminus 1) = 1 then6 r larr compose (inds ([minusn])m) Row index of X(n)

7 X(n)(r c)larr val(m)

Apply partition refinement to the columns of X(n)

8 permn larr orderlyRefine (X(n))

9 return permn

Comparison function for two indices of X10 Function coordCmp(X m1m2)11 for nprime = 1 N do12 if nprime le n then13 return minus1

14 if m1(nprime) gt m2(nprime) then15 return 1

16 return 0

To order tensors we propose the Lexi-Order function as an exten-sion of matLexiOrder The basic idea of Lexi-Order is to determine thepermutation of each tensor mode independently while considering the or-der in other modes fixed Lexi-Order sets the indices of the mode tobe ordered as the columns of a matrix the other indices as the rows andsorts the columns as described for matrices (with the order preserving par-tition refinement method) The precise algorithm appears in Algorithm 19which we also illustrate in Figure 84 when applied to mode 1 Given amode n Lexi-Order first builds a matricized tensor in Compressed Sparse

154 CHAPTER 8 SPARSE TENSOR ORDERING

i j k val0 0 0 13 0 0 73 1 0 80 1 1 20 1 2 31 2 2 41 3 22 3 2

56

orderlyReneQuick

sort Matricize(1 2 3 0)

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

1 0 0 10 0 0 11 0 0 01 0 0 00 1 0 00 1 1 0

i(jk)(00)(10)(11)(12)(22)(32)

Zero-one RepresentationSorted COO

0 33001

i(jk)(00)(10)(11)(12)(22)(32) 1 2

JKI CSR Matrix

Non-zeroDistribution

permi

Figure 84 The steps of Lexi-Order illustrated for mode 1

Row (CSR) sparse matrix format by a call to quickSort with the compar-ison function coordCmp and then by partitioning the nonzeros into the rowsegments (Lines 3ndash7) This comparison function coordCmp does a lexicalcomparison of all-but-mode-n indices which enables efficient matricizationIn other words sorting of the tensor X is the same as building the matricizedtensor X(n) by rows in the fixed lexical ordering where mode n is the col-umn dimension and the remaining modes constitute the rows In Figure 84the sorting step orders the COO entries by (j k) tuples which then serveas the row indices of the matricized CSR representation of X(1) Once thematrix is built we construct 0 1 row vectors in Figure 84 to illustrate itsnonzero distribution which could seamlessly call orderlyRefine functionApart from the quickSort the other parts of Lexi-Order are of lineartime complexity (linear in terms of tensor storage) We use OpenMP Tasksto parallelize quickSort to accelerate Lexi-Order

Like BFS-MCS approach Lexi-Order also finds the permutations forN modes of an Nth-order sparse tensor Figure 85(a) illustrates the effectof Lexi-Order on an example 4times 4times 3 sparse tensor The original tensoris converted to a HiCOO representation with block size B = 2 consistingof 5 blocks with maximum 2 nonzeros per block After reordering withLexi-Order the new HiCOO has 3 nonzero blocks with up to 4 nonzerosper block Thus the blocks are denser which should exhibit better localitybehavior However this reordering scheme is heuristic For example con-sider Figure 81 we draw another HiCOO representation after reordering inFigure 85(b) Applying Lexi-Order would yield a reordered HiCOO rep-resentation with 4 blocks which is the same as the input ordering althoughthe maximum number of nonzeros per block would increase to 4 For thistensor Lexi-Order may not show a big advantage

823 Analysis

We take the Mttkrp operation to analyze reordering behavior for COOCSF and HiCOO formats Recall that for a third-order sparse tensor X Mttkrp multiples each nonzero entry xijk with the R-vector formed by theentry-wise product of the jth row of B and kth row of C when computing

155

Original HiCOO

ek val0 0 0 10 1 1 20 1 0 31 0 0 41 1 0 51 0 0 71 1 00 1 0

86

0 0 0

0 0 1

1 0 0

1 1 1

0

2

5

7

0 1 13

ei ejbi bj bkbptrB1

B2B3

B4

B5

Reordered COO

i j k val1 0 0 11 1 2 21 1 1 32 3 1 42 2 1 53 2 1 60 0 00 1 0

78

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

Reordered HiCOO

ek val0 0 0 70 1 0 81 0 0 11 1 1 31 1 0 20 0 1 50 1 11 0 1

46

0 0 0

0 0 11 1 0

0

45

ei ejbi bj bkbptrB1

B2B3

(1 2 3 0)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

Original HiCOO

Reordered COO

i j k val1 0 0 11 1 0 20 0 0 30 0 1 43 1 0 53 3 1 62 0 22 2 1

78

Original COO

Reordered HiCOO

ek val0 0 0 30 0 1 41 0 0 11 1 0 21 1 0 50 0 0 70 0 11 1 1

86

0 0 0

1 0 0

1 1 01 0 1

0

45

ei ejbi bj bkbptrB1

B2B3B4

(1 0 3 2)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

i j k val0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

(a) A good example (b) A fair example

Figure 85 Comparison of HiCOO representations before and after Lexi-Order

MA The arithmetic intensity of Mttkrp algorithms on these formats isapproximately 14 [155] Thus Mttkrp can be considered memory-boundfor most computer architectures especially CPU platforms

In HiCOO smaller block ratios (αb) and larger average slice sizes pertensor block (cb) are favorable for good Mttkrp performance Our re-ordering algorithms tend to closely pack nonzeros together and get denserblocks (larger cb) which potentially generates less tensor blocks (smallerαb) thus HiCOO-Mttkrp has less memory access more cache hits andbetter performance Moreover smaller αb can also reduce tensor memoryrequirement as a side benefit [155] That is for HiCOO format a reorderingscheme should increase the block density reduce the number of blocks andincrease cache localitymdashthree related performance metrics Reordering ismore beneficial for HiCOO format than COO and CSF formats

COO stores all nonzero indices so relabeling does not make a differencein its storage The same is also true for CSF relabeling does not changethe CSFrsquos tree structure while the order of nodes may change For anMttkrp with tensors stored in these two formats the performance gainfrom reordering is only from the improved data locality Though it is hardto do theoretical analysis for them the potential better data locality couldalso brings performance advantages but could be less than HiCOOrsquos

156 CHAPTER 8 SPARSE TENSOR ORDERING

Tensors Order Dimensions Nnzs Density

vast 3 165K times 11K times 2 26M 69times 10minus3

nell2 3 12K times 9K times 29K 77M 24times 10minus5

choa 3 712K times 10K times 767 27M 50times 10minus6

darpa 3 22K times 22K times 24M 28M 24times 10minus9

fb-m 3 23M times 23M times 166 100M 11times 10minus9

fb-s 3 39M times 39M times 532 140M 17times 10minus10

flickr 3 320K times 28M times 2M 113M 78times 10minus12

deli 3 533K times 17M times 3M 140M 61times 10minus12

nell1 3 29M times 21M times 25M 144M 91times 10minus13

crime 4 6K times 24times 77times 32 5M 15times 10minus2

uber 4 183times 24times 1140times 1717 3M 39times 10minus4

nips 4 2K times 3K times 14K times 17 3M 18times 10minus6

enron 4 6K times 6K times 244K times 1K 54M 55times 10minus9

flickr4d 4 320K times 28M times 2M times 731 113M 11times 10minus14

deli4d 4 533K times 17M times 3M times 1K 140M 43times 10minus15

Table 81 Description of sparse tensors

83 Experiments

Platform We perform experiments on a Linux-based Intel Xeon E5-2698v3 multicore server platform with 32 physical cores distributed on two sock-ets each with 23 GHz frequency We only present results for sequentialruns (see the paper [156] for parallel runs) The processor microarchitectureis Haswell having 32 KiB L1 data cache and 128 GiB memory The codeartifact was written in the C language and was compiled using icc 1801

Dataset We use the sparse tensors derived from real-world applicationsthat appear in Table 81 ordered by decreasing nonzero density separatelyfor third- and fourth-order tensors Apart from choa [188] all tensors areavailable publicly FROSTT [203] HaTen2 [121]

Configurations We report the results under the best configurations for thefollowing parameters for the highest Mttkrp performance with the threeformats (i) the number of reordering iterations and the block size B forHiCOO format (ii) tiling or not for CSF For HiCOO B = 128 achievesthe best results in most cases and we use five reordering iterations whichwill be analyzed in Section 835 All experiments use approximate rank ofR = 16 We use the total execution time of Mttkrps in all modes for everytensor to calculate the speedup which is the ratio of the total Mttkrp timeon a randomly reordered tensor over that using a specific reordering schemeAll the times are averaged over five runs

831 COO-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential COO-Mttkrp from ParTI [154] in Figure 86 This COO-Mttkrp is imple-

157

Spee

dup

0

1

2

3

4

5Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 86 Reordered COO-Mttkrp speedup over a random reorderingimplementation

mented in C using the same algorithm with Tensor Toolbox [18] Forany reordering approach after doing a BFS-MCS Lexi-Order or randomreordering on the input tensor we still sort the tensor in the mode orderof 1 middot middot middot N Observe that Lexi-Order improves sequential COO-Mttkrp performance by 100ndash429times (179times on average) while BFS-MCSgets 095ndash127times (110times on average) Note that Lexi-Order improves theperformance of COO-Mttkrp for all tensors We conclude that this order-ing is always helpful for COO-Mttkrp while the improvements being lessthan what we saw for HiCOO-Mttkrp

832 CSF-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential CSF-Mttkrp from Splatt v111 [205] in Figure 87 CSF-Mttkrp is set to useall CSF representations (ALLMODE) for Mttkrps in all modes and with tilingoption on Lexi-Order improves sequential CSF-Mttkrp performanceby 065ndash233times (150times on average) BFS-MCS improves sequential CSF-Mttkrp performance by 100ndash186times (122times on average) Both orderingapproaches improves the performance of CSF-Mttkrp on average WhileBFS-MCS is always helpful in the sequential case Lexi-Order is nothelpful on only one tensor crime The improvements achieved by the tworeordering approaches for CSF are less than those for HiCOO and COOformats We conclude that both reordering methods are helpful for CSF-Mttkrp but to a lesser extent than for HiCOO and COO based Mttkrp

833 HiCOO-Mttkrp with reordering

Figure 88 shows the speedup of the proposed reordering methods on se-quential HiCOO-Mttkrp Lexi-Order reordering obtains 099ndash414timesspeedup (212times on average) while BFS-MCS reordering gets 099ndash188timesspeedup (134times on average) Lexi-Order and BFS-MCS do not behaveas well on fourth-order tensors as on third-order tensors Tensor flickr4d is

158 CHAPTER 8 SPARSE TENSOR ORDERING

Spee

dup

00

05

10

15

20

25 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 87 Reordered CSF-Mttkrp speedup over a random reorderingimplementation

Spee

dup

0

1

2

3

4

5 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 88 Reordered HiCOO-Mttkrp speedup over a random orderingimplementation

constructed from the same data with flickr with an extra short mode (referto Table 81) Lexi-Order obtains 414times speedup on flickr while only 302timesspeedup on flickr4d The same phenomenon is also observed on tensors deli

and deli4d This phenomenon indicates that it is harder to get good datalocality on higher-order tensors which will be justified in Table 82

We investigate the two critical parameters of HiCOO [155] the blockratio (αb) and the average slice size per tensor block (cb) Smaller αb andlarger cb are favorable for good HiCOO-Mttkrp performance Table 82lists the parameter values for all tensors before and after Lexi-Order theHiCOO-Mttkrp speedup (as shown in Figure 88) and the storage ratioof HiCOO over random ordering Generally when both parameters αb andcb are largely improved we see a good speedup and storage ratio usingLexi-Order As seen for the same data in different orders eg flickr4d

and flickr the αb and cb values after Lexi-Order are better in the smallerorder version This observation supports the intuition that getting gooddata locality is harder for higher-order tensors

834 Reordering methods comparison

As seen above Lexi-Order improves performance more than BFS-MCSin most cases Compared to the reordering method used in Splatt [206]

159

TensorsRandom reordering Lexi-Order Speedup Storageαb cb αb cb seq omp ratio

vast 0004 1758 0004 1562 101 103 0999nell2 0020 0314 0008 0074 104 104 0966choa 0089 0057 0016 0056 116 107 0833

darpa 0796 0009 0018 0113 374 150 0322fb-m 0985 0008 0086 0021 384 121 0335fb-s 0982 0008 0099 0020 390 123 0336

flickr 0999 0008 0097 0025 414 366 0277deli 0988 0008 0501 0010 224 083 0634

nell1 0998 0008 0744 0009 170 070 0812

crime 0001 37702 0001 8978 099 142 1000uber 0041 0469 0011 0270 100 078 0838nips 0016 0434 0004 0435 103 134 0921

enron 0290 0017 0045 0030 125 136 0573flickr4d 0999 0008 0148 0020 302 1181 0214

deli4d 0998 0008 0596 0010 176 126 0697

Table 82 HiCOO parameters before and after Lexi-Order reordering

by setting ALLMODE (identical to the work [206]) to CSF-Mttkrp BFS-MCSgets 104 164 and 161times speedups on tensors nell2 nell1 and deli respectivelyand Lexi-Order obtains 104 170 and 224times speedups By contrast thespeedups using graph partitioning [206] on these three tensors are 106 111and 119times and 106 112 and 124times by using hypergraph partitioning [206]respectively Our BFS-MCS and Lexi-Order schemes both outperformgraph and hypergraph partitionings [206]

The available methods in the state-of-the-art are based on graph andhypergraph partitioning Partitioning is a successful approach when thenumber of partitions is known while the number of blocks in HiCOO is notknown ahead of time In our case the partitions should also be ordered forbetter cache reuse Additionally partitioners are less effective for tensorsthan usual as some dimensions could be very short (creating very highdegree vertices) That is why the proposed ordering based methods deliverbetter results

835 Effect of the number of iterations in Lexi-Order

Since Lexi-Order is iterative we evaluate the effect of the number ofiterations on HiCOO-Mttkrp performance The results appear in Fig-ure 89(a) which is normalized to the run time of 10 iterations Mttkrpon most tensors does not vary a lot by setting different number of iterationsexcept vast nell2 uber and nips We use 5 iterations to get good Mttkrpperformance similar to that of 10 iterations with about half of the overhead(shown in Figure 89(b)) But 3 or fewer iterations will get an acceptableperformance when users care much about the pre-processing time

160 CHAPTER 8 SPARSE TENSOR ORDERING

Nor

mal

ized

Tim

e

00

05

10

15

20 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(a) Sequential HiCOO-Mttkrp behavior

Nor

mal

ized

Tim

e

00

02

04

06

08

10 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(b) The reordering overhead

Figure 89 The performance and overhead of different numbers of iterations

84 Summary further notes and references

Plenty recent research studied the optimization of tensor algorithms [20 2846 159 168] Our work sets itself apart by focusing on reordering to getbetter nonzero structure Various reordering methods have been consideredfor matrix operations [7 173 190 215]

The proposed two heuristics aim to arrange the nonzeros of a given ten-sor close to each other in all modes As said before this would correspondto reordering the rows and columns so that all nonzeros are clustered aroundthe diagonal in the matrix case Heuristics based on partitioning and or-dering have been used the matrix case for arranging most nonzeros aroundthe diagonal or diagonal blocks An obvious alternative in the matrix caseis the BFS-based reverse Cuthill-McKee (RCM) heuristic Given a patternsymmetric matrix A RCM finds a permutation (ordering) such that thesymmetrically permuted matrix PAPT has nonzeros clustered around thediagonal More precisely RCM tries to reduce the bandwidth which is de-fined as maxj minus i aij 6= 0 and j gt i The RCM heuristic can be appliedto pattern-wise unsymmetric even rectangular matrices as well A commonapproach is to create a (2times 2)-block matrix for a given mtimes n matrix A as

B =

(Im AAT In

)

where Im and In are the identity matrices if sizemtimesm and ntimesn respectivelySince B is now pattern-symmetric RCM can be run on it to obtain an(m+n)times(m+n) permutation matrix P The row and column permutations

161

min max geomean

BFS-MCS 084 5356 290Lexi-Order 015 433 099

Table 83 Statistical indicators of the number of nonempty blocks obtainedby BFS-MCS and Lexi-Order with respect to that of RCM on 774 ma-trices from the UFL collection

for A can then be obtained from P by respecting the ordering of the first mrows and the last n rows of B This approach can be extended to tensorsWe discuss the third order case for simplicity Let X be an ItimesJtimesK tensorWe create a matrix B in such a way that the first I rowscolumns correspondto the first-mode indices the following J rowscolumns correspond to thesecond-mode indices and the last K rowscolumns correspond to the third-mode indices Then for each nonzero xijk of X we set brc = 1 whenever rand c correspond to any two indices from the set i j k This way B is asymmetric matrix with ones on the main diagonal Observe that when X istwo dimensional (that is a matrix) we obtain the matrix B above We nowcompare the proposed reordering heuristics with RCM

We first compare the three methods RCM BFS-MCS and Lexi-Orderon matrices available at the UFL collection [59] We have tested these threeheuristics on all pattern-wise unsymmetric matrices having more than 1000and less than 5000000 rows at least three nonzeros per row and columnless than 10000000 nonzeros and less than 100 middot

radicmtimes n nonzeros for a

matrix with m rows and n columns At the time of experimentation therewere 774 matrices satisfying these properties We ordered these matriceswith RCM BFS-MCS and Lexi-Order (5 iterations) and counted thenumber of nonempty blocks with a block size of 128 (in each dimension) Wethen normalized the results of BFS-MCS and Lexi-Order with respect tothat of RCM We give the statistical indicators of these normalized resultsin Table 83 As seen in the table RCM is better than BFS-MCS andLexi-Order performs as good as RCM in reducing the number of blocks

We next compare the three methods on 11 sparse tensors from theFROSTT [203] and Haten2 repositories [121] again with a block size of128 in each dimension Table 84 shows the number of nonempty blocksobtained with these methods In this table the number of blocks obtainedby RCM is given in absolute terms and those obtained by BFS-MCS andLexi-Order are given with respect to those of RCM As seen in the tableRCM is never the best and Lexi-Order is better than BFS-MCS for nineout of eleven cases Overall Lexi-Order and BFS-MCS obtain resultswhose geometric means are 048 and 089 respectively of that of RCMmdashmarking both as more effective than RCM This is interesting as we haveseen that RCM and Lexi-Order have (nearly) the same performance inthe matrix case

162 CHAPTER 8 SPARSE TENSOR ORDERING

RCM BFS-MCS Lexi-Order

lbnl-network 67208 111 068nips 25992 156 045vast 113757 100 095nell-2 565615 092 105flickr 28648054 041 038flickr4d 32913028 061 051deli 70691535 097 091deli4d 94608531 081 088darpa 2753737 165 019fb-m 53184625 066 016fb-s 71308285 080 019

geo mean 089 048

Table 84 Number of nonempty blocks obtained by RCM BFS-MCS andLexi-Order on sparse tensors The results of BFS-MCS and Lexi-Orderare normalized with respect to those of RCM For each tensor the best resultis displayed with bold

random Lexi-Order speedup

vast 202 160 126nell-2 504 433 116choa 178 143 125darpa 318 148 215fb-m 1716 695 247fb-s 2503 973 257flickr 1029 582 177deli 1405 1080 130nell1 1845 1486 124

Table 85 The run time of TTM (for all modes) using a random ordering andLexi-Order and the speedup with Lexi-Order for sequential execution

163

The proposed ordering methods are applicable to some other sparse ten-sor operations without any change Among those operations tensor-matrixmultiplication (TTM) used in computing the Tucker decomposition withhigher-order orthogonal iterations [61] or tensor-vector multiplications usedin the higher order power method [60] are well known The main character-istic of these operations is that the memory access of the algorithm dependson the locality of tensor indices for each dimension If not all dimensionshave the same characteristics potentially a variant of the proposed methodsin which only certain dimensions are ordered could be used We showcasesome results for the TTM case below in Table 85 on some of the tensorsThe results are obtained on a machine with Intel Xeon CPU E5-2650v4 Asseen in the table using Lexi-Order always results in improved sequentialrun time for TTMs

Recent work [46] analyses the memory access pattern of the MTTKRPoperation and proposes low-level highly effective code optimizations Theproposed approach reorganizes the code with register blocking and appliesnonzero blocking in an attempt to fit rows of the factor matrices into thecache The gain in performance remarkable Our ordering methods are or-thogonal to the mentioned and similar optimizations For a much improvedperformance we should first order the tensor using the approaches proposedin this chapter and then further optimize the code using the low-level codeoptimizations

164 CHAPTER 8 SPARSE TENSOR ORDERING

Part IV

Closing

165

Chapter 9

Concluding remarks

We have presented a selection of partitioning matching and ordering prob-lems in combinatorial scientific computing

Partitioning problems These problems arise in task decomposition forparallel computing where load balance and low communication costare two objectives Depending on the task definition and the interac-tion of the tasks the partitioning problems use different combinatorialmodels The selected problems (Chapters 2 and 3) addressed acyclicpartitioning of directed acyclic graphs and partitioning of hypergraphsWhile our contributions for the former problem concerned combinato-rial tools for the desired partitioning objectives and constraints thosefor the second problem concerned the use of hypergraph models andassociated partitioning tools for efficient tensor decomposition in thedistributed memory setting

Matching problems These problems arise in settings where agents com-pete for exclusive access to resources The most common settingsinclude identifying nonzero diagonals in sparse matrices routing andscheduling in data centers A related concept is to decompose a dou-bly stochastic matrix as a convex combination of permutation matri-ces (the famous Birkhoff-von Neumann decomposition) where eachpermutation matrix is a perfect matching in the bipartite graph repre-sentation of the matrix We developed approximation algorithms formatchings in graphs (Chapter 4) and effective heuristics for findingmatchings in hypergraphs (Chapter 6) We also investigated (Chap-ter 5) the problem of finding Birkhoff-von Neumann decompositionswith a small number of permutation matrices and presented complex-ity results and theoretical insights into the decomposition problemOn the way we generalized the decomposition to general real matrices(not only doubly stochastic) having total support

167

168 CHAPTER 9 CONCLUDING REMARKS

Ordering problems These problems arise when one wants to permutesparse matrices and tensors into desirable forms The sought formsdepend of course on the targeted applications By focusing on a re-cent LU decomposition method (or on a method taking advantage ofthe sparsity in a novel way) we developed heuristics to permute sparsematrices into a bordered block triangular form with an aim to reducethe height of the resulting elimination tree (Chapter 7) We also in-vestigated sparse tensor ordering problem where we sought to clusternonzeros around the (hyper)diagonal of a sparse tensor (Chapter 8) inorder to improve the performance of certain tensor operations We pro-posed novel algorithms for obtaining doubly lexical ordering of sparsematrices that are simpler to implement than the known algorithms tobe able to address the tensor ordering problem

At the end of Chapters 2ndash8 we mentioned some immediate future workBelow we state some more future work which require considerable workand creativity

Most of the presented combinatorial algorithms come with applicationsOthers are made applicable by practical and efficient implementations fur-ther developments on the applicationsrsquo side are required for these to be usedin practice In particular the use of acyclic partitioner (the problem ofChapter 2) in a parallel system with the help of a runtime system requiresthe whole graph to be available This is usually not the case an auto-tuning like approach in which one first creates a task graph by a cold runand then partitions this graph to determine a schedule can prove useful Ina similar vein we mentioned the lack of proper software implementing theLU factorization method using elimination trees at the end of Chapter 7Such an implementation will give rise to a number of combinatorial prob-lems (eg the peak memory requirement in factoring a BBT ordered matrixwith elimination trees) and seems to be a new play field

The original Karp-Sipser heuristic for the cardinality matching problemapplies degree-1 and degree-2 reduction rules The first rule is well imple-mented in practical settings (see Chapters 4 and 6) On the other hardfast implementations of the degree-2 reduction rule for undirected graphsare sought A straightforward implementation of this rule can result in theworst case Ω(n2) total time for a graph with n vertices This is obviously toomuch to afford in theory for general sparse graphs with O(n) or O(n log n)edges Are there worst-case linear time algorithms for implementing thisrule in the context of the Karp-Sipser heuristic

The Birkhoff-von Neumann decomposition (Chapter 5) has well foundedapplications in routing Our use of this decomposition in solving linear sys-tems is a first attempt in making it useful for numerical algorithms Inthe proposed approach the cost of constructing a preconditioner with afew terms from a BvN decomposition is equivalent to solving a few maxi-

169

mum bottleneck matching problemmdashthis can be costly in practical settingsCan we use a relaxed definition of the BvN decomposition in which sub-permutation matrices are allowed (such decompositions are used in routingproblems) and make related preconditioner more practical while retainingnumerical properties

170 BIBLIOGRAPHY

Bibliography

[1] IBM ILOG CPLEX Optimizer httpwww-01ibmcomsoftwareintegrationoptimizationcplex-optimizer 2010 [Cited on pp96 121]

[2] Evrim Acar Daniel M Dunlavy and Tamara G Kolda A scalable op-timization approach for fitting canonical tensor decompositions Jour-nal of Chemometrics 25(2)67ndash86 2011 [Cited on pp 11 37]

[3] Seher Acer Tugba Torun and Cevdet Aykanat Improving medium-grain partitioning for scalable sparse tensor decomposition IEEETransactions on Parallel and Distributed Systems 29(12)2814ndash2825Dec 2018 [Cited on pp 52]

[4] Kunal Agrawal Jeremy T Fineman Jordan Krage Charles E Leis-erson and Sivan Toledo Cache-conscious scheduling of streaming ap-plications In Proc Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 236ndash245 New YorkNY USA 2012 ACM [Cited on pp 4]

[5] Emmanuel Agullo Alfredo Buttari Abdou Guermouche and FlorentLopez Multifrontal QR factorization for multicore architectures overruntime systems In Euro-Par 2013 Parallel Processing LNCS pages521ndash532 Springer Berlin Heidelberg 2013 [Cited on pp 131]

[6] Ron Aharoni and Penny Haxell Hallrsquos theorem for hypergraphs Jour-nal of Graph Theory 35(2)83ndash88 2000 [Cited on pp 119]

[7] Kadir Akbudak and Cevdet Aykanat Exploiting locality in sparsematrix-matrix multiplication on many-core architectures IEEETransactions on Parallel and Distributed Systems 28(8)2258ndash2271Aug 2017 [Cited on pp 160]

[8] Patrick R Amestoy Iain S Duff and Jean-Yves LrsquoExcellent Multi-frontal parallel distributed symmetric and unsymmetric solvers Com-puter Methods in Applied Mechanics and Engineering 184(2ndash4)501ndash520 2000 [Cited on pp 131 132]

[9] Patrick R Amestoy Iain S Duff Daniel Ruiz and Bora Ucar Aparallel matrix scaling algorithm In J M Palma P R AmestoyM Dayde M Mattoso and J C Lopes editors 8th InternationalConference on High Performance Computing for Computational Sci-ence (VECPAR) volume 5336 pages 301ndash313 Springer Berlin Hei-delberg 2008 [Cited on pp 56]

BIBLIOGRAPHY 171

[10] Michael Anastos and Alan Frieze Finding perfect matchings in ran-dom cubic graphs in linear time arXiv preprint arXiv1808008252018 [Cited on pp 125]

[11] Claus A Andersson and Rasmus Bro The N-way toolbox for MAT-LAB Chemometrics and Intelligent Laboratory Systems 52(1)1ndash42000 [Cited on pp 11 37]

[12] Hartwig Anzt Edmond Chow and Jack Dongarra Iterative sparsetriangular solves for preconditioning In Larsson Jesper Traff SaschaHunold and Francesco Versaci editors Euro-Par 2015 Parallel Pro-cessing pages 650ndash661 Springer Berlin Heidelberg 2015 [Cited onpp 107]

[13] Jonathan Aronson Martin Dyer Alan Frieze and Stephen SuenRandomized greedy matching II Random Structures amp Algorithms6(1)55ndash73 1995 [Cited on pp 58]

[14] Jonathan Aronson Alan Frieze and Boris G Pittel Maximum match-ings in sparse random graphs Karp-Sipser revisited Random Struc-tures amp Algorithms 12(2)111ndash177 1998 [Cited on pp 56 58]

[15] Cevdet Aykanat Berkant B Cambazoglu and Bora Ucar Multi-leveldirect K-way hypergraph partitioning with multiple constraints andfixed vertices Journal of Parallel and Distributed Computing 68609ndash625 2008 [Cited on pp 46]

[16] Ariful Azad Mahantesh Halappanavar Sivasankaran RajamanickamErik G Boman Arif Khan and Alex Pothen Multithreaded algo-rithms for maximum matching in bipartite graphs In IEEE 26th Inter-national Parallel Distributed Processing Symposium (IPDPS) pages860ndash872 Shanghai China 2012 [Cited on pp 57 59 66 73 74]

[17] Brett W Bader and Tamara G Kolda Efficient MATLAB compu-tations with sparse and factored tensors SIAM Journal on ScientificComputing 30(1)205ndash231 December 2007 [Cited on pp 11 37]

[18] Brett W Bader Tamara G Kolda et al Matlab tensor toolboxversion 30 Available online httpwwwsandiagov~tgkolda

TensorToolbox 2017 [Cited on pp 11 37 38 157]

[19] David A Bader Henning Meyerhenke Peter Sanders and DorotheaWagner editors Graph Partitioning and Graph Clustering - 10thDIMACS Implementation Challenge Workshop Georgia Institute ofTechnology Atlanta GA USA February 13-14 2012 Proceedingsvolume 588 of Contemporary Mathematics American MathematicalSociety 2013 [Cited on pp 69]

172 BIBLIOGRAPHY

[20] Muthu Baskaran Tom Henretty Benoit Pradelle M HarperLangston David Bruns-Smith James Ezick and Richard LethinMemory-efficient parallel tensor decompositions In 2017 IEEE HighPerformance Extreme Computing Conference (HPEC) pages 1ndash7Sep 2017 [Cited on pp 160]

[21] James Bennett and Stan Lanning The Netflix Prize In Proceedingsof KDD cup and workshop page 35 2007 [Cited on pp 10 48]

[22] Anne Benoit Yves Robert and Frederic Vivien A Guide to AlgorithmDesign Paradigms Methods and Complexity Analysis CRC PressBoca Raton FL 2014 [Cited on pp 93]

[23] Michele Benzi and Bora Ucar Preconditioning techniques based onthe Birkhoffndashvon Neumann decomposition Computational Methods inApplied Mathematics 17201ndash215 2017 [Cited on pp 106 108]

[24] Piotr Berman and Marek Karpinski Improved approximation lowerbounds on small occurence optimization ECCC Report 2003 [Citedon pp 113]

[25] Garrett Birkhoff Tres observaciones sobre el algebra lineal Universi-dad Nacional de Tucuman Series A 5147ndash154 1946 [Cited on pp8]

[26] Marcel Birn Vitaly Osipov Peter Sanders Christian Schulz andNodari Sitchinava Efficient parallel and external matching In Fe-lix Wolf Bernd Mohr and Dieter an Mey editors Euro-Par 2013Parallel Processing pages 659ndash670 Springer Berlin Heidelberg 2013[Cited on pp 59]

[27] Rob H Bisseling and Wouter Meesen Communication balancing inparallel sparse matrix-vector multiplication Electronic Transactionson Numerical Analysis 2147ndash65 2005 [Cited on pp 4 47]

[28] Zachary Blanco Bangtian Liu and Maryam Mehri Dehnavi CSTFLarge-scale sparse tensor factorizations on distributed platforms InProceedings of the 47th International Conference on Parallel Process-ing pages 211ndash2110 New York NY USA 2018 ACM [Cited onpp 160]

[29] Guy E Blelloch Jeremy T Fineman and Julian Shun Greedy sequen-tial maximal independent set and matching are parallel on average InProceedings of the Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 308ndash317 ACM 2012[Cited on pp 59 73 74 75]

BIBLIOGRAPHY 173

[30] Norbert Blum A new approach to maximum matching in generalgraphs In Proceedings of the 17th International Colloquium on Au-tomata Languages and Programming pages 586ndash597 London UK1990 Springer-Verlag [Cited on pp 6 76]

[31] Richard A Brualdi The diagonal hypergraph of a matrix (bipartitegraph) Discrete Mathematics 27(2)127ndash147 1979 [Cited on pp 92]

[32] Richard A Brualdi Notes on the Birkhoff algorithm for doublystochastic matrices Canadian Mathematical Bulletin 25(2)191ndash1991982 [Cited on pp 9 92 94 98]

[33] Richard A Brualdi and Peter M Gibson Convex polyhedra of doublystochastic matrices I Applications of the permanent function Jour-nal of Combinatorial Theory Series A 22(2)194ndash230 1977 [Citedon pp 9]

[34] Richard A Brualdi and Herbert J Ryser Combinatorial Matrix The-ory volume 39 of Encyclopedia of Mathematics and its ApplicationsCambridge University Press Cambridge UK New York USA Mel-bourne Australia 1991 [Cited on pp 8]

[35] Thang Nguyen Bui and Curt Jones A heuristic for reducing fill-in insparse matrix factorization In Proc 6th SIAM Conf Parallel Pro-cessing for Scientific Computing pages 445ndash452 SIAM 1993 [Citedon pp 4 15]

[36] Peter J Cameron Notes on combinatorics httpscameroncountsfileswordpresscom201311combpdf (last checked Jan 2019)2013 [Cited on pp 76]

[37] Andrew Carlson Justin Betteridge Bryan Kisiel Burr Settles Este-vam R Hruschka Jr and Tom M Mitchell Toward an architecturefor never-ending language learning In AAAI volume 5 page 3 2010[Cited on pp 48]

[38] J Douglas Carroll and Jih-Jie Chang Analysis of individual dif-ferences in multidimensional scaling via an N-way generalization ofldquoEckart-Youngrdquo decomposition Psychometrika 35(3)283ndash319 1970[Cited on pp 36]

[39] Umit V Catalyurek and Cevdet Aykanat Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplicationIEEE Transactions on Parallel and Distributed Systems 10(7)673ndash693 Jul 1999 [Cited on pp 41 140]

174 BIBLIOGRAPHY

[40] Umit V Catalyurek and Cevdet Aykanat PaToH A Multilevel Hyper-graph Partitioning Tool Version 30 Bilkent University Departmentof Computer Engineering Ankara 06533 Turkey PaToH is availableat httpbmiosuedu~umitsoftwarehtm 1999 [Cited on pp4 7 25 47 143]

[41] Umit V Catalyurek Kamer Kaya and Bora Ucar On shared-memoryparallelization of a sparse matrix scaling algorithm In 2012 41st Inter-national Conference on Parallel Processing pages 68ndash77 Los Alami-tos CA USA September 2012 IEEE Computer Society [Cited onpp 56]

[42] Umit V Catalyurek Mehmet Deveci Kamer Kaya and Bora UcarMultithreaded clustering for multi-level hypergraph partitioning In26th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) pages 848ndash859 Shanghai China 2012 [Cited on pp59]

[43] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang On ser-vice guarantees for input-buffered crossbar switches A capacity de-composition approach by Birkhoff and von Neumann In Quality ofService 1999 IWQoS rsquo99 1999 Seventh International Workshop onpages 79ndash86 1999 [Cited on pp 8]

[44] Jianer Chen Yang Liu Songjian Lu Barry OrsquoSullivan and Igor Raz-gon A fixed-parameter algorithm for the directed feedback vertex setproblem Journal of the ACM 55(5)211ndash2119 2008 [Cited on pp142]

[45] Joseph Cheriyan Randomized O(M(|V |)) algorithms for problemsin matching theory SIAM Journal on Computing 26(6)1635ndash16551997 [Cited on pp 76]

[46] Jee Choi Xing Liu Shaden Smith and Tyler Simon Blocking opti-mization techniques for sparse tensor computation In 2018 IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS)pages 568ndash577 Vancouver British Columbia Canada 2018 [Citedon pp 160 163]

[47] Joon Hee Choi and S V N Vishwanathan DFacTo Distributedfactorization of tensors In Zoubin Ghahramani Max Welling CorinnaCortes Neil D Lawrence and Kilian Q Weinberger editors 27thAdvances in Neural Information Processing Systems pages 1296ndash1304Curran Associates Inc Montreal Quebec Canada 2014 [Cited onpp 11 37 38 39]

BIBLIOGRAPHY 175

[48] Edmond Chow and Aftab Patel Fine-grained parallel incomplete LUfactorization SIAM Journal on Scientific Computing 37(2)C169ndashC193 2015 [Cited on pp 107]

[49] Andrzej Cichocki Danilo P Mandic Lieven De Lathauwer GuoxuZhou Qibin Zhao Cesar Caiafa and Anh-Huy Phan Tensor decom-positions for signal processing applications From two-way to multiwaycomponent analysis IEEE Signal Processing Magazine 32(2)145ndash163March 2015 [Cited on pp 10]

[50] Edith Cohen Structure prediction and computation of sparse matrixproducts Journal of Combinatorial Optimization 2(4)307ndash332 Dec1998 [Cited on pp 127]

[51] Thomas F Coleman and Wei Xu Parallelism in structured Newtoncomputations In Parallel Computing Architectures Algorithms andApplications ParCo 2007 pages 295ndash302 Forschungszentrum Julichand RWTH Aachen University Germany 2007 [Cited on pp 3]

[52] Thomas F Coleman and Wei Xu Fast (structured) Newton computa-tions SIAM Journal on Scientific Computing 31(2)1175ndash1191 2009[Cited on pp 3]

[53] Thomas F Coleman and Wei Xu Automatic Differentiation in MAT-LAB using ADMAT with Applications SIAM 2016 [Cited on pp3]

[54] Jason Cong Zheng Li and Rajive Bagrodia Acyclic multi-way par-titioning of Boolean networks In Proceedings of the 31st Annual De-sign Automation Conference DACrsquo94 pages 670ndash675 New York NYUSA 1994 ACM [Cited on pp 4 17]

[55] Elizabeth H Cuthill and J McKee Reducing the bandwidth of sparsesymmetric matrices In Proceedings of the 24th national conferencepages 157ndash172 New York NY USA 1969 ACM [Cited on pp 12]

[56] Marek Cygan Improved approximation for 3-dimensional matchingvia bounded pathwidth local search In Foundations of ComputerScience (FOCS) 2013 IEEE 54th Annual Symposium on pages 509ndash518 IEEE 2013 [Cited on pp 113 120]

[57] Marek Cygan Fabrizio Grandoni and Monaldo Mastrolilli How to sellhyperedges The hypermatching assignment problem In Proc of thetwenty-fourth annual ACM-SIAM symposium on Discrete algorithmspages 342ndash351 SIAM 2013 [Cited on pp 113]

176 BIBLIOGRAPHY

[58] Joseph Czyzyk Michael P Mesnier and Jorge J More The NEOSServer IEEE Journal on Computational Science and Engineering5(3)68ndash75 1998 [Cited on pp 96]

[59] Timothy A Davis and Yifan Hu The University of Florida sparsematrix collection ACM Transactions on Mathematical Software38(1)11ndash125 2011 [Cited on pp 25 69 82 105 143 161]

[60] Lieven De Lathauwer Pierre Comon Bart De Moor and Joos Vande-walle Higher-order power methodndashApplication in independent com-ponent analysis In Proceedings NOLTArsquo95 pages 91ndash96 Las VegasUSA 1995 [Cited on pp 163]

[61] Lieven De Lathauwer Bart De Moor and Joos Vandewalle Onthe best rank-1 and rank-(R1 R2 RN ) approximation of higher-order tensors SIAM Journal on Matrix Analysis and Applications21(4)1324ndash1342 2000 [Cited on pp 163]

[62] Dominique de Werra Variations on the Theorem of BirkhoffndashvonNeumann and extensions Graphs and Combinatorics 19(2)263ndash278Jun 2003 [Cited on pp 108]

[63] James W Demmel Stanley C Eisenstat John R Gilbert Xiaoye SLi and Joseph W H Liu A supernodal approach to sparse par-tial pivoting SIAM Journal on Matrix Analysis and Applications20(3)720ndash755 1999 [Cited on pp 132]

[64] James W Demmel John R Gilbert and Xiaoye S Li An asyn-chronous parallel supernodal algorithm for sparse gaussian elimina-tion SIAM Journal on Matrix Analysis and Applications 20(4)915ndash952 1999 [Cited on pp 132]

[65] Mehmet Deveci Kamer Kaya Umit V Catalyurek and Bora UcarA push-relabel-based maximum cardinality matching algorithm onGPUs In 42nd International Conference on Parallel Processing pages21ndash29 Lyon France 2013 [Cited on pp 57]

[66] Mehmet Deveci Kamer Kaya Bora Ucar and Umit V CatalyurekGPU accelerated maximum cardinality matching algorithms for bipar-tite graphs In Felix Wolf Bernd Mohr and Dieter an Mey editorsEuro-Par 2013 Parallel Processing volume 8097 of LNCS pages 850ndash861 Springer Berlin Heidelberg 2013 [Cited on pp 57]

[67] Pat Devlin and Jeff Kahn Perfect fractional matchings in k-out hy-pergraphs arXiv preprint arXiv170303513 2017 [Cited on pp 114121]

BIBLIOGRAPHY 177

[68] Elizabeth D Dolan The NEOS Server 40 administrative guide Tech-nical Memorandum ANLMCS-TM-250 Mathematics and ComputerScience Division Argonne National Laboratory 2001 [Cited on pp96]

[69] Elizabeth D Dolan and Jorge J More Benchmarking optimiza-tion software with performance profiles Mathematical Programming91(2)201ndash213 2002 [Cited on pp 26]

[70] Jack J Dongarra Fred G Gustavson and Alan H Karp Implement-ing linear algebra algorithms for dense matrices on a vector pipelinemachine SIAM Review 26(1)pp 91ndash112 1984 [Cited on pp 134]

[71] Iain S Duff Kamer Kaya and Bora Ucar Design implementationand analysis of maximum transversal algorithms ACM Transactionson Mathematical Software 38(2)131ndash1331 2011 [Cited on pp 656 58]

[72] Iain S Duff and Jacko Koster On algorithms for permuting largeentries to the diagonal of a sparse matrix SIAM Journal on MatrixAnalysis and Applications 22973ndash996 2001 [Cited on pp 98 143]

[73] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ap-proximation algorithms for maximum matchings in undirected graphsIn 2018 Proceedings of the Seventh SIAM Workshop on CombinatorialScientific Computing pages 56ndash65 2018 [Cited on pp 75 79 113]

[74] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Fur-ther notes on Birkhoffndashvon Neumann decomposition of doubly stochas-tic matrices Linear Algebra and its Applications 55468ndash78 2018[Cited on pp 101 113 117]

[75] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ef-fective heuristics for matchings in hypergraphs Research Report RR-9224 Inria Grenoble Rhone-Alpes November 2018 (will appear inSEA2) [Cited on pp 118 119 122]

[76] Fanny Dufosse Kamer Kaya and Bora Ucar Two approximationalgorithms for bipartite matching on multicore architectures Journalof Parallel and Distributed Computing 8562ndash78 2015 IPDPS 2014Selected Papers on Numerical and Combinatorial Algorithms [Citedon pp 56 60 65 66 67 69 77 79 81 113 117 125]

[77] Andrew Lloyd Dulmage and Nathan Saul Mendelsohn Coverings ofbipartite graphs Canadian Journal of Mathematics 10517ndash534 1958[Cited on pp 68 143]

178 BIBLIOGRAPHY

[78] Martin E Dyer and Alan M Frieze Randomized greedy matchingRandom Structures amp Algorithms 2(1)29ndash45 1991 [Cited on pp 58113 115]

[79] Jack Edmonds Paths trees and flowers Canadian Journal of Math-ematics 17449ndash467 1965 [Cited on pp 76]

[80] Lawrence C Eggan Transition graphs and the star-height of regu-lar events The Michigan Mathematical Journal 10(4)385ndash397 1963[Cited on pp 5]

[81] Stanley C Eisenstat and Joseph W H Liu The theory of elimina-tion trees for sparse unsymmetric matrices SIAM Journal on MatrixAnalysis and Applications 26(3)686ndash705 2005 [Cited on pp 5 131132 133 135]

[82] Stanley C Eisenstat and Joseph W H Liu A tree-based dataflowmodel for the unsymmetric multifrontal method Electronic Transac-tions on Numerical Analysis 211ndash19 2005 [Cited on pp 132]

[83] Stanley C Eisenstat and Joseph W H Liu Algorithmic aspects ofelimination trees for sparse unsymmetric matrices SIAM Journal onMatrix Analysis and Applications 29(4)1363ndash1381 2008 [Cited onpp 5 132 143 145]

[84] Venmugil Elango Fabrice Rastello Louis-Noel Pouchet J Ramanu-jam and P Sadayappan On characterizing the data access complex-ity of programs ACM SIGPLAN Notices - POPLrsquo15 50(1)567ndash580January 2015 [Cited on pp 3]

[85] Paul Erdos and Alfred Renyi On random matrices Publicationsof the Mathematical Institute of the Hungarian Academy of Sciences8(3)455ndash461 1964 [Cited on pp 72]

[86] Bastiaan Onne Fagginger Auer and Rob H Bisseling A GPU algo-rithm for greedy graph matching In Facing the Multicore ChallengeII volume 7174 of LNCS pages 108ndash119 Springer-Verlag Berlin 2012[Cited on pp 59]

[87] Naznin Fauzia Venmugil Elango Mahesh Ravishankar J Ramanu-jam Fabrice Rastello Atanas Rountev Louis-Noel Pouchet andP Sadayappan Beyond reuse distance analysis Dynamic analysis forcharacterization of data locality potential ACM Transactions on Ar-chitecture and Code Optimization 10(4)531ndash5329 December 2013[Cited on pp 3 16]

BIBLIOGRAPHY 179

[88] Charles M Fiduccia and Robert M Mattheyses A linear-time heuris-tic for improving network partitions In 19th Design Automation Con-ference pages 175ndash181 Las Vegas USA 1982 IEEE [Cited on pp17 22]

[89] Joel Franklin and Jens Lorenz On the scaling of multidimensional ma-trices Linear Algebra and its Applications 114717ndash735 1989 [Citedon pp 114]

[90] Alan M Frieze Maximum matchings in a class of random graphsJournal of Combinatorial Theory Series B 40(2)196ndash212 1986[Cited on pp 76 82 121]

[91] Aurelien Froger Olivier Guyon and Eric Pinson A set packing ap-proach for scheduling passenger train drivers the French experienceIn RailTokyo2015 Tokyo Japan March 2015 [Cited on pp 7]

[92] Harold N Gabow and Robert E Tarjan Faster scaling algorithms fornetwork problems SIAM Journal on Computing 18(5)1013ndash10361989 [Cited on pp 6 76]

[93] Michael R Garey and David S Johnson Computers and IntractabilityA Guide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1979 [Cited on pp 3 47 93]

[94] Alan George Nested dissection of a regular finite element mesh SIAMJournal on Numerical Analysis 10(2)345ndash363 1973 [Cited on pp136]

[95] Alan J George Computer implementation of the finite elementmethod PhD thesis Stanford University Stanford CA USA 1971[Cited on pp 12]

[96] Andrew V Goldberg and Robert Kennedy Global price updates helpSIAM Journal on Discrete Mathematics 10(4)551ndash572 1997 [Citedon pp 6]

[97] Olaf Gorlitz Sergej Sizov and Steffen Staab Pints Peer-to-peerinfrastructure for tagging systems In Proceedings of the 7th Interna-tional Conference on Peer-to-peer Systems IPTPSrsquo08 page 19 Berke-ley CA USA 2008 USENIX Association [Cited on pp 48]

[98] Georg Gottlob and Gianluigi Greco Decomposing combinatorial auc-tions and set packing problems Journal of the ACM 60(4)241ndash2439September 2013 [Cited on pp 7]

[99] Lars Grasedyck Hierarchical singular value decomposition of tensorsSIAM Journal on Matrix Analysis and Applications 31(4)2029ndash20542010 [Cited on pp 52]

180 BIBLIOGRAPHY

[100] William Gropp and Jorge J More Optimization environments andthe NEOS Server In Martin D Buhman and Arieh Iserles editorsApproximation Theory and Optimization pages 167ndash182 CambridgeUniversity Press 1997 [Cited on pp 96]

[101] Hermann Gruber Digraph complexity measures and applications informal language theory Discrete Mathematics amp Theoretical Com-puter Science 14(2)189ndash204 2012 [Cited on pp 5]

[102] Gael Guennebaud Benoıt Jacob et al Eigen v3 httpeigen

tuxfamilyorg 2010 [Cited on pp 47]

[103] Abdou Guermouche Jean-Yves LrsquoExcellent and Gil Utard Impact ofreordering on the memory of a multifrontal solver Parallel Computing29(9)1191ndash1218 2003 [Cited on pp 131]

[104] Anshul Gupta Improved symbolic and numerical factorization algo-rithms for unsymmetric sparse matrices SIAM Journal on MatrixAnalysis and Applications 24(2)529ndash552 2002 [Cited on pp 132]

[105] Mahantesh Halappanavar John Feo Oreste Villa Antonino Tumeoand Alex Pothen Approximate weighted matching on emerging many-core and multithreaded architectures International Journal of HighPerformance Computing Applications 26(4)413ndash430 2012 [Cited onpp 59]

[106] Magnus M Halldorsson Approximating discrete collections via localimprovements In The Sixth annual ACM-SIAM symposium on Dis-crete Algorithms (SODA) volume 95 pages 160ndash169 San FranciscoCalifornia USA 1995 SIAM [Cited on pp 113]

[107] Richard A Harshman Foundations of the PARAFAC procedureModels and conditions for an ldquoexplanatoryrdquo multi-modal factor anal-ysis UCLA Working Papers in Phonetics 161ndash84 1970 [Cited onpp 36]

[108] Nicholas J A Harvey Algebraic algorithms for matching and matroidproblems SIAM Journal on Computing 39(2)679ndash702 2009 [Citedon pp 76]

[109] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-dimensional matching In Approximation Random-ization and Combinatorial Optimization Algorithms and Techniquespages 83ndash97 Springer 2003 [Cited on pp 113]

[110] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-set packing Computational Complexity 15(1)20ndash392006 [Cited on pp 7]

BIBLIOGRAPHY 181

[111] Bruce Hendrickson Load balancing fictions falsehoods and fallaciesApplied Mathematical Modelling 2599ndash108 2000 [Cited on pp 41]

[112] Bruce Hendrickson and Tamara G Kolda Graph partitioning modelsfor parallel computing Parallel Computing 26(12)1519ndash1534 2000[Cited on pp 41]

[113] Bruce Hendrickson and Robert Leland The Chaco userrsquos guide ver-sion 10 Technical Report SAND93ndash2339 Sandia National Laborato-ries Albuquerque NM October 1993 [Cited on pp 4 15 16]

[114] Julien Herrmann Jonathan Kho Bora Ucar Kamer Kaya andUmit V Catalyurek Acyclic partitioning of large directed acyclicgraphs In Proceedings of the 17th IEEEACM International Sympo-sium on Cluster Cloud and Grid Computing CCGRID pages 371ndash380 Madrid Spain May 2017 [Cited on pp 15 16 17 18 28 30]

[115] Julien Herrmann M Yusuf Ozkaya Bora Ucar Kamer Kaya andUmit V Catalyurek Multilevel algorithms for acyclic partitioning ofdirected acyclic graphs SIAM Journal on Scientific Computing 2019to appear [Cited on pp 19 20 26 29 31]

[116] Jonathan D Hogg John K Reid and Jennifer A Scott Design of amulticore sparse Cholesky factorization using DAGs SIAM Journalon Scientific Computing 32(6)3627ndash3649 2010 [Cited on pp 131]

[117] John E Hopcroft and Richard M Karp An n52 algorithm for max-imum matchings in bipartite graphs SIAM Journal on Computing2(4)225ndash231 1973 [Cited on pp 6 62]

[118] Roger A Horn and Charles R Johnson Matrix Analysis CambridgeUniversity Press New York 1985 [Cited on pp 109]

[119] Cor A J Hurkens and Alexander Schrijver On the size of systemsof sets every t of which have an SDR with an application to theworst-case ratio of heuristics for packing problems SIAM Journal onDiscrete Mathematics 2(1)68ndash72 1989 [Cited on pp 113 120]

[120] Oscar H Ibarra and Shlomo Moran Deterministic and probabilisticalgorithms for maximum bipartite matching via fast matrix multipli-cation Information Processing Letters pages 12ndash15 1981 [Cited onpp 76]

[121] Inah Jeon Evangelos E Papalexakis U Kang and Christos Falout-sos Haten2 Billion-scale tensor decompositions In IEEE 31st Inter-national Conference on Data Engineering (ICDE) pages 1047ndash1058April 2015 [Cited on pp 156 161]

182 BIBLIOGRAPHY

[122] Jochen A G Jess and Hendrikus Gerardus Maria Kees A data struc-ture for parallel LU decomposition IEEE Transactions on Comput-ers 31(3)231ndash239 1982 [Cited on pp 134]

[123] U Kang Evangelos Papalexakis Abhay Harpale and Christos Falout-sos GigaTensor Scaling tensor analysis up by 100 times - Algorithmsand discoveries In Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining KDDrsquo12 pages 316ndash324 New York NY USA 2012 ACM [Cited on pp10 11 37]

[124] Micha l Karonski Ed Overman and Boris Pittel On a perfect match-ing in a random bipartite digraph with average out-degree below twoarXiv e-prints page arXiv190305764 Mar 2019 [Cited on pp 59121]

[125] Micha l Karonski and Boris Pittel Existence of a perfect matching ina random (1 + eminus1)ndashout bipartite graph Journal of CombinatorialTheory Series B 881ndash16 May 2003 [Cited on pp 59 67 81 82121]

[126] Richard M Karp Reducibility among combinatorial problems InRaymond E Miller James W Thatcher and Jean D Bohlinger edi-tors Complexity of Computer Computations The IBM Research Sym-posiax pages 85ndash103 Springer US 1972 [Cited on pp 7 142]

[127] Richard M Karp Alexander H G Rinnooy Kan and Rakesh VVohra Average case analysis of a heuristic for the assignment prob-lem Mathematics of Operations Research 19513ndash522 1994 [Citedon pp 81]

[128] Richard M Karp and Michael Sipser Maximum matching in sparserandom graphs In 22nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) pages 364ndash375 Nashville TN USA 1981[Cited on pp 56 58 67 75 81 113]

[129] Richard M Karp Umesh V Vazirani and Vijay V Vazirani An op-timal algorithm for on-line bipartite matching In 22nd annual ACMsymposium on Theory of computing (STOC) pages 352ndash358 Balti-more MD USA 1990 [Cited on pp 57]

[130] George Karypis and Vipin Kumar MeTiS A Software Package forPartitioning Unstructured Graphs Partitioning Meshes and Comput-ing Fill-Reducing Orderings of Sparse Matrices Version 40 Univer-sity of Minnesota Department of Comp Sci and Eng Army HPCResearch Cent Minneapolis 1998 [Cited on pp 4 16 23 27 142]

BIBLIOGRAPHY 183

[131] George Karypis and Vipin Kumar Multilevel algorithms for multi-constraint hypergraph partitioning Technical Report 99-034 Uni-versity of Minnesota Department of Computer ScienceArmy HPCResearch Center Minneapolis MN 55455 November 1998 [Cited onpp 34]

[132] Kamer Kaya Johannes Langguth Fredrik Manne and Bora UcarPush-relabel based algorithms for the maximum transversal problemComputers and Operations Research 40(5)1266ndash1275 2013 [Citedon pp 6 56 58]

[133] Kamer Kaya and Bora Ucar Constructing elimination trees for sparseunsymmetric matrices SIAM Journal on Matrix Analysis and Appli-cations 34(2)345ndash354 2013 [Cited on pp 5]

[134] Oguz Kaya and Bora Ucar Scalable sparse tensor decompositions indistributed memory systems In Proceedings of the International Con-ference for High Performance Computing Networking Storage andAnalysis SC rsquo15 pages 771ndash7711 Austin Texas 2015 ACM NewYork NY USA [Cited on pp 7]

[135] Oguz Kaya and Bora Ucar Parallel CandecompParafac decomposi-tion of sparse tensors using dimension trees SIAM Journal on Scien-tific Computing 40(1)C99ndashC130 2018 [Cited on pp 52]

[136] Enver Kayaaslan and Bora Ucar Reducing elimination tree heightfor parallel LU factorization of sparse unsymmetric matrices In 21stInternational Conference on High Performance Computing (HiPC2014) pages 1ndash10 Goa India December 2014 [Cited on pp 136]

[137] Brian W Kernighan Optimal sequential partitions of graphs Journalof the ACM 18(1)34ndash40 January 1971 [Cited on pp 16]

[138] Shiva Kintali Nishad Kothari and Akash Kumar Approximationalgorithms for directed width parameters CoRR abs11074824 2011[Cited on pp 138]

[139] Philip A Knight The SinkhornndashKnopp algorithm Convergence andapplications SIAM Journal on Matrix Analysis and Applications30(1)261ndash275 2008 [Cited on pp 55 68]

[140] Philip A Knight and Daniel Ruiz A fast algorithm for matrix bal-ancing IMA Journal of Numerical Analysis 33(3)1029ndash1047 2013[Cited on pp 56 79]

[141] Philip A Knight Daniel Ruiz and Bora Ucar A symmetry preservingalgorithm for matrix scaling SIAM Journal on Matrix Analysis andApplications 35(3)931ndash955 2014 [Cited on pp 55 56 68 79 105]

184 BIBLIOGRAPHY

[142] Tamara G Kolda and Brett Bader Tensor decompositions and appli-cations SIAM Review 51(3)455ndash500 2009 [Cited on pp 10 37]

[143] Tjalling C Koopmans and Martin Beckmann Assignment problemsand the location of economic activities Econometrica 25(1)53ndash761957 [Cited on pp 109]

[144] Mads R B Kristensen Simon A F Lund Troels Blum and JamesAvery Fusion of parallel array operations In Proceedings of the 2016International Conference on Parallel Architectures and Compilationpages 71ndash85 New York NY USA 2016 ACM [Cited on pp 3]

[145] Mads R B Kristensen Simon A F Lund Troels Blum KennethSkovhede and Brian Vinter Bohrium A virtual machine approachto portable parallelism In Proceedings of the 2014 IEEE InternationalParallel amp Distributed Processing Symposium Workshops pages 312ndash321 Washington DC USA 2014 [Cited on pp 3]

[146] Janardhan Kulkarni Euiwoong Lee and Mohit Singh MinimumBirkhoff-von Neumann decomposition Preliminary version http

wwwcscmuedu~euiwoonlsparsebvnpdf of the paper whichappeared in Proc 19th International Conference Integer Program-ming and Combinatorial Optimization (IPCO 2017) Waterloo ONCanada pp 343ndash354 June 2017 [Cited on pp 9]

[147] Sebastian Lamm Peter Sanders Christian Schulz Darren Strash andRenato F Werneck Finding Near-Optimal Independent Sets at ScaleIn Proceedings of the 18th Workshop on Algorithm Engineering andExperiments (ALENEXrsquo16) pages 138ndash150 Arlington Virginia USA2016 SIAM [Cited on pp 114 127]

[148] Johannes Langguth Fredrik Manne and Peter Sanders Heuristicinitialization for bipartite matching problems Journal of ExperimentalAlgorithmics 1511ndash122 2010 [Cited on pp 6 56 58]

[149] Yanzhe (Murray) Lei Stefanus Jasin Joline Uichanco and AndrewVakhutinsky Randomized product display (ranking) pricing and or-der fulfillment for e-commerce retailers SSRN Electronic Journal2018 [Cited on pp 9]

[150] Thomas Lengauer Combinatorial Algorithms for Integrated CircuitLayout WileyndashTeubner Chichester UK 1990 [Cited on pp 34]

[151] Renaud Lepere and Denis Trystram A new clustering algorithm forlarge communication delays In 16th International Parallel and Dis-tributed Processing Symposium (IPDPS) page (CDROM) Fort Laud-erdale FL USA 2002 IEEE Computer Society [Cited on pp 32]

BIBLIOGRAPHY 185

[152] Hanoch Levy and David W Low A contraction algorithm for findingsmall cycle cutsets Journal of Algorithms 9(4)470ndash493 1988 [Citedon pp 142]

[153] Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun and Richard VuducModel-driven sparse CP decomposition for higher-order tensors InIPDPS 2017 31th IEEE International Symposium on Parallel andDistributed Processing pages 1048ndash1057 Orlando FL USA May2017 [Cited on pp 52]

[154] Jiajia Li Yuchen Ma and Richard Vuduc ParTI A Parallel TensorInfrastructure for Multicore CPU and GPUs (Version 010) Availablefrom httpsgithubcomhpcgarageParTI 2016 [Cited on pp156]

[155] Jiajia Li Jimeng Sun and Richard Vuduc HiCOO Hierarchical stor-age of sparse tensors In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisSCrsquo18 New York NY USA 2018 [Cited on pp 148 149 155 158]

[156] Jiajia Li Bora Ucar Umit V Catalyurek Jimeng Sun Kevin Barkerand Richard Vuduc Efficient and effective sparse tensor reordering InProceedings of the ACM International Conference on SupercomputingICS rsquo19 pages 227ndash237 New York NY USA 2019 ACM [Cited onpp 147 156]

[157] Xiaoye S Li and James W Demmel Making sparse Gaussian elim-ination scalable by static pivoting In Proc of the 1998 ACMIEEEConference on Supercomputing SC rsquo98 pages 1ndash17 Washington DCUSA 1998 [Cited on pp 132]

[158] Athanasios P Liavas and Nicholas D Sidiropoulos Parallel algorithmsfor constrained tensor factorization via alternating direction methodof multipliers IEEE Transactions on Signal Processing 63(20)5450ndash5463 Oct 2015 [Cited on pp 11]

[159] Bangtian Liu Chengyao Wen Anand D Sarwate and Maryam MehriDehnavi A unified optimization approach for sparse tensor opera-tions on GPUs In 2017 IEEE International Conference on ClusterComputing (CLUSTER) pages 47ndash57 Sept 2017 [Cited on pp 160]

[160] Joseph W H Liu Computational models and task scheduling forparallel sparse Cholesky factorization Parallel Computing 3(4)327ndash342 1986 [Cited on pp 131 134]

[161] Joseph W H Liu A graph partitioning algorithm by node separatorsACM Transactions on Mathematical Software 15(3)198ndash219 1989[Cited on pp 140]

186 BIBLIOGRAPHY

[162] Joseph W H Liu Reordering sparse matrices for parallel eliminationParallel Computing 11(1)73 ndash 91 1989 [Cited on pp 131]

[163] Liang Liu Jun (Jim) Xu and Lance Fortnow Quantized BvND Abetter solution for optical and hybrid switching in data center net-works In 2018 IEEEACM 11th International Conference on Utilityand Cloud Computing pages 237ndash246 Dec 2018 [Cited on pp 9]

[164] Zvi Lotker Boaz Patt-Shamir and Seth Pettie Improved distributedapproximate matching In Proceedings of the Twentieth Annual Sym-posium on Parallelism in Algorithms and Architectures (SPAA) pages129ndash136 Munich Germany 2008 ACM [Cited on pp 59]

[165] Laszlo Lovasz On determinants matchings and random algorithmsIn FCT pages 565ndash574 1979 [Cited on pp 76]

[166] Laszlo Lovasz and Micheal D Plummer Matching Theory ElsevierScience Publishers Netherlands 1986 [Cited on pp 76]

[167] Anna Lubiw Doubly lexical orderings of matrices SIAM Journal onComputing 16(5)854ndash879 1987 [Cited on pp 151 152 153]

[168] Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sunand Richard Vuduc Optimizing sparse tensor times matrix on GPUsJournal of Parallel and Distributed Computing 12999ndash109 2019[Cited on pp 160]

[169] Jakob Magun Greedy matching algorithms an experimental studyJournal of Experimental Algorithmics 36 1998 [Cited on pp 6]

[170] Marvin Marcus and R Ree Diagonals of doubly stochastic matricesThe Quarterly Journal of Mathematics 10(1)296ndash302 1959 [Citedon pp 9]

[171] Nick McKeown The iSLIP scheduling algorithm for input-queuedswitches EEEACM Transactions on Networking 7(2)188ndash201 1999[Cited on pp 6]

[172] Amram Meir and John W Moon The expected node-independencenumber of random trees Indagationes Mathematicae 76335ndash3411973 [Cited on pp 67]

[173] John Mellor-Crummey David Whaley and Ken Kennedy Improvingmemory hierarchy performance for irregular applications using dataand computation reorderings International Journal of Parallel Pro-gramming 29(3)217ndash247 Jun 2001 [Cited on pp 160]

BIBLIOGRAPHY 187

[174] Silvio Micali and Vijay V Vazirani An O(radic|V ||E|) algorithm for

finding maximum matching in general graphs In 21st Annual Sym-posium on Foundations of Computer Science pages 17ndash27 SyracuseNY USA 1980 IEEE [Cited on pp 6 76]

[175] Orlando Moreira Merten Popp and Christian Schulz Graph parti-tioning with acyclicity constraints In 16th International Symposiumon Experimental Algorithms SEA London UK 2017 [Cited on pp16 17 28 29 32]

[176] Orlando Moreira Merten Popp and Christian Schulz Evolutionarymulti-level acyclic graph partitioning In Proceedings of the Geneticand Evolutionary Computation Conference GECCO pages 332ndash339Kyoto Japan 2018 ACM [Cited on pp 17 25 32]

[177] Marcin Mucha and Piotr Sankowski Maximum matchings in planargraphs via Gaussian elimination In S Albers and T Radzik editorsESA 2004 pages 532ndash543 Berlin Heidelberg 2004 Springer BerlinHeidelberg [Cited on pp 76]

[178] Marcin Mucha and Piotr Sankowski Maximum matchings via Gaus-sian elimination In FOCS rsquo04 pages 248ndash255 Washington DC USA2004 IEEE Computer Society [Cited on pp 76]

[179] Huda Nassar Georgios Kollias Ananth Grama and David F GleichLow rank methods for multiple network alignment arXiv e-printspage arXiv180908198 Sep 2018 [Cited on pp 128]

[180] Jaroslav Nesetril and Patrice Ossona de Mendez Tree-depth subgraphcoloring and homomorphism bounds European Journal of Combina-torics 27(6)1022ndash1041 2006 [Cited on pp 132]

[181] M Yusuf Ozkaya Anne Benoit Bora Ucar Julien Herrmann andUmit V Catalyurek A scalable clustering-based task scheduler forhomogeneous processors using DAG partitioning In 33rd IEEE Inter-national Parallel and Distributed Processing Symposium pages 155ndash165 Rio de Janeiro Brazil May 2019 [Cited on pp 32]

[182] Robert Paige and Robert E Tarjan Three partition refinement algo-rithms SIAM Journal on Computing 16(6)973ndash989 December 1987[Cited on pp 152 153]

[183] Chiristos H Papadimitriou and Kenneth Steiglitz Combinatorial Op-timization Algorithms and Complexity Dover Publications (Cor-rected unabridged reprint of Combinatorial Optimization Algo-rithms and Complexity originally published by Prentice-Hall Inc NewJersey 1982) New York 1998 [Cited on pp 29]

188 BIBLIOGRAPHY

[184] Panos M Pardalos Tianbing Qian and Mauricio G C Resende Agreedy randomized adaptive search procedure for the feedback vertexset problem Journal of Combinatorial Optimization 2(4)399ndash4121998 [Cited on pp 142]

[185] Md Mostofa Ali Patwary Rob H Bisseling and Fredrik Manne Par-allel greedy graph matching using an edge partitioning approach InProceedings of the Fourth International Workshop on High-level Paral-lel Programming and Applications HLPP rsquo10 pages 45ndash54 New YorkNY USA 2010 ACM [Cited on pp 59]

[186] Francois Pellegrini SCOTCH 51 Userrsquos Guide Laboratoire Bordelaisde Recherche en Informatique (LaBRI) 2008 [Cited on pp 4 16 23]

[187] Daniel M Pelt and Rob H Bisseling A medium-grain method forfast 2D bipartitioning of sparse matrices In IPDPS2014 IEEE 28thInternational Parallel and Distributed Processing Symposium pages529ndash539 Phoenix Arizona May 2014 [Cited on pp 52]

[188] Ioakeim Perros Evangelos E Papalexakis Fei Wang Richard VuducElizabeth Searles Michael Thompson and Jimeng Sun SPARTanScalable PARAFAC2 for large and sparse data In Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD rsquo17 pages 375ndash384 New York NY USA2017 ACM [Cited on pp 156]

[189] Anh-Huy Phan Petr Tichavsky and Andrzej Cichocki Fast alternat-ing LS algorithms for high order CANDECOMPPARAFAC tensorfactorizations IEEE Transactions on Signal Processing 61(19)4834ndash4846 Oct 2013 [Cited on pp 52]

[190] Juan C Pichel Francisco F Rivera Marcos Fernandez and AurelioRodrıguez Optimization of sparse matrixndashvector multiplication usingreordering techniques on GPUs Microprocessors and Microsystems36(2)65ndash77 2012 [Cited on pp 160]

[191] Matthias Poloczek and Mario Szegedy Randomized greedy algorithmsfor the maximum matching problem with new analysis In IEEE 53rdAnnual Sym on Foundations of Computer Science (FOCS) pages708ndash717 New Brunswick NJ USA 2012 [Cited on pp 58]

[192] Alex Pothen Sparse null bases and marriage theorems PhD the-sis Dept Computer Science Cornell Univ Ithaca New York 1984[Cited on pp 68]

[193] Alex Pothen The complexity of optimal elimination trees TechnicalReport CS-88-13 Pennsylvania State Univ 1988 [Cited on pp 131]

BIBLIOGRAPHY 189

[194] Alex Pothen and Chin-Ju Fan Computing the block triangular formof a sparse matrix ACM Transactions on Mathematical Software16(4)303ndash324 1990 [Cited on pp 58 68 115]

[195] Alex Pothen S M Ferdous and Fredrik Manne Approximation algo-rithms in combinatorial scientific computing Acta Numerica 28541mdash633 2019 [Cited on pp 58]

[196] Louis-Noel Pouchet Polybench The polyhedral benchmark suiteURL httpwebcseohio-stateedu˜pouchetsoftwarepolybench2012 [Cited on pp 25 26]

[197] Michael O Rabin and Vijay V Vazirani Maximum matchings in gen-eral graphs through randomization Journal of Algorithms 10(4)557ndash567 December 1989 [Cited on pp 76]

[198] Daniel Ruiz A scaling algorithm to equilibrate both row and columnnorms in matrices Technical Report TR-2001-034 RAL 2001 [Citedon pp 55 56]

[199] Peter Sanders and Christian Schulz Engineering multilevel graphpartitioning algorithms In Camil Demetrescu and Magnus MHalldorsson editors Algorithms ndash ESA 2011 19th Annual EuropeanSymposium September 5-9 2011 pages 469ndash480 Saarbrucken Ger-many 2011 [Cited on pp 4 16 23]

[200] Robert Schreiber A new implementation of sparse Gaussian elimi-nation ACM Transactions on Mathematical Software 8(3)256ndash2761982 [Cited on pp 131]

[201] R Oguz Selvitopi Muhammet Mustafa Ozdal and Cevdet AykanatA novel method for scaling iterative solvers Avoiding latency over-head of parallel sparse-matrix vector multiplies IEEE Transactionson Parallel and Distributed Systems 26(3)632ndash645 2015 [Cited onpp 51]

[202] Richard Sinkhorn and Paul Knopp Concerning nonnegative matri-ces and doubly stochastic matrices Pacific Journal of Mathematics21343ndash348 1967 [Cited on pp 55 69 114]

[203] Shaden Smith Jee W Choi Jiajia Li Richard Vuduc Jongsoo ParkXing Liu and George Karypis FROSTT The formidable repositoryof open sparse tensors and tools httpfrosttio 2017 [Citedon pp 52 126 156 161]

[204] Shaden Smith and George Karypis A medium-grained algorithm forsparse tensor factorization In 2016 IEEE International Parallel and

190 BIBLIOGRAPHY

Distributed Processing Symposium IPDPS 2016 Chicago IL USAMay 23-27 2016 pages 902ndash911 2016 [Cited on pp 37 52]

[205] Shaden Smith and George Karypis SPLATT The Surprisingly Par-alleL spArse Tensor Toolkit (Version 111) Available from https

githubcomShadenSmithsplatt 2016 [Cited on pp 37 52 157]

[206] Shaden Smith Niranjay Ravindran Nicholas D Sidiropoulos andGeorge Karypis SPLATT Efficient and parallel sparse tensor-matrixmultiplication In 29th IEEE International Parallel amp Distributed Pro-cessing Symposium pages 61ndash70 Hyderabad India May 2015 [Citedon pp 11 36 37 38 39 42 158 159]

[207] Robert E Tarjan and Mihalis Yannakakis Simple linear-time algo-rithms to test chordality of graphs test acyclicity of hypergraphs andselectively reduce acyclic hypergraphs SIAM Journal on Computing13(3)566ndash579 1984 [Cited on pp 149]

[208] Bora Ucar and Cevdet Aykanat Encapsulating multiplecommunication-cost metrics in partitioning sparse rectangular matri-ces for parallel matrix-vector multiplies SIAM Journal on ScientificComputing 25(6)1837ndash1859 2004 [Cited on pp 47 51]

[209] Bora Ucar and Cevdet Aykanat Revisiting hypergraph models forsparse matrix partitioning SIAM Review 49(4)595ndash603 2007 [Citedon pp 42]

[210] Vijay V Vazirani An improved definition of blossoms and a simplerproof of the MV matching algorithm CoRR abs12104594 2012[Cited on pp 76]

[211] David W Walkup Matchings in random regular bipartite digraphsDiscrete Mathematics 31(1)59ndash64 1980 [Cited on pp 59 67 76 81114 121]

[212] Chris Walshaw Multilevel refinement for combinatorial optimisationproblems Annals of Operations Research 131(1)325ndash372 Oct 2004[Cited on pp 25]

[213] Eric S H Wong Evangeline F Y Young and Wai-Kei Mak Clus-tering based acyclic multi-way partitioning In Proceedings of the 13thACM Great Lakes Symposium on VLSI GLSVLSI rsquo03 pages 203ndash206New York NY USA 2003 ACM [Cited on pp 4]

[214] Tao Yang and Apostolos Gerasoulis DSC scheduling parallel tasks onan unbounded number of processors IEEE Transactions on Paralleland Distributed Systems 5(9)951ndash967 1994 [Cited on pp 131]

BIBLIOGRAPHY 191

[215] Albert-Jan Nicholas Yzelman and Dirk Roose High-level strategiesfor parallel shared-memory sparse matrix-vector multiplication IEEETransactions on Parallel and Distributed Systems 25(1)116ndash125 Jan2014 [Cited on pp 160]

  • Introduction
    • Directed graphs
    • Undirected graphs
    • Hypergraphs
    • Sparse matrices
    • Sparse tensors
      • I Partitioning
        • Acyclic partitioning of directed acyclic graphs
          • Related work
          • Directed multilevel graph partitioning
            • Coarsening
            • Initial partitioning
            • Refinement
            • Constrained coarsening and initial partitioning
              • Experiments
              • Summary further notes and references
                • Distributed memory CP decomposition
                  • Background and notation
                    • Hypergraph partitioning
                    • Tensor operations
                      • Related work
                      • Parallelization
                        • Coarse-grain task model
                        • Fine-grain task model
                          • Experiments
                          • Summary further notes and references
                              • II Matching and related problems
                                • Matchings in graphs
                                  • Scaling matrices to doubly stochastic form
                                  • Approximation algorithms for bipartite graphs
                                    • Related work
                                    • Two matching heuristics
                                    • Experiments
                                      • Approximation algorithms for general undirected graphs
                                        • Related work
                                        • One-Out its analysis and variants
                                        • Experiments
                                          • Summary further notes and references
                                            • Birkhoff-von Neumann decomposition
                                              • The minimum number of permutation matrices
                                                • Preliminaries
                                                • The computational complexity of MinBvNDec
                                                  • A result on the polytope of BvN decompositions
                                                  • Two heuristics
                                                  • Experiments
                                                  • A generalization of Birkhoff-von Neumann theorem
                                                    • Doubly stochastic splitting
                                                    • Birkhoff von-Neumann decomposition for arbitrary matrices
                                                      • Summary further notes and references
                                                        • Matchings in hypergraphs
                                                          • Background and notation
                                                          • The heuristics
                                                            • Greedy-H
                                                            • Karp-Sipser-H
                                                            • Karp-Sipser-H-scaling
                                                            • Karp-Sipser-H-mindegree
                                                            • Bipartite-reduction
                                                            • Local-Search
                                                              • Experiments
                                                              • Summary further notes and references
                                                                  • III Ordering matrices and tensors
                                                                    • Elimination trees and height-reducing orderings
                                                                      • Background
                                                                      • The critical path length and the elimination tree height
                                                                      • Reducing the elimination tree height
                                                                        • Theoretical basis
                                                                        • Recursive approach Permute
                                                                        • BBT decomposition PermuteBBT
                                                                        • Edge-separator-based method
                                                                        • Vertex-separator-based method
                                                                        • BT decomposition PermuteBT
                                                                          • Experiments
                                                                          • Summary further notes and references
                                                                            • Sparse tensor ordering
                                                                              • Three sparse tensor storage formats
                                                                              • Problem definition and solutions
                                                                                • BFS-MCS
                                                                                • Lexi-Order
                                                                                • Analysis
                                                                                  • Experiments
                                                                                    • COO-Mttkrp with reordering
                                                                                    • CSF-Mttkrp with reordering
                                                                                    • HiCOO-Mttkrp with reordering
                                                                                    • Reordering methods comparison
                                                                                    • Effect of the number of iterations in Lexi-Order
                                                                                      • Summary further notes and references
                                                                                          • IV Closing
                                                                                            • Concluding remarks
                                                                                            • Bibliography
Page 2: Partitioning, matching, and ordering: Combinatorial

MEMOIRE DrsquoHABILITATION A DIRIGER DES RECHERCHES

presente le 19 septembre 2019

a lrsquoEcole Normale Superieure de Lyon

par

Bora Ucar

Partitioning matching and ordering Combinatorialscientific computing with matrices and tensors

(Partitionnement couplage et permutation Calculscientifique combinatoire sur des matrices et des tenseurs)

Devant le jury compose de

Rob H Bisseling Utrecht University the Netherlands RapporteurNadia Brauner Universite Joseph Fourier Grenoble

FranceExaminatrice

Karen D Devine Sandia National Labs AlbuquerqueNew Mexico USA

Rapporteuse

Ali Pınar Sandia National Labs LivermoreCalifornia USA

Examinateur

Yves Robert ENS Lyon France ExaminateurDenis Trystram Grenoble INP France Rapporteur

ii

Contents

1 Introduction 1

11 Directed graphs 1

12 Undirected graphs 5

13 Hypergraphs 7

14 Sparse matrices 7

15 Sparse tensors 9

I Partitioning 13

2 Acyclic partitioning of directed acyclic graphs 15

21 Related work 16

22 Directed multilevel graph partitioning 18

221 Coarsening 18

222 Initial partitioning 21

223 Refinement 22

224 Constrained coarsening and initial partitioning 23

23 Experiments 25

24 Summary further notes and references 31

3 Distributed memory CP decomposition 33

31 Background and notation 34

311 Hypergraph partitioning 34

312 Tensor operations 35

32 Related work 36

33 Parallelization 38

331 Coarse-grain task model 39

332 Fine-grain task model 43

34 Experiments 47

35 Summary further notes and references 51

iii

iv CONTENTS

II Matching and related problems 53

4 Matchings in graphs 55

41 Scaling matrices to doubly stochastic form 55

42 Approximation algorithms for bipartite graphs 56

421 Related work 57

422 Two matching heuristics 59

423 Experiments 69

43 Approximation algorithms for general undirected graphs 75

431 Related work 76

432 One-Out its analysis and variants 77

433 Experiments 82

44 Summary further notes and references 87

5 Birkhoff-von Neumann decomposition 91

51 The minimum number of permutation matrices 91

511 Preliminaries 92

512 The computational complexity of MinBvNDec 93

52 A result on the polytope of BvN decompositions 94

53 Two heuristics 97

54 Experiments 103

55 A generalization of Birkhoff-von Neumann theorem 106

551 Doubly stochastic splitting 106

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices 108

56 Summary further notes and references 109

6 Matchings in hypergraphs 113

61 Background and notation 114

62 The heuristics 114

621 Greedy-H 115

622 Karp-Sipser-H 115

623 Karp-Sipser-H-scaling 117

624 Karp-Sipser-H-mindegree 118

625 Bipartite-reduction 119

626 Local-Search 120

63 Experiments 120

64 Summary further notes and references 128

III Ordering matrices and tensors 129

7 Elimination trees and height-reducing orderings 131

71 Background 132

CONTENTS v

72 The critical path length and the elimination tree height 13473 Reducing the elimination tree height 136

731 Theoretical basis 137732 Recursive approach Permute 137733 BBT decomposition PermuteBBT 138734 Edge-separator-based method 139735 Vertex-separator-based method 141736 BT decomposition PermuteBT 142

74 Experiments 14275 Summary further notes and references 145

8 Sparse tensor ordering 14781 Three sparse tensor storage formats 14782 Problem definition and solutions 149

821 BFS-MCS 149822 Lexi-Order 151823 Analysis 154

83 Experiments 15684 Summary further notes and references 160

IV Closing 165

9 Concluding remarks 167

Bibliography 170

vi CONTENTS

List of Algorithms

1 CP-ALS for the 3rd order tensors 362 Mttkrp for the 3rd order tensors 383 Coarse-grain Mttkrp for the first mode of third order tensors

at process p within CP-ALS 404 Fine-grain Mttkrp for the first mode of 3rd order tensors at

process p within CP-ALS 44

5 ScaleSK Sinkhorn-Knopp scaling 566 OneSidedMatch for bipartite graph matching 607 TwoSidedMatch for bipartite graph matching 628 Karp-Sipser-MT for bipartite 1-out graph matching 649 One-Out for undirected graph matching 7810 Karp-SipserOne-Out for undirected 1-out graph matching 80

11 GreedyBvN for constructing a BvN decomposition 98

12 Karp-Sipser-H-scaling for hypergraph matching 117

13 Row-LU(A) 13414 Column-LU(A) 13415 Permute(A) 13816 FindStrongVertexSeparatorES(G) 14017 FindStrongVertexSeparatorVS(G) 142

18 BFS-MCS for ordering a tensor dimension 15019 Lexi-Order for ordering a tensor dimension 153

vii

viii LIST OF ALGORITHMS

Chapter 1

Introduction

This document investigates three classes of problems at the interplay ofdiscrete algorithms combinatorial optimization and numerical methodsThe general research area is called combinatorial scientific computing (CSC)In CSC the contributions have practical and theoretical flavor For allproblems discussed in this document we have the design analysis andimplementation of algorithms along with many experiments The theoreticalresults are included in this document some with proofs the reader is invitedto the original papers for the omitted proofs A similar approach is takenfor presenting the experiments While most results for observing theoreticalfindings in practice are included the reader is referred to the original papersfor some other results (eg run time analysis)

The three problem classes are that of partitioning matching and or-dering We cover two problems from the partitioning class three from thematching class and two from the ordering class Below we present theseseven problems after introducing the notation and recalling the related defi-nitions We summarize the computing contexts in which problems arise andhighlight our contributions

11 Directed graphs

A directed graph G = (VE) is a set V of vertices and a set E of directededges of the form e = (u v) where e is directed from u to v A path isa sequence of edges (u1 v1) middot (u2 v2) middot middot middot with vi = ui+1 A path ((u1 v1) middot(u2 v2) middot middot middot (u` v`)) is of length ` where it connects a sequence of `+1 vertices(u1 v1 = u2 v`minus1 = u` v`) A path is called simple if the connectedvertices are distinct Let u v denote a simple path that starts from u andends at v We say that a vertex v is reachable from another vertex u if apath connects them A path ((u1 v1) middot (u2 v2) middot middot middot (u` v`)) forms a (simple)cycle if all vi for 1 le i le ` are distinct and u1 = v` A directed acyclicgraph DAG in short is a directed graph with no cycles A directed graph

1

2 CHAPTER 1 INTRODUCTION

is strongly connected if every vertex is reachable from every other vertex

A directed graph Gprime = (V prime Eprime) is said to be a subgraph of G = (VE)if V prime sube V and Eprime sube E We call Gprime as the subgraph of G induced by V primedenoted as G[V prime] if Eprime = E cap (V primetimesV prime) For a vertex set S sube V we defineG minus S as the subgraph of G induced by V minus S ie G[V minus S] A vertexset C sube V is called a strongly connected component of G if the inducedsubgraph G[C] is strongly connected and C is maximal in the sense thatfor all C prime sup C G[C prime] is not strongly connected A transitive reduction G0

is a minimal subgraph of G such that if u is connected to v in G then u isconnected to v in G0 as well If G is acyclic then its transitive reduction isunique

The path u v represents a dependency of v to u We say that theedge (u v) is redundant if there exists another u v path in the graphThat is when we remove a redundant (u v) edge u remains connected tov and hence the dependency information is preserved We use Pred[v] =u | (u v) isin E to represent the (immediate) predecessors of a vertex v andSucc[v] = u | (v u) isin E to represent the (immediate) successors of v Wecall the neighbors of a vertex v its immediate predecessors and immediatesuccessors Neigh[v] = Pred[v]cupSucc[v] For a vertex u the set of vertices vsuch that u v are called the descendants of u Similarly the set of verticesv such that v u are called the ancestors of the vertex u The verticeswithout any predecessors (and hence ancestors) are called the sources of Gand vertices without any successors (and hence descendants) are called thetargets of G Every vertex u has a weight denoted by wu and every edge(u v) isin E has a cost denoted by cuv

A k-way partitioning of a graph G = (VE) divides V into k disjointsubsets V1 Vk The weight of a part Vi denoted by w(Vi) is equal tosum

uisinVi wu which is the total vertex weight in Vi Given a partition an edgeis called a cut edge if its endpoints are in different parts The edge cut ofa partition is defined as the sum of the costs of the cut edges Usually aconstraint on the part weights accompanies the problem We are interestedin acyclic partitions which are defined below

Definition 11 (Acyclic k-way partition) A partition V1 Vk ofG = (VE) is called an acyclic k-way partition if two paths u v andvprime uprime do not co-exist for u uprime isin Vi v vprime isin Vj and 1 le i 6= j le k

In other words for a suitable numbering of the parts all edges shouldbe directed from a vertex in a part p to another vertex in a part q wherep le q Let us consider a toy example shown in Fig 11a A partition of thevertices of this graph is shown in Fig 11b with a dashed curve Since thereis a cut edge from s to u and another from u to t the partition is cyclic Anacyclic partition is shown in Fig 11c where all the cut edges are from onepart to the other

3

(a) A toy graph (b) A partition ignor-ing the directions it iscyclic

(c) An acyclic parti-tioning

Figure 11 A toy graph with six vertices and six edges a cyclic partitioningand an acyclic partitioning

There is a related notion in the literature [87] which is called a convexpartition A partition is convex if for all vertex pairs u v in the same partthe vertices in any u v path are also in the same part Hence if a partitionis acyclic it is also convex On the other hand convexity does not implyacyclicity Figure 12 shows that the definitions of an acyclic partition anda convex partition are not equivalent For the toy graph in Fig 12a thereare three possible balanced partitions shown in Figures 12b 12c and 12dThey are all convex but only the one in Fig 12d is acyclic

Deciding on the existence of a k-way acyclic partition respecting an upperbound on the part weights and an upper bound on the cost of cut edges isNP-complete [93 ND15] In Chapter 2 we treat this problem which isformalized as follows

Problem 1 (DAGP DAG partitioning) Given a directed acyclic graphG = (VE) with weights on vertices an imbalance parameter ε find anacyclic k-way partition V1 Vk of V such that the balance con-straints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut which is equal to the numberof edges whose end points are in two different parts is minimized

The DAGP problem arises in exposing parallelism in automatic differ-entiation [53 Ch9] and particularly in the computation of the Newtonstep for solving nonlinear systems [51 52] The DAGP problem with someadditional constraints is used to reason about the parallel data movementcomplexity and to dynamically analyze the data locality potential [84 87]Other important applications of the DAGP problem include (i) fusing loopsfor improving temporal locality and enabling streaming and array contrac-tions in runtime systems [144 145] (ii) analysis of cache efficient execution

4 CHAPTER 1 INTRODUCTION

a b

c d

(a) A toy graph

a b

c d

(b) A cyclic and con-vex partition

a b

c d

(c) A cyclic and con-vex partition

a b

c d

(d) An acyclic and convexpartition

Figure 12 A toy graph two cyclic and convex partitions and an acyclicand convex partition

of streaming applications on uniprocessors [4] (iii) a number of circuit de-sign applications in which the signal directions impose acyclic partitioningrequirement [54 213]

In Chapter 2 we address the DAGP problem We adopt the multilevelapproach [35 113] for this problem This approach is the de-facto standardfor solving graph and hypergraph partitioning problems efficiently and usedby almost all current state-of-the-art partitioning tools [27 40 113 130186 199] The algorithms following this approach have three phases thecoarsening phase which reduces the number of vertices by clustering themthe initial partitioning phase which finds a partition of the coarsest graphand the uncoarsening phase in which the initial partition is projected to thefiner graphs and refined along the way until a solution for the original graphis obtained We focus on two-way partitioning (sometimes called bisection)as this scheme can be used in a recursive way for multiway partitioningTo ensure the acyclicity of the partition at all times we propose novel andefficient coarsening and refinement heuristics We also propose effective waysto use the standard undirected graph partitioning methods in our multilevelscheme We perform a large set of experiments on a dataset and reportsignificant improvements compared to the current state of the art

A rooted tree T is an ordered triple (V r ρ) where V is the vertex setr isin V is the root vertex and ρ gives the immediate ancestordescendantrelation For two vertices vi vj such that ρ(vj) = vi we call vi the parent

5

vertex of vj and call vj a child vertex of vi With this definition the root ris the ancestor of every other node in T The children vertices of a vertexare called sibling vertices to each other The height of T denoted as h(T )is the number of vertices in the longest path from a leaf to the root r

Let G = (VE) be a directed graph with n vertices which are ordered(or labeled as 1 n) The elimination tree of a directed graph is definedas follows [81] For any i 6= n ρ(i) = j whenever j gt i is the minimumvertex id such that the vertices i and j lie in the same strongly connectedcomponent of the subgraph G[1 j]mdashthat is i and j are in the samestrongly connected component of the subgraph containing the first j verticesand j is the smallest number Currently the algorithm with the best worst-case time complexity has a run time of O(m log n) [133]mdashanother algorithmwith the worst-case time complexity of O(mn) [83] also performs well inpractice As the definition of the elimination tree implies the properties ofthe tree depend on the ordering of the graph

The elimination tree is a recent model playing important roles in sparseLU factorization This tree captures the dependencies between the tasksof some well-known variants of sparse LU factorization (we review these inChapter 7) Therefore the height of the elimination tree corresponds to thecritical path length of the task dependency graph in the corresponding par-allel LU factorization methods In Chapter 7 we investigate the problem offinding minimum height elimination trees formally defined below to exposea maximum degree of parallelism by minimizing the critical path length

Problem 2 (Minimum height elimination tree) Given a directed graphfind an ordering of its vertices so that the associated elimination tree hasthe minimum height

The minimum height of an elimination tree of a given directed graph cor-responds to a directed graph parameter known as the cycle-rank [80] Theproblem being NP-complete [101] we propose heuristics in Chapter 7 whichgeneralize the most successful approaches used for symmetric matrices tounsymmetric ones

12 Undirected graphs

An undirected graph G = (VE) is a set V of vertices and a set E ofedges Bipartite graphs are undirected graphs where the vertex set can bepartitioned into two sets with all edges connecting vertices in the two partsFormally G = (VR cup VC E) is a bipartite graph where V = VR cup VC isthe vertex set and for each edge e = (r c) we have r isin VR and c isin VC The number of edges incident on a vertex is called its degree A path in agraph is a sequence of vertices such that each consecutive vertex pair share

6 CHAPTER 1 INTRODUCTION

an edge A vertex is reachable from another one if there is a path betweenthem The connected components of a graph are the equivalence classes ofvertices under the ldquois reachable fromrdquo relation A cycle in a graph is a pathwhose start and end vertices are the same A simple cycle is a cycle with novertex repetitions A tree is a connected graph with no cycles A spanningtree of a connected graph G is a tree containing all vertices of G

A matching M in a graph G = (VE) is a subset of edges E where avertex in V is in at most one edge in M Given a matching M a vertex vis said to be matched by M if v is in an edge of M otherwise v is calledunmatched If all the vertices are matched by M then M is said to be aperfect matching The cardinality of a matching M denoted by |M| is thenumber of edges inM In Chapter 4 we treat the following problem whichasks to find approximate matchings

Problem 3 (Approximating the maximum cardinality of a matching inundirected graphs) Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matching M of large cardinalitywith an approximation guarantee

There are a number of polynomial time algorithms to solve the maximumcardinality matching problem exactly in graphs The lowest worst-case timecomplexity of the known algorithms is O(

radicnτ) for a graph with n vertices

and τ edges For bipartite graphs the first of such algorithms is described byHopcroft and Karp [117] certain variants of the push-relabel algorithm [96]also have the same complexity (we give a classification for maximum car-dinality matchings in bipartite graphs in a recent paper [132]) For gen-eral graphs algorithms with the same complexity are known [30 92 174]There is considerable interest in simpler and faster algorithms that havesome approximation guarantee [148] Such algorithms are used as a jump-start routine by the current state of the art exact maximum matching algo-rithms [71 148 169] Furthermore there are applications [171] where largecardinality matchings are used

The specialization of Problem 3 for bipartite graphs and general undi-rected graphs are described in respectively Section 42 and Section 43 Ourfocus is on approximation algorithms which expose parallelism and have lin-ear run time in practical settings We propose randomized heuristics andanalyze their results The proposed heuristics construct a subgraph of theinput graph by randomly choosing some edges They then obtain a maxi-mum cardinality matching on the subgraph and return it as an approximatematching for the input graph The probability density function for choosinga given edge is obtained by scaling a suitable matrix to be doubly stochasticThe heuristics for the bipartite graphs and general undirected graphs sharethis general approach and differ in the analysis To the best of our knowl-edge the proposed heuristics have the highest constant factor approximation

7

ratio (around 086 for large graphs)

13 Hypergraphs

A hypergraph H = (VE) is defined as a set of vertices V and a set of hyper-edges E Each hyperedge is a set of vertices A matching in a hypergraphis a set of disjoint hyperedges

A hypergraph H = (VE) is called d-partite and d-uniform if V =⋃di=1 Vi with disjoint Vis and every hyperedge contains a single vertex from

each Vi In Chapter 6 we investigate the problem of finding matchings inhypergraphs formally defined as follows

Problem 4 (Maximum matching in d-partite d-uniform hypergraphs)Given a d-partite d-uniform hypergraph find a matching of maximumsize

The problem is NP-complete for d ge 3 the 3-partite case is called theMax-3-DM problem [126] and is an important problem in combinatorialoptimization It is a special case of the d-Set-Packing problem [110] Ithas been shown that d-Set-Packing is hard to approximate within a factorof O(d log d) [110] The maximumperfect set packing problem has manyapplications including combinatorial auctions [98] and personnel schedul-ing [91] Matchings in hypergraphs can also be used in the coarsening phaseof multilevel hypergraph partitioning tools [40] In particular d-uniformand d-partite hypergraphs arise in modeling tensors [134] and heuristics forProblem 4 can be used to partition these hypergraphs by plugging themin the current state of the art partitioning tools We propose heuristics forProblem 4 We first generalize some graph matching heuristics then pro-pose a novel heuristic based on tensor scaling to improve the quality viajudicious hyperedge selections Experiments on random synthetic and real-life hypergraphs show that this new heuristic is highly practical and superiorto the others on finding a matching with large cardinality

14 Sparse matrices

Bold upper case Roman letters are used for matrices as in A Matrixelements are shown with the corresponding lowercase letters for exampleaij Matrix sizes are sometimes shown in the lower right corner eg AItimesJ Matlab notation is used to refer to the entire rows and columns of a matrixeg A(i ) and A( j) refer to the ith row and jth column of A respectivelyA permutation matrix is a square 0 1 matrix with exactly one nonzeroin each row and each column A sub-permutation matrix is a square 0 1matrix with at most one nonzero in a row or a column

8 CHAPTER 1 INTRODUCTION

A graph G = (VE) can be represented as a sparse matrix AG calledthe adjacency matrix In this matrix an entry aij = 1 if the vertices vi andvj in G are incident on a common edge and aij = 0 otherwise If G isbipartite with V = VR cup VC we identify each vertex in VR with a uniquerow of AG and each vertex in VC with a unique column of AG If G is notbipartite a vertex is identified with a unique row-column pair A directedgraph GD = (VE) with vertex set V and edge set E can be associated withan ntimes n sparse matrix A Here |V | = n and for each aij 6= 0 where i 6= jwe have a directed edge from vi to vj

We also associate graphs with matrices For a given m times n matrix Awe can associate a bipartite graph G = (VR cup VC E) so that the zero-nonzero pattern of A is equivalent to AG If the matrix is square and hasa symmetric zero-nonzero pattern we can similarly associate an undirectedgraph by ignoring the diagonal entries Finally if the matrix A is squareand has an unsymmetric zero-nonzero pattern we can associate a directedgraph GD = (VE) where (vi vj) isin E iff aij 6= 0

An n times n matrix A 6= 0 is said to have support if there is a perfectmatching in the associated bipartite graph An n times n matrix A is said tohave total support if each edge in its bipartite graph can be put into a perfectmatching A square sparse matrix is called irreducible if its directed graph isstrongly connected A square sparse matrix A is called fully indecomposableif for a permutation matrix Q the matrix B = AQ has a zero free diagonaland the directed graph associated with B is irreducible Fully indecompos-able matrices have total support but a matrix with total support could be ablock diagonal matrix where each block is fully indecomposable For moreformal definitions of support total support and the fully indecomposabilitysee for example the book by Brualdi and Ryser [34 Ch 3 and Ch 4]

Let A = [aij ] isin Rntimesn The matrix A is said to be doubly stochastic ifaij ge 0 for all i j and Ae = AT e = e where e is the vector of all ones By

Birkhoffrsquos Theorem [25] there exist α1 α2 αk isin (0 1) withsumk

i=1 αi = 1and permutation matrices P1P2 Pk such that

A = α1P1 + α2P2 + middot middot middot+ αkPk (11)

This representation is also called Birkhoff-von Neumann (BvN) decomposi-tion We refer to the scalars α1 α2 αk as the coefficients of the decom-position

Doubly stochastic matrices and their associated BvN decompositionshave been used in several operations research problems and applicationsClassical examples are concerned with allocating communication resourceswhere an input traffic is routed to an output traffic in stages [43] Each rout-ing stage is a (sub-)permutation matrix and is used for handling a disjointset of communications The number of stages correspond to the numberof (sub-)permutation matrices A recent variant of this allocation problem

9

arises as a routing problem in data centers [146 163] Heuristics for theBvN decomposition have been used in designing business strategies for e-commerce retailers [149] A display matrix (which shows what needs to bedisplayed in total) is constructed which happens to be doubly stochasticThen a BvN decomposition of this matrix is obtained in which each per-mutation matrix used is a display shown to a shopper A small number ofpermutation matrices is preferred furthermore only slight changes betweenconsecutive permutation matrices is desired (we do not deal with this lastissue in this document)

The BvN decomposition of a doubly stochastic matrix as a convex com-bination of permutation matrices is not unique in general The Marcus-Ree Theorem [170] states that there is always a decomposition with k len2 minus 2n + 2 permutation matrices for dense n times n matrices Brualdi andGibson [33] and Brualdi [32] show that for a fully indecomposable ntimesn sparsematrix with τ nonzeros one has the equivalent expression k le τ minus2n+ 2 InChapter 5 we investigate the problem of finding the minimum number k ofpermutation matrices in the representation (11) More formally define theMinBvNDec problem defined as follows

Problem 5 (MinBvNDec A BvN decomposition with the minimumnumber of permutation matrices) Given a doubly stochastic matrix Afind a BvN decomposition (11) of A with the minimum number k ofpermutation matrices

Brualdi [32 p197] investigates the same problem and concludes that this isa difficult problem We continue along this line and show that the MinBvN-Dec problem is NP-hard We also propose a heuristic for obtaining a BvNdecomposition with a small number of permutation matrices We investi-gate some of the properties of the heuristic theoretically and experimentallyIn this chapter we also give a generalization of the BvN decomposition forreal matrices with possibly negative entries and use this generalization fordesigning preconditioners for iterative linear system solvers

15 Sparse tensors

Tensors are multidimensional arrays generalizing matrices to higher ordersWe use calligraphic font to refer to tensors eg X Let X be a d-dimensionaltensor whose size is n1 times middot middot middot times nd As in matrices an element of a tensor isdenoted by a lowercase letter and subscripts corresponding to the indices ofthe element eg xi1id where ij isin 1 nj A marginal of a tensor is a(dminus1)-dimensional slice of a d-dimensional tensor obtained by fixing one ofits indices A d-dimensional tensor where the entries in each of its marginalssum to one is called d-stochastic In a d-stochastic tensor all dimensions

10 CHAPTER 1 INTRODUCTION

necessarily have the same size n A d-stochastic tensor where each marginalcontains exactly one nonzero entry (equal to one) is called a permutationtensor

A d-partite d-uniform hypergraph H = (V1 cup middot middot middot cup Vd E) correspondsnaturally to a d-dimensional tensor X by associating each vertex class witha dimension of X Let |Vi| = ni Let X isin 0 1n1timesmiddotmiddotmiddottimesnd be a tensor withnonzero elements xv1vd corresponding to the edges (v1 vd) of H Inthis case X is called the adjacency tensor of H

Tensors are used in modeling multifactor or multirelational datasetssuch as product reviews at an online store [21] natural language process-ing [123] multi-sensor data in signal processing [49] among many oth-ers [142] Tensor decomposition algorithms are used as an important tool tounderstand the tensors and glean hidden or latent information One of themost common tensor decomposition approaches is the CandecompParafac(CP) formulation which approximates a given tensor as a sum of rank-onetensors More precisely for a given positive integer R called the decompo-sition rank the CP decomposition approximates a d-dimensional tensor bya sum of R outer products of column vectors (one for each dimension) Inparticular for a three dimensional I times J timesK tensor X CP asks for a rankR approximation X asymp

sumRr=1 ar br cr with vectors ar br and cr of size

respectively I J and K Here denotes the outer product of the vectorsand hence xijk asymp

sumRr=1 ar(i) middot br(j) middot cr(k) By aggregating these column

vectors one forms the matrices A B and C of size respectively I times RJ timesR and K timesR The matrices A B and C are called the factor matrices

The computational core of the most common CP decomposition meth-ods is an operation called the matricized tensor-times-Khatri-Rao product(Mttkrp) For the same tensor X this core operation can be describedsuccinctly as

foreach xijk isin X do

A(i )larr A(i ) + xijk[B(j ) lowastC(k )] (12)

where B and C are input matrices of sizes J timesR and K timesR respectivelyMA is the output matrix of size I timesR which is initialized to zero before theloop For each tensor nonzero xijk the jth row of the matrix B is multipliedelement-wise with the kth row of the matrix C the result is scaled with thetensor entry xijk and added to the ith row of MA In the well known CPdecomposition algorithm based on alternative least squares (CP-ALS) thetwo factor matrices B and C are initialized at the beginning then the factormatrix A is computed by multiplying MA with the (pseudo-)inverse of theHadamard product of BTB and CTC Since BTB and CTC are of sizeR times R their multiplication and inversion do not pose any computationalchallenges Afterwards the same steps are performed to compute a newB and then a new C and this process of computing the new factors one

11

at a time by assuming others fixed is repeated iteratively In Chapter 3we investigate how to efficiently parallelize the Mttkrp operations withthe CP-ALS algorithm in distributed memory systems In particular weinvestigate the following problem

Problem 6 (Parallelizing CP decomposition) Given a sparse tensor X the rank R of the required decomposition the number k of processorspartition X among the processors so as to achieve load balance andreduce the communication cost in a parallel implementation of Mttkrp

More precisely in Chapter 3 we describe parallel implementations of CP-ALS and how the parallelization problem can be cast as a partitioning prob-lem in hypergraphs In this document we do not develop heuristics for thehypergraph partitioning problemmdashwe describe hypergraphs modeling thecomputations and make use of the well-established tools for partitioningthese hypergraphs The Mttkrp operation also arises in the gradient de-scent based methods [2] to compute CP decomposition Furthermore it hasbeen implemented as a stand-alone routine [17] to enable algorithm devel-opment for variants of the core methods [158] That is why it has beeninvestigated for efficient execution in different computational frameworkssuch as Matlab [11 18] MapReduce [123] shared memory [206] and dis-tributed memory [47]

Consider again the Mttkrp operation and the execution of the oper-ations as shown in the body of the displayed for-loop (12) As seen inthis loop the rows of the factor matrices and the output matrix are ac-cessed many times according to the sparsity pattern of the tensor X In acache-based system the computations will be much more efficient if we takeadvantage of spatial and temporal locality If we can reorder the nonzerossuch that the multiplications involving a given row (of MA and B and C)are one after another this would be achieved We need this to hold whenexecuting the Mttkrps for computing the new factor matrices B and Cas well The issue is more easily understood in terms of the sparse matrixndashvector multiplication operation y larr Ax If we have two nonzeros in thesame row aij and aik we would prefer an ordering that maps j and k nextto each other and perform the associated scalar multiplications aijprimexjprime andaikprimexkprime one after other where jprime is the new index id for j and kprime is the newindex id for k Furthermore we would like the same sort of locality for allindices and for the multiplication x larr AT y as well We identify two re-lated problems here One of them is the choice of data structure for takingadvantage of shared indices and the other is to reorder the indices so thatnonzeros sharing an index in a dimension have close-by indices in the otherdimensions In this document we do not propose a data structure our aimis to reorder the indices such that the locality is improved for three commondata structures used in the current state of the art We summarize those

12 CHAPTER 1 INTRODUCTION

three common data structures in Chapter 8 and investigate the followingordering problem Since the overall problem is based on the data structurechoice we state the problem in a general language

Problem 7 (Tensor ordering) Given a sparse tensor X find an or-dering of the indices in each dimension so that the nonzeros of X haveindices close to each other

If we look at the same problem for matrices the problem can be cast asordering the rows and the columns of a given matrix so that the nonzeros areclustered around the diagonal This is also what we search in the tensor casewhere different data structures can benefit from the same ordering thanksto different effects In Chapter 8 we investigate this problem algorithmicallyand develop fast heuristics While the problem in the case of sparse matriceshas been investigated with different angles to the best of our knowledgeours is the first study proposing general ordering tools for tensorsmdashwhichare analogues of the well-known and widely used sparse matrix orderingheuristics such as Cuthill-McKee [55] and reverse Cuthill-McKee [95]

Part I

Partitioning

13

Chapter 2

Multilevel algorithms foracyclic partitioning ofdirected acyclic graphs

In this chapter we address the directed acyclic graph partitioning (DAGP)problem Most of the terms and definitions are standard which we sum-marized in Section 11 The formal problem definition is repeated below forconvenience

DAG partitioning problem Given a DAG G = (VE) an imbalanceparameter ε find an acyclic k-way partition V1 Vk of V such thatthe balance constraints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut is minimized

We adopt the multilevel approach [35 113] with the coarsening initialpartitioning and refinement phases for acyclic partitioning of DAGs Wepropose heuristics for these three phases (Sections 221 222 and 223respectively) which guarantee acyclicity of the partitions at all phases andmaintain a DAG at every level We strove to have fast heuristics at thecore With these characterizations the coarsening phase requires new algo-rithmictheoretical reasoning while the initial partitioning and refinementheuristics are direct adaptations of the standard methods used in undirectedgraph partitioning with some differences worth mentioning We discuss onlythe bisection case as we were able to improve the direct k-way algorithmswe proposed before [114] by using the bisection heuristics recursively

The acyclicity constraint on the partitions forbids the use of the stateof the art undirected graph partitioning tools This has been recognized

15

16 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

before and those tools were put aside [114 175] While this is true one canstill try to make use of the existing undirected graph partitioning tools [113130 186 199] as they have been very well engineered Let us assume thatwe have partitioned a DAG with an undirected graph partitioning tool intotwo parts by ignoring the directions It is easy to detect if the partition iscyclic since all the edges need to go from the first part to the second one Ifa partition is cyclic we can easily fix it as follows Let v be a vertex in thesecond part we can move all vertices to which there is a path from v intothe second part This procedure breaks any cycle containing v and hencethe partition becomes acyclic However the edge cut may increase and thepartitions can be unbalanced To solve the balance problem and reduce thecut we can apply move-based refinement algorithms After this step thefinal partition meets the acyclicity and balance conditions Depending onthe structure of the input graph it could also be a good initial partitionfor reducing the edge cut Indeed one of our most effective schemes usesan undirected graph partitioning algorithm to create a (potentially cyclic)partition fixes the cycles in the partition and refines the resulting acyclicpartition with a novel heuristic to obtain an initial partition We thenintegrate this partition within the proposed coarsening approaches to refineit at different granularities

21 Related work

Fauzia et al [87] propose a heuristic for the acyclic partitioning problemto optimize data locality when analyzing DAGs To create partitions theheuristic categorizes a vertex as ready to be assigned to a partition whenall of the vertices it depends on have already been assigned Vertices areassigned to the current partition set until the maximum number of verticesthat would be ldquoactiverdquo during the computation of the part reaches a specifiedlimit which is the cache size in their application This implies that a partsize is not limited by the sum of the total vertex weights in that part butis a complex function that depends on an external schedule (order) of thevertices This differs from our problem as we limit the size of each part bythe total sum of the weights of the vertices on that part

Kernighan [137] proposes an algorithm to find a minimum edge-cut par-tition of the vertices of a graph into subsets of size greater than a lowerbound and inferior to an upper bound The partition needs to use a fixedvertex sequence that cannot be changed Indeed Kernighanrsquos algorithmtakes a topological order of the vertices of the graph as an input and parti-tions the vertices such that all vertices in a subset constitute a continuousblock in the given topological order This procedure is optimal for a givenfixed topological order and has a run time proportional to the number ofedges in the graph if the part weights are taken as constant We used a

17

modified version of this algorithm as a heuristic in the earlier version of ourwork [114]

Cong et al [54] describe two approaches for obtaining acyclic partitionsof directed Boolean networks modeling circuits The first one is a single-level Fiduccia-Mattheyses (FM)-based approach In this approach Cong etal generate an initial acyclic partition by splitting the list of the vertices (ina topological order) from left to right into k parts such that the weight of eachpart does not violate the bound The quality of the results is then improvedwith a k-way variant of the FM heuristic [88] taking the acyclicity constraintinto account Our previous work [114] employs a similar refinement heuristicThe second approach of Cong et al is a two-level heuristic the initial graphis first clustered with a special decomposition and then it is partitionedusing the first heuristic

In a recent paper [175] Moreira et al focus on an imaging and com-puter vision application on embedded systems and discuss acyclic parti-tioning heuristics They propose a single level approach in which an initialacyclic partitioning is obtained using a topological order To refine the parti-tioning they proposed four local search heuristics which respect the balanceconstraint and maintain the acyclicity of the partition Three heuristics picka vertex and move it to an eligible part if and only if the move improvesthe cut These three heuristics differ in choosing the set of eligible parts foreach vertex some are very restrictive and some allow arbitrary target partsas long as acyclicity is maintained The fourth heuristic tentatively realizesthe moves that increase the cut in order to escape from a possible local min-ima It has been reported that this heuristic delivers better results than theothers In a follow-up paper Moreira et al [176] discuss a multilevel graphpartitioner and an evolutionary algorithm based on this multilevel schemeTheir multilevel scheme starts with a given acyclic partition Then thecoarsening phase contracts edges that are in the same part until there is noedge to contract Here matching-based heuristics from undirected graphpartitioning tools are used without taking the directions of the edges intoaccount Therefore the coarsening phase can create cycles in the graphhowever the induced partitions are never cyclic Then an initial partition isobtained which is refined during the uncoarsening phase with move-basedheuristics In order to guarantee acyclic partitions the vertices that lie incycles are not moved In a systematic evaluation of the proposed methodsMoreira et al note that there are many local minima and suggest using re-laxed constraints in the multilevel setting The proposed methods have highrun time as the evolutionary method of Moreira et al is not concerned withthis issue Improvements with respect to the earlier work [175] are reported

We propose methods to use an undirected graph partitioner to guidethe multilevel partitioner We focus on partitioning the graph in two partssince one can handle the general case with a recursive bisection scheme Wealso propose new coarsening initial partitioning and refinement methods

18 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

specifically designed for the 2-partitioning problem Our multilevel schememaintains acyclic partitions and graphs through all the levels

22 Directed multilevel graph partitioning

Here we summarize the proposed methods for the three phases of the mul-tilevel scheme

221 Coarsening

In this phase we obtain smaller DAGs by coalescing the vertices level bylevel This phase continues until the number of vertices becomes smallerthan a specified bound or the reduction on the number of vertices from onelevel to the next one is lower than a threshold At each level ` we startwith a finer acyclic graph G` compute a valid clustering C` ensuring theacyclicity and obtain a coarser acyclic graph G`+1 While our previouswork [114] discussed matching based algorithms for coarsening we presentagglomerative clustering based variants here The new variants supersedethe matching based ones Unlike the standard undirected graph case inDAG partitioning not all vertices can be safely combined Consider a DAGwith three vertices a b c and three edges (a b) (b c) (a c) Here thevertices a and c cannot be combined since that would create a cycle Wesay that a set of vertices is contractible (all its vertices are matchable) ifunifying them does not create a cycle We now present a general theoryabout finding clusters without forming cycles after giving some definitions

Definition 21 (Clustering) A clustering of a DAG is a set of disjointsubsets of vertices Note that we do not make any assumptions on whetherthe subsets are connected or not

Definition 22 (Coarse graph) Given a DAG G and a clustering C of Gwe let G|C denote the coarse graph created by contracting all sets of verticesof C

The vertices of the coarse graph are the clusters in C If (u v) isin G fortwo vertices u and v that are located in different clusters of C then G|Chas an (directed) edge from the vertex corresponding to ursquos cluster to thevertex corresponding to vrsquos cluster

Definition 23 (Feasible clustering) A feasible clustering C of a DAGG is a clustering such that G|C is acyclic

Theorem 21 Let G = (VE) be a DAG For u v isin V and (u v) isin E thecoarse graph G|(uv) is acyclic if and only if every path from u to v in Gcontains the edge (u v)

19

Theorem 21 whose proof is in the journal version of this chapter [115]can be extended to a set of vertices by noting that this time all paths con-necting two vertices of the set should contain only the vertices of the setThe theorem (nor its extension) does not imply an efficient algorithm as itrequires at least one transitive reduction Furthermore it does not describea condition about two clusters forming a cycle even if both are individuallycontractible In order to address both of these issues we put a constrainton the vertices that can form a cluster based on the following definition

Definition 24 (Top level value) For a DAG G = (VE) the top levelvalue of a vertex u isin V is the length of the longest path from a source ofG to that vertex The top level values of all vertices can be computed in asingle traversal of the graph with a complexity O(|V |+ |E|) We use top[u]to denote the top level of the vertex u

By restricting the set of edges considered in the clustering to the edges(u v) isin E such that top[u]+1 = top[v] we ensure that no cycles are formedby contracting a unique cluster (the condition identified in Theorem 21 issatisfied) Let C be a clustering of the vertices Every edge in a cluster ofC being contractible is a necessary condition for C to be feasible but not asufficient one More restrictions on the edges of vertices inside the clustersshould be found to ensure that C is feasible We propose three coarseningheuristics based on clustering sets of more than two vertices whose pair-wisetop level differences are always zero or one

CoTop Acyclic clustering with forbidden edges

To have an efficient clustering heuristic we restrict ourselves to static in-formation computable in linear time As stated in the introduction of thissection we rely on the top level difference of one (or less) for all verticesin the same cluster and an additional condition to ensure that there willbe no cycles when a number of clusters are contracted simultaneously InTheorem 22 below we give two sufficient conditions for a clustering to befeasible (that is the graphs at all levels are DAGs)mdashthe proof is in theassociated journal paper [115]

Theorem 22 (Correctness of the proposed clustering) Let G = (VE) bea DAG and C = C1 Ck be a clustering If C is such that

bull for any cluster Ci for all u v isin Ci |top[u]minus top[v]| le 1

bull for two different clusters Ci and Cj and for all u isin Ci and v isin Cjeither (u v) isin E or top[u] 6= top[v]minus 1

then the coarse graph G|C is acyclic

20 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

We design a heuristic based on Theorem 22 This heuristic visits allvertices which are singleton at the beginning in an order and adds thevisited vertex to a cluster if the criteria mentioned in Theorem 22 are metif not the vertex stays as a singleton Among those clusters satisfying thementioned criteria the one which saves the largest edge weight (incident onthe vertices of the cluster and the vertex being visited) is chosen as the targetcluster We discuss a set of bookkeeping operations in order to implementthis heuristic in linear O(|V |+ |E|) time [115 Algorithm 1]

CoCyc Acyclic clustering with cycle detection

We now propose a less restrictive clustering algorithm which also maintainsthe acyclicity We again rely on the top level difference of one (or less) forall vertices in the same cluster Knowing this invariant when a new vertexis added to a cluster a cycle-detection algorithm checks that no cycles areformed when all the clusters are contracted simultaneously This algorithmdoes not traverse the entire graph by also using the fact that the top leveldifference within a cluster is at most one Nonetheless detecting such cyclescould require a full traversal of the graph

From Theorem 22 we know if adding a vertex to a cluster whose verticesrsquotop level values are t and t+ 1 creates a cycle in the contracted graph thenthis cycle goes through the vertices with top level values t or t + 1 Thuswhen considering the addition of a vertex u to a cluster C containing v wecheck potential cycle formations by traversing the graph starting from u ina breadth-first search (BFS) manner Let t denote the minimum top levelin C At a vertex w we normally add a successor y of w in a queue (as inBFS) if |top(y)minus t| le 1 if w is in the same cluster as one of its predecessorsx we also add x to the queue if |top(x) minus t| le 1 We detect a cycle if atsome point in the traversal a vertex from cluster C is reached if not thecluster can be extended by adding u In the worst-case the cycle detectionalgorithm completes a full graph traversal but in practice it stops quicklyand does not introduce a significant overhead The details of this clusteringscheme whose worst case run time complexity is O(|V |(|V |+ |E|)) can befound in the original paper [115 Algorithm 2]

CoHyb Hybrid acyclic clustering

The cycle detection based algorithm can suffer from quadratic run time forvertices with large in-degrees or out-degrees To avoid this we design a hy-brid acyclic clustering which uses the relaxed clustering strategy with cycledetection CoCyc by default and switches to the more restrictive clusteringstrategy CoTop for large degree vertices We define a limit on the degree ofa vertex (typically

radic|V |10) for calling it large degree When considering

an edge (u v) where top[u] + 1 = top[v] if the degrees of u and v do not

21

exceed the limit we use the cycle detection algorithm to determine if wecan contract the edge Otherwise if the outdegree of u or the indegree of vis too large the edge will be contracted if it is allowed The complexity ofthis algorithm is in between those CoTop and CoCyc and it will likely avoidthe quadratic behavior in practice

222 Initial partitioning

After the coarsening phase we compute an initial acyclic partitioning of thecoarsest graph We present two heuristics One of them is akin to the greedygraph growing method used in the standard graphhypergraph partitioningmethods The second one uses an undirected partitioning and then fixesthe acyclicity of the partitions Throughout this section we use (V0 V1) todenote the bisection of the vertices of the coarsest graph and ensure thatthere is no edge from the vertices in V1 to those in V0

Greedy directed graph growing

One approach to compute a bisection of a directed graph is to design agreedy algorithm that moves vertices from one part to another using localinformation We start with all vertices in V1 and replace vertices towards V0

by using heaps At any time the vertices that can be moved to V0 are in theheap These vertices are those whose in-neighbors are all in V0 Initially onlythe sources are in the heap and when all the in-neighbors of a vertex v aremoved to C0 v is inserted into the heap We separate this process into twophases In the first phase the key-values of the vertices in the heap are setto the weighted sum of their incoming edges and the ties are broken in favorof the vertex closer to the first vertex moved The first phase continues untilthe first part has more than 09 of the maximum allowed weight (modulothe maximum weight of a vertex) In the second phase the actual gain of avertex is used This gain is equal to the sum of the weights of the incomingedges minus the sum of the weights of the outgoing edges In this phasethe ties are broken in favor of the heavier vertices The second phase stopsas soon as the required balance is obtained The reason that we separatedthis heuristic into two phases is that at the beginning the gains are of noimportance and the more vertices become movable the more flexibility theheuristic has Yet towards the end parts are fairly balanced and usingactual gains should help keeping the cut small

Since the order of the parts is important we also reverse the roles of theparts and the directions of the edges That is we put all vertices in V0 andmove the vertices one by one to V1 when all out-neighbors of a vertex havebeen moved to V1 The proposed greedy directed graph growing heuristicreturns the best of the these two alternatives

22 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Undirected bisection and fixing acyclicity

In this heuristic we partition the coarsest graph as if it were undirectedand then move the vertices from one part to another in case the partitionwas not acyclic Let (P0 P1) denote the (not necessarily acyclic) bisectionof the coarsest graph treated as if it were undirected

The proposed approach designates arbitrarily P0 as V0 and P1 as V1One way to fix the cycle is to move all ancestors of the vertices in V0 to V0thereby guaranteeing that there is no edge from vertices in V1 to vertices inV0 making the bisection (V0 V1) acyclic We do these moves in a reversetopological order Another way to fix the acyclicity is to move all descen-dants of the vertices in V1 to V1 again guaranteeing an acyclic partition Wedo these moves in a topological order We then fix the possible unbalancewith a refinement algorithm

Note that we can also initially designate P1 as V0 and P0 as V1 and canagain fix a potential cycle in two different ways We try all four of thesechoices and return the best partition (essentially returning the best of thefour choices to fix the acyclicity of (P0 P1))

223 Refinement

This phase projects the partition obtained for a coarse graph to the nextfiner one and refines the partition by vertex moves As in the standardrefinement methods the proposed heuristic is applied in a number of passesWithin a pass we repeatedly select the vertex with the maximum move gainamong those that can be moved We tentatively realize this move if themove maintains or improves the balance Then the most profitable prefixof vertex moves are realized at the end of the pass As usual we allow thevertices move only once in a pass We use heaps with the gain of movesas the key value where we keep only movable vertices We call a vertexmovable if moving it to the other part does not create a cyclic partitionWe use the notation (V0 V1) to designate the acyclic bisection with no edgefrom vertices in V1 to vertices in V0 This means that for a vertex to movefrom part V0 to part V1 one of the two conditions should be met (i) eitherall its out-neighbors should be in V1 (ii) or the vertex has no out-neighborsat all Similarly for a vertex to move from part V1 to part V0 one of thetwo conditions should be met (i) either all its in-neighbors should be in V0(ii) or the vertex has no in-neighbors at all This is an adaptation of theboundary Fiduccia-Mattheyses heuristic [88] to directed graphs The notionof movability being more restrictive results in an important simplificationThe gain of moving a vertex v from V0 to V1 issum

uisinSucc[v]

w(v u)minussum

uisinPred[v]

w(u v) (21)

23

and the negative of this value when moving it from V1 to V0 This meansthat the gain of a vertex is static once a vertex is inserted in the heap withthe key value (21) its key-value is never updated A move could rendersome vertices unmovable if they were in the heap then they should bedeleted Therefore the heap data structure needs to support insert deleteand extract max operations only

224 Constrained coarsening and initial partitioning

There are a number of highly successful undirected graph partitioningtools [130 186 199] They are not immediately usable for our purposesas the partitions can be cyclic Fixing such partitions by moving verticesto break the cyclic dependencies among the parts can increase the edgecut dramatically (with respect to the undirected cut) Consider for exam-ple the n times n grid graph where the vertices are at integer positions fori = 1 n and j = 1 n and a vertex at (i j) is connected to (iprime jprime)when |iminus iprime| = 1 or |jminusjprime| = 1 but not both There is an acyclic orientationof this graph called spiral ordering as described in Figure 21 for n = 8This spiral ordering defines a total order When the directions of the edgesare ignored we can have a bisection with perfect balance by cutting onlyn = 8 edges with a vertical line This partition is cyclic and it can be madeacyclic by putting all vertices numbered greater than 32 to the second partThis partition which puts the vertices 1ndash32 to the first part and the rest tothe second part is the unique acyclic bisection with perfect balance for theassociated directed acyclic graph The edge cut in the directed version is35 as seen in the figure (gray edges) In general one has to cut n2 minus 4n+ 3edges for n ge 8 the blue vertices in the border (excluding the corners) haveone edge directed to a red vertex the interior blue vertices have two suchedges finally the blue vertex labeled n22 has three such edges

Let us investigate the quality of the partitions in practice We usedMeTiS [130] as the undirected graph partitioner on a dataset of 94 matrices(their details are in Section 23) The results are given in Figure 22 Forthis preliminary experiment we partitioned the graphs into two with themaximum allowed load imbalance ε = 3 The output of MeTiS is acyclicfor only two graphs and the geometric mean of the normalized edge cut is00012 Figure 22a shows the normalized edge cut and the load imbalanceafter fixing the cycles while Figure 22b shows the two measurements aftermeeting the balance criteria A normalized edge cut value is computed bynormalizing the edge cut with respect to the number of edges

In both figures the horizontal lines mark the geometric mean of thenormalized edge cuts and the vertical lines mark the 3 imbalance ratioIn Figure 22a there are 37 instances in which the load balance after fixingthe cycles is feasible The geometric mean of the normalized edge cuts inthis sub-figure is 00045 while in the other sub-figure it is 00049 Fixing

24 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

815864

32

3327

45

48

41

35

53

38

1522

Figure 21 8times8 grid graph whose vertices are ordered in a spiral way a fewof the vertices are labeled All edges are oriented from a lower numberedvertex to a higher numbered one There is a unique bipartition with 32vertices in each side The edges defining the total order are shown in redand blue except the one from 32 to 33 the cut edges are shown in grayother internal edges are not shown

10 12 14 16 18 20balance

10 5

10 4

10 3

10 2

10 1

100

101

cut

(a) Undirected partitioning and fixing cy-cles

098 100 102 104 106balance

10 5

10 4

10 3

10 2

10 1

cut

(b) Undirected partitioning fixing cyclesand balancing

Figure 22 Normalized edge cut (normalized with respect to the number ofedges) the balance obtained after using an undirected graph partitioner andfixing the cycles (left) and after ensuring balance with refinement (right)

25

the cycles increases the edge cut with respect to an undirected partition-ing but not catastrophically (only by 0004500012 = 375 times in theseexperiments) and achieving balance after this step increases the cut only alittle (goes to 00049 from 00045) That is why we suggest using an undi-rected graph partitioner fixing the cycles among the parts and performinga refinement based method for load balancing as a good (initial) partitioner

In order to refine the initial partition in a multilevel setting we proposea scheme similar to the iterated multilevel algorithm used in the existingpartitioners [40 212] In this scheme first a partition P is obtained Thenthe coarsening phase starts to cluster vertices that were in the same partin P After the coarsening an initial partitioning is freely available byusing the partition P on the coarsest graph The refinement phase then canwork as before Moreira et al [176] use this approach for the directed graphpartitioning problem To be more concrete we first use an undirected graphpartitioner then fix the cycles as discussed in Section 222 and then refinethis acyclic partition for balance with the proposed refinement heuristicsin Section 223 We then use this acyclic partition for constrained coarseningand initial partitioning We expect this scheme to be successful in graphswith many sources and targets where the sources and targets can lie in anyof the parts while the overall partition is acyclic On the other hand if agraph is such that its balanced acyclic partitions need to put sources in onepart and the targets in another part then fixing acyclicity may result inmoving many vertices This in turn will harm the edge cut found by theundirected graph partitioner

23 Experiments

The partitioning tool presented (dagP) is implemented in CC++ program-ming languages The experiments are conducted on a computer equippedwith dual 21 GHz Xeon E5-2683 processors and 512GB memory Thesource code and more information is available at httptdagatechedusoftwaredagP

We have performed a large set of experiments on DAG instances com-ing from two sources The first set of instances is from the PolyhedralBenchmark suite (PolyBench) [196] The second set of instances is obtainedfrom the matrices available in the SuiteSparse Matrix Collection (UFL) [59]From this collection we pick all the matrices satisfying the following prop-erties listed as binary square and has at least 100000 rows and at most226 nonzeros There were a total of 95 matrices at the time of experimenta-tion where two matrices (ids 1514 and 2294) having the same pattern Wediscard the duplicate and use the remaining 94 matrices for experimentsFor each such matrix we take the strict upper triangular part as the asso-ciated DAG instance whenever this part has more nonzeros than the lower

26 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Graph vertex edge max deg avg deg source target2mm 36500 62200 40 1704 2100 4003mm 111900 214600 40 1918 3900 400adi 596695 1059590 109760 1776 843 28atax 241730 385960 230 1597 48530 230covariance 191600 368775 70 1925 4775 1275doitgen 123400 237000 150 1921 3400 3000durbin 126246 250993 252 1988 250 249fdtd-2d 256479 436580 60 1702 3579 1199gemm 1026800 1684200 70 1640 14600 4200gemver 159480 259440 120 1627 15360 120gesummv 376000 500500 500 1331 125250 250heat-3d 308480 491520 20 1593 1280 512jacobi-1d 239202 398000 100 1664 402 398jacobi-2d 157808 282240 20 1789 1008 784lu 344520 676240 79 1963 6400 1ludcmp 357320 701680 80 1964 6480 1mvt 200800 320000 200 1594 40800 400seidel-2d 261520 490960 60 1877 1600 1symm 254020 440400 120 1734 5680 2400syr2k 111000 180900 60 1630 2100 900syrk 594480 975240 81 1640 8040 3240trisolv 240600 320000 399 1330 80600 1trmm 294570 571200 80 1939 6570 4800

Table 21 DAG instances from PolyBench [196]

triangular part otherwise we take the strict lower triangular part All edgeshave unit cost and all vertices have unit weight

Since the proposed heuristics have a randomized behavior (the traversalsused in the coarsening and refinement heuristics are randomized) we runthem 10 times for each DAG instance and report the averages of theseruns We use performance profiles [69] to present the edge-cut results Aperformance profile plot shows the probability that a specific method givesresults within a factor θ of the best edge cut obtained by any of the methodscompared in the plot Hence the higher and closer a plot to the y-axis thebetter the method is

We set the load imbalance parameter ε = 003 in the problem definitiongiven at the beginning of the chapter for all experiments The vertices areunit weighted therefore the imbalance is rarely an issue for a move-basedpartitioner

Coarsening evaluation

We first evaluated the proposed coarsening heuristics (Section 51 of the as-sociated paper [115]) The aim was to find an effective one to set as a defaultcoarsening heuristic Based on performance profiles on the edge cut we con-cluded that in general the coarsening heuristics CoHyb and CoCyc behavesimilarly and are more helpful than CoTop in reducing the edge cut We alsoanalyzed the contraction efficiency of the proposed coarsening heuristics Itis important that the coarsening phase does not stop too early and that the

27

coarsest graph is small enough to be partitioned efficiently By investigatingthe maximum the average and the standard deviation of vertex and edgeweight ratios at the coarsest graph and the average the minimum andthe maximum number of coarsening levels observed for the two datasetswe saw that CoCyc and CoHyb behave again similarly and provide slightlybetter results than CoTop Based on all these observations we set CoHyb asthe default coarsening heuristic as it performs better than CoTop in termsof final edge cut and is guaranteed to be more efficient than CoCyc in termsof run time

Constrained coarsening and initial partitioning

We now investigate the effect of using undirected graph partitioners forcoarsening and initial partitioning as explained in Section 224 We com-pare three variants of the proposed multilevel scheme All of them use therefinement described in Section 223 in the uncoarsening phase

bull CoHyb this variant uses the hybrid coarsening heuristic describedin Section 221 and the greedy directed graph growing heuristic de-scribed in Section 222 in the initial partitioning phase This methoddoes not use constrained coarsening

bull CoHyb C this variant uses an acyclic partition of the finest graph ob-tained as outlined in Section 222 to guide the hybrid coarseningheuristic described in Section 224 and uses the greedy directed graphgrowing heuristic in the initial partitioning phase

bull CoHyb CIP this variant uses the same constrained coarsening heuristicas the previous method but inherits the fixed acyclic partition of thefinest graph as the initial partitioning

The comparison of these three variants are given in Figure 23 for thewhole dataset In this figure we see that using the constrained coarseningis always helpful as it clearly separates CoHyb C and CoHyb CIP from CoHyb

after θ = 11 Furthermore applying the constrained initial partitioning (ontop of the constrained coarsening) brings tangible improvements

In the light of the experiments presented here we suggest the variantCoHyb CIP for general problem instances

CoHyb CIP with respect to a single level algorithm

We compare CoHyb CIP (constrained coarsening and initial partitioning)with a single-level algorithm that uses an undirected graph partitioningfixes the acyclicity and refines the partitions This last variant is denotedas UndirFix and it is the algorithm described in Section 222 Both vari-ants use the same initial partitioning approach which utilizes MeTiS [130] as

28 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CCoHyb_CIP

Figure 23 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) using the constrained coarsening and the greedy directed graphgrowing (CoHyb C) and CoHyb (no constraints)

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPUndirFix

Figure 24 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) and using the same approach without coarsening (UndirFix)

the undirected partitioner The difference between UndirFix and CoHyb CIP

is the latterrsquos ability to refine the initial partition at multiple levels Fig-ure 24 presents this comparison The plots show that the multilevel schemeCoHyb CIP outperforms the single level scheme UndirFix at all appropriateranges of θ attesting to the importance of the multilevel scheme

Comparison with existing work

Here we compare our approach with the evolutionary graph partitioningapproach developed by Moreira et al [175] and briefly with our previouswork [114]

Figure 25 shows how CoHyb CIP and CoTop compare with the evolution-ary approach in terms of the edge cut on the 23 graphs of the PolyBenchdataset for the number of partitions k isin 2 4 8 16 32 We use the av-

29

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPCoTopMoreira et al

Figure 25 Performance profiles for the edge cut obtained by CoHyb CIPCoTop and Moreira et alrsquos approach on the PolyBench dataset with k isin2 4 8 16 32

erage edge cut value of 10 runs for CoTop and CoHyb CIP and the averagevalues presented in [175] for the evolutionary algorithm As seen in thefigure the CoTop variant of the proposed multilevel approach obtains thebest results on this specific dataset (all variants of the proposed approachoutperform the evolutionary approach) In more detailed comparisons [115Appendix A] we see that the variants CoHyb CIP and CoTop of the proposedalgorithm obtain strictly better results than the evolutionary approach in78 and 75 instances (out of 115) respectively when the average edge cutsare compared

In more detailed comparisons [115 Appendix A] we see that CoHyb CIP

obtains 26 less edge cut than the evolutionary approach on average (ge-ometric mean) when the average cuts are compared when the best cutsare compared CoHyb CIP obtains 48 less edge cut Moreover CoTop ob-tains 37 less edge cut than the evolutionary approach when the averagecuts are compared when the best cuts are compared CoTop obtains 41less cut In some instances we see large differences between the averageand the best results of CoTop and CoHyb CIP Combined with the observa-tion that CoHyb CIP yields better results in general this suggests that theneighborhood structure can be improved (see the notion of the strength ofa neighborhood [183 Section 196])

The proposed approach with all the reported variants takes about 30minutes to complete the whole set of experiments for this dataset whereasthe evolutionary approach is much more compute-intensive as it has to runthe multilevel partitioning algorithm numerous times to create and updatethe population of partitions for the evolutionary algorithm The multilevelapproach of Moreira et al [175] is more comparable in terms of characteris-tics with our multilevel scheme When we compare CoTop with the results ofthe multilevel algorithm by Moreira et al our approach provides results thatare 37 better on average and CoHyb CIP approach provides results that are

30 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CIPUndirFix

Figure 26 Performance profiles of CoHyb CoHyb CIP and UndirFix in termsof edge cut for single source single target graph dataset The average of 5runs are reported for each approach

26 better on average highlighting the fact that keeping the acyclicity ofthe directed graph through the multilevel process is useful

Finally CoTop and CoHyb CIP also outperform the previous version of ourmultilevel partitioner [114] which is based on a direct k-way partitioningscheme and matching heuristics for the coarsening phase by 45 and 35on average respectively on the same dataset

Single commodity flow-like problem instances

In many of the instances of our dataset graphs have many source and targetvertices We investigate how our algorithm performs on problems where allsource vertices should be in a given part and all target vertices should bein the other part while also achieving balance This is a problem close tothe maximum flow problem where we want to find the maximum flow (orminimum cut) from the sources to the targets with balance on part weightsFurthermore addressing this problem also provides a setting for solvingpartitioning problems with fixed vertices

We used the UFL dataset where all isolated vertices are discarded anda single source vertex S and a single target vertex T are added to all graphsS is connected to all original source vertices and all original target verticesare connected to T The cost of the new edges is set to the number ofedges A feasible partition should avoid cutting these edges and separateall sources from the targets

The performance profiles of CoHyb CoHyb CIP and UndirFix are givenin Figure 26 with the edge cut as the evaluation criterion As seen in thisfigure CoHyb is the best performing variant and UndirFix is the worstperforming variant This is interesting as in the general setting we sawa reverse relation The variant CoHyb CIP performs in the middle as itcombines the other two

31

1 2 3 4 5 6 7 800

02

04

06

08

10

p CoTopCoHybCoCycUndirFixCoHyb_CIPCoHyb_C

Figure 27 Run time performance profile of CoCyc CoHyb CoTop CoHyb CCoHyb CIP and UndirFix on the whole dataset The values are the averagesof 10 runs

Run time performance

We give again a summary of the run time comparisons from the detailedversion [115 Section 56] Figure 27 shows the comparison of the five vari-ants of the proposed multilevel scheme and the single level scheme on thewhole dataset Each algorithm is run 10 times on each graph As expectedCoTop offers the best performance and CoHyb offers a good trade-off betweenCoTop and CoCyc An interesting remark is that these three algorithms havea better run time than the single level algorithm UndirFix For exampleon the average CoTop is 144 times faster than UndirFix This is mainlydue to cost of fixing acyclicity Undirected partitioning accounts for roughly25 of the execution time of UndirFix and fixing the acyclicity constitutesthe remaining 75 Finally the variants of the multilevel algorithm usingconstrained coarsening heuristics provide satisfying run time performancewith respect to the others

24 Summary further notes and references

We proposed a multilevel approach for acyclic partitioning of directed acyclicgraphs This problem is close to the standard graph partitioning in that theaim is to partition the vertices into a number of parts while minimizingthe edge cut and meeting a balance criterion on the part weights Unlikethe standard graph partitioning problem the directions of the edges areimportant and the resulting partitions should have acyclic dependencies

We proposed coarsening initial partitioning and refinement heuristicsfor the target problem The proposed heuristics take the directions of theedges into account and maintain the acyclicity throughout the three phasesof the multilevel scheme We also proposed efficient and effective approachesto use the standard undirected graph partitioning tools in the multilevel

32 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

scheme for coarsening and initial partitioning We performed a large set ofexperiments on a dataset with graphs having different characteristics andevaluated different combinations of the proposed heuristics Our experi-ments suggested (i) the use of constrained coarsening and initial partition-ing where the main coarsening heuristic is a hybrid one which avoids thecycles and in case it does not performs a fast cycle detection (CoHyb CIP)for the general case (ii) a pure multilevel scheme without constrained coars-ening using the hybrid coarsening heuristic (CoHyb) for the cases where anumber of sources need to be separated from a number of targets (iii) a puremultilevel scheme without constrained coarsening using the fast coarseningalgorithm (CoTop) for the cases where the degrees of the vertices are smallAll three approaches are shown to be more effective and efficient than thecurrent state of the art

Ozkaya et al [181] make use of the proposed partitioner for schedulingdirected acyclic task graphs for homogeneous computing systems Here thepartitioner is used to obtain coarse grained tasks to enhance data localityLepere and Trystram [151] show that acyclic clustering is very effective inscheduling DAGs with unit task weights and large uniform edge costs Inparticular they show that there exists an acyclic clustering whose makespanis at most twice the best clustering

Recall that a directed edge (u v) is called redundant if there is a pathu v in the graph Consider a redundant edge (u v) and a path u vLet us discard the edge (u v) and add its cost to all edges in the identifiedu v path to obtain a reduced graph In an acyclic bisection of this reducedgraph if the vertices u and v are in different parts then the path u vcrosses the cut only once (otherwise the partition is cyclic) and the edgecut in the original graph equals the edge cut in the reduced graph If uand v are in the same part then all paths u v are confined to the samepart Therefore neither the edge (u v) of the original graph nor any pathu v of the reduced graph contributes to the respective cuts Hence theedge cut in the original graph is equal to the edge cut in the reduced graphSuch reductions could potentially help reduce the run time and improve thequality of the partitions in our multilevel setting The transitive reductionis a costly operation and hence we do not hope to apply all reductionsin the proposed approach a subset of those should be found and appliedThe effects of such reductions should be more tangible in the evolutionaryalgorithm [175 176]

Chapter 3

Parallel CandecompParafacdecomposition of sparsetensors in distributedmemory

In this chapter we address the problem of parallelizing the computationalcore of the CandecompParafac decomposition of sparse tensors Tensornotation is a little less common and the reader is invited to revise thenotation and the basic definitions in Section 15 More related definitionsare given in this chapter Below we restate the problem for convenience

Parallelizing CP decomposition Given a sparse tensor X the rankR of the required decomposition the number k of processors partitionX among the processors so as to achieve load balance and reduce thecommunication cost in a parallel implementation of Mttkrp

We investigate an efficient parallelization of the Mttkrp operation indistributed memory environments for sparse tensors in the context of theCP decomposition method CP-ALS For this purpose we formulate twotask definitions a coarse-grain and a fine-grain one These definitions aregiven by applying the owner-computes rule to a coarse-grain and a fine-grainpartition of the tensor nonzeros We define the coarse-grain partition of atensor as a partition of one of its dimensions In matrix terms a coarse-grainpartition corresponds to a row-wise or a column-wise partition We definethe fine-grain partition of a tensor as a partition of its nonzeros This hasthe same significance in matrix terms Based on these two task granularitieswe present two parallel algorithms for the Mttkrp operation We addressthe computational load balance and communication cost reduction problemsfor the two algorithms and present hypergraph partitioning-based models

33

34 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

to tackle these problems with the off-the-shelf partitioning toolsOnce the Mttkrp operation is efficiently parallelized the other parts of

the CP-ALS algorithm are straightforward Nonetheless we design a libraryfor the parallel CP-ALS algorithm to test the proposed Mttkrp algorithmsand report scalability results up to 1024 MPI ranks

31 Background and notation

311 Hypergraph partitioning

Let H = (VE) be a hypergraph with the vertex set V and hyperedge setE Let us associate weights with the vertices and costs with the hyperedgesw(v) denotes the weight of a vertex v and ch denotes the cost of a hyperedgeFor a given integer K ge 2 a K-way vertex partition of a hypergraph H =(VE) is denoted as Π = V1 VK where (i) the parts are non-empty(ii) mutually exclusive VkcapV` = empty for k 6= ` and (iii) collectively exhaustiveV =

⋃Vk

Let w(Vk) =sum

visinVk wv be the total weight of the vertices in the setVk and wavg = w(V )K be the average part weight If each part Vk isin Πsatisfies the balance criterion

w(Vk) le wavg(1 + ε) for k = 1 2 K (31)

we say that Π is balanced where ε represents the maximum allowed imbal-ance ratio

In a partition Π a hyperedge that has at least one vertex in a part is saidto connect that part The number of parts connected by a hyperedge h ieconnectivity is denoted as λh Given a vertex partition Π of a hypergraphH = (VE) one can measure the size of the cut induced by Π as

χ(Π) =sumhisinE

ch(λh minus 1) (32)

This cut measure is called the connectivity-1 cutsize metricGiven ε gt 0 and an integer K gt 1 the standard hypergraph partitioning

problem is defined as the task of finding a balanced partition Π with K partssuch that χ(Π) is minimized The hypergraph partitioning problem is NP-hard [150]

A variant of the above problem is the multi-constraint hypergraph parti-tioning [131] In this variant each vertex has an associated vector of weightsThe partitioning objective is the same as above and the partitioning con-straint is to satisfy a balancing constraint for each weight Let w(v i) denotethe C weights of a vertex v for i = 1 C In this variant the balancecriterion (31) is rewritten as

w(Vk i) le wavgi (1 + ε) for k = 1 K and i = 1 C (33)

35

where the ith weight w(Vk i) of a part Vk is defined as the sum of the ithweights of the vertices in that part wavgi is the average part weight for theith weight of all vertices and ε represents the allowed imbalance ratio

312 Tensor operations

In this chapter we use N to denote the number of dimensions (the order)of a tensor For the sake of simplicity we describe all the notation andthe algorithms for N = 3 even though our algorithms and implementationshave no such restriction We explicitly generalize the discussion to generalorder-N tensors whenever we find necessary A fiber in a tensor is definedby fixing every index but one eg if X is a third-order tensor Xjk is amode-1 fiber and Xij is a mode-3 fiber A slice in a tensor is defined byfixing only one index eg Xi refers to the ith slice of X in mode 1 Weuse |Xi| to denote the number of nonzeros in Xi

Tensors can be matricized in any mode This is achieved by identifying asubset of the modes of a given tensor X as the rows and the other modes of Xas the columns of a matrix and appropriately mapping the elements of X tothose of the resulting matrix We will be exclusively dealing with the matri-cizations of tensors along a single mode For example take X isin RI1timesmiddotmiddotmiddottimesIN Then X(1) denotes the mode-1 matricization of X in such a way that therows of X(1) corresponds to the first mode of X and the columns correspondsto the remaining modes The tensor element xi1iN corresponds to the el-

ement(i1 1 +

sumNj=2

[(ij minus 1)

prodjminus1k=1 Ik

])of X(1) Specifically each column

of the matrix X(1) becomes a mode-1 fiber of the tensor X Matricizationsin the other modes are defined similarly

For two matrices AI1timesJ1 and BI2timesJ2 the Kronecker product is

AotimesB =

a11B middot middot middot a1J1B

aI11B middot middot middot aI1J1B

For two matrices AI1timesJ and BI2timesJ the Khatri-Rao product is

AB =[A( 1)otimesB( 1) middot middot middot A( J)otimesB( J)

]

which is of size I1I2 times J For two matrices AItimesJ and BItimesJ the Hadamard product is

A lowastB =

a11b11 middot middot middot a1Jb1J

aI1bI1 middot middot middot aIJbIJ

Recall that the CP-decomposition of rankR of a given tensor X factorizesX into a sum of R rank-one tensors For X isin RItimesJtimesK it yields X asymp

36 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

sumRr=1 ar br cr for ar isin RI br isin RJ and cr isin RK where is the outer

product of the vectors Here the matrices A = [a1 aR] B = [b1 bR]and C = [c1 cR] are called the factor matrices or factors For N -modetensors we use U1 UN to refer to the factor matrices

We revise the Alternating Least Squares (ALS) method for obtaining arank-R approximation of a tensor X with the CP-decomposition [38 107]A common formulation of CP-ALS is shown in Algorithm 1 for third ordertensors At each iteration each factor matrix is recomputed while fixingthe other two eg A larr X(1)(C B)(BTB lowast CTC)dagger This operation is

performed in the following order MA = X(1)(CB) V = (BTB lowastCTC)daggerand then A larr MAV Here V is a dense matrix of size R times R and iseasy to compute The important issue is the efficient computation of theMttkrp operations yielding MA similarly MB = X(2)(AC) and MC =X(3)(BA)

Algorithm 1 CP-ALS for the 3rd order tensors

Input X A 3rd order tensorR The rank of approximation

Output CP decomposition [[λ ABC]]

1 repeat2 Alarr X(1)(CB)(BTB lowastCTC)dagger

3 Normalize columns of A

4 Blarr X(2)(CA)(ATA lowastCTC)dagger

5 Normalize columns of B

6 Clarr X(3)(BA)(ATA lowastBTB)dagger

7 Normalize columns of C and store the norms as λ

8 until no improvement or maximum iterations reached

The sheer size of the Khatri-Rao products makes them impossible tocompute explicitly hence efficient Mttkrp algorithms find other means tocarry out the Mttkrp operation

32 Related work

SPLATT [206] is an efficient implementation of the Mttkrp operation forsparse tensors on shared memory systems SPLATT implements the Mt-tkrp operation based on the slices of the dimension in which the factoris updated eg on the mode-1 slices when computing A larr X(1)(C B)(BTB lowastCTC)dagger Nonzeros of the fibers in a slice are multiplied with thecorresponding rows of B and the results are accumulated to be later scaledwith the corresponding row of C to compute the row of A correspondingto the slice Parallelization is done using OpenMP directives and the loadbalance (in terms of the number of nonzeros in the slices of the mode for

37

which Mttkrp is computed) is achieved by using the dynamic schedulingpolicy Hypergraph models are used to optimize cache performance by re-ducing the number of times a row of A B and C are accessed Smith etal [206] also use N -partite graphs to reorder the tensors for all dimensionsA newer version of SPLATT [204 205] has a medium grain task definitionand a distributed memory implementation

GigaTensor [123] is an implementation of CP-ALS which follows theMap-Reduce paradigm All important steps (the Mttkrps and the com-putations BTB lowastCTC) are performed using this paradigm A distinct ad-vantage of GigaTensor is that thanks to Map-Reduce the issues of fault-tolerance load balance and out of core tensor data are automatically han-dled The presentation [123] of GigaTensor focuses on three-mode tensorsand expresses the map and the reduce functions for this case Additionalmap and reduce functions are needed for higher order tensors

DFacTo [47] is a distributed memory implementation of the Mttkrp op-eration It performs two successive sparse matrix-vector multiplies (SpMV)to compute a column of the product X(1)(C B) A crucial observation(made also elsewhere [142]) is that this operation can be implemented asXT

(2)B( r) which can be reshaped into a matrix to be multiplied with

C( r) to form the rth column of the Mttkrp Although SpMV is awell-investigated operation there is a peculiarity here the result of thefirst SpMV forms the values of the sparse matrix used in the second oneTherefore there are sophisticated data dependencies between the two Sp-MVs Notice that DFacTo is rich in SpMV operations there are two SpMVsper factorization rank per dimension of the input tensor DFacTo needs tostore the tensor matricized in all dimensions ie X(1) X(N) In lowdimensions this can be a slight memory overhead yet in higher dimensionsthe overhead could be non-negligible DFacTo uses MPI for paralleliza-tion yet fully stores the factor matrices in all MPI ranks The rows ofX(1) X(N) are blockwise distributed (statically) With this partitioneach process computes the corresponding rows of the factor matrices Fi-nally DFacTo performs an MPI Allgatherv operation to communicate thenew results to all processes which results in (InP ) log2 P communicationvolume per process (assuming a hypercube algorithm) when computing thenth factor matrix having In rows using P processes

Tensor Toolbox [18] is a MATLAB toolbox for handling tensors It pro-vides many essential operations and enables fast and efficient realizationsof complex algorithms in MATLAB for sparse tensors [17] Among thoseoperations Mttkrp implementations are provided and used in CP-ALSmethod Here each column of the output is computed by performing N minus 1sparse tensor vector multiplications Another well-known MATLAB toolboxis the N -way toolbox [11] which is essentially for dense tensors and incor-porates now support for sparse tensors [2] through Tensor Toolbox Tensor

38 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Toolbox and the related software provide excellent means for rapid proto-typing of algorithms and also efficient programs for tensor operations thatcan be handled within MATLAB

33 Parallelization

A common approach in implementing the Mttkrp is to explicitly matricizea tensor across all modes and then perform the Khatri-Rao product usingthe matricized tensors [47 206] Matricizing a tensor in a mode i requirescolumn index values up to

prodNk 6=i Ik which can exceed the integer value lim-

its supported by modern architectures when using tensors of higher orderand very large dimensions Also matricizing across all modes results in Nreplications of a tensor which can exceed the memory limitations Hence inorder to be able to handle large tensors we store them in coordinate formatfor the Mttkrp operation which is also the method of choice in TensorToolbox [18] In this format each nonzero value is stored along with all ofits position indices

With a tensor stored in the coordinate format Mttkrp operation canbe performed as shown in Algorithm 2 As seen on Line 4 of this algorithma row of B and a row of C are retrieved and their Hadamard product iscomputed and scaled with a tensor entry to update a row of MA In generalfor an N -mode tensor

MU1(i1 )larrMU1(i1 ) + xi1i2iN [U2(i2 ) lowast middot middot middot lowastUN (iN )]

is computed Here indices of the corresponding rows of the factor matricesand MU1 coincide with indices of the unique tensor entry of the operation

Algorithm 2 Mttkrp for the 3rd order tensors

Input X tensorBC Factor matrices in all modes except the firstIA Number of rows of the factor AR Rank of the factors

Output MA = X(1)(BC)

1 Initialize MA to zeros of size IA timesR2 foreach xijk isin X do44 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

As factor matrices are accessed row-wise we define computational unitsin terms of the rows of factor matrices It follows naturally to partitionall factor matrices row-wise and use the same partition for the Mttkrpoperation for each mode of an input tensor across all CP-ALS iterations toprevent extra communication A crucial issue is the task definitions as this

39

pertains to the issues of load balancing and communication We identify acoarse-grain and a fine-grain task definition

In the coarse-grain task definition the ith atomic task consists of com-puting the row MA(i ) using the nonzeros in the tensor slice Xi and therows of B and C corresponding to the nonzeros in that slice The inputtensor X does not change throughout the iterations of tensor decompositionalgorithms hence it is viable to make the whole slice Xi available to theprocess holding MA(i ) so that the Mttkrp operation can be performedby only communicating the rows of B and C Yet as CP-ALS requires theMttkrp in all modes and each nonzero xijk belongs to slices X (i )X ( j ) and X ( k) we need to replicate tensor entries in the owner pro-cesses of these slices This may require up to N times replication of thetensor depending on its partitioning Note that an explicit matricizationalways requires exactly N replications of tensor entries

In the fine-grain task definition an atomic task corresponds to the mul-tiplication of a tensor entry with the Hadamard product of the correspond-ing rows of B and C Here tensor nonzeros are partitioned among pro-cesses with no replication to induce a task partition by following the owner-computes rule This necessitates communicating the rows of B and C thatare needed by these atomic tasks Furthermore partial results on the rowsof MA need to be communicated as without duplicating tensor entries wecannot in general compute all contributions to a row of MA Here the par-tition of X should be useful in all modes as the CP-ALS method requiresthe Mttkrp in all modes

The coarse-grain task definition resembles the one-dimensional (1D) row-wise (or column-wise) partitioning of sparse matrices whereas the fine-grainone resembles the two-dimensional (nonzero-based) partitioning of sparsematrices for parallel sparse matrix-vector multiply (SpMV) operations As isconfirmed for SpMV in modern applications 1D partitioning usually leads toharder problems of load balancing and communication cost reduction Thesame phenomenon is likely to be observed in tensors as well Nonethelesswe cover the coarse-grain task definition as it is used in the state of theart parallel Mttkrp [47 206] methods which partition the input tensor byslices

331 Coarse-grain task model

In the coarse-grain task model computing the rows of MA MB and MC aredefined as the atomic tasks Let microA denote the partition of the first modersquosindices among the processes ie microA(i) = p if the process p is responsiblefor computing MA(i ) Similarly let microB and microC define the partition of thesecond and the third mode indices The process owning MA(i ) needs theentire tensor slice Xi similarly the process owning MB(j) needs Xj andthe owner of MC(k ) needs Xk This necessitates duplication of some

40 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

tensor nonzeros to prevent unnecessary communicationOne needs to take the context of CP-ALS into account when parallelizing

the Mttkrp operation First the output MA of Mttkrp is transformedinto A Since A(i ) is computed simply by multiplying MA(i ) with thematrix (BTBlowastCTC)dagger we make the process which owns MA(i ) responsiblefor computing A(i ) Second N Mttkrp operations follow one anotherin an iteration Assuming that every process has the required rows of thefactor matrices while executing Mttkrp for the first mode it is advisable toimplement the Mttkrp in such a way that its output MA after transformedinto A is communicated This way all processes will have the necessarydata for executing the Mttkrp for the next mode With these in mind thecoarse-grain parallel Mttkrp method executes Algorithm 3 at each processp

Algorithm 3 Coarse-grain Mttkrp for the first mode of thirdorder tensors at process p within CP-ALS

Input Ip indices where microA(i) = pXIp tensor slicesBTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in XIp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them ATA is available

2 Initialize MA(Ip ) to all zeros of size |Ip| timesR3 foreach i isin Ip do4 foreach xijk isin Xi do66 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

88 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

9 foreach i isin Ip do1111 Send A(i ) to all processes having nonzeros in Xi

12 Receive A(i ) from microA(i) for each owned nonzero xijk

1414 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

As seen in Algorithm 3 the process p computes MA(i ) for all i withmicroA(i) = p on Line 6 This is possible without any communication if weassume the preconditions that BTB CTC and the required rows of B andC are available We need to satisfy this precondition for the Mttkrp oper-ations in the remaining modes Once MA(Ip ) is computed it is multipliedon Line 8 with (BTB lowastCTC)dagger to obtain A(Ip ) Then the process p sendsthe rows of A(Ip ) to other processes who will need them then receivethe ones that it will need More precisely the process p sends A(i ) tothe process r where microB(j) = r or microC(k) = r for a nonzero xijk isin X thereby satisfying the stated precondition for the remaining Mttkrp op-

41

erations Since computing A(i ) required BTB and CTC to be availableat each process we also need to compute ATA and make the result avail-able to all processes We perform this by computing local multiplicationsA(Ip )

TA(Ip ) and all-reducing the partial results (Line 14)

We now investigate the problems of obtaining load balance and reducingthe communication cost in an iteration of CP-ALS given in Algorithm 3 thatis when Algorithm 3 is applied N times once for each mode Line 6 is thecomputational core the process p performs

sumiisinIp |Xi| many Hadamard

products (of vectors of size R) without any communication In order toachieve load balance one should partition the slices of the first mode equi-tably by taking the number of nonzeros into account Line 8 does not involvecommunication where load balance can be achieved by assigning almostequal number of slices to the processes Line 14 requires an almost equalnumber of slices to the processes for load balance The operations at Lines 8and 14 can be performed very efficiently using BLAS3 routines thereforetheir costs are generally negligible in compare to the cost of the sparse ir-regular computations at Line 6 There are two lines 11 and 14 requiringcommunication Line 14 requires the collective all-reduce communicationand one can rely on efficient MPI implementations The communication atLine 11 is irregular and would be costly Therefore the problems of com-putational load balance and communication cost need to be addressed withregards to Lines 6 and 11 respectively

Let ai be the task of computing MA(i ) at Line 6 Let also TA bethe set of tasks ai for i = 1 IA and bj ck TB and TC be definedsimilarly In the CP-ALS algorithm there are dependencies among thetriplets ai bj and ck of the tasks for each nonzero xijk Specifically thetask ai depends on the tasks bj and ck Similarly bj depends on the tasksai and ck and ck depends on the tasks ai and bj These dependencies arethe source of the communication at Line 11 eg A(i ) should be sent tothe processes that own B(j ) and C(k ) for the each nonzero xijk in theslice Xi These dependencies can be modeled using graphs or hypergraphsby identifying the tasks with vertices and their dependencies with edgesand hyperedges respectively As is well-known in similar parallelizationcontexts [39 111 112] the standard graph partition related metric of ldquoedgecutrdquo would loosely relate to the communication cost We therefore proposea hypergraph model for modeling the communication and computationalrequirements of the Mttkrp operation in the context of CP-ALS

For a given tensor X we build the d-partite d-uniform hypergraph H =(VE) described in Section 15 In this model the vertex set V correspondsto the set of tasks We abuse the notation and use ai to refer to a task and itscorresponding vertex in V We then set V = TA cupTB cupTC Since one needsload balance on Line 6 for each mode we use vectors of size N for vertexweights We set w(ai 1) = |Xi| w(bj 2) = |Xj| and w(ck 3) = |Xk|

42 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

c3

b3

a3

c2a2b2

b1

c1

a1

(a) Coarse-grain model

x123 x312

x231a2

a1

b1

c1

b2 c2

c3

b3

a3

(b) Fine-grain model

Figure 31 Coarse and fine-grain hypergraph models for the 3times3times3 tensorX = (1 2 3) (2 3 1) (3 1 2)

as the vertex weights in the indicated modes and assign weight 0 in allother modes The dependencies causing communication at Line 11 are one-to-many for a nonzero xijk any one of ai bj and ck is computed byusing the data associated with the other two Following the guidelines inbuilding the hypergraph models [209] we add a hyperedge correspondingto each task (or each row of the factor matrices) in TA cup TB cup TC in orderto model the data dependencies to that specific task We use nai to denotethe hyperedge corresponding to the task ai and define nbj and nck similarlyFor each nonzero xijk we add the vertices ai bj and ck to the hyperedgesnai n

bj and nck Let Ji and Ki be all unique 2nd and 3rd mode indices

respectively of the nonzeros in Xi Then the hyperedge nai contains thevertex ai all vertices bj for j isin Ji and all vertices ck for k isin Ki

Figure 31a demonstrates the hypergraph model for the coarse-grain taskdecomposition of a sample 3times 3times 3 tensor with nonzeros (1 2 3) (2 3 1)and (3 1 2) Labels for the hyperedges shown as small blue circles are notprovided yet it should be clear from the context eg ai isin nai is shownwith a non-curved black line between the vertex and the hyperedge Forthe first second and third nonzeros we add the corresponding verticesto related hyperedges which are shown with green magenta and orangecurves respectively It is interesting to note that the N -partite graph ofSmith et al [206] and this hypergraph model are related their adjacencystructure when laid as a matrix give the same matrix

Consider now a P -way partition of the vertices of H = (VE) and theassociation of each part with a unique process for obtaining a P -way parallelCP-ALS The task ai incurs |Xi| Hadamard products in the first mode inthe process microA(i) and has no other work when computing the MttkrpsMB = X(2)(AC) and MC = X(3)(AB) in the other modes This obser-vation (combined with the corresponding one for bjs and cks) shows that any

43

partition satisfying the balance condition for each weight component (33)corresponds to a balanced partition of Hadamard products (Line 6) in eachmode Now consider the messages which are of size R sent by the pro-cess p regarding A(i ) These messages are needed by the processes thathave slices in modes 2 and 3 that have nonzeros whose first mode index is iThese processes correspond to the parts connected by the hyperedge nai Forexample in Figure 31a due to the nonzero (3 1 2) a3 has a dependency tob1 and c2 which are modeled by the curves from a3 to nb1 and nc2 Since a3b1 and c2 reside in different parts a3 contributes one to the connectivity ofnb1 and nc2 which accurately captures the total communication volume byincreasing the total connectivity-1 metric by two

Notice that the part microA(i) is always connected by naimdashassuming that theslice Xi is not empty Therefore minimizing the connectivity-1 metric (32)exactly amounts to minimizing the total communication volume required forMttkrp in all computational phases of a CP-ALS iteration if we set thecost ch = R for each hyperedge h

332 Fine-grain task model

In the fine-grain task model we assume a partition of the nonzeros of thetensor hence there is no duplication Let microX denote the partition of thetensor nonzeros eg microX (i j k) = p if the process p holds xijk WhenevermicroX (i j k) = p the process p performs the Hadamard product of B(j ) andC(k ) in the first mode A(i ) and C(k ) in the second mode and A(i )and B(j ) in the third mode then scales the result with xijk Let againmicroA microB and microC denote the partition on the indices of the correspondingmodes that is microA(i) = p denotes that the process p is responsible forcomputing MA(i ) As in the coarse-grain model we take the CP-ALScontext into account while designing the Mttkrp algorithm (i) the processwhich is responsible for MA(i ) is also held responsible for A(i ) (ii)at the beginning all rows of the factor matrices B(j ) and C(k ) wheremicroX (i j k) = p are available to the process p (iii) at the end all processesget the R times R matrix ATA and (iv) each process has all the rows of Athat are required for the upcoming Mttkrp operations in the second or thethird modes With these in mind the fine-grain parallel Mttkrp methodexecutes Algorithm 4 at each process p

In Algorithm 4 Xp and Ip denote the set of nonzeros and the set of firstmode indices assigned to the process p As seen in the algorithm the processp has the required data to carry out all Hadamard products correspondingto all xijk isin Xp on Line 5 After scaling by xijk the Hadamard productsare locally accumulated in MA which contains a row for all i where Xphas a nonzero on the slice Xi Notice that a process generates a partialresult for each slice of the first mode in which it has a nonzero The relevantrow indices (the set of unique first mode indices in Xp) are denoted by Fp

44 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Algorithm 4 Fine-grain Mttkrp for the first mode of 3rd ordertensors at process p within CP-ALS

Input Xp tensor nonzeros where microX (i j k) = pIp the first mode indices where microA(Ip) = pFp the set of unique first mode indices in Xp

BTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in Xp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them A(Fp ) is updated and ATA is available

2 Initialize MA(Fp ) to all zeros of size |Fp| timesR3 foreach xijk isin Xp do55 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

6 foreach i isin Fp Ip do88 Send MA(i ) to microA(i)

1010 Receive contributions to MA(i ) for i isin Ip and add them up

1212 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

13 foreach i isin Ip do1515 Send A(i ) to all processes having nonzeros in Xi

16 foreach i isin Fp Ip do1818 Receive A(i ) from microA(i)

2020 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

45

Some indices in Fp are not owned by the process p For such an i isin Fp Ipthe process p sends its contribution to MA(i ) to the owner process microA(i)at Line 8 Again messages of the process p destined to the same processare combined Then the process p receives contributions to MA(i ) wheremicroA(i) = p at Line 10 Each such contribution is accumulated and the newA(i ) is computed by a multiplication with (BTB lowast CTC)dagger at Line 12Finally the updated A is communicated to satisfy the precondition of theupcoming Mttkrp operations Note that the communications at Lines 15and 18 are the duals of those at Lines 10 and 8 respectively in the sense thatthe directions of the messages are reversed and the messages always pertainto the same row indices For example MA(i ) is sent to the process microA(i)at Line 8 and A(i ) is received from microA(i) at Line 18 Hence the process pgenerates partial results for each slice Xi on which it has a nonzero eitherkeeps it or sends it to the owner of microA(i) (at Line 8) and if so receivesthe updated row A(i ) later (at Line 18) Finally the algorithm finishesby an all-reduce operation to make ATA available to all processes for theupcoming Mttkrp operations

We now investigate the computational and communication requirementsof Algorithm 4 As before the computational core is at Line 5 where theHadamard products of the row vectors B(j ) and C(k ) are computed foreach nonzero xijk that a process owns and the result is added to MA(i )Therefore each nonzero xijk incurs three Hadamard products (one permode) in its owner process microX (i j k) in an iteration of the CP-ALS al-gorithm Other computations are either efficiently handled with BLAS3routines (Lines 12 and 20) or they (Line 10) are not as many as the numberof Hadamard products The significant communication operations are thesends at Lines 8 and 15 and the corresponding receives at Lines 10 and 18Again the all-reduce operation at Line 20 would be handled efficiently by theMPI implementation Therefore the problems of computational load bal-ance and reduced communication cost need to be addressed by partitioningthe tensor nonzeros equitably among processes (for load balance at Line 5)while trying to confine all slices of all modes to a small set of processes (toreduce communication at Lines 8 and 18)

We propose a hypergraph model for the computational load and commu-nication volume of the fine-grain parallel Mttkrp operation in the contextof CP-ALS For a given tensor the hypergraph H = (VE) with the ver-tex set V and hyperedge set E is defined as follows The vertex set V hasfour types of vertices ai for i = 1 IA bj for j = 1 IB ck fork = 1 IC and the vertex xijk we define for each nonzero in X byabusing the notation The vertex xijk represents the Hadamard productsB(j ) lowastC(k ) A(i ) lowastC(k ) and A(i ) lowastB(j ) to be performed duringthe Mttkrp operations at different modes There are three types of hyper-edges in E and they represent the dependencies of the Hadamard productsto the rows of the factor matrices The hyperedges are defined as follows

46 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

E = EA cup EB cup EC where EA contains a hyperedge nai for each A(i ) EBcontains a hyperedge nbj for each B(j ) and EC contains a hyperedge nck for

each C(k ) Initially nai nbj and nck contain only the corresponding vertices

ai bj and ck Then for each nonzero xijk isin X we add the vertex xijkto the hyperedges nai n

bj and nck to model the data dependency to the cor-

responding rows of A B and C As in the previous subsection one needsa multi-constraint formulation for three reasons First since xijk verticesrepresent the Hadamard products they should be evenly distributed amongthe processes Second as the owners of the rows of A B and C are possi-ble destinations of partial results (at Line 8) it is advisable to spread themevenly Third almost equal number of rows per process implies balancedmemory usage and BLAS3 operations at Lines 12 and 20 even though thisoperation is not our main concern for the computational load With theseconstraints we associate a weight vector of size N + 1 with each vertex Fora vertex xijk we set w(xijk ) = [0 0 0 3] for vertices ai bj and ck we setw(ai ) = [1 0 0 0] w(bj ) = [0 1 0 0] w(ck ) = [0 0 1 0]

In Figure 31b we demonstrate the fine-grain hypergraph model for thesample tensor of the previous section We exclude the vertex weights fromthe figure Similar to the coarse-grain model dependency on the rows of AB and C are modeled by connecting each vertex xijk to the hyperedges nai nbj and nck to capture the communication cost

Consider now a P -way partition of the vertices of H = (VE) and as-sociate each part with a unique process to obtain P -way parallel CP-ALSwith the fine-grain Mttkrp operation The partition of the xijk verticesuniquely partitions the Hadamard products Satisfying the fourth balanceconstraint in (33) balances the computational loads of the processes Sim-ilarly satisfying the first three constraints leads to an equitable partitionof the rows of factor matrices among processes Let microX (i j k) = p andmicroA(i) = ` that is the vertex xijk and ai are in different parts Then theprocess p sends a partial result on MA(i ) to the process ` at Line 8 ofAlgorithm 4 By looking at these messages carefully we see that all pro-cesses whose corresponding parts are in the connectivity set of niA will senda message to microA(i) Since microA(i) is also in this set we see that the totalvolume of send operations concerning MA(i ) at Line 8 is equal to λnai minus 1Therefore the connectivity-1 cut size metric (32) over the hyperedges inEA encodes the total volume of messages sent at Line 8 if we set ch = Rfor all hyperedges un EA Since the send operations at Line 15 are dualsof the send operations at Line 8 total volume of messages sent at Line 15for the first mode is also equal to this number By extending this reasoningto all hyperedges we see that the cumulative (over all modes) volume ofcommunication is equal to the connectivity-1 cut-size metric (32)

The presence of multiple constraints usually causes difficulties for thecurrent state of the art partitioning tools (Aykanat et al[15] discuss the is-

47

sue for PaToH [40]) In preliminary experiments we observed that this wasthe case for us as well In an attempt to circumvent this issue we designedthe following alternative We start with the same hypergraph and removethe vertices corresponding to the rows of factor matrices Then use a singleconstraint partitioning on the resulting hypergraph to partition the verticesof nonzeros We are then faced with the problem of assigning the rows offactor matrices given the partition on the tensor nonzeros Any objectivefunction (relevant to our parallelization context) in trying to solve this prob-lem would likely be a variant of the NP-complete Bin-Packing Problem [93p124] as in the SpMV case [27 208] Therefore we heuristically handle thisproblem by making the following observations (i) the modes can be parti-tioned one at a time (ii) each row corresponds to a removed vertex for whichthere is a corresponding hyperedge (iii) the partition of xijk vertices definesa connectivity set for each hyperedge This is essentially multi-constraintpartitioning which takes advantage of the listed observations We use con-nectivity of corresponding hyperedges as key value of rows to impose anorder Then rows are visited in decreasing order of key values Each visitedrow is assigned to the least loaded part in the connectivity set of its corre-sponding hyperedge if that part is not overloaded Otherwise the visitedvertex is assigned to the least loaded part at that moment Note that in thesecond case we add one to the total communication volume otherwise thetotal communication volume obtained by partitioning xijk vertices remainsthe same Note also that key values correspond to the volume of receives atLine 10 of Algorithm 4 hence this method also aims to balance the volumeof receives at Line 10 and the volume of dual send operations at Line 15

34 Experiments

We conducted our experiments on the IDRIS Ada cluster which consistsof 304 nodes each having 128 GBs of memory and four 8-core Intel SandyBridge E5-4650 processors running at 27 GHz We ran our experiments upto 1024 cores and assigned one MPI process per core All codes we usedin our benchmarks were compiled using the Intel C++ compiler (version1401106) with -O3 option for compiler optimizations -axAVXSSE42 toenable SSE optimizations whenever possible and -mkl option to use theIntel MKL library (version 1111) for LAPACK BLAS and VML (VectorMath Library) routines We also compiled DFacTo code using the Eigen(version 324) library [102] with EIGEN USE MKL ALL flag set to ensurethat it utilizes the routines provided in the Intel MKL library for the bestperformance We used PaToH [40] with default options to partition thehypergraphs We created all partitions offline and ran our experiments onthese partitioned tensors on the cluster We do not report timings for par-titioning hypergraphs which is quite costly for two reasons First in most

48 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Tensor I1 I2 I3 I4 nonzeros

netflix 480K 17K 2K - 100M

NELL-B 32M 301 638K - 78M

deli 530K 17M 24M 1K 140M

flickr 319K 28M 16M 713 112M

Table 31 Size of tensors used in the experiments

applications the tensors from the real-world data are built incrementallyhence a partition for the updated tensor can be formed by refining the parti-tion of the previous tensor Also one can decompose a tensor multiple timeswith different ranks of approximation Thereby the cost of an initial parti-tioning of a tensor can be amortized across multiple runs Second we had topartition the data on a system different from the one used for CP-ALS runsIt would not be very useful to compare the run time from different systemsand repeating the same runs on a different system would not add much tothe presented results We also note that the proposed coarse and fine-grainMttkrp formulations are independent from the partitioning method

Our dataset for experiments consists of four sparse tensors whose sizesare given in Table 31 and obtained from various resources [21 37 97]

We briefly describe the methods compared in the experiments DFacTois the CP-ALS routine we compare our methods against which uses blockpartitioning of the rows of the matricized tensors to distribute matrix nonze-ros equally among processes In the method ht-coarsegrain-block we op-erate the coarse-grain CP-ALS implementation in HyperTensor on a tensorwhose slices in each mode are partitioned consecutively to ensure equal num-ber of nonzeros among processes similarly to the DFacTorsquos approach Themethod ht-coarsegrain-hp runs the same CP-ALS implementation andit uses the hypergraph partitioning The method ht-finegrain-randompartitions the tensor nonzeros as well as the rows of the factor matrices ran-domly to establish load balance It uses the fine-grain CP-ALS implemen-tation in HyperTensor Finally the method ht-finegrain-hp correspondsto the fine-grain CP-ALS implementation in HyperTensor operating on thetensor partitioned according to the hypergraph model proposed for the fine-grain task decomposition We benchmark our methods against DFacTo onNetflix and NELL-B datasets only as DFacTo can only process 3-mode ten-sors We let the CP-ALS implementations run for 20 iterations on each datawith R = 10 and record the average time spent per iteration

In Figures 32a and 32b we show the time spent per CP-ALS itera-tion on Netflix and NELL-B tensors for all algorithms In Figures 32c and32d we show the speedups per CP-ALS iteration on Flickr and Delicioustensors for methods excluding DFacTo While performing the comparisonwith DFacTo we preferred to show the time spent per iteration instead ofthe speedup as the sequential run time of our methods differs from that of

49

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(b) Time on Nell-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(d) Speedup on Delicious

Figure 32 Time for parallel CP-ALS iteration on Netflix and NELL-B andspeedups on Flickr and Delicious

50 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

DFacTo Also in Table 32 we show the statistics regarding the communica-tion and computation requirements of the 512-way partitions of the Netflixtensor for all proposed methods

We first observe in Figure 32a that on Netflix ht-finegrain-hp clearlyoutperforms all other methods by achieving a speedup of 194x with 512 coresover a sequential execution whereas ht-coarsegrain-hp ht-coarsegrain-blockDFacTo and ht-finegrain-random could only yield 69x 63x 49x and 40xspeedups respectively Table 32 shows that on 512-way partitioning of thesame tensor ht-finegrain-random ht-coarsegrain-block and ht-coarsegrain-hp result in total send communication volumes of 142M 80M and 77Munits respectively whereas ht-finegrain-hp partitioning requires only 76Munits which explains the superior parallel performance of ht-finegrain-hpSimilarly in Figures 32b 32c and 32d ht-finegrain-hp achieves 81x 129xand 123x speedups on NELL-B Flickr and Delicious tensors while the bestof all other methods could only obtain 39x 55x and 65x respectively upto 1024 cores As seen from these figures ht-finegrain-hp is the fastest of allproposed methods It experiences slow-downs at 1024 cores for all instances

One point to note in Figure 32b is that our Mttkrp implementationgets notably more efficient than DFacTo on NELL-B We believe that themain reason for this difference is due to DFacTorsquos column-by-column wayof computing the factor matrices We instead compute all columns of thefactor matrices simultaneously This vector-wise mode of operation on therows of the factors can significantly improve the data locality and cacheefficiency which leads to this performance difference Another point in thesame figure is that HyperTensor running with ht-coarsegrain-block partition-ing scales better than DFacTo even though they use similar partitioning ofthe matricized tensor or the tensor This is mainly due to the fact thatDFacTo uses MPI Allgatherv to communicate local rows of factor matri-ces to all processes which incurs significant increase in the communicationvolume whereas our coarse-grain implementation performs point-to-pointcommunications Finally in Figure 32a DFacTo has a slight edge over ourmethods in sequential execution This could be due to the fact that it re-quires 3(|X |+ F ) multiplyadd operations across all modes where F is theaverage number of non-empty fibers and is upper-bounded by |X | whereasour implementation always takes 6|X | operations

Partitioning metrics in the Table 32 show that ht-finegrain-hp is ableto successfully reduce the total communication volume which brings aboutits improved scalability However we do not see the same effect when usinghypergraph partitioning for the coarse-grain task model due to inherent lim-itations of 1D partitionings Another point to note is that despite reducingthe total communication volume explicitly imbalance in the communicationvolume still exists among the processes as this objective is not directly cap-tured by the hypergraph models However the most important limitationon further scalability is the communication latency Table 32 shows that

51

ModeComp load Comm volume Num msg

Max Avg Max Avg Max Avg

ht-finegrain-hp

1 196672 196251 21079 6367 734 316

2 196672 196251 18028 5899 1022 1016

3 196672 196251 3545 2492 1022 1018

ht-finegrain-random

1 197507 196251 272326 252118 1022 1022

2 197507 196251 29282 22715 1022 1022

3 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp

1 364181 196251 302001 136741 511 511

2 349123 196251 59523 12228 511 511

3 737570 196251 23524 2000 511 507

ht-coarsegrain-block

1 198602 196251 239337 142006 448 447

2 367966 196251 33889 12458 511 445

3 737570 196251 24659 2049 511 394

Table 32 Statistics for the computation and communication requirementsin one CP-ALS iteration for 512-way partitionings of the Netflix tensor

each process communicates with almost all others especially on the smalldimensions of the tensors and the number of messages is doubled for thefine-grain methods To alleviate this issue one can try to limit the latencyfollowing previous work [201 208]

35 Summary further notes and references

The pros and cons of the coarse and fine-grain models resemble to those ofthe 1D and 2D partitionings in the SpMV context There are two advantagesof the coarse-grain model over the fine-grain one First there is only onecommunicationcomputation step in the coarse-grain model whereas fine-grain model requires two computation and communication phases Secondthe coarse-grain hypergraph model is smaller than the fine-grain hypergraphmodel (one vertex per index of each dimension versus additionally havingone vertex per nonzero) However one major limitation with the coarse-grain parallel decomposition is that restricting all nonzeros in an (N minus 1)dimensional slice of an N dimensional tensor to reside in the same processposes a significant constraint towards communication minimization an (Nminus1)-dimensional data typically has very large ldquosurfacerdquo which necessitatesgathering numerous rows of the factor matrices As N increases one mightexpect this phenomenon to increase the communication volume even furtherAlso if the tensor X is not large enough in one of its modes this poses agranularity problem which potentially causes load imbalance and limits the

52 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

parallelism None of these issues exists in the fine-grain task decompositionIn a more recent work [135] we extended the discussed approach for

efficient parallelization in shared and distributed memory systems For theshared-memory computations we proposed a novel computational schemecalled dimension trees that significantly reduces the computational costwhile offering an effective parallelism We then performed theoretical anal-yses corresponding to the computational gains and the memory use Theseanalyses are also validated with the experiments We also presented com-parisons with the recent version of SPLATT [204 205] which uses a mediumgrain task model and associated partitioning [187] Recent work discussesalgorithms for partitioning for medium-grain task decomposition [3]

A dimension tree is a data structure that partitions the mode indicesof an N -dimensional tensor in a hierarchical manner for computing ten-sor decompositions efficiently It was first used in the hierarchical Tuckerformat representing the hierarchical Tucker decomposition of a tensor [99]which was introduced as a computationally feasible alternative to the origi-nal Tucker decomposition for higher order tensors Our work [135] presentsa novel way of using dimension trees for computing the standard CP de-composition of tensors with a formulation that asymptotically reduces thecomputational cost by memorizing certain partial computations

For computing the CP decomposition of dense tensors Phan et al [189]propose a scheme that divides the tensor modes into two sets pre-computesthe tensor-times-multiple-vector multiplication (TTMVs) for each mode setand finally reuses these partial results to obtain the final Mttkrp result ineach mode This provides a factor of two improvement in the number ofTTMVs over the standard approach and our dimension tree-based frame-work can be considered as the generalization of this approach that providesa factor of N logN improvement Li et al [153] use the same idea of stor-ing intermediate tensors but use a different formulation based on the tensortimes tensor multiplication and the tensorndashmatrix Hadamard product oper-ations for shared memory systems The overall approach is similar to thatproposed by Phan et al [189] where the difference lies in the application ofthe method to sparse tensors and auto-tuning to better control the memoryuse and the gains in the operation counts

In most of the sparse tensors available (eg see FROSTT [203]) oneof the dimensions is usually very small If the indices in that dimensionare vertices in a hypergraph then we have very high degree vertices if theindices are hyperedges then we have hyperedges with very large sizes Bothcases are not favorable for the traditional multilevel hypergraph partitioningmethods where efficient coarsening heuristicsrsquo run time are super-linear interms of vertexhyperedge degrees

Part II

Matching and relatedproblems

53

Chapter 4

Matchings in graphs

In this chapter we address the problem of approximating the maximumcardinality of a matching in undirected graphs with fast heuristics Most ofthe terms and definitions including the problem itself is standard whichwe summarized in Section 12 The formal problem definition is repeatedbelow for convenience

Approximating the maximum cardinality of a matching in undi-rected graphs Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matchingM of large cardinalitywith an approximation guarantee

We design randomized heuristics for the stated problem and analyze theirapproximation guarantees Both the design and analysis of the heuristicsare based on doubly stochastic matrices (see Section 14) and scaling ofnonnegative matrices (in our case the adjacency matrices) to the doublystochastic form We therefore first give some background on scaling matricesto doubly stochastic form

41 Scaling matrices to doubly stochastic form

Any nonnegative matrix A with total support can be scaled with two positivediagonal matrices DR and DC such that DRADC is doubly stochastic (thatis the sum of entries in any row and in any column of DRADC is equalto one) If the matrix is fully indecomposable then the scaling matricesare unique up to a scalar multiplication (if the pair DR and DC scale thematrix then so αDR and 1

αDC for α gt 0) If A has support but not totalsupport then A can be scaled to a doubly stochastic matrix but not withtwo positive diagonal matrices [202]mdashthis fact is also seen in more recenttreatments [139 141 198])

The Sinkhorn-Knopp algorithm [202] is a well-known method for scalingmatrices to doubly stochastic form This algorithm generates a sequence of

55

56 CHAPTER 4 MATCHINGS IN GRAPHS

matrices (whose limit is doubly stochastic) by normalizing the columns andthe rows of the sequence of matrices alternately as shown in Algorithm 5First the initial matrix is normalized such that each column has sum oneThen the resulting matrix is normalized so that each row has sum oneand so on so forth If the initial matrix is symmetric and has total supportSinkhorn-Knopp algorithm will find a symmetric doubly stochastic matrix atthe limit while the intermediate matrices generated in the sequence are notsymmetric Two other iterative scaling algorithms [140 141 198] maintainsymmetry all along the way which are therefore more useful for symmet-ric matrices in practice Efficient shared memory and distributed memoryparallelizations of the Sinkhorn-Knopp and related algorithms have beendiscussed elsewhere [9 41 76]

Algorithm 5 ScaleSK Sinkhorn-Knopp scaling

Data A an ntimes n matrix with total support ε the error thresholdResult dr dc rowcolumn scaling arrays

1 for i = 1 to n do2 dr[i]larr 13 dc[i]larr 1

4 for j = 1 to n do5 csum[j]larr

sumiisinAlowastj

aij

6 repeat7 for j = 1 to n do8 dc[j]larr 1csum[j]

9 for i = 1 to n do10 rsumlarr

sumjisinAilowast

aij times dc[j]

11 dr[i]larr 1rsum

12 for j = 1 to n do13 csum[j]larr

sumiisinAlowastj

dr[i]times aij14 until |max1minus csum[j] 1 le j le n| le ε

42 Approximation algorithms for bipartite graphs

Most of the existing heuristics obtain good results in practice but theirworst-case guarantee is only around 12 Among those the Karp-Sipserheuristic [128] is very well known It finds maximum cardinality matchingsin highly sparse (random) graphs and matches all but O(n15) verticesof a random undirected graph [14] (this heuristics will be reviewed laterin Section 421) Karp-Sipser also obtains very good results in practiceCurrently it is the suggested one to be used as a jump-start routine [71148] for exact algorithms especially for augmenting-path based ones [132]Algorithms that achieve an approximation ratio of 1minus1e where e is the base

57

of the natural logarithm are designed for the online case [129] Many of thesealgorithms are sequential in nature in that a sequence of greedy decisionsare made in the light of the previously made decisions Another heuristicis obtained by truncating the Hopcroft and Karp (HK) algorithm HKstarting from a given matching augments along a maximal set of shortestdisjoint paths until there are no augmenting paths If one lets HK run untilthe shortest augmenting paths are of length k then a 1minus 2k approximatematching is obtained for k ge 3 The run time of this heuristic is O(τk) fora bipartite graph with τ edges

We propose two matching heuristics (Section 422) for bipartite graphsBoth heuristics construct a subgraph of the input graph by randomly choos-ing some edges They then obtain a maximum matching in the selectedsubgraph and return it as an approximate matching for the input graphThe probability density function for choosing a given edge in both heuris-tics is obtained with a matrix scaling method The first heuristic is shown todeliver a constant approximation guarantee of 1minus 1e asymp 0632 of the maxi-mum cardinality matching under the condition that the scaling method hasscaled the input matrix The second one builds on top of the first one andimproves the approximation ratio to 0866 under the same condition as inthe first one Both of the heuristics are designed to be amenable to paral-lelization in modern multicore systems The first heuristic does not requirea conflict resolution scheme Furthermore it does not have any synchro-nization requirements The second heuristic employs Karp-Sipser to find amatching on the selected subgraph We show that Karp-Sipser becomes anexact algorithm on the selected subgraphs Further analysis of the proper-ties of those subgraphs is carried out to design a specialized implementationof Karp-Sipser for efficient parallelization The approximation guaranteesof the two proposed heuristics do not deteriorate with the increased degreeof parallelization thanks to their design which is usually not the case forparallel matching heuristics [16]

For the analysis we assume that the two vertex classes contain the samenumber of vertices and that the associated adjacency matrix has total sup-port Under these criteria the Sinkhorn-Knopp scaling algorithm summa-rized in Section 41 converges in a finite number of steps and obtains adoubly stochastic matrix Later on (towards the end of Section 422 and inthe experiments presented in Section 423) we discuss the bipartite graphswithout the mentioned properties

421 Related work

Parallel (exact) matching algorithms on modern architectures have been re-cently investigated Azad et al [16] study the implementations of a setof known bipartite graph matching algorithms on shared memory systemsDeveci et al [66 65] investigate the implementation of some known match-

58 CHAPTER 4 MATCHINGS IN GRAPHS

ing algorithms or their variants on GPU There are quite good speedupsreported in these implementations yet there are non-trivial instances whereparallelism does not help (for any of the algorithms)

Our focus is on matching heuristics that have linear run time complexityand good quality guarantees on the size of the matching Recent surveysof matching heuristics are given by Duff et al [71 Section 4] Langguth etal [148] and more recently by Pothen et al [195] Two heuristics calledGreedy and Karp-Sipser stand out and are suggested as initialization stepsin the best two exact matching algorithms [132] These two heuristics alsoattracted theoretical interest

The Greedy heuristic has two variants in the literature The first variantrandomly visits the edges and matches the two endpoints of an edge if theyare both available The theoretical performance guarantee of this heuristic is12 ie the heuristic delivers matchings of size at least half of the maximumcardinality of a matching This is analyzed theoretically [78] and shown toobtain results that are near the worst-case on certain classes of graphs Thesecond variant of Greedy repeatedly selects a random vertex and matches itwith a random neighbor The matched vertices along with the ones whichbecome isolated are removed from the graph and the process continuesuntil the whole graph is consumed This variant also has a 12 worst-caseapproximation guarantee (see for example a proof by Pothen and Fan [194])and it is somewhat better (05 + ε for ε ge 00000025 [13] which has beenrecently improved to ε ge 1256 [191])

We make use of the Karp-Sipser heuristic to design one of the proposedheuristics We summarize a commonly used variant of Karp-Sipser here andrefer the reader to the original paper [128] for details The theoretical foun-dation of Karp-Sipser is that if there is a vertex v with exactly one neighbor(v is called degree-one) then there is a maximum cardinality matching inwhich v is matched with its neighbor That is matching v with its neighboris an optimal decision Using this observation Karp-Sipser runs as followsCheck whether there is a degree-one vertex if so then match the vertexwith its unique neighbor and delete both vertices (and the edges incident onthem) from the graph Continue this way until the graph has no edges (inwhich case we are done) or all remaining vertices have degree larger thanone In the latter case pick a random edge match the two endpoints ofthis edge and delete those vertices and the edges incident on them Thenrepeat the whole process on the remaining graph The phase before the firstrandom choice of edges made is called Phase 1 and the rest is called Phase2 (where new degree-one vertices may arise) The run time of this heuris-tic is linear This heuristic matches all but O(n15) vertices of a randomundirected graph [14] One disadvantage of Karp-Sipser is that because ofthe degree dependencies of the vertices to the already matched vertices anefficient parallelism is hard to achieve (a list of degree-one vertices needs tobe maintained) That is probably why some inflicted forms (successful but

59

without any known quality guarantee) of this heuristic were used in recentstudies [16]

Recent studies focusing on approximate matching algorithms on parallelsystems include heuristics for graph matching problem [42 86 105] and alsoheuristics used for initializing bipartite matching algorithms [16] Lotker etal [164] present a distributed 1minus 1k approximate matching algorithm forbipartite graphs This nice theoretical algorithm has O(k3 log ∆ + k2 log n)time steps with a message length of O(log ∆) where ∆ is the maximumdegree of a vertex and n is the number vertices in the graph Blellochet al [29] propose an algorithm to compute maximal matchings (12 ap-proximate) with O(τ) work and O(log3 τ) depth with high probability on abipartite graph with τ edges This is an elaboration of the Greedy heuristicfor parallel systems Although the performance metrics work and depth arequite impressive the approximation guarantee stays as in the serial variantA striking property of this heuristic is that it trades parallelism and re-duced work while always finding the same matching (including those foundin the sequential version) Birn et al [26] discuss maximal (12 approxi-mate) matchings in O(log2 n) time and O(τ) work in the CREW PRAMmodel Practical implementations on distributed memory and GPU sys-tems are discussedmdasha shared memory implementation is left as future workin the cited paper Patwary et al [185] discuss an efficient implementationof Karp-Sipser on distributed memory systems and show that the communi-cation requirement (in terms of the volume) is proportional to that of sparsematrixndashvector multiplication

The overall approach is related to a random bipartite graph model calledrandom k-out model [211] In these graphs every vertex chooses k neighborsuniformly at random from the other part of vertices Walkup [211] showsthat a 1-out subgraph of a complete bipartite graph G = (V1 V2 E) where|V1| = |V2| asymptotically almost surely (aas) does not contain a perfectmatching He also shows that a 2-out subgraph of a complete bipartitegraph aas contains a perfect matching Karonski and Pittel [125] laterwith Overman [124] extend the analysis to two other cases In the firstcase the vertices that were not selected at all choose one more neighbor inthe second case the vertices that were selected only once choose one moreneighbor Initial claim [125] was that in the first case there was a perfectmatching aas which was shown to be false [124] The existence of perfectmatchings in the second case in which the graph is still between 1-out and2-out has been shown recently [124]

422 Two matching heuristics

We propose two matching heuristics for the approximate maximum cardi-nality matching problem on bipartite graphs The heuristics are efficientlyparallelizable and have guaranteed approximation ratios The first heuristic

60 CHAPTER 4 MATCHINGS IN GRAPHS

does not require synchronization nor conflict resolution assuming that thewrite operations to the memory are atomic (such settings are common seea discussion in the original paper [76]) This heuristic has an approxima-tion guarantee of around 0632 The second heuristic is designed to obtainlarger matchings compared to those obtained by the first one This heuristicemploys Karp-Sipser on a judiciously chosen subgraph of the input graphWe show that for this subgraph Karp-Sipser is an exact algorithm and aspecialized efficient implementation is possible to obtain matchings of sizearound 0866 of the maximum cardinality

One-sided matching

The first matching heuristic we propose OneSidedMatch scales the givenadjacency matrix A (each aij is originally either 0 or 1) and uses the scaledentries to randomly choose a column as a match for each row The pseu-docode of the heuristic is shown in Algorithm 6

Algorithm 6 OneSidedMatch Bipartite graphs

Data A an ntimes n (01)-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for j = 1 to n in parallel do3 cmatch[j]larr NIL

4 for i = 1 to n in parallel do5 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

6 cmatch[j]larr i

OneSidedMatch first obtains the scaling vectors dr and dc corre-sponding to a doubly stochastic matrix S (line 1) After initializing thecmatch array for each row i of A the heuristic randomly chooses a col-umn j isin Ailowast based on the probabilities computed by using correspond-ing scaled entries of row i It then matches i and j Clearly multiplerows can choose the same column and write to the same entry in cmatchWe assume that in the parallel shared-memory setting one of the writeoperation survives and the cmatch array defines a valid matching iecmatch[j] j cmatch[j] 6= NIL We now analyze its approximationguarantee in terms of the matching cardinality

Theorem 41 Let A be an n times n (01)-matrix with total support ThenOneSidedMatch obtains a matching of size at least n(1 minus 1e) asymp 0632n

61

in expectation as nrarrinfin

Proof To compute the matching cardinality we will count the columns thatare not picked by any row and subtract it from n Since ΣkisinAilowastsik = 1 foreach row i of S the probability that a column j is not picked by any of therows in Alowastj is equal to

prodiisinAlowastj (1minussij) By applying the arithmetic-geometric

mean inequality we obtain

dj

radic prodiisinAlowastj

(1minus sij) ledj minus

sumiisinAlowastj sij

dj

where dj = |Alowastj | is the degree of column vertex j Therefore

prodiisinAlowastj

(1minus sij) le

(1minus

sumiisinAlowastj sij

dj

)dj

Since S is doubly stochastic we havesum

iisinAlowastj sij = 1 and

prodiisinAlowastj

(1minus sij) le(

1minus 1

dj

)dj

The function on the right hand side above is an increasing one and has thelimit

limdjrarrinfin

(1minus 1

dj

)dj=

1

e

where e is the base of the natural logarithm By the linearity of expectationthe expected number of unmatched columns is no larger than n

e Hence thecardinality of the matching is no smaller than n (1minus 1e)

In Algorithm 6 we split the rows among the threads with a parallelfor construct For each row i the corresponding thread chooses a randomnumber r from a uniform distribution with range (0

sumkisinAilowast

sik] Then thelast nonzero column index j for which

sum1leklej sik le r is found and cmatch[j]

is set to i Since no synchronization or conflict detection is required theheuristic promises significant speedups

Two-sided matching

OneSidedMatchrsquos approximation guarantee and suitable structure for par-allel architectures make it a good heuristic The natural question that fol-lows is whether a heuristic with a better guarantee exits Of course thesought heuristic should also be simple and easy to parallelize We askedldquowhat happens if we repeat the process for the other (column) side of thebipartite graphrdquo The question led us to the following algorithm Let each

62 CHAPTER 4 MATCHINGS IN GRAPHS

row select a column and let each column select a row Take all these 2nchoices to construct a bipartite graph G (a subgraph of the input) with 2nvertices and at most 2n edges (if i chooses j and j chooses i we have oneedge) and seek a maximum cardinality matching in G Since the number ofedges is at most 2n any exact matching algorithm on this graph would befastmdashin particular the worst case run time would be O(n15) [117] Yet wecan do better and obtain a maximum cardinality matching in linear time byrunning the Karp-Sipser heuristic on G as we display in Algorithm 7

Algorithm 7 TwoSidedMatch Bipartite graphs

Data A an ntimes n 0 1-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for i = 1 to n in parallel do3 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

4 rchoice[i]larr j

5 for j = 1 to n in parallel do6 Pick a random row i isin Alowastj by using the probability density function

skjΣ`isinAlowastjs`j

for all k isin Alowastj

7 cchoice[j]larr i

8 Construct a bipartite graph G = (VR cup VC E) where

E =i rchoice[i] i isin 1 ncupcchoice[j] j j isin 1 n

9 matchlarrKarp-Sipser (G)

The most interesting component of TwoSidedMatch is the incorpora-tion of the Karp-Sipser heuristic for two reasons First although it is onlya heuristic Karp-Sipser computes a maximum cardinality matching on thebipartite graph G constructed in Algorithm 7 Second although Karp-Sipserhas a sequential nature we can obtain good speedups with a specializedparallel implementation In general it is hard to efficiently parallelize (non-trivial) graph algorithms and it is even harder when the overall cost isO(n) which is the case for Karp-Sipser on G We give a series of lemmasbelow which enables us to use Karp-Sipser as an exact algorithm with a goodshared-memory parallel performance

The first lemma describes the structure of G constructed at line 8 of

63

TwoSidedMatch

Lemma 41 Each connected component of G constructed in Algorithm 7contains at most one simple cycle

Proof A connected component M sube G with nprime vertices can have at most nprime

edges Let T be a spanning tree of M Since T contains nprime minus 1 edges theremaining edge in M can create at most one cycle when added to T

Lemma 41 explains why Karp-Sipser is an exact algorithm on G If acomponent does not contain a cycle Karp-Sipser consumes all its vertices inPhase 1 Therefore all of the matching decisions given by Karp-Sipser areoptimal for this component Assume a component contains a simple cycleAfter Phase 1 the component is either consumed or due to Lemma 41 itis reduced to a simple cycle In the former case the matching is a maximumcardinality one In the latter case an arbitrary edge of the cycle can beused to match a pair of vertices This decision necessarily leads to a uniqueperfect matching in the remaining simple path These two arguments canbe repeated for all the connected components to see that the Karp-Sipserheuristic finds a maximum cardinality matching in G

Algorithm 8 describes our parallel implementation Karp-Sipser-MT Thegraph is represented using a single array choice where choice[u] is the ver-tex randomly chosen by u isin VR cup VC The choice array is a concatenationof the arrays rchoice and cchoice set in TwoSidedMatch Hence anexplicit graph construction for G (line 8 of Algorithm 7) is not required anda transformation of the selected edges to a graph storage scheme is avoidedKarp-Sipser-MT uses three atomic operations for synchronization The firstoperation Add(memory value) atomically adds a value to a memory loca-tion It is used to compute the vertex degrees in the initial graph (line 9)The second operation CompAndSwap(memory value replace) first checkswhether the memory location has the value If so its content is replaced Thefinal content is returned The third operation AddAndFetch(memoryvalue) atomically adds a given value to a memory location and the finalcontent is returned We will describe the use of these two operations later

Karp-Sipser-MT has two phases which correspond to the two phases ofKarp-Sipser The first phase of Karp-Sipser-MT is similar to that of Karp-Sipser in that optimal matching decisions are made about some degree-onevertices The second phase of Karp-Sipser-MT handles remaining verticesvery efficiently without bothering with their degrees The following defini-tions are used to clarify the difference between an original Karp-Sipser andKarp-Sipser-MT

Definition 41 Given a matching and the array choice let u be an un-matched vertex and v = choice[u] Then u is called

64 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 8 Karp-Sipser-MT Bipartite 1-out graphs

Data G = V choice[middot] the chosen vertex for each u isin V Result match[middot] the match array for u isin V

1 for all u isin V in parallel do2 mark[u]larr 13 deg[u]larr 14 match[u]larr NIL

5 for all u isin V in parallel do6 v larr choice[u]7 mark[v]larr 08 if choice[v] 6= u then9 Add(deg[v] 1)

10 for each vertex u in parallel do Phase 1 out-one vertices 11 if mark[u] = 1 then12 curr larr u13 while curr 6= NIL do14 nbr larr choice[curr]15 if CompAndSwap(match[nbr]NIL curr) = curr then16 match[curr]larr nbr17 curr larr NIL18 nextlarr choice[nbr]19 if match[next] = NIL then20 if AddAndFetch(deg[next]minus1) = 1 then21 curr larr next22 else23 curr larr NIL

24 for each column vertex u in parallel do Phase 2 the rest 25 v larr choice[u]26 if match[u] = NIL and match[v] = NIL then27 match[u]larr v28 match[v]larr u

65

bull out-one if v is unmatched and no unmatched vertex w with choice[w] =u exists

bull in-one if v is matched and only a single unmatched vertex w withchoice[w] = u exists

The first phase of Karp-Sipser-MT (for loop of line 10) does not track andmatch all degree-one vertices Instead only the out-one vertices are takeninto account For each such vertex u that is already out-one before Phase1 we have mark[u] = 1 Karp-Sipser-MT visits these vertices (lines 10-11)Newly arising out-one vertices are consumed right away without maintain-ing a list The second phase of Karp-Sipser-MT (for loop of line 24) ismuch simpler than that of Karp-Sipser as the degrees of the vertices arenot trackedupdated We now discuss how these simplifications are possiblewhile ensuring a maximum cardinality matching in G

Observation 41 An out-one or an in-one vertex is a degree-one vertex inKarp-Sipser

Observation 42 A degree-one vertex in Karp-Sipser is either an out-oneor an in-one vertex or it is one of the two vertices u and v in a 2-cliquesuch that v = choice[u] and u = choice[v]

Lemma 42 (Proved elsewhere [76 Lemma 3]) If there exists an in-onevertex in G at any time during the execution of Karp-Sipser-MT an out-onevertex also exists

According to Observation 41 all the matching decisions given by Karp-Sipser-MT in Phase 1 are optimal since an out-one vertex is a degree-onevertex Observation 42 implies that among all the degree-one vertices Karp-Sipser-MT ignores only the in-ones and 2-cliques According to Lemma 42an in-one vertex cannot exist without an out-one vertex therefore they arehandled in the same phase The 2-cliques that survive Phase 1 are handledin Karp-Sipser-MTrsquos Phase 2 since they can be considered as cycles

To analyze the second phase of Karp-Sipser-MT we will use the followinglemma

Lemma 43 (Proved elsewhere [76 Lemma 4]) Let Gprime = (V primeR cup V primeC Eprime) bethe graph induced by the remaining vertices after the first phase of Karp-Sipser-MT Then the set

(u choice[u]) u isin V primeR choice[u] isin V primeC

is a maximum cardinality matching in Gprime

In the light of Observations 41 and 42 and Lemmas 41ndash43 Karp-Sipser-MT is an exact algorithm on the graphs created in Algorithm 7 Its worstcase (sequential) run time is linear

66 CHAPTER 4 MATCHINGS IN GRAPHS

Figure 41 A toy bipartite graph with 9 row (circles) and 9 column (squares)vertices The edges are oriented from a vertex u to the vertex choice[u]Assuming all the vertices are currently unmatched matching 15-7 (or 5-13)creates two degree-one vertices But no out-one vertex arises after matching(15-7) and only one vertex 6 arises after matching (5-13)

Karp-Sipser-MT tracks and consumes only the out-one vertices Thisbrings high flexibility while executing Karp-Sipser-MT in multi-threaded en-vironments Consider the example in Figure 41 Here after matching a pairof vertices and removing them from the graph multiple degree-one verticescan be generated The standard Karp-Sipser uses a list to store these newdegree-one vertices Such a list is necessary to obtain a larger matching butthe associated synchronizations while updating it in parallel will be an ob-stacle for efficiency The synchronization can be avoided up to some level ifone sacrifices the approximation quality by not making all optimal decisions(as in some existing work [16]) We continue with the following lemma totake advantage of the special structure of the graphs in TwoSidedMatchfor parallel efficiency in Phase 1

Lemma 44 (Proved elsewhere [76 Lemma 5]) Consuming an out-onevertex creates at most one new out-one vertex

According to Lemma 44 Karp-Sipser-MT does not need a list to storethe new out-one vertices since the process can continue with the new out-one vertex In a shared-memory setting there are two concerns for thefirst phase from the synchronization point of view First multiple threadsthat are consuming different out-one vertices can try to match them withthe same unmatched vertex To handle such cases Karp-Sipser-MT uses theatomic CompAndSwap operation (line 15 of Algorithm 8) and ensures thatonly one of these matchings will be processed In this case other threadswhose matching decisions are not performed continue with the next vertexin the for loop at line 10 The second concern is that while consumingout-one vertices several threads may create the same out-one vertex (andwant to continue with it) For example in Figure 41 when two threadsconsume the out-one vertices 1 and 2 at the same time they both will try tocontinue with vertex 4 To handle such cases an atomic AddAndFetch

67

operation (line 20 of Algorithm 8) is used to synchronize the degree reductionoperations on the potential out-one vertices This approach explicitly ordersthe vertex consumptions and guarantees that only the thread who performsthe last consumption before a new out-one vertex u appears continues withu The other threads which wanted to continue with the same path stop andskip to the next unconsumed out-one vertex in the main for loop One lastconcern that can be important on the parallel performance is the maximumlength of such paths since a very long path can yield a significant imbalanceon the work distribution to the threads We investigated the length of suchpaths experimentally (see the original paper [76]) and observed them to betoo short to hurt the parallel performance

The second phase of Karp-Sipser-MT is efficiently parallelized by usingthe idea in Lemma 43 That is a maximum cardinality matching for thegraph remaining after the first phase of Karp-Sipser-MT can be obtained viaa simple parallel for construct (see line 24 of Algorithm 8)

Quality of approximation If the initial matrix A is the n times n matrixof 1s that is aij = 1 for all 1 le i j le n then the doubly stochastic matrixS is such that sij = 1

n for all 1 le i j le n In this case the graph Gcreated by Algorithm 7 is a random 1-out bipartite graph [211] Referringto a study by Meir and Moon [172] Karonski and Pittel [125] argue thatthe maximum cardinality of a matching in a random 1-out bipartite graphis 2(1minusΩ)n asymp 0866n in expectation where Ω asymp 0567 also called LambertrsquosW (1) is the unique solution of the equation ΩeΩ = 1 We obtain the sameresult for random 1-out subgraph of any bipartite graph whose adjacencymatrix has total support as highlighted in the following theorem

Theorem 42 Let A be the n times n adjacency matrix of a given bipar-tite graph where A has total support Then TwoSidedMatch obtainsa matching of size 2(1 minus Ω)n asymp 0866n in expectation as n rarr infin whereΩ asymp 0567 is the unique solution of the equation ΩeΩ = 1

Theorem 42 contributes to the known results about the Karp-Sipserheuristic (recall that it is known to leave out O(n15) vertices) by show-ing a constant approximation ratio with some preprocessing The existenceof total support does not seem to be necessary for the Theorem 42 to hold(see the next subsection)

The proof of Theorem 42 is involved and can be found in the originalpaper [76] It essentially adapts the original analysis of Karp and Sipser [128]to the special graphs at hand

Discussion

The Sinkhorn-Knopp scaling algorithm converges linearly (when A has totalsupport) where the convergence rate is equivalent to the square of the second

68 CHAPTER 4 MATCHINGS IN GRAPHS

largest singular value of the resulting doubly stochastic matrix [139] Ifthe matrix does not have support or total support less is known aboutthe Sinkhorn-Knopp scaling algorithmrsquos convergencemdashwe do not know howmuch of a reduction we will get at each step In such cases we are not ableto bound the run time of the scaling step However the scaling algorithmsshould be run only a few iterations (see also the next paragraph) in practicein which case the practical run time of our heuristics would be linear (inedges and vertices)

We have discussed the proposed matching heuristics while assuming thatA has total support This can be relaxed in two ways to render the over-all approach practical in any bipartite graph The first relaxation is thatwe do not need to run the scaling algorithms until convergence (as in someother use cases of similar algorithms [141]) If

sumiisinAlowastj sij ge α instead ofsum

iisinAlowastj sij = 1 for all j isin 1 n then limdrarrinfin(1minus α

d

)d= 1

eα In otherwords if we apply the scaling algorithms a few iterations or until some rel-atively large error tolerance we can still derive similar results For exampleif α = 092 we will have a matching of size n

(1minus 1

)asymp 0601n (larger

column sums give improved ratios but there are columns whose sum isless than one when the convergence is not achieved) for OneSidedMatchWith the same α TwoSidedMatch will obtain an approximation of 0840In our experiments the number of iterations were always a few where theproven approximation guarantees were always observed The second relax-ation is that we do not need total support we do not even need supportnor equal number of vertices in the two vertex classes Since little is knownabout the scaling methods on such matrices we only mention some factsand observations and later on present some experiments to demonstratethe practicality of the proposed OneSidedMatch and TwoSidedMatchheuristics

A sparse matrix (not necessarily square) can be permuted into a blockupper triangular form using the canonical Dulmage-Mendelsohn (DM) de-composition [77]

A =

H lowast lowastO S lowastO O V

S =

(S1 lowastO S2

)

where H (horizontal) has more columns than rows and has a matchingcovering all rows S is square and has a perfect matching and V (vertical)has more rows than columns and a matching covering all columns Thefollowing facts about the DM decomposition are well known [192 194] Anyof these three blocks can be void If H is not connected then it is blockdiagonal with horizontal blocks If V is not connected then it is blockdiagonal with vertical blocks If S does not have total support then itis in block upper triangular form shown on the right where S1 and S2

69

have the same structure recursively until each block Si has total supportThe entries in the blocks shown by ldquordquo cannot be put into a maximumcardinality matching When the presented scaling methods are applied toa matrix the entries in ldquordquo blocks will tend to zero (the case of S is welldocumented [202]) Furthermore the row sums of the blocks of H will be amultiple of the column sums in the same block a similar statement holds forV finally S will be doubly stochastic That is the scaling algorithms appliedto bipartite graphs without perfect matchings will zero out the entries in theirrelevant parts and identify the entries that can be put into a maximumcardinality matching

423 Experiments

We present a selected set of results from the original paper [76] We used ashared-memory parallel version of Sinkhorn-Knopp algorithm with a fixednumber of iterations This way the scaling algorithm is simplified as noconvergence check is required Furthermore its overhead is bounded Inmost of the experiments we use 0 1 5 or 10 scaling iterations where thecase of 0 corresponds to applying the matching heuristics with uniform edgeselection probabilities While these do not guarantee convergence of theSinkhorn-Knopp scaling algorithm it seems to be enough for our purposes

Matching quality

We investigate the matching quality of the proposed heuristics on all squarematrices with support from the SuiteSparse Matrix Collection [59] having atleast 1000 non-empty rowscolumns and at most 20000000 nonzeros (someof the matrices are also in the 10th DIMACS challenge [19]) There were 742matrices satisfying these properties at the time of experimentation Withthe OneSidedMatch heuristic the quality guarantee of 0632 was sur-passed with 10 iterations of the scaling method in 729 matrices With theTwoSidedMatch heuristic and the same number of iterations of the scalingmethods the quality guarantee of 0866 was surpassed in 705 matrices Mak-ing 10 more scaling iterations smoothed out the remaining instances TheGreedy heuristic (the second variant discussed in Section 421) obtained onaverage (both the geometric and arithmetic means) matchings of size 093of the maximum cardinality In the worst case 082 was observed

We collect a few of the matrices from the SuiteSparse collection in Ta-ble 41 Figures 42a and 42b show the qualities on our test matrices In thefigures the first columns represent the case that the neighbors are pickedfrom a uniform distribution over the adjacency lists ie the case with noscaling hence no guarantee The quality guarantees are achieved with only5 scaling iterations Even with a single iteration the quality of TwoSided-Match is more than 0866 for all matrices and that of OneSidedMatch

70 CHAPTER 4 MATCHINGS IN GRAPHS

Avg sprank

Name n of edges deg stdA stdB n

atmosmodl 1489752 10319760 69 03 03 100audikw 1 943695 77651847 822 425 425 100cage15 5154859 99199551 192 57 57 100channel 4802000 85362744 178 10 10 100europe osm 50912018 108109320 21 05 05 099Hamrle3 1447360 5514242 38 16 21 100hugebubbles 21198119 63580358 30 003 003 100kkt power 2063494 12771361 62 745 745 100nlpkkt240 27993600 760648352 267 222 222 100road usa 23947347 57708624 24 093 093 095torso1 116158 8516500 733 41959 24544 100venturiLevel3 4026819 16108474 40 012 012 100

Table 41 Properties of the matrices used in the comparisons In the ta-ble sprank is the maximum matching cardinality Avg deg is the averagevertex degree stdA and stdB are the standard deviations of A and B ver-tex (row and column) degrees respectively

is more than 0632 for all matrices

Comparison of TwoSidedMatch with Karp-Sipser Next we analyzethe performance of the proposed heuristics with respect to Karp-Sipser on amatrix class which we designed as a challenging case for Karp-Sipser Let Abe an ntimes n matrix R1 be the set of Arsquos first n2 rows and R2 be the setof Arsquos last n2 rows similarly let C1 and C2 be the set of first n2 and theset of last n2 columns As Figure 43 shows A has a full R1 times C1 blockand an empty R2 times C2 block The last h n rows and columns of R1 andC1 respectively are full Each of the blocks R1 times C2 and R2 times C1 has anonzero diagonal Those diagonals form a perfect matching when combinedIn the sequel a matrix whose corresponding bipartite graph has a perfectmatching will be called full-sprank and sprank-deficient otherwise

When h le 1 the Karp-Sipser heuristic consumes the whole graph duringPhase 1 and finds a maximum cardinality matching When h gt 1 Phase1 immediately ends since there is no degree-one vertex In Phase 2 thefirst edge (nonzero) consumed by Karp-Sipser is selected from a uniform dis-tribution over the nonzeros whose corresponding rows and columns are stillunmatched Since the block R1timesC1 is full it is more likely that the nonzerowill be chosen from this block Thus a row in R1 will be matched with acolumn in C1 which is a bad decision since the block R2timesC2 is completelyempty Hence we expect a decrease in the performance of Karp-Sipser ash increases On the other hand the probability that TwoSidedMatchchooses an edge from that block goes to zero as those entries cannot be ina perfect matching

71

050

055

060

065

070

075

080

085

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 42 Matching qualities of OneSidedMatch and TwoSided-Match The horizontal lines are at 0632 and 0866 respectively whichare the approximation guarantees for the heuristics The legends containthe number of scaling iterations

R1

R2

C1 C2

h

h

Figure 43 Adjacency matrices of hard bipartite graph instances for Karp-Sipser

72 CHAPTER 4 MATCHINGS IN GRAPHS

Results with different number of scaling iterationsKarp- 0 1 5 10

h Sipser Qual Err Qual Err Qual Err Qual

2 0782 0522 13853 0557 3463 0989 0578 09994 0704 0489 11257 0516 3856 0980 0604 09978 0707 0466 8653 0487 4345 0946 0648 0996

16 0685 0448 6373 0458 4683 0885 0725 099032 0670 0447 4555 0453 4428 0748 0867 0980

Table 42 Quality comparison (minimum of 10 executions for each instance)of the Karp-Sipser heuristic and TwoSidedMatch on matrices described inFig 43 with n = 3200 and h isin 2 4 8 16 32

The results of the experiments are in Table 42 The first column showsthe h value Then the matching quality obtained by Karp-Sipser and byTwoSidedMatch with different number of scaling iterations (0 1 5 10)as well as the scaling error are given The scaling error is the maximumdifference between 1 and each rowcolumn sum of the scaled matrix (for0 iterations it is equal to n minus 1 for all cases) The quality of a match-ing is computed by dividing its cardinality to the maximum one which isn = 3200 for these experiments To obtain the values in each cell of thetable we run the programs 10 times and give the minimum quality (as weare investigating the worst-case behavior) The highest variance for Karp-Sipser and TwoSidedMatch were (up to four significant digits) 00041 and00001 respectively As expected when h increases Karp-Sipser performsworse and the matching quality drops to 067 for h = 32 TwoSided-Matchrsquos performance increases with the number of scaling iterations Asthe experiment shows only 5 scaling iterations are sufficient to make theproposed two-sided matching heuristic significantly better than Karp-SipserHowever this number was not enough to reach 0866 for the matrix withh = 32 On this matrix with 10 iterations only 2 of the rowscolumnsremain unmatched

Matching quality on bipartite graphs without perfect matchingsWe analyze the proposed heuristics on a class of random sprank-deficientsquare (n = 100000) and rectangular (m = 100000 and n = 120000) matri-ces with a uniform nonzero distribution (two more sprank-deficient matricesare used in the scalability tests as well) These matrices are generatedby Matlabrsquos sprand command (generating Erdos-Renyi random matri-ces [85]) The total number of nonzeros is set to be around d times 100000for d isin 2 3 4 5 The top half of Table 43 presents the results of thisexperiment with square matrices and the bottom half presents the resultswith the rectangular ones

As in the previous experiments the matching qualities in the table arethe minimum of 10 executions for the corresponding instances As Table 43

73

One Two One TwoSided Sided Sided Sided

d iter sprank Match Match d iter sprank Match Match

Square 100000times 100000

2 0 78225 0770 0912 4 0 97787 0644 08382 1 78225 0797 0917 4 1 97787 0673 08482 5 78225 0850 0939 4 5 97787 0719 08732 10 78225 0879 0954 4 10 97787 0740 0886

3 0 92786 0673 0851 5 0 99223 0635 08403 1 92786 0703 0857 5 1 99223 0662 08513 5 92786 0756 0884 5 5 99223 0701 08733 10 92786 0784 0902 5 10 99223 0716 0882

Rectangular 100000times 120000

2 0 87373 0793 0912 4 0 99115 0729 08992 1 87373 0815 0918 4 1 99115 0754 09102 5 87373 0861 0939 4 5 99115 0792 09332 10 87373 0886 0955 4 10 99115 0811 0946

3 0 96564 0739 0896 5 0 99761 0725 09053 1 96564 0769 0904 5 1 99761 0749 09173 5 96564 0813 0930 5 5 99761 0781 09363 10 96564 0836 0945 5 10 99761 0792 0943

Table 43 Matching qualities of the proposed heuristics on random sparsematrices with uniform nonzero distribution Square matrices in the top halfrectangular matrices in the bottom half d average number of nonzeros perrow

shows when the deficiency is high (correlated with small d) it is easier forour algorithms to approximate the maximum cardinality However when dgets larger the algorithms require more scaling iterations Even in this case5 iterations are sufficient to achieve the guaranteed qualities In the squarecase the minimum quality achieved by OneSidedMatch and TwoSided-Match were 0701 and 0873 In the rectangular case the minimum qual-ity achieved by OneSidedMatch and TwoSidedMatch were 0781 and0930 respectively with 5 scaling iterations In all cases increased scal-ing iterations results in higher quality matchings in accordance with theprevious results on square matrices with perfect matchings

Comparisons with some other codes

We run Azad et alrsquos [16] variant of Karp-Sipser (ksV) the two greedy match-ing heuristics of Blelloch et al [29] (inc and nd which are referred to as ldquoin-crementalrdquo and ldquonon-deterministic reservationrdquo in the original paper) andthe proposed OneSidedMatch and TwoSidedMatch heuristics with oneand 16 threads on the matrices given in Table 41 We also add one more ma-trix wc 50K 64 from the family shown in Fig 43 with n = 50000 and h = 64

74 CHAPTER 4 MATCHINGS IN GRAPHS

Run time in secondsOne thread 16 threads

Name ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 020 1480 007 012 030 003 163 001 001 003audikw 1 098 20600 022 055 082 010 1690 002 006 009cage15 154 996 039 092 195 029 110 004 009 019channel 129 10500 030 082 146 014 907 003 008 013europe osm 1203 5180 206 489 1197 120 712 021 046 105Hamrle3 015 012 004 009 025 002 002 001 001 002hugebubbles 836 465 128 392 1004 095 053 018 037 094kkt power 055 068 009 019 045 008 016 001 002 005nlpkkt240 1088 190000 256 556 1011 115 23700 023 058 106road usa 623 357 096 216 542 070 038 009 022 050torso1 011 728 002 006 009 002 120 000 001 001venturiLevel3 045 272 013 032 084 008 029 001 004 009wc 50K 64 721 320000 146 439 580 117 40200 012 055 059

Quality of the obtained matchingName ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 100 100 100 066 087 098 100 097 066 087audikw 1 100 100 100 064 087 095 100 097 064 087cage15 100 100 100 064 087 095 100 091 064 087channel 100 100 100 064 087 099 100 099 064 087europe osm 098 095 095 075 089 098 095 094 075 089Hamrle3 100 084 084 075 090 097 084 086 075 090hugebubbles 099 096 096 070 089 096 096 090 070 089kkt power 085 079 079 078 089 087 079 086 078 089nlpkkt240 099 099 099 064 086 099 099 099 064 086road usa 094 081 081 076 088 094 081 081 076 088torso1 100 100 100 065 088 098 100 098 065 087venturiLevel3 100 100 100 068 088 097 100 096 068 088wc 50K 64 050 050 050 081 098 070 050 051 081 098

Table 44 Run time and quality of different heuristics OneSidedMatchand TwoSidedMatch are labeled with 1SD and 2SD respectively andapply two steps of scaling iterations The heuristic ksV is from Azad etal [16] and the heuristics inc and nd are from Blelloch et al [29]

The OneSidedMatch and TwoSidedMatch heuristics are run with twoscaling iterations and the reported run time includes all components of theheuristics The inc and nd heuristics are compiled with g++ 445 and ksV

is compiled with gcc 445 For all the heuristics we used -O2 optimizationand the appropriate OpenMP flag for compilation The experiments werecarried out on a machine equipped with two Intel Sandybridge-EP CPUsclocked at 200Ghz and 256GB of memory split across the two NUMA do-mains Each CPU has eight-cores (16 cores in total) and HyperThreadingis enabled Each core has its own 32kB L1 cache and 256kB L2 cache The8 cores on a CPU share a 20MB L3 cache The machine runs 64-bit Debianwith Linux 2639-bpo2-amd64 The run time of all heuristics (in seconds)and their matching qualities are given in Table 44

75

As seen in Table 44 nd is the fastest of the tested heuristics with bothone thread and 16 threads on this data set The second fastest heuristicis OneSidedMatch The heuristic ksV is faster than TwoSidedMatchon six instances with one thread With 16 threads TwoSidedMatch isalmost always faster than ksVmdashboth heuristics are quite efficient and havea maximum run time of 117 seconds The heuristic inc was observed tohave large run time on some matrices with relatively high nonzeros per rowAll the existing heuristics are always very efficient except inc which haddifficulties on some matrices in the data set In the original reference [29]inc is quite efficient for some other matrices (where the codes are compiledwith Cilk) We do not see run time with nd elsewhere but in our data setit was always better than inc The proposed OneSidedMatch heuristicrsquosquality is almost always close to its theoretical limit TwoSidedMatch alsoobtains quality results close to its theoretical limit Looking at the quality ofthe matching with one thread we see that ksV obtains (almost always) thebest score except on kkt power and wc 50K 64 where TwoSidedMatchobtains the best score The heuristics inc and nd obtain the same score withone thread but with 16 threads inc obtains better quality than nd almostalways OneSidedMatch never obtains the best score (TwoSidedMatchis always better) yet it is better than ksV inc and nd on the syntheticwc 50K 64 matrix The TwoSidedMatch heuristic obtains better resultsthan inc and nd on Hamrle3 kkt power road usa and wc 50K 64 withboth one thread and 16 threads Also on hugebubbles with 16 threads thedifference between TwoSidedMatch and nd is 1 (in favor of nd)

43 Approximation algorithms for general undi-rected graphs

We extend the results on bipartite graphs of the previous section to generalundirected graphs We again select a subgraph randomly with a probabilitydensity function obtained by scaling the adjacency matrix of the input graphand run Karp-Sipser [128] on the selected subgraph Again Karp-Sipser be-comes an exact algorithm on the selected subgraph and analysis reveals thatthe approximation ratio is around 0866 of the maximum cardinality in theoriginal graph We also propose two other variants of this algorithm whichobtain better results both in theory and in practice We omit most of theproofs in the presentation they can be found in the associated paper [73]

Recall that a graph Gprime = (V prime Eprime) is a subgraph of G = (VE) if V prime sube Vand Eprime sube E Let Gkout = (VEprime) be a random subgraph of G where eachvertex in V randomly chooses k of its edges with repetition (an edge u vis included in Gkout only once even if it is chosen multiple times) We callGkout as a k-out subgraph of G

A permutation of [1 n] in which no element stays in its original po-

76 CHAPTER 4 MATCHINGS IN GRAPHS

sition is called derangement The total number of derangements of [1 n]is denoted by n and is the nearest integer to n

e [36 pp 40ndash42] where e isthe base of the natural logarithm

431 Related work

The first polynomial time algorithm with O(n2τ) complexity for the maxi-mum cardinality matching problem on general graphs with n vertices and τedges is proposed by Edmonds [79] Currently the fastest algorithms haveO(radicnτ) complexity [30 92 174] The first of these algorithms is by Micali

and Vazirani [174] recently its simpler analysis is given by Vazirani [210]Later Blum [30] and Gabow and Tarjan [92] proposed algorithms with thesame complexity

Random graphs have been frequently analyzed in terms of their max-imum matching cardinality Frieze [90] shows results for undirected (gen-eral) graphs along the lines of Walkuprsquos results for bipartite graphs [211]In particular he shows that if each vertex chooses uniformly at random oneneighbor (among all vertices) then the constructed graph does not havea perfect matching almost always In other words he shows that random1-out subgraphs of a complete graph do not have perfect matchings Luck-ily as in bipartite graphs if each vertex chooses two neighbors then theconstructed graph has perfect matchings almost always That is random2-out subgraphs of a complete graph have perfect matchings Our contribu-tions are about making these statements valid for any host graph (assumingthe adjacency matrix has total support) and measure the approximationguarantee of the 1-out case

Randomized algorithms which check the existence of a perfect match-ing and generate perfectmaximum matchings have been proposed in theliterature [45 120 165 177 178] Lovasz showed that the existence of aperfect matching can be verified in randomized O(nω) time where O(nω)is the time complexity of the fastest matrix multiplication algorithm avail-able [165] More information such as the set of allowed edges ie the edgesin some maximum matching of a graph can also be found with the samerandomized complexity as shown by Cheriyan [45]

To construct a maximum cardinality matching in a general non-bipartitegraph a simple easy to implement algorithm with O(nω+1) randomizedcomplexity is proposed by Rabin and Vazirani [197] The algorithm usesmatrix inversion as a subroutine Later Mucha and Sankowski proposedthe first algorithm for the same problem with O(nω) randomized complex-ity [178] This algorithm uses expensive sub-routines such as Gaussian elim-ination and equivalence classes formed based on the edges that appear insome perfect matching [166] A simpler randomized algorithm with the samecomplexity is proposed by Harvey [108]

77

432 One-Out its analysis and variants

We first present the main heuristic and then analyze its approximationguarantee While the heuristic is a straightforward adaptation of its coun-terpart for bipartite graphs [76] the analysis is more complicated becauseof odd cycles The analysis shows that random 1-out subgraphs of a givengraph have the maximum cardinality of a matching around 0866minus log(n)nof the best possiblemdashthe observed performance (see Section 433) is higherThen two variants of the main heuristic are discussed without proofs Theyextend the 1-out subgraphs obtained by the first one and hence deliverbetter results theoretically

The main heuristic One-Out

The heuristic shown in Algorithm 9 first scales the adjacency matrix of agiven undirected graph to be doubly stochastic Based on the values in thematrix the heuristic then randomly marks one of the neighbors for each ver-tex as chosen This defines one marked edge per vertex Then the subgraphof the original graph containing only the marked edges is formed yieldingan undirected graph with at most n edges each vertex having at least oneedge incident on it The Karp-Sipser heuristic is run on this subgraph andfinds a maximum cardinality matching on it as the resulting 1-out graphhas unicyclic components The approximation guarantee of the proposedheuristic is analyzed for this step One practical improvement to the previ-ous work is to make sure that the matching obtained is a maximal one byrunning Karp-Sipser on the vertices that are not matched This brings in alarge improvement to the cardinality of the matching in practice

Let us classify the edges incident on a vertex vi as in-edges (from thoseneighbors that have chosen i at Line 4) and an out-edge (to a neighborchosen by i at Line 4 Dufosse et al [76 Lemma 3] show that during anyexecution of Karp-Sipser it suffices to consider the vertices whose in-edgesare unavailable but out-edges are available as degree-1 vertices

The analysis traces an execution of Karp-Sipser on the subgraph G1out In

other words the analysis can also be perceived as the analysis of the Karp-Sipser heuristicrsquos performance on random 1-out graphs Let A1 be the setof vertices not chosen by any other vertex at Line 4 of Algorithm 9 Thesevertices have in-degree zero and out degree one and hence can be processedby Karp-Sipser Let B1 be the set of vertices chosen by the vertices in A1The vertices in B1 can be perfectly matched with vertices in A1 leavingsome A1 vertices not matched and creating some new in-degree-0 verticesWe can proceed to define A2 to be the vertices that have in degree-0 inV (A1 cup B1) and define B2 as those chosen by A2 and so on so forthFormally let B0 be an empty set and define Ai to be the set of verticeswith in-degree 0 in V Biminus1 and Bi be the vertices chosen by those in Ai

78 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 9 One-Out Undirected graphs

Data G = (VE) and its adjacency matrix AResult match[middot] the matching

1 Dlarr SymScale(A)2 for i = 1 to n in parallel do3 Pick a random j isin Ailowast by using the probability density function

sikΣtisinAilowastsit

for all k isin Ailowast

where sik = d[i]times d[k] is the corresponding entry in the scaledmatrix S = DAD

4 Mark that i chooses j

5 Construct a graph G1out = (VE) where

V =1 nE =(i j) i chose j or j chose i

6 matchlarrKarpSipserOne-Out(G1out)

7 Make match a maximal matching

for i ge 1 Notice that Ai sube Ai+1 and Bi sube Bi+1 The Karp-Sipser heuristiccan process A1 then A2 A1 and so on until the remaining graph hascycles only The sets Ai and Bi and their cardinality are at the core of ouranalysis We first present some facts about these sets and their cardinalityand describe an implementation of Karp-Sipser instrumented to highlightthem

Lemma 45 With the definitions above Ai capBi = empty

Proof We prove this by induction For i = 1 it clearly holds Assume thatit holds for all i lt ` Suppose there exists a vertex u isin A` cap B` BecauseA`minus1capB`minus1 = empty u must necessarily belong to both A` A`minus1 and B` B`minus1For u to be in B` B`minus1 there must exist at least one vertex v isin A` A`minus1

such that v chooses u However the condition for u isin A` is that no vertex inV cap(A`minus1cupB`minus1) has selected it This is a contradiction and the intersectionA` capB` should be empty

Corollary 41 Ai capBj = empty for i le j

Proof Assume Ai cap Bj 6= empty Since Ai sube Aj we have a contradiction asAj capBj = empty by Lemma 45

Thanks to Lemma 45 and Corollary 41 the sets Ai and Bi are disjointand they form a bipartite subgraph of G1

out for all i = 1 `The proposed version Karp-SipserOne-Out of the Karp-Sipser heuristic

for 1-out graphs is shown in Algorithm 10 The degree-1 vertices are kept

79

in a first-in first-out priority queue Q The queue is first initialized with A1and the character lsquorsquo is used to mark the end of A1 Then all vertices inA1 are matched to some other vertices defining B1 When we remove twomatched vertices from the graph G1

out at Lines 29 and 40 we update thedegrees of their remaining neighbors and append the vertices which havedegrees of one to the queue During Phase-1 of Karp-SipserOne-Out wealso maintain the set of Ai and Bi vertices while storing only the last oneA` and B` are returned along with the number ` of levels which is computedthanks to the use of the marker

Apart from the scaling step the proposed heuristic in Algorithm 9 haslinear worst-case time complexity of O(n + τ) The scaling step if applieduntil convergence can take more time than that We do do not suggestrunning it until convergence for all practical purposes 5 or 10 iterations ofthe basic method [141] or even less of the Newton iterations [140] seem suf-ficient (see experiments) Therefore the practical run time of the algorithmis linear

Analysis of One-Out

Let ai and bi be two random variables representing the cardinalities of Aiand Bi respectively in an execution of Karp-SipserOne-Out on a random1-out graph Then Karp-SipserOne-Out matches b` edges in the first phaseand leaves a`minus b` vertices unmatched What remains after the first phase isa set of cycles In the bipartite graph case [76] all vertices in those cyclesare matchable and hence the cardinality of the matching was measured bynminus a` + b` Since we can possibly have odd cycles after the first phase wecannot match all remaining vertices in the general case of undirected graphsLet c be a random variable representing the number of odd cycles after thefirst phase of Karp-SipserOne-Out Then we have the following (obvious)lemma about the approximation guarantee of Algorithm 9

Lemma 46 At the end of execution the number of unmatched vertices isa` minus b` + c Hence Algorithm 9 matches at least nminus (a` minus b` + c) vertices

We need to quantify a` minus b` and c in Lemma 46 We state an an upperbound on a` minus b` in Lemma 47 and an upper bound on c in Lemma 48without any proof (the reader can find the proofs in the original paper [73])While the bound for a`minusb` holds for any graph with total support presentedto Algorithm 9 the bounds for c are shown for random 1-out graphs ofcomplete graphs By plugging the bounds for these quantities we obtainthe following theorem on the approximation guarantee of Algorithm 9

Theorem 43 Algorithm 9 obtains a matching with cardinality at least0866minus d104 log(0336n)e

n in expectation when the input is a complete graph

80 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 10 Karp-SipserOne-Out Undirected graphs

Data G1out = (VE)

Result match[middot] the mates of verticesResult ` the number of levels in the first phaseResult A the set of degree-1 vertices in the first phaseResult B the set of vertices matched to A vertices

1 match[u]larr NIL for all u2 Qlarr v deg(v) = 1 degree-1 or in-degree 0 3 if Q = empty then4 `larr 0 no vertex in level 1 5 else6 `larr 1

7 Enqueue(Q) marks the end of the first level 8 Phase-1larr ongoing9 Alarr B larr empty

10 while true do11 while Q 6= empty do12 ularr Dequeue(Q)13 if u = and Q = empty then14 break the while-Q-loop15 else if u = then16 `larr `+ 117 Enqueue(Q) new level formed 18 skip to the next while-Q-iteration19 if match[u] 6= NIL then20 skip to the next while-Q-iteration

21 for v isin adj(u) do22 if match[v] = NIL then23 match[u]larr v24 match[v]larr u25 if Phase-1 = ongoing then26 Alarr A cup u27 B larr B cup v28 N larr adj(v)

29 G1out larr G1

out u v30 Enqueue(Qw) for w isin N and deg(w) = 131 break the for-v-loop

32 if Phase-1 = ongoing and match[u] = NIL then33 Alarr A cup u u cannot be matched

34 Phase-1 larr done35 if E 6= empty then36 pick a random edge (u v)37 match[u]larr v38 match[v]larr u39 N larr adj(u v)40 G1

out larr G1out u v

41 Enqueue(Qw) for w isin N and deg(w) = 142 else43 break the while-true loop

81

The theorem can also be interpreted more theoretically as a random1-out graph has a maximum cardinality of a matching at least 0866 minusd104 log(0336n)e

n in expectation The bound is in close vicinity of 0866 whichwas the proved bound for the bipartite case [76] We note that in derivingthis bound we assumed a random 1-out graph (as opposed to a random 1-outsubgraph of a given graph) only at a single step We leave the extension tothis latter case as future work and present experiments suggesting that thebound is also achieved for this case In Section 433 we empirically showthat the same bound also holds for graphs whose corresponding matrices donot have total support

In order to measure a` minus b` we adapted a proof from earlier work [76]which was inspired by Karp and Sipserrsquos analysis of the first phase of theirheuristic [128] and obtained the following lemma

Lemma 47 a` minus b` le (2Ω minus 1)n where Ω asymp 0567 equals to W (1) ofLambertrsquos W function

The lemma also reads as a` minus b` lesumn

i=1(2 middot 0567minus 1) le 0134 middot nIn order to achieve the result of Theorem 43 we need to bound c the

number of odd cycles that remain on a G1out graph after the first phase of

Karp-SipserOne-Out We were able to bound c on random 1-out graphs(that is we did not show it for a general graph)

Lemma 48 The expected number c of cycles remaining after the first phaseof Karp-SipserOne-Out in random 1-out graphs is less than or equal tod104 log(nminusa`minus b`)e which is also less than or equal to d104 log(0336n)e

Variants

Here we summarize two related theoretical random bipartite graph modelsthat we adapt to the undirected case using similar algorithms The presenta-tion is brief and without proofs we will present experiments in Section 433

The random (1 + eminus1) undirected graph model (see the bipartite ver-sion [125] summarized in Section 421) lets each vertex choose a randomneighbor Then the vertices that have not been chosen select one moreneighbor The maximum cardinality of a matching in the subgraph consist-ing of the identified edges can be computed as an approximate matchingin the original graph As this heuristic uses a richer graph structure than1-out we expect perfect or near perfect matchings in the general case

A model richer in edges is the random 2-out graph model In this modeleach vertex chooses two neighbors There are two enticing characteristics ofthis model in the uniform bipartite case First the probability of having aperfect matching goes to one with the increasing number of vertices [211]Second there is a special algorithm for computing the maximum cardinalitymatchings in these (bipartite) graphs [127] with high probability in lineartime in expectation

82 CHAPTER 4 MATCHINGS IN GRAPHS

433 Experiments

To understand the effectiveness and efficiency of the proposed heuristics inpractice we report the matching quality and various statistics regarding thesets that appear in our analyses in Section 432 For the experiments weused three graph datasets The first set is generated with matrices fromthe SuiteSparse Matrix Collection [59] We investigated all n times n matricesfrom the collection with 50000 le n le 100000 For a matrix from this setwe removed the diagonal entries symmetrized the pattern of the resultingmatrix and discarded a matrix if it has empty rows There were 115 matricesat the end which we used as the adjacency matrix The graphs in the seconddataset are synthetically generated to make Karp-Sipser deviate from theoptimal matching as much as possible This dataset contains five graphswith different hardness levels for Karp-Sipser The third set contains fivelarge real-life matrices from the SuiteSparse collection for measuring therun time efficiency of the proposed heuristics

A comprehensive evaluation

We use MATLAB to measure the matching qualities based on the first twodatasets For each matrix five runs are performed with each randomizedmatching heuristic and the average is reported One five and ten iterationsare performed to evaluate the impact of the scaling method

Table 45 summarizes the quality of the matchings for all the experimentson the first dataset The matching quality is measured as the ratio of thematching cardinality to the maximum cardinality matching in the originalgraph The table presents statistics for matching qualities of Karp-Sipserperformed on the original graphs (first row) 1-out graphs (the second set ofrows) Karonski-Pittel-like (1 + eminus1)-out graphs (the third set of rows) and2-out graphs (the last set of rows)

For the U-Max rows we construct k-out graphs by using uniform prob-abilities while selecting the neighbors as proposed in the literature [90 125]We compare the cardinality of the maximum matchings in these k-out graphsto the maximum matching cardinality on the original graphs and reportthe statistics The rows St-Max report the same statistics for the k-outgraphs constructed by using probabilities with t isin 1 5 10 scaling iter-ations These statistics serve as upper bounds on the matching qualitiesof the proposed St-KS heuristics which execute Karp-Sipser on the k-outgraphs obtained with t scaling iterations Since St-KS heuristics use only asubgraph the matchings they obtain are not maximal with respect to theoriginal edge set The proposed St-KS+ heuristics exploit this fact andapply another Karp-Sipser phase on the subgraph containing only the un-matched vertices to improve the quality of the matchings The table doesnot contain St-Max rows for 1-out graphs since Karp-Sipser is an optimal

83

Alg Min Max Avg GMean Med StDev

Karp-Sipser 0880 1000 0980 0980 0988 0022

1-ou

t

U-Max 0168 1000 0846 0837 0858 0091S1-KS 0479 1000 0869 0866 0863 0059S5-KS 0839 1000 0885 0884 0865 0044S10-KS 0858 1000 0889 0888 0866 0045S1-KS+ 0836 1000 0951 0950 0953 0043S5-KS+ 0865 1000 0958 0957 0968 0036S10-KS+ 0888 1000 0961 0961 0971 0033

(1+eminus

1)-

out

U-Max 0251 1000 0952 0945 0967 0081S1-Max 0642 1000 0967 0966 0980 0042S5-Max 0918 1000 0977 0977 0985 0020S10-Max 0934 1000 0980 0979 0985 0018S1-KS 0642 1000 0963 0962 0972 0041S5-KS 0918 1000 0972 0972 0976 0020S10-KS 0934 1000 0975 0975 0977 0018S1-KS+ 0857 1000 0972 0972 0979 0025S5-KS+ 0925 1000 0978 0978 0984 0018S10-KS+ 0939 1000 0980 0980 0985 0016

2-ou

t

U-Max 0254 1000 0972 0966 0996 0079S1-Max 0652 1000 0987 0986 0999 0036S5-Max 0952 1000 0995 0995 0999 0009S10-Max 0968 1000 0996 0996 1000 0007S1-KS 0651 1000 0974 0974 0981 0035S5-KS 0945 1000 0982 0982 0984 0013S10-KS 0947 1000 0984 0983 0984 0012S1-KS+ 0860 1000 0980 0979 0984 0020S5-KS+ 0950 1000 0984 0984 0987 0012S10-KS+ 0952 1000 0986 0986 0987 0011

Table 45 For each matrix in the first dataset and each proposed heuristic five

runs are performed The statistics are computed over the mean of these results

the minimum maximum arithmetic and geometric averages median and standard

deviation are reported

84 CHAPTER 4 MATCHINGS IN GRAPHS

0808208408608809

092094096098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphOneOut5iters(MaxKS)OneOut5iters(KS+)

(a) 1-out graphs

09

092

094

096

098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphTwoOut5iters(Max)TwoOut5iters(KS)TwoOut5iters(KS+)

(b) 2-out graphs

Figure 44 The matching qualities sorted in increasing order for Karp-Sipseron the original graph S5-KS and S5-KS+ on 1-out and 2-out graphs Thefigures also contain the quality of the maximum cardinality matchings inthese graphs The experiments are performed on the 115 graphs in the firstdataset

algorithm for these subgraphs

As Table 45 shows more scaling iterations increase the maximum match-ing cardinalities on k-out graphs Although this is much more clear whenthe worst-case graphs are considered it can also be observed for arithmeticand geometric means Since U-Max is the no scaling case the impact ofthe first scaling iteration (S1-KS vs U-Max) is significant On the otherhand the difference on the matching quality for S5-KS and S10-KS is mi-nor Hence five scaling iterations are deemed sufficient for the proposedheuristics in practice

As the theory suggests the heuristics St-KS perform well for (1 + eminus1)-out and 2-out graphs With t isin 5 10 their quality is almost on parwith Karp-Sipser on the original graph and even better for 2-out graphsIn addition applying Karp-Sipser on the subgraph of unmatched vertices toobtain a maximal matching does not increase the matching quality muchSince this subgraph is small the overhead of this extra work will not besignificant Furthermore this extra step significantly improves the matchingquality for 1-out graphs which aas do not have a perfect matching

To better understand the practical performance of the proposed heuris-tics and the impact of the additional Karp-Sipser execution we profile theirperformance by sorting their matching qualities in increasing order for all115 matrices Figure 44a plots these profiles on 1-out and 2-out graphs forthe heuristics with five scaling iterations As the first figure shows five iter-ations are sufficient to obtain 086 matching quality except 17 of the 1-outexperiments The figure also shows that the maximum matching cardinalityin a random 1-out graph is worse than what Karp-Sipser can obtain on theoriginal graph This is why although S5-KS finds maximum matchings on

85

0020406081

0 2 4 6 8 10

CDFofoddCyclesn

Figure 45 The cumulative distribution of the ratio of number of odd cyclesremaining after the first phase of Karp-Sipser in One-Out to the number ofvertices in the graph

1-out graphs its performance is still worse than Karp-Sipser The additionalKarp-Sipser in S5-KS+ closes almost all of this gap and makes the match-ing qualities close to those of Karp-Sipser On the contrary for 2-out graphsgenerated with five scaling iterations the maximum matching cardinalityis more than the cardinality of the matchings found by Karp-Sipser Thereis still a gap between the best possible (red line) and what Karp-Sipser canfind (blue line) on 2-out graphs We believe that this gap can be targetedby specialized efficient exact matching algorithms for 2-out graphs

There are 30 rank-deficient matrices without total support among the115 matrices in the first dataset We observed that even for 1-out graphsthe worst-case quality for S5-KS is 086 and the average is 093 Hencethe proposed approach also works well for rank-deficient matricesgraphs inpractice

Since the number c of odd cycles at the end of the first phase of Karp-Sipser is a performance measure we investigate it For each matrix wecompute the ratio cn We then plot the cumulative distribution of thesevalues in Fig 45 For all the experiments except one this ratio is less than1 For the extreme case the ratio increases to 8 In that particular casethe matching found in the 1-out subgraph is maximum for the input graph(ie the number of odd components is also large in the original graph)

Hard instances for Karp-Sipser

Let A be an ntimesn symmetric matrix V1 be the set of Arsquos first n2 verticesand V2 be the set of Arsquos last n2 vertices As Figure 46 shows A has afull V1timesV1 block ignoring the diagonal ie a clique and an empty V2timesV2

block The last h n vertices of V1 are connected to all the vertices in thecorresponding graph The block V1 times V2 has a nonzero diagonal hence thecorresponding graph has a perfect matching Note that the same instancesare used for the bipartite case

On such a matrix with h = 0 Karp-Sipser consumes the whole graphduring Phase 1 and finds a maximum cardinality matching When h gt 1

86 CHAPTER 4 MATCHINGS IN GRAPHS

V1

V2

V1 V2

h

h

Figure 46 Adjacency matrices of hard undirected graph instances for Karp-Sipser

One-Out5 iters 10 iters 20 iters

h Karp-Sipser error quality error quality error quality

2 096 754 099 068 099 022 1008 078 852 097 078 099 023 099

32 068 665 081 109 099 026 099128 063 332 053 189 090 033 098512 063 124 055 117 059 061 086

Table 46 Results for the hard instances with n = 5000 and different hvalues One-Out is executed with 5 10 and 20 scaling iterations andscaling errors are also reported Averages of five are reported for each cell

Phase 1 immediately ends since there is no degree-one vertex In Phase 2the first edge (nonzero) consumed by Karp-Sipser is selected from a uniformdistribution over the nonzeros whose corresponding rows and columns arestill unmatched Since the block V1 times V1 forms a clique it is more likelythat the nonzero will be chosen from this block Thus a vertex in V1 will bematched with another vertex from V1 which is a bad decision since the blockV2timesV2 is completely empty Hence we expect a decrease on the performanceof Karp-Sipser as h increases On the other hand the probability that theproposed heuristics chooses an edge from that block goes to zero as thoseentries cannot be in a perfect matching

Table 46 shows that the quality of Karp-Sipser drops to 063 as h in-creases In comparison One-Out heuristic with five scaling iterations main-tains a good quality for small h values However a better scaling with moreiterations (10 and 20 for h = 128 and h = 512 respectively) is required toguarantee the desired matching qualitymdashsee the scaling error in the table

Experiments on large-scale graphs

These experiments are performed on a machine running 64-bit CentOS 65which has 30 cores each of which is an Intel Xeon CPU E7-4870 v2 core

87

Karp-SipserMatrix |V | |E| Quality time

cage15 52 940 100 582dielFilterV3real 11 882 099 336hollywood-2009 11 1128 093 418nlpkkt240 280 7465 098 5295rgg n 2 24 s0 168 2651 098 1949

One-Out-S5-KS+ Two-Out-S5-KS+Execution time (seconds) Execution time (seconds)

Matrix Quality Scale 1-out KS KS+ Total Quality Scale 2-out KS KS+ Total

cage 093 067 085 065 005 221 099 067 148 178 001 394diel 098 052 036 007 001 095 099 052 062 016 000 130holly 091 112 045 006 002 165 095 112 076 013 001 201nlpk 093 444 499 396 027 1366 098 444 943 1056 011 2454rgg 093 233 233 208 017 691 098 233 401 670 011 1315

Table 47 Summary of the results with five large-scale matrices for original Karp-

Sipser and the proposed One-Out and TwoOut heuristics with five scaling iter-

ations and post-processing for maximal matchings Matrix names are shortened in

the lower half

operating at 230 GHz To choose five large-scale matrices from SuiteSparsewe first sorted the pattern-symmetric matrices in the collection in decreasingorder of their nonzero count We then chose the five largest matrices fromdifferent families to increase the variety of experiments The details of thesematrices are given in Table 47 This table also contains the run timesand matching qualities of the original Karp-Sipser and the proposed One-Out and Two-Out heuristics The proposed heuristics have five scalingiterations and also apply Karp-Sipser at the end for ensuring maximality

The run time of the proposed heuristics are analyzed in four stages inTable 47 The Scale stage scales the matrix with five iterations the k-out stage chooses the neighbors and constructs the k-out subgraph the KSstage applies Karp-Sipser on the k-out subgraph and KS+ is the stage forthe additional Karp-Sipser at the end The total time with these four stagesis also shown for each instance The quality results are consistent withthe results obtained on the first dataset As the table shows the proposedheuristics are faster than the original Karp-Sipser on the input graph For1-out graphs the proposed approach is 25ndash39 faster than Karp-Sipser onthe original graph The speedups are in between 15ndash26 for 2-out graphswith five iterations

44 Summary further notes and references

We investigated two heuristics for the bipartite maximum cardinality match-ing problem The first one OneSidedMatch is shown to have an approx-

88 CHAPTER 4 MATCHINGS IN GRAPHS

imation guarantee of 1 minus 1e asymp 0632 The second heuristic TwoSid-edMatch is shown to have an approximation guarantee of 0866 Bothalgorithms use well-known methods to scale the sparse matrix associatedwith the given bipartite graph to a doubly stochastic form whose entriesare used as the probability density functions to randomly select a subsetof the edges of the input graph OneSidedMatch selects exactly n edgesto construct a subgraph in which a matching of the guaranteed cardinal-ity is identified with virtually no overhead both in sequential and parallelexecution TwoSidedMatch selects around 2n edges and then runs theKarp-Sipser heuristic as an exact algorithm on the selected subgraph to ob-tain a matching of conjectured cardinality The subgraphs are analyzed todevelop a specialized Karp-Sipser algorithm for efficient parallelization Alltheoretical investigations are first performed assuming bipartite graphs withperfect matchings and the scaling algorithms have converged Then the-oretical arguments and experimental evidence are provided to extend theresults to cover other cases and validate the applicability and practicality ofthe proposed heuristics in general settings

The proposed OneSidedMatch and TwoSidedMatch heuristics donot guarantee maximality of the matching found One can visit the edges ofthe unmatched row vertices and match them to an available neighbor Wehave carried out this greedy post-process and obtained the results shown inFigures 47a and 47b in comparison with what was shown in Figures 42aand 42b We see that the greedy post-process makes significant differencefor OneSidedMatch the approximation ratios achieved are well beyondthe bound 0632 and are close to those of TwoSidedMatch This obser-vation attests to the efficiency of the whole approach

We also extended the heuristics and their analysis for approximatingthe maximum cardinality matchings on general (undirected) graphs Sincewe have in the general case one class of vertices the 1-out based approachcorresponds to the TwoSidedMatch heuristics in the bipartite case The-oretical analysis which can be perceived as an analysis of Karp-Sipser onthe random 1-out graph model showed that the approximation guarantee isslightly less than 0866 The losses with respect to the bipartite case are dueto the existence of odd-length cycles Our experiments suggest that one ofthe heuristics obtains as good results as Karp-Sipser while being faster andmore reliable

A more rigorous treatment and elaboration of the variants of the heuris-tics for general graphs seem worthwhile Although Karp-Sipser works wellfor these graphs we wonder if there are exact linear time algorithms for(1 + eminus1)-out and 2-out graphs We also plan to work on the parallel imple-mentations of these algorithms

89

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 47 Matching qualities obtained after making the matchings foundby OneSidedMatch and TwoSidedMatch maximal The horizontal linesare at 0632 and 0866 respectively which are the approximation guaranteesfor the heuristics The legends contain the number of scaling iterationsCompare with Figures 42a and 42b where maximality of the matchingswas not guaranteed

90 CHAPTER 4 MATCHINGS IN GRAPHS

Chapter 5

Birkhoff-von Neumanndecomposition of doublystochastic matrices

In this chapter we investigate the Birkhoff-von Neumann (BvN) decomposi-tion described in Section 14 The main problem that we address is restatedbelow for convenience

MinBvNDec A BvN decomposition with the minimum num-ber of permutation matrices Given a doubly stochastic matrix Afind a BvN decomposition

A = α1P1 + α2P2 + middot middot middot+ αkPk (51)

of A with the minimum number k of permutation matrices

We first show that the MinBvNDec problem is NP-hard We then proposea heuristic for obtaining a BvN decomposition with a small number of per-mutation matrices We investigate some of the properties of the heuristictheoretically and experimentally In this chapter we also give a general-ization of the BvN decomposition for real matrices with possibly negativeentries and use this generalization for designing preconditioners for iterativelinear system solvers

51 The minimum number ofpermutation matrices

We show in this section that the decision version of the problem is NP-complete We first give some definitions and preliminary results

91

92 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

511 Preliminaries

A multi-set can contain duplicate members Two multi-sets are equivalentif they have the same set of members with the same number of repetitions

Let A and B be two n times n matrices We write A sube B to denote thatfor each nonzero entry of A the corresponding entry of B is nonzero Inparticular if P is an n times n permutation matrix and A is a nonnegativen times n matrix P sube A also means that the entries of A at the positionscorresponding to the nonzeros of P are positive We use P A to denotethe entrywise product of P and A which selects the entries of A at thepositions corresponding to the nonzero entries of P We also use minPAto denote the minimum entry of A at the nonzero positions of P

Let U be a set of positions of the nonzeros of A Then U is calledstrongly stable [31] if for each permutation matrix P sube A pkl = 1 for atmost one pair (k l) isin U

Lemma 51 (Brualdi [32]) Let A be a doubly stochastic matrix Then ina BvN decomposition of A there are at least γ(A) permutation matriceswhere γ(A) is the maximum cardinality of a strongly stable set of positionsof A

Note that γ(A) is no smaller than the maximum number of nonzerosin a row or a column of A for any matrix A Brualdi [32] shows that forany integer t with 1 le t le dn2ed(n + 1)2e there exists an n times n doublystochastic matrix A such that γ(A) = t

An ntimes n circulant matrix C is defined as follows The first row of C isspecified as c1 cn and the ith row is obtained from the (iminus 1)th one bya cyclic rotation to the right for i = 2 n

C =

c1 c2 cncn c1 cnminus1

c2 c3 c1

We state and prove an obvious lemma to be used later

Lemma 52 Let C be an ntimesn positive circulant matrix whose first row isc1 cn The matrix Cprime = 1sum

cjC is doubly stochastic and all BvN decom-

positions with n permutation matrices have the same multi-set of coefficients cisum

cj for i = 1 n

Proof Since the first row of Cprime has all nonzero entries and only n permuta-tion matrices are permitted the multi-set of coefficients must be the entriesin the first row

93

In the lemma if ci = cj for some i 6= j we will have the same value cicfor two different permutation matrices As a sample BvN decomposition letc =

sumci and consider 1

cC =sum cj

c Dj where Dj is the permutation matrixcorresponding to the (jminus 1)th diagonal for j = 1 n the matrix Dj has1s at the positions (i i+ j minus 1) for i = 1 nminus j + 1 and at the positions(nminus j+ 1 + k k) for k = 1 jminus 1 where we assumed that the second setis void for j = 1

Note that the lemma is not concerned by the uniqueness of the BvNdecomposition which would need a unique set of permutation matrices aswell If all cis were distinct this would have been true where the set ofDjs described above would define the unique permutation matrices Alsoa more general variant of the lemma concerns the decompositions whosecardinality k is equal to the maximum cardinality of a strongly stable setIn this case too the coefficients in the decomposition will correspond to theentries in a strongly stable set of cardinality k

512 The computational complexity of MinBvNDec

Here we prove that the MinBvNDec problem is NP-complete in the strongsense ie it remains NP-complete when the instance is represented in unaryIt suffices to do a reduction from a strongly NP-complete problem to provethe strong NP-completeness [22 Section 66]

Theorem 51 The problem of deciding whether there is a Birkhoff-vonNeumann decomposition of a given doubly stochastic matrix with k permu-tation matrices is strongly NP-complete

Proof It is clear that the problem belongs to NP as it is easy to check inpolynomial time that a given decomposition is equal to a given matrix

To establish NP-completeness we demonstrate a reduction from the well-known 3-PARTITION problem which is NP-complete in the strong sense [93p 96] Consider an instance of 3-PARTITION given an array A of 3mpositive integers a positive integer B such that

sum3mi=1 ai = mB and for all

i it holds that B4 lt ai lt B2 does there exist a partition of A into mdisjoint arrays S1 Sm such that each Si has three elements whose sumis B Let I1 denote an instance of 3-PARTITION

We build the following instance I2 of MinBvNDec corresponding to I1

given above Let

M =

[1mEm OO 1

mBC

]where Em is an m timesm matrix whose entries are 1 and C is an 3m times 3mcirculant matrix whose first row is a1 a3m It is easy to see that Mis doubly stochastic A solution of I2 is a BvN decomposition of M withk = 3m permutations We will show that I1 has a solution if and only if I2

has a solution

94 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Assume that I1 has a solution with S1 Sm Let Si = ai1 ai2 ai3and observe that 1

mB (ai1 + ai2 + ai3) = 1m We identify three permu-tation matrices Pid for d = 1 2 3 in C which contain ai1 ai2 and ai3respectively We can write 1

mEm =summ

i=11mDi where Di is the permutation

matrix corresponding to the (i minus 1)th diagonal (described after the proofof Lemma 52) We prepend Di to the three permutation matrices Pid ford = 1 2 3 and obtain three permutation matrices for M We associate thesethree permutation matrices with αid =

aidmB Therefore we can write

M =msumi=1

3sumd=1

αid

[Di

Pid

]

and obtain a BvN decomposition of M with 3m permutation matricesAssume that I2 has a BvN decomposition with 3m permutation matrices

This also defines two BvN decompositions with 3m permutation matricesfor 1

mEm and 1mBC We now establish a correspondence between these two

BvNrsquos to finish the proof Since 1mBC is a circulant matrix with 3m nonzeros

in a row any BvN decomposition of it with 3m permutation matrices hasthe coefficients ai

mB for i = 1 3m by Lemma 52 Since ai + aj lt B we

haveai+ajmB lt 1m Therefore each entry in 1

mEm needs to be included in atleast three permutation matrices A total of 3m permutation matrices coversany row of 1

mEm say the first one Therefore for each entry in this row wehave 1

m = αi + αj + αk for i 6= j 6= k corresponding to three coefficientsused in the BvN decomposition of 1

mBC Note that these three indices i j kdefining the three coefficients used for one entry of Em cannot be used foranother entry in the same row This correspondence defines a partition ofthe 3m numbers αi for i = 1 3m into m groups with three elementseach where each group has a sum of 1m The corresponding three airsquos ina group therefore sums up to B and we have a solution to I1 concludingthe proof

52 A result on the polytope ofBvN decompositions

Let S(A) be the polytope of all BvN decompositions for a given dou-bly stochastic matrix A The extreme points of S(A) are the ones thatcannot be represented as a convex combination of other decompositionsBrualdi [32 pp 197ndash198] observes that any generalized Birkhoff heuristicobtains an extreme point of S(A) and predicts that there are extreme pointsof S(A) which cannot be obtained by a generalized Birkhoff heuristic Inthis section we substantiate this claim by showing an example

Any BvN decomposition of a given matrix A with the smallest numberof permutation matrices is an extreme point of S(A) otherwise the other

95

BvN decompositions expressing the said point would have smaller numberof permutation matrices

Lemma 53 There are doubly stochastic matrices whose polytopes of BvNdecompositions contain extreme points that cannot be obtained by a general-ized Birkhoff heuristic

We will prove the lemma by giving an example We use computationaltools based on a mixed integer linear programming (MILP) formulation ofthe problem of finding a BvN decomposition with the smallest number ofpermutation matrices We first describe the MILP formulation

Let A be a given n times n doubly stochastic matrix and Ωn be the set ofall n times n permutation matrices There are n matrices in Ωn for brevitylet us refer to these permutations by P1 Pn We associate an incidencematrix M of size n2 times n with Ωn We fix an ordering of the entries of Aso that each row of M corresponds to a unique entry in A Each column ofM corresponds to a unique permutation matrix in Ωn We set mij = 1 ifthe ith entry of A appears in the permutation matrix Pj and set mij = 0otherwise Let ~a be the n2-vector containing the values of the entries of Ain the fixed order Let ~x be a vector of n elements xj corresponding tothe permutation matrix Pj With these definitions MILP formulation forfinding a BvN decomposition of A with the smallest number of permutationmatrices can be written as

minimizensumj=1

sj (52)

subject to M~x = ~a (53)

1 ge xj ge 0 for j = 1 n (54)

nsumj=1

xj = 1 (55)

sj ge xj for j = 1 n (56)

sj isin 0 1 for j = 1 n (57)

In this MILP sj is a binary variable which is 1 only if xj gt 0 otherwise 0The equality (53) the inequalities (54) and the equality (55) guaranteethat we have a BvN decomposition of A This MILP can be used only forsmall problems In this MILP we can exclude any permutation matrix Pj

from a decomposition by setting sj = 0

Proof of Lemma 53 Let the following 10 letters correspond to the numbersunderneath

a b c d e f g h i j

1 2 4 8 16 32 64 128 256 512

96 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Consider the following matrix whose row sums and column sums are 1023hence can be considered as doubly stochastic

A =

a+ b d+ i c+ h e+ j f + ge+ g a+ c b+ i d+ f h+ jf + j e+ h d+ g b+ c a+ id+ h b+ f a+ j g + i c+ ec+ i g + j e+ f a+ h b+ d

(58)

Observe that the entries containing the term 2` form a permutationmatrix for ` = 0 9 Therefore this matrix has a BvN decompositionwith 10 permutation matrices We created the MILP above and found 10 asthe smallest number of permutation matrices by calling the CPLEX solver [1]via the NEOS Server [58 68 100] Hence the described decomposition is anextreme point of S(A) None of the said permutation matrices annihilateany entry of the matrix Therefore at the first step no entry of A getsreduced to zero regardless of the order of permutation matrices Thus thisdecomposition cannot be obtained by a generalized Birkhoff heuristic

One can create a family of matrices with arbitrary sizes by embeddingthe same matrix into a larger one of the form

B =

(1023 middot I OO A

)

where I is the identity matrix with the desired size All BvN decompositionsof B can be obtained by extending the permutation matrices in Arsquos BvNdecompositions with I That is for a permutation matrix P in a BvN de-

composition of A

(I OO P

)is a permutation matrix in the corresponding

BvN decomposition of B Furthermore all permutation matrices in an arbi-trary BvN decomposition of B must have I as the principle sub-matrix andthe rest should correspond to a permutation matrix in A defining a BvNdecomposition for A Hence the extreme point S(B) corresponding to theextreme point of S(A) with 10 permutation matrices cannot be found by ageneralized Birkhoff heuristic

Let us investigate the matrix A and its BvN decomposition given in theproof of Lemma 53 Let PaPb Pj be the 10 permutation matrices ofthe decomposition corresponding to a b j We solve 10 MILPs in whichwe set st = 0 for one t isin a b j This way we try to find a BvN decom-position of A without Pa without Pb and so on always with the smallestnumber of permutation matrices The smallest number of permutation ma-trices in these 10 MILPs were 11 This certifies that the only BvN decom-position with 10 permutation matrices necessarily contains PaPb Pj It is easy to see that there is a unique solution to the equality M~x = ~a of

97

A =

3 264 132 528 9680 5 258 40 640544 144 72 6 257136 34 513 320 20260 576 48 129 10

(a) The sample matrix

129 511 257 63 33 15 7 3 2 2 1

3 4 2 5 5 4 2 3 4 1 15 5 3 1 4 1 4 2 1 2 32 1 5 3 1 2 3 4 3 4 41 3 4 4 2 5 1 5 5 3 24 2 1 2 3 3 5 1 2 5 5

(b) A BvN decomposition

Figure 51 The matrix A from Lemma 53 and a BvN decomposition with11 permutation matrices which can be obtained by a generalized Birkhoffheuristic Each column in (b) corresponds to a permutation where the firstline gives the associated coefficient and the following lines give the columnindices matched to the rows 1 to 5 of A

the MILP with xt = 0 for t isin a b j as the submatrix M containingonly the corresponding 10 columns has full column rank

Any generalized Birkhoff heuristic obtains at least 11 permutation ma-trices for the matrix A of the proof of Lemma 53 One such decompositionis shown in a tabular format in Fig 51 In Fig 51a we write the matrixof the lemma explicitly for convenience Then in Fig 51b we give a BvNdecomposition The column headers (the first line) in the table contain thecoefficients of the permutation matrices The nonzero column indices of thepermutation matrices are stored by rows For example the first permuta-tion has the coefficient 129 and columns 3 5 2 1 and 4 are matched tothe rows 1ndash5 The bold indices signify entries whose values are equivalentto the coefficients of the corresponding permutation matrices at the timewhere the permutation matrices are found Therefore the correspondingentries become zero after the corresponding step For example in the firstpermutation a54 = 129

The output of GreedyBvN for the matrix A is given in Fig 52 for refer-ence It contains 12 permutation matrices

53 Two heuristics

There are heuristics to compute a BvN decomposition for a given matrix AIn particular the following family of heuristics is based on the constructiveproof of Birkhoff Let A(0) = A At every step j ge 1 find a permutationmatrix Pj having its ones at the positions of the nonzero elements of A(jminus1)use the minimum nonzero element of A(jminus1) at the positions identified byPj as αj set A(j) = A(jminus1) minus αjPj and repeat the computations in thenext step j + 1 until A(j) becomes void Any heuristic of this type is calledgeneralized Birkhoff heuristic

98 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

A = 513 middot

0 0 0 1 00 0 0 0 11 0 0 0 00 0 1 0 00 1 0 0 0

+ 257 middot

0 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 01 0 0 0 0

+ 127 middot

0 0 1 0 00 0 0 0 10 1 0 0 01 0 0 0 00 0 0 1 0

+ 63 middot

0 0 0 0 11 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 0

+

31 middot

0 0 0 0 10 0 0 1 01 0 0 0 00 1 0 0 00 0 1 0 0

+ 15 middot

0 0 0 1 01 0 0 0 00 1 0 0 00 0 0 0 10 0 1 0 0

+ 7 middot

0 1 0 0 00 0 0 1 00 0 1 0 01 0 0 0 00 0 0 0 1

+ 3 middot

0 0 1 0 00 1 0 0 00 0 0 1 00 0 0 0 11 0 0 0 0

+

2 middot

0 0 0 0 10 0 0 1 00 1 0 0 01 0 0 0 00 0 1 0 0

+ 2 middot

0 0 1 0 01 0 0 0 00 0 0 1 00 1 0 0 00 0 0 0 1

+ 2 middot

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 0

+ 1 middot

1 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 00 0 0 0 1

Figure 52 The output of GreedyBvN for the matrix given in the proof ofLemma 53

Given an n times n matrix A we can associate a bipartite graph GA =(RcupCE) to it The rows of A correspond to the vertices in the set R andthe columns of A correspond to the vertices in the set C so that (ri cj) isin Eiff aij 6= 0 A perfect matching in GA or G in short is a set of n edges notwo sharing a row or a column vertex Therefore a perfect matching M inG defines a unique permutation matrix PM sube A

Birkhoffrsquos heuristic This is the original heuristic used in proving thata doubly stochastic matrix has a BvN decomposition described for exampleby Brualdi [32] Let a be the smallest nonzero of A(jminus1) and G(jminus1) bethe bipartite graph associated with A(jminus1) Find a perfect matching M inG(jminus1) containing a set αj larr a and Pj larr PM

GreedyBvN heuristic At every step j among all perfect matchings inG(jminus1) find one whose minimum element is the maximum That is find aperfect matching M in G(jminus1) where minPM A(jminus1) is the maximumThis ldquobottleneckrdquo matching problem is polynomial time solvable with forexample MC64 [72] In this greedy approach αj is the largest amount wecan subtract from a row and a column of A(jminus1) and we hope to obtain asmall k

Algorithm 11 GreedyBvN A greedy heuristic for constructing aBvN decomposition

Data A a doubly stochastic matrixResult a BvN decomposition

1 k larr 02 while nnz(A) gt 0 do3 k larr k + 14 Pk the pattern of a bottleneck perfect matching M in A

5 αk larr minPk A(jminus1)6 Alarr Aminus αkPk

99

A(0) =

1 4 1 0 0 00 1 4 1 0 00 0 1 4 1 00 0 0 1 4 11 0 0 0 1 44 1 0 0 0 1

A(1) =

0 4 1 0 0 00 1 3 1 0 00 0 1 3 1 00 0 0 1 3 11 0 0 0 1 34 0 0 0 0 1

A(2) =

0 4 0 0 0 00 0 3 1 0 00 0 1 2 1 00 0 0 1 2 11 0 0 0 1 23 0 0 0 0 1

A(3) =

0 3 0 0 0 00 0 3 0 0 00 0 0 2 1 00 0 0 1 1 11 0 0 0 1 12 0 0 0 0 1

A(4) =

0 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 1 11 0 0 0 1 01 0 0 0 0 1

A(5) =

0 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 00 0 0 0 0 1

Figure 53 A sample matrix to show that the original Birkhoff heuristic canobtain BvN decomposition with n permutation matrices while the optimumone has 3

Analysis of the two heuristics

As we will see in Section 54 GreedyBvN can obtain a much smaller num-ber of permutation matrices compared to the Birkhoff heuristic Here wecompare these two heuristics theoretically First we show that the originalBirkhoff heuristic does not have any constant ratio approximation guaran-tee Furthermore for an ntimesn matrix its worst-case approximation ratio isΩ(n)

We begin with a small example shown in Fig 53 We decompose a6 times 6 matrix A(0) which has an optimal BvN decomposition with threepermutation matrices the main diagonal the one containing the entriesequal to 4 and the one containing the remaining entries For simplicitywe used integer values in our example However since the row and columnsums of A(0) is equal to 6 it can be converted to a doubly stochastic matrixby dividing all the entries to 6 Instead of the optimal decomposition in thefigure we obtain the permutation matrices as the original Birkhoff heuristicdoes Each red-colored entry set is a permutation and contains the minimumpossible value 1

In the following we show how to generalize the idea for having matri-ces of arbitrarily large size with three permutation matrices in an optimaldecomposition while the original Birkhoff heuristic obtains n permutation

100 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

matrices

Lemma 54 The worst-case approximation ratio of the Birkhoff heuristicis Ω(n)

Proof For any given integer n ge 3 we show that there is a matrix of sizen times n whose optimal BvN decomposition has 3 permutations whereas theBirkhoff heuristic obtains a BvN decomposition with exactly n permutationmatrices The example in Fig 53 is a special case for n = 6 for the followingconstruction process

Let f 1 2 n rarr 1 2 n be the function f(x) = (x mod n) +1 Given a matrix M let Mprime = F(M) be another matrix containing thesame set of entries where the function f(middot) is used on the coordinate indicesto redistribute the entries of M on Mprime That is mij = mprimef(i)f(j) Since f(middot)is one-to-one and onto if M is a permutation matrix then F(M) is also apermutation matrix We will start with a permutation matrix and run itthrough F for nminus 1 times to obtain n permutation matrices which are alldifferent By adding these permutation matrices we will obtain a matrix Awhose optimal BvN decomposition has three permutation matrices whilethe n permutation matrices used to create A correspond to a decompositionthat can be obtained by the Birkhoff heuristic

Let P1 be the permutation matrix whose ones which are partitionedinto three sets are at the positions

1st set︷ ︸︸ ︷(1 1)

2nd set︷ ︸︸ ︷(n 2)

3rd set︷ ︸︸ ︷(2 3) (3 4) (nminus 1 n) (59)

Let us use F(middot) to generate a matrix sequence Pi = F(Piminus1) for 2 le i le nFor example P2rsquos nonzeros are at the positions

1st set︷ ︸︸ ︷(2 2)

2nd set︷ ︸︸ ︷(1 3)

3rd set︷ ︸︸ ︷(3 4) (4 5) (n 1)

We then add the Pis to build the matrix

A = P1 + P2 + middot middot middot+ Pn

We have the following observations about the nonzero elements of A

1 aii = 1 for all i = 1 n and only Pi has a one at the position (i i)These elements are from the first set of positions of the permutationmatrices as identified in (59) When put together these n entriesform a permutation matrix P(1)

2 aij = 1 for all i = 1 n and j = ((i + 1) mod n) + 1 and onlyPh where h = (i mod n) + 1 has a one at the position (i j) Theseelements are from the second set of positions of the permutation ma-trices as identified in (59) When put together these n entries forma permutation matrix P(2)

101

3 aij = n minus 2 for all i = 1 n and j = (i mod n) + 1 where allP` for ` isin 1 n i j have a one at the position aij Theseelements are from the third set of positions of the permutation matri-ces as identified in (59) When put together these n entries form apermutation matrix P(3) multiplied by the scalar (nminus 2)

In other words we can write

A = P(1) + P(2) + (nminus 2) middotP(3)

and see that A has a BvN decomposition with three permutation matricesWe note that each row and column of A contains three nonzeros and hencethree is the smallest number of permutation matrices in a BvN decomposi-tion of A

Since the minimum element in A is 1 and each Pi contains one suchelement the Birkhoff heuristic can obtain a decomposition using Pi fori = 1 n Therefore the Birkhoff heuristicrsquos approximation is no betterthan n

3 which can be made arbitrarily large

We note that GreedyBvN will optimally decompose the matrix A usedin the proof above

We now analyze the theoretical properties of GreedyBvN We first givea lemma identifying a pattern in its output

Lemma 55 The heuristic GreedyBvN obtains α1 αk in a non-increasingorder α1 ge middot middot middot ge αk where αj is obtained at the jth step

We observed through a series of experiments that the performance ofGreedyBvN depends on the values in the matrix and there seems to be noconstant factor approximation [74]

We create a set of n times n matrices To do that we first fix a set ofz permutation matrices C1 Cz of size n times n These permutationmatrices with varying values of α will be used to generate the matricesThe matrices are parameterized by the subscript i and each Ai is created asfollows Ai = α1 middotC1 +α2 middotC2 + middot middot middot+αz middotCz where each αj for j = 1 zis a randomly chosen integer in the range [1 2i] and we also set a randomlychosen αj equivalent to 2i to guarantee the existence of at least one largevalue even in the unlikely case that all other values are not large enoughAs can be seen each Ai has the same structure and differs from the restonly in the values of αj rsquos that are chosen As a consequence they all canbe decomposed by the same set of permutation matrices

We present our results in two sets of experiments shown in Table 51 fortwo different n We create five random Ai for i isin 10 20 30 40 50 thatis we have five matrices with the parameter i and there are five different iWe have n = 30 and z = 20 in the first set and n = 200 and z = 100 inthe second set Let ki be the number of permutation matrices GreedyBvN

102 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

n = 30 and z = 20

average worst casei ki kiz ki kiz

10 59 299 63 31520 105 529 110 55030 149 746 158 79040 184 923 191 95550 212 1062 227 1135

n = 200 and z = 100

average worst casei ki kiz ki kiz

10 268 269 280 28020 487 488 499 49930 716 716 726 72640 932 933 947 94750 1124 1125 1162 1162

Table 51 Experiments showing the dependence of the performance ofGreedyBvN on the values of the matrix elements n is the matrix sizei isin 10 20 30 40 50 is the parameter for creating matrices using αj isin[1 2i] z is the number of permutation matrices used in creating Ai Fiveexperiments for a given pair of n and i GreedyBvN obtains ki permutationmatrices The average and the maximum number of permutation matricesobtained by GreedyBvN for five random instances are given kiz is a lowerbound on the performance of GreedyBvN as z ge Opt

obtains for a given Ai The table gives the average and the maximumki of five different Ai for a given n and i pair By construction eachAi has a BvN with z permutation matrices Since z is no smaller thanthe optimal value the ratio ki

z gives a lower bound on the performance ofGreedyBvN As seen from the experiments as i increases the performanceof GreedyBvN gets increasingly worse This shows that a constant ratioworst case approximation of GreedyBvN is unlikely While the performancedepends on z (for example for small z GreedyBvN is likely to obtain nearoptimal decompositions) it seems that the size of the matrix does not largelyaffect the relative performance of GreedyBvN

Now we attempt to explain the above results theoretically

Lemma 56 Let α1P1 + middot middot middot + αkP

k be a BvN decomposition of a given

doubly stochastic matrix A with the smallest number k of permutation ma-trices Then for any BvN decomposition of A with ` ge k permutation

matrices we have ` le k middot maxi αi

mini αi If the coefficients are integers (eg when

A is a matrix with constant row and column sums of integral values) wehave ` le k middotmaxi α

i

Proof Consider a BvN decomposition α1P1 + middot middot middot+ α`P` with ` ge k As-sume without loss of generality that α1 ge middot middot middot ge αk and α1 ge middot middot middot ge α`

We know that the coefficients of these two decompositions sum up to thesame value That is sum

i=1

αi =ksumi=1

αi

103

Since α` is the smallest of α and α1 is the largest of α we have

` middot α` le k middot α1

and hence`

kle α1α`

By assuming integer values we see that α` ge 1 and thus

` le k middotmaxiαi

This lemma evaluates the approximation guarantee of a given BvN de-composition It does not seem very useful because of the fact that even ifwe have mini αi we do do not have maxi α

i Luckily we can say more in

the case of GreedyBvN

Corollary 51 Let k be the smallest number of permutation matrices ina BvN decomposition of a given doubly stochastic matrix A Let α1 andα` be the first and last coefficients obtained by the heuristic GreedyBvN fordecomposing A Then ` le k middot α1

α`

Proof This is easy to see as GreedyBvN obtains the coefficients in a non-increasing order (see Lemma 55) and α1 ge αj for all 1 le j le k for anyBvN decomposition containing αj

Lemma 56 and Corollary 51 give a posteriori estimates of the perfor-mance of the heuristic GreedyBvN in that one looks at the decompositionand tells how good it is This potentially can reveal a good performanceFor example when GreedyBvN obtains a BvN decomposition with all coeffi-cients equivalent then we know that it is an optimal BvN The same cannotbe told for the Birkhoff heuristic though (consider the example proceed-ing Lemma 54) We also note that the ratio given in Corollary 51 shouldusually be much larger than the practical performance

54 Experiments

We present results with the two heuristics We note that the original Birkhoffheuristic was not concerned with the minimality of the number of permu-tation matrices Therefore the presented results are not for comparing thetwo heuristics but for giving results with what is available We give resultson two different set of matrices The first set contains real world sparsematrices preprocessed to be doubly stochastic The second set contains afew randomly created dense doubly stochastic matrices

104 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

matrix n τ dmax devsumk

i=1 αi ksumk

i=1 αi kaft01 8205 125567 21 00e+00 0160 2000 1000 120aft02 8184 127762 82 10e-06 0000 1434 1000 234barth 6691 46187 13 00e+00 0160 2000 1000 71barth4 6019 40965 13 00e+00 0140 2000 1000 61bcspwr10 5300 21842 14 10e-06 0380 2000 1000 63bcsstk38 8032 355460 614 00e+00 0000 2000 1000 592lowastbenzene 8219 242669 37 00e+00 0000 2000 1000 113c-29 5033 43731 481 10e-06 0000 2000 1000 870EX5 6545 295680 48 00e+00 0020 2000 1000 229EX6 6545 295680 48 10e-06 0030 2000 1000 226flowmeter0 9669 67391 11 00e+00 0510 2000 1000 58fv1 9604 85264 9 00e+00 0620 2000 1000 50fv2 9801 87025 9 00e+00 0620 2000 1000 52fxm3 6 5026 94026 129 10e-06 0130 2000 1000 383g3rmt3m3 5357 207695 48 10e-06 0050 2000 1000 223Kuu 7102 340200 98 00e+00 0000 2000 1000 330mplate 5962 142190 36 00e+00 0030 2000 1000 153n3c6-b7 6435 51480 8 00e+00 1000 8 1000 8nemeth02 9506 394808 52 00e+00 0000 2000 1000 109nemeth03 9506 394808 52 00e+00 0000 2000 1000 115olm5000 5000 19996 6 10e-06 0750 283 1000 14s1rmq4m1 5489 262411 54 00e+00 0000 2000 1000 211s2rmq4m1 5489 263351 54 00e+00 0000 2000 1000 208SiH4 5041 171903 205 00e+00 0000 2000 1000 574t2d q4 9801 87025 9 20e-06 0500 2000 1000 54

Table 52 Birkhoffrsquos heuristic and GreedyBvN on sparse matrices from theUFL collection The column τ contains the number of nonzeros in a matrixThe column dmax contains the maximum number of nonzeros in a row ora column setting up a lower bound for the number k of permutation ma-trices in a BvN decomposition The column ldquodevrdquo contains the maximumdeviation of a rowcolumn sum of a matrix A from 1 in other words thevalue maxA1 minus 1infin 1TA minus 1Tinfin reported to six significant digitsThe two heuristics are run to obtain at most 2000 permutation matrices oruntil they accumulated a sum of at least 09999 with the coefficients In onematrix (marked with lowast) GreedyBvN obtained this number in less than dmax

permutation matricesmdashincreasing the limit to 0999999 made it return with908 permutation matrices

105

The first set of matrices was created as follows We have selected all ma-trices with the following properties from the SuiteSparse Matrix Collection(UFL) [59] square has at least 5000 and at most 10000 rows fully inde-composable and there are at most 50 and at least 2 nonzeros per row Thisgave a set of 66 matrices These 66 matrices are from 30 different groups ofproblems we have chosen at most two per group to remove any bias thatmight be arising from the group This resulted in 32 matrices which wepreprocessed as follows to obtain a set of doubly stochastic matrices Wefirst took the absolute values of the entries to make the matrices nonnega-tive Then we scaled them to be doubly stochastic using the algorithm ofKnight et al [141] with a tolerance of 10minus6 with at most 1000 iterationsWith this setting the scaling algorithm tries to get the maximum deviationof a rowcolumn sum from 1 to be less than 10minus6 within 1000 iterations In7 matrices the deviation was larger than 10minus4 We deemed this too big adeviation from a doubly stochastic matrix and discarded those matrices Atthe end we had 25 matrices given in Table 52

We run the two heuristics for the BvN decomposition to obtain at most2000 permutation matrices or until they accumulated a sum of at least09999 with the coefficients The accumulated sum of coefficients

sumki=1 αi

and the number of permutation matrices k for the Birkhoff and GreedyBvN

heuristics are given in Table 52 As seen in this table GreedyBvN obtainsa much smaller number of permutation matrices than Birkhoffrsquos heuristicexcept the matrix n3c6-b7 This is a special matrix with eight nonzeros ineach row and column and all nonzeros are 18 Both heuristics find the same(minimum) number of permutation matrices We note that the geometricmean of the ratios kdmax is 34 for GreedyBvN

The second set of matrices was created as follows For n isin 100 200 300we created five matrices of size ntimesn whose entries are randomly chosen in-tegers between 1 and 100 (we performed tests with integers between 1 and20 and the results were close to what we present here) We then scaled themto be doubly stochastic using the algorithm of Knight et al [141] with a tol-erance of 10minus6 with at most 1000 iterations We run the two heuristics forthe BvN decomposition such that at most n2minus2n+ 2 permutation matricesare found or the total value of the coefficients was larger than 09999 Wethen took the average of the five instances with the same n The resultsare in Table 53 In this table again we see that GreedyBvN obtains muchsmaller k than Birkhoffrsquos heuristic

The experimental observation that GreedyBvN performs better than theBirkhoff heuristic is not surprising This is for two reasons First as statedbefore the original Birkhoff heuristic is not concerned with the number ofpermutation matrices Second the heuristic GreedyBvN is an adaptation ofthe Birkhoff heuristic to have large reductions at each step

106 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

nsumk

i=1 αi ksumk

i=1 αi k100 099 9644 100 388200 099 39208 100 717300 100 88759 100 1042

Table 53 Birkhoffrsquos heuristic and GreedyBvN on random dense matricesThe maximum deviation of a rowcolumn sum of a matrix A from 1 thatis maxA1minus 1infin 1TAminus 1Tinfin was always less than 10minus6 For each nthe results are the averages of five different instances

55 A generalization of Birkhoff-von Neumanntheorem

We have exploited the BvN decomposition in the context of preconditioningfor solving linear systems [23] As the decomposition is defined for nonnega-tive matrices we needed a generalization of it to cover all real matrices Tomotivate the generalized decomposition theorem we give a brief summaryof the preconditioner

551 Doubly stochastic splitting

For a given nonnegative matrix A we first preprocess it to get a doublystochastic matrix (whose row and column sums are one) Then using thisdoubly stochastic matrix we select some fraction of some of the nonzeros ofA to be included in the preconditioner The selection is done by taking a setof permutation matrices along with their coefficients in a BvN decompositionof the given matrix

Assume we have a BvN decomposition of A as shown in (51) Thenone can pick an integer r between 1 and k minus 1 and split A as

A = MminusN (510)

where

M = α1P1 + middot middot middot+ αrPr N = minusαr+1Pr+1 minus middot middot middot minus αkPk (511)

Note that M and minusN are doubly substochastic matrices

Definition 51 A splitting of the form (510) with M and N given by (511)is said to be a doubly substochastic splitting

Definition 52 A doubly substochastic splitting A = M minusN of a doublystochastic matrix A is said to be standard if M is invertible We will callsuch a splitting an SDS splitting

107

In general it is not easy to guarantee that a given doubly substochasticsplitting is standard except for some trivial situation such as the case r = 1in which case M is always invertible We also have a characterization forinvertible M when r = 2

Theorem 52 A sufficient condition for M =sumr

i=1 αiPi to be invertible isthat one of the αi with 1 le i le r be greater than the sum of the remainingones

Let A = M minusN be an SDS splitting of A and consider the stationaryiterative scheme

xk+1 = Hxk + c H = Mminus1N c = Mminus1b (512)

where k = 0 1 and x0 is arbitrary As seen in (512) Mminus1 or solversfor My = z are required As is well known the scheme (512) convergesto the solution of Ax = b for any x0 if and only if ρ(H) lt 1 Hence weare interested in conditions that guarantee that the spectral radius of theiteration matrix

H = Mminus1N = minus(α1P1 + middot middot middot+ αrPr)minus1(αr+1Pr+1 + middot middot middot+ αkPk)

is strictly less than one In general this problem appears to be difficultWe have a necessary condition (Theorem 53) and a sufficient condition(Theorem 54) which is simple but restrictive

Theorem 53 For the splitting A = MminusN with M =sumr

i=1 αiPi and N =

minussumk

i=r+1 αiPi to be convergent it must hold thatsumr

i=1 αi gtsumk

i=r+1 αi

Theorem 54 Suppose that one of the αi appearing in M is greater thanthe sum of all the other αi Then ρ(Mminus1N) lt 1 and the stationary iterativemethod (512) converges for all x0 to the unique solution of Ax = b

We discuss how to build preconditioners meeting the sufficiency con-ditions Since M itself is doubly stochastic (up to a scalar ratio) we canapply splitting recursively on M and obtain a special solver Our motivationis that the preconditioner Mminus1 can be applied to vectors via a number ofhighly concurrent steps where the number of steps is controlled by the userTherefore the preconditioners (or the splittings) can be advantageous foruse in many-core computing systems In the context of splittings the appli-cation of N to vectors also enjoys the same property These motivations areshared by recent work on ILU preconditioners where their fine-grained com-putation [48] and approximate application [12] are investigated for GPU-likesystems

108 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Preconditioner construction

It is desirable to have a small number k in the Birkhoff-von Neumann de-composition while designing the preconditionermdashthis was our motivation toinvestigate the MinBvNDec problem This is because of the fact that if weuse splitting then k determines the number of steps in which we computethe matrix-vector products If we do not use all k permutation matriceshaving a few with large coefficients should help to design the preconditionerAs stated in Lemma 55 GreedyBvN (Algorithm 11) obtains the coefficientsin a non-increasing order We can therefore use GreedyBvN to build an Msuch that it satisfies the sufficiency condition presented in Theorem 54That is we can have M with α1sumk

i=1 αigt 12 and hence Mminus1 can be applied

with splitting iterations For this we start by initializing M to α1P1 Thenwhen a coefficient αj where j ge 2 is obtained at Line 5 of Algorithm 11we check if α1 will still be larger than the sum of other coefficients used inbuilding M when we add αjPj to M If so we add αjPj to M and con-tinue In practice we iterate the while loop until k is around 10 and collectαj rsquos as described above Experiments on a set of challenging problems [23]show that this preconditioner is more effective than ILU(0)

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices

In order to apply the preconditioners to all fully indecomposable sparsematrices we generalized the BvN decomposition as follows

Theorem 55 Any (real) matrix A with total support can be written as aconvex combination of a set of signed scaled permutation matrices

Proof Let B = abs(A) and consider R and C making RBC doubly stochas-tic which has a Birkhoff-von Neumann decomposition RBC =

sumki=1 αiPi

Then RAC can be expressed as

RAC =ksumi=1

αiQi

where Qi = [q(i)jk ]ntimesn is obtained from Pi = [p

(i)jk ]ntimesn as follows

q(i)jk = sign(ajk)p

(i)jk

The scaling matrices can be inverted to express A

De Werra [62] presents another generalization of the Birkhoff-von Neu-mann theorem to arbitrary matrices In De Werrarsquos generalization insteadof permutation matrices signed integral matrices are used to decompose the

109

given matrix The entries of the integral matrices used in the decompositionhave the same sign as the corresponding entries in the original matrix butthere may be multiple nonzeros in a row or column De Werra discussesflow-based algorithms to obtain the decompositions While the decomposi-tion is costlier to obtain (general flow computations are usually costlier thancomputing perfect matchings) it is also more general to handle rectangularmatrices as well

56 Summary further notes and references

We investigated the problem of obtaining a Birkhoff-von Neumann decom-position of doubly stochastic matrices with the minimum number of per-mutation matrices We showed four results regarding this problem Firstobtaining a BvN decomposition with the minimum number of permutationmatrices is NP-hard Second there are matrices whose decompositions withthe smallest number of permutation matrices cannot be obtained by anyBirkhoff-like heuristic Third the worst-case approximation ratio of theoriginal Birkhoff heuristic is Ω(n) On a more practical side we proposeda natural greedy heuristic (GreedyBvN) and presented experimental resultswhich demonstrated that this heuristic delivers results that are not far froma trivial lower bound We theoretically investigated GreedyBvNrsquos perfor-mance and observed that its performance depends on the values of matrixelements We showed a bound using the first and the last coefficients foundby the heuristic This chapter also included a generalization of the Birkhoff-von Neumann theorem in which we are able to express any real matrixwith a total support as a convex combination of scaled signed permutationmatrices

The shown bound for the performance of GreedyBvN is expected to bemuch larger than what one observes in practice as the bound can even belarger than the upper bound on the number of permutation matrices Atighter analysis should be possible to explain the practical performance ofthe GreedyBvN heuristic

The heuristics we considered in this chapter were based on finding a per-fect matching Koopmans and Beckmann [143] present another constructiveproof of the Birkhoff theorem This constructive proof splits the matrix asA = βA1 +(1minusβ)A2 where 0 lt β lt 1 A1 and A2 being doubly stochasticmatrices A related proof is also given by Horn and Johnson [118 Theorem871] where β = 12 We review these two constructive proofs

Let C be a cycle of length 2` with ` vertices in each part of the bipartitegraph with vertices ri ri+1 ri+`minus1 on one side and ci ci+1 ci+`minus1

on the other side Let us imagine the cycle drawn on the plane so that wehave horizontal edges of the form (rj cj) for j = i i + ` minus 1 slantededges of the form (ri ci+`minus1) and (rj cjminus1) for j = i + 1 i + ` minus 1 A

110 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

ri ci

rj cj

minusε1

minusε1

+ε1+ε1

(a) ε1 = minaii ajj

ri ci

rj cj

+ε2

+ε2

minusε2minusε2

(b) ε2 = minaij aji

Figure 54 A small example for Koopmans and Beckmann decompositionapproach

small example is shown in Figure 54 Let ε1 and ε2 be the minimum valueof a horizontal and slanted edge respectively Koopmans and Beckmannproceed as follows to obtain a BvN decomposition Consider a matrix C1

which contains minusε1 at the positions corresponding to the horizontal edgesof C ε1 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A1 = A + C1 (see Fig 54a) It is easy to see thateach element of A1 is between 0 and 1 and row and column sums are thesame as in A Therefore A1 is doubly stochastic Similarly consider C2

which contains ε2 at the positions corresponding to the horizontal edges ofC minusε2 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A2 = A + C2 (see Fig 54b) and observe that A2

is also doubly stochastic By a simple arithmetic we can set β = ε2ε1+ε2

andwrite A = βA1 + (1 minus β)A2 The observation that A1 and A2 contain atleast one more zero than A can then be used in an inductive hypothesis toconstruct a decomposition by recursively decomposing A1 and A2

At first sight the constructive approach of Koopmans and Beckmanndoes not lead to an efficient algorithm This is because of the fact thatA1 and A2 can have significant overlap and hence many permutation ma-trices can be shared by A1 and A2 yielding combinatorially large spacerequirement or very high run time A variant of this algorithm needs to beexplored First note that we can find a cycle and manipulate the weights bydecreasing them along the horizontal edges and increasing them along theslanted edges (not necessarily by annihilating the smallest weight) to createanother matrix Then we can take the modified matrix and manipulateanother cycle At one point one can decide to stop and create A1 this waywith a suitable choice of coefficients the decomposition can thus proceed Atthe second sight then one observes that Koopmans and Beckmann methodcalls for creative ideas for developing efficient and effective heuristics forBvN decompositions

Horn and Johnson on the other hand set ε to the smallest of ε1 and ε2

111

Without loss of generality let ε1 be the smallest Then let C be a matrixwhose entries are minus1 0 1 corresponding to the negative zero and positiveentries of C1 Then let A1 = A + εC and A2 = A minus εC Both of thesematrices have nonnegative entries less than or equal to one and have thesame row and column sums as A We can then write A = 05A1 + 05A2

and recurse on A1 and A2 to obtain a decomposition (which stops when wedo not have any cycles)

Inspired by the approaches summarized above we can optimally decom-pose the matrix (58) used in showing that there are optimal decompositionswhich cannot be obtained by Birkhoff-like heuristics Let

A1 =

a+ b d c e 0e a+ c b d 00 e d b+ c ad b a 0 c+ ec 0 e a b+ d

A2 =

a+ b d+ 2 middot i c+ 2 middot h e+ 2 middot j 2 middot f + 2 middot ge+ 2 middot g a+ c b+ 2 middot i d+ 2 middot f 2 middot h+ 2 middot j

2 middot f + 2 middot j e+ 2 middot h d+ 2 middot g b+ c a+ 2 middot id+ 2 middot h b+ 2 middot f a+ 2 middot j 2 middot g + 2 middot i c+ ec+ 2 middot i 2 middot g + 2 middot j e+ 2 middot f a+ 2 middot h b+ d

and observe that A = 05A1 + 05A2 We can then use the permutationscontaining the coefficients a b c d e to decompose A1 optimally with fivepermutation matrices using GreedyBvN (in the decreasing order first ethen d then c and so on) While decomposing A2 we need to use thepermutation matrices found for A1 with a careful selection of the coefficients(instead of a+ b we use a for example for the permutation associated witha) Once we have consumed those five permutations we can again resortto GreedyBvN to decompose the remaining matrix optimally with the fivepermutation matrices associated with f g h i and j In this way we havean optimal decomposition for the matrix (58) by applying GreedyBvN toA1 and carefully decomposing A2 We note that neither A1 nor A2 havethe same sum as A in contrast to the two approaches above Can wesystematize this line of reasoning to develop an effective heuristic to obtaina BvN decomposition

112 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Chapter 6

Matchings in hypergraphs

In this chapter we investigate the maximum matching in d-partite d-uniform hypergraphs described in Section 13 The formal problem definitionis repeated below for convenience

Maximum matching in d-partite d-uniform hypergraphs Givena d-partite d-uniform hypergraph find a matching of maximum size

We investigate effective heuristics for the problem above This problemhas been studied mostly in the context of local search algorithms [119] andthe best known algorithm is due to Cygan [56] who provides a ((d+ 1 + ε)3)-approximation building on previous work [57 106] It is NP-hard to approx-imate the 3-partite case (Max-3-DM) within 9897 [24] Similar boundsexist for higher dimensions the hardness of approximation for d = 4 5 and6 are shown to be 5453minus ε 3029minus ε and 2322minus ε respectively [109]

We propose five heuristics for the above problem The first two heuris-tics are the adaptations of the Greedy [78] and Karp-Sipser [128] heuristicsused in finding matchings in graphs (they are summarized in Chapter 4)The proposed generalizations are referred to as Greedy-H and Karp-Sipser-HGreedy-H traverses the hyperedge list in random order and adds an edgeto the matching whenever possible Karp-Sipser-H introduces certain rulesto Greedy-H to improve the cardinality The third heuristic is inspired by arecent scaling-based approach proposed for the maximum cardinality match-ing problem on graphs [73 74 76] The fourth heuristic is a modificationon the third one that allows for faster execution time The last one findsa matching for a reduced (d minus 1)-dimensional problem and exploits it forthe original matching problem recursively This heuristic uses an exact al-gorithm for the bipartite matching problem at each reduction We performexperiments to evaluate the performance of these heuristics on special classesof random hypergraphs as well as real-life data

One plausible way to tackle the problem is to create the line graph Gfor a given hypergraph H The line graph is created by identifying each

113

114 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

hyperedge of H with a vertex in G and by connecting two vertices of Gwith an edge iff the corresponding hyperedges share a common vertex in HThen successful heuristics for computing large independent sets in graphseg KaMIS [147] can be used to compute large matchings in hypergraphsThis approach although promising quality-wise could be impractical Thisis so since building G from H requires quadratic run time (in terms of thenumber of hyperedges) and more importantly quadratic storage (again interms of the number of hyperedges) in the worst case While this can beacceptable in some instances in some others it is not We have such instancesin the experiments Notice that while a heuristic for the independent setproblem can be of linear time complexity in graphs due to our graphs beinga line graph the actual complexity could be high

61 Background and notation

Franklin and Lorenz [89] show that if a nonnegative d-dimensional tensorX has the same zero-pattern as a d-stochastic tensor one can apply a mul-tidimensional version of the Sinkhorn-Knopp algorithm [202] (see also thedescription in Section 41) to scale X to be d-stochastic This is accom-plished by finding d vectors u(1) u(d) and scaling the nonzeros of X as

xi1id middot u(1)i1middot middot middot middot middot u(d)

idfor all i1 id isin 1 n

In a hypergraph if a vertex is a member of only a single hyperedgewe call it a degree-1 vertex Similarly if a vertex is a member of only twohyperedges we call it a degree-2 vertex

In the k-out random hypergraph model given a set V of vertices eachvertex u isin V selects k hyperedges from the set Eu = e e sube V u isin e ina uniformly random fashion and the union of these edges forms E We areinterested in the d-partite d-uniform case and hence Eu = e |e cap Vi| =1 for 1 le i le d u isin e This model generalizes the random k-out bipartitegraphs [211] Devlin and Kahn [67] investigate fractional matchings in thesehypergraphs and mention in passing that k should be exponential in d toensure that a perfect matching exists

62 The heuristics

A matching which cannot be extended with more edges is called maxi-mal The heuristics proposed here find maximal matchings on d-partited-uniform hypergraphs For such hypergraphs any maximal matching isa d-approximate matching The bound is tight and can be verified ford = 3 Let H be a 3-partite 3times 3times 3 hypergraph with the following edgese1 = (1 1 1) e2 = (2 2 2) e3 = (3 3 3) and e4 = (1 2 3) The maximummatching is e1 e2 e3 but the edge e4 alone forms a maximal matching

115

621 Greedy-H A Greedy heuristic

Among the two variants of Greedy [78 194] we adapt the first one whichrandomly visits the edges The adapted version is referred to as Greedy-Hand it visits the hyperedges in random order and adds the visited hyper-edge to the matching whenever possible Since only maximal matchings arepossible as its output Greedy-H is a d-approximation heuristic

622 Karp-Sipser-H Adapting Karp-Sipser

The commonly used and analyzed version of the Karp-Sipser heuristics forgraphs applies degree-1 reductions as summarized in Section 421 Theoriginal version though applies two rules

bull At any time during the heuristic if a degree-1 vertex appears it ismatched with its only neighbor

bull Otherwise if a degree-2 vertex u appears with neighbors v w u(and its edges) is removed from the current graph and v and w aremerged to create a new vertex vw whose set of neighbors is the unionof those of v and w (except u) A maximum cardinality matching forthe reduced graph can be extended to obtain one for the current graphby matching u with either v or w depending on vwrsquos match

We now propose an adaptation of Karp-Sipser with the two rules for d-partite d-uniform hypergraphs The adapted version is called Karp-Sipser-HSimilar to the original one the modified heuristic iteratively adds a randomhyperedge to the matching remove its d endpoints as well as their hyper-edges However the random selection is not applied whenever hyperedgesdefined by the following lemmas appear

Lemma 61 During the heuristic if a hyperedge e with at least d minus 1degree-1 endpoints appears there exists a maximum cardinality matching inthe current hypergraph containing e

Proof Let H prime be the current hypergraph at hand and e = (u1 ud) be ahyperedge in H prime whose first dminus1 endpoints are degree-1 vertices Let M prime bea maximum cardinality matching in H prime If e isinM prime we are done Otherwiseassume that ud is the endpoint matched by a hyperedge eprime isinM prime (note thatif ud is not matched M prime can be extended with e) Since ui 1 le i lt d arenot matched in M prime M prime eprime cup e defines a valid maximum cardinalitymatching for H prime

We note that it is not possible to relax the condition by using a hyperedgee with less than dminus 1 endpoints of degree-1 in M prime two of ersquos higher degreeendpoints could be matched with two different hyperedges in which casethe substitution as done in the proof of the lemma will not be valid

116 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 62 During the heuristic let e = (u1 ud) and eprime = (uprime1 uprimed)

be two hyperedges sharing at least one endpoint where for an index set I sub1 d of cardinality d minus 1 the vertices ui u

primei for all i isin I only touch

e andor eprime That is for each i isin I either ui = uprimei is a degree-2 vertexor ui 6= uprimei and they are both degree-1 vertices For j isin I uj and uprimej arearbitrary vertices Then in the current hypergraph there exists a maximumcardinality matching having either e or eprime

Proof Let H prime be the current hypergraph at hand and j isin I be the remain-ing part id Let M prime be a maximum cardinality matching in H prime If eithere isin M prime or eprime isin M prime we are done Otherwise ui and uprimei for all i isin I areunmatched by M prime Furthermore since M prime is maximal uj must be matchedby M prime (otherwise M prime can be extended by e) Let eprimeprime isinM prime be the hyperedgematching uj Then M prime eprimeprime cup e defines a valid maximum cardinalitymatching for H prime

Whenever such hyperedges appear the rules below are applied in thesame order

bull Rule-1 At any time during the heuristic if a hyperedge e with atleast dminus 1 degree-1 endpoints appears instead of a random edge e isadded to the matching and removed from the hypergraph

bull Rule-2 Otherwise if two hyperedges e and eprime as defined in Lemma 62appear they are removed from the current hypergraph with the end-points ui u

primei for all i isin I Then we consider uj and uprimej If uj and

uprimej are distinct they are merged to create a new vertex ujuprimej whose

hyperedge list is defined as the union of uj rsquos and uprimej rsquos hyperedge listsIf uj and uprimej are identical we rename uj as uju

primej After obtaining a

maximal matching on the reduced hypergraph depending on the hy-peredge matching uju

primej either e or eprime can be used to obtain a larger

matching in the current hypergraph

When Rule-2 is applied the two hyperedges identified in Lemma 62 areremoved from the hypergraph and only the hyperedges containing uj andoruprimej have an update in their vertex list Since the original hypergraph is d-partite and d-uniform that update is just a renaming of a vertex in theconcerned hyperedges (hence the resulting hypergraph is d-partite and d-uniform)

Although the extended rules usually lead to improved results in compari-son to Greedy-H Karp-Sipser-H still adheres to the d-approximation bound ofmaximal matchings For the example given at the beginning of Section 62Karp-Sipser-H generates a maximum cardinality matching by applying thefirst rule However when e5 = (2 1 3) and e6 = (3 1 3) are added to theexample neither of the two rules can be applied As before in case e4 israndomly selected it alone forms a maximal matching

117

623 Karp-Sipser-H-scaling Karp-Sipser with scaling

Karp-Sipser-H can be modified for better decisions in case neither of the tworules apply In this variant instead of a random selection we first scale theadjacency tensor of H and obtain an approximate d-stochastic tensor X We then augment the matching by adding the edge which corresponds tothe largest value in X when the rules do not apply The modified heuristicis summarized in Algorithm 12

Algorithm 12 Karp-Sipser-H-scaling (H)

Data A d-partite d-uniform n1 times middot middot middot times nd hypergraph H = (VE)Result A maximal matching M of H

1 M larr empty Initially M is empty 2 S larr empty Stack for the merges for Rule-2 3 while H is not empty do4 remove isolated vertices from H5 if existe = (u1 ud) as in Rule-1 then6 M larrM cup e Add e to the matching 7 Apply the reduction Rule-1 on H8 else if existe = (u1 ud) eprime = (uprime1 u

primed) and I as in Rule-2 then

9 Let j be the part index where j isin I10 Apply the reduction Rule-2 on H by introducing the vertex uju

primej

11 Eprime = (v1 ujuprimej vd) for all (v1 uj vd) isin E memorize the hyperedges of uj

12 Spush(e eprime ujuprimej E

prime) Store the current merge

13 else14 X larr Scale(adj(H)) Scale the adjacency tensor of H 15 elarr arg max(u1ud) (xu1ud

) Find the max in X

16 M larrM cup e Add e to the matching 17 Remove all hyperedges of u1 ud from E18 V larr V u1 ud

19 while S 6= empty do20 〈e eprime ujuprimej Eprime〉 larr Spop()

21 if ujuprimej is not matched by M then

22 M larrM cup e23 else24 Let eprimeprime isinM be the hyperedge matching uju

primej

25 if eprimeprime isin Eprime then26 Replace uju

primej in eprimeprime with uprimej

27 M larrM cup eprime28 else29 Replace uju

primej in eprimeprime with uj

30 M larrM cup e

Our inspiration comes from the d = 2 case [74 76] described in Chap-ter 4 By using the scaling method as a preprocessing step and choosing

118 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

edges with a probability corresponding to the scaled entry the edges whichare not included in a perfect matching become less likely to be chosen Un-fortunately for d ge 3 there is no equivalent of Birkhoffrsquos theorem as demon-strated by the following lemma whose proof can be found in the associatedtechnical report [75 Lemma 3]

Lemma 63 For d ge 3 there exist extreme points in the set of d-stochastictensors which are not permutation tensors

These extreme points can be used to generate other d-stochastic ten-sors as linear combinations Due to the lemma above we do not have thetheoretical foundation to imply that hyperedges corresponding to the largeentries in the scaled tensor must necessarily participate in a perfect match-ing Nonetheless the entries not in any perfect matching tend to becomezero (not guaranteed for all though) For the worst case example of Karp-Sipser-H described above the scaling indeed helps the entries correspondingto e4 e5 and e6 to become zero See the discussion following the said lemmain the technical report [75]

On a d-partite d-uniform hypergraph H = (VE) the Sinkhorn-Knoppalgorithm used for scaling operates in iterations each of which requiresO(|E| times d) time In practice we perform only a few iterations (eg 10ndash20)Since we can match at most |V |d hyperedges the overall run time costassociated with scaling is O(|V | times |E|) A straightforward implementationof the second rule can take quadratic time in the worst case of a large numberof repetitive merges with a given vertex In practice more of a linear timebehavior should be observed for the second rule

624 Karp-Sipser-H-mindegree Hypergraph matching viapseudo scaling

In Algorithm 12 applying scaling at every step can be very costly Herewe propose an alternative idea inspired by the specifics of the Sinkhorn-Knopp algorithm to reduce the overall cost In particular we can mimicthe first iteration of the Sinkhorn-Knopp algorithm to discover that we canuse the inverse of the degree of a vertex as its scaling vector This assignseach vertex a weight proportional to the inverse of its degreemdashhence not anexact iteration of Sinkhorn-Knopp

With this approach each hyperedge i1 id is associated with a value1prodd

j=1 λij where λij denotes the degree of the vertex ij from jth part The

selection procedure is the same as that of Algorithm 12 ie the edge withthe maximum value is added to the matching set We refer to this algorithmas Karp-Sipser-H-mindegree as it selects a hyperedge based on a function ofthe degrees of the vertices With a straightforward implementation findingthis hyperedge takes O(|E|) time For a better efficiency the edges can be

119

stored in a heap and when the degree of a node v decreases the increaseKeyheap operation can be called for all its edges

625 Bipartite-reduction Reduction to the bipartite match-ing problem

A perfect matching in a d-partite d-uniform hypergraph H remains perfectwhen projected on a (d minus 1)-partite (d minus 1)-uniform hypergraph obtainedby removing one of Hrsquos dimensions Matchability in (d minus 1)-dimensionalsub-hypergraphs has been investigated [6] to provide an equivalent of HallrsquosTheorem for d-partite hypergraphs These observations lead us to proposea heuristic called Bipartite-reduction This heuristic tackles the d-partited-uniform case by recursively asking for matchings in (dminus1)-partite (dminus1)-uniform hypergraphs and so on until d=2

Let us start with the case where d = 3 Let G = (VG EG) be thebipartite graph with the vertex set VG = V1 cup V2 obtained by deleting V3

from a 3-partite 3-regular hypergraph H = (VE) The edge (u v) isin EG iffthere exists a hyperedge (u v z) isin E One can also assign a weight functionw(middot) to the edges during this step such as

w(u v) = |z (u v z) isin E| (61)

A maximum weighted (product sum etc) matching algorithm can be usedto obtain a matching MG on G A second bipartite graph Gprime = (VGprime EGprime)is then created with VGprime = (V1 times V2) cup V3 and EGprime = (uv z) (u v) isinMG (u v z) isin H Under this construction any matching inGprime correspondsto a valid matching in H Furthermore if the weight function (61) definedabove is used the following holds (the proof is in the technical report [75Proposition 4])

Proposition 61 Let w(MG) =sum

(uv)isinMGw(u v) be the size of the match-

ing MG found in G Then Gprime has w(MG) edges

Thus by selecting a maximum weighted matching MG and maximizingw(MG) the largest number of edges will be kept in Gprime

For d-dimensional matching a similar process is followed First anordering i1 i2 id of the dimensions is defined At the jth bipartite re-duction step the matching is found between the dimension cluster i1i2 middot middot middot ijand dimension ij+1 by similarly solving a bipartite matching instance wherethe edge (u1 middot middot middotuj v) exists iff vertices u1 uj were matched in previoussteps and there exists an edge (u1 uj v zj+2 zd) in H

Unlike the previous heuristics Bipartite-reduction does not have any ap-proximation guarantee We state this with the following lemma (the proofis in the technical report [75 Lemma 5])

120 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 64 If algorithms for the maximum cardinality or the maximumweighted matching with the suggested edge weights (61) problems are usedthen Bipartite-reduction has a worst-case approximation ratio of Ω(n)

626 Local-Search Performing local search

A local search heuristic is proposed by Hurkens and Schrijver [119] It startsfrom a feasible maximal matching M and performs a series of swaps untilit is no longer possible In a swap k edges of M are replaced with at leastk+1 new edges from EM so that the cardinality of M increases by at leastone These k edges from M can be replaced with at most dtimes k new edgesHence these edges can be found by a polynomial algorithm enumeratingall the possibilities The approximation guarantee improves with higher kvalues Local search algorithms are limited in practice due to their hightime complexity The algorithm might have to examine all

(|M |k

)subsets of

M to find a feasible swap at each step The algorithm by Cygan [56] whichachieves a

(d+1+ε

3

)-approximation is based on a different swap scheme but

is also not suited for large hypergraphs We implement Local-Search as analternative from the literature to set a baseline

63 Experiments

To understand the relative performance of the proposed heuristics we con-ducted a wide variety of experiments with both synthetic and real-life dataThe experiments were performed on a computer equipped with Intel Corei7-7600 CPU and 16GB RAM We investigate the performance of the pro-posed heuristics Greedy-H Karp-Sipser-H Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reduction For d = 3 we also consider Local-Search [119] which replaces one hyperedge from a maximal matching Mwith at least two hyperedges from E M to increase the cardinality of M We did not consider local search schemes for higher dimensions or withbetter approximation ratios as they are computationally too expensiveFor each hypergraph we perform ten runs of Greedy-H and Karp-Sipser-Hwith different random decisions and take the maximum cardinality obtainedSince Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reductiondo not pick hyperedges randomly we run them only once We perform 20steps of the scaling procedure in Karp-Sipser-H-scaling We refer to qualityof a matching M in a hypergraph H as the ratio of M rsquos cardinality to thesize of the smallest vertex partition of H

On random hypergraphs

We perform experiments on two classes of d-partite d-uniform random hy-pergraphs where each part has n vertices The first class contains random

121

k k

d ddminus3 ddminus2 ddminus1 d ddminus3 ddminus2 ddminus1

2 - 087 100 2 - 084 1003 080 100 100 3 088 100 100

n = 10 4 100 100 100 n = 30 4 099 100 1005 100 100 100 5 100 100

2 - 088 100 2 - 087 1003 085 100 100 3 084 100 100

n = 20 4 100 100 100 n = 50 4 lowast 100 1005 100 100 100 5

Table 61 The average maximum matching cardinalities of five randominstances over n on random k-out d-partite d-uniform hypergraphs for dif-ferent k d and n No runs for k = ddminus3 for d = 2 and the problems markedwith lowast were not solved within 24 hours

k-out hypergraphs and the second one contains sparse random hypergraphs

Random k-out d-partite d-uniform hypergraphsHere we consider random k-out d-partite d-uniform hypergraphs describedin Section 61 Hence (ignoring the duplicate ones) these hypergraphs havearound d times k times n hyperedges These k-out d-partite d-uniform hyper-graphs have been recently analyzed in the matching context by Devlin andKahn [67] They state in passing that k should be exponential in d for aperfect matching to exist with high probability The bipartite graph variantof the same problem ie with d = 2 has been extensively studied in theliterature [90 124 125 211]

We first investigate the existence of perfect matchings in random k-outd-partite d-uniform hypergraphs For this purpose we used CPLEX [1] tosolve the hypergraph matching problem exactly and found the maximumcardinality of a matching in k-out hypergraphs with k isin ddminus3 ddminus2 ddminus1for d isin 2 5 and n isin 10 20 30 50 For each (k d n) triple we cre-ated five hypergraphs and computed their maximum cardinality matchingsFor k = ddminus3 we encountered several hypergraphs with no perfect match-ing especially for d = 3 The hypergraphs with k = ddminus2 were also lackinga perfect matching for d = 2 However all the hypergraphs we createdwith k = ddminus1 had at least one Based on these results we experimentallyconfirm Devlin and Kahnrsquos statement We also conjecture that ddminus1-outrandom hypergraphs have perfect matchings almost surely The averagemaximum matching cardinalities we obtained in this experiment are givenin Table 61 In this table we do not have results for k = ddminus3 for d = 2and the cases marked with lowast were not solved within 24 hours

We now compare the performance of the proposed heuristics on randomk-out d-partite d-uniform hypergraphs d isin 3 6 9 and n isin 1000 10000We tested with k values equal to powers of 2 for k le d log d The results are

122 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

summarized in Figure 61 For each (k d n) triplet we create ten random in-stances and present the average performance of the heuristics on them Thex-axis in each figure denotes k and the y-axis reports the matching cardinal-ity over n As seen Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree havethe best performance comfortably beating the other alternatives For d = 3Karp-Sipser-H-scaling dominates Karp-Sipser-H-mindegree but when d gt 3we see that Karp-Sipser-H-mindegree has the best performance Local-Searchperforms worse than Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree butbetter than others Karp-Sipser-H performs better than Greedy-H Howevertheir performances get closer as d increases This is due to the fact thatthe conditions for Rule-1 and Rule-2 hold less often for larger d as we havemore restrictions to encounter in such cases Bipartite-reduction has worseperformance than the others and the gap in the performance grows as dincreases This happens since at each step we impose more and more con-ditions on the edges involved and there is no chance to recover from baddecisions

Sparse random d-partite d-uniform hypergraphsHere we consider a random d-partite d-uniform hypergraph Hi created withitimesn hyperedges The parameters used for this experiment are i isin 1 3 5 7n isin 4000 8000 and d isin 6 9 (the associated paper presents further ex-periments with d = 3) Each Hi is created by choosing the vertices of ahyperedge uniformly at random for each dimension We do not allow du-plicate hyperedges Another random hypergraph Hi+M is then obtained byplanting a perfect matching to Hi that is a random perfect matching isadded to the pattern of Hi We again generate ten random instances foreach parameter setting We do not present results for Bipartite-reduction asit was always worse than the others The average quality of different heuris-tics on these instances is shown in Figure 62 The experiments confirmthat Karp-Sipser-H performs consistently better than Greedy-H Further-more Karp-Sipser-H-scaling performs significantly better than Karp-Sipser-HKarp-Sipser-H-scaling works even better than the local search heuristic [75Fig2]) and it is the only heuristic that is capable of finding the planted per-fect matchings for a significant number of the runs In particular it finds aperfect matching on Hi+M rsquos in all cases except for when d = 6 and i = 7Interestingly Karp-Sipser-H-mindegree outperforms Karp-Sipser-H-scaling onHis but is dominated on Hi+M s where it is the second best performingheuristic

Evaluating algorithmic choices

We evaluate the use of scaling and the importance of Rule-1 and Rule-2

Scaling vs no-scalingTo evaluate and emphasize the contribution of scaling better we compare the

123

2 4 8065

07

075

08

085

09

095

1n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

2 4 806

065

07

075

08

085

09

095

1n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

(a) d = 3 n = 1000 (left) and n = 10000 (right)

2 4 1603

04

05

06

07

08

09n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 1603

04

05

06

07

08

09n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(b) d = 6 n = 1000 (left) and n = 10000 (right)

2 4 8 16 3202

03

04

05

06

07

08n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 8 16 3202

03

04

05

06

07

08n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(c) d = 9 n = 1000 (left) and n = 10000 (right)

Figure 61 The performance of the heuristics on k-out d-partite d-uniformhypergraphs with n vertices at each part The y-axis is the ratio of matchingcardinality to n whereas the x-axis is k No Local-Search for d = 6 and d = 9

124 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 031 031 035 035 035 035 036 037 062 061 090 089 100 100 100 1003 043 043 047 047 048 048 054 054 051 050 056 055 100 100 099 0995 048 048 052 052 054 054 061 061 052 052 056 055 100 100 097 0977 052 052 055 055 059 059 066 066 054 054 057 057 084 080 071 070

(a) d = 6 without (left) and with (right) the planted matching

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 025 024 027 027 027 027 030 030 056 055 080 079 100 100 100 1003 034 033 036 036 036 036 043 043 040 040 044 044 100 100 099 1005 038 037 040 040 041 041 048 048 041 040 043 043 100 100 099 0997 040 040 042 042 044 044 051 051 042 042 044 044 100 100 097 096

(b) d = 9 without (left) and with (right) the planted matching

Figure 62 Performance comparisons on d-partite d-uniform hypergraphswith n = 4000 8000 Hi contains i times n random hyperedges and Hi+M

contains an additional perfect matching

R1

R2

C1 C2

t

t

Figure 63 AKS A challenging bipartite graph instance for Karp-Sipser

Local Karp- Karp-Sipser-H- Karp-Sipser-H-t Greedy-H Search Sipser-H scaling mindegree

2 053 099 053 100 1004 053 099 053 100 1008 054 099 055 100 10016 055 099 056 100 10032 059 099 059 100 100

Table 62 Performance of the proposed heuristics on 3-partite 3-uniformhypergraphs corresponding to XKS with n = 300 vertices in each part

125

performance of the heuristics on a particular family of d-partite d-uniformhypergraphs where their bipartite counterparts have been used before aschallenging instances for the original Karp-Sipser heuristic [76]

Let AKS be an n times n matrix discussed before in Figure 43 as a chal-lenging instance to Karp-Sipser and shown in Figure 63 for convenienceTo adapt this scheme to hypergraphstensors we generate a 3-dimensionaltensor XKS such that the nonzero pattern of each marginal of the 3rd di-mension is identical to that of AKS Table 62 shows the performance of theheuristics (ie matching cardinality normalized with n) for 3-dimensionaltensors with n = 300 and t isin 2 4 8 16 32

The use of scaling indeed reduces the influence of the misleading hyper-edges in the dense block R1 times C1 and the proposed Karp-Sipser-H-scalingheuristic always finds the perfect matching as does Karp-Sipser-H-mindegreeHowever Greedy-H and Karp-Sipser-H perform significantly worse Further-more Local-Search returns a 099-approximation in every case because itends up in a local optima

Rule-1 vs Rule-2We finish the discussion on the synthetic data by focusing on Karp-Sipser-H In the bipartite case recent work [10] shows that both rules are neededto obtain perfect matchings in a class of graphs We present a family ofhypergraphs for which applying Rule-2 leads to much improved performancethan applying Rule-1 only

We use Karp-Sipser-H-R1 to refer to Karp-Sipser-H without Rule-2 Asbefore we describe first the bipartite case Let ARF be a n times n matrixwith (ARF )ij = 1 for 1 le i le j le n and (ARF )21 = (ARF )nnminus1 = 1That is ARF is composed of an upper triangular matrix and two additionalsubdiagonal nonzeros The first two columns and the last two rows havetwo nonzeros Assume without loss of generality that the first two rows aremerged by applying Rule-2 on the first column (which is discarded) Thenin the reduced matrix the first column (corresponding to the second columnin the original matrix) will have one nonzero Rule-1 can now be appliedwhereupon the first column in the reduced matrix will have degree oneThe process continues similarly until the reduced matrix is a 2 times 2 denseblock where applying Rule-2 followed by Rule-1 yields a perfect matchingIf only Rule-1 reductions are allowed initially no reduction can be appliedand randomly chosen edges will be matched which negatively affects thequality of the returned matching

For higher dimensions we proceed as follows Let XRF be a d-dimensionaln times middot middot middot times n tensor We set (XRF )ijj = 1 for 1 le i le j le n and(XRF )122 = (XRF )nnminus1nminus1 = 1 By similar reasoning we see thatKarp-Sipser-H with both reduction rules will obtain a perfect matchingwhereas Karp-Sipser-H-R1 will struggle We give some results in Table 63that show the difference between the two We test for n isin 1000 2000 4000

126 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

d2 3 6

n quality rn

quality rn

quality rn

1000 083 045 085 047 080 0312000 086 053 087 056 080 0304000 082 042 075 017 084 045

Table 63 Quality of matching and the number r of the applications ofRule-1 over n in Karp-Sipser-H-R1 for hypergraphs corresponding to XRF Karp-Sipser obtains perfect matchings

and d isin 2 3 6 and show the quality of Karp-Sipser-H-R1 and the num-ber of times that Rule-1 is applied over n We present the best result over10 runs As seen in Table 63 Karp-Sipser-H-R1 obtains matchings thatare about 13ndash25 worse than Karp-Sipser-H Furthermore the larger thenumber of Rule-1 applications is the higher the quality is

On real-life tensors

We also evaluate the performance of the proposed heuristics on some real-life tensors selected from the FROSTT library [203] The descriptions ofthe tensors are given in Table 64 For nips and uber a dimension of size17 and 24 is dropped respectively since they restrict the size of maximumcardinality matching As described before a d-partite d-uniform hyper-graph is obtained from a d-dimensional tensor by keeping a vertex for eachdimension index and a hyperedge for each nonzero Unlike the previousexperiments the parts of the hypergraphs obtained from real-life tensors inTable 64 do not have an equal number of vertices In this case although thescaling algorithm works along the same lines its output is slightly differentLet ni = |Vi| be the cardinality at ith dimension and nmax = max1leiled nibe the maximum one By slightly modifying Sinkhorn-Knopp for each iter-ation of Karp-Sipser-H-scaling we scale the tensor such that the marginalsin dimension i sum up to nmaxni instead of one The results in Table 64resemble those from previous sections Karp-Sipser-H-scaling has the bestperformance and is slightly superior to Karp-Sipser-H-mindegree Greedy-Hand Karp-Sipser-H are close to each other and when it is feasible Local-Searchis better than them We also see that in these instances Bipartite-reductionexhibits a good performance its performance is at least as good as Karp-Sipser-H-scaling for the first three instances but about 10 worse for thelast one

127

Local- Karp- Karp-Sipser-H- Karp-Sipser-H- Bipartite-Tensor d Dimensions Greedy-H Search Sipser-H mindegree scaling Reduction

Uber 3 183times 1140times 1717 183 183 183 183 183 183nips 3 2 482 times 2 862 times

14 0361847 1991 1839 2005 2007 2007

Nell-2 3 12 092 times 9 184 times28 818

3913 4987 3935 5100 5154 5175

Enron 4 6 066 times 5 699 times244 268times 1 176

875 - 875 988 1001 898

Table 64 Four real-life tensors and the performance of the proposed heuris-tics on the corresponding hypergraphs No result for Local-Search for Enronas it is four dimensional Uber has 1117629 nonzeros nips has 3101609nonzeros Nell-2 has 76879419 nonzeros and Enron has 54202099 nonze-ros

Using an independent set solver

We compare Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree with the ideaof reducing the maximum d-dimensional matching problem to that of findingan independent set in the line graph of the given hypergraph We show thatthis transformation can obtain good results but is restricted because linegraphs can require too much space

We use KaMIS [147] to find independent sets in graphs KaMIS uses aplethora of reductions and a genetic algorithm in order to return high cardi-nality independent sets We use the default settings of KaMIS (where exe-cution time is limited to 600 seconds) and generate the line graphs with effi-cient sparse matrixndashmatrix multiplication routines We run KaMIS Greedy-H Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree on a few hypergraphsfrom previous tests The results are summarized in Table 65 The run timeof Greedy-H was less than one second in all instances KaMIS operates inrounds and we give the quality and the run time of the first round andthe final output We note that KaMIS considers the time-limit only afterthe first round has been completed As can be seen while the quality ofKaMIS is always good and in most cases superior to Karp-Sipser-H-scalingand Karp-Sipser-H-mindegree it is also significantly slower (its principle isto deliver high quality results) We also observe that the pseudo scaling ofKarp-Sipser-H-mindegree indeed helps to reduce the run time compared toKarp-Sipser-H-scaling

The line graphs of the real-life instances from Table 64 are too large tobe handled We estimate using known techniques [50] the number of edgesin these graphs to range from 15 times 1010 to 47 times 1013 which translate to126GB to 380TB of storage if edges are stored twice using 4 bytes per edge

128 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

KaMIS Karp-Sipser-H- Karp-Sipser-H-line graph Round 1 Output Greedy-H scaling mindegree

hypergraph gen time quality time quality time quality quality time quality time

8-out n =1000 d = 3

10 098 80 099 600 086 098 1 098 1

8-out n =10000 d = 3

112 098 507 099 600 086 098 197 098 1

8-out n =1000 d = 9

298 067 798 069 802 055 062 2 067 1

n = 8000 d =3 H3

1 077 16 081 602 063 076 5 077 1

n = 8000 d =3 H3+M

2 089 25 100 430 070 100 11 091 1

Table 65 Run time (in seconds) and performance comparisons betweenKaMIS Greedy-H and Karp-Sipser-H-scaling The time required to createthe line graphs should be added to KaMISrsquos overall time

64 Summary further notes and references

We have proposed heuristics for the maximum matching in d-partite d-uniform hypergraphs problem by generalizing existing heuristics for themaximum cardinality matching in graphs The experimental analysis onvarious hypergraphs (both random and corresponding to real-life tensors)show the effectiveness and efficiency of the proposed heuristics

This being a recent study we have much future work On the theoreticalside we plan to investigate the stated conjecture that ddminus1-out random hy-pergraphs have perfect matchings almost always and analyze the theoreticalguarantees of the proposed algorithms On the practical side we want totest the use of the proposed hypergraph matching heuristics in the coarsen-ing phase of the multilevel hypergraph partitioning tools For this we needto adopt the proposed heuristics for the general hypergraph case (not nec-essarily d-partite and d-uniform) Another future work is the investigationof the weighted hypergraph matching problem and the generalization anduse of the proposed heuristics for this problem Solvers for this problem arecomputational tools used in the multiple network alignment problem [179]where a matching among the vertices of more than two graphs is computed

Part III

Ordering matrices andtensors

129

Chapter 7

Elimination trees andheight-reducing orderings

In this chapter we investigate the minimum height elimination problemdescribed in Section 11 The formal problem definition is repeated belowfor convenience

Minimum height elimination tree Given a directed graph find anordering of its vertices so that the associated elimination tree has theminimum height

The standard elimination tree [200] has been used to expose parallelismin sparse Cholesky LU and QR factorizations [5 8 116 160] Roughly aset of vertices without ancestordescendant relations corresponds to a setof independent tasks that can be performed in parallel Therefore the to-tal number of parallel steps or the critical path length is equal to theheight of the tree on an unbounded number of processors [162 214] Ob-taining an elimination tree with the minimum height for a given matrixis NP-complete [193] Therefore heuristic approaches are used One setof heuristic approaches is to content oneself with the graph partitioningbased methods These methods reduce some other important cost metricsin sparse Cholesky factorization such as the fill-in and the operation countwhile giving a shallow depth elimination tree [103] When the matrix is un-symmetric the elimination tree for LU factorization [81] is useful to exposeparallelism In this respect the height of the tree again corresponds to thenumber of parallel steps or the critical path length for certain factorizationschemes Here we develop heuristics to reduce the height of eliminationtrees for unsymmetric matrices

Consider the directed graph Gd obtained by replacing every edge of agiven undirected graph G by two directed edges pointing at opposite direc-tions Then the cycle-rank (or the minimum height of an elimination tree)

131

132 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

of Gd is closely related to the undirected graph parameter tree-depth [180p128] which corresponds to the minimum height of an elimination tree for asymmetric matrix One can exploit this correspondence in the reverse sensein an attempt to reduce the height of the elimination tree while producingan ordering for the matrix That is one can use the standard undirectedgraph model corresponding to the matrix A + AT where the addition ispattern-wise This approach of producing an undirected graph for a givendirected graph by removing the direction of each edge is used in solvers suchas MUMPS [8] and SuperLU [63] This way the state of the art graph parti-tioning based ordering methods can be used We will compare the proposedheuristics with such methods

The height of the elimination tree or the critical path length is notthe only metric for efficient parallel LU factorization of unsymmetric matri-ces Depending on the factorization schemes such as Gaussian eliminationwith static pivoting (implemented in GESP [157]) or multifrontal basedsolvers (implemented for example in WSMP [104]) other metrics come intoplay (such as the fill-in the size of blocking in GESP based solvers egSuperLU MT [64] the maximumminimum size of fronts for parallelismin multifrontal solvers or the communication) Nonetheless the elimina-tion tree helps to give an upper bound on the performance of the suitableparallel LU factorization schemes under a fairly standard PRAM model(assuming unit size computations infinite number of processors and nomemorycommunication or scheduling overhead) Furthermore our overallapproach is based on obtaining a desirable form for LU factorization calledbordered block triangular form (or BBT form for short) [81] This formsimplifies many algorithmic details of different solvers [81 82] and is likelyto control the fill-in as the nonzeros are constrained to be in the blocks ofa BBT form If one orders a matrix without a BBT form and then obtainsthis form the fill-in can increase or decrease [83] Hence our BBT basedapproach is likely to be helpful for LU factorization schemes exploiting theBBT form as well in controlling the fill-in

71 Background

We define Gs = (VEs) as the graph associated with the directed graphG such that Es = vi vj (vi vj) isin E A directed graph G is calledcomplete if (u v) isin E for all vertex pairs (u v)

Let T = (V r ρ) be a rooted tree An ordering σ V harr 1 n is abijective function An ordering σ is called topological if σ(ρ(vj)) gt σ(vj) forall vj isin V Finally a postordering is a special form of topological orderingin which the vertices of each subtree are ordered consecutively

Let A be an n times n unsymmetric matrix with m off-diagonal nonzeroentries The standard directed graph G(A) = (VE) of A consists of a

133

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix A

8

46

9

1

7

2

5

3

(b) Directed graph G(A)

1 2

3

5 64

7 8

9

(c) Elimination tree T(A)

Figure 71 A sample 9 times 9 unsymmetric matrix A the corresponding di-rected graph G(A) and elimination tree T(A)

vertex set V = 1 n and an edge set E = (i j) aij 6= 0 of m edgesWe adopt GA instead of G(A) when a subgraph notion is required

An elimination digraph Gk(A) = (Vk Ek) is defined for 0 le k lt n withthe vertex set Vk = k + 1 n and the edge set

Ek =

E k = 0Ekminus1 cup (i j) (i k) (k j) isin Ekminus1 k gt 0

We also define V k = 1 k for each 0 le k lt n

We assume that A is irreducible and the LU-factorization A = LU existswhere L and U are unit lower and upper triangular respectively Since thefactors L and U are triangular their standard directed graph models G(L)and G(U) are acyclic Eisenstat and Liu [81] define the elimination treeT(A) as follows Let

Parent(i) = minj j gt i and jG(L)===rArr i

G(U)===rArr j

where aG=rArr b means that there is path from a to b in the graph G then

Parent(i) corresponds to the parent of vertex i in T(A) for i lt n and forthe root n we put Parent(n) =infin

We now illustrate some of the definitions Figure 71a displays a samplematrix A with 9 rowscolumns and 26 nonzeros Figure 71b shows thestandard directed graph representation G(A) of this matrix Upon elimina-tion of vertex 1 the three edges (3 2)(6 2) and (7 2) are added and vertex1 is removed to build the elimination digraph G1(A) On the right sideFigure 71c shows the elimination tree T(A) where height h(T(A)) = 5Here the parent vertex of 1 is 3 as GA[1 2] is not strongly connected butGA[1 2 3] is (thanks to the cycle 1 rarr 2 rarr 3 rarr 1) The cross edges ofG(A) with respect to the elimination tree T(A) are represented by dashed

134 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

directed arrows in Figure 71c The initial order is topological but not pos-tordered However 8 1 2 3 5 4 6 7 9 is a postordering of T(A) and doesnot respect the cross edges

72 The critical path length and the eliminationtree height

We summarize rowwise and columnwise LU factorization schemes also knownas ikj and jik variants [70] here to be referred as row-LU and column-LUrespectively Then we show that the critical path lengths for the parallelrow-LU and column-LU factorization schemes are equivalent to the elimina-tion tree height whenever the matrix has a certain form

Let A be an ntimesn irreducible unsymmetric matrix and A = LU whereL and U are unit lower and upper triangular respectively Algorithms 13and 14 display the row-LU and column-LU factorization schemes respec-tively Both schemes update the given matrix A so that the U = [uij ]ilejfactor is formed by the upper triangular (including the diagonal) entries ofthe matrix A at the end The L = [`ij ]igej factor is formed by the multipliers(`ik = aikakk for i gt k) and a unit diagonal We assume a coarse-grainparallelization approach [122 160] In Algorithm 13 task i is defined as thetask of computing rows i of L and U and denoted as Trow(i) Similarlyin Algorithm 14 task j is defined as the task of computing the columns j ofboth L and U factors and denoted as Tcol(j) The dependencies betweenthe tasks can be represented with the directed acyclic graphs of LT and Ufor row-LU and column-LU factorization of A respectively

Algorithm 13 Row-LU(A)

1 for i = 1 to n do2 Trow(i)3 for each k lt i and aik 6= 0 do4 for each j gt k st akj 6= 0 do5 aij larr aij minus akj times (aikakk)

Algorithm 14 Column-LU(A)

1 for j = 1 to n do2 Tcol(j)3 for each k lt j st akj 6= 0 do4 for each i gt k st aik 6= 0 do5 aij larr aij minus aik times (akjakk)

135

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix L + U

8

46

9

1

7

2

5

3

(b) Task dependency graphG(LT)

8

46

9

1

7

2

5

3

(c) Task dependency graphG(U)

Figure 72 The filled matrix L + U (a) task dependency graphs G(LT) forrow-LU (b) and G(U) for column-LU (c)

Figure 72 illustrates the mentioned dependencies Figure 72a showsthe matrix L + U corresponding to the factors of the sample matrix Agiven in Figure 71a In this figure the blue hexagons and green circles arethe fill-ins in L and U respectively Subsequently Figures 72b and 72cshow the task dependency graphs for row-LU and column-LU factorizationsrespectively The blue dashed directed edges in Figure 72b and the greendashed directed edges in Figure 72c correspond to the fill-ins All edgesin Figures 72b and 72c show dependencies For example in the column-LU the second column depends on the first one as the values in the firstcolumn of L are needed while computing the second column This is shownin Figure 72c with a directed edge from the first vertex to the second one

We now consider postorderings of the elimination tree T(A) Let T [k]denote the set of vertices in the subtree of T(A) rooted at k For a postorder-ing the order of siblings is not specified in general sense A postordering iscalled upper bordered block triangular (BBT) if the topological order amongthe sibling vertices is respected that is a vertex k is numbered before asibling vertex ` if there is an edge in G(A) that is directed from T [k] toT [`] Similarly we call a postordering a lower BBT if the sibling verticesare ordered in reverse topological As an example the initial order of ma-trix given in Figure 71 is neither upper nor lower BBT however the order8 6 1 2 3 5 4 7 9 would be an upper BBT (see Figure 71c) The follow-ing theorem shows the relation between a BBT postordering and the taskdependencies in LU factorizations

Theorem 71 ([81]) Assuming the edges are directed from child to parentT(A) is the transitive reduction of G(LT) and G(U) when A is upper andlower BBT ordered respectively

We follow this theorem with a corollary which provides the motivationfor reducing the elimination tree height

Corollary 71 The critical path length of the task dependency graph is

136 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

3

5

7

9

2 3 5 4 7 9

8

6

1

2

6 18

4

(a) The matrix L + U

1 2

3

5 64

7 8

9

(b) The directed graph G(LT)

Figure 73 The filled matrix of matrix A = LU in BBT form (a) and thetask dependency graph G(LT) for row-LU (b)

equal to the elimination tree height h(T(A)) for row-LU factorization whenA is upper BBT ordered and for column-LU factorization when A is lowerBBT ordered

Figure 73 demonstrates the discussed points on a matrix A which is thepermuted A of Figure 71a with the upper BBT postordering 8 6 1 2 3 5 4 7 9Figure 73a and 73b show the filled matrix L + U such that A = LU andthe corresponding task dependency graph G(LT) for row-LU factorizationrespectively As seen in Figure 73b there are only tree (solid) and back(dashed) edges with respect to elimination tree T(A) due to the fact thatT(A) is the transitive reduction of G(LT) (see Theorem 71) In the figurethe blue edges refer to fill-in nonzeros As seen in the figure there is a pathfrom each vertex to the root and a critical path 1 rarr 3 rarr 5 rarr 7 rarr 9 haslength 5 which coincides with the height of T(A)

73 Reducing the elimination tree height

We propose a top-down approach to reorder a given matrix leading to ashort elimination tree The bigger lines of the proposed approach form ageneralization of the state of the art methods used in Cholesky factorization(and are based on nested dissection [94]) We give the algorithms and dis-cussions for the upper BBT form they can be easily adapted for the lowerBBT form The proofs of Proposition 71 and Theorems 72ndash74 are omittedfrom the presentation and can be found in the original paper [136]

137

731 Theoretical basis

Let A be an ntimes n sparse unsymmetric irreducible matrix and T(A) be itselimination tree In this case a BBT decomposition of A can be representedas a permutation of A with a permutation matrix P such that

ABBT =

A11 A12 A1K A1B

A22 A2K A2B

AKK AKB

AB1 AB2 ABK ABB

(71)

where ABBT = PAPT the number of diagonal blocks K gt 1 and Akk isirreducible for each 1 le k le K The border ABlowast is called minimal if there isno permutation matrix Pprime 6= P such that PprimeAPprimeT is a BBT decompositionand AprimeBlowast sub ABlowast That is the border ABlowast is minimal if we cannot removecolumns from ABlowast and the corresponding rows from ABlowast permute themtogether to a diagonal block and still have a BBT form We give thefollowing proposition aimed to be used for Theorem 72

Proposition 71 Let A be in upper BBT form and the border ABlowast beminimal The elimination digraph Gκ(A) is complete where κ =

sumKi=1 |Akk|

The following theorem provides a basis for reducing the elimination treeheight h(T(A))

Theorem 72 Let A be in upper BBT form and the border ABlowast be mini-mal Then

h(T(A)) = |ABlowast|+ max1lekleK

h(T(Akk))

The theorem says that the critical path length in the parallel LU factor-ization methods summarized in Section 72 can be expressed recursively interms of the border and the critical path length of a block with the maxi-mum size This holds for the row-LU scheme when the matrix A is in anupper BBT form and for the column-LU scheme when A is in a lower BBTform Hence having a small border size and diagonal blocks of similar sizeis likely to lead to reduced critical path length

732 Recursive approach Permute

We propose a recursive approach to reorder a given matrix so that the elim-ination tree is reduced Algorithm 15 gives the overview of the solutionframework The main procedure Permute takes an irreducible unsymmet-ric matrix A as its input It calls PermuteBBT to decompose A into aBBT form ABBT Upon obtaining such a decomposition the main pro-cedure calls itself on each diagonal block Akk recursively Whenever the

138 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

matrix becomes sufficiently small (its size gets smaller than τ) the proce-dure PermuteBT is called in order to compute a fine BBT decompositionin which the diagonal blocks are of unit size here to be referred as borderedtriangular (BT) decomposition and the recursion terminates

Algorithm 15 Permute(A)

1 if |A| lt τ then Recursion terminates 2 PermuteBT(A)3 else4 ABBT larr PermuteBBT(A) As given in (71) 5 for k larr 1 to K do6 Permute(Akk) Subblocks of ABBT

733 BBT decomposition PermuteBBT

Permuting A into a BBT form translates into finding a strong (vertex) sep-arator of G(A) where the strong separator itself corresponds to the borderABlowast Regarding to Theorem 72 we are particularly interested in findingminimal borders

Let G = (VE) be a strongly connected directed graph A strong sep-arator (also called directed vertex separator [138]) is defined as a vertex setS sub V such that GminusS is not strongly connected Moreover S is said to beminimal if Gminus Sprime is strongly connected for any proper subset Sprime sub S

The algorithm that we propose to find a strong separator is based on bi-partitioning of directed graphs In case of undirected graphs typically thereare two kinds of graph partitioning methods which differ by their separatorsedge separators and vertex separators An edge (or vertex) separator is a setof edges (or vertices) whose removal leaves the graph disconnected

A BBT decomposition of a matrix may have three or more subblocks (asseen in (71)) However the algorithms we use to find a strong separatorutilize 2-way partitioning which in turn constructs two parts each of whichmay potentially have several components For the sake of simplicity in thepresentation we assume that both parts are strongly connected

We introduce some notation in order to ease the discussion For a pairof vertex subsets (UW ) of directed graph G we write U 7rarrW if there maybe some edges from U to W but there is no edge from W to U BesidesU harr W and U 6harr W are used in a more natural way that is U harr Wimplies that there may be edges in both directions whereas U 6harr W refersto absence of such an edge between U and W

We cast our problem of finding strong separators as follows For a givendirected graph G find a vertex partition Πlowast = V1 V2 S where S refersto a strong separator and V1 V2 refer to vertex parts so that S has small

139

8

46

9

1

7

2

5

3

V2

V1

(a) An edge separator ΠES =V1 V2

7

4

9

2

3

5

V2

6 1

V1

7

4

2

3

5

V2

6 1

V1

9

(b) Covers for E2 7rarr1 (left)and E1 7rarr2 (right)

8

1

3

5

9 8 1 2 3 5

6

7

4

9

7 46

2

(c) Matrix view of ΠES

Figure 74 A sample 2-way partition ΠES = V1 V2 by edge separator (a)bipartite graphs and covers for directed cuts E27rarr1 and E17rarr2 (b) and thematrix view of ΠES

cardinality the part sizes are close to each other and either V1 7rarr V2 orV2 7rarr V1

734 Edge-separator-based method

The first step of the method is to obtain a 2-way partition ΠES = V1 V2 ofG by edge separator We build two strong separators upon ΠES and choosethe one with the smaller cardinality

For an edge separator ΠES = V1 V2 the directed cut Ei 7rarrj is definedas the set of edges directed from Vi to Vj ie Ei 7rarrj = (u v) isin E u isinVi v isin Vj and we say that a vertex set S covers Ei 7rarrj if for any edge(u v) isin Ei 7rarrj either u isin S or v isin S for i 6= j isin 1 2

Initially we have V1 harr V2 Our goal is to find a subset of verticesS sub V so that either V1 7rarr V2 or V2 7rarr V1 where V1 and V2 are the setsof remaining vertices in V1 and V2 after the removal of those vertices in Sthat is V1 = V1 minus S and V2 = V2 minus S The following theorem serves to thepurpose of finding such a subset S

Theorem 73 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and S sube V Now Vi 7rarr Vj if and only if S covers Ej 7rarrifor i 6= j isin 1 2

The problem of finding a strong separator S can be encoded as findinga minimum vertex cover in the bipartite graph whose edge set is populatedonly by the edges of Ej 7rarri Algorithm 16 gives the pseudocode As seenin the algorithm we find two separators S17rarr2 and S2 7rarr1 which cover E27rarr1

and E17rarr2 and result in V1 7rarr V2 and V2 7rarr V1 respectively At the end wetake the strong separator with the smaller cardinality

In this method how the edge separator is found matters The objectivefor ΠES is to minimize the cutsize Ψ(ΠES) which is typically defined over

140 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Algorithm 16 FindStrongVertexSeparatorES(G)

1 ΠES larr GraphBipartiteES(G) ΠES = V1 V2 2 Π17rarr2 larr FindMinCover(GE2 7rarr1)3 Π27rarr1 larr FindMinCover(GE1 7rarr2)4 if |S17rarr2| lt |S27rarr1| then

5 return Π1 7rarr2 = V1 V2 S17rarr2 V1 7rarr V2 6 else

7 return Π2 7rarr1 = V1 V2 S27rarr1 V2 7rarr V1

the cut edges as follows

Ψes(ΠES) = |Ec| = |(u v) isin E π(u) 6= π(v)| (72)

where u isin Vπ(u) and v isin Vπ(v) and Ec refers to set of cut edges We notethat the use of the above metric in Algorithm 16 is a generalization of amethod used for symmetric matrices [161] However Ψes(ΠES) is looselyrelated to the size of the cover found in order to build the strong separatorA more suitable metric would be the number of vertices from which a cutedge emanates or the number of vertices to which a cut edge is directedSuch a cutsize metric can be formalized as follows

Ψcn(ΠES) = |Vc| = |v isin V exist(u v) isin Ec| (73)

The following theorem shows the close relation between Ψcn(ΠES) and thesize of the strong separator S

Theorem 74 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and Πlowast = V1 V2 S be the strong separator built uponΠES Then |S| le Ψcn(ΠES)2

Apart from the theorem we also note that optimizing (73) is likely toyield bipartite graphs which are easier to cover than those resulting fromoptimizing (72) This is because of the fact that all edges emanating from asingle vertex or many different vertices are considered as the same with (72)whereas those emanating from a single vertex are preferred in (73)

We note that the cutsize metric as given in (73) can be modeled usinghypergraphs The hypergraph that models this cutsize metric (called thecolumn-net hypergraph model [39]) has the same vertex set as the directedgraph G = (VE) and a hyperedge for each vertex v isin V that connectsall u isin V such that (u v) isin E Any partition of the vertices of this hyper-graph that minimizes the number of hyperedges in the cut induces a vertexpartition on G with an edge separator that minimizes the cutsize accord-ing to the metric (73) A similar statement holds for another hypergraphconstructed by reversing the edge directions (called the row-net model [39])

141

8

46

9

1

7

2

5

3

S

V2

V1

(a) Strong separator Π2 7rarr1

9

7

1

6

4 9 7 8 1 6

2

3

5

4

3 52

8

(b) ABBT matrix viewof Π2 7rarr1

1

6

2 3

5

9

8

74

(c) Elimination treeT(ABBT)

Figure 75 Π2 7rarr1 the strong separator built upon S2 7rarr1 the cover of E17rarr2

given in 74b (a) the matrix ABBT obtained by Π2 7rarr1 (b) and the eliminationtree T(ABBT)

Figures 74 and 75 give an example for finding a strong separator basedon an edge separator In Figure 74a the red curve implies the edge separa-tor ΠES = V1 V2 such that V1 = 4 6 7 8 9 the green vertices and V2 =1 2 3 5 the blue vertices The cut edges of E27rarr1 = (2 7) (5 4) (5 9)and E17rarr2 = (6 1) (6 3) (7 1) are given in color blue and green respec-tively We see two covers for the bipartite graphs corresponding to each ofthe directed cuts E27rarr1 and E17rarr2 in Figure 74b As seen in Figure 74cthese directed cuts refer to nonzeros of the off-diagonal blocks in matrixview Figure 75a gives the strong separator Π2rarr1 built upon ΠES by thecover S2rarr1 = 1 6 of the bipartite graph corresponding to E1rarr2 which isgiven on the right of 74b Figures 75b and 75c depict the matrix view ofthe strong separator Π2rarr1 and the corresponding elimination tree respec-tively It is worth mentioning that in matrix view V2 proceeds V1 sincewe use Π2rarr1 instead of Π1rarr2 As seen in 75c since the permuted ma-trix (in Figure 75b) has an upper BBT postordering all cross edges of theelimination tree are headed from left to right

735 Vertex-separator-based method

A vertex separator is a strong separator restricted so that V1 6harr V2 Thismethod is based on refining the separator so that either V1 7rarr V2 or V1 7rarr V2The vertex separator can be viewed as a 3-way vertex partition ΠV S =V1 V2 S As in the previous method based on edge separators we buildtwo strong separators upon ΠV S and select the one having the smaller strongseparator

The criterion used to find an initial vertex separator can be formalizedas

Ψvs(ΠV S) = |S| (74)

Algorithm 17 gives the overall process to obtain a strong separator This

142 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

algorithm follows Algorithm 16 closely Here the procedure that refines theinitial vertex separator namely RefineSeparator works as follows Itvisits the separator vertices once and moves the vertex at hand if possibleto V1 or V2 so that Vi 7rarr Vj after the movement where i 7rarr j is specified aseither 1 7rarr 2 or 2 7rarr 1

Algorithm 17 FindStrongVertexSeparatorVS(G)

1 ΠV S larr GraphBipartiteVS(G) ΠV S = V1 V2 S 2 Π1rarr2 larr RefineSeparator(GΠV S 1rarr 2)3 Π2rarr1 larr RefineSeparator(GΠV S 2rarr 1)4 if |S1rarr2| lt |S2rarr1| then

5 return Π1rarr2 = V1 V2 S1rarr2 V1 rarr V2 6 else

7 return Π2rarr1 = V1 V2 S2rarr1 V2 rarr V1

736 BT decomposition PermuteBT

The problem of permuting A into a BT form can be decoded as finding afeedback vertex set which is defined as a subset of vertices whose removalmakes a directed graph acyclic In matrix view a feedback vertex set corre-sponds to border ABlowast when the matrix is permuted into a BBT form (71)where each Aii for i = 1 K is of unit size

The problem of finding a feedback vertex set of size less than a givenvalue is NP-complete [126] but fixed-parameter tractable [44] In literaturethere are greedy algorithms that perform well in practice [152 184] In thiswork we adopt the algorithm proposed by Levy and Low [152]

For example in Figure 75 we observe that the sets 5 and 8 are usedas the feedback-vertex sets for V1 and V2 respectively which is the reasonthose rowscolumns are permuted last in their corresponding subblocks

74 Experiments

We investigate the performance of the proposed heuristic here to be referredas BBT in terms of the elimination tree height and ordering time We havethree variants of BBT (i) BBT-es based on minimizing the edge cut (72) (ii)BBT-cn based on minimizing the hyperedge cut (73) (iii) BBT-vs based onminimizing the vertex separator (74) As a baseline we used MeTiS [130]on the symmetrized matrix A + AT MeTiS uses state of the art methodsto order a symmetric matrix and leads to short elimination trees Its use inthe unsymmetric case on A+AT is a common practice BBT is implementedin C and compiled with mex of MATLAB which uses gcc 445 We usedMeTiS library for graph partitioning according to the cutsize metrics (72)

143

and (74) For the edge cut metric (72) we input MeTiS the graph ofA + AT where the weight of the edge i j is 2 (both aij aji 6= 0) or1 (either aij 6= 0 or aji 6= 0 but not both) For the vertex separatormetric (74) we input the graph of A+AT (no weights) to the correspondingMeTiS procedure We implemented the cutsize metric given in (73) usingPaToH [40] (in the default setting) for hypergraph partitioning using thecolumn-net model (we did not investigate the use of row-net model) Weperformed the experiments on an AMD Opteron Processor 8356 with 32 GBof RAM

In the experiments for each matrix we detected dense rows and columnswhere a rowcolumn is deemed to be dense if it has at least 10

radicn nonzeros

We applied MeTiS and BBT on the submatrix of non-dense rowscolumnsand produced the final ordering by appending the dense rowscolumns afterthe permutations obtained for the submatrices The performance results arecomputed as the median of eleven runs

Recall from Section 732 that whenever the size of the input matrixis smaller than the parameter τ the recursion terminates in BBT variantsand a feedback-vertex-set-based BT decomposition is applied We first in-vestigate the effect of τ in order to suggest a concrete approach For thispurpose we build a dataset of matrices called UFL below The matrices inUFL are from the SuiteSparse Matrix Collection [59] and chosen with thefollowing properties square 5K le n le 100K nnz le 1M real and notof type Graph These criteria enable an automatic selection of matricesfrom different application domains without having to specify the matricesindividually Other criteria could be used ours help to select a large set ofmatrices each of which is not too small and not too large to be handled easilywithin Matlab We exclude matrices recorded as Graph in the SuiteSparsecollection because most of these matrices have nonzeros from a small set ofintegers (for example minus1 1) and are reducible We further detect matriceswith the same nonzero pattern in the UFL dataset and keep only one matrixper unique nonzero pattern We then preprocess the matrices by applyingMC64 [72] in order to permute the columns so that the permuted matrix hasa maximum diagonal product Then we identify the irreducible blocks us-ing Dulmage-Mendelsohn decomposition [77] permute the matrices to blockdiagonal form and delete all entries that lie in the off-diagonal blocks Thistype of preprocessing is common in similar studies [83] and correspondsto preprocessing for numerics and reducibility After this preprocessing wediscard the matrices whose total (remaining) number of nonzeros is less thantwice its size or whose pattern symmetry is greater than 90 where thepattern symmetry score is defined as (nnz(A)minusn)(nnz(A+AT)minusn)minus05We ended up with 99 matrices in the UFL dataset

In Table 71 we summarize the performance of the BBT variants as nor-malized to that of MeTiS on the UFL dataset for the recursion terminationparameter τ isin 3 50 250 The table presents the geometric means of per-

144 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Tree Height (h(A)) Ordering Timeτ -es -cn -vs -es -cn -vs

3 081 069 068 168 353 25850 084 071 072 126 271 191

250 093 082 086 112 212 143

Table 71 Statistical indicators of the performance of BBT variants on theUFL dataset normalized to that of MeTiS with recursion termination param-eter τ isin 3 50 250

0 20 40 60 80Pattern symmetry

03

04

05

06

07

08

09

10

11

12

13

Nor

mal

ized

tree

hei

ght

(a) BBT-vs MeTiS Tree Height (h(A))

0 20 40 60 80Pattern symmetry

05

10

15

20

25

30

Nor

mal

ized

ord

erin

g tim

e

(b) BBT-vs MeTiS Ordering Time

Figure 76 BBT-vs MeTiS performance on the UFL dataset Dashed greenlines mark the geometric mean of the ratios

formance results over all matrices of the dataset As seen in the table forsmaller values of τ all BBT variants achieve shorter elimination trees at acost of increased processing time to order the matrix This is because of thefast but probably not of very high quality local ordering heuristicmdashas dis-cussed before the heuristic here does not directly consider the height of thesubtree corresponding to the submatrix as an optimization goal Other τvalues could also be tried but we find τ = 50 as a suitable choice in generalas it performs close to τ = 3 in quality and close to τ = 250 in efficiency

From Table 71 we observe that BBT-vs method performs in the mid-dle of the other two BBT variants but close to BBT-cn in terms of the treeheight All the three variants improve the height of the tree with respectto MeTiS Specifically at τ = 50 BBT-es BBT-cn and BBT-vs obtain im-provements of 16 29 and 28 respectively The graph based methodsBBT-es and BBT-vs run slower than MeTiS as they contain additional over-heads of obtaining the covers and refining the separators of A + AT forA For example at τ = 50 the overheads result in 26 and 171 longerordering time for BBT-es and BBT-vs respectively As expected the hy-pergraph based method BBT-cn is much slower than others (for example112 longer ordering time with respect to MeTiS with τ = 50) Since we

145

identify BBT-vs as fast and of high quality (it is close to the best BBT variantin terms of the tree height and the fastest one in terms of the run time)we display its individual performance results in detail with τ = 50 in Fig-ure 76 In Figures 76a and 76b each individual point represents the ratioBBT-vs MeTiS in terms of tree height and ordering time respectivelyAs seen in Figure 76a BBT-vs reduces the elimination tree height on 85out of 99 matrices in the UFL dataset and achieves a reduction of 28 interms of the geometric mean with respect to MeTiS (the green dashed linein the figure represents the geometric mean) We observe that the patternsymmetry and the reduction in the elimination tree height highly correlateafter a certain pattern symmetry score More precisely the correlation co-efficient between the pattern symmetry and the normalized tree height is06187 for those matrices with pattern symmetry greater than 20 Thisis of course in agreement with the fact that LU factorization methods ex-ploiting the unsymmetric pattern have much to gain on matrices with lowerpattern symmetry score and less to gain otherwise than the alternatives Onthe other hand as seen in Figure 76b BBT-vs performs consistently slowerthan MeTiS with an average of 91 slow down

The fill-in generally increases [83] when a matrix is reordered into a BBTform with respect to the original ordering using the elimination tree Wenoticed that BBT-vs almost always increases the number of nonzeros in L(by 157 with respect to MeTiS in a set of matrices) However the numberof nonzeros in U almost always decreases (by 15 with respect to MeTiS

in the same dataset) This observation may be exploited during Gaussianelimination where L is not stored

75 Summary further notes and references

We investigated the elimination tree model for unsymmetric matrices as ameans of capturing dependencies among the tasks in row-LU and column-LU factorization schemes Specifically we focused on permuting a givenunsymmetric matrix to obtain an elimination tree with reduced height inorder to expose higher degree of parallelism by minimizing the critical pathlength in parallel LU factorization schemes Based on the theoretical find-ings we proposed a heuristic which orders a given matrix to a borderedblock diagonal form with a small border size and blocks of similar sizes andthen locally orders each block We presented three variants of the proposedheuristic These three variants achieved on average 24 31 and 35improvements with respect to a common method of using MeTiS (a stateof the art tool used for the Cholesky factorization) on a small set of matri-ces On a larger set of matrices the performance improvements were 1628 and 29 The best performing variant is about 271 times slower thanMeTiS on the larger set while others are slower by factors of 126 and 191

146 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Although numerical methods implementing the LU factorization withthe help of the standard elimination tree are well developed those imple-menting the LU factorization using the elimination discussed in this sectionare lacking A potential implementation will have to handle many schedul-ing issues and therefore the use of a runtime system seems adequate Onthe combinatorial side we believe that there is a need for a direct methodto find strong separators which would give better performance in terms ofthe quality and the run time

Chapter 8

Sparse tensor ordering

In this chapter we investigate the tensor ordering problem described inSection 15 The formal problem definition is repeated below for convenience

Tensor ordering Given a sparse tensor X find an ordering of theindices in each dimension so that the nonzeros of X have indices closeto each other

We propose two ordering algorithms for this problem The first proposedheuristic BFS-MCS is a breadth first search (BFS)-like approach based onthe maximum cardinality search family and works on the hypergraph repre-sentation of the given tensor The second proposed heuristic Lexi-Orderis an extension of doubly lexical ordering of matrices to tensors We showthe effects of these schemes in the MTTKRP performance when the ten-sors are stored in three existing sparse tensor formats coordinate (COO)compressed sparse fiber (CSF) and hierarchical coordinate (HiCOO) Theassociated paper [156] has further contributions with respect to paralleliza-tion and includes results on parallel implementations of MTTKRP with thethree formats as well as runs with CandecompParafac decomposition al-gorithms

81 Three sparse tensor storage formats

We consider the three state-of-the-art tensor formats (COO CSF HiCOO)which are all for general unstructured sparse tensors

Coordinate format (COO)

The coordinate (COO) format is the simplest yet arguably most popularformat by far It stores each nonzero value along with all of its positionindices shown in Figure 81a i j k are indices (inds) of the nonzeros storedin the val array

147

148 CHAPTER 8 SPARSE TENSOR ORDERING

(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(a) COO(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(b) CSF(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(c) HiCOO

Figure 81 Three common formats for sparse tensors (example is originallyfrom [155])

Compressed sparse fibers (CSF)

Compressed Sparse Fiber (CSF) is a hierarchical fiber-centric format thateffectively generalizes the CSR matrix format to tensors An example of itsrepresentation appears in Figure 81b Conceptually CSF organizes nonzeroindices into a tree Each level corresponds to a tensor mode and eachnonzero is a path from the root to a leaf The indices of CSF are stored incinds with the pointers stored in cptrs to indicate the locations of nonzerosat the next level Since cptrs needs to show the range up to nnz we use βint

or βlong accordingly for differently sized-tensors

Hierarchical coordinate format (HiCOO)

Hierarchical Coordinate (HiCOO) [155] format derives from COO formatbut improves upon it by compressing the indices in units of sparse tensorblocks HiCOO stores a sparse tensor in a sparse-blocked pattern with apre-specified block size B meaning in Btimesmiddot middot middottimesB blocks It represents everyblock by compactly storing its nonzero triples using fewer bits Figure 81cshows the example tensor given 2 times 2 times 2 blocks (B = 2) For a third-order tensor bi bj bk are block indices indexing tensor blocks ei ej ek areelement indices indexing nonzeros within a tensor block A bptr array storesthe pointers of each blockrsquos beginning locations and val saves all the nonzerovalues The block ratio αb of a HiCOO representation of tensor is defined asthe number of blocks divided by the number of nonzeros in the tensor Theaverage slice size per tensor block cb of a HiCOO representation of tensor isdefined as the average slice size per tensor block These two key parametersdescribe the effectiveness of storing a tensor in the HiCOO format

Comparing the three tensor formats HiCOO and COO treat every modeequally and do not assume any mode order these preserve the mode-generic

149

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

Figure 82 A sparse tensor and the associated hypergraph

orientation [155] CSF has a strong mode-specificity since the tree structureimplies a mode ordering for enumeration of nonzeros Besides comparing toCOO format HiCOO and CSF save storage space and memory footprintswhereas achieve higher performance generally

82 Problem definition and solutions

Our objective is to improve the performance of Mttkrp and therefore CPDalgorithms based on the three tensor storage formats described above Wewill achieve this objective by ordering (or relabeling) the indices in one ormore modes of the input tensor so as to improve the data locality in thetensor and the factor matrices of an Mttkrp operation Take for exampletwo nonzeros (i2 j2 k1) and (i2 j2 k2) of a third-order tensor in Figure 82Relabeling k1 and k2 to other two indices close to each other will potentiallyimprove cache hits for both tensor and their corresponding rows of factormatrices in the Mttkrp operation Besides it will also influence the tensorstorage for some formats (Analysis will be illustrated in Section 823)

We propose two ordering heuristics The aim of the heuristics is toarrange the nonzeros close to each other in all modes If we were to lookat matrices this would correspond to reordering the rows and columns sothat all nonzeros are clustered around the diagonal This way nonzeros in arow or column would be close to each other and any blocking (by imposingfixed sized blocks) would have nonzero blocks only around the diagonalThe proposed heuristics are based on these observations and try to obtainsimilar behavior for tensors The output of a reordering algorithm is thepermutations for all modes being used to relabel tensor indices of them

821 BFS-MCS

BFS-MCS is a breadth first search (BFS)-like heuristic approach basedon the maximum cardinality search family [207] We first construct the d-partite d-uniform hypergraph for a sparse tensor as described in Section 15Recall that in this hypergraph the vertices are tensor indices in all modesand hyperedges represent its nonzero entries Figure 82 shows the hyper-

150 CHAPTER 8 SPARSE TENSOR ORDERING

graph associated with a sparse tensor Here the vertices are blank circlesand hyperedges are represented by grouping vertices

For an Nth-order sparse tensor we need to find the permutations forN modes We determine a permutation for a given mode n (permn) of atensor X isin RI1timesmiddotmiddotmiddottimesIN as follows Suppose some of the indices of mode n arealready ordered Then BFS-MCS picks the next to-be-ordered index asthe one with the strongest connection to the currently ordered index set Inthe case of ties it selects the index with the smallest number of nonzeros inthe corresponding sub-tensor Intuitively a stronger connection representsmore common indices in the modes other than n among an unordered vertexand the already ordered ones This means more data from factor matricescan be reused in Mttkrp if found in cache

Algorithm 18 BFS-MCS ordering based on maximum cardinal-ity search for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN hypergraphG = (VE) mode n

Result Permutation permn

1 wp(i)larr 0 for i = 1 In2 ws(i)larr minus|X ( i )| number of nonzeros with i in mode n 3 Build max-heap Hn for mode-n indices with wp and ws as the primary

and secondary keys respectively4 mV (i)larr 0 for i = 1 In5 for in = 1 In do

6 v(0)n larr GetHeapMax(Hn)

7 permn(v(0)n )larr in

8 mV (v(0)n )larr 1

9 for en isin hyperedges of v(0)n do

10 for v isin en v is not in mode n and mV (v) = 0 do11 mV (v)larr 112 for e isin hyperedges of v and e 6= en do13 vn larr vertex in mode n of e14 if inHeap(vn) then15 wp(vn)larr wp(vn) + 116 heapUpdateKey(Hnwp(vn))

17 return permn

The process is implemented for all mode-n indices by maintaining amax-heap with two keys The primary key of an index is the number ofconnections to the currently ordered indices The secondary key is thenumber of nonzeros in the corresponding (N minus 1)th-order sub-tensor egX ( in ) where the smaller secondary key values signify higherpriority The secondary keys are static Algorithm 18 gives the pseudocodeof BFS-MCS The heap Hn is initially constructed (Line 3) according to the

151

secondary keys of mode-n indices For each mode-n index (eg in) of thismax-heap BFS-MCS traverses all connected tensor indices in the modesexcept n calculates the number of connections of mode-n indices and thenupdates the heap using the primary and secondary keys Let mV denote asize-|V | array that tracks whether a vertex has been visited They are ini-tialized to zeros (unvisited) and changed to 1s once after being visited We

record a mode-n index v(0)n obtained from the max-heap Hn to the permu-

tation array permn The algorithm visits all its hyperedges (ie all nonzeroswith in Line 9) and their connected and unvisited vertices from the other(N minus 1) modes except n (Line 10) For these vertices it again visits theirhyperedges (e in Line 12) and then checks if the connected vertices in moden (vn) are in the heap (Line 14) If so the primary key (connectivity) of vnis increased by 1 and the max-heap (Hn) is updated To summarize this

algorithm locates the sub-tensors of the neighbor vertices of v(0)n to increase

the primary key values of the occurred mode-n indices in HnThe BFS-MCS heuristic has low time complexity We explain it us-

ing tensor terms The innermost heap update operation (Line 15) costsO(log(In)) Each sub-tensor is only visited once with the help of the markerarray mV For all the sub-tensors in one tensor mode the heap updateis only performed for nnz nonzeros (hyperedges) Overall the time com-plexity of BFS-MCS is O(N nnz log(In)) for an Nth-order tensor with nnznonzeros when computing permn for mode n

BFS-MCS does not exactly catch the memory access pattern of an Mt-tkrp It treats the contribution from the indices of all modes except nequally to the connectivity thus the connectivity might not match the ac-tual data reuse Also it uses a greedy strategy to determine the next-levelvertices which could miss the optimal global orderings as is common togreedy heuristics for hard problems

822 Lexi-Order

A lexical ordering of an integer vector is the standard dictionary orderingof its elements defined as follows Given two equal-length vectors x andy we say x le y iff either (i) all elements are the same ie x = y or(ii) there exists an index j such that x(j) lt y(j) and x(i) = y(i) for all0 le i lt j Consider two 0 1 vectors x = (1 0 1 0) and y = (1 1 0 0) wehave x le y because x(1) lt y(1) and x(i) = y(i) for all 0 le i lt 1

Lubiw [167] associates a vector v of size I middot J with an I times J matrix Awhere the matrix indices (i j) are first sorted with i + j and then j Adoubly lexical ordering of A is an ordering of its rows and columns whichmakes v lexically the largest Every real-valued matrix has a doubly lexicalordering and such an ordering is not unique [167] Doubly lexical orderinghas a number of applications including the efficient recognition of totallybalanced subtree and plaid matrices and (strongly) chordal graphs To the

152 CHAPTER 8 SPARSE TENSOR ORDERING

(1 1 0 0)

(1 0 1 0)

0 1 0 10 1 0 00 0 1 00 0 0 1

1 1 0 01 0 0 00 1 0 00 0 1 0

original ordered(a) Vector comparison (b) Matrix

gt

Figure 83 A doubly lexical ordering of a 0 1 matrix

best of our knowledge the merits of this ordering for high performance havenot been investigated for sparse matrices

We propose an iterative algorithm called matLexiOrder for doubly lex-ical ordering of matrices where in an iteration either rows or columns aresorted lexically Lubiw [167 Claim 22] shows that exchanging two rowswhich are lexically out of order improves the lexical ordering of the matrixBy appealing to this result we see that the matLexiOrder algorithm findsa doubly lexical ordering in a finite number of iterations This also fits wellwith another characterization of the doubly lexical ordering [167 182] amatrix is in doubly lexical ordering if both the row and column vectors arein non-increasing lexical order A row vector is read from left to right and acolumn vector is read from top to bottom Figure 83 shows a doubly lexicalordering of a sample matrix As seen in the figure the row vectors are innon-increasing lexical order as are the column vectors

The known doubly lexical ordering algorithms for matrices by Lubiw [167]and Paige and Tarjan [182] are ldquodirectrdquo ordering methods with a run timeof O(nnz log(I + J) + J) and O(nnz +I + J) space for an ItimesJ matrix withnnz nonzeros We find the time complexity of these algorithms to be toohigh for our purpose Furthermore the data structures are too complex toallow an efficient generalized implementation for tensors Since we do notaim to obtain an exact doubly lexical ordering (a close-by ordering will likelysuffice to improve the Mttkrp performance) our approach will be fasterwhile being simpler to implement

We first describe matLexiOrder algorithm for a sparse matrix Thisaims to show the efficiency of our iterative approach compared to othersA partition refinement technique is used to order the rows and columnsalternatively Given an ordering of the rows the columns can be sortedlexically This is achieved by an order preserving variant of the partitionrefinement method [182] which is called orderlyRefine We briefly explainthe partition refinement technique for ordering the columns of a matrixGiven an I times J matrix A all columns are initially put into a single partThen Arsquos nonzeros are visited row by row At a row i each column partC is split into two parts C1 = C cap a(i ) and C2 = C a(i ) and thesetwo parts replace C in the order C1 C2 (empty sets are discarded) Note

153

that this algorithm keeps the parts in a particular order which generatesan ordering of the columns orderlyRefine is used to refine all parts thathave at least one nonzero in row i in O(|a(i )|) time where |a(i )| is thenumber of nonzeros of row i Overall matLexiOrder costs a linear totaltime of O(nnz +I + J) (for rows and columns ordering) per iteration andO(J) space We also observe that only a small number of iterations will beenough (will be shown in Section 835 for tensors) yielding a more storage-efficient algorithm compared to the prior doubly lexical ordering methods[167 182] matLexiOrder in particular the use of orderlyRefine routinea few times is sufficient for our needs

Algorithm 19 Lexi-Order for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN mode nResult Permutation permn

Sort all nonzeros along with all but mode n1 quickSort(X coordCmp) Matricize X to X(n)

2 r larr compose (inds ([minusn]) 1)3 for m = 1 nnz do4 clarr inds(nm) Column index of X(n)

5 if coordCmp(X mmminus 1) = 1 then6 r larr compose (inds ([minusn])m) Row index of X(n)

7 X(n)(r c)larr val(m)

Apply partition refinement to the columns of X(n)

8 permn larr orderlyRefine (X(n))

9 return permn

Comparison function for two indices of X10 Function coordCmp(X m1m2)11 for nprime = 1 N do12 if nprime le n then13 return minus1

14 if m1(nprime) gt m2(nprime) then15 return 1

16 return 0

To order tensors we propose the Lexi-Order function as an exten-sion of matLexiOrder The basic idea of Lexi-Order is to determine thepermutation of each tensor mode independently while considering the or-der in other modes fixed Lexi-Order sets the indices of the mode tobe ordered as the columns of a matrix the other indices as the rows andsorts the columns as described for matrices (with the order preserving par-tition refinement method) The precise algorithm appears in Algorithm 19which we also illustrate in Figure 84 when applied to mode 1 Given amode n Lexi-Order first builds a matricized tensor in Compressed Sparse

154 CHAPTER 8 SPARSE TENSOR ORDERING

i j k val0 0 0 13 0 0 73 1 0 80 1 1 20 1 2 31 2 2 41 3 22 3 2

56

orderlyReneQuick

sort Matricize(1 2 3 0)

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

1 0 0 10 0 0 11 0 0 01 0 0 00 1 0 00 1 1 0

i(jk)(00)(10)(11)(12)(22)(32)

Zero-one RepresentationSorted COO

0 33001

i(jk)(00)(10)(11)(12)(22)(32) 1 2

JKI CSR Matrix

Non-zeroDistribution

permi

Figure 84 The steps of Lexi-Order illustrated for mode 1

Row (CSR) sparse matrix format by a call to quickSort with the compar-ison function coordCmp and then by partitioning the nonzeros into the rowsegments (Lines 3ndash7) This comparison function coordCmp does a lexicalcomparison of all-but-mode-n indices which enables efficient matricizationIn other words sorting of the tensor X is the same as building the matricizedtensor X(n) by rows in the fixed lexical ordering where mode n is the col-umn dimension and the remaining modes constitute the rows In Figure 84the sorting step orders the COO entries by (j k) tuples which then serveas the row indices of the matricized CSR representation of X(1) Once thematrix is built we construct 0 1 row vectors in Figure 84 to illustrate itsnonzero distribution which could seamlessly call orderlyRefine functionApart from the quickSort the other parts of Lexi-Order are of lineartime complexity (linear in terms of tensor storage) We use OpenMP Tasksto parallelize quickSort to accelerate Lexi-Order

Like BFS-MCS approach Lexi-Order also finds the permutations forN modes of an Nth-order sparse tensor Figure 85(a) illustrates the effectof Lexi-Order on an example 4times 4times 3 sparse tensor The original tensoris converted to a HiCOO representation with block size B = 2 consistingof 5 blocks with maximum 2 nonzeros per block After reordering withLexi-Order the new HiCOO has 3 nonzero blocks with up to 4 nonzerosper block Thus the blocks are denser which should exhibit better localitybehavior However this reordering scheme is heuristic For example con-sider Figure 81 we draw another HiCOO representation after reordering inFigure 85(b) Applying Lexi-Order would yield a reordered HiCOO rep-resentation with 4 blocks which is the same as the input ordering althoughthe maximum number of nonzeros per block would increase to 4 For thistensor Lexi-Order may not show a big advantage

823 Analysis

We take the Mttkrp operation to analyze reordering behavior for COOCSF and HiCOO formats Recall that for a third-order sparse tensor X Mttkrp multiples each nonzero entry xijk with the R-vector formed by theentry-wise product of the jth row of B and kth row of C when computing

155

Original HiCOO

ek val0 0 0 10 1 1 20 1 0 31 0 0 41 1 0 51 0 0 71 1 00 1 0

86

0 0 0

0 0 1

1 0 0

1 1 1

0

2

5

7

0 1 13

ei ejbi bj bkbptrB1

B2B3

B4

B5

Reordered COO

i j k val1 0 0 11 1 2 21 1 1 32 3 1 42 2 1 53 2 1 60 0 00 1 0

78

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

Reordered HiCOO

ek val0 0 0 70 1 0 81 0 0 11 1 1 31 1 0 20 0 1 50 1 11 0 1

46

0 0 0

0 0 11 1 0

0

45

ei ejbi bj bkbptrB1

B2B3

(1 2 3 0)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

Original HiCOO

Reordered COO

i j k val1 0 0 11 1 0 20 0 0 30 0 1 43 1 0 53 3 1 62 0 22 2 1

78

Original COO

Reordered HiCOO

ek val0 0 0 30 0 1 41 0 0 11 1 0 21 1 0 50 0 0 70 0 11 1 1

86

0 0 0

1 0 0

1 1 01 0 1

0

45

ei ejbi bj bkbptrB1

B2B3B4

(1 0 3 2)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

i j k val0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

(a) A good example (b) A fair example

Figure 85 Comparison of HiCOO representations before and after Lexi-Order

MA The arithmetic intensity of Mttkrp algorithms on these formats isapproximately 14 [155] Thus Mttkrp can be considered memory-boundfor most computer architectures especially CPU platforms

In HiCOO smaller block ratios (αb) and larger average slice sizes pertensor block (cb) are favorable for good Mttkrp performance Our re-ordering algorithms tend to closely pack nonzeros together and get denserblocks (larger cb) which potentially generates less tensor blocks (smallerαb) thus HiCOO-Mttkrp has less memory access more cache hits andbetter performance Moreover smaller αb can also reduce tensor memoryrequirement as a side benefit [155] That is for HiCOO format a reorderingscheme should increase the block density reduce the number of blocks andincrease cache localitymdashthree related performance metrics Reordering ismore beneficial for HiCOO format than COO and CSF formats

COO stores all nonzero indices so relabeling does not make a differencein its storage The same is also true for CSF relabeling does not changethe CSFrsquos tree structure while the order of nodes may change For anMttkrp with tensors stored in these two formats the performance gainfrom reordering is only from the improved data locality Though it is hardto do theoretical analysis for them the potential better data locality couldalso brings performance advantages but could be less than HiCOOrsquos

156 CHAPTER 8 SPARSE TENSOR ORDERING

Tensors Order Dimensions Nnzs Density

vast 3 165K times 11K times 2 26M 69times 10minus3

nell2 3 12K times 9K times 29K 77M 24times 10minus5

choa 3 712K times 10K times 767 27M 50times 10minus6

darpa 3 22K times 22K times 24M 28M 24times 10minus9

fb-m 3 23M times 23M times 166 100M 11times 10minus9

fb-s 3 39M times 39M times 532 140M 17times 10minus10

flickr 3 320K times 28M times 2M 113M 78times 10minus12

deli 3 533K times 17M times 3M 140M 61times 10minus12

nell1 3 29M times 21M times 25M 144M 91times 10minus13

crime 4 6K times 24times 77times 32 5M 15times 10minus2

uber 4 183times 24times 1140times 1717 3M 39times 10minus4

nips 4 2K times 3K times 14K times 17 3M 18times 10minus6

enron 4 6K times 6K times 244K times 1K 54M 55times 10minus9

flickr4d 4 320K times 28M times 2M times 731 113M 11times 10minus14

deli4d 4 533K times 17M times 3M times 1K 140M 43times 10minus15

Table 81 Description of sparse tensors

83 Experiments

Platform We perform experiments on a Linux-based Intel Xeon E5-2698v3 multicore server platform with 32 physical cores distributed on two sock-ets each with 23 GHz frequency We only present results for sequentialruns (see the paper [156] for parallel runs) The processor microarchitectureis Haswell having 32 KiB L1 data cache and 128 GiB memory The codeartifact was written in the C language and was compiled using icc 1801

Dataset We use the sparse tensors derived from real-world applicationsthat appear in Table 81 ordered by decreasing nonzero density separatelyfor third- and fourth-order tensors Apart from choa [188] all tensors areavailable publicly FROSTT [203] HaTen2 [121]

Configurations We report the results under the best configurations for thefollowing parameters for the highest Mttkrp performance with the threeformats (i) the number of reordering iterations and the block size B forHiCOO format (ii) tiling or not for CSF For HiCOO B = 128 achievesthe best results in most cases and we use five reordering iterations whichwill be analyzed in Section 835 All experiments use approximate rank ofR = 16 We use the total execution time of Mttkrps in all modes for everytensor to calculate the speedup which is the ratio of the total Mttkrp timeon a randomly reordered tensor over that using a specific reordering schemeAll the times are averaged over five runs

831 COO-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential COO-Mttkrp from ParTI [154] in Figure 86 This COO-Mttkrp is imple-

157

Spee

dup

0

1

2

3

4

5Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 86 Reordered COO-Mttkrp speedup over a random reorderingimplementation

mented in C using the same algorithm with Tensor Toolbox [18] Forany reordering approach after doing a BFS-MCS Lexi-Order or randomreordering on the input tensor we still sort the tensor in the mode orderof 1 middot middot middot N Observe that Lexi-Order improves sequential COO-Mttkrp performance by 100ndash429times (179times on average) while BFS-MCSgets 095ndash127times (110times on average) Note that Lexi-Order improves theperformance of COO-Mttkrp for all tensors We conclude that this order-ing is always helpful for COO-Mttkrp while the improvements being lessthan what we saw for HiCOO-Mttkrp

832 CSF-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential CSF-Mttkrp from Splatt v111 [205] in Figure 87 CSF-Mttkrp is set to useall CSF representations (ALLMODE) for Mttkrps in all modes and with tilingoption on Lexi-Order improves sequential CSF-Mttkrp performanceby 065ndash233times (150times on average) BFS-MCS improves sequential CSF-Mttkrp performance by 100ndash186times (122times on average) Both orderingapproaches improves the performance of CSF-Mttkrp on average WhileBFS-MCS is always helpful in the sequential case Lexi-Order is nothelpful on only one tensor crime The improvements achieved by the tworeordering approaches for CSF are less than those for HiCOO and COOformats We conclude that both reordering methods are helpful for CSF-Mttkrp but to a lesser extent than for HiCOO and COO based Mttkrp

833 HiCOO-Mttkrp with reordering

Figure 88 shows the speedup of the proposed reordering methods on se-quential HiCOO-Mttkrp Lexi-Order reordering obtains 099ndash414timesspeedup (212times on average) while BFS-MCS reordering gets 099ndash188timesspeedup (134times on average) Lexi-Order and BFS-MCS do not behaveas well on fourth-order tensors as on third-order tensors Tensor flickr4d is

158 CHAPTER 8 SPARSE TENSOR ORDERING

Spee

dup

00

05

10

15

20

25 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 87 Reordered CSF-Mttkrp speedup over a random reorderingimplementation

Spee

dup

0

1

2

3

4

5 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 88 Reordered HiCOO-Mttkrp speedup over a random orderingimplementation

constructed from the same data with flickr with an extra short mode (referto Table 81) Lexi-Order obtains 414times speedup on flickr while only 302timesspeedup on flickr4d The same phenomenon is also observed on tensors deli

and deli4d This phenomenon indicates that it is harder to get good datalocality on higher-order tensors which will be justified in Table 82

We investigate the two critical parameters of HiCOO [155] the blockratio (αb) and the average slice size per tensor block (cb) Smaller αb andlarger cb are favorable for good HiCOO-Mttkrp performance Table 82lists the parameter values for all tensors before and after Lexi-Order theHiCOO-Mttkrp speedup (as shown in Figure 88) and the storage ratioof HiCOO over random ordering Generally when both parameters αb andcb are largely improved we see a good speedup and storage ratio usingLexi-Order As seen for the same data in different orders eg flickr4d

and flickr the αb and cb values after Lexi-Order are better in the smallerorder version This observation supports the intuition that getting gooddata locality is harder for higher-order tensors

834 Reordering methods comparison

As seen above Lexi-Order improves performance more than BFS-MCSin most cases Compared to the reordering method used in Splatt [206]

159

TensorsRandom reordering Lexi-Order Speedup Storageαb cb αb cb seq omp ratio

vast 0004 1758 0004 1562 101 103 0999nell2 0020 0314 0008 0074 104 104 0966choa 0089 0057 0016 0056 116 107 0833

darpa 0796 0009 0018 0113 374 150 0322fb-m 0985 0008 0086 0021 384 121 0335fb-s 0982 0008 0099 0020 390 123 0336

flickr 0999 0008 0097 0025 414 366 0277deli 0988 0008 0501 0010 224 083 0634

nell1 0998 0008 0744 0009 170 070 0812

crime 0001 37702 0001 8978 099 142 1000uber 0041 0469 0011 0270 100 078 0838nips 0016 0434 0004 0435 103 134 0921

enron 0290 0017 0045 0030 125 136 0573flickr4d 0999 0008 0148 0020 302 1181 0214

deli4d 0998 0008 0596 0010 176 126 0697

Table 82 HiCOO parameters before and after Lexi-Order reordering

by setting ALLMODE (identical to the work [206]) to CSF-Mttkrp BFS-MCSgets 104 164 and 161times speedups on tensors nell2 nell1 and deli respectivelyand Lexi-Order obtains 104 170 and 224times speedups By contrast thespeedups using graph partitioning [206] on these three tensors are 106 111and 119times and 106 112 and 124times by using hypergraph partitioning [206]respectively Our BFS-MCS and Lexi-Order schemes both outperformgraph and hypergraph partitionings [206]

The available methods in the state-of-the-art are based on graph andhypergraph partitioning Partitioning is a successful approach when thenumber of partitions is known while the number of blocks in HiCOO is notknown ahead of time In our case the partitions should also be ordered forbetter cache reuse Additionally partitioners are less effective for tensorsthan usual as some dimensions could be very short (creating very highdegree vertices) That is why the proposed ordering based methods deliverbetter results

835 Effect of the number of iterations in Lexi-Order

Since Lexi-Order is iterative we evaluate the effect of the number ofiterations on HiCOO-Mttkrp performance The results appear in Fig-ure 89(a) which is normalized to the run time of 10 iterations Mttkrpon most tensors does not vary a lot by setting different number of iterationsexcept vast nell2 uber and nips We use 5 iterations to get good Mttkrpperformance similar to that of 10 iterations with about half of the overhead(shown in Figure 89(b)) But 3 or fewer iterations will get an acceptableperformance when users care much about the pre-processing time

160 CHAPTER 8 SPARSE TENSOR ORDERING

Nor

mal

ized

Tim

e

00

05

10

15

20 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(a) Sequential HiCOO-Mttkrp behavior

Nor

mal

ized

Tim

e

00

02

04

06

08

10 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(b) The reordering overhead

Figure 89 The performance and overhead of different numbers of iterations

84 Summary further notes and references

Plenty recent research studied the optimization of tensor algorithms [20 2846 159 168] Our work sets itself apart by focusing on reordering to getbetter nonzero structure Various reordering methods have been consideredfor matrix operations [7 173 190 215]

The proposed two heuristics aim to arrange the nonzeros of a given ten-sor close to each other in all modes As said before this would correspondto reordering the rows and columns so that all nonzeros are clustered aroundthe diagonal in the matrix case Heuristics based on partitioning and or-dering have been used the matrix case for arranging most nonzeros aroundthe diagonal or diagonal blocks An obvious alternative in the matrix caseis the BFS-based reverse Cuthill-McKee (RCM) heuristic Given a patternsymmetric matrix A RCM finds a permutation (ordering) such that thesymmetrically permuted matrix PAPT has nonzeros clustered around thediagonal More precisely RCM tries to reduce the bandwidth which is de-fined as maxj minus i aij 6= 0 and j gt i The RCM heuristic can be appliedto pattern-wise unsymmetric even rectangular matrices as well A commonapproach is to create a (2times 2)-block matrix for a given mtimes n matrix A as

B =

(Im AAT In

)

where Im and In are the identity matrices if sizemtimesm and ntimesn respectivelySince B is now pattern-symmetric RCM can be run on it to obtain an(m+n)times(m+n) permutation matrix P The row and column permutations

161

min max geomean

BFS-MCS 084 5356 290Lexi-Order 015 433 099

Table 83 Statistical indicators of the number of nonempty blocks obtainedby BFS-MCS and Lexi-Order with respect to that of RCM on 774 ma-trices from the UFL collection

for A can then be obtained from P by respecting the ordering of the first mrows and the last n rows of B This approach can be extended to tensorsWe discuss the third order case for simplicity Let X be an ItimesJtimesK tensorWe create a matrix B in such a way that the first I rowscolumns correspondto the first-mode indices the following J rowscolumns correspond to thesecond-mode indices and the last K rowscolumns correspond to the third-mode indices Then for each nonzero xijk of X we set brc = 1 whenever rand c correspond to any two indices from the set i j k This way B is asymmetric matrix with ones on the main diagonal Observe that when X istwo dimensional (that is a matrix) we obtain the matrix B above We nowcompare the proposed reordering heuristics with RCM

We first compare the three methods RCM BFS-MCS and Lexi-Orderon matrices available at the UFL collection [59] We have tested these threeheuristics on all pattern-wise unsymmetric matrices having more than 1000and less than 5000000 rows at least three nonzeros per row and columnless than 10000000 nonzeros and less than 100 middot

radicmtimes n nonzeros for a

matrix with m rows and n columns At the time of experimentation therewere 774 matrices satisfying these properties We ordered these matriceswith RCM BFS-MCS and Lexi-Order (5 iterations) and counted thenumber of nonempty blocks with a block size of 128 (in each dimension) Wethen normalized the results of BFS-MCS and Lexi-Order with respect tothat of RCM We give the statistical indicators of these normalized resultsin Table 83 As seen in the table RCM is better than BFS-MCS andLexi-Order performs as good as RCM in reducing the number of blocks

We next compare the three methods on 11 sparse tensors from theFROSTT [203] and Haten2 repositories [121] again with a block size of128 in each dimension Table 84 shows the number of nonempty blocksobtained with these methods In this table the number of blocks obtainedby RCM is given in absolute terms and those obtained by BFS-MCS andLexi-Order are given with respect to those of RCM As seen in the tableRCM is never the best and Lexi-Order is better than BFS-MCS for nineout of eleven cases Overall Lexi-Order and BFS-MCS obtain resultswhose geometric means are 048 and 089 respectively of that of RCMmdashmarking both as more effective than RCM This is interesting as we haveseen that RCM and Lexi-Order have (nearly) the same performance inthe matrix case

162 CHAPTER 8 SPARSE TENSOR ORDERING

RCM BFS-MCS Lexi-Order

lbnl-network 67208 111 068nips 25992 156 045vast 113757 100 095nell-2 565615 092 105flickr 28648054 041 038flickr4d 32913028 061 051deli 70691535 097 091deli4d 94608531 081 088darpa 2753737 165 019fb-m 53184625 066 016fb-s 71308285 080 019

geo mean 089 048

Table 84 Number of nonempty blocks obtained by RCM BFS-MCS andLexi-Order on sparse tensors The results of BFS-MCS and Lexi-Orderare normalized with respect to those of RCM For each tensor the best resultis displayed with bold

random Lexi-Order speedup

vast 202 160 126nell-2 504 433 116choa 178 143 125darpa 318 148 215fb-m 1716 695 247fb-s 2503 973 257flickr 1029 582 177deli 1405 1080 130nell1 1845 1486 124

Table 85 The run time of TTM (for all modes) using a random ordering andLexi-Order and the speedup with Lexi-Order for sequential execution

163

The proposed ordering methods are applicable to some other sparse ten-sor operations without any change Among those operations tensor-matrixmultiplication (TTM) used in computing the Tucker decomposition withhigher-order orthogonal iterations [61] or tensor-vector multiplications usedin the higher order power method [60] are well known The main character-istic of these operations is that the memory access of the algorithm dependson the locality of tensor indices for each dimension If not all dimensionshave the same characteristics potentially a variant of the proposed methodsin which only certain dimensions are ordered could be used We showcasesome results for the TTM case below in Table 85 on some of the tensorsThe results are obtained on a machine with Intel Xeon CPU E5-2650v4 Asseen in the table using Lexi-Order always results in improved sequentialrun time for TTMs

Recent work [46] analyses the memory access pattern of the MTTKRPoperation and proposes low-level highly effective code optimizations Theproposed approach reorganizes the code with register blocking and appliesnonzero blocking in an attempt to fit rows of the factor matrices into thecache The gain in performance remarkable Our ordering methods are or-thogonal to the mentioned and similar optimizations For a much improvedperformance we should first order the tensor using the approaches proposedin this chapter and then further optimize the code using the low-level codeoptimizations

164 CHAPTER 8 SPARSE TENSOR ORDERING

Part IV

Closing

165

Chapter 9

Concluding remarks

We have presented a selection of partitioning matching and ordering prob-lems in combinatorial scientific computing

Partitioning problems These problems arise in task decomposition forparallel computing where load balance and low communication costare two objectives Depending on the task definition and the interac-tion of the tasks the partitioning problems use different combinatorialmodels The selected problems (Chapters 2 and 3) addressed acyclicpartitioning of directed acyclic graphs and partitioning of hypergraphsWhile our contributions for the former problem concerned combinato-rial tools for the desired partitioning objectives and constraints thosefor the second problem concerned the use of hypergraph models andassociated partitioning tools for efficient tensor decomposition in thedistributed memory setting

Matching problems These problems arise in settings where agents com-pete for exclusive access to resources The most common settingsinclude identifying nonzero diagonals in sparse matrices routing andscheduling in data centers A related concept is to decompose a dou-bly stochastic matrix as a convex combination of permutation matri-ces (the famous Birkhoff-von Neumann decomposition) where eachpermutation matrix is a perfect matching in the bipartite graph repre-sentation of the matrix We developed approximation algorithms formatchings in graphs (Chapter 4) and effective heuristics for findingmatchings in hypergraphs (Chapter 6) We also investigated (Chap-ter 5) the problem of finding Birkhoff-von Neumann decompositionswith a small number of permutation matrices and presented complex-ity results and theoretical insights into the decomposition problemOn the way we generalized the decomposition to general real matrices(not only doubly stochastic) having total support

167

168 CHAPTER 9 CONCLUDING REMARKS

Ordering problems These problems arise when one wants to permutesparse matrices and tensors into desirable forms The sought formsdepend of course on the targeted applications By focusing on a re-cent LU decomposition method (or on a method taking advantage ofthe sparsity in a novel way) we developed heuristics to permute sparsematrices into a bordered block triangular form with an aim to reducethe height of the resulting elimination tree (Chapter 7) We also in-vestigated sparse tensor ordering problem where we sought to clusternonzeros around the (hyper)diagonal of a sparse tensor (Chapter 8) inorder to improve the performance of certain tensor operations We pro-posed novel algorithms for obtaining doubly lexical ordering of sparsematrices that are simpler to implement than the known algorithms tobe able to address the tensor ordering problem

At the end of Chapters 2ndash8 we mentioned some immediate future workBelow we state some more future work which require considerable workand creativity

Most of the presented combinatorial algorithms come with applicationsOthers are made applicable by practical and efficient implementations fur-ther developments on the applicationsrsquo side are required for these to be usedin practice In particular the use of acyclic partitioner (the problem ofChapter 2) in a parallel system with the help of a runtime system requiresthe whole graph to be available This is usually not the case an auto-tuning like approach in which one first creates a task graph by a cold runand then partitions this graph to determine a schedule can prove useful Ina similar vein we mentioned the lack of proper software implementing theLU factorization method using elimination trees at the end of Chapter 7Such an implementation will give rise to a number of combinatorial prob-lems (eg the peak memory requirement in factoring a BBT ordered matrixwith elimination trees) and seems to be a new play field

The original Karp-Sipser heuristic for the cardinality matching problemapplies degree-1 and degree-2 reduction rules The first rule is well imple-mented in practical settings (see Chapters 4 and 6) On the other hardfast implementations of the degree-2 reduction rule for undirected graphsare sought A straightforward implementation of this rule can result in theworst case Ω(n2) total time for a graph with n vertices This is obviously toomuch to afford in theory for general sparse graphs with O(n) or O(n log n)edges Are there worst-case linear time algorithms for implementing thisrule in the context of the Karp-Sipser heuristic

The Birkhoff-von Neumann decomposition (Chapter 5) has well foundedapplications in routing Our use of this decomposition in solving linear sys-tems is a first attempt in making it useful for numerical algorithms Inthe proposed approach the cost of constructing a preconditioner with afew terms from a BvN decomposition is equivalent to solving a few maxi-

169

mum bottleneck matching problemmdashthis can be costly in practical settingsCan we use a relaxed definition of the BvN decomposition in which sub-permutation matrices are allowed (such decompositions are used in routingproblems) and make related preconditioner more practical while retainingnumerical properties

170 BIBLIOGRAPHY

Bibliography

[1] IBM ILOG CPLEX Optimizer httpwww-01ibmcomsoftwareintegrationoptimizationcplex-optimizer 2010 [Cited on pp96 121]

[2] Evrim Acar Daniel M Dunlavy and Tamara G Kolda A scalable op-timization approach for fitting canonical tensor decompositions Jour-nal of Chemometrics 25(2)67ndash86 2011 [Cited on pp 11 37]

[3] Seher Acer Tugba Torun and Cevdet Aykanat Improving medium-grain partitioning for scalable sparse tensor decomposition IEEETransactions on Parallel and Distributed Systems 29(12)2814ndash2825Dec 2018 [Cited on pp 52]

[4] Kunal Agrawal Jeremy T Fineman Jordan Krage Charles E Leis-erson and Sivan Toledo Cache-conscious scheduling of streaming ap-plications In Proc Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 236ndash245 New YorkNY USA 2012 ACM [Cited on pp 4]

[5] Emmanuel Agullo Alfredo Buttari Abdou Guermouche and FlorentLopez Multifrontal QR factorization for multicore architectures overruntime systems In Euro-Par 2013 Parallel Processing LNCS pages521ndash532 Springer Berlin Heidelberg 2013 [Cited on pp 131]

[6] Ron Aharoni and Penny Haxell Hallrsquos theorem for hypergraphs Jour-nal of Graph Theory 35(2)83ndash88 2000 [Cited on pp 119]

[7] Kadir Akbudak and Cevdet Aykanat Exploiting locality in sparsematrix-matrix multiplication on many-core architectures IEEETransactions on Parallel and Distributed Systems 28(8)2258ndash2271Aug 2017 [Cited on pp 160]

[8] Patrick R Amestoy Iain S Duff and Jean-Yves LrsquoExcellent Multi-frontal parallel distributed symmetric and unsymmetric solvers Com-puter Methods in Applied Mechanics and Engineering 184(2ndash4)501ndash520 2000 [Cited on pp 131 132]

[9] Patrick R Amestoy Iain S Duff Daniel Ruiz and Bora Ucar Aparallel matrix scaling algorithm In J M Palma P R AmestoyM Dayde M Mattoso and J C Lopes editors 8th InternationalConference on High Performance Computing for Computational Sci-ence (VECPAR) volume 5336 pages 301ndash313 Springer Berlin Hei-delberg 2008 [Cited on pp 56]

BIBLIOGRAPHY 171

[10] Michael Anastos and Alan Frieze Finding perfect matchings in ran-dom cubic graphs in linear time arXiv preprint arXiv1808008252018 [Cited on pp 125]

[11] Claus A Andersson and Rasmus Bro The N-way toolbox for MAT-LAB Chemometrics and Intelligent Laboratory Systems 52(1)1ndash42000 [Cited on pp 11 37]

[12] Hartwig Anzt Edmond Chow and Jack Dongarra Iterative sparsetriangular solves for preconditioning In Larsson Jesper Traff SaschaHunold and Francesco Versaci editors Euro-Par 2015 Parallel Pro-cessing pages 650ndash661 Springer Berlin Heidelberg 2015 [Cited onpp 107]

[13] Jonathan Aronson Martin Dyer Alan Frieze and Stephen SuenRandomized greedy matching II Random Structures amp Algorithms6(1)55ndash73 1995 [Cited on pp 58]

[14] Jonathan Aronson Alan Frieze and Boris G Pittel Maximum match-ings in sparse random graphs Karp-Sipser revisited Random Struc-tures amp Algorithms 12(2)111ndash177 1998 [Cited on pp 56 58]

[15] Cevdet Aykanat Berkant B Cambazoglu and Bora Ucar Multi-leveldirect K-way hypergraph partitioning with multiple constraints andfixed vertices Journal of Parallel and Distributed Computing 68609ndash625 2008 [Cited on pp 46]

[16] Ariful Azad Mahantesh Halappanavar Sivasankaran RajamanickamErik G Boman Arif Khan and Alex Pothen Multithreaded algo-rithms for maximum matching in bipartite graphs In IEEE 26th Inter-national Parallel Distributed Processing Symposium (IPDPS) pages860ndash872 Shanghai China 2012 [Cited on pp 57 59 66 73 74]

[17] Brett W Bader and Tamara G Kolda Efficient MATLAB compu-tations with sparse and factored tensors SIAM Journal on ScientificComputing 30(1)205ndash231 December 2007 [Cited on pp 11 37]

[18] Brett W Bader Tamara G Kolda et al Matlab tensor toolboxversion 30 Available online httpwwwsandiagov~tgkolda

TensorToolbox 2017 [Cited on pp 11 37 38 157]

[19] David A Bader Henning Meyerhenke Peter Sanders and DorotheaWagner editors Graph Partitioning and Graph Clustering - 10thDIMACS Implementation Challenge Workshop Georgia Institute ofTechnology Atlanta GA USA February 13-14 2012 Proceedingsvolume 588 of Contemporary Mathematics American MathematicalSociety 2013 [Cited on pp 69]

172 BIBLIOGRAPHY

[20] Muthu Baskaran Tom Henretty Benoit Pradelle M HarperLangston David Bruns-Smith James Ezick and Richard LethinMemory-efficient parallel tensor decompositions In 2017 IEEE HighPerformance Extreme Computing Conference (HPEC) pages 1ndash7Sep 2017 [Cited on pp 160]

[21] James Bennett and Stan Lanning The Netflix Prize In Proceedingsof KDD cup and workshop page 35 2007 [Cited on pp 10 48]

[22] Anne Benoit Yves Robert and Frederic Vivien A Guide to AlgorithmDesign Paradigms Methods and Complexity Analysis CRC PressBoca Raton FL 2014 [Cited on pp 93]

[23] Michele Benzi and Bora Ucar Preconditioning techniques based onthe Birkhoffndashvon Neumann decomposition Computational Methods inApplied Mathematics 17201ndash215 2017 [Cited on pp 106 108]

[24] Piotr Berman and Marek Karpinski Improved approximation lowerbounds on small occurence optimization ECCC Report 2003 [Citedon pp 113]

[25] Garrett Birkhoff Tres observaciones sobre el algebra lineal Universi-dad Nacional de Tucuman Series A 5147ndash154 1946 [Cited on pp8]

[26] Marcel Birn Vitaly Osipov Peter Sanders Christian Schulz andNodari Sitchinava Efficient parallel and external matching In Fe-lix Wolf Bernd Mohr and Dieter an Mey editors Euro-Par 2013Parallel Processing pages 659ndash670 Springer Berlin Heidelberg 2013[Cited on pp 59]

[27] Rob H Bisseling and Wouter Meesen Communication balancing inparallel sparse matrix-vector multiplication Electronic Transactionson Numerical Analysis 2147ndash65 2005 [Cited on pp 4 47]

[28] Zachary Blanco Bangtian Liu and Maryam Mehri Dehnavi CSTFLarge-scale sparse tensor factorizations on distributed platforms InProceedings of the 47th International Conference on Parallel Process-ing pages 211ndash2110 New York NY USA 2018 ACM [Cited onpp 160]

[29] Guy E Blelloch Jeremy T Fineman and Julian Shun Greedy sequen-tial maximal independent set and matching are parallel on average InProceedings of the Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 308ndash317 ACM 2012[Cited on pp 59 73 74 75]

BIBLIOGRAPHY 173

[30] Norbert Blum A new approach to maximum matching in generalgraphs In Proceedings of the 17th International Colloquium on Au-tomata Languages and Programming pages 586ndash597 London UK1990 Springer-Verlag [Cited on pp 6 76]

[31] Richard A Brualdi The diagonal hypergraph of a matrix (bipartitegraph) Discrete Mathematics 27(2)127ndash147 1979 [Cited on pp 92]

[32] Richard A Brualdi Notes on the Birkhoff algorithm for doublystochastic matrices Canadian Mathematical Bulletin 25(2)191ndash1991982 [Cited on pp 9 92 94 98]

[33] Richard A Brualdi and Peter M Gibson Convex polyhedra of doublystochastic matrices I Applications of the permanent function Jour-nal of Combinatorial Theory Series A 22(2)194ndash230 1977 [Citedon pp 9]

[34] Richard A Brualdi and Herbert J Ryser Combinatorial Matrix The-ory volume 39 of Encyclopedia of Mathematics and its ApplicationsCambridge University Press Cambridge UK New York USA Mel-bourne Australia 1991 [Cited on pp 8]

[35] Thang Nguyen Bui and Curt Jones A heuristic for reducing fill-in insparse matrix factorization In Proc 6th SIAM Conf Parallel Pro-cessing for Scientific Computing pages 445ndash452 SIAM 1993 [Citedon pp 4 15]

[36] Peter J Cameron Notes on combinatorics httpscameroncountsfileswordpresscom201311combpdf (last checked Jan 2019)2013 [Cited on pp 76]

[37] Andrew Carlson Justin Betteridge Bryan Kisiel Burr Settles Este-vam R Hruschka Jr and Tom M Mitchell Toward an architecturefor never-ending language learning In AAAI volume 5 page 3 2010[Cited on pp 48]

[38] J Douglas Carroll and Jih-Jie Chang Analysis of individual dif-ferences in multidimensional scaling via an N-way generalization ofldquoEckart-Youngrdquo decomposition Psychometrika 35(3)283ndash319 1970[Cited on pp 36]

[39] Umit V Catalyurek and Cevdet Aykanat Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplicationIEEE Transactions on Parallel and Distributed Systems 10(7)673ndash693 Jul 1999 [Cited on pp 41 140]

174 BIBLIOGRAPHY

[40] Umit V Catalyurek and Cevdet Aykanat PaToH A Multilevel Hyper-graph Partitioning Tool Version 30 Bilkent University Departmentof Computer Engineering Ankara 06533 Turkey PaToH is availableat httpbmiosuedu~umitsoftwarehtm 1999 [Cited on pp4 7 25 47 143]

[41] Umit V Catalyurek Kamer Kaya and Bora Ucar On shared-memoryparallelization of a sparse matrix scaling algorithm In 2012 41st Inter-national Conference on Parallel Processing pages 68ndash77 Los Alami-tos CA USA September 2012 IEEE Computer Society [Cited onpp 56]

[42] Umit V Catalyurek Mehmet Deveci Kamer Kaya and Bora UcarMultithreaded clustering for multi-level hypergraph partitioning In26th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) pages 848ndash859 Shanghai China 2012 [Cited on pp59]

[43] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang On ser-vice guarantees for input-buffered crossbar switches A capacity de-composition approach by Birkhoff and von Neumann In Quality ofService 1999 IWQoS rsquo99 1999 Seventh International Workshop onpages 79ndash86 1999 [Cited on pp 8]

[44] Jianer Chen Yang Liu Songjian Lu Barry OrsquoSullivan and Igor Raz-gon A fixed-parameter algorithm for the directed feedback vertex setproblem Journal of the ACM 55(5)211ndash2119 2008 [Cited on pp142]

[45] Joseph Cheriyan Randomized O(M(|V |)) algorithms for problemsin matching theory SIAM Journal on Computing 26(6)1635ndash16551997 [Cited on pp 76]

[46] Jee Choi Xing Liu Shaden Smith and Tyler Simon Blocking opti-mization techniques for sparse tensor computation In 2018 IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS)pages 568ndash577 Vancouver British Columbia Canada 2018 [Citedon pp 160 163]

[47] Joon Hee Choi and S V N Vishwanathan DFacTo Distributedfactorization of tensors In Zoubin Ghahramani Max Welling CorinnaCortes Neil D Lawrence and Kilian Q Weinberger editors 27thAdvances in Neural Information Processing Systems pages 1296ndash1304Curran Associates Inc Montreal Quebec Canada 2014 [Cited onpp 11 37 38 39]

BIBLIOGRAPHY 175

[48] Edmond Chow and Aftab Patel Fine-grained parallel incomplete LUfactorization SIAM Journal on Scientific Computing 37(2)C169ndashC193 2015 [Cited on pp 107]

[49] Andrzej Cichocki Danilo P Mandic Lieven De Lathauwer GuoxuZhou Qibin Zhao Cesar Caiafa and Anh-Huy Phan Tensor decom-positions for signal processing applications From two-way to multiwaycomponent analysis IEEE Signal Processing Magazine 32(2)145ndash163March 2015 [Cited on pp 10]

[50] Edith Cohen Structure prediction and computation of sparse matrixproducts Journal of Combinatorial Optimization 2(4)307ndash332 Dec1998 [Cited on pp 127]

[51] Thomas F Coleman and Wei Xu Parallelism in structured Newtoncomputations In Parallel Computing Architectures Algorithms andApplications ParCo 2007 pages 295ndash302 Forschungszentrum Julichand RWTH Aachen University Germany 2007 [Cited on pp 3]

[52] Thomas F Coleman and Wei Xu Fast (structured) Newton computa-tions SIAM Journal on Scientific Computing 31(2)1175ndash1191 2009[Cited on pp 3]

[53] Thomas F Coleman and Wei Xu Automatic Differentiation in MAT-LAB using ADMAT with Applications SIAM 2016 [Cited on pp3]

[54] Jason Cong Zheng Li and Rajive Bagrodia Acyclic multi-way par-titioning of Boolean networks In Proceedings of the 31st Annual De-sign Automation Conference DACrsquo94 pages 670ndash675 New York NYUSA 1994 ACM [Cited on pp 4 17]

[55] Elizabeth H Cuthill and J McKee Reducing the bandwidth of sparsesymmetric matrices In Proceedings of the 24th national conferencepages 157ndash172 New York NY USA 1969 ACM [Cited on pp 12]

[56] Marek Cygan Improved approximation for 3-dimensional matchingvia bounded pathwidth local search In Foundations of ComputerScience (FOCS) 2013 IEEE 54th Annual Symposium on pages 509ndash518 IEEE 2013 [Cited on pp 113 120]

[57] Marek Cygan Fabrizio Grandoni and Monaldo Mastrolilli How to sellhyperedges The hypermatching assignment problem In Proc of thetwenty-fourth annual ACM-SIAM symposium on Discrete algorithmspages 342ndash351 SIAM 2013 [Cited on pp 113]

176 BIBLIOGRAPHY

[58] Joseph Czyzyk Michael P Mesnier and Jorge J More The NEOSServer IEEE Journal on Computational Science and Engineering5(3)68ndash75 1998 [Cited on pp 96]

[59] Timothy A Davis and Yifan Hu The University of Florida sparsematrix collection ACM Transactions on Mathematical Software38(1)11ndash125 2011 [Cited on pp 25 69 82 105 143 161]

[60] Lieven De Lathauwer Pierre Comon Bart De Moor and Joos Vande-walle Higher-order power methodndashApplication in independent com-ponent analysis In Proceedings NOLTArsquo95 pages 91ndash96 Las VegasUSA 1995 [Cited on pp 163]

[61] Lieven De Lathauwer Bart De Moor and Joos Vandewalle Onthe best rank-1 and rank-(R1 R2 RN ) approximation of higher-order tensors SIAM Journal on Matrix Analysis and Applications21(4)1324ndash1342 2000 [Cited on pp 163]

[62] Dominique de Werra Variations on the Theorem of BirkhoffndashvonNeumann and extensions Graphs and Combinatorics 19(2)263ndash278Jun 2003 [Cited on pp 108]

[63] James W Demmel Stanley C Eisenstat John R Gilbert Xiaoye SLi and Joseph W H Liu A supernodal approach to sparse par-tial pivoting SIAM Journal on Matrix Analysis and Applications20(3)720ndash755 1999 [Cited on pp 132]

[64] James W Demmel John R Gilbert and Xiaoye S Li An asyn-chronous parallel supernodal algorithm for sparse gaussian elimina-tion SIAM Journal on Matrix Analysis and Applications 20(4)915ndash952 1999 [Cited on pp 132]

[65] Mehmet Deveci Kamer Kaya Umit V Catalyurek and Bora UcarA push-relabel-based maximum cardinality matching algorithm onGPUs In 42nd International Conference on Parallel Processing pages21ndash29 Lyon France 2013 [Cited on pp 57]

[66] Mehmet Deveci Kamer Kaya Bora Ucar and Umit V CatalyurekGPU accelerated maximum cardinality matching algorithms for bipar-tite graphs In Felix Wolf Bernd Mohr and Dieter an Mey editorsEuro-Par 2013 Parallel Processing volume 8097 of LNCS pages 850ndash861 Springer Berlin Heidelberg 2013 [Cited on pp 57]

[67] Pat Devlin and Jeff Kahn Perfect fractional matchings in k-out hy-pergraphs arXiv preprint arXiv170303513 2017 [Cited on pp 114121]

BIBLIOGRAPHY 177

[68] Elizabeth D Dolan The NEOS Server 40 administrative guide Tech-nical Memorandum ANLMCS-TM-250 Mathematics and ComputerScience Division Argonne National Laboratory 2001 [Cited on pp96]

[69] Elizabeth D Dolan and Jorge J More Benchmarking optimiza-tion software with performance profiles Mathematical Programming91(2)201ndash213 2002 [Cited on pp 26]

[70] Jack J Dongarra Fred G Gustavson and Alan H Karp Implement-ing linear algebra algorithms for dense matrices on a vector pipelinemachine SIAM Review 26(1)pp 91ndash112 1984 [Cited on pp 134]

[71] Iain S Duff Kamer Kaya and Bora Ucar Design implementationand analysis of maximum transversal algorithms ACM Transactionson Mathematical Software 38(2)131ndash1331 2011 [Cited on pp 656 58]

[72] Iain S Duff and Jacko Koster On algorithms for permuting largeentries to the diagonal of a sparse matrix SIAM Journal on MatrixAnalysis and Applications 22973ndash996 2001 [Cited on pp 98 143]

[73] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ap-proximation algorithms for maximum matchings in undirected graphsIn 2018 Proceedings of the Seventh SIAM Workshop on CombinatorialScientific Computing pages 56ndash65 2018 [Cited on pp 75 79 113]

[74] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Fur-ther notes on Birkhoffndashvon Neumann decomposition of doubly stochas-tic matrices Linear Algebra and its Applications 55468ndash78 2018[Cited on pp 101 113 117]

[75] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ef-fective heuristics for matchings in hypergraphs Research Report RR-9224 Inria Grenoble Rhone-Alpes November 2018 (will appear inSEA2) [Cited on pp 118 119 122]

[76] Fanny Dufosse Kamer Kaya and Bora Ucar Two approximationalgorithms for bipartite matching on multicore architectures Journalof Parallel and Distributed Computing 8562ndash78 2015 IPDPS 2014Selected Papers on Numerical and Combinatorial Algorithms [Citedon pp 56 60 65 66 67 69 77 79 81 113 117 125]

[77] Andrew Lloyd Dulmage and Nathan Saul Mendelsohn Coverings ofbipartite graphs Canadian Journal of Mathematics 10517ndash534 1958[Cited on pp 68 143]

178 BIBLIOGRAPHY

[78] Martin E Dyer and Alan M Frieze Randomized greedy matchingRandom Structures amp Algorithms 2(1)29ndash45 1991 [Cited on pp 58113 115]

[79] Jack Edmonds Paths trees and flowers Canadian Journal of Math-ematics 17449ndash467 1965 [Cited on pp 76]

[80] Lawrence C Eggan Transition graphs and the star-height of regu-lar events The Michigan Mathematical Journal 10(4)385ndash397 1963[Cited on pp 5]

[81] Stanley C Eisenstat and Joseph W H Liu The theory of elimina-tion trees for sparse unsymmetric matrices SIAM Journal on MatrixAnalysis and Applications 26(3)686ndash705 2005 [Cited on pp 5 131132 133 135]

[82] Stanley C Eisenstat and Joseph W H Liu A tree-based dataflowmodel for the unsymmetric multifrontal method Electronic Transac-tions on Numerical Analysis 211ndash19 2005 [Cited on pp 132]

[83] Stanley C Eisenstat and Joseph W H Liu Algorithmic aspects ofelimination trees for sparse unsymmetric matrices SIAM Journal onMatrix Analysis and Applications 29(4)1363ndash1381 2008 [Cited onpp 5 132 143 145]

[84] Venmugil Elango Fabrice Rastello Louis-Noel Pouchet J Ramanu-jam and P Sadayappan On characterizing the data access complex-ity of programs ACM SIGPLAN Notices - POPLrsquo15 50(1)567ndash580January 2015 [Cited on pp 3]

[85] Paul Erdos and Alfred Renyi On random matrices Publicationsof the Mathematical Institute of the Hungarian Academy of Sciences8(3)455ndash461 1964 [Cited on pp 72]

[86] Bastiaan Onne Fagginger Auer and Rob H Bisseling A GPU algo-rithm for greedy graph matching In Facing the Multicore ChallengeII volume 7174 of LNCS pages 108ndash119 Springer-Verlag Berlin 2012[Cited on pp 59]

[87] Naznin Fauzia Venmugil Elango Mahesh Ravishankar J Ramanu-jam Fabrice Rastello Atanas Rountev Louis-Noel Pouchet andP Sadayappan Beyond reuse distance analysis Dynamic analysis forcharacterization of data locality potential ACM Transactions on Ar-chitecture and Code Optimization 10(4)531ndash5329 December 2013[Cited on pp 3 16]

BIBLIOGRAPHY 179

[88] Charles M Fiduccia and Robert M Mattheyses A linear-time heuris-tic for improving network partitions In 19th Design Automation Con-ference pages 175ndash181 Las Vegas USA 1982 IEEE [Cited on pp17 22]

[89] Joel Franklin and Jens Lorenz On the scaling of multidimensional ma-trices Linear Algebra and its Applications 114717ndash735 1989 [Citedon pp 114]

[90] Alan M Frieze Maximum matchings in a class of random graphsJournal of Combinatorial Theory Series B 40(2)196ndash212 1986[Cited on pp 76 82 121]

[91] Aurelien Froger Olivier Guyon and Eric Pinson A set packing ap-proach for scheduling passenger train drivers the French experienceIn RailTokyo2015 Tokyo Japan March 2015 [Cited on pp 7]

[92] Harold N Gabow and Robert E Tarjan Faster scaling algorithms fornetwork problems SIAM Journal on Computing 18(5)1013ndash10361989 [Cited on pp 6 76]

[93] Michael R Garey and David S Johnson Computers and IntractabilityA Guide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1979 [Cited on pp 3 47 93]

[94] Alan George Nested dissection of a regular finite element mesh SIAMJournal on Numerical Analysis 10(2)345ndash363 1973 [Cited on pp136]

[95] Alan J George Computer implementation of the finite elementmethod PhD thesis Stanford University Stanford CA USA 1971[Cited on pp 12]

[96] Andrew V Goldberg and Robert Kennedy Global price updates helpSIAM Journal on Discrete Mathematics 10(4)551ndash572 1997 [Citedon pp 6]

[97] Olaf Gorlitz Sergej Sizov and Steffen Staab Pints Peer-to-peerinfrastructure for tagging systems In Proceedings of the 7th Interna-tional Conference on Peer-to-peer Systems IPTPSrsquo08 page 19 Berke-ley CA USA 2008 USENIX Association [Cited on pp 48]

[98] Georg Gottlob and Gianluigi Greco Decomposing combinatorial auc-tions and set packing problems Journal of the ACM 60(4)241ndash2439September 2013 [Cited on pp 7]

[99] Lars Grasedyck Hierarchical singular value decomposition of tensorsSIAM Journal on Matrix Analysis and Applications 31(4)2029ndash20542010 [Cited on pp 52]

180 BIBLIOGRAPHY

[100] William Gropp and Jorge J More Optimization environments andthe NEOS Server In Martin D Buhman and Arieh Iserles editorsApproximation Theory and Optimization pages 167ndash182 CambridgeUniversity Press 1997 [Cited on pp 96]

[101] Hermann Gruber Digraph complexity measures and applications informal language theory Discrete Mathematics amp Theoretical Com-puter Science 14(2)189ndash204 2012 [Cited on pp 5]

[102] Gael Guennebaud Benoıt Jacob et al Eigen v3 httpeigen

tuxfamilyorg 2010 [Cited on pp 47]

[103] Abdou Guermouche Jean-Yves LrsquoExcellent and Gil Utard Impact ofreordering on the memory of a multifrontal solver Parallel Computing29(9)1191ndash1218 2003 [Cited on pp 131]

[104] Anshul Gupta Improved symbolic and numerical factorization algo-rithms for unsymmetric sparse matrices SIAM Journal on MatrixAnalysis and Applications 24(2)529ndash552 2002 [Cited on pp 132]

[105] Mahantesh Halappanavar John Feo Oreste Villa Antonino Tumeoand Alex Pothen Approximate weighted matching on emerging many-core and multithreaded architectures International Journal of HighPerformance Computing Applications 26(4)413ndash430 2012 [Cited onpp 59]

[106] Magnus M Halldorsson Approximating discrete collections via localimprovements In The Sixth annual ACM-SIAM symposium on Dis-crete Algorithms (SODA) volume 95 pages 160ndash169 San FranciscoCalifornia USA 1995 SIAM [Cited on pp 113]

[107] Richard A Harshman Foundations of the PARAFAC procedureModels and conditions for an ldquoexplanatoryrdquo multi-modal factor anal-ysis UCLA Working Papers in Phonetics 161ndash84 1970 [Cited onpp 36]

[108] Nicholas J A Harvey Algebraic algorithms for matching and matroidproblems SIAM Journal on Computing 39(2)679ndash702 2009 [Citedon pp 76]

[109] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-dimensional matching In Approximation Random-ization and Combinatorial Optimization Algorithms and Techniquespages 83ndash97 Springer 2003 [Cited on pp 113]

[110] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-set packing Computational Complexity 15(1)20ndash392006 [Cited on pp 7]

BIBLIOGRAPHY 181

[111] Bruce Hendrickson Load balancing fictions falsehoods and fallaciesApplied Mathematical Modelling 2599ndash108 2000 [Cited on pp 41]

[112] Bruce Hendrickson and Tamara G Kolda Graph partitioning modelsfor parallel computing Parallel Computing 26(12)1519ndash1534 2000[Cited on pp 41]

[113] Bruce Hendrickson and Robert Leland The Chaco userrsquos guide ver-sion 10 Technical Report SAND93ndash2339 Sandia National Laborato-ries Albuquerque NM October 1993 [Cited on pp 4 15 16]

[114] Julien Herrmann Jonathan Kho Bora Ucar Kamer Kaya andUmit V Catalyurek Acyclic partitioning of large directed acyclicgraphs In Proceedings of the 17th IEEEACM International Sympo-sium on Cluster Cloud and Grid Computing CCGRID pages 371ndash380 Madrid Spain May 2017 [Cited on pp 15 16 17 18 28 30]

[115] Julien Herrmann M Yusuf Ozkaya Bora Ucar Kamer Kaya andUmit V Catalyurek Multilevel algorithms for acyclic partitioning ofdirected acyclic graphs SIAM Journal on Scientific Computing 2019to appear [Cited on pp 19 20 26 29 31]

[116] Jonathan D Hogg John K Reid and Jennifer A Scott Design of amulticore sparse Cholesky factorization using DAGs SIAM Journalon Scientific Computing 32(6)3627ndash3649 2010 [Cited on pp 131]

[117] John E Hopcroft and Richard M Karp An n52 algorithm for max-imum matchings in bipartite graphs SIAM Journal on Computing2(4)225ndash231 1973 [Cited on pp 6 62]

[118] Roger A Horn and Charles R Johnson Matrix Analysis CambridgeUniversity Press New York 1985 [Cited on pp 109]

[119] Cor A J Hurkens and Alexander Schrijver On the size of systemsof sets every t of which have an SDR with an application to theworst-case ratio of heuristics for packing problems SIAM Journal onDiscrete Mathematics 2(1)68ndash72 1989 [Cited on pp 113 120]

[120] Oscar H Ibarra and Shlomo Moran Deterministic and probabilisticalgorithms for maximum bipartite matching via fast matrix multipli-cation Information Processing Letters pages 12ndash15 1981 [Cited onpp 76]

[121] Inah Jeon Evangelos E Papalexakis U Kang and Christos Falout-sos Haten2 Billion-scale tensor decompositions In IEEE 31st Inter-national Conference on Data Engineering (ICDE) pages 1047ndash1058April 2015 [Cited on pp 156 161]

182 BIBLIOGRAPHY

[122] Jochen A G Jess and Hendrikus Gerardus Maria Kees A data struc-ture for parallel LU decomposition IEEE Transactions on Comput-ers 31(3)231ndash239 1982 [Cited on pp 134]

[123] U Kang Evangelos Papalexakis Abhay Harpale and Christos Falout-sos GigaTensor Scaling tensor analysis up by 100 times - Algorithmsand discoveries In Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining KDDrsquo12 pages 316ndash324 New York NY USA 2012 ACM [Cited on pp10 11 37]

[124] Micha l Karonski Ed Overman and Boris Pittel On a perfect match-ing in a random bipartite digraph with average out-degree below twoarXiv e-prints page arXiv190305764 Mar 2019 [Cited on pp 59121]

[125] Micha l Karonski and Boris Pittel Existence of a perfect matching ina random (1 + eminus1)ndashout bipartite graph Journal of CombinatorialTheory Series B 881ndash16 May 2003 [Cited on pp 59 67 81 82121]

[126] Richard M Karp Reducibility among combinatorial problems InRaymond E Miller James W Thatcher and Jean D Bohlinger edi-tors Complexity of Computer Computations The IBM Research Sym-posiax pages 85ndash103 Springer US 1972 [Cited on pp 7 142]

[127] Richard M Karp Alexander H G Rinnooy Kan and Rakesh VVohra Average case analysis of a heuristic for the assignment prob-lem Mathematics of Operations Research 19513ndash522 1994 [Citedon pp 81]

[128] Richard M Karp and Michael Sipser Maximum matching in sparserandom graphs In 22nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) pages 364ndash375 Nashville TN USA 1981[Cited on pp 56 58 67 75 81 113]

[129] Richard M Karp Umesh V Vazirani and Vijay V Vazirani An op-timal algorithm for on-line bipartite matching In 22nd annual ACMsymposium on Theory of computing (STOC) pages 352ndash358 Balti-more MD USA 1990 [Cited on pp 57]

[130] George Karypis and Vipin Kumar MeTiS A Software Package forPartitioning Unstructured Graphs Partitioning Meshes and Comput-ing Fill-Reducing Orderings of Sparse Matrices Version 40 Univer-sity of Minnesota Department of Comp Sci and Eng Army HPCResearch Cent Minneapolis 1998 [Cited on pp 4 16 23 27 142]

BIBLIOGRAPHY 183

[131] George Karypis and Vipin Kumar Multilevel algorithms for multi-constraint hypergraph partitioning Technical Report 99-034 Uni-versity of Minnesota Department of Computer ScienceArmy HPCResearch Center Minneapolis MN 55455 November 1998 [Cited onpp 34]

[132] Kamer Kaya Johannes Langguth Fredrik Manne and Bora UcarPush-relabel based algorithms for the maximum transversal problemComputers and Operations Research 40(5)1266ndash1275 2013 [Citedon pp 6 56 58]

[133] Kamer Kaya and Bora Ucar Constructing elimination trees for sparseunsymmetric matrices SIAM Journal on Matrix Analysis and Appli-cations 34(2)345ndash354 2013 [Cited on pp 5]

[134] Oguz Kaya and Bora Ucar Scalable sparse tensor decompositions indistributed memory systems In Proceedings of the International Con-ference for High Performance Computing Networking Storage andAnalysis SC rsquo15 pages 771ndash7711 Austin Texas 2015 ACM NewYork NY USA [Cited on pp 7]

[135] Oguz Kaya and Bora Ucar Parallel CandecompParafac decomposi-tion of sparse tensors using dimension trees SIAM Journal on Scien-tific Computing 40(1)C99ndashC130 2018 [Cited on pp 52]

[136] Enver Kayaaslan and Bora Ucar Reducing elimination tree heightfor parallel LU factorization of sparse unsymmetric matrices In 21stInternational Conference on High Performance Computing (HiPC2014) pages 1ndash10 Goa India December 2014 [Cited on pp 136]

[137] Brian W Kernighan Optimal sequential partitions of graphs Journalof the ACM 18(1)34ndash40 January 1971 [Cited on pp 16]

[138] Shiva Kintali Nishad Kothari and Akash Kumar Approximationalgorithms for directed width parameters CoRR abs11074824 2011[Cited on pp 138]

[139] Philip A Knight The SinkhornndashKnopp algorithm Convergence andapplications SIAM Journal on Matrix Analysis and Applications30(1)261ndash275 2008 [Cited on pp 55 68]

[140] Philip A Knight and Daniel Ruiz A fast algorithm for matrix bal-ancing IMA Journal of Numerical Analysis 33(3)1029ndash1047 2013[Cited on pp 56 79]

[141] Philip A Knight Daniel Ruiz and Bora Ucar A symmetry preservingalgorithm for matrix scaling SIAM Journal on Matrix Analysis andApplications 35(3)931ndash955 2014 [Cited on pp 55 56 68 79 105]

184 BIBLIOGRAPHY

[142] Tamara G Kolda and Brett Bader Tensor decompositions and appli-cations SIAM Review 51(3)455ndash500 2009 [Cited on pp 10 37]

[143] Tjalling C Koopmans and Martin Beckmann Assignment problemsand the location of economic activities Econometrica 25(1)53ndash761957 [Cited on pp 109]

[144] Mads R B Kristensen Simon A F Lund Troels Blum and JamesAvery Fusion of parallel array operations In Proceedings of the 2016International Conference on Parallel Architectures and Compilationpages 71ndash85 New York NY USA 2016 ACM [Cited on pp 3]

[145] Mads R B Kristensen Simon A F Lund Troels Blum KennethSkovhede and Brian Vinter Bohrium A virtual machine approachto portable parallelism In Proceedings of the 2014 IEEE InternationalParallel amp Distributed Processing Symposium Workshops pages 312ndash321 Washington DC USA 2014 [Cited on pp 3]

[146] Janardhan Kulkarni Euiwoong Lee and Mohit Singh MinimumBirkhoff-von Neumann decomposition Preliminary version http

wwwcscmuedu~euiwoonlsparsebvnpdf of the paper whichappeared in Proc 19th International Conference Integer Program-ming and Combinatorial Optimization (IPCO 2017) Waterloo ONCanada pp 343ndash354 June 2017 [Cited on pp 9]

[147] Sebastian Lamm Peter Sanders Christian Schulz Darren Strash andRenato F Werneck Finding Near-Optimal Independent Sets at ScaleIn Proceedings of the 18th Workshop on Algorithm Engineering andExperiments (ALENEXrsquo16) pages 138ndash150 Arlington Virginia USA2016 SIAM [Cited on pp 114 127]

[148] Johannes Langguth Fredrik Manne and Peter Sanders Heuristicinitialization for bipartite matching problems Journal of ExperimentalAlgorithmics 1511ndash122 2010 [Cited on pp 6 56 58]

[149] Yanzhe (Murray) Lei Stefanus Jasin Joline Uichanco and AndrewVakhutinsky Randomized product display (ranking) pricing and or-der fulfillment for e-commerce retailers SSRN Electronic Journal2018 [Cited on pp 9]

[150] Thomas Lengauer Combinatorial Algorithms for Integrated CircuitLayout WileyndashTeubner Chichester UK 1990 [Cited on pp 34]

[151] Renaud Lepere and Denis Trystram A new clustering algorithm forlarge communication delays In 16th International Parallel and Dis-tributed Processing Symposium (IPDPS) page (CDROM) Fort Laud-erdale FL USA 2002 IEEE Computer Society [Cited on pp 32]

BIBLIOGRAPHY 185

[152] Hanoch Levy and David W Low A contraction algorithm for findingsmall cycle cutsets Journal of Algorithms 9(4)470ndash493 1988 [Citedon pp 142]

[153] Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun and Richard VuducModel-driven sparse CP decomposition for higher-order tensors InIPDPS 2017 31th IEEE International Symposium on Parallel andDistributed Processing pages 1048ndash1057 Orlando FL USA May2017 [Cited on pp 52]

[154] Jiajia Li Yuchen Ma and Richard Vuduc ParTI A Parallel TensorInfrastructure for Multicore CPU and GPUs (Version 010) Availablefrom httpsgithubcomhpcgarageParTI 2016 [Cited on pp156]

[155] Jiajia Li Jimeng Sun and Richard Vuduc HiCOO Hierarchical stor-age of sparse tensors In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisSCrsquo18 New York NY USA 2018 [Cited on pp 148 149 155 158]

[156] Jiajia Li Bora Ucar Umit V Catalyurek Jimeng Sun Kevin Barkerand Richard Vuduc Efficient and effective sparse tensor reordering InProceedings of the ACM International Conference on SupercomputingICS rsquo19 pages 227ndash237 New York NY USA 2019 ACM [Cited onpp 147 156]

[157] Xiaoye S Li and James W Demmel Making sparse Gaussian elim-ination scalable by static pivoting In Proc of the 1998 ACMIEEEConference on Supercomputing SC rsquo98 pages 1ndash17 Washington DCUSA 1998 [Cited on pp 132]

[158] Athanasios P Liavas and Nicholas D Sidiropoulos Parallel algorithmsfor constrained tensor factorization via alternating direction methodof multipliers IEEE Transactions on Signal Processing 63(20)5450ndash5463 Oct 2015 [Cited on pp 11]

[159] Bangtian Liu Chengyao Wen Anand D Sarwate and Maryam MehriDehnavi A unified optimization approach for sparse tensor opera-tions on GPUs In 2017 IEEE International Conference on ClusterComputing (CLUSTER) pages 47ndash57 Sept 2017 [Cited on pp 160]

[160] Joseph W H Liu Computational models and task scheduling forparallel sparse Cholesky factorization Parallel Computing 3(4)327ndash342 1986 [Cited on pp 131 134]

[161] Joseph W H Liu A graph partitioning algorithm by node separatorsACM Transactions on Mathematical Software 15(3)198ndash219 1989[Cited on pp 140]

186 BIBLIOGRAPHY

[162] Joseph W H Liu Reordering sparse matrices for parallel eliminationParallel Computing 11(1)73 ndash 91 1989 [Cited on pp 131]

[163] Liang Liu Jun (Jim) Xu and Lance Fortnow Quantized BvND Abetter solution for optical and hybrid switching in data center net-works In 2018 IEEEACM 11th International Conference on Utilityand Cloud Computing pages 237ndash246 Dec 2018 [Cited on pp 9]

[164] Zvi Lotker Boaz Patt-Shamir and Seth Pettie Improved distributedapproximate matching In Proceedings of the Twentieth Annual Sym-posium on Parallelism in Algorithms and Architectures (SPAA) pages129ndash136 Munich Germany 2008 ACM [Cited on pp 59]

[165] Laszlo Lovasz On determinants matchings and random algorithmsIn FCT pages 565ndash574 1979 [Cited on pp 76]

[166] Laszlo Lovasz and Micheal D Plummer Matching Theory ElsevierScience Publishers Netherlands 1986 [Cited on pp 76]

[167] Anna Lubiw Doubly lexical orderings of matrices SIAM Journal onComputing 16(5)854ndash879 1987 [Cited on pp 151 152 153]

[168] Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sunand Richard Vuduc Optimizing sparse tensor times matrix on GPUsJournal of Parallel and Distributed Computing 12999ndash109 2019[Cited on pp 160]

[169] Jakob Magun Greedy matching algorithms an experimental studyJournal of Experimental Algorithmics 36 1998 [Cited on pp 6]

[170] Marvin Marcus and R Ree Diagonals of doubly stochastic matricesThe Quarterly Journal of Mathematics 10(1)296ndash302 1959 [Citedon pp 9]

[171] Nick McKeown The iSLIP scheduling algorithm for input-queuedswitches EEEACM Transactions on Networking 7(2)188ndash201 1999[Cited on pp 6]

[172] Amram Meir and John W Moon The expected node-independencenumber of random trees Indagationes Mathematicae 76335ndash3411973 [Cited on pp 67]

[173] John Mellor-Crummey David Whaley and Ken Kennedy Improvingmemory hierarchy performance for irregular applications using dataand computation reorderings International Journal of Parallel Pro-gramming 29(3)217ndash247 Jun 2001 [Cited on pp 160]

BIBLIOGRAPHY 187

[174] Silvio Micali and Vijay V Vazirani An O(radic|V ||E|) algorithm for

finding maximum matching in general graphs In 21st Annual Sym-posium on Foundations of Computer Science pages 17ndash27 SyracuseNY USA 1980 IEEE [Cited on pp 6 76]

[175] Orlando Moreira Merten Popp and Christian Schulz Graph parti-tioning with acyclicity constraints In 16th International Symposiumon Experimental Algorithms SEA London UK 2017 [Cited on pp16 17 28 29 32]

[176] Orlando Moreira Merten Popp and Christian Schulz Evolutionarymulti-level acyclic graph partitioning In Proceedings of the Geneticand Evolutionary Computation Conference GECCO pages 332ndash339Kyoto Japan 2018 ACM [Cited on pp 17 25 32]

[177] Marcin Mucha and Piotr Sankowski Maximum matchings in planargraphs via Gaussian elimination In S Albers and T Radzik editorsESA 2004 pages 532ndash543 Berlin Heidelberg 2004 Springer BerlinHeidelberg [Cited on pp 76]

[178] Marcin Mucha and Piotr Sankowski Maximum matchings via Gaus-sian elimination In FOCS rsquo04 pages 248ndash255 Washington DC USA2004 IEEE Computer Society [Cited on pp 76]

[179] Huda Nassar Georgios Kollias Ananth Grama and David F GleichLow rank methods for multiple network alignment arXiv e-printspage arXiv180908198 Sep 2018 [Cited on pp 128]

[180] Jaroslav Nesetril and Patrice Ossona de Mendez Tree-depth subgraphcoloring and homomorphism bounds European Journal of Combina-torics 27(6)1022ndash1041 2006 [Cited on pp 132]

[181] M Yusuf Ozkaya Anne Benoit Bora Ucar Julien Herrmann andUmit V Catalyurek A scalable clustering-based task scheduler forhomogeneous processors using DAG partitioning In 33rd IEEE Inter-national Parallel and Distributed Processing Symposium pages 155ndash165 Rio de Janeiro Brazil May 2019 [Cited on pp 32]

[182] Robert Paige and Robert E Tarjan Three partition refinement algo-rithms SIAM Journal on Computing 16(6)973ndash989 December 1987[Cited on pp 152 153]

[183] Chiristos H Papadimitriou and Kenneth Steiglitz Combinatorial Op-timization Algorithms and Complexity Dover Publications (Cor-rected unabridged reprint of Combinatorial Optimization Algo-rithms and Complexity originally published by Prentice-Hall Inc NewJersey 1982) New York 1998 [Cited on pp 29]

188 BIBLIOGRAPHY

[184] Panos M Pardalos Tianbing Qian and Mauricio G C Resende Agreedy randomized adaptive search procedure for the feedback vertexset problem Journal of Combinatorial Optimization 2(4)399ndash4121998 [Cited on pp 142]

[185] Md Mostofa Ali Patwary Rob H Bisseling and Fredrik Manne Par-allel greedy graph matching using an edge partitioning approach InProceedings of the Fourth International Workshop on High-level Paral-lel Programming and Applications HLPP rsquo10 pages 45ndash54 New YorkNY USA 2010 ACM [Cited on pp 59]

[186] Francois Pellegrini SCOTCH 51 Userrsquos Guide Laboratoire Bordelaisde Recherche en Informatique (LaBRI) 2008 [Cited on pp 4 16 23]

[187] Daniel M Pelt and Rob H Bisseling A medium-grain method forfast 2D bipartitioning of sparse matrices In IPDPS2014 IEEE 28thInternational Parallel and Distributed Processing Symposium pages529ndash539 Phoenix Arizona May 2014 [Cited on pp 52]

[188] Ioakeim Perros Evangelos E Papalexakis Fei Wang Richard VuducElizabeth Searles Michael Thompson and Jimeng Sun SPARTanScalable PARAFAC2 for large and sparse data In Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD rsquo17 pages 375ndash384 New York NY USA2017 ACM [Cited on pp 156]

[189] Anh-Huy Phan Petr Tichavsky and Andrzej Cichocki Fast alternat-ing LS algorithms for high order CANDECOMPPARAFAC tensorfactorizations IEEE Transactions on Signal Processing 61(19)4834ndash4846 Oct 2013 [Cited on pp 52]

[190] Juan C Pichel Francisco F Rivera Marcos Fernandez and AurelioRodrıguez Optimization of sparse matrixndashvector multiplication usingreordering techniques on GPUs Microprocessors and Microsystems36(2)65ndash77 2012 [Cited on pp 160]

[191] Matthias Poloczek and Mario Szegedy Randomized greedy algorithmsfor the maximum matching problem with new analysis In IEEE 53rdAnnual Sym on Foundations of Computer Science (FOCS) pages708ndash717 New Brunswick NJ USA 2012 [Cited on pp 58]

[192] Alex Pothen Sparse null bases and marriage theorems PhD the-sis Dept Computer Science Cornell Univ Ithaca New York 1984[Cited on pp 68]

[193] Alex Pothen The complexity of optimal elimination trees TechnicalReport CS-88-13 Pennsylvania State Univ 1988 [Cited on pp 131]

BIBLIOGRAPHY 189

[194] Alex Pothen and Chin-Ju Fan Computing the block triangular formof a sparse matrix ACM Transactions on Mathematical Software16(4)303ndash324 1990 [Cited on pp 58 68 115]

[195] Alex Pothen S M Ferdous and Fredrik Manne Approximation algo-rithms in combinatorial scientific computing Acta Numerica 28541mdash633 2019 [Cited on pp 58]

[196] Louis-Noel Pouchet Polybench The polyhedral benchmark suiteURL httpwebcseohio-stateedu˜pouchetsoftwarepolybench2012 [Cited on pp 25 26]

[197] Michael O Rabin and Vijay V Vazirani Maximum matchings in gen-eral graphs through randomization Journal of Algorithms 10(4)557ndash567 December 1989 [Cited on pp 76]

[198] Daniel Ruiz A scaling algorithm to equilibrate both row and columnnorms in matrices Technical Report TR-2001-034 RAL 2001 [Citedon pp 55 56]

[199] Peter Sanders and Christian Schulz Engineering multilevel graphpartitioning algorithms In Camil Demetrescu and Magnus MHalldorsson editors Algorithms ndash ESA 2011 19th Annual EuropeanSymposium September 5-9 2011 pages 469ndash480 Saarbrucken Ger-many 2011 [Cited on pp 4 16 23]

[200] Robert Schreiber A new implementation of sparse Gaussian elimi-nation ACM Transactions on Mathematical Software 8(3)256ndash2761982 [Cited on pp 131]

[201] R Oguz Selvitopi Muhammet Mustafa Ozdal and Cevdet AykanatA novel method for scaling iterative solvers Avoiding latency over-head of parallel sparse-matrix vector multiplies IEEE Transactionson Parallel and Distributed Systems 26(3)632ndash645 2015 [Cited onpp 51]

[202] Richard Sinkhorn and Paul Knopp Concerning nonnegative matri-ces and doubly stochastic matrices Pacific Journal of Mathematics21343ndash348 1967 [Cited on pp 55 69 114]

[203] Shaden Smith Jee W Choi Jiajia Li Richard Vuduc Jongsoo ParkXing Liu and George Karypis FROSTT The formidable repositoryof open sparse tensors and tools httpfrosttio 2017 [Citedon pp 52 126 156 161]

[204] Shaden Smith and George Karypis A medium-grained algorithm forsparse tensor factorization In 2016 IEEE International Parallel and

190 BIBLIOGRAPHY

Distributed Processing Symposium IPDPS 2016 Chicago IL USAMay 23-27 2016 pages 902ndash911 2016 [Cited on pp 37 52]

[205] Shaden Smith and George Karypis SPLATT The Surprisingly Par-alleL spArse Tensor Toolkit (Version 111) Available from https

githubcomShadenSmithsplatt 2016 [Cited on pp 37 52 157]

[206] Shaden Smith Niranjay Ravindran Nicholas D Sidiropoulos andGeorge Karypis SPLATT Efficient and parallel sparse tensor-matrixmultiplication In 29th IEEE International Parallel amp Distributed Pro-cessing Symposium pages 61ndash70 Hyderabad India May 2015 [Citedon pp 11 36 37 38 39 42 158 159]

[207] Robert E Tarjan and Mihalis Yannakakis Simple linear-time algo-rithms to test chordality of graphs test acyclicity of hypergraphs andselectively reduce acyclic hypergraphs SIAM Journal on Computing13(3)566ndash579 1984 [Cited on pp 149]

[208] Bora Ucar and Cevdet Aykanat Encapsulating multiplecommunication-cost metrics in partitioning sparse rectangular matri-ces for parallel matrix-vector multiplies SIAM Journal on ScientificComputing 25(6)1837ndash1859 2004 [Cited on pp 47 51]

[209] Bora Ucar and Cevdet Aykanat Revisiting hypergraph models forsparse matrix partitioning SIAM Review 49(4)595ndash603 2007 [Citedon pp 42]

[210] Vijay V Vazirani An improved definition of blossoms and a simplerproof of the MV matching algorithm CoRR abs12104594 2012[Cited on pp 76]

[211] David W Walkup Matchings in random regular bipartite digraphsDiscrete Mathematics 31(1)59ndash64 1980 [Cited on pp 59 67 76 81114 121]

[212] Chris Walshaw Multilevel refinement for combinatorial optimisationproblems Annals of Operations Research 131(1)325ndash372 Oct 2004[Cited on pp 25]

[213] Eric S H Wong Evangeline F Y Young and Wai-Kei Mak Clus-tering based acyclic multi-way partitioning In Proceedings of the 13thACM Great Lakes Symposium on VLSI GLSVLSI rsquo03 pages 203ndash206New York NY USA 2003 ACM [Cited on pp 4]

[214] Tao Yang and Apostolos Gerasoulis DSC scheduling parallel tasks onan unbounded number of processors IEEE Transactions on Paralleland Distributed Systems 5(9)951ndash967 1994 [Cited on pp 131]

BIBLIOGRAPHY 191

[215] Albert-Jan Nicholas Yzelman and Dirk Roose High-level strategiesfor parallel shared-memory sparse matrix-vector multiplication IEEETransactions on Parallel and Distributed Systems 25(1)116ndash125 Jan2014 [Cited on pp 160]

  • Introduction
    • Directed graphs
    • Undirected graphs
    • Hypergraphs
    • Sparse matrices
    • Sparse tensors
      • I Partitioning
        • Acyclic partitioning of directed acyclic graphs
          • Related work
          • Directed multilevel graph partitioning
            • Coarsening
            • Initial partitioning
            • Refinement
            • Constrained coarsening and initial partitioning
              • Experiments
              • Summary further notes and references
                • Distributed memory CP decomposition
                  • Background and notation
                    • Hypergraph partitioning
                    • Tensor operations
                      • Related work
                      • Parallelization
                        • Coarse-grain task model
                        • Fine-grain task model
                          • Experiments
                          • Summary further notes and references
                              • II Matching and related problems
                                • Matchings in graphs
                                  • Scaling matrices to doubly stochastic form
                                  • Approximation algorithms for bipartite graphs
                                    • Related work
                                    • Two matching heuristics
                                    • Experiments
                                      • Approximation algorithms for general undirected graphs
                                        • Related work
                                        • One-Out its analysis and variants
                                        • Experiments
                                          • Summary further notes and references
                                            • Birkhoff-von Neumann decomposition
                                              • The minimum number of permutation matrices
                                                • Preliminaries
                                                • The computational complexity of MinBvNDec
                                                  • A result on the polytope of BvN decompositions
                                                  • Two heuristics
                                                  • Experiments
                                                  • A generalization of Birkhoff-von Neumann theorem
                                                    • Doubly stochastic splitting
                                                    • Birkhoff von-Neumann decomposition for arbitrary matrices
                                                      • Summary further notes and references
                                                        • Matchings in hypergraphs
                                                          • Background and notation
                                                          • The heuristics
                                                            • Greedy-H
                                                            • Karp-Sipser-H
                                                            • Karp-Sipser-H-scaling
                                                            • Karp-Sipser-H-mindegree
                                                            • Bipartite-reduction
                                                            • Local-Search
                                                              • Experiments
                                                              • Summary further notes and references
                                                                  • III Ordering matrices and tensors
                                                                    • Elimination trees and height-reducing orderings
                                                                      • Background
                                                                      • The critical path length and the elimination tree height
                                                                      • Reducing the elimination tree height
                                                                        • Theoretical basis
                                                                        • Recursive approach Permute
                                                                        • BBT decomposition PermuteBBT
                                                                        • Edge-separator-based method
                                                                        • Vertex-separator-based method
                                                                        • BT decomposition PermuteBT
                                                                          • Experiments
                                                                          • Summary further notes and references
                                                                            • Sparse tensor ordering
                                                                              • Three sparse tensor storage formats
                                                                              • Problem definition and solutions
                                                                                • BFS-MCS
                                                                                • Lexi-Order
                                                                                • Analysis
                                                                                  • Experiments
                                                                                    • COO-Mttkrp with reordering
                                                                                    • CSF-Mttkrp with reordering
                                                                                    • HiCOO-Mttkrp with reordering
                                                                                    • Reordering methods comparison
                                                                                    • Effect of the number of iterations in Lexi-Order
                                                                                      • Summary further notes and references
                                                                                          • IV Closing
                                                                                            • Concluding remarks
                                                                                            • Bibliography
Page 3: Partitioning, matching, and ordering: Combinatorial

ii

Contents

1 Introduction 1

11 Directed graphs 1

12 Undirected graphs 5

13 Hypergraphs 7

14 Sparse matrices 7

15 Sparse tensors 9

I Partitioning 13

2 Acyclic partitioning of directed acyclic graphs 15

21 Related work 16

22 Directed multilevel graph partitioning 18

221 Coarsening 18

222 Initial partitioning 21

223 Refinement 22

224 Constrained coarsening and initial partitioning 23

23 Experiments 25

24 Summary further notes and references 31

3 Distributed memory CP decomposition 33

31 Background and notation 34

311 Hypergraph partitioning 34

312 Tensor operations 35

32 Related work 36

33 Parallelization 38

331 Coarse-grain task model 39

332 Fine-grain task model 43

34 Experiments 47

35 Summary further notes and references 51

iii

iv CONTENTS

II Matching and related problems 53

4 Matchings in graphs 55

41 Scaling matrices to doubly stochastic form 55

42 Approximation algorithms for bipartite graphs 56

421 Related work 57

422 Two matching heuristics 59

423 Experiments 69

43 Approximation algorithms for general undirected graphs 75

431 Related work 76

432 One-Out its analysis and variants 77

433 Experiments 82

44 Summary further notes and references 87

5 Birkhoff-von Neumann decomposition 91

51 The minimum number of permutation matrices 91

511 Preliminaries 92

512 The computational complexity of MinBvNDec 93

52 A result on the polytope of BvN decompositions 94

53 Two heuristics 97

54 Experiments 103

55 A generalization of Birkhoff-von Neumann theorem 106

551 Doubly stochastic splitting 106

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices 108

56 Summary further notes and references 109

6 Matchings in hypergraphs 113

61 Background and notation 114

62 The heuristics 114

621 Greedy-H 115

622 Karp-Sipser-H 115

623 Karp-Sipser-H-scaling 117

624 Karp-Sipser-H-mindegree 118

625 Bipartite-reduction 119

626 Local-Search 120

63 Experiments 120

64 Summary further notes and references 128

III Ordering matrices and tensors 129

7 Elimination trees and height-reducing orderings 131

71 Background 132

CONTENTS v

72 The critical path length and the elimination tree height 13473 Reducing the elimination tree height 136

731 Theoretical basis 137732 Recursive approach Permute 137733 BBT decomposition PermuteBBT 138734 Edge-separator-based method 139735 Vertex-separator-based method 141736 BT decomposition PermuteBT 142

74 Experiments 14275 Summary further notes and references 145

8 Sparse tensor ordering 14781 Three sparse tensor storage formats 14782 Problem definition and solutions 149

821 BFS-MCS 149822 Lexi-Order 151823 Analysis 154

83 Experiments 15684 Summary further notes and references 160

IV Closing 165

9 Concluding remarks 167

Bibliography 170

vi CONTENTS

List of Algorithms

1 CP-ALS for the 3rd order tensors 362 Mttkrp for the 3rd order tensors 383 Coarse-grain Mttkrp for the first mode of third order tensors

at process p within CP-ALS 404 Fine-grain Mttkrp for the first mode of 3rd order tensors at

process p within CP-ALS 44

5 ScaleSK Sinkhorn-Knopp scaling 566 OneSidedMatch for bipartite graph matching 607 TwoSidedMatch for bipartite graph matching 628 Karp-Sipser-MT for bipartite 1-out graph matching 649 One-Out for undirected graph matching 7810 Karp-SipserOne-Out for undirected 1-out graph matching 80

11 GreedyBvN for constructing a BvN decomposition 98

12 Karp-Sipser-H-scaling for hypergraph matching 117

13 Row-LU(A) 13414 Column-LU(A) 13415 Permute(A) 13816 FindStrongVertexSeparatorES(G) 14017 FindStrongVertexSeparatorVS(G) 142

18 BFS-MCS for ordering a tensor dimension 15019 Lexi-Order for ordering a tensor dimension 153

vii

viii LIST OF ALGORITHMS

Chapter 1

Introduction

This document investigates three classes of problems at the interplay ofdiscrete algorithms combinatorial optimization and numerical methodsThe general research area is called combinatorial scientific computing (CSC)In CSC the contributions have practical and theoretical flavor For allproblems discussed in this document we have the design analysis andimplementation of algorithms along with many experiments The theoreticalresults are included in this document some with proofs the reader is invitedto the original papers for the omitted proofs A similar approach is takenfor presenting the experiments While most results for observing theoreticalfindings in practice are included the reader is referred to the original papersfor some other results (eg run time analysis)

The three problem classes are that of partitioning matching and or-dering We cover two problems from the partitioning class three from thematching class and two from the ordering class Below we present theseseven problems after introducing the notation and recalling the related defi-nitions We summarize the computing contexts in which problems arise andhighlight our contributions

11 Directed graphs

A directed graph G = (VE) is a set V of vertices and a set E of directededges of the form e = (u v) where e is directed from u to v A path isa sequence of edges (u1 v1) middot (u2 v2) middot middot middot with vi = ui+1 A path ((u1 v1) middot(u2 v2) middot middot middot (u` v`)) is of length ` where it connects a sequence of `+1 vertices(u1 v1 = u2 v`minus1 = u` v`) A path is called simple if the connectedvertices are distinct Let u v denote a simple path that starts from u andends at v We say that a vertex v is reachable from another vertex u if apath connects them A path ((u1 v1) middot (u2 v2) middot middot middot (u` v`)) forms a (simple)cycle if all vi for 1 le i le ` are distinct and u1 = v` A directed acyclicgraph DAG in short is a directed graph with no cycles A directed graph

1

2 CHAPTER 1 INTRODUCTION

is strongly connected if every vertex is reachable from every other vertex

A directed graph Gprime = (V prime Eprime) is said to be a subgraph of G = (VE)if V prime sube V and Eprime sube E We call Gprime as the subgraph of G induced by V primedenoted as G[V prime] if Eprime = E cap (V primetimesV prime) For a vertex set S sube V we defineG minus S as the subgraph of G induced by V minus S ie G[V minus S] A vertexset C sube V is called a strongly connected component of G if the inducedsubgraph G[C] is strongly connected and C is maximal in the sense thatfor all C prime sup C G[C prime] is not strongly connected A transitive reduction G0

is a minimal subgraph of G such that if u is connected to v in G then u isconnected to v in G0 as well If G is acyclic then its transitive reduction isunique

The path u v represents a dependency of v to u We say that theedge (u v) is redundant if there exists another u v path in the graphThat is when we remove a redundant (u v) edge u remains connected tov and hence the dependency information is preserved We use Pred[v] =u | (u v) isin E to represent the (immediate) predecessors of a vertex v andSucc[v] = u | (v u) isin E to represent the (immediate) successors of v Wecall the neighbors of a vertex v its immediate predecessors and immediatesuccessors Neigh[v] = Pred[v]cupSucc[v] For a vertex u the set of vertices vsuch that u v are called the descendants of u Similarly the set of verticesv such that v u are called the ancestors of the vertex u The verticeswithout any predecessors (and hence ancestors) are called the sources of Gand vertices without any successors (and hence descendants) are called thetargets of G Every vertex u has a weight denoted by wu and every edge(u v) isin E has a cost denoted by cuv

A k-way partitioning of a graph G = (VE) divides V into k disjointsubsets V1 Vk The weight of a part Vi denoted by w(Vi) is equal tosum

uisinVi wu which is the total vertex weight in Vi Given a partition an edgeis called a cut edge if its endpoints are in different parts The edge cut ofa partition is defined as the sum of the costs of the cut edges Usually aconstraint on the part weights accompanies the problem We are interestedin acyclic partitions which are defined below

Definition 11 (Acyclic k-way partition) A partition V1 Vk ofG = (VE) is called an acyclic k-way partition if two paths u v andvprime uprime do not co-exist for u uprime isin Vi v vprime isin Vj and 1 le i 6= j le k

In other words for a suitable numbering of the parts all edges shouldbe directed from a vertex in a part p to another vertex in a part q wherep le q Let us consider a toy example shown in Fig 11a A partition of thevertices of this graph is shown in Fig 11b with a dashed curve Since thereis a cut edge from s to u and another from u to t the partition is cyclic Anacyclic partition is shown in Fig 11c where all the cut edges are from onepart to the other

3

(a) A toy graph (b) A partition ignor-ing the directions it iscyclic

(c) An acyclic parti-tioning

Figure 11 A toy graph with six vertices and six edges a cyclic partitioningand an acyclic partitioning

There is a related notion in the literature [87] which is called a convexpartition A partition is convex if for all vertex pairs u v in the same partthe vertices in any u v path are also in the same part Hence if a partitionis acyclic it is also convex On the other hand convexity does not implyacyclicity Figure 12 shows that the definitions of an acyclic partition anda convex partition are not equivalent For the toy graph in Fig 12a thereare three possible balanced partitions shown in Figures 12b 12c and 12dThey are all convex but only the one in Fig 12d is acyclic

Deciding on the existence of a k-way acyclic partition respecting an upperbound on the part weights and an upper bound on the cost of cut edges isNP-complete [93 ND15] In Chapter 2 we treat this problem which isformalized as follows

Problem 1 (DAGP DAG partitioning) Given a directed acyclic graphG = (VE) with weights on vertices an imbalance parameter ε find anacyclic k-way partition V1 Vk of V such that the balance con-straints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut which is equal to the numberof edges whose end points are in two different parts is minimized

The DAGP problem arises in exposing parallelism in automatic differ-entiation [53 Ch9] and particularly in the computation of the Newtonstep for solving nonlinear systems [51 52] The DAGP problem with someadditional constraints is used to reason about the parallel data movementcomplexity and to dynamically analyze the data locality potential [84 87]Other important applications of the DAGP problem include (i) fusing loopsfor improving temporal locality and enabling streaming and array contrac-tions in runtime systems [144 145] (ii) analysis of cache efficient execution

4 CHAPTER 1 INTRODUCTION

a b

c d

(a) A toy graph

a b

c d

(b) A cyclic and con-vex partition

a b

c d

(c) A cyclic and con-vex partition

a b

c d

(d) An acyclic and convexpartition

Figure 12 A toy graph two cyclic and convex partitions and an acyclicand convex partition

of streaming applications on uniprocessors [4] (iii) a number of circuit de-sign applications in which the signal directions impose acyclic partitioningrequirement [54 213]

In Chapter 2 we address the DAGP problem We adopt the multilevelapproach [35 113] for this problem This approach is the de-facto standardfor solving graph and hypergraph partitioning problems efficiently and usedby almost all current state-of-the-art partitioning tools [27 40 113 130186 199] The algorithms following this approach have three phases thecoarsening phase which reduces the number of vertices by clustering themthe initial partitioning phase which finds a partition of the coarsest graphand the uncoarsening phase in which the initial partition is projected to thefiner graphs and refined along the way until a solution for the original graphis obtained We focus on two-way partitioning (sometimes called bisection)as this scheme can be used in a recursive way for multiway partitioningTo ensure the acyclicity of the partition at all times we propose novel andefficient coarsening and refinement heuristics We also propose effective waysto use the standard undirected graph partitioning methods in our multilevelscheme We perform a large set of experiments on a dataset and reportsignificant improvements compared to the current state of the art

A rooted tree T is an ordered triple (V r ρ) where V is the vertex setr isin V is the root vertex and ρ gives the immediate ancestordescendantrelation For two vertices vi vj such that ρ(vj) = vi we call vi the parent

5

vertex of vj and call vj a child vertex of vi With this definition the root ris the ancestor of every other node in T The children vertices of a vertexare called sibling vertices to each other The height of T denoted as h(T )is the number of vertices in the longest path from a leaf to the root r

Let G = (VE) be a directed graph with n vertices which are ordered(or labeled as 1 n) The elimination tree of a directed graph is definedas follows [81] For any i 6= n ρ(i) = j whenever j gt i is the minimumvertex id such that the vertices i and j lie in the same strongly connectedcomponent of the subgraph G[1 j]mdashthat is i and j are in the samestrongly connected component of the subgraph containing the first j verticesand j is the smallest number Currently the algorithm with the best worst-case time complexity has a run time of O(m log n) [133]mdashanother algorithmwith the worst-case time complexity of O(mn) [83] also performs well inpractice As the definition of the elimination tree implies the properties ofthe tree depend on the ordering of the graph

The elimination tree is a recent model playing important roles in sparseLU factorization This tree captures the dependencies between the tasksof some well-known variants of sparse LU factorization (we review these inChapter 7) Therefore the height of the elimination tree corresponds to thecritical path length of the task dependency graph in the corresponding par-allel LU factorization methods In Chapter 7 we investigate the problem offinding minimum height elimination trees formally defined below to exposea maximum degree of parallelism by minimizing the critical path length

Problem 2 (Minimum height elimination tree) Given a directed graphfind an ordering of its vertices so that the associated elimination tree hasthe minimum height

The minimum height of an elimination tree of a given directed graph cor-responds to a directed graph parameter known as the cycle-rank [80] Theproblem being NP-complete [101] we propose heuristics in Chapter 7 whichgeneralize the most successful approaches used for symmetric matrices tounsymmetric ones

12 Undirected graphs

An undirected graph G = (VE) is a set V of vertices and a set E ofedges Bipartite graphs are undirected graphs where the vertex set can bepartitioned into two sets with all edges connecting vertices in the two partsFormally G = (VR cup VC E) is a bipartite graph where V = VR cup VC isthe vertex set and for each edge e = (r c) we have r isin VR and c isin VC The number of edges incident on a vertex is called its degree A path in agraph is a sequence of vertices such that each consecutive vertex pair share

6 CHAPTER 1 INTRODUCTION

an edge A vertex is reachable from another one if there is a path betweenthem The connected components of a graph are the equivalence classes ofvertices under the ldquois reachable fromrdquo relation A cycle in a graph is a pathwhose start and end vertices are the same A simple cycle is a cycle with novertex repetitions A tree is a connected graph with no cycles A spanningtree of a connected graph G is a tree containing all vertices of G

A matching M in a graph G = (VE) is a subset of edges E where avertex in V is in at most one edge in M Given a matching M a vertex vis said to be matched by M if v is in an edge of M otherwise v is calledunmatched If all the vertices are matched by M then M is said to be aperfect matching The cardinality of a matching M denoted by |M| is thenumber of edges inM In Chapter 4 we treat the following problem whichasks to find approximate matchings

Problem 3 (Approximating the maximum cardinality of a matching inundirected graphs) Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matching M of large cardinalitywith an approximation guarantee

There are a number of polynomial time algorithms to solve the maximumcardinality matching problem exactly in graphs The lowest worst-case timecomplexity of the known algorithms is O(

radicnτ) for a graph with n vertices

and τ edges For bipartite graphs the first of such algorithms is described byHopcroft and Karp [117] certain variants of the push-relabel algorithm [96]also have the same complexity (we give a classification for maximum car-dinality matchings in bipartite graphs in a recent paper [132]) For gen-eral graphs algorithms with the same complexity are known [30 92 174]There is considerable interest in simpler and faster algorithms that havesome approximation guarantee [148] Such algorithms are used as a jump-start routine by the current state of the art exact maximum matching algo-rithms [71 148 169] Furthermore there are applications [171] where largecardinality matchings are used

The specialization of Problem 3 for bipartite graphs and general undi-rected graphs are described in respectively Section 42 and Section 43 Ourfocus is on approximation algorithms which expose parallelism and have lin-ear run time in practical settings We propose randomized heuristics andanalyze their results The proposed heuristics construct a subgraph of theinput graph by randomly choosing some edges They then obtain a maxi-mum cardinality matching on the subgraph and return it as an approximatematching for the input graph The probability density function for choosinga given edge is obtained by scaling a suitable matrix to be doubly stochasticThe heuristics for the bipartite graphs and general undirected graphs sharethis general approach and differ in the analysis To the best of our knowl-edge the proposed heuristics have the highest constant factor approximation

7

ratio (around 086 for large graphs)

13 Hypergraphs

A hypergraph H = (VE) is defined as a set of vertices V and a set of hyper-edges E Each hyperedge is a set of vertices A matching in a hypergraphis a set of disjoint hyperedges

A hypergraph H = (VE) is called d-partite and d-uniform if V =⋃di=1 Vi with disjoint Vis and every hyperedge contains a single vertex from

each Vi In Chapter 6 we investigate the problem of finding matchings inhypergraphs formally defined as follows

Problem 4 (Maximum matching in d-partite d-uniform hypergraphs)Given a d-partite d-uniform hypergraph find a matching of maximumsize

The problem is NP-complete for d ge 3 the 3-partite case is called theMax-3-DM problem [126] and is an important problem in combinatorialoptimization It is a special case of the d-Set-Packing problem [110] Ithas been shown that d-Set-Packing is hard to approximate within a factorof O(d log d) [110] The maximumperfect set packing problem has manyapplications including combinatorial auctions [98] and personnel schedul-ing [91] Matchings in hypergraphs can also be used in the coarsening phaseof multilevel hypergraph partitioning tools [40] In particular d-uniformand d-partite hypergraphs arise in modeling tensors [134] and heuristics forProblem 4 can be used to partition these hypergraphs by plugging themin the current state of the art partitioning tools We propose heuristics forProblem 4 We first generalize some graph matching heuristics then pro-pose a novel heuristic based on tensor scaling to improve the quality viajudicious hyperedge selections Experiments on random synthetic and real-life hypergraphs show that this new heuristic is highly practical and superiorto the others on finding a matching with large cardinality

14 Sparse matrices

Bold upper case Roman letters are used for matrices as in A Matrixelements are shown with the corresponding lowercase letters for exampleaij Matrix sizes are sometimes shown in the lower right corner eg AItimesJ Matlab notation is used to refer to the entire rows and columns of a matrixeg A(i ) and A( j) refer to the ith row and jth column of A respectivelyA permutation matrix is a square 0 1 matrix with exactly one nonzeroin each row and each column A sub-permutation matrix is a square 0 1matrix with at most one nonzero in a row or a column

8 CHAPTER 1 INTRODUCTION

A graph G = (VE) can be represented as a sparse matrix AG calledthe adjacency matrix In this matrix an entry aij = 1 if the vertices vi andvj in G are incident on a common edge and aij = 0 otherwise If G isbipartite with V = VR cup VC we identify each vertex in VR with a uniquerow of AG and each vertex in VC with a unique column of AG If G is notbipartite a vertex is identified with a unique row-column pair A directedgraph GD = (VE) with vertex set V and edge set E can be associated withan ntimes n sparse matrix A Here |V | = n and for each aij 6= 0 where i 6= jwe have a directed edge from vi to vj

We also associate graphs with matrices For a given m times n matrix Awe can associate a bipartite graph G = (VR cup VC E) so that the zero-nonzero pattern of A is equivalent to AG If the matrix is square and hasa symmetric zero-nonzero pattern we can similarly associate an undirectedgraph by ignoring the diagonal entries Finally if the matrix A is squareand has an unsymmetric zero-nonzero pattern we can associate a directedgraph GD = (VE) where (vi vj) isin E iff aij 6= 0

An n times n matrix A 6= 0 is said to have support if there is a perfectmatching in the associated bipartite graph An n times n matrix A is said tohave total support if each edge in its bipartite graph can be put into a perfectmatching A square sparse matrix is called irreducible if its directed graph isstrongly connected A square sparse matrix A is called fully indecomposableif for a permutation matrix Q the matrix B = AQ has a zero free diagonaland the directed graph associated with B is irreducible Fully indecompos-able matrices have total support but a matrix with total support could be ablock diagonal matrix where each block is fully indecomposable For moreformal definitions of support total support and the fully indecomposabilitysee for example the book by Brualdi and Ryser [34 Ch 3 and Ch 4]

Let A = [aij ] isin Rntimesn The matrix A is said to be doubly stochastic ifaij ge 0 for all i j and Ae = AT e = e where e is the vector of all ones By

Birkhoffrsquos Theorem [25] there exist α1 α2 αk isin (0 1) withsumk

i=1 αi = 1and permutation matrices P1P2 Pk such that

A = α1P1 + α2P2 + middot middot middot+ αkPk (11)

This representation is also called Birkhoff-von Neumann (BvN) decomposi-tion We refer to the scalars α1 α2 αk as the coefficients of the decom-position

Doubly stochastic matrices and their associated BvN decompositionshave been used in several operations research problems and applicationsClassical examples are concerned with allocating communication resourceswhere an input traffic is routed to an output traffic in stages [43] Each rout-ing stage is a (sub-)permutation matrix and is used for handling a disjointset of communications The number of stages correspond to the numberof (sub-)permutation matrices A recent variant of this allocation problem

9

arises as a routing problem in data centers [146 163] Heuristics for theBvN decomposition have been used in designing business strategies for e-commerce retailers [149] A display matrix (which shows what needs to bedisplayed in total) is constructed which happens to be doubly stochasticThen a BvN decomposition of this matrix is obtained in which each per-mutation matrix used is a display shown to a shopper A small number ofpermutation matrices is preferred furthermore only slight changes betweenconsecutive permutation matrices is desired (we do not deal with this lastissue in this document)

The BvN decomposition of a doubly stochastic matrix as a convex com-bination of permutation matrices is not unique in general The Marcus-Ree Theorem [170] states that there is always a decomposition with k len2 minus 2n + 2 permutation matrices for dense n times n matrices Brualdi andGibson [33] and Brualdi [32] show that for a fully indecomposable ntimesn sparsematrix with τ nonzeros one has the equivalent expression k le τ minus2n+ 2 InChapter 5 we investigate the problem of finding the minimum number k ofpermutation matrices in the representation (11) More formally define theMinBvNDec problem defined as follows

Problem 5 (MinBvNDec A BvN decomposition with the minimumnumber of permutation matrices) Given a doubly stochastic matrix Afind a BvN decomposition (11) of A with the minimum number k ofpermutation matrices

Brualdi [32 p197] investigates the same problem and concludes that this isa difficult problem We continue along this line and show that the MinBvN-Dec problem is NP-hard We also propose a heuristic for obtaining a BvNdecomposition with a small number of permutation matrices We investi-gate some of the properties of the heuristic theoretically and experimentallyIn this chapter we also give a generalization of the BvN decomposition forreal matrices with possibly negative entries and use this generalization fordesigning preconditioners for iterative linear system solvers

15 Sparse tensors

Tensors are multidimensional arrays generalizing matrices to higher ordersWe use calligraphic font to refer to tensors eg X Let X be a d-dimensionaltensor whose size is n1 times middot middot middot times nd As in matrices an element of a tensor isdenoted by a lowercase letter and subscripts corresponding to the indices ofthe element eg xi1id where ij isin 1 nj A marginal of a tensor is a(dminus1)-dimensional slice of a d-dimensional tensor obtained by fixing one ofits indices A d-dimensional tensor where the entries in each of its marginalssum to one is called d-stochastic In a d-stochastic tensor all dimensions

10 CHAPTER 1 INTRODUCTION

necessarily have the same size n A d-stochastic tensor where each marginalcontains exactly one nonzero entry (equal to one) is called a permutationtensor

A d-partite d-uniform hypergraph H = (V1 cup middot middot middot cup Vd E) correspondsnaturally to a d-dimensional tensor X by associating each vertex class witha dimension of X Let |Vi| = ni Let X isin 0 1n1timesmiddotmiddotmiddottimesnd be a tensor withnonzero elements xv1vd corresponding to the edges (v1 vd) of H Inthis case X is called the adjacency tensor of H

Tensors are used in modeling multifactor or multirelational datasetssuch as product reviews at an online store [21] natural language process-ing [123] multi-sensor data in signal processing [49] among many oth-ers [142] Tensor decomposition algorithms are used as an important tool tounderstand the tensors and glean hidden or latent information One of themost common tensor decomposition approaches is the CandecompParafac(CP) formulation which approximates a given tensor as a sum of rank-onetensors More precisely for a given positive integer R called the decompo-sition rank the CP decomposition approximates a d-dimensional tensor bya sum of R outer products of column vectors (one for each dimension) Inparticular for a three dimensional I times J timesK tensor X CP asks for a rankR approximation X asymp

sumRr=1 ar br cr with vectors ar br and cr of size

respectively I J and K Here denotes the outer product of the vectorsand hence xijk asymp

sumRr=1 ar(i) middot br(j) middot cr(k) By aggregating these column

vectors one forms the matrices A B and C of size respectively I times RJ timesR and K timesR The matrices A B and C are called the factor matrices

The computational core of the most common CP decomposition meth-ods is an operation called the matricized tensor-times-Khatri-Rao product(Mttkrp) For the same tensor X this core operation can be describedsuccinctly as

foreach xijk isin X do

A(i )larr A(i ) + xijk[B(j ) lowastC(k )] (12)

where B and C are input matrices of sizes J timesR and K timesR respectivelyMA is the output matrix of size I timesR which is initialized to zero before theloop For each tensor nonzero xijk the jth row of the matrix B is multipliedelement-wise with the kth row of the matrix C the result is scaled with thetensor entry xijk and added to the ith row of MA In the well known CPdecomposition algorithm based on alternative least squares (CP-ALS) thetwo factor matrices B and C are initialized at the beginning then the factormatrix A is computed by multiplying MA with the (pseudo-)inverse of theHadamard product of BTB and CTC Since BTB and CTC are of sizeR times R their multiplication and inversion do not pose any computationalchallenges Afterwards the same steps are performed to compute a newB and then a new C and this process of computing the new factors one

11

at a time by assuming others fixed is repeated iteratively In Chapter 3we investigate how to efficiently parallelize the Mttkrp operations withthe CP-ALS algorithm in distributed memory systems In particular weinvestigate the following problem

Problem 6 (Parallelizing CP decomposition) Given a sparse tensor X the rank R of the required decomposition the number k of processorspartition X among the processors so as to achieve load balance andreduce the communication cost in a parallel implementation of Mttkrp

More precisely in Chapter 3 we describe parallel implementations of CP-ALS and how the parallelization problem can be cast as a partitioning prob-lem in hypergraphs In this document we do not develop heuristics for thehypergraph partitioning problemmdashwe describe hypergraphs modeling thecomputations and make use of the well-established tools for partitioningthese hypergraphs The Mttkrp operation also arises in the gradient de-scent based methods [2] to compute CP decomposition Furthermore it hasbeen implemented as a stand-alone routine [17] to enable algorithm devel-opment for variants of the core methods [158] That is why it has beeninvestigated for efficient execution in different computational frameworkssuch as Matlab [11 18] MapReduce [123] shared memory [206] and dis-tributed memory [47]

Consider again the Mttkrp operation and the execution of the oper-ations as shown in the body of the displayed for-loop (12) As seen inthis loop the rows of the factor matrices and the output matrix are ac-cessed many times according to the sparsity pattern of the tensor X In acache-based system the computations will be much more efficient if we takeadvantage of spatial and temporal locality If we can reorder the nonzerossuch that the multiplications involving a given row (of MA and B and C)are one after another this would be achieved We need this to hold whenexecuting the Mttkrps for computing the new factor matrices B and Cas well The issue is more easily understood in terms of the sparse matrixndashvector multiplication operation y larr Ax If we have two nonzeros in thesame row aij and aik we would prefer an ordering that maps j and k nextto each other and perform the associated scalar multiplications aijprimexjprime andaikprimexkprime one after other where jprime is the new index id for j and kprime is the newindex id for k Furthermore we would like the same sort of locality for allindices and for the multiplication x larr AT y as well We identify two re-lated problems here One of them is the choice of data structure for takingadvantage of shared indices and the other is to reorder the indices so thatnonzeros sharing an index in a dimension have close-by indices in the otherdimensions In this document we do not propose a data structure our aimis to reorder the indices such that the locality is improved for three commondata structures used in the current state of the art We summarize those

12 CHAPTER 1 INTRODUCTION

three common data structures in Chapter 8 and investigate the followingordering problem Since the overall problem is based on the data structurechoice we state the problem in a general language

Problem 7 (Tensor ordering) Given a sparse tensor X find an or-dering of the indices in each dimension so that the nonzeros of X haveindices close to each other

If we look at the same problem for matrices the problem can be cast asordering the rows and the columns of a given matrix so that the nonzeros areclustered around the diagonal This is also what we search in the tensor casewhere different data structures can benefit from the same ordering thanksto different effects In Chapter 8 we investigate this problem algorithmicallyand develop fast heuristics While the problem in the case of sparse matriceshas been investigated with different angles to the best of our knowledgeours is the first study proposing general ordering tools for tensorsmdashwhichare analogues of the well-known and widely used sparse matrix orderingheuristics such as Cuthill-McKee [55] and reverse Cuthill-McKee [95]

Part I

Partitioning

13

Chapter 2

Multilevel algorithms foracyclic partitioning ofdirected acyclic graphs

In this chapter we address the directed acyclic graph partitioning (DAGP)problem Most of the terms and definitions are standard which we sum-marized in Section 11 The formal problem definition is repeated below forconvenience

DAG partitioning problem Given a DAG G = (VE) an imbalanceparameter ε find an acyclic k-way partition V1 Vk of V such thatthe balance constraints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut is minimized

We adopt the multilevel approach [35 113] with the coarsening initialpartitioning and refinement phases for acyclic partitioning of DAGs Wepropose heuristics for these three phases (Sections 221 222 and 223respectively) which guarantee acyclicity of the partitions at all phases andmaintain a DAG at every level We strove to have fast heuristics at thecore With these characterizations the coarsening phase requires new algo-rithmictheoretical reasoning while the initial partitioning and refinementheuristics are direct adaptations of the standard methods used in undirectedgraph partitioning with some differences worth mentioning We discuss onlythe bisection case as we were able to improve the direct k-way algorithmswe proposed before [114] by using the bisection heuristics recursively

The acyclicity constraint on the partitions forbids the use of the stateof the art undirected graph partitioning tools This has been recognized

15

16 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

before and those tools were put aside [114 175] While this is true one canstill try to make use of the existing undirected graph partitioning tools [113130 186 199] as they have been very well engineered Let us assume thatwe have partitioned a DAG with an undirected graph partitioning tool intotwo parts by ignoring the directions It is easy to detect if the partition iscyclic since all the edges need to go from the first part to the second one Ifa partition is cyclic we can easily fix it as follows Let v be a vertex in thesecond part we can move all vertices to which there is a path from v intothe second part This procedure breaks any cycle containing v and hencethe partition becomes acyclic However the edge cut may increase and thepartitions can be unbalanced To solve the balance problem and reduce thecut we can apply move-based refinement algorithms After this step thefinal partition meets the acyclicity and balance conditions Depending onthe structure of the input graph it could also be a good initial partitionfor reducing the edge cut Indeed one of our most effective schemes usesan undirected graph partitioning algorithm to create a (potentially cyclic)partition fixes the cycles in the partition and refines the resulting acyclicpartition with a novel heuristic to obtain an initial partition We thenintegrate this partition within the proposed coarsening approaches to refineit at different granularities

21 Related work

Fauzia et al [87] propose a heuristic for the acyclic partitioning problemto optimize data locality when analyzing DAGs To create partitions theheuristic categorizes a vertex as ready to be assigned to a partition whenall of the vertices it depends on have already been assigned Vertices areassigned to the current partition set until the maximum number of verticesthat would be ldquoactiverdquo during the computation of the part reaches a specifiedlimit which is the cache size in their application This implies that a partsize is not limited by the sum of the total vertex weights in that part butis a complex function that depends on an external schedule (order) of thevertices This differs from our problem as we limit the size of each part bythe total sum of the weights of the vertices on that part

Kernighan [137] proposes an algorithm to find a minimum edge-cut par-tition of the vertices of a graph into subsets of size greater than a lowerbound and inferior to an upper bound The partition needs to use a fixedvertex sequence that cannot be changed Indeed Kernighanrsquos algorithmtakes a topological order of the vertices of the graph as an input and parti-tions the vertices such that all vertices in a subset constitute a continuousblock in the given topological order This procedure is optimal for a givenfixed topological order and has a run time proportional to the number ofedges in the graph if the part weights are taken as constant We used a

17

modified version of this algorithm as a heuristic in the earlier version of ourwork [114]

Cong et al [54] describe two approaches for obtaining acyclic partitionsof directed Boolean networks modeling circuits The first one is a single-level Fiduccia-Mattheyses (FM)-based approach In this approach Cong etal generate an initial acyclic partition by splitting the list of the vertices (ina topological order) from left to right into k parts such that the weight of eachpart does not violate the bound The quality of the results is then improvedwith a k-way variant of the FM heuristic [88] taking the acyclicity constraintinto account Our previous work [114] employs a similar refinement heuristicThe second approach of Cong et al is a two-level heuristic the initial graphis first clustered with a special decomposition and then it is partitionedusing the first heuristic

In a recent paper [175] Moreira et al focus on an imaging and com-puter vision application on embedded systems and discuss acyclic parti-tioning heuristics They propose a single level approach in which an initialacyclic partitioning is obtained using a topological order To refine the parti-tioning they proposed four local search heuristics which respect the balanceconstraint and maintain the acyclicity of the partition Three heuristics picka vertex and move it to an eligible part if and only if the move improvesthe cut These three heuristics differ in choosing the set of eligible parts foreach vertex some are very restrictive and some allow arbitrary target partsas long as acyclicity is maintained The fourth heuristic tentatively realizesthe moves that increase the cut in order to escape from a possible local min-ima It has been reported that this heuristic delivers better results than theothers In a follow-up paper Moreira et al [176] discuss a multilevel graphpartitioner and an evolutionary algorithm based on this multilevel schemeTheir multilevel scheme starts with a given acyclic partition Then thecoarsening phase contracts edges that are in the same part until there is noedge to contract Here matching-based heuristics from undirected graphpartitioning tools are used without taking the directions of the edges intoaccount Therefore the coarsening phase can create cycles in the graphhowever the induced partitions are never cyclic Then an initial partition isobtained which is refined during the uncoarsening phase with move-basedheuristics In order to guarantee acyclic partitions the vertices that lie incycles are not moved In a systematic evaluation of the proposed methodsMoreira et al note that there are many local minima and suggest using re-laxed constraints in the multilevel setting The proposed methods have highrun time as the evolutionary method of Moreira et al is not concerned withthis issue Improvements with respect to the earlier work [175] are reported

We propose methods to use an undirected graph partitioner to guidethe multilevel partitioner We focus on partitioning the graph in two partssince one can handle the general case with a recursive bisection scheme Wealso propose new coarsening initial partitioning and refinement methods

18 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

specifically designed for the 2-partitioning problem Our multilevel schememaintains acyclic partitions and graphs through all the levels

22 Directed multilevel graph partitioning

Here we summarize the proposed methods for the three phases of the mul-tilevel scheme

221 Coarsening

In this phase we obtain smaller DAGs by coalescing the vertices level bylevel This phase continues until the number of vertices becomes smallerthan a specified bound or the reduction on the number of vertices from onelevel to the next one is lower than a threshold At each level ` we startwith a finer acyclic graph G` compute a valid clustering C` ensuring theacyclicity and obtain a coarser acyclic graph G`+1 While our previouswork [114] discussed matching based algorithms for coarsening we presentagglomerative clustering based variants here The new variants supersedethe matching based ones Unlike the standard undirected graph case inDAG partitioning not all vertices can be safely combined Consider a DAGwith three vertices a b c and three edges (a b) (b c) (a c) Here thevertices a and c cannot be combined since that would create a cycle Wesay that a set of vertices is contractible (all its vertices are matchable) ifunifying them does not create a cycle We now present a general theoryabout finding clusters without forming cycles after giving some definitions

Definition 21 (Clustering) A clustering of a DAG is a set of disjointsubsets of vertices Note that we do not make any assumptions on whetherthe subsets are connected or not

Definition 22 (Coarse graph) Given a DAG G and a clustering C of Gwe let G|C denote the coarse graph created by contracting all sets of verticesof C

The vertices of the coarse graph are the clusters in C If (u v) isin G fortwo vertices u and v that are located in different clusters of C then G|Chas an (directed) edge from the vertex corresponding to ursquos cluster to thevertex corresponding to vrsquos cluster

Definition 23 (Feasible clustering) A feasible clustering C of a DAGG is a clustering such that G|C is acyclic

Theorem 21 Let G = (VE) be a DAG For u v isin V and (u v) isin E thecoarse graph G|(uv) is acyclic if and only if every path from u to v in Gcontains the edge (u v)

19

Theorem 21 whose proof is in the journal version of this chapter [115]can be extended to a set of vertices by noting that this time all paths con-necting two vertices of the set should contain only the vertices of the setThe theorem (nor its extension) does not imply an efficient algorithm as itrequires at least one transitive reduction Furthermore it does not describea condition about two clusters forming a cycle even if both are individuallycontractible In order to address both of these issues we put a constrainton the vertices that can form a cluster based on the following definition

Definition 24 (Top level value) For a DAG G = (VE) the top levelvalue of a vertex u isin V is the length of the longest path from a source ofG to that vertex The top level values of all vertices can be computed in asingle traversal of the graph with a complexity O(|V |+ |E|) We use top[u]to denote the top level of the vertex u

By restricting the set of edges considered in the clustering to the edges(u v) isin E such that top[u]+1 = top[v] we ensure that no cycles are formedby contracting a unique cluster (the condition identified in Theorem 21 issatisfied) Let C be a clustering of the vertices Every edge in a cluster ofC being contractible is a necessary condition for C to be feasible but not asufficient one More restrictions on the edges of vertices inside the clustersshould be found to ensure that C is feasible We propose three coarseningheuristics based on clustering sets of more than two vertices whose pair-wisetop level differences are always zero or one

CoTop Acyclic clustering with forbidden edges

To have an efficient clustering heuristic we restrict ourselves to static in-formation computable in linear time As stated in the introduction of thissection we rely on the top level difference of one (or less) for all verticesin the same cluster and an additional condition to ensure that there willbe no cycles when a number of clusters are contracted simultaneously InTheorem 22 below we give two sufficient conditions for a clustering to befeasible (that is the graphs at all levels are DAGs)mdashthe proof is in theassociated journal paper [115]

Theorem 22 (Correctness of the proposed clustering) Let G = (VE) bea DAG and C = C1 Ck be a clustering If C is such that

bull for any cluster Ci for all u v isin Ci |top[u]minus top[v]| le 1

bull for two different clusters Ci and Cj and for all u isin Ci and v isin Cjeither (u v) isin E or top[u] 6= top[v]minus 1

then the coarse graph G|C is acyclic

20 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

We design a heuristic based on Theorem 22 This heuristic visits allvertices which are singleton at the beginning in an order and adds thevisited vertex to a cluster if the criteria mentioned in Theorem 22 are metif not the vertex stays as a singleton Among those clusters satisfying thementioned criteria the one which saves the largest edge weight (incident onthe vertices of the cluster and the vertex being visited) is chosen as the targetcluster We discuss a set of bookkeeping operations in order to implementthis heuristic in linear O(|V |+ |E|) time [115 Algorithm 1]

CoCyc Acyclic clustering with cycle detection

We now propose a less restrictive clustering algorithm which also maintainsthe acyclicity We again rely on the top level difference of one (or less) forall vertices in the same cluster Knowing this invariant when a new vertexis added to a cluster a cycle-detection algorithm checks that no cycles areformed when all the clusters are contracted simultaneously This algorithmdoes not traverse the entire graph by also using the fact that the top leveldifference within a cluster is at most one Nonetheless detecting such cyclescould require a full traversal of the graph

From Theorem 22 we know if adding a vertex to a cluster whose verticesrsquotop level values are t and t+ 1 creates a cycle in the contracted graph thenthis cycle goes through the vertices with top level values t or t + 1 Thuswhen considering the addition of a vertex u to a cluster C containing v wecheck potential cycle formations by traversing the graph starting from u ina breadth-first search (BFS) manner Let t denote the minimum top levelin C At a vertex w we normally add a successor y of w in a queue (as inBFS) if |top(y)minus t| le 1 if w is in the same cluster as one of its predecessorsx we also add x to the queue if |top(x) minus t| le 1 We detect a cycle if atsome point in the traversal a vertex from cluster C is reached if not thecluster can be extended by adding u In the worst-case the cycle detectionalgorithm completes a full graph traversal but in practice it stops quicklyand does not introduce a significant overhead The details of this clusteringscheme whose worst case run time complexity is O(|V |(|V |+ |E|)) can befound in the original paper [115 Algorithm 2]

CoHyb Hybrid acyclic clustering

The cycle detection based algorithm can suffer from quadratic run time forvertices with large in-degrees or out-degrees To avoid this we design a hy-brid acyclic clustering which uses the relaxed clustering strategy with cycledetection CoCyc by default and switches to the more restrictive clusteringstrategy CoTop for large degree vertices We define a limit on the degree ofa vertex (typically

radic|V |10) for calling it large degree When considering

an edge (u v) where top[u] + 1 = top[v] if the degrees of u and v do not

21

exceed the limit we use the cycle detection algorithm to determine if wecan contract the edge Otherwise if the outdegree of u or the indegree of vis too large the edge will be contracted if it is allowed The complexity ofthis algorithm is in between those CoTop and CoCyc and it will likely avoidthe quadratic behavior in practice

222 Initial partitioning

After the coarsening phase we compute an initial acyclic partitioning of thecoarsest graph We present two heuristics One of them is akin to the greedygraph growing method used in the standard graphhypergraph partitioningmethods The second one uses an undirected partitioning and then fixesthe acyclicity of the partitions Throughout this section we use (V0 V1) todenote the bisection of the vertices of the coarsest graph and ensure thatthere is no edge from the vertices in V1 to those in V0

Greedy directed graph growing

One approach to compute a bisection of a directed graph is to design agreedy algorithm that moves vertices from one part to another using localinformation We start with all vertices in V1 and replace vertices towards V0

by using heaps At any time the vertices that can be moved to V0 are in theheap These vertices are those whose in-neighbors are all in V0 Initially onlythe sources are in the heap and when all the in-neighbors of a vertex v aremoved to C0 v is inserted into the heap We separate this process into twophases In the first phase the key-values of the vertices in the heap are setto the weighted sum of their incoming edges and the ties are broken in favorof the vertex closer to the first vertex moved The first phase continues untilthe first part has more than 09 of the maximum allowed weight (modulothe maximum weight of a vertex) In the second phase the actual gain of avertex is used This gain is equal to the sum of the weights of the incomingedges minus the sum of the weights of the outgoing edges In this phasethe ties are broken in favor of the heavier vertices The second phase stopsas soon as the required balance is obtained The reason that we separatedthis heuristic into two phases is that at the beginning the gains are of noimportance and the more vertices become movable the more flexibility theheuristic has Yet towards the end parts are fairly balanced and usingactual gains should help keeping the cut small

Since the order of the parts is important we also reverse the roles of theparts and the directions of the edges That is we put all vertices in V0 andmove the vertices one by one to V1 when all out-neighbors of a vertex havebeen moved to V1 The proposed greedy directed graph growing heuristicreturns the best of the these two alternatives

22 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Undirected bisection and fixing acyclicity

In this heuristic we partition the coarsest graph as if it were undirectedand then move the vertices from one part to another in case the partitionwas not acyclic Let (P0 P1) denote the (not necessarily acyclic) bisectionof the coarsest graph treated as if it were undirected

The proposed approach designates arbitrarily P0 as V0 and P1 as V1One way to fix the cycle is to move all ancestors of the vertices in V0 to V0thereby guaranteeing that there is no edge from vertices in V1 to vertices inV0 making the bisection (V0 V1) acyclic We do these moves in a reversetopological order Another way to fix the acyclicity is to move all descen-dants of the vertices in V1 to V1 again guaranteeing an acyclic partition Wedo these moves in a topological order We then fix the possible unbalancewith a refinement algorithm

Note that we can also initially designate P1 as V0 and P0 as V1 and canagain fix a potential cycle in two different ways We try all four of thesechoices and return the best partition (essentially returning the best of thefour choices to fix the acyclicity of (P0 P1))

223 Refinement

This phase projects the partition obtained for a coarse graph to the nextfiner one and refines the partition by vertex moves As in the standardrefinement methods the proposed heuristic is applied in a number of passesWithin a pass we repeatedly select the vertex with the maximum move gainamong those that can be moved We tentatively realize this move if themove maintains or improves the balance Then the most profitable prefixof vertex moves are realized at the end of the pass As usual we allow thevertices move only once in a pass We use heaps with the gain of movesas the key value where we keep only movable vertices We call a vertexmovable if moving it to the other part does not create a cyclic partitionWe use the notation (V0 V1) to designate the acyclic bisection with no edgefrom vertices in V1 to vertices in V0 This means that for a vertex to movefrom part V0 to part V1 one of the two conditions should be met (i) eitherall its out-neighbors should be in V1 (ii) or the vertex has no out-neighborsat all Similarly for a vertex to move from part V1 to part V0 one of thetwo conditions should be met (i) either all its in-neighbors should be in V0(ii) or the vertex has no in-neighbors at all This is an adaptation of theboundary Fiduccia-Mattheyses heuristic [88] to directed graphs The notionof movability being more restrictive results in an important simplificationThe gain of moving a vertex v from V0 to V1 issum

uisinSucc[v]

w(v u)minussum

uisinPred[v]

w(u v) (21)

23

and the negative of this value when moving it from V1 to V0 This meansthat the gain of a vertex is static once a vertex is inserted in the heap withthe key value (21) its key-value is never updated A move could rendersome vertices unmovable if they were in the heap then they should bedeleted Therefore the heap data structure needs to support insert deleteand extract max operations only

224 Constrained coarsening and initial partitioning

There are a number of highly successful undirected graph partitioningtools [130 186 199] They are not immediately usable for our purposesas the partitions can be cyclic Fixing such partitions by moving verticesto break the cyclic dependencies among the parts can increase the edgecut dramatically (with respect to the undirected cut) Consider for exam-ple the n times n grid graph where the vertices are at integer positions fori = 1 n and j = 1 n and a vertex at (i j) is connected to (iprime jprime)when |iminus iprime| = 1 or |jminusjprime| = 1 but not both There is an acyclic orientationof this graph called spiral ordering as described in Figure 21 for n = 8This spiral ordering defines a total order When the directions of the edgesare ignored we can have a bisection with perfect balance by cutting onlyn = 8 edges with a vertical line This partition is cyclic and it can be madeacyclic by putting all vertices numbered greater than 32 to the second partThis partition which puts the vertices 1ndash32 to the first part and the rest tothe second part is the unique acyclic bisection with perfect balance for theassociated directed acyclic graph The edge cut in the directed version is35 as seen in the figure (gray edges) In general one has to cut n2 minus 4n+ 3edges for n ge 8 the blue vertices in the border (excluding the corners) haveone edge directed to a red vertex the interior blue vertices have two suchedges finally the blue vertex labeled n22 has three such edges

Let us investigate the quality of the partitions in practice We usedMeTiS [130] as the undirected graph partitioner on a dataset of 94 matrices(their details are in Section 23) The results are given in Figure 22 Forthis preliminary experiment we partitioned the graphs into two with themaximum allowed load imbalance ε = 3 The output of MeTiS is acyclicfor only two graphs and the geometric mean of the normalized edge cut is00012 Figure 22a shows the normalized edge cut and the load imbalanceafter fixing the cycles while Figure 22b shows the two measurements aftermeeting the balance criteria A normalized edge cut value is computed bynormalizing the edge cut with respect to the number of edges

In both figures the horizontal lines mark the geometric mean of thenormalized edge cuts and the vertical lines mark the 3 imbalance ratioIn Figure 22a there are 37 instances in which the load balance after fixingthe cycles is feasible The geometric mean of the normalized edge cuts inthis sub-figure is 00045 while in the other sub-figure it is 00049 Fixing

24 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

815864

32

3327

45

48

41

35

53

38

1522

Figure 21 8times8 grid graph whose vertices are ordered in a spiral way a fewof the vertices are labeled All edges are oriented from a lower numberedvertex to a higher numbered one There is a unique bipartition with 32vertices in each side The edges defining the total order are shown in redand blue except the one from 32 to 33 the cut edges are shown in grayother internal edges are not shown

10 12 14 16 18 20balance

10 5

10 4

10 3

10 2

10 1

100

101

cut

(a) Undirected partitioning and fixing cy-cles

098 100 102 104 106balance

10 5

10 4

10 3

10 2

10 1

cut

(b) Undirected partitioning fixing cyclesand balancing

Figure 22 Normalized edge cut (normalized with respect to the number ofedges) the balance obtained after using an undirected graph partitioner andfixing the cycles (left) and after ensuring balance with refinement (right)

25

the cycles increases the edge cut with respect to an undirected partition-ing but not catastrophically (only by 0004500012 = 375 times in theseexperiments) and achieving balance after this step increases the cut only alittle (goes to 00049 from 00045) That is why we suggest using an undi-rected graph partitioner fixing the cycles among the parts and performinga refinement based method for load balancing as a good (initial) partitioner

In order to refine the initial partition in a multilevel setting we proposea scheme similar to the iterated multilevel algorithm used in the existingpartitioners [40 212] In this scheme first a partition P is obtained Thenthe coarsening phase starts to cluster vertices that were in the same partin P After the coarsening an initial partitioning is freely available byusing the partition P on the coarsest graph The refinement phase then canwork as before Moreira et al [176] use this approach for the directed graphpartitioning problem To be more concrete we first use an undirected graphpartitioner then fix the cycles as discussed in Section 222 and then refinethis acyclic partition for balance with the proposed refinement heuristicsin Section 223 We then use this acyclic partition for constrained coarseningand initial partitioning We expect this scheme to be successful in graphswith many sources and targets where the sources and targets can lie in anyof the parts while the overall partition is acyclic On the other hand if agraph is such that its balanced acyclic partitions need to put sources in onepart and the targets in another part then fixing acyclicity may result inmoving many vertices This in turn will harm the edge cut found by theundirected graph partitioner

23 Experiments

The partitioning tool presented (dagP) is implemented in CC++ program-ming languages The experiments are conducted on a computer equippedwith dual 21 GHz Xeon E5-2683 processors and 512GB memory Thesource code and more information is available at httptdagatechedusoftwaredagP

We have performed a large set of experiments on DAG instances com-ing from two sources The first set of instances is from the PolyhedralBenchmark suite (PolyBench) [196] The second set of instances is obtainedfrom the matrices available in the SuiteSparse Matrix Collection (UFL) [59]From this collection we pick all the matrices satisfying the following prop-erties listed as binary square and has at least 100000 rows and at most226 nonzeros There were a total of 95 matrices at the time of experimenta-tion where two matrices (ids 1514 and 2294) having the same pattern Wediscard the duplicate and use the remaining 94 matrices for experimentsFor each such matrix we take the strict upper triangular part as the asso-ciated DAG instance whenever this part has more nonzeros than the lower

26 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Graph vertex edge max deg avg deg source target2mm 36500 62200 40 1704 2100 4003mm 111900 214600 40 1918 3900 400adi 596695 1059590 109760 1776 843 28atax 241730 385960 230 1597 48530 230covariance 191600 368775 70 1925 4775 1275doitgen 123400 237000 150 1921 3400 3000durbin 126246 250993 252 1988 250 249fdtd-2d 256479 436580 60 1702 3579 1199gemm 1026800 1684200 70 1640 14600 4200gemver 159480 259440 120 1627 15360 120gesummv 376000 500500 500 1331 125250 250heat-3d 308480 491520 20 1593 1280 512jacobi-1d 239202 398000 100 1664 402 398jacobi-2d 157808 282240 20 1789 1008 784lu 344520 676240 79 1963 6400 1ludcmp 357320 701680 80 1964 6480 1mvt 200800 320000 200 1594 40800 400seidel-2d 261520 490960 60 1877 1600 1symm 254020 440400 120 1734 5680 2400syr2k 111000 180900 60 1630 2100 900syrk 594480 975240 81 1640 8040 3240trisolv 240600 320000 399 1330 80600 1trmm 294570 571200 80 1939 6570 4800

Table 21 DAG instances from PolyBench [196]

triangular part otherwise we take the strict lower triangular part All edgeshave unit cost and all vertices have unit weight

Since the proposed heuristics have a randomized behavior (the traversalsused in the coarsening and refinement heuristics are randomized) we runthem 10 times for each DAG instance and report the averages of theseruns We use performance profiles [69] to present the edge-cut results Aperformance profile plot shows the probability that a specific method givesresults within a factor θ of the best edge cut obtained by any of the methodscompared in the plot Hence the higher and closer a plot to the y-axis thebetter the method is

We set the load imbalance parameter ε = 003 in the problem definitiongiven at the beginning of the chapter for all experiments The vertices areunit weighted therefore the imbalance is rarely an issue for a move-basedpartitioner

Coarsening evaluation

We first evaluated the proposed coarsening heuristics (Section 51 of the as-sociated paper [115]) The aim was to find an effective one to set as a defaultcoarsening heuristic Based on performance profiles on the edge cut we con-cluded that in general the coarsening heuristics CoHyb and CoCyc behavesimilarly and are more helpful than CoTop in reducing the edge cut We alsoanalyzed the contraction efficiency of the proposed coarsening heuristics Itis important that the coarsening phase does not stop too early and that the

27

coarsest graph is small enough to be partitioned efficiently By investigatingthe maximum the average and the standard deviation of vertex and edgeweight ratios at the coarsest graph and the average the minimum andthe maximum number of coarsening levels observed for the two datasetswe saw that CoCyc and CoHyb behave again similarly and provide slightlybetter results than CoTop Based on all these observations we set CoHyb asthe default coarsening heuristic as it performs better than CoTop in termsof final edge cut and is guaranteed to be more efficient than CoCyc in termsof run time

Constrained coarsening and initial partitioning

We now investigate the effect of using undirected graph partitioners forcoarsening and initial partitioning as explained in Section 224 We com-pare three variants of the proposed multilevel scheme All of them use therefinement described in Section 223 in the uncoarsening phase

bull CoHyb this variant uses the hybrid coarsening heuristic describedin Section 221 and the greedy directed graph growing heuristic de-scribed in Section 222 in the initial partitioning phase This methoddoes not use constrained coarsening

bull CoHyb C this variant uses an acyclic partition of the finest graph ob-tained as outlined in Section 222 to guide the hybrid coarseningheuristic described in Section 224 and uses the greedy directed graphgrowing heuristic in the initial partitioning phase

bull CoHyb CIP this variant uses the same constrained coarsening heuristicas the previous method but inherits the fixed acyclic partition of thefinest graph as the initial partitioning

The comparison of these three variants are given in Figure 23 for thewhole dataset In this figure we see that using the constrained coarseningis always helpful as it clearly separates CoHyb C and CoHyb CIP from CoHyb

after θ = 11 Furthermore applying the constrained initial partitioning (ontop of the constrained coarsening) brings tangible improvements

In the light of the experiments presented here we suggest the variantCoHyb CIP for general problem instances

CoHyb CIP with respect to a single level algorithm

We compare CoHyb CIP (constrained coarsening and initial partitioning)with a single-level algorithm that uses an undirected graph partitioningfixes the acyclicity and refines the partitions This last variant is denotedas UndirFix and it is the algorithm described in Section 222 Both vari-ants use the same initial partitioning approach which utilizes MeTiS [130] as

28 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CCoHyb_CIP

Figure 23 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) using the constrained coarsening and the greedy directed graphgrowing (CoHyb C) and CoHyb (no constraints)

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPUndirFix

Figure 24 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) and using the same approach without coarsening (UndirFix)

the undirected partitioner The difference between UndirFix and CoHyb CIP

is the latterrsquos ability to refine the initial partition at multiple levels Fig-ure 24 presents this comparison The plots show that the multilevel schemeCoHyb CIP outperforms the single level scheme UndirFix at all appropriateranges of θ attesting to the importance of the multilevel scheme

Comparison with existing work

Here we compare our approach with the evolutionary graph partitioningapproach developed by Moreira et al [175] and briefly with our previouswork [114]

Figure 25 shows how CoHyb CIP and CoTop compare with the evolution-ary approach in terms of the edge cut on the 23 graphs of the PolyBenchdataset for the number of partitions k isin 2 4 8 16 32 We use the av-

29

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPCoTopMoreira et al

Figure 25 Performance profiles for the edge cut obtained by CoHyb CIPCoTop and Moreira et alrsquos approach on the PolyBench dataset with k isin2 4 8 16 32

erage edge cut value of 10 runs for CoTop and CoHyb CIP and the averagevalues presented in [175] for the evolutionary algorithm As seen in thefigure the CoTop variant of the proposed multilevel approach obtains thebest results on this specific dataset (all variants of the proposed approachoutperform the evolutionary approach) In more detailed comparisons [115Appendix A] we see that the variants CoHyb CIP and CoTop of the proposedalgorithm obtain strictly better results than the evolutionary approach in78 and 75 instances (out of 115) respectively when the average edge cutsare compared

In more detailed comparisons [115 Appendix A] we see that CoHyb CIP

obtains 26 less edge cut than the evolutionary approach on average (ge-ometric mean) when the average cuts are compared when the best cutsare compared CoHyb CIP obtains 48 less edge cut Moreover CoTop ob-tains 37 less edge cut than the evolutionary approach when the averagecuts are compared when the best cuts are compared CoTop obtains 41less cut In some instances we see large differences between the averageand the best results of CoTop and CoHyb CIP Combined with the observa-tion that CoHyb CIP yields better results in general this suggests that theneighborhood structure can be improved (see the notion of the strength ofa neighborhood [183 Section 196])

The proposed approach with all the reported variants takes about 30minutes to complete the whole set of experiments for this dataset whereasthe evolutionary approach is much more compute-intensive as it has to runthe multilevel partitioning algorithm numerous times to create and updatethe population of partitions for the evolutionary algorithm The multilevelapproach of Moreira et al [175] is more comparable in terms of characteris-tics with our multilevel scheme When we compare CoTop with the results ofthe multilevel algorithm by Moreira et al our approach provides results thatare 37 better on average and CoHyb CIP approach provides results that are

30 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CIPUndirFix

Figure 26 Performance profiles of CoHyb CoHyb CIP and UndirFix in termsof edge cut for single source single target graph dataset The average of 5runs are reported for each approach

26 better on average highlighting the fact that keeping the acyclicity ofthe directed graph through the multilevel process is useful

Finally CoTop and CoHyb CIP also outperform the previous version of ourmultilevel partitioner [114] which is based on a direct k-way partitioningscheme and matching heuristics for the coarsening phase by 45 and 35on average respectively on the same dataset

Single commodity flow-like problem instances

In many of the instances of our dataset graphs have many source and targetvertices We investigate how our algorithm performs on problems where allsource vertices should be in a given part and all target vertices should bein the other part while also achieving balance This is a problem close tothe maximum flow problem where we want to find the maximum flow (orminimum cut) from the sources to the targets with balance on part weightsFurthermore addressing this problem also provides a setting for solvingpartitioning problems with fixed vertices

We used the UFL dataset where all isolated vertices are discarded anda single source vertex S and a single target vertex T are added to all graphsS is connected to all original source vertices and all original target verticesare connected to T The cost of the new edges is set to the number ofedges A feasible partition should avoid cutting these edges and separateall sources from the targets

The performance profiles of CoHyb CoHyb CIP and UndirFix are givenin Figure 26 with the edge cut as the evaluation criterion As seen in thisfigure CoHyb is the best performing variant and UndirFix is the worstperforming variant This is interesting as in the general setting we sawa reverse relation The variant CoHyb CIP performs in the middle as itcombines the other two

31

1 2 3 4 5 6 7 800

02

04

06

08

10

p CoTopCoHybCoCycUndirFixCoHyb_CIPCoHyb_C

Figure 27 Run time performance profile of CoCyc CoHyb CoTop CoHyb CCoHyb CIP and UndirFix on the whole dataset The values are the averagesof 10 runs

Run time performance

We give again a summary of the run time comparisons from the detailedversion [115 Section 56] Figure 27 shows the comparison of the five vari-ants of the proposed multilevel scheme and the single level scheme on thewhole dataset Each algorithm is run 10 times on each graph As expectedCoTop offers the best performance and CoHyb offers a good trade-off betweenCoTop and CoCyc An interesting remark is that these three algorithms havea better run time than the single level algorithm UndirFix For exampleon the average CoTop is 144 times faster than UndirFix This is mainlydue to cost of fixing acyclicity Undirected partitioning accounts for roughly25 of the execution time of UndirFix and fixing the acyclicity constitutesthe remaining 75 Finally the variants of the multilevel algorithm usingconstrained coarsening heuristics provide satisfying run time performancewith respect to the others

24 Summary further notes and references

We proposed a multilevel approach for acyclic partitioning of directed acyclicgraphs This problem is close to the standard graph partitioning in that theaim is to partition the vertices into a number of parts while minimizingthe edge cut and meeting a balance criterion on the part weights Unlikethe standard graph partitioning problem the directions of the edges areimportant and the resulting partitions should have acyclic dependencies

We proposed coarsening initial partitioning and refinement heuristicsfor the target problem The proposed heuristics take the directions of theedges into account and maintain the acyclicity throughout the three phasesof the multilevel scheme We also proposed efficient and effective approachesto use the standard undirected graph partitioning tools in the multilevel

32 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

scheme for coarsening and initial partitioning We performed a large set ofexperiments on a dataset with graphs having different characteristics andevaluated different combinations of the proposed heuristics Our experi-ments suggested (i) the use of constrained coarsening and initial partition-ing where the main coarsening heuristic is a hybrid one which avoids thecycles and in case it does not performs a fast cycle detection (CoHyb CIP)for the general case (ii) a pure multilevel scheme without constrained coars-ening using the hybrid coarsening heuristic (CoHyb) for the cases where anumber of sources need to be separated from a number of targets (iii) a puremultilevel scheme without constrained coarsening using the fast coarseningalgorithm (CoTop) for the cases where the degrees of the vertices are smallAll three approaches are shown to be more effective and efficient than thecurrent state of the art

Ozkaya et al [181] make use of the proposed partitioner for schedulingdirected acyclic task graphs for homogeneous computing systems Here thepartitioner is used to obtain coarse grained tasks to enhance data localityLepere and Trystram [151] show that acyclic clustering is very effective inscheduling DAGs with unit task weights and large uniform edge costs Inparticular they show that there exists an acyclic clustering whose makespanis at most twice the best clustering

Recall that a directed edge (u v) is called redundant if there is a pathu v in the graph Consider a redundant edge (u v) and a path u vLet us discard the edge (u v) and add its cost to all edges in the identifiedu v path to obtain a reduced graph In an acyclic bisection of this reducedgraph if the vertices u and v are in different parts then the path u vcrosses the cut only once (otherwise the partition is cyclic) and the edgecut in the original graph equals the edge cut in the reduced graph If uand v are in the same part then all paths u v are confined to the samepart Therefore neither the edge (u v) of the original graph nor any pathu v of the reduced graph contributes to the respective cuts Hence theedge cut in the original graph is equal to the edge cut in the reduced graphSuch reductions could potentially help reduce the run time and improve thequality of the partitions in our multilevel setting The transitive reductionis a costly operation and hence we do not hope to apply all reductionsin the proposed approach a subset of those should be found and appliedThe effects of such reductions should be more tangible in the evolutionaryalgorithm [175 176]

Chapter 3

Parallel CandecompParafacdecomposition of sparsetensors in distributedmemory

In this chapter we address the problem of parallelizing the computationalcore of the CandecompParafac decomposition of sparse tensors Tensornotation is a little less common and the reader is invited to revise thenotation and the basic definitions in Section 15 More related definitionsare given in this chapter Below we restate the problem for convenience

Parallelizing CP decomposition Given a sparse tensor X the rankR of the required decomposition the number k of processors partitionX among the processors so as to achieve load balance and reduce thecommunication cost in a parallel implementation of Mttkrp

We investigate an efficient parallelization of the Mttkrp operation indistributed memory environments for sparse tensors in the context of theCP decomposition method CP-ALS For this purpose we formulate twotask definitions a coarse-grain and a fine-grain one These definitions aregiven by applying the owner-computes rule to a coarse-grain and a fine-grainpartition of the tensor nonzeros We define the coarse-grain partition of atensor as a partition of one of its dimensions In matrix terms a coarse-grainpartition corresponds to a row-wise or a column-wise partition We definethe fine-grain partition of a tensor as a partition of its nonzeros This hasthe same significance in matrix terms Based on these two task granularitieswe present two parallel algorithms for the Mttkrp operation We addressthe computational load balance and communication cost reduction problemsfor the two algorithms and present hypergraph partitioning-based models

33

34 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

to tackle these problems with the off-the-shelf partitioning toolsOnce the Mttkrp operation is efficiently parallelized the other parts of

the CP-ALS algorithm are straightforward Nonetheless we design a libraryfor the parallel CP-ALS algorithm to test the proposed Mttkrp algorithmsand report scalability results up to 1024 MPI ranks

31 Background and notation

311 Hypergraph partitioning

Let H = (VE) be a hypergraph with the vertex set V and hyperedge setE Let us associate weights with the vertices and costs with the hyperedgesw(v) denotes the weight of a vertex v and ch denotes the cost of a hyperedgeFor a given integer K ge 2 a K-way vertex partition of a hypergraph H =(VE) is denoted as Π = V1 VK where (i) the parts are non-empty(ii) mutually exclusive VkcapV` = empty for k 6= ` and (iii) collectively exhaustiveV =

⋃Vk

Let w(Vk) =sum

visinVk wv be the total weight of the vertices in the setVk and wavg = w(V )K be the average part weight If each part Vk isin Πsatisfies the balance criterion

w(Vk) le wavg(1 + ε) for k = 1 2 K (31)

we say that Π is balanced where ε represents the maximum allowed imbal-ance ratio

In a partition Π a hyperedge that has at least one vertex in a part is saidto connect that part The number of parts connected by a hyperedge h ieconnectivity is denoted as λh Given a vertex partition Π of a hypergraphH = (VE) one can measure the size of the cut induced by Π as

χ(Π) =sumhisinE

ch(λh minus 1) (32)

This cut measure is called the connectivity-1 cutsize metricGiven ε gt 0 and an integer K gt 1 the standard hypergraph partitioning

problem is defined as the task of finding a balanced partition Π with K partssuch that χ(Π) is minimized The hypergraph partitioning problem is NP-hard [150]

A variant of the above problem is the multi-constraint hypergraph parti-tioning [131] In this variant each vertex has an associated vector of weightsThe partitioning objective is the same as above and the partitioning con-straint is to satisfy a balancing constraint for each weight Let w(v i) denotethe C weights of a vertex v for i = 1 C In this variant the balancecriterion (31) is rewritten as

w(Vk i) le wavgi (1 + ε) for k = 1 K and i = 1 C (33)

35

where the ith weight w(Vk i) of a part Vk is defined as the sum of the ithweights of the vertices in that part wavgi is the average part weight for theith weight of all vertices and ε represents the allowed imbalance ratio

312 Tensor operations

In this chapter we use N to denote the number of dimensions (the order)of a tensor For the sake of simplicity we describe all the notation andthe algorithms for N = 3 even though our algorithms and implementationshave no such restriction We explicitly generalize the discussion to generalorder-N tensors whenever we find necessary A fiber in a tensor is definedby fixing every index but one eg if X is a third-order tensor Xjk is amode-1 fiber and Xij is a mode-3 fiber A slice in a tensor is defined byfixing only one index eg Xi refers to the ith slice of X in mode 1 Weuse |Xi| to denote the number of nonzeros in Xi

Tensors can be matricized in any mode This is achieved by identifying asubset of the modes of a given tensor X as the rows and the other modes of Xas the columns of a matrix and appropriately mapping the elements of X tothose of the resulting matrix We will be exclusively dealing with the matri-cizations of tensors along a single mode For example take X isin RI1timesmiddotmiddotmiddottimesIN Then X(1) denotes the mode-1 matricization of X in such a way that therows of X(1) corresponds to the first mode of X and the columns correspondsto the remaining modes The tensor element xi1iN corresponds to the el-

ement(i1 1 +

sumNj=2

[(ij minus 1)

prodjminus1k=1 Ik

])of X(1) Specifically each column

of the matrix X(1) becomes a mode-1 fiber of the tensor X Matricizationsin the other modes are defined similarly

For two matrices AI1timesJ1 and BI2timesJ2 the Kronecker product is

AotimesB =

a11B middot middot middot a1J1B

aI11B middot middot middot aI1J1B

For two matrices AI1timesJ and BI2timesJ the Khatri-Rao product is

AB =[A( 1)otimesB( 1) middot middot middot A( J)otimesB( J)

]

which is of size I1I2 times J For two matrices AItimesJ and BItimesJ the Hadamard product is

A lowastB =

a11b11 middot middot middot a1Jb1J

aI1bI1 middot middot middot aIJbIJ

Recall that the CP-decomposition of rankR of a given tensor X factorizesX into a sum of R rank-one tensors For X isin RItimesJtimesK it yields X asymp

36 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

sumRr=1 ar br cr for ar isin RI br isin RJ and cr isin RK where is the outer

product of the vectors Here the matrices A = [a1 aR] B = [b1 bR]and C = [c1 cR] are called the factor matrices or factors For N -modetensors we use U1 UN to refer to the factor matrices

We revise the Alternating Least Squares (ALS) method for obtaining arank-R approximation of a tensor X with the CP-decomposition [38 107]A common formulation of CP-ALS is shown in Algorithm 1 for third ordertensors At each iteration each factor matrix is recomputed while fixingthe other two eg A larr X(1)(C B)(BTB lowast CTC)dagger This operation is

performed in the following order MA = X(1)(CB) V = (BTB lowastCTC)daggerand then A larr MAV Here V is a dense matrix of size R times R and iseasy to compute The important issue is the efficient computation of theMttkrp operations yielding MA similarly MB = X(2)(AC) and MC =X(3)(BA)

Algorithm 1 CP-ALS for the 3rd order tensors

Input X A 3rd order tensorR The rank of approximation

Output CP decomposition [[λ ABC]]

1 repeat2 Alarr X(1)(CB)(BTB lowastCTC)dagger

3 Normalize columns of A

4 Blarr X(2)(CA)(ATA lowastCTC)dagger

5 Normalize columns of B

6 Clarr X(3)(BA)(ATA lowastBTB)dagger

7 Normalize columns of C and store the norms as λ

8 until no improvement or maximum iterations reached

The sheer size of the Khatri-Rao products makes them impossible tocompute explicitly hence efficient Mttkrp algorithms find other means tocarry out the Mttkrp operation

32 Related work

SPLATT [206] is an efficient implementation of the Mttkrp operation forsparse tensors on shared memory systems SPLATT implements the Mt-tkrp operation based on the slices of the dimension in which the factoris updated eg on the mode-1 slices when computing A larr X(1)(C B)(BTB lowastCTC)dagger Nonzeros of the fibers in a slice are multiplied with thecorresponding rows of B and the results are accumulated to be later scaledwith the corresponding row of C to compute the row of A correspondingto the slice Parallelization is done using OpenMP directives and the loadbalance (in terms of the number of nonzeros in the slices of the mode for

37

which Mttkrp is computed) is achieved by using the dynamic schedulingpolicy Hypergraph models are used to optimize cache performance by re-ducing the number of times a row of A B and C are accessed Smith etal [206] also use N -partite graphs to reorder the tensors for all dimensionsA newer version of SPLATT [204 205] has a medium grain task definitionand a distributed memory implementation

GigaTensor [123] is an implementation of CP-ALS which follows theMap-Reduce paradigm All important steps (the Mttkrps and the com-putations BTB lowastCTC) are performed using this paradigm A distinct ad-vantage of GigaTensor is that thanks to Map-Reduce the issues of fault-tolerance load balance and out of core tensor data are automatically han-dled The presentation [123] of GigaTensor focuses on three-mode tensorsand expresses the map and the reduce functions for this case Additionalmap and reduce functions are needed for higher order tensors

DFacTo [47] is a distributed memory implementation of the Mttkrp op-eration It performs two successive sparse matrix-vector multiplies (SpMV)to compute a column of the product X(1)(C B) A crucial observation(made also elsewhere [142]) is that this operation can be implemented asXT

(2)B( r) which can be reshaped into a matrix to be multiplied with

C( r) to form the rth column of the Mttkrp Although SpMV is awell-investigated operation there is a peculiarity here the result of thefirst SpMV forms the values of the sparse matrix used in the second oneTherefore there are sophisticated data dependencies between the two Sp-MVs Notice that DFacTo is rich in SpMV operations there are two SpMVsper factorization rank per dimension of the input tensor DFacTo needs tostore the tensor matricized in all dimensions ie X(1) X(N) In lowdimensions this can be a slight memory overhead yet in higher dimensionsthe overhead could be non-negligible DFacTo uses MPI for paralleliza-tion yet fully stores the factor matrices in all MPI ranks The rows ofX(1) X(N) are blockwise distributed (statically) With this partitioneach process computes the corresponding rows of the factor matrices Fi-nally DFacTo performs an MPI Allgatherv operation to communicate thenew results to all processes which results in (InP ) log2 P communicationvolume per process (assuming a hypercube algorithm) when computing thenth factor matrix having In rows using P processes

Tensor Toolbox [18] is a MATLAB toolbox for handling tensors It pro-vides many essential operations and enables fast and efficient realizationsof complex algorithms in MATLAB for sparse tensors [17] Among thoseoperations Mttkrp implementations are provided and used in CP-ALSmethod Here each column of the output is computed by performing N minus 1sparse tensor vector multiplications Another well-known MATLAB toolboxis the N -way toolbox [11] which is essentially for dense tensors and incor-porates now support for sparse tensors [2] through Tensor Toolbox Tensor

38 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Toolbox and the related software provide excellent means for rapid proto-typing of algorithms and also efficient programs for tensor operations thatcan be handled within MATLAB

33 Parallelization

A common approach in implementing the Mttkrp is to explicitly matricizea tensor across all modes and then perform the Khatri-Rao product usingthe matricized tensors [47 206] Matricizing a tensor in a mode i requirescolumn index values up to

prodNk 6=i Ik which can exceed the integer value lim-

its supported by modern architectures when using tensors of higher orderand very large dimensions Also matricizing across all modes results in Nreplications of a tensor which can exceed the memory limitations Hence inorder to be able to handle large tensors we store them in coordinate formatfor the Mttkrp operation which is also the method of choice in TensorToolbox [18] In this format each nonzero value is stored along with all ofits position indices

With a tensor stored in the coordinate format Mttkrp operation canbe performed as shown in Algorithm 2 As seen on Line 4 of this algorithma row of B and a row of C are retrieved and their Hadamard product iscomputed and scaled with a tensor entry to update a row of MA In generalfor an N -mode tensor

MU1(i1 )larrMU1(i1 ) + xi1i2iN [U2(i2 ) lowast middot middot middot lowastUN (iN )]

is computed Here indices of the corresponding rows of the factor matricesand MU1 coincide with indices of the unique tensor entry of the operation

Algorithm 2 Mttkrp for the 3rd order tensors

Input X tensorBC Factor matrices in all modes except the firstIA Number of rows of the factor AR Rank of the factors

Output MA = X(1)(BC)

1 Initialize MA to zeros of size IA timesR2 foreach xijk isin X do44 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

As factor matrices are accessed row-wise we define computational unitsin terms of the rows of factor matrices It follows naturally to partitionall factor matrices row-wise and use the same partition for the Mttkrpoperation for each mode of an input tensor across all CP-ALS iterations toprevent extra communication A crucial issue is the task definitions as this

39

pertains to the issues of load balancing and communication We identify acoarse-grain and a fine-grain task definition

In the coarse-grain task definition the ith atomic task consists of com-puting the row MA(i ) using the nonzeros in the tensor slice Xi and therows of B and C corresponding to the nonzeros in that slice The inputtensor X does not change throughout the iterations of tensor decompositionalgorithms hence it is viable to make the whole slice Xi available to theprocess holding MA(i ) so that the Mttkrp operation can be performedby only communicating the rows of B and C Yet as CP-ALS requires theMttkrp in all modes and each nonzero xijk belongs to slices X (i )X ( j ) and X ( k) we need to replicate tensor entries in the owner pro-cesses of these slices This may require up to N times replication of thetensor depending on its partitioning Note that an explicit matricizationalways requires exactly N replications of tensor entries

In the fine-grain task definition an atomic task corresponds to the mul-tiplication of a tensor entry with the Hadamard product of the correspond-ing rows of B and C Here tensor nonzeros are partitioned among pro-cesses with no replication to induce a task partition by following the owner-computes rule This necessitates communicating the rows of B and C thatare needed by these atomic tasks Furthermore partial results on the rowsof MA need to be communicated as without duplicating tensor entries wecannot in general compute all contributions to a row of MA Here the par-tition of X should be useful in all modes as the CP-ALS method requiresthe Mttkrp in all modes

The coarse-grain task definition resembles the one-dimensional (1D) row-wise (or column-wise) partitioning of sparse matrices whereas the fine-grainone resembles the two-dimensional (nonzero-based) partitioning of sparsematrices for parallel sparse matrix-vector multiply (SpMV) operations As isconfirmed for SpMV in modern applications 1D partitioning usually leads toharder problems of load balancing and communication cost reduction Thesame phenomenon is likely to be observed in tensors as well Nonethelesswe cover the coarse-grain task definition as it is used in the state of theart parallel Mttkrp [47 206] methods which partition the input tensor byslices

331 Coarse-grain task model

In the coarse-grain task model computing the rows of MA MB and MC aredefined as the atomic tasks Let microA denote the partition of the first modersquosindices among the processes ie microA(i) = p if the process p is responsiblefor computing MA(i ) Similarly let microB and microC define the partition of thesecond and the third mode indices The process owning MA(i ) needs theentire tensor slice Xi similarly the process owning MB(j) needs Xj andthe owner of MC(k ) needs Xk This necessitates duplication of some

40 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

tensor nonzeros to prevent unnecessary communicationOne needs to take the context of CP-ALS into account when parallelizing

the Mttkrp operation First the output MA of Mttkrp is transformedinto A Since A(i ) is computed simply by multiplying MA(i ) with thematrix (BTBlowastCTC)dagger we make the process which owns MA(i ) responsiblefor computing A(i ) Second N Mttkrp operations follow one anotherin an iteration Assuming that every process has the required rows of thefactor matrices while executing Mttkrp for the first mode it is advisable toimplement the Mttkrp in such a way that its output MA after transformedinto A is communicated This way all processes will have the necessarydata for executing the Mttkrp for the next mode With these in mind thecoarse-grain parallel Mttkrp method executes Algorithm 3 at each processp

Algorithm 3 Coarse-grain Mttkrp for the first mode of thirdorder tensors at process p within CP-ALS

Input Ip indices where microA(i) = pXIp tensor slicesBTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in XIp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them ATA is available

2 Initialize MA(Ip ) to all zeros of size |Ip| timesR3 foreach i isin Ip do4 foreach xijk isin Xi do66 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

88 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

9 foreach i isin Ip do1111 Send A(i ) to all processes having nonzeros in Xi

12 Receive A(i ) from microA(i) for each owned nonzero xijk

1414 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

As seen in Algorithm 3 the process p computes MA(i ) for all i withmicroA(i) = p on Line 6 This is possible without any communication if weassume the preconditions that BTB CTC and the required rows of B andC are available We need to satisfy this precondition for the Mttkrp oper-ations in the remaining modes Once MA(Ip ) is computed it is multipliedon Line 8 with (BTB lowastCTC)dagger to obtain A(Ip ) Then the process p sendsthe rows of A(Ip ) to other processes who will need them then receivethe ones that it will need More precisely the process p sends A(i ) tothe process r where microB(j) = r or microC(k) = r for a nonzero xijk isin X thereby satisfying the stated precondition for the remaining Mttkrp op-

41

erations Since computing A(i ) required BTB and CTC to be availableat each process we also need to compute ATA and make the result avail-able to all processes We perform this by computing local multiplicationsA(Ip )

TA(Ip ) and all-reducing the partial results (Line 14)

We now investigate the problems of obtaining load balance and reducingthe communication cost in an iteration of CP-ALS given in Algorithm 3 thatis when Algorithm 3 is applied N times once for each mode Line 6 is thecomputational core the process p performs

sumiisinIp |Xi| many Hadamard

products (of vectors of size R) without any communication In order toachieve load balance one should partition the slices of the first mode equi-tably by taking the number of nonzeros into account Line 8 does not involvecommunication where load balance can be achieved by assigning almostequal number of slices to the processes Line 14 requires an almost equalnumber of slices to the processes for load balance The operations at Lines 8and 14 can be performed very efficiently using BLAS3 routines thereforetheir costs are generally negligible in compare to the cost of the sparse ir-regular computations at Line 6 There are two lines 11 and 14 requiringcommunication Line 14 requires the collective all-reduce communicationand one can rely on efficient MPI implementations The communication atLine 11 is irregular and would be costly Therefore the problems of com-putational load balance and communication cost need to be addressed withregards to Lines 6 and 11 respectively

Let ai be the task of computing MA(i ) at Line 6 Let also TA bethe set of tasks ai for i = 1 IA and bj ck TB and TC be definedsimilarly In the CP-ALS algorithm there are dependencies among thetriplets ai bj and ck of the tasks for each nonzero xijk Specifically thetask ai depends on the tasks bj and ck Similarly bj depends on the tasksai and ck and ck depends on the tasks ai and bj These dependencies arethe source of the communication at Line 11 eg A(i ) should be sent tothe processes that own B(j ) and C(k ) for the each nonzero xijk in theslice Xi These dependencies can be modeled using graphs or hypergraphsby identifying the tasks with vertices and their dependencies with edgesand hyperedges respectively As is well-known in similar parallelizationcontexts [39 111 112] the standard graph partition related metric of ldquoedgecutrdquo would loosely relate to the communication cost We therefore proposea hypergraph model for modeling the communication and computationalrequirements of the Mttkrp operation in the context of CP-ALS

For a given tensor X we build the d-partite d-uniform hypergraph H =(VE) described in Section 15 In this model the vertex set V correspondsto the set of tasks We abuse the notation and use ai to refer to a task and itscorresponding vertex in V We then set V = TA cupTB cupTC Since one needsload balance on Line 6 for each mode we use vectors of size N for vertexweights We set w(ai 1) = |Xi| w(bj 2) = |Xj| and w(ck 3) = |Xk|

42 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

c3

b3

a3

c2a2b2

b1

c1

a1

(a) Coarse-grain model

x123 x312

x231a2

a1

b1

c1

b2 c2

c3

b3

a3

(b) Fine-grain model

Figure 31 Coarse and fine-grain hypergraph models for the 3times3times3 tensorX = (1 2 3) (2 3 1) (3 1 2)

as the vertex weights in the indicated modes and assign weight 0 in allother modes The dependencies causing communication at Line 11 are one-to-many for a nonzero xijk any one of ai bj and ck is computed byusing the data associated with the other two Following the guidelines inbuilding the hypergraph models [209] we add a hyperedge correspondingto each task (or each row of the factor matrices) in TA cup TB cup TC in orderto model the data dependencies to that specific task We use nai to denotethe hyperedge corresponding to the task ai and define nbj and nck similarlyFor each nonzero xijk we add the vertices ai bj and ck to the hyperedgesnai n

bj and nck Let Ji and Ki be all unique 2nd and 3rd mode indices

respectively of the nonzeros in Xi Then the hyperedge nai contains thevertex ai all vertices bj for j isin Ji and all vertices ck for k isin Ki

Figure 31a demonstrates the hypergraph model for the coarse-grain taskdecomposition of a sample 3times 3times 3 tensor with nonzeros (1 2 3) (2 3 1)and (3 1 2) Labels for the hyperedges shown as small blue circles are notprovided yet it should be clear from the context eg ai isin nai is shownwith a non-curved black line between the vertex and the hyperedge Forthe first second and third nonzeros we add the corresponding verticesto related hyperedges which are shown with green magenta and orangecurves respectively It is interesting to note that the N -partite graph ofSmith et al [206] and this hypergraph model are related their adjacencystructure when laid as a matrix give the same matrix

Consider now a P -way partition of the vertices of H = (VE) and theassociation of each part with a unique process for obtaining a P -way parallelCP-ALS The task ai incurs |Xi| Hadamard products in the first mode inthe process microA(i) and has no other work when computing the MttkrpsMB = X(2)(AC) and MC = X(3)(AB) in the other modes This obser-vation (combined with the corresponding one for bjs and cks) shows that any

43

partition satisfying the balance condition for each weight component (33)corresponds to a balanced partition of Hadamard products (Line 6) in eachmode Now consider the messages which are of size R sent by the pro-cess p regarding A(i ) These messages are needed by the processes thathave slices in modes 2 and 3 that have nonzeros whose first mode index is iThese processes correspond to the parts connected by the hyperedge nai Forexample in Figure 31a due to the nonzero (3 1 2) a3 has a dependency tob1 and c2 which are modeled by the curves from a3 to nb1 and nc2 Since a3b1 and c2 reside in different parts a3 contributes one to the connectivity ofnb1 and nc2 which accurately captures the total communication volume byincreasing the total connectivity-1 metric by two

Notice that the part microA(i) is always connected by naimdashassuming that theslice Xi is not empty Therefore minimizing the connectivity-1 metric (32)exactly amounts to minimizing the total communication volume required forMttkrp in all computational phases of a CP-ALS iteration if we set thecost ch = R for each hyperedge h

332 Fine-grain task model

In the fine-grain task model we assume a partition of the nonzeros of thetensor hence there is no duplication Let microX denote the partition of thetensor nonzeros eg microX (i j k) = p if the process p holds xijk WhenevermicroX (i j k) = p the process p performs the Hadamard product of B(j ) andC(k ) in the first mode A(i ) and C(k ) in the second mode and A(i )and B(j ) in the third mode then scales the result with xijk Let againmicroA microB and microC denote the partition on the indices of the correspondingmodes that is microA(i) = p denotes that the process p is responsible forcomputing MA(i ) As in the coarse-grain model we take the CP-ALScontext into account while designing the Mttkrp algorithm (i) the processwhich is responsible for MA(i ) is also held responsible for A(i ) (ii)at the beginning all rows of the factor matrices B(j ) and C(k ) wheremicroX (i j k) = p are available to the process p (iii) at the end all processesget the R times R matrix ATA and (iv) each process has all the rows of Athat are required for the upcoming Mttkrp operations in the second or thethird modes With these in mind the fine-grain parallel Mttkrp methodexecutes Algorithm 4 at each process p

In Algorithm 4 Xp and Ip denote the set of nonzeros and the set of firstmode indices assigned to the process p As seen in the algorithm the processp has the required data to carry out all Hadamard products correspondingto all xijk isin Xp on Line 5 After scaling by xijk the Hadamard productsare locally accumulated in MA which contains a row for all i where Xphas a nonzero on the slice Xi Notice that a process generates a partialresult for each slice of the first mode in which it has a nonzero The relevantrow indices (the set of unique first mode indices in Xp) are denoted by Fp

44 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Algorithm 4 Fine-grain Mttkrp for the first mode of 3rd ordertensors at process p within CP-ALS

Input Xp tensor nonzeros where microX (i j k) = pIp the first mode indices where microA(Ip) = pFp the set of unique first mode indices in Xp

BTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in Xp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them A(Fp ) is updated and ATA is available

2 Initialize MA(Fp ) to all zeros of size |Fp| timesR3 foreach xijk isin Xp do55 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

6 foreach i isin Fp Ip do88 Send MA(i ) to microA(i)

1010 Receive contributions to MA(i ) for i isin Ip and add them up

1212 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

13 foreach i isin Ip do1515 Send A(i ) to all processes having nonzeros in Xi

16 foreach i isin Fp Ip do1818 Receive A(i ) from microA(i)

2020 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

45

Some indices in Fp are not owned by the process p For such an i isin Fp Ipthe process p sends its contribution to MA(i ) to the owner process microA(i)at Line 8 Again messages of the process p destined to the same processare combined Then the process p receives contributions to MA(i ) wheremicroA(i) = p at Line 10 Each such contribution is accumulated and the newA(i ) is computed by a multiplication with (BTB lowast CTC)dagger at Line 12Finally the updated A is communicated to satisfy the precondition of theupcoming Mttkrp operations Note that the communications at Lines 15and 18 are the duals of those at Lines 10 and 8 respectively in the sense thatthe directions of the messages are reversed and the messages always pertainto the same row indices For example MA(i ) is sent to the process microA(i)at Line 8 and A(i ) is received from microA(i) at Line 18 Hence the process pgenerates partial results for each slice Xi on which it has a nonzero eitherkeeps it or sends it to the owner of microA(i) (at Line 8) and if so receivesthe updated row A(i ) later (at Line 18) Finally the algorithm finishesby an all-reduce operation to make ATA available to all processes for theupcoming Mttkrp operations

We now investigate the computational and communication requirementsof Algorithm 4 As before the computational core is at Line 5 where theHadamard products of the row vectors B(j ) and C(k ) are computed foreach nonzero xijk that a process owns and the result is added to MA(i )Therefore each nonzero xijk incurs three Hadamard products (one permode) in its owner process microX (i j k) in an iteration of the CP-ALS al-gorithm Other computations are either efficiently handled with BLAS3routines (Lines 12 and 20) or they (Line 10) are not as many as the numberof Hadamard products The significant communication operations are thesends at Lines 8 and 15 and the corresponding receives at Lines 10 and 18Again the all-reduce operation at Line 20 would be handled efficiently by theMPI implementation Therefore the problems of computational load bal-ance and reduced communication cost need to be addressed by partitioningthe tensor nonzeros equitably among processes (for load balance at Line 5)while trying to confine all slices of all modes to a small set of processes (toreduce communication at Lines 8 and 18)

We propose a hypergraph model for the computational load and commu-nication volume of the fine-grain parallel Mttkrp operation in the contextof CP-ALS For a given tensor the hypergraph H = (VE) with the ver-tex set V and hyperedge set E is defined as follows The vertex set V hasfour types of vertices ai for i = 1 IA bj for j = 1 IB ck fork = 1 IC and the vertex xijk we define for each nonzero in X byabusing the notation The vertex xijk represents the Hadamard productsB(j ) lowastC(k ) A(i ) lowastC(k ) and A(i ) lowastB(j ) to be performed duringthe Mttkrp operations at different modes There are three types of hyper-edges in E and they represent the dependencies of the Hadamard productsto the rows of the factor matrices The hyperedges are defined as follows

46 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

E = EA cup EB cup EC where EA contains a hyperedge nai for each A(i ) EBcontains a hyperedge nbj for each B(j ) and EC contains a hyperedge nck for

each C(k ) Initially nai nbj and nck contain only the corresponding vertices

ai bj and ck Then for each nonzero xijk isin X we add the vertex xijkto the hyperedges nai n

bj and nck to model the data dependency to the cor-

responding rows of A B and C As in the previous subsection one needsa multi-constraint formulation for three reasons First since xijk verticesrepresent the Hadamard products they should be evenly distributed amongthe processes Second as the owners of the rows of A B and C are possi-ble destinations of partial results (at Line 8) it is advisable to spread themevenly Third almost equal number of rows per process implies balancedmemory usage and BLAS3 operations at Lines 12 and 20 even though thisoperation is not our main concern for the computational load With theseconstraints we associate a weight vector of size N + 1 with each vertex Fora vertex xijk we set w(xijk ) = [0 0 0 3] for vertices ai bj and ck we setw(ai ) = [1 0 0 0] w(bj ) = [0 1 0 0] w(ck ) = [0 0 1 0]

In Figure 31b we demonstrate the fine-grain hypergraph model for thesample tensor of the previous section We exclude the vertex weights fromthe figure Similar to the coarse-grain model dependency on the rows of AB and C are modeled by connecting each vertex xijk to the hyperedges nai nbj and nck to capture the communication cost

Consider now a P -way partition of the vertices of H = (VE) and as-sociate each part with a unique process to obtain P -way parallel CP-ALSwith the fine-grain Mttkrp operation The partition of the xijk verticesuniquely partitions the Hadamard products Satisfying the fourth balanceconstraint in (33) balances the computational loads of the processes Sim-ilarly satisfying the first three constraints leads to an equitable partitionof the rows of factor matrices among processes Let microX (i j k) = p andmicroA(i) = ` that is the vertex xijk and ai are in different parts Then theprocess p sends a partial result on MA(i ) to the process ` at Line 8 ofAlgorithm 4 By looking at these messages carefully we see that all pro-cesses whose corresponding parts are in the connectivity set of niA will senda message to microA(i) Since microA(i) is also in this set we see that the totalvolume of send operations concerning MA(i ) at Line 8 is equal to λnai minus 1Therefore the connectivity-1 cut size metric (32) over the hyperedges inEA encodes the total volume of messages sent at Line 8 if we set ch = Rfor all hyperedges un EA Since the send operations at Line 15 are dualsof the send operations at Line 8 total volume of messages sent at Line 15for the first mode is also equal to this number By extending this reasoningto all hyperedges we see that the cumulative (over all modes) volume ofcommunication is equal to the connectivity-1 cut-size metric (32)

The presence of multiple constraints usually causes difficulties for thecurrent state of the art partitioning tools (Aykanat et al[15] discuss the is-

47

sue for PaToH [40]) In preliminary experiments we observed that this wasthe case for us as well In an attempt to circumvent this issue we designedthe following alternative We start with the same hypergraph and removethe vertices corresponding to the rows of factor matrices Then use a singleconstraint partitioning on the resulting hypergraph to partition the verticesof nonzeros We are then faced with the problem of assigning the rows offactor matrices given the partition on the tensor nonzeros Any objectivefunction (relevant to our parallelization context) in trying to solve this prob-lem would likely be a variant of the NP-complete Bin-Packing Problem [93p124] as in the SpMV case [27 208] Therefore we heuristically handle thisproblem by making the following observations (i) the modes can be parti-tioned one at a time (ii) each row corresponds to a removed vertex for whichthere is a corresponding hyperedge (iii) the partition of xijk vertices definesa connectivity set for each hyperedge This is essentially multi-constraintpartitioning which takes advantage of the listed observations We use con-nectivity of corresponding hyperedges as key value of rows to impose anorder Then rows are visited in decreasing order of key values Each visitedrow is assigned to the least loaded part in the connectivity set of its corre-sponding hyperedge if that part is not overloaded Otherwise the visitedvertex is assigned to the least loaded part at that moment Note that in thesecond case we add one to the total communication volume otherwise thetotal communication volume obtained by partitioning xijk vertices remainsthe same Note also that key values correspond to the volume of receives atLine 10 of Algorithm 4 hence this method also aims to balance the volumeof receives at Line 10 and the volume of dual send operations at Line 15

34 Experiments

We conducted our experiments on the IDRIS Ada cluster which consistsof 304 nodes each having 128 GBs of memory and four 8-core Intel SandyBridge E5-4650 processors running at 27 GHz We ran our experiments upto 1024 cores and assigned one MPI process per core All codes we usedin our benchmarks were compiled using the Intel C++ compiler (version1401106) with -O3 option for compiler optimizations -axAVXSSE42 toenable SSE optimizations whenever possible and -mkl option to use theIntel MKL library (version 1111) for LAPACK BLAS and VML (VectorMath Library) routines We also compiled DFacTo code using the Eigen(version 324) library [102] with EIGEN USE MKL ALL flag set to ensurethat it utilizes the routines provided in the Intel MKL library for the bestperformance We used PaToH [40] with default options to partition thehypergraphs We created all partitions offline and ran our experiments onthese partitioned tensors on the cluster We do not report timings for par-titioning hypergraphs which is quite costly for two reasons First in most

48 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Tensor I1 I2 I3 I4 nonzeros

netflix 480K 17K 2K - 100M

NELL-B 32M 301 638K - 78M

deli 530K 17M 24M 1K 140M

flickr 319K 28M 16M 713 112M

Table 31 Size of tensors used in the experiments

applications the tensors from the real-world data are built incrementallyhence a partition for the updated tensor can be formed by refining the parti-tion of the previous tensor Also one can decompose a tensor multiple timeswith different ranks of approximation Thereby the cost of an initial parti-tioning of a tensor can be amortized across multiple runs Second we had topartition the data on a system different from the one used for CP-ALS runsIt would not be very useful to compare the run time from different systemsand repeating the same runs on a different system would not add much tothe presented results We also note that the proposed coarse and fine-grainMttkrp formulations are independent from the partitioning method

Our dataset for experiments consists of four sparse tensors whose sizesare given in Table 31 and obtained from various resources [21 37 97]

We briefly describe the methods compared in the experiments DFacTois the CP-ALS routine we compare our methods against which uses blockpartitioning of the rows of the matricized tensors to distribute matrix nonze-ros equally among processes In the method ht-coarsegrain-block we op-erate the coarse-grain CP-ALS implementation in HyperTensor on a tensorwhose slices in each mode are partitioned consecutively to ensure equal num-ber of nonzeros among processes similarly to the DFacTorsquos approach Themethod ht-coarsegrain-hp runs the same CP-ALS implementation andit uses the hypergraph partitioning The method ht-finegrain-randompartitions the tensor nonzeros as well as the rows of the factor matrices ran-domly to establish load balance It uses the fine-grain CP-ALS implemen-tation in HyperTensor Finally the method ht-finegrain-hp correspondsto the fine-grain CP-ALS implementation in HyperTensor operating on thetensor partitioned according to the hypergraph model proposed for the fine-grain task decomposition We benchmark our methods against DFacTo onNetflix and NELL-B datasets only as DFacTo can only process 3-mode ten-sors We let the CP-ALS implementations run for 20 iterations on each datawith R = 10 and record the average time spent per iteration

In Figures 32a and 32b we show the time spent per CP-ALS itera-tion on Netflix and NELL-B tensors for all algorithms In Figures 32c and32d we show the speedups per CP-ALS iteration on Flickr and Delicioustensors for methods excluding DFacTo While performing the comparisonwith DFacTo we preferred to show the time spent per iteration instead ofthe speedup as the sequential run time of our methods differs from that of

49

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(b) Time on Nell-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(d) Speedup on Delicious

Figure 32 Time for parallel CP-ALS iteration on Netflix and NELL-B andspeedups on Flickr and Delicious

50 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

DFacTo Also in Table 32 we show the statistics regarding the communica-tion and computation requirements of the 512-way partitions of the Netflixtensor for all proposed methods

We first observe in Figure 32a that on Netflix ht-finegrain-hp clearlyoutperforms all other methods by achieving a speedup of 194x with 512 coresover a sequential execution whereas ht-coarsegrain-hp ht-coarsegrain-blockDFacTo and ht-finegrain-random could only yield 69x 63x 49x and 40xspeedups respectively Table 32 shows that on 512-way partitioning of thesame tensor ht-finegrain-random ht-coarsegrain-block and ht-coarsegrain-hp result in total send communication volumes of 142M 80M and 77Munits respectively whereas ht-finegrain-hp partitioning requires only 76Munits which explains the superior parallel performance of ht-finegrain-hpSimilarly in Figures 32b 32c and 32d ht-finegrain-hp achieves 81x 129xand 123x speedups on NELL-B Flickr and Delicious tensors while the bestof all other methods could only obtain 39x 55x and 65x respectively upto 1024 cores As seen from these figures ht-finegrain-hp is the fastest of allproposed methods It experiences slow-downs at 1024 cores for all instances

One point to note in Figure 32b is that our Mttkrp implementationgets notably more efficient than DFacTo on NELL-B We believe that themain reason for this difference is due to DFacTorsquos column-by-column wayof computing the factor matrices We instead compute all columns of thefactor matrices simultaneously This vector-wise mode of operation on therows of the factors can significantly improve the data locality and cacheefficiency which leads to this performance difference Another point in thesame figure is that HyperTensor running with ht-coarsegrain-block partition-ing scales better than DFacTo even though they use similar partitioning ofthe matricized tensor or the tensor This is mainly due to the fact thatDFacTo uses MPI Allgatherv to communicate local rows of factor matri-ces to all processes which incurs significant increase in the communicationvolume whereas our coarse-grain implementation performs point-to-pointcommunications Finally in Figure 32a DFacTo has a slight edge over ourmethods in sequential execution This could be due to the fact that it re-quires 3(|X |+ F ) multiplyadd operations across all modes where F is theaverage number of non-empty fibers and is upper-bounded by |X | whereasour implementation always takes 6|X | operations

Partitioning metrics in the Table 32 show that ht-finegrain-hp is ableto successfully reduce the total communication volume which brings aboutits improved scalability However we do not see the same effect when usinghypergraph partitioning for the coarse-grain task model due to inherent lim-itations of 1D partitionings Another point to note is that despite reducingthe total communication volume explicitly imbalance in the communicationvolume still exists among the processes as this objective is not directly cap-tured by the hypergraph models However the most important limitationon further scalability is the communication latency Table 32 shows that

51

ModeComp load Comm volume Num msg

Max Avg Max Avg Max Avg

ht-finegrain-hp

1 196672 196251 21079 6367 734 316

2 196672 196251 18028 5899 1022 1016

3 196672 196251 3545 2492 1022 1018

ht-finegrain-random

1 197507 196251 272326 252118 1022 1022

2 197507 196251 29282 22715 1022 1022

3 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp

1 364181 196251 302001 136741 511 511

2 349123 196251 59523 12228 511 511

3 737570 196251 23524 2000 511 507

ht-coarsegrain-block

1 198602 196251 239337 142006 448 447

2 367966 196251 33889 12458 511 445

3 737570 196251 24659 2049 511 394

Table 32 Statistics for the computation and communication requirementsin one CP-ALS iteration for 512-way partitionings of the Netflix tensor

each process communicates with almost all others especially on the smalldimensions of the tensors and the number of messages is doubled for thefine-grain methods To alleviate this issue one can try to limit the latencyfollowing previous work [201 208]

35 Summary further notes and references

The pros and cons of the coarse and fine-grain models resemble to those ofthe 1D and 2D partitionings in the SpMV context There are two advantagesof the coarse-grain model over the fine-grain one First there is only onecommunicationcomputation step in the coarse-grain model whereas fine-grain model requires two computation and communication phases Secondthe coarse-grain hypergraph model is smaller than the fine-grain hypergraphmodel (one vertex per index of each dimension versus additionally havingone vertex per nonzero) However one major limitation with the coarse-grain parallel decomposition is that restricting all nonzeros in an (N minus 1)dimensional slice of an N dimensional tensor to reside in the same processposes a significant constraint towards communication minimization an (Nminus1)-dimensional data typically has very large ldquosurfacerdquo which necessitatesgathering numerous rows of the factor matrices As N increases one mightexpect this phenomenon to increase the communication volume even furtherAlso if the tensor X is not large enough in one of its modes this poses agranularity problem which potentially causes load imbalance and limits the

52 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

parallelism None of these issues exists in the fine-grain task decompositionIn a more recent work [135] we extended the discussed approach for

efficient parallelization in shared and distributed memory systems For theshared-memory computations we proposed a novel computational schemecalled dimension trees that significantly reduces the computational costwhile offering an effective parallelism We then performed theoretical anal-yses corresponding to the computational gains and the memory use Theseanalyses are also validated with the experiments We also presented com-parisons with the recent version of SPLATT [204 205] which uses a mediumgrain task model and associated partitioning [187] Recent work discussesalgorithms for partitioning for medium-grain task decomposition [3]

A dimension tree is a data structure that partitions the mode indicesof an N -dimensional tensor in a hierarchical manner for computing ten-sor decompositions efficiently It was first used in the hierarchical Tuckerformat representing the hierarchical Tucker decomposition of a tensor [99]which was introduced as a computationally feasible alternative to the origi-nal Tucker decomposition for higher order tensors Our work [135] presentsa novel way of using dimension trees for computing the standard CP de-composition of tensors with a formulation that asymptotically reduces thecomputational cost by memorizing certain partial computations

For computing the CP decomposition of dense tensors Phan et al [189]propose a scheme that divides the tensor modes into two sets pre-computesthe tensor-times-multiple-vector multiplication (TTMVs) for each mode setand finally reuses these partial results to obtain the final Mttkrp result ineach mode This provides a factor of two improvement in the number ofTTMVs over the standard approach and our dimension tree-based frame-work can be considered as the generalization of this approach that providesa factor of N logN improvement Li et al [153] use the same idea of stor-ing intermediate tensors but use a different formulation based on the tensortimes tensor multiplication and the tensorndashmatrix Hadamard product oper-ations for shared memory systems The overall approach is similar to thatproposed by Phan et al [189] where the difference lies in the application ofthe method to sparse tensors and auto-tuning to better control the memoryuse and the gains in the operation counts

In most of the sparse tensors available (eg see FROSTT [203]) oneof the dimensions is usually very small If the indices in that dimensionare vertices in a hypergraph then we have very high degree vertices if theindices are hyperedges then we have hyperedges with very large sizes Bothcases are not favorable for the traditional multilevel hypergraph partitioningmethods where efficient coarsening heuristicsrsquo run time are super-linear interms of vertexhyperedge degrees

Part II

Matching and relatedproblems

53

Chapter 4

Matchings in graphs

In this chapter we address the problem of approximating the maximumcardinality of a matching in undirected graphs with fast heuristics Most ofthe terms and definitions including the problem itself is standard whichwe summarized in Section 12 The formal problem definition is repeatedbelow for convenience

Approximating the maximum cardinality of a matching in undi-rected graphs Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matchingM of large cardinalitywith an approximation guarantee

We design randomized heuristics for the stated problem and analyze theirapproximation guarantees Both the design and analysis of the heuristicsare based on doubly stochastic matrices (see Section 14) and scaling ofnonnegative matrices (in our case the adjacency matrices) to the doublystochastic form We therefore first give some background on scaling matricesto doubly stochastic form

41 Scaling matrices to doubly stochastic form

Any nonnegative matrix A with total support can be scaled with two positivediagonal matrices DR and DC such that DRADC is doubly stochastic (thatis the sum of entries in any row and in any column of DRADC is equalto one) If the matrix is fully indecomposable then the scaling matricesare unique up to a scalar multiplication (if the pair DR and DC scale thematrix then so αDR and 1

αDC for α gt 0) If A has support but not totalsupport then A can be scaled to a doubly stochastic matrix but not withtwo positive diagonal matrices [202]mdashthis fact is also seen in more recenttreatments [139 141 198])

The Sinkhorn-Knopp algorithm [202] is a well-known method for scalingmatrices to doubly stochastic form This algorithm generates a sequence of

55

56 CHAPTER 4 MATCHINGS IN GRAPHS

matrices (whose limit is doubly stochastic) by normalizing the columns andthe rows of the sequence of matrices alternately as shown in Algorithm 5First the initial matrix is normalized such that each column has sum oneThen the resulting matrix is normalized so that each row has sum oneand so on so forth If the initial matrix is symmetric and has total supportSinkhorn-Knopp algorithm will find a symmetric doubly stochastic matrix atthe limit while the intermediate matrices generated in the sequence are notsymmetric Two other iterative scaling algorithms [140 141 198] maintainsymmetry all along the way which are therefore more useful for symmet-ric matrices in practice Efficient shared memory and distributed memoryparallelizations of the Sinkhorn-Knopp and related algorithms have beendiscussed elsewhere [9 41 76]

Algorithm 5 ScaleSK Sinkhorn-Knopp scaling

Data A an ntimes n matrix with total support ε the error thresholdResult dr dc rowcolumn scaling arrays

1 for i = 1 to n do2 dr[i]larr 13 dc[i]larr 1

4 for j = 1 to n do5 csum[j]larr

sumiisinAlowastj

aij

6 repeat7 for j = 1 to n do8 dc[j]larr 1csum[j]

9 for i = 1 to n do10 rsumlarr

sumjisinAilowast

aij times dc[j]

11 dr[i]larr 1rsum

12 for j = 1 to n do13 csum[j]larr

sumiisinAlowastj

dr[i]times aij14 until |max1minus csum[j] 1 le j le n| le ε

42 Approximation algorithms for bipartite graphs

Most of the existing heuristics obtain good results in practice but theirworst-case guarantee is only around 12 Among those the Karp-Sipserheuristic [128] is very well known It finds maximum cardinality matchingsin highly sparse (random) graphs and matches all but O(n15) verticesof a random undirected graph [14] (this heuristics will be reviewed laterin Section 421) Karp-Sipser also obtains very good results in practiceCurrently it is the suggested one to be used as a jump-start routine [71148] for exact algorithms especially for augmenting-path based ones [132]Algorithms that achieve an approximation ratio of 1minus1e where e is the base

57

of the natural logarithm are designed for the online case [129] Many of thesealgorithms are sequential in nature in that a sequence of greedy decisionsare made in the light of the previously made decisions Another heuristicis obtained by truncating the Hopcroft and Karp (HK) algorithm HKstarting from a given matching augments along a maximal set of shortestdisjoint paths until there are no augmenting paths If one lets HK run untilthe shortest augmenting paths are of length k then a 1minus 2k approximatematching is obtained for k ge 3 The run time of this heuristic is O(τk) fora bipartite graph with τ edges

We propose two matching heuristics (Section 422) for bipartite graphsBoth heuristics construct a subgraph of the input graph by randomly choos-ing some edges They then obtain a maximum matching in the selectedsubgraph and return it as an approximate matching for the input graphThe probability density function for choosing a given edge in both heuris-tics is obtained with a matrix scaling method The first heuristic is shown todeliver a constant approximation guarantee of 1minus 1e asymp 0632 of the maxi-mum cardinality matching under the condition that the scaling method hasscaled the input matrix The second one builds on top of the first one andimproves the approximation ratio to 0866 under the same condition as inthe first one Both of the heuristics are designed to be amenable to paral-lelization in modern multicore systems The first heuristic does not requirea conflict resolution scheme Furthermore it does not have any synchro-nization requirements The second heuristic employs Karp-Sipser to find amatching on the selected subgraph We show that Karp-Sipser becomes anexact algorithm on the selected subgraphs Further analysis of the proper-ties of those subgraphs is carried out to design a specialized implementationof Karp-Sipser for efficient parallelization The approximation guaranteesof the two proposed heuristics do not deteriorate with the increased degreeof parallelization thanks to their design which is usually not the case forparallel matching heuristics [16]

For the analysis we assume that the two vertex classes contain the samenumber of vertices and that the associated adjacency matrix has total sup-port Under these criteria the Sinkhorn-Knopp scaling algorithm summa-rized in Section 41 converges in a finite number of steps and obtains adoubly stochastic matrix Later on (towards the end of Section 422 and inthe experiments presented in Section 423) we discuss the bipartite graphswithout the mentioned properties

421 Related work

Parallel (exact) matching algorithms on modern architectures have been re-cently investigated Azad et al [16] study the implementations of a setof known bipartite graph matching algorithms on shared memory systemsDeveci et al [66 65] investigate the implementation of some known match-

58 CHAPTER 4 MATCHINGS IN GRAPHS

ing algorithms or their variants on GPU There are quite good speedupsreported in these implementations yet there are non-trivial instances whereparallelism does not help (for any of the algorithms)

Our focus is on matching heuristics that have linear run time complexityand good quality guarantees on the size of the matching Recent surveysof matching heuristics are given by Duff et al [71 Section 4] Langguth etal [148] and more recently by Pothen et al [195] Two heuristics calledGreedy and Karp-Sipser stand out and are suggested as initialization stepsin the best two exact matching algorithms [132] These two heuristics alsoattracted theoretical interest

The Greedy heuristic has two variants in the literature The first variantrandomly visits the edges and matches the two endpoints of an edge if theyare both available The theoretical performance guarantee of this heuristic is12 ie the heuristic delivers matchings of size at least half of the maximumcardinality of a matching This is analyzed theoretically [78] and shown toobtain results that are near the worst-case on certain classes of graphs Thesecond variant of Greedy repeatedly selects a random vertex and matches itwith a random neighbor The matched vertices along with the ones whichbecome isolated are removed from the graph and the process continuesuntil the whole graph is consumed This variant also has a 12 worst-caseapproximation guarantee (see for example a proof by Pothen and Fan [194])and it is somewhat better (05 + ε for ε ge 00000025 [13] which has beenrecently improved to ε ge 1256 [191])

We make use of the Karp-Sipser heuristic to design one of the proposedheuristics We summarize a commonly used variant of Karp-Sipser here andrefer the reader to the original paper [128] for details The theoretical foun-dation of Karp-Sipser is that if there is a vertex v with exactly one neighbor(v is called degree-one) then there is a maximum cardinality matching inwhich v is matched with its neighbor That is matching v with its neighboris an optimal decision Using this observation Karp-Sipser runs as followsCheck whether there is a degree-one vertex if so then match the vertexwith its unique neighbor and delete both vertices (and the edges incident onthem) from the graph Continue this way until the graph has no edges (inwhich case we are done) or all remaining vertices have degree larger thanone In the latter case pick a random edge match the two endpoints ofthis edge and delete those vertices and the edges incident on them Thenrepeat the whole process on the remaining graph The phase before the firstrandom choice of edges made is called Phase 1 and the rest is called Phase2 (where new degree-one vertices may arise) The run time of this heuris-tic is linear This heuristic matches all but O(n15) vertices of a randomundirected graph [14] One disadvantage of Karp-Sipser is that because ofthe degree dependencies of the vertices to the already matched vertices anefficient parallelism is hard to achieve (a list of degree-one vertices needs tobe maintained) That is probably why some inflicted forms (successful but

59

without any known quality guarantee) of this heuristic were used in recentstudies [16]

Recent studies focusing on approximate matching algorithms on parallelsystems include heuristics for graph matching problem [42 86 105] and alsoheuristics used for initializing bipartite matching algorithms [16] Lotker etal [164] present a distributed 1minus 1k approximate matching algorithm forbipartite graphs This nice theoretical algorithm has O(k3 log ∆ + k2 log n)time steps with a message length of O(log ∆) where ∆ is the maximumdegree of a vertex and n is the number vertices in the graph Blellochet al [29] propose an algorithm to compute maximal matchings (12 ap-proximate) with O(τ) work and O(log3 τ) depth with high probability on abipartite graph with τ edges This is an elaboration of the Greedy heuristicfor parallel systems Although the performance metrics work and depth arequite impressive the approximation guarantee stays as in the serial variantA striking property of this heuristic is that it trades parallelism and re-duced work while always finding the same matching (including those foundin the sequential version) Birn et al [26] discuss maximal (12 approxi-mate) matchings in O(log2 n) time and O(τ) work in the CREW PRAMmodel Practical implementations on distributed memory and GPU sys-tems are discussedmdasha shared memory implementation is left as future workin the cited paper Patwary et al [185] discuss an efficient implementationof Karp-Sipser on distributed memory systems and show that the communi-cation requirement (in terms of the volume) is proportional to that of sparsematrixndashvector multiplication

The overall approach is related to a random bipartite graph model calledrandom k-out model [211] In these graphs every vertex chooses k neighborsuniformly at random from the other part of vertices Walkup [211] showsthat a 1-out subgraph of a complete bipartite graph G = (V1 V2 E) where|V1| = |V2| asymptotically almost surely (aas) does not contain a perfectmatching He also shows that a 2-out subgraph of a complete bipartitegraph aas contains a perfect matching Karonski and Pittel [125] laterwith Overman [124] extend the analysis to two other cases In the firstcase the vertices that were not selected at all choose one more neighbor inthe second case the vertices that were selected only once choose one moreneighbor Initial claim [125] was that in the first case there was a perfectmatching aas which was shown to be false [124] The existence of perfectmatchings in the second case in which the graph is still between 1-out and2-out has been shown recently [124]

422 Two matching heuristics

We propose two matching heuristics for the approximate maximum cardi-nality matching problem on bipartite graphs The heuristics are efficientlyparallelizable and have guaranteed approximation ratios The first heuristic

60 CHAPTER 4 MATCHINGS IN GRAPHS

does not require synchronization nor conflict resolution assuming that thewrite operations to the memory are atomic (such settings are common seea discussion in the original paper [76]) This heuristic has an approxima-tion guarantee of around 0632 The second heuristic is designed to obtainlarger matchings compared to those obtained by the first one This heuristicemploys Karp-Sipser on a judiciously chosen subgraph of the input graphWe show that for this subgraph Karp-Sipser is an exact algorithm and aspecialized efficient implementation is possible to obtain matchings of sizearound 0866 of the maximum cardinality

One-sided matching

The first matching heuristic we propose OneSidedMatch scales the givenadjacency matrix A (each aij is originally either 0 or 1) and uses the scaledentries to randomly choose a column as a match for each row The pseu-docode of the heuristic is shown in Algorithm 6

Algorithm 6 OneSidedMatch Bipartite graphs

Data A an ntimes n (01)-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for j = 1 to n in parallel do3 cmatch[j]larr NIL

4 for i = 1 to n in parallel do5 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

6 cmatch[j]larr i

OneSidedMatch first obtains the scaling vectors dr and dc corre-sponding to a doubly stochastic matrix S (line 1) After initializing thecmatch array for each row i of A the heuristic randomly chooses a col-umn j isin Ailowast based on the probabilities computed by using correspond-ing scaled entries of row i It then matches i and j Clearly multiplerows can choose the same column and write to the same entry in cmatchWe assume that in the parallel shared-memory setting one of the writeoperation survives and the cmatch array defines a valid matching iecmatch[j] j cmatch[j] 6= NIL We now analyze its approximationguarantee in terms of the matching cardinality

Theorem 41 Let A be an n times n (01)-matrix with total support ThenOneSidedMatch obtains a matching of size at least n(1 minus 1e) asymp 0632n

61

in expectation as nrarrinfin

Proof To compute the matching cardinality we will count the columns thatare not picked by any row and subtract it from n Since ΣkisinAilowastsik = 1 foreach row i of S the probability that a column j is not picked by any of therows in Alowastj is equal to

prodiisinAlowastj (1minussij) By applying the arithmetic-geometric

mean inequality we obtain

dj

radic prodiisinAlowastj

(1minus sij) ledj minus

sumiisinAlowastj sij

dj

where dj = |Alowastj | is the degree of column vertex j Therefore

prodiisinAlowastj

(1minus sij) le

(1minus

sumiisinAlowastj sij

dj

)dj

Since S is doubly stochastic we havesum

iisinAlowastj sij = 1 and

prodiisinAlowastj

(1minus sij) le(

1minus 1

dj

)dj

The function on the right hand side above is an increasing one and has thelimit

limdjrarrinfin

(1minus 1

dj

)dj=

1

e

where e is the base of the natural logarithm By the linearity of expectationthe expected number of unmatched columns is no larger than n

e Hence thecardinality of the matching is no smaller than n (1minus 1e)

In Algorithm 6 we split the rows among the threads with a parallelfor construct For each row i the corresponding thread chooses a randomnumber r from a uniform distribution with range (0

sumkisinAilowast

sik] Then thelast nonzero column index j for which

sum1leklej sik le r is found and cmatch[j]

is set to i Since no synchronization or conflict detection is required theheuristic promises significant speedups

Two-sided matching

OneSidedMatchrsquos approximation guarantee and suitable structure for par-allel architectures make it a good heuristic The natural question that fol-lows is whether a heuristic with a better guarantee exits Of course thesought heuristic should also be simple and easy to parallelize We askedldquowhat happens if we repeat the process for the other (column) side of thebipartite graphrdquo The question led us to the following algorithm Let each

62 CHAPTER 4 MATCHINGS IN GRAPHS

row select a column and let each column select a row Take all these 2nchoices to construct a bipartite graph G (a subgraph of the input) with 2nvertices and at most 2n edges (if i chooses j and j chooses i we have oneedge) and seek a maximum cardinality matching in G Since the number ofedges is at most 2n any exact matching algorithm on this graph would befastmdashin particular the worst case run time would be O(n15) [117] Yet wecan do better and obtain a maximum cardinality matching in linear time byrunning the Karp-Sipser heuristic on G as we display in Algorithm 7

Algorithm 7 TwoSidedMatch Bipartite graphs

Data A an ntimes n 0 1-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for i = 1 to n in parallel do3 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

4 rchoice[i]larr j

5 for j = 1 to n in parallel do6 Pick a random row i isin Alowastj by using the probability density function

skjΣ`isinAlowastjs`j

for all k isin Alowastj

7 cchoice[j]larr i

8 Construct a bipartite graph G = (VR cup VC E) where

E =i rchoice[i] i isin 1 ncupcchoice[j] j j isin 1 n

9 matchlarrKarp-Sipser (G)

The most interesting component of TwoSidedMatch is the incorpora-tion of the Karp-Sipser heuristic for two reasons First although it is onlya heuristic Karp-Sipser computes a maximum cardinality matching on thebipartite graph G constructed in Algorithm 7 Second although Karp-Sipserhas a sequential nature we can obtain good speedups with a specializedparallel implementation In general it is hard to efficiently parallelize (non-trivial) graph algorithms and it is even harder when the overall cost isO(n) which is the case for Karp-Sipser on G We give a series of lemmasbelow which enables us to use Karp-Sipser as an exact algorithm with a goodshared-memory parallel performance

The first lemma describes the structure of G constructed at line 8 of

63

TwoSidedMatch

Lemma 41 Each connected component of G constructed in Algorithm 7contains at most one simple cycle

Proof A connected component M sube G with nprime vertices can have at most nprime

edges Let T be a spanning tree of M Since T contains nprime minus 1 edges theremaining edge in M can create at most one cycle when added to T

Lemma 41 explains why Karp-Sipser is an exact algorithm on G If acomponent does not contain a cycle Karp-Sipser consumes all its vertices inPhase 1 Therefore all of the matching decisions given by Karp-Sipser areoptimal for this component Assume a component contains a simple cycleAfter Phase 1 the component is either consumed or due to Lemma 41 itis reduced to a simple cycle In the former case the matching is a maximumcardinality one In the latter case an arbitrary edge of the cycle can beused to match a pair of vertices This decision necessarily leads to a uniqueperfect matching in the remaining simple path These two arguments canbe repeated for all the connected components to see that the Karp-Sipserheuristic finds a maximum cardinality matching in G

Algorithm 8 describes our parallel implementation Karp-Sipser-MT Thegraph is represented using a single array choice where choice[u] is the ver-tex randomly chosen by u isin VR cup VC The choice array is a concatenationof the arrays rchoice and cchoice set in TwoSidedMatch Hence anexplicit graph construction for G (line 8 of Algorithm 7) is not required anda transformation of the selected edges to a graph storage scheme is avoidedKarp-Sipser-MT uses three atomic operations for synchronization The firstoperation Add(memory value) atomically adds a value to a memory loca-tion It is used to compute the vertex degrees in the initial graph (line 9)The second operation CompAndSwap(memory value replace) first checkswhether the memory location has the value If so its content is replaced Thefinal content is returned The third operation AddAndFetch(memoryvalue) atomically adds a given value to a memory location and the finalcontent is returned We will describe the use of these two operations later

Karp-Sipser-MT has two phases which correspond to the two phases ofKarp-Sipser The first phase of Karp-Sipser-MT is similar to that of Karp-Sipser in that optimal matching decisions are made about some degree-onevertices The second phase of Karp-Sipser-MT handles remaining verticesvery efficiently without bothering with their degrees The following defini-tions are used to clarify the difference between an original Karp-Sipser andKarp-Sipser-MT

Definition 41 Given a matching and the array choice let u be an un-matched vertex and v = choice[u] Then u is called

64 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 8 Karp-Sipser-MT Bipartite 1-out graphs

Data G = V choice[middot] the chosen vertex for each u isin V Result match[middot] the match array for u isin V

1 for all u isin V in parallel do2 mark[u]larr 13 deg[u]larr 14 match[u]larr NIL

5 for all u isin V in parallel do6 v larr choice[u]7 mark[v]larr 08 if choice[v] 6= u then9 Add(deg[v] 1)

10 for each vertex u in parallel do Phase 1 out-one vertices 11 if mark[u] = 1 then12 curr larr u13 while curr 6= NIL do14 nbr larr choice[curr]15 if CompAndSwap(match[nbr]NIL curr) = curr then16 match[curr]larr nbr17 curr larr NIL18 nextlarr choice[nbr]19 if match[next] = NIL then20 if AddAndFetch(deg[next]minus1) = 1 then21 curr larr next22 else23 curr larr NIL

24 for each column vertex u in parallel do Phase 2 the rest 25 v larr choice[u]26 if match[u] = NIL and match[v] = NIL then27 match[u]larr v28 match[v]larr u

65

bull out-one if v is unmatched and no unmatched vertex w with choice[w] =u exists

bull in-one if v is matched and only a single unmatched vertex w withchoice[w] = u exists

The first phase of Karp-Sipser-MT (for loop of line 10) does not track andmatch all degree-one vertices Instead only the out-one vertices are takeninto account For each such vertex u that is already out-one before Phase1 we have mark[u] = 1 Karp-Sipser-MT visits these vertices (lines 10-11)Newly arising out-one vertices are consumed right away without maintain-ing a list The second phase of Karp-Sipser-MT (for loop of line 24) ismuch simpler than that of Karp-Sipser as the degrees of the vertices arenot trackedupdated We now discuss how these simplifications are possiblewhile ensuring a maximum cardinality matching in G

Observation 41 An out-one or an in-one vertex is a degree-one vertex inKarp-Sipser

Observation 42 A degree-one vertex in Karp-Sipser is either an out-oneor an in-one vertex or it is one of the two vertices u and v in a 2-cliquesuch that v = choice[u] and u = choice[v]

Lemma 42 (Proved elsewhere [76 Lemma 3]) If there exists an in-onevertex in G at any time during the execution of Karp-Sipser-MT an out-onevertex also exists

According to Observation 41 all the matching decisions given by Karp-Sipser-MT in Phase 1 are optimal since an out-one vertex is a degree-onevertex Observation 42 implies that among all the degree-one vertices Karp-Sipser-MT ignores only the in-ones and 2-cliques According to Lemma 42an in-one vertex cannot exist without an out-one vertex therefore they arehandled in the same phase The 2-cliques that survive Phase 1 are handledin Karp-Sipser-MTrsquos Phase 2 since they can be considered as cycles

To analyze the second phase of Karp-Sipser-MT we will use the followinglemma

Lemma 43 (Proved elsewhere [76 Lemma 4]) Let Gprime = (V primeR cup V primeC Eprime) bethe graph induced by the remaining vertices after the first phase of Karp-Sipser-MT Then the set

(u choice[u]) u isin V primeR choice[u] isin V primeC

is a maximum cardinality matching in Gprime

In the light of Observations 41 and 42 and Lemmas 41ndash43 Karp-Sipser-MT is an exact algorithm on the graphs created in Algorithm 7 Its worstcase (sequential) run time is linear

66 CHAPTER 4 MATCHINGS IN GRAPHS

Figure 41 A toy bipartite graph with 9 row (circles) and 9 column (squares)vertices The edges are oriented from a vertex u to the vertex choice[u]Assuming all the vertices are currently unmatched matching 15-7 (or 5-13)creates two degree-one vertices But no out-one vertex arises after matching(15-7) and only one vertex 6 arises after matching (5-13)

Karp-Sipser-MT tracks and consumes only the out-one vertices Thisbrings high flexibility while executing Karp-Sipser-MT in multi-threaded en-vironments Consider the example in Figure 41 Here after matching a pairof vertices and removing them from the graph multiple degree-one verticescan be generated The standard Karp-Sipser uses a list to store these newdegree-one vertices Such a list is necessary to obtain a larger matching butthe associated synchronizations while updating it in parallel will be an ob-stacle for efficiency The synchronization can be avoided up to some level ifone sacrifices the approximation quality by not making all optimal decisions(as in some existing work [16]) We continue with the following lemma totake advantage of the special structure of the graphs in TwoSidedMatchfor parallel efficiency in Phase 1

Lemma 44 (Proved elsewhere [76 Lemma 5]) Consuming an out-onevertex creates at most one new out-one vertex

According to Lemma 44 Karp-Sipser-MT does not need a list to storethe new out-one vertices since the process can continue with the new out-one vertex In a shared-memory setting there are two concerns for thefirst phase from the synchronization point of view First multiple threadsthat are consuming different out-one vertices can try to match them withthe same unmatched vertex To handle such cases Karp-Sipser-MT uses theatomic CompAndSwap operation (line 15 of Algorithm 8) and ensures thatonly one of these matchings will be processed In this case other threadswhose matching decisions are not performed continue with the next vertexin the for loop at line 10 The second concern is that while consumingout-one vertices several threads may create the same out-one vertex (andwant to continue with it) For example in Figure 41 when two threadsconsume the out-one vertices 1 and 2 at the same time they both will try tocontinue with vertex 4 To handle such cases an atomic AddAndFetch

67

operation (line 20 of Algorithm 8) is used to synchronize the degree reductionoperations on the potential out-one vertices This approach explicitly ordersthe vertex consumptions and guarantees that only the thread who performsthe last consumption before a new out-one vertex u appears continues withu The other threads which wanted to continue with the same path stop andskip to the next unconsumed out-one vertex in the main for loop One lastconcern that can be important on the parallel performance is the maximumlength of such paths since a very long path can yield a significant imbalanceon the work distribution to the threads We investigated the length of suchpaths experimentally (see the original paper [76]) and observed them to betoo short to hurt the parallel performance

The second phase of Karp-Sipser-MT is efficiently parallelized by usingthe idea in Lemma 43 That is a maximum cardinality matching for thegraph remaining after the first phase of Karp-Sipser-MT can be obtained viaa simple parallel for construct (see line 24 of Algorithm 8)

Quality of approximation If the initial matrix A is the n times n matrixof 1s that is aij = 1 for all 1 le i j le n then the doubly stochastic matrixS is such that sij = 1

n for all 1 le i j le n In this case the graph Gcreated by Algorithm 7 is a random 1-out bipartite graph [211] Referringto a study by Meir and Moon [172] Karonski and Pittel [125] argue thatthe maximum cardinality of a matching in a random 1-out bipartite graphis 2(1minusΩ)n asymp 0866n in expectation where Ω asymp 0567 also called LambertrsquosW (1) is the unique solution of the equation ΩeΩ = 1 We obtain the sameresult for random 1-out subgraph of any bipartite graph whose adjacencymatrix has total support as highlighted in the following theorem

Theorem 42 Let A be the n times n adjacency matrix of a given bipar-tite graph where A has total support Then TwoSidedMatch obtainsa matching of size 2(1 minus Ω)n asymp 0866n in expectation as n rarr infin whereΩ asymp 0567 is the unique solution of the equation ΩeΩ = 1

Theorem 42 contributes to the known results about the Karp-Sipserheuristic (recall that it is known to leave out O(n15) vertices) by show-ing a constant approximation ratio with some preprocessing The existenceof total support does not seem to be necessary for the Theorem 42 to hold(see the next subsection)

The proof of Theorem 42 is involved and can be found in the originalpaper [76] It essentially adapts the original analysis of Karp and Sipser [128]to the special graphs at hand

Discussion

The Sinkhorn-Knopp scaling algorithm converges linearly (when A has totalsupport) where the convergence rate is equivalent to the square of the second

68 CHAPTER 4 MATCHINGS IN GRAPHS

largest singular value of the resulting doubly stochastic matrix [139] Ifthe matrix does not have support or total support less is known aboutthe Sinkhorn-Knopp scaling algorithmrsquos convergencemdashwe do not know howmuch of a reduction we will get at each step In such cases we are not ableto bound the run time of the scaling step However the scaling algorithmsshould be run only a few iterations (see also the next paragraph) in practicein which case the practical run time of our heuristics would be linear (inedges and vertices)

We have discussed the proposed matching heuristics while assuming thatA has total support This can be relaxed in two ways to render the over-all approach practical in any bipartite graph The first relaxation is thatwe do not need to run the scaling algorithms until convergence (as in someother use cases of similar algorithms [141]) If

sumiisinAlowastj sij ge α instead ofsum

iisinAlowastj sij = 1 for all j isin 1 n then limdrarrinfin(1minus α

d

)d= 1

eα In otherwords if we apply the scaling algorithms a few iterations or until some rel-atively large error tolerance we can still derive similar results For exampleif α = 092 we will have a matching of size n

(1minus 1

)asymp 0601n (larger

column sums give improved ratios but there are columns whose sum isless than one when the convergence is not achieved) for OneSidedMatchWith the same α TwoSidedMatch will obtain an approximation of 0840In our experiments the number of iterations were always a few where theproven approximation guarantees were always observed The second relax-ation is that we do not need total support we do not even need supportnor equal number of vertices in the two vertex classes Since little is knownabout the scaling methods on such matrices we only mention some factsand observations and later on present some experiments to demonstratethe practicality of the proposed OneSidedMatch and TwoSidedMatchheuristics

A sparse matrix (not necessarily square) can be permuted into a blockupper triangular form using the canonical Dulmage-Mendelsohn (DM) de-composition [77]

A =

H lowast lowastO S lowastO O V

S =

(S1 lowastO S2

)

where H (horizontal) has more columns than rows and has a matchingcovering all rows S is square and has a perfect matching and V (vertical)has more rows than columns and a matching covering all columns Thefollowing facts about the DM decomposition are well known [192 194] Anyof these three blocks can be void If H is not connected then it is blockdiagonal with horizontal blocks If V is not connected then it is blockdiagonal with vertical blocks If S does not have total support then itis in block upper triangular form shown on the right where S1 and S2

69

have the same structure recursively until each block Si has total supportThe entries in the blocks shown by ldquordquo cannot be put into a maximumcardinality matching When the presented scaling methods are applied toa matrix the entries in ldquordquo blocks will tend to zero (the case of S is welldocumented [202]) Furthermore the row sums of the blocks of H will be amultiple of the column sums in the same block a similar statement holds forV finally S will be doubly stochastic That is the scaling algorithms appliedto bipartite graphs without perfect matchings will zero out the entries in theirrelevant parts and identify the entries that can be put into a maximumcardinality matching

423 Experiments

We present a selected set of results from the original paper [76] We used ashared-memory parallel version of Sinkhorn-Knopp algorithm with a fixednumber of iterations This way the scaling algorithm is simplified as noconvergence check is required Furthermore its overhead is bounded Inmost of the experiments we use 0 1 5 or 10 scaling iterations where thecase of 0 corresponds to applying the matching heuristics with uniform edgeselection probabilities While these do not guarantee convergence of theSinkhorn-Knopp scaling algorithm it seems to be enough for our purposes

Matching quality

We investigate the matching quality of the proposed heuristics on all squarematrices with support from the SuiteSparse Matrix Collection [59] having atleast 1000 non-empty rowscolumns and at most 20000000 nonzeros (someof the matrices are also in the 10th DIMACS challenge [19]) There were 742matrices satisfying these properties at the time of experimentation Withthe OneSidedMatch heuristic the quality guarantee of 0632 was sur-passed with 10 iterations of the scaling method in 729 matrices With theTwoSidedMatch heuristic and the same number of iterations of the scalingmethods the quality guarantee of 0866 was surpassed in 705 matrices Mak-ing 10 more scaling iterations smoothed out the remaining instances TheGreedy heuristic (the second variant discussed in Section 421) obtained onaverage (both the geometric and arithmetic means) matchings of size 093of the maximum cardinality In the worst case 082 was observed

We collect a few of the matrices from the SuiteSparse collection in Ta-ble 41 Figures 42a and 42b show the qualities on our test matrices In thefigures the first columns represent the case that the neighbors are pickedfrom a uniform distribution over the adjacency lists ie the case with noscaling hence no guarantee The quality guarantees are achieved with only5 scaling iterations Even with a single iteration the quality of TwoSided-Match is more than 0866 for all matrices and that of OneSidedMatch

70 CHAPTER 4 MATCHINGS IN GRAPHS

Avg sprank

Name n of edges deg stdA stdB n

atmosmodl 1489752 10319760 69 03 03 100audikw 1 943695 77651847 822 425 425 100cage15 5154859 99199551 192 57 57 100channel 4802000 85362744 178 10 10 100europe osm 50912018 108109320 21 05 05 099Hamrle3 1447360 5514242 38 16 21 100hugebubbles 21198119 63580358 30 003 003 100kkt power 2063494 12771361 62 745 745 100nlpkkt240 27993600 760648352 267 222 222 100road usa 23947347 57708624 24 093 093 095torso1 116158 8516500 733 41959 24544 100venturiLevel3 4026819 16108474 40 012 012 100

Table 41 Properties of the matrices used in the comparisons In the ta-ble sprank is the maximum matching cardinality Avg deg is the averagevertex degree stdA and stdB are the standard deviations of A and B ver-tex (row and column) degrees respectively

is more than 0632 for all matrices

Comparison of TwoSidedMatch with Karp-Sipser Next we analyzethe performance of the proposed heuristics with respect to Karp-Sipser on amatrix class which we designed as a challenging case for Karp-Sipser Let Abe an ntimes n matrix R1 be the set of Arsquos first n2 rows and R2 be the setof Arsquos last n2 rows similarly let C1 and C2 be the set of first n2 and theset of last n2 columns As Figure 43 shows A has a full R1 times C1 blockand an empty R2 times C2 block The last h n rows and columns of R1 andC1 respectively are full Each of the blocks R1 times C2 and R2 times C1 has anonzero diagonal Those diagonals form a perfect matching when combinedIn the sequel a matrix whose corresponding bipartite graph has a perfectmatching will be called full-sprank and sprank-deficient otherwise

When h le 1 the Karp-Sipser heuristic consumes the whole graph duringPhase 1 and finds a maximum cardinality matching When h gt 1 Phase1 immediately ends since there is no degree-one vertex In Phase 2 thefirst edge (nonzero) consumed by Karp-Sipser is selected from a uniform dis-tribution over the nonzeros whose corresponding rows and columns are stillunmatched Since the block R1timesC1 is full it is more likely that the nonzerowill be chosen from this block Thus a row in R1 will be matched with acolumn in C1 which is a bad decision since the block R2timesC2 is completelyempty Hence we expect a decrease in the performance of Karp-Sipser ash increases On the other hand the probability that TwoSidedMatchchooses an edge from that block goes to zero as those entries cannot be ina perfect matching

71

050

055

060

065

070

075

080

085

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 42 Matching qualities of OneSidedMatch and TwoSided-Match The horizontal lines are at 0632 and 0866 respectively whichare the approximation guarantees for the heuristics The legends containthe number of scaling iterations

R1

R2

C1 C2

h

h

Figure 43 Adjacency matrices of hard bipartite graph instances for Karp-Sipser

72 CHAPTER 4 MATCHINGS IN GRAPHS

Results with different number of scaling iterationsKarp- 0 1 5 10

h Sipser Qual Err Qual Err Qual Err Qual

2 0782 0522 13853 0557 3463 0989 0578 09994 0704 0489 11257 0516 3856 0980 0604 09978 0707 0466 8653 0487 4345 0946 0648 0996

16 0685 0448 6373 0458 4683 0885 0725 099032 0670 0447 4555 0453 4428 0748 0867 0980

Table 42 Quality comparison (minimum of 10 executions for each instance)of the Karp-Sipser heuristic and TwoSidedMatch on matrices described inFig 43 with n = 3200 and h isin 2 4 8 16 32

The results of the experiments are in Table 42 The first column showsthe h value Then the matching quality obtained by Karp-Sipser and byTwoSidedMatch with different number of scaling iterations (0 1 5 10)as well as the scaling error are given The scaling error is the maximumdifference between 1 and each rowcolumn sum of the scaled matrix (for0 iterations it is equal to n minus 1 for all cases) The quality of a match-ing is computed by dividing its cardinality to the maximum one which isn = 3200 for these experiments To obtain the values in each cell of thetable we run the programs 10 times and give the minimum quality (as weare investigating the worst-case behavior) The highest variance for Karp-Sipser and TwoSidedMatch were (up to four significant digits) 00041 and00001 respectively As expected when h increases Karp-Sipser performsworse and the matching quality drops to 067 for h = 32 TwoSided-Matchrsquos performance increases with the number of scaling iterations Asthe experiment shows only 5 scaling iterations are sufficient to make theproposed two-sided matching heuristic significantly better than Karp-SipserHowever this number was not enough to reach 0866 for the matrix withh = 32 On this matrix with 10 iterations only 2 of the rowscolumnsremain unmatched

Matching quality on bipartite graphs without perfect matchingsWe analyze the proposed heuristics on a class of random sprank-deficientsquare (n = 100000) and rectangular (m = 100000 and n = 120000) matri-ces with a uniform nonzero distribution (two more sprank-deficient matricesare used in the scalability tests as well) These matrices are generatedby Matlabrsquos sprand command (generating Erdos-Renyi random matri-ces [85]) The total number of nonzeros is set to be around d times 100000for d isin 2 3 4 5 The top half of Table 43 presents the results of thisexperiment with square matrices and the bottom half presents the resultswith the rectangular ones

As in the previous experiments the matching qualities in the table arethe minimum of 10 executions for the corresponding instances As Table 43

73

One Two One TwoSided Sided Sided Sided

d iter sprank Match Match d iter sprank Match Match

Square 100000times 100000

2 0 78225 0770 0912 4 0 97787 0644 08382 1 78225 0797 0917 4 1 97787 0673 08482 5 78225 0850 0939 4 5 97787 0719 08732 10 78225 0879 0954 4 10 97787 0740 0886

3 0 92786 0673 0851 5 0 99223 0635 08403 1 92786 0703 0857 5 1 99223 0662 08513 5 92786 0756 0884 5 5 99223 0701 08733 10 92786 0784 0902 5 10 99223 0716 0882

Rectangular 100000times 120000

2 0 87373 0793 0912 4 0 99115 0729 08992 1 87373 0815 0918 4 1 99115 0754 09102 5 87373 0861 0939 4 5 99115 0792 09332 10 87373 0886 0955 4 10 99115 0811 0946

3 0 96564 0739 0896 5 0 99761 0725 09053 1 96564 0769 0904 5 1 99761 0749 09173 5 96564 0813 0930 5 5 99761 0781 09363 10 96564 0836 0945 5 10 99761 0792 0943

Table 43 Matching qualities of the proposed heuristics on random sparsematrices with uniform nonzero distribution Square matrices in the top halfrectangular matrices in the bottom half d average number of nonzeros perrow

shows when the deficiency is high (correlated with small d) it is easier forour algorithms to approximate the maximum cardinality However when dgets larger the algorithms require more scaling iterations Even in this case5 iterations are sufficient to achieve the guaranteed qualities In the squarecase the minimum quality achieved by OneSidedMatch and TwoSided-Match were 0701 and 0873 In the rectangular case the minimum qual-ity achieved by OneSidedMatch and TwoSidedMatch were 0781 and0930 respectively with 5 scaling iterations In all cases increased scal-ing iterations results in higher quality matchings in accordance with theprevious results on square matrices with perfect matchings

Comparisons with some other codes

We run Azad et alrsquos [16] variant of Karp-Sipser (ksV) the two greedy match-ing heuristics of Blelloch et al [29] (inc and nd which are referred to as ldquoin-crementalrdquo and ldquonon-deterministic reservationrdquo in the original paper) andthe proposed OneSidedMatch and TwoSidedMatch heuristics with oneand 16 threads on the matrices given in Table 41 We also add one more ma-trix wc 50K 64 from the family shown in Fig 43 with n = 50000 and h = 64

74 CHAPTER 4 MATCHINGS IN GRAPHS

Run time in secondsOne thread 16 threads

Name ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 020 1480 007 012 030 003 163 001 001 003audikw 1 098 20600 022 055 082 010 1690 002 006 009cage15 154 996 039 092 195 029 110 004 009 019channel 129 10500 030 082 146 014 907 003 008 013europe osm 1203 5180 206 489 1197 120 712 021 046 105Hamrle3 015 012 004 009 025 002 002 001 001 002hugebubbles 836 465 128 392 1004 095 053 018 037 094kkt power 055 068 009 019 045 008 016 001 002 005nlpkkt240 1088 190000 256 556 1011 115 23700 023 058 106road usa 623 357 096 216 542 070 038 009 022 050torso1 011 728 002 006 009 002 120 000 001 001venturiLevel3 045 272 013 032 084 008 029 001 004 009wc 50K 64 721 320000 146 439 580 117 40200 012 055 059

Quality of the obtained matchingName ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 100 100 100 066 087 098 100 097 066 087audikw 1 100 100 100 064 087 095 100 097 064 087cage15 100 100 100 064 087 095 100 091 064 087channel 100 100 100 064 087 099 100 099 064 087europe osm 098 095 095 075 089 098 095 094 075 089Hamrle3 100 084 084 075 090 097 084 086 075 090hugebubbles 099 096 096 070 089 096 096 090 070 089kkt power 085 079 079 078 089 087 079 086 078 089nlpkkt240 099 099 099 064 086 099 099 099 064 086road usa 094 081 081 076 088 094 081 081 076 088torso1 100 100 100 065 088 098 100 098 065 087venturiLevel3 100 100 100 068 088 097 100 096 068 088wc 50K 64 050 050 050 081 098 070 050 051 081 098

Table 44 Run time and quality of different heuristics OneSidedMatchand TwoSidedMatch are labeled with 1SD and 2SD respectively andapply two steps of scaling iterations The heuristic ksV is from Azad etal [16] and the heuristics inc and nd are from Blelloch et al [29]

The OneSidedMatch and TwoSidedMatch heuristics are run with twoscaling iterations and the reported run time includes all components of theheuristics The inc and nd heuristics are compiled with g++ 445 and ksV

is compiled with gcc 445 For all the heuristics we used -O2 optimizationand the appropriate OpenMP flag for compilation The experiments werecarried out on a machine equipped with two Intel Sandybridge-EP CPUsclocked at 200Ghz and 256GB of memory split across the two NUMA do-mains Each CPU has eight-cores (16 cores in total) and HyperThreadingis enabled Each core has its own 32kB L1 cache and 256kB L2 cache The8 cores on a CPU share a 20MB L3 cache The machine runs 64-bit Debianwith Linux 2639-bpo2-amd64 The run time of all heuristics (in seconds)and their matching qualities are given in Table 44

75

As seen in Table 44 nd is the fastest of the tested heuristics with bothone thread and 16 threads on this data set The second fastest heuristicis OneSidedMatch The heuristic ksV is faster than TwoSidedMatchon six instances with one thread With 16 threads TwoSidedMatch isalmost always faster than ksVmdashboth heuristics are quite efficient and havea maximum run time of 117 seconds The heuristic inc was observed tohave large run time on some matrices with relatively high nonzeros per rowAll the existing heuristics are always very efficient except inc which haddifficulties on some matrices in the data set In the original reference [29]inc is quite efficient for some other matrices (where the codes are compiledwith Cilk) We do not see run time with nd elsewhere but in our data setit was always better than inc The proposed OneSidedMatch heuristicrsquosquality is almost always close to its theoretical limit TwoSidedMatch alsoobtains quality results close to its theoretical limit Looking at the quality ofthe matching with one thread we see that ksV obtains (almost always) thebest score except on kkt power and wc 50K 64 where TwoSidedMatchobtains the best score The heuristics inc and nd obtain the same score withone thread but with 16 threads inc obtains better quality than nd almostalways OneSidedMatch never obtains the best score (TwoSidedMatchis always better) yet it is better than ksV inc and nd on the syntheticwc 50K 64 matrix The TwoSidedMatch heuristic obtains better resultsthan inc and nd on Hamrle3 kkt power road usa and wc 50K 64 withboth one thread and 16 threads Also on hugebubbles with 16 threads thedifference between TwoSidedMatch and nd is 1 (in favor of nd)

43 Approximation algorithms for general undi-rected graphs

We extend the results on bipartite graphs of the previous section to generalundirected graphs We again select a subgraph randomly with a probabilitydensity function obtained by scaling the adjacency matrix of the input graphand run Karp-Sipser [128] on the selected subgraph Again Karp-Sipser be-comes an exact algorithm on the selected subgraph and analysis reveals thatthe approximation ratio is around 0866 of the maximum cardinality in theoriginal graph We also propose two other variants of this algorithm whichobtain better results both in theory and in practice We omit most of theproofs in the presentation they can be found in the associated paper [73]

Recall that a graph Gprime = (V prime Eprime) is a subgraph of G = (VE) if V prime sube Vand Eprime sube E Let Gkout = (VEprime) be a random subgraph of G where eachvertex in V randomly chooses k of its edges with repetition (an edge u vis included in Gkout only once even if it is chosen multiple times) We callGkout as a k-out subgraph of G

A permutation of [1 n] in which no element stays in its original po-

76 CHAPTER 4 MATCHINGS IN GRAPHS

sition is called derangement The total number of derangements of [1 n]is denoted by n and is the nearest integer to n

e [36 pp 40ndash42] where e isthe base of the natural logarithm

431 Related work

The first polynomial time algorithm with O(n2τ) complexity for the maxi-mum cardinality matching problem on general graphs with n vertices and τedges is proposed by Edmonds [79] Currently the fastest algorithms haveO(radicnτ) complexity [30 92 174] The first of these algorithms is by Micali

and Vazirani [174] recently its simpler analysis is given by Vazirani [210]Later Blum [30] and Gabow and Tarjan [92] proposed algorithms with thesame complexity

Random graphs have been frequently analyzed in terms of their max-imum matching cardinality Frieze [90] shows results for undirected (gen-eral) graphs along the lines of Walkuprsquos results for bipartite graphs [211]In particular he shows that if each vertex chooses uniformly at random oneneighbor (among all vertices) then the constructed graph does not havea perfect matching almost always In other words he shows that random1-out subgraphs of a complete graph do not have perfect matchings Luck-ily as in bipartite graphs if each vertex chooses two neighbors then theconstructed graph has perfect matchings almost always That is random2-out subgraphs of a complete graph have perfect matchings Our contribu-tions are about making these statements valid for any host graph (assumingthe adjacency matrix has total support) and measure the approximationguarantee of the 1-out case

Randomized algorithms which check the existence of a perfect match-ing and generate perfectmaximum matchings have been proposed in theliterature [45 120 165 177 178] Lovasz showed that the existence of aperfect matching can be verified in randomized O(nω) time where O(nω)is the time complexity of the fastest matrix multiplication algorithm avail-able [165] More information such as the set of allowed edges ie the edgesin some maximum matching of a graph can also be found with the samerandomized complexity as shown by Cheriyan [45]

To construct a maximum cardinality matching in a general non-bipartitegraph a simple easy to implement algorithm with O(nω+1) randomizedcomplexity is proposed by Rabin and Vazirani [197] The algorithm usesmatrix inversion as a subroutine Later Mucha and Sankowski proposedthe first algorithm for the same problem with O(nω) randomized complex-ity [178] This algorithm uses expensive sub-routines such as Gaussian elim-ination and equivalence classes formed based on the edges that appear insome perfect matching [166] A simpler randomized algorithm with the samecomplexity is proposed by Harvey [108]

77

432 One-Out its analysis and variants

We first present the main heuristic and then analyze its approximationguarantee While the heuristic is a straightforward adaptation of its coun-terpart for bipartite graphs [76] the analysis is more complicated becauseof odd cycles The analysis shows that random 1-out subgraphs of a givengraph have the maximum cardinality of a matching around 0866minus log(n)nof the best possiblemdashthe observed performance (see Section 433) is higherThen two variants of the main heuristic are discussed without proofs Theyextend the 1-out subgraphs obtained by the first one and hence deliverbetter results theoretically

The main heuristic One-Out

The heuristic shown in Algorithm 9 first scales the adjacency matrix of agiven undirected graph to be doubly stochastic Based on the values in thematrix the heuristic then randomly marks one of the neighbors for each ver-tex as chosen This defines one marked edge per vertex Then the subgraphof the original graph containing only the marked edges is formed yieldingan undirected graph with at most n edges each vertex having at least oneedge incident on it The Karp-Sipser heuristic is run on this subgraph andfinds a maximum cardinality matching on it as the resulting 1-out graphhas unicyclic components The approximation guarantee of the proposedheuristic is analyzed for this step One practical improvement to the previ-ous work is to make sure that the matching obtained is a maximal one byrunning Karp-Sipser on the vertices that are not matched This brings in alarge improvement to the cardinality of the matching in practice

Let us classify the edges incident on a vertex vi as in-edges (from thoseneighbors that have chosen i at Line 4) and an out-edge (to a neighborchosen by i at Line 4 Dufosse et al [76 Lemma 3] show that during anyexecution of Karp-Sipser it suffices to consider the vertices whose in-edgesare unavailable but out-edges are available as degree-1 vertices

The analysis traces an execution of Karp-Sipser on the subgraph G1out In

other words the analysis can also be perceived as the analysis of the Karp-Sipser heuristicrsquos performance on random 1-out graphs Let A1 be the setof vertices not chosen by any other vertex at Line 4 of Algorithm 9 Thesevertices have in-degree zero and out degree one and hence can be processedby Karp-Sipser Let B1 be the set of vertices chosen by the vertices in A1The vertices in B1 can be perfectly matched with vertices in A1 leavingsome A1 vertices not matched and creating some new in-degree-0 verticesWe can proceed to define A2 to be the vertices that have in degree-0 inV (A1 cup B1) and define B2 as those chosen by A2 and so on so forthFormally let B0 be an empty set and define Ai to be the set of verticeswith in-degree 0 in V Biminus1 and Bi be the vertices chosen by those in Ai

78 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 9 One-Out Undirected graphs

Data G = (VE) and its adjacency matrix AResult match[middot] the matching

1 Dlarr SymScale(A)2 for i = 1 to n in parallel do3 Pick a random j isin Ailowast by using the probability density function

sikΣtisinAilowastsit

for all k isin Ailowast

where sik = d[i]times d[k] is the corresponding entry in the scaledmatrix S = DAD

4 Mark that i chooses j

5 Construct a graph G1out = (VE) where

V =1 nE =(i j) i chose j or j chose i

6 matchlarrKarpSipserOne-Out(G1out)

7 Make match a maximal matching

for i ge 1 Notice that Ai sube Ai+1 and Bi sube Bi+1 The Karp-Sipser heuristiccan process A1 then A2 A1 and so on until the remaining graph hascycles only The sets Ai and Bi and their cardinality are at the core of ouranalysis We first present some facts about these sets and their cardinalityand describe an implementation of Karp-Sipser instrumented to highlightthem

Lemma 45 With the definitions above Ai capBi = empty

Proof We prove this by induction For i = 1 it clearly holds Assume thatit holds for all i lt ` Suppose there exists a vertex u isin A` cap B` BecauseA`minus1capB`minus1 = empty u must necessarily belong to both A` A`minus1 and B` B`minus1For u to be in B` B`minus1 there must exist at least one vertex v isin A` A`minus1

such that v chooses u However the condition for u isin A` is that no vertex inV cap(A`minus1cupB`minus1) has selected it This is a contradiction and the intersectionA` capB` should be empty

Corollary 41 Ai capBj = empty for i le j

Proof Assume Ai cap Bj 6= empty Since Ai sube Aj we have a contradiction asAj capBj = empty by Lemma 45

Thanks to Lemma 45 and Corollary 41 the sets Ai and Bi are disjointand they form a bipartite subgraph of G1

out for all i = 1 `The proposed version Karp-SipserOne-Out of the Karp-Sipser heuristic

for 1-out graphs is shown in Algorithm 10 The degree-1 vertices are kept

79

in a first-in first-out priority queue Q The queue is first initialized with A1and the character lsquorsquo is used to mark the end of A1 Then all vertices inA1 are matched to some other vertices defining B1 When we remove twomatched vertices from the graph G1

out at Lines 29 and 40 we update thedegrees of their remaining neighbors and append the vertices which havedegrees of one to the queue During Phase-1 of Karp-SipserOne-Out wealso maintain the set of Ai and Bi vertices while storing only the last oneA` and B` are returned along with the number ` of levels which is computedthanks to the use of the marker

Apart from the scaling step the proposed heuristic in Algorithm 9 haslinear worst-case time complexity of O(n + τ) The scaling step if applieduntil convergence can take more time than that We do do not suggestrunning it until convergence for all practical purposes 5 or 10 iterations ofthe basic method [141] or even less of the Newton iterations [140] seem suf-ficient (see experiments) Therefore the practical run time of the algorithmis linear

Analysis of One-Out

Let ai and bi be two random variables representing the cardinalities of Aiand Bi respectively in an execution of Karp-SipserOne-Out on a random1-out graph Then Karp-SipserOne-Out matches b` edges in the first phaseand leaves a`minus b` vertices unmatched What remains after the first phase isa set of cycles In the bipartite graph case [76] all vertices in those cyclesare matchable and hence the cardinality of the matching was measured bynminus a` + b` Since we can possibly have odd cycles after the first phase wecannot match all remaining vertices in the general case of undirected graphsLet c be a random variable representing the number of odd cycles after thefirst phase of Karp-SipserOne-Out Then we have the following (obvious)lemma about the approximation guarantee of Algorithm 9

Lemma 46 At the end of execution the number of unmatched vertices isa` minus b` + c Hence Algorithm 9 matches at least nminus (a` minus b` + c) vertices

We need to quantify a` minus b` and c in Lemma 46 We state an an upperbound on a` minus b` in Lemma 47 and an upper bound on c in Lemma 48without any proof (the reader can find the proofs in the original paper [73])While the bound for a`minusb` holds for any graph with total support presentedto Algorithm 9 the bounds for c are shown for random 1-out graphs ofcomplete graphs By plugging the bounds for these quantities we obtainthe following theorem on the approximation guarantee of Algorithm 9

Theorem 43 Algorithm 9 obtains a matching with cardinality at least0866minus d104 log(0336n)e

n in expectation when the input is a complete graph

80 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 10 Karp-SipserOne-Out Undirected graphs

Data G1out = (VE)

Result match[middot] the mates of verticesResult ` the number of levels in the first phaseResult A the set of degree-1 vertices in the first phaseResult B the set of vertices matched to A vertices

1 match[u]larr NIL for all u2 Qlarr v deg(v) = 1 degree-1 or in-degree 0 3 if Q = empty then4 `larr 0 no vertex in level 1 5 else6 `larr 1

7 Enqueue(Q) marks the end of the first level 8 Phase-1larr ongoing9 Alarr B larr empty

10 while true do11 while Q 6= empty do12 ularr Dequeue(Q)13 if u = and Q = empty then14 break the while-Q-loop15 else if u = then16 `larr `+ 117 Enqueue(Q) new level formed 18 skip to the next while-Q-iteration19 if match[u] 6= NIL then20 skip to the next while-Q-iteration

21 for v isin adj(u) do22 if match[v] = NIL then23 match[u]larr v24 match[v]larr u25 if Phase-1 = ongoing then26 Alarr A cup u27 B larr B cup v28 N larr adj(v)

29 G1out larr G1

out u v30 Enqueue(Qw) for w isin N and deg(w) = 131 break the for-v-loop

32 if Phase-1 = ongoing and match[u] = NIL then33 Alarr A cup u u cannot be matched

34 Phase-1 larr done35 if E 6= empty then36 pick a random edge (u v)37 match[u]larr v38 match[v]larr u39 N larr adj(u v)40 G1

out larr G1out u v

41 Enqueue(Qw) for w isin N and deg(w) = 142 else43 break the while-true loop

81

The theorem can also be interpreted more theoretically as a random1-out graph has a maximum cardinality of a matching at least 0866 minusd104 log(0336n)e

n in expectation The bound is in close vicinity of 0866 whichwas the proved bound for the bipartite case [76] We note that in derivingthis bound we assumed a random 1-out graph (as opposed to a random 1-outsubgraph of a given graph) only at a single step We leave the extension tothis latter case as future work and present experiments suggesting that thebound is also achieved for this case In Section 433 we empirically showthat the same bound also holds for graphs whose corresponding matrices donot have total support

In order to measure a` minus b` we adapted a proof from earlier work [76]which was inspired by Karp and Sipserrsquos analysis of the first phase of theirheuristic [128] and obtained the following lemma

Lemma 47 a` minus b` le (2Ω minus 1)n where Ω asymp 0567 equals to W (1) ofLambertrsquos W function

The lemma also reads as a` minus b` lesumn

i=1(2 middot 0567minus 1) le 0134 middot nIn order to achieve the result of Theorem 43 we need to bound c the

number of odd cycles that remain on a G1out graph after the first phase of

Karp-SipserOne-Out We were able to bound c on random 1-out graphs(that is we did not show it for a general graph)

Lemma 48 The expected number c of cycles remaining after the first phaseof Karp-SipserOne-Out in random 1-out graphs is less than or equal tod104 log(nminusa`minus b`)e which is also less than or equal to d104 log(0336n)e

Variants

Here we summarize two related theoretical random bipartite graph modelsthat we adapt to the undirected case using similar algorithms The presenta-tion is brief and without proofs we will present experiments in Section 433

The random (1 + eminus1) undirected graph model (see the bipartite ver-sion [125] summarized in Section 421) lets each vertex choose a randomneighbor Then the vertices that have not been chosen select one moreneighbor The maximum cardinality of a matching in the subgraph consist-ing of the identified edges can be computed as an approximate matchingin the original graph As this heuristic uses a richer graph structure than1-out we expect perfect or near perfect matchings in the general case

A model richer in edges is the random 2-out graph model In this modeleach vertex chooses two neighbors There are two enticing characteristics ofthis model in the uniform bipartite case First the probability of having aperfect matching goes to one with the increasing number of vertices [211]Second there is a special algorithm for computing the maximum cardinalitymatchings in these (bipartite) graphs [127] with high probability in lineartime in expectation

82 CHAPTER 4 MATCHINGS IN GRAPHS

433 Experiments

To understand the effectiveness and efficiency of the proposed heuristics inpractice we report the matching quality and various statistics regarding thesets that appear in our analyses in Section 432 For the experiments weused three graph datasets The first set is generated with matrices fromthe SuiteSparse Matrix Collection [59] We investigated all n times n matricesfrom the collection with 50000 le n le 100000 For a matrix from this setwe removed the diagonal entries symmetrized the pattern of the resultingmatrix and discarded a matrix if it has empty rows There were 115 matricesat the end which we used as the adjacency matrix The graphs in the seconddataset are synthetically generated to make Karp-Sipser deviate from theoptimal matching as much as possible This dataset contains five graphswith different hardness levels for Karp-Sipser The third set contains fivelarge real-life matrices from the SuiteSparse collection for measuring therun time efficiency of the proposed heuristics

A comprehensive evaluation

We use MATLAB to measure the matching qualities based on the first twodatasets For each matrix five runs are performed with each randomizedmatching heuristic and the average is reported One five and ten iterationsare performed to evaluate the impact of the scaling method

Table 45 summarizes the quality of the matchings for all the experimentson the first dataset The matching quality is measured as the ratio of thematching cardinality to the maximum cardinality matching in the originalgraph The table presents statistics for matching qualities of Karp-Sipserperformed on the original graphs (first row) 1-out graphs (the second set ofrows) Karonski-Pittel-like (1 + eminus1)-out graphs (the third set of rows) and2-out graphs (the last set of rows)

For the U-Max rows we construct k-out graphs by using uniform prob-abilities while selecting the neighbors as proposed in the literature [90 125]We compare the cardinality of the maximum matchings in these k-out graphsto the maximum matching cardinality on the original graphs and reportthe statistics The rows St-Max report the same statistics for the k-outgraphs constructed by using probabilities with t isin 1 5 10 scaling iter-ations These statistics serve as upper bounds on the matching qualitiesof the proposed St-KS heuristics which execute Karp-Sipser on the k-outgraphs obtained with t scaling iterations Since St-KS heuristics use only asubgraph the matchings they obtain are not maximal with respect to theoriginal edge set The proposed St-KS+ heuristics exploit this fact andapply another Karp-Sipser phase on the subgraph containing only the un-matched vertices to improve the quality of the matchings The table doesnot contain St-Max rows for 1-out graphs since Karp-Sipser is an optimal

83

Alg Min Max Avg GMean Med StDev

Karp-Sipser 0880 1000 0980 0980 0988 0022

1-ou

t

U-Max 0168 1000 0846 0837 0858 0091S1-KS 0479 1000 0869 0866 0863 0059S5-KS 0839 1000 0885 0884 0865 0044S10-KS 0858 1000 0889 0888 0866 0045S1-KS+ 0836 1000 0951 0950 0953 0043S5-KS+ 0865 1000 0958 0957 0968 0036S10-KS+ 0888 1000 0961 0961 0971 0033

(1+eminus

1)-

out

U-Max 0251 1000 0952 0945 0967 0081S1-Max 0642 1000 0967 0966 0980 0042S5-Max 0918 1000 0977 0977 0985 0020S10-Max 0934 1000 0980 0979 0985 0018S1-KS 0642 1000 0963 0962 0972 0041S5-KS 0918 1000 0972 0972 0976 0020S10-KS 0934 1000 0975 0975 0977 0018S1-KS+ 0857 1000 0972 0972 0979 0025S5-KS+ 0925 1000 0978 0978 0984 0018S10-KS+ 0939 1000 0980 0980 0985 0016

2-ou

t

U-Max 0254 1000 0972 0966 0996 0079S1-Max 0652 1000 0987 0986 0999 0036S5-Max 0952 1000 0995 0995 0999 0009S10-Max 0968 1000 0996 0996 1000 0007S1-KS 0651 1000 0974 0974 0981 0035S5-KS 0945 1000 0982 0982 0984 0013S10-KS 0947 1000 0984 0983 0984 0012S1-KS+ 0860 1000 0980 0979 0984 0020S5-KS+ 0950 1000 0984 0984 0987 0012S10-KS+ 0952 1000 0986 0986 0987 0011

Table 45 For each matrix in the first dataset and each proposed heuristic five

runs are performed The statistics are computed over the mean of these results

the minimum maximum arithmetic and geometric averages median and standard

deviation are reported

84 CHAPTER 4 MATCHINGS IN GRAPHS

0808208408608809

092094096098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphOneOut5iters(MaxKS)OneOut5iters(KS+)

(a) 1-out graphs

09

092

094

096

098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphTwoOut5iters(Max)TwoOut5iters(KS)TwoOut5iters(KS+)

(b) 2-out graphs

Figure 44 The matching qualities sorted in increasing order for Karp-Sipseron the original graph S5-KS and S5-KS+ on 1-out and 2-out graphs Thefigures also contain the quality of the maximum cardinality matchings inthese graphs The experiments are performed on the 115 graphs in the firstdataset

algorithm for these subgraphs

As Table 45 shows more scaling iterations increase the maximum match-ing cardinalities on k-out graphs Although this is much more clear whenthe worst-case graphs are considered it can also be observed for arithmeticand geometric means Since U-Max is the no scaling case the impact ofthe first scaling iteration (S1-KS vs U-Max) is significant On the otherhand the difference on the matching quality for S5-KS and S10-KS is mi-nor Hence five scaling iterations are deemed sufficient for the proposedheuristics in practice

As the theory suggests the heuristics St-KS perform well for (1 + eminus1)-out and 2-out graphs With t isin 5 10 their quality is almost on parwith Karp-Sipser on the original graph and even better for 2-out graphsIn addition applying Karp-Sipser on the subgraph of unmatched vertices toobtain a maximal matching does not increase the matching quality muchSince this subgraph is small the overhead of this extra work will not besignificant Furthermore this extra step significantly improves the matchingquality for 1-out graphs which aas do not have a perfect matching

To better understand the practical performance of the proposed heuris-tics and the impact of the additional Karp-Sipser execution we profile theirperformance by sorting their matching qualities in increasing order for all115 matrices Figure 44a plots these profiles on 1-out and 2-out graphs forthe heuristics with five scaling iterations As the first figure shows five iter-ations are sufficient to obtain 086 matching quality except 17 of the 1-outexperiments The figure also shows that the maximum matching cardinalityin a random 1-out graph is worse than what Karp-Sipser can obtain on theoriginal graph This is why although S5-KS finds maximum matchings on

85

0020406081

0 2 4 6 8 10

CDFofoddCyclesn

Figure 45 The cumulative distribution of the ratio of number of odd cyclesremaining after the first phase of Karp-Sipser in One-Out to the number ofvertices in the graph

1-out graphs its performance is still worse than Karp-Sipser The additionalKarp-Sipser in S5-KS+ closes almost all of this gap and makes the match-ing qualities close to those of Karp-Sipser On the contrary for 2-out graphsgenerated with five scaling iterations the maximum matching cardinalityis more than the cardinality of the matchings found by Karp-Sipser Thereis still a gap between the best possible (red line) and what Karp-Sipser canfind (blue line) on 2-out graphs We believe that this gap can be targetedby specialized efficient exact matching algorithms for 2-out graphs

There are 30 rank-deficient matrices without total support among the115 matrices in the first dataset We observed that even for 1-out graphsthe worst-case quality for S5-KS is 086 and the average is 093 Hencethe proposed approach also works well for rank-deficient matricesgraphs inpractice

Since the number c of odd cycles at the end of the first phase of Karp-Sipser is a performance measure we investigate it For each matrix wecompute the ratio cn We then plot the cumulative distribution of thesevalues in Fig 45 For all the experiments except one this ratio is less than1 For the extreme case the ratio increases to 8 In that particular casethe matching found in the 1-out subgraph is maximum for the input graph(ie the number of odd components is also large in the original graph)

Hard instances for Karp-Sipser

Let A be an ntimesn symmetric matrix V1 be the set of Arsquos first n2 verticesand V2 be the set of Arsquos last n2 vertices As Figure 46 shows A has afull V1timesV1 block ignoring the diagonal ie a clique and an empty V2timesV2

block The last h n vertices of V1 are connected to all the vertices in thecorresponding graph The block V1 times V2 has a nonzero diagonal hence thecorresponding graph has a perfect matching Note that the same instancesare used for the bipartite case

On such a matrix with h = 0 Karp-Sipser consumes the whole graphduring Phase 1 and finds a maximum cardinality matching When h gt 1

86 CHAPTER 4 MATCHINGS IN GRAPHS

V1

V2

V1 V2

h

h

Figure 46 Adjacency matrices of hard undirected graph instances for Karp-Sipser

One-Out5 iters 10 iters 20 iters

h Karp-Sipser error quality error quality error quality

2 096 754 099 068 099 022 1008 078 852 097 078 099 023 099

32 068 665 081 109 099 026 099128 063 332 053 189 090 033 098512 063 124 055 117 059 061 086

Table 46 Results for the hard instances with n = 5000 and different hvalues One-Out is executed with 5 10 and 20 scaling iterations andscaling errors are also reported Averages of five are reported for each cell

Phase 1 immediately ends since there is no degree-one vertex In Phase 2the first edge (nonzero) consumed by Karp-Sipser is selected from a uniformdistribution over the nonzeros whose corresponding rows and columns arestill unmatched Since the block V1 times V1 forms a clique it is more likelythat the nonzero will be chosen from this block Thus a vertex in V1 will bematched with another vertex from V1 which is a bad decision since the blockV2timesV2 is completely empty Hence we expect a decrease on the performanceof Karp-Sipser as h increases On the other hand the probability that theproposed heuristics chooses an edge from that block goes to zero as thoseentries cannot be in a perfect matching

Table 46 shows that the quality of Karp-Sipser drops to 063 as h in-creases In comparison One-Out heuristic with five scaling iterations main-tains a good quality for small h values However a better scaling with moreiterations (10 and 20 for h = 128 and h = 512 respectively) is required toguarantee the desired matching qualitymdashsee the scaling error in the table

Experiments on large-scale graphs

These experiments are performed on a machine running 64-bit CentOS 65which has 30 cores each of which is an Intel Xeon CPU E7-4870 v2 core

87

Karp-SipserMatrix |V | |E| Quality time

cage15 52 940 100 582dielFilterV3real 11 882 099 336hollywood-2009 11 1128 093 418nlpkkt240 280 7465 098 5295rgg n 2 24 s0 168 2651 098 1949

One-Out-S5-KS+ Two-Out-S5-KS+Execution time (seconds) Execution time (seconds)

Matrix Quality Scale 1-out KS KS+ Total Quality Scale 2-out KS KS+ Total

cage 093 067 085 065 005 221 099 067 148 178 001 394diel 098 052 036 007 001 095 099 052 062 016 000 130holly 091 112 045 006 002 165 095 112 076 013 001 201nlpk 093 444 499 396 027 1366 098 444 943 1056 011 2454rgg 093 233 233 208 017 691 098 233 401 670 011 1315

Table 47 Summary of the results with five large-scale matrices for original Karp-

Sipser and the proposed One-Out and TwoOut heuristics with five scaling iter-

ations and post-processing for maximal matchings Matrix names are shortened in

the lower half

operating at 230 GHz To choose five large-scale matrices from SuiteSparsewe first sorted the pattern-symmetric matrices in the collection in decreasingorder of their nonzero count We then chose the five largest matrices fromdifferent families to increase the variety of experiments The details of thesematrices are given in Table 47 This table also contains the run timesand matching qualities of the original Karp-Sipser and the proposed One-Out and Two-Out heuristics The proposed heuristics have five scalingiterations and also apply Karp-Sipser at the end for ensuring maximality

The run time of the proposed heuristics are analyzed in four stages inTable 47 The Scale stage scales the matrix with five iterations the k-out stage chooses the neighbors and constructs the k-out subgraph the KSstage applies Karp-Sipser on the k-out subgraph and KS+ is the stage forthe additional Karp-Sipser at the end The total time with these four stagesis also shown for each instance The quality results are consistent withthe results obtained on the first dataset As the table shows the proposedheuristics are faster than the original Karp-Sipser on the input graph For1-out graphs the proposed approach is 25ndash39 faster than Karp-Sipser onthe original graph The speedups are in between 15ndash26 for 2-out graphswith five iterations

44 Summary further notes and references

We investigated two heuristics for the bipartite maximum cardinality match-ing problem The first one OneSidedMatch is shown to have an approx-

88 CHAPTER 4 MATCHINGS IN GRAPHS

imation guarantee of 1 minus 1e asymp 0632 The second heuristic TwoSid-edMatch is shown to have an approximation guarantee of 0866 Bothalgorithms use well-known methods to scale the sparse matrix associatedwith the given bipartite graph to a doubly stochastic form whose entriesare used as the probability density functions to randomly select a subsetof the edges of the input graph OneSidedMatch selects exactly n edgesto construct a subgraph in which a matching of the guaranteed cardinal-ity is identified with virtually no overhead both in sequential and parallelexecution TwoSidedMatch selects around 2n edges and then runs theKarp-Sipser heuristic as an exact algorithm on the selected subgraph to ob-tain a matching of conjectured cardinality The subgraphs are analyzed todevelop a specialized Karp-Sipser algorithm for efficient parallelization Alltheoretical investigations are first performed assuming bipartite graphs withperfect matchings and the scaling algorithms have converged Then the-oretical arguments and experimental evidence are provided to extend theresults to cover other cases and validate the applicability and practicality ofthe proposed heuristics in general settings

The proposed OneSidedMatch and TwoSidedMatch heuristics donot guarantee maximality of the matching found One can visit the edges ofthe unmatched row vertices and match them to an available neighbor Wehave carried out this greedy post-process and obtained the results shown inFigures 47a and 47b in comparison with what was shown in Figures 42aand 42b We see that the greedy post-process makes significant differencefor OneSidedMatch the approximation ratios achieved are well beyondthe bound 0632 and are close to those of TwoSidedMatch This obser-vation attests to the efficiency of the whole approach

We also extended the heuristics and their analysis for approximatingthe maximum cardinality matchings on general (undirected) graphs Sincewe have in the general case one class of vertices the 1-out based approachcorresponds to the TwoSidedMatch heuristics in the bipartite case The-oretical analysis which can be perceived as an analysis of Karp-Sipser onthe random 1-out graph model showed that the approximation guarantee isslightly less than 0866 The losses with respect to the bipartite case are dueto the existence of odd-length cycles Our experiments suggest that one ofthe heuristics obtains as good results as Karp-Sipser while being faster andmore reliable

A more rigorous treatment and elaboration of the variants of the heuris-tics for general graphs seem worthwhile Although Karp-Sipser works wellfor these graphs we wonder if there are exact linear time algorithms for(1 + eminus1)-out and 2-out graphs We also plan to work on the parallel imple-mentations of these algorithms

89

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 47 Matching qualities obtained after making the matchings foundby OneSidedMatch and TwoSidedMatch maximal The horizontal linesare at 0632 and 0866 respectively which are the approximation guaranteesfor the heuristics The legends contain the number of scaling iterationsCompare with Figures 42a and 42b where maximality of the matchingswas not guaranteed

90 CHAPTER 4 MATCHINGS IN GRAPHS

Chapter 5

Birkhoff-von Neumanndecomposition of doublystochastic matrices

In this chapter we investigate the Birkhoff-von Neumann (BvN) decomposi-tion described in Section 14 The main problem that we address is restatedbelow for convenience

MinBvNDec A BvN decomposition with the minimum num-ber of permutation matrices Given a doubly stochastic matrix Afind a BvN decomposition

A = α1P1 + α2P2 + middot middot middot+ αkPk (51)

of A with the minimum number k of permutation matrices

We first show that the MinBvNDec problem is NP-hard We then proposea heuristic for obtaining a BvN decomposition with a small number of per-mutation matrices We investigate some of the properties of the heuristictheoretically and experimentally In this chapter we also give a general-ization of the BvN decomposition for real matrices with possibly negativeentries and use this generalization for designing preconditioners for iterativelinear system solvers

51 The minimum number ofpermutation matrices

We show in this section that the decision version of the problem is NP-complete We first give some definitions and preliminary results

91

92 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

511 Preliminaries

A multi-set can contain duplicate members Two multi-sets are equivalentif they have the same set of members with the same number of repetitions

Let A and B be two n times n matrices We write A sube B to denote thatfor each nonzero entry of A the corresponding entry of B is nonzero Inparticular if P is an n times n permutation matrix and A is a nonnegativen times n matrix P sube A also means that the entries of A at the positionscorresponding to the nonzeros of P are positive We use P A to denotethe entrywise product of P and A which selects the entries of A at thepositions corresponding to the nonzero entries of P We also use minPAto denote the minimum entry of A at the nonzero positions of P

Let U be a set of positions of the nonzeros of A Then U is calledstrongly stable [31] if for each permutation matrix P sube A pkl = 1 for atmost one pair (k l) isin U

Lemma 51 (Brualdi [32]) Let A be a doubly stochastic matrix Then ina BvN decomposition of A there are at least γ(A) permutation matriceswhere γ(A) is the maximum cardinality of a strongly stable set of positionsof A

Note that γ(A) is no smaller than the maximum number of nonzerosin a row or a column of A for any matrix A Brualdi [32] shows that forany integer t with 1 le t le dn2ed(n + 1)2e there exists an n times n doublystochastic matrix A such that γ(A) = t

An ntimes n circulant matrix C is defined as follows The first row of C isspecified as c1 cn and the ith row is obtained from the (iminus 1)th one bya cyclic rotation to the right for i = 2 n

C =

c1 c2 cncn c1 cnminus1

c2 c3 c1

We state and prove an obvious lemma to be used later

Lemma 52 Let C be an ntimesn positive circulant matrix whose first row isc1 cn The matrix Cprime = 1sum

cjC is doubly stochastic and all BvN decom-

positions with n permutation matrices have the same multi-set of coefficients cisum

cj for i = 1 n

Proof Since the first row of Cprime has all nonzero entries and only n permuta-tion matrices are permitted the multi-set of coefficients must be the entriesin the first row

93

In the lemma if ci = cj for some i 6= j we will have the same value cicfor two different permutation matrices As a sample BvN decomposition letc =

sumci and consider 1

cC =sum cj

c Dj where Dj is the permutation matrixcorresponding to the (jminus 1)th diagonal for j = 1 n the matrix Dj has1s at the positions (i i+ j minus 1) for i = 1 nminus j + 1 and at the positions(nminus j+ 1 + k k) for k = 1 jminus 1 where we assumed that the second setis void for j = 1

Note that the lemma is not concerned by the uniqueness of the BvNdecomposition which would need a unique set of permutation matrices aswell If all cis were distinct this would have been true where the set ofDjs described above would define the unique permutation matrices Alsoa more general variant of the lemma concerns the decompositions whosecardinality k is equal to the maximum cardinality of a strongly stable setIn this case too the coefficients in the decomposition will correspond to theentries in a strongly stable set of cardinality k

512 The computational complexity of MinBvNDec

Here we prove that the MinBvNDec problem is NP-complete in the strongsense ie it remains NP-complete when the instance is represented in unaryIt suffices to do a reduction from a strongly NP-complete problem to provethe strong NP-completeness [22 Section 66]

Theorem 51 The problem of deciding whether there is a Birkhoff-vonNeumann decomposition of a given doubly stochastic matrix with k permu-tation matrices is strongly NP-complete

Proof It is clear that the problem belongs to NP as it is easy to check inpolynomial time that a given decomposition is equal to a given matrix

To establish NP-completeness we demonstrate a reduction from the well-known 3-PARTITION problem which is NP-complete in the strong sense [93p 96] Consider an instance of 3-PARTITION given an array A of 3mpositive integers a positive integer B such that

sum3mi=1 ai = mB and for all

i it holds that B4 lt ai lt B2 does there exist a partition of A into mdisjoint arrays S1 Sm such that each Si has three elements whose sumis B Let I1 denote an instance of 3-PARTITION

We build the following instance I2 of MinBvNDec corresponding to I1

given above Let

M =

[1mEm OO 1

mBC

]where Em is an m timesm matrix whose entries are 1 and C is an 3m times 3mcirculant matrix whose first row is a1 a3m It is easy to see that Mis doubly stochastic A solution of I2 is a BvN decomposition of M withk = 3m permutations We will show that I1 has a solution if and only if I2

has a solution

94 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Assume that I1 has a solution with S1 Sm Let Si = ai1 ai2 ai3and observe that 1

mB (ai1 + ai2 + ai3) = 1m We identify three permu-tation matrices Pid for d = 1 2 3 in C which contain ai1 ai2 and ai3respectively We can write 1

mEm =summ

i=11mDi where Di is the permutation

matrix corresponding to the (i minus 1)th diagonal (described after the proofof Lemma 52) We prepend Di to the three permutation matrices Pid ford = 1 2 3 and obtain three permutation matrices for M We associate thesethree permutation matrices with αid =

aidmB Therefore we can write

M =msumi=1

3sumd=1

αid

[Di

Pid

]

and obtain a BvN decomposition of M with 3m permutation matricesAssume that I2 has a BvN decomposition with 3m permutation matrices

This also defines two BvN decompositions with 3m permutation matricesfor 1

mEm and 1mBC We now establish a correspondence between these two

BvNrsquos to finish the proof Since 1mBC is a circulant matrix with 3m nonzeros

in a row any BvN decomposition of it with 3m permutation matrices hasthe coefficients ai

mB for i = 1 3m by Lemma 52 Since ai + aj lt B we

haveai+ajmB lt 1m Therefore each entry in 1

mEm needs to be included in atleast three permutation matrices A total of 3m permutation matrices coversany row of 1

mEm say the first one Therefore for each entry in this row wehave 1

m = αi + αj + αk for i 6= j 6= k corresponding to three coefficientsused in the BvN decomposition of 1

mBC Note that these three indices i j kdefining the three coefficients used for one entry of Em cannot be used foranother entry in the same row This correspondence defines a partition ofthe 3m numbers αi for i = 1 3m into m groups with three elementseach where each group has a sum of 1m The corresponding three airsquos ina group therefore sums up to B and we have a solution to I1 concludingthe proof

52 A result on the polytope ofBvN decompositions

Let S(A) be the polytope of all BvN decompositions for a given dou-bly stochastic matrix A The extreme points of S(A) are the ones thatcannot be represented as a convex combination of other decompositionsBrualdi [32 pp 197ndash198] observes that any generalized Birkhoff heuristicobtains an extreme point of S(A) and predicts that there are extreme pointsof S(A) which cannot be obtained by a generalized Birkhoff heuristic Inthis section we substantiate this claim by showing an example

Any BvN decomposition of a given matrix A with the smallest numberof permutation matrices is an extreme point of S(A) otherwise the other

95

BvN decompositions expressing the said point would have smaller numberof permutation matrices

Lemma 53 There are doubly stochastic matrices whose polytopes of BvNdecompositions contain extreme points that cannot be obtained by a general-ized Birkhoff heuristic

We will prove the lemma by giving an example We use computationaltools based on a mixed integer linear programming (MILP) formulation ofthe problem of finding a BvN decomposition with the smallest number ofpermutation matrices We first describe the MILP formulation

Let A be a given n times n doubly stochastic matrix and Ωn be the set ofall n times n permutation matrices There are n matrices in Ωn for brevitylet us refer to these permutations by P1 Pn We associate an incidencematrix M of size n2 times n with Ωn We fix an ordering of the entries of Aso that each row of M corresponds to a unique entry in A Each column ofM corresponds to a unique permutation matrix in Ωn We set mij = 1 ifthe ith entry of A appears in the permutation matrix Pj and set mij = 0otherwise Let ~a be the n2-vector containing the values of the entries of Ain the fixed order Let ~x be a vector of n elements xj corresponding tothe permutation matrix Pj With these definitions MILP formulation forfinding a BvN decomposition of A with the smallest number of permutationmatrices can be written as

minimizensumj=1

sj (52)

subject to M~x = ~a (53)

1 ge xj ge 0 for j = 1 n (54)

nsumj=1

xj = 1 (55)

sj ge xj for j = 1 n (56)

sj isin 0 1 for j = 1 n (57)

In this MILP sj is a binary variable which is 1 only if xj gt 0 otherwise 0The equality (53) the inequalities (54) and the equality (55) guaranteethat we have a BvN decomposition of A This MILP can be used only forsmall problems In this MILP we can exclude any permutation matrix Pj

from a decomposition by setting sj = 0

Proof of Lemma 53 Let the following 10 letters correspond to the numbersunderneath

a b c d e f g h i j

1 2 4 8 16 32 64 128 256 512

96 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Consider the following matrix whose row sums and column sums are 1023hence can be considered as doubly stochastic

A =

a+ b d+ i c+ h e+ j f + ge+ g a+ c b+ i d+ f h+ jf + j e+ h d+ g b+ c a+ id+ h b+ f a+ j g + i c+ ec+ i g + j e+ f a+ h b+ d

(58)

Observe that the entries containing the term 2` form a permutationmatrix for ` = 0 9 Therefore this matrix has a BvN decompositionwith 10 permutation matrices We created the MILP above and found 10 asthe smallest number of permutation matrices by calling the CPLEX solver [1]via the NEOS Server [58 68 100] Hence the described decomposition is anextreme point of S(A) None of the said permutation matrices annihilateany entry of the matrix Therefore at the first step no entry of A getsreduced to zero regardless of the order of permutation matrices Thus thisdecomposition cannot be obtained by a generalized Birkhoff heuristic

One can create a family of matrices with arbitrary sizes by embeddingthe same matrix into a larger one of the form

B =

(1023 middot I OO A

)

where I is the identity matrix with the desired size All BvN decompositionsof B can be obtained by extending the permutation matrices in Arsquos BvNdecompositions with I That is for a permutation matrix P in a BvN de-

composition of A

(I OO P

)is a permutation matrix in the corresponding

BvN decomposition of B Furthermore all permutation matrices in an arbi-trary BvN decomposition of B must have I as the principle sub-matrix andthe rest should correspond to a permutation matrix in A defining a BvNdecomposition for A Hence the extreme point S(B) corresponding to theextreme point of S(A) with 10 permutation matrices cannot be found by ageneralized Birkhoff heuristic

Let us investigate the matrix A and its BvN decomposition given in theproof of Lemma 53 Let PaPb Pj be the 10 permutation matrices ofthe decomposition corresponding to a b j We solve 10 MILPs in whichwe set st = 0 for one t isin a b j This way we try to find a BvN decom-position of A without Pa without Pb and so on always with the smallestnumber of permutation matrices The smallest number of permutation ma-trices in these 10 MILPs were 11 This certifies that the only BvN decom-position with 10 permutation matrices necessarily contains PaPb Pj It is easy to see that there is a unique solution to the equality M~x = ~a of

97

A =

3 264 132 528 9680 5 258 40 640544 144 72 6 257136 34 513 320 20260 576 48 129 10

(a) The sample matrix

129 511 257 63 33 15 7 3 2 2 1

3 4 2 5 5 4 2 3 4 1 15 5 3 1 4 1 4 2 1 2 32 1 5 3 1 2 3 4 3 4 41 3 4 4 2 5 1 5 5 3 24 2 1 2 3 3 5 1 2 5 5

(b) A BvN decomposition

Figure 51 The matrix A from Lemma 53 and a BvN decomposition with11 permutation matrices which can be obtained by a generalized Birkhoffheuristic Each column in (b) corresponds to a permutation where the firstline gives the associated coefficient and the following lines give the columnindices matched to the rows 1 to 5 of A

the MILP with xt = 0 for t isin a b j as the submatrix M containingonly the corresponding 10 columns has full column rank

Any generalized Birkhoff heuristic obtains at least 11 permutation ma-trices for the matrix A of the proof of Lemma 53 One such decompositionis shown in a tabular format in Fig 51 In Fig 51a we write the matrixof the lemma explicitly for convenience Then in Fig 51b we give a BvNdecomposition The column headers (the first line) in the table contain thecoefficients of the permutation matrices The nonzero column indices of thepermutation matrices are stored by rows For example the first permuta-tion has the coefficient 129 and columns 3 5 2 1 and 4 are matched tothe rows 1ndash5 The bold indices signify entries whose values are equivalentto the coefficients of the corresponding permutation matrices at the timewhere the permutation matrices are found Therefore the correspondingentries become zero after the corresponding step For example in the firstpermutation a54 = 129

The output of GreedyBvN for the matrix A is given in Fig 52 for refer-ence It contains 12 permutation matrices

53 Two heuristics

There are heuristics to compute a BvN decomposition for a given matrix AIn particular the following family of heuristics is based on the constructiveproof of Birkhoff Let A(0) = A At every step j ge 1 find a permutationmatrix Pj having its ones at the positions of the nonzero elements of A(jminus1)use the minimum nonzero element of A(jminus1) at the positions identified byPj as αj set A(j) = A(jminus1) minus αjPj and repeat the computations in thenext step j + 1 until A(j) becomes void Any heuristic of this type is calledgeneralized Birkhoff heuristic

98 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

A = 513 middot

0 0 0 1 00 0 0 0 11 0 0 0 00 0 1 0 00 1 0 0 0

+ 257 middot

0 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 01 0 0 0 0

+ 127 middot

0 0 1 0 00 0 0 0 10 1 0 0 01 0 0 0 00 0 0 1 0

+ 63 middot

0 0 0 0 11 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 0

+

31 middot

0 0 0 0 10 0 0 1 01 0 0 0 00 1 0 0 00 0 1 0 0

+ 15 middot

0 0 0 1 01 0 0 0 00 1 0 0 00 0 0 0 10 0 1 0 0

+ 7 middot

0 1 0 0 00 0 0 1 00 0 1 0 01 0 0 0 00 0 0 0 1

+ 3 middot

0 0 1 0 00 1 0 0 00 0 0 1 00 0 0 0 11 0 0 0 0

+

2 middot

0 0 0 0 10 0 0 1 00 1 0 0 01 0 0 0 00 0 1 0 0

+ 2 middot

0 0 1 0 01 0 0 0 00 0 0 1 00 1 0 0 00 0 0 0 1

+ 2 middot

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 0

+ 1 middot

1 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 00 0 0 0 1

Figure 52 The output of GreedyBvN for the matrix given in the proof ofLemma 53

Given an n times n matrix A we can associate a bipartite graph GA =(RcupCE) to it The rows of A correspond to the vertices in the set R andthe columns of A correspond to the vertices in the set C so that (ri cj) isin Eiff aij 6= 0 A perfect matching in GA or G in short is a set of n edges notwo sharing a row or a column vertex Therefore a perfect matching M inG defines a unique permutation matrix PM sube A

Birkhoffrsquos heuristic This is the original heuristic used in proving thata doubly stochastic matrix has a BvN decomposition described for exampleby Brualdi [32] Let a be the smallest nonzero of A(jminus1) and G(jminus1) bethe bipartite graph associated with A(jminus1) Find a perfect matching M inG(jminus1) containing a set αj larr a and Pj larr PM

GreedyBvN heuristic At every step j among all perfect matchings inG(jminus1) find one whose minimum element is the maximum That is find aperfect matching M in G(jminus1) where minPM A(jminus1) is the maximumThis ldquobottleneckrdquo matching problem is polynomial time solvable with forexample MC64 [72] In this greedy approach αj is the largest amount wecan subtract from a row and a column of A(jminus1) and we hope to obtain asmall k

Algorithm 11 GreedyBvN A greedy heuristic for constructing aBvN decomposition

Data A a doubly stochastic matrixResult a BvN decomposition

1 k larr 02 while nnz(A) gt 0 do3 k larr k + 14 Pk the pattern of a bottleneck perfect matching M in A

5 αk larr minPk A(jminus1)6 Alarr Aminus αkPk

99

A(0) =

1 4 1 0 0 00 1 4 1 0 00 0 1 4 1 00 0 0 1 4 11 0 0 0 1 44 1 0 0 0 1

A(1) =

0 4 1 0 0 00 1 3 1 0 00 0 1 3 1 00 0 0 1 3 11 0 0 0 1 34 0 0 0 0 1

A(2) =

0 4 0 0 0 00 0 3 1 0 00 0 1 2 1 00 0 0 1 2 11 0 0 0 1 23 0 0 0 0 1

A(3) =

0 3 0 0 0 00 0 3 0 0 00 0 0 2 1 00 0 0 1 1 11 0 0 0 1 12 0 0 0 0 1

A(4) =

0 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 1 11 0 0 0 1 01 0 0 0 0 1

A(5) =

0 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 00 0 0 0 0 1

Figure 53 A sample matrix to show that the original Birkhoff heuristic canobtain BvN decomposition with n permutation matrices while the optimumone has 3

Analysis of the two heuristics

As we will see in Section 54 GreedyBvN can obtain a much smaller num-ber of permutation matrices compared to the Birkhoff heuristic Here wecompare these two heuristics theoretically First we show that the originalBirkhoff heuristic does not have any constant ratio approximation guaran-tee Furthermore for an ntimesn matrix its worst-case approximation ratio isΩ(n)

We begin with a small example shown in Fig 53 We decompose a6 times 6 matrix A(0) which has an optimal BvN decomposition with threepermutation matrices the main diagonal the one containing the entriesequal to 4 and the one containing the remaining entries For simplicitywe used integer values in our example However since the row and columnsums of A(0) is equal to 6 it can be converted to a doubly stochastic matrixby dividing all the entries to 6 Instead of the optimal decomposition in thefigure we obtain the permutation matrices as the original Birkhoff heuristicdoes Each red-colored entry set is a permutation and contains the minimumpossible value 1

In the following we show how to generalize the idea for having matri-ces of arbitrarily large size with three permutation matrices in an optimaldecomposition while the original Birkhoff heuristic obtains n permutation

100 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

matrices

Lemma 54 The worst-case approximation ratio of the Birkhoff heuristicis Ω(n)

Proof For any given integer n ge 3 we show that there is a matrix of sizen times n whose optimal BvN decomposition has 3 permutations whereas theBirkhoff heuristic obtains a BvN decomposition with exactly n permutationmatrices The example in Fig 53 is a special case for n = 6 for the followingconstruction process

Let f 1 2 n rarr 1 2 n be the function f(x) = (x mod n) +1 Given a matrix M let Mprime = F(M) be another matrix containing thesame set of entries where the function f(middot) is used on the coordinate indicesto redistribute the entries of M on Mprime That is mij = mprimef(i)f(j) Since f(middot)is one-to-one and onto if M is a permutation matrix then F(M) is also apermutation matrix We will start with a permutation matrix and run itthrough F for nminus 1 times to obtain n permutation matrices which are alldifferent By adding these permutation matrices we will obtain a matrix Awhose optimal BvN decomposition has three permutation matrices whilethe n permutation matrices used to create A correspond to a decompositionthat can be obtained by the Birkhoff heuristic

Let P1 be the permutation matrix whose ones which are partitionedinto three sets are at the positions

1st set︷ ︸︸ ︷(1 1)

2nd set︷ ︸︸ ︷(n 2)

3rd set︷ ︸︸ ︷(2 3) (3 4) (nminus 1 n) (59)

Let us use F(middot) to generate a matrix sequence Pi = F(Piminus1) for 2 le i le nFor example P2rsquos nonzeros are at the positions

1st set︷ ︸︸ ︷(2 2)

2nd set︷ ︸︸ ︷(1 3)

3rd set︷ ︸︸ ︷(3 4) (4 5) (n 1)

We then add the Pis to build the matrix

A = P1 + P2 + middot middot middot+ Pn

We have the following observations about the nonzero elements of A

1 aii = 1 for all i = 1 n and only Pi has a one at the position (i i)These elements are from the first set of positions of the permutationmatrices as identified in (59) When put together these n entriesform a permutation matrix P(1)

2 aij = 1 for all i = 1 n and j = ((i + 1) mod n) + 1 and onlyPh where h = (i mod n) + 1 has a one at the position (i j) Theseelements are from the second set of positions of the permutation ma-trices as identified in (59) When put together these n entries forma permutation matrix P(2)

101

3 aij = n minus 2 for all i = 1 n and j = (i mod n) + 1 where allP` for ` isin 1 n i j have a one at the position aij Theseelements are from the third set of positions of the permutation matri-ces as identified in (59) When put together these n entries form apermutation matrix P(3) multiplied by the scalar (nminus 2)

In other words we can write

A = P(1) + P(2) + (nminus 2) middotP(3)

and see that A has a BvN decomposition with three permutation matricesWe note that each row and column of A contains three nonzeros and hencethree is the smallest number of permutation matrices in a BvN decomposi-tion of A

Since the minimum element in A is 1 and each Pi contains one suchelement the Birkhoff heuristic can obtain a decomposition using Pi fori = 1 n Therefore the Birkhoff heuristicrsquos approximation is no betterthan n

3 which can be made arbitrarily large

We note that GreedyBvN will optimally decompose the matrix A usedin the proof above

We now analyze the theoretical properties of GreedyBvN We first givea lemma identifying a pattern in its output

Lemma 55 The heuristic GreedyBvN obtains α1 αk in a non-increasingorder α1 ge middot middot middot ge αk where αj is obtained at the jth step

We observed through a series of experiments that the performance ofGreedyBvN depends on the values in the matrix and there seems to be noconstant factor approximation [74]

We create a set of n times n matrices To do that we first fix a set ofz permutation matrices C1 Cz of size n times n These permutationmatrices with varying values of α will be used to generate the matricesThe matrices are parameterized by the subscript i and each Ai is created asfollows Ai = α1 middotC1 +α2 middotC2 + middot middot middot+αz middotCz where each αj for j = 1 zis a randomly chosen integer in the range [1 2i] and we also set a randomlychosen αj equivalent to 2i to guarantee the existence of at least one largevalue even in the unlikely case that all other values are not large enoughAs can be seen each Ai has the same structure and differs from the restonly in the values of αj rsquos that are chosen As a consequence they all canbe decomposed by the same set of permutation matrices

We present our results in two sets of experiments shown in Table 51 fortwo different n We create five random Ai for i isin 10 20 30 40 50 thatis we have five matrices with the parameter i and there are five different iWe have n = 30 and z = 20 in the first set and n = 200 and z = 100 inthe second set Let ki be the number of permutation matrices GreedyBvN

102 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

n = 30 and z = 20

average worst casei ki kiz ki kiz

10 59 299 63 31520 105 529 110 55030 149 746 158 79040 184 923 191 95550 212 1062 227 1135

n = 200 and z = 100

average worst casei ki kiz ki kiz

10 268 269 280 28020 487 488 499 49930 716 716 726 72640 932 933 947 94750 1124 1125 1162 1162

Table 51 Experiments showing the dependence of the performance ofGreedyBvN on the values of the matrix elements n is the matrix sizei isin 10 20 30 40 50 is the parameter for creating matrices using αj isin[1 2i] z is the number of permutation matrices used in creating Ai Fiveexperiments for a given pair of n and i GreedyBvN obtains ki permutationmatrices The average and the maximum number of permutation matricesobtained by GreedyBvN for five random instances are given kiz is a lowerbound on the performance of GreedyBvN as z ge Opt

obtains for a given Ai The table gives the average and the maximumki of five different Ai for a given n and i pair By construction eachAi has a BvN with z permutation matrices Since z is no smaller thanthe optimal value the ratio ki

z gives a lower bound on the performance ofGreedyBvN As seen from the experiments as i increases the performanceof GreedyBvN gets increasingly worse This shows that a constant ratioworst case approximation of GreedyBvN is unlikely While the performancedepends on z (for example for small z GreedyBvN is likely to obtain nearoptimal decompositions) it seems that the size of the matrix does not largelyaffect the relative performance of GreedyBvN

Now we attempt to explain the above results theoretically

Lemma 56 Let α1P1 + middot middot middot + αkP

k be a BvN decomposition of a given

doubly stochastic matrix A with the smallest number k of permutation ma-trices Then for any BvN decomposition of A with ` ge k permutation

matrices we have ` le k middot maxi αi

mini αi If the coefficients are integers (eg when

A is a matrix with constant row and column sums of integral values) wehave ` le k middotmaxi α

i

Proof Consider a BvN decomposition α1P1 + middot middot middot+ α`P` with ` ge k As-sume without loss of generality that α1 ge middot middot middot ge αk and α1 ge middot middot middot ge α`

We know that the coefficients of these two decompositions sum up to thesame value That is sum

i=1

αi =ksumi=1

αi

103

Since α` is the smallest of α and α1 is the largest of α we have

` middot α` le k middot α1

and hence`

kle α1α`

By assuming integer values we see that α` ge 1 and thus

` le k middotmaxiαi

This lemma evaluates the approximation guarantee of a given BvN de-composition It does not seem very useful because of the fact that even ifwe have mini αi we do do not have maxi α

i Luckily we can say more in

the case of GreedyBvN

Corollary 51 Let k be the smallest number of permutation matrices ina BvN decomposition of a given doubly stochastic matrix A Let α1 andα` be the first and last coefficients obtained by the heuristic GreedyBvN fordecomposing A Then ` le k middot α1

α`

Proof This is easy to see as GreedyBvN obtains the coefficients in a non-increasing order (see Lemma 55) and α1 ge αj for all 1 le j le k for anyBvN decomposition containing αj

Lemma 56 and Corollary 51 give a posteriori estimates of the perfor-mance of the heuristic GreedyBvN in that one looks at the decompositionand tells how good it is This potentially can reveal a good performanceFor example when GreedyBvN obtains a BvN decomposition with all coeffi-cients equivalent then we know that it is an optimal BvN The same cannotbe told for the Birkhoff heuristic though (consider the example proceed-ing Lemma 54) We also note that the ratio given in Corollary 51 shouldusually be much larger than the practical performance

54 Experiments

We present results with the two heuristics We note that the original Birkhoffheuristic was not concerned with the minimality of the number of permu-tation matrices Therefore the presented results are not for comparing thetwo heuristics but for giving results with what is available We give resultson two different set of matrices The first set contains real world sparsematrices preprocessed to be doubly stochastic The second set contains afew randomly created dense doubly stochastic matrices

104 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

matrix n τ dmax devsumk

i=1 αi ksumk

i=1 αi kaft01 8205 125567 21 00e+00 0160 2000 1000 120aft02 8184 127762 82 10e-06 0000 1434 1000 234barth 6691 46187 13 00e+00 0160 2000 1000 71barth4 6019 40965 13 00e+00 0140 2000 1000 61bcspwr10 5300 21842 14 10e-06 0380 2000 1000 63bcsstk38 8032 355460 614 00e+00 0000 2000 1000 592lowastbenzene 8219 242669 37 00e+00 0000 2000 1000 113c-29 5033 43731 481 10e-06 0000 2000 1000 870EX5 6545 295680 48 00e+00 0020 2000 1000 229EX6 6545 295680 48 10e-06 0030 2000 1000 226flowmeter0 9669 67391 11 00e+00 0510 2000 1000 58fv1 9604 85264 9 00e+00 0620 2000 1000 50fv2 9801 87025 9 00e+00 0620 2000 1000 52fxm3 6 5026 94026 129 10e-06 0130 2000 1000 383g3rmt3m3 5357 207695 48 10e-06 0050 2000 1000 223Kuu 7102 340200 98 00e+00 0000 2000 1000 330mplate 5962 142190 36 00e+00 0030 2000 1000 153n3c6-b7 6435 51480 8 00e+00 1000 8 1000 8nemeth02 9506 394808 52 00e+00 0000 2000 1000 109nemeth03 9506 394808 52 00e+00 0000 2000 1000 115olm5000 5000 19996 6 10e-06 0750 283 1000 14s1rmq4m1 5489 262411 54 00e+00 0000 2000 1000 211s2rmq4m1 5489 263351 54 00e+00 0000 2000 1000 208SiH4 5041 171903 205 00e+00 0000 2000 1000 574t2d q4 9801 87025 9 20e-06 0500 2000 1000 54

Table 52 Birkhoffrsquos heuristic and GreedyBvN on sparse matrices from theUFL collection The column τ contains the number of nonzeros in a matrixThe column dmax contains the maximum number of nonzeros in a row ora column setting up a lower bound for the number k of permutation ma-trices in a BvN decomposition The column ldquodevrdquo contains the maximumdeviation of a rowcolumn sum of a matrix A from 1 in other words thevalue maxA1 minus 1infin 1TA minus 1Tinfin reported to six significant digitsThe two heuristics are run to obtain at most 2000 permutation matrices oruntil they accumulated a sum of at least 09999 with the coefficients In onematrix (marked with lowast) GreedyBvN obtained this number in less than dmax

permutation matricesmdashincreasing the limit to 0999999 made it return with908 permutation matrices

105

The first set of matrices was created as follows We have selected all ma-trices with the following properties from the SuiteSparse Matrix Collection(UFL) [59] square has at least 5000 and at most 10000 rows fully inde-composable and there are at most 50 and at least 2 nonzeros per row Thisgave a set of 66 matrices These 66 matrices are from 30 different groups ofproblems we have chosen at most two per group to remove any bias thatmight be arising from the group This resulted in 32 matrices which wepreprocessed as follows to obtain a set of doubly stochastic matrices Wefirst took the absolute values of the entries to make the matrices nonnega-tive Then we scaled them to be doubly stochastic using the algorithm ofKnight et al [141] with a tolerance of 10minus6 with at most 1000 iterationsWith this setting the scaling algorithm tries to get the maximum deviationof a rowcolumn sum from 1 to be less than 10minus6 within 1000 iterations In7 matrices the deviation was larger than 10minus4 We deemed this too big adeviation from a doubly stochastic matrix and discarded those matrices Atthe end we had 25 matrices given in Table 52

We run the two heuristics for the BvN decomposition to obtain at most2000 permutation matrices or until they accumulated a sum of at least09999 with the coefficients The accumulated sum of coefficients

sumki=1 αi

and the number of permutation matrices k for the Birkhoff and GreedyBvN

heuristics are given in Table 52 As seen in this table GreedyBvN obtainsa much smaller number of permutation matrices than Birkhoffrsquos heuristicexcept the matrix n3c6-b7 This is a special matrix with eight nonzeros ineach row and column and all nonzeros are 18 Both heuristics find the same(minimum) number of permutation matrices We note that the geometricmean of the ratios kdmax is 34 for GreedyBvN

The second set of matrices was created as follows For n isin 100 200 300we created five matrices of size ntimesn whose entries are randomly chosen in-tegers between 1 and 100 (we performed tests with integers between 1 and20 and the results were close to what we present here) We then scaled themto be doubly stochastic using the algorithm of Knight et al [141] with a tol-erance of 10minus6 with at most 1000 iterations We run the two heuristics forthe BvN decomposition such that at most n2minus2n+ 2 permutation matricesare found or the total value of the coefficients was larger than 09999 Wethen took the average of the five instances with the same n The resultsare in Table 53 In this table again we see that GreedyBvN obtains muchsmaller k than Birkhoffrsquos heuristic

The experimental observation that GreedyBvN performs better than theBirkhoff heuristic is not surprising This is for two reasons First as statedbefore the original Birkhoff heuristic is not concerned with the number ofpermutation matrices Second the heuristic GreedyBvN is an adaptation ofthe Birkhoff heuristic to have large reductions at each step

106 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

nsumk

i=1 αi ksumk

i=1 αi k100 099 9644 100 388200 099 39208 100 717300 100 88759 100 1042

Table 53 Birkhoffrsquos heuristic and GreedyBvN on random dense matricesThe maximum deviation of a rowcolumn sum of a matrix A from 1 thatis maxA1minus 1infin 1TAminus 1Tinfin was always less than 10minus6 For each nthe results are the averages of five different instances

55 A generalization of Birkhoff-von Neumanntheorem

We have exploited the BvN decomposition in the context of preconditioningfor solving linear systems [23] As the decomposition is defined for nonnega-tive matrices we needed a generalization of it to cover all real matrices Tomotivate the generalized decomposition theorem we give a brief summaryof the preconditioner

551 Doubly stochastic splitting

For a given nonnegative matrix A we first preprocess it to get a doublystochastic matrix (whose row and column sums are one) Then using thisdoubly stochastic matrix we select some fraction of some of the nonzeros ofA to be included in the preconditioner The selection is done by taking a setof permutation matrices along with their coefficients in a BvN decompositionof the given matrix

Assume we have a BvN decomposition of A as shown in (51) Thenone can pick an integer r between 1 and k minus 1 and split A as

A = MminusN (510)

where

M = α1P1 + middot middot middot+ αrPr N = minusαr+1Pr+1 minus middot middot middot minus αkPk (511)

Note that M and minusN are doubly substochastic matrices

Definition 51 A splitting of the form (510) with M and N given by (511)is said to be a doubly substochastic splitting

Definition 52 A doubly substochastic splitting A = M minusN of a doublystochastic matrix A is said to be standard if M is invertible We will callsuch a splitting an SDS splitting

107

In general it is not easy to guarantee that a given doubly substochasticsplitting is standard except for some trivial situation such as the case r = 1in which case M is always invertible We also have a characterization forinvertible M when r = 2

Theorem 52 A sufficient condition for M =sumr

i=1 αiPi to be invertible isthat one of the αi with 1 le i le r be greater than the sum of the remainingones

Let A = M minusN be an SDS splitting of A and consider the stationaryiterative scheme

xk+1 = Hxk + c H = Mminus1N c = Mminus1b (512)

where k = 0 1 and x0 is arbitrary As seen in (512) Mminus1 or solversfor My = z are required As is well known the scheme (512) convergesto the solution of Ax = b for any x0 if and only if ρ(H) lt 1 Hence weare interested in conditions that guarantee that the spectral radius of theiteration matrix

H = Mminus1N = minus(α1P1 + middot middot middot+ αrPr)minus1(αr+1Pr+1 + middot middot middot+ αkPk)

is strictly less than one In general this problem appears to be difficultWe have a necessary condition (Theorem 53) and a sufficient condition(Theorem 54) which is simple but restrictive

Theorem 53 For the splitting A = MminusN with M =sumr

i=1 αiPi and N =

minussumk

i=r+1 αiPi to be convergent it must hold thatsumr

i=1 αi gtsumk

i=r+1 αi

Theorem 54 Suppose that one of the αi appearing in M is greater thanthe sum of all the other αi Then ρ(Mminus1N) lt 1 and the stationary iterativemethod (512) converges for all x0 to the unique solution of Ax = b

We discuss how to build preconditioners meeting the sufficiency con-ditions Since M itself is doubly stochastic (up to a scalar ratio) we canapply splitting recursively on M and obtain a special solver Our motivationis that the preconditioner Mminus1 can be applied to vectors via a number ofhighly concurrent steps where the number of steps is controlled by the userTherefore the preconditioners (or the splittings) can be advantageous foruse in many-core computing systems In the context of splittings the appli-cation of N to vectors also enjoys the same property These motivations areshared by recent work on ILU preconditioners where their fine-grained com-putation [48] and approximate application [12] are investigated for GPU-likesystems

108 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Preconditioner construction

It is desirable to have a small number k in the Birkhoff-von Neumann de-composition while designing the preconditionermdashthis was our motivation toinvestigate the MinBvNDec problem This is because of the fact that if weuse splitting then k determines the number of steps in which we computethe matrix-vector products If we do not use all k permutation matriceshaving a few with large coefficients should help to design the preconditionerAs stated in Lemma 55 GreedyBvN (Algorithm 11) obtains the coefficientsin a non-increasing order We can therefore use GreedyBvN to build an Msuch that it satisfies the sufficiency condition presented in Theorem 54That is we can have M with α1sumk

i=1 αigt 12 and hence Mminus1 can be applied

with splitting iterations For this we start by initializing M to α1P1 Thenwhen a coefficient αj where j ge 2 is obtained at Line 5 of Algorithm 11we check if α1 will still be larger than the sum of other coefficients used inbuilding M when we add αjPj to M If so we add αjPj to M and con-tinue In practice we iterate the while loop until k is around 10 and collectαj rsquos as described above Experiments on a set of challenging problems [23]show that this preconditioner is more effective than ILU(0)

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices

In order to apply the preconditioners to all fully indecomposable sparsematrices we generalized the BvN decomposition as follows

Theorem 55 Any (real) matrix A with total support can be written as aconvex combination of a set of signed scaled permutation matrices

Proof Let B = abs(A) and consider R and C making RBC doubly stochas-tic which has a Birkhoff-von Neumann decomposition RBC =

sumki=1 αiPi

Then RAC can be expressed as

RAC =ksumi=1

αiQi

where Qi = [q(i)jk ]ntimesn is obtained from Pi = [p

(i)jk ]ntimesn as follows

q(i)jk = sign(ajk)p

(i)jk

The scaling matrices can be inverted to express A

De Werra [62] presents another generalization of the Birkhoff-von Neu-mann theorem to arbitrary matrices In De Werrarsquos generalization insteadof permutation matrices signed integral matrices are used to decompose the

109

given matrix The entries of the integral matrices used in the decompositionhave the same sign as the corresponding entries in the original matrix butthere may be multiple nonzeros in a row or column De Werra discussesflow-based algorithms to obtain the decompositions While the decomposi-tion is costlier to obtain (general flow computations are usually costlier thancomputing perfect matchings) it is also more general to handle rectangularmatrices as well

56 Summary further notes and references

We investigated the problem of obtaining a Birkhoff-von Neumann decom-position of doubly stochastic matrices with the minimum number of per-mutation matrices We showed four results regarding this problem Firstobtaining a BvN decomposition with the minimum number of permutationmatrices is NP-hard Second there are matrices whose decompositions withthe smallest number of permutation matrices cannot be obtained by anyBirkhoff-like heuristic Third the worst-case approximation ratio of theoriginal Birkhoff heuristic is Ω(n) On a more practical side we proposeda natural greedy heuristic (GreedyBvN) and presented experimental resultswhich demonstrated that this heuristic delivers results that are not far froma trivial lower bound We theoretically investigated GreedyBvNrsquos perfor-mance and observed that its performance depends on the values of matrixelements We showed a bound using the first and the last coefficients foundby the heuristic This chapter also included a generalization of the Birkhoff-von Neumann theorem in which we are able to express any real matrixwith a total support as a convex combination of scaled signed permutationmatrices

The shown bound for the performance of GreedyBvN is expected to bemuch larger than what one observes in practice as the bound can even belarger than the upper bound on the number of permutation matrices Atighter analysis should be possible to explain the practical performance ofthe GreedyBvN heuristic

The heuristics we considered in this chapter were based on finding a per-fect matching Koopmans and Beckmann [143] present another constructiveproof of the Birkhoff theorem This constructive proof splits the matrix asA = βA1 +(1minusβ)A2 where 0 lt β lt 1 A1 and A2 being doubly stochasticmatrices A related proof is also given by Horn and Johnson [118 Theorem871] where β = 12 We review these two constructive proofs

Let C be a cycle of length 2` with ` vertices in each part of the bipartitegraph with vertices ri ri+1 ri+`minus1 on one side and ci ci+1 ci+`minus1

on the other side Let us imagine the cycle drawn on the plane so that wehave horizontal edges of the form (rj cj) for j = i i + ` minus 1 slantededges of the form (ri ci+`minus1) and (rj cjminus1) for j = i + 1 i + ` minus 1 A

110 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

ri ci

rj cj

minusε1

minusε1

+ε1+ε1

(a) ε1 = minaii ajj

ri ci

rj cj

+ε2

+ε2

minusε2minusε2

(b) ε2 = minaij aji

Figure 54 A small example for Koopmans and Beckmann decompositionapproach

small example is shown in Figure 54 Let ε1 and ε2 be the minimum valueof a horizontal and slanted edge respectively Koopmans and Beckmannproceed as follows to obtain a BvN decomposition Consider a matrix C1

which contains minusε1 at the positions corresponding to the horizontal edgesof C ε1 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A1 = A + C1 (see Fig 54a) It is easy to see thateach element of A1 is between 0 and 1 and row and column sums are thesame as in A Therefore A1 is doubly stochastic Similarly consider C2

which contains ε2 at the positions corresponding to the horizontal edges ofC minusε2 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A2 = A + C2 (see Fig 54b) and observe that A2

is also doubly stochastic By a simple arithmetic we can set β = ε2ε1+ε2

andwrite A = βA1 + (1 minus β)A2 The observation that A1 and A2 contain atleast one more zero than A can then be used in an inductive hypothesis toconstruct a decomposition by recursively decomposing A1 and A2

At first sight the constructive approach of Koopmans and Beckmanndoes not lead to an efficient algorithm This is because of the fact thatA1 and A2 can have significant overlap and hence many permutation ma-trices can be shared by A1 and A2 yielding combinatorially large spacerequirement or very high run time A variant of this algorithm needs to beexplored First note that we can find a cycle and manipulate the weights bydecreasing them along the horizontal edges and increasing them along theslanted edges (not necessarily by annihilating the smallest weight) to createanother matrix Then we can take the modified matrix and manipulateanother cycle At one point one can decide to stop and create A1 this waywith a suitable choice of coefficients the decomposition can thus proceed Atthe second sight then one observes that Koopmans and Beckmann methodcalls for creative ideas for developing efficient and effective heuristics forBvN decompositions

Horn and Johnson on the other hand set ε to the smallest of ε1 and ε2

111

Without loss of generality let ε1 be the smallest Then let C be a matrixwhose entries are minus1 0 1 corresponding to the negative zero and positiveentries of C1 Then let A1 = A + εC and A2 = A minus εC Both of thesematrices have nonnegative entries less than or equal to one and have thesame row and column sums as A We can then write A = 05A1 + 05A2

and recurse on A1 and A2 to obtain a decomposition (which stops when wedo not have any cycles)

Inspired by the approaches summarized above we can optimally decom-pose the matrix (58) used in showing that there are optimal decompositionswhich cannot be obtained by Birkhoff-like heuristics Let

A1 =

a+ b d c e 0e a+ c b d 00 e d b+ c ad b a 0 c+ ec 0 e a b+ d

A2 =

a+ b d+ 2 middot i c+ 2 middot h e+ 2 middot j 2 middot f + 2 middot ge+ 2 middot g a+ c b+ 2 middot i d+ 2 middot f 2 middot h+ 2 middot j

2 middot f + 2 middot j e+ 2 middot h d+ 2 middot g b+ c a+ 2 middot id+ 2 middot h b+ 2 middot f a+ 2 middot j 2 middot g + 2 middot i c+ ec+ 2 middot i 2 middot g + 2 middot j e+ 2 middot f a+ 2 middot h b+ d

and observe that A = 05A1 + 05A2 We can then use the permutationscontaining the coefficients a b c d e to decompose A1 optimally with fivepermutation matrices using GreedyBvN (in the decreasing order first ethen d then c and so on) While decomposing A2 we need to use thepermutation matrices found for A1 with a careful selection of the coefficients(instead of a+ b we use a for example for the permutation associated witha) Once we have consumed those five permutations we can again resortto GreedyBvN to decompose the remaining matrix optimally with the fivepermutation matrices associated with f g h i and j In this way we havean optimal decomposition for the matrix (58) by applying GreedyBvN toA1 and carefully decomposing A2 We note that neither A1 nor A2 havethe same sum as A in contrast to the two approaches above Can wesystematize this line of reasoning to develop an effective heuristic to obtaina BvN decomposition

112 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Chapter 6

Matchings in hypergraphs

In this chapter we investigate the maximum matching in d-partite d-uniform hypergraphs described in Section 13 The formal problem definitionis repeated below for convenience

Maximum matching in d-partite d-uniform hypergraphs Givena d-partite d-uniform hypergraph find a matching of maximum size

We investigate effective heuristics for the problem above This problemhas been studied mostly in the context of local search algorithms [119] andthe best known algorithm is due to Cygan [56] who provides a ((d+ 1 + ε)3)-approximation building on previous work [57 106] It is NP-hard to approx-imate the 3-partite case (Max-3-DM) within 9897 [24] Similar boundsexist for higher dimensions the hardness of approximation for d = 4 5 and6 are shown to be 5453minus ε 3029minus ε and 2322minus ε respectively [109]

We propose five heuristics for the above problem The first two heuris-tics are the adaptations of the Greedy [78] and Karp-Sipser [128] heuristicsused in finding matchings in graphs (they are summarized in Chapter 4)The proposed generalizations are referred to as Greedy-H and Karp-Sipser-HGreedy-H traverses the hyperedge list in random order and adds an edgeto the matching whenever possible Karp-Sipser-H introduces certain rulesto Greedy-H to improve the cardinality The third heuristic is inspired by arecent scaling-based approach proposed for the maximum cardinality match-ing problem on graphs [73 74 76] The fourth heuristic is a modificationon the third one that allows for faster execution time The last one findsa matching for a reduced (d minus 1)-dimensional problem and exploits it forthe original matching problem recursively This heuristic uses an exact al-gorithm for the bipartite matching problem at each reduction We performexperiments to evaluate the performance of these heuristics on special classesof random hypergraphs as well as real-life data

One plausible way to tackle the problem is to create the line graph Gfor a given hypergraph H The line graph is created by identifying each

113

114 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

hyperedge of H with a vertex in G and by connecting two vertices of Gwith an edge iff the corresponding hyperedges share a common vertex in HThen successful heuristics for computing large independent sets in graphseg KaMIS [147] can be used to compute large matchings in hypergraphsThis approach although promising quality-wise could be impractical Thisis so since building G from H requires quadratic run time (in terms of thenumber of hyperedges) and more importantly quadratic storage (again interms of the number of hyperedges) in the worst case While this can beacceptable in some instances in some others it is not We have such instancesin the experiments Notice that while a heuristic for the independent setproblem can be of linear time complexity in graphs due to our graphs beinga line graph the actual complexity could be high

61 Background and notation

Franklin and Lorenz [89] show that if a nonnegative d-dimensional tensorX has the same zero-pattern as a d-stochastic tensor one can apply a mul-tidimensional version of the Sinkhorn-Knopp algorithm [202] (see also thedescription in Section 41) to scale X to be d-stochastic This is accom-plished by finding d vectors u(1) u(d) and scaling the nonzeros of X as

xi1id middot u(1)i1middot middot middot middot middot u(d)

idfor all i1 id isin 1 n

In a hypergraph if a vertex is a member of only a single hyperedgewe call it a degree-1 vertex Similarly if a vertex is a member of only twohyperedges we call it a degree-2 vertex

In the k-out random hypergraph model given a set V of vertices eachvertex u isin V selects k hyperedges from the set Eu = e e sube V u isin e ina uniformly random fashion and the union of these edges forms E We areinterested in the d-partite d-uniform case and hence Eu = e |e cap Vi| =1 for 1 le i le d u isin e This model generalizes the random k-out bipartitegraphs [211] Devlin and Kahn [67] investigate fractional matchings in thesehypergraphs and mention in passing that k should be exponential in d toensure that a perfect matching exists

62 The heuristics

A matching which cannot be extended with more edges is called maxi-mal The heuristics proposed here find maximal matchings on d-partited-uniform hypergraphs For such hypergraphs any maximal matching isa d-approximate matching The bound is tight and can be verified ford = 3 Let H be a 3-partite 3times 3times 3 hypergraph with the following edgese1 = (1 1 1) e2 = (2 2 2) e3 = (3 3 3) and e4 = (1 2 3) The maximummatching is e1 e2 e3 but the edge e4 alone forms a maximal matching

115

621 Greedy-H A Greedy heuristic

Among the two variants of Greedy [78 194] we adapt the first one whichrandomly visits the edges The adapted version is referred to as Greedy-Hand it visits the hyperedges in random order and adds the visited hyper-edge to the matching whenever possible Since only maximal matchings arepossible as its output Greedy-H is a d-approximation heuristic

622 Karp-Sipser-H Adapting Karp-Sipser

The commonly used and analyzed version of the Karp-Sipser heuristics forgraphs applies degree-1 reductions as summarized in Section 421 Theoriginal version though applies two rules

bull At any time during the heuristic if a degree-1 vertex appears it ismatched with its only neighbor

bull Otherwise if a degree-2 vertex u appears with neighbors v w u(and its edges) is removed from the current graph and v and w aremerged to create a new vertex vw whose set of neighbors is the unionof those of v and w (except u) A maximum cardinality matching forthe reduced graph can be extended to obtain one for the current graphby matching u with either v or w depending on vwrsquos match

We now propose an adaptation of Karp-Sipser with the two rules for d-partite d-uniform hypergraphs The adapted version is called Karp-Sipser-HSimilar to the original one the modified heuristic iteratively adds a randomhyperedge to the matching remove its d endpoints as well as their hyper-edges However the random selection is not applied whenever hyperedgesdefined by the following lemmas appear

Lemma 61 During the heuristic if a hyperedge e with at least d minus 1degree-1 endpoints appears there exists a maximum cardinality matching inthe current hypergraph containing e

Proof Let H prime be the current hypergraph at hand and e = (u1 ud) be ahyperedge in H prime whose first dminus1 endpoints are degree-1 vertices Let M prime bea maximum cardinality matching in H prime If e isinM prime we are done Otherwiseassume that ud is the endpoint matched by a hyperedge eprime isinM prime (note thatif ud is not matched M prime can be extended with e) Since ui 1 le i lt d arenot matched in M prime M prime eprime cup e defines a valid maximum cardinalitymatching for H prime

We note that it is not possible to relax the condition by using a hyperedgee with less than dminus 1 endpoints of degree-1 in M prime two of ersquos higher degreeendpoints could be matched with two different hyperedges in which casethe substitution as done in the proof of the lemma will not be valid

116 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 62 During the heuristic let e = (u1 ud) and eprime = (uprime1 uprimed)

be two hyperedges sharing at least one endpoint where for an index set I sub1 d of cardinality d minus 1 the vertices ui u

primei for all i isin I only touch

e andor eprime That is for each i isin I either ui = uprimei is a degree-2 vertexor ui 6= uprimei and they are both degree-1 vertices For j isin I uj and uprimej arearbitrary vertices Then in the current hypergraph there exists a maximumcardinality matching having either e or eprime

Proof Let H prime be the current hypergraph at hand and j isin I be the remain-ing part id Let M prime be a maximum cardinality matching in H prime If eithere isin M prime or eprime isin M prime we are done Otherwise ui and uprimei for all i isin I areunmatched by M prime Furthermore since M prime is maximal uj must be matchedby M prime (otherwise M prime can be extended by e) Let eprimeprime isinM prime be the hyperedgematching uj Then M prime eprimeprime cup e defines a valid maximum cardinalitymatching for H prime

Whenever such hyperedges appear the rules below are applied in thesame order

bull Rule-1 At any time during the heuristic if a hyperedge e with atleast dminus 1 degree-1 endpoints appears instead of a random edge e isadded to the matching and removed from the hypergraph

bull Rule-2 Otherwise if two hyperedges e and eprime as defined in Lemma 62appear they are removed from the current hypergraph with the end-points ui u

primei for all i isin I Then we consider uj and uprimej If uj and

uprimej are distinct they are merged to create a new vertex ujuprimej whose

hyperedge list is defined as the union of uj rsquos and uprimej rsquos hyperedge listsIf uj and uprimej are identical we rename uj as uju

primej After obtaining a

maximal matching on the reduced hypergraph depending on the hy-peredge matching uju

primej either e or eprime can be used to obtain a larger

matching in the current hypergraph

When Rule-2 is applied the two hyperedges identified in Lemma 62 areremoved from the hypergraph and only the hyperedges containing uj andoruprimej have an update in their vertex list Since the original hypergraph is d-partite and d-uniform that update is just a renaming of a vertex in theconcerned hyperedges (hence the resulting hypergraph is d-partite and d-uniform)

Although the extended rules usually lead to improved results in compari-son to Greedy-H Karp-Sipser-H still adheres to the d-approximation bound ofmaximal matchings For the example given at the beginning of Section 62Karp-Sipser-H generates a maximum cardinality matching by applying thefirst rule However when e5 = (2 1 3) and e6 = (3 1 3) are added to theexample neither of the two rules can be applied As before in case e4 israndomly selected it alone forms a maximal matching

117

623 Karp-Sipser-H-scaling Karp-Sipser with scaling

Karp-Sipser-H can be modified for better decisions in case neither of the tworules apply In this variant instead of a random selection we first scale theadjacency tensor of H and obtain an approximate d-stochastic tensor X We then augment the matching by adding the edge which corresponds tothe largest value in X when the rules do not apply The modified heuristicis summarized in Algorithm 12

Algorithm 12 Karp-Sipser-H-scaling (H)

Data A d-partite d-uniform n1 times middot middot middot times nd hypergraph H = (VE)Result A maximal matching M of H

1 M larr empty Initially M is empty 2 S larr empty Stack for the merges for Rule-2 3 while H is not empty do4 remove isolated vertices from H5 if existe = (u1 ud) as in Rule-1 then6 M larrM cup e Add e to the matching 7 Apply the reduction Rule-1 on H8 else if existe = (u1 ud) eprime = (uprime1 u

primed) and I as in Rule-2 then

9 Let j be the part index where j isin I10 Apply the reduction Rule-2 on H by introducing the vertex uju

primej

11 Eprime = (v1 ujuprimej vd) for all (v1 uj vd) isin E memorize the hyperedges of uj

12 Spush(e eprime ujuprimej E

prime) Store the current merge

13 else14 X larr Scale(adj(H)) Scale the adjacency tensor of H 15 elarr arg max(u1ud) (xu1ud

) Find the max in X

16 M larrM cup e Add e to the matching 17 Remove all hyperedges of u1 ud from E18 V larr V u1 ud

19 while S 6= empty do20 〈e eprime ujuprimej Eprime〉 larr Spop()

21 if ujuprimej is not matched by M then

22 M larrM cup e23 else24 Let eprimeprime isinM be the hyperedge matching uju

primej

25 if eprimeprime isin Eprime then26 Replace uju

primej in eprimeprime with uprimej

27 M larrM cup eprime28 else29 Replace uju

primej in eprimeprime with uj

30 M larrM cup e

Our inspiration comes from the d = 2 case [74 76] described in Chap-ter 4 By using the scaling method as a preprocessing step and choosing

118 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

edges with a probability corresponding to the scaled entry the edges whichare not included in a perfect matching become less likely to be chosen Un-fortunately for d ge 3 there is no equivalent of Birkhoffrsquos theorem as demon-strated by the following lemma whose proof can be found in the associatedtechnical report [75 Lemma 3]

Lemma 63 For d ge 3 there exist extreme points in the set of d-stochastictensors which are not permutation tensors

These extreme points can be used to generate other d-stochastic ten-sors as linear combinations Due to the lemma above we do not have thetheoretical foundation to imply that hyperedges corresponding to the largeentries in the scaled tensor must necessarily participate in a perfect match-ing Nonetheless the entries not in any perfect matching tend to becomezero (not guaranteed for all though) For the worst case example of Karp-Sipser-H described above the scaling indeed helps the entries correspondingto e4 e5 and e6 to become zero See the discussion following the said lemmain the technical report [75]

On a d-partite d-uniform hypergraph H = (VE) the Sinkhorn-Knoppalgorithm used for scaling operates in iterations each of which requiresO(|E| times d) time In practice we perform only a few iterations (eg 10ndash20)Since we can match at most |V |d hyperedges the overall run time costassociated with scaling is O(|V | times |E|) A straightforward implementationof the second rule can take quadratic time in the worst case of a large numberof repetitive merges with a given vertex In practice more of a linear timebehavior should be observed for the second rule

624 Karp-Sipser-H-mindegree Hypergraph matching viapseudo scaling

In Algorithm 12 applying scaling at every step can be very costly Herewe propose an alternative idea inspired by the specifics of the Sinkhorn-Knopp algorithm to reduce the overall cost In particular we can mimicthe first iteration of the Sinkhorn-Knopp algorithm to discover that we canuse the inverse of the degree of a vertex as its scaling vector This assignseach vertex a weight proportional to the inverse of its degreemdashhence not anexact iteration of Sinkhorn-Knopp

With this approach each hyperedge i1 id is associated with a value1prodd

j=1 λij where λij denotes the degree of the vertex ij from jth part The

selection procedure is the same as that of Algorithm 12 ie the edge withthe maximum value is added to the matching set We refer to this algorithmas Karp-Sipser-H-mindegree as it selects a hyperedge based on a function ofthe degrees of the vertices With a straightforward implementation findingthis hyperedge takes O(|E|) time For a better efficiency the edges can be

119

stored in a heap and when the degree of a node v decreases the increaseKeyheap operation can be called for all its edges

625 Bipartite-reduction Reduction to the bipartite match-ing problem

A perfect matching in a d-partite d-uniform hypergraph H remains perfectwhen projected on a (d minus 1)-partite (d minus 1)-uniform hypergraph obtainedby removing one of Hrsquos dimensions Matchability in (d minus 1)-dimensionalsub-hypergraphs has been investigated [6] to provide an equivalent of HallrsquosTheorem for d-partite hypergraphs These observations lead us to proposea heuristic called Bipartite-reduction This heuristic tackles the d-partited-uniform case by recursively asking for matchings in (dminus1)-partite (dminus1)-uniform hypergraphs and so on until d=2

Let us start with the case where d = 3 Let G = (VG EG) be thebipartite graph with the vertex set VG = V1 cup V2 obtained by deleting V3

from a 3-partite 3-regular hypergraph H = (VE) The edge (u v) isin EG iffthere exists a hyperedge (u v z) isin E One can also assign a weight functionw(middot) to the edges during this step such as

w(u v) = |z (u v z) isin E| (61)

A maximum weighted (product sum etc) matching algorithm can be usedto obtain a matching MG on G A second bipartite graph Gprime = (VGprime EGprime)is then created with VGprime = (V1 times V2) cup V3 and EGprime = (uv z) (u v) isinMG (u v z) isin H Under this construction any matching inGprime correspondsto a valid matching in H Furthermore if the weight function (61) definedabove is used the following holds (the proof is in the technical report [75Proposition 4])

Proposition 61 Let w(MG) =sum

(uv)isinMGw(u v) be the size of the match-

ing MG found in G Then Gprime has w(MG) edges

Thus by selecting a maximum weighted matching MG and maximizingw(MG) the largest number of edges will be kept in Gprime

For d-dimensional matching a similar process is followed First anordering i1 i2 id of the dimensions is defined At the jth bipartite re-duction step the matching is found between the dimension cluster i1i2 middot middot middot ijand dimension ij+1 by similarly solving a bipartite matching instance wherethe edge (u1 middot middot middotuj v) exists iff vertices u1 uj were matched in previoussteps and there exists an edge (u1 uj v zj+2 zd) in H

Unlike the previous heuristics Bipartite-reduction does not have any ap-proximation guarantee We state this with the following lemma (the proofis in the technical report [75 Lemma 5])

120 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 64 If algorithms for the maximum cardinality or the maximumweighted matching with the suggested edge weights (61) problems are usedthen Bipartite-reduction has a worst-case approximation ratio of Ω(n)

626 Local-Search Performing local search

A local search heuristic is proposed by Hurkens and Schrijver [119] It startsfrom a feasible maximal matching M and performs a series of swaps untilit is no longer possible In a swap k edges of M are replaced with at leastk+1 new edges from EM so that the cardinality of M increases by at leastone These k edges from M can be replaced with at most dtimes k new edgesHence these edges can be found by a polynomial algorithm enumeratingall the possibilities The approximation guarantee improves with higher kvalues Local search algorithms are limited in practice due to their hightime complexity The algorithm might have to examine all

(|M |k

)subsets of

M to find a feasible swap at each step The algorithm by Cygan [56] whichachieves a

(d+1+ε

3

)-approximation is based on a different swap scheme but

is also not suited for large hypergraphs We implement Local-Search as analternative from the literature to set a baseline

63 Experiments

To understand the relative performance of the proposed heuristics we con-ducted a wide variety of experiments with both synthetic and real-life dataThe experiments were performed on a computer equipped with Intel Corei7-7600 CPU and 16GB RAM We investigate the performance of the pro-posed heuristics Greedy-H Karp-Sipser-H Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reduction For d = 3 we also consider Local-Search [119] which replaces one hyperedge from a maximal matching Mwith at least two hyperedges from E M to increase the cardinality of M We did not consider local search schemes for higher dimensions or withbetter approximation ratios as they are computationally too expensiveFor each hypergraph we perform ten runs of Greedy-H and Karp-Sipser-Hwith different random decisions and take the maximum cardinality obtainedSince Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reductiondo not pick hyperedges randomly we run them only once We perform 20steps of the scaling procedure in Karp-Sipser-H-scaling We refer to qualityof a matching M in a hypergraph H as the ratio of M rsquos cardinality to thesize of the smallest vertex partition of H

On random hypergraphs

We perform experiments on two classes of d-partite d-uniform random hy-pergraphs where each part has n vertices The first class contains random

121

k k

d ddminus3 ddminus2 ddminus1 d ddminus3 ddminus2 ddminus1

2 - 087 100 2 - 084 1003 080 100 100 3 088 100 100

n = 10 4 100 100 100 n = 30 4 099 100 1005 100 100 100 5 100 100

2 - 088 100 2 - 087 1003 085 100 100 3 084 100 100

n = 20 4 100 100 100 n = 50 4 lowast 100 1005 100 100 100 5

Table 61 The average maximum matching cardinalities of five randominstances over n on random k-out d-partite d-uniform hypergraphs for dif-ferent k d and n No runs for k = ddminus3 for d = 2 and the problems markedwith lowast were not solved within 24 hours

k-out hypergraphs and the second one contains sparse random hypergraphs

Random k-out d-partite d-uniform hypergraphsHere we consider random k-out d-partite d-uniform hypergraphs describedin Section 61 Hence (ignoring the duplicate ones) these hypergraphs havearound d times k times n hyperedges These k-out d-partite d-uniform hyper-graphs have been recently analyzed in the matching context by Devlin andKahn [67] They state in passing that k should be exponential in d for aperfect matching to exist with high probability The bipartite graph variantof the same problem ie with d = 2 has been extensively studied in theliterature [90 124 125 211]

We first investigate the existence of perfect matchings in random k-outd-partite d-uniform hypergraphs For this purpose we used CPLEX [1] tosolve the hypergraph matching problem exactly and found the maximumcardinality of a matching in k-out hypergraphs with k isin ddminus3 ddminus2 ddminus1for d isin 2 5 and n isin 10 20 30 50 For each (k d n) triple we cre-ated five hypergraphs and computed their maximum cardinality matchingsFor k = ddminus3 we encountered several hypergraphs with no perfect match-ing especially for d = 3 The hypergraphs with k = ddminus2 were also lackinga perfect matching for d = 2 However all the hypergraphs we createdwith k = ddminus1 had at least one Based on these results we experimentallyconfirm Devlin and Kahnrsquos statement We also conjecture that ddminus1-outrandom hypergraphs have perfect matchings almost surely The averagemaximum matching cardinalities we obtained in this experiment are givenin Table 61 In this table we do not have results for k = ddminus3 for d = 2and the cases marked with lowast were not solved within 24 hours

We now compare the performance of the proposed heuristics on randomk-out d-partite d-uniform hypergraphs d isin 3 6 9 and n isin 1000 10000We tested with k values equal to powers of 2 for k le d log d The results are

122 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

summarized in Figure 61 For each (k d n) triplet we create ten random in-stances and present the average performance of the heuristics on them Thex-axis in each figure denotes k and the y-axis reports the matching cardinal-ity over n As seen Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree havethe best performance comfortably beating the other alternatives For d = 3Karp-Sipser-H-scaling dominates Karp-Sipser-H-mindegree but when d gt 3we see that Karp-Sipser-H-mindegree has the best performance Local-Searchperforms worse than Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree butbetter than others Karp-Sipser-H performs better than Greedy-H Howevertheir performances get closer as d increases This is due to the fact thatthe conditions for Rule-1 and Rule-2 hold less often for larger d as we havemore restrictions to encounter in such cases Bipartite-reduction has worseperformance than the others and the gap in the performance grows as dincreases This happens since at each step we impose more and more con-ditions on the edges involved and there is no chance to recover from baddecisions

Sparse random d-partite d-uniform hypergraphsHere we consider a random d-partite d-uniform hypergraph Hi created withitimesn hyperedges The parameters used for this experiment are i isin 1 3 5 7n isin 4000 8000 and d isin 6 9 (the associated paper presents further ex-periments with d = 3) Each Hi is created by choosing the vertices of ahyperedge uniformly at random for each dimension We do not allow du-plicate hyperedges Another random hypergraph Hi+M is then obtained byplanting a perfect matching to Hi that is a random perfect matching isadded to the pattern of Hi We again generate ten random instances foreach parameter setting We do not present results for Bipartite-reduction asit was always worse than the others The average quality of different heuris-tics on these instances is shown in Figure 62 The experiments confirmthat Karp-Sipser-H performs consistently better than Greedy-H Further-more Karp-Sipser-H-scaling performs significantly better than Karp-Sipser-HKarp-Sipser-H-scaling works even better than the local search heuristic [75Fig2]) and it is the only heuristic that is capable of finding the planted per-fect matchings for a significant number of the runs In particular it finds aperfect matching on Hi+M rsquos in all cases except for when d = 6 and i = 7Interestingly Karp-Sipser-H-mindegree outperforms Karp-Sipser-H-scaling onHis but is dominated on Hi+M s where it is the second best performingheuristic

Evaluating algorithmic choices

We evaluate the use of scaling and the importance of Rule-1 and Rule-2

Scaling vs no-scalingTo evaluate and emphasize the contribution of scaling better we compare the

123

2 4 8065

07

075

08

085

09

095

1n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

2 4 806

065

07

075

08

085

09

095

1n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

(a) d = 3 n = 1000 (left) and n = 10000 (right)

2 4 1603

04

05

06

07

08

09n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 1603

04

05

06

07

08

09n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(b) d = 6 n = 1000 (left) and n = 10000 (right)

2 4 8 16 3202

03

04

05

06

07

08n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 8 16 3202

03

04

05

06

07

08n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(c) d = 9 n = 1000 (left) and n = 10000 (right)

Figure 61 The performance of the heuristics on k-out d-partite d-uniformhypergraphs with n vertices at each part The y-axis is the ratio of matchingcardinality to n whereas the x-axis is k No Local-Search for d = 6 and d = 9

124 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 031 031 035 035 035 035 036 037 062 061 090 089 100 100 100 1003 043 043 047 047 048 048 054 054 051 050 056 055 100 100 099 0995 048 048 052 052 054 054 061 061 052 052 056 055 100 100 097 0977 052 052 055 055 059 059 066 066 054 054 057 057 084 080 071 070

(a) d = 6 without (left) and with (right) the planted matching

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 025 024 027 027 027 027 030 030 056 055 080 079 100 100 100 1003 034 033 036 036 036 036 043 043 040 040 044 044 100 100 099 1005 038 037 040 040 041 041 048 048 041 040 043 043 100 100 099 0997 040 040 042 042 044 044 051 051 042 042 044 044 100 100 097 096

(b) d = 9 without (left) and with (right) the planted matching

Figure 62 Performance comparisons on d-partite d-uniform hypergraphswith n = 4000 8000 Hi contains i times n random hyperedges and Hi+M

contains an additional perfect matching

R1

R2

C1 C2

t

t

Figure 63 AKS A challenging bipartite graph instance for Karp-Sipser

Local Karp- Karp-Sipser-H- Karp-Sipser-H-t Greedy-H Search Sipser-H scaling mindegree

2 053 099 053 100 1004 053 099 053 100 1008 054 099 055 100 10016 055 099 056 100 10032 059 099 059 100 100

Table 62 Performance of the proposed heuristics on 3-partite 3-uniformhypergraphs corresponding to XKS with n = 300 vertices in each part

125

performance of the heuristics on a particular family of d-partite d-uniformhypergraphs where their bipartite counterparts have been used before aschallenging instances for the original Karp-Sipser heuristic [76]

Let AKS be an n times n matrix discussed before in Figure 43 as a chal-lenging instance to Karp-Sipser and shown in Figure 63 for convenienceTo adapt this scheme to hypergraphstensors we generate a 3-dimensionaltensor XKS such that the nonzero pattern of each marginal of the 3rd di-mension is identical to that of AKS Table 62 shows the performance of theheuristics (ie matching cardinality normalized with n) for 3-dimensionaltensors with n = 300 and t isin 2 4 8 16 32

The use of scaling indeed reduces the influence of the misleading hyper-edges in the dense block R1 times C1 and the proposed Karp-Sipser-H-scalingheuristic always finds the perfect matching as does Karp-Sipser-H-mindegreeHowever Greedy-H and Karp-Sipser-H perform significantly worse Further-more Local-Search returns a 099-approximation in every case because itends up in a local optima

Rule-1 vs Rule-2We finish the discussion on the synthetic data by focusing on Karp-Sipser-H In the bipartite case recent work [10] shows that both rules are neededto obtain perfect matchings in a class of graphs We present a family ofhypergraphs for which applying Rule-2 leads to much improved performancethan applying Rule-1 only

We use Karp-Sipser-H-R1 to refer to Karp-Sipser-H without Rule-2 Asbefore we describe first the bipartite case Let ARF be a n times n matrixwith (ARF )ij = 1 for 1 le i le j le n and (ARF )21 = (ARF )nnminus1 = 1That is ARF is composed of an upper triangular matrix and two additionalsubdiagonal nonzeros The first two columns and the last two rows havetwo nonzeros Assume without loss of generality that the first two rows aremerged by applying Rule-2 on the first column (which is discarded) Thenin the reduced matrix the first column (corresponding to the second columnin the original matrix) will have one nonzero Rule-1 can now be appliedwhereupon the first column in the reduced matrix will have degree oneThe process continues similarly until the reduced matrix is a 2 times 2 denseblock where applying Rule-2 followed by Rule-1 yields a perfect matchingIf only Rule-1 reductions are allowed initially no reduction can be appliedand randomly chosen edges will be matched which negatively affects thequality of the returned matching

For higher dimensions we proceed as follows Let XRF be a d-dimensionaln times middot middot middot times n tensor We set (XRF )ijj = 1 for 1 le i le j le n and(XRF )122 = (XRF )nnminus1nminus1 = 1 By similar reasoning we see thatKarp-Sipser-H with both reduction rules will obtain a perfect matchingwhereas Karp-Sipser-H-R1 will struggle We give some results in Table 63that show the difference between the two We test for n isin 1000 2000 4000

126 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

d2 3 6

n quality rn

quality rn

quality rn

1000 083 045 085 047 080 0312000 086 053 087 056 080 0304000 082 042 075 017 084 045

Table 63 Quality of matching and the number r of the applications ofRule-1 over n in Karp-Sipser-H-R1 for hypergraphs corresponding to XRF Karp-Sipser obtains perfect matchings

and d isin 2 3 6 and show the quality of Karp-Sipser-H-R1 and the num-ber of times that Rule-1 is applied over n We present the best result over10 runs As seen in Table 63 Karp-Sipser-H-R1 obtains matchings thatare about 13ndash25 worse than Karp-Sipser-H Furthermore the larger thenumber of Rule-1 applications is the higher the quality is

On real-life tensors

We also evaluate the performance of the proposed heuristics on some real-life tensors selected from the FROSTT library [203] The descriptions ofthe tensors are given in Table 64 For nips and uber a dimension of size17 and 24 is dropped respectively since they restrict the size of maximumcardinality matching As described before a d-partite d-uniform hyper-graph is obtained from a d-dimensional tensor by keeping a vertex for eachdimension index and a hyperedge for each nonzero Unlike the previousexperiments the parts of the hypergraphs obtained from real-life tensors inTable 64 do not have an equal number of vertices In this case although thescaling algorithm works along the same lines its output is slightly differentLet ni = |Vi| be the cardinality at ith dimension and nmax = max1leiled nibe the maximum one By slightly modifying Sinkhorn-Knopp for each iter-ation of Karp-Sipser-H-scaling we scale the tensor such that the marginalsin dimension i sum up to nmaxni instead of one The results in Table 64resemble those from previous sections Karp-Sipser-H-scaling has the bestperformance and is slightly superior to Karp-Sipser-H-mindegree Greedy-Hand Karp-Sipser-H are close to each other and when it is feasible Local-Searchis better than them We also see that in these instances Bipartite-reductionexhibits a good performance its performance is at least as good as Karp-Sipser-H-scaling for the first three instances but about 10 worse for thelast one

127

Local- Karp- Karp-Sipser-H- Karp-Sipser-H- Bipartite-Tensor d Dimensions Greedy-H Search Sipser-H mindegree scaling Reduction

Uber 3 183times 1140times 1717 183 183 183 183 183 183nips 3 2 482 times 2 862 times

14 0361847 1991 1839 2005 2007 2007

Nell-2 3 12 092 times 9 184 times28 818

3913 4987 3935 5100 5154 5175

Enron 4 6 066 times 5 699 times244 268times 1 176

875 - 875 988 1001 898

Table 64 Four real-life tensors and the performance of the proposed heuris-tics on the corresponding hypergraphs No result for Local-Search for Enronas it is four dimensional Uber has 1117629 nonzeros nips has 3101609nonzeros Nell-2 has 76879419 nonzeros and Enron has 54202099 nonze-ros

Using an independent set solver

We compare Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree with the ideaof reducing the maximum d-dimensional matching problem to that of findingan independent set in the line graph of the given hypergraph We show thatthis transformation can obtain good results but is restricted because linegraphs can require too much space

We use KaMIS [147] to find independent sets in graphs KaMIS uses aplethora of reductions and a genetic algorithm in order to return high cardi-nality independent sets We use the default settings of KaMIS (where exe-cution time is limited to 600 seconds) and generate the line graphs with effi-cient sparse matrixndashmatrix multiplication routines We run KaMIS Greedy-H Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree on a few hypergraphsfrom previous tests The results are summarized in Table 65 The run timeof Greedy-H was less than one second in all instances KaMIS operates inrounds and we give the quality and the run time of the first round andthe final output We note that KaMIS considers the time-limit only afterthe first round has been completed As can be seen while the quality ofKaMIS is always good and in most cases superior to Karp-Sipser-H-scalingand Karp-Sipser-H-mindegree it is also significantly slower (its principle isto deliver high quality results) We also observe that the pseudo scaling ofKarp-Sipser-H-mindegree indeed helps to reduce the run time compared toKarp-Sipser-H-scaling

The line graphs of the real-life instances from Table 64 are too large tobe handled We estimate using known techniques [50] the number of edgesin these graphs to range from 15 times 1010 to 47 times 1013 which translate to126GB to 380TB of storage if edges are stored twice using 4 bytes per edge

128 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

KaMIS Karp-Sipser-H- Karp-Sipser-H-line graph Round 1 Output Greedy-H scaling mindegree

hypergraph gen time quality time quality time quality quality time quality time

8-out n =1000 d = 3

10 098 80 099 600 086 098 1 098 1

8-out n =10000 d = 3

112 098 507 099 600 086 098 197 098 1

8-out n =1000 d = 9

298 067 798 069 802 055 062 2 067 1

n = 8000 d =3 H3

1 077 16 081 602 063 076 5 077 1

n = 8000 d =3 H3+M

2 089 25 100 430 070 100 11 091 1

Table 65 Run time (in seconds) and performance comparisons betweenKaMIS Greedy-H and Karp-Sipser-H-scaling The time required to createthe line graphs should be added to KaMISrsquos overall time

64 Summary further notes and references

We have proposed heuristics for the maximum matching in d-partite d-uniform hypergraphs problem by generalizing existing heuristics for themaximum cardinality matching in graphs The experimental analysis onvarious hypergraphs (both random and corresponding to real-life tensors)show the effectiveness and efficiency of the proposed heuristics

This being a recent study we have much future work On the theoreticalside we plan to investigate the stated conjecture that ddminus1-out random hy-pergraphs have perfect matchings almost always and analyze the theoreticalguarantees of the proposed algorithms On the practical side we want totest the use of the proposed hypergraph matching heuristics in the coarsen-ing phase of the multilevel hypergraph partitioning tools For this we needto adopt the proposed heuristics for the general hypergraph case (not nec-essarily d-partite and d-uniform) Another future work is the investigationof the weighted hypergraph matching problem and the generalization anduse of the proposed heuristics for this problem Solvers for this problem arecomputational tools used in the multiple network alignment problem [179]where a matching among the vertices of more than two graphs is computed

Part III

Ordering matrices andtensors

129

Chapter 7

Elimination trees andheight-reducing orderings

In this chapter we investigate the minimum height elimination problemdescribed in Section 11 The formal problem definition is repeated belowfor convenience

Minimum height elimination tree Given a directed graph find anordering of its vertices so that the associated elimination tree has theminimum height

The standard elimination tree [200] has been used to expose parallelismin sparse Cholesky LU and QR factorizations [5 8 116 160] Roughly aset of vertices without ancestordescendant relations corresponds to a setof independent tasks that can be performed in parallel Therefore the to-tal number of parallel steps or the critical path length is equal to theheight of the tree on an unbounded number of processors [162 214] Ob-taining an elimination tree with the minimum height for a given matrixis NP-complete [193] Therefore heuristic approaches are used One setof heuristic approaches is to content oneself with the graph partitioningbased methods These methods reduce some other important cost metricsin sparse Cholesky factorization such as the fill-in and the operation countwhile giving a shallow depth elimination tree [103] When the matrix is un-symmetric the elimination tree for LU factorization [81] is useful to exposeparallelism In this respect the height of the tree again corresponds to thenumber of parallel steps or the critical path length for certain factorizationschemes Here we develop heuristics to reduce the height of eliminationtrees for unsymmetric matrices

Consider the directed graph Gd obtained by replacing every edge of agiven undirected graph G by two directed edges pointing at opposite direc-tions Then the cycle-rank (or the minimum height of an elimination tree)

131

132 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

of Gd is closely related to the undirected graph parameter tree-depth [180p128] which corresponds to the minimum height of an elimination tree for asymmetric matrix One can exploit this correspondence in the reverse sensein an attempt to reduce the height of the elimination tree while producingan ordering for the matrix That is one can use the standard undirectedgraph model corresponding to the matrix A + AT where the addition ispattern-wise This approach of producing an undirected graph for a givendirected graph by removing the direction of each edge is used in solvers suchas MUMPS [8] and SuperLU [63] This way the state of the art graph parti-tioning based ordering methods can be used We will compare the proposedheuristics with such methods

The height of the elimination tree or the critical path length is notthe only metric for efficient parallel LU factorization of unsymmetric matri-ces Depending on the factorization schemes such as Gaussian eliminationwith static pivoting (implemented in GESP [157]) or multifrontal basedsolvers (implemented for example in WSMP [104]) other metrics come intoplay (such as the fill-in the size of blocking in GESP based solvers egSuperLU MT [64] the maximumminimum size of fronts for parallelismin multifrontal solvers or the communication) Nonetheless the elimina-tion tree helps to give an upper bound on the performance of the suitableparallel LU factorization schemes under a fairly standard PRAM model(assuming unit size computations infinite number of processors and nomemorycommunication or scheduling overhead) Furthermore our overallapproach is based on obtaining a desirable form for LU factorization calledbordered block triangular form (or BBT form for short) [81] This formsimplifies many algorithmic details of different solvers [81 82] and is likelyto control the fill-in as the nonzeros are constrained to be in the blocks ofa BBT form If one orders a matrix without a BBT form and then obtainsthis form the fill-in can increase or decrease [83] Hence our BBT basedapproach is likely to be helpful for LU factorization schemes exploiting theBBT form as well in controlling the fill-in

71 Background

We define Gs = (VEs) as the graph associated with the directed graphG such that Es = vi vj (vi vj) isin E A directed graph G is calledcomplete if (u v) isin E for all vertex pairs (u v)

Let T = (V r ρ) be a rooted tree An ordering σ V harr 1 n is abijective function An ordering σ is called topological if σ(ρ(vj)) gt σ(vj) forall vj isin V Finally a postordering is a special form of topological orderingin which the vertices of each subtree are ordered consecutively

Let A be an n times n unsymmetric matrix with m off-diagonal nonzeroentries The standard directed graph G(A) = (VE) of A consists of a

133

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix A

8

46

9

1

7

2

5

3

(b) Directed graph G(A)

1 2

3

5 64

7 8

9

(c) Elimination tree T(A)

Figure 71 A sample 9 times 9 unsymmetric matrix A the corresponding di-rected graph G(A) and elimination tree T(A)

vertex set V = 1 n and an edge set E = (i j) aij 6= 0 of m edgesWe adopt GA instead of G(A) when a subgraph notion is required

An elimination digraph Gk(A) = (Vk Ek) is defined for 0 le k lt n withthe vertex set Vk = k + 1 n and the edge set

Ek =

E k = 0Ekminus1 cup (i j) (i k) (k j) isin Ekminus1 k gt 0

We also define V k = 1 k for each 0 le k lt n

We assume that A is irreducible and the LU-factorization A = LU existswhere L and U are unit lower and upper triangular respectively Since thefactors L and U are triangular their standard directed graph models G(L)and G(U) are acyclic Eisenstat and Liu [81] define the elimination treeT(A) as follows Let

Parent(i) = minj j gt i and jG(L)===rArr i

G(U)===rArr j

where aG=rArr b means that there is path from a to b in the graph G then

Parent(i) corresponds to the parent of vertex i in T(A) for i lt n and forthe root n we put Parent(n) =infin

We now illustrate some of the definitions Figure 71a displays a samplematrix A with 9 rowscolumns and 26 nonzeros Figure 71b shows thestandard directed graph representation G(A) of this matrix Upon elimina-tion of vertex 1 the three edges (3 2)(6 2) and (7 2) are added and vertex1 is removed to build the elimination digraph G1(A) On the right sideFigure 71c shows the elimination tree T(A) where height h(T(A)) = 5Here the parent vertex of 1 is 3 as GA[1 2] is not strongly connected butGA[1 2 3] is (thanks to the cycle 1 rarr 2 rarr 3 rarr 1) The cross edges ofG(A) with respect to the elimination tree T(A) are represented by dashed

134 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

directed arrows in Figure 71c The initial order is topological but not pos-tordered However 8 1 2 3 5 4 6 7 9 is a postordering of T(A) and doesnot respect the cross edges

72 The critical path length and the eliminationtree height

We summarize rowwise and columnwise LU factorization schemes also knownas ikj and jik variants [70] here to be referred as row-LU and column-LUrespectively Then we show that the critical path lengths for the parallelrow-LU and column-LU factorization schemes are equivalent to the elimina-tion tree height whenever the matrix has a certain form

Let A be an ntimesn irreducible unsymmetric matrix and A = LU whereL and U are unit lower and upper triangular respectively Algorithms 13and 14 display the row-LU and column-LU factorization schemes respec-tively Both schemes update the given matrix A so that the U = [uij ]ilejfactor is formed by the upper triangular (including the diagonal) entries ofthe matrix A at the end The L = [`ij ]igej factor is formed by the multipliers(`ik = aikakk for i gt k) and a unit diagonal We assume a coarse-grainparallelization approach [122 160] In Algorithm 13 task i is defined as thetask of computing rows i of L and U and denoted as Trow(i) Similarlyin Algorithm 14 task j is defined as the task of computing the columns j ofboth L and U factors and denoted as Tcol(j) The dependencies betweenthe tasks can be represented with the directed acyclic graphs of LT and Ufor row-LU and column-LU factorization of A respectively

Algorithm 13 Row-LU(A)

1 for i = 1 to n do2 Trow(i)3 for each k lt i and aik 6= 0 do4 for each j gt k st akj 6= 0 do5 aij larr aij minus akj times (aikakk)

Algorithm 14 Column-LU(A)

1 for j = 1 to n do2 Tcol(j)3 for each k lt j st akj 6= 0 do4 for each i gt k st aik 6= 0 do5 aij larr aij minus aik times (akjakk)

135

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix L + U

8

46

9

1

7

2

5

3

(b) Task dependency graphG(LT)

8

46

9

1

7

2

5

3

(c) Task dependency graphG(U)

Figure 72 The filled matrix L + U (a) task dependency graphs G(LT) forrow-LU (b) and G(U) for column-LU (c)

Figure 72 illustrates the mentioned dependencies Figure 72a showsthe matrix L + U corresponding to the factors of the sample matrix Agiven in Figure 71a In this figure the blue hexagons and green circles arethe fill-ins in L and U respectively Subsequently Figures 72b and 72cshow the task dependency graphs for row-LU and column-LU factorizationsrespectively The blue dashed directed edges in Figure 72b and the greendashed directed edges in Figure 72c correspond to the fill-ins All edgesin Figures 72b and 72c show dependencies For example in the column-LU the second column depends on the first one as the values in the firstcolumn of L are needed while computing the second column This is shownin Figure 72c with a directed edge from the first vertex to the second one

We now consider postorderings of the elimination tree T(A) Let T [k]denote the set of vertices in the subtree of T(A) rooted at k For a postorder-ing the order of siblings is not specified in general sense A postordering iscalled upper bordered block triangular (BBT) if the topological order amongthe sibling vertices is respected that is a vertex k is numbered before asibling vertex ` if there is an edge in G(A) that is directed from T [k] toT [`] Similarly we call a postordering a lower BBT if the sibling verticesare ordered in reverse topological As an example the initial order of ma-trix given in Figure 71 is neither upper nor lower BBT however the order8 6 1 2 3 5 4 7 9 would be an upper BBT (see Figure 71c) The follow-ing theorem shows the relation between a BBT postordering and the taskdependencies in LU factorizations

Theorem 71 ([81]) Assuming the edges are directed from child to parentT(A) is the transitive reduction of G(LT) and G(U) when A is upper andlower BBT ordered respectively

We follow this theorem with a corollary which provides the motivationfor reducing the elimination tree height

Corollary 71 The critical path length of the task dependency graph is

136 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

3

5

7

9

2 3 5 4 7 9

8

6

1

2

6 18

4

(a) The matrix L + U

1 2

3

5 64

7 8

9

(b) The directed graph G(LT)

Figure 73 The filled matrix of matrix A = LU in BBT form (a) and thetask dependency graph G(LT) for row-LU (b)

equal to the elimination tree height h(T(A)) for row-LU factorization whenA is upper BBT ordered and for column-LU factorization when A is lowerBBT ordered

Figure 73 demonstrates the discussed points on a matrix A which is thepermuted A of Figure 71a with the upper BBT postordering 8 6 1 2 3 5 4 7 9Figure 73a and 73b show the filled matrix L + U such that A = LU andthe corresponding task dependency graph G(LT) for row-LU factorizationrespectively As seen in Figure 73b there are only tree (solid) and back(dashed) edges with respect to elimination tree T(A) due to the fact thatT(A) is the transitive reduction of G(LT) (see Theorem 71) In the figurethe blue edges refer to fill-in nonzeros As seen in the figure there is a pathfrom each vertex to the root and a critical path 1 rarr 3 rarr 5 rarr 7 rarr 9 haslength 5 which coincides with the height of T(A)

73 Reducing the elimination tree height

We propose a top-down approach to reorder a given matrix leading to ashort elimination tree The bigger lines of the proposed approach form ageneralization of the state of the art methods used in Cholesky factorization(and are based on nested dissection [94]) We give the algorithms and dis-cussions for the upper BBT form they can be easily adapted for the lowerBBT form The proofs of Proposition 71 and Theorems 72ndash74 are omittedfrom the presentation and can be found in the original paper [136]

137

731 Theoretical basis

Let A be an ntimes n sparse unsymmetric irreducible matrix and T(A) be itselimination tree In this case a BBT decomposition of A can be representedas a permutation of A with a permutation matrix P such that

ABBT =

A11 A12 A1K A1B

A22 A2K A2B

AKK AKB

AB1 AB2 ABK ABB

(71)

where ABBT = PAPT the number of diagonal blocks K gt 1 and Akk isirreducible for each 1 le k le K The border ABlowast is called minimal if there isno permutation matrix Pprime 6= P such that PprimeAPprimeT is a BBT decompositionand AprimeBlowast sub ABlowast That is the border ABlowast is minimal if we cannot removecolumns from ABlowast and the corresponding rows from ABlowast permute themtogether to a diagonal block and still have a BBT form We give thefollowing proposition aimed to be used for Theorem 72

Proposition 71 Let A be in upper BBT form and the border ABlowast beminimal The elimination digraph Gκ(A) is complete where κ =

sumKi=1 |Akk|

The following theorem provides a basis for reducing the elimination treeheight h(T(A))

Theorem 72 Let A be in upper BBT form and the border ABlowast be mini-mal Then

h(T(A)) = |ABlowast|+ max1lekleK

h(T(Akk))

The theorem says that the critical path length in the parallel LU factor-ization methods summarized in Section 72 can be expressed recursively interms of the border and the critical path length of a block with the maxi-mum size This holds for the row-LU scheme when the matrix A is in anupper BBT form and for the column-LU scheme when A is in a lower BBTform Hence having a small border size and diagonal blocks of similar sizeis likely to lead to reduced critical path length

732 Recursive approach Permute

We propose a recursive approach to reorder a given matrix so that the elim-ination tree is reduced Algorithm 15 gives the overview of the solutionframework The main procedure Permute takes an irreducible unsymmet-ric matrix A as its input It calls PermuteBBT to decompose A into aBBT form ABBT Upon obtaining such a decomposition the main pro-cedure calls itself on each diagonal block Akk recursively Whenever the

138 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

matrix becomes sufficiently small (its size gets smaller than τ) the proce-dure PermuteBT is called in order to compute a fine BBT decompositionin which the diagonal blocks are of unit size here to be referred as borderedtriangular (BT) decomposition and the recursion terminates

Algorithm 15 Permute(A)

1 if |A| lt τ then Recursion terminates 2 PermuteBT(A)3 else4 ABBT larr PermuteBBT(A) As given in (71) 5 for k larr 1 to K do6 Permute(Akk) Subblocks of ABBT

733 BBT decomposition PermuteBBT

Permuting A into a BBT form translates into finding a strong (vertex) sep-arator of G(A) where the strong separator itself corresponds to the borderABlowast Regarding to Theorem 72 we are particularly interested in findingminimal borders

Let G = (VE) be a strongly connected directed graph A strong sep-arator (also called directed vertex separator [138]) is defined as a vertex setS sub V such that GminusS is not strongly connected Moreover S is said to beminimal if Gminus Sprime is strongly connected for any proper subset Sprime sub S

The algorithm that we propose to find a strong separator is based on bi-partitioning of directed graphs In case of undirected graphs typically thereare two kinds of graph partitioning methods which differ by their separatorsedge separators and vertex separators An edge (or vertex) separator is a setof edges (or vertices) whose removal leaves the graph disconnected

A BBT decomposition of a matrix may have three or more subblocks (asseen in (71)) However the algorithms we use to find a strong separatorutilize 2-way partitioning which in turn constructs two parts each of whichmay potentially have several components For the sake of simplicity in thepresentation we assume that both parts are strongly connected

We introduce some notation in order to ease the discussion For a pairof vertex subsets (UW ) of directed graph G we write U 7rarrW if there maybe some edges from U to W but there is no edge from W to U BesidesU harr W and U 6harr W are used in a more natural way that is U harr Wimplies that there may be edges in both directions whereas U 6harr W refersto absence of such an edge between U and W

We cast our problem of finding strong separators as follows For a givendirected graph G find a vertex partition Πlowast = V1 V2 S where S refersto a strong separator and V1 V2 refer to vertex parts so that S has small

139

8

46

9

1

7

2

5

3

V2

V1

(a) An edge separator ΠES =V1 V2

7

4

9

2

3

5

V2

6 1

V1

7

4

2

3

5

V2

6 1

V1

9

(b) Covers for E2 7rarr1 (left)and E1 7rarr2 (right)

8

1

3

5

9 8 1 2 3 5

6

7

4

9

7 46

2

(c) Matrix view of ΠES

Figure 74 A sample 2-way partition ΠES = V1 V2 by edge separator (a)bipartite graphs and covers for directed cuts E27rarr1 and E17rarr2 (b) and thematrix view of ΠES

cardinality the part sizes are close to each other and either V1 7rarr V2 orV2 7rarr V1

734 Edge-separator-based method

The first step of the method is to obtain a 2-way partition ΠES = V1 V2 ofG by edge separator We build two strong separators upon ΠES and choosethe one with the smaller cardinality

For an edge separator ΠES = V1 V2 the directed cut Ei 7rarrj is definedas the set of edges directed from Vi to Vj ie Ei 7rarrj = (u v) isin E u isinVi v isin Vj and we say that a vertex set S covers Ei 7rarrj if for any edge(u v) isin Ei 7rarrj either u isin S or v isin S for i 6= j isin 1 2

Initially we have V1 harr V2 Our goal is to find a subset of verticesS sub V so that either V1 7rarr V2 or V2 7rarr V1 where V1 and V2 are the setsof remaining vertices in V1 and V2 after the removal of those vertices in Sthat is V1 = V1 minus S and V2 = V2 minus S The following theorem serves to thepurpose of finding such a subset S

Theorem 73 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and S sube V Now Vi 7rarr Vj if and only if S covers Ej 7rarrifor i 6= j isin 1 2

The problem of finding a strong separator S can be encoded as findinga minimum vertex cover in the bipartite graph whose edge set is populatedonly by the edges of Ej 7rarri Algorithm 16 gives the pseudocode As seenin the algorithm we find two separators S17rarr2 and S2 7rarr1 which cover E27rarr1

and E17rarr2 and result in V1 7rarr V2 and V2 7rarr V1 respectively At the end wetake the strong separator with the smaller cardinality

In this method how the edge separator is found matters The objectivefor ΠES is to minimize the cutsize Ψ(ΠES) which is typically defined over

140 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Algorithm 16 FindStrongVertexSeparatorES(G)

1 ΠES larr GraphBipartiteES(G) ΠES = V1 V2 2 Π17rarr2 larr FindMinCover(GE2 7rarr1)3 Π27rarr1 larr FindMinCover(GE1 7rarr2)4 if |S17rarr2| lt |S27rarr1| then

5 return Π1 7rarr2 = V1 V2 S17rarr2 V1 7rarr V2 6 else

7 return Π2 7rarr1 = V1 V2 S27rarr1 V2 7rarr V1

the cut edges as follows

Ψes(ΠES) = |Ec| = |(u v) isin E π(u) 6= π(v)| (72)

where u isin Vπ(u) and v isin Vπ(v) and Ec refers to set of cut edges We notethat the use of the above metric in Algorithm 16 is a generalization of amethod used for symmetric matrices [161] However Ψes(ΠES) is looselyrelated to the size of the cover found in order to build the strong separatorA more suitable metric would be the number of vertices from which a cutedge emanates or the number of vertices to which a cut edge is directedSuch a cutsize metric can be formalized as follows

Ψcn(ΠES) = |Vc| = |v isin V exist(u v) isin Ec| (73)

The following theorem shows the close relation between Ψcn(ΠES) and thesize of the strong separator S

Theorem 74 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and Πlowast = V1 V2 S be the strong separator built uponΠES Then |S| le Ψcn(ΠES)2

Apart from the theorem we also note that optimizing (73) is likely toyield bipartite graphs which are easier to cover than those resulting fromoptimizing (72) This is because of the fact that all edges emanating from asingle vertex or many different vertices are considered as the same with (72)whereas those emanating from a single vertex are preferred in (73)

We note that the cutsize metric as given in (73) can be modeled usinghypergraphs The hypergraph that models this cutsize metric (called thecolumn-net hypergraph model [39]) has the same vertex set as the directedgraph G = (VE) and a hyperedge for each vertex v isin V that connectsall u isin V such that (u v) isin E Any partition of the vertices of this hyper-graph that minimizes the number of hyperedges in the cut induces a vertexpartition on G with an edge separator that minimizes the cutsize accord-ing to the metric (73) A similar statement holds for another hypergraphconstructed by reversing the edge directions (called the row-net model [39])

141

8

46

9

1

7

2

5

3

S

V2

V1

(a) Strong separator Π2 7rarr1

9

7

1

6

4 9 7 8 1 6

2

3

5

4

3 52

8

(b) ABBT matrix viewof Π2 7rarr1

1

6

2 3

5

9

8

74

(c) Elimination treeT(ABBT)

Figure 75 Π2 7rarr1 the strong separator built upon S2 7rarr1 the cover of E17rarr2

given in 74b (a) the matrix ABBT obtained by Π2 7rarr1 (b) and the eliminationtree T(ABBT)

Figures 74 and 75 give an example for finding a strong separator basedon an edge separator In Figure 74a the red curve implies the edge separa-tor ΠES = V1 V2 such that V1 = 4 6 7 8 9 the green vertices and V2 =1 2 3 5 the blue vertices The cut edges of E27rarr1 = (2 7) (5 4) (5 9)and E17rarr2 = (6 1) (6 3) (7 1) are given in color blue and green respec-tively We see two covers for the bipartite graphs corresponding to each ofthe directed cuts E27rarr1 and E17rarr2 in Figure 74b As seen in Figure 74cthese directed cuts refer to nonzeros of the off-diagonal blocks in matrixview Figure 75a gives the strong separator Π2rarr1 built upon ΠES by thecover S2rarr1 = 1 6 of the bipartite graph corresponding to E1rarr2 which isgiven on the right of 74b Figures 75b and 75c depict the matrix view ofthe strong separator Π2rarr1 and the corresponding elimination tree respec-tively It is worth mentioning that in matrix view V2 proceeds V1 sincewe use Π2rarr1 instead of Π1rarr2 As seen in 75c since the permuted ma-trix (in Figure 75b) has an upper BBT postordering all cross edges of theelimination tree are headed from left to right

735 Vertex-separator-based method

A vertex separator is a strong separator restricted so that V1 6harr V2 Thismethod is based on refining the separator so that either V1 7rarr V2 or V1 7rarr V2The vertex separator can be viewed as a 3-way vertex partition ΠV S =V1 V2 S As in the previous method based on edge separators we buildtwo strong separators upon ΠV S and select the one having the smaller strongseparator

The criterion used to find an initial vertex separator can be formalizedas

Ψvs(ΠV S) = |S| (74)

Algorithm 17 gives the overall process to obtain a strong separator This

142 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

algorithm follows Algorithm 16 closely Here the procedure that refines theinitial vertex separator namely RefineSeparator works as follows Itvisits the separator vertices once and moves the vertex at hand if possibleto V1 or V2 so that Vi 7rarr Vj after the movement where i 7rarr j is specified aseither 1 7rarr 2 or 2 7rarr 1

Algorithm 17 FindStrongVertexSeparatorVS(G)

1 ΠV S larr GraphBipartiteVS(G) ΠV S = V1 V2 S 2 Π1rarr2 larr RefineSeparator(GΠV S 1rarr 2)3 Π2rarr1 larr RefineSeparator(GΠV S 2rarr 1)4 if |S1rarr2| lt |S2rarr1| then

5 return Π1rarr2 = V1 V2 S1rarr2 V1 rarr V2 6 else

7 return Π2rarr1 = V1 V2 S2rarr1 V2 rarr V1

736 BT decomposition PermuteBT

The problem of permuting A into a BT form can be decoded as finding afeedback vertex set which is defined as a subset of vertices whose removalmakes a directed graph acyclic In matrix view a feedback vertex set corre-sponds to border ABlowast when the matrix is permuted into a BBT form (71)where each Aii for i = 1 K is of unit size

The problem of finding a feedback vertex set of size less than a givenvalue is NP-complete [126] but fixed-parameter tractable [44] In literaturethere are greedy algorithms that perform well in practice [152 184] In thiswork we adopt the algorithm proposed by Levy and Low [152]

For example in Figure 75 we observe that the sets 5 and 8 are usedas the feedback-vertex sets for V1 and V2 respectively which is the reasonthose rowscolumns are permuted last in their corresponding subblocks

74 Experiments

We investigate the performance of the proposed heuristic here to be referredas BBT in terms of the elimination tree height and ordering time We havethree variants of BBT (i) BBT-es based on minimizing the edge cut (72) (ii)BBT-cn based on minimizing the hyperedge cut (73) (iii) BBT-vs based onminimizing the vertex separator (74) As a baseline we used MeTiS [130]on the symmetrized matrix A + AT MeTiS uses state of the art methodsto order a symmetric matrix and leads to short elimination trees Its use inthe unsymmetric case on A+AT is a common practice BBT is implementedin C and compiled with mex of MATLAB which uses gcc 445 We usedMeTiS library for graph partitioning according to the cutsize metrics (72)

143

and (74) For the edge cut metric (72) we input MeTiS the graph ofA + AT where the weight of the edge i j is 2 (both aij aji 6= 0) or1 (either aij 6= 0 or aji 6= 0 but not both) For the vertex separatormetric (74) we input the graph of A+AT (no weights) to the correspondingMeTiS procedure We implemented the cutsize metric given in (73) usingPaToH [40] (in the default setting) for hypergraph partitioning using thecolumn-net model (we did not investigate the use of row-net model) Weperformed the experiments on an AMD Opteron Processor 8356 with 32 GBof RAM

In the experiments for each matrix we detected dense rows and columnswhere a rowcolumn is deemed to be dense if it has at least 10

radicn nonzeros

We applied MeTiS and BBT on the submatrix of non-dense rowscolumnsand produced the final ordering by appending the dense rowscolumns afterthe permutations obtained for the submatrices The performance results arecomputed as the median of eleven runs

Recall from Section 732 that whenever the size of the input matrixis smaller than the parameter τ the recursion terminates in BBT variantsand a feedback-vertex-set-based BT decomposition is applied We first in-vestigate the effect of τ in order to suggest a concrete approach For thispurpose we build a dataset of matrices called UFL below The matrices inUFL are from the SuiteSparse Matrix Collection [59] and chosen with thefollowing properties square 5K le n le 100K nnz le 1M real and notof type Graph These criteria enable an automatic selection of matricesfrom different application domains without having to specify the matricesindividually Other criteria could be used ours help to select a large set ofmatrices each of which is not too small and not too large to be handled easilywithin Matlab We exclude matrices recorded as Graph in the SuiteSparsecollection because most of these matrices have nonzeros from a small set ofintegers (for example minus1 1) and are reducible We further detect matriceswith the same nonzero pattern in the UFL dataset and keep only one matrixper unique nonzero pattern We then preprocess the matrices by applyingMC64 [72] in order to permute the columns so that the permuted matrix hasa maximum diagonal product Then we identify the irreducible blocks us-ing Dulmage-Mendelsohn decomposition [77] permute the matrices to blockdiagonal form and delete all entries that lie in the off-diagonal blocks Thistype of preprocessing is common in similar studies [83] and correspondsto preprocessing for numerics and reducibility After this preprocessing wediscard the matrices whose total (remaining) number of nonzeros is less thantwice its size or whose pattern symmetry is greater than 90 where thepattern symmetry score is defined as (nnz(A)minusn)(nnz(A+AT)minusn)minus05We ended up with 99 matrices in the UFL dataset

In Table 71 we summarize the performance of the BBT variants as nor-malized to that of MeTiS on the UFL dataset for the recursion terminationparameter τ isin 3 50 250 The table presents the geometric means of per-

144 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Tree Height (h(A)) Ordering Timeτ -es -cn -vs -es -cn -vs

3 081 069 068 168 353 25850 084 071 072 126 271 191

250 093 082 086 112 212 143

Table 71 Statistical indicators of the performance of BBT variants on theUFL dataset normalized to that of MeTiS with recursion termination param-eter τ isin 3 50 250

0 20 40 60 80Pattern symmetry

03

04

05

06

07

08

09

10

11

12

13

Nor

mal

ized

tree

hei

ght

(a) BBT-vs MeTiS Tree Height (h(A))

0 20 40 60 80Pattern symmetry

05

10

15

20

25

30

Nor

mal

ized

ord

erin

g tim

e

(b) BBT-vs MeTiS Ordering Time

Figure 76 BBT-vs MeTiS performance on the UFL dataset Dashed greenlines mark the geometric mean of the ratios

formance results over all matrices of the dataset As seen in the table forsmaller values of τ all BBT variants achieve shorter elimination trees at acost of increased processing time to order the matrix This is because of thefast but probably not of very high quality local ordering heuristicmdashas dis-cussed before the heuristic here does not directly consider the height of thesubtree corresponding to the submatrix as an optimization goal Other τvalues could also be tried but we find τ = 50 as a suitable choice in generalas it performs close to τ = 3 in quality and close to τ = 250 in efficiency

From Table 71 we observe that BBT-vs method performs in the mid-dle of the other two BBT variants but close to BBT-cn in terms of the treeheight All the three variants improve the height of the tree with respectto MeTiS Specifically at τ = 50 BBT-es BBT-cn and BBT-vs obtain im-provements of 16 29 and 28 respectively The graph based methodsBBT-es and BBT-vs run slower than MeTiS as they contain additional over-heads of obtaining the covers and refining the separators of A + AT forA For example at τ = 50 the overheads result in 26 and 171 longerordering time for BBT-es and BBT-vs respectively As expected the hy-pergraph based method BBT-cn is much slower than others (for example112 longer ordering time with respect to MeTiS with τ = 50) Since we

145

identify BBT-vs as fast and of high quality (it is close to the best BBT variantin terms of the tree height and the fastest one in terms of the run time)we display its individual performance results in detail with τ = 50 in Fig-ure 76 In Figures 76a and 76b each individual point represents the ratioBBT-vs MeTiS in terms of tree height and ordering time respectivelyAs seen in Figure 76a BBT-vs reduces the elimination tree height on 85out of 99 matrices in the UFL dataset and achieves a reduction of 28 interms of the geometric mean with respect to MeTiS (the green dashed linein the figure represents the geometric mean) We observe that the patternsymmetry and the reduction in the elimination tree height highly correlateafter a certain pattern symmetry score More precisely the correlation co-efficient between the pattern symmetry and the normalized tree height is06187 for those matrices with pattern symmetry greater than 20 Thisis of course in agreement with the fact that LU factorization methods ex-ploiting the unsymmetric pattern have much to gain on matrices with lowerpattern symmetry score and less to gain otherwise than the alternatives Onthe other hand as seen in Figure 76b BBT-vs performs consistently slowerthan MeTiS with an average of 91 slow down

The fill-in generally increases [83] when a matrix is reordered into a BBTform with respect to the original ordering using the elimination tree Wenoticed that BBT-vs almost always increases the number of nonzeros in L(by 157 with respect to MeTiS in a set of matrices) However the numberof nonzeros in U almost always decreases (by 15 with respect to MeTiS

in the same dataset) This observation may be exploited during Gaussianelimination where L is not stored

75 Summary further notes and references

We investigated the elimination tree model for unsymmetric matrices as ameans of capturing dependencies among the tasks in row-LU and column-LU factorization schemes Specifically we focused on permuting a givenunsymmetric matrix to obtain an elimination tree with reduced height inorder to expose higher degree of parallelism by minimizing the critical pathlength in parallel LU factorization schemes Based on the theoretical find-ings we proposed a heuristic which orders a given matrix to a borderedblock diagonal form with a small border size and blocks of similar sizes andthen locally orders each block We presented three variants of the proposedheuristic These three variants achieved on average 24 31 and 35improvements with respect to a common method of using MeTiS (a stateof the art tool used for the Cholesky factorization) on a small set of matri-ces On a larger set of matrices the performance improvements were 1628 and 29 The best performing variant is about 271 times slower thanMeTiS on the larger set while others are slower by factors of 126 and 191

146 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Although numerical methods implementing the LU factorization withthe help of the standard elimination tree are well developed those imple-menting the LU factorization using the elimination discussed in this sectionare lacking A potential implementation will have to handle many schedul-ing issues and therefore the use of a runtime system seems adequate Onthe combinatorial side we believe that there is a need for a direct methodto find strong separators which would give better performance in terms ofthe quality and the run time

Chapter 8

Sparse tensor ordering

In this chapter we investigate the tensor ordering problem described inSection 15 The formal problem definition is repeated below for convenience

Tensor ordering Given a sparse tensor X find an ordering of theindices in each dimension so that the nonzeros of X have indices closeto each other

We propose two ordering algorithms for this problem The first proposedheuristic BFS-MCS is a breadth first search (BFS)-like approach based onthe maximum cardinality search family and works on the hypergraph repre-sentation of the given tensor The second proposed heuristic Lexi-Orderis an extension of doubly lexical ordering of matrices to tensors We showthe effects of these schemes in the MTTKRP performance when the ten-sors are stored in three existing sparse tensor formats coordinate (COO)compressed sparse fiber (CSF) and hierarchical coordinate (HiCOO) Theassociated paper [156] has further contributions with respect to paralleliza-tion and includes results on parallel implementations of MTTKRP with thethree formats as well as runs with CandecompParafac decomposition al-gorithms

81 Three sparse tensor storage formats

We consider the three state-of-the-art tensor formats (COO CSF HiCOO)which are all for general unstructured sparse tensors

Coordinate format (COO)

The coordinate (COO) format is the simplest yet arguably most popularformat by far It stores each nonzero value along with all of its positionindices shown in Figure 81a i j k are indices (inds) of the nonzeros storedin the val array

147

148 CHAPTER 8 SPARSE TENSOR ORDERING

(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(a) COO(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(b) CSF(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(c) HiCOO

Figure 81 Three common formats for sparse tensors (example is originallyfrom [155])

Compressed sparse fibers (CSF)

Compressed Sparse Fiber (CSF) is a hierarchical fiber-centric format thateffectively generalizes the CSR matrix format to tensors An example of itsrepresentation appears in Figure 81b Conceptually CSF organizes nonzeroindices into a tree Each level corresponds to a tensor mode and eachnonzero is a path from the root to a leaf The indices of CSF are stored incinds with the pointers stored in cptrs to indicate the locations of nonzerosat the next level Since cptrs needs to show the range up to nnz we use βint

or βlong accordingly for differently sized-tensors

Hierarchical coordinate format (HiCOO)

Hierarchical Coordinate (HiCOO) [155] format derives from COO formatbut improves upon it by compressing the indices in units of sparse tensorblocks HiCOO stores a sparse tensor in a sparse-blocked pattern with apre-specified block size B meaning in Btimesmiddot middot middottimesB blocks It represents everyblock by compactly storing its nonzero triples using fewer bits Figure 81cshows the example tensor given 2 times 2 times 2 blocks (B = 2) For a third-order tensor bi bj bk are block indices indexing tensor blocks ei ej ek areelement indices indexing nonzeros within a tensor block A bptr array storesthe pointers of each blockrsquos beginning locations and val saves all the nonzerovalues The block ratio αb of a HiCOO representation of tensor is defined asthe number of blocks divided by the number of nonzeros in the tensor Theaverage slice size per tensor block cb of a HiCOO representation of tensor isdefined as the average slice size per tensor block These two key parametersdescribe the effectiveness of storing a tensor in the HiCOO format

Comparing the three tensor formats HiCOO and COO treat every modeequally and do not assume any mode order these preserve the mode-generic

149

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

Figure 82 A sparse tensor and the associated hypergraph

orientation [155] CSF has a strong mode-specificity since the tree structureimplies a mode ordering for enumeration of nonzeros Besides comparing toCOO format HiCOO and CSF save storage space and memory footprintswhereas achieve higher performance generally

82 Problem definition and solutions

Our objective is to improve the performance of Mttkrp and therefore CPDalgorithms based on the three tensor storage formats described above Wewill achieve this objective by ordering (or relabeling) the indices in one ormore modes of the input tensor so as to improve the data locality in thetensor and the factor matrices of an Mttkrp operation Take for exampletwo nonzeros (i2 j2 k1) and (i2 j2 k2) of a third-order tensor in Figure 82Relabeling k1 and k2 to other two indices close to each other will potentiallyimprove cache hits for both tensor and their corresponding rows of factormatrices in the Mttkrp operation Besides it will also influence the tensorstorage for some formats (Analysis will be illustrated in Section 823)

We propose two ordering heuristics The aim of the heuristics is toarrange the nonzeros close to each other in all modes If we were to lookat matrices this would correspond to reordering the rows and columns sothat all nonzeros are clustered around the diagonal This way nonzeros in arow or column would be close to each other and any blocking (by imposingfixed sized blocks) would have nonzero blocks only around the diagonalThe proposed heuristics are based on these observations and try to obtainsimilar behavior for tensors The output of a reordering algorithm is thepermutations for all modes being used to relabel tensor indices of them

821 BFS-MCS

BFS-MCS is a breadth first search (BFS)-like heuristic approach basedon the maximum cardinality search family [207] We first construct the d-partite d-uniform hypergraph for a sparse tensor as described in Section 15Recall that in this hypergraph the vertices are tensor indices in all modesand hyperedges represent its nonzero entries Figure 82 shows the hyper-

150 CHAPTER 8 SPARSE TENSOR ORDERING

graph associated with a sparse tensor Here the vertices are blank circlesand hyperedges are represented by grouping vertices

For an Nth-order sparse tensor we need to find the permutations forN modes We determine a permutation for a given mode n (permn) of atensor X isin RI1timesmiddotmiddotmiddottimesIN as follows Suppose some of the indices of mode n arealready ordered Then BFS-MCS picks the next to-be-ordered index asthe one with the strongest connection to the currently ordered index set Inthe case of ties it selects the index with the smallest number of nonzeros inthe corresponding sub-tensor Intuitively a stronger connection representsmore common indices in the modes other than n among an unordered vertexand the already ordered ones This means more data from factor matricescan be reused in Mttkrp if found in cache

Algorithm 18 BFS-MCS ordering based on maximum cardinal-ity search for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN hypergraphG = (VE) mode n

Result Permutation permn

1 wp(i)larr 0 for i = 1 In2 ws(i)larr minus|X ( i )| number of nonzeros with i in mode n 3 Build max-heap Hn for mode-n indices with wp and ws as the primary

and secondary keys respectively4 mV (i)larr 0 for i = 1 In5 for in = 1 In do

6 v(0)n larr GetHeapMax(Hn)

7 permn(v(0)n )larr in

8 mV (v(0)n )larr 1

9 for en isin hyperedges of v(0)n do

10 for v isin en v is not in mode n and mV (v) = 0 do11 mV (v)larr 112 for e isin hyperedges of v and e 6= en do13 vn larr vertex in mode n of e14 if inHeap(vn) then15 wp(vn)larr wp(vn) + 116 heapUpdateKey(Hnwp(vn))

17 return permn

The process is implemented for all mode-n indices by maintaining amax-heap with two keys The primary key of an index is the number ofconnections to the currently ordered indices The secondary key is thenumber of nonzeros in the corresponding (N minus 1)th-order sub-tensor egX ( in ) where the smaller secondary key values signify higherpriority The secondary keys are static Algorithm 18 gives the pseudocodeof BFS-MCS The heap Hn is initially constructed (Line 3) according to the

151

secondary keys of mode-n indices For each mode-n index (eg in) of thismax-heap BFS-MCS traverses all connected tensor indices in the modesexcept n calculates the number of connections of mode-n indices and thenupdates the heap using the primary and secondary keys Let mV denote asize-|V | array that tracks whether a vertex has been visited They are ini-tialized to zeros (unvisited) and changed to 1s once after being visited We

record a mode-n index v(0)n obtained from the max-heap Hn to the permu-

tation array permn The algorithm visits all its hyperedges (ie all nonzeroswith in Line 9) and their connected and unvisited vertices from the other(N minus 1) modes except n (Line 10) For these vertices it again visits theirhyperedges (e in Line 12) and then checks if the connected vertices in moden (vn) are in the heap (Line 14) If so the primary key (connectivity) of vnis increased by 1 and the max-heap (Hn) is updated To summarize this

algorithm locates the sub-tensors of the neighbor vertices of v(0)n to increase

the primary key values of the occurred mode-n indices in HnThe BFS-MCS heuristic has low time complexity We explain it us-

ing tensor terms The innermost heap update operation (Line 15) costsO(log(In)) Each sub-tensor is only visited once with the help of the markerarray mV For all the sub-tensors in one tensor mode the heap updateis only performed for nnz nonzeros (hyperedges) Overall the time com-plexity of BFS-MCS is O(N nnz log(In)) for an Nth-order tensor with nnznonzeros when computing permn for mode n

BFS-MCS does not exactly catch the memory access pattern of an Mt-tkrp It treats the contribution from the indices of all modes except nequally to the connectivity thus the connectivity might not match the ac-tual data reuse Also it uses a greedy strategy to determine the next-levelvertices which could miss the optimal global orderings as is common togreedy heuristics for hard problems

822 Lexi-Order

A lexical ordering of an integer vector is the standard dictionary orderingof its elements defined as follows Given two equal-length vectors x andy we say x le y iff either (i) all elements are the same ie x = y or(ii) there exists an index j such that x(j) lt y(j) and x(i) = y(i) for all0 le i lt j Consider two 0 1 vectors x = (1 0 1 0) and y = (1 1 0 0) wehave x le y because x(1) lt y(1) and x(i) = y(i) for all 0 le i lt 1

Lubiw [167] associates a vector v of size I middot J with an I times J matrix Awhere the matrix indices (i j) are first sorted with i + j and then j Adoubly lexical ordering of A is an ordering of its rows and columns whichmakes v lexically the largest Every real-valued matrix has a doubly lexicalordering and such an ordering is not unique [167] Doubly lexical orderinghas a number of applications including the efficient recognition of totallybalanced subtree and plaid matrices and (strongly) chordal graphs To the

152 CHAPTER 8 SPARSE TENSOR ORDERING

(1 1 0 0)

(1 0 1 0)

0 1 0 10 1 0 00 0 1 00 0 0 1

1 1 0 01 0 0 00 1 0 00 0 1 0

original ordered(a) Vector comparison (b) Matrix

gt

Figure 83 A doubly lexical ordering of a 0 1 matrix

best of our knowledge the merits of this ordering for high performance havenot been investigated for sparse matrices

We propose an iterative algorithm called matLexiOrder for doubly lex-ical ordering of matrices where in an iteration either rows or columns aresorted lexically Lubiw [167 Claim 22] shows that exchanging two rowswhich are lexically out of order improves the lexical ordering of the matrixBy appealing to this result we see that the matLexiOrder algorithm findsa doubly lexical ordering in a finite number of iterations This also fits wellwith another characterization of the doubly lexical ordering [167 182] amatrix is in doubly lexical ordering if both the row and column vectors arein non-increasing lexical order A row vector is read from left to right and acolumn vector is read from top to bottom Figure 83 shows a doubly lexicalordering of a sample matrix As seen in the figure the row vectors are innon-increasing lexical order as are the column vectors

The known doubly lexical ordering algorithms for matrices by Lubiw [167]and Paige and Tarjan [182] are ldquodirectrdquo ordering methods with a run timeof O(nnz log(I + J) + J) and O(nnz +I + J) space for an ItimesJ matrix withnnz nonzeros We find the time complexity of these algorithms to be toohigh for our purpose Furthermore the data structures are too complex toallow an efficient generalized implementation for tensors Since we do notaim to obtain an exact doubly lexical ordering (a close-by ordering will likelysuffice to improve the Mttkrp performance) our approach will be fasterwhile being simpler to implement

We first describe matLexiOrder algorithm for a sparse matrix Thisaims to show the efficiency of our iterative approach compared to othersA partition refinement technique is used to order the rows and columnsalternatively Given an ordering of the rows the columns can be sortedlexically This is achieved by an order preserving variant of the partitionrefinement method [182] which is called orderlyRefine We briefly explainthe partition refinement technique for ordering the columns of a matrixGiven an I times J matrix A all columns are initially put into a single partThen Arsquos nonzeros are visited row by row At a row i each column partC is split into two parts C1 = C cap a(i ) and C2 = C a(i ) and thesetwo parts replace C in the order C1 C2 (empty sets are discarded) Note

153

that this algorithm keeps the parts in a particular order which generatesan ordering of the columns orderlyRefine is used to refine all parts thathave at least one nonzero in row i in O(|a(i )|) time where |a(i )| is thenumber of nonzeros of row i Overall matLexiOrder costs a linear totaltime of O(nnz +I + J) (for rows and columns ordering) per iteration andO(J) space We also observe that only a small number of iterations will beenough (will be shown in Section 835 for tensors) yielding a more storage-efficient algorithm compared to the prior doubly lexical ordering methods[167 182] matLexiOrder in particular the use of orderlyRefine routinea few times is sufficient for our needs

Algorithm 19 Lexi-Order for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN mode nResult Permutation permn

Sort all nonzeros along with all but mode n1 quickSort(X coordCmp) Matricize X to X(n)

2 r larr compose (inds ([minusn]) 1)3 for m = 1 nnz do4 clarr inds(nm) Column index of X(n)

5 if coordCmp(X mmminus 1) = 1 then6 r larr compose (inds ([minusn])m) Row index of X(n)

7 X(n)(r c)larr val(m)

Apply partition refinement to the columns of X(n)

8 permn larr orderlyRefine (X(n))

9 return permn

Comparison function for two indices of X10 Function coordCmp(X m1m2)11 for nprime = 1 N do12 if nprime le n then13 return minus1

14 if m1(nprime) gt m2(nprime) then15 return 1

16 return 0

To order tensors we propose the Lexi-Order function as an exten-sion of matLexiOrder The basic idea of Lexi-Order is to determine thepermutation of each tensor mode independently while considering the or-der in other modes fixed Lexi-Order sets the indices of the mode tobe ordered as the columns of a matrix the other indices as the rows andsorts the columns as described for matrices (with the order preserving par-tition refinement method) The precise algorithm appears in Algorithm 19which we also illustrate in Figure 84 when applied to mode 1 Given amode n Lexi-Order first builds a matricized tensor in Compressed Sparse

154 CHAPTER 8 SPARSE TENSOR ORDERING

i j k val0 0 0 13 0 0 73 1 0 80 1 1 20 1 2 31 2 2 41 3 22 3 2

56

orderlyReneQuick

sort Matricize(1 2 3 0)

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

1 0 0 10 0 0 11 0 0 01 0 0 00 1 0 00 1 1 0

i(jk)(00)(10)(11)(12)(22)(32)

Zero-one RepresentationSorted COO

0 33001

i(jk)(00)(10)(11)(12)(22)(32) 1 2

JKI CSR Matrix

Non-zeroDistribution

permi

Figure 84 The steps of Lexi-Order illustrated for mode 1

Row (CSR) sparse matrix format by a call to quickSort with the compar-ison function coordCmp and then by partitioning the nonzeros into the rowsegments (Lines 3ndash7) This comparison function coordCmp does a lexicalcomparison of all-but-mode-n indices which enables efficient matricizationIn other words sorting of the tensor X is the same as building the matricizedtensor X(n) by rows in the fixed lexical ordering where mode n is the col-umn dimension and the remaining modes constitute the rows In Figure 84the sorting step orders the COO entries by (j k) tuples which then serveas the row indices of the matricized CSR representation of X(1) Once thematrix is built we construct 0 1 row vectors in Figure 84 to illustrate itsnonzero distribution which could seamlessly call orderlyRefine functionApart from the quickSort the other parts of Lexi-Order are of lineartime complexity (linear in terms of tensor storage) We use OpenMP Tasksto parallelize quickSort to accelerate Lexi-Order

Like BFS-MCS approach Lexi-Order also finds the permutations forN modes of an Nth-order sparse tensor Figure 85(a) illustrates the effectof Lexi-Order on an example 4times 4times 3 sparse tensor The original tensoris converted to a HiCOO representation with block size B = 2 consistingof 5 blocks with maximum 2 nonzeros per block After reordering withLexi-Order the new HiCOO has 3 nonzero blocks with up to 4 nonzerosper block Thus the blocks are denser which should exhibit better localitybehavior However this reordering scheme is heuristic For example con-sider Figure 81 we draw another HiCOO representation after reordering inFigure 85(b) Applying Lexi-Order would yield a reordered HiCOO rep-resentation with 4 blocks which is the same as the input ordering althoughthe maximum number of nonzeros per block would increase to 4 For thistensor Lexi-Order may not show a big advantage

823 Analysis

We take the Mttkrp operation to analyze reordering behavior for COOCSF and HiCOO formats Recall that for a third-order sparse tensor X Mttkrp multiples each nonzero entry xijk with the R-vector formed by theentry-wise product of the jth row of B and kth row of C when computing

155

Original HiCOO

ek val0 0 0 10 1 1 20 1 0 31 0 0 41 1 0 51 0 0 71 1 00 1 0

86

0 0 0

0 0 1

1 0 0

1 1 1

0

2

5

7

0 1 13

ei ejbi bj bkbptrB1

B2B3

B4

B5

Reordered COO

i j k val1 0 0 11 1 2 21 1 1 32 3 1 42 2 1 53 2 1 60 0 00 1 0

78

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

Reordered HiCOO

ek val0 0 0 70 1 0 81 0 0 11 1 1 31 1 0 20 0 1 50 1 11 0 1

46

0 0 0

0 0 11 1 0

0

45

ei ejbi bj bkbptrB1

B2B3

(1 2 3 0)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

Original HiCOO

Reordered COO

i j k val1 0 0 11 1 0 20 0 0 30 0 1 43 1 0 53 3 1 62 0 22 2 1

78

Original COO

Reordered HiCOO

ek val0 0 0 30 0 1 41 0 0 11 1 0 21 1 0 50 0 0 70 0 11 1 1

86

0 0 0

1 0 0

1 1 01 0 1

0

45

ei ejbi bj bkbptrB1

B2B3B4

(1 0 3 2)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

i j k val0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

(a) A good example (b) A fair example

Figure 85 Comparison of HiCOO representations before and after Lexi-Order

MA The arithmetic intensity of Mttkrp algorithms on these formats isapproximately 14 [155] Thus Mttkrp can be considered memory-boundfor most computer architectures especially CPU platforms

In HiCOO smaller block ratios (αb) and larger average slice sizes pertensor block (cb) are favorable for good Mttkrp performance Our re-ordering algorithms tend to closely pack nonzeros together and get denserblocks (larger cb) which potentially generates less tensor blocks (smallerαb) thus HiCOO-Mttkrp has less memory access more cache hits andbetter performance Moreover smaller αb can also reduce tensor memoryrequirement as a side benefit [155] That is for HiCOO format a reorderingscheme should increase the block density reduce the number of blocks andincrease cache localitymdashthree related performance metrics Reordering ismore beneficial for HiCOO format than COO and CSF formats

COO stores all nonzero indices so relabeling does not make a differencein its storage The same is also true for CSF relabeling does not changethe CSFrsquos tree structure while the order of nodes may change For anMttkrp with tensors stored in these two formats the performance gainfrom reordering is only from the improved data locality Though it is hardto do theoretical analysis for them the potential better data locality couldalso brings performance advantages but could be less than HiCOOrsquos

156 CHAPTER 8 SPARSE TENSOR ORDERING

Tensors Order Dimensions Nnzs Density

vast 3 165K times 11K times 2 26M 69times 10minus3

nell2 3 12K times 9K times 29K 77M 24times 10minus5

choa 3 712K times 10K times 767 27M 50times 10minus6

darpa 3 22K times 22K times 24M 28M 24times 10minus9

fb-m 3 23M times 23M times 166 100M 11times 10minus9

fb-s 3 39M times 39M times 532 140M 17times 10minus10

flickr 3 320K times 28M times 2M 113M 78times 10minus12

deli 3 533K times 17M times 3M 140M 61times 10minus12

nell1 3 29M times 21M times 25M 144M 91times 10minus13

crime 4 6K times 24times 77times 32 5M 15times 10minus2

uber 4 183times 24times 1140times 1717 3M 39times 10minus4

nips 4 2K times 3K times 14K times 17 3M 18times 10minus6

enron 4 6K times 6K times 244K times 1K 54M 55times 10minus9

flickr4d 4 320K times 28M times 2M times 731 113M 11times 10minus14

deli4d 4 533K times 17M times 3M times 1K 140M 43times 10minus15

Table 81 Description of sparse tensors

83 Experiments

Platform We perform experiments on a Linux-based Intel Xeon E5-2698v3 multicore server platform with 32 physical cores distributed on two sock-ets each with 23 GHz frequency We only present results for sequentialruns (see the paper [156] for parallel runs) The processor microarchitectureis Haswell having 32 KiB L1 data cache and 128 GiB memory The codeartifact was written in the C language and was compiled using icc 1801

Dataset We use the sparse tensors derived from real-world applicationsthat appear in Table 81 ordered by decreasing nonzero density separatelyfor third- and fourth-order tensors Apart from choa [188] all tensors areavailable publicly FROSTT [203] HaTen2 [121]

Configurations We report the results under the best configurations for thefollowing parameters for the highest Mttkrp performance with the threeformats (i) the number of reordering iterations and the block size B forHiCOO format (ii) tiling or not for CSF For HiCOO B = 128 achievesthe best results in most cases and we use five reordering iterations whichwill be analyzed in Section 835 All experiments use approximate rank ofR = 16 We use the total execution time of Mttkrps in all modes for everytensor to calculate the speedup which is the ratio of the total Mttkrp timeon a randomly reordered tensor over that using a specific reordering schemeAll the times are averaged over five runs

831 COO-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential COO-Mttkrp from ParTI [154] in Figure 86 This COO-Mttkrp is imple-

157

Spee

dup

0

1

2

3

4

5Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 86 Reordered COO-Mttkrp speedup over a random reorderingimplementation

mented in C using the same algorithm with Tensor Toolbox [18] Forany reordering approach after doing a BFS-MCS Lexi-Order or randomreordering on the input tensor we still sort the tensor in the mode orderof 1 middot middot middot N Observe that Lexi-Order improves sequential COO-Mttkrp performance by 100ndash429times (179times on average) while BFS-MCSgets 095ndash127times (110times on average) Note that Lexi-Order improves theperformance of COO-Mttkrp for all tensors We conclude that this order-ing is always helpful for COO-Mttkrp while the improvements being lessthan what we saw for HiCOO-Mttkrp

832 CSF-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential CSF-Mttkrp from Splatt v111 [205] in Figure 87 CSF-Mttkrp is set to useall CSF representations (ALLMODE) for Mttkrps in all modes and with tilingoption on Lexi-Order improves sequential CSF-Mttkrp performanceby 065ndash233times (150times on average) BFS-MCS improves sequential CSF-Mttkrp performance by 100ndash186times (122times on average) Both orderingapproaches improves the performance of CSF-Mttkrp on average WhileBFS-MCS is always helpful in the sequential case Lexi-Order is nothelpful on only one tensor crime The improvements achieved by the tworeordering approaches for CSF are less than those for HiCOO and COOformats We conclude that both reordering methods are helpful for CSF-Mttkrp but to a lesser extent than for HiCOO and COO based Mttkrp

833 HiCOO-Mttkrp with reordering

Figure 88 shows the speedup of the proposed reordering methods on se-quential HiCOO-Mttkrp Lexi-Order reordering obtains 099ndash414timesspeedup (212times on average) while BFS-MCS reordering gets 099ndash188timesspeedup (134times on average) Lexi-Order and BFS-MCS do not behaveas well on fourth-order tensors as on third-order tensors Tensor flickr4d is

158 CHAPTER 8 SPARSE TENSOR ORDERING

Spee

dup

00

05

10

15

20

25 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 87 Reordered CSF-Mttkrp speedup over a random reorderingimplementation

Spee

dup

0

1

2

3

4

5 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 88 Reordered HiCOO-Mttkrp speedup over a random orderingimplementation

constructed from the same data with flickr with an extra short mode (referto Table 81) Lexi-Order obtains 414times speedup on flickr while only 302timesspeedup on flickr4d The same phenomenon is also observed on tensors deli

and deli4d This phenomenon indicates that it is harder to get good datalocality on higher-order tensors which will be justified in Table 82

We investigate the two critical parameters of HiCOO [155] the blockratio (αb) and the average slice size per tensor block (cb) Smaller αb andlarger cb are favorable for good HiCOO-Mttkrp performance Table 82lists the parameter values for all tensors before and after Lexi-Order theHiCOO-Mttkrp speedup (as shown in Figure 88) and the storage ratioof HiCOO over random ordering Generally when both parameters αb andcb are largely improved we see a good speedup and storage ratio usingLexi-Order As seen for the same data in different orders eg flickr4d

and flickr the αb and cb values after Lexi-Order are better in the smallerorder version This observation supports the intuition that getting gooddata locality is harder for higher-order tensors

834 Reordering methods comparison

As seen above Lexi-Order improves performance more than BFS-MCSin most cases Compared to the reordering method used in Splatt [206]

159

TensorsRandom reordering Lexi-Order Speedup Storageαb cb αb cb seq omp ratio

vast 0004 1758 0004 1562 101 103 0999nell2 0020 0314 0008 0074 104 104 0966choa 0089 0057 0016 0056 116 107 0833

darpa 0796 0009 0018 0113 374 150 0322fb-m 0985 0008 0086 0021 384 121 0335fb-s 0982 0008 0099 0020 390 123 0336

flickr 0999 0008 0097 0025 414 366 0277deli 0988 0008 0501 0010 224 083 0634

nell1 0998 0008 0744 0009 170 070 0812

crime 0001 37702 0001 8978 099 142 1000uber 0041 0469 0011 0270 100 078 0838nips 0016 0434 0004 0435 103 134 0921

enron 0290 0017 0045 0030 125 136 0573flickr4d 0999 0008 0148 0020 302 1181 0214

deli4d 0998 0008 0596 0010 176 126 0697

Table 82 HiCOO parameters before and after Lexi-Order reordering

by setting ALLMODE (identical to the work [206]) to CSF-Mttkrp BFS-MCSgets 104 164 and 161times speedups on tensors nell2 nell1 and deli respectivelyand Lexi-Order obtains 104 170 and 224times speedups By contrast thespeedups using graph partitioning [206] on these three tensors are 106 111and 119times and 106 112 and 124times by using hypergraph partitioning [206]respectively Our BFS-MCS and Lexi-Order schemes both outperformgraph and hypergraph partitionings [206]

The available methods in the state-of-the-art are based on graph andhypergraph partitioning Partitioning is a successful approach when thenumber of partitions is known while the number of blocks in HiCOO is notknown ahead of time In our case the partitions should also be ordered forbetter cache reuse Additionally partitioners are less effective for tensorsthan usual as some dimensions could be very short (creating very highdegree vertices) That is why the proposed ordering based methods deliverbetter results

835 Effect of the number of iterations in Lexi-Order

Since Lexi-Order is iterative we evaluate the effect of the number ofiterations on HiCOO-Mttkrp performance The results appear in Fig-ure 89(a) which is normalized to the run time of 10 iterations Mttkrpon most tensors does not vary a lot by setting different number of iterationsexcept vast nell2 uber and nips We use 5 iterations to get good Mttkrpperformance similar to that of 10 iterations with about half of the overhead(shown in Figure 89(b)) But 3 or fewer iterations will get an acceptableperformance when users care much about the pre-processing time

160 CHAPTER 8 SPARSE TENSOR ORDERING

Nor

mal

ized

Tim

e

00

05

10

15

20 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(a) Sequential HiCOO-Mttkrp behavior

Nor

mal

ized

Tim

e

00

02

04

06

08

10 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(b) The reordering overhead

Figure 89 The performance and overhead of different numbers of iterations

84 Summary further notes and references

Plenty recent research studied the optimization of tensor algorithms [20 2846 159 168] Our work sets itself apart by focusing on reordering to getbetter nonzero structure Various reordering methods have been consideredfor matrix operations [7 173 190 215]

The proposed two heuristics aim to arrange the nonzeros of a given ten-sor close to each other in all modes As said before this would correspondto reordering the rows and columns so that all nonzeros are clustered aroundthe diagonal in the matrix case Heuristics based on partitioning and or-dering have been used the matrix case for arranging most nonzeros aroundthe diagonal or diagonal blocks An obvious alternative in the matrix caseis the BFS-based reverse Cuthill-McKee (RCM) heuristic Given a patternsymmetric matrix A RCM finds a permutation (ordering) such that thesymmetrically permuted matrix PAPT has nonzeros clustered around thediagonal More precisely RCM tries to reduce the bandwidth which is de-fined as maxj minus i aij 6= 0 and j gt i The RCM heuristic can be appliedto pattern-wise unsymmetric even rectangular matrices as well A commonapproach is to create a (2times 2)-block matrix for a given mtimes n matrix A as

B =

(Im AAT In

)

where Im and In are the identity matrices if sizemtimesm and ntimesn respectivelySince B is now pattern-symmetric RCM can be run on it to obtain an(m+n)times(m+n) permutation matrix P The row and column permutations

161

min max geomean

BFS-MCS 084 5356 290Lexi-Order 015 433 099

Table 83 Statistical indicators of the number of nonempty blocks obtainedby BFS-MCS and Lexi-Order with respect to that of RCM on 774 ma-trices from the UFL collection

for A can then be obtained from P by respecting the ordering of the first mrows and the last n rows of B This approach can be extended to tensorsWe discuss the third order case for simplicity Let X be an ItimesJtimesK tensorWe create a matrix B in such a way that the first I rowscolumns correspondto the first-mode indices the following J rowscolumns correspond to thesecond-mode indices and the last K rowscolumns correspond to the third-mode indices Then for each nonzero xijk of X we set brc = 1 whenever rand c correspond to any two indices from the set i j k This way B is asymmetric matrix with ones on the main diagonal Observe that when X istwo dimensional (that is a matrix) we obtain the matrix B above We nowcompare the proposed reordering heuristics with RCM

We first compare the three methods RCM BFS-MCS and Lexi-Orderon matrices available at the UFL collection [59] We have tested these threeheuristics on all pattern-wise unsymmetric matrices having more than 1000and less than 5000000 rows at least three nonzeros per row and columnless than 10000000 nonzeros and less than 100 middot

radicmtimes n nonzeros for a

matrix with m rows and n columns At the time of experimentation therewere 774 matrices satisfying these properties We ordered these matriceswith RCM BFS-MCS and Lexi-Order (5 iterations) and counted thenumber of nonempty blocks with a block size of 128 (in each dimension) Wethen normalized the results of BFS-MCS and Lexi-Order with respect tothat of RCM We give the statistical indicators of these normalized resultsin Table 83 As seen in the table RCM is better than BFS-MCS andLexi-Order performs as good as RCM in reducing the number of blocks

We next compare the three methods on 11 sparse tensors from theFROSTT [203] and Haten2 repositories [121] again with a block size of128 in each dimension Table 84 shows the number of nonempty blocksobtained with these methods In this table the number of blocks obtainedby RCM is given in absolute terms and those obtained by BFS-MCS andLexi-Order are given with respect to those of RCM As seen in the tableRCM is never the best and Lexi-Order is better than BFS-MCS for nineout of eleven cases Overall Lexi-Order and BFS-MCS obtain resultswhose geometric means are 048 and 089 respectively of that of RCMmdashmarking both as more effective than RCM This is interesting as we haveseen that RCM and Lexi-Order have (nearly) the same performance inthe matrix case

162 CHAPTER 8 SPARSE TENSOR ORDERING

RCM BFS-MCS Lexi-Order

lbnl-network 67208 111 068nips 25992 156 045vast 113757 100 095nell-2 565615 092 105flickr 28648054 041 038flickr4d 32913028 061 051deli 70691535 097 091deli4d 94608531 081 088darpa 2753737 165 019fb-m 53184625 066 016fb-s 71308285 080 019

geo mean 089 048

Table 84 Number of nonempty blocks obtained by RCM BFS-MCS andLexi-Order on sparse tensors The results of BFS-MCS and Lexi-Orderare normalized with respect to those of RCM For each tensor the best resultis displayed with bold

random Lexi-Order speedup

vast 202 160 126nell-2 504 433 116choa 178 143 125darpa 318 148 215fb-m 1716 695 247fb-s 2503 973 257flickr 1029 582 177deli 1405 1080 130nell1 1845 1486 124

Table 85 The run time of TTM (for all modes) using a random ordering andLexi-Order and the speedup with Lexi-Order for sequential execution

163

The proposed ordering methods are applicable to some other sparse ten-sor operations without any change Among those operations tensor-matrixmultiplication (TTM) used in computing the Tucker decomposition withhigher-order orthogonal iterations [61] or tensor-vector multiplications usedin the higher order power method [60] are well known The main character-istic of these operations is that the memory access of the algorithm dependson the locality of tensor indices for each dimension If not all dimensionshave the same characteristics potentially a variant of the proposed methodsin which only certain dimensions are ordered could be used We showcasesome results for the TTM case below in Table 85 on some of the tensorsThe results are obtained on a machine with Intel Xeon CPU E5-2650v4 Asseen in the table using Lexi-Order always results in improved sequentialrun time for TTMs

Recent work [46] analyses the memory access pattern of the MTTKRPoperation and proposes low-level highly effective code optimizations Theproposed approach reorganizes the code with register blocking and appliesnonzero blocking in an attempt to fit rows of the factor matrices into thecache The gain in performance remarkable Our ordering methods are or-thogonal to the mentioned and similar optimizations For a much improvedperformance we should first order the tensor using the approaches proposedin this chapter and then further optimize the code using the low-level codeoptimizations

164 CHAPTER 8 SPARSE TENSOR ORDERING

Part IV

Closing

165

Chapter 9

Concluding remarks

We have presented a selection of partitioning matching and ordering prob-lems in combinatorial scientific computing

Partitioning problems These problems arise in task decomposition forparallel computing where load balance and low communication costare two objectives Depending on the task definition and the interac-tion of the tasks the partitioning problems use different combinatorialmodels The selected problems (Chapters 2 and 3) addressed acyclicpartitioning of directed acyclic graphs and partitioning of hypergraphsWhile our contributions for the former problem concerned combinato-rial tools for the desired partitioning objectives and constraints thosefor the second problem concerned the use of hypergraph models andassociated partitioning tools for efficient tensor decomposition in thedistributed memory setting

Matching problems These problems arise in settings where agents com-pete for exclusive access to resources The most common settingsinclude identifying nonzero diagonals in sparse matrices routing andscheduling in data centers A related concept is to decompose a dou-bly stochastic matrix as a convex combination of permutation matri-ces (the famous Birkhoff-von Neumann decomposition) where eachpermutation matrix is a perfect matching in the bipartite graph repre-sentation of the matrix We developed approximation algorithms formatchings in graphs (Chapter 4) and effective heuristics for findingmatchings in hypergraphs (Chapter 6) We also investigated (Chap-ter 5) the problem of finding Birkhoff-von Neumann decompositionswith a small number of permutation matrices and presented complex-ity results and theoretical insights into the decomposition problemOn the way we generalized the decomposition to general real matrices(not only doubly stochastic) having total support

167

168 CHAPTER 9 CONCLUDING REMARKS

Ordering problems These problems arise when one wants to permutesparse matrices and tensors into desirable forms The sought formsdepend of course on the targeted applications By focusing on a re-cent LU decomposition method (or on a method taking advantage ofthe sparsity in a novel way) we developed heuristics to permute sparsematrices into a bordered block triangular form with an aim to reducethe height of the resulting elimination tree (Chapter 7) We also in-vestigated sparse tensor ordering problem where we sought to clusternonzeros around the (hyper)diagonal of a sparse tensor (Chapter 8) inorder to improve the performance of certain tensor operations We pro-posed novel algorithms for obtaining doubly lexical ordering of sparsematrices that are simpler to implement than the known algorithms tobe able to address the tensor ordering problem

At the end of Chapters 2ndash8 we mentioned some immediate future workBelow we state some more future work which require considerable workand creativity

Most of the presented combinatorial algorithms come with applicationsOthers are made applicable by practical and efficient implementations fur-ther developments on the applicationsrsquo side are required for these to be usedin practice In particular the use of acyclic partitioner (the problem ofChapter 2) in a parallel system with the help of a runtime system requiresthe whole graph to be available This is usually not the case an auto-tuning like approach in which one first creates a task graph by a cold runand then partitions this graph to determine a schedule can prove useful Ina similar vein we mentioned the lack of proper software implementing theLU factorization method using elimination trees at the end of Chapter 7Such an implementation will give rise to a number of combinatorial prob-lems (eg the peak memory requirement in factoring a BBT ordered matrixwith elimination trees) and seems to be a new play field

The original Karp-Sipser heuristic for the cardinality matching problemapplies degree-1 and degree-2 reduction rules The first rule is well imple-mented in practical settings (see Chapters 4 and 6) On the other hardfast implementations of the degree-2 reduction rule for undirected graphsare sought A straightforward implementation of this rule can result in theworst case Ω(n2) total time for a graph with n vertices This is obviously toomuch to afford in theory for general sparse graphs with O(n) or O(n log n)edges Are there worst-case linear time algorithms for implementing thisrule in the context of the Karp-Sipser heuristic

The Birkhoff-von Neumann decomposition (Chapter 5) has well foundedapplications in routing Our use of this decomposition in solving linear sys-tems is a first attempt in making it useful for numerical algorithms Inthe proposed approach the cost of constructing a preconditioner with afew terms from a BvN decomposition is equivalent to solving a few maxi-

169

mum bottleneck matching problemmdashthis can be costly in practical settingsCan we use a relaxed definition of the BvN decomposition in which sub-permutation matrices are allowed (such decompositions are used in routingproblems) and make related preconditioner more practical while retainingnumerical properties

170 BIBLIOGRAPHY

Bibliography

[1] IBM ILOG CPLEX Optimizer httpwww-01ibmcomsoftwareintegrationoptimizationcplex-optimizer 2010 [Cited on pp96 121]

[2] Evrim Acar Daniel M Dunlavy and Tamara G Kolda A scalable op-timization approach for fitting canonical tensor decompositions Jour-nal of Chemometrics 25(2)67ndash86 2011 [Cited on pp 11 37]

[3] Seher Acer Tugba Torun and Cevdet Aykanat Improving medium-grain partitioning for scalable sparse tensor decomposition IEEETransactions on Parallel and Distributed Systems 29(12)2814ndash2825Dec 2018 [Cited on pp 52]

[4] Kunal Agrawal Jeremy T Fineman Jordan Krage Charles E Leis-erson and Sivan Toledo Cache-conscious scheduling of streaming ap-plications In Proc Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 236ndash245 New YorkNY USA 2012 ACM [Cited on pp 4]

[5] Emmanuel Agullo Alfredo Buttari Abdou Guermouche and FlorentLopez Multifrontal QR factorization for multicore architectures overruntime systems In Euro-Par 2013 Parallel Processing LNCS pages521ndash532 Springer Berlin Heidelberg 2013 [Cited on pp 131]

[6] Ron Aharoni and Penny Haxell Hallrsquos theorem for hypergraphs Jour-nal of Graph Theory 35(2)83ndash88 2000 [Cited on pp 119]

[7] Kadir Akbudak and Cevdet Aykanat Exploiting locality in sparsematrix-matrix multiplication on many-core architectures IEEETransactions on Parallel and Distributed Systems 28(8)2258ndash2271Aug 2017 [Cited on pp 160]

[8] Patrick R Amestoy Iain S Duff and Jean-Yves LrsquoExcellent Multi-frontal parallel distributed symmetric and unsymmetric solvers Com-puter Methods in Applied Mechanics and Engineering 184(2ndash4)501ndash520 2000 [Cited on pp 131 132]

[9] Patrick R Amestoy Iain S Duff Daniel Ruiz and Bora Ucar Aparallel matrix scaling algorithm In J M Palma P R AmestoyM Dayde M Mattoso and J C Lopes editors 8th InternationalConference on High Performance Computing for Computational Sci-ence (VECPAR) volume 5336 pages 301ndash313 Springer Berlin Hei-delberg 2008 [Cited on pp 56]

BIBLIOGRAPHY 171

[10] Michael Anastos and Alan Frieze Finding perfect matchings in ran-dom cubic graphs in linear time arXiv preprint arXiv1808008252018 [Cited on pp 125]

[11] Claus A Andersson and Rasmus Bro The N-way toolbox for MAT-LAB Chemometrics and Intelligent Laboratory Systems 52(1)1ndash42000 [Cited on pp 11 37]

[12] Hartwig Anzt Edmond Chow and Jack Dongarra Iterative sparsetriangular solves for preconditioning In Larsson Jesper Traff SaschaHunold and Francesco Versaci editors Euro-Par 2015 Parallel Pro-cessing pages 650ndash661 Springer Berlin Heidelberg 2015 [Cited onpp 107]

[13] Jonathan Aronson Martin Dyer Alan Frieze and Stephen SuenRandomized greedy matching II Random Structures amp Algorithms6(1)55ndash73 1995 [Cited on pp 58]

[14] Jonathan Aronson Alan Frieze and Boris G Pittel Maximum match-ings in sparse random graphs Karp-Sipser revisited Random Struc-tures amp Algorithms 12(2)111ndash177 1998 [Cited on pp 56 58]

[15] Cevdet Aykanat Berkant B Cambazoglu and Bora Ucar Multi-leveldirect K-way hypergraph partitioning with multiple constraints andfixed vertices Journal of Parallel and Distributed Computing 68609ndash625 2008 [Cited on pp 46]

[16] Ariful Azad Mahantesh Halappanavar Sivasankaran RajamanickamErik G Boman Arif Khan and Alex Pothen Multithreaded algo-rithms for maximum matching in bipartite graphs In IEEE 26th Inter-national Parallel Distributed Processing Symposium (IPDPS) pages860ndash872 Shanghai China 2012 [Cited on pp 57 59 66 73 74]

[17] Brett W Bader and Tamara G Kolda Efficient MATLAB compu-tations with sparse and factored tensors SIAM Journal on ScientificComputing 30(1)205ndash231 December 2007 [Cited on pp 11 37]

[18] Brett W Bader Tamara G Kolda et al Matlab tensor toolboxversion 30 Available online httpwwwsandiagov~tgkolda

TensorToolbox 2017 [Cited on pp 11 37 38 157]

[19] David A Bader Henning Meyerhenke Peter Sanders and DorotheaWagner editors Graph Partitioning and Graph Clustering - 10thDIMACS Implementation Challenge Workshop Georgia Institute ofTechnology Atlanta GA USA February 13-14 2012 Proceedingsvolume 588 of Contemporary Mathematics American MathematicalSociety 2013 [Cited on pp 69]

172 BIBLIOGRAPHY

[20] Muthu Baskaran Tom Henretty Benoit Pradelle M HarperLangston David Bruns-Smith James Ezick and Richard LethinMemory-efficient parallel tensor decompositions In 2017 IEEE HighPerformance Extreme Computing Conference (HPEC) pages 1ndash7Sep 2017 [Cited on pp 160]

[21] James Bennett and Stan Lanning The Netflix Prize In Proceedingsof KDD cup and workshop page 35 2007 [Cited on pp 10 48]

[22] Anne Benoit Yves Robert and Frederic Vivien A Guide to AlgorithmDesign Paradigms Methods and Complexity Analysis CRC PressBoca Raton FL 2014 [Cited on pp 93]

[23] Michele Benzi and Bora Ucar Preconditioning techniques based onthe Birkhoffndashvon Neumann decomposition Computational Methods inApplied Mathematics 17201ndash215 2017 [Cited on pp 106 108]

[24] Piotr Berman and Marek Karpinski Improved approximation lowerbounds on small occurence optimization ECCC Report 2003 [Citedon pp 113]

[25] Garrett Birkhoff Tres observaciones sobre el algebra lineal Universi-dad Nacional de Tucuman Series A 5147ndash154 1946 [Cited on pp8]

[26] Marcel Birn Vitaly Osipov Peter Sanders Christian Schulz andNodari Sitchinava Efficient parallel and external matching In Fe-lix Wolf Bernd Mohr and Dieter an Mey editors Euro-Par 2013Parallel Processing pages 659ndash670 Springer Berlin Heidelberg 2013[Cited on pp 59]

[27] Rob H Bisseling and Wouter Meesen Communication balancing inparallel sparse matrix-vector multiplication Electronic Transactionson Numerical Analysis 2147ndash65 2005 [Cited on pp 4 47]

[28] Zachary Blanco Bangtian Liu and Maryam Mehri Dehnavi CSTFLarge-scale sparse tensor factorizations on distributed platforms InProceedings of the 47th International Conference on Parallel Process-ing pages 211ndash2110 New York NY USA 2018 ACM [Cited onpp 160]

[29] Guy E Blelloch Jeremy T Fineman and Julian Shun Greedy sequen-tial maximal independent set and matching are parallel on average InProceedings of the Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 308ndash317 ACM 2012[Cited on pp 59 73 74 75]

BIBLIOGRAPHY 173

[30] Norbert Blum A new approach to maximum matching in generalgraphs In Proceedings of the 17th International Colloquium on Au-tomata Languages and Programming pages 586ndash597 London UK1990 Springer-Verlag [Cited on pp 6 76]

[31] Richard A Brualdi The diagonal hypergraph of a matrix (bipartitegraph) Discrete Mathematics 27(2)127ndash147 1979 [Cited on pp 92]

[32] Richard A Brualdi Notes on the Birkhoff algorithm for doublystochastic matrices Canadian Mathematical Bulletin 25(2)191ndash1991982 [Cited on pp 9 92 94 98]

[33] Richard A Brualdi and Peter M Gibson Convex polyhedra of doublystochastic matrices I Applications of the permanent function Jour-nal of Combinatorial Theory Series A 22(2)194ndash230 1977 [Citedon pp 9]

[34] Richard A Brualdi and Herbert J Ryser Combinatorial Matrix The-ory volume 39 of Encyclopedia of Mathematics and its ApplicationsCambridge University Press Cambridge UK New York USA Mel-bourne Australia 1991 [Cited on pp 8]

[35] Thang Nguyen Bui and Curt Jones A heuristic for reducing fill-in insparse matrix factorization In Proc 6th SIAM Conf Parallel Pro-cessing for Scientific Computing pages 445ndash452 SIAM 1993 [Citedon pp 4 15]

[36] Peter J Cameron Notes on combinatorics httpscameroncountsfileswordpresscom201311combpdf (last checked Jan 2019)2013 [Cited on pp 76]

[37] Andrew Carlson Justin Betteridge Bryan Kisiel Burr Settles Este-vam R Hruschka Jr and Tom M Mitchell Toward an architecturefor never-ending language learning In AAAI volume 5 page 3 2010[Cited on pp 48]

[38] J Douglas Carroll and Jih-Jie Chang Analysis of individual dif-ferences in multidimensional scaling via an N-way generalization ofldquoEckart-Youngrdquo decomposition Psychometrika 35(3)283ndash319 1970[Cited on pp 36]

[39] Umit V Catalyurek and Cevdet Aykanat Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplicationIEEE Transactions on Parallel and Distributed Systems 10(7)673ndash693 Jul 1999 [Cited on pp 41 140]

174 BIBLIOGRAPHY

[40] Umit V Catalyurek and Cevdet Aykanat PaToH A Multilevel Hyper-graph Partitioning Tool Version 30 Bilkent University Departmentof Computer Engineering Ankara 06533 Turkey PaToH is availableat httpbmiosuedu~umitsoftwarehtm 1999 [Cited on pp4 7 25 47 143]

[41] Umit V Catalyurek Kamer Kaya and Bora Ucar On shared-memoryparallelization of a sparse matrix scaling algorithm In 2012 41st Inter-national Conference on Parallel Processing pages 68ndash77 Los Alami-tos CA USA September 2012 IEEE Computer Society [Cited onpp 56]

[42] Umit V Catalyurek Mehmet Deveci Kamer Kaya and Bora UcarMultithreaded clustering for multi-level hypergraph partitioning In26th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) pages 848ndash859 Shanghai China 2012 [Cited on pp59]

[43] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang On ser-vice guarantees for input-buffered crossbar switches A capacity de-composition approach by Birkhoff and von Neumann In Quality ofService 1999 IWQoS rsquo99 1999 Seventh International Workshop onpages 79ndash86 1999 [Cited on pp 8]

[44] Jianer Chen Yang Liu Songjian Lu Barry OrsquoSullivan and Igor Raz-gon A fixed-parameter algorithm for the directed feedback vertex setproblem Journal of the ACM 55(5)211ndash2119 2008 [Cited on pp142]

[45] Joseph Cheriyan Randomized O(M(|V |)) algorithms for problemsin matching theory SIAM Journal on Computing 26(6)1635ndash16551997 [Cited on pp 76]

[46] Jee Choi Xing Liu Shaden Smith and Tyler Simon Blocking opti-mization techniques for sparse tensor computation In 2018 IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS)pages 568ndash577 Vancouver British Columbia Canada 2018 [Citedon pp 160 163]

[47] Joon Hee Choi and S V N Vishwanathan DFacTo Distributedfactorization of tensors In Zoubin Ghahramani Max Welling CorinnaCortes Neil D Lawrence and Kilian Q Weinberger editors 27thAdvances in Neural Information Processing Systems pages 1296ndash1304Curran Associates Inc Montreal Quebec Canada 2014 [Cited onpp 11 37 38 39]

BIBLIOGRAPHY 175

[48] Edmond Chow and Aftab Patel Fine-grained parallel incomplete LUfactorization SIAM Journal on Scientific Computing 37(2)C169ndashC193 2015 [Cited on pp 107]

[49] Andrzej Cichocki Danilo P Mandic Lieven De Lathauwer GuoxuZhou Qibin Zhao Cesar Caiafa and Anh-Huy Phan Tensor decom-positions for signal processing applications From two-way to multiwaycomponent analysis IEEE Signal Processing Magazine 32(2)145ndash163March 2015 [Cited on pp 10]

[50] Edith Cohen Structure prediction and computation of sparse matrixproducts Journal of Combinatorial Optimization 2(4)307ndash332 Dec1998 [Cited on pp 127]

[51] Thomas F Coleman and Wei Xu Parallelism in structured Newtoncomputations In Parallel Computing Architectures Algorithms andApplications ParCo 2007 pages 295ndash302 Forschungszentrum Julichand RWTH Aachen University Germany 2007 [Cited on pp 3]

[52] Thomas F Coleman and Wei Xu Fast (structured) Newton computa-tions SIAM Journal on Scientific Computing 31(2)1175ndash1191 2009[Cited on pp 3]

[53] Thomas F Coleman and Wei Xu Automatic Differentiation in MAT-LAB using ADMAT with Applications SIAM 2016 [Cited on pp3]

[54] Jason Cong Zheng Li and Rajive Bagrodia Acyclic multi-way par-titioning of Boolean networks In Proceedings of the 31st Annual De-sign Automation Conference DACrsquo94 pages 670ndash675 New York NYUSA 1994 ACM [Cited on pp 4 17]

[55] Elizabeth H Cuthill and J McKee Reducing the bandwidth of sparsesymmetric matrices In Proceedings of the 24th national conferencepages 157ndash172 New York NY USA 1969 ACM [Cited on pp 12]

[56] Marek Cygan Improved approximation for 3-dimensional matchingvia bounded pathwidth local search In Foundations of ComputerScience (FOCS) 2013 IEEE 54th Annual Symposium on pages 509ndash518 IEEE 2013 [Cited on pp 113 120]

[57] Marek Cygan Fabrizio Grandoni and Monaldo Mastrolilli How to sellhyperedges The hypermatching assignment problem In Proc of thetwenty-fourth annual ACM-SIAM symposium on Discrete algorithmspages 342ndash351 SIAM 2013 [Cited on pp 113]

176 BIBLIOGRAPHY

[58] Joseph Czyzyk Michael P Mesnier and Jorge J More The NEOSServer IEEE Journal on Computational Science and Engineering5(3)68ndash75 1998 [Cited on pp 96]

[59] Timothy A Davis and Yifan Hu The University of Florida sparsematrix collection ACM Transactions on Mathematical Software38(1)11ndash125 2011 [Cited on pp 25 69 82 105 143 161]

[60] Lieven De Lathauwer Pierre Comon Bart De Moor and Joos Vande-walle Higher-order power methodndashApplication in independent com-ponent analysis In Proceedings NOLTArsquo95 pages 91ndash96 Las VegasUSA 1995 [Cited on pp 163]

[61] Lieven De Lathauwer Bart De Moor and Joos Vandewalle Onthe best rank-1 and rank-(R1 R2 RN ) approximation of higher-order tensors SIAM Journal on Matrix Analysis and Applications21(4)1324ndash1342 2000 [Cited on pp 163]

[62] Dominique de Werra Variations on the Theorem of BirkhoffndashvonNeumann and extensions Graphs and Combinatorics 19(2)263ndash278Jun 2003 [Cited on pp 108]

[63] James W Demmel Stanley C Eisenstat John R Gilbert Xiaoye SLi and Joseph W H Liu A supernodal approach to sparse par-tial pivoting SIAM Journal on Matrix Analysis and Applications20(3)720ndash755 1999 [Cited on pp 132]

[64] James W Demmel John R Gilbert and Xiaoye S Li An asyn-chronous parallel supernodal algorithm for sparse gaussian elimina-tion SIAM Journal on Matrix Analysis and Applications 20(4)915ndash952 1999 [Cited on pp 132]

[65] Mehmet Deveci Kamer Kaya Umit V Catalyurek and Bora UcarA push-relabel-based maximum cardinality matching algorithm onGPUs In 42nd International Conference on Parallel Processing pages21ndash29 Lyon France 2013 [Cited on pp 57]

[66] Mehmet Deveci Kamer Kaya Bora Ucar and Umit V CatalyurekGPU accelerated maximum cardinality matching algorithms for bipar-tite graphs In Felix Wolf Bernd Mohr and Dieter an Mey editorsEuro-Par 2013 Parallel Processing volume 8097 of LNCS pages 850ndash861 Springer Berlin Heidelberg 2013 [Cited on pp 57]

[67] Pat Devlin and Jeff Kahn Perfect fractional matchings in k-out hy-pergraphs arXiv preprint arXiv170303513 2017 [Cited on pp 114121]

BIBLIOGRAPHY 177

[68] Elizabeth D Dolan The NEOS Server 40 administrative guide Tech-nical Memorandum ANLMCS-TM-250 Mathematics and ComputerScience Division Argonne National Laboratory 2001 [Cited on pp96]

[69] Elizabeth D Dolan and Jorge J More Benchmarking optimiza-tion software with performance profiles Mathematical Programming91(2)201ndash213 2002 [Cited on pp 26]

[70] Jack J Dongarra Fred G Gustavson and Alan H Karp Implement-ing linear algebra algorithms for dense matrices on a vector pipelinemachine SIAM Review 26(1)pp 91ndash112 1984 [Cited on pp 134]

[71] Iain S Duff Kamer Kaya and Bora Ucar Design implementationand analysis of maximum transversal algorithms ACM Transactionson Mathematical Software 38(2)131ndash1331 2011 [Cited on pp 656 58]

[72] Iain S Duff and Jacko Koster On algorithms for permuting largeentries to the diagonal of a sparse matrix SIAM Journal on MatrixAnalysis and Applications 22973ndash996 2001 [Cited on pp 98 143]

[73] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ap-proximation algorithms for maximum matchings in undirected graphsIn 2018 Proceedings of the Seventh SIAM Workshop on CombinatorialScientific Computing pages 56ndash65 2018 [Cited on pp 75 79 113]

[74] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Fur-ther notes on Birkhoffndashvon Neumann decomposition of doubly stochas-tic matrices Linear Algebra and its Applications 55468ndash78 2018[Cited on pp 101 113 117]

[75] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ef-fective heuristics for matchings in hypergraphs Research Report RR-9224 Inria Grenoble Rhone-Alpes November 2018 (will appear inSEA2) [Cited on pp 118 119 122]

[76] Fanny Dufosse Kamer Kaya and Bora Ucar Two approximationalgorithms for bipartite matching on multicore architectures Journalof Parallel and Distributed Computing 8562ndash78 2015 IPDPS 2014Selected Papers on Numerical and Combinatorial Algorithms [Citedon pp 56 60 65 66 67 69 77 79 81 113 117 125]

[77] Andrew Lloyd Dulmage and Nathan Saul Mendelsohn Coverings ofbipartite graphs Canadian Journal of Mathematics 10517ndash534 1958[Cited on pp 68 143]

178 BIBLIOGRAPHY

[78] Martin E Dyer and Alan M Frieze Randomized greedy matchingRandom Structures amp Algorithms 2(1)29ndash45 1991 [Cited on pp 58113 115]

[79] Jack Edmonds Paths trees and flowers Canadian Journal of Math-ematics 17449ndash467 1965 [Cited on pp 76]

[80] Lawrence C Eggan Transition graphs and the star-height of regu-lar events The Michigan Mathematical Journal 10(4)385ndash397 1963[Cited on pp 5]

[81] Stanley C Eisenstat and Joseph W H Liu The theory of elimina-tion trees for sparse unsymmetric matrices SIAM Journal on MatrixAnalysis and Applications 26(3)686ndash705 2005 [Cited on pp 5 131132 133 135]

[82] Stanley C Eisenstat and Joseph W H Liu A tree-based dataflowmodel for the unsymmetric multifrontal method Electronic Transac-tions on Numerical Analysis 211ndash19 2005 [Cited on pp 132]

[83] Stanley C Eisenstat and Joseph W H Liu Algorithmic aspects ofelimination trees for sparse unsymmetric matrices SIAM Journal onMatrix Analysis and Applications 29(4)1363ndash1381 2008 [Cited onpp 5 132 143 145]

[84] Venmugil Elango Fabrice Rastello Louis-Noel Pouchet J Ramanu-jam and P Sadayappan On characterizing the data access complex-ity of programs ACM SIGPLAN Notices - POPLrsquo15 50(1)567ndash580January 2015 [Cited on pp 3]

[85] Paul Erdos and Alfred Renyi On random matrices Publicationsof the Mathematical Institute of the Hungarian Academy of Sciences8(3)455ndash461 1964 [Cited on pp 72]

[86] Bastiaan Onne Fagginger Auer and Rob H Bisseling A GPU algo-rithm for greedy graph matching In Facing the Multicore ChallengeII volume 7174 of LNCS pages 108ndash119 Springer-Verlag Berlin 2012[Cited on pp 59]

[87] Naznin Fauzia Venmugil Elango Mahesh Ravishankar J Ramanu-jam Fabrice Rastello Atanas Rountev Louis-Noel Pouchet andP Sadayappan Beyond reuse distance analysis Dynamic analysis forcharacterization of data locality potential ACM Transactions on Ar-chitecture and Code Optimization 10(4)531ndash5329 December 2013[Cited on pp 3 16]

BIBLIOGRAPHY 179

[88] Charles M Fiduccia and Robert M Mattheyses A linear-time heuris-tic for improving network partitions In 19th Design Automation Con-ference pages 175ndash181 Las Vegas USA 1982 IEEE [Cited on pp17 22]

[89] Joel Franklin and Jens Lorenz On the scaling of multidimensional ma-trices Linear Algebra and its Applications 114717ndash735 1989 [Citedon pp 114]

[90] Alan M Frieze Maximum matchings in a class of random graphsJournal of Combinatorial Theory Series B 40(2)196ndash212 1986[Cited on pp 76 82 121]

[91] Aurelien Froger Olivier Guyon and Eric Pinson A set packing ap-proach for scheduling passenger train drivers the French experienceIn RailTokyo2015 Tokyo Japan March 2015 [Cited on pp 7]

[92] Harold N Gabow and Robert E Tarjan Faster scaling algorithms fornetwork problems SIAM Journal on Computing 18(5)1013ndash10361989 [Cited on pp 6 76]

[93] Michael R Garey and David S Johnson Computers and IntractabilityA Guide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1979 [Cited on pp 3 47 93]

[94] Alan George Nested dissection of a regular finite element mesh SIAMJournal on Numerical Analysis 10(2)345ndash363 1973 [Cited on pp136]

[95] Alan J George Computer implementation of the finite elementmethod PhD thesis Stanford University Stanford CA USA 1971[Cited on pp 12]

[96] Andrew V Goldberg and Robert Kennedy Global price updates helpSIAM Journal on Discrete Mathematics 10(4)551ndash572 1997 [Citedon pp 6]

[97] Olaf Gorlitz Sergej Sizov and Steffen Staab Pints Peer-to-peerinfrastructure for tagging systems In Proceedings of the 7th Interna-tional Conference on Peer-to-peer Systems IPTPSrsquo08 page 19 Berke-ley CA USA 2008 USENIX Association [Cited on pp 48]

[98] Georg Gottlob and Gianluigi Greco Decomposing combinatorial auc-tions and set packing problems Journal of the ACM 60(4)241ndash2439September 2013 [Cited on pp 7]

[99] Lars Grasedyck Hierarchical singular value decomposition of tensorsSIAM Journal on Matrix Analysis and Applications 31(4)2029ndash20542010 [Cited on pp 52]

180 BIBLIOGRAPHY

[100] William Gropp and Jorge J More Optimization environments andthe NEOS Server In Martin D Buhman and Arieh Iserles editorsApproximation Theory and Optimization pages 167ndash182 CambridgeUniversity Press 1997 [Cited on pp 96]

[101] Hermann Gruber Digraph complexity measures and applications informal language theory Discrete Mathematics amp Theoretical Com-puter Science 14(2)189ndash204 2012 [Cited on pp 5]

[102] Gael Guennebaud Benoıt Jacob et al Eigen v3 httpeigen

tuxfamilyorg 2010 [Cited on pp 47]

[103] Abdou Guermouche Jean-Yves LrsquoExcellent and Gil Utard Impact ofreordering on the memory of a multifrontal solver Parallel Computing29(9)1191ndash1218 2003 [Cited on pp 131]

[104] Anshul Gupta Improved symbolic and numerical factorization algo-rithms for unsymmetric sparse matrices SIAM Journal on MatrixAnalysis and Applications 24(2)529ndash552 2002 [Cited on pp 132]

[105] Mahantesh Halappanavar John Feo Oreste Villa Antonino Tumeoand Alex Pothen Approximate weighted matching on emerging many-core and multithreaded architectures International Journal of HighPerformance Computing Applications 26(4)413ndash430 2012 [Cited onpp 59]

[106] Magnus M Halldorsson Approximating discrete collections via localimprovements In The Sixth annual ACM-SIAM symposium on Dis-crete Algorithms (SODA) volume 95 pages 160ndash169 San FranciscoCalifornia USA 1995 SIAM [Cited on pp 113]

[107] Richard A Harshman Foundations of the PARAFAC procedureModels and conditions for an ldquoexplanatoryrdquo multi-modal factor anal-ysis UCLA Working Papers in Phonetics 161ndash84 1970 [Cited onpp 36]

[108] Nicholas J A Harvey Algebraic algorithms for matching and matroidproblems SIAM Journal on Computing 39(2)679ndash702 2009 [Citedon pp 76]

[109] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-dimensional matching In Approximation Random-ization and Combinatorial Optimization Algorithms and Techniquespages 83ndash97 Springer 2003 [Cited on pp 113]

[110] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-set packing Computational Complexity 15(1)20ndash392006 [Cited on pp 7]

BIBLIOGRAPHY 181

[111] Bruce Hendrickson Load balancing fictions falsehoods and fallaciesApplied Mathematical Modelling 2599ndash108 2000 [Cited on pp 41]

[112] Bruce Hendrickson and Tamara G Kolda Graph partitioning modelsfor parallel computing Parallel Computing 26(12)1519ndash1534 2000[Cited on pp 41]

[113] Bruce Hendrickson and Robert Leland The Chaco userrsquos guide ver-sion 10 Technical Report SAND93ndash2339 Sandia National Laborato-ries Albuquerque NM October 1993 [Cited on pp 4 15 16]

[114] Julien Herrmann Jonathan Kho Bora Ucar Kamer Kaya andUmit V Catalyurek Acyclic partitioning of large directed acyclicgraphs In Proceedings of the 17th IEEEACM International Sympo-sium on Cluster Cloud and Grid Computing CCGRID pages 371ndash380 Madrid Spain May 2017 [Cited on pp 15 16 17 18 28 30]

[115] Julien Herrmann M Yusuf Ozkaya Bora Ucar Kamer Kaya andUmit V Catalyurek Multilevel algorithms for acyclic partitioning ofdirected acyclic graphs SIAM Journal on Scientific Computing 2019to appear [Cited on pp 19 20 26 29 31]

[116] Jonathan D Hogg John K Reid and Jennifer A Scott Design of amulticore sparse Cholesky factorization using DAGs SIAM Journalon Scientific Computing 32(6)3627ndash3649 2010 [Cited on pp 131]

[117] John E Hopcroft and Richard M Karp An n52 algorithm for max-imum matchings in bipartite graphs SIAM Journal on Computing2(4)225ndash231 1973 [Cited on pp 6 62]

[118] Roger A Horn and Charles R Johnson Matrix Analysis CambridgeUniversity Press New York 1985 [Cited on pp 109]

[119] Cor A J Hurkens and Alexander Schrijver On the size of systemsof sets every t of which have an SDR with an application to theworst-case ratio of heuristics for packing problems SIAM Journal onDiscrete Mathematics 2(1)68ndash72 1989 [Cited on pp 113 120]

[120] Oscar H Ibarra and Shlomo Moran Deterministic and probabilisticalgorithms for maximum bipartite matching via fast matrix multipli-cation Information Processing Letters pages 12ndash15 1981 [Cited onpp 76]

[121] Inah Jeon Evangelos E Papalexakis U Kang and Christos Falout-sos Haten2 Billion-scale tensor decompositions In IEEE 31st Inter-national Conference on Data Engineering (ICDE) pages 1047ndash1058April 2015 [Cited on pp 156 161]

182 BIBLIOGRAPHY

[122] Jochen A G Jess and Hendrikus Gerardus Maria Kees A data struc-ture for parallel LU decomposition IEEE Transactions on Comput-ers 31(3)231ndash239 1982 [Cited on pp 134]

[123] U Kang Evangelos Papalexakis Abhay Harpale and Christos Falout-sos GigaTensor Scaling tensor analysis up by 100 times - Algorithmsand discoveries In Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining KDDrsquo12 pages 316ndash324 New York NY USA 2012 ACM [Cited on pp10 11 37]

[124] Micha l Karonski Ed Overman and Boris Pittel On a perfect match-ing in a random bipartite digraph with average out-degree below twoarXiv e-prints page arXiv190305764 Mar 2019 [Cited on pp 59121]

[125] Micha l Karonski and Boris Pittel Existence of a perfect matching ina random (1 + eminus1)ndashout bipartite graph Journal of CombinatorialTheory Series B 881ndash16 May 2003 [Cited on pp 59 67 81 82121]

[126] Richard M Karp Reducibility among combinatorial problems InRaymond E Miller James W Thatcher and Jean D Bohlinger edi-tors Complexity of Computer Computations The IBM Research Sym-posiax pages 85ndash103 Springer US 1972 [Cited on pp 7 142]

[127] Richard M Karp Alexander H G Rinnooy Kan and Rakesh VVohra Average case analysis of a heuristic for the assignment prob-lem Mathematics of Operations Research 19513ndash522 1994 [Citedon pp 81]

[128] Richard M Karp and Michael Sipser Maximum matching in sparserandom graphs In 22nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) pages 364ndash375 Nashville TN USA 1981[Cited on pp 56 58 67 75 81 113]

[129] Richard M Karp Umesh V Vazirani and Vijay V Vazirani An op-timal algorithm for on-line bipartite matching In 22nd annual ACMsymposium on Theory of computing (STOC) pages 352ndash358 Balti-more MD USA 1990 [Cited on pp 57]

[130] George Karypis and Vipin Kumar MeTiS A Software Package forPartitioning Unstructured Graphs Partitioning Meshes and Comput-ing Fill-Reducing Orderings of Sparse Matrices Version 40 Univer-sity of Minnesota Department of Comp Sci and Eng Army HPCResearch Cent Minneapolis 1998 [Cited on pp 4 16 23 27 142]

BIBLIOGRAPHY 183

[131] George Karypis and Vipin Kumar Multilevel algorithms for multi-constraint hypergraph partitioning Technical Report 99-034 Uni-versity of Minnesota Department of Computer ScienceArmy HPCResearch Center Minneapolis MN 55455 November 1998 [Cited onpp 34]

[132] Kamer Kaya Johannes Langguth Fredrik Manne and Bora UcarPush-relabel based algorithms for the maximum transversal problemComputers and Operations Research 40(5)1266ndash1275 2013 [Citedon pp 6 56 58]

[133] Kamer Kaya and Bora Ucar Constructing elimination trees for sparseunsymmetric matrices SIAM Journal on Matrix Analysis and Appli-cations 34(2)345ndash354 2013 [Cited on pp 5]

[134] Oguz Kaya and Bora Ucar Scalable sparse tensor decompositions indistributed memory systems In Proceedings of the International Con-ference for High Performance Computing Networking Storage andAnalysis SC rsquo15 pages 771ndash7711 Austin Texas 2015 ACM NewYork NY USA [Cited on pp 7]

[135] Oguz Kaya and Bora Ucar Parallel CandecompParafac decomposi-tion of sparse tensors using dimension trees SIAM Journal on Scien-tific Computing 40(1)C99ndashC130 2018 [Cited on pp 52]

[136] Enver Kayaaslan and Bora Ucar Reducing elimination tree heightfor parallel LU factorization of sparse unsymmetric matrices In 21stInternational Conference on High Performance Computing (HiPC2014) pages 1ndash10 Goa India December 2014 [Cited on pp 136]

[137] Brian W Kernighan Optimal sequential partitions of graphs Journalof the ACM 18(1)34ndash40 January 1971 [Cited on pp 16]

[138] Shiva Kintali Nishad Kothari and Akash Kumar Approximationalgorithms for directed width parameters CoRR abs11074824 2011[Cited on pp 138]

[139] Philip A Knight The SinkhornndashKnopp algorithm Convergence andapplications SIAM Journal on Matrix Analysis and Applications30(1)261ndash275 2008 [Cited on pp 55 68]

[140] Philip A Knight and Daniel Ruiz A fast algorithm for matrix bal-ancing IMA Journal of Numerical Analysis 33(3)1029ndash1047 2013[Cited on pp 56 79]

[141] Philip A Knight Daniel Ruiz and Bora Ucar A symmetry preservingalgorithm for matrix scaling SIAM Journal on Matrix Analysis andApplications 35(3)931ndash955 2014 [Cited on pp 55 56 68 79 105]

184 BIBLIOGRAPHY

[142] Tamara G Kolda and Brett Bader Tensor decompositions and appli-cations SIAM Review 51(3)455ndash500 2009 [Cited on pp 10 37]

[143] Tjalling C Koopmans and Martin Beckmann Assignment problemsand the location of economic activities Econometrica 25(1)53ndash761957 [Cited on pp 109]

[144] Mads R B Kristensen Simon A F Lund Troels Blum and JamesAvery Fusion of parallel array operations In Proceedings of the 2016International Conference on Parallel Architectures and Compilationpages 71ndash85 New York NY USA 2016 ACM [Cited on pp 3]

[145] Mads R B Kristensen Simon A F Lund Troels Blum KennethSkovhede and Brian Vinter Bohrium A virtual machine approachto portable parallelism In Proceedings of the 2014 IEEE InternationalParallel amp Distributed Processing Symposium Workshops pages 312ndash321 Washington DC USA 2014 [Cited on pp 3]

[146] Janardhan Kulkarni Euiwoong Lee and Mohit Singh MinimumBirkhoff-von Neumann decomposition Preliminary version http

wwwcscmuedu~euiwoonlsparsebvnpdf of the paper whichappeared in Proc 19th International Conference Integer Program-ming and Combinatorial Optimization (IPCO 2017) Waterloo ONCanada pp 343ndash354 June 2017 [Cited on pp 9]

[147] Sebastian Lamm Peter Sanders Christian Schulz Darren Strash andRenato F Werneck Finding Near-Optimal Independent Sets at ScaleIn Proceedings of the 18th Workshop on Algorithm Engineering andExperiments (ALENEXrsquo16) pages 138ndash150 Arlington Virginia USA2016 SIAM [Cited on pp 114 127]

[148] Johannes Langguth Fredrik Manne and Peter Sanders Heuristicinitialization for bipartite matching problems Journal of ExperimentalAlgorithmics 1511ndash122 2010 [Cited on pp 6 56 58]

[149] Yanzhe (Murray) Lei Stefanus Jasin Joline Uichanco and AndrewVakhutinsky Randomized product display (ranking) pricing and or-der fulfillment for e-commerce retailers SSRN Electronic Journal2018 [Cited on pp 9]

[150] Thomas Lengauer Combinatorial Algorithms for Integrated CircuitLayout WileyndashTeubner Chichester UK 1990 [Cited on pp 34]

[151] Renaud Lepere and Denis Trystram A new clustering algorithm forlarge communication delays In 16th International Parallel and Dis-tributed Processing Symposium (IPDPS) page (CDROM) Fort Laud-erdale FL USA 2002 IEEE Computer Society [Cited on pp 32]

BIBLIOGRAPHY 185

[152] Hanoch Levy and David W Low A contraction algorithm for findingsmall cycle cutsets Journal of Algorithms 9(4)470ndash493 1988 [Citedon pp 142]

[153] Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun and Richard VuducModel-driven sparse CP decomposition for higher-order tensors InIPDPS 2017 31th IEEE International Symposium on Parallel andDistributed Processing pages 1048ndash1057 Orlando FL USA May2017 [Cited on pp 52]

[154] Jiajia Li Yuchen Ma and Richard Vuduc ParTI A Parallel TensorInfrastructure for Multicore CPU and GPUs (Version 010) Availablefrom httpsgithubcomhpcgarageParTI 2016 [Cited on pp156]

[155] Jiajia Li Jimeng Sun and Richard Vuduc HiCOO Hierarchical stor-age of sparse tensors In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisSCrsquo18 New York NY USA 2018 [Cited on pp 148 149 155 158]

[156] Jiajia Li Bora Ucar Umit V Catalyurek Jimeng Sun Kevin Barkerand Richard Vuduc Efficient and effective sparse tensor reordering InProceedings of the ACM International Conference on SupercomputingICS rsquo19 pages 227ndash237 New York NY USA 2019 ACM [Cited onpp 147 156]

[157] Xiaoye S Li and James W Demmel Making sparse Gaussian elim-ination scalable by static pivoting In Proc of the 1998 ACMIEEEConference on Supercomputing SC rsquo98 pages 1ndash17 Washington DCUSA 1998 [Cited on pp 132]

[158] Athanasios P Liavas and Nicholas D Sidiropoulos Parallel algorithmsfor constrained tensor factorization via alternating direction methodof multipliers IEEE Transactions on Signal Processing 63(20)5450ndash5463 Oct 2015 [Cited on pp 11]

[159] Bangtian Liu Chengyao Wen Anand D Sarwate and Maryam MehriDehnavi A unified optimization approach for sparse tensor opera-tions on GPUs In 2017 IEEE International Conference on ClusterComputing (CLUSTER) pages 47ndash57 Sept 2017 [Cited on pp 160]

[160] Joseph W H Liu Computational models and task scheduling forparallel sparse Cholesky factorization Parallel Computing 3(4)327ndash342 1986 [Cited on pp 131 134]

[161] Joseph W H Liu A graph partitioning algorithm by node separatorsACM Transactions on Mathematical Software 15(3)198ndash219 1989[Cited on pp 140]

186 BIBLIOGRAPHY

[162] Joseph W H Liu Reordering sparse matrices for parallel eliminationParallel Computing 11(1)73 ndash 91 1989 [Cited on pp 131]

[163] Liang Liu Jun (Jim) Xu and Lance Fortnow Quantized BvND Abetter solution for optical and hybrid switching in data center net-works In 2018 IEEEACM 11th International Conference on Utilityand Cloud Computing pages 237ndash246 Dec 2018 [Cited on pp 9]

[164] Zvi Lotker Boaz Patt-Shamir and Seth Pettie Improved distributedapproximate matching In Proceedings of the Twentieth Annual Sym-posium on Parallelism in Algorithms and Architectures (SPAA) pages129ndash136 Munich Germany 2008 ACM [Cited on pp 59]

[165] Laszlo Lovasz On determinants matchings and random algorithmsIn FCT pages 565ndash574 1979 [Cited on pp 76]

[166] Laszlo Lovasz and Micheal D Plummer Matching Theory ElsevierScience Publishers Netherlands 1986 [Cited on pp 76]

[167] Anna Lubiw Doubly lexical orderings of matrices SIAM Journal onComputing 16(5)854ndash879 1987 [Cited on pp 151 152 153]

[168] Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sunand Richard Vuduc Optimizing sparse tensor times matrix on GPUsJournal of Parallel and Distributed Computing 12999ndash109 2019[Cited on pp 160]

[169] Jakob Magun Greedy matching algorithms an experimental studyJournal of Experimental Algorithmics 36 1998 [Cited on pp 6]

[170] Marvin Marcus and R Ree Diagonals of doubly stochastic matricesThe Quarterly Journal of Mathematics 10(1)296ndash302 1959 [Citedon pp 9]

[171] Nick McKeown The iSLIP scheduling algorithm for input-queuedswitches EEEACM Transactions on Networking 7(2)188ndash201 1999[Cited on pp 6]

[172] Amram Meir and John W Moon The expected node-independencenumber of random trees Indagationes Mathematicae 76335ndash3411973 [Cited on pp 67]

[173] John Mellor-Crummey David Whaley and Ken Kennedy Improvingmemory hierarchy performance for irregular applications using dataand computation reorderings International Journal of Parallel Pro-gramming 29(3)217ndash247 Jun 2001 [Cited on pp 160]

BIBLIOGRAPHY 187

[174] Silvio Micali and Vijay V Vazirani An O(radic|V ||E|) algorithm for

finding maximum matching in general graphs In 21st Annual Sym-posium on Foundations of Computer Science pages 17ndash27 SyracuseNY USA 1980 IEEE [Cited on pp 6 76]

[175] Orlando Moreira Merten Popp and Christian Schulz Graph parti-tioning with acyclicity constraints In 16th International Symposiumon Experimental Algorithms SEA London UK 2017 [Cited on pp16 17 28 29 32]

[176] Orlando Moreira Merten Popp and Christian Schulz Evolutionarymulti-level acyclic graph partitioning In Proceedings of the Geneticand Evolutionary Computation Conference GECCO pages 332ndash339Kyoto Japan 2018 ACM [Cited on pp 17 25 32]

[177] Marcin Mucha and Piotr Sankowski Maximum matchings in planargraphs via Gaussian elimination In S Albers and T Radzik editorsESA 2004 pages 532ndash543 Berlin Heidelberg 2004 Springer BerlinHeidelberg [Cited on pp 76]

[178] Marcin Mucha and Piotr Sankowski Maximum matchings via Gaus-sian elimination In FOCS rsquo04 pages 248ndash255 Washington DC USA2004 IEEE Computer Society [Cited on pp 76]

[179] Huda Nassar Georgios Kollias Ananth Grama and David F GleichLow rank methods for multiple network alignment arXiv e-printspage arXiv180908198 Sep 2018 [Cited on pp 128]

[180] Jaroslav Nesetril and Patrice Ossona de Mendez Tree-depth subgraphcoloring and homomorphism bounds European Journal of Combina-torics 27(6)1022ndash1041 2006 [Cited on pp 132]

[181] M Yusuf Ozkaya Anne Benoit Bora Ucar Julien Herrmann andUmit V Catalyurek A scalable clustering-based task scheduler forhomogeneous processors using DAG partitioning In 33rd IEEE Inter-national Parallel and Distributed Processing Symposium pages 155ndash165 Rio de Janeiro Brazil May 2019 [Cited on pp 32]

[182] Robert Paige and Robert E Tarjan Three partition refinement algo-rithms SIAM Journal on Computing 16(6)973ndash989 December 1987[Cited on pp 152 153]

[183] Chiristos H Papadimitriou and Kenneth Steiglitz Combinatorial Op-timization Algorithms and Complexity Dover Publications (Cor-rected unabridged reprint of Combinatorial Optimization Algo-rithms and Complexity originally published by Prentice-Hall Inc NewJersey 1982) New York 1998 [Cited on pp 29]

188 BIBLIOGRAPHY

[184] Panos M Pardalos Tianbing Qian and Mauricio G C Resende Agreedy randomized adaptive search procedure for the feedback vertexset problem Journal of Combinatorial Optimization 2(4)399ndash4121998 [Cited on pp 142]

[185] Md Mostofa Ali Patwary Rob H Bisseling and Fredrik Manne Par-allel greedy graph matching using an edge partitioning approach InProceedings of the Fourth International Workshop on High-level Paral-lel Programming and Applications HLPP rsquo10 pages 45ndash54 New YorkNY USA 2010 ACM [Cited on pp 59]

[186] Francois Pellegrini SCOTCH 51 Userrsquos Guide Laboratoire Bordelaisde Recherche en Informatique (LaBRI) 2008 [Cited on pp 4 16 23]

[187] Daniel M Pelt and Rob H Bisseling A medium-grain method forfast 2D bipartitioning of sparse matrices In IPDPS2014 IEEE 28thInternational Parallel and Distributed Processing Symposium pages529ndash539 Phoenix Arizona May 2014 [Cited on pp 52]

[188] Ioakeim Perros Evangelos E Papalexakis Fei Wang Richard VuducElizabeth Searles Michael Thompson and Jimeng Sun SPARTanScalable PARAFAC2 for large and sparse data In Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD rsquo17 pages 375ndash384 New York NY USA2017 ACM [Cited on pp 156]

[189] Anh-Huy Phan Petr Tichavsky and Andrzej Cichocki Fast alternat-ing LS algorithms for high order CANDECOMPPARAFAC tensorfactorizations IEEE Transactions on Signal Processing 61(19)4834ndash4846 Oct 2013 [Cited on pp 52]

[190] Juan C Pichel Francisco F Rivera Marcos Fernandez and AurelioRodrıguez Optimization of sparse matrixndashvector multiplication usingreordering techniques on GPUs Microprocessors and Microsystems36(2)65ndash77 2012 [Cited on pp 160]

[191] Matthias Poloczek and Mario Szegedy Randomized greedy algorithmsfor the maximum matching problem with new analysis In IEEE 53rdAnnual Sym on Foundations of Computer Science (FOCS) pages708ndash717 New Brunswick NJ USA 2012 [Cited on pp 58]

[192] Alex Pothen Sparse null bases and marriage theorems PhD the-sis Dept Computer Science Cornell Univ Ithaca New York 1984[Cited on pp 68]

[193] Alex Pothen The complexity of optimal elimination trees TechnicalReport CS-88-13 Pennsylvania State Univ 1988 [Cited on pp 131]

BIBLIOGRAPHY 189

[194] Alex Pothen and Chin-Ju Fan Computing the block triangular formof a sparse matrix ACM Transactions on Mathematical Software16(4)303ndash324 1990 [Cited on pp 58 68 115]

[195] Alex Pothen S M Ferdous and Fredrik Manne Approximation algo-rithms in combinatorial scientific computing Acta Numerica 28541mdash633 2019 [Cited on pp 58]

[196] Louis-Noel Pouchet Polybench The polyhedral benchmark suiteURL httpwebcseohio-stateedu˜pouchetsoftwarepolybench2012 [Cited on pp 25 26]

[197] Michael O Rabin and Vijay V Vazirani Maximum matchings in gen-eral graphs through randomization Journal of Algorithms 10(4)557ndash567 December 1989 [Cited on pp 76]

[198] Daniel Ruiz A scaling algorithm to equilibrate both row and columnnorms in matrices Technical Report TR-2001-034 RAL 2001 [Citedon pp 55 56]

[199] Peter Sanders and Christian Schulz Engineering multilevel graphpartitioning algorithms In Camil Demetrescu and Magnus MHalldorsson editors Algorithms ndash ESA 2011 19th Annual EuropeanSymposium September 5-9 2011 pages 469ndash480 Saarbrucken Ger-many 2011 [Cited on pp 4 16 23]

[200] Robert Schreiber A new implementation of sparse Gaussian elimi-nation ACM Transactions on Mathematical Software 8(3)256ndash2761982 [Cited on pp 131]

[201] R Oguz Selvitopi Muhammet Mustafa Ozdal and Cevdet AykanatA novel method for scaling iterative solvers Avoiding latency over-head of parallel sparse-matrix vector multiplies IEEE Transactionson Parallel and Distributed Systems 26(3)632ndash645 2015 [Cited onpp 51]

[202] Richard Sinkhorn and Paul Knopp Concerning nonnegative matri-ces and doubly stochastic matrices Pacific Journal of Mathematics21343ndash348 1967 [Cited on pp 55 69 114]

[203] Shaden Smith Jee W Choi Jiajia Li Richard Vuduc Jongsoo ParkXing Liu and George Karypis FROSTT The formidable repositoryof open sparse tensors and tools httpfrosttio 2017 [Citedon pp 52 126 156 161]

[204] Shaden Smith and George Karypis A medium-grained algorithm forsparse tensor factorization In 2016 IEEE International Parallel and

190 BIBLIOGRAPHY

Distributed Processing Symposium IPDPS 2016 Chicago IL USAMay 23-27 2016 pages 902ndash911 2016 [Cited on pp 37 52]

[205] Shaden Smith and George Karypis SPLATT The Surprisingly Par-alleL spArse Tensor Toolkit (Version 111) Available from https

githubcomShadenSmithsplatt 2016 [Cited on pp 37 52 157]

[206] Shaden Smith Niranjay Ravindran Nicholas D Sidiropoulos andGeorge Karypis SPLATT Efficient and parallel sparse tensor-matrixmultiplication In 29th IEEE International Parallel amp Distributed Pro-cessing Symposium pages 61ndash70 Hyderabad India May 2015 [Citedon pp 11 36 37 38 39 42 158 159]

[207] Robert E Tarjan and Mihalis Yannakakis Simple linear-time algo-rithms to test chordality of graphs test acyclicity of hypergraphs andselectively reduce acyclic hypergraphs SIAM Journal on Computing13(3)566ndash579 1984 [Cited on pp 149]

[208] Bora Ucar and Cevdet Aykanat Encapsulating multiplecommunication-cost metrics in partitioning sparse rectangular matri-ces for parallel matrix-vector multiplies SIAM Journal on ScientificComputing 25(6)1837ndash1859 2004 [Cited on pp 47 51]

[209] Bora Ucar and Cevdet Aykanat Revisiting hypergraph models forsparse matrix partitioning SIAM Review 49(4)595ndash603 2007 [Citedon pp 42]

[210] Vijay V Vazirani An improved definition of blossoms and a simplerproof of the MV matching algorithm CoRR abs12104594 2012[Cited on pp 76]

[211] David W Walkup Matchings in random regular bipartite digraphsDiscrete Mathematics 31(1)59ndash64 1980 [Cited on pp 59 67 76 81114 121]

[212] Chris Walshaw Multilevel refinement for combinatorial optimisationproblems Annals of Operations Research 131(1)325ndash372 Oct 2004[Cited on pp 25]

[213] Eric S H Wong Evangeline F Y Young and Wai-Kei Mak Clus-tering based acyclic multi-way partitioning In Proceedings of the 13thACM Great Lakes Symposium on VLSI GLSVLSI rsquo03 pages 203ndash206New York NY USA 2003 ACM [Cited on pp 4]

[214] Tao Yang and Apostolos Gerasoulis DSC scheduling parallel tasks onan unbounded number of processors IEEE Transactions on Paralleland Distributed Systems 5(9)951ndash967 1994 [Cited on pp 131]

BIBLIOGRAPHY 191

[215] Albert-Jan Nicholas Yzelman and Dirk Roose High-level strategiesfor parallel shared-memory sparse matrix-vector multiplication IEEETransactions on Parallel and Distributed Systems 25(1)116ndash125 Jan2014 [Cited on pp 160]

  • Introduction
    • Directed graphs
    • Undirected graphs
    • Hypergraphs
    • Sparse matrices
    • Sparse tensors
      • I Partitioning
        • Acyclic partitioning of directed acyclic graphs
          • Related work
          • Directed multilevel graph partitioning
            • Coarsening
            • Initial partitioning
            • Refinement
            • Constrained coarsening and initial partitioning
              • Experiments
              • Summary further notes and references
                • Distributed memory CP decomposition
                  • Background and notation
                    • Hypergraph partitioning
                    • Tensor operations
                      • Related work
                      • Parallelization
                        • Coarse-grain task model
                        • Fine-grain task model
                          • Experiments
                          • Summary further notes and references
                              • II Matching and related problems
                                • Matchings in graphs
                                  • Scaling matrices to doubly stochastic form
                                  • Approximation algorithms for bipartite graphs
                                    • Related work
                                    • Two matching heuristics
                                    • Experiments
                                      • Approximation algorithms for general undirected graphs
                                        • Related work
                                        • One-Out its analysis and variants
                                        • Experiments
                                          • Summary further notes and references
                                            • Birkhoff-von Neumann decomposition
                                              • The minimum number of permutation matrices
                                                • Preliminaries
                                                • The computational complexity of MinBvNDec
                                                  • A result on the polytope of BvN decompositions
                                                  • Two heuristics
                                                  • Experiments
                                                  • A generalization of Birkhoff-von Neumann theorem
                                                    • Doubly stochastic splitting
                                                    • Birkhoff von-Neumann decomposition for arbitrary matrices
                                                      • Summary further notes and references
                                                        • Matchings in hypergraphs
                                                          • Background and notation
                                                          • The heuristics
                                                            • Greedy-H
                                                            • Karp-Sipser-H
                                                            • Karp-Sipser-H-scaling
                                                            • Karp-Sipser-H-mindegree
                                                            • Bipartite-reduction
                                                            • Local-Search
                                                              • Experiments
                                                              • Summary further notes and references
                                                                  • III Ordering matrices and tensors
                                                                    • Elimination trees and height-reducing orderings
                                                                      • Background
                                                                      • The critical path length and the elimination tree height
                                                                      • Reducing the elimination tree height
                                                                        • Theoretical basis
                                                                        • Recursive approach Permute
                                                                        • BBT decomposition PermuteBBT
                                                                        • Edge-separator-based method
                                                                        • Vertex-separator-based method
                                                                        • BT decomposition PermuteBT
                                                                          • Experiments
                                                                          • Summary further notes and references
                                                                            • Sparse tensor ordering
                                                                              • Three sparse tensor storage formats
                                                                              • Problem definition and solutions
                                                                                • BFS-MCS
                                                                                • Lexi-Order
                                                                                • Analysis
                                                                                  • Experiments
                                                                                    • COO-Mttkrp with reordering
                                                                                    • CSF-Mttkrp with reordering
                                                                                    • HiCOO-Mttkrp with reordering
                                                                                    • Reordering methods comparison
                                                                                    • Effect of the number of iterations in Lexi-Order
                                                                                      • Summary further notes and references
                                                                                          • IV Closing
                                                                                            • Concluding remarks
                                                                                            • Bibliography
Page 4: Partitioning, matching, and ordering: Combinatorial

Contents

1 Introduction 1

11 Directed graphs 1

12 Undirected graphs 5

13 Hypergraphs 7

14 Sparse matrices 7

15 Sparse tensors 9

I Partitioning 13

2 Acyclic partitioning of directed acyclic graphs 15

21 Related work 16

22 Directed multilevel graph partitioning 18

221 Coarsening 18

222 Initial partitioning 21

223 Refinement 22

224 Constrained coarsening and initial partitioning 23

23 Experiments 25

24 Summary further notes and references 31

3 Distributed memory CP decomposition 33

31 Background and notation 34

311 Hypergraph partitioning 34

312 Tensor operations 35

32 Related work 36

33 Parallelization 38

331 Coarse-grain task model 39

332 Fine-grain task model 43

34 Experiments 47

35 Summary further notes and references 51

iii

iv CONTENTS

II Matching and related problems 53

4 Matchings in graphs 55

41 Scaling matrices to doubly stochastic form 55

42 Approximation algorithms for bipartite graphs 56

421 Related work 57

422 Two matching heuristics 59

423 Experiments 69

43 Approximation algorithms for general undirected graphs 75

431 Related work 76

432 One-Out its analysis and variants 77

433 Experiments 82

44 Summary further notes and references 87

5 Birkhoff-von Neumann decomposition 91

51 The minimum number of permutation matrices 91

511 Preliminaries 92

512 The computational complexity of MinBvNDec 93

52 A result on the polytope of BvN decompositions 94

53 Two heuristics 97

54 Experiments 103

55 A generalization of Birkhoff-von Neumann theorem 106

551 Doubly stochastic splitting 106

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices 108

56 Summary further notes and references 109

6 Matchings in hypergraphs 113

61 Background and notation 114

62 The heuristics 114

621 Greedy-H 115

622 Karp-Sipser-H 115

623 Karp-Sipser-H-scaling 117

624 Karp-Sipser-H-mindegree 118

625 Bipartite-reduction 119

626 Local-Search 120

63 Experiments 120

64 Summary further notes and references 128

III Ordering matrices and tensors 129

7 Elimination trees and height-reducing orderings 131

71 Background 132

CONTENTS v

72 The critical path length and the elimination tree height 13473 Reducing the elimination tree height 136

731 Theoretical basis 137732 Recursive approach Permute 137733 BBT decomposition PermuteBBT 138734 Edge-separator-based method 139735 Vertex-separator-based method 141736 BT decomposition PermuteBT 142

74 Experiments 14275 Summary further notes and references 145

8 Sparse tensor ordering 14781 Three sparse tensor storage formats 14782 Problem definition and solutions 149

821 BFS-MCS 149822 Lexi-Order 151823 Analysis 154

83 Experiments 15684 Summary further notes and references 160

IV Closing 165

9 Concluding remarks 167

Bibliography 170

vi CONTENTS

List of Algorithms

1 CP-ALS for the 3rd order tensors 362 Mttkrp for the 3rd order tensors 383 Coarse-grain Mttkrp for the first mode of third order tensors

at process p within CP-ALS 404 Fine-grain Mttkrp for the first mode of 3rd order tensors at

process p within CP-ALS 44

5 ScaleSK Sinkhorn-Knopp scaling 566 OneSidedMatch for bipartite graph matching 607 TwoSidedMatch for bipartite graph matching 628 Karp-Sipser-MT for bipartite 1-out graph matching 649 One-Out for undirected graph matching 7810 Karp-SipserOne-Out for undirected 1-out graph matching 80

11 GreedyBvN for constructing a BvN decomposition 98

12 Karp-Sipser-H-scaling for hypergraph matching 117

13 Row-LU(A) 13414 Column-LU(A) 13415 Permute(A) 13816 FindStrongVertexSeparatorES(G) 14017 FindStrongVertexSeparatorVS(G) 142

18 BFS-MCS for ordering a tensor dimension 15019 Lexi-Order for ordering a tensor dimension 153

vii

viii LIST OF ALGORITHMS

Chapter 1

Introduction

This document investigates three classes of problems at the interplay ofdiscrete algorithms combinatorial optimization and numerical methodsThe general research area is called combinatorial scientific computing (CSC)In CSC the contributions have practical and theoretical flavor For allproblems discussed in this document we have the design analysis andimplementation of algorithms along with many experiments The theoreticalresults are included in this document some with proofs the reader is invitedto the original papers for the omitted proofs A similar approach is takenfor presenting the experiments While most results for observing theoreticalfindings in practice are included the reader is referred to the original papersfor some other results (eg run time analysis)

The three problem classes are that of partitioning matching and or-dering We cover two problems from the partitioning class three from thematching class and two from the ordering class Below we present theseseven problems after introducing the notation and recalling the related defi-nitions We summarize the computing contexts in which problems arise andhighlight our contributions

11 Directed graphs

A directed graph G = (VE) is a set V of vertices and a set E of directededges of the form e = (u v) where e is directed from u to v A path isa sequence of edges (u1 v1) middot (u2 v2) middot middot middot with vi = ui+1 A path ((u1 v1) middot(u2 v2) middot middot middot (u` v`)) is of length ` where it connects a sequence of `+1 vertices(u1 v1 = u2 v`minus1 = u` v`) A path is called simple if the connectedvertices are distinct Let u v denote a simple path that starts from u andends at v We say that a vertex v is reachable from another vertex u if apath connects them A path ((u1 v1) middot (u2 v2) middot middot middot (u` v`)) forms a (simple)cycle if all vi for 1 le i le ` are distinct and u1 = v` A directed acyclicgraph DAG in short is a directed graph with no cycles A directed graph

1

2 CHAPTER 1 INTRODUCTION

is strongly connected if every vertex is reachable from every other vertex

A directed graph Gprime = (V prime Eprime) is said to be a subgraph of G = (VE)if V prime sube V and Eprime sube E We call Gprime as the subgraph of G induced by V primedenoted as G[V prime] if Eprime = E cap (V primetimesV prime) For a vertex set S sube V we defineG minus S as the subgraph of G induced by V minus S ie G[V minus S] A vertexset C sube V is called a strongly connected component of G if the inducedsubgraph G[C] is strongly connected and C is maximal in the sense thatfor all C prime sup C G[C prime] is not strongly connected A transitive reduction G0

is a minimal subgraph of G such that if u is connected to v in G then u isconnected to v in G0 as well If G is acyclic then its transitive reduction isunique

The path u v represents a dependency of v to u We say that theedge (u v) is redundant if there exists another u v path in the graphThat is when we remove a redundant (u v) edge u remains connected tov and hence the dependency information is preserved We use Pred[v] =u | (u v) isin E to represent the (immediate) predecessors of a vertex v andSucc[v] = u | (v u) isin E to represent the (immediate) successors of v Wecall the neighbors of a vertex v its immediate predecessors and immediatesuccessors Neigh[v] = Pred[v]cupSucc[v] For a vertex u the set of vertices vsuch that u v are called the descendants of u Similarly the set of verticesv such that v u are called the ancestors of the vertex u The verticeswithout any predecessors (and hence ancestors) are called the sources of Gand vertices without any successors (and hence descendants) are called thetargets of G Every vertex u has a weight denoted by wu and every edge(u v) isin E has a cost denoted by cuv

A k-way partitioning of a graph G = (VE) divides V into k disjointsubsets V1 Vk The weight of a part Vi denoted by w(Vi) is equal tosum

uisinVi wu which is the total vertex weight in Vi Given a partition an edgeis called a cut edge if its endpoints are in different parts The edge cut ofa partition is defined as the sum of the costs of the cut edges Usually aconstraint on the part weights accompanies the problem We are interestedin acyclic partitions which are defined below

Definition 11 (Acyclic k-way partition) A partition V1 Vk ofG = (VE) is called an acyclic k-way partition if two paths u v andvprime uprime do not co-exist for u uprime isin Vi v vprime isin Vj and 1 le i 6= j le k

In other words for a suitable numbering of the parts all edges shouldbe directed from a vertex in a part p to another vertex in a part q wherep le q Let us consider a toy example shown in Fig 11a A partition of thevertices of this graph is shown in Fig 11b with a dashed curve Since thereis a cut edge from s to u and another from u to t the partition is cyclic Anacyclic partition is shown in Fig 11c where all the cut edges are from onepart to the other

3

(a) A toy graph (b) A partition ignor-ing the directions it iscyclic

(c) An acyclic parti-tioning

Figure 11 A toy graph with six vertices and six edges a cyclic partitioningand an acyclic partitioning

There is a related notion in the literature [87] which is called a convexpartition A partition is convex if for all vertex pairs u v in the same partthe vertices in any u v path are also in the same part Hence if a partitionis acyclic it is also convex On the other hand convexity does not implyacyclicity Figure 12 shows that the definitions of an acyclic partition anda convex partition are not equivalent For the toy graph in Fig 12a thereare three possible balanced partitions shown in Figures 12b 12c and 12dThey are all convex but only the one in Fig 12d is acyclic

Deciding on the existence of a k-way acyclic partition respecting an upperbound on the part weights and an upper bound on the cost of cut edges isNP-complete [93 ND15] In Chapter 2 we treat this problem which isformalized as follows

Problem 1 (DAGP DAG partitioning) Given a directed acyclic graphG = (VE) with weights on vertices an imbalance parameter ε find anacyclic k-way partition V1 Vk of V such that the balance con-straints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut which is equal to the numberof edges whose end points are in two different parts is minimized

The DAGP problem arises in exposing parallelism in automatic differ-entiation [53 Ch9] and particularly in the computation of the Newtonstep for solving nonlinear systems [51 52] The DAGP problem with someadditional constraints is used to reason about the parallel data movementcomplexity and to dynamically analyze the data locality potential [84 87]Other important applications of the DAGP problem include (i) fusing loopsfor improving temporal locality and enabling streaming and array contrac-tions in runtime systems [144 145] (ii) analysis of cache efficient execution

4 CHAPTER 1 INTRODUCTION

a b

c d

(a) A toy graph

a b

c d

(b) A cyclic and con-vex partition

a b

c d

(c) A cyclic and con-vex partition

a b

c d

(d) An acyclic and convexpartition

Figure 12 A toy graph two cyclic and convex partitions and an acyclicand convex partition

of streaming applications on uniprocessors [4] (iii) a number of circuit de-sign applications in which the signal directions impose acyclic partitioningrequirement [54 213]

In Chapter 2 we address the DAGP problem We adopt the multilevelapproach [35 113] for this problem This approach is the de-facto standardfor solving graph and hypergraph partitioning problems efficiently and usedby almost all current state-of-the-art partitioning tools [27 40 113 130186 199] The algorithms following this approach have three phases thecoarsening phase which reduces the number of vertices by clustering themthe initial partitioning phase which finds a partition of the coarsest graphand the uncoarsening phase in which the initial partition is projected to thefiner graphs and refined along the way until a solution for the original graphis obtained We focus on two-way partitioning (sometimes called bisection)as this scheme can be used in a recursive way for multiway partitioningTo ensure the acyclicity of the partition at all times we propose novel andefficient coarsening and refinement heuristics We also propose effective waysto use the standard undirected graph partitioning methods in our multilevelscheme We perform a large set of experiments on a dataset and reportsignificant improvements compared to the current state of the art

A rooted tree T is an ordered triple (V r ρ) where V is the vertex setr isin V is the root vertex and ρ gives the immediate ancestordescendantrelation For two vertices vi vj such that ρ(vj) = vi we call vi the parent

5

vertex of vj and call vj a child vertex of vi With this definition the root ris the ancestor of every other node in T The children vertices of a vertexare called sibling vertices to each other The height of T denoted as h(T )is the number of vertices in the longest path from a leaf to the root r

Let G = (VE) be a directed graph with n vertices which are ordered(or labeled as 1 n) The elimination tree of a directed graph is definedas follows [81] For any i 6= n ρ(i) = j whenever j gt i is the minimumvertex id such that the vertices i and j lie in the same strongly connectedcomponent of the subgraph G[1 j]mdashthat is i and j are in the samestrongly connected component of the subgraph containing the first j verticesand j is the smallest number Currently the algorithm with the best worst-case time complexity has a run time of O(m log n) [133]mdashanother algorithmwith the worst-case time complexity of O(mn) [83] also performs well inpractice As the definition of the elimination tree implies the properties ofthe tree depend on the ordering of the graph

The elimination tree is a recent model playing important roles in sparseLU factorization This tree captures the dependencies between the tasksof some well-known variants of sparse LU factorization (we review these inChapter 7) Therefore the height of the elimination tree corresponds to thecritical path length of the task dependency graph in the corresponding par-allel LU factorization methods In Chapter 7 we investigate the problem offinding minimum height elimination trees formally defined below to exposea maximum degree of parallelism by minimizing the critical path length

Problem 2 (Minimum height elimination tree) Given a directed graphfind an ordering of its vertices so that the associated elimination tree hasthe minimum height

The minimum height of an elimination tree of a given directed graph cor-responds to a directed graph parameter known as the cycle-rank [80] Theproblem being NP-complete [101] we propose heuristics in Chapter 7 whichgeneralize the most successful approaches used for symmetric matrices tounsymmetric ones

12 Undirected graphs

An undirected graph G = (VE) is a set V of vertices and a set E ofedges Bipartite graphs are undirected graphs where the vertex set can bepartitioned into two sets with all edges connecting vertices in the two partsFormally G = (VR cup VC E) is a bipartite graph where V = VR cup VC isthe vertex set and for each edge e = (r c) we have r isin VR and c isin VC The number of edges incident on a vertex is called its degree A path in agraph is a sequence of vertices such that each consecutive vertex pair share

6 CHAPTER 1 INTRODUCTION

an edge A vertex is reachable from another one if there is a path betweenthem The connected components of a graph are the equivalence classes ofvertices under the ldquois reachable fromrdquo relation A cycle in a graph is a pathwhose start and end vertices are the same A simple cycle is a cycle with novertex repetitions A tree is a connected graph with no cycles A spanningtree of a connected graph G is a tree containing all vertices of G

A matching M in a graph G = (VE) is a subset of edges E where avertex in V is in at most one edge in M Given a matching M a vertex vis said to be matched by M if v is in an edge of M otherwise v is calledunmatched If all the vertices are matched by M then M is said to be aperfect matching The cardinality of a matching M denoted by |M| is thenumber of edges inM In Chapter 4 we treat the following problem whichasks to find approximate matchings

Problem 3 (Approximating the maximum cardinality of a matching inundirected graphs) Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matching M of large cardinalitywith an approximation guarantee

There are a number of polynomial time algorithms to solve the maximumcardinality matching problem exactly in graphs The lowest worst-case timecomplexity of the known algorithms is O(

radicnτ) for a graph with n vertices

and τ edges For bipartite graphs the first of such algorithms is described byHopcroft and Karp [117] certain variants of the push-relabel algorithm [96]also have the same complexity (we give a classification for maximum car-dinality matchings in bipartite graphs in a recent paper [132]) For gen-eral graphs algorithms with the same complexity are known [30 92 174]There is considerable interest in simpler and faster algorithms that havesome approximation guarantee [148] Such algorithms are used as a jump-start routine by the current state of the art exact maximum matching algo-rithms [71 148 169] Furthermore there are applications [171] where largecardinality matchings are used

The specialization of Problem 3 for bipartite graphs and general undi-rected graphs are described in respectively Section 42 and Section 43 Ourfocus is on approximation algorithms which expose parallelism and have lin-ear run time in practical settings We propose randomized heuristics andanalyze their results The proposed heuristics construct a subgraph of theinput graph by randomly choosing some edges They then obtain a maxi-mum cardinality matching on the subgraph and return it as an approximatematching for the input graph The probability density function for choosinga given edge is obtained by scaling a suitable matrix to be doubly stochasticThe heuristics for the bipartite graphs and general undirected graphs sharethis general approach and differ in the analysis To the best of our knowl-edge the proposed heuristics have the highest constant factor approximation

7

ratio (around 086 for large graphs)

13 Hypergraphs

A hypergraph H = (VE) is defined as a set of vertices V and a set of hyper-edges E Each hyperedge is a set of vertices A matching in a hypergraphis a set of disjoint hyperedges

A hypergraph H = (VE) is called d-partite and d-uniform if V =⋃di=1 Vi with disjoint Vis and every hyperedge contains a single vertex from

each Vi In Chapter 6 we investigate the problem of finding matchings inhypergraphs formally defined as follows

Problem 4 (Maximum matching in d-partite d-uniform hypergraphs)Given a d-partite d-uniform hypergraph find a matching of maximumsize

The problem is NP-complete for d ge 3 the 3-partite case is called theMax-3-DM problem [126] and is an important problem in combinatorialoptimization It is a special case of the d-Set-Packing problem [110] Ithas been shown that d-Set-Packing is hard to approximate within a factorof O(d log d) [110] The maximumperfect set packing problem has manyapplications including combinatorial auctions [98] and personnel schedul-ing [91] Matchings in hypergraphs can also be used in the coarsening phaseof multilevel hypergraph partitioning tools [40] In particular d-uniformand d-partite hypergraphs arise in modeling tensors [134] and heuristics forProblem 4 can be used to partition these hypergraphs by plugging themin the current state of the art partitioning tools We propose heuristics forProblem 4 We first generalize some graph matching heuristics then pro-pose a novel heuristic based on tensor scaling to improve the quality viajudicious hyperedge selections Experiments on random synthetic and real-life hypergraphs show that this new heuristic is highly practical and superiorto the others on finding a matching with large cardinality

14 Sparse matrices

Bold upper case Roman letters are used for matrices as in A Matrixelements are shown with the corresponding lowercase letters for exampleaij Matrix sizes are sometimes shown in the lower right corner eg AItimesJ Matlab notation is used to refer to the entire rows and columns of a matrixeg A(i ) and A( j) refer to the ith row and jth column of A respectivelyA permutation matrix is a square 0 1 matrix with exactly one nonzeroin each row and each column A sub-permutation matrix is a square 0 1matrix with at most one nonzero in a row or a column

8 CHAPTER 1 INTRODUCTION

A graph G = (VE) can be represented as a sparse matrix AG calledthe adjacency matrix In this matrix an entry aij = 1 if the vertices vi andvj in G are incident on a common edge and aij = 0 otherwise If G isbipartite with V = VR cup VC we identify each vertex in VR with a uniquerow of AG and each vertex in VC with a unique column of AG If G is notbipartite a vertex is identified with a unique row-column pair A directedgraph GD = (VE) with vertex set V and edge set E can be associated withan ntimes n sparse matrix A Here |V | = n and for each aij 6= 0 where i 6= jwe have a directed edge from vi to vj

We also associate graphs with matrices For a given m times n matrix Awe can associate a bipartite graph G = (VR cup VC E) so that the zero-nonzero pattern of A is equivalent to AG If the matrix is square and hasa symmetric zero-nonzero pattern we can similarly associate an undirectedgraph by ignoring the diagonal entries Finally if the matrix A is squareand has an unsymmetric zero-nonzero pattern we can associate a directedgraph GD = (VE) where (vi vj) isin E iff aij 6= 0

An n times n matrix A 6= 0 is said to have support if there is a perfectmatching in the associated bipartite graph An n times n matrix A is said tohave total support if each edge in its bipartite graph can be put into a perfectmatching A square sparse matrix is called irreducible if its directed graph isstrongly connected A square sparse matrix A is called fully indecomposableif for a permutation matrix Q the matrix B = AQ has a zero free diagonaland the directed graph associated with B is irreducible Fully indecompos-able matrices have total support but a matrix with total support could be ablock diagonal matrix where each block is fully indecomposable For moreformal definitions of support total support and the fully indecomposabilitysee for example the book by Brualdi and Ryser [34 Ch 3 and Ch 4]

Let A = [aij ] isin Rntimesn The matrix A is said to be doubly stochastic ifaij ge 0 for all i j and Ae = AT e = e where e is the vector of all ones By

Birkhoffrsquos Theorem [25] there exist α1 α2 αk isin (0 1) withsumk

i=1 αi = 1and permutation matrices P1P2 Pk such that

A = α1P1 + α2P2 + middot middot middot+ αkPk (11)

This representation is also called Birkhoff-von Neumann (BvN) decomposi-tion We refer to the scalars α1 α2 αk as the coefficients of the decom-position

Doubly stochastic matrices and their associated BvN decompositionshave been used in several operations research problems and applicationsClassical examples are concerned with allocating communication resourceswhere an input traffic is routed to an output traffic in stages [43] Each rout-ing stage is a (sub-)permutation matrix and is used for handling a disjointset of communications The number of stages correspond to the numberof (sub-)permutation matrices A recent variant of this allocation problem

9

arises as a routing problem in data centers [146 163] Heuristics for theBvN decomposition have been used in designing business strategies for e-commerce retailers [149] A display matrix (which shows what needs to bedisplayed in total) is constructed which happens to be doubly stochasticThen a BvN decomposition of this matrix is obtained in which each per-mutation matrix used is a display shown to a shopper A small number ofpermutation matrices is preferred furthermore only slight changes betweenconsecutive permutation matrices is desired (we do not deal with this lastissue in this document)

The BvN decomposition of a doubly stochastic matrix as a convex com-bination of permutation matrices is not unique in general The Marcus-Ree Theorem [170] states that there is always a decomposition with k len2 minus 2n + 2 permutation matrices for dense n times n matrices Brualdi andGibson [33] and Brualdi [32] show that for a fully indecomposable ntimesn sparsematrix with τ nonzeros one has the equivalent expression k le τ minus2n+ 2 InChapter 5 we investigate the problem of finding the minimum number k ofpermutation matrices in the representation (11) More formally define theMinBvNDec problem defined as follows

Problem 5 (MinBvNDec A BvN decomposition with the minimumnumber of permutation matrices) Given a doubly stochastic matrix Afind a BvN decomposition (11) of A with the minimum number k ofpermutation matrices

Brualdi [32 p197] investigates the same problem and concludes that this isa difficult problem We continue along this line and show that the MinBvN-Dec problem is NP-hard We also propose a heuristic for obtaining a BvNdecomposition with a small number of permutation matrices We investi-gate some of the properties of the heuristic theoretically and experimentallyIn this chapter we also give a generalization of the BvN decomposition forreal matrices with possibly negative entries and use this generalization fordesigning preconditioners for iterative linear system solvers

15 Sparse tensors

Tensors are multidimensional arrays generalizing matrices to higher ordersWe use calligraphic font to refer to tensors eg X Let X be a d-dimensionaltensor whose size is n1 times middot middot middot times nd As in matrices an element of a tensor isdenoted by a lowercase letter and subscripts corresponding to the indices ofthe element eg xi1id where ij isin 1 nj A marginal of a tensor is a(dminus1)-dimensional slice of a d-dimensional tensor obtained by fixing one ofits indices A d-dimensional tensor where the entries in each of its marginalssum to one is called d-stochastic In a d-stochastic tensor all dimensions

10 CHAPTER 1 INTRODUCTION

necessarily have the same size n A d-stochastic tensor where each marginalcontains exactly one nonzero entry (equal to one) is called a permutationtensor

A d-partite d-uniform hypergraph H = (V1 cup middot middot middot cup Vd E) correspondsnaturally to a d-dimensional tensor X by associating each vertex class witha dimension of X Let |Vi| = ni Let X isin 0 1n1timesmiddotmiddotmiddottimesnd be a tensor withnonzero elements xv1vd corresponding to the edges (v1 vd) of H Inthis case X is called the adjacency tensor of H

Tensors are used in modeling multifactor or multirelational datasetssuch as product reviews at an online store [21] natural language process-ing [123] multi-sensor data in signal processing [49] among many oth-ers [142] Tensor decomposition algorithms are used as an important tool tounderstand the tensors and glean hidden or latent information One of themost common tensor decomposition approaches is the CandecompParafac(CP) formulation which approximates a given tensor as a sum of rank-onetensors More precisely for a given positive integer R called the decompo-sition rank the CP decomposition approximates a d-dimensional tensor bya sum of R outer products of column vectors (one for each dimension) Inparticular for a three dimensional I times J timesK tensor X CP asks for a rankR approximation X asymp

sumRr=1 ar br cr with vectors ar br and cr of size

respectively I J and K Here denotes the outer product of the vectorsand hence xijk asymp

sumRr=1 ar(i) middot br(j) middot cr(k) By aggregating these column

vectors one forms the matrices A B and C of size respectively I times RJ timesR and K timesR The matrices A B and C are called the factor matrices

The computational core of the most common CP decomposition meth-ods is an operation called the matricized tensor-times-Khatri-Rao product(Mttkrp) For the same tensor X this core operation can be describedsuccinctly as

foreach xijk isin X do

A(i )larr A(i ) + xijk[B(j ) lowastC(k )] (12)

where B and C are input matrices of sizes J timesR and K timesR respectivelyMA is the output matrix of size I timesR which is initialized to zero before theloop For each tensor nonzero xijk the jth row of the matrix B is multipliedelement-wise with the kth row of the matrix C the result is scaled with thetensor entry xijk and added to the ith row of MA In the well known CPdecomposition algorithm based on alternative least squares (CP-ALS) thetwo factor matrices B and C are initialized at the beginning then the factormatrix A is computed by multiplying MA with the (pseudo-)inverse of theHadamard product of BTB and CTC Since BTB and CTC are of sizeR times R their multiplication and inversion do not pose any computationalchallenges Afterwards the same steps are performed to compute a newB and then a new C and this process of computing the new factors one

11

at a time by assuming others fixed is repeated iteratively In Chapter 3we investigate how to efficiently parallelize the Mttkrp operations withthe CP-ALS algorithm in distributed memory systems In particular weinvestigate the following problem

Problem 6 (Parallelizing CP decomposition) Given a sparse tensor X the rank R of the required decomposition the number k of processorspartition X among the processors so as to achieve load balance andreduce the communication cost in a parallel implementation of Mttkrp

More precisely in Chapter 3 we describe parallel implementations of CP-ALS and how the parallelization problem can be cast as a partitioning prob-lem in hypergraphs In this document we do not develop heuristics for thehypergraph partitioning problemmdashwe describe hypergraphs modeling thecomputations and make use of the well-established tools for partitioningthese hypergraphs The Mttkrp operation also arises in the gradient de-scent based methods [2] to compute CP decomposition Furthermore it hasbeen implemented as a stand-alone routine [17] to enable algorithm devel-opment for variants of the core methods [158] That is why it has beeninvestigated for efficient execution in different computational frameworkssuch as Matlab [11 18] MapReduce [123] shared memory [206] and dis-tributed memory [47]

Consider again the Mttkrp operation and the execution of the oper-ations as shown in the body of the displayed for-loop (12) As seen inthis loop the rows of the factor matrices and the output matrix are ac-cessed many times according to the sparsity pattern of the tensor X In acache-based system the computations will be much more efficient if we takeadvantage of spatial and temporal locality If we can reorder the nonzerossuch that the multiplications involving a given row (of MA and B and C)are one after another this would be achieved We need this to hold whenexecuting the Mttkrps for computing the new factor matrices B and Cas well The issue is more easily understood in terms of the sparse matrixndashvector multiplication operation y larr Ax If we have two nonzeros in thesame row aij and aik we would prefer an ordering that maps j and k nextto each other and perform the associated scalar multiplications aijprimexjprime andaikprimexkprime one after other where jprime is the new index id for j and kprime is the newindex id for k Furthermore we would like the same sort of locality for allindices and for the multiplication x larr AT y as well We identify two re-lated problems here One of them is the choice of data structure for takingadvantage of shared indices and the other is to reorder the indices so thatnonzeros sharing an index in a dimension have close-by indices in the otherdimensions In this document we do not propose a data structure our aimis to reorder the indices such that the locality is improved for three commondata structures used in the current state of the art We summarize those

12 CHAPTER 1 INTRODUCTION

three common data structures in Chapter 8 and investigate the followingordering problem Since the overall problem is based on the data structurechoice we state the problem in a general language

Problem 7 (Tensor ordering) Given a sparse tensor X find an or-dering of the indices in each dimension so that the nonzeros of X haveindices close to each other

If we look at the same problem for matrices the problem can be cast asordering the rows and the columns of a given matrix so that the nonzeros areclustered around the diagonal This is also what we search in the tensor casewhere different data structures can benefit from the same ordering thanksto different effects In Chapter 8 we investigate this problem algorithmicallyand develop fast heuristics While the problem in the case of sparse matriceshas been investigated with different angles to the best of our knowledgeours is the first study proposing general ordering tools for tensorsmdashwhichare analogues of the well-known and widely used sparse matrix orderingheuristics such as Cuthill-McKee [55] and reverse Cuthill-McKee [95]

Part I

Partitioning

13

Chapter 2

Multilevel algorithms foracyclic partitioning ofdirected acyclic graphs

In this chapter we address the directed acyclic graph partitioning (DAGP)problem Most of the terms and definitions are standard which we sum-marized in Section 11 The formal problem definition is repeated below forconvenience

DAG partitioning problem Given a DAG G = (VE) an imbalanceparameter ε find an acyclic k-way partition V1 Vk of V such thatthe balance constraints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut is minimized

We adopt the multilevel approach [35 113] with the coarsening initialpartitioning and refinement phases for acyclic partitioning of DAGs Wepropose heuristics for these three phases (Sections 221 222 and 223respectively) which guarantee acyclicity of the partitions at all phases andmaintain a DAG at every level We strove to have fast heuristics at thecore With these characterizations the coarsening phase requires new algo-rithmictheoretical reasoning while the initial partitioning and refinementheuristics are direct adaptations of the standard methods used in undirectedgraph partitioning with some differences worth mentioning We discuss onlythe bisection case as we were able to improve the direct k-way algorithmswe proposed before [114] by using the bisection heuristics recursively

The acyclicity constraint on the partitions forbids the use of the stateof the art undirected graph partitioning tools This has been recognized

15

16 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

before and those tools were put aside [114 175] While this is true one canstill try to make use of the existing undirected graph partitioning tools [113130 186 199] as they have been very well engineered Let us assume thatwe have partitioned a DAG with an undirected graph partitioning tool intotwo parts by ignoring the directions It is easy to detect if the partition iscyclic since all the edges need to go from the first part to the second one Ifa partition is cyclic we can easily fix it as follows Let v be a vertex in thesecond part we can move all vertices to which there is a path from v intothe second part This procedure breaks any cycle containing v and hencethe partition becomes acyclic However the edge cut may increase and thepartitions can be unbalanced To solve the balance problem and reduce thecut we can apply move-based refinement algorithms After this step thefinal partition meets the acyclicity and balance conditions Depending onthe structure of the input graph it could also be a good initial partitionfor reducing the edge cut Indeed one of our most effective schemes usesan undirected graph partitioning algorithm to create a (potentially cyclic)partition fixes the cycles in the partition and refines the resulting acyclicpartition with a novel heuristic to obtain an initial partition We thenintegrate this partition within the proposed coarsening approaches to refineit at different granularities

21 Related work

Fauzia et al [87] propose a heuristic for the acyclic partitioning problemto optimize data locality when analyzing DAGs To create partitions theheuristic categorizes a vertex as ready to be assigned to a partition whenall of the vertices it depends on have already been assigned Vertices areassigned to the current partition set until the maximum number of verticesthat would be ldquoactiverdquo during the computation of the part reaches a specifiedlimit which is the cache size in their application This implies that a partsize is not limited by the sum of the total vertex weights in that part butis a complex function that depends on an external schedule (order) of thevertices This differs from our problem as we limit the size of each part bythe total sum of the weights of the vertices on that part

Kernighan [137] proposes an algorithm to find a minimum edge-cut par-tition of the vertices of a graph into subsets of size greater than a lowerbound and inferior to an upper bound The partition needs to use a fixedvertex sequence that cannot be changed Indeed Kernighanrsquos algorithmtakes a topological order of the vertices of the graph as an input and parti-tions the vertices such that all vertices in a subset constitute a continuousblock in the given topological order This procedure is optimal for a givenfixed topological order and has a run time proportional to the number ofedges in the graph if the part weights are taken as constant We used a

17

modified version of this algorithm as a heuristic in the earlier version of ourwork [114]

Cong et al [54] describe two approaches for obtaining acyclic partitionsof directed Boolean networks modeling circuits The first one is a single-level Fiduccia-Mattheyses (FM)-based approach In this approach Cong etal generate an initial acyclic partition by splitting the list of the vertices (ina topological order) from left to right into k parts such that the weight of eachpart does not violate the bound The quality of the results is then improvedwith a k-way variant of the FM heuristic [88] taking the acyclicity constraintinto account Our previous work [114] employs a similar refinement heuristicThe second approach of Cong et al is a two-level heuristic the initial graphis first clustered with a special decomposition and then it is partitionedusing the first heuristic

In a recent paper [175] Moreira et al focus on an imaging and com-puter vision application on embedded systems and discuss acyclic parti-tioning heuristics They propose a single level approach in which an initialacyclic partitioning is obtained using a topological order To refine the parti-tioning they proposed four local search heuristics which respect the balanceconstraint and maintain the acyclicity of the partition Three heuristics picka vertex and move it to an eligible part if and only if the move improvesthe cut These three heuristics differ in choosing the set of eligible parts foreach vertex some are very restrictive and some allow arbitrary target partsas long as acyclicity is maintained The fourth heuristic tentatively realizesthe moves that increase the cut in order to escape from a possible local min-ima It has been reported that this heuristic delivers better results than theothers In a follow-up paper Moreira et al [176] discuss a multilevel graphpartitioner and an evolutionary algorithm based on this multilevel schemeTheir multilevel scheme starts with a given acyclic partition Then thecoarsening phase contracts edges that are in the same part until there is noedge to contract Here matching-based heuristics from undirected graphpartitioning tools are used without taking the directions of the edges intoaccount Therefore the coarsening phase can create cycles in the graphhowever the induced partitions are never cyclic Then an initial partition isobtained which is refined during the uncoarsening phase with move-basedheuristics In order to guarantee acyclic partitions the vertices that lie incycles are not moved In a systematic evaluation of the proposed methodsMoreira et al note that there are many local minima and suggest using re-laxed constraints in the multilevel setting The proposed methods have highrun time as the evolutionary method of Moreira et al is not concerned withthis issue Improvements with respect to the earlier work [175] are reported

We propose methods to use an undirected graph partitioner to guidethe multilevel partitioner We focus on partitioning the graph in two partssince one can handle the general case with a recursive bisection scheme Wealso propose new coarsening initial partitioning and refinement methods

18 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

specifically designed for the 2-partitioning problem Our multilevel schememaintains acyclic partitions and graphs through all the levels

22 Directed multilevel graph partitioning

Here we summarize the proposed methods for the three phases of the mul-tilevel scheme

221 Coarsening

In this phase we obtain smaller DAGs by coalescing the vertices level bylevel This phase continues until the number of vertices becomes smallerthan a specified bound or the reduction on the number of vertices from onelevel to the next one is lower than a threshold At each level ` we startwith a finer acyclic graph G` compute a valid clustering C` ensuring theacyclicity and obtain a coarser acyclic graph G`+1 While our previouswork [114] discussed matching based algorithms for coarsening we presentagglomerative clustering based variants here The new variants supersedethe matching based ones Unlike the standard undirected graph case inDAG partitioning not all vertices can be safely combined Consider a DAGwith three vertices a b c and three edges (a b) (b c) (a c) Here thevertices a and c cannot be combined since that would create a cycle Wesay that a set of vertices is contractible (all its vertices are matchable) ifunifying them does not create a cycle We now present a general theoryabout finding clusters without forming cycles after giving some definitions

Definition 21 (Clustering) A clustering of a DAG is a set of disjointsubsets of vertices Note that we do not make any assumptions on whetherthe subsets are connected or not

Definition 22 (Coarse graph) Given a DAG G and a clustering C of Gwe let G|C denote the coarse graph created by contracting all sets of verticesof C

The vertices of the coarse graph are the clusters in C If (u v) isin G fortwo vertices u and v that are located in different clusters of C then G|Chas an (directed) edge from the vertex corresponding to ursquos cluster to thevertex corresponding to vrsquos cluster

Definition 23 (Feasible clustering) A feasible clustering C of a DAGG is a clustering such that G|C is acyclic

Theorem 21 Let G = (VE) be a DAG For u v isin V and (u v) isin E thecoarse graph G|(uv) is acyclic if and only if every path from u to v in Gcontains the edge (u v)

19

Theorem 21 whose proof is in the journal version of this chapter [115]can be extended to a set of vertices by noting that this time all paths con-necting two vertices of the set should contain only the vertices of the setThe theorem (nor its extension) does not imply an efficient algorithm as itrequires at least one transitive reduction Furthermore it does not describea condition about two clusters forming a cycle even if both are individuallycontractible In order to address both of these issues we put a constrainton the vertices that can form a cluster based on the following definition

Definition 24 (Top level value) For a DAG G = (VE) the top levelvalue of a vertex u isin V is the length of the longest path from a source ofG to that vertex The top level values of all vertices can be computed in asingle traversal of the graph with a complexity O(|V |+ |E|) We use top[u]to denote the top level of the vertex u

By restricting the set of edges considered in the clustering to the edges(u v) isin E such that top[u]+1 = top[v] we ensure that no cycles are formedby contracting a unique cluster (the condition identified in Theorem 21 issatisfied) Let C be a clustering of the vertices Every edge in a cluster ofC being contractible is a necessary condition for C to be feasible but not asufficient one More restrictions on the edges of vertices inside the clustersshould be found to ensure that C is feasible We propose three coarseningheuristics based on clustering sets of more than two vertices whose pair-wisetop level differences are always zero or one

CoTop Acyclic clustering with forbidden edges

To have an efficient clustering heuristic we restrict ourselves to static in-formation computable in linear time As stated in the introduction of thissection we rely on the top level difference of one (or less) for all verticesin the same cluster and an additional condition to ensure that there willbe no cycles when a number of clusters are contracted simultaneously InTheorem 22 below we give two sufficient conditions for a clustering to befeasible (that is the graphs at all levels are DAGs)mdashthe proof is in theassociated journal paper [115]

Theorem 22 (Correctness of the proposed clustering) Let G = (VE) bea DAG and C = C1 Ck be a clustering If C is such that

bull for any cluster Ci for all u v isin Ci |top[u]minus top[v]| le 1

bull for two different clusters Ci and Cj and for all u isin Ci and v isin Cjeither (u v) isin E or top[u] 6= top[v]minus 1

then the coarse graph G|C is acyclic

20 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

We design a heuristic based on Theorem 22 This heuristic visits allvertices which are singleton at the beginning in an order and adds thevisited vertex to a cluster if the criteria mentioned in Theorem 22 are metif not the vertex stays as a singleton Among those clusters satisfying thementioned criteria the one which saves the largest edge weight (incident onthe vertices of the cluster and the vertex being visited) is chosen as the targetcluster We discuss a set of bookkeeping operations in order to implementthis heuristic in linear O(|V |+ |E|) time [115 Algorithm 1]

CoCyc Acyclic clustering with cycle detection

We now propose a less restrictive clustering algorithm which also maintainsthe acyclicity We again rely on the top level difference of one (or less) forall vertices in the same cluster Knowing this invariant when a new vertexis added to a cluster a cycle-detection algorithm checks that no cycles areformed when all the clusters are contracted simultaneously This algorithmdoes not traverse the entire graph by also using the fact that the top leveldifference within a cluster is at most one Nonetheless detecting such cyclescould require a full traversal of the graph

From Theorem 22 we know if adding a vertex to a cluster whose verticesrsquotop level values are t and t+ 1 creates a cycle in the contracted graph thenthis cycle goes through the vertices with top level values t or t + 1 Thuswhen considering the addition of a vertex u to a cluster C containing v wecheck potential cycle formations by traversing the graph starting from u ina breadth-first search (BFS) manner Let t denote the minimum top levelin C At a vertex w we normally add a successor y of w in a queue (as inBFS) if |top(y)minus t| le 1 if w is in the same cluster as one of its predecessorsx we also add x to the queue if |top(x) minus t| le 1 We detect a cycle if atsome point in the traversal a vertex from cluster C is reached if not thecluster can be extended by adding u In the worst-case the cycle detectionalgorithm completes a full graph traversal but in practice it stops quicklyand does not introduce a significant overhead The details of this clusteringscheme whose worst case run time complexity is O(|V |(|V |+ |E|)) can befound in the original paper [115 Algorithm 2]

CoHyb Hybrid acyclic clustering

The cycle detection based algorithm can suffer from quadratic run time forvertices with large in-degrees or out-degrees To avoid this we design a hy-brid acyclic clustering which uses the relaxed clustering strategy with cycledetection CoCyc by default and switches to the more restrictive clusteringstrategy CoTop for large degree vertices We define a limit on the degree ofa vertex (typically

radic|V |10) for calling it large degree When considering

an edge (u v) where top[u] + 1 = top[v] if the degrees of u and v do not

21

exceed the limit we use the cycle detection algorithm to determine if wecan contract the edge Otherwise if the outdegree of u or the indegree of vis too large the edge will be contracted if it is allowed The complexity ofthis algorithm is in between those CoTop and CoCyc and it will likely avoidthe quadratic behavior in practice

222 Initial partitioning

After the coarsening phase we compute an initial acyclic partitioning of thecoarsest graph We present two heuristics One of them is akin to the greedygraph growing method used in the standard graphhypergraph partitioningmethods The second one uses an undirected partitioning and then fixesthe acyclicity of the partitions Throughout this section we use (V0 V1) todenote the bisection of the vertices of the coarsest graph and ensure thatthere is no edge from the vertices in V1 to those in V0

Greedy directed graph growing

One approach to compute a bisection of a directed graph is to design agreedy algorithm that moves vertices from one part to another using localinformation We start with all vertices in V1 and replace vertices towards V0

by using heaps At any time the vertices that can be moved to V0 are in theheap These vertices are those whose in-neighbors are all in V0 Initially onlythe sources are in the heap and when all the in-neighbors of a vertex v aremoved to C0 v is inserted into the heap We separate this process into twophases In the first phase the key-values of the vertices in the heap are setto the weighted sum of their incoming edges and the ties are broken in favorof the vertex closer to the first vertex moved The first phase continues untilthe first part has more than 09 of the maximum allowed weight (modulothe maximum weight of a vertex) In the second phase the actual gain of avertex is used This gain is equal to the sum of the weights of the incomingedges minus the sum of the weights of the outgoing edges In this phasethe ties are broken in favor of the heavier vertices The second phase stopsas soon as the required balance is obtained The reason that we separatedthis heuristic into two phases is that at the beginning the gains are of noimportance and the more vertices become movable the more flexibility theheuristic has Yet towards the end parts are fairly balanced and usingactual gains should help keeping the cut small

Since the order of the parts is important we also reverse the roles of theparts and the directions of the edges That is we put all vertices in V0 andmove the vertices one by one to V1 when all out-neighbors of a vertex havebeen moved to V1 The proposed greedy directed graph growing heuristicreturns the best of the these two alternatives

22 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Undirected bisection and fixing acyclicity

In this heuristic we partition the coarsest graph as if it were undirectedand then move the vertices from one part to another in case the partitionwas not acyclic Let (P0 P1) denote the (not necessarily acyclic) bisectionof the coarsest graph treated as if it were undirected

The proposed approach designates arbitrarily P0 as V0 and P1 as V1One way to fix the cycle is to move all ancestors of the vertices in V0 to V0thereby guaranteeing that there is no edge from vertices in V1 to vertices inV0 making the bisection (V0 V1) acyclic We do these moves in a reversetopological order Another way to fix the acyclicity is to move all descen-dants of the vertices in V1 to V1 again guaranteeing an acyclic partition Wedo these moves in a topological order We then fix the possible unbalancewith a refinement algorithm

Note that we can also initially designate P1 as V0 and P0 as V1 and canagain fix a potential cycle in two different ways We try all four of thesechoices and return the best partition (essentially returning the best of thefour choices to fix the acyclicity of (P0 P1))

223 Refinement

This phase projects the partition obtained for a coarse graph to the nextfiner one and refines the partition by vertex moves As in the standardrefinement methods the proposed heuristic is applied in a number of passesWithin a pass we repeatedly select the vertex with the maximum move gainamong those that can be moved We tentatively realize this move if themove maintains or improves the balance Then the most profitable prefixof vertex moves are realized at the end of the pass As usual we allow thevertices move only once in a pass We use heaps with the gain of movesas the key value where we keep only movable vertices We call a vertexmovable if moving it to the other part does not create a cyclic partitionWe use the notation (V0 V1) to designate the acyclic bisection with no edgefrom vertices in V1 to vertices in V0 This means that for a vertex to movefrom part V0 to part V1 one of the two conditions should be met (i) eitherall its out-neighbors should be in V1 (ii) or the vertex has no out-neighborsat all Similarly for a vertex to move from part V1 to part V0 one of thetwo conditions should be met (i) either all its in-neighbors should be in V0(ii) or the vertex has no in-neighbors at all This is an adaptation of theboundary Fiduccia-Mattheyses heuristic [88] to directed graphs The notionof movability being more restrictive results in an important simplificationThe gain of moving a vertex v from V0 to V1 issum

uisinSucc[v]

w(v u)minussum

uisinPred[v]

w(u v) (21)

23

and the negative of this value when moving it from V1 to V0 This meansthat the gain of a vertex is static once a vertex is inserted in the heap withthe key value (21) its key-value is never updated A move could rendersome vertices unmovable if they were in the heap then they should bedeleted Therefore the heap data structure needs to support insert deleteand extract max operations only

224 Constrained coarsening and initial partitioning

There are a number of highly successful undirected graph partitioningtools [130 186 199] They are not immediately usable for our purposesas the partitions can be cyclic Fixing such partitions by moving verticesto break the cyclic dependencies among the parts can increase the edgecut dramatically (with respect to the undirected cut) Consider for exam-ple the n times n grid graph where the vertices are at integer positions fori = 1 n and j = 1 n and a vertex at (i j) is connected to (iprime jprime)when |iminus iprime| = 1 or |jminusjprime| = 1 but not both There is an acyclic orientationof this graph called spiral ordering as described in Figure 21 for n = 8This spiral ordering defines a total order When the directions of the edgesare ignored we can have a bisection with perfect balance by cutting onlyn = 8 edges with a vertical line This partition is cyclic and it can be madeacyclic by putting all vertices numbered greater than 32 to the second partThis partition which puts the vertices 1ndash32 to the first part and the rest tothe second part is the unique acyclic bisection with perfect balance for theassociated directed acyclic graph The edge cut in the directed version is35 as seen in the figure (gray edges) In general one has to cut n2 minus 4n+ 3edges for n ge 8 the blue vertices in the border (excluding the corners) haveone edge directed to a red vertex the interior blue vertices have two suchedges finally the blue vertex labeled n22 has three such edges

Let us investigate the quality of the partitions in practice We usedMeTiS [130] as the undirected graph partitioner on a dataset of 94 matrices(their details are in Section 23) The results are given in Figure 22 Forthis preliminary experiment we partitioned the graphs into two with themaximum allowed load imbalance ε = 3 The output of MeTiS is acyclicfor only two graphs and the geometric mean of the normalized edge cut is00012 Figure 22a shows the normalized edge cut and the load imbalanceafter fixing the cycles while Figure 22b shows the two measurements aftermeeting the balance criteria A normalized edge cut value is computed bynormalizing the edge cut with respect to the number of edges

In both figures the horizontal lines mark the geometric mean of thenormalized edge cuts and the vertical lines mark the 3 imbalance ratioIn Figure 22a there are 37 instances in which the load balance after fixingthe cycles is feasible The geometric mean of the normalized edge cuts inthis sub-figure is 00045 while in the other sub-figure it is 00049 Fixing

24 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

815864

32

3327

45

48

41

35

53

38

1522

Figure 21 8times8 grid graph whose vertices are ordered in a spiral way a fewof the vertices are labeled All edges are oriented from a lower numberedvertex to a higher numbered one There is a unique bipartition with 32vertices in each side The edges defining the total order are shown in redand blue except the one from 32 to 33 the cut edges are shown in grayother internal edges are not shown

10 12 14 16 18 20balance

10 5

10 4

10 3

10 2

10 1

100

101

cut

(a) Undirected partitioning and fixing cy-cles

098 100 102 104 106balance

10 5

10 4

10 3

10 2

10 1

cut

(b) Undirected partitioning fixing cyclesand balancing

Figure 22 Normalized edge cut (normalized with respect to the number ofedges) the balance obtained after using an undirected graph partitioner andfixing the cycles (left) and after ensuring balance with refinement (right)

25

the cycles increases the edge cut with respect to an undirected partition-ing but not catastrophically (only by 0004500012 = 375 times in theseexperiments) and achieving balance after this step increases the cut only alittle (goes to 00049 from 00045) That is why we suggest using an undi-rected graph partitioner fixing the cycles among the parts and performinga refinement based method for load balancing as a good (initial) partitioner

In order to refine the initial partition in a multilevel setting we proposea scheme similar to the iterated multilevel algorithm used in the existingpartitioners [40 212] In this scheme first a partition P is obtained Thenthe coarsening phase starts to cluster vertices that were in the same partin P After the coarsening an initial partitioning is freely available byusing the partition P on the coarsest graph The refinement phase then canwork as before Moreira et al [176] use this approach for the directed graphpartitioning problem To be more concrete we first use an undirected graphpartitioner then fix the cycles as discussed in Section 222 and then refinethis acyclic partition for balance with the proposed refinement heuristicsin Section 223 We then use this acyclic partition for constrained coarseningand initial partitioning We expect this scheme to be successful in graphswith many sources and targets where the sources and targets can lie in anyof the parts while the overall partition is acyclic On the other hand if agraph is such that its balanced acyclic partitions need to put sources in onepart and the targets in another part then fixing acyclicity may result inmoving many vertices This in turn will harm the edge cut found by theundirected graph partitioner

23 Experiments

The partitioning tool presented (dagP) is implemented in CC++ program-ming languages The experiments are conducted on a computer equippedwith dual 21 GHz Xeon E5-2683 processors and 512GB memory Thesource code and more information is available at httptdagatechedusoftwaredagP

We have performed a large set of experiments on DAG instances com-ing from two sources The first set of instances is from the PolyhedralBenchmark suite (PolyBench) [196] The second set of instances is obtainedfrom the matrices available in the SuiteSparse Matrix Collection (UFL) [59]From this collection we pick all the matrices satisfying the following prop-erties listed as binary square and has at least 100000 rows and at most226 nonzeros There were a total of 95 matrices at the time of experimenta-tion where two matrices (ids 1514 and 2294) having the same pattern Wediscard the duplicate and use the remaining 94 matrices for experimentsFor each such matrix we take the strict upper triangular part as the asso-ciated DAG instance whenever this part has more nonzeros than the lower

26 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Graph vertex edge max deg avg deg source target2mm 36500 62200 40 1704 2100 4003mm 111900 214600 40 1918 3900 400adi 596695 1059590 109760 1776 843 28atax 241730 385960 230 1597 48530 230covariance 191600 368775 70 1925 4775 1275doitgen 123400 237000 150 1921 3400 3000durbin 126246 250993 252 1988 250 249fdtd-2d 256479 436580 60 1702 3579 1199gemm 1026800 1684200 70 1640 14600 4200gemver 159480 259440 120 1627 15360 120gesummv 376000 500500 500 1331 125250 250heat-3d 308480 491520 20 1593 1280 512jacobi-1d 239202 398000 100 1664 402 398jacobi-2d 157808 282240 20 1789 1008 784lu 344520 676240 79 1963 6400 1ludcmp 357320 701680 80 1964 6480 1mvt 200800 320000 200 1594 40800 400seidel-2d 261520 490960 60 1877 1600 1symm 254020 440400 120 1734 5680 2400syr2k 111000 180900 60 1630 2100 900syrk 594480 975240 81 1640 8040 3240trisolv 240600 320000 399 1330 80600 1trmm 294570 571200 80 1939 6570 4800

Table 21 DAG instances from PolyBench [196]

triangular part otherwise we take the strict lower triangular part All edgeshave unit cost and all vertices have unit weight

Since the proposed heuristics have a randomized behavior (the traversalsused in the coarsening and refinement heuristics are randomized) we runthem 10 times for each DAG instance and report the averages of theseruns We use performance profiles [69] to present the edge-cut results Aperformance profile plot shows the probability that a specific method givesresults within a factor θ of the best edge cut obtained by any of the methodscompared in the plot Hence the higher and closer a plot to the y-axis thebetter the method is

We set the load imbalance parameter ε = 003 in the problem definitiongiven at the beginning of the chapter for all experiments The vertices areunit weighted therefore the imbalance is rarely an issue for a move-basedpartitioner

Coarsening evaluation

We first evaluated the proposed coarsening heuristics (Section 51 of the as-sociated paper [115]) The aim was to find an effective one to set as a defaultcoarsening heuristic Based on performance profiles on the edge cut we con-cluded that in general the coarsening heuristics CoHyb and CoCyc behavesimilarly and are more helpful than CoTop in reducing the edge cut We alsoanalyzed the contraction efficiency of the proposed coarsening heuristics Itis important that the coarsening phase does not stop too early and that the

27

coarsest graph is small enough to be partitioned efficiently By investigatingthe maximum the average and the standard deviation of vertex and edgeweight ratios at the coarsest graph and the average the minimum andthe maximum number of coarsening levels observed for the two datasetswe saw that CoCyc and CoHyb behave again similarly and provide slightlybetter results than CoTop Based on all these observations we set CoHyb asthe default coarsening heuristic as it performs better than CoTop in termsof final edge cut and is guaranteed to be more efficient than CoCyc in termsof run time

Constrained coarsening and initial partitioning

We now investigate the effect of using undirected graph partitioners forcoarsening and initial partitioning as explained in Section 224 We com-pare three variants of the proposed multilevel scheme All of them use therefinement described in Section 223 in the uncoarsening phase

bull CoHyb this variant uses the hybrid coarsening heuristic describedin Section 221 and the greedy directed graph growing heuristic de-scribed in Section 222 in the initial partitioning phase This methoddoes not use constrained coarsening

bull CoHyb C this variant uses an acyclic partition of the finest graph ob-tained as outlined in Section 222 to guide the hybrid coarseningheuristic described in Section 224 and uses the greedy directed graphgrowing heuristic in the initial partitioning phase

bull CoHyb CIP this variant uses the same constrained coarsening heuristicas the previous method but inherits the fixed acyclic partition of thefinest graph as the initial partitioning

The comparison of these three variants are given in Figure 23 for thewhole dataset In this figure we see that using the constrained coarseningis always helpful as it clearly separates CoHyb C and CoHyb CIP from CoHyb

after θ = 11 Furthermore applying the constrained initial partitioning (ontop of the constrained coarsening) brings tangible improvements

In the light of the experiments presented here we suggest the variantCoHyb CIP for general problem instances

CoHyb CIP with respect to a single level algorithm

We compare CoHyb CIP (constrained coarsening and initial partitioning)with a single-level algorithm that uses an undirected graph partitioningfixes the acyclicity and refines the partitions This last variant is denotedas UndirFix and it is the algorithm described in Section 222 Both vari-ants use the same initial partitioning approach which utilizes MeTiS [130] as

28 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CCoHyb_CIP

Figure 23 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) using the constrained coarsening and the greedy directed graphgrowing (CoHyb C) and CoHyb (no constraints)

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPUndirFix

Figure 24 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) and using the same approach without coarsening (UndirFix)

the undirected partitioner The difference between UndirFix and CoHyb CIP

is the latterrsquos ability to refine the initial partition at multiple levels Fig-ure 24 presents this comparison The plots show that the multilevel schemeCoHyb CIP outperforms the single level scheme UndirFix at all appropriateranges of θ attesting to the importance of the multilevel scheme

Comparison with existing work

Here we compare our approach with the evolutionary graph partitioningapproach developed by Moreira et al [175] and briefly with our previouswork [114]

Figure 25 shows how CoHyb CIP and CoTop compare with the evolution-ary approach in terms of the edge cut on the 23 graphs of the PolyBenchdataset for the number of partitions k isin 2 4 8 16 32 We use the av-

29

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPCoTopMoreira et al

Figure 25 Performance profiles for the edge cut obtained by CoHyb CIPCoTop and Moreira et alrsquos approach on the PolyBench dataset with k isin2 4 8 16 32

erage edge cut value of 10 runs for CoTop and CoHyb CIP and the averagevalues presented in [175] for the evolutionary algorithm As seen in thefigure the CoTop variant of the proposed multilevel approach obtains thebest results on this specific dataset (all variants of the proposed approachoutperform the evolutionary approach) In more detailed comparisons [115Appendix A] we see that the variants CoHyb CIP and CoTop of the proposedalgorithm obtain strictly better results than the evolutionary approach in78 and 75 instances (out of 115) respectively when the average edge cutsare compared

In more detailed comparisons [115 Appendix A] we see that CoHyb CIP

obtains 26 less edge cut than the evolutionary approach on average (ge-ometric mean) when the average cuts are compared when the best cutsare compared CoHyb CIP obtains 48 less edge cut Moreover CoTop ob-tains 37 less edge cut than the evolutionary approach when the averagecuts are compared when the best cuts are compared CoTop obtains 41less cut In some instances we see large differences between the averageand the best results of CoTop and CoHyb CIP Combined with the observa-tion that CoHyb CIP yields better results in general this suggests that theneighborhood structure can be improved (see the notion of the strength ofa neighborhood [183 Section 196])

The proposed approach with all the reported variants takes about 30minutes to complete the whole set of experiments for this dataset whereasthe evolutionary approach is much more compute-intensive as it has to runthe multilevel partitioning algorithm numerous times to create and updatethe population of partitions for the evolutionary algorithm The multilevelapproach of Moreira et al [175] is more comparable in terms of characteris-tics with our multilevel scheme When we compare CoTop with the results ofthe multilevel algorithm by Moreira et al our approach provides results thatare 37 better on average and CoHyb CIP approach provides results that are

30 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CIPUndirFix

Figure 26 Performance profiles of CoHyb CoHyb CIP and UndirFix in termsof edge cut for single source single target graph dataset The average of 5runs are reported for each approach

26 better on average highlighting the fact that keeping the acyclicity ofthe directed graph through the multilevel process is useful

Finally CoTop and CoHyb CIP also outperform the previous version of ourmultilevel partitioner [114] which is based on a direct k-way partitioningscheme and matching heuristics for the coarsening phase by 45 and 35on average respectively on the same dataset

Single commodity flow-like problem instances

In many of the instances of our dataset graphs have many source and targetvertices We investigate how our algorithm performs on problems where allsource vertices should be in a given part and all target vertices should bein the other part while also achieving balance This is a problem close tothe maximum flow problem where we want to find the maximum flow (orminimum cut) from the sources to the targets with balance on part weightsFurthermore addressing this problem also provides a setting for solvingpartitioning problems with fixed vertices

We used the UFL dataset where all isolated vertices are discarded anda single source vertex S and a single target vertex T are added to all graphsS is connected to all original source vertices and all original target verticesare connected to T The cost of the new edges is set to the number ofedges A feasible partition should avoid cutting these edges and separateall sources from the targets

The performance profiles of CoHyb CoHyb CIP and UndirFix are givenin Figure 26 with the edge cut as the evaluation criterion As seen in thisfigure CoHyb is the best performing variant and UndirFix is the worstperforming variant This is interesting as in the general setting we sawa reverse relation The variant CoHyb CIP performs in the middle as itcombines the other two

31

1 2 3 4 5 6 7 800

02

04

06

08

10

p CoTopCoHybCoCycUndirFixCoHyb_CIPCoHyb_C

Figure 27 Run time performance profile of CoCyc CoHyb CoTop CoHyb CCoHyb CIP and UndirFix on the whole dataset The values are the averagesof 10 runs

Run time performance

We give again a summary of the run time comparisons from the detailedversion [115 Section 56] Figure 27 shows the comparison of the five vari-ants of the proposed multilevel scheme and the single level scheme on thewhole dataset Each algorithm is run 10 times on each graph As expectedCoTop offers the best performance and CoHyb offers a good trade-off betweenCoTop and CoCyc An interesting remark is that these three algorithms havea better run time than the single level algorithm UndirFix For exampleon the average CoTop is 144 times faster than UndirFix This is mainlydue to cost of fixing acyclicity Undirected partitioning accounts for roughly25 of the execution time of UndirFix and fixing the acyclicity constitutesthe remaining 75 Finally the variants of the multilevel algorithm usingconstrained coarsening heuristics provide satisfying run time performancewith respect to the others

24 Summary further notes and references

We proposed a multilevel approach for acyclic partitioning of directed acyclicgraphs This problem is close to the standard graph partitioning in that theaim is to partition the vertices into a number of parts while minimizingthe edge cut and meeting a balance criterion on the part weights Unlikethe standard graph partitioning problem the directions of the edges areimportant and the resulting partitions should have acyclic dependencies

We proposed coarsening initial partitioning and refinement heuristicsfor the target problem The proposed heuristics take the directions of theedges into account and maintain the acyclicity throughout the three phasesof the multilevel scheme We also proposed efficient and effective approachesto use the standard undirected graph partitioning tools in the multilevel

32 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

scheme for coarsening and initial partitioning We performed a large set ofexperiments on a dataset with graphs having different characteristics andevaluated different combinations of the proposed heuristics Our experi-ments suggested (i) the use of constrained coarsening and initial partition-ing where the main coarsening heuristic is a hybrid one which avoids thecycles and in case it does not performs a fast cycle detection (CoHyb CIP)for the general case (ii) a pure multilevel scheme without constrained coars-ening using the hybrid coarsening heuristic (CoHyb) for the cases where anumber of sources need to be separated from a number of targets (iii) a puremultilevel scheme without constrained coarsening using the fast coarseningalgorithm (CoTop) for the cases where the degrees of the vertices are smallAll three approaches are shown to be more effective and efficient than thecurrent state of the art

Ozkaya et al [181] make use of the proposed partitioner for schedulingdirected acyclic task graphs for homogeneous computing systems Here thepartitioner is used to obtain coarse grained tasks to enhance data localityLepere and Trystram [151] show that acyclic clustering is very effective inscheduling DAGs with unit task weights and large uniform edge costs Inparticular they show that there exists an acyclic clustering whose makespanis at most twice the best clustering

Recall that a directed edge (u v) is called redundant if there is a pathu v in the graph Consider a redundant edge (u v) and a path u vLet us discard the edge (u v) and add its cost to all edges in the identifiedu v path to obtain a reduced graph In an acyclic bisection of this reducedgraph if the vertices u and v are in different parts then the path u vcrosses the cut only once (otherwise the partition is cyclic) and the edgecut in the original graph equals the edge cut in the reduced graph If uand v are in the same part then all paths u v are confined to the samepart Therefore neither the edge (u v) of the original graph nor any pathu v of the reduced graph contributes to the respective cuts Hence theedge cut in the original graph is equal to the edge cut in the reduced graphSuch reductions could potentially help reduce the run time and improve thequality of the partitions in our multilevel setting The transitive reductionis a costly operation and hence we do not hope to apply all reductionsin the proposed approach a subset of those should be found and appliedThe effects of such reductions should be more tangible in the evolutionaryalgorithm [175 176]

Chapter 3

Parallel CandecompParafacdecomposition of sparsetensors in distributedmemory

In this chapter we address the problem of parallelizing the computationalcore of the CandecompParafac decomposition of sparse tensors Tensornotation is a little less common and the reader is invited to revise thenotation and the basic definitions in Section 15 More related definitionsare given in this chapter Below we restate the problem for convenience

Parallelizing CP decomposition Given a sparse tensor X the rankR of the required decomposition the number k of processors partitionX among the processors so as to achieve load balance and reduce thecommunication cost in a parallel implementation of Mttkrp

We investigate an efficient parallelization of the Mttkrp operation indistributed memory environments for sparse tensors in the context of theCP decomposition method CP-ALS For this purpose we formulate twotask definitions a coarse-grain and a fine-grain one These definitions aregiven by applying the owner-computes rule to a coarse-grain and a fine-grainpartition of the tensor nonzeros We define the coarse-grain partition of atensor as a partition of one of its dimensions In matrix terms a coarse-grainpartition corresponds to a row-wise or a column-wise partition We definethe fine-grain partition of a tensor as a partition of its nonzeros This hasthe same significance in matrix terms Based on these two task granularitieswe present two parallel algorithms for the Mttkrp operation We addressthe computational load balance and communication cost reduction problemsfor the two algorithms and present hypergraph partitioning-based models

33

34 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

to tackle these problems with the off-the-shelf partitioning toolsOnce the Mttkrp operation is efficiently parallelized the other parts of

the CP-ALS algorithm are straightforward Nonetheless we design a libraryfor the parallel CP-ALS algorithm to test the proposed Mttkrp algorithmsand report scalability results up to 1024 MPI ranks

31 Background and notation

311 Hypergraph partitioning

Let H = (VE) be a hypergraph with the vertex set V and hyperedge setE Let us associate weights with the vertices and costs with the hyperedgesw(v) denotes the weight of a vertex v and ch denotes the cost of a hyperedgeFor a given integer K ge 2 a K-way vertex partition of a hypergraph H =(VE) is denoted as Π = V1 VK where (i) the parts are non-empty(ii) mutually exclusive VkcapV` = empty for k 6= ` and (iii) collectively exhaustiveV =

⋃Vk

Let w(Vk) =sum

visinVk wv be the total weight of the vertices in the setVk and wavg = w(V )K be the average part weight If each part Vk isin Πsatisfies the balance criterion

w(Vk) le wavg(1 + ε) for k = 1 2 K (31)

we say that Π is balanced where ε represents the maximum allowed imbal-ance ratio

In a partition Π a hyperedge that has at least one vertex in a part is saidto connect that part The number of parts connected by a hyperedge h ieconnectivity is denoted as λh Given a vertex partition Π of a hypergraphH = (VE) one can measure the size of the cut induced by Π as

χ(Π) =sumhisinE

ch(λh minus 1) (32)

This cut measure is called the connectivity-1 cutsize metricGiven ε gt 0 and an integer K gt 1 the standard hypergraph partitioning

problem is defined as the task of finding a balanced partition Π with K partssuch that χ(Π) is minimized The hypergraph partitioning problem is NP-hard [150]

A variant of the above problem is the multi-constraint hypergraph parti-tioning [131] In this variant each vertex has an associated vector of weightsThe partitioning objective is the same as above and the partitioning con-straint is to satisfy a balancing constraint for each weight Let w(v i) denotethe C weights of a vertex v for i = 1 C In this variant the balancecriterion (31) is rewritten as

w(Vk i) le wavgi (1 + ε) for k = 1 K and i = 1 C (33)

35

where the ith weight w(Vk i) of a part Vk is defined as the sum of the ithweights of the vertices in that part wavgi is the average part weight for theith weight of all vertices and ε represents the allowed imbalance ratio

312 Tensor operations

In this chapter we use N to denote the number of dimensions (the order)of a tensor For the sake of simplicity we describe all the notation andthe algorithms for N = 3 even though our algorithms and implementationshave no such restriction We explicitly generalize the discussion to generalorder-N tensors whenever we find necessary A fiber in a tensor is definedby fixing every index but one eg if X is a third-order tensor Xjk is amode-1 fiber and Xij is a mode-3 fiber A slice in a tensor is defined byfixing only one index eg Xi refers to the ith slice of X in mode 1 Weuse |Xi| to denote the number of nonzeros in Xi

Tensors can be matricized in any mode This is achieved by identifying asubset of the modes of a given tensor X as the rows and the other modes of Xas the columns of a matrix and appropriately mapping the elements of X tothose of the resulting matrix We will be exclusively dealing with the matri-cizations of tensors along a single mode For example take X isin RI1timesmiddotmiddotmiddottimesIN Then X(1) denotes the mode-1 matricization of X in such a way that therows of X(1) corresponds to the first mode of X and the columns correspondsto the remaining modes The tensor element xi1iN corresponds to the el-

ement(i1 1 +

sumNj=2

[(ij minus 1)

prodjminus1k=1 Ik

])of X(1) Specifically each column

of the matrix X(1) becomes a mode-1 fiber of the tensor X Matricizationsin the other modes are defined similarly

For two matrices AI1timesJ1 and BI2timesJ2 the Kronecker product is

AotimesB =

a11B middot middot middot a1J1B

aI11B middot middot middot aI1J1B

For two matrices AI1timesJ and BI2timesJ the Khatri-Rao product is

AB =[A( 1)otimesB( 1) middot middot middot A( J)otimesB( J)

]

which is of size I1I2 times J For two matrices AItimesJ and BItimesJ the Hadamard product is

A lowastB =

a11b11 middot middot middot a1Jb1J

aI1bI1 middot middot middot aIJbIJ

Recall that the CP-decomposition of rankR of a given tensor X factorizesX into a sum of R rank-one tensors For X isin RItimesJtimesK it yields X asymp

36 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

sumRr=1 ar br cr for ar isin RI br isin RJ and cr isin RK where is the outer

product of the vectors Here the matrices A = [a1 aR] B = [b1 bR]and C = [c1 cR] are called the factor matrices or factors For N -modetensors we use U1 UN to refer to the factor matrices

We revise the Alternating Least Squares (ALS) method for obtaining arank-R approximation of a tensor X with the CP-decomposition [38 107]A common formulation of CP-ALS is shown in Algorithm 1 for third ordertensors At each iteration each factor matrix is recomputed while fixingthe other two eg A larr X(1)(C B)(BTB lowast CTC)dagger This operation is

performed in the following order MA = X(1)(CB) V = (BTB lowastCTC)daggerand then A larr MAV Here V is a dense matrix of size R times R and iseasy to compute The important issue is the efficient computation of theMttkrp operations yielding MA similarly MB = X(2)(AC) and MC =X(3)(BA)

Algorithm 1 CP-ALS for the 3rd order tensors

Input X A 3rd order tensorR The rank of approximation

Output CP decomposition [[λ ABC]]

1 repeat2 Alarr X(1)(CB)(BTB lowastCTC)dagger

3 Normalize columns of A

4 Blarr X(2)(CA)(ATA lowastCTC)dagger

5 Normalize columns of B

6 Clarr X(3)(BA)(ATA lowastBTB)dagger

7 Normalize columns of C and store the norms as λ

8 until no improvement or maximum iterations reached

The sheer size of the Khatri-Rao products makes them impossible tocompute explicitly hence efficient Mttkrp algorithms find other means tocarry out the Mttkrp operation

32 Related work

SPLATT [206] is an efficient implementation of the Mttkrp operation forsparse tensors on shared memory systems SPLATT implements the Mt-tkrp operation based on the slices of the dimension in which the factoris updated eg on the mode-1 slices when computing A larr X(1)(C B)(BTB lowastCTC)dagger Nonzeros of the fibers in a slice are multiplied with thecorresponding rows of B and the results are accumulated to be later scaledwith the corresponding row of C to compute the row of A correspondingto the slice Parallelization is done using OpenMP directives and the loadbalance (in terms of the number of nonzeros in the slices of the mode for

37

which Mttkrp is computed) is achieved by using the dynamic schedulingpolicy Hypergraph models are used to optimize cache performance by re-ducing the number of times a row of A B and C are accessed Smith etal [206] also use N -partite graphs to reorder the tensors for all dimensionsA newer version of SPLATT [204 205] has a medium grain task definitionand a distributed memory implementation

GigaTensor [123] is an implementation of CP-ALS which follows theMap-Reduce paradigm All important steps (the Mttkrps and the com-putations BTB lowastCTC) are performed using this paradigm A distinct ad-vantage of GigaTensor is that thanks to Map-Reduce the issues of fault-tolerance load balance and out of core tensor data are automatically han-dled The presentation [123] of GigaTensor focuses on three-mode tensorsand expresses the map and the reduce functions for this case Additionalmap and reduce functions are needed for higher order tensors

DFacTo [47] is a distributed memory implementation of the Mttkrp op-eration It performs two successive sparse matrix-vector multiplies (SpMV)to compute a column of the product X(1)(C B) A crucial observation(made also elsewhere [142]) is that this operation can be implemented asXT

(2)B( r) which can be reshaped into a matrix to be multiplied with

C( r) to form the rth column of the Mttkrp Although SpMV is awell-investigated operation there is a peculiarity here the result of thefirst SpMV forms the values of the sparse matrix used in the second oneTherefore there are sophisticated data dependencies between the two Sp-MVs Notice that DFacTo is rich in SpMV operations there are two SpMVsper factorization rank per dimension of the input tensor DFacTo needs tostore the tensor matricized in all dimensions ie X(1) X(N) In lowdimensions this can be a slight memory overhead yet in higher dimensionsthe overhead could be non-negligible DFacTo uses MPI for paralleliza-tion yet fully stores the factor matrices in all MPI ranks The rows ofX(1) X(N) are blockwise distributed (statically) With this partitioneach process computes the corresponding rows of the factor matrices Fi-nally DFacTo performs an MPI Allgatherv operation to communicate thenew results to all processes which results in (InP ) log2 P communicationvolume per process (assuming a hypercube algorithm) when computing thenth factor matrix having In rows using P processes

Tensor Toolbox [18] is a MATLAB toolbox for handling tensors It pro-vides many essential operations and enables fast and efficient realizationsof complex algorithms in MATLAB for sparse tensors [17] Among thoseoperations Mttkrp implementations are provided and used in CP-ALSmethod Here each column of the output is computed by performing N minus 1sparse tensor vector multiplications Another well-known MATLAB toolboxis the N -way toolbox [11] which is essentially for dense tensors and incor-porates now support for sparse tensors [2] through Tensor Toolbox Tensor

38 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Toolbox and the related software provide excellent means for rapid proto-typing of algorithms and also efficient programs for tensor operations thatcan be handled within MATLAB

33 Parallelization

A common approach in implementing the Mttkrp is to explicitly matricizea tensor across all modes and then perform the Khatri-Rao product usingthe matricized tensors [47 206] Matricizing a tensor in a mode i requirescolumn index values up to

prodNk 6=i Ik which can exceed the integer value lim-

its supported by modern architectures when using tensors of higher orderand very large dimensions Also matricizing across all modes results in Nreplications of a tensor which can exceed the memory limitations Hence inorder to be able to handle large tensors we store them in coordinate formatfor the Mttkrp operation which is also the method of choice in TensorToolbox [18] In this format each nonzero value is stored along with all ofits position indices

With a tensor stored in the coordinate format Mttkrp operation canbe performed as shown in Algorithm 2 As seen on Line 4 of this algorithma row of B and a row of C are retrieved and their Hadamard product iscomputed and scaled with a tensor entry to update a row of MA In generalfor an N -mode tensor

MU1(i1 )larrMU1(i1 ) + xi1i2iN [U2(i2 ) lowast middot middot middot lowastUN (iN )]

is computed Here indices of the corresponding rows of the factor matricesand MU1 coincide with indices of the unique tensor entry of the operation

Algorithm 2 Mttkrp for the 3rd order tensors

Input X tensorBC Factor matrices in all modes except the firstIA Number of rows of the factor AR Rank of the factors

Output MA = X(1)(BC)

1 Initialize MA to zeros of size IA timesR2 foreach xijk isin X do44 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

As factor matrices are accessed row-wise we define computational unitsin terms of the rows of factor matrices It follows naturally to partitionall factor matrices row-wise and use the same partition for the Mttkrpoperation for each mode of an input tensor across all CP-ALS iterations toprevent extra communication A crucial issue is the task definitions as this

39

pertains to the issues of load balancing and communication We identify acoarse-grain and a fine-grain task definition

In the coarse-grain task definition the ith atomic task consists of com-puting the row MA(i ) using the nonzeros in the tensor slice Xi and therows of B and C corresponding to the nonzeros in that slice The inputtensor X does not change throughout the iterations of tensor decompositionalgorithms hence it is viable to make the whole slice Xi available to theprocess holding MA(i ) so that the Mttkrp operation can be performedby only communicating the rows of B and C Yet as CP-ALS requires theMttkrp in all modes and each nonzero xijk belongs to slices X (i )X ( j ) and X ( k) we need to replicate tensor entries in the owner pro-cesses of these slices This may require up to N times replication of thetensor depending on its partitioning Note that an explicit matricizationalways requires exactly N replications of tensor entries

In the fine-grain task definition an atomic task corresponds to the mul-tiplication of a tensor entry with the Hadamard product of the correspond-ing rows of B and C Here tensor nonzeros are partitioned among pro-cesses with no replication to induce a task partition by following the owner-computes rule This necessitates communicating the rows of B and C thatare needed by these atomic tasks Furthermore partial results on the rowsof MA need to be communicated as without duplicating tensor entries wecannot in general compute all contributions to a row of MA Here the par-tition of X should be useful in all modes as the CP-ALS method requiresthe Mttkrp in all modes

The coarse-grain task definition resembles the one-dimensional (1D) row-wise (or column-wise) partitioning of sparse matrices whereas the fine-grainone resembles the two-dimensional (nonzero-based) partitioning of sparsematrices for parallel sparse matrix-vector multiply (SpMV) operations As isconfirmed for SpMV in modern applications 1D partitioning usually leads toharder problems of load balancing and communication cost reduction Thesame phenomenon is likely to be observed in tensors as well Nonethelesswe cover the coarse-grain task definition as it is used in the state of theart parallel Mttkrp [47 206] methods which partition the input tensor byslices

331 Coarse-grain task model

In the coarse-grain task model computing the rows of MA MB and MC aredefined as the atomic tasks Let microA denote the partition of the first modersquosindices among the processes ie microA(i) = p if the process p is responsiblefor computing MA(i ) Similarly let microB and microC define the partition of thesecond and the third mode indices The process owning MA(i ) needs theentire tensor slice Xi similarly the process owning MB(j) needs Xj andthe owner of MC(k ) needs Xk This necessitates duplication of some

40 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

tensor nonzeros to prevent unnecessary communicationOne needs to take the context of CP-ALS into account when parallelizing

the Mttkrp operation First the output MA of Mttkrp is transformedinto A Since A(i ) is computed simply by multiplying MA(i ) with thematrix (BTBlowastCTC)dagger we make the process which owns MA(i ) responsiblefor computing A(i ) Second N Mttkrp operations follow one anotherin an iteration Assuming that every process has the required rows of thefactor matrices while executing Mttkrp for the first mode it is advisable toimplement the Mttkrp in such a way that its output MA after transformedinto A is communicated This way all processes will have the necessarydata for executing the Mttkrp for the next mode With these in mind thecoarse-grain parallel Mttkrp method executes Algorithm 3 at each processp

Algorithm 3 Coarse-grain Mttkrp for the first mode of thirdorder tensors at process p within CP-ALS

Input Ip indices where microA(i) = pXIp tensor slicesBTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in XIp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them ATA is available

2 Initialize MA(Ip ) to all zeros of size |Ip| timesR3 foreach i isin Ip do4 foreach xijk isin Xi do66 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

88 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

9 foreach i isin Ip do1111 Send A(i ) to all processes having nonzeros in Xi

12 Receive A(i ) from microA(i) for each owned nonzero xijk

1414 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

As seen in Algorithm 3 the process p computes MA(i ) for all i withmicroA(i) = p on Line 6 This is possible without any communication if weassume the preconditions that BTB CTC and the required rows of B andC are available We need to satisfy this precondition for the Mttkrp oper-ations in the remaining modes Once MA(Ip ) is computed it is multipliedon Line 8 with (BTB lowastCTC)dagger to obtain A(Ip ) Then the process p sendsthe rows of A(Ip ) to other processes who will need them then receivethe ones that it will need More precisely the process p sends A(i ) tothe process r where microB(j) = r or microC(k) = r for a nonzero xijk isin X thereby satisfying the stated precondition for the remaining Mttkrp op-

41

erations Since computing A(i ) required BTB and CTC to be availableat each process we also need to compute ATA and make the result avail-able to all processes We perform this by computing local multiplicationsA(Ip )

TA(Ip ) and all-reducing the partial results (Line 14)

We now investigate the problems of obtaining load balance and reducingthe communication cost in an iteration of CP-ALS given in Algorithm 3 thatis when Algorithm 3 is applied N times once for each mode Line 6 is thecomputational core the process p performs

sumiisinIp |Xi| many Hadamard

products (of vectors of size R) without any communication In order toachieve load balance one should partition the slices of the first mode equi-tably by taking the number of nonzeros into account Line 8 does not involvecommunication where load balance can be achieved by assigning almostequal number of slices to the processes Line 14 requires an almost equalnumber of slices to the processes for load balance The operations at Lines 8and 14 can be performed very efficiently using BLAS3 routines thereforetheir costs are generally negligible in compare to the cost of the sparse ir-regular computations at Line 6 There are two lines 11 and 14 requiringcommunication Line 14 requires the collective all-reduce communicationand one can rely on efficient MPI implementations The communication atLine 11 is irregular and would be costly Therefore the problems of com-putational load balance and communication cost need to be addressed withregards to Lines 6 and 11 respectively

Let ai be the task of computing MA(i ) at Line 6 Let also TA bethe set of tasks ai for i = 1 IA and bj ck TB and TC be definedsimilarly In the CP-ALS algorithm there are dependencies among thetriplets ai bj and ck of the tasks for each nonzero xijk Specifically thetask ai depends on the tasks bj and ck Similarly bj depends on the tasksai and ck and ck depends on the tasks ai and bj These dependencies arethe source of the communication at Line 11 eg A(i ) should be sent tothe processes that own B(j ) and C(k ) for the each nonzero xijk in theslice Xi These dependencies can be modeled using graphs or hypergraphsby identifying the tasks with vertices and their dependencies with edgesand hyperedges respectively As is well-known in similar parallelizationcontexts [39 111 112] the standard graph partition related metric of ldquoedgecutrdquo would loosely relate to the communication cost We therefore proposea hypergraph model for modeling the communication and computationalrequirements of the Mttkrp operation in the context of CP-ALS

For a given tensor X we build the d-partite d-uniform hypergraph H =(VE) described in Section 15 In this model the vertex set V correspondsto the set of tasks We abuse the notation and use ai to refer to a task and itscorresponding vertex in V We then set V = TA cupTB cupTC Since one needsload balance on Line 6 for each mode we use vectors of size N for vertexweights We set w(ai 1) = |Xi| w(bj 2) = |Xj| and w(ck 3) = |Xk|

42 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

c3

b3

a3

c2a2b2

b1

c1

a1

(a) Coarse-grain model

x123 x312

x231a2

a1

b1

c1

b2 c2

c3

b3

a3

(b) Fine-grain model

Figure 31 Coarse and fine-grain hypergraph models for the 3times3times3 tensorX = (1 2 3) (2 3 1) (3 1 2)

as the vertex weights in the indicated modes and assign weight 0 in allother modes The dependencies causing communication at Line 11 are one-to-many for a nonzero xijk any one of ai bj and ck is computed byusing the data associated with the other two Following the guidelines inbuilding the hypergraph models [209] we add a hyperedge correspondingto each task (or each row of the factor matrices) in TA cup TB cup TC in orderto model the data dependencies to that specific task We use nai to denotethe hyperedge corresponding to the task ai and define nbj and nck similarlyFor each nonzero xijk we add the vertices ai bj and ck to the hyperedgesnai n

bj and nck Let Ji and Ki be all unique 2nd and 3rd mode indices

respectively of the nonzeros in Xi Then the hyperedge nai contains thevertex ai all vertices bj for j isin Ji and all vertices ck for k isin Ki

Figure 31a demonstrates the hypergraph model for the coarse-grain taskdecomposition of a sample 3times 3times 3 tensor with nonzeros (1 2 3) (2 3 1)and (3 1 2) Labels for the hyperedges shown as small blue circles are notprovided yet it should be clear from the context eg ai isin nai is shownwith a non-curved black line between the vertex and the hyperedge Forthe first second and third nonzeros we add the corresponding verticesto related hyperedges which are shown with green magenta and orangecurves respectively It is interesting to note that the N -partite graph ofSmith et al [206] and this hypergraph model are related their adjacencystructure when laid as a matrix give the same matrix

Consider now a P -way partition of the vertices of H = (VE) and theassociation of each part with a unique process for obtaining a P -way parallelCP-ALS The task ai incurs |Xi| Hadamard products in the first mode inthe process microA(i) and has no other work when computing the MttkrpsMB = X(2)(AC) and MC = X(3)(AB) in the other modes This obser-vation (combined with the corresponding one for bjs and cks) shows that any

43

partition satisfying the balance condition for each weight component (33)corresponds to a balanced partition of Hadamard products (Line 6) in eachmode Now consider the messages which are of size R sent by the pro-cess p regarding A(i ) These messages are needed by the processes thathave slices in modes 2 and 3 that have nonzeros whose first mode index is iThese processes correspond to the parts connected by the hyperedge nai Forexample in Figure 31a due to the nonzero (3 1 2) a3 has a dependency tob1 and c2 which are modeled by the curves from a3 to nb1 and nc2 Since a3b1 and c2 reside in different parts a3 contributes one to the connectivity ofnb1 and nc2 which accurately captures the total communication volume byincreasing the total connectivity-1 metric by two

Notice that the part microA(i) is always connected by naimdashassuming that theslice Xi is not empty Therefore minimizing the connectivity-1 metric (32)exactly amounts to minimizing the total communication volume required forMttkrp in all computational phases of a CP-ALS iteration if we set thecost ch = R for each hyperedge h

332 Fine-grain task model

In the fine-grain task model we assume a partition of the nonzeros of thetensor hence there is no duplication Let microX denote the partition of thetensor nonzeros eg microX (i j k) = p if the process p holds xijk WhenevermicroX (i j k) = p the process p performs the Hadamard product of B(j ) andC(k ) in the first mode A(i ) and C(k ) in the second mode and A(i )and B(j ) in the third mode then scales the result with xijk Let againmicroA microB and microC denote the partition on the indices of the correspondingmodes that is microA(i) = p denotes that the process p is responsible forcomputing MA(i ) As in the coarse-grain model we take the CP-ALScontext into account while designing the Mttkrp algorithm (i) the processwhich is responsible for MA(i ) is also held responsible for A(i ) (ii)at the beginning all rows of the factor matrices B(j ) and C(k ) wheremicroX (i j k) = p are available to the process p (iii) at the end all processesget the R times R matrix ATA and (iv) each process has all the rows of Athat are required for the upcoming Mttkrp operations in the second or thethird modes With these in mind the fine-grain parallel Mttkrp methodexecutes Algorithm 4 at each process p

In Algorithm 4 Xp and Ip denote the set of nonzeros and the set of firstmode indices assigned to the process p As seen in the algorithm the processp has the required data to carry out all Hadamard products correspondingto all xijk isin Xp on Line 5 After scaling by xijk the Hadamard productsare locally accumulated in MA which contains a row for all i where Xphas a nonzero on the slice Xi Notice that a process generates a partialresult for each slice of the first mode in which it has a nonzero The relevantrow indices (the set of unique first mode indices in Xp) are denoted by Fp

44 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Algorithm 4 Fine-grain Mttkrp for the first mode of 3rd ordertensors at process p within CP-ALS

Input Xp tensor nonzeros where microX (i j k) = pIp the first mode indices where microA(Ip) = pFp the set of unique first mode indices in Xp

BTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in Xp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them A(Fp ) is updated and ATA is available

2 Initialize MA(Fp ) to all zeros of size |Fp| timesR3 foreach xijk isin Xp do55 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

6 foreach i isin Fp Ip do88 Send MA(i ) to microA(i)

1010 Receive contributions to MA(i ) for i isin Ip and add them up

1212 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

13 foreach i isin Ip do1515 Send A(i ) to all processes having nonzeros in Xi

16 foreach i isin Fp Ip do1818 Receive A(i ) from microA(i)

2020 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

45

Some indices in Fp are not owned by the process p For such an i isin Fp Ipthe process p sends its contribution to MA(i ) to the owner process microA(i)at Line 8 Again messages of the process p destined to the same processare combined Then the process p receives contributions to MA(i ) wheremicroA(i) = p at Line 10 Each such contribution is accumulated and the newA(i ) is computed by a multiplication with (BTB lowast CTC)dagger at Line 12Finally the updated A is communicated to satisfy the precondition of theupcoming Mttkrp operations Note that the communications at Lines 15and 18 are the duals of those at Lines 10 and 8 respectively in the sense thatthe directions of the messages are reversed and the messages always pertainto the same row indices For example MA(i ) is sent to the process microA(i)at Line 8 and A(i ) is received from microA(i) at Line 18 Hence the process pgenerates partial results for each slice Xi on which it has a nonzero eitherkeeps it or sends it to the owner of microA(i) (at Line 8) and if so receivesthe updated row A(i ) later (at Line 18) Finally the algorithm finishesby an all-reduce operation to make ATA available to all processes for theupcoming Mttkrp operations

We now investigate the computational and communication requirementsof Algorithm 4 As before the computational core is at Line 5 where theHadamard products of the row vectors B(j ) and C(k ) are computed foreach nonzero xijk that a process owns and the result is added to MA(i )Therefore each nonzero xijk incurs three Hadamard products (one permode) in its owner process microX (i j k) in an iteration of the CP-ALS al-gorithm Other computations are either efficiently handled with BLAS3routines (Lines 12 and 20) or they (Line 10) are not as many as the numberof Hadamard products The significant communication operations are thesends at Lines 8 and 15 and the corresponding receives at Lines 10 and 18Again the all-reduce operation at Line 20 would be handled efficiently by theMPI implementation Therefore the problems of computational load bal-ance and reduced communication cost need to be addressed by partitioningthe tensor nonzeros equitably among processes (for load balance at Line 5)while trying to confine all slices of all modes to a small set of processes (toreduce communication at Lines 8 and 18)

We propose a hypergraph model for the computational load and commu-nication volume of the fine-grain parallel Mttkrp operation in the contextof CP-ALS For a given tensor the hypergraph H = (VE) with the ver-tex set V and hyperedge set E is defined as follows The vertex set V hasfour types of vertices ai for i = 1 IA bj for j = 1 IB ck fork = 1 IC and the vertex xijk we define for each nonzero in X byabusing the notation The vertex xijk represents the Hadamard productsB(j ) lowastC(k ) A(i ) lowastC(k ) and A(i ) lowastB(j ) to be performed duringthe Mttkrp operations at different modes There are three types of hyper-edges in E and they represent the dependencies of the Hadamard productsto the rows of the factor matrices The hyperedges are defined as follows

46 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

E = EA cup EB cup EC where EA contains a hyperedge nai for each A(i ) EBcontains a hyperedge nbj for each B(j ) and EC contains a hyperedge nck for

each C(k ) Initially nai nbj and nck contain only the corresponding vertices

ai bj and ck Then for each nonzero xijk isin X we add the vertex xijkto the hyperedges nai n

bj and nck to model the data dependency to the cor-

responding rows of A B and C As in the previous subsection one needsa multi-constraint formulation for three reasons First since xijk verticesrepresent the Hadamard products they should be evenly distributed amongthe processes Second as the owners of the rows of A B and C are possi-ble destinations of partial results (at Line 8) it is advisable to spread themevenly Third almost equal number of rows per process implies balancedmemory usage and BLAS3 operations at Lines 12 and 20 even though thisoperation is not our main concern for the computational load With theseconstraints we associate a weight vector of size N + 1 with each vertex Fora vertex xijk we set w(xijk ) = [0 0 0 3] for vertices ai bj and ck we setw(ai ) = [1 0 0 0] w(bj ) = [0 1 0 0] w(ck ) = [0 0 1 0]

In Figure 31b we demonstrate the fine-grain hypergraph model for thesample tensor of the previous section We exclude the vertex weights fromthe figure Similar to the coarse-grain model dependency on the rows of AB and C are modeled by connecting each vertex xijk to the hyperedges nai nbj and nck to capture the communication cost

Consider now a P -way partition of the vertices of H = (VE) and as-sociate each part with a unique process to obtain P -way parallel CP-ALSwith the fine-grain Mttkrp operation The partition of the xijk verticesuniquely partitions the Hadamard products Satisfying the fourth balanceconstraint in (33) balances the computational loads of the processes Sim-ilarly satisfying the first three constraints leads to an equitable partitionof the rows of factor matrices among processes Let microX (i j k) = p andmicroA(i) = ` that is the vertex xijk and ai are in different parts Then theprocess p sends a partial result on MA(i ) to the process ` at Line 8 ofAlgorithm 4 By looking at these messages carefully we see that all pro-cesses whose corresponding parts are in the connectivity set of niA will senda message to microA(i) Since microA(i) is also in this set we see that the totalvolume of send operations concerning MA(i ) at Line 8 is equal to λnai minus 1Therefore the connectivity-1 cut size metric (32) over the hyperedges inEA encodes the total volume of messages sent at Line 8 if we set ch = Rfor all hyperedges un EA Since the send operations at Line 15 are dualsof the send operations at Line 8 total volume of messages sent at Line 15for the first mode is also equal to this number By extending this reasoningto all hyperedges we see that the cumulative (over all modes) volume ofcommunication is equal to the connectivity-1 cut-size metric (32)

The presence of multiple constraints usually causes difficulties for thecurrent state of the art partitioning tools (Aykanat et al[15] discuss the is-

47

sue for PaToH [40]) In preliminary experiments we observed that this wasthe case for us as well In an attempt to circumvent this issue we designedthe following alternative We start with the same hypergraph and removethe vertices corresponding to the rows of factor matrices Then use a singleconstraint partitioning on the resulting hypergraph to partition the verticesof nonzeros We are then faced with the problem of assigning the rows offactor matrices given the partition on the tensor nonzeros Any objectivefunction (relevant to our parallelization context) in trying to solve this prob-lem would likely be a variant of the NP-complete Bin-Packing Problem [93p124] as in the SpMV case [27 208] Therefore we heuristically handle thisproblem by making the following observations (i) the modes can be parti-tioned one at a time (ii) each row corresponds to a removed vertex for whichthere is a corresponding hyperedge (iii) the partition of xijk vertices definesa connectivity set for each hyperedge This is essentially multi-constraintpartitioning which takes advantage of the listed observations We use con-nectivity of corresponding hyperedges as key value of rows to impose anorder Then rows are visited in decreasing order of key values Each visitedrow is assigned to the least loaded part in the connectivity set of its corre-sponding hyperedge if that part is not overloaded Otherwise the visitedvertex is assigned to the least loaded part at that moment Note that in thesecond case we add one to the total communication volume otherwise thetotal communication volume obtained by partitioning xijk vertices remainsthe same Note also that key values correspond to the volume of receives atLine 10 of Algorithm 4 hence this method also aims to balance the volumeof receives at Line 10 and the volume of dual send operations at Line 15

34 Experiments

We conducted our experiments on the IDRIS Ada cluster which consistsof 304 nodes each having 128 GBs of memory and four 8-core Intel SandyBridge E5-4650 processors running at 27 GHz We ran our experiments upto 1024 cores and assigned one MPI process per core All codes we usedin our benchmarks were compiled using the Intel C++ compiler (version1401106) with -O3 option for compiler optimizations -axAVXSSE42 toenable SSE optimizations whenever possible and -mkl option to use theIntel MKL library (version 1111) for LAPACK BLAS and VML (VectorMath Library) routines We also compiled DFacTo code using the Eigen(version 324) library [102] with EIGEN USE MKL ALL flag set to ensurethat it utilizes the routines provided in the Intel MKL library for the bestperformance We used PaToH [40] with default options to partition thehypergraphs We created all partitions offline and ran our experiments onthese partitioned tensors on the cluster We do not report timings for par-titioning hypergraphs which is quite costly for two reasons First in most

48 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Tensor I1 I2 I3 I4 nonzeros

netflix 480K 17K 2K - 100M

NELL-B 32M 301 638K - 78M

deli 530K 17M 24M 1K 140M

flickr 319K 28M 16M 713 112M

Table 31 Size of tensors used in the experiments

applications the tensors from the real-world data are built incrementallyhence a partition for the updated tensor can be formed by refining the parti-tion of the previous tensor Also one can decompose a tensor multiple timeswith different ranks of approximation Thereby the cost of an initial parti-tioning of a tensor can be amortized across multiple runs Second we had topartition the data on a system different from the one used for CP-ALS runsIt would not be very useful to compare the run time from different systemsand repeating the same runs on a different system would not add much tothe presented results We also note that the proposed coarse and fine-grainMttkrp formulations are independent from the partitioning method

Our dataset for experiments consists of four sparse tensors whose sizesare given in Table 31 and obtained from various resources [21 37 97]

We briefly describe the methods compared in the experiments DFacTois the CP-ALS routine we compare our methods against which uses blockpartitioning of the rows of the matricized tensors to distribute matrix nonze-ros equally among processes In the method ht-coarsegrain-block we op-erate the coarse-grain CP-ALS implementation in HyperTensor on a tensorwhose slices in each mode are partitioned consecutively to ensure equal num-ber of nonzeros among processes similarly to the DFacTorsquos approach Themethod ht-coarsegrain-hp runs the same CP-ALS implementation andit uses the hypergraph partitioning The method ht-finegrain-randompartitions the tensor nonzeros as well as the rows of the factor matrices ran-domly to establish load balance It uses the fine-grain CP-ALS implemen-tation in HyperTensor Finally the method ht-finegrain-hp correspondsto the fine-grain CP-ALS implementation in HyperTensor operating on thetensor partitioned according to the hypergraph model proposed for the fine-grain task decomposition We benchmark our methods against DFacTo onNetflix and NELL-B datasets only as DFacTo can only process 3-mode ten-sors We let the CP-ALS implementations run for 20 iterations on each datawith R = 10 and record the average time spent per iteration

In Figures 32a and 32b we show the time spent per CP-ALS itera-tion on Netflix and NELL-B tensors for all algorithms In Figures 32c and32d we show the speedups per CP-ALS iteration on Flickr and Delicioustensors for methods excluding DFacTo While performing the comparisonwith DFacTo we preferred to show the time spent per iteration instead ofthe speedup as the sequential run time of our methods differs from that of

49

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(b) Time on Nell-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(d) Speedup on Delicious

Figure 32 Time for parallel CP-ALS iteration on Netflix and NELL-B andspeedups on Flickr and Delicious

50 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

DFacTo Also in Table 32 we show the statistics regarding the communica-tion and computation requirements of the 512-way partitions of the Netflixtensor for all proposed methods

We first observe in Figure 32a that on Netflix ht-finegrain-hp clearlyoutperforms all other methods by achieving a speedup of 194x with 512 coresover a sequential execution whereas ht-coarsegrain-hp ht-coarsegrain-blockDFacTo and ht-finegrain-random could only yield 69x 63x 49x and 40xspeedups respectively Table 32 shows that on 512-way partitioning of thesame tensor ht-finegrain-random ht-coarsegrain-block and ht-coarsegrain-hp result in total send communication volumes of 142M 80M and 77Munits respectively whereas ht-finegrain-hp partitioning requires only 76Munits which explains the superior parallel performance of ht-finegrain-hpSimilarly in Figures 32b 32c and 32d ht-finegrain-hp achieves 81x 129xand 123x speedups on NELL-B Flickr and Delicious tensors while the bestof all other methods could only obtain 39x 55x and 65x respectively upto 1024 cores As seen from these figures ht-finegrain-hp is the fastest of allproposed methods It experiences slow-downs at 1024 cores for all instances

One point to note in Figure 32b is that our Mttkrp implementationgets notably more efficient than DFacTo on NELL-B We believe that themain reason for this difference is due to DFacTorsquos column-by-column wayof computing the factor matrices We instead compute all columns of thefactor matrices simultaneously This vector-wise mode of operation on therows of the factors can significantly improve the data locality and cacheefficiency which leads to this performance difference Another point in thesame figure is that HyperTensor running with ht-coarsegrain-block partition-ing scales better than DFacTo even though they use similar partitioning ofthe matricized tensor or the tensor This is mainly due to the fact thatDFacTo uses MPI Allgatherv to communicate local rows of factor matri-ces to all processes which incurs significant increase in the communicationvolume whereas our coarse-grain implementation performs point-to-pointcommunications Finally in Figure 32a DFacTo has a slight edge over ourmethods in sequential execution This could be due to the fact that it re-quires 3(|X |+ F ) multiplyadd operations across all modes where F is theaverage number of non-empty fibers and is upper-bounded by |X | whereasour implementation always takes 6|X | operations

Partitioning metrics in the Table 32 show that ht-finegrain-hp is ableto successfully reduce the total communication volume which brings aboutits improved scalability However we do not see the same effect when usinghypergraph partitioning for the coarse-grain task model due to inherent lim-itations of 1D partitionings Another point to note is that despite reducingthe total communication volume explicitly imbalance in the communicationvolume still exists among the processes as this objective is not directly cap-tured by the hypergraph models However the most important limitationon further scalability is the communication latency Table 32 shows that

51

ModeComp load Comm volume Num msg

Max Avg Max Avg Max Avg

ht-finegrain-hp

1 196672 196251 21079 6367 734 316

2 196672 196251 18028 5899 1022 1016

3 196672 196251 3545 2492 1022 1018

ht-finegrain-random

1 197507 196251 272326 252118 1022 1022

2 197507 196251 29282 22715 1022 1022

3 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp

1 364181 196251 302001 136741 511 511

2 349123 196251 59523 12228 511 511

3 737570 196251 23524 2000 511 507

ht-coarsegrain-block

1 198602 196251 239337 142006 448 447

2 367966 196251 33889 12458 511 445

3 737570 196251 24659 2049 511 394

Table 32 Statistics for the computation and communication requirementsin one CP-ALS iteration for 512-way partitionings of the Netflix tensor

each process communicates with almost all others especially on the smalldimensions of the tensors and the number of messages is doubled for thefine-grain methods To alleviate this issue one can try to limit the latencyfollowing previous work [201 208]

35 Summary further notes and references

The pros and cons of the coarse and fine-grain models resemble to those ofthe 1D and 2D partitionings in the SpMV context There are two advantagesof the coarse-grain model over the fine-grain one First there is only onecommunicationcomputation step in the coarse-grain model whereas fine-grain model requires two computation and communication phases Secondthe coarse-grain hypergraph model is smaller than the fine-grain hypergraphmodel (one vertex per index of each dimension versus additionally havingone vertex per nonzero) However one major limitation with the coarse-grain parallel decomposition is that restricting all nonzeros in an (N minus 1)dimensional slice of an N dimensional tensor to reside in the same processposes a significant constraint towards communication minimization an (Nminus1)-dimensional data typically has very large ldquosurfacerdquo which necessitatesgathering numerous rows of the factor matrices As N increases one mightexpect this phenomenon to increase the communication volume even furtherAlso if the tensor X is not large enough in one of its modes this poses agranularity problem which potentially causes load imbalance and limits the

52 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

parallelism None of these issues exists in the fine-grain task decompositionIn a more recent work [135] we extended the discussed approach for

efficient parallelization in shared and distributed memory systems For theshared-memory computations we proposed a novel computational schemecalled dimension trees that significantly reduces the computational costwhile offering an effective parallelism We then performed theoretical anal-yses corresponding to the computational gains and the memory use Theseanalyses are also validated with the experiments We also presented com-parisons with the recent version of SPLATT [204 205] which uses a mediumgrain task model and associated partitioning [187] Recent work discussesalgorithms for partitioning for medium-grain task decomposition [3]

A dimension tree is a data structure that partitions the mode indicesof an N -dimensional tensor in a hierarchical manner for computing ten-sor decompositions efficiently It was first used in the hierarchical Tuckerformat representing the hierarchical Tucker decomposition of a tensor [99]which was introduced as a computationally feasible alternative to the origi-nal Tucker decomposition for higher order tensors Our work [135] presentsa novel way of using dimension trees for computing the standard CP de-composition of tensors with a formulation that asymptotically reduces thecomputational cost by memorizing certain partial computations

For computing the CP decomposition of dense tensors Phan et al [189]propose a scheme that divides the tensor modes into two sets pre-computesthe tensor-times-multiple-vector multiplication (TTMVs) for each mode setand finally reuses these partial results to obtain the final Mttkrp result ineach mode This provides a factor of two improvement in the number ofTTMVs over the standard approach and our dimension tree-based frame-work can be considered as the generalization of this approach that providesa factor of N logN improvement Li et al [153] use the same idea of stor-ing intermediate tensors but use a different formulation based on the tensortimes tensor multiplication and the tensorndashmatrix Hadamard product oper-ations for shared memory systems The overall approach is similar to thatproposed by Phan et al [189] where the difference lies in the application ofthe method to sparse tensors and auto-tuning to better control the memoryuse and the gains in the operation counts

In most of the sparse tensors available (eg see FROSTT [203]) oneof the dimensions is usually very small If the indices in that dimensionare vertices in a hypergraph then we have very high degree vertices if theindices are hyperedges then we have hyperedges with very large sizes Bothcases are not favorable for the traditional multilevel hypergraph partitioningmethods where efficient coarsening heuristicsrsquo run time are super-linear interms of vertexhyperedge degrees

Part II

Matching and relatedproblems

53

Chapter 4

Matchings in graphs

In this chapter we address the problem of approximating the maximumcardinality of a matching in undirected graphs with fast heuristics Most ofthe terms and definitions including the problem itself is standard whichwe summarized in Section 12 The formal problem definition is repeatedbelow for convenience

Approximating the maximum cardinality of a matching in undi-rected graphs Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matchingM of large cardinalitywith an approximation guarantee

We design randomized heuristics for the stated problem and analyze theirapproximation guarantees Both the design and analysis of the heuristicsare based on doubly stochastic matrices (see Section 14) and scaling ofnonnegative matrices (in our case the adjacency matrices) to the doublystochastic form We therefore first give some background on scaling matricesto doubly stochastic form

41 Scaling matrices to doubly stochastic form

Any nonnegative matrix A with total support can be scaled with two positivediagonal matrices DR and DC such that DRADC is doubly stochastic (thatis the sum of entries in any row and in any column of DRADC is equalto one) If the matrix is fully indecomposable then the scaling matricesare unique up to a scalar multiplication (if the pair DR and DC scale thematrix then so αDR and 1

αDC for α gt 0) If A has support but not totalsupport then A can be scaled to a doubly stochastic matrix but not withtwo positive diagonal matrices [202]mdashthis fact is also seen in more recenttreatments [139 141 198])

The Sinkhorn-Knopp algorithm [202] is a well-known method for scalingmatrices to doubly stochastic form This algorithm generates a sequence of

55

56 CHAPTER 4 MATCHINGS IN GRAPHS

matrices (whose limit is doubly stochastic) by normalizing the columns andthe rows of the sequence of matrices alternately as shown in Algorithm 5First the initial matrix is normalized such that each column has sum oneThen the resulting matrix is normalized so that each row has sum oneand so on so forth If the initial matrix is symmetric and has total supportSinkhorn-Knopp algorithm will find a symmetric doubly stochastic matrix atthe limit while the intermediate matrices generated in the sequence are notsymmetric Two other iterative scaling algorithms [140 141 198] maintainsymmetry all along the way which are therefore more useful for symmet-ric matrices in practice Efficient shared memory and distributed memoryparallelizations of the Sinkhorn-Knopp and related algorithms have beendiscussed elsewhere [9 41 76]

Algorithm 5 ScaleSK Sinkhorn-Knopp scaling

Data A an ntimes n matrix with total support ε the error thresholdResult dr dc rowcolumn scaling arrays

1 for i = 1 to n do2 dr[i]larr 13 dc[i]larr 1

4 for j = 1 to n do5 csum[j]larr

sumiisinAlowastj

aij

6 repeat7 for j = 1 to n do8 dc[j]larr 1csum[j]

9 for i = 1 to n do10 rsumlarr

sumjisinAilowast

aij times dc[j]

11 dr[i]larr 1rsum

12 for j = 1 to n do13 csum[j]larr

sumiisinAlowastj

dr[i]times aij14 until |max1minus csum[j] 1 le j le n| le ε

42 Approximation algorithms for bipartite graphs

Most of the existing heuristics obtain good results in practice but theirworst-case guarantee is only around 12 Among those the Karp-Sipserheuristic [128] is very well known It finds maximum cardinality matchingsin highly sparse (random) graphs and matches all but O(n15) verticesof a random undirected graph [14] (this heuristics will be reviewed laterin Section 421) Karp-Sipser also obtains very good results in practiceCurrently it is the suggested one to be used as a jump-start routine [71148] for exact algorithms especially for augmenting-path based ones [132]Algorithms that achieve an approximation ratio of 1minus1e where e is the base

57

of the natural logarithm are designed for the online case [129] Many of thesealgorithms are sequential in nature in that a sequence of greedy decisionsare made in the light of the previously made decisions Another heuristicis obtained by truncating the Hopcroft and Karp (HK) algorithm HKstarting from a given matching augments along a maximal set of shortestdisjoint paths until there are no augmenting paths If one lets HK run untilthe shortest augmenting paths are of length k then a 1minus 2k approximatematching is obtained for k ge 3 The run time of this heuristic is O(τk) fora bipartite graph with τ edges

We propose two matching heuristics (Section 422) for bipartite graphsBoth heuristics construct a subgraph of the input graph by randomly choos-ing some edges They then obtain a maximum matching in the selectedsubgraph and return it as an approximate matching for the input graphThe probability density function for choosing a given edge in both heuris-tics is obtained with a matrix scaling method The first heuristic is shown todeliver a constant approximation guarantee of 1minus 1e asymp 0632 of the maxi-mum cardinality matching under the condition that the scaling method hasscaled the input matrix The second one builds on top of the first one andimproves the approximation ratio to 0866 under the same condition as inthe first one Both of the heuristics are designed to be amenable to paral-lelization in modern multicore systems The first heuristic does not requirea conflict resolution scheme Furthermore it does not have any synchro-nization requirements The second heuristic employs Karp-Sipser to find amatching on the selected subgraph We show that Karp-Sipser becomes anexact algorithm on the selected subgraphs Further analysis of the proper-ties of those subgraphs is carried out to design a specialized implementationof Karp-Sipser for efficient parallelization The approximation guaranteesof the two proposed heuristics do not deteriorate with the increased degreeof parallelization thanks to their design which is usually not the case forparallel matching heuristics [16]

For the analysis we assume that the two vertex classes contain the samenumber of vertices and that the associated adjacency matrix has total sup-port Under these criteria the Sinkhorn-Knopp scaling algorithm summa-rized in Section 41 converges in a finite number of steps and obtains adoubly stochastic matrix Later on (towards the end of Section 422 and inthe experiments presented in Section 423) we discuss the bipartite graphswithout the mentioned properties

421 Related work

Parallel (exact) matching algorithms on modern architectures have been re-cently investigated Azad et al [16] study the implementations of a setof known bipartite graph matching algorithms on shared memory systemsDeveci et al [66 65] investigate the implementation of some known match-

58 CHAPTER 4 MATCHINGS IN GRAPHS

ing algorithms or their variants on GPU There are quite good speedupsreported in these implementations yet there are non-trivial instances whereparallelism does not help (for any of the algorithms)

Our focus is on matching heuristics that have linear run time complexityand good quality guarantees on the size of the matching Recent surveysof matching heuristics are given by Duff et al [71 Section 4] Langguth etal [148] and more recently by Pothen et al [195] Two heuristics calledGreedy and Karp-Sipser stand out and are suggested as initialization stepsin the best two exact matching algorithms [132] These two heuristics alsoattracted theoretical interest

The Greedy heuristic has two variants in the literature The first variantrandomly visits the edges and matches the two endpoints of an edge if theyare both available The theoretical performance guarantee of this heuristic is12 ie the heuristic delivers matchings of size at least half of the maximumcardinality of a matching This is analyzed theoretically [78] and shown toobtain results that are near the worst-case on certain classes of graphs Thesecond variant of Greedy repeatedly selects a random vertex and matches itwith a random neighbor The matched vertices along with the ones whichbecome isolated are removed from the graph and the process continuesuntil the whole graph is consumed This variant also has a 12 worst-caseapproximation guarantee (see for example a proof by Pothen and Fan [194])and it is somewhat better (05 + ε for ε ge 00000025 [13] which has beenrecently improved to ε ge 1256 [191])

We make use of the Karp-Sipser heuristic to design one of the proposedheuristics We summarize a commonly used variant of Karp-Sipser here andrefer the reader to the original paper [128] for details The theoretical foun-dation of Karp-Sipser is that if there is a vertex v with exactly one neighbor(v is called degree-one) then there is a maximum cardinality matching inwhich v is matched with its neighbor That is matching v with its neighboris an optimal decision Using this observation Karp-Sipser runs as followsCheck whether there is a degree-one vertex if so then match the vertexwith its unique neighbor and delete both vertices (and the edges incident onthem) from the graph Continue this way until the graph has no edges (inwhich case we are done) or all remaining vertices have degree larger thanone In the latter case pick a random edge match the two endpoints ofthis edge and delete those vertices and the edges incident on them Thenrepeat the whole process on the remaining graph The phase before the firstrandom choice of edges made is called Phase 1 and the rest is called Phase2 (where new degree-one vertices may arise) The run time of this heuris-tic is linear This heuristic matches all but O(n15) vertices of a randomundirected graph [14] One disadvantage of Karp-Sipser is that because ofthe degree dependencies of the vertices to the already matched vertices anefficient parallelism is hard to achieve (a list of degree-one vertices needs tobe maintained) That is probably why some inflicted forms (successful but

59

without any known quality guarantee) of this heuristic were used in recentstudies [16]

Recent studies focusing on approximate matching algorithms on parallelsystems include heuristics for graph matching problem [42 86 105] and alsoheuristics used for initializing bipartite matching algorithms [16] Lotker etal [164] present a distributed 1minus 1k approximate matching algorithm forbipartite graphs This nice theoretical algorithm has O(k3 log ∆ + k2 log n)time steps with a message length of O(log ∆) where ∆ is the maximumdegree of a vertex and n is the number vertices in the graph Blellochet al [29] propose an algorithm to compute maximal matchings (12 ap-proximate) with O(τ) work and O(log3 τ) depth with high probability on abipartite graph with τ edges This is an elaboration of the Greedy heuristicfor parallel systems Although the performance metrics work and depth arequite impressive the approximation guarantee stays as in the serial variantA striking property of this heuristic is that it trades parallelism and re-duced work while always finding the same matching (including those foundin the sequential version) Birn et al [26] discuss maximal (12 approxi-mate) matchings in O(log2 n) time and O(τ) work in the CREW PRAMmodel Practical implementations on distributed memory and GPU sys-tems are discussedmdasha shared memory implementation is left as future workin the cited paper Patwary et al [185] discuss an efficient implementationof Karp-Sipser on distributed memory systems and show that the communi-cation requirement (in terms of the volume) is proportional to that of sparsematrixndashvector multiplication

The overall approach is related to a random bipartite graph model calledrandom k-out model [211] In these graphs every vertex chooses k neighborsuniformly at random from the other part of vertices Walkup [211] showsthat a 1-out subgraph of a complete bipartite graph G = (V1 V2 E) where|V1| = |V2| asymptotically almost surely (aas) does not contain a perfectmatching He also shows that a 2-out subgraph of a complete bipartitegraph aas contains a perfect matching Karonski and Pittel [125] laterwith Overman [124] extend the analysis to two other cases In the firstcase the vertices that were not selected at all choose one more neighbor inthe second case the vertices that were selected only once choose one moreneighbor Initial claim [125] was that in the first case there was a perfectmatching aas which was shown to be false [124] The existence of perfectmatchings in the second case in which the graph is still between 1-out and2-out has been shown recently [124]

422 Two matching heuristics

We propose two matching heuristics for the approximate maximum cardi-nality matching problem on bipartite graphs The heuristics are efficientlyparallelizable and have guaranteed approximation ratios The first heuristic

60 CHAPTER 4 MATCHINGS IN GRAPHS

does not require synchronization nor conflict resolution assuming that thewrite operations to the memory are atomic (such settings are common seea discussion in the original paper [76]) This heuristic has an approxima-tion guarantee of around 0632 The second heuristic is designed to obtainlarger matchings compared to those obtained by the first one This heuristicemploys Karp-Sipser on a judiciously chosen subgraph of the input graphWe show that for this subgraph Karp-Sipser is an exact algorithm and aspecialized efficient implementation is possible to obtain matchings of sizearound 0866 of the maximum cardinality

One-sided matching

The first matching heuristic we propose OneSidedMatch scales the givenadjacency matrix A (each aij is originally either 0 or 1) and uses the scaledentries to randomly choose a column as a match for each row The pseu-docode of the heuristic is shown in Algorithm 6

Algorithm 6 OneSidedMatch Bipartite graphs

Data A an ntimes n (01)-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for j = 1 to n in parallel do3 cmatch[j]larr NIL

4 for i = 1 to n in parallel do5 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

6 cmatch[j]larr i

OneSidedMatch first obtains the scaling vectors dr and dc corre-sponding to a doubly stochastic matrix S (line 1) After initializing thecmatch array for each row i of A the heuristic randomly chooses a col-umn j isin Ailowast based on the probabilities computed by using correspond-ing scaled entries of row i It then matches i and j Clearly multiplerows can choose the same column and write to the same entry in cmatchWe assume that in the parallel shared-memory setting one of the writeoperation survives and the cmatch array defines a valid matching iecmatch[j] j cmatch[j] 6= NIL We now analyze its approximationguarantee in terms of the matching cardinality

Theorem 41 Let A be an n times n (01)-matrix with total support ThenOneSidedMatch obtains a matching of size at least n(1 minus 1e) asymp 0632n

61

in expectation as nrarrinfin

Proof To compute the matching cardinality we will count the columns thatare not picked by any row and subtract it from n Since ΣkisinAilowastsik = 1 foreach row i of S the probability that a column j is not picked by any of therows in Alowastj is equal to

prodiisinAlowastj (1minussij) By applying the arithmetic-geometric

mean inequality we obtain

dj

radic prodiisinAlowastj

(1minus sij) ledj minus

sumiisinAlowastj sij

dj

where dj = |Alowastj | is the degree of column vertex j Therefore

prodiisinAlowastj

(1minus sij) le

(1minus

sumiisinAlowastj sij

dj

)dj

Since S is doubly stochastic we havesum

iisinAlowastj sij = 1 and

prodiisinAlowastj

(1minus sij) le(

1minus 1

dj

)dj

The function on the right hand side above is an increasing one and has thelimit

limdjrarrinfin

(1minus 1

dj

)dj=

1

e

where e is the base of the natural logarithm By the linearity of expectationthe expected number of unmatched columns is no larger than n

e Hence thecardinality of the matching is no smaller than n (1minus 1e)

In Algorithm 6 we split the rows among the threads with a parallelfor construct For each row i the corresponding thread chooses a randomnumber r from a uniform distribution with range (0

sumkisinAilowast

sik] Then thelast nonzero column index j for which

sum1leklej sik le r is found and cmatch[j]

is set to i Since no synchronization or conflict detection is required theheuristic promises significant speedups

Two-sided matching

OneSidedMatchrsquos approximation guarantee and suitable structure for par-allel architectures make it a good heuristic The natural question that fol-lows is whether a heuristic with a better guarantee exits Of course thesought heuristic should also be simple and easy to parallelize We askedldquowhat happens if we repeat the process for the other (column) side of thebipartite graphrdquo The question led us to the following algorithm Let each

62 CHAPTER 4 MATCHINGS IN GRAPHS

row select a column and let each column select a row Take all these 2nchoices to construct a bipartite graph G (a subgraph of the input) with 2nvertices and at most 2n edges (if i chooses j and j chooses i we have oneedge) and seek a maximum cardinality matching in G Since the number ofedges is at most 2n any exact matching algorithm on this graph would befastmdashin particular the worst case run time would be O(n15) [117] Yet wecan do better and obtain a maximum cardinality matching in linear time byrunning the Karp-Sipser heuristic on G as we display in Algorithm 7

Algorithm 7 TwoSidedMatch Bipartite graphs

Data A an ntimes n 0 1-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for i = 1 to n in parallel do3 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

4 rchoice[i]larr j

5 for j = 1 to n in parallel do6 Pick a random row i isin Alowastj by using the probability density function

skjΣ`isinAlowastjs`j

for all k isin Alowastj

7 cchoice[j]larr i

8 Construct a bipartite graph G = (VR cup VC E) where

E =i rchoice[i] i isin 1 ncupcchoice[j] j j isin 1 n

9 matchlarrKarp-Sipser (G)

The most interesting component of TwoSidedMatch is the incorpora-tion of the Karp-Sipser heuristic for two reasons First although it is onlya heuristic Karp-Sipser computes a maximum cardinality matching on thebipartite graph G constructed in Algorithm 7 Second although Karp-Sipserhas a sequential nature we can obtain good speedups with a specializedparallel implementation In general it is hard to efficiently parallelize (non-trivial) graph algorithms and it is even harder when the overall cost isO(n) which is the case for Karp-Sipser on G We give a series of lemmasbelow which enables us to use Karp-Sipser as an exact algorithm with a goodshared-memory parallel performance

The first lemma describes the structure of G constructed at line 8 of

63

TwoSidedMatch

Lemma 41 Each connected component of G constructed in Algorithm 7contains at most one simple cycle

Proof A connected component M sube G with nprime vertices can have at most nprime

edges Let T be a spanning tree of M Since T contains nprime minus 1 edges theremaining edge in M can create at most one cycle when added to T

Lemma 41 explains why Karp-Sipser is an exact algorithm on G If acomponent does not contain a cycle Karp-Sipser consumes all its vertices inPhase 1 Therefore all of the matching decisions given by Karp-Sipser areoptimal for this component Assume a component contains a simple cycleAfter Phase 1 the component is either consumed or due to Lemma 41 itis reduced to a simple cycle In the former case the matching is a maximumcardinality one In the latter case an arbitrary edge of the cycle can beused to match a pair of vertices This decision necessarily leads to a uniqueperfect matching in the remaining simple path These two arguments canbe repeated for all the connected components to see that the Karp-Sipserheuristic finds a maximum cardinality matching in G

Algorithm 8 describes our parallel implementation Karp-Sipser-MT Thegraph is represented using a single array choice where choice[u] is the ver-tex randomly chosen by u isin VR cup VC The choice array is a concatenationof the arrays rchoice and cchoice set in TwoSidedMatch Hence anexplicit graph construction for G (line 8 of Algorithm 7) is not required anda transformation of the selected edges to a graph storage scheme is avoidedKarp-Sipser-MT uses three atomic operations for synchronization The firstoperation Add(memory value) atomically adds a value to a memory loca-tion It is used to compute the vertex degrees in the initial graph (line 9)The second operation CompAndSwap(memory value replace) first checkswhether the memory location has the value If so its content is replaced Thefinal content is returned The third operation AddAndFetch(memoryvalue) atomically adds a given value to a memory location and the finalcontent is returned We will describe the use of these two operations later

Karp-Sipser-MT has two phases which correspond to the two phases ofKarp-Sipser The first phase of Karp-Sipser-MT is similar to that of Karp-Sipser in that optimal matching decisions are made about some degree-onevertices The second phase of Karp-Sipser-MT handles remaining verticesvery efficiently without bothering with their degrees The following defini-tions are used to clarify the difference between an original Karp-Sipser andKarp-Sipser-MT

Definition 41 Given a matching and the array choice let u be an un-matched vertex and v = choice[u] Then u is called

64 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 8 Karp-Sipser-MT Bipartite 1-out graphs

Data G = V choice[middot] the chosen vertex for each u isin V Result match[middot] the match array for u isin V

1 for all u isin V in parallel do2 mark[u]larr 13 deg[u]larr 14 match[u]larr NIL

5 for all u isin V in parallel do6 v larr choice[u]7 mark[v]larr 08 if choice[v] 6= u then9 Add(deg[v] 1)

10 for each vertex u in parallel do Phase 1 out-one vertices 11 if mark[u] = 1 then12 curr larr u13 while curr 6= NIL do14 nbr larr choice[curr]15 if CompAndSwap(match[nbr]NIL curr) = curr then16 match[curr]larr nbr17 curr larr NIL18 nextlarr choice[nbr]19 if match[next] = NIL then20 if AddAndFetch(deg[next]minus1) = 1 then21 curr larr next22 else23 curr larr NIL

24 for each column vertex u in parallel do Phase 2 the rest 25 v larr choice[u]26 if match[u] = NIL and match[v] = NIL then27 match[u]larr v28 match[v]larr u

65

bull out-one if v is unmatched and no unmatched vertex w with choice[w] =u exists

bull in-one if v is matched and only a single unmatched vertex w withchoice[w] = u exists

The first phase of Karp-Sipser-MT (for loop of line 10) does not track andmatch all degree-one vertices Instead only the out-one vertices are takeninto account For each such vertex u that is already out-one before Phase1 we have mark[u] = 1 Karp-Sipser-MT visits these vertices (lines 10-11)Newly arising out-one vertices are consumed right away without maintain-ing a list The second phase of Karp-Sipser-MT (for loop of line 24) ismuch simpler than that of Karp-Sipser as the degrees of the vertices arenot trackedupdated We now discuss how these simplifications are possiblewhile ensuring a maximum cardinality matching in G

Observation 41 An out-one or an in-one vertex is a degree-one vertex inKarp-Sipser

Observation 42 A degree-one vertex in Karp-Sipser is either an out-oneor an in-one vertex or it is one of the two vertices u and v in a 2-cliquesuch that v = choice[u] and u = choice[v]

Lemma 42 (Proved elsewhere [76 Lemma 3]) If there exists an in-onevertex in G at any time during the execution of Karp-Sipser-MT an out-onevertex also exists

According to Observation 41 all the matching decisions given by Karp-Sipser-MT in Phase 1 are optimal since an out-one vertex is a degree-onevertex Observation 42 implies that among all the degree-one vertices Karp-Sipser-MT ignores only the in-ones and 2-cliques According to Lemma 42an in-one vertex cannot exist without an out-one vertex therefore they arehandled in the same phase The 2-cliques that survive Phase 1 are handledin Karp-Sipser-MTrsquos Phase 2 since they can be considered as cycles

To analyze the second phase of Karp-Sipser-MT we will use the followinglemma

Lemma 43 (Proved elsewhere [76 Lemma 4]) Let Gprime = (V primeR cup V primeC Eprime) bethe graph induced by the remaining vertices after the first phase of Karp-Sipser-MT Then the set

(u choice[u]) u isin V primeR choice[u] isin V primeC

is a maximum cardinality matching in Gprime

In the light of Observations 41 and 42 and Lemmas 41ndash43 Karp-Sipser-MT is an exact algorithm on the graphs created in Algorithm 7 Its worstcase (sequential) run time is linear

66 CHAPTER 4 MATCHINGS IN GRAPHS

Figure 41 A toy bipartite graph with 9 row (circles) and 9 column (squares)vertices The edges are oriented from a vertex u to the vertex choice[u]Assuming all the vertices are currently unmatched matching 15-7 (or 5-13)creates two degree-one vertices But no out-one vertex arises after matching(15-7) and only one vertex 6 arises after matching (5-13)

Karp-Sipser-MT tracks and consumes only the out-one vertices Thisbrings high flexibility while executing Karp-Sipser-MT in multi-threaded en-vironments Consider the example in Figure 41 Here after matching a pairof vertices and removing them from the graph multiple degree-one verticescan be generated The standard Karp-Sipser uses a list to store these newdegree-one vertices Such a list is necessary to obtain a larger matching butthe associated synchronizations while updating it in parallel will be an ob-stacle for efficiency The synchronization can be avoided up to some level ifone sacrifices the approximation quality by not making all optimal decisions(as in some existing work [16]) We continue with the following lemma totake advantage of the special structure of the graphs in TwoSidedMatchfor parallel efficiency in Phase 1

Lemma 44 (Proved elsewhere [76 Lemma 5]) Consuming an out-onevertex creates at most one new out-one vertex

According to Lemma 44 Karp-Sipser-MT does not need a list to storethe new out-one vertices since the process can continue with the new out-one vertex In a shared-memory setting there are two concerns for thefirst phase from the synchronization point of view First multiple threadsthat are consuming different out-one vertices can try to match them withthe same unmatched vertex To handle such cases Karp-Sipser-MT uses theatomic CompAndSwap operation (line 15 of Algorithm 8) and ensures thatonly one of these matchings will be processed In this case other threadswhose matching decisions are not performed continue with the next vertexin the for loop at line 10 The second concern is that while consumingout-one vertices several threads may create the same out-one vertex (andwant to continue with it) For example in Figure 41 when two threadsconsume the out-one vertices 1 and 2 at the same time they both will try tocontinue with vertex 4 To handle such cases an atomic AddAndFetch

67

operation (line 20 of Algorithm 8) is used to synchronize the degree reductionoperations on the potential out-one vertices This approach explicitly ordersthe vertex consumptions and guarantees that only the thread who performsthe last consumption before a new out-one vertex u appears continues withu The other threads which wanted to continue with the same path stop andskip to the next unconsumed out-one vertex in the main for loop One lastconcern that can be important on the parallel performance is the maximumlength of such paths since a very long path can yield a significant imbalanceon the work distribution to the threads We investigated the length of suchpaths experimentally (see the original paper [76]) and observed them to betoo short to hurt the parallel performance

The second phase of Karp-Sipser-MT is efficiently parallelized by usingthe idea in Lemma 43 That is a maximum cardinality matching for thegraph remaining after the first phase of Karp-Sipser-MT can be obtained viaa simple parallel for construct (see line 24 of Algorithm 8)

Quality of approximation If the initial matrix A is the n times n matrixof 1s that is aij = 1 for all 1 le i j le n then the doubly stochastic matrixS is such that sij = 1

n for all 1 le i j le n In this case the graph Gcreated by Algorithm 7 is a random 1-out bipartite graph [211] Referringto a study by Meir and Moon [172] Karonski and Pittel [125] argue thatthe maximum cardinality of a matching in a random 1-out bipartite graphis 2(1minusΩ)n asymp 0866n in expectation where Ω asymp 0567 also called LambertrsquosW (1) is the unique solution of the equation ΩeΩ = 1 We obtain the sameresult for random 1-out subgraph of any bipartite graph whose adjacencymatrix has total support as highlighted in the following theorem

Theorem 42 Let A be the n times n adjacency matrix of a given bipar-tite graph where A has total support Then TwoSidedMatch obtainsa matching of size 2(1 minus Ω)n asymp 0866n in expectation as n rarr infin whereΩ asymp 0567 is the unique solution of the equation ΩeΩ = 1

Theorem 42 contributes to the known results about the Karp-Sipserheuristic (recall that it is known to leave out O(n15) vertices) by show-ing a constant approximation ratio with some preprocessing The existenceof total support does not seem to be necessary for the Theorem 42 to hold(see the next subsection)

The proof of Theorem 42 is involved and can be found in the originalpaper [76] It essentially adapts the original analysis of Karp and Sipser [128]to the special graphs at hand

Discussion

The Sinkhorn-Knopp scaling algorithm converges linearly (when A has totalsupport) where the convergence rate is equivalent to the square of the second

68 CHAPTER 4 MATCHINGS IN GRAPHS

largest singular value of the resulting doubly stochastic matrix [139] Ifthe matrix does not have support or total support less is known aboutthe Sinkhorn-Knopp scaling algorithmrsquos convergencemdashwe do not know howmuch of a reduction we will get at each step In such cases we are not ableto bound the run time of the scaling step However the scaling algorithmsshould be run only a few iterations (see also the next paragraph) in practicein which case the practical run time of our heuristics would be linear (inedges and vertices)

We have discussed the proposed matching heuristics while assuming thatA has total support This can be relaxed in two ways to render the over-all approach practical in any bipartite graph The first relaxation is thatwe do not need to run the scaling algorithms until convergence (as in someother use cases of similar algorithms [141]) If

sumiisinAlowastj sij ge α instead ofsum

iisinAlowastj sij = 1 for all j isin 1 n then limdrarrinfin(1minus α

d

)d= 1

eα In otherwords if we apply the scaling algorithms a few iterations or until some rel-atively large error tolerance we can still derive similar results For exampleif α = 092 we will have a matching of size n

(1minus 1

)asymp 0601n (larger

column sums give improved ratios but there are columns whose sum isless than one when the convergence is not achieved) for OneSidedMatchWith the same α TwoSidedMatch will obtain an approximation of 0840In our experiments the number of iterations were always a few where theproven approximation guarantees were always observed The second relax-ation is that we do not need total support we do not even need supportnor equal number of vertices in the two vertex classes Since little is knownabout the scaling methods on such matrices we only mention some factsand observations and later on present some experiments to demonstratethe practicality of the proposed OneSidedMatch and TwoSidedMatchheuristics

A sparse matrix (not necessarily square) can be permuted into a blockupper triangular form using the canonical Dulmage-Mendelsohn (DM) de-composition [77]

A =

H lowast lowastO S lowastO O V

S =

(S1 lowastO S2

)

where H (horizontal) has more columns than rows and has a matchingcovering all rows S is square and has a perfect matching and V (vertical)has more rows than columns and a matching covering all columns Thefollowing facts about the DM decomposition are well known [192 194] Anyof these three blocks can be void If H is not connected then it is blockdiagonal with horizontal blocks If V is not connected then it is blockdiagonal with vertical blocks If S does not have total support then itis in block upper triangular form shown on the right where S1 and S2

69

have the same structure recursively until each block Si has total supportThe entries in the blocks shown by ldquordquo cannot be put into a maximumcardinality matching When the presented scaling methods are applied toa matrix the entries in ldquordquo blocks will tend to zero (the case of S is welldocumented [202]) Furthermore the row sums of the blocks of H will be amultiple of the column sums in the same block a similar statement holds forV finally S will be doubly stochastic That is the scaling algorithms appliedto bipartite graphs without perfect matchings will zero out the entries in theirrelevant parts and identify the entries that can be put into a maximumcardinality matching

423 Experiments

We present a selected set of results from the original paper [76] We used ashared-memory parallel version of Sinkhorn-Knopp algorithm with a fixednumber of iterations This way the scaling algorithm is simplified as noconvergence check is required Furthermore its overhead is bounded Inmost of the experiments we use 0 1 5 or 10 scaling iterations where thecase of 0 corresponds to applying the matching heuristics with uniform edgeselection probabilities While these do not guarantee convergence of theSinkhorn-Knopp scaling algorithm it seems to be enough for our purposes

Matching quality

We investigate the matching quality of the proposed heuristics on all squarematrices with support from the SuiteSparse Matrix Collection [59] having atleast 1000 non-empty rowscolumns and at most 20000000 nonzeros (someof the matrices are also in the 10th DIMACS challenge [19]) There were 742matrices satisfying these properties at the time of experimentation Withthe OneSidedMatch heuristic the quality guarantee of 0632 was sur-passed with 10 iterations of the scaling method in 729 matrices With theTwoSidedMatch heuristic and the same number of iterations of the scalingmethods the quality guarantee of 0866 was surpassed in 705 matrices Mak-ing 10 more scaling iterations smoothed out the remaining instances TheGreedy heuristic (the second variant discussed in Section 421) obtained onaverage (both the geometric and arithmetic means) matchings of size 093of the maximum cardinality In the worst case 082 was observed

We collect a few of the matrices from the SuiteSparse collection in Ta-ble 41 Figures 42a and 42b show the qualities on our test matrices In thefigures the first columns represent the case that the neighbors are pickedfrom a uniform distribution over the adjacency lists ie the case with noscaling hence no guarantee The quality guarantees are achieved with only5 scaling iterations Even with a single iteration the quality of TwoSided-Match is more than 0866 for all matrices and that of OneSidedMatch

70 CHAPTER 4 MATCHINGS IN GRAPHS

Avg sprank

Name n of edges deg stdA stdB n

atmosmodl 1489752 10319760 69 03 03 100audikw 1 943695 77651847 822 425 425 100cage15 5154859 99199551 192 57 57 100channel 4802000 85362744 178 10 10 100europe osm 50912018 108109320 21 05 05 099Hamrle3 1447360 5514242 38 16 21 100hugebubbles 21198119 63580358 30 003 003 100kkt power 2063494 12771361 62 745 745 100nlpkkt240 27993600 760648352 267 222 222 100road usa 23947347 57708624 24 093 093 095torso1 116158 8516500 733 41959 24544 100venturiLevel3 4026819 16108474 40 012 012 100

Table 41 Properties of the matrices used in the comparisons In the ta-ble sprank is the maximum matching cardinality Avg deg is the averagevertex degree stdA and stdB are the standard deviations of A and B ver-tex (row and column) degrees respectively

is more than 0632 for all matrices

Comparison of TwoSidedMatch with Karp-Sipser Next we analyzethe performance of the proposed heuristics with respect to Karp-Sipser on amatrix class which we designed as a challenging case for Karp-Sipser Let Abe an ntimes n matrix R1 be the set of Arsquos first n2 rows and R2 be the setof Arsquos last n2 rows similarly let C1 and C2 be the set of first n2 and theset of last n2 columns As Figure 43 shows A has a full R1 times C1 blockand an empty R2 times C2 block The last h n rows and columns of R1 andC1 respectively are full Each of the blocks R1 times C2 and R2 times C1 has anonzero diagonal Those diagonals form a perfect matching when combinedIn the sequel a matrix whose corresponding bipartite graph has a perfectmatching will be called full-sprank and sprank-deficient otherwise

When h le 1 the Karp-Sipser heuristic consumes the whole graph duringPhase 1 and finds a maximum cardinality matching When h gt 1 Phase1 immediately ends since there is no degree-one vertex In Phase 2 thefirst edge (nonzero) consumed by Karp-Sipser is selected from a uniform dis-tribution over the nonzeros whose corresponding rows and columns are stillunmatched Since the block R1timesC1 is full it is more likely that the nonzerowill be chosen from this block Thus a row in R1 will be matched with acolumn in C1 which is a bad decision since the block R2timesC2 is completelyempty Hence we expect a decrease in the performance of Karp-Sipser ash increases On the other hand the probability that TwoSidedMatchchooses an edge from that block goes to zero as those entries cannot be ina perfect matching

71

050

055

060

065

070

075

080

085

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 42 Matching qualities of OneSidedMatch and TwoSided-Match The horizontal lines are at 0632 and 0866 respectively whichare the approximation guarantees for the heuristics The legends containthe number of scaling iterations

R1

R2

C1 C2

h

h

Figure 43 Adjacency matrices of hard bipartite graph instances for Karp-Sipser

72 CHAPTER 4 MATCHINGS IN GRAPHS

Results with different number of scaling iterationsKarp- 0 1 5 10

h Sipser Qual Err Qual Err Qual Err Qual

2 0782 0522 13853 0557 3463 0989 0578 09994 0704 0489 11257 0516 3856 0980 0604 09978 0707 0466 8653 0487 4345 0946 0648 0996

16 0685 0448 6373 0458 4683 0885 0725 099032 0670 0447 4555 0453 4428 0748 0867 0980

Table 42 Quality comparison (minimum of 10 executions for each instance)of the Karp-Sipser heuristic and TwoSidedMatch on matrices described inFig 43 with n = 3200 and h isin 2 4 8 16 32

The results of the experiments are in Table 42 The first column showsthe h value Then the matching quality obtained by Karp-Sipser and byTwoSidedMatch with different number of scaling iterations (0 1 5 10)as well as the scaling error are given The scaling error is the maximumdifference between 1 and each rowcolumn sum of the scaled matrix (for0 iterations it is equal to n minus 1 for all cases) The quality of a match-ing is computed by dividing its cardinality to the maximum one which isn = 3200 for these experiments To obtain the values in each cell of thetable we run the programs 10 times and give the minimum quality (as weare investigating the worst-case behavior) The highest variance for Karp-Sipser and TwoSidedMatch were (up to four significant digits) 00041 and00001 respectively As expected when h increases Karp-Sipser performsworse and the matching quality drops to 067 for h = 32 TwoSided-Matchrsquos performance increases with the number of scaling iterations Asthe experiment shows only 5 scaling iterations are sufficient to make theproposed two-sided matching heuristic significantly better than Karp-SipserHowever this number was not enough to reach 0866 for the matrix withh = 32 On this matrix with 10 iterations only 2 of the rowscolumnsremain unmatched

Matching quality on bipartite graphs without perfect matchingsWe analyze the proposed heuristics on a class of random sprank-deficientsquare (n = 100000) and rectangular (m = 100000 and n = 120000) matri-ces with a uniform nonzero distribution (two more sprank-deficient matricesare used in the scalability tests as well) These matrices are generatedby Matlabrsquos sprand command (generating Erdos-Renyi random matri-ces [85]) The total number of nonzeros is set to be around d times 100000for d isin 2 3 4 5 The top half of Table 43 presents the results of thisexperiment with square matrices and the bottom half presents the resultswith the rectangular ones

As in the previous experiments the matching qualities in the table arethe minimum of 10 executions for the corresponding instances As Table 43

73

One Two One TwoSided Sided Sided Sided

d iter sprank Match Match d iter sprank Match Match

Square 100000times 100000

2 0 78225 0770 0912 4 0 97787 0644 08382 1 78225 0797 0917 4 1 97787 0673 08482 5 78225 0850 0939 4 5 97787 0719 08732 10 78225 0879 0954 4 10 97787 0740 0886

3 0 92786 0673 0851 5 0 99223 0635 08403 1 92786 0703 0857 5 1 99223 0662 08513 5 92786 0756 0884 5 5 99223 0701 08733 10 92786 0784 0902 5 10 99223 0716 0882

Rectangular 100000times 120000

2 0 87373 0793 0912 4 0 99115 0729 08992 1 87373 0815 0918 4 1 99115 0754 09102 5 87373 0861 0939 4 5 99115 0792 09332 10 87373 0886 0955 4 10 99115 0811 0946

3 0 96564 0739 0896 5 0 99761 0725 09053 1 96564 0769 0904 5 1 99761 0749 09173 5 96564 0813 0930 5 5 99761 0781 09363 10 96564 0836 0945 5 10 99761 0792 0943

Table 43 Matching qualities of the proposed heuristics on random sparsematrices with uniform nonzero distribution Square matrices in the top halfrectangular matrices in the bottom half d average number of nonzeros perrow

shows when the deficiency is high (correlated with small d) it is easier forour algorithms to approximate the maximum cardinality However when dgets larger the algorithms require more scaling iterations Even in this case5 iterations are sufficient to achieve the guaranteed qualities In the squarecase the minimum quality achieved by OneSidedMatch and TwoSided-Match were 0701 and 0873 In the rectangular case the minimum qual-ity achieved by OneSidedMatch and TwoSidedMatch were 0781 and0930 respectively with 5 scaling iterations In all cases increased scal-ing iterations results in higher quality matchings in accordance with theprevious results on square matrices with perfect matchings

Comparisons with some other codes

We run Azad et alrsquos [16] variant of Karp-Sipser (ksV) the two greedy match-ing heuristics of Blelloch et al [29] (inc and nd which are referred to as ldquoin-crementalrdquo and ldquonon-deterministic reservationrdquo in the original paper) andthe proposed OneSidedMatch and TwoSidedMatch heuristics with oneand 16 threads on the matrices given in Table 41 We also add one more ma-trix wc 50K 64 from the family shown in Fig 43 with n = 50000 and h = 64

74 CHAPTER 4 MATCHINGS IN GRAPHS

Run time in secondsOne thread 16 threads

Name ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 020 1480 007 012 030 003 163 001 001 003audikw 1 098 20600 022 055 082 010 1690 002 006 009cage15 154 996 039 092 195 029 110 004 009 019channel 129 10500 030 082 146 014 907 003 008 013europe osm 1203 5180 206 489 1197 120 712 021 046 105Hamrle3 015 012 004 009 025 002 002 001 001 002hugebubbles 836 465 128 392 1004 095 053 018 037 094kkt power 055 068 009 019 045 008 016 001 002 005nlpkkt240 1088 190000 256 556 1011 115 23700 023 058 106road usa 623 357 096 216 542 070 038 009 022 050torso1 011 728 002 006 009 002 120 000 001 001venturiLevel3 045 272 013 032 084 008 029 001 004 009wc 50K 64 721 320000 146 439 580 117 40200 012 055 059

Quality of the obtained matchingName ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 100 100 100 066 087 098 100 097 066 087audikw 1 100 100 100 064 087 095 100 097 064 087cage15 100 100 100 064 087 095 100 091 064 087channel 100 100 100 064 087 099 100 099 064 087europe osm 098 095 095 075 089 098 095 094 075 089Hamrle3 100 084 084 075 090 097 084 086 075 090hugebubbles 099 096 096 070 089 096 096 090 070 089kkt power 085 079 079 078 089 087 079 086 078 089nlpkkt240 099 099 099 064 086 099 099 099 064 086road usa 094 081 081 076 088 094 081 081 076 088torso1 100 100 100 065 088 098 100 098 065 087venturiLevel3 100 100 100 068 088 097 100 096 068 088wc 50K 64 050 050 050 081 098 070 050 051 081 098

Table 44 Run time and quality of different heuristics OneSidedMatchand TwoSidedMatch are labeled with 1SD and 2SD respectively andapply two steps of scaling iterations The heuristic ksV is from Azad etal [16] and the heuristics inc and nd are from Blelloch et al [29]

The OneSidedMatch and TwoSidedMatch heuristics are run with twoscaling iterations and the reported run time includes all components of theheuristics The inc and nd heuristics are compiled with g++ 445 and ksV

is compiled with gcc 445 For all the heuristics we used -O2 optimizationand the appropriate OpenMP flag for compilation The experiments werecarried out on a machine equipped with two Intel Sandybridge-EP CPUsclocked at 200Ghz and 256GB of memory split across the two NUMA do-mains Each CPU has eight-cores (16 cores in total) and HyperThreadingis enabled Each core has its own 32kB L1 cache and 256kB L2 cache The8 cores on a CPU share a 20MB L3 cache The machine runs 64-bit Debianwith Linux 2639-bpo2-amd64 The run time of all heuristics (in seconds)and their matching qualities are given in Table 44

75

As seen in Table 44 nd is the fastest of the tested heuristics with bothone thread and 16 threads on this data set The second fastest heuristicis OneSidedMatch The heuristic ksV is faster than TwoSidedMatchon six instances with one thread With 16 threads TwoSidedMatch isalmost always faster than ksVmdashboth heuristics are quite efficient and havea maximum run time of 117 seconds The heuristic inc was observed tohave large run time on some matrices with relatively high nonzeros per rowAll the existing heuristics are always very efficient except inc which haddifficulties on some matrices in the data set In the original reference [29]inc is quite efficient for some other matrices (where the codes are compiledwith Cilk) We do not see run time with nd elsewhere but in our data setit was always better than inc The proposed OneSidedMatch heuristicrsquosquality is almost always close to its theoretical limit TwoSidedMatch alsoobtains quality results close to its theoretical limit Looking at the quality ofthe matching with one thread we see that ksV obtains (almost always) thebest score except on kkt power and wc 50K 64 where TwoSidedMatchobtains the best score The heuristics inc and nd obtain the same score withone thread but with 16 threads inc obtains better quality than nd almostalways OneSidedMatch never obtains the best score (TwoSidedMatchis always better) yet it is better than ksV inc and nd on the syntheticwc 50K 64 matrix The TwoSidedMatch heuristic obtains better resultsthan inc and nd on Hamrle3 kkt power road usa and wc 50K 64 withboth one thread and 16 threads Also on hugebubbles with 16 threads thedifference between TwoSidedMatch and nd is 1 (in favor of nd)

43 Approximation algorithms for general undi-rected graphs

We extend the results on bipartite graphs of the previous section to generalundirected graphs We again select a subgraph randomly with a probabilitydensity function obtained by scaling the adjacency matrix of the input graphand run Karp-Sipser [128] on the selected subgraph Again Karp-Sipser be-comes an exact algorithm on the selected subgraph and analysis reveals thatthe approximation ratio is around 0866 of the maximum cardinality in theoriginal graph We also propose two other variants of this algorithm whichobtain better results both in theory and in practice We omit most of theproofs in the presentation they can be found in the associated paper [73]

Recall that a graph Gprime = (V prime Eprime) is a subgraph of G = (VE) if V prime sube Vand Eprime sube E Let Gkout = (VEprime) be a random subgraph of G where eachvertex in V randomly chooses k of its edges with repetition (an edge u vis included in Gkout only once even if it is chosen multiple times) We callGkout as a k-out subgraph of G

A permutation of [1 n] in which no element stays in its original po-

76 CHAPTER 4 MATCHINGS IN GRAPHS

sition is called derangement The total number of derangements of [1 n]is denoted by n and is the nearest integer to n

e [36 pp 40ndash42] where e isthe base of the natural logarithm

431 Related work

The first polynomial time algorithm with O(n2τ) complexity for the maxi-mum cardinality matching problem on general graphs with n vertices and τedges is proposed by Edmonds [79] Currently the fastest algorithms haveO(radicnτ) complexity [30 92 174] The first of these algorithms is by Micali

and Vazirani [174] recently its simpler analysis is given by Vazirani [210]Later Blum [30] and Gabow and Tarjan [92] proposed algorithms with thesame complexity

Random graphs have been frequently analyzed in terms of their max-imum matching cardinality Frieze [90] shows results for undirected (gen-eral) graphs along the lines of Walkuprsquos results for bipartite graphs [211]In particular he shows that if each vertex chooses uniformly at random oneneighbor (among all vertices) then the constructed graph does not havea perfect matching almost always In other words he shows that random1-out subgraphs of a complete graph do not have perfect matchings Luck-ily as in bipartite graphs if each vertex chooses two neighbors then theconstructed graph has perfect matchings almost always That is random2-out subgraphs of a complete graph have perfect matchings Our contribu-tions are about making these statements valid for any host graph (assumingthe adjacency matrix has total support) and measure the approximationguarantee of the 1-out case

Randomized algorithms which check the existence of a perfect match-ing and generate perfectmaximum matchings have been proposed in theliterature [45 120 165 177 178] Lovasz showed that the existence of aperfect matching can be verified in randomized O(nω) time where O(nω)is the time complexity of the fastest matrix multiplication algorithm avail-able [165] More information such as the set of allowed edges ie the edgesin some maximum matching of a graph can also be found with the samerandomized complexity as shown by Cheriyan [45]

To construct a maximum cardinality matching in a general non-bipartitegraph a simple easy to implement algorithm with O(nω+1) randomizedcomplexity is proposed by Rabin and Vazirani [197] The algorithm usesmatrix inversion as a subroutine Later Mucha and Sankowski proposedthe first algorithm for the same problem with O(nω) randomized complex-ity [178] This algorithm uses expensive sub-routines such as Gaussian elim-ination and equivalence classes formed based on the edges that appear insome perfect matching [166] A simpler randomized algorithm with the samecomplexity is proposed by Harvey [108]

77

432 One-Out its analysis and variants

We first present the main heuristic and then analyze its approximationguarantee While the heuristic is a straightforward adaptation of its coun-terpart for bipartite graphs [76] the analysis is more complicated becauseof odd cycles The analysis shows that random 1-out subgraphs of a givengraph have the maximum cardinality of a matching around 0866minus log(n)nof the best possiblemdashthe observed performance (see Section 433) is higherThen two variants of the main heuristic are discussed without proofs Theyextend the 1-out subgraphs obtained by the first one and hence deliverbetter results theoretically

The main heuristic One-Out

The heuristic shown in Algorithm 9 first scales the adjacency matrix of agiven undirected graph to be doubly stochastic Based on the values in thematrix the heuristic then randomly marks one of the neighbors for each ver-tex as chosen This defines one marked edge per vertex Then the subgraphof the original graph containing only the marked edges is formed yieldingan undirected graph with at most n edges each vertex having at least oneedge incident on it The Karp-Sipser heuristic is run on this subgraph andfinds a maximum cardinality matching on it as the resulting 1-out graphhas unicyclic components The approximation guarantee of the proposedheuristic is analyzed for this step One practical improvement to the previ-ous work is to make sure that the matching obtained is a maximal one byrunning Karp-Sipser on the vertices that are not matched This brings in alarge improvement to the cardinality of the matching in practice

Let us classify the edges incident on a vertex vi as in-edges (from thoseneighbors that have chosen i at Line 4) and an out-edge (to a neighborchosen by i at Line 4 Dufosse et al [76 Lemma 3] show that during anyexecution of Karp-Sipser it suffices to consider the vertices whose in-edgesare unavailable but out-edges are available as degree-1 vertices

The analysis traces an execution of Karp-Sipser on the subgraph G1out In

other words the analysis can also be perceived as the analysis of the Karp-Sipser heuristicrsquos performance on random 1-out graphs Let A1 be the setof vertices not chosen by any other vertex at Line 4 of Algorithm 9 Thesevertices have in-degree zero and out degree one and hence can be processedby Karp-Sipser Let B1 be the set of vertices chosen by the vertices in A1The vertices in B1 can be perfectly matched with vertices in A1 leavingsome A1 vertices not matched and creating some new in-degree-0 verticesWe can proceed to define A2 to be the vertices that have in degree-0 inV (A1 cup B1) and define B2 as those chosen by A2 and so on so forthFormally let B0 be an empty set and define Ai to be the set of verticeswith in-degree 0 in V Biminus1 and Bi be the vertices chosen by those in Ai

78 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 9 One-Out Undirected graphs

Data G = (VE) and its adjacency matrix AResult match[middot] the matching

1 Dlarr SymScale(A)2 for i = 1 to n in parallel do3 Pick a random j isin Ailowast by using the probability density function

sikΣtisinAilowastsit

for all k isin Ailowast

where sik = d[i]times d[k] is the corresponding entry in the scaledmatrix S = DAD

4 Mark that i chooses j

5 Construct a graph G1out = (VE) where

V =1 nE =(i j) i chose j or j chose i

6 matchlarrKarpSipserOne-Out(G1out)

7 Make match a maximal matching

for i ge 1 Notice that Ai sube Ai+1 and Bi sube Bi+1 The Karp-Sipser heuristiccan process A1 then A2 A1 and so on until the remaining graph hascycles only The sets Ai and Bi and their cardinality are at the core of ouranalysis We first present some facts about these sets and their cardinalityand describe an implementation of Karp-Sipser instrumented to highlightthem

Lemma 45 With the definitions above Ai capBi = empty

Proof We prove this by induction For i = 1 it clearly holds Assume thatit holds for all i lt ` Suppose there exists a vertex u isin A` cap B` BecauseA`minus1capB`minus1 = empty u must necessarily belong to both A` A`minus1 and B` B`minus1For u to be in B` B`minus1 there must exist at least one vertex v isin A` A`minus1

such that v chooses u However the condition for u isin A` is that no vertex inV cap(A`minus1cupB`minus1) has selected it This is a contradiction and the intersectionA` capB` should be empty

Corollary 41 Ai capBj = empty for i le j

Proof Assume Ai cap Bj 6= empty Since Ai sube Aj we have a contradiction asAj capBj = empty by Lemma 45

Thanks to Lemma 45 and Corollary 41 the sets Ai and Bi are disjointand they form a bipartite subgraph of G1

out for all i = 1 `The proposed version Karp-SipserOne-Out of the Karp-Sipser heuristic

for 1-out graphs is shown in Algorithm 10 The degree-1 vertices are kept

79

in a first-in first-out priority queue Q The queue is first initialized with A1and the character lsquorsquo is used to mark the end of A1 Then all vertices inA1 are matched to some other vertices defining B1 When we remove twomatched vertices from the graph G1

out at Lines 29 and 40 we update thedegrees of their remaining neighbors and append the vertices which havedegrees of one to the queue During Phase-1 of Karp-SipserOne-Out wealso maintain the set of Ai and Bi vertices while storing only the last oneA` and B` are returned along with the number ` of levels which is computedthanks to the use of the marker

Apart from the scaling step the proposed heuristic in Algorithm 9 haslinear worst-case time complexity of O(n + τ) The scaling step if applieduntil convergence can take more time than that We do do not suggestrunning it until convergence for all practical purposes 5 or 10 iterations ofthe basic method [141] or even less of the Newton iterations [140] seem suf-ficient (see experiments) Therefore the practical run time of the algorithmis linear

Analysis of One-Out

Let ai and bi be two random variables representing the cardinalities of Aiand Bi respectively in an execution of Karp-SipserOne-Out on a random1-out graph Then Karp-SipserOne-Out matches b` edges in the first phaseand leaves a`minus b` vertices unmatched What remains after the first phase isa set of cycles In the bipartite graph case [76] all vertices in those cyclesare matchable and hence the cardinality of the matching was measured bynminus a` + b` Since we can possibly have odd cycles after the first phase wecannot match all remaining vertices in the general case of undirected graphsLet c be a random variable representing the number of odd cycles after thefirst phase of Karp-SipserOne-Out Then we have the following (obvious)lemma about the approximation guarantee of Algorithm 9

Lemma 46 At the end of execution the number of unmatched vertices isa` minus b` + c Hence Algorithm 9 matches at least nminus (a` minus b` + c) vertices

We need to quantify a` minus b` and c in Lemma 46 We state an an upperbound on a` minus b` in Lemma 47 and an upper bound on c in Lemma 48without any proof (the reader can find the proofs in the original paper [73])While the bound for a`minusb` holds for any graph with total support presentedto Algorithm 9 the bounds for c are shown for random 1-out graphs ofcomplete graphs By plugging the bounds for these quantities we obtainthe following theorem on the approximation guarantee of Algorithm 9

Theorem 43 Algorithm 9 obtains a matching with cardinality at least0866minus d104 log(0336n)e

n in expectation when the input is a complete graph

80 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 10 Karp-SipserOne-Out Undirected graphs

Data G1out = (VE)

Result match[middot] the mates of verticesResult ` the number of levels in the first phaseResult A the set of degree-1 vertices in the first phaseResult B the set of vertices matched to A vertices

1 match[u]larr NIL for all u2 Qlarr v deg(v) = 1 degree-1 or in-degree 0 3 if Q = empty then4 `larr 0 no vertex in level 1 5 else6 `larr 1

7 Enqueue(Q) marks the end of the first level 8 Phase-1larr ongoing9 Alarr B larr empty

10 while true do11 while Q 6= empty do12 ularr Dequeue(Q)13 if u = and Q = empty then14 break the while-Q-loop15 else if u = then16 `larr `+ 117 Enqueue(Q) new level formed 18 skip to the next while-Q-iteration19 if match[u] 6= NIL then20 skip to the next while-Q-iteration

21 for v isin adj(u) do22 if match[v] = NIL then23 match[u]larr v24 match[v]larr u25 if Phase-1 = ongoing then26 Alarr A cup u27 B larr B cup v28 N larr adj(v)

29 G1out larr G1

out u v30 Enqueue(Qw) for w isin N and deg(w) = 131 break the for-v-loop

32 if Phase-1 = ongoing and match[u] = NIL then33 Alarr A cup u u cannot be matched

34 Phase-1 larr done35 if E 6= empty then36 pick a random edge (u v)37 match[u]larr v38 match[v]larr u39 N larr adj(u v)40 G1

out larr G1out u v

41 Enqueue(Qw) for w isin N and deg(w) = 142 else43 break the while-true loop

81

The theorem can also be interpreted more theoretically as a random1-out graph has a maximum cardinality of a matching at least 0866 minusd104 log(0336n)e

n in expectation The bound is in close vicinity of 0866 whichwas the proved bound for the bipartite case [76] We note that in derivingthis bound we assumed a random 1-out graph (as opposed to a random 1-outsubgraph of a given graph) only at a single step We leave the extension tothis latter case as future work and present experiments suggesting that thebound is also achieved for this case In Section 433 we empirically showthat the same bound also holds for graphs whose corresponding matrices donot have total support

In order to measure a` minus b` we adapted a proof from earlier work [76]which was inspired by Karp and Sipserrsquos analysis of the first phase of theirheuristic [128] and obtained the following lemma

Lemma 47 a` minus b` le (2Ω minus 1)n where Ω asymp 0567 equals to W (1) ofLambertrsquos W function

The lemma also reads as a` minus b` lesumn

i=1(2 middot 0567minus 1) le 0134 middot nIn order to achieve the result of Theorem 43 we need to bound c the

number of odd cycles that remain on a G1out graph after the first phase of

Karp-SipserOne-Out We were able to bound c on random 1-out graphs(that is we did not show it for a general graph)

Lemma 48 The expected number c of cycles remaining after the first phaseof Karp-SipserOne-Out in random 1-out graphs is less than or equal tod104 log(nminusa`minus b`)e which is also less than or equal to d104 log(0336n)e

Variants

Here we summarize two related theoretical random bipartite graph modelsthat we adapt to the undirected case using similar algorithms The presenta-tion is brief and without proofs we will present experiments in Section 433

The random (1 + eminus1) undirected graph model (see the bipartite ver-sion [125] summarized in Section 421) lets each vertex choose a randomneighbor Then the vertices that have not been chosen select one moreneighbor The maximum cardinality of a matching in the subgraph consist-ing of the identified edges can be computed as an approximate matchingin the original graph As this heuristic uses a richer graph structure than1-out we expect perfect or near perfect matchings in the general case

A model richer in edges is the random 2-out graph model In this modeleach vertex chooses two neighbors There are two enticing characteristics ofthis model in the uniform bipartite case First the probability of having aperfect matching goes to one with the increasing number of vertices [211]Second there is a special algorithm for computing the maximum cardinalitymatchings in these (bipartite) graphs [127] with high probability in lineartime in expectation

82 CHAPTER 4 MATCHINGS IN GRAPHS

433 Experiments

To understand the effectiveness and efficiency of the proposed heuristics inpractice we report the matching quality and various statistics regarding thesets that appear in our analyses in Section 432 For the experiments weused three graph datasets The first set is generated with matrices fromthe SuiteSparse Matrix Collection [59] We investigated all n times n matricesfrom the collection with 50000 le n le 100000 For a matrix from this setwe removed the diagonal entries symmetrized the pattern of the resultingmatrix and discarded a matrix if it has empty rows There were 115 matricesat the end which we used as the adjacency matrix The graphs in the seconddataset are synthetically generated to make Karp-Sipser deviate from theoptimal matching as much as possible This dataset contains five graphswith different hardness levels for Karp-Sipser The third set contains fivelarge real-life matrices from the SuiteSparse collection for measuring therun time efficiency of the proposed heuristics

A comprehensive evaluation

We use MATLAB to measure the matching qualities based on the first twodatasets For each matrix five runs are performed with each randomizedmatching heuristic and the average is reported One five and ten iterationsare performed to evaluate the impact of the scaling method

Table 45 summarizes the quality of the matchings for all the experimentson the first dataset The matching quality is measured as the ratio of thematching cardinality to the maximum cardinality matching in the originalgraph The table presents statistics for matching qualities of Karp-Sipserperformed on the original graphs (first row) 1-out graphs (the second set ofrows) Karonski-Pittel-like (1 + eminus1)-out graphs (the third set of rows) and2-out graphs (the last set of rows)

For the U-Max rows we construct k-out graphs by using uniform prob-abilities while selecting the neighbors as proposed in the literature [90 125]We compare the cardinality of the maximum matchings in these k-out graphsto the maximum matching cardinality on the original graphs and reportthe statistics The rows St-Max report the same statistics for the k-outgraphs constructed by using probabilities with t isin 1 5 10 scaling iter-ations These statistics serve as upper bounds on the matching qualitiesof the proposed St-KS heuristics which execute Karp-Sipser on the k-outgraphs obtained with t scaling iterations Since St-KS heuristics use only asubgraph the matchings they obtain are not maximal with respect to theoriginal edge set The proposed St-KS+ heuristics exploit this fact andapply another Karp-Sipser phase on the subgraph containing only the un-matched vertices to improve the quality of the matchings The table doesnot contain St-Max rows for 1-out graphs since Karp-Sipser is an optimal

83

Alg Min Max Avg GMean Med StDev

Karp-Sipser 0880 1000 0980 0980 0988 0022

1-ou

t

U-Max 0168 1000 0846 0837 0858 0091S1-KS 0479 1000 0869 0866 0863 0059S5-KS 0839 1000 0885 0884 0865 0044S10-KS 0858 1000 0889 0888 0866 0045S1-KS+ 0836 1000 0951 0950 0953 0043S5-KS+ 0865 1000 0958 0957 0968 0036S10-KS+ 0888 1000 0961 0961 0971 0033

(1+eminus

1)-

out

U-Max 0251 1000 0952 0945 0967 0081S1-Max 0642 1000 0967 0966 0980 0042S5-Max 0918 1000 0977 0977 0985 0020S10-Max 0934 1000 0980 0979 0985 0018S1-KS 0642 1000 0963 0962 0972 0041S5-KS 0918 1000 0972 0972 0976 0020S10-KS 0934 1000 0975 0975 0977 0018S1-KS+ 0857 1000 0972 0972 0979 0025S5-KS+ 0925 1000 0978 0978 0984 0018S10-KS+ 0939 1000 0980 0980 0985 0016

2-ou

t

U-Max 0254 1000 0972 0966 0996 0079S1-Max 0652 1000 0987 0986 0999 0036S5-Max 0952 1000 0995 0995 0999 0009S10-Max 0968 1000 0996 0996 1000 0007S1-KS 0651 1000 0974 0974 0981 0035S5-KS 0945 1000 0982 0982 0984 0013S10-KS 0947 1000 0984 0983 0984 0012S1-KS+ 0860 1000 0980 0979 0984 0020S5-KS+ 0950 1000 0984 0984 0987 0012S10-KS+ 0952 1000 0986 0986 0987 0011

Table 45 For each matrix in the first dataset and each proposed heuristic five

runs are performed The statistics are computed over the mean of these results

the minimum maximum arithmetic and geometric averages median and standard

deviation are reported

84 CHAPTER 4 MATCHINGS IN GRAPHS

0808208408608809

092094096098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphOneOut5iters(MaxKS)OneOut5iters(KS+)

(a) 1-out graphs

09

092

094

096

098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphTwoOut5iters(Max)TwoOut5iters(KS)TwoOut5iters(KS+)

(b) 2-out graphs

Figure 44 The matching qualities sorted in increasing order for Karp-Sipseron the original graph S5-KS and S5-KS+ on 1-out and 2-out graphs Thefigures also contain the quality of the maximum cardinality matchings inthese graphs The experiments are performed on the 115 graphs in the firstdataset

algorithm for these subgraphs

As Table 45 shows more scaling iterations increase the maximum match-ing cardinalities on k-out graphs Although this is much more clear whenthe worst-case graphs are considered it can also be observed for arithmeticand geometric means Since U-Max is the no scaling case the impact ofthe first scaling iteration (S1-KS vs U-Max) is significant On the otherhand the difference on the matching quality for S5-KS and S10-KS is mi-nor Hence five scaling iterations are deemed sufficient for the proposedheuristics in practice

As the theory suggests the heuristics St-KS perform well for (1 + eminus1)-out and 2-out graphs With t isin 5 10 their quality is almost on parwith Karp-Sipser on the original graph and even better for 2-out graphsIn addition applying Karp-Sipser on the subgraph of unmatched vertices toobtain a maximal matching does not increase the matching quality muchSince this subgraph is small the overhead of this extra work will not besignificant Furthermore this extra step significantly improves the matchingquality for 1-out graphs which aas do not have a perfect matching

To better understand the practical performance of the proposed heuris-tics and the impact of the additional Karp-Sipser execution we profile theirperformance by sorting their matching qualities in increasing order for all115 matrices Figure 44a plots these profiles on 1-out and 2-out graphs forthe heuristics with five scaling iterations As the first figure shows five iter-ations are sufficient to obtain 086 matching quality except 17 of the 1-outexperiments The figure also shows that the maximum matching cardinalityin a random 1-out graph is worse than what Karp-Sipser can obtain on theoriginal graph This is why although S5-KS finds maximum matchings on

85

0020406081

0 2 4 6 8 10

CDFofoddCyclesn

Figure 45 The cumulative distribution of the ratio of number of odd cyclesremaining after the first phase of Karp-Sipser in One-Out to the number ofvertices in the graph

1-out graphs its performance is still worse than Karp-Sipser The additionalKarp-Sipser in S5-KS+ closes almost all of this gap and makes the match-ing qualities close to those of Karp-Sipser On the contrary for 2-out graphsgenerated with five scaling iterations the maximum matching cardinalityis more than the cardinality of the matchings found by Karp-Sipser Thereis still a gap between the best possible (red line) and what Karp-Sipser canfind (blue line) on 2-out graphs We believe that this gap can be targetedby specialized efficient exact matching algorithms for 2-out graphs

There are 30 rank-deficient matrices without total support among the115 matrices in the first dataset We observed that even for 1-out graphsthe worst-case quality for S5-KS is 086 and the average is 093 Hencethe proposed approach also works well for rank-deficient matricesgraphs inpractice

Since the number c of odd cycles at the end of the first phase of Karp-Sipser is a performance measure we investigate it For each matrix wecompute the ratio cn We then plot the cumulative distribution of thesevalues in Fig 45 For all the experiments except one this ratio is less than1 For the extreme case the ratio increases to 8 In that particular casethe matching found in the 1-out subgraph is maximum for the input graph(ie the number of odd components is also large in the original graph)

Hard instances for Karp-Sipser

Let A be an ntimesn symmetric matrix V1 be the set of Arsquos first n2 verticesand V2 be the set of Arsquos last n2 vertices As Figure 46 shows A has afull V1timesV1 block ignoring the diagonal ie a clique and an empty V2timesV2

block The last h n vertices of V1 are connected to all the vertices in thecorresponding graph The block V1 times V2 has a nonzero diagonal hence thecorresponding graph has a perfect matching Note that the same instancesare used for the bipartite case

On such a matrix with h = 0 Karp-Sipser consumes the whole graphduring Phase 1 and finds a maximum cardinality matching When h gt 1

86 CHAPTER 4 MATCHINGS IN GRAPHS

V1

V2

V1 V2

h

h

Figure 46 Adjacency matrices of hard undirected graph instances for Karp-Sipser

One-Out5 iters 10 iters 20 iters

h Karp-Sipser error quality error quality error quality

2 096 754 099 068 099 022 1008 078 852 097 078 099 023 099

32 068 665 081 109 099 026 099128 063 332 053 189 090 033 098512 063 124 055 117 059 061 086

Table 46 Results for the hard instances with n = 5000 and different hvalues One-Out is executed with 5 10 and 20 scaling iterations andscaling errors are also reported Averages of five are reported for each cell

Phase 1 immediately ends since there is no degree-one vertex In Phase 2the first edge (nonzero) consumed by Karp-Sipser is selected from a uniformdistribution over the nonzeros whose corresponding rows and columns arestill unmatched Since the block V1 times V1 forms a clique it is more likelythat the nonzero will be chosen from this block Thus a vertex in V1 will bematched with another vertex from V1 which is a bad decision since the blockV2timesV2 is completely empty Hence we expect a decrease on the performanceof Karp-Sipser as h increases On the other hand the probability that theproposed heuristics chooses an edge from that block goes to zero as thoseentries cannot be in a perfect matching

Table 46 shows that the quality of Karp-Sipser drops to 063 as h in-creases In comparison One-Out heuristic with five scaling iterations main-tains a good quality for small h values However a better scaling with moreiterations (10 and 20 for h = 128 and h = 512 respectively) is required toguarantee the desired matching qualitymdashsee the scaling error in the table

Experiments on large-scale graphs

These experiments are performed on a machine running 64-bit CentOS 65which has 30 cores each of which is an Intel Xeon CPU E7-4870 v2 core

87

Karp-SipserMatrix |V | |E| Quality time

cage15 52 940 100 582dielFilterV3real 11 882 099 336hollywood-2009 11 1128 093 418nlpkkt240 280 7465 098 5295rgg n 2 24 s0 168 2651 098 1949

One-Out-S5-KS+ Two-Out-S5-KS+Execution time (seconds) Execution time (seconds)

Matrix Quality Scale 1-out KS KS+ Total Quality Scale 2-out KS KS+ Total

cage 093 067 085 065 005 221 099 067 148 178 001 394diel 098 052 036 007 001 095 099 052 062 016 000 130holly 091 112 045 006 002 165 095 112 076 013 001 201nlpk 093 444 499 396 027 1366 098 444 943 1056 011 2454rgg 093 233 233 208 017 691 098 233 401 670 011 1315

Table 47 Summary of the results with five large-scale matrices for original Karp-

Sipser and the proposed One-Out and TwoOut heuristics with five scaling iter-

ations and post-processing for maximal matchings Matrix names are shortened in

the lower half

operating at 230 GHz To choose five large-scale matrices from SuiteSparsewe first sorted the pattern-symmetric matrices in the collection in decreasingorder of their nonzero count We then chose the five largest matrices fromdifferent families to increase the variety of experiments The details of thesematrices are given in Table 47 This table also contains the run timesand matching qualities of the original Karp-Sipser and the proposed One-Out and Two-Out heuristics The proposed heuristics have five scalingiterations and also apply Karp-Sipser at the end for ensuring maximality

The run time of the proposed heuristics are analyzed in four stages inTable 47 The Scale stage scales the matrix with five iterations the k-out stage chooses the neighbors and constructs the k-out subgraph the KSstage applies Karp-Sipser on the k-out subgraph and KS+ is the stage forthe additional Karp-Sipser at the end The total time with these four stagesis also shown for each instance The quality results are consistent withthe results obtained on the first dataset As the table shows the proposedheuristics are faster than the original Karp-Sipser on the input graph For1-out graphs the proposed approach is 25ndash39 faster than Karp-Sipser onthe original graph The speedups are in between 15ndash26 for 2-out graphswith five iterations

44 Summary further notes and references

We investigated two heuristics for the bipartite maximum cardinality match-ing problem The first one OneSidedMatch is shown to have an approx-

88 CHAPTER 4 MATCHINGS IN GRAPHS

imation guarantee of 1 minus 1e asymp 0632 The second heuristic TwoSid-edMatch is shown to have an approximation guarantee of 0866 Bothalgorithms use well-known methods to scale the sparse matrix associatedwith the given bipartite graph to a doubly stochastic form whose entriesare used as the probability density functions to randomly select a subsetof the edges of the input graph OneSidedMatch selects exactly n edgesto construct a subgraph in which a matching of the guaranteed cardinal-ity is identified with virtually no overhead both in sequential and parallelexecution TwoSidedMatch selects around 2n edges and then runs theKarp-Sipser heuristic as an exact algorithm on the selected subgraph to ob-tain a matching of conjectured cardinality The subgraphs are analyzed todevelop a specialized Karp-Sipser algorithm for efficient parallelization Alltheoretical investigations are first performed assuming bipartite graphs withperfect matchings and the scaling algorithms have converged Then the-oretical arguments and experimental evidence are provided to extend theresults to cover other cases and validate the applicability and practicality ofthe proposed heuristics in general settings

The proposed OneSidedMatch and TwoSidedMatch heuristics donot guarantee maximality of the matching found One can visit the edges ofthe unmatched row vertices and match them to an available neighbor Wehave carried out this greedy post-process and obtained the results shown inFigures 47a and 47b in comparison with what was shown in Figures 42aand 42b We see that the greedy post-process makes significant differencefor OneSidedMatch the approximation ratios achieved are well beyondthe bound 0632 and are close to those of TwoSidedMatch This obser-vation attests to the efficiency of the whole approach

We also extended the heuristics and their analysis for approximatingthe maximum cardinality matchings on general (undirected) graphs Sincewe have in the general case one class of vertices the 1-out based approachcorresponds to the TwoSidedMatch heuristics in the bipartite case The-oretical analysis which can be perceived as an analysis of Karp-Sipser onthe random 1-out graph model showed that the approximation guarantee isslightly less than 0866 The losses with respect to the bipartite case are dueto the existence of odd-length cycles Our experiments suggest that one ofthe heuristics obtains as good results as Karp-Sipser while being faster andmore reliable

A more rigorous treatment and elaboration of the variants of the heuris-tics for general graphs seem worthwhile Although Karp-Sipser works wellfor these graphs we wonder if there are exact linear time algorithms for(1 + eminus1)-out and 2-out graphs We also plan to work on the parallel imple-mentations of these algorithms

89

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 47 Matching qualities obtained after making the matchings foundby OneSidedMatch and TwoSidedMatch maximal The horizontal linesare at 0632 and 0866 respectively which are the approximation guaranteesfor the heuristics The legends contain the number of scaling iterationsCompare with Figures 42a and 42b where maximality of the matchingswas not guaranteed

90 CHAPTER 4 MATCHINGS IN GRAPHS

Chapter 5

Birkhoff-von Neumanndecomposition of doublystochastic matrices

In this chapter we investigate the Birkhoff-von Neumann (BvN) decomposi-tion described in Section 14 The main problem that we address is restatedbelow for convenience

MinBvNDec A BvN decomposition with the minimum num-ber of permutation matrices Given a doubly stochastic matrix Afind a BvN decomposition

A = α1P1 + α2P2 + middot middot middot+ αkPk (51)

of A with the minimum number k of permutation matrices

We first show that the MinBvNDec problem is NP-hard We then proposea heuristic for obtaining a BvN decomposition with a small number of per-mutation matrices We investigate some of the properties of the heuristictheoretically and experimentally In this chapter we also give a general-ization of the BvN decomposition for real matrices with possibly negativeentries and use this generalization for designing preconditioners for iterativelinear system solvers

51 The minimum number ofpermutation matrices

We show in this section that the decision version of the problem is NP-complete We first give some definitions and preliminary results

91

92 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

511 Preliminaries

A multi-set can contain duplicate members Two multi-sets are equivalentif they have the same set of members with the same number of repetitions

Let A and B be two n times n matrices We write A sube B to denote thatfor each nonzero entry of A the corresponding entry of B is nonzero Inparticular if P is an n times n permutation matrix and A is a nonnegativen times n matrix P sube A also means that the entries of A at the positionscorresponding to the nonzeros of P are positive We use P A to denotethe entrywise product of P and A which selects the entries of A at thepositions corresponding to the nonzero entries of P We also use minPAto denote the minimum entry of A at the nonzero positions of P

Let U be a set of positions of the nonzeros of A Then U is calledstrongly stable [31] if for each permutation matrix P sube A pkl = 1 for atmost one pair (k l) isin U

Lemma 51 (Brualdi [32]) Let A be a doubly stochastic matrix Then ina BvN decomposition of A there are at least γ(A) permutation matriceswhere γ(A) is the maximum cardinality of a strongly stable set of positionsof A

Note that γ(A) is no smaller than the maximum number of nonzerosin a row or a column of A for any matrix A Brualdi [32] shows that forany integer t with 1 le t le dn2ed(n + 1)2e there exists an n times n doublystochastic matrix A such that γ(A) = t

An ntimes n circulant matrix C is defined as follows The first row of C isspecified as c1 cn and the ith row is obtained from the (iminus 1)th one bya cyclic rotation to the right for i = 2 n

C =

c1 c2 cncn c1 cnminus1

c2 c3 c1

We state and prove an obvious lemma to be used later

Lemma 52 Let C be an ntimesn positive circulant matrix whose first row isc1 cn The matrix Cprime = 1sum

cjC is doubly stochastic and all BvN decom-

positions with n permutation matrices have the same multi-set of coefficients cisum

cj for i = 1 n

Proof Since the first row of Cprime has all nonzero entries and only n permuta-tion matrices are permitted the multi-set of coefficients must be the entriesin the first row

93

In the lemma if ci = cj for some i 6= j we will have the same value cicfor two different permutation matrices As a sample BvN decomposition letc =

sumci and consider 1

cC =sum cj

c Dj where Dj is the permutation matrixcorresponding to the (jminus 1)th diagonal for j = 1 n the matrix Dj has1s at the positions (i i+ j minus 1) for i = 1 nminus j + 1 and at the positions(nminus j+ 1 + k k) for k = 1 jminus 1 where we assumed that the second setis void for j = 1

Note that the lemma is not concerned by the uniqueness of the BvNdecomposition which would need a unique set of permutation matrices aswell If all cis were distinct this would have been true where the set ofDjs described above would define the unique permutation matrices Alsoa more general variant of the lemma concerns the decompositions whosecardinality k is equal to the maximum cardinality of a strongly stable setIn this case too the coefficients in the decomposition will correspond to theentries in a strongly stable set of cardinality k

512 The computational complexity of MinBvNDec

Here we prove that the MinBvNDec problem is NP-complete in the strongsense ie it remains NP-complete when the instance is represented in unaryIt suffices to do a reduction from a strongly NP-complete problem to provethe strong NP-completeness [22 Section 66]

Theorem 51 The problem of deciding whether there is a Birkhoff-vonNeumann decomposition of a given doubly stochastic matrix with k permu-tation matrices is strongly NP-complete

Proof It is clear that the problem belongs to NP as it is easy to check inpolynomial time that a given decomposition is equal to a given matrix

To establish NP-completeness we demonstrate a reduction from the well-known 3-PARTITION problem which is NP-complete in the strong sense [93p 96] Consider an instance of 3-PARTITION given an array A of 3mpositive integers a positive integer B such that

sum3mi=1 ai = mB and for all

i it holds that B4 lt ai lt B2 does there exist a partition of A into mdisjoint arrays S1 Sm such that each Si has three elements whose sumis B Let I1 denote an instance of 3-PARTITION

We build the following instance I2 of MinBvNDec corresponding to I1

given above Let

M =

[1mEm OO 1

mBC

]where Em is an m timesm matrix whose entries are 1 and C is an 3m times 3mcirculant matrix whose first row is a1 a3m It is easy to see that Mis doubly stochastic A solution of I2 is a BvN decomposition of M withk = 3m permutations We will show that I1 has a solution if and only if I2

has a solution

94 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Assume that I1 has a solution with S1 Sm Let Si = ai1 ai2 ai3and observe that 1

mB (ai1 + ai2 + ai3) = 1m We identify three permu-tation matrices Pid for d = 1 2 3 in C which contain ai1 ai2 and ai3respectively We can write 1

mEm =summ

i=11mDi where Di is the permutation

matrix corresponding to the (i minus 1)th diagonal (described after the proofof Lemma 52) We prepend Di to the three permutation matrices Pid ford = 1 2 3 and obtain three permutation matrices for M We associate thesethree permutation matrices with αid =

aidmB Therefore we can write

M =msumi=1

3sumd=1

αid

[Di

Pid

]

and obtain a BvN decomposition of M with 3m permutation matricesAssume that I2 has a BvN decomposition with 3m permutation matrices

This also defines two BvN decompositions with 3m permutation matricesfor 1

mEm and 1mBC We now establish a correspondence between these two

BvNrsquos to finish the proof Since 1mBC is a circulant matrix with 3m nonzeros

in a row any BvN decomposition of it with 3m permutation matrices hasthe coefficients ai

mB for i = 1 3m by Lemma 52 Since ai + aj lt B we

haveai+ajmB lt 1m Therefore each entry in 1

mEm needs to be included in atleast three permutation matrices A total of 3m permutation matrices coversany row of 1

mEm say the first one Therefore for each entry in this row wehave 1

m = αi + αj + αk for i 6= j 6= k corresponding to three coefficientsused in the BvN decomposition of 1

mBC Note that these three indices i j kdefining the three coefficients used for one entry of Em cannot be used foranother entry in the same row This correspondence defines a partition ofthe 3m numbers αi for i = 1 3m into m groups with three elementseach where each group has a sum of 1m The corresponding three airsquos ina group therefore sums up to B and we have a solution to I1 concludingthe proof

52 A result on the polytope ofBvN decompositions

Let S(A) be the polytope of all BvN decompositions for a given dou-bly stochastic matrix A The extreme points of S(A) are the ones thatcannot be represented as a convex combination of other decompositionsBrualdi [32 pp 197ndash198] observes that any generalized Birkhoff heuristicobtains an extreme point of S(A) and predicts that there are extreme pointsof S(A) which cannot be obtained by a generalized Birkhoff heuristic Inthis section we substantiate this claim by showing an example

Any BvN decomposition of a given matrix A with the smallest numberof permutation matrices is an extreme point of S(A) otherwise the other

95

BvN decompositions expressing the said point would have smaller numberof permutation matrices

Lemma 53 There are doubly stochastic matrices whose polytopes of BvNdecompositions contain extreme points that cannot be obtained by a general-ized Birkhoff heuristic

We will prove the lemma by giving an example We use computationaltools based on a mixed integer linear programming (MILP) formulation ofthe problem of finding a BvN decomposition with the smallest number ofpermutation matrices We first describe the MILP formulation

Let A be a given n times n doubly stochastic matrix and Ωn be the set ofall n times n permutation matrices There are n matrices in Ωn for brevitylet us refer to these permutations by P1 Pn We associate an incidencematrix M of size n2 times n with Ωn We fix an ordering of the entries of Aso that each row of M corresponds to a unique entry in A Each column ofM corresponds to a unique permutation matrix in Ωn We set mij = 1 ifthe ith entry of A appears in the permutation matrix Pj and set mij = 0otherwise Let ~a be the n2-vector containing the values of the entries of Ain the fixed order Let ~x be a vector of n elements xj corresponding tothe permutation matrix Pj With these definitions MILP formulation forfinding a BvN decomposition of A with the smallest number of permutationmatrices can be written as

minimizensumj=1

sj (52)

subject to M~x = ~a (53)

1 ge xj ge 0 for j = 1 n (54)

nsumj=1

xj = 1 (55)

sj ge xj for j = 1 n (56)

sj isin 0 1 for j = 1 n (57)

In this MILP sj is a binary variable which is 1 only if xj gt 0 otherwise 0The equality (53) the inequalities (54) and the equality (55) guaranteethat we have a BvN decomposition of A This MILP can be used only forsmall problems In this MILP we can exclude any permutation matrix Pj

from a decomposition by setting sj = 0

Proof of Lemma 53 Let the following 10 letters correspond to the numbersunderneath

a b c d e f g h i j

1 2 4 8 16 32 64 128 256 512

96 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Consider the following matrix whose row sums and column sums are 1023hence can be considered as doubly stochastic

A =

a+ b d+ i c+ h e+ j f + ge+ g a+ c b+ i d+ f h+ jf + j e+ h d+ g b+ c a+ id+ h b+ f a+ j g + i c+ ec+ i g + j e+ f a+ h b+ d

(58)

Observe that the entries containing the term 2` form a permutationmatrix for ` = 0 9 Therefore this matrix has a BvN decompositionwith 10 permutation matrices We created the MILP above and found 10 asthe smallest number of permutation matrices by calling the CPLEX solver [1]via the NEOS Server [58 68 100] Hence the described decomposition is anextreme point of S(A) None of the said permutation matrices annihilateany entry of the matrix Therefore at the first step no entry of A getsreduced to zero regardless of the order of permutation matrices Thus thisdecomposition cannot be obtained by a generalized Birkhoff heuristic

One can create a family of matrices with arbitrary sizes by embeddingthe same matrix into a larger one of the form

B =

(1023 middot I OO A

)

where I is the identity matrix with the desired size All BvN decompositionsof B can be obtained by extending the permutation matrices in Arsquos BvNdecompositions with I That is for a permutation matrix P in a BvN de-

composition of A

(I OO P

)is a permutation matrix in the corresponding

BvN decomposition of B Furthermore all permutation matrices in an arbi-trary BvN decomposition of B must have I as the principle sub-matrix andthe rest should correspond to a permutation matrix in A defining a BvNdecomposition for A Hence the extreme point S(B) corresponding to theextreme point of S(A) with 10 permutation matrices cannot be found by ageneralized Birkhoff heuristic

Let us investigate the matrix A and its BvN decomposition given in theproof of Lemma 53 Let PaPb Pj be the 10 permutation matrices ofthe decomposition corresponding to a b j We solve 10 MILPs in whichwe set st = 0 for one t isin a b j This way we try to find a BvN decom-position of A without Pa without Pb and so on always with the smallestnumber of permutation matrices The smallest number of permutation ma-trices in these 10 MILPs were 11 This certifies that the only BvN decom-position with 10 permutation matrices necessarily contains PaPb Pj It is easy to see that there is a unique solution to the equality M~x = ~a of

97

A =

3 264 132 528 9680 5 258 40 640544 144 72 6 257136 34 513 320 20260 576 48 129 10

(a) The sample matrix

129 511 257 63 33 15 7 3 2 2 1

3 4 2 5 5 4 2 3 4 1 15 5 3 1 4 1 4 2 1 2 32 1 5 3 1 2 3 4 3 4 41 3 4 4 2 5 1 5 5 3 24 2 1 2 3 3 5 1 2 5 5

(b) A BvN decomposition

Figure 51 The matrix A from Lemma 53 and a BvN decomposition with11 permutation matrices which can be obtained by a generalized Birkhoffheuristic Each column in (b) corresponds to a permutation where the firstline gives the associated coefficient and the following lines give the columnindices matched to the rows 1 to 5 of A

the MILP with xt = 0 for t isin a b j as the submatrix M containingonly the corresponding 10 columns has full column rank

Any generalized Birkhoff heuristic obtains at least 11 permutation ma-trices for the matrix A of the proof of Lemma 53 One such decompositionis shown in a tabular format in Fig 51 In Fig 51a we write the matrixof the lemma explicitly for convenience Then in Fig 51b we give a BvNdecomposition The column headers (the first line) in the table contain thecoefficients of the permutation matrices The nonzero column indices of thepermutation matrices are stored by rows For example the first permuta-tion has the coefficient 129 and columns 3 5 2 1 and 4 are matched tothe rows 1ndash5 The bold indices signify entries whose values are equivalentto the coefficients of the corresponding permutation matrices at the timewhere the permutation matrices are found Therefore the correspondingentries become zero after the corresponding step For example in the firstpermutation a54 = 129

The output of GreedyBvN for the matrix A is given in Fig 52 for refer-ence It contains 12 permutation matrices

53 Two heuristics

There are heuristics to compute a BvN decomposition for a given matrix AIn particular the following family of heuristics is based on the constructiveproof of Birkhoff Let A(0) = A At every step j ge 1 find a permutationmatrix Pj having its ones at the positions of the nonzero elements of A(jminus1)use the minimum nonzero element of A(jminus1) at the positions identified byPj as αj set A(j) = A(jminus1) minus αjPj and repeat the computations in thenext step j + 1 until A(j) becomes void Any heuristic of this type is calledgeneralized Birkhoff heuristic

98 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

A = 513 middot

0 0 0 1 00 0 0 0 11 0 0 0 00 0 1 0 00 1 0 0 0

+ 257 middot

0 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 01 0 0 0 0

+ 127 middot

0 0 1 0 00 0 0 0 10 1 0 0 01 0 0 0 00 0 0 1 0

+ 63 middot

0 0 0 0 11 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 0

+

31 middot

0 0 0 0 10 0 0 1 01 0 0 0 00 1 0 0 00 0 1 0 0

+ 15 middot

0 0 0 1 01 0 0 0 00 1 0 0 00 0 0 0 10 0 1 0 0

+ 7 middot

0 1 0 0 00 0 0 1 00 0 1 0 01 0 0 0 00 0 0 0 1

+ 3 middot

0 0 1 0 00 1 0 0 00 0 0 1 00 0 0 0 11 0 0 0 0

+

2 middot

0 0 0 0 10 0 0 1 00 1 0 0 01 0 0 0 00 0 1 0 0

+ 2 middot

0 0 1 0 01 0 0 0 00 0 0 1 00 1 0 0 00 0 0 0 1

+ 2 middot

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 0

+ 1 middot

1 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 00 0 0 0 1

Figure 52 The output of GreedyBvN for the matrix given in the proof ofLemma 53

Given an n times n matrix A we can associate a bipartite graph GA =(RcupCE) to it The rows of A correspond to the vertices in the set R andthe columns of A correspond to the vertices in the set C so that (ri cj) isin Eiff aij 6= 0 A perfect matching in GA or G in short is a set of n edges notwo sharing a row or a column vertex Therefore a perfect matching M inG defines a unique permutation matrix PM sube A

Birkhoffrsquos heuristic This is the original heuristic used in proving thata doubly stochastic matrix has a BvN decomposition described for exampleby Brualdi [32] Let a be the smallest nonzero of A(jminus1) and G(jminus1) bethe bipartite graph associated with A(jminus1) Find a perfect matching M inG(jminus1) containing a set αj larr a and Pj larr PM

GreedyBvN heuristic At every step j among all perfect matchings inG(jminus1) find one whose minimum element is the maximum That is find aperfect matching M in G(jminus1) where minPM A(jminus1) is the maximumThis ldquobottleneckrdquo matching problem is polynomial time solvable with forexample MC64 [72] In this greedy approach αj is the largest amount wecan subtract from a row and a column of A(jminus1) and we hope to obtain asmall k

Algorithm 11 GreedyBvN A greedy heuristic for constructing aBvN decomposition

Data A a doubly stochastic matrixResult a BvN decomposition

1 k larr 02 while nnz(A) gt 0 do3 k larr k + 14 Pk the pattern of a bottleneck perfect matching M in A

5 αk larr minPk A(jminus1)6 Alarr Aminus αkPk

99

A(0) =

1 4 1 0 0 00 1 4 1 0 00 0 1 4 1 00 0 0 1 4 11 0 0 0 1 44 1 0 0 0 1

A(1) =

0 4 1 0 0 00 1 3 1 0 00 0 1 3 1 00 0 0 1 3 11 0 0 0 1 34 0 0 0 0 1

A(2) =

0 4 0 0 0 00 0 3 1 0 00 0 1 2 1 00 0 0 1 2 11 0 0 0 1 23 0 0 0 0 1

A(3) =

0 3 0 0 0 00 0 3 0 0 00 0 0 2 1 00 0 0 1 1 11 0 0 0 1 12 0 0 0 0 1

A(4) =

0 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 1 11 0 0 0 1 01 0 0 0 0 1

A(5) =

0 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 00 0 0 0 0 1

Figure 53 A sample matrix to show that the original Birkhoff heuristic canobtain BvN decomposition with n permutation matrices while the optimumone has 3

Analysis of the two heuristics

As we will see in Section 54 GreedyBvN can obtain a much smaller num-ber of permutation matrices compared to the Birkhoff heuristic Here wecompare these two heuristics theoretically First we show that the originalBirkhoff heuristic does not have any constant ratio approximation guaran-tee Furthermore for an ntimesn matrix its worst-case approximation ratio isΩ(n)

We begin with a small example shown in Fig 53 We decompose a6 times 6 matrix A(0) which has an optimal BvN decomposition with threepermutation matrices the main diagonal the one containing the entriesequal to 4 and the one containing the remaining entries For simplicitywe used integer values in our example However since the row and columnsums of A(0) is equal to 6 it can be converted to a doubly stochastic matrixby dividing all the entries to 6 Instead of the optimal decomposition in thefigure we obtain the permutation matrices as the original Birkhoff heuristicdoes Each red-colored entry set is a permutation and contains the minimumpossible value 1

In the following we show how to generalize the idea for having matri-ces of arbitrarily large size with three permutation matrices in an optimaldecomposition while the original Birkhoff heuristic obtains n permutation

100 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

matrices

Lemma 54 The worst-case approximation ratio of the Birkhoff heuristicis Ω(n)

Proof For any given integer n ge 3 we show that there is a matrix of sizen times n whose optimal BvN decomposition has 3 permutations whereas theBirkhoff heuristic obtains a BvN decomposition with exactly n permutationmatrices The example in Fig 53 is a special case for n = 6 for the followingconstruction process

Let f 1 2 n rarr 1 2 n be the function f(x) = (x mod n) +1 Given a matrix M let Mprime = F(M) be another matrix containing thesame set of entries where the function f(middot) is used on the coordinate indicesto redistribute the entries of M on Mprime That is mij = mprimef(i)f(j) Since f(middot)is one-to-one and onto if M is a permutation matrix then F(M) is also apermutation matrix We will start with a permutation matrix and run itthrough F for nminus 1 times to obtain n permutation matrices which are alldifferent By adding these permutation matrices we will obtain a matrix Awhose optimal BvN decomposition has three permutation matrices whilethe n permutation matrices used to create A correspond to a decompositionthat can be obtained by the Birkhoff heuristic

Let P1 be the permutation matrix whose ones which are partitionedinto three sets are at the positions

1st set︷ ︸︸ ︷(1 1)

2nd set︷ ︸︸ ︷(n 2)

3rd set︷ ︸︸ ︷(2 3) (3 4) (nminus 1 n) (59)

Let us use F(middot) to generate a matrix sequence Pi = F(Piminus1) for 2 le i le nFor example P2rsquos nonzeros are at the positions

1st set︷ ︸︸ ︷(2 2)

2nd set︷ ︸︸ ︷(1 3)

3rd set︷ ︸︸ ︷(3 4) (4 5) (n 1)

We then add the Pis to build the matrix

A = P1 + P2 + middot middot middot+ Pn

We have the following observations about the nonzero elements of A

1 aii = 1 for all i = 1 n and only Pi has a one at the position (i i)These elements are from the first set of positions of the permutationmatrices as identified in (59) When put together these n entriesform a permutation matrix P(1)

2 aij = 1 for all i = 1 n and j = ((i + 1) mod n) + 1 and onlyPh where h = (i mod n) + 1 has a one at the position (i j) Theseelements are from the second set of positions of the permutation ma-trices as identified in (59) When put together these n entries forma permutation matrix P(2)

101

3 aij = n minus 2 for all i = 1 n and j = (i mod n) + 1 where allP` for ` isin 1 n i j have a one at the position aij Theseelements are from the third set of positions of the permutation matri-ces as identified in (59) When put together these n entries form apermutation matrix P(3) multiplied by the scalar (nminus 2)

In other words we can write

A = P(1) + P(2) + (nminus 2) middotP(3)

and see that A has a BvN decomposition with three permutation matricesWe note that each row and column of A contains three nonzeros and hencethree is the smallest number of permutation matrices in a BvN decomposi-tion of A

Since the minimum element in A is 1 and each Pi contains one suchelement the Birkhoff heuristic can obtain a decomposition using Pi fori = 1 n Therefore the Birkhoff heuristicrsquos approximation is no betterthan n

3 which can be made arbitrarily large

We note that GreedyBvN will optimally decompose the matrix A usedin the proof above

We now analyze the theoretical properties of GreedyBvN We first givea lemma identifying a pattern in its output

Lemma 55 The heuristic GreedyBvN obtains α1 αk in a non-increasingorder α1 ge middot middot middot ge αk where αj is obtained at the jth step

We observed through a series of experiments that the performance ofGreedyBvN depends on the values in the matrix and there seems to be noconstant factor approximation [74]

We create a set of n times n matrices To do that we first fix a set ofz permutation matrices C1 Cz of size n times n These permutationmatrices with varying values of α will be used to generate the matricesThe matrices are parameterized by the subscript i and each Ai is created asfollows Ai = α1 middotC1 +α2 middotC2 + middot middot middot+αz middotCz where each αj for j = 1 zis a randomly chosen integer in the range [1 2i] and we also set a randomlychosen αj equivalent to 2i to guarantee the existence of at least one largevalue even in the unlikely case that all other values are not large enoughAs can be seen each Ai has the same structure and differs from the restonly in the values of αj rsquos that are chosen As a consequence they all canbe decomposed by the same set of permutation matrices

We present our results in two sets of experiments shown in Table 51 fortwo different n We create five random Ai for i isin 10 20 30 40 50 thatis we have five matrices with the parameter i and there are five different iWe have n = 30 and z = 20 in the first set and n = 200 and z = 100 inthe second set Let ki be the number of permutation matrices GreedyBvN

102 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

n = 30 and z = 20

average worst casei ki kiz ki kiz

10 59 299 63 31520 105 529 110 55030 149 746 158 79040 184 923 191 95550 212 1062 227 1135

n = 200 and z = 100

average worst casei ki kiz ki kiz

10 268 269 280 28020 487 488 499 49930 716 716 726 72640 932 933 947 94750 1124 1125 1162 1162

Table 51 Experiments showing the dependence of the performance ofGreedyBvN on the values of the matrix elements n is the matrix sizei isin 10 20 30 40 50 is the parameter for creating matrices using αj isin[1 2i] z is the number of permutation matrices used in creating Ai Fiveexperiments for a given pair of n and i GreedyBvN obtains ki permutationmatrices The average and the maximum number of permutation matricesobtained by GreedyBvN for five random instances are given kiz is a lowerbound on the performance of GreedyBvN as z ge Opt

obtains for a given Ai The table gives the average and the maximumki of five different Ai for a given n and i pair By construction eachAi has a BvN with z permutation matrices Since z is no smaller thanthe optimal value the ratio ki

z gives a lower bound on the performance ofGreedyBvN As seen from the experiments as i increases the performanceof GreedyBvN gets increasingly worse This shows that a constant ratioworst case approximation of GreedyBvN is unlikely While the performancedepends on z (for example for small z GreedyBvN is likely to obtain nearoptimal decompositions) it seems that the size of the matrix does not largelyaffect the relative performance of GreedyBvN

Now we attempt to explain the above results theoretically

Lemma 56 Let α1P1 + middot middot middot + αkP

k be a BvN decomposition of a given

doubly stochastic matrix A with the smallest number k of permutation ma-trices Then for any BvN decomposition of A with ` ge k permutation

matrices we have ` le k middot maxi αi

mini αi If the coefficients are integers (eg when

A is a matrix with constant row and column sums of integral values) wehave ` le k middotmaxi α

i

Proof Consider a BvN decomposition α1P1 + middot middot middot+ α`P` with ` ge k As-sume without loss of generality that α1 ge middot middot middot ge αk and α1 ge middot middot middot ge α`

We know that the coefficients of these two decompositions sum up to thesame value That is sum

i=1

αi =ksumi=1

αi

103

Since α` is the smallest of α and α1 is the largest of α we have

` middot α` le k middot α1

and hence`

kle α1α`

By assuming integer values we see that α` ge 1 and thus

` le k middotmaxiαi

This lemma evaluates the approximation guarantee of a given BvN de-composition It does not seem very useful because of the fact that even ifwe have mini αi we do do not have maxi α

i Luckily we can say more in

the case of GreedyBvN

Corollary 51 Let k be the smallest number of permutation matrices ina BvN decomposition of a given doubly stochastic matrix A Let α1 andα` be the first and last coefficients obtained by the heuristic GreedyBvN fordecomposing A Then ` le k middot α1

α`

Proof This is easy to see as GreedyBvN obtains the coefficients in a non-increasing order (see Lemma 55) and α1 ge αj for all 1 le j le k for anyBvN decomposition containing αj

Lemma 56 and Corollary 51 give a posteriori estimates of the perfor-mance of the heuristic GreedyBvN in that one looks at the decompositionand tells how good it is This potentially can reveal a good performanceFor example when GreedyBvN obtains a BvN decomposition with all coeffi-cients equivalent then we know that it is an optimal BvN The same cannotbe told for the Birkhoff heuristic though (consider the example proceed-ing Lemma 54) We also note that the ratio given in Corollary 51 shouldusually be much larger than the practical performance

54 Experiments

We present results with the two heuristics We note that the original Birkhoffheuristic was not concerned with the minimality of the number of permu-tation matrices Therefore the presented results are not for comparing thetwo heuristics but for giving results with what is available We give resultson two different set of matrices The first set contains real world sparsematrices preprocessed to be doubly stochastic The second set contains afew randomly created dense doubly stochastic matrices

104 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

matrix n τ dmax devsumk

i=1 αi ksumk

i=1 αi kaft01 8205 125567 21 00e+00 0160 2000 1000 120aft02 8184 127762 82 10e-06 0000 1434 1000 234barth 6691 46187 13 00e+00 0160 2000 1000 71barth4 6019 40965 13 00e+00 0140 2000 1000 61bcspwr10 5300 21842 14 10e-06 0380 2000 1000 63bcsstk38 8032 355460 614 00e+00 0000 2000 1000 592lowastbenzene 8219 242669 37 00e+00 0000 2000 1000 113c-29 5033 43731 481 10e-06 0000 2000 1000 870EX5 6545 295680 48 00e+00 0020 2000 1000 229EX6 6545 295680 48 10e-06 0030 2000 1000 226flowmeter0 9669 67391 11 00e+00 0510 2000 1000 58fv1 9604 85264 9 00e+00 0620 2000 1000 50fv2 9801 87025 9 00e+00 0620 2000 1000 52fxm3 6 5026 94026 129 10e-06 0130 2000 1000 383g3rmt3m3 5357 207695 48 10e-06 0050 2000 1000 223Kuu 7102 340200 98 00e+00 0000 2000 1000 330mplate 5962 142190 36 00e+00 0030 2000 1000 153n3c6-b7 6435 51480 8 00e+00 1000 8 1000 8nemeth02 9506 394808 52 00e+00 0000 2000 1000 109nemeth03 9506 394808 52 00e+00 0000 2000 1000 115olm5000 5000 19996 6 10e-06 0750 283 1000 14s1rmq4m1 5489 262411 54 00e+00 0000 2000 1000 211s2rmq4m1 5489 263351 54 00e+00 0000 2000 1000 208SiH4 5041 171903 205 00e+00 0000 2000 1000 574t2d q4 9801 87025 9 20e-06 0500 2000 1000 54

Table 52 Birkhoffrsquos heuristic and GreedyBvN on sparse matrices from theUFL collection The column τ contains the number of nonzeros in a matrixThe column dmax contains the maximum number of nonzeros in a row ora column setting up a lower bound for the number k of permutation ma-trices in a BvN decomposition The column ldquodevrdquo contains the maximumdeviation of a rowcolumn sum of a matrix A from 1 in other words thevalue maxA1 minus 1infin 1TA minus 1Tinfin reported to six significant digitsThe two heuristics are run to obtain at most 2000 permutation matrices oruntil they accumulated a sum of at least 09999 with the coefficients In onematrix (marked with lowast) GreedyBvN obtained this number in less than dmax

permutation matricesmdashincreasing the limit to 0999999 made it return with908 permutation matrices

105

The first set of matrices was created as follows We have selected all ma-trices with the following properties from the SuiteSparse Matrix Collection(UFL) [59] square has at least 5000 and at most 10000 rows fully inde-composable and there are at most 50 and at least 2 nonzeros per row Thisgave a set of 66 matrices These 66 matrices are from 30 different groups ofproblems we have chosen at most two per group to remove any bias thatmight be arising from the group This resulted in 32 matrices which wepreprocessed as follows to obtain a set of doubly stochastic matrices Wefirst took the absolute values of the entries to make the matrices nonnega-tive Then we scaled them to be doubly stochastic using the algorithm ofKnight et al [141] with a tolerance of 10minus6 with at most 1000 iterationsWith this setting the scaling algorithm tries to get the maximum deviationof a rowcolumn sum from 1 to be less than 10minus6 within 1000 iterations In7 matrices the deviation was larger than 10minus4 We deemed this too big adeviation from a doubly stochastic matrix and discarded those matrices Atthe end we had 25 matrices given in Table 52

We run the two heuristics for the BvN decomposition to obtain at most2000 permutation matrices or until they accumulated a sum of at least09999 with the coefficients The accumulated sum of coefficients

sumki=1 αi

and the number of permutation matrices k for the Birkhoff and GreedyBvN

heuristics are given in Table 52 As seen in this table GreedyBvN obtainsa much smaller number of permutation matrices than Birkhoffrsquos heuristicexcept the matrix n3c6-b7 This is a special matrix with eight nonzeros ineach row and column and all nonzeros are 18 Both heuristics find the same(minimum) number of permutation matrices We note that the geometricmean of the ratios kdmax is 34 for GreedyBvN

The second set of matrices was created as follows For n isin 100 200 300we created five matrices of size ntimesn whose entries are randomly chosen in-tegers between 1 and 100 (we performed tests with integers between 1 and20 and the results were close to what we present here) We then scaled themto be doubly stochastic using the algorithm of Knight et al [141] with a tol-erance of 10minus6 with at most 1000 iterations We run the two heuristics forthe BvN decomposition such that at most n2minus2n+ 2 permutation matricesare found or the total value of the coefficients was larger than 09999 Wethen took the average of the five instances with the same n The resultsare in Table 53 In this table again we see that GreedyBvN obtains muchsmaller k than Birkhoffrsquos heuristic

The experimental observation that GreedyBvN performs better than theBirkhoff heuristic is not surprising This is for two reasons First as statedbefore the original Birkhoff heuristic is not concerned with the number ofpermutation matrices Second the heuristic GreedyBvN is an adaptation ofthe Birkhoff heuristic to have large reductions at each step

106 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

nsumk

i=1 αi ksumk

i=1 αi k100 099 9644 100 388200 099 39208 100 717300 100 88759 100 1042

Table 53 Birkhoffrsquos heuristic and GreedyBvN on random dense matricesThe maximum deviation of a rowcolumn sum of a matrix A from 1 thatis maxA1minus 1infin 1TAminus 1Tinfin was always less than 10minus6 For each nthe results are the averages of five different instances

55 A generalization of Birkhoff-von Neumanntheorem

We have exploited the BvN decomposition in the context of preconditioningfor solving linear systems [23] As the decomposition is defined for nonnega-tive matrices we needed a generalization of it to cover all real matrices Tomotivate the generalized decomposition theorem we give a brief summaryof the preconditioner

551 Doubly stochastic splitting

For a given nonnegative matrix A we first preprocess it to get a doublystochastic matrix (whose row and column sums are one) Then using thisdoubly stochastic matrix we select some fraction of some of the nonzeros ofA to be included in the preconditioner The selection is done by taking a setof permutation matrices along with their coefficients in a BvN decompositionof the given matrix

Assume we have a BvN decomposition of A as shown in (51) Thenone can pick an integer r between 1 and k minus 1 and split A as

A = MminusN (510)

where

M = α1P1 + middot middot middot+ αrPr N = minusαr+1Pr+1 minus middot middot middot minus αkPk (511)

Note that M and minusN are doubly substochastic matrices

Definition 51 A splitting of the form (510) with M and N given by (511)is said to be a doubly substochastic splitting

Definition 52 A doubly substochastic splitting A = M minusN of a doublystochastic matrix A is said to be standard if M is invertible We will callsuch a splitting an SDS splitting

107

In general it is not easy to guarantee that a given doubly substochasticsplitting is standard except for some trivial situation such as the case r = 1in which case M is always invertible We also have a characterization forinvertible M when r = 2

Theorem 52 A sufficient condition for M =sumr

i=1 αiPi to be invertible isthat one of the αi with 1 le i le r be greater than the sum of the remainingones

Let A = M minusN be an SDS splitting of A and consider the stationaryiterative scheme

xk+1 = Hxk + c H = Mminus1N c = Mminus1b (512)

where k = 0 1 and x0 is arbitrary As seen in (512) Mminus1 or solversfor My = z are required As is well known the scheme (512) convergesto the solution of Ax = b for any x0 if and only if ρ(H) lt 1 Hence weare interested in conditions that guarantee that the spectral radius of theiteration matrix

H = Mminus1N = minus(α1P1 + middot middot middot+ αrPr)minus1(αr+1Pr+1 + middot middot middot+ αkPk)

is strictly less than one In general this problem appears to be difficultWe have a necessary condition (Theorem 53) and a sufficient condition(Theorem 54) which is simple but restrictive

Theorem 53 For the splitting A = MminusN with M =sumr

i=1 αiPi and N =

minussumk

i=r+1 αiPi to be convergent it must hold thatsumr

i=1 αi gtsumk

i=r+1 αi

Theorem 54 Suppose that one of the αi appearing in M is greater thanthe sum of all the other αi Then ρ(Mminus1N) lt 1 and the stationary iterativemethod (512) converges for all x0 to the unique solution of Ax = b

We discuss how to build preconditioners meeting the sufficiency con-ditions Since M itself is doubly stochastic (up to a scalar ratio) we canapply splitting recursively on M and obtain a special solver Our motivationis that the preconditioner Mminus1 can be applied to vectors via a number ofhighly concurrent steps where the number of steps is controlled by the userTherefore the preconditioners (or the splittings) can be advantageous foruse in many-core computing systems In the context of splittings the appli-cation of N to vectors also enjoys the same property These motivations areshared by recent work on ILU preconditioners where their fine-grained com-putation [48] and approximate application [12] are investigated for GPU-likesystems

108 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Preconditioner construction

It is desirable to have a small number k in the Birkhoff-von Neumann de-composition while designing the preconditionermdashthis was our motivation toinvestigate the MinBvNDec problem This is because of the fact that if weuse splitting then k determines the number of steps in which we computethe matrix-vector products If we do not use all k permutation matriceshaving a few with large coefficients should help to design the preconditionerAs stated in Lemma 55 GreedyBvN (Algorithm 11) obtains the coefficientsin a non-increasing order We can therefore use GreedyBvN to build an Msuch that it satisfies the sufficiency condition presented in Theorem 54That is we can have M with α1sumk

i=1 αigt 12 and hence Mminus1 can be applied

with splitting iterations For this we start by initializing M to α1P1 Thenwhen a coefficient αj where j ge 2 is obtained at Line 5 of Algorithm 11we check if α1 will still be larger than the sum of other coefficients used inbuilding M when we add αjPj to M If so we add αjPj to M and con-tinue In practice we iterate the while loop until k is around 10 and collectαj rsquos as described above Experiments on a set of challenging problems [23]show that this preconditioner is more effective than ILU(0)

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices

In order to apply the preconditioners to all fully indecomposable sparsematrices we generalized the BvN decomposition as follows

Theorem 55 Any (real) matrix A with total support can be written as aconvex combination of a set of signed scaled permutation matrices

Proof Let B = abs(A) and consider R and C making RBC doubly stochas-tic which has a Birkhoff-von Neumann decomposition RBC =

sumki=1 αiPi

Then RAC can be expressed as

RAC =ksumi=1

αiQi

where Qi = [q(i)jk ]ntimesn is obtained from Pi = [p

(i)jk ]ntimesn as follows

q(i)jk = sign(ajk)p

(i)jk

The scaling matrices can be inverted to express A

De Werra [62] presents another generalization of the Birkhoff-von Neu-mann theorem to arbitrary matrices In De Werrarsquos generalization insteadof permutation matrices signed integral matrices are used to decompose the

109

given matrix The entries of the integral matrices used in the decompositionhave the same sign as the corresponding entries in the original matrix butthere may be multiple nonzeros in a row or column De Werra discussesflow-based algorithms to obtain the decompositions While the decomposi-tion is costlier to obtain (general flow computations are usually costlier thancomputing perfect matchings) it is also more general to handle rectangularmatrices as well

56 Summary further notes and references

We investigated the problem of obtaining a Birkhoff-von Neumann decom-position of doubly stochastic matrices with the minimum number of per-mutation matrices We showed four results regarding this problem Firstobtaining a BvN decomposition with the minimum number of permutationmatrices is NP-hard Second there are matrices whose decompositions withthe smallest number of permutation matrices cannot be obtained by anyBirkhoff-like heuristic Third the worst-case approximation ratio of theoriginal Birkhoff heuristic is Ω(n) On a more practical side we proposeda natural greedy heuristic (GreedyBvN) and presented experimental resultswhich demonstrated that this heuristic delivers results that are not far froma trivial lower bound We theoretically investigated GreedyBvNrsquos perfor-mance and observed that its performance depends on the values of matrixelements We showed a bound using the first and the last coefficients foundby the heuristic This chapter also included a generalization of the Birkhoff-von Neumann theorem in which we are able to express any real matrixwith a total support as a convex combination of scaled signed permutationmatrices

The shown bound for the performance of GreedyBvN is expected to bemuch larger than what one observes in practice as the bound can even belarger than the upper bound on the number of permutation matrices Atighter analysis should be possible to explain the practical performance ofthe GreedyBvN heuristic

The heuristics we considered in this chapter were based on finding a per-fect matching Koopmans and Beckmann [143] present another constructiveproof of the Birkhoff theorem This constructive proof splits the matrix asA = βA1 +(1minusβ)A2 where 0 lt β lt 1 A1 and A2 being doubly stochasticmatrices A related proof is also given by Horn and Johnson [118 Theorem871] where β = 12 We review these two constructive proofs

Let C be a cycle of length 2` with ` vertices in each part of the bipartitegraph with vertices ri ri+1 ri+`minus1 on one side and ci ci+1 ci+`minus1

on the other side Let us imagine the cycle drawn on the plane so that wehave horizontal edges of the form (rj cj) for j = i i + ` minus 1 slantededges of the form (ri ci+`minus1) and (rj cjminus1) for j = i + 1 i + ` minus 1 A

110 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

ri ci

rj cj

minusε1

minusε1

+ε1+ε1

(a) ε1 = minaii ajj

ri ci

rj cj

+ε2

+ε2

minusε2minusε2

(b) ε2 = minaij aji

Figure 54 A small example for Koopmans and Beckmann decompositionapproach

small example is shown in Figure 54 Let ε1 and ε2 be the minimum valueof a horizontal and slanted edge respectively Koopmans and Beckmannproceed as follows to obtain a BvN decomposition Consider a matrix C1

which contains minusε1 at the positions corresponding to the horizontal edgesof C ε1 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A1 = A + C1 (see Fig 54a) It is easy to see thateach element of A1 is between 0 and 1 and row and column sums are thesame as in A Therefore A1 is doubly stochastic Similarly consider C2

which contains ε2 at the positions corresponding to the horizontal edges ofC minusε2 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A2 = A + C2 (see Fig 54b) and observe that A2

is also doubly stochastic By a simple arithmetic we can set β = ε2ε1+ε2

andwrite A = βA1 + (1 minus β)A2 The observation that A1 and A2 contain atleast one more zero than A can then be used in an inductive hypothesis toconstruct a decomposition by recursively decomposing A1 and A2

At first sight the constructive approach of Koopmans and Beckmanndoes not lead to an efficient algorithm This is because of the fact thatA1 and A2 can have significant overlap and hence many permutation ma-trices can be shared by A1 and A2 yielding combinatorially large spacerequirement or very high run time A variant of this algorithm needs to beexplored First note that we can find a cycle and manipulate the weights bydecreasing them along the horizontal edges and increasing them along theslanted edges (not necessarily by annihilating the smallest weight) to createanother matrix Then we can take the modified matrix and manipulateanother cycle At one point one can decide to stop and create A1 this waywith a suitable choice of coefficients the decomposition can thus proceed Atthe second sight then one observes that Koopmans and Beckmann methodcalls for creative ideas for developing efficient and effective heuristics forBvN decompositions

Horn and Johnson on the other hand set ε to the smallest of ε1 and ε2

111

Without loss of generality let ε1 be the smallest Then let C be a matrixwhose entries are minus1 0 1 corresponding to the negative zero and positiveentries of C1 Then let A1 = A + εC and A2 = A minus εC Both of thesematrices have nonnegative entries less than or equal to one and have thesame row and column sums as A We can then write A = 05A1 + 05A2

and recurse on A1 and A2 to obtain a decomposition (which stops when wedo not have any cycles)

Inspired by the approaches summarized above we can optimally decom-pose the matrix (58) used in showing that there are optimal decompositionswhich cannot be obtained by Birkhoff-like heuristics Let

A1 =

a+ b d c e 0e a+ c b d 00 e d b+ c ad b a 0 c+ ec 0 e a b+ d

A2 =

a+ b d+ 2 middot i c+ 2 middot h e+ 2 middot j 2 middot f + 2 middot ge+ 2 middot g a+ c b+ 2 middot i d+ 2 middot f 2 middot h+ 2 middot j

2 middot f + 2 middot j e+ 2 middot h d+ 2 middot g b+ c a+ 2 middot id+ 2 middot h b+ 2 middot f a+ 2 middot j 2 middot g + 2 middot i c+ ec+ 2 middot i 2 middot g + 2 middot j e+ 2 middot f a+ 2 middot h b+ d

and observe that A = 05A1 + 05A2 We can then use the permutationscontaining the coefficients a b c d e to decompose A1 optimally with fivepermutation matrices using GreedyBvN (in the decreasing order first ethen d then c and so on) While decomposing A2 we need to use thepermutation matrices found for A1 with a careful selection of the coefficients(instead of a+ b we use a for example for the permutation associated witha) Once we have consumed those five permutations we can again resortto GreedyBvN to decompose the remaining matrix optimally with the fivepermutation matrices associated with f g h i and j In this way we havean optimal decomposition for the matrix (58) by applying GreedyBvN toA1 and carefully decomposing A2 We note that neither A1 nor A2 havethe same sum as A in contrast to the two approaches above Can wesystematize this line of reasoning to develop an effective heuristic to obtaina BvN decomposition

112 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Chapter 6

Matchings in hypergraphs

In this chapter we investigate the maximum matching in d-partite d-uniform hypergraphs described in Section 13 The formal problem definitionis repeated below for convenience

Maximum matching in d-partite d-uniform hypergraphs Givena d-partite d-uniform hypergraph find a matching of maximum size

We investigate effective heuristics for the problem above This problemhas been studied mostly in the context of local search algorithms [119] andthe best known algorithm is due to Cygan [56] who provides a ((d+ 1 + ε)3)-approximation building on previous work [57 106] It is NP-hard to approx-imate the 3-partite case (Max-3-DM) within 9897 [24] Similar boundsexist for higher dimensions the hardness of approximation for d = 4 5 and6 are shown to be 5453minus ε 3029minus ε and 2322minus ε respectively [109]

We propose five heuristics for the above problem The first two heuris-tics are the adaptations of the Greedy [78] and Karp-Sipser [128] heuristicsused in finding matchings in graphs (they are summarized in Chapter 4)The proposed generalizations are referred to as Greedy-H and Karp-Sipser-HGreedy-H traverses the hyperedge list in random order and adds an edgeto the matching whenever possible Karp-Sipser-H introduces certain rulesto Greedy-H to improve the cardinality The third heuristic is inspired by arecent scaling-based approach proposed for the maximum cardinality match-ing problem on graphs [73 74 76] The fourth heuristic is a modificationon the third one that allows for faster execution time The last one findsa matching for a reduced (d minus 1)-dimensional problem and exploits it forthe original matching problem recursively This heuristic uses an exact al-gorithm for the bipartite matching problem at each reduction We performexperiments to evaluate the performance of these heuristics on special classesof random hypergraphs as well as real-life data

One plausible way to tackle the problem is to create the line graph Gfor a given hypergraph H The line graph is created by identifying each

113

114 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

hyperedge of H with a vertex in G and by connecting two vertices of Gwith an edge iff the corresponding hyperedges share a common vertex in HThen successful heuristics for computing large independent sets in graphseg KaMIS [147] can be used to compute large matchings in hypergraphsThis approach although promising quality-wise could be impractical Thisis so since building G from H requires quadratic run time (in terms of thenumber of hyperedges) and more importantly quadratic storage (again interms of the number of hyperedges) in the worst case While this can beacceptable in some instances in some others it is not We have such instancesin the experiments Notice that while a heuristic for the independent setproblem can be of linear time complexity in graphs due to our graphs beinga line graph the actual complexity could be high

61 Background and notation

Franklin and Lorenz [89] show that if a nonnegative d-dimensional tensorX has the same zero-pattern as a d-stochastic tensor one can apply a mul-tidimensional version of the Sinkhorn-Knopp algorithm [202] (see also thedescription in Section 41) to scale X to be d-stochastic This is accom-plished by finding d vectors u(1) u(d) and scaling the nonzeros of X as

xi1id middot u(1)i1middot middot middot middot middot u(d)

idfor all i1 id isin 1 n

In a hypergraph if a vertex is a member of only a single hyperedgewe call it a degree-1 vertex Similarly if a vertex is a member of only twohyperedges we call it a degree-2 vertex

In the k-out random hypergraph model given a set V of vertices eachvertex u isin V selects k hyperedges from the set Eu = e e sube V u isin e ina uniformly random fashion and the union of these edges forms E We areinterested in the d-partite d-uniform case and hence Eu = e |e cap Vi| =1 for 1 le i le d u isin e This model generalizes the random k-out bipartitegraphs [211] Devlin and Kahn [67] investigate fractional matchings in thesehypergraphs and mention in passing that k should be exponential in d toensure that a perfect matching exists

62 The heuristics

A matching which cannot be extended with more edges is called maxi-mal The heuristics proposed here find maximal matchings on d-partited-uniform hypergraphs For such hypergraphs any maximal matching isa d-approximate matching The bound is tight and can be verified ford = 3 Let H be a 3-partite 3times 3times 3 hypergraph with the following edgese1 = (1 1 1) e2 = (2 2 2) e3 = (3 3 3) and e4 = (1 2 3) The maximummatching is e1 e2 e3 but the edge e4 alone forms a maximal matching

115

621 Greedy-H A Greedy heuristic

Among the two variants of Greedy [78 194] we adapt the first one whichrandomly visits the edges The adapted version is referred to as Greedy-Hand it visits the hyperedges in random order and adds the visited hyper-edge to the matching whenever possible Since only maximal matchings arepossible as its output Greedy-H is a d-approximation heuristic

622 Karp-Sipser-H Adapting Karp-Sipser

The commonly used and analyzed version of the Karp-Sipser heuristics forgraphs applies degree-1 reductions as summarized in Section 421 Theoriginal version though applies two rules

bull At any time during the heuristic if a degree-1 vertex appears it ismatched with its only neighbor

bull Otherwise if a degree-2 vertex u appears with neighbors v w u(and its edges) is removed from the current graph and v and w aremerged to create a new vertex vw whose set of neighbors is the unionof those of v and w (except u) A maximum cardinality matching forthe reduced graph can be extended to obtain one for the current graphby matching u with either v or w depending on vwrsquos match

We now propose an adaptation of Karp-Sipser with the two rules for d-partite d-uniform hypergraphs The adapted version is called Karp-Sipser-HSimilar to the original one the modified heuristic iteratively adds a randomhyperedge to the matching remove its d endpoints as well as their hyper-edges However the random selection is not applied whenever hyperedgesdefined by the following lemmas appear

Lemma 61 During the heuristic if a hyperedge e with at least d minus 1degree-1 endpoints appears there exists a maximum cardinality matching inthe current hypergraph containing e

Proof Let H prime be the current hypergraph at hand and e = (u1 ud) be ahyperedge in H prime whose first dminus1 endpoints are degree-1 vertices Let M prime bea maximum cardinality matching in H prime If e isinM prime we are done Otherwiseassume that ud is the endpoint matched by a hyperedge eprime isinM prime (note thatif ud is not matched M prime can be extended with e) Since ui 1 le i lt d arenot matched in M prime M prime eprime cup e defines a valid maximum cardinalitymatching for H prime

We note that it is not possible to relax the condition by using a hyperedgee with less than dminus 1 endpoints of degree-1 in M prime two of ersquos higher degreeendpoints could be matched with two different hyperedges in which casethe substitution as done in the proof of the lemma will not be valid

116 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 62 During the heuristic let e = (u1 ud) and eprime = (uprime1 uprimed)

be two hyperedges sharing at least one endpoint where for an index set I sub1 d of cardinality d minus 1 the vertices ui u

primei for all i isin I only touch

e andor eprime That is for each i isin I either ui = uprimei is a degree-2 vertexor ui 6= uprimei and they are both degree-1 vertices For j isin I uj and uprimej arearbitrary vertices Then in the current hypergraph there exists a maximumcardinality matching having either e or eprime

Proof Let H prime be the current hypergraph at hand and j isin I be the remain-ing part id Let M prime be a maximum cardinality matching in H prime If eithere isin M prime or eprime isin M prime we are done Otherwise ui and uprimei for all i isin I areunmatched by M prime Furthermore since M prime is maximal uj must be matchedby M prime (otherwise M prime can be extended by e) Let eprimeprime isinM prime be the hyperedgematching uj Then M prime eprimeprime cup e defines a valid maximum cardinalitymatching for H prime

Whenever such hyperedges appear the rules below are applied in thesame order

bull Rule-1 At any time during the heuristic if a hyperedge e with atleast dminus 1 degree-1 endpoints appears instead of a random edge e isadded to the matching and removed from the hypergraph

bull Rule-2 Otherwise if two hyperedges e and eprime as defined in Lemma 62appear they are removed from the current hypergraph with the end-points ui u

primei for all i isin I Then we consider uj and uprimej If uj and

uprimej are distinct they are merged to create a new vertex ujuprimej whose

hyperedge list is defined as the union of uj rsquos and uprimej rsquos hyperedge listsIf uj and uprimej are identical we rename uj as uju

primej After obtaining a

maximal matching on the reduced hypergraph depending on the hy-peredge matching uju

primej either e or eprime can be used to obtain a larger

matching in the current hypergraph

When Rule-2 is applied the two hyperedges identified in Lemma 62 areremoved from the hypergraph and only the hyperedges containing uj andoruprimej have an update in their vertex list Since the original hypergraph is d-partite and d-uniform that update is just a renaming of a vertex in theconcerned hyperedges (hence the resulting hypergraph is d-partite and d-uniform)

Although the extended rules usually lead to improved results in compari-son to Greedy-H Karp-Sipser-H still adheres to the d-approximation bound ofmaximal matchings For the example given at the beginning of Section 62Karp-Sipser-H generates a maximum cardinality matching by applying thefirst rule However when e5 = (2 1 3) and e6 = (3 1 3) are added to theexample neither of the two rules can be applied As before in case e4 israndomly selected it alone forms a maximal matching

117

623 Karp-Sipser-H-scaling Karp-Sipser with scaling

Karp-Sipser-H can be modified for better decisions in case neither of the tworules apply In this variant instead of a random selection we first scale theadjacency tensor of H and obtain an approximate d-stochastic tensor X We then augment the matching by adding the edge which corresponds tothe largest value in X when the rules do not apply The modified heuristicis summarized in Algorithm 12

Algorithm 12 Karp-Sipser-H-scaling (H)

Data A d-partite d-uniform n1 times middot middot middot times nd hypergraph H = (VE)Result A maximal matching M of H

1 M larr empty Initially M is empty 2 S larr empty Stack for the merges for Rule-2 3 while H is not empty do4 remove isolated vertices from H5 if existe = (u1 ud) as in Rule-1 then6 M larrM cup e Add e to the matching 7 Apply the reduction Rule-1 on H8 else if existe = (u1 ud) eprime = (uprime1 u

primed) and I as in Rule-2 then

9 Let j be the part index where j isin I10 Apply the reduction Rule-2 on H by introducing the vertex uju

primej

11 Eprime = (v1 ujuprimej vd) for all (v1 uj vd) isin E memorize the hyperedges of uj

12 Spush(e eprime ujuprimej E

prime) Store the current merge

13 else14 X larr Scale(adj(H)) Scale the adjacency tensor of H 15 elarr arg max(u1ud) (xu1ud

) Find the max in X

16 M larrM cup e Add e to the matching 17 Remove all hyperedges of u1 ud from E18 V larr V u1 ud

19 while S 6= empty do20 〈e eprime ujuprimej Eprime〉 larr Spop()

21 if ujuprimej is not matched by M then

22 M larrM cup e23 else24 Let eprimeprime isinM be the hyperedge matching uju

primej

25 if eprimeprime isin Eprime then26 Replace uju

primej in eprimeprime with uprimej

27 M larrM cup eprime28 else29 Replace uju

primej in eprimeprime with uj

30 M larrM cup e

Our inspiration comes from the d = 2 case [74 76] described in Chap-ter 4 By using the scaling method as a preprocessing step and choosing

118 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

edges with a probability corresponding to the scaled entry the edges whichare not included in a perfect matching become less likely to be chosen Un-fortunately for d ge 3 there is no equivalent of Birkhoffrsquos theorem as demon-strated by the following lemma whose proof can be found in the associatedtechnical report [75 Lemma 3]

Lemma 63 For d ge 3 there exist extreme points in the set of d-stochastictensors which are not permutation tensors

These extreme points can be used to generate other d-stochastic ten-sors as linear combinations Due to the lemma above we do not have thetheoretical foundation to imply that hyperedges corresponding to the largeentries in the scaled tensor must necessarily participate in a perfect match-ing Nonetheless the entries not in any perfect matching tend to becomezero (not guaranteed for all though) For the worst case example of Karp-Sipser-H described above the scaling indeed helps the entries correspondingto e4 e5 and e6 to become zero See the discussion following the said lemmain the technical report [75]

On a d-partite d-uniform hypergraph H = (VE) the Sinkhorn-Knoppalgorithm used for scaling operates in iterations each of which requiresO(|E| times d) time In practice we perform only a few iterations (eg 10ndash20)Since we can match at most |V |d hyperedges the overall run time costassociated with scaling is O(|V | times |E|) A straightforward implementationof the second rule can take quadratic time in the worst case of a large numberof repetitive merges with a given vertex In practice more of a linear timebehavior should be observed for the second rule

624 Karp-Sipser-H-mindegree Hypergraph matching viapseudo scaling

In Algorithm 12 applying scaling at every step can be very costly Herewe propose an alternative idea inspired by the specifics of the Sinkhorn-Knopp algorithm to reduce the overall cost In particular we can mimicthe first iteration of the Sinkhorn-Knopp algorithm to discover that we canuse the inverse of the degree of a vertex as its scaling vector This assignseach vertex a weight proportional to the inverse of its degreemdashhence not anexact iteration of Sinkhorn-Knopp

With this approach each hyperedge i1 id is associated with a value1prodd

j=1 λij where λij denotes the degree of the vertex ij from jth part The

selection procedure is the same as that of Algorithm 12 ie the edge withthe maximum value is added to the matching set We refer to this algorithmas Karp-Sipser-H-mindegree as it selects a hyperedge based on a function ofthe degrees of the vertices With a straightforward implementation findingthis hyperedge takes O(|E|) time For a better efficiency the edges can be

119

stored in a heap and when the degree of a node v decreases the increaseKeyheap operation can be called for all its edges

625 Bipartite-reduction Reduction to the bipartite match-ing problem

A perfect matching in a d-partite d-uniform hypergraph H remains perfectwhen projected on a (d minus 1)-partite (d minus 1)-uniform hypergraph obtainedby removing one of Hrsquos dimensions Matchability in (d minus 1)-dimensionalsub-hypergraphs has been investigated [6] to provide an equivalent of HallrsquosTheorem for d-partite hypergraphs These observations lead us to proposea heuristic called Bipartite-reduction This heuristic tackles the d-partited-uniform case by recursively asking for matchings in (dminus1)-partite (dminus1)-uniform hypergraphs and so on until d=2

Let us start with the case where d = 3 Let G = (VG EG) be thebipartite graph with the vertex set VG = V1 cup V2 obtained by deleting V3

from a 3-partite 3-regular hypergraph H = (VE) The edge (u v) isin EG iffthere exists a hyperedge (u v z) isin E One can also assign a weight functionw(middot) to the edges during this step such as

w(u v) = |z (u v z) isin E| (61)

A maximum weighted (product sum etc) matching algorithm can be usedto obtain a matching MG on G A second bipartite graph Gprime = (VGprime EGprime)is then created with VGprime = (V1 times V2) cup V3 and EGprime = (uv z) (u v) isinMG (u v z) isin H Under this construction any matching inGprime correspondsto a valid matching in H Furthermore if the weight function (61) definedabove is used the following holds (the proof is in the technical report [75Proposition 4])

Proposition 61 Let w(MG) =sum

(uv)isinMGw(u v) be the size of the match-

ing MG found in G Then Gprime has w(MG) edges

Thus by selecting a maximum weighted matching MG and maximizingw(MG) the largest number of edges will be kept in Gprime

For d-dimensional matching a similar process is followed First anordering i1 i2 id of the dimensions is defined At the jth bipartite re-duction step the matching is found between the dimension cluster i1i2 middot middot middot ijand dimension ij+1 by similarly solving a bipartite matching instance wherethe edge (u1 middot middot middotuj v) exists iff vertices u1 uj were matched in previoussteps and there exists an edge (u1 uj v zj+2 zd) in H

Unlike the previous heuristics Bipartite-reduction does not have any ap-proximation guarantee We state this with the following lemma (the proofis in the technical report [75 Lemma 5])

120 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 64 If algorithms for the maximum cardinality or the maximumweighted matching with the suggested edge weights (61) problems are usedthen Bipartite-reduction has a worst-case approximation ratio of Ω(n)

626 Local-Search Performing local search

A local search heuristic is proposed by Hurkens and Schrijver [119] It startsfrom a feasible maximal matching M and performs a series of swaps untilit is no longer possible In a swap k edges of M are replaced with at leastk+1 new edges from EM so that the cardinality of M increases by at leastone These k edges from M can be replaced with at most dtimes k new edgesHence these edges can be found by a polynomial algorithm enumeratingall the possibilities The approximation guarantee improves with higher kvalues Local search algorithms are limited in practice due to their hightime complexity The algorithm might have to examine all

(|M |k

)subsets of

M to find a feasible swap at each step The algorithm by Cygan [56] whichachieves a

(d+1+ε

3

)-approximation is based on a different swap scheme but

is also not suited for large hypergraphs We implement Local-Search as analternative from the literature to set a baseline

63 Experiments

To understand the relative performance of the proposed heuristics we con-ducted a wide variety of experiments with both synthetic and real-life dataThe experiments were performed on a computer equipped with Intel Corei7-7600 CPU and 16GB RAM We investigate the performance of the pro-posed heuristics Greedy-H Karp-Sipser-H Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reduction For d = 3 we also consider Local-Search [119] which replaces one hyperedge from a maximal matching Mwith at least two hyperedges from E M to increase the cardinality of M We did not consider local search schemes for higher dimensions or withbetter approximation ratios as they are computationally too expensiveFor each hypergraph we perform ten runs of Greedy-H and Karp-Sipser-Hwith different random decisions and take the maximum cardinality obtainedSince Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reductiondo not pick hyperedges randomly we run them only once We perform 20steps of the scaling procedure in Karp-Sipser-H-scaling We refer to qualityof a matching M in a hypergraph H as the ratio of M rsquos cardinality to thesize of the smallest vertex partition of H

On random hypergraphs

We perform experiments on two classes of d-partite d-uniform random hy-pergraphs where each part has n vertices The first class contains random

121

k k

d ddminus3 ddminus2 ddminus1 d ddminus3 ddminus2 ddminus1

2 - 087 100 2 - 084 1003 080 100 100 3 088 100 100

n = 10 4 100 100 100 n = 30 4 099 100 1005 100 100 100 5 100 100

2 - 088 100 2 - 087 1003 085 100 100 3 084 100 100

n = 20 4 100 100 100 n = 50 4 lowast 100 1005 100 100 100 5

Table 61 The average maximum matching cardinalities of five randominstances over n on random k-out d-partite d-uniform hypergraphs for dif-ferent k d and n No runs for k = ddminus3 for d = 2 and the problems markedwith lowast were not solved within 24 hours

k-out hypergraphs and the second one contains sparse random hypergraphs

Random k-out d-partite d-uniform hypergraphsHere we consider random k-out d-partite d-uniform hypergraphs describedin Section 61 Hence (ignoring the duplicate ones) these hypergraphs havearound d times k times n hyperedges These k-out d-partite d-uniform hyper-graphs have been recently analyzed in the matching context by Devlin andKahn [67] They state in passing that k should be exponential in d for aperfect matching to exist with high probability The bipartite graph variantof the same problem ie with d = 2 has been extensively studied in theliterature [90 124 125 211]

We first investigate the existence of perfect matchings in random k-outd-partite d-uniform hypergraphs For this purpose we used CPLEX [1] tosolve the hypergraph matching problem exactly and found the maximumcardinality of a matching in k-out hypergraphs with k isin ddminus3 ddminus2 ddminus1for d isin 2 5 and n isin 10 20 30 50 For each (k d n) triple we cre-ated five hypergraphs and computed their maximum cardinality matchingsFor k = ddminus3 we encountered several hypergraphs with no perfect match-ing especially for d = 3 The hypergraphs with k = ddminus2 were also lackinga perfect matching for d = 2 However all the hypergraphs we createdwith k = ddminus1 had at least one Based on these results we experimentallyconfirm Devlin and Kahnrsquos statement We also conjecture that ddminus1-outrandom hypergraphs have perfect matchings almost surely The averagemaximum matching cardinalities we obtained in this experiment are givenin Table 61 In this table we do not have results for k = ddminus3 for d = 2and the cases marked with lowast were not solved within 24 hours

We now compare the performance of the proposed heuristics on randomk-out d-partite d-uniform hypergraphs d isin 3 6 9 and n isin 1000 10000We tested with k values equal to powers of 2 for k le d log d The results are

122 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

summarized in Figure 61 For each (k d n) triplet we create ten random in-stances and present the average performance of the heuristics on them Thex-axis in each figure denotes k and the y-axis reports the matching cardinal-ity over n As seen Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree havethe best performance comfortably beating the other alternatives For d = 3Karp-Sipser-H-scaling dominates Karp-Sipser-H-mindegree but when d gt 3we see that Karp-Sipser-H-mindegree has the best performance Local-Searchperforms worse than Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree butbetter than others Karp-Sipser-H performs better than Greedy-H Howevertheir performances get closer as d increases This is due to the fact thatthe conditions for Rule-1 and Rule-2 hold less often for larger d as we havemore restrictions to encounter in such cases Bipartite-reduction has worseperformance than the others and the gap in the performance grows as dincreases This happens since at each step we impose more and more con-ditions on the edges involved and there is no chance to recover from baddecisions

Sparse random d-partite d-uniform hypergraphsHere we consider a random d-partite d-uniform hypergraph Hi created withitimesn hyperedges The parameters used for this experiment are i isin 1 3 5 7n isin 4000 8000 and d isin 6 9 (the associated paper presents further ex-periments with d = 3) Each Hi is created by choosing the vertices of ahyperedge uniformly at random for each dimension We do not allow du-plicate hyperedges Another random hypergraph Hi+M is then obtained byplanting a perfect matching to Hi that is a random perfect matching isadded to the pattern of Hi We again generate ten random instances foreach parameter setting We do not present results for Bipartite-reduction asit was always worse than the others The average quality of different heuris-tics on these instances is shown in Figure 62 The experiments confirmthat Karp-Sipser-H performs consistently better than Greedy-H Further-more Karp-Sipser-H-scaling performs significantly better than Karp-Sipser-HKarp-Sipser-H-scaling works even better than the local search heuristic [75Fig2]) and it is the only heuristic that is capable of finding the planted per-fect matchings for a significant number of the runs In particular it finds aperfect matching on Hi+M rsquos in all cases except for when d = 6 and i = 7Interestingly Karp-Sipser-H-mindegree outperforms Karp-Sipser-H-scaling onHis but is dominated on Hi+M s where it is the second best performingheuristic

Evaluating algorithmic choices

We evaluate the use of scaling and the importance of Rule-1 and Rule-2

Scaling vs no-scalingTo evaluate and emphasize the contribution of scaling better we compare the

123

2 4 8065

07

075

08

085

09

095

1n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

2 4 806

065

07

075

08

085

09

095

1n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

(a) d = 3 n = 1000 (left) and n = 10000 (right)

2 4 1603

04

05

06

07

08

09n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 1603

04

05

06

07

08

09n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(b) d = 6 n = 1000 (left) and n = 10000 (right)

2 4 8 16 3202

03

04

05

06

07

08n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 8 16 3202

03

04

05

06

07

08n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(c) d = 9 n = 1000 (left) and n = 10000 (right)

Figure 61 The performance of the heuristics on k-out d-partite d-uniformhypergraphs with n vertices at each part The y-axis is the ratio of matchingcardinality to n whereas the x-axis is k No Local-Search for d = 6 and d = 9

124 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 031 031 035 035 035 035 036 037 062 061 090 089 100 100 100 1003 043 043 047 047 048 048 054 054 051 050 056 055 100 100 099 0995 048 048 052 052 054 054 061 061 052 052 056 055 100 100 097 0977 052 052 055 055 059 059 066 066 054 054 057 057 084 080 071 070

(a) d = 6 without (left) and with (right) the planted matching

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 025 024 027 027 027 027 030 030 056 055 080 079 100 100 100 1003 034 033 036 036 036 036 043 043 040 040 044 044 100 100 099 1005 038 037 040 040 041 041 048 048 041 040 043 043 100 100 099 0997 040 040 042 042 044 044 051 051 042 042 044 044 100 100 097 096

(b) d = 9 without (left) and with (right) the planted matching

Figure 62 Performance comparisons on d-partite d-uniform hypergraphswith n = 4000 8000 Hi contains i times n random hyperedges and Hi+M

contains an additional perfect matching

R1

R2

C1 C2

t

t

Figure 63 AKS A challenging bipartite graph instance for Karp-Sipser

Local Karp- Karp-Sipser-H- Karp-Sipser-H-t Greedy-H Search Sipser-H scaling mindegree

2 053 099 053 100 1004 053 099 053 100 1008 054 099 055 100 10016 055 099 056 100 10032 059 099 059 100 100

Table 62 Performance of the proposed heuristics on 3-partite 3-uniformhypergraphs corresponding to XKS with n = 300 vertices in each part

125

performance of the heuristics on a particular family of d-partite d-uniformhypergraphs where their bipartite counterparts have been used before aschallenging instances for the original Karp-Sipser heuristic [76]

Let AKS be an n times n matrix discussed before in Figure 43 as a chal-lenging instance to Karp-Sipser and shown in Figure 63 for convenienceTo adapt this scheme to hypergraphstensors we generate a 3-dimensionaltensor XKS such that the nonzero pattern of each marginal of the 3rd di-mension is identical to that of AKS Table 62 shows the performance of theheuristics (ie matching cardinality normalized with n) for 3-dimensionaltensors with n = 300 and t isin 2 4 8 16 32

The use of scaling indeed reduces the influence of the misleading hyper-edges in the dense block R1 times C1 and the proposed Karp-Sipser-H-scalingheuristic always finds the perfect matching as does Karp-Sipser-H-mindegreeHowever Greedy-H and Karp-Sipser-H perform significantly worse Further-more Local-Search returns a 099-approximation in every case because itends up in a local optima

Rule-1 vs Rule-2We finish the discussion on the synthetic data by focusing on Karp-Sipser-H In the bipartite case recent work [10] shows that both rules are neededto obtain perfect matchings in a class of graphs We present a family ofhypergraphs for which applying Rule-2 leads to much improved performancethan applying Rule-1 only

We use Karp-Sipser-H-R1 to refer to Karp-Sipser-H without Rule-2 Asbefore we describe first the bipartite case Let ARF be a n times n matrixwith (ARF )ij = 1 for 1 le i le j le n and (ARF )21 = (ARF )nnminus1 = 1That is ARF is composed of an upper triangular matrix and two additionalsubdiagonal nonzeros The first two columns and the last two rows havetwo nonzeros Assume without loss of generality that the first two rows aremerged by applying Rule-2 on the first column (which is discarded) Thenin the reduced matrix the first column (corresponding to the second columnin the original matrix) will have one nonzero Rule-1 can now be appliedwhereupon the first column in the reduced matrix will have degree oneThe process continues similarly until the reduced matrix is a 2 times 2 denseblock where applying Rule-2 followed by Rule-1 yields a perfect matchingIf only Rule-1 reductions are allowed initially no reduction can be appliedand randomly chosen edges will be matched which negatively affects thequality of the returned matching

For higher dimensions we proceed as follows Let XRF be a d-dimensionaln times middot middot middot times n tensor We set (XRF )ijj = 1 for 1 le i le j le n and(XRF )122 = (XRF )nnminus1nminus1 = 1 By similar reasoning we see thatKarp-Sipser-H with both reduction rules will obtain a perfect matchingwhereas Karp-Sipser-H-R1 will struggle We give some results in Table 63that show the difference between the two We test for n isin 1000 2000 4000

126 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

d2 3 6

n quality rn

quality rn

quality rn

1000 083 045 085 047 080 0312000 086 053 087 056 080 0304000 082 042 075 017 084 045

Table 63 Quality of matching and the number r of the applications ofRule-1 over n in Karp-Sipser-H-R1 for hypergraphs corresponding to XRF Karp-Sipser obtains perfect matchings

and d isin 2 3 6 and show the quality of Karp-Sipser-H-R1 and the num-ber of times that Rule-1 is applied over n We present the best result over10 runs As seen in Table 63 Karp-Sipser-H-R1 obtains matchings thatare about 13ndash25 worse than Karp-Sipser-H Furthermore the larger thenumber of Rule-1 applications is the higher the quality is

On real-life tensors

We also evaluate the performance of the proposed heuristics on some real-life tensors selected from the FROSTT library [203] The descriptions ofthe tensors are given in Table 64 For nips and uber a dimension of size17 and 24 is dropped respectively since they restrict the size of maximumcardinality matching As described before a d-partite d-uniform hyper-graph is obtained from a d-dimensional tensor by keeping a vertex for eachdimension index and a hyperedge for each nonzero Unlike the previousexperiments the parts of the hypergraphs obtained from real-life tensors inTable 64 do not have an equal number of vertices In this case although thescaling algorithm works along the same lines its output is slightly differentLet ni = |Vi| be the cardinality at ith dimension and nmax = max1leiled nibe the maximum one By slightly modifying Sinkhorn-Knopp for each iter-ation of Karp-Sipser-H-scaling we scale the tensor such that the marginalsin dimension i sum up to nmaxni instead of one The results in Table 64resemble those from previous sections Karp-Sipser-H-scaling has the bestperformance and is slightly superior to Karp-Sipser-H-mindegree Greedy-Hand Karp-Sipser-H are close to each other and when it is feasible Local-Searchis better than them We also see that in these instances Bipartite-reductionexhibits a good performance its performance is at least as good as Karp-Sipser-H-scaling for the first three instances but about 10 worse for thelast one

127

Local- Karp- Karp-Sipser-H- Karp-Sipser-H- Bipartite-Tensor d Dimensions Greedy-H Search Sipser-H mindegree scaling Reduction

Uber 3 183times 1140times 1717 183 183 183 183 183 183nips 3 2 482 times 2 862 times

14 0361847 1991 1839 2005 2007 2007

Nell-2 3 12 092 times 9 184 times28 818

3913 4987 3935 5100 5154 5175

Enron 4 6 066 times 5 699 times244 268times 1 176

875 - 875 988 1001 898

Table 64 Four real-life tensors and the performance of the proposed heuris-tics on the corresponding hypergraphs No result for Local-Search for Enronas it is four dimensional Uber has 1117629 nonzeros nips has 3101609nonzeros Nell-2 has 76879419 nonzeros and Enron has 54202099 nonze-ros

Using an independent set solver

We compare Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree with the ideaof reducing the maximum d-dimensional matching problem to that of findingan independent set in the line graph of the given hypergraph We show thatthis transformation can obtain good results but is restricted because linegraphs can require too much space

We use KaMIS [147] to find independent sets in graphs KaMIS uses aplethora of reductions and a genetic algorithm in order to return high cardi-nality independent sets We use the default settings of KaMIS (where exe-cution time is limited to 600 seconds) and generate the line graphs with effi-cient sparse matrixndashmatrix multiplication routines We run KaMIS Greedy-H Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree on a few hypergraphsfrom previous tests The results are summarized in Table 65 The run timeof Greedy-H was less than one second in all instances KaMIS operates inrounds and we give the quality and the run time of the first round andthe final output We note that KaMIS considers the time-limit only afterthe first round has been completed As can be seen while the quality ofKaMIS is always good and in most cases superior to Karp-Sipser-H-scalingand Karp-Sipser-H-mindegree it is also significantly slower (its principle isto deliver high quality results) We also observe that the pseudo scaling ofKarp-Sipser-H-mindegree indeed helps to reduce the run time compared toKarp-Sipser-H-scaling

The line graphs of the real-life instances from Table 64 are too large tobe handled We estimate using known techniques [50] the number of edgesin these graphs to range from 15 times 1010 to 47 times 1013 which translate to126GB to 380TB of storage if edges are stored twice using 4 bytes per edge

128 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

KaMIS Karp-Sipser-H- Karp-Sipser-H-line graph Round 1 Output Greedy-H scaling mindegree

hypergraph gen time quality time quality time quality quality time quality time

8-out n =1000 d = 3

10 098 80 099 600 086 098 1 098 1

8-out n =10000 d = 3

112 098 507 099 600 086 098 197 098 1

8-out n =1000 d = 9

298 067 798 069 802 055 062 2 067 1

n = 8000 d =3 H3

1 077 16 081 602 063 076 5 077 1

n = 8000 d =3 H3+M

2 089 25 100 430 070 100 11 091 1

Table 65 Run time (in seconds) and performance comparisons betweenKaMIS Greedy-H and Karp-Sipser-H-scaling The time required to createthe line graphs should be added to KaMISrsquos overall time

64 Summary further notes and references

We have proposed heuristics for the maximum matching in d-partite d-uniform hypergraphs problem by generalizing existing heuristics for themaximum cardinality matching in graphs The experimental analysis onvarious hypergraphs (both random and corresponding to real-life tensors)show the effectiveness and efficiency of the proposed heuristics

This being a recent study we have much future work On the theoreticalside we plan to investigate the stated conjecture that ddminus1-out random hy-pergraphs have perfect matchings almost always and analyze the theoreticalguarantees of the proposed algorithms On the practical side we want totest the use of the proposed hypergraph matching heuristics in the coarsen-ing phase of the multilevel hypergraph partitioning tools For this we needto adopt the proposed heuristics for the general hypergraph case (not nec-essarily d-partite and d-uniform) Another future work is the investigationof the weighted hypergraph matching problem and the generalization anduse of the proposed heuristics for this problem Solvers for this problem arecomputational tools used in the multiple network alignment problem [179]where a matching among the vertices of more than two graphs is computed

Part III

Ordering matrices andtensors

129

Chapter 7

Elimination trees andheight-reducing orderings

In this chapter we investigate the minimum height elimination problemdescribed in Section 11 The formal problem definition is repeated belowfor convenience

Minimum height elimination tree Given a directed graph find anordering of its vertices so that the associated elimination tree has theminimum height

The standard elimination tree [200] has been used to expose parallelismin sparse Cholesky LU and QR factorizations [5 8 116 160] Roughly aset of vertices without ancestordescendant relations corresponds to a setof independent tasks that can be performed in parallel Therefore the to-tal number of parallel steps or the critical path length is equal to theheight of the tree on an unbounded number of processors [162 214] Ob-taining an elimination tree with the minimum height for a given matrixis NP-complete [193] Therefore heuristic approaches are used One setof heuristic approaches is to content oneself with the graph partitioningbased methods These methods reduce some other important cost metricsin sparse Cholesky factorization such as the fill-in and the operation countwhile giving a shallow depth elimination tree [103] When the matrix is un-symmetric the elimination tree for LU factorization [81] is useful to exposeparallelism In this respect the height of the tree again corresponds to thenumber of parallel steps or the critical path length for certain factorizationschemes Here we develop heuristics to reduce the height of eliminationtrees for unsymmetric matrices

Consider the directed graph Gd obtained by replacing every edge of agiven undirected graph G by two directed edges pointing at opposite direc-tions Then the cycle-rank (or the minimum height of an elimination tree)

131

132 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

of Gd is closely related to the undirected graph parameter tree-depth [180p128] which corresponds to the minimum height of an elimination tree for asymmetric matrix One can exploit this correspondence in the reverse sensein an attempt to reduce the height of the elimination tree while producingan ordering for the matrix That is one can use the standard undirectedgraph model corresponding to the matrix A + AT where the addition ispattern-wise This approach of producing an undirected graph for a givendirected graph by removing the direction of each edge is used in solvers suchas MUMPS [8] and SuperLU [63] This way the state of the art graph parti-tioning based ordering methods can be used We will compare the proposedheuristics with such methods

The height of the elimination tree or the critical path length is notthe only metric for efficient parallel LU factorization of unsymmetric matri-ces Depending on the factorization schemes such as Gaussian eliminationwith static pivoting (implemented in GESP [157]) or multifrontal basedsolvers (implemented for example in WSMP [104]) other metrics come intoplay (such as the fill-in the size of blocking in GESP based solvers egSuperLU MT [64] the maximumminimum size of fronts for parallelismin multifrontal solvers or the communication) Nonetheless the elimina-tion tree helps to give an upper bound on the performance of the suitableparallel LU factorization schemes under a fairly standard PRAM model(assuming unit size computations infinite number of processors and nomemorycommunication or scheduling overhead) Furthermore our overallapproach is based on obtaining a desirable form for LU factorization calledbordered block triangular form (or BBT form for short) [81] This formsimplifies many algorithmic details of different solvers [81 82] and is likelyto control the fill-in as the nonzeros are constrained to be in the blocks ofa BBT form If one orders a matrix without a BBT form and then obtainsthis form the fill-in can increase or decrease [83] Hence our BBT basedapproach is likely to be helpful for LU factorization schemes exploiting theBBT form as well in controlling the fill-in

71 Background

We define Gs = (VEs) as the graph associated with the directed graphG such that Es = vi vj (vi vj) isin E A directed graph G is calledcomplete if (u v) isin E for all vertex pairs (u v)

Let T = (V r ρ) be a rooted tree An ordering σ V harr 1 n is abijective function An ordering σ is called topological if σ(ρ(vj)) gt σ(vj) forall vj isin V Finally a postordering is a special form of topological orderingin which the vertices of each subtree are ordered consecutively

Let A be an n times n unsymmetric matrix with m off-diagonal nonzeroentries The standard directed graph G(A) = (VE) of A consists of a

133

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix A

8

46

9

1

7

2

5

3

(b) Directed graph G(A)

1 2

3

5 64

7 8

9

(c) Elimination tree T(A)

Figure 71 A sample 9 times 9 unsymmetric matrix A the corresponding di-rected graph G(A) and elimination tree T(A)

vertex set V = 1 n and an edge set E = (i j) aij 6= 0 of m edgesWe adopt GA instead of G(A) when a subgraph notion is required

An elimination digraph Gk(A) = (Vk Ek) is defined for 0 le k lt n withthe vertex set Vk = k + 1 n and the edge set

Ek =

E k = 0Ekminus1 cup (i j) (i k) (k j) isin Ekminus1 k gt 0

We also define V k = 1 k for each 0 le k lt n

We assume that A is irreducible and the LU-factorization A = LU existswhere L and U are unit lower and upper triangular respectively Since thefactors L and U are triangular their standard directed graph models G(L)and G(U) are acyclic Eisenstat and Liu [81] define the elimination treeT(A) as follows Let

Parent(i) = minj j gt i and jG(L)===rArr i

G(U)===rArr j

where aG=rArr b means that there is path from a to b in the graph G then

Parent(i) corresponds to the parent of vertex i in T(A) for i lt n and forthe root n we put Parent(n) =infin

We now illustrate some of the definitions Figure 71a displays a samplematrix A with 9 rowscolumns and 26 nonzeros Figure 71b shows thestandard directed graph representation G(A) of this matrix Upon elimina-tion of vertex 1 the three edges (3 2)(6 2) and (7 2) are added and vertex1 is removed to build the elimination digraph G1(A) On the right sideFigure 71c shows the elimination tree T(A) where height h(T(A)) = 5Here the parent vertex of 1 is 3 as GA[1 2] is not strongly connected butGA[1 2 3] is (thanks to the cycle 1 rarr 2 rarr 3 rarr 1) The cross edges ofG(A) with respect to the elimination tree T(A) are represented by dashed

134 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

directed arrows in Figure 71c The initial order is topological but not pos-tordered However 8 1 2 3 5 4 6 7 9 is a postordering of T(A) and doesnot respect the cross edges

72 The critical path length and the eliminationtree height

We summarize rowwise and columnwise LU factorization schemes also knownas ikj and jik variants [70] here to be referred as row-LU and column-LUrespectively Then we show that the critical path lengths for the parallelrow-LU and column-LU factorization schemes are equivalent to the elimina-tion tree height whenever the matrix has a certain form

Let A be an ntimesn irreducible unsymmetric matrix and A = LU whereL and U are unit lower and upper triangular respectively Algorithms 13and 14 display the row-LU and column-LU factorization schemes respec-tively Both schemes update the given matrix A so that the U = [uij ]ilejfactor is formed by the upper triangular (including the diagonal) entries ofthe matrix A at the end The L = [`ij ]igej factor is formed by the multipliers(`ik = aikakk for i gt k) and a unit diagonal We assume a coarse-grainparallelization approach [122 160] In Algorithm 13 task i is defined as thetask of computing rows i of L and U and denoted as Trow(i) Similarlyin Algorithm 14 task j is defined as the task of computing the columns j ofboth L and U factors and denoted as Tcol(j) The dependencies betweenthe tasks can be represented with the directed acyclic graphs of LT and Ufor row-LU and column-LU factorization of A respectively

Algorithm 13 Row-LU(A)

1 for i = 1 to n do2 Trow(i)3 for each k lt i and aik 6= 0 do4 for each j gt k st akj 6= 0 do5 aij larr aij minus akj times (aikakk)

Algorithm 14 Column-LU(A)

1 for j = 1 to n do2 Tcol(j)3 for each k lt j st akj 6= 0 do4 for each i gt k st aik 6= 0 do5 aij larr aij minus aik times (akjakk)

135

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix L + U

8

46

9

1

7

2

5

3

(b) Task dependency graphG(LT)

8

46

9

1

7

2

5

3

(c) Task dependency graphG(U)

Figure 72 The filled matrix L + U (a) task dependency graphs G(LT) forrow-LU (b) and G(U) for column-LU (c)

Figure 72 illustrates the mentioned dependencies Figure 72a showsthe matrix L + U corresponding to the factors of the sample matrix Agiven in Figure 71a In this figure the blue hexagons and green circles arethe fill-ins in L and U respectively Subsequently Figures 72b and 72cshow the task dependency graphs for row-LU and column-LU factorizationsrespectively The blue dashed directed edges in Figure 72b and the greendashed directed edges in Figure 72c correspond to the fill-ins All edgesin Figures 72b and 72c show dependencies For example in the column-LU the second column depends on the first one as the values in the firstcolumn of L are needed while computing the second column This is shownin Figure 72c with a directed edge from the first vertex to the second one

We now consider postorderings of the elimination tree T(A) Let T [k]denote the set of vertices in the subtree of T(A) rooted at k For a postorder-ing the order of siblings is not specified in general sense A postordering iscalled upper bordered block triangular (BBT) if the topological order amongthe sibling vertices is respected that is a vertex k is numbered before asibling vertex ` if there is an edge in G(A) that is directed from T [k] toT [`] Similarly we call a postordering a lower BBT if the sibling verticesare ordered in reverse topological As an example the initial order of ma-trix given in Figure 71 is neither upper nor lower BBT however the order8 6 1 2 3 5 4 7 9 would be an upper BBT (see Figure 71c) The follow-ing theorem shows the relation between a BBT postordering and the taskdependencies in LU factorizations

Theorem 71 ([81]) Assuming the edges are directed from child to parentT(A) is the transitive reduction of G(LT) and G(U) when A is upper andlower BBT ordered respectively

We follow this theorem with a corollary which provides the motivationfor reducing the elimination tree height

Corollary 71 The critical path length of the task dependency graph is

136 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

3

5

7

9

2 3 5 4 7 9

8

6

1

2

6 18

4

(a) The matrix L + U

1 2

3

5 64

7 8

9

(b) The directed graph G(LT)

Figure 73 The filled matrix of matrix A = LU in BBT form (a) and thetask dependency graph G(LT) for row-LU (b)

equal to the elimination tree height h(T(A)) for row-LU factorization whenA is upper BBT ordered and for column-LU factorization when A is lowerBBT ordered

Figure 73 demonstrates the discussed points on a matrix A which is thepermuted A of Figure 71a with the upper BBT postordering 8 6 1 2 3 5 4 7 9Figure 73a and 73b show the filled matrix L + U such that A = LU andthe corresponding task dependency graph G(LT) for row-LU factorizationrespectively As seen in Figure 73b there are only tree (solid) and back(dashed) edges with respect to elimination tree T(A) due to the fact thatT(A) is the transitive reduction of G(LT) (see Theorem 71) In the figurethe blue edges refer to fill-in nonzeros As seen in the figure there is a pathfrom each vertex to the root and a critical path 1 rarr 3 rarr 5 rarr 7 rarr 9 haslength 5 which coincides with the height of T(A)

73 Reducing the elimination tree height

We propose a top-down approach to reorder a given matrix leading to ashort elimination tree The bigger lines of the proposed approach form ageneralization of the state of the art methods used in Cholesky factorization(and are based on nested dissection [94]) We give the algorithms and dis-cussions for the upper BBT form they can be easily adapted for the lowerBBT form The proofs of Proposition 71 and Theorems 72ndash74 are omittedfrom the presentation and can be found in the original paper [136]

137

731 Theoretical basis

Let A be an ntimes n sparse unsymmetric irreducible matrix and T(A) be itselimination tree In this case a BBT decomposition of A can be representedas a permutation of A with a permutation matrix P such that

ABBT =

A11 A12 A1K A1B

A22 A2K A2B

AKK AKB

AB1 AB2 ABK ABB

(71)

where ABBT = PAPT the number of diagonal blocks K gt 1 and Akk isirreducible for each 1 le k le K The border ABlowast is called minimal if there isno permutation matrix Pprime 6= P such that PprimeAPprimeT is a BBT decompositionand AprimeBlowast sub ABlowast That is the border ABlowast is minimal if we cannot removecolumns from ABlowast and the corresponding rows from ABlowast permute themtogether to a diagonal block and still have a BBT form We give thefollowing proposition aimed to be used for Theorem 72

Proposition 71 Let A be in upper BBT form and the border ABlowast beminimal The elimination digraph Gκ(A) is complete where κ =

sumKi=1 |Akk|

The following theorem provides a basis for reducing the elimination treeheight h(T(A))

Theorem 72 Let A be in upper BBT form and the border ABlowast be mini-mal Then

h(T(A)) = |ABlowast|+ max1lekleK

h(T(Akk))

The theorem says that the critical path length in the parallel LU factor-ization methods summarized in Section 72 can be expressed recursively interms of the border and the critical path length of a block with the maxi-mum size This holds for the row-LU scheme when the matrix A is in anupper BBT form and for the column-LU scheme when A is in a lower BBTform Hence having a small border size and diagonal blocks of similar sizeis likely to lead to reduced critical path length

732 Recursive approach Permute

We propose a recursive approach to reorder a given matrix so that the elim-ination tree is reduced Algorithm 15 gives the overview of the solutionframework The main procedure Permute takes an irreducible unsymmet-ric matrix A as its input It calls PermuteBBT to decompose A into aBBT form ABBT Upon obtaining such a decomposition the main pro-cedure calls itself on each diagonal block Akk recursively Whenever the

138 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

matrix becomes sufficiently small (its size gets smaller than τ) the proce-dure PermuteBT is called in order to compute a fine BBT decompositionin which the diagonal blocks are of unit size here to be referred as borderedtriangular (BT) decomposition and the recursion terminates

Algorithm 15 Permute(A)

1 if |A| lt τ then Recursion terminates 2 PermuteBT(A)3 else4 ABBT larr PermuteBBT(A) As given in (71) 5 for k larr 1 to K do6 Permute(Akk) Subblocks of ABBT

733 BBT decomposition PermuteBBT

Permuting A into a BBT form translates into finding a strong (vertex) sep-arator of G(A) where the strong separator itself corresponds to the borderABlowast Regarding to Theorem 72 we are particularly interested in findingminimal borders

Let G = (VE) be a strongly connected directed graph A strong sep-arator (also called directed vertex separator [138]) is defined as a vertex setS sub V such that GminusS is not strongly connected Moreover S is said to beminimal if Gminus Sprime is strongly connected for any proper subset Sprime sub S

The algorithm that we propose to find a strong separator is based on bi-partitioning of directed graphs In case of undirected graphs typically thereare two kinds of graph partitioning methods which differ by their separatorsedge separators and vertex separators An edge (or vertex) separator is a setof edges (or vertices) whose removal leaves the graph disconnected

A BBT decomposition of a matrix may have three or more subblocks (asseen in (71)) However the algorithms we use to find a strong separatorutilize 2-way partitioning which in turn constructs two parts each of whichmay potentially have several components For the sake of simplicity in thepresentation we assume that both parts are strongly connected

We introduce some notation in order to ease the discussion For a pairof vertex subsets (UW ) of directed graph G we write U 7rarrW if there maybe some edges from U to W but there is no edge from W to U BesidesU harr W and U 6harr W are used in a more natural way that is U harr Wimplies that there may be edges in both directions whereas U 6harr W refersto absence of such an edge between U and W

We cast our problem of finding strong separators as follows For a givendirected graph G find a vertex partition Πlowast = V1 V2 S where S refersto a strong separator and V1 V2 refer to vertex parts so that S has small

139

8

46

9

1

7

2

5

3

V2

V1

(a) An edge separator ΠES =V1 V2

7

4

9

2

3

5

V2

6 1

V1

7

4

2

3

5

V2

6 1

V1

9

(b) Covers for E2 7rarr1 (left)and E1 7rarr2 (right)

8

1

3

5

9 8 1 2 3 5

6

7

4

9

7 46

2

(c) Matrix view of ΠES

Figure 74 A sample 2-way partition ΠES = V1 V2 by edge separator (a)bipartite graphs and covers for directed cuts E27rarr1 and E17rarr2 (b) and thematrix view of ΠES

cardinality the part sizes are close to each other and either V1 7rarr V2 orV2 7rarr V1

734 Edge-separator-based method

The first step of the method is to obtain a 2-way partition ΠES = V1 V2 ofG by edge separator We build two strong separators upon ΠES and choosethe one with the smaller cardinality

For an edge separator ΠES = V1 V2 the directed cut Ei 7rarrj is definedas the set of edges directed from Vi to Vj ie Ei 7rarrj = (u v) isin E u isinVi v isin Vj and we say that a vertex set S covers Ei 7rarrj if for any edge(u v) isin Ei 7rarrj either u isin S or v isin S for i 6= j isin 1 2

Initially we have V1 harr V2 Our goal is to find a subset of verticesS sub V so that either V1 7rarr V2 or V2 7rarr V1 where V1 and V2 are the setsof remaining vertices in V1 and V2 after the removal of those vertices in Sthat is V1 = V1 minus S and V2 = V2 minus S The following theorem serves to thepurpose of finding such a subset S

Theorem 73 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and S sube V Now Vi 7rarr Vj if and only if S covers Ej 7rarrifor i 6= j isin 1 2

The problem of finding a strong separator S can be encoded as findinga minimum vertex cover in the bipartite graph whose edge set is populatedonly by the edges of Ej 7rarri Algorithm 16 gives the pseudocode As seenin the algorithm we find two separators S17rarr2 and S2 7rarr1 which cover E27rarr1

and E17rarr2 and result in V1 7rarr V2 and V2 7rarr V1 respectively At the end wetake the strong separator with the smaller cardinality

In this method how the edge separator is found matters The objectivefor ΠES is to minimize the cutsize Ψ(ΠES) which is typically defined over

140 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Algorithm 16 FindStrongVertexSeparatorES(G)

1 ΠES larr GraphBipartiteES(G) ΠES = V1 V2 2 Π17rarr2 larr FindMinCover(GE2 7rarr1)3 Π27rarr1 larr FindMinCover(GE1 7rarr2)4 if |S17rarr2| lt |S27rarr1| then

5 return Π1 7rarr2 = V1 V2 S17rarr2 V1 7rarr V2 6 else

7 return Π2 7rarr1 = V1 V2 S27rarr1 V2 7rarr V1

the cut edges as follows

Ψes(ΠES) = |Ec| = |(u v) isin E π(u) 6= π(v)| (72)

where u isin Vπ(u) and v isin Vπ(v) and Ec refers to set of cut edges We notethat the use of the above metric in Algorithm 16 is a generalization of amethod used for symmetric matrices [161] However Ψes(ΠES) is looselyrelated to the size of the cover found in order to build the strong separatorA more suitable metric would be the number of vertices from which a cutedge emanates or the number of vertices to which a cut edge is directedSuch a cutsize metric can be formalized as follows

Ψcn(ΠES) = |Vc| = |v isin V exist(u v) isin Ec| (73)

The following theorem shows the close relation between Ψcn(ΠES) and thesize of the strong separator S

Theorem 74 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and Πlowast = V1 V2 S be the strong separator built uponΠES Then |S| le Ψcn(ΠES)2

Apart from the theorem we also note that optimizing (73) is likely toyield bipartite graphs which are easier to cover than those resulting fromoptimizing (72) This is because of the fact that all edges emanating from asingle vertex or many different vertices are considered as the same with (72)whereas those emanating from a single vertex are preferred in (73)

We note that the cutsize metric as given in (73) can be modeled usinghypergraphs The hypergraph that models this cutsize metric (called thecolumn-net hypergraph model [39]) has the same vertex set as the directedgraph G = (VE) and a hyperedge for each vertex v isin V that connectsall u isin V such that (u v) isin E Any partition of the vertices of this hyper-graph that minimizes the number of hyperedges in the cut induces a vertexpartition on G with an edge separator that minimizes the cutsize accord-ing to the metric (73) A similar statement holds for another hypergraphconstructed by reversing the edge directions (called the row-net model [39])

141

8

46

9

1

7

2

5

3

S

V2

V1

(a) Strong separator Π2 7rarr1

9

7

1

6

4 9 7 8 1 6

2

3

5

4

3 52

8

(b) ABBT matrix viewof Π2 7rarr1

1

6

2 3

5

9

8

74

(c) Elimination treeT(ABBT)

Figure 75 Π2 7rarr1 the strong separator built upon S2 7rarr1 the cover of E17rarr2

given in 74b (a) the matrix ABBT obtained by Π2 7rarr1 (b) and the eliminationtree T(ABBT)

Figures 74 and 75 give an example for finding a strong separator basedon an edge separator In Figure 74a the red curve implies the edge separa-tor ΠES = V1 V2 such that V1 = 4 6 7 8 9 the green vertices and V2 =1 2 3 5 the blue vertices The cut edges of E27rarr1 = (2 7) (5 4) (5 9)and E17rarr2 = (6 1) (6 3) (7 1) are given in color blue and green respec-tively We see two covers for the bipartite graphs corresponding to each ofthe directed cuts E27rarr1 and E17rarr2 in Figure 74b As seen in Figure 74cthese directed cuts refer to nonzeros of the off-diagonal blocks in matrixview Figure 75a gives the strong separator Π2rarr1 built upon ΠES by thecover S2rarr1 = 1 6 of the bipartite graph corresponding to E1rarr2 which isgiven on the right of 74b Figures 75b and 75c depict the matrix view ofthe strong separator Π2rarr1 and the corresponding elimination tree respec-tively It is worth mentioning that in matrix view V2 proceeds V1 sincewe use Π2rarr1 instead of Π1rarr2 As seen in 75c since the permuted ma-trix (in Figure 75b) has an upper BBT postordering all cross edges of theelimination tree are headed from left to right

735 Vertex-separator-based method

A vertex separator is a strong separator restricted so that V1 6harr V2 Thismethod is based on refining the separator so that either V1 7rarr V2 or V1 7rarr V2The vertex separator can be viewed as a 3-way vertex partition ΠV S =V1 V2 S As in the previous method based on edge separators we buildtwo strong separators upon ΠV S and select the one having the smaller strongseparator

The criterion used to find an initial vertex separator can be formalizedas

Ψvs(ΠV S) = |S| (74)

Algorithm 17 gives the overall process to obtain a strong separator This

142 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

algorithm follows Algorithm 16 closely Here the procedure that refines theinitial vertex separator namely RefineSeparator works as follows Itvisits the separator vertices once and moves the vertex at hand if possibleto V1 or V2 so that Vi 7rarr Vj after the movement where i 7rarr j is specified aseither 1 7rarr 2 or 2 7rarr 1

Algorithm 17 FindStrongVertexSeparatorVS(G)

1 ΠV S larr GraphBipartiteVS(G) ΠV S = V1 V2 S 2 Π1rarr2 larr RefineSeparator(GΠV S 1rarr 2)3 Π2rarr1 larr RefineSeparator(GΠV S 2rarr 1)4 if |S1rarr2| lt |S2rarr1| then

5 return Π1rarr2 = V1 V2 S1rarr2 V1 rarr V2 6 else

7 return Π2rarr1 = V1 V2 S2rarr1 V2 rarr V1

736 BT decomposition PermuteBT

The problem of permuting A into a BT form can be decoded as finding afeedback vertex set which is defined as a subset of vertices whose removalmakes a directed graph acyclic In matrix view a feedback vertex set corre-sponds to border ABlowast when the matrix is permuted into a BBT form (71)where each Aii for i = 1 K is of unit size

The problem of finding a feedback vertex set of size less than a givenvalue is NP-complete [126] but fixed-parameter tractable [44] In literaturethere are greedy algorithms that perform well in practice [152 184] In thiswork we adopt the algorithm proposed by Levy and Low [152]

For example in Figure 75 we observe that the sets 5 and 8 are usedas the feedback-vertex sets for V1 and V2 respectively which is the reasonthose rowscolumns are permuted last in their corresponding subblocks

74 Experiments

We investigate the performance of the proposed heuristic here to be referredas BBT in terms of the elimination tree height and ordering time We havethree variants of BBT (i) BBT-es based on minimizing the edge cut (72) (ii)BBT-cn based on minimizing the hyperedge cut (73) (iii) BBT-vs based onminimizing the vertex separator (74) As a baseline we used MeTiS [130]on the symmetrized matrix A + AT MeTiS uses state of the art methodsto order a symmetric matrix and leads to short elimination trees Its use inthe unsymmetric case on A+AT is a common practice BBT is implementedin C and compiled with mex of MATLAB which uses gcc 445 We usedMeTiS library for graph partitioning according to the cutsize metrics (72)

143

and (74) For the edge cut metric (72) we input MeTiS the graph ofA + AT where the weight of the edge i j is 2 (both aij aji 6= 0) or1 (either aij 6= 0 or aji 6= 0 but not both) For the vertex separatormetric (74) we input the graph of A+AT (no weights) to the correspondingMeTiS procedure We implemented the cutsize metric given in (73) usingPaToH [40] (in the default setting) for hypergraph partitioning using thecolumn-net model (we did not investigate the use of row-net model) Weperformed the experiments on an AMD Opteron Processor 8356 with 32 GBof RAM

In the experiments for each matrix we detected dense rows and columnswhere a rowcolumn is deemed to be dense if it has at least 10

radicn nonzeros

We applied MeTiS and BBT on the submatrix of non-dense rowscolumnsand produced the final ordering by appending the dense rowscolumns afterthe permutations obtained for the submatrices The performance results arecomputed as the median of eleven runs

Recall from Section 732 that whenever the size of the input matrixis smaller than the parameter τ the recursion terminates in BBT variantsand a feedback-vertex-set-based BT decomposition is applied We first in-vestigate the effect of τ in order to suggest a concrete approach For thispurpose we build a dataset of matrices called UFL below The matrices inUFL are from the SuiteSparse Matrix Collection [59] and chosen with thefollowing properties square 5K le n le 100K nnz le 1M real and notof type Graph These criteria enable an automatic selection of matricesfrom different application domains without having to specify the matricesindividually Other criteria could be used ours help to select a large set ofmatrices each of which is not too small and not too large to be handled easilywithin Matlab We exclude matrices recorded as Graph in the SuiteSparsecollection because most of these matrices have nonzeros from a small set ofintegers (for example minus1 1) and are reducible We further detect matriceswith the same nonzero pattern in the UFL dataset and keep only one matrixper unique nonzero pattern We then preprocess the matrices by applyingMC64 [72] in order to permute the columns so that the permuted matrix hasa maximum diagonal product Then we identify the irreducible blocks us-ing Dulmage-Mendelsohn decomposition [77] permute the matrices to blockdiagonal form and delete all entries that lie in the off-diagonal blocks Thistype of preprocessing is common in similar studies [83] and correspondsto preprocessing for numerics and reducibility After this preprocessing wediscard the matrices whose total (remaining) number of nonzeros is less thantwice its size or whose pattern symmetry is greater than 90 where thepattern symmetry score is defined as (nnz(A)minusn)(nnz(A+AT)minusn)minus05We ended up with 99 matrices in the UFL dataset

In Table 71 we summarize the performance of the BBT variants as nor-malized to that of MeTiS on the UFL dataset for the recursion terminationparameter τ isin 3 50 250 The table presents the geometric means of per-

144 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Tree Height (h(A)) Ordering Timeτ -es -cn -vs -es -cn -vs

3 081 069 068 168 353 25850 084 071 072 126 271 191

250 093 082 086 112 212 143

Table 71 Statistical indicators of the performance of BBT variants on theUFL dataset normalized to that of MeTiS with recursion termination param-eter τ isin 3 50 250

0 20 40 60 80Pattern symmetry

03

04

05

06

07

08

09

10

11

12

13

Nor

mal

ized

tree

hei

ght

(a) BBT-vs MeTiS Tree Height (h(A))

0 20 40 60 80Pattern symmetry

05

10

15

20

25

30

Nor

mal

ized

ord

erin

g tim

e

(b) BBT-vs MeTiS Ordering Time

Figure 76 BBT-vs MeTiS performance on the UFL dataset Dashed greenlines mark the geometric mean of the ratios

formance results over all matrices of the dataset As seen in the table forsmaller values of τ all BBT variants achieve shorter elimination trees at acost of increased processing time to order the matrix This is because of thefast but probably not of very high quality local ordering heuristicmdashas dis-cussed before the heuristic here does not directly consider the height of thesubtree corresponding to the submatrix as an optimization goal Other τvalues could also be tried but we find τ = 50 as a suitable choice in generalas it performs close to τ = 3 in quality and close to τ = 250 in efficiency

From Table 71 we observe that BBT-vs method performs in the mid-dle of the other two BBT variants but close to BBT-cn in terms of the treeheight All the three variants improve the height of the tree with respectto MeTiS Specifically at τ = 50 BBT-es BBT-cn and BBT-vs obtain im-provements of 16 29 and 28 respectively The graph based methodsBBT-es and BBT-vs run slower than MeTiS as they contain additional over-heads of obtaining the covers and refining the separators of A + AT forA For example at τ = 50 the overheads result in 26 and 171 longerordering time for BBT-es and BBT-vs respectively As expected the hy-pergraph based method BBT-cn is much slower than others (for example112 longer ordering time with respect to MeTiS with τ = 50) Since we

145

identify BBT-vs as fast and of high quality (it is close to the best BBT variantin terms of the tree height and the fastest one in terms of the run time)we display its individual performance results in detail with τ = 50 in Fig-ure 76 In Figures 76a and 76b each individual point represents the ratioBBT-vs MeTiS in terms of tree height and ordering time respectivelyAs seen in Figure 76a BBT-vs reduces the elimination tree height on 85out of 99 matrices in the UFL dataset and achieves a reduction of 28 interms of the geometric mean with respect to MeTiS (the green dashed linein the figure represents the geometric mean) We observe that the patternsymmetry and the reduction in the elimination tree height highly correlateafter a certain pattern symmetry score More precisely the correlation co-efficient between the pattern symmetry and the normalized tree height is06187 for those matrices with pattern symmetry greater than 20 Thisis of course in agreement with the fact that LU factorization methods ex-ploiting the unsymmetric pattern have much to gain on matrices with lowerpattern symmetry score and less to gain otherwise than the alternatives Onthe other hand as seen in Figure 76b BBT-vs performs consistently slowerthan MeTiS with an average of 91 slow down

The fill-in generally increases [83] when a matrix is reordered into a BBTform with respect to the original ordering using the elimination tree Wenoticed that BBT-vs almost always increases the number of nonzeros in L(by 157 with respect to MeTiS in a set of matrices) However the numberof nonzeros in U almost always decreases (by 15 with respect to MeTiS

in the same dataset) This observation may be exploited during Gaussianelimination where L is not stored

75 Summary further notes and references

We investigated the elimination tree model for unsymmetric matrices as ameans of capturing dependencies among the tasks in row-LU and column-LU factorization schemes Specifically we focused on permuting a givenunsymmetric matrix to obtain an elimination tree with reduced height inorder to expose higher degree of parallelism by minimizing the critical pathlength in parallel LU factorization schemes Based on the theoretical find-ings we proposed a heuristic which orders a given matrix to a borderedblock diagonal form with a small border size and blocks of similar sizes andthen locally orders each block We presented three variants of the proposedheuristic These three variants achieved on average 24 31 and 35improvements with respect to a common method of using MeTiS (a stateof the art tool used for the Cholesky factorization) on a small set of matri-ces On a larger set of matrices the performance improvements were 1628 and 29 The best performing variant is about 271 times slower thanMeTiS on the larger set while others are slower by factors of 126 and 191

146 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Although numerical methods implementing the LU factorization withthe help of the standard elimination tree are well developed those imple-menting the LU factorization using the elimination discussed in this sectionare lacking A potential implementation will have to handle many schedul-ing issues and therefore the use of a runtime system seems adequate Onthe combinatorial side we believe that there is a need for a direct methodto find strong separators which would give better performance in terms ofthe quality and the run time

Chapter 8

Sparse tensor ordering

In this chapter we investigate the tensor ordering problem described inSection 15 The formal problem definition is repeated below for convenience

Tensor ordering Given a sparse tensor X find an ordering of theindices in each dimension so that the nonzeros of X have indices closeto each other

We propose two ordering algorithms for this problem The first proposedheuristic BFS-MCS is a breadth first search (BFS)-like approach based onthe maximum cardinality search family and works on the hypergraph repre-sentation of the given tensor The second proposed heuristic Lexi-Orderis an extension of doubly lexical ordering of matrices to tensors We showthe effects of these schemes in the MTTKRP performance when the ten-sors are stored in three existing sparse tensor formats coordinate (COO)compressed sparse fiber (CSF) and hierarchical coordinate (HiCOO) Theassociated paper [156] has further contributions with respect to paralleliza-tion and includes results on parallel implementations of MTTKRP with thethree formats as well as runs with CandecompParafac decomposition al-gorithms

81 Three sparse tensor storage formats

We consider the three state-of-the-art tensor formats (COO CSF HiCOO)which are all for general unstructured sparse tensors

Coordinate format (COO)

The coordinate (COO) format is the simplest yet arguably most popularformat by far It stores each nonzero value along with all of its positionindices shown in Figure 81a i j k are indices (inds) of the nonzeros storedin the val array

147

148 CHAPTER 8 SPARSE TENSOR ORDERING

(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(a) COO(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(b) CSF(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(c) HiCOO

Figure 81 Three common formats for sparse tensors (example is originallyfrom [155])

Compressed sparse fibers (CSF)

Compressed Sparse Fiber (CSF) is a hierarchical fiber-centric format thateffectively generalizes the CSR matrix format to tensors An example of itsrepresentation appears in Figure 81b Conceptually CSF organizes nonzeroindices into a tree Each level corresponds to a tensor mode and eachnonzero is a path from the root to a leaf The indices of CSF are stored incinds with the pointers stored in cptrs to indicate the locations of nonzerosat the next level Since cptrs needs to show the range up to nnz we use βint

or βlong accordingly for differently sized-tensors

Hierarchical coordinate format (HiCOO)

Hierarchical Coordinate (HiCOO) [155] format derives from COO formatbut improves upon it by compressing the indices in units of sparse tensorblocks HiCOO stores a sparse tensor in a sparse-blocked pattern with apre-specified block size B meaning in Btimesmiddot middot middottimesB blocks It represents everyblock by compactly storing its nonzero triples using fewer bits Figure 81cshows the example tensor given 2 times 2 times 2 blocks (B = 2) For a third-order tensor bi bj bk are block indices indexing tensor blocks ei ej ek areelement indices indexing nonzeros within a tensor block A bptr array storesthe pointers of each blockrsquos beginning locations and val saves all the nonzerovalues The block ratio αb of a HiCOO representation of tensor is defined asthe number of blocks divided by the number of nonzeros in the tensor Theaverage slice size per tensor block cb of a HiCOO representation of tensor isdefined as the average slice size per tensor block These two key parametersdescribe the effectiveness of storing a tensor in the HiCOO format

Comparing the three tensor formats HiCOO and COO treat every modeequally and do not assume any mode order these preserve the mode-generic

149

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

Figure 82 A sparse tensor and the associated hypergraph

orientation [155] CSF has a strong mode-specificity since the tree structureimplies a mode ordering for enumeration of nonzeros Besides comparing toCOO format HiCOO and CSF save storage space and memory footprintswhereas achieve higher performance generally

82 Problem definition and solutions

Our objective is to improve the performance of Mttkrp and therefore CPDalgorithms based on the three tensor storage formats described above Wewill achieve this objective by ordering (or relabeling) the indices in one ormore modes of the input tensor so as to improve the data locality in thetensor and the factor matrices of an Mttkrp operation Take for exampletwo nonzeros (i2 j2 k1) and (i2 j2 k2) of a third-order tensor in Figure 82Relabeling k1 and k2 to other two indices close to each other will potentiallyimprove cache hits for both tensor and their corresponding rows of factormatrices in the Mttkrp operation Besides it will also influence the tensorstorage for some formats (Analysis will be illustrated in Section 823)

We propose two ordering heuristics The aim of the heuristics is toarrange the nonzeros close to each other in all modes If we were to lookat matrices this would correspond to reordering the rows and columns sothat all nonzeros are clustered around the diagonal This way nonzeros in arow or column would be close to each other and any blocking (by imposingfixed sized blocks) would have nonzero blocks only around the diagonalThe proposed heuristics are based on these observations and try to obtainsimilar behavior for tensors The output of a reordering algorithm is thepermutations for all modes being used to relabel tensor indices of them

821 BFS-MCS

BFS-MCS is a breadth first search (BFS)-like heuristic approach basedon the maximum cardinality search family [207] We first construct the d-partite d-uniform hypergraph for a sparse tensor as described in Section 15Recall that in this hypergraph the vertices are tensor indices in all modesand hyperedges represent its nonzero entries Figure 82 shows the hyper-

150 CHAPTER 8 SPARSE TENSOR ORDERING

graph associated with a sparse tensor Here the vertices are blank circlesand hyperedges are represented by grouping vertices

For an Nth-order sparse tensor we need to find the permutations forN modes We determine a permutation for a given mode n (permn) of atensor X isin RI1timesmiddotmiddotmiddottimesIN as follows Suppose some of the indices of mode n arealready ordered Then BFS-MCS picks the next to-be-ordered index asthe one with the strongest connection to the currently ordered index set Inthe case of ties it selects the index with the smallest number of nonzeros inthe corresponding sub-tensor Intuitively a stronger connection representsmore common indices in the modes other than n among an unordered vertexand the already ordered ones This means more data from factor matricescan be reused in Mttkrp if found in cache

Algorithm 18 BFS-MCS ordering based on maximum cardinal-ity search for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN hypergraphG = (VE) mode n

Result Permutation permn

1 wp(i)larr 0 for i = 1 In2 ws(i)larr minus|X ( i )| number of nonzeros with i in mode n 3 Build max-heap Hn for mode-n indices with wp and ws as the primary

and secondary keys respectively4 mV (i)larr 0 for i = 1 In5 for in = 1 In do

6 v(0)n larr GetHeapMax(Hn)

7 permn(v(0)n )larr in

8 mV (v(0)n )larr 1

9 for en isin hyperedges of v(0)n do

10 for v isin en v is not in mode n and mV (v) = 0 do11 mV (v)larr 112 for e isin hyperedges of v and e 6= en do13 vn larr vertex in mode n of e14 if inHeap(vn) then15 wp(vn)larr wp(vn) + 116 heapUpdateKey(Hnwp(vn))

17 return permn

The process is implemented for all mode-n indices by maintaining amax-heap with two keys The primary key of an index is the number ofconnections to the currently ordered indices The secondary key is thenumber of nonzeros in the corresponding (N minus 1)th-order sub-tensor egX ( in ) where the smaller secondary key values signify higherpriority The secondary keys are static Algorithm 18 gives the pseudocodeof BFS-MCS The heap Hn is initially constructed (Line 3) according to the

151

secondary keys of mode-n indices For each mode-n index (eg in) of thismax-heap BFS-MCS traverses all connected tensor indices in the modesexcept n calculates the number of connections of mode-n indices and thenupdates the heap using the primary and secondary keys Let mV denote asize-|V | array that tracks whether a vertex has been visited They are ini-tialized to zeros (unvisited) and changed to 1s once after being visited We

record a mode-n index v(0)n obtained from the max-heap Hn to the permu-

tation array permn The algorithm visits all its hyperedges (ie all nonzeroswith in Line 9) and their connected and unvisited vertices from the other(N minus 1) modes except n (Line 10) For these vertices it again visits theirhyperedges (e in Line 12) and then checks if the connected vertices in moden (vn) are in the heap (Line 14) If so the primary key (connectivity) of vnis increased by 1 and the max-heap (Hn) is updated To summarize this

algorithm locates the sub-tensors of the neighbor vertices of v(0)n to increase

the primary key values of the occurred mode-n indices in HnThe BFS-MCS heuristic has low time complexity We explain it us-

ing tensor terms The innermost heap update operation (Line 15) costsO(log(In)) Each sub-tensor is only visited once with the help of the markerarray mV For all the sub-tensors in one tensor mode the heap updateis only performed for nnz nonzeros (hyperedges) Overall the time com-plexity of BFS-MCS is O(N nnz log(In)) for an Nth-order tensor with nnznonzeros when computing permn for mode n

BFS-MCS does not exactly catch the memory access pattern of an Mt-tkrp It treats the contribution from the indices of all modes except nequally to the connectivity thus the connectivity might not match the ac-tual data reuse Also it uses a greedy strategy to determine the next-levelvertices which could miss the optimal global orderings as is common togreedy heuristics for hard problems

822 Lexi-Order

A lexical ordering of an integer vector is the standard dictionary orderingof its elements defined as follows Given two equal-length vectors x andy we say x le y iff either (i) all elements are the same ie x = y or(ii) there exists an index j such that x(j) lt y(j) and x(i) = y(i) for all0 le i lt j Consider two 0 1 vectors x = (1 0 1 0) and y = (1 1 0 0) wehave x le y because x(1) lt y(1) and x(i) = y(i) for all 0 le i lt 1

Lubiw [167] associates a vector v of size I middot J with an I times J matrix Awhere the matrix indices (i j) are first sorted with i + j and then j Adoubly lexical ordering of A is an ordering of its rows and columns whichmakes v lexically the largest Every real-valued matrix has a doubly lexicalordering and such an ordering is not unique [167] Doubly lexical orderinghas a number of applications including the efficient recognition of totallybalanced subtree and plaid matrices and (strongly) chordal graphs To the

152 CHAPTER 8 SPARSE TENSOR ORDERING

(1 1 0 0)

(1 0 1 0)

0 1 0 10 1 0 00 0 1 00 0 0 1

1 1 0 01 0 0 00 1 0 00 0 1 0

original ordered(a) Vector comparison (b) Matrix

gt

Figure 83 A doubly lexical ordering of a 0 1 matrix

best of our knowledge the merits of this ordering for high performance havenot been investigated for sparse matrices

We propose an iterative algorithm called matLexiOrder for doubly lex-ical ordering of matrices where in an iteration either rows or columns aresorted lexically Lubiw [167 Claim 22] shows that exchanging two rowswhich are lexically out of order improves the lexical ordering of the matrixBy appealing to this result we see that the matLexiOrder algorithm findsa doubly lexical ordering in a finite number of iterations This also fits wellwith another characterization of the doubly lexical ordering [167 182] amatrix is in doubly lexical ordering if both the row and column vectors arein non-increasing lexical order A row vector is read from left to right and acolumn vector is read from top to bottom Figure 83 shows a doubly lexicalordering of a sample matrix As seen in the figure the row vectors are innon-increasing lexical order as are the column vectors

The known doubly lexical ordering algorithms for matrices by Lubiw [167]and Paige and Tarjan [182] are ldquodirectrdquo ordering methods with a run timeof O(nnz log(I + J) + J) and O(nnz +I + J) space for an ItimesJ matrix withnnz nonzeros We find the time complexity of these algorithms to be toohigh for our purpose Furthermore the data structures are too complex toallow an efficient generalized implementation for tensors Since we do notaim to obtain an exact doubly lexical ordering (a close-by ordering will likelysuffice to improve the Mttkrp performance) our approach will be fasterwhile being simpler to implement

We first describe matLexiOrder algorithm for a sparse matrix Thisaims to show the efficiency of our iterative approach compared to othersA partition refinement technique is used to order the rows and columnsalternatively Given an ordering of the rows the columns can be sortedlexically This is achieved by an order preserving variant of the partitionrefinement method [182] which is called orderlyRefine We briefly explainthe partition refinement technique for ordering the columns of a matrixGiven an I times J matrix A all columns are initially put into a single partThen Arsquos nonzeros are visited row by row At a row i each column partC is split into two parts C1 = C cap a(i ) and C2 = C a(i ) and thesetwo parts replace C in the order C1 C2 (empty sets are discarded) Note

153

that this algorithm keeps the parts in a particular order which generatesan ordering of the columns orderlyRefine is used to refine all parts thathave at least one nonzero in row i in O(|a(i )|) time where |a(i )| is thenumber of nonzeros of row i Overall matLexiOrder costs a linear totaltime of O(nnz +I + J) (for rows and columns ordering) per iteration andO(J) space We also observe that only a small number of iterations will beenough (will be shown in Section 835 for tensors) yielding a more storage-efficient algorithm compared to the prior doubly lexical ordering methods[167 182] matLexiOrder in particular the use of orderlyRefine routinea few times is sufficient for our needs

Algorithm 19 Lexi-Order for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN mode nResult Permutation permn

Sort all nonzeros along with all but mode n1 quickSort(X coordCmp) Matricize X to X(n)

2 r larr compose (inds ([minusn]) 1)3 for m = 1 nnz do4 clarr inds(nm) Column index of X(n)

5 if coordCmp(X mmminus 1) = 1 then6 r larr compose (inds ([minusn])m) Row index of X(n)

7 X(n)(r c)larr val(m)

Apply partition refinement to the columns of X(n)

8 permn larr orderlyRefine (X(n))

9 return permn

Comparison function for two indices of X10 Function coordCmp(X m1m2)11 for nprime = 1 N do12 if nprime le n then13 return minus1

14 if m1(nprime) gt m2(nprime) then15 return 1

16 return 0

To order tensors we propose the Lexi-Order function as an exten-sion of matLexiOrder The basic idea of Lexi-Order is to determine thepermutation of each tensor mode independently while considering the or-der in other modes fixed Lexi-Order sets the indices of the mode tobe ordered as the columns of a matrix the other indices as the rows andsorts the columns as described for matrices (with the order preserving par-tition refinement method) The precise algorithm appears in Algorithm 19which we also illustrate in Figure 84 when applied to mode 1 Given amode n Lexi-Order first builds a matricized tensor in Compressed Sparse

154 CHAPTER 8 SPARSE TENSOR ORDERING

i j k val0 0 0 13 0 0 73 1 0 80 1 1 20 1 2 31 2 2 41 3 22 3 2

56

orderlyReneQuick

sort Matricize(1 2 3 0)

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

1 0 0 10 0 0 11 0 0 01 0 0 00 1 0 00 1 1 0

i(jk)(00)(10)(11)(12)(22)(32)

Zero-one RepresentationSorted COO

0 33001

i(jk)(00)(10)(11)(12)(22)(32) 1 2

JKI CSR Matrix

Non-zeroDistribution

permi

Figure 84 The steps of Lexi-Order illustrated for mode 1

Row (CSR) sparse matrix format by a call to quickSort with the compar-ison function coordCmp and then by partitioning the nonzeros into the rowsegments (Lines 3ndash7) This comparison function coordCmp does a lexicalcomparison of all-but-mode-n indices which enables efficient matricizationIn other words sorting of the tensor X is the same as building the matricizedtensor X(n) by rows in the fixed lexical ordering where mode n is the col-umn dimension and the remaining modes constitute the rows In Figure 84the sorting step orders the COO entries by (j k) tuples which then serveas the row indices of the matricized CSR representation of X(1) Once thematrix is built we construct 0 1 row vectors in Figure 84 to illustrate itsnonzero distribution which could seamlessly call orderlyRefine functionApart from the quickSort the other parts of Lexi-Order are of lineartime complexity (linear in terms of tensor storage) We use OpenMP Tasksto parallelize quickSort to accelerate Lexi-Order

Like BFS-MCS approach Lexi-Order also finds the permutations forN modes of an Nth-order sparse tensor Figure 85(a) illustrates the effectof Lexi-Order on an example 4times 4times 3 sparse tensor The original tensoris converted to a HiCOO representation with block size B = 2 consistingof 5 blocks with maximum 2 nonzeros per block After reordering withLexi-Order the new HiCOO has 3 nonzero blocks with up to 4 nonzerosper block Thus the blocks are denser which should exhibit better localitybehavior However this reordering scheme is heuristic For example con-sider Figure 81 we draw another HiCOO representation after reordering inFigure 85(b) Applying Lexi-Order would yield a reordered HiCOO rep-resentation with 4 blocks which is the same as the input ordering althoughthe maximum number of nonzeros per block would increase to 4 For thistensor Lexi-Order may not show a big advantage

823 Analysis

We take the Mttkrp operation to analyze reordering behavior for COOCSF and HiCOO formats Recall that for a third-order sparse tensor X Mttkrp multiples each nonzero entry xijk with the R-vector formed by theentry-wise product of the jth row of B and kth row of C when computing

155

Original HiCOO

ek val0 0 0 10 1 1 20 1 0 31 0 0 41 1 0 51 0 0 71 1 00 1 0

86

0 0 0

0 0 1

1 0 0

1 1 1

0

2

5

7

0 1 13

ei ejbi bj bkbptrB1

B2B3

B4

B5

Reordered COO

i j k val1 0 0 11 1 2 21 1 1 32 3 1 42 2 1 53 2 1 60 0 00 1 0

78

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

Reordered HiCOO

ek val0 0 0 70 1 0 81 0 0 11 1 1 31 1 0 20 0 1 50 1 11 0 1

46

0 0 0

0 0 11 1 0

0

45

ei ejbi bj bkbptrB1

B2B3

(1 2 3 0)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

Original HiCOO

Reordered COO

i j k val1 0 0 11 1 0 20 0 0 30 0 1 43 1 0 53 3 1 62 0 22 2 1

78

Original COO

Reordered HiCOO

ek val0 0 0 30 0 1 41 0 0 11 1 0 21 1 0 50 0 0 70 0 11 1 1

86

0 0 0

1 0 0

1 1 01 0 1

0

45

ei ejbi bj bkbptrB1

B2B3B4

(1 0 3 2)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

i j k val0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

(a) A good example (b) A fair example

Figure 85 Comparison of HiCOO representations before and after Lexi-Order

MA The arithmetic intensity of Mttkrp algorithms on these formats isapproximately 14 [155] Thus Mttkrp can be considered memory-boundfor most computer architectures especially CPU platforms

In HiCOO smaller block ratios (αb) and larger average slice sizes pertensor block (cb) are favorable for good Mttkrp performance Our re-ordering algorithms tend to closely pack nonzeros together and get denserblocks (larger cb) which potentially generates less tensor blocks (smallerαb) thus HiCOO-Mttkrp has less memory access more cache hits andbetter performance Moreover smaller αb can also reduce tensor memoryrequirement as a side benefit [155] That is for HiCOO format a reorderingscheme should increase the block density reduce the number of blocks andincrease cache localitymdashthree related performance metrics Reordering ismore beneficial for HiCOO format than COO and CSF formats

COO stores all nonzero indices so relabeling does not make a differencein its storage The same is also true for CSF relabeling does not changethe CSFrsquos tree structure while the order of nodes may change For anMttkrp with tensors stored in these two formats the performance gainfrom reordering is only from the improved data locality Though it is hardto do theoretical analysis for them the potential better data locality couldalso brings performance advantages but could be less than HiCOOrsquos

156 CHAPTER 8 SPARSE TENSOR ORDERING

Tensors Order Dimensions Nnzs Density

vast 3 165K times 11K times 2 26M 69times 10minus3

nell2 3 12K times 9K times 29K 77M 24times 10minus5

choa 3 712K times 10K times 767 27M 50times 10minus6

darpa 3 22K times 22K times 24M 28M 24times 10minus9

fb-m 3 23M times 23M times 166 100M 11times 10minus9

fb-s 3 39M times 39M times 532 140M 17times 10minus10

flickr 3 320K times 28M times 2M 113M 78times 10minus12

deli 3 533K times 17M times 3M 140M 61times 10minus12

nell1 3 29M times 21M times 25M 144M 91times 10minus13

crime 4 6K times 24times 77times 32 5M 15times 10minus2

uber 4 183times 24times 1140times 1717 3M 39times 10minus4

nips 4 2K times 3K times 14K times 17 3M 18times 10minus6

enron 4 6K times 6K times 244K times 1K 54M 55times 10minus9

flickr4d 4 320K times 28M times 2M times 731 113M 11times 10minus14

deli4d 4 533K times 17M times 3M times 1K 140M 43times 10minus15

Table 81 Description of sparse tensors

83 Experiments

Platform We perform experiments on a Linux-based Intel Xeon E5-2698v3 multicore server platform with 32 physical cores distributed on two sock-ets each with 23 GHz frequency We only present results for sequentialruns (see the paper [156] for parallel runs) The processor microarchitectureis Haswell having 32 KiB L1 data cache and 128 GiB memory The codeartifact was written in the C language and was compiled using icc 1801

Dataset We use the sparse tensors derived from real-world applicationsthat appear in Table 81 ordered by decreasing nonzero density separatelyfor third- and fourth-order tensors Apart from choa [188] all tensors areavailable publicly FROSTT [203] HaTen2 [121]

Configurations We report the results under the best configurations for thefollowing parameters for the highest Mttkrp performance with the threeformats (i) the number of reordering iterations and the block size B forHiCOO format (ii) tiling or not for CSF For HiCOO B = 128 achievesthe best results in most cases and we use five reordering iterations whichwill be analyzed in Section 835 All experiments use approximate rank ofR = 16 We use the total execution time of Mttkrps in all modes for everytensor to calculate the speedup which is the ratio of the total Mttkrp timeon a randomly reordered tensor over that using a specific reordering schemeAll the times are averaged over five runs

831 COO-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential COO-Mttkrp from ParTI [154] in Figure 86 This COO-Mttkrp is imple-

157

Spee

dup

0

1

2

3

4

5Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 86 Reordered COO-Mttkrp speedup over a random reorderingimplementation

mented in C using the same algorithm with Tensor Toolbox [18] Forany reordering approach after doing a BFS-MCS Lexi-Order or randomreordering on the input tensor we still sort the tensor in the mode orderof 1 middot middot middot N Observe that Lexi-Order improves sequential COO-Mttkrp performance by 100ndash429times (179times on average) while BFS-MCSgets 095ndash127times (110times on average) Note that Lexi-Order improves theperformance of COO-Mttkrp for all tensors We conclude that this order-ing is always helpful for COO-Mttkrp while the improvements being lessthan what we saw for HiCOO-Mttkrp

832 CSF-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential CSF-Mttkrp from Splatt v111 [205] in Figure 87 CSF-Mttkrp is set to useall CSF representations (ALLMODE) for Mttkrps in all modes and with tilingoption on Lexi-Order improves sequential CSF-Mttkrp performanceby 065ndash233times (150times on average) BFS-MCS improves sequential CSF-Mttkrp performance by 100ndash186times (122times on average) Both orderingapproaches improves the performance of CSF-Mttkrp on average WhileBFS-MCS is always helpful in the sequential case Lexi-Order is nothelpful on only one tensor crime The improvements achieved by the tworeordering approaches for CSF are less than those for HiCOO and COOformats We conclude that both reordering methods are helpful for CSF-Mttkrp but to a lesser extent than for HiCOO and COO based Mttkrp

833 HiCOO-Mttkrp with reordering

Figure 88 shows the speedup of the proposed reordering methods on se-quential HiCOO-Mttkrp Lexi-Order reordering obtains 099ndash414timesspeedup (212times on average) while BFS-MCS reordering gets 099ndash188timesspeedup (134times on average) Lexi-Order and BFS-MCS do not behaveas well on fourth-order tensors as on third-order tensors Tensor flickr4d is

158 CHAPTER 8 SPARSE TENSOR ORDERING

Spee

dup

00

05

10

15

20

25 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 87 Reordered CSF-Mttkrp speedup over a random reorderingimplementation

Spee

dup

0

1

2

3

4

5 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 88 Reordered HiCOO-Mttkrp speedup over a random orderingimplementation

constructed from the same data with flickr with an extra short mode (referto Table 81) Lexi-Order obtains 414times speedup on flickr while only 302timesspeedup on flickr4d The same phenomenon is also observed on tensors deli

and deli4d This phenomenon indicates that it is harder to get good datalocality on higher-order tensors which will be justified in Table 82

We investigate the two critical parameters of HiCOO [155] the blockratio (αb) and the average slice size per tensor block (cb) Smaller αb andlarger cb are favorable for good HiCOO-Mttkrp performance Table 82lists the parameter values for all tensors before and after Lexi-Order theHiCOO-Mttkrp speedup (as shown in Figure 88) and the storage ratioof HiCOO over random ordering Generally when both parameters αb andcb are largely improved we see a good speedup and storage ratio usingLexi-Order As seen for the same data in different orders eg flickr4d

and flickr the αb and cb values after Lexi-Order are better in the smallerorder version This observation supports the intuition that getting gooddata locality is harder for higher-order tensors

834 Reordering methods comparison

As seen above Lexi-Order improves performance more than BFS-MCSin most cases Compared to the reordering method used in Splatt [206]

159

TensorsRandom reordering Lexi-Order Speedup Storageαb cb αb cb seq omp ratio

vast 0004 1758 0004 1562 101 103 0999nell2 0020 0314 0008 0074 104 104 0966choa 0089 0057 0016 0056 116 107 0833

darpa 0796 0009 0018 0113 374 150 0322fb-m 0985 0008 0086 0021 384 121 0335fb-s 0982 0008 0099 0020 390 123 0336

flickr 0999 0008 0097 0025 414 366 0277deli 0988 0008 0501 0010 224 083 0634

nell1 0998 0008 0744 0009 170 070 0812

crime 0001 37702 0001 8978 099 142 1000uber 0041 0469 0011 0270 100 078 0838nips 0016 0434 0004 0435 103 134 0921

enron 0290 0017 0045 0030 125 136 0573flickr4d 0999 0008 0148 0020 302 1181 0214

deli4d 0998 0008 0596 0010 176 126 0697

Table 82 HiCOO parameters before and after Lexi-Order reordering

by setting ALLMODE (identical to the work [206]) to CSF-Mttkrp BFS-MCSgets 104 164 and 161times speedups on tensors nell2 nell1 and deli respectivelyand Lexi-Order obtains 104 170 and 224times speedups By contrast thespeedups using graph partitioning [206] on these three tensors are 106 111and 119times and 106 112 and 124times by using hypergraph partitioning [206]respectively Our BFS-MCS and Lexi-Order schemes both outperformgraph and hypergraph partitionings [206]

The available methods in the state-of-the-art are based on graph andhypergraph partitioning Partitioning is a successful approach when thenumber of partitions is known while the number of blocks in HiCOO is notknown ahead of time In our case the partitions should also be ordered forbetter cache reuse Additionally partitioners are less effective for tensorsthan usual as some dimensions could be very short (creating very highdegree vertices) That is why the proposed ordering based methods deliverbetter results

835 Effect of the number of iterations in Lexi-Order

Since Lexi-Order is iterative we evaluate the effect of the number ofiterations on HiCOO-Mttkrp performance The results appear in Fig-ure 89(a) which is normalized to the run time of 10 iterations Mttkrpon most tensors does not vary a lot by setting different number of iterationsexcept vast nell2 uber and nips We use 5 iterations to get good Mttkrpperformance similar to that of 10 iterations with about half of the overhead(shown in Figure 89(b)) But 3 or fewer iterations will get an acceptableperformance when users care much about the pre-processing time

160 CHAPTER 8 SPARSE TENSOR ORDERING

Nor

mal

ized

Tim

e

00

05

10

15

20 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(a) Sequential HiCOO-Mttkrp behavior

Nor

mal

ized

Tim

e

00

02

04

06

08

10 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(b) The reordering overhead

Figure 89 The performance and overhead of different numbers of iterations

84 Summary further notes and references

Plenty recent research studied the optimization of tensor algorithms [20 2846 159 168] Our work sets itself apart by focusing on reordering to getbetter nonzero structure Various reordering methods have been consideredfor matrix operations [7 173 190 215]

The proposed two heuristics aim to arrange the nonzeros of a given ten-sor close to each other in all modes As said before this would correspondto reordering the rows and columns so that all nonzeros are clustered aroundthe diagonal in the matrix case Heuristics based on partitioning and or-dering have been used the matrix case for arranging most nonzeros aroundthe diagonal or diagonal blocks An obvious alternative in the matrix caseis the BFS-based reverse Cuthill-McKee (RCM) heuristic Given a patternsymmetric matrix A RCM finds a permutation (ordering) such that thesymmetrically permuted matrix PAPT has nonzeros clustered around thediagonal More precisely RCM tries to reduce the bandwidth which is de-fined as maxj minus i aij 6= 0 and j gt i The RCM heuristic can be appliedto pattern-wise unsymmetric even rectangular matrices as well A commonapproach is to create a (2times 2)-block matrix for a given mtimes n matrix A as

B =

(Im AAT In

)

where Im and In are the identity matrices if sizemtimesm and ntimesn respectivelySince B is now pattern-symmetric RCM can be run on it to obtain an(m+n)times(m+n) permutation matrix P The row and column permutations

161

min max geomean

BFS-MCS 084 5356 290Lexi-Order 015 433 099

Table 83 Statistical indicators of the number of nonempty blocks obtainedby BFS-MCS and Lexi-Order with respect to that of RCM on 774 ma-trices from the UFL collection

for A can then be obtained from P by respecting the ordering of the first mrows and the last n rows of B This approach can be extended to tensorsWe discuss the third order case for simplicity Let X be an ItimesJtimesK tensorWe create a matrix B in such a way that the first I rowscolumns correspondto the first-mode indices the following J rowscolumns correspond to thesecond-mode indices and the last K rowscolumns correspond to the third-mode indices Then for each nonzero xijk of X we set brc = 1 whenever rand c correspond to any two indices from the set i j k This way B is asymmetric matrix with ones on the main diagonal Observe that when X istwo dimensional (that is a matrix) we obtain the matrix B above We nowcompare the proposed reordering heuristics with RCM

We first compare the three methods RCM BFS-MCS and Lexi-Orderon matrices available at the UFL collection [59] We have tested these threeheuristics on all pattern-wise unsymmetric matrices having more than 1000and less than 5000000 rows at least three nonzeros per row and columnless than 10000000 nonzeros and less than 100 middot

radicmtimes n nonzeros for a

matrix with m rows and n columns At the time of experimentation therewere 774 matrices satisfying these properties We ordered these matriceswith RCM BFS-MCS and Lexi-Order (5 iterations) and counted thenumber of nonempty blocks with a block size of 128 (in each dimension) Wethen normalized the results of BFS-MCS and Lexi-Order with respect tothat of RCM We give the statistical indicators of these normalized resultsin Table 83 As seen in the table RCM is better than BFS-MCS andLexi-Order performs as good as RCM in reducing the number of blocks

We next compare the three methods on 11 sparse tensors from theFROSTT [203] and Haten2 repositories [121] again with a block size of128 in each dimension Table 84 shows the number of nonempty blocksobtained with these methods In this table the number of blocks obtainedby RCM is given in absolute terms and those obtained by BFS-MCS andLexi-Order are given with respect to those of RCM As seen in the tableRCM is never the best and Lexi-Order is better than BFS-MCS for nineout of eleven cases Overall Lexi-Order and BFS-MCS obtain resultswhose geometric means are 048 and 089 respectively of that of RCMmdashmarking both as more effective than RCM This is interesting as we haveseen that RCM and Lexi-Order have (nearly) the same performance inthe matrix case

162 CHAPTER 8 SPARSE TENSOR ORDERING

RCM BFS-MCS Lexi-Order

lbnl-network 67208 111 068nips 25992 156 045vast 113757 100 095nell-2 565615 092 105flickr 28648054 041 038flickr4d 32913028 061 051deli 70691535 097 091deli4d 94608531 081 088darpa 2753737 165 019fb-m 53184625 066 016fb-s 71308285 080 019

geo mean 089 048

Table 84 Number of nonempty blocks obtained by RCM BFS-MCS andLexi-Order on sparse tensors The results of BFS-MCS and Lexi-Orderare normalized with respect to those of RCM For each tensor the best resultis displayed with bold

random Lexi-Order speedup

vast 202 160 126nell-2 504 433 116choa 178 143 125darpa 318 148 215fb-m 1716 695 247fb-s 2503 973 257flickr 1029 582 177deli 1405 1080 130nell1 1845 1486 124

Table 85 The run time of TTM (for all modes) using a random ordering andLexi-Order and the speedup with Lexi-Order for sequential execution

163

The proposed ordering methods are applicable to some other sparse ten-sor operations without any change Among those operations tensor-matrixmultiplication (TTM) used in computing the Tucker decomposition withhigher-order orthogonal iterations [61] or tensor-vector multiplications usedin the higher order power method [60] are well known The main character-istic of these operations is that the memory access of the algorithm dependson the locality of tensor indices for each dimension If not all dimensionshave the same characteristics potentially a variant of the proposed methodsin which only certain dimensions are ordered could be used We showcasesome results for the TTM case below in Table 85 on some of the tensorsThe results are obtained on a machine with Intel Xeon CPU E5-2650v4 Asseen in the table using Lexi-Order always results in improved sequentialrun time for TTMs

Recent work [46] analyses the memory access pattern of the MTTKRPoperation and proposes low-level highly effective code optimizations Theproposed approach reorganizes the code with register blocking and appliesnonzero blocking in an attempt to fit rows of the factor matrices into thecache The gain in performance remarkable Our ordering methods are or-thogonal to the mentioned and similar optimizations For a much improvedperformance we should first order the tensor using the approaches proposedin this chapter and then further optimize the code using the low-level codeoptimizations

164 CHAPTER 8 SPARSE TENSOR ORDERING

Part IV

Closing

165

Chapter 9

Concluding remarks

We have presented a selection of partitioning matching and ordering prob-lems in combinatorial scientific computing

Partitioning problems These problems arise in task decomposition forparallel computing where load balance and low communication costare two objectives Depending on the task definition and the interac-tion of the tasks the partitioning problems use different combinatorialmodels The selected problems (Chapters 2 and 3) addressed acyclicpartitioning of directed acyclic graphs and partitioning of hypergraphsWhile our contributions for the former problem concerned combinato-rial tools for the desired partitioning objectives and constraints thosefor the second problem concerned the use of hypergraph models andassociated partitioning tools for efficient tensor decomposition in thedistributed memory setting

Matching problems These problems arise in settings where agents com-pete for exclusive access to resources The most common settingsinclude identifying nonzero diagonals in sparse matrices routing andscheduling in data centers A related concept is to decompose a dou-bly stochastic matrix as a convex combination of permutation matri-ces (the famous Birkhoff-von Neumann decomposition) where eachpermutation matrix is a perfect matching in the bipartite graph repre-sentation of the matrix We developed approximation algorithms formatchings in graphs (Chapter 4) and effective heuristics for findingmatchings in hypergraphs (Chapter 6) We also investigated (Chap-ter 5) the problem of finding Birkhoff-von Neumann decompositionswith a small number of permutation matrices and presented complex-ity results and theoretical insights into the decomposition problemOn the way we generalized the decomposition to general real matrices(not only doubly stochastic) having total support

167

168 CHAPTER 9 CONCLUDING REMARKS

Ordering problems These problems arise when one wants to permutesparse matrices and tensors into desirable forms The sought formsdepend of course on the targeted applications By focusing on a re-cent LU decomposition method (or on a method taking advantage ofthe sparsity in a novel way) we developed heuristics to permute sparsematrices into a bordered block triangular form with an aim to reducethe height of the resulting elimination tree (Chapter 7) We also in-vestigated sparse tensor ordering problem where we sought to clusternonzeros around the (hyper)diagonal of a sparse tensor (Chapter 8) inorder to improve the performance of certain tensor operations We pro-posed novel algorithms for obtaining doubly lexical ordering of sparsematrices that are simpler to implement than the known algorithms tobe able to address the tensor ordering problem

At the end of Chapters 2ndash8 we mentioned some immediate future workBelow we state some more future work which require considerable workand creativity

Most of the presented combinatorial algorithms come with applicationsOthers are made applicable by practical and efficient implementations fur-ther developments on the applicationsrsquo side are required for these to be usedin practice In particular the use of acyclic partitioner (the problem ofChapter 2) in a parallel system with the help of a runtime system requiresthe whole graph to be available This is usually not the case an auto-tuning like approach in which one first creates a task graph by a cold runand then partitions this graph to determine a schedule can prove useful Ina similar vein we mentioned the lack of proper software implementing theLU factorization method using elimination trees at the end of Chapter 7Such an implementation will give rise to a number of combinatorial prob-lems (eg the peak memory requirement in factoring a BBT ordered matrixwith elimination trees) and seems to be a new play field

The original Karp-Sipser heuristic for the cardinality matching problemapplies degree-1 and degree-2 reduction rules The first rule is well imple-mented in practical settings (see Chapters 4 and 6) On the other hardfast implementations of the degree-2 reduction rule for undirected graphsare sought A straightforward implementation of this rule can result in theworst case Ω(n2) total time for a graph with n vertices This is obviously toomuch to afford in theory for general sparse graphs with O(n) or O(n log n)edges Are there worst-case linear time algorithms for implementing thisrule in the context of the Karp-Sipser heuristic

The Birkhoff-von Neumann decomposition (Chapter 5) has well foundedapplications in routing Our use of this decomposition in solving linear sys-tems is a first attempt in making it useful for numerical algorithms Inthe proposed approach the cost of constructing a preconditioner with afew terms from a BvN decomposition is equivalent to solving a few maxi-

169

mum bottleneck matching problemmdashthis can be costly in practical settingsCan we use a relaxed definition of the BvN decomposition in which sub-permutation matrices are allowed (such decompositions are used in routingproblems) and make related preconditioner more practical while retainingnumerical properties

170 BIBLIOGRAPHY

Bibliography

[1] IBM ILOG CPLEX Optimizer httpwww-01ibmcomsoftwareintegrationoptimizationcplex-optimizer 2010 [Cited on pp96 121]

[2] Evrim Acar Daniel M Dunlavy and Tamara G Kolda A scalable op-timization approach for fitting canonical tensor decompositions Jour-nal of Chemometrics 25(2)67ndash86 2011 [Cited on pp 11 37]

[3] Seher Acer Tugba Torun and Cevdet Aykanat Improving medium-grain partitioning for scalable sparse tensor decomposition IEEETransactions on Parallel and Distributed Systems 29(12)2814ndash2825Dec 2018 [Cited on pp 52]

[4] Kunal Agrawal Jeremy T Fineman Jordan Krage Charles E Leis-erson and Sivan Toledo Cache-conscious scheduling of streaming ap-plications In Proc Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 236ndash245 New YorkNY USA 2012 ACM [Cited on pp 4]

[5] Emmanuel Agullo Alfredo Buttari Abdou Guermouche and FlorentLopez Multifrontal QR factorization for multicore architectures overruntime systems In Euro-Par 2013 Parallel Processing LNCS pages521ndash532 Springer Berlin Heidelberg 2013 [Cited on pp 131]

[6] Ron Aharoni and Penny Haxell Hallrsquos theorem for hypergraphs Jour-nal of Graph Theory 35(2)83ndash88 2000 [Cited on pp 119]

[7] Kadir Akbudak and Cevdet Aykanat Exploiting locality in sparsematrix-matrix multiplication on many-core architectures IEEETransactions on Parallel and Distributed Systems 28(8)2258ndash2271Aug 2017 [Cited on pp 160]

[8] Patrick R Amestoy Iain S Duff and Jean-Yves LrsquoExcellent Multi-frontal parallel distributed symmetric and unsymmetric solvers Com-puter Methods in Applied Mechanics and Engineering 184(2ndash4)501ndash520 2000 [Cited on pp 131 132]

[9] Patrick R Amestoy Iain S Duff Daniel Ruiz and Bora Ucar Aparallel matrix scaling algorithm In J M Palma P R AmestoyM Dayde M Mattoso and J C Lopes editors 8th InternationalConference on High Performance Computing for Computational Sci-ence (VECPAR) volume 5336 pages 301ndash313 Springer Berlin Hei-delberg 2008 [Cited on pp 56]

BIBLIOGRAPHY 171

[10] Michael Anastos and Alan Frieze Finding perfect matchings in ran-dom cubic graphs in linear time arXiv preprint arXiv1808008252018 [Cited on pp 125]

[11] Claus A Andersson and Rasmus Bro The N-way toolbox for MAT-LAB Chemometrics and Intelligent Laboratory Systems 52(1)1ndash42000 [Cited on pp 11 37]

[12] Hartwig Anzt Edmond Chow and Jack Dongarra Iterative sparsetriangular solves for preconditioning In Larsson Jesper Traff SaschaHunold and Francesco Versaci editors Euro-Par 2015 Parallel Pro-cessing pages 650ndash661 Springer Berlin Heidelberg 2015 [Cited onpp 107]

[13] Jonathan Aronson Martin Dyer Alan Frieze and Stephen SuenRandomized greedy matching II Random Structures amp Algorithms6(1)55ndash73 1995 [Cited on pp 58]

[14] Jonathan Aronson Alan Frieze and Boris G Pittel Maximum match-ings in sparse random graphs Karp-Sipser revisited Random Struc-tures amp Algorithms 12(2)111ndash177 1998 [Cited on pp 56 58]

[15] Cevdet Aykanat Berkant B Cambazoglu and Bora Ucar Multi-leveldirect K-way hypergraph partitioning with multiple constraints andfixed vertices Journal of Parallel and Distributed Computing 68609ndash625 2008 [Cited on pp 46]

[16] Ariful Azad Mahantesh Halappanavar Sivasankaran RajamanickamErik G Boman Arif Khan and Alex Pothen Multithreaded algo-rithms for maximum matching in bipartite graphs In IEEE 26th Inter-national Parallel Distributed Processing Symposium (IPDPS) pages860ndash872 Shanghai China 2012 [Cited on pp 57 59 66 73 74]

[17] Brett W Bader and Tamara G Kolda Efficient MATLAB compu-tations with sparse and factored tensors SIAM Journal on ScientificComputing 30(1)205ndash231 December 2007 [Cited on pp 11 37]

[18] Brett W Bader Tamara G Kolda et al Matlab tensor toolboxversion 30 Available online httpwwwsandiagov~tgkolda

TensorToolbox 2017 [Cited on pp 11 37 38 157]

[19] David A Bader Henning Meyerhenke Peter Sanders and DorotheaWagner editors Graph Partitioning and Graph Clustering - 10thDIMACS Implementation Challenge Workshop Georgia Institute ofTechnology Atlanta GA USA February 13-14 2012 Proceedingsvolume 588 of Contemporary Mathematics American MathematicalSociety 2013 [Cited on pp 69]

172 BIBLIOGRAPHY

[20] Muthu Baskaran Tom Henretty Benoit Pradelle M HarperLangston David Bruns-Smith James Ezick and Richard LethinMemory-efficient parallel tensor decompositions In 2017 IEEE HighPerformance Extreme Computing Conference (HPEC) pages 1ndash7Sep 2017 [Cited on pp 160]

[21] James Bennett and Stan Lanning The Netflix Prize In Proceedingsof KDD cup and workshop page 35 2007 [Cited on pp 10 48]

[22] Anne Benoit Yves Robert and Frederic Vivien A Guide to AlgorithmDesign Paradigms Methods and Complexity Analysis CRC PressBoca Raton FL 2014 [Cited on pp 93]

[23] Michele Benzi and Bora Ucar Preconditioning techniques based onthe Birkhoffndashvon Neumann decomposition Computational Methods inApplied Mathematics 17201ndash215 2017 [Cited on pp 106 108]

[24] Piotr Berman and Marek Karpinski Improved approximation lowerbounds on small occurence optimization ECCC Report 2003 [Citedon pp 113]

[25] Garrett Birkhoff Tres observaciones sobre el algebra lineal Universi-dad Nacional de Tucuman Series A 5147ndash154 1946 [Cited on pp8]

[26] Marcel Birn Vitaly Osipov Peter Sanders Christian Schulz andNodari Sitchinava Efficient parallel and external matching In Fe-lix Wolf Bernd Mohr and Dieter an Mey editors Euro-Par 2013Parallel Processing pages 659ndash670 Springer Berlin Heidelberg 2013[Cited on pp 59]

[27] Rob H Bisseling and Wouter Meesen Communication balancing inparallel sparse matrix-vector multiplication Electronic Transactionson Numerical Analysis 2147ndash65 2005 [Cited on pp 4 47]

[28] Zachary Blanco Bangtian Liu and Maryam Mehri Dehnavi CSTFLarge-scale sparse tensor factorizations on distributed platforms InProceedings of the 47th International Conference on Parallel Process-ing pages 211ndash2110 New York NY USA 2018 ACM [Cited onpp 160]

[29] Guy E Blelloch Jeremy T Fineman and Julian Shun Greedy sequen-tial maximal independent set and matching are parallel on average InProceedings of the Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 308ndash317 ACM 2012[Cited on pp 59 73 74 75]

BIBLIOGRAPHY 173

[30] Norbert Blum A new approach to maximum matching in generalgraphs In Proceedings of the 17th International Colloquium on Au-tomata Languages and Programming pages 586ndash597 London UK1990 Springer-Verlag [Cited on pp 6 76]

[31] Richard A Brualdi The diagonal hypergraph of a matrix (bipartitegraph) Discrete Mathematics 27(2)127ndash147 1979 [Cited on pp 92]

[32] Richard A Brualdi Notes on the Birkhoff algorithm for doublystochastic matrices Canadian Mathematical Bulletin 25(2)191ndash1991982 [Cited on pp 9 92 94 98]

[33] Richard A Brualdi and Peter M Gibson Convex polyhedra of doublystochastic matrices I Applications of the permanent function Jour-nal of Combinatorial Theory Series A 22(2)194ndash230 1977 [Citedon pp 9]

[34] Richard A Brualdi and Herbert J Ryser Combinatorial Matrix The-ory volume 39 of Encyclopedia of Mathematics and its ApplicationsCambridge University Press Cambridge UK New York USA Mel-bourne Australia 1991 [Cited on pp 8]

[35] Thang Nguyen Bui and Curt Jones A heuristic for reducing fill-in insparse matrix factorization In Proc 6th SIAM Conf Parallel Pro-cessing for Scientific Computing pages 445ndash452 SIAM 1993 [Citedon pp 4 15]

[36] Peter J Cameron Notes on combinatorics httpscameroncountsfileswordpresscom201311combpdf (last checked Jan 2019)2013 [Cited on pp 76]

[37] Andrew Carlson Justin Betteridge Bryan Kisiel Burr Settles Este-vam R Hruschka Jr and Tom M Mitchell Toward an architecturefor never-ending language learning In AAAI volume 5 page 3 2010[Cited on pp 48]

[38] J Douglas Carroll and Jih-Jie Chang Analysis of individual dif-ferences in multidimensional scaling via an N-way generalization ofldquoEckart-Youngrdquo decomposition Psychometrika 35(3)283ndash319 1970[Cited on pp 36]

[39] Umit V Catalyurek and Cevdet Aykanat Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplicationIEEE Transactions on Parallel and Distributed Systems 10(7)673ndash693 Jul 1999 [Cited on pp 41 140]

174 BIBLIOGRAPHY

[40] Umit V Catalyurek and Cevdet Aykanat PaToH A Multilevel Hyper-graph Partitioning Tool Version 30 Bilkent University Departmentof Computer Engineering Ankara 06533 Turkey PaToH is availableat httpbmiosuedu~umitsoftwarehtm 1999 [Cited on pp4 7 25 47 143]

[41] Umit V Catalyurek Kamer Kaya and Bora Ucar On shared-memoryparallelization of a sparse matrix scaling algorithm In 2012 41st Inter-national Conference on Parallel Processing pages 68ndash77 Los Alami-tos CA USA September 2012 IEEE Computer Society [Cited onpp 56]

[42] Umit V Catalyurek Mehmet Deveci Kamer Kaya and Bora UcarMultithreaded clustering for multi-level hypergraph partitioning In26th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) pages 848ndash859 Shanghai China 2012 [Cited on pp59]

[43] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang On ser-vice guarantees for input-buffered crossbar switches A capacity de-composition approach by Birkhoff and von Neumann In Quality ofService 1999 IWQoS rsquo99 1999 Seventh International Workshop onpages 79ndash86 1999 [Cited on pp 8]

[44] Jianer Chen Yang Liu Songjian Lu Barry OrsquoSullivan and Igor Raz-gon A fixed-parameter algorithm for the directed feedback vertex setproblem Journal of the ACM 55(5)211ndash2119 2008 [Cited on pp142]

[45] Joseph Cheriyan Randomized O(M(|V |)) algorithms for problemsin matching theory SIAM Journal on Computing 26(6)1635ndash16551997 [Cited on pp 76]

[46] Jee Choi Xing Liu Shaden Smith and Tyler Simon Blocking opti-mization techniques for sparse tensor computation In 2018 IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS)pages 568ndash577 Vancouver British Columbia Canada 2018 [Citedon pp 160 163]

[47] Joon Hee Choi and S V N Vishwanathan DFacTo Distributedfactorization of tensors In Zoubin Ghahramani Max Welling CorinnaCortes Neil D Lawrence and Kilian Q Weinberger editors 27thAdvances in Neural Information Processing Systems pages 1296ndash1304Curran Associates Inc Montreal Quebec Canada 2014 [Cited onpp 11 37 38 39]

BIBLIOGRAPHY 175

[48] Edmond Chow and Aftab Patel Fine-grained parallel incomplete LUfactorization SIAM Journal on Scientific Computing 37(2)C169ndashC193 2015 [Cited on pp 107]

[49] Andrzej Cichocki Danilo P Mandic Lieven De Lathauwer GuoxuZhou Qibin Zhao Cesar Caiafa and Anh-Huy Phan Tensor decom-positions for signal processing applications From two-way to multiwaycomponent analysis IEEE Signal Processing Magazine 32(2)145ndash163March 2015 [Cited on pp 10]

[50] Edith Cohen Structure prediction and computation of sparse matrixproducts Journal of Combinatorial Optimization 2(4)307ndash332 Dec1998 [Cited on pp 127]

[51] Thomas F Coleman and Wei Xu Parallelism in structured Newtoncomputations In Parallel Computing Architectures Algorithms andApplications ParCo 2007 pages 295ndash302 Forschungszentrum Julichand RWTH Aachen University Germany 2007 [Cited on pp 3]

[52] Thomas F Coleman and Wei Xu Fast (structured) Newton computa-tions SIAM Journal on Scientific Computing 31(2)1175ndash1191 2009[Cited on pp 3]

[53] Thomas F Coleman and Wei Xu Automatic Differentiation in MAT-LAB using ADMAT with Applications SIAM 2016 [Cited on pp3]

[54] Jason Cong Zheng Li and Rajive Bagrodia Acyclic multi-way par-titioning of Boolean networks In Proceedings of the 31st Annual De-sign Automation Conference DACrsquo94 pages 670ndash675 New York NYUSA 1994 ACM [Cited on pp 4 17]

[55] Elizabeth H Cuthill and J McKee Reducing the bandwidth of sparsesymmetric matrices In Proceedings of the 24th national conferencepages 157ndash172 New York NY USA 1969 ACM [Cited on pp 12]

[56] Marek Cygan Improved approximation for 3-dimensional matchingvia bounded pathwidth local search In Foundations of ComputerScience (FOCS) 2013 IEEE 54th Annual Symposium on pages 509ndash518 IEEE 2013 [Cited on pp 113 120]

[57] Marek Cygan Fabrizio Grandoni and Monaldo Mastrolilli How to sellhyperedges The hypermatching assignment problem In Proc of thetwenty-fourth annual ACM-SIAM symposium on Discrete algorithmspages 342ndash351 SIAM 2013 [Cited on pp 113]

176 BIBLIOGRAPHY

[58] Joseph Czyzyk Michael P Mesnier and Jorge J More The NEOSServer IEEE Journal on Computational Science and Engineering5(3)68ndash75 1998 [Cited on pp 96]

[59] Timothy A Davis and Yifan Hu The University of Florida sparsematrix collection ACM Transactions on Mathematical Software38(1)11ndash125 2011 [Cited on pp 25 69 82 105 143 161]

[60] Lieven De Lathauwer Pierre Comon Bart De Moor and Joos Vande-walle Higher-order power methodndashApplication in independent com-ponent analysis In Proceedings NOLTArsquo95 pages 91ndash96 Las VegasUSA 1995 [Cited on pp 163]

[61] Lieven De Lathauwer Bart De Moor and Joos Vandewalle Onthe best rank-1 and rank-(R1 R2 RN ) approximation of higher-order tensors SIAM Journal on Matrix Analysis and Applications21(4)1324ndash1342 2000 [Cited on pp 163]

[62] Dominique de Werra Variations on the Theorem of BirkhoffndashvonNeumann and extensions Graphs and Combinatorics 19(2)263ndash278Jun 2003 [Cited on pp 108]

[63] James W Demmel Stanley C Eisenstat John R Gilbert Xiaoye SLi and Joseph W H Liu A supernodal approach to sparse par-tial pivoting SIAM Journal on Matrix Analysis and Applications20(3)720ndash755 1999 [Cited on pp 132]

[64] James W Demmel John R Gilbert and Xiaoye S Li An asyn-chronous parallel supernodal algorithm for sparse gaussian elimina-tion SIAM Journal on Matrix Analysis and Applications 20(4)915ndash952 1999 [Cited on pp 132]

[65] Mehmet Deveci Kamer Kaya Umit V Catalyurek and Bora UcarA push-relabel-based maximum cardinality matching algorithm onGPUs In 42nd International Conference on Parallel Processing pages21ndash29 Lyon France 2013 [Cited on pp 57]

[66] Mehmet Deveci Kamer Kaya Bora Ucar and Umit V CatalyurekGPU accelerated maximum cardinality matching algorithms for bipar-tite graphs In Felix Wolf Bernd Mohr and Dieter an Mey editorsEuro-Par 2013 Parallel Processing volume 8097 of LNCS pages 850ndash861 Springer Berlin Heidelberg 2013 [Cited on pp 57]

[67] Pat Devlin and Jeff Kahn Perfect fractional matchings in k-out hy-pergraphs arXiv preprint arXiv170303513 2017 [Cited on pp 114121]

BIBLIOGRAPHY 177

[68] Elizabeth D Dolan The NEOS Server 40 administrative guide Tech-nical Memorandum ANLMCS-TM-250 Mathematics and ComputerScience Division Argonne National Laboratory 2001 [Cited on pp96]

[69] Elizabeth D Dolan and Jorge J More Benchmarking optimiza-tion software with performance profiles Mathematical Programming91(2)201ndash213 2002 [Cited on pp 26]

[70] Jack J Dongarra Fred G Gustavson and Alan H Karp Implement-ing linear algebra algorithms for dense matrices on a vector pipelinemachine SIAM Review 26(1)pp 91ndash112 1984 [Cited on pp 134]

[71] Iain S Duff Kamer Kaya and Bora Ucar Design implementationand analysis of maximum transversal algorithms ACM Transactionson Mathematical Software 38(2)131ndash1331 2011 [Cited on pp 656 58]

[72] Iain S Duff and Jacko Koster On algorithms for permuting largeentries to the diagonal of a sparse matrix SIAM Journal on MatrixAnalysis and Applications 22973ndash996 2001 [Cited on pp 98 143]

[73] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ap-proximation algorithms for maximum matchings in undirected graphsIn 2018 Proceedings of the Seventh SIAM Workshop on CombinatorialScientific Computing pages 56ndash65 2018 [Cited on pp 75 79 113]

[74] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Fur-ther notes on Birkhoffndashvon Neumann decomposition of doubly stochas-tic matrices Linear Algebra and its Applications 55468ndash78 2018[Cited on pp 101 113 117]

[75] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ef-fective heuristics for matchings in hypergraphs Research Report RR-9224 Inria Grenoble Rhone-Alpes November 2018 (will appear inSEA2) [Cited on pp 118 119 122]

[76] Fanny Dufosse Kamer Kaya and Bora Ucar Two approximationalgorithms for bipartite matching on multicore architectures Journalof Parallel and Distributed Computing 8562ndash78 2015 IPDPS 2014Selected Papers on Numerical and Combinatorial Algorithms [Citedon pp 56 60 65 66 67 69 77 79 81 113 117 125]

[77] Andrew Lloyd Dulmage and Nathan Saul Mendelsohn Coverings ofbipartite graphs Canadian Journal of Mathematics 10517ndash534 1958[Cited on pp 68 143]

178 BIBLIOGRAPHY

[78] Martin E Dyer and Alan M Frieze Randomized greedy matchingRandom Structures amp Algorithms 2(1)29ndash45 1991 [Cited on pp 58113 115]

[79] Jack Edmonds Paths trees and flowers Canadian Journal of Math-ematics 17449ndash467 1965 [Cited on pp 76]

[80] Lawrence C Eggan Transition graphs and the star-height of regu-lar events The Michigan Mathematical Journal 10(4)385ndash397 1963[Cited on pp 5]

[81] Stanley C Eisenstat and Joseph W H Liu The theory of elimina-tion trees for sparse unsymmetric matrices SIAM Journal on MatrixAnalysis and Applications 26(3)686ndash705 2005 [Cited on pp 5 131132 133 135]

[82] Stanley C Eisenstat and Joseph W H Liu A tree-based dataflowmodel for the unsymmetric multifrontal method Electronic Transac-tions on Numerical Analysis 211ndash19 2005 [Cited on pp 132]

[83] Stanley C Eisenstat and Joseph W H Liu Algorithmic aspects ofelimination trees for sparse unsymmetric matrices SIAM Journal onMatrix Analysis and Applications 29(4)1363ndash1381 2008 [Cited onpp 5 132 143 145]

[84] Venmugil Elango Fabrice Rastello Louis-Noel Pouchet J Ramanu-jam and P Sadayappan On characterizing the data access complex-ity of programs ACM SIGPLAN Notices - POPLrsquo15 50(1)567ndash580January 2015 [Cited on pp 3]

[85] Paul Erdos and Alfred Renyi On random matrices Publicationsof the Mathematical Institute of the Hungarian Academy of Sciences8(3)455ndash461 1964 [Cited on pp 72]

[86] Bastiaan Onne Fagginger Auer and Rob H Bisseling A GPU algo-rithm for greedy graph matching In Facing the Multicore ChallengeII volume 7174 of LNCS pages 108ndash119 Springer-Verlag Berlin 2012[Cited on pp 59]

[87] Naznin Fauzia Venmugil Elango Mahesh Ravishankar J Ramanu-jam Fabrice Rastello Atanas Rountev Louis-Noel Pouchet andP Sadayappan Beyond reuse distance analysis Dynamic analysis forcharacterization of data locality potential ACM Transactions on Ar-chitecture and Code Optimization 10(4)531ndash5329 December 2013[Cited on pp 3 16]

BIBLIOGRAPHY 179

[88] Charles M Fiduccia and Robert M Mattheyses A linear-time heuris-tic for improving network partitions In 19th Design Automation Con-ference pages 175ndash181 Las Vegas USA 1982 IEEE [Cited on pp17 22]

[89] Joel Franklin and Jens Lorenz On the scaling of multidimensional ma-trices Linear Algebra and its Applications 114717ndash735 1989 [Citedon pp 114]

[90] Alan M Frieze Maximum matchings in a class of random graphsJournal of Combinatorial Theory Series B 40(2)196ndash212 1986[Cited on pp 76 82 121]

[91] Aurelien Froger Olivier Guyon and Eric Pinson A set packing ap-proach for scheduling passenger train drivers the French experienceIn RailTokyo2015 Tokyo Japan March 2015 [Cited on pp 7]

[92] Harold N Gabow and Robert E Tarjan Faster scaling algorithms fornetwork problems SIAM Journal on Computing 18(5)1013ndash10361989 [Cited on pp 6 76]

[93] Michael R Garey and David S Johnson Computers and IntractabilityA Guide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1979 [Cited on pp 3 47 93]

[94] Alan George Nested dissection of a regular finite element mesh SIAMJournal on Numerical Analysis 10(2)345ndash363 1973 [Cited on pp136]

[95] Alan J George Computer implementation of the finite elementmethod PhD thesis Stanford University Stanford CA USA 1971[Cited on pp 12]

[96] Andrew V Goldberg and Robert Kennedy Global price updates helpSIAM Journal on Discrete Mathematics 10(4)551ndash572 1997 [Citedon pp 6]

[97] Olaf Gorlitz Sergej Sizov and Steffen Staab Pints Peer-to-peerinfrastructure for tagging systems In Proceedings of the 7th Interna-tional Conference on Peer-to-peer Systems IPTPSrsquo08 page 19 Berke-ley CA USA 2008 USENIX Association [Cited on pp 48]

[98] Georg Gottlob and Gianluigi Greco Decomposing combinatorial auc-tions and set packing problems Journal of the ACM 60(4)241ndash2439September 2013 [Cited on pp 7]

[99] Lars Grasedyck Hierarchical singular value decomposition of tensorsSIAM Journal on Matrix Analysis and Applications 31(4)2029ndash20542010 [Cited on pp 52]

180 BIBLIOGRAPHY

[100] William Gropp and Jorge J More Optimization environments andthe NEOS Server In Martin D Buhman and Arieh Iserles editorsApproximation Theory and Optimization pages 167ndash182 CambridgeUniversity Press 1997 [Cited on pp 96]

[101] Hermann Gruber Digraph complexity measures and applications informal language theory Discrete Mathematics amp Theoretical Com-puter Science 14(2)189ndash204 2012 [Cited on pp 5]

[102] Gael Guennebaud Benoıt Jacob et al Eigen v3 httpeigen

tuxfamilyorg 2010 [Cited on pp 47]

[103] Abdou Guermouche Jean-Yves LrsquoExcellent and Gil Utard Impact ofreordering on the memory of a multifrontal solver Parallel Computing29(9)1191ndash1218 2003 [Cited on pp 131]

[104] Anshul Gupta Improved symbolic and numerical factorization algo-rithms for unsymmetric sparse matrices SIAM Journal on MatrixAnalysis and Applications 24(2)529ndash552 2002 [Cited on pp 132]

[105] Mahantesh Halappanavar John Feo Oreste Villa Antonino Tumeoand Alex Pothen Approximate weighted matching on emerging many-core and multithreaded architectures International Journal of HighPerformance Computing Applications 26(4)413ndash430 2012 [Cited onpp 59]

[106] Magnus M Halldorsson Approximating discrete collections via localimprovements In The Sixth annual ACM-SIAM symposium on Dis-crete Algorithms (SODA) volume 95 pages 160ndash169 San FranciscoCalifornia USA 1995 SIAM [Cited on pp 113]

[107] Richard A Harshman Foundations of the PARAFAC procedureModels and conditions for an ldquoexplanatoryrdquo multi-modal factor anal-ysis UCLA Working Papers in Phonetics 161ndash84 1970 [Cited onpp 36]

[108] Nicholas J A Harvey Algebraic algorithms for matching and matroidproblems SIAM Journal on Computing 39(2)679ndash702 2009 [Citedon pp 76]

[109] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-dimensional matching In Approximation Random-ization and Combinatorial Optimization Algorithms and Techniquespages 83ndash97 Springer 2003 [Cited on pp 113]

[110] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-set packing Computational Complexity 15(1)20ndash392006 [Cited on pp 7]

BIBLIOGRAPHY 181

[111] Bruce Hendrickson Load balancing fictions falsehoods and fallaciesApplied Mathematical Modelling 2599ndash108 2000 [Cited on pp 41]

[112] Bruce Hendrickson and Tamara G Kolda Graph partitioning modelsfor parallel computing Parallel Computing 26(12)1519ndash1534 2000[Cited on pp 41]

[113] Bruce Hendrickson and Robert Leland The Chaco userrsquos guide ver-sion 10 Technical Report SAND93ndash2339 Sandia National Laborato-ries Albuquerque NM October 1993 [Cited on pp 4 15 16]

[114] Julien Herrmann Jonathan Kho Bora Ucar Kamer Kaya andUmit V Catalyurek Acyclic partitioning of large directed acyclicgraphs In Proceedings of the 17th IEEEACM International Sympo-sium on Cluster Cloud and Grid Computing CCGRID pages 371ndash380 Madrid Spain May 2017 [Cited on pp 15 16 17 18 28 30]

[115] Julien Herrmann M Yusuf Ozkaya Bora Ucar Kamer Kaya andUmit V Catalyurek Multilevel algorithms for acyclic partitioning ofdirected acyclic graphs SIAM Journal on Scientific Computing 2019to appear [Cited on pp 19 20 26 29 31]

[116] Jonathan D Hogg John K Reid and Jennifer A Scott Design of amulticore sparse Cholesky factorization using DAGs SIAM Journalon Scientific Computing 32(6)3627ndash3649 2010 [Cited on pp 131]

[117] John E Hopcroft and Richard M Karp An n52 algorithm for max-imum matchings in bipartite graphs SIAM Journal on Computing2(4)225ndash231 1973 [Cited on pp 6 62]

[118] Roger A Horn and Charles R Johnson Matrix Analysis CambridgeUniversity Press New York 1985 [Cited on pp 109]

[119] Cor A J Hurkens and Alexander Schrijver On the size of systemsof sets every t of which have an SDR with an application to theworst-case ratio of heuristics for packing problems SIAM Journal onDiscrete Mathematics 2(1)68ndash72 1989 [Cited on pp 113 120]

[120] Oscar H Ibarra and Shlomo Moran Deterministic and probabilisticalgorithms for maximum bipartite matching via fast matrix multipli-cation Information Processing Letters pages 12ndash15 1981 [Cited onpp 76]

[121] Inah Jeon Evangelos E Papalexakis U Kang and Christos Falout-sos Haten2 Billion-scale tensor decompositions In IEEE 31st Inter-national Conference on Data Engineering (ICDE) pages 1047ndash1058April 2015 [Cited on pp 156 161]

182 BIBLIOGRAPHY

[122] Jochen A G Jess and Hendrikus Gerardus Maria Kees A data struc-ture for parallel LU decomposition IEEE Transactions on Comput-ers 31(3)231ndash239 1982 [Cited on pp 134]

[123] U Kang Evangelos Papalexakis Abhay Harpale and Christos Falout-sos GigaTensor Scaling tensor analysis up by 100 times - Algorithmsand discoveries In Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining KDDrsquo12 pages 316ndash324 New York NY USA 2012 ACM [Cited on pp10 11 37]

[124] Micha l Karonski Ed Overman and Boris Pittel On a perfect match-ing in a random bipartite digraph with average out-degree below twoarXiv e-prints page arXiv190305764 Mar 2019 [Cited on pp 59121]

[125] Micha l Karonski and Boris Pittel Existence of a perfect matching ina random (1 + eminus1)ndashout bipartite graph Journal of CombinatorialTheory Series B 881ndash16 May 2003 [Cited on pp 59 67 81 82121]

[126] Richard M Karp Reducibility among combinatorial problems InRaymond E Miller James W Thatcher and Jean D Bohlinger edi-tors Complexity of Computer Computations The IBM Research Sym-posiax pages 85ndash103 Springer US 1972 [Cited on pp 7 142]

[127] Richard M Karp Alexander H G Rinnooy Kan and Rakesh VVohra Average case analysis of a heuristic for the assignment prob-lem Mathematics of Operations Research 19513ndash522 1994 [Citedon pp 81]

[128] Richard M Karp and Michael Sipser Maximum matching in sparserandom graphs In 22nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) pages 364ndash375 Nashville TN USA 1981[Cited on pp 56 58 67 75 81 113]

[129] Richard M Karp Umesh V Vazirani and Vijay V Vazirani An op-timal algorithm for on-line bipartite matching In 22nd annual ACMsymposium on Theory of computing (STOC) pages 352ndash358 Balti-more MD USA 1990 [Cited on pp 57]

[130] George Karypis and Vipin Kumar MeTiS A Software Package forPartitioning Unstructured Graphs Partitioning Meshes and Comput-ing Fill-Reducing Orderings of Sparse Matrices Version 40 Univer-sity of Minnesota Department of Comp Sci and Eng Army HPCResearch Cent Minneapolis 1998 [Cited on pp 4 16 23 27 142]

BIBLIOGRAPHY 183

[131] George Karypis and Vipin Kumar Multilevel algorithms for multi-constraint hypergraph partitioning Technical Report 99-034 Uni-versity of Minnesota Department of Computer ScienceArmy HPCResearch Center Minneapolis MN 55455 November 1998 [Cited onpp 34]

[132] Kamer Kaya Johannes Langguth Fredrik Manne and Bora UcarPush-relabel based algorithms for the maximum transversal problemComputers and Operations Research 40(5)1266ndash1275 2013 [Citedon pp 6 56 58]

[133] Kamer Kaya and Bora Ucar Constructing elimination trees for sparseunsymmetric matrices SIAM Journal on Matrix Analysis and Appli-cations 34(2)345ndash354 2013 [Cited on pp 5]

[134] Oguz Kaya and Bora Ucar Scalable sparse tensor decompositions indistributed memory systems In Proceedings of the International Con-ference for High Performance Computing Networking Storage andAnalysis SC rsquo15 pages 771ndash7711 Austin Texas 2015 ACM NewYork NY USA [Cited on pp 7]

[135] Oguz Kaya and Bora Ucar Parallel CandecompParafac decomposi-tion of sparse tensors using dimension trees SIAM Journal on Scien-tific Computing 40(1)C99ndashC130 2018 [Cited on pp 52]

[136] Enver Kayaaslan and Bora Ucar Reducing elimination tree heightfor parallel LU factorization of sparse unsymmetric matrices In 21stInternational Conference on High Performance Computing (HiPC2014) pages 1ndash10 Goa India December 2014 [Cited on pp 136]

[137] Brian W Kernighan Optimal sequential partitions of graphs Journalof the ACM 18(1)34ndash40 January 1971 [Cited on pp 16]

[138] Shiva Kintali Nishad Kothari and Akash Kumar Approximationalgorithms for directed width parameters CoRR abs11074824 2011[Cited on pp 138]

[139] Philip A Knight The SinkhornndashKnopp algorithm Convergence andapplications SIAM Journal on Matrix Analysis and Applications30(1)261ndash275 2008 [Cited on pp 55 68]

[140] Philip A Knight and Daniel Ruiz A fast algorithm for matrix bal-ancing IMA Journal of Numerical Analysis 33(3)1029ndash1047 2013[Cited on pp 56 79]

[141] Philip A Knight Daniel Ruiz and Bora Ucar A symmetry preservingalgorithm for matrix scaling SIAM Journal on Matrix Analysis andApplications 35(3)931ndash955 2014 [Cited on pp 55 56 68 79 105]

184 BIBLIOGRAPHY

[142] Tamara G Kolda and Brett Bader Tensor decompositions and appli-cations SIAM Review 51(3)455ndash500 2009 [Cited on pp 10 37]

[143] Tjalling C Koopmans and Martin Beckmann Assignment problemsand the location of economic activities Econometrica 25(1)53ndash761957 [Cited on pp 109]

[144] Mads R B Kristensen Simon A F Lund Troels Blum and JamesAvery Fusion of parallel array operations In Proceedings of the 2016International Conference on Parallel Architectures and Compilationpages 71ndash85 New York NY USA 2016 ACM [Cited on pp 3]

[145] Mads R B Kristensen Simon A F Lund Troels Blum KennethSkovhede and Brian Vinter Bohrium A virtual machine approachto portable parallelism In Proceedings of the 2014 IEEE InternationalParallel amp Distributed Processing Symposium Workshops pages 312ndash321 Washington DC USA 2014 [Cited on pp 3]

[146] Janardhan Kulkarni Euiwoong Lee and Mohit Singh MinimumBirkhoff-von Neumann decomposition Preliminary version http

wwwcscmuedu~euiwoonlsparsebvnpdf of the paper whichappeared in Proc 19th International Conference Integer Program-ming and Combinatorial Optimization (IPCO 2017) Waterloo ONCanada pp 343ndash354 June 2017 [Cited on pp 9]

[147] Sebastian Lamm Peter Sanders Christian Schulz Darren Strash andRenato F Werneck Finding Near-Optimal Independent Sets at ScaleIn Proceedings of the 18th Workshop on Algorithm Engineering andExperiments (ALENEXrsquo16) pages 138ndash150 Arlington Virginia USA2016 SIAM [Cited on pp 114 127]

[148] Johannes Langguth Fredrik Manne and Peter Sanders Heuristicinitialization for bipartite matching problems Journal of ExperimentalAlgorithmics 1511ndash122 2010 [Cited on pp 6 56 58]

[149] Yanzhe (Murray) Lei Stefanus Jasin Joline Uichanco and AndrewVakhutinsky Randomized product display (ranking) pricing and or-der fulfillment for e-commerce retailers SSRN Electronic Journal2018 [Cited on pp 9]

[150] Thomas Lengauer Combinatorial Algorithms for Integrated CircuitLayout WileyndashTeubner Chichester UK 1990 [Cited on pp 34]

[151] Renaud Lepere and Denis Trystram A new clustering algorithm forlarge communication delays In 16th International Parallel and Dis-tributed Processing Symposium (IPDPS) page (CDROM) Fort Laud-erdale FL USA 2002 IEEE Computer Society [Cited on pp 32]

BIBLIOGRAPHY 185

[152] Hanoch Levy and David W Low A contraction algorithm for findingsmall cycle cutsets Journal of Algorithms 9(4)470ndash493 1988 [Citedon pp 142]

[153] Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun and Richard VuducModel-driven sparse CP decomposition for higher-order tensors InIPDPS 2017 31th IEEE International Symposium on Parallel andDistributed Processing pages 1048ndash1057 Orlando FL USA May2017 [Cited on pp 52]

[154] Jiajia Li Yuchen Ma and Richard Vuduc ParTI A Parallel TensorInfrastructure for Multicore CPU and GPUs (Version 010) Availablefrom httpsgithubcomhpcgarageParTI 2016 [Cited on pp156]

[155] Jiajia Li Jimeng Sun and Richard Vuduc HiCOO Hierarchical stor-age of sparse tensors In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisSCrsquo18 New York NY USA 2018 [Cited on pp 148 149 155 158]

[156] Jiajia Li Bora Ucar Umit V Catalyurek Jimeng Sun Kevin Barkerand Richard Vuduc Efficient and effective sparse tensor reordering InProceedings of the ACM International Conference on SupercomputingICS rsquo19 pages 227ndash237 New York NY USA 2019 ACM [Cited onpp 147 156]

[157] Xiaoye S Li and James W Demmel Making sparse Gaussian elim-ination scalable by static pivoting In Proc of the 1998 ACMIEEEConference on Supercomputing SC rsquo98 pages 1ndash17 Washington DCUSA 1998 [Cited on pp 132]

[158] Athanasios P Liavas and Nicholas D Sidiropoulos Parallel algorithmsfor constrained tensor factorization via alternating direction methodof multipliers IEEE Transactions on Signal Processing 63(20)5450ndash5463 Oct 2015 [Cited on pp 11]

[159] Bangtian Liu Chengyao Wen Anand D Sarwate and Maryam MehriDehnavi A unified optimization approach for sparse tensor opera-tions on GPUs In 2017 IEEE International Conference on ClusterComputing (CLUSTER) pages 47ndash57 Sept 2017 [Cited on pp 160]

[160] Joseph W H Liu Computational models and task scheduling forparallel sparse Cholesky factorization Parallel Computing 3(4)327ndash342 1986 [Cited on pp 131 134]

[161] Joseph W H Liu A graph partitioning algorithm by node separatorsACM Transactions on Mathematical Software 15(3)198ndash219 1989[Cited on pp 140]

186 BIBLIOGRAPHY

[162] Joseph W H Liu Reordering sparse matrices for parallel eliminationParallel Computing 11(1)73 ndash 91 1989 [Cited on pp 131]

[163] Liang Liu Jun (Jim) Xu and Lance Fortnow Quantized BvND Abetter solution for optical and hybrid switching in data center net-works In 2018 IEEEACM 11th International Conference on Utilityand Cloud Computing pages 237ndash246 Dec 2018 [Cited on pp 9]

[164] Zvi Lotker Boaz Patt-Shamir and Seth Pettie Improved distributedapproximate matching In Proceedings of the Twentieth Annual Sym-posium on Parallelism in Algorithms and Architectures (SPAA) pages129ndash136 Munich Germany 2008 ACM [Cited on pp 59]

[165] Laszlo Lovasz On determinants matchings and random algorithmsIn FCT pages 565ndash574 1979 [Cited on pp 76]

[166] Laszlo Lovasz and Micheal D Plummer Matching Theory ElsevierScience Publishers Netherlands 1986 [Cited on pp 76]

[167] Anna Lubiw Doubly lexical orderings of matrices SIAM Journal onComputing 16(5)854ndash879 1987 [Cited on pp 151 152 153]

[168] Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sunand Richard Vuduc Optimizing sparse tensor times matrix on GPUsJournal of Parallel and Distributed Computing 12999ndash109 2019[Cited on pp 160]

[169] Jakob Magun Greedy matching algorithms an experimental studyJournal of Experimental Algorithmics 36 1998 [Cited on pp 6]

[170] Marvin Marcus and R Ree Diagonals of doubly stochastic matricesThe Quarterly Journal of Mathematics 10(1)296ndash302 1959 [Citedon pp 9]

[171] Nick McKeown The iSLIP scheduling algorithm for input-queuedswitches EEEACM Transactions on Networking 7(2)188ndash201 1999[Cited on pp 6]

[172] Amram Meir and John W Moon The expected node-independencenumber of random trees Indagationes Mathematicae 76335ndash3411973 [Cited on pp 67]

[173] John Mellor-Crummey David Whaley and Ken Kennedy Improvingmemory hierarchy performance for irregular applications using dataand computation reorderings International Journal of Parallel Pro-gramming 29(3)217ndash247 Jun 2001 [Cited on pp 160]

BIBLIOGRAPHY 187

[174] Silvio Micali and Vijay V Vazirani An O(radic|V ||E|) algorithm for

finding maximum matching in general graphs In 21st Annual Sym-posium on Foundations of Computer Science pages 17ndash27 SyracuseNY USA 1980 IEEE [Cited on pp 6 76]

[175] Orlando Moreira Merten Popp and Christian Schulz Graph parti-tioning with acyclicity constraints In 16th International Symposiumon Experimental Algorithms SEA London UK 2017 [Cited on pp16 17 28 29 32]

[176] Orlando Moreira Merten Popp and Christian Schulz Evolutionarymulti-level acyclic graph partitioning In Proceedings of the Geneticand Evolutionary Computation Conference GECCO pages 332ndash339Kyoto Japan 2018 ACM [Cited on pp 17 25 32]

[177] Marcin Mucha and Piotr Sankowski Maximum matchings in planargraphs via Gaussian elimination In S Albers and T Radzik editorsESA 2004 pages 532ndash543 Berlin Heidelberg 2004 Springer BerlinHeidelberg [Cited on pp 76]

[178] Marcin Mucha and Piotr Sankowski Maximum matchings via Gaus-sian elimination In FOCS rsquo04 pages 248ndash255 Washington DC USA2004 IEEE Computer Society [Cited on pp 76]

[179] Huda Nassar Georgios Kollias Ananth Grama and David F GleichLow rank methods for multiple network alignment arXiv e-printspage arXiv180908198 Sep 2018 [Cited on pp 128]

[180] Jaroslav Nesetril and Patrice Ossona de Mendez Tree-depth subgraphcoloring and homomorphism bounds European Journal of Combina-torics 27(6)1022ndash1041 2006 [Cited on pp 132]

[181] M Yusuf Ozkaya Anne Benoit Bora Ucar Julien Herrmann andUmit V Catalyurek A scalable clustering-based task scheduler forhomogeneous processors using DAG partitioning In 33rd IEEE Inter-national Parallel and Distributed Processing Symposium pages 155ndash165 Rio de Janeiro Brazil May 2019 [Cited on pp 32]

[182] Robert Paige and Robert E Tarjan Three partition refinement algo-rithms SIAM Journal on Computing 16(6)973ndash989 December 1987[Cited on pp 152 153]

[183] Chiristos H Papadimitriou and Kenneth Steiglitz Combinatorial Op-timization Algorithms and Complexity Dover Publications (Cor-rected unabridged reprint of Combinatorial Optimization Algo-rithms and Complexity originally published by Prentice-Hall Inc NewJersey 1982) New York 1998 [Cited on pp 29]

188 BIBLIOGRAPHY

[184] Panos M Pardalos Tianbing Qian and Mauricio G C Resende Agreedy randomized adaptive search procedure for the feedback vertexset problem Journal of Combinatorial Optimization 2(4)399ndash4121998 [Cited on pp 142]

[185] Md Mostofa Ali Patwary Rob H Bisseling and Fredrik Manne Par-allel greedy graph matching using an edge partitioning approach InProceedings of the Fourth International Workshop on High-level Paral-lel Programming and Applications HLPP rsquo10 pages 45ndash54 New YorkNY USA 2010 ACM [Cited on pp 59]

[186] Francois Pellegrini SCOTCH 51 Userrsquos Guide Laboratoire Bordelaisde Recherche en Informatique (LaBRI) 2008 [Cited on pp 4 16 23]

[187] Daniel M Pelt and Rob H Bisseling A medium-grain method forfast 2D bipartitioning of sparse matrices In IPDPS2014 IEEE 28thInternational Parallel and Distributed Processing Symposium pages529ndash539 Phoenix Arizona May 2014 [Cited on pp 52]

[188] Ioakeim Perros Evangelos E Papalexakis Fei Wang Richard VuducElizabeth Searles Michael Thompson and Jimeng Sun SPARTanScalable PARAFAC2 for large and sparse data In Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD rsquo17 pages 375ndash384 New York NY USA2017 ACM [Cited on pp 156]

[189] Anh-Huy Phan Petr Tichavsky and Andrzej Cichocki Fast alternat-ing LS algorithms for high order CANDECOMPPARAFAC tensorfactorizations IEEE Transactions on Signal Processing 61(19)4834ndash4846 Oct 2013 [Cited on pp 52]

[190] Juan C Pichel Francisco F Rivera Marcos Fernandez and AurelioRodrıguez Optimization of sparse matrixndashvector multiplication usingreordering techniques on GPUs Microprocessors and Microsystems36(2)65ndash77 2012 [Cited on pp 160]

[191] Matthias Poloczek and Mario Szegedy Randomized greedy algorithmsfor the maximum matching problem with new analysis In IEEE 53rdAnnual Sym on Foundations of Computer Science (FOCS) pages708ndash717 New Brunswick NJ USA 2012 [Cited on pp 58]

[192] Alex Pothen Sparse null bases and marriage theorems PhD the-sis Dept Computer Science Cornell Univ Ithaca New York 1984[Cited on pp 68]

[193] Alex Pothen The complexity of optimal elimination trees TechnicalReport CS-88-13 Pennsylvania State Univ 1988 [Cited on pp 131]

BIBLIOGRAPHY 189

[194] Alex Pothen and Chin-Ju Fan Computing the block triangular formof a sparse matrix ACM Transactions on Mathematical Software16(4)303ndash324 1990 [Cited on pp 58 68 115]

[195] Alex Pothen S M Ferdous and Fredrik Manne Approximation algo-rithms in combinatorial scientific computing Acta Numerica 28541mdash633 2019 [Cited on pp 58]

[196] Louis-Noel Pouchet Polybench The polyhedral benchmark suiteURL httpwebcseohio-stateedu˜pouchetsoftwarepolybench2012 [Cited on pp 25 26]

[197] Michael O Rabin and Vijay V Vazirani Maximum matchings in gen-eral graphs through randomization Journal of Algorithms 10(4)557ndash567 December 1989 [Cited on pp 76]

[198] Daniel Ruiz A scaling algorithm to equilibrate both row and columnnorms in matrices Technical Report TR-2001-034 RAL 2001 [Citedon pp 55 56]

[199] Peter Sanders and Christian Schulz Engineering multilevel graphpartitioning algorithms In Camil Demetrescu and Magnus MHalldorsson editors Algorithms ndash ESA 2011 19th Annual EuropeanSymposium September 5-9 2011 pages 469ndash480 Saarbrucken Ger-many 2011 [Cited on pp 4 16 23]

[200] Robert Schreiber A new implementation of sparse Gaussian elimi-nation ACM Transactions on Mathematical Software 8(3)256ndash2761982 [Cited on pp 131]

[201] R Oguz Selvitopi Muhammet Mustafa Ozdal and Cevdet AykanatA novel method for scaling iterative solvers Avoiding latency over-head of parallel sparse-matrix vector multiplies IEEE Transactionson Parallel and Distributed Systems 26(3)632ndash645 2015 [Cited onpp 51]

[202] Richard Sinkhorn and Paul Knopp Concerning nonnegative matri-ces and doubly stochastic matrices Pacific Journal of Mathematics21343ndash348 1967 [Cited on pp 55 69 114]

[203] Shaden Smith Jee W Choi Jiajia Li Richard Vuduc Jongsoo ParkXing Liu and George Karypis FROSTT The formidable repositoryof open sparse tensors and tools httpfrosttio 2017 [Citedon pp 52 126 156 161]

[204] Shaden Smith and George Karypis A medium-grained algorithm forsparse tensor factorization In 2016 IEEE International Parallel and

190 BIBLIOGRAPHY

Distributed Processing Symposium IPDPS 2016 Chicago IL USAMay 23-27 2016 pages 902ndash911 2016 [Cited on pp 37 52]

[205] Shaden Smith and George Karypis SPLATT The Surprisingly Par-alleL spArse Tensor Toolkit (Version 111) Available from https

githubcomShadenSmithsplatt 2016 [Cited on pp 37 52 157]

[206] Shaden Smith Niranjay Ravindran Nicholas D Sidiropoulos andGeorge Karypis SPLATT Efficient and parallel sparse tensor-matrixmultiplication In 29th IEEE International Parallel amp Distributed Pro-cessing Symposium pages 61ndash70 Hyderabad India May 2015 [Citedon pp 11 36 37 38 39 42 158 159]

[207] Robert E Tarjan and Mihalis Yannakakis Simple linear-time algo-rithms to test chordality of graphs test acyclicity of hypergraphs andselectively reduce acyclic hypergraphs SIAM Journal on Computing13(3)566ndash579 1984 [Cited on pp 149]

[208] Bora Ucar and Cevdet Aykanat Encapsulating multiplecommunication-cost metrics in partitioning sparse rectangular matri-ces for parallel matrix-vector multiplies SIAM Journal on ScientificComputing 25(6)1837ndash1859 2004 [Cited on pp 47 51]

[209] Bora Ucar and Cevdet Aykanat Revisiting hypergraph models forsparse matrix partitioning SIAM Review 49(4)595ndash603 2007 [Citedon pp 42]

[210] Vijay V Vazirani An improved definition of blossoms and a simplerproof of the MV matching algorithm CoRR abs12104594 2012[Cited on pp 76]

[211] David W Walkup Matchings in random regular bipartite digraphsDiscrete Mathematics 31(1)59ndash64 1980 [Cited on pp 59 67 76 81114 121]

[212] Chris Walshaw Multilevel refinement for combinatorial optimisationproblems Annals of Operations Research 131(1)325ndash372 Oct 2004[Cited on pp 25]

[213] Eric S H Wong Evangeline F Y Young and Wai-Kei Mak Clus-tering based acyclic multi-way partitioning In Proceedings of the 13thACM Great Lakes Symposium on VLSI GLSVLSI rsquo03 pages 203ndash206New York NY USA 2003 ACM [Cited on pp 4]

[214] Tao Yang and Apostolos Gerasoulis DSC scheduling parallel tasks onan unbounded number of processors IEEE Transactions on Paralleland Distributed Systems 5(9)951ndash967 1994 [Cited on pp 131]

BIBLIOGRAPHY 191

[215] Albert-Jan Nicholas Yzelman and Dirk Roose High-level strategiesfor parallel shared-memory sparse matrix-vector multiplication IEEETransactions on Parallel and Distributed Systems 25(1)116ndash125 Jan2014 [Cited on pp 160]

  • Introduction
    • Directed graphs
    • Undirected graphs
    • Hypergraphs
    • Sparse matrices
    • Sparse tensors
      • I Partitioning
        • Acyclic partitioning of directed acyclic graphs
          • Related work
          • Directed multilevel graph partitioning
            • Coarsening
            • Initial partitioning
            • Refinement
            • Constrained coarsening and initial partitioning
              • Experiments
              • Summary further notes and references
                • Distributed memory CP decomposition
                  • Background and notation
                    • Hypergraph partitioning
                    • Tensor operations
                      • Related work
                      • Parallelization
                        • Coarse-grain task model
                        • Fine-grain task model
                          • Experiments
                          • Summary further notes and references
                              • II Matching and related problems
                                • Matchings in graphs
                                  • Scaling matrices to doubly stochastic form
                                  • Approximation algorithms for bipartite graphs
                                    • Related work
                                    • Two matching heuristics
                                    • Experiments
                                      • Approximation algorithms for general undirected graphs
                                        • Related work
                                        • One-Out its analysis and variants
                                        • Experiments
                                          • Summary further notes and references
                                            • Birkhoff-von Neumann decomposition
                                              • The minimum number of permutation matrices
                                                • Preliminaries
                                                • The computational complexity of MinBvNDec
                                                  • A result on the polytope of BvN decompositions
                                                  • Two heuristics
                                                  • Experiments
                                                  • A generalization of Birkhoff-von Neumann theorem
                                                    • Doubly stochastic splitting
                                                    • Birkhoff von-Neumann decomposition for arbitrary matrices
                                                      • Summary further notes and references
                                                        • Matchings in hypergraphs
                                                          • Background and notation
                                                          • The heuristics
                                                            • Greedy-H
                                                            • Karp-Sipser-H
                                                            • Karp-Sipser-H-scaling
                                                            • Karp-Sipser-H-mindegree
                                                            • Bipartite-reduction
                                                            • Local-Search
                                                              • Experiments
                                                              • Summary further notes and references
                                                                  • III Ordering matrices and tensors
                                                                    • Elimination trees and height-reducing orderings
                                                                      • Background
                                                                      • The critical path length and the elimination tree height
                                                                      • Reducing the elimination tree height
                                                                        • Theoretical basis
                                                                        • Recursive approach Permute
                                                                        • BBT decomposition PermuteBBT
                                                                        • Edge-separator-based method
                                                                        • Vertex-separator-based method
                                                                        • BT decomposition PermuteBT
                                                                          • Experiments
                                                                          • Summary further notes and references
                                                                            • Sparse tensor ordering
                                                                              • Three sparse tensor storage formats
                                                                              • Problem definition and solutions
                                                                                • BFS-MCS
                                                                                • Lexi-Order
                                                                                • Analysis
                                                                                  • Experiments
                                                                                    • COO-Mttkrp with reordering
                                                                                    • CSF-Mttkrp with reordering
                                                                                    • HiCOO-Mttkrp with reordering
                                                                                    • Reordering methods comparison
                                                                                    • Effect of the number of iterations in Lexi-Order
                                                                                      • Summary further notes and references
                                                                                          • IV Closing
                                                                                            • Concluding remarks
                                                                                            • Bibliography
Page 5: Partitioning, matching, and ordering: Combinatorial

iv CONTENTS

II Matching and related problems 53

4 Matchings in graphs 55

41 Scaling matrices to doubly stochastic form 55

42 Approximation algorithms for bipartite graphs 56

421 Related work 57

422 Two matching heuristics 59

423 Experiments 69

43 Approximation algorithms for general undirected graphs 75

431 Related work 76

432 One-Out its analysis and variants 77

433 Experiments 82

44 Summary further notes and references 87

5 Birkhoff-von Neumann decomposition 91

51 The minimum number of permutation matrices 91

511 Preliminaries 92

512 The computational complexity of MinBvNDec 93

52 A result on the polytope of BvN decompositions 94

53 Two heuristics 97

54 Experiments 103

55 A generalization of Birkhoff-von Neumann theorem 106

551 Doubly stochastic splitting 106

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices 108

56 Summary further notes and references 109

6 Matchings in hypergraphs 113

61 Background and notation 114

62 The heuristics 114

621 Greedy-H 115

622 Karp-Sipser-H 115

623 Karp-Sipser-H-scaling 117

624 Karp-Sipser-H-mindegree 118

625 Bipartite-reduction 119

626 Local-Search 120

63 Experiments 120

64 Summary further notes and references 128

III Ordering matrices and tensors 129

7 Elimination trees and height-reducing orderings 131

71 Background 132

CONTENTS v

72 The critical path length and the elimination tree height 13473 Reducing the elimination tree height 136

731 Theoretical basis 137732 Recursive approach Permute 137733 BBT decomposition PermuteBBT 138734 Edge-separator-based method 139735 Vertex-separator-based method 141736 BT decomposition PermuteBT 142

74 Experiments 14275 Summary further notes and references 145

8 Sparse tensor ordering 14781 Three sparse tensor storage formats 14782 Problem definition and solutions 149

821 BFS-MCS 149822 Lexi-Order 151823 Analysis 154

83 Experiments 15684 Summary further notes and references 160

IV Closing 165

9 Concluding remarks 167

Bibliography 170

vi CONTENTS

List of Algorithms

1 CP-ALS for the 3rd order tensors 362 Mttkrp for the 3rd order tensors 383 Coarse-grain Mttkrp for the first mode of third order tensors

at process p within CP-ALS 404 Fine-grain Mttkrp for the first mode of 3rd order tensors at

process p within CP-ALS 44

5 ScaleSK Sinkhorn-Knopp scaling 566 OneSidedMatch for bipartite graph matching 607 TwoSidedMatch for bipartite graph matching 628 Karp-Sipser-MT for bipartite 1-out graph matching 649 One-Out for undirected graph matching 7810 Karp-SipserOne-Out for undirected 1-out graph matching 80

11 GreedyBvN for constructing a BvN decomposition 98

12 Karp-Sipser-H-scaling for hypergraph matching 117

13 Row-LU(A) 13414 Column-LU(A) 13415 Permute(A) 13816 FindStrongVertexSeparatorES(G) 14017 FindStrongVertexSeparatorVS(G) 142

18 BFS-MCS for ordering a tensor dimension 15019 Lexi-Order for ordering a tensor dimension 153

vii

viii LIST OF ALGORITHMS

Chapter 1

Introduction

This document investigates three classes of problems at the interplay ofdiscrete algorithms combinatorial optimization and numerical methodsThe general research area is called combinatorial scientific computing (CSC)In CSC the contributions have practical and theoretical flavor For allproblems discussed in this document we have the design analysis andimplementation of algorithms along with many experiments The theoreticalresults are included in this document some with proofs the reader is invitedto the original papers for the omitted proofs A similar approach is takenfor presenting the experiments While most results for observing theoreticalfindings in practice are included the reader is referred to the original papersfor some other results (eg run time analysis)

The three problem classes are that of partitioning matching and or-dering We cover two problems from the partitioning class three from thematching class and two from the ordering class Below we present theseseven problems after introducing the notation and recalling the related defi-nitions We summarize the computing contexts in which problems arise andhighlight our contributions

11 Directed graphs

A directed graph G = (VE) is a set V of vertices and a set E of directededges of the form e = (u v) where e is directed from u to v A path isa sequence of edges (u1 v1) middot (u2 v2) middot middot middot with vi = ui+1 A path ((u1 v1) middot(u2 v2) middot middot middot (u` v`)) is of length ` where it connects a sequence of `+1 vertices(u1 v1 = u2 v`minus1 = u` v`) A path is called simple if the connectedvertices are distinct Let u v denote a simple path that starts from u andends at v We say that a vertex v is reachable from another vertex u if apath connects them A path ((u1 v1) middot (u2 v2) middot middot middot (u` v`)) forms a (simple)cycle if all vi for 1 le i le ` are distinct and u1 = v` A directed acyclicgraph DAG in short is a directed graph with no cycles A directed graph

1

2 CHAPTER 1 INTRODUCTION

is strongly connected if every vertex is reachable from every other vertex

A directed graph Gprime = (V prime Eprime) is said to be a subgraph of G = (VE)if V prime sube V and Eprime sube E We call Gprime as the subgraph of G induced by V primedenoted as G[V prime] if Eprime = E cap (V primetimesV prime) For a vertex set S sube V we defineG minus S as the subgraph of G induced by V minus S ie G[V minus S] A vertexset C sube V is called a strongly connected component of G if the inducedsubgraph G[C] is strongly connected and C is maximal in the sense thatfor all C prime sup C G[C prime] is not strongly connected A transitive reduction G0

is a minimal subgraph of G such that if u is connected to v in G then u isconnected to v in G0 as well If G is acyclic then its transitive reduction isunique

The path u v represents a dependency of v to u We say that theedge (u v) is redundant if there exists another u v path in the graphThat is when we remove a redundant (u v) edge u remains connected tov and hence the dependency information is preserved We use Pred[v] =u | (u v) isin E to represent the (immediate) predecessors of a vertex v andSucc[v] = u | (v u) isin E to represent the (immediate) successors of v Wecall the neighbors of a vertex v its immediate predecessors and immediatesuccessors Neigh[v] = Pred[v]cupSucc[v] For a vertex u the set of vertices vsuch that u v are called the descendants of u Similarly the set of verticesv such that v u are called the ancestors of the vertex u The verticeswithout any predecessors (and hence ancestors) are called the sources of Gand vertices without any successors (and hence descendants) are called thetargets of G Every vertex u has a weight denoted by wu and every edge(u v) isin E has a cost denoted by cuv

A k-way partitioning of a graph G = (VE) divides V into k disjointsubsets V1 Vk The weight of a part Vi denoted by w(Vi) is equal tosum

uisinVi wu which is the total vertex weight in Vi Given a partition an edgeis called a cut edge if its endpoints are in different parts The edge cut ofa partition is defined as the sum of the costs of the cut edges Usually aconstraint on the part weights accompanies the problem We are interestedin acyclic partitions which are defined below

Definition 11 (Acyclic k-way partition) A partition V1 Vk ofG = (VE) is called an acyclic k-way partition if two paths u v andvprime uprime do not co-exist for u uprime isin Vi v vprime isin Vj and 1 le i 6= j le k

In other words for a suitable numbering of the parts all edges shouldbe directed from a vertex in a part p to another vertex in a part q wherep le q Let us consider a toy example shown in Fig 11a A partition of thevertices of this graph is shown in Fig 11b with a dashed curve Since thereis a cut edge from s to u and another from u to t the partition is cyclic Anacyclic partition is shown in Fig 11c where all the cut edges are from onepart to the other

3

(a) A toy graph (b) A partition ignor-ing the directions it iscyclic

(c) An acyclic parti-tioning

Figure 11 A toy graph with six vertices and six edges a cyclic partitioningand an acyclic partitioning

There is a related notion in the literature [87] which is called a convexpartition A partition is convex if for all vertex pairs u v in the same partthe vertices in any u v path are also in the same part Hence if a partitionis acyclic it is also convex On the other hand convexity does not implyacyclicity Figure 12 shows that the definitions of an acyclic partition anda convex partition are not equivalent For the toy graph in Fig 12a thereare three possible balanced partitions shown in Figures 12b 12c and 12dThey are all convex but only the one in Fig 12d is acyclic

Deciding on the existence of a k-way acyclic partition respecting an upperbound on the part weights and an upper bound on the cost of cut edges isNP-complete [93 ND15] In Chapter 2 we treat this problem which isformalized as follows

Problem 1 (DAGP DAG partitioning) Given a directed acyclic graphG = (VE) with weights on vertices an imbalance parameter ε find anacyclic k-way partition V1 Vk of V such that the balance con-straints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut which is equal to the numberof edges whose end points are in two different parts is minimized

The DAGP problem arises in exposing parallelism in automatic differ-entiation [53 Ch9] and particularly in the computation of the Newtonstep for solving nonlinear systems [51 52] The DAGP problem with someadditional constraints is used to reason about the parallel data movementcomplexity and to dynamically analyze the data locality potential [84 87]Other important applications of the DAGP problem include (i) fusing loopsfor improving temporal locality and enabling streaming and array contrac-tions in runtime systems [144 145] (ii) analysis of cache efficient execution

4 CHAPTER 1 INTRODUCTION

a b

c d

(a) A toy graph

a b

c d

(b) A cyclic and con-vex partition

a b

c d

(c) A cyclic and con-vex partition

a b

c d

(d) An acyclic and convexpartition

Figure 12 A toy graph two cyclic and convex partitions and an acyclicand convex partition

of streaming applications on uniprocessors [4] (iii) a number of circuit de-sign applications in which the signal directions impose acyclic partitioningrequirement [54 213]

In Chapter 2 we address the DAGP problem We adopt the multilevelapproach [35 113] for this problem This approach is the de-facto standardfor solving graph and hypergraph partitioning problems efficiently and usedby almost all current state-of-the-art partitioning tools [27 40 113 130186 199] The algorithms following this approach have three phases thecoarsening phase which reduces the number of vertices by clustering themthe initial partitioning phase which finds a partition of the coarsest graphand the uncoarsening phase in which the initial partition is projected to thefiner graphs and refined along the way until a solution for the original graphis obtained We focus on two-way partitioning (sometimes called bisection)as this scheme can be used in a recursive way for multiway partitioningTo ensure the acyclicity of the partition at all times we propose novel andefficient coarsening and refinement heuristics We also propose effective waysto use the standard undirected graph partitioning methods in our multilevelscheme We perform a large set of experiments on a dataset and reportsignificant improvements compared to the current state of the art

A rooted tree T is an ordered triple (V r ρ) where V is the vertex setr isin V is the root vertex and ρ gives the immediate ancestordescendantrelation For two vertices vi vj such that ρ(vj) = vi we call vi the parent

5

vertex of vj and call vj a child vertex of vi With this definition the root ris the ancestor of every other node in T The children vertices of a vertexare called sibling vertices to each other The height of T denoted as h(T )is the number of vertices in the longest path from a leaf to the root r

Let G = (VE) be a directed graph with n vertices which are ordered(or labeled as 1 n) The elimination tree of a directed graph is definedas follows [81] For any i 6= n ρ(i) = j whenever j gt i is the minimumvertex id such that the vertices i and j lie in the same strongly connectedcomponent of the subgraph G[1 j]mdashthat is i and j are in the samestrongly connected component of the subgraph containing the first j verticesand j is the smallest number Currently the algorithm with the best worst-case time complexity has a run time of O(m log n) [133]mdashanother algorithmwith the worst-case time complexity of O(mn) [83] also performs well inpractice As the definition of the elimination tree implies the properties ofthe tree depend on the ordering of the graph

The elimination tree is a recent model playing important roles in sparseLU factorization This tree captures the dependencies between the tasksof some well-known variants of sparse LU factorization (we review these inChapter 7) Therefore the height of the elimination tree corresponds to thecritical path length of the task dependency graph in the corresponding par-allel LU factorization methods In Chapter 7 we investigate the problem offinding minimum height elimination trees formally defined below to exposea maximum degree of parallelism by minimizing the critical path length

Problem 2 (Minimum height elimination tree) Given a directed graphfind an ordering of its vertices so that the associated elimination tree hasthe minimum height

The minimum height of an elimination tree of a given directed graph cor-responds to a directed graph parameter known as the cycle-rank [80] Theproblem being NP-complete [101] we propose heuristics in Chapter 7 whichgeneralize the most successful approaches used for symmetric matrices tounsymmetric ones

12 Undirected graphs

An undirected graph G = (VE) is a set V of vertices and a set E ofedges Bipartite graphs are undirected graphs where the vertex set can bepartitioned into two sets with all edges connecting vertices in the two partsFormally G = (VR cup VC E) is a bipartite graph where V = VR cup VC isthe vertex set and for each edge e = (r c) we have r isin VR and c isin VC The number of edges incident on a vertex is called its degree A path in agraph is a sequence of vertices such that each consecutive vertex pair share

6 CHAPTER 1 INTRODUCTION

an edge A vertex is reachable from another one if there is a path betweenthem The connected components of a graph are the equivalence classes ofvertices under the ldquois reachable fromrdquo relation A cycle in a graph is a pathwhose start and end vertices are the same A simple cycle is a cycle with novertex repetitions A tree is a connected graph with no cycles A spanningtree of a connected graph G is a tree containing all vertices of G

A matching M in a graph G = (VE) is a subset of edges E where avertex in V is in at most one edge in M Given a matching M a vertex vis said to be matched by M if v is in an edge of M otherwise v is calledunmatched If all the vertices are matched by M then M is said to be aperfect matching The cardinality of a matching M denoted by |M| is thenumber of edges inM In Chapter 4 we treat the following problem whichasks to find approximate matchings

Problem 3 (Approximating the maximum cardinality of a matching inundirected graphs) Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matching M of large cardinalitywith an approximation guarantee

There are a number of polynomial time algorithms to solve the maximumcardinality matching problem exactly in graphs The lowest worst-case timecomplexity of the known algorithms is O(

radicnτ) for a graph with n vertices

and τ edges For bipartite graphs the first of such algorithms is described byHopcroft and Karp [117] certain variants of the push-relabel algorithm [96]also have the same complexity (we give a classification for maximum car-dinality matchings in bipartite graphs in a recent paper [132]) For gen-eral graphs algorithms with the same complexity are known [30 92 174]There is considerable interest in simpler and faster algorithms that havesome approximation guarantee [148] Such algorithms are used as a jump-start routine by the current state of the art exact maximum matching algo-rithms [71 148 169] Furthermore there are applications [171] where largecardinality matchings are used

The specialization of Problem 3 for bipartite graphs and general undi-rected graphs are described in respectively Section 42 and Section 43 Ourfocus is on approximation algorithms which expose parallelism and have lin-ear run time in practical settings We propose randomized heuristics andanalyze their results The proposed heuristics construct a subgraph of theinput graph by randomly choosing some edges They then obtain a maxi-mum cardinality matching on the subgraph and return it as an approximatematching for the input graph The probability density function for choosinga given edge is obtained by scaling a suitable matrix to be doubly stochasticThe heuristics for the bipartite graphs and general undirected graphs sharethis general approach and differ in the analysis To the best of our knowl-edge the proposed heuristics have the highest constant factor approximation

7

ratio (around 086 for large graphs)

13 Hypergraphs

A hypergraph H = (VE) is defined as a set of vertices V and a set of hyper-edges E Each hyperedge is a set of vertices A matching in a hypergraphis a set of disjoint hyperedges

A hypergraph H = (VE) is called d-partite and d-uniform if V =⋃di=1 Vi with disjoint Vis and every hyperedge contains a single vertex from

each Vi In Chapter 6 we investigate the problem of finding matchings inhypergraphs formally defined as follows

Problem 4 (Maximum matching in d-partite d-uniform hypergraphs)Given a d-partite d-uniform hypergraph find a matching of maximumsize

The problem is NP-complete for d ge 3 the 3-partite case is called theMax-3-DM problem [126] and is an important problem in combinatorialoptimization It is a special case of the d-Set-Packing problem [110] Ithas been shown that d-Set-Packing is hard to approximate within a factorof O(d log d) [110] The maximumperfect set packing problem has manyapplications including combinatorial auctions [98] and personnel schedul-ing [91] Matchings in hypergraphs can also be used in the coarsening phaseof multilevel hypergraph partitioning tools [40] In particular d-uniformand d-partite hypergraphs arise in modeling tensors [134] and heuristics forProblem 4 can be used to partition these hypergraphs by plugging themin the current state of the art partitioning tools We propose heuristics forProblem 4 We first generalize some graph matching heuristics then pro-pose a novel heuristic based on tensor scaling to improve the quality viajudicious hyperedge selections Experiments on random synthetic and real-life hypergraphs show that this new heuristic is highly practical and superiorto the others on finding a matching with large cardinality

14 Sparse matrices

Bold upper case Roman letters are used for matrices as in A Matrixelements are shown with the corresponding lowercase letters for exampleaij Matrix sizes are sometimes shown in the lower right corner eg AItimesJ Matlab notation is used to refer to the entire rows and columns of a matrixeg A(i ) and A( j) refer to the ith row and jth column of A respectivelyA permutation matrix is a square 0 1 matrix with exactly one nonzeroin each row and each column A sub-permutation matrix is a square 0 1matrix with at most one nonzero in a row or a column

8 CHAPTER 1 INTRODUCTION

A graph G = (VE) can be represented as a sparse matrix AG calledthe adjacency matrix In this matrix an entry aij = 1 if the vertices vi andvj in G are incident on a common edge and aij = 0 otherwise If G isbipartite with V = VR cup VC we identify each vertex in VR with a uniquerow of AG and each vertex in VC with a unique column of AG If G is notbipartite a vertex is identified with a unique row-column pair A directedgraph GD = (VE) with vertex set V and edge set E can be associated withan ntimes n sparse matrix A Here |V | = n and for each aij 6= 0 where i 6= jwe have a directed edge from vi to vj

We also associate graphs with matrices For a given m times n matrix Awe can associate a bipartite graph G = (VR cup VC E) so that the zero-nonzero pattern of A is equivalent to AG If the matrix is square and hasa symmetric zero-nonzero pattern we can similarly associate an undirectedgraph by ignoring the diagonal entries Finally if the matrix A is squareand has an unsymmetric zero-nonzero pattern we can associate a directedgraph GD = (VE) where (vi vj) isin E iff aij 6= 0

An n times n matrix A 6= 0 is said to have support if there is a perfectmatching in the associated bipartite graph An n times n matrix A is said tohave total support if each edge in its bipartite graph can be put into a perfectmatching A square sparse matrix is called irreducible if its directed graph isstrongly connected A square sparse matrix A is called fully indecomposableif for a permutation matrix Q the matrix B = AQ has a zero free diagonaland the directed graph associated with B is irreducible Fully indecompos-able matrices have total support but a matrix with total support could be ablock diagonal matrix where each block is fully indecomposable For moreformal definitions of support total support and the fully indecomposabilitysee for example the book by Brualdi and Ryser [34 Ch 3 and Ch 4]

Let A = [aij ] isin Rntimesn The matrix A is said to be doubly stochastic ifaij ge 0 for all i j and Ae = AT e = e where e is the vector of all ones By

Birkhoffrsquos Theorem [25] there exist α1 α2 αk isin (0 1) withsumk

i=1 αi = 1and permutation matrices P1P2 Pk such that

A = α1P1 + α2P2 + middot middot middot+ αkPk (11)

This representation is also called Birkhoff-von Neumann (BvN) decomposi-tion We refer to the scalars α1 α2 αk as the coefficients of the decom-position

Doubly stochastic matrices and their associated BvN decompositionshave been used in several operations research problems and applicationsClassical examples are concerned with allocating communication resourceswhere an input traffic is routed to an output traffic in stages [43] Each rout-ing stage is a (sub-)permutation matrix and is used for handling a disjointset of communications The number of stages correspond to the numberof (sub-)permutation matrices A recent variant of this allocation problem

9

arises as a routing problem in data centers [146 163] Heuristics for theBvN decomposition have been used in designing business strategies for e-commerce retailers [149] A display matrix (which shows what needs to bedisplayed in total) is constructed which happens to be doubly stochasticThen a BvN decomposition of this matrix is obtained in which each per-mutation matrix used is a display shown to a shopper A small number ofpermutation matrices is preferred furthermore only slight changes betweenconsecutive permutation matrices is desired (we do not deal with this lastissue in this document)

The BvN decomposition of a doubly stochastic matrix as a convex com-bination of permutation matrices is not unique in general The Marcus-Ree Theorem [170] states that there is always a decomposition with k len2 minus 2n + 2 permutation matrices for dense n times n matrices Brualdi andGibson [33] and Brualdi [32] show that for a fully indecomposable ntimesn sparsematrix with τ nonzeros one has the equivalent expression k le τ minus2n+ 2 InChapter 5 we investigate the problem of finding the minimum number k ofpermutation matrices in the representation (11) More formally define theMinBvNDec problem defined as follows

Problem 5 (MinBvNDec A BvN decomposition with the minimumnumber of permutation matrices) Given a doubly stochastic matrix Afind a BvN decomposition (11) of A with the minimum number k ofpermutation matrices

Brualdi [32 p197] investigates the same problem and concludes that this isa difficult problem We continue along this line and show that the MinBvN-Dec problem is NP-hard We also propose a heuristic for obtaining a BvNdecomposition with a small number of permutation matrices We investi-gate some of the properties of the heuristic theoretically and experimentallyIn this chapter we also give a generalization of the BvN decomposition forreal matrices with possibly negative entries and use this generalization fordesigning preconditioners for iterative linear system solvers

15 Sparse tensors

Tensors are multidimensional arrays generalizing matrices to higher ordersWe use calligraphic font to refer to tensors eg X Let X be a d-dimensionaltensor whose size is n1 times middot middot middot times nd As in matrices an element of a tensor isdenoted by a lowercase letter and subscripts corresponding to the indices ofthe element eg xi1id where ij isin 1 nj A marginal of a tensor is a(dminus1)-dimensional slice of a d-dimensional tensor obtained by fixing one ofits indices A d-dimensional tensor where the entries in each of its marginalssum to one is called d-stochastic In a d-stochastic tensor all dimensions

10 CHAPTER 1 INTRODUCTION

necessarily have the same size n A d-stochastic tensor where each marginalcontains exactly one nonzero entry (equal to one) is called a permutationtensor

A d-partite d-uniform hypergraph H = (V1 cup middot middot middot cup Vd E) correspondsnaturally to a d-dimensional tensor X by associating each vertex class witha dimension of X Let |Vi| = ni Let X isin 0 1n1timesmiddotmiddotmiddottimesnd be a tensor withnonzero elements xv1vd corresponding to the edges (v1 vd) of H Inthis case X is called the adjacency tensor of H

Tensors are used in modeling multifactor or multirelational datasetssuch as product reviews at an online store [21] natural language process-ing [123] multi-sensor data in signal processing [49] among many oth-ers [142] Tensor decomposition algorithms are used as an important tool tounderstand the tensors and glean hidden or latent information One of themost common tensor decomposition approaches is the CandecompParafac(CP) formulation which approximates a given tensor as a sum of rank-onetensors More precisely for a given positive integer R called the decompo-sition rank the CP decomposition approximates a d-dimensional tensor bya sum of R outer products of column vectors (one for each dimension) Inparticular for a three dimensional I times J timesK tensor X CP asks for a rankR approximation X asymp

sumRr=1 ar br cr with vectors ar br and cr of size

respectively I J and K Here denotes the outer product of the vectorsand hence xijk asymp

sumRr=1 ar(i) middot br(j) middot cr(k) By aggregating these column

vectors one forms the matrices A B and C of size respectively I times RJ timesR and K timesR The matrices A B and C are called the factor matrices

The computational core of the most common CP decomposition meth-ods is an operation called the matricized tensor-times-Khatri-Rao product(Mttkrp) For the same tensor X this core operation can be describedsuccinctly as

foreach xijk isin X do

A(i )larr A(i ) + xijk[B(j ) lowastC(k )] (12)

where B and C are input matrices of sizes J timesR and K timesR respectivelyMA is the output matrix of size I timesR which is initialized to zero before theloop For each tensor nonzero xijk the jth row of the matrix B is multipliedelement-wise with the kth row of the matrix C the result is scaled with thetensor entry xijk and added to the ith row of MA In the well known CPdecomposition algorithm based on alternative least squares (CP-ALS) thetwo factor matrices B and C are initialized at the beginning then the factormatrix A is computed by multiplying MA with the (pseudo-)inverse of theHadamard product of BTB and CTC Since BTB and CTC are of sizeR times R their multiplication and inversion do not pose any computationalchallenges Afterwards the same steps are performed to compute a newB and then a new C and this process of computing the new factors one

11

at a time by assuming others fixed is repeated iteratively In Chapter 3we investigate how to efficiently parallelize the Mttkrp operations withthe CP-ALS algorithm in distributed memory systems In particular weinvestigate the following problem

Problem 6 (Parallelizing CP decomposition) Given a sparse tensor X the rank R of the required decomposition the number k of processorspartition X among the processors so as to achieve load balance andreduce the communication cost in a parallel implementation of Mttkrp

More precisely in Chapter 3 we describe parallel implementations of CP-ALS and how the parallelization problem can be cast as a partitioning prob-lem in hypergraphs In this document we do not develop heuristics for thehypergraph partitioning problemmdashwe describe hypergraphs modeling thecomputations and make use of the well-established tools for partitioningthese hypergraphs The Mttkrp operation also arises in the gradient de-scent based methods [2] to compute CP decomposition Furthermore it hasbeen implemented as a stand-alone routine [17] to enable algorithm devel-opment for variants of the core methods [158] That is why it has beeninvestigated for efficient execution in different computational frameworkssuch as Matlab [11 18] MapReduce [123] shared memory [206] and dis-tributed memory [47]

Consider again the Mttkrp operation and the execution of the oper-ations as shown in the body of the displayed for-loop (12) As seen inthis loop the rows of the factor matrices and the output matrix are ac-cessed many times according to the sparsity pattern of the tensor X In acache-based system the computations will be much more efficient if we takeadvantage of spatial and temporal locality If we can reorder the nonzerossuch that the multiplications involving a given row (of MA and B and C)are one after another this would be achieved We need this to hold whenexecuting the Mttkrps for computing the new factor matrices B and Cas well The issue is more easily understood in terms of the sparse matrixndashvector multiplication operation y larr Ax If we have two nonzeros in thesame row aij and aik we would prefer an ordering that maps j and k nextto each other and perform the associated scalar multiplications aijprimexjprime andaikprimexkprime one after other where jprime is the new index id for j and kprime is the newindex id for k Furthermore we would like the same sort of locality for allindices and for the multiplication x larr AT y as well We identify two re-lated problems here One of them is the choice of data structure for takingadvantage of shared indices and the other is to reorder the indices so thatnonzeros sharing an index in a dimension have close-by indices in the otherdimensions In this document we do not propose a data structure our aimis to reorder the indices such that the locality is improved for three commondata structures used in the current state of the art We summarize those

12 CHAPTER 1 INTRODUCTION

three common data structures in Chapter 8 and investigate the followingordering problem Since the overall problem is based on the data structurechoice we state the problem in a general language

Problem 7 (Tensor ordering) Given a sparse tensor X find an or-dering of the indices in each dimension so that the nonzeros of X haveindices close to each other

If we look at the same problem for matrices the problem can be cast asordering the rows and the columns of a given matrix so that the nonzeros areclustered around the diagonal This is also what we search in the tensor casewhere different data structures can benefit from the same ordering thanksto different effects In Chapter 8 we investigate this problem algorithmicallyand develop fast heuristics While the problem in the case of sparse matriceshas been investigated with different angles to the best of our knowledgeours is the first study proposing general ordering tools for tensorsmdashwhichare analogues of the well-known and widely used sparse matrix orderingheuristics such as Cuthill-McKee [55] and reverse Cuthill-McKee [95]

Part I

Partitioning

13

Chapter 2

Multilevel algorithms foracyclic partitioning ofdirected acyclic graphs

In this chapter we address the directed acyclic graph partitioning (DAGP)problem Most of the terms and definitions are standard which we sum-marized in Section 11 The formal problem definition is repeated below forconvenience

DAG partitioning problem Given a DAG G = (VE) an imbalanceparameter ε find an acyclic k-way partition V1 Vk of V such thatthe balance constraints

w(Vi) le (1 + ε)w(V )

k

are satisfied for 1 le i le k and the edge cut is minimized

We adopt the multilevel approach [35 113] with the coarsening initialpartitioning and refinement phases for acyclic partitioning of DAGs Wepropose heuristics for these three phases (Sections 221 222 and 223respectively) which guarantee acyclicity of the partitions at all phases andmaintain a DAG at every level We strove to have fast heuristics at thecore With these characterizations the coarsening phase requires new algo-rithmictheoretical reasoning while the initial partitioning and refinementheuristics are direct adaptations of the standard methods used in undirectedgraph partitioning with some differences worth mentioning We discuss onlythe bisection case as we were able to improve the direct k-way algorithmswe proposed before [114] by using the bisection heuristics recursively

The acyclicity constraint on the partitions forbids the use of the stateof the art undirected graph partitioning tools This has been recognized

15

16 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

before and those tools were put aside [114 175] While this is true one canstill try to make use of the existing undirected graph partitioning tools [113130 186 199] as they have been very well engineered Let us assume thatwe have partitioned a DAG with an undirected graph partitioning tool intotwo parts by ignoring the directions It is easy to detect if the partition iscyclic since all the edges need to go from the first part to the second one Ifa partition is cyclic we can easily fix it as follows Let v be a vertex in thesecond part we can move all vertices to which there is a path from v intothe second part This procedure breaks any cycle containing v and hencethe partition becomes acyclic However the edge cut may increase and thepartitions can be unbalanced To solve the balance problem and reduce thecut we can apply move-based refinement algorithms After this step thefinal partition meets the acyclicity and balance conditions Depending onthe structure of the input graph it could also be a good initial partitionfor reducing the edge cut Indeed one of our most effective schemes usesan undirected graph partitioning algorithm to create a (potentially cyclic)partition fixes the cycles in the partition and refines the resulting acyclicpartition with a novel heuristic to obtain an initial partition We thenintegrate this partition within the proposed coarsening approaches to refineit at different granularities

21 Related work

Fauzia et al [87] propose a heuristic for the acyclic partitioning problemto optimize data locality when analyzing DAGs To create partitions theheuristic categorizes a vertex as ready to be assigned to a partition whenall of the vertices it depends on have already been assigned Vertices areassigned to the current partition set until the maximum number of verticesthat would be ldquoactiverdquo during the computation of the part reaches a specifiedlimit which is the cache size in their application This implies that a partsize is not limited by the sum of the total vertex weights in that part butis a complex function that depends on an external schedule (order) of thevertices This differs from our problem as we limit the size of each part bythe total sum of the weights of the vertices on that part

Kernighan [137] proposes an algorithm to find a minimum edge-cut par-tition of the vertices of a graph into subsets of size greater than a lowerbound and inferior to an upper bound The partition needs to use a fixedvertex sequence that cannot be changed Indeed Kernighanrsquos algorithmtakes a topological order of the vertices of the graph as an input and parti-tions the vertices such that all vertices in a subset constitute a continuousblock in the given topological order This procedure is optimal for a givenfixed topological order and has a run time proportional to the number ofedges in the graph if the part weights are taken as constant We used a

17

modified version of this algorithm as a heuristic in the earlier version of ourwork [114]

Cong et al [54] describe two approaches for obtaining acyclic partitionsof directed Boolean networks modeling circuits The first one is a single-level Fiduccia-Mattheyses (FM)-based approach In this approach Cong etal generate an initial acyclic partition by splitting the list of the vertices (ina topological order) from left to right into k parts such that the weight of eachpart does not violate the bound The quality of the results is then improvedwith a k-way variant of the FM heuristic [88] taking the acyclicity constraintinto account Our previous work [114] employs a similar refinement heuristicThe second approach of Cong et al is a two-level heuristic the initial graphis first clustered with a special decomposition and then it is partitionedusing the first heuristic

In a recent paper [175] Moreira et al focus on an imaging and com-puter vision application on embedded systems and discuss acyclic parti-tioning heuristics They propose a single level approach in which an initialacyclic partitioning is obtained using a topological order To refine the parti-tioning they proposed four local search heuristics which respect the balanceconstraint and maintain the acyclicity of the partition Three heuristics picka vertex and move it to an eligible part if and only if the move improvesthe cut These three heuristics differ in choosing the set of eligible parts foreach vertex some are very restrictive and some allow arbitrary target partsas long as acyclicity is maintained The fourth heuristic tentatively realizesthe moves that increase the cut in order to escape from a possible local min-ima It has been reported that this heuristic delivers better results than theothers In a follow-up paper Moreira et al [176] discuss a multilevel graphpartitioner and an evolutionary algorithm based on this multilevel schemeTheir multilevel scheme starts with a given acyclic partition Then thecoarsening phase contracts edges that are in the same part until there is noedge to contract Here matching-based heuristics from undirected graphpartitioning tools are used without taking the directions of the edges intoaccount Therefore the coarsening phase can create cycles in the graphhowever the induced partitions are never cyclic Then an initial partition isobtained which is refined during the uncoarsening phase with move-basedheuristics In order to guarantee acyclic partitions the vertices that lie incycles are not moved In a systematic evaluation of the proposed methodsMoreira et al note that there are many local minima and suggest using re-laxed constraints in the multilevel setting The proposed methods have highrun time as the evolutionary method of Moreira et al is not concerned withthis issue Improvements with respect to the earlier work [175] are reported

We propose methods to use an undirected graph partitioner to guidethe multilevel partitioner We focus on partitioning the graph in two partssince one can handle the general case with a recursive bisection scheme Wealso propose new coarsening initial partitioning and refinement methods

18 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

specifically designed for the 2-partitioning problem Our multilevel schememaintains acyclic partitions and graphs through all the levels

22 Directed multilevel graph partitioning

Here we summarize the proposed methods for the three phases of the mul-tilevel scheme

221 Coarsening

In this phase we obtain smaller DAGs by coalescing the vertices level bylevel This phase continues until the number of vertices becomes smallerthan a specified bound or the reduction on the number of vertices from onelevel to the next one is lower than a threshold At each level ` we startwith a finer acyclic graph G` compute a valid clustering C` ensuring theacyclicity and obtain a coarser acyclic graph G`+1 While our previouswork [114] discussed matching based algorithms for coarsening we presentagglomerative clustering based variants here The new variants supersedethe matching based ones Unlike the standard undirected graph case inDAG partitioning not all vertices can be safely combined Consider a DAGwith three vertices a b c and three edges (a b) (b c) (a c) Here thevertices a and c cannot be combined since that would create a cycle Wesay that a set of vertices is contractible (all its vertices are matchable) ifunifying them does not create a cycle We now present a general theoryabout finding clusters without forming cycles after giving some definitions

Definition 21 (Clustering) A clustering of a DAG is a set of disjointsubsets of vertices Note that we do not make any assumptions on whetherthe subsets are connected or not

Definition 22 (Coarse graph) Given a DAG G and a clustering C of Gwe let G|C denote the coarse graph created by contracting all sets of verticesof C

The vertices of the coarse graph are the clusters in C If (u v) isin G fortwo vertices u and v that are located in different clusters of C then G|Chas an (directed) edge from the vertex corresponding to ursquos cluster to thevertex corresponding to vrsquos cluster

Definition 23 (Feasible clustering) A feasible clustering C of a DAGG is a clustering such that G|C is acyclic

Theorem 21 Let G = (VE) be a DAG For u v isin V and (u v) isin E thecoarse graph G|(uv) is acyclic if and only if every path from u to v in Gcontains the edge (u v)

19

Theorem 21 whose proof is in the journal version of this chapter [115]can be extended to a set of vertices by noting that this time all paths con-necting two vertices of the set should contain only the vertices of the setThe theorem (nor its extension) does not imply an efficient algorithm as itrequires at least one transitive reduction Furthermore it does not describea condition about two clusters forming a cycle even if both are individuallycontractible In order to address both of these issues we put a constrainton the vertices that can form a cluster based on the following definition

Definition 24 (Top level value) For a DAG G = (VE) the top levelvalue of a vertex u isin V is the length of the longest path from a source ofG to that vertex The top level values of all vertices can be computed in asingle traversal of the graph with a complexity O(|V |+ |E|) We use top[u]to denote the top level of the vertex u

By restricting the set of edges considered in the clustering to the edges(u v) isin E such that top[u]+1 = top[v] we ensure that no cycles are formedby contracting a unique cluster (the condition identified in Theorem 21 issatisfied) Let C be a clustering of the vertices Every edge in a cluster ofC being contractible is a necessary condition for C to be feasible but not asufficient one More restrictions on the edges of vertices inside the clustersshould be found to ensure that C is feasible We propose three coarseningheuristics based on clustering sets of more than two vertices whose pair-wisetop level differences are always zero or one

CoTop Acyclic clustering with forbidden edges

To have an efficient clustering heuristic we restrict ourselves to static in-formation computable in linear time As stated in the introduction of thissection we rely on the top level difference of one (or less) for all verticesin the same cluster and an additional condition to ensure that there willbe no cycles when a number of clusters are contracted simultaneously InTheorem 22 below we give two sufficient conditions for a clustering to befeasible (that is the graphs at all levels are DAGs)mdashthe proof is in theassociated journal paper [115]

Theorem 22 (Correctness of the proposed clustering) Let G = (VE) bea DAG and C = C1 Ck be a clustering If C is such that

bull for any cluster Ci for all u v isin Ci |top[u]minus top[v]| le 1

bull for two different clusters Ci and Cj and for all u isin Ci and v isin Cjeither (u v) isin E or top[u] 6= top[v]minus 1

then the coarse graph G|C is acyclic

20 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

We design a heuristic based on Theorem 22 This heuristic visits allvertices which are singleton at the beginning in an order and adds thevisited vertex to a cluster if the criteria mentioned in Theorem 22 are metif not the vertex stays as a singleton Among those clusters satisfying thementioned criteria the one which saves the largest edge weight (incident onthe vertices of the cluster and the vertex being visited) is chosen as the targetcluster We discuss a set of bookkeeping operations in order to implementthis heuristic in linear O(|V |+ |E|) time [115 Algorithm 1]

CoCyc Acyclic clustering with cycle detection

We now propose a less restrictive clustering algorithm which also maintainsthe acyclicity We again rely on the top level difference of one (or less) forall vertices in the same cluster Knowing this invariant when a new vertexis added to a cluster a cycle-detection algorithm checks that no cycles areformed when all the clusters are contracted simultaneously This algorithmdoes not traverse the entire graph by also using the fact that the top leveldifference within a cluster is at most one Nonetheless detecting such cyclescould require a full traversal of the graph

From Theorem 22 we know if adding a vertex to a cluster whose verticesrsquotop level values are t and t+ 1 creates a cycle in the contracted graph thenthis cycle goes through the vertices with top level values t or t + 1 Thuswhen considering the addition of a vertex u to a cluster C containing v wecheck potential cycle formations by traversing the graph starting from u ina breadth-first search (BFS) manner Let t denote the minimum top levelin C At a vertex w we normally add a successor y of w in a queue (as inBFS) if |top(y)minus t| le 1 if w is in the same cluster as one of its predecessorsx we also add x to the queue if |top(x) minus t| le 1 We detect a cycle if atsome point in the traversal a vertex from cluster C is reached if not thecluster can be extended by adding u In the worst-case the cycle detectionalgorithm completes a full graph traversal but in practice it stops quicklyand does not introduce a significant overhead The details of this clusteringscheme whose worst case run time complexity is O(|V |(|V |+ |E|)) can befound in the original paper [115 Algorithm 2]

CoHyb Hybrid acyclic clustering

The cycle detection based algorithm can suffer from quadratic run time forvertices with large in-degrees or out-degrees To avoid this we design a hy-brid acyclic clustering which uses the relaxed clustering strategy with cycledetection CoCyc by default and switches to the more restrictive clusteringstrategy CoTop for large degree vertices We define a limit on the degree ofa vertex (typically

radic|V |10) for calling it large degree When considering

an edge (u v) where top[u] + 1 = top[v] if the degrees of u and v do not

21

exceed the limit we use the cycle detection algorithm to determine if wecan contract the edge Otherwise if the outdegree of u or the indegree of vis too large the edge will be contracted if it is allowed The complexity ofthis algorithm is in between those CoTop and CoCyc and it will likely avoidthe quadratic behavior in practice

222 Initial partitioning

After the coarsening phase we compute an initial acyclic partitioning of thecoarsest graph We present two heuristics One of them is akin to the greedygraph growing method used in the standard graphhypergraph partitioningmethods The second one uses an undirected partitioning and then fixesthe acyclicity of the partitions Throughout this section we use (V0 V1) todenote the bisection of the vertices of the coarsest graph and ensure thatthere is no edge from the vertices in V1 to those in V0

Greedy directed graph growing

One approach to compute a bisection of a directed graph is to design agreedy algorithm that moves vertices from one part to another using localinformation We start with all vertices in V1 and replace vertices towards V0

by using heaps At any time the vertices that can be moved to V0 are in theheap These vertices are those whose in-neighbors are all in V0 Initially onlythe sources are in the heap and when all the in-neighbors of a vertex v aremoved to C0 v is inserted into the heap We separate this process into twophases In the first phase the key-values of the vertices in the heap are setto the weighted sum of their incoming edges and the ties are broken in favorof the vertex closer to the first vertex moved The first phase continues untilthe first part has more than 09 of the maximum allowed weight (modulothe maximum weight of a vertex) In the second phase the actual gain of avertex is used This gain is equal to the sum of the weights of the incomingedges minus the sum of the weights of the outgoing edges In this phasethe ties are broken in favor of the heavier vertices The second phase stopsas soon as the required balance is obtained The reason that we separatedthis heuristic into two phases is that at the beginning the gains are of noimportance and the more vertices become movable the more flexibility theheuristic has Yet towards the end parts are fairly balanced and usingactual gains should help keeping the cut small

Since the order of the parts is important we also reverse the roles of theparts and the directions of the edges That is we put all vertices in V0 andmove the vertices one by one to V1 when all out-neighbors of a vertex havebeen moved to V1 The proposed greedy directed graph growing heuristicreturns the best of the these two alternatives

22 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Undirected bisection and fixing acyclicity

In this heuristic we partition the coarsest graph as if it were undirectedand then move the vertices from one part to another in case the partitionwas not acyclic Let (P0 P1) denote the (not necessarily acyclic) bisectionof the coarsest graph treated as if it were undirected

The proposed approach designates arbitrarily P0 as V0 and P1 as V1One way to fix the cycle is to move all ancestors of the vertices in V0 to V0thereby guaranteeing that there is no edge from vertices in V1 to vertices inV0 making the bisection (V0 V1) acyclic We do these moves in a reversetopological order Another way to fix the acyclicity is to move all descen-dants of the vertices in V1 to V1 again guaranteeing an acyclic partition Wedo these moves in a topological order We then fix the possible unbalancewith a refinement algorithm

Note that we can also initially designate P1 as V0 and P0 as V1 and canagain fix a potential cycle in two different ways We try all four of thesechoices and return the best partition (essentially returning the best of thefour choices to fix the acyclicity of (P0 P1))

223 Refinement

This phase projects the partition obtained for a coarse graph to the nextfiner one and refines the partition by vertex moves As in the standardrefinement methods the proposed heuristic is applied in a number of passesWithin a pass we repeatedly select the vertex with the maximum move gainamong those that can be moved We tentatively realize this move if themove maintains or improves the balance Then the most profitable prefixof vertex moves are realized at the end of the pass As usual we allow thevertices move only once in a pass We use heaps with the gain of movesas the key value where we keep only movable vertices We call a vertexmovable if moving it to the other part does not create a cyclic partitionWe use the notation (V0 V1) to designate the acyclic bisection with no edgefrom vertices in V1 to vertices in V0 This means that for a vertex to movefrom part V0 to part V1 one of the two conditions should be met (i) eitherall its out-neighbors should be in V1 (ii) or the vertex has no out-neighborsat all Similarly for a vertex to move from part V1 to part V0 one of thetwo conditions should be met (i) either all its in-neighbors should be in V0(ii) or the vertex has no in-neighbors at all This is an adaptation of theboundary Fiduccia-Mattheyses heuristic [88] to directed graphs The notionof movability being more restrictive results in an important simplificationThe gain of moving a vertex v from V0 to V1 issum

uisinSucc[v]

w(v u)minussum

uisinPred[v]

w(u v) (21)

23

and the negative of this value when moving it from V1 to V0 This meansthat the gain of a vertex is static once a vertex is inserted in the heap withthe key value (21) its key-value is never updated A move could rendersome vertices unmovable if they were in the heap then they should bedeleted Therefore the heap data structure needs to support insert deleteand extract max operations only

224 Constrained coarsening and initial partitioning

There are a number of highly successful undirected graph partitioningtools [130 186 199] They are not immediately usable for our purposesas the partitions can be cyclic Fixing such partitions by moving verticesto break the cyclic dependencies among the parts can increase the edgecut dramatically (with respect to the undirected cut) Consider for exam-ple the n times n grid graph where the vertices are at integer positions fori = 1 n and j = 1 n and a vertex at (i j) is connected to (iprime jprime)when |iminus iprime| = 1 or |jminusjprime| = 1 but not both There is an acyclic orientationof this graph called spiral ordering as described in Figure 21 for n = 8This spiral ordering defines a total order When the directions of the edgesare ignored we can have a bisection with perfect balance by cutting onlyn = 8 edges with a vertical line This partition is cyclic and it can be madeacyclic by putting all vertices numbered greater than 32 to the second partThis partition which puts the vertices 1ndash32 to the first part and the rest tothe second part is the unique acyclic bisection with perfect balance for theassociated directed acyclic graph The edge cut in the directed version is35 as seen in the figure (gray edges) In general one has to cut n2 minus 4n+ 3edges for n ge 8 the blue vertices in the border (excluding the corners) haveone edge directed to a red vertex the interior blue vertices have two suchedges finally the blue vertex labeled n22 has three such edges

Let us investigate the quality of the partitions in practice We usedMeTiS [130] as the undirected graph partitioner on a dataset of 94 matrices(their details are in Section 23) The results are given in Figure 22 Forthis preliminary experiment we partitioned the graphs into two with themaximum allowed load imbalance ε = 3 The output of MeTiS is acyclicfor only two graphs and the geometric mean of the normalized edge cut is00012 Figure 22a shows the normalized edge cut and the load imbalanceafter fixing the cycles while Figure 22b shows the two measurements aftermeeting the balance criteria A normalized edge cut value is computed bynormalizing the edge cut with respect to the number of edges

In both figures the horizontal lines mark the geometric mean of thenormalized edge cuts and the vertical lines mark the 3 imbalance ratioIn Figure 22a there are 37 instances in which the load balance after fixingthe cycles is feasible The geometric mean of the normalized edge cuts inthis sub-figure is 00045 while in the other sub-figure it is 00049 Fixing

24 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

815864

32

3327

45

48

41

35

53

38

1522

Figure 21 8times8 grid graph whose vertices are ordered in a spiral way a fewof the vertices are labeled All edges are oriented from a lower numberedvertex to a higher numbered one There is a unique bipartition with 32vertices in each side The edges defining the total order are shown in redand blue except the one from 32 to 33 the cut edges are shown in grayother internal edges are not shown

10 12 14 16 18 20balance

10 5

10 4

10 3

10 2

10 1

100

101

cut

(a) Undirected partitioning and fixing cy-cles

098 100 102 104 106balance

10 5

10 4

10 3

10 2

10 1

cut

(b) Undirected partitioning fixing cyclesand balancing

Figure 22 Normalized edge cut (normalized with respect to the number ofedges) the balance obtained after using an undirected graph partitioner andfixing the cycles (left) and after ensuring balance with refinement (right)

25

the cycles increases the edge cut with respect to an undirected partition-ing but not catastrophically (only by 0004500012 = 375 times in theseexperiments) and achieving balance after this step increases the cut only alittle (goes to 00049 from 00045) That is why we suggest using an undi-rected graph partitioner fixing the cycles among the parts and performinga refinement based method for load balancing as a good (initial) partitioner

In order to refine the initial partition in a multilevel setting we proposea scheme similar to the iterated multilevel algorithm used in the existingpartitioners [40 212] In this scheme first a partition P is obtained Thenthe coarsening phase starts to cluster vertices that were in the same partin P After the coarsening an initial partitioning is freely available byusing the partition P on the coarsest graph The refinement phase then canwork as before Moreira et al [176] use this approach for the directed graphpartitioning problem To be more concrete we first use an undirected graphpartitioner then fix the cycles as discussed in Section 222 and then refinethis acyclic partition for balance with the proposed refinement heuristicsin Section 223 We then use this acyclic partition for constrained coarseningand initial partitioning We expect this scheme to be successful in graphswith many sources and targets where the sources and targets can lie in anyof the parts while the overall partition is acyclic On the other hand if agraph is such that its balanced acyclic partitions need to put sources in onepart and the targets in another part then fixing acyclicity may result inmoving many vertices This in turn will harm the edge cut found by theundirected graph partitioner

23 Experiments

The partitioning tool presented (dagP) is implemented in CC++ program-ming languages The experiments are conducted on a computer equippedwith dual 21 GHz Xeon E5-2683 processors and 512GB memory Thesource code and more information is available at httptdagatechedusoftwaredagP

We have performed a large set of experiments on DAG instances com-ing from two sources The first set of instances is from the PolyhedralBenchmark suite (PolyBench) [196] The second set of instances is obtainedfrom the matrices available in the SuiteSparse Matrix Collection (UFL) [59]From this collection we pick all the matrices satisfying the following prop-erties listed as binary square and has at least 100000 rows and at most226 nonzeros There were a total of 95 matrices at the time of experimenta-tion where two matrices (ids 1514 and 2294) having the same pattern Wediscard the duplicate and use the remaining 94 matrices for experimentsFor each such matrix we take the strict upper triangular part as the asso-ciated DAG instance whenever this part has more nonzeros than the lower

26 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

Graph vertex edge max deg avg deg source target2mm 36500 62200 40 1704 2100 4003mm 111900 214600 40 1918 3900 400adi 596695 1059590 109760 1776 843 28atax 241730 385960 230 1597 48530 230covariance 191600 368775 70 1925 4775 1275doitgen 123400 237000 150 1921 3400 3000durbin 126246 250993 252 1988 250 249fdtd-2d 256479 436580 60 1702 3579 1199gemm 1026800 1684200 70 1640 14600 4200gemver 159480 259440 120 1627 15360 120gesummv 376000 500500 500 1331 125250 250heat-3d 308480 491520 20 1593 1280 512jacobi-1d 239202 398000 100 1664 402 398jacobi-2d 157808 282240 20 1789 1008 784lu 344520 676240 79 1963 6400 1ludcmp 357320 701680 80 1964 6480 1mvt 200800 320000 200 1594 40800 400seidel-2d 261520 490960 60 1877 1600 1symm 254020 440400 120 1734 5680 2400syr2k 111000 180900 60 1630 2100 900syrk 594480 975240 81 1640 8040 3240trisolv 240600 320000 399 1330 80600 1trmm 294570 571200 80 1939 6570 4800

Table 21 DAG instances from PolyBench [196]

triangular part otherwise we take the strict lower triangular part All edgeshave unit cost and all vertices have unit weight

Since the proposed heuristics have a randomized behavior (the traversalsused in the coarsening and refinement heuristics are randomized) we runthem 10 times for each DAG instance and report the averages of theseruns We use performance profiles [69] to present the edge-cut results Aperformance profile plot shows the probability that a specific method givesresults within a factor θ of the best edge cut obtained by any of the methodscompared in the plot Hence the higher and closer a plot to the y-axis thebetter the method is

We set the load imbalance parameter ε = 003 in the problem definitiongiven at the beginning of the chapter for all experiments The vertices areunit weighted therefore the imbalance is rarely an issue for a move-basedpartitioner

Coarsening evaluation

We first evaluated the proposed coarsening heuristics (Section 51 of the as-sociated paper [115]) The aim was to find an effective one to set as a defaultcoarsening heuristic Based on performance profiles on the edge cut we con-cluded that in general the coarsening heuristics CoHyb and CoCyc behavesimilarly and are more helpful than CoTop in reducing the edge cut We alsoanalyzed the contraction efficiency of the proposed coarsening heuristics Itis important that the coarsening phase does not stop too early and that the

27

coarsest graph is small enough to be partitioned efficiently By investigatingthe maximum the average and the standard deviation of vertex and edgeweight ratios at the coarsest graph and the average the minimum andthe maximum number of coarsening levels observed for the two datasetswe saw that CoCyc and CoHyb behave again similarly and provide slightlybetter results than CoTop Based on all these observations we set CoHyb asthe default coarsening heuristic as it performs better than CoTop in termsof final edge cut and is guaranteed to be more efficient than CoCyc in termsof run time

Constrained coarsening and initial partitioning

We now investigate the effect of using undirected graph partitioners forcoarsening and initial partitioning as explained in Section 224 We com-pare three variants of the proposed multilevel scheme All of them use therefinement described in Section 223 in the uncoarsening phase

bull CoHyb this variant uses the hybrid coarsening heuristic describedin Section 221 and the greedy directed graph growing heuristic de-scribed in Section 222 in the initial partitioning phase This methoddoes not use constrained coarsening

bull CoHyb C this variant uses an acyclic partition of the finest graph ob-tained as outlined in Section 222 to guide the hybrid coarseningheuristic described in Section 224 and uses the greedy directed graphgrowing heuristic in the initial partitioning phase

bull CoHyb CIP this variant uses the same constrained coarsening heuristicas the previous method but inherits the fixed acyclic partition of thefinest graph as the initial partitioning

The comparison of these three variants are given in Figure 23 for thewhole dataset In this figure we see that using the constrained coarseningis always helpful as it clearly separates CoHyb C and CoHyb CIP from CoHyb

after θ = 11 Furthermore applying the constrained initial partitioning (ontop of the constrained coarsening) brings tangible improvements

In the light of the experiments presented here we suggest the variantCoHyb CIP for general problem instances

CoHyb CIP with respect to a single level algorithm

We compare CoHyb CIP (constrained coarsening and initial partitioning)with a single-level algorithm that uses an undirected graph partitioningfixes the acyclicity and refines the partitions This last variant is denotedas UndirFix and it is the algorithm described in Section 222 Both vari-ants use the same initial partitioning approach which utilizes MeTiS [130] as

28 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CCoHyb_CIP

Figure 23 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) using the constrained coarsening and the greedy directed graphgrowing (CoHyb C) and CoHyb (no constraints)

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPUndirFix

Figure 24 Performance profiles for the edge cut obtained by the pro-posed multilevel algorithm using the constrained coarsening and partitioning(CoHyb CIP) and using the same approach without coarsening (UndirFix)

the undirected partitioner The difference between UndirFix and CoHyb CIP

is the latterrsquos ability to refine the initial partition at multiple levels Fig-ure 24 presents this comparison The plots show that the multilevel schemeCoHyb CIP outperforms the single level scheme UndirFix at all appropriateranges of θ attesting to the importance of the multilevel scheme

Comparison with existing work

Here we compare our approach with the evolutionary graph partitioningapproach developed by Moreira et al [175] and briefly with our previouswork [114]

Figure 25 shows how CoHyb CIP and CoTop compare with the evolution-ary approach in terms of the edge cut on the 23 graphs of the PolyBenchdataset for the number of partitions k isin 2 4 8 16 32 We use the av-

29

10 12 14 16 1800

02

04

06

08

10

p

CoHyb_CIPCoTopMoreira et al

Figure 25 Performance profiles for the edge cut obtained by CoHyb CIPCoTop and Moreira et alrsquos approach on the PolyBench dataset with k isin2 4 8 16 32

erage edge cut value of 10 runs for CoTop and CoHyb CIP and the averagevalues presented in [175] for the evolutionary algorithm As seen in thefigure the CoTop variant of the proposed multilevel approach obtains thebest results on this specific dataset (all variants of the proposed approachoutperform the evolutionary approach) In more detailed comparisons [115Appendix A] we see that the variants CoHyb CIP and CoTop of the proposedalgorithm obtain strictly better results than the evolutionary approach in78 and 75 instances (out of 115) respectively when the average edge cutsare compared

In more detailed comparisons [115 Appendix A] we see that CoHyb CIP

obtains 26 less edge cut than the evolutionary approach on average (ge-ometric mean) when the average cuts are compared when the best cutsare compared CoHyb CIP obtains 48 less edge cut Moreover CoTop ob-tains 37 less edge cut than the evolutionary approach when the averagecuts are compared when the best cuts are compared CoTop obtains 41less cut In some instances we see large differences between the averageand the best results of CoTop and CoHyb CIP Combined with the observa-tion that CoHyb CIP yields better results in general this suggests that theneighborhood structure can be improved (see the notion of the strength ofa neighborhood [183 Section 196])

The proposed approach with all the reported variants takes about 30minutes to complete the whole set of experiments for this dataset whereasthe evolutionary approach is much more compute-intensive as it has to runthe multilevel partitioning algorithm numerous times to create and updatethe population of partitions for the evolutionary algorithm The multilevelapproach of Moreira et al [175] is more comparable in terms of characteris-tics with our multilevel scheme When we compare CoTop with the results ofthe multilevel algorithm by Moreira et al our approach provides results thatare 37 better on average and CoHyb CIP approach provides results that are

30 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

10 12 14 16 1800

02

04

06

08

10

p

CoHybCoHyb_CIPUndirFix

Figure 26 Performance profiles of CoHyb CoHyb CIP and UndirFix in termsof edge cut for single source single target graph dataset The average of 5runs are reported for each approach

26 better on average highlighting the fact that keeping the acyclicity ofthe directed graph through the multilevel process is useful

Finally CoTop and CoHyb CIP also outperform the previous version of ourmultilevel partitioner [114] which is based on a direct k-way partitioningscheme and matching heuristics for the coarsening phase by 45 and 35on average respectively on the same dataset

Single commodity flow-like problem instances

In many of the instances of our dataset graphs have many source and targetvertices We investigate how our algorithm performs on problems where allsource vertices should be in a given part and all target vertices should bein the other part while also achieving balance This is a problem close tothe maximum flow problem where we want to find the maximum flow (orminimum cut) from the sources to the targets with balance on part weightsFurthermore addressing this problem also provides a setting for solvingpartitioning problems with fixed vertices

We used the UFL dataset where all isolated vertices are discarded anda single source vertex S and a single target vertex T are added to all graphsS is connected to all original source vertices and all original target verticesare connected to T The cost of the new edges is set to the number ofedges A feasible partition should avoid cutting these edges and separateall sources from the targets

The performance profiles of CoHyb CoHyb CIP and UndirFix are givenin Figure 26 with the edge cut as the evaluation criterion As seen in thisfigure CoHyb is the best performing variant and UndirFix is the worstperforming variant This is interesting as in the general setting we sawa reverse relation The variant CoHyb CIP performs in the middle as itcombines the other two

31

1 2 3 4 5 6 7 800

02

04

06

08

10

p CoTopCoHybCoCycUndirFixCoHyb_CIPCoHyb_C

Figure 27 Run time performance profile of CoCyc CoHyb CoTop CoHyb CCoHyb CIP and UndirFix on the whole dataset The values are the averagesof 10 runs

Run time performance

We give again a summary of the run time comparisons from the detailedversion [115 Section 56] Figure 27 shows the comparison of the five vari-ants of the proposed multilevel scheme and the single level scheme on thewhole dataset Each algorithm is run 10 times on each graph As expectedCoTop offers the best performance and CoHyb offers a good trade-off betweenCoTop and CoCyc An interesting remark is that these three algorithms havea better run time than the single level algorithm UndirFix For exampleon the average CoTop is 144 times faster than UndirFix This is mainlydue to cost of fixing acyclicity Undirected partitioning accounts for roughly25 of the execution time of UndirFix and fixing the acyclicity constitutesthe remaining 75 Finally the variants of the multilevel algorithm usingconstrained coarsening heuristics provide satisfying run time performancewith respect to the others

24 Summary further notes and references

We proposed a multilevel approach for acyclic partitioning of directed acyclicgraphs This problem is close to the standard graph partitioning in that theaim is to partition the vertices into a number of parts while minimizingthe edge cut and meeting a balance criterion on the part weights Unlikethe standard graph partitioning problem the directions of the edges areimportant and the resulting partitions should have acyclic dependencies

We proposed coarsening initial partitioning and refinement heuristicsfor the target problem The proposed heuristics take the directions of theedges into account and maintain the acyclicity throughout the three phasesof the multilevel scheme We also proposed efficient and effective approachesto use the standard undirected graph partitioning tools in the multilevel

32 CHAPTER 2 ACYCLIC PARTITIONING OF DAGS

scheme for coarsening and initial partitioning We performed a large set ofexperiments on a dataset with graphs having different characteristics andevaluated different combinations of the proposed heuristics Our experi-ments suggested (i) the use of constrained coarsening and initial partition-ing where the main coarsening heuristic is a hybrid one which avoids thecycles and in case it does not performs a fast cycle detection (CoHyb CIP)for the general case (ii) a pure multilevel scheme without constrained coars-ening using the hybrid coarsening heuristic (CoHyb) for the cases where anumber of sources need to be separated from a number of targets (iii) a puremultilevel scheme without constrained coarsening using the fast coarseningalgorithm (CoTop) for the cases where the degrees of the vertices are smallAll three approaches are shown to be more effective and efficient than thecurrent state of the art

Ozkaya et al [181] make use of the proposed partitioner for schedulingdirected acyclic task graphs for homogeneous computing systems Here thepartitioner is used to obtain coarse grained tasks to enhance data localityLepere and Trystram [151] show that acyclic clustering is very effective inscheduling DAGs with unit task weights and large uniform edge costs Inparticular they show that there exists an acyclic clustering whose makespanis at most twice the best clustering

Recall that a directed edge (u v) is called redundant if there is a pathu v in the graph Consider a redundant edge (u v) and a path u vLet us discard the edge (u v) and add its cost to all edges in the identifiedu v path to obtain a reduced graph In an acyclic bisection of this reducedgraph if the vertices u and v are in different parts then the path u vcrosses the cut only once (otherwise the partition is cyclic) and the edgecut in the original graph equals the edge cut in the reduced graph If uand v are in the same part then all paths u v are confined to the samepart Therefore neither the edge (u v) of the original graph nor any pathu v of the reduced graph contributes to the respective cuts Hence theedge cut in the original graph is equal to the edge cut in the reduced graphSuch reductions could potentially help reduce the run time and improve thequality of the partitions in our multilevel setting The transitive reductionis a costly operation and hence we do not hope to apply all reductionsin the proposed approach a subset of those should be found and appliedThe effects of such reductions should be more tangible in the evolutionaryalgorithm [175 176]

Chapter 3

Parallel CandecompParafacdecomposition of sparsetensors in distributedmemory

In this chapter we address the problem of parallelizing the computationalcore of the CandecompParafac decomposition of sparse tensors Tensornotation is a little less common and the reader is invited to revise thenotation and the basic definitions in Section 15 More related definitionsare given in this chapter Below we restate the problem for convenience

Parallelizing CP decomposition Given a sparse tensor X the rankR of the required decomposition the number k of processors partitionX among the processors so as to achieve load balance and reduce thecommunication cost in a parallel implementation of Mttkrp

We investigate an efficient parallelization of the Mttkrp operation indistributed memory environments for sparse tensors in the context of theCP decomposition method CP-ALS For this purpose we formulate twotask definitions a coarse-grain and a fine-grain one These definitions aregiven by applying the owner-computes rule to a coarse-grain and a fine-grainpartition of the tensor nonzeros We define the coarse-grain partition of atensor as a partition of one of its dimensions In matrix terms a coarse-grainpartition corresponds to a row-wise or a column-wise partition We definethe fine-grain partition of a tensor as a partition of its nonzeros This hasthe same significance in matrix terms Based on these two task granularitieswe present two parallel algorithms for the Mttkrp operation We addressthe computational load balance and communication cost reduction problemsfor the two algorithms and present hypergraph partitioning-based models

33

34 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

to tackle these problems with the off-the-shelf partitioning toolsOnce the Mttkrp operation is efficiently parallelized the other parts of

the CP-ALS algorithm are straightforward Nonetheless we design a libraryfor the parallel CP-ALS algorithm to test the proposed Mttkrp algorithmsand report scalability results up to 1024 MPI ranks

31 Background and notation

311 Hypergraph partitioning

Let H = (VE) be a hypergraph with the vertex set V and hyperedge setE Let us associate weights with the vertices and costs with the hyperedgesw(v) denotes the weight of a vertex v and ch denotes the cost of a hyperedgeFor a given integer K ge 2 a K-way vertex partition of a hypergraph H =(VE) is denoted as Π = V1 VK where (i) the parts are non-empty(ii) mutually exclusive VkcapV` = empty for k 6= ` and (iii) collectively exhaustiveV =

⋃Vk

Let w(Vk) =sum

visinVk wv be the total weight of the vertices in the setVk and wavg = w(V )K be the average part weight If each part Vk isin Πsatisfies the balance criterion

w(Vk) le wavg(1 + ε) for k = 1 2 K (31)

we say that Π is balanced where ε represents the maximum allowed imbal-ance ratio

In a partition Π a hyperedge that has at least one vertex in a part is saidto connect that part The number of parts connected by a hyperedge h ieconnectivity is denoted as λh Given a vertex partition Π of a hypergraphH = (VE) one can measure the size of the cut induced by Π as

χ(Π) =sumhisinE

ch(λh minus 1) (32)

This cut measure is called the connectivity-1 cutsize metricGiven ε gt 0 and an integer K gt 1 the standard hypergraph partitioning

problem is defined as the task of finding a balanced partition Π with K partssuch that χ(Π) is minimized The hypergraph partitioning problem is NP-hard [150]

A variant of the above problem is the multi-constraint hypergraph parti-tioning [131] In this variant each vertex has an associated vector of weightsThe partitioning objective is the same as above and the partitioning con-straint is to satisfy a balancing constraint for each weight Let w(v i) denotethe C weights of a vertex v for i = 1 C In this variant the balancecriterion (31) is rewritten as

w(Vk i) le wavgi (1 + ε) for k = 1 K and i = 1 C (33)

35

where the ith weight w(Vk i) of a part Vk is defined as the sum of the ithweights of the vertices in that part wavgi is the average part weight for theith weight of all vertices and ε represents the allowed imbalance ratio

312 Tensor operations

In this chapter we use N to denote the number of dimensions (the order)of a tensor For the sake of simplicity we describe all the notation andthe algorithms for N = 3 even though our algorithms and implementationshave no such restriction We explicitly generalize the discussion to generalorder-N tensors whenever we find necessary A fiber in a tensor is definedby fixing every index but one eg if X is a third-order tensor Xjk is amode-1 fiber and Xij is a mode-3 fiber A slice in a tensor is defined byfixing only one index eg Xi refers to the ith slice of X in mode 1 Weuse |Xi| to denote the number of nonzeros in Xi

Tensors can be matricized in any mode This is achieved by identifying asubset of the modes of a given tensor X as the rows and the other modes of Xas the columns of a matrix and appropriately mapping the elements of X tothose of the resulting matrix We will be exclusively dealing with the matri-cizations of tensors along a single mode For example take X isin RI1timesmiddotmiddotmiddottimesIN Then X(1) denotes the mode-1 matricization of X in such a way that therows of X(1) corresponds to the first mode of X and the columns correspondsto the remaining modes The tensor element xi1iN corresponds to the el-

ement(i1 1 +

sumNj=2

[(ij minus 1)

prodjminus1k=1 Ik

])of X(1) Specifically each column

of the matrix X(1) becomes a mode-1 fiber of the tensor X Matricizationsin the other modes are defined similarly

For two matrices AI1timesJ1 and BI2timesJ2 the Kronecker product is

AotimesB =

a11B middot middot middot a1J1B

aI11B middot middot middot aI1J1B

For two matrices AI1timesJ and BI2timesJ the Khatri-Rao product is

AB =[A( 1)otimesB( 1) middot middot middot A( J)otimesB( J)

]

which is of size I1I2 times J For two matrices AItimesJ and BItimesJ the Hadamard product is

A lowastB =

a11b11 middot middot middot a1Jb1J

aI1bI1 middot middot middot aIJbIJ

Recall that the CP-decomposition of rankR of a given tensor X factorizesX into a sum of R rank-one tensors For X isin RItimesJtimesK it yields X asymp

36 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

sumRr=1 ar br cr for ar isin RI br isin RJ and cr isin RK where is the outer

product of the vectors Here the matrices A = [a1 aR] B = [b1 bR]and C = [c1 cR] are called the factor matrices or factors For N -modetensors we use U1 UN to refer to the factor matrices

We revise the Alternating Least Squares (ALS) method for obtaining arank-R approximation of a tensor X with the CP-decomposition [38 107]A common formulation of CP-ALS is shown in Algorithm 1 for third ordertensors At each iteration each factor matrix is recomputed while fixingthe other two eg A larr X(1)(C B)(BTB lowast CTC)dagger This operation is

performed in the following order MA = X(1)(CB) V = (BTB lowastCTC)daggerand then A larr MAV Here V is a dense matrix of size R times R and iseasy to compute The important issue is the efficient computation of theMttkrp operations yielding MA similarly MB = X(2)(AC) and MC =X(3)(BA)

Algorithm 1 CP-ALS for the 3rd order tensors

Input X A 3rd order tensorR The rank of approximation

Output CP decomposition [[λ ABC]]

1 repeat2 Alarr X(1)(CB)(BTB lowastCTC)dagger

3 Normalize columns of A

4 Blarr X(2)(CA)(ATA lowastCTC)dagger

5 Normalize columns of B

6 Clarr X(3)(BA)(ATA lowastBTB)dagger

7 Normalize columns of C and store the norms as λ

8 until no improvement or maximum iterations reached

The sheer size of the Khatri-Rao products makes them impossible tocompute explicitly hence efficient Mttkrp algorithms find other means tocarry out the Mttkrp operation

32 Related work

SPLATT [206] is an efficient implementation of the Mttkrp operation forsparse tensors on shared memory systems SPLATT implements the Mt-tkrp operation based on the slices of the dimension in which the factoris updated eg on the mode-1 slices when computing A larr X(1)(C B)(BTB lowastCTC)dagger Nonzeros of the fibers in a slice are multiplied with thecorresponding rows of B and the results are accumulated to be later scaledwith the corresponding row of C to compute the row of A correspondingto the slice Parallelization is done using OpenMP directives and the loadbalance (in terms of the number of nonzeros in the slices of the mode for

37

which Mttkrp is computed) is achieved by using the dynamic schedulingpolicy Hypergraph models are used to optimize cache performance by re-ducing the number of times a row of A B and C are accessed Smith etal [206] also use N -partite graphs to reorder the tensors for all dimensionsA newer version of SPLATT [204 205] has a medium grain task definitionand a distributed memory implementation

GigaTensor [123] is an implementation of CP-ALS which follows theMap-Reduce paradigm All important steps (the Mttkrps and the com-putations BTB lowastCTC) are performed using this paradigm A distinct ad-vantage of GigaTensor is that thanks to Map-Reduce the issues of fault-tolerance load balance and out of core tensor data are automatically han-dled The presentation [123] of GigaTensor focuses on three-mode tensorsand expresses the map and the reduce functions for this case Additionalmap and reduce functions are needed for higher order tensors

DFacTo [47] is a distributed memory implementation of the Mttkrp op-eration It performs two successive sparse matrix-vector multiplies (SpMV)to compute a column of the product X(1)(C B) A crucial observation(made also elsewhere [142]) is that this operation can be implemented asXT

(2)B( r) which can be reshaped into a matrix to be multiplied with

C( r) to form the rth column of the Mttkrp Although SpMV is awell-investigated operation there is a peculiarity here the result of thefirst SpMV forms the values of the sparse matrix used in the second oneTherefore there are sophisticated data dependencies between the two Sp-MVs Notice that DFacTo is rich in SpMV operations there are two SpMVsper factorization rank per dimension of the input tensor DFacTo needs tostore the tensor matricized in all dimensions ie X(1) X(N) In lowdimensions this can be a slight memory overhead yet in higher dimensionsthe overhead could be non-negligible DFacTo uses MPI for paralleliza-tion yet fully stores the factor matrices in all MPI ranks The rows ofX(1) X(N) are blockwise distributed (statically) With this partitioneach process computes the corresponding rows of the factor matrices Fi-nally DFacTo performs an MPI Allgatherv operation to communicate thenew results to all processes which results in (InP ) log2 P communicationvolume per process (assuming a hypercube algorithm) when computing thenth factor matrix having In rows using P processes

Tensor Toolbox [18] is a MATLAB toolbox for handling tensors It pro-vides many essential operations and enables fast and efficient realizationsof complex algorithms in MATLAB for sparse tensors [17] Among thoseoperations Mttkrp implementations are provided and used in CP-ALSmethod Here each column of the output is computed by performing N minus 1sparse tensor vector multiplications Another well-known MATLAB toolboxis the N -way toolbox [11] which is essentially for dense tensors and incor-porates now support for sparse tensors [2] through Tensor Toolbox Tensor

38 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Toolbox and the related software provide excellent means for rapid proto-typing of algorithms and also efficient programs for tensor operations thatcan be handled within MATLAB

33 Parallelization

A common approach in implementing the Mttkrp is to explicitly matricizea tensor across all modes and then perform the Khatri-Rao product usingthe matricized tensors [47 206] Matricizing a tensor in a mode i requirescolumn index values up to

prodNk 6=i Ik which can exceed the integer value lim-

its supported by modern architectures when using tensors of higher orderand very large dimensions Also matricizing across all modes results in Nreplications of a tensor which can exceed the memory limitations Hence inorder to be able to handle large tensors we store them in coordinate formatfor the Mttkrp operation which is also the method of choice in TensorToolbox [18] In this format each nonzero value is stored along with all ofits position indices

With a tensor stored in the coordinate format Mttkrp operation canbe performed as shown in Algorithm 2 As seen on Line 4 of this algorithma row of B and a row of C are retrieved and their Hadamard product iscomputed and scaled with a tensor entry to update a row of MA In generalfor an N -mode tensor

MU1(i1 )larrMU1(i1 ) + xi1i2iN [U2(i2 ) lowast middot middot middot lowastUN (iN )]

is computed Here indices of the corresponding rows of the factor matricesand MU1 coincide with indices of the unique tensor entry of the operation

Algorithm 2 Mttkrp for the 3rd order tensors

Input X tensorBC Factor matrices in all modes except the firstIA Number of rows of the factor AR Rank of the factors

Output MA = X(1)(BC)

1 Initialize MA to zeros of size IA timesR2 foreach xijk isin X do44 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

As factor matrices are accessed row-wise we define computational unitsin terms of the rows of factor matrices It follows naturally to partitionall factor matrices row-wise and use the same partition for the Mttkrpoperation for each mode of an input tensor across all CP-ALS iterations toprevent extra communication A crucial issue is the task definitions as this

39

pertains to the issues of load balancing and communication We identify acoarse-grain and a fine-grain task definition

In the coarse-grain task definition the ith atomic task consists of com-puting the row MA(i ) using the nonzeros in the tensor slice Xi and therows of B and C corresponding to the nonzeros in that slice The inputtensor X does not change throughout the iterations of tensor decompositionalgorithms hence it is viable to make the whole slice Xi available to theprocess holding MA(i ) so that the Mttkrp operation can be performedby only communicating the rows of B and C Yet as CP-ALS requires theMttkrp in all modes and each nonzero xijk belongs to slices X (i )X ( j ) and X ( k) we need to replicate tensor entries in the owner pro-cesses of these slices This may require up to N times replication of thetensor depending on its partitioning Note that an explicit matricizationalways requires exactly N replications of tensor entries

In the fine-grain task definition an atomic task corresponds to the mul-tiplication of a tensor entry with the Hadamard product of the correspond-ing rows of B and C Here tensor nonzeros are partitioned among pro-cesses with no replication to induce a task partition by following the owner-computes rule This necessitates communicating the rows of B and C thatare needed by these atomic tasks Furthermore partial results on the rowsof MA need to be communicated as without duplicating tensor entries wecannot in general compute all contributions to a row of MA Here the par-tition of X should be useful in all modes as the CP-ALS method requiresthe Mttkrp in all modes

The coarse-grain task definition resembles the one-dimensional (1D) row-wise (or column-wise) partitioning of sparse matrices whereas the fine-grainone resembles the two-dimensional (nonzero-based) partitioning of sparsematrices for parallel sparse matrix-vector multiply (SpMV) operations As isconfirmed for SpMV in modern applications 1D partitioning usually leads toharder problems of load balancing and communication cost reduction Thesame phenomenon is likely to be observed in tensors as well Nonethelesswe cover the coarse-grain task definition as it is used in the state of theart parallel Mttkrp [47 206] methods which partition the input tensor byslices

331 Coarse-grain task model

In the coarse-grain task model computing the rows of MA MB and MC aredefined as the atomic tasks Let microA denote the partition of the first modersquosindices among the processes ie microA(i) = p if the process p is responsiblefor computing MA(i ) Similarly let microB and microC define the partition of thesecond and the third mode indices The process owning MA(i ) needs theentire tensor slice Xi similarly the process owning MB(j) needs Xj andthe owner of MC(k ) needs Xk This necessitates duplication of some

40 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

tensor nonzeros to prevent unnecessary communicationOne needs to take the context of CP-ALS into account when parallelizing

the Mttkrp operation First the output MA of Mttkrp is transformedinto A Since A(i ) is computed simply by multiplying MA(i ) with thematrix (BTBlowastCTC)dagger we make the process which owns MA(i ) responsiblefor computing A(i ) Second N Mttkrp operations follow one anotherin an iteration Assuming that every process has the required rows of thefactor matrices while executing Mttkrp for the first mode it is advisable toimplement the Mttkrp in such a way that its output MA after transformedinto A is communicated This way all processes will have the necessarydata for executing the Mttkrp for the next mode With these in mind thecoarse-grain parallel Mttkrp method executes Algorithm 3 at each processp

Algorithm 3 Coarse-grain Mttkrp for the first mode of thirdorder tensors at process p within CP-ALS

Input Ip indices where microA(i) = pXIp tensor slicesBTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in XIp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them ATA is available

2 Initialize MA(Ip ) to all zeros of size |Ip| timesR3 foreach i isin Ip do4 foreach xijk isin Xi do66 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

88 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

9 foreach i isin Ip do1111 Send A(i ) to all processes having nonzeros in Xi

12 Receive A(i ) from microA(i) for each owned nonzero xijk

1414 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

As seen in Algorithm 3 the process p computes MA(i ) for all i withmicroA(i) = p on Line 6 This is possible without any communication if weassume the preconditions that BTB CTC and the required rows of B andC are available We need to satisfy this precondition for the Mttkrp oper-ations in the remaining modes Once MA(Ip ) is computed it is multipliedon Line 8 with (BTB lowastCTC)dagger to obtain A(Ip ) Then the process p sendsthe rows of A(Ip ) to other processes who will need them then receivethe ones that it will need More precisely the process p sends A(i ) tothe process r where microB(j) = r or microC(k) = r for a nonzero xijk isin X thereby satisfying the stated precondition for the remaining Mttkrp op-

41

erations Since computing A(i ) required BTB and CTC to be availableat each process we also need to compute ATA and make the result avail-able to all processes We perform this by computing local multiplicationsA(Ip )

TA(Ip ) and all-reducing the partial results (Line 14)

We now investigate the problems of obtaining load balance and reducingthe communication cost in an iteration of CP-ALS given in Algorithm 3 thatis when Algorithm 3 is applied N times once for each mode Line 6 is thecomputational core the process p performs

sumiisinIp |Xi| many Hadamard

products (of vectors of size R) without any communication In order toachieve load balance one should partition the slices of the first mode equi-tably by taking the number of nonzeros into account Line 8 does not involvecommunication where load balance can be achieved by assigning almostequal number of slices to the processes Line 14 requires an almost equalnumber of slices to the processes for load balance The operations at Lines 8and 14 can be performed very efficiently using BLAS3 routines thereforetheir costs are generally negligible in compare to the cost of the sparse ir-regular computations at Line 6 There are two lines 11 and 14 requiringcommunication Line 14 requires the collective all-reduce communicationand one can rely on efficient MPI implementations The communication atLine 11 is irregular and would be costly Therefore the problems of com-putational load balance and communication cost need to be addressed withregards to Lines 6 and 11 respectively

Let ai be the task of computing MA(i ) at Line 6 Let also TA bethe set of tasks ai for i = 1 IA and bj ck TB and TC be definedsimilarly In the CP-ALS algorithm there are dependencies among thetriplets ai bj and ck of the tasks for each nonzero xijk Specifically thetask ai depends on the tasks bj and ck Similarly bj depends on the tasksai and ck and ck depends on the tasks ai and bj These dependencies arethe source of the communication at Line 11 eg A(i ) should be sent tothe processes that own B(j ) and C(k ) for the each nonzero xijk in theslice Xi These dependencies can be modeled using graphs or hypergraphsby identifying the tasks with vertices and their dependencies with edgesand hyperedges respectively As is well-known in similar parallelizationcontexts [39 111 112] the standard graph partition related metric of ldquoedgecutrdquo would loosely relate to the communication cost We therefore proposea hypergraph model for modeling the communication and computationalrequirements of the Mttkrp operation in the context of CP-ALS

For a given tensor X we build the d-partite d-uniform hypergraph H =(VE) described in Section 15 In this model the vertex set V correspondsto the set of tasks We abuse the notation and use ai to refer to a task and itscorresponding vertex in V We then set V = TA cupTB cupTC Since one needsload balance on Line 6 for each mode we use vectors of size N for vertexweights We set w(ai 1) = |Xi| w(bj 2) = |Xj| and w(ck 3) = |Xk|

42 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

c3

b3

a3

c2a2b2

b1

c1

a1

(a) Coarse-grain model

x123 x312

x231a2

a1

b1

c1

b2 c2

c3

b3

a3

(b) Fine-grain model

Figure 31 Coarse and fine-grain hypergraph models for the 3times3times3 tensorX = (1 2 3) (2 3 1) (3 1 2)

as the vertex weights in the indicated modes and assign weight 0 in allother modes The dependencies causing communication at Line 11 are one-to-many for a nonzero xijk any one of ai bj and ck is computed byusing the data associated with the other two Following the guidelines inbuilding the hypergraph models [209] we add a hyperedge correspondingto each task (or each row of the factor matrices) in TA cup TB cup TC in orderto model the data dependencies to that specific task We use nai to denotethe hyperedge corresponding to the task ai and define nbj and nck similarlyFor each nonzero xijk we add the vertices ai bj and ck to the hyperedgesnai n

bj and nck Let Ji and Ki be all unique 2nd and 3rd mode indices

respectively of the nonzeros in Xi Then the hyperedge nai contains thevertex ai all vertices bj for j isin Ji and all vertices ck for k isin Ki

Figure 31a demonstrates the hypergraph model for the coarse-grain taskdecomposition of a sample 3times 3times 3 tensor with nonzeros (1 2 3) (2 3 1)and (3 1 2) Labels for the hyperedges shown as small blue circles are notprovided yet it should be clear from the context eg ai isin nai is shownwith a non-curved black line between the vertex and the hyperedge Forthe first second and third nonzeros we add the corresponding verticesto related hyperedges which are shown with green magenta and orangecurves respectively It is interesting to note that the N -partite graph ofSmith et al [206] and this hypergraph model are related their adjacencystructure when laid as a matrix give the same matrix

Consider now a P -way partition of the vertices of H = (VE) and theassociation of each part with a unique process for obtaining a P -way parallelCP-ALS The task ai incurs |Xi| Hadamard products in the first mode inthe process microA(i) and has no other work when computing the MttkrpsMB = X(2)(AC) and MC = X(3)(AB) in the other modes This obser-vation (combined with the corresponding one for bjs and cks) shows that any

43

partition satisfying the balance condition for each weight component (33)corresponds to a balanced partition of Hadamard products (Line 6) in eachmode Now consider the messages which are of size R sent by the pro-cess p regarding A(i ) These messages are needed by the processes thathave slices in modes 2 and 3 that have nonzeros whose first mode index is iThese processes correspond to the parts connected by the hyperedge nai Forexample in Figure 31a due to the nonzero (3 1 2) a3 has a dependency tob1 and c2 which are modeled by the curves from a3 to nb1 and nc2 Since a3b1 and c2 reside in different parts a3 contributes one to the connectivity ofnb1 and nc2 which accurately captures the total communication volume byincreasing the total connectivity-1 metric by two

Notice that the part microA(i) is always connected by naimdashassuming that theslice Xi is not empty Therefore minimizing the connectivity-1 metric (32)exactly amounts to minimizing the total communication volume required forMttkrp in all computational phases of a CP-ALS iteration if we set thecost ch = R for each hyperedge h

332 Fine-grain task model

In the fine-grain task model we assume a partition of the nonzeros of thetensor hence there is no duplication Let microX denote the partition of thetensor nonzeros eg microX (i j k) = p if the process p holds xijk WhenevermicroX (i j k) = p the process p performs the Hadamard product of B(j ) andC(k ) in the first mode A(i ) and C(k ) in the second mode and A(i )and B(j ) in the third mode then scales the result with xijk Let againmicroA microB and microC denote the partition on the indices of the correspondingmodes that is microA(i) = p denotes that the process p is responsible forcomputing MA(i ) As in the coarse-grain model we take the CP-ALScontext into account while designing the Mttkrp algorithm (i) the processwhich is responsible for MA(i ) is also held responsible for A(i ) (ii)at the beginning all rows of the factor matrices B(j ) and C(k ) wheremicroX (i j k) = p are available to the process p (iii) at the end all processesget the R times R matrix ATA and (iv) each process has all the rows of Athat are required for the upcoming Mttkrp operations in the second or thethird modes With these in mind the fine-grain parallel Mttkrp methodexecutes Algorithm 4 at each process p

In Algorithm 4 Xp and Ip denote the set of nonzeros and the set of firstmode indices assigned to the process p As seen in the algorithm the processp has the required data to carry out all Hadamard products correspondingto all xijk isin Xp on Line 5 After scaling by xijk the Hadamard productsare locally accumulated in MA which contains a row for all i where Xphas a nonzero on the slice Xi Notice that a process generates a partialresult for each slice of the first mode in which it has a nonzero The relevantrow indices (the set of unique first mode indices in Xp) are denoted by Fp

44 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Algorithm 4 Fine-grain Mttkrp for the first mode of 3rd ordertensors at process p within CP-ALS

Input Xp tensor nonzeros where microX (i j k) = pIp the first mode indices where microA(Ip) = pFp the set of unique first mode indices in Xp

BTB and CTC RtimesR matricesB(Jp )C(Kp ) Rows of the factor matrices whereJp and Kp correspond to the unique second andthird mode indices in Xp

1 On exit A(Ip ) is computed its rows are sent to processes needing

them A(Fp ) is updated and ATA is available

2 Initialize MA(Fp ) to all zeros of size |Fp| timesR3 foreach xijk isin Xp do55 MA(i )larrMA(i ) + xijk[B(j ) lowastC(k )]

6 foreach i isin Fp Ip do88 Send MA(i ) to microA(i)

1010 Receive contributions to MA(i ) for i isin Ip and add them up

1212 A(Ip )larrMA(Ip )(BTB lowastCTC)dagger

13 foreach i isin Ip do1515 Send A(i ) to all processes having nonzeros in Xi

16 foreach i isin Fp Ip do1818 Receive A(i ) from microA(i)

2020 Locally compute A(Ip )TA(Ip ) and all-reduce the results to form ATA

45

Some indices in Fp are not owned by the process p For such an i isin Fp Ipthe process p sends its contribution to MA(i ) to the owner process microA(i)at Line 8 Again messages of the process p destined to the same processare combined Then the process p receives contributions to MA(i ) wheremicroA(i) = p at Line 10 Each such contribution is accumulated and the newA(i ) is computed by a multiplication with (BTB lowast CTC)dagger at Line 12Finally the updated A is communicated to satisfy the precondition of theupcoming Mttkrp operations Note that the communications at Lines 15and 18 are the duals of those at Lines 10 and 8 respectively in the sense thatthe directions of the messages are reversed and the messages always pertainto the same row indices For example MA(i ) is sent to the process microA(i)at Line 8 and A(i ) is received from microA(i) at Line 18 Hence the process pgenerates partial results for each slice Xi on which it has a nonzero eitherkeeps it or sends it to the owner of microA(i) (at Line 8) and if so receivesthe updated row A(i ) later (at Line 18) Finally the algorithm finishesby an all-reduce operation to make ATA available to all processes for theupcoming Mttkrp operations

We now investigate the computational and communication requirementsof Algorithm 4 As before the computational core is at Line 5 where theHadamard products of the row vectors B(j ) and C(k ) are computed foreach nonzero xijk that a process owns and the result is added to MA(i )Therefore each nonzero xijk incurs three Hadamard products (one permode) in its owner process microX (i j k) in an iteration of the CP-ALS al-gorithm Other computations are either efficiently handled with BLAS3routines (Lines 12 and 20) or they (Line 10) are not as many as the numberof Hadamard products The significant communication operations are thesends at Lines 8 and 15 and the corresponding receives at Lines 10 and 18Again the all-reduce operation at Line 20 would be handled efficiently by theMPI implementation Therefore the problems of computational load bal-ance and reduced communication cost need to be addressed by partitioningthe tensor nonzeros equitably among processes (for load balance at Line 5)while trying to confine all slices of all modes to a small set of processes (toreduce communication at Lines 8 and 18)

We propose a hypergraph model for the computational load and commu-nication volume of the fine-grain parallel Mttkrp operation in the contextof CP-ALS For a given tensor the hypergraph H = (VE) with the ver-tex set V and hyperedge set E is defined as follows The vertex set V hasfour types of vertices ai for i = 1 IA bj for j = 1 IB ck fork = 1 IC and the vertex xijk we define for each nonzero in X byabusing the notation The vertex xijk represents the Hadamard productsB(j ) lowastC(k ) A(i ) lowastC(k ) and A(i ) lowastB(j ) to be performed duringthe Mttkrp operations at different modes There are three types of hyper-edges in E and they represent the dependencies of the Hadamard productsto the rows of the factor matrices The hyperedges are defined as follows

46 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

E = EA cup EB cup EC where EA contains a hyperedge nai for each A(i ) EBcontains a hyperedge nbj for each B(j ) and EC contains a hyperedge nck for

each C(k ) Initially nai nbj and nck contain only the corresponding vertices

ai bj and ck Then for each nonzero xijk isin X we add the vertex xijkto the hyperedges nai n

bj and nck to model the data dependency to the cor-

responding rows of A B and C As in the previous subsection one needsa multi-constraint formulation for three reasons First since xijk verticesrepresent the Hadamard products they should be evenly distributed amongthe processes Second as the owners of the rows of A B and C are possi-ble destinations of partial results (at Line 8) it is advisable to spread themevenly Third almost equal number of rows per process implies balancedmemory usage and BLAS3 operations at Lines 12 and 20 even though thisoperation is not our main concern for the computational load With theseconstraints we associate a weight vector of size N + 1 with each vertex Fora vertex xijk we set w(xijk ) = [0 0 0 3] for vertices ai bj and ck we setw(ai ) = [1 0 0 0] w(bj ) = [0 1 0 0] w(ck ) = [0 0 1 0]

In Figure 31b we demonstrate the fine-grain hypergraph model for thesample tensor of the previous section We exclude the vertex weights fromthe figure Similar to the coarse-grain model dependency on the rows of AB and C are modeled by connecting each vertex xijk to the hyperedges nai nbj and nck to capture the communication cost

Consider now a P -way partition of the vertices of H = (VE) and as-sociate each part with a unique process to obtain P -way parallel CP-ALSwith the fine-grain Mttkrp operation The partition of the xijk verticesuniquely partitions the Hadamard products Satisfying the fourth balanceconstraint in (33) balances the computational loads of the processes Sim-ilarly satisfying the first three constraints leads to an equitable partitionof the rows of factor matrices among processes Let microX (i j k) = p andmicroA(i) = ` that is the vertex xijk and ai are in different parts Then theprocess p sends a partial result on MA(i ) to the process ` at Line 8 ofAlgorithm 4 By looking at these messages carefully we see that all pro-cesses whose corresponding parts are in the connectivity set of niA will senda message to microA(i) Since microA(i) is also in this set we see that the totalvolume of send operations concerning MA(i ) at Line 8 is equal to λnai minus 1Therefore the connectivity-1 cut size metric (32) over the hyperedges inEA encodes the total volume of messages sent at Line 8 if we set ch = Rfor all hyperedges un EA Since the send operations at Line 15 are dualsof the send operations at Line 8 total volume of messages sent at Line 15for the first mode is also equal to this number By extending this reasoningto all hyperedges we see that the cumulative (over all modes) volume ofcommunication is equal to the connectivity-1 cut-size metric (32)

The presence of multiple constraints usually causes difficulties for thecurrent state of the art partitioning tools (Aykanat et al[15] discuss the is-

47

sue for PaToH [40]) In preliminary experiments we observed that this wasthe case for us as well In an attempt to circumvent this issue we designedthe following alternative We start with the same hypergraph and removethe vertices corresponding to the rows of factor matrices Then use a singleconstraint partitioning on the resulting hypergraph to partition the verticesof nonzeros We are then faced with the problem of assigning the rows offactor matrices given the partition on the tensor nonzeros Any objectivefunction (relevant to our parallelization context) in trying to solve this prob-lem would likely be a variant of the NP-complete Bin-Packing Problem [93p124] as in the SpMV case [27 208] Therefore we heuristically handle thisproblem by making the following observations (i) the modes can be parti-tioned one at a time (ii) each row corresponds to a removed vertex for whichthere is a corresponding hyperedge (iii) the partition of xijk vertices definesa connectivity set for each hyperedge This is essentially multi-constraintpartitioning which takes advantage of the listed observations We use con-nectivity of corresponding hyperedges as key value of rows to impose anorder Then rows are visited in decreasing order of key values Each visitedrow is assigned to the least loaded part in the connectivity set of its corre-sponding hyperedge if that part is not overloaded Otherwise the visitedvertex is assigned to the least loaded part at that moment Note that in thesecond case we add one to the total communication volume otherwise thetotal communication volume obtained by partitioning xijk vertices remainsthe same Note also that key values correspond to the volume of receives atLine 10 of Algorithm 4 hence this method also aims to balance the volumeof receives at Line 10 and the volume of dual send operations at Line 15

34 Experiments

We conducted our experiments on the IDRIS Ada cluster which consistsof 304 nodes each having 128 GBs of memory and four 8-core Intel SandyBridge E5-4650 processors running at 27 GHz We ran our experiments upto 1024 cores and assigned one MPI process per core All codes we usedin our benchmarks were compiled using the Intel C++ compiler (version1401106) with -O3 option for compiler optimizations -axAVXSSE42 toenable SSE optimizations whenever possible and -mkl option to use theIntel MKL library (version 1111) for LAPACK BLAS and VML (VectorMath Library) routines We also compiled DFacTo code using the Eigen(version 324) library [102] with EIGEN USE MKL ALL flag set to ensurethat it utilizes the routines provided in the Intel MKL library for the bestperformance We used PaToH [40] with default options to partition thehypergraphs We created all partitions offline and ran our experiments onthese partitioned tensors on the cluster We do not report timings for par-titioning hypergraphs which is quite costly for two reasons First in most

48 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

Tensor I1 I2 I3 I4 nonzeros

netflix 480K 17K 2K - 100M

NELL-B 32M 301 638K - 78M

deli 530K 17M 24M 1K 140M

flickr 319K 28M 16M 713 112M

Table 31 Size of tensors used in the experiments

applications the tensors from the real-world data are built incrementallyhence a partition for the updated tensor can be formed by refining the parti-tion of the previous tensor Also one can decompose a tensor multiple timeswith different ranks of approximation Thereby the cost of an initial parti-tioning of a tensor can be amortized across multiple runs Second we had topartition the data on a system different from the one used for CP-ALS runsIt would not be very useful to compare the run time from different systemsand repeating the same runs on a different system would not add much tothe presented results We also note that the proposed coarse and fine-grainMttkrp formulations are independent from the partitioning method

Our dataset for experiments consists of four sparse tensors whose sizesare given in Table 31 and obtained from various resources [21 37 97]

We briefly describe the methods compared in the experiments DFacTois the CP-ALS routine we compare our methods against which uses blockpartitioning of the rows of the matricized tensors to distribute matrix nonze-ros equally among processes In the method ht-coarsegrain-block we op-erate the coarse-grain CP-ALS implementation in HyperTensor on a tensorwhose slices in each mode are partitioned consecutively to ensure equal num-ber of nonzeros among processes similarly to the DFacTorsquos approach Themethod ht-coarsegrain-hp runs the same CP-ALS implementation andit uses the hypergraph partitioning The method ht-finegrain-randompartitions the tensor nonzeros as well as the rows of the factor matrices ran-domly to establish load balance It uses the fine-grain CP-ALS implemen-tation in HyperTensor Finally the method ht-finegrain-hp correspondsto the fine-grain CP-ALS implementation in HyperTensor operating on thetensor partitioned according to the hypergraph model proposed for the fine-grain task decomposition We benchmark our methods against DFacTo onNetflix and NELL-B datasets only as DFacTo can only process 3-mode ten-sors We let the CP-ALS implementations run for 20 iterations on each datawith R = 10 and record the average time spent per iteration

In Figures 32a and 32b we show the time spent per CP-ALS itera-tion on Netflix and NELL-B tensors for all algorithms In Figures 32c and32d we show the speedups per CP-ALS iteration on Flickr and Delicioustensors for methods excluding DFacTo While performing the comparisonwith DFacTo we preferred to show the time spent per iteration instead ofthe speedup as the sequential run time of our methods differs from that of

49

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

minus1

100

101

102

Number of MPI processes

Av

era

ge

tim

e (

in s

ec

on

ds

) p

er

CP

minusA

LS

ite

rati

on

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

DFacTo

(b) Time on Nell-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

ee

du

p o

ve

r th

e s

eq

ue

nti

al

ex

ec

uti

on

of

a C

Pminus

AL

S i

tera

tio

n

htminusfinegrainminushp

htminusfinegrainminusrandom

htminuscoarsegrainminushp

htminuscoarsegrainminusblock

(d) Speedup on Delicious

Figure 32 Time for parallel CP-ALS iteration on Netflix and NELL-B andspeedups on Flickr and Delicious

50 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

DFacTo Also in Table 32 we show the statistics regarding the communica-tion and computation requirements of the 512-way partitions of the Netflixtensor for all proposed methods

We first observe in Figure 32a that on Netflix ht-finegrain-hp clearlyoutperforms all other methods by achieving a speedup of 194x with 512 coresover a sequential execution whereas ht-coarsegrain-hp ht-coarsegrain-blockDFacTo and ht-finegrain-random could only yield 69x 63x 49x and 40xspeedups respectively Table 32 shows that on 512-way partitioning of thesame tensor ht-finegrain-random ht-coarsegrain-block and ht-coarsegrain-hp result in total send communication volumes of 142M 80M and 77Munits respectively whereas ht-finegrain-hp partitioning requires only 76Munits which explains the superior parallel performance of ht-finegrain-hpSimilarly in Figures 32b 32c and 32d ht-finegrain-hp achieves 81x 129xand 123x speedups on NELL-B Flickr and Delicious tensors while the bestof all other methods could only obtain 39x 55x and 65x respectively upto 1024 cores As seen from these figures ht-finegrain-hp is the fastest of allproposed methods It experiences slow-downs at 1024 cores for all instances

One point to note in Figure 32b is that our Mttkrp implementationgets notably more efficient than DFacTo on NELL-B We believe that themain reason for this difference is due to DFacTorsquos column-by-column wayof computing the factor matrices We instead compute all columns of thefactor matrices simultaneously This vector-wise mode of operation on therows of the factors can significantly improve the data locality and cacheefficiency which leads to this performance difference Another point in thesame figure is that HyperTensor running with ht-coarsegrain-block partition-ing scales better than DFacTo even though they use similar partitioning ofthe matricized tensor or the tensor This is mainly due to the fact thatDFacTo uses MPI Allgatherv to communicate local rows of factor matri-ces to all processes which incurs significant increase in the communicationvolume whereas our coarse-grain implementation performs point-to-pointcommunications Finally in Figure 32a DFacTo has a slight edge over ourmethods in sequential execution This could be due to the fact that it re-quires 3(|X |+ F ) multiplyadd operations across all modes where F is theaverage number of non-empty fibers and is upper-bounded by |X | whereasour implementation always takes 6|X | operations

Partitioning metrics in the Table 32 show that ht-finegrain-hp is ableto successfully reduce the total communication volume which brings aboutits improved scalability However we do not see the same effect when usinghypergraph partitioning for the coarse-grain task model due to inherent lim-itations of 1D partitionings Another point to note is that despite reducingthe total communication volume explicitly imbalance in the communicationvolume still exists among the processes as this objective is not directly cap-tured by the hypergraph models However the most important limitationon further scalability is the communication latency Table 32 shows that

51

ModeComp load Comm volume Num msg

Max Avg Max Avg Max Avg

ht-finegrain-hp

1 196672 196251 21079 6367 734 316

2 196672 196251 18028 5899 1022 1016

3 196672 196251 3545 2492 1022 1018

ht-finegrain-random

1 197507 196251 272326 252118 1022 1022

2 197507 196251 29282 22715 1022 1022

3 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp

1 364181 196251 302001 136741 511 511

2 349123 196251 59523 12228 511 511

3 737570 196251 23524 2000 511 507

ht-coarsegrain-block

1 198602 196251 239337 142006 448 447

2 367966 196251 33889 12458 511 445

3 737570 196251 24659 2049 511 394

Table 32 Statistics for the computation and communication requirementsin one CP-ALS iteration for 512-way partitionings of the Netflix tensor

each process communicates with almost all others especially on the smalldimensions of the tensors and the number of messages is doubled for thefine-grain methods To alleviate this issue one can try to limit the latencyfollowing previous work [201 208]

35 Summary further notes and references

The pros and cons of the coarse and fine-grain models resemble to those ofthe 1D and 2D partitionings in the SpMV context There are two advantagesof the coarse-grain model over the fine-grain one First there is only onecommunicationcomputation step in the coarse-grain model whereas fine-grain model requires two computation and communication phases Secondthe coarse-grain hypergraph model is smaller than the fine-grain hypergraphmodel (one vertex per index of each dimension versus additionally havingone vertex per nonzero) However one major limitation with the coarse-grain parallel decomposition is that restricting all nonzeros in an (N minus 1)dimensional slice of an N dimensional tensor to reside in the same processposes a significant constraint towards communication minimization an (Nminus1)-dimensional data typically has very large ldquosurfacerdquo which necessitatesgathering numerous rows of the factor matrices As N increases one mightexpect this phenomenon to increase the communication volume even furtherAlso if the tensor X is not large enough in one of its modes this poses agranularity problem which potentially causes load imbalance and limits the

52 CHAPTER 3 DISTRIBUTED MEMORY CP DECOMPOSITION

parallelism None of these issues exists in the fine-grain task decompositionIn a more recent work [135] we extended the discussed approach for

efficient parallelization in shared and distributed memory systems For theshared-memory computations we proposed a novel computational schemecalled dimension trees that significantly reduces the computational costwhile offering an effective parallelism We then performed theoretical anal-yses corresponding to the computational gains and the memory use Theseanalyses are also validated with the experiments We also presented com-parisons with the recent version of SPLATT [204 205] which uses a mediumgrain task model and associated partitioning [187] Recent work discussesalgorithms for partitioning for medium-grain task decomposition [3]

A dimension tree is a data structure that partitions the mode indicesof an N -dimensional tensor in a hierarchical manner for computing ten-sor decompositions efficiently It was first used in the hierarchical Tuckerformat representing the hierarchical Tucker decomposition of a tensor [99]which was introduced as a computationally feasible alternative to the origi-nal Tucker decomposition for higher order tensors Our work [135] presentsa novel way of using dimension trees for computing the standard CP de-composition of tensors with a formulation that asymptotically reduces thecomputational cost by memorizing certain partial computations

For computing the CP decomposition of dense tensors Phan et al [189]propose a scheme that divides the tensor modes into two sets pre-computesthe tensor-times-multiple-vector multiplication (TTMVs) for each mode setand finally reuses these partial results to obtain the final Mttkrp result ineach mode This provides a factor of two improvement in the number ofTTMVs over the standard approach and our dimension tree-based frame-work can be considered as the generalization of this approach that providesa factor of N logN improvement Li et al [153] use the same idea of stor-ing intermediate tensors but use a different formulation based on the tensortimes tensor multiplication and the tensorndashmatrix Hadamard product oper-ations for shared memory systems The overall approach is similar to thatproposed by Phan et al [189] where the difference lies in the application ofthe method to sparse tensors and auto-tuning to better control the memoryuse and the gains in the operation counts

In most of the sparse tensors available (eg see FROSTT [203]) oneof the dimensions is usually very small If the indices in that dimensionare vertices in a hypergraph then we have very high degree vertices if theindices are hyperedges then we have hyperedges with very large sizes Bothcases are not favorable for the traditional multilevel hypergraph partitioningmethods where efficient coarsening heuristicsrsquo run time are super-linear interms of vertexhyperedge degrees

Part II

Matching and relatedproblems

53

Chapter 4

Matchings in graphs

In this chapter we address the problem of approximating the maximumcardinality of a matching in undirected graphs with fast heuristics Most ofthe terms and definitions including the problem itself is standard whichwe summarized in Section 12 The formal problem definition is repeatedbelow for convenience

Approximating the maximum cardinality of a matching in undi-rected graphs Given an undirected graph G = (VE) design a fast(near linear time) heuristic to obtain a matchingM of large cardinalitywith an approximation guarantee

We design randomized heuristics for the stated problem and analyze theirapproximation guarantees Both the design and analysis of the heuristicsare based on doubly stochastic matrices (see Section 14) and scaling ofnonnegative matrices (in our case the adjacency matrices) to the doublystochastic form We therefore first give some background on scaling matricesto doubly stochastic form

41 Scaling matrices to doubly stochastic form

Any nonnegative matrix A with total support can be scaled with two positivediagonal matrices DR and DC such that DRADC is doubly stochastic (thatis the sum of entries in any row and in any column of DRADC is equalto one) If the matrix is fully indecomposable then the scaling matricesare unique up to a scalar multiplication (if the pair DR and DC scale thematrix then so αDR and 1

αDC for α gt 0) If A has support but not totalsupport then A can be scaled to a doubly stochastic matrix but not withtwo positive diagonal matrices [202]mdashthis fact is also seen in more recenttreatments [139 141 198])

The Sinkhorn-Knopp algorithm [202] is a well-known method for scalingmatrices to doubly stochastic form This algorithm generates a sequence of

55

56 CHAPTER 4 MATCHINGS IN GRAPHS

matrices (whose limit is doubly stochastic) by normalizing the columns andthe rows of the sequence of matrices alternately as shown in Algorithm 5First the initial matrix is normalized such that each column has sum oneThen the resulting matrix is normalized so that each row has sum oneand so on so forth If the initial matrix is symmetric and has total supportSinkhorn-Knopp algorithm will find a symmetric doubly stochastic matrix atthe limit while the intermediate matrices generated in the sequence are notsymmetric Two other iterative scaling algorithms [140 141 198] maintainsymmetry all along the way which are therefore more useful for symmet-ric matrices in practice Efficient shared memory and distributed memoryparallelizations of the Sinkhorn-Knopp and related algorithms have beendiscussed elsewhere [9 41 76]

Algorithm 5 ScaleSK Sinkhorn-Knopp scaling

Data A an ntimes n matrix with total support ε the error thresholdResult dr dc rowcolumn scaling arrays

1 for i = 1 to n do2 dr[i]larr 13 dc[i]larr 1

4 for j = 1 to n do5 csum[j]larr

sumiisinAlowastj

aij

6 repeat7 for j = 1 to n do8 dc[j]larr 1csum[j]

9 for i = 1 to n do10 rsumlarr

sumjisinAilowast

aij times dc[j]

11 dr[i]larr 1rsum

12 for j = 1 to n do13 csum[j]larr

sumiisinAlowastj

dr[i]times aij14 until |max1minus csum[j] 1 le j le n| le ε

42 Approximation algorithms for bipartite graphs

Most of the existing heuristics obtain good results in practice but theirworst-case guarantee is only around 12 Among those the Karp-Sipserheuristic [128] is very well known It finds maximum cardinality matchingsin highly sparse (random) graphs and matches all but O(n15) verticesof a random undirected graph [14] (this heuristics will be reviewed laterin Section 421) Karp-Sipser also obtains very good results in practiceCurrently it is the suggested one to be used as a jump-start routine [71148] for exact algorithms especially for augmenting-path based ones [132]Algorithms that achieve an approximation ratio of 1minus1e where e is the base

57

of the natural logarithm are designed for the online case [129] Many of thesealgorithms are sequential in nature in that a sequence of greedy decisionsare made in the light of the previously made decisions Another heuristicis obtained by truncating the Hopcroft and Karp (HK) algorithm HKstarting from a given matching augments along a maximal set of shortestdisjoint paths until there are no augmenting paths If one lets HK run untilthe shortest augmenting paths are of length k then a 1minus 2k approximatematching is obtained for k ge 3 The run time of this heuristic is O(τk) fora bipartite graph with τ edges

We propose two matching heuristics (Section 422) for bipartite graphsBoth heuristics construct a subgraph of the input graph by randomly choos-ing some edges They then obtain a maximum matching in the selectedsubgraph and return it as an approximate matching for the input graphThe probability density function for choosing a given edge in both heuris-tics is obtained with a matrix scaling method The first heuristic is shown todeliver a constant approximation guarantee of 1minus 1e asymp 0632 of the maxi-mum cardinality matching under the condition that the scaling method hasscaled the input matrix The second one builds on top of the first one andimproves the approximation ratio to 0866 under the same condition as inthe first one Both of the heuristics are designed to be amenable to paral-lelization in modern multicore systems The first heuristic does not requirea conflict resolution scheme Furthermore it does not have any synchro-nization requirements The second heuristic employs Karp-Sipser to find amatching on the selected subgraph We show that Karp-Sipser becomes anexact algorithm on the selected subgraphs Further analysis of the proper-ties of those subgraphs is carried out to design a specialized implementationof Karp-Sipser for efficient parallelization The approximation guaranteesof the two proposed heuristics do not deteriorate with the increased degreeof parallelization thanks to their design which is usually not the case forparallel matching heuristics [16]

For the analysis we assume that the two vertex classes contain the samenumber of vertices and that the associated adjacency matrix has total sup-port Under these criteria the Sinkhorn-Knopp scaling algorithm summa-rized in Section 41 converges in a finite number of steps and obtains adoubly stochastic matrix Later on (towards the end of Section 422 and inthe experiments presented in Section 423) we discuss the bipartite graphswithout the mentioned properties

421 Related work

Parallel (exact) matching algorithms on modern architectures have been re-cently investigated Azad et al [16] study the implementations of a setof known bipartite graph matching algorithms on shared memory systemsDeveci et al [66 65] investigate the implementation of some known match-

58 CHAPTER 4 MATCHINGS IN GRAPHS

ing algorithms or their variants on GPU There are quite good speedupsreported in these implementations yet there are non-trivial instances whereparallelism does not help (for any of the algorithms)

Our focus is on matching heuristics that have linear run time complexityand good quality guarantees on the size of the matching Recent surveysof matching heuristics are given by Duff et al [71 Section 4] Langguth etal [148] and more recently by Pothen et al [195] Two heuristics calledGreedy and Karp-Sipser stand out and are suggested as initialization stepsin the best two exact matching algorithms [132] These two heuristics alsoattracted theoretical interest

The Greedy heuristic has two variants in the literature The first variantrandomly visits the edges and matches the two endpoints of an edge if theyare both available The theoretical performance guarantee of this heuristic is12 ie the heuristic delivers matchings of size at least half of the maximumcardinality of a matching This is analyzed theoretically [78] and shown toobtain results that are near the worst-case on certain classes of graphs Thesecond variant of Greedy repeatedly selects a random vertex and matches itwith a random neighbor The matched vertices along with the ones whichbecome isolated are removed from the graph and the process continuesuntil the whole graph is consumed This variant also has a 12 worst-caseapproximation guarantee (see for example a proof by Pothen and Fan [194])and it is somewhat better (05 + ε for ε ge 00000025 [13] which has beenrecently improved to ε ge 1256 [191])

We make use of the Karp-Sipser heuristic to design one of the proposedheuristics We summarize a commonly used variant of Karp-Sipser here andrefer the reader to the original paper [128] for details The theoretical foun-dation of Karp-Sipser is that if there is a vertex v with exactly one neighbor(v is called degree-one) then there is a maximum cardinality matching inwhich v is matched with its neighbor That is matching v with its neighboris an optimal decision Using this observation Karp-Sipser runs as followsCheck whether there is a degree-one vertex if so then match the vertexwith its unique neighbor and delete both vertices (and the edges incident onthem) from the graph Continue this way until the graph has no edges (inwhich case we are done) or all remaining vertices have degree larger thanone In the latter case pick a random edge match the two endpoints ofthis edge and delete those vertices and the edges incident on them Thenrepeat the whole process on the remaining graph The phase before the firstrandom choice of edges made is called Phase 1 and the rest is called Phase2 (where new degree-one vertices may arise) The run time of this heuris-tic is linear This heuristic matches all but O(n15) vertices of a randomundirected graph [14] One disadvantage of Karp-Sipser is that because ofthe degree dependencies of the vertices to the already matched vertices anefficient parallelism is hard to achieve (a list of degree-one vertices needs tobe maintained) That is probably why some inflicted forms (successful but

59

without any known quality guarantee) of this heuristic were used in recentstudies [16]

Recent studies focusing on approximate matching algorithms on parallelsystems include heuristics for graph matching problem [42 86 105] and alsoheuristics used for initializing bipartite matching algorithms [16] Lotker etal [164] present a distributed 1minus 1k approximate matching algorithm forbipartite graphs This nice theoretical algorithm has O(k3 log ∆ + k2 log n)time steps with a message length of O(log ∆) where ∆ is the maximumdegree of a vertex and n is the number vertices in the graph Blellochet al [29] propose an algorithm to compute maximal matchings (12 ap-proximate) with O(τ) work and O(log3 τ) depth with high probability on abipartite graph with τ edges This is an elaboration of the Greedy heuristicfor parallel systems Although the performance metrics work and depth arequite impressive the approximation guarantee stays as in the serial variantA striking property of this heuristic is that it trades parallelism and re-duced work while always finding the same matching (including those foundin the sequential version) Birn et al [26] discuss maximal (12 approxi-mate) matchings in O(log2 n) time and O(τ) work in the CREW PRAMmodel Practical implementations on distributed memory and GPU sys-tems are discussedmdasha shared memory implementation is left as future workin the cited paper Patwary et al [185] discuss an efficient implementationof Karp-Sipser on distributed memory systems and show that the communi-cation requirement (in terms of the volume) is proportional to that of sparsematrixndashvector multiplication

The overall approach is related to a random bipartite graph model calledrandom k-out model [211] In these graphs every vertex chooses k neighborsuniformly at random from the other part of vertices Walkup [211] showsthat a 1-out subgraph of a complete bipartite graph G = (V1 V2 E) where|V1| = |V2| asymptotically almost surely (aas) does not contain a perfectmatching He also shows that a 2-out subgraph of a complete bipartitegraph aas contains a perfect matching Karonski and Pittel [125] laterwith Overman [124] extend the analysis to two other cases In the firstcase the vertices that were not selected at all choose one more neighbor inthe second case the vertices that were selected only once choose one moreneighbor Initial claim [125] was that in the first case there was a perfectmatching aas which was shown to be false [124] The existence of perfectmatchings in the second case in which the graph is still between 1-out and2-out has been shown recently [124]

422 Two matching heuristics

We propose two matching heuristics for the approximate maximum cardi-nality matching problem on bipartite graphs The heuristics are efficientlyparallelizable and have guaranteed approximation ratios The first heuristic

60 CHAPTER 4 MATCHINGS IN GRAPHS

does not require synchronization nor conflict resolution assuming that thewrite operations to the memory are atomic (such settings are common seea discussion in the original paper [76]) This heuristic has an approxima-tion guarantee of around 0632 The second heuristic is designed to obtainlarger matchings compared to those obtained by the first one This heuristicemploys Karp-Sipser on a judiciously chosen subgraph of the input graphWe show that for this subgraph Karp-Sipser is an exact algorithm and aspecialized efficient implementation is possible to obtain matchings of sizearound 0866 of the maximum cardinality

One-sided matching

The first matching heuristic we propose OneSidedMatch scales the givenadjacency matrix A (each aij is originally either 0 or 1) and uses the scaledentries to randomly choose a column as a match for each row The pseu-docode of the heuristic is shown in Algorithm 6

Algorithm 6 OneSidedMatch Bipartite graphs

Data A an ntimes n (01)-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for j = 1 to n in parallel do3 cmatch[j]larr NIL

4 for i = 1 to n in parallel do5 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

6 cmatch[j]larr i

OneSidedMatch first obtains the scaling vectors dr and dc corre-sponding to a doubly stochastic matrix S (line 1) After initializing thecmatch array for each row i of A the heuristic randomly chooses a col-umn j isin Ailowast based on the probabilities computed by using correspond-ing scaled entries of row i It then matches i and j Clearly multiplerows can choose the same column and write to the same entry in cmatchWe assume that in the parallel shared-memory setting one of the writeoperation survives and the cmatch array defines a valid matching iecmatch[j] j cmatch[j] 6= NIL We now analyze its approximationguarantee in terms of the matching cardinality

Theorem 41 Let A be an n times n (01)-matrix with total support ThenOneSidedMatch obtains a matching of size at least n(1 minus 1e) asymp 0632n

61

in expectation as nrarrinfin

Proof To compute the matching cardinality we will count the columns thatare not picked by any row and subtract it from n Since ΣkisinAilowastsik = 1 foreach row i of S the probability that a column j is not picked by any of therows in Alowastj is equal to

prodiisinAlowastj (1minussij) By applying the arithmetic-geometric

mean inequality we obtain

dj

radic prodiisinAlowastj

(1minus sij) ledj minus

sumiisinAlowastj sij

dj

where dj = |Alowastj | is the degree of column vertex j Therefore

prodiisinAlowastj

(1minus sij) le

(1minus

sumiisinAlowastj sij

dj

)dj

Since S is doubly stochastic we havesum

iisinAlowastj sij = 1 and

prodiisinAlowastj

(1minus sij) le(

1minus 1

dj

)dj

The function on the right hand side above is an increasing one and has thelimit

limdjrarrinfin

(1minus 1

dj

)dj=

1

e

where e is the base of the natural logarithm By the linearity of expectationthe expected number of unmatched columns is no larger than n

e Hence thecardinality of the matching is no smaller than n (1minus 1e)

In Algorithm 6 we split the rows among the threads with a parallelfor construct For each row i the corresponding thread chooses a randomnumber r from a uniform distribution with range (0

sumkisinAilowast

sik] Then thelast nonzero column index j for which

sum1leklej sik le r is found and cmatch[j]

is set to i Since no synchronization or conflict detection is required theheuristic promises significant speedups

Two-sided matching

OneSidedMatchrsquos approximation guarantee and suitable structure for par-allel architectures make it a good heuristic The natural question that fol-lows is whether a heuristic with a better guarantee exits Of course thesought heuristic should also be simple and easy to parallelize We askedldquowhat happens if we repeat the process for the other (column) side of thebipartite graphrdquo The question led us to the following algorithm Let each

62 CHAPTER 4 MATCHINGS IN GRAPHS

row select a column and let each column select a row Take all these 2nchoices to construct a bipartite graph G (a subgraph of the input) with 2nvertices and at most 2n edges (if i chooses j and j chooses i we have oneedge) and seek a maximum cardinality matching in G Since the number ofedges is at most 2n any exact matching algorithm on this graph would befastmdashin particular the worst case run time would be O(n15) [117] Yet wecan do better and obtain a maximum cardinality matching in linear time byrunning the Karp-Sipser heuristic on G as we display in Algorithm 7

Algorithm 7 TwoSidedMatch Bipartite graphs

Data A an ntimes n 0 1-matrix with total supportResult cmatch[middot] the rows matched to columns

1 〈drdc〉 larr ScaleSK(A)2 for i = 1 to n in parallel do3 Pick a random column j isin Ailowast by using the probability density

function sikΣ`isinAilowastsi`

for all k isin Ailowast

where sik = dr[i]times dc[k] is the corresponding entry in the scaledmatrix S = DRADC

4 rchoice[i]larr j

5 for j = 1 to n in parallel do6 Pick a random row i isin Alowastj by using the probability density function

skjΣ`isinAlowastjs`j

for all k isin Alowastj

7 cchoice[j]larr i

8 Construct a bipartite graph G = (VR cup VC E) where

E =i rchoice[i] i isin 1 ncupcchoice[j] j j isin 1 n

9 matchlarrKarp-Sipser (G)

The most interesting component of TwoSidedMatch is the incorpora-tion of the Karp-Sipser heuristic for two reasons First although it is onlya heuristic Karp-Sipser computes a maximum cardinality matching on thebipartite graph G constructed in Algorithm 7 Second although Karp-Sipserhas a sequential nature we can obtain good speedups with a specializedparallel implementation In general it is hard to efficiently parallelize (non-trivial) graph algorithms and it is even harder when the overall cost isO(n) which is the case for Karp-Sipser on G We give a series of lemmasbelow which enables us to use Karp-Sipser as an exact algorithm with a goodshared-memory parallel performance

The first lemma describes the structure of G constructed at line 8 of

63

TwoSidedMatch

Lemma 41 Each connected component of G constructed in Algorithm 7contains at most one simple cycle

Proof A connected component M sube G with nprime vertices can have at most nprime

edges Let T be a spanning tree of M Since T contains nprime minus 1 edges theremaining edge in M can create at most one cycle when added to T

Lemma 41 explains why Karp-Sipser is an exact algorithm on G If acomponent does not contain a cycle Karp-Sipser consumes all its vertices inPhase 1 Therefore all of the matching decisions given by Karp-Sipser areoptimal for this component Assume a component contains a simple cycleAfter Phase 1 the component is either consumed or due to Lemma 41 itis reduced to a simple cycle In the former case the matching is a maximumcardinality one In the latter case an arbitrary edge of the cycle can beused to match a pair of vertices This decision necessarily leads to a uniqueperfect matching in the remaining simple path These two arguments canbe repeated for all the connected components to see that the Karp-Sipserheuristic finds a maximum cardinality matching in G

Algorithm 8 describes our parallel implementation Karp-Sipser-MT Thegraph is represented using a single array choice where choice[u] is the ver-tex randomly chosen by u isin VR cup VC The choice array is a concatenationof the arrays rchoice and cchoice set in TwoSidedMatch Hence anexplicit graph construction for G (line 8 of Algorithm 7) is not required anda transformation of the selected edges to a graph storage scheme is avoidedKarp-Sipser-MT uses three atomic operations for synchronization The firstoperation Add(memory value) atomically adds a value to a memory loca-tion It is used to compute the vertex degrees in the initial graph (line 9)The second operation CompAndSwap(memory value replace) first checkswhether the memory location has the value If so its content is replaced Thefinal content is returned The third operation AddAndFetch(memoryvalue) atomically adds a given value to a memory location and the finalcontent is returned We will describe the use of these two operations later

Karp-Sipser-MT has two phases which correspond to the two phases ofKarp-Sipser The first phase of Karp-Sipser-MT is similar to that of Karp-Sipser in that optimal matching decisions are made about some degree-onevertices The second phase of Karp-Sipser-MT handles remaining verticesvery efficiently without bothering with their degrees The following defini-tions are used to clarify the difference between an original Karp-Sipser andKarp-Sipser-MT

Definition 41 Given a matching and the array choice let u be an un-matched vertex and v = choice[u] Then u is called

64 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 8 Karp-Sipser-MT Bipartite 1-out graphs

Data G = V choice[middot] the chosen vertex for each u isin V Result match[middot] the match array for u isin V

1 for all u isin V in parallel do2 mark[u]larr 13 deg[u]larr 14 match[u]larr NIL

5 for all u isin V in parallel do6 v larr choice[u]7 mark[v]larr 08 if choice[v] 6= u then9 Add(deg[v] 1)

10 for each vertex u in parallel do Phase 1 out-one vertices 11 if mark[u] = 1 then12 curr larr u13 while curr 6= NIL do14 nbr larr choice[curr]15 if CompAndSwap(match[nbr]NIL curr) = curr then16 match[curr]larr nbr17 curr larr NIL18 nextlarr choice[nbr]19 if match[next] = NIL then20 if AddAndFetch(deg[next]minus1) = 1 then21 curr larr next22 else23 curr larr NIL

24 for each column vertex u in parallel do Phase 2 the rest 25 v larr choice[u]26 if match[u] = NIL and match[v] = NIL then27 match[u]larr v28 match[v]larr u

65

bull out-one if v is unmatched and no unmatched vertex w with choice[w] =u exists

bull in-one if v is matched and only a single unmatched vertex w withchoice[w] = u exists

The first phase of Karp-Sipser-MT (for loop of line 10) does not track andmatch all degree-one vertices Instead only the out-one vertices are takeninto account For each such vertex u that is already out-one before Phase1 we have mark[u] = 1 Karp-Sipser-MT visits these vertices (lines 10-11)Newly arising out-one vertices are consumed right away without maintain-ing a list The second phase of Karp-Sipser-MT (for loop of line 24) ismuch simpler than that of Karp-Sipser as the degrees of the vertices arenot trackedupdated We now discuss how these simplifications are possiblewhile ensuring a maximum cardinality matching in G

Observation 41 An out-one or an in-one vertex is a degree-one vertex inKarp-Sipser

Observation 42 A degree-one vertex in Karp-Sipser is either an out-oneor an in-one vertex or it is one of the two vertices u and v in a 2-cliquesuch that v = choice[u] and u = choice[v]

Lemma 42 (Proved elsewhere [76 Lemma 3]) If there exists an in-onevertex in G at any time during the execution of Karp-Sipser-MT an out-onevertex also exists

According to Observation 41 all the matching decisions given by Karp-Sipser-MT in Phase 1 are optimal since an out-one vertex is a degree-onevertex Observation 42 implies that among all the degree-one vertices Karp-Sipser-MT ignores only the in-ones and 2-cliques According to Lemma 42an in-one vertex cannot exist without an out-one vertex therefore they arehandled in the same phase The 2-cliques that survive Phase 1 are handledin Karp-Sipser-MTrsquos Phase 2 since they can be considered as cycles

To analyze the second phase of Karp-Sipser-MT we will use the followinglemma

Lemma 43 (Proved elsewhere [76 Lemma 4]) Let Gprime = (V primeR cup V primeC Eprime) bethe graph induced by the remaining vertices after the first phase of Karp-Sipser-MT Then the set

(u choice[u]) u isin V primeR choice[u] isin V primeC

is a maximum cardinality matching in Gprime

In the light of Observations 41 and 42 and Lemmas 41ndash43 Karp-Sipser-MT is an exact algorithm on the graphs created in Algorithm 7 Its worstcase (sequential) run time is linear

66 CHAPTER 4 MATCHINGS IN GRAPHS

Figure 41 A toy bipartite graph with 9 row (circles) and 9 column (squares)vertices The edges are oriented from a vertex u to the vertex choice[u]Assuming all the vertices are currently unmatched matching 15-7 (or 5-13)creates two degree-one vertices But no out-one vertex arises after matching(15-7) and only one vertex 6 arises after matching (5-13)

Karp-Sipser-MT tracks and consumes only the out-one vertices Thisbrings high flexibility while executing Karp-Sipser-MT in multi-threaded en-vironments Consider the example in Figure 41 Here after matching a pairof vertices and removing them from the graph multiple degree-one verticescan be generated The standard Karp-Sipser uses a list to store these newdegree-one vertices Such a list is necessary to obtain a larger matching butthe associated synchronizations while updating it in parallel will be an ob-stacle for efficiency The synchronization can be avoided up to some level ifone sacrifices the approximation quality by not making all optimal decisions(as in some existing work [16]) We continue with the following lemma totake advantage of the special structure of the graphs in TwoSidedMatchfor parallel efficiency in Phase 1

Lemma 44 (Proved elsewhere [76 Lemma 5]) Consuming an out-onevertex creates at most one new out-one vertex

According to Lemma 44 Karp-Sipser-MT does not need a list to storethe new out-one vertices since the process can continue with the new out-one vertex In a shared-memory setting there are two concerns for thefirst phase from the synchronization point of view First multiple threadsthat are consuming different out-one vertices can try to match them withthe same unmatched vertex To handle such cases Karp-Sipser-MT uses theatomic CompAndSwap operation (line 15 of Algorithm 8) and ensures thatonly one of these matchings will be processed In this case other threadswhose matching decisions are not performed continue with the next vertexin the for loop at line 10 The second concern is that while consumingout-one vertices several threads may create the same out-one vertex (andwant to continue with it) For example in Figure 41 when two threadsconsume the out-one vertices 1 and 2 at the same time they both will try tocontinue with vertex 4 To handle such cases an atomic AddAndFetch

67

operation (line 20 of Algorithm 8) is used to synchronize the degree reductionoperations on the potential out-one vertices This approach explicitly ordersthe vertex consumptions and guarantees that only the thread who performsthe last consumption before a new out-one vertex u appears continues withu The other threads which wanted to continue with the same path stop andskip to the next unconsumed out-one vertex in the main for loop One lastconcern that can be important on the parallel performance is the maximumlength of such paths since a very long path can yield a significant imbalanceon the work distribution to the threads We investigated the length of suchpaths experimentally (see the original paper [76]) and observed them to betoo short to hurt the parallel performance

The second phase of Karp-Sipser-MT is efficiently parallelized by usingthe idea in Lemma 43 That is a maximum cardinality matching for thegraph remaining after the first phase of Karp-Sipser-MT can be obtained viaa simple parallel for construct (see line 24 of Algorithm 8)

Quality of approximation If the initial matrix A is the n times n matrixof 1s that is aij = 1 for all 1 le i j le n then the doubly stochastic matrixS is such that sij = 1

n for all 1 le i j le n In this case the graph Gcreated by Algorithm 7 is a random 1-out bipartite graph [211] Referringto a study by Meir and Moon [172] Karonski and Pittel [125] argue thatthe maximum cardinality of a matching in a random 1-out bipartite graphis 2(1minusΩ)n asymp 0866n in expectation where Ω asymp 0567 also called LambertrsquosW (1) is the unique solution of the equation ΩeΩ = 1 We obtain the sameresult for random 1-out subgraph of any bipartite graph whose adjacencymatrix has total support as highlighted in the following theorem

Theorem 42 Let A be the n times n adjacency matrix of a given bipar-tite graph where A has total support Then TwoSidedMatch obtainsa matching of size 2(1 minus Ω)n asymp 0866n in expectation as n rarr infin whereΩ asymp 0567 is the unique solution of the equation ΩeΩ = 1

Theorem 42 contributes to the known results about the Karp-Sipserheuristic (recall that it is known to leave out O(n15) vertices) by show-ing a constant approximation ratio with some preprocessing The existenceof total support does not seem to be necessary for the Theorem 42 to hold(see the next subsection)

The proof of Theorem 42 is involved and can be found in the originalpaper [76] It essentially adapts the original analysis of Karp and Sipser [128]to the special graphs at hand

Discussion

The Sinkhorn-Knopp scaling algorithm converges linearly (when A has totalsupport) where the convergence rate is equivalent to the square of the second

68 CHAPTER 4 MATCHINGS IN GRAPHS

largest singular value of the resulting doubly stochastic matrix [139] Ifthe matrix does not have support or total support less is known aboutthe Sinkhorn-Knopp scaling algorithmrsquos convergencemdashwe do not know howmuch of a reduction we will get at each step In such cases we are not ableto bound the run time of the scaling step However the scaling algorithmsshould be run only a few iterations (see also the next paragraph) in practicein which case the practical run time of our heuristics would be linear (inedges and vertices)

We have discussed the proposed matching heuristics while assuming thatA has total support This can be relaxed in two ways to render the over-all approach practical in any bipartite graph The first relaxation is thatwe do not need to run the scaling algorithms until convergence (as in someother use cases of similar algorithms [141]) If

sumiisinAlowastj sij ge α instead ofsum

iisinAlowastj sij = 1 for all j isin 1 n then limdrarrinfin(1minus α

d

)d= 1

eα In otherwords if we apply the scaling algorithms a few iterations or until some rel-atively large error tolerance we can still derive similar results For exampleif α = 092 we will have a matching of size n

(1minus 1

)asymp 0601n (larger

column sums give improved ratios but there are columns whose sum isless than one when the convergence is not achieved) for OneSidedMatchWith the same α TwoSidedMatch will obtain an approximation of 0840In our experiments the number of iterations were always a few where theproven approximation guarantees were always observed The second relax-ation is that we do not need total support we do not even need supportnor equal number of vertices in the two vertex classes Since little is knownabout the scaling methods on such matrices we only mention some factsand observations and later on present some experiments to demonstratethe practicality of the proposed OneSidedMatch and TwoSidedMatchheuristics

A sparse matrix (not necessarily square) can be permuted into a blockupper triangular form using the canonical Dulmage-Mendelsohn (DM) de-composition [77]

A =

H lowast lowastO S lowastO O V

S =

(S1 lowastO S2

)

where H (horizontal) has more columns than rows and has a matchingcovering all rows S is square and has a perfect matching and V (vertical)has more rows than columns and a matching covering all columns Thefollowing facts about the DM decomposition are well known [192 194] Anyof these three blocks can be void If H is not connected then it is blockdiagonal with horizontal blocks If V is not connected then it is blockdiagonal with vertical blocks If S does not have total support then itis in block upper triangular form shown on the right where S1 and S2

69

have the same structure recursively until each block Si has total supportThe entries in the blocks shown by ldquordquo cannot be put into a maximumcardinality matching When the presented scaling methods are applied toa matrix the entries in ldquordquo blocks will tend to zero (the case of S is welldocumented [202]) Furthermore the row sums of the blocks of H will be amultiple of the column sums in the same block a similar statement holds forV finally S will be doubly stochastic That is the scaling algorithms appliedto bipartite graphs without perfect matchings will zero out the entries in theirrelevant parts and identify the entries that can be put into a maximumcardinality matching

423 Experiments

We present a selected set of results from the original paper [76] We used ashared-memory parallel version of Sinkhorn-Knopp algorithm with a fixednumber of iterations This way the scaling algorithm is simplified as noconvergence check is required Furthermore its overhead is bounded Inmost of the experiments we use 0 1 5 or 10 scaling iterations where thecase of 0 corresponds to applying the matching heuristics with uniform edgeselection probabilities While these do not guarantee convergence of theSinkhorn-Knopp scaling algorithm it seems to be enough for our purposes

Matching quality

We investigate the matching quality of the proposed heuristics on all squarematrices with support from the SuiteSparse Matrix Collection [59] having atleast 1000 non-empty rowscolumns and at most 20000000 nonzeros (someof the matrices are also in the 10th DIMACS challenge [19]) There were 742matrices satisfying these properties at the time of experimentation Withthe OneSidedMatch heuristic the quality guarantee of 0632 was sur-passed with 10 iterations of the scaling method in 729 matrices With theTwoSidedMatch heuristic and the same number of iterations of the scalingmethods the quality guarantee of 0866 was surpassed in 705 matrices Mak-ing 10 more scaling iterations smoothed out the remaining instances TheGreedy heuristic (the second variant discussed in Section 421) obtained onaverage (both the geometric and arithmetic means) matchings of size 093of the maximum cardinality In the worst case 082 was observed

We collect a few of the matrices from the SuiteSparse collection in Ta-ble 41 Figures 42a and 42b show the qualities on our test matrices In thefigures the first columns represent the case that the neighbors are pickedfrom a uniform distribution over the adjacency lists ie the case with noscaling hence no guarantee The quality guarantees are achieved with only5 scaling iterations Even with a single iteration the quality of TwoSided-Match is more than 0866 for all matrices and that of OneSidedMatch

70 CHAPTER 4 MATCHINGS IN GRAPHS

Avg sprank

Name n of edges deg stdA stdB n

atmosmodl 1489752 10319760 69 03 03 100audikw 1 943695 77651847 822 425 425 100cage15 5154859 99199551 192 57 57 100channel 4802000 85362744 178 10 10 100europe osm 50912018 108109320 21 05 05 099Hamrle3 1447360 5514242 38 16 21 100hugebubbles 21198119 63580358 30 003 003 100kkt power 2063494 12771361 62 745 745 100nlpkkt240 27993600 760648352 267 222 222 100road usa 23947347 57708624 24 093 093 095torso1 116158 8516500 733 41959 24544 100venturiLevel3 4026819 16108474 40 012 012 100

Table 41 Properties of the matrices used in the comparisons In the ta-ble sprank is the maximum matching cardinality Avg deg is the averagevertex degree stdA and stdB are the standard deviations of A and B ver-tex (row and column) degrees respectively

is more than 0632 for all matrices

Comparison of TwoSidedMatch with Karp-Sipser Next we analyzethe performance of the proposed heuristics with respect to Karp-Sipser on amatrix class which we designed as a challenging case for Karp-Sipser Let Abe an ntimes n matrix R1 be the set of Arsquos first n2 rows and R2 be the setof Arsquos last n2 rows similarly let C1 and C2 be the set of first n2 and theset of last n2 columns As Figure 43 shows A has a full R1 times C1 blockand an empty R2 times C2 block The last h n rows and columns of R1 andC1 respectively are full Each of the blocks R1 times C2 and R2 times C1 has anonzero diagonal Those diagonals form a perfect matching when combinedIn the sequel a matrix whose corresponding bipartite graph has a perfectmatching will be called full-sprank and sprank-deficient otherwise

When h le 1 the Karp-Sipser heuristic consumes the whole graph duringPhase 1 and finds a maximum cardinality matching When h gt 1 Phase1 immediately ends since there is no degree-one vertex In Phase 2 thefirst edge (nonzero) consumed by Karp-Sipser is selected from a uniform dis-tribution over the nonzeros whose corresponding rows and columns are stillunmatched Since the block R1timesC1 is full it is more likely that the nonzerowill be chosen from this block Thus a row in R1 will be matched with acolumn in C1 which is a bad decision since the block R2timesC2 is completelyempty Hence we expect a decrease in the performance of Karp-Sipser ash increases On the other hand the probability that TwoSidedMatchchooses an edge from that block goes to zero as those entries cannot be ina perfect matching

71

050

055

060

065

070

075

080

085

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 42 Matching qualities of OneSidedMatch and TwoSided-Match The horizontal lines are at 0632 and 0866 respectively whichare the approximation guarantees for the heuristics The legends containthe number of scaling iterations

R1

R2

C1 C2

h

h

Figure 43 Adjacency matrices of hard bipartite graph instances for Karp-Sipser

72 CHAPTER 4 MATCHINGS IN GRAPHS

Results with different number of scaling iterationsKarp- 0 1 5 10

h Sipser Qual Err Qual Err Qual Err Qual

2 0782 0522 13853 0557 3463 0989 0578 09994 0704 0489 11257 0516 3856 0980 0604 09978 0707 0466 8653 0487 4345 0946 0648 0996

16 0685 0448 6373 0458 4683 0885 0725 099032 0670 0447 4555 0453 4428 0748 0867 0980

Table 42 Quality comparison (minimum of 10 executions for each instance)of the Karp-Sipser heuristic and TwoSidedMatch on matrices described inFig 43 with n = 3200 and h isin 2 4 8 16 32

The results of the experiments are in Table 42 The first column showsthe h value Then the matching quality obtained by Karp-Sipser and byTwoSidedMatch with different number of scaling iterations (0 1 5 10)as well as the scaling error are given The scaling error is the maximumdifference between 1 and each rowcolumn sum of the scaled matrix (for0 iterations it is equal to n minus 1 for all cases) The quality of a match-ing is computed by dividing its cardinality to the maximum one which isn = 3200 for these experiments To obtain the values in each cell of thetable we run the programs 10 times and give the minimum quality (as weare investigating the worst-case behavior) The highest variance for Karp-Sipser and TwoSidedMatch were (up to four significant digits) 00041 and00001 respectively As expected when h increases Karp-Sipser performsworse and the matching quality drops to 067 for h = 32 TwoSided-Matchrsquos performance increases with the number of scaling iterations Asthe experiment shows only 5 scaling iterations are sufficient to make theproposed two-sided matching heuristic significantly better than Karp-SipserHowever this number was not enough to reach 0866 for the matrix withh = 32 On this matrix with 10 iterations only 2 of the rowscolumnsremain unmatched

Matching quality on bipartite graphs without perfect matchingsWe analyze the proposed heuristics on a class of random sprank-deficientsquare (n = 100000) and rectangular (m = 100000 and n = 120000) matri-ces with a uniform nonzero distribution (two more sprank-deficient matricesare used in the scalability tests as well) These matrices are generatedby Matlabrsquos sprand command (generating Erdos-Renyi random matri-ces [85]) The total number of nonzeros is set to be around d times 100000for d isin 2 3 4 5 The top half of Table 43 presents the results of thisexperiment with square matrices and the bottom half presents the resultswith the rectangular ones

As in the previous experiments the matching qualities in the table arethe minimum of 10 executions for the corresponding instances As Table 43

73

One Two One TwoSided Sided Sided Sided

d iter sprank Match Match d iter sprank Match Match

Square 100000times 100000

2 0 78225 0770 0912 4 0 97787 0644 08382 1 78225 0797 0917 4 1 97787 0673 08482 5 78225 0850 0939 4 5 97787 0719 08732 10 78225 0879 0954 4 10 97787 0740 0886

3 0 92786 0673 0851 5 0 99223 0635 08403 1 92786 0703 0857 5 1 99223 0662 08513 5 92786 0756 0884 5 5 99223 0701 08733 10 92786 0784 0902 5 10 99223 0716 0882

Rectangular 100000times 120000

2 0 87373 0793 0912 4 0 99115 0729 08992 1 87373 0815 0918 4 1 99115 0754 09102 5 87373 0861 0939 4 5 99115 0792 09332 10 87373 0886 0955 4 10 99115 0811 0946

3 0 96564 0739 0896 5 0 99761 0725 09053 1 96564 0769 0904 5 1 99761 0749 09173 5 96564 0813 0930 5 5 99761 0781 09363 10 96564 0836 0945 5 10 99761 0792 0943

Table 43 Matching qualities of the proposed heuristics on random sparsematrices with uniform nonzero distribution Square matrices in the top halfrectangular matrices in the bottom half d average number of nonzeros perrow

shows when the deficiency is high (correlated with small d) it is easier forour algorithms to approximate the maximum cardinality However when dgets larger the algorithms require more scaling iterations Even in this case5 iterations are sufficient to achieve the guaranteed qualities In the squarecase the minimum quality achieved by OneSidedMatch and TwoSided-Match were 0701 and 0873 In the rectangular case the minimum qual-ity achieved by OneSidedMatch and TwoSidedMatch were 0781 and0930 respectively with 5 scaling iterations In all cases increased scal-ing iterations results in higher quality matchings in accordance with theprevious results on square matrices with perfect matchings

Comparisons with some other codes

We run Azad et alrsquos [16] variant of Karp-Sipser (ksV) the two greedy match-ing heuristics of Blelloch et al [29] (inc and nd which are referred to as ldquoin-crementalrdquo and ldquonon-deterministic reservationrdquo in the original paper) andthe proposed OneSidedMatch and TwoSidedMatch heuristics with oneand 16 threads on the matrices given in Table 41 We also add one more ma-trix wc 50K 64 from the family shown in Fig 43 with n = 50000 and h = 64

74 CHAPTER 4 MATCHINGS IN GRAPHS

Run time in secondsOne thread 16 threads

Name ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 020 1480 007 012 030 003 163 001 001 003audikw 1 098 20600 022 055 082 010 1690 002 006 009cage15 154 996 039 092 195 029 110 004 009 019channel 129 10500 030 082 146 014 907 003 008 013europe osm 1203 5180 206 489 1197 120 712 021 046 105Hamrle3 015 012 004 009 025 002 002 001 001 002hugebubbles 836 465 128 392 1004 095 053 018 037 094kkt power 055 068 009 019 045 008 016 001 002 005nlpkkt240 1088 190000 256 556 1011 115 23700 023 058 106road usa 623 357 096 216 542 070 038 009 022 050torso1 011 728 002 006 009 002 120 000 001 001venturiLevel3 045 272 013 032 084 008 029 001 004 009wc 50K 64 721 320000 146 439 580 117 40200 012 055 059

Quality of the obtained matchingName ksV inc nd 1SD 2SD ksV inc nd 1SD 2SD

atmosmodl 100 100 100 066 087 098 100 097 066 087audikw 1 100 100 100 064 087 095 100 097 064 087cage15 100 100 100 064 087 095 100 091 064 087channel 100 100 100 064 087 099 100 099 064 087europe osm 098 095 095 075 089 098 095 094 075 089Hamrle3 100 084 084 075 090 097 084 086 075 090hugebubbles 099 096 096 070 089 096 096 090 070 089kkt power 085 079 079 078 089 087 079 086 078 089nlpkkt240 099 099 099 064 086 099 099 099 064 086road usa 094 081 081 076 088 094 081 081 076 088torso1 100 100 100 065 088 098 100 098 065 087venturiLevel3 100 100 100 068 088 097 100 096 068 088wc 50K 64 050 050 050 081 098 070 050 051 081 098

Table 44 Run time and quality of different heuristics OneSidedMatchand TwoSidedMatch are labeled with 1SD and 2SD respectively andapply two steps of scaling iterations The heuristic ksV is from Azad etal [16] and the heuristics inc and nd are from Blelloch et al [29]

The OneSidedMatch and TwoSidedMatch heuristics are run with twoscaling iterations and the reported run time includes all components of theheuristics The inc and nd heuristics are compiled with g++ 445 and ksV

is compiled with gcc 445 For all the heuristics we used -O2 optimizationand the appropriate OpenMP flag for compilation The experiments werecarried out on a machine equipped with two Intel Sandybridge-EP CPUsclocked at 200Ghz and 256GB of memory split across the two NUMA do-mains Each CPU has eight-cores (16 cores in total) and HyperThreadingis enabled Each core has its own 32kB L1 cache and 256kB L2 cache The8 cores on a CPU share a 20MB L3 cache The machine runs 64-bit Debianwith Linux 2639-bpo2-amd64 The run time of all heuristics (in seconds)and their matching qualities are given in Table 44

75

As seen in Table 44 nd is the fastest of the tested heuristics with bothone thread and 16 threads on this data set The second fastest heuristicis OneSidedMatch The heuristic ksV is faster than TwoSidedMatchon six instances with one thread With 16 threads TwoSidedMatch isalmost always faster than ksVmdashboth heuristics are quite efficient and havea maximum run time of 117 seconds The heuristic inc was observed tohave large run time on some matrices with relatively high nonzeros per rowAll the existing heuristics are always very efficient except inc which haddifficulties on some matrices in the data set In the original reference [29]inc is quite efficient for some other matrices (where the codes are compiledwith Cilk) We do not see run time with nd elsewhere but in our data setit was always better than inc The proposed OneSidedMatch heuristicrsquosquality is almost always close to its theoretical limit TwoSidedMatch alsoobtains quality results close to its theoretical limit Looking at the quality ofthe matching with one thread we see that ksV obtains (almost always) thebest score except on kkt power and wc 50K 64 where TwoSidedMatchobtains the best score The heuristics inc and nd obtain the same score withone thread but with 16 threads inc obtains better quality than nd almostalways OneSidedMatch never obtains the best score (TwoSidedMatchis always better) yet it is better than ksV inc and nd on the syntheticwc 50K 64 matrix The TwoSidedMatch heuristic obtains better resultsthan inc and nd on Hamrle3 kkt power road usa and wc 50K 64 withboth one thread and 16 threads Also on hugebubbles with 16 threads thedifference between TwoSidedMatch and nd is 1 (in favor of nd)

43 Approximation algorithms for general undi-rected graphs

We extend the results on bipartite graphs of the previous section to generalundirected graphs We again select a subgraph randomly with a probabilitydensity function obtained by scaling the adjacency matrix of the input graphand run Karp-Sipser [128] on the selected subgraph Again Karp-Sipser be-comes an exact algorithm on the selected subgraph and analysis reveals thatthe approximation ratio is around 0866 of the maximum cardinality in theoriginal graph We also propose two other variants of this algorithm whichobtain better results both in theory and in practice We omit most of theproofs in the presentation they can be found in the associated paper [73]

Recall that a graph Gprime = (V prime Eprime) is a subgraph of G = (VE) if V prime sube Vand Eprime sube E Let Gkout = (VEprime) be a random subgraph of G where eachvertex in V randomly chooses k of its edges with repetition (an edge u vis included in Gkout only once even if it is chosen multiple times) We callGkout as a k-out subgraph of G

A permutation of [1 n] in which no element stays in its original po-

76 CHAPTER 4 MATCHINGS IN GRAPHS

sition is called derangement The total number of derangements of [1 n]is denoted by n and is the nearest integer to n

e [36 pp 40ndash42] where e isthe base of the natural logarithm

431 Related work

The first polynomial time algorithm with O(n2τ) complexity for the maxi-mum cardinality matching problem on general graphs with n vertices and τedges is proposed by Edmonds [79] Currently the fastest algorithms haveO(radicnτ) complexity [30 92 174] The first of these algorithms is by Micali

and Vazirani [174] recently its simpler analysis is given by Vazirani [210]Later Blum [30] and Gabow and Tarjan [92] proposed algorithms with thesame complexity

Random graphs have been frequently analyzed in terms of their max-imum matching cardinality Frieze [90] shows results for undirected (gen-eral) graphs along the lines of Walkuprsquos results for bipartite graphs [211]In particular he shows that if each vertex chooses uniformly at random oneneighbor (among all vertices) then the constructed graph does not havea perfect matching almost always In other words he shows that random1-out subgraphs of a complete graph do not have perfect matchings Luck-ily as in bipartite graphs if each vertex chooses two neighbors then theconstructed graph has perfect matchings almost always That is random2-out subgraphs of a complete graph have perfect matchings Our contribu-tions are about making these statements valid for any host graph (assumingthe adjacency matrix has total support) and measure the approximationguarantee of the 1-out case

Randomized algorithms which check the existence of a perfect match-ing and generate perfectmaximum matchings have been proposed in theliterature [45 120 165 177 178] Lovasz showed that the existence of aperfect matching can be verified in randomized O(nω) time where O(nω)is the time complexity of the fastest matrix multiplication algorithm avail-able [165] More information such as the set of allowed edges ie the edgesin some maximum matching of a graph can also be found with the samerandomized complexity as shown by Cheriyan [45]

To construct a maximum cardinality matching in a general non-bipartitegraph a simple easy to implement algorithm with O(nω+1) randomizedcomplexity is proposed by Rabin and Vazirani [197] The algorithm usesmatrix inversion as a subroutine Later Mucha and Sankowski proposedthe first algorithm for the same problem with O(nω) randomized complex-ity [178] This algorithm uses expensive sub-routines such as Gaussian elim-ination and equivalence classes formed based on the edges that appear insome perfect matching [166] A simpler randomized algorithm with the samecomplexity is proposed by Harvey [108]

77

432 One-Out its analysis and variants

We first present the main heuristic and then analyze its approximationguarantee While the heuristic is a straightforward adaptation of its coun-terpart for bipartite graphs [76] the analysis is more complicated becauseof odd cycles The analysis shows that random 1-out subgraphs of a givengraph have the maximum cardinality of a matching around 0866minus log(n)nof the best possiblemdashthe observed performance (see Section 433) is higherThen two variants of the main heuristic are discussed without proofs Theyextend the 1-out subgraphs obtained by the first one and hence deliverbetter results theoretically

The main heuristic One-Out

The heuristic shown in Algorithm 9 first scales the adjacency matrix of agiven undirected graph to be doubly stochastic Based on the values in thematrix the heuristic then randomly marks one of the neighbors for each ver-tex as chosen This defines one marked edge per vertex Then the subgraphof the original graph containing only the marked edges is formed yieldingan undirected graph with at most n edges each vertex having at least oneedge incident on it The Karp-Sipser heuristic is run on this subgraph andfinds a maximum cardinality matching on it as the resulting 1-out graphhas unicyclic components The approximation guarantee of the proposedheuristic is analyzed for this step One practical improvement to the previ-ous work is to make sure that the matching obtained is a maximal one byrunning Karp-Sipser on the vertices that are not matched This brings in alarge improvement to the cardinality of the matching in practice

Let us classify the edges incident on a vertex vi as in-edges (from thoseneighbors that have chosen i at Line 4) and an out-edge (to a neighborchosen by i at Line 4 Dufosse et al [76 Lemma 3] show that during anyexecution of Karp-Sipser it suffices to consider the vertices whose in-edgesare unavailable but out-edges are available as degree-1 vertices

The analysis traces an execution of Karp-Sipser on the subgraph G1out In

other words the analysis can also be perceived as the analysis of the Karp-Sipser heuristicrsquos performance on random 1-out graphs Let A1 be the setof vertices not chosen by any other vertex at Line 4 of Algorithm 9 Thesevertices have in-degree zero and out degree one and hence can be processedby Karp-Sipser Let B1 be the set of vertices chosen by the vertices in A1The vertices in B1 can be perfectly matched with vertices in A1 leavingsome A1 vertices not matched and creating some new in-degree-0 verticesWe can proceed to define A2 to be the vertices that have in degree-0 inV (A1 cup B1) and define B2 as those chosen by A2 and so on so forthFormally let B0 be an empty set and define Ai to be the set of verticeswith in-degree 0 in V Biminus1 and Bi be the vertices chosen by those in Ai

78 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 9 One-Out Undirected graphs

Data G = (VE) and its adjacency matrix AResult match[middot] the matching

1 Dlarr SymScale(A)2 for i = 1 to n in parallel do3 Pick a random j isin Ailowast by using the probability density function

sikΣtisinAilowastsit

for all k isin Ailowast

where sik = d[i]times d[k] is the corresponding entry in the scaledmatrix S = DAD

4 Mark that i chooses j

5 Construct a graph G1out = (VE) where

V =1 nE =(i j) i chose j or j chose i

6 matchlarrKarpSipserOne-Out(G1out)

7 Make match a maximal matching

for i ge 1 Notice that Ai sube Ai+1 and Bi sube Bi+1 The Karp-Sipser heuristiccan process A1 then A2 A1 and so on until the remaining graph hascycles only The sets Ai and Bi and their cardinality are at the core of ouranalysis We first present some facts about these sets and their cardinalityand describe an implementation of Karp-Sipser instrumented to highlightthem

Lemma 45 With the definitions above Ai capBi = empty

Proof We prove this by induction For i = 1 it clearly holds Assume thatit holds for all i lt ` Suppose there exists a vertex u isin A` cap B` BecauseA`minus1capB`minus1 = empty u must necessarily belong to both A` A`minus1 and B` B`minus1For u to be in B` B`minus1 there must exist at least one vertex v isin A` A`minus1

such that v chooses u However the condition for u isin A` is that no vertex inV cap(A`minus1cupB`minus1) has selected it This is a contradiction and the intersectionA` capB` should be empty

Corollary 41 Ai capBj = empty for i le j

Proof Assume Ai cap Bj 6= empty Since Ai sube Aj we have a contradiction asAj capBj = empty by Lemma 45

Thanks to Lemma 45 and Corollary 41 the sets Ai and Bi are disjointand they form a bipartite subgraph of G1

out for all i = 1 `The proposed version Karp-SipserOne-Out of the Karp-Sipser heuristic

for 1-out graphs is shown in Algorithm 10 The degree-1 vertices are kept

79

in a first-in first-out priority queue Q The queue is first initialized with A1and the character lsquorsquo is used to mark the end of A1 Then all vertices inA1 are matched to some other vertices defining B1 When we remove twomatched vertices from the graph G1

out at Lines 29 and 40 we update thedegrees of their remaining neighbors and append the vertices which havedegrees of one to the queue During Phase-1 of Karp-SipserOne-Out wealso maintain the set of Ai and Bi vertices while storing only the last oneA` and B` are returned along with the number ` of levels which is computedthanks to the use of the marker

Apart from the scaling step the proposed heuristic in Algorithm 9 haslinear worst-case time complexity of O(n + τ) The scaling step if applieduntil convergence can take more time than that We do do not suggestrunning it until convergence for all practical purposes 5 or 10 iterations ofthe basic method [141] or even less of the Newton iterations [140] seem suf-ficient (see experiments) Therefore the practical run time of the algorithmis linear

Analysis of One-Out

Let ai and bi be two random variables representing the cardinalities of Aiand Bi respectively in an execution of Karp-SipserOne-Out on a random1-out graph Then Karp-SipserOne-Out matches b` edges in the first phaseand leaves a`minus b` vertices unmatched What remains after the first phase isa set of cycles In the bipartite graph case [76] all vertices in those cyclesare matchable and hence the cardinality of the matching was measured bynminus a` + b` Since we can possibly have odd cycles after the first phase wecannot match all remaining vertices in the general case of undirected graphsLet c be a random variable representing the number of odd cycles after thefirst phase of Karp-SipserOne-Out Then we have the following (obvious)lemma about the approximation guarantee of Algorithm 9

Lemma 46 At the end of execution the number of unmatched vertices isa` minus b` + c Hence Algorithm 9 matches at least nminus (a` minus b` + c) vertices

We need to quantify a` minus b` and c in Lemma 46 We state an an upperbound on a` minus b` in Lemma 47 and an upper bound on c in Lemma 48without any proof (the reader can find the proofs in the original paper [73])While the bound for a`minusb` holds for any graph with total support presentedto Algorithm 9 the bounds for c are shown for random 1-out graphs ofcomplete graphs By plugging the bounds for these quantities we obtainthe following theorem on the approximation guarantee of Algorithm 9

Theorem 43 Algorithm 9 obtains a matching with cardinality at least0866minus d104 log(0336n)e

n in expectation when the input is a complete graph

80 CHAPTER 4 MATCHINGS IN GRAPHS

Algorithm 10 Karp-SipserOne-Out Undirected graphs

Data G1out = (VE)

Result match[middot] the mates of verticesResult ` the number of levels in the first phaseResult A the set of degree-1 vertices in the first phaseResult B the set of vertices matched to A vertices

1 match[u]larr NIL for all u2 Qlarr v deg(v) = 1 degree-1 or in-degree 0 3 if Q = empty then4 `larr 0 no vertex in level 1 5 else6 `larr 1

7 Enqueue(Q) marks the end of the first level 8 Phase-1larr ongoing9 Alarr B larr empty

10 while true do11 while Q 6= empty do12 ularr Dequeue(Q)13 if u = and Q = empty then14 break the while-Q-loop15 else if u = then16 `larr `+ 117 Enqueue(Q) new level formed 18 skip to the next while-Q-iteration19 if match[u] 6= NIL then20 skip to the next while-Q-iteration

21 for v isin adj(u) do22 if match[v] = NIL then23 match[u]larr v24 match[v]larr u25 if Phase-1 = ongoing then26 Alarr A cup u27 B larr B cup v28 N larr adj(v)

29 G1out larr G1

out u v30 Enqueue(Qw) for w isin N and deg(w) = 131 break the for-v-loop

32 if Phase-1 = ongoing and match[u] = NIL then33 Alarr A cup u u cannot be matched

34 Phase-1 larr done35 if E 6= empty then36 pick a random edge (u v)37 match[u]larr v38 match[v]larr u39 N larr adj(u v)40 G1

out larr G1out u v

41 Enqueue(Qw) for w isin N and deg(w) = 142 else43 break the while-true loop

81

The theorem can also be interpreted more theoretically as a random1-out graph has a maximum cardinality of a matching at least 0866 minusd104 log(0336n)e

n in expectation The bound is in close vicinity of 0866 whichwas the proved bound for the bipartite case [76] We note that in derivingthis bound we assumed a random 1-out graph (as opposed to a random 1-outsubgraph of a given graph) only at a single step We leave the extension tothis latter case as future work and present experiments suggesting that thebound is also achieved for this case In Section 433 we empirically showthat the same bound also holds for graphs whose corresponding matrices donot have total support

In order to measure a` minus b` we adapted a proof from earlier work [76]which was inspired by Karp and Sipserrsquos analysis of the first phase of theirheuristic [128] and obtained the following lemma

Lemma 47 a` minus b` le (2Ω minus 1)n where Ω asymp 0567 equals to W (1) ofLambertrsquos W function

The lemma also reads as a` minus b` lesumn

i=1(2 middot 0567minus 1) le 0134 middot nIn order to achieve the result of Theorem 43 we need to bound c the

number of odd cycles that remain on a G1out graph after the first phase of

Karp-SipserOne-Out We were able to bound c on random 1-out graphs(that is we did not show it for a general graph)

Lemma 48 The expected number c of cycles remaining after the first phaseof Karp-SipserOne-Out in random 1-out graphs is less than or equal tod104 log(nminusa`minus b`)e which is also less than or equal to d104 log(0336n)e

Variants

Here we summarize two related theoretical random bipartite graph modelsthat we adapt to the undirected case using similar algorithms The presenta-tion is brief and without proofs we will present experiments in Section 433

The random (1 + eminus1) undirected graph model (see the bipartite ver-sion [125] summarized in Section 421) lets each vertex choose a randomneighbor Then the vertices that have not been chosen select one moreneighbor The maximum cardinality of a matching in the subgraph consist-ing of the identified edges can be computed as an approximate matchingin the original graph As this heuristic uses a richer graph structure than1-out we expect perfect or near perfect matchings in the general case

A model richer in edges is the random 2-out graph model In this modeleach vertex chooses two neighbors There are two enticing characteristics ofthis model in the uniform bipartite case First the probability of having aperfect matching goes to one with the increasing number of vertices [211]Second there is a special algorithm for computing the maximum cardinalitymatchings in these (bipartite) graphs [127] with high probability in lineartime in expectation

82 CHAPTER 4 MATCHINGS IN GRAPHS

433 Experiments

To understand the effectiveness and efficiency of the proposed heuristics inpractice we report the matching quality and various statistics regarding thesets that appear in our analyses in Section 432 For the experiments weused three graph datasets The first set is generated with matrices fromthe SuiteSparse Matrix Collection [59] We investigated all n times n matricesfrom the collection with 50000 le n le 100000 For a matrix from this setwe removed the diagonal entries symmetrized the pattern of the resultingmatrix and discarded a matrix if it has empty rows There were 115 matricesat the end which we used as the adjacency matrix The graphs in the seconddataset are synthetically generated to make Karp-Sipser deviate from theoptimal matching as much as possible This dataset contains five graphswith different hardness levels for Karp-Sipser The third set contains fivelarge real-life matrices from the SuiteSparse collection for measuring therun time efficiency of the proposed heuristics

A comprehensive evaluation

We use MATLAB to measure the matching qualities based on the first twodatasets For each matrix five runs are performed with each randomizedmatching heuristic and the average is reported One five and ten iterationsare performed to evaluate the impact of the scaling method

Table 45 summarizes the quality of the matchings for all the experimentson the first dataset The matching quality is measured as the ratio of thematching cardinality to the maximum cardinality matching in the originalgraph The table presents statistics for matching qualities of Karp-Sipserperformed on the original graphs (first row) 1-out graphs (the second set ofrows) Karonski-Pittel-like (1 + eminus1)-out graphs (the third set of rows) and2-out graphs (the last set of rows)

For the U-Max rows we construct k-out graphs by using uniform prob-abilities while selecting the neighbors as proposed in the literature [90 125]We compare the cardinality of the maximum matchings in these k-out graphsto the maximum matching cardinality on the original graphs and reportthe statistics The rows St-Max report the same statistics for the k-outgraphs constructed by using probabilities with t isin 1 5 10 scaling iter-ations These statistics serve as upper bounds on the matching qualitiesof the proposed St-KS heuristics which execute Karp-Sipser on the k-outgraphs obtained with t scaling iterations Since St-KS heuristics use only asubgraph the matchings they obtain are not maximal with respect to theoriginal edge set The proposed St-KS+ heuristics exploit this fact andapply another Karp-Sipser phase on the subgraph containing only the un-matched vertices to improve the quality of the matchings The table doesnot contain St-Max rows for 1-out graphs since Karp-Sipser is an optimal

83

Alg Min Max Avg GMean Med StDev

Karp-Sipser 0880 1000 0980 0980 0988 0022

1-ou

t

U-Max 0168 1000 0846 0837 0858 0091S1-KS 0479 1000 0869 0866 0863 0059S5-KS 0839 1000 0885 0884 0865 0044S10-KS 0858 1000 0889 0888 0866 0045S1-KS+ 0836 1000 0951 0950 0953 0043S5-KS+ 0865 1000 0958 0957 0968 0036S10-KS+ 0888 1000 0961 0961 0971 0033

(1+eminus

1)-

out

U-Max 0251 1000 0952 0945 0967 0081S1-Max 0642 1000 0967 0966 0980 0042S5-Max 0918 1000 0977 0977 0985 0020S10-Max 0934 1000 0980 0979 0985 0018S1-KS 0642 1000 0963 0962 0972 0041S5-KS 0918 1000 0972 0972 0976 0020S10-KS 0934 1000 0975 0975 0977 0018S1-KS+ 0857 1000 0972 0972 0979 0025S5-KS+ 0925 1000 0978 0978 0984 0018S10-KS+ 0939 1000 0980 0980 0985 0016

2-ou

t

U-Max 0254 1000 0972 0966 0996 0079S1-Max 0652 1000 0987 0986 0999 0036S5-Max 0952 1000 0995 0995 0999 0009S10-Max 0968 1000 0996 0996 1000 0007S1-KS 0651 1000 0974 0974 0981 0035S5-KS 0945 1000 0982 0982 0984 0013S10-KS 0947 1000 0984 0983 0984 0012S1-KS+ 0860 1000 0980 0979 0984 0020S5-KS+ 0950 1000 0984 0984 0987 0012S10-KS+ 0952 1000 0986 0986 0987 0011

Table 45 For each matrix in the first dataset and each proposed heuristic five

runs are performed The statistics are computed over the mean of these results

the minimum maximum arithmetic and geometric averages median and standard

deviation are reported

84 CHAPTER 4 MATCHINGS IN GRAPHS

0808208408608809

092094096098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphOneOut5iters(MaxKS)OneOut5iters(KS+)

(a) 1-out graphs

09

092

094

096

098

1

0 10 20 30 40 50 60 70 80 90 100 110

MatchingQua

lity

Experiments

KSonWholeGraphTwoOut5iters(Max)TwoOut5iters(KS)TwoOut5iters(KS+)

(b) 2-out graphs

Figure 44 The matching qualities sorted in increasing order for Karp-Sipseron the original graph S5-KS and S5-KS+ on 1-out and 2-out graphs Thefigures also contain the quality of the maximum cardinality matchings inthese graphs The experiments are performed on the 115 graphs in the firstdataset

algorithm for these subgraphs

As Table 45 shows more scaling iterations increase the maximum match-ing cardinalities on k-out graphs Although this is much more clear whenthe worst-case graphs are considered it can also be observed for arithmeticand geometric means Since U-Max is the no scaling case the impact ofthe first scaling iteration (S1-KS vs U-Max) is significant On the otherhand the difference on the matching quality for S5-KS and S10-KS is mi-nor Hence five scaling iterations are deemed sufficient for the proposedheuristics in practice

As the theory suggests the heuristics St-KS perform well for (1 + eminus1)-out and 2-out graphs With t isin 5 10 their quality is almost on parwith Karp-Sipser on the original graph and even better for 2-out graphsIn addition applying Karp-Sipser on the subgraph of unmatched vertices toobtain a maximal matching does not increase the matching quality muchSince this subgraph is small the overhead of this extra work will not besignificant Furthermore this extra step significantly improves the matchingquality for 1-out graphs which aas do not have a perfect matching

To better understand the practical performance of the proposed heuris-tics and the impact of the additional Karp-Sipser execution we profile theirperformance by sorting their matching qualities in increasing order for all115 matrices Figure 44a plots these profiles on 1-out and 2-out graphs forthe heuristics with five scaling iterations As the first figure shows five iter-ations are sufficient to obtain 086 matching quality except 17 of the 1-outexperiments The figure also shows that the maximum matching cardinalityin a random 1-out graph is worse than what Karp-Sipser can obtain on theoriginal graph This is why although S5-KS finds maximum matchings on

85

0020406081

0 2 4 6 8 10

CDFofoddCyclesn

Figure 45 The cumulative distribution of the ratio of number of odd cyclesremaining after the first phase of Karp-Sipser in One-Out to the number ofvertices in the graph

1-out graphs its performance is still worse than Karp-Sipser The additionalKarp-Sipser in S5-KS+ closes almost all of this gap and makes the match-ing qualities close to those of Karp-Sipser On the contrary for 2-out graphsgenerated with five scaling iterations the maximum matching cardinalityis more than the cardinality of the matchings found by Karp-Sipser Thereis still a gap between the best possible (red line) and what Karp-Sipser canfind (blue line) on 2-out graphs We believe that this gap can be targetedby specialized efficient exact matching algorithms for 2-out graphs

There are 30 rank-deficient matrices without total support among the115 matrices in the first dataset We observed that even for 1-out graphsthe worst-case quality for S5-KS is 086 and the average is 093 Hencethe proposed approach also works well for rank-deficient matricesgraphs inpractice

Since the number c of odd cycles at the end of the first phase of Karp-Sipser is a performance measure we investigate it For each matrix wecompute the ratio cn We then plot the cumulative distribution of thesevalues in Fig 45 For all the experiments except one this ratio is less than1 For the extreme case the ratio increases to 8 In that particular casethe matching found in the 1-out subgraph is maximum for the input graph(ie the number of odd components is also large in the original graph)

Hard instances for Karp-Sipser

Let A be an ntimesn symmetric matrix V1 be the set of Arsquos first n2 verticesand V2 be the set of Arsquos last n2 vertices As Figure 46 shows A has afull V1timesV1 block ignoring the diagonal ie a clique and an empty V2timesV2

block The last h n vertices of V1 are connected to all the vertices in thecorresponding graph The block V1 times V2 has a nonzero diagonal hence thecorresponding graph has a perfect matching Note that the same instancesare used for the bipartite case

On such a matrix with h = 0 Karp-Sipser consumes the whole graphduring Phase 1 and finds a maximum cardinality matching When h gt 1

86 CHAPTER 4 MATCHINGS IN GRAPHS

V1

V2

V1 V2

h

h

Figure 46 Adjacency matrices of hard undirected graph instances for Karp-Sipser

One-Out5 iters 10 iters 20 iters

h Karp-Sipser error quality error quality error quality

2 096 754 099 068 099 022 1008 078 852 097 078 099 023 099

32 068 665 081 109 099 026 099128 063 332 053 189 090 033 098512 063 124 055 117 059 061 086

Table 46 Results for the hard instances with n = 5000 and different hvalues One-Out is executed with 5 10 and 20 scaling iterations andscaling errors are also reported Averages of five are reported for each cell

Phase 1 immediately ends since there is no degree-one vertex In Phase 2the first edge (nonzero) consumed by Karp-Sipser is selected from a uniformdistribution over the nonzeros whose corresponding rows and columns arestill unmatched Since the block V1 times V1 forms a clique it is more likelythat the nonzero will be chosen from this block Thus a vertex in V1 will bematched with another vertex from V1 which is a bad decision since the blockV2timesV2 is completely empty Hence we expect a decrease on the performanceof Karp-Sipser as h increases On the other hand the probability that theproposed heuristics chooses an edge from that block goes to zero as thoseentries cannot be in a perfect matching

Table 46 shows that the quality of Karp-Sipser drops to 063 as h in-creases In comparison One-Out heuristic with five scaling iterations main-tains a good quality for small h values However a better scaling with moreiterations (10 and 20 for h = 128 and h = 512 respectively) is required toguarantee the desired matching qualitymdashsee the scaling error in the table

Experiments on large-scale graphs

These experiments are performed on a machine running 64-bit CentOS 65which has 30 cores each of which is an Intel Xeon CPU E7-4870 v2 core

87

Karp-SipserMatrix |V | |E| Quality time

cage15 52 940 100 582dielFilterV3real 11 882 099 336hollywood-2009 11 1128 093 418nlpkkt240 280 7465 098 5295rgg n 2 24 s0 168 2651 098 1949

One-Out-S5-KS+ Two-Out-S5-KS+Execution time (seconds) Execution time (seconds)

Matrix Quality Scale 1-out KS KS+ Total Quality Scale 2-out KS KS+ Total

cage 093 067 085 065 005 221 099 067 148 178 001 394diel 098 052 036 007 001 095 099 052 062 016 000 130holly 091 112 045 006 002 165 095 112 076 013 001 201nlpk 093 444 499 396 027 1366 098 444 943 1056 011 2454rgg 093 233 233 208 017 691 098 233 401 670 011 1315

Table 47 Summary of the results with five large-scale matrices for original Karp-

Sipser and the proposed One-Out and TwoOut heuristics with five scaling iter-

ations and post-processing for maximal matchings Matrix names are shortened in

the lower half

operating at 230 GHz To choose five large-scale matrices from SuiteSparsewe first sorted the pattern-symmetric matrices in the collection in decreasingorder of their nonzero count We then chose the five largest matrices fromdifferent families to increase the variety of experiments The details of thesematrices are given in Table 47 This table also contains the run timesand matching qualities of the original Karp-Sipser and the proposed One-Out and Two-Out heuristics The proposed heuristics have five scalingiterations and also apply Karp-Sipser at the end for ensuring maximality

The run time of the proposed heuristics are analyzed in four stages inTable 47 The Scale stage scales the matrix with five iterations the k-out stage chooses the neighbors and constructs the k-out subgraph the KSstage applies Karp-Sipser on the k-out subgraph and KS+ is the stage forthe additional Karp-Sipser at the end The total time with these four stagesis also shown for each instance The quality results are consistent withthe results obtained on the first dataset As the table shows the proposedheuristics are faster than the original Karp-Sipser on the input graph For1-out graphs the proposed approach is 25ndash39 faster than Karp-Sipser onthe original graph The speedups are in between 15ndash26 for 2-out graphswith five iterations

44 Summary further notes and references

We investigated two heuristics for the bipartite maximum cardinality match-ing problem The first one OneSidedMatch is shown to have an approx-

88 CHAPTER 4 MATCHINGS IN GRAPHS

imation guarantee of 1 minus 1e asymp 0632 The second heuristic TwoSid-edMatch is shown to have an approximation guarantee of 0866 Bothalgorithms use well-known methods to scale the sparse matrix associatedwith the given bipartite graph to a doubly stochastic form whose entriesare used as the probability density functions to randomly select a subsetof the edges of the input graph OneSidedMatch selects exactly n edgesto construct a subgraph in which a matching of the guaranteed cardinal-ity is identified with virtually no overhead both in sequential and parallelexecution TwoSidedMatch selects around 2n edges and then runs theKarp-Sipser heuristic as an exact algorithm on the selected subgraph to ob-tain a matching of conjectured cardinality The subgraphs are analyzed todevelop a specialized Karp-Sipser algorithm for efficient parallelization Alltheoretical investigations are first performed assuming bipartite graphs withperfect matchings and the scaling algorithms have converged Then the-oretical arguments and experimental evidence are provided to extend theresults to cover other cases and validate the applicability and practicality ofthe proposed heuristics in general settings

The proposed OneSidedMatch and TwoSidedMatch heuristics donot guarantee maximality of the matching found One can visit the edges ofthe unmatched row vertices and match them to an available neighbor Wehave carried out this greedy post-process and obtained the results shown inFigures 47a and 47b in comparison with what was shown in Figures 42aand 42b We see that the greedy post-process makes significant differencefor OneSidedMatch the approximation ratios achieved are well beyondthe bound 0632 and are close to those of TwoSidedMatch This obser-vation attests to the efficiency of the whole approach

We also extended the heuristics and their analysis for approximatingthe maximum cardinality matchings on general (undirected) graphs Sincewe have in the general case one class of vertices the 1-out based approachcorresponds to the TwoSidedMatch heuristics in the bipartite case The-oretical analysis which can be perceived as an analysis of Karp-Sipser onthe random 1-out graph model showed that the approximation guarantee isslightly less than 0866 The losses with respect to the bipartite case are dueto the existence of odd-length cycles Our experiments suggest that one ofthe heuristics obtains as good results as Karp-Sipser while being faster andmore reliable

A more rigorous treatment and elaboration of the variants of the heuris-tics for general graphs seem worthwhile Although Karp-Sipser works wellfor these graphs we wonder if there are exact linear time algorithms for(1 + eminus1)-out and 2-out graphs We also plan to work on the parallel imple-mentations of these algorithms

89

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

063

0 1 5

(a) OneSidedMatch

050055060065070075080085090095100

atmos

modl

audikw

_1

cage

15

chan

nel

europ

e_os

m

Hamrle

3

huge

bubbles

kkt_p

ower

nlpkk

t240

road

_usa

torso

1

ventu

riLev

el3

087

0 1 5

(b) TwoSidedMatch

Figure 47 Matching qualities obtained after making the matchings foundby OneSidedMatch and TwoSidedMatch maximal The horizontal linesare at 0632 and 0866 respectively which are the approximation guaranteesfor the heuristics The legends contain the number of scaling iterationsCompare with Figures 42a and 42b where maximality of the matchingswas not guaranteed

90 CHAPTER 4 MATCHINGS IN GRAPHS

Chapter 5

Birkhoff-von Neumanndecomposition of doublystochastic matrices

In this chapter we investigate the Birkhoff-von Neumann (BvN) decomposi-tion described in Section 14 The main problem that we address is restatedbelow for convenience

MinBvNDec A BvN decomposition with the minimum num-ber of permutation matrices Given a doubly stochastic matrix Afind a BvN decomposition

A = α1P1 + α2P2 + middot middot middot+ αkPk (51)

of A with the minimum number k of permutation matrices

We first show that the MinBvNDec problem is NP-hard We then proposea heuristic for obtaining a BvN decomposition with a small number of per-mutation matrices We investigate some of the properties of the heuristictheoretically and experimentally In this chapter we also give a general-ization of the BvN decomposition for real matrices with possibly negativeentries and use this generalization for designing preconditioners for iterativelinear system solvers

51 The minimum number ofpermutation matrices

We show in this section that the decision version of the problem is NP-complete We first give some definitions and preliminary results

91

92 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

511 Preliminaries

A multi-set can contain duplicate members Two multi-sets are equivalentif they have the same set of members with the same number of repetitions

Let A and B be two n times n matrices We write A sube B to denote thatfor each nonzero entry of A the corresponding entry of B is nonzero Inparticular if P is an n times n permutation matrix and A is a nonnegativen times n matrix P sube A also means that the entries of A at the positionscorresponding to the nonzeros of P are positive We use P A to denotethe entrywise product of P and A which selects the entries of A at thepositions corresponding to the nonzero entries of P We also use minPAto denote the minimum entry of A at the nonzero positions of P

Let U be a set of positions of the nonzeros of A Then U is calledstrongly stable [31] if for each permutation matrix P sube A pkl = 1 for atmost one pair (k l) isin U

Lemma 51 (Brualdi [32]) Let A be a doubly stochastic matrix Then ina BvN decomposition of A there are at least γ(A) permutation matriceswhere γ(A) is the maximum cardinality of a strongly stable set of positionsof A

Note that γ(A) is no smaller than the maximum number of nonzerosin a row or a column of A for any matrix A Brualdi [32] shows that forany integer t with 1 le t le dn2ed(n + 1)2e there exists an n times n doublystochastic matrix A such that γ(A) = t

An ntimes n circulant matrix C is defined as follows The first row of C isspecified as c1 cn and the ith row is obtained from the (iminus 1)th one bya cyclic rotation to the right for i = 2 n

C =

c1 c2 cncn c1 cnminus1

c2 c3 c1

We state and prove an obvious lemma to be used later

Lemma 52 Let C be an ntimesn positive circulant matrix whose first row isc1 cn The matrix Cprime = 1sum

cjC is doubly stochastic and all BvN decom-

positions with n permutation matrices have the same multi-set of coefficients cisum

cj for i = 1 n

Proof Since the first row of Cprime has all nonzero entries and only n permuta-tion matrices are permitted the multi-set of coefficients must be the entriesin the first row

93

In the lemma if ci = cj for some i 6= j we will have the same value cicfor two different permutation matrices As a sample BvN decomposition letc =

sumci and consider 1

cC =sum cj

c Dj where Dj is the permutation matrixcorresponding to the (jminus 1)th diagonal for j = 1 n the matrix Dj has1s at the positions (i i+ j minus 1) for i = 1 nminus j + 1 and at the positions(nminus j+ 1 + k k) for k = 1 jminus 1 where we assumed that the second setis void for j = 1

Note that the lemma is not concerned by the uniqueness of the BvNdecomposition which would need a unique set of permutation matrices aswell If all cis were distinct this would have been true where the set ofDjs described above would define the unique permutation matrices Alsoa more general variant of the lemma concerns the decompositions whosecardinality k is equal to the maximum cardinality of a strongly stable setIn this case too the coefficients in the decomposition will correspond to theentries in a strongly stable set of cardinality k

512 The computational complexity of MinBvNDec

Here we prove that the MinBvNDec problem is NP-complete in the strongsense ie it remains NP-complete when the instance is represented in unaryIt suffices to do a reduction from a strongly NP-complete problem to provethe strong NP-completeness [22 Section 66]

Theorem 51 The problem of deciding whether there is a Birkhoff-vonNeumann decomposition of a given doubly stochastic matrix with k permu-tation matrices is strongly NP-complete

Proof It is clear that the problem belongs to NP as it is easy to check inpolynomial time that a given decomposition is equal to a given matrix

To establish NP-completeness we demonstrate a reduction from the well-known 3-PARTITION problem which is NP-complete in the strong sense [93p 96] Consider an instance of 3-PARTITION given an array A of 3mpositive integers a positive integer B such that

sum3mi=1 ai = mB and for all

i it holds that B4 lt ai lt B2 does there exist a partition of A into mdisjoint arrays S1 Sm such that each Si has three elements whose sumis B Let I1 denote an instance of 3-PARTITION

We build the following instance I2 of MinBvNDec corresponding to I1

given above Let

M =

[1mEm OO 1

mBC

]where Em is an m timesm matrix whose entries are 1 and C is an 3m times 3mcirculant matrix whose first row is a1 a3m It is easy to see that Mis doubly stochastic A solution of I2 is a BvN decomposition of M withk = 3m permutations We will show that I1 has a solution if and only if I2

has a solution

94 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Assume that I1 has a solution with S1 Sm Let Si = ai1 ai2 ai3and observe that 1

mB (ai1 + ai2 + ai3) = 1m We identify three permu-tation matrices Pid for d = 1 2 3 in C which contain ai1 ai2 and ai3respectively We can write 1

mEm =summ

i=11mDi where Di is the permutation

matrix corresponding to the (i minus 1)th diagonal (described after the proofof Lemma 52) We prepend Di to the three permutation matrices Pid ford = 1 2 3 and obtain three permutation matrices for M We associate thesethree permutation matrices with αid =

aidmB Therefore we can write

M =msumi=1

3sumd=1

αid

[Di

Pid

]

and obtain a BvN decomposition of M with 3m permutation matricesAssume that I2 has a BvN decomposition with 3m permutation matrices

This also defines two BvN decompositions with 3m permutation matricesfor 1

mEm and 1mBC We now establish a correspondence between these two

BvNrsquos to finish the proof Since 1mBC is a circulant matrix with 3m nonzeros

in a row any BvN decomposition of it with 3m permutation matrices hasthe coefficients ai

mB for i = 1 3m by Lemma 52 Since ai + aj lt B we

haveai+ajmB lt 1m Therefore each entry in 1

mEm needs to be included in atleast three permutation matrices A total of 3m permutation matrices coversany row of 1

mEm say the first one Therefore for each entry in this row wehave 1

m = αi + αj + αk for i 6= j 6= k corresponding to three coefficientsused in the BvN decomposition of 1

mBC Note that these three indices i j kdefining the three coefficients used for one entry of Em cannot be used foranother entry in the same row This correspondence defines a partition ofthe 3m numbers αi for i = 1 3m into m groups with three elementseach where each group has a sum of 1m The corresponding three airsquos ina group therefore sums up to B and we have a solution to I1 concludingthe proof

52 A result on the polytope ofBvN decompositions

Let S(A) be the polytope of all BvN decompositions for a given dou-bly stochastic matrix A The extreme points of S(A) are the ones thatcannot be represented as a convex combination of other decompositionsBrualdi [32 pp 197ndash198] observes that any generalized Birkhoff heuristicobtains an extreme point of S(A) and predicts that there are extreme pointsof S(A) which cannot be obtained by a generalized Birkhoff heuristic Inthis section we substantiate this claim by showing an example

Any BvN decomposition of a given matrix A with the smallest numberof permutation matrices is an extreme point of S(A) otherwise the other

95

BvN decompositions expressing the said point would have smaller numberof permutation matrices

Lemma 53 There are doubly stochastic matrices whose polytopes of BvNdecompositions contain extreme points that cannot be obtained by a general-ized Birkhoff heuristic

We will prove the lemma by giving an example We use computationaltools based on a mixed integer linear programming (MILP) formulation ofthe problem of finding a BvN decomposition with the smallest number ofpermutation matrices We first describe the MILP formulation

Let A be a given n times n doubly stochastic matrix and Ωn be the set ofall n times n permutation matrices There are n matrices in Ωn for brevitylet us refer to these permutations by P1 Pn We associate an incidencematrix M of size n2 times n with Ωn We fix an ordering of the entries of Aso that each row of M corresponds to a unique entry in A Each column ofM corresponds to a unique permutation matrix in Ωn We set mij = 1 ifthe ith entry of A appears in the permutation matrix Pj and set mij = 0otherwise Let ~a be the n2-vector containing the values of the entries of Ain the fixed order Let ~x be a vector of n elements xj corresponding tothe permutation matrix Pj With these definitions MILP formulation forfinding a BvN decomposition of A with the smallest number of permutationmatrices can be written as

minimizensumj=1

sj (52)

subject to M~x = ~a (53)

1 ge xj ge 0 for j = 1 n (54)

nsumj=1

xj = 1 (55)

sj ge xj for j = 1 n (56)

sj isin 0 1 for j = 1 n (57)

In this MILP sj is a binary variable which is 1 only if xj gt 0 otherwise 0The equality (53) the inequalities (54) and the equality (55) guaranteethat we have a BvN decomposition of A This MILP can be used only forsmall problems In this MILP we can exclude any permutation matrix Pj

from a decomposition by setting sj = 0

Proof of Lemma 53 Let the following 10 letters correspond to the numbersunderneath

a b c d e f g h i j

1 2 4 8 16 32 64 128 256 512

96 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Consider the following matrix whose row sums and column sums are 1023hence can be considered as doubly stochastic

A =

a+ b d+ i c+ h e+ j f + ge+ g a+ c b+ i d+ f h+ jf + j e+ h d+ g b+ c a+ id+ h b+ f a+ j g + i c+ ec+ i g + j e+ f a+ h b+ d

(58)

Observe that the entries containing the term 2` form a permutationmatrix for ` = 0 9 Therefore this matrix has a BvN decompositionwith 10 permutation matrices We created the MILP above and found 10 asthe smallest number of permutation matrices by calling the CPLEX solver [1]via the NEOS Server [58 68 100] Hence the described decomposition is anextreme point of S(A) None of the said permutation matrices annihilateany entry of the matrix Therefore at the first step no entry of A getsreduced to zero regardless of the order of permutation matrices Thus thisdecomposition cannot be obtained by a generalized Birkhoff heuristic

One can create a family of matrices with arbitrary sizes by embeddingthe same matrix into a larger one of the form

B =

(1023 middot I OO A

)

where I is the identity matrix with the desired size All BvN decompositionsof B can be obtained by extending the permutation matrices in Arsquos BvNdecompositions with I That is for a permutation matrix P in a BvN de-

composition of A

(I OO P

)is a permutation matrix in the corresponding

BvN decomposition of B Furthermore all permutation matrices in an arbi-trary BvN decomposition of B must have I as the principle sub-matrix andthe rest should correspond to a permutation matrix in A defining a BvNdecomposition for A Hence the extreme point S(B) corresponding to theextreme point of S(A) with 10 permutation matrices cannot be found by ageneralized Birkhoff heuristic

Let us investigate the matrix A and its BvN decomposition given in theproof of Lemma 53 Let PaPb Pj be the 10 permutation matrices ofthe decomposition corresponding to a b j We solve 10 MILPs in whichwe set st = 0 for one t isin a b j This way we try to find a BvN decom-position of A without Pa without Pb and so on always with the smallestnumber of permutation matrices The smallest number of permutation ma-trices in these 10 MILPs were 11 This certifies that the only BvN decom-position with 10 permutation matrices necessarily contains PaPb Pj It is easy to see that there is a unique solution to the equality M~x = ~a of

97

A =

3 264 132 528 9680 5 258 40 640544 144 72 6 257136 34 513 320 20260 576 48 129 10

(a) The sample matrix

129 511 257 63 33 15 7 3 2 2 1

3 4 2 5 5 4 2 3 4 1 15 5 3 1 4 1 4 2 1 2 32 1 5 3 1 2 3 4 3 4 41 3 4 4 2 5 1 5 5 3 24 2 1 2 3 3 5 1 2 5 5

(b) A BvN decomposition

Figure 51 The matrix A from Lemma 53 and a BvN decomposition with11 permutation matrices which can be obtained by a generalized Birkhoffheuristic Each column in (b) corresponds to a permutation where the firstline gives the associated coefficient and the following lines give the columnindices matched to the rows 1 to 5 of A

the MILP with xt = 0 for t isin a b j as the submatrix M containingonly the corresponding 10 columns has full column rank

Any generalized Birkhoff heuristic obtains at least 11 permutation ma-trices for the matrix A of the proof of Lemma 53 One such decompositionis shown in a tabular format in Fig 51 In Fig 51a we write the matrixof the lemma explicitly for convenience Then in Fig 51b we give a BvNdecomposition The column headers (the first line) in the table contain thecoefficients of the permutation matrices The nonzero column indices of thepermutation matrices are stored by rows For example the first permuta-tion has the coefficient 129 and columns 3 5 2 1 and 4 are matched tothe rows 1ndash5 The bold indices signify entries whose values are equivalentto the coefficients of the corresponding permutation matrices at the timewhere the permutation matrices are found Therefore the correspondingentries become zero after the corresponding step For example in the firstpermutation a54 = 129

The output of GreedyBvN for the matrix A is given in Fig 52 for refer-ence It contains 12 permutation matrices

53 Two heuristics

There are heuristics to compute a BvN decomposition for a given matrix AIn particular the following family of heuristics is based on the constructiveproof of Birkhoff Let A(0) = A At every step j ge 1 find a permutationmatrix Pj having its ones at the positions of the nonzero elements of A(jminus1)use the minimum nonzero element of A(jminus1) at the positions identified byPj as αj set A(j) = A(jminus1) minus αjPj and repeat the computations in thenext step j + 1 until A(j) becomes void Any heuristic of this type is calledgeneralized Birkhoff heuristic

98 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

A = 513 middot

0 0 0 1 00 0 0 0 11 0 0 0 00 0 1 0 00 1 0 0 0

+ 257 middot

0 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 01 0 0 0 0

+ 127 middot

0 0 1 0 00 0 0 0 10 1 0 0 01 0 0 0 00 0 0 1 0

+ 63 middot

0 0 0 0 11 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 0

+

31 middot

0 0 0 0 10 0 0 1 01 0 0 0 00 1 0 0 00 0 1 0 0

+ 15 middot

0 0 0 1 01 0 0 0 00 1 0 0 00 0 0 0 10 0 1 0 0

+ 7 middot

0 1 0 0 00 0 0 1 00 0 1 0 01 0 0 0 00 0 0 0 1

+ 3 middot

0 0 1 0 00 1 0 0 00 0 0 1 00 0 0 0 11 0 0 0 0

+

2 middot

0 0 0 0 10 0 0 1 00 1 0 0 01 0 0 0 00 0 1 0 0

+ 2 middot

0 0 1 0 01 0 0 0 00 0 0 1 00 1 0 0 00 0 0 0 1

+ 2 middot

1 0 0 0 00 1 0 0 00 0 1 0 00 0 0 0 10 0 0 1 0

+ 1 middot

1 0 0 0 00 0 1 0 00 0 0 1 00 1 0 0 00 0 0 0 1

Figure 52 The output of GreedyBvN for the matrix given in the proof ofLemma 53

Given an n times n matrix A we can associate a bipartite graph GA =(RcupCE) to it The rows of A correspond to the vertices in the set R andthe columns of A correspond to the vertices in the set C so that (ri cj) isin Eiff aij 6= 0 A perfect matching in GA or G in short is a set of n edges notwo sharing a row or a column vertex Therefore a perfect matching M inG defines a unique permutation matrix PM sube A

Birkhoffrsquos heuristic This is the original heuristic used in proving thata doubly stochastic matrix has a BvN decomposition described for exampleby Brualdi [32] Let a be the smallest nonzero of A(jminus1) and G(jminus1) bethe bipartite graph associated with A(jminus1) Find a perfect matching M inG(jminus1) containing a set αj larr a and Pj larr PM

GreedyBvN heuristic At every step j among all perfect matchings inG(jminus1) find one whose minimum element is the maximum That is find aperfect matching M in G(jminus1) where minPM A(jminus1) is the maximumThis ldquobottleneckrdquo matching problem is polynomial time solvable with forexample MC64 [72] In this greedy approach αj is the largest amount wecan subtract from a row and a column of A(jminus1) and we hope to obtain asmall k

Algorithm 11 GreedyBvN A greedy heuristic for constructing aBvN decomposition

Data A a doubly stochastic matrixResult a BvN decomposition

1 k larr 02 while nnz(A) gt 0 do3 k larr k + 14 Pk the pattern of a bottleneck perfect matching M in A

5 αk larr minPk A(jminus1)6 Alarr Aminus αkPk

99

A(0) =

1 4 1 0 0 00 1 4 1 0 00 0 1 4 1 00 0 0 1 4 11 0 0 0 1 44 1 0 0 0 1

A(1) =

0 4 1 0 0 00 1 3 1 0 00 0 1 3 1 00 0 0 1 3 11 0 0 0 1 34 0 0 0 0 1

A(2) =

0 4 0 0 0 00 0 3 1 0 00 0 1 2 1 00 0 0 1 2 11 0 0 0 1 23 0 0 0 0 1

A(3) =

0 3 0 0 0 00 0 3 0 0 00 0 0 2 1 00 0 0 1 1 11 0 0 0 1 12 0 0 0 0 1

A(4) =

0 2 0 0 0 00 0 2 0 0 00 0 0 2 0 00 0 0 0 1 11 0 0 0 1 01 0 0 0 0 1

A(5) =

0 1 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 01 0 0 0 0 00 0 0 0 0 1

Figure 53 A sample matrix to show that the original Birkhoff heuristic canobtain BvN decomposition with n permutation matrices while the optimumone has 3

Analysis of the two heuristics

As we will see in Section 54 GreedyBvN can obtain a much smaller num-ber of permutation matrices compared to the Birkhoff heuristic Here wecompare these two heuristics theoretically First we show that the originalBirkhoff heuristic does not have any constant ratio approximation guaran-tee Furthermore for an ntimesn matrix its worst-case approximation ratio isΩ(n)

We begin with a small example shown in Fig 53 We decompose a6 times 6 matrix A(0) which has an optimal BvN decomposition with threepermutation matrices the main diagonal the one containing the entriesequal to 4 and the one containing the remaining entries For simplicitywe used integer values in our example However since the row and columnsums of A(0) is equal to 6 it can be converted to a doubly stochastic matrixby dividing all the entries to 6 Instead of the optimal decomposition in thefigure we obtain the permutation matrices as the original Birkhoff heuristicdoes Each red-colored entry set is a permutation and contains the minimumpossible value 1

In the following we show how to generalize the idea for having matri-ces of arbitrarily large size with three permutation matrices in an optimaldecomposition while the original Birkhoff heuristic obtains n permutation

100 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

matrices

Lemma 54 The worst-case approximation ratio of the Birkhoff heuristicis Ω(n)

Proof For any given integer n ge 3 we show that there is a matrix of sizen times n whose optimal BvN decomposition has 3 permutations whereas theBirkhoff heuristic obtains a BvN decomposition with exactly n permutationmatrices The example in Fig 53 is a special case for n = 6 for the followingconstruction process

Let f 1 2 n rarr 1 2 n be the function f(x) = (x mod n) +1 Given a matrix M let Mprime = F(M) be another matrix containing thesame set of entries where the function f(middot) is used on the coordinate indicesto redistribute the entries of M on Mprime That is mij = mprimef(i)f(j) Since f(middot)is one-to-one and onto if M is a permutation matrix then F(M) is also apermutation matrix We will start with a permutation matrix and run itthrough F for nminus 1 times to obtain n permutation matrices which are alldifferent By adding these permutation matrices we will obtain a matrix Awhose optimal BvN decomposition has three permutation matrices whilethe n permutation matrices used to create A correspond to a decompositionthat can be obtained by the Birkhoff heuristic

Let P1 be the permutation matrix whose ones which are partitionedinto three sets are at the positions

1st set︷ ︸︸ ︷(1 1)

2nd set︷ ︸︸ ︷(n 2)

3rd set︷ ︸︸ ︷(2 3) (3 4) (nminus 1 n) (59)

Let us use F(middot) to generate a matrix sequence Pi = F(Piminus1) for 2 le i le nFor example P2rsquos nonzeros are at the positions

1st set︷ ︸︸ ︷(2 2)

2nd set︷ ︸︸ ︷(1 3)

3rd set︷ ︸︸ ︷(3 4) (4 5) (n 1)

We then add the Pis to build the matrix

A = P1 + P2 + middot middot middot+ Pn

We have the following observations about the nonzero elements of A

1 aii = 1 for all i = 1 n and only Pi has a one at the position (i i)These elements are from the first set of positions of the permutationmatrices as identified in (59) When put together these n entriesform a permutation matrix P(1)

2 aij = 1 for all i = 1 n and j = ((i + 1) mod n) + 1 and onlyPh where h = (i mod n) + 1 has a one at the position (i j) Theseelements are from the second set of positions of the permutation ma-trices as identified in (59) When put together these n entries forma permutation matrix P(2)

101

3 aij = n minus 2 for all i = 1 n and j = (i mod n) + 1 where allP` for ` isin 1 n i j have a one at the position aij Theseelements are from the third set of positions of the permutation matri-ces as identified in (59) When put together these n entries form apermutation matrix P(3) multiplied by the scalar (nminus 2)

In other words we can write

A = P(1) + P(2) + (nminus 2) middotP(3)

and see that A has a BvN decomposition with three permutation matricesWe note that each row and column of A contains three nonzeros and hencethree is the smallest number of permutation matrices in a BvN decomposi-tion of A

Since the minimum element in A is 1 and each Pi contains one suchelement the Birkhoff heuristic can obtain a decomposition using Pi fori = 1 n Therefore the Birkhoff heuristicrsquos approximation is no betterthan n

3 which can be made arbitrarily large

We note that GreedyBvN will optimally decompose the matrix A usedin the proof above

We now analyze the theoretical properties of GreedyBvN We first givea lemma identifying a pattern in its output

Lemma 55 The heuristic GreedyBvN obtains α1 αk in a non-increasingorder α1 ge middot middot middot ge αk where αj is obtained at the jth step

We observed through a series of experiments that the performance ofGreedyBvN depends on the values in the matrix and there seems to be noconstant factor approximation [74]

We create a set of n times n matrices To do that we first fix a set ofz permutation matrices C1 Cz of size n times n These permutationmatrices with varying values of α will be used to generate the matricesThe matrices are parameterized by the subscript i and each Ai is created asfollows Ai = α1 middotC1 +α2 middotC2 + middot middot middot+αz middotCz where each αj for j = 1 zis a randomly chosen integer in the range [1 2i] and we also set a randomlychosen αj equivalent to 2i to guarantee the existence of at least one largevalue even in the unlikely case that all other values are not large enoughAs can be seen each Ai has the same structure and differs from the restonly in the values of αj rsquos that are chosen As a consequence they all canbe decomposed by the same set of permutation matrices

We present our results in two sets of experiments shown in Table 51 fortwo different n We create five random Ai for i isin 10 20 30 40 50 thatis we have five matrices with the parameter i and there are five different iWe have n = 30 and z = 20 in the first set and n = 200 and z = 100 inthe second set Let ki be the number of permutation matrices GreedyBvN

102 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

n = 30 and z = 20

average worst casei ki kiz ki kiz

10 59 299 63 31520 105 529 110 55030 149 746 158 79040 184 923 191 95550 212 1062 227 1135

n = 200 and z = 100

average worst casei ki kiz ki kiz

10 268 269 280 28020 487 488 499 49930 716 716 726 72640 932 933 947 94750 1124 1125 1162 1162

Table 51 Experiments showing the dependence of the performance ofGreedyBvN on the values of the matrix elements n is the matrix sizei isin 10 20 30 40 50 is the parameter for creating matrices using αj isin[1 2i] z is the number of permutation matrices used in creating Ai Fiveexperiments for a given pair of n and i GreedyBvN obtains ki permutationmatrices The average and the maximum number of permutation matricesobtained by GreedyBvN for five random instances are given kiz is a lowerbound on the performance of GreedyBvN as z ge Opt

obtains for a given Ai The table gives the average and the maximumki of five different Ai for a given n and i pair By construction eachAi has a BvN with z permutation matrices Since z is no smaller thanthe optimal value the ratio ki

z gives a lower bound on the performance ofGreedyBvN As seen from the experiments as i increases the performanceof GreedyBvN gets increasingly worse This shows that a constant ratioworst case approximation of GreedyBvN is unlikely While the performancedepends on z (for example for small z GreedyBvN is likely to obtain nearoptimal decompositions) it seems that the size of the matrix does not largelyaffect the relative performance of GreedyBvN

Now we attempt to explain the above results theoretically

Lemma 56 Let α1P1 + middot middot middot + αkP

k be a BvN decomposition of a given

doubly stochastic matrix A with the smallest number k of permutation ma-trices Then for any BvN decomposition of A with ` ge k permutation

matrices we have ` le k middot maxi αi

mini αi If the coefficients are integers (eg when

A is a matrix with constant row and column sums of integral values) wehave ` le k middotmaxi α

i

Proof Consider a BvN decomposition α1P1 + middot middot middot+ α`P` with ` ge k As-sume without loss of generality that α1 ge middot middot middot ge αk and α1 ge middot middot middot ge α`

We know that the coefficients of these two decompositions sum up to thesame value That is sum

i=1

αi =ksumi=1

αi

103

Since α` is the smallest of α and α1 is the largest of α we have

` middot α` le k middot α1

and hence`

kle α1α`

By assuming integer values we see that α` ge 1 and thus

` le k middotmaxiαi

This lemma evaluates the approximation guarantee of a given BvN de-composition It does not seem very useful because of the fact that even ifwe have mini αi we do do not have maxi α

i Luckily we can say more in

the case of GreedyBvN

Corollary 51 Let k be the smallest number of permutation matrices ina BvN decomposition of a given doubly stochastic matrix A Let α1 andα` be the first and last coefficients obtained by the heuristic GreedyBvN fordecomposing A Then ` le k middot α1

α`

Proof This is easy to see as GreedyBvN obtains the coefficients in a non-increasing order (see Lemma 55) and α1 ge αj for all 1 le j le k for anyBvN decomposition containing αj

Lemma 56 and Corollary 51 give a posteriori estimates of the perfor-mance of the heuristic GreedyBvN in that one looks at the decompositionand tells how good it is This potentially can reveal a good performanceFor example when GreedyBvN obtains a BvN decomposition with all coeffi-cients equivalent then we know that it is an optimal BvN The same cannotbe told for the Birkhoff heuristic though (consider the example proceed-ing Lemma 54) We also note that the ratio given in Corollary 51 shouldusually be much larger than the practical performance

54 Experiments

We present results with the two heuristics We note that the original Birkhoffheuristic was not concerned with the minimality of the number of permu-tation matrices Therefore the presented results are not for comparing thetwo heuristics but for giving results with what is available We give resultson two different set of matrices The first set contains real world sparsematrices preprocessed to be doubly stochastic The second set contains afew randomly created dense doubly stochastic matrices

104 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

matrix n τ dmax devsumk

i=1 αi ksumk

i=1 αi kaft01 8205 125567 21 00e+00 0160 2000 1000 120aft02 8184 127762 82 10e-06 0000 1434 1000 234barth 6691 46187 13 00e+00 0160 2000 1000 71barth4 6019 40965 13 00e+00 0140 2000 1000 61bcspwr10 5300 21842 14 10e-06 0380 2000 1000 63bcsstk38 8032 355460 614 00e+00 0000 2000 1000 592lowastbenzene 8219 242669 37 00e+00 0000 2000 1000 113c-29 5033 43731 481 10e-06 0000 2000 1000 870EX5 6545 295680 48 00e+00 0020 2000 1000 229EX6 6545 295680 48 10e-06 0030 2000 1000 226flowmeter0 9669 67391 11 00e+00 0510 2000 1000 58fv1 9604 85264 9 00e+00 0620 2000 1000 50fv2 9801 87025 9 00e+00 0620 2000 1000 52fxm3 6 5026 94026 129 10e-06 0130 2000 1000 383g3rmt3m3 5357 207695 48 10e-06 0050 2000 1000 223Kuu 7102 340200 98 00e+00 0000 2000 1000 330mplate 5962 142190 36 00e+00 0030 2000 1000 153n3c6-b7 6435 51480 8 00e+00 1000 8 1000 8nemeth02 9506 394808 52 00e+00 0000 2000 1000 109nemeth03 9506 394808 52 00e+00 0000 2000 1000 115olm5000 5000 19996 6 10e-06 0750 283 1000 14s1rmq4m1 5489 262411 54 00e+00 0000 2000 1000 211s2rmq4m1 5489 263351 54 00e+00 0000 2000 1000 208SiH4 5041 171903 205 00e+00 0000 2000 1000 574t2d q4 9801 87025 9 20e-06 0500 2000 1000 54

Table 52 Birkhoffrsquos heuristic and GreedyBvN on sparse matrices from theUFL collection The column τ contains the number of nonzeros in a matrixThe column dmax contains the maximum number of nonzeros in a row ora column setting up a lower bound for the number k of permutation ma-trices in a BvN decomposition The column ldquodevrdquo contains the maximumdeviation of a rowcolumn sum of a matrix A from 1 in other words thevalue maxA1 minus 1infin 1TA minus 1Tinfin reported to six significant digitsThe two heuristics are run to obtain at most 2000 permutation matrices oruntil they accumulated a sum of at least 09999 with the coefficients In onematrix (marked with lowast) GreedyBvN obtained this number in less than dmax

permutation matricesmdashincreasing the limit to 0999999 made it return with908 permutation matrices

105

The first set of matrices was created as follows We have selected all ma-trices with the following properties from the SuiteSparse Matrix Collection(UFL) [59] square has at least 5000 and at most 10000 rows fully inde-composable and there are at most 50 and at least 2 nonzeros per row Thisgave a set of 66 matrices These 66 matrices are from 30 different groups ofproblems we have chosen at most two per group to remove any bias thatmight be arising from the group This resulted in 32 matrices which wepreprocessed as follows to obtain a set of doubly stochastic matrices Wefirst took the absolute values of the entries to make the matrices nonnega-tive Then we scaled them to be doubly stochastic using the algorithm ofKnight et al [141] with a tolerance of 10minus6 with at most 1000 iterationsWith this setting the scaling algorithm tries to get the maximum deviationof a rowcolumn sum from 1 to be less than 10minus6 within 1000 iterations In7 matrices the deviation was larger than 10minus4 We deemed this too big adeviation from a doubly stochastic matrix and discarded those matrices Atthe end we had 25 matrices given in Table 52

We run the two heuristics for the BvN decomposition to obtain at most2000 permutation matrices or until they accumulated a sum of at least09999 with the coefficients The accumulated sum of coefficients

sumki=1 αi

and the number of permutation matrices k for the Birkhoff and GreedyBvN

heuristics are given in Table 52 As seen in this table GreedyBvN obtainsa much smaller number of permutation matrices than Birkhoffrsquos heuristicexcept the matrix n3c6-b7 This is a special matrix with eight nonzeros ineach row and column and all nonzeros are 18 Both heuristics find the same(minimum) number of permutation matrices We note that the geometricmean of the ratios kdmax is 34 for GreedyBvN

The second set of matrices was created as follows For n isin 100 200 300we created five matrices of size ntimesn whose entries are randomly chosen in-tegers between 1 and 100 (we performed tests with integers between 1 and20 and the results were close to what we present here) We then scaled themto be doubly stochastic using the algorithm of Knight et al [141] with a tol-erance of 10minus6 with at most 1000 iterations We run the two heuristics forthe BvN decomposition such that at most n2minus2n+ 2 permutation matricesare found or the total value of the coefficients was larger than 09999 Wethen took the average of the five instances with the same n The resultsare in Table 53 In this table again we see that GreedyBvN obtains muchsmaller k than Birkhoffrsquos heuristic

The experimental observation that GreedyBvN performs better than theBirkhoff heuristic is not surprising This is for two reasons First as statedbefore the original Birkhoff heuristic is not concerned with the number ofpermutation matrices Second the heuristic GreedyBvN is an adaptation ofthe Birkhoff heuristic to have large reductions at each step

106 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Birkhoff GreedyBvN

nsumk

i=1 αi ksumk

i=1 αi k100 099 9644 100 388200 099 39208 100 717300 100 88759 100 1042

Table 53 Birkhoffrsquos heuristic and GreedyBvN on random dense matricesThe maximum deviation of a rowcolumn sum of a matrix A from 1 thatis maxA1minus 1infin 1TAminus 1Tinfin was always less than 10minus6 For each nthe results are the averages of five different instances

55 A generalization of Birkhoff-von Neumanntheorem

We have exploited the BvN decomposition in the context of preconditioningfor solving linear systems [23] As the decomposition is defined for nonnega-tive matrices we needed a generalization of it to cover all real matrices Tomotivate the generalized decomposition theorem we give a brief summaryof the preconditioner

551 Doubly stochastic splitting

For a given nonnegative matrix A we first preprocess it to get a doublystochastic matrix (whose row and column sums are one) Then using thisdoubly stochastic matrix we select some fraction of some of the nonzeros ofA to be included in the preconditioner The selection is done by taking a setof permutation matrices along with their coefficients in a BvN decompositionof the given matrix

Assume we have a BvN decomposition of A as shown in (51) Thenone can pick an integer r between 1 and k minus 1 and split A as

A = MminusN (510)

where

M = α1P1 + middot middot middot+ αrPr N = minusαr+1Pr+1 minus middot middot middot minus αkPk (511)

Note that M and minusN are doubly substochastic matrices

Definition 51 A splitting of the form (510) with M and N given by (511)is said to be a doubly substochastic splitting

Definition 52 A doubly substochastic splitting A = M minusN of a doublystochastic matrix A is said to be standard if M is invertible We will callsuch a splitting an SDS splitting

107

In general it is not easy to guarantee that a given doubly substochasticsplitting is standard except for some trivial situation such as the case r = 1in which case M is always invertible We also have a characterization forinvertible M when r = 2

Theorem 52 A sufficient condition for M =sumr

i=1 αiPi to be invertible isthat one of the αi with 1 le i le r be greater than the sum of the remainingones

Let A = M minusN be an SDS splitting of A and consider the stationaryiterative scheme

xk+1 = Hxk + c H = Mminus1N c = Mminus1b (512)

where k = 0 1 and x0 is arbitrary As seen in (512) Mminus1 or solversfor My = z are required As is well known the scheme (512) convergesto the solution of Ax = b for any x0 if and only if ρ(H) lt 1 Hence weare interested in conditions that guarantee that the spectral radius of theiteration matrix

H = Mminus1N = minus(α1P1 + middot middot middot+ αrPr)minus1(αr+1Pr+1 + middot middot middot+ αkPk)

is strictly less than one In general this problem appears to be difficultWe have a necessary condition (Theorem 53) and a sufficient condition(Theorem 54) which is simple but restrictive

Theorem 53 For the splitting A = MminusN with M =sumr

i=1 αiPi and N =

minussumk

i=r+1 αiPi to be convergent it must hold thatsumr

i=1 αi gtsumk

i=r+1 αi

Theorem 54 Suppose that one of the αi appearing in M is greater thanthe sum of all the other αi Then ρ(Mminus1N) lt 1 and the stationary iterativemethod (512) converges for all x0 to the unique solution of Ax = b

We discuss how to build preconditioners meeting the sufficiency con-ditions Since M itself is doubly stochastic (up to a scalar ratio) we canapply splitting recursively on M and obtain a special solver Our motivationis that the preconditioner Mminus1 can be applied to vectors via a number ofhighly concurrent steps where the number of steps is controlled by the userTherefore the preconditioners (or the splittings) can be advantageous foruse in many-core computing systems In the context of splittings the appli-cation of N to vectors also enjoys the same property These motivations areshared by recent work on ILU preconditioners where their fine-grained com-putation [48] and approximate application [12] are investigated for GPU-likesystems

108 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Preconditioner construction

It is desirable to have a small number k in the Birkhoff-von Neumann de-composition while designing the preconditionermdashthis was our motivation toinvestigate the MinBvNDec problem This is because of the fact that if weuse splitting then k determines the number of steps in which we computethe matrix-vector products If we do not use all k permutation matriceshaving a few with large coefficients should help to design the preconditionerAs stated in Lemma 55 GreedyBvN (Algorithm 11) obtains the coefficientsin a non-increasing order We can therefore use GreedyBvN to build an Msuch that it satisfies the sufficiency condition presented in Theorem 54That is we can have M with α1sumk

i=1 αigt 12 and hence Mminus1 can be applied

with splitting iterations For this we start by initializing M to α1P1 Thenwhen a coefficient αj where j ge 2 is obtained at Line 5 of Algorithm 11we check if α1 will still be larger than the sum of other coefficients used inbuilding M when we add αjPj to M If so we add αjPj to M and con-tinue In practice we iterate the while loop until k is around 10 and collectαj rsquos as described above Experiments on a set of challenging problems [23]show that this preconditioner is more effective than ILU(0)

552 Birkhoff von-Neumann decomposition for arbitrary ma-trices

In order to apply the preconditioners to all fully indecomposable sparsematrices we generalized the BvN decomposition as follows

Theorem 55 Any (real) matrix A with total support can be written as aconvex combination of a set of signed scaled permutation matrices

Proof Let B = abs(A) and consider R and C making RBC doubly stochas-tic which has a Birkhoff-von Neumann decomposition RBC =

sumki=1 αiPi

Then RAC can be expressed as

RAC =ksumi=1

αiQi

where Qi = [q(i)jk ]ntimesn is obtained from Pi = [p

(i)jk ]ntimesn as follows

q(i)jk = sign(ajk)p

(i)jk

The scaling matrices can be inverted to express A

De Werra [62] presents another generalization of the Birkhoff-von Neu-mann theorem to arbitrary matrices In De Werrarsquos generalization insteadof permutation matrices signed integral matrices are used to decompose the

109

given matrix The entries of the integral matrices used in the decompositionhave the same sign as the corresponding entries in the original matrix butthere may be multiple nonzeros in a row or column De Werra discussesflow-based algorithms to obtain the decompositions While the decomposi-tion is costlier to obtain (general flow computations are usually costlier thancomputing perfect matchings) it is also more general to handle rectangularmatrices as well

56 Summary further notes and references

We investigated the problem of obtaining a Birkhoff-von Neumann decom-position of doubly stochastic matrices with the minimum number of per-mutation matrices We showed four results regarding this problem Firstobtaining a BvN decomposition with the minimum number of permutationmatrices is NP-hard Second there are matrices whose decompositions withthe smallest number of permutation matrices cannot be obtained by anyBirkhoff-like heuristic Third the worst-case approximation ratio of theoriginal Birkhoff heuristic is Ω(n) On a more practical side we proposeda natural greedy heuristic (GreedyBvN) and presented experimental resultswhich demonstrated that this heuristic delivers results that are not far froma trivial lower bound We theoretically investigated GreedyBvNrsquos perfor-mance and observed that its performance depends on the values of matrixelements We showed a bound using the first and the last coefficients foundby the heuristic This chapter also included a generalization of the Birkhoff-von Neumann theorem in which we are able to express any real matrixwith a total support as a convex combination of scaled signed permutationmatrices

The shown bound for the performance of GreedyBvN is expected to bemuch larger than what one observes in practice as the bound can even belarger than the upper bound on the number of permutation matrices Atighter analysis should be possible to explain the practical performance ofthe GreedyBvN heuristic

The heuristics we considered in this chapter were based on finding a per-fect matching Koopmans and Beckmann [143] present another constructiveproof of the Birkhoff theorem This constructive proof splits the matrix asA = βA1 +(1minusβ)A2 where 0 lt β lt 1 A1 and A2 being doubly stochasticmatrices A related proof is also given by Horn and Johnson [118 Theorem871] where β = 12 We review these two constructive proofs

Let C be a cycle of length 2` with ` vertices in each part of the bipartitegraph with vertices ri ri+1 ri+`minus1 on one side and ci ci+1 ci+`minus1

on the other side Let us imagine the cycle drawn on the plane so that wehave horizontal edges of the form (rj cj) for j = i i + ` minus 1 slantededges of the form (ri ci+`minus1) and (rj cjminus1) for j = i + 1 i + ` minus 1 A

110 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

ri ci

rj cj

minusε1

minusε1

+ε1+ε1

(a) ε1 = minaii ajj

ri ci

rj cj

+ε2

+ε2

minusε2minusε2

(b) ε2 = minaij aji

Figure 54 A small example for Koopmans and Beckmann decompositionapproach

small example is shown in Figure 54 Let ε1 and ε2 be the minimum valueof a horizontal and slanted edge respectively Koopmans and Beckmannproceed as follows to obtain a BvN decomposition Consider a matrix C1

which contains minusε1 at the positions corresponding to the horizontal edgesof C ε1 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A1 = A + C1 (see Fig 54a) It is easy to see thateach element of A1 is between 0 and 1 and row and column sums are thesame as in A Therefore A1 is doubly stochastic Similarly consider C2

which contains ε2 at the positions corresponding to the horizontal edges ofC minusε2 at the positions corresponding to the slanted edges of C and zeroselsewhere Then define A2 = A + C2 (see Fig 54b) and observe that A2

is also doubly stochastic By a simple arithmetic we can set β = ε2ε1+ε2

andwrite A = βA1 + (1 minus β)A2 The observation that A1 and A2 contain atleast one more zero than A can then be used in an inductive hypothesis toconstruct a decomposition by recursively decomposing A1 and A2

At first sight the constructive approach of Koopmans and Beckmanndoes not lead to an efficient algorithm This is because of the fact thatA1 and A2 can have significant overlap and hence many permutation ma-trices can be shared by A1 and A2 yielding combinatorially large spacerequirement or very high run time A variant of this algorithm needs to beexplored First note that we can find a cycle and manipulate the weights bydecreasing them along the horizontal edges and increasing them along theslanted edges (not necessarily by annihilating the smallest weight) to createanother matrix Then we can take the modified matrix and manipulateanother cycle At one point one can decide to stop and create A1 this waywith a suitable choice of coefficients the decomposition can thus proceed Atthe second sight then one observes that Koopmans and Beckmann methodcalls for creative ideas for developing efficient and effective heuristics forBvN decompositions

Horn and Johnson on the other hand set ε to the smallest of ε1 and ε2

111

Without loss of generality let ε1 be the smallest Then let C be a matrixwhose entries are minus1 0 1 corresponding to the negative zero and positiveentries of C1 Then let A1 = A + εC and A2 = A minus εC Both of thesematrices have nonnegative entries less than or equal to one and have thesame row and column sums as A We can then write A = 05A1 + 05A2

and recurse on A1 and A2 to obtain a decomposition (which stops when wedo not have any cycles)

Inspired by the approaches summarized above we can optimally decom-pose the matrix (58) used in showing that there are optimal decompositionswhich cannot be obtained by Birkhoff-like heuristics Let

A1 =

a+ b d c e 0e a+ c b d 00 e d b+ c ad b a 0 c+ ec 0 e a b+ d

A2 =

a+ b d+ 2 middot i c+ 2 middot h e+ 2 middot j 2 middot f + 2 middot ge+ 2 middot g a+ c b+ 2 middot i d+ 2 middot f 2 middot h+ 2 middot j

2 middot f + 2 middot j e+ 2 middot h d+ 2 middot g b+ c a+ 2 middot id+ 2 middot h b+ 2 middot f a+ 2 middot j 2 middot g + 2 middot i c+ ec+ 2 middot i 2 middot g + 2 middot j e+ 2 middot f a+ 2 middot h b+ d

and observe that A = 05A1 + 05A2 We can then use the permutationscontaining the coefficients a b c d e to decompose A1 optimally with fivepermutation matrices using GreedyBvN (in the decreasing order first ethen d then c and so on) While decomposing A2 we need to use thepermutation matrices found for A1 with a careful selection of the coefficients(instead of a+ b we use a for example for the permutation associated witha) Once we have consumed those five permutations we can again resortto GreedyBvN to decompose the remaining matrix optimally with the fivepermutation matrices associated with f g h i and j In this way we havean optimal decomposition for the matrix (58) by applying GreedyBvN toA1 and carefully decomposing A2 We note that neither A1 nor A2 havethe same sum as A in contrast to the two approaches above Can wesystematize this line of reasoning to develop an effective heuristic to obtaina BvN decomposition

112 CHAPTER 5 BIRKHOFF-VON NEUMANN DECOMPOSITION

Chapter 6

Matchings in hypergraphs

In this chapter we investigate the maximum matching in d-partite d-uniform hypergraphs described in Section 13 The formal problem definitionis repeated below for convenience

Maximum matching in d-partite d-uniform hypergraphs Givena d-partite d-uniform hypergraph find a matching of maximum size

We investigate effective heuristics for the problem above This problemhas been studied mostly in the context of local search algorithms [119] andthe best known algorithm is due to Cygan [56] who provides a ((d+ 1 + ε)3)-approximation building on previous work [57 106] It is NP-hard to approx-imate the 3-partite case (Max-3-DM) within 9897 [24] Similar boundsexist for higher dimensions the hardness of approximation for d = 4 5 and6 are shown to be 5453minus ε 3029minus ε and 2322minus ε respectively [109]

We propose five heuristics for the above problem The first two heuris-tics are the adaptations of the Greedy [78] and Karp-Sipser [128] heuristicsused in finding matchings in graphs (they are summarized in Chapter 4)The proposed generalizations are referred to as Greedy-H and Karp-Sipser-HGreedy-H traverses the hyperedge list in random order and adds an edgeto the matching whenever possible Karp-Sipser-H introduces certain rulesto Greedy-H to improve the cardinality The third heuristic is inspired by arecent scaling-based approach proposed for the maximum cardinality match-ing problem on graphs [73 74 76] The fourth heuristic is a modificationon the third one that allows for faster execution time The last one findsa matching for a reduced (d minus 1)-dimensional problem and exploits it forthe original matching problem recursively This heuristic uses an exact al-gorithm for the bipartite matching problem at each reduction We performexperiments to evaluate the performance of these heuristics on special classesof random hypergraphs as well as real-life data

One plausible way to tackle the problem is to create the line graph Gfor a given hypergraph H The line graph is created by identifying each

113

114 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

hyperedge of H with a vertex in G and by connecting two vertices of Gwith an edge iff the corresponding hyperedges share a common vertex in HThen successful heuristics for computing large independent sets in graphseg KaMIS [147] can be used to compute large matchings in hypergraphsThis approach although promising quality-wise could be impractical Thisis so since building G from H requires quadratic run time (in terms of thenumber of hyperedges) and more importantly quadratic storage (again interms of the number of hyperedges) in the worst case While this can beacceptable in some instances in some others it is not We have such instancesin the experiments Notice that while a heuristic for the independent setproblem can be of linear time complexity in graphs due to our graphs beinga line graph the actual complexity could be high

61 Background and notation

Franklin and Lorenz [89] show that if a nonnegative d-dimensional tensorX has the same zero-pattern as a d-stochastic tensor one can apply a mul-tidimensional version of the Sinkhorn-Knopp algorithm [202] (see also thedescription in Section 41) to scale X to be d-stochastic This is accom-plished by finding d vectors u(1) u(d) and scaling the nonzeros of X as

xi1id middot u(1)i1middot middot middot middot middot u(d)

idfor all i1 id isin 1 n

In a hypergraph if a vertex is a member of only a single hyperedgewe call it a degree-1 vertex Similarly if a vertex is a member of only twohyperedges we call it a degree-2 vertex

In the k-out random hypergraph model given a set V of vertices eachvertex u isin V selects k hyperedges from the set Eu = e e sube V u isin e ina uniformly random fashion and the union of these edges forms E We areinterested in the d-partite d-uniform case and hence Eu = e |e cap Vi| =1 for 1 le i le d u isin e This model generalizes the random k-out bipartitegraphs [211] Devlin and Kahn [67] investigate fractional matchings in thesehypergraphs and mention in passing that k should be exponential in d toensure that a perfect matching exists

62 The heuristics

A matching which cannot be extended with more edges is called maxi-mal The heuristics proposed here find maximal matchings on d-partited-uniform hypergraphs For such hypergraphs any maximal matching isa d-approximate matching The bound is tight and can be verified ford = 3 Let H be a 3-partite 3times 3times 3 hypergraph with the following edgese1 = (1 1 1) e2 = (2 2 2) e3 = (3 3 3) and e4 = (1 2 3) The maximummatching is e1 e2 e3 but the edge e4 alone forms a maximal matching

115

621 Greedy-H A Greedy heuristic

Among the two variants of Greedy [78 194] we adapt the first one whichrandomly visits the edges The adapted version is referred to as Greedy-Hand it visits the hyperedges in random order and adds the visited hyper-edge to the matching whenever possible Since only maximal matchings arepossible as its output Greedy-H is a d-approximation heuristic

622 Karp-Sipser-H Adapting Karp-Sipser

The commonly used and analyzed version of the Karp-Sipser heuristics forgraphs applies degree-1 reductions as summarized in Section 421 Theoriginal version though applies two rules

bull At any time during the heuristic if a degree-1 vertex appears it ismatched with its only neighbor

bull Otherwise if a degree-2 vertex u appears with neighbors v w u(and its edges) is removed from the current graph and v and w aremerged to create a new vertex vw whose set of neighbors is the unionof those of v and w (except u) A maximum cardinality matching forthe reduced graph can be extended to obtain one for the current graphby matching u with either v or w depending on vwrsquos match

We now propose an adaptation of Karp-Sipser with the two rules for d-partite d-uniform hypergraphs The adapted version is called Karp-Sipser-HSimilar to the original one the modified heuristic iteratively adds a randomhyperedge to the matching remove its d endpoints as well as their hyper-edges However the random selection is not applied whenever hyperedgesdefined by the following lemmas appear

Lemma 61 During the heuristic if a hyperedge e with at least d minus 1degree-1 endpoints appears there exists a maximum cardinality matching inthe current hypergraph containing e

Proof Let H prime be the current hypergraph at hand and e = (u1 ud) be ahyperedge in H prime whose first dminus1 endpoints are degree-1 vertices Let M prime bea maximum cardinality matching in H prime If e isinM prime we are done Otherwiseassume that ud is the endpoint matched by a hyperedge eprime isinM prime (note thatif ud is not matched M prime can be extended with e) Since ui 1 le i lt d arenot matched in M prime M prime eprime cup e defines a valid maximum cardinalitymatching for H prime

We note that it is not possible to relax the condition by using a hyperedgee with less than dminus 1 endpoints of degree-1 in M prime two of ersquos higher degreeendpoints could be matched with two different hyperedges in which casethe substitution as done in the proof of the lemma will not be valid

116 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 62 During the heuristic let e = (u1 ud) and eprime = (uprime1 uprimed)

be two hyperedges sharing at least one endpoint where for an index set I sub1 d of cardinality d minus 1 the vertices ui u

primei for all i isin I only touch

e andor eprime That is for each i isin I either ui = uprimei is a degree-2 vertexor ui 6= uprimei and they are both degree-1 vertices For j isin I uj and uprimej arearbitrary vertices Then in the current hypergraph there exists a maximumcardinality matching having either e or eprime

Proof Let H prime be the current hypergraph at hand and j isin I be the remain-ing part id Let M prime be a maximum cardinality matching in H prime If eithere isin M prime or eprime isin M prime we are done Otherwise ui and uprimei for all i isin I areunmatched by M prime Furthermore since M prime is maximal uj must be matchedby M prime (otherwise M prime can be extended by e) Let eprimeprime isinM prime be the hyperedgematching uj Then M prime eprimeprime cup e defines a valid maximum cardinalitymatching for H prime

Whenever such hyperedges appear the rules below are applied in thesame order

bull Rule-1 At any time during the heuristic if a hyperedge e with atleast dminus 1 degree-1 endpoints appears instead of a random edge e isadded to the matching and removed from the hypergraph

bull Rule-2 Otherwise if two hyperedges e and eprime as defined in Lemma 62appear they are removed from the current hypergraph with the end-points ui u

primei for all i isin I Then we consider uj and uprimej If uj and

uprimej are distinct they are merged to create a new vertex ujuprimej whose

hyperedge list is defined as the union of uj rsquos and uprimej rsquos hyperedge listsIf uj and uprimej are identical we rename uj as uju

primej After obtaining a

maximal matching on the reduced hypergraph depending on the hy-peredge matching uju

primej either e or eprime can be used to obtain a larger

matching in the current hypergraph

When Rule-2 is applied the two hyperedges identified in Lemma 62 areremoved from the hypergraph and only the hyperedges containing uj andoruprimej have an update in their vertex list Since the original hypergraph is d-partite and d-uniform that update is just a renaming of a vertex in theconcerned hyperedges (hence the resulting hypergraph is d-partite and d-uniform)

Although the extended rules usually lead to improved results in compari-son to Greedy-H Karp-Sipser-H still adheres to the d-approximation bound ofmaximal matchings For the example given at the beginning of Section 62Karp-Sipser-H generates a maximum cardinality matching by applying thefirst rule However when e5 = (2 1 3) and e6 = (3 1 3) are added to theexample neither of the two rules can be applied As before in case e4 israndomly selected it alone forms a maximal matching

117

623 Karp-Sipser-H-scaling Karp-Sipser with scaling

Karp-Sipser-H can be modified for better decisions in case neither of the tworules apply In this variant instead of a random selection we first scale theadjacency tensor of H and obtain an approximate d-stochastic tensor X We then augment the matching by adding the edge which corresponds tothe largest value in X when the rules do not apply The modified heuristicis summarized in Algorithm 12

Algorithm 12 Karp-Sipser-H-scaling (H)

Data A d-partite d-uniform n1 times middot middot middot times nd hypergraph H = (VE)Result A maximal matching M of H

1 M larr empty Initially M is empty 2 S larr empty Stack for the merges for Rule-2 3 while H is not empty do4 remove isolated vertices from H5 if existe = (u1 ud) as in Rule-1 then6 M larrM cup e Add e to the matching 7 Apply the reduction Rule-1 on H8 else if existe = (u1 ud) eprime = (uprime1 u

primed) and I as in Rule-2 then

9 Let j be the part index where j isin I10 Apply the reduction Rule-2 on H by introducing the vertex uju

primej

11 Eprime = (v1 ujuprimej vd) for all (v1 uj vd) isin E memorize the hyperedges of uj

12 Spush(e eprime ujuprimej E

prime) Store the current merge

13 else14 X larr Scale(adj(H)) Scale the adjacency tensor of H 15 elarr arg max(u1ud) (xu1ud

) Find the max in X

16 M larrM cup e Add e to the matching 17 Remove all hyperedges of u1 ud from E18 V larr V u1 ud

19 while S 6= empty do20 〈e eprime ujuprimej Eprime〉 larr Spop()

21 if ujuprimej is not matched by M then

22 M larrM cup e23 else24 Let eprimeprime isinM be the hyperedge matching uju

primej

25 if eprimeprime isin Eprime then26 Replace uju

primej in eprimeprime with uprimej

27 M larrM cup eprime28 else29 Replace uju

primej in eprimeprime with uj

30 M larrM cup e

Our inspiration comes from the d = 2 case [74 76] described in Chap-ter 4 By using the scaling method as a preprocessing step and choosing

118 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

edges with a probability corresponding to the scaled entry the edges whichare not included in a perfect matching become less likely to be chosen Un-fortunately for d ge 3 there is no equivalent of Birkhoffrsquos theorem as demon-strated by the following lemma whose proof can be found in the associatedtechnical report [75 Lemma 3]

Lemma 63 For d ge 3 there exist extreme points in the set of d-stochastictensors which are not permutation tensors

These extreme points can be used to generate other d-stochastic ten-sors as linear combinations Due to the lemma above we do not have thetheoretical foundation to imply that hyperedges corresponding to the largeentries in the scaled tensor must necessarily participate in a perfect match-ing Nonetheless the entries not in any perfect matching tend to becomezero (not guaranteed for all though) For the worst case example of Karp-Sipser-H described above the scaling indeed helps the entries correspondingto e4 e5 and e6 to become zero See the discussion following the said lemmain the technical report [75]

On a d-partite d-uniform hypergraph H = (VE) the Sinkhorn-Knoppalgorithm used for scaling operates in iterations each of which requiresO(|E| times d) time In practice we perform only a few iterations (eg 10ndash20)Since we can match at most |V |d hyperedges the overall run time costassociated with scaling is O(|V | times |E|) A straightforward implementationof the second rule can take quadratic time in the worst case of a large numberof repetitive merges with a given vertex In practice more of a linear timebehavior should be observed for the second rule

624 Karp-Sipser-H-mindegree Hypergraph matching viapseudo scaling

In Algorithm 12 applying scaling at every step can be very costly Herewe propose an alternative idea inspired by the specifics of the Sinkhorn-Knopp algorithm to reduce the overall cost In particular we can mimicthe first iteration of the Sinkhorn-Knopp algorithm to discover that we canuse the inverse of the degree of a vertex as its scaling vector This assignseach vertex a weight proportional to the inverse of its degreemdashhence not anexact iteration of Sinkhorn-Knopp

With this approach each hyperedge i1 id is associated with a value1prodd

j=1 λij where λij denotes the degree of the vertex ij from jth part The

selection procedure is the same as that of Algorithm 12 ie the edge withthe maximum value is added to the matching set We refer to this algorithmas Karp-Sipser-H-mindegree as it selects a hyperedge based on a function ofthe degrees of the vertices With a straightforward implementation findingthis hyperedge takes O(|E|) time For a better efficiency the edges can be

119

stored in a heap and when the degree of a node v decreases the increaseKeyheap operation can be called for all its edges

625 Bipartite-reduction Reduction to the bipartite match-ing problem

A perfect matching in a d-partite d-uniform hypergraph H remains perfectwhen projected on a (d minus 1)-partite (d minus 1)-uniform hypergraph obtainedby removing one of Hrsquos dimensions Matchability in (d minus 1)-dimensionalsub-hypergraphs has been investigated [6] to provide an equivalent of HallrsquosTheorem for d-partite hypergraphs These observations lead us to proposea heuristic called Bipartite-reduction This heuristic tackles the d-partited-uniform case by recursively asking for matchings in (dminus1)-partite (dminus1)-uniform hypergraphs and so on until d=2

Let us start with the case where d = 3 Let G = (VG EG) be thebipartite graph with the vertex set VG = V1 cup V2 obtained by deleting V3

from a 3-partite 3-regular hypergraph H = (VE) The edge (u v) isin EG iffthere exists a hyperedge (u v z) isin E One can also assign a weight functionw(middot) to the edges during this step such as

w(u v) = |z (u v z) isin E| (61)

A maximum weighted (product sum etc) matching algorithm can be usedto obtain a matching MG on G A second bipartite graph Gprime = (VGprime EGprime)is then created with VGprime = (V1 times V2) cup V3 and EGprime = (uv z) (u v) isinMG (u v z) isin H Under this construction any matching inGprime correspondsto a valid matching in H Furthermore if the weight function (61) definedabove is used the following holds (the proof is in the technical report [75Proposition 4])

Proposition 61 Let w(MG) =sum

(uv)isinMGw(u v) be the size of the match-

ing MG found in G Then Gprime has w(MG) edges

Thus by selecting a maximum weighted matching MG and maximizingw(MG) the largest number of edges will be kept in Gprime

For d-dimensional matching a similar process is followed First anordering i1 i2 id of the dimensions is defined At the jth bipartite re-duction step the matching is found between the dimension cluster i1i2 middot middot middot ijand dimension ij+1 by similarly solving a bipartite matching instance wherethe edge (u1 middot middot middotuj v) exists iff vertices u1 uj were matched in previoussteps and there exists an edge (u1 uj v zj+2 zd) in H

Unlike the previous heuristics Bipartite-reduction does not have any ap-proximation guarantee We state this with the following lemma (the proofis in the technical report [75 Lemma 5])

120 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Lemma 64 If algorithms for the maximum cardinality or the maximumweighted matching with the suggested edge weights (61) problems are usedthen Bipartite-reduction has a worst-case approximation ratio of Ω(n)

626 Local-Search Performing local search

A local search heuristic is proposed by Hurkens and Schrijver [119] It startsfrom a feasible maximal matching M and performs a series of swaps untilit is no longer possible In a swap k edges of M are replaced with at leastk+1 new edges from EM so that the cardinality of M increases by at leastone These k edges from M can be replaced with at most dtimes k new edgesHence these edges can be found by a polynomial algorithm enumeratingall the possibilities The approximation guarantee improves with higher kvalues Local search algorithms are limited in practice due to their hightime complexity The algorithm might have to examine all

(|M |k

)subsets of

M to find a feasible swap at each step The algorithm by Cygan [56] whichachieves a

(d+1+ε

3

)-approximation is based on a different swap scheme but

is also not suited for large hypergraphs We implement Local-Search as analternative from the literature to set a baseline

63 Experiments

To understand the relative performance of the proposed heuristics we con-ducted a wide variety of experiments with both synthetic and real-life dataThe experiments were performed on a computer equipped with Intel Corei7-7600 CPU and 16GB RAM We investigate the performance of the pro-posed heuristics Greedy-H Karp-Sipser-H Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reduction For d = 3 we also consider Local-Search [119] which replaces one hyperedge from a maximal matching Mwith at least two hyperedges from E M to increase the cardinality of M We did not consider local search schemes for higher dimensions or withbetter approximation ratios as they are computationally too expensiveFor each hypergraph we perform ten runs of Greedy-H and Karp-Sipser-Hwith different random decisions and take the maximum cardinality obtainedSince Karp-Sipser-H-scaling Karp-Sipser-H-mindegree and Bipartite-reductiondo not pick hyperedges randomly we run them only once We perform 20steps of the scaling procedure in Karp-Sipser-H-scaling We refer to qualityof a matching M in a hypergraph H as the ratio of M rsquos cardinality to thesize of the smallest vertex partition of H

On random hypergraphs

We perform experiments on two classes of d-partite d-uniform random hy-pergraphs where each part has n vertices The first class contains random

121

k k

d ddminus3 ddminus2 ddminus1 d ddminus3 ddminus2 ddminus1

2 - 087 100 2 - 084 1003 080 100 100 3 088 100 100

n = 10 4 100 100 100 n = 30 4 099 100 1005 100 100 100 5 100 100

2 - 088 100 2 - 087 1003 085 100 100 3 084 100 100

n = 20 4 100 100 100 n = 50 4 lowast 100 1005 100 100 100 5

Table 61 The average maximum matching cardinalities of five randominstances over n on random k-out d-partite d-uniform hypergraphs for dif-ferent k d and n No runs for k = ddminus3 for d = 2 and the problems markedwith lowast were not solved within 24 hours

k-out hypergraphs and the second one contains sparse random hypergraphs

Random k-out d-partite d-uniform hypergraphsHere we consider random k-out d-partite d-uniform hypergraphs describedin Section 61 Hence (ignoring the duplicate ones) these hypergraphs havearound d times k times n hyperedges These k-out d-partite d-uniform hyper-graphs have been recently analyzed in the matching context by Devlin andKahn [67] They state in passing that k should be exponential in d for aperfect matching to exist with high probability The bipartite graph variantof the same problem ie with d = 2 has been extensively studied in theliterature [90 124 125 211]

We first investigate the existence of perfect matchings in random k-outd-partite d-uniform hypergraphs For this purpose we used CPLEX [1] tosolve the hypergraph matching problem exactly and found the maximumcardinality of a matching in k-out hypergraphs with k isin ddminus3 ddminus2 ddminus1for d isin 2 5 and n isin 10 20 30 50 For each (k d n) triple we cre-ated five hypergraphs and computed their maximum cardinality matchingsFor k = ddminus3 we encountered several hypergraphs with no perfect match-ing especially for d = 3 The hypergraphs with k = ddminus2 were also lackinga perfect matching for d = 2 However all the hypergraphs we createdwith k = ddminus1 had at least one Based on these results we experimentallyconfirm Devlin and Kahnrsquos statement We also conjecture that ddminus1-outrandom hypergraphs have perfect matchings almost surely The averagemaximum matching cardinalities we obtained in this experiment are givenin Table 61 In this table we do not have results for k = ddminus3 for d = 2and the cases marked with lowast were not solved within 24 hours

We now compare the performance of the proposed heuristics on randomk-out d-partite d-uniform hypergraphs d isin 3 6 9 and n isin 1000 10000We tested with k values equal to powers of 2 for k le d log d The results are

122 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

summarized in Figure 61 For each (k d n) triplet we create ten random in-stances and present the average performance of the heuristics on them Thex-axis in each figure denotes k and the y-axis reports the matching cardinal-ity over n As seen Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree havethe best performance comfortably beating the other alternatives For d = 3Karp-Sipser-H-scaling dominates Karp-Sipser-H-mindegree but when d gt 3we see that Karp-Sipser-H-mindegree has the best performance Local-Searchperforms worse than Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree butbetter than others Karp-Sipser-H performs better than Greedy-H Howevertheir performances get closer as d increases This is due to the fact thatthe conditions for Rule-1 and Rule-2 hold less often for larger d as we havemore restrictions to encounter in such cases Bipartite-reduction has worseperformance than the others and the gap in the performance grows as dincreases This happens since at each step we impose more and more con-ditions on the edges involved and there is no chance to recover from baddecisions

Sparse random d-partite d-uniform hypergraphsHere we consider a random d-partite d-uniform hypergraph Hi created withitimesn hyperedges The parameters used for this experiment are i isin 1 3 5 7n isin 4000 8000 and d isin 6 9 (the associated paper presents further ex-periments with d = 3) Each Hi is created by choosing the vertices of ahyperedge uniformly at random for each dimension We do not allow du-plicate hyperedges Another random hypergraph Hi+M is then obtained byplanting a perfect matching to Hi that is a random perfect matching isadded to the pattern of Hi We again generate ten random instances foreach parameter setting We do not present results for Bipartite-reduction asit was always worse than the others The average quality of different heuris-tics on these instances is shown in Figure 62 The experiments confirmthat Karp-Sipser-H performs consistently better than Greedy-H Further-more Karp-Sipser-H-scaling performs significantly better than Karp-Sipser-HKarp-Sipser-H-scaling works even better than the local search heuristic [75Fig2]) and it is the only heuristic that is capable of finding the planted per-fect matchings for a significant number of the runs In particular it finds aperfect matching on Hi+M rsquos in all cases except for when d = 6 and i = 7Interestingly Karp-Sipser-H-mindegree outperforms Karp-Sipser-H-scaling onHis but is dominated on Hi+M s where it is the second best performingheuristic

Evaluating algorithmic choices

We evaluate the use of scaling and the importance of Rule-1 and Rule-2

Scaling vs no-scalingTo evaluate and emphasize the contribution of scaling better we compare the

123

2 4 8065

07

075

08

085

09

095

1n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

2 4 806

065

07

075

08

085

09

095

1n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeLocal-searchBipartite-reduction

(a) d = 3 n = 1000 (left) and n = 10000 (right)

2 4 1603

04

05

06

07

08

09n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 1603

04

05

06

07

08

09n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(b) d = 6 n = 1000 (left) and n = 10000 (right)

2 4 8 16 3202

03

04

05

06

07

08n=1000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

2 4 8 16 3202

03

04

05

06

07

08n=10000

GreedyKarp-SipserKarp-Sipser-scalingKarp-Sipser-mindegreeBipartite-reduction

(c) d = 9 n = 1000 (left) and n = 10000 (right)

Figure 61 The performance of the heuristics on k-out d-partite d-uniformhypergraphs with n vertices at each part The y-axis is the ratio of matchingcardinality to n whereas the x-axis is k No Local-Search for d = 6 and d = 9

124 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 031 031 035 035 035 035 036 037 062 061 090 089 100 100 100 1003 043 043 047 047 048 048 054 054 051 050 056 055 100 100 099 0995 048 048 052 052 054 054 061 061 052 052 056 055 100 100 097 0977 052 052 055 055 059 059 066 066 054 054 057 057 084 080 071 070

(a) d = 6 without (left) and with (right) the planted matching

Hi Random hypergraph Hi+M Random hypergraph + a perfect matchingKarp- Karp-Sipser-H- Karp-Sipser-H- Karp- Karp-Sipser-H- Karp-Sipser-H-

Greedy-H Sipser-H scaling mindegree Greedy-H Sipser-H scaling mindegreei 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000 4000 8000

1 025 024 027 027 027 027 030 030 056 055 080 079 100 100 100 1003 034 033 036 036 036 036 043 043 040 040 044 044 100 100 099 1005 038 037 040 040 041 041 048 048 041 040 043 043 100 100 099 0997 040 040 042 042 044 044 051 051 042 042 044 044 100 100 097 096

(b) d = 9 without (left) and with (right) the planted matching

Figure 62 Performance comparisons on d-partite d-uniform hypergraphswith n = 4000 8000 Hi contains i times n random hyperedges and Hi+M

contains an additional perfect matching

R1

R2

C1 C2

t

t

Figure 63 AKS A challenging bipartite graph instance for Karp-Sipser

Local Karp- Karp-Sipser-H- Karp-Sipser-H-t Greedy-H Search Sipser-H scaling mindegree

2 053 099 053 100 1004 053 099 053 100 1008 054 099 055 100 10016 055 099 056 100 10032 059 099 059 100 100

Table 62 Performance of the proposed heuristics on 3-partite 3-uniformhypergraphs corresponding to XKS with n = 300 vertices in each part

125

performance of the heuristics on a particular family of d-partite d-uniformhypergraphs where their bipartite counterparts have been used before aschallenging instances for the original Karp-Sipser heuristic [76]

Let AKS be an n times n matrix discussed before in Figure 43 as a chal-lenging instance to Karp-Sipser and shown in Figure 63 for convenienceTo adapt this scheme to hypergraphstensors we generate a 3-dimensionaltensor XKS such that the nonzero pattern of each marginal of the 3rd di-mension is identical to that of AKS Table 62 shows the performance of theheuristics (ie matching cardinality normalized with n) for 3-dimensionaltensors with n = 300 and t isin 2 4 8 16 32

The use of scaling indeed reduces the influence of the misleading hyper-edges in the dense block R1 times C1 and the proposed Karp-Sipser-H-scalingheuristic always finds the perfect matching as does Karp-Sipser-H-mindegreeHowever Greedy-H and Karp-Sipser-H perform significantly worse Further-more Local-Search returns a 099-approximation in every case because itends up in a local optima

Rule-1 vs Rule-2We finish the discussion on the synthetic data by focusing on Karp-Sipser-H In the bipartite case recent work [10] shows that both rules are neededto obtain perfect matchings in a class of graphs We present a family ofhypergraphs for which applying Rule-2 leads to much improved performancethan applying Rule-1 only

We use Karp-Sipser-H-R1 to refer to Karp-Sipser-H without Rule-2 Asbefore we describe first the bipartite case Let ARF be a n times n matrixwith (ARF )ij = 1 for 1 le i le j le n and (ARF )21 = (ARF )nnminus1 = 1That is ARF is composed of an upper triangular matrix and two additionalsubdiagonal nonzeros The first two columns and the last two rows havetwo nonzeros Assume without loss of generality that the first two rows aremerged by applying Rule-2 on the first column (which is discarded) Thenin the reduced matrix the first column (corresponding to the second columnin the original matrix) will have one nonzero Rule-1 can now be appliedwhereupon the first column in the reduced matrix will have degree oneThe process continues similarly until the reduced matrix is a 2 times 2 denseblock where applying Rule-2 followed by Rule-1 yields a perfect matchingIf only Rule-1 reductions are allowed initially no reduction can be appliedand randomly chosen edges will be matched which negatively affects thequality of the returned matching

For higher dimensions we proceed as follows Let XRF be a d-dimensionaln times middot middot middot times n tensor We set (XRF )ijj = 1 for 1 le i le j le n and(XRF )122 = (XRF )nnminus1nminus1 = 1 By similar reasoning we see thatKarp-Sipser-H with both reduction rules will obtain a perfect matchingwhereas Karp-Sipser-H-R1 will struggle We give some results in Table 63that show the difference between the two We test for n isin 1000 2000 4000

126 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

d2 3 6

n quality rn

quality rn

quality rn

1000 083 045 085 047 080 0312000 086 053 087 056 080 0304000 082 042 075 017 084 045

Table 63 Quality of matching and the number r of the applications ofRule-1 over n in Karp-Sipser-H-R1 for hypergraphs corresponding to XRF Karp-Sipser obtains perfect matchings

and d isin 2 3 6 and show the quality of Karp-Sipser-H-R1 and the num-ber of times that Rule-1 is applied over n We present the best result over10 runs As seen in Table 63 Karp-Sipser-H-R1 obtains matchings thatare about 13ndash25 worse than Karp-Sipser-H Furthermore the larger thenumber of Rule-1 applications is the higher the quality is

On real-life tensors

We also evaluate the performance of the proposed heuristics on some real-life tensors selected from the FROSTT library [203] The descriptions ofthe tensors are given in Table 64 For nips and uber a dimension of size17 and 24 is dropped respectively since they restrict the size of maximumcardinality matching As described before a d-partite d-uniform hyper-graph is obtained from a d-dimensional tensor by keeping a vertex for eachdimension index and a hyperedge for each nonzero Unlike the previousexperiments the parts of the hypergraphs obtained from real-life tensors inTable 64 do not have an equal number of vertices In this case although thescaling algorithm works along the same lines its output is slightly differentLet ni = |Vi| be the cardinality at ith dimension and nmax = max1leiled nibe the maximum one By slightly modifying Sinkhorn-Knopp for each iter-ation of Karp-Sipser-H-scaling we scale the tensor such that the marginalsin dimension i sum up to nmaxni instead of one The results in Table 64resemble those from previous sections Karp-Sipser-H-scaling has the bestperformance and is slightly superior to Karp-Sipser-H-mindegree Greedy-Hand Karp-Sipser-H are close to each other and when it is feasible Local-Searchis better than them We also see that in these instances Bipartite-reductionexhibits a good performance its performance is at least as good as Karp-Sipser-H-scaling for the first three instances but about 10 worse for thelast one

127

Local- Karp- Karp-Sipser-H- Karp-Sipser-H- Bipartite-Tensor d Dimensions Greedy-H Search Sipser-H mindegree scaling Reduction

Uber 3 183times 1140times 1717 183 183 183 183 183 183nips 3 2 482 times 2 862 times

14 0361847 1991 1839 2005 2007 2007

Nell-2 3 12 092 times 9 184 times28 818

3913 4987 3935 5100 5154 5175

Enron 4 6 066 times 5 699 times244 268times 1 176

875 - 875 988 1001 898

Table 64 Four real-life tensors and the performance of the proposed heuris-tics on the corresponding hypergraphs No result for Local-Search for Enronas it is four dimensional Uber has 1117629 nonzeros nips has 3101609nonzeros Nell-2 has 76879419 nonzeros and Enron has 54202099 nonze-ros

Using an independent set solver

We compare Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree with the ideaof reducing the maximum d-dimensional matching problem to that of findingan independent set in the line graph of the given hypergraph We show thatthis transformation can obtain good results but is restricted because linegraphs can require too much space

We use KaMIS [147] to find independent sets in graphs KaMIS uses aplethora of reductions and a genetic algorithm in order to return high cardi-nality independent sets We use the default settings of KaMIS (where exe-cution time is limited to 600 seconds) and generate the line graphs with effi-cient sparse matrixndashmatrix multiplication routines We run KaMIS Greedy-H Karp-Sipser-H-scaling and Karp-Sipser-H-mindegree on a few hypergraphsfrom previous tests The results are summarized in Table 65 The run timeof Greedy-H was less than one second in all instances KaMIS operates inrounds and we give the quality and the run time of the first round andthe final output We note that KaMIS considers the time-limit only afterthe first round has been completed As can be seen while the quality ofKaMIS is always good and in most cases superior to Karp-Sipser-H-scalingand Karp-Sipser-H-mindegree it is also significantly slower (its principle isto deliver high quality results) We also observe that the pseudo scaling ofKarp-Sipser-H-mindegree indeed helps to reduce the run time compared toKarp-Sipser-H-scaling

The line graphs of the real-life instances from Table 64 are too large tobe handled We estimate using known techniques [50] the number of edgesin these graphs to range from 15 times 1010 to 47 times 1013 which translate to126GB to 380TB of storage if edges are stored twice using 4 bytes per edge

128 CHAPTER 6 MATCHINGS IN HYPERGRAPHS

KaMIS Karp-Sipser-H- Karp-Sipser-H-line graph Round 1 Output Greedy-H scaling mindegree

hypergraph gen time quality time quality time quality quality time quality time

8-out n =1000 d = 3

10 098 80 099 600 086 098 1 098 1

8-out n =10000 d = 3

112 098 507 099 600 086 098 197 098 1

8-out n =1000 d = 9

298 067 798 069 802 055 062 2 067 1

n = 8000 d =3 H3

1 077 16 081 602 063 076 5 077 1

n = 8000 d =3 H3+M

2 089 25 100 430 070 100 11 091 1

Table 65 Run time (in seconds) and performance comparisons betweenKaMIS Greedy-H and Karp-Sipser-H-scaling The time required to createthe line graphs should be added to KaMISrsquos overall time

64 Summary further notes and references

We have proposed heuristics for the maximum matching in d-partite d-uniform hypergraphs problem by generalizing existing heuristics for themaximum cardinality matching in graphs The experimental analysis onvarious hypergraphs (both random and corresponding to real-life tensors)show the effectiveness and efficiency of the proposed heuristics

This being a recent study we have much future work On the theoreticalside we plan to investigate the stated conjecture that ddminus1-out random hy-pergraphs have perfect matchings almost always and analyze the theoreticalguarantees of the proposed algorithms On the practical side we want totest the use of the proposed hypergraph matching heuristics in the coarsen-ing phase of the multilevel hypergraph partitioning tools For this we needto adopt the proposed heuristics for the general hypergraph case (not nec-essarily d-partite and d-uniform) Another future work is the investigationof the weighted hypergraph matching problem and the generalization anduse of the proposed heuristics for this problem Solvers for this problem arecomputational tools used in the multiple network alignment problem [179]where a matching among the vertices of more than two graphs is computed

Part III

Ordering matrices andtensors

129

Chapter 7

Elimination trees andheight-reducing orderings

In this chapter we investigate the minimum height elimination problemdescribed in Section 11 The formal problem definition is repeated belowfor convenience

Minimum height elimination tree Given a directed graph find anordering of its vertices so that the associated elimination tree has theminimum height

The standard elimination tree [200] has been used to expose parallelismin sparse Cholesky LU and QR factorizations [5 8 116 160] Roughly aset of vertices without ancestordescendant relations corresponds to a setof independent tasks that can be performed in parallel Therefore the to-tal number of parallel steps or the critical path length is equal to theheight of the tree on an unbounded number of processors [162 214] Ob-taining an elimination tree with the minimum height for a given matrixis NP-complete [193] Therefore heuristic approaches are used One setof heuristic approaches is to content oneself with the graph partitioningbased methods These methods reduce some other important cost metricsin sparse Cholesky factorization such as the fill-in and the operation countwhile giving a shallow depth elimination tree [103] When the matrix is un-symmetric the elimination tree for LU factorization [81] is useful to exposeparallelism In this respect the height of the tree again corresponds to thenumber of parallel steps or the critical path length for certain factorizationschemes Here we develop heuristics to reduce the height of eliminationtrees for unsymmetric matrices

Consider the directed graph Gd obtained by replacing every edge of agiven undirected graph G by two directed edges pointing at opposite direc-tions Then the cycle-rank (or the minimum height of an elimination tree)

131

132 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

of Gd is closely related to the undirected graph parameter tree-depth [180p128] which corresponds to the minimum height of an elimination tree for asymmetric matrix One can exploit this correspondence in the reverse sensein an attempt to reduce the height of the elimination tree while producingan ordering for the matrix That is one can use the standard undirectedgraph model corresponding to the matrix A + AT where the addition ispattern-wise This approach of producing an undirected graph for a givendirected graph by removing the direction of each edge is used in solvers suchas MUMPS [8] and SuperLU [63] This way the state of the art graph parti-tioning based ordering methods can be used We will compare the proposedheuristics with such methods

The height of the elimination tree or the critical path length is notthe only metric for efficient parallel LU factorization of unsymmetric matri-ces Depending on the factorization schemes such as Gaussian eliminationwith static pivoting (implemented in GESP [157]) or multifrontal basedsolvers (implemented for example in WSMP [104]) other metrics come intoplay (such as the fill-in the size of blocking in GESP based solvers egSuperLU MT [64] the maximumminimum size of fronts for parallelismin multifrontal solvers or the communication) Nonetheless the elimina-tion tree helps to give an upper bound on the performance of the suitableparallel LU factorization schemes under a fairly standard PRAM model(assuming unit size computations infinite number of processors and nomemorycommunication or scheduling overhead) Furthermore our overallapproach is based on obtaining a desirable form for LU factorization calledbordered block triangular form (or BBT form for short) [81] This formsimplifies many algorithmic details of different solvers [81 82] and is likelyto control the fill-in as the nonzeros are constrained to be in the blocks ofa BBT form If one orders a matrix without a BBT form and then obtainsthis form the fill-in can increase or decrease [83] Hence our BBT basedapproach is likely to be helpful for LU factorization schemes exploiting theBBT form as well in controlling the fill-in

71 Background

We define Gs = (VEs) as the graph associated with the directed graphG such that Es = vi vj (vi vj) isin E A directed graph G is calledcomplete if (u v) isin E for all vertex pairs (u v)

Let T = (V r ρ) be a rooted tree An ordering σ V harr 1 n is abijective function An ordering σ is called topological if σ(ρ(vj)) gt σ(vj) forall vj isin V Finally a postordering is a special form of topological orderingin which the vertices of each subtree are ordered consecutively

Let A be an n times n unsymmetric matrix with m off-diagonal nonzeroentries The standard directed graph G(A) = (VE) of A consists of a

133

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix A

8

46

9

1

7

2

5

3

(b) Directed graph G(A)

1 2

3

5 64

7 8

9

(c) Elimination tree T(A)

Figure 71 A sample 9 times 9 unsymmetric matrix A the corresponding di-rected graph G(A) and elimination tree T(A)

vertex set V = 1 n and an edge set E = (i j) aij 6= 0 of m edgesWe adopt GA instead of G(A) when a subgraph notion is required

An elimination digraph Gk(A) = (Vk Ek) is defined for 0 le k lt n withthe vertex set Vk = k + 1 n and the edge set

Ek =

E k = 0Ekminus1 cup (i j) (i k) (k j) isin Ekminus1 k gt 0

We also define V k = 1 k for each 0 le k lt n

We assume that A is irreducible and the LU-factorization A = LU existswhere L and U are unit lower and upper triangular respectively Since thefactors L and U are triangular their standard directed graph models G(L)and G(U) are acyclic Eisenstat and Liu [81] define the elimination treeT(A) as follows Let

Parent(i) = minj j gt i and jG(L)===rArr i

G(U)===rArr j

where aG=rArr b means that there is path from a to b in the graph G then

Parent(i) corresponds to the parent of vertex i in T(A) for i lt n and forthe root n we put Parent(n) =infin

We now illustrate some of the definitions Figure 71a displays a samplematrix A with 9 rowscolumns and 26 nonzeros Figure 71b shows thestandard directed graph representation G(A) of this matrix Upon elimina-tion of vertex 1 the three edges (3 2)(6 2) and (7 2) are added and vertex1 is removed to build the elimination digraph G1(A) On the right sideFigure 71c shows the elimination tree T(A) where height h(T(A)) = 5Here the parent vertex of 1 is 3 as GA[1 2] is not strongly connected butGA[1 2 3] is (thanks to the cycle 1 rarr 2 rarr 3 rarr 1) The cross edges ofG(A) with respect to the elimination tree T(A) are represented by dashed

134 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

directed arrows in Figure 71c The initial order is topological but not pos-tordered However 8 1 2 3 5 4 6 7 9 is a postordering of T(A) and doesnot respect the cross edges

72 The critical path length and the eliminationtree height

We summarize rowwise and columnwise LU factorization schemes also knownas ikj and jik variants [70] here to be referred as row-LU and column-LUrespectively Then we show that the critical path lengths for the parallelrow-LU and column-LU factorization schemes are equivalent to the elimina-tion tree height whenever the matrix has a certain form

Let A be an ntimesn irreducible unsymmetric matrix and A = LU whereL and U are unit lower and upper triangular respectively Algorithms 13and 14 display the row-LU and column-LU factorization schemes respec-tively Both schemes update the given matrix A so that the U = [uij ]ilejfactor is formed by the upper triangular (including the diagonal) entries ofthe matrix A at the end The L = [`ij ]igej factor is formed by the multipliers(`ik = aikakk for i gt k) and a unit diagonal We assume a coarse-grainparallelization approach [122 160] In Algorithm 13 task i is defined as thetask of computing rows i of L and U and denoted as Trow(i) Similarlyin Algorithm 14 task j is defined as the task of computing the columns j ofboth L and U factors and denoted as Tcol(j) The dependencies betweenthe tasks can be represented with the directed acyclic graphs of LT and Ufor row-LU and column-LU factorization of A respectively

Algorithm 13 Row-LU(A)

1 for i = 1 to n do2 Trow(i)3 for each k lt i and aik 6= 0 do4 for each j gt k st akj 6= 0 do5 aij larr aij minus akj times (aikakk)

Algorithm 14 Column-LU(A)

1 for j = 1 to n do2 Tcol(j)3 for each k lt j st akj 6= 0 do4 for each i gt k st aik 6= 0 do5 aij larr aij minus aik times (akjakk)

135

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9

(a) The matrix L + U

8

46

9

1

7

2

5

3

(b) Task dependency graphG(LT)

8

46

9

1

7

2

5

3

(c) Task dependency graphG(U)

Figure 72 The filled matrix L + U (a) task dependency graphs G(LT) forrow-LU (b) and G(U) for column-LU (c)

Figure 72 illustrates the mentioned dependencies Figure 72a showsthe matrix L + U corresponding to the factors of the sample matrix Agiven in Figure 71a In this figure the blue hexagons and green circles arethe fill-ins in L and U respectively Subsequently Figures 72b and 72cshow the task dependency graphs for row-LU and column-LU factorizationsrespectively The blue dashed directed edges in Figure 72b and the greendashed directed edges in Figure 72c correspond to the fill-ins All edgesin Figures 72b and 72c show dependencies For example in the column-LU the second column depends on the first one as the values in the firstcolumn of L are needed while computing the second column This is shownin Figure 72c with a directed edge from the first vertex to the second one

We now consider postorderings of the elimination tree T(A) Let T [k]denote the set of vertices in the subtree of T(A) rooted at k For a postorder-ing the order of siblings is not specified in general sense A postordering iscalled upper bordered block triangular (BBT) if the topological order amongthe sibling vertices is respected that is a vertex k is numbered before asibling vertex ` if there is an edge in G(A) that is directed from T [k] toT [`] Similarly we call a postordering a lower BBT if the sibling verticesare ordered in reverse topological As an example the initial order of ma-trix given in Figure 71 is neither upper nor lower BBT however the order8 6 1 2 3 5 4 7 9 would be an upper BBT (see Figure 71c) The follow-ing theorem shows the relation between a BBT postordering and the taskdependencies in LU factorizations

Theorem 71 ([81]) Assuming the edges are directed from child to parentT(A) is the transitive reduction of G(LT) and G(U) when A is upper andlower BBT ordered respectively

We follow this theorem with a corollary which provides the motivationfor reducing the elimination tree height

Corollary 71 The critical path length of the task dependency graph is

136 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

3

5

7

9

2 3 5 4 7 9

8

6

1

2

6 18

4

(a) The matrix L + U

1 2

3

5 64

7 8

9

(b) The directed graph G(LT)

Figure 73 The filled matrix of matrix A = LU in BBT form (a) and thetask dependency graph G(LT) for row-LU (b)

equal to the elimination tree height h(T(A)) for row-LU factorization whenA is upper BBT ordered and for column-LU factorization when A is lowerBBT ordered

Figure 73 demonstrates the discussed points on a matrix A which is thepermuted A of Figure 71a with the upper BBT postordering 8 6 1 2 3 5 4 7 9Figure 73a and 73b show the filled matrix L + U such that A = LU andthe corresponding task dependency graph G(LT) for row-LU factorizationrespectively As seen in Figure 73b there are only tree (solid) and back(dashed) edges with respect to elimination tree T(A) due to the fact thatT(A) is the transitive reduction of G(LT) (see Theorem 71) In the figurethe blue edges refer to fill-in nonzeros As seen in the figure there is a pathfrom each vertex to the root and a critical path 1 rarr 3 rarr 5 rarr 7 rarr 9 haslength 5 which coincides with the height of T(A)

73 Reducing the elimination tree height

We propose a top-down approach to reorder a given matrix leading to ashort elimination tree The bigger lines of the proposed approach form ageneralization of the state of the art methods used in Cholesky factorization(and are based on nested dissection [94]) We give the algorithms and dis-cussions for the upper BBT form they can be easily adapted for the lowerBBT form The proofs of Proposition 71 and Theorems 72ndash74 are omittedfrom the presentation and can be found in the original paper [136]

137

731 Theoretical basis

Let A be an ntimes n sparse unsymmetric irreducible matrix and T(A) be itselimination tree In this case a BBT decomposition of A can be representedas a permutation of A with a permutation matrix P such that

ABBT =

A11 A12 A1K A1B

A22 A2K A2B

AKK AKB

AB1 AB2 ABK ABB

(71)

where ABBT = PAPT the number of diagonal blocks K gt 1 and Akk isirreducible for each 1 le k le K The border ABlowast is called minimal if there isno permutation matrix Pprime 6= P such that PprimeAPprimeT is a BBT decompositionand AprimeBlowast sub ABlowast That is the border ABlowast is minimal if we cannot removecolumns from ABlowast and the corresponding rows from ABlowast permute themtogether to a diagonal block and still have a BBT form We give thefollowing proposition aimed to be used for Theorem 72

Proposition 71 Let A be in upper BBT form and the border ABlowast beminimal The elimination digraph Gκ(A) is complete where κ =

sumKi=1 |Akk|

The following theorem provides a basis for reducing the elimination treeheight h(T(A))

Theorem 72 Let A be in upper BBT form and the border ABlowast be mini-mal Then

h(T(A)) = |ABlowast|+ max1lekleK

h(T(Akk))

The theorem says that the critical path length in the parallel LU factor-ization methods summarized in Section 72 can be expressed recursively interms of the border and the critical path length of a block with the maxi-mum size This holds for the row-LU scheme when the matrix A is in anupper BBT form and for the column-LU scheme when A is in a lower BBTform Hence having a small border size and diagonal blocks of similar sizeis likely to lead to reduced critical path length

732 Recursive approach Permute

We propose a recursive approach to reorder a given matrix so that the elim-ination tree is reduced Algorithm 15 gives the overview of the solutionframework The main procedure Permute takes an irreducible unsymmet-ric matrix A as its input It calls PermuteBBT to decompose A into aBBT form ABBT Upon obtaining such a decomposition the main pro-cedure calls itself on each diagonal block Akk recursively Whenever the

138 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

matrix becomes sufficiently small (its size gets smaller than τ) the proce-dure PermuteBT is called in order to compute a fine BBT decompositionin which the diagonal blocks are of unit size here to be referred as borderedtriangular (BT) decomposition and the recursion terminates

Algorithm 15 Permute(A)

1 if |A| lt τ then Recursion terminates 2 PermuteBT(A)3 else4 ABBT larr PermuteBBT(A) As given in (71) 5 for k larr 1 to K do6 Permute(Akk) Subblocks of ABBT

733 BBT decomposition PermuteBBT

Permuting A into a BBT form translates into finding a strong (vertex) sep-arator of G(A) where the strong separator itself corresponds to the borderABlowast Regarding to Theorem 72 we are particularly interested in findingminimal borders

Let G = (VE) be a strongly connected directed graph A strong sep-arator (also called directed vertex separator [138]) is defined as a vertex setS sub V such that GminusS is not strongly connected Moreover S is said to beminimal if Gminus Sprime is strongly connected for any proper subset Sprime sub S

The algorithm that we propose to find a strong separator is based on bi-partitioning of directed graphs In case of undirected graphs typically thereare two kinds of graph partitioning methods which differ by their separatorsedge separators and vertex separators An edge (or vertex) separator is a setof edges (or vertices) whose removal leaves the graph disconnected

A BBT decomposition of a matrix may have three or more subblocks (asseen in (71)) However the algorithms we use to find a strong separatorutilize 2-way partitioning which in turn constructs two parts each of whichmay potentially have several components For the sake of simplicity in thepresentation we assume that both parts are strongly connected

We introduce some notation in order to ease the discussion For a pairof vertex subsets (UW ) of directed graph G we write U 7rarrW if there maybe some edges from U to W but there is no edge from W to U BesidesU harr W and U 6harr W are used in a more natural way that is U harr Wimplies that there may be edges in both directions whereas U 6harr W refersto absence of such an edge between U and W

We cast our problem of finding strong separators as follows For a givendirected graph G find a vertex partition Πlowast = V1 V2 S where S refersto a strong separator and V1 V2 refer to vertex parts so that S has small

139

8

46

9

1

7

2

5

3

V2

V1

(a) An edge separator ΠES =V1 V2

7

4

9

2

3

5

V2

6 1

V1

7

4

2

3

5

V2

6 1

V1

9

(b) Covers for E2 7rarr1 (left)and E1 7rarr2 (right)

8

1

3

5

9 8 1 2 3 5

6

7

4

9

7 46

2

(c) Matrix view of ΠES

Figure 74 A sample 2-way partition ΠES = V1 V2 by edge separator (a)bipartite graphs and covers for directed cuts E27rarr1 and E17rarr2 (b) and thematrix view of ΠES

cardinality the part sizes are close to each other and either V1 7rarr V2 orV2 7rarr V1

734 Edge-separator-based method

The first step of the method is to obtain a 2-way partition ΠES = V1 V2 ofG by edge separator We build two strong separators upon ΠES and choosethe one with the smaller cardinality

For an edge separator ΠES = V1 V2 the directed cut Ei 7rarrj is definedas the set of edges directed from Vi to Vj ie Ei 7rarrj = (u v) isin E u isinVi v isin Vj and we say that a vertex set S covers Ei 7rarrj if for any edge(u v) isin Ei 7rarrj either u isin S or v isin S for i 6= j isin 1 2

Initially we have V1 harr V2 Our goal is to find a subset of verticesS sub V so that either V1 7rarr V2 or V2 7rarr V1 where V1 and V2 are the setsof remaining vertices in V1 and V2 after the removal of those vertices in Sthat is V1 = V1 minus S and V2 = V2 minus S The following theorem serves to thepurpose of finding such a subset S

Theorem 73 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and S sube V Now Vi 7rarr Vj if and only if S covers Ej 7rarrifor i 6= j isin 1 2

The problem of finding a strong separator S can be encoded as findinga minimum vertex cover in the bipartite graph whose edge set is populatedonly by the edges of Ej 7rarri Algorithm 16 gives the pseudocode As seenin the algorithm we find two separators S17rarr2 and S2 7rarr1 which cover E27rarr1

and E17rarr2 and result in V1 7rarr V2 and V2 7rarr V1 respectively At the end wetake the strong separator with the smaller cardinality

In this method how the edge separator is found matters The objectivefor ΠES is to minimize the cutsize Ψ(ΠES) which is typically defined over

140 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Algorithm 16 FindStrongVertexSeparatorES(G)

1 ΠES larr GraphBipartiteES(G) ΠES = V1 V2 2 Π17rarr2 larr FindMinCover(GE2 7rarr1)3 Π27rarr1 larr FindMinCover(GE1 7rarr2)4 if |S17rarr2| lt |S27rarr1| then

5 return Π1 7rarr2 = V1 V2 S17rarr2 V1 7rarr V2 6 else

7 return Π2 7rarr1 = V1 V2 S27rarr1 V2 7rarr V1

the cut edges as follows

Ψes(ΠES) = |Ec| = |(u v) isin E π(u) 6= π(v)| (72)

where u isin Vπ(u) and v isin Vπ(v) and Ec refers to set of cut edges We notethat the use of the above metric in Algorithm 16 is a generalization of amethod used for symmetric matrices [161] However Ψes(ΠES) is looselyrelated to the size of the cover found in order to build the strong separatorA more suitable metric would be the number of vertices from which a cutedge emanates or the number of vertices to which a cut edge is directedSuch a cutsize metric can be formalized as follows

Ψcn(ΠES) = |Vc| = |v isin V exist(u v) isin Ec| (73)

The following theorem shows the close relation between Ψcn(ΠES) and thesize of the strong separator S

Theorem 74 Let G = (VE) be a directed graph ΠES = V1 V2 be anedge separator of G and Πlowast = V1 V2 S be the strong separator built uponΠES Then |S| le Ψcn(ΠES)2

Apart from the theorem we also note that optimizing (73) is likely toyield bipartite graphs which are easier to cover than those resulting fromoptimizing (72) This is because of the fact that all edges emanating from asingle vertex or many different vertices are considered as the same with (72)whereas those emanating from a single vertex are preferred in (73)

We note that the cutsize metric as given in (73) can be modeled usinghypergraphs The hypergraph that models this cutsize metric (called thecolumn-net hypergraph model [39]) has the same vertex set as the directedgraph G = (VE) and a hyperedge for each vertex v isin V that connectsall u isin V such that (u v) isin E Any partition of the vertices of this hyper-graph that minimizes the number of hyperedges in the cut induces a vertexpartition on G with an edge separator that minimizes the cutsize accord-ing to the metric (73) A similar statement holds for another hypergraphconstructed by reversing the edge directions (called the row-net model [39])

141

8

46

9

1

7

2

5

3

S

V2

V1

(a) Strong separator Π2 7rarr1

9

7

1

6

4 9 7 8 1 6

2

3

5

4

3 52

8

(b) ABBT matrix viewof Π2 7rarr1

1

6

2 3

5

9

8

74

(c) Elimination treeT(ABBT)

Figure 75 Π2 7rarr1 the strong separator built upon S2 7rarr1 the cover of E17rarr2

given in 74b (a) the matrix ABBT obtained by Π2 7rarr1 (b) and the eliminationtree T(ABBT)

Figures 74 and 75 give an example for finding a strong separator basedon an edge separator In Figure 74a the red curve implies the edge separa-tor ΠES = V1 V2 such that V1 = 4 6 7 8 9 the green vertices and V2 =1 2 3 5 the blue vertices The cut edges of E27rarr1 = (2 7) (5 4) (5 9)and E17rarr2 = (6 1) (6 3) (7 1) are given in color blue and green respec-tively We see two covers for the bipartite graphs corresponding to each ofthe directed cuts E27rarr1 and E17rarr2 in Figure 74b As seen in Figure 74cthese directed cuts refer to nonzeros of the off-diagonal blocks in matrixview Figure 75a gives the strong separator Π2rarr1 built upon ΠES by thecover S2rarr1 = 1 6 of the bipartite graph corresponding to E1rarr2 which isgiven on the right of 74b Figures 75b and 75c depict the matrix view ofthe strong separator Π2rarr1 and the corresponding elimination tree respec-tively It is worth mentioning that in matrix view V2 proceeds V1 sincewe use Π2rarr1 instead of Π1rarr2 As seen in 75c since the permuted ma-trix (in Figure 75b) has an upper BBT postordering all cross edges of theelimination tree are headed from left to right

735 Vertex-separator-based method

A vertex separator is a strong separator restricted so that V1 6harr V2 Thismethod is based on refining the separator so that either V1 7rarr V2 or V1 7rarr V2The vertex separator can be viewed as a 3-way vertex partition ΠV S =V1 V2 S As in the previous method based on edge separators we buildtwo strong separators upon ΠV S and select the one having the smaller strongseparator

The criterion used to find an initial vertex separator can be formalizedas

Ψvs(ΠV S) = |S| (74)

Algorithm 17 gives the overall process to obtain a strong separator This

142 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

algorithm follows Algorithm 16 closely Here the procedure that refines theinitial vertex separator namely RefineSeparator works as follows Itvisits the separator vertices once and moves the vertex at hand if possibleto V1 or V2 so that Vi 7rarr Vj after the movement where i 7rarr j is specified aseither 1 7rarr 2 or 2 7rarr 1

Algorithm 17 FindStrongVertexSeparatorVS(G)

1 ΠV S larr GraphBipartiteVS(G) ΠV S = V1 V2 S 2 Π1rarr2 larr RefineSeparator(GΠV S 1rarr 2)3 Π2rarr1 larr RefineSeparator(GΠV S 2rarr 1)4 if |S1rarr2| lt |S2rarr1| then

5 return Π1rarr2 = V1 V2 S1rarr2 V1 rarr V2 6 else

7 return Π2rarr1 = V1 V2 S2rarr1 V2 rarr V1

736 BT decomposition PermuteBT

The problem of permuting A into a BT form can be decoded as finding afeedback vertex set which is defined as a subset of vertices whose removalmakes a directed graph acyclic In matrix view a feedback vertex set corre-sponds to border ABlowast when the matrix is permuted into a BBT form (71)where each Aii for i = 1 K is of unit size

The problem of finding a feedback vertex set of size less than a givenvalue is NP-complete [126] but fixed-parameter tractable [44] In literaturethere are greedy algorithms that perform well in practice [152 184] In thiswork we adopt the algorithm proposed by Levy and Low [152]

For example in Figure 75 we observe that the sets 5 and 8 are usedas the feedback-vertex sets for V1 and V2 respectively which is the reasonthose rowscolumns are permuted last in their corresponding subblocks

74 Experiments

We investigate the performance of the proposed heuristic here to be referredas BBT in terms of the elimination tree height and ordering time We havethree variants of BBT (i) BBT-es based on minimizing the edge cut (72) (ii)BBT-cn based on minimizing the hyperedge cut (73) (iii) BBT-vs based onminimizing the vertex separator (74) As a baseline we used MeTiS [130]on the symmetrized matrix A + AT MeTiS uses state of the art methodsto order a symmetric matrix and leads to short elimination trees Its use inthe unsymmetric case on A+AT is a common practice BBT is implementedin C and compiled with mex of MATLAB which uses gcc 445 We usedMeTiS library for graph partitioning according to the cutsize metrics (72)

143

and (74) For the edge cut metric (72) we input MeTiS the graph ofA + AT where the weight of the edge i j is 2 (both aij aji 6= 0) or1 (either aij 6= 0 or aji 6= 0 but not both) For the vertex separatormetric (74) we input the graph of A+AT (no weights) to the correspondingMeTiS procedure We implemented the cutsize metric given in (73) usingPaToH [40] (in the default setting) for hypergraph partitioning using thecolumn-net model (we did not investigate the use of row-net model) Weperformed the experiments on an AMD Opteron Processor 8356 with 32 GBof RAM

In the experiments for each matrix we detected dense rows and columnswhere a rowcolumn is deemed to be dense if it has at least 10

radicn nonzeros

We applied MeTiS and BBT on the submatrix of non-dense rowscolumnsand produced the final ordering by appending the dense rowscolumns afterthe permutations obtained for the submatrices The performance results arecomputed as the median of eleven runs

Recall from Section 732 that whenever the size of the input matrixis smaller than the parameter τ the recursion terminates in BBT variantsand a feedback-vertex-set-based BT decomposition is applied We first in-vestigate the effect of τ in order to suggest a concrete approach For thispurpose we build a dataset of matrices called UFL below The matrices inUFL are from the SuiteSparse Matrix Collection [59] and chosen with thefollowing properties square 5K le n le 100K nnz le 1M real and notof type Graph These criteria enable an automatic selection of matricesfrom different application domains without having to specify the matricesindividually Other criteria could be used ours help to select a large set ofmatrices each of which is not too small and not too large to be handled easilywithin Matlab We exclude matrices recorded as Graph in the SuiteSparsecollection because most of these matrices have nonzeros from a small set ofintegers (for example minus1 1) and are reducible We further detect matriceswith the same nonzero pattern in the UFL dataset and keep only one matrixper unique nonzero pattern We then preprocess the matrices by applyingMC64 [72] in order to permute the columns so that the permuted matrix hasa maximum diagonal product Then we identify the irreducible blocks us-ing Dulmage-Mendelsohn decomposition [77] permute the matrices to blockdiagonal form and delete all entries that lie in the off-diagonal blocks Thistype of preprocessing is common in similar studies [83] and correspondsto preprocessing for numerics and reducibility After this preprocessing wediscard the matrices whose total (remaining) number of nonzeros is less thantwice its size or whose pattern symmetry is greater than 90 where thepattern symmetry score is defined as (nnz(A)minusn)(nnz(A+AT)minusn)minus05We ended up with 99 matrices in the UFL dataset

In Table 71 we summarize the performance of the BBT variants as nor-malized to that of MeTiS on the UFL dataset for the recursion terminationparameter τ isin 3 50 250 The table presents the geometric means of per-

144 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Tree Height (h(A)) Ordering Timeτ -es -cn -vs -es -cn -vs

3 081 069 068 168 353 25850 084 071 072 126 271 191

250 093 082 086 112 212 143

Table 71 Statistical indicators of the performance of BBT variants on theUFL dataset normalized to that of MeTiS with recursion termination param-eter τ isin 3 50 250

0 20 40 60 80Pattern symmetry

03

04

05

06

07

08

09

10

11

12

13

Nor

mal

ized

tree

hei

ght

(a) BBT-vs MeTiS Tree Height (h(A))

0 20 40 60 80Pattern symmetry

05

10

15

20

25

30

Nor

mal

ized

ord

erin

g tim

e

(b) BBT-vs MeTiS Ordering Time

Figure 76 BBT-vs MeTiS performance on the UFL dataset Dashed greenlines mark the geometric mean of the ratios

formance results over all matrices of the dataset As seen in the table forsmaller values of τ all BBT variants achieve shorter elimination trees at acost of increased processing time to order the matrix This is because of thefast but probably not of very high quality local ordering heuristicmdashas dis-cussed before the heuristic here does not directly consider the height of thesubtree corresponding to the submatrix as an optimization goal Other τvalues could also be tried but we find τ = 50 as a suitable choice in generalas it performs close to τ = 3 in quality and close to τ = 250 in efficiency

From Table 71 we observe that BBT-vs method performs in the mid-dle of the other two BBT variants but close to BBT-cn in terms of the treeheight All the three variants improve the height of the tree with respectto MeTiS Specifically at τ = 50 BBT-es BBT-cn and BBT-vs obtain im-provements of 16 29 and 28 respectively The graph based methodsBBT-es and BBT-vs run slower than MeTiS as they contain additional over-heads of obtaining the covers and refining the separators of A + AT forA For example at τ = 50 the overheads result in 26 and 171 longerordering time for BBT-es and BBT-vs respectively As expected the hy-pergraph based method BBT-cn is much slower than others (for example112 longer ordering time with respect to MeTiS with τ = 50) Since we

145

identify BBT-vs as fast and of high quality (it is close to the best BBT variantin terms of the tree height and the fastest one in terms of the run time)we display its individual performance results in detail with τ = 50 in Fig-ure 76 In Figures 76a and 76b each individual point represents the ratioBBT-vs MeTiS in terms of tree height and ordering time respectivelyAs seen in Figure 76a BBT-vs reduces the elimination tree height on 85out of 99 matrices in the UFL dataset and achieves a reduction of 28 interms of the geometric mean with respect to MeTiS (the green dashed linein the figure represents the geometric mean) We observe that the patternsymmetry and the reduction in the elimination tree height highly correlateafter a certain pattern symmetry score More precisely the correlation co-efficient between the pattern symmetry and the normalized tree height is06187 for those matrices with pattern symmetry greater than 20 Thisis of course in agreement with the fact that LU factorization methods ex-ploiting the unsymmetric pattern have much to gain on matrices with lowerpattern symmetry score and less to gain otherwise than the alternatives Onthe other hand as seen in Figure 76b BBT-vs performs consistently slowerthan MeTiS with an average of 91 slow down

The fill-in generally increases [83] when a matrix is reordered into a BBTform with respect to the original ordering using the elimination tree Wenoticed that BBT-vs almost always increases the number of nonzeros in L(by 157 with respect to MeTiS in a set of matrices) However the numberof nonzeros in U almost always decreases (by 15 with respect to MeTiS

in the same dataset) This observation may be exploited during Gaussianelimination where L is not stored

75 Summary further notes and references

We investigated the elimination tree model for unsymmetric matrices as ameans of capturing dependencies among the tasks in row-LU and column-LU factorization schemes Specifically we focused on permuting a givenunsymmetric matrix to obtain an elimination tree with reduced height inorder to expose higher degree of parallelism by minimizing the critical pathlength in parallel LU factorization schemes Based on the theoretical find-ings we proposed a heuristic which orders a given matrix to a borderedblock diagonal form with a small border size and blocks of similar sizes andthen locally orders each block We presented three variants of the proposedheuristic These three variants achieved on average 24 31 and 35improvements with respect to a common method of using MeTiS (a stateof the art tool used for the Cholesky factorization) on a small set of matri-ces On a larger set of matrices the performance improvements were 1628 and 29 The best performing variant is about 271 times slower thanMeTiS on the larger set while others are slower by factors of 126 and 191

146 CHAPTER 7 E-TREE HIGHT-REDUCING ORDERING

Although numerical methods implementing the LU factorization withthe help of the standard elimination tree are well developed those imple-menting the LU factorization using the elimination discussed in this sectionare lacking A potential implementation will have to handle many schedul-ing issues and therefore the use of a runtime system seems adequate Onthe combinatorial side we believe that there is a need for a direct methodto find strong separators which would give better performance in terms ofthe quality and the run time

Chapter 8

Sparse tensor ordering

In this chapter we investigate the tensor ordering problem described inSection 15 The formal problem definition is repeated below for convenience

Tensor ordering Given a sparse tensor X find an ordering of theindices in each dimension so that the nonzeros of X have indices closeto each other

We propose two ordering algorithms for this problem The first proposedheuristic BFS-MCS is a breadth first search (BFS)-like approach based onthe maximum cardinality search family and works on the hypergraph repre-sentation of the given tensor The second proposed heuristic Lexi-Orderis an extension of doubly lexical ordering of matrices to tensors We showthe effects of these schemes in the MTTKRP performance when the ten-sors are stored in three existing sparse tensor formats coordinate (COO)compressed sparse fiber (CSF) and hierarchical coordinate (HiCOO) Theassociated paper [156] has further contributions with respect to paralleliza-tion and includes results on parallel implementations of MTTKRP with thethree formats as well as runs with CandecompParafac decomposition al-gorithms

81 Three sparse tensor storage formats

We consider the three state-of-the-art tensor formats (COO CSF HiCOO)which are all for general unstructured sparse tensors

Coordinate format (COO)

The coordinate (COO) format is the simplest yet arguably most popularformat by far It stores each nonzero value along with all of its positionindices shown in Figure 81a i j k are indices (inds) of the nonzeros storedin the val array

147

148 CHAPTER 8 SPARSE TENSOR ORDERING

(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(a) COO(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(b) CSF(a) COO

i j k val

0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

inds

(b) CSF

i

j

k

val

0

0

0

1

1

0

2

1

0

0

3

0

2

4

2

0

1

5

1

3

0 3

2

2

2

6 7 8

cindscptrs

(c) HiCOO

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

binds einds

(c) HiCOO

Figure 81 Three common formats for sparse tensors (example is originallyfrom [155])

Compressed sparse fibers (CSF)

Compressed Sparse Fiber (CSF) is a hierarchical fiber-centric format thateffectively generalizes the CSR matrix format to tensors An example of itsrepresentation appears in Figure 81b Conceptually CSF organizes nonzeroindices into a tree Each level corresponds to a tensor mode and eachnonzero is a path from the root to a leaf The indices of CSF are stored incinds with the pointers stored in cptrs to indicate the locations of nonzerosat the next level Since cptrs needs to show the range up to nnz we use βint

or βlong accordingly for differently sized-tensors

Hierarchical coordinate format (HiCOO)

Hierarchical Coordinate (HiCOO) [155] format derives from COO formatbut improves upon it by compressing the indices in units of sparse tensorblocks HiCOO stores a sparse tensor in a sparse-blocked pattern with apre-specified block size B meaning in Btimesmiddot middot middottimesB blocks It represents everyblock by compactly storing its nonzero triples using fewer bits Figure 81cshows the example tensor given 2 times 2 times 2 blocks (B = 2) For a third-order tensor bi bj bk are block indices indexing tensor blocks ei ej ek areelement indices indexing nonzeros within a tensor block A bptr array storesthe pointers of each blockrsquos beginning locations and val saves all the nonzerovalues The block ratio αb of a HiCOO representation of tensor is defined asthe number of blocks divided by the number of nonzeros in the tensor Theaverage slice size per tensor block cb of a HiCOO representation of tensor isdefined as the average slice size per tensor block These two key parametersdescribe the effectiveness of storing a tensor in the HiCOO format

Comparing the three tensor formats HiCOO and COO treat every modeequally and do not assume any mode order these preserve the mode-generic

149

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

(a) Tensor

i j k vali1 a

bcd

j1 k1

i1 j2 k1

i2 j2 k1

i2 j2 k2

(b) Hypergraph

i1 j1 k1

i2 j2 k2

cd

a

b

Figure 82 A sparse tensor and the associated hypergraph

orientation [155] CSF has a strong mode-specificity since the tree structureimplies a mode ordering for enumeration of nonzeros Besides comparing toCOO format HiCOO and CSF save storage space and memory footprintswhereas achieve higher performance generally

82 Problem definition and solutions

Our objective is to improve the performance of Mttkrp and therefore CPDalgorithms based on the three tensor storage formats described above Wewill achieve this objective by ordering (or relabeling) the indices in one ormore modes of the input tensor so as to improve the data locality in thetensor and the factor matrices of an Mttkrp operation Take for exampletwo nonzeros (i2 j2 k1) and (i2 j2 k2) of a third-order tensor in Figure 82Relabeling k1 and k2 to other two indices close to each other will potentiallyimprove cache hits for both tensor and their corresponding rows of factormatrices in the Mttkrp operation Besides it will also influence the tensorstorage for some formats (Analysis will be illustrated in Section 823)

We propose two ordering heuristics The aim of the heuristics is toarrange the nonzeros close to each other in all modes If we were to lookat matrices this would correspond to reordering the rows and columns sothat all nonzeros are clustered around the diagonal This way nonzeros in arow or column would be close to each other and any blocking (by imposingfixed sized blocks) would have nonzero blocks only around the diagonalThe proposed heuristics are based on these observations and try to obtainsimilar behavior for tensors The output of a reordering algorithm is thepermutations for all modes being used to relabel tensor indices of them

821 BFS-MCS

BFS-MCS is a breadth first search (BFS)-like heuristic approach basedon the maximum cardinality search family [207] We first construct the d-partite d-uniform hypergraph for a sparse tensor as described in Section 15Recall that in this hypergraph the vertices are tensor indices in all modesand hyperedges represent its nonzero entries Figure 82 shows the hyper-

150 CHAPTER 8 SPARSE TENSOR ORDERING

graph associated with a sparse tensor Here the vertices are blank circlesand hyperedges are represented by grouping vertices

For an Nth-order sparse tensor we need to find the permutations forN modes We determine a permutation for a given mode n (permn) of atensor X isin RI1timesmiddotmiddotmiddottimesIN as follows Suppose some of the indices of mode n arealready ordered Then BFS-MCS picks the next to-be-ordered index asthe one with the strongest connection to the currently ordered index set Inthe case of ties it selects the index with the smallest number of nonzeros inthe corresponding sub-tensor Intuitively a stronger connection representsmore common indices in the modes other than n among an unordered vertexand the already ordered ones This means more data from factor matricescan be reused in Mttkrp if found in cache

Algorithm 18 BFS-MCS ordering based on maximum cardinal-ity search for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN hypergraphG = (VE) mode n

Result Permutation permn

1 wp(i)larr 0 for i = 1 In2 ws(i)larr minus|X ( i )| number of nonzeros with i in mode n 3 Build max-heap Hn for mode-n indices with wp and ws as the primary

and secondary keys respectively4 mV (i)larr 0 for i = 1 In5 for in = 1 In do

6 v(0)n larr GetHeapMax(Hn)

7 permn(v(0)n )larr in

8 mV (v(0)n )larr 1

9 for en isin hyperedges of v(0)n do

10 for v isin en v is not in mode n and mV (v) = 0 do11 mV (v)larr 112 for e isin hyperedges of v and e 6= en do13 vn larr vertex in mode n of e14 if inHeap(vn) then15 wp(vn)larr wp(vn) + 116 heapUpdateKey(Hnwp(vn))

17 return permn

The process is implemented for all mode-n indices by maintaining amax-heap with two keys The primary key of an index is the number ofconnections to the currently ordered indices The secondary key is thenumber of nonzeros in the corresponding (N minus 1)th-order sub-tensor egX ( in ) where the smaller secondary key values signify higherpriority The secondary keys are static Algorithm 18 gives the pseudocodeof BFS-MCS The heap Hn is initially constructed (Line 3) according to the

151

secondary keys of mode-n indices For each mode-n index (eg in) of thismax-heap BFS-MCS traverses all connected tensor indices in the modesexcept n calculates the number of connections of mode-n indices and thenupdates the heap using the primary and secondary keys Let mV denote asize-|V | array that tracks whether a vertex has been visited They are ini-tialized to zeros (unvisited) and changed to 1s once after being visited We

record a mode-n index v(0)n obtained from the max-heap Hn to the permu-

tation array permn The algorithm visits all its hyperedges (ie all nonzeroswith in Line 9) and their connected and unvisited vertices from the other(N minus 1) modes except n (Line 10) For these vertices it again visits theirhyperedges (e in Line 12) and then checks if the connected vertices in moden (vn) are in the heap (Line 14) If so the primary key (connectivity) of vnis increased by 1 and the max-heap (Hn) is updated To summarize this

algorithm locates the sub-tensors of the neighbor vertices of v(0)n to increase

the primary key values of the occurred mode-n indices in HnThe BFS-MCS heuristic has low time complexity We explain it us-

ing tensor terms The innermost heap update operation (Line 15) costsO(log(In)) Each sub-tensor is only visited once with the help of the markerarray mV For all the sub-tensors in one tensor mode the heap updateis only performed for nnz nonzeros (hyperedges) Overall the time com-plexity of BFS-MCS is O(N nnz log(In)) for an Nth-order tensor with nnznonzeros when computing permn for mode n

BFS-MCS does not exactly catch the memory access pattern of an Mt-tkrp It treats the contribution from the indices of all modes except nequally to the connectivity thus the connectivity might not match the ac-tual data reuse Also it uses a greedy strategy to determine the next-levelvertices which could miss the optimal global orderings as is common togreedy heuristics for hard problems

822 Lexi-Order

A lexical ordering of an integer vector is the standard dictionary orderingof its elements defined as follows Given two equal-length vectors x andy we say x le y iff either (i) all elements are the same ie x = y or(ii) there exists an index j such that x(j) lt y(j) and x(i) = y(i) for all0 le i lt j Consider two 0 1 vectors x = (1 0 1 0) and y = (1 1 0 0) wehave x le y because x(1) lt y(1) and x(i) = y(i) for all 0 le i lt 1

Lubiw [167] associates a vector v of size I middot J with an I times J matrix Awhere the matrix indices (i j) are first sorted with i + j and then j Adoubly lexical ordering of A is an ordering of its rows and columns whichmakes v lexically the largest Every real-valued matrix has a doubly lexicalordering and such an ordering is not unique [167] Doubly lexical orderinghas a number of applications including the efficient recognition of totallybalanced subtree and plaid matrices and (strongly) chordal graphs To the

152 CHAPTER 8 SPARSE TENSOR ORDERING

(1 1 0 0)

(1 0 1 0)

0 1 0 10 1 0 00 0 1 00 0 0 1

1 1 0 01 0 0 00 1 0 00 0 1 0

original ordered(a) Vector comparison (b) Matrix

gt

Figure 83 A doubly lexical ordering of a 0 1 matrix

best of our knowledge the merits of this ordering for high performance havenot been investigated for sparse matrices

We propose an iterative algorithm called matLexiOrder for doubly lex-ical ordering of matrices where in an iteration either rows or columns aresorted lexically Lubiw [167 Claim 22] shows that exchanging two rowswhich are lexically out of order improves the lexical ordering of the matrixBy appealing to this result we see that the matLexiOrder algorithm findsa doubly lexical ordering in a finite number of iterations This also fits wellwith another characterization of the doubly lexical ordering [167 182] amatrix is in doubly lexical ordering if both the row and column vectors arein non-increasing lexical order A row vector is read from left to right and acolumn vector is read from top to bottom Figure 83 shows a doubly lexicalordering of a sample matrix As seen in the figure the row vectors are innon-increasing lexical order as are the column vectors

The known doubly lexical ordering algorithms for matrices by Lubiw [167]and Paige and Tarjan [182] are ldquodirectrdquo ordering methods with a run timeof O(nnz log(I + J) + J) and O(nnz +I + J) space for an ItimesJ matrix withnnz nonzeros We find the time complexity of these algorithms to be toohigh for our purpose Furthermore the data structures are too complex toallow an efficient generalized implementation for tensors Since we do notaim to obtain an exact doubly lexical ordering (a close-by ordering will likelysuffice to improve the Mttkrp performance) our approach will be fasterwhile being simpler to implement

We first describe matLexiOrder algorithm for a sparse matrix Thisaims to show the efficiency of our iterative approach compared to othersA partition refinement technique is used to order the rows and columnsalternatively Given an ordering of the rows the columns can be sortedlexically This is achieved by an order preserving variant of the partitionrefinement method [182] which is called orderlyRefine We briefly explainthe partition refinement technique for ordering the columns of a matrixGiven an I times J matrix A all columns are initially put into a single partThen Arsquos nonzeros are visited row by row At a row i each column partC is split into two parts C1 = C cap a(i ) and C2 = C a(i ) and thesetwo parts replace C in the order C1 C2 (empty sets are discarded) Note

153

that this algorithm keeps the parts in a particular order which generatesan ordering of the columns orderlyRefine is used to refine all parts thathave at least one nonzero in row i in O(|a(i )|) time where |a(i )| is thenumber of nonzeros of row i Overall matLexiOrder costs a linear totaltime of O(nnz +I + J) (for rows and columns ordering) per iteration andO(J) space We also observe that only a small number of iterations will beenough (will be shown in Section 835 for tensors) yielding a more storage-efficient algorithm compared to the prior doubly lexical ordering methods[167 182] matLexiOrder in particular the use of orderlyRefine routinea few times is sufficient for our needs

Algorithm 19 Lexi-Order for a given mode

Data An Nth-order sparse tensor X isin RI1timesmiddotmiddotmiddottimesIN mode nResult Permutation permn

Sort all nonzeros along with all but mode n1 quickSort(X coordCmp) Matricize X to X(n)

2 r larr compose (inds ([minusn]) 1)3 for m = 1 nnz do4 clarr inds(nm) Column index of X(n)

5 if coordCmp(X mmminus 1) = 1 then6 r larr compose (inds ([minusn])m) Row index of X(n)

7 X(n)(r c)larr val(m)

Apply partition refinement to the columns of X(n)

8 permn larr orderlyRefine (X(n))

9 return permn

Comparison function for two indices of X10 Function coordCmp(X m1m2)11 for nprime = 1 N do12 if nprime le n then13 return minus1

14 if m1(nprime) gt m2(nprime) then15 return 1

16 return 0

To order tensors we propose the Lexi-Order function as an exten-sion of matLexiOrder The basic idea of Lexi-Order is to determine thepermutation of each tensor mode independently while considering the or-der in other modes fixed Lexi-Order sets the indices of the mode tobe ordered as the columns of a matrix the other indices as the rows andsorts the columns as described for matrices (with the order preserving par-tition refinement method) The precise algorithm appears in Algorithm 19which we also illustrate in Figure 84 when applied to mode 1 Given amode n Lexi-Order first builds a matricized tensor in Compressed Sparse

154 CHAPTER 8 SPARSE TENSOR ORDERING

i j k val0 0 0 13 0 0 73 1 0 80 1 1 20 1 2 31 2 2 41 3 22 3 2

56

orderlyReneQuick

sort Matricize(1 2 3 0)

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

1 0 0 10 0 0 11 0 0 01 0 0 00 1 0 00 1 1 0

i(jk)(00)(10)(11)(12)(22)(32)

Zero-one RepresentationSorted COO

0 33001

i(jk)(00)(10)(11)(12)(22)(32) 1 2

JKI CSR Matrix

Non-zeroDistribution

permi

Figure 84 The steps of Lexi-Order illustrated for mode 1

Row (CSR) sparse matrix format by a call to quickSort with the compar-ison function coordCmp and then by partitioning the nonzeros into the rowsegments (Lines 3ndash7) This comparison function coordCmp does a lexicalcomparison of all-but-mode-n indices which enables efficient matricizationIn other words sorting of the tensor X is the same as building the matricizedtensor X(n) by rows in the fixed lexical ordering where mode n is the col-umn dimension and the remaining modes constitute the rows In Figure 84the sorting step orders the COO entries by (j k) tuples which then serveas the row indices of the matricized CSR representation of X(1) Once thematrix is built we construct 0 1 row vectors in Figure 84 to illustrate itsnonzero distribution which could seamlessly call orderlyRefine functionApart from the quickSort the other parts of Lexi-Order are of lineartime complexity (linear in terms of tensor storage) We use OpenMP Tasksto parallelize quickSort to accelerate Lexi-Order

Like BFS-MCS approach Lexi-Order also finds the permutations forN modes of an Nth-order sparse tensor Figure 85(a) illustrates the effectof Lexi-Order on an example 4times 4times 3 sparse tensor The original tensoris converted to a HiCOO representation with block size B = 2 consistingof 5 blocks with maximum 2 nonzeros per block After reordering withLexi-Order the new HiCOO has 3 nonzero blocks with up to 4 nonzerosper block Thus the blocks are denser which should exhibit better localitybehavior However this reordering scheme is heuristic For example con-sider Figure 81 we draw another HiCOO representation after reordering inFigure 85(b) Applying Lexi-Order would yield a reordered HiCOO rep-resentation with 4 blocks which is the same as the input ordering althoughthe maximum number of nonzeros per block would increase to 4 For thistensor Lexi-Order may not show a big advantage

823 Analysis

We take the Mttkrp operation to analyze reordering behavior for COOCSF and HiCOO formats Recall that for a third-order sparse tensor X Mttkrp multiples each nonzero entry xijk with the R-vector formed by theentry-wise product of the jth row of B and kth row of C when computing

155

Original HiCOO

ek val0 0 0 10 1 1 20 1 0 31 0 0 41 1 0 51 0 0 71 1 00 1 0

86

0 0 0

0 0 1

1 0 0

1 1 1

0

2

5

7

0 1 13

ei ejbi bj bkbptrB1

B2B3

B4

B5

Reordered COO

i j k val1 0 0 11 1 2 21 1 1 32 3 1 42 2 1 53 2 1 60 0 00 1 0

78

Original COO

i j k val0 0 0 10 1 1 20 1 2 31 2 2 41 3 2 52 3 2 63 0 03 1 0

78

Reordered HiCOO

ek val0 0 0 70 1 0 81 0 0 11 1 1 31 1 0 20 0 1 50 1 11 0 1

46

0 0 0

0 0 11 1 0

0

45

ei ejbi bj bkbptrB1

B2B3

(1 2 3 0)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

Original HiCOO

Reordered COO

i j k val1 0 0 11 1 0 20 0 0 30 0 1 43 1 0 53 3 1 62 0 22 2 1

78

Original COO

Reordered HiCOO

ek val0 0 0 30 0 1 41 0 0 11 1 0 21 1 0 50 0 0 70 0 11 1 1

86

0 0 0

1 0 0

1 1 01 0 1

0

45

ei ejbi bj bkbptrB1

B2B3B4

(1 0 3 2)

(0 1 3 2)

(0 2 1)

i

j

kperm

(a)

(b)

i j k val0 0 0 10 1 0 21 0 0 31 0 2 42 1 0 52 2 2 63 0 13 3 2

78

ek val0 0 0 10 1 0 21 0 0 31 0 0 40 1 0 51 0 1 70 0 01 1 0

68

0 0 0

0 0 11 0 0

1 1 1

0

34

6

ei ejbi bj bkbptr

B1

B2

B3

B4

(a) A good example (b) A fair example

Figure 85 Comparison of HiCOO representations before and after Lexi-Order

MA The arithmetic intensity of Mttkrp algorithms on these formats isapproximately 14 [155] Thus Mttkrp can be considered memory-boundfor most computer architectures especially CPU platforms

In HiCOO smaller block ratios (αb) and larger average slice sizes pertensor block (cb) are favorable for good Mttkrp performance Our re-ordering algorithms tend to closely pack nonzeros together and get denserblocks (larger cb) which potentially generates less tensor blocks (smallerαb) thus HiCOO-Mttkrp has less memory access more cache hits andbetter performance Moreover smaller αb can also reduce tensor memoryrequirement as a side benefit [155] That is for HiCOO format a reorderingscheme should increase the block density reduce the number of blocks andincrease cache localitymdashthree related performance metrics Reordering ismore beneficial for HiCOO format than COO and CSF formats

COO stores all nonzero indices so relabeling does not make a differencein its storage The same is also true for CSF relabeling does not changethe CSFrsquos tree structure while the order of nodes may change For anMttkrp with tensors stored in these two formats the performance gainfrom reordering is only from the improved data locality Though it is hardto do theoretical analysis for them the potential better data locality couldalso brings performance advantages but could be less than HiCOOrsquos

156 CHAPTER 8 SPARSE TENSOR ORDERING

Tensors Order Dimensions Nnzs Density

vast 3 165K times 11K times 2 26M 69times 10minus3

nell2 3 12K times 9K times 29K 77M 24times 10minus5

choa 3 712K times 10K times 767 27M 50times 10minus6

darpa 3 22K times 22K times 24M 28M 24times 10minus9

fb-m 3 23M times 23M times 166 100M 11times 10minus9

fb-s 3 39M times 39M times 532 140M 17times 10minus10

flickr 3 320K times 28M times 2M 113M 78times 10minus12

deli 3 533K times 17M times 3M 140M 61times 10minus12

nell1 3 29M times 21M times 25M 144M 91times 10minus13

crime 4 6K times 24times 77times 32 5M 15times 10minus2

uber 4 183times 24times 1140times 1717 3M 39times 10minus4

nips 4 2K times 3K times 14K times 17 3M 18times 10minus6

enron 4 6K times 6K times 244K times 1K 54M 55times 10minus9

flickr4d 4 320K times 28M times 2M times 731 113M 11times 10minus14

deli4d 4 533K times 17M times 3M times 1K 140M 43times 10minus15

Table 81 Description of sparse tensors

83 Experiments

Platform We perform experiments on a Linux-based Intel Xeon E5-2698v3 multicore server platform with 32 physical cores distributed on two sock-ets each with 23 GHz frequency We only present results for sequentialruns (see the paper [156] for parallel runs) The processor microarchitectureis Haswell having 32 KiB L1 data cache and 128 GiB memory The codeartifact was written in the C language and was compiled using icc 1801

Dataset We use the sparse tensors derived from real-world applicationsthat appear in Table 81 ordered by decreasing nonzero density separatelyfor third- and fourth-order tensors Apart from choa [188] all tensors areavailable publicly FROSTT [203] HaTen2 [121]

Configurations We report the results under the best configurations for thefollowing parameters for the highest Mttkrp performance with the threeformats (i) the number of reordering iterations and the block size B forHiCOO format (ii) tiling or not for CSF For HiCOO B = 128 achievesthe best results in most cases and we use five reordering iterations whichwill be analyzed in Section 835 All experiments use approximate rank ofR = 16 We use the total execution time of Mttkrps in all modes for everytensor to calculate the speedup which is the ratio of the total Mttkrp timeon a randomly reordered tensor over that using a specific reordering schemeAll the times are averaged over five runs

831 COO-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential COO-Mttkrp from ParTI [154] in Figure 86 This COO-Mttkrp is imple-

157

Spee

dup

0

1

2

3

4

5Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 86 Reordered COO-Mttkrp speedup over a random reorderingimplementation

mented in C using the same algorithm with Tensor Toolbox [18] Forany reordering approach after doing a BFS-MCS Lexi-Order or randomreordering on the input tensor we still sort the tensor in the mode orderof 1 middot middot middot N Observe that Lexi-Order improves sequential COO-Mttkrp performance by 100ndash429times (179times on average) while BFS-MCSgets 095ndash127times (110times on average) Note that Lexi-Order improves theperformance of COO-Mttkrp for all tensors We conclude that this order-ing is always helpful for COO-Mttkrp while the improvements being lessthan what we saw for HiCOO-Mttkrp

832 CSF-Mttkrp with reordering

We show the effect of the two reordering approaches on sequential CSF-Mttkrp from Splatt v111 [205] in Figure 87 CSF-Mttkrp is set to useall CSF representations (ALLMODE) for Mttkrps in all modes and with tilingoption on Lexi-Order improves sequential CSF-Mttkrp performanceby 065ndash233times (150times on average) BFS-MCS improves sequential CSF-Mttkrp performance by 100ndash186times (122times on average) Both orderingapproaches improves the performance of CSF-Mttkrp on average WhileBFS-MCS is always helpful in the sequential case Lexi-Order is nothelpful on only one tensor crime The improvements achieved by the tworeordering approaches for CSF are less than those for HiCOO and COOformats We conclude that both reordering methods are helpful for CSF-Mttkrp but to a lesser extent than for HiCOO and COO based Mttkrp

833 HiCOO-Mttkrp with reordering

Figure 88 shows the speedup of the proposed reordering methods on se-quential HiCOO-Mttkrp Lexi-Order reordering obtains 099ndash414timesspeedup (212times on average) while BFS-MCS reordering gets 099ndash188timesspeedup (134times on average) Lexi-Order and BFS-MCS do not behaveas well on fourth-order tensors as on third-order tensors Tensor flickr4d is

158 CHAPTER 8 SPARSE TENSOR ORDERING

Spee

dup

00

05

10

15

20

25 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 87 Reordered CSF-Mttkrp speedup over a random reorderingimplementation

Spee

dup

0

1

2

3

4

5 Lexi-Order

BFS-MCS

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

Figure 88 Reordered HiCOO-Mttkrp speedup over a random orderingimplementation

constructed from the same data with flickr with an extra short mode (referto Table 81) Lexi-Order obtains 414times speedup on flickr while only 302timesspeedup on flickr4d The same phenomenon is also observed on tensors deli

and deli4d This phenomenon indicates that it is harder to get good datalocality on higher-order tensors which will be justified in Table 82

We investigate the two critical parameters of HiCOO [155] the blockratio (αb) and the average slice size per tensor block (cb) Smaller αb andlarger cb are favorable for good HiCOO-Mttkrp performance Table 82lists the parameter values for all tensors before and after Lexi-Order theHiCOO-Mttkrp speedup (as shown in Figure 88) and the storage ratioof HiCOO over random ordering Generally when both parameters αb andcb are largely improved we see a good speedup and storage ratio usingLexi-Order As seen for the same data in different orders eg flickr4d

and flickr the αb and cb values after Lexi-Order are better in the smallerorder version This observation supports the intuition that getting gooddata locality is harder for higher-order tensors

834 Reordering methods comparison

As seen above Lexi-Order improves performance more than BFS-MCSin most cases Compared to the reordering method used in Splatt [206]

159

TensorsRandom reordering Lexi-Order Speedup Storageαb cb αb cb seq omp ratio

vast 0004 1758 0004 1562 101 103 0999nell2 0020 0314 0008 0074 104 104 0966choa 0089 0057 0016 0056 116 107 0833

darpa 0796 0009 0018 0113 374 150 0322fb-m 0985 0008 0086 0021 384 121 0335fb-s 0982 0008 0099 0020 390 123 0336

flickr 0999 0008 0097 0025 414 366 0277deli 0988 0008 0501 0010 224 083 0634

nell1 0998 0008 0744 0009 170 070 0812

crime 0001 37702 0001 8978 099 142 1000uber 0041 0469 0011 0270 100 078 0838nips 0016 0434 0004 0435 103 134 0921

enron 0290 0017 0045 0030 125 136 0573flickr4d 0999 0008 0148 0020 302 1181 0214

deli4d 0998 0008 0596 0010 176 126 0697

Table 82 HiCOO parameters before and after Lexi-Order reordering

by setting ALLMODE (identical to the work [206]) to CSF-Mttkrp BFS-MCSgets 104 164 and 161times speedups on tensors nell2 nell1 and deli respectivelyand Lexi-Order obtains 104 170 and 224times speedups By contrast thespeedups using graph partitioning [206] on these three tensors are 106 111and 119times and 106 112 and 124times by using hypergraph partitioning [206]respectively Our BFS-MCS and Lexi-Order schemes both outperformgraph and hypergraph partitionings [206]

The available methods in the state-of-the-art are based on graph andhypergraph partitioning Partitioning is a successful approach when thenumber of partitions is known while the number of blocks in HiCOO is notknown ahead of time In our case the partitions should also be ordered forbetter cache reuse Additionally partitioners are less effective for tensorsthan usual as some dimensions could be very short (creating very highdegree vertices) That is why the proposed ordering based methods deliverbetter results

835 Effect of the number of iterations in Lexi-Order

Since Lexi-Order is iterative we evaluate the effect of the number ofiterations on HiCOO-Mttkrp performance The results appear in Fig-ure 89(a) which is normalized to the run time of 10 iterations Mttkrpon most tensors does not vary a lot by setting different number of iterationsexcept vast nell2 uber and nips We use 5 iterations to get good Mttkrpperformance similar to that of 10 iterations with about half of the overhead(shown in Figure 89(b)) But 3 or fewer iterations will get an acceptableperformance when users care much about the pre-processing time

160 CHAPTER 8 SPARSE TENSOR ORDERING

Nor

mal

ized

Tim

e

00

05

10

15

20 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(a) Sequential HiCOO-Mttkrp behavior

Nor

mal

ized

Tim

e

00

02

04

06

08

10 1

2

3

5

10

deli4dickr4denronnipsubercrimenell1deliickrfb-sfb-mdarpachoanell2vast

3-D Tensors 4-D Tensors

(b) The reordering overhead

Figure 89 The performance and overhead of different numbers of iterations

84 Summary further notes and references

Plenty recent research studied the optimization of tensor algorithms [20 2846 159 168] Our work sets itself apart by focusing on reordering to getbetter nonzero structure Various reordering methods have been consideredfor matrix operations [7 173 190 215]

The proposed two heuristics aim to arrange the nonzeros of a given ten-sor close to each other in all modes As said before this would correspondto reordering the rows and columns so that all nonzeros are clustered aroundthe diagonal in the matrix case Heuristics based on partitioning and or-dering have been used the matrix case for arranging most nonzeros aroundthe diagonal or diagonal blocks An obvious alternative in the matrix caseis the BFS-based reverse Cuthill-McKee (RCM) heuristic Given a patternsymmetric matrix A RCM finds a permutation (ordering) such that thesymmetrically permuted matrix PAPT has nonzeros clustered around thediagonal More precisely RCM tries to reduce the bandwidth which is de-fined as maxj minus i aij 6= 0 and j gt i The RCM heuristic can be appliedto pattern-wise unsymmetric even rectangular matrices as well A commonapproach is to create a (2times 2)-block matrix for a given mtimes n matrix A as

B =

(Im AAT In

)

where Im and In are the identity matrices if sizemtimesm and ntimesn respectivelySince B is now pattern-symmetric RCM can be run on it to obtain an(m+n)times(m+n) permutation matrix P The row and column permutations

161

min max geomean

BFS-MCS 084 5356 290Lexi-Order 015 433 099

Table 83 Statistical indicators of the number of nonempty blocks obtainedby BFS-MCS and Lexi-Order with respect to that of RCM on 774 ma-trices from the UFL collection

for A can then be obtained from P by respecting the ordering of the first mrows and the last n rows of B This approach can be extended to tensorsWe discuss the third order case for simplicity Let X be an ItimesJtimesK tensorWe create a matrix B in such a way that the first I rowscolumns correspondto the first-mode indices the following J rowscolumns correspond to thesecond-mode indices and the last K rowscolumns correspond to the third-mode indices Then for each nonzero xijk of X we set brc = 1 whenever rand c correspond to any two indices from the set i j k This way B is asymmetric matrix with ones on the main diagonal Observe that when X istwo dimensional (that is a matrix) we obtain the matrix B above We nowcompare the proposed reordering heuristics with RCM

We first compare the three methods RCM BFS-MCS and Lexi-Orderon matrices available at the UFL collection [59] We have tested these threeheuristics on all pattern-wise unsymmetric matrices having more than 1000and less than 5000000 rows at least three nonzeros per row and columnless than 10000000 nonzeros and less than 100 middot

radicmtimes n nonzeros for a

matrix with m rows and n columns At the time of experimentation therewere 774 matrices satisfying these properties We ordered these matriceswith RCM BFS-MCS and Lexi-Order (5 iterations) and counted thenumber of nonempty blocks with a block size of 128 (in each dimension) Wethen normalized the results of BFS-MCS and Lexi-Order with respect tothat of RCM We give the statistical indicators of these normalized resultsin Table 83 As seen in the table RCM is better than BFS-MCS andLexi-Order performs as good as RCM in reducing the number of blocks

We next compare the three methods on 11 sparse tensors from theFROSTT [203] and Haten2 repositories [121] again with a block size of128 in each dimension Table 84 shows the number of nonempty blocksobtained with these methods In this table the number of blocks obtainedby RCM is given in absolute terms and those obtained by BFS-MCS andLexi-Order are given with respect to those of RCM As seen in the tableRCM is never the best and Lexi-Order is better than BFS-MCS for nineout of eleven cases Overall Lexi-Order and BFS-MCS obtain resultswhose geometric means are 048 and 089 respectively of that of RCMmdashmarking both as more effective than RCM This is interesting as we haveseen that RCM and Lexi-Order have (nearly) the same performance inthe matrix case

162 CHAPTER 8 SPARSE TENSOR ORDERING

RCM BFS-MCS Lexi-Order

lbnl-network 67208 111 068nips 25992 156 045vast 113757 100 095nell-2 565615 092 105flickr 28648054 041 038flickr4d 32913028 061 051deli 70691535 097 091deli4d 94608531 081 088darpa 2753737 165 019fb-m 53184625 066 016fb-s 71308285 080 019

geo mean 089 048

Table 84 Number of nonempty blocks obtained by RCM BFS-MCS andLexi-Order on sparse tensors The results of BFS-MCS and Lexi-Orderare normalized with respect to those of RCM For each tensor the best resultis displayed with bold

random Lexi-Order speedup

vast 202 160 126nell-2 504 433 116choa 178 143 125darpa 318 148 215fb-m 1716 695 247fb-s 2503 973 257flickr 1029 582 177deli 1405 1080 130nell1 1845 1486 124

Table 85 The run time of TTM (for all modes) using a random ordering andLexi-Order and the speedup with Lexi-Order for sequential execution

163

The proposed ordering methods are applicable to some other sparse ten-sor operations without any change Among those operations tensor-matrixmultiplication (TTM) used in computing the Tucker decomposition withhigher-order orthogonal iterations [61] or tensor-vector multiplications usedin the higher order power method [60] are well known The main character-istic of these operations is that the memory access of the algorithm dependson the locality of tensor indices for each dimension If not all dimensionshave the same characteristics potentially a variant of the proposed methodsin which only certain dimensions are ordered could be used We showcasesome results for the TTM case below in Table 85 on some of the tensorsThe results are obtained on a machine with Intel Xeon CPU E5-2650v4 Asseen in the table using Lexi-Order always results in improved sequentialrun time for TTMs

Recent work [46] analyses the memory access pattern of the MTTKRPoperation and proposes low-level highly effective code optimizations Theproposed approach reorganizes the code with register blocking and appliesnonzero blocking in an attempt to fit rows of the factor matrices into thecache The gain in performance remarkable Our ordering methods are or-thogonal to the mentioned and similar optimizations For a much improvedperformance we should first order the tensor using the approaches proposedin this chapter and then further optimize the code using the low-level codeoptimizations

164 CHAPTER 8 SPARSE TENSOR ORDERING

Part IV

Closing

165

Chapter 9

Concluding remarks

We have presented a selection of partitioning matching and ordering prob-lems in combinatorial scientific computing

Partitioning problems These problems arise in task decomposition forparallel computing where load balance and low communication costare two objectives Depending on the task definition and the interac-tion of the tasks the partitioning problems use different combinatorialmodels The selected problems (Chapters 2 and 3) addressed acyclicpartitioning of directed acyclic graphs and partitioning of hypergraphsWhile our contributions for the former problem concerned combinato-rial tools for the desired partitioning objectives and constraints thosefor the second problem concerned the use of hypergraph models andassociated partitioning tools for efficient tensor decomposition in thedistributed memory setting

Matching problems These problems arise in settings where agents com-pete for exclusive access to resources The most common settingsinclude identifying nonzero diagonals in sparse matrices routing andscheduling in data centers A related concept is to decompose a dou-bly stochastic matrix as a convex combination of permutation matri-ces (the famous Birkhoff-von Neumann decomposition) where eachpermutation matrix is a perfect matching in the bipartite graph repre-sentation of the matrix We developed approximation algorithms formatchings in graphs (Chapter 4) and effective heuristics for findingmatchings in hypergraphs (Chapter 6) We also investigated (Chap-ter 5) the problem of finding Birkhoff-von Neumann decompositionswith a small number of permutation matrices and presented complex-ity results and theoretical insights into the decomposition problemOn the way we generalized the decomposition to general real matrices(not only doubly stochastic) having total support

167

168 CHAPTER 9 CONCLUDING REMARKS

Ordering problems These problems arise when one wants to permutesparse matrices and tensors into desirable forms The sought formsdepend of course on the targeted applications By focusing on a re-cent LU decomposition method (or on a method taking advantage ofthe sparsity in a novel way) we developed heuristics to permute sparsematrices into a bordered block triangular form with an aim to reducethe height of the resulting elimination tree (Chapter 7) We also in-vestigated sparse tensor ordering problem where we sought to clusternonzeros around the (hyper)diagonal of a sparse tensor (Chapter 8) inorder to improve the performance of certain tensor operations We pro-posed novel algorithms for obtaining doubly lexical ordering of sparsematrices that are simpler to implement than the known algorithms tobe able to address the tensor ordering problem

At the end of Chapters 2ndash8 we mentioned some immediate future workBelow we state some more future work which require considerable workand creativity

Most of the presented combinatorial algorithms come with applicationsOthers are made applicable by practical and efficient implementations fur-ther developments on the applicationsrsquo side are required for these to be usedin practice In particular the use of acyclic partitioner (the problem ofChapter 2) in a parallel system with the help of a runtime system requiresthe whole graph to be available This is usually not the case an auto-tuning like approach in which one first creates a task graph by a cold runand then partitions this graph to determine a schedule can prove useful Ina similar vein we mentioned the lack of proper software implementing theLU factorization method using elimination trees at the end of Chapter 7Such an implementation will give rise to a number of combinatorial prob-lems (eg the peak memory requirement in factoring a BBT ordered matrixwith elimination trees) and seems to be a new play field

The original Karp-Sipser heuristic for the cardinality matching problemapplies degree-1 and degree-2 reduction rules The first rule is well imple-mented in practical settings (see Chapters 4 and 6) On the other hardfast implementations of the degree-2 reduction rule for undirected graphsare sought A straightforward implementation of this rule can result in theworst case Ω(n2) total time for a graph with n vertices This is obviously toomuch to afford in theory for general sparse graphs with O(n) or O(n log n)edges Are there worst-case linear time algorithms for implementing thisrule in the context of the Karp-Sipser heuristic

The Birkhoff-von Neumann decomposition (Chapter 5) has well foundedapplications in routing Our use of this decomposition in solving linear sys-tems is a first attempt in making it useful for numerical algorithms Inthe proposed approach the cost of constructing a preconditioner with afew terms from a BvN decomposition is equivalent to solving a few maxi-

169

mum bottleneck matching problemmdashthis can be costly in practical settingsCan we use a relaxed definition of the BvN decomposition in which sub-permutation matrices are allowed (such decompositions are used in routingproblems) and make related preconditioner more practical while retainingnumerical properties

170 BIBLIOGRAPHY

Bibliography

[1] IBM ILOG CPLEX Optimizer httpwww-01ibmcomsoftwareintegrationoptimizationcplex-optimizer 2010 [Cited on pp96 121]

[2] Evrim Acar Daniel M Dunlavy and Tamara G Kolda A scalable op-timization approach for fitting canonical tensor decompositions Jour-nal of Chemometrics 25(2)67ndash86 2011 [Cited on pp 11 37]

[3] Seher Acer Tugba Torun and Cevdet Aykanat Improving medium-grain partitioning for scalable sparse tensor decomposition IEEETransactions on Parallel and Distributed Systems 29(12)2814ndash2825Dec 2018 [Cited on pp 52]

[4] Kunal Agrawal Jeremy T Fineman Jordan Krage Charles E Leis-erson and Sivan Toledo Cache-conscious scheduling of streaming ap-plications In Proc Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 236ndash245 New YorkNY USA 2012 ACM [Cited on pp 4]

[5] Emmanuel Agullo Alfredo Buttari Abdou Guermouche and FlorentLopez Multifrontal QR factorization for multicore architectures overruntime systems In Euro-Par 2013 Parallel Processing LNCS pages521ndash532 Springer Berlin Heidelberg 2013 [Cited on pp 131]

[6] Ron Aharoni and Penny Haxell Hallrsquos theorem for hypergraphs Jour-nal of Graph Theory 35(2)83ndash88 2000 [Cited on pp 119]

[7] Kadir Akbudak and Cevdet Aykanat Exploiting locality in sparsematrix-matrix multiplication on many-core architectures IEEETransactions on Parallel and Distributed Systems 28(8)2258ndash2271Aug 2017 [Cited on pp 160]

[8] Patrick R Amestoy Iain S Duff and Jean-Yves LrsquoExcellent Multi-frontal parallel distributed symmetric and unsymmetric solvers Com-puter Methods in Applied Mechanics and Engineering 184(2ndash4)501ndash520 2000 [Cited on pp 131 132]

[9] Patrick R Amestoy Iain S Duff Daniel Ruiz and Bora Ucar Aparallel matrix scaling algorithm In J M Palma P R AmestoyM Dayde M Mattoso and J C Lopes editors 8th InternationalConference on High Performance Computing for Computational Sci-ence (VECPAR) volume 5336 pages 301ndash313 Springer Berlin Hei-delberg 2008 [Cited on pp 56]

BIBLIOGRAPHY 171

[10] Michael Anastos and Alan Frieze Finding perfect matchings in ran-dom cubic graphs in linear time arXiv preprint arXiv1808008252018 [Cited on pp 125]

[11] Claus A Andersson and Rasmus Bro The N-way toolbox for MAT-LAB Chemometrics and Intelligent Laboratory Systems 52(1)1ndash42000 [Cited on pp 11 37]

[12] Hartwig Anzt Edmond Chow and Jack Dongarra Iterative sparsetriangular solves for preconditioning In Larsson Jesper Traff SaschaHunold and Francesco Versaci editors Euro-Par 2015 Parallel Pro-cessing pages 650ndash661 Springer Berlin Heidelberg 2015 [Cited onpp 107]

[13] Jonathan Aronson Martin Dyer Alan Frieze and Stephen SuenRandomized greedy matching II Random Structures amp Algorithms6(1)55ndash73 1995 [Cited on pp 58]

[14] Jonathan Aronson Alan Frieze and Boris G Pittel Maximum match-ings in sparse random graphs Karp-Sipser revisited Random Struc-tures amp Algorithms 12(2)111ndash177 1998 [Cited on pp 56 58]

[15] Cevdet Aykanat Berkant B Cambazoglu and Bora Ucar Multi-leveldirect K-way hypergraph partitioning with multiple constraints andfixed vertices Journal of Parallel and Distributed Computing 68609ndash625 2008 [Cited on pp 46]

[16] Ariful Azad Mahantesh Halappanavar Sivasankaran RajamanickamErik G Boman Arif Khan and Alex Pothen Multithreaded algo-rithms for maximum matching in bipartite graphs In IEEE 26th Inter-national Parallel Distributed Processing Symposium (IPDPS) pages860ndash872 Shanghai China 2012 [Cited on pp 57 59 66 73 74]

[17] Brett W Bader and Tamara G Kolda Efficient MATLAB compu-tations with sparse and factored tensors SIAM Journal on ScientificComputing 30(1)205ndash231 December 2007 [Cited on pp 11 37]

[18] Brett W Bader Tamara G Kolda et al Matlab tensor toolboxversion 30 Available online httpwwwsandiagov~tgkolda

TensorToolbox 2017 [Cited on pp 11 37 38 157]

[19] David A Bader Henning Meyerhenke Peter Sanders and DorotheaWagner editors Graph Partitioning and Graph Clustering - 10thDIMACS Implementation Challenge Workshop Georgia Institute ofTechnology Atlanta GA USA February 13-14 2012 Proceedingsvolume 588 of Contemporary Mathematics American MathematicalSociety 2013 [Cited on pp 69]

172 BIBLIOGRAPHY

[20] Muthu Baskaran Tom Henretty Benoit Pradelle M HarperLangston David Bruns-Smith James Ezick and Richard LethinMemory-efficient parallel tensor decompositions In 2017 IEEE HighPerformance Extreme Computing Conference (HPEC) pages 1ndash7Sep 2017 [Cited on pp 160]

[21] James Bennett and Stan Lanning The Netflix Prize In Proceedingsof KDD cup and workshop page 35 2007 [Cited on pp 10 48]

[22] Anne Benoit Yves Robert and Frederic Vivien A Guide to AlgorithmDesign Paradigms Methods and Complexity Analysis CRC PressBoca Raton FL 2014 [Cited on pp 93]

[23] Michele Benzi and Bora Ucar Preconditioning techniques based onthe Birkhoffndashvon Neumann decomposition Computational Methods inApplied Mathematics 17201ndash215 2017 [Cited on pp 106 108]

[24] Piotr Berman and Marek Karpinski Improved approximation lowerbounds on small occurence optimization ECCC Report 2003 [Citedon pp 113]

[25] Garrett Birkhoff Tres observaciones sobre el algebra lineal Universi-dad Nacional de Tucuman Series A 5147ndash154 1946 [Cited on pp8]

[26] Marcel Birn Vitaly Osipov Peter Sanders Christian Schulz andNodari Sitchinava Efficient parallel and external matching In Fe-lix Wolf Bernd Mohr and Dieter an Mey editors Euro-Par 2013Parallel Processing pages 659ndash670 Springer Berlin Heidelberg 2013[Cited on pp 59]

[27] Rob H Bisseling and Wouter Meesen Communication balancing inparallel sparse matrix-vector multiplication Electronic Transactionson Numerical Analysis 2147ndash65 2005 [Cited on pp 4 47]

[28] Zachary Blanco Bangtian Liu and Maryam Mehri Dehnavi CSTFLarge-scale sparse tensor factorizations on distributed platforms InProceedings of the 47th International Conference on Parallel Process-ing pages 211ndash2110 New York NY USA 2018 ACM [Cited onpp 160]

[29] Guy E Blelloch Jeremy T Fineman and Julian Shun Greedy sequen-tial maximal independent set and matching are parallel on average InProceedings of the Twenty-fourth Annual ACM Symposium on Par-allelism in Algorithms and Architectures pages 308ndash317 ACM 2012[Cited on pp 59 73 74 75]

BIBLIOGRAPHY 173

[30] Norbert Blum A new approach to maximum matching in generalgraphs In Proceedings of the 17th International Colloquium on Au-tomata Languages and Programming pages 586ndash597 London UK1990 Springer-Verlag [Cited on pp 6 76]

[31] Richard A Brualdi The diagonal hypergraph of a matrix (bipartitegraph) Discrete Mathematics 27(2)127ndash147 1979 [Cited on pp 92]

[32] Richard A Brualdi Notes on the Birkhoff algorithm for doublystochastic matrices Canadian Mathematical Bulletin 25(2)191ndash1991982 [Cited on pp 9 92 94 98]

[33] Richard A Brualdi and Peter M Gibson Convex polyhedra of doublystochastic matrices I Applications of the permanent function Jour-nal of Combinatorial Theory Series A 22(2)194ndash230 1977 [Citedon pp 9]

[34] Richard A Brualdi and Herbert J Ryser Combinatorial Matrix The-ory volume 39 of Encyclopedia of Mathematics and its ApplicationsCambridge University Press Cambridge UK New York USA Mel-bourne Australia 1991 [Cited on pp 8]

[35] Thang Nguyen Bui and Curt Jones A heuristic for reducing fill-in insparse matrix factorization In Proc 6th SIAM Conf Parallel Pro-cessing for Scientific Computing pages 445ndash452 SIAM 1993 [Citedon pp 4 15]

[36] Peter J Cameron Notes on combinatorics httpscameroncountsfileswordpresscom201311combpdf (last checked Jan 2019)2013 [Cited on pp 76]

[37] Andrew Carlson Justin Betteridge Bryan Kisiel Burr Settles Este-vam R Hruschka Jr and Tom M Mitchell Toward an architecturefor never-ending language learning In AAAI volume 5 page 3 2010[Cited on pp 48]

[38] J Douglas Carroll and Jih-Jie Chang Analysis of individual dif-ferences in multidimensional scaling via an N-way generalization ofldquoEckart-Youngrdquo decomposition Psychometrika 35(3)283ndash319 1970[Cited on pp 36]

[39] Umit V Catalyurek and Cevdet Aykanat Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplicationIEEE Transactions on Parallel and Distributed Systems 10(7)673ndash693 Jul 1999 [Cited on pp 41 140]

174 BIBLIOGRAPHY

[40] Umit V Catalyurek and Cevdet Aykanat PaToH A Multilevel Hyper-graph Partitioning Tool Version 30 Bilkent University Departmentof Computer Engineering Ankara 06533 Turkey PaToH is availableat httpbmiosuedu~umitsoftwarehtm 1999 [Cited on pp4 7 25 47 143]

[41] Umit V Catalyurek Kamer Kaya and Bora Ucar On shared-memoryparallelization of a sparse matrix scaling algorithm In 2012 41st Inter-national Conference on Parallel Processing pages 68ndash77 Los Alami-tos CA USA September 2012 IEEE Computer Society [Cited onpp 56]

[42] Umit V Catalyurek Mehmet Deveci Kamer Kaya and Bora UcarMultithreaded clustering for multi-level hypergraph partitioning In26th IEEE International Parallel and Distributed Processing Sympo-sium (IPDPS) pages 848ndash859 Shanghai China 2012 [Cited on pp59]

[43] Cheng-Shang Chang Wen-Jyh Chen and Hsiang-Yi Huang On ser-vice guarantees for input-buffered crossbar switches A capacity de-composition approach by Birkhoff and von Neumann In Quality ofService 1999 IWQoS rsquo99 1999 Seventh International Workshop onpages 79ndash86 1999 [Cited on pp 8]

[44] Jianer Chen Yang Liu Songjian Lu Barry OrsquoSullivan and Igor Raz-gon A fixed-parameter algorithm for the directed feedback vertex setproblem Journal of the ACM 55(5)211ndash2119 2008 [Cited on pp142]

[45] Joseph Cheriyan Randomized O(M(|V |)) algorithms for problemsin matching theory SIAM Journal on Computing 26(6)1635ndash16551997 [Cited on pp 76]

[46] Jee Choi Xing Liu Shaden Smith and Tyler Simon Blocking opti-mization techniques for sparse tensor computation In 2018 IEEE In-ternational Parallel and Distributed Processing Symposium (IPDPS)pages 568ndash577 Vancouver British Columbia Canada 2018 [Citedon pp 160 163]

[47] Joon Hee Choi and S V N Vishwanathan DFacTo Distributedfactorization of tensors In Zoubin Ghahramani Max Welling CorinnaCortes Neil D Lawrence and Kilian Q Weinberger editors 27thAdvances in Neural Information Processing Systems pages 1296ndash1304Curran Associates Inc Montreal Quebec Canada 2014 [Cited onpp 11 37 38 39]

BIBLIOGRAPHY 175

[48] Edmond Chow and Aftab Patel Fine-grained parallel incomplete LUfactorization SIAM Journal on Scientific Computing 37(2)C169ndashC193 2015 [Cited on pp 107]

[49] Andrzej Cichocki Danilo P Mandic Lieven De Lathauwer GuoxuZhou Qibin Zhao Cesar Caiafa and Anh-Huy Phan Tensor decom-positions for signal processing applications From two-way to multiwaycomponent analysis IEEE Signal Processing Magazine 32(2)145ndash163March 2015 [Cited on pp 10]

[50] Edith Cohen Structure prediction and computation of sparse matrixproducts Journal of Combinatorial Optimization 2(4)307ndash332 Dec1998 [Cited on pp 127]

[51] Thomas F Coleman and Wei Xu Parallelism in structured Newtoncomputations In Parallel Computing Architectures Algorithms andApplications ParCo 2007 pages 295ndash302 Forschungszentrum Julichand RWTH Aachen University Germany 2007 [Cited on pp 3]

[52] Thomas F Coleman and Wei Xu Fast (structured) Newton computa-tions SIAM Journal on Scientific Computing 31(2)1175ndash1191 2009[Cited on pp 3]

[53] Thomas F Coleman and Wei Xu Automatic Differentiation in MAT-LAB using ADMAT with Applications SIAM 2016 [Cited on pp3]

[54] Jason Cong Zheng Li and Rajive Bagrodia Acyclic multi-way par-titioning of Boolean networks In Proceedings of the 31st Annual De-sign Automation Conference DACrsquo94 pages 670ndash675 New York NYUSA 1994 ACM [Cited on pp 4 17]

[55] Elizabeth H Cuthill and J McKee Reducing the bandwidth of sparsesymmetric matrices In Proceedings of the 24th national conferencepages 157ndash172 New York NY USA 1969 ACM [Cited on pp 12]

[56] Marek Cygan Improved approximation for 3-dimensional matchingvia bounded pathwidth local search In Foundations of ComputerScience (FOCS) 2013 IEEE 54th Annual Symposium on pages 509ndash518 IEEE 2013 [Cited on pp 113 120]

[57] Marek Cygan Fabrizio Grandoni and Monaldo Mastrolilli How to sellhyperedges The hypermatching assignment problem In Proc of thetwenty-fourth annual ACM-SIAM symposium on Discrete algorithmspages 342ndash351 SIAM 2013 [Cited on pp 113]

176 BIBLIOGRAPHY

[58] Joseph Czyzyk Michael P Mesnier and Jorge J More The NEOSServer IEEE Journal on Computational Science and Engineering5(3)68ndash75 1998 [Cited on pp 96]

[59] Timothy A Davis and Yifan Hu The University of Florida sparsematrix collection ACM Transactions on Mathematical Software38(1)11ndash125 2011 [Cited on pp 25 69 82 105 143 161]

[60] Lieven De Lathauwer Pierre Comon Bart De Moor and Joos Vande-walle Higher-order power methodndashApplication in independent com-ponent analysis In Proceedings NOLTArsquo95 pages 91ndash96 Las VegasUSA 1995 [Cited on pp 163]

[61] Lieven De Lathauwer Bart De Moor and Joos Vandewalle Onthe best rank-1 and rank-(R1 R2 RN ) approximation of higher-order tensors SIAM Journal on Matrix Analysis and Applications21(4)1324ndash1342 2000 [Cited on pp 163]

[62] Dominique de Werra Variations on the Theorem of BirkhoffndashvonNeumann and extensions Graphs and Combinatorics 19(2)263ndash278Jun 2003 [Cited on pp 108]

[63] James W Demmel Stanley C Eisenstat John R Gilbert Xiaoye SLi and Joseph W H Liu A supernodal approach to sparse par-tial pivoting SIAM Journal on Matrix Analysis and Applications20(3)720ndash755 1999 [Cited on pp 132]

[64] James W Demmel John R Gilbert and Xiaoye S Li An asyn-chronous parallel supernodal algorithm for sparse gaussian elimina-tion SIAM Journal on Matrix Analysis and Applications 20(4)915ndash952 1999 [Cited on pp 132]

[65] Mehmet Deveci Kamer Kaya Umit V Catalyurek and Bora UcarA push-relabel-based maximum cardinality matching algorithm onGPUs In 42nd International Conference on Parallel Processing pages21ndash29 Lyon France 2013 [Cited on pp 57]

[66] Mehmet Deveci Kamer Kaya Bora Ucar and Umit V CatalyurekGPU accelerated maximum cardinality matching algorithms for bipar-tite graphs In Felix Wolf Bernd Mohr and Dieter an Mey editorsEuro-Par 2013 Parallel Processing volume 8097 of LNCS pages 850ndash861 Springer Berlin Heidelberg 2013 [Cited on pp 57]

[67] Pat Devlin and Jeff Kahn Perfect fractional matchings in k-out hy-pergraphs arXiv preprint arXiv170303513 2017 [Cited on pp 114121]

BIBLIOGRAPHY 177

[68] Elizabeth D Dolan The NEOS Server 40 administrative guide Tech-nical Memorandum ANLMCS-TM-250 Mathematics and ComputerScience Division Argonne National Laboratory 2001 [Cited on pp96]

[69] Elizabeth D Dolan and Jorge J More Benchmarking optimiza-tion software with performance profiles Mathematical Programming91(2)201ndash213 2002 [Cited on pp 26]

[70] Jack J Dongarra Fred G Gustavson and Alan H Karp Implement-ing linear algebra algorithms for dense matrices on a vector pipelinemachine SIAM Review 26(1)pp 91ndash112 1984 [Cited on pp 134]

[71] Iain S Duff Kamer Kaya and Bora Ucar Design implementationand analysis of maximum transversal algorithms ACM Transactionson Mathematical Software 38(2)131ndash1331 2011 [Cited on pp 656 58]

[72] Iain S Duff and Jacko Koster On algorithms for permuting largeentries to the diagonal of a sparse matrix SIAM Journal on MatrixAnalysis and Applications 22973ndash996 2001 [Cited on pp 98 143]

[73] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ap-proximation algorithms for maximum matchings in undirected graphsIn 2018 Proceedings of the Seventh SIAM Workshop on CombinatorialScientific Computing pages 56ndash65 2018 [Cited on pp 75 79 113]

[74] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Fur-ther notes on Birkhoffndashvon Neumann decomposition of doubly stochas-tic matrices Linear Algebra and its Applications 55468ndash78 2018[Cited on pp 101 113 117]

[75] Fanny Dufosse Kamer Kaya Ioannis Panagiotas and Bora Ucar Ef-fective heuristics for matchings in hypergraphs Research Report RR-9224 Inria Grenoble Rhone-Alpes November 2018 (will appear inSEA2) [Cited on pp 118 119 122]

[76] Fanny Dufosse Kamer Kaya and Bora Ucar Two approximationalgorithms for bipartite matching on multicore architectures Journalof Parallel and Distributed Computing 8562ndash78 2015 IPDPS 2014Selected Papers on Numerical and Combinatorial Algorithms [Citedon pp 56 60 65 66 67 69 77 79 81 113 117 125]

[77] Andrew Lloyd Dulmage and Nathan Saul Mendelsohn Coverings ofbipartite graphs Canadian Journal of Mathematics 10517ndash534 1958[Cited on pp 68 143]

178 BIBLIOGRAPHY

[78] Martin E Dyer and Alan M Frieze Randomized greedy matchingRandom Structures amp Algorithms 2(1)29ndash45 1991 [Cited on pp 58113 115]

[79] Jack Edmonds Paths trees and flowers Canadian Journal of Math-ematics 17449ndash467 1965 [Cited on pp 76]

[80] Lawrence C Eggan Transition graphs and the star-height of regu-lar events The Michigan Mathematical Journal 10(4)385ndash397 1963[Cited on pp 5]

[81] Stanley C Eisenstat and Joseph W H Liu The theory of elimina-tion trees for sparse unsymmetric matrices SIAM Journal on MatrixAnalysis and Applications 26(3)686ndash705 2005 [Cited on pp 5 131132 133 135]

[82] Stanley C Eisenstat and Joseph W H Liu A tree-based dataflowmodel for the unsymmetric multifrontal method Electronic Transac-tions on Numerical Analysis 211ndash19 2005 [Cited on pp 132]

[83] Stanley C Eisenstat and Joseph W H Liu Algorithmic aspects ofelimination trees for sparse unsymmetric matrices SIAM Journal onMatrix Analysis and Applications 29(4)1363ndash1381 2008 [Cited onpp 5 132 143 145]

[84] Venmugil Elango Fabrice Rastello Louis-Noel Pouchet J Ramanu-jam and P Sadayappan On characterizing the data access complex-ity of programs ACM SIGPLAN Notices - POPLrsquo15 50(1)567ndash580January 2015 [Cited on pp 3]

[85] Paul Erdos and Alfred Renyi On random matrices Publicationsof the Mathematical Institute of the Hungarian Academy of Sciences8(3)455ndash461 1964 [Cited on pp 72]

[86] Bastiaan Onne Fagginger Auer and Rob H Bisseling A GPU algo-rithm for greedy graph matching In Facing the Multicore ChallengeII volume 7174 of LNCS pages 108ndash119 Springer-Verlag Berlin 2012[Cited on pp 59]

[87] Naznin Fauzia Venmugil Elango Mahesh Ravishankar J Ramanu-jam Fabrice Rastello Atanas Rountev Louis-Noel Pouchet andP Sadayappan Beyond reuse distance analysis Dynamic analysis forcharacterization of data locality potential ACM Transactions on Ar-chitecture and Code Optimization 10(4)531ndash5329 December 2013[Cited on pp 3 16]

BIBLIOGRAPHY 179

[88] Charles M Fiduccia and Robert M Mattheyses A linear-time heuris-tic for improving network partitions In 19th Design Automation Con-ference pages 175ndash181 Las Vegas USA 1982 IEEE [Cited on pp17 22]

[89] Joel Franklin and Jens Lorenz On the scaling of multidimensional ma-trices Linear Algebra and its Applications 114717ndash735 1989 [Citedon pp 114]

[90] Alan M Frieze Maximum matchings in a class of random graphsJournal of Combinatorial Theory Series B 40(2)196ndash212 1986[Cited on pp 76 82 121]

[91] Aurelien Froger Olivier Guyon and Eric Pinson A set packing ap-proach for scheduling passenger train drivers the French experienceIn RailTokyo2015 Tokyo Japan March 2015 [Cited on pp 7]

[92] Harold N Gabow and Robert E Tarjan Faster scaling algorithms fornetwork problems SIAM Journal on Computing 18(5)1013ndash10361989 [Cited on pp 6 76]

[93] Michael R Garey and David S Johnson Computers and IntractabilityA Guide to the Theory of NP-Completeness W H Freeman amp CoNew York NY USA 1979 [Cited on pp 3 47 93]

[94] Alan George Nested dissection of a regular finite element mesh SIAMJournal on Numerical Analysis 10(2)345ndash363 1973 [Cited on pp136]

[95] Alan J George Computer implementation of the finite elementmethod PhD thesis Stanford University Stanford CA USA 1971[Cited on pp 12]

[96] Andrew V Goldberg and Robert Kennedy Global price updates helpSIAM Journal on Discrete Mathematics 10(4)551ndash572 1997 [Citedon pp 6]

[97] Olaf Gorlitz Sergej Sizov and Steffen Staab Pints Peer-to-peerinfrastructure for tagging systems In Proceedings of the 7th Interna-tional Conference on Peer-to-peer Systems IPTPSrsquo08 page 19 Berke-ley CA USA 2008 USENIX Association [Cited on pp 48]

[98] Georg Gottlob and Gianluigi Greco Decomposing combinatorial auc-tions and set packing problems Journal of the ACM 60(4)241ndash2439September 2013 [Cited on pp 7]

[99] Lars Grasedyck Hierarchical singular value decomposition of tensorsSIAM Journal on Matrix Analysis and Applications 31(4)2029ndash20542010 [Cited on pp 52]

180 BIBLIOGRAPHY

[100] William Gropp and Jorge J More Optimization environments andthe NEOS Server In Martin D Buhman and Arieh Iserles editorsApproximation Theory and Optimization pages 167ndash182 CambridgeUniversity Press 1997 [Cited on pp 96]

[101] Hermann Gruber Digraph complexity measures and applications informal language theory Discrete Mathematics amp Theoretical Com-puter Science 14(2)189ndash204 2012 [Cited on pp 5]

[102] Gael Guennebaud Benoıt Jacob et al Eigen v3 httpeigen

tuxfamilyorg 2010 [Cited on pp 47]

[103] Abdou Guermouche Jean-Yves LrsquoExcellent and Gil Utard Impact ofreordering on the memory of a multifrontal solver Parallel Computing29(9)1191ndash1218 2003 [Cited on pp 131]

[104] Anshul Gupta Improved symbolic and numerical factorization algo-rithms for unsymmetric sparse matrices SIAM Journal on MatrixAnalysis and Applications 24(2)529ndash552 2002 [Cited on pp 132]

[105] Mahantesh Halappanavar John Feo Oreste Villa Antonino Tumeoand Alex Pothen Approximate weighted matching on emerging many-core and multithreaded architectures International Journal of HighPerformance Computing Applications 26(4)413ndash430 2012 [Cited onpp 59]

[106] Magnus M Halldorsson Approximating discrete collections via localimprovements In The Sixth annual ACM-SIAM symposium on Dis-crete Algorithms (SODA) volume 95 pages 160ndash169 San FranciscoCalifornia USA 1995 SIAM [Cited on pp 113]

[107] Richard A Harshman Foundations of the PARAFAC procedureModels and conditions for an ldquoexplanatoryrdquo multi-modal factor anal-ysis UCLA Working Papers in Phonetics 161ndash84 1970 [Cited onpp 36]

[108] Nicholas J A Harvey Algebraic algorithms for matching and matroidproblems SIAM Journal on Computing 39(2)679ndash702 2009 [Citedon pp 76]

[109] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-dimensional matching In Approximation Random-ization and Combinatorial Optimization Algorithms and Techniquespages 83ndash97 Springer 2003 [Cited on pp 113]

[110] Elad Hazan Shmuel Safra and Oded Schwartz On the complexity ofapproximating k-set packing Computational Complexity 15(1)20ndash392006 [Cited on pp 7]

BIBLIOGRAPHY 181

[111] Bruce Hendrickson Load balancing fictions falsehoods and fallaciesApplied Mathematical Modelling 2599ndash108 2000 [Cited on pp 41]

[112] Bruce Hendrickson and Tamara G Kolda Graph partitioning modelsfor parallel computing Parallel Computing 26(12)1519ndash1534 2000[Cited on pp 41]

[113] Bruce Hendrickson and Robert Leland The Chaco userrsquos guide ver-sion 10 Technical Report SAND93ndash2339 Sandia National Laborato-ries Albuquerque NM October 1993 [Cited on pp 4 15 16]

[114] Julien Herrmann Jonathan Kho Bora Ucar Kamer Kaya andUmit V Catalyurek Acyclic partitioning of large directed acyclicgraphs In Proceedings of the 17th IEEEACM International Sympo-sium on Cluster Cloud and Grid Computing CCGRID pages 371ndash380 Madrid Spain May 2017 [Cited on pp 15 16 17 18 28 30]

[115] Julien Herrmann M Yusuf Ozkaya Bora Ucar Kamer Kaya andUmit V Catalyurek Multilevel algorithms for acyclic partitioning ofdirected acyclic graphs SIAM Journal on Scientific Computing 2019to appear [Cited on pp 19 20 26 29 31]

[116] Jonathan D Hogg John K Reid and Jennifer A Scott Design of amulticore sparse Cholesky factorization using DAGs SIAM Journalon Scientific Computing 32(6)3627ndash3649 2010 [Cited on pp 131]

[117] John E Hopcroft and Richard M Karp An n52 algorithm for max-imum matchings in bipartite graphs SIAM Journal on Computing2(4)225ndash231 1973 [Cited on pp 6 62]

[118] Roger A Horn and Charles R Johnson Matrix Analysis CambridgeUniversity Press New York 1985 [Cited on pp 109]

[119] Cor A J Hurkens and Alexander Schrijver On the size of systemsof sets every t of which have an SDR with an application to theworst-case ratio of heuristics for packing problems SIAM Journal onDiscrete Mathematics 2(1)68ndash72 1989 [Cited on pp 113 120]

[120] Oscar H Ibarra and Shlomo Moran Deterministic and probabilisticalgorithms for maximum bipartite matching via fast matrix multipli-cation Information Processing Letters pages 12ndash15 1981 [Cited onpp 76]

[121] Inah Jeon Evangelos E Papalexakis U Kang and Christos Falout-sos Haten2 Billion-scale tensor decompositions In IEEE 31st Inter-national Conference on Data Engineering (ICDE) pages 1047ndash1058April 2015 [Cited on pp 156 161]

182 BIBLIOGRAPHY

[122] Jochen A G Jess and Hendrikus Gerardus Maria Kees A data struc-ture for parallel LU decomposition IEEE Transactions on Comput-ers 31(3)231ndash239 1982 [Cited on pp 134]

[123] U Kang Evangelos Papalexakis Abhay Harpale and Christos Falout-sos GigaTensor Scaling tensor analysis up by 100 times - Algorithmsand discoveries In Proceedings of the 18th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining KDDrsquo12 pages 316ndash324 New York NY USA 2012 ACM [Cited on pp10 11 37]

[124] Micha l Karonski Ed Overman and Boris Pittel On a perfect match-ing in a random bipartite digraph with average out-degree below twoarXiv e-prints page arXiv190305764 Mar 2019 [Cited on pp 59121]

[125] Micha l Karonski and Boris Pittel Existence of a perfect matching ina random (1 + eminus1)ndashout bipartite graph Journal of CombinatorialTheory Series B 881ndash16 May 2003 [Cited on pp 59 67 81 82121]

[126] Richard M Karp Reducibility among combinatorial problems InRaymond E Miller James W Thatcher and Jean D Bohlinger edi-tors Complexity of Computer Computations The IBM Research Sym-posiax pages 85ndash103 Springer US 1972 [Cited on pp 7 142]

[127] Richard M Karp Alexander H G Rinnooy Kan and Rakesh VVohra Average case analysis of a heuristic for the assignment prob-lem Mathematics of Operations Research 19513ndash522 1994 [Citedon pp 81]

[128] Richard M Karp and Michael Sipser Maximum matching in sparserandom graphs In 22nd Annual IEEE Symposium on Foundations ofComputer Science (FOCS) pages 364ndash375 Nashville TN USA 1981[Cited on pp 56 58 67 75 81 113]

[129] Richard M Karp Umesh V Vazirani and Vijay V Vazirani An op-timal algorithm for on-line bipartite matching In 22nd annual ACMsymposium on Theory of computing (STOC) pages 352ndash358 Balti-more MD USA 1990 [Cited on pp 57]

[130] George Karypis and Vipin Kumar MeTiS A Software Package forPartitioning Unstructured Graphs Partitioning Meshes and Comput-ing Fill-Reducing Orderings of Sparse Matrices Version 40 Univer-sity of Minnesota Department of Comp Sci and Eng Army HPCResearch Cent Minneapolis 1998 [Cited on pp 4 16 23 27 142]

BIBLIOGRAPHY 183

[131] George Karypis and Vipin Kumar Multilevel algorithms for multi-constraint hypergraph partitioning Technical Report 99-034 Uni-versity of Minnesota Department of Computer ScienceArmy HPCResearch Center Minneapolis MN 55455 November 1998 [Cited onpp 34]

[132] Kamer Kaya Johannes Langguth Fredrik Manne and Bora UcarPush-relabel based algorithms for the maximum transversal problemComputers and Operations Research 40(5)1266ndash1275 2013 [Citedon pp 6 56 58]

[133] Kamer Kaya and Bora Ucar Constructing elimination trees for sparseunsymmetric matrices SIAM Journal on Matrix Analysis and Appli-cations 34(2)345ndash354 2013 [Cited on pp 5]

[134] Oguz Kaya and Bora Ucar Scalable sparse tensor decompositions indistributed memory systems In Proceedings of the International Con-ference for High Performance Computing Networking Storage andAnalysis SC rsquo15 pages 771ndash7711 Austin Texas 2015 ACM NewYork NY USA [Cited on pp 7]

[135] Oguz Kaya and Bora Ucar Parallel CandecompParafac decomposi-tion of sparse tensors using dimension trees SIAM Journal on Scien-tific Computing 40(1)C99ndashC130 2018 [Cited on pp 52]

[136] Enver Kayaaslan and Bora Ucar Reducing elimination tree heightfor parallel LU factorization of sparse unsymmetric matrices In 21stInternational Conference on High Performance Computing (HiPC2014) pages 1ndash10 Goa India December 2014 [Cited on pp 136]

[137] Brian W Kernighan Optimal sequential partitions of graphs Journalof the ACM 18(1)34ndash40 January 1971 [Cited on pp 16]

[138] Shiva Kintali Nishad Kothari and Akash Kumar Approximationalgorithms for directed width parameters CoRR abs11074824 2011[Cited on pp 138]

[139] Philip A Knight The SinkhornndashKnopp algorithm Convergence andapplications SIAM Journal on Matrix Analysis and Applications30(1)261ndash275 2008 [Cited on pp 55 68]

[140] Philip A Knight and Daniel Ruiz A fast algorithm for matrix bal-ancing IMA Journal of Numerical Analysis 33(3)1029ndash1047 2013[Cited on pp 56 79]

[141] Philip A Knight Daniel Ruiz and Bora Ucar A symmetry preservingalgorithm for matrix scaling SIAM Journal on Matrix Analysis andApplications 35(3)931ndash955 2014 [Cited on pp 55 56 68 79 105]

184 BIBLIOGRAPHY

[142] Tamara G Kolda and Brett Bader Tensor decompositions and appli-cations SIAM Review 51(3)455ndash500 2009 [Cited on pp 10 37]

[143] Tjalling C Koopmans and Martin Beckmann Assignment problemsand the location of economic activities Econometrica 25(1)53ndash761957 [Cited on pp 109]

[144] Mads R B Kristensen Simon A F Lund Troels Blum and JamesAvery Fusion of parallel array operations In Proceedings of the 2016International Conference on Parallel Architectures and Compilationpages 71ndash85 New York NY USA 2016 ACM [Cited on pp 3]

[145] Mads R B Kristensen Simon A F Lund Troels Blum KennethSkovhede and Brian Vinter Bohrium A virtual machine approachto portable parallelism In Proceedings of the 2014 IEEE InternationalParallel amp Distributed Processing Symposium Workshops pages 312ndash321 Washington DC USA 2014 [Cited on pp 3]

[146] Janardhan Kulkarni Euiwoong Lee and Mohit Singh MinimumBirkhoff-von Neumann decomposition Preliminary version http

wwwcscmuedu~euiwoonlsparsebvnpdf of the paper whichappeared in Proc 19th International Conference Integer Program-ming and Combinatorial Optimization (IPCO 2017) Waterloo ONCanada pp 343ndash354 June 2017 [Cited on pp 9]

[147] Sebastian Lamm Peter Sanders Christian Schulz Darren Strash andRenato F Werneck Finding Near-Optimal Independent Sets at ScaleIn Proceedings of the 18th Workshop on Algorithm Engineering andExperiments (ALENEXrsquo16) pages 138ndash150 Arlington Virginia USA2016 SIAM [Cited on pp 114 127]

[148] Johannes Langguth Fredrik Manne and Peter Sanders Heuristicinitialization for bipartite matching problems Journal of ExperimentalAlgorithmics 1511ndash122 2010 [Cited on pp 6 56 58]

[149] Yanzhe (Murray) Lei Stefanus Jasin Joline Uichanco and AndrewVakhutinsky Randomized product display (ranking) pricing and or-der fulfillment for e-commerce retailers SSRN Electronic Journal2018 [Cited on pp 9]

[150] Thomas Lengauer Combinatorial Algorithms for Integrated CircuitLayout WileyndashTeubner Chichester UK 1990 [Cited on pp 34]

[151] Renaud Lepere and Denis Trystram A new clustering algorithm forlarge communication delays In 16th International Parallel and Dis-tributed Processing Symposium (IPDPS) page (CDROM) Fort Laud-erdale FL USA 2002 IEEE Computer Society [Cited on pp 32]

BIBLIOGRAPHY 185

[152] Hanoch Levy and David W Low A contraction algorithm for findingsmall cycle cutsets Journal of Algorithms 9(4)470ndash493 1988 [Citedon pp 142]

[153] Jiajia Li Jee Choi Ioakeim Perros Jimeng Sun and Richard VuducModel-driven sparse CP decomposition for higher-order tensors InIPDPS 2017 31th IEEE International Symposium on Parallel andDistributed Processing pages 1048ndash1057 Orlando FL USA May2017 [Cited on pp 52]

[154] Jiajia Li Yuchen Ma and Richard Vuduc ParTI A Parallel TensorInfrastructure for Multicore CPU and GPUs (Version 010) Availablefrom httpsgithubcomhpcgarageParTI 2016 [Cited on pp156]

[155] Jiajia Li Jimeng Sun and Richard Vuduc HiCOO Hierarchical stor-age of sparse tensors In Proceedings of the International Conferencefor High Performance Computing Networking Storage and AnalysisSCrsquo18 New York NY USA 2018 [Cited on pp 148 149 155 158]

[156] Jiajia Li Bora Ucar Umit V Catalyurek Jimeng Sun Kevin Barkerand Richard Vuduc Efficient and effective sparse tensor reordering InProceedings of the ACM International Conference on SupercomputingICS rsquo19 pages 227ndash237 New York NY USA 2019 ACM [Cited onpp 147 156]

[157] Xiaoye S Li and James W Demmel Making sparse Gaussian elim-ination scalable by static pivoting In Proc of the 1998 ACMIEEEConference on Supercomputing SC rsquo98 pages 1ndash17 Washington DCUSA 1998 [Cited on pp 132]

[158] Athanasios P Liavas and Nicholas D Sidiropoulos Parallel algorithmsfor constrained tensor factorization via alternating direction methodof multipliers IEEE Transactions on Signal Processing 63(20)5450ndash5463 Oct 2015 [Cited on pp 11]

[159] Bangtian Liu Chengyao Wen Anand D Sarwate and Maryam MehriDehnavi A unified optimization approach for sparse tensor opera-tions on GPUs In 2017 IEEE International Conference on ClusterComputing (CLUSTER) pages 47ndash57 Sept 2017 [Cited on pp 160]

[160] Joseph W H Liu Computational models and task scheduling forparallel sparse Cholesky factorization Parallel Computing 3(4)327ndash342 1986 [Cited on pp 131 134]

[161] Joseph W H Liu A graph partitioning algorithm by node separatorsACM Transactions on Mathematical Software 15(3)198ndash219 1989[Cited on pp 140]

186 BIBLIOGRAPHY

[162] Joseph W H Liu Reordering sparse matrices for parallel eliminationParallel Computing 11(1)73 ndash 91 1989 [Cited on pp 131]

[163] Liang Liu Jun (Jim) Xu and Lance Fortnow Quantized BvND Abetter solution for optical and hybrid switching in data center net-works In 2018 IEEEACM 11th International Conference on Utilityand Cloud Computing pages 237ndash246 Dec 2018 [Cited on pp 9]

[164] Zvi Lotker Boaz Patt-Shamir and Seth Pettie Improved distributedapproximate matching In Proceedings of the Twentieth Annual Sym-posium on Parallelism in Algorithms and Architectures (SPAA) pages129ndash136 Munich Germany 2008 ACM [Cited on pp 59]

[165] Laszlo Lovasz On determinants matchings and random algorithmsIn FCT pages 565ndash574 1979 [Cited on pp 76]

[166] Laszlo Lovasz and Micheal D Plummer Matching Theory ElsevierScience Publishers Netherlands 1986 [Cited on pp 76]

[167] Anna Lubiw Doubly lexical orderings of matrices SIAM Journal onComputing 16(5)854ndash879 1987 [Cited on pp 151 152 153]

[168] Yuchen Ma Jiajia Li Xiaolong Wu Chenggang Yan Jimeng Sunand Richard Vuduc Optimizing sparse tensor times matrix on GPUsJournal of Parallel and Distributed Computing 12999ndash109 2019[Cited on pp 160]

[169] Jakob Magun Greedy matching algorithms an experimental studyJournal of Experimental Algorithmics 36 1998 [Cited on pp 6]

[170] Marvin Marcus and R Ree Diagonals of doubly stochastic matricesThe Quarterly Journal of Mathematics 10(1)296ndash302 1959 [Citedon pp 9]

[171] Nick McKeown The iSLIP scheduling algorithm for input-queuedswitches EEEACM Transactions on Networking 7(2)188ndash201 1999[Cited on pp 6]

[172] Amram Meir and John W Moon The expected node-independencenumber of random trees Indagationes Mathematicae 76335ndash3411973 [Cited on pp 67]

[173] John Mellor-Crummey David Whaley and Ken Kennedy Improvingmemory hierarchy performance for irregular applications using dataand computation reorderings International Journal of Parallel Pro-gramming 29(3)217ndash247 Jun 2001 [Cited on pp 160]

BIBLIOGRAPHY 187

[174] Silvio Micali and Vijay V Vazirani An O(radic|V ||E|) algorithm for

finding maximum matching in general graphs In 21st Annual Sym-posium on Foundations of Computer Science pages 17ndash27 SyracuseNY USA 1980 IEEE [Cited on pp 6 76]

[175] Orlando Moreira Merten Popp and Christian Schulz Graph parti-tioning with acyclicity constraints In 16th International Symposiumon Experimental Algorithms SEA London UK 2017 [Cited on pp16 17 28 29 32]

[176] Orlando Moreira Merten Popp and Christian Schulz Evolutionarymulti-level acyclic graph partitioning In Proceedings of the Geneticand Evolutionary Computation Conference GECCO pages 332ndash339Kyoto Japan 2018 ACM [Cited on pp 17 25 32]

[177] Marcin Mucha and Piotr Sankowski Maximum matchings in planargraphs via Gaussian elimination In S Albers and T Radzik editorsESA 2004 pages 532ndash543 Berlin Heidelberg 2004 Springer BerlinHeidelberg [Cited on pp 76]

[178] Marcin Mucha and Piotr Sankowski Maximum matchings via Gaus-sian elimination In FOCS rsquo04 pages 248ndash255 Washington DC USA2004 IEEE Computer Society [Cited on pp 76]

[179] Huda Nassar Georgios Kollias Ananth Grama and David F GleichLow rank methods for multiple network alignment arXiv e-printspage arXiv180908198 Sep 2018 [Cited on pp 128]

[180] Jaroslav Nesetril and Patrice Ossona de Mendez Tree-depth subgraphcoloring and homomorphism bounds European Journal of Combina-torics 27(6)1022ndash1041 2006 [Cited on pp 132]

[181] M Yusuf Ozkaya Anne Benoit Bora Ucar Julien Herrmann andUmit V Catalyurek A scalable clustering-based task scheduler forhomogeneous processors using DAG partitioning In 33rd IEEE Inter-national Parallel and Distributed Processing Symposium pages 155ndash165 Rio de Janeiro Brazil May 2019 [Cited on pp 32]

[182] Robert Paige and Robert E Tarjan Three partition refinement algo-rithms SIAM Journal on Computing 16(6)973ndash989 December 1987[Cited on pp 152 153]

[183] Chiristos H Papadimitriou and Kenneth Steiglitz Combinatorial Op-timization Algorithms and Complexity Dover Publications (Cor-rected unabridged reprint of Combinatorial Optimization Algo-rithms and Complexity originally published by Prentice-Hall Inc NewJersey 1982) New York 1998 [Cited on pp 29]

188 BIBLIOGRAPHY

[184] Panos M Pardalos Tianbing Qian and Mauricio G C Resende Agreedy randomized adaptive search procedure for the feedback vertexset problem Journal of Combinatorial Optimization 2(4)399ndash4121998 [Cited on pp 142]

[185] Md Mostofa Ali Patwary Rob H Bisseling and Fredrik Manne Par-allel greedy graph matching using an edge partitioning approach InProceedings of the Fourth International Workshop on High-level Paral-lel Programming and Applications HLPP rsquo10 pages 45ndash54 New YorkNY USA 2010 ACM [Cited on pp 59]

[186] Francois Pellegrini SCOTCH 51 Userrsquos Guide Laboratoire Bordelaisde Recherche en Informatique (LaBRI) 2008 [Cited on pp 4 16 23]

[187] Daniel M Pelt and Rob H Bisseling A medium-grain method forfast 2D bipartitioning of sparse matrices In IPDPS2014 IEEE 28thInternational Parallel and Distributed Processing Symposium pages529ndash539 Phoenix Arizona May 2014 [Cited on pp 52]

[188] Ioakeim Perros Evangelos E Papalexakis Fei Wang Richard VuducElizabeth Searles Michael Thompson and Jimeng Sun SPARTanScalable PARAFAC2 for large and sparse data In Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD rsquo17 pages 375ndash384 New York NY USA2017 ACM [Cited on pp 156]

[189] Anh-Huy Phan Petr Tichavsky and Andrzej Cichocki Fast alternat-ing LS algorithms for high order CANDECOMPPARAFAC tensorfactorizations IEEE Transactions on Signal Processing 61(19)4834ndash4846 Oct 2013 [Cited on pp 52]

[190] Juan C Pichel Francisco F Rivera Marcos Fernandez and AurelioRodrıguez Optimization of sparse matrixndashvector multiplication usingreordering techniques on GPUs Microprocessors and Microsystems36(2)65ndash77 2012 [Cited on pp 160]

[191] Matthias Poloczek and Mario Szegedy Randomized greedy algorithmsfor the maximum matching problem with new analysis In IEEE 53rdAnnual Sym on Foundations of Computer Science (FOCS) pages708ndash717 New Brunswick NJ USA 2012 [Cited on pp 58]

[192] Alex Pothen Sparse null bases and marriage theorems PhD the-sis Dept Computer Science Cornell Univ Ithaca New York 1984[Cited on pp 68]

[193] Alex Pothen The complexity of optimal elimination trees TechnicalReport CS-88-13 Pennsylvania State Univ 1988 [Cited on pp 131]

BIBLIOGRAPHY 189

[194] Alex Pothen and Chin-Ju Fan Computing the block triangular formof a sparse matrix ACM Transactions on Mathematical Software16(4)303ndash324 1990 [Cited on pp 58 68 115]

[195] Alex Pothen S M Ferdous and Fredrik Manne Approximation algo-rithms in combinatorial scientific computing Acta Numerica 28541mdash633 2019 [Cited on pp 58]

[196] Louis-Noel Pouchet Polybench The polyhedral benchmark suiteURL httpwebcseohio-stateedu˜pouchetsoftwarepolybench2012 [Cited on pp 25 26]

[197] Michael O Rabin and Vijay V Vazirani Maximum matchings in gen-eral graphs through randomization Journal of Algorithms 10(4)557ndash567 December 1989 [Cited on pp 76]

[198] Daniel Ruiz A scaling algorithm to equilibrate both row and columnnorms in matrices Technical Report TR-2001-034 RAL 2001 [Citedon pp 55 56]

[199] Peter Sanders and Christian Schulz Engineering multilevel graphpartitioning algorithms In Camil Demetrescu and Magnus MHalldorsson editors Algorithms ndash ESA 2011 19th Annual EuropeanSymposium September 5-9 2011 pages 469ndash480 Saarbrucken Ger-many 2011 [Cited on pp 4 16 23]

[200] Robert Schreiber A new implementation of sparse Gaussian elimi-nation ACM Transactions on Mathematical Software 8(3)256ndash2761982 [Cited on pp 131]

[201] R Oguz Selvitopi Muhammet Mustafa Ozdal and Cevdet AykanatA novel method for scaling iterative solvers Avoiding latency over-head of parallel sparse-matrix vector multiplies IEEE Transactionson Parallel and Distributed Systems 26(3)632ndash645 2015 [Cited onpp 51]

[202] Richard Sinkhorn and Paul Knopp Concerning nonnegative matri-ces and doubly stochastic matrices Pacific Journal of Mathematics21343ndash348 1967 [Cited on pp 55 69 114]

[203] Shaden Smith Jee W Choi Jiajia Li Richard Vuduc Jongsoo ParkXing Liu and George Karypis FROSTT The formidable repositoryof open sparse tensors and tools httpfrosttio 2017 [Citedon pp 52 126 156 161]

[204] Shaden Smith and George Karypis A medium-grained algorithm forsparse tensor factorization In 2016 IEEE International Parallel and

190 BIBLIOGRAPHY

Distributed Processing Symposium IPDPS 2016 Chicago IL USAMay 23-27 2016 pages 902ndash911 2016 [Cited on pp 37 52]

[205] Shaden Smith and George Karypis SPLATT The Surprisingly Par-alleL spArse Tensor Toolkit (Version 111) Available from https

githubcomShadenSmithsplatt 2016 [Cited on pp 37 52 157]

[206] Shaden Smith Niranjay Ravindran Nicholas D Sidiropoulos andGeorge Karypis SPLATT Efficient and parallel sparse tensor-matrixmultiplication In 29th IEEE International Parallel amp Distributed Pro-cessing Symposium pages 61ndash70 Hyderabad India May 2015 [Citedon pp 11 36 37 38 39 42 158 159]

[207] Robert E Tarjan and Mihalis Yannakakis Simple linear-time algo-rithms to test chordality of graphs test acyclicity of hypergraphs andselectively reduce acyclic hypergraphs SIAM Journal on Computing13(3)566ndash579 1984 [Cited on pp 149]

[208] Bora Ucar and Cevdet Aykanat Encapsulating multiplecommunication-cost metrics in partitioning sparse rectangular matri-ces for parallel matrix-vector multiplies SIAM Journal on ScientificComputing 25(6)1837ndash1859 2004 [Cited on pp 47 51]

[209] Bora Ucar and Cevdet Aykanat Revisiting hypergraph models forsparse matrix partitioning SIAM Review 49(4)595ndash603 2007 [Citedon pp 42]

[210] Vijay V Vazirani An improved definition of blossoms and a simplerproof of the MV matching algorithm CoRR abs12104594 2012[Cited on pp 76]

[211] David W Walkup Matchings in random regular bipartite digraphsDiscrete Mathematics 31(1)59ndash64 1980 [Cited on pp 59 67 76 81114 121]

[212] Chris Walshaw Multilevel refinement for combinatorial optimisationproblems Annals of Operations Research 131(1)325ndash372 Oct 2004[Cited on pp 25]

[213] Eric S H Wong Evangeline F Y Young and Wai-Kei Mak Clus-tering based acyclic multi-way partitioning In Proceedings of the 13thACM Great Lakes Symposium on VLSI GLSVLSI rsquo03 pages 203ndash206New York NY USA 2003 ACM [Cited on pp 4]

[214] Tao Yang and Apostolos Gerasoulis DSC scheduling parallel tasks onan unbounded number of processors IEEE Transactions on Paralleland Distributed Systems 5(9)951ndash967 1994 [Cited on pp 131]

BIBLIOGRAPHY 191

[215] Albert-Jan Nicholas Yzelman and Dirk Roose High-level strategiesfor parallel shared-memory sparse matrix-vector multiplication IEEETransactions on Parallel and Distributed Systems 25(1)116ndash125 Jan2014 [Cited on pp 160]

  • Introduction
    • Directed graphs
    • Undirected graphs
    • Hypergraphs
    • Sparse matrices
    • Sparse tensors
      • I Partitioning
        • Acyclic partitioning of directed acyclic graphs
          • Related work
          • Directed multilevel graph partitioning
            • Coarsening
            • Initial partitioning
            • Refinement
            • Constrained coarsening and initial partitioning
              • Experiments
              • Summary further notes and references
                • Distributed memory CP decomposition
                  • Background and notation
                    • Hypergraph partitioning
                    • Tensor operations
                      • Related work
                      • Parallelization
                        • Coarse-grain task model
                        • Fine-grain task model
                          • Experiments
                          • Summary further notes and references
                              • II Matching and related problems
                                • Matchings in graphs
                                  • Scaling matrices to doubly stochastic form
                                  • Approximation algorithms for bipartite graphs
                                    • Related work
                                    • Two matching heuristics
                                    • Experiments
                                      • Approximation algorithms for general undirected graphs
                                        • Related work
                                        • One-Out its analysis and variants
                                        • Experiments
                                          • Summary further notes and references
                                            • Birkhoff-von Neumann decomposition
                                              • The minimum number of permutation matrices
                                                • Preliminaries
                                                • The computational complexity of MinBvNDec
                                                  • A result on the polytope of BvN decompositions
                                                  • Two heuristics
                                                  • Experiments
                                                  • A generalization of Birkhoff-von Neumann theorem
                                                    • Doubly stochastic splitting
                                                    • Birkhoff von-Neumann decomposition for arbitrary matrices
                                                      • Summary further notes and references
                                                        • Matchings in hypergraphs
                                                          • Background and notation
                                                          • The heuristics
                                                            • Greedy-H
                                                            • Karp-Sipser-H
                                                            • Karp-Sipser-H-scaling
                                                            • Karp-Sipser-H-mindegree
                                                            • Bipartite-reduction
                                                            • Local-Search
                                                              • Experiments
                                                              • Summary further notes and references
                                                                  • III Ordering matrices and tensors
                                                                    • Elimination trees and height-reducing orderings
                                                                      • Background
                                                                      • The critical path length and the elimination tree height
                                                                      • Reducing the elimination tree height
                                                                        • Theoretical basis
                                                                        • Recursive approach Permute
                                                                        • BBT decomposition PermuteBBT
                                                                        • Edge-separator-based method
                                                                        • Vertex-separator-based method
                                                                        • BT decomposition PermuteBT
                                                                          • Experiments
                                                                          • Summary further notes and references
                                                                            • Sparse tensor ordering
                                                                              • Three sparse tensor storage formats
                                                                              • Problem definition and solutions
                                                                                • BFS-MCS
                                                                                • Lexi-Order
                                                                                • Analysis
                                                                                  • Experiments
                                                                                    • COO-Mttkrp with reordering
                                                                                    • CSF-Mttkrp with reordering
                                                                                    • HiCOO-Mttkrp with reordering
                                                                                    • Reordering methods comparison
                                                                                    • Effect of the number of iterations in Lexi-Order
                                                                                      • Summary further notes and references
                                                                                          • IV Closing
                                                                                            • Concluding remarks
                                                                                            • Bibliography
Page 6: Partitioning, matching, and ordering: Combinatorial
Page 7: Partitioning, matching, and ordering: Combinatorial
Page 8: Partitioning, matching, and ordering: Combinatorial
Page 9: Partitioning, matching, and ordering: Combinatorial
Page 10: Partitioning, matching, and ordering: Combinatorial
Page 11: Partitioning, matching, and ordering: Combinatorial
Page 12: Partitioning, matching, and ordering: Combinatorial
Page 13: Partitioning, matching, and ordering: Combinatorial
Page 14: Partitioning, matching, and ordering: Combinatorial
Page 15: Partitioning, matching, and ordering: Combinatorial
Page 16: Partitioning, matching, and ordering: Combinatorial
Page 17: Partitioning, matching, and ordering: Combinatorial
Page 18: Partitioning, matching, and ordering: Combinatorial
Page 19: Partitioning, matching, and ordering: Combinatorial
Page 20: Partitioning, matching, and ordering: Combinatorial
Page 21: Partitioning, matching, and ordering: Combinatorial
Page 22: Partitioning, matching, and ordering: Combinatorial
Page 23: Partitioning, matching, and ordering: Combinatorial
Page 24: Partitioning, matching, and ordering: Combinatorial
Page 25: Partitioning, matching, and ordering: Combinatorial
Page 26: Partitioning, matching, and ordering: Combinatorial
Page 27: Partitioning, matching, and ordering: Combinatorial
Page 28: Partitioning, matching, and ordering: Combinatorial
Page 29: Partitioning, matching, and ordering: Combinatorial
Page 30: Partitioning, matching, and ordering: Combinatorial
Page 31: Partitioning, matching, and ordering: Combinatorial
Page 32: Partitioning, matching, and ordering: Combinatorial
Page 33: Partitioning, matching, and ordering: Combinatorial
Page 34: Partitioning, matching, and ordering: Combinatorial
Page 35: Partitioning, matching, and ordering: Combinatorial
Page 36: Partitioning, matching, and ordering: Combinatorial
Page 37: Partitioning, matching, and ordering: Combinatorial
Page 38: Partitioning, matching, and ordering: Combinatorial
Page 39: Partitioning, matching, and ordering: Combinatorial
Page 40: Partitioning, matching, and ordering: Combinatorial
Page 41: Partitioning, matching, and ordering: Combinatorial
Page 42: Partitioning, matching, and ordering: Combinatorial
Page 43: Partitioning, matching, and ordering: Combinatorial
Page 44: Partitioning, matching, and ordering: Combinatorial
Page 45: Partitioning, matching, and ordering: Combinatorial
Page 46: Partitioning, matching, and ordering: Combinatorial
Page 47: Partitioning, matching, and ordering: Combinatorial
Page 48: Partitioning, matching, and ordering: Combinatorial
Page 49: Partitioning, matching, and ordering: Combinatorial
Page 50: Partitioning, matching, and ordering: Combinatorial
Page 51: Partitioning, matching, and ordering: Combinatorial
Page 52: Partitioning, matching, and ordering: Combinatorial
Page 53: Partitioning, matching, and ordering: Combinatorial
Page 54: Partitioning, matching, and ordering: Combinatorial
Page 55: Partitioning, matching, and ordering: Combinatorial
Page 56: Partitioning, matching, and ordering: Combinatorial
Page 57: Partitioning, matching, and ordering: Combinatorial
Page 58: Partitioning, matching, and ordering: Combinatorial
Page 59: Partitioning, matching, and ordering: Combinatorial
Page 60: Partitioning, matching, and ordering: Combinatorial
Page 61: Partitioning, matching, and ordering: Combinatorial
Page 62: Partitioning, matching, and ordering: Combinatorial
Page 63: Partitioning, matching, and ordering: Combinatorial
Page 64: Partitioning, matching, and ordering: Combinatorial
Page 65: Partitioning, matching, and ordering: Combinatorial
Page 66: Partitioning, matching, and ordering: Combinatorial
Page 67: Partitioning, matching, and ordering: Combinatorial
Page 68: Partitioning, matching, and ordering: Combinatorial
Page 69: Partitioning, matching, and ordering: Combinatorial
Page 70: Partitioning, matching, and ordering: Combinatorial
Page 71: Partitioning, matching, and ordering: Combinatorial
Page 72: Partitioning, matching, and ordering: Combinatorial
Page 73: Partitioning, matching, and ordering: Combinatorial
Page 74: Partitioning, matching, and ordering: Combinatorial
Page 75: Partitioning, matching, and ordering: Combinatorial
Page 76: Partitioning, matching, and ordering: Combinatorial
Page 77: Partitioning, matching, and ordering: Combinatorial
Page 78: Partitioning, matching, and ordering: Combinatorial
Page 79: Partitioning, matching, and ordering: Combinatorial
Page 80: Partitioning, matching, and ordering: Combinatorial
Page 81: Partitioning, matching, and ordering: Combinatorial
Page 82: Partitioning, matching, and ordering: Combinatorial
Page 83: Partitioning, matching, and ordering: Combinatorial
Page 84: Partitioning, matching, and ordering: Combinatorial
Page 85: Partitioning, matching, and ordering: Combinatorial
Page 86: Partitioning, matching, and ordering: Combinatorial
Page 87: Partitioning, matching, and ordering: Combinatorial
Page 88: Partitioning, matching, and ordering: Combinatorial
Page 89: Partitioning, matching, and ordering: Combinatorial
Page 90: Partitioning, matching, and ordering: Combinatorial
Page 91: Partitioning, matching, and ordering: Combinatorial
Page 92: Partitioning, matching, and ordering: Combinatorial
Page 93: Partitioning, matching, and ordering: Combinatorial
Page 94: Partitioning, matching, and ordering: Combinatorial
Page 95: Partitioning, matching, and ordering: Combinatorial
Page 96: Partitioning, matching, and ordering: Combinatorial
Page 97: Partitioning, matching, and ordering: Combinatorial
Page 98: Partitioning, matching, and ordering: Combinatorial
Page 99: Partitioning, matching, and ordering: Combinatorial
Page 100: Partitioning, matching, and ordering: Combinatorial
Page 101: Partitioning, matching, and ordering: Combinatorial
Page 102: Partitioning, matching, and ordering: Combinatorial
Page 103: Partitioning, matching, and ordering: Combinatorial
Page 104: Partitioning, matching, and ordering: Combinatorial
Page 105: Partitioning, matching, and ordering: Combinatorial
Page 106: Partitioning, matching, and ordering: Combinatorial
Page 107: Partitioning, matching, and ordering: Combinatorial
Page 108: Partitioning, matching, and ordering: Combinatorial
Page 109: Partitioning, matching, and ordering: Combinatorial
Page 110: Partitioning, matching, and ordering: Combinatorial
Page 111: Partitioning, matching, and ordering: Combinatorial
Page 112: Partitioning, matching, and ordering: Combinatorial
Page 113: Partitioning, matching, and ordering: Combinatorial
Page 114: Partitioning, matching, and ordering: Combinatorial
Page 115: Partitioning, matching, and ordering: Combinatorial
Page 116: Partitioning, matching, and ordering: Combinatorial
Page 117: Partitioning, matching, and ordering: Combinatorial
Page 118: Partitioning, matching, and ordering: Combinatorial
Page 119: Partitioning, matching, and ordering: Combinatorial
Page 120: Partitioning, matching, and ordering: Combinatorial
Page 121: Partitioning, matching, and ordering: Combinatorial
Page 122: Partitioning, matching, and ordering: Combinatorial
Page 123: Partitioning, matching, and ordering: Combinatorial
Page 124: Partitioning, matching, and ordering: Combinatorial
Page 125: Partitioning, matching, and ordering: Combinatorial
Page 126: Partitioning, matching, and ordering: Combinatorial
Page 127: Partitioning, matching, and ordering: Combinatorial
Page 128: Partitioning, matching, and ordering: Combinatorial
Page 129: Partitioning, matching, and ordering: Combinatorial
Page 130: Partitioning, matching, and ordering: Combinatorial
Page 131: Partitioning, matching, and ordering: Combinatorial
Page 132: Partitioning, matching, and ordering: Combinatorial
Page 133: Partitioning, matching, and ordering: Combinatorial
Page 134: Partitioning, matching, and ordering: Combinatorial
Page 135: Partitioning, matching, and ordering: Combinatorial
Page 136: Partitioning, matching, and ordering: Combinatorial
Page 137: Partitioning, matching, and ordering: Combinatorial
Page 138: Partitioning, matching, and ordering: Combinatorial
Page 139: Partitioning, matching, and ordering: Combinatorial
Page 140: Partitioning, matching, and ordering: Combinatorial
Page 141: Partitioning, matching, and ordering: Combinatorial
Page 142: Partitioning, matching, and ordering: Combinatorial
Page 143: Partitioning, matching, and ordering: Combinatorial
Page 144: Partitioning, matching, and ordering: Combinatorial
Page 145: Partitioning, matching, and ordering: Combinatorial
Page 146: Partitioning, matching, and ordering: Combinatorial
Page 147: Partitioning, matching, and ordering: Combinatorial
Page 148: Partitioning, matching, and ordering: Combinatorial
Page 149: Partitioning, matching, and ordering: Combinatorial
Page 150: Partitioning, matching, and ordering: Combinatorial
Page 151: Partitioning, matching, and ordering: Combinatorial
Page 152: Partitioning, matching, and ordering: Combinatorial
Page 153: Partitioning, matching, and ordering: Combinatorial
Page 154: Partitioning, matching, and ordering: Combinatorial
Page 155: Partitioning, matching, and ordering: Combinatorial
Page 156: Partitioning, matching, and ordering: Combinatorial
Page 157: Partitioning, matching, and ordering: Combinatorial
Page 158: Partitioning, matching, and ordering: Combinatorial
Page 159: Partitioning, matching, and ordering: Combinatorial
Page 160: Partitioning, matching, and ordering: Combinatorial
Page 161: Partitioning, matching, and ordering: Combinatorial
Page 162: Partitioning, matching, and ordering: Combinatorial
Page 163: Partitioning, matching, and ordering: Combinatorial
Page 164: Partitioning, matching, and ordering: Combinatorial
Page 165: Partitioning, matching, and ordering: Combinatorial
Page 166: Partitioning, matching, and ordering: Combinatorial
Page 167: Partitioning, matching, and ordering: Combinatorial
Page 168: Partitioning, matching, and ordering: Combinatorial
Page 169: Partitioning, matching, and ordering: Combinatorial
Page 170: Partitioning, matching, and ordering: Combinatorial
Page 171: Partitioning, matching, and ordering: Combinatorial
Page 172: Partitioning, matching, and ordering: Combinatorial
Page 173: Partitioning, matching, and ordering: Combinatorial
Page 174: Partitioning, matching, and ordering: Combinatorial
Page 175: Partitioning, matching, and ordering: Combinatorial
Page 176: Partitioning, matching, and ordering: Combinatorial
Page 177: Partitioning, matching, and ordering: Combinatorial
Page 178: Partitioning, matching, and ordering: Combinatorial
Page 179: Partitioning, matching, and ordering: Combinatorial
Page 180: Partitioning, matching, and ordering: Combinatorial
Page 181: Partitioning, matching, and ordering: Combinatorial
Page 182: Partitioning, matching, and ordering: Combinatorial
Page 183: Partitioning, matching, and ordering: Combinatorial
Page 184: Partitioning, matching, and ordering: Combinatorial
Page 185: Partitioning, matching, and ordering: Combinatorial
Page 186: Partitioning, matching, and ordering: Combinatorial
Page 187: Partitioning, matching, and ordering: Combinatorial
Page 188: Partitioning, matching, and ordering: Combinatorial
Page 189: Partitioning, matching, and ordering: Combinatorial
Page 190: Partitioning, matching, and ordering: Combinatorial
Page 191: Partitioning, matching, and ordering: Combinatorial
Page 192: Partitioning, matching, and ordering: Combinatorial
Page 193: Partitioning, matching, and ordering: Combinatorial
Page 194: Partitioning, matching, and ordering: Combinatorial
Page 195: Partitioning, matching, and ordering: Combinatorial
Page 196: Partitioning, matching, and ordering: Combinatorial