77
BLAS Inner Product Outer Product Matrix-Vector Product Matrix-Matrix Product Parallel Numerical Algorithms Chapter 3 – Dense Linear Systems Section 3.1 – Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel Numerical AlgorithmsChapter 3 – Dense Linear Systems

Section 3.1 – Vector and Matrix Products

Michael T. Heath and Edgar Solomonik

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 77

Page 2: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Outline

1 BLAS

2 Inner Product

3 Outer Product

4 Matrix-Vector Product

5 Matrix-Matrix Product

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 77

Page 3: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Basic Linear Algebra Subprograms

Basic Linear Algebra Subprograms (BLAS) are buildingblocks for many other matrix computations

BLAS encapsulate basic operations on vectors and matricesso they can be optimized for particular computerarchitecture while high-level routines that call them remainportable

BLAS offer good opportunities for optimizing utilization ofmemory hierarchy

Generic BLAS are available from netlib, and manycomputer vendors provide custom versions optimized fortheir particular systems

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 77

Page 4: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Examples of BLASLevel Work Examples Function

1 O(n) daxpy Scalar × vector + vectorddot Inner productdnrm2 Euclidean vector norm

2 O(n2) dgemv Matrix-vector productdtrsv Triangular solvedger Outer-product

3 O(n3) dgemm Matrix-matrix productdtrsm Multiple triangular solvesdsyrk Symmetric rank-k update

γ1︸︷︷︸BLAS 1 effective sec/flop

> γ2︸︷︷︸BLAS 2 effective sec/flop

� γ3︸︷︷︸BLAS 3 effective sec/flop

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 77

Page 5: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Inner Product

Inner product of two n-vectors x and y given by

xTy =

n∑i=1

xi yi

Computation of inner product requires n multiplicationsand n− 1 additions

M1 = Θ(n), Q1 = Θ(n), T1 = Θ(γ n)

Effectively as hard as scalar reduction, can be done viabinary or binomial tree summation

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 77

Page 6: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Parallel Algorithm

Partition

For i = 1, . . . , n, fine-grain task i stores xi and yi, andcomputes their product xi yi

Communicate

Sum reduction over n fine-grain tasks

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 x9 y9

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 77

Page 7: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Fine-Grain Parallel Algorithm

zi = xiyi

reduce zi across all tasks i = 1, ..., n

{ local scalar product }

{ sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 77

Page 8: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Agglomeration and Mapping

Agglomerate

Combine k components of both x and y to form eachcoarse-grain task, which computes inner product of thesesubvectors

Communication becomes sum reduction over n/kcoarse-grain tasks

Map

Assign (n/k)/p coarse-grain tasks to each of p processors,for total of n/p components of x and y per processor

x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 x8 y8 x9 y9++ ++ + +

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 77

Page 9: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Coarse-Grain Parallel Algorithm

zi = xT[i]y[i]

reduce zi across all processors i = 1, ..., p

{ local inner product }

{ sum reduction }

[x[i] – subvector of x assigned to processor i

]Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 77

Page 10: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Performance

The parallel costs (Lp,Wp, Fp) for the inner product are givenby

Computational cost Fp = Θ(n/p) regardless of network

The latency and bandwidth costs depend on network:

1-D mesh: Lp,Wp = Θ(p)

2-D mesh: Lp,Wp = Θ(√p)

hypercube: Lp,Wp = Θ(log p)

For a hypercube or fully-connected network time is

Tp = αLp + βWp + γFp = Θ(α log(p) + γn/p)

Efficiency and scaling are the same as for binary tree sum

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 77

Page 11: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Inner product on 1-D Mesh

For 1-D mesh, total time is Tp = Θ(γn/p+ αp)

To determine strong scalability, we set constant efficiencyand solve for ps

const = Eps =T1psTps

= Θ

(γn

γn+ αp2s

)= Θ

(1

1 + (α/γ)p2s/n

)which yields ps = Θ(

√(γ/α)n)

1-D mesh weakly scalable to pw = Θ((γ/α)n) processors:

Epw(pwn) = Θ

(1

1 + (α/γ)p2w/(pwn)

)= Θ

(1

1 + (α/γ)pw/n

)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 77

Page 12: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmScalability

Inner product on 2-D Mesh

For 2-D mesh, total time is Tp = Θ(γn/p+ α√p)

To determine strong scalability, we set constant efficiencyand solve for ps

const = Eps =T1psTps

= Θ

(γn

γn+ αp3/2s

)= Θ

(1

1 + (α/γ)p3/2s /n

)which yields ps = Θ((γ/α)2/3n2/3)

2-D mesh weakly scalable to pw = Θ((γ/α)2n2), since

Epw(pwn) = Θ

(1

1 + (α/γ)p3/2w /(pwn)

)= Θ

(1

1 + (α/γ)√pw/n

)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 77

Page 13: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Outer Product

Outer product of two n-vectors x and y is n× n matrixZ = xyT whose (i, j) entry zij = xi yj

For example,x1x2x3

y1y2y3

T =

x1y1 x1y2 x1y3x2y1 x2y2 x2y3x3y1 x3y2 x3y3

Computation of outer product requires n2 multiplications,

M1 = Θ(n2), Q1 = Θ(n2), T1 = Θ(γn2)

(in this case, we should treat M1 as output size or definethe problem as in the BLAS: Z = Zinput + xyT )

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 77

Page 14: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Parallel Algorithm

Partition

For i, j = 1, . . . , n, fine-grain task (i, j) computes andstores zij = xi yj , yielding 2-D array of n2 fine-grain tasks

Assuming no replication of data, at most 2n fine-graintasks store components of x and y, say either

for some j, task (i, j) stores xi and task ( j, i) stores yi, ortask (i, i) stores both xi and yi, i = 1, . . . , n

Communicate

For i = 1, . . . , n, task that stores xi broadcasts it to all othertasks in ith task row

For j = 1, . . . , n, task that stores yj broadcasts it to allother tasks in jth task column

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 77

Page 15: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fine-Grain Tasks and Communication

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 77

Page 16: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fine-Grain Parallel Algorithm

broadcast xi to tasks (i, k), k = 1, . . . , n

broadcast yj to tasks (k, j), k = 1, . . . , n

zij = xiyj

{ horizontal broadcast }

{ vertical broadcast }

{ local scalar product }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 77

Page 17: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Agglomeration

Agglomerate

With n× n array of fine-grain tasks, natural strategies are

2-D: Combine k × k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks

1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks

1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 77

Page 18: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration

Each task that stores portion of x must broadcast itssubvector to all other tasks in its task row

Each task that stores portion of y must broadcast itssubvector to all other tasks in its task column

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 77

Page 19: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 77

Page 20: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Agglomeration

If either x or y stored in one task, then broadcast requiredto communicate needed values to all other tasks

If either x or y distributed across tasks, then multinodebroadcast required to communicate needed values to othertasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 77

Page 21: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Column Agglomeration

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 77

Page 22: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Row Agglomeration

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 77

Page 23: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Mapping

Map

2-D: Assign (n/k)2/p coarse-grain tasks to each of pprocessors using any desired mapping in each dimension,treating target network as 2-D mesh

1-D: Assign n/p coarse-grain tasks to each of p processorsusing any desired mapping, treating target network as 1-Dmesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 77

Page 24: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration with Block Mapping

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 77

Page 25: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Column Agglomeration with Block Mapping

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 77

Page 26: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Row Agglomeration with Block Mapping

x1 y1 x1 y2

x2 y1 x2 y2

x1 y3 x1 y4

x2 y3 x2 y4

x3 y1 x3 y2

x4 y1 x4 y2

x3 y3 x3 y4

x4 y3 x4 y4

x5 y1 x5 y2

x6 y1 x6 y2

x5 y3 x5 y4

x6 y3 x6 y4

x1 y5 x1 y6

x2 y5 x2 y6

x3 y5 x3 y6

x4 y5 x4 y6

x5 y5 x5 y6

x6 y5 x6 y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 77

Page 27: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Coarse-Grain Parallel Algorithm

broadcast x[i ] to ith process row

broadcast y[ j ] to jth process column

Z[i ][ j ] = x[i ]yT[ j ]

{ horizontal broadcast }

{ vertical broadcast }

{ local outer product }

[Z[i ][ j ] means submatrix of Z assigned to process (i, j) by

mapping]

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 77

Page 28: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Performance

The parallel costs (Lp,Wp, Fp) for the outer product areComputational cost Fp = Θ(n2/p) regardless of networkThe latency and bandwidth costs can be derived from thecost of broadcast/allgather

1-D agglomeration: Lp = Θ(log p),Wp = Θ(n)

2-D agglomeration: Lp = Θ(log p),Wp = Θ(n/√p)

For 1-D agglomeration, execution time is

T 1-Dp = T allgather

p (n) + Θ(γn2/p) = Θ(α log(p) + βn+ γn2/p)

For 2-D agglomeration, execution time is

T 2-Dp = 2T bcast√

p (n/√p)+Θ(γn2/p) = Θ(α log(p)+βn/

√p+γn2/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 77

Page 29: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Outer Product Strong Scaling

1-D agglomeration is strongly scalable to

ps = Θ(min((γ/α)n2/ log((γ/α)n2), (γ/β)n))

processors, since

E1-Dps = Θ(1/(1 + (α/γ) log(ps)ps/n

2 + (β/γ)ps/n))

2-D agglomeration is strongly scalable to

ps = Θ(min((γ/α)n2/ log((γ/α)n2), (γ/β)2n2))

processors, since

E2-Dps = Θ(1/(1 + (α/γ) log(ps)ps/n

2 + (β/γ)√ps/n))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 77

Page 30: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Outer Product Weak Scaling

1-D agglomeration is weakly scalable to

pw = Θ(min(2(γ/α)n2, (γ/β)2n2))

processors, since

E1-Dpw (√pwn) = Θ(1/(1 + (α/γ) log(pw)/n2 + (β/γ)

√pw/n))

2-D agglomeration is weakly scalable to

pw = Θ(2(γ/α)n2)

processors, since

E2-Dpw (√pwn) = Θ(1/(1 + (α/γ) log(pw)/n2 + (β/γ)/n))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 77

Page 31: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Memory Requirements

The memory requirements are dominated by storing Z,which in practice means the outer-product is a poorprimitive (local flop-to-byte ratio of 1)If possible, Z should be operated on as it is computed, e.g.if we really need

vi =∑j

f(xiyj) for some scalar function f

If Z does not need to be stored, vector storage dominatesWithout further modification, 1-D algorithm requiresMp = Θ(np) overall storage for vectorWithout further modification, 2-D algorithm requiresMp = Θ(n

√p) overall storage for vector

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 77

Page 32: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Matrix-Vector Product

Consider matrix-vector product

y = Ax

where A is n× n matrix and x and y are n-vectors

Components of vector y are given by inner products:

yi =

n∑j=1

aij xj , i = 1, . . . , n

The sequential memory, work, and time are

M1 = Θ(n2), Q1 = Θ(n2), T1 = Θ(γn2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 77

Page 33: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Parallel Algorithm

Partition

For i, j = 1, . . . , n, fine-grain task (i, j) stores aij andcomputes aij xj , yielding 2-D array of n2 fine-grain tasks

Assuming no replication of data, at most 2n fine-graintasks store components of x and y, say either

for some j, task ( j, i) stores xi and task (i, j) stores yi, ortask (i, i) stores both xi and yi, i = 1, . . . , n

Communicate

For j = 1, . . . , n, task that stores xj broadcasts it to allother tasks in jth task column

For i = 1, . . . , n, sum reduction over ith task row gives yi

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 77

Page 34: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fine-Grain Tasks and Communication

a11x1y1

a12x2

a21x1a22x2y2

a13x3 a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 77

Page 35: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fine-Grain Parallel Algorithm

broadcast xj to tasks (k, j), k = 1, . . . , n

yi = aijxj

reduce yi across tasks (i, k), k = 1, . . . , n

{ vertical broadcast }

{ local scalar product }

{ horizontal sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 77

Page 36: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Agglomeration

Agglomerate

With n× n array of fine-grain tasks, natural strategies are

2-D: Combine k × k subarray of fine-grain tasks to formeach coarse-grain task, yielding (n/k)2 coarse-grain tasks

1-D column: Combine n fine-grain tasks in each columninto coarse-grain task, yielding n coarse-grain tasks

1-D row: Combine n fine-grain tasks in each row intocoarse-grain task, yielding n coarse-grain tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 77

Page 37: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration

Subvector of x broadcast along each task column

Each task computes local matrix-vector product ofsubmatrix of A with subvector of x

Sum reduction along each task row produces subvector ofresult y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 77

Page 38: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration

a13x3a11x1y1

a12x2

a21x1a22x2y2

a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 77

Page 39: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Agglomeration

1-D column agglomeration

Each task computes product of its component of x timesits column of matrix, with no communication required

Sum reduction across tasks then produces y

1-D row agglomeration

If x stored in one task, then broadcast required tocommunicate needed values to all other tasks

If x distributed across tasks, then multinode broadcastrequired to communicate needed values to other tasks

Each task computes inner product of its row of A withentire vector x to produce its component of y

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 77

Page 40: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Column Agglomeration

a11x1y1

a12x2

a21x1a22x2y2

a13x3 a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 77

Page 41: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Row Agglomeration

a11x1y1

a12x2

a21x1a22x2y2

a13x3 a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 77

Page 42: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Agglomeration

Column and row algorithms are dual to each other

Column algorithm begins with communication-free localvector scaling (daxpy) computations combined acrossprocessors by a reductionRow algorithm begins with broadcast followed bycommunication-free local inner-product (ddot)computations

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 77

Page 43: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Mapping

Map

2-D: Assign (n/k)2/p coarse-grain tasks to each of pprocesses using any desired mapping in each dimension,treating target network as 2-D mesh

1-D: Assign n/p coarse-grain tasks to each of p processesusing any desired mapping, treating target network as 1-Dmesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 77

Page 44: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

2-D Agglomeration with Block Mapping

a13x3a11x1y1

a12x2

a21x1a22x2y2

a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 77

Page 45: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Column Agglomeration with Block Mapping

a11x1y1

a12x2

a21x1a22x2y2

a13x3 a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 77

Page 46: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

1-D Row Agglomeration with Block Mapping

a11x1y1

a12x2

a21x1a22x2y2

a13x3 a14x4

a23x3 a24x4

a31x1 a32x2

a41x1 a42x2

a33x3y3

a34x4

a43x3a44x4y4

a51x1 a52x2

a61x1 a62x2

a53x3 a54x4

a63x3 a64x4

a15x5 a16x6

a25x5 a26x6

a35x5 a36x6

a45x5 a46x6

a55x5y5

a56x6

a65x5a66x6y6

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 77

Page 47: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Coarse-Grain Parallel Algorithm

broadcast x[ j ] to jth process column

y[i ] = A[i ][ j ]x[ j ]

reduce y[i ] across ith process row

{ vertical broadcast }

{ local matrix-vector product }

{ horizontal sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 77

Page 48: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Performance

The parallel costs (Lp,Wp, Fp) for the matrix-vector product areComputational cost Fp = Θ(n2/p) regardless of networkCommunication costs can be derived from the cost ofcollectives

1-D agglomeration: Lp = Θ(log p),Wp = Θ(n)2-D agglomeration: Lp = Θ(log p),Wp = Θ(n/

√p)

For 1-D row agglomeration, perform allgather,

T 1-Dp = T allgather

p (n) + Θ(γn2/p) = Θ(α log(p) + βn+ γn2/p)

For 2-D agglomeration, perform broadcast and reduction,

T 2-Dp = T bcast√

p (n/√p) + T reduce√

p (n/√p) + Θ(γn2/p)

= Θ(α log(p) + βn/√p+ γn2/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 77

Page 49: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Matrix-Matrix Product

Consider matrix-matrix product

C = AB

where A, B, and result C are n× n matrices

Entries of matrix C are given by

cij =

n∑k=1

aik bkj , i, j = 1, . . . , n

Each of n2 entries of C requires n multiply-add operations,so model serial time as

T1 = γ n3

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 77

Page 50: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Matrix-Matrix Product

Matrix-matrix product can be viewed as

n2 inner products, orsum of n outer products, orn matrix-vector products

and each viewpoint yields different algorithm

One way to derive parallel algorithms for matrix-matrixproduct is to apply parallel algorithms already developedfor inner product, outer product, or matrix-vector product

However, considering the problem as a whole yields thebest algorithms

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 77

Page 51: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Parallel Algorithm

Partition

For i, j, k = 1, . . . , n, fine-grain task(i, j, k) computes product aik bkj , yielding3-D array of n3 fine-grain tasksAssuming no replication of data, at most3n2 fine-grain tasks store entries of A, B,or C, say task (i, j, j) stores aij , task(i, j, i) stores bij , and task (i, j, k) storescij for i, j = 1, . . . , n and some fixed k

i

j

k

We refer to subsets of tasks along i, j, and k dimensionsas rows, columns, and layers, respectively, so kth columnof A and kth row of B are stored in kth layer of tasks

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 77

Page 52: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Parallel Algorithm

Communicate

Broadcast entries of jth column of A horizontally alongeach task row in jth layer

Broadcast entries of ith row of B vertically along each taskcolumn in ith layer

For i, j = 1, . . . , n, result cij is given by sum reduction overtasks (i, j, k), k = 1, . . . , n

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 77

Page 53: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fine-Grain Algorithm

broadcast aik to tasks (i, q, k), q = 1, . . . , n

broadcast bkj to tasks (q, j, k), q = 1, . . . , n

cij = aikbkj

reduce cij across tasks (i, j, q), q = 1, . . . , n

{ horizontal broadcast }

{ vertical broadcast }

{ local scalar product }

{ lateral sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 77

Page 54: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Agglomeration

Agglomerate

With n× n× n array of fine-grain tasks, natural strategies are

3-D: Combine q × q × q subarray of fine-grain tasks

2-D: Combine q × q × n subarray of fine-grain tasks,eliminating sum reductions

1-D column: Combine n× 1× n subarray of fine-graintasks, eliminating vertical broadcasts and sum reductions

1-D row: Combine 1× n× n subarray of fine-grain tasks,eliminating horizontal broadcasts and sum reductions

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 77

Page 55: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Mapping

Map

Corresponding mapping strategies are

3-D: Assign (n/q)3/p coarse-grain tasks to each of pprocessors using any desired mapping in each dimension,treating target network as 3-D mesh

2-D: Assign (n/q)2/p coarse-grain tasks to each of pprocessors using any desired mapping in each dimension,treating target network as 2-D mesh

1-D: Assign n/p coarse-grain tasks to each of p processorsusing any desired mapping, treating target network as 1-Dmesh

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 77

Page 56: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Agglomeration with Block Mapping

1-D row 1-D col 2-D 3-D

agglomerations

1-D column1-D row 3-D2-D

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 77

Page 57: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Coarse-Grain 3-D Parallel Algorithm

broadcast A[i ][k] to ith processor row

broadcast B[k][ j ] to jth processor column

C[i ][ j ] = A[i ][k]B[k][ j ]

reduce C[i ][ j ] across processor layers

{ horizontal broadcast }

{ vertical broadcast }

{ local matrix product }

{ lateral sum reduction }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 77

Page 58: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Agglomeration with Block Mapping

A21 B12 + A22 B22A21 B11 + A22 B21

A11 B12 + A12 B22

B22B21

B12B11

A22A21

A12A11 A11 B11 + A12 B21

=1-D column:

A21 B12 + A22 B22A21 B11 + A22 B21

A11 B12 + A12 B22

B22B21

B12B11

A22A21

A12A11 A11 B11 + A12 B21

=1-D row:

A21 B12 + A22 B22A21 B11 + A22 B21

A11 B12 + A12 B22

B22B21

B12B11

A22A21

A12A11 A11 B11 + A12 B21

=2-D:

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 77

Page 59: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Coarse-Grain 2-D Parallel Algorithm

allgather A[i ][ j ] in ith processor rowallgather B[i ][ j ] in jth processor columnC[i ][ j ] = 0

for k = 1, . . . ,√p

C[i ][ j ] = C[i ][ j ] + A[i ][k]B[k][ j ]

end

{ horizontal broadcast }{ vertical broadcast }

{ sum local products }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 77

Page 60: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

SUMMA Algorithm

Algorithm just described requires excessive memory, sinceeach process accumulates

√p blocks of both A and B

One way to reduce memory requirements is to

broadcast blocks of A successively across processor rowsbroadcast blocks of B successively across processor cols

C[i ][ j ] = 0

for k = 1, . . . ,√p

broadcast A[i ][ k ] in ith processor rowbroadcast B[k ][ j ] in jth processor columnC[i ][ j ] = C[i ][ j ] + A[i ][k]B[k][ j ]

end

{ horizontal broadcast }{ vertical broadcast }{ sum local products }

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 77

Page 61: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

SUMMA Algorithm

A

BA

B

A

B

AB

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

16 CPUs (4x4)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 77

Page 62: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Cannon Algorithm

Another approach, due to Cannon (1969), is to circulateblocks of B vertically and blocks of A horizontally in ringfashion

Blocks of both matrices must be initially aligned usingcircular shifts so that correct blocks meet as needed

Requires less memory than SUMMA and replaces linebroadcasts with shifts, lowering the number of messages

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 77

Page 63: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Cannon Algorithm

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 77

Page 64: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Fox Algorithm

It is possible to mix techniques from SUMMA andCannon’s algorithm:

circulate blocks of B in ring fashion vertically alongprocessor columns step by step so that each block of Bcomes in conjunction with appropriate block of A broadcastat that same step

This algorithm is due to Fox et al.

Shifts, especially in Cannon’s algorithm, are harder togeneralize to nonsquare processor grids than collectives inalgorithms like SUMMA

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 64 / 77

Page 65: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Execution Time for 3-D Agglomeration

For 3-D agglomeration, computing each of p blocks C[i ][ j ]

requires matrix-matrix product of two (n/ 3√p )× (n/ 3

√p )

blocks, soFp = (n/ 3

√p )3 = n3/p

On 3-D mesh, each broadcast or reduction takes time

T bcastp1/3

((n/p1/3)2) = O(α log p+ βn2/p2/3)

Total time is therefore

Tp = O(α log p+ β n2/p2/3 + γ n3/p)

However, overall memory footprint is

Mp = Θ(p(n/p1/3)2) = Θ(p1/3n2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 65 / 77

Page 66: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Strong Scalability of 3-D Agglomeration

The 3-D agglomeration efficiency is given by

Ep(n) =pT1(n)

Tp(n)= O(1/(1+(α/γ)p log p/n3+(β/γ) p1/3/n))

For strong scaling to ps processors we needEps(n) = Θ(1), so 3-D agglomeration strong scales to

ps = O(min((γ/α)n3/ log(n), (γ/β)n3)) processors

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 66 / 77

Page 67: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Weak Scalability of 3-D Agglomeration

For weak scaling (with constant input size / processor) topw processor, we need Epw(n

√pw) = Θ(1), which holds

However, note that the algorithm is not memory-efficient asMp = Θ(p1/3n2), and if keeping memory footprint perprocessor constant, we must grow n with p1/3

Scaling with constant memory footprint processor(Mp/p = const) is possible to pm processors whereEpm(np

1/3m ) = Θ(1), which holds for

pm = Θ(2(γ/α)n3) processors

The isoefficiency function is Q̃(p) = Θ(p log(p))

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 67 / 77

Page 68: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Execution Time for 2-D Agglomeration

For 2-D agglomeration (SUMMA), computation of eachblock C[i ][ j ] requires

√p matrix-matrix products of

(n/√p )× (n/

√p ) blocks, so

Fp =√p (n/

√p )3 = n3/p

For broadcast among rows and columns of processir grid,communication time is

2√pT bcast√

p (n2/p) = Θ(α√p log(p) + βn2/

√p)

Total time is therefore

Tp = O(α√p log(p) + βn2/

√p+ γ n3/p)

The algorithm is memory-efficient, Mp = Θ(n2)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 68 / 77

Page 69: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Strong Scalability of 2-D Agglomeration

The 2-D agglomeration efficiency is given by

Ep(n) =pT1(n)

Tp(n)= O(1/(1+(α/γ)p3/2 log p/n3+(β/γ)

√p/n))

For strong scaling to ps processors we needEps(n) = Θ(1), so 2-D agglomeration strong scales to

ps = O(min((γ/α)n2/ log(n)2/3, (γ/β)n2)) processors

For weak scaling to pw processors with n2/p matrixelements per processor, we need Epw(

√pwn) = Θ(1), so

2-D agglomeration (SUMMA) weak scales to

pw = O(2(γ/α)n3) processors

Cannon’s algorithm achieves unconditional weak scalability

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 69 / 77

Page 70: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Scalability for 1-D Agglomeration

For 1-D agglomeration on 1-D mesh, total time is

Tp = O(α log(p) + βn2 + γn3/p)

The corresponding efficiency is

Ep = O(1/(1 + (α/β)p log(p)n3 + (β/γ)p/n)

Strong scalability is possible to at most ps = O((γ/β)n)processorsWeak scalability is possible to at most pw = O(

√(γ/β)n)

processors

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 70 / 77

Page 71: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Rectangular Matrix Multiplication

If C is m× n, A is m× k,and B is k × n, choosinga 3D grid that optimizesvolume-to-surface-arearatio yields bandwidthcost...

Wp(m,n, k) = O

(min

p1p2p3=p

[mk

p1p2+

kn

p1p3+

mn

p2p3

])Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 71 / 77

Page 72: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Communication vs. Memory Tradeoff

Communication cost for 2-D algorithms for matrix-matrixproduct is optimal, assuming no replication of storageIf explicit replication of storage is allowed, then lowercommunication volume is possible via 3-D algorithmGenerally, we assign n/p1 × n/p2 × n/p3 computationsubvolume to each processorThe largest face of the subvolume gives communicationcost, the smallest face gives minimal memory usage

can keep smallest face local while successively computingslices of subvolume

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 72 / 77

Page 73: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Leveraging Additional Memory in Matrix Multiplication

Provided M̄ total available memory, 2-D and 3-Dalgorithms achieve bandwidth cost

Wp(n, M̄) =

∞ : M̄ < n2

n2/√p : M̄ = Θ(n2)

n2/p2/3 : M̄ = Θ(n2p1/3)

For general M̄ , possible to pick processor grid to achieve

Wp(n, M̄) = O(n3/(√pM̄1/2) + n2/p2/3)

and an overall execution time of

Tp(n, M̄) = O(α(log p+ n3√p/M̄3/2) + βWp(n, M̄) + γn3/p)

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 73 / 77

Page 74: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

Strong Scaling using All Available Memory

The efficiency of the algorithm for a given amount of totalmemory M̄p is

Ep(n, M̄p) = O(1/(1 + (α/γ)(p log p/n3 + p3/2/M̄3/2p )

+ (β/γ)(√p/M̄1/2

p + p1/3/n)))

For strong scaling assuming M̄p = pM̄1, we need

Eps(n, psM̄1) = psT1(n, M̄1)/Tps(n, psM̄1) = Θ(1)

In this case, we obtain

ps = Θ(min((α/γ)n3/ log(n), (β/γ)n3))

as good as the 3-D algorithm, but more memory-efficient

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 74 / 77

Page 75: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

References

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, andP. Palkar, A three-dimensional approach to parallel matrixmultiplication, IBM J. Res. Dev., 39:575-582, 1995

J. Berntsen, Communication efficient matrix multiplicationon hypercubes, Parallel Comput. 12:335-342, 1989

J. Demmel, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, andO. Spillinger, Communication-optimal parallel recursiverectangular matrix multiplication, IPDPS, 2013

J. W. Demmel, M. T. Heath, and H. A. van der Vorst,Parallel numerical linear algebra, Acta Numerica2:111-197, 1993

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 75 / 77

Page 76: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

References

R. Dias da Cunha, A benchmark study based on theparallel computation of the vector outer-product A = uvT

operation, Concurrency: Practice and Experience9:803-819, 1997

G. C. Fox, S. W. Otto, and A. J. G. Hey, Matrix algorithmson a hypercube I: matrix multiplication, Parallel Comput.4:17-31, 1987

D. Irony, S. Toledo, and A. Tiskin, Communication lowerbounds for distributed-memory matrix multiplication, J.Parallel Distrib. Comput. 64:1017-1026, 2004.

S. L. Johnsson, Communication efficient basic linearalgebra computations on hypercube architectures, J.Parallel Distrib. Comput. 4(2):133-172, 1987

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 76 / 77

Page 77: Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554/slides/slides_05.pdfParallel Algorithm Scalability Parallel Algorithm Partition For i= 1;:::;n, fine-grain task

BLASInner ProductOuter Product

Matrix-Vector ProductMatrix-Matrix Product

Parallel AlgorithmAgglomeration SchemesScalability

References

S. L. Johnsson, Minimizing the communication time formatrix multiplication on multiprocessors, Parallel Comput.19:1235-1257, 1993

B. Lipshitz, Communication-avoiding parallel recursivealgorithms for matrix multiplication, Tech. Rept.UCB/EECS-2013-100, University of California at Berkeley,May 2013.

O. McBryan and E. F. Van de Velde, Matrix and vectoroperations on hypercube parallel processors, ParallelComput. 5:117-126, 1987

R. A. Van De Geijn and J. Watts, SUMMA: Scalableuniversal matrix multiplication algorithm, Concurrency:Practice and Experience 9(4):255-274, 1997

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 77 / 77