Parallel Numerics, WT 2013/2014 - 2 Elementary Linear ... · PDF fileBLAS: Basic Linear Algebra SubroutinesAnalysis of the Matrix-Vector-ProductAnalysis of Matrix-Matrix Product Granularity

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Level-2 BLAS

• Matrix-Vector operations with O(n2) operations (sequentially)

• BLAS-Notation:

S single precisionGE general matrixMV vector

. }

• defines SGEMV, matrix-vector product: y = αAx + βy

• Other Level-2 BLAS: solving triangular system Lx = b withtriangular matrix L.

Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems

page 11 of 39


Level-3 BLAS

• Matrix-Matrix operations with O(n3) operations (sequentially)

• BLAS-Notation:

S single precisionGE general matrixMM matrix

. }

• defines SGEMM, matrix-matrix product: C = αAB + βC


page 12 of 39


Granularity for BLAS

BLAS level operation formula memory granularity

BLAS-1 AXPY: 2n αx + y 2n + 1 < 1

BLAS-2 GEMV: 2n2 αAx + βy n2 + 2n 2

BLAS-3 GEMM: 2n3 αAB + βC 4n2 n2

BLAS-3 has best operations to memory ratio!


page 13 of 39


2.2. Analysis of the Matrix-Vector-Product

A = (aij) i=1,...,nj=1,...,m

∈ Rn×m, b ∈ Rm, c ∈ Rn

2.2.1. Vectorizationc1...

cn

=

a11 · · · a1m...

. . ....

an1 · · · anm

·b1

...bn

=

a11b1 + · · ·+ a1mbm...

an1b1 + · · ·+ anmbm

=

=

m∑j=1

a1jbj

...m∑

j=1

anjbj

=

m∑j=1

bj

a1j...

anj

n DOT-products of length m m SAXPYs of length n (GAXPY)Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems

page 14 of 39


Pseudocode: ij-form

c = 0;

for i=1,...,n

....for j=1,...,m

........ci = ci + aijbj

....end

end

} DOT-product

ci = Ai•b, DOT-product of i th row of A with vector b


page 15 of 39


Pseudocode: ji-form

c = 0;

for j=1,...,m

....for i=1,...,n

........ci = ci + aijbj

....end

end

• SAXPY updating vector c with j th column of A• GAXPY:

– Sequence of SAXPYs related to the same vector– Advantage: vector c, that is updated, can be kept in fast

memory• No additional data transfer


page 16 of 39


GAXPY (repetition)

• SAXPY:

y := y + αx

• GAXPY:

y = y0

for i = 1 : n

...y := y + αixi

end

• Series of SAXPYs regarding the same vector y .

• length(GAXPY) = length(y )

• Advantage: less data transfer!


page 17 of 39


2.2.2. Parallelization by Building Blocks

Reduce matrix-vector product on smaller matrix-vector products.

{1,2, . . . ,n} = 〈1,n〉 = I1 ∪ I2 ∪ . . . ∪ IR disjunct: Ij ∩Ik = 0 for j 6= k

{1,2, . . . ,m} = 〈1,m〉 = J1 ∪ J2 ∪ . . . ∪ JS disjunct: Jj ∩Jk = 0 for j 6= k

Use 2-dimensional array of processors Prs.

Prs gets matrix block Ars := A(Ir , Js), bs := b(Js), cr := c(Ir ).

cr =S∑

s=1

Arsbs =:S∑

s=1

c(s)r


page 18 of 39


Pseudocode

for r = 1, . . . ,R....for s = 1, . . . ,S

........c(s)r = Arsbs;

....end

end

Small, independent matrix-vectorproducts. No communication ne-cessary during computations!

for r = 1, . . . ,R....cr = 0....for s = 1, . . . ,S

........cr = cr + c(s)r ;

....end

end

Blockwise collection and additionof vectors. Rowwise communicati-on! Fan-in.


page 19 of 39


Blocking: Special CasesS = 1: The computation of Ai•b is vectorizable by GAXPYs.

c =

A1•

A2•

...

· b =

A1•b

A2•b...

No communication necessary between processor P1, . . . ,PR

R = 1: A•jbj are independent.

c = (A•1|A•2| · · · ) ·

b1

b2

...

= A•1b1 + A•2b2 + . . .

Then collection of partial results from processor P1, . . . ,PS . Fan-in.Final sum in one processor: vectorizable by GAXPYs.


page 20 of 39


Rules

1. Inner loops of a program should be simple, vectorizable

2. Outer loop of a programm should be substantial, independent,parallelizable.

substantial andparallelizable simple, vectorizable

3. Reuse of data (cache, minimal data transfer, blocking)


page 21 of 39


2.2.3. c = Ab for Banded Matrix

• Bandwidth β (symmetric)

• 2β+1 diagonals: main diag. + β subdiag. + β superdiag.

• β = 1: tridiagonal


page 22 of 39


Notation: Banded Matrices A and A

A =

a11 · · · a1,β+1 0 · · · 0... a22

. . .. . . · · ·

...

aβ+1,1. . .

. . .. . .

. . . 0

0. . .

. . .. . .

. . . an−β,n...

.... . .

. . . an−1,n−1...

0 · · · 0 an,n−β · · · ann

→

A =

a10 · · · a1,β 0 · · · 0... a20

. . .. . . · · ·

...

aβ+1,−β. . .

. . .. . .

. . . 0

0. . .

. . .. . .

. . . an−β,β...

.... . .

. . . an−1,0...

0 · · · 0 an,−β · · · an,0


page 23 of 39


c = Ab for Banded Matrix

Storing entries diagonalwise: n(2β + 1) matrix instead of n2.

ai,s = ai,i+s for row i = 1, . . . ,n

1 ≤ i + s ≤ n and −β ≤ s ≤ β and 1 ≤ i ≤ n

1− i ≤ s ≤ n − i and −β ≤ s ≤ β

↓. in row i

s ∈ [li , ri ] = [max{−β,1− i},min{β,n − i}]

1− s ≤ i ≤ n − s and 1 ≤ i ≤ n

↓. in diag. s

i ∈ [ls, rs] = [max{1,1− s},min{n,n − s}]


page 24 of 39


Computation of the mtx-vec product basedon storage scheme on vector CPUs

For i = 1, . . . ,n : ci = Ai•b =∑

j

aijbj =

ri∑s=li

ai,i+sbi+s =

ri∑s=li

ai,sbi+s

• General TRIAD, no SAXPY:for s = −β : β

....for i = max{1− s, 1} : min{n − s, n}

........ci = ci + ai,sbi+s

....end

end

• or, partial DOT-product:for i = 1 : n....for s = max{−β, 1− i} : max{β, n − i}........ci = ci + ai,sbi+s

....end

end

• Sparsity⇒ less operations, but also loss of efficiency.Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems

page 25 of 39


Band Ab in Parallel

• Partitioning:

〈1,n〉 =R⋃

r=1

Ir , disjunct

for i ∈ Ir

....ci =

ri∑s=li

aisbi+s

end

• Processor Pr gets rows to index set Ir := [mr ,Mr ] in order tocompute its part of the final vector c.

• What part of vector b does processor Pr need in order tocompute its part of c?


page 26 of 39


Band Ab in Parallel

• Necessary for Ir : bj = bi+s:

j = i + s ≥ mr + max{−β,1−mr} = max{mr − β,1}

j = i + s ≤ Mr + rMr = Mr + min{β,n −Mr} = min{Mr + β,n}

• Processor Pr with index set Ir needs from b the indices

j ∈ [max{1,mr − β},min{n,Mr + β}]


page 27 of 39


2.3. Analysis of Matrix-Matrix Product

A = (aij) i=1,...,nj=1,...,m

∈ Rn×m, B(bij) i=1,...,mj=1,...,q

∈ Rm×q ,

C = AB = (cij) i=1,...,nj=1,...,q

∈ Rn×q

for i = 1 : n....for j = 1 : q

........cij =m∑

k=1

aik bkj

....end

end


page 28 of 39


2.3.1. Vectorization

• Algorithm 1: (ijk)-Form:

for i = 1 : n

....for j = 1 : q

........for k = 1 : m

............cij = cij + aik bkj DOT-product of length m. }

........end

....end

end

cij = Ai•B•j for all i , j

• All entries cij are fully computed, one after another.

• Access to A and C is rowwise, to B columnwise (depends oninner most loops!)


page 29 of 39


Other View on the Matrix-Matrix Product

Matrix A considered as combination of columns or rows

A = A1eT1 + . . .+AmeT

m = (A1 0 · · · )+(0 A2 0 . . .)+ . . .+(. . . 0 Am)

= e1a1 + . . .+ enan =

a10...

+ . . .+

...0an

AB =n∑

j=1

AjeTj

m∑k=1

ek bk =∑k,j

Aj(eTj ek )bk =

m∑k=1

Ak bk︸︷︷︸full n × q matrices

as a sum of full matrices Ak bk by outer product of the k th column of Aand the k th row of B.


page 30 of 39


Algorithm 2: (jki)-Form

for j=1,...,q

....for k=1,...,m

........for i=1,...,n

............cij = cij + aik bkj

........end

....end

end

• Vector update: c•j = c•j + a•k bkj

• Sequence of SAXPYs for the same vector: c•j =∑

k

bkja•k

• C computed columnwise; access to A columnwise. Access to Bcolumnwise, but delayed.


page 31 of 39


Algorithm 3: (kji)-Form

for k=1,...,m

....for j=1,...,q

........for i=1,...,n

............cij = cij + aik bkj

........end

....end

end

• Vector update: c•j = c•j + a•k bkj

• Sequence of SAXPYs for different vectors c•j (no GAXPY)

• Access to A columnwise. Access to B rowwise + delayed.C computed with intermediate values c(k)

ij which are computedcolumnwise.


page 32 of 39


Overview of Different Formsijk ikj kij jik jki kji

Alg. 1 Alg. 2 Alg. 3

Access toA by row — — row column column

Access toB by column row row column — —

Comput. ofC row row row column column column

Computation of cij

direct delayed delayed direct delayed delayed

Vector ope-ration DOT GAXPY SAXPY DOT GAXPY SAXPY

Vectorlength m q q m n n

Better: GAXPY (longer vector length).Access to matrices according to storage scheme (rowwise orcolumnwise)


page 33 of 39


2.3.2. Matrix-Matrix Product in Parallel

〈1,n〉 =R⋃

r=1

Ir , 〈1,m〉 =S⋃

s=1

Ks, 〈1,q〉 =T⋃

t=1

Jt

Distribute the blocks relative to index sets Ir , Ks, and Jt to processorarray Prst :

1. Processor Prst computes small matrix-matrix product. Allprocessors in parallel: c(s)

rt = ArsBst

2. Compute sum by fan-in in s:

crt =S∑

s=1

c(s)rt


page 34 of 39


Mtx-Mtx in Parallel: Special Case S = 1

• Each processor Prt can compute its part of c, crt , independentlywithout communication.

• Each processor needs

– full block of rows of A, relative to index set Ir , and– full block of columns of B, relative to index set Jt ,

to compute crt relative to rows Ik and columns Jt .


page 35 of 39


Mtx-Mtx in Parallel: Special Case S = 1

• With n · q processors each processor has to compute oneDOT-product with O(m) parallel time steps.

crt =m∑

k=1

ark bkt

• Fan-in by m · nq additional processors for all DOT-productsreduces number of parallel time steps to O(log(m)).


page 36 of 39


1D-Parallelization of A · B• 1D: p processors linear, each processor gets full A and column

slice of B, computing the related column slice of C = AB

• Communication: N2p for A and (N · Np ) · p = N2 for B

• Granularity:N3

N2(1 + p)=

N1 + p

• Blocking only in i , the columns of B!for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i


page 37 of 39


2D-Parallelization of A · B• 2D: p processors square, q :=

√p, each proc. gets row slice of A

and column slice of B computing full subblock of C = AB

• Communication: N2√p for A and N2√p for B

• Granularity:N3

2N2√p=

N2√

p

• Blocking in i and j , the columns of B and the rows of A!for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i


page 38 of 39


3D-Parallelization A · B• 3D: p processors cubic, each processor gets subblock of A and

subblock of B, computing part of subblock of C = AB.

Additional fan-in to collect parts to full subblock of C. (q = p13 ).

• Communication:

N2p13 for A and for B

(= p · N2

p23= p · blocksize

), fan-in: N2p

13

• Granularity:N3

3N2p13=

N3p

13

• Blocking in i , j , and k !for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i


page 39 of 39

Parallel Numerics, WT 2013/2014

3 Linear Systems of Equations with DenseMatrices

Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices

page 1 of 1

Contents1 Introduction

1.1 Computer Science Aspects1.2 Numerical Problems1.3 Graphs1.4 Loop Manipulations

2 Elementary Linear Algebra Problems2.1 BLAS: Basic Linear Algebra Subroutines2.2 Matrix-Vector Operations2.3 Matrix-Matrix-Product

3 Linear Systems of Equations with Dense Matrices3.1 Gaussian Elimination3.2 Parallelization3.3 QR-Decomposition with Householder matrices

4 Sparse Matrices4.1 General Properties, Storage4.2 Sparse Matrices and Graphs4.3 Reordering4.4 Gaussian Elimination for Sparse Matrices

5 Iterative Methods for Sparse Matrices5.1 Stationary Methods5.2 Nonstationary Methods5.3 Preconditioning

6 Domain Decomposition6.1 Overlapping Domain Decomposition6.2 Non-overlapping Domain Decomposition6.3 Schur Complements


page 2 of 1

3.1. Linear Systems of Equations with DenseMatrices

3.1.1. Gaussian Elimination: Basic Properties• Linear system of equations:

a11x1 + . . .+ a1nxn = b1

......

an1x1 + . . .+ annxn = bn

• Solve Ax = b a11 · · · a1n...

. . ....

an1 · · · ann

x1

...xn

=

b1...

bn

• Generate simpler linear equations (matrices). Transform A in

triangular form: A = A(1) → A(2) → . . .→ A(n) = U.


page 3 of 1

Transformation to Upper Triangular Forma11 a12 · · · a1na21 a22 · · · a2n...

.... . .

...an1 an2 · · · ann

row transformations: (2)→ (2)− a21

a11· (1), . . . , (n)→ (n)− an1

a11· (1)

leads to

A(2) =

a11 a12 a13 · · · a1n

0 a(2)22 a(2)

23 · · · a(2)2n

0 a(2)32 a(2)

33 · · · a(2)3n

......

.... . .

...0 a(2)

n2 a(2)n3 · · · a(2)

nn

next transformations: (3)→ (3)− a(2)

32

a(2)22

· (2), . . . , (n)→ (n)− a(2)n2

a(2)22

· (2)


page 4 of 1

Transformation to Triangular Form (cont.)

A(3) =

a11 a12 a13 · · · a1n

0 a(2)22 a(2)

23 · · · a(2)2n

0 0 a(3)33 · · · a(3)

3n...

......

. . ....

0 0 a(3)n3 · · · a(3)

nn

next transformations: (4)→ (4)− a(3)

43

a(3)33

· (3), . . . , (n)→ (n)− a(3)n3

a(3)33

· (3)

A(n) =

a11 a12 a13 · · · a1n

0 a(2)22 a(2)

23 · · · a(2)2n

0 0 a(3)33 · · · a(3)

3n...

......

. . ....

0 0 0 · · · a(n)nn

= U


page 5 of 1

Pseudocode Gaussian Elimination (GE)Simplification: assume that no pivoting is necessary.

a(k)kk 6= 0 or |a(k)

kk | ≥ ρ > 0 for k = 1,2, . . . ,n

for k = 1 : n − 1...for i = k + 1 : n......li,k =

ai,kak,k

...end

...for i = k + 1 : n

......for j = k + 1 : n

.........ai,j = ai,j − li,k · ak,j

......end

...end

end

In practice:• Include pivoting and include right hand side b.• There is still to solve a triangular system in U!


page 6 of 1

Intermediate Systems

A(k), k = 1,2, . . . ,n with A = A(1) and U = A(n)

a(1)11 · · · a(1)

1,k−1 a(1)1,k · · · a(1)

1,n

0. . .

......

. . ....

.... . . a(k−1)

k−1,k−1 a(k−1)k−1,k · · · a(k−1)

k−1,n

0 · · · 0 a(k)k,k · · · a(k)

k,n...

. . ....

.... . .

...0 · · · 0 a(k)

n,k · · · a(k)n,n


page 7 of 1

Define Auxiliary Matrices

L =

1 0 · · · 0

l2,1 1. . . 0

.... . . . . .

...ln,1 · · · ln,n−1 1

and U = A(n)

Lk :=

0 · · · 0 0 0 · · · 0...

. . ....

......

. . ....

0 · · · 0 0 0 · · · 00 · · · 0 0 0 · · · 00 · · · 0 lk+1,k 0 · · · 0...

. . ....

......

. . ....

0 · · · 0 ln,k 0 · · · 0

, L = I +

∑k

Lk


page 8 of 1

Elimination Step in Terms of AuxiliaryMatrices

A(k+1) = (I − Lk ) · A(k) = A(k) − Lk · A(k)

U = A(n) = (I − Ln−1) · A(n−1) = . . . = (I − Ln−1) · · · (I − L1)A(1) = L · A

L := (I − Ln−1) · · · (I − L1)

A = L−1 · U with U upper triangular and L lower triangular

• Theorem 2: L−1 = L and therefore A = LU.

• Advantage: Every further problem Ax = bj can be reduced to(LU)x = bj for arbitrary j .

• Solve two triangular problems (LU)x = Ly = b and Ux = y .


page 9 of 1

Theorem 2: L−1 = L → A = LU

for i ≤ j : Li · Lj =

(I + Lj)(I − Lj) = I + Lj − Lj − L2j = I ⇒ (I − Lj)

−1 = I + Lj

(I + Li)(I + Lj) = I + Li + Lj + LiLj = I + Li + Lj︸︷︷︸L−1 = [(I − Ln−1) · · · (I − L1)]

−1 = (I − L1)−1 · · · (I − Ln−1)

−1 =

(I + L1)(I + L2) · · · (I + Ln−1) = I + L1 + L2 + · · ·+ Ln−1 = L


page 10 of 1

3.2. GE in Parallel: Blockwise

Main idea: Blocking of GE to avoid data transfer between processors.

Basic Concepts:

Replace GE or large LU-decomposition of full matrix by smallintermediate steps (by sequence of small block operations):• Solving collection of small triangular systems LUk = Bk

(parallelism in columns of U)

• A→ A− LU updating matrices (also easy to parallelize)

• small B = LU-decompositions (parallelism in rows of B)


page 11 of 1

How to Choose Blocks in L/U SatisfyingLU = AL11 0 0

L21 L22 0L31 L32 L33

U11 U12 U130 U22 U230 0 U33

=

A11 A12 A13A21 A22 A23A31 A32 A33

=

=

L11U11 L11U12 L11U13L21U11 L21U12 + L22U22 L21U13 + L22U23L31U11 L31U12 + L32U22 ∗

Different ways of computing L and U depending on• start (assume first entry/row/column of L/U as given)• how to compute new entry/row/column of L/U• update of block structure of L/U by grouping in

– known blocks– blocks newly to compute– blocks to be computed later


page 12 of 1

Crout Form


page 13 of 1

Crout Form (cont.)1. Solve

by small LU-decomposition of the modified part of A→ L22,L32,and U22.

2. Solve

by solving small triangular systems of equations in L22 → U23.

Initial steps:

L11U11 = A11,

(L21L31

)U11 =

(A21A31

), L11(U12 U13) = (A12 A13)


page 14 of 1

New Partitioning

• Combine already computed parts from second column of L andsecond row of U into first column of L and first row of U.

• Split the until now ignored parts L33 and U33 into newcolumns/rows.

• Repeat this overall procedure until L and U are fully computed.


page 15 of 1

Block StructureIntermediate block structure:

Solve for red blocks.

Reconfigure the block structure:

Repeat until done.Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices

page 16 of 1

Left Looking GE

• Solve L11U12 = A12 by a couple of parallel triangular solves and(L22L32

)U22 =

(A22A32

)−(

L21L31

)U12 =:

(A22

A32

)update part of A and perform small LU-decompostion.

• Reorder blocks and repeat until ready. Start: L11U11 = A11,L21U11 = A21, and L31U11 = A31.


page 17 of 1


Solve for red blocks.


Repeat until done.Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices

page 18 of 1

Right Looking GENew blocking:

• Start with L11U11 = A11 (small LU-decomposition).

• Equations L21U11 = A21 and L11U12 = A12 by triangular solvesgives L21 and U12.

• It remains L22U22 = A22 − L21U12 = A22

• To compute the LU-decomposition of modified A22 repeat2× 2-blocking for A22 and apply recursively.


page 19 of 1


Solve for blue and both red blocks.


Repeat until done.


page 20 of 1

Comparison and Overview

• In comparison, all methods

– have nearly same efficiency in parallel– but better performance (in sequential or parallel) than the

unblocked variants because they are based on BLAS-3.

• Elementary steps of all blocking methods:

– Matrix-Matrix product and sum (easy to parallelize)– Couple of triangular solves (easy to parallelize)– Small LU-decomposition (parallelizable for long rows)

• Crout and right looking slightly better because more flops inmatrix-updates and less triangular solves respectivelyLU-decompositions.


page 21 of 1

Documents

Parallel Numerics, WT 2013/2014 - 2 Elementary Linear ... · PDF fileBLAS: Basic Linear Algebra SubroutinesAnalysis of the Matrix-Vector-ProductAnalysis of Matrix-Matrix Product Granularity