Upload
buitu
View
229
Download
1
Embed Size (px)
Citation preview
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-2 BLAS
• Matrix-Vector operations with O(n2) operations (sequentially)
• BLAS-Notation:
S single precisionGE general matrixMV vector
. }
• defines SGEMV, matrix-vector product: y = αAx + βy
• Other Level-2 BLAS: solving triangular system Lx = b withtriangular matrix L.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 11 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-3 BLAS
• Matrix-Matrix operations with O(n3) operations (sequentially)
• BLAS-Notation:
S single precisionGE general matrixMM matrix
. }
• defines SGEMM, matrix-matrix product: C = αAB + βC
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 12 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Granularity for BLAS
BLAS level operation formula memory granularity
BLAS-1 AXPY: 2n αx + y 2n + 1 < 1
BLAS-2 GEMV: 2n2 αAx + βy n2 + 2n 2
BLAS-3 GEMM: 2n3 αAB + βC 4n2 n2
BLAS-3 has best operations to memory ratio!
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 13 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.2. Analysis of the Matrix-Vector-Product
A = (aij) i=1,...,nj=1,...,m
∈ Rn×m, b ∈ Rm, c ∈ Rn
2.2.1. Vectorizationc1...
cn
=
a11 · · · a1m...
. . ....
an1 · · · anm
·b1
...bn
=
a11b1 + · · ·+ a1mbm...
an1b1 + · · ·+ anmbm
=
=
m∑j=1
a1jbj
...m∑
j=1
anjbj
=
m∑j=1
bj
a1j...
anj
n DOT-products of length m m SAXPYs of length n (GAXPY)Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 14 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Pseudocode: ij-form
c = 0;
for i=1,...,n
....for j=1,...,m
........ci = ci + aijbj
....end
end
} DOT-product
ci = Ai•b, DOT-product of i th row of A with vector b
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 15 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Pseudocode: ji-form
c = 0;
for j=1,...,m
....for i=1,...,n
........ci = ci + aijbj
....end
end
• SAXPY updating vector c with j th column of A• GAXPY:
– Sequence of SAXPYs related to the same vector– Advantage: vector c, that is updated, can be kept in fast
memory• No additional data transfer
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 16 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
GAXPY (repetition)
• SAXPY:
y := y + αx
• GAXPY:
y = y0
for i = 1 : n
...y := y + αixi
end
• Series of SAXPYs regarding the same vector y .
• length(GAXPY) = length(y )
• Advantage: less data transfer!
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 17 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.2.2. Parallelization by Building Blocks
Reduce matrix-vector product on smaller matrix-vector products.
{1,2, . . . ,n} = 〈1,n〉 = I1 ∪ I2 ∪ . . . ∪ IR disjunct: Ij ∩Ik = 0 for j 6= k
{1,2, . . . ,m} = 〈1,m〉 = J1 ∪ J2 ∪ . . . ∪ JS disjunct: Jj ∩Jk = 0 for j 6= k
Use 2-dimensional array of processors Prs.
Prs gets matrix block Ars := A(Ir , Js), bs := b(Js), cr := c(Ir ).
cr =S∑
s=1
Arsbs =:S∑
s=1
c(s)r
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 18 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Pseudocode
for r = 1, . . . ,R....for s = 1, . . . ,S
........c(s)r = Arsbs;
....end
end
Small, independent matrix-vectorproducts. No communication ne-cessary during computations!
for r = 1, . . . ,R....cr = 0....for s = 1, . . . ,S
........cr = cr + c(s)r ;
....end
end
Blockwise collection and additionof vectors. Rowwise communicati-on! Fan-in.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 19 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Blocking: Special CasesS = 1: The computation of Ai•b is vectorizable by GAXPYs.
c =
A1•
A2•
...
· b =
A1•b
A2•b...
No communication necessary between processor P1, . . . ,PR
R = 1: A•jbj are independent.
c = (A•1|A•2| · · · ) ·
b1
b2
...
= A•1b1 + A•2b2 + . . .
Then collection of partial results from processor P1, . . . ,PS . Fan-in.Final sum in one processor: vectorizable by GAXPYs.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 20 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Rules
1. Inner loops of a program should be simple, vectorizable
2. Outer loop of a programm should be substantial, independent,parallelizable.
substantial andparallelizable simple, vectorizable
3. Reuse of data (cache, minimal data transfer, blocking)
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 21 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.2.3. c = Ab for Banded Matrix
• Bandwidth β (symmetric)
• 2β+1 diagonals: main diag. + β subdiag. + β superdiag.
• β = 1: tridiagonal
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 22 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Notation: Banded Matrices A and A
A =
a11 · · · a1,β+1 0 · · · 0... a22
. . .. . . · · ·
...
aβ+1,1. . .
. . .. . .
. . . 0
0. . .
. . .. . .
. . . an−β,n...
.... . .
. . . an−1,n−1...
0 · · · 0 an,n−β · · · ann
→
A =
a10 · · · a1,β 0 · · · 0... a20
. . .. . . · · ·
...
aβ+1,−β. . .
. . .. . .
. . . 0
0. . .
. . .. . .
. . . an−β,β...
.... . .
. . . an−1,0...
0 · · · 0 an,−β · · · an,0
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 23 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
c = Ab for Banded Matrix
Storing entries diagonalwise: n(2β + 1) matrix instead of n2.
ai,s = ai,i+s for row i = 1, . . . ,n
1 ≤ i + s ≤ n and −β ≤ s ≤ β and 1 ≤ i ≤ n
1− i ≤ s ≤ n − i and −β ≤ s ≤ β
↓. in row i
s ∈ [li , ri ] = [max{−β,1− i},min{β,n − i}]
1− s ≤ i ≤ n − s and 1 ≤ i ≤ n
↓. in diag. s
i ∈ [ls, rs] = [max{1,1− s},min{n,n − s}]
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 24 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Computation of the mtx-vec product basedon storage scheme on vector CPUs
For i = 1, . . . ,n : ci = Ai•b =∑
j
aijbj =
ri∑s=li
ai,i+sbi+s =
ri∑s=li
ai,sbi+s
• General TRIAD, no SAXPY:for s = −β : β
....for i = max{1− s, 1} : min{n − s, n}
........ci = ci + ai,sbi+s
....end
end
• or, partial DOT-product:for i = 1 : n....for s = max{−β, 1− i} : max{β, n − i}........ci = ci + ai,sbi+s
....end
end
• Sparsity⇒ less operations, but also loss of efficiency.Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 25 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Band Ab in Parallel
• Partitioning:
〈1,n〉 =R⋃
r=1
Ir , disjunct
for i ∈ Ir
....ci =
ri∑s=li
aisbi+s
end
• Processor Pr gets rows to index set Ir := [mr ,Mr ] in order tocompute its part of the final vector c.
• What part of vector b does processor Pr need in order tocompute its part of c?
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 26 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Band Ab in Parallel
• Necessary for Ir : bj = bi+s:
j = i + s ≥ mr + max{−β,1−mr} = max{mr − β,1}
j = i + s ≤ Mr + rMr = Mr + min{β,n −Mr} = min{Mr + β,n}
• Processor Pr with index set Ir needs from b the indices
j ∈ [max{1,mr − β},min{n,Mr + β}]
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 27 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.3. Analysis of Matrix-Matrix Product
A = (aij) i=1,...,nj=1,...,m
∈ Rn×m, B(bij) i=1,...,mj=1,...,q
∈ Rm×q ,
C = AB = (cij) i=1,...,nj=1,...,q
∈ Rn×q
for i = 1 : n....for j = 1 : q
........cij =m∑
k=1
aik bkj
....end
end
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 28 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.3.1. Vectorization
• Algorithm 1: (ijk)-Form:
for i = 1 : n
....for j = 1 : q
........for k = 1 : m
............cij = cij + aik bkj DOT-product of length m. }
........end
....end
end
cij = Ai•B•j for all i , j
• All entries cij are fully computed, one after another.
• Access to A and C is rowwise, to B columnwise (depends oninner most loops!)
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 29 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Other View on the Matrix-Matrix Product
Matrix A considered as combination of columns or rows
A = A1eT1 + . . .+AmeT
m = (A1 0 · · · )+(0 A2 0 . . .)+ . . .+(. . . 0 Am)
= e1a1 + . . .+ enan =
a10...
+ . . .+
...0an
AB =n∑
j=1
AjeTj
m∑k=1
ek bk =∑k,j
Aj(eTj ek )bk =
m∑k=1
Ak bk︸ ︷︷ ︸full n × q matrices
as a sum of full matrices Ak bk by outer product of the k th column of Aand the k th row of B.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 30 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Algorithm 2: (jki)-Form
for j=1,...,q
....for k=1,...,m
........for i=1,...,n
............cij = cij + aik bkj
........end
....end
end
• Vector update: c•j = c•j + a•k bkj
• Sequence of SAXPYs for the same vector: c•j =∑
k
bkja•k
• C computed columnwise; access to A columnwise. Access to Bcolumnwise, but delayed.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 31 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Algorithm 3: (kji)-Form
for k=1,...,m
....for j=1,...,q
........for i=1,...,n
............cij = cij + aik bkj
........end
....end
end
• Vector update: c•j = c•j + a•k bkj
• Sequence of SAXPYs for different vectors c•j (no GAXPY)
• Access to A columnwise. Access to B rowwise + delayed.C computed with intermediate values c(k)
ij which are computedcolumnwise.
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 32 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Overview of Different Formsijk ikj kij jik jki kji
Alg. 1 Alg. 2 Alg. 3
Access toA by row — — row column column
Access toB by column row row column — —
Comput. ofC row row row column column column
Computation of cij
direct delayed delayed direct delayed delayed
Vector ope-ration DOT GAXPY SAXPY DOT GAXPY SAXPY
Vectorlength m q q m n n
Better: GAXPY (longer vector length).Access to matrices according to storage scheme (rowwise orcolumnwise)
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 33 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2.3.2. Matrix-Matrix Product in Parallel
〈1,n〉 =R⋃
r=1
Ir , 〈1,m〉 =S⋃
s=1
Ks, 〈1,q〉 =T⋃
t=1
Jt
Distribute the blocks relative to index sets Ir , Ks, and Jt to processorarray Prst :
1. Processor Prst computes small matrix-matrix product. Allprocessors in parallel: c(s)
rt = ArsBst
2. Compute sum by fan-in in s:
crt =S∑
s=1
c(s)rt
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 34 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Mtx-Mtx in Parallel: Special Case S = 1
• Each processor Prt can compute its part of c, crt , independentlywithout communication.
• Each processor needs
– full block of rows of A, relative to index set Ir , and– full block of columns of B, relative to index set Jt ,
to compute crt relative to rows Ik and columns Jt .
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 35 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Mtx-Mtx in Parallel: Special Case S = 1
• With n · q processors each processor has to compute oneDOT-product with O(m) parallel time steps.
crt =m∑
k=1
ark bkt
• Fan-in by m · nq additional processors for all DOT-productsreduces number of parallel time steps to O(log(m)).
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 36 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
1D-Parallelization of A · B• 1D: p processors linear, each processor gets full A and column
slice of B, computing the related column slice of C = AB
• Communication: N2p for A and (N · Np ) · p = N2 for B
• Granularity:N3
N2(1 + p)=
N1 + p
• Blocking only in i , the columns of B!for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 37 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
2D-Parallelization of A · B• 2D: p processors square, q :=
√p, each proc. gets row slice of A
and column slice of B computing full subblock of C = AB
• Communication: N2√p for A and N2√p for B
• Granularity:N3
2N2√p=
N2√
p
• Blocking in i and j , the columns of B and the rows of A!for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 38 of 39
BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
3D-Parallelization A · B• 3D: p processors cubic, each processor gets subblock of A and
subblock of B, computing part of subblock of C = AB.
Additional fan-in to collect parts to full subblock of C. (q = p13 ).
• Communication:
N2p13 for A and for B
(= p · N2
p23= p · blocksize
), fan-in: N2p
13
• Granularity:N3
3N2p13=
N3p
13
• Blocking in i , j , and k !for i = 1 : n...for j = 1 : n......for k = 1 : n.........Cj,i = Cj,i + Aj,k Bk,i
Parallel Numerics, WT 2013/2014 2 Elementary Linear Algebra Problems
page 39 of 39
Parallel Numerics, WT 2013/2014
3 Linear Systems of Equations with DenseMatrices
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 1 of 1
Contents1 Introduction
1.1 Computer Science Aspects1.2 Numerical Problems1.3 Graphs1.4 Loop Manipulations
2 Elementary Linear Algebra Problems2.1 BLAS: Basic Linear Algebra Subroutines2.2 Matrix-Vector Operations2.3 Matrix-Matrix-Product
3 Linear Systems of Equations with Dense Matrices3.1 Gaussian Elimination3.2 Parallelization3.3 QR-Decomposition with Householder matrices
4 Sparse Matrices4.1 General Properties, Storage4.2 Sparse Matrices and Graphs4.3 Reordering4.4 Gaussian Elimination for Sparse Matrices
5 Iterative Methods for Sparse Matrices5.1 Stationary Methods5.2 Nonstationary Methods5.3 Preconditioning
6 Domain Decomposition6.1 Overlapping Domain Decomposition6.2 Non-overlapping Domain Decomposition6.3 Schur Complements
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 2 of 1
3.1. Linear Systems of Equations with DenseMatrices
3.1.1. Gaussian Elimination: Basic Properties• Linear system of equations:
a11x1 + . . .+ a1nxn = b1
......
an1x1 + . . .+ annxn = bn
• Solve Ax = b a11 · · · a1n...
. . ....
an1 · · · ann
x1
...xn
=
b1...
bn
• Generate simpler linear equations (matrices). Transform A in
triangular form: A = A(1) → A(2) → . . .→ A(n) = U.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 3 of 1
Transformation to Upper Triangular Forma11 a12 · · · a1na21 a22 · · · a2n...
.... . .
...an1 an2 · · · ann
row transformations: (2)→ (2)− a21
a11· (1), . . . , (n)→ (n)− an1
a11· (1)
leads to
A(2) =
a11 a12 a13 · · · a1n
0 a(2)22 a(2)
23 · · · a(2)2n
0 a(2)32 a(2)
33 · · · a(2)3n
......
.... . .
...0 a(2)
n2 a(2)n3 · · · a(2)
nn
next transformations: (3)→ (3)− a(2)
32
a(2)22
· (2), . . . , (n)→ (n)− a(2)n2
a(2)22
· (2)
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 4 of 1
Transformation to Triangular Form (cont.)
A(3) =
a11 a12 a13 · · · a1n
0 a(2)22 a(2)
23 · · · a(2)2n
0 0 a(3)33 · · · a(3)
3n...
......
. . ....
0 0 a(3)n3 · · · a(3)
nn
next transformations: (4)→ (4)− a(3)
43
a(3)33
· (3), . . . , (n)→ (n)− a(3)n3
a(3)33
· (3)
A(n) =
a11 a12 a13 · · · a1n
0 a(2)22 a(2)
23 · · · a(2)2n
0 0 a(3)33 · · · a(3)
3n...
......
. . ....
0 0 0 · · · a(n)nn
= U
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 5 of 1
Pseudocode Gaussian Elimination (GE)Simplification: assume that no pivoting is necessary.
a(k)kk 6= 0 or |a(k)
kk | ≥ ρ > 0 for k = 1,2, . . . ,n
for k = 1 : n − 1...for i = k + 1 : n......li,k =
ai,kak,k
...end
...for i = k + 1 : n
......for j = k + 1 : n
.........ai,j = ai,j − li,k · ak,j
......end
...end
end
In practice:• Include pivoting and include right hand side b.• There is still to solve a triangular system in U!
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 6 of 1
Intermediate Systems
A(k), k = 1,2, . . . ,n with A = A(1) and U = A(n)
a(1)11 · · · a(1)
1,k−1 a(1)1,k · · · a(1)
1,n
0. . .
......
. . ....
.... . . a(k−1)
k−1,k−1 a(k−1)k−1,k · · · a(k−1)
k−1,n
0 · · · 0 a(k)k,k · · · a(k)
k,n...
. . ....
.... . .
...0 · · · 0 a(k)
n,k · · · a(k)n,n
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 7 of 1
Define Auxiliary Matrices
L =
1 0 · · · 0
l2,1 1. . . 0
.... . . . . .
...ln,1 · · · ln,n−1 1
and U = A(n)
Lk :=
0 · · · 0 0 0 · · · 0...
. . ....
......
. . ....
0 · · · 0 0 0 · · · 00 · · · 0 0 0 · · · 00 · · · 0 lk+1,k 0 · · · 0...
. . ....
......
. . ....
0 · · · 0 ln,k 0 · · · 0
, L = I +
∑k
Lk
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 8 of 1
Elimination Step in Terms of AuxiliaryMatrices
A(k+1) = (I − Lk ) · A(k) = A(k) − Lk · A(k)
U = A(n) = (I − Ln−1) · A(n−1) = . . . = (I − Ln−1) · · · (I − L1)A(1) = L · A
L := (I − Ln−1) · · · (I − L1)
A = L−1 · U with U upper triangular and L lower triangular
• Theorem 2: L−1 = L and therefore A = LU.
• Advantage: Every further problem Ax = bj can be reduced to(LU)x = bj for arbitrary j .
• Solve two triangular problems (LU)x = Ly = b and Ux = y .
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 9 of 1
Theorem 2: L−1 = L → A = LU
for i ≤ j : Li · Lj =
(I + Lj)(I − Lj) = I + Lj − Lj − L2j = I ⇒ (I − Lj)
−1 = I + Lj
(I + Li)(I + Lj) = I + Li + Lj + LiLj = I + Li + Lj︸ ︷︷ ︸L−1 = [(I − Ln−1) · · · (I − L1)]
−1 = (I − L1)−1 · · · (I − Ln−1)
−1 =
(I + L1)(I + L2) · · · (I + Ln−1) = I + L1 + L2 + · · ·+ Ln−1 = L
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 10 of 1
3.2. GE in Parallel: Blockwise
Main idea: Blocking of GE to avoid data transfer between processors.
Basic Concepts:
Replace GE or large LU-decomposition of full matrix by smallintermediate steps (by sequence of small block operations):• Solving collection of small triangular systems LUk = Bk
(parallelism in columns of U)
• A→ A− LU updating matrices (also easy to parallelize)
• small B = LU-decompositions (parallelism in rows of B)
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 11 of 1
How to Choose Blocks in L/U SatisfyingLU = AL11 0 0
L21 L22 0L31 L32 L33
U11 U12 U130 U22 U230 0 U33
=
A11 A12 A13A21 A22 A23A31 A32 A33
=
=
L11U11 L11U12 L11U13L21U11 L21U12 + L22U22 L21U13 + L22U23L31U11 L31U12 + L32U22 ∗
Different ways of computing L and U depending on• start (assume first entry/row/column of L/U as given)• how to compute new entry/row/column of L/U• update of block structure of L/U by grouping in
– known blocks– blocks newly to compute– blocks to be computed later
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 12 of 1
Crout Form
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 13 of 1
Crout Form (cont.)1. Solve
by small LU-decomposition of the modified part of A→ L22,L32,and U22.
2. Solve
by solving small triangular systems of equations in L22 → U23.
Initial steps:
L11U11 = A11,
(L21L31
)U11 =
(A21A31
), L11(U12 U13) = (A12 A13)
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 14 of 1
New Partitioning
• Combine already computed parts from second column of L andsecond row of U into first column of L and first row of U.
• Split the until now ignored parts L33 and U33 into newcolumns/rows.
• Repeat this overall procedure until L and U are fully computed.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 15 of 1
Block StructureIntermediate block structure:
Solve for red blocks.
Reconfigure the block structure:
Repeat until done.Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 16 of 1
Left Looking GE
• Solve L11U12 = A12 by a couple of parallel triangular solves and(L22L32
)U22 =
(A22A32
)−(
L21L31
)U12 =:
(A22
A32
)update part of A and perform small LU-decompostion.
• Reorder blocks and repeat until ready. Start: L11U11 = A11,L21U11 = A21, and L31U11 = A31.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 17 of 1
Block StructureIntermediate block structure:
Solve for red blocks.
Reconfigure the block structure:
Repeat until done.Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 18 of 1
Right Looking GENew blocking:
• Start with L11U11 = A11 (small LU-decomposition).
• Equations L21U11 = A21 and L11U12 = A12 by triangular solvesgives L21 and U12.
• It remains L22U22 = A22 − L21U12 = A22
• To compute the LU-decomposition of modified A22 repeat2× 2-blocking for A22 and apply recursively.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 19 of 1
Block StructureIntermediate block structure:
Solve for blue and both red blocks.
Reconfigure the block structure:
Repeat until done.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 20 of 1
Comparison and Overview
• In comparison, all methods
– have nearly same efficiency in parallel– but better performance (in sequential or parallel) than the
unblocked variants because they are based on BLAS-3.
• Elementary steps of all blocking methods:
– Matrix-Matrix product and sum (easy to parallelize)– Couple of triangular solves (easy to parallelize)– Small LU-decomposition (parallelizable for long rows)
• Crout and right looking slightly better because more flops inmatrix-updates and less triangular solves respectivelyLU-decompositions.
Parallel Numerics, WT 2013/2014 3 Linear Systems of Equations with Dense Matrices
page 21 of 1