Upload
claire-wade
View
212
Download
0
Embed Size (px)
Citation preview
Dense Linear Algebra
Sathish Vadhiyar
Gaussian Elimination - ReviewVersion 1for each column i zero it out below the diagonal by adding multiples of row i to
later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k)
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
Gaussian Elimination - ReviewVersion 2 – Remove A(j, i)/A(i, i) from inner loopfor each column i zero it out below the diagonal by adding multiples of row i to
later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k)
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
Gaussian Elimination - ReviewVersion 3 – Don’t compute what we already knowfor each column i zero it out below the diagonal by adding multiples of row i to
later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k)
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to
later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
GE - Runtime
Divisions
Multiplications / subtractions
Total
1+ 2 + 3 + … (n-1) = n2/2 (approx.)
12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2
2n3/3
Parallel GE
1st step – 1-D block partitioning along blocks of n columns by p processors
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
1D block partitioning - Steps
1. Divisions
2. Broadcast
3. Multiplications and Subtractions
Runtime:
n2/2
xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp
(n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.)
< n2/2 +n2logp + n3/p
2-D block
To speedup the divisions
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
P
Q
2D block partitioning - Steps
1. Broadcast of (k,k)
2. Divisions
3. Broadcast of multipliers
logQ
n2/Q (approx.)
xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP
4. Multiplications and subtractionsn3/PQ (approx.)
Problem with block partitioning for GE
Once a block is finished, the corresponding processor remains idle for the rest of the execution
Solution? -
Onto cyclic The block partitioning algorithms waste
processor cycles. No load balancing throughout the algorithm.
Onto cyclic
0 1 2 3 0 2 3 0 2 31 1 0
cyclic 1-D block-cyclic 2-D block-cyclic
Load balanceLoad balance, block operations, but column factorization bottleneck
Has everything
Block cyclic Having blocks in a processor can lead
to block-based operations (block matrix multiply etc.)
Block based operations lead to high performance
Operations can be split into 3 categories based on number of operations per memory reference
Referred to BLAS levels
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy
efficiently exploited by higher level BLAS
BLAS Memory Refs.
Flops Flops/Memory refs.
Level-1(vector)y=y+axZ=y.x
3n 2n 2/3
Level-2(Matrix-vector)y=y+AxA = A+(alpha) xyT
n2 2n2 2
Level-3(Matrix-Matrix)C=C+AB
4n2 2n3 n/2
Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)
0
0
0
0
0
0
0
0
0
0
0
i
ii,i
X
X
x
j
k
What GE really computes:
for i=1 to n-1
A(i+1:n, i) = A(i+1:n, i) / A(i, i)
A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n)A(i,i)
A(j,i) A(j,k)
A(i,k)
Finished multipliers
Finished multipliers
i
i
A(i+1:n, i)
A(i, i+1:n)
A(i+1:n, i+1:n)
BLAS1 and BLAS2 operations
Converting BLAS2 to BLAS3
Use blocking for optimized matrix-multiplies (BLAS3)
Matrix multiplies by delayed updates
Save several updates to trailing matrices
Apply several updates in the form of matrix multiply
Modified GE using BLAS3Courtesy: Dr. Jack Dongarra
for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1
*/ A(ib:end, end+1:n) = LL-1A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end,
end+1:n) /* Apply delayed updates with single matrix multiply */
b
ib end
ib
end
bCompleted part of L
Completed part of U A(ib:end, ib:end)
A(end+1:n, ib:end) A(end+1:n, end+1:n)
GE: MiscellaneousGE with Partial Pivoting
1D block-column partitioning: which is better? Column or row pivoting
2D block partitioning: Can restrict the pivot search to limited number of columns
•Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1)
•The exchange information is passed to the other processes by piggybacking with the multiplier information
• Row pivoting
• Involves distributed search and exchange – O(n/P)+O(logP)
Triangular Solve – Unit Upper triangular matrix
Sequential Complexity - Complexity of parallel algorithm with
2D block partitioning (P0.5*P0.5)
O(n2)
O(n2)/P0.5
Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n3/P)