Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of

Dense Linear Algebra

Sathish Vadhiyar

Gaussian Elimination - ReviewVersion 1for each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Gaussian Elimination - ReviewVersion 2 – Remove A(j, i)/A(i, i) from inner loopfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Gaussian Elimination - ReviewVersion 3 – Don’t compute what we already knowfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to

later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

GE - Runtime

Divisions

Multiplications / subtractions

Total

1+ 2 + 3 + … (n-1) = n2/2 (approx.)

12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2

2n3/3

Parallel GE

1st step – 1-D block partitioning along blocks of n columns by p processors

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

1D block partitioning - Steps

1. Divisions

2. Broadcast

3. Multiplications and Subtractions

Runtime:

n2/2

xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n2logp

(n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.)

< n2/2 +n2logp + n3/p

2-D block

To speedup the divisions

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

P

Q

2D block partitioning - Steps

1. Broadcast of (k,k)

2. Divisions

3. Broadcast of multipliers

logQ

n2/Q (approx.)

xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP

4. Multiplications and subtractionsn3/PQ (approx.)

Problem with block partitioning for GE

Once a block is finished, the corresponding processor remains idle for the rest of the execution

Solution? -

Onto cyclic The block partitioning algorithms waste

processor cycles. No load balancing throughout the algorithm.

Onto cyclic

0 1 2 3 0 2 3 0 2 31 1 0

cyclic 1-D block-cyclic 2-D block-cyclic

Load balanceLoad balance, block operations, but column factorization bottleneck

Has everything

Block cyclic Having blocks in a processor can lead

to block-based operations (block matrix multiply etc.)

Block based operations lead to high performance

Operations can be split into 3 categories based on number of operations per memory reference

Referred to BLAS levels

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy

efficiently exploited by higher level BLAS

BLAS Memory Refs.

Flops Flops/Memory refs.

Level-1(vector)y=y+axZ=y.x

3n 2n 2/3

Level-2(Matrix-vector)y=y+AxA = A+(alpha) xyT

n2 2n2 2

Level-3(Matrix-Matrix)C=C+AB

4n2 2n3 n/2

Gaussian Elimination - ReviewVersion 4 – Store multipliers m below diagonalsfor each column i zero it out below the diagonal by adding multiples of row i to later rowsfor i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k)

0

0

0

0

0

0

0

0

0

0

0

i

ii,i

X

X

x

j

k

What GE really computes:

for i=1 to n-1

A(i+1:n, i) = A(i+1:n, i) / A(i, i)

A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n)A(i,i)

A(j,i) A(j,k)

A(i,k)

Finished multipliers

Finished multipliers

i

i

A(i+1:n, i)

A(i, i+1:n)

A(i+1:n, i+1:n)

BLAS1 and BLAS2 operations

Converting BLAS2 to BLAS3

Use blocking for optimized matrix-multiplies (BLAS3)

Matrix multiplies by delayed updates

Save several updates to trailing matrices

Apply several updates in the form of matrix multiply

Modified GE using BLAS3Courtesy: Dr. Jack Dongarra

for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1

*/ A(ib:end, end+1:n) = LL-1A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end,

end+1:n) /* Apply delayed updates with single matrix multiply */

b

ib end

ib

end

bCompleted part of L

Completed part of U A(ib:end, ib:end)

A(end+1:n, ib:end) A(end+1:n, end+1:n)

GE: MiscellaneousGE with Partial Pivoting

1D block-column partitioning: which is better? Column or row pivoting

2D block partitioning: Can restrict the pivot search to limited number of columns

•Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1)

•The exchange information is passed to the other processes by piggybacking with the multiplier information

• Row pivoting

• Involves distributed search and exchange – O(n/P)+O(logP)

Triangular Solve – Unit Upper triangular matrix

Sequential Complexity - Complexity of parallel algorithm with

2D block partitioning (P0.5*P0.5)

O(n2)

O(n2)/P0.5

Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n3/P)

Documents

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of