45
Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/2011 1

Communication costs of LU decomposition algorithms for banded matrices

  • Upload
    clive

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Communication costs of LU decomposition algorithms for banded matrices. Razvan Carbunescu. Outline (1/2). Sequential general LU factorization (GETRF) and Lower Bounds Definitions and Lower Bounds LAPACK algorithm Communication cost Summary - PowerPoint PPT Presentation

Citation preview

Page 1: Communication costs of LU decomposition algorithms for banded matrices

1

Communication costs of LU decomposition algorithms for

banded matrices

Razvan Carbunescu

12/02/2011

Page 2: Communication costs of LU decomposition algorithms for banded matrices

2

Outline (1/2)• Sequential general LU factorization (GETRF) and Lower Bounds• Definitions and Lower Bounds• LAPACK algorithm• Communication cost• Summary

• Sequential banded LU factorization (GBTRF) and Lower Bounds• Definitions and Lower Bounds• Banded format• LAPACK algorithm• Communication cost• Summary

• Sequential LU Summary

12/02/2011

Page 3: Communication costs of LU decomposition algorithms for banded matrices

3

Outline (2/2)• Parallel LU definitions and Lower bounds

• Parallel Cholesky algorithms (Saad, Schultz ‘85)• SPIKE Cholesky algorithm (Sameh’85)

• Parallel banded LU factorization (PGBTRF)• ScaLAPACK algorithm• Communication cost• Summary

• Parallel banded LU and Cholesky Summary

• Future Work

• General Summary12/02/2011

Page 4: Communication costs of LU decomposition algorithms for banded matrices

4

GETRF – Definitions and Lower Bounds• Variables:

n - size of the matrix

r - block size (panel width)

i - current panel number

M - size of fast memory

• fits into pattern of 3-nested loops and has usual lower bounds:

12/02/2011

Page 5: Communication costs of LU decomposition algorithms for banded matrices

5

GETRF - Communication assumptions•BLAS2 LU on (m x n) matrix takes

•TRSM on (n x m) with LL (n x n) takes

•GEMM in (m x n) - (m x k) (k x n) takes

12/02/2011

m

n

m

n

n

n

P

L

U

n

m

n

n

n

m

U

LL-1

A

m

n

m

k

k

n

A

L

U

m

m

A

Page 6: Communication costs of LU decomposition algorithms for banded matrices

6

GETRF – LAPACK algorithm

12/02/2011

• For each panel block:

1) Factorize panel (n x r) 2) Permute matrix3) Compute U update (TRSM) of size r x (n-ir) with LL of size r x r4) Compute GEMM update of size:

(n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))

Page 7: Communication costs of LU decomposition algorithms for banded matrices

7

GETRF – LAPACK algorithm (1/4)

12/02/2011

• Factorize panel P

Words:

Total words :

n- (i-1)r

r

r

r

r

P

L

U

n- (i-1)r

Page 8: Communication costs of LU decomposition algorithms for banded matrices

8

GETRF – LAPACK algorithm (2/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

Page 9: Communication costs of LU decomposition algorithms for banded matrices

9

GETRF – LAPACK algorithm (3/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

r

n-ir

r

r

r

n-ir U

LL-1

A

Page 10: Communication costs of LU decomposition algorithms for banded matrices

10

GETRF – LAPACK algorithm (4/4)

12/02/2011

• Permute matrix with pivot information from panel

Words:

Total words :

n-ir

n - ir

r r

n -ir A

L

U

n-ir A

n-ir

n-ir

Page 11: Communication costs of LU decomposition algorithms for banded matrices

11

GETRF – Communication cost

12/02/2011

• Communication cost

• Simplified in the big O notation we get:

Page 12: Communication costs of LU decomposition algorithms for banded matrices

12

GETRF - General LU Summary• General LU lower bounds are:

• LAPACK LU algorithm gives :

12/02/2011

Page 13: Communication costs of LU decomposition algorithms for banded matrices

13

GBTRF - Banded LU factorization• Variables:

n - size of the matrix

b - matrix bandwidth

r - block size (panel width)

M - size of fast memory

• Also fits into 3-nested loops lower bounds:

12/02/2011

Page 14: Communication costs of LU decomposition algorithms for banded matrices

14

Banded Format• GBTRF uses a special “banded format”

• Packed data format that stores mostly data and very few non-zeros

• columns map to columns ; diagonals map to rows

• easy to retrieve a square block from original A by using lda – 1

12/02/2011

Page 15: Communication costs of LU decomposition algorithms for banded matrices

15

Banded Format

12/02/2011

Conceptual

Actual

• Because of format the update of U and of the Schur complement get split into multiple stages for the parts of the band matrix near the edges of the storage array

Page 16: Communication costs of LU decomposition algorithms for banded matrices

16

GBTRF Algorithm• For each panel block

1) Factorize panel of size b x r2) Permute rest of matrix affected by panel3) Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)4) Compute U update (TRSM) of size r x r with LL of size (r x r)5) Compute 4 GEMM updates of sizes:

(b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) (b-2r) x r + ((b-2r) x r ) * (r x r) r x (b-2r) + (r x r) * (r x (b-2r)) r x r + (r x r) * (r x r)

12/02/2011

Page 17: Communication costs of LU decomposition algorithms for banded matrices

17

GBTRF – LAPACK algorithm (1/8)

12/02/2011

• Factorize panel P

Words:

Total words :

b

r rr

b

r

Page 18: Communication costs of LU decomposition algorithms for banded matrices

18

GBTRF – LAPACK algorithm (2/8)

12/02/2011

• Apply permutations

Words:

Total words :

Page 19: Communication costs of LU decomposition algorithms for banded matrices

19

GBTRF – LAPACK algorithm (3/8)

12/02/2011

• Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)

Words:

Total words :

r

b – 2r b – 2rr

r r-1

Page 20: Communication costs of LU decomposition algorithms for banded matrices

20

GBTRF – LAPACK algorithm (4/8)

12/02/2011

• Compute U update (TRSM) of size r x r with LL of size (r x r)

Words:

Total words :

r

-1rr

r

r

r

Page 21: Communication costs of LU decomposition algorithms for banded matrices

21

GBTRF – LAPACK algorithm (5/8)

12/02/2011

• Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r))

Words:

Total words :

b – 2r

b – 2r b – 2rrb – 2r

Page 22: Communication costs of LU decomposition algorithms for banded matrices

22

GBTRF – LAPACK algorithm (6/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

b – 2r b – 2r b – 2r

r

r

Page 23: Communication costs of LU decomposition algorithms for banded matrices

23

GBTRF – LAPACK algorithm (7/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

b – 2r

r r r

r

r

Page 24: Communication costs of LU decomposition algorithms for banded matrices

24

GBTRF – LAPACK algorithm (8/8)

12/02/2011

• Compute GEMM update of size

Words:

Total words :

r

r r r r

Page 25: Communication costs of LU decomposition algorithms for banded matrices

25

GBTRF communication cost

12/02/2011

• A full cost would be:

• If we choose r < b/3 this simplifies the leading terms to:

• Since r < b the other option is b/3 < r < b which gives in this case we get:

Page 26: Communication costs of LU decomposition algorithms for banded matrices

26

GBTRF - Banded LU Summary• Banded LU lower bounds are:

• LAPACK banded LU algorithm gives :

12/02/2011

Page 27: Communication costs of LU decomposition algorithms for banded matrices

27

Sequential Summary

12/02/2011

Page 28: Communication costs of LU decomposition algorithms for banded matrices

28

Parallel banded LU - Definitions• Variables:

n - size of the matrix

p - number of processors

b - matrix bandwidth

M - size of fast memory

12/02/2011

Page 29: Communication costs of LU decomposition algorithms for banded matrices

29

Parallel banded LU – Lower Bounds• Assuming banded matrix is distributed in a 1D layout across n

• Lower Bounds

12/02/2011

P(i-1) P(i)

Page 30: Communication costs of LU decomposition algorithms for banded matrices

30

Parallel banded algorithms – (Saad ‘85)• In (Saad, Schultz ’85) we are presented with a computation and communication analysis for banded Cholesky (LLT) solvers on a 1D ring, 2D torus and n-D hypercube as well as a pipelined approach • While this is a different computation from LU, Cholesky can be viewed as a minimum cost for LU since it does not require pivoting nor the computation of the U but is also used for Gaussian Elimination

• Since most parallel banded algorithms also increase the amount of computation done that will also be compared between the algorithms in terms of multiplicative factors to the leading term.

12/02/2011

Page 31: Communication costs of LU decomposition algorithms for banded matrices

31

Parallel banded algorithms – RIGBE

12/02/2011

Page 32: Communication costs of LU decomposition algorithms for banded matrices

32

Parallel banded algorithms – BIGBE

12/02/2011

Page 33: Communication costs of LU decomposition algorithms for banded matrices

33

Parallel banded algorithms – HBGE

12/02/2011

• Same algorithm as BIGGE but the 2D grid is embedded in the Hypercube to allow for faster communication costs

Page 34: Communication costs of LU decomposition algorithms for banded matrices

34

Parallel banded algorithms – WFGE

12/02/2011

• Uses the 2D cyclic layout and then performs operations diagonally

Page 35: Communication costs of LU decomposition algorithms for banded matrices

35

Parallel banded algorithms – (Saad ‘85)• Parallel band LU lower bounds:

• Banded Cholesky algorithms :

12/02/2011

Page 36: Communication costs of LU decomposition algorithms for banded matrices

36

Parallel banded algorithms – SPIKE (1/3)• Another parallel banded implementation is presented in the SPIKE Algorithm (Lawrie, Sameh ‘84) which is a Cholesky solver which is just a special case of Gaussian Elimination

• This algorithm for factorization and solver is extended to a pivoting LU implementation in (Sameh ’05)

12/02/2011

Page 37: Communication costs of LU decomposition algorithms for banded matrices

37

Parallel banded algorithms – SPIKE (2/3)

12/02/2011

Page 38: Communication costs of LU decomposition algorithms for banded matrices

38

Parallel banded algorithms – SPIKE (3/3)

12/02/2011

• parallel band LU Lower Bounds

• SPIKE Cholesky algorithm

Page 39: Communication costs of LU decomposition algorithms for banded matrices

39

PGBTRF – Data Layout• Adopts same banded layout as sequential with a slightly higher bandwidth storage (4b instead of 3b) and 1D block distribution

12/02/2011

n

P1 P2 P3 P4

2b

2b

Page 40: Communication costs of LU decomposition algorithms for banded matrices

40

PGBTRF – Algorithm• Description from ScaLAPACK code

1) Compute Fully Independent band LU factorizations of the submatrices located in local memory.

2) Pass the upper triangular matrix from the end of the local storage on to the next processor.

3) From local factorization and upper triangular matrix form a reduced blocked bidiagonal system and store extra data in Af (extra storage)

4) Solve reduced blocked bidiagonal system to compute extra factors and store in Af

12/02/2011

Page 41: Communication costs of LU decomposition algorithms for banded matrices

41

PGBTRF – Communication cost

12/02/2011

• Parallel band LU lower bounds:

• ScaLAPACK band LU algorithm:

Page 42: Communication costs of LU decomposition algorithms for banded matrices

42

Parallel Summary• Lower Bounds

• (Saad’85)

• SPIKE

• ScaLAPACK

12/02/2011

Page 43: Communication costs of LU decomposition algorithms for banded matrices

43

Future Work• Checking the lower bounds and implementation details of applying CALU to the panel in the LAPACK algorithm

• Investigate parallel band LU lower bounds for an exact cost

• Heterogeneous analysis of implemented MAGMA sgbtrf and lower bounds for a heterogeneous model

• Looking at Nested Dissection as another Divide and Conquer method for parallel banded LU

• Analysis of cost of applying a parallel banded algorithm to the sequential model to see if we can reduce the communication by increasing computation

12/02/2011

Page 44: Communication costs of LU decomposition algorithms for banded matrices

44

General Summary

12/02/2011

Page 45: Communication costs of LU decomposition algorithms for banded matrices

45

Questions?

12/02/2011