22
Block LU Factorization Lecture 24 MA471 Fall 2003

Block LU Factorization Lecture 24

  • Upload
    spence

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Block LU Factorization Lecture 24. MA471 Fall 2003. Example Case. Suppose we are faced with the solution of a linear system Ax=b Further suppose: A is large (dim( A )>10,000) A is dense A is full We have a sequence of different b vectors. Problems. - PowerPoint PPT Presentation

Citation preview

Page 1: Block LU Factorization Lecture 24

Block LU FactorizationLecture 24

MA471 Fall 2003

Page 2: Block LU Factorization Lecture 24

Example Case

1) Suppose we are faced with the solution of a linear system Ax=b

2) Further suppose:1) A is large (dim(A)>10,000) 2) A is dense3) A is full4) We have a sequence of different b vectors.

Page 3: Block LU Factorization Lecture 24

Problems• Suppose we are able to compute the

matrix –– It costs N2 doubles to store the matrix– E.g. for N=100,000 we require 76.3 gigabytes

of storage for the matrix alone.– 32 bit processors are limited to 4 gigabytes of

memory– Most desktops (even 64 bit) do not have 76.3

gigabytes

– What to do?

Page 4: Block LU Factorization Lecture 24

Divide and Conquer

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

One approach is to assume we have a square number of processors.We then divide the matrix into blocks – storing one block per processor.

Page 5: Block LU Factorization Lecture 24

Back to the Linear System

• We are now faced with LU factorization of a distributed matrix.

• This calls for a modified LU routine which acts on blocks of the matrix.

• We will demonstrate this algorithm for one level.

• i.e. we need to construct matrices L,U such that A=LU and we only store single blocks of A,L,U on any processor.

Page 6: Block LU Factorization Lecture 24

Constructing the Block LU Factorization

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 1 0

L20 0 1

*

U00 U01 U02

0 ?11 ?12

0 ?21 ?22

First we LU factorize A00 and look for the above block factorization. However, we need to figure out what each of the entries are:

A00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01..

Page 7: Block LU Factorization Lecture 24

contA00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 – L10*U01A12 = L10*U02 + ?12 => ?12 = A12 – L10*U02A21 = L20*U01 + ?21 => ?21 = A21 – L20*U01A22 = L20*U02 + ?22 => ?22 = A22 – L20*U02

In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Page 8: Block LU Factorization Lecture 24

Summary First Stage

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 1 0

L20 0 1

*

U00 U01 U02

0 ?11 ?12

0 ?21 ?22

First step: LU factorize uppermost block diagonal

Second step: a) compute U0n = L00\A0n n>0 b) compute Ln0 = An0/U00 n>0

Third step: compute ?nm = Anm – Ln0*U0m, (n,m>0)

Page 9: Block LU Factorization Lecture 24

Now Factorize Lower SE Block

?11 ?12

?21 ?22=

L11 0

L21 1*

U11 U12

0 ??22

We repeat the previous algorithm this time on the two by two SE block.

Page 10: Block LU Factorization Lecture 24

End Result

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 L11 0

L20 L21 L22

*

U00 U01 U02

0 U11 U12

0 0 U22

Page 11: Block LU Factorization Lecture 24

Matlab Version

Page 12: Block LU Factorization Lecture 24

Parallel AlgorithmP0 P1 P2

P3 P4 P5

P6 P7 P8

P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)

P1: U01 = L00\A01P2: U02 = L00\A02

P3: L10 = A10/U00P6: L20 = A20/U00

P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02

In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Page 13: Block LU Factorization Lecture 24

Parallel Communication

L00U00 U01 U02

L10 A11 A12

L20 A21 A22

P0: L00,U00 =lu(A)

P1: U01 = L00\A01P2: U02 = L00\A02

P3: L10 = A10/U00P6: L20 = A20/U00

P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02

In the general case:Anm = Ln0*U0m + ?nm => ?nm = Anm – Ln0*U0m

Page 14: Block LU Factorization Lecture 24

Communication Summary

P0: L00,U00 =lu(A)

P1: U01 = L00\A01P2: U02 = L00\A02

P3: L10 = A10/U00P6: L20 = A20/U00

P4: A11 <- A11 – L10*U01P5: A12 <- A12 – L10*U02P7: A21 <- A21 – L20*U01P8: A22 <- A22 – L20*U02

P0: sends L00 to P1,P2 sends U00 to P3,P6

P1: sends U01 to P4,P7P2: sends U02 to P5,P8

P3: sends L10 to P4,P5P4: sends L20 to P7,P8

P0 P1 P2

P3 P4 P5

P6 P7 P8

L00U00 U01 U02

L10 A11 A12

L20 A21 A22

Page 15: Block LU Factorization Lecture 24

Upshot

Notes:1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait

compared to compute time.3) The compute part of this code can be optimized much more – making the parallel

efficiency even worse.

a b

(a) P0: sends L00 to P1,P2 sends U00 to P3,P6

(b) P1: sends U01 to P4,P7(c) P2: sends U02 to P5,P8

(d) P3: sends L10 to P4,P5(e) P4: sends L20 to P7,P8

cde

(f) P4: sends L11 to P5 sends U11 to P7

(g) P1: sends U12 to P8

(h) P3: sends L21 to P8

f

1st stage: 1st stage:

g

h

Page 16: Block LU Factorization Lecture 24

Block Back Solve

• After factorization we are left with the task of using the distributed L and U to compute the backsolve:

U00L00 U01 U02

L10 U11L11 U12

L20 L21 U22L22

Block distribution of L and U

P0 P1 P2

P3 P4 P5

P6 P7 P8

Page 17: Block LU Factorization Lecture 24

Recall

• Given an LU factorization of A namely, L,U such that A=LU

• Then we can solve Ax=b by• y=L\b• x=U\y

Page 18: Block LU Factorization Lecture 24

Distributed Back Solve

L00 0 0

L10 L11 0

L20 L21 L22

=

y0

y1

y2

b0

b1

b2

P0: solve L00*y0 = b0 send: y0 to P3,P6P3: send: L10*y0 to P4P4: solve L11*y1 = b1-L10*y0 send: y1 to P7P6: send: L20*y0 to P8\P7: send: L21*y1 to P8P8: solve L22*y2 = b2-L20*y0-L21*y1Results: y0 on P0, y1 on P4, y2 on P8

P0 P1 P2

P3 P4 P5

P6 P7 P8

Page 19: Block LU Factorization Lecture 24

Matlab Code

Page 20: Block LU Factorization Lecture 24

Back Solve

After the factorization we computed a solution to Ax=b

This consists of two distributed block triangular systems to solve

Page 21: Block LU Factorization Lecture 24

Barrier Between Back Solves

This time I inserted an MPI_Barrier call between the backsolves. This highlights the serial nature of the backsolves..

Page 22: Block LU Factorization Lecture 24

Example Codehttp://www.math.unm.edu/~timwar/MA471F03/blocklu.m

http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c