Block Lu Factorization

8/12/2019 Block Lu Factorization

1/22

Block LU Factorization

Lecture 24

MA471 Fall 2003


2/22

Example Case

1) Suppose we are faced with the solution of a

linear system Ax=b

2) Further suppose:

1) A is large (dim(A)>10,000)

2) A is dense

3) A is full

4) We have a sequence of different bvectors.


3/22

Problems

Suppose we are able to compute the

matrix It costs N2doubles to store the matrix

E.g. for N=100,000 we require 76.3 gigabytes

of storage for the matrix alone. 32 bit processors are limited to 4 gigabytes of

memory

Most desktops (even 64 bit) do not have 76.3gigabytes

What to do?


4/22

Divide and Conquer

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

One approach is to assume we have a square number of processors.

We then divide the matrix into blocks storing one block per processor.


5/22

Back to the Linear System

We are now faced with LU factorization ofa distributed matrix.

This calls for a modified LU routine which

acts on blocks of the matrix. We will demonstrate this algorithm for one

level.

i.e. we need to construct matrices L,Usuch that A=LU and we only store singleblocks of A,L,U on any processor.


6/22

Constructing the

Block LU Factorization

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 1 0

L20 0 1

*

U00 U01 U02

0 ?11 ?12

0 ?21 ?22

First we LU factorize A00 and look for the above block factorization.

However, we need to figure out what each of the entries are:

A00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00

A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 L10*U01..


7/22

cont

A00 = L00*U00 (compute by L00, U00 by LU factorization)

A01 = L00*U01 => U01 = L00\A01

A02 = L00*U02 => U02 = L00\A02

A10 = L10*U00 => L10 = A10/U00

A20 = L20*U00 => L20 = A20/U00

A11 = L10*U01 + ?11 => ?11 = A11 L10*U01

A12 = L10*U02 + ?12 => ?12 = A12 L10*U02

A21 = L20*U01 + ?21 => ?21 = A21 L20*U01

A22 = L20*U02 + ?22 => ?22 = A22 L20*U02

In the general case:

Anm = Ln0*U0m + ?nm => ?nm = Anm Ln0*U0m


8/22

Summary First Stage

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 1 0

L20 0 1

*

U00 U01 U02

0 ?11 ?12

0 ?21 ?22

First step: LU factorize uppermost block diagonal

Second step: a) compute U0n = L00\A0n n>0

b) compute Ln0 = An0/U00 n>0

Third step: compute ?nm = Anm Ln0*U0m, (n,m>0)


9/22

Now Factorize Lower SE Block

?11 ?12

?21 ?22

=

L11 0

L21 1

*

U11 U12

0 ??22

We repeat the previous algorithm this time on the two by two SE block.


10/22

End Result

A00 A01 A02

A10 A11 A12

A20 A21 A22

=

L00 0 0

L10 L11 0

L20 L21 L22

*

U00 U01 U02

0 U11 U12

0 0 U22


11/22

Matlab

Version


12/22

Parallel Algorithm

P0 P1 P2

P3 P4 P5

P6 P7 P8

P0: A00 = L00*U00 (compute by L00, U00 by LU factorization)

P1: U01 = L00\A01

P2: U02 = L00\A02

P3: L10 = A10/U00P6: L20 = A20/U00

P4: A11


13/22

Parallel Communication

L00

U00U01 U02

L10 A11 A12

L20 A21 A22

P0: L00,U00 =lu(A)

P1: U01 = L00\A01

P2: U02 = L00\A02

P3: L10 = A10/U00P6: L20 = A20/U00

P4: A11


14/22

Communication

Summary

P0: L00,U00 =lu(A)

P1: U01 = L00\A01

P2: U02 = L00\A02

P3: L10 = A10/U00

P6: L20 = A20/U00

P4: A11


15/22

Upshot

Notes:

1) I added an MPI_Barrier purely to separate the LU factorization and the backsolve.

2) In terms of efficiency we can see that quite a bit of time is spent in MPI_Wait

compared to compute time.

3) The compute part of this code can be optimized much more making the parallelefficiency even worse.

a b

(a) P0: sends L00 to P1,P2sends U00 to P3,P6

(b) P1: sends U01 to P4,P7

(c) P2: sends U02 to P5,P8

(d) P3: sends L10 to P4,P5(e) P4: sends L20 to P7,P8

cde

(f) P4: sends L11 to P5sends U11 to P7

(g) P1: sends U12 to P8

(h) P3: sends L21 to P8

f

1ststage: 1ststage:

g

h


16/22

Block Back Solve

After factorization we are left with the task

of using the distributed L and U to

compute the backsolve:

U00

L00U01 U02

L10U11

L11U12

L20 L21U22

L22

Block distribution of L and U

P0 P1 P2

P3 P4 P5

P6 P7 P8


17/22

Recall

Given an LU factorization of Anamely,

L,Usuch that A=LU

Then we can solve Ax=b by

y=L\b

x=U\y


18/22

Distributed Back Solve

L00 0 0

L10 L11 0

L20 L21 L22

=

y0

y1

y2

b0

b1

b2

P0: solve L00*y0 = b0 send: y0 to P3,P6

P3: send: L10*y0 to P4

P4: solve L11*y1 = b1-L10*y0 send: y1 to P7

P6: send: L20*y0 to P8\

P7: send: L21*y1 to P8

P8: solve L22*y2 = b2-L20*y0-L21*y1

Results: y0 on P0, y1 on P4, y2 on P8

P0 P1 P2

P3 P4 P5

P6 P7 P8


19/22

Matlab

Code

B k S l


20/22

Back Solve

After the factorization we computed a solution to Ax=b

This consists of two distributed block triangular systems to solve


21/22

Barrier Between Back Solves

This time I inserted an MPI_Barrier call between the backsolves.

This highlights the serial nature of the backsolves..


22/22

Example Code

http://www.math.unm.edu/~timwar/MA471F03/blocklu.m

http://www.math.unm.edu/~timwar/MA471F03/parlufact2.c
http://www.math.unm.edu/~timwar/MA471F03/blocklu.mhttp://www.math.unm.edu/~timwar/MA471F03/parlufact2.chttp://www.math.unm.edu/~timwar/MA471F03/parlufact2.chttp://www.math.unm.edu/~timwar/MA471F03/blocklu.m

Documents

Block Lu Factorization