61
CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011 HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 18, 2011

HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

  • Upload
    dayo

  • View
    25

  • Download
    1

Embed Size (px)

DESCRIPTION

Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 18, 2011. HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2. Puzzle of the Day. if(a = 0) { … } - PowerPoint PPT Presentation

Citation preview

Page 1: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

APPLIED PARALLEL ALGORITHMS 2

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 18, 2011

Page 2: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Puzzle of the Day

• Some nice ways to get something different from what was intended:

2

if(a = 0) { … }/* a always equals 0, but block will never be executed */

if(0 < a < 5) { … }/* this "boolean" is always true! [think: (0 < a) < 5] */

if(a =! 0) { … }/* a always equal to 1, as this is compiled as (a = !0), an assignment, rather than (a != 0) or (a == !0) */

Page 3: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

3

Page 4: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

4

Page 5: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

5

Parallel Matrix Processing & Locality

• Maximize locality– Spatial locality

• Variable likely to be used if neighbor data is used• Exploits unit or uniform stride access patterns• Exploits cache line length• Adjacent blocks minimize message traffic

– Depends on volume to surface ratio

– Temporal locality• Variable likely to be reused if already recently used• Exploits cache loads and LRU (least recently used) replacement policy• Exploits register allocation

– Granularity• Maximizes length of local computation• Reduces number of messages• Maximizes length of individual messages

Page 6: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

6

Array Decomposition

• Simple MPI Example• Master-Worker Data Partitioning and Distribution

– Array decomposition– Uniformly distributes parts of array among workers

• (and master)– A kind of static load balancing

• Assumes equal work on equal data set sizes

• Demonstrates– Data partitioning– Data distribution– Coarse grain parallel execution

• No communication between tasks– Reduction operator– Master-worker control model

Page 7: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

7

Array Decomposition Layout

• Dimensions – 1 dimension: linear (dot product)– 2 dimensions: “2-D” or (matrix operations)– 3 dimensions (higher order models)– Impacts surface to volume ratio for inter process communications

• Distribution – Block

• Minimizes messaging• Maximizes message size

– Cyclic • Improves load balancing

• Memory layout– C vs. FORTRAN

Page 8: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

8

Array Decomposition

Accumulate sum from each part

rayCompleteAr

Page 9: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

9

Array Decomposition

Demonstrate simple data decomposition :– Master initializes array and then distributes an equal portion of the array

among the other tasks.

– The other tasks receive their portion of the array, they perform an addition operation to each array element.

– Each task maintains the sum for their portion of the array

– The master task does likewise with its portion of the array.

– As each of the non-master tasks finish, they send their updated portion of the array to the master.

– An MPI collective communication call is used to collect the sums maintained by each task.

– Finally, the master task displays selected parts of the final array and the global sum of all array elements.

– Assumption : that the array can be equally divided among the group.

Page 10: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

10

Flowchart for Array Decomposition“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

Calculate Sum for array chunk

Calculate Sum for array chunk

Calculate Sum for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunkCalculate Sum

for array chunk…

Send SumSend Sum Send SumSend Sum … Send SumSend Sum

Recv. resultsRecv. results

Reduction Operator to Sum up results

Reduction Operator to Sum up results

Print resultsPrint results

EndEnd

Page 11: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

11

Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000#define MASTER 0

float data[ARRAYSIZE];int main (int argc, char **argv){int numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);

MPI_Status status;

MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) { printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/ MPI_Abort(MPI_COMM_WORLD, rc); exit(0); }MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);

chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Workload to be processed by each processor

Page 12: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

12

Array Decompositon (source code)if (taskid == MASTER){ sum = 0; for(i=0; i<ARRAYSIZE; i++) { data[i] = i * 1.0; sum = sum + data[i]; } printf("Initialized array sum = %e\n",sum); offset = chunksize; for (dest=1; dest<numtasks; dest++) { MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD); printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset); offset = offset + chunksize; } offset = 0;

mysum = update(offset, chunksize, taskid);

for (i=1; i<numtasks; i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); }

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Initialize array

Array[0] -> Array[offset-1] is processed by master

Send workloads to respective processorsMaster computes

local Sum

Master receives summation computed by workers

Page 13: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

13

Array Decompositon (source code) MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); printf("Sample results: \n"); offset = 0; for (i=0; i<numtasks; i++) { for (j=0; j<5; j++) printf(" %e",data[offset+j]); printf("\n"); offset = offset + chunksize; } printf("*** Final sum= %e ***\n",sum); } /* end of master section */if (taskid > MASTER) { /* Receive my portion of array from the master task */ source = MASTER; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); mysum = update(offset, chunksize, taskid); /* Send my results back to the master task */ dest = MASTER; MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD); MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); } /* end of non-master */

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Master computes the SUM of all workloads

Worker processes receive work chunks from master

Each worker computes local sum

Send local sum to master process

Page 14: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

14

Array Decompositon (source code)

MPI_Finalize();

} /* end of main */

float update(int myoffset, int chunk, int myid) { int i; float mysum; /* Perform addition to each of my array elements and keep my sum */ mysum = 0; for(i=myoffset; i < myoffset + chunk; i++) { data[i] = data[i] + i * 1.0; mysum = mysum + data[i]; } printf("Task %d mysum = %e\n",myid,mysum); return(mysum); }

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Page 15: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

15

Demo : Array Decomposition

[lsu00@master array_decomposition]$ mpiexec -np 4 ./arrayMPI task 0 has started...MPI task 2 has started...MPI task 1 has started...MPI task 3 has started...Initialized array sum = 1.335708e+14Sent 4000000 elements to task 1 offset= 4000000Sent 4000000 elements to task 2 offset= 8000000Task 1 mysum = 4.884048e+13Sent 4000000 elements to task 3 offset= 12000000Task 2 mysum = 7.983003e+13Task 0 mysum = 1.598859e+13Task 3 mysum = 1.161867e+14Sample results: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07*** Final sum= 2.608458e+14 ***

Output from arete for a 4 processor run.

Page 16: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

16

Page 17: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose• The transpose of the (m x n) matrix A is the (n x m) matrix

formed by interchanging the rows and columns such that row i becomes column i of the transposed matrix

mnnn

m

m

T

aaa

aaa

aaa

21

22212

12111

A

mnmm

n

n

aaa

aaa

aaa

21

22221

11211

A

010

431A

04

13

01TA

52

31A

53

21TA

17

Page 18: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

18

#include <stdio.h>#include <sys/time.h>#include <omp.h>#define SIZE 4

main(){ int i, j; float Matrix[SIZE][SIZE], Trans[SIZE][SIZE]; for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Matrix[i][j] = (i * j) * 5 + i; } for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) Trans[i][j] = 0.0; }

Initialize source matrix

Initialize results matrix

Page 19: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - OpenMP

19

#pragma omp parallel for private(j) for (i = 0; i < SIZE; i++) for (j = 0; j < SIZE; j++) Trans[j][i] = Matrix[i][j]; printf("The Input Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Matrix[i][j]); printf("\n"); } printf("\nThe Transpose Matrix Is \n"); for (i = 0; i < SIZE; i++) { for (j = 0; j < SIZE; j++) printf("%f \t", Trans[i][j]); printf("\n"); } return 0;}

Perform transpose in parallel using omp parallel

for

Page 20: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – OpenMP (DEMO)

20

[LSU760000@n01 matrix_transpose]$ ./omp_mtrans

The Input Matrix Is 0.000000 0.000000 0.0000000 0.0000000 1.000000 6.000000 11.000000 16.000000 2.000000 12.000000 22.000000 32.000000 3.000000 18.000000 33.000000 48.000000

The Transpose Matrix Is 0.000000 1.0000000 2.0000000 3.0000000 0.000000 6.0000000 12.000000 18.000000 0.000000 11.000000 22.000000 33.000000 0.000000 16.000000 32.000000 48.000000

Page 21: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

21

#include <stdio.h>#include "mpi.h"#define N 4int A[N][N];void fill_matrix(){ int i,j; for(i = 0; i < N; i ++) for(j = 0; j < N; j ++) A[i][j] = i * N + j;}void print_matrix(){ int i,j; for(i = 0; i < N; i ++) { for(j = 0; j < N; j ++) printf("%d ", A[i][j]); printf("\n"); }}

Initialize source matrix

Page 22: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

22

main(int argc, char* argv[]){ int r, i; MPI_Status st; MPI_Datatype typ;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &r);

if(r == 0) { fill_matrix(); printf("\n Source:\n"); print_matrix(); MPI_Type_contiguous(N * N, MPI_INT, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Send(&(A[0][0]), 1, typ, 1, 0, MPI_COMM_WORLD); }

Creating custom MPI datatype to store local

workloads

Page 23: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose - MPI

23

else if(r == 1){ MPI_Type_vector(N, 1, N, MPI_INT, &typ); MPI_Type_hvector(N, 1, sizeof(int), typ, &typ); MPI_Type_commit(&typ); MPI_Barrier(MPI_COMM_WORLD); MPI_Recv(&(A[0][0]), 1, typ, 0, 0, MPI_COMM_WORLD, &st); printf("\n Transposed:\n"); print_matrix(); }

MPI_Finalize();}

Creates a vector datatype of length N strided by a blocklength

of 1

Datatype MPI_Type_hvector allows for on the fly transpose of the matrix

Page 24: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Matrix Transpose – MPI (DEMO)

24

[LSU760000@n01 matrix_transpose]$ mpiexiec -np 2 ./mpi_mtrans

Source:0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Transposed:0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

Page 25: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

25

Page 26: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Linear Systems

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3333232131

2323222121

1313212111

bxaxaxa

bxaxaxa

bxaxaxa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

3

2

1

3

2

1

333231

232221

131211

b

b

b

x

x

x

aaa

aaa

aaa

Solve Ax=b, where A is an nn matrix andb is an n1 column vector

www.cs.princeton.edu/courses/archive/fall07/cos323/

26

Page 27: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Fundamental operations:1. Replace one equation with linear combination

of other equations

2. Interchange two equations

3. Re-label two variables

• Combine to reduce to trivial system

• Simplest variant only uses #1 operations but get better stability by adding– #2 or

– #2 and #3

www.cs.princeton.edu/courses/archive/fall07/cos323/

27

Page 28: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Solve:

• Can be represented as

• Goal: to reduce the LHS to an identity matrix resulting with the solutions in RHS

1354

732

21

21

xx

xx

1354

732

21

21

xx

xx

13

7

54

32

13

7

54

32

?

?

10

01

?

?

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

28

Page 29: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Basic operation 1: replace any row bylinear combination with any other row :

replace row1 with 1/2 * row1 + 0 * row2

• Replace row2 with row2 – 4 * row1

• Negate row2

13

7

54

32

13

7

54

32

1354

1 27

23

1354

1 27

23

110

1 27

23

110

1 27

23

110

1 27

23

110

1 27

23

www.cs.princeton.edu/courses/archive/fall07/cos323/

29

Row1 = (Row1)/2

Row2=Row2-(4*Row1)

Row2 = (-1)*Row2

Page 30: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gauss-Jordan Elimination

• Replace row1 with row1 – 3/2 * row2

• Solution:

x1 = 2, x2 = 1

110

1 27

23

110

1 27

23

1

2

10

01

1

2

10

01

www.cs.princeton.edu/courses/archive/fall07/cos323/

30

Row1 = Row1 – (3/2)* Row2

Page 31: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Immediately run into problem: algorithm wants us to divide by zero!• More subtle version:

• The pivot or pivot element is the element of a matrix which is selected first by an algorithm to do computation

• Pivot entry is usually required to be at least distinct from zero, and often distant from it

• Select largest element in matrix and swap columns and rows to bring this element to the ‚right’ position: full (complete) pivoting

8

2

32

10

8

2

32

10

8

2

32

1001.0

8

2

32

1001.0

www.cs.princeton.edu/courses/archive/fall07/cos323/

31

Page 32: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting

• Consider this system:

• Pivoting :– Swap rows 1 and 2:

– And continue to solve as shown before

1

8

10

23

1

8

10

23

1

2

10

01

110

1 38

32

1

2

10

01

110

1 38

32

www.cs.princeton.edu/courses/archive/fall07/cos323/

32

x1 = 2, x2 = 1

8

1

23

10

8

1

23

10

Page 33: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pivoting:Example• Division by small numbers round-off error in computer arithmetic

• Consider the following system0.0001x1 + x2 = 1.000x1 + x2 = 2.000

• exact solution: x1=1.0001 and x2 = 0.9999

• say we round off after 3 digits after the decimal point

• Multiply the first equation by 104 and subtract it from the second equation

• (1 - 1)x1 + (1 - 104)x2 = 2 - 104

• But, in finite precision with only 3 digits:– 1 - 104 = -0.9999 E+4 ~ -0.999 E+4– 2 - 104 = -0.9998 E+4 ~ -0.999 E+4

• Therefore, x2 = 1 and x1 = 0 (from the first equation)

• Very far from the real solution!

0.0001 1

1 1

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

33

1

2

Page 34: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

• Partial pivoting doesn‘t look for largest element in matrix, but just for the largest element in the ‚current‘ column

• Swap rows to bring the corresponding row to ‚right‘ position

• Partial pivoting is generally sufficient to adequately reduce round-off error.

• Complete pivoting is usually not necessary to ensure numerical stability

• Due to the additional computations it introduces, it may not always be the most appropriate pivoting strategy

34

http://www.amath.washington.edu/~bloss/amath352_lectures/

Page 35: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting• One can just swap rows

x1 + x2 = 2.000

0.0001x1 + x2 = 1.000• Multiple the first equation by 0.0001 and subtract it from the second equation gives:

(1 - 0.0001)x2 = 1 - 0.0001

0.9999 x2 = 0.9999 => x2 = 1

and then x1 = 1• Final solution is closer to the real solution.

• Partial Pivoting– For numerical stability, one doesn’t go in order, but pick the next row in rows i to n that has the

largest element in row i– This row is swapped with row i (along with elements of the right hand side) before the

subtractions• the swap is not done in memory but rather one keeps an indirection array

• Total Pivoting– Look for the greatest element ANYWHERE in the matrix– Swap columns– Swap rows

• Numerical stability is really a difficult fieldsrc: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

35

Page 36: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Partial Pivoting

36

http://www.amath.washington.edu/~bloss/amath352_lectures/

Page 37: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Special Cases

• Common special case:• Tri-diagonal Systems :

– Only main diagonal & 1 above,1 below– Solve using : Gauss-Jordan

• Lower Triangular Systems (L)– Solve using : forward substitution

• Upper Triangular Systems (U)– Solve using : backward substitution

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

4443

343332

232221

1211

00

0

0

00

b

b

b

b

aa

aaa

aaa

aa

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

4

3

2

1

44434241

333231

2221

11

0

00

000

b

b

b

b

aaaa

aaa

aa

a

11

11 a

bx

11

11 a

bx

22

12122 a

xabx

22

12122 a

xabx

33

23213133 a

xaxabx

33

23213133 a

xaxabx

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

5

4

3

2

1

55

4544

353433

25242322

1514131211

0000

000

00

0

b

b

b

b

b

a

aa

aaa

aaaa

aaaaa

55

55 a

bx

55

55 a

bx

44

54544 a

xabx

44

54544 a

xabx

www.cs.princeton.edu/courses/archive/fall07/cos323/

37

Page 38: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

38

Page 39: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Solving Linear Systems of Eq.

• Method for solving Linear Systems– The need to solve linear systems arises in an estimated 75% of

all scientific computing problems [Dahlquist 1974]• Gaussian Elimination is perhaps the most well-known method

– based on the fact that the solution of a linear system is invariant under scaling and under row additions

• One can multiply a row of the matrix by a constant as long as one multiplies the corresponding element of the right-hand side by the same constant

• One can add a row of the matrix to another one as long as one adds the corresponding elements of the right-hand side

– Idea: scale and add equations so as to transform matrix A in an upper triangular matrix:

??

???

x =

equation n-i has i unknowns, with

?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

39

Page 40: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination1 1 1

1 -2 2

1 2 -1

0

4

2x =

1 1 1

0 -3 1

0 1 -2

0

4

2x =

1 1 1

0 -3 1

0 0 -5

0

4

10

x =

Subtract row 1 from rows 2 and 3

Multiple row 3 by 3 and add row 2

-5x3 = 10 x3 = -2-3x2 + x3 = 4 x2 = -2x1 + x2 + x3 = 0 x1 = 4

Solving equations inreverse order (backsolving)

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

40

Page 41: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Gaussian Elimination• The algorithm goes through the matrix from the top-left

corner to the bottom-right corner• The ith step eliminates non-zero sub-diagonal elements

in column i, subtracting the ith row scaled by aji/aii from row j, for j=i+1,..,n.

i

0

values already computed

values yet to beupdated

pivot row ito

be

ze

roe

d

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

41

Page 42: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Sequential Gaussian Elimination

Simple sequential algorithm

// for each column i// zero it out below the diagonal by adding// multiples of row i to later rowsfor i = 1 to n-1 // for each row j below row i for j = i+1 to n // add a multiple of row i to row j for k = i to n A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

• Several “tricks” that do not change the spirit of the algorithm but make implementation easier and/or more efficient– Right-hand side is typically kept in column n+1 of the matrix and one

speaks of an augmented matrix– Compute the A(i,j)/A(i,i) term outside of the loop

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

42

Page 43: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Parallel Gaussian Elimination?

• Assume that we have one processor per matrix element

Reduction

to find the max aji

Broadcast

max aji needed to computethe scaling factor

Compute

Independent computationof the scaling factor

Broadcasts

Every update needs thescaling factor and the element from the pivot row

Compute

Independentcomputations

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

43

Page 44: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• Gaussian Elimination is simple but– What if we have to solve many Ax = b systems for different values of b?

• This happens a LOT in real applications

• Another method is the “LU Factorization” (LU Decomposition)• Ax = b• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper

triangular matrix O(n3)• Then Ax = b is written L U x = b• Solve L y = b O(n2) • Solve U x = y O(n2)

??????

x =??????

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

44

Page 45: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization: Principle

• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

• Magically, A = L x U !• Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5

5

2 2 3

gaussianelimination

save thescalingfactor

1 2 -1

4 -5

5

2 2 3

gaussianelimination

+save thescalingfactor

1 2 -1

4 -5

5

2 -2

5gaussianelimination

+save thescalingfactor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3U =

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

45

Page 46: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

stores the scaling factors

k

k

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

46

Page 47: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU Factorization

• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

k

ij

k

update

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

47

Page 48: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Parallel LU on a ring

• Since the algorithm operates by columns from left to right, we should distribute columns to processors

• Principle of the algorithm– At each step, the processor that owns column k does the “prepare” task

and then broadcasts the bottom part of column k to all others• Annoying if the matrix is stored in row-major fashion• Remember that one is free to store the matrix in anyway one wants, as long

as it’s coherent and that the right output is generated

– After the broadcast, the other processors can then update their data.

• Assume there is a function alloc(k) that returns the rank of the processor that owns column k– Basically so that we don’t clutter our program with too many global-to-

local index translations

• In fact, we will first write everything in terms of global indices, as to avoid all annoying index arithmetic

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

48

Page 49: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

LU-broadcast(A,n) { q MY_NUM() p NUM_PROCS() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj

}}

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

49

Page 50: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Dealing with local indices

• Assume that p divides n• Each processor needs to store r=n/p columns and its

local indices go from 0 to r-1• After step k, only columns with indices greater than k will

be used• Simple idea: use a local index, l, that everyone initializes

to 0• At step k, processor alloc(k) increases its local index so

that next time it will point to its next local column

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

50

Page 51: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-broadcast algorithm

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

51

Page 52: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Bad load balancing

P1 P2 P3 P4

alreadydone

alreadydone working

on it

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

52

Page 53: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Good Load Balancing?

working on it

alreadydone

alreadydone

Cyclic distribution

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

53

Page 54: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Load-balanced program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k mod p == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

54

Page 55: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Performance Analysis

• How long does this code take to run?– This is not an easy question because there are many tasks and

many communications

• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)– n-1 column preparations: (n2/2) w’ + O(1)– column updates: (n3/3p) w + O(n2)

• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)

• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice

• How can we improve this algorithm?

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

55

Page 56: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Pipelining on the Ring

• So far, in the algorithm we’ve used a simple broadcast• Nothing was specific to being on a ring of processors

and it’s portable – in fact you could just write raw MPI that just looks like our

pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors

• But it’s not efficient– The n-1 communication steps are not overlapped with

computations– Therefore Amdahl’s law, etc.

• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast

code we saw at the very beginning throughout the LU code

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

56

Page 57: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Previous program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

57

Page 58: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

LU-pipeline algorithm

double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

58

Page 59: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Topics

• Array Decomposition• Matrix Transpose• Gauss-Jordan Elimination• LU Decomposition• Summary Materials for Test

59

Page 60: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

Summary : Material for the Test

• Matrix Transpose: Slides 17-23• Gauss Jordan: Slides 26-30• Pivoting: Slides 31-37• Special Cases (forward & backward substitution): Slide 35• LU Decomposition 44-58

60

Page 61: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 2

CSC 7600 Lecture 16: Applied Parallel Algorithms 2 Spring 2011

61