94
High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University March 1 st , 2007

High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

Embed Size (px)

Citation preview

Page 1: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

High Performance Computing: Concepts, Methods & Means

Parallel Algorithms 1

Prof. Thomas SterlingDepartment of Computer Science

Louisiana State University

March 1st, 2007

Page 2: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

2

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 3: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

3

Topics

• Introduction• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 4: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

4

“embarrassingly parallel”

• Common phrase– poorly defined, – widely used

• Suggests lots and lots of parallelism – with essentially no inter task communication or coordination– Highly partitionable workload with minimal overhead

• “almost embarrassingly parallel”– Same as above, but– Requires master to launch many tasks– Requires master to collect final results of tasks– Sometimes still referred to as “embarrassingly parallel”

Page 5: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

5

Basic Parallel (MPI) Program Steps

• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results

– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code• Terminate and return to OS

Page 6: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

6

Parallel Programming

• Goals– Correctness– Reduction in execution time– Efficiency– Scalability– Increased problem size and richness of models

• Objectives– Expose parallelism

• Algorithm design

– Distribute work uniformly• Data decomposition and allocation• Dynamic load balancing

– Minimize overhead of synchronization and communication• Coarse granularity• Big messages

– Minimize redundant work• Still sometimes better than communication

Page 7: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

7

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 8: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

8

Array Decomposition

• Simple MPI Example• Master-Worker Data Partitioning and Distribution

– Array decomposition– Uniformly distributes parts of array among workers

• (and master)

– A kind of static load balancing• Assumes equal work on equal data set sizes

• Demonstrates– Data partitioning– Data distribution– Coarse grain parallel execution

• No communication between tasks

– Reduction operator– Master-worker control model

Page 9: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

9

Array Decomposition (LLNL Web)

Demonstrate simple data decomposition :– Master initializes array and then distributes an equal portion of the array

among the other tasks.

– The other tasks receive their portion of the array, they perform an addition operation to each array element.

– Each task maintains the sum for their portion of the array

– The master task does likewise with its portion of the array.

– As each of the non-master tasks finish, they send their updated portion of the array to the master.

– An MPI collective communication call is used to collect the sums maintained by each task.

– Finally, the master task displays selected parts of the final array and the global sum of all array elements.

– Assumption : that the array can be equally divided among the group.

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_array.c

Page 10: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

10

Array Decomposition

• Array Processing : A Parallel Solution• Arrays elements are distributed so that each processor owns a portion of an array

(subarray). • Independent calculation of array elements insures there is no need for communication

between tasks. • Distribution scheme is chosen by other criteria, e.g. unit stride (stride of 1) through the

subarrays. Unit stride maximizes cache/memory usage. • Since it is desirable to have unit stride through the subarrays, the choice of a

distribution scheme depends on the programming language. See the Block-Cyclic Distribution Diagram for the options.

• After the array is distributed, each task executes the portion of the loop corresponding to the data it owns. For example, with Fortran block distribution: do j = mystart, myend do i = 1,n a(i,j) = fcn(i,j) end do end do

• Notice that only the outer loop variables are different from the serial solution.

Page 11: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

11

Array Decomposition Locality

• Maximize locality– Spatial locality

• Variable likely to be used if neighbor data is used• Exploits unit or uniform stride access patterns• Exploits cache line length• Adjacent blocks minimizes message traffic

– Temporal locality• Variable likely to be reused if already recently used• Exploits cache loads and LRU (least recently used)

replacement policy• Exploits register allocation

– Granularity• Maximizes length of local computation• Reduces number of messages• Maximizes length of individual messages

Page 12: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

12

Array Decomposition Layout

• Dimensions – 1 dimension: linear– 2 dimensions: “2-D” or – 3 dimensions– Impacts surface to volume ratio for inter process

communications

• Distribution – Block

• Minimizes messaging

• Maximizes message size

– Cyclic • Improves load balancing

Page 13: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

13

Array Decomposition

Accumulate sum from each part

rayCompleteAr

Page 14: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

14

Flowchart for Array Decomposition“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

Calculate Sum for array chunkCalculate Sum for array chunk

Calculate Sum for array chunkCalculate Sum for array chunk

Calculate Sum for array chunkCalculate Sum for array chunk

Calculate Sum for array chunkCalculate Sum for array chunk…

Send SumSend Sum Send SumSend Sum … Send SumSend Sum

Recv. resultsRecv. results

Reduction Operator to Sum up results

Reduction Operator to Sum up results

Print resultsPrint results

EndEnd

Page 15: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

15

Array Decompositon (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define ARRAYSIZE 16000000 /**Matrix Size 4000x4000**/#define MASTER 0

float data[ARRAYSIZE];int main(argc,argv)int argc;char *argv[];{int numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize; float mysum, sum;float update(int myoffset, int chunk, int myid);MPI_Status status;

/***** Initializing the MPI Environment *****/MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &numtasks);if (numtasks % 4 != 0) { printf("Quitting. Number of MPI tasks must be divisible by 4.\n"); /**For equal distribution of workload**/ MPI_Abort(MPI_COMM_WORLD, rc); exit(0); }MPI_Comm_rank(MPI_COMM_WORLD,&taskid);printf ("MPI task %d has started...\n", taskid);chunksize = (ARRAYSIZE / numtasks);tag2 = 1;tag1 = 2;

Page 16: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

16

Array Decompositon (source code)

/***** Master : Initializes the array and fills it with dummy values******/if (taskid == MASTER){ /* Initialize the array */ sum = 0; for(i=0; i<ARRAYSIZE; i++) { data[i] = i * 1.0; sum = sum + data[i]; } printf("Initialized array sum = %e\n",sum); /* Partition & Send each task its portion of the array - master keeps 1st part */ offset = chunksize; for (dest=1; dest<numtasks; dest++) { MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, dest, tag2, MPI_COMM_WORLD); printf("Sent %d elements to task %d offset= %d\n",chunksize,dest,offset); offset = offset + chunksize; } /* Master does its part of the work */ offset = 0; mysum = update(offset, chunksize, taskid);

/* Wait to receive results from each task */ for (i=1; i<numtasks; i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); }

Page 17: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

17

Array Decompositon (source code) /* Get final sum and print sample results */ MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); printf("Sample results: \n"); offset = 0; for (i=0; i<numtasks; i++) { for (j=0; j<5; j++) printf(" %e",data[offset+j]); printf("\n"); offset = offset + chunksize; } printf("*** Final sum= %e ***\n",sum);

} /* end of master section */

/***** Non-master tasks only *****/

if (taskid > MASTER) { /* Receive my portion of array from the master task */ source = MASTER; MPI_Recv(&offset, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, &status); MPI_Recv(&data[offset], chunksize, MPI_FLOAT, source, tag2, MPI_COMM_WORLD, &status); mysum = update(offset, chunksize, taskid); /* Send my results back to the master task */ dest = MASTER; MPI_Send(&offset, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD); MPI_Send(&data[offset], chunksize, MPI_FLOAT, MASTER, tag2, MPI_COMM_WORLD); MPI_Reduce(&mysum, &sum, 1, MPI_FLOAT, MPI_SUM, MASTER, MPI_COMM_WORLD); } /* end of non-master */

Page 18: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

18

Array Decompositon (source code)

MPI_Finalize();

} /* end of main */

/**calculates the sum of the array chunk**/float update(int myoffset, int chunk, int myid) { int i; float mysum; /* Perform addition to each of my array elements and keep my sum */ mysum = 0; for(i=myoffset; i < myoffset + chunk; i++) { data[i] = data[i] + i * 1.0; mysum = mysum + data[i]; } printf("Task %d mysum = %e\n",myid,mysum); return(mysum); }

Page 19: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

19

Demo : Array Decomposition

/home/cdekate/pa/array_decompositionWed Feb 28 09:08:52 CST 2007MPI task 0 has started...MPI task 2 has started...MPI task 1 has started...MPI task 3 has started...Initialized array sum = 1.335708e+14Sent 4000000 elements to task 1 offset= 4000000Sent 4000000 elements to task 2 offset= 8000000Task 1 mysum = 4.884048e+13Sent 4000000 elements to task 3 offset= 12000000Task 2 mysum = 7.983003e+13Task 0 mysum = 1.598859e+13Task 3 mysum = 1.161867e+14Sample results: 0.000000e+00 2.000000e+00 4.000000e+00 6.000000e+00 8.000000e+00 8.000000e+06 8.000002e+06 8.000004e+06 8.000006e+06 8.000008e+06 1.600000e+07 1.600000e+07 1.600000e+07 1.600001e+07 1.600001e+07 2.400000e+07 2.400000e+07 2.400000e+07 2.400001e+07 2.400001e+07*** Final sum= 2.608458e+14 ***Wed Feb 28 09:08:53 CST 2007

Output from Celeritas for a 4 processor run.

Page 20: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

20

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 21: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

21

Page 22: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

22

Page 23: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

23

Page 24: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

24

Page 25: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

25

Page 26: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

26

Page 27: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

27

Page 28: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

28

Flowchart for Mandelbrot Set Generation

“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Calculate Mandelbrot set values across work region

Write result from task 0 to file

Write result from task 0 to file

Recv. results from “workers”Recv. results

from “workers”

Send result to “master”Send result to “master”

Send result to “master”Send result to “master”

Send result to “master”Send result to “master”

Concatenate results to fileConcatenate results to file

EndEnd

Page 29: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

29

Mandelbrot Sets (source code)#include<stdio.h>#include<assert.h>#include<stdlib.h>#include<mpi.h>

typedef struct complex{ double real; double imag;} Complex; /* Calculate Pixel Values */int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0;

/* Core Mandelbrot Algorithm */ do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count);}

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Page 30: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

30

Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){ FILE *file; int i, j; Complex c; int tmp; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status;

/***** Initializing MPI Environment*****/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype);

/***** Pass in the dimension (X,Y) of the area to cover *****/ if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Page 31: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

31

Mandelbrot Sets (source code) /* assume divides equally */ /***** Mark work region across processors*****/ nrows_l = nx/nprocs;

/* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l;

/* calculate each processor's region of work */ /***** Distribute region across processors*****/

/***** All processors including the “master” and “workers” perform the following*****/

mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1; /* calc each procs coordinates and call local mandelbrot set function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * 4. - 2. ; tmp = cal_pixel(c); /***** Calculate value for each coordinate point*****/ *data_l++ = (double) tmp; } } data_l = data_l_tmp;

/***** If “master” then begin writing output and then wait for “workers” to send their result*****/

if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Page 32: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

32

Mandelbrot Sets (source code)

for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } }

/***** If “worker” send the calculated pixel array to the the “master”*****/else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); }

MPI_Finalize();}

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Page 33: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

33

Demo : Mandelbrot Sets

Page 34: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

34

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 35: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

35

Page 36: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

36

Page 37: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

37

Page 38: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

38

Monte Carlo• The value of PI can be calculated in a number of

ways. Consider the following method of approximating PI Inscribe a circle in a square

• Randomly generate points in the square • Determine the number of points in the square that

are also in the circle • Let r be the number of points in the circle divided

by the number of points in the square • PI ~ 4 r • Note that the more points generated, the better

the approximation • Algorithm :

npoints = 10000

circle_count = 0

do j = 1,npoints

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

PI = 4.0*circle_count/npoints

Page 39: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

39

Monte Carlo Pi Computation

Initialize MPIEnvironment

Receive Request

Compute Random Array

Send Array to Requestor

Last Request?

Finalize MPI

Y

N

Server

Initialize MPI Environment

WorkerMaster

Receive Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Finalize MPI

N

Y

Propagate Number of Points (Allreduce)

Initialize MPI Environment

Broadcast Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Print Statistics

N

Y

Propagate Number of Points (Allreduce)

Finalize MPI

Output Partial Result

Page 40: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

40

Creating Custom Communicators

• Communicators define groups and the access patterns among them

• Default communicator is MPI_COMM_WORLD• Some algorithms demand more sophisticated control

of communications to take advantage of reduction operators

• MPI permits creation of custom communicators• MPI_COMM_create

Page 41: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

41

Monte Carlo : Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000/* We'd like a value that gives the maximum value returned by the function random, but no such value is *portable*. RAND_MAX is available on many systems but is not always the correct value for random (it isn't for Solaris). The value ((unsigned(1)<<31)-1) is common but not guaranteed */#define INT_MAX 1000000000

/* message tags */#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){ int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status;

/* Initialize MPI Environment */

MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid);

Page 42: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

42

Monte Carlo : Pi (source code) server = numprocs-1; /* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon ); /* Broadcast Error Bounds */ MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group ); MPI_Comm_create( world, worker_group, &workers ); /* Creating a custom communicator */ MPI_Group_free(&worker_group);

/* Server Block */ if (myid == server) { /* I am the rand server */

do { /* Recv. Request for random number array*/

MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {

/* Computing the random array */for (i = 0; i < CHUNKSIZE; ) { rands[i] = random();

if (rands[i] <= INT_MAX) i++;}

/* Send random number array*/MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world);

}}while( request>0 );

} /* End Server Block */

else { /* Begin Worker Block */ request = 1;

done = in = out = 0;max = INT_MAX; /* max int, for normalization */

MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid );

iter = 0;

Page 43: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

43

Monte Carlo : Pi (source code)while (!done) { iter++; request = 1;

/* Recv. random array from server*/ MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status );

/* Derive random x,y coordinates for each pair(x,y) of items in the Random Array */ for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1;

y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;

} /* Perform reduction and broadcast of */

MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); /* Compute PI*/ error = fabs( Pi-3.141592653589793238462643); /* Calculate Error */ done = (error < epsilon || (totalin+totalout) > 1000000); request = (done) ? 0 : 1; if (myid == 0) { /* If “Master” : Print current value of PI */

printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );

} else { /* If “Worker” : Request new array if not finished */

if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);

} }

MPI_Comm_free(&workers); }

Page 44: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

44

Monte Carlo : Pi (source code)

if (myid == 0) { /* If “Master” : Print Results */

printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n", totalin+totalout, totalin, totalout );getchar();

} MPI_Finalize();}

Page 45: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

45

Demo : Monte Carlo Pi

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

Page 46: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

46

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 47: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

47

Matrix Multiplication (LLNL Web)

• MPI Example : MPI Matrix Multiply - C Version• In this code, the master task distributes a matrix multiply

operation to numtasks-1 worker tasks.• NOTE: C and Fortran versions of this code differ because of

the way arrays are stored/passed. C arrays are row-major order but Fortran arrays are column-major order.

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 48: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

48

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrices — A ReviewAn n x m matrix

Page 49: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

49

Matrix Addition

Involves adding corresponding elements of each matrix to form the result matrix.

Given the elements of A as ai,j and the elements of B as bi,j, each element of C is computed as

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 50: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

50

Matrix Multiplication

Multiplication of two matrices, A and B, produces the matrix Cwhose elements, ci,j (0 <= i < n, 0 <= j < m), are computed as follows:

where A is an n x l matrix and B is an l x m matrix.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 51: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

51

Matrix multiplication, C = A x B

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 52: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

52

Matrix-Vector Multiplicationc = A x b

Matrix-vector multiplication follows directly from the definition ofmatrix-matrix multiplication by making B an n x1 matrix (vector).Result an n x 1 matrix (vector).

Page 53: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

53

Relationship of Matrices to Linear Equations

A system of linear equations can be written in matrix form:

Ax = b

Matrix A holds the a constants

x is a vector of the unknowns

b is a vector of the b constants.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 54: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

54

Implementing Matrix MultiplicationSequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {

c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];}

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 55: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

55

Partitioning into Submatrices

Suppose matrix divided into s2 submatrices. Each submatrix has n/s x n/s elements. Using notation Ap,q as submatrix in submatrix rowp and submatrix column q:

for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; /* clear elements of submatrix */

for (r = 0; r < m; r++) /* submatrix multiplication &*/

Cp,q = Cp,q + Ap,r * Br,q; /*add to accum. submatrix*/

}The line

Cp,q = Cp,q + Ap,r * Br,q;

means multiply submatrix Ap,r and Br,q using matrix multiplicationand add to submatrix Cp,q using matrix addition. Known as blockmatrix multiplication.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 56: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

56

Block Matrix Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 57: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

57Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 58: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

58

Direct Implementation

One processor to compute each element of C – n^2 processors would be needed. One row of elements of A and one column of elements of B needed. Some of same elements sent to more than one processor. Can use submatrices.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 59: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

59

Performance Improvement

Using tree construction n numbers can be added in log n stepsusing n processors:

Computational timecomplexity of O(log n)using n3 processors.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 60: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

60

Flowchart for Matrix Multiplication“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

wait for “workers“ to finish task

wait for “workers“ to finish task

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product…

Send resultSend result Send resultSend result … Send resultSend result

Recv. resultsRecv. results

Print resultsPrint results

EndEnd

Page 61: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

61

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>

#define NRA 62 /* number of rows in matrix A */#define NCA 15 /* number of columns in matrix A */#define NCB 7 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */

int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 62: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

62

Matrix Multiplication (source code)/* Initialize MPI Environment */

MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1;

/* Master block*/ if (taskid == MASTER) { printf("mpi_mm has started with %d tasks.\n",numtasks); printf("Initializing arrays...\n"); for (i=0; i<NRA; i++) for (j=0; j<NCA; j++) a[i][j]= i+j; /* Initialize array a */ for (i=0; i<NCA; i++) for (j=0; j<NCB; j++) b[i][j]= i*j; /* Initialize array b */ /* Send matrix data to the worker tasks */ averow = NRA/numworkers; /* determining fraction of array to be processed by “workers” */ extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; /* Message Tag */

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 63: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

63

Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) { /* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }

/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;

/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);

/* The array C contains the product of sub-array A and the array B */ MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

Page 64: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

64

Matrix Multiplication (source code)

/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Page 65: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

65

Demo : Matrix Multiplication[cdekate@compute-0-6 matrix_multiplication]$ mpirun -np 4 ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Sending 21 rows to task 1 offset=0Sending 21 rows to task 2 offset=21Sending 20 rows to task 3 offset=42Received results from task 1Received results from task 2Received results from task 3******************************************************Result Matrix:

0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 ……… 0.00 6475.00 12950.00 19425.00 25900.00 32375.00 38850.00 0.00 6580.00 13160.00 19740.00 26320.00 32900.00 39480.00 0.00 6685.00 13370.00 20055.00 26740.00 33425.00 40110.00 0.00 6790.00 13580.00 20370.00 27160.00 33950.00 40740.00 0.00 6895.00 13790.00 20685.00 27580.00 34475.00 41370.00 0.00 7000.00 14000.00 21000.00 28000.00 35000.00 42000.00 0.00 7105.00 14210.00 21315.00 28420.00 35525.00 42630.00 0.00 7210.00 14420.00 21630.00 28840.00 36050.00 43260.00 0.00 7315.00 14630.00 21945.00 29260.00 36575.00 43890.00 0.00 7420.00 14840.00 22260.00 29680.00 37100.00 44520.00 ******************************************************Done.[cdekate@compute-0-6 matrix_multiplication]$

Page 66: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

66

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 67: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

67

N Bodies

OU Supercomputing Center for Education & Research

Page 68: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

68

OU Supercomputing Center for Education & Research

N-Body Problems

An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.

For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.

Likewise, if you have N atoms, then every atom exerts a force (nuclear) on all of the other N–1 atoms.

Page 69: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

69

1-Body Problem

When N is 1, you have a simple 1-Body Problem: a single particle, with no forces acting on it.

Given the particle’s position P and velocity V at some time t0, you can trivially calculate the particle’s position at time t0+Δt:

P(t0+Δt) = P(t0) + VΔt

V(t0+Δt) = V(t0)

OU Supercomputing Center for Education & Research

Page 70: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

70

2-Body Problem

When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.

The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.

So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.

OU Supercomputing Center for Education & Research

Page 71: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

71

N-Body Problems

For N of 3 or more, no one knows how to solve the equations to get a closed form solution.

So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.

Popular applications of N-body codes include astronomy and chemistry.

Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).

OU Supercomputing Center for Education & Research

Page 72: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

72

N Bodies

OU Supercomputing Center for Education & Research

Page 73: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

73

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

You can also think of this as (N • (N–1))/2 forces, in the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.

OU Supercomputing Center for Education & Research

Page 74: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

74

Aside: Big-O Notation

Let’s say that you have some task to perform on a certain number of things, and that the task takes a certain amount of time to complete.

Let’s say that the amount of time can be expressed as a polynomial on the number of things to perform the task on.

For example, the amount of time it takes to read a book might be proportional to the number of words, plus the amount of time it takes to sit in your favorite easy chair.

C1 . N + C2

OU Supercomputing Center for Education & Research

Page 75: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

75

Big-O: Dropping the Low Term

C1 . N + C2

When N is very large, the time spent settling into your easy chair becomes such a small proportion of the total time that it’s virtually zero.

So from a practical perspective, for large N, the polynomial reduces to:

C1 . NIn fact, for any polynomial, all of the terms except the highest-order

term are irrelevant, for large N.

OU Supercomputing Center for Education & Research

Page 76: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

76

Big-O: Dropping the Constant

C1 . NComputers get faster and faster all the time. And there

are many different flavors of computers, having many different speeds.

So, computer scientists don’t care about the constant, only about the order of the highest-order term of the polynomial.

They indicate this with Big-O notation:O(N)

This is often said as: “of order N.”

OU Supercomputing Center for Education & Research

Page 77: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

77

N-Body Problems

Given N bodies, each body exerts a force on all of the other N–1 bodies.

Therefore, there are N • (N–1) forces in total.

In Big-O notation, that’s O(N2) forces to calculate.

So, calculating the forces takes O(N2) time to execute.

But, there are only N particles, each taking up the same amount of memory, so we say that N-body codes are of:

• O(N) spatial complexity (memory)• O(N2) time complexity

OU Supercomputing Center for Education & Research

Page 78: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

78

O(N2) Forces

Note that this picture shows only the forces between A and everyone else.

A

OU Supercomputing Center for Education & Research

Page 79: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

79

How to Calculate?

Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.

For example,

F(A,B) = G · dist(A,B)2 · mA · mB

where G is the gravitational constant and m is the mass of the particle in question.

If you have all of the forces for every pair of particles, then you can calculate their sum, obtaining the force on every particle.

OU Supercomputing Center for Education & Research

Page 80: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

80

How to Parallelize?

Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.

How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the

particles, and then share around the forces;• have each process decide its own subset of the

interactions, and then share around the forces.

OU Supercomputing Center for Education & Research

Page 81: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

81

Do You Need a Master?

Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).

Do you need a master?

Well, can each processor determine on its own either (a) which of the bodies to process, or (b) which of the interactions?

If the answer is yes, then you don’t need a master.

OU Supercomputing Center for Education & Research

Page 82: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

82

Parallelize How?

Suppose you have P processors.

Should you parallelize:• by assigning a subset of N/P of the bodies to each

processor, or• by assigning a subset of ½N(N-1)/P of the

interactions to each processor?

OU Supercomputing Center for Education & Research

Page 83: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

83

N-Body “Pipeline” Implementation Flowchart

Create ring communicator

Initialize particle parameters

Copy local particle data to send buffer

Update positions of local particles

All iterations done?

Finalize MPI

N

Y

Initiate transmission of send buffer to the RIGHT neighbor in ring

Initiate reception of data from the LEFT neighbor in ring

Compute forces between local and send buffer particles

Processed particles from all remote nodes?

N

Wait for message exchange to complete

Copy particle data from receive buffer to send buffer

Y

Initialize MPI environment

Page 84: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

84

N-Body (source code)

#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>

/* Pipeline version of the algorithm... *//* we really need the velocities as well… */

/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */

/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;

void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );

#define MAX_PARTICLES 4000#define MAX_P 128

Page 85: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

85

N-Body (source code)

main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */

recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,

offset; /* location of local particles */ int totpart, /* total number of particles */

cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];

/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring );

Page 86: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

86

N-Body (source code) MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */

/* Calculate local fraction of particles */ if (argc < 2) {

fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );

} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {

fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );

} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );

/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)

displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];

/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;

/* Begin simulation loop */ while (cnt--) {

double max_f, max_f_seg;

Page 87: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

87

N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {

/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );

} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;

/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );

/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {

printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}

Page 88: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

88

N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {

particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;

}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;

max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;

Page 89: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

89

N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}

/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;

Page 90: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

90

N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}

Page 91: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

91

Demo : N-Body Problem

> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpirun –np 4 nbodypipe 4000Computed 4000 particles in 1.119051 seconds

Page 92: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

92

Topics

• Introduction

• Array Decomposition

• Mandelbrot Sets

• Monte Carlo : PI Calculation

• Matrix Multiplication

• N-Body Problem

• Summary – Materials for Test

Page 93: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

93

Summary – Material for the Test

• Introduction – Slides: 4, 5, 6• Array decomposition – Slides: 11, 12• Mandelbrot load balancing – Slides: 25, 26• Monte Carlo create Communicators – Slides: 40, 42• Matrix Multiply basic algorithm – Slides: 49 – 54

Page 94: High Performance Computing: Concepts, Methods & Means Parallel Algorithms 1 Prof. Thomas Sterling Department of Computer Science Louisiana State University

94