72
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 10 th , 2011

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

Embed Size (px)

Citation preview

Page 1: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

APPLIED PARALLEL ALGORITHMS 1

Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011

Page 2: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Dr. Hartmut Kaiser

Center for Computation & Technology

R315 Johnston

[email protected]

2

Page 3: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C function declarations:

void foo();void foo(void);void foo(…);

Page 4: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C function declarations:

• What’s the difference between the following valid C++ function declarations:

void foo();void foo(void);void foo(…);

void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters

Page 5: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Puzzle of the Day

• What’s the difference between the following valid C function declarations:

void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

• What’s the difference between the following valid C++ function declarations:

void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters

Page 6: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

6

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 7: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

7

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 8: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

8

Parallel Programming

• Goals– Correctness– Reduction in execution time– Efficiency– Scalability– Increased problem size and richness of models

• Objectives– Expose parallelism

• Algorithm design

– Distribute work uniformly• Data decomposition and allocation• Dynamic load balancing

– Minimize overhead of synchronization and communication• Coarse granularity• Big messages

– Minimize redundant work• Still sometimes better than communication

Page 9: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

9

Basic Parallel (MPI) Program Steps

• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results

– Optional for non-embarrassingly parallel (cooperative)

• Detect “stop” condition– Maybe implicit with a barrier etc.

• Aggregate final results– Often a reduction operator

• Output results and error code• Terminate and return to OS

Page 10: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

10

“embarrassingly parallel”

• Common phrase– poorly defined, – widely used

• Suggests lots and lots of parallelism – with essentially no inter task communication or coordination– Highly partitionable workload with minimal overhead

• “almost embarrassingly parallel”– Same as above, but– Requires master to launch many tasks– Requires master to collect final results of tasks– Sometimes still referred to as “embarrassingly parallel”

Page 11: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

11

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 12: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Mandelbrot set

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.

Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

12

Page 13: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson

& M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Mandelbrot Set

Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function

where zk+1 is the (k + 1)th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero.

Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by

13

Page 14: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Sequential routine computing value of one point returning number of iterations

structure complex {float real;float imag;

};int cal_pixel(complex c){

int count, max;complex z;float temp, lengthsq;max = 256;z.real = 0; z.imag = 0;count = 0; /* number of iterations */do {

temp = z.real * z.real - z.imag * z.imag + c.real;z.imag = 2 * z.real * z.imag + c.imag;z.real = temp;lengthsq = z.real * z.real + z.imag * z.imag;count++;

} while ((lengthsq < 4.0) && (count < max));return count;

}

14

Page 15: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Parallelizing Mandelbrot Set Computation

Static Task Assignment

Simply divide the region into fixed number of parts, each computed by a separate processor.

Not very successful because different regions require different numbers of iterations and time.

Dynamic Task Assignment

Have processor request regions after computing previousregions

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

15

Page 16: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Dynamic Task AssignmentWork Pool/Processor Farms

16

Page 17: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

17

Flowchart for Mandelbrot Set Generation

“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment … Initialize MPI

EnvironmentInitialize MPI Environment

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Create Local Workload buffer

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Isolate work regions

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

… …

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Calculate Mandelbrot set

values across work region

Write result from task 0 to file

Write result from task 0 to file

Recv. results from “workers”

Recv. results from “workers”

Send result to “master”

Send result to “master”

Send result to “master”

Send result to “master”

Send result to “master”

Send result to “master”…

Concatenate results to fileConcatenate results to file

EndEnd

Page 18: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

18

Mandelbrot Sets (source code)#include<stdio.h>#include<assert.h>#include<stdlib.h>#include<mpi.h>typedef struct complex{ double real; double imag;} Complex;int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count);} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

cal_pixel () runs on every worker process calculates the :

for every pixel

cal_pixel () runs on every worker process calculates the :

for every pixel

Page 19: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

19

Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status;

/***** Initializing MPI Environment*****/

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype);

/***** Pass in the dimension (X,Y) of the area to cover *****/

if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]);

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Initialize MPI EnvironmentInitialize MPI Environment

Check if the input arguments : x,y dimensions of the region to be processed are passed

Check if the input arguments : x,y dimensions of the region to be processed are passed

Page 20: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

20

Mandelbrot Sets (source code)

/* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1;

/* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l;

/* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * 4. - 2. ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } } data_l = data_l_tmp;

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Determining the dimensions of the work to be performed by each concurrent task.

Determining the dimensions of the work to be performed by each concurrent task.

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated

Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated

Page 21: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

21

Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } }else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); }

MPI_Finalize();}

Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/

Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Master process opens a file to store output into and stores its values in the file

Master then waits to receive values computed by each of the worker processes

Worker processes send computed mandelbrot values of their region to the master processWorker processes send computed mandelbrot values of their region to the master process

Page 22: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

22

Demo : Mandelbrot Sets

Page 23: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: Mandelbrot Sets

23

Page 24: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

24

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 25: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

25

Page 26: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Monte Carlo Simulation

• Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm

• Especially useful in – Studying systems with a large number of coupled degrees

of freedom• Fluids, disordered materials, strongly coupled solids, cellular

structures

– For modeling phenomena with significant uncertainty in inputs

• The calculation of risk in business

– These methods are also widely used in mathematics • The evaluation of definite integrals, particularly multidimensional

integrals with complicated boundary conditions

26

Page 27: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Monte Carlo Simulation

• No single approach, multitude of different methods

• Usually follows pattern– Define a domain of possible inputs – Generate inputs randomly from the domain – Perform a deterministic computation using the

inputs – Aggregate the results of the individual

computations into the final result

• Example: calculate Pi

27

Page 28: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

28

Monte Carlo: Algorithm for Pi• The value of PI can be calculated in a number of

ways. Consider the following method of approximating PI: Inscribe a circle in a square

• Randomly generate points in the square • Determine the number of points in the square that

are also in the circle • Let r be the number of points in the circle divided

by the number of points in the square • PI ~ 4 r • Note that the more points generated, the better

the approximation • Algorithm :

npoints = 10000

circle_count = 0

do j = 1,npoints

generate 2 random numbers between 0 and 1

xcoordinate = random1 ; ycoordinate = random2

if (xcoordinate, ycoordinate) inside circle

then circle_count = circle_count + 1

end do

PI = 4.0*circle_count/npoints

Page 29: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

29

Page 30: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

30

OpenMP Pi Calculation

Initialize variables

Initialize OpenMP parallel environment

Calculate PI

Print value of pi

N WorkerThreadsMaster Thread

Generate random X,Y Generate random X,YGenerate random X,Y Generate random X,YGenerate random X,Y

Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2

If point lies within the

circle

Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2

If point lies within the

circle

If point lies within the

circle

Count ++ Count ++Count ++

Reduction ∑Reduction ∑

Y

N N N

Y Y

Page 31: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Calculating Pi

31

#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42

main(int argc, char* argv){ int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo;

printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime );

Seed for generating random numberSeed for generating random number

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Page 32: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Calculating Pi

32

printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED);#pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Initialize random number generator; srand is used to seed the random number generated by rand()

Initialize random number generator; srand is used to seed the random number generated by rand()

Randomly generate x,y pointsRandomly generate x,y points

Initialize OpenMP parallel for with reduction(∑)

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

Page 33: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Calculating Pi

33

if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d , estimate of pi is %g \n",niter,pi); return 0;}

http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

Calculate PI based on the aggregate count of the points that lie within the circle

Calculate PI based on the aggregate count of the points that lie within the circle

Page 34: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo : OpenMP Pi

34

[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008 thread 0 just did iteration 16665 the count is 13124 thread 1 just did iteration 33332 the count is 6514 thread 1 just did iteration 49999 the count is 19609 thread 2 just did iteration 66665 the count is 13048 thread 3 just did iteration 83332 the count is 6445 thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008 the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$

Page 35: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

35

Creating Custom Communicators

• Communicators define groups and the access patterns among them

• Default communicator is MPI_COMM_WORLD• Some algorithms demand more sophisticated control of

communications to take advantage of reduction operators

• MPI permits creation of custom communicators• MPI_Comm_create

Page 36: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

36

MPI Monte Carlo Pi Computation

Initialize MPIEnvironment

Receive Request

Compute Random Array

Send Array to Requestor

Last Request?

Finalize MPI

Y

N

Server

Initialize MPI Environment

WorkerMaster

Receive Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Finalize MPI

N

Y

Propagate Number of Points (Allreduce)

Initialize MPI Environment

Broadcast Error Bound

Send Request to Server

Receive Random Array

Perform Computations

Stop Condition Satisfied?

Print Statistics

N

Y

Propagate Number of Points (Allreduce)

Finalize MPI

Output Partial Result

Page 37: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

37

Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){ int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status;

MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid);

Initialize MPI environment

Page 38: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

38

Monte Carlo : MPI - Pi (source code)

server = numprocs-1; /* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon );

MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group );

MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);

if (myid == server) { do {

MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {

for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }

} while( request>0 ); } else { /* Begin Worker Block */

request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0;

Broadcast Error Bounds: epsilon

Create a custom communicator

Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor

Worker process : Request the server to generate a random number array

Page 39: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

39

Monte Carlo : MPI - Pi (source code)while (!done) { iter++; request = 1; /* Recv. random array from server*/

MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1;

y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;

} MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643); done = (error < epsilon || (totalin+totalout) > 1000000); request = (done) ? 0 : 1; if (myid == 0) { /* If “Master” : Print current value of PI */

printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );

} else { /* If “Worker” : Request new array if not finished */

if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);

} }

MPI_Comm_free(&workers); }

Worker : Receive random number array from the Server

Worker: For each pair of x,y in the random number array, calculate the coordinates

Worker: For each pair of x,y in the random number array, calculate the coordinates

Determine if the number is inside or out of the circleDetermine if the number is inside or out of the circle

Print current value of PI and request for more work

Compute the value of pi and Check if error is within threshholdCompute the value of pi and Check if error is within threshhold

Page 40: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

40

Monte Carlo : MPI - Pi (source code)

if (myid == 0) { /* If “Master” : Print Results */

printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n", totalin+totalout, totalin, totalout );getchar();

} MPI_Finalize();}

Print the final value of PI

Page 41: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

41

Demo : MPI Monte Carlo, Pi

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696

Page 42: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

42

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 43: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Vector Dot Product

• Multiplication of 2 vectors followed by Summation

43

A[i]

X1

X2

X3

X4

X5

… …

Xn

B[i]

Y1

Y2

Y3

Y4

Y5

… …

Yn

∙ =

n

i 1

A[i] * B[i]

X1* Y1

X2* Y2

X3* Y3

X4* Y4

X5* Y5

… …

Xn* Yn

Page 44: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

44

OpenMP Dot Product : using Reduction

Initialize variables

Initialize OpenMP parallel environment

Calculate local computations

Calculate local computationsCalculate local computations

Calculate local computationsCalculate local computations

REDUCTION : ∑

Print value of Dot Product

N WorkerThreadsMaster Thread

Master Thread

Workload and schedule is determined by OpenMP

during runtime

Workload and schedule is determined by OpenMP

during runtime

Page 45: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Dot Product

45

#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; }#pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]

Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 46: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: Dot Product using Reduction

46

[cdekate@celeritas l12]$ ./reduction a[i] b[i] a[i]*b[i] 0.000000 0.000000 0.000000 1.000000 2.000000 2.000000 2.000000 4.000000 8.000000 3.000000 6.000000 18.000000 4.000000 8.000000 32.000000 5.000000 10.000000 50.000000 6.000000 12.000000 72.000000 7.000000 14.000000 98.000000 8.000000 16.000000 128.000000 9.000000 18.000000 162.000000 10.000000 20.000000 200.000000 11.000000 22.000000 242.000000 12.000000 24.000000 288.000000 13.000000 26.000000 338.000000 14.000000 28.000000 392.000000 15.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$

Page 47: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

47

MPI Dot Product Computation

Initialize Variables

WorkerMaster

Initialize MPI environment

Receive Size of vectors

Receive local workload for Vector A

Receive local workload for Vector B

Initialize Variables

Initialize MPI Environment

Broadcast Size of Vectors

Get Vector A &Distribute Partitioned Vector A

Get Vector B & Distribute Partitioned Vector B

Calculate dot-product for local workloads

Print Result

REDUCTION ∑

Calculate dot-product for local workloads

Page 48: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

48

#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

Initialize MPI Environment

Broadcast the order of vectors across the workers

Parallel Programming with MPIbyPeter Pacheco

Page 49: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

49

n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank);

dot = Parallel_dot(local_x, local_y, n_bar);

if (my_rank == 0) printf("The dot product is %f\n", dot);

MPI_Finalize();} /* main */

void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q;

Receive and distribute the two vectors

Calculate the parallel dot product for local workloads

Master: Print the result of the dot product

Parallel Programming with MPIbyPeter Pacheco

Page 50: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

50

float temp[MAX_LOCAL_ORDER]; MPI_Status status;

if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); }} /* Read_vector */

float Serial_dot( float x[] /* in */,

MASTER: Get the input from the User prepare the local workloadMASTER: Get the input from the User prepare the local workload

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing

Worker : Receive the local workload to be processed

Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays

Parallel Programming with MPI byPeter Pacheco

Page 51: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

MPI Dot Product

51

float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum;} /* Serial_dot */float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot;} /* Parallel_dot */

Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays

Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workloadParallel_dot() : Calls the Serial_dot() to perform the dot product for local workload

Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)

Parallel Programming with MPIbyPeter Pacheco

Page 52: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Demo: MPI Dot Product

52

[cdekate@celeritas l13]$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$

Page 53: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

53

Topics

• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication

Page 54: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

54

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Vector Multiplication

Page 55: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

55

Matrix-Vector Multiplicationc = A xb

Page 56: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

56

Implementing Matrix MultiplicationSequential Code

Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be

for (i = 0; i < n; i++)for (j = 0; j < n; j++) {

c[i][j] = 0;for (k = 0; k < n; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];

}

This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Page 57: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

Implementing Matrix Multiplication

• With n processors (and n x n matrices), we can obtain:• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a

separate processor

• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).

• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop. • Not cost-optimal since O(n3) < n3 x O(log n).

• O(log n) lower bound for parallel matrix multiplication.

57

Page 58: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

58

Block Matrix Multiplication

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Partitioning into sub-matricies

Page 59: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

59

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.

Matrix Multiplication

Page 60: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

60

Performance Improvement

Using tree construction n numbers can be added in O(log n) steps (using n3 processors):

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

Page 61: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

61

OpenMP: Flowchart for Matrix Multiplication

Initialize variables & matricesInitialize variables & matrices

Initialize OpenMP EnvironmentInitialize OpenMP Environment

Compute the Matrix product for the local workload

Compute the Matrix product for the local workload

Print ResultsPrint Results

Compute the Matrix product for the local workload

Compute the Matrix product for the local workload

Compute the Matrix product for the local workload

Compute the Matrix product for the local workload

Schedule and workload chunksize are determined based on user preferences

during compile/run time

Since each thread works on portion of the array and updates different parts of the same

array synchronization is not needed

Page 62: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplication

62

#include <stdio.h>#include <omp.h>

/* Main Program */

main(){ int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B];

for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n");

Initialize the two Matrices A[][] & B[][] with sum of their index valuesInitialize the two Matrices A[][] & B[][] with sum of their index values

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 63: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplication

63

for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } }#pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n");

Initialize the results matrix with 0.0Initialize the results matrix with 0.0

Print the Matrices for debugging purposes

Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 64: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

OpenMP Matrix Multiplicaton

64

for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); }}

SRC : https://computing.llnl.gov/tutorials/openMP/

Page 65: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

DEMO : OpenMP Matrix Multiplication

65

[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000

The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$

Page 66: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

66

Flowchart for MPI Matrix Multiplication

“master” “workers”

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

Initialize MPI EnvironmentInitialize MPI Environment

… Initialize MPI EnvironmentInitialize MPI Environment

Initialize ArrayInitialize Array

Partition Array into workloads Partition Array into workloads

Send Workload to “workers”

Send Workload to “workers”

Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work

wait for “workers“ to finish task

wait for “workers“ to finish task

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product

Calculate matrix product…

Send resultSend result Send resultSend result … Send resultSend result

Recv. resultsRecv. results

Print resultsPrint results

EndEnd

Page 67: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

67

Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */

taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */

double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */

MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

Initialize the MPI environment

Source : http://www.llnl.gov/computing/t

utorials/mpi/samples/C/mpi_mm.c

Page 68: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

68

Matrix Multiplication (source code)if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER;

Source : http://www.llnl.gov/computing/t

utorials/mpi/samples/C/mpi_mm.c

MASTER: Initialize the matrix A & B

Print the two matrices for Debugging purposes

Calculate the number of rows to be processed by each workerCalculate the number of rows to be processed by each worker

Calculate the number of overflow rows to be processed additionally by each workerCalculate the number of overflow rows to be processed additionally by each worker

Page 69: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

69

Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }

/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;

/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }

MASTER : Send the workload chunk across to each of the worker

MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker

Source : http://www.llnl.gov/computing/t

utorials/mpi/samples/C/mpi_mm.c

Page 70: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

70

Matrix Multiplication (source code)/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);

for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)

/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);

/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c

WORKER: Receive the workload to be processed by each worker

Calculate the matrix product and store the result in c[][]Calculate the matrix product and store the result in c[][]

Send the computed results array to the Master

Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C

/mpi_mm.c

Page 71: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

71

Demo : Matrix Multiplication[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Matrix B :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix: 30.00 40.00 50.00 60.00 40.00 54.00 68.00 82.00 50.00 68.00 86.00 104.00 60.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$

Page 72: CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011

72