Upload
madeline-watson
View
222
Download
1
Embed Size (px)
Citation preview
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
APPLIED PARALLEL ALGORITHMS 1
Prof. Thomas SterlingDr. Hartmut KaiserDepartment of Computer ScienceLouisiana State UniversityMarch 10th, 2011
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Dr. Hartmut Kaiser
Center for Computation & Technology
R315 Johnston
2
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C function declarations:
void foo();void foo(void);void foo(…);
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C function declarations:
• What’s the difference between the following valid C++ function declarations:
void foo();void foo(void);void foo(…);
void foo(); any number of parametersvoid foo(void); no parametervoid foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Puzzle of the Day
• What’s the difference between the following valid C function declarations:
void foo(); any number of parametersvoid foo(void); no parametersvoid foo(…); any number of parameters
• What’s the difference between the following valid C++ function declarations:
void foo(); no parametersvoid foo(void); no parametersvoid foo(…); any number of parameters
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
6
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
7
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
8
Parallel Programming
• Goals– Correctness– Reduction in execution time– Efficiency– Scalability– Increased problem size and richness of models
• Objectives– Expose parallelism
• Algorithm design
– Distribute work uniformly• Data decomposition and allocation• Dynamic load balancing
– Minimize overhead of synchronization and communication• Coarse granularity• Big messages
– Minimize redundant work• Still sometimes better than communication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
9
Basic Parallel (MPI) Program Steps
• Establish logical bindings• Initialize application execution environment• Distribute data and work• Perform core computations in parallel (across nodes)• Synchronize and Exchange intermediate data results
– Optional for non-embarrassingly parallel (cooperative)
• Detect “stop” condition– Maybe implicit with a barrier etc.
• Aggregate final results– Often a reduction operator
• Output results and error code• Terminate and return to OS
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
10
“embarrassingly parallel”
• Common phrase– poorly defined, – widely used
• Suggests lots and lots of parallelism – with essentially no inter task communication or coordination– Highly partitionable workload with minimal overhead
• “almost embarrassingly parallel”– Same as above, but– Requires master to launch many tasks– Requires master to collect final results of tasks– Sometimes still referred to as “embarrassingly parallel”
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
11
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Mandelbrot set
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B.
Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
12
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson
& M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Mandelbrot Set
Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function
where zk+1 is the (k + 1)th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero.
Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by
13
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Sequential routine computing value of one point returning number of iterations
structure complex {float real;float imag;
};int cal_pixel(complex c){
int count, max;complex z;float temp, lengthsq;max = 256;z.real = 0; z.imag = 0;count = 0; /* number of iterations */do {
temp = z.real * z.real - z.imag * z.imag + c.real;z.imag = 2 * z.real * z.imag + c.imag;z.real = temp;lengthsq = z.real * z.real + z.imag * z.imag;count++;
} while ((lengthsq < 4.0) && (count < max));return count;
}
14
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Parallelizing Mandelbrot Set Computation
Static Task Assignment
Simply divide the region into fixed number of parts, each computed by a separate processor.
Not very successful because different regions require different numbers of iterations and time.
Dynamic Task Assignment
Have processor request regions after computing previousregions
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
15
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,
@ 2004 Pearson Education Inc. All rights reserved.
Dynamic Task AssignmentWork Pool/Processor Farms
16
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
17
Flowchart for Mandelbrot Set Generation
“master” “workers”
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment … Initialize MPI
EnvironmentInitialize MPI Environment
Create Local Workload buffer
Create Local Workload buffer
…
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Create Local Workload buffer
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Isolate work regions
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
… …
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Calculate Mandelbrot set
values across work region
Write result from task 0 to file
Write result from task 0 to file
Recv. results from “workers”
Recv. results from “workers”
Send result to “master”
Send result to “master”
Send result to “master”
Send result to “master”
Send result to “master”
Send result to “master”…
Concatenate results to fileConcatenate results to file
EndEnd
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
18
Mandelbrot Sets (source code)#include<stdio.h>#include<assert.h>#include<stdlib.h>#include<mpi.h>typedef struct complex{ double real; double imag;} Complex;int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count);} Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
cal_pixel () runs on every worker process calculates the :
for every pixel
cal_pixel () runs on every worker process calculates the :
for every pixel
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
19
Mandelbrot Sets (source code)#define MASTERPE 0int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status;
/***** Initializing MPI Environment*****/
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype);
/***** Pass in the dimension (X,Y) of the area to cover *****/
if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]);
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Initialize MPI EnvironmentInitialize MPI Environment
Check if the input arguments : x,y dimensions of the region to be processed are passed
Check if the input arguments : x,y dimensions of the region to be processed are passed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
20
Mandelbrot Sets (source code)
/* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1;
/* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l;
/* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * 4. - 2. ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } } data_l = data_l_tmp;
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Determining the dimensions of the work to be performed by each concurrent task.
Determining the dimensions of the work to be performed by each concurrent task.
Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated
Local tasks calculate the coordinates for each pixel in the local region.For each pixel, cal_pixel() function is called and the corresponding value is calculated
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
21
Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } }else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); }
MPI_Finalize();}
Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/
Master process opens a file to store output into and stores its values in the file
Master then waits to receive values computed by each of the worker processes
Master process opens a file to store output into and stores its values in the file
Master then waits to receive values computed by each of the worker processes
Worker processes send computed mandelbrot values of their region to the master processWorker processes send computed mandelbrot values of their region to the master process
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
22
Demo : Mandelbrot Sets
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: Mandelbrot Sets
23
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
24
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Monte Carlo Simulation
• Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm
• Especially useful in – Studying systems with a large number of coupled degrees
of freedom• Fluids, disordered materials, strongly coupled solids, cellular
structures
– For modeling phenomena with significant uncertainty in inputs
• The calculation of risk in business
– These methods are also widely used in mathematics • The evaluation of definite integrals, particularly multidimensional
integrals with complicated boundary conditions
26
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Monte Carlo Simulation
• No single approach, multitude of different methods
• Usually follows pattern– Define a domain of possible inputs – Generate inputs randomly from the domain – Perform a deterministic computation using the
inputs – Aggregate the results of the individual
computations into the final result
• Example: calculate Pi
27
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
28
Monte Carlo: Algorithm for Pi• The value of PI can be calculated in a number of
ways. Consider the following method of approximating PI: Inscribe a circle in a square
• Randomly generate points in the square • Determine the number of points in the square that
are also in the circle • Let r be the number of points in the circle divided
by the number of points in the square • PI ~ 4 r • Note that the more points generated, the better
the approximation • Algorithm :
npoints = 10000
circle_count = 0
do j = 1,npoints
generate 2 random numbers between 0 and 1
xcoordinate = random1 ; ycoordinate = random2
if (xcoordinate, ycoordinate) inside circle
then circle_count = circle_count + 1
end do
PI = 4.0*circle_count/npoints
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
30
OpenMP Pi Calculation
Initialize variables
Initialize OpenMP parallel environment
Calculate PI
Print value of pi
N WorkerThreadsMaster Thread
Generate random X,Y Generate random X,YGenerate random X,Y Generate random X,YGenerate random X,Y
Calculate Z=X^2+Y^2 Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2
If point lies within the
circle
Calculate Z =X^2+Y^2Calculate Z =X^2+Y^2
If point lies within the
circle
If point lies within the
circle
Count ++ Count ++Count ++
Reduction ∑Reduction ∑
Y
N N N
Y Y
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Calculating Pi
31
#include <omp.h>#include <stdlib.h>#include <stdio.h>#include <time.h>#define SEED 42
main(int argc, char* argv){ int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo;
printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime );
Seed for generating random numberSeed for generating random number
http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Calculating Pi
32
printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED);#pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
Initialize random number generator; srand is used to seed the random number generated by rand()
Initialize random number generator; srand is used to seed the random number generated by rand()
Randomly generate x,y pointsRandomly generate x,y points
Initialize OpenMP parallel for with reduction(∑)
Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count
Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Calculating Pi
33
if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d , estimate of pi is %g \n",niter,pi); return 0;}
http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML
Calculate PI based on the aggregate count of the points that lie within the circle
Calculate PI based on the aggregate count of the points that lie within the circle
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo : OpenMP Pi
34
[cdekate@celeritas l13]$ ./omcpiEnter the number of iterations used to estimate pi: 100000The current date/time is: Tue Mar 4 05:53:52 2008 thread 0 just did iteration 16665 the count is 13124 thread 1 just did iteration 33332 the count is 6514 thread 1 just did iteration 49999 the count is 19609 thread 2 just did iteration 66665 the count is 13048 thread 3 just did iteration 83332 the count is 6445 thread 3 just did iteration 99999 the count is 19489The current date/time is: Tue Mar 4 05:53:52 2008 the total count is 78320# of trials= 100000 , estimate of pi is 3.1328[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
35
Creating Custom Communicators
• Communicators define groups and the access patterns among them
• Default communicator is MPI_COMM_WORLD• Some algorithms demand more sophisticated control of
communications to take advantage of reduction operators
• MPI permits creation of custom communicators• MPI_Comm_create
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
36
MPI Monte Carlo Pi Computation
Initialize MPIEnvironment
Receive Request
Compute Random Array
Send Array to Requestor
Last Request?
Finalize MPI
Y
N
Server
Initialize MPI Environment
WorkerMaster
Receive Error Bound
Send Request to Server
Receive Random Array
Perform Computations
Stop Condition Satisfied?
Finalize MPI
N
Y
Propagate Number of Points (Allreduce)
Initialize MPI Environment
Broadcast Error Bound
Send Request to Server
Receive Random Array
Perform Computations
Stop Condition Satisfied?
Print Statistics
N
Y
Propagate Number of Points (Allreduce)
Finalize MPI
Output Partial Result
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
37
Monte Carlo : MPI - Pi (source code)#include <stdio.h>#include <math.h>#include "mpi.h“#define CHUNKSIZE 1000#define INT_MAX 1000000000#define REQUEST 1#define REPLY 2int main( int argc, char *argv[] ){ int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status;
MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid);
Initialize MPI environment
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
38
Monte Carlo : MPI - Pi (source code)
server = numprocs-1; /* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon );
MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group );
MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group);
if (myid == server) { do {
MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) {
for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); }
} while( request>0 ); } else { /* Begin Worker Block */
request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0;
Broadcast Error Bounds: epsilon
Create a custom communicator
Server process : 1. Receives request to generate a random ,2. Computes the random number array, 3. Send array to requestor
Worker process : Request the server to generate a random number array
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
39
Monte Carlo : MPI - Pi (source code)while (!done) { iter++; request = 1; /* Recv. random array from server*/
MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1;
y = (((double) rands[i++])/max) * 2 - 1;if (x*x + y*y < 1.0) in++;else out++;
} MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643); done = (error < epsilon || (totalin+totalout) > 1000000); request = (done) ? 0 : 1; if (myid == 0) { /* If “Master” : Print current value of PI */
printf( "\rpi = %23.20f", Pi );MPI_Send( &request, 1, MPI_INT, server, REQUEST, world );
} else { /* If “Worker” : Request new array if not finished */
if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world);
} }
MPI_Comm_free(&workers); }
Worker : Receive random number array from the Server
Worker: For each pair of x,y in the random number array, calculate the coordinates
Worker: For each pair of x,y in the random number array, calculate the coordinates
Determine if the number is inside or out of the circleDetermine if the number is inside or out of the circle
Print current value of PI and request for more work
Compute the value of pi and Check if error is within threshholdCompute the value of pi and Check if error is within threshhold
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
40
Monte Carlo : MPI - Pi (source code)
if (myid == 0) { /* If “Master” : Print Results */
printf( "\npoints: %d\nin: %d, out: %d, <ret> to exit\n", totalin+totalout, totalin, totalout );getchar();
} MPI_Finalize();}
Print the final value of PI
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
41
Demo : MPI Monte Carlo, Pi
> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696
> mpirun –np 4 monte 1e-20pi = 3.14164517741129456496points: 1000500in: 785804, out: 214696
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
42
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Vector Dot Product
• Multiplication of 2 vectors followed by Summation
43
A[i]
X1
X2
X3
X4
X5
… …
Xn
B[i]
Y1
Y2
Y3
Y4
Y5
… …
Yn
∙ =
n
i 1
A[i] * B[i]
X1* Y1
X2* Y2
X3* Y3
X4* Y4
X5* Y5
… …
Xn* Yn
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
44
OpenMP Dot Product : using Reduction
Initialize variables
Initialize OpenMP parallel environment
Calculate local computations
Calculate local computationsCalculate local computations
Calculate local computationsCalculate local computations
REDUCTION : ∑
Print value of Dot Product
N WorkerThreadsMaster Thread
Master Thread
Workload and schedule is determined by OpenMP
during runtime
Workload and schedule is determined by OpenMP
during runtime
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Dot Product
45
#include <omp.h>main () {int i, n, chunk;float a[16], b[16], result;n = 16;chunk = 4;result = 0.0;for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; }#pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]);printf("Final result= %f\n",result);}
Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]
Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i]
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: Dot Product using Reduction
46
[cdekate@celeritas l12]$ ./reduction a[i] b[i] a[i]*b[i] 0.000000 0.000000 0.000000 1.000000 2.000000 2.000000 2.000000 4.000000 8.000000 3.000000 6.000000 18.000000 4.000000 8.000000 32.000000 5.000000 10.000000 50.000000 6.000000 12.000000 72.000000 7.000000 14.000000 98.000000 8.000000 16.000000 128.000000 9.000000 18.000000 162.000000 10.000000 20.000000 200.000000 11.000000 22.000000 242.000000 12.000000 24.000000 288.000000 13.000000 26.000000 338.000000 14.000000 28.000000 392.000000 15.000000 30.000000 450.000000Final result= 2480.000000[cdekate@celeritas l12]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
47
MPI Dot Product Computation
Initialize Variables
WorkerMaster
Initialize MPI environment
Receive Size of vectors
Receive local workload for Vector A
Receive local workload for Vector B
Initialize Variables
Initialize MPI Environment
Broadcast Size of Vectors
Get Vector A &Distribute Partitioned Vector A
Get Vector B & Distribute Partitioned Vector B
Calculate dot-product for local workloads
Print Result
REDUCTION ∑
Calculate dot-product for local workloads
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
48
#include <stdio.h>#include "mpi.h"#define MAX_LOCAL_ORDER 100main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
Initialize MPI Environment
Broadcast the order of vectors across the workers
Parallel Programming with MPIbyPeter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
49
n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank);
dot = Parallel_dot(local_x, local_y, n_bar);
if (my_rank == 0) printf("The dot product is %f\n", dot);
MPI_Finalize();} /* main */
void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q;
Receive and distribute the two vectors
Calculate the parallel dot product for local workloads
Master: Print the result of the dot product
Parallel Programming with MPIbyPeter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
50
float temp[MAX_LOCAL_ORDER]; MPI_Status status;
if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); }} /* Read_vector */
float Serial_dot( float x[] /* in */,
MASTER: Get the input from the User prepare the local workloadMASTER: Get the input from the User prepare the local workload
Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing
Get the input from the User load balance in real-time by storing the work chunks in arrayAnd sending the array to the worker nodes for processing
Worker : Receive the local workload to be processed
Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays
Parallel Programming with MPI byPeter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
MPI Dot Product
51
float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum;} /* Serial_dot */float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot;} /* Parallel_dot */
Serial_dot() : calculates the dot product on local arraysSerial_dot() : calculates the dot product on local arrays
Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workloadParallel_dot() : Calls the Serial_dot() to perform the dot product for local workload
Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM)
Parallel Programming with MPIbyPeter Pacheco
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Demo: MPI Dot Product
52
[cdekate@celeritas l13]$ mpirun …. ./mpi_dotEnter the order of the vectors16Enter the first vector0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Enter the second vector0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30The dot product is 2480.000000[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
53
Topics
• Introduction• Mandelbrot Sets• Monte Carlo : PI Calculation• Vector Dot-Product• Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
54
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrix Vector Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
55
Matrix-Vector Multiplicationc = A xb
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
56
Implementing Matrix MultiplicationSequential Code
Assume throughout that the matrices are square (n x n matrices).The sequential code to compute A x B could simply be
for (i = 0; i < n; i++)for (j = 0; j < n; j++) {
c[i][j] = 0;for (k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
This algorithm requires n3 multiplications and n3 additions, leading to a sequential time complexity of O(n3). Very easy to parallelize.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
Implementing Matrix Multiplication
• With n processors (and n x n matrices), we can obtain:• Time complexity of O(n2) with n processors• Each instance of inner loop is independent and can be done by a
separate processor
• Time complexity of O(n) with n2 processors• One element of A and B assigned to each processor.• Cost optimal since O(n3) = n x O(n2) = n2 x O(n).
• Time complexity of O(log n) with n3 processors• By parallelizing the inner loop. • Not cost-optimal since O(n3) < n3 x O(log n).
• O(log n) lower bound for parallel matrix multiplication.
57
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
58
Block Matrix Multiplication
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Partitioning into sub-matricies
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
59
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved.
Matrix Multiplication
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
60
Performance Improvement
Using tree construction n numbers can be added in O(log n) steps (using n3 processors):
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
61
OpenMP: Flowchart for Matrix Multiplication
Initialize variables & matricesInitialize variables & matrices
Initialize OpenMP EnvironmentInitialize OpenMP Environment
Compute the Matrix product for the local workload
Compute the Matrix product for the local workload
Print ResultsPrint Results
Compute the Matrix product for the local workload
Compute the Matrix product for the local workload
Compute the Matrix product for the local workload
Compute the Matrix product for the local workload
Schedule and workload chunksize are determined based on user preferences
during compile/run time
Since each thread works on portion of the array and updates different parts of the same
array synchronization is not needed
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplication
62
#include <stdio.h>#include <omp.h>
/* Main Program */
main(){ int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B];
for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n");
Initialize the two Matrices A[][] & B[][] with sum of their index valuesInitialize the two Matrices A[][] & B[][] with sum of their index values
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplication
63
for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } }#pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n");
Initialize the results matrix with 0.0Initialize the results matrix with 0.0
Print the Matrices for debugging purposes
Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
OpenMP Matrix Multiplicaton
64
for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); }}
SRC : https://computing.llnl.gov/tutorials/openMP/
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
DEMO : OpenMP Matrix Multiplication
65
[cdekate@celeritas l13]$ ./omp_mmThe Matrix_A Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000The Matrix_B Is0.000000 1.000000 2.000000 3.0000001.000000 2.000000 3.000000 4.0000002.000000 3.000000 4.000000 5.0000003.000000 4.000000 5.000000 6.000000
The Matrix Computation Result Is14.000000 20.000000 26.000000 32.00000020.000000 30.000000 40.000000 50.00000026.000000 40.000000 54.000000 68.00000032.000000 50.000000 68.000000 86.000000[cdekate@celeritas l13]$
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
66
Flowchart for MPI Matrix Multiplication
“master” “workers”
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
Initialize MPI EnvironmentInitialize MPI Environment
… Initialize MPI EnvironmentInitialize MPI Environment
Initialize ArrayInitialize Array
Partition Array into workloads Partition Array into workloads
Send Workload to “workers”
Send Workload to “workers”
Recv. workRecv. work Recv. workRecv. work … Recv. workRecv. work
wait for “workers“ to finish task
wait for “workers“ to finish task
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product
Calculate matrix product…
Send resultSend result Send resultSend result … Send resultSend result
Recv. resultsRecv. results
Print resultsPrint results
EndEnd
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
67
Matrix Multiplication (source code)#include "mpi.h"#include <stdio.h>#include <stdlib.h>#define NRA 4 /* number of rows in matrix A */#define NCA 4 /* number of columns in matrix A */#define NCB 4 /* number of columns in matrix B */#define MASTER 0 /* taskid of first task */#define FROM_MASTER 1 /* setting a message type */#define FROM_WORKER 2 /* setting a message type */int main(argc,argv)int argc;char *argv[];{int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */numworkers, /* number of worker tasks */source, /* task id of message source */dest, /* task id of message destination */mtype, /* message type */rows, /* rows of matrix A sent to each worker */averow, extra, offset, /* used to determine rows sent to each worker */i, j, k, rc; /* misc */
double a[NRA][NCA], /* matrix A to be multiplied */b[NCA][NCB], /* matrix B to be multiplied */c[NRA][NCB]; /* result matrix C */
MPI_Status status;MPI_Init(&argc,&argv);MPI_Comm_rank(MPI_COMM_WORLD,&taskid);MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
Initialize the MPI environment
Source : http://www.llnl.gov/computing/t
utorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
68
Matrix Multiplication (source code)if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); }numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER;
Source : http://www.llnl.gov/computing/t
utorials/mpi/samples/C/mpi_mm.c
MASTER: Initialize the matrix A & B
Print the two matrices for Debugging purposes
Calculate the number of rows to be processed by each workerCalculate the number of rows to be processed by each worker
Calculate the number of overflow rows to be processed additionally by each workerCalculate the number of overflow rows to be processed additionally by each worker
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
69
Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; }
/* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i;
/* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); }
MASTER : Send the workload chunk across to each of the worker
MASTER: Receive the workload chunk from the workersc[][] contains the matrix products calculated for each workload chunk by the corresponding worker
Source : http://www.llnl.gov/computing/t
utorials/mpi/samples/C/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
70
Matrix Multiplication (source code)/**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++)
/* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
/* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize();}
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c
WORKER: Receive the workload to be processed by each worker
Calculate the matrix product and store the result in c[][]Calculate the matrix product and store the result in c[][]
Send the computed results array to the Master
Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C
/mpi_mm.c
CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011
71
Demo : Matrix Multiplication[cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts ./mpi_mmmpi_mm has started with 4 tasks.Initializing arrays...Matrix A :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Matrix B :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00Sending 2 rows to task 1 offset=0Sending 1 rows to task 2 offset=2Sending 1 rows to task 3 offset=3Received results from task 1Received results from task 2Received results from task 3Result Matrix: 30.00 40.00 50.00 60.00 40.00 54.00 68.00 82.00 50.00 68.00 86.00 104.00 60.00 82.00 104.00 126.00[cdekate@celeritas matrix_multiplication]$