Download ppt - MPI and OpenMP By: Jesus Caban and Matt McKnight

MPI and OpenMP

By: Jesus Caban

and

Matt McKnight

What is MPI?

• MPI: Message Passing Interface– Is not a new programming language, is a

library with functions that can be called from C/Fortran/Python

– Successor to PVM (Parallel Virtual Machine )

– Developed by an open, international forum with representation from industry, academia and government laboratories.

What it’s for?

• Allows data to be passed between processes in a distributed memory environment

• Provides source-code portability

• Allows efficient implementation• A great deal of functionality• Support for heterogeneous

parallel architectures

MPI Communicator

• Idea:– Group of processors that

are allowed to communicate to each other

• Most often use communicators– MPI_COMM_WORLD

• Note MPI Format:MPI_XXXvar = MPI_Xxx(parameters);

MPI_Xxx(parameters);

Getting Started

Include MPI header file

Initialize MPI environment

Work:Make message passing calls

SendReceive

Terminate MPI environment

Include File

Include

Initialize

Work

Terminate

Include MPI header file

#include <stdio.h>#include <stdlib.h>#include <mpi.h>

int main(int argc, char** argv){ …}

Initialize MPI

Include

Initialize

Work

Terminate

Initialize MPI environment

int main(int argc, char** argv){

int numtasks, rank;

MPI_Init (*argc,*argv) ;

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank); ... }

Initialize MPI (cont.)

Include

Initialize

Work

Terminate

MPI_Init (&argc,&argv)Not MPI functions called before this call.

MPI_Comm_size(MPI_COMM_WORLD, &nump)A communicator is a collection of processes

that can send messages to each other. MPI_COMM_WORLD is a predefined communicator that consists of all the processes running when the program execution begins.

MPI_Comm_rank(MPI_COMM_WORLD, &myrank)In order for a process to find out its rank.


Include

Initialize

Work

Terminate


#include <stdio.h>#include <stdlib.h>#include <mpi.h>

int main(int argc, char** argv){ … MPI_Finalize();}

No MPI functions called after this call.

Let’s work with MPI

Include

Initialize

Work

Terminate

Work:Make message passing calls (Send, Receive)

if(my_rank != 0){MPI_Send(data, strlen(data)+1, MPI_CHAR,

dest, tag, MPI_COMM_WORLD);}else{

MPI_Recv(data, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);

}

Work (cont.)

Include

Initialize

Work

Terminate

int MPI_Send ( void* message, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)

int MPI_Recv ( void* message, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm commMPI_Status *status)

Hello World!!#include "mpi.h" int main(int argc, char* argv[]) {

int my_rank, p, source, dest, tag = 0; char message[100]; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p);

if (my_rank != 0) { /* Create message */ sprintf(message, “Hello from process %d!", my_rank); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,

MPI_COMM_WORLD); }else {

for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag,

MPI_COMM_WORLD, &status); printf("%s", message);

}}MPI_Finalize();

}

Compile and Run MPI

• Compile– gcc –c hello.exe mpi_hello.c –lmpi– mpicc mpi_hello.c

• Run– mpirun –np 5 hello.exe

• Output$mpirun –np 5 hello.exe

Hello from process 1!Hello from process 2!Hello from process 3!Hello from process 4!

More MPI Functions• MPI_Bcast( void *m, int s, MPI_Datatype dt, int root, MPI_Comm)

– Sends a copy of the data in m on the process with rank root to each process in the communicator.

• MPI_Reduce( void *operand, void* result, int count,MPI_Datatype datatye, MPI_Op operator, int root, MPI_Comm comm)

– Combines the operands stored in the memory referenced by operand using operation operator and stores the result in res on process root.

• double MPI_Wtime( void)– Returns a double precision value that represents the number of seconds

that have elapsed since some point in the past.• MPI_Barrier ( MPI_Comm comm)

– Each process in comm block until every process in comm has called it.

More Examples

• Trapezoidal Rule:– Integral from a to b of

a nonnegative function f(x)

– Approach: Estimating the area by partitioning the region into regular geometric shapes and then add the areas of the shapes

• Compute Pi

21

4)(

xxf

1

0

)( dxxf

Compute PI#include <stdio.h> #include "mpi.h" #define PI 3.141592653589793238462643 #define PI_STR "3.141592653589793238462643" #define MAXLEN 40 #define f(x) (4./(1.+ (x)*(x)))

void main(int argc, char *argv[]){ int N=0,rank,nprocrs,i,answer=1; double mypi,pi,h,sum, x, starttime,endtime,runtime,runtime_max; char buff[MAXLEN];

MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf(“CPU %d saying hello",rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocrs);

if(rank==0)printf("Using a total of %d CPUs",nprocrs);

Compute PI

while(answer){ if(rank==0){

printf("This program computes pi as “ "4.*Integral{0->1}[1/(1+x^2)]");

printf("(Using PI = %s)",PI_STR);printf("Input the Number of intervals: N ="); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&N); printf("pi will be computed with %d intervals

on %d processors.", N ,nprocrs); }

/*Procr 0 = P(0) gives N to all other processors*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); if(N<=0)

goto end_program;

Compute PIstarttime=MPI_Wtime(); sum=0.0;h=1./N; for(i=1+rank;i<=N;i+=nprocrs){

x=h*(i-0.5); sum+=f(x); } mypi=sum*h; endtime=MPI_Wtime(); runtime=endtime-starttime; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);MPI_Reduce(&runtime,&runtime_max,1,MPI_DOUBLE,MPI_MAX,0,

MPI_COMM_WORLD);

printf("Procr %d: runtime = %f",rank,runtime);fflush(stdout);

if(rank==0){ printf("For %d intervals, pi = %.14lf,error=%g",N,pi,fabs(pi-PI));

Compute PIprintf("computed in = %f secs",runtime_max);fflush(stdout); printf("Do you wish to try another run? (y=1;n=0)"); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&answer);

}

/*processors wait while P(0) gets new input from user*/ MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&answer,1,MPI_INT,0,MPI_COMM_WORLD); if(!answer)

break; } end_program:

printf("\nProcr %d: Saying good-bye!\n",rank); if(rank==0)

printf("\nEND PROGRAM\n"); MPI_Finalize(); }

Compile and Run Example 2• Compile

– gcc –c pi.exe pi.c –lmpi

$mpirun –np 2 pi.exeProcr 1 saying hello. Procr 0 saying hello Using a total of 2 CPUs

This program computes pi as 4.*Integral{0->1}[1/(1+x^2)] (Using PI = 3.141592653589793238462643)

Input the Number of intervals: N = 10pi will be computed with 10 intervals on 2 processors

Procr 0: runtime = 0.000003 Procr 1: runtime = 0.000003

For 10 intervals, pi = 3.14242598500110, error = 0.000833331 computed in = 0.000003 secs

What is

• Similar to MPI, but used for shared memory parallelism

• Simple set of directives• Incremental parallelism• Unfortunately only works with proprietary

compilers…

?

Compilers and Platforms• • Compilers and Platforms • Fujitsu/Lahey Fortran, C and C++

– Intel Linux Systems – Sun Solaris Systems

• HP HP-UX PA-RISC/Itanium – Fortran – C – aC++

• HP Tru64 Unix – Fortran – C – C++

• IBM XL Fortran and C from IBM – IBM AIX Systems

• Intel C++ and Fortran Compilers from Intel – Intel IA32 Linux Systems – Intel IA32 Windows Systems – Intel Itanium-based Linux Systems – Intel Itanium-based Windows Systems

• Guide Fortran and C/C++ from Intel's KAI Softare Lab – Intel Linux Systems – Intel Windows Systems

• PGF77 and PGF90 Compilers from The Portland Group, Inc. (PGI) – Intel Linux Systems – Intel Solaris Systems – Intel Windows/NT Systems

• SGI MIPSpro 7.4 Compilers – SGI IRIX Systems

• Sun Microsystems Sun ONE Studio 8, Compiler Collection, Fortran 95, C, and C++ – Sun Solaris Platforms – Compiler Collection Portal

• VAST from Veridian Pacific-Sierra Research – IBM AIX Systems – Intel IA32 Linux Systems – Intel Windows/NT Systems – SGI IRIX Systems – Sun Solaris Systems

taken from www.openmp.org

http://www.lahey.com/linux.htm

http://www.fr.fse.fujitsu.com/devuk/solaris.shtml

http://h21007.www2.hp.com/dspp/tech/tech_TechSoftwareDetailPage_IDX/1,1703,1844,00.html




http://www.compaq.com/fortran/dfau.html

http://www.tru64unix.compaq.com/dtk/

http://www.tru64unix.compaq.com/cplus/

http://www.ibm.com/software/ad/fortran/xlfortran

http://www.ibm.com/software/ad/caix

http://developer.intel.com/software/products/compilers/

http://www.kai.com/parallel/kappro/guide

http://www.pgroup.com/

http://www.sgi.com/developers/devtools/languages/mips74features.html



How do you use OpenMP?– C/C++ API

• Parallel Construct – when a ‘region’ of the program can be executed in multiple parallel threads, this fundamental construct starts the execution.

#pragma omp parallel [clause[ [, ]clase] …] new-linestructured-block

The clause is one of the following:if (scalar–expression)private (variable-list)firstprivate (variable-list)default (shared | none)shared (variable-list)copyin (variable-list)reduction (operator : variable-list)num_threads (integer-expression)

• for Construct– Defines an iterative work-sharing construct in

which the iterations of the associated loop will execute in parallel.

• Sections Construct– Identifies a noniterative work-sharing

construct that specifies a set of constructs that are to be divided among threads, each section being executed only once by each thread

Fundamental Constructs

• single Construct– associates a structured block’s execution with

only one thread

• parallel for Construct– Shortcut for a parallel region containing only

one for directive

• parallel sections Construct– Shortcut for a parallel region containing only a

single sections directive

Master and Synchronization Directives

• master Construct– Specifies a structured block that is executed by the

master thread of the team

• critical Construct– Restricts execution of the associated structured block

to a single thread at a time

• barrier Directive– Synchronize all threads in a team. When this

construct is encountered, all threads wait until the others have reached this point.

• atomic Construct– Ensures that a specific memory location is

updated ‘atomically’ (meaning only one thread is allowed write-access at a time)

• flush Directive– Specifies a “cross-thread” sequence point at

which all threads in a team are ensured a “clean” view of certain objects in memory

• ordered Construct– A structured block following this directive will

iterate in the same order as if executed in a sequential loop.

Data

• How do we control the data in this SMP environment?– threadprivate Directive

• makes files-scope and namespace-scope private to a thread

• Data-Sharing Attributes– private - private to each thread– firstprivate– lastprivate – shared – shared among all threads– default – User affects attributes– reduction – perform reduction on scalars– copyin – assign the same value to threadprivate

variables– copyprivate – broadcast the value of a private variable from

one member of a team to the others

Scalability test on SGI Origin 2000

Timing results of the dot product test in milliseconds for

n = 16 * 1024.

www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html

Timing results of matrix times matrix test in milliseconds for n = 128

www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html

Architecture comparison

From http://www.csm.ornl.gov/~dunigan/sgi/

References

• Book: Parallel Programming with MPI, Peter Pacheco• www-unix.mcs.anl.gov/mpi• http://alliance.osc.edu/impi/• http://rocs.acomp.usf.edu/tut/mpi.php• http://www.lam-mpi.org/tutorials/nd/• www.openmp.org• www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html