MPI and OpenMP
By: Jesus Caban
and
Matt McKnight
What is MPI?
• MPI: Message Passing Interface– Is not a new programming language, is a
library with functions that can be called from C/Fortran/Python
– Successor to PVM (Parallel Virtual Machine )
– Developed by an open, international forum with representation from industry, academia and government laboratories.
What it’s for?
• Allows data to be passed between processes in a distributed memory environment
• Provides source-code portability
• Allows efficient implementation• A great deal of functionality• Support for heterogeneous
parallel architectures
MPI Communicator
• Idea:– Group of processors that
are allowed to communicate to each other
• Most often use communicators– MPI_COMM_WORLD
• Note MPI Format:MPI_XXXvar = MPI_Xxx(parameters);
MPI_Xxx(parameters);
Getting Started
Include MPI header file
Initialize MPI environment
Work:Make message passing calls
SendReceive
Terminate MPI environment
Include File
Include
Initialize
Work
Terminate
Include MPI header file
#include <stdio.h>#include <stdlib.h>#include <mpi.h>
int main(int argc, char** argv){ …}
Initialize MPI
Include
Initialize
Work
Terminate
Initialize MPI environment
int main(int argc, char** argv){
int numtasks, rank;
MPI_Init (*argc,*argv) ;
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank); ... }
Initialize MPI (cont.)
Include
Initialize
Work
Terminate
MPI_Init (&argc,&argv)Not MPI functions called before this call.
MPI_Comm_size(MPI_COMM_WORLD, &nump)A communicator is a collection of processes
that can send messages to each other. MPI_COMM_WORLD is a predefined communicator that consists of all the processes running when the program execution begins.
MPI_Comm_rank(MPI_COMM_WORLD, &myrank)In order for a process to find out its rank.
Terminate MPI environment
Include
Initialize
Work
Terminate
Terminate MPI environment
#include <stdio.h>#include <stdlib.h>#include <mpi.h>
int main(int argc, char** argv){ … MPI_Finalize();}
No MPI functions called after this call.
Let’s work with MPI
Include
Initialize
Work
Terminate
Work:Make message passing calls (Send, Receive)
if(my_rank != 0){MPI_Send(data, strlen(data)+1, MPI_CHAR,
dest, tag, MPI_COMM_WORLD);}else{
MPI_Recv(data, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);
}
Work (cont.)
Include
Initialize
Work
Terminate
int MPI_Send ( void* message, int count, MPI_Datatype datatype,int dest, int tag, MPI_Comm comm)
int MPI_Recv ( void* message, int count, MPI_Datatype datatype,int source, int tag, MPI_Comm commMPI_Status *status)
Hello World!!#include "mpi.h" int main(int argc, char* argv[]) {
int my_rank, p, source, dest, tag = 0; char message[100]; MPI_Status status;
MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p);
if (my_rank != 0) { /* Create message */ sprintf(message, “Hello from process %d!", my_rank); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag,
MPI_COMM_WORLD); }else {
for(source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag,
MPI_COMM_WORLD, &status); printf("%s", message);
}}MPI_Finalize();
}
Compile and Run MPI
• Compile– gcc –c hello.exe mpi_hello.c –lmpi– mpicc mpi_hello.c
• Run– mpirun –np 5 hello.exe
• Output$mpirun –np 5 hello.exe
Hello from process 1!Hello from process 2!Hello from process 3!Hello from process 4!
More MPI Functions• MPI_Bcast( void *m, int s, MPI_Datatype dt, int root, MPI_Comm)
– Sends a copy of the data in m on the process with rank root to each process in the communicator.
• MPI_Reduce( void *operand, void* result, int count,MPI_Datatype datatye, MPI_Op operator, int root, MPI_Comm comm)
– Combines the operands stored in the memory referenced by operand using operation operator and stores the result in res on process root.
• double MPI_Wtime( void)– Returns a double precision value that represents the number of seconds
that have elapsed since some point in the past.• MPI_Barrier ( MPI_Comm comm)
– Each process in comm block until every process in comm has called it.
More Examples
• Trapezoidal Rule:– Integral from a to b of
a nonnegative function f(x)
– Approach: Estimating the area by partitioning the region into regular geometric shapes and then add the areas of the shapes
• Compute Pi
21
4)(
xxf
1
0
)( dxxf
Compute PI#include <stdio.h> #include "mpi.h" #define PI 3.141592653589793238462643 #define PI_STR "3.141592653589793238462643" #define MAXLEN 40 #define f(x) (4./(1.+ (x)*(x)))
void main(int argc, char *argv[]){ int N=0,rank,nprocrs,i,answer=1; double mypi,pi,h,sum, x, starttime,endtime,runtime,runtime_max; char buff[MAXLEN];
MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf(“CPU %d saying hello",rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocrs);
if(rank==0)printf("Using a total of %d CPUs",nprocrs);
Compute PI
while(answer){ if(rank==0){
printf("This program computes pi as “ "4.*Integral{0->1}[1/(1+x^2)]");
printf("(Using PI = %s)",PI_STR);printf("Input the Number of intervals: N ="); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&N); printf("pi will be computed with %d intervals
on %d processors.", N ,nprocrs); }
/*Procr 0 = P(0) gives N to all other processors*/ MPI_Bcast(&N,1,MPI_INT,0,MPI_COMM_WORLD); if(N<=0)
goto end_program;
Compute PIstarttime=MPI_Wtime(); sum=0.0;h=1./N; for(i=1+rank;i<=N;i+=nprocrs){
x=h*(i-0.5); sum+=f(x); } mypi=sum*h; endtime=MPI_Wtime(); runtime=endtime-starttime; MPI_Reduce(&mypi,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);MPI_Reduce(&runtime,&runtime_max,1,MPI_DOUBLE,MPI_MAX,0,
MPI_COMM_WORLD);
printf("Procr %d: runtime = %f",rank,runtime);fflush(stdout);
if(rank==0){ printf("For %d intervals, pi = %.14lf,error=%g",N,pi,fabs(pi-PI));
Compute PIprintf("computed in = %f secs",runtime_max);fflush(stdout); printf("Do you wish to try another run? (y=1;n=0)"); fgets(buff,MAXLEN,stdin); sscanf(buff,"%d",&answer);
}
/*processors wait while P(0) gets new input from user*/ MPI_Barrier(MPI_COMM_WORLD); MPI_Bcast(&answer,1,MPI_INT,0,MPI_COMM_WORLD); if(!answer)
break; } end_program:
printf("\nProcr %d: Saying good-bye!\n",rank); if(rank==0)
printf("\nEND PROGRAM\n"); MPI_Finalize(); }
Compile and Run Example 2• Compile
– gcc –c pi.exe pi.c –lmpi
$mpirun –np 2 pi.exeProcr 1 saying hello. Procr 0 saying hello Using a total of 2 CPUs
This program computes pi as 4.*Integral{0->1}[1/(1+x^2)] (Using PI = 3.141592653589793238462643)
Input the Number of intervals: N = 10pi will be computed with 10 intervals on 2 processors
Procr 0: runtime = 0.000003 Procr 1: runtime = 0.000003
For 10 intervals, pi = 3.14242598500110, error = 0.000833331 computed in = 0.000003 secs
What is
• Similar to MPI, but used for shared memory parallelism
• Simple set of directives• Incremental parallelism• Unfortunately only works with proprietary
compilers…
?
Compilers and Platforms• • Compilers and Platforms • Fujitsu/Lahey Fortran, C and C++
– Intel Linux Systems – Sun Solaris Systems
• HP HP-UX PA-RISC/Itanium – Fortran – C – aC++
• HP Tru64 Unix – Fortran – C – C++
• IBM XL Fortran and C from IBM – IBM AIX Systems
• Intel C++ and Fortran Compilers from Intel – Intel IA32 Linux Systems – Intel IA32 Windows Systems – Intel Itanium-based Linux Systems – Intel Itanium-based Windows Systems
• Guide Fortran and C/C++ from Intel's KAI Softare Lab – Intel Linux Systems – Intel Windows Systems
• PGF77 and PGF90 Compilers from The Portland Group, Inc. (PGI) – Intel Linux Systems – Intel Solaris Systems – Intel Windows/NT Systems
• SGI MIPSpro 7.4 Compilers – SGI IRIX Systems
• Sun Microsystems Sun ONE Studio 8, Compiler Collection, Fortran 95, C, and C++ – Sun Solaris Platforms – Compiler Collection Portal
• VAST from Veridian Pacific-Sierra Research – IBM AIX Systems – Intel IA32 Linux Systems – Intel Windows/NT Systems – SGI IRIX Systems – Sun Solaris Systems
taken from www.openmp.org
How do you use OpenMP?– C/C++ API
• Parallel Construct – when a ‘region’ of the program can be executed in multiple parallel threads, this fundamental construct starts the execution.
#pragma omp parallel [clause[ [, ]clase] …] new-linestructured-block
The clause is one of the following:if (scalar–expression)private (variable-list)firstprivate (variable-list)default (shared | none)shared (variable-list)copyin (variable-list)reduction (operator : variable-list)num_threads (integer-expression)
• for Construct– Defines an iterative work-sharing construct in
which the iterations of the associated loop will execute in parallel.
• Sections Construct– Identifies a noniterative work-sharing
construct that specifies a set of constructs that are to be divided among threads, each section being executed only once by each thread
Fundamental Constructs
• single Construct– associates a structured block’s execution with
only one thread
• parallel for Construct– Shortcut for a parallel region containing only
one for directive
• parallel sections Construct– Shortcut for a parallel region containing only a
single sections directive
Master and Synchronization Directives
• master Construct– Specifies a structured block that is executed by the
master thread of the team
• critical Construct– Restricts execution of the associated structured block
to a single thread at a time
• barrier Directive– Synchronize all threads in a team. When this
construct is encountered, all threads wait until the others have reached this point.
• atomic Construct– Ensures that a specific memory location is
updated ‘atomically’ (meaning only one thread is allowed write-access at a time)
• flush Directive– Specifies a “cross-thread” sequence point at
which all threads in a team are ensured a “clean” view of certain objects in memory
• ordered Construct– A structured block following this directive will
iterate in the same order as if executed in a sequential loop.
Data
• How do we control the data in this SMP environment?– threadprivate Directive
• makes files-scope and namespace-scope private to a thread
• Data-Sharing Attributes– private - private to each thread– firstprivate– lastprivate – shared – shared among all threads– default – User affects attributes– reduction – perform reduction on scalars– copyin – assign the same value to threadprivate
variables– copyprivate – broadcast the value of a private variable from
one member of a team to the others
Scalability test on SGI Origin 2000
Timing results of the dot product test in milliseconds for
n = 16 * 1024.
www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html
Timing results of matrix times matrix test in milliseconds for n = 128
www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html
Architecture comparison
From http://www.csm.ornl.gov/~dunigan/sgi/
References
• Book: Parallel Programming with MPI, Peter Pacheco• www-unix.mcs.anl.gov/mpi• http://alliance.osc.edu/impi/• http://rocs.acomp.usf.edu/tut/mpi.php• http://www.lam-mpi.org/tutorials/nd/• www.openmp.org• www.public.iastate.edu/~grl/HFP1/hpf.openmp.mpi.June6.2002.html