Parallel computing(2)

Introduction to Parallel Computing

Part IIb

What is MPI?

Message Passing Interface (MPI) is astandardised interface. Using this interface,several implementations have been made.The MPI standard specifies three forms ofsubroutine interfaces:(1) Language independent notation;(2) Fortran notation;(3) C notation.

MPI Features

MPI implementations provide:

• Abstraction of hardware implementation• Synchronous communication• Asynchronous communication• File operations• Time measurement operations

Implementations

MPICH Unix / Windows NTMPICH-T3E Cray T3ELAM Unix/SGI Irix/IBM AIXChimp SunOS/AIX/Irix/HP-UXWinMPI Windows 3.1 (no network req.)

Programming with MPI

What is the difference between programmingusing the traditional approach and the MPIapproach:

1. Use of MPI library2. Compiling3. Running

Compiling (1)

When a program is written, compiling itshould be done a little bit different from thenormal situation. Although details differ forvarious MPI implementations, there aretwo frequently used approaches.

Compiling (2)

First approach

Second approach

$ gcc myprogram.c –o myexecutable -lmpi

$ mpicc myprogram.c –o myexecutable

Running (1)

In order to run an MPI-Enabled applicationwe should generally use the command‘mpirun’:

Where x is the number of processes to use,and <parameters> are the arguments to theExecutable, if any.

$ mpirun –np x myexecutable <parameters>

Running (2)

The ‘mpirun’ program will take care of thecreation of processes on selected processors.By default, ‘mpirun’ will decide whichprocessors to use, this is usually determinedby a global configuration file. It is possibleto specify processors, but they may only beused as a hint.

MPI Programming (1)

Implementations of MPI support Fortran, C,or both. Here we only consider programmingusing the C Libraries. The first step in writinga program using MPI is to include the correctheader:

#include “mpi.h”

MPI Programming (2)

#include “mpi.h”

int main (int argc, char *argv[]){ … MPI_Init(&argc, &argv); … MPI_Finalize(); return …;}

MPI_Init

int MPI_Init (int *argc, char ***argv)

The MPI_Init procedure should be calledbefore any other MPI procedure (exceptMPI_Initialized). It must be called exactlyonce, at program initialisation. If removesthe arguments that are used by MPI from theargument array.

MPI_Finalize

int MPI_Finalize (void)

This routine cleans up all MPI states. It shouldbe the last MPI routine to be called in aprogram; no other MPI routine may be calledafter MPI_Finalize. Pending communicationshould be finished before finalisation.

Using multiple processes

When running an MPI enabled program usingmultiple processes, each process will run anidentical copy of the program. So there mustbe a way to know which process we are.This situation is comparable to that ofprogramming using the ‘fork’ statement. MPIdefines two subroutines that can be used.

MPI_Comm_size

int MPI_Comm_size (MPI_Comm comm, int *size)

This call returns the number of processes involved in a communicator. To find out howmany processes are used in total, call thisfunction with the predefined globalcommunicator MPI_COMM_WORLD.

MPI_Comm_rank

int MPI_Comm_rank (MPI_Comm comm, int *rank)

This procedure determines the rank (index) ofthe calling process in the communicator. Eachprocess is assigned a unique number within acommunicator.

MPI_COMM_WORLD

MPI communicators are used to specify towhat processes communication applies to.A communicator is shared by a group ofprocesses. The predefined MPI_COMM_WORLD

applies to all processes. Communicators canbe duplicated, created and deleted. For mostapplication, use of MPI_COMM_WORLD

suffices.

Example ‘Hello World!’#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){ int size, rank;

MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

printf ("Hello world! from processor (%d/%d)\n", rank+1, size);

MPI_Finalize();

return 0;}

Running ‘Hello World!’

$ mpicc -o hello hello.c$ mpirun -np 3 helloHello world! from processor (1/3)Hello world! from processor (2/3)Hello world! from processor (3/3)$ _

MPI_Send

int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

Synchronously sends a message to dest. Datais found in buf, that contains count elementsof datatype. To identify the send, a tag has tobe specified. The destination dest is theprocessor rank in communicator comm.

MPI_Recvint MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Synchronously receives a message from source.Buffer must be able to hold count elements ofdatatype. The status field is filled with statusinformation. MPI_Recv and MPI_Send callsshould match; equal tag, count, datatype.

DatatypesMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long double

(http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html)

Example send / receive#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){ MPI_Status s; int size, rank, i, j;

MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

if (rank == 0) // Master process { printf ("Receiving data . . .\n"); for (i = 1; i < size; i++) { MPI_Recv ((void *)&j, 1, MPI_INT, i, 0xACE5, MPI_COMM_WORLD, &s); printf ("[%d] sent %d\n", i, j); } } else { j = rank * rank; MPI_Send ((void *)&j, 1, MPI_INT, 0, 0xACE5, MPI_COMM_WORLD); }

MPI_Finalize(); return 0;}

Running send / receive

$ mpicc -o sendrecv sendrecv.c$ mpirun -np 4 sendrecvReceiving data . . .[1] sent 1[2] sent 4[3] sent 9$ _

MPI_Bcastint MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

Synchronously broadcasts a message fromroot, to all processors in communicator comm(including itself). Buffer is used as source inroot processor, as destination in others.

MPI_Barrier

int MPI_Barrier (MPI_Comm comm)

Blocks until all processes defined in commhave reached this routine. Use this routine tosynchronize processes.

Example broadcast / barrierint main (int argc, char *argv[]){ int rank, i;

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

if (rank == 0) i = 27; MPI_Bcast ((void *)&i, 1, MPI_INT, 0, MPI_COMM_WORLD); printf ("[%d] i = %d\n", rank, i);

// Wait for every process to reach this code MPI_Barrier (MPI_COMM_WORLD);

MPI_Finalize();

return 0;}

Running broadcast / barrier

$ mpicc -o broadcast broadcast.c$ mpirun -np 3 broadcast[0] i = 27[1] i = 27[2] i = 27$ _

MPI_Sendrecvint MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Send and receive (2nd, using only one buffer).

Other useful routines

• MPI_Scatter• MPI_Gather• MPI_Type_vector• MPI_Type_commit• MPI_Reduce / MPI_Allreduce• MPI_Op_create

Example scatter / reduceint main (int argc, char *argv[]){ int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors int rank, i = -1, j = -1;

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , 1, MPI_INT, 0, MPI_COMM_WORLD);

printf ("[%d] Received i = %d\n", rank, i);

MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD);

printf ("[%d] j = %d\n", rank, j);

MPI_Finalize();

return 0;}

Running scatter / reduce$ mpicc -o scatterreduce scatterreduce.c$ mpirun -np 4 scatterreduce[0] Received i = 1[0] j = 24[1] Received i = 2[1] j = -1[2] Received i = 3[2] j = -1[3] Received i = 4[3] j = -1$ _

Some reduce operationsMPI_MAX Maximum valueMPI_MIN Minimum valueMPI_SUM Sum of valuesMPI_PROD Product of valuesMPI_LAND Logical ANDMPI_BAND Boolean ANDMPI_LOR Logical ORMPI_BOR Boolean ORMPI_LXOR Logical Exclusive ORMPI_BXOR Boolean Exclusive OR

Measuring running time

double MPI_Wtime (void);

double timeStart, timeEnd;...timeStart = MPI_Wtime(); // Code to measure time for goes here.timeEnd = MPI_Wtime()...printf (“Running time = %f seconds\n”, timeEnd – timeStart);

Parallel sorting (1)

Sorting an sequence of numbers using thebinary–sort method. This method dividesa given sequence into two halves (untilonly one element remains) and sorts bothhalves recursively. The two halves are thenmerged together to form a sorted sequence.

Binary sort pseudo-code

sorted-sequence BinarySort (sequence){ if (# elements in sequence > 1) { seqA = first half of sequence seqB = second half of sequence BinarySort (seqA); BinarySort (seqB); sorted-sequence = merge (seqA, seqB); } else sorted-sequence = sequence}

Merge two sorted sequences

1 7 845 62 311

Example binary – sort

1 2 345 67 8

1 257 34 68

1 7 25 48 36

1 7 5 2 8 4 6 31 7 5 2 8 4 6 31 7 5 2 8 4 6 3

1 7 52 84 63

1 752 84 63

1 4 863 72 51 4 863 72 5

Parallel sorting (2)

This way of dividing work and gathering theresults is a quite natural way to use for aparallel implementation. Divide work in twoto two processors. Have each of theseprocessors divide their work again, until eitherno data can be split again or no processors areavailable anymore.

Implementation problems

• Number of processors may not be a power of two• Number of elements may not be a power of two• How to achieve an even workload?• Data size is less than number of processors

Parallel matrix multiplication

We use the following partitioning of data (p=4)

Implementation

1. Master (process 0) reads data2. Master sends size of data to slaves3. Slaves allocate memory4. Master broadcasts second matrix to all other

processes5. Master sends respective parts of first matrix to

all other processes6. Every process performs its local multiplication7. All slave processes send back their result.

Multiplication 1000 x 10001000 x 1000 Matrix multiplication

0 10 20 30 40 50 60

Processors

Tp T1 / p

Multiplication 5000 x 50005000 x 5000 Matrix multiplication

0 5 10 15 20 25 30 35

Processors

Tp T1 / p

Gaussian elimination

We use the following partitioning of data (p=4)

Implementation (1)

1. Master reads both matrices2. Master sends size of matrices to slaves3. Slaves calculate their part and allocate

memory4. Master sends each slave its respective part5. Set sweeping row to 0 in all processes6. Sweep matrix (see next sheet)7. Slave send back their result

Implementation (2)

While sweeping row not past final row doA. Have every process decide whether they

own the current sweeping rowB. The owner sends a copy of the row to

every other processC. All processes sweep their part of the

matrix using the current rowD. Sweeping row is incremented

Programming hints

• Keep it simple!• Avoid deadlocks• Write robust code even at cost of speed• Design in advance, debugging is more

difficult (printing output is different)• Error handing requires synchronisation, you

can’t just exit the program.

References (1)

MPI Forum Home Pagehttp://www.mpi-forum.org/index.html

Beginners guide to MPI (see also /MPI/)http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html

MPICHhttp://www-unix.mcs.anl.gov/mpi/mpich/

References (2)

Miscellaneous

http://www.erc.msstate.edu/labs/hpcl/projects/mpi/http://nexus.cs.usfca.edu/mpi/http://www-unix.mcs.anl.gov/~gropp/http://www.epm.ornl.gov/~walker/mpitutorial/http://www.lam-mpi.org/http://epcc.ed.ac.uk/chimp/http://www-unix.mcs.anl.gov/mpi/www/www3/

Parallel computing(2)

Technology

CPE 779 Parallel Computing - Spring 20121 CPE 779 Parallel Computing Lecture

Parallel Computing Why & How? - SINTEFWe´re already at the age of parallel computing Parallel computing relies on parallel hardware Parallel computing needs parallel software So parallel

parallel computing

Introduction to Parallel Computing. Introduction to High Performance Computing Page 2 Abstract This presentation covers the basics of parallel computing

Parallel Computing Explained Parallel Performance Analysis

Parallel Scientific Computing: Algorithms and Tools Lecture #2

Introduction to Parallel Computing - ULisboaIntroduction to Parallel Computing. Ricardo Fonseca | European PhD 2010 What is Parallel Computing • Parallel computing: use of multiple

Introduction to Parallel Computing · 2. Interfaces for multi-core parallel computing – algorithmic support and implementation: task-parallel models, lock- and wait-free data structures

Parallel Computing Toolbox™ 4 User’s Guide - HPC Home · Parallel Computing Toolbox™ User’s Guide ... Parallel Code Execution.....12-2 Parallel Code on a MATLAB Pool.....12-2

Compilers, Parallel Computing, and Grid Computing

Introduction to parallel computing chapter 2

Tr · š Pa ć · č Itroduction to Parallel Computing...parallel computation (Chap. 1) and an introduction to parallel computing (Chap. 2) that covers parallel computer systems, the

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX · •Goals: 1. Basic understanding of parallel computing concepts 2. Familiarity with MATLAB parallel computing tools • Outline:

Lecture 2: Introduction to Parallel Computing Using CUDA

Introduction to Parallel Computing. Serial Computing

UNIT 2 CLASSIFICATION OF PARALLEL - High Performance Computing

Parallel Computing - Ruhr University Bochumgrauer/lectures/compIISS16/pd… · Parallel Computing Introduction Parallel Computing Example: Heat transfer in 2 Dimensions MPI What is

CHAPTER 2: Computer Clusters for Scalable Parallel Computing

Parallel Computing: Overview - Computer Science Workshop/Parallel... · April 23, 2002 Introduction to Parallel Computing •Why w e need parallel computing • How such machines

PARALLEL PROGRAMMING MANY-CORE COMPUTING: HARDWARE (2/5)