51
Introduction to Parallel Computing Part II b

Parallel computing(2)

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Parallel computing(2)

Introduction to Parallel Computing

Part IIb

Page 2: Parallel computing(2)

What is MPI?

Message Passing Interface (MPI) is astandardised interface. Using this interface,several implementations have been made.The MPI standard specifies three forms ofsubroutine interfaces:(1) Language independent notation;(2) Fortran notation;(3) C notation.

Page 3: Parallel computing(2)

MPI Features

MPI implementations provide:

• Abstraction of hardware implementation• Synchronous communication• Asynchronous communication• File operations• Time measurement operations

Page 4: Parallel computing(2)

Implementations

MPICH Unix / Windows NTMPICH-T3E Cray T3ELAM Unix/SGI Irix/IBM AIXChimp SunOS/AIX/Irix/HP-UXWinMPI Windows 3.1 (no network req.)

Page 5: Parallel computing(2)

Programming with MPI

What is the difference between programmingusing the traditional approach and the MPIapproach:

1. Use of MPI library2. Compiling3. Running

Page 6: Parallel computing(2)

Compiling (1)

When a program is written, compiling itshould be done a little bit different from thenormal situation. Although details differ forvarious MPI implementations, there aretwo frequently used approaches.

Page 7: Parallel computing(2)

Compiling (2)

First approach

Second approach

$ gcc myprogram.c –o myexecutable -lmpi

$ mpicc myprogram.c –o myexecutable

Page 8: Parallel computing(2)

Running (1)

In order to run an MPI-Enabled applicationwe should generally use the command‘mpirun’:

Where x is the number of processes to use,and <parameters> are the arguments to theExecutable, if any.

$ mpirun –np x myexecutable <parameters>

Page 9: Parallel computing(2)

Running (2)

The ‘mpirun’ program will take care of thecreation of processes on selected processors.By default, ‘mpirun’ will decide whichprocessors to use, this is usually determinedby a global configuration file. It is possibleto specify processors, but they may only beused as a hint.

Page 10: Parallel computing(2)

MPI Programming (1)

Implementations of MPI support Fortran, C,or both. Here we only consider programmingusing the C Libraries. The first step in writinga program using MPI is to include the correctheader:

#include “mpi.h”

Page 11: Parallel computing(2)

MPI Programming (2)

#include “mpi.h”

int main (int argc, char *argv[]){ … MPI_Init(&argc, &argv); … MPI_Finalize(); return …;}

Page 12: Parallel computing(2)

MPI_Init

int MPI_Init (int *argc, char ***argv)

The MPI_Init procedure should be calledbefore any other MPI procedure (exceptMPI_Initialized). It must be called exactlyonce, at program initialisation. If removesthe arguments that are used by MPI from theargument array.

Page 13: Parallel computing(2)

MPI_Finalize

int MPI_Finalize (void)

This routine cleans up all MPI states. It shouldbe the last MPI routine to be called in aprogram; no other MPI routine may be calledafter MPI_Finalize. Pending communicationshould be finished before finalisation.

Page 14: Parallel computing(2)

Using multiple processes

When running an MPI enabled program usingmultiple processes, each process will run anidentical copy of the program. So there mustbe a way to know which process we are.This situation is comparable to that ofprogramming using the ‘fork’ statement. MPIdefines two subroutines that can be used.

Page 15: Parallel computing(2)

MPI_Comm_size

int MPI_Comm_size (MPI_Comm comm, int *size)

This call returns the number of processes involved in a communicator. To find out howmany processes are used in total, call thisfunction with the predefined globalcommunicator MPI_COMM_WORLD.

Page 16: Parallel computing(2)

MPI_Comm_rank

int MPI_Comm_rank (MPI_Comm comm, int *rank)

This procedure determines the rank (index) ofthe calling process in the communicator. Eachprocess is assigned a unique number within acommunicator.

Page 17: Parallel computing(2)

MPI_COMM_WORLD

MPI communicators are used to specify towhat processes communication applies to.A communicator is shared by a group ofprocesses. The predefined MPI_COMM_WORLD

applies to all processes. Communicators canbe duplicated, created and deleted. For mostapplication, use of MPI_COMM_WORLD

suffices.

Page 18: Parallel computing(2)

Example ‘Hello World!’#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){ int size, rank;

MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

printf ("Hello world! from processor (%d/%d)\n", rank+1, size);

MPI_Finalize();

return 0;}

Page 19: Parallel computing(2)

Running ‘Hello World!’

$ mpicc -o hello hello.c$ mpirun -np 3 helloHello world! from processor (1/3)Hello world! from processor (2/3)Hello world! from processor (3/3)$ _

Page 20: Parallel computing(2)

MPI_Send

int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

Synchronously sends a message to dest. Datais found in buf, that contains count elementsof datatype. To identify the send, a tag has tobe specified. The destination dest is theprocessor rank in communicator comm.

Page 21: Parallel computing(2)

MPI_Recvint MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

Synchronously receives a message from source.Buffer must be able to hold count elements ofdatatype. The status field is filled with statusinformation. MPI_Recv and MPI_Send callsshould match; equal tag, count, datatype.

Page 22: Parallel computing(2)

DatatypesMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long double

(http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html)

Page 23: Parallel computing(2)

Example send / receive#include <stdio.h>#include "mpi.h"

int main (int argc, char *argv[]){ MPI_Status s; int size, rank, i, j;

MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

if (rank == 0) // Master process { printf ("Receiving data . . .\n"); for (i = 1; i < size; i++) { MPI_Recv ((void *)&j, 1, MPI_INT, i, 0xACE5, MPI_COMM_WORLD, &s); printf ("[%d] sent %d\n", i, j); } } else { j = rank * rank; MPI_Send ((void *)&j, 1, MPI_INT, 0, 0xACE5, MPI_COMM_WORLD); }

MPI_Finalize(); return 0;}

Page 24: Parallel computing(2)

Running send / receive

$ mpicc -o sendrecv sendrecv.c$ mpirun -np 4 sendrecvReceiving data . . .[1] sent 1[2] sent 4[3] sent 9$ _

Page 25: Parallel computing(2)

MPI_Bcastint MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)

Synchronously broadcasts a message fromroot, to all processors in communicator comm(including itself). Buffer is used as source inroot processor, as destination in others.

Page 26: Parallel computing(2)

MPI_Barrier

int MPI_Barrier (MPI_Comm comm)

Blocks until all processes defined in commhave reached this routine. Use this routine tosynchronize processes.

Page 27: Parallel computing(2)

Example broadcast / barrierint main (int argc, char *argv[]){ int rank, i;

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

if (rank == 0) i = 27; MPI_Bcast ((void *)&i, 1, MPI_INT, 0, MPI_COMM_WORLD); printf ("[%d] i = %d\n", rank, i);

// Wait for every process to reach this code MPI_Barrier (MPI_COMM_WORLD);

MPI_Finalize();

return 0;}

Page 28: Parallel computing(2)

Running broadcast / barrier

$ mpicc -o broadcast broadcast.c$ mpirun -np 3 broadcast[0] i = 27[1] i = 27[2] i = 27$ _

Page 29: Parallel computing(2)

MPI_Sendrecvint MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

Send and receive (2nd, using only one buffer).

Page 30: Parallel computing(2)

Other useful routines

• MPI_Scatter• MPI_Gather• MPI_Type_vector• MPI_Type_commit• MPI_Reduce / MPI_Allreduce• MPI_Op_create

Page 31: Parallel computing(2)

Example scatter / reduceint main (int argc, char *argv[]){ int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors int rank, i = -1, j = -1;

MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , 1, MPI_INT, 0, MPI_COMM_WORLD);

printf ("[%d] Received i = %d\n", rank, i);

MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD);

printf ("[%d] j = %d\n", rank, j);

MPI_Finalize();

return 0;}

Page 32: Parallel computing(2)

Running scatter / reduce$ mpicc -o scatterreduce scatterreduce.c$ mpirun -np 4 scatterreduce[0] Received i = 1[0] j = 24[1] Received i = 2[1] j = -1[2] Received i = 3[2] j = -1[3] Received i = 4[3] j = -1$ _

Page 33: Parallel computing(2)

Some reduce operationsMPI_MAX Maximum valueMPI_MIN Minimum valueMPI_SUM Sum of valuesMPI_PROD Product of valuesMPI_LAND Logical ANDMPI_BAND Boolean ANDMPI_LOR Logical ORMPI_BOR Boolean ORMPI_LXOR Logical Exclusive ORMPI_BXOR Boolean Exclusive OR

Page 34: Parallel computing(2)

Measuring running time

double MPI_Wtime (void);

double timeStart, timeEnd;...timeStart = MPI_Wtime(); // Code to measure time for goes here.timeEnd = MPI_Wtime()...printf (“Running time = %f seconds\n”, timeEnd – timeStart);

Page 35: Parallel computing(2)

Parallel sorting (1)

Sorting an sequence of numbers using thebinary–sort method. This method dividesa given sequence into two halves (untilonly one element remains) and sorts bothhalves recursively. The two halves are thenmerged together to form a sorted sequence.

Page 36: Parallel computing(2)

Binary sort pseudo-code

sorted-sequence BinarySort (sequence){ if (# elements in sequence > 1) { seqA = first half of sequence seqB = second half of sequence BinarySort (seqA); BinarySort (seqB); sorted-sequence = merge (seqA, seqB); } else sorted-sequence = sequence}

Page 37: Parallel computing(2)

Merge two sorted sequences

1 7 845 62 311

1

1

2 3

1

4

1

5 6

1 7

7

8

8

1

Page 38: Parallel computing(2)

Example binary – sort

1 2 345 67 8

1 257 34 68

1 7 25 48 36

1 7 5 2 8 4 6 31 7 5 2 8 4 6 31 7 5 2 8 4 6 3

1 7 52 84 63

1 752 84 63

1 4 863 72 51 4 863 72 5

Page 39: Parallel computing(2)

Parallel sorting (2)

This way of dividing work and gathering theresults is a quite natural way to use for aparallel implementation. Divide work in twoto two processors. Have each of theseprocessors divide their work again, until eitherno data can be split again or no processors areavailable anymore.

Page 40: Parallel computing(2)

Implementation problems

• Number of processors may not be a power of two• Number of elements may not be a power of two• How to achieve an even workload?• Data size is less than number of processors

Page 41: Parallel computing(2)

Parallel matrix multiplication

We use the following partitioning of data (p=4)

P1

P2

P3

P4

P1

P2

P3

P4

Page 42: Parallel computing(2)

Implementation

1. Master (process 0) reads data2. Master sends size of data to slaves3. Slaves allocate memory4. Master broadcasts second matrix to all other

processes5. Master sends respective parts of first matrix to

all other processes6. Every process performs its local multiplication7. All slave processes send back their result.

Page 43: Parallel computing(2)

Multiplication 1000 x 10001000 x 1000 Matrix multiplication

0

20

40

60

80

100

120

140

0 10 20 30 40 50 60

Processors

Tim

e (s

)

Tp T1 / p

Page 44: Parallel computing(2)

Multiplication 5000 x 50005000 x 5000 Matrix multiplication

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 5 10 15 20 25 30 35

Processors

Tim

e (s

)

Tp T1 / p

Page 45: Parallel computing(2)

Gaussian elimination

We use the following partitioning of data (p=4)

P1

P2

P3

P4

P1

P2

P3

P4

Page 46: Parallel computing(2)

Implementation (1)

1. Master reads both matrices2. Master sends size of matrices to slaves3. Slaves calculate their part and allocate

memory4. Master sends each slave its respective part5. Set sweeping row to 0 in all processes6. Sweep matrix (see next sheet)7. Slave send back their result

Page 47: Parallel computing(2)

Implementation (2)

While sweeping row not past final row doA. Have every process decide whether they

own the current sweeping rowB. The owner sends a copy of the row to

every other processC. All processes sweep their part of the

matrix using the current rowD. Sweeping row is incremented

Page 48: Parallel computing(2)

Programming hints

• Keep it simple!• Avoid deadlocks• Write robust code even at cost of speed• Design in advance, debugging is more

difficult (printing output is different)• Error handing requires synchronisation, you

can’t just exit the program.

Page 49: Parallel computing(2)

References (1)

MPI Forum Home Pagehttp://www.mpi-forum.org/index.html

Beginners guide to MPI (see also /MPI/)http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html

MPICHhttp://www-unix.mcs.anl.gov/mpi/mpich/

Page 50: Parallel computing(2)

References (2)

Miscellaneous

http://www.erc.msstate.edu/labs/hpcl/projects/mpi/http://nexus.cs.usfca.edu/mpi/http://www-unix.mcs.anl.gov/~gropp/http://www.epm.ornl.gov/~walker/mpitutorial/http://www.lam-mpi.org/http://epcc.ed.ac.uk/chimp/http://www-unix.mcs.anl.gov/mpi/www/www3/

Page 51: Parallel computing(2)