High Performance Parallel Programming

High Performance Parallel Programming

Dirk van der KnijffAdvanced Research Computing

Information Division



• Lecture 4: Message Passing Interface 3


So Far..• Messages

– source, dest, data, tag, communicator• Communicators

– MPI_COMM_WORLD

• Point-to-point communications– different modes - standard, synchronous, buffered, ready– blocking vs non-blocking

• Derived datatypes– construct then commit


Ping-pong exercise: program/**********************************************************************

* This file has been written as a sample solution to an exercise in a

* course given at the Edinburgh Parallel Computing Centre. It is made

* freely available with the understanding that every copy of this file

* must include this header and that EPCC takes no responsibility for

* the use of the enclosed teaching material.

*

* Authors: Joel Malard, Alan Simpson

*

* Contact: [email protected]

*

* Purpose: A program to experiment with point-to-point

* communications.

*

* Contents: C source code.

*

********************************************************************/

#include <stdio.h>

#include <mpi.h>

#define proc_A 0

#define proc_B 1

#define ping 101

#define pong 101

float buffer[100000];

long float_size;

void processor_A (void), processor_B (void);

void main ( int argc, char *argv[] )

{

int ierror, rank, size;

extern long float_size;

MPI_Init(&argc, &argv);

MPI_Type_extent(MPI_FLOAT, &float_size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == proc_A)

processor_A();

else if (rank == proc_B)

processor_B();

MPI_Finalize();

}

void processor_A( void )

{

int i, length, ierror;

MPI_Status status;

double start, finish, time;

extern float buffer[100000];

extern long float_size;

printf("Length\tTotal Time\tTransfer Rate\n");

for (length = 1; length <= 100000; length += 1000){

start = MPI_Wtime();

for (i = 1; i <= 100; i++){

MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping,

MPI_COMM_WORLD);

MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong,

MPI_COMM_WORLD, &status);

}

finish = MPI_Wtime();

time = finish - start;

printf("%d\t%f\t%f\n", length, time/200.,

(float)(2 * float_size * 100 * length)/time);

}

}

void processor_B( void )

{

int i, length, ierror;

MPI_Status status;

extern float buffer[100000];

for (length = 1; length <= 100000; length += 1000) {

for (i = 1; i <= 100; i++) {

MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping,

MPI_COMM_WORLD, &status);

MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong,

MPI_COMM_WORLD);

}

}

}


Ping-pong exercise: resultsPing_pong performance

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

1

Message Length

seconds

0

1

2

3

4

5

6

7

8

9

MBytes/sec

Total Time Transfer Rate


Ping-pong exercise: results 2Ping_pong performance

0

0.01

0.02

0.03

0.04

0.05

0.06

1 10001 20001 30001 40001 50001 60001 70001 80001 90001

Message Length

seconds

0

2

4

6

8

10

12

MBytes/sec

Total Time Transfer Rate


Running ping-pongcompile:

mpicc ping_pong.c -o ping_pong

submit:qsub ping_pong.sh

where ping_pong.sh is#PBS -q exclusive

#PBS -l nodes=2

cd <your sub_directory>

mpirun ping_pong


Collective communication• Communications involving a group of processes• Called by all processes in a communicator

– for sub-groups need to form a new communicator• Examples

– Barrier synchronisation– Broadcast, Scatter, Gather– Global sum, Global maximum, etc.


Characteristics• Collective action over a communicator• All processes must communicate• Synchronisation may or may not occur• All collective operations are blocking• No tags• Recieve buffers must be exactly the right size• Collective communications and point-to-point

communications cannot interfere


MPI_Barrier• Blocks each calling process until all other members

have also called it.• Generally used to synchronise between phases of a

program• Only one argument - no data is exchanged

MPI_Barrier(comm)


Broadcast• Copies data from a specified root process to all other

processes in communicator– all processes must specify the same root– other aguments same as for point-to-point– datatypes and sizes must match

MPI_Bcast(buffer, count, datatype, root, comm)

• Note: MPI does not support a multicast function


Scatter, Gather• Scatter and Gather are inverse operations• Note that all processes partake - even rootScatter:

a

a b c d e

b c d e

a b c d e

before

after


Gather

Gather:

a b c d e

before

a

a b c d e

b c d e

after


MPI_Scatter, MPI_GatherMPI_Scatter(sendbuf, sendcount, sendtype,

recvbuf, recvcount, recvtype, root, comm)

MPI_Gather(sendbuf, sendcount, sendtype,recvbuf, recvcount, recvtype, root, comm)

• sendcount in scatter and recvcount in gatherrefer to the size of each individual message(sendtype = recvtype => sendcount = recvcount)

• total type signatures must match


ExampleMPI_Comm comm;

int gsize, sendarray[100];

int root, myrank, *rbuf;

MPI_Datatype rtype;

...

MPI_Comm_rank(comm, myrank);

MPI_Comm_size(comm, &gsize);

MPI_Type_contigous(100, MPI_INT, &rtype);

MPI_Type_commit(&rtype);

if (myrank == root) {

rbuf = (int *)malloc(gsize*100*sizeof(int));

}

MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm);


More routinesMPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

a b c d e

a b c d ea b c d e

a b c d e a b c d e a b c d e a b c d e

a b c d e f g h i j k l m n o p q r s t u v w x y

a f k p u b g l q v c h m r w d i n s x e j o t ya b c d e f g h i j k l m n o p q r s t u v w x y


Vector routinesMPI_Scatterv(sendbuf, sendcount, displs, sendtype,

recvbuf, recvcount, recvtype, root, comm)

MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm)

MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm)

MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm)

• Allow send/recv to be from/to non-contiguous locationsin an array

• Useful if sending different counts at different times


Global reduction routines• Used to compute a result which depends on data

distributed over a number of processes• Examples:

– global sum or product– global maximum or minimum– global user-defined operation

• Operation should be associative– aside: remember floating-point operations technically aren’t

associative but we usually don’t care - can affect results in parallel programs though


Global reduction (cont.)MPI_Reduce(sendbuf, recvbuf, count, datatype, op,

root, comm)

• combines count elements from each sendbuf using op and leaves results in recvbuf on process root

• e.g.

MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm)

2 1 3 1 1

2 1 3 1 18

3 1 2 1 2

2 1 3 1 19

rs

rs

rs

rs

rs

rs

rs

rs

rs

rs


Reduction operatorsMPI_MAX MaximumMPI_MIN MinumumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical ANDMPI_BAND Bitewise ANDMPI_LOR Logical ORMPI_BOR Bitwise ORMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Max value and locationMPI_MINLOC Min value and location


User-defined operatorsIn C the operator is defined as a function of type

typedef void MPI_User_function(void *invec, void*inoutvec, int *len, MPI_Datatype

*datatype);

In Fortran must write a function asfunction <user_function>(invec(*), inoutvec(*),

len, type)

where the function has the following schemafor (i = 1 to len)

inoutvec(i) = inoutvec(i) op invec(i)

ThenMPI_Op_create(user_function, commute, op)

returns a handle op of type MPI_Op


VariantsMPI_Allreduce(sendbuf, recvbuf, count, datatype,

op, comm)

• All processes invloved receive identical results

MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm)

• Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result.


Reduce-scatterMPI_INT *s, *r, *rc;

int rank, gsize;

...

rc = (/ 1, 2, 0, 1, 1 /)

MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm)

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 17 9 6 9 9


ScanMPI_Scan(sendbuf, recvbuf, count, datatype, op,

comm)

• Performs a prefix reduction on data across grouprecvbuf(myrank) = op(sendbuf((i,i=1,myrank)))

MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm);

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 11 1 2 1 3 2 3 3 3 5 3 6 4 4 7 5 7 5 6 8 7 9 6 9 9


Further topics• Error-handling

– Errors are handled by an error handler– MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD– MPI_ERRORS_RETURN - MPI state is undefined– MPI_Error_string(errorcode, string, resultlen)

• Message probing– Messages can be probed– Note - wildcard reads may receive a different message– blocking and non-blocking

• Persistent communications


Assignment 2.• Write a general procedure to multiply 2 matrices.• Start with

– http://www.hpc.unimelb.edu.au/cs/assignment2/• This is a harness for last years assignment

– Last year I asked them to optimise first– This year just parallelize

• Next Tuesday I will discuss strategies– That doesn’t mean don’t start now…– Ideas available in various places…

http://www.hpc.unimelb.edu.au/cs/assignment2/



Tomorrow - matrix multiplication

Documents

High Performance Parallel Programming