Upload
pascal
View
47
Download
0
Embed Size (px)
DESCRIPTION
High Performance Parallel Programming. Dirk van der Knijff Advanced Research Computing Information Division. High Performance Parallel Programming. Lecture 4:Message Passing Interface 3. So Far. Messages source, dest, data, tag, communicator Communicators MPI_COMM_WORLD - PowerPoint PPT Presentation
Citation preview
High Performance Parallel Programming
Dirk van der KnijffAdvanced Research Computing
Information Division
High Performance Parallel Programming
High Performance Parallel Programming
• Lecture 4: Message Passing Interface 3
High Performance Parallel Programming
So Far..• Messages
– source, dest, data, tag, communicator• Communicators
– MPI_COMM_WORLD
• Point-to-point communications– different modes - standard, synchronous, buffered, ready– blocking vs non-blocking
• Derived datatypes– construct then commit
High Performance Parallel Programming
Ping-pong exercise: program/**********************************************************************
* This file has been written as a sample solution to an exercise in a
* course given at the Edinburgh Parallel Computing Centre. It is made
* freely available with the understanding that every copy of this file
* must include this header and that EPCC takes no responsibility for
* the use of the enclosed teaching material.
*
* Authors: Joel Malard, Alan Simpson
*
* Contact: [email protected]
*
* Purpose: A program to experiment with point-to-point
* communications.
*
* Contents: C source code.
*
********************************************************************/
#include <stdio.h>
#include <mpi.h>
#define proc_A 0
#define proc_B 1
#define ping 101
#define pong 101
float buffer[100000];
long float_size;
void processor_A (void), processor_B (void);
void main ( int argc, char *argv[] )
{
int ierror, rank, size;
extern long float_size;
MPI_Init(&argc, &argv);
MPI_Type_extent(MPI_FLOAT, &float_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == proc_A)
processor_A();
else if (rank == proc_B)
processor_B();
MPI_Finalize();
}
void processor_A( void )
{
int i, length, ierror;
MPI_Status status;
double start, finish, time;
extern float buffer[100000];
extern long float_size;
printf("Length\tTotal Time\tTransfer Rate\n");
for (length = 1; length <= 100000; length += 1000){
start = MPI_Wtime();
for (i = 1; i <= 100; i++){
MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping,
MPI_COMM_WORLD);
MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong,
MPI_COMM_WORLD, &status);
}
finish = MPI_Wtime();
time = finish - start;
printf("%d\t%f\t%f\n", length, time/200.,
(float)(2 * float_size * 100 * length)/time);
}
}
void processor_B( void )
{
int i, length, ierror;
MPI_Status status;
extern float buffer[100000];
for (length = 1; length <= 100000; length += 1000) {
for (i = 1; i <= 100; i++) {
MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping,
MPI_COMM_WORLD, &status);
MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong,
MPI_COMM_WORLD);
}
}
}
High Performance Parallel Programming
Ping-pong exercise: resultsPing_pong performance
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
1
Message Length
seconds
0
1
2
3
4
5
6
7
8
9
MBytes/sec
Total Time Transfer Rate
High Performance Parallel Programming
Ping-pong exercise: results 2Ping_pong performance
0
0.01
0.02
0.03
0.04
0.05
0.06
1 10001 20001 30001 40001 50001 60001 70001 80001 90001
Message Length
seconds
0
2
4
6
8
10
12
MBytes/sec
Total Time Transfer Rate
High Performance Parallel Programming
Running ping-pongcompile:
mpicc ping_pong.c -o ping_pong
submit:qsub ping_pong.sh
where ping_pong.sh is#PBS -q exclusive
#PBS -l nodes=2
cd <your sub_directory>
mpirun ping_pong
High Performance Parallel Programming
Collective communication• Communications involving a group of processes• Called by all processes in a communicator
– for sub-groups need to form a new communicator• Examples
– Barrier synchronisation– Broadcast, Scatter, Gather– Global sum, Global maximum, etc.
High Performance Parallel Programming
Characteristics• Collective action over a communicator• All processes must communicate• Synchronisation may or may not occur• All collective operations are blocking• No tags• Recieve buffers must be exactly the right size• Collective communications and point-to-point
communications cannot interfere
High Performance Parallel Programming
MPI_Barrier• Blocks each calling process until all other members
have also called it.• Generally used to synchronise between phases of a
program• Only one argument - no data is exchanged
MPI_Barrier(comm)
High Performance Parallel Programming
Broadcast• Copies data from a specified root process to all other
processes in communicator– all processes must specify the same root– other aguments same as for point-to-point– datatypes and sizes must match
MPI_Bcast(buffer, count, datatype, root, comm)
• Note: MPI does not support a multicast function
High Performance Parallel Programming
Scatter, Gather• Scatter and Gather are inverse operations• Note that all processes partake - even rootScatter:
a
a b c d e
b c d e
a b c d e
before
after
High Performance Parallel Programming
Gather
Gather:
a b c d e
before
a
a b c d e
b c d e
after
High Performance Parallel Programming
MPI_Scatter, MPI_GatherMPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, comm)
MPI_Gather(sendbuf, sendcount, sendtype,recvbuf, recvcount, recvtype, root, comm)
• sendcount in scatter and recvcount in gatherrefer to the size of each individual message(sendtype = recvtype => sendcount = recvcount)
• total type signatures must match
High Performance Parallel Programming
ExampleMPI_Comm comm;
int gsize, sendarray[100];
int root, myrank, *rbuf;
MPI_Datatype rtype;
...
MPI_Comm_rank(comm, myrank);
MPI_Comm_size(comm, &gsize);
MPI_Type_contigous(100, MPI_INT, &rtype);
MPI_Type_commit(&rtype);
if (myrank == root) {
rbuf = (int *)malloc(gsize*100*sizeof(int));
}
MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm);
High Performance Parallel Programming
More routinesMPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
a b c d e
a b c d ea b c d e
a b c d e a b c d e a b c d e a b c d e
a b c d e f g h i j k l m n o p q r s t u v w x y
a f k p u b g l q v c h m r w d i n s x e j o t ya b c d e f g h i j k l m n o p q r s t u v w x y
High Performance Parallel Programming
Vector routinesMPI_Scatterv(sendbuf, sendcount, displs, sendtype,
recvbuf, recvcount, recvtype, root, comm)
MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm)
MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm)
MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm)
• Allow send/recv to be from/to non-contiguous locationsin an array
• Useful if sending different counts at different times
High Performance Parallel Programming
Global reduction routines• Used to compute a result which depends on data
distributed over a number of processes• Examples:
– global sum or product– global maximum or minimum– global user-defined operation
• Operation should be associative– aside: remember floating-point operations technically aren’t
associative but we usually don’t care - can affect results in parallel programs though
High Performance Parallel Programming
Global reduction (cont.)MPI_Reduce(sendbuf, recvbuf, count, datatype, op,
root, comm)
• combines count elements from each sendbuf using op and leaves results in recvbuf on process root
• e.g.
MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm)
2 1 3 1 1
2 1 3 1 18
3 1 2 1 2
2 1 3 1 19
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
High Performance Parallel Programming
Reduction operatorsMPI_MAX MaximumMPI_MIN MinumumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical ANDMPI_BAND Bitewise ANDMPI_LOR Logical ORMPI_BOR Bitwise ORMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Max value and locationMPI_MINLOC Min value and location
High Performance Parallel Programming
User-defined operatorsIn C the operator is defined as a function of type
typedef void MPI_User_function(void *invec, void*inoutvec, int *len, MPI_Datatype
*datatype);
In Fortran must write a function asfunction <user_function>(invec(*), inoutvec(*),
len, type)
where the function has the following schemafor (i = 1 to len)
inoutvec(i) = inoutvec(i) op invec(i)
ThenMPI_Op_create(user_function, commute, op)
returns a handle op of type MPI_Op
High Performance Parallel Programming
VariantsMPI_Allreduce(sendbuf, recvbuf, count, datatype,
op, comm)
• All processes invloved receive identical results
MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm)
• Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result.
High Performance Parallel Programming
Reduce-scatterMPI_INT *s, *r, *rc;
int rank, gsize;
...
rc = (/ 1, 2, 0, 1, 1 /)
MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm)
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 17 9 6 9 9
High Performance Parallel Programming
ScanMPI_Scan(sendbuf, recvbuf, count, datatype, op,
comm)
• Performs a prefix reduction on data across grouprecvbuf(myrank) = op(sendbuf((i,i=1,myrank)))
MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm);
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 11 1 2 1 3 2 3 3 3 5 3 6 4 4 7 5 7 5 6 8 7 9 6 9 9
High Performance Parallel Programming
Further topics• Error-handling
– Errors are handled by an error handler– MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD– MPI_ERRORS_RETURN - MPI state is undefined– MPI_Error_string(errorcode, string, resultlen)
• Message probing– Messages can be probed– Note - wildcard reads may receive a different message– blocking and non-blocking
• Persistent communications
High Performance Parallel Programming
Assignment 2.• Write a general procedure to multiply 2 matrices.• Start with
– http://www.hpc.unimelb.edu.au/cs/assignment2/• This is a harness for last years assignment
– Last year I asked them to optimise first– This year just parallelize
• Next Tuesday I will discuss strategies– That doesn’t mean don’t start now…– Ideas available in various places…
High Performance Parallel Programming
High Performance Parallel Programming
Tomorrow - matrix multiplication