Upload
truongdien
View
217
Download
0
Embed Size (px)
Citation preview
CS 770G - Parallel Algorithms in Scientific Computing
May 14 , 2001Lecture 3
Message-Passing II:MPI Programming
2
References
• Using MPI: Portable Parallel Programming with the Message-Passing Interface
Gropp, Lusk, Skjellum, MIT.
• Parallel Programming with MPI Pacheco, Morgan Kaufmann
3
Message Passing Programming
• Separate processors.• Separate address spaces.• Procs execute independently and concurrently.• Procs transfer data cooperatively.• Single Program Multiple Data (SPMD)
– All procs are executing the same program, but operating on different data.
• Multiple Program Multiple Data (MPMD)– Different procs may be executing a different program.
• Common software tools: PVM, MPI.
4
What is MPI? Why?• Message-passing library specifications
– Message-passing model– Not a compiler specification– Not a specific product
• For parallel computers, clusters & heterogeneous networks.• Designed to permit the development of parallel software
libraries.• Designed to provide access to advanced parallel hardware for:
– End users– Library writers– Tool developers
5
Who Designed MPI?
• Broad participants.• Vendors:
– IBM, Intel, TMC, Meiko, Cray, Convex, nCube.
• Library writers:– PVM, p4, Zipcode, TCGMSG, Chameleon, Express, Linda.
• Application specialists and consultants– Companies: ARCO, KAI, NAG, Parasoft, Shell, …– Labs: ANL, LANL, LLNL, ORNL, SNL, …– Universities: almost 20.
6
Why Use MPI?
• Standardization– The only message-passing library which can be considered as
standard.
• Portability– There is no need to modify the source when porting codes
from one platform to another which supports MPI.
• Performance– Vendor implementations should be able to exploit native
hardware to optimize performance.
7
Why Use MPI? (cont.)
• Availability– A variety of implementations are available, both vendor and
public domain, e.g. MPICH implemented by ANL.
• Functionality– It provides more than 100 subroutine calls.
8
Features of MPI• General
– Communicators combine context and group for message security.
– Thread safety.
• Point-to-point communication:– Structured buffers and derived datatypes, heterogeneity.– Modes: standard, synchronous, ready (to allow access to fast
protocols), buffered.
• Collective communication:– Both built-in & user-defined collective operators.– Large number of data movement routines– Subgroup defined directly or by topology.
9
Is MPI Large or Small?
• MPI is large -- over 100 functions– Extensive functionalities requires many functions.
• MPI is small -- 6 functions:– MPI_Init: initialize MPI.– MPI_Comm_size: find out how many procs there are.– MPI_Comm_rank: find out which proc I am.– MPI_Send: send a message.– MPI_Recv: receive a message.– MPI_Finalize: terminate MPI.
10
Is MPI Large or Small? (cont.)
• MPI is just right– One can access flexibility when it is required.– One need not master all parts of MPI to use it.
11
Send & Receive• Cooperate data transfer:
• To (from) whom is data sent (received)?
• What is sent?
• How does the receiver identify it?
Data Data
Proc 0 Proc 1
Send Receive
12
Message Passing: SendSyntax of MPI_Send:
MPI_Send (address, count, datatype, dest, tag, comm)
• (address, count) = a contiguous area in memory containing the message to be sent.
• datatype = type of data, e.g. integer, double precision.– Message length = count * sizeof(datatype)
• dest = integer identifier representing the proc to receive the message.
• tag = nonnegative integer that the destination can use to selectively screen messages.
• comm = communicator = group of procs.
13
Message Passing: Receive
Syntax of MPI_Recv:
MPI_Recv (address, count, datatype, source, tag,comm, status)
• address, count, datatype, tag, comm the same as MPI_Send.
• source = integer identifier representing the proc to send the message.
• status = information about the message that is received.
14
SPMD• Proc 0 & proc 1 are actually performing different operations.• However, not necessary to write separate programs for each
proc.• Typically, use conditional statement and proc id to identify
the job of each proc. Example:
endif
status);ORLD,MPI_COMM_W0,MPI_INT,0,,10,MPI_Recv(a
1)(my_idifelse
ORLD);MPI_COMM_W0,MPI_INT,1,,10,MPI_Send(a
0)(my_idif
a[10];int
==
==
15
Deadlock• Example: exchange data between 2 procs:
• MPI_Send is a synchronous operation. If no system buffering, it keeps waiting until a matching receive is posted.
Proc 0 Proc 1
MPI_SendMPI_Recv
Data 1
Data 2
Data 2
Data 1
MPI_SendMPI_Recv
16
Deadlock (cont.)
• Both procs are waiting for each other → deadlock.• However, OK if system buffering exists → unsafe
program.
MPI_SendMPI_Recv
Proc 0 Proc 1
Data 1
Data 2
Data 2
Data 1
MPI_SendMPI_Recv
17
Deadlock (cont.)• Note: MPI_Recv is blocking and nonbuffered.
• A real deadlock:
• Can be fixed by reordering comm.:
Proc 0 Proc 1MPI_Recv MPI_RecvMPI_Send MPI_Send
Proc 0 Proc 1MPI_Send MPI_RecvMPI_Recv MPI_Send
18
Buffered / Nonbuffered Comm.• No-buffering (phone calls)
– Proc 0 initiates the send request and ring Proc 1. It waits until Proc 1 is ready to receive. The transmission starts.
– Synchronous comm.: complete only when the message was received by the receiving proc.
• Buffering (beeper)– The message to be sent (by Proc 0) is copied to a system-
controlled block of memory (buffer).– Proc 0 can continue executing the rest of its program.– When Proc 1 is ready to receive the message, the system copies
the buffered message to Proc 1.– Asynchronous comm.: may be completed even though the
receiving proc has not received the message.
19
Buffered Comm.• Buffering requires system resources, e.g. memory, and
can be slower if the receiving proc is ready at the time of requesting send.
• Application buffer: address space that hold the data.• System buffer: system space for storing messages. In
buffered comm, data in application buffer is copied to/from system buffer.
• MPI allows comm. in buffered mode: MPI_Bsend, MPI_Ibsend.
• User allocates the buffer by: MPI_Buffer_attach (buffer, buffer_size)
• Free the buffer by MPI_Buffer_detach.
20
Blocking / Nonblocking Comm.• Blocking comm. (McDonald)
– The receiving proc has to wait if the message is not ready.– Different from synchronous comm.– Proc 0 may already buffered the message to system and Proc 1
is ready, but the interconnection network is busy.
• Nonblocking comm. (In & Out)– Proc 1 checks with the system if the message has arrived yet. If
not, continue doing other stuff. Otherwise, get the message from the system.
– Useful when computation and comm. can perform at the same time.
21
Blocking / Nonblocking Comm. (cont.)• MPI allows both nonblocking send & receive: MPI_Isend, MPI_Irecv.
• In nonblocking send, it identifies an area in memory to serve as a send buffer. Processing continues immediately without waiting for message to be copied out from application buffer.
• The program should not modify the application bufferuntil the nonblocking send has completed.
• Nonblocking comm. can be combined with nonbuffereing: MPI_Issend, or buffering: MPI_Ibsend.
• Use MPI_Wait or MPI_Test to determine if the nonblocking send or receive has completed.
22
Example: Blocking vs Nonblocking• Data exchange in a ring topology.
• Blocking version:0P 1P
3P2P
}
&status);ring_comm,1,0,my_id
MPI_FLOAT,blksize,t,recv_offseMPI_Recv(y
;ring_comm)1,0,my_id
MPI_FLOAT,blksize,t,send_offseMPI_Send(y
blksize;*p)%p)1i((my_idtrecv_offse
blksize;*p)%p)i((my_idtsend_offse
{)ip;i0;(ifor
−+
++
+−−=+−=
++<=
23
Example: Blocking vs Nonblocking (cont.)• Nonblocking version:
}
);st,&statusrecv_requeMPI_Wait(&
);st,&statussend_requeMPI_Wait(&
blksize;*p)%p)2i((my_idtrecv_offse
blksize;*p)%p)1-i((my_idtsend_offse
est);&recv_requring_comm,1,0,my_id
MPI_FLOAT,blksize,t,recv_offseyMPI_Irecv(
est);&send_requring_comm,1,0,my_id
MPI_FLOAT,blksize,t,send_offseyMPI_Isend(
blksize;*p)%p)1i((my_idtrecv_offse
blksize;*p)%p)i((my_idtsend_offse
{)i1;-pi0;(ifor
blksize;*p)1-(my_idtrecv_offse
blksize;*my_idtsend_offse
+−−=+−=
−+
++
+−−=+−=
++<=+=
=
24
Summary: Comm. Modes
• 4 comm. Modes in MPI: standard, buffered, synchronous, ready. They can be either blocking or nonblocking.
• In standard modes (MPI_Send, MPI_Recv), it is up to the system to decide whether messages should be buffered.
• In synch. mode, a send won't complete until a matching receive has been posted and which has begun reception of the data.– MPI_Ssend, MPI_Issend.– No system buffering.
25
Summary: Comm. Modes (con.t)
• In buffered mode, the completion of a send does not depend on the existence of a matching receive.– MPI_Bsend, MPI_Ibsend.– System buffering by MPI_Buffer_attach & MPI_buffer_detach.
• Ready mode not discussed.
26
Collective Communications
Comm. Pattern involving all the procs; usually more than 2.• MPI_Barrier: synchronize all procs.• Broadcast (MPI_Bcast)• Reduction (MPI_Reduce)
– All procs contribute data that is combined using a binary op.– E.g. max, min, sum, etc.– One proc obtains the final answer.
• Allreduce (MPI_Allreduce)– Same as MPI_Reduce, but every procs contains the final answer.– Conceptually, MPI_Allreduce = MPI_Reduce + MPI_Bcast, but
more efficient.
27
An Implementation
• Tree-structured comm.: (find the max among procs)
• Only need log p stages of comm.• Not necessary optimal on a particular architecture.
14 5
3
02 36 7
7 4 5
57
7
6
70 P1 2 4 5 6 PP P P P P P
28
Example 1: Hello, world!
• #include "mpi.h": basic MPI definitions and data types.• MPI_Init starts MPI.• MPI_Finalize exits MPI.• Note: all non-MPI routines are local; thus printf run on
each proc.
}
0;return
ze();MPI_Finali
);world!\n"Hello,printf("
);argc,&argvMPI_Init(&
{
*argv;*char
argc;int
argv)main(argc,int
>< hstdioincludehmpiinclude
.#"."#
29
Example 2: "Advanced" Hello, world!
• MPI_Comm_rank returns proc id.• MPI_Comm_size returns # of procs.• Note: for some parallel systems, only a few designated
procs can do I/O.• How does the output look like?
}
0;return
ze();MPI_Finali
size);rank,,n"\%dof%damIworld!Hello,printf("
size);&MM_WORLD,ize(MPI_COMPI_Comm_s
rank);&MM_WORLD,ank(MPI_COMPI_Comm_r
);argc,&argvMPI_Init(&
{
argv)main(argc,int
30
Example: Calculate π• Well-known formula:
• Numerical integration (Trapezoidal rule):
.1
41
0 2 π=+∫ dx
x
f(x)
=ba= n20 x . . . . xn-1n-2x1 xxx
.lssubinterva of #,/)(,
.)(21)()()(
21)( 110
=−=+=
++++≈ −∫
nnabhihax
xfxfxfxfhdxxf
i
nn
b
aL
31
Example: Calculate π (cont.)• A sequential function Trap(a,b,n) approx the
integral from a to b of f(x) using trap rule with nsubintervals:
}
integral;return
}
f(x);*hintegralintegral
ih;ax
{)in;i1;(ifor
f(b))/2;(f(a)*hintegral
a)/n;(bh
{
n)b,Trap(a,double
+=+=
++<=+=
−=
32
Parallelizing Trap
• Divide the interval [a,b] into p equal subintervals.• Each proc calculates the local approx. integral using trap rule
simultaneously.• Finally, combine the local values to obtain the total integral.
kn/pxProc 0
2n/pxn/px xProc kProc 1
f(x)
Proc p-1(k+1)n/p xn=b0xa= n-n/px
33
Parallel Trap Program
);n,b,Trap(aintegralh;*nab
h;*n*my_idaan/p;n
a)/n;(bh1;b0;a128;n
/*locallyruleTrapApply*/
my_id);MM_WORLD,&ank(MPI_COMPI_Comm_r/*idprocmyDetermine*/
p);MM_WORLD,&ize(MPI_COMPI_Comm_s/*procsmanyhowoutFind*/
);argc,&argvMPI_Init(&/*MPIStartup*/
kkk
kkk
kk
k
=+=
+==
−====
34
Parallel Trap Program (cont.)
ze();MPI_Finali/*MPIClose*/
}ORLD);MPI_COMM_W
tag,,0,MPI_DOUBLE,integral,1MPI_Send(&{else
}}
integral;totaltotalus);ORLD,&statMPI_COMM_W
tag,k,,MPI_DOUBLE,integral,1MPI_Recv(&{)kp;k1;(kfor
integral;total{0)(my_idif
/*integralsupSum*/
+=
++<==
==
35
Parallelizing Trap Program (cont.)
• Can/should replace MPI_Send & MPI_Recv by MPI_Reduce.
• Embarrassingly parallel -- no comm. needed during the computations of the local approx. integrals.
36
Wildcards
• MPI_ANY_TAG, MPI_ANY_SOURCE– MPI_Recv can use it for the tag and source input
arguments.
• May use status output argument to determine the actual source and tag.
• In C, the last parameter of MPI_Recv, status, is a struct with at least 2 members:– status -> MPI_TAG
– status -> MPI_SOURCE
• They return the rank of the proc that sent the message (MPI_SOURCE), and the tag number (MPI_TAG).
37
Timing• MPI_Wtime() returns the wall-clock time.
start;-finishtime
);MPI_WTime(finish
_WORLD);r(MPI_COMMMPI_Barrie
);MPI_WTime(start
_WORLD);r(MPI_COMMMPI_Barrie
time;finish,start,double
==
=
M
M
38
MPI Data Structures
• Suppose in the previous program, proc 0 reads in the values of a, b, & n from standard input, and then broadcasts the values to the other procs.
• Consequently, we need to perform MPI_Bcast 3 times.
• Sending a message is expensive in parallel environment → min. latency.
• Can reduce overhead cost by sending the 3 values in a single message.
• 3 approaches: count, derived datatype, MPI_Pack/Unpack.
39
(I) count + datatype
• In MPI_Send (MPI_Recv, MPI_Bcast …), we specify the length of data by count.
• Thus, we may group data items having the same datatype: store the data in contigous memory locations (e.g. array).
• Unfortunately, in our case, a & b are doubles but n is an integer.
40
(II) Derived Data Type• Define an MPI datatype consisting of 2 doubles and 1 integer.• Use MPI_Type_struct.
• A general MPI datatype is a sequence of pairs:
– ti = MPI_datatype– di = displacement in bytes relative to the starting address of the
message
• E.g.
• Then the derived datatype is:
)},(,),,(),,{( 1100 nn dtdtdt K
10 25 30
ba n
E,20)}(MPI_DOUBL,E,15),(MPI_DOUBLLE,0),{(MPI_DOUB K
41
(II) Derived Data Type (cont.)
ORLD);MPI_COMM_Wnewtype,0,&a,1,MPI_Bcast(ype);type,&newtdisp,blklen,truct(3,MPI_Type_s
base;-addressdisp[2]ess);s(&n,&addrMPI_Addres
base;-addressdisp[1]ess);s(&b,&addrMPI_Addres);s(&a,&baseMPI_Addres
0;disp[0]
MPI_INT;type[2];MPI_DOUBLEtype[1]type[0]
1;blklen[2]blklen[1]blklen[0]
1;b0;a128;n
address;base,disp[3],MPI_Ainttype[3];newtype,peMPI_Dataty
blklen[3];n,intb;a,double
=
=
=
===
===
===
42
(III) MPI_Pack / Unpack• MPI_Pack stores noncontiguous data in contiguous memory
locations.• MPI_Unpack copy data from a contiguous buffer into
noncontiguous memory locations (reverse of MPI_Pack).
ORLD);MPI_COMM_W,0,MPI_PACKEDe,buffer_sizbuffer,MPI_Bcast(
ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,MPI_MIT,n,1,MPI_Pack(&
ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,,MPI_DOUBLEb,1,MPI_Pack(&
ORLD);MPI_COMM_Wposition,&e,buffer_sizbuffer,,MPI_DOUBLEa,1,MPI_Pack(&
0;position
position;intfer_size];buffer[bufchar
=
43
MPI Topology• Cartesian meshes are very common in parallel
programs solving PDEs.• In such programs, comm. patterns resemble closely
with the computational meshes.• The mapping of the comm. topology to the hardware
topology can be made in many ways; some are better than the others.
• Thus, MPI allows the vendor to help optimize this mapping.
• Two types of virtual topologies in MPI: Cartesian & graph topology.
44
Cartesian Topology
• Create a 4×3 2D-mesh topology:
• MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, periods, reorder, new_comm);– ndim = # of dimensions = 2– dims = # of procs in each direction– dims[0] = 4, dims[1] = 3
(0,0)
(1,1) (3,1)(2,1)(0,1)
(1,0)
(3,2)(2,2)(1,2)(0,2)
(3,0)(2,0)
45
Cartesian Topology (cont.)
• MPI_Cart_create(MPI_COMM_WORLD, ndim, dims, periods, reorder, new_comm);– periods indicates if the procs at the end are connected.
Useful for periodic domains.– periods[0] = periods[1] = 0;– reorder indicates whether allowing the system to optimize
the mapping of grid procs to the underlying physical procs by reordering.
46
Other Functions• MPI_Comm_rank(new_comm, my_new_rank)
– The ranking may have been changed by reordering.
• MPI_Cart_coords(new_comm, my_new_rank, ndim, coordiniates);– Given the rank, it returns the array coordinates
containing the coordinates of the procs.
• MPI_Cart_rank returns the rank when given the coordinates.
• MPI_Cart_get returns dims, periods, coords.• MPI_Cart_shift returns the coordinates of the
source & dest procs in a shift operation.
47
Communicators• Example: MPI_COMM_WORLD.• Communicator = group + context.• A communicator consists of:
– A group of procs.– A set of comm. channels between these procs.– Each communicator has its own set of channels.– Messages sent with one communicator cannot be received by
another.• Enable development of safe software libraries.
– Library uses private comm. domain.• Sometime, restricting comm. to a subgroup is useful, e.g.
broadcast messages across a row or down a column of grid procs.
48
Why Communicators ?• Conflicts among MPI calls by users and libraries.• E.g. Sub1 & Sub2 are from 2 different libraries.• Correct execution of library calls:
recv(any)
recv(any)
Sub1
Sub2
21
recv(1)
send(1)
send(2)
send(1)
send(0)
send(0)
recv(0)
recv(2)
0 PP P
49
Why Communicators ? (cont.)
• Incorrect execution of library calls:
recv(any)
recv(any)
Sub1
Sub2
21
recv(1)
send(1)
send(2)
send(1)
send(0)
send(0)
recv(0)
recv(2)
0 PP P
50
MPI Group
• A group = set of procs.• Create a group by:
– MPI_Group_incl: includes specific members.– MPI_Group_excl: excludes specific members.– MPI_Group_union: union two groups.– MPI_Group_intersection: intersect two groups.
• MPI_Comm_group: get an existing group.• MPI_Group_free: free a group.
51
MPI-2
• Extensions to MPI1.1, MPI1.2.
• Major topics being discussed:– Dynamic process management.
– Client/server.
– Real-time extensions.
– "One-sided" communications.
– Portable access to MPI state (for debuggers).
– Language bindings for C++ and Fortran 90.