Upload
letruc
View
225
Download
0
Embed Size (px)
Citation preview
DistributedProgramming
Concurrent and Distributed Programminghttp://fmt.cs.utwente.nl/courses/cdp/
HC 7 - Tuesday, 10 January 2012
CDP #7
Jaco van de Polhttp://fmt.cs.utwente.nl/~vdpol/
Jaco van de Pol Lecture 7: Distributed Programming
Overview of Lecture 7
Message Passing Interface (MPI)
Implementation: Mays and Musts
MPI in Java: MPJ
Lecture heavily based on material from
Ganesh Gopalakrishnan (University of Utah)
Jan Lemeire (Vrije Universiteit Brussels)
Stefan Blom (UT)
2
Jaco van de Pol Lecture 7: Distributed Programming
Matrix Multiplication (MMUL)
Parallel Implementation
3Matrix Multiplication
�1 32 4
�
� �� �A
·�
21
�
� �� �b
=
�58
�
� �� �r
parallel implementation:
int A11=1,A12=3,b1=2;int A21=2,A22=4,b2=1;
int r1,r2worker 1 worker 2r1:=A11*b1+A12*b2; r2 := A21*b1+A22*b2;
March 23, 2010 Stefan Blom 3/ 31
Parallel MMUL
A11=1, A12=3, b1=2A21=2, A22=4, b2=1
int r1,r2
A11=1, A12=3, b1=2A21=2, A22=4, b2=1
int r1,r2
worker 1 worker 2
r1 = A11*b1 + A12*b2 r2 = A21*b1 + A22*b2
Jaco van de Pol Lecture 7: Distributed Programming
Distributed Memory MMUL
Worker 1 owns row 1 of A and b
Worker 2 owns row 2 of A and b
4
Distributed Memory MMUL
chan c1to2 = [1] of int, c2to1 = [1] of intchan c1to2 = [1] of int, c2to1 = [1] of int
worker 1 worker 2
int A11=1, A12=3, b1=2int r1,b2’
c1to2 ! b1c2to1 ? b2’r1 = A11*b1 + A12*b2’
int A21=2, A22=4, b1=1int r2,b1’
c2to1 ! b2c1to2 ? b1’r2 = A21*b1’ + A22*b2
Jaco van de Pol Lecture 7: Distributed Programming
Optimized DM MMUL
Important principle in Message-Passing Concurrency:Overlapping computation and communication!
5
Optimized Distributed Memory MMUL
chan c1to2 = [1] of int, c2to1 = [1] of intchan c1to2 = [1] of int, c2to1 = [1] of int
worker 1 worker 2
int A11=1, A12=3, b1=2int r1,b2’
c2to1 ! b1r1 = A11*b1c2to1 ? b2’r1 = r1 + A12*b2’
int A21=2, A22=4, b1=1int r2,b1’
c2to1 ! b2r2 = A22*b2c2to1 ? b1’r2 = A21*b1’ + r2
Amdahl’s Law
Jaco van de Pol Lecture 7: Distributed Programming
Message-Passing Paradigm
Partitioned Address SpaceEach worker has its own private address space
Typically 1 worker per processor
Supports only explicit parallelizationAdds complexity to programming
Encourages locality of data access
Often Single Program Multiple Data (SPMD)Same code executed by all processes; identical except for master
loosely synchronous paradigm: tasks asynchronous between interactions (through messages)
6
!"#$%&'&()& !"#$% & '(
)*++#$*,-#++./$0-#1#2.$3
"#14.4.5/*20#221*++0+-#6*7#680-156*++08#+0.4+059/0*:6;<+.=*
#221*++0+-#6*>?-.6#;0@0-156*++0-*10-156*++51
A/;?0+<--514+0*:-;.6.40-#1#;;*;.B#4.5/C22+0653-;*:.4?0450-15$1#33./$7/65<1#$*+0;56#;.4?05D02#4#0#66*++
AD4*/0E./$;*0"15$1#30)<;4.-;*0F#4#0GE")FH0#--15#68>8*0+#3*0652*0.+0*:*6<4*20I?0*=*1?0-156*++%J2*/4.6#;K0*:6*-40D51048*03#+4*10*++,&*-$,-#./)+#+0, -#1#2.$3L0I*49**/0./4*1#64.5/+0G4815<$803*++#$*+HK04#+M+0*:*6<4*0653-;*4*;?0#+?/6815/5<+;?
)*++#$*,-#++./$0"#1#;;*;0"156*++./$
KUMAR p233
Jaco van de Pol Lecture 7: Distributed Programming
Blocking Send and Receive 7
Jaco van de Pol Lecture 7: Distributed Programming
Non-Blocking Send and Receive 8
Jaco van de Pol Lecture 7: Distributed Programming
Buffered NB Send and Receive
Copying without CPU (Direct Memory Access)
9
Jaco van de Pol Lecture 7: Distributed Programming
Message Passing Interface (MPI)
MPI is a standard for writing parallel programs portably
Distributed Memory Programming Model
Every Worker executes the same program (SPMD)
Widely used in High-Performance Computing (HPC)
http://www.mpi-forum.org/
GoalsIntroduce basic principle of MPI
Create awareness of potential problems and their solutions
10
Jaco van de Pol Lecture 7: Distributed Programming
Fundamental MPI Operations
MPI 2.0: over 300 functions
Only four fundamental operations(immediate) send
(immediate) receive
wait
barrier
11
Jaco van de Pol Lecture 7: Distributed Programming
MPI_Irecv (Non-Blocking Receive)
MPI_Irecv (source, msg_buf, req, ...)Non-blocking call returning a request handle reqStarts a mem-to-mem copy from source process to msg_buf
MPI_Wait(req) awaits completion
When wait unblocks, msg_buf is ready for consumptionSource can be specific ID (“rank”) or
“wildcard” aka * aka ANY_SOURCEReceive from any eligible (matching) sender
Blocking receive can be defined as:
12
MPI_Recv (source, msg_buf) { request req; MPI_Irecv (source, msg_buf, req); MPI_Wait (req);}
The I in Irecv stands forImmediate
Jaco van de Pol Lecture 7: Distributed Programming
MPI_Isend (Non-Blocking Send)
MPI_Isend (destination, msg_buf, req, ...)
Non-blocking call returning a request handle req
Starts a mem-to-mem copy from this process to destinationMay benefit from runtime buffering
MPI_Wait(req) awaits completion
Sender can reuse msg_buf when wait unblocks
Blocking send can be defined as:
13
MPI_Send (destination, msg_buf) { request req; MPI_Isend (destination, msg_buf, req); MPI_Wait (req);}
No guarantee that message was received at
destination!completion: hand-off to MPI
Jaco van de Pol Lecture 7: Distributed Programming
MPI Send Modes
Blocking vs. Non-BlockingDoes Send call return immediately to caller?
MPI_Send: blocking, but not synchronous
MPI_Isend: non-blocking, even without matching receive
Synchronous vs. AsynchronousMPI_Ssend: completes when receiver starts receiving
... many other subtle variations: MPI_Bsend, MPI_Rsend, ...
14
Jaco van de Pol Lecture 7: Distributed Programming
Message Order
Message order between two fixed workers is preserved
No guarantees between multiple workers
15Message Order
Message order between two workers is preserved.P0 P1
Send(1, 1) Recv(0, x)
Send(1, 2) Recv(0, y)
always x = 1, y = 2
P0 P1 P2
Send(1, 1) Recv(∗, x) Send(1, 2)
Recv(∗, y)
x = 1, y = 2 or x = 2, y = 1
March 23, 2010 Stefan Blom 18/ 31
Jaco van de Pol Lecture 7: Distributed Programming
MPI Example: Scenario 1 16Example
P0 P1 P2
Irecv(from : ∗, r1) //Sleep(3) Sleep(3)
Irecv(from : 2, r2); Isend(to : 0, r1)
�����������
Isend(to : 0, r1);��
Isend(to : 2, , r3);��waits Irecv(from : 0, r2);
Irecv(from : ∗, r4); Isend(to : 0, r3);��
waits deadlockfree waits
March 23, 2010 Stefan Blom 16/ 31
Send
Recv
Recv Send Send
Recv
Send
r3);
Jaco van de Pol Lecture 7: Distributed Programming
MPI Example: Scenario 2 17Example
P0 P1 P2
Irecv(from : ∗, r1) Sleep(3) //Sleep(3)
Irecv(from : 2, r2);deadlock
Isend(to : 0, r1) Isend(to : 0, r1);
������������������������������
Isend(to : 2, , r3); waits Irecv(from : 0, r2);
Isend(from : ∗, r4); Isend(to : 0, r3);
waits waits
March 23, 2010 Stefan Blom 15/ 31
Example
P0 P1 P2
Irecv(from : ∗, r1) //Sleep(3) Sleep(3)
Irecv(from : 2, r2); Isend(to : 0, r1)
�����������
Isend(to : 0, r1);��
Isend(to : 2, , r3);��waits Irecv(from : 0, r2);
Irecv(from : ∗, r4); Isend(to : 0, r3);��
waits deadlockfree waits
March 23, 2010 Stefan Blom 16/ 31
Send
Recv
Recv Send Send
Recv
Send
r3);
Jaco van de Pol Lecture 7: Distributed Programming
Barrier
Programming construct that synchronizes the actions of its participants
All participants must enter and then leave the barrier
the first participant may leave only after the last has entered
18
Jaco van de Pol Lecture 7: Distributed Programming
Barrier Example 19Barrier Example
p qa bbarrier barrierc d
reduced state space:
b��
a ��
b��a ��
d��
c ��
d��c ��
full state space:
b��
a ��
b��
enter ��
b��
enter��
a ��
enter��
enter ��
enter��a �� enter ��
leave��
leave ��
leave��
c ��
leave��
d��
leave ��
d��
c ��
d��leave �� c ��
March 23, 2010 Stefan Blom 14/ 31
worker 1 worker 2
aBarrierc
bBarrierd
Jaco van de Pol Lecture 7: Distributed Programming
Barriers versus Wait: beware
What scenarios are possible in this program?Note: going through the barrier means Isend/Irecv finished
This does NOT imply that the data transfer actually happened
Actually, P1 may receive P2’s Isend2!!Data seems to “travel through the barrier” in wrong direction!
20
P0: Isend0(to:1, h0) ; Barrier0 ; Wait(h0)P1: Irecv(from:*, h1) ; Barrier1 ; Wait(h1)P2: Barrier2 ; Isend2(to:1, h2) ; Wait(h2)
Figure 4: Illustration of Barrier Semantics and the POE Algorithm
the execution. While these rules match the rules followed by other languages and libraries in supportingtheir barrier operations, in the case of MPI, it is possible for a process Pi to have a non-blocking operationOP before its barrier call, and another process Pj to have an operation OP’ after Pj’s matching barrier callwhere OP can observe the execution of OP’. This means that OP can, in effect, complete after Pi’s barrierhas been invoked. Such behaviors are allowed in MPI to keep it a high-performance API.
Figure 4 illustrates the scenario just discussed. In this example, one Isend issued by P0, shown asIsend0(to:1, &h0), and another issued by P2, shown as Isend2(to:1, &h2), target a wild-card Irecv issued by P1, shown as Irecv(from:*, h1). The following execution is possible:(i) Isend0(to:1, &h0) is issued, (ii) Irecv(from:*, h1) is issued, (iii) each process fully exe-cutes its own barrier, and this “collective operation” finishes, (iv) Isend2(to:1, h2) is issued, (v) nowboth sends and the receive are alive, and hence Isend0 and Isend2 become dependent (non-deterministicmatches with Irecv), requiring a dynamic algorithm to pursue both matches. Notice that Isend0 canfinish before Barrier0 and Irecv can finish after Barrier1. We sometimes refer to the placement ofbarriers as in Figure 4 as “crooked barriers,” because the barriers are used even though there are instructionsaround it that observe each other.
To handle this example properly, the POE algorithm of ISP cannot assume that Isend0 is the onlysender that can match Irecv. Therefore ISP must somehow delay issuing Irecv into the MPI run-timeuntil all its matchers are known. In this example, ISP will collect Isend0 and Irecv, then issue all theBarriers, then encounter Isend2, and then only act on R. Here, ISP will issue Irecv with arguments0 and 2 respectively in two consecutive replays of the whole program. Thus, the inner workings of ISP’sPOE algorithm are as follows:
• Collect wildcard receives• Delay non-blocking sends that may match the receives• Issue other non-interfering operations (such as the Barriers)• Finally, when forced (in the example in Figure 4, the forcing occurrs when all processes are at theirWait statements), ISP will perform the dynamic re-writings, followed by the two replays.
The example of Figure 4 illustrates many aspects of POE:
• Delaying of operations: Irecv was delayed in this example;• Out of order issue: the Barriers were issued out of program order;• Being forced to compute the full extent of non-determinism: In our example, when all processes
are at fence instructions – meaning, instructions such as Wait and Barrier – the POE algorithmknows that none of the sends coming after the fence can match the wildcard Irecv. (We will definea fence later on in Section 3.2.) This allows POE to compute the full extent of non-determinismon-the-fly.
• Replay: For each case of non-determinism, ISP only remembers the choices by keeping a stack trail.The full MPI runtime state cannot be remembered (it is too voluminous, spread over clusters, andconsists of the entire state of a large machine). Thus, ISP always resorts to the approach of startingfrom the initial state (known) and computing up to a choice point, and then branching differently
4
Consider the following program:
it consists of three workers, each running their own code
Jaco van de Pol Lecture 7: Distributed Programming
The rulesBarriers:
a Barrier happens before the following statements of all procs
Data-dependencies for read/write:a Recv can never return before the matching Send has started
Within one worker:two sends to the same destination arrive in the same order
receiving from the same source preserves the message order(unless different tags are used; see later)
Wait:a wait after an Irecv means: message has arrived in the buffer a wait after an Isend means: buffer space can be reused (!)
21
Jaco van de Pol Lecture 7: Distributed Programming
MPI QuizIs the following sequence deadlock-free?
No, not guaranteed (but likely it is). Implementations areallowed (but not expected) to only actually start a send request when the corresponding Wait is issued;
The following is deadlock-free for sure:
The following deadlocks for sure:
22
P0: Irecv(from:0,rh); Barrier; Isend(to:0,sh); Wait(rh); Wait(sh);
P0: Irecv(from:0,rh); Barrier; Isend(to:0,sh); Wait(sh); Wait(rh);
P0: Irecv(from:0,rh); Wait(rh); Barrier; Isend(to:0,sh); Wait(sh);
Jaco van de Pol Lecture 7: Distributed Programming
Barrier Implementation: Broadcast
Too expensive for many workers: consider message countMessage complexity: O(N^2) Hmm, too bad; bandwidth??
Latency: O(1) This is quite OK! (theoretically)
23
void msg_barrier_bcast () { request rq[Nworkers]; message msg[Nworkers];
for (int i=0; i<Nworkers; ++i) if (i!=me) { Irecv (i, msg[i], rq[i]); } for (int i=0; i<Nworkers; ++i) if (i!=me) { Send (i,”barrier”); } for (int i=0; i<Nworkers; ++i) if (i!=me) { Wait (rq[i]); }}
Jaco van de Pol Lecture 7: Distributed Programming
Barrier Implementation: Ring
Too expensive for many workers: latencyMessage complexity: O(N) This seems rather optimal
Latency: O(N) This is really killing performance!
24
void msg_barrier_ring () { message msg; if (me == 0) { Send (Nworkers-1, “enter”); } Recv (*, msg);
if (me != 0) { Send (me-1, msg); } else { Send (Nworkers-1, “leave”); }
Recv (*, msg); if (me != 0) { Send (me-1, msg); }}
Jaco van de Pol Lecture 7: Distributed Programming
Barrier Implementation: D&C
Assume N=2k workers: Divide & Conquer
Acceptable for many workers:
Message complexity: O(N. log N) Acceptable
Latency: O(log N) Acceptable
25
void msg_barrier_dc () { for (int i=0; i<k; ++i) {
request rq; message msg; Irecv (XOR(me,1<<i), msg, rq); Send (XOR(me,1<<i), “barrier”); Wait (rq); }}
Barrier: divide and conquer
Assume 2k workers.P0
������������� P1
������������� P2
������������� P3
������������� P4
������������� P5
������������� P6
������������� P7
�������������
�����
���
��������
�����
���
��������
�����
���
��������
�����
���
��������
������
����
�
������
����
�
������
����
�
������
����
�
������
����
�
������
����
�
������
����
�
������
����
�
������������������
������������������
������������������
������������������
������������������
������������������
������������������
������������������
void msg_barrier_3(){
for(i=0;i<k;i++){
request rq;
message msg;
Irecv(XOR(me,(1<<i)),msg,rq);
Send(XOR(me,(1<<i)),"barrier");
Wait(rq);
}
}
Acceptable for many workers: cost is O(log n).
March 23, 2010 Stefan Blom 25/ 31
Jaco van de Pol Lecture 7: Distributed Programming
MPI Tags: Message Types
Every MPI message contains both data and a numeric tag
Messages with different tags are allowed to overtake at the request of the receiver
Tags: poor man’s message namespaces, e.g., for MPI libraries
26
Multiple Types of Messages
Every MPI message contains both data and a numeric tag.
Messages with different tags are allowed to overtake at therequest of the receiver.
P0 P1
Send(1, tag : 0, 1) Recv(∗, tag : ∗, x)
Send(1, tag : 0, 2) Recv(∗, tag : ∗, y)
always x = 1, y = 2
P0 P1
Send(1, tag : 0, 1) Recv(∗, tag : 1, x)
Send(1, tag : 1, 2) Recv(∗, tag : 0, y)
x = 2, y = 1 or deadlock
March 23, 2010 Stefan Blom 19/ 31
Jaco van de Pol Lecture 7: Distributed Programming
MPI Communicators: Connections
Like TCP allows multiple connections,MPI allows multiple communicators
Communicator is handle for a group of processors
Communication on different communicators is required to allow overtaking
27
Multiple Connections or Communicators
• Like TCP allows multiple sockets,MPI allows multiple Communicators.
• Communication on different communicator is required toallow overtaking.
P0 P1
Isend(comm : 0, 1, 1, h) Recv(comm : 1, ∗, x)
Send(comm : 1, tag : 0, 2) Recv(comm : 0, ∗, y)
Wait(h)
always x = 2, y = 1
P0 P1
Send(comm : 0, 1, 1) Recv(comm : 1, ∗, x)
Send(comm : 1, 1, 2) Recv(comm : 0, ∗, y)
x = 2, y = 1 or deadlock
March 23, 2010 Stefan Blom 20/ 31
Jaco van de Pol Lecture 7: Distributed Programming
MPJ: MPI in Java
MPJ is a Java binding for MPI
Implemented by IBIS project of the Vrije Universiteit Amsterdam
28
Jaco van de Pol Lecture 7: Distributed Programming
MPJ Example: Hello World 29
import ibis.mpj.MPJ;import ibis.mpj.MPJException;public class MPIHello { public static void main(String[] args) throws MPJException { MPJ.init(args); int size=MPJ.COMM_WORLD.size(); int rank=MPJ.COMM_WORLD.rank();
String who=”worker “ + rank + ”/” + size + ”: “; System.err.println(who + ”hello world!”); MPJ.finish(); }}
Output:worker 0/2: hello world!worker 1/2: hello world!
Jaco van de Pol Lecture 7: Distributed Programming
MPJ Example: Ring
Send messages in a ring
30
...int msg[] = new int[1];int tag = 42;int l = msg.length;
if (rank==0) MPJ.COMM_WORLD.send(msg, 0, l, MPJ.INT, size-1, tag);ibis.mpj.Status stat;
stat = MPJ.COMM_WORLD.recv(msg, 0, l, MPJ.INT, MPJ.ANY_SOURCE, tag);int src = stat.getSource();System.err.println(who+”received “ + msg[0] + “ from “ + src);
++msg[0];if (rank!=0) MPJ.COMM_WORLD.send(msg, 0, l, MPJ.INT, rank-1, tag);...
offset count type target
Jaco van de Pol Lecture 7: Distributed Programming
MPJ Setup (Unix)
Use JDK 1.6.0 from SUN Oracle
Create a directory cdp/ in home directory
Download and unzip mpj-2.2.zip to cdp/creates mpj-2.2/
Add to startup scripts:
31
export MPJ_HOME=$HOME/cdp/mpj-2.2export PATH=$PATH:$MPJ_HOME/binexport CLASSPATH=$MPJ_HOME/lib/mpj-2.2.jar:.
Jaco van de Pol Lecture 7: Distributed Programming
MPJ Setup (Windows)
Use JDK 1.6.0 from SUN Oracle
Create a directory cdp/
Download and unzip mpj-2.2.zip to cdp/creates mpj-2.2/
Execute in command shell:
Persist settings: http://www.support.tabs3.com/main/R10463.htm
32
set MPJ_HOME="location of the mpj-2.2 directory"set CLASSPATH=%CLASSPATH%;"%MPJ_HOME%"\lib\mpj-2.2.jarset PATH=%PATH%;"%MPJ_HOME%"\bin
Jaco van de Pol Lecture 7: Distributed Programming
MPJ Compiling & Running
Compile programs with javac MPIHello.java
Create a file ibis.properties:
Run mpj-server --events in a new terminal
Run two instances of mpj-run MPIHello
33
ibis.server.address=localhostibis.pool.size=2ibis.pool.name=cdpdemo
Number of workers