Overview of Lecture 7 - Universiteit Twentefmt.cs.utwente.nl/courses/cdp/slides/cdp-7-mpi-1-4up.pdf · Overview of Lecture 7 Message Passing Interface (MPI) ... Lecture heavily based

DistributedProgramming

Concurrent and Distributed Programminghttp://fmt.cs.utwente.nl/courses/cdp/

HC 7 - Tuesday, 10 January 2012

CDP #7

Jaco van de Polhttp://fmt.cs.utwente.nl/~vdpol/

Jaco van de Pol Lecture 7: Distributed Programming

Overview of Lecture 7

Message Passing Interface (MPI)

Implementation: Mays and Musts

MPI in Java: MPJ

Lecture heavily based on material from

Ganesh Gopalakrishnan (University of Utah)

Jan Lemeire (Vrije Universiteit Brussels)

Stefan Blom (UT)

2


Matrix Multiplication (MMUL)

Parallel Implementation

3Matrix Multiplication

�1 32 4

�

� �� A

·�

21

�

� �� b

=

�58

�

� �� r

parallel implementation:

int A11=1,A12=3,b1=2;int A21=2,A22=4,b2=1;

int r1,r2worker 1 worker 2r1:=A11*b1+A12*b2; r2 := A21*b1+A22*b2;

March 23, 2010 Stefan Blom 3/ 31

Parallel MMUL

A11=1, A12=3, b1=2A21=2, A22=4, b2=1

int r1,r2

A11=1, A12=3, b1=2A21=2, A22=4, b2=1

int r1,r2

worker 1 worker 2

r1 = A11*b1 + A12*b2 r2 = A21*b1 + A22*b2


Distributed Memory MMUL

Worker 1 owns row 1 of A and b

Worker 2 owns row 2 of A and b

4

Distributed Memory MMUL

chan c1to2 = [1] of int, c2to1 = [1] of intchan c1to2 = [1] of int, c2to1 = [1] of int

worker 1 worker 2

int A11=1, A12=3, b1=2int r1,b2’

c1to2 ! b1c2to1 ? b2’r1 = A11*b1 + A12*b2’

int A21=2, A22=4, b1=1int r2,b1’

c2to1 ! b2c1to2 ? b1’r2 = A21*b1’ + A22*b2


Optimized DM MMUL

Important principle in Message-Passing Concurrency:Overlapping computation and communication!

5

Optimized Distributed Memory MMUL

chan c1to2 = [1] of int, c2to1 = [1] of intchan c1to2 = [1] of int, c2to1 = [1] of int

worker 1 worker 2

int A11=1, A12=3, b1=2int r1,b2’

c2to1 ! b1r1 = A11*b1c2to1 ? b2’r1 = r1 + A12*b2’

int A21=2, A22=4, b1=1int r2,b1’

c2to1 ! b2r2 = A22*b2c2to1 ? b1’r2 = A21*b1’ + r2

Amdahl’s Law


Message-Passing Paradigm

Partitioned Address SpaceEach worker has its own private address space

Typically 1 worker per processor

Supports only explicit parallelizationAdds complexity to programming

Encourages locality of data access

Often Single Program Multiple Data (SPMD)Same code executed by all processes; identical except for master

loosely synchronous paradigm: tasks asynchronous between interactions (through messages)

6

!"#$%&'&()& !"#$% & '(

)*++#$*,-#++./$0-#1#2.$3

"#14.4.5/*20#221*++0+-#6*7#680-156*++08#+0.4+059/0*:6;<+.=*

#221*++0+-#6*>?-.6#;0@0-156*++0-*10-156*++51

A/;?0+<--514+0*:-;.6.40-#1#;;*;.B#4.5/C22+0653-;*:.4?0450-15$1#33./$7/65<1#$*+0;56#;.4?05D02#4#0#66*++

AD4*/0E./$;*0"15$1#30)<;4.-;*0F#4#0GE")FH0#--15#68>8*0+#3*0652*0.+0*:*6<4*20I?0*=*1?0-156*++%J2*/4.6#;K0*:6*-40D51048*03#+4*10*++,&*-$,-#./)+#+0, -#1#2.$3L0I*49**/0./4*1#64.5/+0G4815<$803*++#$*+HK04#+M+0*:*6<4*0653-;*4*;?0#+?/6815/5<+;?

)*++#$*,-#++./$0"#1#;;*;0"156*++./$

KUMAR p233


Blocking Send and Receive 7


Non-Blocking Send and Receive 8


Buffered NB Send and Receive

Copying without CPU (Direct Memory Access)

9


Message Passing Interface (MPI)

MPI is a standard for writing parallel programs portably

Distributed Memory Programming Model

Every Worker executes the same program (SPMD)

Widely used in High-Performance Computing (HPC)

http://www.mpi-forum.org/

GoalsIntroduce basic principle of MPI

Create awareness of potential problems and their solutions

10


Fundamental MPI Operations

MPI 2.0: over 300 functions

Only four fundamental operations(immediate) send

(immediate) receive

wait

barrier

11


MPI_Irecv (Non-Blocking Receive)

MPI_Irecv (source, msg_buf, req, ...)Non-blocking call returning a request handle reqStarts a mem-to-mem copy from source process to msg_buf

MPI_Wait(req) awaits completion

When wait unblocks, msg_buf is ready for consumptionSource can be specific ID (“rank”) or

“wildcard” aka * aka ANY_SOURCEReceive from any eligible (matching) sender

Blocking receive can be defined as:

12

MPI_Recv (source, msg_buf) { request req; MPI_Irecv (source, msg_buf, req); MPI_Wait (req);}

The I in Irecv stands forImmediate


MPI_Isend (Non-Blocking Send)

MPI_Isend (destination, msg_buf, req, ...)

Non-blocking call returning a request handle req

Starts a mem-to-mem copy from this process to destinationMay benefit from runtime buffering

MPI_Wait(req) awaits completion

Sender can reuse msg_buf when wait unblocks

Blocking send can be defined as:

13

MPI_Send (destination, msg_buf) { request req; MPI_Isend (destination, msg_buf, req); MPI_Wait (req);}

No guarantee that message was received at

destination!completion: hand-off to MPI


MPI Send Modes

Blocking vs. Non-BlockingDoes Send call return immediately to caller?

MPI_Send: blocking, but not synchronous

MPI_Isend: non-blocking, even without matching receive

Synchronous vs. AsynchronousMPI_Ssend: completes when receiver starts receiving

... many other subtle variations: MPI_Bsend, MPI_Rsend, ...

14


Message Order

Message order between two fixed workers is preserved

No guarantees between multiple workers

15Message Order

Message order between two workers is preserved.P0 P1

Send(1, 1) Recv(0, x)

Send(1, 2) Recv(0, y)

always x = 1, y = 2

P0 P1 P2

Send(1, 1) Recv(∗, x) Send(1, 2)

Recv(∗, y)

x = 1, y = 2 or x = 2, y = 1



MPI Example: Scenario 1 16Example

P0 P1 P2

Irecv(from : ∗, r1) //Sleep(3) Sleep(3)

Irecv(from : 2, r2); Isend(to : 0, r1)

��

Isend(to : 0, r1);��

Isend(to : 2, , r3);��waits Irecv(from : 0, r2);

Irecv(from : ∗, r4); Isend(to : 0, r3);��

waits deadlockfree waits


Send

Recv

Recv Send Send

Recv

Send

r3);


MPI Example: Scenario 2 17Example

P0 P1 P2

Irecv(from : ∗, r1) Sleep(3) //Sleep(3)

Irecv(from : 2, r2);deadlock

Isend(to : 0, r1) Isend(to : 0, r1);

��

Isend(to : 2, , r3); waits Irecv(from : 0, r2);

Isend(from : ∗, r4); Isend(to : 0, r3);

waits waits


Example

P0 P1 P2

Irecv(from : ∗, r1) //Sleep(3) Sleep(3)

Irecv(from : 2, r2); Isend(to : 0, r1)

��

Isend(to : 0, r1);��

Isend(to : 2, , r3);��waits Irecv(from : 0, r2);

Irecv(from : ∗, r4); Isend(to : 0, r3);��

waits deadlockfree waits


Send

Recv

Recv Send Send

Recv

Send

r3);


Barrier

Programming construct that synchronizes the actions of its participants

All participants must enter and then leave the barrier

the first participant may leave only after the last has entered

18


Barrier Example 19Barrier Example

p qa bbarrier barrierc d

reduced state space:

b��

a ��

b��a ��

d��

c ��

d��c ��

full state space:

b��

a ��

b��

enter ��

b��

enter��

a ��

enter��

enter ��

enter��a �� enter ��

leave��

leave ��

leave��

c ��

leave��

d��

leave ��

d��

c ��

d��leave �� c ��


worker 1 worker 2

aBarrierc

bBarrierd


Barriers versus Wait: beware

What scenarios are possible in this program?Note: going through the barrier means Isend/Irecv finished

This does NOT imply that the data transfer actually happened

Actually, P1 may receive P2’s Isend2!!Data seems to “travel through the barrier” in wrong direction!

20

P0: Isend0(to:1, h0) ; Barrier0 ; Wait(h0)P1: Irecv(from:*, h1) ; Barrier1 ; Wait(h1)P2: Barrier2 ; Isend2(to:1, h2) ; Wait(h2)

Figure 4: Illustration of Barrier Semantics and the POE Algorithm

the execution. While these rules match the rules followed by other languages and libraries in supportingtheir barrier operations, in the case of MPI, it is possible for a process Pi to have a non-blocking operationOP before its barrier call, and another process Pj to have an operation OP’ after Pj’s matching barrier callwhere OP can observe the execution of OP’. This means that OP can, in effect, complete after Pi’s barrierhas been invoked. Such behaviors are allowed in MPI to keep it a high-performance API.

Figure 4 illustrates the scenario just discussed. In this example, one Isend issued by P0, shown asIsend0(to:1, &h0), and another issued by P2, shown as Isend2(to:1, &h2), target a wild-card Irecv issued by P1, shown as Irecv(from:*, h1). The following execution is possible:(i) Isend0(to:1, &h0) is issued, (ii) Irecv(from:*, h1) is issued, (iii) each process fully exe-cutes its own barrier, and this “collective operation” finishes, (iv) Isend2(to:1, h2) is issued, (v) nowboth sends and the receive are alive, and hence Isend0 and Isend2 become dependent (non-deterministicmatches with Irecv), requiring a dynamic algorithm to pursue both matches. Notice that Isend0 canfinish before Barrier0 and Irecv can finish after Barrier1. We sometimes refer to the placement ofbarriers as in Figure 4 as “crooked barriers,” because the barriers are used even though there are instructionsaround it that observe each other.

To handle this example properly, the POE algorithm of ISP cannot assume that Isend0 is the onlysender that can match Irecv. Therefore ISP must somehow delay issuing Irecv into the MPI run-timeuntil all its matchers are known. In this example, ISP will collect Isend0 and Irecv, then issue all theBarriers, then encounter Isend2, and then only act on R. Here, ISP will issue Irecv with arguments0 and 2 respectively in two consecutive replays of the whole program. Thus, the inner workings of ISP’sPOE algorithm are as follows:

• Collect wildcard receives• Delay non-blocking sends that may match the receives• Issue other non-interfering operations (such as the Barriers)• Finally, when forced (in the example in Figure 4, the forcing occurrs when all processes are at theirWait statements), ISP will perform the dynamic re-writings, followed by the two replays.

The example of Figure 4 illustrates many aspects of POE:

• Delaying of operations: Irecv was delayed in this example;• Out of order issue: the Barriers were issued out of program order;• Being forced to compute the full extent of non-determinism: In our example, when all processes

are at fence instructions – meaning, instructions such as Wait and Barrier – the POE algorithmknows that none of the sends coming after the fence can match the wildcard Irecv. (We will definea fence later on in Section 3.2.) This allows POE to compute the full extent of non-determinismon-the-fly.

• Replay: For each case of non-determinism, ISP only remembers the choices by keeping a stack trail.The full MPI runtime state cannot be remembered (it is too voluminous, spread over clusters, andconsists of the entire state of a large machine). Thus, ISP always resorts to the approach of startingfrom the initial state (known) and computing up to a choice point, and then branching differently

4

Consider the following program:

it consists of three workers, each running their own code


The rulesBarriers:

a Barrier happens before the following statements of all procs

Data-dependencies for read/write:a Recv can never return before the matching Send has started

Within one worker:two sends to the same destination arrive in the same order

receiving from the same source preserves the message order(unless different tags are used; see later)

Wait:a wait after an Irecv means: message has arrived in the buffer a wait after an Isend means: buffer space can be reused (!)

21


MPI QuizIs the following sequence deadlock-free?

No, not guaranteed (but likely it is). Implementations areallowed (but not expected) to only actually start a send request when the corresponding Wait is issued;

The following is deadlock-free for sure:

The following deadlocks for sure:

22

P0: Irecv(from:0,rh); Barrier; Isend(to:0,sh); Wait(rh); Wait(sh);

P0: Irecv(from:0,rh); Barrier; Isend(to:0,sh); Wait(sh); Wait(rh);

P0: Irecv(from:0,rh); Wait(rh); Barrier; Isend(to:0,sh); Wait(sh);


Barrier Implementation: Broadcast

Too expensive for many workers: consider message countMessage complexity: O(N^2) Hmm, too bad; bandwidth??

Latency: O(1) This is quite OK! (theoretically)

23

void msg_barrier_bcast () { request rq[Nworkers]; message msg[Nworkers];

for (int i=0; i<Nworkers; ++i) if (i!=me) { Irecv (i, msg[i], rq[i]); } for (int i=0; i<Nworkers; ++i) if (i!=me) { Send (i,”barrier”); } for (int i=0; i<Nworkers; ++i) if (i!=me) { Wait (rq[i]); }}


Barrier Implementation: Ring

Too expensive for many workers: latencyMessage complexity: O(N) This seems rather optimal

Latency: O(N) This is really killing performance!

24

void msg_barrier_ring () { message msg; if (me == 0) { Send (Nworkers-1, “enter”); } Recv (*, msg);

if (me != 0) { Send (me-1, msg); } else { Send (Nworkers-1, “leave”); }

Recv (*, msg); if (me != 0) { Send (me-1, msg); }}


Barrier Implementation: D&C

Assume N=2k workers: Divide & Conquer

Acceptable for many workers:

Message complexity: O(N. log N) Acceptable

Latency: O(log N) Acceptable

25

void msg_barrier_dc () { for (int i=0; i<k; ++i) {

request rq; message msg; Irecv (XOR(me,1<<i), msg, rq); Send (XOR(me,1<<i), “barrier”); Wait (rq); }}

Barrier: divide and conquer

Assume 2k workers.P0

�� P1

�� P2

�� P3

�� P4

�� P5

�� P6

�� P7

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

�

��

��

��

��

��

��

��

��

void msg_barrier_3(){

for(i=0;i<k;i++){

request rq;

message msg;

Irecv(XOR(me,(1<<i)),msg,rq);

Send(XOR(me,(1<<i)),"barrier");

Wait(rq);

}

}

Acceptable for many workers: cost is O(log n).



MPI Tags: Message Types

Every MPI message contains both data and a numeric tag

Messages with different tags are allowed to overtake at the request of the receiver

Tags: poor man’s message namespaces, e.g., for MPI libraries

26

Multiple Types of Messages

Every MPI message contains both data and a numeric tag.

Messages with different tags are allowed to overtake at therequest of the receiver.

P0 P1

Send(1, tag : 0, 1) Recv(∗, tag : ∗, x)

Send(1, tag : 0, 2) Recv(∗, tag : ∗, y)

always x = 1, y = 2

P0 P1

Send(1, tag : 0, 1) Recv(∗, tag : 1, x)

Send(1, tag : 1, 2) Recv(∗, tag : 0, y)

x = 2, y = 1 or deadlock



MPI Communicators: Connections

Like TCP allows multiple connections,MPI allows multiple communicators

Communicator is handle for a group of processors

Communication on different communicators is required to allow overtaking

27

Multiple Connections or Communicators

• Like TCP allows multiple sockets,MPI allows multiple Communicators.

• Communication on different communicator is required toallow overtaking.

P0 P1

Isend(comm : 0, 1, 1, h) Recv(comm : 1, ∗, x)

Send(comm : 1, tag : 0, 2) Recv(comm : 0, ∗, y)

Wait(h)

always x = 2, y = 1

P0 P1

Send(comm : 0, 1, 1) Recv(comm : 1, ∗, x)

Send(comm : 1, 1, 2) Recv(comm : 0, ∗, y)

x = 2, y = 1 or deadlock



MPJ: MPI in Java

MPJ is a Java binding for MPI

Implemented by IBIS project of the Vrije Universiteit Amsterdam

28


MPJ Example: Hello World 29

import ibis.mpj.MPJ;import ibis.mpj.MPJException;public class MPIHello { public static void main(String[] args) throws MPJException { MPJ.init(args); int size=MPJ.COMM_WORLD.size(); int rank=MPJ.COMM_WORLD.rank();

String who=”worker “ + rank + ”/” + size + ”: “; System.err.println(who + ”hello world!”); MPJ.finish(); }}

Output:worker 0/2: hello world!worker 1/2: hello world!


MPJ Example: Ring

Send messages in a ring

30

...int msg[] = new int[1];int tag = 42;int l = msg.length;

if (rank==0) MPJ.COMM_WORLD.send(msg, 0, l, MPJ.INT, size-1, tag);ibis.mpj.Status stat;

stat = MPJ.COMM_WORLD.recv(msg, 0, l, MPJ.INT, MPJ.ANY_SOURCE, tag);int src = stat.getSource();System.err.println(who+”received “ + msg[0] + “ from “ + src);

++msg[0];if (rank!=0) MPJ.COMM_WORLD.send(msg, 0, l, MPJ.INT, rank-1, tag);...

offset count type target


MPJ Setup (Unix)

Use JDK 1.6.0 from SUN Oracle

Create a directory cdp/ in home directory

Download and unzip mpj-2.2.zip to cdp/creates mpj-2.2/

Add to startup scripts:

31

export MPJ_HOME=$HOME/cdp/mpj-2.2export PATH=$PATH:$MPJ_HOME/binexport CLASSPATH=$MPJ_HOME/lib/mpj-2.2.jar:.


MPJ Setup (Windows)

Use JDK 1.6.0 from SUN Oracle

Create a directory cdp/

Download and unzip mpj-2.2.zip to cdp/creates mpj-2.2/

Execute in command shell:

Persist settings: http://www.support.tabs3.com/main/R10463.htm

32

set MPJ_HOME="location of the mpj-2.2 directory"set CLASSPATH=%CLASSPATH%;"%MPJ_HOME%"\lib\mpj-2.2.jarset PATH=%PATH%;"%MPJ_HOME%"\bin


MPJ Compiling & Running

Compile programs with javac MPIHello.java

Create a file ibis.properties:

Run mpj-server --events in a new terminal

Run two instances of mpj-run MPIHello

33

ibis.server.address=localhostibis.pool.size=2ibis.pool.name=cdpdemo

Number of workers

Documents

Overview of Lecture 7 - Universiteit Twentefmt.cs.utwente.nl/courses/cdp/slides/cdp-7-mpi-1-4up.pdf · Overview of Lecture 7 Message Passing Interface (MPI) ... Lecture heavily based