Upload
bethany-hodge
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
1
Lecture 4:Lecture 4:Part 2: MPI Point-to-Point Part 2: MPI Point-to-Point
CommunicationCommunication
2
Realizing Message PassingRealizing Message Passing
Separate network from processor Separate user memory from system memory
node 0
user
systemPE NI
node 1
user
systemPENI
Network
3
Communication Modes for Communication Modes for “Send”“Send”
Blocking/Non-Blocking : Timing regarding the use of user message
buffer Ready:
Timing regarding the invocation of send and receive
Buffered : User/System Buffer Allocation
4
Communication Modes for Communication Modes for “Send”“Send”
Synchronous/Asynchronous: Timing regarding the invocation of send and
receive + the execution of receive operation local/non-local:
completion independ/depend on the execution of another user process
5
Messaging SemanticsMessaging Semantics
Sender Receiver
User-space
System-space
Blocking/nonblocking
Synchronous/asynchronous
Ready
Not Ready
6
Blocking/Non-blocking Blocking/Non-blocking Send Send Blocking send: messaging command does not
return until the message data have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the
matching receive buffer. May be copied into a temporary system buffer, even
no matching receive is invoked. Local (completion does not depend on the execution
of another user process)
7
Blocking Receive -MPI_recvBlocking Receive -MPI_recv
Return when receive is locally complete Message buffer can be read from after
return
8
Nonblocking Send - Nonblocking Send - MPI_IsendMPI_Isend Non-blocking, asynchronous Does not block for receive ( Return “immediately”)
Check for completion with MPI_Wait( ) before using buffer
MPI_Wait( ) returns when message has been safely sent, not when it has been received.
9
Non-blocking Receive Non-blocking Receive MPI_irecvMPI_irecv
Return “immediately” Message buffer should not be read from
after return Must check for local completion
MPI_wait (..): block until the communication is complete
MPI_waitall: block until all communication operations in a given list have completed
10
Non-blocking Receive - Non-blocking Receive - MPI_IrecvMPI_Irecv
MPI_Irecv(Buf, count,source, tag, comm, REQUEST,..): REQUEST can be used to query the status of the
communication
MPI_WAIT(REQUEST,status): return only if REQUEST is complete
MPI_Waitall(count, array_of_request,..): wait for the completion of all REQUESTs in the array.
11
Nonblocking Nonblocking CommunicationCommunication Improve Performance by overlapping communication
and computation You need intelligent communication interface
(messaging co-processor used in SP2, Paragon, CS-2, Myrinet, ATM)
startup transfer startup transfer
startup startup
Add computation
12
Ready Send -- MPI_Rsend( )Ready Send -- MPI_Rsend( )
Receive must be posted before message arrives. Otherwise, the operation is erroneous and its outcome is undefined.
Non-local (completion depends on the starting time of the receiving process)
Overheads for synchronization.
13
Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )
Explicitly buffers messages on sending side User allocates buffer by himself/herself
(MPI_BUFFER_ATTACH( ))
Programmer likes to control the usage of buffer -- writing new communication libraries.
14
Buffered Send -- Buffered Send -- MPI_Bsend( )MPI_Bsend( )
user
systemPE NI
user allocated buffer
15
Synchronous Send --Synchronous Send --MPI_Ssend( )MPI_Ssend( )
Does not return until message is actually received
Send buffer can be reused if send operation completed
Non-local (receiver must have received the message)
16
Standard Send -- MPI_Send( Standard Send -- MPI_Send( )) Standard Send: depends on the
implementation (usually, synchronous, blocking, and non-local)
Safe to reuse buffer when MPI_Send( ) returns
May block until message is received (depends on implementation)
17
Standard Send -- MPI_Send( Standard Send -- MPI_Send( ))
A good implementationshort message: send immediately,
buffer if no receive posted. Should try to reduce latency. Buffering is unimportant
large message: use Rendezvous protocol (request-reply-send; wait for matching receive then send)
18
How to Exchange DataHow to Exchange Data
Simple (code in node 0)sid = MPI_Isend(buf1, node1)
rid = MPI_Irecv(buf2, node1)
..... computation ......
call MPI_Wait(sid)
call MPI_Wait(rid)
For maximum performanceids(1) = MPI_Isend(buf1, node1)
ids(2) = MPI_Irecv(buf2, node1)
..... computation ......
call MPI_Waitall(2, ids)
19
Model and Measure p2p Model and Measure p2p communication in MPIcommunication in MPI
data transfer time = latency + message size/bandwidth
latency (T0) is startup time, independent of message time (but depends on the communication mode/protocol)
bandwidth (B) is number of bytes transferred per second (memory access rate + network transmission rate)
20
Latency and BandwidthLatency and Bandwidth
for short message: latency dominates transfer time
for long message: the bandwidth term dominates transfer time
Critical message size (n 1/2) = latency x bandwidth (let latency = message size/bandwidth)
21
Measure p2p performanceMeasure p2p performance
Round-trip time (ping-pong) time/2
sendrecv
recvsend
22
Some MPI Performance Some MPI Performance ResultsResults
Machine T0 (microsec) B (MB/s)
T3D 54 120
SP2 61 33
Paragon 75 36
PowerChallenge 15 61
23
ProtocolsProtocols
Rendezvous Eager Mixed Pull (get)
24
RendezvousRendezvous
Algorithm:Sender sends request-to-sendReceiver acknowledgesSender sends data
No buffering required High latency (three-steps) High bandwidth (no extra buffer copy)
25
EagerEager
Algorithm:Sender sends data immediatelyUsually must be buffered
May be directly transferred if receive already posted
Features:Low latencyLow bandwidth (buffer copy)
26
MixedMixed
Algorithm:Eager for short messagesRendezvous for long messagesSwitch protocols near n 1/2
27
MixedMixed
Features:Low latency for latency-dominated (short)
messagesHigh bandwidth for bandwidth-dominated (long)
messagesReasonable memory management Non-ideal performance for some messages near
n 1/2
28
Pull (Get) ProtocolPull (Get) Protocol
One-side communication Used in shared memory machines
29
MPICH p2p on SGI MPICH p2p on SGI
Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)
(each interval is 128 bytes)
0.00
2.00
4.00
6.00
8.00
10.00
0 256 512 768 1024 1280 1536 1792 2048 2304
Packet Size (bytes)
Wall C
lock
tim
e (
us
)
Minimum Average
Default : 0-1024 byte: Short, 1024-128K: Eager, > 128KB: RendezvousMPID_PKT_MAX_DATA_SIZE = 256
Short (fill data in the header)
30
Let MPID_PKT_MAX_DATA_SIZE = 256Let MPID_PKT_MAX_DATA_SIZE = 256
Ping-pong time - SGI Power Challenge(configuration: -arch=IRIX64 -device=ch_lfshmem -comm=shared)
(MPID_PKT_MAX_DATA_SIZE to 256 & the long_len < 1024)(each interval is 128 bytes)
0.002.004.006.00
8.0010.0012.0014.00
0 256 512 768 1024 1280 1536 1792 2048 2304
Packet Size (bytes)
Wall C
lock
tim
e (
us
)
Minimum Average
Short
Eager
Rendezvous
31
MPI-FM (HPVM: Fast Messages) MPI-FM (HPVM: Fast Messages) PerformancePerformance
0 50 100 150 200 250
One-way latency (µs)
WorseBetter
0 50 100 150 200 250 300
Bandwidth (MB/s)
Worse Better
HPVM
Pwr. Chal.
SP-2
T3E
Origin 2K
Beowulf
Note: Supercomputer measurements taken by NAS, JPL, and HLRS (Germany)
32
MPI Collective OperationsMPI Collective Operations
MPI_Alltoall(v)
MPI_AlltoallIt is an extension of MPI_Allgather to the case
where each process sends distinct data to each of the receivers. The j-th block of data sent from process i is received by process j and is placed in the i-th block of receive buffer of process j.
MPI_Alltoall(v)
6611551144113311221111110011 7711
6622552244223322222211220022 7722
6666556644663366226611660066 7766
6677557744773377227711770077 7777
6655555544553355225511550055 7755
6633553344333333223311330033 7733
6666555544443344224411440044 7744
6600550044003300220011000000 7700
1166115511441133112211111100 1177
2266225522442233222222112200 2277
6666665566446633662266116600 6677
7766775577447733772277117700 7777
5566555555445533552255115500 5577
3366335533443333332233113300 3377
4466445544444433442244114400 4477
0066005500440033002200110000 0077
alltoall
datadata
proc
ess
Define ij be the i-th block of data of process j.
MPI_Alltoall(v)
Current Implementation:Process j sends ij directly to process i
6611551144113311221111110011 7711
6622552244223322222211220022 7722
6666556644663366226611660066 7766
6677557744773377227711770077 7777
6655555544553355225511550055 7755
6633553344333333223311330033 7733
6666555544443344224411440044 7744
6600550044003300220011000000 7700
1100
2200
6600
7700
5500
3300
4400
0000
Send buffer Receive buffer0
1
4
3
2
5
6
7
MPI_Alltoall(v)
Current Implementation:Process j sends ij directly to process i
6611551144113311221111110011 7711
6622552244223322222211220022 7722
6666556644663366226611660066 7766
6677557744773377227711770077 7777
6655555544553355225511550055 7755
6633553344333333223311330033 7733
6666555544443344224411440044 7744
6600550044003300220011000000 7700
1166115511441133112211111100 1177
2266225522442233222222112200 2277
6666665566446633662266116600 6677
7766775577447733772277117700 7777
5566555555445533552255115500 5577
3366335533443333332233113300 3377
4466445544444433442244114400 4477
0066005500440033002200110000 00770
1
4
3
2
5
6
7