Upload
margaretmargaret-hudson
View
222
Download
2
Tags:
Embed Size (px)
Citation preview
SCTP versus TCP for MPI
Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science
University of British Columbia
Outline
Self Introduction Research background Research presentation
SCTP & MPI backgroundMPI over SCTP designDesign featuresResultsConclusions
Who am I?
Born and raised in Columbus area OSU alumni Europa alumni Worked a few years Grad student finishing my MSc at
UBC
UBC
d
Who do I work with?
Alan Wagner (Prof, UBC) Humaira Kamal (PhD, UBC) Mike Yao Chen Tsai (MSc, UBC) Edith Vong (BSc, UBC)
Randall Stewart (Cisco)
What field do we work in?
Parallel computingConcurrently utilize multiple resources
What field do we work in?
Parallel computingConcurrently utilize multiple resources
1 cook
What field do we work in?
Parallel computingConcurrently utilize multiple resources
1 cookvs
8 cooks
What field do we work in?
Time Saved
Parallel computingConcurrently utilize multiple resources
What field do we work in?
Message passing programming model Message Passing Interface (MPI)
• Standardized API for applications
...result = compute();
MPI_Send(proc1, result, …);...
...local_answer = solve();
MPI_Recv(proc0, otherResult, ...);result = local_answer – otherResult;
...
message
Process 0 Process 1
What field do we work in? Middleware for MPI
Glues necessary components together for parallel environment
JobScheduler
ProcessManager
MPIParallelLibrary
MPI Middleware
Parallel Application
Parallel Application
Resource Resource Resource Resource Resource
Parallel Application
Parallel Application
What field do we work in? Middleware for MPI
Glues necessary components together for parallel environment
JobScheduler
ProcessManager
MPIParallelLibrary
MPI Middleware
Parallel Application
Parallel Application
Resource Resource Resource Resource Resource
Parallel Application
Parallel Application
←
What field do we work in?
Parallel library componentImplements MPI API for various
interconnects• Shared memory• Myrinet• Infiniband• Specialized hardware (BlueGene/L, ASCI
Red, etc)
What field do we work in?
TCP/IP protocol stack interconnect Stream Control Transmission Protocol
Application
Transport
Network
Link
TCP, UDP, SCTP
IP
Ethernet (device driver and
interface card)
SCTP versus TCP for MPI
Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science
University of British Columbia
Supercomputing 2005, Seattle, Washington USA
What is MPI and SCTP?
Message Passing Interface (MPI) Library that is widely used to parallelize scientific
and compute-intensive programs Stream Control Transmission Protocol (SCTP)
General purpose unicast transport protocol for IP network data communications
Recently standardized by IETF Can be used anywhere TCP is used
What is MPI and SCTP?
Message Passing Interface (MPI) Library that is widely used to parallelize scientific
and compute-intensive programs Stream Control Transmission Protocol (SCTP)
General purpose unicast transport protocol for IP network data communications
Recently standardized by IETF Can be used anywhere TCP is used
QuestionCan we take advantage of SCTP features to better support parallel applications using MPI?
Communicating MPI Processes
TCP is often used as transport protocol for MPI
MPI API
MPI Process
TCP
IP
MPI API
MPI Process
TCP
IP
SCTP SCTP
SCTP Key Features
Reliable in-order delivery, flow control, full duplex transfer.
Selective ACK is built-in the protocol
TCP-like congestion control
SCTP Key Features
Message oriented
Use of associations
Multihoming
Multiple streams within an association
Associations and Multihoming
Primary address Heartbeats Retransmissions Failover User adjustable
controls CMT
Node 0
NIC1 NIC2
Node 1
NIC3 NIC4
Network207.10.x.x
Network168.1.x.x
IP=207.10.40.1
IP=168.1.140.10IP=168.1.10.30
IP=207.10.3.20
Logical View of Multiple Streams in an Association
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
Stream 1
Stream 2
Stream 3
SEND
SEND
RECEIVE
RECEIVE
Inbound Streams
Outbound Streams
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg EMsg B Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B Msg C
Receive order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg E
Receive order
Msg A Msg DMsg B Msg C
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Receive order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Receive order
Can be received in the same order as it was sent (required in TCP).
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg D Msg EMsg B Msg C
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg DMsg E Msg B Msg C
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg DMsg EMsg B Msg C
MPI API Implementaion
Message matching is done based on Tag, Rank and Context (TRC).
Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered.
Use of wildcards for receive
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
Payload
Format of MPI Message
Context Rank Tag
Envelope
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Irecv(..ANY_TAG..)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A) Msg_3
Msg_2
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1
MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
Out of order messages withsame tagsviolate MPI semantics
MPI API Implementation
Request Progression Layer
Short Messages vs. Long Messages
Application Layer Receive Request is Issued
MPI Implementation
SCTP LayerIncoming Message is Received
Unexpected Message Queue
Receive Request Queue
Runtime
Socket
MPI over SCTP :Design and Implementation LAM (Local Area Multi-computer) is an open
source implementation of MPI library. Origins at Ohio Supercomputing Center
We redesigned LAM TCP RPI module to use SCTP.
RPI module is responsible maintaining state information of all requests.
MPI over SCTP :Design and Implementation Challenges:
Lack of documentationCode examination
• Our document is linked-off LAM/MPI websiteExtensive instrumentation
• Diagnostic traces Identification of problems in SCTP protocol
Using SCTP for MPI
Striking similarities between SCTP and MPI
MPISCTP
Context
Rank /Source
MessageTags
One-to-ManySocket
Association
Streams
Implementation Issues Maintaining State Information
Maintain state appropriately for each request function to work with the one-to-many style.
Message Demultiplexing Extend RPI initialization to map associations to rank. Demultiplexing of each incoming message to direct it to the proper
receive function. Concurrency and SCTP Streams
Consistently map MPI tag-rank-context to SCTP streams, maintaining proper MPI semantics.
Resource Management Make RPI more message-driven. Eliminate the use of the select() system call, making the
implementation more scalable. Eliminating the need to maintain a large number of socket descriptors.
Implementation Issues
Eliminating Race Conditions Finding solutions for race conditions due to added
concurrency. Use of barrier after association setup phase.
Reliability Modify out-of-band daemons and request progression
interface (RPI) to use a common transport layer protocol to allow for all components of LAM to multihome successfully.
Support for large messages Devised a long-message protocol to handle messages
larger than socket send buffer. Experiments with different SCTP stacks
Features of Design
Scalability
Head-of-Line Blocking
Scalability
MPIProcess
MPIProcess
MPIProcess
MPIProcess
N - 1 sockets
TCP
Scalability
MPIProcess
MPIProcess
MPIProcess
MPIProcess
1 socket
SCTP
Head-of-Line Blocking
Process X Process Y
Tag_AMPI_Send
MPI_Send
MPI_Irecv
MPI_Irecv
Tag_B
Msg_AMsg_B
Delivered
SCTP
Process X Process Y
Tag_AMPI_Send
MPI_Send
MPI_Irecv
MPI_Irecv
Tag_B
Msg_AMsg_B
Blocked
TCP
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_IrecvMsg-B arrives
Socket buffer
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Msg-B arrives
Msg-A arrives
Socket buffer
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
MPI_Waitany
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Limitations
Comprehensive CRC32c checksum – offload to NIC not yet commonly available
SCTP bundles messages together so it might not always be able to pack a full MTU
SCTP stack is in early stages and will improve over time
Performance is stack dependant (Linux lksctp stack << FreeBSD KAME stack)
Experiments
Controlled environment - Eight nodes -Dummynet
Used standard benchmarks as well as real world programs
Fair comparisonBuffer sizes, Nagle disabled, SACK ON,
No multihoming, CRC32c OFF
Experiments: Benchmarks
MPBench Ping Pong Test
0
0.2
0.4
0.6
0.8
1
1.2
1.41
3276
8
6553
5
9830
2
1310
69
Message Size (bytes)
Th
rou
gh
pu
t N
orm
ali
zed
to
LA
M_
TC
P
va
lue
s
LAM_SCTP
LAM_TCP
MPBench Ping Pong Test under No Loss
NAS Benchmarks
The NAS benchmarks approximate real world parallel scientific applications
We experimented with a suite of 7 benchmarks, 4 data set sizes
SCTP performance comparable to TCP for large datasets.
Latency Tolerant Programs
Bulk Farm Processor programReal-world applicationNon-blocking communicationOverlap computation with
communicationUse of multiple tags
Farm Program - Short Messages
LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Short, Fanout: 10
8.7 11.7 16.06.2
88.1
154.7
0
50
100
150
200
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
LAM_SCTP
LAM_TCP
Head-of-line blocking – Short messages
LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Short, Fanout: 10
8.711.7
16.0
9.311.0
21.6
0
5
10
15
20
25
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
10 Streams
1 Stream
Conclusions
SCTP is a better suited for MPIAvoids unnecessary head-of-line
blocking due to use of streams Increased fault tolerance in presence of
multihomed hosts In-built security featuresRobust under loss
SCTP might be key to moving MPI programs from LANs to WANs.
Future Work
Release LAM SCTP RPI module at SC|05
Incorporate our work into Open MPI and/or MPICH2
Modify real applications to use tags as streams
More information about our work is at:
http://www.cs.ubc.ca/labs/dsg/mpi-sctp/
Thank you!
Extra Slides
Partially Ordered User Messages Sent on
Different Streams
User messages
Message Stream Number (SNo)
Fragmentation
Data chunk queue
02122
Bundling
Control chunk queue
SCTP Layer
IP LayerSCTP Packets
Added Security
P0 P1
INIT
INIT-ACK
COOKIE-ECHO
COOKIE-ACK
User data can be piggy-backed on third and fourth leg
SCTP’s Use of Signed Cookie
Added Security
32 bit Verification Tag – reset attack Autoclose feature No half-closed state
Farm Program - Long Messages
LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Long, Fanout: 10
79786
1585
129
3103
6414
0
2000
4000
6000
8000
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
LAM_SCTP
LAM_TCP
Head-of-line blocking – Long messages
LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Long, Fanout: 10
79
786
1585
79
1000
1942
0
500
1000
1500
2000
2500
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
10 Streams
1 Stream
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
0100002000030000400005000060000
Bytes/second
1 2
Loss Rate
Throughput of Ping-pong w/ 30K messages
SCTP
TCP
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
0100020003000400050006000
Bytes/second
1 2
Loss Rate
Throughput of Ping-pong w/ 300K messages
SCTP
TCP