77
SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Embed Size (px)

Citation preview

Page 1: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

SCTP versus TCP for MPI

Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science

University of British Columbia

Page 2: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Outline

Self Introduction Research background Research presentation

SCTP & MPI backgroundMPI over SCTP designDesign featuresResultsConclusions

Page 3: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Who am I?

Born and raised in Columbus area OSU alumni Europa alumni Worked a few years Grad student finishing my MSc at

UBC

Page 4: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

UBC

d

Page 5: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Who do I work with?

Alan Wagner (Prof, UBC) Humaira Kamal (PhD, UBC) Mike Yao Chen Tsai (MSc, UBC) Edith Vong (BSc, UBC)

Randall Stewart (Cisco)

Page 6: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Parallel computingConcurrently utilize multiple resources

Page 7: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Parallel computingConcurrently utilize multiple resources

1 cook

Page 8: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Parallel computingConcurrently utilize multiple resources

1 cookvs

8 cooks

Page 9: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Time Saved

Parallel computingConcurrently utilize multiple resources

Page 10: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Message passing programming model Message Passing Interface (MPI)

• Standardized API for applications

...result = compute();

MPI_Send(proc1, result, …);...

...local_answer = solve();

MPI_Recv(proc0, otherResult, ...);result = local_answer – otherResult;

...

message

Process 0 Process 1

Page 11: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in? Middleware for MPI

Glues necessary components together for parallel environment

JobScheduler

ProcessManager

MPIParallelLibrary

MPI Middleware

Parallel Application

Parallel Application

Resource Resource Resource Resource Resource

Parallel Application

Parallel Application

Page 12: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in? Middleware for MPI

Glues necessary components together for parallel environment

JobScheduler

ProcessManager

MPIParallelLibrary

MPI Middleware

Parallel Application

Parallel Application

Resource Resource Resource Resource Resource

Parallel Application

Parallel Application

Page 13: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

Parallel library componentImplements MPI API for various

interconnects• Shared memory• Myrinet• Infiniband• Specialized hardware (BlueGene/L, ASCI

Red, etc)

Page 14: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What field do we work in?

TCP/IP protocol stack interconnect Stream Control Transmission Protocol

Application

Transport

Network

Link

TCP, UDP, SCTP

IP

Ethernet (device driver and

interface card)

Page 15: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

SCTP versus TCP for MPI

Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science

University of British Columbia

Supercomputing 2005, Seattle, Washington USA

Page 16: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What is MPI and SCTP?

Message Passing Interface (MPI) Library that is widely used to parallelize scientific

and compute-intensive programs Stream Control Transmission Protocol (SCTP)

General purpose unicast transport protocol for IP network data communications

Recently standardized by IETF Can be used anywhere TCP is used

Page 17: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

What is MPI and SCTP?

Message Passing Interface (MPI) Library that is widely used to parallelize scientific

and compute-intensive programs Stream Control Transmission Protocol (SCTP)

General purpose unicast transport protocol for IP network data communications

Recently standardized by IETF Can be used anywhere TCP is used

QuestionCan we take advantage of SCTP features to better support parallel applications using MPI?

Page 18: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Communicating MPI Processes

TCP is often used as transport protocol for MPI

MPI API

MPI Process

TCP

IP

MPI API

MPI Process

TCP

IP

SCTP SCTP

Page 19: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

SCTP Key Features

Reliable in-order delivery, flow control, full duplex transfer.

Selective ACK is built-in the protocol

TCP-like congestion control

Page 20: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

SCTP Key Features

Message oriented

Use of associations

Multihoming

Multiple streams within an association

Page 21: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Associations and Multihoming

Primary address Heartbeats Retransmissions Failover User adjustable

controls CMT

Node 0

NIC1 NIC2

Node 1

NIC3 NIC4

Network207.10.x.x

Network168.1.x.x

IP=207.10.40.1

IP=168.1.140.10IP=168.1.10.30

IP=207.10.3.20

Page 22: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Logical View of Multiple Streams in an Association

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

Stream 1

Stream 2

Stream 3

SEND

SEND

RECEIVE

RECEIVE

Inbound Streams

Outbound Streams

Page 23: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Send order

Page 24: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg EMsg B Msg C

Send order

Page 25: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg E

Msg B

Msg C

Send order

Page 26: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D Msg E

Msg B

Msg C

Send order

Page 27: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Send order

Page 28: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Send order

Page 29: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B

Msg C

Receive order

Page 30: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B

Msg C

Receive order

Page 31: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A

Msg D

Msg E

Msg B Msg C

Receive order

Page 32: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg E

Receive order

Msg A Msg DMsg B Msg C

Page 33: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Receive order

Page 34: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D Msg EMsg B Msg C

Receive order

Can be received in the same order as it was sent (required in TCP).

Page 35: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Alternative receive order

Page 36: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Alternative receive order

Msg A Msg D Msg EMsg B Msg C

Page 37: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Alternative receive order

Msg A Msg DMsg E Msg B Msg C

Page 38: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

Endpoint X Endpoint Y

Stream 1

Stream 2

Stream 3

SENDRECEIVE

Msg A Msg D

Msg E

Msg B

Msg C

Alternative receive order

Msg A Msg DMsg EMsg B Msg C

Page 39: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI API Implementaion

Message matching is done based on Tag, Rank and Context (TRC).

Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered.

Use of wildcards for receive

MPI_Send(msg,count,type,dest-rank,tag,context)

MPI_Recv(msg,count,type,source-rank,tag,context)

Payload

Format of MPI Message

Context Rank Tag

Envelope

Page 40: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI Messages Using Same Context, Two Processes

Process X Process Y

Msg_1MPI_Send(Msg_1,Tag_A)

MPI_Irecv(..ANY_TAG..)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A) Msg_3

Msg_2

Process X Process Y

Msg_1MPI_Send(Msg_1,Tag_A)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A)Msg_3

Msg_2

MPI_Irecv(..ANY_TAG..)

Page 41: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI Messages Using Same Context, Two Processes

Process X Process Y

Msg_1

MPI_Send(Msg_1,Tag_A)

MPI_Send(Msg_2,Tag_B)

MPI_Send(Msg_3,Tag_A)Msg_3

Msg_2

MPI_Irecv(..ANY_TAG..)

Out of order messages withsame tagsviolate MPI semantics

Page 42: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI API Implementation

Request Progression Layer

Short Messages vs. Long Messages

Application Layer Receive Request is Issued

MPI Implementation

SCTP LayerIncoming Message is Received

Unexpected Message Queue

Receive Request Queue

Runtime

Socket

Page 43: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI over SCTP :Design and Implementation LAM (Local Area Multi-computer) is an open

source implementation of MPI library. Origins at Ohio Supercomputing Center

We redesigned LAM TCP RPI module to use SCTP.

RPI module is responsible maintaining state information of all requests.

Page 44: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

MPI over SCTP :Design and Implementation Challenges:

Lack of documentationCode examination

• Our document is linked-off LAM/MPI websiteExtensive instrumentation

• Diagnostic traces Identification of problems in SCTP protocol

Page 45: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Using SCTP for MPI

Striking similarities between SCTP and MPI

MPISCTP

Context

Rank /Source

MessageTags

One-to-ManySocket

Association

Streams

Page 46: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Implementation Issues Maintaining State Information

Maintain state appropriately for each request function to work with the one-to-many style.

Message Demultiplexing Extend RPI initialization to map associations to rank. Demultiplexing of each incoming message to direct it to the proper

receive function. Concurrency and SCTP Streams

Consistently map MPI tag-rank-context to SCTP streams, maintaining proper MPI semantics.

Resource Management Make RPI more message-driven. Eliminate the use of the select() system call, making the

implementation more scalable. Eliminating the need to maintain a large number of socket descriptors.

Page 47: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Implementation Issues

Eliminating Race Conditions Finding solutions for race conditions due to added

concurrency. Use of barrier after association setup phase.

Reliability Modify out-of-band daemons and request progression

interface (RPI) to use a common transport layer protocol to allow for all components of LAM to multihome successfully.

Support for large messages Devised a long-message protocol to handle messages

larger than socket send buffer. Experiments with different SCTP stacks

Page 48: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Features of Design

Scalability

Head-of-Line Blocking

Page 49: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Scalability

MPIProcess

MPIProcess

MPIProcess

MPIProcess

N - 1 sockets

TCP

Page 50: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Scalability

MPIProcess

MPIProcess

MPIProcess

MPIProcess

1 socket

SCTP

Page 51: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Head-of-Line Blocking

Process X Process Y

Tag_AMPI_Send

MPI_Send

MPI_Irecv

MPI_Irecv

Tag_B

Msg_AMsg_B

Delivered

SCTP

Process X Process Y

Tag_AMPI_Send

MPI_Send

MPI_Irecv

MPI_Irecv

Tag_B

Msg_AMsg_B

Blocked

TCP

Page 52: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

Page 53: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Exe

cuti

on

tim

e o

n P

0

MPI_Irecv

MPI_IrecvMsg-B arrives

Socket buffer

Page 54: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Exe

cuti

on

tim

e o

n P

0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Msg-B arrives

Msg-A arrives

Socket buffer

Page 55: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Exe

cuti

on

tim

e o

n P

0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Msg-B arrives

Msg-A arrives

Socket buffer

Page 56: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Ex

ecu

tio

n t

ime

on

P0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Msg-B arrives

Msg-A arrives

Socket buffer

SCTP

MPI_Irecv

MPI_Irecv

Page 57: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Ex

ecu

tio

n t

ime

on

P0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Msg-B arrives

Msg-A arrives

Socket buffer

SCTP

MPI_Irecv

MPI_Irecv

MPI_Waitany

Page 58: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

P0

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Irecv(P1, MPI_ANY_TAG)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP

Ex

ecu

tio

n t

ime

on

P0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Msg-B arrives

Msg-A arrives

Socket buffer

SCTP

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Page 59: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Limitations

Comprehensive CRC32c checksum – offload to NIC not yet commonly available

SCTP bundles messages together so it might not always be able to pack a full MTU

SCTP stack is in early stages and will improve over time

Performance is stack dependant (Linux lksctp stack << FreeBSD KAME stack)

Page 60: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Experiments

Controlled environment - Eight nodes -Dummynet

Used standard benchmarks as well as real world programs

Fair comparisonBuffer sizes, Nagle disabled, SACK ON,

No multihoming, CRC32c OFF

Page 61: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Experiments: Benchmarks

MPBench Ping Pong Test

0

0.2

0.4

0.6

0.8

1

1.2

1.41

3276

8

6553

5

9830

2

1310

69

Message Size (bytes)

Th

rou

gh

pu

t N

orm

ali

zed

to

LA

M_

TC

P

va

lue

s

LAM_SCTP

LAM_TCP

MPBench Ping Pong Test under No Loss

Page 62: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

NAS Benchmarks

The NAS benchmarks approximate real world parallel scientific applications

We experimented with a suite of 7 benchmarks, 4 data set sizes

SCTP performance comparable to TCP for large datasets.

Page 63: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Latency Tolerant Programs

Bulk Farm Processor programReal-world applicationNon-blocking communicationOverlap computation with

communicationUse of multiple tags

Page 64: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Farm Program - Short Messages

LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Short, Fanout: 10

8.7 11.7 16.06.2

88.1

154.7

0

50

100

150

200

0% 1% 2%Loss Rate

To

tal R

un

Tim

e (

se

co

nd

s)

LAM_SCTP

LAM_TCP

Page 65: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Head-of-line blocking – Short messages

LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Short, Fanout: 10

8.711.7

16.0

9.311.0

21.6

0

5

10

15

20

25

0% 1% 2%Loss Rate

To

tal R

un

Tim

e (

se

co

nd

s)

10 Streams

1 Stream

Page 66: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Conclusions

SCTP is a better suited for MPIAvoids unnecessary head-of-line

blocking due to use of streams Increased fault tolerance in presence of

multihomed hosts In-built security featuresRobust under loss

SCTP might be key to moving MPI programs from LANs to WANs.

Page 67: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Future Work

Release LAM SCTP RPI module at SC|05

Incorporate our work into Open MPI and/or MPICH2

Modify real applications to use tags as streams

Page 68: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

More information about our work is at:

http://www.cs.ubc.ca/labs/dsg/mpi-sctp/

Thank you!

Page 69: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Extra Slides

Page 70: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Partially Ordered User Messages Sent on

Different Streams

User messages

Message Stream Number (SNo)

Fragmentation

Data chunk queue

02122

Bundling

Control chunk queue

SCTP Layer

IP LayerSCTP Packets

Page 71: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Added Security

P0 P1

INIT

INIT-ACK

COOKIE-ECHO

COOKIE-ACK

User data can be piggy-backed on third and fourth leg

SCTP’s Use of Signed Cookie

Page 72: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Added Security

32 bit Verification Tag – reset attack Autoclose feature No half-closed state

Page 73: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Farm Program - Long Messages

LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Long, Fanout: 10

79786

1585

129

3103

6414

0

2000

4000

6000

8000

0% 1% 2%Loss Rate

To

tal R

un

Tim

e (

se

co

nd

s)

LAM_SCTP

LAM_TCP

Page 74: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Head-of-line blocking – Long messages

LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Long, Fanout: 10

79

786

1585

79

1000

1942

0

500

1000

1500

2000

2500

0% 1% 2%Loss Rate

To

tal R

un

Tim

e (

se

co

nd

s)

10 Streams

1 Stream

Page 75: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Experiments: Benchmarks

SCTP outperformed TCP under loss for ping pong test.

Page 76: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Experiments: Benchmarks

SCTP outperformed TCP under loss for ping pong test.

0100002000030000400005000060000

Bytes/second

1 2

Loss Rate

Throughput of Ping-pong w/ 30K messages

SCTP

TCP

Page 77: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia

Experiments: Benchmarks

SCTP outperformed TCP under loss for ping pong test.

0100020003000400050006000

Bytes/second

1 2

Loss Rate

Throughput of Ping-pong w/ 300K messages

SCTP

TCP