Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

1

Flow control: a prerequisite for Ethernet scaling

• GROMACS and network congestion

• a network stress-test with MPI_Alltoall

• IEEE 802.3x = ???

• high-level fl ow control / ordered communication

Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller

2

GROMACS 3.3 GigE speedup Aquaporin-1 benchmark (80k atoms)

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 CPU/node 2 CPUs/node

Orca,Coral,Slomo,...

3

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

spee

dup

* benchmark done by Renate Dohmen, RZG

*

4

time step 1 time step 2

MPI_Alltoall

SR Coul SR Coul

PME PME

Where is the problem?

• delays of 0.25s in comm-intensive program parts (MPI_Alltoall, MPI_Allreduce, MPI_Bcast, consecutive MPI_Sendrecvs)

5

Network congestion

0 1 2

3 4 5

FDX

switch

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

6

Network congestion

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

0 1 2

3 4 5

FDX

switch

7

/* Initiate all send/recv to/from others. */for (i loops over processes) { MPI_Recv_init(from i ...); MPI_Send_init(to i ...);}/* Start all the requests. */MPI_Startall(...);/* Wait for them all. */MPI_Waitall(...);

/* do the communication -- post *all* sends and re-ceives: */for ( i=0; i<size; i++ ) { /* Performance fi x sent in by Duncan Grove <[email protected]>. Instead of posting irecvs and isends from rank=0 to size-1, scatter the destinations so that messages don‘t all go to rank 0 fi rst. Thanks Duncan! */ dest = (rank+i) % size; MPI_Irecv(from dest)}for ( i=0; i<size; i++ ) { dest = (rank+i) % size; MPI_Isend(to dest)}/* ... then wait for *all* of them to fi nish: */MPI_Waitall(...);

LAM: MPICH:

0 1 2

3 4 5

FDX

switch

common MPI_Alltoall implementations order of message transfers is random!

8

A network stress-test with MPI_Alltoall

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

• MPI_Alltoall, a parallel matrix transpose

• each of N procs sends N-1 messages of size S

• N*(N-1) messages transferred in time Δt

• transmission rate T per proc: T = (N–1)S / Δt

9

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

1 CPU/node

T averaged over 25 calls (bars: min/max)

10

LAM MPI_Alltoall

1 CPU/node


Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

11

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

1 CPU/node


Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

22001408

1731847380043296

12

LAM MPI_Alltoall


101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]tra

nsm

issi

on ra

te p

er C

PU [M

B/s]

4 CPUs8163264

13

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

LAM MPICH

14

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

LAM MPICH49 Mb/s 36 Mb/s

theoretical limit Tmax = 58 Mb/s

trans

mis

sion

rate

per

CPU

[MB/

s]

15

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss!

0 1 2

3 4 5

FDX

switch

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

16

0 1 2

3 4 5

FDX

switch





• reduce packet loss


17

0 1 2

3 4 5

FDX

switch

PAUSE!







18

0 1 2

3 4 5

FDX

switch

PAUSE!

PAUSE!







19

MPI_Alltoall performance (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

10 10 10message size [bytes]

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

4 CPUs8163264

LAM MPICH

no fl ow control

withIEEE 802.3xfl ow control

20

GROMACS 3.3 GigE speedup

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc


21


1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc


fc

Sp16 = 9.1

Sp16 = 7.4

22

High-level fl ow control • low-level fl ow control does not guarantee good performance for 16+ CPUs

• implement own all-to-all

• idea: arrange communication in separated phases, each free of congestion

• experimental MPI prototype: CC-MPI2

2 Karwande et al., 2005, An MPI prototype for compiled communication on Ethernet switched clusters, J Parallel Distrib Comput 65

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

23

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

0

1

23

4

0

1

23

4

0

1

23

4

0

1

23

4

A B C D

for (i=0; i<ncpu; i++) /* loop over all CPUs */{ /* send to destination CPU while receiving from source CPU: */ dest = (cpuid+i) % ncpu; source = (ncpu+cpuid-i) % ncpu; MPI_Sendrecv(send to dest, recv from source); /* separate the communication phases: */ if (i<ncpu-1) MPI_Barrier(comm);}

High-level fl ow control ordered all-to-all for single-CPU nodes

24

High-level fl ow control ordered all-to-all for multi-CPU nodes

/* loop over all nodes */ for (i=0; i<nnodes; i++) { /* send to destination node while receiving from source node: */ destnode = (nodeid+i) % nnodes; sourcenode = (nnodes+nodeid-i) % nnodes; /* loop over CPUs on a node: */ for (j=0; j < cpus_per_node; j++) { /* source and destination CPU: */ destcpu =destnode*cpus_per_node + j; sourcecpu=sourcenode*cpus_per_node+j; MPI_Irecv(from sourcecpu ...); MPI_Isend(to destcpu ...); } /* wait for communication to fi nish: */ MPI_Waitall(...); MPI_Waitall(...); /* separate the communication phases: */ if (i<nnodes-1) MPI_Barrier(...);}

25

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264


withhigh-level

fl ow control

LAM, Sendrecv all−to−all, 1 CPU/nodeLAM, Sendrecv all−to−all, 1 CPU/node LAM, Isend/Irecv all−to−all, 2 CPUs/nodeLAM, Isend/Irecv all−to−all, 2 CPUs/node

26

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105


trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264


withhigh-level

fl ow control

68.2 Mb/s 49.0 Mb/s

69.6 Mb/s 45.3 Mb/s

117 Mb/s58 Mb/s

27

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264


withhigh-level

fl ow control

MPICH, Sendrecv all−to−all, 1 CPU/node MPICH, Isend/Irecv all−to−all, 2 CPUs/node

28

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

100

101

102


trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264


withhigh-level

fl ow control

43.6 Mb/s 36.3 Mb/s

30.9 Mb/s 34.9 Mb/s

117 Mb/s58 Mb/s

29


1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

mod

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc

mod

1 CPU/node

Sendrecv-scheme, high-level fl ow control only

2 CPUs/node

Isend/Irecv scheme, high-level fl ow control only

8scheme,

MPI_Alltoall vs. ordered all-to-alls

30


0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)


Xeons(Orca)

spee

dup

31


0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)


Xeons(Orca)

spee

dup

flowcontrol

32


0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)


Xeons(Orca)

spee

dup

flowcontrol

1 CPU

33

Conclusions

• when running GROMACS on >2 Ethernet-connected nodes, fl ow control is needed to avoid network congestion

• implemented a congestion-free all-to-all for multi-CPU nodes

• highest network throughput probably for combination of high+low level fl ow control

• fi ndings can be applied to other parallel applications that need all-to-all communication

34

Outlook • MPICH1-1.2.6 MPICH2-1.0.2• LAM-7.1.1 OpenMPI-1.0rc1

• allocation technique for PME/PP splitting on multi-CPU nodes

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc fcfc

Sp16 = 9.1 Sp16 = 7.4

LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2

PP

PME


Documents

Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet