34
1 Flow control: a prerequisite for Ethernet scaling GROMACS and network congestion a network stress-test with MPI_Alltoall IEEE 802.3x = ??? high-level flow control / ordered communication Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller

Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

1

Flow control: a prerequisite for Ethernet scaling

• GROMACS and network congestion

• a network stress-test with MPI_Alltoall

• IEEE 802.3x = ???

• high-level fl ow control / ordered communication

Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller

Page 2: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

2

GROMACS 3.3 GigE speedup Aquaporin-1 benchmark (80k atoms)

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 CPU/node 2 CPUs/node

Orca,Coral,Slomo,...

Page 3: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

3

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

spee

dup

* benchmark done by Renate Dohmen, RZG

*

Page 4: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

4

time step 1 time step 2

MPI_Alltoall

SR Coul SR Coul

PME PME

Where is the problem?

• delays of 0.25s in comm-intensive program parts (MPI_Alltoall, MPI_Allreduce, MPI_Bcast, consecutive MPI_Sendrecvs)

Page 5: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

5

Network congestion

0 1 2

3 4 5

FDX

switch

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

Page 6: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

6

Network congestion

• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)

0 1 2

3 4 5

FDX

switch

Page 7: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

7

/* Initiate all send/recv to/from others. */for (i loops over processes) { MPI_Recv_init(from i ...); MPI_Send_init(to i ...);}/* Start all the requests. */MPI_Startall(...);/* Wait for them all. */MPI_Waitall(...);

/* do the communication -- post *all* sends and re-ceives: */for ( i=0; i<size; i++ ) { /* Performance fi x sent in by Duncan Grove <[email protected]>. Instead of posting irecvs and isends from rank=0 to size-1, scatter the destinations so that messages don‘t all go to rank 0 fi rst. Thanks Duncan! */ dest = (rank+i) % size; MPI_Irecv(from dest)}for ( i=0; i<size; i++ ) { dest = (rank+i) % size; MPI_Isend(to dest)}/* ... then wait for *all* of them to fi nish: */MPI_Waitall(...);

LAM: MPICH:

0 1 2

3 4 5

FDX

switch

common MPI_Alltoall implementations order of message transfers is random!

Page 8: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

8

A network stress-test with MPI_Alltoall

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

• MPI_Alltoall, a parallel matrix transpose

• each of N procs sends N-1 messages of size S

• N*(N-1) messages transferred in time Δt

• transmission rate T per proc: T = (N–1)S / Δt

Page 9: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

9

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

1 CPU/node

T averaged over 25 calls (bars: min/max)

Page 10: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

10

LAM MPI_Alltoall

1 CPU/node

T averaged over 25 calls (bars: min/max)

Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

Page 11: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

11

LAM MPI_Alltoall

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

1 CPU/node

T averaged over 25 calls (bars: min/max)

Aquaporin-1, PME grid 90x88x80=633600 fl oats

CPUsbyte to

each CPU2 6494404 1731846 738008 43296

16 1180832 2952

Guanylin, PME grid 30x30x21=18900 fl oats

CPUsbyte to

each CPU2 198004 56326 22008 1408

data transferred within the FFTW all-to-all:

22001408

1731847380043296

Page 12: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

12

LAM MPI_Alltoall

1 CPU/node 2 CPUs/node

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]tra

nsm

issi

on ra

te p

er C

PU [M

B/s]

4 CPUs8163264

Page 13: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

13

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

LAM MPICH

Page 14: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

14

LAM vs. MPICH (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

LAM MPICH49 Mb/s 36 Mb/s

theoretical limit Tmax = 58 Mb/s

trans

mis

sion

rate

per

CPU

[MB/

s]

Page 15: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

15

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss!

0 1 2

3 4 5

FDX

switch

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

Page 16: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

16

0 1 2

3 4 5

FDX

switch

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

Page 17: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

17

0 1 2

3 4 5

FDX

switch

PAUSE!

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

Page 18: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

18

0 1 2

3 4 5

FDX

switch

PAUSE!

PAUSE!

IEEE 802.3x fl ow control

• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion

• fl ow of ... messages/data/TCP packets

• receivers may send PAUSE frames to tell source to stop sending

• reduce packet loss

1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.

Page 19: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

19

MPI_Alltoall performance (2 CPUs/node)

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

10 10 10message size [bytes]

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

4 CPUs8163264

4 CPUs8163264

LAM MPICH

no fl ow control

withIEEE 802.3xfl ow control

Page 20: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

20

GROMACS 3.3 GigE speedup

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc

1 CPU/node 2 CPUs/node

Page 21: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

21

GROMACS 3.3 GigE speedup

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc

1 CPU/node 2 CPUs/node

fc

Sp16 = 9.1

Sp16 = 7.4

Page 22: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

22

High-level fl ow control • low-level fl ow control does not guarantee good performance for 16+ CPUs

• implement own all-to-all

• idea: arrange communication in separated phases, each free of congestion

• experimental MPI prototype: CC-MPI2

2 Karwande et al., 2005, An MPI prototype for compiled communication on Ethernet switched clusters, J Parallel Distrib Comput 65

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

Page 23: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

23

p0 A0 A1 A2 A3

p1 B0 B1 B2 B3

p2 C0 C1 C2 C3

p3 D0 D1 D2 D3

p0 A0 B0 C0 D0

p1 A1 B1 C1 D1

p2 A2 B2 C2 D2

p3 A3 B3 C3 D3

N = 4 ... 64 S = 4 byte... 2 Mb

0

1

23

4

0

1

23

4

0

1

23

4

0

1

23

4

A B C D

for (i=0; i<ncpu; i++) /* loop over all CPUs */{ /* send to destination CPU while receiving from source CPU: */ dest = (cpuid+i) % ncpu; source = (ncpu+cpuid-i) % ncpu; MPI_Sendrecv(send to dest, recv from source); /* separate the communication phases: */ if (i<ncpu-1) MPI_Barrier(comm);}

High-level fl ow control ordered all-to-all for single-CPU nodes

Page 24: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

24

High-level fl ow control ordered all-to-all for multi-CPU nodes

/* loop over all nodes */ for (i=0; i<nnodes; i++) { /* send to destination node while receiving from source node: */ destnode = (nodeid+i) % nnodes; sourcenode = (nnodes+nodeid-i) % nnodes; /* loop over CPUs on a node: */ for (j=0; j < cpus_per_node; j++) { /* source and destination CPU: */ destcpu =destnode*cpus_per_node + j; sourcecpu=sourcenode*cpus_per_node+j; MPI_Irecv(from sourcecpu ...); MPI_Isend(to destcpu ...); } /* wait for communication to fi nish: */ MPI_Waitall(...); MPI_Waitall(...); /* separate the communication phases: */ if (i<nnodes-1) MPI_Barrier(...);}

Page 25: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

25

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264

withIEEE 802.3xfl ow control

withhigh-level

fl ow control

LAM, Sendrecv all−to−all, 1 CPU/nodeLAM, Sendrecv all−to−all, 1 CPU/node LAM, Isend/Irecv all−to−all, 2 CPUs/nodeLAM, Isend/Irecv all−to−all, 2 CPUs/node

Page 26: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

26

High- vs. low level fl ow control – LAM

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 1 CPU/node, flow control

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, 2 CPUs/node, flow control

4 CPUs8163264

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Sendrecv all−to−all, 1 CPU/node

4 CPUs81632

101 102 103 104 105 10610−2

10−1

100

101

102

103 104 105

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

LAM, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs8163264

withIEEE 802.3xfl ow control

withhigh-level

fl ow control

68.2 Mb/s 49.0 Mb/s

69.6 Mb/s 45.3 Mb/s

117 Mb/s58 Mb/s

Page 27: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

27

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264

withIEEE 802.3xfl ow control

withhigh-level

fl ow control

MPICH, Sendrecv all−to−all, 1 CPU/node MPICH, Isend/Irecv all−to−all, 2 CPUs/node

Page 28: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

28

High- vs. low level fl ow control – MPICH

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 1 CPU/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, 2 CPUs/node, flow control

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Sendrecv all−to−all, 1 CPU/node

101 102 103 104 105 10610−2

10−1

100

101

102

message size [bytes]

trans

mis

sion

rate

per

CPU

[MB/

s]

MPICH, Isend/Irecv all−to−all, 2 CPUs/node

4 CPUs81632

4 CPUs8163264

4 CPUs81632

4 CPUs8163264

withIEEE 802.3xfl ow control

withhigh-level

fl ow control

43.6 Mb/s 36.3 Mb/s

30.9 Mb/s 34.9 Mb/s

117 Mb/s58 Mb/s

Page 29: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

29

GROMACS 3.3 GigE speedup

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

mod

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc

mod

1 CPU/node

Sendrecv-scheme, high-level fl ow control only

2 CPUs/node

Isend/Irecv scheme, high-level fl ow control only

8scheme,

MPI_Alltoall vs. ordered all-to-alls

Page 30: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

30

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

spee

dup

Page 31: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

31

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

spee

dup

flowcontrol

Page 32: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

32

GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)

0

2

4

6

8

10

12

Infiniband2.2 GHz

Opterons(RZG)

MyrinetEthernet3.06 GHz

Xeons(Orca)

spee

dup

flowcontrol

1 CPU

Page 33: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

33

Conclusions

• when running GROMACS on >2 Ethernet-connected nodes, fl ow control is needed to avoid network congestion

• implemented a congestion-free all-to-all for multi-CPU nodes

• highest network throughput probably for combination of high+low level fl ow control

• fi ndings can be applied to other parallel applications that need all-to-all communication

Page 34: Flow control: a prerequisite for Ethernet scaling · 3 GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs) 0 2 4 6 8 10 12 Infiniband 2.2 GHz Opterons (RZG) Ethernet

34

Outlook • MPICH1-1.2.6 MPICH2-1.0.2• LAM-7.1.1 OpenMPI-1.0rc1

• allocation technique for PME/PP splitting on multi-CPU nodes

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

1 2 4 8 16 321

2

4

8

number of CPUs

spee

dup

gmx 3.3

fcfc fcfc

Sp16 = 9.1 Sp16 = 7.4

LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2

PP

PME

1 CPU/node 2 CPUs/node