Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Flow control: a prerequisite for Ethernet scaling
• GROMACS and network congestion
• a network stress-test with MPI_Alltoall
• IEEE 802.3x = ???
• high-level fl ow control / ordered communication
Carsten Kutzner, David van der Spoel, Erik Lindahl, Martin Fechner, Bert L. de Groot, Helmut Grubmüller
2
GROMACS 3.3 GigE speedup Aquaporin-1 benchmark (80k atoms)
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 CPU/node 2 CPUs/node
Orca,Coral,Slomo,...
3
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
* benchmark done by Renate Dohmen, RZG
*
4
time step 1 time step 2
MPI_Alltoall
SR Coul SR Coul
PME PME
Where is the problem?
• delays of 0.25s in comm-intensive program parts (MPI_Alltoall, MPI_Allreduce, MPI_Bcast, consecutive MPI_Sendrecvs)
5
Network congestion
0 1 2
3 4 5
FDX
switch
• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)
6
Network congestion
• If more packets arrive than a receiver can process, packets are dropped (and later retransmitted)
0 1 2
3 4 5
FDX
switch
7
/* Initiate all send/recv to/from others. */for (i loops over processes) { MPI_Recv_init(from i ...); MPI_Send_init(to i ...);}/* Start all the requests. */MPI_Startall(...);/* Wait for them all. */MPI_Waitall(...);
/* do the communication -- post *all* sends and re-ceives: */for ( i=0; i<size; i++ ) { /* Performance fi x sent in by Duncan Grove <[email protected]>. Instead of posting irecvs and isends from rank=0 to size-1, scatter the destinations so that messages don‘t all go to rank 0 fi rst. Thanks Duncan! */ dest = (rank+i) % size; MPI_Irecv(from dest)}for ( i=0; i<size; i++ ) { dest = (rank+i) % size; MPI_Isend(to dest)}/* ... then wait for *all* of them to fi nish: */MPI_Waitall(...);
LAM: MPICH:
0 1 2
3 4 5
FDX
switch
common MPI_Alltoall implementations order of message transfers is random!
8
A network stress-test with MPI_Alltoall
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
• MPI_Alltoall, a parallel matrix transpose
• each of N procs sends N-1 messages of size S
• N*(N-1) messages transferred in time Δt
• transmission rate T per proc: T = (N–1)S / Δt
9
LAM MPI_Alltoall
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
1 CPU/node
T averaged over 25 calls (bars: min/max)
10
LAM MPI_Alltoall
1 CPU/node
T averaged over 25 calls (bars: min/max)
Aquaporin-1, PME grid 90x88x80=633600 fl oats
CPUsbyte to
each CPU2 6494404 1731846 738008 43296
16 1180832 2952
Guanylin, PME grid 30x30x21=18900 fl oats
CPUsbyte to
each CPU2 198004 56326 22008 1408
data transferred within the FFTW all-to-all:
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
11
LAM MPI_Alltoall
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
1 CPU/node
T averaged over 25 calls (bars: min/max)
Aquaporin-1, PME grid 90x88x80=633600 fl oats
CPUsbyte to
each CPU2 6494404 1731846 738008 43296
16 1180832 2952
Guanylin, PME grid 30x30x21=18900 fl oats
CPUsbyte to
each CPU2 198004 56326 22008 1408
data transferred within the FFTW all-to-all:
22001408
1731847380043296
12
LAM MPI_Alltoall
1 CPU/node 2 CPUs/node
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]tra
nsm
issi
on ra
te p
er C
PU [M
B/s]
4 CPUs8163264
13
LAM vs. MPICH (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
LAM MPICH
14
LAM vs. MPICH (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
LAM MPICH49 Mb/s 36 Mb/s
theoretical limit Tmax = 58 Mb/s
trans
mis
sion
rate
per
CPU
[MB/
s]
15
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss!
0 1 2
3 4 5
FDX
switch
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
16
0 1 2
3 4 5
FDX
switch
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
17
0 1 2
3 4 5
FDX
switch
PAUSE!
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
18
0 1 2
3 4 5
FDX
switch
PAUSE!
PAUSE!
IEEE 802.3x fl ow control
• IEEE 802.3x1 defi nes low level fl ow control mechanism to prevent network congestion
• fl ow of ... messages/data/TCP packets
• receivers may send PAUSE frames to tell source to stop sending
• reduce packet loss
1 IEEE, 2000. Information technology–Telecommunications and information exchange between systems–Local and metropolitan area net- works–Specifi c requirements–Part 3: Carrier sense multiple access with collision detection (CSMA/CD) access method and physical layer specifi cations. IEEE 802.3, New York, Annex 31B: MAC control PAUSE operation.
19
MPI_Alltoall performance (2 CPUs/node)
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
10 10 10message size [bytes]
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
4 CPUs8163264
4 CPUs8163264
LAM MPICH
no fl ow control
withIEEE 802.3xfl ow control
20
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
1 CPU/node 2 CPUs/node
21
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
1 CPU/node 2 CPUs/node
fc
Sp16 = 9.1
Sp16 = 7.4
22
High-level fl ow control • low-level fl ow control does not guarantee good performance for 16+ CPUs
• implement own all-to-all
• idea: arrange communication in separated phases, each free of congestion
• experimental MPI prototype: CC-MPI2
2 Karwande et al., 2005, An MPI prototype for compiled communication on Ethernet switched clusters, J Parallel Distrib Comput 65
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
23
p0 A0 A1 A2 A3
p1 B0 B1 B2 B3
p2 C0 C1 C2 C3
p3 D0 D1 D2 D3
p0 A0 B0 C0 D0
p1 A1 B1 C1 D1
p2 A2 B2 C2 D2
p3 A3 B3 C3 D3
N = 4 ... 64 S = 4 byte... 2 Mb
0
1
23
4
0
1
23
4
0
1
23
4
0
1
23
4
A B C D
for (i=0; i<ncpu; i++) /* loop over all CPUs */{ /* send to destination CPU while receiving from source CPU: */ dest = (cpuid+i) % ncpu; source = (ncpu+cpuid-i) % ncpu; MPI_Sendrecv(send to dest, recv from source); /* separate the communication phases: */ if (i<ncpu-1) MPI_Barrier(comm);}
High-level fl ow control ordered all-to-all for single-CPU nodes
24
High-level fl ow control ordered all-to-all for multi-CPU nodes
/* loop over all nodes */ for (i=0; i<nnodes; i++) { /* send to destination node while receiving from source node: */ destnode = (nodeid+i) % nnodes; sourcenode = (nnodes+nodeid-i) % nnodes; /* loop over CPUs on a node: */ for (j=0; j < cpus_per_node; j++) { /* source and destination CPU: */ destcpu =destnode*cpus_per_node + j; sourcecpu=sourcenode*cpus_per_node+j; MPI_Irecv(from sourcecpu ...); MPI_Isend(to destcpu ...); } /* wait for communication to fi nish: */ MPI_Waitall(...); MPI_Waitall(...); /* separate the communication phases: */ if (i<nnodes-1) MPI_Barrier(...);}
25
High- vs. low level fl ow control – LAM
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 1 CPU/node, flow control
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 2 CPUs/node, flow control
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Sendrecv all−to−all, 1 CPU/node
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
LAM, Sendrecv all−to−all, 1 CPU/nodeLAM, Sendrecv all−to−all, 1 CPU/node LAM, Isend/Irecv all−to−all, 2 CPUs/nodeLAM, Isend/Irecv all−to−all, 2 CPUs/node
26
High- vs. low level fl ow control – LAM
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 1 CPU/node, flow control
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, 2 CPUs/node, flow control
4 CPUs8163264
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Sendrecv all−to−all, 1 CPU/node
4 CPUs81632
101 102 103 104 105 10610−2
10−1
100
101
102
103 104 105
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
LAM, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
68.2 Mb/s 49.0 Mb/s
69.6 Mb/s 45.3 Mb/s
117 Mb/s58 Mb/s
27
High- vs. low level fl ow control – MPICH
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 1 CPU/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 2 CPUs/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Sendrecv all−to−all, 1 CPU/node
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs81632
4 CPUs8163264
4 CPUs81632
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
MPICH, Sendrecv all−to−all, 1 CPU/node MPICH, Isend/Irecv all−to−all, 2 CPUs/node
28
High- vs. low level fl ow control – MPICH
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 1 CPU/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, 2 CPUs/node, flow control
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Sendrecv all−to−all, 1 CPU/node
101 102 103 104 105 10610−2
10−1
100
101
102
message size [bytes]
trans
mis
sion
rate
per
CPU
[MB/
s]
MPICH, Isend/Irecv all−to−all, 2 CPUs/node
4 CPUs81632
4 CPUs8163264
4 CPUs81632
4 CPUs8163264
withIEEE 802.3xfl ow control
withhigh-level
fl ow control
43.6 Mb/s 36.3 Mb/s
30.9 Mb/s 34.9 Mb/s
117 Mb/s58 Mb/s
29
GROMACS 3.3 GigE speedup
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
mod
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc
mod
1 CPU/node
Sendrecv-scheme, high-level fl ow control only
2 CPUs/node
Isend/Irecv scheme, high-level fl ow control only
8scheme,
MPI_Alltoall vs. ordered all-to-alls
30
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
31
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
flowcontrol
32
GROMACS 3.3 GigE speedup compared to other interconnects (8x2 CPUs)
0
2
4
6
8
10
12
Infiniband2.2 GHz
Opterons(RZG)
MyrinetEthernet3.06 GHz
Xeons(Orca)
spee
dup
flowcontrol
1 CPU
33
Conclusions
• when running GROMACS on >2 Ethernet-connected nodes, fl ow control is needed to avoid network congestion
• implemented a congestion-free all-to-all for multi-CPU nodes
• highest network throughput probably for combination of high+low level fl ow control
• fi ndings can be applied to other parallel applications that need all-to-all communication
34
Outlook • MPICH1-1.2.6 MPICH2-1.0.2• LAM-7.1.1 OpenMPI-1.0rc1
• allocation technique for PME/PP splitting on multi-CPU nodes
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
1 2 4 8 16 321
2
4
8
number of CPUs
spee
dup
gmx 3.3
fcfc fcfc
Sp16 = 9.1 Sp16 = 7.4
LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 LAM-7.1.1 OpenMPI-1.0rc1 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2 MPICH1-1.2.6 MPICH2-1.0.2
PP
PME
1 CPU/node 2 CPUs/node