View
213
Download
0
Tags:
Embed Size (px)
Citation preview
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
1
UDP Performance and PCI-X Activity ofthe Intel 10 Gigabit Ethernet Adapter on:
HP rx2600 Dual Itanium 2 SuperMicro P4DP8-2G Dual XenonDell Poweredge 2650 Dual Xenon
Richard Hughes-Jones
Many people helped including:Sverre Jarp and Glen Hisdal CERN Open Lab
Sylvain Ravot, Olivier Martin and Elise Guyot DataTAG project
Les Cottrell, Connie Logg and Gary Buhrmaster SLAC
Stephen Dallison MB-NG
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
2
Introduction 10 GigE on Itanium IA64 10 GigE on Xeon IA32 10 GigE on Dell Xeon IA32 Tuning the PCI-X bus SC2003 Phoenix
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
3
UDP/IP packets sent between back-to-back systems Similar processing to TCP/IP but no flow control & congestion avoidance algorithms Used UDPmon test program
Latency Round trip times using Request-Response UDP frames Latency as a function of frame size
• Slope s given by:
• Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s)• Intercept indicates processing times + HW latencies
Histograms of ‘singleton’ measurements UDP Throughput
Send a controlled stream of UDP frames spaced at regular intervals Vary the frame size and the frame transmit spacing & measure:
• The time of first and last frames received• The number packets received, lost, & out of order• Histogram inter-packet spacing received packets• Packet loss pattern• 1-way delay• CPU load• Number of interrupts
Latency & Throughput Measurements
Tells us about:
Behavior of the IP stack
The way the HW operates
Interrupt coalescence
Tells us about:
Behavior of the IP stack
The way the HW operates
Capacity & Available throughput of the LAN / MAN / WAN
1
s
paths data dt
db
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
4
The Throughput Measurements UDP Throughput Send a controlled stream of UDP frames spaced at regular intervals
Zero stats OK done
●●●
Get remote statistics Send statistics:No. receivedNo. lost + loss
patternNo. out-of-orderCPU load & no. int1-way delay
Send data frames atregular intervals ●●●
Time to send Time to receive
Inter-packet time(Histogram)
Signal end of testOK done
n bytes
Number of packets
Wait timetime
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
5
The PCI Bus & Gigabit Ethernet Measurements
PCI Activity Logic Analyzer with
PCI Probe cards in sending PC
Gigabit Ethernet Fiber Probe Card
PCI Probe cards in receiving PC
GigabitEthernet
ProbeCPU
mem
chipset
NIC
CPU
mem
NIC
chipset
Logic AnalyserDisplay
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
6
Example: The 1 Gigabit NIC Intel pro/1000
Latency
Throughput
Bus Activity
gig6-7 Intel pci 66 MHz 27nov02
0
200
400
600
800
1000
0 5 10 15 20 25 30 35 40Transmit Time per frame us
Recv
Wire
rate
M
bits
/s
50 bytes 100 bytes 200 bytes 400 bytes 600 bytes 800 bytes 1000 bytes 1200 bytes 1400 bytes 1472 bytes
Motherboard: Supermicro P4DP6 Chipset: E7500 (Plumas) CPU: Dual Xeon 2 2GHz with 512k
L2 cache Mem bus 400 MHz
PCI-X 64 bit 66 MHz HP Linux Kernel 2.4.19 SMP MTU 1500 bytes
Intel PRO/1000 XT
Intel 64 bit 66 MHz
y = 0.0093x + 194.67
y = 0.0149x + 201.75
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000Message length bytes
Late
ncy u
s
64 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
900
170 190 210
Latency us
N(t)
512 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
170 190 210Latency us
N(t)
1024 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
1400 bytes Intel 64 bit 66 MHz
0
100
200
300
400
500
600
700
800
190 210 230
Latency us
N(t)
Receive Transfer
Send Transfer
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
7
Data Flow: SuperMicro 370DLE: SysKonnect Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset CPU: PIII 800 MHz PCI:64 bit 66 MHz RedHat 7.1 Kernel 2.4.14
1400 bytes sent Wait 100 us ~8 us for send or receive Stack & Application overhead ~ 10 us / node
Send PCI
Receive PCI
~36 us
Send TransferSend CSR setup
Receive TransferPacket on Ethernet Fibre
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
8
10 Gigabit Ethernet NIC with the PCI-X probe card.
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
9
Intel PRO/10GbE LR Adapter in the HP rx2600 system
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
10
10 GigE on Itanium IA64: UDP Latency Motherboard: HP rx2600 IA 64 Chipset: HPzx1 CPU: Dual Itanium 2 1GHz with
512k L2 cache Mem bus dual 622 MHz 4.3 GByte/s PCI-X 133 MHz HP Linux Kernel 2.5.72 SMP Intel PRO/10GbE LR Server Adapter NIC driver with
RxIntDelay=0 XsumRX=1 XsumTX=1 RxDescriptors=2048
TxDescriptors=2048 MTU 1500 bytes
Latency 100 µs & very well behaved Latency Slope 0.0033 µs/byte B2B Expect: 0.00268 µs/byte
PCI 0.00188 ** 10GigE 0.0008 PCI 0.00188
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
11
10 GigE on Itanium IA64: Latency Histograms
Double peak structure with the peaks separated by 3-4 µs Peaks are ~1-2 µs wide Similar to that observed with 1 Gbit Ethernet NICs on IA32 architectures
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
12
10 GigE on Itanium IA64: UDP Throughput HP Linux Kernel 2.5.72 SMP MTU 16114 bytes Max throughput 5.749 Gbit/s
Int on every packet No packet loss in 10M packets
Sending host, 1 CPU is idle For 14000-16080 byte packets, one
CPU is 40% in kernel mode As the packet size decreases load
rises to ~90% for packets of 4000 bytes or less.
Receiving host both CPUs busy 16114 bytes 40% kernel mode Small packets 80 % kernel mode TCP gensink data rate was
745 MBytes/s = 5.96 Gbit/s
Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
ker
nel R
ecei
ver
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
ker
nel
Sen
der
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Oplab29-30 10GE Xsum 512kbuf MTU16114 30Jul03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Re
cv
Wir
e r
ate
Mb
its
/s
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
13
10 GigE on Itanium IA64: UDP Throughput [04]
HP Linux Kernel 2.6.1 #17 SMP MTU 16114 bytes Max throughput 5.81 Gbit/s
Int on every packet Some packet loss pkts < 4000 bytes
Sending host, 1 CPU is idle – but swap over
For 14000-16080 byte packets, one CPU is 20-30% in kernel mode
As the packet size decreases load rises to ~90% for packets of 4000 bytes or less.
Receiving host 1 CPU is idle – but swap over
16114 bytes 40% kernel mode Small packets 70 % kernel mode
Openlab98-99 10GE MTU16114 12 Feb04
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Openlab98-99 10GE MTU16114 12 Feb04
0102030405060708090
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
Ke
rne
l S
en
de
r
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Openlab98-99 10GE MTU16114 12 Feb04
0102030405060708090
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
Ke
rne
l R
ec
iev
er
16080 bytes 16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
14
16080 byte packets every 200 µs Intel PRO/10GbE LR Server Adapter MTU 16114 setpci -s 02:1.0 e6.b=2e (22 26 2a ) mmrbc 4096 bytes (512 1024 2048)
PCI-X Signals transmit - memory to NIC Interrupt and processing: 48.4 µs after start Data transfer takes ~22 µs Data transfer rate over PCI-X: 5.86 Gbit/s
10 GigE on Itanium IA64: PCI-X bus Activity
CSR Access
Transfer of 16114 bytes PCI-X bursts 256 bytes
Made up of 4 PCI-X sequences of ~4.55 µs then a gap of 700 ns
Sequence contains 16 PCI bursts 256 bytes Sequence length 4096 bytes ( mmrbc)
CSR Access
PCI-X Sequence 4096 bytesPCI-X bursts 256 bytes
Gap 700nsPCI-X Sequence
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
15
16080 byte packets every 200 µs Intel PRO/10GbE LR Server Adapter MTU 16114 setpci -s 02:1.0 e6.b=2e (22 26 2a ) mmrbc 4096 bytes (512 1024 2048)
PCI-X Signals transmit - memory to NIC Interrupt and processing: 48.4 µs after start Data transfer takes ~22 µs Data transfer rate over PCI-X: 5.86 Gbit/s
PCI-X Signals receive – NIC to memory Interrupt every packet Data transfer takes ~18.4 µs Data transfer rate over PCI-X : 7.014 Gbit/s Note: receive is faster cf the 1 GE NICs
10 GigE on Itanium IA64: PCI-X bus Activity
CSR Access
Transfer of 16114 bytes PCI-X bursts 256 bytes
InterruptCSR Access
Transfer of 16114 bytes PCI-X bursts 512 bytes
PCI-X Sequence
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
16
10 GigE on Xeon IA32: UDP Latency Motherboard: Supermicro P4DP8-G2 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon 2.2GHz with 512k L2
cache Mem bus 400 MHz
PCI-X 133 MHz RedHat Kernel 2.4.21 SMP Intel(R) PRO/10GbE Network Driver
v1.0.45
Intel PRO/10GbE LR Server Adapter
NIC driver with RxIntDelay=0 XsumRX=1 XsumTX=1 RxDescriptors=2048
TxDescriptors=2048 MTU 1500 bytes
Latency 144 µs & reasonably behaved Latency Slope 0.0032 µs/byte B2B Expect: 0.00268 µs/byte
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
17
10 GigE on Xeon IA32: Latency Histograms Double peak structure with the peaks separated by 3-4 µs Peaks are ~1-2 µs wide Simliar to that observed with 1 Gbit Ethernet NICs on IA32 architectures
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
18
10 GigE on Xeon IA32: Throughput MTU 16114 bytes Max throughput 2.75 Gbit/s mmrbc
512
Max throughput 3.97 Gbit/s mmrbc 4096 bytes
Int on every packet No packet loss in 10M packets
Sending host, For closely spaced packets, the other
CPU is ~60-70 % in kernel mode
Receiving host Small packets 80 % in kernel mode >9000 bytes ~50% in kernel mode
DataTAG3-6 10GE MTU16114
0
500
1000
1500
2000
2500
3000
3500
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Dt3-6 10GE Xsum 512kbuf MTU16114 31Jul03
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
ker
nel
Rec
eive
r
16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
Dt3-6 10GE Xsum 512kbuf MTU16114 31Jul03
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
Ker
nel
Sen
der
16000 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
19
10 GigE on Xeon IA32: PCI-X bus Activity
16080 byte packets every 200 µs Intel PRO/10GbE LR mmrbc 512 bytes
PCI-X Signals transmit - memory to NIC Interrupt and processing: 70 µs after start Data transfer takes ~44.7 µs Data transfer rate over PCI-X: 2.88Gbit/s
PCI-X Signals receive – NIC to memory Interrupt every packet Data transfer takes ~18.29 µs Data transfer rate over PCI-X : 7.014Gbit/s
same as Itanium
InterruptCSR Access
Transfer of 16114 bytes PCI-X burst 256 bytes
InterruptTransfer of 16114 bytes
CSR Access
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
20
10 GigE on Dell Xeon: Throughput MTU 16114 bytes Max throughput 5.4 Gbit/s
Int on every packet Some packet loss pkts < 4000 bytes
Sending host, For closely spaced packets, one CPU
is ~70% in kernel mode CPU usage swaps
Receiving host 1 CPU is idle but CPU usage swaps For closely spaced packets ~80 % in
kernel mode
SLAC Dell 10GE MTU16114
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mbi
ts/s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
C
SLAC Dell 10GE MTU16114
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
Ker
nel
Sen
der
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
C
SLAC Dell 10GE MTU16114
0
20
40
60
80
100
0 5 10 15 20 25 30 35 40Spacing between frames us
% C
PU
Ker
nel
Rec
eive
r
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
C
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
21
10 GigE on Dell Xeon : UDP Latency Motherboard: Dell Poweredge 2650 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon 3.06 GHz with 512k L2
cache Mem bus 533 MHz
PCI-X 133 MHz RedHat Kernel 2.4.20 altAIMD Intel(R) PRO/10GbE Network Driver
v1.0.45
Intel PRO/10GbE LR Server Adapter
NIC driver with RxIntDelay=0 XsumRX=1 XsumTX=1 RxDescriptors=2048
TxDescriptors=2048 MTU 16114 bytes
Latency 36 µs with some steps Latency Slope 0.0017 µs/byte B2B Expect: 0.00268 µs/byte
SLAC Dell 10 GE
y = 0.0017x + 36.579
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400Message length bytes
Late
ncy
us
ave time us
min time
SLAC Dell 10 GE
0
10
20
30
40
50
60
70
80
0 2000 4000 6000 8000 10000 12000
Message length bytes
Late
ncy
us
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
22
10 GigEthernet: Throughput 1500 byte MTU gives ~ 2 Gbit/s Used 16144 byte MTU max user length 16080 DataTAG Supermicro PCs Dual 2.2 GHz Xeon CPU FSB 400 MHz PCI-X mmrbc 512 bytes wire rate throughput of 2.9 Gbit/s
SLAC Dell PCs giving a Dual 3.0 GHz Xeon CPU FSB 533 MHz PCI-X mmrbc 4096 bytes wire rate of 5.4 Gbit/s
CERN OpenLab HP Itanium PCs Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz PCI-X mmrbc 4096 bytes wire rate of 5.7 Gbit/s
an-al 10GE Xsum 512kbuf MTU16114 27Oct03
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25 30 35 40Spacing between frames us
Rec
v W
ire
rate
Mb
its/
s
16080 bytes 14000 bytes 12000 bytes 10000 bytes 9000 bytes 8000 bytes 7000 bytes 6000 bytes 5000 bytes 4000 bytes 3000 bytes 2000 bytes 1472 bytes
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
23
Tuning PCI-X: Variation of mmrbc IA32
mmrbc1024 bytes
mmrbc2048 bytes
mmrbc4096 bytes
mmrbc512 bytes
CSR Access
PCI-X Sequence
Data Transfer
Interrupt & CSR Update
16080 byte packets every 200 µs Intel PRO/10GbE LR Adapter
PCI-X bus occupancy vs mmrbc
Plot: Measured times
Times based on PCI-X times from the logic analyser
Expected throughput
0
5
10
15
20
25
30
35
40
45
50
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e us
0
1
2
3
4
5
6
7
8
9P
CI-
X T
rans
fer
rate
Gbi
t/s
Measured PCI-X transfer time usexpected time usrate from expected time Gbit/s Max throughput PCI-X
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
24
Tuning PCI-X: Throughput vs mmrbc
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e u
s
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
Kernel 2.6.1#17 HP Itanium Intel10GE Feb04
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
Kernel 2.6.1#17 HP Itanium Intel10GE
0
10
20
30
40
50
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
CP
U %
Kern
el m
ode
local - sendingremote - receiving
Std UDP Kernel 2.4.22 Dell Intel10GE
0
2
4
6
8
10
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
PC
I-X
Tra
nsfe
r tim
e
us
measured Rate Gbit/srate from expected time Gbit/s Max throughput PCI-X
DataTag IA32 2.2 GHz Xeon 400 MHz FSB 2.7 – 4.0 Gbit.s
SLAC Dell 3.0 GHz Xeon 533 MHz FSB 3.5 - 5.2 Gbit/s
OpenLab IA64 1.0 GHz Itanium 622 MHz FSB 3.2 - 5.7 Gbit/s
CPU load is 1 CPU not average
0
10
20
30
40
50
60
70
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
% C
PU
ke
rne
l
local - sendingremote - receiving
Std UDP Kernel 2.4.22 Dell Intel10GE
01020304050607080
0 1000 2000 3000 4000 5000Max Memory Read Byte Count
CP
U %
Kern
el m
ode
local - sendingremote - receiving
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
25
10 GigEthernet at SC2003 BW Challenge Three Server systems with 10 GigEthernet NICs Used the DataTAG altAIMD stack 9000 byte MTU Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to:
Pal Alto PAIX rtt 17 ms , window 30 MB Shared with Caltech booth 4.37 Gbit hstcp I=5% Then 2.87 Gbit I=16% Fall corresponds to 10 Gbit on link
3.3Gbit Scalable I=8% Tested 2 flows sum 1.9Gbit I=39%
Chicago Starlight rtt 65 ms , window 60 MB Phoenix CPU 2.2 GHz 3.1 Gbit hstcp I=1.6%
Amsterdam SARA rtt 175 ms , window 200 MB Phoenix CPU 2.2 GHz
4.35 Gbit hstcp I=6.9% Very Stable Both used Abilene to Chicago
10 Gbits/s throughput from SC2003 to PAIX
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router to LA/PAIXPhoenix-PAIX HS-TCPPhoenix-PAIX Scalable-TCPPhoenix-PAIX Scalable-TCP #2
10 Gbits/s throughput from SC2003 to Chicago & Amsterdam
0
1
2
3
4
5
6
7
8
9
10
11/19/0315:59
11/19/0316:13
11/19/0316:27
11/19/0316:42
11/19/0316:56
11/19/0317:11
11/19/0317:25 Date & Time
Throughput
Gbits/s
Router traffic to Abilele
Phoenix-Chicago
Phoenix-Amsterdam
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
26
Summary & Conclusions
Intel PRO/10GbE LR Adapter and driver gave stable throughput and worked well
Need large MTU (9000 or 16114) – 1500 bytes gives ~2 Gbit/s
PCI-X tuning mmrbc = 4096 bytes increase by 55% (3.2 to 5.7 Gbit/s) PCI-X sequences clear on transmit gaps ~ 950 ns Transfers: transmission (22 µs) takes longer than receiving (18 µs) Tx rate 5.85 Gbit/s Rx rate 7.0 Gbit/s (Itanium) (PCI-X max 8.5Gbit/s)
CPU load considerable 60% Xenon 40% Itanium BW of Memory system important – crosses 3 times! Sensitive to OS/ Driver updates
More study needed
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
28
Test setup with the CERN Open Lab Itanium systems
PFLDNet Argonne Feb 2004R. Hughes-Jones Manchester
29
1472 byte packets every 15 µs Intel Pro/1000
PCI:64 bit 33 MHz
82% usage
PCI:64 bit 66 MHz
65% usage
Data transfers half as long
Send PCI
Receive PCI
Send setup
Data Transfers
Receive Transfers
Send PCI
Receive PCI
Send setup
Data Transfers
Receive Transfers
1 GigE on IA32: PCI bus Activity 33 & 66 MHz