J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology High speed WAN data transfers for science Session Recent Results

J. Bunn, D. Nae, H. Newman, J. Bunn, D. Nae, H. Newman, S. RavotS. Ravot, X. Su, Y. Xia, X. Su, Y. XiaCalifornia Institute of Technology California Institute of Technology

High speed WAN data transfers for scienceHigh speed WAN data transfers for science Session Recent Results Session Recent Results

TERENA Networking Conference 2005TERENA Networking Conference 2005Poznan, PolandPoznan, Poland

6-9 June6-9 June

AGENDAAGENDA

TCP performance over high speed/latency networks End-Systems performanceEnd-Systems performance Next generation networkNext generation network Advanced R&D projectsAdvanced R&D projects

Single TCP stream performance under Single TCP stream performance under periodic lossesperiodic losses

Effect of packet loss

0102030405060708090

100

0.000001 0.00001 0.0001 0.001 0.01 0.1 1 10Packet Loss frequency (%)

Ba

nd

wid

th U

tiliz

ati

on

(%

)

WAN (RTT=120ms)

LAN (RTT=0.04 ms)

Loss rate =0.01%:Loss rate =0.01%:LAN BW LAN BW utilization= 99%utilization= 99%WAN BW WAN BW utilization=1.2%utilization=1.2%

Bandwidth available = 1 Gbps

TCP throughput much more sensitive to packet loss in WANs than LANsTCP throughput much more sensitive to packet loss in WANs than LANs TCP’s congestion control algorithm (AIMD) is not well-suited to TCP’s congestion control algorithm (AIMD) is not well-suited to

gigabit networks gigabit networks The effect of packets loss can be disastrousThe effect of packets loss can be disastrous

TCP is inefficient in high bandwidth*delay networksTCP is inefficient in high bandwidth*delay networks The future performance-outlook for computational grids looks bad The future performance-outlook for computational grids looks bad

if we continue to rely solely on the widely-deployed TCP RENOif we continue to rely solely on the widely-deployed TCP RENO

Single TCP stream between Caltech and CERNSingle TCP stream between Caltech and CERN

Available (PCI-X) Available (PCI-X) Bandwidth=8.5 GbpsBandwidth=8.5 Gbps RTT=250ms (16’000 km) RTT=250ms (16’000 km) 9000 Byte MTU9000 Byte MTU15 min to increase 15 min to increase throughput throughput from 3 to 6 Gbpsfrom 3 to 6 Gbps

Sending station:Sending station: Tyan S2882 Tyan S2882 motherboard, motherboard, 2x 2x Opteron 2.4 GHz , Opteron 2.4 GHz , 2 GB DDR. 2 GB DDR.

Receiving station:Receiving station: CERN OpenLab:HP CERN OpenLab:HP rx4640, rx4640, 4x 1.5GHz 4x 1.5GHz Itanium-2, zx1 Itanium-2, zx1 chipset, chipset, 8GB memory8GB memory

Network adapter: S2IO 10 GbE

Burst of packet losses

Single packet

loss

CPU load = 100%

ResponsivenessResponsiveness

PathPath BandwidthBandwidth RTT (ms)RTT (ms) MTU (Byte)MTU (Byte) Time to recoverTime to recover

LANLAN 10 Gb/s10 Gb/s 11 15001500 430 ms430 ms

Geneva–ChicagoGeneva–Chicago 10 Gb/s10 Gb/s 120120 15001500 1 hr 32 min1 hr 32 min

Geneva-Los AngelesGeneva-Los Angeles 1 Gb/s1 Gb/s 180180 15001500 23 min23 min

Geneva-Los AngelesGeneva-Los Angeles 10 Gb/s10 Gb/s 180180 15001500 3 hr 51 min3 hr 51 min

Geneva-Los AngelesGeneva-Los Angeles 10 Gb/s10 Gb/s 180180 90009000 38 min38 min

Geneva-Los AngelesGeneva-Los Angeles 10 Gb/s10 Gb/s 180180 64k (TSO)64k (TSO) 5 min5 min

Geneva-TokyoGeneva-Tokyo 1 Gb/s1 Gb/s 300300 15001500 1 hr 04 min1 hr 04 min

Large MTU accelerates the growth of the windowLarge MTU accelerates the growth of the window Time to recover from a packet loss decreases with large MTUTime to recover from a packet loss decreases with large MTU Larger MTU reduces overhead per frames (saves CPU cycles, reduces Larger MTU reduces overhead per frames (saves CPU cycles, reduces

the number of packets) the number of packets)

C . RTTC . RTT

2 . MSS2 . MSS

22

C : Capacity of the linkC : Capacity of the link

Time to recover from a single packet loss:Time to recover from a single packet loss:

Fairness: TCP Reno MTU & RTT bias

Starlight (Chi)Starlight (Chi)CERN (GVA)CERN (GVA)

Two TCP streams share a 1 Gbps Two TCP streams share a 1 Gbps bottleneck bottleneck

RTT=117 msRTT=117 ms MTU = 1500 Bytes; Avg. throughput MTU = 1500 Bytes; Avg. throughput

over a period of 4000s = 50 Mb/sover a period of 4000s = 50 Mb/s MTU = 9000 Bytes; Avg. throughput MTU = 9000 Bytes; Avg. throughput

over a period of 4000s = 698 Mb/sover a period of 4000s = 698 Mb/s Factor 14 !Factor 14 ! Connections with large MTU Connections with large MTU

increase quickly their rate and grab increase quickly their rate and grab most of the available bandwidthmost of the available bandwidth

RR RRGbE GbE SwitchSwitch

Host #1Host #1POS 2.5POS 2.5 GbpsGbps1 GE1 GE

1 GE1 GE

Host #2Host #2

Host #1Host #1

Host #2Host #2

1 GE1 GE

1 GE1 GE

BottleneckBottleneck

TCP VariantsTCP Variants HSTCP, Scalable, Westwood+, Bic TCP, HTCPHSTCP, Scalable, Westwood+, Bic TCP, HTCP FAST TCPFAST TCP

Based on TCP VegasBased on TCP Vegas Uses end-to-end delay and loss to dynamically adjust the congestion windowUses end-to-end delay and loss to dynamically adjust the congestion window Defines an explicit equilibriumDefines an explicit equilibrium 7.3 Gbps between CERN and Caltech for hours 7.3 Gbps between CERN and Caltech for hours FAST vs RENO:FAST vs RENO:

1G

BW utilization 19% BW utilization 27%BW utilization 95%

capacity = 1Gbps; 180 ms round trip latency; 1 flow

3600s3600s3600s

Linux TCP Linux TCP (Optimized) FAST

TCP Variants PerformanceTCP Variants Performance

Tests between CERN and Caltech Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1

flow Sending station: Tyan S2882 motherboard, 2x Opteron 2.4

GHz , 2 GB DDR. Receiving station (CERN OpenLab): HP rx4640, 4x 1.5GHz

Itanium-2, zx1 chipset, 8GB memory Network adapter: Neterion 10 GE NIC

Linux TCP Linux Westwood+ Linux BIC TCP FAST TCP

3.0 Gbps 5.0 Gbps 7.3 Gbps4.1 Gbps

Product of transfer speed and distance Product of transfer speed and distance using standard using standard Internet (TCP/IP) protocols.Internet (TCP/IP) protocols.

Single Stream Single Stream 7.5 Gbps X 16 kkm7.5 Gbps X 16 kkm with with Linux: July 2004Linux: July 2004

IPv4 Multi-stream record with FAST TCP: IPv4 Multi-stream record with FAST TCP: 6.86 Gbps X 27kkm:6.86 Gbps X 27kkm: Nov 2004 Nov 2004

IPv6 record: IPv6 record: 5.11 Gbps5.11 Gbps between between Geneva and Starlight: Jan. 2005Geneva and Starlight: Jan. 2005

Concentrate now on reliable Concentrate now on reliable Terabyte-scale file transfersTerabyte-scale file transfers Disk-to-disk Marks: Disk-to-disk Marks:

536 Mbytes/sec (Windows); 536 Mbytes/sec (Windows); 500 Mbytes/sec (Linux) 500 Mbytes/sec (Linux)

Note Note System Issues:System Issues: PCI-X Bus, PCI-X Bus, Network Interface, Disk I/O Controllers, Network Interface, Disk I/O Controllers, CPU, DriversCPU, Drivers

Internet 2 Land Speed Record (LSR)Internet 2 Land Speed Record (LSR) Redefining the Role and Limits of TCPRedefining the Role and Limits of TCP

http://www.guinnessworldrecords.com/

Nov. 2004 Record Network

6.6 Gbps16500km

4.2 Gbps16343km5.6 Gbps

10949km

5.4 Gbps7067km2.5 Gbps

10037km0.9 Gbps10978km0.4 Gbps

12272km

0

20

40

60

80

100

120

140

160

Thro

ughp

ut (G

bps)

Internet2 LSR - Single IPv4 TCP stream 7.21 Gbps20675 kmInternet2 LSRs:

Blue = HEP7.2G X 20.7 kkm

Th

rou

hg

pu

t (P

eta

bit

-m/s

ec)

High Throughput Disk to Disk Transfers: High Throughput Disk to Disk Transfers: From 0.1 to 1GByte/sec From 0.1 to 1GByte/sec

Server Server Hardware (Rather than Network) Hardware (Rather than Network) Bottlenecks:Bottlenecks:

Write/read and transmit tasks share Write/read and transmit tasks share the same limited resources: CPU, PCI-the same limited resources: CPU, PCI-X bus, memory, IO chipset X bus, memory, IO chipset

PCI-X bus bandwidth: 8.5 Gbps PCI-X bus bandwidth: 8.5 Gbps [133MHz x 64 bit][133MHz x 64 bit]

Neterion SR 10GE NICNeterion SR 10GE NIC 10 GE NIC – 7.5 Gbits/sec (memory-to-10 GE NIC – 7.5 Gbits/sec (memory-to-

memory, with 52% CPU utilization)memory, with 52% CPU utilization) 2*10 GE NIC (802.3ad link aggregation) 2*10 GE NIC (802.3ad link aggregation)

– 11.1 Gbits/sec (memory-to-memory)– 11.1 Gbits/sec (memory-to-memory) 1 TB at 536 MB/sec from CERN to 1 TB at 536 MB/sec from CERN to

CaltechCaltech

Performance in this range (from 100 Performance in this range (from 100 MByte/sec up to 1 GByte/sec)MByte/sec up to 1 GByte/sec) is is

required to build a responsive Grid-required to build a responsive Grid-based Processing and Analysis System based Processing and Analysis System

for LHC for LHC

SC2004:SC2004:High Speed TeraByte Transfers for PhysicsHigh Speed TeraByte Transfers for Physics

Demonstrating that many 10 Gbps wavelengths can be Demonstrating that many 10 Gbps wavelengths can be used efficiently over continental and transoceanic used efficiently over continental and transoceanic distances distances

Preview of the globally distributed grid system that is now Preview of the globally distributed grid system that is now being developed in preparation for the next generation of being developed in preparation for the next generation of high-energy physics experiments at CERN's Large Hadron high-energy physics experiments at CERN's Large Hadron Collider (LHC), Collider (LHC),

Monitoring the WAN performance using the MonALISA Monitoring the WAN performance using the MonALISA agent-based systemagent-based system

Major Partners : Caltech-FNAL-SLAC Major Partners : Caltech-FNAL-SLAC Major Sponsors:Major Sponsors:

Cisco, S2io, HP, Newysis Cisco, S2io, HP, Newysis Major networks:Major networks:

NLR, Abilene, ESnet, LHCnet, Ampath, TeraGridNLR, Abilene, ESnet, LHCnet, Ampath, TeraGrid Bandwidth challenge award: 101 Gigabit Per Second MarkBandwidth challenge award: 101 Gigabit Per Second Mark

SC2004 Network (I)SC2004 Network (I)

80 10GE ports50 10GE network adapters10 Cisco 7600/6500 dedicated to the experimentBackbone for the experiment built in two days.

101 Gigabit Per Second Mark101 Gigabit Per Second Mark

101 GbpsUnstable

end-sytems

END of demo

Source: Bandwidth Challenge committee

UltraLight:UltraLight: Developing Advanced Network Developing Advanced Network Services for Data Intensive HEP ApplicationsServices for Data Intensive HEP Applications

UltraLight:UltraLight: a next-generation hybrid packet- and circuit- a next-generation hybrid packet- and circuit-switched network infrastructureswitched network infrastructure Packet switched:Packet switched: cost effective solution; requires cost effective solution; requires

ultrascale protocols to share 10G ultrascale protocols to share 10G efficiently and fairly efficiently and fairly Circuit-switched:Circuit-switched: Scheduled or sudden “overflow” Scheduled or sudden “overflow”

demands handled by provisioning additional wavelengths; demands handled by provisioning additional wavelengths; Use path diversity, e.g. across the US, Atlantic, Canada,…Use path diversity, e.g. across the US, Atlantic, Canada,…

Extend and augment existing grid computing infrastructures Extend and augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as (currently focused on CPU/storage) to include the network as an integral componentan integral component Using MonALISA to monitor and manage global systemsUsing MonALISA to monitor and manage global systems

Partners:Partners: Caltech, UF, FIU, UMich, SLAC, FNAL, Caltech, UF, FIU, UMich, SLAC, FNAL, MIT/Haystack; CERN, NLR, CENIC, Internet2; MIT/Haystack; CERN, NLR, CENIC, Internet2; Translight, UKLight, Netherlight; UvA, UCL, KEK, TaiwanTranslight, UKLight, Netherlight; UvA, UCL, KEK, Taiwan

Strong support from CiscoStrong support from Cisco

Ultralight 10GbE OXC @ CaltechUltralight 10GbE OXC @ Caltech

L1, L2 and L3 servicesL1, L2 and L3 services Hybrid packet- and circuit-switched PoPHybrid packet- and circuit-switched PoP Control plane is L3Control plane is L3

LambdaStationLambdaStation

A Joint Fermilab and Caltech projectA Joint Fermilab and Caltech project Enabling HEP applications to send high throughput traffic Enabling HEP applications to send high throughput traffic

between mass storage systems across advanced network between mass storage systems across advanced network pathspaths

Dynamic Path Provisioning across USNet, UltraLight, NLR; Dynamic Path Provisioning across USNet, UltraLight, NLR; Plus an ESNet/Abilene “standard path”Plus an ESNet/Abilene “standard path”

DoE fundedDoE funded

LambdaStation: ResultsLambdaStation: Results

Plot shows functionalityPlot shows functionality Alternate use of lightpaths Alternate use of lightpaths

and conventional routed and conventional routed networks.networks.

Grid FTPGrid FTP

SummarySummary For many years the Wide Area Network has been the bottleneck;For many years the Wide Area Network has been the bottleneck;

this is no longer the case in many countries thus making deployment this is no longer the case in many countries thus making deployment of a data intensive Grid infrastructure possible!of a data intensive Grid infrastructure possible!

Some transport protocol issues still need to be resolved; however Some transport protocol issues still need to be resolved; however there are many encouraging signs that practical solutions may now there are many encouraging signs that practical solutions may now be in sight.be in sight.

1GByte/sec disk to disk challenge. 1GByte/sec disk to disk challenge. Today: 1 TB at 536 MB/sec from CERN to CaltechToday: 1 TB at 536 MB/sec from CERN to Caltech

Next generation network and Grid system: UltraLight and Next generation network and Grid system: UltraLight and LambdaStationLambdaStation The integrated, managed network The integrated, managed network Extend and augment existing grid computing infrastructures Extend and augment existing grid computing infrastructures

(currently focused on CPU/storage) to include the network (currently focused on CPU/storage) to include the network as an integral componentas an integral component..

Thanks Thanks

& &

QuestionsQuestions

Documents

J. Bunn, D. Nae, H. Newman, S. Ravot, X. Su, Y. Xia California Institute of Technology High speed WAN data transfers for science Session Recent Results