Kei Hiraki University of Tokyo
Realization and Utilization of high-BW TCP Realization and Utilization of high-BW TCP on real applicationon real application
Kei HirakiKei Hiraki
Data Reservoir / GRAPE-DR projectData Reservoir / GRAPE-DR project
The University of TokyoThe University of Tokyo
Kei Hiraki University of Tokyo
Computing System for real Scientists
• Fast CPU, huge memory and disks, good graphics– Cluster technology, DSM technology, Graphics processors
– Grid technology
• Very fast remote file accesses– Global file system, data parallel file systems, Replication facilities
• Transparency to local computation– No complex middleware, or no small modification to existing software
• Real Scientists are not computer scientists
• Computer scientists are not work forces for real scientists
Kei Hiraki University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(1)
• Sharing Scientific Data between distant research institutes– Physics, astronomy, earth science, simulation data
• Very High-speed single file transfer on Long Fat pipe Network– > 10 Gbps, > 20,000 Km, > 400ms RTT
• High utilization of available bandwidth– Transferred file data rate > 90% of available bandwidth
• Including header overheads, initial negotiation overheads
Kei Hiraki University of Tokyo
Objectives of Data Reservoir / GRAPE-DR(2)
• GRAPE-DR:Very high-speed attached processor to a server– 2004 – 2008– Successor of Grape-6 astronomical simulator
• 2PFLOPS on 128 node cluster system– 1G FLOPS / processor– 1024 processor / chip– 8 chips / PCI card– 2 PCI card / serer
– 2 M processor / system
Kei Hiraki University of Tokyo
Grape6
Very High-speedNetwork
DataReservoir
Data analysis at University of Tokyo
Belle Experiments
X-ray astronomy Satellite ASUKA
SUBARUTelescope
NobeyamaRadio
Observatory( VLBI)
Nuclear experiments
DataReservoir
DataReservoirLocal
Accesses
DistributedShared
files
Data intensive scientific computation through global networks
Digital Sky Survey
Kei Hiraki University of Tokyo
DataReservoir
DataReservoir
High latencyVery high bandwidthNetwork
Distribute Shared Data
(DSM like architecture)
Cache Disks
Cache Disks
Local file accesses
Disk-block levelParallel andMulti-stream transfer
Local file accesses
Basic Architecture
Kei Hiraki University of Tokyo
Disk Server
Scientific Detectors
User Programs
IP Switch
File Server File Server
Disk Server
IP Switch
File Server File Server
Disk Server Disk Server
1st level striping
2nd level striping
Disk access by iSCSI
File accesses on Data Reservoir
IBM x345 (2.6GHz x 2)
Kei Hiraki University of Tokyo
Scientific Detectors
User Programs
File Server File Server File Server File Server
IP Switch IP Switch
Disk Server Disk ServerDisk Server Disk Server
iSCSI Bulk Transfer
Global Network
Global Data Transfer
Kei Hiraki University of Tokyo
Problems found in 1st generation Data Reservoir
• Low TCP bandwidth due to packet losses– TCP congestion window size control– Very slow recovery from fast recovery phase (>20min)
• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– User and other network users have interests only to
total behavior of parallel TCP streams
Kei Hiraki University of Tokyo
Fast Ethernet vs. GbE
• Iperf in 30 seconds• Min/Avg: Fast Ethernet > GbE
0
20
40
60
80
100
120
140
160
180
Throughput [Mbps]
TxQ100 TxQ5000 TxQ100 TxQ25000
MinMaxAvg
FE GbE
Kei Hiraki University of Tokyo
Packet Transmission Rate
• Bursty behavior– Transmission in 20ms against RTT 200ms– Idle in rest 180ms
0
5
10
15
20
25
30
35
40
45
50
4.5 5 5.5 6 6.5 7 7.5 8 8.5
Time [sec]
US2Tokyo
Packet loss occurred
Kei Hiraki University of Tokyo
Packet Spacing
• Ideal Story– Transmitting packet every RTT/cwnd– 24μs interval for 500Mbps (MTU 1500B)– High load for software only– Low overhead because of limited use at slow start phase
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
RTT
RTT/cwnd
Kei Hiraki University of Tokyo
Example Case of 8 IPG
• Success on Fast Retransmit– Smooth Transition to Congestion Avoidance– CA takes 28 minutes to recover to 550Mbps
0
20
40
60
80
100
120
140
0 5 10 15 20 25 30 35
Time [sec]
IPG8
Kei Hiraki University of Tokyo
Best Case of 1023B IPG
• Like Fast Ethernet case– Proper transmission rate
• Spurious Retransmit due to Reordering
0
100
200
300
400
500
600
0 5 10 15 20 25 30 35
Time [sec]
IPG1023
Kei Hiraki University of Tokyo
Unbalance within parallel TCP streams
• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– Meaningless unfairness among parallel streams– User and other network users have interests only to total
behavior of parallel TCP streams
• Our approach– Constant Σcwnd i for fair TCP network usage to other users
– Balance each cwnd i communicating between parallel TCP streams
time
BW
time
BW
Kei Hiraki University of Tokyo
3rd Generation Data Reservoir
• Hardware and software basis for 100Gbps Distributed Data-sharing systems
• 10Gbps disk data transfer by a single Data Reservoir server• Transparent support for multiple filesystems (detection of
modified disk blocks)• Hardware(FPGA) implementation of Inter-layer coordination
mechanisms• 10 Gbps Long Fat pipe Network emulator and 10 Gbps data
logger
Kei Hiraki University of Tokyo
Utilization of 10Gbps network
• A single box 10 Gbps Data Reservoir server• Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server)• Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth• Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger)
ChelsioT110
TCP NICQuad Opteron
Server(SUN V40z)Linux 2.6.6
PCI-X bus
PCI-X bus
PCI-X bus
ChelsioT110
TCP NIC
SCSIadaptor
PCI-X bus
SCSIadaptor
10G EthernetSwitch
10GBASE-SR
Data Data ReservoirReservoirSoftwareSoftware
Ultra320SCSI
Kei Hiraki University of Tokyo
Tokyo-CERN experiment (Oct.2004)
• CERN-Amsterdam-Chicago-Seattle-Tokyo– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE
– 18,500 km WAN PHY connection
• Performance result
– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf
– 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf
– 8.8 Gbps disk to disk performance• 9 servers, 36 disks• 36 parallel TCP streams
Kei Hiraki University of Tokyo
Tokyo
Chicago
Amsterdam
SeattleVancouver
Calgary
MinneapolisIEEAF
CANARIE
SURFnet
Network used in the experiment
Tokyo-CERN Network connection
CA*net 4
End Systems
A L1 or L2 switch
Geneva
Kei Hiraki University of Tokyo
Network topology of CERN-Tokyo experiment
Tokyo
Seattle Vancouver
Minneapolis Chicago
Amsterdam CERN(Geneva)
IBM x345 serverDual Intel Xeon 2.4GHz
2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)
IBM x345
IBM x345
Opteron serverDual Opteron248,2.2GHz
1GB memoryLinux 2.6.6 (No.2-6)
ChelsioT110 NIC Fujitsu XG800
12 port switch
FoundryBI MG8
Data Reservoir at Univ. of Tokyo
GbE
IBM x345 serverDual Intel Xeon 2.4GHz
2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)
IBM x345
IBM x345
Opteron serverDual Opteron248,2.2GHz
1GB memoryLinux 2.6.6 (No.2-6)
ChelsioT110 NIC
Data Reservoir at CERN(Geneva)
GbE
T-LEX
Pacific Northwest Gigapop
StarLight
NetherLight
WIDE / IEEAF
CA*net 4 SURFnet
10G
BA
SE
-LW
FujitsuXG800
FoundryFEX x448
FoundryNetIron40G
ExtremeSummit 400
Kei Hiraki University of Tokyo
LSR experiments
• Target– > 30,000 km LSR distance
– L3 switching at Chicago and Amsterdam
– Period of the experiment• 12/20 – 1/3 • Holiday season for vacant public research networks
• System configuration– A pair of opteron servers with Chelsio T110 (at N-otemachi)
– Another pair of opteron servers with Chelsion T110 for competing traffinc generation
– ClearSight 10Gbps packet analyzer for packet capturing
Kei Hiraki University of Tokyo
Tokyo
Chicago
Amsterdam
SeattleVancouver
Calgary
MinneapolisIEEAF/Tyco/WIDE
CANARIE
SURFnet
Network used in the experiment
Figure 2. Network connection
CA*net 4
APAN/JGN2Abilene
NYC
A router or an L3 switch
A L1 or L2 switch
Kei Hiraki University of Tokyo
Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo
Tokyo
T-LEX
Amsterdam
NetherLight
SURFnetIEEAF/Tyco/WIDE CANARIE
Router or L3 switch
University of Amsterdam
Chicago StarLight
L1 or L2 switch
Force10E1200
ONS15454
Vancouver
FoundryNetIron
40G
OME6550
Minneapolis
Atlantic
Ocean
Pacific
Ocean
Opteron1
Opteron server
ChelsioT110 NIC
IEEAF/Tyco
Opteron server
ChelsioT110 NIC
ClearSight10Gbpscapture
FujitsuXG800
OC-192
WAN PHY
WAN PHY
CANARIE
SURFnet
OME6550
Procket8801
Procket8812
ONS15454
ONS15454
ONS15454
ONS15454
ONS15454
HDXc
T640
T640
HDXc
CISCO12416
CISCO12416
CISCO6509
Force10E600Opteron3
SeattlePacific
Northwest Gigapop
New York
MANLAN
Chicago
SURFnet
SURFnet
OC-192
OC-192Abilene
APAN/JGN
TransPAC
SURFnet
SURFnet
SURFnet
WIDE
WIDE
Calgary
Univ of Tokyo WIDE APAN/JGN2Abilene
Kei Hiraki University of Tokyo
Network Traffic on routers and switches
StarLight Force10 E1200
University of Amsterdam Force10 E600
Abilene T640 NYCM to CHIN
TransPAC Procket 8801
Submitted run
Kei Hiraki University of Tokyo
Summary• Single Stream TCP
– We removed TCP related difficulties– Now I/O bus bandwidth is the bottleneck– Cheap and simple servers can enjoy 10Gbps network
• Lack of methodology in high-performance network debugging– 3 day debugging (overnight working)– 1 day stable period (usable for measurements)– Network may feel fatigue, some trouble must happen
– We need something effective.
• Detailed issues– Flow control (and QoS)– Buffer size and policy– Optical level setting
Kei Hiraki University of Tokyo
Kei Hiraki University of Tokyo
Systems used in Long-distance TCP experimentsSystems used in Long-distance TCP experiments
CERNCERN PittsburghPittsburgh TokyoTokyo
Kei Hiraki University of Tokyo
Efficient and effective utilization of High-speed internet
• Efficient and effective utilization of 10Gbps network is still very difficult
• PHY, MAC, Data-link , and Switches– 10Gbps is ready to use
• Network interface adaptor– 8Gbps is ready to use, 10Gbps in several months
– Proper offloading, RDMA implementation
• I/O bus of a server– 20 Gbps is necessary to drive 10Gbps network
• Drivers, operating system– Too many interruption, buffer memory management
• File system– Slow NFS service
– Consistency problem
Kei Hiraki University of Tokyo
Difficulty in10Gbps Data Reservoir• Disk to disk Single Stream TCP data transfer
– High CPU utilization (performance limit by CPU)• Too many context switches
• Too many interruption from Network adaptor (> 30,000/s)
• Data copy from buffers to buffers
• I/O bus bottleneck– PCI-X/133 --- maximum 7.6Gbps data transfer
• Waiting for PCI-X/266 or PCI-express x8 or x16 NIC
– Disk performance• Performance limit of RAID adaptor
• Number of disks for data transfer (>40 disks are required)
• File system– High BW in file service is more difficult than data sharing
Kei Hiraki University of Tokyo
High-speed IP network in supercomputing(GRAPE-DR project)
• World fastest computing system – 2PFLOPS in 2008 (performance on actual application programs)
• Construction of general-purpose massively parallel architecture– Low power consumption in PFLOPS range performance
– MPP architecture more general-purpose than vector architecture
• Use of comodity network for interconnection– 10Gbps optical network (2008) + MEMs switches
– 100Gbps optical network (2010)
Kei Hiraki University of Tokyo70 80 90 2000 2010 2020 2030 2040 2050
16
64
256
1K
Parallel processors
Processor chips
1M
1G
1T
1E
1Z
1Y
1027
1030
FLOPS
1P
Year
EarthSimulator40TFLOPS
Target performance
Grape DR2
PFLOPSKEISOKU
supercomputer10PFLOPS
Kei Hiraki University of Tokyo
Kei Hiraki University of Tokyo
GRAPE-DR architecture
Shared memory
On chip network
512 PEsInteger ALU
Floating point ALU
Outside world
G
• Massively Parallel Processor
• Pipelined connection of a large number of PEs • SIMASD (Single Instruction on Multiple and Shared Data)
– All instruction operates on Data of local memory and shared memory
– Extension of vector architecture
• Issues– Compiler for SIMASD architecture (currently developing – flat-C)
F CP +
Local Memory
On chip shared memory
Kei Hiraki University of Tokyo
Hierarchical construction of GRAPE-DR
メモリ
512PE/Chip
512 GFlops /Chip
2KPE/PCI board
2TFLOPS/PCI board
8 KPE/Server
8 TFLOPS/Server
1 MPE/Node
1PFLOPS/Node
2M PE/System
2PFLOPS/System
Kei Hiraki University of Tokyo
Network architecture inside a GRAPE-DR system
Memory
KOE
AMD based server
Memory bus
Adaptive compier
光インタフェース
Outside IP network
100GbpsiSCSI サーバ
MEMs based optical switch
IP storage system
Total system conductorFor dynamic optimization
Highly functional router
Kei Hiraki University of Tokyo
Fujitsu Computer Technologies, LTD