Download ppt - Realization and Utilization of high-BW TCP on real application

Kei Hiraki University of Tokyo

Realization and Utilization of high-BW TCP Realization and Utilization of high-BW TCP on real applicationon real application

Kei HirakiKei Hiraki

Data Reservoir / GRAPE-DR projectData Reservoir / GRAPE-DR project

The University of TokyoThe University of Tokyo


Computing System for real Scientists

• Fast CPU, huge memory and disks, good graphics– Cluster technology, DSM technology, Graphics processors

– Grid technology

• Very fast remote file accesses– Global file system, data parallel file systems, Replication facilities

• Transparency to local computation– No complex middleware, or no small modification to existing software

• Real Scientists are not computer scientists

• Computer scientists are not work forces for real scientists


Objectives of Data Reservoir / GRAPE-DR(1)

• Sharing Scientific Data between distant research institutes– Physics, astronomy, earth science, simulation data

• Very High-speed single file transfer on Long Fat pipe Network– > 10 Gbps, > 20,000 Km, > 400ms RTT

• High utilization of available bandwidth– Transferred file data rate > 90% of available bandwidth

• Including header overheads, initial negotiation overheads


Objectives of Data Reservoir / GRAPE-DR(2)

• GRAPE-DR:Very high-speed attached processor to a server– 2004 – 2008– Successor of Grape-6 astronomical simulator

• 2PFLOPS on 128 node cluster system– 1G FLOPS / processor– 1024 processor / chip– 8 chips / PCI card– 2 PCI card / serer

– 2 M processor / system


Grape6

Very 　 High-speedNetwork

DataReservoir

Data analysis at University of Tokyo

Belle Experiments

X-ray astronomy Satellite ASUKA

SUBARUTelescope

NobeyamaRadio

Observatory（ VLBI)

Nuclear experiments

DataReservoir

DataReservoirLocal

Accesses

DistributedShared

files

Data intensive scientific computation through global networks

Digital Sky Survey


DataReservoir

DataReservoir

High latencyVery high bandwidthNetwork

Distribute 　 Shared 　 Data

(DSM like architecture)

Cache Disks

Cache Disks

Local file accesses

Disk-block levelParallel andMulti-stream transfer

Local file accesses

Basic Architecture


Disk Server

Scientific Detectors

User Programs

IP Switch

File Server File Server

Disk Server

IP Switch

File Server File Server

Disk Server Disk Server

1st level striping

2nd level striping

Disk access by iSCSI

File accesses on Data Reservoir

IBM x345 (2.6GHz x 2)


Scientific Detectors

User Programs

File Server File Server File Server File Server

IP Switch IP Switch

Disk Server Disk ServerDisk Server Disk Server

iSCSI Bulk Transfer

Global Network

Global Data Transfer


Problems found in 1st generation Data Reservoir

• Low TCP bandwidth due to packet losses– TCP congestion window size control– Very slow recovery from fast recovery phase (>20min)

• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– User and other network users have interests only to

total behavior of parallel TCP streams


Fast Ethernet vs. GbE

• Iperf in 30 seconds• Min/Avg: Fast Ethernet > GbE

0

20

40

60

80

100

120

140

160

180

Throughput　[Mbps]

TxQ100 TxQ5000 TxQ100 TxQ25000

MinMaxAvg

FE GbE


Packet Transmission Rate

• Bursty behavior– Transmission in 20ms against RTT 200ms– Idle in rest 180ms

0

5

10

15

20

25

30

35

40

45

50

4.5 5 5.5 6 6.5 7 7.5 8 8.5

Time [sec]

US2Tokyo

Packet loss occurred


Packet Spacing

• Ideal Story– Transmitting packet every RTT/cwnd– 24μs interval for 500Mbps (MTU 1500B)– High load for software only– Low overhead because of limited use at slow start phase

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

RTT

RTT/cwnd


Example Case of 8 IPG

• Success on Fast Retransmit– Smooth Transition to Congestion Avoidance– CA takes 28 minutes to recover to 550Mbps

0

20

40

60

80

100

120

140

0 5 10 15 20 25 30 35

Time [sec]

IPG8


Best Case of 1023B IPG

• Like Fast Ethernet case– Proper transmission rate

• Spurious Retransmit due to Reordering

0

100

200

300

400

500

600

0 5 10 15 20 25 30 35

Time [sec]

IPG1023


Unbalance within parallel TCP streams

• Unbalance among parallel iSCSI streams– Packet scheduling by switches and routers– Meaningless unfairness among parallel streams– User and other network users have interests only to total

behavior of parallel TCP streams

• Our approach– Constant Σcwnd i for fair TCP network usage to other users

– Balance each cwnd i communicating between parallel TCP streams

time

BW

time

BW


3rd Generation Data Reservoir

• Hardware and software basis for 100Gbps Distributed Data-sharing systems

• 10Gbps disk data transfer by a single Data Reservoir server• Transparent support for multiple filesystems (detection of

modified disk blocks)• Hardware(FPGA) implementation of Inter-layer coordination

mechanisms• 10 Gbps Long Fat pipe Network emulator and 10 Gbps data

logger


Utilization of 10Gbps network

• A single box 10 Gbps Data Reservoir server• Quad Opteron server with multiple PCI-X buses (prototype, SUN V40z server)• Two Chelsio T110 TCP off-loading NIC • Disk arrays for necessary disk bandwidth• Data Reservoir software (iSCSI deamon, disk driver, data transfer maneger)

ChelsioT110

TCP NICQuad Opteron

Server(SUN V40z)Linux 2.6.6

PCI-X bus

PCI-X bus

PCI-X bus

ChelsioT110

TCP NIC

SCSIadaptor

PCI-X bus

SCSIadaptor

10G EthernetSwitch

10GBASE-SR

Data Data ReservoirReservoirSoftwareSoftware

Ultra320SCSI


Tokyo-CERN experiment (Oct.2004)

• CERN-Amsterdam-Chicago-Seattle-Tokyo– SURFnet – CA*net 4 – IEEAF/Tyco – WIDE

– 18,500 km WAN PHY connection

• Performance result

– 7.21 Gbps (TCP payload) standard Ethernet frame size, iperf

– 7.53 Gbps (TCP payload) 8K Jumbo frame, iperf

– 8.8 Gbps disk to disk performance• 9 servers, 36 disks• 36 parallel TCP streams


Tokyo

Chicago

Amsterdam

SeattleVancouver

Calgary

MinneapolisIEEAF

CANARIE

SURFnet

Network used in the experiment

Tokyo-CERN Network connection

CA*net 4

End Systems

A L1 or L2 switch

Geneva


Network topology of CERN-Tokyo experiment

Tokyo

Seattle Vancouver

Minneapolis Chicago

Amsterdam CERN(Geneva)

IBM x345 serverDual Intel Xeon 2.4GHz

2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)

IBM x345

IBM x345

Opteron serverDual Opteron248,2.2GHz

1GB memoryLinux 2.6.6 (No.2-6)

ChelsioT110 NIC Fujitsu XG800

12 port switch

FoundryBI MG8

Data Reservoir at Univ. of Tokyo

GbE

IBM x345 serverDual Intel Xeon 2.4GHz

2GB memoryLinux 2.6.6 (No.2-7)Linux 2.4.X (No. 1)

IBM x345

IBM x345

Opteron serverDual Opteron248,2.2GHz

1GB memoryLinux 2.6.6 (No.2-6)

ChelsioT110 NIC

Data Reservoir at CERN(Geneva)

GbE

T-LEX

Pacific Northwest Gigapop

StarLight

NetherLight

WIDE / IEEAF

CA*net 4 SURFnet

10G

BA

SE

-LW

FujitsuXG800

FoundryFEX ｘ４４８

FoundryNetIron40G

ExtremeSummit 400


LSR experiments

• Target– > 30,000 km LSR distance

– L3 switching at Chicago and Amsterdam

– Period of the experiment• 12/20 – 1/3 • Holiday season for vacant public research networks

• System configuration– A pair of opteron servers with Chelsio T110 (at N-otemachi)

– Another pair of opteron servers with Chelsion T110 for competing traffinc generation

– ClearSight 10Gbps packet analyzer for packet capturing


Tokyo

Chicago

Amsterdam

SeattleVancouver

Calgary

MinneapolisIEEAF/Tyco/WIDE

CANARIE

SURFnet

Network used in the experiment

Figure 2. Network connection

CA*net 4

APAN/JGN2Abilene

NYC

A router or an L3 switch

A L1 or L2 switch


Single stream TCP – Tokyo – Chicago – Amsterdam – NY – Chicago - Tokyo

Tokyo

T-LEX

Amsterdam

NetherLight

SURFnetIEEAF/Tyco/WIDE CANARIE

Router or L3 switch

University of Amsterdam

Chicago StarLight

L1 or L2 switch

Force10E1200

ONS15454

Vancouver

FoundryNetIron

40G

OME6550

Minneapolis

Atlantic

Ocean

Pacific

Ocean

Opteron1

Opteron server

ChelsioT110 NIC

IEEAF/Tyco

Opteron server

ChelsioT110 NIC

ClearSight10Gbpscapture

FujitsuXG800

OC-192

WAN PHY

WAN PHY

CANARIE

SURFnet

OME6550

Procket8801

Procket8812

ONS15454

ONS15454

ONS15454

ONS15454

ONS15454

HDXc

T640

T640

HDXc

CISCO12416

CISCO12416

CISCO6509

Force10E600Opteron3

SeattlePacific

Northwest Gigapop

New York

MANLAN

Chicago

SURFnet

SURFnet

OC-192

OC-192Abilene

APAN/JGN

TransPAC

SURFnet

SURFnet

SURFnet

WIDE

WIDE

Calgary

Univ of Tokyo WIDE APAN/JGN2Abilene


Network Traffic on routers and switches

StarLight Force10 E1200

University of Amsterdam Force10 E600

Abilene T640 NYCM to CHIN

TransPAC Procket 8801

Submitted run


Summary• Single Stream TCP

– We removed TCP related difficulties– Now I/O bus bandwidth is the bottleneck– Cheap and simple servers can enjoy 10Gbps network

• Lack of methodology in high-performance network debugging– 3 day debugging (overnight working)– 1 day stable period (usable for measurements)– Network may feel fatigue, some trouble must happen

– We need something effective.

• Detailed issues– Flow control (and QoS)– Buffer size and policy– Optical level setting



Systems used in Long-distance TCP experimentsSystems used in Long-distance TCP experiments

CERNCERN 　　　　　　　　　　　　 PittsburghPittsburgh 　　　　　　　　 TokyoTokyo


Efficient and effective utilization of High-speed internet

• Efficient and effective utilization of 10Gbps network is still very difficult

• PHY, MAC, Data-link , and Switches– 10Gbps is ready to use

• Network interface adaptor– 8Gbps is ready to use, 10Gbps in several months

– Proper offloading, RDMA implementation

• I/O bus of a server– 20 Gbps is necessary to drive 10Gbps network

• Drivers, operating system– Too many interruption, buffer memory management

• File system– Slow NFS service

– Consistency problem


Difficulty in10Gbps Data Reservoir• Disk to disk Single Stream TCP data transfer

– High CPU utilization (performance limit by CPU)• Too many context switches

• Too many interruption from Network adaptor (> 30,000/s)

• Data copy from buffers to buffers

• I/O bus bottleneck– PCI-X/133 --- maximum 7.6Gbps data transfer

• Waiting for PCI-X/266 or PCI-express x8 or x16 NIC

– Disk performance• Performance limit of RAID adaptor

• Number of disks for data transfer (>40 disks are required)

• File system– High BW in file service is more difficult than data sharing


High-speed IP network in supercomputing(GRAPE-DR project)

• World fastest computing system – 2PFLOPS in 2008 (performance on actual application programs)

• Construction of general-purpose massively parallel architecture– Low power consumption in PFLOPS range performance

– MPP architecture more general-purpose than vector architecture

• Use of comodity network for interconnection– 10Gbps optical network (2008) + MEMs switches

– 100Gbps optical network (2010)

Kei Hiraki University of Tokyo70 80 90 2000 2010 2020 2030 2040 2050

16

64

256

1K

Parallel processors

Processor chips

1M

1G

1T

1E

1Z

1Y

1027

1030

FLOPS

1P

Year

EarthSimulator40TFLOPS

Target performance

Grape　DR2 　

PFLOPSKEISOKU

supercomputer10PFLOPS



GRAPE-DR architecture

Shared memory

On chip network

512 PEsInteger ALU

Floating point ALU

Outside world

Ｇ

• Massively Parallel Processor

• Pipelined connection of a large number of PEs 　• SIMASD (Single Instruction on Multiple and Shared Data)

– All instruction operates on Data of local memory and shared memory

– Extension of vector architecture

• Issues– Compiler for SIMASD architecture (currently developing – flat-C)

Ｆ CP ＋

Local Memory

On chip shared memory


Hierarchical construction of GRAPE-DR

メモリ

512PE/Chip

512 GFlops /Chip

2KPE/PCI board

2TFLOPS/PCI board

8 KPE/Server

8 TFLOPS/Server

１ MPE/Node

1PFLOPS/Node

2M PE/System

2PFLOPS/System


Network architecture inside a GRAPE-DR system

Memory

KOE

AMD based server

Memory bus

Adaptive compier

光インタフェース

Outside IP network

100GbpsiSCSI サーバ

MEMs based optical switch

IP storage system

Total system conductorFor dynamic optimization

Highly functional router


Fujitsu Computer Technologies, LTD