Upload
shaina
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Master’s Project Presentation. End-host Route Selection in the CHEETAH Networking Solution. Zhanxiang Huang 05/01/2006 Advisor: Malathi Veeraraghavan. - PowerPoint PPT Presentation
Citation preview
1
End-host Route Selection in the CHEETAH Networking Solution
Zhanxiang Huang
05/01/2006
Advisor: Malathi Veeraraghavan
Master’s Project Presentation
Acknowledgement: This work was carried out under the sponsorship of NSF ITR-0312376, NSF ANI-0335190, NSF ANI-0087487, and DOE DE-FG02-04ER25640 grants.
2
Outline
• CHEETAH project overview• End-host route selection problem• Model-based solution• Measurement-based solution• Conclusion and future work
3
Circuit-switched High-speed End-to-End Transport ArcHitecture (CHEETAH)
ConnectionlessBest-effort
Internet
Goal: high-speed rate-guaranteed end-to-end circuits with call-by-call-based bandwidth sharing
long term leased line (under-utilized &
expensive)
TelephonyNetwork
64kbps circuits
end-to-end connection
CongestionDelayJitterLoss
4
CHEETAH Applications
• Applications:– video telephony
– high speed file transfer
– remote visualization
especially in eScience community,
e.g. Terascale Supernova Initiative (TSI) project
Internet Internet
5
Current CHEETAH Network
Control card
OC192card
GbE/10GbEcard
Cray X-1
ORNL
SN16000
OC192card
Control card
GbEcard
SN16000
OC192card
Control card
AtlantaSN16000
OC192card
OC-192
OC-192
GbE/10GbEcard
…
high-speed network
dynamic signaling scheme
end-host software
signaling engine
NCSU
UVA
CUNY
signaling engine
signaling engine
NC
GTech
6
CHEETAH End-host Software Architecture
– OCS: check Optical Connection Service availability.– Routing Decision: choose between circuit and Internet
path for each file transfer. – RSVP-TE Module: dynamic provision of circuits.– C-TCP: transport layer protocol optimized for circuits.
Internet
End-host End-host
ApplicationTCP/IP
NICII
CHEETAHNetwork
NICIRSVP-TE Module
Routing Decision
OCS Client
CHEETAH software
C-TCP NICII
NICI RSVP-TE Module
Routing Decision
OCS Client
CHEETAH software
ApplicationTCP/IP
C-TCP
7
Circuit or Internet Path?
• Circuit setup requests may be denied.• It depends on the data transfer delays on the two paths.
Internet(best-effort path)
CHEETAHNetwork(circuit)
End-host End-host
Circuit transfer delay is about 5.1 seconds.
Internet transfer delay is about 100ms.An extreme example: Transfer a 1K-byte file using TCP.
round trip time=24msBottleneck link rate=100Mbps
round trip time=8mscircuit rate=1Gbpssetup delay=5 seconds
8
What Determines Data Transfer Delays?
• Over paths:– Circuit:
• Circuit rate• Round trip time• Setup delay
– Internet:• Round trip time• Bottleneck link rate• Packet loss rate
• At end-hosts:– Transport layer protocol and parameter settings– OS Process scheduling– Hard disk throughput
9
How to Estimate Data Transfer Delays?
• Model-based solution– Construct mathematical models for computing file transfer
delays over the circuit and Internet paths.
• Measurement-based solution– Estimate file transfer delays based on delay
measurements of past file transfers.
10
Model-based Solution
• Modeling TCP delay over Internet path– TCP Reno delay model [UMass98]
• Modeling delay over CHEETAH circuit– Let Pb be the call blocking probability
– Average delay over circuit is
(1 ) ( _ _ _ _ )
( _ _ _ _ _ )b
b
P setup delay transfer delay over circuit
P average setup failure delay delay over Internet
11
Inputs to Delay Models
• Inputs to TCP Reno delay model: – File size– Bottleneck link rate– Round trip time– Packet loss rate
– Initial congestion window size
– Sender and receiver buffer sizes
• Inputs to circuit delay model:– File size– Circuit rate– Round trip time over the
circuit path– Round trip time over the
signaling path– Call processing delay at
each switch– Signaling engine call load– Number of switches on the
path– Call blocking probability
12
Limitations of the Model-based Solution
• Packet loss rate is difficult to measure. (Tools that I tested include Sting, iperf, ping, badabing and etc.)
• Same are call blocking probability and signaling engine call load.
• Many TCP variants are emerging but there is no delay model for them yet.– e.g. BIC-TCP has been included in linux kernel 2.6 but has
not been modeled yet.
13
Internet
Measurement-based Solution
• Assumptions– Fixed circuit rates, e.g. 1Gbps,
100Mbps…
– The number of destinations with which an end-host typically communicates, is not large.
– Internet traffic has repeating patterns over time, which means that during a specific time period, round trip time, packet loss rate and call blocking probability are likely the same.
delay
file size
circuit
Internet
0crossover
circuit
Idea: Discretize time and file size, at each time slot, for each destination and each circuit rate, measure the delays of file transfers over both paths to find the crossover file size.
14
Active and Passive Measurements
• Active measurements – Traffic is injected into the network explicitly for
the purpose of obtaining measurements.
• Passive measurements– Data is collected under normal network usage.
15
A Best-case Active-measurement Experiment
Best-case means packet loss rate and call blocking probability are equal to zero. TCP buffers are set to Bandwidth Delay Product values.
Drawback: significant measurement traffic overhead
16
mid
Active Measurements
Delays on Internet path and circuit are random variables, DI and DC.
1. Find an interval (min, max) that contains the crossover file size;
2. Measure delays on both paths for file size mid=(min+max)/2;
3. If |E(DI)-E(DC)|<e, crossover=mid;
4. If E(DI)>E(DC), max=mid;
5. If E(DI)<E(DC), min=mid;
6. Go to 2;
delay
file size
circuit
Internet
0 crossover
min max
Drawback: measurement traffic overhead
Let M be the initial max file size and N be the initial min file size. Traffic size = O(M*log(M-N)).
17
Passive Measurements
1. Initiate (min, max) with (0, +inf).
2. If file size < min, choose Internet;
3. If file size > max, choose circuit;
4. If min <= file size <= max, choose each path with probability ½. Record the data transfer delays.
5. Once there are sufficient records to compute Pr(DI-DC>0) for a file size in (min, max), adjust min or max based on Pr(DI-DC>0).
p
file size
maxmin0
1
1/2
crossover
(Note that min and max are file sizes in application queries and assume DI and DC follow normal distributions.)
18
Hybrid Measurements
• Fast startup– Find the bottleneck link rate of the Internet path and the
circuit setup delay through either passive or active measurement.
– Solve the equation for “file_size”.
– Init (min, max) with (file_size/2, file_size*2).
• Use active measurements when initiated by administrator users.
_ __ _
_ _ _ _ _
file size file sizeestimated setup delay
circuit rate Internet path bottleneck link rate
19
Bookkeeping Data Structure
Time Slot Destination Circuit Rate
Crossover File Size
Transfer Delay Records
File Size DI (sec) DC (sec)
02:00 – 03:00 Sunday
128.109.34.22 1Gbps 50MByte – 70MByte
50MByte 5.081 5.715
60MByte 5.060 5.066
70MByte 5.033 4.002
… … …
…
20
Interaction Between CHEETAH Software Modules and Applications
ApplicationDecision-making
Thread 1
Measurement Monitor
Thread 2
RDDatabase
query
reply
update
Routing Decision Module
triggerreport delays
report blocks
Administrator
Admin Interface
QueryInterface
ReportInterface
RDAPI
RSVPAPI
update
query
reply
RSVP / C-TCP Modules
TCP
trigger
trigger
Active Measurement
Scheduler
Thread 3
SysCallInterface
MeasurementTools
RSVPAPI
trigger
reportdelays orbandwidth
trigger
1 23
4
5 67
5
21
Evaluation
• Experiment setup– The Routing Decision server and an application run on a
Linux-2.6 box with 2 Xeon 2.8GHz CPUs and 1GB memory.– The application queries with parameters, <128.109.34.22,
1Gbps circuit rate, 1GByte file size, time slot 02:00 Sunday>. The database has an entry corresponding to this IP and time slot.
– Internet path: bottleneck link rate=100Mbps; round trip time =24ms. Circuit: round trip time=8ms.
• Delay– An application submits 100 queries.– Mean query delay = 0.0055 sec < round trip time << 5 sec
(the average setup delay).– Query delay standard deviation = 2.3608e-004 sec < 0.3ms
22
Conclusion and Future Work
• Conclusion– Measurement-based solution is better than the model-
based solution. Adaptive to new TCP variants Adaptive to the traffic pattern changes Adaptive to hardware or software configuration changes Low overhead
• Future work– Scalability issues
• For a computer that communicates with a large number of end-hosts (e.g. a web server), we can separate the RD module from the computer and run a separate RD server for it.
• For computers in the same LAN and with the same hardware and software configurations, we create an RD server for the whole LAN.
23
Reference
[CHEETAH] M. Veeraraghavan, X. Zheng, H. Lee, M. Gardner, W. Feng, CHEETAH: Circuit-switched High-speed End-to-End Transport ArcHitecture, Proc. of Opticomm 2003, Oct. 13-17, 2003. Dallas, TX, Won Best Student Paper Award.
[C-TCP] A. P. Mudambi, X. Zheng, and M. Veeraraghavan, A Transport Protocol for Dedicated End-to-End Circuits, accepted by ICC 2006.
[UMass98] J. Padhye, V. Firoiu, D. Towsley and J. Kurose. Modeling TCP throughput: A simple model and its empirical validation. In SIGCOMM ’98, September 1998.
24
Backup Slides
25
How to compute Pr(DI-DC>0)?
• Assume the delays observed on the Internet path and the circuit are normally distributed random variables, DI and DC. Each file size has these two random variables.
2
( ) ( ) ( )
( ) ( ) ( )
(2 ), where z is standard normal distribution,
is the sample standard deviation, is the confidence
level and w is the width of the confidence interval
I C I C
I C I C
E D D E D E D
V D D V D V D
n zw
.0
E(DI-DC)P(DI-DC)
DI-DC
26
CHEETAH network
CentuarFastIron
FESX448
1GCompute-0-4 152.48.249.6
Orbitty Compute Nodes
1GOC192 OC192 GbE
1-8-331-8-34
1-8-35
1-8-36
1-6-1
1-6-171-8-37
MCNCCatalyst
7600
H
H
H
H
H1G1G
1G
1G
1-7-1
Compute-0-3 152.48.249.5
Compute-0-2 152.48.249.4
Compute-0-1 152.48.249.3
Compute-0-0 152.48.249.2
1G
1G1G
1G
Wukong 152.48.249.102
1-8-381-7-17
cheetah-nc
3x1G VLAN
OC192
1-6-1
1-6-17
10GbE
1-7-1
GbE
1-7-33
1-7-34
1-7-35
1-7-36
1-7-37
1-7-38
1-7-39
1GZelda1 10.0.0.11
H
H
H1G
1GZelda2 10.0.0.12
Zelda3 10.0.0.13
1G
1G
Zelda4 10.0.0.14
H
H
Zelda5 10.0.0.15
2x1G MPLS tunnels
1G1G
Cheetah-atl
OC-192 lamda
10GbEGbE
1-7-33
1-7-34
1-7-35
1-7-36
Cheetah-ornl
1-7-1 1-6-1
OC192
X1(E)UCNS1GFC1G
1G
JuniperT320
JuniperT320
1G
1G
Force10E300
switch
ORNL
Atlanta
NC
Direct fibers
VLANs
MPLS tunnels
Wuneng 152.48.249.103H1-8-39
H1G
UVa Catalyst
4948
WASHHOPI
Force10
WASHAbileneT640
NCSUM20
2x1G MPLS tunnels
CUNYFoundry
NYCHOPI
Force10
1G
1GUVa host H
CUNY host
H
1GUVa
CUNY
By Xuan Zheng, [email protected]
27
Delay model
The average delay using CHEETAH circuit is:
[ ] (1 )( [ ] [ ]) ( [ ] [ ]), (1)
Comparing (1) with [ ], we get:
(1) [ ] (1
primarycircuitE T P E T E T P E T E Tsetup tcpcheetah b b fail
primaryE Ttcp
primaryE T Ptcp b
)( [ ] [ ] [ ]) [ ], (2)
Approximating [ ] to [ ], we get:
(2) (1 )( [ ] [ ]) [ ],
primarycircuitE T E T E T P E Tsetup tcp b fail
E T E Tsetupfail
primarycircuitP E T E T E Ttcp setupb
(3)
If (3)<0 then the application should try to set up a circuit;
otherwise it should use the primary access link.
[ ]In orther words, if [ ] [ ] ,
1
E Tsetupprimary circuitE T E TtcpPb
(4)
then attempt circuit setup, otherwise resort to the TCP/IP Internet path.
28
(a) is call blocking probability.
(b) [ ] ,
in which
is the size of the file to transfer,
is the data rate of the circuit, and
is round-trip propagation del
2
PbcircuitTf propcircuitE T
rc
f
rccircuitTprop
ay of the circuit.
Circuit delay model (1)
29
Circuit delay model (2)
(c) [ ] [1 ]( 1) [1 ] , (6)2(1 ) 2(1 )
in which
is the cumulative size of signaling messages used in circuit setup,
is the signaling link
msig sig sp signalingE T k T k Tsp propsetup
rs spsig
msig
rs
rate, assuming all the signaling links have the same rate,
is the traffic load of the M/D/1 queue model of the signaling link,
is the number of switches on the circuit path,
is the call-proce
sig
k
Tsp
ssing delay incurred at each switch,
is the traffic load of the M/D/1 queue model of the signaling
processor, and
is round-trip propagation delay of the circuit's signaling path.
sp
signalingTprop
30
TCP-Reno delay model (1)(d) [ ] [ ] [ ] [ ] [ ]
The [ ] depends on the specific operating system.
Approximate [ ] to 100ms for BSD-derived stacks and 150ms for Windows,
(i) Calculate
primaryE T E T E T E T E Tss catcp loss delack
E Tdelack
E Tdelack
[ ]
[ ]( 1)[log ( )], [ ] max
1[ ]1 maxmax 1[log ( ) 1 ( [ ] )],
1max1
in which
RTT=[the round trip delay]
1 1/ [the rate of exponential growth of cwnd
E Tss
E dssRTT when E W Wssw
E Tss W wWRTT E d otherwisess
w W
b
during slow start]
[1 (1 ) ](1 )[ ] 1
[sender's initial cwnd size]1
[the maximum window we would expect TCP to achieve at the end of slow start]max
[number of segments to send]
[data segment
dp pE dss
p
w
W
d
p
loss rate]
[ ]( 1) 1[ ]
[number of segments to send a delayed ACK]
wE dssE Wss
b
31
TCP-Reno delay model (2)
(ii) Calculate [ ]
[ ] [ ( , [ ]) [ ] (1 ( , [ ])) ]
1 (1 )
3 31 (1 ) [1 (1 ) ]( , ) min(1, 3[1 (1 ) ] /[1 (1 ) ]
( ) 0[ ] ,1
6 1( ) 1 21
[the average duration 0
E TlossTOE T l Q p E W E Z Q p E W RTTss ss ssloss
dl psswp p
Q p w wp p
G p TTOE Zp
i iG p pi
T
of the first TO in a sequence of one
or more successful timeouts].
32
TCP-Reno delay model (3)
(iii) Calculate [ ]
[ ] [ ] / ( , , , )max0
[ ] [ ]
2 8(1 ) 2 2( ) ( )3 3 3
( , , , )max0
1 ( )( , ( ))
2, when ( ) max( , ( )) ( ) 0( ( ) 1)
2 1
1 max ( , ma2
E Tca
E T E d R p RTT T Wca ca
E d d E dca ss
b p bW p
b bp b
R p RTT T W
p W pQ p W p
pW p W
Q p W p G p TbRTT W p
p
WpQ p W
p
)x
,( , ) ( )1 max 0( 2)max
8 1max
otherwiseQ p W G p Tb p
RTT WpW p
33
Start Setup Delay Timer
Call Bandwidth Requester
Setup Success
Yes
Stop Setup Delay Timer
Init sl = s = su setup_delay*ci
rcuit_rate, cover = false
No
Start Circuit Transfer Delay
Timer
Transfer file of size s over circuit
Stop Circuit Transfer Delay
Timer
Compute Circuit Throughput
Start Internet Transfer Delay
Timer
Transfer file of size s over the Internet
Stop Internet Transfer Delay
Timer
Compute Internet Throughput
Internet Throughput
>Circuit
Throughput
Yes
sl = sIf ( !cover ) su
= 2*sus = (sl+su)/2
No sl = su
Yes
| T_Internet -T_Circuit | <
delta
Yes
Crossover File Size is s and
update the DB
No
sl = 0s = (sl+su)/2If ( !cover ) cover = true
No
Start
End
Too many fails
s denotes File Size, sl denotes the Lower Bound of s, su denotes the upper Bound of s, cover denotes whether or not (sl, su) has covered the crossover file size and delta is the threshold for the difference between circuit and Internet throughputs.
Tear down circuit
Binary Search Algorithm for Determining the Crossover File Size for One Destination
su = ss = (sl+su)/2If ( !cover ) cover = true
34
Measurement example room in
35
Experiment setup
mvstu6
CPU 2 CPUs, each is Intel(R) Xeon(TM) CPU
2.80GHz with 1024KB cache
Memory 1GB
Hard disk 1 MegaRAID Model: LD 0 RAID0 69G
OS 2.6.12-1.1381_FC3smp
File system EXT3
NIC Intel PRO/1000 Single Port Adapters working at rate 100Mbps, Full Duplex
36
Acronym
• CHEETAH – Circuit-switched High-speed End-to-End Transport ArcHitecture
• PLR – Packet Loss Rate• SD – Setup/Teardown Delay• RTT – Round Trip Time• AB – Available Bandwidth• GMPLS – Generalized Multiple Protocol Label
Switching • SONET – Synchronous Optical NETwork• SDH – Synchronous Digital Hierarchy