Upload
kory-cannon
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Efficient and Robust Traffic Engineering in a Dynamic Environment
Ph.D. Dissertation DefensePh.D. Dissertation Defense
Hao WangHao Wang
Y. Richard YangY. Richard Yang
Joan FeigenbaumJoan Feigenbaum
Jennifer Rexford (Princeton)Jennifer Rexford (Princeton)
Avi SilberschatzAvi Silberschatz
Advisor:Advisor:
Committee:Committee:
04/19/23 2
Collaborators (2003 – 2008)
Richard Alimi Alex Gerber (AT&T Research) Albert Greenberg (Microsoft Research) Paul H. Liu Zheng Ma Lili Qiu (UT Austin) Jia Wang (AT&T Research) Ye Wang Haiyong Xie Y. Richard YangY. Richard Yang Yin Zhang (UT Austin)Yin Zhang (UT Austin)
04/19/23 3
Internet as a Social Infrastructure Growing fast
Annual traffic growth 2007 – 2008: [TeleGeography Research] US domestic: 112%
IPTV 2007-2011: Increase 10 times
Applications with more stringent requirements, e.g., IPTV / VoD VoIP, telecommuting/video-
conferencing
04/19/23 4
Internet Routing Determines paths from source to destination
For applications Delay, loss, delay jitter, …
For ISPs Efficiency of resource usage Capability of failure recovery Scalability
Routing is becoming more important High cost of network assets Highly competitive nature of the ISP market
04/19/23 5
Traffic Engineering (TE) Objective: efficient & reliable routing
Input: Topology
Traffic
ISP Objective
App. Requirement
A D
B
C F
E
0.8
0.2
Output: Routing
04/19/23 6
Challenges in a Dynamic Environment
Traffic fluctuates Topology changes Multiple ISP objectives
TrafficEngineering
Traffic
Topology
ISP Objective
App. Requirement
Routing
04/19/23 7
Challenge due to Traffic Dynamics
Traffic fluctuates, e.g., Diurnal patterns Worms/viruses, DoS attacks, flash crowds BGP routing changes, load balancing by
multihomed customers, TE by peers, failures in other networks
Implications: can lead to long delay, high loss, reduced throughput
04/19/23 8
Challenge due to Topology Dynamics Topology changes, e.g.,
Maintenance, failures, mis-configurations Accidents / disasters
e.g., 675,000 excavation accidents in 2004 [Common Ground Alliance] Network cable cuts every few days …
Implications: substantial disruption to Internet E.g., Jan. 9, 2006, two link failures in a major US
ISP led to disconnection of millions of wireless users, partition of many office networks
04/19/23 9
Challenge due to ISP Objectives
Network traffic & topology relatively stable most of the time
Unexpected scenarios do happen Many unexpected scenarios happen when service is most
valuable! disaster, flash crowds,… unhandled dynamics => violation of SLA customers can remember bad experiences very well
How to balance between: Common-case performance Unexpected-case performance
04/19/23 1004/19/23 10
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Traffic Engineering
04/19/23 1104/19/23 11
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Resilient Traffic Engineering Service
Interdomain Reliability Service
04/19/23 1204/19/23 12
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope [SIGCOMM’06]
Interdomain Reliability Service [SIGCOMM’07]
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 1304/19/23 13
What’s Next
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope
Interdomain Reliability Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 14
Challenge: Unpredictable Traffic
Internet traffic is highly unpredictable! Can be relatively stable most of the time … However, usually contains spikes that ramp up extremely
quickly we saw traffic spikes in the traces of several networks
Unpredictable traffic variations have been observed and studied by other researchers [Teixeira et al. ’04, Uhlig & Bonaventure ’02, Xu et al. ’05 ]
Confirmed by operators of several large networks via email survey
04/19/23 15
Previous TE Approaches
Prediction-based TE Examples:
Off-line: Single predicted TM [ Sharad et al. ’05 ] Multiple predicted TMs [ Zhang et al. ’05 ]
On-line: MATE [ Elwalid et al. ’01 ] & TeXCP [ Kandula et al. ’05 ]
Pro: Works great when traffic is predictable Con: May pay a high penalty when real traffic
deviates substantially from the prediction
04/19/23 16
Previous TE Approaches (cont’d) Traffic-oblivious routing
Examples: Oblivious routing [ Racke ’02, Azar et al. ’03, Applegate et
al. ’03 ]
Valiant load-balancing [ Kodialam et al. ’05, Zhang & McKeown ’04 ]
Pro: Provides worst-case performance bounds Con: May be sub-optimal for normal traffic
The optimal oblivious ratio of several real network topologies studied in [Applegate et al ’03] is ~2
An over-provision rate of ~2 is still too high
04/19/23 17
Our Approach: COPE
Objectives Good common-case performance Tight worst-case guarantee
Our approach Common-case optimization
Optimize for predicted, normal traffic load Penalty envelope
Bound worst-case performance for the unexpected
04/19/23 18
COPE Illustrated Common-case Optimization with Penalty Envelope
C
X
Bound for set X
Optimize for set C
minf maxdC PC(f, d)
s.t. (1) f is a routing (2) xX:PX(f, x) PEC: common-case (predicted) TMs
X: all TMs of interest
PC(f,d): common-case penalty function
PX(f,x): worst-case penalty function
PE: penalty envelope
04/19/23 19
COPE in Perspective
Prediction-based TE
Best common-case
+Poor worst-case
Oblivious Routing
Poor common-case
+Best worst-case
COPE
Good common-case
+Bounded worst-
case
Spectrum of TE with unpredictable traffic
Position controllable by penalty envelope
The worst unexpected case too unlikely to occur Too wasteful to “optimize” for the worst-case (at the cost of poor common-case performance)
There are enough unexpected cases Penalty envelope is required
04/19/23 20
Model: More Details Network topology: graph G = (V,E)
V: set of routers E: set of network links, link (i,j) capacity cij
Traffic matrices (TMs) A TM is a set of demands: d = { dab | a,b V } dab: traffic demand from a to b
Link-based routing f = { fab(i,j) | a,b V, (i,j) E } fab(i,j) : the fraction of demand from a to b (i.e.,
dab) that is routed through link (i,j)
04/19/23 21
Routing Performance Metric Maximum Link Utilization
(MLU):
Vba
ijababEji
cjifddfU,
),(/),(max),(
04/19/23 22
Penalty Envelope – A Case Study X: the set of TMs with access capacity
constraints PX(f, x): MLU of routing f on demand x PE: upper-bound on MLU Direct formulation of xX:PX(f, x) PE:
ab
,
link , , such that
, ,
( ) / ( )
a ba ab V b V
ab aba b V
l d
d ECR d ICR
d f l c l PE
infinite # of cases
04/19/23 23
Slave LP
Test whether constraint satisfied by solving the following LP for each link:
Constraint satisfied if slave LP objective r* ≤ PE Slave LP can serve as separation oracle in Ellipsoid
Method
,
max ( ) / ( )
subject to: , : 0,
,
ab aba b V
ab
ab a ba ab V b V
d f l c l
a b V d
d ECR d ICR
still not fast enough
Can we use Interior-Point Method?
04/19/23 24
Dual of Slave LP
Introduce dual variables for the access capacity constraints:
Dual function of the slave LP:
,
0
( ) / ( )
( , ) supab
ab aba b V
l l la ab ad a V b V
la ba aa V b V
d f l c l
q d ECR
d ICR
0, 0la la
04/19/23 25
Dual of Slave LP
Since no duality gap, r* ≤ PE iff
( , )
, if ( ) / ( ) , ,
, otherwise
l l
la a la a ab la lba V a V
q
ECR ICR f l c l a b V
: 0, 0
, : ( ) / ( )la la
ab la lb
la a la aa V a V
a V
a b V f l c l
ECR ICR PE
Polynomial # of constraints, can use Interior-Point Method
04/19/23 26
Evaluations Evaluated algorithms
COPE: COPE with PC(f,d) = PR(f,d) (i.e. performance ratio) COPE-MLU: COPE with PC(f,d) = U(f,d) (i.e. max link utilization) Oblivious routing: minf maxxPR(f,x) ( COPE with =1) Dynamic: optimize routing for TM in previous interval Peak: peak interval of previous day + prev/same days in last week Multi: all intervals in previous day + prev/same days in last week Optimal: requires an oracle
Dataset US-ISP
hourly PoP-level TMs for a tier-1 ISP (1 month in 2005) Optimal oblivious ratio: 2.045; default penalty envelope: 2.5
Abilene 5-min router-level TMs on Abilene (6 months: Mar – Sep. 2004) Optimal oblivious ratio: 1.853; default penalty envelope: 2.0
04/19/23 27
US-ISP: Performance Ratio
1
1.5
2
2.5
3
3.5
4
4.5
5
0 100 200 300 400 500 600 700
Perfo
rman
ce ra
tio
Intervals sorted by performance ratio
multipeakdynamicobliviousCOPE-MLUCOPE
Common cases: COPE is close to dynamic and much better than othersUnexpected cases: COPE beats even oblivious and is much better than
others
Performance Ratio =
MLU of the algorithm------------------------------- MLU of optimal routing
04/19/23 28
Abilene: Performance Ratio
1
2
3
4
5
6
7
8
0 500 1000 1500 2000
Perfo
rman
ce ra
tio
Sorted interval
multipeak
dynamicoblivious
COPE
Common cases: COPE is close to dynamic and much better than othersUnexpected cases: COPE is close to oblivious and much better than
others
04/19/23 29
Abilene: MLU in Unexpected Cases
0
0.5
1
1.5
2
2.5
120 130 140 150 160 170 180 190 200
MLU
Interval
multipeak
dynamicoblivious
COPEoptimal
Unexpected cases: COPE is close to oblivious and much better than others
04/19/23 30
US-ISP: Sensitivity to PE
1
1.2
1.4
1.6
1.8
2
2.2
0 50 100 150 200 250
Per
form
ance
ratio
Intervals sorted by performance ratio
obliviousPE = 2.2PE = 2.4PE = 2.6PE = 2.8
Even a small margin in PE can significantly improve the common-case performance
04/19/23 31
COPE Summary COPE =
Common-case Optimization with Penalty Envelope
COPE works! Common cases: close to optimal; much better than oblivious
routing and prediction-based TE with comparable overhead Unexpected cases: much better than prediction-based TE, and
sometimes may beat oblivious routing Even a small margin in PE improves common-case performance a
lot
COPE as an optimization paradigm also applies to many other contexts, e.g., interdomain
04/19/23 3204/19/23 32
What’s Next
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope
Interdomain Reliability Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 33
“Any future Internet should attain the highest possible level of availability, so that it can be used for mission-critical activities, and it can serve the nation in times of crisis”
- GENI, 2006
04/19/23 34
“The 3 elements which carriers are most concerned about when deploying communication services are: Network reliability Network usability Network fault processing capabilities”
-Telemark, 2006
The top 3 all belong to reliability!
04/19/23 35
Reliability Needs Redundancy
Over-provisioning of bandwidth
Diversity of physical connectivity
Challenge: significant investments Extra equipment for over-provisioning Expense & difficulty in obtaining rights-of-way
for diversity
04/19/23 36
REIN Overview
REliability as an INterdomain Service Objective
Increase the redundancy available to an IP network at low cost
Basic idea Observation: IP networks overlap, yet they differ IP networks provide redundancy for each other Effects: Sharing improves reliability and reduces costs Analogy: insurance, airline alliance
04/19/23 37
Example: Jan. 9, 2006 of a Major US ISP
Stockton
Rialto
El Palso
Oroville
Los Angles
Dallas
AT&T
Interdomain bypass path
04/19/23 38
How to Make REIN Work: the Details Why would IP networks share interdomain
bypass paths? Peering, cost-free, or customer-provider
What is the signaling protocol to share these paths? Manual, new protocol, BGP communities
How can an interdomain bypass path be used in the intradomain forwarding path? Interdomain GMPLS IP tunnels
04/19/23 39
Evaluation Dataset
US-ISP hourly PoP-level TMs for a tier-1 ISP (1 month in 2007)
Abilene 5-min router-level TMs on Abilene (6 months: Mar – Sep.
2004) RocketFuel PoP-level topologies
TE algorithms TE-R (robust) [This is COPE with REIN awareness] Oblivious routing/bypassing (oblivious) Constrained Shortest Path First rerouting (CSPF) Flow-based optimal routing (optimal)
04/19/23 40
Why REIN: Connectivity Improvements
Without REIN, as high as 60% of links w/ conn. < 3, esp. in some smaller networks
With <= 7 REIN routes: reduce to < 10%
04/19/23 41
Why REIN: Overload Prevention (Abilene 2-link)
Without REIN, even optimal routing overloads bottleneck links by ~300%. With 10 interdomain bypass paths of 2Gbps each, REIN reduces MLU to
~80%
Abilene bottleneck link traffic intensity: 2-link failures, Tuesday, August 31, 2004
04/19/23 42
REIN Summary
An interdomain service to improve the redundancy of IP networks at low cost
Significantly improves network reliability, esp. when used with COPE to utilize network resources under failures
04/19/23 43
Outline
Challenges to TE in dynamic environment
COPE
REIN
From algorithms to a toolkit
04/19/23 44
Torte
Compute TE routing
Convert Link-based to
path-based
Generate router
configuration
- Network topology- Traffic demand matrices- Penalty envelopes- Protected link sets
f
Paths withTrafficSplit ratio
- Global MPLS configuration- Explicit path configuration- LSP configuration- Backup LSP/tunnel configuration- Output Juniper/Cisco configurations
Torte: A Toolkit for Optimal & Robust TE
http://www-net.cs.yale.edu/projects/torte/tools/torte.tar.gz
04/19/23 44
04/19/23 45
Link-based vs. Path-based Routing Our algorithms compute link-based traffic
split ratio (for each O-D pair) e.g. fnewylosa{chic, hous}=0.25 means link
chichous carry 25% traffic from newy to losa. Current MPLS-enabled routers support only
path-based traffic split ratio Traffic load on equal-cost LSPs is (typically)
proportional to requested bandwidth by each LSP Juniper J-, M-, T- series Cisco 7200, 7500, 12000 series
04/19/23 45
04/19/23 46
From Link-based to Path-based Use the coverage-based method
developed at Yale LANS to convert link-based routing to path-based routing [Zheng et. al, 2007]
A flow decomposition algorithm to select path one by one until reaching the required coverage (for the O-D pair)
04/19/23 46
04/19/23 47
An Example
04/19/23 47Courtesy of Zheng Ma
04/19/23 48
Performance Bound of Path Generation
Theorem: Given a link-based routing f and a q-coverage path set for f, there is a path-based routing over the q-coverage path set s.t. for any input, the MLU under the path-based routing is bounded by 1/q of the MLU achieved by f.
04/19/23 4904/19/23 49
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope [SIGCOMM’06]
Interdomain Reliability Service [SIGCOMM’07]
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 50
Future Directions
Problem of pure MPLS TE May need to configure many LSPs
80 core routers ~ avg. 5 LSPs per O-D pair Total of 80 x 80 x 5 = 32000 LSPs!
Configuration overhead Router memory requirement
Possible solutions Pick heavy-hitter O-D pairs
Others will just use OSPF default routing Hybrid OSPF/MPLS TE
How to optimize OSPF other than using heuristics
04/19/23 51
Future Directions (cont’d)
Online adaptation with penalty envelope Network and/or traffic evolve over time The “normal network load” may change dramatically in a
medium time scale Diurnal pattern Weekly pattern
Possible solutions Gradually adapts to changes in network/traffic condition Difficulty: maintain robustness during the course of
adaptation
04/19/23 52
Thank you!
04/19/23 5304/19/23 53
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope [SIGCOMM’06]
Interdomain Reliability Service [SIGCOMM’07]
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 5404/19/23 54
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network Infrastructure
Common-case Optimization with Penalty Envelope
Interdomain Reliability Service
Resilient Routing Reconfiguration
Traffic Variations
Backup Slides
04/19/23 55
04/19/23 56
Outline
Challenges to TE in dynamic environment
Common-case Optimization with Penalty Envelope Use traffic variation as target application
Interdomain Reliability Service
Implementation
04/19/23 57
Routing Performance Metrics Maximum Link Utilization (MLU):
Optimal Utilization
Performance Ratio
),(min)(routing a is
dfUdOUf
)(
),(),(
dOU
dfUdfPR
Vba
ijababEji
cjifddfU,
),(/),(max),(
04/19/23 58
COPE Instantiation
C: convex hull of multiple past TMs A linear predictor predicts the next TM as a convex
combination of past TMs (e.g., EWMA) Aggregation of all possible linear predictors the convex
hullX: all possible non-negative TMs
Can add access capacity constraints or use a bigger convex hull
PC(f,d): penalty function for common cases maximum link utilization: U(f,d) performance ratio: PR(f,d)
PX(f,x): penalty function for worst cases maximum link utilization: U(f,d) performance ratio: PR(f,x)
minf maxdC PC(f, d)s.t. (1) f is a routing; and (2) xX:PX(f, x) PE
04/19/23 59
Choosing the Penalty Envelope
PE = minf maxxX PX(f,x)
1 controls the size of PE w.r.t. the optimal worst-case penalty =1 oblivious routing = prediction-based TE
04/19/23 60
COPE Implementation1. Collect TMs continuously2. Compute COPE routing for the next day by solving a
linear program (LP) Common-case optimization
Common case: convex hull of multiple past TMs All TMs in previous day + same/previous days in last week
Minimize either MLU or PR over the convex hull Penalty envelope
Bounded PR over all possible nonnegative TMs See dissertation for details of our LP formulation
3. Install COPE routing Currently done once per day an off-line solution
04/19/23 61
US-ISP: Maximum Link Utilization
0
1
2
3
4
5
6
0 100 200 300 400 500
MLU
afte
r nor
mal
izat
ion
Interval
multipeakdynamicobliviousCOPE-MLUCOPE
Common cases: COPE is close to dynamic and much better than othersUnexpected cases: COPE beats even oblivious and is much better than
others
04/19/23 62
Abilene: MLU in Common Cases
0
0.05
0.1
0.15
0.2
100 120 140 160 180 200 220 240 260 280
MLU
Interval
multipeak
dynamicoblivious
COPEoptimal
Common cases: COPE is close to optimal/dynamic and much better than others
04/19/23 63
COPE with Interdomain Routing Motivation
Changes in availability of interdomain routes can cause significant shifts of traffic within the domain E.g. when a peering link fails, all traffic through that
link is rerouted
Challenges Point-to-multipoint demands
need to find splitting ratios among exit points The set of exit points may change
topology itself is dynamic Too many prefixes
cannot enumerate all possible exit point changes
04/19/23 64
COPE with Interdomain Routing: A Two-Step Approach
1. Apply COPE on an extended topology to derive good splitting ratios• Group dest prefixes with same set of exit points into a virtual node• Derive pseudo demands destined to each virtual node by merging demands to prefixes that belong to
this virtual node• Connect virtual node to corresponding peer using virtual link with infinite BW• Compute extended topology G’ as
G’ = intradomain topology + peers + peering links + virtual nodes + virtual links• Apply COPE to compute routing on G’ for the pseudo demands• Derive splitting ratios based on the routes
2. Apply COPE on point-to-point demands to compute intradomain routing• Use the splitting ratios obtained in Step 1 to map point-to-multipoint demands into point-to-point
demands
intradomain topologypeer
peerpeer
peer
virtual
virtual
04/19/23 65
Preliminary Evaluation
COPE can significantly limit the impact of peering link failures
0
0.05
0.1
0.15
0.2
0 50 100 150 200
MLU
Interval
link 11 and 15 fail
COPE PE=2.2, with failuresCOPE PE=2.2, no failureoptimal, with failuresoptimal, no failure
04/19/23 66
Outline
Challenges to TE in dynamic environment
COPE
Interdomain Reliability Service Use failures as target application
Implementation
04/19/23 67
To Improve Reliability, We Need Network redundancy
Over-provision of bandwidth Diversity of physical connectivity Challenge: significant investments
Extra equipment for over-provisioning Expense & difficulty to obtain rights of way for
diversity Our solution: REIN
Also, good traffic engineering algorithms for reliability Challenge: scalable, efficient and fast response
to topology changes [Resilient Routing Reconf.]
04/19/23 68
REIN Business Model: Three Possibilities Peering
Mutual backup w/o financial settlement Incentive: improve reliability of both at low cost Symmetry in backup paths provisioning & usage
Cost-free One-sided, volunteer and/or public service
Customer-Provider Fixed or usage-based pricing Pricing should limit abuse
04/19/23 69
Interdomain Bypass Path Signaling Many possibilities, e.g.,
Manual configuration A new protocol Utilize BGP communities
04/19/23 70
REIN Data Forwarding
Main capability needed: Allow traffic to leave and re-enter a network
not supported under hierarchical routing of the current Internet
REIN forwarding mechanism Interdomain GMPLS IP tunneling Either way, only need agreement b/w neighboring
networks Incrementally deployable
04/19/23 71
a1 / A / a1 / REIN_PATH_REQ
BGP Bypass Path Signaling
B provides interdomain bypass paths to A.Task of A: discover a path to a1 through B
BGP announcement: Dest. / AS path / Bypass path / TagAdditional attr.: desired starting point (e.g. a2), bw, etc.
a1
a2
a3
Network A
b1 RIB
a1 / A / a1 / REIN_PATH_REQb1
b3
b2
Network B
REIN local policy computes bypass pathsto export: e.g., lightly-loaded paths
a1 / BA / b2,b1,a1 / REIN_PATH_AVAIL
a2 RIB
a1 / BA / b2,b1,a1 /-
04/19/23 72
Further Optimization
After an IP network imports a set of such paths, how does it effectively utilize them in routing computation?
How to minimize the number of such paths?
04/19/23 73
Further Optimization: Minimize Interdomain Bypass Paths Motivation
REIN may provide many alternatives Only a few may be necessary
Reduce configuration overhead & budget constraints
Step 1: Connectivity objective Preset connectivity requirement Cost assoc. w/ interdomain paths Meet connectivity requirement + minimizing total cost Formulated as a Mixed Integer Programming (MIP)
Step 2: TE objective Sort interdomain paths according to a scoring function Greedy selection until TE has desired performance
04/19/23 74
Traffic Engineering for Reliability (TE-R) Objectives
Efficient utilization of all redundant resources
Scalable implementable in current Internet
Protection: fast ReRouting for common failure scenarios
Restoration: routing convergence for non-common failure scenarios
VPN QoS guarantee, if possible
a1
a2
a3
Intradomain link
REIN virtual link
Network topology for TE-R
04/19/23 75
Our TE-R Algorithm: Features Robust normal-case routing f*
Based on COPE [ Wang et al. ’06 ] Guarantee bandwidth provisioning for hose-model VPN under f*
Robust fast rerouting under failures on top of f* VPN traffic purely intradomain if possible
Novel coverage-based techniques for computational feasibility and implementability Use flow-based routing to compute optimal solution Coverage to generate implementation with performance
guarantee For details, please see dissertation.
04/19/23 76
Why Need a TE-R (Abilene 1-link
failure)
CSPF overloads bottleneck link by ~300%vs.
robust TE-R successfully reroutes all traffic
Abilene bottleneck link traffic intensity: 1-link failures, Tuesday August 31, 2004
04/19/23 77
Existing Reliability Techniques
Network redundancy techniques Link layer techniques, e.g. SONET rings
Pro: fast response; Con: expensive [ Giroire et al. ’03 ] IP Restoration
Online: MATE [ Elwalid et al. ’01 ] & TeXCP [ Kandula et al. ’05 ] Offline: Optimization of IGP weights [ Fortz & Thorup ’03, Nucci et al. ’03 ] Pro: inexpensive; Con: slow response
MPLS Protection [RFC 3469] Path protection: end-to-end Link protection: fast rerouting (FRR) Supported by modern routers, fast response & affordable costs
All techniques depend on available network redundancy TE for reliability
Restorable bandwidth guaranteed connections [ Kar et al. ’03, Kodialam et al. ’04, Kodialam & Lakshman ’03 ]
Oblivious fast rerouting [ Applegate et al. ’04 ] Overlay for reliability
RON [ Anderson et al. ’01 ],application-level source routing [ Gummadi et al. ’04 ] Pro: Do not require cooperation from backbone Con: Slower response time (dep. On transport layer timeouts), less visibility into
network
04/19/23 78
REIN Path Advertisement Message
04/19/23 79
Outline
Challenges to TE in dynamic environment TE with dynamic traffic TE with dynamic topology
REIN R3
Implementation of proposed TE algorithms
04/19/23 80
R3 Motivation
Network topologies are constantly changing Unexpected failures Scheduled maintenance
Multiple changes may overlap in time Changes happen frequently Some changes take a long time to recover
Fiber cuts Maintenance
An IP network usually operates with multiple (simultaneous) topology changes
04/19/23 81
Existing TE Approaches
IP fast re-routing IPFRR Multi-topology OSPF Failure-carrying Packets (FCP) Pros:
Fast response Scalable: e.g. FCP works as long as a network remain
connected Cons:
Only guarantee connectivity No guarantee of quality of response, esp. congestion
04/19/23 82
Existing TE Approaches (cont’)
MPLS fast re-routing Optimal demand-oblivious restoration [Applegate et al, ’04] COPE reliability routing [REIN TE-R] Pros:
Fast response Efficiency: provide some guarantee for quality of response
Cons: Re-optimize routing for each change scenario No scalable
On-demand computation: slow response time Pre-computation: need to keep many protection routings in routers
Disruption to even existing traffic not affected by changes
04/19/23 83
Our Approach: R3
Objectives Fast response + little disruption to existing traffic Guarantee for quality of response Scalability (not too many protection routings kept in routers)
Our approach R3 = Resilient Routing Reconfiguration A single protection routing reconfigured according to
current topology (initial + changes) Pre-computed protection + simple reconfiguration Performance bound on congestion A single protection routing works for many change scenarios
04/19/23 84
R3 Basic Idea Consider link (i, j) If (i,j) fails, at most 10
Mbps traffic needs re-routing
If network can route 10 Mbps additional, fictional traffic from i to j, failure recovery is guaranteed As least as much traffic can
be carried w/o using (i,j) For proof, see dissertation
Cap = 10 Mbps
Demand = 6 Mbps
i j
Addl. 4 Mbps
Demand = 6 Mbps
i j
Addl. 6Mbps
04/19/23 85
R3: Topology-uncertainty Demand Topology uncertainty => traffic uncertainty
Link capacities as upper bound of traffic to re-route Other choices: a given percent of link capacities
Exponential number of topology changes => convex set of topology-uncertainty demands Integer relaxation
Change s1 ~ topology uncertainty demand x1 Change s2 ~ topology uncertainty demand x2 Change s1 or Change s2 ~ x1 or x2
x1+ (1- ) x2, [0,1]
04/19/23 86
R3: Routing Reconfiguration
Link e = (i, j) Let ge be the routing of topology-uncertainty
demand from i to j The protection routing when e fails is a re-
normalization of ge after removing e ge(e) = 0 (no traffic goes on e)
ge(e’) = ge(e’) / [1 – ge(e) ] (scale up to a unit flow)
Multiple link failures: reconfigure one-by-one Final result is independent of order of reconf.
04/19/23 87
Evaluation Methodology Dataset
US-ISP hourly PoP-level TMs for a tier-1 ISP (1 month in 2007)
RocketFuel PoP-level gravity model TMs
Topology changes All single- and two-link changes Sampled three- and four-link changes For US-ISP, single-link + single-maintenance
TE algorithms Base routing:
R3 OSPF
Protection routing R3 OSPF reconvergence, OSPF link detour, FCP
Flow-based optimal routing (optimal) used as reference
04/19/23 88
US-ISP: Worst-case with one change
R3 and OSPF+R3 achieves close-to-optimal performanceAll others lead to much higher level of traffic intensity
04/19/23 89
US-ISP: One Change Performance Ratio
R3 and OSPF+R3 consistently performs within 30% of optimalAll others lead to much higher performance penalty (>= 260%)
04/19/23 90
US-ISP: Two and Three Change
R3 and OSPF+R3 significantly out-performs other algorithms
04/19/23 91
RocketFuel SBC: Two and Three Change
R3 significantly out-performs other algorithms, even OSPF+R3For SBC, integerated R3 base + protection is necessary
04/19/23 92
RocketFuel Level-3: Two and Three Change
R3 and OSPF+R3 have similar performance, and significantly out-performs other algorithms
For Level-3, a good OSPF + R3 is enough
04/19/23 93
R3 Summary
R3 A single protection routing that can be reconfigured to
recover from multiple topology changes Simple reconfiguration upon changes Performance guarantee A single protection routing works for many possible changes
Ongoing & future work Implementation of routing reconfiguration on
routers Preventing transient loops during simultaneous
reconfigurations
04/19/23 94
Generate MPLS configurations Configure LSPs for each O-D pair
Create explicit LSP path Requested bandwidth of the LSP is proportional
to the (path-based) traffic split ratio Configure backup LSPs/tunnels for each
protected link (to implement R3) Create explicit backup LSP/tunnel Requested bandwidth of the backup LSP/tunnel is
proportional to the backup traffic split ratio
04/19/23 94
04/19/23 95
Future Directions on COPE
COPE with OSPF COPE with online TE COPE for other network optimization problems
04/19/23 96
Future Directions on REIN
A thorough study of the effects of cross-provider shared-risk link group data
Effectiveness of REIN on smaller IP networks Improving TE robustness under dynamic
topology
04/19/23 97
Our Approaches
REIN Cost-effective way to increase redundancy
R3 Scalable, congestion-free fast rerouting
04/19/23 98
Summary of Contributions
Traffic Variations Topology ChangesISP Objectives
Network infrastructure
Traditional TE approaches
Network infrastructureInterdomainReliabilityService
Common-case Optimization with Penalty Envelope
Resilient Routing Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 99
What’s next
Traffic Variations Topology ChangesISP Objectives
Network infrastructureInterdomainReliabilityService
Common-case Optimization with Penalty Envelope
Resilient Routing Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 100
What’s next
Traffic Variations Topology ChangesISP Objectives
Network infrastructureInterdomainReliabilityService
Common-case Optimization with Penalty Envelope
Resilient Routing Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 101
Summary
Traffic Variations Topology ChangesISP Objectives
Network infrastructureInterdomainReliabilityService
Common-case Optimization with Penalty Envelope
Resilient Routing Service
Resilient Routing Reconfiguration
Traffic Variations
04/19/23 102
Why REIN: Overload Prevention (US-
ISP failure log)
REIN can reduce normalized traffic intensity by 118% ~ 35%, depending on the TE-R algorithms used.
Improvement of traffic intensity by REIN for a week in January 2007 for US-ISP