Upload
haley-james
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Self-healing in Routing: Failure Analysis, and Improvements
Qi Li
Tsinghua University
Aug. 28, 2008
Aug. 28, 2007 AsiaFI, Student Workshop 2
Outline
Problem Statement Analysis of Self-Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work
Aug. 28, 2007 AsiaFI, Student Workshop 3
Problem Statement
Routing (Intra- and Inter- domain) is critical elements as Internet infrastructure
How robust are they against large scale failures/attacks? Cisco routers caused major outage in Japan 2007 Earthquake in Taiwan causes undersea cable damage in 2006
We need to improve them, but how can we do?
Aug. 28, 2007 AsiaFI, Student Workshop 4
Internet Routing
Not a homogeneous network A network autonomous systems (AS) Each AS under the control of an ISP. Large variation in AS sizes – typical heavy tail.
Inter-AS routing Border Gateway Protocol (BGP). A path-vector algorithm. Serious scalability/recovery issues.
Intra-AS routing Several algorithms; usually work fine
Central control, smaller network, …
Aug. 28, 2007 AsiaFI, Student Workshop 5
Measurements – Prefix Growth
Table sizes grow 2x faster than real growth One (conservative) analysis predicts 2M entries in 10 ye
ars
Aug. 28, 2007 AsiaFI, Student Workshop 6
Measurements – BGP Updates
Aug. 28, 2007 AsiaFI, Student Workshop 7
Distribution of Updates – Main Observation
Most of the network is very stable
Parts of the network are very unstable
Everybody pays for the instability
Problem is getting worse
Aug. 28, 2007 AsiaFI, Student Workshop 8
Routing Failure Causes
Large area router/link damage (e.g., earthquake) Large scale failure due to buggy SW update. High BW cable cuts Router configuration errors
Aggregation of large un-owned IP blocks Happens when prefixes are aggregated for efficiency
Incorrect policy settings resulting in large scale delivery failures
Network wide congestion (DoS attack) Malicious route advertisements via worms
Aug. 28, 2007 AsiaFI, Student Workshop 9
Outline
Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-healing Solution Conclusion and Future work
Aug. 28, 2007 AsiaFI, Student Workshop 10
Existing Routing Protocols
Normal process of IP-based self-healing routing Failure Detection Failure Notification Forwarding Path Re-computation
Existing routing protocols … RIP: hundreds of seconds, count to infinity OSPF, tens of seconds BGP, several minutes or longer, can’t converge due to policy
confliction.
Aug. 28, 2007 AsiaFI, Student Workshop 11
The State Transition under Failure
A simple state transition to analyze the routing convergence.
Aug. 28, 2007 AsiaFI, Student Workshop 12
The Problems of Transient Failures
Routing Blackhole Traffic is silently dropped without informing the source that th
e data did not reach its intended recipient. Routing Loop
The path to a particular destination forms a loop.
Aug. 28, 2007 AsiaFI, Student Workshop 13
Outline
Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work
Aug. 28, 2007 AsiaFI, Student Workshop 14
Traditional Fast Reroute Solutions
Major improvement in Intra-domain routing is fast reroute solutions. SONET rings are significantly reduce this recovery time, but
they are expensive. FRR with MPLS-TE, hard to deploy because it will introduce
much complexity into core network. IP-FRR developed by IETF, which still has some
shortcomings, e.g., LFA needs a neighbor with a shortest path not containing the failed nodes.
Layer 3 Tunnel provides pre-computed path protection, which may not eliminate the routing loops introduced by tunneling.
Aug. 28, 2007 AsiaFI, Student Workshop 15
State Transition of Improved Solution
State transition with protection and damping: improving availability and stability.
Aug. 28, 2007 AsiaFI, Student Workshop 16
BGP Fast Convergence Solutions
Major Problem in BGP Theoretical analysis and measurement result indicate path
exploration of path vector protocol prolongs routing convergence
Several solution addressed this problem: RCN can eliminate all the obsolete routes and ensure that
only valid alternative routes are chosen and propagated by carrying the root-cause information in the BGP updates.
Ghost Flushing improves the BGP convergence by expediting the removal of outdated “ghost” information in the Internet.
Drawbacks … Network fail-over events in GF, Transient routing problems.…
Aug. 28, 2007 AsiaFI, Student Workshop 17
Outline
Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution
Requirements of Solution Routing Protection Evaluation Metrics
Conclusion and Future Work
Aug. 28, 2007 AsiaFI, Student Workshop 18
Self-healing Routing
The goal of self-healing routing After a link or a node is devastated, network can restore or re
pair routes by itself Self-healing routing approaches
Routing Restoration (Fast Routing Convergence)
Attempt to find a new path on-demand to restore connectivity when a failure occurs.
Routing Protection
Based on the fixed and predetermined failure recovery, provide a working path set up for traffic forwarding and an alternate protection path.
Aug. 28, 2007 AsiaFI, Student Workshop 19
Requirements of Solution
Simplicity The solution should be simple and not add much complexity in core
networks, but MPLS needs a fundamental infrastructure. Easy Deployment and Management
MPLS-related solution is not a good potential solution because it is hard to pre-compute backup path for every nodes.
Efficiency Protection should not be deployed to cover 100% of network,
especially when multiple failures happen. Incremental Deployment Support
It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once.
Aug. 28, 2007 AsiaFI, Student Workshop 20
Requirements of Solution (cont.)
Business model Support The designed solution should consider the business model of path
protection application in production networks. In order to protect unstable network and backbone network areas,
contrasts between different ISPs should be signed to guarantee routing availability in these areas.
Low Cost The path protection solution should provide routes without many
computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss.
The solution should covers protection under both short term or long term network failures.
Aug. 28, 2007 AsiaFI, Student Workshop 21
Principle of our solution (cont.)
The key idea of routing protection is that it makes tradeoff between the additional cost introduced by tunneling and packet lost caused by failures.
Fast Failure Detection simplicity, fast detection, easy implementation and no change to
existing routing protocols, Bidirectional Forwarding Detection (BFD) is directly applied.
Path Protection Technique Although two different types of routing protocol need be considered,
intra-domain routing and inter-domain routing tunnel, there is no need for us to provide path protection techniques for different routing instances.
In order to eliminate the problems introduced by L3 tunnel, we choose L2TP as protection technique.
Aug. 28, 2007 AsiaFI, Student Workshop 22
Principle of our solution (cont.)
Tunnel Deactivation Tunnels should be deactivated if the short term failure
recovers or route converges again after a long term failure, e.g. for the view of loop avoidance or performance. In this situation, tunnel inactivation mechanism is essential to guarantee normal data forwarding.
LAC: L2tp Access Concentrator
LNS: L2TP Network Server
Aug. 28, 2007 AsiaFI, Student Workshop 23
Evaluation metrics of routing system
Two metrics to evaluate routing system Availability refers to the ability of routing system to work for normal
packet delivery no matter whether network failures happen. Stability refers to routing dynamic of routing system no matter
network failures happen.
Routing paths provided by tunnel guarantee routing availability, while delayed route updates during long-term failures or eliminated route updates during short-term failures improves stability of routing systems.
Aug. 28, 2007 AsiaFI, Student Workshop 24
Outline
Problems Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future Work
Aug. 28, 2007 AsiaFI, Student Workshop 25
Conclusion and Future Work
A lot of interesting problems in the Internet The routing issues in Internet are being addressed active
ly. Many of the problems are hard – no easy solutions, have
to make tradeoffs. Our solution well addresses the self-healing problems of
routing. Further study and measurement of our solution Development of the prototype and Experimental analysis
on CERNET2