26
Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Embed Size (px)

Citation preview

Page 1: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Self-healing in Routing: Failure Analysis, and Improvements

Qi Li

Tsinghua University

Aug. 28, 2008

Page 2: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 2

Outline

Problem Statement Analysis of Self-Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work

Page 3: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 3

Problem Statement

Routing (Intra- and Inter- domain) is critical elements as Internet infrastructure

How robust are they against large scale failures/attacks? Cisco routers caused major outage in Japan 2007 Earthquake in Taiwan causes undersea cable damage in 2006

We need to improve them, but how can we do?

Page 4: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 4

Internet Routing

Not a homogeneous network A network autonomous systems (AS) Each AS under the control of an ISP. Large variation in AS sizes – typical heavy tail.

Inter-AS routing Border Gateway Protocol (BGP). A path-vector algorithm. Serious scalability/recovery issues.

Intra-AS routing Several algorithms; usually work fine

Central control, smaller network, …

Page 5: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 5

Measurements – Prefix Growth

Table sizes grow 2x faster than real growth One (conservative) analysis predicts 2M entries in 10 ye

ars

Page 6: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 6

Measurements – BGP Updates

Page 7: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 7

Distribution of Updates – Main Observation

Most of the network is very stable

Parts of the network are very unstable

Everybody pays for the instability

Problem is getting worse

Page 8: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 8

Routing Failure Causes

Large area router/link damage (e.g., earthquake) Large scale failure due to buggy SW update. High BW cable cuts Router configuration errors

Aggregation of large un-owned IP blocks Happens when prefixes are aggregated for efficiency

Incorrect policy settings resulting in large scale delivery failures

Network wide congestion (DoS attack) Malicious route advertisements via worms

Page 9: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 9

Outline

Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-healing Solution Conclusion and Future work

Page 10: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 10

Existing Routing Protocols

Normal process of IP-based self-healing routing Failure Detection Failure Notification Forwarding Path Re-computation

Existing routing protocols … RIP: hundreds of seconds, count to infinity OSPF, tens of seconds BGP, several minutes or longer, can’t converge due to policy

confliction.

Page 11: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 11

The State Transition under Failure

A simple state transition to analyze the routing convergence.

Page 12: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 12

The Problems of Transient Failures

Routing Blackhole Traffic is silently dropped without informing the source that th

e data did not reach its intended recipient. Routing Loop

The path to a particular destination forms a loop.

Page 13: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 13

Outline

Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future work

Page 14: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 14

Traditional Fast Reroute Solutions

Major improvement in Intra-domain routing is fast reroute solutions. SONET rings are significantly reduce this recovery time, but

they are expensive. FRR with MPLS-TE, hard to deploy because it will introduce

much complexity into core network. IP-FRR developed by IETF, which still has some

shortcomings, e.g., LFA needs a neighbor with a shortest path not containing the failed nodes.

Layer 3 Tunnel provides pre-computed path protection, which may not eliminate the routing loops introduced by tunneling.

Page 15: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 15

State Transition of Improved Solution

State transition with protection and damping: improving availability and stability.

Page 16: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 16

BGP Fast Convergence Solutions

Major Problem in BGP Theoretical analysis and measurement result indicate path

exploration of path vector protocol prolongs routing convergence

Several solution addressed this problem: RCN can eliminate all the obsolete routes and ensure that

only valid alternative routes are chosen and propagated by carrying the root-cause information in the BGP updates.

Ghost Flushing improves the BGP convergence by expediting the removal of outdated “ghost” information in the Internet.

Drawbacks … Network fail-over events in GF, Transient routing problems.…

Page 17: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 17

Outline

Problem Statement Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution

Requirements of Solution Routing Protection Evaluation Metrics

Conclusion and Future Work

Page 18: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 18

Self-healing Routing

The goal of self-healing routing After a link or a node is devastated, network can restore or re

pair routes by itself Self-healing routing approaches

Routing Restoration (Fast Routing Convergence)

Attempt to find a new path on-demand to restore connectivity when a failure occurs.

Routing Protection

Based on the fixed and predetermined failure recovery, provide a working path set up for traffic forwarding and an alternate protection path.

Page 19: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 19

Requirements of Solution

Simplicity The solution should be simple and not add much complexity in core

networks, but MPLS needs a fundamental infrastructure. Easy Deployment and Management

MPLS-related solution is not a good potential solution because it is hard to pre-compute backup path for every nodes.

Efficiency Protection should not be deployed to cover 100% of network,

especially when multiple failures happen. Incremental Deployment Support

It is an important factor when considering and designing a novel routing protocol, because we all can not ensure that we can deploy it once.

Page 20: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 20

Requirements of Solution (cont.)

Business model Support The designed solution should consider the business model of path

protection application in production networks. In order to protect unstable network and backbone network areas,

contrasts between different ISPs should be signed to guarantee routing availability in these areas.

Low Cost The path protection solution should provide routes without many

computation processes or additional computation power needed on routers, and provide packet delivery performance guarantee with low packet loss.

The solution should covers protection under both short term or long term network failures.

Page 21: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 21

Principle of our solution (cont.)

The key idea of routing protection is that it makes tradeoff between the additional cost introduced by tunneling and packet lost caused by failures.

Fast Failure Detection simplicity, fast detection, easy implementation and no change to

existing routing protocols, Bidirectional Forwarding Detection (BFD) is directly applied.

Path Protection Technique Although two different types of routing protocol need be considered,

intra-domain routing and inter-domain routing tunnel, there is no need for us to provide path protection techniques for different routing instances.

In order to eliminate the problems introduced by L3 tunnel, we choose L2TP as protection technique.

Page 22: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 22

Principle of our solution (cont.)

Tunnel Deactivation Tunnels should be deactivated if the short term failure

recovers or route converges again after a long term failure, e.g. for the view of loop avoidance or performance. In this situation, tunnel inactivation mechanism is essential to guarantee normal data forwarding.

LAC: L2tp Access Concentrator

LNS: L2TP Network Server

Page 23: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 23

Evaluation metrics of routing system

Two metrics to evaluate routing system Availability refers to the ability of routing system to work for normal

packet delivery no matter whether network failures happen. Stability refers to routing dynamic of routing system no matter

network failures happen.

Routing paths provided by tunnel guarantee routing availability, while delayed route updates during long-term failures or eliminated route updates during short-term failures improves stability of routing systems.

Page 24: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 24

Outline

Problems Analysis of Self Healing Routing Existing Improvement Solutions Our Self-Healing Solution Conclusion and Future Work

Page 25: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Aug. 28, 2007 AsiaFI, Student Workshop 25

Conclusion and Future Work

A lot of interesting problems in the Internet The routing issues in Internet are being addressed active

ly. Many of the problems are hard – no easy solutions, have

to make tradeoffs. Our solution well addresses the self-healing problems of

routing. Further study and measurement of our solution Development of the prototype and Experimental analysis

on CERNET2

Page 26: Self-healing in Routing: Failure Analysis, and Improvements Qi Li Tsinghua University Aug. 28, 2008

Thanks

Q&A

[email protected]