16
Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Embed Size (px)

Citation preview

Page 1: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Towards an Internet that “Never Fails”

Hari BalakrishnanMIT

Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Page 2: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

What We Should Aim Toward

• Carrier airlines (2002 FAA Fact Book) 41 accidents, 6.7 million flights (five “nines” availability)

• 911 phone service (1993 NRIC report) 29 minutes downtime per year per line (four “nines”

availability)

• Standard phone service (various sources) 53 minutes downtime per year per line (four “nines”

availability)

• The Internet? One to two “nines”

Page 3: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Example Catastrophic Failures

“…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.”

-- news.com, April 25, 1997“Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001“WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue."

-- cnn.com, October 3, 2002"A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).”

-- dslreports.com, February 23, 2004

Page 4: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

NANOG List Failure “Analysis”

0102030405060708090

Filtering RouteLeaks

RouteHijacks

RouteInstability

RoutingLoops

Blackholes

# Threads over Stated Period

1994-1997 1998-2001 2001-2004

Note: Only includes problems openly discussed on this list.

More than 70% of threads discussing failures relatedto router configuration or route announcement problems

Page 5: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Faults and Failures

• Fault = Underlying defect in a component that causes it to violate a specification Latent or Active (i.e., cause errors)

• Unmasked faults (errors) cause failures Failure of subsystem (spec violation) causes fault in

system

• Internet faults occur for complex reasons Hardware, software, protocol, design, implementation,

operational faults: could be triggered by malice

• Internet failure: A cannot communicate with B

Page 6: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Three Directions

• Configuration as programming Defines BGP behavior Tools to cope with routing complexity

• Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful

• End-to-end routing Exposing multiple paths to end systems (and

stubs)

Page 7: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Today: Reactive Operation

• Problems cause downtime• Problems often not immediately apparent

What happens if I tweak this policy…?

Page 8: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Coping with Complexity• View configuration as (distributed) programming

Large-scale: over 1M lines of code in some networks

• Programming tools to reduce fault frequency Static analysis can detect many faults [rcc] Sandboxing to overcome current “stimulus-response”

reasoning [FR03]

• Centralize configuration platform More “intentional” config specs Push configs to routers Push routes to routers [RCP:F+04] Use static analysis and sandboxing tools

Page 9: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Proactive Operation with rcchttp://nms.csail.mit.edu/rcc

Faults

• Represent complex, distributed configuration• Define a correctness specification• Map specification to constraints

ConfigureDetectFaults

Deploy

rcc

rccNormalized

Representation

CorrectnessSpecification

ConstraintsDistributed router

configurations (Single AS)

Page 10: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Correctness Specification

Path Visibility Every destination with a usable path has a route advertisement

Route Validity Every route advertisement corresponds to a usable path

Example violation: Signaling partition

Example violation: Routing loop

If there exists a path, then there exists a route

If there exists a route, then there exists a path

Page 11: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Results: Faults across 17 ASes

0

2

4

6

8

10

iBGP

SignalingPartitionDuplicateLoopbackIncomplete

iBGP

Session

Inconsistent

Export

Inconsistent

ImportTransitBetween

Peers

Undefined

Filter

Incomplete

Filter

Number of ASes

Route Validity Path Visibility

Every AS had faults, regardless of network sizeMost faults can be attributed to distributed configuration

Page 12: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Three Directions

• Configuration as programming Tools to cope with routing complexity

• Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful

• End-to-end routing Exposing multiple paths to end systems

Page 13: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Prefixes are too coarse-grained

Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn

70% of intra-AS failuresnot visible in BGP [FABK03]

Page 14: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

…but they are also too fine-grained!

• ~70% of discontiguous prefix pairs from the same AS are announced from the same location

• Allocation explains about 60% of these cases: Registries often allocate discontiguous address

blocks to a single AS on the same day

• Routes for these prefixes will “flap” together. 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent)

Route objects should correspond to an “atom” of hosts that share fate

Page 15: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Proposal: Atomic Interdomain Protocol (AIP)

• Exterminate prefixes

• Name “atomic domains” (AD) directly Addressing, forwarding and routing on ADs Like current AS numbers, but finer-grained Example: MIT, Microsoft Redmond, one PoP of a

large ISP, …

• Flat AD IDs can carry cryptographic meaning Self-certifying (hash of public key)

• End-system addresses have the form [AD : LocalID]

Page 16: Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru

Summary

It’s worth shooting for a two or three order-of-magnitude improvement in Internet availability

It’s possible to get four or five nines of Internet availability, if we: Develop tools to cope with configuration

complexity Develop a failure-atomic routing system Expose multiple IP-layer paths to higher

layers