Feb15.ppt

★Detecting BGP Configuration Faults with Static Analysis★ IP Fault Localization Via Risk Modeling★ Finding a Needle in a Haystack: Pinpointing …

Nick Feamster et al Ramana Rao Kompella et al

Jian Wu et al

Presented by Mikyung Han

Detecting BGP Configuration Faults

2nd Symposium on Networked Systems Design and Implementation (NSDI)

,Boston, MA, May 2005

Nick Feamster

Hari Balakrishnan★Best Paper Award

With Static Analysis

3/53

The Internet is increasingly becoming part of the mission-critical Infrastructure (a public utility!).

Big problem: Very poor understanding of how to manage it.

Is correctness really that important?

4/53

Why does routing go wrong?

Complex policies Competing / cooperating

networks Each with only limited visibility Large scale Tens of thousands networks …each with hundreds of routers …each routing to hundreds of

thousands of IP prefixes

5/53

What can go wrong?

Two-thirds of the problems are caused by configuration of the routing protocol

Some things are out of the hands of networking research

But…

6/53

Categories of BGP Configurations

Ranking: route selection

Dissemination: internal route advertisement

Filtering: route advertisement

Customer

Competitor

Primary

Backup

…. More Flexibility bringsMore

COMPLEXITY!

7/53

These problems are real“…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.”

-- news.com, April 25, 1997“Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001“WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue."

-- cnn.com, October 3, 2002"A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).”

-- dslreports.com, February 23, 2004

8/53

Routing Faults Discussed on NANOG mailing List

0102030405060708090

Filtering RouteLeaks

RouteHijacks

RouteInstability

RoutingLoops

Blackholes

# T

hre

ad

s o

ve

r S

tate

d P

eri

od

1994-1997 1998-2001 2001-2004

9/53

Why is routing hard to get right?

Defining correctness is hardInteractions cause unintended consequences Each network independently configured Unintended policy interactionsOperators make mistakes Configuration is difficult Complex policies, distributed configuration

10/53

Today: Tweak-N-Pray

Problems cause downtimeProblems often not immediately apparent

What happens if I tweak this policy…?

Configure ObserveWait for

Next ProblemDesired Effect?

RevertNo

Yes

11/53

Goal: Proactive ApproachIdea: Analyze configuration before deployment

ConfigureDetectFaults

Deploy

rcc

Many faults can be detected with static analysis.

12/53

Router Configuration Checker (rcc)

A tool that finds faults in BGP configuration with static analysis Does not require additional work of operators

Detects Path Visibility Faults Route Validity Faults Only detects faults in single AS Only detects faults that cause persistent

failures

13/53

What is so cool about rcc?

Finds faults proactively before deployment

Just convenient for now BGP might need a high level specification of

policies in the future To do so,

High level specification language needed Network operators need to learn and deploy Even so, they may well write it incorrectly!

No additional works from network operators!

14/53

“rcc”

rcc Overview

Normalized Representation

CorrectnessSpecification

Constraints

Faults

Analyzing complex, distributed configurationDefining a correctness specificationMapping specification to constraints

Challenges

Distributed routerconfigurations

(Single AS)

15/53

rcc Implementation

Preprocessor Parser

Verifier

Distributed routerConfigurations

(offline)Relational Database(mySQL)

Constraints

Faults

(Cisco, Avici, Juniper, Procket, etc.)

Normalized Representation

More Parsable Version

Runs simple queriesSelect, join, etc

16/53

Which faults does rcc detect?

Faults found by rcc

Latent faults

Potentially active faults

End-to-end failures

17/53

Correctness SpecificationSafetyThe protocol converges to a stable path assignment for every possible initial state and message ordering

The protocol does not oscillate

Path Visibility Every destination with a usable path has a route advertisement

Route Validity Every route advertisement corresponds to a usable path

Example violation: Network partition

Example violation: Routing loop

If there exists a path, then there exists a route

If there exists a route, then there exists a path

18/53

Path Visibility in iBGP

“iBGP”

c c c

RR

c

RR RR

Default: “Full mesh” iBGP. Doesn’t scale.

Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes.

19/53

iBGP Fault ExampleNetwork Partition W learns r1 via eBGP X does not

readvertise to other iBGP sessions

Then Y and Z won’t learn r1 to d

Suboptimal Routing Even if Y and Z learn

a route to d via eBGP, this would be worse than r1 learned by W

20/53

iBGP Signaling: Static CheckTheorem.Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a full mesh.rcc checks whether iBGP signaling graph

G is connected and acyclic, and whether the routers at the top layer of G form a full mesh.

21/53

Route Validity: Policy Related Problems

rcc operates without a specification of the intended policy For the convenience’s sake

rcc forms beliefs Assume intended policies conform to best

common practice Analyze the configuration for common

patterns and look for deviations from those patterns

Still useful but some false positives

22/53

Route Validity: Best Common Practice

A route learned from peers should not be re-advertised to another peer Ex: Ensuring no routes learned from

Worldcom propagate to Sprint

AS should advertise routes with equally good attributes to each peer at every peering point Violations

when routers in AS have different policy set to same peer

When there exists iBGP signaling partition

23/53

Route Validity: Configuration Anomalies

When the configurations for sessions at different routers to a neighboring AS are the same except at one or two routers, rcc reports faults!False Positives of course …

24/53

Analyzing Real-World Configuration

Downloaded by 70 network operators, some of them shared their configurations Reluctant to share because its proprietary Because they don’t like researchers

finding faults on their network

Detected more than 1000 faults previously undiscovered in 17 ASes

25/53

Summary: Faults across 17 ASes

0

2

4

6

8

10

iBG

PS

ign

ali

ng

Pa

rtit

ion

Du

pli

ca

teL

oo

pb

ac

k

Inc

om

ple

teiB

GP

Se

ss

ion

Inc

on

sis

ten

tE

xp

ort

Inc

on

sis

ten

tIm

po

rt

Tra

ns

itB

etw

ee

nP

ee

rs

Un

de

fin

ed

Fil

ter

Inc

om

ple

teF

ilte

r

Nu

mb

er o

f A

Ses

Route Validity Path Visibility

Every AS had faults, regardless of network sizeMost faults can be attributed to distributed configuration

26/53

rcc: Take-home lessonsBetter intra-AS route dissemination protocol needed Current route reflection causes many faults!

BGP needs to be configured with a centralized higher-level specification language Current distributed, low-level nature introduces

complexity, obscurity, and possibility to misconfiguration

But! trade-off with flexibility and expressiveness

27/53

DiscussionStrength Proves static configuration analysis

uncovers many errors Identifies major causes of error

Distributed configuration Intra-AS dissemination is too complex Mechanistic expression of policy

Weakness rcc is not sound or complete More room for improvement on ‘beliefs’

IP Fault Localizationvia Risk Modeling

2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI)

,Boston, MA, May 2005

Ramana Rao Kompella

Jennifer YatesAlbert Greenberg

Alex C Snoeren

29/53

IP Network Fault-Tolerance

InternetInternet

XXIP FaultIP Fault

Alternate PathAlternate Path

IP Networks are designed to be fault-tolerant!IP Networks are designed to be fault-tolerant!

RouterRouter

AliceAlice EveEve

Any failure that causes an IP link to fail is

termed “IP Fault”

30/53

Fault RepairFast Repair is necessary because Probability of a simultaneous failure increases

with down-time Expensive to provision too many alternate

paths

Fault Localization is a bottleneck for fault repair!

31/53

What makes fault localization hard?

A typical Tier-I ISP network has About a thousand routers A few thousand IP links Tens of thousands of optical components About 50-100 thousand miles of optical fiber Complicated topologies (mesh, ring etc.)

Current alarms do not indicate root-causeOften problematic to monitor actual component failureFailure alerts can get lost

Operators Need an automated tool for fast fault localization

32/53

Key Ideas: Shared Risk!

Risk modeling to localize faults across the IP and optical layersSRLG: Shared Risk Link Groups A physical object represents shared risk

for a group of logical entities at IP layer

SCORE: Spatial Correlation Engine cross-correlates dynamic fault

information from two disparate network layers

33/53

Los AngelesLos Angeles

San JoseSan Jose WashingtonWashington

AtlantaAtlanta

HoustonHouston

Logical/Physical IP Network

QWEST IP Network

34/53

Logical/Physical IP Network



AtlantaAtlanta

HoustonHouston


AtlantaAtlanta

HoustonHouston


SHARED SHARED RISKRISK

XX

XXDWDM DWDM failed ?failed ?

Links that share a Links that share a “Shared Risk” form an “Shared Risk” form an

Shared Risk Link Group Shared Risk Link Group (SRLG)(SRLG)

XX

DWDM O-E-O Conversion

Router

35/53

Various types of SRLGsPhysical Shared Risks SONET (e.g. DWDM, ADM, Optical Amplifiers) Fiber Fiber Span Router Module Port

Logical Shared Risks Autonomous System OSPF Areas

36/53

SRLG Prevalence

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000

CD

F

SRLG Cardinality (no. of links per group)Logscale

Fiber SpansFiber

SONET Network ElementsPorts

Router ModulesRouters

AreasAggregated Database

At least 47% of all SRLGs have atleast

two links

More than 85% of OSPF Areas have

atleast 10 links

Source : Section of ATT Backbone Network

37/53

Problem Formulation

A set of link C = {c1, c2, … , cn}

A set of risk Group G = {G1, G2, … , Gm} Gi = {ci1, ci2, … , cik}, st cix are likely to fail

simultaneouslyAn observation

O = {ce1, ce2, … , cem}Find Hypothesis H

H = {Gh1, Gh2, … , Ghk} which explains O Every member of O belongs to at least one member of

H and all the members of a given group Ghi belong to O Many Hs!

Occam’s Razor: Let’s not assume more than what is necessarySimplicity is the Best

38/53

SRLG DatabaseR0 – {L0,L1}R1 –

{L0,L2,L3,L4}R2 – {L4,L5}R3 – {L3,L5,L6}R4 – {L1,L2,L6}D1 – {L0,L1,L2}D2 – {L3,L5,L6}D3 – {L3,L4,L5}F0 – {L0,L1}F1 – {L0,L2}

…

R0R0

R1R1

R2R2

R3R3R4R4

L0L0

L1L1

L2L2 L3L3

L4L4

L5L5L6L6

R0R0

R1R1

R2R2

R3R3R4R4

D1D1

D2D2

D3D3F0F0

F1F1

F2F2

F3F3

F4F4

F5F5

F6F6F7F7

39/53

Bipartite Graph Formulation

DWDM1DWDM1 DWDM2DWDM2 FiberFiberSpan0Span0

R0R0 R1R1

L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX XX

XX

HypothesisHypothesis : Possible Explanation : Possible Explanation

ObservationObservation: Temporally Correlated: Temporally Correlated

R2R2 R3R3 R4R4DWDM3DWDM3 FiberFiber

Span1Span1

40/53

Bipartite Graph Formulation

DWDM1DWDM1 DWDM2DWDM2R0R0 R1R1

L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX

R2R2 R3R3 R4R4DWDM3DWDM3

XX

XXXXHypothesisHypothesis : Can contain multiple simultaneous failures : Can contain multiple simultaneous failures

FiberFiberSpan0Span0

FiberFiberSpan1Span1

Set cover of a given Observation : NP-HardSet cover of a given Observation : NP-Hard

41/53

Greedy Approximation


L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6XX XX XX

R2R2 R3R3 R4R4DWDM3DWDM3

XX

FiberFiberSpan 0Span 0

FiberFiberSpan 1Span 1

Hit RatioHit Ratio of R0 = |G of R0 = |Gii O|/|Gi| = 1/2 = 50% O|/|Gi| = 1/2 = 50%

Coverage RatioCoverage Ratio of R0 = | G of R0 = | Gii O|/|O| = 1/4 = 25% O|/|O| = 1/4 = 25%

42/53

Greedy ApproximationXX XX XX XX


L0L0 L1L1 L2L2 L3L3 L4L4 L5L5 L6L6

R2R2 R3R3 R4R4DWDM3DWDM3 FiberFiber

Span 0Span 0 FiberFiber

Span 1Span 1

R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%)R0=(50%,25%),R1=(75%,75%),R2=(100%,50%),R3=(33%,25%),R4=(66%,50%)

D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%)D1=(66%,50%),D2=(33%,25%),D3=(66%,50%),F0=(50%,25%),F1=(100%,50%)

Out of all groups with hit-ratio 100%, pick group with max coveragePrune links associated with this group and add this group to hypothesisRepeat with pruned observation until no unexplained Observation

43/53

Modeling ImperfectionsIdeally, If a shared component fails, all associated links fail

Not true in practice sometimes Failure message could get lost! (transported by

UDP) Inaccurate modeling of risk groups

Solution : Use an error threshold for the hit-ratios Accounts for losses in data Inaccurate modeling of SRLGs

44/53

Modified Greedy Approximation

Select groups that have hit ratio > error thresholdOut of these groups, identify the group with maximum coveragePrune the set of links that are explained by this groupRecursively repeat the above steps until all links are fully explained

45/53

SCORE Spatial Correlation Module

Intelligence is built onto the SRLG database and reflected in the SCORE queriesObtains minimum set hypothesis

46/53

SCORE System Architecture

Data Data TranslatorTranslator WWWWWW

Router Router SyslogsSyslogs

Spatial Correlation Spatial Correlation (SCORE)(SCORE)

FAULT LOCALIZATION POLICIESFAULT LOCALIZATION POLICIES

Data Data TranslatorTranslator

Data Data TranslatorTranslator

SNMP TrapsSNMP TrapsSONETSONETPM dataPM data

SRLG DatabaseSRLG Database

APIAPIInput : <Ckt1, Ckt2 ..>,Input : <Ckt1, Ckt2 ..>,

Error ThresholdError ThresholdOutput : <Grp1, Grp2..>Output : <Grp1, Grp2..>

Multiple QueryMultiple Query

1. Event Clustering-captures events close together in time2. Localization Heuristics: -uses multiple error threshold outputs H with min cost (|H|/eThresh)

-queries clustered events with similar signature

47/53

Evaluation : Artificial Faults

Artificially generated faults but real SRLG database from (a section of) AT&T backbone networkPicked a set of components to failObservation then fed to SCORE No losses in data no database

inconsistency

Hypothesis compared with injected faults

48/53

Perfect Fault Notification

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10 12 14 16 18 20

Fra

ctio

n o

f Co

rre

ct H

ypo

the

ses

Number of simultaneously induced failures

FIBERSPANPORT

MODULEROUTER

AREASONET

Aggregated

Accuracy Greater than 95% for 5 failures

49/53

Imperfect Fault Notifications

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.05 0.1 0.15 0.2 0.25 0.3

Fra

ctio

n o

f Co

rre

ct H

ypo

the

ses

Loss Probability (eThresh 0.6)

One FailureTwo Failures

Three FailuresFour FailuresFive Failures

Almost linear accuracy tradeoff

with loss probability

50/53

Evaluation : Real FaultsA set of 18 faults studied and diagnosed Where root-cause well-known

One Case Study OSPF Area wide problem that affected about 70

links SCORE identified about 20 SRLG groups as

hypothesis Further analysis revealed that error due to

incorrect SRLG modeling Relaxed error threshold to 0.7 brought it down to 4 Only OSPF interfaces with MPLS enabled got

affected by the protocol bug

51/53

Evaluation: Real Faults

Similarly, SCORE uncovered Database problems Missing error reports from certain

links Other inconsistencies

Shows how error-thresholds are effective in uncovering these inconsistencies and data losses

52/53

Localization Precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CD

F

Localization Fraction

About 40% of faults About 40% of faults could be localized to could be localized to

less than 5% of less than 5% of componentscomponents

About 80% of faults About 80% of faults could be localized to could be localized to

less than 10% of less than 10% of componentscomponents

53/53

DiscussionStrength Captured the spatial correlation between IP

links Database inconsistencies are resolved in

SCORE using a simple error threshold scheme

Weakness Fails to model either very high-level risk group

or very low-level risk group Extremely hard to select a single error

threshold for all observations! Need more intelligent heuristics to fault

localization policy

Finding a Needle in a Haystack:Pinpointing Significant BGP Routing Changes in an IP Network

Proc. Networked Systems Design and ImplementationMay 2005

Jian Wu

Z. Morley Mao Jennifer Rexford

Jia Wang

55/53

Challenges & Goals

Large volume of BGP updates Millions daily, very bursty Too much for an operator to manageDifferent than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the ASGoal Covert millions of BGP updates into a few

dozen of actionable reports!

56/53

System Architecture

Event Classification


“Typed”Events

EEBR

EEBR

EEBR

BGP Updates

(106)

BGP Update Grouping

BGP Update Grouping

Events

Persistent Flapping Prefixes

(101)

(105)

EventCorrelation

EventCorrelation

Clusters

Frequent Flapping Prefixes

(103)

(101)

Traffic ImpactPrediction


EEBREEBR EEBR

LargeDisruptions

Netflow Data

(101)

57/53

Grouping BGP Update into Events

Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers

Solution: •Group all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.

BGP Update Grouping

BGP Update Grouping

EEBR

EEBR

EEBR

BGP Updates

Events

Persistent Flapping Prefixes

58/53


Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network



Events “Typed”Events

Solution: classify events by severity of their impacts

59/53

Event Correlation

Challenge: A single routing change affects multiple destination prefixes

EventCorrelation

EventCorrelation“Typed”

EventsClusters

Solution: group events of same type that occur close in time

60/53

Statistics on Event Classification

Events Updates

No Disruption 50.3% 48.6%

Internal Disruption 15.6% 3.4%

Single External Disruption

20.7% 7.9%

Multiple External Disruption

7.4% 18.2%

Loss/Gain of Reachability 6.0% 21.9%First 3 categories have significant variations from day to dayUpdates per event depends on the type of events and the number of affected routers

61/53

Traffic Impact Prediction

Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations



EEBR

Clusters LargeDisruptions

Netflow Data

EEBR EEBR

Solution: weigh each cluster by traffic volume

62/53

Conclusion

BGP anomaly detection Fast, online fashion Significant information reduction (to a few

dozen of actionable reports!)

Uncovered important network behaviors Persistent flapping prefixes Hot-potato changes Session resets and interface failures

Documents

Feb15.ppt