26
1 Characterization of Failures in the Sprint IP backbone Athina Markopoulou Gianluca Iannaccone Supratik Bhattacharyya Chen-Nee Chuah Christophe Diot

Characterization of Failures in the Sprint IP backbone

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Characterization of Failures in the Sprint IP backbone

1

Characterization of Failures in the Sprint IP backbone

Athina MarkopoulouGianluca Iannaccone

Supratik BhattacharyyaChen-Nee ChuahChristophe Diot

Page 2: Characterization of Failures in the Sprint IP backbone

2

Why Study Backbone Failures?

• Backbone networks provide excellent traditional QoS

• E.g. SLAs in the Sprint’s IP network – 0.3% packet loss– 55 msec delay in continental USA– 99.9% port availability

• Failures are poorly understood…– although they happen every day

Page 3: Characterization of Failures in the Sprint IP backbone

3

Link Failures

• Definitions – IP link: adjacency between two IS-IS routers– Link Failure: loss of this adjacency

• Possible reasons– Fiber cuts, optical equipment failures, router problem,

human error or mis-configuration, maintenance …

• Impact on the IP layer– Topology changes & routing re-configuration

Page 4: Characterization of Failures in the Sprint IP backbone

4

Dealing with failures

• Potential impact on availability – Forwarding disrupted during route re-convergence– Overload/Congestion on backup paths

• Network design becomes hard – Protection mechanisms– Topology design – Capacity provisioning – Timers tuning

• A failure model is needed !

Page 5: Characterization of Failures in the Sprint IP backbone

5

Outline

• Motivation • Contributions

– Measurement collection– Failures classification– Modeling of each class

• Conclusion

Page 6: Characterization of Failures in the Sprint IP backbone

6

Measurements of Link Failures at ISIS level

• ISIS listeners collect flooded LSPs

time

LSP from A:Link to B is down

LSP from B:Link to A is down

LSP from A:Link to B is up

LSP from B:Link to A is up

start endfailureduration

Link A-BRouter A Router B

Listener

• Record: (link A-B, router A, router B, start time, end time)

Page 7: Characterization of Failures in the Sprint IP backbone

7

Data Set: US Failures, Apr. - Nov. 2002

Apr May Jun Jul Aug Sep Oct NovDay

Page 8: Characterization of Failures in the Sprint IP backbone

8

Classification Methodology

Data Set

Unplanned failures

Shared Link Failures

(happen simultaneously on 2 or more links)

Maintenance (scheduled)

Individual Link Failures

(happen on only one link)

Page 9: Characterization of Failures in the Sprint IP backbone

9

Maintenance

Weekly schedule (Mondays 5am-2pm UTC): 20% of failures

Page 10: Characterization of Failures in the Sprint IP backbone

10

Data Set-revisited

Page 11: Characterization of Failures in the Sprint IP backbone

11

Excluding Maintenance: Unplanned failures

Page 12: Characterization of Failures in the Sprint IP backbone

12

t2t1 time

0- 10- 20 - 3

Shared failures (I): simultaneous

• 2+ links, go down/up at the exact same time– 16.5% of all unplanned failures

LSP from Router 0, reporting 3 links up

LSP from Router 0, reporting all links down

• For every such event – all links connected to the same router

• Indicate router-related cause

Router 0

LinecardRouter 1Router 2Router 3

Event

Page 13: Characterization of Failures in the Sprint IP backbone

13

Shared failures (II): overlapping

• Some links fail almost simultaneously– 14.3% of all failures are grouped in overlapping events

• In 80% of overlapping events– links have no common router– but all links share multiple optical components– Good indication for optical-related failure

Overlapping W2 W1

Link 1Link 2

Link 3time

Shared optical component

Link 1Link 2Link 3

Event

Page 14: Characterization of Failures in the Sprint IP backbone

14

Classification

Data Set

Unplanned

Simultaneous

Shared Link Failures

Maintenance

Individual Link Failures

Overlapping

Optical Related

Unspecified Router Related

16.5% 11.4%

(30.8%) (69.2%)(100%)

Page 15: Characterization of Failures in the Sprint IP backbone

15

Individual Link Failures

Apr May Jun Aug Sep Oct NovDay

Jul

After excluding maintenance and shared (router, optical)

Page 16: Characterization of Failures in the Sprint IP backbone

16

High vs. Low Failure Links

• Normalized number of failures per link

• High degree of heterogeneity

• Roughly two power-laws

• A few (2.5%) links account for half of individual failures

Page 17: Characterization of Failures in the Sprint IP backbone

17

High Failure Links

• Components may be old or undergoing upgrades

Apr May Jun Aug Sep Oct NovJul

Page 18: Characterization of Failures in the Sprint IP backbone

18

Low Failure Links

Page 19: Characterization of Failures in the Sprint IP backbone

19

Classification Summary

Low Failure Links

Unplanned

Simultaneous

Shared Link Failures

Individual Link Failures

Overlapping

Optical related

Unspecifiedcause

High Failure Links

Router related

16.5% 11.4% 2.9% 38.5% 30.7%

(30.8%) (69.2%)

100%

Page 20: Characterization of Failures in the Sprint IP backbone

20

Characterize each class

1. How often?

2. How are they spread across links?

E.g. low failure links - revisited

3. How long do they last?

Page 21: Characterization of Failures in the Sprint IP backbone

21

1. How often?Time between two successive failures, on any link

•Weibull:• Low autocorrelation

))/(exp(1)( shapescalexxF −−=

Page 22: Characterization of Failures in the Sprint IP backbone

22

2. How are they spread across links?

• Order links in decreasing number of failures• Link with rank l has nl failures, s.t. 35.1−∝ ln l

If• long time T( )• independent linksThen • given that there is a failure, it happens on link l, with probability:

L

lnnn

ncondlp +++= ...21

Tn

p ll ~

Page 23: Characterization of Failures in the Sprint IP backbone

23

Characterizing the Properties of each Class

Empirical

Empirical

N/A

N/A

4. Number of links in the same event

Network-wide: Weibull,

low autocorr.

Network-wide:Weibull,

low autocorr.

Per link:Empirical

Network-wide:Weibull,

low autocorr.

1.Time between failures

EmpiricalPower-lawHigh failure links

EmpiricalPower-lawLow failure links

EmpiricalPower-lawRouterrelated

Empirical-Optical related

3. Duration2. Num. of failures per link (or num. of events per router)

Page 24: Characterization of Failures in the Sprint IP backbone

24

Failure Durations – all groups

Page 25: Characterization of Failures in the Sprint IP backbone

25

Conclusion

• Summary – Measurements – Classification – Characterization

• Implications – Contribution: A Failure Model– Problem Area: Network Reliability

Page 26: Characterization of Failures in the Sprint IP backbone

26

[email protected]

Thank you!