58
Detecting network outages using different sources of data TMA Experts Summit, Paris, France Cristel Pelsser University of Strasbourg / ICube June, 2019 1 / 44

Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Detecting network outages using different

sources of data

TMA Experts Summit, Paris, France

Cristel Pelsser

University of Strasbourg / ICube

June, 2019

1 / 44

Page 2: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Some perspective on:

• From unsollicited traffic

Detecting Outages using Internet Background Radiation.

Andreas Guillot (U. Strasbourg), Romain Fontugne (IIJ),

Philipp Winter (CAIDA), Pascal Merindol (U. Strasbourg),

Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel

Pelsser (U. Strasbourg). TMA 2019.

• From highly distributed permanent TCP connections

Disco: Fast, Good, and Cheap Outage Detection. Anant Shah

(Colorado State U.), Romain Fontugne (IIJ), Emile Aben

(RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy

Bush (IIJ, Arrcus). TMA 2017.

• From large-scale traceroute measurements

Pinpointing Anomalies in Large-Scale Traceroute

Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE

NCC), Cristel Pelsser (University of Strasbourg), Randy Bush

(IIJ, Arrcus). IMC 2017.

2 / 44

Page 3: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Understanding Internet health? (Motivation)

• To speedup failure identification and thus recovery

• To identify weak areas and thus guide network design

3 / 44

Page 4: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Understanding Internet health? (Problem 1)

Manual observations and operations

• Traceroute / Ping / Operators’ group mailing lists

• Time consuming

• Slow process

• Small visibility

→ Our goal: Automaticaly pinpoint network disruptions

(i.e. congestion and network disconnections)

4 / 44

Page 5: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Understanding Internet health? (Problem 2)

A single viewpoint is not enough

→ Our goal: mine results from deployed platforms

→ Cooperative and distributed approach

→ Using existing data, no added burden to the network

5 / 44

Page 6: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection from unsollicited

traffic

Page 7: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: Internet Background Radiation

Internet

P1

P1 is advertised to the Internet

7 / 44

Page 8: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: Internet Background Radiation

Internet

P1

P1 is advertised to the Internet

Scans, responses to spoofed traffic

7 / 44

Page 9: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: Internet Background Radiation

Spoofed traffic

Internet

P1

P1 is advertised to the Internet

Scans, responses to spoofed traffic

Sends traffic with source in P1

7 / 44

Page 10: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: Internet Background Radiation

Spoofed traffic

Internet

P1

P1 is advertised to the Internet

Scans, responses to spoofed traffic

Sends traffic with source in P1

Responds to spoofed traffic

7 / 44

Page 11: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: IP count time-series (per country or AS)

Use cases: Attacks, Censorship, Local outages detection

2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP Original time series

Figure 1: Egyptian revolution

⇒ More than 60 000 time series in the CAIDA telescope data.

We use drops in the time series are indicators of an outage.

8 / 44

Page 12: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Current methodology used by IODA

Detecting outages using fixed thresholds

9 / 44

Page 13: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Our goal

Detecting outages using dynamic thresholds

10 / 44

Page 14: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection process

2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP

Training Validation TestOriginal time series

11 / 44

Page 15: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection process

2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Nu

mb

er

of

uniq

ue s

ou

rce IP Training Calibration Test

Original time series

Predicted time series

Prediction and confidence interval

11 / 44

Page 16: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection process

2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP Training Validation Test

Original time seriesPredicted time series

• When the real data is outside the prediction interval, we raise

an alarm.

• We want a prediction model that is robust to the seasonality

and noise in the data → We use the SARIMA model1.

1More details on the methodology on wednesday.11 / 44

Page 17: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Validation: ground truth

Characteristics

• 130 known outages

• Multiple spatial scales

• Countries

• Regions

• Autonomous Systems

• Multiple durations (from an hour to a week)

• Multiple causes (intentional or non intentional)

12 / 44

Page 18: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Evaluating our solution

Objectives

• Identifying the

minimal number of IP

addresses

• Identifying a good

threshold

Threshold

• TPR of 90% and

FPR of 2%

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

All time series< 20 IPs> 20 IPs2 sigma - 95%3 sigma - 99.5%5 sigma - 99.99%

Figure 2: ROC curve

13 / 44

Page 19: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Comparing our proposal (Chocolatine) to CAIDA’s tools

• More events detected than the simplistic thresholding

technique (DN)

• Higher overlap with other detection techniques

• Not a complete overlap→ difference in dataset coverage

→ different sensitivities to outages

DN17

BGP644

AP633

36

985 3

15

Chocolatine251

BGP445

AP440

235

489 196

511

14 / 44

Page 20: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection from highly dis-

tributed permanent TCP connec-

tions

Page 21: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed Approach

Disco:

• Monitor long-running TCP

connections and synchronous

disconnections from related

network/area

• We apply Disco on RIPE Atlas

data, where probes are widely

distributed at the edge and behind

NATs/CGNs providing visibility

Trinocular may not have

→ Outage = synchronous disconnections from the same

topological/geographical area

16 / 44

Page 22: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Assumptions / Design Choices

Rely on TCP disconnects

• Hence the granularity of detection is dependent on TCP

timeouts

Bursts of disconnections are indicators of interesting outage

• While there might be non bursty outages that are interesting,

Disco is designed to detect large synchronous disconnections

17 / 44

Page 23: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed System: Disco & Atlas

RIPE Atlas platform

• 10k probes worldwide

• Persistent connections with

RIPE controllers

• Continuous traceroute

measurements

(see outages from inside)

→ Dataset: Stream of probe connection/disconnections

(from 2011 to 2016)

18 / 44

Page 24: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Disco Overview

1. Split disconnection

stream in sub-streams

(AS, country,

geo-proximate

50km radius)

2. Burst modeling and

outage detection

3. Aggregation and

outage reporting

19 / 44

Page 25: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Why Burst Modeling?

Goal: How to find synchronous disconnections?

• Time series conceal

temporal characteristics

• Burst model estimates

disconnections arrival

rate at any time

Implementation: Kleinberg burst model2

2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining

and Knowledge Discovery, 2003.20 / 44

Page 26: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Burst modeling: Example

• Monkey causes blackout in

Kenya at 8:30 UTC June,

7th 2016

• Same day RIPE rebooted

controllers

21 / 44

Page 27: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Results

Outage detection:

• Atlas probes disconnections from 2011 to 2016

• Disco found 443 significant outages

Outage characterization and validation:

• Traceroute results from probes (buffered if no connectivity)

• Outage detection results from Trinocular

22 / 44

Page 28: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Validation (Traceroute)

Comparison to traceroutes:

• Probes in detected outages can reach traceroutes destination?

→ Velocity ratio: proportion of completed traceroutes in

given time

0.0 0.5 1.0 1.5 2.0

R (Average Velocity Ratio)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Pro

babili

ty M

ass F

unction

Normal

Outage

→ Velocity ratio ≤ 0.5 for 95% of detected outages23 / 44

Page 29: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Validation (Trinocular)

Comparison to Trinocular (2015):

• Disco found 53 outages in 2015

• Corresponding to 851 /24s (only 43% is responsive to ICMP)

Results for /24s reported by Disco and pinged by Trinocular:

• 33/53 are also found by Trinocular

• 9/53 are missed by Trinocular (avg time of outages < 1hr)

• Other outages are partially detected by Trinocular

23 outages found by Trinocular are missed by Disco

• Disconnections are not very bursty in these cases

→ Disco’s precision: 95%, recall: 67% 24 / 44

Page 30: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Outage detection from large-scale

traceroute measurements

Page 31: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Dataset: RIPE Atlas traceroutes

Two repetitive large-scale measurements

• Builtin: traceroute every 30 minutes to all DNS root servers

(≈ 500 server instances)

• Anchoring : traceroute every 15 minutes to 189 collaborative

servers

Analyzed dataset

• May to December 2015

• 2.8 billion IPv4 traceroutes

• 1.2 billion IPv6 traceroutes

26 / 44

Page 32: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Monitor delays with traceroute?

Traceroute to “www.target.com”

Round Trip Time (RTT) between B and C?

Report abnormal RTT between B and C?

27 / 44

Page 33: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Monitor delays with traceroute?

Challenges:

• Noisy data

• Traffic

asymmetry

0 2 4 6 8 10 12Number of hops

0

50

100

150

200

250

300

RT

T (

ms)

Traceroutes from CZ to BD

28 / 44

Page 34: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Monitor delays with traceroute?

Challenges:

• Noisy data

• Traffic

asymmetry

0 2 4 6 8 10 12Number of hops

0

50

100

150

200

250

300

RT

T (

ms)

Traceroutes from CZ to BD

28 / 44

Page 35: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

29 / 44

Page 36: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

• No!

• Traffic is asymmetric

• RTTB and RTTC take different return paths!

• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep

30 / 44

Page 37: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

• No!

• Traffic is asymmetric

• RTTB and RTTC take different return paths!

• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep30 / 44

Page 38: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Problem with differential RTT

Monitoring ∆CB over time:

Time

0

10

20

30

∆R

TT

→ Delay change on BC? CD? DA? BA???31 / 44

Page 39: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = x0

32 / 44

Page 40: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1}

32 / 44

Page 41: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1, x2, x3, x4}

32 / 44

Page 42: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1, x2, x3, x4}

Median ∆CB : • Stable if a few return paths delay change

• Fluctuate if delay on BC changes32 / 44

Page 43: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes

−400

−300

−200

−100

0

100

200

300

400

Diffe

rential R

TT

(m

s)

130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich)

Raw values

Jun 02 2015

Jun 04 2015

Jun 06 2015

Jun 08 2015

Jun 10 2015

Jun 12 2015

Jun 14 2015

4.8

5.0

5.2

5.4

5.6

Diffe

rential R

TT

(m

s)

Median Diff. RTT

Normal Reference

• Stable despite noisy RTTs

(not true for average)

• Normally distributed

33 / 44

Page 44: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Detecting congestion

Nov 26 2015

Nov 27 2015

Nov 28 2015

Nov 29 2015

Nov 30 2015

Dec 01 2015−10

−5

0

5

10

15

20

25

30

Diffe

ren

tia

l R

TT

(m

s)

72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE))

Median Diff. RTT

Normal Reference

Detected Anomalies

Significant RTT changes:

Confidence interval not overlapping with the normal reference

34 / 44

Page 45: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Results

Analyzed dataset

• Atlas builtin/anchoring measurements

• From May to Dec. 2015

• 2.8 billion IPv4 traceroutes

• 1.2 billion IPv6 traceroutes

• Observed 262k IPv4 and 42k IPv6 links (core links)

We found a lot of congested links!

Let’s look at one example

35 / 44

Page 46: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Study case: Telekom Malaysia BGP leak

36 / 44

Page 47: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Study case: Telekom Malaysia BGP leak

37 / 44

Page 48: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Study case: Telekom Malaysia BGP leak

37 / 44

Page 49: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Study case: Telekom Malaysia BGP leak

37 / 44

Page 50: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Study case: Telekom Malaysia BGP leak

Not only with Google... but about 170k prefixes!

37 / 44

Page 51: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Congestion in Level3

Rerouted traffic has congested Level3 (120 reported links)

• Example: 229ms increase between two routers in London!

Jun 08 2015

Jun 09 2015

Jun 10 2015

Jun 11 2015

Jun 12 2015

Jun 13 2015

−50

0

50

100

150

200

250

300

350

Diffe

ren

tia

l R

TT

(m

s)

67.16.133.130 - 67.17.106.150

Median Diff. RTT

Normal Reference

Detected Anomalies

38 / 44

Page 52: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Congestion in Level3

Reported links in London:

Delay increase

Delay & packet loss

→ Traffic staying within UK/Europe may also be altered

39 / 44

Page 53: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

But why did we look at that?

Per-AS alarm for delay

40 / 44

Page 54: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Conclusions and perspectives (1)

We proposed 3 different techniques to detect outages for 3

different sources of data

• Each source of data has its own coverage

• Core links (congestion and failures)

• Prefix, country, region, AS disconnections

41 / 44

Page 55: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Conclusions and perspectives (1)

We proposed 3 different techniques to detect outages for 3

different sources of data

• Each source of data has its own coverage

• Core links (congestion and failures)

• Prefix, country, region, AS disconnections

• Each source of data has its own noise, properties

• Identifying the suitable model is a challenge

41 / 44

Page 56: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Conclusions and perspectives (2)

There is no substancial, state of the art ground truth to validate

the results. We resort to

• the comparison of different techniques with different coverages

• evaluations on the basis of partial ground truth

• characterizations of the detected outages based on the

detection algorithm used

42 / 44

Page 57: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Turn this

43 / 44

Page 58: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Into this

http://ihr.iijlab.net

44 / 44