Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes

Detecting network outages using different

sources of data

TMA Experts Summit, Paris, France

Cristel Pelsser

University of Strasbourg / ICube

June, 2019

1 / 44

Some perspective on:

• From unsollicited traffic

Detecting Outages using Internet Background Radiation.

Andreas Guillot (U. Strasbourg), Romain Fontugne (IIJ),

Philipp Winter (CAIDA), Pascal Merindol (U. Strasbourg),

Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel

Pelsser (U. Strasbourg). TMA 2019.

• From highly distributed permanent TCP connections

Disco: Fast, Good, and Cheap Outage Detection. Anant Shah

(Colorado State U.), Romain Fontugne (IIJ), Emile Aben

(RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy

Bush (IIJ, Arrcus). TMA 2017.

• From large-scale traceroute measurements

Pinpointing Anomalies in Large-Scale Traceroute

Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE

NCC), Cristel Pelsser (University of Strasbourg), Randy Bush

(IIJ, Arrcus). IMC 2017.

2 / 44

Understanding Internet health? (Motivation)

• To speedup failure identification and thus recovery

• To identify weak areas and thus guide network design

3 / 44

Understanding Internet health? (Problem 1)

Manual observations and operations

• Traceroute / Ping / Operators’ group mailing lists

• Time consuming

• Slow process

• Small visibility

→ Our goal: Automaticaly pinpoint network disruptions

(i.e. congestion and network disconnections)

4 / 44

Understanding Internet health? (Problem 2)

A single viewpoint is not enough

→ Our goal: mine results from deployed platforms

→ Cooperative and distributed approach

→ Using existing data, no added burden to the network

5 / 44

Outage detection from unsollicited

traffic

Dataset: Internet Background Radiation

Internet

P1

P1 is advertised to the Internet

7 / 44


Internet

P1


Scans, responses to spoofed traffic

7 / 44


Spoofed traffic

Internet

P1



Sends traffic with source in P1

7 / 44


Spoofed traffic

Internet

P1



Sends traffic with source in P1

Responds to spoofed traffic

7 / 44

Dataset: IP count time-series (per country or AS)

Use cases: Attacks, Censorship, Local outages detection

2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP Original time series

Figure 1: Egyptian revolution

⇒ More than 60 000 time series in the CAIDA telescope data.

We use drops in the time series are indicators of an outage.

8 / 44

Current methodology used by IODA

Detecting outages using fixed thresholds

9 / 44

Our goal

Detecting outages using dynamic thresholds

10 / 44

Outage detection process

2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP

Training Validation TestOriginal time series

11 / 44


2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Nu

mb

er

of

uniq

ue s

ou

rce IP Training Calibration Test

Original time series

Predicted time series

Prediction and confidence interval

11 / 44


2011-01-14

2011-01-17

2011-01-20

2011-01-23

2011-01-26

2011-01-29

2011-02-01

2011-02-04

2011-02-07

Time

0

200

400

600

800

Num

ber o

f uni

que

sour

ce IP Training Validation Test

Original time seriesPredicted time series

• When the real data is outside the prediction interval, we raise

an alarm.

• We want a prediction model that is robust to the seasonality

and noise in the data → We use the SARIMA model1.

1More details on the methodology on wednesday.11 / 44

Validation: ground truth

Characteristics

• 130 known outages

• Multiple spatial scales

• Countries

• Regions

• Autonomous Systems

• Multiple durations (from an hour to a week)

• Multiple causes (intentional or non intentional)

12 / 44

Evaluating our solution

Objectives

• Identifying the

minimal number of IP

addresses

• Identifying a good

threshold

Threshold

• TPR of 90% and

FPR of 2%

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

All time series< 20 IPs> 20 IPs2 sigma - 95%3 sigma - 99.5%5 sigma - 99.99%

Figure 2: ROC curve

13 / 44

Comparing our proposal (Chocolatine) to CAIDA’s tools

• More events detected than the simplistic thresholding

technique (DN)

• Higher overlap with other detection techniques

• Not a complete overlap→ difference in dataset coverage

→ different sensitivities to outages

DN17

BGP644

AP633

36

985 3

15

Chocolatine251

BGP445

AP440

235

489 196

511

14 / 44

Outage detection from highly dis-

tributed permanent TCP connec-

tions

Proposed Approach

Disco:

• Monitor long-running TCP

connections and synchronous

disconnections from related

network/area

• We apply Disco on RIPE Atlas

data, where probes are widely

distributed at the edge and behind

NATs/CGNs providing visibility

Trinocular may not have

→ Outage = synchronous disconnections from the same

topological/geographical area

16 / 44

Assumptions / Design Choices

Rely on TCP disconnects

• Hence the granularity of detection is dependent on TCP

timeouts

Bursts of disconnections are indicators of interesting outage

• While there might be non bursty outages that are interesting,

Disco is designed to detect large synchronous disconnections

17 / 44

Proposed System: Disco & Atlas

RIPE Atlas platform

• 10k probes worldwide

• Persistent connections with

RIPE controllers

• Continuous traceroute

measurements

(see outages from inside)

→ Dataset: Stream of probe connection/disconnections

(from 2011 to 2016)

18 / 44

Disco Overview

1. Split disconnection

stream in sub-streams

(AS, country,

geo-proximate

50km radius)

2. Burst modeling and

outage detection

3. Aggregation and

outage reporting

19 / 44

Why Burst Modeling?

Goal: How to find synchronous disconnections?

• Time series conceal

temporal characteristics

• Burst model estimates

disconnections arrival

rate at any time

Implementation: Kleinberg burst model2

2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining

and Knowledge Discovery, 2003.20 / 44

Burst modeling: Example

• Monkey causes blackout in

Kenya at 8:30 UTC June,

7th 2016

• Same day RIPE rebooted

controllers

21 / 44

Results

Outage detection:

• Atlas probes disconnections from 2011 to 2016

• Disco found 443 significant outages

Outage characterization and validation:

• Traceroute results from probes (buffered if no connectivity)

• Outage detection results from Trinocular

22 / 44

Validation (Traceroute)

Comparison to traceroutes:

• Probes in detected outages can reach traceroutes destination?

→ Velocity ratio: proportion of completed traceroutes in

given time

0.0 0.5 1.0 1.5 2.0

R (Average Velocity Ratio)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Pro

babili

ty M

ass F

unction

Normal

Outage

→ Velocity ratio ≤ 0.5 for 95% of detected outages23 / 44

Validation (Trinocular)

Comparison to Trinocular (2015):

• Disco found 53 outages in 2015

• Corresponding to 851 /24s (only 43% is responsive to ICMP)

Results for /24s reported by Disco and pinged by Trinocular:

• 33/53 are also found by Trinocular

• 9/53 are missed by Trinocular (avg time of outages < 1hr)

• Other outages are partially detected by Trinocular

23 outages found by Trinocular are missed by Disco

• Disconnections are not very bursty in these cases

→ Disco’s precision: 95%, recall: 67% 24 / 44

Outage detection from large-scale

traceroute measurements

Dataset: RIPE Atlas traceroutes

Two repetitive large-scale measurements

• Builtin: traceroute every 30 minutes to all DNS root servers

(≈ 500 server instances)

• Anchoring : traceroute every 15 minutes to 189 collaborative

servers

Analyzed dataset

• May to December 2015

• 2.8 billion IPv4 traceroutes


26 / 44

Monitor delays with traceroute?

Traceroute to “www.target.com”

Round Trip Time (RTT) between B and C?

Report abnormal RTT between B and C?

27 / 44


Challenges:

• Noisy data

• Traffic

asymmetry

0 2 4 6 8 10 12Number of hops

0

50

100

150

200

250

300

RT

T (

ms)

Traceroutes from CZ to BD

28 / 44


Challenges:

• Noisy data

• Traffic

asymmetry

0 2 4 6 8 10 12Number of hops

0

50

100

150

200

250

300

RT

T (

ms)

Traceroutes from CZ to BD

28 / 44

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

29 / 44



• No!

• Traffic is asymmetric

• RTTB and RTTC take different return paths!

• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep

30 / 44



• No!

• Traffic is asymmetric

• RTTB and RTTC take different return paths!

• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep30 / 44

Problem with differential RTT

Monitoring ∆CB over time:

Time

0

10

20

30

∆R

TT

→ Delay change on BC? CD? DA? BA???31 / 44

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = x0

32 / 44


Differential RTT: ∆CB = {x0, x1}

32 / 44


Differential RTT: ∆CB = {x0, x1, x2, x3, x4}

32 / 44


Differential RTT: ∆CB = {x0, x1, x2, x3, x4}

Median ∆CB : • Stable if a few return paths delay change

• Fluctuate if delay on BC changes32 / 44

Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes

−400

−300

−200

−100

0

100

200

300

400

Diffe

rential R

TT

(m

s)

130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich)

Raw values

Jun 02 2015

Jun 04 2015

Jun 06 2015

Jun 08 2015

Jun 10 2015

Jun 12 2015

Jun 14 2015

4.8

5.0

5.2

5.4

5.6

Diffe

rential R

TT

(m

s)

Median Diff. RTT

Normal Reference

• Stable despite noisy RTTs

(not true for average)

• Normally distributed

33 / 44

Detecting congestion

Nov 26 2015

Nov 27 2015

Nov 28 2015

Nov 29 2015

Nov 30 2015

Dec 01 2015−10

−5

0

5

10

15

20

25

30

Diffe

ren

tia

l R

TT

(m

s)

72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE))

Median Diff. RTT

Normal Reference

Detected Anomalies

Significant RTT changes:

Confidence interval not overlapping with the normal reference

34 / 44

Results

Analyzed dataset

• Atlas builtin/anchoring measurements

• From May to Dec. 2015



• Observed 262k IPv4 and 42k IPv6 links (core links)

We found a lot of congested links!

Let’s look at one example

35 / 44

Study case: Telekom Malaysia BGP leak

36 / 44


37 / 44


37 / 44


37 / 44


Not only with Google... but about 170k prefixes!

37 / 44

Congestion in Level3

Rerouted traffic has congested Level3 (120 reported links)

• Example: 229ms increase between two routers in London!

Jun 08 2015

Jun 09 2015

Jun 10 2015

Jun 11 2015

Jun 12 2015

Jun 13 2015

−50

0

50

100

150

200

250

300

350

Diffe

ren

tia

l R

TT

(m

s)

67.16.133.130 - 67.17.106.150

Median Diff. RTT

Normal Reference

Detected Anomalies

38 / 44

Congestion in Level3

Reported links in London:

Delay increase

Delay & packet loss

→ Traffic staying within UK/Europe may also be altered

39 / 44

But why did we look at that?

Per-AS alarm for delay

40 / 44

Conclusions and perspectives (1)

We proposed 3 different techniques to detect outages for 3

different sources of data

• Each source of data has its own coverage

• Core links (congestion and failures)

• Prefix, country, region, AS disconnections

41 / 44


We proposed 3 different techniques to detect outages for 3

different sources of data

• Each source of data has its own coverage

• Core links (congestion and failures)

• Prefix, country, region, AS disconnections

• Each source of data has its own noise, properties

• Identifying the suitable model is a challenge

41 / 44


There is no substancial, state of the art ground truth to validate

the results. We resort to

• the comparison of different techniques with different coverages

• evaluations on the basis of partial ground truth

• characterizations of the detected outages based on the

detection algorithm used

42 / 44

Turn this

43 / 44

Into this

http://ihr.iijlab.net

44 / 44

http://ihr.iijlab.net

Documents

Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes