View
1
Download
0
Category
Preview:
Citation preview
Detecting network outages using different
sources of data
TMA Experts Summit, Paris, France
Cristel Pelsser
University of Strasbourg / ICube
June, 2019
1 / 44
Some perspective on:
• From unsollicited traffic
Detecting Outages using Internet Background Radiation.
Andreas Guillot (U. Strasbourg), Romain Fontugne (IIJ),
Philipp Winter (CAIDA), Pascal Merindol (U. Strasbourg),
Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel
Pelsser (U. Strasbourg). TMA 2019.
• From highly distributed permanent TCP connections
Disco: Fast, Good, and Cheap Outage Detection. Anant Shah
(Colorado State U.), Romain Fontugne (IIJ), Emile Aben
(RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy
Bush (IIJ, Arrcus). TMA 2017.
• From large-scale traceroute measurements
Pinpointing Anomalies in Large-Scale Traceroute
Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE
NCC), Cristel Pelsser (University of Strasbourg), Randy Bush
(IIJ, Arrcus). IMC 2017.
2 / 44
Understanding Internet health? (Motivation)
• To speedup failure identification and thus recovery
• To identify weak areas and thus guide network design
3 / 44
Understanding Internet health? (Problem 1)
Manual observations and operations
• Traceroute / Ping / Operators’ group mailing lists
• Time consuming
• Slow process
• Small visibility
→ Our goal: Automaticaly pinpoint network disruptions
(i.e. congestion and network disconnections)
4 / 44
Understanding Internet health? (Problem 2)
A single viewpoint is not enough
→ Our goal: mine results from deployed platforms
→ Cooperative and distributed approach
→ Using existing data, no added burden to the network
5 / 44
Outage detection from unsollicited
traffic
Dataset: Internet Background Radiation
Internet
P1
P1 is advertised to the Internet
7 / 44
Dataset: Internet Background Radiation
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
7 / 44
Dataset: Internet Background Radiation
Spoofed traffic
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
Sends traffic with source in P1
7 / 44
Dataset: Internet Background Radiation
Spoofed traffic
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
Sends traffic with source in P1
Responds to spoofed traffic
7 / 44
Dataset: IP count time-series (per country or AS)
Use cases: Attacks, Censorship, Local outages detection
2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP Original time series
Figure 1: Egyptian revolution
⇒ More than 60 000 time series in the CAIDA telescope data.
We use drops in the time series are indicators of an outage.
8 / 44
Current methodology used by IODA
Detecting outages using fixed thresholds
9 / 44
Our goal
Detecting outages using dynamic thresholds
10 / 44
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP
Training Validation TestOriginal time series
11 / 44
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Nu
mb
er
of
uniq
ue s
ou
rce IP Training Calibration Test
Original time series
Predicted time series
Prediction and confidence interval
11 / 44
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP Training Validation Test
Original time seriesPredicted time series
• When the real data is outside the prediction interval, we raise
an alarm.
• We want a prediction model that is robust to the seasonality
and noise in the data → We use the SARIMA model1.
1More details on the methodology on wednesday.11 / 44
Validation: ground truth
Characteristics
• 130 known outages
• Multiple spatial scales
• Countries
• Regions
• Autonomous Systems
• Multiple durations (from an hour to a week)
• Multiple causes (intentional or non intentional)
12 / 44
Evaluating our solution
Objectives
• Identifying the
minimal number of IP
addresses
• Identifying a good
threshold
Threshold
• TPR of 90% and
FPR of 2%
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
True
Pos
itive
Rat
e
All time series< 20 IPs> 20 IPs2 sigma - 95%3 sigma - 99.5%5 sigma - 99.99%
Figure 2: ROC curve
13 / 44
Comparing our proposal (Chocolatine) to CAIDA’s tools
• More events detected than the simplistic thresholding
technique (DN)
• Higher overlap with other detection techniques
• Not a complete overlap→ difference in dataset coverage
→ different sensitivities to outages
DN17
BGP644
AP633
36
985 3
15
Chocolatine251
BGP445
AP440
235
489 196
511
14 / 44
Outage detection from highly dis-
tributed permanent TCP connec-
tions
Proposed Approach
Disco:
• Monitor long-running TCP
connections and synchronous
disconnections from related
network/area
• We apply Disco on RIPE Atlas
data, where probes are widely
distributed at the edge and behind
NATs/CGNs providing visibility
Trinocular may not have
→ Outage = synchronous disconnections from the same
topological/geographical area
16 / 44
Assumptions / Design Choices
Rely on TCP disconnects
• Hence the granularity of detection is dependent on TCP
timeouts
Bursts of disconnections are indicators of interesting outage
• While there might be non bursty outages that are interesting,
Disco is designed to detect large synchronous disconnections
17 / 44
Proposed System: Disco & Atlas
RIPE Atlas platform
• 10k probes worldwide
• Persistent connections with
RIPE controllers
• Continuous traceroute
measurements
(see outages from inside)
→ Dataset: Stream of probe connection/disconnections
(from 2011 to 2016)
18 / 44
Disco Overview
1. Split disconnection
stream in sub-streams
(AS, country,
geo-proximate
50km radius)
2. Burst modeling and
outage detection
3. Aggregation and
outage reporting
19 / 44
Why Burst Modeling?
Goal: How to find synchronous disconnections?
• Time series conceal
temporal characteristics
• Burst model estimates
disconnections arrival
rate at any time
Implementation: Kleinberg burst model2
2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining
and Knowledge Discovery, 2003.20 / 44
Burst modeling: Example
• Monkey causes blackout in
Kenya at 8:30 UTC June,
7th 2016
• Same day RIPE rebooted
controllers
21 / 44
Results
Outage detection:
• Atlas probes disconnections from 2011 to 2016
• Disco found 443 significant outages
Outage characterization and validation:
• Traceroute results from probes (buffered if no connectivity)
• Outage detection results from Trinocular
22 / 44
Validation (Traceroute)
Comparison to traceroutes:
• Probes in detected outages can reach traceroutes destination?
→ Velocity ratio: proportion of completed traceroutes in
given time
0.0 0.5 1.0 1.5 2.0
R (Average Velocity Ratio)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Pro
babili
ty M
ass F
unction
Normal
Outage
→ Velocity ratio ≤ 0.5 for 95% of detected outages23 / 44
Validation (Trinocular)
Comparison to Trinocular (2015):
• Disco found 53 outages in 2015
• Corresponding to 851 /24s (only 43% is responsive to ICMP)
Results for /24s reported by Disco and pinged by Trinocular:
• 33/53 are also found by Trinocular
• 9/53 are missed by Trinocular (avg time of outages < 1hr)
• Other outages are partially detected by Trinocular
23 outages found by Trinocular are missed by Disco
• Disconnections are not very bursty in these cases
→ Disco’s precision: 95%, recall: 67% 24 / 44
Outage detection from large-scale
traceroute measurements
Dataset: RIPE Atlas traceroutes
Two repetitive large-scale measurements
• Builtin: traceroute every 30 minutes to all DNS root servers
(≈ 500 server instances)
• Anchoring : traceroute every 15 minutes to 189 collaborative
servers
Analyzed dataset
• May to December 2015
• 2.8 billion IPv4 traceroutes
• 1.2 billion IPv6 traceroutes
26 / 44
Monitor delays with traceroute?
Traceroute to “www.target.com”
Round Trip Time (RTT) between B and C?
Report abnormal RTT between B and C?
27 / 44
Monitor delays with traceroute?
Challenges:
• Noisy data
• Traffic
asymmetry
0 2 4 6 8 10 12Number of hops
0
50
100
150
200
250
300
RT
T (
ms)
Traceroutes from CZ to BD
28 / 44
Monitor delays with traceroute?
Challenges:
• Noisy data
• Traffic
asymmetry
0 2 4 6 8 10 12Number of hops
0
50
100
150
200
250
300
RT
T (
ms)
Traceroutes from CZ to BD
28 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
29 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
• No!
• Traffic is asymmetric
• RTTB and RTTC take different return paths!
• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep
30 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
• No!
• Traffic is asymmetric
• RTTB and RTTC take different return paths!
• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep30 / 44
Problem with differential RTT
Monitoring ∆CB over time:
Time
0
10
20
30
∆R
TT
→ Delay change on BC? CD? DA? BA???31 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = x0
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1}
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4}
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4}
Median ∆CB : • Stable if a few return paths delay change
• Fluctuate if delay on BC changes32 / 44
Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes
−400
−300
−200
−100
0
100
200
300
400
Diffe
rential R
TT
(m
s)
130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich)
Raw values
Jun 02 2015
Jun 04 2015
Jun 06 2015
Jun 08 2015
Jun 10 2015
Jun 12 2015
Jun 14 2015
4.8
5.0
5.2
5.4
5.6
Diffe
rential R
TT
(m
s)
Median Diff. RTT
Normal Reference
• Stable despite noisy RTTs
(not true for average)
• Normally distributed
33 / 44
Detecting congestion
Nov 26 2015
Nov 27 2015
Nov 28 2015
Nov 29 2015
Nov 30 2015
Dec 01 2015−10
−5
0
5
10
15
20
25
30
Diffe
ren
tia
l R
TT
(m
s)
72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE))
Median Diff. RTT
Normal Reference
Detected Anomalies
Significant RTT changes:
Confidence interval not overlapping with the normal reference
34 / 44
Results
Analyzed dataset
• Atlas builtin/anchoring measurements
• From May to Dec. 2015
• 2.8 billion IPv4 traceroutes
• 1.2 billion IPv6 traceroutes
• Observed 262k IPv4 and 42k IPv6 links (core links)
We found a lot of congested links!
Let’s look at one example
35 / 44
Study case: Telekom Malaysia BGP leak
36 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
Not only with Google... but about 170k prefixes!
37 / 44
Congestion in Level3
Rerouted traffic has congested Level3 (120 reported links)
• Example: 229ms increase between two routers in London!
Jun 08 2015
Jun 09 2015
Jun 10 2015
Jun 11 2015
Jun 12 2015
Jun 13 2015
−50
0
50
100
150
200
250
300
350
Diffe
ren
tia
l R
TT
(m
s)
67.16.133.130 - 67.17.106.150
Median Diff. RTT
Normal Reference
Detected Anomalies
38 / 44
Congestion in Level3
Reported links in London:
Delay increase
Delay & packet loss
→ Traffic staying within UK/Europe may also be altered
39 / 44
But why did we look at that?
Per-AS alarm for delay
40 / 44
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3
different sources of data
• Each source of data has its own coverage
• Core links (congestion and failures)
• Prefix, country, region, AS disconnections
41 / 44
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3
different sources of data
• Each source of data has its own coverage
• Core links (congestion and failures)
• Prefix, country, region, AS disconnections
• Each source of data has its own noise, properties
• Identifying the suitable model is a challenge
41 / 44
Conclusions and perspectives (2)
There is no substancial, state of the art ground truth to validate
the results. We resort to
• the comparison of different techniques with different coverages
• evaluations on the basis of partial ground truth
• characterizations of the detected outages based on the
detection algorithm used
42 / 44
Turn this
43 / 44
Recommended