View
15
Download
0
Category
Preview:
Citation preview
Pythia: Detection, Localization, and Diagnosis
of Performance Problemsusing perfSONAR
Partha KanuparthyConstantine Dovrolis (PI)
Georgia Institute of Technology
Intro
Pythia is a data-analysis tool
data from perfSONAR
Focus: performance problems
Funded by DoE: 3 yrs (since Sept’11)
This talk: detection, localization, WiP
Pythia:one tool, three objectives
Detection: “noticeable loss rate between ORNL and SLAC on 07/11/11 at 09:00:02 EDT”
Localization“it happened at DENV-SLAC link”
Diagnosis“it was due to insufficient router buffers”
DetectionFirst step: “Is there a problem?”
Look for deviations from baseline
but not monitor-related events!
0
10
20
30
40
50
60
0 20 40 60 80 100 120 140 160 180 200
de
lay (
ms)
seq.
Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET
delay
Congestion: NY-CLEV 0
500
1000
1500
2000
2500
201680 201700 201720 201740 201760 201780 201800 201820
dela
y (
ms)
seq.
context-switch in path ALBU_OWAMP_ES_NET to ATLA_OWAMP_ES_NET
context-switchdelay
Monitor event: ALBU-ATLbaseline
2.5s rise!
Detection ImplementationA single pass through OWAMP timeseries
Discard monitor-related events
Discard level-shifts (e.g., NTP synchronization; TODO: detecting routing changes)
The rest are network performance problems!
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
0 2 00 4 00 6 00 8 00 1 000 1 2 00
de
lay (
ms)
seq.
levelshift in path A L B U _O WA M P _E S _N E T to A T L A _O WA M P _E S _N E T
delay
NTP level-shifts
0
10
20
30
40
50
60
0 20 40 60 80 100 120 140 160 180 200
dela
y (
ms)
seq.
Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET
delay
Congestion: NY-CLEV
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90
dela
y (
ms)
seq.
Possible-congestion in path NEWY_OWAMP_ES_NET to CLEV_OWAMP_ES_NET
delay
Detection: In PracticeDetection outputs congestion events
> 10s long
start, end timestamps
ESnet data: 12 days, 33 monitors
Internet2 data:22 days, 9 monitors
Monitor events
Congestion events
Congestion events / path / day
ESnet 2.2 million 933 0.1
Internet2 11,200 2268 1.4
How long are congestion events?
ESnet, I2: 90% events were 10-20s long
this is sufficient to affect app-performance
delay increases by 10s of milliseconds
some events are common across paths
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 12 14 16 18 20 22 24 26 28
CD
F
Duration (in seconds)
ESnet : CDF of duration of congestion events which are atleast 10 seconds long
ESnet
Internet2
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1
1 0 1 2 1 4 1 6 1 8 2 0 2 2 2 4 2 6 2 8
CD
F
D ura tio n (in se c o n d s)
In te rn e t2 : D ura tio n o f c o n g e stio n e ve n ts
Are lossy events common?
Answer: No
ESnet: no lossy congestion events
Internet2: 6 of 2268 lossy congestion events
< 0.1% loss rate assampled by OWAMP
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085
CD
F
Percentage of packets lost
Internet2 : Fraction of lost packets for lossy congestion evnets
Internet2
LocalizationFollow-up to detection:“Which link is bad?”
Link/path performance levels discrete:e.g., high delay, medium delay, low delay
Localization: minimum number of bad links that can explain bad paths
use greedy heuristic to solve iteratively
Localizing Bad LinksESnet: 9 congestion events
1 bad link localized for each
up to 75 paths affected by an event:
0
10
20
30
40
50
60
70
80
washcr1-aofacr2.es.net
starcr1-chicr1.es.net
bnlmr2-bnlowamp.es.net
clevcr1-ip-bostcr1.es.net
clevcr1-ip-bostcr1.es.net
bostcr1-ip-aofacr2.es.net
chiccr1-ip-clevcr1.es.net
clevcr1-ip-chiccr1.es.net
Number of paths
Bad
link
0
5
10
15
20
25
30
35
40
45
0 2 4 6 8 10 12 14 16 18 20
One W
ay D
ela
y (
in m
s)
Time (in seconds)
Localization: Internet2Internet2: 266 congestion events in 22 days
3 bad links: 1 case
2 bad links: 6 cases
1 bad link: rest
few bad links dominate 90% events:
ge-6-2-0.0.rtr.kans (58% events)ge-1-2-0.0.rtr.chic (25%)xe-1-1-0.0.rtr.hous (6%)
0
5
10
15
20
25
0 5 10 15 20 25
Lin
k E
ve
nts
pe
r d
ay
Day since 23rd Feb 2011
ge-6-2-0.0.rtr.kansge-1-2-0.0.rtr.chic
xe-1-1-0.0.rtr.hous
Timeline of bad links: peaks around 7th March 2011
Case StudyInternet2: event with two bad links
28th Feb 2011, 00:10:51 GMT
Localized bad links:ge-6-2-0.0-rtr.KANS ge-6-1-0.0-rtr.LOSA
Predicted bad link performance (avg):26ms and 57ms
0
10
20
30
40
50
60
70
80
90
100
-5 0 5 10 15 20 25 30
On
e W
ay D
ela
y (
in m
s)
Time (in seconds)
Path : CHIC_LAT to LOSA_LAT
path:CHIC to LOSA
path:ATLA to KANS
path:HOUS to LOSA
0
2 0
4 0
6 0
8 0
1 0 0
-5 0 5 1 0 1 5 2 0 2 5 3 0
One W
ay D
ela
y (
in m
s)
Time (in seconds)
Path : H O US_L A T to L O SA _L A T
0
20
40
60
80
100
-5 0 5 10 15 20 25 30
One W
ay D
ela
y (
in m
s)
Time (in seconds)
Path : ATLA_LAT to KANS_LAT
Diagnosis“What was the problem?”
Match signatures to identify known problems
delays, losses, reordering, etc.
Unknown signatures:
store in database for future diagnosis
operators can “label” them for reference
Work in progress
Pythia: Real-time System
Centralized process talks to perfSONAR MAs to collect data
Work in progress
tracerouteMA
OWAMPMA 1
OWAMPMA 2
OWAMPMA 3
BWCTLMA
...MA
Pythiaserver
Preprocessing DetectionLocalization
Diagnosis
Pre-processingtraceroute: compensate for “* * *”s
Clock skew: use a 1000s window to normalize delays
Clock offset between monitors: use a 2s window to identify congestion events
Identify simultaneous events across paths
Recommended