44
http:// aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U. Washington Crowdsourcing Service-Level Network Event Detection

Crowdsourcing Service-Level Network Event Detection

  • Upload
    hiero

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Crowdsourcing Service-Level Network Event Detection. David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs * currently at U. Washington. Internet driven by services. - PowerPoint PPT Presentation

Citation preview

Page 1: Crowdsourcing  Service-Level  Network Event Detection

http://aqualab.cs.northwestern.edu

David Choffnes Fabián Bustamante Zihui GeNorthwestern University* Northwestern University AT&T Labs

*currently at U. Washington

Crowdsourcing Service-Level Network Event Detection

Page 2: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 2

Internet driven by servicesInternet activity increasingly driven by services– VoIP, video streaming, games, file downloads

User experience as a key benchmark– Largely determined by frequency, duration and

severity of network problems (events)

To minimize impact on users– Identify problems affecting end-to-end performance– Online, reliable detection and isolation -- potentially

across networks

Crowdsourcing Network Monitoring

Page 3: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 3

Detecting network eventsVariety of existing detection approaches– Internal link/router monitoring

• Limited to single administrative domain– BGP monitoring

• Identifies only control-plane issues– Distributed active probing

• Overhead/cost scales with size of monitored network

Limited or no visibility into end-to-end performance– Particularly problematic for edge networks

Crowdsourcing Network Monitoring

Need a scalable solution that captures what user sees

Page 4: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 4

Crowdsourcing Event Monitoring (CEM)Push monitoring to the edge systems themselves– Monitor from inside network-intensive applications– Detect drops in performance

Crowdsourcing Network Monitoring

If enough hosts see• Same performance problem, • at the same time, • in the same network….

The problem is likely to be the network

Page 5: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 5

Outline…Crowdsourcing network monitoring– General approach– Case study using confirmed network problems– Wide area evaluation

System implementation– BitTorrent extension (NEWS) installed by >48k users

Conclusion

Crowdsourcing Network Monitoring

Page 6: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 6

System RequirementsScalability– Use passive monitoring, fully distribute detection

Localization in time and space– Online detection– Isolation to network regions

Privacy– Use only network location, not identity

Reliability from uncontrolled hosts– Probability analysis to identify likely real events

Adoption– Build inside popular applications and/or use incentives

Crowdsourcing Network Monitoring

Page 7: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 7

Approach and architecture

Crowdsourcing Network Monitoring

Distributed System

Local Event

Detection

Performance signals

[e.g., upload/ download rate]

Page 8: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 8

Approach and architectureLocal detection– Passively monitor local performance information

(signals)• General (e.g. transfer rates) and application specific (e.g.

content availability in BitTorrent)– Detect drops in performance

• E.g., dropped video frame, sudden drop in throughput• Filter out cases that are normal application behavior

– E.g., BitTorrent peer finishes downloading but still seeds

– Publish information only about these suspected local events

Crowdsourcing Network Monitoring

Page 9: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 9

System Architecture

Crowdsourcing Network Monitoring

Local Event

Detection

Dis

tribu

ted

Sto

rage

Performance signals Local

Events

Distributed System

Page 10: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 10

Approach and architectureGroup corroboration– Gather information about local events in same network– Identify synchronous problems that are unlikely to occur

by chance– Likelihood ratio to distinguish network events from

coincidenceWho can identify network events?– Each user can detect events separately– Any third party with access to distributed storage can do

the same (e.g., network operators)

Crowdsourcing Network Monitoring

Page 11: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 11

System Architecture

Crowdsourcing Network Monitoring

ISP Operator

Local Event

Detection

GroupCorroboration

Dis

tribu

ted

Sto

rage

Performance signals

Confirmed

LocalEvents

Tap on DS

RemoteEvents

Distributed System

Page 12: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 12

Evaluating the approachParticipatory monitoring challenges– Needs large-scale adoption, active users, dist. storage– Edge traces are rare

P2P applications are a natural fit– Used worldwide, generates diverse flows– BitTorrent is one of the most popular

• Consumes large amounts of bandwidth• Vuze client allows extensibility, piggyback on existing users• Built-in distributed storage (DHT)

Ono dataset for traces– Installed by more than 1 million BitTorrent users– Network and BitTorrent-specific information from

hundreds of thousands of users worldwideCrowdsourcing Network Monitoring

Page 13: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 13

CEM Case Study

Crowdsourcing Network Monitoring

Evaluate effectiveness of our approach using BitTorrentHow (well) does it work?– Case study: British Telecom (BT) network– Provides confirmed events through a Web interface– 27 April 2009, 3:54 PM

• “We are aware of a network problem which may be affecting access to the internet in certain areas. Our engineers are working to resolve the problem. We apologize for any inconvenience this may cause.”

• Resolved: 27 April 2009, 8:50 PM– Similar to events seen in other networks

“Enough users complained about the network being slow and we’re looking into it.”“As of 9PM, we’re pretty sure we fixed the

problem so we marked it resolved.”

Page 14: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 14

Local detection in BitTorrent

Crowdsourcing Network Monitoring

Peers monitor multiple performance signals– General – e.g. transfer rates, number of connected peers– Protocol specific – Torrent availability

Detect drops in throughput as local eventsIndividual signals – Noisy– Uncontrolled duration– Wide range of values

Page 15: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 15

Moving-average smoothing reveals events

Crowdsourcing Network Monitoring

Performance drops around 10:54

Further drop at 14:50

Final recovery at ~17:30

Page 16: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 16

Group corroborationGiven locally detected events, why would they occur at the same time?1. Service-specific problems (e.g., no seeder)

Use application level information

2. Coincidence (e.g., noisy local detection) Union probability

3. Problem isolated to one or more networks Group hosts according to network location

Crowdsourcing Network Monitoring

Page 17: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 17

Coincidence in network eventsCoincidence assumes that local events are occurring independentlyCalculate this using union probability (Pu)

– P(Lh): probability for host h seeing a local event

For large n, likelihood of coincidence is very small

Crowdsourcing Network Monitoring

Page 18: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 18

Likelihood ratioAre detected network problems occurring more often than chance?Comparing probabilities– Coincidence (Pu)– Network (Pe)

• Measure how often synchronized events occur in same network

Likelihood ratio: LR = Pe/Pu

LR > 1: Events more likely due to the network than chance– Empirically derive a stricter LR threshold– Use LR as a tuning knob to control rate of event

detectionCrowdsourcing Network Monitoring

Page 19: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 19

Likelihood ratios in BT Yahoo

Crowdsourcing Network Monitoring

W=10, σ=1.5

W=20, σ=2.0

Congestion event after recovery

Most events no more likely than chance

All LR>1 correspond to actual network events!

Page 20: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 20

Wide-area evaluationGold standard: False positives/negatives…– Almost no ISPs want to publish records

of network problems– Existing approaches do not target service-level events– In short, there is no “ground truth”– Affects all research in this domain

What we can do– Find ISPs reporting network events via public interfaces– Work with ISPs under NDAs

Compare our approach with ISP information– Only works where we have coverage (users)

Crowdsourcing Network Monitoring

Page 21: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 21

Evaluation criteriaCoverage – Confirmed events– Number of networks covered worldwide– Cross-network events

Efficiency– Event detection rate– Overhead

Crowdsourcing Network Monitoring

Page 22: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 22

Effectiveness – BT YahooOne month from BT Yahoo– Detected: 181 events– 54 occur during confirmed problems– Remaining are not necessarily false positives

• Even if so, about 4 events per day

Crowdsourcing Network Monitoring

Page 23: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 23

Effectiveness – North American ISPOne month of outage data– Varies according to number of subscribers (S) in each

region– S > 10000

• We detect 50% (38% more may be detected but we don’t have enough users to confirm)

– 10000 > S > 1000• 67% may be detected but not sufficient corroboration

Crowdsourcing Network Monitoring

Page 24: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 24

Robustness to parametersRobust to various detection settings, populations– Number of users in a network is not strongly correlated

with number of events detected– Network problems detected only 2% of the time for

small MA deviations, 0.75% of the time for large ones• Can be filtered with likelihood ratio threshold

Crowdsourcing Network Monitoring

Sensitive local detection (MA settings: 1.5σ, w=10)

Less sensitive detection

(MA settings: 2.2σ, w=20)

Page 25: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 25

SummaryService-level monitoring through crowdsourcing– Push monitoring to applications at the edge– Scalable, distributed approach to detection– Evaluation using large-scale P2P trace data

Crowdsourcing Network Monitoring

Page 26: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 26

NEWS implementation and deploymentPlugin for the Vuze BitTorrent client– More than 48,000 installs– Core classes for event detection only ~1,000 LOC– Lots more code for UI, user notifications

Event detection– Local detection based on 15-second samples of

performance information• Transfer rates• Torrent state (leech/seed)• Content availability

– Group corroboration and localization• Publishes event information to built-in DHT• Uses BGP prefix, ASN information for group corroboration

(already collected by Vuze)

Crowdsourcing Network Monitoring

Page 27: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 27

Food for thoughtOpen issues– Which network groupings are best?

• Whois listings, topologies, ISP-specific…– Where is the ground truth?

• Crowdsourcing event labeling (Newsight)– Can we apply these principles to other services?

• VoIP, video streaming, CDNs

Crowdsourcing Network Monitoring

Questions?

Page 28: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 28

Questions?

Crowdsourcing Network Monitoring

Page 29: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 29

Backups

Crowdsourcing Network Monitoring

Page 30: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 30

Related workCrowdsourcing– Human computation [von Ahn]– Intrusion detection [Dash et al.]

Events detected– Layer-3 and below [Lakhina et al., Mahajan et al.]– End-to-end [Madhyastha et al., Zhang et al.]

Monitoring location– In network (Netflow)– Distributed probing [Feamster et al., Katz-Bassett et al.]– Edge systems [Shavitt et al., Simpson et al.]

Measurement technique– Active [Andersen et al.]– Passive [Zhang et al., Casado et al., …]

Crowdsourcing Network Monitoring

Page 31: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 31

Snapshot of what we collect

Crowdsourcing Network Monitoring

Started (proper) collection in December 2007Daily stats (approximate)– 3 to 4 GB of compressed data– About 10 to 20 GB raw data– 2.5-3M traceroutes– 100-150M connection samples

9-Feb

-2005

20-Feb

-2005

3-Mar-

2005

14-M

ar-20

05

25-M

ar-20

05

5-Apr-

2005

16-A

pr-20

05

27-A

pr-20

05

8-May

-2005

19-M

ay-20

05

30-M

ay-20

05

10-Ju

n-200

5

21-Ju

n-200

5

2-Jul-

2005

13-Ju

l-200

5

24-Ju

l-200

5

4-Aug

-2005

15-A

ug-20

05

26-A

ug-20

05

6-Sep

-2005

17-S

ep-20

05

28-S

ep-20

05

9-Oct-

2005

20-O

ct-20

05

31-O

ct-20

05

11-N

ov-20

05

22-N

ov-20

05

3-Dec

-2005

14-D

ec-20

05

25-D

ec-20

05

5-Jan

-2006

16-Ja

n-200

6

27-Ja

n-200

60

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

Per-connection Samples

9-Feb

-2005

20-Feb

-2005

3-Mar-

2005

14-M

ar-20

05

25-M

ar-20

05

5-Apr-

2005

16-A

pr-20

05

27-A

pr-20

05

8-May

-2005

19-M

ay-20

05

30-M

ay-20

05

10-Ju

n-200

5

21-Ju

n-200

5

2-Jul-

2005

13-Ju

l-200

5

24-Ju

l-200

5

4-Aug

-2005

15-A

ug-20

05

26-A

ug-20

05

6-Sep

-2005

17-S

ep-20

05

28-S

ep-20

05

9-Oct-

2005

20-O

ct-20

05

31-O

ct-20

05

11-N

ov-20

05

22-N

ov-20

05

3-Dec

-2005

14-D

ec-20

05

25-D

ec-20

05

5-Jan

-2006

16-Ja

n-200

6

27-Ja

n-200

60

20,000,00040,000,00060,000,00080,000,000

100,000,000120,000,000140,000,000160,000,000

Per-Download Samples

9-Feb

-2005

20-Feb

-2005

3-Mar-

2005

14-M

ar-20

05

25-M

ar-20

05

5-Apr-

2005

16-A

pr-20

05

27-A

pr-20

05

8-May

-2005

19-M

ay-20

05

30-M

ay-20

05

10-Ju

n-200

5

21-Ju

n-200

5

2-Jul-

2005

13-Ju

l-200

5

24-Ju

l-200

5

4-Aug

-2005

15-A

ug-20

05

26-A

ug-20

05

6-Sep

-2005

17-S

ep-20

05

28-S

ep-20

05

9-Oct-

2005

20-O

ct-20

05

31-O

ct-20

05

11-N

ov-20

05

22-N

ov-20

05

3-Dec

-2005

14-D

ec-20

05

25-D

ec-20

05

5-Jan

-2006

16-Ja

n-200

6

27-Ja

n-200

60

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

Traceroutes

Page 32: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 32

Wide area eventsDetected problems in the US, Europe and AsiaIdentified potential cross network events– Use ISP relationships and correlate per-ISP events– Detected cases in seven countries

Crowdsourcing Network Monitoring

Page 33: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 33

Robustness to parametersRobust to various detection settings, populations– Number of users in a network is not strongly correlated

with number of events detected

Crowdsourcing Network Monitoring

Ordered by # users Ordered by # events

Page 34: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 34

How much does it cost?Simulate using 3 stddev thresholds and 2 window sizes in parallel – Allows NEWS to detect multiple types of events– Model caching in DHT

Count number of DHT operations at any timeGoals– Low cost (does not affect user’s transfers)– Privacy preserving– No reliance on infrastructure

Crowdsourcing Network Monitoring

Page 35: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 35

Events at each time stepRead and writes are clustered– Expected for a system that detects events– Diurnal pattern

Crowdsourcing Network Monitoring

Page 36: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 36

How much does it cumulatively cost?One read every 10 seconds, one write every two minutes (spread over hundreds of users)Reasonable load on a DHT– Kademlia caches values close to host– 38 bytes per read/write, about 4 B/s overhead

Crowdsourcing Network Monitoring

Page 37: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 37

Strawman costDecentralized approach not using summaries– 13 signals x 4 bytes = 52 bytes, every 15 seconds– Sharing incurs O(N2) cost– 1000 hosts: 34.6MB/s

Centralized approach to collecting data– 13 signals x 4 bytes = 52 bytes, every 15 seconds– 1000 hosts: 4 KB/s– Plus CPU/memory costs for processing this data– Ignores important issues

• Who hosts this?• Privacy?• Cross-network events?

Crowdsourcing Network Monitoring

Page 38: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 38

NEWS UINotifications through non-intrusive interfaceList of events for historical information

Crowdsourcing Network Monitoring

Page 39: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 39

NEWS Usage

Crowdsourcing Network Monitoring

Page 40: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 40

The view from 1 million users

Crowdsourcing Network Monitoring

231,000

547,000

35,000 1,096

Page 41: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 41

Extension to multiple signalsLeverage independent signals at each host– For example, upload rate and download rate– Even more unlikely that both signals affected at same

time by coincidence

Crowdsourcing Network Monitoring

Page 42: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 42

Detecting throughput dropsMoving average– Low cost– Few parameters– Well understood

Key parameters– Window size, deviation

Approach– Find mean for each window– Find how much next sample deviates from mean

Key questions– Can we find good window sizes, threshold deviations?

Crowdsourcing Network Monitoring

Page 43: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 43

Simulate different numbers of hosts seeing events independently with a normally dist. probability

Relative likelihood of coincidence

Crowdsourcing Network Monitoring

The more peers seeing an event at the same time, the less likely it occurs by coincidence

Five orders of magnitude between 3 peers and 9 peers corroborating an event

Page 44: Crowdsourcing  Service-Level  Network Event Detection

David Choffnes 44Crowdsourcing Network Monitoring

Ground truth hard to come byCan we crowdsource event labeling?Make information available to community

Newsight

http://aqualab.cs.northwestern.edu/projects/news/newsight.html