36
Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution Abhishek Kumar Minho Sung Jun (Jim) Xu Networking and Telecommunications Group College of Computing Georgia Institute of Technology {akumar,mhsung,jx}@cc.gatech.edu Jia Wang AT&T Labs - Research [email protected]

Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Data Streaming Algorithms for Efficient and AccurateEstimation of Flow Size Distribution

Abhishek KumarMinho SungJun (Jim) Xu

Networking and Telecommunications GroupCollege of Computing

Georgia Institute of Technology{akumar,mhsung,jx}@cc.gatech.edu

Jia WangAT&T Labs - Research

[email protected]

1

Page 2: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Problem Statement

Problem: To estimate the probability distribution of flow sizes.In other words, for each positive integeri, estimateni, the numberof flows of sizei.

2

Page 3: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Problem Statement

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Flow Distribution

3

Page 4: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Problem Statement

Problem: To estimate the probability distribution of flow sizes.In other words, for each positive integeri, estimateni, the numberof flows of sizei.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Flow Distribution

Definition of Flow: All packets with the same flow-label.The flow-label can be defined as any combination of fields fromthe IP header, e.g <Source IP, source Port, Dest. IP, Dest. Port,Protocol>.

4

Page 5: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Overview

Motivation

Related work: Inversion of sampled traces

System Model

Estimating total flows and flows of size 1

A holistic approach to estimate the entire distribution

A multiresolution extension of the mechanism

5

Page 6: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Motivation

• Knowledge of flow-distribution allows us to infer the usagepattern of the network, in terms of:

– The access bandwidth of the user population.

– Application types.

• It also helps in detecting anomalous events such as:

– Incipient worm infections.

– DDoS attacks.

– Route flapping.

• Enables other measurement applications, such as traffic matrixestimation.

6

Page 7: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Related work: Inverting sampled packet traces.

• Current measurement boxes collect traces via packet sampling.

• The approach is to invert the sampled distribution to obtain theactual distribution [Duffield et al., SIGCOMM’03].

• High estimation errors due to low sampling rates.

• Practical limitations to inverting sampled traffic [Hohn & Veitch,IMC’03].

7

Page 8: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Solution Architecture – System Model

streamingresult

distributionStreaming

Online

Module Module

Processing

Offline

1. Update

2. Raw

3. Flow

Packet stream

Header Header Header

8

Page 9: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Solution Architecture — Insertion Module

•Measurement proceeds in epochs (e.g. 100 seconds).

•Maintain an array of counters in fast memory (SRAM).

• For each packet, a counter is chosen via hashing, and incre-mented.

• No attempt to detect or resolve collisions.

• Data collection is lossy (erroneous), but very fast.

• At the end of the epoch, the counter array is paged to disk.

9

Page 10: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

10

Page 11: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

Packet arrival

11

Page 12: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

Choose location by hashing flow label

12

Page 13: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

1

Increment counter

13

Page 14: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

1

14

Page 15: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

1

1

15

Page 16: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor

1

1

16

Page 17: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor1

2

17

Page 18: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor1

2

18

Page 19: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters

Array of Counters

Processor1

Collision !!3

19

Page 20: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Array of counters — Implementation

• Efficient Implementation of a Statistics Counter Architecture.[Ramabhadran & Varghese, SIGMETRICS’03]

• Small (7-bit) counter in fast memory.

• Large (32 or 64 bit) counter in slow memory.

• Perfectly fits our requirements.

20

Page 21: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Solution Architecture — Estimation Module

• The counter array is processed to obtain theCounter Value Dis-tribution. {m0 =# counters with value ’0’,y1 =# counters withvalue ’1’, · · · , yz =# counters with value ’z’.}• Use Bayesian statistics to derive the following quantities from

the counter value distribution:

– The total no. of flows,n.

– The total no. of flows with exactly one packet,n1.

– The flow distributionφ.

21

Page 22: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

The shape of the “Counter Value Distribution”

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Actual flow distributionm=1024Km=512Km=256Km=128K

The distribution of flow sizes and raw counter values (bothx andy axes are in log-scale).m = number of counters.

22

Page 23: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Estimating the no. of total flows,n, and flows of size one,n1

• Let total number of counters bem.

• The number of flows hashing to any counterc is modeled bythe Poisson random variable with parameterλ = n/m.

• There is a simple estimator for the total number of flows:n = m ln m

m0.

• This is a standard result, first used forProbabilistic Countingby Whang et al.[ACM Trans. Database Sys. 1990].

•We extend this result to derive an estimator for the total numberof flows of size 1:

n1 = y1enm (1)

• Extremely accurate estimates (within±1%).

• It is difficult to extend this approach for arbitraryi.

23

Page 24: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Estimating the entire flow distribution,φ

• Begin with a guess of the flow distribution,φold.

• Based on thisφold, compute the various possible ways of “split-ting” a particular counter value and the respective probabilitiesof such events.

• This allows us to compute a refined estimate of the flow distri-butionφnew.

• Useφold ← φnew for the next iteration.

• Repeating this multiple times allows the estimate to convergeto a local maximum.

• This is an instance ofExpectation maximization.

24

Page 25: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Estimating the entire flow distribution — an example

• For example, a counter value of 3 could be caused by threeevents:

– 3 = 3 (no hash collision);– 3 = 1 + 2 (a flow of size 1 colliding with a flow of size 2);– 3 = 1 + 1 + 1 (three flows of size 1 hashed to the same

location)

• Suppose the respective probabilities of these three events are0.5, 0.3, and 0.2 respectively, and there are 1000 counters withvalue 3.

• Then we estimate that 500, 300, and 200 counters split in thethree above ways, respectively.

• So we credit 300 * 1 + 200 * 3 = 900 ton1, the count of size 1flows, and credit 300 and 500 ton2 andn3, respectively.

25

Page 26: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Evaluation — Before and after running the Estimation algorithm

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Actual flow distributionRaw counter values

Estimation using our algorithm

26

Page 27: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Estimation for small flow sizes

10

100

1000

10000

100000

1e+06

1 10

frequ

ency

flow size

Actual flow distributionRaw counter values

Estimation using our algorithm

27

Page 28: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Sampling vs. array of counters – Web traffic.

1

10

100

1000

10000

100000

1 10 100 1000

freq

uenc

y

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

28

Page 29: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Replot with bucket-based smoothing – Web traffic.

1

10

100

1000

10000

100000

1 10 100 1000

freq

uenc

y

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

29

Page 30: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Sampling vs. array of counters – DNS traffic.

1

10

100

1000

10000

100000

1 10 100

freq

uenc

y

flow size

Actual flow distributionInferred from sampling,N=10

Inferred from sampling,N=100Estimation using our algorithm

30

Page 31: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Variability in the number of flows

• The total number of flows can change by orders of magnitude.

• Our mechanism is sensitive to the “load factor”n/m.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Actual flow distributionm=1024Km=512Km=256Km=128K

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 5 10 15 20

WM

RD

Iteration

m=1024Km=512Km=256Km=128K

• To accommodate all possible values ofn, we extend our workto a multi-resolution version.

31

Page 32: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

The multi-resolution array of counters

m

mm m mm

2m m

2AA 1 rA

2R r−1R rR1R

A r+1A

r+1R

r−2m2

r−1m2

r−1

• The multi-resolution array of counters allows our scheme tooperate for any value ofn, with graceful degradation in accu-racy for large number of flows.

32

Page 33: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Original and estimated distributions using MRAC.

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000 100000

frequ

ency

flow size

Actual flow distributionEstimation using our algorithm

(c)Trace “Long” (563,080 flows).

0.001

0.01

0.1

1

10

100

1000

10000

100000

1 10 100 1000 10000

frequ

ency

flow size

Actual flow distributionEstimation using our algorithm

(d)Trace “Short” (55,515 flows).

33

Page 34: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Conclusions

• Data-Streaming based solution for estimatong flow-distribution.

• Lossy data structure + Bayesian statistics = Accurate stream-ing

• Fast but “lossy” data-collection.

• Estimation using EM algorithm.

• An order of magnitude of improvement in estimation accuracyover sampling based solutions.

•Multiresolution version allows a tradeoff between storage costand accuracy.

34

Page 35: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Future Work

• Reusing EM results from preceeding epochs to speed up thecomputation.

• Evaluating the suitability of the mechanism for other “uncom-mon” distributions.

• Estimating the flow-size distribution of various subpopulations.

• Data-Streaming solutions to other traffic monitoring problems.

35

Page 36: Data Streaming Algorithms for Efficient and Accurate ...jx/reprints/Sigm04_talk.pdf · Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Thank You !

36