19
Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories David Muelas , Miguel Gordo, Jos´ e Luis Garc´ ıa-Dorado, Jorge E. L´ opez de Vergara Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es, [email protected] Universidad Aut´onoma de Madrid CNSM 2015 – November 2015

Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Embed Size (px)

Citation preview

Page 1: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Dictyogram: a Statistical Approach for theDefinition and Visualization of Network Flow

Categories

David Muelas, Miguel Gordo, Jose Luis Garcıa-Dorado,Jorge E. Lopez de Vergara

Email: {dav.muelas, jl.garcia, jorge.lopez vergara}@uam.es,[email protected]

Universidad Autonoma de Madrid

CNSM 2015 – November 2015

Page 2: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Network Health Check

Network managers must monitor network vital signs to assure it ishealthy:

(a) ECG

00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cat1 Cat2 Cat3 Cat4 Cat5 Cat6 Cat7 Cat8 Cat9 Cat10

(b) Dictyogram (Normalized version)

But. . . What exactly is Dictyogram?

Page 3: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

Dictyogram (from δικτυo, network in Greek): Method tographically trace the network flow behavior versus time. Itsgraphical results can be like a network electrogram, showing itsvital signs.

Page 4: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Outline

1 IntroductionContextOur Goals

2 Method definitionProbability integral transformModeling CDFs

3 Experimental resultsModel evaluationDictyogram visualization

4 Conclusions

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 4

Page 5: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

ContextOur Goals

Context

Network flow-based monitoring has been proven useful todetect network intrusion, malfunction, or other types ofanomalies.

Unfortunately, network managers have to deal with tons ofmeasurement data, and its interpretation has become achallenge.

Data summaries: difficult to reach a good trade-off betweendetail and simplifications: insufficient data can lead torestricted or even erroneous conclusions.

Not only the measurements are important from the point ofview of network management: the application of suitabletechniques improves the quality and depth of the knowledgethat can be extracted from measurements.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 5

Page 6: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

ContextOur Goals

Our Goals

Our proposal is intended to ease network managers’ work byproposing a novel approach to study the behavior of network flowcharacteristics. Our main goal is to define comprehensivesummaries of network flow data:

Our approach is based in the study of different flowcharacteristics’ ECDFs — e.g., flow size or durationdistributions.

Using those ECDFs, we define flow categories using theintegral probability transform — e.g., using decile delimitedintervals.

As we will see, this approach improves the detection of networkanomalies and the visualization of network state.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 6

Page 7: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Probability integral transformModeling CDFs

Method description

Probability integral transform:

Let X be a continuous random variable with cumulativedistribution function FX . Then FX (X ) follows a uniformdistribution on [0, 1].

(b)

0

0.5

1(a)

Ci = F

X

−1(P

i)

Pi

And them, we define flow categories using a set of probabilitylevels using the CDF of certain flow characteristics.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 7

Page 8: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Probability integral transformModeling CDFs

Keep an eye on the hypotheses!

25 30 350

0.2

0.4

0.6

0.8

1

(b)

02004006000

0.2

0.4

0.6

0.8

1

(a)

(c) Gaussian

0 20 40 600

0.2

0.4

0.6

0.8

1

(b)

051015200

0.2

0.4

0.6

0.8

1

(a)

(d) Poisson

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 8

Page 9: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Probability integral transformModeling CDFs

How can we model an CDF?

Glivenko-Cantelli theorem: the ECDF converges to the CDFas the number of observations increases.

Nonetheless, computational cost increases when weaccumulate all the values of the characteristic under analysis.

Alternative approach: Functional Data Analysis:

Mean Function: FmeanX =

1

n

n∑i=1

FXi

Problem: not robust

Functional Depth:

Maximum depth observation.Median Function (it is the function that maximizes thefunctional depth we use).

Problem: more computationally expensive

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 9

Page 10: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Probability integral transformModeling CDFs

Dataset for the evaluation

To asses the advantages of our method, we have use a realdataset:

Flow records, Spanish Academic Network: more than onemillion users, more than 7 years of data.

Exporters: 5 Netflow exporters, different geographicallocations (all of them in Spain).

Packet level sampling: rate of one out of 100 packets.

Period selected for the evaluation of the CDF estimationmethods: 30 days.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 10

Page 11: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Probability integral transformModeling CDFs

Analyzing ECDFs to get a model of the typical behavior

101

102

103

104

105

106

107

108

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

X: 40Y: 0.9

X: 44Y: 0.8

X: 53Y: 0.7

X: 80Y: 0.6

X: 149Y: 0.5

X: 501Y: 0.4

X: 1452Y: 0.3

X: 1500Y: 0.2

X: 3000Y: 0.1

Flow size (bytes)

P (

X >

x)

Mean

Deepest

Median

Figure: Comparison between observed CCDFs (orange line, no marker)for Exporter A, and models obtained using the mean (blue line, circles),deepest (black line, diamonds) and median (red line, triangles) functions.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 11

Page 12: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Model evaluationDictyogram visualization

Empirical comparison (I)

0 5 10 15 20 25 300

5

10x 10

5

A

0 5 10 15 20 25 300

5

10x 10

6

B

0 5 10 15 20 25 300

5

10x 10

7

C

0 5 10 15 20 25 300

5

10x 10

6

D

0 5 10 15 20 25 300

5x 10

6

E

Mean Deepest Median

Figure: Evolution of the Pearson’s test-statistic for all exporters. (Less isbetter.)

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 12

Page 13: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Model evaluationDictyogram visualization

Empirical comparison (II)

Table: Summary of the evaluation of the different methods to estimatethe CDF.

Exporter Method # Best

AMean function 0Deepest obs. 3

Median function 25

BMean function 0Deepest obs. 6

Median function 22

CMean function 20Deepest obs. 8

Median function 0

DMean function 0Deepest obs. 23

Median function 5

EMean function 0Deepest obs. 28

Median function 0

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 13

Page 14: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Model evaluationDictyogram visualization

Final visualization of Dictyogram

03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000

2

4x 10

4(a) Mean

Co

ncu

rre

nt

flo

ws f

or

ea

ch

ca

teg

ory

03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000

2

4x 10

4(b) Deepest Observation

Time of day

03:00:00 06:00:00 09:00:00 12:00:00 15:00:00 18:00:00 21:00:000

2

4x 10

4(c) Median

1

1

1

2

2

2

Figure: Dictyogram representation of fi (t) with their respective sizeintervals delimited by the deciles given by (a) mean, (b) deepest observedECDF, and (c) median.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 14

Page 15: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Model evaluationDictyogram visualization

Final visualization of Dictyogram

00:00:00 03:20:00 06:40:00 10:00:00 13:20:00 16:40:00 20:00:00 23:20:000

0.5

1

1.5

2

2.5

3

3.5

4x 10

4

1 2

Figure: Zoom in the median.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 15

Page 16: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Key remarks

Our method:

Is manager friendly: it provides Statistical summaries basedon certain probability levels, which eases the study of theflows traversing the network.

Links statistical properties to time evolution: it eases thedetection of changes in the statistical properties of thecharacteristics under analysis.

Improves network flow data visualization: it lets controlthe resolution of the visualization of the distribution thatnetwork flow characteristics follow.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 16

Page 17: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Future work

We plan to:

Study how to summarize several different network behaviors ina multivariate uniform distribution, and use other well-knowndistributions (and not only uniform) for signatures.

Study the distribution of the Pearson’s test-statistic to detectanomalous events.

Test the stability of the estimation of the CDF ( to definesome criteria to recalibrate the model).

Explore other representations with higher dimensionality.

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 17

Page 18: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Thank you!

Questions?

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 18

Page 19: Dictyogram: a Statistical Approach for the Definition and Visualization of Network Flow Categories

IntroductionMethod definition

Experimental resultsConclusions

Annex: Functional depth

We use the definition given by:

MSn,H(x) = min{SLn(x), ILn(x)} (1)

where

SLn(x) = 1nλ(I )

n∑i=1

λ{t ∈ I : x(t) ≤ xi (t)}

ILn(x) = 1nλ(I )

n∑i=1

λ{t ∈ I : x(t) ≥ xi (t)} (2)

With it, we consider:

Maximum depth observation.

Median Function (it is the function that maximizes thefunctional depth we use).

D. Muelas, M. Gordo, J.L. Garcıa-Dorado, J.E. Lopez de Vergara Dictyogram 19