INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

YFlowRank

Presentation by ANMLJuly 2004

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

About the Presenter

• Mark Meiss• Academic Background:

– B.S. Mathematics, B.S. Computer Science

– Ph.D. student in Department of Computer Science

• Research interests:– Structural analysis of network traffic data

– High-performance file transfer protocols

– Autonomous information retrieval agents

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

About the Presenter

• Professional Experience:– Over 10 years in software development– With IU IT Services since 1997– Worked with Bloomington NOC– First employee of ANML– Developed Animated Traffic Map, Router

Proxy, Tsunami file transfer protocol, etc.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

PageRank

• PageRank is a Web page ranking system invented by Brin and Page of Google– Attempts to measure importance of a Web page– Pages gain rank by being pointed to by many

pages and by pointing to pages with high rank– Calculated offline using an iterative algorithm– Examines only the connections in the Web, not

the content of the pages

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Technical Details of PageRank

• A given set of Web pages creates an implied directed graph of connections– The graph has an edge from page A to page B if page A

links to page B

• This graph can be represented as a matrix– If entry (i, j) is non-zero, page i links to page j

– Sparse representation is necessary• Google’s matrix has over 1,000,000,000,000,000,000 entries

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Technical Details of PageRank

• Problem with “dangling links”– These are links to pages that contain no links of

their own– These pages absorb PageRank without

distributing it to other pages

• Solution is to say that a page without outbound links actually links to every page with equal probability

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Calculating PageRank

• We can think of the connectivity matrix as defining a Markov model that generates a random list of Web pages– In other words, we can use the matrix to make a

random walk of the Web

• The PageRank vector is the first eigenvector of the connectivity matrix– In other words, it’s the probability that we’re at that

page during our random walk

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Vulnerability of PageRank

• PageRank was first published in 1998• Since then, it has been shown to be

vulnerable to “clique attacks”– Unsavory Web site owner buys 75 domains– Home page on each domain points to each of

the other domains– All of the domains thus rise in PageRank score

• Google blacklists Web sites for this

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

FlowRank

• Netflow records also create an implied connectivity matrix– We can create an edge from host A to host B if

host A transmits data to host B

• The vulnerability to a clique attack becomes a detector of peer-to-peer applications and social networks!

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Weighted PageRank

• The volume of data in a flow is an important characteristic of the traffic– We modify the basic PageRank algorithm by

weighing all entries based on traffic volume– This new algorithm still converges, but the final

values have a significantly different distribution

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Weighted PageRank

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

So What’s It Good For?

• These are potential applications; this research is just starting– Automatic detection of peer-to-peer

applications or “bot networks”– Heuristic for node importance in visualization

tools– Heuristic for ordering importance of IDS

anomalies

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Rethinking the Edges

• In theory, every TCP connection between host A and host B involves two flows– One from host A to host B– One from host B to host A

• Due to sampling, we often catch only one of the two– This interferes with the operation of FlowRank

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y


• When we see a flow from host A to host B, why should the edge go from A to B and not from B to A?– We can try to identify which host is the client

(initiator of the connection) and which is the server (receiver of the connection)

– We can make a good guess at this by studying the relative frequency of the ports used

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y


• This client/server classification seems to greatly increase the utility of the connectivity graph

• Examining the connectivity graph over time can give us an idea of the type of application that runs on a TCP port

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Visualization

• We build the entries for the connectivity matrix by assigning an index to each IP address– The first host to show up is index 1, etc.

• Suppose there is a flow from 127.54.1.3 to 10.99.4.63– 127.54.1.3 may get index 314– 10.99.4.63 may get index 57– Then entry (314, 57) in the matrix will be non-zero

• We can see this matrix using the “spy” command in Matlab

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Visualization

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Problems

• Assigning the indices in order of occurrence– Makes the non-zero entries in the graph grow

down and to the right over time– Concentrates high-traffic nodes in the upper left– Exposes artifacts of netflow sampling

• Static image gives very little temporal information

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Solutions

• After generating the full index for a set of data, we can randomize its order– Tends to separate high-traffic nodes

– Avoids sampling artifacts

• We can include a temporal element as well– Produce a movie with a sliding window of netflow

traffic

– For example, use a 1-hour window and 15-minute increments for each frame

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

[Interlude]

…video demonstration…

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Problems

• We can’t see the FlowRank data

• We can’t highlight the importance of any particular node

• We can’t generate a video file in a convenient codec using Matlab

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Solutions

• Write a frame rendering program and save each frame as a .PNG file– Use the mplayer system to create a DiVX file

• Use the FlowRank vector to modify the size of a flow in the frame– Size of a flow is proportional to the number of

standard deviations difference between the mean FlowRank and (src+dst)/2

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

[Interlude]


I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Another Quick Fix

• We don’t know whether a flow is important because of its source (server), its destination (client), or both

• Solution: Give each flow a red component and a blue component– A red flow is important because of the server– A blue flow is important because of the client– A magenta flow is important because of both

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

[Interlude]


I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Evaluating FlowRank

• How can we show that FlowRank is a useful metric for distinguishing traffic?

• We need some empirical way of measuring its utility

• It has to be useful enough to justify the (considerable) computational expense of calculating it

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Experimental Setup

• Split large volume of TCP netflow data into 65,536 bins, one for each port

• Compute an n-dimensional statistical profile for each port (n is currently around 20)– Also compute an (n+m)-dimensional profile, where the

extra dimensions are based on FlowRank statistics

• Apply clustering and classification algorithms (SVM, k-means, etc.) to each set of profiles

• Examine the differences between the two sets

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Structural Visualization

• It would be nice to examine the connectivity matrix as an actual graph

• This presents major problems– Because of port-scanning, crawling, etc., most data

contains a single large component containing over 2/3 of all the edges, plus some noise

– Optimal graph layout is an NP-hard problem

– Current graph layout packages can’t handle hundreds of thousands of nodes (with some limited exceptions)

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Structural Visualization

• Moving the visualization to 3D gives layout algorithms another degree of freedom

• Also allows for better interactive navigation of the data (virtual fly-bys, etc.)

• We have had some early success with the Tulip package

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

[Interlude]


I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Future Directions

• Real-time visualization

• Anomaly detection

• Tunneled traffic detection

• Intent profiling

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y

Your Ideas are Valued!

• Please share any thoughts, criticisms, or questions you may have!

• E-mail: [email protected]

Documents

INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004