Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts

Using Structure Indices for Efficient Approximation of

Network Properties

Matthew J. Rattigan, Marc Maier, and David Jensen

University of Massachusetts Amherst

Data MiningNovember 27, 2006

Deborah Stoffer

The Problem Recent research works with very large

networks Millions of nodes

Calculating network statistics on very large networks can be difficult Shortest paths Betweenness centrality

The proportion of all shortest paths in the network that run through a given node

Closeness centrality The average distance from the given node to every

other node in the network

The Problem The most efficient known algorithms for

calculating betweenness centrality and closeness centrality are O(ne + n2logn) n – number of nodes e – number of edges

Calculations for path finding can have even higher complexity Require bidirectional breadth-first search

The Problem Example - Rexa citation graph

Papers in computer science and related fields Largest connected component contains

165,000 nodes (papers) and 321,000 edges (citations)

Finding a path of length 15 requires the exploration of 65,000 nodes

The Problem

Network Structure Index (NSI) Similar to the type of index commonly used to

speed queries in modern database systems Can be constructed once for a given graph and

then used to speed the calculations of many measures on the graph

Two components of a NSI Set of annotations on every node in the network that

provide information about relative or absolute location For G(V,E) the annotations define A: V → S, where S is an

arbitrarily complex “annotation space” A distance function that uses the annotations to define

graph distance between pairs of nodes by mapping pairs of node annotations to a positive real number

D: S x S → R

Types of Network Structure Indices All Pairs Shortest Path (APSP) Degree Landmark Global Network Positioning (GNP) Zone Distance to Zone (DTZ)

All Pairs Shortest Path NSI Node annotations

Consist of an n x n matrix (n = |V|) containing the optimal path distances between all pairs of nodes

Distance function A simple lookup in the matrix

Degree NSI Node annotations

Annotate each node with its undirected degree within the graph

Distance function between source node s and target node t DDegree (s, t) = 2n – degree (s) – degree (t)

Landmark NSI Randomly designate a small number of

nodes in the network to serve as navigational beacons

Node annotations Annotate nodes in the graph by flooding out

from each landmark and recording the graph distance to each node in the network

Gives a vector of graph distances for each node

Distance function

Landmark NSI

Global Network Positioning NSI Node annotation

Annotation uses a nonlinear optimization algorithm to create a multidimensional coordinate system that encodes the location of each node within the network

Distance function is the Manhattan distance between node pairs

Zone NSI Node annotations

Each node is annotated with a d-dimensional vector of zone labels

Distance function

Zone NSI Algorithm For d dimensions

Randomly select k seed nodes, assign them zone labels 1 through k, and place them in the labeled set

Place all other nodes in the unlabeled set While the unlabeled set is not empty

Randomly select a node l from the labeled set Randomly select a node u from the unlabeled set

that is a neighbor to l Assign u to the same zone as l and move it to the

labeled set

Zone NSI

Distance to Zone (DTZ) NSI Hybrid between Landmark and Zone NSIs Node annotations

Divide the graph into zones and for each node u and zone Z calculate the distance from u to the closest node in Z

Distance function

Distance to Zone (DTZ) NSI

Complexity of Different NSIs

Search Performance Optimality of the lengths of paths found

Path ratio

pf is the length of the found paths

po is the length of the optimal paths r is the number of randomly selected pairs of

nodes in the graph P = 1.0 indicates an NSI that finds optimal

paths P >> 1.0 indicates a poor performing NSI

Search Performance Performance gain

Exploration ratio

ef is the number of nodes explored by best-first search

eb is the number of nodes that are explored using a bidirectional breadth-first search

r is the number of pairs of nodes in the graph E values close to zero indicate good search

performance E values greater than 1.0 indicate poor search

performance

Search Performance NSIs evaluated on synthetic graphs

Random Rewired lattices Forest Fire

Search Performance

Search Performance

Search Performance

Search Performance

Constant Time Distance Estimation Can sometimes use an NSI to directly

estimate the graph distance between any two nodes

Can use the DTZ annotation distance to estimate actual graph distances Annotate the graph as described for the DTZ

NSI Randomly sample p pairs of nodes in the graph

and perform breadth-first search to obtain their exact graph distance

Use linear regression to obtain an equation for estimated distance

Constant Time Distance Estimation

Constant Time Distance Estimation

Constant Time Distance Estimation Simple distance can be used to produce a

wide variety of attributes on nodes, which can be used by data mining algorithms that analyze graphs Label nodes with their distance to a particular

node in a graph How close is each actor to Kevin Bacon?

Label nodes with the minimum or maximum distance to one of a set of designated nodes

How close is each actor to an Academy Award winner?

Closeness Centrality Measures the proximity of a given node in

a network to every other node

Important to social network dynamics Accurate estimates of closeness centrality

often impossible to calculate for large data sets

Using an NSI for path finding can estimate closeness centrality efficiently

Closeness Centrality

Closeness Centrality A measure of centrality can be used to

produce attributes on nodes that may be useful to knowledge discovery algorithms Determine the closeness of every node to a

collection of key nodes Closeness to all winners of Academy Awards for best

actor in the past 10 years Constrain closeness calculations for members

of clusters Closeness rank of an actor within their movie

industry Weight closeness based on the attributes of

the outlying nodes Closeness to winners of Academy Awards weighted

by how recent an award

Betweenness Centrality Measures the number of short paths on

which a given node lies

Important to social network dynamics Accurate estimates of betweenness

centrality often impossible to calculate for large data sets

Betweenness Centrality Can estimate betweenness using the paths

identified through NSI navigation Randomly sample pairs of nodes and

discover the shortest path between them Count the number of times each node in

the graph appears on one of these paths to obtain a betweenness ranking

Betweenness Centrality

Betweenness Centrality A high betweenness score can indicate a

bridge between two communities An actor that has played in movies belonging to

different movie industries Betweenness centrality can be used to

create features on nodes that are useful for data mining Calculate betweenness centrality for particular

groups of nodes Actors that sit between winners of Academy Awards

for best picture and the IMDb’s “Bottom 100”, the worst 100 movies as voted by users of the Internet Movie Database

Conclusions The NSIs Zone and DTZ allow efficient and

accurate estimation of path lengths between arbitrary nodes in a network

Efficient calculations of network statistics allow a better range of potential approaches to knowledge discovery

All potential NSIs have not been exhaustively researched

NSIs could have other applications Finding connection subgraphs Approximating neighborhood functions

Questions?

Documents

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts