Upload
att
View
128
Download
3
Tags:
Embed Size (px)
Citation preview
Measuring and Extracting Proximity in Networks
Yehuda Koren, Stephen North and Chris Volinsky
KDD 2006Philadelphia
Outline
• What is proximity and why do we care? • What are the qualities of a good proximity measure?• A series of proposals• Our proposal: Cycle-Free Effective Conductance• Extraction of proximity graphs• Applying CFEC to large graphs• Applications: Call detail, IMDB, DBLP• Summary and Extensions
http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl
What is Proximity?
• What is the distance between two nodes in a social network?
• proximity [prox·im·i·ty || prɑk'sɪmətɪ /prɒ-]n. adjacency, nearness, closeness, vicinity
What is proximity good for?
• Missing Data• Link Prediction• Indirect relations• Information sharing• Viral marketing• Identifying clusters
Our Goals
• Measure and visualize proximity between nodes.• Measurement should have the following qualities:
– “Close” nodes are intuitive• Short graph distance• Multiple paths • High weights on edges• Low degree nodes in the paths
– Monotonicity – Generalizes to n > 2.
Our goals• Explain proximity by extracting proximity subgraphs that are
readily visualized and contain a large percentage of overall proximity.
• Idea comes from “connection subgraphs” (Faloutsos, McCurley and Tomkins 2004), the small subgraph that best captures the connections between two nodes of the graph
Prox = .0053
Prox = .0048
Large social networks
31M 438K co-authors
1.1M 896K actor-actor
1000M 300M phone calls
800M 200M IM
data source |V| |E|
• -Proximity is relevant in all social networks, listed below are a few we have played with
-For now, we consider these as undirected graphs (stay tuned)
Measuring proximity• Many proposals in the literature (n.b. Liben-Nowell and Kleinberg 2003)• Graph distance: shortest path
– Doesn’t account for path length, multiple paths, or high-degree nodes• Maximum Network Flow
– Disregards path length, high degree nodes, depends on bottlenecks• Electrical networks, or “effective conductance” (e.g. Doyle and Snell
1984)– High degree nodes still a problem
When is the electric current analogy misleading?
Noise?Significant connection
• Same current-flow in both cases! • Degree-1 nodes are neutral (attract no-flow)
Sink- augmented effective conductance [Faloutsos, McCurley & Tomkins, KDD 2004]
• Connect all nodes to a grounded universal sink (with 0V)• Tax each node - deliver portion of the flow to the sink
No nodes of degree 1 (above problem solved)Penalizes long pathsHow do we set taxing system?Doesn’t generalize to n > 2No monotonicity…
Universal sink and (non-)monotonicity
With universal sink – no monotonicity:
• For larger networks, proximity tends to zero creating a “size bias”.
• Adding s—t paths can either increase or decrease proximity!
Network size
Pro
xim
ity
Electrical networks = random walks
• Current-flow notions have direct random walk interpretation
• Take a random walk starting at s, following edges of the graph proportional to their weight (conductance).
• Let D(s), the degree of s, be the number of random walks originating at s. Then:
– The escape probability, EP(st), is the probability that a walk originating at s will reach t before visiting s again , and
– The effective conductance between s and t:• EC(s,t) = EP(st) * Deg(s)
With the random walk perspective, you can see that the 1-degree nodes have no influence.
By discouraging “backtracking”, we now can properly account for high degree nodes
Electrical networks = random walks
Our proximity: cycle free effective conductance
• The cycle-free escape probability, CFEP(st) is the probability that a random walk originating at s will reach t without visiting any node more than once
• Multiplying by degree of the source gives an absolute quantity (accounting for the number of "actually initiated" walks):
• The cycle-free effective conductance between s and t: CFEC(s,t) = CFEP(st) * Deg(s)
Higher redgreen c.f. escape probability
Lower redgreen c.f. escape probability
Properties of CFEC as a proximity measure:• Accounts for multiple paths• Favors short paths• Penalizes high-degree nodes• Penalizes dead-end paths• Parameter free• Has the “right” monotonicity• Accommodates edge directions• Has a natural extension to multiple endpoints
Computing CFEC
• Unlike previous measures, exact computation is impossible
• Practically, we can estimate it extremely well• Probability of paths declines exponentially (e.g.,
100th path is x106 less probable than the first one.)• Estimate using the most probable paths:
c.f.escsimple path [ ]
P ( ) = prob( )p s t
s t p
c.f.eschighly probablesimple path [ ]
P ( ) prob( )
p s t
s t p
Finding k most probable paths
• Finding k shortest simple paths takes O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982]
• For an edge u-v of weight w(u,v), define its length
• Edge lengths are positive• Exp(-l(u,v)) = C*Prob(path)• Short path = High-probable path• Stop path-computation when probability drops below
“10-6” of first path
( , )( , ) log
deg( ) deg( )
w u vl u v
u v
Extracting proximity graphs
Recall FMT’04 “connection subgraphs”, the small subgraph that best captures the connections between two nodes of the graph
Extracting proximity graphs
• Achieve an efficient balance between “size” and “proximity” by maximizing the ratio:
• Larger α emphasize proximity larger subgraph– α=0 return shortest path
– α=∞ return all paths
CFEC( )
sub ap
gr h
s t
Extracting proximity graphs• We already have the collection, Rk of shortest paths
{P1,P2,…,Pk}• Find the subset of the paths that maximizes
CFEC( )
sub ap
gr h
s t
… and combine the selected paths into a “proximity graph”
• This is an NP-hard problem, but recall that we have a list of paths sorted by probability
• Use a branch and bound path merging algorithm
Working with large graphs• Dealing with full graph is sometimes infeasible and usually
unnecessary• Prior to running the algorithm, we construct a candidate graph in
main memory (also FCT ’04).
full networkN ~ 350M
Candidate graphN ~ 10,000
Proximity GraphN ~ 20
S T
Finding the candidate graph
S T
Dist(T,i)=2Dist(S,i)=2
S T
Dist(T,i)=3Dist(S,i)=3
S T
Dist(T,i)=4Dist(S,i)=4
S T
Dist(T,i)=5Dist(S,i)=5
Shortest path of length 10
S T
Dist(T,i)=12Dist(S,i)=12 i
• Stop adding nodes when path probabilities are below e
• Any path through unscanned node is likely to be low probability
• Once we have this candidate graph, apply CFEC algorithm to extract proximity graph.
Summary: Proximity Graphs
• We have a measure of proximity which fulfills our desired criteria– Intuitive sense of closeness– Generalizes to n>2– Parameter free
• Using this measure of proximity we can efficiently extract the proximity graph.
• Let’s apply to real data
Application: call detail
• AT&T’s call detail graph is large (350M nodes, several billion edges).
• To calculate proximity, we just need an adjacency list– Dynamic, efficient creation of adjacency lists for transaction
graphs (Cortes, Pregibon, and Volinsky 2003)
• Select a random sample of 2000 residential TNs and calculate proximity between them. – We found a path for 1808 of them– For those that we found a path, we calculated proximity, and
rendered a proximity graph for them.
Building Proximity Graphs
full networkN ~ 350M
Candidate graphN ~ 10,000
Proximity GraphN ~ 20
Distribution of proximities in phone-call network
Application: call detail• Capturing proximity in a proximity graph….• Studying a
– Low alpha: smaller graphs, less proximity captured.
a = 10 seems to give a good tradeoff
%C
aptu
red
Pro
xim
ity#
Gra
phs
Size of graph
Proximity as link predictor
• Calculate proximities for a sample of pairs in the network that have never communicated.
• Look in the future to see which of these communicate in the next time period t.
• Did those that eventually communicate have closer proximities.
• i.e. is proximity predictive of future communication?
Mean log proximity:Communicators = -2.4Non-comm. = -5.9
Proximity as link predictor
Using Visualization
• Different Visualizations bring out different aspects of the proximity graph, especially for n>2.
Using a hierarchical layout for n=2 shows different eras of movie stars
Prox webpagehttp://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl
Summary
• Proposed cycle free effective conductance (CFEC) with a random walk interpretation to measure “proximity” in social networks and other ad-hoc networks
• Described a way of approximating CFEC• Described a way of visualizing CFEC as a subgraph• Extended the method to external datasets• Showed empirical evidence for its utility
http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl
Extensions
• Compare to other proximity measures (Katz, PageRank, and other methods compared in Liben-Nowell and Kleinberg (2003))
• Quantify proximity across different kinds of networks• Extend c.f. effective conductance to:
– Multiple endpoints (already demonstrated)– Directed edges (future work – use k-shortest paths in a directed
graph, alg. due to Hershberger et al)
http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl