Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Web Structure MiningLink Prediction
Outline1. Link prediction problem
2. Proximity measures
3. Prediction by supervised learning
2
Link prediction‣ Link prediction. Given a snapshot of a dynamic network at
time t, predict edges added in the interval (t,t’)
‣ Link completion. Given a network, infer links that are consistent with the structure, but missing
‣ Link reliability. Estimate the reliability of given links in the network
‣ What to predict?‣ Link existence
‣ Link weight
‣ Link type
3
1 - Link prediction
Link prediction
‣ Number of missing edges = |V| (|V| - 1)/2 - |E|
‣ In sparse graphs, |E| << |V|2
‣ Probability of correct random guess O(1/|V|2)
4
1 - Link prediction
8
1
46
3
7 5
2
Scoring algorithm
‣ Link prediction by proximity scoring 1. For each pair of nodes compute proximity score c(v,v’)
2. Sort all pairs by the decreasing score
3. Select top n pairs (or above some threshold) as new links
‣ Many metrics have been summarised in : David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019-1031.
5
1 - Link prediction
Scoring functions‣ Based on the local neighbourhood of vi and vj
‣ Number of common neighbours
‣ Jaccard’s coefficient
‣ Adamic / Adar
6
2 - Proximity measures
|N (vi) \N (vj)|
|N (vi) \N (vj)||N (vi) [N (vj)|
X
v 2N (vi)\N (vi)
1
log|N (v)|
Scoring functions‣ Based on paths and ensemble of paths between vi and vj
‣ Shortest path
‣ Katz score
‣ Personalized (rooted) PageRank
7
2 - Proximity measures
�min{pathij > 0}
1X
l=1
�(l)|paths(l)ij |
PR = ↵(D�1A)TPR+ (1� ↵)
Scoring functions‣ Expected number of random walk steps:
‣ Hitting time:
‣ Commute time:
‣ Normalized hitting / commute time:
‣ SimRank:
8
2 - Proximity measures
�Hij
�(Hij +Hji)
�(Hij⇡j +Hji⇡i)with ⇡i (resp. ⇡j) be the stationary probability of vi (resp. vj)
SimRank(vi, vj) = �.
Pa2Ni
Pb2Nj
SimRank(a, b)
|Ni|.|Nj |
Scoring functions‣ Preferential attachment (2 alternative versions)
‣
‣
‣ Clustering coefficient ‣
‣
9
2 - Proximity measures
ki . kj = |Ni| . |Nj |
ki + kj = |Ni| + |Nj |
CC(vi) . CC(vj)
CC(vi) + CC(vj)
Some results
10
2 - Proximity measures
Source: The link-prediction problem for social networks
Some results
11
2 - Proximity measures
Source: The link-prediction problem for social networks
Some results
2 - Proximity measures
12 Source: The link-prediction problem for social networks
Take away message
‣ Node-based topological similarity measures (common neighbours, Jaccard, Adamic/Adar, preferential attachment) perform the best but does not scale well
‣ Path-based topological similarity measure (Katz, Hitting time, rooted PageRank) have to be preferred when dealing with relatively big networks (>10K vertices)
2 - Proximity measures
13
Binary classification‣ A challenging classification problem:
‣ A very large number of possible edges (quadratic in number of nodes)
‣ Highly unbalanced class distribution
‣ Positive examples : linear growth with number of nodes
‣ Negative example : quadratic growth with number of nodes
3 - Prediction by supervised learning
14
A very challenging problem
3 - Prediction by supervised learning
15
Source: M. Rattigan, D. Jensen. The case for anomalous link discovery. ACM SIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005
Link prediction by supervised learning
‣ Supervised learning process 1. Feature generation
2. Model training
3. Testing
‣ Features ‣ Topological proximity features
‣ Aggregated features
‣ Content based node proximity features
3 - Prediction by supervised learning
16
Evaluation
3 - Prediction by supervised learning
17
‣ Simple « hold out set » evaluation
‣ More sophisticated evaluation method is preferable (cross-validation)
8
1
46
3
7 5
2
8
1
46
3
7 5
2
Whole graph Training graph
Evaluation metrics
3 - Prediction by supervised learning
18
‣ Precision, recall, F-measure
‣ True rate positive (TPR), False positive rate (FPR), ROC curve, AUC
Precision =TP
TP + FP
, Recall =TP
TP + FN
F =2 . P recision .Recall
Precision+Recall
TPR =TP
TP + FN, FPR =
FP
FP + TN