Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Web Structure MiningLink Prediction

Outline1. Link prediction problem

2. Proximity measures

3. Prediction by supervised learning

2

Link prediction‣ Link prediction. Given a snapshot of a dynamic network at

time t, predict edges added in the interval (t,t’)

‣ Link completion. Given a network, infer links that are consistent with the structure, but missing

‣ Link reliability. Estimate the reliability of given links in the network

‣ What to predict?‣ Link existence

‣ Link weight

‣ Link type

3

1 - Link prediction

Link prediction

‣ Number of missing edges = |V| (|V| - 1)/2 - |E|

‣ In sparse graphs, |E| << |V|2

‣ Probability of correct random guess O(1/|V|2)

4

1 - Link prediction

8

1

46

3

7 5

2

Scoring algorithm

‣ Link prediction by proximity scoring 1. For each pair of nodes compute proximity score c(v,v’)

2. Sort all pairs by the decreasing score

3. Select top n pairs (or above some threshold) as new links

‣ Many metrics have been summarised in : David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019-1031.

5

1 - Link prediction

Scoring functions‣ Based on the local neighbourhood of vi and vj

‣ Number of common neighbours

‣ Jaccard’s coefficient

‣ Adamic / Adar

6

2 - Proximity measures

|N (vi) \N (vj)|

|N (vi) \N (vj)||N (vi) [N (vj)|

X

v 2N (vi)\N (vi)

1

log|N (v)|

Scoring functions‣ Based on paths and ensemble of paths between vi and vj

‣ Shortest path

‣ Katz score

‣ Personalized (rooted) PageRank

7


�min{pathij > 0}

1X

l=1

�(l)|paths(l)ij |

PR = ↵(D�1A)TPR+ (1� ↵)

Scoring functions‣ Expected number of random walk steps:

‣ Hitting time:

‣ Commute time:

‣ Normalized hitting / commute time:

‣ SimRank:

8


�Hij

�(Hij +Hji)

�(Hij⇡j +Hji⇡i)with ⇡i (resp. ⇡j) be the stationary probability of vi (resp. vj)

SimRank(vi, vj) = �.

Pa2Ni

Pb2Nj

SimRank(a, b)

|Ni|.|Nj |

Scoring functions‣ Preferential attachment (2 alternative versions)

‣

‣

‣ Clustering coefficient ‣

‣

9


ki . kj = |Ni| . |Nj |

ki + kj = |Ni| + |Nj |

CC(vi) . CC(vj)

CC(vi) + CC(vj)

Some results

10


Source: The link-prediction problem for social networks

Some results

11


Source: The link-prediction problem for social networks

Some results


12 Source: The link-prediction problem for social networks

Take away message

‣ Node-based topological similarity measures (common neighbours, Jaccard, Adamic/Adar, preferential attachment) perform the best but does not scale well

‣ Path-based topological similarity measure (Katz, Hitting time, rooted PageRank) have to be preferred when dealing with relatively big networks (>10K vertices)


13

Binary classification‣ A challenging classification problem:

‣ A very large number of possible edges (quadratic in number of nodes)

‣ Highly unbalanced class distribution

‣ Positive examples : linear growth with number of nodes

‣ Negative example : quadratic growth with number of nodes

3 - Prediction by supervised learning

14

A very challenging problem


15

Source: M. Rattigan, D. Jensen. The case for anomalous link discovery. ACM SIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005

Link prediction by supervised learning

‣ Supervised learning process 1. Feature generation

2. Model training

3. Testing

‣ Features ‣ Topological proximity features

‣ Aggregated features

‣ Content based node proximity features


16

Evaluation


17

‣ Simple « hold out set » evaluation

‣ More sophisticated evaluation method is preferable (cross-validation)

8

1

46

3

7 5

2

8

1

46

3

7 5

2

Whole graph Training graph

Evaluation metrics


18

‣ Precision, recall, F-measure

‣ True rate positive (TPR), False positive rate (FPR), ROC curve, AUC

Precision =TP

TP + FP

, Recall =TP

TP + FN

F =2 . P recision .Recall

Precision+Recall

TPR =TP

TP + FN, FPR =

FP

FP + TN

Documents

Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),