18
Web Structure Mining Link Prediction

Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Web Structure MiningLink Prediction

Page 2: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Outline1. Link prediction problem

2. Proximity measures

3. Prediction by supervised learning

2

Page 3: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Link prediction‣ Link prediction. Given a snapshot of a dynamic network at

time t, predict edges added in the interval (t,t’)

‣ Link completion. Given a network, infer links that are consistent with the structure, but missing

‣ Link reliability. Estimate the reliability of given links in the network

‣ What to predict?‣ Link existence

‣ Link weight

‣ Link type

3

1 - Link prediction

Page 4: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Link prediction

‣ Number of missing edges = |V| (|V| - 1)/2 - |E|

‣ In sparse graphs, |E| << |V|2

‣ Probability of correct random guess O(1/|V|2)

4

1 - Link prediction

8

1

46

3

7 5

2

Page 5: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Scoring algorithm

‣ Link prediction by proximity scoring 1. For each pair of nodes compute proximity score c(v,v’)

2. Sort all pairs by the decreasing score

3. Select top n pairs (or above some threshold) as new links

‣ Many metrics have been summarised in : David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 7 (May 2007), 1019-1031.

5

1 - Link prediction

Page 6: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Scoring functions‣ Based on the local neighbourhood of vi and vj

‣ Number of common neighbours

‣ Jaccard’s coefficient

‣ Adamic / Adar

6

2 - Proximity measures

|N (vi) \N (vj)|

|N (vi) \N (vj)||N (vi) [N (vj)|

X

v 2N (vi)\N (vi)

1

log|N (v)|

Page 7: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Scoring functions‣ Based on paths and ensemble of paths between vi and vj

‣ Shortest path

‣ Katz score

‣ Personalized (rooted) PageRank

7

2 - Proximity measures

�min{pathij > 0}

1X

l=1

�(l)|paths(l)ij |

PR = ↵(D�1A)TPR+ (1� ↵)

Page 8: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Scoring functions‣ Expected number of random walk steps:

‣ Hitting time:

‣ Commute time:

‣ Normalized hitting / commute time:

‣ SimRank:

8

2 - Proximity measures

�Hij

�(Hij +Hji)

�(Hij⇡j +Hji⇡i)with ⇡i (resp. ⇡j) be the stationary probability of vi (resp. vj)

SimRank(vi, vj) = �.

Pa2Ni

Pb2Nj

SimRank(a, b)

|Ni|.|Nj |

Page 9: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Scoring functions‣ Preferential attachment (2 alternative versions)

‣ Clustering coefficient ‣

9

2 - Proximity measures

ki . kj = |Ni| . |Nj |

ki + kj = |Ni| + |Nj |

CC(vi) . CC(vj)

CC(vi) + CC(vj)

Page 10: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Some results

10

2 - Proximity measures

Source: The link-prediction problem for social networks

Page 11: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Some results

11

2 - Proximity measures

Source: The link-prediction problem for social networks

Page 12: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Some results

2 - Proximity measures

12 Source: The link-prediction problem for social networks

Page 13: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Take away message

‣ Node-based topological similarity measures (common neighbours, Jaccard, Adamic/Adar, preferential attachment) perform the best but does not scale well

‣ Path-based topological similarity measure (Katz, Hitting time, rooted PageRank) have to be preferred when dealing with relatively big networks (>10K vertices)

2 - Proximity measures

13

Page 14: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Binary classification‣ A challenging classification problem:

‣ A very large number of possible edges (quadratic in number of nodes)

‣ Highly unbalanced class distribution

‣ Positive examples : linear growth with number of nodes

‣ Negative example : quadratic growth with number of nodes

3 - Prediction by supervised learning

14

Page 15: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

A very challenging problem

3 - Prediction by supervised learning

15

Source: M. Rattigan, D. Jensen. The case for anomalous link discovery. ACM SIGKDD Explorations Newsletter. v 7, n 2, pp 41-47, 2005

Page 16: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Link prediction by supervised learning

‣ Supervised learning process 1. Feature generation

2. Model training

3. Testing

‣ Features ‣ Topological proximity features

‣ Aggregated features

‣ Content based node proximity features

3 - Prediction by supervised learning

16

Page 17: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Evaluation

3 - Prediction by supervised learning

17

‣ Simple « hold out set » evaluation

‣ More sophisticated evaluation method is preferable (cross-validation)

8

1

46

3

7 5

2

8

1

46

3

7 5

2

Whole graph Training graph

Page 18: Web Structure Mining - IRITYoann.Pitarch/Docs/M2Stats/... · Evaluation metrics 3 - Prediction by supervised learning 18 ‣ Precision, recall, F-measure ‣ True rate positive (TPR),

Evaluation metrics

3 - Prediction by supervised learning

18

‣ Precision, recall, F-measure

‣ True rate positive (TPR), False positive rate (FPR), ROC curve, AUC

Precision =TP

TP + FP

, Recall =TP

TP + FN

F =2 . P recision .Recall

Precision+Recall

TPR =TP

TP + FN, FPR =

FP

FP + TN