Link Reconstruction from Partial Information Gong Xiaofeng, Li Kun & C. H. Lai TSL@NUS

Preview:

Citation preview

Link Reconstruction from Partial Information

Gong Xiaofeng, Li Kun & C. H. LaiTSL@NUS

General situations where problems may arise

Observed network (ANxN filled with 0s and 1s) Scenarios:A) no side information. statistical analysis, clustering, modeling, process, etc.B) Some links are uncertain (positions known) link reconstruction problem, based on model, similarity

measure.C) Some 1s are set to be 0s (positions unknown) variant problem of link reconstruction, possible related to

link prediction.D) network is subject to change. one kind of prediction problem (link prediction), node

prediction, network evolution, etc.

B.1 Problem of network reconstruction

1

2

3

4

5

010?0

10?01

0?011

?010?

011?0

5

4

3

2

1

54321

Guess out the values (0 or 1) of dashed arrows.

There are some unknown links, which may be corrupted, missed or unable to measure at time.

Presumptions: o Network has structures.o Unknown links are fairly sampled.oNumber of unknown links are small.

B.2 Procedures of reconstruction of links

Available information -> fitted probabilistic model P(NxN)-> connection probability p(i,j) of each unknown links (i,j)-> determine a threshold of connection probability Pt-> set (i,j) to be 1, if p(i,j)>pt, and 0 otherwise

observed network

parameters

model function

optimizationconnection probability

threshold reconstruction or prediction

modelingprediction

B.3 Reformulated signal detection problem

Observed network -> 3 types of signals, 0, 1 and ?.Fitted model -> connection probabilities, P0 and P1.Signals (P?) to be classified -> ?

Problem: Giving connection probability P? -> type of signal (0 or 1)

Assumption under certain model:Unknown links do not influence significantly the reliability of fitted model (P0 and P1) , i.e., Connection probability P? of any unknown link can be regarded as be sampled from P0 or P1.

Searching an optimal detection scheme? e.g., Neyman-Pearson criterion,

Observation (data): connection probability (p)Hypothesis: H0: 0-link and H1: 1-link Data space E: R0 and R1, acceptance region

Decision D: D0 (accept H0) and D1 (accept H1)

B.4 An equivalent hypothesis testing problem

11

00

RpD

RpDD

1010 RRERR

md

mf

PHDPP

HDPPHDPP

1)(

)()(

11

1001

fyD PHDPHDP )(),(min 0110)(

B.5 Measuring reconstruction performance

actual valuepre

dictin

g o

utco

me

p np’ True Positive (TP) False Positive (FP) P’n’ False Negative (FN) True Negative (TN) N’

P N

Contingency table (or confusion matrix)

statistics defined: Sensitivity or True Positive Rate

(TPR):

TPR=TP/P=TP/(TP+FN)

False Positive Rate (FPR): FPR=FP/N=FP/(FP+TN)Accuracy (ACC): ACC=(TP+TN)/(P+N)

True Negative Rate or Specificity (SPC)

:SPC=TN/N=1-FPR

Positive Predictive Value (PPV): PPV=TP/(TP+FP)Receiver Operating Characteristic

(ROC):

TPR vs. FPR

B.6 Relation to performance measures

f0(p)

R4R3

R2R1

f1(p)

pt

)(

)(

)(

)(

1

4

3

2

10

RSTN

RSFP

RSFN

RSTP

if

connection probabilities

B.7 Criterion of MAP

)()(

)()({

011

100

pHPpHPD

pHPpHPDD

For reconstruction problem, we choose criterion to maximize the a posteriori probability of the two hypothesis.

MAPii

i cLHcP

HcP

cP

HcPcHP

)()(

)(

)(

)()(

1

0

0

1

A.1 Probabilistic model of structured networks

IIdd

eAobp

CMAwwfC

wwCCnkwCC

CMAC

wwwwk

ji

wwwwijij

ijijjiij

jiijkk

Tmkkkk

jiij

T

ji or

matrix adjancency ,matrix connection

attribute define node for

ij1

,)1(Pr

)()(

),(),2,1,(

),(

],,,[,

)()(

21

A.2 Estimate model parameters (MLE)

met are conditions stopping wheniterating cease )3

updateusly simultaneo )2

initial fromstart )1

onoptimizati basedgradient iterated

)()1(

)(

)1(

)(

)1ln()1(ln)Pr(ln

0ww01

0

11

,,

W

LWW

W

wwddp

pA

w

p

pp

pA

w

L

pApAwAL

jk

N

j kjjk

jkjkN

j k

jk

jkjk

jkjk

k

ijiijij

ijiijij

B.8 Example network

B.9 Density function of connection

probabilities

0 0.2 0.4 0.6 0.8 1

-0.01

0

0.01

0.02

0.03

0.04

0.05

Connection Probability (p)

Pro

babi

lity

dens

ity f

unct

ion

(pdf

)

f1(p)

1/r f0(p)

B.10 MAP detector minimizes average error

1

0

0

11100

1100

0 110

0 01

001

101010

)(

)(0)()(

)())(1(min

)()(

)(1)()(

)()(

t

ttt

t

tt

p

pp

pf

pfpfpf

p

M

pFpFM

dppfHDP

dppfdppfHDP

HDPHDPM

t

t

t

Density function is usually jagged and difficult to work with. Distribution function is preferred. Consider the minimum average error (cost).

B.11 Distribution of connection

probabilities

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

2

Connection probability (p)

Sca

led

prob

abili

ty d

istr

ibut

ion

func

tion

F1(p)

1/r (1-F0(p))

F1(p)+1/r (1-F0(p))

B.12 Generalizability of algorithm

0 0.2 0.4 0.6 0.8 110

-6

10-5

10-4

10-3

10-2

10-1

100

Connection probability (p)

Pro

babi

lity

dens

ity f

unct

ion

(pdf

)

F0(p)

F0m(p)

0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Connection probability (p)

Pro

babi

lity

dens

ity f

unct

ion

(pdf

)

F1(p)

F1m(p)

Unknowns following same distribution approximately?

Possible reasons for unfavorable burst at tail, source of model error.

B.13 Robustness of algorithm

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Connection probability (p)

Sca

led

prob

abili

ty d

istr

ibut

ion

func

tion

1/r (1-F0(p)) 5%

1/r (1-F0m(p)) 5%

1/r (1-F0(p)) 10%

1/r (1-F0m(p)) 10%

1/r (1-F0(p)) 15%

1/r (1-F0m(p)) 15%

1/r (1-F0(p)) 20%

1/r (1-F0m(p)) 20%

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1.2

Connection Probability (p)

Pro

babi

lity

dist

ribu

tion

func

tion

F1(p) 5%

F1m(p) 5%

F1(p) 10%

F1m(p) 10%

F1(p) 15%

F1m(p) 15%

F1(p) 20%

F1m(p) 20%

sensitive to number of unknown links?

B.14 Comparison of operation points

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.5

1

1.5

2

Connection probability (p)

Sca

led

prob

abilt

y di

stru

btio

n fu

nctio

n

F1(p)

1/r (1-F0(p))

F1m(p)

1/r (1-F1m(p)

F1(p)+1/r (1-F0(p))

F1m(p)+1/r (1-F0

m(p))

B.15 Reconstruction results

P N ACC (%) TP/P (%)TN/N (%)

TP/(TP+FP) (%)

201 5293 98.13 80.60 98.79 71.68

222 5272 98.13 80.63 98.86 74.90

192 5302 98.11 75.52 98.92 71.78

224 5270 98.25 80.80 98.99 77.35

235 5259 98.13 75.32 99.14 79.73

217 5277 98.38 78.34 99.20 80.19

204 5290 98.31 77.45 99.11 77.07

192 5302 98.25 71.88 99.21 76.67

231 5263 98.16 77.06 99.09 78.76

217 5277 97.93 71.89 99.00 74.64

213.5 5280.5 98.18 76.95 99.03 76.28

USAir Network, 10% missed

C.1 A variant problem of link reconstruction

Observed network -> types of signals, 0 and 1.

1

2

3

4

5

010?0

10?01

0?011

?010?

011?0

5

4

3

2

1

54321

0100

1001

0011

010

0110

0

0

0

00

0

5

4

3

2

1

54321

some 0s are originally 1s, but be set as 0s. position unknown, number known or unknown.

C.2 Procedures for the variant problem

Available information -> fitted probabilistic model P(NxN)-> connection probability p(i,j) of each 0-link (i,j)-> (a) number (M) unknown -> determine a threshold of connection probability Pt -> set (i,j) to be 1, if p(i,j)>pt, and 0 otherwise (b) number (M) known -> scoring: ranking connection probabilities of candidate links (all 0-links) -> set M links with highest score to be 1s.

C.3 Algorithm based on common neighbor

max/ nnpp ijjiij

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0

0.5

1

1.5

2

2.5

Connection Probability (p)

Scale

d d

istr

ibution f

unctions

F11/r (1-F0)F1+1/r (1-F0)

0.1 0.2 0.3 0.4 0.5 0.6 0.7

0

0.5

1

1.5

Connection probability (p)S

cale

d d

istr

ibution f

unction

F11/r (1-F0)F1 + 1/r (1-F0)

C.4 Comparison between two methods

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

-0.02

0

0.02

0.04

0.06

0.08

0.1

Connection probability (p)

Pro

babi

lity

dens

ity f

unct

ion

f1 common neighbors1/r f0 common neighborsf1 model1/r f0 model

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

-0.5

0

0.5

1

1.5

2

2.5

3

Connection probability (p)

Sca

led

dist

ribut

ion

func

tion

F1 common neighbors1/r (1-F0) common neighborsF1 model1/r (1-F0) model

Probability density functions Distribution functions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.2

0.4

0.6

0.8

1

1.2

1.4

Connection probability (p)

Dis

trib

utio

n fu

nctio

n

common neighborsred: 20%blue 5%

model-basedred: 20%blue 5%

C.5 Generalizability and robustness of algorithms

0 100 200 300 400 500 600 700 800 900 10000

50

100

150

200

250

Number of predicted links

Num

ber

of h

its

common neighborsprobabilistic modelperfect algorithm

C.6 Reconstruction performance by ranking

0 100 200 300 400 500

0

50

100

150

200

250

300

350

400

450

500

nz = 1740

D.1 Problem of link prediction

Procedure is identical to that of the variant link

reconstruction problem.

0 50 100 150 200 250 300 350 400 450 5000

50

100

150

200

250

300

350

400

Number of links predicted

Num

ber

of h

its

common neighbormodel basedperfect algorithm

Econophysics Co-authorship network (N=506, m=519, nL=379)

0 100 200 300 400 500

0

50

100

150

200

250

300

350

400

450

500

nz = 1038

D.2 Factors to affect prediction performance

Problem of generalizability: a) size of the training set, or time span of prediction; b) time-changing growing mechanism

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.05

0.1

0.15

0.2

0.25

0.3

Connection probability (p)

Pro

babi

lity

dens

ity f

unct

ion

f1f0fn

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

Connection probability (p)

Sca

led

dist

ribut

ion

func

tion

F11/r (1-F0)Fn

D.3 Effects of training set size

Assume new links to be known, examine the variant

problem above: training data set is not able to capture

underlying distribution faithfully, either size is too small

or growing rule is time dependent.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Connection probability (p)

Pro

babi

lity

desi

ty f

unct

ion

F1FnF0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

Connection Probability (p)

Sca

led

dsitr

ibut

ion

func

tion

F1Fn1/r (1-F0)

Conclusions

The problem of network reconstruction is thoroughlystudied. Under more general framework, the problemcan be reformulated as hypothesis testing problem,which gives deeper insights into our understanding ofthe problem, and enable us to relate the reconstructionperformance of various methods to quantities at morefundamental level.

THANK YOUTHANK YOU