1 On the bias of Breadth First Search (BFS) and of other graph sampling techniques Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),

1

On the bias of Breadth First Search (BFS)

and of other graph sampling techniques

Maciej Kurant (EPFL / UCI)

Joint work with:Athina Markopoulou (UCI),

Patrick Thiran (EPFL).

08 Sep 2010, ITC’22, Amsterdam, Netherlands

2

Breadth First Search (BFS)

D

F

H

EI

J

GC

B

A

K

D B A C F E I H G K J

Not feasible for huge online graphs! E.g., a full BFS of the friendship graph of would require 200TB of html traffic.

3

BFS sample of a large graph

D

F

H

EI

J

GC

B

A

K

D B A C F E I H G K J

sampling budget

4

Why sample with BFS?

• BFS is a well known textbook technique

• BFS sample is a nice looking graph– E.g.., BFS of a lattice is a lattice– We can study its topological characteristics, which is not possible with random walks

• It is used in practice:– Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, “Analysis of Topological Characteristics of Huge

Online Social Networking Services,” in Proc. of WWW, 2007.– A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and S. Bhattacharjee, “Measurement and

Analysis of Online Social Networks,” in Proc. of IMC, 2007.– C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao, “User interactions in social networks

and their implications,” in Proc. of EuroSys, 2009.

5

323

:degree node average Observed

kkqk

94

:degree node average Real

kkpkk

Why?

qk - observed

node degree distribution

pk - real node

degree distribution

Our BFS samples of .

BFS sample size: 100K nodes Facebook size: 500M nodes

6

- real average node degree

- real average squared node degree.

Our Goal

Graph traversals on RG(pk):

?BFS

qk ( f ) = ?

This bias has been empirically observed in the past, but never formally analyzed.

0

7

Graph model RG(pk)

• Random graph RG(pk) with a given node

degree distribution pk

• Can be generated by configuration model:Example: |V| = 4 and pk: p1= p2= p3= p4 = 0.25

‘stubs’

8

Approach 1: Brute force

Remedy: “The Principle of Deferred Decisions”

So we can generate the graph ‘on the fly’, while exploring it!

Generate all possible graphs, and ... No way!!

9

wv u

v

vwu uv

w

vu u

v

u

v

i

uXkE

k

kkE

kwX

uXkE

kvX

kE

kuXvX

uX

iX

)Pr(22

)Pr(

)Pr(2

)Pr(

28

3)|Pr(

)Pr(

1,

3

12

12

1

node sampled th

Approach 2: The Principle of Deferred Decisions

This does not scale! (because of dependencies between stubs)

v

u

?

* we assumed that the generated graph is connected

10

v4V2

v3v1

1

2

1 23

1

2 3 4

1

time t0 1

Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).

Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.

(both in a different context)

Approach 3: Breaking the stub dependencies

11

v4V2

v3v1

1

2

1 23

1

2 3 4

1

time t0 1

Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).

Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.

(both in a different context)


12defined well

nodes sampled of fraction Expected

observed be toexpected ondistributi Degree

before sampled degree of nodes ofnumber Expected

timebefore sampled is degree of node

)(

)1(1)(

))1(1(

))1(1()(

))1(1(

)1(1)Pr(

ft

tptf

tp

tptq

tVp

t

k

kk

l

ll

kk

k

kk

k

tk

tkv

f

number of nodes of degree k


13

The analysis is exactly the same for other graph traversal techniques:

• BFS• DFS• Forest Fire• Snowball Sampling• Node sampling weighted by degrees• …


14

Simulations on a power law random graph with 10K nodes

Theory vs

Simulations

degree distribution

corrected!

15

What if the graph is not random?

Random graph RG(pk):Purely random, given the degree distribution pk.

Assortative RG(pk):Nodes of similar degree are more likely to connect.

94real

85corrected

324sampled

338expected

BFS deg av k

16



Summary


MHRW, RWRW

Random Walk

17



Summary


MHRW, RWRW

Random Walk

18




For small sample size (for f→0),BFS has the same bias as RW.(also in our Facebook measurements)

This bias monotonically decreases with f. We found analytically the shape of this curve.

Thank you!

MHRW, RWRW

For large sample size (for f→1), BFS becomes unbiased.

Summary

Random Walk

Documents

1 On the bias of Breadth First Search (BFS) and of other graph sampling techniques Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),