View
221
Download
0
Tags:
Embed Size (px)
Citation preview
1
On the bias of Breadth First Search (BFS)
and of other graph sampling techniques
Maciej Kurant (EPFL / UCI)
Joint work with:Athina Markopoulou (UCI),
Patrick Thiran (EPFL).
08 Sep 2010, ITC’22, Amsterdam, Netherlands
2
Breadth First Search (BFS)
D
F
H
EI
J
GC
B
A
K
D B A C F E I H G K J
Not feasible for huge online graphs! E.g., a full BFS of the friendship graph of would require 200TB of html traffic.
4
Why sample with BFS?
• BFS is a well known textbook technique
• BFS sample is a nice looking graph– E.g.., BFS of a lattice is a lattice– We can study its topological characteristics, which is not possible with random walks
• It is used in practice:– Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, “Analysis of Topological Characteristics of Huge
Online Social Networking Services,” in Proc. of WWW, 2007.– A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and S. Bhattacharjee, “Measurement and
Analysis of Online Social Networks,” in Proc. of IMC, 2007.– C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao, “User interactions in social networks
and their implications,” in Proc. of EuroSys, 2009.
5
323
:degree node average Observed
kkqk
94
:degree node average Real
kkpkk
Why?
qk - observed
node degree distribution
pk - real node
degree distribution
Our BFS samples of .
BFS sample size: 100K nodes Facebook size: 500M nodes
6
- real average node degree
- real average squared node degree.
Our Goal
Graph traversals on RG(pk):
?BFS
qk ( f ) = ?
This bias has been empirically observed in the past, but never formally analyzed.
0
7
Graph model RG(pk)
• Random graph RG(pk) with a given node
degree distribution pk
• Can be generated by configuration model:Example: |V| = 4 and pk: p1= p2= p3= p4 = 0.25
‘stubs’
8
Approach 1: Brute force
Remedy: “The Principle of Deferred Decisions”
So we can generate the graph ‘on the fly’, while exploring it!
Generate all possible graphs, and ... No way!!
9
wv u
v
vwu uv
w
vu u
v
u
v
i
uXkE
k
kkE
kwX
uXkE
kvX
kE
kuXvX
uX
iX
)Pr(22
)Pr(
)Pr(2
)Pr(
28
3)|Pr(
)Pr(
1,
3
12
12
1
node sampled th
Approach 2: The Principle of Deferred Decisions
This does not scale! (because of dependencies between stubs)
v
u
?
* we assumed that the generated graph is connected
10
v4V2
v3v1
1
2
1 23
1
2 3 4
1
time t0 1
Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).
Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.
(both in a different context)
Approach 3: Breaking the stub dependencies
11
v4V2
v3v1
1
2
1 23
1
2 3 4
1
time t0 1
Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).
Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.
(both in a different context)
Approach 3: Breaking the stub dependencies
12defined well
nodes sampled of fraction Expected
observed be toexpected ondistributi Degree
before sampled degree of nodes ofnumber Expected
timebefore sampled is degree of node
)(
)1(1)(
))1(1(
))1(1()(
))1(1(
)1(1)Pr(
ft
tptf
tp
tptq
tVp
t
k
kk
l
ll
kk
k
kk
k
tk
tkv
f
number of nodes of degree k
Approach 3: Breaking the stub dependencies
13
The analysis is exactly the same for other graph traversal techniques:
• BFS• DFS• Forest Fire• Snowball Sampling• Node sampling weighted by degrees• …
Approach 3: Breaking the stub dependencies
14
Simulations on a power law random graph with 10K nodes
Theory vs
Simulations
degree distribution
corrected!
15
What if the graph is not random?
Random graph RG(pk):Purely random, given the degree distribution pk.
Assortative RG(pk):Nodes of similar degree are more likely to connect.
94real
85corrected
324sampled
338expected
BFS deg av k
16
- real average node degree
- real average squared node degree.
Summary
Graph traversals on RG(pk):
MHRW, RWRW
Random Walk
17
- real average node degree
- real average squared node degree.
Summary
Graph traversals on RG(pk):
MHRW, RWRW
Random Walk
18
- real average node degree
- real average squared node degree.
Graph traversals on RG(pk):
For small sample size (for f→0),BFS has the same bias as RW.(also in our Facebook measurements)
This bias monotonically decreases with f. We found analytically the shape of this curve.
Thank you!
MHRW, RWRW
For large sample size (for f→1), BFS becomes unbiased.
Summary
Random Walk