1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant...

Sampling Massive Online GraphsChallenges, Techniques, and Applications to Facebook

Maciej Kurant (UC Irvine)

Joint work with:

Minas Gjoka (UC Irvine), Athina Markopoulou (UC Irvine),

Carter T. Butts (UC Irvine),Patrick Thiran (EPFL).

14 Nov, 2011, KTH

Why study Online Social Networks (OSNs)?Engineering• Search engine accuracy• Better spam filters• Efficient data centers• New apps/Third party services• Offload 3G operators• …

Social Media• Predict the spread and importance of information• Social filters• …

Social Sciences• Great source of data for studying the structure of the

society, online behavior, …

Marketing• Influential users• Recommendations• Ad placement• …

Large scale data mining• understand user communication patterns, community

structure• “human sensors”

Privacy

OSNs cover 50% of world’s Internet users

> 1 billion users October 2011

800 million

200 million

66 million

50 million

34 million

Active users

Facebook:•800+M users•150 friends each (on average)•8 bytes (64 bits) per user ID

The raw connectivity data, with no attributes:•800 x 150 x 8B = 960 GB

This is neither feasible nor practical. Solution: Sampling!

To get this data, one would have to download:•200 TB of HTML data!

Sampling

• Node attributes• Topology• Graph size• Evolution in time• Random node

selection• …

Objective:

Sampling

selection• …

Objective:• NodesWhat:

Sampling

selection• …

Objective:• Nodes• Edges

Sampling

selection• …

Objective:• Nodes• Edges•

Subgraphs

Sampling

selection• …

Subgraphs

What:• Directly

• Often not possible

Sampling

selection• …

Subgraphs

What:• Directly

• Exploration

• OSNs• P2P, distributed systems• WWW• “Offline” social network

• Nodes• Edges•

Subgraphs

What:• Directly

• Exploration

Sampling

selection• …

Objective:

Random Walks in graph sampling: • WWW [Henzinger et at. 2000, Baykan et al. 2009]• P2P [Gkantsidis et al. 2004 , Stutzbach et al. 2006, Rasti et al. 2009]• OSN [Rasti et al. 2008, Krishnamurthy et al, 2008]• “Offline” social networks [Salganik et al. 2004, Volz et al. 2008]

Random Walks mixing improvements: • Random jumps [Henzinger et al. 2000, Avrachenkov, et al. 2010]• Fastest Mixing Markov Chain [Boyd et al. 2004]• Multiple dependent walks [Ribeiro et al. 2010]

BFS and other traversals in graph sampling: • Najork et al. 2001, Achlioptas et al. 2005, Leskovec et al. 2006, Mislove et al. 2007, Cha 2007,

Ahn et al. 2007, Wilson et al. 2009, Viswanath 2009, Ye et al. 2010, Gile and Handcock 2011

Measurement/Characterization studies of OSNs: • Cyworld, Orkut, Myspace, Flickr, Youtube [Mislove et al. 2007, …]• Facebook [Krishnamurthy et al. ’08, Wilson et al. 2009, …]

Independence sampling: • Hansen-Hurwitz estimator [Hansen and Hurwitz 1943]• Stratified sampling [Neyman 1934]

Related work

OutlineIntroduction

Sampling with replacements (Random Walks):• MHRW vs RWRW• Multigraph Sampling• Stratified Weighted Random Walk (S-WRW)

Sampling without replacements (Traversals):• The bias of BFS (and of DFS/RDS/…)

Estimation from a sample

Conclusion and Future Directions

OutlineIntroduction

qk - observed

node degree distribution

pk - real node

degree distribution

Random Walk in Facebook

degree of node v

Pr(sampling v) ~ kv

Metropolis-Hastings Random Walk (MHRW):

DA AC…

How to get an unbiased sample?

S = asymptotically uniform

Metropolis-Hastings Random Walk (MHRW):

DA AC…

Re-Weighted Random Walk (RWRW):

Collect a classic (biased) RW sample…

Now apply the Hansen-Hurwitz estimator:

How to get an unbiased sample?

S = asymptotically uniform

Metropolis-Hastings Random Walk (MHRW): Re-Weighted Random Walk (RWRW):

Facebook results

Also corrects for the bias of all other metrics:

Not corrected:

MHRW or RWRW ?

RWRW is better than MHRW • RWRW requires 1.5 to 7 times fewer samples to achieve the same

• Intuition?

However:• Pathological counter-examples exist.

• MHRW is easier to use (it does not require reweighting)

MHRW or RWRW ?

[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.

Online Convergence Diagnostics

Acceptable convergence between 500 and 3000 iterations (depending on property of interest)

• Inferences assume that samples are drawn from stationary distribution

• No ground truth available in practice• MCMC literature, online diagnostics

[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.

OutlineIntroduction

G Friends

Events

Groups

E.g., in LastFM

G Friends

Events

Groups

E.g., in LastFM

G* = Friends + Events + Groups

( G* is a multigraph )F

Multigraph sampling

[2] Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011.

Efficient implementation (saves bandwidth):1) Select relation graph Gi with probability deg(H,Gi) / deg(H, G*)2) Within Gi choose an edge uniformly at random, i.e., with probability 1/deg(H, Gi).

Applied to LastFM:- better coverage of previously isolated nodes - better estimates of distributions and means

OutlineIntroduction

Not all nodes are equal

irrelevant

important(equally) important

Node categories:e.g. China

e.g., Sweden

Stratification under Weighted Independence Sampler (WIS)(node size is proportional to its sampling probability)

Not all nodes are equal

But graph exploration techniques have to follow the links!

Trade-off between • ideal (WIS) sampling weights• fast convergence

Enforcing WIS weights may lead to slow (or no) convergence

Assumption: On sampling a node, we learn the categories

of its neighbors.

irrelevant

important(equally) important

Node categories: Stratification under Weighted Independence Sampler (WIS)(node size is proportional to its sampling probability)

Fastest Mixing Markov Chain [Boyd et al. 2004]

Measurement objective

E.g., compare the size of red and green categories.

Category weights optimal under WIS

Stratified sampling theory +

Information collected by pilot RW

Problem 2: “Black holes”

Modified category weights

Problem 1: Poor or no connectivity

Solution: Small weight>0 for irrelevant categories. f* -the fraction of time we plan to spend

in irrelevant nodes (e.g., 1%)

Solution:Limit the weight of tiny relevant categories.Γ - maximal factor by which we can

increase edge weights (e.g., 100 times)

Edge weights in G

vol(green), from pilot RW

Target edge weights:

Edge weights in G

Resolve conflicts: • arithmetic mean, • geometric mean, • max, • …

vol(green), from pilot RW

Target edge weights:

Edge weights in G

WRW sample

Edge weights in G

WRW sample

Final result

Hansen-Hurwitz estimator

Stratified Weighted Random Walk

(S-WRW)

Edge weights in G

WRW sample

Final result

Colleges in Facebook

versions of S-WRW

Random Walk (RW)

Samples in colleges: 86% of S-WRW, 9% of RW.

This is because S-WRW avoids irrelevant categories.

The difference is larger (100x) for small colleges. This is due

to S-WRW’s stratification.

[3] Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.

RW required 10-15 times more samples than S-WRW to achieve the same accuracy.

Sampling with replacements: Summary

RWRW is 1.5-7 times more efficient than MHRW• counter-examples exists

Multigraph Sampling• walking on multiple relations improves efficiency

Stratified Weighted Random Walk • oversamples relevant regions, undersamples irrelevant regions• 10-15 fold gains in sampling costs

Online Convergence Diagnostics

[1] Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011.[3] Maciej Kurant, Minas Gjoka, Carter T. Butts and Athina Markopoulou,

“Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.

OutlineIntroduction

Sampling without replacements (Traversals)

Examples:•BFS (Breadth-First Search)•DFS (Depth-First Search)•Forest Fire•RDS (Respondent-Driven Sampling)•Snowball sampling•…

Why sample with BFS?• BFS is a well known textbook technique• BFS sample is a nice looking graph• It is used in practice [Ahn et al. 2007,

Mislove et al. 2007, Wilson et al. 2009]

BFS in Facebook

BFS (Breadth First Search) with f=0.5% of nodes sampled

(338 for RW)

This bias has been empirically observed in the past [Najork et al. 2001].

Our goals:• Formally analyze the bias of BFS (challenging due to dependencies)• Correct for this bias.• (no new sampling method proposed)

- real average node degree

- real average squared node degree.

Goal: Analyze the bias of BFS

Graph traversals on RG(pk):

qk ( f ) = ?

true average node degree

Graph model RG(pk)

• Random graph RG(pk) with a given node degree distribution pk (sequence)

• Can be generated by configuration modelExample:

‘stubs’

Approach 1: Brute force

Remedy: “The Principle of Deferred Decisions”

So we can generate the graph ‘on the fly’, while exploring it!

Generate all possible graphs, and ... No way!

vwu uv

)Pr(22

3)|Pr(

node sampled th

Approach 2: The Principle of Deferred Decisions

This does not scale! (because of dependencies between stubs)

* we assumed that the generated graph is connected

time t0 1

Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).

Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.

(both in a different context)

Approach 2b: Breaking the stub dependencies

time t0 1

Originally proposed in:J. H. Kim, “Poisson cloning model for random graphs,” International Congress of Mathematicians (ICM), 2006 (preprint in 2004).

Developped in:D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, “On the bias of traceroute sampling: or, power-law degree distributions in regular graphs,” in STOC, 2005.

(both in a different context)

))(1(1

)1(1)(

))1(1(

))1(1()(

))1(1()1(1)Pr(

)1()Pr(

ondistributi degree node Corrected

defined well

nodes sampled offraction Expected

observed be toexpectedon distributi Degree

before sampled degree of nodes ofnumber Expected

timebefore sampled is degree of node

tkv not

number of nodes of degree k

))(1(1

)1(1)(

))1(1(

))1(1()(

))1(1()1(1)Pr(

)1()Pr(

ondistributi degree node Corrected

defined well

nodes sampled offraction Expected

observed be toexpectedon distributi Degree

before sampled degree of nodes ofnumber Expected

timebefore sampled is degree of node

tkv not

number of nodes of degree k

MHRW, RWRW

Main results

[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.

Python code available at: http://mkurant.com/publications

MHRW, RWRW

Main results

For small sample size (for f→0),BFS has the same bias as RW.

This bias monotonically decreases with f. We found analytically the shape of this curve.

MHRW, RWRWFor large sample size (for f→1),

BFS becomes unbiased.

Under RG(pk), all traversals are subject to exactly the same bias.

What if the graph is not random?

expected,sampled

true,corrected

Sampling without replacements: Summary

58[4] Maciej Kurant, Athina Markopoulou, Patrick Thiran, “On the Bias of BFS”, JSAC 2011.

MHRW, RWRW

A difficult problem • Dependencies between samples

We computed analytically the bias of BFS in RG(pk)• Initial bias as of RW• Same bias for all traversals (BFS, DFS, RDS,…) under RG(pk)• A bias correction procedure• Works well for real-life graphs

If possible, prefer methods with replacements.

OutlineIntroduction

1) Local properties

Node properties:• Community membership information• Privacy settings• Names• …

Local topology properties:• Node degree distribution• Assortativity• Clustering coefficient• …

Example: Privacy Awareness in Facebook’091) Local properties

Privacy Awareness - fraction of users that change the default privacy settings.PA =

2) Estimating the graph size

• Counts repeated nodes – “Reversed Birthday Paradox”• Work in progress

Probability that a random node in A is a neighbor of a random node in B

From a randomly sampled set of nodes we infer a valid topology!

3) Coarse-grained topology

[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488

(estimator)

geosocialmap.com

64[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488

Public and private colleges in the USA

geosocialmap.com 65

geosocialmap.com

The world according to Facebook

Saudi Arabia

United Arab Emirates

Lebanon

Jordan

Israel

Strong clusters among middle-eastern countries

Summary

Multigraph sampling [2] Stratified WRW [3]Random Walks (with replacements)

• RWRW > MHRW [1]• Convergence Diagnostics

References[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case Study of Unbiased Sampling of OSNs”, INFOCOM 2010.[2] M. Gjoka, C. T. Butts, M. Kurant and A. Markopoulou, “Multigraph Sampling of Online Social Networks”, JSAC 2011[3] M. Kurant, M. Gjoka, C. T. Butts and A. Markopoulou, “Walking on a Graph with a Magnifying Glass”, SIGMETRICS 2011.[4] M. Kurant, A. Markopoulou and P. Thiran, “On the bias of BFS (Breadth First Search)”, JSAC, 2011.[5] M. Kurant, M. Gjoka, Y. Wang, Z. W. Almquist, C. T. Butts, A. Markopoulou, “Coarse-Grained Topology Estimation”, arXiv:1105.5488[6] Datasets available from : http://odysseas.calit2.uci.edu/osn

Stratified WRW [3]

MHRW, RWRW

Traversals (no replacements)

Multigraph sampling [2]

Random Walks (with replacements)

Stratified WRW [3]

MHRW, RWRW

Coarse-grained topologies [5]

Traversals (no replacements)

Multigraph sampling [2]

Thank you mkurant.com

Random Walks (with replacements)

1 Sampling Massive Online Graphs Challenges, Techniques, and Applications to Facebook Maciej Kurant...

Documents

BIOENGINEERING: University of Colorado at Boulder, USA ... · Professor Edwin Lewis, UC Davis Professor Roger Rangel, UC Irvine Professor Greg Washington, UC Irvine ... in 2015 •

UC Irvine Sports Medicine - Amazon S3€¦ · UC Irvine Sports Medicine The “Screen and Clearance” will be performed by the UC Irvine Intercollegiate Athletic Team Physicians

UNIVERSITY OF CALIFORNIA, IRVINE 1966-67 CATALOGUE · university of california, irvine 1966-67 catalogue . ... uc i university of california, irvine 1966-67 catalogue uc irvine

UNIVERSITY OF CALIFORNIA, IRVINE - · UC Irvine School of Physical Sciences Faculty Endowed Fellowship, 2008-2009 Outstanding Presentation, UC Irvine Institute of Geophysics and

1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou

UC Irvine - escholarship.org

3-Party Secure Computation of Oblivious RAM Sky Faber (UC Irvine) Stanislaw Jarecki (UC Irvine) Sotirios Kentros (Salem State U) Boyang Wei (UC Irvine)

UC Irvine is Shaping the Future

Practical Recommendations on Crawling Online Social Networks Minas Gjoka Maciej Kurant Carter Butts Athina Markopoulou University of California, Irvine

UC Irvine Previously Published Works

Choosing UC Irvine Health Employee Guide

UC Irvine - files.eric.ed.gov

The development of a HOV driver behavior model under Paramics Will Recker, UC Irvine Shin-Ting Jeng, UC Irvine Lianyu Chu, CCIT-UC Berkeley

SAS in IR: Behind the scenes tips and tricks - California · Function Sample: LENGTH. UC Irvine vs UC Irvin: LOWCASE,UPCASE,PROPCASE. UC Irvine -> uc irvine: 30 CAIR 2016 ... SAS

- 1 - t. UC Irvine Health UC Irvine Health represents the clinical and academic endeavors of UC Irvine Medical Center and UC Irvine School of Medicine

UC Irvine Cath Lab Essentials Program - 2014

University of California, Irvine...School Joseph Jenkins UC Irvine James Nisbet UC Irvine Kavita Philip UC Irvine . Anthony Reese UC Irvine Mark Rose UC Santa Barbara Betsy Rosenblatt

Agnieszka Kurant

By Tammie Tran UC Irvine

UC Irvine WICS workshop feb 2017