1
Influence at Scale: Distributed Computation of Complex Contagion in Networks Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University Motivation โ€ข Information requirements: how much of the graph do we need to examine? (query complexity). โ€ข Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data. The Framework The influence model: The Independent Cascade Model [KKTโ€™03] โ€ข Input: edge-weighted graph = (, ,p), initial seeds 0 โŠ† . โ€ข At every synchronous step >0, every node previously infected node โˆˆ โˆ’ โˆ’1 successfully infects an uninfected neighbor w.p. . +1 -- set of infected nodes at step +1. โ€ข The influence function: 0 = [ ] The MRC parallel computation model [KSVโ€™10] โ€ข Inspired by the MapReduce paradigm. โ€ข Synchronous round computation on ; on tuples. On every round: 1. Map: apply local transformation on tuples in a streaming fashion. 2. Reduce: polytime computation on aggregates of tuples with same key. โ€ข Constraintaints: 1โˆ’ machines and space per machine; log rounds. โ€ข Graph access model: Link server. The Information Requirements of Influence Estimation โ€ข Question: how much of the graph to we need to know in order to estimate the influence? โ€ข Concretely: what is the query complexity of approximating ( 0 )? Theorem: Let โˆˆ 0, 1 3 , โˆˆ [0, 1 2 ]. For large enough = ||, โˆƒ = (, , ) for which getting an estimate ( 0 ) of ( 0 ) s.t. 1โˆ’ 0 โ‰ค 0 โ‰ค ( 0 ) with confidence โ‰ฅ1โˆ’ requires ฮฉ((1 โˆ’ 2) link server queries. Main algorithm: Intermediate Layer; Estimating the Expectation Main intuitions: 1. High/low influence few/many samples needed, 2. Can compute the Riemann integral for CDF ( 0 = 0 โˆž 1 โˆ’ () ) Lemma (informal):w.h.p., VerifyGuess returns True if > . Main Algorithm: Top Layer; Iterating on Candidate Values Theorem: For any โˆˆ (0, 1 4 ), InfEst provides a (1 + 8)-approx. w.h.p. Algorithm: Bottom Layer; Bounded Samples Highlights: Input: Seeds 0 โŠ† , link-server oracle (โ‹…), - # samples, bound . 1. Simultaneous prob-BFS in MapReduce, one layer per round. 2. Finalize a sample when at least nodes infected. 3. Return fraction of samples that reached bound. Theorem (informal): 1) If is suff. small, # of tuples is ฮฉ( 1.5 log 4 ). 2) If = log , algorithm takes polylog rounds. Experimental Results 1 2 5 3 6 7 4 0 L-S () + () Running time Approx. ratios (Epinions) Heuristics (Epinions) Scalable estimation of the influence of individuals in massive social networks

Influence at Scale: Distributed Computation of Complex

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Influence at Scale: Distributed Computation of Complex

Influence at Scale: Distributed Computation of Complex Contagion in Networks

Brendan Lucier, Joel Oren, Yaron Singer Microsoft Research, University of Toronto, Harvard University

Motivation

โ€ข Information requirements: how much of the graph do we need to examine? (query complexity).โ€ข Efficient parallel computation: how to distribute computation; maintain high-volume intermediate data.

The FrameworkThe influence model: The Independent Cascade Model [KKTโ€™03]โ€ข Input: edge-weighted graph ๐บ = (๐‘‰, ๐ธ,p), initial seeds ๐‘†0 โŠ† ๐‘‰.โ€ข At every synchronous step ๐‘ก > 0, every node previously infected node

๐‘ข โˆˆ ๐‘†๐‘ก โˆ’ ๐‘†๐‘กโˆ’1 successfully infects an uninfected neighbor ๐‘ฃ w.p. ๐‘๐‘ข๐‘ฃ.๐‘†๐‘ก+1 -- set of infected nodes at step ๐‘ก + 1.

โ€ข The influence function: ๐‘“ ๐‘†0 = ๐ธ[๐‘†๐‘ก]

The MRC parallel computation model [KSVโ€™10]โ€ข Inspired by the MapReduce paradigm.โ€ข Synchronous round computation on ๐‘ ๐‘˜; ๐‘ฃ on tuples. On every round:

1. Map: apply local transformation on tuples in a streaming fashion.2. Reduce: polytime computation on aggregates of tuples with same

key.โ€ข Constraintaints: ๐‘1โˆ’๐œ– machines and space per machine; log๐‘๐‘ rounds.

โ€ข Graph access model: Link server.

The Information Requirements of Influence Estimation

โ€ข Question: how much of the graph to we need to know in order to estimate the influence?

โ€ข Concretely: what is the query complexity of approximating ๐‘“(๐‘†0)?

Theorem: Let ๐œ– โˆˆ 0,1

3, ๐›ฟ โˆˆ [0,

1

2]. For large enough ๐‘› = |๐‘‰|, โˆƒ๐บ = (๐‘‰, ๐ธ, ๐’‘)

for which getting an estimate ๐‘“(๐‘†0) of ๐‘“(๐‘†0) s.t. 1 โˆ’ ๐œ– ๐‘“ ๐‘†0 โ‰ค ๐‘“ ๐‘†0 โ‰ค ๐‘“(๐‘†0) with confidence โ‰ฅ 1 โˆ’ ๐›ฟ requires ฮฉ((1 โˆ’ 2๐›ฟ) ๐‘› link server queries.

Main algorithm: Intermediate Layer; Estimating the Expectation

Main intuitions:1. High/low influence few/many samples needed,

2. Can compute the Riemann integral for CDF (๐‘“ ๐‘†0 = 0

โˆž1 โˆ’ ๐น(๐ผ) ๐‘‘๐ผ)

Lemma (informal):w.h.p., VerifyGuess returns True if ๐œ > ๐‘–๐‘›๐‘“๐‘™๐‘ข๐‘’๐‘›๐‘๐‘’.

Main Algorithm: Top Layer; Iterating on Candidate Values

Theorem: For any ๐œ– โˆˆ (0,1

4), InfEst provides a (1 + 8๐œ–)-approx. w.h.p.

Algorithm: Bottom Layer; Bounded SamplesHighlights: Input: Seeds ๐‘†0 โŠ† ๐‘‰, link-server oracle ๐‘„(โ‹…), ๐ฟ- # samples, bound ๐‘ก.1. Simultaneous prob-BFS in MapReduce, one layer per round.2. Finalize a sample when at least ๐‘ก nodes infected.3. Return fraction of samples that reached bound.

Theorem (informal): 1) If ๐ฟ is suff. small, # of tuples is ฮฉ(๐‘š1.5 log4 ๐‘š).2) If ๐‘‘๐‘–๐‘Ž๐‘š ๐บ = log๐‘ ๐‘›, algorithm takes polylog rounds.

Experimental Results

1

2

5

3

6

7

4

๐บ

๐‘

๐‘Ž

๐‘ฅ

๐‘ ๐‘‘

๐‘†0

๐‘๐‘ฅ๐‘

๐‘๐‘ฅ๐‘Ž๐‘๐‘๐‘‘๐‘๐‘๐‘

๐‘๐‘‘๐‘Ž

L-S๐‘„(๐‘ข)

๐‘+(๐‘ข)

๐‘

Running timeApprox. ratios (Epinions)

Heuristics(Epinions)

Scalable estimation of the influence of individuals in massive social networks