55
Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Embed Size (px)

Citation preview

Page 1: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Information DiffusionMary McGlohonCMU 10-8023/23/10

Page 2: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Outline•Intro: Models for diffusion

▫Epidemiological: SIS/SIR/SIRS▫Threshold models

•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing

Page 3: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Epidemiological: SIS•Susceptible, Infected, Susceptible

▫Infected for tI timesteps▫While infected, transmits with probability b▫After tI steps, returns to susceptible

Page 4: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Epidemiological: SIR•Susceptible, Infected, Removed

▫Infected for tI timesteps▫While infected, transmits with probability b▫After tI steps, goes to removed/recovered

Page 5: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Epidemiological: SIRS•Susceptible, Infected, Removed,

Susceptible▫Combination of SIS+SIR▫After tI steps, goes to removed/recovered

▫After tR steps, returns to susceptible

Page 6: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Epidemiological: Networks•Historically, SIS/SIR assumed a person

could infect anybody else, full clique. There is an epidemic threshold in SIS.

•For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani]▫(But not for PL networks with high

clustering coefficients [Egu´ıluz and Klemm])

Page 7: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Threshold Models•Each node in network has weighted

threshold•If adopted neighbors reaches threshold,

the node adopts.

Page 8: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Outline•Intro: Models for diffusion

▫Epidemiological: SIS/SIR/SIRS▫Threshold models

•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing

Page 9: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•D. Gruhl, R. Guha, Liben D. Nowell, A.

Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004).

•Goal: How do topics trend in blogs, and how can we model diffusion of topics?

Page 10: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•Data: Crawled 11K blogs, 400K posts.•Found 34o topics:

▫apple arianna ashcroft astronaut blair boykin bustamante chibi china davis diana farfarello guantanamo harvard kazaa longhorn schwarzenegger udell siegfried wildfires zidane gizmodo microsoft saddam

Page 11: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•Topics =

Chatter + Spikes▫Chatter:

Alzheimer▫Spike: Chibi▫Spiky Chatter:

Microsoft

Page 12: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•Modeled as SIR

▫Some set of authors is infected to write about a topic

▫Then propagate, as others write new posts on that topic

▫Measure the topic over time and other properties

•Fit using EM▫Compute probability of propagation along

each edge

Page 13: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•Validation:

▫Synthetic Used modified Erdos-Renyi graph, created

propagation Found that EM was able to identify

transmission of most edges▫Real

Found “internet-only” topics Looked at most highly ranked expected

transmission links, identified a real link in 90% of cases

Page 14: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Info Diffusion in Blogs•Limitations of SIR

▫No multiple postings▫No “stickiness”, which topics resonate with

whom ▫No time limiting factor in topics▫“Closed world assumption”

No outside influences after initial infection

Page 15: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Outline•Intro: Models for diffusion

▫Epidemiological: SIS/SIR/SIRS▫Threshold models

•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing

Page 16: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Cascades in Blogs•Jure Leskovec, Mary Mcglohon,

Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007)

•Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?

Page 17: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

17

Cascades in Blogs•Data:

▫Gathered from August-September 2005▫Used set of 44,362 blogs, 2.4 million posts▫245,404 blog-to-blog links

Time [1 day]

Nu

mb

er

of

post

s

Jul 4

Aug 1Sep 29

Page 18: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

18

Cascades in Blogs

Blogosphere

B1 B2

B4B3

Cascades

d

e

b c

e

a

a

b c

de

“Star” “Chain”

•What is the timing of links?•What are cascade sizes?•What are cascade shapes?

Page 19: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

19

Cascades in Blogs

• What is the timing of links?• Does popularity decay at a constant rate?

• With an exponential (“half life”)?

Linear-linear scale Log-linear scale Log-log scale

Page 20: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

20

Cascades in Blogs•Observation: The probability that a post written at time tp acquires a link at time tp + Δ is:

p(tp+Δ) ∝ Δ-1.5

log(days after post)

log

( #

in

-lin

ks)

slope=-1.5

(Linear-linear scale)

Page 21: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

21

Cascades in Blogs

• How are cascade sizes distributed?

• Geometric distribution?

Linear-linear scale Log-linear scale Log-log scale

d

e

b c

e

a

Page 22: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

22

Cascades in Blogs•Q: What size distribution do cascades follow? Are large cascades frequent?

•Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution:

p(n) ∝ n-2

log(Cascade size) (# of nodes)

log

(Cou

nt)

slope=-2d

e

b c

e

a

Page 23: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

23

Cascades in Blogs

• How are cascade shapes distributed?

• More stars? More chains?

d

e

b c

e

a

Page 24: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

log(Size) of chain (# nodes)

log

(Cou

nt)

a=-8.5

log(Size) of star (# nodes)

log

(Cou

nt) a=-3.1

Cascades in Blogs• Q: What is the distribution of particular cascade

shapes?• Observation: Stars and chains in blog cascades

also follow a power law, with different exponents (star -3.1, chain -8.5).

24

Page 25: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

25

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

Page 26: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

26

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

Page 27: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

27

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

Page 28: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

28

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

p4,1 p2,1

Page 29: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

29

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

p4,1 p2,1

Page 30: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

30

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

p4,1 p2,1

Page 31: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

31

Cascades in Blogs• Based on SIS model in epidemiology

▫Randomly pick blog to infect, add post to cascade

▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.

B1 B2

B4B3

p1,1

p4,1 p2,1

p4,1

Page 32: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

32

Cascades in BlogsMost frequent cascades

model

data

log(Cascade size) (# nodes)

log

(Cou

nt

)

log

(Cou

nt

)

log(Star size)

log

(Cou

nt)

log(Chain size)

DataModel

Page 33: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Cascades in Blogs•Limitations of SIS

▫Closed world assumption▫Forced to set infection probability low to

avoid large epidemics– possibly limits stars.▫No time limit, possibly overestimates

chains.

Page 34: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Outline•Intro: Models for diffusion

▫Epidemiological: SIS/SIR/SIRS▫Threshold models

•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing

Page 35: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•David Liben-Nowell, Jon Kleinberg.

Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12.(March 2008), pp. 4633-4638.

•Goal: How can we trace the path of a meme, and explain these paths?

Page 36: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•Data: NPR chain letter records.

▫People directed to sign and send back to admin

▫Had several copies of lists, overlaps▫Reconstructed the trees using edit distance

Page 37: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•A reconstruction:

Page 38: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•The tree:

Page 39: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•How to model?

▫These trees have much longer paths▫2 considerations

Spatial distance (geographic) Timing

Page 40: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•Model: based on a delay distribution•Nodes reply-to-all, so latecomers just

append.

Page 41: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Chain Letter Cascades•Validation: Simulated on a real social

network (Livejournal), produced similar trees.

•Limitations:▫The chain letter mechanism is somewhat

nontraditional diffusion▫Closed-world assumption is perhaps OK

Page 42: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Outline•Intro: Models for diffusion

▫Epidemiological: SIS/SIR/SIRS▫Threshold models

•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-Based Marketing

Page 43: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Shawndra Hill, Foster Provost, Chris

Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275.

•Question: Is there statistical evidence that network linkage directly affects product adoption?

Page 44: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Data: Direct-mail marketing campaign for

adopting a new communications service.▫21 target segments, millions of customers▫Divided based on:

Loyalty Previous adoptions Predictive scores based on other

demographics Different marketing campaigns (postcards,

calls)

Page 45: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing

Page 46: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Hypothesis: A customer who has had

direct communication with a subscriber is more likely to adopt.▫Data: (incomplete) network information

ID of users, Timestamp, Duration•To test, added a “NN” (network neighbor)

flag to features if a customer had communicated with a subscriber. (0.3% overall)

Page 47: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Created baseline statistical model based

on node attributes.▫“Loyalty”- how consumer used services in

past▫Geographic - city, state, etc.▫Demographic- census-type data, credit

score

•Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.

Page 48: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Log-odds ratio for NN variable

Page 49: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Take rates Lift ratios

Page 50: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Added a “segment 22” consisting of only

NN, but made up of less promising customers.

Page 51: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•What about causality? What if the

adoption is due to homophily?

•To address this, sample from non-NN to make a similar data set to the NN group.

•Performed logistic regression, showed that network impact is highest for the least loyal group.

Page 52: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Lift curve for NN

Page 53: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•What about other network features?

▫Degree, transactions, connectedness, etc.•Added network features to existing

regression model, tested lift.

Page 54: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Network-Based Marketing•Lift in sales for both models

Page 55: Information Diffusion Mary McGlohon CMU 10-802 3/23/10

Conclusion•Several ways of approaching the study of

diffusion•No model is perfect. Considerations:

▫Closed world assumption vs. external effects

▫Homophily and node attributes▫Network structure

•Network information is valuable, but (usually) does not account for everything.