Upload
lindsay-thornton
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Information DiffusionMary McGlohonCMU 10-8023/23/10
Outline•Intro: Models for diffusion
▫Epidemiological: SIS/SIR/SIRS▫Threshold models
•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing
Epidemiological: SIS•Susceptible, Infected, Susceptible
▫Infected for tI timesteps▫While infected, transmits with probability b▫After tI steps, returns to susceptible
Epidemiological: SIR•Susceptible, Infected, Removed
▫Infected for tI timesteps▫While infected, transmits with probability b▫After tI steps, goes to removed/recovered
Epidemiological: SIRS•Susceptible, Infected, Removed,
Susceptible▫Combination of SIS+SIR▫After tI steps, goes to removed/recovered
▫After tR steps, returns to susceptible
Epidemiological: Networks•Historically, SIS/SIR assumed a person
could infect anybody else, full clique. There is an epidemic threshold in SIS.
•For random power-law networks, threshold=0 [Pastor-Satorras+Vespignani]▫(But not for PL networks with high
clustering coefficients [Egu´ıluz and Klemm])
Threshold Models•Each node in network has weighted
threshold•If adopted neighbors reaches threshold,
the node adopts.
Outline•Intro: Models for diffusion
▫Epidemiological: SIS/SIR/SIRS▫Threshold models
•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing
Info Diffusion in Blogs•D. Gruhl, R. Guha, Liben D. Nowell, A.
Tomkins. Information Diffusion Through Blogspace. In WWW '04 (2004).
•Goal: How do topics trend in blogs, and how can we model diffusion of topics?
Info Diffusion in Blogs•Data: Crawled 11K blogs, 400K posts.•Found 34o topics:
▫apple arianna ashcroft astronaut blair boykin bustamante chibi china davis diana farfarello guantanamo harvard kazaa longhorn schwarzenegger udell siegfried wildfires zidane gizmodo microsoft saddam
Info Diffusion in Blogs•Topics =
Chatter + Spikes▫Chatter:
Alzheimer▫Spike: Chibi▫Spiky Chatter:
Microsoft
Info Diffusion in Blogs•Modeled as SIR
▫Some set of authors is infected to write about a topic
▫Then propagate, as others write new posts on that topic
▫Measure the topic over time and other properties
•Fit using EM▫Compute probability of propagation along
each edge
Info Diffusion in Blogs•Validation:
▫Synthetic Used modified Erdos-Renyi graph, created
propagation Found that EM was able to identify
transmission of most edges▫Real
Found “internet-only” topics Looked at most highly ranked expected
transmission links, identified a real link in 90% of cases
Info Diffusion in Blogs•Limitations of SIR
▫No multiple postings▫No “stickiness”, which topics resonate with
whom ▫No time limiting factor in topics▫“Closed world assumption”
No outside influences after initial infection
Outline•Intro: Models for diffusion
▫Epidemiological: SIS/SIR/SIRS▫Threshold models
•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing
Cascades in Blogs•Jure Leskovec, Mary Mcglohon,
Christos Faloutsos, Natalie Glance, Matthew Hurst. Cascading Behavior in Large Blog Graphs: Patterns and a Model. In Society of Applied and Industrial Mathematics: Data Mining (SDM07) (2007)
•Goal: What do cascades (conversation trees) in blogs look like, and how can we model them?
17
Cascades in Blogs•Data:
▫Gathered from August-September 2005▫Used set of 44,362 blogs, 2.4 million posts▫245,404 blog-to-blog links
Time [1 day]
Nu
mb
er
of
post
s
Jul 4
Aug 1Sep 29
18
Cascades in Blogs
Blogosphere
B1 B2
B4B3
Cascades
d
e
b c
e
a
a
b c
de
“Star” “Chain”
•What is the timing of links?•What are cascade sizes?•What are cascade shapes?
19
Cascades in Blogs
• What is the timing of links?• Does popularity decay at a constant rate?
• With an exponential (“half life”)?
Linear-linear scale Log-linear scale Log-log scale
20
Cascades in Blogs•Observation: The probability that a post written at time tp acquires a link at time tp + Δ is:
p(tp+Δ) ∝ Δ-1.5
log(days after post)
log
( #
in
-lin
ks)
slope=-1.5
(Linear-linear scale)
21
Cascades in Blogs
• How are cascade sizes distributed?
• Geometric distribution?
Linear-linear scale Log-linear scale Log-log scale
d
e
b c
e
a
22
Cascades in Blogs•Q: What size distribution do cascades follow? Are large cascades frequent?
•Observation: The probability of observing a cascade of n blog posts follows a Zipf distribution:
p(n) ∝ n-2
log(Cascade size) (# of nodes)
log
(Cou
nt)
slope=-2d
e
b c
e
a
23
Cascades in Blogs
• How are cascade shapes distributed?
• More stars? More chains?
d
e
b c
e
a
log(Size) of chain (# nodes)
log
(Cou
nt)
a=-8.5
log(Size) of star (# nodes)
log
(Cou
nt) a=-3.1
Cascades in Blogs• Q: What is the distribution of particular cascade
shapes?• Observation: Stars and chains in blog cascades
also follow a power law, with different exponents (star -3.1, chain -8.5).
24
25
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
26
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
27
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
28
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
p4,1 p2,1
29
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
p4,1 p2,1
30
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
p4,1 p2,1
31
Cascades in Blogs• Based on SIS model in epidemiology
▫Randomly pick blog to infect, add post to cascade
▫Infect each in-linked neighbor with probability ▫Add infected neighbors’ posts to cascade.▫Set old infected node to uninfected.
B1 B2
B4B3
p1,1
p4,1 p2,1
p4,1
32
Cascades in BlogsMost frequent cascades
model
data
log(Cascade size) (# nodes)
log
(Cou
nt
)
log
(Cou
nt
)
log(Star size)
log
(Cou
nt)
log(Chain size)
DataModel
Cascades in Blogs•Limitations of SIS
▫Closed world assumption▫Forced to set infection probability low to
avoid large epidemics– possibly limits stars.▫No time limit, possibly overestimates
chains.
Outline•Intro: Models for diffusion
▫Epidemiological: SIS/SIR/SIRS▫Threshold models
•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-based Marketing
Chain Letter Cascades•David Liben-Nowell, Jon Kleinberg.
Tracing the Flow of Information on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Sciences, Vol. 105, No. 12.(March 2008), pp. 4633-4638.
•Goal: How can we trace the path of a meme, and explain these paths?
Chain Letter Cascades•Data: NPR chain letter records.
▫People directed to sign and send back to admin
▫Had several copies of lists, overlaps▫Reconstructed the trees using edit distance
Chain Letter Cascades•A reconstruction:
Chain Letter Cascades•The tree:
Chain Letter Cascades•How to model?
▫These trees have much longer paths▫2 considerations
Spatial distance (geographic) Timing
Chain Letter Cascades•Model: based on a delay distribution•Nodes reply-to-all, so latecomers just
append.
Chain Letter Cascades•Validation: Simulated on a real social
network (Livejournal), produced similar trees.
•Limitations:▫The chain letter mechanism is somewhat
nontraditional diffusion▫Closed-world assumption is perhaps OK
Outline•Intro: Models for diffusion
▫Epidemiological: SIS/SIR/SIRS▫Threshold models
•Case studies▫SIR: Info diffusion in blogs▫SIS: Cascades in blogs▫Timing: Cascades in chain letters▫A closer look: Network-Based Marketing
Network-Based Marketing•Shawndra Hill, Foster Provost, Chris
Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, Vol. 22, No. 2. (2006), pp. 256-275.
•Question: Is there statistical evidence that network linkage directly affects product adoption?
Network-Based Marketing•Data: Direct-mail marketing campaign for
adopting a new communications service.▫21 target segments, millions of customers▫Divided based on:
Loyalty Previous adoptions Predictive scores based on other
demographics Different marketing campaigns (postcards,
calls)
Network-Based Marketing
Network-Based Marketing•Hypothesis: A customer who has had
direct communication with a subscriber is more likely to adopt.▫Data: (incomplete) network information
ID of users, Timestamp, Duration•To test, added a “NN” (network neighbor)
flag to features if a customer had communicated with a subscriber. (0.3% overall)
Network-Based Marketing•Created baseline statistical model based
on node attributes.▫“Loyalty”- how consumer used services in
past▫Geographic - city, state, etc.▫Demographic- census-type data, credit
score
•Added a variable for NN, performed logistic regression on each segment, with response variable being “take rate”.
Network-Based Marketing•Log-odds ratio for NN variable
Network-Based Marketing•Take rates Lift ratios
Network-Based Marketing•Added a “segment 22” consisting of only
NN, but made up of less promising customers.
Network-Based Marketing•What about causality? What if the
adoption is due to homophily?
•To address this, sample from non-NN to make a similar data set to the NN group.
•Performed logistic regression, showed that network impact is highest for the least loyal group.
Network-Based Marketing•Lift curve for NN
Network-Based Marketing•What about other network features?
▫Degree, transactions, connectedness, etc.•Added network features to existing
regression model, tested lift.
Network-Based Marketing•Lift in sales for both models
Conclusion•Several ways of approaching the study of
diffusion•No model is perfect. Considerations:
▫Closed world assumption vs. external effects
▫Homophily and node attributes▫Network structure
•Network information is valuable, but (usually) does not account for everything.