33
Determining Factors of Information Spreading Processes Dashun Wang 2011-06-07

Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Determining Factors of Information Spreading Processes

Dashun Wang

2011-06-07!

Page 2: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

How a specific piece of information spreads?

Viral Marketing

Although message infection and propagation can bequite involved processes, population-level analyses de-scribe viral propagation as a function of the basic repro-ductive number R0, i.e., the average number of secondarycases generated by each informed individual [11,14]. IfR0 > 1, propagation reaches the tipping point where infor-mation reaches a significant fraction of the target popula-tion, but if R0 < 1, propagation dies quickly. Secondaryspreaders passed the message to !r ! 2:96 individuals onaverage. Just a fraction ! ! 0:0879 of those receiving itwere infected by the viral process and forwarded the mes-sage again. Thus, the average of secondary cases perinfected individual R0 ! !!r ’ 0:26 is below the tippingpoint. While R0 is small, the large heterogeneity in theindividual values of r and ! (see Fig. 2) led to a bigvariation in the cascade sizes found in our experiments[20]. Such heterogeneity was not enough to sustain thespreading, and all cascades stopped propagating after afinite number of recommendation steps. The campaignwas a viral success [11], nevertheless, as the number ofindividuals virally reached was 4 times that of seeds.

A striking feature of the viral cascades found was thescarcity of loops, triangles, or closed paths (Fig. 1). Emailredundancy (i.e., the fraction of emails sent to alreadyinformed individuals) was just 0.74% as cascades weremostly treelike shaped. This fact has also been found inrecommendation cascades of online retailers [21] or infor-mation cascades in the blogosphere [13]. However, socialnetworks are locally dense, and a large fraction of linksconnects members of communities or groups internally[22] anticipating a larger email redundancy. A possibleexplanation is that viral spreaders assume that the groupfrom whence a message came knows it already and avoidthat community, as suggested in Ref. [10] for chain letters.This self-avoiding feature may reduce the impact on infor-mation diffusion of the social network local structure [2] infavor of midrange to global topology properties.

In line with the studies mentioned in the introduction,our viral marketing campaigns show also high heteroge-

neity in the response time "R at the individual level:Participants forward the message after !"R ! 1:5 days onaverage, with a large standard deviation of#"R ! 5:5 days.Some participants resend the invitation email as much as"R ! 69 days after receiving it. Our data are fully consis-tent with a log-normal distribution for the distribution ofresponse times P""R#. Power-law distributions or exponen-tial distributions systematically over- and underestimate(respectively) the frequency of large response times (seeFig. 2). Moreover, response time does not show statisticalcorrelation with the number of recommendations made byparticipants (see Fig. 2). Thus, the delay "R in forwarding amessage and the number of recommendations sent r resultfrom seemingly independent decisions.At the collective level, we find an extraordinary behavior

of information diffusion: If i"t# is the average fraction (overall cascades) of informed individuals forwarding the mes-sage at time t, its dynamics follows an unexpectedly slowpace (see Fig. 3). This is in striking contrast with tradi-tional epidemic models [14] where the dynamics of i"t# inthe cascades is modeled by the growth equation

di

dt! $0i"t# ) i"t# $ i"0#e$0t; (1)

where $0 ! "R0 % 1#= !"R is the naive approximation to theMalthusian rate parameter of the population. Equation (1)is the simplest version of more complicated models such asthe Bass model of innovations diffusion [5] or thesusceptible-infected-removed (SIR) epidemic model [14]used to model information propagation in social networks[1,3]. Equation (1) is based on the assumption that infec-tion, or information diffusion, happens mostly around time"R ’ !"R and that new infections by individuals that havealready infected others are very unlikely for "R & !"R.

0,01 1!R,i

100

101

102

ri

10!3 10!210!1 100

101 102

x (days)

10!2

10!1

100

P(!

R >

x)

FIG. 2 (color online). Complementary cumulative distributionof the response time "R in the viral marketing campaigns(circles). The solid line shows fits to a log-normal distributionwith % ! 5:547 and #2 ! 4:519 (black) and to an exponentialdistribution (red) and a power-law distribution (blue) with exp-onent %1:48. Inset: Scatter plot of the number of recommenda-tions sent by each participant ri vs her response time "R;i (dots).The red solid line is a running average of r as a function of "R.

FIG. 1. Cascade with 122 nodes and 6 propagation steps foundin the experiments. It starts out of a seed in the center (black) andgrows by propagation through viral nodes (gray).

PRL 103, 038702 (2009) P HY S I CA L R EV I EW LE T T E R Sweek ending17 JULY 2009

038702-2

Chain Letters

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

José Luis Iribarren and E. Moro, Phys. Rev. Lett. 2009D. Liben-Nowell and J. Kleinberg, Proc Natl Acad Sci 2008

Page 3: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Viral Marketing

Although message infection and propagation can bequite involved processes, population-level analyses de-scribe viral propagation as a function of the basic repro-ductive number R0, i.e., the average number of secondarycases generated by each informed individual [11,14]. IfR0 > 1, propagation reaches the tipping point where infor-mation reaches a significant fraction of the target popula-tion, but if R0 < 1, propagation dies quickly. Secondaryspreaders passed the message to !r ! 2:96 individuals onaverage. Just a fraction ! ! 0:0879 of those receiving itwere infected by the viral process and forwarded the mes-sage again. Thus, the average of secondary cases perinfected individual R0 ! !!r ’ 0:26 is below the tippingpoint. While R0 is small, the large heterogeneity in theindividual values of r and ! (see Fig. 2) led to a bigvariation in the cascade sizes found in our experiments[20]. Such heterogeneity was not enough to sustain thespreading, and all cascades stopped propagating after afinite number of recommendation steps. The campaignwas a viral success [11], nevertheless, as the number ofindividuals virally reached was 4 times that of seeds.

A striking feature of the viral cascades found was thescarcity of loops, triangles, or closed paths (Fig. 1). Emailredundancy (i.e., the fraction of emails sent to alreadyinformed individuals) was just 0.74% as cascades weremostly treelike shaped. This fact has also been found inrecommendation cascades of online retailers [21] or infor-mation cascades in the blogosphere [13]. However, socialnetworks are locally dense, and a large fraction of linksconnects members of communities or groups internally[22] anticipating a larger email redundancy. A possibleexplanation is that viral spreaders assume that the groupfrom whence a message came knows it already and avoidthat community, as suggested in Ref. [10] for chain letters.This self-avoiding feature may reduce the impact on infor-mation diffusion of the social network local structure [2] infavor of midrange to global topology properties.

In line with the studies mentioned in the introduction,our viral marketing campaigns show also high heteroge-

neity in the response time "R at the individual level:Participants forward the message after !"R ! 1:5 days onaverage, with a large standard deviation of#"R ! 5:5 days.Some participants resend the invitation email as much as"R ! 69 days after receiving it. Our data are fully consis-tent with a log-normal distribution for the distribution ofresponse times P""R#. Power-law distributions or exponen-tial distributions systematically over- and underestimate(respectively) the frequency of large response times (seeFig. 2). Moreover, response time does not show statisticalcorrelation with the number of recommendations made byparticipants (see Fig. 2). Thus, the delay "R in forwarding amessage and the number of recommendations sent r resultfrom seemingly independent decisions.At the collective level, we find an extraordinary behavior

of information diffusion: If i"t# is the average fraction (overall cascades) of informed individuals forwarding the mes-sage at time t, its dynamics follows an unexpectedly slowpace (see Fig. 3). This is in striking contrast with tradi-tional epidemic models [14] where the dynamics of i"t# inthe cascades is modeled by the growth equation

di

dt! $0i"t# ) i"t# $ i"0#e$0t; (1)

where $0 ! "R0 % 1#= !"R is the naive approximation to theMalthusian rate parameter of the population. Equation (1)is the simplest version of more complicated models such asthe Bass model of innovations diffusion [5] or thesusceptible-infected-removed (SIR) epidemic model [14]used to model information propagation in social networks[1,3]. Equation (1) is based on the assumption that infec-tion, or information diffusion, happens mostly around time"R ’ !"R and that new infections by individuals that havealready infected others are very unlikely for "R & !"R.

0,01 1!R,i

100

101

102

ri

10!3 10!210!1 100

101 102

x (days)

10!2

10!1

100

P(!

R >

x)

FIG. 2 (color online). Complementary cumulative distributionof the response time "R in the viral marketing campaigns(circles). The solid line shows fits to a log-normal distributionwith % ! 5:547 and #2 ! 4:519 (black) and to an exponentialdistribution (red) and a power-law distribution (blue) with exp-onent %1:48. Inset: Scatter plot of the number of recommenda-tions sent by each participant ri vs her response time "R;i (dots).The red solid line is a running average of r as a function of "R.

FIG. 1. Cascade with 122 nodes and 6 propagation steps foundin the experiments. It starts out of a seed in the center (black) andgrows by propagation through viral nodes (gray).

PRL 103, 038702 (2009) P HY S I CA L R EV I EW LE T T E R Sweek ending17 JULY 2009

038702-2

Chain Letters

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

José Luis Iribarren and E. Moro, Phys. Rev. Lett. 2009D. Liben-Nowell and J. Kleinberg, Proc Natl Acad Sci 2008

What affects information spreading processes?

Page 4: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

•Promote strategies to achieve expected coverage

Information Spreading in Context

•Prediction models for information flow

•Protect digital information leakage

•Assisting users to disseminate information more efficiently

Page 5: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Data

Detailed traces of social interactions

Forwarded Emails

Spreading Processes

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

Page 6: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Data• Emails from 8000+ Employees

• 2000+ Fw threads

• Information about the individuals

e.g.: performance, dept, job role

• Content of the emails

social network

ensemble of trees

individual characteristics

what the information is about

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

Page 7: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Research Focus• To Whom one spreads the information• Waiting time

Microscopic Level

Macroscopic Level

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

• To how many people• Overall coverage

Page 8: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Microscopic Level

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

•The Underlying Social Network•Information Content and Expertise•Organizational Context•Individual Characteristics

PFw(q)/P rand(q)Probability Ratio:

P rand(q)

PFw(q) Probability of having q in forwarded emails

Same probability in normal emails

1

Page 9: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

The Underlying Social Network

!""

!"!

!"#

!"$

!"%

"&'

!

!&'

w(B!)C*

+,-./.01023

!""

!"!

!"#

!"$

!"%

"&'

!

!&'

w(A!)B*

+,-./.01023

a b

mor

e lik

ely

weak strong

w(A→ B) w(B → C)

Page 10: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

! " #!"#$%&$'()*')+,$-./0/$123"2405 !"#$%&$'()%*)''$-!"#$./0/$123"2405

!"#$%&$# '$($)*$#+,)-)%-.#

Figure 2: An illustrative example of an informationpathway.

experts to experts. The e!ciency of the spreading process isa"ected by departmental structure, but little by individualperformance. These findings can guide us to build bettersocial and collaborative environments, design applicationsassisting users to disseminate information more e!ciently,and develop strategies to protect digital information leakageand predictive tools for recommendation systems.

4.1 The Underlying Social NetworksHow information spreads may be influenced by the un-

derlying social network, and understanding the interplaybetween the social network and spreading process is veryimportant. First, it has a number of implications in varioussocial systems, such as promoting new strategies in viralmarketing by taking into account the e"ect of the networktopology. Second, it plays an important role in assessingthe choice of models, arguing whether a flu-like epidemi-ology model, which directly relies on the topology of thenetwork, is suitable for modeling the information flow, (seeSection 5).

We start by building a social network among our usersby aggregating email communications over a one year pe-riod. We add a link between two users if there has been atleast one email communication between them. The weightof the link, w(i ! j), is asymmetric, defined as the numberof emails sent from user i to j. As we are mostly inter-ested in the connectivity between individuals, we focus onthe static picture of the network rather than the dynamicsof the network evolution.

We show in Fig. 3 the probability ratio of email forward-ing activity2 as a function of the weight of the links betweeninitiators and spreaders, w(A ! B), and spreaders and re-ceivers, w(B ! C), defined in the information pathways inFig. 2. A positive slope would indicate that information ismore likely to flow through strong ties, whereas a negativeslope shows that weaker connections are more favorable forthe spreading processes. Surprisingly, we observe that theinformation is more likely to spread initially via weak tiesand then gets passed through strong connections, strong ev-idence of information routing by spreaders choosing socialneighbors of di"erent closeness.

This raises an interesting question: how well are the in-formation spreaders connected in the network? Are they arandom sample of individuals or are they a biased sampleof more central social hubs? We show in Fig. 4 the degreedistribution P (k) of nodes in the whole social network asgrey circles and the sample of spreaders as orange crosses.Interestingly, we find that spreaders show nearly the same

2the probability ratio of email forwarding as a function ofquantity q is obtained by PFw(q)/P rand(q), where PFw(q)is the probability of having q in forwarded emails, whileP rand(q) is the same probability for overall emails. A valueequals to 1 would indicate PFw(q) is about what you wouldexpect normally.

! " #! " #

! "

Figure 3: The probability ratio of email forwardingas a function of the weight of links between A and Band B and C respectively in the information path-ways. Information spreading undergoes an interest-ing re-routing process, from weak links to strongties.

100

101

102

103

10!6

10!4

10!2

100

k

P(k

)

total usersusers as B

Figure 4: The degree distribution of the whole net-work and the group of the spreaders. The spreadershave comparable connectivity to randomly sampledindividuals in the network.

distribution of connectivity as a random sample of individ-uals from the network.

4.2 Information Content and ExpertiseAn important question about information spreading is

how the process depends on the relevance of the content ofthe information to the individual’s expertise. Here, we ex-plore this issue using the available message content. As men-tioned previously, the content of each email in our datasetis represented as term frequencies. We build a vocabularyvector !vi = "s1, s2, ..., sn# for each user i by looking at thecontent of all the emails sent by i, where the length of !vi isthe total number of meaningful words that have appearedin all emails for all users, thus is the same length for everyuser. The j-th element sj is the score of the j-th word cal-culated by TF-IDF. The vector !vi will provide a measureof ranked “buzzwords” for user i, which serves as an indica-tor of the individual’s expertise, since previous studies haveshown emails are a primary form of communication withinbig corporations [36, 6]. Next, we build a vector !vl for eachemail l following the same procedure, where sj is the TF-IDF score of the j-th word in !vl. Therefore, !vl will give usa measure of the content of email l, accounting for overlycommon words and overly rare words. Then the similaritybetween the content of the information l and the individuali’s expertise is defined as the cosine similarity of the twovectors, Si,l = !vi · !vl/($!vi$ $!vl$). We show in Fig. 5, howthe probability ratio of information spreading changes infunction of Si,l for user i as (a) spreaders and (b) receivers,respectively. The probability ratio anti-correlates with Si,l,

�vi

�vl

Information Content and Expertise

Page 11: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

mor

e lik

ely

non-expert expert

0 0.2 0.4 0.6 0.8 10.5

1

1.5

2

2.5

3

3.5

Si,l for user i as receivers

Pro

ba

bili

ty

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

Si,l for user i as spreaders

Pro

ba

bili

ty

a b

! " #!"#$%&$'()*')+,$-./0/$123"2405 !"#$%&$'()%*)''$-!"#$./0/$123"2405

!"#$%&$# '$($)*$#+,)-)%-.#

Figure 2: An illustrative example of an informationpathway.

experts to experts. The e!ciency of the spreading process isa"ected by departmental structure, but little by individualperformance. These findings can guide us to build bettersocial and collaborative environments, design applicationsassisting users to disseminate information more e!ciently,and develop strategies to protect digital information leakageand predictive tools for recommendation systems.

4.1 The Underlying Social NetworksHow information spreads may be influenced by the un-

derlying social network, and understanding the interplaybetween the social network and spreading process is veryimportant. First, it has a number of implications in varioussocial systems, such as promoting new strategies in viralmarketing by taking into account the e"ect of the networktopology. Second, it plays an important role in assessingthe choice of models, arguing whether a flu-like epidemi-ology model, which directly relies on the topology of thenetwork, is suitable for modeling the information flow, (seeSection 5).

We start by building a social network among our usersby aggregating email communications over a one year pe-riod. We add a link between two users if there has been atleast one email communication between them. The weightof the link, w(i ! j), is asymmetric, defined as the numberof emails sent from user i to j. As we are mostly inter-ested in the connectivity between individuals, we focus onthe static picture of the network rather than the dynamicsof the network evolution.

We show in Fig. 3 the probability ratio of email forward-ing activity2 as a function of the weight of the links betweeninitiators and spreaders, w(A ! B), and spreaders and re-ceivers, w(B ! C), defined in the information pathways inFig. 2. A positive slope would indicate that information ismore likely to flow through strong ties, whereas a negativeslope shows that weaker connections are more favorable forthe spreading processes. Surprisingly, we observe that theinformation is more likely to spread initially via weak tiesand then gets passed through strong connections, strong ev-idence of information routing by spreaders choosing socialneighbors of di"erent closeness.

This raises an interesting question: how well are the in-formation spreaders connected in the network? Are they arandom sample of individuals or are they a biased sampleof more central social hubs? We show in Fig. 4 the degreedistribution P (k) of nodes in the whole social network asgrey circles and the sample of spreaders as orange crosses.Interestingly, we find that spreaders show nearly the same

2the probability ratio of email forwarding as a function ofquantity q is obtained by PFw(q)/P rand(q), where PFw(q)is the probability of having q in forwarded emails, whileP rand(q) is the same probability for overall emails. A valueequals to 1 would indicate PFw(q) is about what you wouldexpect normally.

! " #! " #

! "

Figure 3: The probability ratio of email forwardingas a function of the weight of links between A and Band B and C respectively in the information path-ways. Information spreading undergoes an interest-ing re-routing process, from weak links to strongties.

100

101

102

103

10!6

10!4

10!2

100

k

P(k

)

total usersusers as B

Figure 4: The degree distribution of the whole net-work and the group of the spreaders. The spreadershave comparable connectivity to randomly sampledindividuals in the network.

distribution of connectivity as a random sample of individ-uals from the network.

4.2 Information Content and ExpertiseAn important question about information spreading is

how the process depends on the relevance of the content ofthe information to the individual’s expertise. Here, we ex-plore this issue using the available message content. As men-tioned previously, the content of each email in our datasetis represented as term frequencies. We build a vocabularyvector !vi = "s1, s2, ..., sn# for each user i by looking at thecontent of all the emails sent by i, where the length of !vi isthe total number of meaningful words that have appearedin all emails for all users, thus is the same length for everyuser. The j-th element sj is the score of the j-th word cal-culated by TF-IDF. The vector !vi will provide a measureof ranked “buzzwords” for user i, which serves as an indica-tor of the individual’s expertise, since previous studies haveshown emails are a primary form of communication withinbig corporations [36, 6]. Next, we build a vector !vl for eachemail l following the same procedure, where sj is the TF-IDF score of the j-th word in !vl. Therefore, !vl will give usa measure of the content of email l, accounting for overlycommon words and overly rare words. Then the similaritybetween the content of the information l and the individuali’s expertise is defined as the cosine similarity of the twovectors, Si,l = !vi · !vl/($!vi$ $!vl$). We show in Fig. 5, howthe probability ratio of information spreading changes infunction of Si,l for user i as (a) spreaders and (b) receivers,respectively. The probability ratio anti-correlates with Si,l,

�vi

�vl

Information Content and Expertise

Page 12: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

! "# "! $#

"

$

%

&

!

'()*+,-.*/(-0123456

-

-

/74

,!/74Coordinator

Gatekeeper

Consultant

Liaison

Representative

A

B

C

A B

C

A

B

C

A

B

C

A

B

C

within

across

social bottleneck:information experiences delay in inter-dept flows

−6 −4 −2 0 2 4 60

1

2

3

4

Level differece btwn A and B

Prob

abilit

y

0 5 10 150

1

2

organizational distance btwn A and B

Prob

abilit

y

non-homophily effect

on “formal” network

faster

Organizational Context

Technology Research (DMR-0426737), and IIS-0513650 pro-grams and the Defense Threat Reduction Agency AwardHDTRA1-08-1-0027.

7. REFERENCES[1] L. Adamic and E. Adar. How to search a social

network. Social Networks, 27(3):187 – 203, 2005.[2] E. Adar and L. A. Adamic. Tracking information

epidemics in blogspace. In Web Intelligence, pages207–214, 2005.

[3] R. Albert and A.-L. Barabasi. Statistical mechanics ofcomplex networks. Reviews of modern physics,74(1):47–97, 2002.

[4] A.-L. Barabasi. The origin of bursts and heavy tails inhuman dynamics. Nature, 435(7039):207–211, 2005.

[5] A.-L. Barabasi and R. Albert. Emergence of scaling inrandom networks. Science, 286(5439):509, 1999.

[6] N. K. Baym, Y. B. Zhang, and M.-C. Lin. SocialInteractions Across Media. New Media & Society,6(3):299–318, 2004.

[7] L. Briesemeister, P. Lincoln, and P. Porras. Epidemicprofiles and defense of scale-free networks. WORM2003, Oct. 27 2003.

[8] G. Caldarelli. Scale-Free Networks. Oxford UniversityPress, 2007.

[9] A. Clauset, C. Shalizi, and M. Newman. Power-lawdistributions in empirical data. SIAM review,51(4):661–703, 2009.

[10] P. Domingos and M. Richardson. Mining the networkvalue of customers. In KDD, pages 57–66, 2001.

[11] H. Ebel, L.-I. Mielsch, and S. Bornholdt. Scale-freetopology of e-mail networks. Phys. Rev. E,66(3):035103, Sep 2002.

[12] J.-P. Eckmann, E. Moses, and D. Sergi. Entropy ofdialogues creates coherent structures in e-mail tra!c.Proceedings of the National Academy of Sciences of theUnited States of America, 101(40):14333–14337, 2004.

[13] B. Golub and M. Jackson. Using selection bias toexplain the observed structure of Internet di"usions.Proceedings of the National Academy of Sciences,107(24):10833, 2010.

[14] R. Gould and R. Fernandez. Structures of mediation:A formal approach to brokerage in transactionnetworks. Sociological Methodology, 19(1989):89–126,1989.

[15] M. S. Granovetter. The strength of weak ties. TheAmerican Journal of Sociology, 78(6):1360–1380, May,1973.

[16] D. Gruhl, R. V. Guha, D. Liben-Nowell, andA. Tomkins. Information di"usion through blogspace.In WWW, pages 491–501, 2004.

[17] T. Harris. The theory of branching processes. DoverPubns, 2002.

[18] H. W. Hethcote. The mathematics of infectiousdiseases. SIAM Review, 42:599–653, 2000.

[19] J. L. Iribarren and E. Moro. Impact of human activitypatterns on the dynamics of information di"usion.Phys. Rev. Lett., 103(3):038702, Jul 2009.

[20] T. Karagiannis and M. Vojnovic. Behavioral profilesfor advanced email features. In WWW, pages 711–720,2009.

[21] D. Kempe, J. M. Kleinberg, and E. Tardos.Maximizing the spread of influence through a socialnetwork. In KDD, pages 137–146, 2003.

[22] G. Kossinets and D. J. Watts. Empirical Analysis ofan Evolving Social Network. Science, 311(5757):88–90,2006.

[23] R. Kumar, M. Mahdian, and M. McGlohon. Dynamicsof conversations. In KDD, pages 553–562, 2010.

[24] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins.On the bursty evolution of blogspace. In WWW, pages159–178, 2005.

[25] J. Leskovec, L. A. Adamic, and B. A. Huberman. Thedynamics of viral marketing. In ACM/EC, pages228–237, 2006.

[26] J. Leskovec, M. McGlohon, C. Faloutsos, N. S. Glance,and M. Hurst. Patterns of cascading behavior in largeblog graphs. In SDM, 2007.

[27] D. Liben-Nowell and J. Kleinberg. Tracinginformation flow on a global scale using Internetchain-letter data. Proceedings of the National Academyof Sciences, 105(12):4633–4638, 2008.

[28] H. Smith, Y. Rogers, and M. Brady. Managing one’ssocial network: Does age make a di"erence. In In:Proc. Interact 2003, IOS, pages 551–558. Press, 2003.

[29] M. A. Smith, J. Ubois, and B. M. Gross. Forwardthinking. In CEAS, 2005.

[30] D. Strang and S. A. Soule. Di"usion in organizationsand social movements: From hybrid corn to poisonpills. Annual Review of Sociology, 24(1):265–290, 1998.

[31] H. Tong, B. A. Prakash, C. Tsourakakis,T. Eliassi-Rad, C. Faloutsos, and D. H. P. Chau. Onthe vulnerability of large graphs. In ICDM, 2010.

[32] T. W. Valente. Network models of the di"usion ofinnovations. Computational & MathematicalOrganization Theory, 2:163–164, 1996.

[33] A. Vazquez, J. Oliveira, Z. Dezso, K. Goh, I. Kondor,and A.-L. Barabasi. Modeling bursts and heavy tailsin human dynamics. Physical Review E, 73(3):36127,2006.

[34] Y. Wang, D. Chakrabarti, C. Wang, and C. Faloutsos.Epidemic spreading in real networks: An eigenvalueviewpoint. SRDS, 2003.

[35] H. Watson and F. Galton. On the probability of theextinction of families. Journal of the AnthropologicalInstitute of Great Britain and Ireland, pages 138–144,1875.

[36] B. Wellman. The Internet in Everyday Life (TheInformation Age). Blackwell Publishers, December2002.

[37] Z. Wen and C.-Y. Lin. On the quality of inferringinterests from social neighbors. In KDD, pages373–382, 2010.

[38] F. Wu, B. A. Huberman, L. A. Adamic, and J. R.Tyler. Information flow in social groups. Physica A:Statistical and Theoretical Physics, 337(1-2):327 – 335,2004.

[39] L. Wu, C.-Y. Lin, S. Aral, and E. Brynjolfsson. Valueof social network – a large-scale analysis on networkstructure impact to financial revenue of informationtechnology consultants. In The Winter Conference onBusiness Intelligence, 2009.

Page 13: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

faster

! " # $

%&"!'

"!!"

"!!

"!"

()*+,*-./0)&,+&1&23)*3

4)56./&76-)&89,2*3:

! " # $

%&"!'

"!!"

"!!

"!"

()*+,*-./0)&,+&1&23)*3

4)56./&76-)&89,2*3:

a b

revenue generated

A B CAug 5, 09:30:12 “data request” Aug 5, 09:53:00 “Fw: data request”

Spreader ReceiverInitiator

The information waiting time appears to be const. for both initiators and spreaders, independent of individual performance

Individual Characterisitcs

Page 14: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Macroscopic Level

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

•Structural Properties of the spreading processes•What’s the best model for the observed structures

1.To how many people one would forward the information2.What is the overall coverage

Page 15: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Galton-Watson Branching Process

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

each node randomly draw a number of children from

a given distribution

Is there a model that could fit the structures of the spreading process?

p(κ)

Page 16: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Galton-Watson Branching Process

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

each node randomly draw a number of children from

a given distribution

Is there a model that could fit the structures of the spreading process?

D. Liben-Nowell and J. Kleinberg, Proc Natl Acad Sci 2008B. Golub and M. O. Jackson, Proc Natl Acad Sci 2010

p(κ)

Page 17: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

size = 8width = 4depth = 3

Galton-Watson Branching Process

0 2 5 10 15 2010−4

10−3

10−2

10−1

100

depthP(

dept

h)

empiricalmodel

100 101 102 103 10410−6

10−4

10−2

100

size

P(si

ze)

empiricalmodel

100 101 102 10310−6

10−4

10−2

100

width

P(w

idth

)

empiricalmodel

Tree size, width, and depth

Page 18: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

0 2 5 10 15 2010−4

10−3

10−2

10−1

100

depth

P(de

pth)

empiricalmodel

100 101 102 103 10410−6

10−4

10−2

100

size

P(si

ze)

empiricalmodel

100 101 102 10310−6

10−4

10−2

100

widthP(

wid

th)

empiricalmodel

Tree size, width, and depth

Page 19: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

0 2 5 10 15 2010−4

10−3

10−2

10−1

100

depth

P(de

pth)

empiricalmodel

100 101 102 103 10410−6

10−4

10−2

100

size

P(si

ze)

empiricalmodel

100 101 102 10310−6

10−4

10−2

100

widthP(

wid

th)

empiricalmodel

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

Anomaly #1 Ultra Shallow!

Tree size, width, and depth

Page 20: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

P (κ | d)

Galton-Watson Branching Process

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

d=0

d=1

d=2

d=3

Stage Dependence

p(κ)

Page 21: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

and those runs in which the tree fails to reach this size are omitted.Omitting simulated trees that fail to grow large enough is consistentwith our goal of studying properties of information diffusionconditional on reaching a large population; in real life, mostcirculated e-mail messages never spread widely, but we are inter-ested in the structure of those that do.

Our models all incorporate the following two principles: Manyrecipients may choose not to forward the letter at all, and only a fewrecipients will choose to post the letter publicly. Thus, we introducea discard-rate parameter !, specifying the probability that a givenrecipient discards the message and takes no further action on it, anda post-rate parameter ", specifying the probability that eachrecipient publicly posts his or her copy of the letter. In keeping withfindings from earlier experiments based on e-mail forwarding (27),we set the discard-rate to the default value 0.65, although we findthat reasonable variations do not qualitatively change the findings.The post-rate is a parameter that we will more explicitly vary. Publicposting is the only means by which portions of the tree becomeobservable: When a recipient posts the letter, his or her full pathfrom the root becomes visible, and hence in general a node on thetree is observable at the end of the process if and only if one of itsdescendants posted a copy of the letter. We will be studying thestructure of the observable portions of the trees produced by ourmodels, as we do with real chain letters.

We first consider a model based on a direct application of theseprobabilistic ingredients. We choose a random root node andconstruct a tree in unit time steps. In each step, each new recipientof the letter discards it independently with probability ! andotherwise forwards it to all neighbors (posting with probability ").Any neighbor w that has not received the letter previously becomesa new recipient in the next time step; if w receives the letter frommultiple senders, it chooses one of these senders arbitrarily as itsparent in the tree. Finally, once the process terminates, we look atthe observable portion of the tree, consisting of the union of allpaths from the root to the nodes that posted their copy of the letter.

Although such a model is very natural, it produces trees thatcompare poorly to the real chain-letter data. Simulating this modelon the LJ network, the observable portion of the tree has a mediandepth 5.0, width 9,625, and single-child fraction 19.04% (averagedover 10 independent runs) with " ! 0.10, and very similar prop-erties for other small values of ". This wide divergence from the realdata cannot be remedied simply by having recipients send to asmaller set of neighbors; if each recipient who forwards does so toa random subset of 4 or 5 of his or her neighbors, then the widthremains in the thousands, the median depth remains "50, and thesingle-child fraction remains"70%. The central problem is that thisstyle of random epidemic process seems unable to produce treeswhose observable portions are very large, yet with a number ofchildren per node so highly concentrated around 1.

Fig. 3. Close-up of a portion of the tree in Fig. 2.

4636 ! www.pnas.org"cgi"doi"10.1073"pnas.0708471105 Liben-Nowell and Kleinberg

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

P (κ | d)

Galton-Watson Branching Process

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

the regimes of the Galton–Watson process seems to fit when welook at the unconditional distribution—is overcome by condition-ing our subcritical process on the rare event of growing large, asLiben-Nowell and Kleinberg also do in their model. The maindifference between their approach and ours is that we do notexplicitly model the network or detailed mechanics of the distri-bution process. We focus only on the random variable describinghow many children each node has and on the selection bias.Those two ingredients alone suffice to produce a conditionaldistribution concentrated on trees with the right shapes.

Whereas the specifics of the model and analysis that follow areparticular to the Galton–Watson process, the broader point isworth emphasizing. Large-scale network phenomena that we ob-serve may not be typical instances of the processes that generatedthem but instead exceptional realizations. Although implicationsof selection bias are well understood in some settings, they havenot traditionally played a significant role at the dataset level insocial network analysis. Our study of the chain letters providesa particularly stark example of how this perspective can, in asimple model, explain a great deal about the observations. Thispoints to the need for a richer theoretical understanding of howselection modifies the structure of important classical processes.

ModelWe begin by stating our formal model of chain letter propagationand discussing the fitting of its key parameters. Let X be aGalton–Watson random tree generated by the branching processstarting at one root where the probability of any node having kchildren is p!k" and the distribution is identical across nodes. Thisdistribution is the fundamental parameter in the model. It is asimple matter to fit it by using maximum-likelihood estimationgiven an observed tree. The key fact is that, because the numberof children is conditionally independent and identically distribu-ted across nodes (given the past), this fitting goes through evenwhen there is a size selection bias in the trees that are observed.

Let L!p; x" be the probability of observing a specific tree xunder the model—that is, the probability that X # x. For anyrooted tree x, let f !k; x" refer to the total number of nodes inx with k children. It follows directly that L!p; x" #

Qkp!k"f !k;x",

and so the log-likelihood function is

!!p; x" # f !0; x" log!1 !

!

k>0

p!k""$!

k>0

f !k; x" log p!k":

Maximizing the log-likelihood with respect to p and denoting themaximizer p, we see that if f !k; x" # 0, then p!k" # 0; otherwise,setting the derivative with respect to p!k" to be equal to 0 impliesthat, for every k,

f !k; x"f !0; x"

# p!k"p!0"

[noting that f !0; x" > 0 for any finite tree]. Therefore, for everyk " 1,

p!k" # f !k; x"#!

k

f !k; x":

In other words, the estimated probability of having k children isjust the fraction of nodes with k children in the data. It is straight-forward to verify that p!k" yields a global maximum of the like-lihood function. The same procedure can be carried out on manytrees at once.

It is worth noting that we did not explicitly model the observa-tion process here—as Liben-Nowell and Kleinberg did—in whichsome but not all nodes post the chain letter on the Internet whereit can then be found and its propagation traced back to the root. Itturns out that this aspect of the process can be omitted withoutloss of generality. Formally, our approach corresponds to defin-

ing a node as an observable node—that is, a node that forwardedthe letter and one of whose descendants posted a later version. Aslong as the number of children and the decision of whether topost are independent of each other and identically distributedacross nodes, then the random variables describing how manyobservable children each node has also satisfy the assumptionsof the simple Galton–Watson process, albeit with a differentdistribution.

We also did not explicitly include a model of the network overwhich the diffusion is happening. The reason is that the only thingthat matters for the Galton–Watson process is the number ofobservable children that each node has, and our reduced-formapproach focuses on this process rather than on the network.Although the network structure will certainly influence the off-spring randomvariable, themechanismsof that can be complicated.What kinds of network processes underlie p is an interesting ques-tion whose analysis is the focus of Liben-Nowell and Kleinberg’swork mentioned earlier. In the Discussion and SI Text, we proposeone micromodel that can generate our Galton–Watson processwith an offspring distribution matching the data.

ResultsWe applied this fitting procedure to an example from theSI Appendix of ref. 6. Specifically, we used the National PublicRadio (NPR) petition, whose observed portion had three compo-nents and used the function f for the component with 2,442observed nodes. Of course, the real NPR dissemination tree pre-sumably had only one component and the pieces in the recon-struction arose from an inability to reconstruct the tree all theway to its origin. We simply take one of the subtrees as a singleinstance of the diffusion process.

The distribution p that was estimated is reported in Table 1. Itsexpectation is 2;441#2;442 $ 0.9996. Indeed, it is immediate toverify on the basis of the formulas above that the estimated dis-tribution p will have expectation !n ! 1"#n in any tree of n nodes.The implications of this rather simple fact deserve some com-ment. It entails that maximum-likelihood estimation of the kindwe perform above, on the basis of a finite tree, will always inferthe process to be subcritical. The extent of subcriticality (the gapbetween the expected number of children and 1) will depend onthe size of the observed tree. On the other hand, a confidenceinterval for the expected number of children per node wouldinclude values exceeding unity. In any case, this feature of theestimator—always estimating !n ! 1"#n as the expected numberof children—is an artifact of using a very simple time-homoge-neous branching process. Including realistic features that limitthe spread of a chain letter once it reaches a large size would,we conjecture, lead to a different maximum-likelihood estimatorof this quantity, which could exceed unity even for finite trees.

After estimating the distribution, we simulated the branchingprocess with this distribution and analyzed only realizationswhose sizes were between those of the largest and smallestobserved components in the NPR data—between 2,442 and3,250 nodes. We generated 10,000 of these realizations. The mostrelevant histograms from the analysis are shown in Fig. 1. Thestatistics we compute for each tree are the median node depth(distance from the origin) as well as the width (maximum numberof nodes at the same depth).

Table 1. The distribution of the number of children per node,estimated from the data

k p!k"

0 0.02461 0.95252 0.02173 0.0012"4 0

10834 % www.pnas.org/cgi/doi/10.1073/pnas.1000814107 Golub and Jackson

d=0

d=1

d=2

d=3

p(κ)

Stage Dependence

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

Page 22: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

P (κ | d)

d=0

d=1

d=2

d=3

Anomaly #2 Stage Dependence

Stage Dependence

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

Page 23: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Aug 5, 09:30:12 "data request"

Aug 5, 09:53:00 "Fw: data request"

Aug 6, 14:21:53 "Fw: Fw: data request"

Stage DependenceUltra Shallow

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

0 2 5 10 15 2010−4

10−3

10−2

10−1

100

depth

P(de

pth)

empiricalmodel

Empirical Observations

Page 24: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

Modeling the spreading processes

Page 25: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

κ vs. k

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

if independent

Modeling the spreading processes

Page 26: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

κ vs. k

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

if independent

•Modeling: disease vs. information•Practice: choosing the seeds

Modeling the spreading processes

Page 27: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

100 101 10210−6

10−4

10−2

100

κ

P(κ|d)

~3.5

~2.5

d=0d=1d=2d=3

κ vs. k

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

Figure 13: Distribution of branching factors ! con-ditioned on the degree of the node k. The branchingfactors are independent on the degree connectivity.

underlying social network. As shown in Fig. 4, the degreedistribution, P (k), is also fat tailed [5, 3, 8]. Indeed, in-dividuals are connected di!erently in the network. Whilemost people have only a few connections, there are a no-table number of individuals who have many social neigh-bors. This raises an important question: to what extentdoes the information spreading process depend on the un-derlying social network? First o!, the branching factor ! foran individual in the spreading process is upper-bounded bythe total number of connections s/he has. Yet beyond that,it is important to inspect whether there is a correlation be-tween k and !. This question has a number of importantimplications. In the viral marketing case for example, wherethe underlying social network is usually not visible, the cor-relation between k and ! will tell us whether it is a goodmarketing strategy to carefully choose the seed populationsto spread an advertisement. A positive correlation suggeststhat it does matter who you choose to start the spreading,as social hubs would tend to send the information to morepeople. Yet if the correlation is not so strong, one couldargue that perhaps it is not so important how one choosesthe seed population. Another example comes from the dif-ference between the spreading of information and diseases.Indeed, diseases spread from a seed to many others throughnetworks, bearing high level similarity to the spreading ofinformation. The models of epidemics commonly rely oninfection rates, where better connected nodes infect moreneighbors, corresponding to a strong correlation between kand !. Therefore, understanding to what extent informa-tion spreading relies on the context of underlying networkwould quantitatively assess the di!erence between these twospreading processes, arguing whether the existing epidemicmodels are applicable to the spreading of information.

The correlation between k and ! can be examined by em-pirically measuring the conditional probability P (! | k). In-deed, as P (!) =

!P (! | k)P (k)dk, if ! is largely uncorre-

lated with k, P (! | k) can therefore be factored out of theintegral, giving P (!) = P (! | k), leading to a data collapsewhen plotting P (!) in di!erent curves by grouping individ-uals of similar k. We show in Fig. 13 the conditional prob-ability P (! | k) for two di!erent stages of spreading (d = 0vs. d = 1), respectively. Surprisingly, we observe very goodcollapse for di!erent k in both figures, which indicates thatthere is no direct correlation between k and !. The breadthof the dissemination of information is independent of theconnectivity k of individuals. This indicates, while to whoma user forwards the information indeed depends on the un-derlying social network (as shown in Sec. 4.1), to how manypeople (!) one would forward the information does not.

100

101

102

10!6

10!4

10!2

100

n

Pn(n

)

Figure 14: The distribution of the number of recip-ients for each email Pn(n) is fat-tailed.

100

101

102

10!6

10!4

10!2

100

!

P(!

)

d=0d=1

Figure 15: P (!) for d = 0 and d = 1, model pre-diction (solid lines) vs. experiment measure (scat-tered squares and circles). Our model well capturesthe stage dependence phenomenon of informationspreading.

The surprising independence of node properties of the in-formation spreading process leads us to question its depen-dence on the media properties of email systems. There-fore, we model the spreading processes by mimicking theway emails are sent. Indeed, an important feature of emailcommunication, distinguishing it from other forms of com-munication, like cell phones, is the ability to send a messageto multiple recipients at the same time.

Therefore, the distribution of the number of recipients forall the emails being sent should follow some non-trivial form,other than "(1) in cell phones, i.e., each phone call is made toone recipient only. Let us denote the distribution for emailssystem as Pn(n) for now, where n represents the number ofrecipients in each email. While some emails are forwarded,many more are not. The easiest way to look at email for-warding is to treat it as an independent decision makingprocess, where each recipient with probability p forwardsthe information, or probability 1!p does nothing. As emailforwarding represents a small fraction of overall email traf-fic, p should be a small number. When a recipient decidesto forward the email, s/he draws a random number fromthe distribution Pn to decide how many people to send theemails. So the distribution of branching factors should fol-low the same distribution as Pn, from which the randomnumbers were drawn, giving P (!) = Pn(!). However, thisshould only hold for the case of d > 0. Indeed, as our studyis focused on the emails that are forwarded, there shouldbe an extra term for correcting this conditional probabilitywhen d = 0. That is, the original emails with more recipi-

if independent

100 101 10210 6

10 4

10 2

100

P(|k

)

k~15k~50k~110k~200k>250

100 101 10210 6

10 4

10 2

100

P(|k

)

k~15k~50k~110k~200k>250

a bd=0 d=1

•Modeling: disease vs. information•Practice: choosing the seeds

Modeling the spreading processes

Page 28: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

#recipients p

1-p

Forward

Do nothing

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

Modeling the spreading processes

Pn

Page 29: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

#recipients

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

when d=0

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

Modeling the spreading processes

p

1-p

Forward

Do nothingPn

Page 30: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

#recipients

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

when d=0

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

measured independently from the data

Modeling the spreading processes

p

1-p

Forward

Do nothingPn

Page 31: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

100 101 10210−6

10−4

10−2

100

κ

P(κ)

d=0d=1

100

101

102

103

10!6

10!4

10!2

100

size

P(s

ize)

empiricalmodel

100

101

102

103

10!6

10!4

10!2

100

width

P(w

idth

)

empiricalmodel

0 1 2 3 4 510

!4

10!3

10!2

10!1

100

depth

P(d

epth

)

empiricalmodel

Figure 16: Size, width, and depth distributions of model prediction (triangles) with empirical observations(squares). The model matches well with observations. Note the last point in depth distribution is biased byempirical finite size e!ect, lower bounded by N!1.

ents are more likely to get forwarded, as there will be morepeople to make a decision whether or not to pass on theinformation. Following this mechanism, the distribution ofbranching factors at depth 0, P (! | d = 0), follows

P (! | d = 0) = A (1! (1! p)!)Pn(!)

= A!1! e! ln(1!p)

"Pn(!)

(1)

where A is the normalization factor, whereas P (! | d > 0)follows

P (! | d > 0) = Pn(!) (2)

In the limit of p " 0, to the leading power, the relationshipbetween the scaling exponent "0 of P (! | d = 0), and "1 ofP (! | d = 1), follows the simple relation "1 = "0 + 1 if ! #!1/ ln(1!p) $ 1/p, and "1 = "0 if ! % !1/ ln(1!p) $ 1/p.

Both parameter p and function Pn can be measured inde-pendently from our data, yielding p = 0.012 and a fat-taileddistribution Pn (Fig. 14). We can therefore simulate thedistributions of size, width, and depth using these two mea-sured parameters. The results are shown in Fig. 16, withobservations as squares and model predictions as triangles.Surprisingly, they all match the empirical observations verywell. The distribution of size and width follows a powerlaw, with an exponent of 2.63 and 2.51, very close to theempirical observations (2.67 and 2.53). Indeed, both dis-tributions pass the two-sample Kolmogorov-Smirnov tests,with p-values equal to 1. Furthermore, the observation ofstage dependence could be verified analytically by pluggingthe parameters into eqs. (1) and (2), as plotted in blue andred lines in Fig. 15, respectively. It is also very well capturedby the model.

The model we described above for email forwarding pro-cesses is purely stochastic and has two parameters, p and Pn,which are measured from our email dataset independently.Perhaps unexpectedly, such a simple model explains a greatdeal of observations. This, together with Fig. 13, indicatesthat, despite the complexity in real life, the macroscopicstructures of information spreading processes are largely in-dependent of contextual information and can be well cap-tured and explained via simple machanisms.

6. CONCLUSIONS AND FUTURE WORKApplications of social systems rely on our understanding

of information spreading patterns. In this work, by com-

bining two related but distinct large scale datasets, we ad-dress the factors that govern information spreading at bothmicroscopic and macroscopic levels. We found, microscopi-cally, whom the information flows to indeed depends on thestructure of the underlying social network, individual ex-pertise and organizational hierarchy. The performance ofindividuals has little influence on the e!ciency of spread-ing, yet departmental constraints do slow down the process.At the macroscopic level, however, although seemingly com-plex, the structural properties of spreading trees, i.e., tohow many people a user forwards the information and thetotal coverage the information reaches, can be well capturedby a simple stochastic branching model, indicating that thespreading process follows a random yet reproducible pat-tern, largely independent of context. We believe that ourfindings could guide users to build better social and collab-orative applications, design tools and strategies to spreadinformation more e!ciently, improve information security,develop predictive tools for recommendation systems, andmore.

Future directions mainly fall into two lines. The first isto develop a better prediction model for information flow.Indeed, upon understanding to whom one forwards informa-tion, when one would forward it, and to how many people,the question thereafter is can we build a better predictionmodel of the flows? The second direction is about the muta-tion of information. People sometimes add extra informationor express opinions about existing information when passingalong the originals to others. How does information mutatealong the way? How does the mutation of information a"ectthe patterns of spreading? These questions stand as miss-ing chapters in our understanding of spreading processes.Indeed, with the availability of large-scale email datasets,thorough inspection of the email message contents will re-veal the dynamics of information itself, which in turn canyield better predictive tools for information spreading.

AcknowledgementThe authors thank James P. Bagrow,Nick Blumm, Helena Buhr, U Kang, Yu-Ru Lin, YelenaMejova, Jiang Yang, and anonymous reviewers for manyuseful discussions and insightful suggestions. This work wassupported by the Network Science Collaborative Technol-ogy Alliance sponsored by the U.S. Army Research Labo-ratory under Agreement Number W911NF-09-2-0053; theJames S. McDonnell Foundation 21st Century Initiative inStudying Complex Systems; the US O!ce of Naval ResearchAward (N000141010968); the NSF within the Information

a simple stochastic model captures a great deals of empirical observations

Modeling the spreading processes

Stage Dependence

Ultra Shallow

Page 32: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Conclusion

• At the macroscopic level, the structures of spreading processes are largely independent of context.

• At the microscopic level, information spreading is indeed highly dependent on social context as well as individualsʼ behavioral profiles.

Page 33: Determining Factors of Information Spreading ProcessesR in the viral marketing campaigns (circles). The solid line shows fits to a log-normal distribution with %^ ¼ 5:547 and #^

Acknowledgement

Ching-Yung Lin

Zhen WenHanghang Tong

Chaoming SongAlbert-László Barabási !

Information Spreading in Context. In Proceedings of the 20th international conference on World Wide Web (WWW '11)