31
Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft *work completed at Microsoft 1

Community Structure and Information Flow in Usenet

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Community Structure and Information Flow in Usenet

Community Structure and Information Flow in Usenet: Improving Analysis with a Thread Ownership Model

Mary McGlohon, Carnegie Mellon* Matthew Hurst, Microsoft

*work completed at Microsoft

1

Page 2: Community Structure and Information Flow in Usenet

Motivation • Comparing communities of

online social networks may lend insight into how groups form and thrive

• We would also like to understand how information diffuses between groups

2

Collaborations at Santa Fe Institute (Girvan & Newman)

Page 3: Community Structure and Information Flow in Usenet

Why Usenet? • We delve into these questions by analyzing data

from Usenet •  Public • Can be analyzed over a long time period • Has pre-defined, hierarchical community

structure •  Two main goals: ▫ Compare different group activity (size, reciprocity) ▫ Observe diffusion between groups

3

Page 4: Community Structure and Information Flow in Usenet

Data

•  Posts from 200 politically-oriented newsgroups (bulletin boards) ▫  “polit” in name

•  January 2004-June 2008 •  Several countries, state/

provinces, and topics. •  19.6 million unique

articles, 6.2 million cross-posted

4

Replies Parent

Page 5: Community Structure and Information Flow in Usenet

Cross-posting

• A large percentage of articles are cross-posted to multiple groups.

•  Somebody reading one group may “reply-to-all”, such that all groups see it.

5

Major issue: many are cross-posted to multiple groups. Where is conversation truly occurring?

{alt.politics, us.politics}

{alt.politics, us.politics, pa.politics}

{alt.politics, us.politics, pa.politics}

{alt.politics, us.politics}

Page 6: Community Structure and Information Flow in Usenet

Outline • Motivation • Data description •  Structural Analysis ▫ Size ▫ Reciprocity ▫ Similarity

• Ownership model ▫ Effects of Cross-posting ▫ Information Flow based on Ownership ▫ Similarity

6

Page 7: Community Structure and Information Flow in Usenet

Structural Analysis • We hope to compare the structure of

communities by answering the following questions:

• How do edges form? • How does the reciprocity of groups compare? • How can we measure similarity?

7

Page 8: Community Structure and Information Flow in Usenet

Sizes of groups

• How do edges form?

•  To answer, we make a network of authors for each group

•  If a1 has replied to a2 at any point, there is an edge from a1 to a2

8

Page 9: Community Structure and Information Flow in Usenet

Sizes of groups

•  Power law-like relationship between number of authors and number of edges.

•  Similar to densification law [Leskovec+05], only with individual networks instead of snapshots of a network over time.

9 log(nodes)

log(edges) t=2004

t=2008

log(Number of authors)

log(

Nu

mbe

r of

ed

ges)

alt.politics

tw.bbs.politics

Page 10: Community Structure and Information Flow in Usenet

Reciprocity •  Which groups have highest reciprocity? •  Reciprocity: percentage of reply-edges that are

mutual •  Top 10 were European newsgroups (up to 0.58): ▫ hun.politika ▫  relcom.politics ▫ hsv.politics ▫  italia.modena.politica ▫  se.politik ▫  it.discussioni.leggende.metropolitane ▫ ukr.politics ▫  yu.forum.politika ▫ ni.politics ▫  swnet.politik

•  Lowest reciprocity occurred in tw.bbs.* (<0.1)

10

Page 11: Community Structure and Information Flow in Usenet

Similarity • How can we measure similarity between groups? • Use Jaccard coefficient for cross-posts:

# Shared articles (cross-posts) between 2 groups Total number of articles in groups

• Can do the same with shared authors • Highest similarity ~0.54 (bc.politics and

ont.politics)

11

Page 12: Community Structure and Information Flow in Usenet

Similarity

12

• Each group is a node • Edge drawn if similarity > 0.1 (thick edge >0.2) •  Form clusters: parties, US regional, countries,

alt.politics subgroups

Page 13: Community Structure and Information Flow in Usenet

Parties/topics

13

Page 14: Community Structure and Information Flow in Usenet

US States

14

Page 15: Community Structure and Information Flow in Usenet

English-speaking countries

15

Page 16: Community Structure and Information Flow in Usenet

alt.politics.*

16

Page 17: Community Structure and Information Flow in Usenet

Outline • Motivation • Data Description •  Structural Analysis ▫ Size ▫ Reciprocity ▫ Similarity

• Ownership model ▫ Information Flow based on Ownership ▫ Similarity

17

Page 18: Community Structure and Information Flow in Usenet

Problem: Excessive cross-posting • We just saw that there is significant overlap

between groups in terms of articles • However, cross-posting occurs often between

unrelated groups (“edges below threshold”) • We would like to find out in which group the

activity is truly occurring

•  How can we trace this?

18

Page 19: Community Structure and Information Flow in Usenet

Solution: Thread Ownership • Answer: Assign “ownership” based on the

authors of the posts •  First, assign authors to groups based on

devotion ▫ Devotion(a,g): what percentage of an author a’s

posts are exclusively posted to a given group g •  For each post, normalize devotion among groups

where the post occurs. ▫ Group with highest devotion score for the author

has more “ownership” of a post

19

Page 20: Community Structure and Information Flow in Usenet

Example: Thread Ownership •  Suppose in the data authors have the following

numbers of non-cross-posts in each group:

•  Then, they form a thread:

20

{alt.politics, us.politics}

{alt.politics, us.politics} {alt.politics, us.politics, pa.politics}

alt.politics us.politics pa.politics

Author 1 6 4 0

Author 2 0 1 3

Author 3 0 1 2

{0.6, 0.4}

{0, 1} {0, 0.25, 0.75}

Page 21: Community Structure and Information Flow in Usenet

Real thread

•  Initially cross-posted to several groups (including talk.politics.misc), 38 groups in total

• Ownership concentrated in seattle.politics and or.politics

•  Subject: “Kiss the National Parks Good-Bye”

21

Page 22: Community Structure and Information Flow in Usenet

Applications of thread ownership • Ownership model aids in analyzing threads ▫ Influence between groups: How are threads

discovered and posted to new groups? ▫ Similarity of groups: How can ownership help us

more precisely state when two groups are similar?

22

Page 23: Community Structure and Information Flow in Usenet

Information flow between groups •  How are threads discovered and posted

to new groups? •  Idea: Extend ownership to influence

•  How often does an author in group 1 respond to a post they found in group 2? ▫ Author finds parent post pp by browsing group gp ▫ Author writes child post pc to group gc ▫ Then, we say gp influences gc

Influence(gp, gc) = Devotion(a, gp) * Devotion(a, gc)

•  This helps pinpoint when an author decides to cross-post late in the thread

23

{alt.politics, us.politics}

{alt.politics, us.politics, pa.politics}

Page 24: Community Structure and Information Flow in Usenet

Example: Ownership-based influence

• Author 2 sees parent post • Replies, adding pa.politics. •  Since Author 2 is not devoted to

alt.politics, he was most likely influenced by us.politics

•  Influence(us.politics,pa.politics) = 1 * 0.75 = 0.75

24

{alt.politics, us.politics}

{alt.politics, us.politics, pa.politics}

alt.politics us.politics pa.politics

Author 1 6 4 0

Author 2 0 1 3

Page 25: Community Structure and Information Flow in Usenet

Who influences whom? •  Information often diffuses from major to minor

groups

25

Page 26: Community Structure and Information Flow in Usenet

Ownership-based Similarity •  Q: How can ownership help us more precisely state

when two groups are similar? •  A: Use “shared ownership” instead of shared posts

Western states Eastern states 26

Page 27: Community Structure and Information Flow in Usenet

Applications and future work •  Potential Applications ▫ Link prediction ▫ Information retrieval and relevance ▫ Ownership for email lists

•  Future Work ▫ Using comparative measures to predict whether

group will continue

27

Page 28: Community Structure and Information Flow in Usenet

Related work: Discussion Groups •  Backstrom, L.; Kumar, R.; Marlow, C.; Novak, J.; and

Tomkins, A. 2008. Preferential behavior in online groups. WSDM ’08

•  Gomez, V.; Kaltenbrunner, A.; and Lopez, V. 2008. Statistical analysis of the social network and discussion threads in slashdot. WWW ’08

•  Mishne, G., and Glance, N. 2006. Leave a reply: An analysis of weblog comments. WWE ’06

•  Turner, T. C.; Smith, M. A.; Fisher, D.; and Welser, H. T. 2005. Picturing usenet: Mapping computer-mediated collective action. Journal of Computer-Mediated Communication 10(4).

•  Viegas, F. B., and Smith, M. 2004. Newsgroup crowds and authorlines: visualizing the activity of individuals in conversational cyberspaces. HICSS 2004

28

Page 29: Community Structure and Information Flow in Usenet

Related work: Information Diffusion

• Kossinets, G.; Kleinberg, J.; and Watts, D. 2008. The structure of information pathways in a social communication network. KDD’08

•  Leskovec, J.; Kleinberg, J.; and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. KDD ’05

• Nowell, D. L., and Kleinberg, J. 2008. Tracing the flow of information on a global scale using Internet chain-letter data. PNAS 105(12):4633–4638.

29

Page 30: Community Structure and Information Flow in Usenet

Conclusions • Case study of nearly 200 newsgroups, including

19 million unique posts • Demonstrated “densification” law as applies to

different groups • Compared groups in terms of reciprocity and

shared posts/authors •  Proposed thread ownership model to cut down

on “noise” from cross-posts • Applied ownership to diffusion between groups,

group similarity

30

Page 31: Community Structure and Information Flow in Usenet

Contact info • Mary McGlohon • www.cs.cmu.edu/~mmcgloho

• Matthew Hurst •  datamining.typepad.com

•  Special thanks to Christos Faloutsos, Michael Gamon, Kathy Gill, Christian Konig, Alexei Maykov, Purna Sarkar, Hassan Sayyadi, Marc Smith

31