39
Topical Semantics of Twitter Links MARCH 23, 2011 In-seok An SNU Internet Database Lab. Michael J. Welch, Uri Schonfeld Yahoo! Inc. , UCLA Computer Science Dept WSDM`11

Topical Semantics of Twitter Links

Embed Size (px)

DESCRIPTION

Topical Semantics of Twitter Links. Michael J. Welch, Uri Schonfeld Yahoo! Inc. , UCLA Computer Science Dept WSDM`11. MARCH 23, 2011 In- seok An SNU Internet Database Lab. Outline. Introduction twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics - PowerPoint PPT Presentation

Citation preview

Page 1: Topical Semantics of Twitter Links

Topical Semantics of Twitter Links

MARCH 23, 2011In-seok An

SNU Internet Database Lab.

Michael J. Welch, Uri SchonfeldYahoo! Inc. , UCLA Computer Science Dept

WSDM`11

Page 2: Topical Semantics of Twitter Links

Outline Introduction

– twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion

2

Page 3: Topical Semantics of Twitter Links

Introduction Twitter

– Microblogging site– 10th world wide in total traffic– 28 million unique monthly visitors– Provider of information for breaking news events

3

Page 4: Topical Semantics of Twitter Links

Introduction Simple graphical modeling for Web

– Text-based pages connected by hyperlinks ( directed edges )– Will fail to capture all that this information has to offer– Produce less than ideal results

A rich graphical model for Twitter– Multiple semantic edges

Follow, RT, Mention, List

– Not all edges are created equal

In this paper– Web graph vs. Twitter graph– Follow link vs. Retweet link

4

Page 5: Topical Semantics of Twitter Links

Introduction

Twitter Twitter

– Blogging platform Maximum of 140 characters Micro-blogging platform

– Multiple interfaces Web, SMS, mobile application, instant messaging, etc.

5

Page 6: Topical Semantics of Twitter Links

Dual role– Reader

A user may choose to follow another user’s posts– Accessible via a private stream ( timeline )– Sorted by their publication timestamp

Friends / follower

– Writer Posting messages Retweet messages Reply or Mention other twitterian

6

Introduction

Twitter

Page 7: Topical Semantics of Twitter Links

Mention– User is referred to by their username prefixed with the char-

acter “@”

Retweet– A user chooses to repeat another user’s post– New style retweet

– Old style retweet

Introduction

Twitter

7

Page 8: Topical Semantics of Twitter Links

List– Added in late 2009– Allows users to construct and organize a group of users re-

ferred to as a list– Help a user to focus on the posts of certain subsets of their

friends

Two broad categories– Topical lists

Centered around the discussion of common interests or subjects “politics”

– Classification lists Formed to group users who share a common trait “Celebrities”, “professional athletes”

– Lists generate meaningful manually-created categorizations of users

Introduction

Twitter

8

Page 9: Topical Semantics of Twitter Links

Outline Introduction Modeling Twitter

– The Full Twitter Graph Model– Additional Twitter Information– The Simplified Twitter Graph

Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion

9

Page 10: Topical Semantics of Twitter Links

Modeling Twitter Web graph model

– Nodes Web pages

– Edges Hyperlinks connecting them

– Enables the application of many graph analysis techniques Inlink & outlink distributions PageRank

N by N matrix M– The Web graph is commonly represented as matrix– N is the number of pages on the web–

jij c

M1

10

Page 11: Topical Semantics of Twitter Links

Modeling Twitter

The Full Twitter Graph Model The Twitter graph is inherently more complex

– At least two different types of entities ( nodes ) Users and Tweets

– At least four types of relationships ( edges ) Follows, Publish, Retweets and Mentions

Twitter Graph Edges– Follow edge

User a follows the posts of user b

– Publish edge Authorship of the post

– Retweet edge Post a is a retweet of post b

– Mention edge Post a mentions user b

11

Page 12: Topical Semantics of Twitter Links

Modeling Twitter

The Full Twitter Graph Model Matrix representation of the Twitter graph

– Identical to the Web graph

– |U| + |P| by |U| + |P| matrix |U| : the number of users |P| : the number of posts

– A non-zero value in Represents an edge between node i and node j

ijT

12

Page 13: Topical Semantics of Twitter Links

Modeling Twitter

Additional Twitter Information Time

– Twitter includes timestamp information When each post was written When accounts were created

– When a follow link was created No explicit way to determine Can be approximated with repeated crawling

– Valuable for studying factors Evolution of the graph Charting popularity over time

13

Page 14: Topical Semantics of Twitter Links

Modeling Twitter

Additional Twitter Information Hyperlinks

– Standard hyperlinks embedded in the posts

– Third node type Web page Uniquely identified by a URL

– Difficulty modeling hyperlinks in Twitter Common use of URL shortening services

– TinyURL and bit.ly Prevents making use of keywords or other interesting artifacts

the URL may contain directly Makes additional processing of the data necessary

14

Page 15: Topical Semantics of Twitter Links

Modeling Twitter

Additional Twitter Information Post Content

– Use the content of a post To extract metadata

– User name mention– Identification of retweets

– Remaining textual content of a post Determining the topics of interest to a user as well

– Difficulties Small size of the posts

– Sparsity of data– Sparsity of tokens

Frequent use of nonstandard shorthand notation

15

Page 16: Topical Semantics of Twitter Links

Modeling Twitter

The Simplified Twitter Graph Simplified Twitter Graph

– Only includes user nodes– Still capturing the most important information

From the original representation as it pertains to the users

– The user-user follow links remain As they are from the Full Twitter graph

– Add retweet edges to the simplified Twitter Graph If user a retweets user b at least one time

– There is retwet edge from user a to user b

16

Page 17: Topical Semantics of Twitter Links

Outline Introduction Modeling Twitter Analysis of The Graph

– Link Distributions– Graph Formation

Exploring Link Semantics Experiments on Link Semantics Conclusion

17

Page 18: Topical Semantics of Twitter Links

Analysis of The Graph Data specification

– Collected between October 2009 and January 2010– 1.1 million Twitter users– More than 273 million follow edges– 2.9 million retweet edges

Crawling method– Beginning with an initial seed set of the top 1000 users in

twitterholic.com– Crawling in a BFS manner– Traversing the follow links in a forward direction

18

Page 19: Topical Semantics of Twitter Links

Analysis of The Graph

Link Distributions Follow Edges

– Power-law distribution– Two abnormal spikes in Outlink distribution

20-friend– Twitter provides an initial a set of 20 “recommended” users to follow

2000-friend– The restrictions Twitter places on following more than 2000 users

19

Page 20: Topical Semantics of Twitter Links

Analysis of The Graph

Link Distributions Retweet Edges

– Retweet Inlink Power-law distribution

– Retweet Outlink Does not follow power-law distribution

– While the number of friends one has is generally power-law, the number of users one finds truly interesting does not ap-pear to scale in a similar fashion

20

Page 21: Topical Semantics of Twitter Links

Analysis of The Graph

Link Distributions Posting Frequency

– 417,613 users who publish at least one tweet– Most recent 200 posts per user– 58,000 users published only a single post during the month– A large number of users wrote more than 100 posts

21

Page 22: Topical Semantics of Twitter Links

Analysis of The Graph

Graph Formation Readers and Writers

– Three potential scenarios A user acts primarily as reader

– No or little posts A user frequently retweets posts

– Writes little to no original content A user contributes significant

new content

– User’s reading and writing behavior Each dot : unique user X-axis : # of posts published by friends Y-axis : # of posts published by user Shade : originality

– The lighter shades indicate less originality Size : PageRank of each user ( based on follow-edge )

22

Page 23: Topical Semantics of Twitter Links

Analysis of The Graph

Graph Formation General trend

– For users who post very frequently A larger fraction of their posts are actually retweets

– Many users retweeted at least one post which they did not read from one of their friends

Despite the explicit friendship links available in the site struc-ture, it is still not possible to know exactly what a user reads

– Many websites are adding modules which display Twitter results

23

Page 24: Topical Semantics of Twitter Links

Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics

– Retweet vs. Follow based Ranking– Link Virality

Experiments on Link Semantics Conclusion

24

Page 25: Topical Semantics of Twitter Links

Exploring Link Semantics Web graph

– A link from page a to page b Endorsement of the quality of page b Extent its relevance to page a

Twitter graph– Follow link

Endorsement of quality or interest The actual semantics of the link

– User a , acting as a reader, is interested in user b acting as writer

– Retweet link Endorsement of quality

– User is interested in the topic– User expects his readers to be interested in this post

Retweet edge signifies a connection from user a as a writer to user b as a writer

25

Page 26: Topical Semantics of Twitter Links

Exploring Link Semantics

Retweet vs. Follow based Ranking

PageRank based on two edges– Retweet-based

Simple power-law distribution

– Follow-based Two different segments with different power-law coefficients

26

Page 27: Topical Semantics of Twitter Links

Exploring Link Semantics

Retweet vs. Follow based Ranking

PageRank over Retweet links vs. Follow links– Follow links

Twitter recommended celebrities ( barackobama )– Rich get richer phenomenon

Top ranker has lower rank in RT-based PageRank

– Retweet links Tweetmeme

– Social bookmarking site Top ranker has lower rank in

Follow-based PageRank

27

Page 28: Topical Semantics of Twitter Links

Exploring Link Semantics

Retweet vs. Follow based Ranking

Follow-based – Public figure or celebrities

Retweet-based– News generating entities

Aplusk is the only user who appears in the top 10 for both rankings

These rank can be affected byspam or marketing techniques

– ddlovatoRT simply retweet all posts mentioning Demi Lovato

– Twitter’s research team estimates thatless than 1% of Tweets are now spam

28

Page 29: Topical Semantics of Twitter Links

Exploring Link Semantics

Link Virality Retweet Virality

Follow Virality–

– RoF(u) : the users who u has seen at least on post from via a retweet

– FoF(u) : the set of all users who are reachable by traversing ex-actly two directed follow edges

– Fr(u) : the set of users whom user u follows Retweet Viriality is consistently higher than Follow Virality

– Retweets demonstrate a stronger notion of importance or influ-ence to users

– Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends”

29

Page 30: Topical Semantics of Twitter Links

Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics

– Empirical Results– Topic Sensitive PageRank

Conclusion

30

Page 31: Topical Semantics of Twitter Links

Experiments on Link Semantics Topical relevance

– Follow links quickly diffuse into a broad range of topics– Retweet links remain more concentrated on the original

topic

Data– 1.1 million users– 273 million follow edges– 2.9 million retweet edges

31

Page 32: Topical Semantics of Twitter Links

Experiments on Link Semantics

Empirical Results Empirical evaluation

– Starting from a seed set of users Members of the same topical list

– photography and design

– Generate two sets of users At least one seed member follows them At least one seed member has retweeted one of their posts

– Random sample of 25 users from each of these sets– Manually assessed them for topical relevance

Result– # of relevant users in the follow-generated samples were 4

and 5– # of relevant users in the retweet-generated samples were

19 and 20

32

Page 33: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank PageRank

– Recursive ranking formula– Page is as important as the pages pointing to it

Topic Sensitive PageRank( TSPR )– Quantify the difference in topical relevance carried by follow

and retweet links– Biased PageRank

Generate query-specific importance scores for pages at query time

– We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link

1

[1] T.H. Haveliwala. Topic-sensitive PageRank, www 2002.

33

Page 34: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank Experiments

– Beginning with a topical Twitter list– Compute topic sensitive PageRank for

Follow edges Retweet edges

– If the links carry the topicality well The high-ranking users are likely to be topically relevant to the

original seed topic

– Evaluate the resulting highest ranked users for relevance to the original topic with a user survey

34

Page 35: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank Experimental Setup

– Collected 9 topical lists from listorious.com 19 ~ 437 users

– Average 155, median 49 Seed users have average 14,284 followers

– Compute personalized PageRank – Selected the 30 highest ranking non-seed users – Conduct a survey

Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank

Ordered randomly Mixed with a random set of 10 of the seed users for that topic Make a binary judgment of each user’s relevance A total of 12 people participated in the survey Each list was evaluated by at least 2 people

35

Page 36: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank Accuracy of the highly ranked users

– Precision

The average relevancy of a set of users

– Relevance

The fraction of users who were judged relevant by at least on survey taker

– the set of users from U judged relevant in evalua-tion k of a paricular list

)(URk

36

Page 37: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank Result

– Precision can be improved by simply using retweet links in-stead of following links

Precision of top ranked user improved by over 30%

37

Page 38: Topical Semantics of Twitter Links

Experiments on Link Semantics

Topic Sensitive PageRank Cohesiveness of Seed

– To verify the seed users Include 10 randomly selected seed users for each evaluation

Result– Average Precision : 0.931

Minimum of 0.838 Maximum of 1.9

– The seed users represented their topics well– Our survey takers understood and agreed upon the topic def-

initions

38

Page 39: Topical Semantics of Twitter Links

Conclusion We have described a detailed model of Twitter as a

graph– Key statistics about the graph– Provided some initial insights as to how the graph forms

important distinctions between edge types in the graph– Follow and retweet– The varying semantics and properties of these edges will

have significant implication on graph algorithms such as PageRank

– Retweet edges preserve topical relevance Better than follow edges

39