Upload
isadora-jacobson
View
27
Download
1
Embed Size (px)
DESCRIPTION
Topical Semantics of Twitter Links. Michael J. Welch, Uri Schonfeld Yahoo! Inc. , UCLA Computer Science Dept WSDM`11. MARCH 23, 2011 In- seok An SNU Internet Database Lab. Outline. Introduction twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics - PowerPoint PPT Presentation
Citation preview
Topical Semantics of Twitter Links
MARCH 23, 2011In-seok An
SNU Internet Database Lab.
Michael J. Welch, Uri SchonfeldYahoo! Inc. , UCLA Computer Science Dept
WSDM`11
Outline Introduction
– twitter Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion
2
Introduction Twitter
– Microblogging site– 10th world wide in total traffic– 28 million unique monthly visitors– Provider of information for breaking news events
3
Introduction Simple graphical modeling for Web
– Text-based pages connected by hyperlinks ( directed edges )– Will fail to capture all that this information has to offer– Produce less than ideal results
A rich graphical model for Twitter– Multiple semantic edges
Follow, RT, Mention, List
– Not all edges are created equal
In this paper– Web graph vs. Twitter graph– Follow link vs. Retweet link
4
Introduction
Twitter Twitter
– Blogging platform Maximum of 140 characters Micro-blogging platform
– Multiple interfaces Web, SMS, mobile application, instant messaging, etc.
5
Dual role– Reader
A user may choose to follow another user’s posts– Accessible via a private stream ( timeline )– Sorted by their publication timestamp
Friends / follower
– Writer Posting messages Retweet messages Reply or Mention other twitterian
6
Introduction
Mention– User is referred to by their username prefixed with the char-
acter “@”
Retweet– A user chooses to repeat another user’s post– New style retweet
– Old style retweet
Introduction
7
List– Added in late 2009– Allows users to construct and organize a group of users re-
ferred to as a list– Help a user to focus on the posts of certain subsets of their
friends
Two broad categories– Topical lists
Centered around the discussion of common interests or subjects “politics”
– Classification lists Formed to group users who share a common trait “Celebrities”, “professional athletes”
– Lists generate meaningful manually-created categorizations of users
Introduction
8
Outline Introduction Modeling Twitter
– The Full Twitter Graph Model– Additional Twitter Information– The Simplified Twitter Graph
Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics Conclusion
9
Modeling Twitter Web graph model
– Nodes Web pages
– Edges Hyperlinks connecting them
– Enables the application of many graph analysis techniques Inlink & outlink distributions PageRank
N by N matrix M– The Web graph is commonly represented as matrix– N is the number of pages on the web–
jij c
M1
10
Modeling Twitter
The Full Twitter Graph Model The Twitter graph is inherently more complex
– At least two different types of entities ( nodes ) Users and Tweets
– At least four types of relationships ( edges ) Follows, Publish, Retweets and Mentions
Twitter Graph Edges– Follow edge
User a follows the posts of user b
– Publish edge Authorship of the post
– Retweet edge Post a is a retweet of post b
– Mention edge Post a mentions user b
11
Modeling Twitter
The Full Twitter Graph Model Matrix representation of the Twitter graph
– Identical to the Web graph
– |U| + |P| by |U| + |P| matrix |U| : the number of users |P| : the number of posts
– A non-zero value in Represents an edge between node i and node j
ijT
12
Modeling Twitter
Additional Twitter Information Time
– Twitter includes timestamp information When each post was written When accounts were created
– When a follow link was created No explicit way to determine Can be approximated with repeated crawling
– Valuable for studying factors Evolution of the graph Charting popularity over time
13
Modeling Twitter
Additional Twitter Information Hyperlinks
– Standard hyperlinks embedded in the posts
– Third node type Web page Uniquely identified by a URL
– Difficulty modeling hyperlinks in Twitter Common use of URL shortening services
– TinyURL and bit.ly Prevents making use of keywords or other interesting artifacts
the URL may contain directly Makes additional processing of the data necessary
14
Modeling Twitter
Additional Twitter Information Post Content
– Use the content of a post To extract metadata
– User name mention– Identification of retweets
– Remaining textual content of a post Determining the topics of interest to a user as well
– Difficulties Small size of the posts
– Sparsity of data– Sparsity of tokens
Frequent use of nonstandard shorthand notation
15
Modeling Twitter
The Simplified Twitter Graph Simplified Twitter Graph
– Only includes user nodes– Still capturing the most important information
From the original representation as it pertains to the users
– The user-user follow links remain As they are from the Full Twitter graph
– Add retweet edges to the simplified Twitter Graph If user a retweets user b at least one time
– There is retwet edge from user a to user b
16
Outline Introduction Modeling Twitter Analysis of The Graph
– Link Distributions– Graph Formation
Exploring Link Semantics Experiments on Link Semantics Conclusion
17
Analysis of The Graph Data specification
– Collected between October 2009 and January 2010– 1.1 million Twitter users– More than 273 million follow edges– 2.9 million retweet edges
Crawling method– Beginning with an initial seed set of the top 1000 users in
twitterholic.com– Crawling in a BFS manner– Traversing the follow links in a forward direction
18
Analysis of The Graph
Link Distributions Follow Edges
– Power-law distribution– Two abnormal spikes in Outlink distribution
20-friend– Twitter provides an initial a set of 20 “recommended” users to follow
2000-friend– The restrictions Twitter places on following more than 2000 users
19
Analysis of The Graph
Link Distributions Retweet Edges
– Retweet Inlink Power-law distribution
– Retweet Outlink Does not follow power-law distribution
– While the number of friends one has is generally power-law, the number of users one finds truly interesting does not ap-pear to scale in a similar fashion
20
Analysis of The Graph
Link Distributions Posting Frequency
– 417,613 users who publish at least one tweet– Most recent 200 posts per user– 58,000 users published only a single post during the month– A large number of users wrote more than 100 posts
21
Analysis of The Graph
Graph Formation Readers and Writers
– Three potential scenarios A user acts primarily as reader
– No or little posts A user frequently retweets posts
– Writes little to no original content A user contributes significant
new content
– User’s reading and writing behavior Each dot : unique user X-axis : # of posts published by friends Y-axis : # of posts published by user Shade : originality
– The lighter shades indicate less originality Size : PageRank of each user ( based on follow-edge )
22
Analysis of The Graph
Graph Formation General trend
– For users who post very frequently A larger fraction of their posts are actually retweets
– Many users retweeted at least one post which they did not read from one of their friends
Despite the explicit friendship links available in the site struc-ture, it is still not possible to know exactly what a user reads
– Many websites are adding modules which display Twitter results
23
Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics
– Retweet vs. Follow based Ranking– Link Virality
Experiments on Link Semantics Conclusion
24
Exploring Link Semantics Web graph
– A link from page a to page b Endorsement of the quality of page b Extent its relevance to page a
Twitter graph– Follow link
Endorsement of quality or interest The actual semantics of the link
– User a , acting as a reader, is interested in user b acting as writer
– Retweet link Endorsement of quality
– User is interested in the topic– User expects his readers to be interested in this post
Retweet edge signifies a connection from user a as a writer to user b as a writer
25
Exploring Link Semantics
Retweet vs. Follow based Ranking
PageRank based on two edges– Retweet-based
Simple power-law distribution
– Follow-based Two different segments with different power-law coefficients
26
Exploring Link Semantics
Retweet vs. Follow based Ranking
PageRank over Retweet links vs. Follow links– Follow links
Twitter recommended celebrities ( barackobama )– Rich get richer phenomenon
Top ranker has lower rank in RT-based PageRank
– Retweet links Tweetmeme
– Social bookmarking site Top ranker has lower rank in
Follow-based PageRank
27
Exploring Link Semantics
Retweet vs. Follow based Ranking
Follow-based – Public figure or celebrities
Retweet-based– News generating entities
Aplusk is the only user who appears in the top 10 for both rankings
These rank can be affected byspam or marketing techniques
– ddlovatoRT simply retweet all posts mentioning Demi Lovato
– Twitter’s research team estimates thatless than 1% of Tweets are now spam
28
Exploring Link Semantics
Link Virality Retweet Virality
–
Follow Virality–
– RoF(u) : the users who u has seen at least on post from via a retweet
– FoF(u) : the set of all users who are reachable by traversing ex-actly two directed follow edges
– Fr(u) : the set of users whom user u follows Retweet Viriality is consistently higher than Follow Virality
– Retweets demonstrate a stronger notion of importance or influ-ence to users
– Users are more likely to follow people they see retweeted than those who are merely “Friends of Friends”
29
Outline Introduction Modeling Twitter Analysis of The Graph Exploring Link Semantics Experiments on Link Semantics
– Empirical Results– Topic Sensitive PageRank
Conclusion
30
Experiments on Link Semantics Topical relevance
– Follow links quickly diffuse into a broad range of topics– Retweet links remain more concentrated on the original
topic
Data– 1.1 million users– 273 million follow edges– 2.9 million retweet edges
31
Experiments on Link Semantics
Empirical Results Empirical evaluation
– Starting from a seed set of users Members of the same topical list
– photography and design
– Generate two sets of users At least one seed member follows them At least one seed member has retweeted one of their posts
– Random sample of 25 users from each of these sets– Manually assessed them for topical relevance
Result– # of relevant users in the follow-generated samples were 4
and 5– # of relevant users in the retweet-generated samples were
19 and 20
32
Experiments on Link Semantics
Topic Sensitive PageRank PageRank
– Recursive ranking formula– Page is as important as the pages pointing to it
Topic Sensitive PageRank( TSPR )– Quantify the difference in topical relevance carried by follow
and retweet links– Biased PageRank
Generate query-specific importance scores for pages at query time
– We use topic sensitive PageRank to quantify the difference in topical relevance carried by follow and retweet link
1
[1] T.H. Haveliwala. Topic-sensitive PageRank, www 2002.
33
Experiments on Link Semantics
Topic Sensitive PageRank Experiments
– Beginning with a topical Twitter list– Compute topic sensitive PageRank for
Follow edges Retweet edges
– If the links carry the topicality well The high-ranking users are likely to be topically relevant to the
original seed topic
– Evaluate the resulting highest ranked users for relevance to the original topic with a user survey
34
Experiments on Link Semantics
Topic Sensitive PageRank Experimental Setup
– Collected 9 topical lists from listorious.com 19 ~ 437 users
– Average 155, median 49 Seed users have average 14,284 followers
– Compute personalized PageRank – Selected the 30 highest ranking non-seed users – Conduct a survey
Participants were shown a topic description and the 30 highest raned users for either a follow-based or a retweet-based PageRank
Ordered randomly Mixed with a random set of 10 of the seed users for that topic Make a binary judgment of each user’s relevance A total of 12 people participated in the survey Each list was evaluated by at least 2 people
35
Experiments on Link Semantics
Topic Sensitive PageRank Accuracy of the highly ranked users
– Precision
The average relevancy of a set of users
– Relevance
The fraction of users who were judged relevant by at least on survey taker
– the set of users from U judged relevant in evalua-tion k of a paricular list
)(URk
36
Experiments on Link Semantics
Topic Sensitive PageRank Result
– Precision can be improved by simply using retweet links in-stead of following links
Precision of top ranked user improved by over 30%
37
Experiments on Link Semantics
Topic Sensitive PageRank Cohesiveness of Seed
– To verify the seed users Include 10 randomly selected seed users for each evaluation
Result– Average Precision : 0.931
Minimum of 0.838 Maximum of 1.9
– The seed users represented their topics well– Our survey takers understood and agreed upon the topic def-
initions
38
Conclusion We have described a detailed model of Twitter as a
graph– Key statistics about the graph– Provided some initial insights as to how the graph forms
important distinctions between edge types in the graph– Follow and retweet– The varying semantics and properties of these edges will
have significant implication on graph algorithms such as PageRank
– Retweet edges preserve topical relevance Better than follow edges
39