Upload
barry-ball
View
214
Download
0
Embed Size (px)
Citation preview
Social Tag PredictionSocial Tag Prediction
Paul Heymann, Daniel Ramage, and Hector Garcia-Paul Heymann, Daniel Ramage, and Hector Garcia-MolinaMolina
Stanford UniversityStanford University
SIGIRSIGIR 2008 2008
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag Prediction Using TagsTag Prediction Using Tags
ConclusionsConclusions
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag Prediction Using TagsTag Prediction Using Tags
ConclusionsConclusions
IntroductionIntroduction
Social tag allows users to contribute Social tag allows users to contribute metadata to metadata to large and dynamiclarge and dynamic corporacorpora
Social tag prediction problemSocial tag prediction problem– Given a set of Given a set of objects objects and a set ofand a set of tagstags, can we , can we
predict whether a given tag could/should be applied predict whether a given tag could/should be applied to a particular object?to a particular object?
Benefits of Predicting Benefits of Predicting Social TagsSocial Tags At a At a fundamentalfundamental level, we gain insights into level, we gain insights into
the “information content” of tagsthe “information content” of tags– If tags are easy to predict from other content, they If tags are easy to predict from other content, they
add little valueadd little value
At a At a practicalpractical level, a tag predictor can level, a tag predictor can enhance a social tagging site in a variety of enhance a social tagging site in a variety of formsforms– Increase recall of single tag queries/feedsIncrease recall of single tag queries/feeds– Inter-user agreementInter-user agreement– Tag disambiguationTag disambiguation– BootstrappingBootstrapping– System suggestionSystem suggestion
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag Prediction Using TagsTag Prediction Using Tags
ConclusionsConclusions
PreliminariesPreliminaries
A post is a set of triples (A post is a set of triples (ttii , , uujj , , ookk) indicating ) indicating a user a user uujj annotated object annotated object ookk by a set of tags by a set of tags
Imagining a tag do or do not describe an Imagining a tag do or do not describe an object, there are three 3 relations:object, there are three 3 relations:– RRpp = a set of ( = a set of (tt, , oo) pairs where each pair means ) pairs where each pair means
that tag that tag tt positivelypositively describes object describes object oo
– RRnn = a set of ( = a set of (tt, , oo) pairs where each pair means ) pairs where each pair means that tag that tag tt negativelynegatively describes object describes object oo
– RRaa = a set of ( = a set of (tt, , uu, , oo) triples where each triple ) triples where each triple means that user means that user uu annotatedannotated object object o o with tagwith tag t t
TT100100 = the 100 most frequent tags = the 100 most frequent tags
Operators and Operators and ExamplesExamples Two standard relational algebra operatorsTwo standard relational algebra operators
– σσcc selects tuples from a relation where a particular selects tuples from a relation where a particular condition condition cc holds (WHERE in SQL) holds (WHERE in SQL)
– ππpp projects a relation into a smaller number of attributes projects a relation into a smaller number of attributes (SELECT in SQL)(SELECT in SQL)
Example: for a web Example: for a web oobagelsbagels about a downtwon about a downtwon bagel shop and a web page bagel shop and a web page oopizzapizza about a pizzeria, about a pizzeria,RRpp = ( = (ttbagels bagels , , oobagelsbagels), (), (ttshop shop , , oobagelsbagels), (), (ttdowntown downtown , , oobagelsbagels), ),
((ttpizza pizza , , oopizzapizza), (), (ttpizzeria pizzeria , , oopizzapizza))
RRnn = ( = (ttpizzeria pizzeria , , oobagelsbagels), (), (ttpizza pizza , , oobagelsbagels), (), (ttbagels bagels , , oopizzapizza) …) …
ππt t (σ(σOObagelsbagels((RRpp)) = )) = tags which positively describe tags which positively describe oobagelsbagels
= (= (ttbagels bagels , , ttshop shop , , ttdowntowndowntown))
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag PredictionTag Prediction Using TagsUsing Tags
ConclusionsConclusions
DatasetDataset
The base: Stanford Tag Crawl DatasetThe base: Stanford Tag Crawl Dataset– Gathered from del.icio.usGathered from del.icio.us– Consist of 2,549,282 unique URLs with their postsConsist of 2,549,282 unique URLs with their posts– Anchor text and Link information for each URLAnchor text and Link information for each URL
Experimental dataset constructionExperimental dataset construction– Aiming to approximate Aiming to approximate RRpp and and RRnn
– Assume that if (Assume that if (ttii , , ookk) π) π((t t , , oo))((RRaa) then () then (ttii , , ookk) ) RRpp
The reverse is not trueThe reverse is not true
– Filter the dataset by postcount(Filter the dataset by postcount(ookk) = |π) = |πu u (σ(σOk Ok ((RRaa))|))| Assume as postcount(Assume as postcount(ookk) increases, ) increases, RRpp is approximated is approximated
by by RRaa Filtering threshold = 100Filtering threshold = 100
– 62,000 URLs in the filtered set62,000 URLs in the filtered set
Probability of Adding Probability of Adding “New” Tags“New” Tags
Figure: Average new tags (in TFigure: Average new tags (in T100100) versus number of ) versus number of postsposts
Comparison between Comparison between Popular TagsPopular Tags
Table: The top/bottom tags in TTable: The top/bottom tags in T100100 to be added after to be added after the 100th bookmark. The top 15 tags are relatively the 100th bookmark. The top 15 tags are relatively ambiguous and personal.ambiguous and personal.
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag PredictionTag Prediction Using TagsUsing Tags
ConclusionsConclusions
Features for SVMFeatures for SVM
Page text featuresPage text features– Bag of wordsBag of words
Anchor textAnchor text– Bag of wordsBag of words– Text within 15 words of inlinks to the URLText within 15 words of inlinks to the URL– Use only URLs with at least 100 inlinks as examplesUse only URLs with at least 100 inlinks as examples
Surrounding hostsSurrounding hosts– Hosts/domains of backlinksHosts/domains of backlinks– Hosts/domains of the URLHosts/domains of the URL– Hosts/domains of forward linksHosts/domains of forward links
For each feature type, the top 1000 features For each feature type, the top 1000 features selected by mutual information are usedselected by mutual information are used
Experiment SetupExperiment Setup
Binary tag classification by SVM for TBinary tag classification by SVM for T100100 – SVMlight and SVMperf with a linear kernelSVMlight and SVMperf with a linear kernel
Data splitsData splits– Full/Full: 11/16 positive/negative examples for training Full/Full: 11/16 positive/negative examples for training
and the rest for testingand the rest for testing Evaluated by precision-recall BEP (PRBEP) instead of Evaluated by precision-recall BEP (PRBEP) instead of
accuracyaccuracy
– 200/200: randomly select 200 positive/negative 200/200: randomly select 200 positive/negative examples for training and the same for testingexamples for training and the same for testing
Evaluated by accuracyEvaluated by accuracy Provided as an imperfect indication of how predictable a Provided as an imperfect indication of how predictable a
tag is due to its “information content” rather than the tag is due to its “information content” rather than the distribution of examples in the systemdistribution of examples in the system
Order of PredictabilityOrder of Predictability
Predictability = PRBEP (Full/Full) + Prec@10% Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200)(Full/Full) + Accuracy (200/200)
Figure: Tags in TFigure: Tags in T100100 in increasing order of predictability in increasing order of predictability from left to right.from left to right.
DiscussionsDiscussions
What precision can we get at the PRBEP?What precision can we get at the PRBEP?– 60% for page text, 58% for anchor text, and 51% 60% for page text, 58% for anchor text, and 51%
for surrounding hostsfor surrounding hosts– Much better than chance given a majority of tags in Much better than chance given a majority of tags in
TT100100 occur on less than 15% of documents occur on less than 15% of documents
What precision can we get with low recall? What precision can we get with low recall? – 90% for all features and 92.5% for page text in 90% for all features and 92.5% for page text in
Prec@10% (Full/Full)Prec@10% (Full/Full)
Which page information is best for predicting Which page information is best for predicting tags?tags?– Page text > anchor text > surrounding hostsPage text > anchor text > surrounding hosts
What makes a tag What makes a tag predictable? (1/2)predictable? (1/2) Entropy measure: Entropy measure:
What makes a tag What makes a tag predictable? (2/2)predictable? (2/2)
Figure: Tag popularity positively correlated to PRBEP Figure: Tag popularity positively correlated to PRBEP in the Full/Full splitin the Full/Full split
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag Prediction Using TagsTag Prediction Using Tags
ConclusionsConclusions
Tag Prediction Using Tag Prediction Using TagsTags Between about 30 and 50 percent of URLs Between about 30 and 50 percent of URLs
posted to del.icio.us have only 1 or 2 posted to del.icio.us have only 1 or 2 bookmarksbookmarks– Recall for single tag queries will be lowRecall for single tag queries will be low
The question: The question: given a small number of tags, how much given a small number of tags, how much can we expand this set of tags in a high precision manner?can we expand this set of tags in a high precision manner?– Similar to Similar to market-basketmarket-basket data mining data mining
A large set of items and a large set of baskets each of A large set of items and a large set of baskets each of which contains a small set of itemswhich contains a small set of items
The goal is to find correlations between sets of itemsThe goal is to find correlations between sets of items
– The baskets are URLs and the items are tagsThe baskets are URLs and the items are tags
Association RulesAssociation Rules
SuportSuport: the number of baskets containing : the number of baskets containing both both XX and and YY
ConfidenceConfidence: P(: P(Y Y ||X X ) (How likely is ) (How likely is YY given given X X ?)?)
InterestInterest: P(: P(Y Y ||X X )) -- P(P(Y Y ) (How much more ) (How much more common is common is X X &&YY than expected by chance?) than expected by chance?)
Found Association Found Association RulesRules Observed relations: type-of, various forms, translations …Observed relations: type-of, various forms, translations …
etcetc
Found Association Found Association RulesRules Random sampling of the top 8000 rules of length 3 or Random sampling of the top 8000 rules of length 3 or
lessless
Simulation of Tag Simulation of Tag ExpansionExpansion About 50,000 URLs for training and 10,000 URLs for About 50,000 URLs for training and 10,000 URLs for
testingtesting
OutlineOutline
IntroductionIntroduction
PreliminariesPreliminaries
DatasetDataset
Tag Prediction Using Page Tag Prediction Using Page InformationInformation
Tag Prediction Using TagsTag Prediction Using Tags
ConclusionsConclusions
ConclusionsConclusions
Our tag prediction results suggest three insights:Our tag prediction results suggest three insights:– Many tags on the web Many tags on the web do notdo not contribute substantial contribute substantial
additional information beyond page text, anchor text, additional information beyond page text, anchor text, and surrounding hosts.and surrounding hosts.
– The predictability of a tag is negatively correlated with The predictability of a tag is negatively correlated with its entropy, when our classifiers are given balanced its entropy, when our classifiers are given balanced training data. When considering tags in their natural training data. When considering tags in their natural distributions, data sparsity issues tend to dominate.distributions, data sparsity issues tend to dominate.
– Association rules can increase recall on single tag Association rules can increase recall on single tag queries. We found association rules linking queries. We found association rules linking languageslanguages, , super/subconceptssuper/subconcepts, and other relationships., and other relationships.