Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008

Social Tag PredictionSocial Tag Prediction

Paul Heymann, Daniel Ramage, and Hector Garcia-Paul Heymann, Daniel Ramage, and Hector Garcia-MolinaMolina

Stanford UniversityStanford University

SIGIRSIGIR 2008 2008

OutlineOutline

IntroductionIntroduction

PreliminariesPreliminaries

DatasetDataset

Tag Prediction Using Page Tag Prediction Using Page InformationInformation

Tag Prediction Using TagsTag Prediction Using Tags

ConclusionsConclusions

OutlineOutline



DatasetDataset





Social tag allows users to contribute Social tag allows users to contribute metadata to metadata to large and dynamiclarge and dynamic corporacorpora

Social tag prediction problemSocial tag prediction problem– Given a set of Given a set of objects objects and a set ofand a set of tagstags, can we , can we

predict whether a given tag could/should be applied predict whether a given tag could/should be applied to a particular object?to a particular object?

Benefits of Predicting Benefits of Predicting Social TagsSocial Tags At a At a fundamentalfundamental level, we gain insights into level, we gain insights into

the “information content” of tagsthe “information content” of tags– If tags are easy to predict from other content, they If tags are easy to predict from other content, they

add little valueadd little value

At a At a practicalpractical level, a tag predictor can level, a tag predictor can enhance a social tagging site in a variety of enhance a social tagging site in a variety of formsforms– Increase recall of single tag queries/feedsIncrease recall of single tag queries/feeds– Inter-user agreementInter-user agreement– Tag disambiguationTag disambiguation– BootstrappingBootstrapping– System suggestionSystem suggestion

OutlineOutline



DatasetDataset





A post is a set of triples (A post is a set of triples (ttii , , uujj , , ookk) indicating ) indicating a user a user uujj annotated object annotated object ookk by a set of tags by a set of tags

Imagining a tag do or do not describe an Imagining a tag do or do not describe an object, there are three 3 relations:object, there are three 3 relations:– RRpp = a set of ( = a set of (tt, , oo) pairs where each pair means ) pairs where each pair means

that tag that tag tt positivelypositively describes object describes object oo

– RRnn = a set of ( = a set of (tt, , oo) pairs where each pair means ) pairs where each pair means that tag that tag tt negativelynegatively describes object describes object oo

– RRaa = a set of ( = a set of (tt, , uu, , oo) triples where each triple ) triples where each triple means that user means that user uu annotatedannotated object object o o with tagwith tag t t

TT100100 = the 100 most frequent tags = the 100 most frequent tags

Operators and Operators and ExamplesExamples Two standard relational algebra operatorsTwo standard relational algebra operators

– σσcc selects tuples from a relation where a particular selects tuples from a relation where a particular condition condition cc holds (WHERE in SQL) holds (WHERE in SQL)

– ππpp projects a relation into a smaller number of attributes projects a relation into a smaller number of attributes (SELECT in SQL)(SELECT in SQL)

Example: for a web Example: for a web oobagelsbagels about a downtwon about a downtwon bagel shop and a web page bagel shop and a web page oopizzapizza about a pizzeria, about a pizzeria,RRpp = ( = (ttbagels bagels , , oobagelsbagels), (), (ttshop shop , , oobagelsbagels), (), (ttdowntown downtown , , oobagelsbagels), ),

((ttpizza pizza , , oopizzapizza), (), (ttpizzeria pizzeria , , oopizzapizza))

RRnn = ( = (ttpizzeria pizzeria , , oobagelsbagels), (), (ttpizza pizza , , oobagelsbagels), (), (ttbagels bagels , , oopizzapizza) …) …

ππt t (σ(σOObagelsbagels((RRpp)) = )) = tags which positively describe tags which positively describe oobagelsbagels

= (= (ttbagels bagels , , ttshop shop , , ttdowntowndowntown))

OutlineOutline



DatasetDataset


Tag PredictionTag Prediction Using TagsUsing Tags


DatasetDataset

The base: Stanford Tag Crawl DatasetThe base: Stanford Tag Crawl Dataset– Gathered from del.icio.usGathered from del.icio.us– Consist of 2,549,282 unique URLs with their postsConsist of 2,549,282 unique URLs with their posts– Anchor text and Link information for each URLAnchor text and Link information for each URL

Experimental dataset constructionExperimental dataset construction– Aiming to approximate Aiming to approximate RRpp and and RRnn

– Assume that if (Assume that if (ttii , , ookk) π) π((t t , , oo))((RRaa) then () then (ttii , , ookk) ) RRpp

The reverse is not trueThe reverse is not true

– Filter the dataset by postcount(Filter the dataset by postcount(ookk) = |π) = |πu u (σ(σOk Ok ((RRaa))|))| Assume as postcount(Assume as postcount(ookk) increases, ) increases, RRpp is approximated is approximated

by by RRaa Filtering threshold = 100Filtering threshold = 100

– 62,000 URLs in the filtered set62,000 URLs in the filtered set

Probability of Adding Probability of Adding “New” Tags“New” Tags

Figure: Average new tags (in TFigure: Average new tags (in T100100) versus number of ) versus number of postsposts

Comparison between Comparison between Popular TagsPopular Tags

Table: The top/bottom tags in TTable: The top/bottom tags in T100100 to be added after to be added after the 100th bookmark. The top 15 tags are relatively the 100th bookmark. The top 15 tags are relatively ambiguous and personal.ambiguous and personal.

OutlineOutline



DatasetDataset


Tag PredictionTag Prediction Using TagsUsing Tags


Features for SVMFeatures for SVM

Page text featuresPage text features– Bag of wordsBag of words

Anchor textAnchor text– Bag of wordsBag of words– Text within 15 words of inlinks to the URLText within 15 words of inlinks to the URL– Use only URLs with at least 100 inlinks as examplesUse only URLs with at least 100 inlinks as examples

Surrounding hostsSurrounding hosts– Hosts/domains of backlinksHosts/domains of backlinks– Hosts/domains of the URLHosts/domains of the URL– Hosts/domains of forward linksHosts/domains of forward links

For each feature type, the top 1000 features For each feature type, the top 1000 features selected by mutual information are usedselected by mutual information are used

Experiment SetupExperiment Setup

Binary tag classification by SVM for TBinary tag classification by SVM for T100100 – SVMlight and SVMperf with a linear kernelSVMlight and SVMperf with a linear kernel

Data splitsData splits– Full/Full: 11/16 positive/negative examples for training Full/Full: 11/16 positive/negative examples for training

and the rest for testingand the rest for testing Evaluated by precision-recall BEP (PRBEP) instead of Evaluated by precision-recall BEP (PRBEP) instead of

accuracyaccuracy

– 200/200: randomly select 200 positive/negative 200/200: randomly select 200 positive/negative examples for training and the same for testingexamples for training and the same for testing

Evaluated by accuracyEvaluated by accuracy Provided as an imperfect indication of how predictable a Provided as an imperfect indication of how predictable a

tag is due to its “information content” rather than the tag is due to its “information content” rather than the distribution of examples in the systemdistribution of examples in the system

Order of PredictabilityOrder of Predictability

Predictability = PRBEP (Full/Full) + Prec@10% Predictability = PRBEP (Full/Full) + Prec@10% (Full/Full) + Accuracy (200/200)(Full/Full) + Accuracy (200/200)

Figure: Tags in TFigure: Tags in T100100 in increasing order of predictability in increasing order of predictability from left to right.from left to right.

DiscussionsDiscussions

What precision can we get at the PRBEP?What precision can we get at the PRBEP?– 60% for page text, 58% for anchor text, and 51% 60% for page text, 58% for anchor text, and 51%

for surrounding hostsfor surrounding hosts– Much better than chance given a majority of tags in Much better than chance given a majority of tags in

TT100100 occur on less than 15% of documents occur on less than 15% of documents

What precision can we get with low recall? What precision can we get with low recall? – 90% for all features and 92.5% for page text in 90% for all features and 92.5% for page text in

Prec@10% (Full/Full)Prec@10% (Full/Full)

Which page information is best for predicting Which page information is best for predicting tags?tags?– Page text > anchor text > surrounding hostsPage text > anchor text > surrounding hosts

What makes a tag What makes a tag predictable? (1/2)predictable? (1/2) Entropy measure: Entropy measure:

What makes a tag What makes a tag predictable? (2/2)predictable? (2/2)

Figure: Tag popularity positively correlated to PRBEP Figure: Tag popularity positively correlated to PRBEP in the Full/Full splitin the Full/Full split

OutlineOutline



DatasetDataset




Tag Prediction Using Tag Prediction Using TagsTags Between about 30 and 50 percent of URLs Between about 30 and 50 percent of URLs

posted to del.icio.us have only 1 or 2 posted to del.icio.us have only 1 or 2 bookmarksbookmarks– Recall for single tag queries will be lowRecall for single tag queries will be low

The question: The question: given a small number of tags, how much given a small number of tags, how much can we expand this set of tags in a high precision manner?can we expand this set of tags in a high precision manner?– Similar to Similar to market-basketmarket-basket data mining data mining

A large set of items and a large set of baskets each of A large set of items and a large set of baskets each of which contains a small set of itemswhich contains a small set of items

The goal is to find correlations between sets of itemsThe goal is to find correlations between sets of items

– The baskets are URLs and the items are tagsThe baskets are URLs and the items are tags

Association RulesAssociation Rules

SuportSuport: the number of baskets containing : the number of baskets containing both both XX and and YY

ConfidenceConfidence: P(: P(Y Y ||X X ) (How likely is ) (How likely is YY given given X X ?)?)

InterestInterest: P(: P(Y Y ||X X )) －－ P(P(Y Y ) (How much more ) (How much more common is common is X X &&YY than expected by chance?) than expected by chance?)

Found Association Found Association RulesRules Observed relations: type-of, various forms, translations …Observed relations: type-of, various forms, translations …

etcetc

Found Association Found Association RulesRules Random sampling of the top 8000 rules of length 3 or Random sampling of the top 8000 rules of length 3 or

lessless

Simulation of Tag Simulation of Tag ExpansionExpansion About 50,000 URLs for training and 10,000 URLs for About 50,000 URLs for training and 10,000 URLs for

testingtesting

OutlineOutline



DatasetDataset





Our tag prediction results suggest three insights:Our tag prediction results suggest three insights:– Many tags on the web Many tags on the web do notdo not contribute substantial contribute substantial

additional information beyond page text, anchor text, additional information beyond page text, anchor text, and surrounding hosts.and surrounding hosts.

– The predictability of a tag is negatively correlated with The predictability of a tag is negatively correlated with its entropy, when our classifiers are given balanced its entropy, when our classifiers are given balanced training data. When considering tags in their natural training data. When considering tags in their natural distributions, data sparsity issues tend to dominate.distributions, data sparsity issues tend to dominate.

– Association rules can increase recall on single tag Association rules can increase recall on single tag queries. We found association rules linking queries. We found association rules linking languageslanguages, , super/subconceptssuper/subconcepts, and other relationships., and other relationships.

Documents

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008