Upload
damiano-spina-valenti
View
229
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Slides for the paper presentation at CLEF 2011. Amsterdam, The Netherlands
Citation preview
Filter keywords and majority class strategies for company name
disambiguation on Twitter
Damiano Spina, Enrique Amigó and Julio Gonzalo
{damiano,enrique,julio}@lsi.uned.es
UNED NLP & IR Group
CLEF 2011 Conference September 19-22, Amsterdam
Goal
• Two signals coming from intuition:
– Filter keywords
– Majority Class
• Do they help characterizing and solving the problem?
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
Tweets for query «jaguar»
• related tweets=0 • unrelated tweets=10 • Related ratio = 0
Tweets for query «orange»
• related tweets=5 • unrelated tweets=5 • Related ratio = 0.5
Tweets for query «apple»
Fingerprint representation
Fingerprint representation
Fingerprint representation
Fingerprint representation
WePS-3 Task 2 Systems
WePS-3 Task 2 Systems
Filter keywords
Tweets for query «apple»
Tweets for query «apple»
• positive keyword: store • 4 tweets annotated as
«related»
• positive keyword: store • 4 tweets annotated as
«related» • negative keyword: eating
• 2 tweets annotated as «unrelated»
Tweets for query «apple»
• positive keyword: store • 4 tweets annotated as
«related» • negative keyword: eating
• 2 tweets annotated as «unrelated»
• Accuracy= 1.0 • Recall=60%
Tweets for query «apple»
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
Company name Positive Keywords Negative Keywords
amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta
fox money, weather, leader, denouncing, viewers
megan, matthew, lazy, valley, michael
ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Oracle keywords (perfects on Twitter)
Company name Positive Keywords Negative Keywords
amazon electronics, books, apparel, computers, buy
river, rainforest, deforestation, bolivian, brazilian
fox tv, broadcast, shows, episodes, fringe, bones
animal, terrier, hunting, volkswagen, racing
ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric
tom, harrison, henry, glenn, gucci
Manual keywords (perfects for a Web user)
Upper bound of Filter Keywords
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
Upper bound of Filter Keywords
Manual keywords
– ≈10 per company
– 14.61 % recall (vs. 39.97% 10 oracle keyword)
– 0.86 accuracy
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
Upper bound of Filter Keywords
Manual keywords
– ≈10 per company
– 14.61 % recall (vs. 39.97% 10 oracle keyword)
– 0.86 accuracy
5 oracle keywords ≈ 30% recall
20 oracle keywords ≈ 50% recall
Oracle keywords
Twitter ≠ Web
Majority Class
• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
Tweets for query «jaguar»
• Accuracy= 0.80 • Recall=100%
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
winner-takes-all
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
• Optimal decision
– 0.80 accuracy
winner-takes-all
Upper bound of Majority Class
• For each test case /company name
– all unrelated or all related
• Optimal decision
– 0.80 accuracy • ≈ best manual system
(0.83)
• > best automatic system (0.75)
winner-takes-all
Filter keywords + majority class upperbound
Tweets
Filter keywords (oracle or manual)
Majority Class?
(1) winner-takes-all
Tweets
Filter keywords (oracle or manual)
Majority Class
(2) winner-takes-remainder
Tweets
Majority Class
Filter keywords (oracle or manual)
(3) bootstrapping
Tweets
Machine learning
training
Filter keywords (oracle or manual)
(3) bootstrapping
Tweets
Machine learning
training
Filter keywords (oracle or manual)
application
Filter keywords + majority class
Filter keywords + majority class
≈ ‘all related’ baseline
Filter keywords + majority class baseline
Filter keywords + majority class baseline
Keyword Classification
Terms Filter keywords (automatic)
• Automatic Discovery of Filter Keywords:
Filter keywords + majority class baseline
Keyword Classification
Terms Filter keywords (automatic)
– 13 Term features:
• 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features
– 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc) • Hybrid (Neural net + heuristic’s features)
• Automatic Discovery of Filter Keywords:
Automatic Tweets Classification
0,83 0,75 0,73
0,63 0,56
0,48
accu
racy
WePS-3 systems (automatic)
Filter keywords + Majority Class baseline
WePS-3 systems (manual)
Conclusions
• Fingerprint representation
– Behaviour of binary classification systems on skewed datasets
– Baselines independent of corpus
Conclusions
• Fingerprint representation
– Behaviour of binary classification systems on skewed datasets
– Baselines independent of corpus
• Twitter ≠ Web
– Oracle keywords ≠ Manual keywords
Conclusions
• Fingerprint representation – Behaviour of binary classification systems on skewed
datasets
– Baselines independent of corpus
• Twitter ≠ Web – Oracle keywords ≠ Manual keywords
• Filter keywords & majority class strategies – Useful signals to help solving the problem
– Both signals alone already give competitive performance
Filter keywords and majority class strategies for company name
disambiguation on Twitter
CLEF 2011 Conference September 19-22, Amsterdam
Damiano Spina, Enrique Amigó and Julio Gonzalo
{damiano,enrique,julio}@lsi.uned.es
UNED NLP & IR Group