49
Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam

Filter keywords and majority class strategies for company name disambiguation on Twitter

Embed Size (px)

DESCRIPTION

Slides for the paper presentation at CLEF 2011. Amsterdam, The Netherlands

Citation preview

Page 1: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords and majority class strategies for company name

disambiguation on Twitter

Damiano Spina, Enrique Amigó and Julio Gonzalo

{damiano,enrique,julio}@lsi.uned.es

UNED NLP & IR Group

CLEF 2011 Conference September 19-22, Amsterdam

Page 2: Filter keywords and majority class strategies for company name disambiguation on Twitter
Page 3: Filter keywords and majority class strategies for company name disambiguation on Twitter
Page 4: Filter keywords and majority class strategies for company name disambiguation on Twitter
Page 5: Filter keywords and majority class strategies for company name disambiguation on Twitter

Goal

• Two signals coming from intuition:

– Filter keywords

– Majority Class

• Do they help characterizing and solving the problem?

Page 6: Filter keywords and majority class strategies for company name disambiguation on Twitter

WePS-3 Online Reputation Management Task

Page 7: Filter keywords and majority class strategies for company name disambiguation on Twitter

WePS-3 Online Reputation Management Task

Page 8: Filter keywords and majority class strategies for company name disambiguation on Twitter

WePS-3 Online Reputation Management Task

Page 9: Filter keywords and majority class strategies for company name disambiguation on Twitter

• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8

Tweets for query «jaguar»

Page 10: Filter keywords and majority class strategies for company name disambiguation on Twitter

• related tweets=0 • unrelated tweets=10 • Related ratio = 0

Tweets for query «orange»

Page 11: Filter keywords and majority class strategies for company name disambiguation on Twitter

• related tweets=5 • unrelated tweets=5 • Related ratio = 0.5

Tweets for query «apple»

Page 12: Filter keywords and majority class strategies for company name disambiguation on Twitter

Fingerprint representation

Page 13: Filter keywords and majority class strategies for company name disambiguation on Twitter

Fingerprint representation

Page 14: Filter keywords and majority class strategies for company name disambiguation on Twitter

Fingerprint representation

Page 15: Filter keywords and majority class strategies for company name disambiguation on Twitter

Fingerprint representation

Page 16: Filter keywords and majority class strategies for company name disambiguation on Twitter

WePS-3 Task 2 Systems

Page 17: Filter keywords and majority class strategies for company name disambiguation on Twitter

WePS-3 Task 2 Systems

Page 18: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords

Page 19: Filter keywords and majority class strategies for company name disambiguation on Twitter

Tweets for query «apple»

Page 20: Filter keywords and majority class strategies for company name disambiguation on Twitter

Tweets for query «apple»

• positive keyword: store • 4 tweets annotated as

«related»

Page 21: Filter keywords and majority class strategies for company name disambiguation on Twitter

• positive keyword: store • 4 tweets annotated as

«related» • negative keyword: eating

• 2 tweets annotated as «unrelated»

Tweets for query «apple»

Page 22: Filter keywords and majority class strategies for company name disambiguation on Twitter

• positive keyword: store • 4 tweets annotated as

«related» • negative keyword: eating

• 2 tweets annotated as «unrelated»

• Accuracy= 1.0 • Recall=60%

Tweets for query «apple»

Page 23: Filter keywords and majority class strategies for company name disambiguation on Twitter

Company name Positive Keywords Negative Keywords

amazon electronics, books, apparel, computers, buy

river, rainforest, deforestation, bolivian, brazilian

fox tv, broadcast, shows, episodes, fringe, bones

animal, terrier, hunting, volkswagen, racing

ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric

tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

Page 24: Filter keywords and majority class strategies for company name disambiguation on Twitter

Company name Positive Keywords Negative Keywords

amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta

fox money, weather, leader, denouncing, viewers

megan, matthew, lazy, valley, michael

ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords

amazon electronics, books, apparel, computers, buy

river, rainforest, deforestation, bolivian, brazilian

fox tv, broadcast, shows, episodes, fringe, bones

animal, terrier, hunting, volkswagen, racing

ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric

tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

Page 25: Filter keywords and majority class strategies for company name disambiguation on Twitter

Company name Positive Keywords Negative Keywords

amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta

fox money, weather, leader, denouncing, viewers

megan, matthew, lazy, valley, michael

ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords

amazon electronics, books, apparel, computers, buy

river, rainforest, deforestation, bolivian, brazilian

fox tv, broadcast, shows, episodes, fringe, bones

animal, terrier, hunting, volkswagen, racing

ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric

tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

Page 26: Filter keywords and majority class strategies for company name disambiguation on Twitter

Company name Positive Keywords Negative Keywords

amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta

fox money, weather, leader, denouncing, viewers

megan, matthew, lazy, valley, michael

ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

Oracle keywords (perfects on Twitter)

Company name Positive Keywords Negative Keywords

amazon electronics, books, apparel, computers, buy

river, rainforest, deforestation, bolivian, brazilian

fox tv, broadcast, shows, episodes, fringe, bones

animal, terrier, hunting, volkswagen, racing

ford motor, cars, hybrids, crossovers, mondeo, focus, fiesta, prices, dealer, electric

tom, harrison, henry, glenn, gucci

Manual keywords (perfects for a Web user)

Page 27: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Filter Keywords

5 oracle keywords ≈ 30% recall

20 oracle keywords ≈ 50% recall

Oracle keywords

Page 28: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Filter Keywords

Manual keywords

– ≈10 per company

– 14.61 % recall (vs. 39.97% 10 oracle keyword)

– 0.86 accuracy

5 oracle keywords ≈ 30% recall

20 oracle keywords ≈ 50% recall

Oracle keywords

Page 29: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Filter Keywords

Manual keywords

– ≈10 per company

– 14.61 % recall (vs. 39.97% 10 oracle keyword)

– 0.86 accuracy

5 oracle keywords ≈ 30% recall

20 oracle keywords ≈ 50% recall

Oracle keywords

Twitter ≠ Web

Page 30: Filter keywords and majority class strategies for company name disambiguation on Twitter

Majority Class

Page 31: Filter keywords and majority class strategies for company name disambiguation on Twitter

• related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8

Tweets for query «jaguar»

• Accuracy= 0.80 • Recall=100%

Page 32: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Majority Class

• For each test case /company name

– all unrelated or all related

winner-takes-all

Page 33: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Majority Class

• For each test case /company name

– all unrelated or all related

• Optimal decision

– 0.80 accuracy

winner-takes-all

Page 34: Filter keywords and majority class strategies for company name disambiguation on Twitter

Upper bound of Majority Class

• For each test case /company name

– all unrelated or all related

• Optimal decision

– 0.80 accuracy • ≈ best manual system

(0.83)

• > best automatic system (0.75)

winner-takes-all

Page 35: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class upperbound

Tweets

Filter keywords (oracle or manual)

Majority Class?

Page 36: Filter keywords and majority class strategies for company name disambiguation on Twitter

(1) winner-takes-all

Tweets

Filter keywords (oracle or manual)

Majority Class

Page 37: Filter keywords and majority class strategies for company name disambiguation on Twitter

(2) winner-takes-remainder

Tweets

Majority Class

Filter keywords (oracle or manual)

Page 38: Filter keywords and majority class strategies for company name disambiguation on Twitter

(3) bootstrapping

Tweets

Machine learning

training

Filter keywords (oracle or manual)

Page 39: Filter keywords and majority class strategies for company name disambiguation on Twitter

(3) bootstrapping

Tweets

Machine learning

training

Filter keywords (oracle or manual)

application

Page 40: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class

Page 41: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class

≈ ‘all related’ baseline

Page 42: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class baseline

Page 43: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class baseline

Keyword Classification

Terms Filter keywords (automatic)

• Automatic Discovery of Filter Keywords:

Page 44: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords + majority class baseline

Keyword Classification

Terms Filter keywords (automatic)

– 13 Term features:

• 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features

– 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc) • Hybrid (Neural net + heuristic’s features)

• Automatic Discovery of Filter Keywords:

Page 45: Filter keywords and majority class strategies for company name disambiguation on Twitter

Automatic Tweets Classification

0,83 0,75 0,73

0,63 0,56

0,48

accu

racy

WePS-3 systems (automatic)

Filter keywords + Majority Class baseline

WePS-3 systems (manual)

Page 46: Filter keywords and majority class strategies for company name disambiguation on Twitter

Conclusions

• Fingerprint representation

– Behaviour of binary classification systems on skewed datasets

– Baselines independent of corpus

Page 47: Filter keywords and majority class strategies for company name disambiguation on Twitter

Conclusions

• Fingerprint representation

– Behaviour of binary classification systems on skewed datasets

– Baselines independent of corpus

• Twitter ≠ Web

– Oracle keywords ≠ Manual keywords

Page 48: Filter keywords and majority class strategies for company name disambiguation on Twitter

Conclusions

• Fingerprint representation – Behaviour of binary classification systems on skewed

datasets

– Baselines independent of corpus

• Twitter ≠ Web – Oracle keywords ≠ Manual keywords

• Filter keywords & majority class strategies – Useful signals to help solving the problem

– Both signals alone already give competitive performance

Page 49: Filter keywords and majority class strategies for company name disambiguation on Twitter

Filter keywords and majority class strategies for company name

disambiguation on Twitter

CLEF 2011 Conference September 19-22, Amsterdam

Damiano Spina, Enrique Amigó and Julio Gonzalo

{damiano,enrique,julio}@lsi.uned.es

UNED NLP & IR Group