Upload
leon-derczynski
View
272
Download
4
Embed Size (px)
Citation preview
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Leon DerczynskiKalina Bontcheva
Ian Roberts
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
“I strongly recommend this paper”
“It is therefore a very useful resource”
“Impact of resources: 5Overall recommendation: 5
Reviewer Confidence: 5”
wow
so review
very paper
much japan
Most of our language tech was trained on news
The bias is:
- middle class- white
-working age- educated
- male- 1980s/1990s- from the US
- journalist- following AP guidelines
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
Photo © Michael Jang 1983
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
.. and punishes you when you don't.
(not cool!)
The REAL problem:
Our studies have centred on a tiny, over-biased set of data
There is no variation!(analyse some WSJ if you are not convinced..)
It's time to up our game; social media is a cheap & unprecedented resource
e.g. Baldwin @ WNUT15; Hovy @ ACL15
Social media is incredibly powerful
- sample of all global discourse- warns of earthquakes
- sends fire engines- predicts virus outbreaks (e.g. WNV)
Traditional tools have awful performance
Stanford NER 40% F1
Single-topic recall 66%.. cross-topic 33%
What kind of entities do we find in social media?
High variety – ages quickly
News Tweets
PER Politicians, business leaders, journalists, celebrities
Sportsmen, actors, TV personalities, celebrities, names of friends
LOC Countries, cities, rivers, and other places related to current affairs
Restaurants, bars, local landmarks/areas, cities, rarely countries
ORG Public and private companies, government organisations
Bands, internet companies, sports clubs
Why a new corpus?
Existing ones are tiny, and hyperfocused
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multipleNo hashtags /
usernames
What kind of variance do we see?
Temporal:- concept drift over time
- daily cycles (work, family, socialising)- weekly cycles
- time of year (seasonal behaviours)
Spatial- many different anglophone regions
- different surface forms in each- different signifiers (LLC – Ltd. - DAC)
Social- WSJ readers and writers
- net celebrities- tv characters
Corpus design:
Temporal- drawn over six years, from twitter archive
- selected over multiple temporal cycles
Spatial- spread over six anglophone regions:
UK, US, IE, CA, NZ, AU
Social- general segment- selection for news
- selection for commentary
Annotation problems
Workflow:Crowdsourcing platform interfaces = pita
Not in USA, so no mturk access
Solution:
- GATE Crowdsourcing plugin- Load corpus, set up task, add API
key, launch job, done!- Automatic result collection &
alignment- Even Java/Swing is prettier than
mturk’s back end
Annotation problems
Task designLots of training required
Many entity types
SolutionBrief instructionsClean interface
Annotate just one entity type at a time- pricy but way better, and overall, quicker
Annotation problems
Annotator recallPretty serious problem
People have limited knowledge, limited world experienceExpert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?KKTNY in 45 min!!!!!
Annotation problems
Annotator recallPretty serious problem
People have limited knowledge, limited world experienceExpert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?KKTNY in 45 min!!!!!
Solution:Ignore traditional IAA
Pool the results - “max recall”Rare knowledge ≠ Wrong knowledge
Post-solution:Expert adjudication step
Annotation problems
Crowd can be pretty dumbNot its fault – we gave no education
People need precise idea of task
Solution 1Ensure workers get good score on known data first
Lace the text with gold data, for monitoring & feedback
Solution 2Keep task focused (just one entity type)
Give instructions & examples
Results – annotator quality
Experts are consistent, but don’t get far
Crowd is varied and inconsistent, but gets superior recall performance
Remember, recall is the problem with soc med!
GroupRecall over final
annotationsF1 IAA
Expert 0.309 0.835
Crowd 0.837 0.350
Results: size
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multipleNo hashtags /
usernames
BTC(Broad Twitter
Corpus)165K PLO
Expert + Crowd
Source JSON available
Documents 9 551
Tokens 165 739
Person 5 271
Location 3 114
Organisation 3 732
Total 12 117
Results: diversity
Sorry Botswana, Bahamas, South Africa,
Malta.. looking forward to seeing you crowdsource!
Results: diversity
By year, and month
Results: diversity
By day of month, weekday, and time of day
Results: IAA
Adjudication is the agreement with max-recall
Naïve is micro-averaged lenient match
Note that max-recall performs very well (according to expert..)
Level Adjudication Naïve
Whole doc 0.839 N/a
Person 0.920 0.799
Location 0.963 0.861
Organisation 0.936 0.954
All 0.940 0.877
Results: popular surface forms
CONLL is: * ancient* US and int.rel. centric* about cricket???
Results: long tail steepness
Tail vs. head tells us something about diversityIf a few forms make up many mentions, the corpus is more boring:
- less variety (qualitative)- harder to generalise
about (maths!)
We bisect at h-index point, and compare
proportions
Corpus distribution
Totally legal to give source; it’s under 50K tweets
- JSON- GATE docs
- CoNLL
All intermediate crowdsourcing data included in the GATE docs
Available before Dec 16
To be extra sure, also available as “rehydratable standoff”
Thanks! And thank you everyone!
Alonso & Lease, 2011Bontcheva et al. 2014aBontcheva et al. 2014bCallison-Burch & Dredze, 2010Difallah et al. 2013Finin et al. 2010Hovy et al. 2013Khanna et al. 2010Morris et al. 2012Sabou et al. 2014
Balog et al. 2012Bollacker et al. 2008Hovy 2010Rowe et al. 2013Ritter et al. 2011Rose et al. 2002Tjong Kim Sam et al. 2003
Coppersmith et al. 2014De Choudhury et al. 2013Kedzie et al. 2015Neubig et al. 2011Tumasjan et al. 2010
Eisenstein et al. 2010Eisenstein 2013Hu et al. 2013Kergl et al. 2014Mascaro & Goggins 2012Tufekci 2014
Bontcheva et al. 2013Liu et al. 2011Lui & Baldwin 2012Magdy & Elsayed 2016Mostafa 2013O’Connor et al. 2010
Fromreide et al. 2014Masud et al. 2010