Upload
lora-aroyo
View
980
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Big data is having a disruptive impact across the sciences. Human annotation of semantic interpretation tasks is a critical part of big data semantics, but it is based on an antiquated ideal of a single correct truth that needs to be similarly disrupted.We expose seven myths about human annotation, most of which derive from that antiquated ideal of truth, and dispell these myths with examples from our research.We propose a new theory of truth, Crowd Truth, that is based on the intuition that human interpretation is subjective, and that measuring annotations on the same objects of interpretation (in our examples, sentences) across a crowd will provide a useful representation of their subjectivity and the range of reasonable interpretations.
Citation preview
Truth is a Lie CrowdTruth:
The 7 Myths of Human Annota9on
Lora Aroyo
Human annota9on of seman)c interpreta)on tasks as cri)cal part of cogni)ve systems engineering
– standard prac)ce based on an9quated ideal of a single correct truth
– 7 myths of human annota)on
– new theory of truth: CrowdTruth
Take Home Message
Lora Aroyo
I amar prestar aen... • amount of data & scale of computa9on available have increased by a previously inconceivable amount
• CS & AI moved out of thought problems to empirical science
• current methods pre-‐date this fundamental shi?
• the ideal of “one truth” is a lie
• crowdsourcing & seman9cs together correct the fallacy and improve analy)c systems
The world has changed: there is a need to form a new theory of truth -‐ appropriate to cogni)ve systems
Lora Aroyo
Seman)c interpreta)on is needed in all sciences
– Data abstracted into categories
– PaIerns, correla9ons, associa9ons & implica9ons are extracted
Cogni9ve Compu9ng: providing some way of scalable seman)c interpreta)on
Seman9c Interpreta9on
Lora Aroyo
• Humans analyze examples: annota)ons for ground truth = the correct output for each example
• Machines learn from the examples
• Ground Truth Quality:
– measured by inter-‐annotator agreement
– founded on ideal for single, universally constant truth
– high agreement = high quality – disagreement must be eliminated
Tradi9onal Human Annota9on
Lora Aroyo
Current gold standard acquisi9on & quality evalua9on are outdated
• Cogni)ve Compu)ng increases the need for machines to handle the scale of data
• Results in increasing need for new gold standards able to measure machine performance on tasks that require seman)c interpreta)on
Need for Change Lora Aroyo
The New Ground Truth is CrowdTruth
• One truth: data collec)on efforts assume one correct interpreta)on for every example
• All examples are created equal: ground truth treats all examples the same – either match the correct result or not
• Detailed guidelines help: if examples cause disagreement -‐ add instruc)ons to limit interpreta)ons
• Disagreement is bad: increase quality of annota)on data by reducing disagreement among the annotators
• One is enough: most of the annotated examples are evaluated by one person
• Experts are beIer: annotators with domain knowledge provide beIer annota)ons
• Once done, forever valid: annota)ons are not updated; new data not aligned with previous
7 Myths
myths directly influence the prac)ce of collec)ng human annotated data; Need to be revisited in the context of new changing world & in the face of a new
theory of truth (CrowdTruth)
Lora Aroyo
current ground truth collec)on efforts assume one correct interpreta)on
for every example
the ideal of truth is a fallacy for seman9c interpreta9on and needs to be changed
1. One Truth
What if there are MORE?
Lora Aroyo
Lora Aroyo
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Other passionate, rollicking, literate, humorous, silly, aggressive, fiery, does not fit into rousing, cheerful, fun, poignant, wis9ul, campy, quirky, tense, anxious, any of the 5 confident, sweet, amiable, bi>ersweet, whimsical, wi>y, intense, vola?le, clusters boisterous, good-‐natured autumnal, wry visceral rowdy brooding
Choose one:
Which is the mood most appropriate for each song?
one truth?
Results in:
(Lee and Hu 2012)
• typically annotators are asked whether a binary property holds for each example
• o?en not given a chance to say that the property may par9ally hold, or holds but is not clearly expressed
• mathema9cs of using ground truth treats every example the same – either match correct result or not
• poor quality examples tend to generate high disagreement
disagreement allows us to weight sentences = the ability to train & evaluate a machine more flexibly
2. All Examples Are Created Equal
What if they are DIFFERENT?
Lora Aroyo
ANTIBIOTICS are the first line treatment for indications of TYPHUS. With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.
clearly treats
disagreement can indicate vagueness & ambiguity of sentences
less clear treats
Is TREAT relation expressed between the highlighted terms?
equal training data?
Lora Aroyo
• Perfuming agreement scores by forcing annotators to make choices they may think are not valid
• Low annotator agreement is addressed by detailed guidelines for annotators to consistently handle the cases that generate disagreement
• Remove poten9al signal on examples that are ambiguous
precise annota)on guidelines do eliminate disagreement but do not
increase quality
3. Detailed Guidelines Help
What if they HURT?
Lora Aroyo
disagreement can indicate problems with the task
Instruc9ons Your task is to listen to the following 30 second music clips and select the most appropriate mood cluster that represents the mood of the music. Try to think about the mood carried by the music and please try to ignore any lyrics. If you feel the music does not fit into any of the 5 clusters please select “Other”. The descrip)ons of the clusters are provided in the panel at the top of the page for your reference. Answer the ques)ons carefully. Your work will not be accepted if your answers are inconsistent and/or incomplete.
Which mood cluster is most appropriate for a song?
restric2ng guidelines help? (Lee and Hu 2012)
Lora Aroyo
• rather than accep)ng disagreement as a natural property of seman)c interpreta)on
• tradi)onally, disagreement is considered a measure of poor quality because: – task is poorly defined or – annotators lack training
this makes the elimina9on of
disagreement the GOAL
4. Disagreement is Bad
What if it is GOOD?
Lora Aroyo
ANTIBIOTICS are the first line treatment for indications of TYPHUS. à agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. à agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. à agreement 50%
disagreement bad? disagreement can reflect the degree of clarity in a sentence
Lora Aroyo
Does each sentence express the TREAT relation?
• over 90% of annotated examples – seen by 1-‐2 annotators
• small number overlap – to measure agreement
five or six popular
interpreta9ons can’t be captured by one or two
people
5. One is Enough
What if it is NOT ENOUGH?
Lora Aroyo
One Quality?
accumulated results for each rela)on across all the sentences
20 workers/sentence (and higher) yields same rela9ve disagreement
Lora Aroyo
• conven9onal wisdom: human annotators with domain knowledge provide beIer annotated data, e.g – medical texts should be annotated
by medical experts
• but experts are expensive & don’t scale
mul9ple perspec9ves on data can be useful, beyond what experts believe is salient or correct
6. Experts Are BeIer
What if the CROWD IS BETTER?
Lora Aroyo
experts beIer than crowd?
• 91% of expert annotations covered by the crowd • expert annotators reach agreement only in 30% • most popular crowd vote covers 95% of this
expert annotation agreement
Lora Aroyo
What is the (medical) relation between the highlighted (medical) terms?
• perspec9ves change over 9me – old training data might contain examples that are not valid or only par)ally valid later
• con9nuous collec9on of training data over )me allows the adapta)on of gold standards to changing )mes – popularity of music – levels of educa)on
7. Once Done, Forever Valid
What if VALIDITY CHANGES?
OSAMA BIN LADEN used money from his own construction company to support the MUHAJADEEN in Afghanistan against Soviet forces.
forever valid? both types should be valid -‐ two roles for same en9ty
-‐ adapta9on of gold standards to changing 9mes
1990: hero 2011: terrorist
Lora Aroyo
Which are mentions of terrorists in this sentence?
crowdtruth.org Jean-‐Marc Côté, 1899
crowdtruth.org
• annotator disagreement is signal, not noise. • it is indicative of the variation in human
semantic interpretation of signs • it can indicate ambiguity, vagueness,
similarity, over-generality, as well as quality
crowdtruth.org
hIp://crowd-‐watson.nl
The Team 2013
The Crew 2014
The (almost complete) Team 2014
crowdtruth.org
lora-aroyo.org slideshare.com/laroyo
@laroyo