52
Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty gathering gold standard annotations for relation extraction Crowd Truth Harnessing Disagreement in Crowdsourcing

CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Embed Size (px)

Citation preview

Page 1: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

gathering gold standard annotations for relation extraction

Crowd Truth Harnessing Disagreement in

Crowdsourcing

Page 2: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty IBM Confidential

•  Open Domain Question-Answering Machine, that given – Rich Natural Language Questions – Over a Broad Domain of Knowledge

•  Won a 2-game Jeopardy match against the all-time winners –  viewed by over 50,000,000

Page 3: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Cognitive Computing EXPANDS human cognition, makes the jobs we do easier, like a cognitive prosthesis, especially when dealing with processing massive data, or data that requires human interpretation

LEARNS as you use it – most machine errors are easy for a human to detect, and we can instrument usage of systems to better understand the system and the problem it solves

INTERACTS naturally. We need to bring machines closer to their users, we have adapted ourselves enough to them, they should

understand natural language, spoken or written, be able to process images and videos. These simple human problems are extremely

complex for machines, but are hallmarks of a new computing era.

Page 4: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Watson MD

•  Adapt Watson to Medical QA •  Mainly an NLP task •  Cognitive computing systems need

human-annotated data for training, testing, evaluation

the human annotation task is one of semantic interpretation

Now answering medical

questions!

Page 5: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Gadolinium agents are useful for patients with renal impairment, but in patients with severe renal failure requiring dialysis it presents a risk of nephrogenic systemic fibrosis.

Mention detection: find the spans (begin, end) of relevant medical terms (factors) in a passage. Factor Typing: find the type of each mention

substance disorder

disorder

NER

disorder

treatment

NLP Tasks

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 6: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

NLP Tasks Gadolinium agents are useful for patients with renal

impairment, but in patients with severe renal failure requiring dialysis it presents a risk of nephrogenic systemic fibrosis.

Mention detection: find the spans (begin, end) of relevant medical terms (factors) in a passage. Factor Typing: find the type of each mention Factor (Entity) Identification: find the corresponding ids for a mentioned factor in a knowledge-base

C0016911 C1408325

C0035078

C1619692

C0019004

NLP Tasks

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 7: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

NLP Tasks Gadolinium agents are useful for patients with renal

impairment, but in patients with severe renal failure requiring dialysis it presents a risk of nephrogenic systemic fibrosis.

Mention detection: find the spans (begin, end) of relevant medical terms (factors) in a passage. Factor Typing: find the type of each mention Factor (Entity) Identification: find the corresponding ids for a mentioned factor in a knowledge-base Relation detection: find relations that are expressed in a passage between factors?

cause treats

treats

contra- indicates

NLP Tasks

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 8: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

NLP Tasks Gadolinium agents are useful for patients with renal

impairment, but in patients with severe renal failure requiring dialysis it presents a risk of nephrogenic systemic fibrosis.

Mention detection: find the spans (begin, end) of relevant medical terms (factors) in a passage. Factor Typing: find the type of each mention Factor (Entity) Identification: find the corresponding ids for a mentioned factor in a knowledge-base Relation detection: find relations that are expressed in a passage between factors? Coreference: Find the mentions in a sentence that refer to the same factor.

Page 9: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Gold Standard Assumption

•  Cognitive systems need to be told what is right & what is wrong •  A gold standard or ground truth

•  Performance is measured on test sets vetted by human experts à never perfect, always improving against test data

• Historically, gold standards are created assuming that for each annotated instance there is a single right answer

• Gold standard quality is measured in inter-annotator agreement à does not account for perspectives, for reasonable alternative interpretations

Page 10: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

but people don’t always agree…

Page 11: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Disagreement

Gadolinium agents are useful for patients with renal

impairment, but in patients with severe renal failure requiring dialysis there is a risk of nephrogenic

systemic fibrosis.

cause

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 12: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Gadolinium agents are useful for patients with renal

impairment, but in patients with severe renal failure requiring dialysis there is a risk of nephrogenic

systemic fibrosis.

side-effect The human annotation task is one of semantic interpretation

Disagreement

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 13: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Why do people disagree? Sentence

Relation Worker

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 14: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Key Question

How do we represent & measure disagreement in a

way that it can be harnessed?

Page 15: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Why do people disagree? Sign

Referent Observer

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Triangle of Reference

Page 16: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Position

maybe this disagreement is a signal and not noise?

can we harness it?

Page 17: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Crowd Truth

Annotator disagreement is signal, not noise.

It is indicative of the variation in human semantic interpretation of signs, and can indicate ambiguity,

vagueness, over-generality, etc.

http://www.freefoto.com/preview/01-47-44/Flock-of-Birds

Page 18: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Approach Principles

1. understand the range of disagreements by creating a space of possibilities with frequencies & similarities

2. tolerate, capture & exploit disagreement

3. score machine output based on where it falls in this space

4. adaptable to new annotation tasks

Flickr: auroille

Page 19: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Crowd Watson •  Crowdsourcing gold standard data for

•  Training Watson in medical domain, as well as for events extraction, image annotations, video tagging and summarization

•  Crowdsourcing for Domain Adaptation •  How to rapidly acquire knowledge for new domains

•  Platforms •  CrowdFlower, Amazon Mechanical Turk •  Crowdsourcing Games with a Purpose, e.g. Dr. Watson, Waisda?

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 20: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Relation Extraction crowdsourcing gold standard data

Relations overlap in meaning Sentences are vague and ambiguous Experts have different interpretations

Page 21: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

In distant supervision we take arguments that are known to be related by a target relation in a knowledge base and we find

all sentences in a corpus that mention both arguments.

Page 22: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Page 23: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Representation Worker Vector

1 1 1

Page 24: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Representation Sentence Vector

1 1 1

1 1

1

1 1

1 1

1 1

1

1

1

0 1 1 0 0 4 3 0 0 5 1 0

Page 25: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Feeling the way the CHEST expands (PALPATION), can identify areas of the lung that are full of fluid.

?PALPATIONIs CHEST related to

diagnose location associated with

is_a otherpart_of

0 0 02 3 0 0 0 1 0 0 44 1

Disagreement for Sentence Clarity

Unclear relationship between the two arguments reflected in the disagreement

Page 26: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

?CONJUNCTIVITISHYPERAEMIA related toIs0 0 0 1 0 0 0 013 0 0 0 0 0

symptomcause

Redness (HYPERAEMIA), irritation (chemosis) and watering (epiphora) of the eyes are symptoms common to all forms of CONJUNCTIVITIS.

Disagreement for Sentence Clarity

Clearly expressed relation between the two arguments reflected in the agreement

Page 27: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Sentence-Relation Score

Measures how clearly a sentence expresses a relation

0 1 1 0 0 4 3 0 0 5 1 0

Unit vector for relation R6

Sentence Vector

Cosine = .55

Page 28: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Worker Disagreement

Measured per worker

Worker-sentence disagreement

0 1 1 0 0 4 3 0 0 5 1 0

Worker’s sentence vector

Sentence Vector

AVG (Cosine)

Page 29: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Crowd Truth Metrics Relation Extraction

Three parts to understand human interpretations: §  Sentence

•  How good is a sentence for relation extraction task?

§  Workers •  How well does a worker understand the sentence?

§  Relations •  Is the meaning of the relation clear? •  How ambiguous/confusable is it?

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 30: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Crowd Truth Metrics Based on the Triangle of Reference

Three parts to understand human interpretations: §  Sign

•  How good is a sign for conveying information?

§  People •  How well does a person understand the sign?

§  Ontology •  Are the distinctions of the ontology clear? •  How ambiguous/confusable are they?

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 31: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

The Dark Side of Crowdsourcing Disagreement

•  spammers generate disagreement for the wrong reasons •  most spam detection requires gold standard • Worker-sentence disagreement: the average of all the cosines between

each worker’s sentence vector and the full sentence vector (minus that worker). Indicates how much a worker disagrees with the crowd on a sentence basis

• Worker-worker disagreement: a pairwise confusion matrix between workers and the average agreement across the matrix for each worker. Indicates whether there are consistently like-minded workers

Page 32: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Chris Welty Crowd Truth for Cognitive Computing Lora Aroyo

Harnessing Disagreement • Sentence-relation score: measured for each relation on each sentence as the cosine of

the unit vector for relation with sentence vector • Sentence clarity: for each sentence - max relation score for that sentence. If all the

workers selected the same relation for a sentence, the max score is 1, indicating a clear sentence

• Relation similarity: pairwise conditional probability that if relation Ri is annotated in a sentence, then Rj is as well. Indicates how confusable linguistic expression of two relations are

• Relation ambiguity: max relation similarity for a relation. If a relation is clear score is low

• Relation clarity: max sentence-relation score for a relation over all sentences. If a relation has a high clarity score, it means that it is at least possible to express the relation clearly

• Worker Quality: avg. cosine of worker vector with sentence vector for all sentences the worker annotated.

Page 33: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Disagreement metrics •  Diverging opinions cluster around the most

plausible options.

•  Identify workers who systematically disagree 1.  With the opinion of the majority (worker-sentence disag)

o  Compare worker opinion with that of the majority 2.  With the rest of their co-workers (worker-worker disag)

o  Workers with the same opinion as worker W. 3.  + Avg. number of relations / sentence

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 34: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Task completion time

Page 35: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Task completion time

Page 36: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Task completion time

Page 37: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session
Page 38: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Spam in a channel

Page 39: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Conclusions

•  Crowd Truth can help us understand the diversity of interpretations •  with adequate representation & metrics •  dispense with the “one correct answer” assumption

•  Disagreement metrics can be augmented by content filters for better spam detection •  explanations by workers can be useful

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 40: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

The Crew

•  Lora Aroyo (VU) •  Chris Welty (IBM) •  Guillermo Soberon (VU) •  Hui Lin (IBM) •  Anca Dumitrache (VU) •  Oana Inel (VU) •  Manfred Overmeen (IBM) •  Robert-Jan Sips (IBM)

Page 41: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

http://crowd-watson.nl

Page 42: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Questions?

Page 43: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Accuracy pred. low quality (1)

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 44: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Accuracy pred. low quality (2)

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 45: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Spamming scenarios

Dev. Test

•  12 spammers / 110 workers

•  139 "spammed" sentences out of 1302 (11%)

•  100% accuracy spam detection

•  20 spammers / 93 workers

•  386 "spammed" sentences out of 1291 (30%)

•  89% accuracy (10 spammers missed)

Can we do better? Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 46: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Data collected

•  Annotations o  12 relations + OTH / NON o  Behaviour with respect to the crowd

Disagreement Filters

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 47: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

•  Annotations o  12 relations + OTH / NON o  Behaviour with respect to the crowd

•  Explanations o  Selected Words (justify the choice) o  Explanation (for OTHER or NONE) o  Individual behaviour patterns.

Disagreement Filters

Explanation filters

Data collected

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 48: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Relation Extraction

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 49: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Explanations analysis Four patterns in worker behaviour indicating

spam: o  No Valid Words were used for the text o  Using the same text for all the annotations o  Using the same text for both "Selected words" and

"Explanation" o  Bad understanding (not following) of the task

instructions: §  Selecting "None" and "Other" in combination

with other relations §  Including explanations when are not required.

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 50: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Spam patterns analysis

None / Other Rep. Response Rep. Text No Valid Words

Spam Candidates

22 8 14 12

Overlap with disagreement

18% 37% 36% 42%

30 unique workers were identified ONLY by the Explanation filters as possible low quality

workers.

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 51: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Spam patterns analysis

None / Other Rep. Response Rep. Text No Valid Words

Spam Candidates

22 8 14 12

Overlap with disagreement

18% 37% 36% 42%

30 unique workers were identified ONLY by the Explanation filters as possible low quality

workers.

Explanation Filters ⊄ Disagreement metrics

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty

Page 52: CCCT University of Amsterdam Seminars 2013: Crowdsourcing Session

Results •  Linear combination of Disagreement metrics

+ Explanation filters o  "No Valid Words" and Avg. Num Relations / sent a

bit more weight than the rest •  Results

o  95% accuracy and .88 F1 score o  16 spammers out of 20

•  Previously, only with disagreement metrics: o  88% Accuracy, .66 F1 score o  10 spammers out of 20

Lora Aroyo Crowd Truth for Cognitive Computing Chris Welty