31
1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy Pamela Forner Álvaro Rodrigo Richard Sutcliffe Corina Forascu Caroline Sporleder Modality and Negation Roser Morante Walter Daelemans

1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Embed Size (px)

Citation preview

Page 1: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

1

CLEF 2011, AmsterdamQA4MRE, Question Answering for Machine Reading Evaluation

Question Answering Track Overview

Main TaskAnselmo PeñasEduard Hovy

Pamela FornerÁlvaro Rodrigo

Richard SutcliffeCorina Forascu

Caroline Sporleder

Modality and Negation

Roser MoranteWalter Daelemans

Page 2: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

2

QA Tasks & Time at CLEF

2003 2004

2005

2006

2007

2008 2009

2010 2011

QA Task

s

Multiple Language QA Main Task ResPubliQA QA4MRE

Temporal restrictio

nsand lists

Answer Validation Exercise (AVE)

GikiCLEF

Negation and Modality

Real Time

QA over Speech Transcriptions (QAST)

WiQA

WSD

QA

Page 3: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

3

New setting

QA over a single documentMultiple Choice Reading Comprehension

Tests• Forget about the IR step (for a while)• Focus on answering questions about a

single text• Chose the correct answer

Why this new setting?

Page 4: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Systems performance

Upper bound of 60% accuracy

OverallBest result

<60%

Definitions

Best result>80% NOT

IR approach

Page 5: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Pipeline Upper Bound

SOMETHING to break the pipeline: answer validation instead of re-ranking

Question

Answer

Questionanalysis

PassageRetrieval

AnswerExtraction

AnswerRanking

1.00.8 0.8 0.64x x =

Not enough evidence

Page 6: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Multi-stream upper bound

Perfect combination

81%

Best system 52,5%

Best with ORGANIZATION

Best with PERSON

Best with TIME

Page 7: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Multi-stream architectures

Different systems response better different types of questions

• Specialization• Collaboration

QA sys1

QA sys2

QA sys3

QA sysn

Question

Candidate answers

SOMETHING for

combining / selecting

Answer

Page 8: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

AVE 2006-2008

Answer Validation: decide whether to return the candidate answer or not

Answer Validation should help to improve QAIntroduce more content analysisUse Machine Learning techniquesAble to break pipelines and combine

streams

Page 9: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

9

Hypothesis generation + validation

Question

Searching space of

candidate answers

Hypothesis generation

functions+

Answer validation functions

Answer

Page 10: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

ResPubliQA 2009 - 2010

Transfer AVE results to QA main task 2009 and 2010Promote QA systems with better answer

validation

QA evaluation setting assuming thatTo leave a question unanswered has

more value than to give a wrong answer

Page 11: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Evaluation measure

n: Number of questionsnR: Number of correctly answered

questionsnU: Number of unanswered questions

)(1

1@n

nnn

nc R

UR

Reward systems that maintain accuracy but reduce the number of incorrect answers by leaving some questions unanswered

Page 12: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

12

Conclusions of ResPubliQA 2009 – 2010

This was not enough We expected a bigger change in

systems architecture Validation is still in the pipeline

Bad IR -> Bad QA No qualitative improvement in

performance Need of space to develop the

technology

Page 13: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

13

2011 campaign

Promote a bigger change in QA systems architecture

QA4MRE: Question Answering for Machine Reading Evaluation

Measure progress in two reading abilitiesAnswer questions about a single textCapture knowledge from text

collections

Page 14: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Reading test

Text

Coal seam gas drilling in Australia's Surat Basin has been halted by flooding.

Australia's Easternwell, being acquired by Transfield Services, has ceased drilling because of the flooding.

The company is drilling coal seam gas wells for Australia's Santos Ltd.

Santos said the impact was minimal.

Multiple choice testAccording to the text…

What company owns wells in Surat Basin?a) Australiab) Coal seam gas wellsc) Easternwelld) Transfield Servicese) Santos Ltd.f) Ausam Energy Corporation g) Queenslandh) Chinchilla

Page 15: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Knowledge gaps

Acquire this knowledge from the reference collection

drill

Company BWell C

for

own | P=0.8

Queensland

Australia

Surat Basin

is part of

is part ofCompany

A

I II

Page 16: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Knowledge-Understanding dependence

We “understand” because we “know”We need a little more of both to answer

questions

Capture ‘knowledge’ expressed in texts

‘Understand’ language

Reading cycle

Page 17: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Control the variable of knowledge

The ability of making inferences about texts is correlated to the amount of knowledge considered This variable has to be taken into account

during evaluation Otherwise is very difficult to compare

methods

How to control the variable of knowledge in a reading task?

Page 18: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

Text as sources of knowledge Text Collection

Big and diverse enough to acquire knowledge

• Impossible for all possible topics

Define a scalable strategy: topic by topic

Reference collection per topic (20,000-100,000 docs.)

Several topics Narrow enough to limit knowledge

needed • AIDS• CLIMATE CHANGE• MUSIC & SOCIETY

Page 19: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

19

Evaluation tests

12 reading tests (4 docs per topic)120 questions (10 questions per test)600 choices (5 options per question)

Translated into 5 languages: English, German, Spanish, Italian, Romanian

Page 20: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

20

Evaluation tests

44 questions required background knowledge from the reference collection

38 required combine info from different paragraphs

Textual inferencesLexical: acronyms, synonyms,

hypernyms…Syntactic: nominalizations,

paraphrasing…Discourse: correference, ellipsis…

Page 21: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

21

Evaluation

QA perspective evaluationc@1 over all 120 questions

Reading perspective evaluationAggregating results by test

TaskRegistere

dgroups

Participant groups

Submitted Runs

QA4MRE 25 12 62 runs

Page 22: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

22

Workshop QA4MRE

Tuesday 10:30 – 12:30

Keynote: Text Mining in Biograph (Walter Daelemans)

QA4MRE methodology and results (Álvaro Rodrigo)

Report on Modality and Negation pilot (Roser Morante)

14:00 – 16:00Reports from participants

Wednesday 10:30 – 12:30

Breakout session

Page 23: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

23

CLEF 2011, AmsterdamQA4MRE, Question Answering for Machine Reading Evaluation

Question Answering Track Breakout session

Main TaskAnselmo PeñasEduard Hovy

Pamela FornerÁlvaro Rodrigo

Richard SutcliffeCorina Forascu

Caroline Sporleder

Modality and Negation

Roser MoranteWalter Daelemans

Page 24: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

24

QA4MRE breakout session Task

Questions are more difficult and realistic

100% reusable test sets Languages and participants

No participants for some languages But valuable resource for evaluation Good balance for developing tests

in other languages (even without participants)

• Problem is to find parallel translations for tests

Page 25: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

25

QA4MRE breakout session Background collections

Good balance of quality and noise

Methodology to build them is ok Test documents (TED)

Not ideal but parallel Open audience and no copyright

issues Consider other possibilities

• CafeBabel• BBC news

Page 26: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

26

QA4MRE breakout session Evaluation

Encourage participants to test previous systems on new campaigns

Ablation tests, what happens if you remove a component?

Runs with and without background knowledge, with and without external resources

Processing time measurements

Page 27: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

27

QA4MRE 2012

Topics Previous

1. AIDS2. Music and Society3. Climate Change

Add1. Alzheimer (divulgative sources:

blogs, web, news, …)

Page 28: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

28

QA4MRE 2012 Pilots

Modality and Negation Move to a three value setting:

Given an event in the text decide whether it is1. Asserted (no negation and no speculation)2. Negated (negation and no speculation),3. Speculated

Roadmap1. 2012 as a separated pilot2. 2013 integrate modality and negation in the

main task tests

Page 29: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

29

QA4MRE 2012 Pilots

Biomedical domain Focus in one disease: Alzheimer (59,000

Medline abstracts) Scientific language Give participants the background collection

already processed: Tok, Lem, POS, NER, Dependency parsing

Development set

Page 30: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

30

QA4MRE 2012 in summary Main task

Multiple Choice Reading Comprehension tests

Same format Additional topic: Alzheimer English, German, (maybe Spanish,

Italian, Romanian, others) Two pilots

Modality and negation• Asserted, negated, speculated

Biomedical domain focus on Alzheimer disease• Same format as the main task

Page 31: 1 CLEF 2011, Amsterdam QA4MRE, Question Answering for Machine Reading Evaluation Question Answering Track Overview Main Task Anselmo Peñas Eduard Hovy

31

Thanks!