Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman

Text REtrieval Conference (TREC)

Implementing a Question-Answering

Evaluation for AQUAINT

Ellen M. VoorheesDonna Harman


TREC QA Track

• Goal: encourage research into systems that return answers, rather than document lists, in response to a question

• NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task


Task• For each closed-class question, return a

ranked list of 5 [docid, text-snippet] pairs• snippets drawn from large news collection • score: reciprocal rank of first correct response

• Test conditions– TRECs 8, 9:

• 50 or 250 byte snippets• answer guaranteed to exist in collection

– TREC 2001:• 50 byte snippets only• no guarantee of answer in collection


Sample Questions• How many calories are there in a Big Mac?

• What is the fare for a round trip between New York and London on the Concorde?

• Who was the 16th President of the United States?

• Where is the Taj Mahal?

• When did French revolutionaries storm the Bastille?


Selecting Questions• TREC-8

– most questions created specifically for track– NIST staff selected questions

• TREC-9– questions suggested by logs of real questions– much more ambiguous, and therefore

difficultWho is Colin Powell? vs. Who invented the paper clip?

• TREC 2001– questions taken directly from filtered logs– large percentage of definition questions


What Evaluation Methodology?

• Different philosophies – IR: the “user” is the sole judge of a satisfactory

response• human assessors judge responses• flexible interpretation of correct response• final scores comparative, not absolute

– IE: there exists the answer• answer keys developed by application expert• requires enumeration of all acceptable responses at

outset• subsequent scoring trivial; final scores absolute


QA Track Evaluation Methodology

• NIST assessors judge answer strings– binary judgment of correct/incorrect– document provides context for answer

• In TREC-8, each question was independently judged by 3 assessors– built high-quality final judgment set– provided data for measuring effect of

differences between judges on final scores


Judging Guidelines• Document context used

– frame of reference• Who is President of the United States?• What is the world’s population?

– credit for getting answer from mistaken doc

• Answers must be responsive– no credit when list of possible answers given– must include units, appropriate punctuation– for questions about famous objects, answers

must pertain to that one object


Validating Evaluation Methodology

• Is user-based evaluation appropriate?

• Is it reliable?– can assessors perform the task?– do differences affect scores?

• Does the methodology produce a QA test collection?– evaluate runs that were not judged


Assessors can Perform Task

• Examined assessors during the task– written comments– think-aloud sessions

• Measured agreement among assessors– on average, 6% of judged strings had

some disagreement– mean overlap of .641 across 3 judges for

193 questions that had some correct answer found


User Evaluation Necessary• Even for these questions, context matters

– Taj Mahal casino in Atlantic City

• Legitimate differences in opinion as to whether string contains correct answer– granularity of dates– completeness of names– “confusability” of answer string

• If assessors’ opinions differ, so will eventual end-users’ opinions


QA Track Scoring Metric

• Mean reciprocal rank– score for individual question is the

reciprocal of the rank at which the first correct response returned (0 if no correct response returned)

– score of a run is the mean over the test set of questions


TREC 2001 QA Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

insi

ght

LCC1

orcl

1

isi1

a50

uwm

ta1

mts

una0

ibm

sqa0

1a

IBM

KS

1M3

Mea

n R

ecip

roca

l R

ank

05101520253035404550

% Q

's w

ith N

o A

nsw

er

Scores for the best run of the top 8 groups using strict evaluation


Comparative Scores Stable

• Quantify effect of different judgments by calculating correlation between rankings of systems– mean Kendall of .96 (both TREC-8 & T2001)– equivalent to variation found in IR collections

• Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set– adjudicated > 3 times the cost of 1-judge


Methodology Summary• User-based evaluation is appropriate and

necessary for the QA task+user-based evaluation accommodates

different opinions• different assessors have conflicting opinions as to

the correctness of a response• reflects real-world

+assessors understand their task and can do it+comparative results are stable– effect of differences on training unknown– need more coherent user model


TREC 2001

• Introduced new tasks– new task that requires collating

information from multiple documents to form a list of responses

What are 9 novels written by John Updike?

– short sequences of interrelated questionsWhere was Chuck Berry born?What was his first song on the radio?When did it air?


Participation in QA Tracks

20

28

36

0

5

10

15

20

25

30

35

40

TREC 1999 TREC 2000 TREC 2001

Num

ber

of

Part

icip

ants


TREC 2001 QA Participants

Alicante University KAI ST Sun MicrosystemsChinese Acad. Sciences KCSL, Toronto Syracuse U.CL Research Korea Univ. Tilburg U.Conexor Oy Language Comp. Corp. U. of AmsterdamEC Wise, I nc. LI MSI U. of I llinois, UrbanaFudan University Microsof t Research U. of I owaHarbin I nst. of Tech. MI TRE U. of ManitobaI BM (Franz) NTT Comm. Sci. Labs U. de MontrealI BM (Prager) National Taiwan U. U. of PennsylvaniaI nsightSof t-M Oracle U. di PisaI SI , USC Pohang U. of Sci&Tech U. of WaterlooI TC-I RST, Trento Queens College, CUNY U. of York


AQUAINT ProposalsEnd-to-End Dialog Context Knowledge

BaseSUNY/ Rutgers Berkeley BBN CMU Columbia UMass USC-I SI LCC I BM/ Cyc SRI SAI C


Proposed Evaluations

• Extended QA (end-to-end) track in TREC

• Knowledge base task

• Dialog task


Dialog Evaluation• Goal:

• evaluate interactive use of QA systems• explore issues at the analyst/system interface

• Participants:• contractors whose main emphasis is on dialog

(e.g., SUNY/Rutgers, LCC, SRI)• others welcome, including TREC interactive

participants

• Plan:• design pilot study for 2002 during breakout• full evaluation in 2003


Knowledge Base Evaluation

• Goal:• investigate systems’ ability to exploit deep

knowledge• assume KB already exists

• Participants:• SAIC, SRI, Cyc• others welcome (AAAI ‘99 workshop

participants?)

• Plan:• design full evaluation plan for 2002 during

breakout


End-to-End Evaluation• Goals:

• continue TREC QA track as suggested by roadmap• add AQUAINT-specific conditions• introduce task variants based on question

typology

• Participants:• all contractors, other TREC participants

• Plan:• quickly devise interim typology for use in 2002• ad hoc working group to develop more formal

typology by December, 2002

Documents

Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman