Upload
lily-howard
View
219
Download
0
Embed Size (px)
Citation preview
Text REtrieval Conference (TREC)
Implementing a Question-Answering
Evaluation for AQUAINT
Ellen M. VoorheesDonna Harman
Text REtrieval Conference (TREC)
TREC QA Track
• Goal: encourage research into systems that return answers, rather than document lists, in response to a question
• NIST (TREC-8) subgoal: investigate whether the evaluation methodology used for text retrieval systems is appropriate for another NLP task
Text REtrieval Conference (TREC)
Task• For each closed-class question, return a
ranked list of 5 [docid, text-snippet] pairs• snippets drawn from large news collection • score: reciprocal rank of first correct response
• Test conditions– TRECs 8, 9:
• 50 or 250 byte snippets• answer guaranteed to exist in collection
– TREC 2001:• 50 byte snippets only• no guarantee of answer in collection
Text REtrieval Conference (TREC)
Sample Questions• How many calories are there in a Big Mac?
• What is the fare for a round trip between New York and London on the Concorde?
• Who was the 16th President of the United States?
• Where is the Taj Mahal?
• When did French revolutionaries storm the Bastille?
Text REtrieval Conference (TREC)
Selecting Questions• TREC-8
– most questions created specifically for track– NIST staff selected questions
• TREC-9– questions suggested by logs of real questions– much more ambiguous, and therefore
difficultWho is Colin Powell? vs. Who invented the paper clip?
• TREC 2001– questions taken directly from filtered logs– large percentage of definition questions
Text REtrieval Conference (TREC)
What Evaluation Methodology?
• Different philosophies – IR: the “user” is the sole judge of a satisfactory
response• human assessors judge responses• flexible interpretation of correct response• final scores comparative, not absolute
– IE: there exists the answer• answer keys developed by application expert• requires enumeration of all acceptable responses at
outset• subsequent scoring trivial; final scores absolute
Text REtrieval Conference (TREC)
QA Track Evaluation Methodology
• NIST assessors judge answer strings– binary judgment of correct/incorrect– document provides context for answer
• In TREC-8, each question was independently judged by 3 assessors– built high-quality final judgment set– provided data for measuring effect of
differences between judges on final scores
Text REtrieval Conference (TREC)
Judging Guidelines• Document context used
– frame of reference• Who is President of the United States?• What is the world’s population?
– credit for getting answer from mistaken doc
• Answers must be responsive– no credit when list of possible answers given– must include units, appropriate punctuation– for questions about famous objects, answers
must pertain to that one object
Text REtrieval Conference (TREC)
Validating Evaluation Methodology
• Is user-based evaluation appropriate?
• Is it reliable?– can assessors perform the task?– do differences affect scores?
• Does the methodology produce a QA test collection?– evaluate runs that were not judged
Text REtrieval Conference (TREC)
Assessors can Perform Task
• Examined assessors during the task– written comments– think-aloud sessions
• Measured agreement among assessors– on average, 6% of judged strings had
some disagreement– mean overlap of .641 across 3 judges for
193 questions that had some correct answer found
Text REtrieval Conference (TREC)
User Evaluation Necessary• Even for these questions, context matters
– Taj Mahal casino in Atlantic City
• Legitimate differences in opinion as to whether string contains correct answer– granularity of dates– completeness of names– “confusability” of answer string
• If assessors’ opinions differ, so will eventual end-users’ opinions
Text REtrieval Conference (TREC)
QA Track Scoring Metric
• Mean reciprocal rank– score for individual question is the
reciprocal of the rank at which the first correct response returned (0 if no correct response returned)
– score of a run is the mean over the test set of questions
Text REtrieval Conference (TREC)
TREC 2001 QA Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
insi
ght
LCC1
orcl
1
isi1
a50
uwm
ta1
mts
una0
ibm
sqa0
1a
IBM
KS
1M3
Mea
n R
ecip
roca
l R
ank
05101520253035404550
% Q
's w
ith N
o A
nsw
er
Scores for the best run of the top 8 groups using strict evaluation
Text REtrieval Conference (TREC)
Comparative Scores Stable
• Quantify effect of different judgments by calculating correlation between rankings of systems– mean Kendall of .96 (both TREC-8 & T2001)– equivalent to variation found in IR collections
• Judgment sets based on 1 judge’s opinion equivalent to adjudicated judgment set– adjudicated > 3 times the cost of 1-judge
Text REtrieval Conference (TREC)
Methodology Summary• User-based evaluation is appropriate and
necessary for the QA task+user-based evaluation accommodates
different opinions• different assessors have conflicting opinions as to
the correctness of a response• reflects real-world
+assessors understand their task and can do it+comparative results are stable– effect of differences on training unknown– need more coherent user model
Text REtrieval Conference (TREC)
TREC 2001
• Introduced new tasks– new task that requires collating
information from multiple documents to form a list of responses
What are 9 novels written by John Updike?
– short sequences of interrelated questionsWhere was Chuck Berry born?What was his first song on the radio?When did it air?
Text REtrieval Conference (TREC)
Participation in QA Tracks
20
28
36
0
5
10
15
20
25
30
35
40
TREC 1999 TREC 2000 TREC 2001
Num
ber
of
Part
icip
ants
Text REtrieval Conference (TREC)
TREC 2001 QA Participants
Alicante University KAI ST Sun MicrosystemsChinese Acad. Sciences KCSL, Toronto Syracuse U.CL Research Korea Univ. Tilburg U.Conexor Oy Language Comp. Corp. U. of AmsterdamEC Wise, I nc. LI MSI U. of I llinois, UrbanaFudan University Microsof t Research U. of I owaHarbin I nst. of Tech. MI TRE U. of ManitobaI BM (Franz) NTT Comm. Sci. Labs U. de MontrealI BM (Prager) National Taiwan U. U. of PennsylvaniaI nsightSof t-M Oracle U. di PisaI SI , USC Pohang U. of Sci&Tech U. of WaterlooI TC-I RST, Trento Queens College, CUNY U. of York
Text REtrieval Conference (TREC)
AQUAINT ProposalsEnd-to-End Dialog Context Knowledge
BaseSUNY/ Rutgers Berkeley BBN CMU Columbia UMass USC-I SI LCC I BM/ Cyc SRI SAI C
Text REtrieval Conference (TREC)
Proposed Evaluations
• Extended QA (end-to-end) track in TREC
• Knowledge base task
• Dialog task
Text REtrieval Conference (TREC)
Dialog Evaluation• Goal:
• evaluate interactive use of QA systems• explore issues at the analyst/system interface
• Participants:• contractors whose main emphasis is on dialog
(e.g., SUNY/Rutgers, LCC, SRI)• others welcome, including TREC interactive
participants
• Plan:• design pilot study for 2002 during breakout• full evaluation in 2003
Text REtrieval Conference (TREC)
Knowledge Base Evaluation
• Goal:• investigate systems’ ability to exploit deep
knowledge• assume KB already exists
• Participants:• SAIC, SRI, Cyc• others welcome (AAAI ‘99 workshop
participants?)
• Plan:• design full evaluation plan for 2002 during
breakout
Text REtrieval Conference (TREC)
End-to-End Evaluation• Goals:
• continue TREC QA track as suggested by roadmap• add AQUAINT-specific conditions• introduce task variants based on question
typology
• Participants:• all contractors, other TREC participants
• Plan:• quickly devise interim typology for use in 2002• ad hoc working group to develop more formal
typology by December, 2002