Upload
neal-adams
View
217
Download
3
Embed Size (px)
Citation preview
Is That Your Final Answer?The Role of Confidence in Question Answering
Systems
Robert Gaizauskas1 and Sam Scott2
1Natural Language Processing GroupDepartment of Computer Science
University of Sheffield
2Centre for Interdisciplinary StudiesCarleton University
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk
Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
Question Answering: A New Challenge in IR
Traditionally information retrieval systems are viewed as systems that return documents in response to a query Such systems better termed document retrieval systems Once document returned user must search it to find required info Acceptable if docs returned are short, not too many returned, and info
need is general Not acceptable if many docs returned or docs very long or info need
very specific
Recently (1999,2000) the TREC Question Answering (QA) track has been designed to address this issue As construed in TREC, Q A systems take natural language questions
and a text collection as input and return specific answers (literal text strings) from documents in the text collection
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA: An (Incomplete) Historical Perspective
Question answering not a new topic: Erotetic logic (Harrah, 1984; Belnap and Steel, 1976) Deductive question answering work in AI (Green, 1969;Schubert,
1986) Conceptual Theories of QA (Lehnert, 1977) Natural language front-ends to databases (Copestake, 1990;
DARPA ATIS evaluations)
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
The TREC QA Track: Task Definition
Inputs: 4GB newswire texts (from the TREC text collection) File of natural language questions (200 TREC-8/700 TREC-9)
e.g.
Where is the Taj Mahal?
How tall is the Eiffel Tower?
Who was Johnny Mathis’ high school track coach?
Outputs: Five ranked answers per question, including pointer to source document
• 50 byte category
• 250 byte category Up to two runs per category per site
Limitations: Each question has an answer in the text collection Each answer is a single literal string from a text (no implicit or multiple answers)
December 8, 2000 Dublin Computational Linguistics Research Seminar
The TREC QA Track: Metrics and Scoring
The principal metric is Mean Reciprocal Rank (MRR) Correct answer at rank 1 scores 1 Correct answer at rank 2 scores 1/2 Correct answer at rank 3 scores 1/3 …
Sum over all questions and divide by number of questions More formally:
where N = # questions, ri = the reciprocal of the best (lowest) rank assigned by a system at which a correct answer is found for question i, or 0 if no correct answer was found
Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)
N
rMRR
N
1ii
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Potential of NLP for Question Answering
NLP has failed to deliver significant improvements in the document retrieval task.
Will the same be true of QA? Must depend on the definition of task
Current TREC QA task is best construed as micro passage retrieval
There are a number of linguistic phenomena relevant to QA which suggest that NLP ought to be able to help, in principle.
But, it also now seems clear from TREC-9 results that NLP techniques do improve the effectiveness of QA systems in practice.
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Potential of NLP for Question Answering
Coreference Part of the information required to answer a question may occur in one sentence, while the rest occurs in another linked via an anaphor. E.g.
Question: How much did Mercury spend on advertising in 1993? Text: Mercury …Last year the company spent £12m on Advertising.
Deixis References (possibly relative) to here and now may need to be correctly interpreted. E.g. to answer the preceding question requires interpreting last year as 1993 via the date-line of the text (1994).
Grammatical knowledge Difference in grammatical role can be of crucial importance. E.g.
Question: Which company took over Microsoft?
cannot be answered
Text: Microsoft took over Entropic.
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Potential of NLP for Question Answering (cont)
Semantic knowledge Entailments based on lexical semantics may need to be computed. E.g. To answer the
Question: At what age did Rossini stop writing opera? using the Text: Rossini … did not write another opera after he was 35. requires knowing that stopping X at time t means not doing X after t. World knowledge World knowledge may be required to interpret
linguistic expressions. E.g. To answer the Question: In which city is the Eiffel Tower? using the Text: The Eiffel Tower is in Paris. but not the Text: The Eiffel Tower is in France. requires the knowledge that Paris is a city, France a country.
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk
Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
Sheffield QA System Architecture
Overall objective is to use:
IR system as fast filter to select small set of documents with high relevance to query from the initial, large text collection
IE system to perform slow, detailed linguistic analysis to extract answer from limited set of docs proposed by IR system
December 8, 2000 Dublin Computational Linguistics Research Seminar
Okapi
Used “off the shelf” – available from http://www.soi.city.ac.uk/research/cisr/okapi/okapi.html
based on the probabilistic retrieval model (Robertson + Sparck-Jones, 1976)
Used passage retrieval capabilities of Okapi
Passage retrieval parameters: Min. passage: 1 para; Max. passage: 3 paras; Para step unit: 1
arrived at by experimentation on TREC-8 data Examined trade-offs between:
number of documents and “answer loss” :
184/198 questions had answer in top 20 full docs; 160/198 in top 5
passage length and “answer loss” :
only 2 answers lost from top 5 3-para passages
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA-LaSIE
Derived from LaSIE: Large Scale Information Extraction System
LaSIE developed to participate in the DARPA Message Understanding Conferences (MUC-6/7) Template filling (elements, relations, scenarios) Named Entity recognition Coreference identification
QA-LaSIE is a pipeline of 9 component modules – first 8 are borrowed (with minor modifications) from LaSIE
The question document and each candidate answer document pass through all nine components
Key difference between MUC and QA task: IE template filling tasks are domain-specific; QA is domain-independent
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA-LaSIE Components
1. Tokenizer. Identifies token boundaries and text section boundaries.
2. Gazetteer Lookup. Matches tokens against specialised lexicons (place,person names, etc.). Labels with appropriate name categories.
3. Sentence Splitter. Identifies sentence boundaries in the text body.
4. Brill Tagger. Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text.
5. Tagged Morph. Identifies the root form and inflectional suffix for tokens tagged as nouns or verbs.
6. Parser. Performs two-pass bottom-up chart parsing first with a special named entity grammar, then with a general phrasal grammar. A “best parse” (possibly partial) is selected and a quasi-logical form(QLF) of each sentence is constructed.
For the QA task, a special grammar module identifies the “sought entity” of a question and forms a special QLF representation for it.
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA-LaSIE Components (cont)
7. Name Matcher. Matches variants of named entities across the text.
8. Discourse Interpreter. Adds the QLF representation to a semantic net containing background world and domain knowledge. Additional info inferred from the input is added to the model, and coreference resolution is attempted between instances mentioned in the text.
For the QA task, special code was added to find and score a possible answer entity from each sentence in the answer texts.
9. TREC-9 Question Answering Module. Examines the scores for each possible answer entity, and then outputs the top 5 answers formatted for each of the four submitted runs.
New module for the QA task.
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA in Detail (1): Question Parsing
Phrase structure rules are used to parse different question types and produce a quasi-logical form (QLF) representation which contains:
a qvar predicate identifying the sought entity a qattr predicate identifying the property or relation whose value
is sought for the qvar (this may not always be present.)
Q:Who released the internet worm?
qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3)worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4)
Question QLF:
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA in Detail (2):Sentence/Entity Scoring
Two sentence-by-sentence passes through each candidate answer text
Sentence Scoring: Co-reference system from LaSIE discourse interpreter resolves
coreferring entities both within answer texts and between answer and question texts.
Main verb in question matched to similar verbs in answer text Each non-qvar entity in the question is a “constraint”, and
candidate answer sentences get one point for each constraint they contain.
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA in Detail (2):Sentence/Entity Scoring (cont)
Entity Scoring: Each entity in each candidate answer sentence which was not matched to a term in the question at the sentence scoring stage receives a score based on:
a) semantic and property similarity to the qvar
b) whether it shares with the qvar the same relation to a matched verb (the lobj or lsubj relation)
c) whether it stands in a relation such as apposition, qualification or prepositional attachment to another entity in the answer sentence which was matched to a term in the question at the sentence scoring stage
Entity scores are normalised in the range [0-1] so that they never outweigh a better sentence match
December 8, 2000 Dublin Computational Linguistics Research Seminar
QA in Detail (2):Sentence/Entity Scoring (cont)
Total Score: For each sentence a total score is computed by summing the sentence score and the “best entity score” dividing by the number of entities in question + 1 (has no effect
on answer outcome but normalises scores in [0-1] – useful for comparisons across questions)
Each sentence is annotated with Total sentence score “best entity” “exact answer” = name attribute of best entity, if found
December 8, 2000 Dublin Computational Linguistics Research Seminar
Question Answering in Detail: Answer Generation
The 5 highest scoring sentences from all 20 candidate answer texts were used as the basis for the TREC answer output
Results from 4 runs were submitted: shef50ea – output the name of the best entity if available;
otherwise output its longest realization in the text shef50 – output the first occurrence of the best answer entity in
the text – if less than 50 bytes long output entire sentence or a 50 byte window around the answer, whichever is shorter
shef250 - same as shef50 but with a limit of 250 bytes shef250p - same as shef250 but with extra padding from the
surrounding text allowed to a 250 byte maximum
December 8, 2000 Dublin Computational Linguistics Research Seminar
Question Answering in Detail: An Example
Q:Who released the internet worm? A:Morris testified that he released the internet worm…
Question QLF:
qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3)worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4)
Total (normalized): 0.97
person(e1), name(e1,’Morris'), testify(e2), lsubj(e2,e1), lobj(e2,e6), proposition(e6), main_event(e6,e3), release(e3), pronoun(e4,he), lsubj(e3,e4), worm(e5), lobj(e3,e5)
Answer QLF:
Shef50ea: “Morris”Shef50: “Morris testified that he released the internet wor” Shef250: “Morris testified that he released the internet worm …” Shef250p: “… Morris testified that he released the internet worm …”
Answers:
Sentence Score: 2Entity Score (e1): 0.91
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk
Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
Evaluation Results
Two sets of results: Development results on 198 TREC-8 questions Blind test results on 693 TREC-9 questions
Baseline experiment carried out using Okapi only Take top 5 passages Return central 50/250 bytes
December 8, 2000 Dublin Computational Linguistics Research Seminar
Best Development Results on TREC-8 Questions
December 8, 2000 Dublin Computational Linguistics Research Seminar
TREC-9 Results
December 8, 2000 Dublin Computational Linguistics Research Seminar
TREC-9 50 Byte Runs
December 8, 2000 Dublin Computational Linguistics Research Seminar
TREC-9 250 Byte Runs
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk
Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Role of Confidence in QA Systems
Little discussion to date concerning usability of QA systems, as conceptualised in the TREC QA task
Imagine asking How tall is the Eiffel Tower? and getting answers:
1. 400 meters (URL …)
2. 200 meters (URL …)
3. 300 meters (URL …)
4. 350 meters (URL …)
5. 250 meters (URL …)
There are several issues concerning the utility of such output, but two crucial ones are
a) How confident can we be in the system’s output?
b) How confident is the system is its own output?
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Role of Confidence in QA Systems (cont)
That these questions are important to users (question askers) is immediately apparent from watching any episode of the ITV quiz show Who Wants to be a Millionaire?
Participants are allowed to “phone a friend” as one of their “lifelines”, when confronted with a question they cannot answer.
Almost invariably theya) Select a friend who they feel is most likely to know the answer – i.e.
they attach an a priori confidence rating to their friend’s QA ability (How confident can we be in the system’s output?)
b) Ask their friend how confident they are in the answer they supply – i.e. they ask their friend to supply a confidence rating on their own performance
(How confident is the system is its own output?) MRR scores give an answer to a); however, to date no exploration
of b)
December 8, 2000 Dublin Computational Linguistics Research Seminar
The Role of Confidence in QA Systems (cont)
QA-LaSIE associates a normalised score in the range [0-1] with each answer - the combined sentence/entity (CSE) score can the CSE scores be treated as confidence measures?
To determine this, need to see if CSE scores correlate with answer correctness Note this is also a test of whether the CSE measure is a good one
Have carried out an analysis of CSE scores for shef50ea and shef250 runs on the TREC-8 question set Rank all proposed answers by CSE score For 20, 10, and 5 equal subdivisions of the [0-1] CSE score range
determine the % answers correct in that subdivision …
December 8, 2000 Dublin Computational Linguistics Research Seminar
Shef50ea: CSE vs. Correctness
December 8, 2000 Dublin Computational Linguistics Research Seminar
Shef250: CSE vs. Correctness
Caveat: analysis based on unequal distribution of data points. For the .2 chunks:
Range Data-points
0-.19 115.2-.39 511.4-.59 306.6-.79 45.8-1.0 5
December 8, 2000 Dublin Computational Linguistics Research Seminar
Applications of Confidence Measures
The CSE/Correctness correlation (preliminarily) established above indicates the CSE measure is a useful measure of confidence
How can we use this measure? Show it to the user – good indicator of how much faith they should have
in the answer/whether they should bother following up the URL to the source document
In a more realistic setting, where not every question can be assumed to have an answer in the text collection, CSE score may suggest a threshold below which “no answer” should be returned
• proposal for TREC-10
December 8, 2000 Dublin Computational Linguistics Research Seminar
Outline of Talk
Question Answering: A New Challenge in Information Retrieval
The TREC Question Answering Track
The Task
Evaluation Metrics
The Potential of NLP for Question Answering
The Sheffield QA System
Okapi
QA-LaSIE
Evaluation Results
Confidence Measures and their Application
Conclusions and Discussion
December 8, 2000 Dublin Computational Linguistics Research Seminar
Conclusions and Discussion
TREC-9 test results represent significant drop wrt to best training results But, much better than TREC-8, vindicating the “looser” approach to
matching answers
QA-LaSIE scores better than Okapi-baseline, suggesting NLP is playing a significant role But, a more intelligent baseline (e.g. selecting answer passages based
on word overlap with query) might prove otherwise
Computing confidence measures provides some support that our objective scoring function is sensible. They can be used for User support Helping to establish thresholds for “no answer” response Tuning parameters in the scoring function (ML techniques?)
December 8, 2000 Dublin Computational Linguistics Research Seminar
Future Work
Failure analysis Okapi – for how many questions were no documents containing an
answer found? Question parsing – how many question forms were unanalysable? Matching procedure – where did it break down?
Moving beyond word root matching – using Wordnet? Building an interactive demo to do QA against the web – Java
applet interface to Google + QA-LaSIE running in Sheffield via CGI Gets the right answer to the million £ question “Who was the husband
of Eleanor of Aquitaine?” !
December 8, 2000 Dublin Computational Linguistics Research Seminar
THE END