Is That Your Final Answer? The Role of Confidence in Question Answering Systems Robert Gaizauskas 1 and Sam Scott 2 1 Natural Language Processing Group

Is That Your Final Answer?The Role of Confidence in Question Answering

Systems

Robert Gaizauskas1 and Sam Scott2

1Natural Language Processing GroupDepartment of Computer Science

University of Sheffield

2Centre for Interdisciplinary StudiesCarleton University

December 8, 2000 Dublin Computational Linguistics Research Seminar

Outline of Talk

Question Answering: A New Challenge in Information Retrieval

The TREC Question Answering Track

The Task

Evaluation Metrics

The Potential of NLP for Question Answering

The Sheffield QA System

Okapi

QA-LaSIE

Evaluation Results

Confidence Measures and their Application

Conclusions and Discussion


Question Answering: A New Challenge in IR

Traditionally information retrieval systems are viewed as systems that return documents in response to a query Such systems better termed document retrieval systems Once document returned user must search it to find required info Acceptable if docs returned are short, not too many returned, and info

need is general Not acceptable if many docs returned or docs very long or info need

very specific

Recently (1999,2000) the TREC Question Answering (QA) track has been designed to address this issue As construed in TREC, Q A systems take natural language questions

and a text collection as input and return specific answers (literal text strings) from documents in the text collection


QA: An (Incomplete) Historical Perspective

Question answering not a new topic: Erotetic logic (Harrah, 1984; Belnap and Steel, 1976) Deductive question answering work in AI (Green, 1969;Schubert,

1986) Conceptual Theories of QA (Lehnert, 1977) Natural language front-ends to databases (Copestake, 1990;

DARPA ATIS evaluations)


Outline of Talk Question Answering: A New Challenge in Information Retrieval


The Task

Evaluation Metrics



Okapi

QA-LaSIE

Evaluation Results




The TREC QA Track: Task Definition

Inputs: 4GB newswire texts (from the TREC text collection) File of natural language questions (200 TREC-8/700 TREC-9)

e.g.

Where is the Taj Mahal?

How tall is the Eiffel Tower?

Who was Johnny Mathis’ high school track coach?

Outputs: Five ranked answers per question, including pointer to source document

• 50 byte category

• 250 byte category Up to two runs per category per site

Limitations: Each question has an answer in the text collection Each answer is a single literal string from a text (no implicit or multiple answers)


The TREC QA Track: Metrics and Scoring

The principal metric is Mean Reciprocal Rank (MRR) Correct answer at rank 1 scores 1 Correct answer at rank 2 scores 1/2 Correct answer at rank 3 scores 1/3 …

Sum over all questions and divide by number of questions More formally:

where N = # questions, ri = the reciprocal of the best (lowest) rank assigned by a system at which a correct answer is found for question i, or 0 if no correct answer was found

Judgements made by human judges based on answer string alone (lenient evaluation) and by reference to documents (strict evaluation)

N

rMRR

N

1ii



NLP has failed to deliver significant improvements in the document retrieval task.

Will the same be true of QA? Must depend on the definition of task

Current TREC QA task is best construed as micro passage retrieval

There are a number of linguistic phenomena relevant to QA which suggest that NLP ought to be able to help, in principle.

But, it also now seems clear from TREC-9 results that NLP techniques do improve the effectiveness of QA systems in practice.



Coreference Part of the information required to answer a question may occur in one sentence, while the rest occurs in another linked via an anaphor. E.g.

Question: How much did Mercury spend on advertising in 1993? Text: Mercury …Last year the company spent £12m on Advertising.

Deixis References (possibly relative) to here and now may need to be correctly interpreted. E.g. to answer the preceding question requires interpreting last year as 1993 via the date-line of the text (1994).

Grammatical knowledge Difference in grammatical role can be of crucial importance. E.g.

Question: Which company took over Microsoft?

cannot be answered

Text: Microsoft took over Entropic.


The Potential of NLP for Question Answering (cont)

Semantic knowledge Entailments based on lexical semantics may need to be computed. E.g. To answer the

Question: At what age did Rossini stop writing opera? using the Text: Rossini … did not write another opera after he was 35. requires knowing that stopping X at time t means not doing X after t. World knowledge World knowledge may be required to interpret

linguistic expressions. E.g. To answer the Question: In which city is the Eiffel Tower? using the Text: The Eiffel Tower is in Paris. but not the Text: The Eiffel Tower is in France. requires the knowledge that Paris is a city, France a country.


Outline of Talk



The Task

Evaluation Metrics



Okapi

QA-LaSIE

Evaluation Results




Sheffield QA System Architecture

Overall objective is to use:

IR system as fast filter to select small set of documents with high relevance to query from the initial, large text collection

IE system to perform slow, detailed linguistic analysis to extract answer from limited set of docs proposed by IR system


Okapi

Used “off the shelf” – available from http://www.soi.city.ac.uk/research/cisr/okapi/okapi.html

based on the probabilistic retrieval model (Robertson + Sparck-Jones, 1976)

Used passage retrieval capabilities of Okapi

Passage retrieval parameters: Min. passage: 1 para; Max. passage: 3 paras; Para step unit: 1

arrived at by experimentation on TREC-8 data Examined trade-offs between:

number of documents and “answer loss” :

184/198 questions had answer in top 20 full docs; 160/198 in top 5

passage length and “answer loss” :

only 2 answers lost from top 5 3-para passages


QA-LaSIE

Derived from LaSIE: Large Scale Information Extraction System

LaSIE developed to participate in the DARPA Message Understanding Conferences (MUC-6/7) Template filling (elements, relations, scenarios) Named Entity recognition Coreference identification

QA-LaSIE is a pipeline of 9 component modules – first 8 are borrowed (with minor modifications) from LaSIE

The question document and each candidate answer document pass through all nine components

Key difference between MUC and QA task: IE template filling tasks are domain-specific; QA is domain-independent


QA-LaSIE Components

1. Tokenizer. Identifies token boundaries and text section boundaries.

2. Gazetteer Lookup. Matches tokens against specialised lexicons (place,person names, etc.). Labels with appropriate name categories.

3. Sentence Splitter. Identifies sentence boundaries in the text body.

4. Brill Tagger. Assigns one of the 48 Penn TreeBank part-of-speech tags to each token in the text.

5. Tagged Morph. Identifies the root form and inflectional suffix for tokens tagged as nouns or verbs.

6. Parser. Performs two-pass bottom-up chart parsing first with a special named entity grammar, then with a general phrasal grammar. A “best parse” (possibly partial) is selected and a quasi-logical form(QLF) of each sentence is constructed.

For the QA task, a special grammar module identifies the “sought entity” of a question and forms a special QLF representation for it.


QA-LaSIE Components (cont)

7. Name Matcher. Matches variants of named entities across the text.

8. Discourse Interpreter. Adds the QLF representation to a semantic net containing background world and domain knowledge. Additional info inferred from the input is added to the model, and coreference resolution is attempted between instances mentioned in the text.

For the QA task, special code was added to find and score a possible answer entity from each sentence in the answer texts.

9. TREC-9 Question Answering Module. Examines the scores for each possible answer entity, and then outputs the top 5 answers formatted for each of the four submitted runs.

New module for the QA task.


QA in Detail (1): Question Parsing

Phrase structure rules are used to parse different question types and produce a quasi-logical form (QLF) representation which contains:

a qvar predicate identifying the sought entity a qattr predicate identifying the property or relation whose value

is sought for the qvar (this may not always be present.)

Q:Who released the internet worm?

qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3)worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4)

Question QLF:


QA in Detail (2):Sentence/Entity Scoring

Two sentence-by-sentence passes through each candidate answer text

Sentence Scoring: Co-reference system from LaSIE discourse interpreter resolves

coreferring entities both within answer texts and between answer and question texts.

Main verb in question matched to similar verbs in answer text Each non-qvar entity in the question is a “constraint”, and

candidate answer sentences get one point for each constraint they contain.


QA in Detail (2):Sentence/Entity Scoring (cont)

Entity Scoring: Each entity in each candidate answer sentence which was not matched to a term in the question at the sentence scoring stage receives a score based on:

a) semantic and property similarity to the qvar

b) whether it shares with the qvar the same relation to a matched verb (the lobj or lsubj relation)

c) whether it stands in a relation such as apposition, qualification or prepositional attachment to another entity in the answer sentence which was matched to a term in the question at the sentence scoring stage

Entity scores are normalised in the range [0-1] so that they never outweigh a better sentence match


QA in Detail (2):Sentence/Entity Scoring (cont)

Total Score: For each sentence a total score is computed by summing the sentence score and the “best entity score” dividing by the number of entities in question + 1 (has no effect

on answer outcome but normalises scores in [0-1] – useful for comparisons across questions)

Each sentence is annotated with Total sentence score “best entity” “exact answer” = name attribute of best entity, if found


Question Answering in Detail: Answer Generation

The 5 highest scoring sentences from all 20 candidate answer texts were used as the basis for the TREC answer output

Results from 4 runs were submitted: shef50ea – output the name of the best entity if available;

otherwise output its longest realization in the text shef50 – output the first occurrence of the best answer entity in

the text – if less than 50 bytes long output entire sentence or a 50 byte window around the answer, whichever is shorter

shef250 - same as shef50 but with a limit of 250 bytes shef250p - same as shef250 but with extra padding from the

surrounding text allowed to a 250 byte maximum


Question Answering in Detail: An Example

Q:Who released the internet worm? A:Morris testified that he released the internet worm…

Question QLF:

qvar(e1), qattr(e1,name), person(e1), release(e2), lsubj(e2,e1), lobj(e2,e3)worm(e3), det(e3,the), name(e4,’Internet’), qual(e3,e4)

Total (normalized): 0.97

person(e1), name(e1,’Morris'), testify(e2), lsubj(e2,e1), lobj(e2,e6), proposition(e6), main_event(e6,e3), release(e3), pronoun(e4,he), lsubj(e3,e4), worm(e5), lobj(e3,e5)

Answer QLF:

Shef50ea: “Morris”Shef50: “Morris testified that he released the internet wor” Shef250: “Morris testified that he released the internet worm …” Shef250p: “… Morris testified that he released the internet worm …”

Answers:

Sentence Score: 2Entity Score (e1): 0.91


Outline of Talk



The Task

Evaluation Metrics



Okapi

QA-LaSIE

Evaluation Results




Evaluation Results

Two sets of results: Development results on 198 TREC-8 questions Blind test results on 693 TREC-9 questions

Baseline experiment carried out using Okapi only Take top 5 passages Return central 50/250 bytes


Best Development Results on TREC-8 Questions


TREC-9 Results


TREC-9 50 Byte Runs


TREC-9 250 Byte Runs


Outline of Talk



The Task

Evaluation Metrics



Okapi

QA-LaSIE

Evaluation Results




The Role of Confidence in QA Systems

Little discussion to date concerning usability of QA systems, as conceptualised in the TREC QA task

Imagine asking How tall is the Eiffel Tower? and getting answers:

1. 400 meters (URL …)





There are several issues concerning the utility of such output, but two crucial ones are

a) How confident can we be in the system’s output?

b) How confident is the system is its own output?


The Role of Confidence in QA Systems (cont)

That these questions are important to users (question askers) is immediately apparent from watching any episode of the ITV quiz show Who Wants to be a Millionaire?

Participants are allowed to “phone a friend” as one of their “lifelines”, when confronted with a question they cannot answer.

Almost invariably theya) Select a friend who they feel is most likely to know the answer – i.e.

they attach an a priori confidence rating to their friend’s QA ability (How confident can we be in the system’s output?)

b) Ask their friend how confident they are in the answer they supply – i.e. they ask their friend to supply a confidence rating on their own performance

(How confident is the system is its own output?) MRR scores give an answer to a); however, to date no exploration

of b)


The Role of Confidence in QA Systems (cont)

QA-LaSIE associates a normalised score in the range [0-1] with each answer - the combined sentence/entity (CSE) score can the CSE scores be treated as confidence measures?

To determine this, need to see if CSE scores correlate with answer correctness Note this is also a test of whether the CSE measure is a good one

Have carried out an analysis of CSE scores for shef50ea and shef250 runs on the TREC-8 question set Rank all proposed answers by CSE score For 20, 10, and 5 equal subdivisions of the [0-1] CSE score range

determine the % answers correct in that subdivision …


Shef50ea: CSE vs. Correctness


Shef250: CSE vs. Correctness

Caveat: analysis based on unequal distribution of data points. For the .2 chunks:

Range Data-points

0-.19 115.2-.39 511.4-.59 306.6-.79 45.8-1.0 5


Applications of Confidence Measures

The CSE/Correctness correlation (preliminarily) established above indicates the CSE measure is a useful measure of confidence

How can we use this measure? Show it to the user – good indicator of how much faith they should have

in the answer/whether they should bother following up the URL to the source document

In a more realistic setting, where not every question can be assumed to have an answer in the text collection, CSE score may suggest a threshold below which “no answer” should be returned

• proposal for TREC-10


Outline of Talk



The Task

Evaluation Metrics



Okapi

QA-LaSIE

Evaluation Results





TREC-9 test results represent significant drop wrt to best training results But, much better than TREC-8, vindicating the “looser” approach to

matching answers

QA-LaSIE scores better than Okapi-baseline, suggesting NLP is playing a significant role But, a more intelligent baseline (e.g. selecting answer passages based

on word overlap with query) might prove otherwise

Computing confidence measures provides some support that our objective scoring function is sensible. They can be used for User support Helping to establish thresholds for “no answer” response Tuning parameters in the scoring function (ML techniques?)


Future Work

Failure analysis Okapi – for how many questions were no documents containing an

answer found? Question parsing – how many question forms were unanalysable? Matching procedure – where did it break down?

Moving beyond word root matching – using Wordnet? Building an interactive demo to do QA against the web – Java

applet interface to Google + QA-LaSIE running in Sheffield via CGI Gets the right answer to the million £ question “Who was the husband

of Eleanor of Aquitaine?” !


THE END

Documents

Is That Your Final Answer? The Role of Confidence in Question Answering Systems Robert Gaizauskas 1 and Sam Scott 2 1 Natural Language Processing Group