Relevance Models for QA Project Update University of Massachusetts, Amherst AQUAINT meeting December, 2002 Bruce Croft and James Allan, PIs

Relevance Models for QA

Project UpdateUniversity of Massachusetts, Amherst

AQUAINT meetingDecember, 2002

Bruce Croft and James Allan, PIs

UMass AQUAINT Project Status Question answering using language models

Carried out more experiments using basic LM approach Developed new model(s) and starting more experiments Moved experiments to LEMUR toolkit

Query triage Studied Clarity measure for questions

Question answering with semi-structured data Developed HMM and CRF-based table extractors More experiments on question answering with table structure

Answer updating Experiments with time-based questions

QA using LM

P(Answer|Question) can be estimated many ways Could be done directly, but usually will involve intermediate steps

such as documents, question classes Initially focused on answer passages, but “extracted” answers

can be modeled Can model “templates” as well as n-gram answer models Can also introduce cross-lingual QA through P(Alang1|Qlang2)

Every approach requires training data “answer mining” for answer models/templates incorporating user feedback

Query Triage

Given a question, what can we infer from it? Query vs. question Quality (does it need to be made more precise) Type (likely form of answers and granularity) Human intermediation (should it be directed to a human expert?)

Previous work developed “Clarity” measure for queries and tested on TREC ad-hoc data Demonstrated high correlation with performance Threshold can be set automatically

Current research focuses on TREC QA data

Basic result: We can predict question performance (with some qualifications)

Did not work for some TREC question classes

For example: What is the date of Bastille Day?

TREC-9P Clarity score 2.49 What time of year do most people fly?

TREC-9P Clarity score 0.76

Predicting Question Performance

-6

-5

-4

-3

-2

-1

0

Collection LM Question LM

“the”“do”, “day”, “what”

“celebrate”

“paris”

“bastille”

“assmann”

terms

Lo

g PClarity score computation

QuestionQ, text

QuestionQ, text

...

Passages, A ...

Passages ranked by P(A|Q)retrieveretrieve

modelpassage

collectionlanguage

modelpassage

collectionlanguage

modelquestion-related

language

modelquestion-related

language

Compute divergence

Compute divergence

Clarity Score

Clarity Example (for queries)

term rank

pqL

og

2(p

q/p

c)

Top 6 terms in query model: 1. "adjust" 2. "federal" 3. "action" 4. "land" 5. "occur" 6. "hyundai"

56.08 "What adjustments should be made whenfederal action occurs?" (clar. 0.37)

56.12 "Show me predictions for changes in the prime lending rate and any changes made in the prime lending rates"

(clar. 2.85)Top 6 terms in query model: 1. "bank" 2. "hong" 3. "kong" 4. "rate" 5. "lend" 6. "prime"

Test System

Passages: Two sentences, overlapping from top retrieved docs for all questions

Measuring performance: Question Likelihood used to rank passages Average precision (rather than MRR) Top 8 documents to estimate Clarity scores

Precision vs. Clarity (Time Qs)A

vera

ge

Pre

cisi

on

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3

What is the date of Bastille Day?

What time of year do most people fly?

What is Martin Luther King Jr 's real birthday?

Clarity Score

Question Type # of Qs Rank Correlation (R) P-Value

Amount 33 – 0.132 0.77

Famous 74 0.197 0.046

Location 100 0.386 0.000062

Person 93 – 0.109 0.85

Time 47 0.458 0.00094

Miscellaneous 130 0.355 0.000028

Correlation by Question Type

Strong on average Allows prediction of question performance Variation with question type

Two bad (R<0) cases: Amount and Person Amount: only has 33 questions, only a few bad Qs Person: 93 questions, plenty of bad Qs to analyze

What’s going on?

Correlation Analysis

Two kinds of mistakes: High clarity, low average precision

E.g. What is Martin Luther King Jr 's real birthday? Answerless, coherent, very likely context in collection Rare (good thing for the method)

Low clarity, high average precision Various kinds of bad luck Often coupled with few relevant passages Many examples in Person case…

Predictive Mistakes

0 3

1

0

Ave

. P

rec

isio

n

Clarity Score

Precision vs. Clarity (Person Qs)

15 “really bad” mistakes “Really bad” ≡ clarity score < 30 %-ile and ave. precision > 70 %-ile 8 with many relevant answer passages ( > 50 )

5 (one-third) are slight variants of Who created “The Muppets”? 2 variants of What king signed the Magna Carta? 1 other question with plenty of relevants

7 with few relevant answer passages E.g. Silly Putty was invented by whom?, 2 rels

0

0.2

0.4

0.6

0.8

1

0.8 1.2 1.6 2 2.4 2.8

Ave

. P

rec

isio

n

Clarity Score

QA using Tables

Developed and tested QUASM demonstration system using non-LM techniques extraction of tabular structure answer passages constructed from extracted data and

metadata extension of question types for “statistical” data failure analysis

Major focus now is to develop probabilistic framework for whole process tabular structure extraction answer passage representation P(Answer|Question)

QuASM – Lessons Learned

Much harder to find answers in tables than in text Table extraction is the key issue Representation of answer passages also very

important what is an answer passage for tables? e.g. too much metadata can cause poor retrieval

Table Extraction Heuristics do a good job of identifying tables

97.8% percent of lines labeled correctly as in or out of table

Small labeling errors, however, can lead to poor retrieval

Current algorithm for extracting header information too permissive

Text Table Transformation

<h4><pre><font color=maroon> Number and Percent of Children under 19 Years of Age, at or below 200 Percent ofPoverty, by State: Three-Year Averages for 1997, 1998, and 1999. (Numbers in Thousands)</font>

_________________________________________________________________________________

| AT OR BELOW | AT OR BELOW 200% OF POVERTY | Total children | 200% OF POVERTY | WITHOUT HEALTH INSURANCE | under 19 years, |____________________________|_____________________________| all income levels | Standard Standard| Standard Standard | |Number error Pct. error |Number error Pct. error |______________________|____________________________|_____________________________|Alabama....... 1,114 | 499 45.8 44.6 3.1 | 106 21.3 9.6 1.8 |Alaska........ 215 | 63 6.4 29.4 2.5 | 18 3.4 8.3 1.5 |Arizona....... 1,430 | 730 54.7 51.1 2.7 | 272 33.6 19.0 2.1 |Arkansas...... 740 | 377 30.5 50.5 2.9 | 111 16.5 14.7 2.0 |

Text Table Transformation - Problems

<QA_SECTION><TITLE> (Numbers in Thousands)</font> </TITLE> <CAPTIONS> | AT OR BELOW | AT OR BELOW 200% OF POVERTY |

Total children | 200% OF POVERTY | WITHOUT HEALTH INSURANCE | under 19 years, |____________________________|_____________________________| all income levels | Standard Standard| Standard Standard | |Number error Pct. error |Number error Pct. error | </CAPTIONS>

<ROW> Alabama....... </ROW> <COLUMN> AT OR BELOW 200% OF POVERTY ____________________________ Standard Number

</COLUMN>. | 499 </QA_SECTION>

Missed part of title due to lack of indentation

Extraneous text

New Labeling

3 Cells2 Gaps

Mostly LettersMostly DigitsHeader Like

DashesStarts with

SpacesConsecutive

SpacesAll White Space

Features

NONTABLEBLANKLINE

TITLESUPERHEADERTABLEHEADERSUBHEADERDATAROW

SEPARATORSECTIONHEADER

SECTIONDATAROW TABLEFOOTNOTETABLECAPTION

Line Tags

Text Table Extraction Model

Non-Table

Title Data Row

Super Header

Table Header

Subheader

Finite State Machine (hidden Markov process)

<100001000> <111001000> <110101000> <111101000> <010100100> <110100100>

Non-Table Title Super Header Table Header Data Row Data Row

Visible feature vectors probabilistically infer state sequence.

Features for Table Extraction

These features are not independent Many correlations Overlapping and

long-distance dependencies

Observations from the past and future

3 Cells2 Gaps

Mostly LettersMostly DigitsHeader Like

DashesStarts with

SpacesConsecutive

SpacesAll White Space

Features

<100001000> <111001000> <110101000> <111101000> <010100100> <110100100>


Observations are conditioned on state

HMMs are the standard sequence model

They are a generative model of the sequence

Generative models do not easily handle non-independent features.

Hidden Markov Models

Conditional Random Fields

<100001000> <111001000> <110101000> <111101000> <010100100> <110100100>


State sequence is conditioned on entire observation sequence.

A conditional model:Can examine features, but is not responsible for generating them.Doesn’t have to explicitly model their dependencies.Has the ability to handle many arbitrary features with the full power of finite state automata.

Results

ExperimentPercentage of Lines Labeled

Correctly

Random, Training Data MLE 11.4%

HMM 83.0%

Fully Connected CRF 93.3%

Original Heuristic (4 labels) 77.0%

Label six test documents, total of 5817 lines.

Summary of Plans

Testing a probabilistic model for QA Refining the Clarity measure for questions Finer-grain table extraction and QA tests Time-dependent language models

Documents

Relevance Models for QA Project Update University of Massachusetts, Amherst AQUAINT meeting December, 2002 Bruce Croft and James Allan, PIs