A data driven approach to query expansion in question answering

A Data Driven Approach to Query Expansion in Question Answering

Leon Derczynski, Robert Gaizauskas, Mark Greenwood and Jun Wang

Natural Language Processing GroupDepartment of Computer Science

University of Sheffield, UK

Summary Introduce a system for QA Find that its IR component limits system performance

Explore alternative IR components Identify which questions cause IR to stumble

Using answer lists, find extension words that make these questions easier

Show how knowledge of these words can make rapidly accelerate the development of query expansion methods

Show why one simple relevance feedback technique cannot improve IR for QA

How we do QA

Question answering system follows a linear procedure to get from question to answers

Pre-processing Text retrieval Answer Extraction

Performance at each stage affects later results

Measuring QA Performance

Overall metrics Coverage Redundancy

TREC provides answers Regular expressions for matching text IDs of documents deemed helpful

Ways of assessing correctness Lenient: the document text contains an answer Strict: further, the document ID is listed by TREC

Assessing IR Performance

Low initial system performance

Analysed each component in the system

Question pre-processing correct

Coverage and redundancy checked in IR part

IR component issues

Only 65% of questions generate any text to be prepared for answer extraction

IR failings cap the entire system performance

Need to balance the amount of information retrieved for AE

Retrieving more text boosts coverage, but also introduces excess noise

Initial performance

Lucene statistics

Using strict matching, at paragraph level

Question year Coverage Redundancy

2004 63.6% 1.62

2005 56.6% 1.15

2006 56.8% 1.18

Potential performance inhibitors

IR Engine Is Lucene causing problems? Profile some alternative engines

Difficult questions Identify which questions cause problems Examine these:

Common factors How can they be made approachable?

Information Retrieval Engines

AnswerFinder uses a modular framework, including an IR plugin for Lucene

Indri and Terrier are two public domain IR engines, which have both been adapted to perform TREC tasks Indri – based on the Lemur toolkit and INQUERY engine Terrier – developed in Glasgow for dealing with terabyte

corpora

Plugins are created for Indri and Terrier, which are then used as replacement IR components

Automated testing of overall QA performance done using multiple IR engines

IR Engine performance

Engine Coverage Redundancy

Indri 55.2% 1.15

Lucene 56.8% 1.18

Terrier 49.3% 1.00

With n=20; strict retrieval; TREC 2006 question set; paragraph-level texts.

• Performance between engines does not seem to vary significantly

• Non-QA-specific IR Engine tweaking possibly not a great avenue for performance increases

Identification of difficult questions Coverage of 56.8% indicates that for over 40% of questions, no

documents are found.

Some questions are difficult for all engines

How to define a “difficult” question?

Calculate average redundancy (over multiple engines) for each question in a set

Questions with average redundancy less than a certain threshold are deemed difficult

A threshold of zero is usually enough to find a sizeable dataset

Examining the answer data

TREC answer data provides hints to what documents an IR engine ideal for QA should retrieve Helpful document lists Regular expressions of answers

Some questions are marked by TREC as having no answer; these are excluded from the difficult question set

Making questions accessible

Given the answer bearing documents and answer text, it’s easy to extract words from answer-bearing paragraphs

For example, where the answer is “baby monitor”:

The inventor of the baby monitor found this device almost accidentally

These surrounding words may improve coverage when used as query extensions

How can we find out which extension words are most helpful?

Rebuilding the question set

Only use answerable difficult questions For each question:

Add original question to the question set as a control Find target paragraphs in “correct” texts Build a list of all words in that paragraph, except: answers,

stop words, and question words For each word:

Create a sub-question which consists of the original question, extended by that word

Rebuilding the question set

Example:

Single factoid question: Q + E How tall is the Eiffel tower? + height

Question in a series: Q + T + E Where did he play in college? + Warren Moon +

NFL

Do data-driven extensions help?

Base performance is at or below the difficult question threshold (typically zero)

Any extension that brings performance above zero is deemed a “helpful word”

From the set of difficult questions, 75% were made approachable by using a data-driven extension

If we can add these terms accurately to questions, the cap on answer extraction performance is raised

Do data-driven extensions help?

Question Where did he play in college? Target Warren Moon Base redundancy is zero

Extensions Football Redundancy: 1 NFL Redundancy: 2.5

Adding some generic related words improves performance

Do data-driven extensions help? Question Who was the nominal leader after the

overthrow? Target Pakistani government overthrown in 1999 Base redundancy is zero

Extensions Islamabad Redundancy: 2.5 Pakistan Redundancy: 4 Kashmir Redundancy: 4

Location based words can raise redundancy

Do data-driven extensions help? Question Who have commanded the division? Target 82nd Airborne Division Base redundancy is zero Question expects a list of answers

Extensions Col Redundancy: 2 Gen Redundancy: 3 officer Redundancy: 1 decimated Redundancy: 1

The proper names for ranks help; this can be hinted at by “Who” Events related to the target may suggest words Possibly not a victorious unit!

Observations on helpful words

Inclusion of pertainyms has a positive effect on performance, agreeing with more general observations in Greenwood (2004)

Army ranks stood out highly

Use of an always-include list

Some related words help, though there’s often no deterministic relationship between them and the questions

Measuring automated expansion

Known helpful words are also the target set of words that any expansion method should aim for

Once the target expansions are known, measuring automated expansion becomes easier

No need to perform IR for every candidate expanded query (some runs over AQUAINT took up to 14 hours on a 4-core 2.3GHz system)

Rapid evaluation permits faster development of expansion techniques

Relevance feedback in QA Simple RF works by using features of an initial

retrieval to alter a query

We picked the highest frequency words in the “initially retrieved texts”, and used them to expand a query

The size of the IRT set is denoted r

Previous work (Monz 2003) looked at relevance feedback using a small range of values for r

Different sizes of initial retrievals are used, between r=5 and r=50

Rapidly evaluating RF

Three metrics show how a query expansion technique performs: Percentage of all helpful words found in IRT

This shows the intersection between words in initially retrieved texts, and the helpful words.

Percentage of texts containing helpful words If this is low, then the IR system does not retrieve many

documents containing helpful words, given the initial query Percentage of expansion terms that are helpful

This is a key statistic; the higher this is, the better performance is likely to be

Relevance feedback predictions

Less than 35% of the documents used in relevance feedback actually contain helpful words

Picking helpful words out from initial retrievals is not easy, when there’s so much noise

Due to the small probability of adding helpful words, relevance feedback is likely not to make difficult questions accessible.

Adding noise to the query will drown out otherwise helpful documents for non-difficult questions

2004 2005 2006

Helpful words found in IRT 4.2% 18.6% 8.9%

IRT containing helpful words 10.0% 33.3% 34.3%

RF words that are “helpful” 1.25% 1.67% 5.71%

RF selects some words to be added on to a query, based on an initial search.

Relevance feedback results

Only 1.25% - 5.71% of the words that relevance feedback chose were actually helpful; the rest only add noise

Performance using TF-based relevance feedback is consistently lower than the baseline

Hypothesis of poor performance is supported

Coverage at n docs r=5 r=50 Baseline

10 34.7% 28.4% 43.4%

20 44.4% 39.8% 55.3%

Conclusions

IR engine performance for QA does not vary wildly

Identifying helpful words provides a tool for assessing query expansion methods

TF-based relevance feedback cannot be generally effective in IR for QA

Linguistic relationships exist that can help in query expansion

Any questions?

Technology

A data driven approach to query expansion in question answering