Natural Language Processing Group Department of Computer Science University of Sheffield, UK IR4QA: An Unhappy Marriage Mark A. Greenwood

Natural Language Processing GroupDepartment of Computer Science

University of Sheffield, UK

IR4QA: An Unhappy MarriageIR4QA: An Unhappy MarriageMark A. Greenwood

Outline of TalkOutline of Talk

• Background

• ‘Ancient’ History

• Recent Past

• An Uncertain Future

• Possible New Directions

BackgroundBackground

Although QA is not new, the language processing community has yet to develop a clearly articulated and commonly accepted guiding framework and research methodology, parallel to that of IR, MT, or text summarization.

As a result, despite ten years of system evaluations in the TREC QA track for specific kinds of questions and answers, the community does not have a clear idea how much progress was made during that period for QA in general.

OAQA09 Call for Papers

BackgroundBackground

• We will focus here on the selection of promising documents which can be subjected to further processing in order to extract exact answers to questions.

• The common approach to this problem has been to employ an IR engine to retrieve a small set of relevant documents, a field known as IR4QA.

• The rest of this talk will explain How we got to this point Why it is fundamentally flawed Where we might go from here


• Background


• Recent Past



‘‘Ancient’Ancient’ History History

• Traditionally IR and QA were separate research areas

• They had different users and goals

• The inputs and outputs to both systems were radically different

• Both had their own strengths and weaknesses


• Early QA systems were usually just interfaces to structured data LUNAR (Woods, 1973) BASEBALL (Green et al., 1961)

• Those systems which worked over text were usually based around reading comprehension exercises and used scenario templates SAM (Schank and Abelson, 1977)

• Questions varied in length but were asking for information which wasn’t known to the user

• Systems were not open-domain, i.e. LUNAR only knew about moon rocks.


• In comparison to QA systems early IR systems could be applied to any document collection Performance varied from collection to collection but in principal

• Queries were usually quite long and described the documents the user was looking for The CACM collection is a good example

• Systems returned full documents not exact answers As the user already knew what they were looking for this was OK Full documents doesn’t help when you don’t know what you are looking

for as you then have to read all the returned documents


• Background


• Recent Past



Recent PastRecent Past

• Recent QA research has been guided by the TREC evaluations

• The TREC QA track was originally conceived as a task that would interest both the IR and IE communities Focused IR Open-Domain IE

• It was hoped that over time the two communities would work together to develop new combined approaches

• Unfortunately it would seem that the IR community is not, on the whole, interested in the QA task


• Most, if not all, modern QA systems have adopted a (roughly) three stage architecture: question analysis, document retrieval, and answer extraction.


• IR4QA has not been aggressively researched by the community yet we know that... IR performance places an upper-bound on end-to-end performance – a

commonly quoted figure is 60% (Tellex et al., 2003) Even if we look at the top 1000 documents no relevant documents are

returned for 8% of the questions (Hovy et al., 2000) Most systems use off-the-shelf IR components with little or no tuning to

the task, i.e. Lucene, Okapi... Complex multi-query strategies have been tried in an effort to solve the

problem, but they only serve to highlight how bad performance at this step actually is.


• IR4QA has focused on the development and evaluation of the document retrieval component in such systems.

• The main problems are QA researchers are not IR researchers We don’t fully understand the intricate details of IR engines QA and IR are fundamentally different tasks


• Commonly accepted evaluation framework consists of (Roberts and Gaizauskas, 2004) Coverage – the proportion of documents for which at least one answer

bearing document is retrieved Redundancy – the average number of answer bearing documents

retrieved for a question


• There have been two workshops focused on the problem of IR4QA Sheffield, SIGIR 2004 Manchester, Coling 2008

• The main conclusions of both were that IR4QA is very hard Approaches that lead to increased IR performance do not necessarily

lead to appreciable increases in end-to-end performance Selection of documents shouldn’t be performed in isolation from the rest

of the system


• Background


• Recent Past



An Uncertain FutureAn Uncertain Future

• It seems clear that, on the whole, the IR community are not interested in QA

• Using off-the-shelf IR components has been shown to introduce unacceptable caps on performance

• The IR4QA community need to consider radically different approaches to the problem of selecting relevant documents from large corpora


• Background


• Recent Past



Possible New DirectionsPossible New Directions

• Answer extraction requires complex text processing Answer extraction techniques don’t scale well Some form of text selection component is required

• There are two orthogonal directions we could take Continue to use traditional IR techniques but discard the traditional view

of what makes a document (and/or query) Continue to work with traditional documents but use a radically different

selection approach

We need approaches that scale – working on AQUAINT size collections is nice for self contained experiments but shouldn’t be the end goal!

What Is A Document?What Is A Document?

• Topic Indexing and Retrieval (Ahn and Webber, 2008) throws away the common idea of documents while using a standard IR engine to directly retrieve answers not text.

• Topics are entities that answer questions People, companies, locations etc.

• Topic documents are built by simply joining together all sentences from a corpus that contain the topic (or variations of, i.e. Bill Clinton and William Clinton)

• QA is then a matter of retrieving the most relevant topic document using an IR engine and returning the associated topic as the answer

What Is A Document?What Is A Document?

Let The Data Guide YouLet The Data Guide You

• A decade of recent QA research has yielded a lot of useful data

• We have lots of example questions (at least a few thousand just from TREC) each of which... Has a known correct answer Is associated with at least one answer bearing document

• We should use this data to guide new selection approaches. A simple approach would be to perform query expansion by looking for

terms which are often associated with correct answers to certain question types (Derczynski et al., 2008)

Look for patterns in the answer bearing documents and index collections based on these patterns rather than words

Answer By UnderstandingAnswer By Understanding

• I’ve always been of the opinion that QA is intelligent IR Where intelligence equates to some level of understanding

• This suggests we should index meaning not just textual content. Take into account co-reference when selecting text passages Indexing relations should allow for more focused selection ‘Hybrid’ search that uses annotations and text (Bhagdev et al., 2008)

DISCUSSIONDISCUSSION

ReferencesReferences• Kisuh Ahn and Bonnie Webber. 2008. Topic Indexing and Retrieval for Factoid QA. In Proceedings

of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA).• Ravish Bhagdev, Sam Chapman, Fabio Ciravegna, Vitaveska Lanfranchi and Daniela Petrelli.

2008. Hybrid Search: Effectively Combining Keywords and Semantic Searches. In Proceedings of the 5th European Semantic Web Conference, ESWC 08, Tenerife.

• Leon Derczynski, Jun Wang, Robert Gaizauskas and Mark A. Greenwood. 2008. A Data Driven Approach to Query Expansion in Question Answering. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA).

• Bert F. Green, Alice K. Wolf, Carol Chomsky, and Kenneth Laughery. 1961. BASEBALL: An Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference, volume 19, pages 219--224.

• Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, and Chin-Yew Lin. 2000. Question Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference.

• Ian Roberts and Robert Gaizauskas. 2004. Evaluating Passage Retrieval Approaches for Question Answering. In Proceedings of 26th European Conference on Information Retrieval (ECIR’04), pages 72--84, University of Sunderland, UK.

• Roger C. Schank and Robert Abelson. 1977. Scripts, Plans, Goals and Understanding. Hillsdale.• Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quantitative

Evaluation of Passage Retrieval Algorithms for Question Answering. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41--47, Toronto, Canada, July.

• William Woods. 1973. Progress in Natural Language Understanding - An Application to Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages 441--450.

Documents

Natural Language Processing Group Department of Computer Science University of Sheffield, UK IR4QA: An Unhappy Marriage Mark A. Greenwood