26
2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Embed Size (px)

Citation preview

Page 1: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

2004 ARDA Challenge WorkshopAn Investigation of Evaluation Metrics

for Analytic Question Answering

Overview

Antonio SanfilippoPNNL/NWRRC

AQUAINT Phase 2 Fall WorkshopTampa, FL

Page 2: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Northwestern Regional Research Center

• Hosted by Pacific Northwest National Laboratory

• Located in Richland, WA

Page 3: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Problem

• The adoption of new QA technologies in the IC is hindered by the gap between the development and usage environments– There is no systematic way of ensuring that QA

systems conform to the working practices of analysts

– Systems may perform well in terms of accuracy, but do not address the needs of analysts

Page 4: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Solution

• Develop evaluation metrics that reflect the interaction of users and QA systems to determine how and to which extent these systems meet user requirements – Determine the utility of features and functionalities

– Establish and corroborate user requirements

– Perform a user-centric comparison of different systems

Page 5: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Experimental Focus

• The development of the evaluation metrics is based on empirical studies of analysts using – 3 Question Answering systems

• Cycorp

• LCC

• SUNY@Albany

– the Google search engine as the baseline system

Page 6: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Stakeholders• Government Champions

– John Prange (ARDA)– Kelcy Allwein (DIA)– Mike Blair (NAVY)

• Team Leaders– Emile Morse & Jean Scholtz (NIST)

• Team Participants– Tomek Strzalkowski, Sharon Small, Sean Ryan, Hilda Hardy (SUNY@Albany)– Sanda Harabagiu, Andy Hickl, John Williams (LCC)– Stefano Bertolo (Cycorp)– Paul Kantor (Rutgers University)– Diane Kelly (University of North Carolina)– Peter LaMonica, Chuck Messenger (AFRL) – Joe Konczal (NIST)– Katherine Johnson, Frank Greitzer (PNNL)Analysts: 7 from NAVY, 1 from ARMYGraduate Students: Robert Rittman, Aleksandra Sarcevic, Ying Sun (Rutgers University)

• PNNL Oversight– Rich Quadrel (NWRRC Director)– Troy Juntunen (System Installation and Connectivity)– Ben Barnett, Trina Pitcher, John Calhoun, Eileen Boiling (Admin)– Antonio Sanfilippo (Project Manager)

Page 7: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

RoadmapFeb 23

– Project planning meeting (NIST)March-April

– Preparation (contracts, purchases, data collection, initial scenario development)April 15-16

– Kickoff meeting (NIST)April-May

– Finalize scenarios, metric hypotheses, and evaluation methods & materials– Work with NWRRC to set up facilities for data collection at PNNL

June 7-25– Install systems at PNNL– Carry out user studies with analysts– Collect data

July– First version of data analysis– Internal progress report and agenda for the remaining work

August– Final version of data analysis and final exam

September– Final report

Page 8: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Technical Approach

• Construct evaluation metric hypotheses about the utility of QA systems and test these in experimental user studies – Collect data relative to evaluation hypotheses for 8

analysts working on 8 task assignment scenarios with 4 QA systems

– Analyze collected data to verify utility of evaluation metric hypotheses

Page 9: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

 

Evaluation Hypotheses

Question answering systems shouldQuestion-naires

NASA TLX

SmiFro & Status

Cross-evaluation

System Logs

Glass Box

Query Trails

H1Support information gathering with lower cognitive workload   X     X   X

H2 Assist in exploring more paths/hypotheses X           X

H3 Enable production of higher quality reports X     X      

H4 Provide useful suggestions to the analyst X       X X  

H5 Provide more good surprises than bad X   X        

H6Enable more focus on analysis than data collection X            

H7Enable analysts to collect more data in less time X         X  

H8 Reduce the time spent reading  X         X  

H9 Identify gaps in the knowledge base X       X    

H10Help the analyst recognize gaps in their thinking X  

H11 Provide context for information X       X    

H12Provide context, continuity and coherence of dialogue X       X X X

H13Let analysts relocate previously seen materials X  

H14 Be easy to use X X          

H15Increase an analyst’s confidence in exploration and report X   X        

ID Scenario Topics

A Indian Chemical Weapons Production and Delivery Systems

B Libyan Chemical Weapons Program

C Iranian Chemical Weapons Development and Impact

D North Korean Chemical and Biological Weapons Research

E Pakistani Chemical Agent Production

F Current Status of Russia’s Chemical Weapons Program

G South African Chemical Agents Program Status

H Assessment of Egypt’s Biological Weapons

Page 10: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Methodology

Page 11: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Accomplishments

• Results to-date from the analysis of the data collected during the user studies at PNNL indicate that – Most of the valuation hypotheses initially set by the team proved

to be useful for the user-centered assessment of QA systems– The methodology developed by the team during the course of the

user studies is effective for applying these evaluation metrics– On average, the Cycorp, Albany and LCC Question Answering

systems were deemed to be more useful by users than the baseline system (Google)

Page 12: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Results & Benefits• The workshop delivered a set of tested user-centric

evaluation criteria and a methodology for applying these evaluation criteria to gain knowledge about how QA systems meet the needs of analysts

• The availability of user-centric evaluation metrics enables a systematic methodology for tailoring the utility of QA systems to the specific needs of the Intelligence Community– Target feature and functionalities that are most impactful– Facilitate technology insertion

Page 13: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Assessment• The work has been carried out on schedule and with

extreme precision, attention to details and high technical standards

• Results indicate that the Workshop will be impactful in establishing a user-centered evaluation framework for interactive information systems.

• Results will be presented in the next talk by Emile Morse• A version of the methodology developed will be

demonstrated in today’s exercise

Page 14: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Parting Shots

Views from the June Challenge problem in Richland

Page 15: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 16: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 17: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 18: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 19: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 20: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 21: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 22: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 23: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 24: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 25: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase
Page 26: 2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase

Thank You!