19
A Multi-Perspective Evaluation of the NESPOLE! Speech-to- Speech Translation System Alon Lavie, Carnegie Mellon University Florian Metze, University of Karlsruhe Roldano Cattoni, ITC-irst Erica Costantini, University of Trieste

A Multi-Perspective Evaluation of the NESPOLE! Speech-to-Speech Translation System Alon Lavie, Carnegie Mellon University Florian Metze, University of

Embed Size (px)

Citation preview

A Multi-Perspective Evaluation of the NESPOLE! Speech-to-Speech

Translation System

Alon Lavie, Carnegie Mellon University

Florian Metze, University of Karlsruhe

Roldano Cattoni, ITC-irst

Erica Costantini, University of Trieste

July 8, 2002 ACL-02 S2S-Translation Wshp 2

Outline

• The NESPOLE! Project

• Approach and System Architecture

• Performance and Usability Challenges:– Distributed real-time performance over internet– Integration and use of multi-modal capabilities– End-to-end Translation performance

• Lessons learned and conclusions

• Speech-to-speech translation for E-Commerce applications• Partners: CMU, Univ of Karlsruhe, ITC-irst, UJF-CLIPS,

AETHRA, APT-Trentino• Builds on successful collaboration within C-STAR• Improved limited-domain speech translation• Experiment with multimodality and with MEMT• Showcase-1: Travel and Tourism in Trentino, completed

in Nov-2001, demonstrated• Showcase-2: expanded travel + medical service

July 8, 2002 ACL-02 S2S-Translation Wshp 4

Speech-to-speech in E-commerce

• Replace current passive web E-commerce with live interaction capabilities

• Client starts via web, can easily connect to agent for specific information

• “Thin client” - very little special hardware and software on client PC: browser, MS Netmeeting, Shared Whiteboard

July 8, 2002 ACL-02 S2S-Translation Wshp 5

NESPOLE! User Interfaces

July 8, 2002 ACL-02 S2S-Translation Wshp 6

NESPOLE! Architecture

July 8, 2002 ACL-02 S2S-Translation Wshp 7

Distributed S2S Translation over the Internet

July 8, 2002 ACL-02 S2S-Translation Wshp 8

Network Traffic Impact

July 8, 2002 ACL-02 S2S-Translation Wshp 9

NESPOLE! Monitor

July 8, 2002 ACL-02 S2S-Translation Wshp 10

Aethra Whiteboard

July 8, 2002 ACL-02 S2S-Translation Wshp 11

Recent Developments: Apr-02

• Improved analysis and generation grammars (using old C-STAR data)

• Improved SR engines• Packet-loss, video, and modem connection tests• Data Collection for Showcase 2A• Evaluation Scheme Experiment• Paper and Demo at HLT-02• Paper submissions to ACL-02, ICSLP-02,

ESSLLI-02

July 8, 2002 ACL-02 S2S-Translation Wshp 12

IF Status Report

• Presented by Donna Gates

July 8, 2002 ACL-02 S2S-Translation Wshp 13

WP5: HLT Modules

• Data Collection for Showcase-2A completed in February-2002

• Status of transcriptions from all sites?• CMU will maintain a data repository: (Alon

collecting all data CDs here)• IF discussions and development have

already started (Donna)• Development Schedule?

July 8, 2002 ACL-02 S2S-Translation Wshp 14

WP7: Evaluation

• D9: Evaluation of Showcase-1 Report: draft circulated earlier this week

• Each site should verify that most up-to-date results are being reported

• Include detailed tables in the report?• Majority vote – finalize a common

procedure• New evaluation experiments

July 8, 2002 ACL-02 S2S-Translation Wshp 15

Majority Vote Scheme

• Issue: did all sites use same guidelines?• What to do when there is no majority?

– i.e. 4 graders assign P/P/K/K

• What to do when there is complete disagreement?– i.e. 3 graders assign P/K/B

• Need to recalculate scores from prev evaluation?

July 8, 2002 ACL-02 S2S-Translation Wshp 16

New Evaluation Experiments

• We are investigating three main issues:– Binary versus 3-way grading

– Majority vote versus averaging of scores

– Intercoder and Intracoder agreement

• Grading Experiment:– Four groups, three graders in each group

– Each group grades two sets, two weeks apart

– Sets are different but have a common large overlap

– Groups differ in eval scheme used (binary/3-way)

July 8, 2002 ACL-02 S2S-Translation Wshp 17

Planned Analysis of Data

• Compare results across grading schemes (binary vs. 3-way) on same set of data

• Compare majority scores with average scores• Evaluate Intercoder agreement between graders

(on same set and same scheme)• Evaluate Intracoder agreement of same grader (on

overlap data in the two sets, same grading scheme in both sessions)

July 8, 2002 ACL-02 S2S-Translation Wshp 18

Preliminary Results

Group(procedure) W1 Acc W1 Bad W2 Acc W2 Bad

Gr1 (binary/3-way) 50.2 49.8 48.7 51.3

Gr2 (3-way/binary) 52.4 47.6 48.8 51.2

Gr3 (3-way/3-way) 53.8 46.2 54.9 45.1

Gr4 (binary/binary) 49.0 51.0 50.0 50.0

July 8, 2002 ACL-02 S2S-Translation Wshp 19

Plans for Final Evaluations

• Improved end-to-end evaluations

• Additional component evaluations?

• Additional user studies?

• How do we evaluate user interfaces, communication effectiveness?