NESPOLE! Project Status Carnegie Mellon University

NESPOLE! Project StatusCarnegie Mellon University

Grenoble Meeting

November 15, 2001

Main Accomplishments: Nov-01

• Improved DACPar Analyzer

• Improved SR engines

• Port to Linux HLT servers

• Significant coverage improvements

• Formal evaluation

• (SPECTRUM Proposal)

The DACPar Analyzer

Parse an utterance for arguments (SOUP)

Segment the utterance into sentences Extract features from the utterance

and the single best parse output Use a learned classifier to identify the

speech act (TiMBL) Use a learned classifier to identify the

concept sequence (TiMBL) Combine into a full parse

Improved DACPar Analyzer

• Improved segmentation of utterances into SDUs

• Using IF well-formedness constraints to improve overall DA classification

• coverage and training set improvements

DACPar - Improved Segmentation

• Segmenting single turns into DA units (SDUs) - two problems:– under-segmentation: detecting SDU boundaries between

parsed arguments

– over-segmentation: due to CrossDomain grammar - single SDUs that are incorrectly split

• New segment boundary detector implemented based on argument statistical model

• CrossDomain grammar tuned to prevent over-segmentation

DACPar - Using IF Constraints

• Two goals:– Ensure that resulting DA analysis is a legal IF– Improve classification outcome using the well-

formedness constraints from IF spec

• Classifier produces ranked list of Das

• Select highest ranking DA that licenses the greatest number of arguments (ideally all)

DACPar - Initial Results

• SA classification accuracy ~65%• DA classification accuracy ~45%• Eng-to-Eng translation (from trans) 58% (43%)• Eng-to-Eng translation (from hypo) 45% (32%)

Improved SR Engine

Showcase-1 Formal Evaluation

• Data used for evaluation

• Evaluation scheme: end-to-end, mono- and cross-lingual, SDU-based, human-grading

• Compiling of results

• Initial available results

• Lessons learned...

Evaluation - Data Used

• Goals:– unseen data not used for system development– both scenario-a and scenario-c, some MM data– original mono-lingual data and cross-lingual

data collected when using the system

• Mixture intended primarily for comprehensiveness, not for comparison of different conditions (stat significance)

Evaluation Methodology• Evaluation scheme: end-to-end, mono- and cross-lingual, SDU-

based, human-grading

• Evaluate translation from transcriptions and from SR output, also SR WERs, (SR as a paraphrase)

• Multiple human graders - should NOT be system developers

• One grader segments each turn into SDUs, graders then assign grades for each identified SDU

• Cross lingual eval: – client SDUs from E/G/F --> Italian– agent SDUs from Italian --> E/G/F

• Donna’s grading program

Compiling of Results

• Each site should compile its own results!

• Calculate separate results for:– each dialogue, each grader, client/agent SDUs

• Average/combine results for:– all graders, client+agent, all dialogues

combined

Results: SR PerformanceGerman SR Accuracy

Speaker % Accuracy

------------------------

g006 42.69

g034 66.32

g047 78.67

g051 69.43

Average 63.52

English SR Accuracy

Speaker % Accuracy ------------------------ e025ap 68.6 e039ap 39.5 e011yp 83.1 e827cy 71.0

Average 61.9

English Evaluation

English Eval Data

a1 = e025ap ( 46 SDUs) ( 27 utts)

a2 = e039ap (123 SDUs) ( 37 utts)

amm = e011yp ( 54 SDUs) ( 39 utts)

cmm = e827cy (109 SDUs) ( 48 utts)

ALL = total (332 SDUs) (151 utts)

English-to-English

HYPO ---- G1 G2 G3 ALL | WA------------------------------------------- a1 76(65) 74(61) 65(52) 72(59) | 68------------------------------------------- a2 55(39) 43(32) 50(35) 50(35) | 39-------------------------------------------amm 91(89) 93(85) 91(78) 91(84) | 84-------------------------------------------cmm 71(63) 65(59) 69(56) 68(59) | 70-------------------------------------------ALL 69(59) 63(54) 65(51) 66(56) | 61-------------------------------------------

English-to-English

SLT-TCT ---- G1 G2 G3 ALL---------------------------------------- a1 74(70) 76(54) 67(41) 72(55)---------------------------------------- a2 62(46) 45(40) 46(32) 51(39)----------------------------------------amm 74(57) 67(54) 61(48) 67(53)----------------------------------------cmm 65(49) 40(31) 51(31) 52(37)----------------------------------------ALL 67(52) 51(41) 53(35) 58(43)----------------------------------------

SLT-REC ---- G1 G2 G3 ALL---------------------------------------- a1 58(50) 52(33) 43(24) 51(36)---------------------------------------- a2 41(27) 29(23) 33(21) 34(23)----------------------------------------amm 69(57) 70(63) 70(41) 70(54)----------------------------------------cmm 50(39) 32(26) 41(21) 41(29)----------------------------------------ALL 51(39) 40(32) 43(25) 45(32)----------------------------------------

Results: English-to-English

English-to-English

a1 a2 amm cmm ALL----------------------------------------------TCT 72(55) 51(39) 67(53) 52(37) 58(43)----------------------------------------------REC 51(36) 34(23) 70(54) 41(29) 45(32)----------------------------------------------HYPO 72(59) 50(35) 91(84) 68(59) 66(56)----------------------------------------------

Results: English-to-Italian

English-to-Italian a1 a2 amm cmm ALL-----------------------------------------------TCT 77(52) 48(36) 67(45) 59(31) 55(38)-----------------------------------------------REC 57(39) 29(19) 69(44) 39(24) 43(27)-----------------------------------------------

English-to-English

a1 a2 amm cmm ALL----------------------------------------------TCT 72(55) 51(39) 67(53) 52(37) 58(43)----------------------------------------------REC 51(36) 34(23) 70(54) 41(29) 45(32)----------------------------------------------HYPO 72(59) 50(35) 91(84) 68(59) 66(56)----------------------------------------------

German Evaluation

Graders: Dialogs:G1: Benjamin a1: g047ak ( 46 SDUs / 23 utts.)G2: Tanja a2: g051ak (174 SDUs / 59 utts.)G3: Stephan amm: g006yk (108 SDUs / 70 utts.)

c1: g034ck (314 SDUs / 98 utts.)All: 644 SDUs / 350 utts.

German-to-German

HYPO SLT-TCT SLT-RECG1 57 (50) 28 (23) 26 (23)G2 59 (50) 24 (6) 21 (5)G3 64 (48) 39 (7) 32 (5)All 58 (48) 31 (15) 25 (12)

a1 81 (74) 55 (21) 52 (20)a2 71 (59) 35 (14) 34 (14)amm 38 (25) 34 (18) 22 (11)c1 58 (49) 23 (8) 19 (8)All 58 (48) 31 (15) 25 (12)

G1 G2 G3 AllHYPO 57 (50) 59 (50) 64 (48) 58 (48)SLT-TCT 28 (23) 24 (6) 39 (7) 31 (15)SLT-REC 26 (23) 21 (5) 32 (5) 25 (12)

German-to-Italian

SLT-TCT SLT-RECG1 31 (7) 26 (4)G2 38 (9) 32 (6)G3 30 (24) 26 (22)All 32 (13) 27 (11)

a1 55 (21) 56 (22)a2 39 (13) 34 (12)amm 36 (18) 31 (15)c1 25 (10) 19 (8)All 32 (13) 27 (11)

G1 G2 G3 AllSLT-TCT 31 (7) 38 (9) 30 (24) 32 (13)SLT-REC 26 (4) 32 (6) 26 (22) 27 (11)

Lessons Learned/Issues

• Variance between graders - what to do?

• Segmentation variations - what to do?

• Grading with two binary decisions

• New data for next evaluation + save copy of current system version

• Release current eval data for system dev?

• Component Evaluation

Showcase-2 Open Issues

• Definitions of the domains and scenarios for showcase-2a and showcase-2b

• Data Collection

• New functionalities:– for the users (client/agent)– for the system developers & for demonstration

• Architecture modifications

Demo at IST Issues

• Details about the demo

• Demo “wrapper” around the system:– client initiates call from a web page– dealing with the push-to-talk issue– other functionalities?

• Schedule for tests before demo

Documents

NESPOLE! Project Status Carnegie Mellon University