28
SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe, Monika Woszczyna

SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe,

Embed Size (px)

Citation preview

SIG IL 2000

Evaluation of a Practical Interlingua for Task-Oriented Dialogue

Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro

Watanabe, Monika Woszczyna

SIG IL 2000

Interchange Format Design

The CSTAR II Interchange Format was designed and developed by all of the CSTAR II partners: CMU, IRST, ETRI, UKA, CLIPS++ and ATR.

www.c-star.org

SIG IL 2000

Expressivity vs Simplicity

• If it is not expressive enough, components of meaning will be lost.

• If it is not simple enough, it can’t be used reliably across sites.

• If it is not simple enough, it will not be quickly portable to new domains.

SIG IL 2000

Task Oriented Sentences

• Perform an action in the domain.

• Are not descriptive.

• Contain fixed expressions that cannot be translated literally.

SIG IL 2000

Domain Actions: Extended, Domain-Specific Speech Acts

Examples:

c:request-information+availability+room

a:give-information+personal-data

c:give-information+temporal+arrival

SIG IL 2000

Components of the Interchange Format

speaker a: a: (agent)

speech act give-informationgive-information

concept* +availability+room+availability+room

argument* (room-type=(single & double), (room-type=(single & double), time=md12)time=md12)

SIG IL 2000

Examples no that’s not necessary c:negatec:negate

yes I am c:affirmc:affirm

and I was wondering what you have in the way of rooms available during that time c:request-information+availability+roomc:request-information+availability+room

my name is alex waibel c:give-information+personal-data (person-name=(given-name=alex, family-c:give-information+personal-data (person-name=(given-name=alex, family-

name=waibel))name=waibel))

and how will you be paying for this a:request-information+payment (method=question)a:request-information+payment (method=question)

I have a mastercard c:give-information+payment (method=mastercard)c:give-information+payment (method=mastercard)

SIG IL 2000

Not Covered orNot Represented in IF

• Relative clauses

• Comparatives (in general)

• Tense

• Number (but quantity is represented)

SIG IL 2000

Scope of the IF

May 1999

Speech acts 54 Concepts 84 Arguments 118

SIG IL 2000

Expressivity: Coverage Experiment

• Development data was tagged with interlingua representations by human experts.

• Sentences that are not intended to be covered by the interlingua (as judged by human experts) were given the tag “no-tag.”

• Test data was tagged by human experts.

SIG IL 2000

Coverage Experiment:Development and Test Data

Languages Dialogue Type Number of DAUnits

Development Data:

English monolingual 2698

Italian monolingual 234

Korean bilingual (onlyKorean utterancesare included)

1142

Test Data:

Japanese-English bilingual 6069

SIG IL 2000

The Interchange Format Database

61.2.3 olang I lang I Prv IRST “telefono per prenotare delle stanze per quattro colleghi”

61.2.3 olang I lang E Prv IRST “I’m calling to book some rooms for four colleagues”

61.2.3 IF Prv IRST c:request-action+reservation+features+room (for-whom= (associate, quantity=4))

61.2.3 comments: dial-oo5-spkB-roca0-02-3

d.u.sdu olang X lang Y Prv Z “sdu-in-language-Y on one line”d.u.sdu olang X lang E Prv Z “sdu-in-English on one line”d.u.sdu IF Prv Z dialogue-act-on-one-lined.u.asdu comments: your commentsd.u.asdu comments: go here

SIG IL 2000

Coverage of Top 10 Dialogue Acts in Development Data

Cumulative % Percent Count DA5.9 244 no-tag

15.7 15.7 652 acknowledge19.8 4.1 172 affirm23.3 3.4 143 thank26.0 2.7 113 introduce-self28.0 2.0 85 give-info+price30.1 2.0 85 greeting31.9 1.9 78 give-info+temp33.7 1.8 75 give-info+num35.5 1.8 73 give-

info+price+room37.2 1.7 70 req-info+payment

SIG IL 2000

Coverage of Top 10 Speech Acts in Development Data

Cumulative % Percent Count Speech Act30.1 30.1 1250 give-information45.8 15.7 655 acknowledge57.7 11.9 493 req-information62.7 5.0 209 req-verif-give-inf.67.6 4.9 203 request-action71.7 4.1 172 affirm75.1 3.4 143 thank77.9 2.7 113 introduce-self80.2 2.4 98 offer82.4 2.1 89 accept

SIG IL 2000

Coverage of Top 10 Dialogue Acts in Test Data

Cumulative % Percent Count DA4.6 263 no-tag

15.6 15.6 885 acknowledge20.2 4.6 260 thank23.7 3.5 200 introduce-self27.0 3.4 191 affirm29.7 2.7 153 apologize32.3 2.6 147 greeting34.6 2.3 128 closing36.3 1.7 98 give-info+personal38.0 1.7 95 give-info+temp.39.5 1.6 89 give-info+price

SIG IL 2000

Coverage of Top 10 Speech Acts in Test Data

Cumulative % Percent Count DA

25.6 25.6 1454 give-information

41.7 16.1 916 acknowledge

53.6 11.9 677 req-information

58.2 4.6 260 thank

62.0 3.7 213 req-verif-give-info

65.5 3.5 200 Introduce-self

68.8 3.4 191 affirm

72.0 3.2 181 request-action

74.8 2.8 159 accept

77.5 2.7 153 apologize

SIG IL 2000

Simplicity:Consistency of Use Across Sites

• Successful international demo.

• After testing English-Italian and English-Korean, Italian-Korean worked without extra effort.

• Inter-coder agreement experiment• Cross-site evaluation experiment

SIG IL 2000

Inter-coder Agreement Experiment

• 84 DA units from Japanese-English data

• Some dialogue fragments and some isolated sentences

• Coded at CMU and IRST

• Results reported in percent agreement

SIG IL 2000

Inter-Coder Agreement Resuts

Speech Act 82.14

Concept List 88.00

Dialogue Act 65.48

Argument List 85.79

SIG IL 2000

Inter-Coder Agreement Error Analysis of 33 Sentences

• 6 are equivalent due to ambiguity in the IF specification.

• 16 are similar enough to produce output with equivalent meaning.– offer-search+availability: Let me check the availability– give-information+search+availability: I will check the

availability• 4 contain differences where the input sentence was

ambiguous and taggers chose different meanings.– 6 o’clock could be 6:00 or 18:00

• 5 contain errors by one or more taggers and would produce outputs with different meanings

SIG IL 2000

Cross-Site Evaluation

• Analysis and generation grammars were written at different sites (CMU and IRST).

• Analysis at CMU produces IF.

• IF is sent to IRST.

• Generation at IRST produces Italian sentences.

SIG IL 2000

Intra-Site Evaluation

• Analysis and generation are both performed at CMU by researchers in constant contact with each other.

• English-IF-English, English-German, and English-Japanese

SIG IL 2000

Cross Site Evaluation Data

• 130 utterances from a user study performed at CMU

• Speech input• “Traveller” is a second time user.• “Agent” is a system developer.• Traveller and agent cannot see or hear each

other.• All communication is through English-IF-

English paraphrase.

SIG IL 2000

Evaluation Scoring• OK: meaning is preserved• Perfect: meaning is preserved and the output is

fluent• Bad: meaning is not preserved• Acceptable: Sum of Perfect and OK• English-German was graded at CMU, IRST and

CLIPS. • English-IF-English was graded at CMU and CLIPS• English-Japanese was graded at CMU.• English-Italian was graded at IRST.• English-French was graded at CLIPS

SIG IL 2000

End-to-End Evaluation Results

Method Output Lang. % Acceptable Grader Number ofGraders

1. Recognition English 78% CMU 32. Transcr. English 75% CMU+CLIPS 43. Rec. English 61% CMU+CLIPS 44. Transcr. Japanese 77% CMU 25. Rec. Japanese 62% CMU 26. Transcr. German 69% CMU+IRST+

CLIPS5

7. Rec. German 60% CMU+IRST+CLIPS

5

8. Transc. Italian 73% IRST 6

9. Rec. Italian 61% IRST 6

10. Transcr. French 66% IRST 211. Rec. French 56% IRST 2

SIG IL 2000

End-to-End Evaluation Results

Method Output Lang. % Acceptable Grader Number ofGraders

1. Recognition English 78% CMU 32. Transcr. English 74% CMU 33. Rec. English 59% CMU 34. Transcr. Japanese 77% CMU 25. Rec. Japanese 62% CMU 26. Transcr. German 70% CMU 27. Rec. German 58% CMU 28. Transcr. German 67% IRST 29. Rec. German 59% IRST 2

10. Transcr. Italian 73% IRST 611. Rec. Italian 61% IRST 6

SIG IL 2000

Conclusions

• Coverage is surprisingly good for a certain type of data: role playing for flight reservations, hotel reservations, greetings, and payment.

• Cross-site evaluation is about as good as intra-site evaluation.

• Inter-coder agreement could be improved, but not all errors affect translation quality.

SIG IL 2000

Current Work

• Integrating the task-oriented interlingua with a more traditional frame-based interlingua for descriptive sentences.

• The NEPSOLE! Consortium: http://nespole.itc.it