Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002

Error Detection in Human-Machine InteractionDan Bohus

DoD Group, Oct 2002

Errors in Spoken-Language Interfaces

Speech Recognition is problematic: Input signal qualityAccents, Non-native speakersSpoken language disfluencies: stutters, false-

starts, /mm/, /um/ Typical Word Error Rates in SDS: 10-30% Systems today lack the ability to gracefully

recover from error

An exampleS: Are you a registered user ?U: No I'm not. No [NO I'M NOT NO]S: What is your full name ?U: Adam Schumacher [ATHENS IN AKRON]S: There is an Athens in Georgia and in Greece. Which destination did you want ?U: Neither [NEITHER]S: I'm sorry, I didn't catch that. There is an Athens in Georgia and in Greece. Which destination

did you want ?U: Georgia [GEORGIA]S: A flight from Athens... Where do you want to go ?U: Start over [START OVER]S: Do you really want to start over ?U: Yes [YES]S: What city are you leaving from ?U: Hamilton, Ontario [HILTON ONTARIO]S: Sorry, I'm not sure I understood what you said. Where are you leaving from ?U: Hamilton [HILTON]S: I'm still having trouble understanding you. To go on, I need you to answer the following

question. What city are you leaving from ?U: Toronto [TORONTO]

Pathway to a solution

Make systems aware of unreliability in their inputsConfidence scores

Develop a model which learns to optimally choose between several prevention/repair strategies Identify strategiesExpress them in a computable mannerDevelop the model

Papers

Error Detection in Spoken Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Problem Spotting in Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Goals

[Let’s look at dialog on page 2]

(1) Analysis of positive an negative cues we use in response to implicit and explicit verification questions

(2) Explore the possibilities of spotting errors on line

Explicit vs. Implicit Explicit

Presumably easier for the system to verify But there’s evidence that it’s not as easy …

Leads to more turns, less efficiency, frustration

ImplicitEfficiencyBut induces a higher cognitive burden which

can result in more confusion~ Systems don’t deal very well with it…

Clarke & Schaeffer Grounding model

Presentation phaseAcceptance phase

Various indicators Go ON / YES Go BACK / NO

Can we detect them reliably (when following implicit and explicit verification questions) ?

Positive and Negative Cues

Positive Negative

Short turns Long turns

Unmarked word order Marked word order

Confirm Discomfirm

Answer No answer

No corrections Corrections

No repetitions Repetitions

New info No new info

Experimental Setup / Data

120 dialogs : Dutch SDS providing train timetable information

487 utterances44 (~10%) not used

Users accepting a wrong result Barge-in Users starting their own contribution

Left 443 resulting adjacent S/U utterances

Results – Nr of words

~Problems Problems

Explicit 1.68 3.44

Implicit 3.21 7.12

Results – Empty turns (%)

~Problems Problems

Explicit 0% 2.6%

Implicit 3.4% 10.3%

Results – Marked word order %

~Problems Problems

Explicit 3.3% 4.4%

Implicit 1.2% 26.9%

Results – Yes/No

~Problems Problems

Explicit Yes 92.8% 6.1%

No 0% 56.6%

Other 7.1% 37.1%

Implicit Yes 0% 0%

No 0% 15.4%

Other 100% ? 84.6%

Results – Repeated/Corrected/New

~Problems Problems

Explicit Repeated 8.5% 23.9%

Corrected 0% 72.6%

New 11.4% 12.4%

Implicit Repeated 2.4% 61.0%

Corrected 0% 92.3%

New 53.6% 36.5%

First conclusion

People use more negative cues when there are problems

And even more so for implicit confirmations (vs. explicit ones)

How well can you classify Using individual features

Look at precision/recall Explicit: absence of confirmation Implicit: non-zero number of corrections

Multiple featuresUsed memory based learning

97% accuracy (maj. Baseline 68%) Confirm + Correct is winning, although individually

less good This is overall, right ? How about for explicit vs.

implicit ?

BUT !!! How many of these features are available on-line?

Positive Negative

Short turns Long turns

Unmarked word order Marked word order

Confirm Disconfirm

Answer No answer

No corrections ? Corrections ?

No repetitions ? Repetitions ?

New info ? No new info ?

What else can we throw at it ?

Prosody (next paper) Lexical information Acoustic confidence scores

Maybe also of previous utterances Repetitions/Corrections/New info on

transcript ? … …

Papers

Error Detection in Spoken Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Problem Spotting in Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates[E.Krahmer, M. Swerts, M. Theune, M. Weegels]

Goals

Investigate the prosodic correlates of disconfirmations Is this slightly different than before ? (i.e. now

looking at any corrections? Answer: No)Looked at prosody on “NO” as a go_on vs a

go_back:Do you want to fly from Pittsburgh ?Shall I summarize your trip ?

Human-human

Higher pitch range, longer duration Preceded by a longer delay High H% boundary tone

Expected to see same behavior for disconfirmation in human-machine

Prosodic correlates

Features POSITIVE(‘go on’) NEGATIVE(‘go back’)

Boundary tone Low High

Duration Short Long

Delay Short Long

Pause Short Long

Pitch range Low High

Yes, the correlations are there as expected

Perceptual analysis Took 40 “No” from No+stuff, 20 go_on and

20 go_back (note that some features are lost this way…)

Forced choice randomized task, w/ no feedback; 25 native speakers of Dutch

Results17 go_on correctly identified above chance15 go_back correctly identified above chance;

but also 1 incorrectly identified above chance.

Discussion Q1: Blurred relationships …

Confidence annotationGo_on / Go_back signal

Is that the same as corrections ? Is that the most general case for responses to

implicit/explicit verifications, or should we have a separate detector ?

Q2: What other features could we throw at these problems ? What are the “most juicy” ones ?

Discussion

Q3: For implicit confirms, are these different in terms of induced response behavior ?When do you want to leave Pittsburgh ? Travelling from Pittsburgh … when do you

want to leave ? When do you want to leave from Pittsburgh to

Boston ?

Documents

Error Detection in Human-Machine Interaction Dan Bohus DoD Group, Oct 2002