View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Error Detection in Human-Machine InteractionDan Bohus
DoD Group, Oct 2002
Errors in Spoken-Language Interfaces
Speech Recognition is problematic: Input signal qualityAccents, Non-native speakersSpoken language disfluencies: stutters, false-
starts, /mm/, /um/ Typical Word Error Rates in SDS: 10-30% Systems today lack the ability to gracefully
recover from error
An exampleS: Are you a registered user ?U: No I'm not. No [NO I'M NOT NO]S: What is your full name ?U: Adam Schumacher [ATHENS IN AKRON]S: There is an Athens in Georgia and in Greece. Which destination did you want ?U: Neither [NEITHER]S: I'm sorry, I didn't catch that. There is an Athens in Georgia and in Greece. Which destination
did you want ?U: Georgia [GEORGIA]S: A flight from Athens... Where do you want to go ?U: Start over [START OVER]S: Do you really want to start over ?U: Yes [YES]S: What city are you leaving from ?U: Hamilton, Ontario [HILTON ONTARIO]S: Sorry, I'm not sure I understood what you said. Where are you leaving from ?U: Hamilton [HILTON]S: I'm still having trouble understanding you. To go on, I need you to answer the following
question. What city are you leaving from ?U: Toronto [TORONTO]
Pathway to a solution
Make systems aware of unreliability in their inputsConfidence scores
Develop a model which learns to optimally choose between several prevention/repair strategies Identify strategiesExpress them in a computable mannerDevelop the model
Papers
Error Detection in Spoken Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Problem Spotting in Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Goals
[Let’s look at dialog on page 2]
(1) Analysis of positive an negative cues we use in response to implicit and explicit verification questions
(2) Explore the possibilities of spotting errors on line
Explicit vs. Implicit Explicit
Presumably easier for the system to verify But there’s evidence that it’s not as easy …
Leads to more turns, less efficiency, frustration
ImplicitEfficiencyBut induces a higher cognitive burden which
can result in more confusion~ Systems don’t deal very well with it…
Clarke & Schaeffer Grounding model
Presentation phaseAcceptance phase
Various indicators Go ON / YES Go BACK / NO
Can we detect them reliably (when following implicit and explicit verification questions) ?
Positive and Negative Cues
Positive Negative
Short turns Long turns
Unmarked word order Marked word order
Confirm Discomfirm
Answer No answer
No corrections Corrections
No repetitions Repetitions
New info No new info
Experimental Setup / Data
120 dialogs : Dutch SDS providing train timetable information
487 utterances44 (~10%) not used
Users accepting a wrong result Barge-in Users starting their own contribution
Left 443 resulting adjacent S/U utterances
Results – Nr of words
~Problems Problems
Explicit 1.68 3.44
Implicit 3.21 7.12
Results – Empty turns (%)
~Problems Problems
Explicit 0% 2.6%
Implicit 3.4% 10.3%
Results – Marked word order %
~Problems Problems
Explicit 3.3% 4.4%
Implicit 1.2% 26.9%
Results – Yes/No
~Problems Problems
Explicit Yes 92.8% 6.1%
No 0% 56.6%
Other 7.1% 37.1%
Implicit Yes 0% 0%
No 0% 15.4%
Other 100% ? 84.6%
Results – Repeated/Corrected/New
~Problems Problems
Explicit Repeated 8.5% 23.9%
Corrected 0% 72.6%
New 11.4% 12.4%
Implicit Repeated 2.4% 61.0%
Corrected 0% 92.3%
New 53.6% 36.5%
First conclusion
People use more negative cues when there are problems
And even more so for implicit confirmations (vs. explicit ones)
How well can you classify Using individual features
Look at precision/recall Explicit: absence of confirmation Implicit: non-zero number of corrections
Multiple featuresUsed memory based learning
97% accuracy (maj. Baseline 68%) Confirm + Correct is winning, although individually
less good This is overall, right ? How about for explicit vs.
implicit ?
BUT !!! How many of these features are available on-line?
Positive Negative
Short turns Long turns
Unmarked word order Marked word order
Confirm Disconfirm
Answer No answer
No corrections ? Corrections ?
No repetitions ? Repetitions ?
New info ? No new info ?
What else can we throw at it ?
Prosody (next paper) Lexical information Acoustic confidence scores
Maybe also of previous utterances Repetitions/Corrections/New info on
transcript ? … …
Papers
Error Detection in Spoken Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Problem Spotting in Human-Machine Interaction[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
The Dual of Denial: Discomfirmations in Dialogue and Their Prosodic Correlates[E.Krahmer, M. Swerts, M. Theune, M. Weegels]
Goals
Investigate the prosodic correlates of disconfirmations Is this slightly different than before ? (i.e. now
looking at any corrections? Answer: No)Looked at prosody on “NO” as a go_on vs a
go_back:Do you want to fly from Pittsburgh ?Shall I summarize your trip ?
Human-human
Higher pitch range, longer duration Preceded by a longer delay High H% boundary tone
Expected to see same behavior for disconfirmation in human-machine
Prosodic correlates
Features POSITIVE(‘go on’) NEGATIVE(‘go back’)
Boundary tone Low High
Duration Short Long
Delay Short Long
Pause Short Long
Pitch range Low High
Yes, the correlations are there as expected
Perceptual analysis Took 40 “No” from No+stuff, 20 go_on and
20 go_back (note that some features are lost this way…)
Forced choice randomized task, w/ no feedback; 25 native speakers of Dutch
Results17 go_on correctly identified above chance15 go_back correctly identified above chance;
but also 1 incorrectly identified above chance.
Discussion Q1: Blurred relationships …
Confidence annotationGo_on / Go_back signal
Is that the same as corrections ? Is that the most general case for responses to
implicit/explicit verifications, or should we have a separate detector ?
Q2: What other features could we throw at these problems ? What are the “most juicy” ones ?
Discussion
Q3: For implicit confirms, are these different in terms of induced response behavior ?When do you want to leave Pittsburgh ? Travelling from Pittsburgh … when do you
want to leave ? When do you want to leave from Pittsburgh to
Boston ?