Upload
sara-nicholson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
An Investigation into Recovering from Non-understanding ErrorsDan Bohus
Dialogs on Dialogs Reading Group TalkCarnegie Mellon University, October 2004
2
Non-understandings
S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]
System knows there was a user turn, but There is no relevant semantic information in the input Confidence is too low to trust any semantic information in
the input
10 – 30% of turns in a mixed initiative system
GOAL: Do a better job at recovering from non-understandings
3
Recovery Ingredients
Detection Set of strategies (actions) Policy (method for choosing between actions)
4
Recovery Ingredients – Non-understandings
Detection Generally, system knows when a non-
understanding happened
Set of strategies (actions) Notify non-understanding, repeat question, ask
repeat/rephrase, provide help, etc.
Policy (method for choosing between actions) Traditionally fixed heuristic
5
Issues under Investigation
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
6
Issues under Investigation
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
7
Experimental Design - Overview
Subjects interact over the telephone with RoomLine Perform a number of scenario-based tasks
Between-subjects experiment Control: system uses a random (uniform) policy for
engaging the non-understanding recovery strategies
Wizard: policy is determined at runtime by a human (wizard)
46 subjects, balanced Gender x Native
8
MOVE-ON
HELP
SIGNAL
Non-understanding StrategiesS: For when do you need the room?U: [non-understanding] FAIL Sorry, I didn’t catch that. Tell me for what day you need the room YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … ASK REPEAT (AREP) Could you please repeat that? ASK REPHRASE (ARPH) Could you please try to rephrase that? NOTIFY (NTFY) Sorry, I don’t think I understood you correctly… YIELD TURN (YLD) … REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room? EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation …
Verb.
V
T
A
T
T
T
T
A
T
Prompt.
Y
N
Y
N
N
N
N
Y
Y
9
Experimental Design: Scenarios
Presented graphically (explained during briefing)
10
Corpus Statistics / Characteristics
46 users; 484 sessions; ~ 9000 turns Transcribed Annotated with:
Misunderstandings & deletions Non-understandings Concept transfer accuracy Transcript grammaticality labels
OK, OOR, OOG, OOS, OOD, VOID
Correct concept values in each turn – [ongoing]
11
Back to the Issues
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
12
Impact of Policy on Performance
General picture Significant improvements for non-natives, especially after
non-understandings
Global Task success
Significant improvements (x1.77) for non-natives SASSI Scores: nothing detectable
Local WER
significant improvements across the board Understanding error metrics (CT, CER, NONU, MIS)
significant improvement for non-natives Recovery
Nothing detectable (?) Faster on the wizard side
13
Impact of Policy on Performance
… Weird stuff
Conclusion?
14
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
Back to the Issues
15
Impact on task performance
Models for predicting task success from various types of errors [show in Matlab]
Can shed more light on: Effect of the policy Native / non-native differences Costs of various types of errors
Currently analyzing it. Issues: Build (state-)conditioned cost models Robustness
16
Back to the Issues
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
17
Individual strategy performance
Under “random”/uniform conditions (control) All-way-comparison: Matlab, summary file (rank
analysis ?) First conclusions:
Moving-on helps Help helps Just signaling is not so good, YLD is pretty bad
Compare with wizard: Ask Repeat boosted (significantly x1.58)
Wizard reverse engineering (?) HELP / FAIL behavior in non-natives (?) Predicting success: when to help, when to ask
repeat?
18
MOVE-ON
HELP
SIGNAL
Non-understanding StrategiesS: For when do you need the room?U: [non-understanding] FAIL Sorry, I didn’t catch that. Tell me for what day you need the room YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … ASK REPEAT (AREP) Could you please repeat that? ASK REPHRASE (ARPH) Could you please try to rephrase that? NOTIFY (NTFY) Sorry, I don’t think I understood you correctly… YIELD TURN (YLD) … REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room? EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation …
Verb.
V
T
A
T
T
T
T
A
T
Prompt.
Y
N
Y
N
N
N
N
Y
Y
19
Back to the Issues
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
20
Identify Potential New Strategies
Better informed by the error-type / blame assignment analysis (top of my stack)
So far Ask user to speak shorter Ask user to speak louder Speculative execution
21
Speculative execution
A lot of small recognition errors appear repeatedly YES > THIS, NEXT GUEST > YES GUEST USER > TUESDAY Etc…
Learn from experience how to avoid these errors Example:
S: Did you say you wanted a room for Tuesday?
U: YES [THIS]
S: Sorry, I didn’t catch that. Did you say you wanted a room for Tuesday?
U: YES [YES]
Learn that “THIS” actually means “YES”
22
Speculative execution - components
Learn mapping Learner with high precision (no false positives)
Apply mapping Learner with high recall
Precision / Recall tradeoff
How much can this method really buy us?
23
Speculative Execution – 0st cut
Conservative Learner Learns from non-understanding segments where
Dialogue state is the same throughout (mapping is state-specific)
Final response is in focus, contains only one concept and has high confidence
Conservative Applier Apply only when dialogue state matches and non-
understood input matches perfectly at the state level
Going through the whole dataset, learning as you go results: 10% application at the end, does not asymptote yet
Precision? (480 ruled learned)
How does this look to you?
24
Speculative execution
Of course much more to dig in here … Learners which generalize more Confidence score on the rules Active learning: appliers with confidence, and
feedback into learning Potentially use it in other cases (not only non-
understandings, but potential misunderstandings)
25
Back to the Issues
Detection Analysis of error types, blame assignment, impact
on task performance Detection of error type Adaptation of rejection threshold
Set of strategies Investigate individual strategy performance Identify potential new strategies
Policy Impact of a “smarter” policy on performance Building a policy from data
26
Building a Policy from Data
Experiment shown that wizard boosted performance of Ask Repeat
Can we predict likelihood of success for each strategy from features available online? Identify informative features
Might be better informed by error-type/blame-assignment analysis
Try simple classifiers MDP (?)
Can also formulate problem as a decision boundary or classification problem… (?)
27
Thank you!
28
Experimental Design:Control vs Wizard Conditions
Control: random (uniform) policy Wizard: human with access to audio & system state
Perf
orm
an
ce
Random (uniform) policy
Manually designed policy
Data-driven designed policy
Human wizard with access to audio
? Human wizard with access to only system state ?