28
An Investigation into Recovering from Non- understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

Embed Size (px)

Citation preview

Page 1: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

An Investigation into Recovering from Non-understanding ErrorsDan Bohus

Dialogs on Dialogs Reading Group TalkCarnegie Mellon University, October 2004

Page 2: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

2

Non-understandings

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]

System knows there was a user turn, but There is no relevant semantic information in the input Confidence is too low to trust any semantic information in

the input

10 – 30% of turns in a mixed initiative system

GOAL: Do a better job at recovering from non-understandings

Page 3: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

3

Recovery Ingredients

Detection Set of strategies (actions) Policy (method for choosing between actions)

Page 4: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

4

Recovery Ingredients – Non-understandings

Detection Generally, system knows when a non-

understanding happened

Set of strategies (actions) Notify non-understanding, repeat question, ask

repeat/rephrase, provide help, etc.

Policy (method for choosing between actions) Traditionally fixed heuristic

Page 5: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

5

Issues under Investigation

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 6: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

6

Issues under Investigation

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 7: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

7

Experimental Design - Overview

Subjects interact over the telephone with RoomLine Perform a number of scenario-based tasks

Between-subjects experiment Control: system uses a random (uniform) policy for

engaging the non-understanding recovery strategies

Wizard: policy is determined at runtime by a human (wizard)

46 subjects, balanced Gender x Native

Page 8: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

8

MOVE-ON

HELP

SIGNAL

Non-understanding StrategiesS: For when do you need the room?U: [non-understanding] FAIL Sorry, I didn’t catch that. Tell me for what day you need the room YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … ASK REPEAT (AREP) Could you please repeat that? ASK REPHRASE (ARPH) Could you please try to rephrase that? NOTIFY (NTFY) Sorry, I don’t think I understood you correctly… YIELD TURN (YLD) … REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room? EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation …

Verb.

V

T

A

T

T

T

T

A

T

Prompt.

Y

N

Y

N

N

N

N

Y

Y

Page 9: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

9

Experimental Design: Scenarios

Presented graphically (explained during briefing)

Page 10: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

10

Corpus Statistics / Characteristics

46 users; 484 sessions; ~ 9000 turns Transcribed Annotated with:

Misunderstandings & deletions Non-understandings Concept transfer accuracy Transcript grammaticality labels

OK, OOR, OOG, OOS, OOD, VOID

Correct concept values in each turn – [ongoing]

Page 11: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

11

Back to the Issues

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 12: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

12

Impact of Policy on Performance

General picture Significant improvements for non-natives, especially after

non-understandings

Global Task success

Significant improvements (x1.77) for non-natives SASSI Scores: nothing detectable

Local WER

significant improvements across the board Understanding error metrics (CT, CER, NONU, MIS)

significant improvement for non-natives Recovery

Nothing detectable (?) Faster on the wizard side

Page 13: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

13

Impact of Policy on Performance

… Weird stuff

Conclusion?

Page 14: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

14

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Back to the Issues

Page 15: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

15

Impact on task performance

Models for predicting task success from various types of errors [show in Matlab]

Can shed more light on: Effect of the policy Native / non-native differences Costs of various types of errors

Currently analyzing it. Issues: Build (state-)conditioned cost models Robustness

Page 16: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

16

Back to the Issues

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 17: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

17

Individual strategy performance

Under “random”/uniform conditions (control) All-way-comparison: Matlab, summary file (rank

analysis ?) First conclusions:

Moving-on helps Help helps Just signaling is not so good, YLD is pretty bad

Compare with wizard: Ask Repeat boosted (significantly x1.58)

Wizard reverse engineering (?) HELP / FAIL behavior in non-natives (?) Predicting success: when to help, when to ask

repeat?

Page 18: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

18

MOVE-ON

HELP

SIGNAL

Non-understanding StrategiesS: For when do you need the room?U: [non-understanding] FAIL Sorry, I didn’t catch that. Tell me for what day you need the room YOU CAN SAY (YCS) Sorry, I didn’t catch that. For when do you need the conference room? You can say something like tomorrow at 10 am … TERSE YOU CAN SAY (TYCS) Sorry, I didn’t catch that. You can say something like tomorrow at 10 am … FULL HELP (HELP) Sorry, I didn’t catch that. I am currently trying to make a conference room reservation for you. Right now I need to know the date and time for when you need the reservation. You can say something like tomorrow at 10 am … ASK REPEAT (AREP) Could you please repeat that? ASK REPHRASE (ARPH) Could you please try to rephrase that? NOTIFY (NTFY) Sorry, I don’t think I understood you correctly… YIELD TURN (YLD) … REPEAT SYSTEM PROMPT (REPP) For when do you need the conference room? EXPLAIN MORE (EXPL) Right now I need to know the date and time for when you need the reservation …

Verb.

V

T

A

T

T

T

T

A

T

Prompt.

Y

N

Y

N

N

N

N

Y

Y

Page 19: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

19

Back to the Issues

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 20: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

20

Identify Potential New Strategies

Better informed by the error-type / blame assignment analysis (top of my stack)

So far Ask user to speak shorter Ask user to speak louder Speculative execution

Page 21: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

21

Speculative execution

A lot of small recognition errors appear repeatedly YES > THIS, NEXT GUEST > YES GUEST USER > TUESDAY Etc…

Learn from experience how to avoid these errors Example:

S: Did you say you wanted a room for Tuesday?

U: YES [THIS]

S: Sorry, I didn’t catch that. Did you say you wanted a room for Tuesday?

U: YES [YES]

Learn that “THIS” actually means “YES”

Page 22: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

22

Speculative execution - components

Learn mapping Learner with high precision (no false positives)

Apply mapping Learner with high recall

Precision / Recall tradeoff

How much can this method really buy us?

Page 23: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

23

Speculative Execution – 0st cut

Conservative Learner Learns from non-understanding segments where

Dialogue state is the same throughout (mapping is state-specific)

Final response is in focus, contains only one concept and has high confidence

Conservative Applier Apply only when dialogue state matches and non-

understood input matches perfectly at the state level

Going through the whole dataset, learning as you go results: 10% application at the end, does not asymptote yet

Precision? (480 ruled learned)

How does this look to you?

Page 24: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

24

Speculative execution

Of course much more to dig in here … Learners which generalize more Confidence score on the rules Active learning: appliers with confidence, and

feedback into learning Potentially use it in other cases (not only non-

understandings, but potential misunderstandings)

Page 25: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

25

Back to the Issues

Detection Analysis of error types, blame assignment, impact

on task performance Detection of error type Adaptation of rejection threshold

Set of strategies Investigate individual strategy performance Identify potential new strategies

Policy Impact of a “smarter” policy on performance Building a policy from data

Page 26: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

26

Building a Policy from Data

Experiment shown that wizard boosted performance of Ask Repeat

Can we predict likelihood of success for each strategy from features available online? Identify informative features

Might be better informed by error-type/blame-assignment analysis

Try simple classifiers MDP (?)

Can also formulate problem as a decision boundary or classification problem… (?)

Page 27: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

27

Thank you!

Page 28: An Investigation into Recovering from Non-understanding Errors Dan Bohus Dialogs on Dialogs Reading Group Talk Carnegie Mellon University, October 2004

28

Experimental Design:Control vs Wizard Conditions

Control: random (uniform) policy Wizard: human with access to audio & system state

Perf

orm

an

ce

Random (uniform) policy

Manually designed policy

Data-driven designed policy

Human wizard with access to audio

? Human wizard with access to only system state ?