Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001

Building & Evaluating Spoken Dialogue Systems

Discourse & Dialogue

CS 359

November 27, 2001

Agenda

• How to get started– System bootstrapping

• “Wizard-of-Oz” design

• Strengths & Limitations

• How to tell if you succeeded– System evaluation

• What you do & how you do it

• Performance = Task success - Task cost

System Bootstrapping

• Question: How should we design a system?– What should it be able to understand?

• Key: How would people talk to it?

• Suggestion 1: Like people talk to each other?– Collect human-human interactions, same task– But, computers NOT like people, act differently– Politeness, assumed knowledge, style, complexity– Adapt to needs of hearer– Balance need for understanding, reduce effort

“Wizard-of-Oz” Studies

• Suggestion 2: Like people talk to computer!– Get application/domain specific language

• But, system NOT built yet!– Simulate system mediated thru human wizard

• Fast, rigid/consistent, no small errors/typos

– Structured simulations• Automate as much as possible

– E.g. response editor - hierarchical menus/templates, access to different apps, query creator, time-stamped logging

Good Wizard Studies

• Requirements:– Background system:

• Fully implemented or simulated• Allows some user initiative

– Task:• Somewhat open “scenario” • Not too complex or private

– Must be piloted:• Task scenario/simulation

Comparing Styles

• Human-human versus human-computer– H-H: more complex; H-C: simpler structure– Domain variability greater than individual– Vocabulary choice– Use of anaphora

• Question: Should you lie to the user?– Only way to get realistic behavior– Debrief: explain protocol, offer to destroy data

System Evaluation

• Question: Which design is better?

• Approach 1: Content-based measures– Task-completion– Concept accuracy– Reference answer

• Query result versus key

• Limited: Only one strategy – Many alternatives

System Evaluation (cont’d)

• Not just accuracy, but efficiency

• Approach 2: Cost-based measures– Time to completion:

• # of utterances

• # of turns

• Duration in seconds

– Error measures:• # corrections, # repetitions

Combining Measures

• Issues: – Generalization: Factors affecting performance– Sub-dialogues: not just WHOLE task

• PARADISE:– Separate what agent does from how does it– Performance = task success & dialogue costs

• Performance => Usability => User satisfaction• Task success = operationalized as K-coefficient• Costs = efficiency, qualitative measures

Measuring Task Success

• AVM: Attribute Value Matrix– Capture info to be exchanged b/t user & system– “Key”: AVM instantiation for scenario

• K-coefficient calculated from confusion matrix– on-diagonal: match key; off-diagonal: misunderstood

• K = P(A) - P(E)/ (1-P(E))– P(A): Proportion agreement; P(E): Proportion expect– Actual - chance agreement

• Pros: corrects for chance; compare across tasks

Measuring Task Costs

• Define cost measures:– E.g. # utterances, # repairs

• Can compute across sub-dialogues– Match segment to purpose

• Hierarchical structure - link to subtasks

• Tag by AVM info goals

Estimating Performance Fn

• Predicted measure: Performance– User satisfaction rating:

• Rating: 1-6 on some question or average of questions

• Predictor measures: Success & Costs– Normalize each to z-score

• Handle varying scales

– Apply multiple linear regression to compute weights

• Calculate for sub-dialogue: restrict K, costs

Evaluation

• Applied to multiple tasks– Travel, Reservation/Purchase, Circuit-Fix-It– Define new AVM attributes

• Match discourse structure

• Compare dialogue strategies– Explicit/Implicit confirmation– System/User/Mixed initiative

Summary

• Building for HCI– Human-human versus human-computer– Acquire vocabulary, structure, style

• Base on “Wizard-of-Oz” simulation

• Evaluating strategies– Performance = task success - dialogue cost– task success: agreement between response & key

• Success level compensates for chance

– Costs: number of repairs, utterances

Documents

Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001