Upload
jesse-hawkins
View
214
Download
0
Embed Size (px)
Citation preview
Building & Evaluating Spoken Dialogue Systems
Discourse & Dialogue
CS 359
November 27, 2001
Agenda
• How to get started– System bootstrapping
• “Wizard-of-Oz” design
• Strengths & Limitations
• How to tell if you succeeded– System evaluation
• What you do & how you do it
• Performance = Task success - Task cost
System Bootstrapping
• Question: How should we design a system?– What should it be able to understand?
• Key: How would people talk to it?
• Suggestion 1: Like people talk to each other?– Collect human-human interactions, same task– But, computers NOT like people, act differently– Politeness, assumed knowledge, style, complexity– Adapt to needs of hearer– Balance need for understanding, reduce effort
“Wizard-of-Oz” Studies
• Suggestion 2: Like people talk to computer!– Get application/domain specific language
• But, system NOT built yet!– Simulate system mediated thru human wizard
• Fast, rigid/consistent, no small errors/typos
– Structured simulations• Automate as much as possible
– E.g. response editor - hierarchical menus/templates, access to different apps, query creator, time-stamped logging
Good Wizard Studies
• Requirements:– Background system:
• Fully implemented or simulated• Allows some user initiative
– Task:• Somewhat open “scenario” • Not too complex or private
– Must be piloted:• Task scenario/simulation
Comparing Styles
• Human-human versus human-computer– H-H: more complex; H-C: simpler structure– Domain variability greater than individual– Vocabulary choice– Use of anaphora
• Question: Should you lie to the user?– Only way to get realistic behavior– Debrief: explain protocol, offer to destroy data
System Evaluation
• Question: Which design is better?
• Approach 1: Content-based measures– Task-completion– Concept accuracy– Reference answer
• Query result versus key
• Limited: Only one strategy – Many alternatives
System Evaluation (cont’d)
• Not just accuracy, but efficiency
• Approach 2: Cost-based measures– Time to completion:
• # of utterances
• # of turns
• Duration in seconds
– Error measures:• # corrections, # repetitions
Combining Measures
• Issues: – Generalization: Factors affecting performance– Sub-dialogues: not just WHOLE task
• PARADISE:– Separate what agent does from how does it– Performance = task success & dialogue costs
• Performance => Usability => User satisfaction• Task success = operationalized as K-coefficient• Costs = efficiency, qualitative measures
Measuring Task Success
• AVM: Attribute Value Matrix– Capture info to be exchanged b/t user & system– “Key”: AVM instantiation for scenario
• K-coefficient calculated from confusion matrix– on-diagonal: match key; off-diagonal: misunderstood
• K = P(A) - P(E)/ (1-P(E))– P(A): Proportion agreement; P(E): Proportion expect– Actual - chance agreement
• Pros: corrects for chance; compare across tasks
Measuring Task Costs
• Define cost measures:– E.g. # utterances, # repairs
• Can compute across sub-dialogues– Match segment to purpose
• Hierarchical structure - link to subtasks
• Tag by AVM info goals
Estimating Performance Fn
• Predicted measure: Performance– User satisfaction rating:
• Rating: 1-6 on some question or average of questions
• Predictor measures: Success & Costs– Normalize each to z-score
• Handle varying scales
– Apply multiple linear regression to compute weights
• Calculate for sub-dialogue: restrict K, costs
Evaluation
• Applied to multiple tasks– Travel, Reservation/Purchase, Circuit-Fix-It– Define new AVM attributes
• Match discourse structure
• Compare dialogue strategies– Explicit/Implicit confirmation– System/User/Mixed initiative
Summary
• Building for HCI– Human-human versus human-computer– Acquire vocabulary, structure, style
• Base on “Wizard-of-Oz” simulation
• Evaluating strategies– Performance = task success - dialogue cost– task success: agreement between response & key
• Success level compensates for chance
– Costs: number of repairs, utterances