View
218
Download
1
Tags:
Embed Size (px)
Citation preview
On the Usability of Spoken Dialogue Systems
On the Usability of Spoken Dialogue Systems
Presentation of Ph.D. thesisby
Lars Bo LarsenAalborg University, Sep. 12, 2003
Page 2 of 33
On the Usability of Spoken Dialogue Systems
Overview
• Introduction – background• Definition of usability• The OVID project• Objective measures
• Turn-taking and user initiatives• Perceived and observed task success
• Subjective measures• Questionnaires for measuring user satisfaction• Factor Analysis
• Combined analysis using the Paradise scheme• Summary and conclusions
Page 3 of 33
On the Usability of Spoken Dialogue Systems
Background
This work has been carried out in two phases:• The first phase was the experimental work carried out in
the Esprit OVID project in 1996-7• This resulted in a number of reports and publications in
1997-99• Another important result was a fully annotated dialogue
corpus• The second more recent phase was in 2002-3, where the
results were verified and analysed from a methodical point of view, and new analyses were carried out on the corpus
• Most notably, a new paradigm (called Paradise) had been proposed since the original OVID work
• In between the two phases I mainly worked in the area of multi modal systems and teaching.
Page 4 of 33
On the Usability of Spoken Dialogue Systems
The Goals of the OVID Project
OVID Technical Annex:
“The partners intend to approach the work via a series of controlled usability trials of the software in a realistic banking service with real bank customers.
The results will be an assessment of how bank customers are able to use the automated service without training in its use, to design an optimal user interface dialogue which can accommodate the untrained user.”
Page 5 of 33
On the Usability of Spoken Dialogue Systems
Usability
ISO’s definition of usability:
• Effectiveness: Accuracy and completeness with which users achieve specified goals.
• Efficiency: Resources expended in relation to the accuracy and completeness with which users achieve goals.
• Satisfaction: Freedom from discomfort, and positive attitudes towards the use of the product.
Usability: extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use.”
Page 6 of 33
On the Usability of Spoken Dialogue Systems
Usability of Spoken Dialogue Systems (SDS)
The general definition of usability and the associated attributes are of course also true for the case of SDS.
However:• Due to the complexity of the input processing and the non-
persistence of speech, special attention must be paid to:• The learnability and memorability of speech based
interfaces• The transparency and error-handling capability
For this reason, the methods that has been developed and are well-proven for traditional interfaces can not directly be used for speech.
Page 7 of 33
On the Usability of Spoken Dialogue Systems
Usability Measures
Two orthogonal categories of usability measures must be captured simultaneously:
Objective measures: To evaluate the effectiveness and efficiency of the system:
• Observed values of e.g. time to complete tasks, task success rates, error rates, number of help messages, number of user barge-ins, etc.
• Objective measures are directly observable
Subjective measures: To evaluate user preferences:• User satisfaction, the user’s attitudes towards the overall system,
or particular aspects of it
• User attitudes cannot be observed directly - you must ask the users
Page 8 of 33
On the Usability of Spoken Dialogue Systems
Requirements to the Dialogue
Based on interviews with banking personnel, the functionality of the home bank was chosen to be:• Provide balance and information of movements for user
accounts.• User must provide Id and PIN codes for access• The user must be (or feel) in control of the dialogue• The service must be equally acceptable to users
regardless of gender, age and accent
Furthermore, the dialogue must accommodate the untrained user
Page 9 of 33
On the Usability of Spoken Dialogue Systems
The Overall Dialogue Model
Id-number
Access Code
Main
BalanceMini Stat.
Page 10 of 33
On the Usability of Spoken Dialogue Systems
Balance orMini Stat ?
For whichAccount ?
ProvideBalance
MoreAccounts ?
For whichAccount ?
ProvideMini-Stat
MoreAccounts ?
Whish toContinue ?
Dialogue Task Structure Main
Balance Mini Stat
System directed
Transitions:
User Initiated
A Mixed-Initiative Dialogue Model with Short-Cuts
Page 11 of 33
On the Usability of Spoken Dialogue Systems
Objective Measures on the OVID Corpus
The Corpus:• 700 transcribed dialogues for 310 users who were
requested to carry out two pre-defined scenarios.
Speech I/O Quality:• Speech (concept) recognition performance
Dialogue Symmetry• Turn-taking strategy, in particular how and when users
took the initiative in the dialogue
Communication Effeciency• Timing parameters for overall and subtask performance
Task Effectiveness:• Task success rates
Page 12 of 33
On the Usability of Spoken Dialogue Systems
Objective Measures – Timing
All users were required to carry out two scenarios, A and B
The table shows the average time spent in the login subtasks for the first (A1,B1) and second (A2,B2) dialogues
A paired, two tailed t-test revealed a significant reduction of the time spent in the “Id_number” sub task when comparing the first to the second dialogue (p = 0.03)
Page 13 of 33
On the Usability of Spoken Dialogue Systems
Objective Measures – Turn taking
Analysing the turn-taking strategies uncovered a similar trend – users completed the dialogues with a smaller number of turns in the second dialogue
In particular, users were more willing to take the initiative in the dialogue, the more experienced they became.
Average number of user initiatives per dialogue for the ”A” and ”B” scenarios, for the first and second dialogues.
An unpaired two-tailed t-test shows a significant (p = 0.02) increase in the number of user initiatives relative to the total number of turns for scenario B2 compared to B1.
Page 14 of 33
On the Usability of Spoken Dialogue Systems
Objective Measures – Task Completion
Perceived versus observed task success:
• Although 96% of the users believed that they had completed both scenarios, only 74% actually did so
There is a reduction of almost 50% of the failed dialogues from the first to the second call (a 25% reduction is significant at the 95% conf. level)
Page 15 of 33
On the Usability of Spoken Dialogue Systems
Objective Measures – Speech Recognition
The intervals show the speech recognition performance experienced by a particular proportion of users
A recognition performance of 90% roughly corresponds to one error per dialogue
Page 16 of 33
On the Usability of Spoken Dialogue Systems
Objecture Measures - Conclusions
A significant reduction of time for the ID_number sub task was observed when comparing durations of the first and second dialogues.
Analyses of the users’ turn- taking strategy for the first and second calls reveal a significant increase in the users’ tendency to take the initiative in the dialogue.
Task completion rates also showed a significant increase from the first to the second dialogue.
These findings are interpreted as signs of system learnability.
--------------------------------------
No differences were identified for users from different demographic groups (gender, region, age)
Page 17 of 33
On the Usability of Spoken Dialogue Systems
Subjective Measures
The term user satisfaction is used to denote the degree to which the users are satisfied with, or accept the system performance.
Contrary to (most) objective measures, information of user satisfaction is not directly observable, but must be obtained by asking the users
Often the user is asked to express his/her attitude towards a number of statements about the system, for example using a so-called Likert attitude questionnaire
Page 18 of 33
On the Usability of Spoken Dialogue Systems
The OVID Questionnaire with 25 Statements
Average User Attitudes with 98% confidence intervals
1
2
3
4
5
6
7ea
sy t
o u
se
kn
ew w
hat
to
do
frie
nd
lin
ess
con
fusi
ng
use
aga
in
reli
abil
ity
out
of c
ontr
ol
lik
e vo
ice
con
cen
trat
ion
effe
cien
cy
flu
ster
ed
too
fast
un
der
str
ess
voic
e cl
ear
fru
stat
ion
pre
fer
hu
man
too
com
pli
cate
d
enjo
ymen
t
nee
ds
imp
rove
men
t
pol
iten
ess
secu
rity
con
ven
ien
t
con
fid
enti
alit
y
rem
emb
er t
oo m
uch
good
val
ue
Ave
rage
Category
Att
itu
de
Domain Dep.
Page 19 of 33
On the Usability of Spoken Dialogue Systems
Factor Analysis
Factor Analysis (FA) is used to identify the underlying relationships between the statements• Mathematically, FA resembles Principal Components
Analysis (PCA), but:• In FA, the factors are perceived as the cause of the observed
variable scores, i.e. it is the underlying factor structure that has produced (or caused) the observed variable scores.
• In contrast, for PCA, the components are just perceived as aggregates of the observed variable scores
• Furthermore, in PCA all variance is modelled, whereas in FA only the variance the variables have in common (communalities) are considered
• FA has an element of subjective judgment, since the goal is to arrive at a factor set that will provide an interpretation of the observed data
Page 20 of 33
On the Usability of Spoken Dialogue Systems
Verification of OVID FactorsOVID Factors VarianceF1: Quality of interface/ 19%
performanceUse Again, Reliability, Efficiency,prefer Human, EnjoymentNeeds Improvement
F2: Cognitive load 13%Concentration, Speed,Under Stress
F3: Control/Confusion 9%Know what was expected,
perceived control, Confusion,Flustered, /Too Complicated
F4: Friendliness 8%Friendly, PoliteF5: Voice 8%Liked Voice, Voice clear
Total Explained Variance 57%
Original CCIR Factors Variance
F1: Quality of interface/ 21%performanceUse Again, Efficiency, Reliability
Needs Improvement
F2: Cognitive effort and Stress, 17%Speed, Under Stress, Concentration, Perceived control
F3: Conversational modelVoice, Tone prompts, Friendliness
F4: FluencyVoice clarity, Politeness,
Know what was expected
F5: TransparencyEase of use, Prompt helpfulness Degree of
fluster
Total Explained Variance 74%
Page 21 of 33
On the Usability of Spoken Dialogue Systems
Six-Factor Structure
When the five domain dependent statements are added, a sixth factor emerges.
Page 22 of 33
On the Usability of Spoken Dialogue Systems
Correlating Statement Scores
0
0,25
0,5
0,75knew what to do
too fastconfusing
polite
like voice
voice clear
concentrate
friendly
remember much
stresscontrol
confidentialitysecurity
flusteredneeds improvement
prefer human
reliability
complicated
easy to use
Efficiency
Frustration
good valueenjoyment
convenient
Correlation with user attitude to “Use again”
F3: Convenience
Page 23 of 33
On the Usability of Spoken Dialogue Systems
Subjective Measures - Conclusions
The questionnaire used for the subjective measures were shown to be valid and produce a factor structure similar to that of the original CCIR questionnaire
When the domain dependent statements were included, the factor structure changed and new factor “confidentiality” emerged
Generally, the users were positive towards the OVID home bank service (average score was 5.6 – i.e between “agree” and “strongly agree”)
Similar to the objective measures, no significant differences between the demographic groups were found
Page 24 of 33
On the Usability of Spoken Dialogue Systems
Combining Objective and Subjective Measures
• The PARADISE (Paradigm for Dialogue System Evaluation) scheme (proposed by Walker et al from AT&T in 1997) attempts to combine the subjective and objective measures.
• This is done by estimating a performance function with “usability” as the independent variable and the objective measures as the dependent variables
• The performance equation is modeled using Multiple Linear Regression (MLR)
Page 25 of 33
On the Usability of Spoken Dialogue Systems
The PARADISE Model
Kappa attempts to compensate for differences in the complexity of the dialogues
Page 26 of 33
On the Usability of Spoken Dialogue Systems
Applying PARADISE to the OVID Corpus
Checking the correlation of the independent variables before applying MLR. Only the speech recognition and task success parameters turned out to be significant predictors of usability (represented as the F1-factor group)
Page 28 of 33
On the Usability of Spoken Dialogue Systems
The Resulting Performance Function
The resulting performance function for the OVID experiment, compared with similar PARADISE analyses by Walker et al at AT&T.
Page 29 of 33
On the Usability of Spoken Dialogue Systems
Estimation of the user attitudes
Observed and Estimated User Attitudes
Users
Use
r A
ttitu
de
(F
1)
+ 95% Conf.
Observed
Esitmated
- 95% Conf.
5 10 15 20 25 301
2
3
4
5
6
7
For verification of the model, the identified parameters are used to estimate the user satisfaction (red line). It is clear that only half of the variance of the observed (blue) is captured.
Page 30 of 33
On the Usability of Spoken Dialogue Systems
Conclusion on the PARADISE Results
The important question is of course whether any new information was revealed.• It is hardly surprising that a relationship between ASR
performance, task success and user satisfaction can be observed.
• Kappa proved to be a better predictor of usability than a more simple ratio of completed sub-goals. The main function of Kappa is to normalise for task complexity, which in this case it seems to have done.
• There is an (almost surprisingly) good correspondence between the OVID results and those obtained by AT&T
• PARADISE is limited by the requirement for well-defined scenario based dialogues and a linear relationship between performance measures and usability
Page 31 of 33
On the Usability of Spoken Dialogue Systems
Summary
• Certain usability aspects must receive special attention, due to the nature of speech, most notably transparency and learnability
• The requirements set up for the OVID dialogue has to a large degree been met. (Exception: Speed)
• The learnability of the OVID dialogue has been demonstrated through measures of the timing and turn-taking strategy
• The validity of the questionnaire used for OVID has been established through factor analysis
• A PARADISE analysis confirmed that speech recognition and task success are important for user satisfaction, and a high correspondence with results obtained elswhere is shown.
• The important topic of multi modal user interaction and the issue of memorability have not been addressed in this work
Page 32 of 33
On the Usability of Spoken Dialogue Systems
What is the in the Future for Speech?
• Speech as a modality is in a highly competitive “market”, and must simply be better than any other option for people to use it
• Many envisioned “killer applications” as e.g. phone-based home banking has been taken over by the Web (e.g. 38% of Danish internet users used internet home banking regularly by 2002, while none used speech)
• The methods for measuring user satisfaction has to a large degree been overlooked by the speech community and must receive more attention if speech based interfaces are to be successful
• Much focus has been on naturalness and user control, but really without any hard proof that this actually leads to higher user satisfaction – learnability might be just as important
• The focus on mobility might provide a breakthrough for speech, especially in combination with other modalities
Page 33 of 33
On the Usability of Spoken Dialogue Systems
and Finally…
I wish to thank all those of my colleagues at CPK and the OVID team who have helped me in this work, either directly or by taking over some of my other tasks.
I also wish to thank my family for their support
Last, I wish to thank you all for coming here and listen to what I had to say
_____________