On the Usability of Spoken Dialogue Systems Presentation of Ph.D. thesis by Lars Bo Larsen Aalborg University, Sep. 12, 2003

On the Usability of Spoken Dialogue Systems


Presentation of Ph.D. thesisby

Lars Bo LarsenAalborg University, Sep. 12, 2003

Page 2 of 33


Overview

• Introduction – background• Definition of usability• The OVID project• Objective measures

• Turn-taking and user initiatives• Perceived and observed task success

• Subjective measures• Questionnaires for measuring user satisfaction• Factor Analysis

• Combined analysis using the Paradise scheme• Summary and conclusions

Page 3 of 33


Background

This work has been carried out in two phases:• The first phase was the experimental work carried out in

the Esprit OVID project in 1996-7• This resulted in a number of reports and publications in

1997-99• Another important result was a fully annotated dialogue

corpus• The second more recent phase was in 2002-3, where the

results were verified and analysed from a methodical point of view, and new analyses were carried out on the corpus

• Most notably, a new paradigm (called Paradise) had been proposed since the original OVID work

• In between the two phases I mainly worked in the area of multi modal systems and teaching.

Page 4 of 33


The Goals of the OVID Project

OVID Technical Annex:

“The partners intend to approach the work via a series of controlled usability trials of the software in a realistic banking service with real bank customers.

The results will be an assessment of how bank customers are able to use the automated service without training in its use, to design an optimal user interface dialogue which can accommodate the untrained user.”

Page 5 of 33


Usability

ISO’s definition of usability:

• Effectiveness: Accuracy and completeness with which users achieve specified goals.

• Efficiency: Resources expended in relation to the accuracy and completeness with which users achieve goals.

• Satisfaction: Freedom from discomfort, and positive attitudes towards the use of the product.

Usability: extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use.”

Page 6 of 33


Usability of Spoken Dialogue Systems (SDS)

The general definition of usability and the associated attributes are of course also true for the case of SDS.

However:• Due to the complexity of the input processing and the non-

persistence of speech, special attention must be paid to:• The learnability and memorability of speech based

interfaces• The transparency and error-handling capability

For this reason, the methods that has been developed and are well-proven for traditional interfaces can not directly be used for speech.

Page 7 of 33


Usability Measures

Two orthogonal categories of usability measures must be captured simultaneously:

Objective measures: To evaluate the effectiveness and efficiency of the system:

• Observed values of e.g. time to complete tasks, task success rates, error rates, number of help messages, number of user barge-ins, etc.

• Objective measures are directly observable

Subjective measures: To evaluate user preferences:• User satisfaction, the user’s attitudes towards the overall system,

or particular aspects of it

• User attitudes cannot be observed directly - you must ask the users

Page 8 of 33


Requirements to the Dialogue

Based on interviews with banking personnel, the functionality of the home bank was chosen to be:• Provide balance and information of movements for user

accounts.• User must provide Id and PIN codes for access• The user must be (or feel) in control of the dialogue• The service must be equally acceptable to users

regardless of gender, age and accent

Furthermore, the dialogue must accommodate the untrained user

Page 9 of 33


The Overall Dialogue Model

Id-number

Access Code

Main

BalanceMini Stat.

Page 10 of 33


Balance orMini Stat ?

For whichAccount ?

ProvideBalance

MoreAccounts ?

For whichAccount ?

ProvideMini-Stat

MoreAccounts ?

Whish toContinue ?

Dialogue Task Structure Main

Balance Mini Stat

System directed

Transitions:

User Initiated

A Mixed-Initiative Dialogue Model with Short-Cuts

Page 11 of 33


Objective Measures on the OVID Corpus

The Corpus:• 700 transcribed dialogues for 310 users who were

requested to carry out two pre-defined scenarios.

Speech I/O Quality:• Speech (concept) recognition performance

Dialogue Symmetry• Turn-taking strategy, in particular how and when users

took the initiative in the dialogue

Communication Effeciency• Timing parameters for overall and subtask performance

Task Effectiveness:• Task success rates

Page 12 of 33


Objective Measures – Timing

All users were required to carry out two scenarios, A and B

The table shows the average time spent in the login subtasks for the first (A1,B1) and second (A2,B2) dialogues

A paired, two tailed t-test revealed a significant reduction of the time spent in the “Id_number” sub task when comparing the first to the second dialogue (p = 0.03)

Page 13 of 33


Objective Measures – Turn taking

Analysing the turn-taking strategies uncovered a similar trend – users completed the dialogues with a smaller number of turns in the second dialogue

In particular, users were more willing to take the initiative in the dialogue, the more experienced they became.

Average number of user initiatives per dialogue for the ”A” and ”B” scenarios, for the first and second dialogues.

An unpaired two-tailed t-test shows a significant (p = 0.02) increase in the number of user initiatives relative to the total number of turns for scenario B2 compared to B1.

Page 14 of 33


Objective Measures – Task Completion

Perceived versus observed task success:

• Although 96% of the users believed that they had completed both scenarios, only 74% actually did so

There is a reduction of almost 50% of the failed dialogues from the first to the second call (a 25% reduction is significant at the 95% conf. level)

Page 15 of 33


Objective Measures – Speech Recognition

The intervals show the speech recognition performance experienced by a particular proportion of users

A recognition performance of 90% roughly corresponds to one error per dialogue

Page 16 of 33


Objecture Measures - Conclusions

A significant reduction of time for the ID_number sub task was observed when comparing durations of the first and second dialogues.

Analyses of the users’ turn- taking strategy for the first and second calls reveal a significant increase in the users’ tendency to take the initiative in the dialogue.

Task completion rates also showed a significant increase from the first to the second dialogue.

These findings are interpreted as signs of system learnability.

--------------------------------------

No differences were identified for users from different demographic groups (gender, region, age)

Page 17 of 33


Subjective Measures

The term user satisfaction is used to denote the degree to which the users are satisfied with, or accept the system performance.

Contrary to (most) objective measures, information of user satisfaction is not directly observable, but must be obtained by asking the users

Often the user is asked to express his/her attitude towards a number of statements about the system, for example using a so-called Likert attitude questionnaire

Page 18 of 33


The OVID Questionnaire with 25 Statements

Average User Attitudes with 98% confidence intervals

1

2

3

4

5

6

7ea

sy t

o u

se

kn

ew w

hat

to

do

frie

nd

lin

ess

con

fusi

ng

use

aga

in

reli

abil

ity

out

of c

ontr

ol

lik

e vo

ice

con

cen

trat

ion

effe

cien

cy

flu

ster

ed

too

fast

un

der

str

ess

voic

e cl

ear

fru

stat

ion

pre

fer

hu

man

too

com

pli

cate

d

enjo

ymen

t

nee

ds

imp

rove

men

t

pol

iten

ess

secu

rity

con

ven

ien

t

con

fid

enti

alit

y

rem

emb

er t

oo m

uch

good

val

ue

Ave

rage

Category

Att

itu

de

Domain Dep.

Page 19 of 33


Factor Analysis

Factor Analysis (FA) is used to identify the underlying relationships between the statements• Mathematically, FA resembles Principal Components

Analysis (PCA), but:• In FA, the factors are perceived as the cause of the observed

variable scores, i.e. it is the underlying factor structure that has produced (or caused) the observed variable scores.

• In contrast, for PCA, the components are just perceived as aggregates of the observed variable scores

• Furthermore, in PCA all variance is modelled, whereas in FA only the variance the variables have in common (communalities) are considered

• FA has an element of subjective judgment, since the goal is to arrive at a factor set that will provide an interpretation of the observed data

Page 20 of 33


Verification of OVID FactorsOVID Factors VarianceF1: Quality of interface/ 19%

performanceUse Again, Reliability, Efficiency,prefer Human, EnjoymentNeeds Improvement

F2: Cognitive load 13%Concentration, Speed,Under Stress

F3: Control/Confusion 9%Know what was expected,

perceived control, Confusion,Flustered, /Too Complicated

F4: Friendliness 8%Friendly, PoliteF5: Voice 8%Liked Voice, Voice clear

Total Explained Variance 57%

Original CCIR Factors Variance

F1: Quality of interface/ 21%performanceUse Again, Efficiency, Reliability

Needs Improvement

F2: Cognitive effort and Stress, 17%Speed, Under Stress, Concentration, Perceived control

F3: Conversational modelVoice, Tone prompts, Friendliness

F4: FluencyVoice clarity, Politeness,

Know what was expected

F5: TransparencyEase of use, Prompt helpfulness Degree of

fluster

Total Explained Variance 74%

Page 21 of 33


Six-Factor Structure

When the five domain dependent statements are added, a sixth factor emerges.

Page 22 of 33


Correlating Statement Scores

0

0,25

0,5

0,75knew what to do

too fastconfusing

polite

like voice

voice clear

concentrate

friendly

remember much

stresscontrol

confidentialitysecurity

flusteredneeds improvement

prefer human

reliability

complicated

easy to use

Efficiency

Frustration

good valueenjoyment

convenient

Correlation with user attitude to “Use again”

F3: Convenience

Page 23 of 33


Subjective Measures - Conclusions

The questionnaire used for the subjective measures were shown to be valid and produce a factor structure similar to that of the original CCIR questionnaire

When the domain dependent statements were included, the factor structure changed and new factor “confidentiality” emerged

Generally, the users were positive towards the OVID home bank service (average score was 5.6 – i.e between “agree” and “strongly agree”)

Similar to the objective measures, no significant differences between the demographic groups were found

Page 24 of 33


Combining Objective and Subjective Measures

• The PARADISE (Paradigm for Dialogue System Evaluation) scheme (proposed by Walker et al from AT&T in 1997) attempts to combine the subjective and objective measures.

• This is done by estimating a performance function with “usability” as the independent variable and the objective measures as the dependent variables

• The performance equation is modeled using Multiple Linear Regression (MLR)

Page 25 of 33


The PARADISE Model

Kappa attempts to compensate for differences in the complexity of the dialogues

Page 26 of 33


Applying PARADISE to the OVID Corpus

Checking the correlation of the independent variables before applying MLR. Only the speech recognition and task success parameters turned out to be significant predictors of usability (represented as the F1-factor group)

Page 27 of 33


The Regresison

Page 28 of 33


The Resulting Performance Function

The resulting performance function for the OVID experiment, compared with similar PARADISE analyses by Walker et al at AT&T.

Page 29 of 33


Estimation of the user attitudes

Observed and Estimated User Attitudes

Users

Use

r A

ttitu

de

(F

1)

+ 95% Conf.

Observed

Esitmated

- 95% Conf.

5 10 15 20 25 301

2

3

4

5

6

7

For verification of the model, the identified parameters are used to estimate the user satisfaction (red line). It is clear that only half of the variance of the observed (blue) is captured.

Page 30 of 33


Conclusion on the PARADISE Results

The important question is of course whether any new information was revealed.• It is hardly surprising that a relationship between ASR

performance, task success and user satisfaction can be observed.

• Kappa proved to be a better predictor of usability than a more simple ratio of completed sub-goals. The main function of Kappa is to normalise for task complexity, which in this case it seems to have done.

• There is an (almost surprisingly) good correspondence between the OVID results and those obtained by AT&T

• PARADISE is limited by the requirement for well-defined scenario based dialogues and a linear relationship between performance measures and usability

Page 31 of 33


Summary

• Certain usability aspects must receive special attention, due to the nature of speech, most notably transparency and learnability

• The requirements set up for the OVID dialogue has to a large degree been met. (Exception: Speed)

• The learnability of the OVID dialogue has been demonstrated through measures of the timing and turn-taking strategy

• The validity of the questionnaire used for OVID has been established through factor analysis

• A PARADISE analysis confirmed that speech recognition and task success are important for user satisfaction, and a high correspondence with results obtained elswhere is shown.

• The important topic of multi modal user interaction and the issue of memorability have not been addressed in this work

Page 32 of 33


What is the in the Future for Speech?

• Speech as a modality is in a highly competitive “market”, and must simply be better than any other option for people to use it

• Many envisioned “killer applications” as e.g. phone-based home banking has been taken over by the Web (e.g. 38% of Danish internet users used internet home banking regularly by 2002, while none used speech)

• The methods for measuring user satisfaction has to a large degree been overlooked by the speech community and must receive more attention if speech based interfaces are to be successful

• Much focus has been on naturalness and user control, but really without any hard proof that this actually leads to higher user satisfaction – learnability might be just as important

• The focus on mobility might provide a breakthrough for speech, especially in combination with other modalities

Page 33 of 33


and Finally…

I wish to thank all those of my colleagues at CPK and the OVID team who have helped me in this work, either directly or by taking over some of my other tasks.

I also wish to thank my family for their support

Last, I wish to thank you all for coming here and listen to what I had to say

_____________

Documents

On the Usability of Spoken Dialogue Systems Presentation of Ph.D. thesis by Lars Bo Larsen Aalborg University, Sep. 12, 2003