Upload
nibaw
View
47
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Rating Scales: What the Research Says. Joe DumasTom Tullis UX ConsultantFidelity Investments [email protected] [email protected]. The Scope of the Session. Discussion of literature about rating scales in usability methods, primarily usability testing - PowerPoint PPT Presentation
Citation preview
Mini_UPA, 2009
Rating Scales:Rating Scales:What the Research SaysWhat the Research Says
Joe DumasJoe Dumas Tom TullisTom Tullis
UX ConsultantUX Consultant Fidelity InvestmentsFidelity Investments
[email protected]@gmail.com [email protected]@fmr.com
Mini_UPA, 2009 2
The Scope of the SessionThe Scope of the Session
Discussion of literature about rating scales in Discussion of literature about rating scales in usability methods, primarily usability testingusability methods, primarily usability testing
Brief review of recommendations from older Brief review of recommendations from older literatureliterature
Focus on recent studiesFocus on recent studies Recommendations for practitionersRecommendations for practitioners
Mini_UPA, 2009 3
Table of ContentsTable of Contents
Types of rating scalesTypes of rating scales Guidelines from past studiesGuidelines from past studies How to evaluate a rating scaleHow to evaluate a rating scale Guidelines from recent studiesGuidelines from recent studies Additional advantages of rating scalesAdditional advantages of rating scales
Types of Rating Scales
Mini_UPA, 2009 5
FormatsFormats
One question formatOne question format Before-after formatBefore-after format Multiple question formatMultiple question format
Mini_UPA, 2009 6
One Question Formats
Original Likert scale format:Original Likert scale format:
I think that I would like to use this system frequently:I think that I would like to use this system frequently:
___ Strongly Disagree___ Strongly Disagree
___ Disagree___ Disagree
___ Neither agree not disagree___ Neither agree not disagree
___ Agree___ Agree
___ Strongly Agree___ Strongly Agree Rensis Likert
Mini_UPA, 2009 7
One Question Formats
Likert-like scales:Likert-like scales:
Characters on the screen are:Characters on the screen are:
Hard to readHard to read Easy to readEasy to read
1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9
Mini_UPA, 2009 8
One Question Formats
One more Likert-like scale (used in SUMI):One more Likert-like scale (used in SUMI):
I would recommend this software to my colleagues:I would recommend this software to my colleagues:__ Agree__ Agree __ Undecided__ Undecided __ Disagree__ Disagree
Mini_UPA, 2009 9
One Question Formats
SubjectiveSubjective
Mental Mental
Effort Effort
ScaleScale
(SMEQ)(SMEQ)
Mini_UPA, 2009 10
One Question Formats
Semantic Differential:Semantic Differential:
Magnitude estimation:Magnitude estimation: Use any positive numberUse any positive number
Mini_UPA, 2009 11
Before-After RatingsBefore-After Ratings
Before the task:Before the task:
How easy or difficult do you expect this task to be:How easy or difficult do you expect this task to be:
Very easyVery easy Very difficultVery difficult
11 22 33 44 55 66 77 After the task:After the task:
How easy or difficult was task to do:How easy or difficult was task to do:
Very easyVery easy Very difficultVery difficult
11 22 33 44 55 66 77
Mini_UPA, 2009 12
Multiple Question FormatsMultiple Question Formats (Selected List) (Selected List)
Software Usability Scale (SUS) – 10 ratingsSoftware Usability Scale (SUS) – 10 ratings *Questionnaire for User-Interface Satisfaction – QUIS*Questionnaire for User-Interface Satisfaction – QUIS
71 (long form), 26 (short form) ratings 71 (long form), 26 (short form) ratings *Software Usability Measurement Inventory (SUMI) – *Software Usability Measurement Inventory (SUMI) –
50 ratings50 ratings After Scenario Questionnaire (ASQ) – three ratingsAfter Scenario Questionnaire (ASQ) – three ratings
* Requires a license* Requires a license
Mini_UPA, 2009 13
More Multiple Question FormatsMore Multiple Question Formats
Post Study System Usability Questionnaire Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic version called the (PSSOQ) - 19 ratings. Electronic version called the Computer System Usability Questionnaire (CSUQ)Computer System Usability Questionnaire (CSUQ)
*Website Analysis and MeasureMent Inventory *Website Analysis and MeasureMent Inventory (WAMMI) – 20 ratings of website usability(WAMMI) – 20 ratings of website usability
* Requires a license* Requires a license
Guidelines from Past StudiesGuidelines from Past Studies
Mini_UPA, 2009 15
Guidelines Guidelines
Have 5-9 levels in a ratingHave 5-9 levels in a rating You gain no additional information by having You gain no additional information by having
more than 10 levelsmore than 10 levels Include a neutral point in the middle of the scaleInclude a neutral point in the middle of the scale
Otherwise you lose information by forcing Otherwise you lose information by forcing some participants to take sidessome participants to take sides
People from some Asian cultures are more People from some Asian cultures are more likely to choose the midpointlikely to choose the midpoint
Mini_UPA, 2009 16
GuidelinesGuidelines
Use positive integers as numbersUse positive integers as numbers 1-7 instead of -3 to +3 (Participants are less 1-7 instead of -3 to +3 (Participants are less
likely to go below 0 than they are to use 1-3)likely to go below 0 than they are to use 1-3) Or don’t show numbers at allOr don’t show numbers at all
Use word labels for at least the end points.Use word labels for at least the end points. Hard to create labels for every point beyond 5 Hard to create labels for every point beyond 5
levelslevels Having labels on the end points only also Having labels on the end points only also
makes the data more “interval-like”.makes the data more “interval-like”.
Mini_UPA, 2009 17
GuidelinesGuidelines
Most word labels produce a bipolar scaleMost word labels produce a bipolar scale In a 1 to 7 scale from easy to difficult, what is In a 1 to 7 scale from easy to difficult, what is
increasing with the numbers? Is ease the increasing with the numbers? Is ease the absence of difficulty?absence of difficulty?
This may be one reason why participants are This may be one reason why participants are reluctant to move to the difficult end – it is a reluctant to move to the difficult end – it is a different concept that lack of easedifferent concept that lack of ease
One solution – scale from “not at all easy” to One solution – scale from “not at all easy” to “very easy”“very easy”
Evaluating a Rating ScaleEvaluating a Rating Scale
Mini_UPA, 2009 19
Statistical CriteriaStatistical Criteria
Is it valid? Does it measure what it’s suppose to Is it valid? Does it measure what it’s suppose to measure?measure?
For example, does it correlate with other usability For example, does it correlate with other usability measures.measures.
Is it sensitive?Is it sensitive? Can it discriminate between tasks or products with Can it discriminate between tasks or products with
small samplessmall samples
Mini_UPA, 2009 20
Practical CriteriaPractical Criteria
Is it easy for the participant to understand and use?Is it easy for the participant to understand and use? Do they get what it means?Do they get what it means?
Is it easy for the tester to present – online or paper Is it easy for the tester to present – online or paper – and score?– and score? Do you need a widget to present it?Do you need a widget to present it? Can scoring be done automatically?Can scoring be done automatically?
Guidelines from Recent StudiesGuidelines from Recent Studies
Mini_UPA, 2009 22
Post-Task RatingsPost-Task Ratings
The simpler the betterThe simpler the better Tedesco and Tullis found this format the most Tedesco and Tullis found this format the most
sensitivesensitive
Overall this task was:Overall this task was:
Very Easy Very Easy Very DifficultVery Difficult
Sauro and Dumas found SMEQ just as sensitive as Sauro and Dumas found SMEQ just as sensitive as LikertLikert
Mini_UPA, 2009 23
More on Post-Task RatingsMore on Post-Task Ratings
They provide diagnostic information about usability They provide diagnostic information about usability issues with tasksissues with tasks
They correlate moderately well with other measures They correlate moderately well with other measures especially time, and their correlations are higher especially time, and their correlations are higher than for post-test ratingsthan for post-test ratings
Mini_UPA, 2009 24
More on Post-Task RatingsMore on Post-Task Ratings
Even post-task ratings may be inflated (Teague et al., Even post-task ratings may be inflated (Teague et al., 2001). Ratings made during a task were significantly 2001). Ratings made during a task were significantly lower than after the task and even higher if given only lower than after the task and even higher if given only after the taskafter the task
ConcurrentConcurrent
During taskDuring taskConcurrentConcurrent
After taskAfter taskPost-taskPost-task
OnlyOnly
EaseEase 4.444.44 4.784.78 5.605.60
Mini_UPA, 2009 25
Post-Test RatingsPost-Test Ratings
Home grown questionnaires perform more poorly Home grown questionnaires perform more poorly than standardized onesthan standardized ones
Tullis and Stetson and others have found SUS most Tullis and Stetson and others have found SUS most sensitive. Many testers are using it.sensitive. Many testers are using it.
Some of the standardized questionnaires have Some of the standardized questionnaires have industry norms to compare against - SUMI and industry norms to compare against - SUMI and WAMMIWAMMI But no one knows what the database of norms But no one knows what the database of norms
containscontains
Mini_UPA, 2009 26
More on Post-Test RatingsMore on Post-Test Ratings
The lowest correlations among all measures used in The lowest correlations among all measures used in testing are with post-task ratings (Sauro and Lewis)testing are with post-task ratings (Sauro and Lewis) Why? – they are tapping into factors that don’t Why? – they are tapping into factors that don’t
effect other measures such as demand effect other measures such as demand characteristics, need to please, need to appear characteristics, need to please, need to appear competent, lack of understanding of what an competent, lack of understanding of what an “overall” rating means, etc.“overall” rating means, etc.
Mini_UPA, 2009 27
Examine the DistributionExamine the Distribution
See how the average would miss how bimodal the See how the average would miss how bimodal the distribution is. Some participants find it very hard to distribution is. Some participants find it very hard to use. Why?use. Why?
Mini_UPA, 2009 28
Low Sensitivity with Small SamplesLow Sensitivity with Small Samples
Three recent studies have all shown that post-task Three recent studies have all shown that post-task and post-test ratings do not discriminate well with and post-test ratings do not discriminate well with sample sizes below about 10-12sample sizes below about 10-12
For sample sizes typical of laboratory formative For sample sizes typical of laboratory formative tests, ratings are not reliabletests, ratings are not reliable
Ratings can be used as an opportunity to get Ratings can be used as an opportunity to get participant to talk about why they have chosen a participant to talk about why they have chosen a valuevalue
Mini_UPA, 2009 29
The Value of Confidence IntervalsThe Value of Confidence Intervals
Ratings of "Ease of Finding Information" (1-7, Higher=Better)(Error bars represent a 90% confidence interval)
0
1
2
3
4
5
6
Sample: 5 Sample: 10 Sample: 20 Sample: 30 Sample: 50
NASA
Wikipedia
Actual data from an online study comparing the NASA & Wikipedia sites for finding info on the Apollo space program.
Little Known Advantages of Little Known Advantages of Rating ScalesRating Scales
Mini_UPA, 2009 31
Ratings Can Help Prioritize WorkRatings Can Help Prioritize Work
Average Expectation and Experience Ratings by Task
1
2
3
4
5
6
7
1 2 3 4 5 6 7
Average Expectation Rating
Avg
. Exp
erie
nce
Rat
ing
“Fix it Fast”
“Promote It”
“Big Opportunity”
“Don’t Touch It”
1=Difficult …
7=Easy
Mini_UPA, 2009 32
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Task 1 Task 2 Task 3 Task 4
Per
cen
t C
orr
ect
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
5.0
Tas
k E
ase
Rat
ing
(1-
5, H
igh
er=
Bet
ter)
Accuracy Ease Rating
Ratings Can Help Identify Ratings Can Help Identify “Disconnects”“Disconnects”
This “disconnect” between the accuracy and task ease ratings is worrisome– it indicates users didn’t realize they were screwing up on Task 2!
Mini_UPA, 2009 33
Ratings Can Help You Make Ratings Can Help You Make ComparisonsComparisons
Frequency Distribution of SUS Scores for 129 Conditions from 50 Studies
0
5
10
15
20
25
30
35
40
45
50
<=40 41-50 51-60 61-70 71-80 81-90 91-100
Average SUS Scores
Fre
qu
ency
You can be very pleased if you get an average SUS score of 83 (which is the 94th percentile of this distribution).
But you should be worried if you get an average SUS score of 48 (the 12th percentile).
Mini_UPA, 2009 34
In Closing…In Closing…
These slides, a bibliography of readings, and These slides, a bibliography of readings, and associated examples, can be downloaded from:associated examples, can be downloaded from:
http://www.measuringUX.com/
Feel free to contact us with questions!