Rating Scales: What the Research Says

Mini_UPA, 2009

Rating Scales:Rating Scales:What the Research SaysWhat the Research Says

Joe DumasJoe Dumas Tom TullisTom Tullis

UX ConsultantUX Consultant Fidelity InvestmentsFidelity Investments

[email protected]@gmail.com [email protected]@fmr.com

mailto:[email protected]

mailto:[email protected]

Mini_UPA, 2009 2

The Scope of the SessionThe Scope of the Session

Discussion of literature about rating scales in Discussion of literature about rating scales in usability methods, primarily usability testingusability methods, primarily usability testing

Brief review of recommendations from older Brief review of recommendations from older literatureliterature

Focus on recent studiesFocus on recent studies Recommendations for practitionersRecommendations for practitioners

Mini_UPA, 2009 3

Table of ContentsTable of Contents

Types of rating scalesTypes of rating scales Guidelines from past studiesGuidelines from past studies How to evaluate a rating scaleHow to evaluate a rating scale Guidelines from recent studiesGuidelines from recent studies Additional advantages of rating scalesAdditional advantages of rating scales

Types of Rating Scales

Mini_UPA, 2009 5

FormatsFormats

One question formatOne question format Before-after formatBefore-after format Multiple question formatMultiple question format

Mini_UPA, 2009 6

One Question Formats

Original Likert scale format:Original Likert scale format:

I think that I would like to use this system frequently:I think that I would like to use this system frequently:

___ Strongly Disagree___ Strongly Disagree

___ Disagree___ Disagree

___ Neither agree not disagree___ Neither agree not disagree

___ Agree___ Agree

___ Strongly Agree___ Strongly Agree Rensis Likert

Mini_UPA, 2009 7


Likert-like scales:Likert-like scales:

Characters on the screen are:Characters on the screen are:

Hard to readHard to read Easy to readEasy to read

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

Mini_UPA, 2009 8


One more Likert-like scale (used in SUMI):One more Likert-like scale (used in SUMI):

I would recommend this software to my colleagues:I would recommend this software to my colleagues:__ Agree__ Agree __ Undecided__ Undecided __ Disagree__ Disagree

Mini_UPA, 2009 9


SubjectiveSubjective

Mental Mental

Effort Effort

ScaleScale

(SMEQ)(SMEQ)

Mini_UPA, 2009 10


Semantic Differential:Semantic Differential:

Magnitude estimation:Magnitude estimation: Use any positive numberUse any positive number

Mini_UPA, 2009 11

Before-After RatingsBefore-After Ratings

Before the task:Before the task:

How easy or difficult do you expect this task to be:How easy or difficult do you expect this task to be:

Very easyVery easy Very difficultVery difficult

11 22 33 44 55 66 77 After the task:After the task:

How easy or difficult was task to do:How easy or difficult was task to do:

Very easyVery easy Very difficultVery difficult

11 22 33 44 55 66 77

Mini_UPA, 2009 12

Multiple Question FormatsMultiple Question Formats (Selected List) (Selected List)

Software Usability Scale (SUS) – 10 ratingsSoftware Usability Scale (SUS) – 10 ratings *Questionnaire for User-Interface Satisfaction – QUIS*Questionnaire for User-Interface Satisfaction – QUIS

71 (long form), 26 (short form) ratings 71 (long form), 26 (short form) ratings *Software Usability Measurement Inventory (SUMI) – *Software Usability Measurement Inventory (SUMI) –

50 ratings50 ratings After Scenario Questionnaire (ASQ) – three ratingsAfter Scenario Questionnaire (ASQ) – three ratings

* Requires a license* Requires a license

Mini_UPA, 2009 13

More Multiple Question FormatsMore Multiple Question Formats

Post Study System Usability Questionnaire Post Study System Usability Questionnaire (PSSOQ) - 19 ratings. Electronic version called the (PSSOQ) - 19 ratings. Electronic version called the Computer System Usability Questionnaire (CSUQ)Computer System Usability Questionnaire (CSUQ)

*Website Analysis and MeasureMent Inventory *Website Analysis and MeasureMent Inventory (WAMMI) – 20 ratings of website usability(WAMMI) – 20 ratings of website usability

* Requires a license* Requires a license

Guidelines from Past StudiesGuidelines from Past Studies

Mini_UPA, 2009 15

Guidelines Guidelines

Have 5-9 levels in a ratingHave 5-9 levels in a rating You gain no additional information by having You gain no additional information by having

more than 10 levelsmore than 10 levels Include a neutral point in the middle of the scaleInclude a neutral point in the middle of the scale

Otherwise you lose information by forcing Otherwise you lose information by forcing some participants to take sidessome participants to take sides

People from some Asian cultures are more People from some Asian cultures are more likely to choose the midpointlikely to choose the midpoint

Mini_UPA, 2009 16

GuidelinesGuidelines

Use positive integers as numbersUse positive integers as numbers 1-7 instead of -3 to +3 (Participants are less 1-7 instead of -3 to +3 (Participants are less

likely to go below 0 than they are to use 1-3)likely to go below 0 than they are to use 1-3) Or don’t show numbers at allOr don’t show numbers at all

Use word labels for at least the end points.Use word labels for at least the end points. Hard to create labels for every point beyond 5 Hard to create labels for every point beyond 5

levelslevels Having labels on the end points only also Having labels on the end points only also

makes the data more “interval-like”.makes the data more “interval-like”.

Mini_UPA, 2009 17

GuidelinesGuidelines

Most word labels produce a bipolar scaleMost word labels produce a bipolar scale In a 1 to 7 scale from easy to difficult, what is In a 1 to 7 scale from easy to difficult, what is

increasing with the numbers? Is ease the increasing with the numbers? Is ease the absence of difficulty?absence of difficulty?

This may be one reason why participants are This may be one reason why participants are reluctant to move to the difficult end – it is a reluctant to move to the difficult end – it is a different concept that lack of easedifferent concept that lack of ease

One solution – scale from “not at all easy” to One solution – scale from “not at all easy” to “very easy”“very easy”

Evaluating a Rating ScaleEvaluating a Rating Scale

Mini_UPA, 2009 19

Statistical CriteriaStatistical Criteria

Is it valid? Does it measure what it’s suppose to Is it valid? Does it measure what it’s suppose to measure?measure?

For example, does it correlate with other usability For example, does it correlate with other usability measures.measures.

Is it sensitive?Is it sensitive? Can it discriminate between tasks or products with Can it discriminate between tasks or products with

small samplessmall samples

Mini_UPA, 2009 20

Practical CriteriaPractical Criteria

Is it easy for the participant to understand and use?Is it easy for the participant to understand and use? Do they get what it means?Do they get what it means?

Is it easy for the tester to present – online or paper Is it easy for the tester to present – online or paper – and score?– and score? Do you need a widget to present it?Do you need a widget to present it? Can scoring be done automatically?Can scoring be done automatically?

Guidelines from Recent StudiesGuidelines from Recent Studies

Mini_UPA, 2009 22

Post-Task RatingsPost-Task Ratings

The simpler the betterThe simpler the better Tedesco and Tullis found this format the most Tedesco and Tullis found this format the most

sensitivesensitive

Overall this task was:Overall this task was:

Very Easy Very Easy Very DifficultVery Difficult

Sauro and Dumas found SMEQ just as sensitive as Sauro and Dumas found SMEQ just as sensitive as LikertLikert

Mini_UPA, 2009 23

More on Post-Task RatingsMore on Post-Task Ratings

They provide diagnostic information about usability They provide diagnostic information about usability issues with tasksissues with tasks

They correlate moderately well with other measures They correlate moderately well with other measures especially time, and their correlations are higher especially time, and their correlations are higher than for post-test ratingsthan for post-test ratings

Mini_UPA, 2009 24

More on Post-Task RatingsMore on Post-Task Ratings

Even post-task ratings may be inflated (Teague et al., Even post-task ratings may be inflated (Teague et al., 2001). Ratings made during a task were significantly 2001). Ratings made during a task were significantly lower than after the task and even higher if given only lower than after the task and even higher if given only after the taskafter the task

ConcurrentConcurrent

During taskDuring taskConcurrentConcurrent

After taskAfter taskPost-taskPost-task

OnlyOnly

EaseEase 4.444.44 4.784.78 5.605.60

Mini_UPA, 2009 25

Post-Test RatingsPost-Test Ratings

Home grown questionnaires perform more poorly Home grown questionnaires perform more poorly than standardized onesthan standardized ones

Tullis and Stetson and others have found SUS most Tullis and Stetson and others have found SUS most sensitive. Many testers are using it.sensitive. Many testers are using it.

Some of the standardized questionnaires have Some of the standardized questionnaires have industry norms to compare against - SUMI and industry norms to compare against - SUMI and WAMMIWAMMI But no one knows what the database of norms But no one knows what the database of norms

containscontains

Mini_UPA, 2009 26

More on Post-Test RatingsMore on Post-Test Ratings

The lowest correlations among all measures used in The lowest correlations among all measures used in testing are with post-task ratings (Sauro and Lewis)testing are with post-task ratings (Sauro and Lewis) Why? – they are tapping into factors that don’t Why? – they are tapping into factors that don’t

effect other measures such as demand effect other measures such as demand characteristics, need to please, need to appear characteristics, need to please, need to appear competent, lack of understanding of what an competent, lack of understanding of what an “overall” rating means, etc.“overall” rating means, etc.

Mini_UPA, 2009 27

Examine the DistributionExamine the Distribution

See how the average would miss how bimodal the See how the average would miss how bimodal the distribution is. Some participants find it very hard to distribution is. Some participants find it very hard to use. Why?use. Why?

Mini_UPA, 2009 28

Low Sensitivity with Small SamplesLow Sensitivity with Small Samples

Three recent studies have all shown that post-task Three recent studies have all shown that post-task and post-test ratings do not discriminate well with and post-test ratings do not discriminate well with sample sizes below about 10-12sample sizes below about 10-12

For sample sizes typical of laboratory formative For sample sizes typical of laboratory formative tests, ratings are not reliabletests, ratings are not reliable

Ratings can be used as an opportunity to get Ratings can be used as an opportunity to get participant to talk about why they have chosen a participant to talk about why they have chosen a valuevalue

Mini_UPA, 2009 29

The Value of Confidence IntervalsThe Value of Confidence Intervals

Ratings of "Ease of Finding Information" (1-7, Higher=Better)(Error bars represent a 90% confidence interval)

0

1

2

3

4

5

6

Sample: 5 Sample: 10 Sample: 20 Sample: 30 Sample: 50

NASA

Wikipedia

Actual data from an online study comparing the NASA & Wikipedia sites for finding info on the Apollo space program.

Little Known Advantages of Little Known Advantages of Rating ScalesRating Scales

Mini_UPA, 2009 31

Ratings Can Help Prioritize WorkRatings Can Help Prioritize Work

Average Expectation and Experience Ratings by Task

1

2

3

4

5

6

7

1 2 3 4 5 6 7

Average Expectation Rating

Avg

. Exp

erie

nce

Rat

ing

“Fix it Fast”

“Promote It”

“Big Opportunity”

“Don’t Touch It”

1=Difficult …

7=Easy

Mini_UPA, 2009 32

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Task 1 Task 2 Task 3 Task 4

Per

cen

t C

orr

ect

4.0

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

4.9

5.0

Tas

k E

ase

Rat

ing

(1-

5, H

igh

er=

Bet

ter)

Accuracy Ease Rating

Ratings Can Help Identify Ratings Can Help Identify “Disconnects”“Disconnects”

This “disconnect” between the accuracy and task ease ratings is worrisome– it indicates users didn’t realize they were screwing up on Task 2!

Mini_UPA, 2009 33

Ratings Can Help You Make Ratings Can Help You Make ComparisonsComparisons

Frequency Distribution of SUS Scores for 129 Conditions from 50 Studies

0

5

10

15

20

25

30

35

40

45

50

<=40 41-50 51-60 61-70 71-80 81-90 91-100

Average SUS Scores

Fre

qu

ency

You can be very pleased if you get an average SUS score of 83 (which is the 94th percentile of this distribution).

But you should be worried if you get an average SUS score of 48 (the 12th percentile).

Mini_UPA, 2009 34

In Closing…In Closing…

These slides, a bibliography of readings, and These slides, a bibliography of readings, and associated examples, can be downloaded from:associated examples, can be downloaded from:

http://www.measuringUX.com/

Feel free to contact us with questions!

Documents

Rating Scales: What the Research Says