COMPARISON OF EMPIRICAL TESTING AND ......Empirical usability testing and walkthrough methods differ in the experimental controls employed in the former. lIuman factors practitioners

~ (H[ ’92 May3-7, 1992

COMPARISON OF EMPIRICAL TESTING AND WALKTHROUGHMETHODS IN USER INTERFACE EVALUATION

Clar-e-Marie Karat, Robert CampbeW, and Tama Fiegel* *

IBM T. J. Watson Research CenterPO Box 704, Yorktown Heights, NY 10598

[email protected]. com

* Now at Department of Psychology, Clemson University, Clemson, SC 29634**NOW at JJepa~ment of Psychology, New Mexico State University, LM CIWCCS, NM 88003

ABSTRACTWe investigated the relative effectiveness of empiricalusabitit y testing and individual and team walkthroughmethods in identifying usability problems in twographical user interface office systems. The findingswere replicated across the two systems and show thatthe empirical testing condition identified the largestnumber of problems, and identified a significant num-ber of relatively severe problems that were missed bythe walkthrough conditions. Team walkthroughsachieved better results than individual walkthroughs insome areas. About a third of the significant usabilityproblems identfled were common across all methods.Cost-effectiveness data show that empirical testing re-quired the same or less time to identify each problemwhen compared to walkthroughs.

KEYWORDS: Empirical testing, walkthroughs, prob-lem severity, cost-effectiveness, scenarios

INTRODUCTIONSoftware development teams work within cost, sched-ule, personnel and technological constraints. In recentyears, usability engineering methods appropriate tothese constraints have evolved and become increasinglyincorporated into software development cycles. I-Iu-man factors practitioners currently rely on two typesof tectitques to evaluate representations of user inter-faces: (1) empirical usability testing in laboratory orfield settings; and (2) a variety of usability walkthroughmethods. These latter methods have substantive dif-ferences and are referred to as pluralistic walkthroughs,heuristic evaluations, cognitive walkthroughs, think-aloud evaluations, and scenario-based and guideline-based reviews [1, 4, 10, 12, 16, 18, 19, 20, 21, 22, 23].Empirical usability testing and walkthrough methodsdiffer in the experimental controls employed in theformer.

lIuman factors practitioners must make tradeoffs re-

Permisslon to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct com-mercial advantage, the ACM copyright notice and the title of the publica-tion and tts date appear, and notice is given that copying ISby permis-sion of the Association for Computing Machinery. To copy otherwise,or to republish, requires a fee and/or specific permission,

1992 ACM 0-89791-513-5/92 /0005-0397 1.50

ga.rding time, cost, and human factors issues in selectinga usability engineering method to use in a particulardevelopment situation [15]. Use of walkthroughs has

been encouraged by development cycle pressures andby the adoption of development goals of efficiency anduser-centered design [1, 2, 5, 12, 20, 23]. Manyquestions remain about how walkthroughs compare toempirical methods of usability assessment, and whenand how walkthroughs are most effective.

Questions About Empirical Testing VersusWalkthrough MethodsUsability prob!ems. How do the two methods comparein the number of usability problems identified in a userinterface? Is one method better than the other inidentifying serious problems? 1low many of the prob-lems are found by both methods and how many arefound solely by one method?

Reliability of d~ferences. If the methods differ in theireffectiveness in identifying usability problems in userinterfaces, do these differences persist across differentsystems? Or is the effectiveness of an evaluationmethod system dependent, based on the type of inter-face style and metaphor used in the interface?

Cost-effectiveness. What is the relative cost-effectivenessof the two techniques in identifying the usability prob-lems in an interface?

Human factors involvement. What amount of humanfactors involvement is necessary in the use of the twotechniques? What issues arise in analyzing and inter-preting data?

Questions About Walkthrough TechniquesIndividuals versus teams. Are walkthroughs more ef-fective when conducted individually or in teams? Socialpsychology has documented that groups seldom per-form up to the level of their best member [17]. Oneexception in this area is that groups do offer the possi-bility of more accurate judgments than individuals, es-pecially when working on complex tasks [17]. The useof interaction-enhancing procedures may heightengroup productivity as well [7].

Evaluator expertise. Are members of developmentteams and representative end users effective evaluators

397

~~ CHI’92 May3-7, 1992

in walkthroughs, or should the evaluators be exclu-sively human factors or user interface (7J1) specialists?

Prescribed tasks versus se~-guided exploration. In ausability walkthrough with prescribed tasks, evaluatorsstep through a representation of a user interface (e.g.,

paper specflcation, prototype) or the actual systemwhile performing representative end user tasks. Otherwalkthrough procedures rely on self-guided explorationof the interface by evaluators who may or may notgenerate scenarios for that purpose. Do evaluators inwalkthroughs think that one approach is more usefulthan the other in identifying usabfit y problems?

Utility of guidelines. What is the role of usabilityheuristics or guidelines in usabilky walkthroughs? Areheuristics useful and necessary for experienced mem-bers of development teams?

Recent DataRecent studies provide some data on these issues.Jeffries, Miller, Wharton, and Uyeda [10] comparedthe effectiveness of usability testing, guideline, heuristic,and cognitive walkthrough [16] methods in identifyinguser interface problems. The heuristic method divergedfrom the Nielsen and Molich [20] method as it wascompleted by UI specialists and did not include the useof written heuristics or guidelines. Prescribed task sce-narios were employed in usability testing and cognitiveWalkthroughs. Results showed that the heuristicmethod identfled the most usability problems andmore of the serious problems, and did so at the lowestcost of the four techniques. Usability testing was gen-erally the second-best method of the four in identifyingproblems. A number of questions arise: Was evaluatorexpertise the key component in the effectiveness ofboth the heuristic and usability testing methods? Whatwas the amount of overlap in the problems identifiedby the four methods? What was the inter-rater reli-ability y of problem identification?

Desurvire, Lawrence, and Atwood [4] compared theeffectiveness of emptilcal usability testing and heuristicevaluations in identifying violations of usability guide-lines. The heuristic method differed from others [10,20] in that evaluators rated guidelines on bipolar scalesfor each of a set of tasks. Laboratory testing identfledviolations of six of ten relevant guidelines, while thecombined results from the heuristic evaluations identi-fied only one violation of a guideline. The heuristicratings from IJI experts and empirical usability testparticipants were predictive of laboratory user per-formance data. Non-UI experts’ ratings were not pre-dictive of performance. Iieuristic ratings were effectivein identifying tasks where problems would occur, butnot the specific user interface problems themselves.UI experts’ “best guess” predictions of performancewere comparable to their heuristic ratings.

Bias [1] describes a systematic group evaluation pro-

cedure caUed the pluralistic usability walkthrough that

includes end user, architect, design, developer, publica-

tion writer, and human factors representatives who

complete scenario-driven walkthroughs of software

prototypes. Human factors staff lead the group ses-

sions. The design-test-redesign cycle is reduced tominutes through the use of low-technology prototypesto illustrate alternative designs, and through the pres-ence and cooperation of individuals with the variedskills required to complete the work. This techniquehighlights the value of rnultidkciplinary activity in de-sign [6] and group problem solving [7], and the itera-tive design possible within tight time constraints. Aquestion arises about the group facilitation skills andprocedures required for human factors engineers toachieve high group productivity and accurate judg-ments in the walkthroughs [7, 17].

Nielsen and Molich [20] tested heuristic evaluationscompleted individually by evaluators who were nothuman factors experts. In three of the four studies re-ported, evaluators received a lecture on nine usabiMyheuristics and related reference material. No prescribedtask scenarios were employed. Aggregates of data fromfive computer science students or industry computerprofessionals generally found about two-thirds of theproblems that the authors had previously identified inthe interfaces. While the study measured how hard itwas to fmd problems, there was no measure of the se-verity of the identified usability problems.

There is some evidence that evaluators other than UIspecialists can carry out useful and successful think-aloud and heuristic walkthroughs [19, 20]. Also,

Jorgensen [12] and Wright and Monk [23] both pro-vide evidence of the success of a think-aloud techniqueused by developers who had rniniial training on theprocedure and limited use of human factors resource.Developers observed users who thought aloud whileworking through tasks on the developers’ systems.Walkthroughs were done by individuals in the formerstudy and by teams in the latter. Questions arise aboutthe numbers and types of problems that were notidentified in these studies, and how data on problemswere interpreted and analyzed.

Goals of the StudyOur study had three goals: The f~st goal of the studywas to better understand the relationship between em-pirical testing and walkthrough results. This informat-ion would improve the understanding of the tradeoffsin selecting one rather than another in a particular sit-uation. This goal included assessment of the numberand severity of usabilit y problems ident~led by the twomethods and the resource required to identify them.The second goal was to determine whether the resultsregarding the relative effectiveness of empirical andwalkthrough methods were reliable and would replicateacross systems, or whether these results were systemdependent. The third goal of the study was to under-stand how well walkthroughs work in user interfaceevaluation and how to improve their effectiveness. Aneffective walkthrough method would be one that iden-tifies most usability problems in an interface, and es-pecially the most severe usability problems. The roleof individuals and teams, evaluator characteristics, sce-narios, self-guided exploration, and usability heuristicsin walkthroughs were explored as part of this third goal.

398

[HI ’92 M(IY3-7, 1992

METHODDesignThree user interface evaluation methods were assessed:empirical usabiiit y test, individual usabilitywaikthrough, and team usability waikthrough. Theusability waikthrough procedure developed and used inthis study included components to maximize effective-ness of usability problem identflcation, based on pre-vious research. The walkthrough included separatesegments for 1) self-guided exploration of a graphicaluser interface (GUI) oflice system, and 2) use of pre-scribed scenarios. The procedure utilized a set of 12usability guidelines. Walkthroughs were conducted in-dividually by six evaluators in one condition and by sixpairs of evaluators in another condition. The evalu-ators in the team condition conversed with each otherabout issues and problems during the sessions. Theevaluators were responsible for documenting the usa-bility problems they identfled in the waikthroubs.

The empirical usability test method also had separatesegments for 1) self-guided exploration of a GUI sys-tem, and 2) use of prescribed scenarios. The six usersin the usability tests were asked to describe usabilityproblems they encountered, and problems were re-corded by the human factors staff who were observingthe sessions.

The usability problems identfled through use of thethree methods were categorized using common metrics.Thus data could be compared across methods on di-mensions including number and severity of usabilityproblems identfled in the interface. The three methodswere each applied to two competitive software systemsin order to assess the reliability of the fmdmgs. Theusabfity tests and walkthroughs were completed as ifpart of a realistic development schedule with resourceconstraints so that the data could provide practical in-formation on usability engineering in product develop-ment.

ParticipantsSix separate groups of experienced (GUI) users partic-ipated in the study of the two systems. For each sys-tem, the empirical test and individual waikthrough eachutilized six participants, and the team waikthrough uti-lized six pairs of participants. A total of 48 participantstook part in the study. Participants were randomly as-signed to methods; team members did not know eachother. Participants had not previously used the GUIsystem they worked with in the study. The six groupswere comparable based on background data gatheredprior to usability sessions. Participants were predomi-nantly end users and developers of GUI systems, alongwith a few UI specialists and software support stti.Most of the participants had advanced educational de-grees; used computers in home, work, and school set-tings; and had used a variety of computers, operatingsystems, and applications. They used computers ap-proximately 20 hours a week, including over 10 hoursa week on GUI systems. Except for more formal edu-cation, the participants were typical of those whowould participate in usability walkthroughs and empir-ical testing of GUI systems in product development.

MateriaisThe two systems selected for the study were commer-cially available GUI office environments with inte-grated text, spreadsheet, and graphics applications.They will be referred to as Systems 1 and 2, The twosystems difTered substantially in the type of interfacestyle and oflice metaphor presented.

Human factors staff consulted with end users and de-veloped a set of nine generic task scenarios to be usedwith both systems. These scenarios were representativeof typical office tasks involving text, spreadsheet, andgraphics applications, and use of the system environ-ment. The tasks included a range of one to thirteensubtasks and covered document creation, moving andcopying within and between documents, linking andupdating documents, drawing, printing, interfacecustomization, finding and backing up documents, anduse of system-provided and user-generated macros.

A two-page document of guidelines was developed forthe evaluators in the waikthrough conditions, Thedocument told evaluators their assignment was toidentify usability problems with the interface initiallyby exploring the interface on their own and then bywalking through typical tasks provided to them. A us-ability problem was defined as anything that interferedwith a user’s ability to efllciently and effectively com-plete tasks. Evaluators were asked to keep in mind theguidelines about what makes a system usable and torefer back to them as necessary. Following the preceptsof minimalism [3], the document provided brief defi-nitions and task-oriented examples of twelve guidelines.These guidelines were compiled from heuristics usedby Nielsen and Molich [20], the 1S0 working paperon general dialogue principles [9], and the IBM CUAuser interface design principles [8]. The twelve usabd-it y guidelines included:

● Use a simple and natural dialog,● Provide an intuitive visual layout,● Speak the user’s language,● Minimiz e the user’s memory load,● Be consistent,● Provide feedback,● Provide clearly marked exits,● Provide shortcuts,● Provide good help,● Allow user customization,● M“mirnize the use and effects of modes, and● Support input device continuity.

A usability problem description form was developed foruse by the walkthrough evaluators. The form instructedthe evaluators to briefly describe each problem andthen rate its impact on end user task completion.

ProcedureThe authors completed all usability engineering workin the study and became familiar with each GUI systemprior to commencement of the usability sessions. Allsessions were completed in a usability studio inHawthorne, NY. The GUI systems were set up on an

399

(HI ’92 Mav3-7, 1992

IBM PS/2 Model 80 with an 8514 display and a printer.Usability sessions for all methods each took aboutthree hours. The f~st half of the usability sessions in-cluded an introduction by the usability engineer whoadministered the sessions, and a self-guided explorationof the system by the participants. During the self-guided exploration, participants could go through cm-line tutorials, read any of the hard copy documentationshipped with the system, use and modify example doc-uments created using the dtierent applications andsystem functions, or create new application documentsand experiment with system functions. In the secondhalf of the sessions, participants worked through a setof nine typical tasks presented in random order andcompleted a debriefing questionnaire given by the ad-ministrator.

The emptilcal testing and walkthrough sessions differedin human factors involvement in the session and inhow usability problems were documented. In empiricaltesting, two usability engineers adminktered each ses-sion in its entirety with an individual user. One personin the control studio interacted with users (who werein the usability studio and described usability problemsthey encountered during sessions), controlled the vide-otape equipment, and observed usability problems. Thesecond person in the control studio logged user com-ments, usability problems, time on task, and task suc-cess or failure.

Usability staff involvement in the waLkthrough sessionswas limited to test the resource requirements of themethod, One administrator was available on-call dur-ing the session in case of unexpected events. A fewsample sessions were videotaped and observed by hu-man factors sta~ no session logging occurred. Oneadministrator introduced the session and instructedwalkthrough evaluators in the use of the guidelines forusability walkthrough document and the usabilityproblem description forms. The administrator empha-sized in both individual and two-person teamwalkthrough conditions that the problem ident~lcationsheets were the deliverable for the session. In the teamconditions, evaluators were given additional in-structions to help each other by providing relevant in-formation [7]. They were told that if either one of theteam members thought something was a usabilityproblem, they should record it. Also, teamwalkthrough evaluators were instructed to take turnswith the mouse and with recording usability problemsso that each person had direct experience with the

interface and with the usability problem descriptionforms. Evaluators read the guidelines document andthe administrator then left the studio. After the self-guided exploration phase, the administrator returnedbriefly to present the task scenarios and emphasize thatit was more important to identify usability problemsthan to complete all the tasks.

RESULTSData AnalysisThe usability problems recorded during empirical test-ing and the usability problem descriptions documentedduring the walkthroughs were classified by the usability

engineers using a generic model of usability problemsthat evolved during the course of the study. The clas-sification completed a content analysis of the problemsand prepared the data for subsequent problem severityratings. The hierarchical model consisted of a total of47 categories of potential user interface problem areas.Because of the functional differences between the sys-tems, all 47 categories applied to System 1 while only43 applied to System 2. Subcategories were createdwhen a main category had several problems that wererelated, yet addressed different aspects of the higher-level category. For example, the fifth main categorywas Move & Copy and it had two subcategories: 1)Clipboard and 2) Direct Manipulation. We distin-guished between problems that were pervasive throughthe environment and those that were application spe-cKIc. For example, we classfied icon complaints (e.g.,cannot understand icon meaning, icons lm-d to read,no icon status information or it is hard to distinguish)as pervasive office-level problems, while confusionabout specific spreadsheet functions (e.g., how does thesum function work?) were regarded as applicationspecflc,

Item classflcation was discussed until consensus wasreached about its placement in the model. To assessinter-rater reliability, 50 problem statements were ran-domly selected from the data for the three conditionsand classtiled by two usability engineers who had notobserved the participants or been involved in dataanalysis. They each classified the data using the genericmodel of usabilhy problems. The inter-rater reliabilityscores between the third-party usability staff and thestaff involved in the study were 870/0 for the empiricaltesting data, 70~0 for the individual walkthrough data,and 710/0 for the team walkthrough data. For eachempirical testing and walkthrough group, data wereanalyzed regarding the number of usability problcmtokens (all instances), usability problem types (in-stances minus all duplicates), and problem areas (higherlevel categories of problem tokens and types, e.g.,Move and Copy) in the generic model. These problemareas were assigned Problem Severity ClassKlcation(PSC) ratings.

A version of the PSC measure, which is used in IBM,was employed in the study. It provides a ranking ofusability problems by severity that can be used to de-termine allocation of resources for addressing userinterface problems (see Table 1). PSC ratings arecomputed on a two-dimensional scale, where one axisrepresents the impact of a usability problem on enduser ability to complete a task, and the other representsfrequency (the percentage of end users who experiencethe problem). Categories of the impact dnension(high, moderate, low) and frequency dimension (high,moderate) were combined to form an index of PSCratings that ranged from 1-3 where 1 is most severe.High impact was defined as a problem that preventedthe user from completing the task, moderate impactrepresented sigrMcant problems in task completion,and low impact represented minor problems and inef-ficiencies. Given the small sample sizes in the condi-

400

~ CHI’92 May3-7, 1992

tions, moderate frequency was defined as 2 (33VO)users, evaluators, or evaluator teams; and high fre-quency was defined as three (50Yo) or more of them.For example, if three or more of the six evaluators(high frequency) reported a problem that caused sig-tilcant d~lculty (moderate impact) in completing atask, a PSC rating of “l” would be assigned to theproblem area.

Frequency (Percentage of [Jsers)

High ModerateImpact on TaskHigh 1 1Moderate 1 2Law 2 3

Table 1. Problem Severity Classification rating matrix.

We generated PSC ratings for each of the categories inthe generic model of user interface problems. Therewere 47 PSC ratings for System 1, and 43 for System2. To generate PSC ratings in the empirical conditions,the human factors staiT calculated the frequency of us-ers experiencing a problem and assigned an impactscore. Disagreements about impact scores were dis-cussed until consensus was reached. In thewalkthrough conditions, human factors staff calculatedthe frequency data and averaged the impact scoresprovided by the evaluators. Problem areas that did nothave problem tokens from at least two participants orteams were assigned a PSC rating of 99 (i.e., no actionrequired). Problem areas with PSC ratings of 1-3 werecalled sigdcant problem areas (SPA), and those withratings of 99 were called “no action” areas.

For the walkthrough conditions, evaluator question-naire data on aspects of the walkthrough procedurewere collected during the debriefing sessions and ana-lyzed. For the empirical conditions, data on time ontask, completion rates, and the debriefing questionnairewere collected but are not reported here.

Empirical Testing and Walkthrough ResultsFor Systems 1 and 2, empirical usability testing identi-fied the largest number of usability problem tokens (allinstances), followed by team walkthrough and then in-dividual walkthrough (see Table 2). For both systems,

—Empirical Team Individual

Test Walk WalkSystem 1Problem Tokens 421 115 78Problem Types 159 68 49

System 2Problem Tokens 401 107 64Problem Types 130 54 39

Table 2. Total identified usability problems.

the total number of usability problcm tokens found byempirical testing was about four times the total numberof problems identified by team walkthroughs, andabout five times the total number found by individual

walkthroughs. The dtierence in the distribution acrossthe groups of the total number of tokens found wasstatistically si@lcant for each system at the p <.01level according to X2 tests.

Empirical testing also identified the largest number ofusability problem types (instances minus duplicates),followed by team and individual walkthroughs. Forboth systems, the total number of usability problemtypes found by empirical testing was about twice thetotal number found by the team walkthroughs, andthree times the total number found by individualwalkthroughs. Again, the diierence in the distributionof the total number of problem types found in the threegroups was statistically si@lcant for each system atthe p <.01 level.

The data on PSC ratings assigned to problem areas forSystems 1 and 2 are presented in Table 3. For bothsystems, empirical testing identified a larger total num-ber of signMcant problem areas (Total SPAS) assignedPSC ratings of 1-3 than did either team or individualwalkthroughs, I-Iowever, the variation in total numberof SPAS across methods was statistically si@lcant forSystem 1 (p< .01) but not for System 2. For bothsystems, there was no bias or tendency towards moresevere ratings (i.e., more PSC 1s versus 2s) in onegroup as compared to another.

Empirical Team IndividualTest walk Walk

System 1Psc 1Psc2 !; 11 ;Psc 3

Total SPAS :0 2!3 liNo Action Areas 7 24 29Total Problem Areas 47 47 47

System 2Psc 1 10 6

Psc 2 15 ; 10psc 3 2 1 1

Total SPAS 27 14 17No Action Areas 16 29 26Total Problem Areas 43 43 43

Table 3. PSC ratings of usability problem areas.

Table 4 provides information on the number of uniqueusability problem areas identified by each of the meth-ods. A problem area that is unique to a method is aSPA that is identified by only one method. For Sys-tems 1 and 2, empirical testing ident~led the largestnumber of unique problem areas. For both systems,two-thirds or more of these unique problem areas wereassigned a PSC rating of 2, representing relatively im-portant problems in the user interfaces.

A common usability problem area across methods oc-curred when a SPA was ident~led by all three methods.For example, a common problem area was the basicmodel for linking, where emp~lcal testiug and teamwalkthroughs generated a SPA with a PSC rating of 1and indkidual walkthrough generated a SPA with a

401

[HI ’92 May3-7, 1992

PSC rating of 2. To analyze the proportion of problemareas that were common, the total number of SPAS foreach system was computed. The total number of SPASident~led for System 1 was 41 (40 identifkd by empir-ical testing plus 1 unique problem area identtiled byteam waLkthrough). The total number of SPAS identi-fied for System 2 was 29 (27 identified by empiricaltesting plus 2 unique problem areas found by individualwalkthrough). Regarding common problem areas forSystem 1, 13 of the total number of 41 SPAS (32?4. oftotal) were common across the three techniques. ForSystem 2, 10 of the 29 SPAS (35’%. of the total) werecommon across techniques.

13mptilcal Team IndividualTest walk Walk

System 1 13 0System 2 8 : 2

‘Table 4. Unique usability problem areas.

Walkthrough ResultsAdditional analysis of the effectiveness of team ascompared to individual walkthroughs was conductedby studying the total number of problem tokens foundby each individual walkthrough evaluator orwalkthrough evaluator team. This analysis showed thatteams found more problem tokens than did individualwalkthrough evaluators for each system (p <.01 ac-cording to t-tests). For System 1, the average numberof problems identified by the walkthrough teams was19 while the average for individual walkthrough evalua-tors was 13. For System 2, these values were 18 forteam and 11 for individual walkthroughs respectively.However as shown in Tables 2 and 3 above, while moreproblem tokens and types were identiiled by teamwalkthrough conditions, the total number of SPASidentified was similar for both team and individualwalkthroughs, and the pattern held across systems.

During the debriefing, evaluators rated the relative use-fulness of scenarios as compared to self-guided explo-ration in identifying usability problems in the systems.The evaluators used a 5-point scale where a score of 1was the most positive response for use of scenarios.All walkthrough groups favored the use of scenariosover self-guided exploration; the average score acrosssystems was 1,8. Evaluators were also asked about theadded value of using the guidelines during thewalkthrough. A 5-point scale was again used and ascore of 1 was the most positive response for guidelines.For both systems, the walkthrough evaluators thoughtthe guidelines were of liiited added value to them inidentifying usability problems; the average score acrosssystems was 3.9. The evaluators said they thought thebrief document was very effective in explaining andgiving examples of the guidelines, and that they wouldnot change the format. They stated that because oftheir experience with GUI and other systems, they werealready fiuniliar with the guideline concepts, but thatless experienced users would fmd it very useful. It wasnoted that almost all evaluators tried to take theguideline document with them at the end of the session.

When asked about this, they said they were verypleased with it and wanted to keep it for reference.

Cost-Effectiveness DataTable 5 shows the cost-effectiveness data for the threemethods on the two systems. This analysis includes thetime required by human factors staff and participants;no laboratory facility costs are included. Iluman factorstime includes preparation of all materials (35-45 hoursacross methods), administration of sessions (10-55hours), and data analysis (16-50 hours). Time to ana-lyze the data using the generic model of problem areasand the PSC matrix are included for all groups. Asexpected, the total hours required (human factors plusparticipant) for a method was highest for empiricaltesting for System 1 and System 2.

Empirical Team IndividualTest Walk Walk

System 1HF staff hours 136 70Participant hours 24 :; 24Total hours 160 118 94Problem Types 159 68 49Hours/Type 1.7 1.9SPAS ;b” 23 18

Hours/SPA 4.0 5.1 5.2

System 2HF staff hours 116 77 76Participant hours 24Total hours 140 1% 1?0Problem Types 130 54 39Hours/Type 1.1 2.3 2.6SPAS 27 14 17Hours/SPA 5.2 8.9 5.9

Table 5. Cost-effectiveness data for the three methods.

I-Iowever, empirical testing needed only about half asmuch time as the walkthroughs to fuld each usabilityproblem type. For System 1, the hours required toidentify each significant problem area were fairly similaracross techniques. For System 2, the resource requiredfor team walktlu-ough was higher than for both empir-ical testing and individual walkthrough.

DISCUSSIONThe findings regarding the relative effectiveness of em-pirical testing and walkthrough methods were generallyreplicated across the two GUI systems. It is not clearwhether these patterns would be replicated onnon-GUI systems, however, the significant differencesin the style and presentation of the two GUI systemsin the study support the reliability of the results acrossthese types of systems.

The empirical testing condition identified the largestnumber of problems, and identified a sig@cant num-ber of relatively severe problems that were missed bythe walkthrough conditions. These data are consistentwith Desurvire et al. [4] data and at odds with Jeffrieset al. [10] data. The difference between our data and

402

(H1’92 MOy3-7, 1992

those of Jeffries et al. [10] in the number of problemsfound might be explained by the difference in evaluatorexpertise, but Desurvire et al. [4] also utilized UI ex-perts, and their data are consistent with ours. All threestudies do provide strong support for the value of UIexpertise though. We recognize that the basis of ourempirical usability testing results was the experimentalcontrols employed, the skiUs required to conduct thetest, experience with the two GUI systems prior toobserving test sessions, and the UI expertise requiredto recognize and interpret the usability problems en-countered by the users. Our data suggest that this typeof empirical usability testing should be employed forbaseline and other key checkpoint tests in the develop-ment cycle where coverage of the interface and iden-tification of all sign~lcant problems is essential.W alkthroughs of the type in this study are a good al-ternative when resources are very limited [19] and maybe the preferred method early in the development cyclefor deciding between alternative designs for particularfeatures. In Jeffries et al. [10] the heuristic methodfound a larger number of severe problems than usabil-ity testing. This might be explained partially by thedifferences in procedures. Data for the methods in ourstudy were collected across a three-hour time period.In the Jeffries et al. [10] study, the UI experts in theheuristic condition documented the problems theyfound over a two-week period.

About a third of the significant problem areas identitledwere common across all methods. The degree ofoverlap is encouraging, but it should caution humanfactors practitioners about the tradeoffs they are mak-ing in employing one method rather than another.These methods are complementary and yield differentresults; they act as different types of sieves in identifyingusability problems. Jeffries [11] stated that there wasless overlap in problems found by any two of themethods in their study [10] than in ours. The higherdegree of overlap in our study might be partially dueto the fact that all methods used the same scenarios.These scenarios were rich and complex examples oftypical work that end users need to perform and mayhave greatly aided in the evaluation of the systems byall methods. The overlap between the two methodsthat used their scenarios in Jeffries et al. [10] was nohigher than that between the others though, and mayreflect dfierences in the two sets of scenarios.

‘ream walkthroughs achieved better results than indi-vidual walkthroughs in some areas. The fact that anydifferences emerged between the team and individualwalkthroughs is encouraging. The brief period of theusability session, the lack of an established working re-lationship between team members, and the small sizeof the teams may have contributed to the small differ-ences found. Many usability walkthroughs in productdevelopment are done by moderate-sized teams (e.g.,6-8 people) because of the wide range of skills andbackgrounds necessary to identify and then resolve us-ability problems. Therefore, due to practical and or-ganizational considerations, team walkthroughs maybe an area warranting future research. Work by Bias

[1] and Hackman and Morris [7] may help identifyways to facilitate and enhance the performance of teamWalkthroughs.

AU walkthrough groups favored the use of scenariosover self-guided exploration in identifying usabilityproblems. This evidence supports the use of a set ofrich scenarios developed in consultation with end users.And as evaluation work attempts to predict what willoccur in real world settings, the use of well-foundedscenarios can provide some assurance that real worldproblems will be identfled.

The evaluators, who were all experienced GUI users,generally thought that the guidelines for usability wereof limited added value to them in conducting thewalkthroughs, but would be helpful for less experiencedusers. These data are consistent with the Desurvire etal. [4] results showing no difference between UI ex-perts’ heuristic and “best guess” ratings. Guidelinesmay serve to promote consensus about usability goalsfor development teams, and may be more useful for lessexperienced evaluators during walkthroughs.

The results also demonstrate that evaluators who haverelevant computer experience and represent a sampleof end users and development team members cancomplete usability walkthroughs with relative success.Specflc UI or human factors expertise may be veryheipful but is not required, and there are a multitudeof practical, individual, end user, organizational, andproduct benefits to be achieved by involving moremembers of development teams in walkthroughs [5, 12,19, 20],

Cost-effectiveness data show that empirical testing re-quired the same or less time to identify each problemas compared to walkthroughs. The differences betweenthese data and the cost-benefit data for usability testand heuristic methods in Jeffiies et aL [IO] maybe dueto the differences in the walkthrough procedures uti-lized and in the type of data analysis performed in thetwo studies, If our walkthrough data had been ana-lyzed in other ways or by different individuals (e.g.,developers), the resource required might have variedsignificantly. However, the resource and skills appliedto data analysis may be reflected in the quality of theanalysis and the resulting changes to systems. Ulti-mately, the true cost-benefit of these methods will berealized through their ability to facilitate the achieve-ment of usability objectives for systems in iterative de-velopment, and to provide measurable benefits thatexceed the costs of their use [13, 14, 15]. AnaJysis ofdata from one iteration in isolation is of limited utility.

The ident~lcation of usability problems is not an endin itself. Rather, it is a means towards eliminatingproblems and improving the interface. The part of thedevelopment process concerned with making recom-mendations for change based on the usability problemsident~led is not covered in this study. We did fmd thatthe larger the number of problem tokens and typesidenttiled regarding a signitlcant problem area of theinterface, the richer the source of data was for formingrecommendations for changes to improve that portion

403

17 (H1’92 May3-7, 1992

of the interface. The data from this study show that theempirical and team walkthrough conditions have theadvantage over the individual walkthrough conditionsin this area. The empirical test data contained fourtimes as many problem tokens describing a si~lcantproblem area and providing context about it as com-pared to team walkthrough data, and teams produced33-50% more information as compared to individualwalkthroughs. The quality of the data analysis com-pleted and the recommendations that arise from themare issues for future research.

I low could walkthrough methods be improved? Usersin the empirical testing sessions were given opportu-nities to provide recommendations for changes to theusability problems they encountered, and thewalkthrough sessions could be improved to captureevaluator recommendations as well. Another area ofwalkthrough procedures that needs attention is the dif-ficulty of interpreting problems. Walkthrough evalu-ators used different language than the staff whoanalyzed the problem reports, and the data analysts’job of unrJerstanding these problem statements wasmade more difficult by a lack of context and lack ofsession observation. The dificulty experienced in in-terpreting walkthrough data was supported by thelower inter-rater reliability data reported forwalkthroughs compared to usability tests. Moreover,from walkthrough sessions that human factors staffobserved, it became evident that evaluators misattri-buted the sources of problems. We also observed thatevaluators sometimes became so involved in the taskscenarios that they forgot to document problems theyencountered and identfled. We attempted to overcomethis demand characteristic of the walkthroughs by em-phasizing the importance of problem ident~lcation overtask completion, but it was not effective in some cases,and further refinement of intervention strategies shouldbe explored [7]. A better debriefing of evaluators thatincluded reviewing identified usability problems, cap-turing undocumented ones that evaluators mentionedin passing, and collecting evaluator recommendationsfor changes might improve walkthrough effectiveness.

ACKNOWLEDGEMENTSThe authors thank Amy Aaronson, Catalina Danis,Tom Dayton, John Gould, Robin Jeffries, John Karat,Wendy Kellogg, Jakob Nielsen, John Richards, KevinSirtgley, Cathleen Wharton, and the anonymous re-viewers for comments on earlier versions of this paper.

REFERENCES

1. Bias, R.G. Walkthroughx Efficient collaborative testing.IEEE Software , 8, 5, 1991, pp. 94-95.

2. Bellotti, V. Implications of current design practice for the useof I-ICI techniques. In Jones, D.M. and Winder, R., (Eds.),

People and Compukw IV. Cambridge University Press,Cambridge, 1988, pp. 13-34.

3. Carroll, J., Smith-Kerker, P. L., Ford, J. R., and Mazur-

Rlmetz, S.A. The minimaI manual. Human-Computer hrter-actkm, 3, 1987-88, pp. 123-153.

4. Desurvire, H., Lawrence, D., and Atwood, M. Empiricismversus judgment: Comparing user interface evaluation meth-

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19,

20.

21.

22.

23.

ods on a new telephone-based interface. SIGCHI Bulletin, 23,4, pp. 58-59.

Gould, J. D., and Lewis, C. Designing for usability Key prin-ciples and what designers think. Cornmunicatiorrx of theACM, 28, 1985, pp. 300-311.

Grudin, J. and Poltrock, S.13. User interface design in largecorporations: Coordination and communication across disci-plines. In Proceedings of CIII’89 (Austin, TX, April 30-May4, 1989), ACM, New York, pp. 197-203.

Hackman, J. R., and Morris, C.G. Group tasks, group inter-action process, and group performance effectiveness: A reviewand proposed integration. In Berkowitz, L., (Ed.), Advances inExperimental Social Psychology, Vol. 8, Academic Press, NewYork, 1975.

International Business Machines Corporation, Systems Ap-plication Architecture, Common User Access, Guide to UserInterface Design, (SC34-4289), 1991.

International Standards Organization. Working Paper of ISO9241 Part 10, Dialogue Principles, Version 2 , WO/~C159/SC4/WG5 N155, 1990.

Jeffries, R. J., Miller, J. R., Wharton, C., and Uyeda, K.M.

User interface evaluation in the real world A comparison of

four techniques. In Proceedings of CHJ’91, (New Orleans, LA,

April 28-May 3, 1991), ACM, New York, pp. 119-124.

Jeffries, R. J. Personal communication, Sept. 1991.

Jorgensen, A.K. Thinking-aloud in user interface desigrx A

method promoting cognitive ergonomics. Ergonomics, 33, 4,1990, pp. 501-507.

Karat, C. Cost-justifying human factors support on develop-ment projects. Human Factors Society Bulletin, 35, 3, 1992,pp. 1-4.

Karat, C. Cost-benefit and business case analysis of usabilityengineering. ACM SIGCHI Conference on Human Factors inComputing Systems, New Orleans, LA, April 28-May2, Tuto-rial Notes.

Karat, C. Cost-benefit analysis of usability engineering tech-

niques. In Proceedings of the HFS Socie&, (Orlando, FL,Oct., 1990), pp. 839-843.

Lewis, C., Poison, P., Wharton, C., and Rieman, J. Testing a

walkthrough methodology for theory-based design of walk-

up-and-use interfaces. In Proceedings of CH1’90 (Seattle, WA,April 1-5, 1990), ACM, New York, pp. 235-242.

McGrath, J. E, Groups: Interaction and Performance Prentice-

Hall, 13tglewood Cliffs, N. J., 1984.

Nielsen, J. Finding usability problems through heuristic eval-uation. In Proceedings of cHI’92 (Monterey, CA, May 3-7,

1992), ACM, New York.

Nielsen, J. Usability engineering at a discount. In Salvendy,

G., and Smith, M. J., (Eds.), Designing and Using Humarr-Computer Interfaces and Knowledge-Based Systems. Elsevier

Science Pulishers, Amsterdam, 1989, pp. 394-401.

Nielsen, J. and M olich, R. Heuristic evaluation of user inter-faces. In Proceedings of CHI’90, (Seattle, WA, April 1-5,1990), ACM, New York, pp. 249-256.

Whitten, N. Managing Software Development Projects: For-mula for Success. Wiley and Sons, New York, 1990, pp.203-223.

Wharton, C., Bradford, J., JetTries, R., and Franzke, M. Ap-

plying cognitive walkthroughs to more complex user inter-faces: Experiences, issues, and recommendations. InProceedings of CHJ’92 (Monterey, CA, May 3-7, 1992),

ACM, New York.

Wright, P. C., and Monk, A.F. The use of think-aloud evalu-ation methods in design. SIGCHI Bulletin, 23, 1, 1991, pp.55-57.

404

Documents

COMPARISON OF EMPIRICAL TESTING AND ......Empirical usability testing and walkthrough methods differ in the experimental controls employed in the former. lIuman factors practitioners