A Survey of Human-Performance Modeling Techniques for ...udekel/coursework/GOMS/MethodsGOMS.pdfA Survey of Human-Performance Modeling Techniques for Usability, Uri Dekel Page 3/30

A Survey of Human-Performance Modeling Techniques for Usability, Uri Dekel Page 1/30

A Survey of Human-Performance Modeling Techniques for Usability

A project document for the course

Methods of SW Development: Deciding what to Design

Uri Dekel ISRI, Carnegie Mellon University

DRAFT 11/30/04


Table of Contents Introduction......................................................................................................................... 3 1. Evolution of models........................................................................................................ 5

1.1 Stimulus-Response-Controller models ..................................................................... 5 1.1.1 Information theory and Shannon’s definition of Entropy.................................. 5 1.1.2 Shannon’s definition of Channel Capacity ........................................................ 6 1.1.3 Fitts’ law ............................................................................................................ 7 1.1.4 The Hick-Hyman law of choice......................................................................... 8

1.2 Information processing models................................................................................. 9 1.2.1 General Human Information Processing Models............................................... 9 1.2.2 The Model Human Processor........................................................................... 10

2. GOMS........................................................................................................................... 12 2.1 Principles of GOMS................................................................................................ 12 2.2 GOMS Variants ...................................................................................................... 13

2.2.1 Keystroke-Level Model ................................................................................... 13 2.2.2 CMN-GOMS.................................................................................................... 14 2.2.3 CPM-GOMS .................................................................................................... 15 2.2.4 NGOMSL......................................................................................................... 15

2.3 Picking a GOMS variant ......................................................................................... 16 2.4 Limitations of GOMS ............................................................................................. 17

3. The future of GOMS..................................................................................................... 18 3.1 GOMS tool support................................................................................................. 18

3.1.1 Quick and Dirty GOMS................................................................................... 18 3.1.2 Deriving KLM models..................................................................................... 19 3.1.3 GLEAN - Tool support for NGOMSL............................................................. 20 3.1.3 Automatic CPM-GOMS .................................................................................. 21

4. Conclusions................................................................................................................... 21 Annotated Bibliography.................................................................................................... 24


Introduction The effectiveness, success, and overall user experience from interactive software are greatly affected by the design of the interface between man and machine. Certain designs can enhance productivity whereas others can make use impractical or cause dangerous situations. A design must consider physiological and sensory handicaps, principles of perception, and limitations of motor coordination [22]. General rules-of-thumb are not always sufficient for designing interfaces that maximize user efficiency. Consider, for example, a call center where employees need to handle a large number of short calls [13]. Small differences in the time it takes the employee to accomplish certain tasks, such as accessing a client’s account at the beginning of each call, accumulate and can make a huge difference in the overall operating costs of the call center. As computers, networks, and software become faster, the limiting factor is no longer the operating speed of the hardware, but rather the human, which does not follow Moore’s law. Even if the computer can respond instantaneously to any user input or command, the action will take time because the user has to formulate a plan, move the mouse, click a button, etc. The goal of designers is to create interfaces that minimize the time for completing human-dependent activities. To this end, a quantitative model of how humans work is necessary. Engineers need ways to predict the usability of an interface before it is implemented, or at least without needing time-consuming user tests. For example, Nielsen [31] argues the following: A Holy Grail for many usability scientists is the invention of analytic methods that would allow designers to predict the usability of a user interface before it has even been tested. Not only would such a method save us from user testing, it would allow for precise estimates of the trade-offs between different design solutions without having to build them. The only thing that would be better would be a generative theory of usability that could design the user interface based on a description of the usability goals to be achieved. In the past 50 years, research in cognitive modeling lead to a line of models of human performance, starting with simple rules for short movements or simple choices, and leading up to sophisticated higher level models such as GOMS and its variants. Such models may be a step on the road to what Nielsen described as a “Holy Grail”. The primary goal of this report is to introduce the user to this evolution of models, so that a reader studying the GOMS model will have the necessary foundations and understanding regarding the theoretical principles behind this model. The second part of this report will then briefly discuss GOMS (which is covered at an introductory levels in other sources, esp. [22]), its variants, and new upcoming models. We will also try to provide software practitioners with suggestions and indications for selecting models. It is important to note that our discussion is focused on modeling the way humans work and deriving time estimates, rather than attempting to predict their intent. The fields of


artificial intelligence and intelligent user interfaces provide a variety of techniques for adapting interfaces to the perceived intent of the user, but these topics are outside the scope of this report.


1. Evolution of models

1.1 Stimulus-Response-Controller models Early models, referred to as stimulus-response-controller models, focused on short perceptual and motor activities, which characterized issues of early interactive systems. A human was considered as an information processor which responds to simple stimuli by carrying out a certain motor behavior. To develop such models, experimental psychologists in the 1950s used information theory and analog signal processing techniques to understand the perceptual, cognitive and motor processes of humans. An important example is Fitts’ law [11]. Before we discuss this law, we begin with a short introduction to information theory and Shannon’s formulation of channel capacity [32], upon which Fitts’ law is based.

1.1.1 Information theory and Shannon’s definition of Entropy Let us consider a binary message of certain length in bits. If it contains pure information, that is, if all its bits are used to represent a value, then we have an equal probability for each possible value. In other words, we cannot guess what the value will be or even have a bias. We say that the message has high entropy. However, if we insert other information, such as replicating every bit once to ensure consistency in noisy channels, we would only use half the bits to convey information. We can predict that the values are more likely to consist of a sequence of identical pairs of bits. The message has lower entropy. Naturally, there is a tradeoff between maximizing the amount of information, and protecting against errors. Entropy has many other applications in communication theory, such as encryption. If we know that a certain message consists of natural language text, then certain letters have a higher probability of appearing, and thus the entropy of the message is lower. A simple additive encryption scheme, where a constant is added to each letter, is not a good way to encrypt text since it preserves the probability differences. Formally, the entropy of a discrete random event x with n possible states is:

( ) ( )�=

−=n

iii ppxH

12log . For example, in a single random coin throw (a Bernoulli trial)

the entropy of getting heads is 1 for a valid coin, where there are equal probabilities of both options, but goes towards zero as the probabilities become different with a skewed coin. To further illustrate the idea of entropy of an event, let us consider an e-commerce site which provides services mainly to domestic customers, but has several international customers as well. Let us assume that there are N countries in addition to the US, that the probability that a particular customer is domestic is p, and that customers not from the US are equally distributed between these N countries. If we consider a customer’s interaction


with the site as an event, and the customer’s country of residence as a state, what is the entropy of this country selection? The figure on the left plots the entropy when there are only two other countries in addition to the US. When no customers are from the US (probability of 0), the customers are split evenly between the remaining two countries, like the toss of a valid coin, and the entropy is thus 1. If all customers are from the US, then there is no randomness and thus no entropy. The entropy is maximized when the probability of a customer being from the US is one-third, because then the probability for each of the three countries is equivalent. The same phenomenon can be seen when the number of countries is increased, as is seen on the right.

1.1.2 Shannon’s definition of Channel Capacity In a clear channel, if r messages per second can be sent with an entropy of H bits per message, then the information rate would be rHR = bps. Of course, in reality there is noise in the channel, which can cause errors. Intuitively, it seems that for a given system, an increased information rate will increase the number of errors per second. Thus, if we send 1000bps with one error per second, it would seem that there would be more errors than if we sent 2000bps, no matter how we encode the messages. Shannon’s theorem counters this intuition and shows that the increase in errors depends on the channel. Every channel has a maximum information rate, or capacity, marked C. As long as the rate is not higher than this capacity, then there exists an encoding which will allow the message to be transmitted with arbitrarily small error probabilities by using the appropriate coding. It is theoretically possible to send the message with no errors. On the other hand, if the rate exceeds the capacity, then there is no way to avoid the errors regardless of the coding technique. If we add error-correction information, we reduce the entropy and thus the rate.

Entropy of Customer Location (N=2)

0

0.5

1

1.5

2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Probability customer is from US

Ent

ropy

Entropy of Customer Location (N=10)

00.5

11.5

22.5

33.5

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Probability customer is from US

Ent

ropy


Shannon’s 17th theorem formulates the capacity of a channel: ( )SNRBC += 1log 2 where B is the bandwidth and SNR is the signal to noise ratio.

1.1.3 Fitts’ law In 1954, Paul Fitts was studying the limitations of movement tasks. He chose to measure the difficulty of a movement task in information-metric bits, and then consider the movement task as the transmission of information through a “human channel”. The limitations are then due to the maximal information capacity of this human channel. He formulated a law [11] which predicts the time required to move from a starting point to a specific target area, as a function of the size of the target and its distance. This law has been applied to modeling pointing, in real world problems of the time such as control panel design, and later in the design of interactive software. Under Fitts’ approach, a movement task has an index of difficulty, marked ID, which combines the effect of the length of movement with the size of the target. This relation is intuitive as it is easier to acquire a big target at close range than a small target at long range. Using a formula similar to Shannon’s, the signal to noise ratio is replaced by the ratio between the amplitude of the movement (distance from starting point to target center), A, and the width of the target, W. Thus, )1(log 2 W

AID += . What about the bandwidth in Shannon’s original formula? A task consists of more than simply movement: First, there is a constant time which is independent of the movement and its length, for example, reaction time. Second, the properties pointing device affect the speed of movement, regardless of its length or difficulty. This is called the performance index of the device. These two factors become regression coefficients, a for intercept and b for slope, resulting in the following formula for movement time:

( )WAbaMT ++= 1log 2 . The two coefficients can be measured empirically for a

particular task and input device. The validity of Fitt’s law has been shown to hold in many experiments on different tasks and input mechanisms. Fitt’s law can be used to evaluate the layout of a UI for specific pointing devices. It is also the base for several well known rule-of-thumb implications regarding design:

• Buttons and other selectable GUI widgets should not be made too small, as this would make them very difficult to click on.

• In systems where the pointer cannot go outside the screen limits, the edges and corners of the screen are particularly easy for a user to acquire. Important widgets such as the menu bar or task bar should therefore be placed in these areas. In the terms of Fitts’ law, the width of these positions is infinite because the mouse can be moved arbitrarily far and still point at the relevant elements.

• Pop-up menus around the current mouse position (i.e., “right-click menus”) should be used in addition to the regular pull-down menus of the menu bar, because the user does not have to move the mouse to acquire them.


• A pop-up pie menu is often preferable to a linear pop-up menu because all items are at the same distance from the mouse.

• Mouse based gesture recognition can increase performance in tasks that involve mouse or toolbar buttons.

Of course, Fitt’s law is naturally very limited. For instance: • It addresses only target distance and size and ignores other effects on user

performance. • It applies to a single dimension and considers a single width quantity, even though

there are multi-dimensional extensions [29]. • It considers only the motor response of the human without considering issues such

as software acceleration for a mouse pointer. • It does not consider the difference in hands, or parallel strategies. • It does not provide means for estimating mental preparation time • It does not account for training, although for such low level operations, it can be

argued that experience does not make a difference. The most significant limitation of Fitt’s law is that it operates at a very low level: it only considers movements, and is limited to simple straight target acquisitions. The problem of tasks with complex movement paths, such as drawing, gesturing, or writing are partially addressed by extensions of the law to complex paths by means of integration [1], [2]. Later on we will describe higher-level models, such as GOMS, that are useful for modeling more complex tasks.

1.1.4 The Hick-Hyman law of choice Humans have non-zero reaction time. They must first perceive the situation, formulate several action plans (or identify potential actions), and then choose one. Fitt’s law was concerned with movement, and addressed this reaction time with a simple coefficient. It turns out that decision time can also be formulated in a similar manner, in what is called the Hick’s law [14] or the Hicks-Hyman law [16]. Hick’s law describes the time it takes the user to make a decision as a function of the possible available choices. In the simplest formulation, where each of the n choices has an equal probability of being chosen, the reaction time is: ( )1log2 += naT , where the coefficient a can be measured experimentally. The addition of a 1 to the logarithm content accounts for the option of not responding at all. When the choices have differing probabilities, we can think of an entropy of the decision,

and formulate hick’s law as follows: ��

��

�+= �

=1

1log2

1 i

n

ii p

paT .

Hick’s law is based on the idea that humans typically do not make a choice by going over all the choices in a linear manner. Rather, they try to subdivide the problem, essentially


performing a binary search. The law has been validated experimentally (e.g., [28] for menu selections). It is important to note that Hick’s law does not hold for any arbitrary interface and task, but rather for tasks where a search strategy can be devised. One important implication is therefore that the interface should support such a strategy. Consider, for example, a web form with a pull-down list box where a user must pick his state, and assume that each state has an equal probability of being selected. Hick’s law implies that the states should be organized alphabetically. While this implication may be intuitive, the formulation of the law can be used to evaluate specific design options. Another rule-of-thumb based on Hick’s law is that a tool bar or a menu bar should be split into intuitive categories to facilitate learning and use. Suppose, for example, that a user who is not very familiar with Microsoft Word needs to delete a row from a table, but is not sure where the command is located and how exactly it is called. Rather than showing all the possible commands, the Word menu bar has only several categories, one of them for tables. The commands in the Table submenu are organized into several groups. A quick glance shows that one group appears to contain commands on cell, another commands for formatting, and the third contains commands that are related to data. The user focuses on the commands on cells, spots the “delete” option and opens the next menubar, where the specific deletion can be selected. In a similar manner, consider the standard toolbar and the formatting toolbar of word. The former contains a button that looks like a picture of a table (the “insert table” command), and the latter contains a button which also looks like a table (the “outside border” command). Even without tooltips, most users would find their way to the right button if they that one button is reserved for formatting. It is also easy to ignore groups of buttons (e.g., we spot the “bold” and skip the other text formatting commands).

1.2 Information processing models Earlier models and laws such as Fitts’ and Hicks’ were developed before interactive computer systems became prevalent. Human computer interaction is continuous and cannot be easily broken down discrete events, necessitating the development of appropriate models.

1.2.1 General Human Information Processing Models In the 1960s, Newell and Simon developed information processing models based on both psychology and computer science. Unlike earlier models which were based on analog-signal information processing, these newer models were based on a processing unit performing sequences of operations on symbols, in a similar manner to computer programs. The idea is not to model all human behaviors in this form, but rather specific tasks.


In general, human information processing models consist of a processor with an attached memory which receives inputs from receptors based on external world stimuli, and orders motor actions as an output. The following Figure from [22] illustrates this structure: ��

��

��

��

��

��

��

��

��

��

��

The general-psychology literature provides models of human performance on particular tasks, such as using a certain interactive system. The model is built and fitted to the results of experiments conducted on the actual system in question. Such models are not appropriate for the needs of interface designers because they are not predictive. The necessary models are called “zero-parameter” models: they can provide a prediction for a system which has not yet been built. It is important to note that even zero-parameter models are parameterized by information such as typing rate or device movement coefficients. But there are no parameters which need to be derived from experimenting with the system to be built.

1.2.2 The Model Human Processor In 1983, based on empirical data, Card, Moran and Newell [9] proposed the “Model Human Processor” (MHP), a specific human information processing framework which supports the creation of zero-parameter models for specific tasks. Under this framework, humans process visual and auditory input, which are deemed the most relevant to HCI, and react with motor activities. The unique characteristic of the MHP is that humans are assumed to have three separate systems with processors: perceptual, motor, and cognitive.


��

��

��

��

��

��

��

��

��

��

��

��

!��

��

"��#�

��

��$

��

%��$

��

%��$

��&

'��(��

The MHP is essentially divided into three systems, each consisting of a processor and memory. These systems can operate serially or in parallel. Each has its own rules of operation, and only interfaces with the other systems in specific locations. Every processor has a cycle time which limits the amount of work it can do but also places a delay before a next step can begin. Every memory unit has a storage capacity, decay time, and a type. Human memory is split into long-term memory with a long access time, and a small rapid-access working memory, in the same way that modern computers have cache memories. Working memory is extremely limited. It is not a physically separated memory, but rather comprises of small symbolic chunks of long term memory, of which only few are active at any time. There are about seven such chunks, and this is often referred to as the “seven-plus-minus-two” rule [30]. The perceptual system consists of the perceptual receptors in the outside world (i.e., our senses), the perceptual processor, and a temporary perceptual store of visual and auditory information. This system is responsible for translating the observed phenomena in the external world into information that can be processed by the cognitive system. The input information which appears in perceptual memory bears close resemblance to its physical form, such as an image of a word rather than what it symbolizes. The perceptual processor matches this information, encodes it symbolically, and places the symbolic encoding in long term memory. The perceptual processor is limited, and the perceptual storage is small and has a short decay time. As a result, if abundant information is received, some of it will be lost; the human attention directs the focus of work of the perceptual processor. The cognitive system uses the contents of both working- and long term- memories to make decisions and schedule motor activities. Each cycle has a recognize-act structure: it


“recognizes” by using association to activate chunks of the long term memory, and it acts by modifying data in the working memory. However, some decisions can take several cognitive cycles, especially if there is uncertainty involved. On the other hand, practice can hasten decisions. The motor system is responsible for translating thoughts into actions. It uses decisions stored in the working memory to move the limbs or specific fingers, the head, neck, and eyes, and to produce speech. Each movement is broken down into small parts. Many of these movements are based on a certain motor memory or motor cache that simplifies movements such as speaking or typing with practice. This memory is not explicitly represented in MHP. The MHP is somewhat limited in that it does not address human attention. However, its more significant limitation, which prevents it from being useful as-is for predicting user performance, is that it works in a bottom-up manner. Given the exact perceptual, cognitive, and motor activities that will take place it can help predict how much time they would take. In designing an interface, it is preferable to start with the goal of the task and formulate the steps to accomplishing this goal, rather than the other way around. The GOMS model, which we discuss next, tries to address this issue.

2. GOMS While this report is focused on the evolution of GOMS models rather than on the models themselves, we will provide a brief survey here.

2.1 Principles of GOMS GOMS enables the description of tasks and of a user’s knowledge on how to perform them. This description consists of goals, operators, methods and selection rules. Goals are what a user tries to accomplish, and can be defined at different levels of abstraction. They can also be broken down into sub-goals, thus constructing a hierarchy. Operators are the elementary actions which are used to accomplish a goal. They can be perceptual, cognitive or motor. They are atomic and cannot be further decomposed. Different GOMS variants have different operators. Typically they are context free, that is, their execution time does not depend on the current state of the system. Methods are procedural algorithms for accomplishing a goal (or subgoal). Several methods may accomplish the same goal. Finally, selection rules are used to select the appropriate method for achieving a specific goal. They represent the user’s knowledge of how to accomplish the task. The important characteristic of GOMS models compared to other less-formal task analysis methods is that a GOMS model contains all the knowledge for executing he task. It can therefore be used qualitatively, to develop training tools, manuals, and help systems using the knowledge inherent in the model. In this report, however, we are more interested in its quantitative use for predicting performance.


2.2 GOMS Variants

2.2.1 Keystroke-Level Model The Keystroke Level Model, proposed by Card, Moran and Newell [8], can be thought of as a predecessor of GOMS because it is much more restrictive than other techniques. In particular, the model does not use selection rules to pick the method for accomplishing the task, so it can only be used after the designer have selected the appropriate method. A KLM model is simply a series of serial operators, consisting of keystrokes and mouse movements which the analyst had fed in, as well as mental operators which are placed using simple heuristics The predicted execution time is the sum of the operator’s execution times. For this reason, KLM models are easier to construct, understand and read. They can be used to compare the execution times of two alternatives. Yet, they do not take advantage of parallelisms in human ability, nor of the human ability to make decisions based on the current context. Under KLM, the set of operators is limited to its six standard operator classes: K (press key or button), P (point with a mouse at a target), H for homing hands, D for drawing, M for mental preparation, and R for system response time. The operators can be parameterized and concatenated. The following example presents a KLM model for replacing the word “will” with “shall” in Microsoft Word. The time estimates are estimates.

Description Operation Time (sec)Reach for mouse �� Move pointer to "Replace" button P[menu item] � � �Click on "Replace" command � � �� Home on keyboard �� Specify word to be replaced � � � � �� Reach for mouse �� Point to correct field � � "! �# �� Click on field � � �� Home on keyboard �� Type new word �$�� % �Reach for mouse �� Move pointer on Replace-all � � ��&�# ��'�()��# # � � � �Click on field � � ��

Total 10.4


2.2.2 CMN-GOMS The Card-Moran-Newell version of GOMS [9] is actually the first GOMS model, as KLM was a limited predecessor. It takes a form similar to computer programs, with the main goal broken into subgoals that can be realized as subroutines. It uses selection rules to pick the appropriate method. It also provides a variety of cognitive operations. According to [18], while it would seem that CMN-GOMS uses the MHP model, it does not use the parallelism of that architecture. Instead, it is based on its two “principles of operation”. The “problem space principle” suggests that a user’s activity consists of applying operators to transform an initial state into a goal state. The “rationality principle” postulates that users will develop effective methods given the task, its environment, and their own processing limitations. It is also important to note that CMN-GOMS has been criticized (e.g., [17]) for not being formally defined. Applications of that model rely on the interpretation of the applying researchers. Following is a partial example of a CMN-GOMS model for moving text in word, adapted from [18]. GOAL: EDIT-MANUSCRIPT . GOAL: EDIT-UNIT-TASK ... repeat until no more unit tasks . . GOAL: ACQUIRE UNIT-TASK . . . GOAL: GET-NEXT-PAGE ... if at end of manuscript page . . . GOAL: GET-FROM-MANUSCRIPT . . GOAL: EXECUTE-UNIT-TASK ... if a unit task was found . . . GOAL: MODIFY-TEXT . . . . [select: GOAL: MOVE-TEXT* ...if text is to be moved . . . . GOAL: DELETE-PHRASE ...if a phrase is to be deleted . . . . GOAL: INSERT-WORD] ... if a word is to be inserted . . . . VERIFY-EDIT *Expansion of MOVE-TEXT goal GOAL: MOVE-TEXT . GOAL: CUT-TEXT . . GOAL: HIGHLIGHT-TEXT . . . [select**: GOAL: HIGHLIGHT-WORD . . . . MOVE-CURSOR-TO-WORD . . . . DOUBLE-CLICK-MOUSE-BUTTON . . . . VERIFY-HIGHLIGHT . . . GOAL: HIGHLIGHT-ARBITRARY-TEXT . . . . MOVE-CURSOR-TO-BEGINNING 1.10 . . . . CLICK-MOUSE-BUTTON 0.20 . . . . MOVE-CURSOR-TO-END 1.10 . . . . SHIFT-CLICK-MOUSE-BUTTON 0.48 . . . . VERIFY-HIGHLIGHT] 1.35 . . GOAL: ISSUE-CUT-COMMAND . . . MOVE-CURSOR-TO-EDIT-MENU 1.10 . . . PRESS-MOUSE-BUTTON 0.10 . . . MOVE-CURSOR-TO-CUT-ITEM 1.10 . . . VERIFY-HIGHLIGHT 1.35 . . . RELEASE-MOUSE-BUTTON 0.10 . GOAL: PASTE-TEXT . . GOAL: POSITION-CURSOR-AT-INSERTION-POINT . . MOVE-CURSOR-TO-INSERTION-POIONT 1.10 . . CLICK-MOUSE-BUTTON 0.20 . . VERIFY-POSITION 1.35 . . GOAL: ISSUE-PASTE-COMMAND . . . MOVE-CURSOR-TO-EDIT-MENU 1.10 . . . PRESS-MOUSE-BUTTON 0.10


. . . MOVE-MOUSE-TO-PASTE-ITEM 1.10

. . . VERIFY-HIGHLIGHT 1.35

. . . RELEASE-MOUSE-BUTTON 0.10 TOTAL TIME PREDICTED (SEC) 14.38

2.2.3 CPM-GOMS CPM-GOMS (cognitive-perceptual-motor or critical-path-method GOMS) adds parallelism to GOMS in the form of using the Model Human Processor as an execution platform. A CPM-GOMS model is created from a CMN model by replacing the operators of that model with fragments of MHP activities using certain templates. The result is a PERT chart with activities for each MHP component. Minimal execution time is then the critical path in this chart. The CPM-GOMS uses the MHP to add parallelism to the serial scenario, thus obtaining a more optimized execution time prediction. However, this parallelism assumes an extremely high degree of skill and expertise, which not all users will achieve. While there are ways to assist users in reaching this optimal results, such as placing recurring items in fixed positions, these predictions might still be too optimistic [22]. The assumption of high skill also results in reduced estimates of cognitive activities. The CPM-GOMS model is less readable and is much more difficult to construct and understand. For this reason, tasks where little parallelism can be expected are not fit for this model, as the added benefits do not justify the added complexity [22]. On the other hand, it is highly useful for tasks where parallelism is necessary. Following is an extract from a CPM-GOMS model for text editing [18]:

2.2.4 NGOMSL The Natural GOMS Language model [26] is a different extension to CMN-GOMS, which represents GOMS models using natural language, and represents methods in term of Kieras’ and Polsom’s cognitive complexity theory [23]. This theory allows this model to


support internal operators for subgoals and working memory manipulation. For example, it includes rules-of-thumb for the number of steps in a method, how goals are set and terminated, and the information that needs to be remembered by the user executing the task [17]. Unlike other GOMS models, it can be used to estimate learning time for new tasks as well as to predict certain errors. Following is an example the task moving task in NGOMSL from [18]: Method for goal: Cut text Step 1. Accomplish goal: Highlight text. Step 2. Return that the command is CUT, and accomplish goal: Issue a command. Step 3. Return with goal accomplished. ... Selection rule set for goal: Highlight text If text-is word, then accomplish goal: Highlight word. If text-is arbitrary, then accomplish goal: Highlight arbitrary text. Return with goal accomplished. ... Method for goal: Highlight arbitrary text Step 1. Determine position of beginning of text (1.20 sec) Step 2. Move cursor to beginning of text (1.10 sec) Step 3. Click mouse button. (0.20 sec) Step 4. Move cursor to end of text. (1.10 sec) Step 5. Shift-click mouse button. (0.48 sec) Step 6. Verify that correct text is highlighted (1.20 sec) Step 7. Return with goal accomplished.

This NGOMSL model predicts that it will take 5.28 seconds to highlight arbitrary text.

2.3 Picking a GOMS variant In [17], John and Kieras discuss the selection of GOMS variant based on the type of task in which the users are engaged, and the types of information which are sought by applying GOMS. They suggest that tasks can be characterized by four dimensions: degree of goal-directedness, degree of routine and skill, degree of user control, and sequentiality. They argue that only tasks that are goal-directed, involve routine and have a degree of user control are suitable for GOMS modeling. Thus, the choice of GOMS variant for tasks that meet these criteria should be based on whether parallelism is involved, and on the type of information sought. In their analysis, John and Kieras identified several kinds of design information which could be provided by GOMS models. In addition to learning-, execution-, and error recovery times, these include functionality coverage (whether a particular functionality can be achieved), consistency (whether similar goals have similar methods), and determining the operator sequence which shall be used to achieve the task. The following table (reproduced from [17]) summarizes the suggested variants:


2.4 Limitations of GOMS Although GOMS is generally considered one of the more mature engineering models of human performance, it certainly has limitations and it has not yet gained a significant foothold outside the academic world. One problem is the usual issue of usability and quality of research tools and software, which we will discuss in the next section. The other problems lie in the foundation of this technique, and we shall discuss them here. The most significant limitation of the GOMS family is that its predictions are only valid for expert users who never make errors. While it can be assumed that any regular user would eventually become an expert in a repetitive task given enough practice, not all systems are expected to be used in this manner. As software and interactive systems become more ubiquitous, most systems will only be used occasionally, even though our performance when using them might be important. Part of the goal of HCI is to make systems more accessible to novices, and with the exception of specific learning time estimates in NGOMSL, the GOMS family does not address these users. A related criticism of GOMS is that it is goal-directed and neglects the problem-solving nature of some tasks, although expertise can replace much of this problem solving. One must also remember that GOMS models rely on statistical averages to provide predictions, and do not take into account the differences between individual users Even more problematic, however, is the assumption of error-free execution, as even experts make errors, and these errors may be quite common. The time to notice an error and to correct it might have a significant influence on the overall performance of a particular interface design, and a GOMS-based comparison does not take this into account.


Remember that can only be used to evaluate procedural properties of the interface. Because GOMS-modeling provides a single metric, execution time, it should not be used as the only means for evaluating designs. Other aspects such as usefulness, enjoyment, and social or organizational impact are not taken into account. It also neglects ergonomic issues and fatigue which can be a problem under long-term use. Early GOMS models have also been criticized of not utilizing newer cognitive theories, and thus serve as heuristics rather than models of performance on particular tasks. Later models, such as CPM-GOMS and NGOMSL seem to better incorporate cognitive architectures. Nevertheless, all GOMS models seem to be less consistent in their approach to the cognitive operations than they are for motor operations.

3. The future of GOMS

3.1 GOMS tool support Following the formulation of GOMS and its main variants, there was much empirical research trying to validate this methodology in real life settings. The best-known and most successful industrial application is project Ernstine, where CPM-GOMS was used to predict the outcome of a trial with a new operator workstation for the NYNEX phone company [12], [13]. A significant limitation of these works was that the modeling itself was done by expert GOMS researchers, and in certain cases the methodology had to be adapted to the specifics of the problem. The methodologies were not accessible to the layman user, and no automated tools were used to simplify the task. Since the GOMS approach have reached a level of maturity at least in academia, much of the current research involves tools that make it more accessible for other users, in an attempt to gain industrial acceptance.

3.1.1 Quick and Dirty GOMS Beard et al. argue that “GOMS models can be practical if the effort required to produce and use such models is commensurate with its limited practical accuracy” [6]. In studying the design workstations for medical CT workstations, they found standard GOMS models to be inappropriate to their problem. They had a problem obtaining accurate atomic task times, estimating parallelism, and a significant problem with the interaction style itself, as most of the radiologist’s attention is focused on interpretation goals rather than on interacting with the software. Their solution was developing the “Quick and Dirty” GOMS, which eliminates most GOMS constructs, and presents models as simple trees. In a QGOMS model, tasks


(represented by nodes) are broken down into subtasks, represented by child nodes. Edges can be marked for repeating an action several times. The user specifies execution times for leaf nodes, and a parent node’s predicted execution time is then the sum of execution times of its children. The researches claim that despite its simplicity, QGOMS offered them the ability to rapidly evaluate different designs based on rough estimates. They argue that software engineers will be more willing to adapt QGOMS, compared to other methods, because of its fast learning curve. They also suggest that a tree structure simplifies feeding the model into a computer, while exhibiting opportunities for design and consistency improvements. In essence, the authors propose a “dumbing-down” of existing GOMS variants, and argue that such as simplistic model is often useful. Yet if we examine case studies were successful GOMS modeling was conducted, such as the NYNEX operator workstation, it is clear that there is often need for the toolkit of different operators and better support for parallelism, as human intuition itself is not enough to provide an accurate prediction. Other research, which we shall now present, attempts to derive the classical GOMS models automatically.

3.1.2 Deriving KLM models KLM models are the simplest to create and understand, as they are essentially a recording of the specific keystrokes and mouse movements, along with heuristically-placed mental operators. A line of research at Carnegie Mellon University attempts to create tools that will allow novice users to create mock-ups of their UIs (or use the actual UI) and obtain execution-time predictions. A first prototype was the CRITIQUE system [15], in which a user created a mock-up UI in the research-oriented SubArctic GUI toolkit, and demonstrated how certain tasks are performed. The user’s activity was recorded, and transformed using rules to a KLM model from which the predictions are obtained.

(Diagram reproduced from [15]). A more sophisticated system is the CogTools project [20], which allows predictions to be derived from mockup UIs created with a mainstream tool: Macromedia Dreamweaver. Unfortunately, at present the system is very cumbersome, and requires the use of a variety of external tool. The user starts by installing a DreamWeaver extension which provides a replacement for various UI form elements. Using these elements, the user creates a mock-up of the UI which needs to be tested. After starting an appropriate Java application, the user starts demonstrating tasks in a Netscape browser (not Mozilla). The


interaction is recorded in the Java program, and can be saved into trace logs, which are then translated into the LISP-based ACT-R framework [3] which can eventually predict execution time. Clearly, such a process is unlikely to be attractive for many developers, especially since DreamWeaver and Allegro Lisp are expensive commercial products. The following figures are reproduced from the CogTools homepage:

3.1.3 GLEAN - Tool support for NGOMSL Kieras’s NGOMSL technique is intended to make GOMS models readable ad comprehensible with little training, so that even readers who are not familiar with the concept of GOMS can at least understand the described procedures. However, it still requires users to do tedious calculation work in order to make prediction. The GOMS Language Evaluation and ANalysis (GLEAN) system [24] strives to allow users to easily develop a GOMS model for an interface, and then calculate learning time and execution time for given scenarios. GLEAN works by simulating the interaction between a simulated user and a simulated device. The analysis feeds in a GOMS model of the interface, descriptions of the benchmark tasks, and a description of the device behavior. GLEAN uses this data to produce a variety of usability metrics. The model is written in a variant of NGOMSL which can be easily interpreted by a computer while still appearing simple as verbose English text. The variant adds an associative working memory, allowing the users to store and retrieve values from working memory by tags, and to discard them when they are no longer necessary. Subtasks can be parameterized with values from working memory. In that sense, the model behaves much like a computer program.


The task specifications are written in a natural-text like manner. A transition network is used to represent the device behavior, although the system provides existing specifications for specific platforms, such as the Macintosh file manager. The authors validate GLEAN by reproducing a previously-published case study where models and predictions were developed manually. The authors intended to do further case studies, but faced significant problems with the device simulator part, as they would often have to reproduce the interface implementation. They expressed an intent to try to connect GLEAN to an existing UI development tool. It seems, however, that development have been discontinued five years ago, as Kieras switched to a different project.

3.1.3 Automatic CPM-GOMS Although CPM-GOMS is more sophisticated and powerful than other GOMS models because of its capability for parallelism, it is very difficult to manually construct its models. A CPM-GOMS model is a PERT chart, constructed from a CMN-GOMS model by identifying certain patterns and interleaving them into overlapping regions. This is a difficult task even if there is a convenient drawing tool for creating the PERT chart themselves. NASA’s APEX project [5], which provides a framework for creating autonomous agents, uses a computational architecture to model how humans perform complex tasks. The APEX architecture uses “resources” to represent notions such as memory, vision, and gaze, which map into components of the Model Human Processor. The architecture is capable of receiving procedural descriptions in a special Procedure Definition Language, and then interleaving the necessary actions between its resources. Because the APEX resources map into MHP components, John et al. [19] leveraged APEX as a tool for creating CPM-GOMS models. The user specifies high-level methods in the procedure definition language, and APEX interleaves them into the MHP resources and produces the appropriate PERT chart. The researchers intend to validate this approach on previous case studies with CPM-GOMS.

4. Conclusions In the evaluation of user interface designs, many attributes are qualitative. These include ease of use, enjoyment, aesthetics, intuitiveness, etc. Many rules of thumb, such as Nielsen’s usability heuristics, can be used to design towards improving attributes. Nevertheless, interface designs also have important quantitative properties, such as the time required to learn them, and the maximum efficiency a person can achieve in using them. Since many tasks are routine, it is important to accurately predict these values in


advance, in order to reduce prototyping and testing cycles. Therefore, there is a clear need for models for predicting performance. At present, it seems that the GOMS family of models is the only mature methodology for obtaining this information. It is clearly not a solution for everything. It can only be applied to specific “drone” tasks rather than to tasks that require creativity and thought. Fortunately (or unfortunately…) such tasks are abundant. GOMS models vary in their sophistication, capability and accuracy. Models such as KLM and CMN-GOMS are intuitive even without a background in cognitive psychology, whereas more sophisticated variants such as NGOMSL and CPM-GOMS are much more complex. A common thread is that all of them are tedious and difficult to construct manually. While these models have been tried and tested many times, most of these trials had been conducted by experts. I was not able to find a documented case of an independent developer successfully applying and benefiting from such models. As software engineers, our hubris often leads us to believe that we can intuitively determine which interface design is best, including in terms of maximal efficiency. If anything, the material presented in this report should make it clear that this determination is difficult to make even with careful analysis. At a minimum, even if we do not try to obtain quantitative predictions, we should try to reap the qualitative benefits of GOMS modeling or even simple task modeling. We need to think what the users will be trying to accomplish and how they are going to go about doing it. We might discover inconsistencies or opportunities for improvement. Our task analysis can serve to develop detailed use cases, and lays the foundation for training manuals. Even if we are not interested in an accurate estimate, it might be a good idea to try and “ballpark” a rough estimate of how long these interactions would take, to evaluate the design. The QGOMS model and the simple tree-based tool which realize it might be useful here. We don’t need to know anything about cognitive modeling. We are simply creating a hierarchy of nodes representing actions. We already have the knowledge to use it. We can, of course, try to improve the accuracy of our models by trying to obtain realistic figures for the simplest tasks, such as moving the mouse. We can do this using simple laws such as Fitts’, or by timing ourselves. Of course, for critical systems where any performance difference will have profound economic impact, we should consider using the more sophisticated models. Many projects do not have a usability engineer, and in these cases, it is better to assign performance modeling to the testing engineers rather than to the software engineers. As SW engineers, we are often concerned more with constructing something than evaluating it. In particular, we prefer not to learn things that are used for evaluating existing systems rather than for creating new ones. It is therefore unrealistic to expect engineers to learn the details of these sophisticated models, and generate them by hand.


For GOMS modeling to achieve widespread acceptance in the software development community, we need appropriate tool support. Ideally, the development environment should provide a “human performance profiler” the same way it provides a program performance profiler. Users should be capable of creating a user interface, and then demonstrating certain scenarios. The system should provide timing predictions for these action, and suggest improvements. How close are we to this goal? It seems that we are very far. Tools such as GLEAN and APEX require users to feed in complex models and learn a specific language, are unlikely to be adapted by engineers. They will be useful if we could generate the models and task descriptions by demonstrating scenarios. The only tools which currently try to produce a model from demonstration are CRITIQUE and CogTools. Both are limited research tools running on research platforms, and are unlikely to be adopted. One can only hope that such a capability would be added to APEX, as that project is being developed at NASA and therefore has a higher quality than university developed tools. To truly gain acceptance, this capability would have to be added to a mainstream tool such as Visual Basic or Eclipse.


Annotated Bibliography [1] Johnny Accot and Shumin Zhai, “Beyond Fitts’ law: models for

trajectory-based HCI tasks”, in Proceedings of the 1997 SIGCHI conference on Human factors in computing systems, pp. 295-302.

This paper attempts to find a simple “steering law” to model trajectory based interactions in two- and three- dimensional space. By conducting a series of experiments on “steering through tunnels”, the authors derive several laws and infer interesting implications.

[2] Johnny Accot and Shumin Zhai, “Performance Evaluation of Input

Devices in Trajectory-based Tasks: An Application of Steering Law”, in Proceedings of the 1999 SIGCHI conference on Human factors in computing systems, pp. 466-472. Using their previously published steering laws for complex paths, the authors evaluated several pointing devices in an attempt to find which is best suited for such paths. The mouse and stylus were found to be superior to touchpads and trackballs.

[3] ACT-R homepage

http://act-r.psy.cmu.edu/about/ The ACT-R project provides a robust cognitive architecture, aimed at providing extensive cognitive facilities, and offering quantitative measures which can be compared with empirical results. ACT-R theory is directly realized in the ACT-R software, which is implemented in Lisp.

[4] Anderson, J. R., Matessa, M. & Lebiere, C. (1997). ACT-R: A Theory of

Higher Level Cognition and Its Relation to Visual Attention. Human-Computer Interaction, 12, pp. 439-462.


[5] APEX homepage http://human-factors.arc.nasa.gov/apex/index.html The APEX Project at NASA is a framework for developing autonomous intelligent agents which can accomplish challenging tasks. At its core is an artificial intelligence engine for reactive behavior, and a Procedure Definition Language used to represent behavior. APEX uses a computational architecture to model how humans perform complex tasks, and was therefore used for to automatically construct CPM-GOMS models, using its own scheduling capabilities to interleave cognitive, perceptual and motor operations.

[6] David V. Beard, Scott Entrikin, Pat Conroy, Nathan C. Wingert, Corey D.

Schou, Dana K. Smith, and Kevin M. Denelsbeck, “Quick GOMS: a visual software engineering tool for simple rapid time-motion modeling”, in Interactions, Volume 4, No. 3, 1997 This paper presents the “Quick and Dirty” GOMS approach, which uses simple tree-based representation with a generic “task node” operator. The authors argue that even such a simplistic design is useful, especially when only rough estimates are necessary.

[7] Stuart K. Card, William K. English, and Betty J. Burr, “Evaluation of

mouse, rate-controlled isometric joystick, step keys, and text keys for text selection on a CRT”. Ergonomics, 21(8):601--613, 1978. An early comparative study of input devices,

[8] S. Card, T. Moran & A. Newell (1980) The keystroke-level model for user

performance time with interactive systems. Communications of the ACM, 23(7), 396-410 This paper proposes the KLM model. Figure 2 lists the four heuristic rules for placing the M operators.

[9] Stuart K. Card, Thomas P. Moran, Allen Newell. “The Psychology of

Human-Computer Interaction”, L. Erlbaum Associates, Hillsdale, NJ, 1983. A major text on HCI psychology and the original formulation of GOMS. Chapter 2 introduced the human information processor, chapters 5 and 6 described GOMS (albeit with a focus on text editing), and chapter 8 introduced the KLM model.


[10] Stuart K. Card, Jock D. Mackinlay, and George G. Robertson, “The design space of input devices”, in Proceedings of the 1990 SIGCHI conference on Human factors in computing systems, pp 117-124. This paper investigates a variety of input devices, and proposes a taxonomy of devices and properties. As can be expected from Xerox PARC researchers, they considered both mainstream and offbeat input devices.

[11] Paul.M. Fitts, “The information capacity of the human motor system in

controlling the amplitude of movement”. Journal of Experimental Psychology, Vol. 47, No. 6, Pages 381-391, June 1954 The original formulation of Fitt’s law predates any computer-oriented applications.

[12] Wayne D. Gray and Bonnie E. John and Michael E. Atwood, “The

precision of Project Ernestine or an overview of a validation of GOMS”, in Proceedings of the 1992 SIGCHI conference on Human factors in computing systems, pp. 307-312.

A short description of project Ernestine and the successful applications of GOMS modeling.

[13] Wayne D. Gray and Bonnie E. John and Michael E. Atwood, “Project

Ernestine: Validating a GOMS analysis for predicting and explaining real-world task performance”, in Human Computer Interaction, No. 8, pp. 237-309. Available online at: http://www.rpi.edu/~grayw/pubs/papers/GJ&A93_HCIj.pdf A journal version of the CHI article, describing project Ernestine and its successful application to GOMS modeling.

[14] William. E. Hick. “On the rate of gain of information”, in Quarterly

Journal of Experimental Psychology, no. 4, pp. 11-26, 1952.

The original formulation of Hick’s law. [15] Scott E. Hudson, Bonnie E. John, Keith Knudsen and Michael D. Byrne,

“A tool for creating predictive performance models from user interface demonstrations”, in Proceedings of the 1999 symposium on User Interface Software and Technology, pp. 93-102.

The paper describes the Critique system, an early attempt at automatically deriving KLM models and execution time predictions from a mock-up user interface. The interfaces are generated in the research-oriented SubArctic UI toolkit so the system cannot be used directly by regular users.


[16] R. Hyman. “Stimulus information as a determinant of reaction time.”

Journal of Experimental Psychology, 45:188-196, 1953.

[17] Bonnie E. John and David E. Kieras, “Using GOMS for User Interface

Design and Evaluation: Which Technique?”, ACM Transactions on Computer-Human Interaction, Vol. 3, No. 4, pp. 287-319, December 1996 Provides a thorough comparison of the different GOMS variants, and provides rules of thumb for selecting the appropriate variant for particular tasks. The table outlining these rules is reproduced in this report. The discussion is more application-focused than scientifically-focused.

[18] Bonnie E. John and David E. Kieras, “The GOMS family of user interface

analysis techniques: Comparison and Contrast”, ACM Transactions on Computer-Human Interaction, Vol. 3, No. 4, pp. 320-351, December 1996 A scientific comparison of the different GOMS variants, applied to a demonstrational text editing technique.

[19] Bonnie E. John, Alonso Vera, Michael Matessa, Michael Freed and Roger

Remington, “Automating CPM-GOMS” in Proceedings of the 2002 SIGCHI conference on Human factors in computing systems.

[20] Bonnie E. John, Konstantine Prevas, Dario D. Salvucci, and Ken

Koedinger, “Predictive Human Performance Modeling Made Easy”, In Proceedings of the 2004 CHI conference on Human factors in computing systems, pp. 455-462 This paper describes the CogTools system, in which designers create a mock-up interface using a third-party editor (DreamWeaver), and a KLM model is generated automatically, along with time predictions. They experimentally show the results to be comparable with those achieved by expert GOMS modelers. See more information in the text.

[21] Bonnie E. John, Konstantine Prevas, Peter Centgraf, and Sandra Esch,

“CogTool User Guide”. Available online at: http://www-2.cs.cmu.edu/~bej/cogtool/publications.html A manual for the CogTool system, which can be used to automatically derive KLM models and time estimates from mock-up user interfaces created in DreamWeaver.


[22] Bonnie John, “Information Processing and Skilled Behavior”, in “HCI

models, theories, and frameworks”, John Caroll (Ed.), Morgan Kaufmann Publishers, 2003 This chapter from Caroll’s book surveys the GOMS methodology, and its application in the Ernestine project (the NYNEX operator workstation evaluation)

[23] David E. Kieras and P.G. Polson, “An approach to the formal analysis of

user complexity”. International Journal of Man-Machine Studies, No. 22, pp.365-394, 1985.

[24] David E. Kieras , Scott D. Wood , Kasem Abotel , Anthony Hornof,

“GLEAN: a computer-based tool for rapid GOMS model usability evaluation of user interface designs”, in Proceedings of the 8th annual ACM symposium on User interface and software technology, pp. 91-100. 1995 The authors describe an NGOMSL based tool for specifying and executing GOMS models. See more information in the text.

[25] David E. Kieras, “Task analysis and the design of functionality”. In The

Computer Science and Engineering Handbook, T. Allen (Ed.), CRC Press, Boca Raton, FL, 1996. Available online at: http://www.engin.umich.edu/class/eecs493/html/lectures/TADFChap.pdf An introductory text to task-analysis and simple use of GOMS.

[26] David E. Kieras, “A guide to GOMS Model Usability Evaluation using

NGOMSL”, in Handbook of Human Computer Interaction, 2nd Edition, M. Helander, T. Landauer, and P. Prabhu (Eds.), pp. 733-766, Amsterdam, Holland, 1997. Available online at: ftp://www.eecs.umich.edu/people/kieras/GOMS/NGOMSL_Guide.pdf The most complete text on the NGOMSL model.

[27] David E. Kieras and Thomas P. Santoro, “Computational GOMS

modeling of a complex team task: lessons learned”, in Proceedings of the 2004 SIGCHI conference on Human factors in computing systems, pp. 97-104


[28] T.K. Landauer and D. W. NAchbar, “Selection from alphabetic and numeric menu trees using a touch screen: breadth, depth, and width”, in Proceedings of the 1985 SIGCHI conference on human factors in computing systems, pp. 73-78. The authors investigate the speed of menu selections by conducting experiments in which users must select one item from an ordinal range of 4096. They attempt different menu breakdowns, where as the number of choices per menu increases, the screen target size per choice is decreased. The resulting selection speeds conform with both Fitt’s law and Hick’s law. Their results showed that broad and shallow menus were better than narrow deep menus for this task.

[29] Scott I. MacKenzie and William A. S. Buxton, “Extending Fitts' law to

two-dimensional tasks”, in Proceedings of 1992 ACM CHI Conference on Human Factors in Computing Systems, pp. 219--226. Because the original formulation of Fitt’s law cannot be applied to rectangular targets with significant differences between width and height, alternate definitions are necessary. The authors investigate and test two alternative formulations.

[30] George A. Miller, “The Magical Number Seven, Plus or Minus Two: Some

Limits on our Capacity for Processing Information”, First published in Psychological Review, no. 63, pages 81-97. Can be found online at http://psychclassics.yorku.ca/Miller/

[31] Jakob Nieslen, Usability Engineering, Book published by Morgan

Kaufmann, 1993 An entry level book discussing all aspects of usability. Jakob Nielsen is known for his “usability heuristics”.

[32] Claude E. Shannon, “The Mathematical Theory of Communication”

Original version in Bell System Technical Journal, Vol. 27, pages 379-423, 623-656, July, October 1948. A reprint is available from Bell Labs at: http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf The fundamental work on communication theory, which presents the ideas of entropy and channels of communication.


[33] SOAR homepage http://sitemaker.umich.edu/soar

Documents

A Survey of Human-Performance Modeling Techniques for ...udekel/coursework/GOMS/MethodsGOMS.pdfA Survey of Human-Performance Modeling Techniques for Usability, Uri Dekel Page 3/30