Metrics, Metrics, Metrics, Part 2: Universal Metrics?

H I S T O R I E S A N D F U T U R E S

NovEMbEr/DEcEMbEr 2010 1541-1672/10/$26.00©2010IEEE 93Published by the IEEE Computer Society

Editor: robert r. Hoffman, Jeffrey M. bradshaw, and Kenneth M. Ford, InstituteforHumanandMachineCognition,[email protected]

H U M A N - C E N T E R E D C O M P U T I N G

Metrics, Metrics, Metrics, Part 2: Universal Metrics?

Robert R. Hoffman, Institute for Human and Machine CognitionPeter A. Hancock, University of Central FloridaJeffrey M. Bradshaw, Institute for Human and Machine Cognition

frustrations, uncertainties, mistrust, and auto-mation surprises caused by poorly designed tech-nology that is not human-centered. This second article focuses on the concept of “metrics” and issues related to it. Following a discussion of rel-evant issues, we present an immodestly bold pro-posal for a set of universal metrics.

Metrics have been a salient topic of many re-cent government-funded research programs for developing large-scale information systems. We have counted a multitude of funding program an-nouncements that include statements such as the following abstraction:

The program seeks metrics quantifying the value and

risk added by new information, processes, and modali-

ties ... [The program seeks] the quantitative and qualita-

tive metrics required by the acquisition community to

use human systems integration tools and processes in

the design process …

This is an expression of “The Great Hope,” to be codifi ed in mathematics in the same manner it has now been reifi ed in the legal language of the procurement process. The doomed expectation is that if something can be measured we therefore will be able to understand it. This is a thinly ve-neered disguise for the reductionistic obsession to measure the success of everything by its return on investment. The sought-for measures defi ned in these terms have to gauge effi ciency, effort, ac-curacy, and similar refl ections of a maximizing process, hearkening back to the contest between John Henry and the steam hammer. This myopic

perspective is particularly frustrating for the ad-vocates of human-centered computing and work-centered design.

Our view is that we must measure cognitive work at the system level—addressing, for example, the important trade-offs and the wider effects of technology-induced changes in the culture of the workplace and the health of the community at large. The quantitative characterization of highly complex interactive effects, or even more prob-lematically, subjective apperceptions of the world, presents one of the greatest challenges to advanc-ing technology in the early 21st century.2

Cognitive systems engineers have called for new objective methods for evaluating the performance impact and learnability for software systems,3–7

including the increasing number of systems re-quiring human-automation teamwork of a conse-quential sort.8,9 In general, there has been a rising concern with the human factors of complex cogni-tive work, or “metrics and methodologies for eval-uating technologies.”10

So what exactly does it mean to ask for a metric?

Measures versus MetricsTo understand the foundations for a demand for metrics, we illustrate some basic ideas on mea-surement through the use of the simple example of intelligence testing (see Table 1).

Metrics are thresholds or decision criteria that are used in an evaluation. One has to decide, “This value differentiates.” Such decisions arise from policy shaped by goals, value judgments, and other considerations. The policy that leads to the adoption of particular operational defi nitions (for example, “What do we want to measure?”) is ex-ternal to measurement. Thus, metrics do not arise either immediately or automatically from mea-sures or measurement scales. Assuming that we

A previous article in this department from

2008 introduced the topic of measures and

metrics.1 The focus of that essay was on measure-

ment of the “negative hedonics” of work—the

IS-25-06-Hcc.indd 93 12/11/10 4:19 PM

94 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

have successfully gone from a concep-tual measurable to one or more rea-sonable operational definitions, and assuming further that we have suc-cessfully derived specific measure-ment procedures and linked the mea-surements to one or more meaningful measurement scales, we cannot then assume some easy step to a metric without having some sort of policy or goal. Without some policy to specify what is desired (or good), how can we determine what a decision threshold should be? In one context, 85 percent correct might represent a useful met-ric. In another context, it might be misleading or indeed genuinely dan-gerous. In one context, 35 percent better than before might be a signif-icant gain, whereas in another con-tent, 35 percent might be negligible.

With this understanding as back-ground, we argue that on certain as-sumptions that apply to the evalua-tion of computer-supported cognitive work, it might be possible to generate universal metrics that place all per-formance evaluations essentially on a common playing field.

The Designer’s GambleIn the standard view of hypothesis testing, real-world variability must be restricted either by passive control

or more often by active manipula-tion. Multiple experiments are always required to peg down the determin-ers of human skill acquisition and performance, especially in macrocog-nitive work systems. Potentially, any feature of the participants (such as experience, intelligence, motivation, aptitude, and so on), test scenarios (such as interesting, rare, easy, or bor-ing), teams (such as colocated, asyn-chronous, and dysfunctional), and tools (such as displays and menus) can prove relevant, as can count-less other mediating and moderating variables.7,11

This means that if an experimen-tal paradigm were conducted prop-erly, the development and procure-ment process would take even longer than it already does and, in theory, could require a boundless sequence of tests. This would be an unten-able situation at a time when the priority is to drastically reduce pro-curement time.12–14 Furthermore, by the time the relevant factors have been controlled, key variables iso-lated, and effect sizes estimated, design requirements changed and reevaluated, and so forth, the cogni-tive work will almost certainly have evolved or been transformed, some-times completely.15

“It is difficult to sample all the things that must be sampled to make a generalization ... the sheer number [of interacting factors] can lead to unwieldy research plans.”16 We call this the “fundamental disconnect.”17 We need to reduce the time frame re-quired for experimentation so that its length does not preclude effec-tive change in an evermore rapidly changing world. We need to find al-ternatives to both standard usability testing and standard controlled ex-perimentation so as to expedite eval-uation of the performance effects of technological interventions in macro-cognitive work systems.4,7,18

In light of this conundrum, we think it might be fruitful to question some of the basic methodological as-sumptions in the standard experimen-tal model. For example, there is the question of controlling for variables in the workplace. Let us ask the fol-lowing: If all the interacting and un-controllable factors are in effect when the actual work is being performed, why should we assume that they have to be controlled when work methods are being evaluated? Indeed, we ac-tually need the daunting variability of the world to be represented in the evaluation of new technologies.7,18 The traditional approach asserts

Table 1. Some fundamental concepts of measurement.

Steps to get from a theoretical concept to a metric Meaning Example

Conceptual measurable This is a concept from the subject matter of some theory. It is assumed that instances of this concept can be identified and counted.

A theory of cognitive development might assert that there are individual differences in intellectual capacity, referred to as intelligence.

Operational definition This is a specification of a replicable, dependable proce-dure for counting instances or making measurements.

This is what an intelligence test does, as in the phrase, “Intelligence is what an intelligence test measures.”

Numerical scale The numerical scale expresses distinctions regarding the conceptual measurable. The distinctions could be categorical or numerical.

“Genius” is a categorical distinction. The IQ is a ratio of mental age and chronological age, which is a numerical distinction.

Measurement scale The numerical scale values are entered into a calcula-tion that creates a derived measurement scale.

IQ scores expressed as percentile ranks.

Measurement A specific observation is regarded as an instance of a specific scale value.

A specific measured IQ score and its derived percentile rank is an example of a measurement.

Metric A decision threshold is expressed as some value (minimum, maximum, or range) on the numerical scale.

If a person’s IQ score measured at age 16 is at the 85th percentile or greater, they get to go to college. If a person’s IQ score is 145 or greater, they are classified as genius.

IS-25-06-Hcc.indd 94 12/11/10 4:19 PM

NovEMbEr/DEcEMbEr 2010 www.computer.org/intelligent 95

that the only path to scientific truth is to conduct an extended series of con-trolled factorial experiments result-ing in measures of statistical signifi-cance. However, in macrocognitive work domains, we need an approach that emphasizes ecological represen-tativeness and utility and leads to measures of practical significance.19 Thus, we express what we call the Designer’s Gamble, which can be stated as follows:

We, the designers, believe that our new

technology is good, and that good work

will result from its use. Thus, we can

let the daunting variability of the “real

world” remain in the summary statistics

and measurements, and we can conduct

reasonably risky tests of usefulness and

usability. We’re going to gamble that the

new technologies and the work methods

they instill are so good that improve-

ments in the cognitive work will be

straightforwardly demonstrable despite

the daunting variability of the real world.

We think that the Designer’s Gam-ble is no mere fantasy. Just as funding program announcements sometimes appear to ask for the world, research proposals often gladly promise it. Statements of the following general

type often appear in grant proposals and preproposal white papers:

We will develop new modeling strategies

leveraging previous research in dynamic

networked environments. This architec-

ture will provide near real-time interop-

erability and robustness and will allow

the detection and modeling of informa-

tion flows and actions and mitigate data

overload. This will then be integrated

with a suite of algorithms that will au-

tomatically reconfigure the running

simulation.…

Overly confident statements such as these, relying heftily as they do on the word “will,” promise more than can ever be guaranteed. Other words, such as “might,” would be more ap-propriate. Phrases such as “we hope will” would be more honest. Organi-zations, teams, and individuals who seek to create information technol-ogy invariably justify their entire ap-proach and design rationale on a tacit Designer’s Gamble.

The Designer’s Gamble can be an explicit assumption made during the processes of procurement (such as system development and evaluation). As such, it is a leverage point for em-pirical analysis and, in particular,

testing hypotheses about the good-ness of software tools. What follows from the Designer’s Gamble is a way around the fundamental disconnect, through the explicit use of range statistics in which we look at the ex-tremes of performance.

Range Statistics and Universal MetricsComparing the best and worst per-formers using a new technology, as opposed to their performance when using their legacy methods, informs the evaluator about both the technol-ogy’s learnability and the quality of the work that results. The best perfor-mance of an individual (or team) dem-onstrates what is possible with the new technology, while results from the worst performing individual (or team) can draw attention to training, work variability, or selection prob-lems. Neither extreme represents an aberration to be glossed over by cal-culations of averages or standard de-viations. This is especially important for a statistical analysis of situations where the participants are beginners with the tools because we know that measurements taken on initial task performance are not normally dis-tributed.18,20–22 Thus, we can devise a

Table 2. Universal Metrics levels.

Metric level Definition

Metric level 0 (minimal) Range statistics are not distinguishable from those in the legacy work.

Metric level 1 (mixed) The best performer in the new work performs better at achieving the primary task goals than the best performer using the legacy work method, but the worst performer of new work performs worse at achieving the primary task goals than the worst performer at the legacy work method. Metric level 1 is an expected outcome because it is somewhat likely that any intervention will tend to increase performance variability. At this metric level, the performance at the high end improves. Worsened performance at the low end indicates a need for either an improved job selection criterion or improved training.

Metric level 2 (improved) The worst performer of new work performs better at achieving the primary task goals than the worst performer at the legacy work.

Metric level 3 (good) The best performer of new work performs better at achieving the primary task goals than the best performer at the legacy work.

Metric level 4 (excellent) The worst performer of new work performs better at achieving the primary task goals than the worst performer at the legacy work, and the best performer in the new work performs better at achieving the primary task goals than the best performer at the legacy work.

Metric level 5 (superior) The worst performer of new work performs better at achieving the primary task goals than the best performer at the legacy work.

IS-25-06-Hcc.indd 95 12/11/10 4:19 PM

96 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

set of universal metric levels for com-paring new work methods to legacy work methods, and for evaluating the learnability of work methods. Table 2 presents one such set of metrics.

Figure 1 illustrates these universal metrics levels.

The universal metrics levels pre-sume that the new work involves the same principal task goals as the leg-acy work, for which data are avail-able to form a baseline used in es-tablishing what counts as “best” and “worst” performance. (For cases in which the work involves completely new kinds of tasks, there might not be an historical baseline and the eval-uation will initially have to reference some normative model or theory of the work. An example might be the emerging forms of cyberdefense. This issue of “formative design,” however, is a serious and significant topic de-serving of its own separate analysis.)

Using these universal metrics levels represents a risk on the part of evalu-ators. The outcome of an evaluation hinges on the performance of one or two individuals (or teams). Remem-bering that the performance of all the others will fall between the extremes, in the “real world,” operational per-formance likewise often hinges on the performance of one or two individuals

(or teams). If the evaluation is explic-itly founded on the Designer’s Gam-ble, and if the desire is to evaluate software in ecologically valid condi-tions, then the logic of the approach holds firm. It is of course possible, al-though we hope not likely, that evalu-ators will try to finesse this method by eliminating poor performers post hoc (for example, on some claim of validity, such as “they were sick”). But such finagling would be transpar-ent and counterproductive.

If, on the other hand, a software tool developer does not wish to adopt the Designer’s Gamble up front, then any a priori promises about the per-formance gains that will result from the to-be-delivered capabilities must be expressed in a far more cautious way than we commonly see today.

Either way, the sponsor wins.We submit that the concept of uni-

versal metrics levels can provide a framework that includes the following:

1. an approach to evaluation that em-phasizes ecological representative-ness and utility and escapes the constraints imposed by traditional controlled factorial experimentation;

2. a means for measuring practical significance rather than (or in ad-dition to) statistical significance;

3. a common playing field for eval-uating performance of software-supported work of all kinds;

4. a common playing field for evalu-ating the learnability of software-supported work methods and, by implication, the goodness of the software tools; and

5. a means of sensitizing analyses to outliers that might signal training or selection issues.

What we offer here is not a closed-end solution. Rather, it is a first step or a prospectus in a challenging journey to rethink, document, and quantify the character and capaci-ties of large-scale interacting human- machine systems.

References 1. R.R. Hoffman, P. Hancock, and

M. Marx, “Metrics, Metrics, Metrics:

Negative Hedonicity,” IEEE Intelligent

Systems, vol. 23, no. 2, 2008, pp. 69–73.

2. P.A. Hancock, J.L. Weaver, and

R. Parasuraman, “Sans Subjectivity,

Ergonomics is Engineering,” Ergo-

nomics, vol. 45, 2002, pp. 991–994.

3. R.R. Hoffman, K.N. Neville, and

J. Fowlkes, “Using Cognitive Task

Analysis to Explore Issues in the Pro-

curement of Intelligent Decision Sup-

port Systems,” Cognition, Technology,

and Work, vol. 11, 2009, pp. 57–70.

4. R. Lipshitz, “Rigor and Relevance in

Naturalistic Decision Making Research:

How To Study Decision Making Rigor-

ously with Small Samples and Without

Controls or Statistical Inference,” to be

published in Cognitive Engineering and

Decision Making, 2010.

5. K. Neville et al., “The Procurement

Woes Revisited,” IEEE Intelligent Sys-

tems, vol. 23, no. 1, 2008, pp. 72–75.

6. W.M. Newman, “On Simulation,

Measurement, and Piecewise Usability

Evaluation,” Human-Computer Interac-

tion, vol. 13, no. 3, 1998, pp. 317–323.

Figure 1. A visual explanation of the meanings of the Universal Metrics levels.

W W

B B

W

W

B B

W W

B

B

W

W

B

B

W

WB

B

W

W

B

B

Metric0

Worseperformance

Prin

cipa

l per

form

ance

mea

sure

Lega

cy w

ork

New

wor

k

Lega

cy w

ork

New

wor

k

Lega

cy w

ork

New

wor

k

Lega

cy w

ork

New

wor

k

Lega

cy w

ork

New

wor

k

Lega

cy w

ork

New

wor

k

Betterperformance

Metric1

Metric2

Metric3

Metric4

Metric5

IS-25-06-Hcc.indd 96 12/11/10 4:19 PM

7. E.N. Roth and R.G. Eggleston,

“Forging New Evaluation Paradigms:

Beyond Statistical Generalization,”

Macrocognition Metrics and Scenarios,

E. Patterson and J. Miller, eds., Ashgate,

2010, pp. 204–219.

8. J.M. Bradshaw, P. Feltovich, and

M. Johnson, “Human-Agent Interac-

tion,” to be published in Handbook of

Human-Machine Interaction, G. Boy,

ed., Ashgate, 2010.

9. G. Klein et al., “Ten Challenges for

Making Automation a ‘Team Player’

in Joint Human-Agent Activity,” IEEE

Intelligent Systems, vol. 24, no. 6,

pp. 91–95.

10. J. Scholtz, “Metrics for evaluation of

software technology to support intel-

ligence analysis,” Proc. Human Factors

and Ergonomics Society 49th Ann.

Meeting, Human Factors and Ergonom-

ics Soc., 2005, p. 918.

11. R.M. Baron and D.A. Kenny, “The

Moderator-Mediator Variable Distinc-

tion in Social Psychological Research:

Conceptual, Strategic and Statistical

Considerations,” J. Personality and

Social Psychology, vol. 51, no. 6, 1986,

pp. 1173–1182.

12. M. Hewish, “Out of CAOCs Comes

Order,” Jane’s Int’l Defense Rev., May

2003, p. 22.

13. Weapon Systems Acquisition Reform

Act of 2009, Public Law 111-23, US

Statutes at Large 123, 2009.

14. R.R. Hoffman and S.V. Deal, “Influ-

encing versus Informing Design,

Part 1: A Gap Analysis,” IEEE Intel-

ligent Systems, vol. 23, no. 5, 2008,

pp. 72–75.

15. J.A. Ballas, “Human-Centered Com-

puting for Tactical Weather Forecast-

ing: An Example of the ‘Moving Target

Rule,’” Expertise Out of Context:

Proc. 6th Int’l Conf. Naturalistic

Decision Making, R.R. Hoffman, ed.,

Erlbaum, 2007, pp. 317–326.

16. W.A. Firestone, “Alternative Argu-

ments for Generalizing from Data as

Applied to Qualitative Research,”

Educational Researcher, vol. 22, 1993,

p. 18, 19.

17. L. Chronbach, “Beyond the Two Disci-

plines of Scientific Psychology,” Am. Psy-

chologist, vol. 30, no. 2, 1975, pp. 116–127.

18. R.R. Hoffman et al., “Measurement for

Evaluating the Learnability and Resil-

ience of Methods of Cognitive Work,”

Theoretical Issues in Ergonomic Sci-

ence, 2010; http://prod.informaworld.com/

smpp/content~content=a920740998~db=

all~jumptype=rss.

19. R.R. Hoffman and K.A. Deffenbacher,

“An Analysis of the Relations of Basic

and Applied Science,” Ecological Psy-

chology, vol. 5, no. 4, 1993, pp. 315–352.

20. W.F. Book, “The Psychology of Skill

with Special Reference to its Acquisition

in Typewriting,” Studies in Psychology,

vol. 1, Univ. of Montana Press, 1908.

21. K.M. Newell and P.A. Hancock,

“Forgotten Moments: Skewness and

Kurtosis Are Influential Factors in

Inferences Extrapolated from Response

Distributions,” J. Motor Behavior,

vol. 16, no. 3, 1984, pp. 320–335.

22. H.S. Sichel, “The Estimation of

Parameters of a Negative Binomial

Distribution with Special Reference to

Psychological Data,” Psychometrika,

vol. 16, no. 1, 1951, pp. 107–127.

robert r. Hoffman is a senior research

scientist at the Institute for Human and Ma-

chine Cognition. Contact him at rhoffman@

ihmc.us.

Peter A. Hancock is professor of psychol-

ogy at the University of Central Florida.

Contact him at [email protected].

edu.

Jeffrey M. bradshaw is a senior research

scientist at the Institute for Human and Ma-

chine Cognition. Contact him at jbradshaw@

ihmc.us.

IS-25-06-c3 Cover- 3 09/11/10 4:07 PM

Documents

Metrics, Metrics, Metrics, Part 2: Universal Metrics?