Types of inconsistency in health-state utility judgments

Types of inconsistency in health-stateutility judgments

Jonathan Barona,* and Peter A. Ubelb

a University of Pennsylvania, Pennsylvania, USAb University of Michigan, Michigan, USA

Abstract

In making judgments of health-related quality of life, respondents often compare the

relative magnitude of two intervals between health states, such as the interval between

normal health and blindness compared to that between normal health and death. We ex-

amined two ways of comparing such intervals: person-trade-off (PTO)—in which the judg-

ment concerns matching numbers of people so that two changes are equivalent—and direct

judgment of the ratio. Both measures showed ratio inconsistency (a ratio that should be the

product of two other ratios is too high) and superadditivity (two ratios that should add to 1

are too high). Some responses in both methods implied that two intervals which should have

been different (because they shared a top or bottom point, but differed on the other point)

were nevertheless viewed by subjects as being of equal size. These equality responses were

more common when death was the bottom (worse end) of both intervals being compared

(e.g., the interval between death and blindness is perceived as being the same size as the

interval between death and normal health) than when any other condition was at the bottom

or when the condition common to the two intervals was at the top. A second experiment

indicated that subjects really do consider the intervals to be equal. Our findings argue

for giving subjects a chance to reflect on such apparent inconsistencies in practical utility

elicitation.

� 2002 Elsevier Science (USA). All rights reserved.

1. Introduction

Choice among programs, such as those concerned with allocation of medical

resources, is sometimes based on measurement of benefits of competing options on a

common scale of utility. Utility measures require numerical judgments based on

descriptions of health states. In recent years, a few methods have been used to elicit

these judgments. In standard gambles, subjects are asked for a probability p at which

they would be indifferent between obtaining some health state (e.g., blindness) and a

gamble with a p chance of obtaining a better health state (e.g., normal health) and a1� p chance of obtaining a worse state (e.g., death). In direct judgment, they are

asked to assign a number to blindness on a scale in which 0 is death and 100 is

*Corresponding author. Department of Psychology, University of Pennsylvania, 3815 Walnut Street,

Philadelphia, PA 19104-6196, USA.

E-mail address: [email protected] (J. Baron).

Organizational Behavior and Human

Decision Processes 89 (2002) 1100–1118

www.elsevier.com/locate/obhdp

ORGANIZATIONALBEHAVIOR

AND HUMANDECISION PROCESSES

0749-5978/02/$ - see front matter � 2002 Elsevier Science (USA). All rights reserved.

PII: S0749-5978 (02 )00019-5

mail to: [email protected]

normal health (for example). In time-trade-off, they are asked how many years ofnormal health, followed by death, is equivalent to, for example, 10 years of blindness

followed by death. And in person- trade-off, they are asked, for example, preventing

how many people from dying is just as good as preventing 100 people from becoming

blind. These methods have been used in practical decisions and have been the topic

of extensive study (Baron, 1997; Bleichrodt & Johanneson, 1997; Krabbe, Essink-

Bot, & Bonsel, 1997; Llewellyn-Thomas, 1997; Nord, 1995, 1999; Ubel, Loewenstein,

Scanlon, & Kamlet, 1996, 1998). Similar methods have been used in decision analysis

(Keeney & Raiffa, 1976; von Winterfeldt & Edwards, 1986).At least three factors make such measurement difficult (Baron, 1997). First, the

judgment may over- or underweigh some factor that makes a health condition more

or less severe, such as the extent of adaptation to a chronic condition. More gen-

erally, people may change the way in which they think about a health condition as

they reflect on what it involves. This is not an issue that we address here. Second,

judgments may be affected by aspects of the judgment task that are unrelated to the

utility of the conditions. For example, responses to hypothetical gambles may be

affected by attitudes toward risk or by the overweighing of certainty (de Neufville &Delqui�ee, 1988; Tversky & Kahneman, 1992). Third, numerical responses may be

distorted by scaling problems, such as tendencies to underweigh differences farther

away from a reference point (Tversky & Kahneman, 1992) or to space judgments

evenly along the given range (Mellers & Birnbaum, 1983; Poulton, 1979).

The studies reported here investigate the last two problems as they apply to two

methods that might be used for utility estimation, difference judgments, and person-

trade-off (PTO). The main purpose of this article is to determine the nature of in-

ternal inconsistencies in PTO and direct measures, so that we can know how to dealwith them for practical purposes. One inconsistency of particular interest results

from judgments which imply that two health states are equally distant from a third

(e.g., death), yet still different from each other.

We choose these two methods because they represent two general approaches to

measurement. In direct judgment, subjects focus on the conditions and make a sort

of psychophysical comparison of magnitudes. The PTO, by contrast, requires a

hypothetical decision, and is therefore subject to influence by factors other than the

magnitudes in question. The PTO, like standard gamble and time-trade-off, has beenadvocated exactly because it is potentially sensitive to factors, such as fairness, other

than utility magnitudes of health states.

1.1. The two methods: Difference judgments and person-trade-off

We think of each judgment as a comparison of two utility intervals, such as the

interval between condition A and condition B versus the interval between conditions

A and C. One of the ‘‘conditions’’ could be normal health, or death. Each judgmentpermits us to infer the ratio of the two utility differences, e.g., the size of the smaller

interval relative to the larger. In difference judgments, this is what the subject is

asked to judge.

In PTO judgments, we infer the judgment from the subject�s answer. If preventing100 people�s health from declining from Health Condition A to B is judged to be

equally important as preventing 200 people from declining from A to C, we infer that

the difference between A and C is half of the distance between A and B. This in-

ference assumes that the subject takes the utilitarian approach of adding differencesacross people. (Again, we return to this assumption later.)

In difference judgments, the subject makes a direct comparison of the sizes of the

two intervals, such as the interval between normal health and blindness and the in-

terval between normal health and death. This corresponds to the method of swing

weights used in decision analysis (Fischer, 1985; von Winterfeldt & Edwards, 1986).

J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1101

Difference judgments may be distorted in several ways (Baron, 1997; Birnbaum,1978). Of interest here is the effect of distance from the reference point. Much evidence

supports the general conclusion that people underweigh differences that are more

distant from a reference point, regardless of whether the states at issue are better or

worse than the reference point. For example, people seem to judge that the disutility

of monetary losses is marginally declining as the loss increases; for example, losing

$100 seems less bad when it is added to a $900 loss than when it occurs alone

(Kahneman & Tversky, 1979; Thaler, 1985). When we measure the utility of health

states, two different reference points are available, normal health and death. It maymatter which we use. If some condition X is roughly half as bad as death, then people

may judge it as numerically closer to death when they use normal health as the ref-

erence, and as closer to normal health when they use death as the reference. When

people are asked ‘‘How large is the difference between normal health and X compared

to the difference between normal health and death?,’’ they might answer 60%, and

when they are asked ‘‘How large is the difference between X and death compared to

the difference between normal health and death?’’, they might answer 60% again.

Their answers would thus sum to 120%, not 100%. We call this effect superadditivity.In PTO judgments, the subject provides a number of people such that some

benefit (harm) to that number is just as good (bad) as another benefit (harm) to a

given number of other people (Nord, 1995; Ubel et al., 1996, 2000). This provides a

utility measure if we assume that the judgment is utilitarian, so that the total benefit

to a group is just the benefit per person times the number of people. For example, if

preventing 250 cases of blindness is seen as equivalent to preventing 100 deaths (of

normal people), then we infer that the interval between normal and blindness is .40

of the interval between normal and death. This is because the total utility of pre-venting blindness is :40 � 250, which equals the total utility of preventing 100 deaths.

If we infer the utility of blindness on a scale on which normal is 1 and death is 0, then

we infer that blindness has a utility of .60.

PTO judgments are affected by principles of fairness, regardless of the condition

that remains when lives are saved (Nord, 1995; Ubel et al., 1996, 2000). PTO mea-

sures have been advocated as a way of capturing attitudes about policy. Some of

these attitudes are not utilitarian. Still, if these measures are not internally consistent,

it is difficult to see how they can be used. Internal inconsistency may arise fromfairness considerations. Of particular interest here is the desire to give patients equal

opportunity for treatment regardless of their present condition.

This principle may be applied in particular to saving lives. People may feel that the

decision to save lives should not depend on the quality of the life after it is saved.

There are many reasons for such a view. For example, people may feel that health

professionals should not be in the business of judging the worth of a person�s life.Once we allow such factors as handicaps to enter decisions about life saving, where

do we draw the line between these conditions and such other factors as number offriends or economic productivity? People may also think in terms of the emotional or

utility effects of changes rather than differences between existence in two different

states, and it may be approximately true that the reaction to the change from a

prospect of certain death to the prospect of continued life may be just as good for a

person with a severe handicap as to a person without one. Or, alternatively, people

may just apply a heuristic of fairness.

1.2. Consistency checks

The present experiments look for these effects. In part to look for these effects, we

test consistency among judgments in two ways, explained shortly, which we call ratio

consistency, and additivity. Consistency checks are of interest in their own right. Few

efforts to test consistency are reported in the literature (Bleichrodt & Johanneson,

J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181102

1997; Miyamoto & Eraker, 1988). For PTO, only ratio consistency has been tested(Ubel et al., 1996), and it is, as we show, a necessary but not a sufficient condition for

utility measurement. Internal consistency is necessary if a judgment is to be inter-

preted as a judged utility function and used for making decisions that depend on

such judgments. Yet, both researchers and practitioners in medical areas almost

never test internal consistency in any way.

In the experiment reported here, subjects made PTO and difference judgments on

the same set of conditions, all of which are long lasting and serious. To describe the

judgments in general, we use the notation uðAÞ � uðBÞ to mean the utility difference,and we take Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g to refer to the judged or inferred ratio ofuðAÞ � uðBÞ to uðAÞ � uðCÞ. We can think of the denominator uðAÞ � uðCÞ as the unitof utility. For example, uðAÞ � uðBÞ could be the difference between normal health

and blindness, and uðAÞ � uðCÞ could be the difference between normal health and

blindness-and-deafness. Someone who thought that blindness was half as bad as

blindness-and-deafness, relative to normal health, would judge the utility of

uðAÞ � uðBÞ as .5 using uðAÞ � uðCÞ as the standard. In other words, this person

would judge the uðAÞ � uðBÞ interval as half the size of the uðAÞ � uðCÞ interval.As noted, we test two kinds of internal consistency: ratio consistency and addi-

tivity. Ratio consistency (Fagot, 1978) tests the condition that

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g

If we suppose that judgments are insensitive to differences among utility intervals,

then

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g< Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g

For example, suppose that A is normal health, B is blindness, C is combined

blindness and deafness, and D is death. Then each interval is the ‘‘badness’’ of A, B,

C, or D compared to normal health. If B is half as bad as C and if C is half as bad asD, then A should be a quarter as bad as D. If, in the extreme, a subject is so in-

sensitive to differences among utility intervals as to say ‘‘half’’ to every question, the

product of the first two judgments will still be a quarter, but the third judgment will

be a half. Such insensitivity could result from undersensitivity to large differences,

oversensitivity to small differences, or both. Ubel et al. (1996, 1998) found such an

effect in PTO and called it ‘‘multiplicative intransitivity.’’ Fagot (1978) found ratio

inconsistency (but not the same as we find) in a variety of psychophysical judgments;

we use Fagot�s term.Ratio inconsistency could also be related to the finding of Birnbaum and his

colleagues (Birnbaum, 1978; Birnbaum & Sutton, 1992) that subjects asked to judge

the ratio of two stimuli respond (with a nonlinear response function) to the difference

between the stimuli rather than to the ratio of their distances from zero (no stimu-

lation, in a sensory task). However, when subjects are asked for ratios of differences—

e.g., what is the ratio between the utility (or loudness, etc.) difference between A and

B and the difference between C and D?—they tend to base their responses correctly

on the ratio of the differences, and not the difference of the differences. Birnbaum�sresult can be taken to imply that we must state the reference point explicitly if we

want subjects to use it, so we do this when we ask about differences. Explicitness in

stating the ends of ranges being compared is one of the prescriptions of decision

analysis (Fischer, 1985), but it is not used routinely in other value-elicitation tasks.

Still, explicit statement of end points may not remove all of the tendency to respond

in terms of differences. Such a tendency could be described as a neglect of changes in

the standard of comparison. When subjects attend more to differences (as opposed to

ratios), they might be undersensitive to changes in the standard of comparison they


are given, from question to question. Such underattention would also produce ratioinconsistency and is, in our procedures, indistinguishable from other explanations of

this effect.

Additivity tests whether the judged utility of two parts of an interval add up to the

whole interval, that is, whether

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g þ Jf½uðBÞ � uðCÞ�=½uðAÞ � uðCÞ�g ¼ 1

assuming A, B, and C are ordered. For example, if the difference between normal

health and blindness is .3 of the difference between normal health and blindness-and-

deafness, then the difference between blindness and blindness-and-deafness should be

.7 of the difference between normal health and blindness-and-deafness. Superaddi-

tivity could result from numbers that are too high (.4 and .8 instead of .3 and .7).Superadditivity could also result from adoption of one of the states in the numerator

(e.g., A in uðAÞ � uðBÞ or C in uðBÞ � uðCÞ) as the reference point, which would

produce the effect if we assume that differences near the reference point loom larger.

Additivity is nearly sufficient for a utility scale, as such a scale can be defined in

terms of units that are added. The Additivity implies ratio consistency, but ratio

consistency does not imply additivity. The Appendix A explains these relationships,

and it also discusses inversion consistency, which concerns the direction of compar-

ison—smaller to larger interval or the reverse. Inversion inconsistency was found byBaron, Wu, Brennan, Weeks, and Ubel (2001). It is a special case of ratio consistency.

2. Experiment 1

This experiment compared two methods for eliciting judgments from which utility

can be inferred. Each method compares two intervals, such as the interval between

normal health and blindness or the interval between normal health and death. Each

of the two intervals has a top (best health state) and a bottom (worst health state).

Sometimes the two intervals share the same top, and sometimes they share the same

bottom. Sometimes the bottom is death, and sometimes the top is normal health.

In the difference method, subjects make direct numerical judgments of the size ofone interval compared to another, the standard, which has a size of 100. For ex-

ample, people are asked to imagine that the difference between normal health and

death is 100 and then asked what the difference between normal health and some

other health state would be. We infer judgments from PTO responses as described

earlier. In the PTO, we ask subjects to compare the benefit of preventing changes for

the worse, such as death, getting a handicap that one does not have, or getting a

worse handicap. In previous studies not reported here, we found that different

wording, such as ‘‘saving’’ rather than ‘‘preventing’’ yielded identical results.Subjects completed our studies on the World Wide Web. This method of recruit-

ment is new (Birnbaum, 2000) and is therefore worthy of comment. We originally

recruited subjects by posting to Usenet news groups, but then various other people

put links to our studies in their web pages, such as the American Psychological

Society, and still others found us with search engines (looking for terms like ‘‘surveys’’

and ‘‘pay’’). Subjects who wish to be paid must provide their name, email address,

address, and (for US residents) their Social Security Number. It is therefore at least as

difficult as with other methods to complete the same questionnaire under differentnames. Several studies find little or no difference between web subjects and other

subjects (Birnbaum, 2000; Soetikno, Mrad, Pao, & Lenert, 1997; Soetikno, Prov-

enzale, & Lenert, 1997). Of course, the web, like paper, can be used in many ways.

One feature of our approach is to include checks for sensible answers (using Java-

Script, a programming language for web pages) so that subjects who do not take the

study seriously, find it difficult to complete.


2.1. Method

One hundred and fourteen subjects completed a questionnaire on the World Wide

Web for $3. Three additional subjects were eliminated because they gave the same

response to every item (100 or 50). Ages ranged from 13 to 69 (median 29); 71% were

female; and, coincidentally, 71% were non-students.

Each subject answered a series of 32 questions regarding the relative magnitudes

of differences in health states. There were two question frames, one using PTO and

one using the difference method, with 16 items each.The introduction began:

This study is about different ways of eliciting numerical judgments of health. If we could

measure the benefit of curing or preventing different conditions, we could allocate scarce re-

sources so as to get the greatest benefit per dollar. The following questions are about meth-

ods that might really be used to measure the badness of various conditions. The worse a

condition, the more good is done by preventing it.

All the items refer to conditions that people acquire at about age 60, from strokes, and last

for the rest of their lives. DEATH refers to death at age 60. The conditions are always in

CAPITAL LETTERS. Here are the conditions:

PARALYSIS OF LEGS

PARALYSIS OF ARMS AND LEGS

BLINDNESS

BLINDNESS AND DEAFNESS

NORMAL HEALTH

DEATH

All questions require judgments on a numerical scale from 0 to 100. It never makes sense here

to give an answer of more than 100.

All questions concern comparison of two ranges. For example one range might be between

normal health and deafness, and another range might be between deafness and death.

In one type of question, we ask about the relative size of two ranges. How large is the smal-

ler range, as a percent of the larger one? You might think of each range as a distance defined

by two locations. How large is the distance from New York to Paris, as a percent of the dis-

tance from Chicago to Moscow? This is the kind of judgment you must make.

The second kind of question is like this:

How many people must be prevented from changing from normal health to death in order

to make this just as good as preventing 100 people from changing from normal health to

deafness?

In this case, you should give a number less than 100. Preventing blindness does a

lot of good, but not as much as preventing death, so it would require fewer cases ofpreventing death to do the same amount of good.

Going back to the distance example:

How many trips from Chicago to Moscow is just as far as 100 trips from New York to

Paris?

The subject then did a practice item using the deafness example. Instead offilling in the number, the initial number was set at 50, and the subject pressed one

of two buttons to indicate which option was better, until the subject judged them

to be equal. This was to get the subject used to the idea of filling in a number that

made two options equal. (The subject could not go on without doing the practice

item.)

Each of the 32 screens presented either a PTO item or a difference item. The PTO

item was of the form:

How many people must be prevented from changing from

A to B

in order to make this just as good as


preventing 100 people from changing from

C to D?

The difference item was of the form:

If the difference between

A to B

is 100, how big is the difference between

C to D?

In each case, one end of the two intervals was the same. For example, the ‘‘tops’’

of both intervals (A and C) could be normal health, or the ‘‘bottom’’ (B and D) could

be death. The C � D interval was always intended as smaller than the A� B interval.

Thus, we intended that the answers to both PTO and difference questions would be

less than or equal to 100. For both kinds of questions, a response of 100 implies thatthe intervals are equal (if they interpreted as utility intervals). Table 1 shows the

basic comparisons, using paralysis items.

For example, for the first item, the difference form was, ‘‘If the difference be-

tween normal health and death is 100, how big is the difference between normal

health and paralysis of legs?’’ The PTO form was, ‘‘How many people must be

prevented from changing from normal health to death in order to make this just as

good as preventing 100 people from changing from normal health to paralysis of

legs?’’The same items were used for sensory impairments with blindness substituted for

legs and blindness and deafness for paralysis of arms and legs.

Table 1

The values of A–D for comparisons 1–8, illustrated with paralysis


Notice that items 1–3 compare intervals with Normal in common at the top; 5, 6,and 8 compare intervals with Death in common at the bottom; 7 has a non-death

condition at the bottom; and 4 has a non-normal condition at the top. Comparisons

1–3 allow a test of ratio consistency: the ratio in 1 should be the product of the ratios

in 2 and 3. Likewise for 6, 8, and 5. Finally, comparisons 1 and 5 together test

additivity, as do comparisons 2 and 6, 3 and 7, and 4 and 8.

The 32 items (the eight items above, each in its paralysis or sensory form, and then

either PTO or difference) appeared in a different random order for each subject.

2.2. Results

The nature of the disability—sensory or paralysis—did not affect the judgments,

and this factor did not interact significantly with PTO vs. difference or with com-

parison (1–8). Accordingly, we combined the results for the types of disability for

analysis and presentation, and we use the paralysis conditions to stand for both.

Table 2 shows the mean utility ratios for the eight comparisons, the percentage of

responses in which each measure yielded equality responses, that is, responses in-dicating that the two intervals were equal (100), and the mean utility ratios with

equality responses excluded. In contrast to previous results (e.g., Baron et al., 2001),

PTO and difference judgments were very close, and their overall means did not differ

significantly across subjects (with or without the equality responses). Conceivably,

the random intermixing of the two measures encouraged subjects to see them as

more similar. (Baron et al., 2001, Experiment 2, found that close proximity of the

two measures made their results closer.)

2.2.1. Equality responses

An equality response is a response of 100, which we take to indicate that the

subject judged the two intervals to be equal. We expected fewer of these responses in

the difference method than in the PTO method because subjects might think that any

other response implies unequal treatment of two groups differing in current health or

in what can be prevented. The difference method is likely to be seen as a judgment of

seriousness without any implications for how two groups should be treated. In fact,

the proportion of equality responses was greater for PTO than for the differencemethod (t113 ¼ 4:94, p ¼ :0000, across subjects). Table 1 shows the mean proportion

of equality responses for each of the eight comparisons.

We classified the comparisons into four types according to what condition was

common to both intervals being compared: both intervals have normal health at the

top (the better end of the interval—comparisons 1–3, called Common-top-normal);

both have the same condition at the top, other than normal (comparison 4,

Common-top-not-normal); both have death at the bottom (comparisons 5, 6, and 8,

Table 2

Mean judged ratios of utility differences (in percent), percentage of equality responses, and judged ratios

excluding equality responses, Experiment 1

Comparison Mean ratio Equality responses Mean, excluding equality

Diff. PTO Diff. PTO Diff. PTO

1. N—D vs. N—L 47 49 2 12 46 42

2. N—D vs. N—A&L 58 56 2 15 57 48

3. N—A&L vs. N—L 56 63 3 13 55 57

4. L—D vs. L—A&L 53 53 2 11 52 47

5. N—D vs. L—D 61 63 10 20 56 53

6. N—D vs. A&L—D 54 59 8 18 50 48

7. N—A&L vs. L—A&L 52 56 3 10 50 51

8. L—D vs. A&L—D 65 63 17 16 58 57


Common-bottom-death); and both have some other condition at the bottom(comparison 7, Common-bottom-not-death). We computed the mean proportion of

equality responses for each of the four types.

The four types differed in an analysis of variance that included subject, PTO-vs.-

difference, and type as factors (F3;339 ¼ 11:50, p ¼ :0000). The differences among thetypes did not interact significantly with PTO-vs.-difference. Equality responses were

higher in the Common-bottom-death comparisons than in the other conditions,

which were approximately equal. Collapsing across PTO and difference, the pro-

portion of equality responses was .15 in Common-bottom-death and .08, .07, and.07, respectively in the other three types: Common-top-normal, Common-top-not-

normal, and Common-bottom-not-death. In separate t tests, the proportion of

equality responses in Common-bottom-death was higher than for each of the three

other types at p < :002 (with Bonferroni correction, collapsing across PTO vs. dif-

ference). No other differences between types were significant.

In sum, equality responses are relatively frequent for Common-bottom-death,

that is, when both intervals compared involve saving lives.

2.2.2. Additivity

We assessed additivity for comparisons involving death but not normal health

(comparison 4 plus comparison 8), those involving normal health but not death (3

plus 7), and those involving both normal and death (extremes: 1 plus 5 and 2 plus 6).

Table 3 shows the measure of superadditivity, the sum of the two ratios, minus 1.

The measure is 0 if the two ratios are additive.

Utility ratios were superadditive, as shown in Table 3. Superadditivity was sig-

nificantly positive overall (F1;113 ¼ 82:9, p ¼ 0000), and in each the six types (threefor PTO and three for difference; p < :002 by t test). Although the six types differed

significantly, the differences are difficult to interpret.

Superadditivity was reduced when equality responses were excluded (t113 ¼ 6:28,p ¼ :0000, using the average across the six types—three for difference and three for

PTO). When equality responses were excluded, superadditivity remained overall

(t113 ¼ 2:41, p ¼ :0176).The substantial reduction in superadditivity—from a mean of .28 to a mean of

.11—suggests that superadditivity results largely from equality responses. For ex-ample, if the difference between death and leg paralysis is equal to the difference

between death and normal health, then any judgment that there is a difference be-

tween leg paralysis and normal health will lead to superadditivity.

On the other hand, the results also suggest that superadditivity is in part a scaling

effect, i.e., the J function is nonlinear. This observed nonlinearity is not just a by-

product of equality responses, nor of principles applied to choices among groups of

people. Superadditivity was significant for the difference method alone, without

equality responses (t113 ¼ 2:83, p ¼ :0052). It could arise either in the way condi-tions are perceived or in the mapping of these perceptions into numerical responses.

When subjects compare intervals with one end of one of the intervals at normal,

or one end at death, they may take that end as a reference point, exaggerate dif-

ferences near it and minimize differences far from it, thus producing ratios that are

too high.

Table 3

Superadditivity, the sum of two ratios that should add to 1, )1, for Experiment 1

Items in comparison Difference PTO

Death but not normal health .18 (.09) .16 (.04)

Normal health but not death .08 (.06) .19 (.08)

Both normal health and death .10 (.06) .14 ().02)

Equality responses are excluded for numbers in parentheses.


2.2.3. Ratio inconsistency

Ratio inconsistency was positive for all measures (p < :0005, with or without

equality responses, t > 5 for both). The measure of inconsistency was not affected by

method (difference vs. PTO) or by whether the comparisons involved normal health

(comparisons 1–3) or death (comparisons 5, 6, and 8). Themeasure was slightly higher

when equality responses were excluded. The mean of the logarithmic ratio was .18

with all responses and .20 with equality responses excluded (using all remaining data,

t113 ¼ 2:49, p ¼ :0142). These means correspond, respectively, to ratios of 1.52 and

1.60. In other words, a ratio that should be the product of two others is more than 1.5times larger than the product, on the average. This result (also found by Baron et al.,

2001) suggests that people do not differentiate large ratios and small ratios sufficiently.

3. Experiment 2

Experiment 1 suggested that subjects were using an equality principle, in which

they truly judged the intervals to be equal (in the difference method) or thought that

two groups should get equal priority (in the PTO method). Experiment 2 attempted

to find direct evidence for these principles by asking subjects about them. To avoid

biasing the subjects by our questions, we also asked about the opposite principle, the

idea of complete triviality of one interval compared to the other, so that one interval‘‘dominates’’ the other.

We also changed the response format. The use of an open-ended response in

Experiment 1 might have encouraged subjects to respond with round numbers, and a

response of 100 might have meant ‘‘closer to 100 than to 90.’’ To avoid this possi-

bility, we gave subjects a set of buttons labeled 100, 99, 95, 90, 80,. . . : This responsemode makes a response of 99 just as acceptable as a response of 100.

3.1. Method

Sixty-eight subjects completed a questionnaire on the World Wide Web for $3.

Ages ranged from 17 to 73 (median 34); 69%were female; and 72%were non-students.

The procedure was identical to that of Experiment 1, except for the addition of a

set of buttons for responses, and the addition of two questions, one about equality

and one about what we call dominance. An example of a difference-method item is:

If the difference between

BLINDNESS and DEATH

is 100, how big is the difference between

BLINDNESS AND DEAFNESS and DEATH?

Choose the closest number:

100 99 95 90 80 70 60 50 40 30 20 10 5 1 0

Consider the following statements:

The difference between

BLINDNESS AND DEAFNESS and DEATH

is as large as the difference between

BLINDNESS and DEATH. AGREE DISAGREE

The difference between

BLINDNESS AND DEAFNESS and DEATH

is trivial compared to the difference between

BLINDNESS and DEATH. AGREE DISAGREE

For the PTO item, the buttons were the same, but the equality and dominance

questions were (using a different example):

It is just as important to prevent people from changing from

PARALYSIS OF ARMS AND LEGS to DEATH as it is to prevent people from changing

from


PARALYSIS OF LEGS to DEATH. AGREE DISAGREE

When we have a choice between preventing the change from

PARALYSIS OF ARMS AND LEGS to DEATH in some people, and preventing the

change from

PARALYSIS OF LEGS to DEATH in other people, we should always prevent the change

from PARALYSIS OF LEGS to DEATH no matter how few people we can help.

AGREE DISAGREE

3.2. Results

Table 4 shows the mean utility ratios for the eight comparisons, the percentage of

responses in which each measure yielded equality responses, and the percentagreement with the equality and dominance questions. In general, the proportion of

equality responses (numerical responses of ‘‘100’’) was higher than in Experiment 1,

despite our effort to reduce them by giving the subjects a clear ‘‘99’’ option. The

equality question may have called attention to the possibility that intervals were

equal. Interestingly, the dominance question—although often endorsed—did not

seem to play an analogous role. (Only 2.6% of the responses were the lowest possible,

which turned out to be 1 rather than 0 because of a programming error, an error that

no subject complained about.)Of primary interest, subjects endorsed the equality question more often than they

followed it literally in their numerical judgments, but their responses to this question

followed the same pattern. Again, we classified the comparisons into four types

according to what condition was common to both intervals being compared—

Common-top-normal, Common-top-not-normal, Common-bottom-death, Com-

mon-bottom-not-death—and we examined each subject�s proportion of each type of

response for each of these types. The effect of type was highly significant for all three

relevant measures: numerical utility judgments (F3;201 ¼ 14:04, p ¼ :0000), equalityresponses (judgments of 100; F3;201 ¼ 17:67, p ¼ :0000), and equality agreement

(F3;201 ¼ 29:34, p ¼ :000). As in Experiment 1, Common-bottom-death differed very

clearly from all other types for utility judgments, equality responses, and equality

agreement.

Unlike Experiment 1, some of the other type differences were significant too. For

equality responses and equality agreement, all of these involved the Common-bot-

tom-not-death type (Condition 7 in Tables 1 and 4), which was higher than both

Common-top-normal and Common-top-not-normal types (p < :025 in all cases).These results suggest that subjects pay more attention to the bottom of the interval

than the top in making these equality judgments.

The pattern for the dominance question is not just the reverse of the equality

question, as seen in Table 4. Answers to the dominance question are most frequent

when one of the intervals includes death and the other does not (cases 1, 2, and 4 vs.

Table 4

Mean judged ratios of utility differences, percentage of equality responses, and percent agreement with

equality and dominance questions, Experiment 2

Comparison Mean ratio Equality responses Equality agree Dominance agree

Diff. PTO Diff. PTO Diff. PTO Diff. PTO

1. N—D vs. N—L 51 58 4 29 13 19 53 59

2. N—D vs. N—A&L 61 60 5 27 21 30 42 47

3. N—A&L vs. N—L 60 67 12 26 33 40 21 27

4. L—D vs. L—A&L 54 57 8 21 20 27 45 49

5. N—D vs. L—D 72 76 29 47 52 69 24 16

6. N—D vs. A&L—D 64 66 24 36 43 57 24 22

7. N—A&L vs. L—A&L 61 67 17 32 31 52 32 24

8. L—D vs. A&L—D 71 74 32 36 56 61 18 18


all others; t67 ¼ 6:48, p ¼ :0000, for the means of these cases vs. the means of theothers). Subjects seem to apply a principle that preventing death dominates pre-

venting anything else.

PTO and difference judgments also differed significantly in utility judgments,

equality responses, and equality agreement (p < :03 for all) with more equality and

higher utilities in PTO. Dominance responses showed no effect.

In sum, direct judgments of the equality of intervals support our interpretation of

the results of Experiment 1. Agreement that intervals are equal was more frequent

for Common-bottom-death, that is, when both intervals compared involve savinglives. We also found some evidence of greater attention to the bottom of the interval

than the top, and for judgments of dominance when one bottom involved death and

the other did not.

Results for superadditivity replicated those of Experiment 1, although the effect

was somewhat larger (with a minimum of .21, for the difference method when normal

health was involved but not death). Again, the effect was significantly reduced when

equality responses were omitted (t33 ¼ 3:03, p ¼ :0048). This time, however, therewas no significant superadditivity effect when the equality responses were removed.Note that many subjects had insufficient data for these tests, because the number of

equality responses was so much higher than in Experiment 1.

Ratio inconsistency was, again, significantly positive for all measures (p < :025,with or without equality responses). Again, this measure of inconsistency was not

affected by method (difference vs. PTO), by whether the comparisons involved

normal health (comparisons 1–3) or death (comparisons 5, 6, and 8), or by whether

or not equality responses were excluded.

4. General discussion

We examined two methods, direct judgment and person-trade-off. The former

asks the subject to think about utility in an abstract way. The latter infers the

subject�s utility from a matching response in a hypothetical decision. Both methods

are subject to distortions resulting from scaling effects or heuristics.

Our results suggest that further use of these methods—and possibly others aswell—should include checks for consistency, such as those we have used. Such checks

may help respondents bring their judgments into line with their true values, or

construct their values through a process of reflection with the help of the consistency

checks. For example, a person might come to agree that the difference between death

and normal health is greater than the difference between death and some state of

impairment, after reflecting on the implication of equality judgments that the im-

pairment is no different from normal health.

4.1. Equality responses and superadditivity

We found evidence for a heuristic or principle of equality in which subjects judged

the intervals being compared as equal. This principle led to higher mean utility ratios

when they were measured with death as the reference point, especially in PTO, and

the principle also led to superadditivity. It is possible that some of these responses

resulted from careless responding. Careless responding, however, cannot account for

the observed differences among conditions. The results of Experiment 2, in whichsubjects acknowledge equality in a consistent pattern, also argue against a care-

lessness account. Equality responses in PTO (but not in direct rating) have also been

found using interview methods (Pinto-Prades & Lopez-Nichol�aas, 1998).We would expect superaddivity to result from equality responses. If, for example,

a subject judged Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g to be 1 and Jf½uðBÞ� uðCÞ�=


½uðAÞ � uðCÞ�g to be greater than 0, then the sum would be greater than 1. In Exper-iment 1, superadditivity was, in fact, reduced when these equality responses were re-

moved, but it was not eliminated.Note also that ratio inconsistencywas not reduced by

removal of equality responses, and there is no general reason to expect such reduction.

If utility measures are affected by a principle of equality, it is not clear which

measures are relevant to which decisions. At issue, in part, are the reasons for the

principle. One possible reason is that subjects regard handicaps as irrelevant when

they are properties of people. In this regard, they are like economic or social status,

outside the bounds of medical decisions—we do not take these things into account inallocating health care. This explanation, however, is inconsistent with our finding

that equality responses are particularly common only when death is at the bottom

end of both intervals being compared. If the top condition was generally irrelevant

when it differed between the two intervals, then equality responses would be used as

much when the bottom condition was some non-death condition as when it was

death. It seems that the equality responses are peculiar to comparisons involving

death. Moreover, equality responses involving death were prevalent in difference

judgments as well as PTO judgments.Another possible explanation of equality responses is that subjects regard pre-

venting death or saving lives as lexically ordered before any other health outcomes

(Norcross, 1997). This, too, conflicts with our results, as we did not find many ‘‘zero’’

responses, in which subjects gave infinite weight to death in comparisons where the

top of the two intervals was constant (comparisons 1 and 2). In Experiment 2, the

pattern of responses to the dominance question, which assessed this principle di-

rectly, was somewhat different from that of the equality responses. On the other

hand, the same experiment provided some evidence that, in general, people pay moreattention to the bottom of an interval, so they are more likely to make equality

responses if the bottom of two intervals is the same than if the top is the same.

One possible explanation, consistent with our results, is that people employ a

heuristic principle that is specific to life-saving, which is that decisions about life-

saving are special, in that everyone should be treated equally. The use of this principle

for PTO measures is consistent with the finding that these measures showed more

equality responses than difference measures. These results are also consistent with the

findings of Ubel and Richardson (2002): In a PTO task, subjects compared saving thelives of people with paralysis of their legs to saving the lives of people in normal

health. The people were either paralyzed before they became ill or they were in normal

health before they became ill. When the pre-existing paralysis case came first, most

subjects gave equal priority to the two life-saving procedures, but, when the people

were normal beforehand, most subjects gave priority to restoring people to normal

health over restoring them to a state of paralysis. Evidently, the pre-existing condition

elicited the equality heuristic more often than did the control condition.

We found equality responses for difference judgments as well as for PTO, and theywere more frequent when death was at one end of both intervals. This finding suggests

that people sometimes see saving lives as equally valuable regardless of the end state

after the life is saved. Difference judgments are about magnitudes, and they are not

hypothetical decisions. Equality responses for difference judgments, thus, would result

from a judgment of the value for each individual saved. Equality responses in general,

then, are not easily understood as resulting from a heuristic involving fairness, since

they occur for difference judgments as well as PTO judgments. Also, fairness to the

handicapped would seem to apply even when the bottom condition is not death, anddeath is crucial for equality judgments in both PTO and difference judgments.

In principle, such a judgment that two different intervals are equal, or nearly

equal, could be correct, if people think of intervals as changes from one state to

another rather than as differences between two persistent states. The utility of a

change need not depend solely on the difference between the utility of being in the


start state and the utility of being in the end state. For example, the utility of changeitself might have some maximum. Thus, the utility of a change from A to B might be

just as great as the utility of a change from A to C, even if the change from B to C has

non-zero utility as well.

It is unlikely, however, that this perception of value is typically accurate for the

affected people. Given a choice between having one�s life saved and being handi-

capped and having one�s life saved and not being handicapped, most people would

strongly prefer the latter, even in the face of great happiness in knowing that their life

would be saved. Although we could simply assume that these judgments representtrue expressions of public preferences, we should find out how people respond to

challenges, either by asking them outright whether they prefer to be handicapped or

not or by explaining the superadditivity effect to them. (We may also try to explain to

them that they are to judge states rather than changes.)

Superadditivity could also result from a tendency to see intervals as more equal

than they are. This, in turn, could result from declining influence of differences

further from a reference point, as discussed in the Introduction.

4.2. Ratio inconsistency

Ratio inconsistency, which we also found, can be described as a failure to make

sufficient distinctions among high and low ratios; subjects tend to give responses

toward the middle of the scale they are given (Poulton, 1979).

More generally, our results suggest two general problems with utility measure-

ment, one resulting from the use of heuristic principles and the other resulting from

psychophysical effects. Further research is needed to examine the malleability ofthese effects, and also whether these effects are found in other measures of utility,

such as standard gambles and time-trade-offs. We are undertaking this research.

These are not the only possible problems, of course.

4.3. PTO as a utility measure

Some have argued that the PTO is not a utility measure but, instead, a measure of

societal value (Nord, Pinto, Richardson, Menzel, & Ubel, 1999) As such, it is ac-ceptable, they argue, for PTO measurements to show superadditivity. If, for exam-

ple, saving the life of someone who is left blind is just as valuable as saving someone

else�s life who will have full vision, according to a PTO measurement, then people

should still be allowed to say that there is value to curing blindness. They would not

be able to do so if the PTO was a measure of utility and was not allowed to show

superadditivity.

Suppose that treatment A saves people�s lives and leave them blind, and treatment

B cures blindness. Treatment C saves people�s lives and leaves them with full vision.People judge that giving A to 10 people is as valuable as giving C to 10 people, and

giving B to 10 people is half as valuable as either of the other treatments (i.e., giving

B to 20 people is equally valuable). Then, if value is additive, the value of giving A,

followed by B, to 10 people would be greater than the value of giving C to 14 people.

(Assume that patients are under anesthesia once when A and B are given to the same

patients.) Yet the result would be the same.

The only escape from this apparent conflict is either to say that values of events

are not additive, or else to say that decisions should not be based on consequences.Ubel et al. (2000) have discussed the possibility that values need not be additive. In

principle, such a state of affairs could prevent us from making inferences about

choices, without asking about the particular choice. In practice, such situations may

be rare, and unproblematic. We have suggested earlier that, for some situations

involving changes, values need not be additive.


If decisions are not based on consequences, then we must ask for justification. If,for example, we choose the combination of A and B over giving C to more people,

how can we explain to the people who would have gotten C why they are not getting

it. We cannot say that we had to make a choice and that the benefits to others were

greater. Moreover, suppose people were behind a veil of ignorance and did not know

which group they were in, the Aþ B group (10) or the C group (14), but their chance

of being in each group depended on its size. In this case, the policy of choosing C

would be better for each individual. We thus could make a decision that is judged to

be good on the basis of a PTO judgment and that makes everyone worse off. (Thefact that people are worse off only ex ante does not seem to weaken the force of this

argument, which is based on the work of Kaplow & Shavell (2000).)

In addition, even as a measure of societal value and not utility, PTO measures

should not show ratio inconsistency, as they have done in this study (Baron et al.,

2001). The problem is again that, by combining judgments we could arrive at dif-

ferent conclusions about the same choice. If curing one-eye blindness in 200 people is

as good as curing blindness in 50, and if curing blindness in 50 is as good as saving

the lives of 10, then we must conclude that curing one-eye blindness in 200 is betterthan saving the lives of 9, yet someone might judge that the figure is really 18 rather

than 9. Thus, preference itself would be intransitive.

These considerations argue that the PTO measure must behave like a utility

measure, although it could still reflect a different kind of utility that we might call

societal utility. Yet, consider again the veil of ignorance. Any PTO judgment has an

equivalent judgment using gambles for identically placed individuals. (The fact that

people are rarely identically placed is, again, irrelevant, since a normative theory of

the sort we are considering must apply everywhere.) If the PTO judgment disagreeswith the judgment of gambles, then it could lead to a societal decision that makes

everyone worse off, in terms of their own judgments of personal good. This sort of

argument could apply to any other utility measure (including direct ratings, if people

take their direct ratings to be interval measures of personal good).

4.4. Conclusion

Our findings suggest that practical uses of utility elicitation methods, evenmethods as simple as direct judgment, should be accompanied by consistency checks

of the sort we have used (as suggested, for example, by Keeney & Raiffa, 1976, p.

271, and Baron et al., 2001). Use of consistency checks is a common practice in

decision analysis, but is rarely used in other traditions of practical judgment elici-

tation. The evidence to date suggests that people can make judgments that are both

consistent and honest. That is, people are generally willing to accept the conclusion

that inconsistent judgments are erroneous (Baron et al., 2001).

When consistency checks fail, subjects should be confronted with the results andasked to resolve them. When one of the ends of the scale is death and when subjects

give equality responses, they should also be asked to reflect on these responses,

asking themselves, for example, whether, given that someone�s life was to be saved, itreally did not matter what state they were in after that. This can be done in a per-

sonal interview, or with the use of an interactive computer program, or both to-

gether. The equality effect is not, itself, inherently inconsistent for a single judgment,

but it apparently leads to superadditivity when it is combined with other judgments,

so the checks should focus on the inconsistency rather than on the equality effectitself. Moreover, as we noted, we have no reason to expect reduction of the equality

effect to influence other types of inconsistency, aside from superadditivity.

Our results also suggest that, when PTO and difference judgments differ, the

difference judgments yield fewer equality responses and are thus somewhat less

subject to inconsistencies. This does not eliminate the need for consistency checks.


But our results suggest that direct comparison of intervals remains a viable methodfor assessment of utility.

Our results are limited to the two methods we have used. Similar inconsistencies

have been found using other methods. For example, Ubel et al. (1996) found ratio

inconsistency in the standard-gamble method (which is based on expected-utility

theory), and O�Leary, Fairclough, Jankowski, and Weeks (1995) found unwillingness

to trade-off any time in the time-tradeoff, a result that may be analogous to equality

responses here. As yet, we do not know whether the same patterns of results would

be found for these other measures (involving the nature of the common bottom) aswe found here.

Acknowledgments

This research was supported by N.S.F. grant SBR95–20288, and by a grant from

the Penn Cancer Center. Peter Ubel�s work was supported by the Department of

Veterans Affairs through a Career Development Award in health services researchand by the Robert Wood Johnson Foundation�s Generalist Physician Faculty

Scholar Program. We thank Gretchen Chapman, Michael DeKay, Leslie Lenert and

the reviewers for comments on a draft.

Appendix

Why additivity is special

Additivity, under reasonable conditions, implies ratio consistency, but ratio con-

sistency does not imply additivity. Moreover, addivity is closer to being a sufficientcondition for a utility scale. (The additional assumptions required are likely to hold.)

Additivity as defined above can be seen as a special case of a more general ad-

ditivity property of utility judgments: Jf½uðAÞ � uðBÞ�=½uðEÞ � uðF Þ�g þJf½uðBÞ�uðCÞ�=½uðEÞ � uðF Þ�g ¼ Jf½uðAÞ � uðCÞ�=½uðEÞ � uðF Þ�g. The denominator need not

be uðAÞ � uðCÞ; the denominator uðEÞ � uðF Þ simply defines the unit of measure-

ment. This general additivity property implies ‘‘monotonicity’’ (Krantz, Luce,

Suppes, & Tversky, 1971, p. 145—actually as stated, ‘‘weak monotonicity’’ is based

on inequality rather than equality, but we use the equality version of p. 137 forclarity of exposition). Specifically, omitting the denominator J ½uðEÞ � uðF Þ� (which isthe unit of measurement for all judgments), the monotonicity property is:

J ½uðAÞ � uðBÞ� ¼ J ½uðA0Þ � uðB0Þ�and J ½uðBÞ � uðCÞ� ¼ J ½uðB0Þ � uðC0Þ�; then

J ½uðAÞ � uðCÞ� ¼ J ½uðA0Þ � uðC0Þ�In other words, if the interval between A0 and B0 is judged equal to that between A

and B (where the primes indicate different health states) and B0 and C0 is equivalentto B and C, then A0 and C0 is equivalent to A and C. This is the most critical property

required for an interval scale of utility. The general form of the additivity test (with

J ½uðEÞ � uðF Þ� in the denominator) implies monotonicity, since J ½uðAÞ � uðCÞ� is thesum of J ½uðAÞ � uðBÞ� and J ½uðBÞ � uðCÞ� and must therefore have the same value forany other intervals equal to these two, respectively. Here, as noted, we test a special

case in which A and C define the unit, but, if additivity fails for this case, it is not

generally true, so a utility scale cannot be constructed. If such a scale can be con-structed, then ratio consistency also holds.

But the converse does not hold. For example, suppose that judgments are based

on the square of utility differences rather than the differences themselves. Thus

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g ¼ ½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2:


Then the following are equivalent:1. Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g

¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g2. f½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2g � f½uðAÞ � uðCÞ�2=½uðAÞ � uðDÞ�2g

¼ f½uðAÞ � uðBÞ�2=½uðAÞ � uðDÞ�2g3. f½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � f½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g

¼ f½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�gSo ratio consistency will still hold, but additivity will not hold in general, since

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g þ Jf½uðBÞ � uðCÞ�=½uðAÞ � uðCÞ�g

¼ f½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2g þ f½uðBÞ � uðCÞ�2=½uðAÞ � uðCÞ�2g

¼ ½uðAÞ2 � 2uðAÞuðBÞ þ uðBÞ2� þ ½uðBÞ2 � 2uðBÞuðCÞ þ uðCÞ2�½uðAÞ � uðCÞ�2

;

which need not equal ½½uðAÞ � uðBÞ� þ ½uðBÞ � uðCÞ��=½uðAÞ � uðCÞ�2 .In sum, additivity (together with other assumptions that are likely to hold, such as

larger intervals having greater utilities) implies a utility scale and the two consistency

measures do not.

Why inversion consistency is a special case of ratio consistency

Inversion consistency (Baron et al., 2001) refers to the effect of direction ofcomparison, Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g vs. Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g. Forexample, ‘‘How big is the difference between normal health and blindness compared

to that between normal health and blindness-and-deafness,’’ vs. ‘‘How big is the

latter difference compared to the former.’’ If subjects are biased toward higher nu-

merical responses, no matter what the question, then the product of the two ratios

will be greater than 1, as found for a number of psychophysical judgments (Fagot,

1979; Fagot & Pokorny, 1989). High numbers could result from a tendency to give

numbers in the middle of the given range; for numbers over 100, subjects could seethe top of the range as very high. Inversion consistency can be seen as a special case

of ratio consistency (as shown in the Appendix).

Inversion consistency amounts to

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g ¼ 1;

but it can also be written as

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðBÞ�g:

Ratio consistency says that

Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g:

If we set D equal to B, it is apparent that ratio consistency implies inversion

consistency, if ratio consistency holds for reversals (such as Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g).

Because inversion consistency is a special case of ratio consistency, it is no more

definitive as a consistency check. It is necessary, but not sufficient, for a utility scale.

References

Baron, J. (1997). Biases in the quantitative measurement of values for public decisions. Psychological

Bulletin, 122, 72–88.


Baron, J., Wu, Z., Brennan, D. J., Weeks, C., & Ubel, P. A. (2001). Analog scale, ratio judgment and

person-trade-off as measures of health utility: biases and their correction. Journal of Behavioral

Decision Making, 14, 17–34.

Birnbaum, M. H. (1978). Differences and ratios in psychological measurement. In N. Castellan & F.

Restle (Eds.), Cognitive theory (Vol. 3, pp. 33–74). Hillsdale, NJ: Erlbaum.

Birnbaum, M. H. (Ed.). (2000). Psychological experiments on the internet. New York: Academic Press.

Birnbaum, M. H., & Sutton, S. E. (1992). Scale convergence and utility measurement. Organizational

Behavior and Human Decision Processes, 52, 183–215.

Bleichrodt, H., & Johanneson, M. (1997). The validity of QALYs: An experimental test of constant

proportional trade-off and utility independence. Medical Decision Making, 17, 21–32.

de Neufville, R., & Delqui�ee, P. (1988). A model of the influence of certainty and probability ‘‘effects’’ on

the measurement of utility. In B. Munier (Ed.), Risk, decision, and rationality (pp. 189–205). Dordrecht:

Reidel.

Fagot, R. F. (1978). A theory of relative judgment. Perception and Psychophysics, 24, 243–252.

Fagot, R. F., & Pokorny, R. (1989). Bias effects on magnitude and ratio estimation power function

exponents. Perception and Psychophysics, 45, 221–330.

Fischer, G. W. (1985). Range sensitivity of attribute weights in multiattribute value models. Organizational

Behavior and Human Decision Processes, 62, 252–266.

Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decisions under risk. Econometrica,

47, 263–291.

Kaplow, L., & Shavell, S. (2000). Principles of fairness versus human welfare: On the evaluation of legal

policy. Discussion Paper No. 277, Center for Law, Economics, and Business, Harvard Law School.

(http://www.law.harvard.edu/programs/olin_center).

Keeney, R. L., Raiffa, H., & Decisions with multiple objectives, (1976). New York: Wiley.

Krabbe, P. F. M., Essink-Bot, M.-L., & Bonsel, G. K. (1997). The comparability and reliability of five

health-state valuation methods. Social Science and Medicine, 45, 1641–1652.

Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. 1). New

York: Academic Press.

Llewellyn-Thomas, H. A. (1997). Investigating patients� preferences for different treatment options.

Canadian Journal of Nursing Research, 29, 45–64.

Mellers, B. A., & Birnbaum, M. H. (1983). Loci of contextual effects in judgment. Journal of Experimental

Psychology: Human Perception and Performance, 8, 582–601.

Miyamoto, J. M., & Eraker, S. A. (1988). A multiplicative model of survival duration and health quality.

Journal of Experimental Psychology: General, 117, 3–20.

Norcross, A. (1997). Comparing harms: Headaches and human lives. Philosophy and Public Affairs, 26,

135–167.

Nord, E. (1995). The person-trade-off approach to valuing health care programs. Medical Decision

Making, 15, 201–208.

Nord, E. (1999). Cost-value analysis in health care: Making sense out of QALYs. Cambridge, UK:

Cambridge University Press.

Nord, E., Pinto, J. L., Richardson, J., Menzel, P., & Ubel, P. A. (1999). Incorporating societal concerns

for fairness in numerical valuations of health programs. Health Economics, 8, 25–39.

O�Leary, J. F., Fairclough, D. L., Jankowski, M. K., & Weeks, J. C. (1995). Comparison of time-trade-off

utilities and rating scale values of cancer patients and their relatives. Medical Decision Making, 15,

132–137.

Pinto-Prades, J.-L., & Lopez-Nichol�aas, A. (1998). More evidence on the plateau effect: A social

perspective. Medical Decision Making, 18, 287–294.

Poulton, E. C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86, 777–

803.

Soetikno, R. M., Mrad, R., Pao, V., & Lenert, L. A. (1997). Quality-of-life research on the Internet:

feasibility and potential biases in patients with ulcerative colitis. Journal of the American Medical

Informatics Association, 4, 426–435.

Soetikno, R. M., Provenzale, D., & Lenert, L. A. (1997). Studying ulcerative colitis over the World Wide

Web. American Journal of Gastroenterology, 92, 457–460.

Thaler, R. (1985). Mental accounting and consumer choice. Marketing Science, 4, 199–214.

Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative representations of

uncertainty. Journal of Risk and Uncertainty, 5, 297–323.

Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1996). Individual utilities are inconsistent with

rationing choices: A partial explanation of why Oregon�s cost-effectiveness list failed.Medical DecisionMaking, 16, 108–116.

Ubel, P. A., Nord, E., Gold, M., Menzel, P., Pinto Prades, J. L., & Richardson, J. (2000). Improving value

measurement in cost-effectiveness analysis. Medical Care, 38, 892–901.

Ubel, P. A., Richardson, J. & Baron, J. (2002). Exploring the role of order effects in person-trade-off

elicitations. Health Policy, 61, 189–199.


http://www.law.harvard.edu/programs/olin_center

Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1998). Value measurement in cost-utility

analysis: Explaining the discrepancy between analog scale and person-trade-off elicitations. Health

Policy, 43, 33–44.

von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. New York:

Cambridge University Press.

Received 14 September 2000


Documents

Types of inconsistency in health-state utility judgments