Upload
jonathan-baron
View
215
Download
1
Embed Size (px)
Citation preview
Types of inconsistency in health-stateutility judgments
Jonathan Barona,* and Peter A. Ubelb
a University of Pennsylvania, Pennsylvania, USAb University of Michigan, Michigan, USA
Abstract
In making judgments of health-related quality of life, respondents often compare the
relative magnitude of two intervals between health states, such as the interval between
normal health and blindness compared to that between normal health and death. We ex-
amined two ways of comparing such intervals: person-trade-off (PTO)—in which the judg-
ment concerns matching numbers of people so that two changes are equivalent—and direct
judgment of the ratio. Both measures showed ratio inconsistency (a ratio that should be the
product of two other ratios is too high) and superadditivity (two ratios that should add to 1
are too high). Some responses in both methods implied that two intervals which should have
been different (because they shared a top or bottom point, but differed on the other point)
were nevertheless viewed by subjects as being of equal size. These equality responses were
more common when death was the bottom (worse end) of both intervals being compared
(e.g., the interval between death and blindness is perceived as being the same size as the
interval between death and normal health) than when any other condition was at the bottom
or when the condition common to the two intervals was at the top. A second experiment
indicated that subjects really do consider the intervals to be equal. Our findings argue
for giving subjects a chance to reflect on such apparent inconsistencies in practical utility
elicitation.
� 2002 Elsevier Science (USA). All rights reserved.
1. Introduction
Choice among programs, such as those concerned with allocation of medical
resources, is sometimes based on measurement of benefits of competing options on a
common scale of utility. Utility measures require numerical judgments based on
descriptions of health states. In recent years, a few methods have been used to elicit
these judgments. In standard gambles, subjects are asked for a probability p at which
they would be indifferent between obtaining some health state (e.g., blindness) and a
gamble with a p chance of obtaining a better health state (e.g., normal health) and a1� p chance of obtaining a worse state (e.g., death). In direct judgment, they are
asked to assign a number to blindness on a scale in which 0 is death and 100 is
*Corresponding author. Department of Psychology, University of Pennsylvania, 3815 Walnut Street,
Philadelphia, PA 19104-6196, USA.
E-mail address: [email protected] (J. Baron).
Organizational Behavior and Human
Decision Processes 89 (2002) 1100–1118
www.elsevier.com/locate/obhdp
ORGANIZATIONALBEHAVIOR
AND HUMANDECISION PROCESSES
0749-5978/02/$ - see front matter � 2002 Elsevier Science (USA). All rights reserved.
PII: S0749-5978 (02 )00019-5
normal health (for example). In time-trade-off, they are asked how many years ofnormal health, followed by death, is equivalent to, for example, 10 years of blindness
followed by death. And in person- trade-off, they are asked, for example, preventing
how many people from dying is just as good as preventing 100 people from becoming
blind. These methods have been used in practical decisions and have been the topic
of extensive study (Baron, 1997; Bleichrodt & Johanneson, 1997; Krabbe, Essink-
Bot, & Bonsel, 1997; Llewellyn-Thomas, 1997; Nord, 1995, 1999; Ubel, Loewenstein,
Scanlon, & Kamlet, 1996, 1998). Similar methods have been used in decision analysis
(Keeney & Raiffa, 1976; von Winterfeldt & Edwards, 1986).At least three factors make such measurement difficult (Baron, 1997). First, the
judgment may over- or underweigh some factor that makes a health condition more
or less severe, such as the extent of adaptation to a chronic condition. More gen-
erally, people may change the way in which they think about a health condition as
they reflect on what it involves. This is not an issue that we address here. Second,
judgments may be affected by aspects of the judgment task that are unrelated to the
utility of the conditions. For example, responses to hypothetical gambles may be
affected by attitudes toward risk or by the overweighing of certainty (de Neufville &Delqui�ee, 1988; Tversky & Kahneman, 1992). Third, numerical responses may be
distorted by scaling problems, such as tendencies to underweigh differences farther
away from a reference point (Tversky & Kahneman, 1992) or to space judgments
evenly along the given range (Mellers & Birnbaum, 1983; Poulton, 1979).
The studies reported here investigate the last two problems as they apply to two
methods that might be used for utility estimation, difference judgments, and person-
trade-off (PTO). The main purpose of this article is to determine the nature of in-
ternal inconsistencies in PTO and direct measures, so that we can know how to dealwith them for practical purposes. One inconsistency of particular interest results
from judgments which imply that two health states are equally distant from a third
(e.g., death), yet still different from each other.
We choose these two methods because they represent two general approaches to
measurement. In direct judgment, subjects focus on the conditions and make a sort
of psychophysical comparison of magnitudes. The PTO, by contrast, requires a
hypothetical decision, and is therefore subject to influence by factors other than the
magnitudes in question. The PTO, like standard gamble and time-trade-off, has beenadvocated exactly because it is potentially sensitive to factors, such as fairness, other
than utility magnitudes of health states.
1.1. The two methods: Difference judgments and person-trade-off
We think of each judgment as a comparison of two utility intervals, such as the
interval between condition A and condition B versus the interval between conditions
A and C. One of the ‘‘conditions’’ could be normal health, or death. Each judgmentpermits us to infer the ratio of the two utility differences, e.g., the size of the smaller
interval relative to the larger. In difference judgments, this is what the subject is
asked to judge.
In PTO judgments, we infer the judgment from the subject�s answer. If preventing100 people�s health from declining from Health Condition A to B is judged to be
equally important as preventing 200 people from declining from A to C, we infer that
the difference between A and C is half of the distance between A and B. This in-
ference assumes that the subject takes the utilitarian approach of adding differencesacross people. (Again, we return to this assumption later.)
In difference judgments, the subject makes a direct comparison of the sizes of the
two intervals, such as the interval between normal health and blindness and the in-
terval between normal health and death. This corresponds to the method of swing
weights used in decision analysis (Fischer, 1985; von Winterfeldt & Edwards, 1986).
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1101
Difference judgments may be distorted in several ways (Baron, 1997; Birnbaum,1978). Of interest here is the effect of distance from the reference point. Much evidence
supports the general conclusion that people underweigh differences that are more
distant from a reference point, regardless of whether the states at issue are better or
worse than the reference point. For example, people seem to judge that the disutility
of monetary losses is marginally declining as the loss increases; for example, losing
$100 seems less bad when it is added to a $900 loss than when it occurs alone
(Kahneman & Tversky, 1979; Thaler, 1985). When we measure the utility of health
states, two different reference points are available, normal health and death. It maymatter which we use. If some condition X is roughly half as bad as death, then people
may judge it as numerically closer to death when they use normal health as the ref-
erence, and as closer to normal health when they use death as the reference. When
people are asked ‘‘How large is the difference between normal health and X compared
to the difference between normal health and death?,’’ they might answer 60%, and
when they are asked ‘‘How large is the difference between X and death compared to
the difference between normal health and death?’’, they might answer 60% again.
Their answers would thus sum to 120%, not 100%. We call this effect superadditivity.In PTO judgments, the subject provides a number of people such that some
benefit (harm) to that number is just as good (bad) as another benefit (harm) to a
given number of other people (Nord, 1995; Ubel et al., 1996, 2000). This provides a
utility measure if we assume that the judgment is utilitarian, so that the total benefit
to a group is just the benefit per person times the number of people. For example, if
preventing 250 cases of blindness is seen as equivalent to preventing 100 deaths (of
normal people), then we infer that the interval between normal and blindness is .40
of the interval between normal and death. This is because the total utility of pre-venting blindness is :40 � 250, which equals the total utility of preventing 100 deaths.
If we infer the utility of blindness on a scale on which normal is 1 and death is 0, then
we infer that blindness has a utility of .60.
PTO judgments are affected by principles of fairness, regardless of the condition
that remains when lives are saved (Nord, 1995; Ubel et al., 1996, 2000). PTO mea-
sures have been advocated as a way of capturing attitudes about policy. Some of
these attitudes are not utilitarian. Still, if these measures are not internally consistent,
it is difficult to see how they can be used. Internal inconsistency may arise fromfairness considerations. Of particular interest here is the desire to give patients equal
opportunity for treatment regardless of their present condition.
This principle may be applied in particular to saving lives. People may feel that the
decision to save lives should not depend on the quality of the life after it is saved.
There are many reasons for such a view. For example, people may feel that health
professionals should not be in the business of judging the worth of a person�s life.Once we allow such factors as handicaps to enter decisions about life saving, where
do we draw the line between these conditions and such other factors as number offriends or economic productivity? People may also think in terms of the emotional or
utility effects of changes rather than differences between existence in two different
states, and it may be approximately true that the reaction to the change from a
prospect of certain death to the prospect of continued life may be just as good for a
person with a severe handicap as to a person without one. Or, alternatively, people
may just apply a heuristic of fairness.
1.2. Consistency checks
The present experiments look for these effects. In part to look for these effects, we
test consistency among judgments in two ways, explained shortly, which we call ratio
consistency, and additivity. Consistency checks are of interest in their own right. Few
efforts to test consistency are reported in the literature (Bleichrodt & Johanneson,
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181102
1997; Miyamoto & Eraker, 1988). For PTO, only ratio consistency has been tested(Ubel et al., 1996), and it is, as we show, a necessary but not a sufficient condition for
utility measurement. Internal consistency is necessary if a judgment is to be inter-
preted as a judged utility function and used for making decisions that depend on
such judgments. Yet, both researchers and practitioners in medical areas almost
never test internal consistency in any way.
In the experiment reported here, subjects made PTO and difference judgments on
the same set of conditions, all of which are long lasting and serious. To describe the
judgments in general, we use the notation uðAÞ � uðBÞ to mean the utility difference,and we take Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g to refer to the judged or inferred ratio ofuðAÞ � uðBÞ to uðAÞ � uðCÞ. We can think of the denominator uðAÞ � uðCÞ as the unitof utility. For example, uðAÞ � uðBÞ could be the difference between normal health
and blindness, and uðAÞ � uðCÞ could be the difference between normal health and
blindness-and-deafness. Someone who thought that blindness was half as bad as
blindness-and-deafness, relative to normal health, would judge the utility of
uðAÞ � uðBÞ as .5 using uðAÞ � uðCÞ as the standard. In other words, this person
would judge the uðAÞ � uðBÞ interval as half the size of the uðAÞ � uðCÞ interval.As noted, we test two kinds of internal consistency: ratio consistency and addi-
tivity. Ratio consistency (Fagot, 1978) tests the condition that
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g
If we suppose that judgments are insensitive to differences among utility intervals,
then
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g< Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g
For example, suppose that A is normal health, B is blindness, C is combined
blindness and deafness, and D is death. Then each interval is the ‘‘badness’’ of A, B,
C, or D compared to normal health. If B is half as bad as C and if C is half as bad asD, then A should be a quarter as bad as D. If, in the extreme, a subject is so in-
sensitive to differences among utility intervals as to say ‘‘half’’ to every question, the
product of the first two judgments will still be a quarter, but the third judgment will
be a half. Such insensitivity could result from undersensitivity to large differences,
oversensitivity to small differences, or both. Ubel et al. (1996, 1998) found such an
effect in PTO and called it ‘‘multiplicative intransitivity.’’ Fagot (1978) found ratio
inconsistency (but not the same as we find) in a variety of psychophysical judgments;
we use Fagot�s term.Ratio inconsistency could also be related to the finding of Birnbaum and his
colleagues (Birnbaum, 1978; Birnbaum & Sutton, 1992) that subjects asked to judge
the ratio of two stimuli respond (with a nonlinear response function) to the difference
between the stimuli rather than to the ratio of their distances from zero (no stimu-
lation, in a sensory task). However, when subjects are asked for ratios of differences—
e.g., what is the ratio between the utility (or loudness, etc.) difference between A and
B and the difference between C and D?—they tend to base their responses correctly
on the ratio of the differences, and not the difference of the differences. Birnbaum�sresult can be taken to imply that we must state the reference point explicitly if we
want subjects to use it, so we do this when we ask about differences. Explicitness in
stating the ends of ranges being compared is one of the prescriptions of decision
analysis (Fischer, 1985), but it is not used routinely in other value-elicitation tasks.
Still, explicit statement of end points may not remove all of the tendency to respond
in terms of differences. Such a tendency could be described as a neglect of changes in
the standard of comparison. When subjects attend more to differences (as opposed to
ratios), they might be undersensitive to changes in the standard of comparison they
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1103
are given, from question to question. Such underattention would also produce ratioinconsistency and is, in our procedures, indistinguishable from other explanations of
this effect.
Additivity tests whether the judged utility of two parts of an interval add up to the
whole interval, that is, whether
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g þ Jf½uðBÞ � uðCÞ�=½uðAÞ � uðCÞ�g ¼ 1
assuming A, B, and C are ordered. For example, if the difference between normal
health and blindness is .3 of the difference between normal health and blindness-and-
deafness, then the difference between blindness and blindness-and-deafness should be
.7 of the difference between normal health and blindness-and-deafness. Superaddi-
tivity could result from numbers that are too high (.4 and .8 instead of .3 and .7).Superadditivity could also result from adoption of one of the states in the numerator
(e.g., A in uðAÞ � uðBÞ or C in uðBÞ � uðCÞ) as the reference point, which would
produce the effect if we assume that differences near the reference point loom larger.
Additivity is nearly sufficient for a utility scale, as such a scale can be defined in
terms of units that are added. The Additivity implies ratio consistency, but ratio
consistency does not imply additivity. The Appendix A explains these relationships,
and it also discusses inversion consistency, which concerns the direction of compar-
ison—smaller to larger interval or the reverse. Inversion inconsistency was found byBaron, Wu, Brennan, Weeks, and Ubel (2001). It is a special case of ratio consistency.
2. Experiment 1
This experiment compared two methods for eliciting judgments from which utility
can be inferred. Each method compares two intervals, such as the interval between
normal health and blindness or the interval between normal health and death. Each
of the two intervals has a top (best health state) and a bottom (worst health state).
Sometimes the two intervals share the same top, and sometimes they share the same
bottom. Sometimes the bottom is death, and sometimes the top is normal health.
In the difference method, subjects make direct numerical judgments of the size ofone interval compared to another, the standard, which has a size of 100. For ex-
ample, people are asked to imagine that the difference between normal health and
death is 100 and then asked what the difference between normal health and some
other health state would be. We infer judgments from PTO responses as described
earlier. In the PTO, we ask subjects to compare the benefit of preventing changes for
the worse, such as death, getting a handicap that one does not have, or getting a
worse handicap. In previous studies not reported here, we found that different
wording, such as ‘‘saving’’ rather than ‘‘preventing’’ yielded identical results.Subjects completed our studies on the World Wide Web. This method of recruit-
ment is new (Birnbaum, 2000) and is therefore worthy of comment. We originally
recruited subjects by posting to Usenet news groups, but then various other people
put links to our studies in their web pages, such as the American Psychological
Society, and still others found us with search engines (looking for terms like ‘‘surveys’’
and ‘‘pay’’). Subjects who wish to be paid must provide their name, email address,
address, and (for US residents) their Social Security Number. It is therefore at least as
difficult as with other methods to complete the same questionnaire under differentnames. Several studies find little or no difference between web subjects and other
subjects (Birnbaum, 2000; Soetikno, Mrad, Pao, & Lenert, 1997; Soetikno, Prov-
enzale, & Lenert, 1997). Of course, the web, like paper, can be used in many ways.
One feature of our approach is to include checks for sensible answers (using Java-
Script, a programming language for web pages) so that subjects who do not take the
study seriously, find it difficult to complete.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181104
2.1. Method
One hundred and fourteen subjects completed a questionnaire on the World Wide
Web for $3. Three additional subjects were eliminated because they gave the same
response to every item (100 or 50). Ages ranged from 13 to 69 (median 29); 71% were
female; and, coincidentally, 71% were non-students.
Each subject answered a series of 32 questions regarding the relative magnitudes
of differences in health states. There were two question frames, one using PTO and
one using the difference method, with 16 items each.The introduction began:
This study is about different ways of eliciting numerical judgments of health. If we could
measure the benefit of curing or preventing different conditions, we could allocate scarce re-
sources so as to get the greatest benefit per dollar. The following questions are about meth-
ods that might really be used to measure the badness of various conditions. The worse a
condition, the more good is done by preventing it.
All the items refer to conditions that people acquire at about age 60, from strokes, and last
for the rest of their lives. DEATH refers to death at age 60. The conditions are always in
CAPITAL LETTERS. Here are the conditions:
PARALYSIS OF LEGS
PARALYSIS OF ARMS AND LEGS
BLINDNESS
BLINDNESS AND DEAFNESS
NORMAL HEALTH
DEATH
All questions require judgments on a numerical scale from 0 to 100. It never makes sense here
to give an answer of more than 100.
All questions concern comparison of two ranges. For example one range might be between
normal health and deafness, and another range might be between deafness and death.
In one type of question, we ask about the relative size of two ranges. How large is the smal-
ler range, as a percent of the larger one? You might think of each range as a distance defined
by two locations. How large is the distance from New York to Paris, as a percent of the dis-
tance from Chicago to Moscow? This is the kind of judgment you must make.
The second kind of question is like this:
How many people must be prevented from changing from normal health to death in order
to make this just as good as preventing 100 people from changing from normal health to
deafness?
In this case, you should give a number less than 100. Preventing blindness does a
lot of good, but not as much as preventing death, so it would require fewer cases ofpreventing death to do the same amount of good.
Going back to the distance example:
How many trips from Chicago to Moscow is just as far as 100 trips from New York to
Paris?
The subject then did a practice item using the deafness example. Instead offilling in the number, the initial number was set at 50, and the subject pressed one
of two buttons to indicate which option was better, until the subject judged them
to be equal. This was to get the subject used to the idea of filling in a number that
made two options equal. (The subject could not go on without doing the practice
item.)
Each of the 32 screens presented either a PTO item or a difference item. The PTO
item was of the form:
How many people must be prevented from changing from
A to B
in order to make this just as good as
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1105
preventing 100 people from changing from
C to D?
The difference item was of the form:
If the difference between
A to B
is 100, how big is the difference between
C to D?
In each case, one end of the two intervals was the same. For example, the ‘‘tops’’
of both intervals (A and C) could be normal health, or the ‘‘bottom’’ (B and D) could
be death. The C � D interval was always intended as smaller than the A� B interval.
Thus, we intended that the answers to both PTO and difference questions would be
less than or equal to 100. For both kinds of questions, a response of 100 implies thatthe intervals are equal (if they interpreted as utility intervals). Table 1 shows the
basic comparisons, using paralysis items.
For example, for the first item, the difference form was, ‘‘If the difference be-
tween normal health and death is 100, how big is the difference between normal
health and paralysis of legs?’’ The PTO form was, ‘‘How many people must be
prevented from changing from normal health to death in order to make this just as
good as preventing 100 people from changing from normal health to paralysis of
legs?’’The same items were used for sensory impairments with blindness substituted for
legs and blindness and deafness for paralysis of arms and legs.
Table 1
The values of A–D for comparisons 1–8, illustrated with paralysis
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181106
Notice that items 1–3 compare intervals with Normal in common at the top; 5, 6,and 8 compare intervals with Death in common at the bottom; 7 has a non-death
condition at the bottom; and 4 has a non-normal condition at the top. Comparisons
1–3 allow a test of ratio consistency: the ratio in 1 should be the product of the ratios
in 2 and 3. Likewise for 6, 8, and 5. Finally, comparisons 1 and 5 together test
additivity, as do comparisons 2 and 6, 3 and 7, and 4 and 8.
The 32 items (the eight items above, each in its paralysis or sensory form, and then
either PTO or difference) appeared in a different random order for each subject.
2.2. Results
The nature of the disability—sensory or paralysis—did not affect the judgments,
and this factor did not interact significantly with PTO vs. difference or with com-
parison (1–8). Accordingly, we combined the results for the types of disability for
analysis and presentation, and we use the paralysis conditions to stand for both.
Table 2 shows the mean utility ratios for the eight comparisons, the percentage of
responses in which each measure yielded equality responses, that is, responses in-dicating that the two intervals were equal (100), and the mean utility ratios with
equality responses excluded. In contrast to previous results (e.g., Baron et al., 2001),
PTO and difference judgments were very close, and their overall means did not differ
significantly across subjects (with or without the equality responses). Conceivably,
the random intermixing of the two measures encouraged subjects to see them as
more similar. (Baron et al., 2001, Experiment 2, found that close proximity of the
two measures made their results closer.)
2.2.1. Equality responses
An equality response is a response of 100, which we take to indicate that the
subject judged the two intervals to be equal. We expected fewer of these responses in
the difference method than in the PTO method because subjects might think that any
other response implies unequal treatment of two groups differing in current health or
in what can be prevented. The difference method is likely to be seen as a judgment of
seriousness without any implications for how two groups should be treated. In fact,
the proportion of equality responses was greater for PTO than for the differencemethod (t113 ¼ 4:94, p ¼ :0000, across subjects). Table 1 shows the mean proportion
of equality responses for each of the eight comparisons.
We classified the comparisons into four types according to what condition was
common to both intervals being compared: both intervals have normal health at the
top (the better end of the interval—comparisons 1–3, called Common-top-normal);
both have the same condition at the top, other than normal (comparison 4,
Common-top-not-normal); both have death at the bottom (comparisons 5, 6, and 8,
Table 2
Mean judged ratios of utility differences (in percent), percentage of equality responses, and judged ratios
excluding equality responses, Experiment 1
Comparison Mean ratio Equality responses Mean, excluding equality
Diff. PTO Diff. PTO Diff. PTO
1. N—D vs. N—L 47 49 2 12 46 42
2. N—D vs. N—A&L 58 56 2 15 57 48
3. N—A&L vs. N—L 56 63 3 13 55 57
4. L—D vs. L—A&L 53 53 2 11 52 47
5. N—D vs. L—D 61 63 10 20 56 53
6. N—D vs. A&L—D 54 59 8 18 50 48
7. N—A&L vs. L—A&L 52 56 3 10 50 51
8. L—D vs. A&L—D 65 63 17 16 58 57
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1107
Common-bottom-death); and both have some other condition at the bottom(comparison 7, Common-bottom-not-death). We computed the mean proportion of
equality responses for each of the four types.
The four types differed in an analysis of variance that included subject, PTO-vs.-
difference, and type as factors (F3;339 ¼ 11:50, p ¼ :0000). The differences among thetypes did not interact significantly with PTO-vs.-difference. Equality responses were
higher in the Common-bottom-death comparisons than in the other conditions,
which were approximately equal. Collapsing across PTO and difference, the pro-
portion of equality responses was .15 in Common-bottom-death and .08, .07, and.07, respectively in the other three types: Common-top-normal, Common-top-not-
normal, and Common-bottom-not-death. In separate t tests, the proportion of
equality responses in Common-bottom-death was higher than for each of the three
other types at p < :002 (with Bonferroni correction, collapsing across PTO vs. dif-
ference). No other differences between types were significant.
In sum, equality responses are relatively frequent for Common-bottom-death,
that is, when both intervals compared involve saving lives.
2.2.2. Additivity
We assessed additivity for comparisons involving death but not normal health
(comparison 4 plus comparison 8), those involving normal health but not death (3
plus 7), and those involving both normal and death (extremes: 1 plus 5 and 2 plus 6).
Table 3 shows the measure of superadditivity, the sum of the two ratios, minus 1.
The measure is 0 if the two ratios are additive.
Utility ratios were superadditive, as shown in Table 3. Superadditivity was sig-
nificantly positive overall (F1;113 ¼ 82:9, p ¼ 0000), and in each the six types (threefor PTO and three for difference; p < :002 by t test). Although the six types differed
significantly, the differences are difficult to interpret.
Superadditivity was reduced when equality responses were excluded (t113 ¼ 6:28,p ¼ :0000, using the average across the six types—three for difference and three for
PTO). When equality responses were excluded, superadditivity remained overall
(t113 ¼ 2:41, p ¼ :0176).The substantial reduction in superadditivity—from a mean of .28 to a mean of
.11—suggests that superadditivity results largely from equality responses. For ex-ample, if the difference between death and leg paralysis is equal to the difference
between death and normal health, then any judgment that there is a difference be-
tween leg paralysis and normal health will lead to superadditivity.
On the other hand, the results also suggest that superadditivity is in part a scaling
effect, i.e., the J function is nonlinear. This observed nonlinearity is not just a by-
product of equality responses, nor of principles applied to choices among groups of
people. Superadditivity was significant for the difference method alone, without
equality responses (t113 ¼ 2:83, p ¼ :0052). It could arise either in the way condi-tions are perceived or in the mapping of these perceptions into numerical responses.
When subjects compare intervals with one end of one of the intervals at normal,
or one end at death, they may take that end as a reference point, exaggerate dif-
ferences near it and minimize differences far from it, thus producing ratios that are
too high.
Table 3
Superadditivity, the sum of two ratios that should add to 1, )1, for Experiment 1
Items in comparison Difference PTO
Death but not normal health .18 (.09) .16 (.04)
Normal health but not death .08 (.06) .19 (.08)
Both normal health and death .10 (.06) .14 ().02)
Equality responses are excluded for numbers in parentheses.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181108
2.2.3. Ratio inconsistency
Ratio inconsistency was positive for all measures (p < :0005, with or without
equality responses, t > 5 for both). The measure of inconsistency was not affected by
method (difference vs. PTO) or by whether the comparisons involved normal health
(comparisons 1–3) or death (comparisons 5, 6, and 8). Themeasure was slightly higher
when equality responses were excluded. The mean of the logarithmic ratio was .18
with all responses and .20 with equality responses excluded (using all remaining data,
t113 ¼ 2:49, p ¼ :0142). These means correspond, respectively, to ratios of 1.52 and
1.60. In other words, a ratio that should be the product of two others is more than 1.5times larger than the product, on the average. This result (also found by Baron et al.,
2001) suggests that people do not differentiate large ratios and small ratios sufficiently.
3. Experiment 2
Experiment 1 suggested that subjects were using an equality principle, in which
they truly judged the intervals to be equal (in the difference method) or thought that
two groups should get equal priority (in the PTO method). Experiment 2 attempted
to find direct evidence for these principles by asking subjects about them. To avoid
biasing the subjects by our questions, we also asked about the opposite principle, the
idea of complete triviality of one interval compared to the other, so that one interval‘‘dominates’’ the other.
We also changed the response format. The use of an open-ended response in
Experiment 1 might have encouraged subjects to respond with round numbers, and a
response of 100 might have meant ‘‘closer to 100 than to 90.’’ To avoid this possi-
bility, we gave subjects a set of buttons labeled 100, 99, 95, 90, 80,. . . : This responsemode makes a response of 99 just as acceptable as a response of 100.
3.1. Method
Sixty-eight subjects completed a questionnaire on the World Wide Web for $3.
Ages ranged from 17 to 73 (median 34); 69%were female; and 72%were non-students.
The procedure was identical to that of Experiment 1, except for the addition of a
set of buttons for responses, and the addition of two questions, one about equality
and one about what we call dominance. An example of a difference-method item is:
If the difference between
BLINDNESS and DEATH
is 100, how big is the difference between
BLINDNESS AND DEAFNESS and DEATH?
Choose the closest number:
100 99 95 90 80 70 60 50 40 30 20 10 5 1 0
Consider the following statements:
The difference between
BLINDNESS AND DEAFNESS and DEATH
is as large as the difference between
BLINDNESS and DEATH. AGREE DISAGREE
The difference between
BLINDNESS AND DEAFNESS and DEATH
is trivial compared to the difference between
BLINDNESS and DEATH. AGREE DISAGREE
For the PTO item, the buttons were the same, but the equality and dominance
questions were (using a different example):
It is just as important to prevent people from changing from
PARALYSIS OF ARMS AND LEGS to DEATH as it is to prevent people from changing
from
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1109
PARALYSIS OF LEGS to DEATH. AGREE DISAGREE
When we have a choice between preventing the change from
PARALYSIS OF ARMS AND LEGS to DEATH in some people, and preventing the
change from
PARALYSIS OF LEGS to DEATH in other people, we should always prevent the change
from PARALYSIS OF LEGS to DEATH no matter how few people we can help.
AGREE DISAGREE
3.2. Results
Table 4 shows the mean utility ratios for the eight comparisons, the percentage of
responses in which each measure yielded equality responses, and the percentagreement with the equality and dominance questions. In general, the proportion of
equality responses (numerical responses of ‘‘100’’) was higher than in Experiment 1,
despite our effort to reduce them by giving the subjects a clear ‘‘99’’ option. The
equality question may have called attention to the possibility that intervals were
equal. Interestingly, the dominance question—although often endorsed—did not
seem to play an analogous role. (Only 2.6% of the responses were the lowest possible,
which turned out to be 1 rather than 0 because of a programming error, an error that
no subject complained about.)Of primary interest, subjects endorsed the equality question more often than they
followed it literally in their numerical judgments, but their responses to this question
followed the same pattern. Again, we classified the comparisons into four types
according to what condition was common to both intervals being compared—
Common-top-normal, Common-top-not-normal, Common-bottom-death, Com-
mon-bottom-not-death—and we examined each subject�s proportion of each type of
response for each of these types. The effect of type was highly significant for all three
relevant measures: numerical utility judgments (F3;201 ¼ 14:04, p ¼ :0000), equalityresponses (judgments of 100; F3;201 ¼ 17:67, p ¼ :0000), and equality agreement
(F3;201 ¼ 29:34, p ¼ :000). As in Experiment 1, Common-bottom-death differed very
clearly from all other types for utility judgments, equality responses, and equality
agreement.
Unlike Experiment 1, some of the other type differences were significant too. For
equality responses and equality agreement, all of these involved the Common-bot-
tom-not-death type (Condition 7 in Tables 1 and 4), which was higher than both
Common-top-normal and Common-top-not-normal types (p < :025 in all cases).These results suggest that subjects pay more attention to the bottom of the interval
than the top in making these equality judgments.
The pattern for the dominance question is not just the reverse of the equality
question, as seen in Table 4. Answers to the dominance question are most frequent
when one of the intervals includes death and the other does not (cases 1, 2, and 4 vs.
Table 4
Mean judged ratios of utility differences, percentage of equality responses, and percent agreement with
equality and dominance questions, Experiment 2
Comparison Mean ratio Equality responses Equality agree Dominance agree
Diff. PTO Diff. PTO Diff. PTO Diff. PTO
1. N—D vs. N—L 51 58 4 29 13 19 53 59
2. N—D vs. N—A&L 61 60 5 27 21 30 42 47
3. N—A&L vs. N—L 60 67 12 26 33 40 21 27
4. L—D vs. L—A&L 54 57 8 21 20 27 45 49
5. N—D vs. L—D 72 76 29 47 52 69 24 16
6. N—D vs. A&L—D 64 66 24 36 43 57 24 22
7. N—A&L vs. L—A&L 61 67 17 32 31 52 32 24
8. L—D vs. A&L—D 71 74 32 36 56 61 18 18
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181110
all others; t67 ¼ 6:48, p ¼ :0000, for the means of these cases vs. the means of theothers). Subjects seem to apply a principle that preventing death dominates pre-
venting anything else.
PTO and difference judgments also differed significantly in utility judgments,
equality responses, and equality agreement (p < :03 for all) with more equality and
higher utilities in PTO. Dominance responses showed no effect.
In sum, direct judgments of the equality of intervals support our interpretation of
the results of Experiment 1. Agreement that intervals are equal was more frequent
for Common-bottom-death, that is, when both intervals compared involve savinglives. We also found some evidence of greater attention to the bottom of the interval
than the top, and for judgments of dominance when one bottom involved death and
the other did not.
Results for superadditivity replicated those of Experiment 1, although the effect
was somewhat larger (with a minimum of .21, for the difference method when normal
health was involved but not death). Again, the effect was significantly reduced when
equality responses were omitted (t33 ¼ 3:03, p ¼ :0048). This time, however, therewas no significant superadditivity effect when the equality responses were removed.Note that many subjects had insufficient data for these tests, because the number of
equality responses was so much higher than in Experiment 1.
Ratio inconsistency was, again, significantly positive for all measures (p < :025,with or without equality responses). Again, this measure of inconsistency was not
affected by method (difference vs. PTO), by whether the comparisons involved
normal health (comparisons 1–3) or death (comparisons 5, 6, and 8), or by whether
or not equality responses were excluded.
4. General discussion
We examined two methods, direct judgment and person-trade-off. The former
asks the subject to think about utility in an abstract way. The latter infers the
subject�s utility from a matching response in a hypothetical decision. Both methods
are subject to distortions resulting from scaling effects or heuristics.
Our results suggest that further use of these methods—and possibly others aswell—should include checks for consistency, such as those we have used. Such checks
may help respondents bring their judgments into line with their true values, or
construct their values through a process of reflection with the help of the consistency
checks. For example, a person might come to agree that the difference between death
and normal health is greater than the difference between death and some state of
impairment, after reflecting on the implication of equality judgments that the im-
pairment is no different from normal health.
4.1. Equality responses and superadditivity
We found evidence for a heuristic or principle of equality in which subjects judged
the intervals being compared as equal. This principle led to higher mean utility ratios
when they were measured with death as the reference point, especially in PTO, and
the principle also led to superadditivity. It is possible that some of these responses
resulted from careless responding. Careless responding, however, cannot account for
the observed differences among conditions. The results of Experiment 2, in whichsubjects acknowledge equality in a consistent pattern, also argue against a care-
lessness account. Equality responses in PTO (but not in direct rating) have also been
found using interview methods (Pinto-Prades & Lopez-Nichol�aas, 1998).We would expect superaddivity to result from equality responses. If, for example,
a subject judged Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g to be 1 and Jf½uðBÞ� uðCÞ�=
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1111
½uðAÞ � uðCÞ�g to be greater than 0, then the sum would be greater than 1. In Exper-iment 1, superadditivity was, in fact, reduced when these equality responses were re-
moved, but it was not eliminated.Note also that ratio inconsistencywas not reduced by
removal of equality responses, and there is no general reason to expect such reduction.
If utility measures are affected by a principle of equality, it is not clear which
measures are relevant to which decisions. At issue, in part, are the reasons for the
principle. One possible reason is that subjects regard handicaps as irrelevant when
they are properties of people. In this regard, they are like economic or social status,
outside the bounds of medical decisions—we do not take these things into account inallocating health care. This explanation, however, is inconsistent with our finding
that equality responses are particularly common only when death is at the bottom
end of both intervals being compared. If the top condition was generally irrelevant
when it differed between the two intervals, then equality responses would be used as
much when the bottom condition was some non-death condition as when it was
death. It seems that the equality responses are peculiar to comparisons involving
death. Moreover, equality responses involving death were prevalent in difference
judgments as well as PTO judgments.Another possible explanation of equality responses is that subjects regard pre-
venting death or saving lives as lexically ordered before any other health outcomes
(Norcross, 1997). This, too, conflicts with our results, as we did not find many ‘‘zero’’
responses, in which subjects gave infinite weight to death in comparisons where the
top of the two intervals was constant (comparisons 1 and 2). In Experiment 2, the
pattern of responses to the dominance question, which assessed this principle di-
rectly, was somewhat different from that of the equality responses. On the other
hand, the same experiment provided some evidence that, in general, people pay moreattention to the bottom of an interval, so they are more likely to make equality
responses if the bottom of two intervals is the same than if the top is the same.
One possible explanation, consistent with our results, is that people employ a
heuristic principle that is specific to life-saving, which is that decisions about life-
saving are special, in that everyone should be treated equally. The use of this principle
for PTO measures is consistent with the finding that these measures showed more
equality responses than difference measures. These results are also consistent with the
findings of Ubel and Richardson (2002): In a PTO task, subjects compared saving thelives of people with paralysis of their legs to saving the lives of people in normal
health. The people were either paralyzed before they became ill or they were in normal
health before they became ill. When the pre-existing paralysis case came first, most
subjects gave equal priority to the two life-saving procedures, but, when the people
were normal beforehand, most subjects gave priority to restoring people to normal
health over restoring them to a state of paralysis. Evidently, the pre-existing condition
elicited the equality heuristic more often than did the control condition.
We found equality responses for difference judgments as well as for PTO, and theywere more frequent when death was at one end of both intervals. This finding suggests
that people sometimes see saving lives as equally valuable regardless of the end state
after the life is saved. Difference judgments are about magnitudes, and they are not
hypothetical decisions. Equality responses for difference judgments, thus, would result
from a judgment of the value for each individual saved. Equality responses in general,
then, are not easily understood as resulting from a heuristic involving fairness, since
they occur for difference judgments as well as PTO judgments. Also, fairness to the
handicapped would seem to apply even when the bottom condition is not death, anddeath is crucial for equality judgments in both PTO and difference judgments.
In principle, such a judgment that two different intervals are equal, or nearly
equal, could be correct, if people think of intervals as changes from one state to
another rather than as differences between two persistent states. The utility of a
change need not depend solely on the difference between the utility of being in the
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181112
start state and the utility of being in the end state. For example, the utility of changeitself might have some maximum. Thus, the utility of a change from A to B might be
just as great as the utility of a change from A to C, even if the change from B to C has
non-zero utility as well.
It is unlikely, however, that this perception of value is typically accurate for the
affected people. Given a choice between having one�s life saved and being handi-
capped and having one�s life saved and not being handicapped, most people would
strongly prefer the latter, even in the face of great happiness in knowing that their life
would be saved. Although we could simply assume that these judgments representtrue expressions of public preferences, we should find out how people respond to
challenges, either by asking them outright whether they prefer to be handicapped or
not or by explaining the superadditivity effect to them. (We may also try to explain to
them that they are to judge states rather than changes.)
Superadditivity could also result from a tendency to see intervals as more equal
than they are. This, in turn, could result from declining influence of differences
further from a reference point, as discussed in the Introduction.
4.2. Ratio inconsistency
Ratio inconsistency, which we also found, can be described as a failure to make
sufficient distinctions among high and low ratios; subjects tend to give responses
toward the middle of the scale they are given (Poulton, 1979).
More generally, our results suggest two general problems with utility measure-
ment, one resulting from the use of heuristic principles and the other resulting from
psychophysical effects. Further research is needed to examine the malleability ofthese effects, and also whether these effects are found in other measures of utility,
such as standard gambles and time-trade-offs. We are undertaking this research.
These are not the only possible problems, of course.
4.3. PTO as a utility measure
Some have argued that the PTO is not a utility measure but, instead, a measure of
societal value (Nord, Pinto, Richardson, Menzel, & Ubel, 1999) As such, it is ac-ceptable, they argue, for PTO measurements to show superadditivity. If, for exam-
ple, saving the life of someone who is left blind is just as valuable as saving someone
else�s life who will have full vision, according to a PTO measurement, then people
should still be allowed to say that there is value to curing blindness. They would not
be able to do so if the PTO was a measure of utility and was not allowed to show
superadditivity.
Suppose that treatment A saves people�s lives and leave them blind, and treatment
B cures blindness. Treatment C saves people�s lives and leaves them with full vision.People judge that giving A to 10 people is as valuable as giving C to 10 people, and
giving B to 10 people is half as valuable as either of the other treatments (i.e., giving
B to 20 people is equally valuable). Then, if value is additive, the value of giving A,
followed by B, to 10 people would be greater than the value of giving C to 14 people.
(Assume that patients are under anesthesia once when A and B are given to the same
patients.) Yet the result would be the same.
The only escape from this apparent conflict is either to say that values of events
are not additive, or else to say that decisions should not be based on consequences.Ubel et al. (2000) have discussed the possibility that values need not be additive. In
principle, such a state of affairs could prevent us from making inferences about
choices, without asking about the particular choice. In practice, such situations may
be rare, and unproblematic. We have suggested earlier that, for some situations
involving changes, values need not be additive.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1113
If decisions are not based on consequences, then we must ask for justification. If,for example, we choose the combination of A and B over giving C to more people,
how can we explain to the people who would have gotten C why they are not getting
it. We cannot say that we had to make a choice and that the benefits to others were
greater. Moreover, suppose people were behind a veil of ignorance and did not know
which group they were in, the Aþ B group (10) or the C group (14), but their chance
of being in each group depended on its size. In this case, the policy of choosing C
would be better for each individual. We thus could make a decision that is judged to
be good on the basis of a PTO judgment and that makes everyone worse off. (Thefact that people are worse off only ex ante does not seem to weaken the force of this
argument, which is based on the work of Kaplow & Shavell (2000).)
In addition, even as a measure of societal value and not utility, PTO measures
should not show ratio inconsistency, as they have done in this study (Baron et al.,
2001). The problem is again that, by combining judgments we could arrive at dif-
ferent conclusions about the same choice. If curing one-eye blindness in 200 people is
as good as curing blindness in 50, and if curing blindness in 50 is as good as saving
the lives of 10, then we must conclude that curing one-eye blindness in 200 is betterthan saving the lives of 9, yet someone might judge that the figure is really 18 rather
than 9. Thus, preference itself would be intransitive.
These considerations argue that the PTO measure must behave like a utility
measure, although it could still reflect a different kind of utility that we might call
societal utility. Yet, consider again the veil of ignorance. Any PTO judgment has an
equivalent judgment using gambles for identically placed individuals. (The fact that
people are rarely identically placed is, again, irrelevant, since a normative theory of
the sort we are considering must apply everywhere.) If the PTO judgment disagreeswith the judgment of gambles, then it could lead to a societal decision that makes
everyone worse off, in terms of their own judgments of personal good. This sort of
argument could apply to any other utility measure (including direct ratings, if people
take their direct ratings to be interval measures of personal good).
4.4. Conclusion
Our findings suggest that practical uses of utility elicitation methods, evenmethods as simple as direct judgment, should be accompanied by consistency checks
of the sort we have used (as suggested, for example, by Keeney & Raiffa, 1976, p.
271, and Baron et al., 2001). Use of consistency checks is a common practice in
decision analysis, but is rarely used in other traditions of practical judgment elici-
tation. The evidence to date suggests that people can make judgments that are both
consistent and honest. That is, people are generally willing to accept the conclusion
that inconsistent judgments are erroneous (Baron et al., 2001).
When consistency checks fail, subjects should be confronted with the results andasked to resolve them. When one of the ends of the scale is death and when subjects
give equality responses, they should also be asked to reflect on these responses,
asking themselves, for example, whether, given that someone�s life was to be saved, itreally did not matter what state they were in after that. This can be done in a per-
sonal interview, or with the use of an interactive computer program, or both to-
gether. The equality effect is not, itself, inherently inconsistent for a single judgment,
but it apparently leads to superadditivity when it is combined with other judgments,
so the checks should focus on the inconsistency rather than on the equality effectitself. Moreover, as we noted, we have no reason to expect reduction of the equality
effect to influence other types of inconsistency, aside from superadditivity.
Our results also suggest that, when PTO and difference judgments differ, the
difference judgments yield fewer equality responses and are thus somewhat less
subject to inconsistencies. This does not eliminate the need for consistency checks.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181114
But our results suggest that direct comparison of intervals remains a viable methodfor assessment of utility.
Our results are limited to the two methods we have used. Similar inconsistencies
have been found using other methods. For example, Ubel et al. (1996) found ratio
inconsistency in the standard-gamble method (which is based on expected-utility
theory), and O�Leary, Fairclough, Jankowski, and Weeks (1995) found unwillingness
to trade-off any time in the time-tradeoff, a result that may be analogous to equality
responses here. As yet, we do not know whether the same patterns of results would
be found for these other measures (involving the nature of the common bottom) aswe found here.
Acknowledgments
This research was supported by N.S.F. grant SBR95–20288, and by a grant from
the Penn Cancer Center. Peter Ubel�s work was supported by the Department of
Veterans Affairs through a Career Development Award in health services researchand by the Robert Wood Johnson Foundation�s Generalist Physician Faculty
Scholar Program. We thank Gretchen Chapman, Michael DeKay, Leslie Lenert and
the reviewers for comments on a draft.
Appendix
Why additivity is special
Additivity, under reasonable conditions, implies ratio consistency, but ratio con-
sistency does not imply additivity. Moreover, addivity is closer to being a sufficientcondition for a utility scale. (The additional assumptions required are likely to hold.)
Additivity as defined above can be seen as a special case of a more general ad-
ditivity property of utility judgments: Jf½uðAÞ � uðBÞ�=½uðEÞ � uðF Þ�g þJf½uðBÞ�uðCÞ�=½uðEÞ � uðF Þ�g ¼ Jf½uðAÞ � uðCÞ�=½uðEÞ � uðF Þ�g. The denominator need not
be uðAÞ � uðCÞ; the denominator uðEÞ � uðF Þ simply defines the unit of measure-
ment. This general additivity property implies ‘‘monotonicity’’ (Krantz, Luce,
Suppes, & Tversky, 1971, p. 145—actually as stated, ‘‘weak monotonicity’’ is based
on inequality rather than equality, but we use the equality version of p. 137 forclarity of exposition). Specifically, omitting the denominator J ½uðEÞ � uðF Þ� (which isthe unit of measurement for all judgments), the monotonicity property is:
J ½uðAÞ � uðBÞ� ¼ J ½uðA0Þ � uðB0Þ�and J ½uðBÞ � uðCÞ� ¼ J ½uðB0Þ � uðC0Þ�; then
J ½uðAÞ � uðCÞ� ¼ J ½uðA0Þ � uðC0Þ�In other words, if the interval between A0 and B0 is judged equal to that between A
and B (where the primes indicate different health states) and B0 and C0 is equivalentto B and C, then A0 and C0 is equivalent to A and C. This is the most critical property
required for an interval scale of utility. The general form of the additivity test (with
J ½uðEÞ � uðF Þ� in the denominator) implies monotonicity, since J ½uðAÞ � uðCÞ� is thesum of J ½uðAÞ � uðBÞ� and J ½uðBÞ � uðCÞ� and must therefore have the same value forany other intervals equal to these two, respectively. Here, as noted, we test a special
case in which A and C define the unit, but, if additivity fails for this case, it is not
generally true, so a utility scale cannot be constructed. If such a scale can be con-structed, then ratio consistency also holds.
But the converse does not hold. For example, suppose that judgments are based
on the square of utility differences rather than the differences themselves. Thus
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g ¼ ½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2:
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1115
Then the following are equivalent:1. Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g
¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g2. f½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2g � f½uðAÞ � uðCÞ�2=½uðAÞ � uðDÞ�2g
¼ f½uðAÞ � uðBÞ�2=½uðAÞ � uðDÞ�2g3. f½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � f½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g
¼ f½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�gSo ratio consistency will still hold, but additivity will not hold in general, since
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g þ Jf½uðBÞ � uðCÞ�=½uðAÞ � uðCÞ�g
¼ f½uðAÞ � uðBÞ�2=½uðAÞ � uðCÞ�2g þ f½uðBÞ � uðCÞ�2=½uðAÞ � uðCÞ�2g
¼ ½uðAÞ2 � 2uðAÞuðBÞ þ uðBÞ2� þ ½uðBÞ2 � 2uðBÞuðCÞ þ uðCÞ2�½uðAÞ � uðCÞ�2
;
which need not equal ½½uðAÞ � uðBÞ� þ ½uðBÞ � uðCÞ��=½uðAÞ � uðCÞ�2 .In sum, additivity (together with other assumptions that are likely to hold, such as
larger intervals having greater utilities) implies a utility scale and the two consistency
measures do not.
Why inversion consistency is a special case of ratio consistency
Inversion consistency (Baron et al., 2001) refers to the effect of direction ofcomparison, Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g vs. Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g. Forexample, ‘‘How big is the difference between normal health and blindness compared
to that between normal health and blindness-and-deafness,’’ vs. ‘‘How big is the
latter difference compared to the former.’’ If subjects are biased toward higher nu-
merical responses, no matter what the question, then the product of the two ratios
will be greater than 1, as found for a number of psychophysical judgments (Fagot,
1979; Fagot & Pokorny, 1989). High numbers could result from a tendency to give
numbers in the middle of the given range; for numbers over 100, subjects could seethe top of the range as very high. Inversion consistency can be seen as a special case
of ratio consistency (as shown in the Appendix).
Inversion consistency amounts to
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g ¼ 1;
but it can also be written as
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðBÞ�g:
Ratio consistency says that
Jf½uðAÞ � uðBÞ�=½uðAÞ � uðCÞ�g � Jf½uðAÞ � uðCÞ�=½uðAÞ � uðDÞ�g¼ Jf½uðAÞ � uðBÞ�=½uðAÞ � uðDÞ�g:
If we set D equal to B, it is apparent that ratio consistency implies inversion
consistency, if ratio consistency holds for reversals (such as Jf½uðAÞ � uðCÞ�=½uðAÞ � uðBÞ�g).
Because inversion consistency is a special case of ratio consistency, it is no more
definitive as a consistency check. It is necessary, but not sufficient, for a utility scale.
References
Baron, J. (1997). Biases in the quantitative measurement of values for public decisions. Psychological
Bulletin, 122, 72–88.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181116
Baron, J., Wu, Z., Brennan, D. J., Weeks, C., & Ubel, P. A. (2001). Analog scale, ratio judgment and
person-trade-off as measures of health utility: biases and their correction. Journal of Behavioral
Decision Making, 14, 17–34.
Birnbaum, M. H. (1978). Differences and ratios in psychological measurement. In N. Castellan & F.
Restle (Eds.), Cognitive theory (Vol. 3, pp. 33–74). Hillsdale, NJ: Erlbaum.
Birnbaum, M. H. (Ed.). (2000). Psychological experiments on the internet. New York: Academic Press.
Birnbaum, M. H., & Sutton, S. E. (1992). Scale convergence and utility measurement. Organizational
Behavior and Human Decision Processes, 52, 183–215.
Bleichrodt, H., & Johanneson, M. (1997). The validity of QALYs: An experimental test of constant
proportional trade-off and utility independence. Medical Decision Making, 17, 21–32.
de Neufville, R., & Delqui�ee, P. (1988). A model of the influence of certainty and probability ‘‘effects’’ on
the measurement of utility. In B. Munier (Ed.), Risk, decision, and rationality (pp. 189–205). Dordrecht:
Reidel.
Fagot, R. F. (1978). A theory of relative judgment. Perception and Psychophysics, 24, 243–252.
Fagot, R. F., & Pokorny, R. (1989). Bias effects on magnitude and ratio estimation power function
exponents. Perception and Psychophysics, 45, 221–330.
Fischer, G. W. (1985). Range sensitivity of attribute weights in multiattribute value models. Organizational
Behavior and Human Decision Processes, 62, 252–266.
Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decisions under risk. Econometrica,
47, 263–291.
Kaplow, L., & Shavell, S. (2000). Principles of fairness versus human welfare: On the evaluation of legal
policy. Discussion Paper No. 277, Center for Law, Economics, and Business, Harvard Law School.
(http://www.law.harvard.edu/programs/olin_center).
Keeney, R. L., Raiffa, H., & Decisions with multiple objectives, (1976). New York: Wiley.
Krabbe, P. F. M., Essink-Bot, M.-L., & Bonsel, G. K. (1997). The comparability and reliability of five
health-state valuation methods. Social Science and Medicine, 45, 1641–1652.
Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. 1). New
York: Academic Press.
Llewellyn-Thomas, H. A. (1997). Investigating patients� preferences for different treatment options.
Canadian Journal of Nursing Research, 29, 45–64.
Mellers, B. A., & Birnbaum, M. H. (1983). Loci of contextual effects in judgment. Journal of Experimental
Psychology: Human Perception and Performance, 8, 582–601.
Miyamoto, J. M., & Eraker, S. A. (1988). A multiplicative model of survival duration and health quality.
Journal of Experimental Psychology: General, 117, 3–20.
Norcross, A. (1997). Comparing harms: Headaches and human lives. Philosophy and Public Affairs, 26,
135–167.
Nord, E. (1995). The person-trade-off approach to valuing health care programs. Medical Decision
Making, 15, 201–208.
Nord, E. (1999). Cost-value analysis in health care: Making sense out of QALYs. Cambridge, UK:
Cambridge University Press.
Nord, E., Pinto, J. L., Richardson, J., Menzel, P., & Ubel, P. A. (1999). Incorporating societal concerns
for fairness in numerical valuations of health programs. Health Economics, 8, 25–39.
O�Leary, J. F., Fairclough, D. L., Jankowski, M. K., & Weeks, J. C. (1995). Comparison of time-trade-off
utilities and rating scale values of cancer patients and their relatives. Medical Decision Making, 15,
132–137.
Pinto-Prades, J.-L., & Lopez-Nichol�aas, A. (1998). More evidence on the plateau effect: A social
perspective. Medical Decision Making, 18, 287–294.
Poulton, E. C. (1979). Models for biases in judging sensory magnitude. Psychological Bulletin, 86, 777–
803.
Soetikno, R. M., Mrad, R., Pao, V., & Lenert, L. A. (1997). Quality-of-life research on the Internet:
feasibility and potential biases in patients with ulcerative colitis. Journal of the American Medical
Informatics Association, 4, 426–435.
Soetikno, R. M., Provenzale, D., & Lenert, L. A. (1997). Studying ulcerative colitis over the World Wide
Web. American Journal of Gastroenterology, 92, 457–460.
Thaler, R. (1985). Mental accounting and consumer choice. Marketing Science, 4, 199–214.
Tversky, A., & Kahneman, D. (1992). Advances in prospect theory: Cumulative representations of
uncertainty. Journal of Risk and Uncertainty, 5, 297–323.
Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1996). Individual utilities are inconsistent with
rationing choices: A partial explanation of why Oregon�s cost-effectiveness list failed.Medical DecisionMaking, 16, 108–116.
Ubel, P. A., Nord, E., Gold, M., Menzel, P., Pinto Prades, J. L., & Richardson, J. (2000). Improving value
measurement in cost-effectiveness analysis. Medical Care, 38, 892–901.
Ubel, P. A., Richardson, J. & Baron, J. (2002). Exploring the role of order effects in person-trade-off
elicitations. Health Policy, 61, 189–199.
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–1118 1117
Ubel, P. A., Loewenstein, G., Scanlon, D., & Kamlet, M. (1998). Value measurement in cost-utility
analysis: Explaining the discrepancy between analog scale and person-trade-off elicitations. Health
Policy, 43, 33–44.
von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. New York:
Cambridge University Press.
Received 14 September 2000
J. Baron, P.A. Ubel / Organizational Behavior and Human Decision Processes 89 (2002) 1100–11181118