18
Journal of Behavioral Decision Makin` J[ Behav[ Dec[ Makin`\ 03] 06Ð23 "1990# Copyright Þ 1990 John Wiley + Sons\ Ltd[ Accepted 03 September 0888 Analog Scale, Magnitude Estimation, and Person Trade-off as Measures of Health Utility: Biases and their Correction JONATHAN BARON,* ZHIJUN WU, DALLAS J. BRENNAN, CHRISTINE WEEKS and PETER A. UBEL University of Pennsylvania, USA ABSTRACT Subjects judged the disutility of health conditions "e[g[ blindness# using one of them "e[g[ blindness¦deafness# as a standard\ using three elicitation methods] analog scale "AS\ how bad is blindness compared to blindness¦deafness<#^ mag! nitude estimation "ME\ blindness¦deafness is how many times as bad as blind! ness<#^ and person trade!o} "PTO\ how many people cured of blindness is as good as 09 people cured of blindness¦deafness<#[ ME disutilities of the less bad condition were smallest\ and AS was highest[ Interleaving PTO with ME made PTO more like ME[ AS disutilities were inconsistent with direct judgments of di}erences between pairs of conditions[ ME and PTO judgments were internally inconsistent] e[g[ the disutility of one!eye!blindness relative to blindness¦deafness was larger than predicted from comparison of each to blindness[ Consistency training reduced inconsistency\ increased agreement between AS and PTO\ and transferred from one method to the other[ The results support the use of con! sistency checks in utility elicitation[ Copyright Þ 1990 John Wiley + Sons\ Ltd[ KEY WORDS utility elicitation^ cost!e}ectiveness^ person trade!o}^ analog scale^ decision analysis^ consistency checks^ transfer of training INTRODUCTION Measures of judged utility can provide valuable input to public policy decisions\ including those about medical care[ For example\ comparison of the utility of medical treatment to its cost can yield a Correspondence to] Jonathan Baron\ Department of Psychology\ University of Pennsylvania\ 2704 Walnut Street\ Philadelphia\ PA 08093!5085\ USA[ E!mail] baronÝpsych[upenn[edu Contract grant sponsor] NSF Contract grant number] SBR84!19177 Contract grant sponsor] Department of Veterans A}airs Contract grant sponsor] Robert Wood Johnson Foundation[

Analog scale, magnitude estimation, and person trade-off as measures of health utility: biases and their correction

Embed Size (px)

Citation preview

Journal of Behavioral Decision Makin`

J[ Behav[ Dec[ Makin`\ 03] 06Ð23 "1990#

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Accepted 03 September 0888

Analog Scale, Magnitude Estimation, andPerson Trade-off as Measures of HealthUtility: Biases and their Correction

JONATHAN BARON,* ZHIJUN WU, DALLAS J. BRENNAN,CHRISTINE WEEKS and PETER A. UBELUniversity of Pennsylvania, USA

ABSTRACT

Subjects judged the disutility of health conditions "e[g[ blindness# using one ofthem "e[g[ blindness¦deafness# as a standard\ using three elicitation methods]analog scale "AS\ how bad is blindness compared to blindness¦deafness<#^ mag!nitude estimation "ME\ blindness¦deafness is how many times as bad as blind!ness<#^ and person trade!o} "PTO\ how many people cured of blindness is asgood as 09 people cured of blindness¦deafness<#[ ME disutilities of the less badcondition were smallest\ and AS was highest[ Interleaving PTO with ME madePTO more like ME[ AS disutilities were inconsistent with direct judgments ofdi}erences between pairs of conditions[ ME and PTO judgments were internallyinconsistent] e[g[ the disutility of one!eye!blindness relative to blindness¦deafnesswas larger than predicted from comparison of each to blindness[ Consistencytraining reduced inconsistency\ increased agreement between AS and PTO\ andtransferred from one method to the other[ The results support the use of con!sistency checks in utility elicitation[ Copyright Þ 1990 John Wiley + Sons\ Ltd[

KEY WORDS utility elicitation^ cost!e}ectiveness^ person trade!o}^ analog scale^decision analysis^ consistency checks^ transfer of training

INTRODUCTION

Measures of judged utility can provide valuable input to public policy decisions\ including those aboutmedical care[ For example\ comparison of the utility of medical treatment to its cost can yield a

� Correspondence to] Jonathan Baron\ Department of Psychology\ University of Pennsylvania\ 2704 Walnut Street\ Philadelphia\PA 08093!5085\ USA[ E!mail] baronÝpsych[upenn[edu

Contract grant sponsor] NSFContract grant number] SBR84!19177Contract grant sponsor] Department of Veterans A}airsContract grant sponsor] Robert Wood Johnson Foundation[

07 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

bene_t:cost ratio\ which can be used to allocate scarce resources where they are most e}ective[ Utilitymeasures require human judgment\ and several methods have been proposed to elicit judgments fromrespondents "Baron\ 0886^ Kaplan\ 0884^ Torrance\ 0875^ Ubel et al[\ 0885#[

We focus here on three of these methods\ analog scale "AS#\ magnitude estimation "ME#\ and persontrade!o} "PTO#[ At issue are whether these methods can yield internally consistent responses\ whetherthey agree with each other\ and how their internal consistency and agreement can be improved[ As wediscuss later\ these methods can be seen as representative of two general types of methods in commonuse\ one involving direct utility judgments and the other involving matching responses in hypotheticaldecisions[

In the analog scale "AS# method\ respondents assign numbers to conditions on a scale with the endsclearly de_ned[ Typically one end is death and the other is normal health\ but here we use conditionsother than death in most studies[ Because we de_ne the worse end of the scale as 099 and normalhealth as 9\ we speak of the scale as measuring disutility rather than utility[

The AS is simple to use and explain[ In principle\ it is justi_ed by the idea of di}erence measurement[That is\ di}erences among numbers should be ordered according to judged di}erences among theconditions to which they correspond "Krantz et al[\ 0860\ section 3[1#[ In practice\ AS producesdisutilities that seem excessively high] conditions are judged to be closer to the worse end of the scalethan other methods or intuition suggests[ For example\ Ubel et al[ "0885# asked subjects\ {You have aganglion cyst on one hand[ This cyst is a tiny bulge on top of one of the tendons in your hand[ It doesnot disturb the function of your hand[ You are able to do everything you could normally do\ includingactivities that require strength or agility of the hand[ However\ occasionally you are aware of the bumpof your hand\ about the size of a pea[ And once every month or so the cyst causes mild pain\ whichcan be eliminated by taking an aspirin[| The AS ratings implied a disutility of 7 on a scale wherenormal health is 9 and death is 099[ In other words\ the cyst was judged about 0:01 as bad as death[

We show here that at least part of the problem is that subjects ignore instructions to considerdi}erences\ so their numbers con~ict with their own judgment of di}erences[ Instead\ they seem tofollow some sort of psychophysical function anchored on normal health\ a function with a slope that~attens as distance from normal health increases[ Such functions have been found even for ratings ofmonetary losses\ where economic theory predicts\ if anything\ the reverse kind of curvature "Galanterand Pliner\ 0863^ Kahneman and Tversky\ 0868#[

Note that AS asks the subject to make a judgment about a less bad condition using a worse conditionas a standard with a disutility of 099\ so numerical responses are less than 099[ In magnitude estimation"ME#\ we ask the subject to compare the worse condition to the less bad condition\ to which we assigna disutility of 09[ Responses are thus greater than the disutility of the standard[ We reserve the termME for comparisons of worse to less bad\ following the usage of others "Kaplan\ 0884\ Richardson\0883#\ although the term has been used for what we call AS as well[ The ME task is discussed in theliterature on decision analysis "Fischer\ 0884^ von Winterfeldt and Edwards\ 0875\ Chapter 7# and ithas a long and continuing history in psychophysics "Birnbaum\ 0867^ Fagot and Pokorny\ 0878\Stevens\ 0840#[ Our interest in ME is primarily as a way of checking the consistency of AS[ If both ASand ME represent utilities\ the direction of judgment should not matter] if condition A is judged to behalf as bad as B\ then B should be judged twice as bad as A[ Although a utility representation impliessuch inversion consistency\ inversion consistency is not su.cient for a utility representation[ "Forexample\ inversion consistency will hold for any power transform of utility^ see Baron and Ubel\ 0888\unpublished manuscript#[

The third method of interest here is the person trade!o} "PTO#[ In one version of the PTO\ thesubject is asked how many people have to be cured of condition B to do just as much good as curing"say# 09 people of condition A[ "A is worse than B\ so the subject|s response is larger than 09[# PTOhas been advocated because it seems most directly relevant to policy decisions about allocation of

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 08

resources "Nord\ 0884#[ However\ it yields results that are internally inconsistent[ In particular\ whensubjects compare conditions B and A "as above# and conditions C and B\ we should be able to predicttheir comparison of C and A by multiplying the _rst two utilities[ For example\ if curing 19 Bs is asgood as 09 As and curing 29 Cs is as good as 09 Bs\ then curing 59 Cs should be as good as 09 As[Ubel et al[ "0885# found that the CÐA comparison is less extreme than predicted on the assumptionthat subjects were equating numbers on the basis of utilities[ We call this result ratio inconsistency"following Baron and Ubel\ 0888\ unpublished manuscript\ Ubel et al[ called it {multiplicative intran!sitivity|#[

Ubel et al[\ and others such as Nord et al[ "0884#\ have suggested that subjects are making PTOjudgments on the basis of considerations of fairness that go beyond equating total utility for the twogroups of patients[ Two types of additional fairness considerations are relevant[ First\ subjects wantto give patients an equal opportunity for treatment regardless of their condition[ This principle wouldlead to judgments of equal disutility for all conditions^ hence\ 099 people cured of condition A wouldalways be equivalent to 099 people cured of Condition B\ regardless of the nature of A and B[ Second\subjects want to treat the worse condition _rst "Ubel\ 0887#[ If A were worse than B\ then subjectsasked how many people cured of A is equivalent to 099 people cured of B would answer 0[ Both ofthese principles could lead subjects to pay too little attention to the relative seriousness of A and B[They would thus tend to give the same answers for any comparison\ and this would result in ratioinconsistency[

An alternative explanation of internally inconsistent results for PTO is that of scale distortion\ muchlike that described for AS[ We know of no previous results that would predict the direction of thisdistortion\ and\ in principle\ it could go in either direction\ of under! or over!responsiveness todi}erences among conditions[

Subjects could make PTO judgments in di}erent ways[ In one\ they think about the decision problemof allocating resources[ Fairness considerations could come into play[ In the other\ they evaluate theconditions "as they do in AS or ME# and use this evaluation to infer the PTO judgment[ For example\if they judge A to be twice as bad as B\ they would infer that curing 49 people of A is equivalent tocuring 099 people of B[ We test this possibility by comparing two ways of presenting PTO judgments[In one\ we present PTO judgments on their own[ In the other\ we present each PTO judgmentimmediately after an equivalent ME judgment[ We use ME because the answer is larger than thestandard\ just as it is for the PTO question that we used[

The studies reported here have two main purposes[ One is to demonstrate inconsistencies in theutility measures of interest[ The second is to ask how these inconsistencies can be corrected in theprocess of elicitation[ We thus examine both inconsistencies and the e}ects of training[ Of particularinterest is whether training will transfer to new cases and whether training in one sort of consistencywill improve other measures of consistency as a byproduct[ The latter result is expected if improvementsin consistency also improve validity[ By validity\ we mean that the numbers are more representativeof an internal scale that both honestly expresses subjects| judgments and has the form of a utilitymeasure[ Internal consistency does not imply validity\ but inconsistency implies invalidity[

The results may shed light on utility judgment more generally\ both in health contexts and othercontexts[ AS and PTO are two of the four methods most often discussed in the literature on utilityelicitation "Kaplan\ 0884#\ and the other two being time!tradeo} and standard gambles[ PTO is likethe latter two methods because it asks subjects for a number that makes two options equally preferredin a hypothetical decision[ AS is like methods sometimes used in multi!attribute decision analysis[Decision analysts "Keeney and Rai}a\ 0882# usually recommend that analysis should not take initialjudgments of utility at face value[ Instead\ the analysis should carry out consistency checks[ Experimentsof the sort we do here could thus show the need for such checks in health utility elicitation and alsoshed light on the e}ectiveness of such checks in decision analysis[

19 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

EXPERIMENT 0

The experiment asked several questions about the nature and sources of disagreement among methodsof utility estimation]

"0# Do ME and PTO yield comparable results< Ubel et al[ "0885# found that AS consistently yieldedlarger disutilities "closer to death# than did PTO[ But PTO could be a}ected by fairness consider!ations\ as argued earlier[ Worst!_rst\ in particular\ would lead to smaller disutilities "closer tonormal health# when they are estimated with PTO[ ME would not have this problem\ since it is adirect judgment\ not a decision about distribution[

To compare ME and PTO\ we made sure that the numerical scales were the same[ In PTO\ weasked how many people needed to be cured of a less severe condition to equal the bene_t of curing09 people of the more severe condition[ Thus\ PTO indi}erence points were 09 or greater[ We alsoused 09 as the standard for ME judgments[

Subjects could also think about PTO in terms of disutilities[ They could answer questions aboutratios of people as if they were questions about ratios of disutility[ We gave PTO twice\ the secondtime "called PTO1# interleaved with ME[ That is\ each ME item was followed by a PTO item withthe same two conditions[ This was to encourage subjects to think about the PTO in this way[ Wecould compare these judgments to PTO judgments presented alone[ We also asked subjects whetherthey objected to thinking about the PTO in this way[

"1# Is ratio inconsistency present in ME\ as well as in PTO< If ratio inconsistency is the result offairness only\ it should be absent in ME\ but found in PTO[

"2# Does it matter whether the PTO asks about the bene_t of curing people or the harm of not curingthem< We used two versions of PTO] one involved curing people "which we abbreviate as PTO#\and the other involved not curing people "leaving them uncured\ PTOÐNotCure#[ These versionswould di}er if subjects think of equal opportunity as more important for losses "PTOÐNotCure#than for gains "PTO#\ or the reverse[ Equality would express itself as relatively close and highdisutilities inferred from PTO] in the extreme\ if subjects said that curing 09 people of eachcondition was equivalent to curing 09 people of the worst condition\ then all conditions wouldhave an implied disutility of 0 on a scale with 0 as the maximum[

"3# Does the AS method yield disutility judgments that are too large< To test this\ we asked subjectsto compare the e}ect of becoming deaf on normal people and blind people[ If subjects think thatdeafness is worse in people who are blind\ then they should judge blindness as less than half as badas blindness plus deafness[

"4# Can asking about di}erences reduce the bias in AS judgments< We gave the AS method by itselfand in a condition that requested the subject to compare di}erences "as in no[ 3# before completingthe scale "AS!CompDi}s#[

"5# How does ME compare to AS< These two measures may be considered as di}erent ways of askingabout the same ratio[ In AS\ the worse condition is the standard of comparison^ in ME\ the lessbad[

Method

Twenty subjects\ paid ,5:hour "raised to the nearest dollar#\ completed this questionnaire and othersat their own pace in a quiet room[ Most were students at the University of Pennsylvania and Phi!ladelphia College of Pharmacy and Science[

The questionnaire had two forms\ in the following orders]

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 10

First form Second form

PTO PTOÐNotCurePTO NotCurePTOAS AS!CompDi}sME and PTO1 "interleaved# ME and PTO1 "interleaved#AS!CompDi}s

Ten subjects completed each form[ The two others served to counterbalance the order of PTO andPTOÐNotCure\ and they also di}ered in whether AS!CompDi}s was presented after AS "without thedi}erence comparison questions#[ We saw no point in presenting this form of AS after AS!CompDi}sin the second order[

The questionnaire began with a list of the conditions and their standard abbreviations given to thesubjects\ and\ in brackets\ the abbreviations we use here]

One!blind ðBŁ] blindness in one eye[Blind ðBBŁ] blindness in both eyes[One!deaf ðDŁ] deafness in one ear[Deaf ðDDŁ] deafness in both ears[One!blind!and!one!deaf ðBDŁ] blindness in one eye and deafness in the opposite ear[Blind!and!deaf ðBBDDŁ] blindness and deafness[

It then reminded the subject of the limitations resulting from each condition\ and it said\ {In thequestions to follow\ suppose that these conditions begin in adults from ages 19 to 59\ and they do notget better by themselves[ People with them do not di}er in sex\ age\ or any other characteristics[ Youshould rate these conditions for the average person[|

Here are the introductions and a sample item from each condition[

Curing people ðPTOŁSuppose that\ in a given state\ the Medicaid budget is increased so that the state can pay for

certain expensive procedures\ which it did not pay for before[ These procedures can cure people ofthese disorders[ The state cannot pay for everything\ but it wants to do the most good with what ithas[ So it wants to determine the bene_ts of curing di}erent numbers of people of di}erent conditions[It will then try to get the most bene_t for its money[

IN THIS PART\ YOUR ANSWERS SHOULD ALL BE GREATER THAN 09[

Curing 09 people with Blind does as much good as curing people with One!blind[

[ [ [

Not curing people] leaving them uncured ðPTOÐNotCureŁ

Suppose that\ in a given state\ a new Medicaid program is to put into e}ect\ but funds are limited[The state will be unable to pay for certain expensive procedures\ which can cure people of thesedisorders[ The state must decide which procedures not to cover[ It must decide how much harm isdone by not covering certain procedures\ so it can do the least harm possible within its budget[

IN THIS PART\ YOUR ANSWERS SHOULD ALL BE GREATER THAN 09[Leaving 09 people blind does as much harm as leaving people One!blind[

[ [ [

Analog scale ðASŁ

Put each of the following conditions on the following scale by drawing an arrow from the conditionname to the point on the scale where it goes[ Di}erences on the scale should re~ect di}erencesbetween the conditions[ For example\ a di}erence of two units should be twice as great as a di}erenceof one unit[ Do Blind _rst\ then Deaf\ and so on[ ðThe questionnaire listed all the conditions usingfull descriptions * which were ordered BB\ DD\ B\ D\ BD * and it provided a vertical scale markedwith {No impairment| at the top end\ called 9\ and {Blind!and!deaf| at the bottom\ called 099[Ł

[ [ [

11 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

Ratios ðPTOÐME\ magnitude estimation with person tradeo} interleavedŁ

In this part\ _ll in the blanks with ratios[ You do not need to use whole numbers[ You can usedecimals or fractions[ After each ratio question\ answer the corresponding question about thebene_ts of curing people[

Blind is times as bad as One!blind[

Curing 09 people with Blind does as much good as curing people with One!blind[

ðAdditional items compared DD with D\ BD with B\ BD with D\ and BBDD with BB\ DD\ B\ D\and BD[Ł

[ [ [

Comparisons ðAS!CompDi}sŁ

Person A is normal and becomes Deaf[ Person B is Blind and becomes Deaf[ For which person is itworse to become Deaf< Or are these e}ects equally bad< "Circle one#<

A B equally bad

Person A is normal and becomes One!deaf[ Person B is One!blind and becomes One!deaf[ For whichperson is it worse to become One!deaf< Or are these e}ects equally bad< "Circle one#<

A B equally bad

The answers to these questions should be consistent with the following scale[ That is\ a bigger changeshould correspond to a larger interval on the scale\ and an equal change should correspond to equalintervals[ Please bear this in mind[ ðThe Analog Scale followed[Ł

Results

Comparison of methods

ME and PTO methods yielded smaller disutilities than AS methods\ and ME was smaller than PTO[Mean disutilities of each condition are shown in Exhibit 0[ Items with disutilities of one or greater areexcluded from all analyses "e[g[ giving 09 or less as a response in PTO methods * 4[5) of responsesoverall#[

Exhibit 0[ Disutilities inferred from each method for each comparison

Method

PTO PTOÐNotCure AS ME PTOÐME AS!CompDi}s

BB:BBDD 9[27 9[25 9[67 9[06 9[16 9[64DD:BBDD 9[24 9[26 9[55 9[04 9[14 9[62B:BBDD 9[13 9[11 9[23 9[00 9[06 9[22D:BBDD 9[10 9[19 9[12 9[98 9[03 9[13BD:BBDD 9[20 9[17 9[41 9[01 9[02 9[37B:BB 9[20 9[21 9[03 9[12D:DD 9[22 9[21 9[04 9[14BD:BB 9[33 9[34 9[18 9[23BD:DD 9[32 9[32 9[17 9[21

For example\ 9[27 for BB:BBDD in PTO means that subjects thought that\ on average\ curing about 15 people of BB was asgood as curing 09 people of BBDD "since 9[27 is about 09:15#[ "Abbreviations] B � blind in one eye\ BB � blind\ BBDD � blindand deaf\ etc[^ PTO � person tradeo}\ PTOÐNotCure � person tradeo} for not curing\ ME � magnitude estimation\ AS � an!alog scale\ PTOÐME � person tradeo} interleaved with ME\ AS!CompDi}s � analog scale following comparison of di}er!ences[#

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 12

To compare methods\ we computed\ for each method\ the mean disutility "on a 9Ð0 scale# of theconditions common to methods being compared[ Although the level of this measure has no naturalinterpretation\ it allows us to compare measures\ with each response contributing equally in proportionto its utility[

ME disutilities were signi_cantly smaller than all other measures except for PTOÐME "the PTOcondition interleaved with ME^ p�9[992 or better by t!test#[ PTO and PTOÐNotCure did not di}er[

As and AS!CompDi}s did not di}er\ either within the subjects who did both or between the groupsthat did one or the other _rst[ The comparison manipulation thus had no overall e}ect[ We formed asingle variable for whichever was done _rst[ Mean disutilities of this variable were larger than those ofall other methods "p�9[990 or better#[

Consistency with difference jud`ments

Before AS!CompDi}s\ the questionnaire asked for direct comparisons of di}erences between states[In response to the _rst of these questions\ 08 of 19 subjects said that it was worse for a Both!blindperson to become Both!blindÐBoth!deaf than for a normal person to become Both!blind[ Yet whenthe subjects completed the scale after answering this question\ 08 out of 19 assigned a disutility of atleast 9[4 to Both!blind\ and 05 of these were 9[5 or higher[ "The di}erence in proportion is\ of course\signi_cant at p³ 9[9994[# Subjects therefore did not bring their judgments of di}erences to bear ontheir disutility judgments\ despite the instructions to do so[

Ratio inconsistency

In PTO "all versions# and ME\ judgments were inconsistent[ "We could not assess consistency for ASin this study[# For example\ the mean ME disutility of One!blind as a proportion of Blind is 9[027\ thedisutility of Blind as a proportion of Blind!and!Deaf is 9[063\ and the disutility of One!blind as aproportion of Blind!and!deaf is 9[095\ which is higher than 9[913\ the product of the _rst two utilities[Exhibit 1 shows the individual subject data for this comparison[

For PTO\ PTOÐNotCure\ PTO1\ and ME\ we measured ratio inconsistency for each set of judgmentsof this sort for which it could be measured by taking the product of the _rst two\ utilities\ dividing bythe third\ and taking the log[0 The overall ratio inconsistency measure for each method was the meanof these four scores[ All these means were less than zero "p�9[922 or better#[ They did not di}ersigni_cantly from each other[

Relation of PTO to ME

A PTOÐME condition was interleaved with the ME condition\ to determine whether subjects foundit acceptable to make person trade!o} judgments in terms of utility ratios[ Nine "out of 19# subjectsevidently had no objection to making PTO judgments in terms of ratios[ Three subjects producedPTOÐME utilities identical to their ME utilities\ even though both of these di}ered from their initialPTO utilities\ two subjects produced PTOÐME utilities that were farther "by a factor of at least four#from their PTO utilities than from their ME utilities\ and two subjects had identical utilities for allthree judgments[ These 02 subjects produced PTOÐME utilities closer to their PTO utilities than to

0 Without taking the log\ equivalent errors in the numerator and denominator of the ratio could lead to a positive mean error[For example\ if the disutility of B relative to BBDD "B:BBDD# is 14) too high on half the trials and the product of BB:BBDDand B:BB is 14) too high on the other half\ the mean ratio would be the mean of 4:3 and 3:4\ which is 0[914 rather than 0[ Thedistribution of the log ratio is also approximately normal\ although the ratio itself has a skewed distribution[

13 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

Exhibit 1[ Each subject|s disutilities for magnitude estimation "ME# of blindness "BB# relative to one!eye blindness"B#\ blindness plus deafness "BBDD# relative to blindness\ one!eye blindness relative to blindness plus deafness\and the prediction of the last measure from the _rst two\ assuming ratio consistency[ The utility scale is logarithmic[Each line represents the data from one subject[

their ME utilities[ These subjects evidently resisted the idea of making PTO judgments in terms ofratios[ Two of the remaining four subjects did not produce utilities less than 0 for PTOÐME\ and twodid not produce such answers for PTO[ Of the latter two\ one had ME utilities that matched the PTOÐME utilities and one did not[ In sum\ only four subjects resisted responding to the PTOÐME task asif it were a magnitude estimation\ and 03 seemed to accept the idea[ Exhibit 2 shows the disutilities forPTOÐNotCure\ PTO\ PTOÐME\ and ME[

Although thinking of the PTO task as magnitude estimation may have made the task easier\ it onlyincreased "nonsigni_cantly# the ratio inconsistency of the judgments[ This result is inconsistent with thehypothesis that ratio inconsistency in PTO results from fairness principles[ Rather\ ratio inconsistency ismore easily explained in terms of insu.cient distinctions among di}erent conditions[ Subjects mayanchor on one condition when they rate others\ then adjust insu.ciently for the di}erences amongconditions[

AS versus ME

Disutilities based on AS were larger in all comparisons than those based on ME^ all comparisons weresigni_cant[ Although small e}ects resulting from the direction of comparison occur in psychophysicaltasks "Fagot\ 0870^ Fagot and Pokorny\ 0878#\ the magnitude of these di}erences suggests a di}erentmechanism\ which may have to do with the fact that the scale is unbounded in ME[

EXPERIMENT 1

This experiment examined further the con~ict between AS and PTO[ It included tests of ratio incon!sistency in AS as well as PTO[ It also included two PTO conditions\ one in which the more severe

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 14

Exhibit 2[ Each subject|s disutilities for PTOÐNotCure\ PTOÐCure\ PTOÐCure1 "with ME#\ and ME[ Each lineis one subject[ Subjects with missing data on intermediate points are excluded[

condition was the standard and subjects had to decide how many people needed to be cured of the lesssevere condition to bring equal bene_ts\ and another PTO condition "PTOÐRev# in which the lesssevere condition was the standard and subjects had to state a PTO indi}erence point for the moresevere condition[ This allowed us to ask whether the discrepancy between AS and ME was found inother tasks in which the standard di}ered[

The experiment used a face!to!face interview procedure rather than written administration\ althoughsome subjects answered the same questions in written form for comparison[ The interview attemptedto train the subjects to make internally consistent judgments[ Subjects were told both about ratioinconsistency and about comparison of intervals in AS[ That is\ they were encouraged to make surethat di}erences between the numbers they assigned in AS re~ected their ordering of di}erences betweenconditions[ For example\ a subject who thought that the di}erence between Both!Blind and Blind!and!Deaf was greater than that between No!impairment and Blind should give Both!Blind a rating ofless than 49 on the AS in which Blind!and!Deaf was assigned 099[ The interview also trained subjectsin ratio consistency[

We asked two questions about the e}ect of training[ First\ when subjects are induced to becomemore consistent\ do they merely make their numbers follow the rules without worrying about whetherthe numbers still re~ect their honest judgment< Alternatively\ are judgments su.ciently ~exible so thatthey can be made mathematically consistent while still being honest< Decision analysts claim thatconsistency checks usually do not violate the respondent|s best judgment\ for example] {[ [ [ if theconsistency checks produce discrepancies with the previous preferences indicated by the decisionmaker\ these discrepancies must be called to his attention and parts of the assessment procedureshould\ be repeated to acquire consistent preferences [ [ [ Of course\ if the respondent has strong\ crisp\unalterable views on all questions and if these are inconsistent\ then we would be in a mess\ wouldn|twe< In practice\ however\ the respondent usually feels fuzzier about some of his answers than others\and it is this degree of fuzziness that usually makes a world of di}erence[ For it then becomes usuallypossible to generate a _nal coherent set of responses that does not violently contradict any strongly

15 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

held feelings| "Keeney and Rai}a\ 0882\ p[ 160#[ Such checks can even improve the perceived validityof numerical judgments "e[g[ Keeney and Rai}a\ 0882\ p[ 199#[ In this experiment\ we wanted to testwhether this was true of our respondents\ who are more like representatives of the public than like theexperts that Keeney and Rai}a typically used for decision analysis[

Second\ if training improves consistency within a method * AS or PTO * does it also increaseagreement between the disutilities inferred from the two di}erent methods< If so\ then it would bemore likely that the modi_ed disutilities were converging on some true utility judgment\ not necessarilythe truth about the conditions themselves but\ rather\ about the subjects| judgments of the disutilityof these conditions[ We might expect such convergence if the subject has an internal scale of disutility\which obeys the consistency requirement\ but the subject distorts this scale when expressing it throughcertain kinds of questions[ When the distortions are removed\ di}erent kinds of questions will tap thesame underlying scale[ This is the theoretical claim made by the idea of scale convergence in psy!chophysics "Birnbaum\ 0867#[

Method

Following the same introduction as in the last experiment * which referred to a health insurer|s needto measure the bene_t of various interventions * half the subjects began with PTO and half with AS[Twenty!two subjects were interviewed\ and 15 completed a written form of the same items "withoutany consistency checks\ so that they answered each question only once#[ Order has no e}ect and is notdiscussed further[ Data from an additional _ve subjects in the written condition were excluded becausethese subjects consistently answered PTO questions backward\ giving numbers smaller than the stan!dard when larger numbers were expected[ "In the interview\ such responses were quickly corrected andsubjects had no problem after one correction[ These answers are of interest in their own right but arenot examined further here[ See Lochhead\ 0879\ for a related phenomenon[#

The PTO was essentially the same as the PTO condition of Experiment 0\ except that it includedonly the following comparisons] BBDD versus BB^ BBDD versus DD^ BB versus B^ DD versus D^BBDD versus B^ and BBDD versus D[ "Again\ one letter means one eye or ear and two letters meanboth[# These comparisons permitted two ratio consistency checks\ one involving BBDD\ BB\ and B\and the other involving BBDD\ DD\ and D[ The _rst group of items were of the form {Curing 09patients with Both!blindÐBoth!deaf is equivalent to curing patients with Both!blind[|

In addition\ in a second group of items\ called PTOÐRev\ the form of the comparison was reversed\e[g[ {Curing patients with Both!BlindÐBoth!deaf is equivalent to curing 099 patients with Both!blind[| This meant that the response was a utility measure for the second term\ on a 9Ð099 scale de_nedby No!impairment and the _rst term\ respectively[ The responses from this judgment were thus directlycomparable to those from the AS[ PTOÐRev included only four comparisons] BBDD versus BB^BBDD versus DD^ BB versus B^ and DD versus D[ We could not assess ratio inconsistency here[

The AS condition asked for judgments on individual scales instead of all at once\ so as to make thetwo methods more comparable[ So\ for example\ the subject was asked {Where does Both!blind go onthe following scale<| and a horizontal line was provided\ divided into tenths by tick marks\ with 9 and{No!impairment| at the left and 099 and {Both!blindÐBoth!deaf| on the right[ A separate scale of thissort was used for each of the other comparisons listed for the PTO\ allowing the same ratio inconsistencychecks[

Half of the interviewed subjects in each order "AS _rst\ PTO _rst# were given an introduction to theconsistency checks to be carried out\ before they did the task[ The introduction was designed tocounteract any tendency of the subject to resist the consistency checks on the grounds that the initialjudgments should have been correct[ The introduction simply explained the checks to be done but then

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 16

told the subject not to think about them when making the judgments[ It had no e}ect\ so we shall notdescribe it further[

After each task\ all subjects were given a set of consistency checks speci_c to that task\ and theywere then asked to redo the task\ if necessary more than once\ in order to try to make their judgmentsconsistent[ Most subjects redid the entire task only once\ but so many made minor changes along theway that we did not attempt to count the number of times the task was done^ we used only the _naljudgments in each task as our data[

The ratio inconsistency check for AS was read to the subject as follows "the letters referring to thesubject|s answers#\ for one of the two checks]

From your answers\ we can conclude that being Both!blind is A) as bad as being Both!blindÐBoth!deaf\ and being One!blind is C) as bad as being Both!blind[ So the badness of being One!blind compared to Both!blindÐBoth!deaf should be C) of A) or CA)[ But you said E)[

This would create a problem for the insurer[ To determine the badness of being One!blind relativeto Both!blindÐBoth!deaf\ they would not know which answer to use[ They might not always havetime to ask everyone all three questions about all combinations of conditions[

Try to make your numerical answers consistent[ Can you do this and still have them re~ect yourtrue opinions about the conditions< "If not\ why not<#

This was followed by a check for di}erences\ as follows]

Which is greater\ the di}erence in badness between having No Impairment and being Both!blind\or the di}erence in badness between being Both!blind and being Both!blindÐBoth!deaf< A shouldbe more than 49 if you think that the _rst di}erence is larger\ and it should be less than 49 if youthink that the second di}erence is larger[

ðThe test was repeated for One!blind versus Both!blind[Ł

If these tests don|t work\ it would create a problem for the insurer[ If they had to decide whether totreat some number of people with one condition or twice as many people with another condition\your answers about which di}erence is greater would imply one thing\ but your numerical answerswould imply another[

Try to make your numerical answers consistent[ Can you do this and still have them re~ect yourtrue opinions about the conditions< "if not\ why not<#

This di}erence check was repeated for the deafness items\ and the interviewer explained the tests asneeded[ The consistency checks for the PTO were essentially the same\ but modi_ed as needed[ Theratio inconsistency check was done for the PTO items only\ and the di}erence check was done for thePTOÐRev items only[

The PTO check also began with a question about whether PTO could be interpreted as a ratio]{Does the number of people matter\ or the ratio< For example\ you said that curing 09 people withBoth!blindÐBoth!deaf is equivalent to curing A people with Both!blind[ So curing 0 person with Both!blindÐBoth!deaf is equivalent to curing A:09 people with Both!blind[ Is that right< ðIf not\ explainthat the insurer wants numbers that it can use this way[ It doesn|t know how many people have eachcondition[Ł|

Likewise\ the PTOÐRev check began with] {Now we might think of your answers as measures ofhow bad each condition is[ The number of Both!blindÐBoth!deaf people that are equivalent to acondition in 099 people is proportional to how bad that condition is[ Do you see any reason why theinsurer shouldn|t interpret your answers this way<|

17 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

Twenty!two subjects were interviewed\ 01 with and 09 without the introduction to the consistencychecks "which\ as we said\ had no e}ect#[ Eleven of the 11 did PTO\ PTOÐRev\ and AS\ and 00 didthe reverse[ In addition\ 15 subjects "03 AS!_rst\ 01 PTOÐ_rst# completed the scales in a written formonly\ without an interviewer\ and without consistency checks[ This allowed us to test for e}ects of thepresence of an interviewer on consistency in the _rst set of items[ Subjects were solicited as inExperiment 0[

Results

Acceptability of consistency

Essentially all of the interviewed subjects found it acceptable to try to make their judgments consistent\although not all of them managed to succeed in doing so[

Mean disutility and ratio inconsistency before consistency checks[ The interview versus paper subjectsdid not di}er signi_cantly in the measures common to both groups\ so we combined the groups forthe analysis of e}ects before the consistency checks[ "This lack of di}erence suggests that interviewingas such does not change performance\ except for the opportunity to correct backward interpretationof the PTO[#

Mean disutilities on the comparisons common to all three tasks were] 9[36 for AS\ 9[24 for PTO\and 9[39 for PTOÐRev[ All di}erences were signi_cant at p�9[929 or better by t!test[ Of particularinterest is the di}erence for PTOÐRev and PTO[ What is common to the tasks with smaller disutilities"ME\ versus AS in Experiment 0\ and PTO\ versus PTOÐRev# is that they both involve high numericalresponses[ At least part of the e}ect\ then\ may be explained in terms of a tendency to assign highernumerical responses to the condition to which a number is assigned\ whether this condition is theworse condition or the less bad condition[

Ratio inconsistency was measured as before\ as the log of the product of the smaller steps dividedby the larger step "e[g[ BBDD versus BB times BB versus B\ divided by BBDD versus B * the productbeing taken of this and the corresponding measure for deafness#[ Before the consistency checks\ ratioinconsistency was not signi_cantly di}erent from zero for AS "mean −9[09#\ but was less than zerofor PTO "mean −9[61\ t33 �1[69\ p�9[909#[ The di}erence was signi_cant "t31 �1[23\ p�9[913#[These results are consistent with those of Experiment 0\ where we found negative ratio inconsistencyfor PTO and did not examine ratio inconsistency in AS[ Although the mean ratio inconsistency in ASwas neither positive or negative\ only 04) of the subjects made perfectly consistent judgments[

Effect of consistency checks[ Consistency checking increased agreement between AS and P!TO\ andit decreased the absolute value of ratio inconsistency[ Exhibit 3 shows the mean disutilities for theinterview subjects only\ for purposes of comparison[

Exhibit 3[ Mean disutilities for interviewed subjects

Method

PTO PTOÐME PTOÐRev PTOÐRev1 AS AS1

BB 9[39 9[30 9[31 9[34 9[48 9[46DD 9[29 9[23 9[23 9[23 9[34 9[33B:BB 9[22 9[22 9[22 9[20 9[28 9[25D:DD 9[29 9[29 9[24 9[22 9[25 9[22B 9[05 9[03 9[12 9[19D 9[02 9[02 9[08 9[05

{1| means after the consistency checks[ Abbreviations are the same as in Exhibit 0[

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 18

Although e}ects of the checks "initial versus second# were small\ they did reduce ratio inconsistency[The absolute value of ratio inconsistency for the interview subjects decreased from 9[18 to 9[96 forAS "t08 �2[90\ p�9[996# and from 0[95 to 9[35 for PTO "t07 �2[08\ p�9[994#[1 The mean ratioinconsistency for AS was still not signi_cantly di}erent from zero\ after the checks[ The mean ratioinconsistency for PTO was not signi_cantly di}erent from zero after the checks\ and it changedsigni_cantly "from −9[24 to 9[06 for the 07 subjects with complete data\ p�9[916# from before toafter[

The mean disutility in AS decreased from 9[26 before the checks to 9[23 after "for the mean of allitems\ p�9[20#[ They did not a}ect the means for PTO or PTOÐRev[

The measures agreed with each other after the consistency manipulation more than before[ Tomeasure disagreement between two measures\ we computed the mean absolute value of the di}erencesof the utilities of all the items common to the measures[ "We excluded cases in which more than twoitems were missing from AS or PTO or more than one from comparisons involving PTOÐRev[#Disagreement between AS and PTO was 9[076 before and 9[040 after the consistency checks "t06 �1[35\p�9[901\ one!tailed#[ Disagreement between AS and PTOÐRev was 9[058 before and 9[044 after"n[s[#[ Disagreement between PTO and PTOÐRev was 9[006 before and 9[000 after "n[s[#[ Althoughonly one of the three beforeÐafter comparisons was signi_cant\ the mean of all three measures wasalso signi_cant "t08 �1[00\ p�9[913#\ and the nonsigni_cant results were those with the fewest data[

To determine the source of the reduced disagreement between AS and PTO\ we regressed theinconsistency reduction on four predictors\ two for each measure] the reduction in "absolute value of#ratio inconsistency for AS and PTO^ and the decrease in mean disutility for AS and PTO[ The overallregression was signi_cant "p�9[924#[ The only signi_cant predictor was the reduction in the AS mean"b�9[51\ p�9[00#[ These results suggests that the reduced inconsistency between AS and PTO wasthe result of bringing down the excessively high disutilities expressed in AS[ This may have resultedfrom the di}erence training[ Although ratio consistency training was e}ective\ its e}ect apparently didnot contribute to the reduced disagreement[

EXPERIMENT 2

Experiment 2 asks whether training e}ects in consistency checking can transfer between two di}erentmethods\ AS and ME[ Such transfer e}ects are consistent with the view that training could lead tomore accurate expression of judgments on an underlying utility scale[ The experiment compared trainedgroups to control groups with no training[

Method

Subjects were 59 undergraduate or graduate students from the University of Pennsylvania\ mean age10[1 "22 males and 16 females#[ Two additional subjects made obvious careless mistakes\ and oneyielded utilities for magnitude estimations that were more than two orders of magnitude smaller thanthe closest other subject "with ratios on the order of a million#[ These three subjects were not included[Subjects were paid ,5\ and the experiment took less than an hour[

Analog Scale "AS# and Magnitude Estimation "ME# were used in four questionnaires[ AS!beforeand ME!before represented the AS and ME before the consistency check[ AS!after and ME!after

1 Ratio inconsistency went in both directions for both measures\ even though the mean value of ratio inconsistency was negativefor PTO before the consistency checks[ The checks were e}ective in reducing ratio inconsistency regardless of its direction[

29 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

represented the AS and ME after the consistency check[ The analog scale was printed with normalcondition as its left end and a worse health state as its right end[ Subjects were asked to put a third\less serious\ condition in the appropriate position on the scale[ ME questions simply asked subjectshow may times worse one condition was than another[

Two sets of health conditions were included in AS!before\ ME!before\ AS!after and ME!after\ onedealing with paralysis and the other with sensory losses[ The paralysis items were] one arm "A#\ oneleg "L#\ both arms "AA#\ both legs "LL#\ and both arms and legs "AALL#[ Death "X# was also included[The sensory items were blindness of one eye "B#\ complete blindness "BB#\ deafness in one ear "D#\complete deafness "DD#\ blindness and deafness "BBDD#\ and death "X#[ Each group "AS!before\ ME!before\ AS!after\ ME!after# had nine questions\ three groups of three\ with each group allowing acheck for ratio consistency[ The ratio inconsistency tests in the paralysis items were based on theproducts of the following {_rst two items| compared to the third item[ Each item shows the standardon the right[

First two items Compared to

Paralysis items]AALL:LL LL:L versus AALL:LX:AA AA:A versus X:AX:AALL AALL:LL versus X:LLX:AALL AALL:L versus X:L

Sensory items]BBDD:BB BB:B versus BBDD:BX:DD DD:D versus X:DX:BBDD BBDD:BB versus X:BBX:BBDD BBDD:B versus X:B

Subjects completed AS!before and ME!before\ then did a consistency check either on AS!before "AS!before experimental group# ME!before "ME!before experimental group# or neither "control group#\and _nally completed AS!after and ME!after[ Assignment of content "sensory versus paralysis# tobefore:after was counterbalanced\ as was order of the nine items in each group "which was otherwiserandom#[

The instruction for all groups began] {The purpose of this study is to evaluate di}erent public healthcare programs[ Suppose the Department of Health in a state must choose a new policy to preventvarious illness[ The decision would depend on factors such as the seriousness of each illness\ the costof prevention\ the amount of budget\ etc[ The basic idea for making decision is that we want to get themost bene_t from the money spent[ The following questions are used for measuring the badness ofeach health state[ Imagine that all the patients are college students[| Then subjects were given speci_cinstruction and examples about how to complete AS and ME[

The consistency check was given to the two experimental groups[ The check varied according to thedi}erence of inconsistency[ For example] {In question B you used number X to indicate the badnessof Paralysis of one arm when the badness of Paralysis of both arms was 099[ That meant you thoughtParalysis of one arm was X) as bad as Paralysis of both arms[ Following this inference\ you thoughtParalysis of both arms was Y) as bad as Death[ According to the method for calculating utility\ yourevaluation for Paralysis of one arm was "X[Y#) as bad as Death[ Now look at question I\ in whichyou thought Paralysis of one arm was Z) as bad as Death[ Z) was quite di}erent from "X[Y#)[This would create a problem for decision makers\ who are responsible for deciding which health careprogram to use[ They would not know which answer\ the direct or indirect one\ represented your trueopinion\ and they don|t have time to ask every respondent why such inconsistency existed[ Please takea moment to think about this problem and whether you can be more consistent<|

Results

The consistency check increased consistency within each method "AS and ME#\ transferred to theuntrained method\ and increased agreement between the two methods "Exhibit 4#[ Again\ because

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 20

Exhibit 4[ Mean inconsistency "absolute value of log# and standard deviation "in parentheses# before and afterthe consistency check

AS!before ME!before AS!after ME!after

Check on AS!before 9[146 "9[024# 9[377 "9[268# 9[097 "9[097# 9[246 "9[394#Check on ME!before 9[077 "9[005# 9[419 "9[472# 9[042 "9[023# 9[032 "9[062#Control 9[105 "9[066# 9[366 "9[376# 9[131 "9[088# 9[593 "9[410#

subjects di}ered in the direction of their initial ratio inconsistency and because we wanted to use asmany data as possible\ we used the absolute value of the ratio inconsistency measure in the analysis ofthe data[ "Because the measure was logarithmic\ 9 indicated perfect consistency[#

Inconsistency decreased in the experimental groups and increased in the control group\ but theincrease was not signi_cant[ The best measure of a training e}ect is comparison of the decrease ininconsistency between the experimental and control groups[ For the trained task "AS or ME#\ thisdi}erence was signi_cant for each kind of training "t27 �2[76\ p�9[9993\ for AS^ t27 �2[04\p�9[9920\ for ME#[ For the untrained tasks\ this di}erence was signi_cant for AS "t27 �1[12\p�9[9219# but not for ME "t27 �0[39\ p�9[0581#[ However\ an overall test of transfer to theuntrained task\ combining both AS and ME "and averaging the two conditions in the control group#\yielded a signi_cant di}erence "t47 �1[12\ p�9[9184#[ Moreover\ the two experimental groups didnot di}er signi_cantly in the change in the untrained task "t�0[95#[ We can thus conclude that\ ingeneral\ the training transferred to the other task\ whichever task was trained[ Transfer was notcomplete\ however] the interaction between change in AS versus ME and group "for the experimentalgroups# was signi_cant "F0\27 �3[51\ p�9[927#[

Although we used the absolute value to look at training e}ects\ we note that ratio inconsistency wasin the same direction as previously found\ with the product of the smaller di}erences too high relativeto the larger di}erence "mean of 9[02 for AS!before\ t48 �4[29\ p�9[9999^ 9[32 for ME\ t48 �5[60\p�9[9999#[ The e}ects of training were also signi_cant when we used the raw measure in place of theabsolute value "t47 �1[61\ p�9[9974\ for trained task^ t47 �1[33\ p�9[9066\ for untrained ðtransferŁtask#[

DISCUSSION

Our most important result concerns the e}ect of exposing subjects to consistency checks[ We askedsubjects to do two kinds of checks\ one for di}erence comparison and one for ratio inconsistency[These checks reduced ratio inconsistency in all measures\ reduced disutilities in AS\ and reduced thedisagreement among measures[ Experiment 2 shows that this change is not just a function of repeatedtesting\ although the e}ect here was limited to ratio inconsistency "because the other discrepancy wasnot trained#[ The increased agreement in Experiment 1 was apparently mediated\ to some extent\ bythe reduction in mean disutilities for AS[ This is consistent with the hypothesis that AS disutilities aregenerally too high\ as in the case of the ganglion cyst cited in the introduction[ The reduced disagreementsuggests increased validity\ and the apparent reason for it is consistent with our conclusion that ASdisutilities are generally too high[

Increased agreement among measures as a result of inconsistency reduction in one of the measuressuggests that inconsistency is a source of invalidity[ Again\ our interest here is in the validity ofexpression of an internal scale that has the properties of a utility scale "assume that such a scale exists#\not with a true judgment of the utility of a condition[ We suggest this procedure as a general methodfor studying all types of utility elicitation[

21 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

In addition to these e}ects of consistency checking\ our _ndings con_rmed earlier reports of dis!agreement among measures of disutility[ AS disutilities were highest * as typically found * and MEdisutilities were lowest\ a new result[ PTO was larger than PTOÐRev\ suggesting that the numbersassigned to the subject of the comparison are high\ relative to the standard\ regardless of whichcondition is the standard "worse or less bad#[ One explanation of this result is that the scale isunbounded when the numbers are higher than those assigned to the standard\ but it is bounded byzero when they are lower[ Subjects may not want to get too close to zero[

The di}erence between PTO and ME was\ for most subjects\ not a matter of strong commitment]when the two judgments were interleaved\ most subjects thought it reasonable to make them agree[AS judgments were inconsistent with direct judgments of di}erences\ and merely asking subjects tomake the di}erence judgment before doing the AS task did not reduce this inconsistency[

We also found ratio inconsistency in PTO "as reported by Ubel et al[\ 0885# and in ME[ The directionof the ratio inconsistency is that the most extreme judgments were not extreme enough[ Either thedisutility assigned to the least bad condition relative to the worse one is not low enough\ or the disutilityassigned to the intermediate conditions is not high enough\ or both[ Mean AS ratio inconsistencyacross subjects was not di}erent from zero\ but some subjects were inconsistent in each direction[

AS judgments are inconsistent with judgments of di}erences[ Most subjects assign disutilities ofmore than 9[4 to blindness or deafness on a scale from no impairment "9# to blindness¦deafness "0#[Yet the same subjects judge that the di}erence between no impairment and blindness or deafness isless than that between either blindness or deafness alone and both together[ The latter judgment seemsreasonable\ since vision and hearing are\ to some extent\ substitutes in the economic sense[ Either canbe a means to communication\ for example\ and when both are absent communication becomes muchmore di.cult[ We therefore suspect that the AS disutility judgments are in fact too high[ This fact mayalso help to reduce ratio inconsistency\ by leaving more room for relatively low numbers to be assignedto one!eye blindness or one!ear deafness[ AS may thus increase internal consistency at the expense ofexternal validity[

AS is one of the easier methods to use[ It is more highly correlated than other measures withquestionnaire scales of health "Bosch and Hunink\ 0885#\ perhaps because it is less prone to errorbased on misunderstanding[ High correlations\ however\ can be based on a correct ordering ofconditions\ even if the numbers assigned are generally too high or too low[ Our results suggest thatthey are too high[ But the results also suggest that appropriate instruction can substantially increasethe validity of the responses[

Our study leaves several questions unanswered\ and we are pursuing these in current research]

"0# How do our results depend on the direction of the scale< Our questions all concerned disutility\using normal health as the reference point[ Conceivably\ the excessively high disutilities resultingfrom AS could result from a convex utility function in the domain of losses\ and the use of normalhealth as a reference point could encourage such a perception[ This is unlikely to be the wholestory\ since Ubel et al[ "0885# also found that AS yielded larger disutilities\ and they used death asthe implied reference point[

"1# What are the sources of the inconsistency between AS and ME and between PTO and PTOÐRev<It seems that subjects tend to give numbers that are too high\ regardless of whether the worsecondition or the less bad condition serves as the standard of comparison[

"2# What are the causes of ratio inconsistency\ and how is it reduced< Possibly\ subjects anchor onearlier judgments when making later ones\ so that their later judgments are not di}erent enoughfrom the earlier ones[ Ratio inconsistency decreases after the consistency check\ but we do notknow why[ In particular\ we do not know whether subjects learn anything that would improvetheir judgments even when they cannot themselves carry out the consistency check[

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

J[ Baron et al[ Analo` Scale\ Ma`nitude Estimation and Person Trade!off 22

ACKNOWLEDGEMENTS

This research was supported by NSF grant SBR84!19177 "Baron#[ Peter Ubel|s work was supportedby the Department of Veterans A}airs through a Career Development Award in health services researchand by the Robert Wood Johnson Foundation|s Generalist Physician Faculty Scholar Program[ PeterUbel is in the Center for Bioethics and the Veterans Administration Medical Center[

REFERENCES

Baron J[ 0886[ Biases in the quantitative measurement of values for public decisions[ Psycholo`ical Bulletin 011]61Ð77[

Birnbaum MH[ 0867[ Di}erences and ratios in psychological measurement[ In Co`nitive Theory\ Vol[ 2\ CastellanN\ Restle F "eds#[ Erlbaum] Hillsdale\ NJ^ 22Ð63[

Bosch JL\ Hunink MGM[ 0885[ The relationship between descriptive and valuational quality!of!life measures inpatients with intermittent claudication[ Medical Decision Makin` 05] 106Ð114[

Fagot RF[ 0870[ A theory of bidirectional judgments[ Perception and Psychophysics 29] 070Ð082[Fagot RF\ Pokorny R[ 0878[ Bias e}ects on magnitude and ratio estimation power function exponents[ Perception

and Psychophysics 34] 110Ð229[Fischer GW[ 0884[ Range sensitivity of attribute weights in multiattribute value models[ Or`anizational Behavior

and Human Decision Processes 51] 141Ð155[Galanter E\ Pliner P[ 0863[ Cross!modality matching of money against other continua[ In Sensation and Measure!

ment\ Moscowitz HR et al[ "eds#[ Reidel] Dordrecht^ 54Ð65[Houston DA\ Sherman SJ\ Baker SM[ 0878[ The in~uence of unique features and direction of comparison on

preferences[ Journal of Experimental Social Psycholo`y 14] 010Ð030[Kahneman D\ Tversky A[ 0868[ Prospect theory] An analysis of decisions under risk[ Econometrica 36] 152Ð180[Kaplan RF[ 0884[ Utility assessment for estimating quality!adjusted life years[ In Valuin` Health Care] Cost

Bene_ts and Effectiveness of Pharmaceuticals and Other Medical Technolo`ies\ Sloan FA "ed[#[ CambridgeUniversity Press] New York^ 20Ð59[

Keeney RL\ Rai}a H[ 0882[ Decisions with Multiple Objectives[ Cambridge University Press] New York "originallypublished by Wiley\ 0865#[

Krantz DH\ Luce RD\ Suppes P\ Tversky A[ 0860[ Foundations of Measurement\ Vol[ 0[ Academic Press] NewYork[

Lochhead J[ 0879[ Faculty interpretations of simple algebraic statements] The professor|s side of the equation[Journal of Mathematical Behavior 2] 29Ð26[

Nord E[ 0884[ The person trade!o} approach to valuing health care programs[ Medical Decision Makin` 04] 190Ð197[

Nord E\ Richardson J\ Kuhse H\ Singer P[ 0884[ Who cares about cost< Does economic analysis impose or re~ectsocial values< Health Policy 23] 68Ð83[

Richardson J[ 0883[ CostÐutility analysis] What should be measured< Social Science and Medicine 28] 6Ð10[Stevens SS[ 0840[ Mathematics and psychophysics[ In Handbook of Experimental Psycholo`y\ Stevens SS "ed[#[

Wiley] New York^ 0Ð38[Torrance GW[ 0875[ Measurement of health!state utilities for economic appraisal] A review[ Journal of Health

Economics 4] 0Ð29[Ubel PA[ 0887[ How stable are people|s preferences for giving priority to severely ill patients[ Social Science and

Medicine 38] 784Ð892[Ubel PA\ Loewenstein G\ Scanlon D\ Kamlet M[ 0885[ Individual utilities are inconsistent with rationing choices]

A partial explanation of why Oregon|s cost!e}ectiveness list failed[ Medical Decision Makin` 05] 097Ð005[von Winterfeldt D\ Edwards W[ 0875[ Decision Analysis and Behavioral Research[ Cambridge University Press]

New York[

Authors| bio`raphies]Jonathan Baron is Professor of Psychology\ University of Pennsylvania[ A third edition of his book Thinkin` andDecidin` "Cambridge University Press# is expected this year[

23 Journal of Behavioral Decision Makin` Vol[ 03\ Iss[ No[ 0

Copyright Þ 1990 John Wiley + Sons\ Ltd[ Journal of Behavioral Decision Makin`\ Vol[ 03\ 06Ð23 "1990#

Dalls Brennan is a researcher and writer on contemporary cultural politics and mass media[ She works regularlyin Trinidad and the United States and is based in New York City[

Peter Ubel is Assistant Professor\ General Internal Medicine\ University of Pennsylvania[ His book Pricin` Life]Why it|s time for health care rationin` is published by MIT Press[

Christine Weeks does market research at Brintnall and Nicolini\ in Philadelphia[

Zhijun Wu is a graduate student in computer science at Temple University[

Authors| addresses]Jonathan Baron\ Department of Psychology\ University of Pennsylvania\ 2704 Walnut Street\ Philadelphia\ PA08093!5085\ USA[

Peter Ubel\ Department of General Internal Medicine\ University of Pennsylvania\ 0112 Blockley Hall\Philadelphia\ PA 08093!5910\ USA[

Dallas Brennan\ 86 First Place\ 2R\ Brooklyn\ NY 00120\ USA[

Christine Weeks\ Brintnall and Nicolini\ Inc[\ 0779 JFK Blvd[\ Philadelphia\ PA 08092\ USA[

Zhijun Wu\ Department of Computer and Information Sciences\ Temple University\ 0794 N[ Broad St[\Philadelphia\ PA 08011\ USA[