5
Socio-Econ. Plam. Sci. Vol. 21, No. 6, pp. 371-375, 1987 0038-0121/87 $3.00 + 0.00 Printed in Great Britain. All rights reserved Copyright 0 1987 Pergamon Journals Ltd AN EMPIRICAL EVALUATION OF TWO ASSESSMENT METHODS FOR UTILITY MEASUREMENT FOR LIFE YEARS ABRAHAM MEHREZ’ and AMIRAM GAFNI* ‘Department of Industrial Engineering and Management, Ben Gurion University of the Negev, Beer Sheva, Israel and 2Health Economics and Policy Analysis Group,:Department of Clinical Epidemiology & Biostatistics, Health Science Center, McMaster University, Hamilton, Ontario L8N 325, Canada (Received 20 March 1987) Abstract-Measurements of utility functions over life years provide useful information for decision making in the health care field. However, biases in the assessment procedures of utility functions is a well-known and documented phenomenon. In this paper we investigate possible biases in the assessment of utility functions when two different methods (direct and indirect assessment) are used. More specifically, we examine the estimation of utility functions over different lengths of life. The main findings, obtained from an empirical investigation in which the two assessment techniques were applied to a sample of students, are: (a) the use of the different methods does not lead to significant differences in the utility evaluation from a social point of view (health program evaluation); (b) the use of the different methods does lead to significant differences in the utility evaluation from an individual point of view (clinical decision making); (c) in both methods risk aversion was found to be common for shorter periods of time while risk prone behavior, when it exists, was found mainly for longer periods of time. INTRODUCTION Biases in the assessment procedures of utility func- tions is a well-known and documented phenomenon [l-3]. Thus, it is evident that more than one assess- ment procedure should be employed in order to improve the overall estimation of the utility function. This led to the development of different methods for constructing Von-Neumann-Morgenstern (VNM) utility functions such as the certainty equivalent method, the probability equivalence method, the gain equivalence method and the loss equivalence method. In this paper we investigate possible biases in the assessment of utility functions when two different methods (direct and indirect assessment) are used. More specifically we examine the estimation of utility functions over different lengths of life. The theory of preferences and their measurement is well established in the field of health care; two applications being: clinical decision analysis and health program evaluation. Clinical decision analysis is concerned with decisions particular to tests and/or therapy for an individual patient. The patient’s pref- erence in view of the various possible outcomes is measured and used as part of the decision analysis [4-61. Health program evaluation is used when alter- native health care programs exist, and a decision has to be made regarding which program (or variation) to pursue and which to defer or reject [4, 7-101. In recent years we have witnessed a growing body of information, in the health care field describing the influence of variables involved in the process of measurement upon the results obtained. For example, Llewelly-Thomas et al. [l 11, show that the type of scenario presented to raters and the sequence in which the method of assessment is used have a major influence on the results. Other researchers have criticized the use of such methods as they may not be representative of a patient’s long term preference, and that a decision based on such measures may not maximize the patient’s long term satisfaction [12]. In spite of this criticism no one yet, to the best of our knowledge, has proposed better tools for revealing individual desires. (See a recent comprehensive de- scription by Torrance [13] of the Measurement of Health State Utilities for Economic Appraisal.) Thus, it seems to us that further investigation of the existing tools will only improve our understanding of their limitations, and lead to their more effective use by the decision makers. The need to compare different methods of utility measurement is called for in the literature. Bursztajn and Hamm [14] in their paper on the clinical utility of utility assessment, which deals with the usefulness of utility assessment to clinicians, say: “The sort of questions that one must ask can be clarified by examining the version of the lottery method of sub- jective utility assessment that is commonly used, as in the case of the paper by McNeil et al. This method produces a “utility curve”, showing the relation between the amount of life remaining and the subjec- tive value of that amount of life. We ask, first, whether it is wise to rely on only one such method to measure utility”? (p. 161). The study presented here will help us to cope with this question. The choice of treatment, test, etc., usually depends on the patient’s tradeoff between two attributes: life years and quality of life. (More about this issue can be found in Pliskin et al. [ 151).In this paper we relate, however, to the estimation of utility functions over life years only, assuming a constant level of state of health (quality of life). A program or treatment which SE P s. Zl,L‘-x

An empirical evaluation of two assessment methods for utility measurement for life years

Embed Size (px)

Citation preview

Page 1: An empirical evaluation of two assessment methods for utility measurement for life years

Socio-Econ. Plam. Sci. Vol. 21, No. 6, pp. 371-375, 1987 0038-0121/87 $3.00 + 0.00 ’ Printed in Great Britain. All rights reserved Copyright 0 1987 Pergamon Journals Ltd

AN EMPIRICAL EVALUATION OF TWO ASSESSMENT METHODS FOR UTILITY MEASUREMENT FOR

LIFE YEARS

ABRAHAM MEHREZ’ and AMIRAM GAFNI* ‘Department of Industrial Engineering and Management, Ben Gurion University of the Negev, Beer Sheva, Israel and 2Health Economics and Policy Analysis Group,:Department of Clinical Epidemiology

& Biostatistics, Health Science Center, McMaster University, Hamilton, Ontario L8N 325, Canada

(Received 20 March 1987)

Abstract-Measurements of utility functions over life years provide useful information for decision making in the health care field. However, biases in the assessment procedures of utility functions is a well-known and documented phenomenon. In this paper we investigate possible biases in the assessment of utility functions when two different methods (direct and indirect assessment) are used. More specifically, we examine the estimation of utility functions over different lengths of life. The main findings, obtained from an empirical investigation in which the two assessment techniques were applied to a sample of students, are: (a) the use of the different methods does not lead to significant differences in the utility evaluation from a social point of view (health program evaluation); (b) the use of the different methods does lead to significant differences in the utility evaluation from an individual point of view (clinical decision making); (c) in both methods risk aversion was found to be common for shorter periods of time while risk prone behavior, when it exists, was found mainly for longer periods of time.

INTRODUCTION

Biases in the assessment procedures of utility func- tions is a well-known and documented phenomenon [l-3]. Thus, it is evident that more than one assess- ment procedure should be employed in order to improve the overall estimation of the utility function. This led to the development of different methods for constructing Von-Neumann-Morgenstern (VNM) utility functions such as the certainty equivalent method, the probability equivalence method, the gain equivalence method and the loss equivalence method. In this paper we investigate possible biases in the assessment of utility functions when two different methods (direct and indirect assessment) are used. More specifically we examine the estimation of utility functions over different lengths of life.

The theory of preferences and their measurement is well established in the field of health care; two applications being: clinical decision analysis and health program evaluation. Clinical decision analysis is concerned with decisions particular to tests and/or therapy for an individual patient. The patient’s pref- erence in view of the various possible outcomes is measured and used as part of the decision analysis [4-61. Health program evaluation is used when alter- native health care programs exist, and a decision has to be made regarding which program (or variation) to pursue and which to defer or reject [4, 7-101.

In recent years we have witnessed a growing body of information, in the health care field describing the influence of variables involved in the process of measurement upon the results obtained. For example, Llewelly-Thomas et al. [l 11, show that the type of scenario presented to raters and the sequence in which the method of assessment is used have a

major influence on the results. Other researchers have criticized the use of such methods as they may not be representative of a patient’s long term preference, and that a decision based on such measures may not maximize the patient’s long term satisfaction [12]. In spite of this criticism no one yet, to the best of our knowledge, has proposed better tools for revealing individual desires. (See a recent comprehensive de- scription by Torrance [13] of the Measurement of Health State Utilities for Economic Appraisal.) Thus, it seems to us that further investigation of the existing tools will only improve our understanding of their limitations, and lead to their more effective use by the decision makers.

The need to compare different methods of utility measurement is called for in the literature. Bursztajn and Hamm [14] in their paper on the clinical utility of utility assessment, which deals with the usefulness of utility assessment to clinicians, say: “The sort of questions that one must ask can be clarified by examining the version of the lottery method of sub- jective utility assessment that is commonly used, as in the case of the paper by McNeil et al. This method produces a “utility curve”, showing the relation between the amount of life remaining and the subjec- tive value of that amount of life. We ask, first, whether it is wise to rely on only one such method to measure utility”? (p. 161). The study presented here will help us to cope with this question.

The choice of treatment, test, etc., usually depends on the patient’s tradeoff between two attributes: life years and quality of life. (More about this issue can be found in Pliskin et al. [ 151). In this paper we relate, however, to the estimation of utility functions over life years only, assuming a constant level of state of health (quality of life). A program or treatment which

SE P s. Zl,L‘-x

Page 2: An empirical evaluation of two assessment methods for utility measurement for life years

372 ABRAHAM MEHREZ and AM~M GAFNI

can affect only the length of life is most common for chronic diseases (a chronic state is defined as a constant health state over time). An example of a decision problem of a person facing alternative treat- ment possibilities for a chronic disease can be found in McNeil et ai. [S] for the case of lung cancer.

Assessing a VNM utility function can be done either by standard reference lottery questions (the direct method) or via a two stage procedure (indirect method). In the former a value function is construc- ted and in the second one lottery questions are used to construct a cardinal utility function (see Keeney and Raiffa, [16]). The purpose of the experiment described in this paper is to compare these two methods in the context of life years.

Comparing the two methods is of interest from a different perspective. Bell and Raiffa [23] and Dyer and Sarin [17, 181 have suggested that VNM utility functions reflect not only attitude toward risk but also intensity of preferences of riskless outcomes. According to this argument, at least part of the risk aversion is due to the fact that differences between the mean lottery value and the worst possible outcome of the lottery “matters more” than the differences be- tween the mean lottery value and the best possible outcome of the lottery (which is similar to the economic argument of decreasing marginal utility). Dyer and Sarin [17,18] thus propose that risk atti- tude and intensity of preference for sure outcomes be separated by independently scaling a risky utility function, V(x), and a riskless measurable value func- tion, V(X), that reflects intensity of preference for difference between sure outcomes. Because both a risky utility function, U(X), and a measurable value function, Y(x), must rank order sure outcomes ac- cording to the decision maker’s preferences, there exists a strictly increasing function, U,, such that U(x) = U,[ V(X)]. Since the direct method represents measuring the left hand side of the equation and the indireet method represents the right hand side, it is inte~sting to find if this theoretical equality relation is supported by empirical evidence.

The paper is organized as follows: in the first section we describe the hypotheses tested; in the second section we describe an experiment that was conducted for that purpose. The results of the different statistical tests performed are described in the third section which is followed by a summary and a discussion in the concluding section.

HYPOTHESES

The following hypotheses were tested:

1. Are the two methods equivalent? That is, do they produce the same value? A similar comparison of three measurement techniques to assess social preferences for health states (the standard gamble, the time-trade off and the category scaling) was done by Torrance [19] with respect to the mean differences. In our study two types of comparisons are per- formed: compa~son of mean differences and com- parison of standard errors of measurements. These comparisons are made under the assumption that the observed value of the utility function can be decom- posed into a fixed true value and an error variable.

The properties of the distribution of the error vari- able may depend on the estimation method. Like Wallsten and Budescu [20] (see for encoding subjec- tive probabilities), we postulate that the true utility value will be obtained across a series of statistically independent judgements by a given individual. Thus true utility value is a hypothetical concept, which is independent of the method employed for estimation. Note that two sources of error may affect the proper- ties of the error variable distribution in the case of the indirect estimation procedure. These sources stem from the fact that the estimation is done in two stages. However, it is not necessary that the standard error of the error variable in the indirect method will be greater than the one derived from the direct method. The relation between these two standard errors depends, among other things, on the degree of (an unobservable) correlation between the random errors of the two stages.

2. Is the indirect method consistent with the direct method? Namely, are the values, measured by the two methods, related in some systematic way? Following Torrance [19] we use the correlation coefficient be- tween the utility derived by the two methods as a validation criterion. A perfect consistency is obtained if the correlation coefficient is one.

3. Is the shape of the utility functions influenced by the response method biases? Namely, do the two utility functions measured differ in their shapes? Studies aimed toward examining similar hypotheses were reported by Wehrung et al. [21], Hershey et al. [I] and Hershey and Schoemaker [22] in different settings. For example, Hershey et al. [I] found that the certainty equivalence method leads to more risk seeking behavior than the probability equivalence method. These studies (as others mentioned by Hershey et al. [1]) used the scheme suggested by Fishburn and Kochenberger to classify the composite shape of the utility function. Similar classification is used in our study to test the hypothesis mentioned above with respect to the same individual and with respect to the group as a whole.

THE EXPERIMENT

The experiment was conducted with 40 under- graduate students (average age: 25) in an advanced course in decision sciences in the Department of Industrial Engineering and Management at Ben Gurion University. All were familiar with Von Neumann-Morgenstern theory from both lectures and reading material. Each subject received two consecutive questionnaires separated by a two week period. The order of the two questionnaires was randomized. No order effects were found. In both questionnaires the participants were presented with the following choice:

P G

S versus <L

Where S is a sure amount in terms of remaining life years, P is the probability of winning G years (for gain), and t (for loss) the lower outcome of the lottery. In both cases we used the probablity equiv-

Page 3: An empirical evaluation of two assessment methods for utility measurement for life years

An empirical evaluation of two assessment methods 373

Table 1. Confidence intervals of 95% for the differences of the expected utilitiest

generated by normal distributions

s Lower limit Upper limit

5 -2.02 2.1 10 -0.52 0.54 20 -0.73 0.72 30 -2.01 1.98 40 -2.85 2.80

tThe confidence interval was generated for the expected utility accepted by using the direct method minus the expected utility accepted by using the indirect method.

alence (PE) method to elicit an indifference level for P for given values of G, L, and S. G and L were constant throughout the experiment at levels of G = 50 yr L = 0. The value of S was changed through the experiment using the following values S = 5, 10, 20, 30, 40 yr.

In one questionnaire the direct method of assess- ment was used. Using the PE method the students were asked to elicit indifference levels for P for the different values of S, G and L mentioned above. An example of such choice was

Assume that you suffer from the following chronic condition (the condition was de- scribed). You can either choose not to have treatment and thus live five years for sure (S = 5) and then die, or you can choose to receive treatment with probability P to live 50 (G = 50) more years (in the same health state as the five years) and then die, and with probability (1 - P) to die immediately (L = 0). What is the level of probability P which makes you indifferent between these two alternatives?

In the second questionnaire the indirect method of assessment was used. As mentioned in our intro- duction the indirect method is a two stage procedure. In the first stage the ordinal value function for years of life was constructed using the category scaling method. The category scaling instrument consists of a 100 millimeter “desirability line” marked by the subject, and scored on the assumption that it repre- sents 100 equal interval categories. One end of the line is labelled “Immediate Death, Least Desirable” while the other end is labelled “50 yr, Most De- sirable”. The subject is asked to indicate on the line the position that represents the relative desirability of each of the relevant periods of length of life. From the first stage values were accepted for G, L and the different values of S. G was given the value of 1.0, L given the value of 0.0 and the different values as- signed to S were read from the marks assigned by the individuals. These values are used as inputs for the second stage which was performed a week later. As mentioned in our introduction, in this stage lottery questions are used to conduct the cardinal utility function. Again, using the PE method the students were asked to determine indifference levels for P for the different values of S, G and L obtained in the first stage. Note that an important difference between the direct and indirect methods is the outcome used in the lottery (the values of S, G, L). While in the direct

Table 2. The computed val- ues of f(39, 39) test for equality of standard devi- ations from normal distri- butions, for different values

of St

The calculated S value off

5 0.93 10 1.16 20 1.33 30 1.6 40 1.5

Vhe null hypothesis IS that the standard error of measurement of the di- rect method is different than the one accepted in the indirect method.

method one uses actual years, in the indirect method one uses judgemental values obtained in the first stage using the category scaling method.

RESULTS

1. With respect to the first hypothesis, two types of statistical tests were performed: The first was a comparison of mean differences of two normal distri- butions. A t-test was performed for the different values of S mentioned in the previous section. The null hypothesis that the mean of the difference scores is significantly different from zero was not rejected (significance level of 0.05). In Table 1 the upper and lower limits of the 95% confidence interval are presented. The second test was a comparison of the standard errors of measurements. The null hypothesis about equality of standard deviation was not rejected (significance level of 0.05) when it was compared to one of the following alternative assumptions: the standard error of measurement of the direct method is larger or alternatively smaller than the one ob- tained in the indirect method. The computed values of F (39, 39) for the different values of S for the first alternative are presented in Table 2. These values were compared to the critical value of Fwhich is 1.69.

2. With respect to the second hypothesis, the null hypothesis that the correlation coefficient between the utility derived by the two methods is zero, for the different levels of S, was not rejected. The sample correlations for the different values of S are presented in Table 3. The null hypothesis that the correlation coefficient is zero vs the alternative hypothesis that it is not zero was not rejected at a significance level of 0.05. However, some positive levels of correlation can be seen in Table 3.

Table 3. The sample cor- relation for the different

levels of S

s

5 10 20 30 40

Sample correlations

0.103 0.122 0.129 0.242 0.072

Page 4: An empirical evaluation of two assessment methods for utility measurement for life years

314 ABRAHAM MEHREZ and AMIRAM GAFNI

Table 4. The sample proportions of the different categones for the different methods

DlIXCt Indirect category assessment assessment

Concave 18 23 COIWEZ 0 0 Concave-convex 14 6 Convex~oncave 0 2 Lmear 0 0 Undetermined shape 8 9

3. Regarding the third hypothesis, a Chi-square test (with three degrees of freedom) was performed to compare the proportions of several categories classi- fying the composite shape of the utility function in the two assessment methods. Our categories differ slightly from those suggested by Fishburn and Kochenberger, (Hershey et al. [l]) and include: concave, convex, concave-convex, convex-concave, linear and undetermined shape. The null hypothesis about equality of proportions was not rejected. The computed value of the Chi-square (with three degrees of freedom) was found to be smaller than the critical value 11.07 (significance level of 0.05). In Table 4 we present the sample proportions of the different categories for the different methods. In addition, we defined an indicator random variable that obtains the value of one-if for a given observation the same shape category was found in both methods of assessment, and zero otherwise. It was found that for 18 out of the 40 observations that we have, our indicator variable received the value of one.

SUMMARY AND DISCUSSION

Three important conclusions can be drawn from our empirical evidence.

1. The use of different methods (direct and in- direct) does not lead to significant differences in the utility evaluation from the social point of view (health program evaluation).? The results regarding means equality, equality of standard errors of measurement and equality of proportions of the different catego- ries, support this claim.

2. The use of different methods (direct and indi- rect) does lead to significant differences in the utility evaluation from an individual point of view (clinical decision making).? This can be seen from the hypoth- esis testing of the correlation coefficient and from the result that more than half of the observations do not have the same category of shape. We note that there is no contradiction between these two claims. For example, it is possible that the null hypothesis about equality of means is not rejected while the null’ hypothesis about correlation coefficient being equal to zero is not rejected either. In principle the com- parison between (1) and (2) above should lead to the conclusion that at the individual level, utility assess- ment using one method can result in biases.

3. The results illustrated in Table 4 show that risk aversion is common for shorter periods of time (small

tSee the Introduction.

S). On the other hand, risk prone behavior, when it exists, is found for longer periods of time (large S). This can be interpreted to mean that individuals tend to take risks for periods of time that seem too long to realize the consequences. In other words, individ- uals may opt to participate in risky treatment only if the results are far in the future. This result calls for further search with a sample of older people. It is possible that for an older group, a risk aversion behavior will be observed along the way. Until such research is performed one should be careful in ap- plying the above results for policy decision-making.

Finally, we relate to the use of students to assess biases in the assessment procedure of utility func- tions. It is true that students who completed an advanced course in decision science and are more mathematically sophisticated cannot represent the “average” patient. However, we feel that using such “sophisticated” population has the advantage of eliminating claims that biases in the assessment pro- cedure might result from lack of understanding of these techniques and lack of understanding of the concept of probabilities. This might be one of the reasons why so many studies to assess biases in utility measures use students. We do agree that more work is called for in simplifying the measurement pro- cedure by using different devices that make the probabilistic concept more understandable to the “average” patient, and thus make these techniques more useful in measuring his preferences.

1.

2.

3.

4.

5.

6.

I.

8.

9.

10

REFERENCES

J. C. Hershey, H. C. Kunreuther and P. J. Schoemaker. Sources of bias in assessment procedures for utility functions. Mgmt Sci. 28, 936954 (1982). H. J. Einhorn and R. M. Hogarth. Behavioral decision theory: processes of judgement and choice. Ann. Rev. Psychol. 32, 53-88 (1981). R. Krazysztofowicz and L. Duckstein. Assessment errors in multiattribute utilitv functions. Oraanizational Behavior & Human Performance 26, 3263;8 (1980). M. C. Weinstein. H. C. Finebere. A. S. Elstein. H. S. Frazier, D. NewHauser, R. R. N&a and B. J. McNeil. Clinical Decision Making. W.B. Saunders, Philadelphia (1980). B. J. McNeil, R. Weichselbaum and S. G. Pauker. Fallacy of the five-year survival in lung cancer. New Engl. J. Med. 229, 137991401. B. J. McNeil, R. Weichselbaum and S. G. Pauker. Speech and survival: trade offs between quality and quantity of life in laryngeal cancer. New Engl. J. Med. 305, 982-987 (1981). G. W. Torrance, D. L. Sackett and W. H. Thomas. Utility maximization model for program evaluation: a demonstration application. In tie&h Status Indexes, (Edited bv R. L. Bern). vv. 156-17 1. Hosnital Research and Educational Tr&,-Chicago (1973). - J. W. Bush, M. M. Chen and D. L. Patrick. Health status index in cost effectiveness analysis of PKU pro- gram. In Health Status Indexes, (Edited by R. L. Berg), pp. 172-208. Hospital Research and Education Trust (1973). M. C. Weinstein, J. S. Pliskin and W. B. Stason. Coronary artery bypass surgery: decisions and policy analysis. In Costs Risks and Benefits of Surgery, (Edited by J. P. Bunker), pp. 342-371. Oxford Press, New York. S. G. Pauker, S. P. Pauker and B. J. McNeil. The effect of private attitudes on public policy: prenatal screening

Page 5: An empirical evaluation of two assessment methods for utility measurement for life years

An empirical evaluation of two assessment methods 375

for neural tube defects as a prototype. Med. Decis. 18. J. S. Dyer and R. K. Sarin. Relative risk aversion. Mgmt Making 1, 103-111 (1981). Sci. 28, 875-886 (1982).

1 I. H. Llewellyn-Thomas, H. J. Slherland, R. Tibshirani, A. Ciampi, J. E. Till and N. F. Boyd. Describing health states: methodologic issues in obtaining values for health states. Med. Care 22, 543-552 (1984).

12. J. J. J. Christensen-Szalanski. Discount functions and the measurement of patients’ values: women’s decisions during childbirth. Med. Decis. Making 4,47-58 (1984).

13. G. W. Torrance. Measurement of health state utilities for economic appraisal. J. Health Econ. 5, l-30 (1986).

14. H. Bursztajn and R. M. Hamm. The clinical utility of utility assessment. Med. Decis. Making 2, 161-165 (1982).

19. G. W. Torrance. Social preferences for health states: an empirical evaluation of three measurement techniques. Socio-Econ. Plann. Sci. 10, 128-136 (1976).

20. T. S. Wallsten and D. V. Budesco. Encoding subjective probabilities: a psychological and psychometric review. Mgmt Sci. 29, 151-172 (1983).

21. D. A. Wehrung, K. R. MacCrimmon and K. M. Brothers. Utility measures: comparisons of domains, stability and equivalence procedures. Working Paper 603, University of British Columbia (1980).

15. J. S. Pliskin, D. S. Shepard and M. C. Weinstein. Utility functions for life years and health status. Ops Rex 28, 20&224 (1980).

22. J. C. Hershey and P. J. Schoemaker. Adjustment bias in indifference judgement between gambles and sure amounts. Working Paper, Decision Sciences De- partment, The Wharton School, University of Pennsyl- vania (1982).

16. R. L. Keeney and H. Raiffa. Decisions with Multiple Objectives. New York, Wiley (1976).

17. J. S. Dyer and R. K. Sarin. Measurable multiattribute value functions. Ops Res. 27, 81&822 (1979).

23. D. E. Bell and H. Raiffa. Marginal value and intrinsic risk aversion. In Risk: A Seminar Series. International Institute of Applied Systems Analysis, (Edited by H. Kunreuther), pp. 3255350. Laxenburg, Austria (1981).