Upload
john-brazier
View
214
Download
2
Embed Size (px)
Citation preview
REVIEW ARTIC LE ~canarrocs I (~) oIOO~!~, 1995 111(j.~SOo5O,(J
The Short-Form 36 (SF-36) Health Survey and Its Use in Pharmacoeconomic Evaluation Johll Brazier
Sheffield Centre for Health and Related Resea rch, University of Sheffield, Sheffield, England
Contents Summary I . The Short-Form 36 (Sf-36) Health Survey .
1.1 DescnptiOn 1.2 HiStory and Development
2. The PsychOmetriC FoundatiOns of the SF·36 . 2.1 Reliability. 2.2 Volldity
3. Using SF-36 ScOfes in Economic Evaluation 3,1 Cost-Minimisation Anolysis . 3,2 Cost-EHec tiveness Analysis 3.3 Cost-Utility Analysis .
4. Economic Criticisms of the SF·36 4.1 Item Selection. 4.2 Scoring of the SF-36 . 4.3 Risk 4 .4 Time .
5. The Sf-36 and Preference-Based Measures . 6. DerMng a Single Index from SF-36
6.1 Arbitrary Weights 6.2 Multi-Attribute Utility Theory 6.3 Scenario Approaches
7, Conclusion .
403
"" "" · "" · 406 · 406 · 406
408 408 408 408 '09 409 409 409
· 410 · 410 . 411
41' 412 41' 41'
Summary The S hon-Fonn 36 (SF-36) Health Survey is a brief self-administered questionnaire that generates scores across 8 dimensions of health. It has been fou nd to be reliable, and valid in tenns of cri teria such as agreement with clinical diagnosis and disease severity, but its underlying values have not been tested against patient preferences.
The SF-36 was not devised for use in economic evaluation. The SF-36 may be used in cost-minimisation analyses. where the dimension scores can be shown to reflect people's values for health at an ordinal level, but it cannot be used in either cost-effectiveness or cost-utility analyses. The dimension scores of zero to 100 do nOI provide a common currency and, where there is confl ict between the dimension scores. there is no basis for establishing an overall health benefit.
404 Brazier
Furthermore. in clinical trials. the usual comparison is between mean or median scores. which assumes ri sk n eutrality and docs not take adequate account of the relat ionship between the value of health and time.
Although Ihey are under pressure to assess the cost effect iveness of health care interventions. researc hers and policy analysts must resist short-cut methods of deri ving a si ngle inde", from the SF-36 that aTC based on arbitrary aggregation sche mes. because these ignore people's preferences and the crucia! quanti tyl quality trade-off, and therefore cannot be used in economic evaluations. However. the rich descriptive material and muhidimensionality of the SF-36 may have potential for use in economic eva luation. Multi-attribute utility theory provides a way of deriving a single index based on elicited values. but it requires a major restructuring of the sc ales of the SF-36. Alternatively. SF-36 rcspon~es may provide material for constructing health scenarios that could then be valued on a holistic basis,
Purchasers of healthcare are bei ng forced to make difficu lt choices. One of the key criterion in creasingly used i s cost effectiveness,ll) but this is hindered by the a bsence of evidence on the costs and consequences of health interventions. The last 2 decades have witnessed the growth of what is now known as the 'outcomes movement '12) to promote the measurement of health, and one of its recent nagships has been the Shorl-Form 36 (SF-36) Heahh SurveyP·41 This is a brief and easy to usc self-administered questionnaire. which provides a general assess ment of the patient"s perceived health in terms of 8 dimen sions of heallh covering fu nctioning and well-being. Currently. it is being used in pharmacological and other clin ical trials across many countries and it is likel y that SF-36 data will be used to support claim s for the cost effectiveness of health interventions. In contrast. measures developed by econom ists, such a s the Health Ut ilities Index (HUI),15,61 Quality of Well Being Index[7-91 and the EuroQol ,I'OI are not being used as widely in clinical trials.
The aim of th is article is to examine whether SF-36 results can be used in the economic eval uation of healthcare interventions.
1. The Short-Form 36 (SF-36) Health Survey
1.1 Description
The SF-36 is a standardi sed, 36-item question-
naire designed for completion by patients (although it can be interviewer-admin istered) in a cl inic or at home. It measures health on 8 multi-item dimensions that cover functional status. well-being and overall evaluation of health (fig. I and table I). For each item. there is a choice of responses on a Li kert scale, ranging (for example) from ' li mited a lot" to 'not li mited at all" or 'all of the time' to 'none of the time' (fig. I). The chosen item responses arc recoded onto an equal interval scale (except for 3 itemsllll). Scores arc computed for each dimension by adding the recoded item responses together and transformi ng the res ul ts onto a scale fro m zero (worst health on sca le) to 100 (best health on scale).III!
1.2 History and Development
The SF-36 has evolved from 2 major research programmes in the US. The fi rst was the Health In surance Experiment (HIE) undertaken by Ihe Rand Corporation to examine the consequences (for costS and health) of alternat ive methods of organis ing the delivery and finance of healthcare.l l21
The original survey contained 108 items, coveri ng a broad array of func tional status and well-being concepts, [131 [n the Medical Ou tcome Survey (MOS ), whic h examined how di ffere nt aspects of healthcare affect outcome, these scales were further developed and refined.l l41 Both of these studies were concerned with developing standardised measures
The SF-36 Health Survey 4Q5
I, The lol lowlng questions are about actlvltes you might do during a typk:al day
Does your health limit you in these activities? It so. how much?. .
Please circle one number on each Une
Yes, Yes, No. not limited a lot limited a little limited at all
Climbing severallf ights of stairs 2 3
Bending. kneet ing, or stooping 2 3
Walking ha~ a m ile 2 3
These questions are about how you feel . how things have been with you during the past mooth.
How mUCh time during the past month:
Please circle one number on each line
All of Most of A good Some of A little of None of
the time th(!time brt of the time the time the time the time
O,d you feel lu ll o f life? 2 3 , 5 6
Have you felt downhearted 2 3 , 5 6
and low?
Has your health limited 2 3 , 6
your SOCial activltes (l ike viSiting friends or close re latives)?
Fig, 1. Samples of questions from the Short-Form 36 (SF-36) Health Survey.l2't
of patient-perceived health, rather than conventional measures based on clinical judgement.
Items for the HI E and MOS originated from a review of the literature in the 1970s, and therefore the items selected for inclusion in the SF-36 have their roots in instruments that have exi sted for more than 20 years. ! I~! The usefulness of these fu ll length health batteries was seen to be limited by their size, particularly if they were administered alongside disease-specific measures. On the other hand. short single-item scales were not regarded as covering a sufficient range of health domains and
C> "'dis Inter()(]1iona1 Lmiled ...... right. ,e.erved.
have been fou nd to be unreliable and insensitive. particu larly for small groups in trials,! 151 The aim of the SF-36 developers was therefore to create a questionnaire that was " "a standardised health status survey that is com prehensive, psychometrically sound. and brief' ,l3J The application of psychometric methods in selecting and testing items was an important feature in the development of the SF-36. For econom ists, however, it is important 10 understand the basis of this approach, and its appropriateness in developing an economic measure,
PhormocoEconomic> 7 (5) \995
406 Brinier
Table I. Summary of the 8 Short-Form 36 (SF-36) Health Survey ~ems. and the Hea~h Transrtion rteml" l
Dimension No. 01 items No. of levels
Physical fullClioning " " Role funct;oning - physical ,
Bodily pa in 2 " General heMh 5 " Vitality ,
" Social functioning 2 9
Role functiooing - emotional 3 ,
Menial health 5 26
Reported health lransrtion 5
2. The Psychometric Foundations of the SF-36
There is a well establ ished methodological tradition in psychology of measuring s ubjective concepts such as intelligence, attitudes and beliefs.116J originally derived from the field of psychophysics, known as psychometrics. The methods of psychometrics have been applied widely in health measurement!I? ) and were integral to both the construction and testing of the SF-36 dimension scales.l 3.I S.18J
2.1 Reliability
Any measure must be able to consistentl y reproduce a series of results over repeated measurements with the minimum amount of random error on an unchanged population. The types of reliability usually considered are: (i) internal consistency; and (ii) consistency over time ( test-retest reliabi lity).
2 . I. I Infernal Consistency Most published work has exami ned the internal
consistency of the SF-36 in terms of the correlation of items with their own scale. In general. the SF-36 was able to sat isfy the s tandard tests of internal consistency and homogeneity.119.211
C> ... dis Int9mOtiooo1 Umiloo. AI rights 'ew<Voo.
Summary 01 contan!
Extent t o which health limrts physical activities such as seIf-care, walking, climbing stairs, bending, lilting and moderate and vigorous exercise Extent t o which physical health inter!eres with worlt or other claily activrties. including accomplishing less Ihan wanted, limitations in Ihe kind of activities. or ditliculty in performing activities
Intensity of pain and effect 01 pain on normal W{)rlt , both inside
and outside the home
Personal evaluation 01 health. including current health, health outlook and resistance to illness
Feeling energetic and lull of lije versus feeling tired and worn out
Extent to which physical health or emotional problems interfere with normal social activities
Extent to which emotional problems interfere with work or other da ily activ~ies . inctvding decreased time spent on ac~v~ies .
accomplishing less. and not wOl'k ing as carefully as usual
General mental health. inclvding depression. anXiety. behavioural·emotiorlal control and general pos~ive effect
Evaluahon of current health compared to one yea. ago
2. 1.2 rest·/tetest /tel/ability The underlying stability of scores between test
and retest has important implications for required sample size in trials. In a survey of general practice patients. test and 2-week retest scores of the SF-36 dimensions were found to have rank correlations of between 0.60 and 0.81.12 1J which i s withi n the range regarded as acceptable for group comparisons (0.5 to 0.7).1161 The mean differe nces between test and retest did not exceed I point on the 100 point scale. and the plots of these differences against the subjects' scores did not reveal any bias.121 1
2.2 Volidity
Validity has been defined as the extent to which an instrument measures what it is intended to measure. However. as noted by McDowell and NewelJ,l1 7J it is not the instrument per se that is valid . but the way in which it is used. Many measures are originally designed for one task. but subsequently become used for a variety of other purposes. The SF-36 has been validated by its abili ty to predict clinical diagnosis and health service ut ilisation. rather than how well it conforms to patients' values, and this is Ihe basis of crit icisms of using it in econom ic evalu ation. This subsection considers
PhormocoEcOl"lOl"l"ic$ 7 (5) 1m
The SF-36 HC<llth Survey
the psychometric tests of validi ty that have so far been applied to SF-36.
2.2. ' Conlenl validity COnient validi ty i s defi ned as the ex tent t o
wh ich the c hoice of items is appropriate fo r the health domains being measured. Claims for content validity typicall y rest on the comprehensiveness of the domains and the method for generating items.
Health Domains The health domains fo r the SF-36 have been
choscn b y the developers to broadl y refl ect Ihe WHO definition of h ealth as a 'state o f complete ph ysical. mentlll and socilll well -be ing Hnd not merely the absence of disease or infirmity' .122J The 5 dimension s of physical health . me nt al hea lth , everyday funct ioning i n social and role activities, and general perceptions of well -bei ng were originally regarded as 11 min imu m criterion for content validity.t 22 1 Later the role functioning domain was found to miss limitat ions att ributable to mental problems, a nd it was the refore divided into the 2 di mensions of role limitat ions attributable to physical health , a nd those attributable to e mot ional problems. A dimension was also added for vi tality because i ts · ... sensitivity to the impact of disease and treatment has been d emonstrated in recent clinical trials in volving patients with hyperten· sion, prostate d isease, and those di ffe ring in severity o f AIDS' .13 1 Pain was regarded as important enough in its own ri ght to also require its own dimension.
Item The original items of the HIE, and subsequently
the MOS survey, were main ly taken fro m existing literature. Items were selected 10 include p osi tive as well as negative aspects of health . Thus. the mental health dimension in the UK version of the SF-36 includes ' have you been a happy person?' as well as 'have you f elt downhearted and low?' ) II J In selecti ng i tems for the SF-36 from the o riginal MOS long fonn, the developers employed a number of strategies that the authors ac knowledged varied across d imens ions. II ~J In the original MOS study. a 20- item shortened version was produced,
407
known as the SF_20.t201 Subsequent analysis of thi s scale revealed a n umber of shortcomings in terms of compre hensiveness and sensitivity compared with the long form. The SF-36 was developed t o correspond more closely wi th equivalent scales in the long form. This ·expert" approach could be cri ticised for not being based d irectly on patient views. but it had the a dvantage of being able to select the best features of ex isting measures. The approach used by some other methodologists. such as the developers of the Nottingham Health Profi le (NHP), is to o btain the statements from i nterviews with palientslBJ and, for the Sickness Impact Profile (SIP), patients and professional s were inte r· viewed.1241
2.2.2 Construct validity Psychometric researchers assess the validi ty of
health measures on correlational evidence, where relat ionships between measures and variables afe hypothesised (·constructs '). and t hen tested empirically. The S F-36 has been validated on a w ide range of common cond itions by comparing SF-36 scores with measures of disease severity and other health ind icators.12HII Researchers have found evidence of validity across a range of constructs. including visiting a general practi tioner and attending a hospital-based clinic.ll lJ Furthermore, using thi s a pproach. the S F-36 has been shown to be more sensiti ve than the NHP and the EuroQol classificati on because o f its abil ity to detect l eve ls of perceived ill-health amongst patients who had achieved 'good health ' according to these i nstTUments.l2 1.321 These tests of construct v<ll idity can . however. be criticised for thei r dependence on cl inical judgement and o ther supply factors. IJ31
2.2.3 ResponJ/veneu The purpose of health status assessment is to
measure health chllnge. Responsiveness is the ability of a n instrument to measure significant changes in health . a nd as such i s a fo rm of validity. However, responsiveness is usually assessed as a statistical characte risti c of an instrument, using meas ures such as effect sizes (I.e. the mean score change divided by the standard devialion at baseli ne) and t he standardised responsc mean (mean score change
Phormoeo(~ 1 ( 5) 1995
408
divided by its standard deviation), and is linked 10 the sample size required to delect a c hangcJ34J In a reccm SlUdy.(31] the SF·36. another generic measure (the SIP) and several disease-specific measures (including the Arthritics Impact Measure) were administered pre- and postoperatively to patients undergoing IOta! hip arthroplasty. The degree of responsiveness of the measures was found 10 be comparable. However, there are few data currently published on the responsiveness of the SF-36 in other patient groups.
3. Using SF-36 Scores in Economic Evaluation
Cl inical trials currently in progress will be reporting changes in scores for the 8 SF-36 dimensions alongside other outcomes such as survi val. 10 exami ne whe ther one treatment is more effective than another. There are 3 types of outcomes scenarios when 2 treatments, A and B, are compared:
0) The outcomes of treatments A and B are the same for all OUicome measures, including the SF-36 dimensions;
(ii) Some outcomes are e ither better or the same in treatment A compared with treatment B: and
(iii) Some outcomes are better and others are worse for treatment A compared with treatment B,
For the purposes of this discussion, we will assume that SF-36 is a valid description of the health consequences of treatment.
The economist, of course. has the addi ti onal problem of relating each of these health outcome scenarios to cost data. using one of the following techniques of economic evaluation in healthcare.(35J
3.1. Cost-Minimisation Analysis
In cost-minimisation analysis it is necessary to demonstrate the ordinal properties of the outcome measures. If this can be shown. scenario (i) wou ld seem to provide the basis for establish ing that the 2 treatments are equally effective, and hence the analysis can focus on costs. In scenario (ii ). if lreatment A was s hown to be cheaper than B as well as more effective. again it may be possible to judge cost effectiveness.
8rt1zier
3.2 Cost-Effectiveness Analysis
Where treatment A is more cosIly and more effecti ve than treatment B. then the marginal cost must be weighed against its marginal benefit over B. This can be appraised in cost-effectiveness analysis by comparing costs with a si ngle outcome such as life-years. However, the SF-36 would. in addition. generate 8 cost-per-dimension-score ratios, and it is therefore possi ble for treatment A to be more cost e ffective than B in some d imensions, but less cost effective in others. Th is result would present considerable difficulties in applyi ng the conven tional economic decision ruies.l3S1 Selecting only one dimension would solve the analytical problem but is likely to lead 10 fa lse conclusions about relati ve effectiveness where the interven tions have multiple outcomes. Furthermore, SF-36 dimension scores have not been shown to have interval properties (Le. where the scores represent equal interval s) and thus the cost-effectiveness ratios are likel y to be meaningless.
Where there is any connict between costs and health outcomes, it is no t possible to apply coste ffectiveness analysis using SF-36.
3.3 Cost-Utility AnalYSis
The SF-36 was not designed t o incorporate the preferences of individuals and hence the economist's notion of 'utility'. The developers of the MOS measures adv ised' ... when multiple items arc combined into a score, scores are possible over a range of numbers and the score has no inherent meaning ',(18] Scores on SF-36 dimensions range between zero and 100 but they are not comparable, and there is currently no basis for combin ing them into a single index. There has been work to reduce the number of dimensions using factor analytic techniques, which helps to overcome the statistical problem, but at best thi s reduces the dimensions down to 2 or 3(25( and is nOI based on preferences.
In scenario (iii), there can be a connie! between survival and SF-36 scores. An important feature of cost-utility analysis is the combination of survi val
The SF-36 Health Survey
with health-related quality of life to produce qual ity-adjusted life years (QALYs) . SF-36 scores cannot be combined with surv ival on the same scale, and therefore the necessary trade-offs between these 2 types of outcome required in cost-utili ty analysis is not explicitly addressed. Thus, SF-36 results cannot currently be used in cost-utility analysis.
4. Economic Criticisms of the SF-36
The SF-36 currently has only a very limited potential for use in conventional economic evaluations. Despite this, many economists will be asked to assess the relative efficiency of different interventions from SF-36 trial data. It is. therefore. worthwhile examining these data more critically from an economic perspective.
4.1 Item Selection
There is a danger of ruling out an item simply because it does not fit neatly into one of the hypothesised domains. Some health prob lems. such as limitations in bathing and dressing (an SF-36 physical function ing item), are extremely important to people's quality of life but their infrequency in a population can result in a poorcorretation with other items. Conventional tests of internal consistency can therefore reduce the validity of a questionnaire. unless applied with care. (The 'bathing and dressing' item was retained in the SF-36 despite this problem.)
4.2 Scoring the SF-36
The interval properties of SF-36 scales as measures of patient values have not been tested. It is assumed that being limited in 2 items equals twice the interval ofa single limitation. Even the ordinality of the dimension scores may be in some doubt. For example, physical funct ioni ng incl udes items for 'lifting and carrying groceries' and 'bending, kneeling or stooping ', which are assumed to be equal, but one could be regarded as far worse than the other by many patients. The econom ist would want to test the implicit weights investigated
CI Adtl lnt",rnolior><>l UrY"Wtad. AI right> re-servOO.
against the value of these limitations to patients. society o r whoever's values are deemed relevant. (In contrast, the developers of the SF-36 have been more concerned with the statistical assumptions underlying these scoring procedures. includi ng 'the distribution of responses to items within the same scale and item variances are rough ly equal· .I 1SI)
A test of validity used by economists is to examine whether the scores correspond to values revealed in the choices made by patients.l 281 For example, whether an informed patient would actually choose the treatment that results in a higher SF-36 score, and whether the strength of this preference correlates strongly with the SF-36 scores. However, these 'revealed preferences' of patients in healthcare tend 10 be contaminated by their outcomes and the influence of professionals. At best, it might be possible to test the ordinality of some scores, but it is not likely 10 provide a usable technique for testing the strength of the values underlying the dimension scores. For this, stated preference techniques have been applied. such as the rating scale used by Cairns and colleagues.l361 They have attempted to compare the implicit values of 3 condition-specific measures of health with stated preferences. Patients were asked to rank and rate a sample of scenarios describing hypothetical patients drawn from these scales. Disagreement was found between the ran kings derived from the origi nal scales and the raters ' stated preferences, and the scales did not appear 10 have simple interval properties. The authors concluded that a large number of scenarios would have to be valued to estimate any relationshi p between the scales, and that it may be necessary to revalue them completely.
4.3 Risk
Health statu s measures such as the SF-36 can be critici sed for their failure to take account of risk. Analyses of trial results typically focus on the mean, or occasionally median, change in health scores. Variance in the data is used to establish the statistical significance of a change, but the implications of the distribution of these changes fo r
PhoonocoEcor.onics 7 (5) 1995
'10
risk-taking are ignored. A very wide range of outcomes has been found for common treatments such as cholecystectomy, with negative consequences from complications including mortality. The ranking of utility expected from a prospect such as a surgical trcalmen! is not necessarily the same as its mean value. For a risk-averse person, uncertainly reduces the util ity of a mean outcome or prospect. and the maximisation of utility may involve choosing a Ireatment that achieves a lower improvement in the SF-36 scores, but with less variance. T he distributioll of hcahh outcomes should not be ignored when comparing the effectiveness o[lrealmcnts. However, there is no agreed method for incorporating risk into the valuation of health outcome. Expected utility theory (EUT) is usually claimed to be the theoretical basis of economic valuation techniques such as standard gamble, but the assumptions of EUT have been heavily criticised on empirical grounds.lJ7J Even with in the paradigm of EUT, there is no agreement on how risk should be incorporated p g·39]
4.4 Time
In clinical trials, the frequency and length of follow-up is often inadequate to obtain an accurate description of the duration and pattern of health gai n. In the simplest trial design, the outcome of a treatment is the difference between heal th scores assessed before and after treatment. A more sophisticated approach is to estimate the health change for every patient as the difference between the pretreatment scores and a weighted average of scores at the post-treatment assessment, with the weights proportional to the time spent between each assessment.140)
The method of weighting scores by durat ion assumes the value of a health state is independen t of when it occurs, the length of time a patient is in the state, and where it occurs in relat ion to other health states. Usually in economic evaluation a constant posit ive rate of time preference is assumed a nd d iscounting is recommended when analysi ng QALYs.l411 However, it has been suggested that peop le's time preferences have a more complex
o Adis Inlernohonol lJmjled. All fights feseN{j(j.
Brazier
pattern, which may be linked with 'thresholds' in a person's life cycleP7 J For example, health may be more important to a person when they have young children. The length of time endured in a health state can also be important. Sackett and Torrance(42) asked patients and members of the general population to value a variety of health states, including hospital dialysis. for durations of 3 months, 8 years and life, and found the mean daily health state util ities declined with duration. More generally, Richardson and colleagues[43[ have argued that the utility of a health state may be directly related 10 a person's prognosis: 'A poor health state may be more tolerable if it is perceived as a temporary hardship to be endured to obtain subsequent healt h. Conversely, the enjoyment of an otherwise satisfactory healt h state may be dimin ished by the knowledge that it will end in suffering and death', Simi larly, it can be argued that a person adjusts to a health problem and that th is alleviates its consequences.
The importance of the context of a health state is someth ing that could be missed by a narrowly defined measure of health focusing on disability or pain. However, the SF-36 is a more broadly based measure, and may be able to describe the consequences of duration and prognosis. In the second example of Richardson and colleagues,[43[ a person who has cancer and is very depressed about the prognosis would respond accordingly 10 items on the SF·36 Health Survey.
5. The SF-36 and Preference-Based Measures
An alternative approach used by many economists is to employ preference measures such as standard gamble or time trade-off in clinical trials to obtain a single index value for each patient's health state. These elicitation procedures have been developed from theories of choice to obtai n uti li ty scores for different states of health, and a llow health-related quality of life to be combined with survival to generate a single index number from zero (death) to I (full health), such as the QAL yt441
Phormocol'c onomk;s 7 (5) I~
The SF·36 Health Survey
A recent review of studies containing both health status and preference measures fou nd them to be poorl y or moderately correlated (r = 0.01 to 0.60), and health status scores were on ly able to predict 18 to 43% of 'utility' scores in regression mode ls.l4S1 The direct assessment o f preferences in clinical trial s is likely to be innuenced by a range of facto rs o ther than the patient's health status, such as attitudes to risk and time, degree of understanding the valuation task, and the palient's circumstances.l-l6l These other factors may be regarded as contaminants, but some of them, such as attitudes to ri sk and time, should be included in the valuation of the benefit s of most heaJthcare. In their review of these 2 approaches, Revicki and Kaplanl4SJ advocate the use of both, but in practice this may not be feasible and would present a dilemma where the results do not agree.
Economic measures have not been found to be as responsive as general health status measures. A recently published comparison by Katz and colleaguesl471 of measures used on patients undergoing hip arthroplasty found' ... the utility measure [time trade -ofll is less responsive to clinical change than the SIP, and the quality of life rating scale [the 'feeling themlometer' I is the least responsive of all three measures'. The Canadian Erythropoietin Study G roupl48] found significant differences between the experimemal and placebo gro ups in dimensions of a di sease· spec ific (Kidney Disease Questionnairc) and a generic profile (S IP) measure, but not the time trade-off at 6 months after treatment.1481 Direct utility assessment in trials wou ld seem to require larger sample sizes than measures of health status, particularly where the expected differences are small.
In practice, direct preference assessment is comparatively rare, a fact that may be attributable to resistance from clinical researchers concerned about the distress to patients from valuation exercises that incorporate life and death choices. An alternati ve approach pioneered by Kaplan and Bushl91
and by Torrancels.61 has been to combine these approaches by estimating weights using preferencebased methods for the items and dimensio ns of
411
the health statuS instrument. to generate a single index measure of health (e.g. QALYs).
6, Deriving c Single Index from SF·36
To use SF-36 in economic evaluation it will be necessary to deri ve a single index measure that reflects the strength of people's prefe rences fo r different aspects of health. The fo llowing discussion considers how such an index might be de· rived.
6.1 Arbitrary Weights
One approach is simply 10 combine the dimension scores o r item responses into a si ngle index using an assumed set of weights. A research team at Brunei University in the UK aggregated the NHP into a single index to estimate the QALYs gained from a heart transplantation programme.l491 Three methods of aggregation were utilised: (i) the proportion of affirmative responses to the 38 statements in the NHP; Oi) weighting the affirmative responses by weights estimated by the NHP developers. u sing Thurstone 's method of paired comparisons;f231 and (i ii ) using unitary statement weights within dimen sion s and then weighting the dime nsions by their proportion of the 38 Slatements. Similar results were obtained wi th each method of aggregation, although the range of values examined was very limited and other weight· ing schemes may have led to different results.
Two of the devisors of the NHP. who originally argued against deriving a single index from their profile measure, have recently published a method for obtaining an index of di stress to be used in conjunc tion with another measure of dependency in cost-utili ty studies.l$()l Their index contains 23 out of the original 38 statements in the NHP (since it excludes mobility) but otherwise is the same as BruneI's fi rs t aggregation scheme.
A wide range of aggregation schemes could be applied to the SF·36, involving the summing of dimension scores or items responses, using different assumed weights. The easiest method would be to weight the dimension scores as follows:
PI'xJrmac:oEcon 1 (5) 1995
412
(Eq. I )
where: Kj = the dimension weight applied to dimension) : n = the number of dimensions: Xj = the dimension score of dimensio n) ; and:
Lj~I Kj = 1
Such arbit rary weighting schemes cou ld easily be applied to the results of a trial, bUI they would not generate an index that could be legit imately used in an economic evaluation , because the dimension scores are not measures of uti lily and have not been based on people's preferences. Funhcnnorc, there is no allowance fo r any possible interaction between the dimensions. The consequences of having pain and limitations in physical funct ioning may be more. or indeed less. than the sum of the 2 separate dimensions. Finally. for use in cost-utili ty analysis. the index would have 10 becombincd with surv ival. something O ' Bricn and colleagues did not feel able to ach icve with the N HP.I4Jr The Brunei team argued that 'a more forma l process is required for translating health IJroji/e information, be it from the NHP or S IP with their richness and multi-dimensionality, in to relative 1'O/lllIlions of typical health statcs. which can then be used 10 indicatc relative quantity/quality of life trade-offs or preferences' .
The wcights could be based on the strength of association of items or dimension scores with events such as survival or usc of health services. Aside from the statistical problems. again the main objection 10 this is the absence o f values derived from patients.
6.2 Multi-Attribute Utility Theory
An alternative approach is to value all the health states defined b y the SF·36 - that is , every conceivable combination of responses to the questionnaire using preference-based techniques. The SF-36 is a complex multidimensional scale that defines more than I million possible states, presenting a formidab le va luation task. Torrance and col-
Bra~ ier
leaguest6.S11 have pioneered an approach to val uation using multi-attribute utility theory (MAUT) from the operational research li terature . By making a number of assumpt ion s about the form ofa utility or value function it is sufficient to value a small sample of poss ible slales. This function can then be used to e stimate values fo r all possible states. An example of such a func ti on is the addi tive model as follows:
(Eq.2)
where:
Xj; represents levc l i (i = 1 to 111 levels) on att ribute j: and U(Xj;) is the utilit y associated with leve l ion attributej.
This is similar to equation 1. but the arbit rary weights Kj are now replaced by the utility valucs associated wi th each attribute. This requires each attribute 10 be additively independent of all other attributes. and therefore docs not allow for any interaction between attributes. Us ing this approach, the utility funct ion for eac h attribulecan bedetermined separately by asking respondents to value health states where the levels o f only I dimension arc varied at a time.
More complex equat ions can be used, which make less stringent assumptions about the fonn of the relationship between the dimensions than additive independcnce.ls2J For example, they can include terms for possible interaction between 2 or more attributes. In the MAUT literature, there are examples of multilinear and multipliclltive fUllctiolllll forms. However. increasing the compl exity of the mode ls can make them far more difficu lt 10 esti mate and difficult to interpret. Greater complexi ty must bejustified by a significant gain in the modcl, such as improvements in predictive power. Whichever model is used, it will be necessary to test its assumptions.
MAUT is a powerful techn ique that co uld provide a p ractical method fo r deriving a single index from the SF-36. However, existing instrumenls using MAUT, such as the HSJ and EuroQol. have fewe r dimension items than the SF-36. The HUI
","lOll I iOCOEC(In()ITIicf 7 (~) 1995
The SF-36 Health Survey
(Mark I) developed by Torrance and assoc iates defined 960 health states.161 and the EuroQol (version 2) a modest 243.1101 Furthermore. the items within each dimension have a clear ordinal relationship. which substantially simplifies the mathematical modelling. Many items in the SF-36 do not have an obv ious ordinal relationship with in each dimension. and there is another layer of complexity in that most items have more than 2 responses. Consequently, MAUT cannot easily be applied to the current structure of the SF-36.
One way forward that is being explored by a team of researchers in Sheffield has been to both reduce the size of the SF-36. and to simplify its structure into sets of ranked items or attribute levels.l53J The first version has 6 dimensions. and between 2 and 6 levels. A sample of states defined by thi s scale has been valued by groups of pat ients and professionals using standard gamble and rating scales. The study aims to determine whether a profi le measure such as the SF-36. with its rich descri ptive material. can be translated into an index for estimating QALYs.l541
6.3 Scenario Approaches
Under the MAUT approach, a single index is derived for each state from zero (death) to I (perfec t health), and this is multiplied by survival to obtain a total number of QALYs. The QALY methodology has been criticised in the literature for ignoring the effects of duration and overall prognosis on the patient's valuation of a health state (see section 4.4). Gafni and associatesl3~1 at McMaster Un iversity have argued that equation 2 cannot be regarded as representing a utility function. In it s place. Ihey advocate a 2-stage algorithm for valuing a health stale where the duration ofa slate is treated as a variable. Gafni and colleaguesl381 also argue that their method of val uation is a superior technique for measuring utility to standard gamble and time trade-off. but these arguments extend beyond the scope of this article. This approach has not, however, been used widely because it substantially increases the valuat ion task. particularly if the scenarios arc to be linked accurately 10 the wide range
413
of outcomes found for many common procedures in clinical trials. 1401 A simpler approach has been proposed by Hall and colleagueslSSI where specific 'vignettes' or health scenarios arc valued. using a single valuation procedure.
The SF-36 could be used to generate these scenarios directly from trial evidence, assuming it was shown to be sufficiently comprehensive fo r the condition bei ng studied. II would be a major research task, since it would not benefit from the short-cut offered by MAUT, but would provide an important opportunity to test the additional assumptions of MAUT.
7. Conclusion
The SF-36 has not been designed for use in economic evaluation. It may be used in cost minimisation analysis, provided the scales can be demonstrated to reflect people's ordinal preferences. However. it cannot be used in cost-effectiveness or cost-utility analysis, and any attempt to assess cost effectiveness in less formal ways using SF-36 data must be viewed with caut ion . A single index deriv ed from SF-36 us ing arbitrary aggregate procedures would not be appropriate for use in economic evaluation. SF-36 data could be more useful in economic evaluation if it was feasible to derive a single index based on explic it valuation techniques. MAUT may provide one approach, but it requires a major modification of the struct ure and size of the existing SF-36. FUl1hennore. this approach requires a number of assumptions about the form of a utility funct ion that must be tested. Were s uch research success fu l. it would add consi derable value to SF-36 as a research tool.
Acknowledgements
The author would like 10 thank Rosemary Harper and olher members of the Sheffield SF-36 team. and Ihe referees for Iheir perceptive commenls on earlier drafts of Ihis paper. The aUlhor's post is supported by Trent Health.
References l. Drummond ME Cosl-effecliveness guiddines for ",imbur",,
ment of pharmaceuticals: is economic evalualiOfl ready for its ~nhanc~d stalUs? Health Econ 1992: I: 85-92
PhormocoE~ 7 (5) 1995
4 14
2. Ellwood PM. OutC()l~~ managcm"nt: a technology of pal ient upcr;ence. N Engl J M~d 1988; 318: 1549·56
3. Wa~ JE. Sl>erbou....,CD. T'hc: MOS J6.item ShorI-Form Heahh Suryey tSF·36): I. CQflCq)tual framc".-orl and item selection. Med Carr 1992; 30: 473·83
4. Aaronson NK. Acquadru C. Alonso J. c[ al. International quality <If lif" as!\CSsm..nl (lQOl A) proj«1. Qual Life Res 1992: I: 349·51
5. Torr~l\Cc GW. Social preferences for health states: an cmpirkal evaluation of Ihr~ measurement techniques. Socioc:con Plan Sci 1976: 10(3): 128·36
6. Tomancc GW. Muhi-alltib\ltc ut ility Ihrory as a mcthollofmusuring social prdc~nc" for health stales in long-tam ,.-arc. In: Kane RL. Kane RA. witon;. V::Ilues in long term ("art.
u~;ngton: Lc~ing'oo Books DC Health and Company, 1982 7. Kaplan RM. AIKkI'5Ql1 JP. The general health policy modd: an
integrated approach. Ouality of life as-<c:ssmcntS in di nical !rials. New York: Ra"en Press l1d . 1990
8. Kaplan RM. Bush JW. Berry C. H~alth Status: types of ~alidity ~nd the inde~ ofw~lI·being. Hc~lth &rv Res 1976 Winter; II (4): 478_507
9. K~pbn RM. Bush JW. Bcalth.rdated qu~Ji ty of life mc:asu...::ment for e"alu~l ion of rc.',urch ~nd policy ~n~J)"Sis. He~hh
J>s~hoI 1 982; I (I): 61 -80 10. The Euroqol group_ Eumqol:. ncwfaci' ity for the rMaSUrerMm
(If health relatOO qu~lity of life. Health Policy 1990: 16 (3): 199-208
II . Medica l OutCOrMS Trust . How 10 score thc SF·36 Hea lth Sur,·cy. Boston: Medical OutCorTlC Survey. 1993
12. Brook RH, Ware JE, Robens WR. e l a!. Docs free Can: improve adults' he~lth1 Results from a randomiscd controlled trial. N Engl J Med 1983: 309: 1426-)4
13. Brook RH, Ware JE. DavkS_Avery A, et ~I. Overview ofadult health StatuS measures fidded in Rand'. Health lnsura~ Sludy. Med C~re 1979: 17 Suppl.: 7
14. Tarlo~ AR. Ware JE. Greenlield S. e t al . The medical OIItcollles study: an application (If me thods for monitoring the resulls of medical clIn:. JAM A 1983: 262: 925-30
15. Wan: JE. Soo'" KK. Kosinski M. et a1. SF-36 Health Survey manual and interprelation guide. Boston: "The Health Institute, New England Medical Center. 1993
16. Nunnally Jc. Ps~homctrie lheory. New Yorl. : McGraw·Hili Book Co .• 1967
17. McDowell I, Newell C. Meas uring hea lth . A guide 10 ra ting sc~ks and questionnaire . O~ford: O~ford Universi ty I'ress. 1987
18. Siewart Al. Ware J. edi tors. Measuring fUll<:tklning.nd well being. Durham. NC: Duke Universi ty Press. 1992
19. McHomcyCA. Ware JE. lu JA!. . d a1. The: MOS 36-item Short Form Health Sun-'ey (SF-36): III . Tesu of data quality. assumptions and n:liabilily across diverse palient groups. Mcd Can: 1994: 32: 4().S2
20. Stewan Al. Hays RD. Ware JF.. The MOS short- form General Health Survey: re liabili ty and va lidity in a patienl population. Mcd Care 1981:1; 26 (7): 724-35
21. Br.ozier JE. Harper R. JOIH:S NMB. ct 011. Validating the SF.)6 health su"",,y questionnaire: new outcome measure for pri . mary Care. BMJ 1992: )05; 16().4
22. WIIO. Constitulion of the World Health Q,xanisalion. Docu.tICnts. GellCva: WHO. 1948
22. Ware JE. Slandank for validating he~lth IllCasures: definilion and con tent. J Chronic Dis 1987; 40 (6): 473_80
Bmli"
23. Hu nl SM. McEwen J. McKenna SP. Measuring health Slatus. London: Croom Helm. 1986
24. Carter W. Bobbiu R. Bergner M. et a1. Validation of an interval scaling: the sickness impact profile. Health Scrv Res 1976; 11 :516-28
25. McllorllCyCA. Ware JE. Rac1.ek AK. The MOS J6..item ShortForm Health Survey (SF-36): II. Psychomelric and clinical le.\IS of valid ity in measuring physical and mcntal heallh conSiructs. Moo Care 1993: 31: 247-63
26. Garr~1I AM. RUla DA. Abdalla M I. el al. "The SF-36 health survey questiollnaire: an OUICOlllC IllCasure suitable for rouline usc wilhin the NHS. BMJ 1993; 306: 1440·4
27. Vickrey BG. Mays RD. Graber J. d a!. A llCalth·related quality of life instrument for patients evaluated for epilepsy surgery. Med Care 1992; 30: 299-3 19
28. Nerenl DR. Repasky VI', Whitehouse MD. et al. Ongoing a~
scssmCnl of health status in patients with diai:lCtes IlICllilUS using the SF·36 and dini:lCles TYPE scale. Med Care 1992; 30 (5): MS24()' S2
29. Kurtin PS. Davies AR, Meyer KB. el al. Patient-based health stalUS IlICJSUn:mc:nts in outpalient dialysis: earl y e~perie nccs
in de,doping an outromc:s asses.~ment program. Med Care 1992; 30 (5): MS IJ6..49
30. Berwick OM. Murphy JA. Goldman I'A, et a1. Pcrforma~ of a Ii"e ilem nICnlal heall h seruning test. Med Care 1991 ; 29 (2): 169-76
3 1. KalzJN. l arson MG. Phillips C B. el ul. Compara li"e measure · menl sensitivilY of short and longer health stalus inslrunw:nls. Med Care 1992; 30: 917-25
32. Brazier J. JollCli N. Kind P. Tesling the val idity oflllC Euroqol and comparinll it wilh the SF-36 health survey qucslionnaire. Qual life Res 1993: 2: 169-80
33. William$ A. Measuring fUll<:tioning and well-being. by Stewan ~nd Ware. Health Econ 1992; 1 (4): 255·8
)4. Dcyo RA. lnui TS. Toward clinical applicalionsof heallh Sia lus IllCaSureS: sensitivily of sca lcs 10 clinica ll y important changes leditoriall . Hea lth & rv Res 1984: 19 (3) : 275
3S. Drummond MF. Sioddart Gl. Tomtll<:c GW. Melhods for the ecOl\()mic evaluation of health care programmes. Oxford: OAford Medical Publications, 1987
36. Cairns J. Johnston K. McKen~ic l. Dc"eloping QAlYs from condition specific outcome measUfll$. HERU DP 1419 1. Ai). erd«n: Uni"ersity of Abtrd«n, 1991
37. loomes G, McKenzie l. The uSC (lfQALYs in health care d,,_ ci~ion making. Soc Sci M~d 1989: 28 (4): 299·308
38. Gafni A. Birch S. Mehrcz A. Economics. h~alth and healt h economics: HYEs vs. QAlYs. J Health Econ 1993; 12 (3): 325-40
39. Culy.:r AJ , Wagstaff A. QALYs VI. HYEs. J Health Econ 1993; 12 (3): 311 -24
40. Nicholl J, Brazier JEo Milner pc, el aJ. Randomiscd controll~d
lrial of cOSl~ffecti v~neS5 of lithotripsy and OJ"'n cholecystectomy as tn:allnents for gall bladder SIOIl<:S. La~t 1992: 340: 80 1-'
41 . Gudex C. QAlY. and lheir use by the Hulth Service IDiscus, ion PalX'r 201. York: Centre for Health EcOfM1mics. Uni ver. si tyof Yorl..1986
42. Sackell 01.. TOfT1IIl<:C GW. The ut ility of diffeunt health statn as pcrcci.·cd by the general public. J Chronic Dis 1978: 31 (I I); 697-700
43. Richardson J, lI ali J. Salkeld G . CUA: the compalibililY of measuremenl lechniques and Ihe nw:asurement of ulility
The SF-36 Health Survey
through time. In: s., lby-Smi th C. editor. Economics and Health Clayton. VIC: Monash University. 1989: 31-52
44. Torr~nce GW, Measurement of health state utiliti es for economic appraisal: a review. J Health feon 1986; 5; 1_ 19
45. Revidi DA. Kaplan RM . Relation~hip bet ..... u n psychomc. tric and utility_ba<;<:d approaches to the meaSurement of healthrelated quality of life. Qual Life Res 1993: 2 (3): 477-87
46. Revidi DA. Relationship bet ..... un health utility and psychometric health stalus measureS. Med Care 1992; 30 (Suppl.): MS214-82
41. Katz JN. PhillipsCB. Fossel AH. et al. Stability and responsive· neSS of utility meaSures. Med Care 1994; 32 (2) : 183-8
48. Laupacis A. The Canadian Erythropoietin Study Group. BMJ 1990; 300: 573-8
49. O'Brien BJ. Buxton MJ. Ferguson BA. Measuring the dTeetivene .. of Ilean traMplant program"",s: qUHlity of life data and their relations hip to su rvival analy~is. J Chronic Dis 1987;40(1): 1315-85
SO. McKenna S. Hunt SM. Tennant A, The development of a patient_completed index of distress from the Nottingham Health Prof,le: a new measure foruse in cost-utility s tudies, Br J Med lOcon 1993; 6: 13-24
o Ade InlernolionollJmited. AI rights reserved.
41 5
51. Feeny D. Furlong W. Barr RD. et al. A comprehensive multi-at· tribute system for classifying the health status of survi>'ors of childhood CanCer. J Clin Oneol 1992; 10 (6); 923-8
52. Keeney RL. Raiffa H. Dec isions wi th multiple objectives: preferences and value tradeoffs, New York: John Wiley and Sons. 1976
53. Brazier JE. The SF-36 Health Survey Que<tionnai re- a too l for economi.lts. Health Econ 1993: 2: 213-5
54. Brazier JE. Usherwood T. Harper R. ct at Deriving a .~ingle
index for health from the SF·36: interim "'pon for the Depanmcnt of Hea lth, Shcffield: Uni>'c rsity of Sheffield. 1994
55. Hall J. Gerard K. Salkeld G. e t al. A COSt utility analysis of mammography scr«cning in Australia, Soc Sci Med 1992; 34 (9): 993-1004
Correspondence and reprints: fohl! Bmzi", Senior Lecturer in Heallh Economics, Sheffield Centre for Health and Related Research, Universi ty of Sheffield, Regent Cou rt, JO Regent Street, Sheffield SI 4DA, England.