A Method for Predicting Death and Illness in an Elderly Person

A Method for Predicting Death and Illness in an Elderly Person

by

Michael Walter Anderson

B.A. (College of William and Mary in Virginia) 1991M.A. (University of California at Berkeley) 1993

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Doctor of Philosophy

In

Demography

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA at BERKELEY

Committee in charge:

Professor Kenneth W. Wachter, ChairProfessor Ronald D. LeeProfessor John R. WilmothProfessor David A. Freedman

1997

The dissertation ofMichael Walter Anderson is approved:

7~~UJ;t, 7180/71-Chair Date

1)~~.L--- ::t1l-9/'17-

*Date

K -:+ ( 2..3 {4-:tDate

.~ W"7University of California at Berkeley

1997


© 1997

by


1

Abstract


by


Doctor of Philosophy in Demography

University of California at Berkeley

Professor Kenneth W. Wachter, Chair

This dissertation describes three accomplishments. Statistically, it defines a

nonparametric predictor for efficiently classifying binary observations. The model uses

Boolean operators to combine binary predictor variables in a variation on classification trees.

Second, this approach is demonstrated by predicting three-year mortality and illness in a

sample of elderly persons (ages 65+) with compact subsets of questions from a baseline

survey (EPESE: Established Populations for Epidemiologic Studies of the Elderly,

N=10,294). Third, the author constructs a search algorithm to fit the model, using a test set

to choose model sizes and estimate error. These tasks are united under a common objective:

predicting whether an elderly person would die within three years by asking simple questions

about the individual's demographic status, functionality, and epidemiological history. The

results also inform substantive analyses of mortality.

Several models are presented. Set A correctly predicted 66% of deaths in an internal

test set with a specificity of70%. Set B caught 42% of deaths with a specificity of 89%. Set

C caught 22% of deaths with a specificity of 96%. The estimated area under the receiver

operating characteristic curve was 74.4% ± 1.2%. The accuracy is comparable to that

2

achieved by larger models built with more complicated methods (e.g., linear discriminant

analysis, logistic regression). However, the models developed here use fewer variables and

achieve a more elegant, easily interpreted representation. For example, Set B requires only

seven variables, all in the form of simple questions (e.g., "Can you walk half a mile?"). To

validate the error estimates, the models were applied to an independent sample to which the

author was blinded (the North Carolina EPESE respondents, N=4, 162). The accuracy was

slightly lower for this sample, but the size of the difference was small (comparable to

sampling error).

Although the models are designed for the purposes of prediction, they also facilitate

etiological analyses by dividing the respondents into more homogeneous subsets. Based on

evidence provided in the results, it is argued that cancer was underestimated as a cause of

death in the EPESE sample, and that digitalis toxicity may have been a substantial source of

excess mortality.

Chair

This dissertation is dedicated to the elderly.

111

IV

Table of Contents

List of Tables viList of Figures viAcknowledgments viiiChapter 1 - Questions that predict death and illness 1

Section 1.1 - Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Section 1.2 - The value of predicting mortality 5Section 1.3 - Data and concept 9Section 1.4 - A general search algorithm 14Section 1.5 - The test set method 19Section 1.6 - Use of the test set to determine the number of questions 26Section 1.7 - Results: mortality 32Section 1.8 - Results: heart failure, stroke and cancer 38Section 1.9 - Discussion 41Section 1.10 - An overview of the dissertation 52

Chapter 2 - The problem of prediction 55Section 2.1 - The specific problem of predicting death or illness 55Section 2.2 - Defining prediction and prediction error for a general predictor .. 56Section 2.3 - Previous efforts to predict mortality and morbidity 63Section 2.4 - Regression as a method of classification 65Section 2.5 - Classification trees and Bayes' rule 71Section 2.6 - Other methods of classification 88Section 2.7 - Existing applications of the models 96

Chapter 3 - Data: Established Populations for Epidemiologic Studies of the Elderly .. 105Section 3.1 - The EPESE project 105Section 3.2 - Composition of the populations 106Section 3.3 - The baseline survey 107Section 3.4 - Outcomes 109Section 3.5 - A comparison of the EPESE sample and the U.S. population 110Section 3.6 - Functionality, morbidity and causes of death in the

EPESE sample 118Section 3.7 - Missing data 131

Chapter 4 - The statistical methods of prediction 135Section 4.1 - Approaches to model selection 135Section 4.2 - A method for choosing questions 137Section 4.3 - Performance of the search algorithm 151Section 4.4 - Linear discriminant analysis 163Section 4.5 - Logistic regression 166Section 4.6 - The CART algorithm for model selection 169Section 4.7 - Modifications and additions to the question set method 172

Chapter 5 - Tools for prediction on an individual level 177Section 5.1 - Simple questions for predicting survival or death 177Section 5.2 - An index for the risk of mortality: linear discriminant analysis . 186

v

Section 5.3 - Classification trees for predicting death 197Section 5.4 - A regression model of mortality 202Section 5.5 - Validation of the question sets with the North Carolina sample .205Section 5.6 - Validation of the discriminant and classification tree models 219Section 5.7 - A comparative analysis of the models 222Section 5.8 - An additional question set model 227

Chapter 6 - The causes of death in the elderly 230Section 6.1 - Causes as identified by the death certificate 230Section 6.2 - Underlying and associated causes in the EPESE population 236Section 6.3 - The causes of death associated with the models 240Section 6.4 - The causal processes and risk factors associated with death 243Section 6.5 - Digitalis use and mortality: cause or consequence? 248

Chapter 7 - Conclusions 255Section 7.1 - The power of the models for predicting mortality 255Section 7.2 - The efficiency of the search method 256Section 7.3 - Implications for causal analyses of mortality 257Section 7.4 - Future applications of these methods 261

References 263Appendix I - Questions for the prediction of death 269Appendix II - Questions for predicting heart failure 273Appendix III - Questions for predicting strokes 275Appendix IV - Questions for predicting cancer 277Appendix V - C code for the repeated random search algorithm (RSA) 279Appendix VI - Mortality index for the elderly 294Appendix VII - C code for the repeated random and exhaustive search

algorithm (RRESA), combined with backward deletion to estimateerror on a test set 299

Appendix VIII - A large question set for predicting death 333

VI

List of Tables

Table 1.1 - Predicted outcome by true outcome for Question Set A of Appendix Ias applied to a test set 23

Table 1.2 - Deaths predicted correctly and survivors predicted incorrectly as dead 33Table 1.3 - Heart failures predicted correctly and non-failures predicted

incorrectly 40Table 1.4 - Strokes predicted correctly and non-strokes predicted incorrectly 42Table 1.5 - Cancer predicted correctly and non-cancer predicted incorrectly 43Table 3.1 - Variables in the baseline survey 108Table 3.2 - Observed sample and U.S. population estimates of probability of dying

within three years of age x (3Qx) by sex 117Table 4.1 - Specifications for the construction of the question set models 147Table 5.1 - Predicted outcome by true outcome for Question Set A, full dataset 177Table 5.2 - Predicted outcome by true outcome for Question Set A, internal test set .. 178Table 5.3 - Predicted outcome by true outcome for Question Set B, full dataset 180Table 5.4 - Predicted outcome by true outcome for Question Set B, test dataset 182Table 5.5 - Predicted outcome by true outcome for Question Set C, full dataset 183Table 5.6 - Predicted outcome by true outcome for Question Set C, test dataset 185Table 5.7 - Probability of death within three years by mortality index 188Table 5.8 - Average and median mortality index score by age and sex 192Table 5.9 - Deaths and survivors predicted by discriminant analysis 194Table 5.10 - Deaths and survivors predicted by classification trees 198Table 5.11 - Coefficients in the logistic regression model of mortality 203Table 5.12 - Predicted outcome by true outcome, logistic regression, test set 204Table 5.13 - Results of applying Question Set A to Duke sample 210Table 5.14 - Results of applying Question Set B to Duke sample 215Table 5.15 - Results of applying Question Set C to Duke sample 217Table 5.16 - Results of applying discriminant model to Duke sample 219Table 5.17 - Results of applying discriminant model to Duke sample (highest risk) .. 220Table 5.18 - Results of applying classification tree to Duke sample 221Table 5.19 - Performance of three methods of prediction 223Table 5.20 - Predicted outcome by true outcome, Question Set J, full dataset 227Table 5.21 - Predicted outcome by true outcome, Question Set J, test dataset 227Table 6.1 - Classification of ICD-9 codes 237Table 6.2 - Underlying causes of death by question subset 242

List of Figures

Figure 1.1 - Body mass index (weight/height) by systolic blood pressure 20Figure 1.2 - Misclassification error by number of questions asked 28

VB

Figure 1.3 - True positive fraction by false positive fraction (ROC curve) forquestions predicting deaths in the test set 37

Figure 1.4 - Diagram of death from chronic illness 46Figure 2.1 - Simulation of Yas a quadratic function ofX 68Figure 2.2 - Example of a classification tree for risk of death 72Figure 2.3 - Distribution of survivors by blood pressure and body mass index 77Figure 2.4 - Distribution of decedents by blood pressure and body mass index 78Figure 2.5 - Contour plot of ration of survivors' distribution to decedents' distribution 80Figure 2.6 - Example of question sets combined with OR 86Figure 2.7 - Division of contour plot, 9-question model 89Figure 2.8 - Division of contour plot, 3-question model 90Figure 2.9 - Distribution of deaths and survivors by z index 92Figure 3.1 - Number of respondents by age and sex 112Figure 3.2 - Probability of death within 3 years by age and sex 116Figure 3.3 - Proportion unable to walk half of a mile without help by age and sex 120Figure 3.4 - Proportion unable to bathe without help by age and sex 121Figure 3.5 - Proportion ever diagnosed with heart failure by age and sex 123Figure 3.6 - Proportion ever diagnosed with cancer by age and sex 124Figure 3.7 - Incidence of heart failure by age and sex 127Figure 3.8 - Incidence of stroke by age and sex 128Figure 3.9 - Incidence of cancer by age and sex 129Figure 4.1 - Histogram of numbers of mutations required to achieve absorption 155Figure 4.2 - Total number of successful mutations by total number of mutation

(trajectories for 200 random absorption points) 156Figure 4.3 - Total number of successful mutations by total number of mutations

(trajectories for 200 random absorption points) 157Figure 4.4 - Probability of a successful mutation by number of previous

successful mutations 160Figure 4.5 - Model fitness by model size, discriminant analysis 165Figure 4.6 - Misclassification error by size of tree 173Figure 5.1 - Respondents chosen by Set A, by age and sex 181Figure 5.2 - Respondents chosen by Set B, by age and sex 184Figure 5.3 - Bar plot of probability of death by linear discriminant index score 190Figure 5.4 - Smoothed estimate of probability of death by linear discriminant

index score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191Figure 5.5 - True positive fraction by false positive fraction (ROC curve) for

discriminant model of deaths in the test set 196Figure 5.6 - True positive fraction by false positive fraction (ROC curve) for

classification trees of deaths in the test set 199Figure 5.7 - Classification tree for risk of death within 3 years 201Figure 5.8 - Number of respondents in Duke sample by age, sex, and race 209Figure 5.9 - True positive fraction by false positive fraction (ROC curve)

for questions predicting death (with Duke results) 218

Vlll

Acknowledgments

My research would not have been possible without the support of the following persons and

institutions: Ronald Lee, Kenneth Wachter, John Wilmoth, David Freedman, Leo Breiman,

David Aday, the Department of Demography and the Statistical Computing Facility at the

University of California at Berkeley, the National Institutes of Health, Gary Maloney,

Marilyn Friedemann, and my mother and father.

1

Chapter 1 - Questions that predict death and illness

Danger is a biologic necessityfor men, like sleep and dreams. Ifyou face death, forthat time, for the period ofdirect confrontation, you are immortal. For the Western middleclasses, danger is a rarity and erupts only with a sudden, random shock. Andyet we are allin danger at times, since our death exists; Mektoub, it is written, waiting to present theaspect ofsurprised recognition.

Is there a techniquefor confronting death without immediate physical danger? Canone reach the Western Lands without physical death? These are the questions that HassanI Sabbah asked.

Don Juan says that every man carries his own death with him at all times. Theimpeccable warrior contacts and confronts his death at all times, and is immortal.

- William S. Burroughs, The Western Lands

1.1 Introduction

This research has four distinct objectives. From a statistical standpoint, the

dissertation defines a nonparametric method of efficiently classifying observations by using

Boolean operators to combine predictor variables in the form of binary questions. This

approach is demonstrated by predicting three-year mortality in a sample of elderly persons

with small subsets of baseline survey questions. To fit this form of model, the author

constructs a search algorithm that attempts to reduce misclassification error by using internal

test set techniques to choose model sizes and estimate error. Finally, these predictive models

are used to inform more substantive analyses of the causes of death in the elderly.

These diverse goals are unified under a single theme: the prediction of death with

simple survey questions. The central idea of the dissertation is to select a combination of

questions from a large pool of possible questionnaire items such that persons with certain

answers have a high probability of death within three years. Many of these items are simple

questions about the functional status of the individual (e.g., activies of daily living, or

2

ADL's) which are commonly found in many large health surveys. Below is an example of

such questions that can be answered without help from any clinician, and that take less than

a minute to administer in an interview:

(l) Other than when you might have been in the hospital, was there any timein the past 12 months when you needed help from another person or from anyspecial equipment or device to do the following things?

i) Walking across a small room?

No help neededNeeded helpUnable to doMissing/NA1

ii) Bathing, either a sponge bath, tub bath or shower?

No help neededNeeded helpUnable to doMissing/NA

(2) Are you retired?

YesNoMissinglNA

The chance that a randomly selected elderly person (aged 65 or over) in the U.S.

would give all three answers in bold is small. However, for those that do, it is estimated

with the data used in this dissertation that there is a 50% to 65% chance of death within three

years. It is improbable that this estimate is the result of a fluke, chance variation, or

statistical artifact, as these estimates were validated on a large, independent sample ofelderly

1 The answer "Missing/NA" is explained below. For these questions, this category of answer isgiven only by the small minority of respondents who do not give any of the other responses for some reason(e.g., refusal). See Chapter 3 for a thorough treatment of missing values.

3

persons after the questions were selected.

First, the questions were constructed with the methods developed below using a

sample of 10,294 respondents from a survey study of persons aged 65 and over from Iowa,

New Haven and Boston (the original EPESE sites). Using an internal test set (explained

below), the respondents chosen by the above questions were estimated to have a 65% chance

of dying. When the questions were then applied to an entirely independent sample of 4,162

elderly persons having a different demographic makeup (the North Carolina EPESE dataset),

a death rate of greater than 50% was observed in the chosen respondents. This dissertation

argues that many more combinations of questions can be found which isolate other groups

of elderly persons who also have a very high risk of death and chronic illness. The central

task is in developing a method for selecting these combinations of questions. The approach

invented here is developed completely in the context of predicting mortality and morbidity;

yet it is somewhat serendipitously designed to be generalizable to the prediction or

classification of nearly any binary outcome based on a large pool of predictor variables. This

is accomplished by using a search algorithm which: I) handles a large variety of data types;

2) does not utilize distribution-based statistical tests of parameters for model selection; and

3) does not depend on any substantive or context-specific information about the predictor

variables (e.g., medical knowledge about the causes of mortality). Rather, the strategy is

simply to use the power of the computer to search across combinations of questions and

answers that isolate large numbers of respondents with high death rates. To ensure that the

computer was not simply exploiting chance variation by identifying random correlations in

the data, a large test set of respondents was selected out of the dataset before the models were

4

built, and the computer used the remaining respondents (the learning set) to construct models

ofdifferent sizes. These models were then applied to the test set respondents, and the model

size that most accurately predicted mortality was selected. The accuracy of the models was

estimated with the test set data, and these error estimates were validated on an independent

sample from an entirely different geographic region, as mentioned (see Chapter 5).

In this way, an entirely systematic and atheoretic method of model construction IS

defined. However, powerful statistical models are useless if they are not applicable to

contextually complex, real world situations. Thus, the author does not attempt to develop

this algorithm in a completely atheoretic, mechanistic context. Instead, the construction of

the statistical method is integrated with the more palpable task of predicting death in actual

elderly persons. Consequently, a considerable portion of the dissertation is concerned with

many different aspects of predicting and explaining mortality beyond those involving the

mere construction of statistical models. In short, the methods here were developed with the

ultimate intention of contributing to the existing knowledge of mortality; the statistical tools

were created to serve this purpose, and although they may be generalized to other problems,

the development of a general algorithm was not intended to be the primary contribution of

the dissertation. Since the methods matter to the results, however, a great deal of attention

is given to the details of the model construction in addition to pure mortality research.

The remainder of this chapter provides a roadmap for the dissertation in broad

strokes. First, the motivation for prediction is discussed in the context of predicting death

and illness in individuals, and the existing "state of the art" techniques of building models

for these purposes are briefly described. Next, the chapter provides background on the

5

survey data used for this project and presents the basic idea of combining the survey

questions in binary form with Boolean logic to classify respondents as high or low risk. A

general search algorithm for finding these questions is then described. The problem

presented by random variation in the data is defined in nontechnical language, and questions

are raised about the use of internal test set techniques to estimate the accuracy of the models.

Finally, a brief summary of the central results of the dissertation is provided, followed by a

discussion of substantive implications.

1.2 The value of predicting mortality

A vast literature exists concerning the statistical prediction of mortality and

morbidity, and this research serves a variety of purposes.2 The value of accurate prognoses

for clinicians responsible for medical decision making is one central interest. Since most

chronic diseases progress in stages, slowly deteriorating the body over time, there is a great

advantage in detecting the onset of these conditions sooner and treating the patient before

disease has fully taken its toll. Additionally, if physicians can reliably predict postoperative

mortality, the less suitable candidates for surgery could be more effectively identified before

the procedure is performed. Secondly, a large part of the research is concerned with the

assessment of risk-adjusted outcomes for hospital quality evaluation, since accurately

determining the risk-level of patients at admission is essential for a fair comparison of post-

admission mortality rates between hospitals. Similarly, the identification of an accurate

2 For some of the latest research in this field, see Anderson et al. (1994); Becker et al. (1995);Bernstein et al. (1996); Cain et al. (1994); Davis et al. (1995); Eysenck (1993); Grubb et al. (1995);Iezzoni et al. (1996); Josephson et al. (1995); Marshall et al. (1994); Normand et al. (1996); Ortiz et al.(1995); Piccirillo and Feinstein (1996); Poses et al. (1996); Pritchard and Woolsey (1995); Quintana et al.(1995); Reuben et al. (1992); Rowan et al. (1994); Schucter et al. (1996); Smith and Waitzman (1994);Talcott et al. (1992); Turner et al. (1995); and Wong et al. (1995).

6

measure of risk using a small group of commonly-measured variables yields a natural proxy

for the observation of death. This provides a valuable research tool, as many studies in

which similar questionnaires are used do not observe mortality and morbidity. Lastly, some

researchers are interested in the causal mechanisms behind death and illness. Since causal

connections are difficult to establish with observational data, their quantitative analyses are

necessarily geared toward the statistically feasible task of prediction, with some hope of

gaining some substantive insight from the fitted models.

However, the goal of most existing research focuses on a limited universe of

respondents. In particular, the vast majority of attempts at genuine prediction (i.e., prognosis

as opposed to causal analysis) are directed at the minority of persons who are already quite

ill. They have usually been admitted to a hospital for their illness, and may be quite close

to death at admission. As such, the models use physiological data that require laboratory

work, such as blood analyses. The dominant statistical approach used in these efforts is

logistic regression, with some other equally cumbersome methods (e.g., discriminant

analysis, kernel density estimation) used less frequently. These methods use complicated

(usually innappropriate) assumptions about the data. In the case of regression, models tend

to be overfitted when many variables are introduced, and the usual R2 statistic may over-

estimate the accuracy of a model.3

The approach developed in this dissertation is distinct from these efforts in two ways.

First, the model is of a nonparametric form with a particularly simple representation (as

demonstrated by the question sets above). This idea of partitioning a dataset into high and

3 See Chapters 2 and 4 for more detailed discussions of these methods and explanations for whythe author finds them cumbersome.

7

low risk groups nonparametrically with "yes/no" questions (called binary splits on predictor

variables) is not original; indeed, the inspiration came from the idea of classification trees

as developed by Breiman et al. (1984). However, instead of arranging splits hierarchically

in a tree structure, the variation developed here uses Boolean logic to arrange splits into

subsets of questions, each defining a particular region of the sample space. When combined

with an algorithm for selecting these question subsets, this form of model appears to provide

advantages over the tree approach in terms of predictive power and simplicity of

representation.

Secondly, the dissertation aims to build powerful models designed to predict death

in the general population of elderly persons in the U.S., not patients in a hospital who are

already suffering from a known, diagnosed condition. Although there exist many fitted

models of mortality constructed from probability samples of U.S. elderly, these usually are

designed not for the purposes ofpredictionper se, but with the intention ofconducting causal

analyses.4 The aim of this dissertation research is to build models constructed purely based

on prediction and attaining increased predictive power. Some readers not familiar with

modem statistical techniques may fail to see the distinction between these goals. The biggest

differences pertain to how the model is specified and how its accuracy is measured. By

defining prediction strictly, one is provided with a much more natural criterion for estimating

the predictive accuracy of models honestly (e.g., test set or cross-validation estimates of

error). This approach also aids the task of model specification by giving the researcher a

means by which the optimal model size may be gauged.

4 E.g., see the discussion on Smith and Waitzman (1994) in Section 2.7.

8

As a byproduct of models whose development is based on pure prediction, many

results obtained in this dissertation may be of academic interest for mortality researchers.

For example, because the predictor for mortality consists of a handful of questions which

are common to many surveys, the model yields a natural proxy for short-term mortality

which could be used in those surveys for which mortality is not directly observed as an

outcome. That is, a predictor which can classify a large number of high-risk respondents

accurately can be constructed through the use of a set of questions which number 10 or

fewer, and which require no information about most substantive variables of interest (e.g.

race, income, educational attainment). Thus, a researcher wishing to use such a proxy would

be left with most independent variables available for analysis. Additionally, the variables

required (principally age, sex, and activities of daily living, or ADL's) are widely available

on many survey instruments, most of which do not observe mortality directly.

Finally, although the method is statistically geared toward prediction, the exact

questions and answers by which high-risk persons are identified raise some interesting

substantive and theoretical issues. Beyond simply identifying possible risk factors, greater

structural features of the mortality processes in the elderly population are highlighted by the

analysis. Since the method for finding questions does not control for the vast majority of

covariates in any way, it might seem that the causal implications of such a model would be

much less interpretable than the usual regression model. However, the question set method

has a different type of advantage: it not only finds subsets of the population who are at very

high risk (or conversely, low risk), it identifies these persons in efficient and easy-to

understand terms. Thus, the idea is not necessarily to find a direct causal connection between

9

mortality and the actual variables used in the question models, but to use the questions

simply to delimit the various (highly heterogeneous) high-risk subsets ofthe population; then

intuitive thinking about the causal processes can be usefully applied by comparing the high

and low risks populations on more substantive grounds.

The results have some very interesting implications for this type of thinking. In

particular, many researchers seem to believe that by first controlling for age and sex, which

of course does divide the population into relatively low and high risk subsets, one gains not

only a great deal in substantive understanding, but in predictive power as well. However, the

questions below depart from this assumption radically; it seems possible to attain higher

levels of predictive power by considering age and sex secondarily. The result is that the

subsets of the population identified as high risk by a particular model usually include persons

of all ages and both sexes, and their mortality rates are much higher than predicted by their

age and sex distribution. Additionally, the models point to some possible deficiencies in

cause-specific death certificate data and the detection of illnesses (particularly cancer) that

may mislead research into the causal processes leading to death.

1.3 Data and Concept

The EPESE project (Established Populations for Epidemiologic Studies of the

Elderly) gathered extensive survey data on 10,294 noninstitutionalized elderly persons (aged

65 and over) from New Haven, East Boston and Iowa County.s In a 1983 baseline survey,

these respondents answered questions concerning demographic status, weight, height, blood

pressure, past medical diagnoses of maladies such as cancer, heart attacks, diabetes, strokes

5 See Comoni-Huntley (1993). A more thorough description ofthe data is provided in Chapter 3.

10

and bone fractures, activities of daily living and other measures of physical and mental

functional status, depression, living arrangements, past histories of smoking, drinking and

obesity, and many other variables. The respondents were then followed over a period of

three years and the survivors annually provided follow-up data on the subsequent diagnoses

ofchronic diseases (among other variables). Those respondents who died (l ,450 within three

years of baseline) were matched with their death certificates. In this dissertation, the main

variable to be predicted from the baseline survey is whether the respondent is among these

decedents; occurrence of heart attacks, strokes, and the onset of cancer after baseline are the

other outcomes of interest.

For this project, the dataset was divided into two groups: a test set and a learning set.

A simple random sample was used to select out one third ofthe respondents (3,432) in the

survey, and these persons were designated as the test set respondents. At the earliest stages

of the research, the test set was set aside for future use, and a computer-intensive, randomly-

driven search algorithm was implemented on the remaining two thirds of the respondents

(6,862). The goal of this search was to find four survey questions such that the respondents

who supplied particular answers to these questions subsequently experienced extremely high

rates of mortality.6 The potential pool of questions was vast; 164 predictor variables were

culled from the survey.? Variables could take on many different values, allowing for

6 At this point, the number of questions to ask was chosen solely from intuition. Later, abackward deletion was used to decide the best number of questions.

? There were more than 164 variables available in the dataset, but some were not consistentlyphrased in all three geographic areas (e.g., most of the questions on emotional status), and some questionswere sample-specific items which did not have any particular relevance to the general population of elderlypersons; these items were ignored. Other than missing values, which were assigned a numeric value of999,there was no recoding ofthe data.

11

thousands of distinct questions. On the first application of this algorithm, without any

examination of the test set, the following set of four questions was discovered:

(l) Are you able to walk half a mile without any help? That's about eight ordinaryblocks.

YesNoMissing 8

(2) Are you able to do heavy work (such as shoveling snow, washing windows, walls)without any help?

YesNoMissing

(3) What day of the week is it? (Respondent's answer is:)

CorrectIncorrectRefusedMissing

(4) What is your household composition?

AloneWith spouse onlyWith spouse and other person(s)Other arrangementsMissing

Once these questions were selected, the answers of the 3,432 test set respondents were

examined. In the test set, 82 respondents gave answers to the questions that were all in bold

(above). Of these respondents, about 50% died within three years. These individuals were

not all obviously ill: 60% said they had not spent one night in the hospital in the year before

8 The category "missing" includes all responses which do not fit the other categories, as well asdata which is literally missing since some questions were not asked of a handful of respondents (e.g., theIowa proxy and telephone interviews). See Section 3.7 for a discussion of missing data.

12

baseline. Some 85% said they had never been diagnosed with a heart attack and 81 % had

never been diagnosed with cancer. They were not all old males, either. In fact, the mortality

rate ofthe chosen test set respondents was more than twice as high as the expected death rate

based on the age and sex composition of the chosen respondents.9

The deaths in these chosen respondents only made up 8% of all deaths in the test set,

but this was not the only set of questions that could be found. Upon repeating the search, it

was found that other sets of questions chose other sets of respondents. When these questions

were applied to the test set, death rates were consistently higher than 50% and sometimes as

high as 80%. Thus, it seemed possible to combine sets of questions using simple Boolean

logic to identify larger groups of respondents. Two sets of questions, each of the form "Are

the answers to questions A, Band C all in bold?" can be combined to have the form "Are

the answers to questions A, B, and C all in bold, OR are the answers to questions X, Yand

Z all in bold?" Of course, it is only possible to gain predictive power if both sets of

questions do not identify the same set of respondents. Typically, however, these groups

were mostly disjoint, and the question subsets could be delibrately constructed together so

that they were mostly disjoint.

Since the number of questions was not systematically chosen at this stage of the

research, it was thought that the results of the search algorithm could probably be improved

by determining the best number of questions to ask. Larger sets of questions could provide

greater predictive power, for example, but they could also fail due to chance error in the

9 This expected age- and sex-standardized death rate is the death rate one would expect if arandom group of respondents with the same age and sex distribution was exposed to the age- and sexspecific death rates calculated from the entire sample.

13

predictions. Also, it seemed that greater predictive power could be achieved by combining

sets of questions using the OR operator, as described. Again, at what point do the

combinations of question sets become so large that they fail to increase (or even decrease)

true predictive accuracy?10 The answer was learned by examining the test set with a

sequence ofvariously sized questions sets.

Before examining more powerful sets of questions however, it is instructive to

examine this search algorithm in somewhat more detail. It is also necessary to define what

is meant by "true" predictive power or accuracy with a probability model and error criterion,

and to describe how this accuracy might be estimated honestly. Thus, it is also necessary to

address another issue concerning the "honesty" ofthe error estimates. Multiple examinations

of the test set can introduce some downward bias in test set estimates of error, resulting in

an optimistic impression of accuracy. Also, the author had used the full dataset for other

analyses before this endeavor, and was consequently not truly blind to the test set

respondents. However, the search was constructed in an attempt to mitigate such bias. I I

Ultimately, the predictive power of these questions (and the overall method) was validated

on a completely independent sample (the Duke EPESE sample), and additional efforts to do

so with other datasets (the NHANES I and NHEFS surveys) are presently underway. Thus

far, the results have shown that using the test set for both refining and assessing the model

does not impart substantial bias to the estimate of accuracy, and that the resulting models are

not overfitted.

10 The general problem of choosing a model to maximize predictive power usually resolves to abias-variance tradeoff. This is a well-studied subject in statistics, see Breiman et aI. (1984) for an example.

II See Chapter 2 for a heuristic justification of the estimation.

14

1.4 A general search algorithm

Suppose we have some number ofpredictor variables, or survey questions (164 here),

and a binary outcome variable, such as death or survival within three years of baseline.

Again, the goal is to search for a set of questions and answers such that the death rate of the

respondents who answer a certain way is extremely high. 12 However, this task is more easily

stated than completed. To simplify the job, the questions can be constrained to take the form

ofbinary splits, i.e., "Is variable X < 2.3?", or "Is variableX~ 1.5?", such that every question

has a "yes/no"answer (or bold and nonbold, as in the examples above). Some readers may

recognize that this is essentially a variation on the CART method of building classification

trees, which also finds sets of binary questions in classifying a categorical variable (see

Breiman et al. (1984)). Several important differences exist, however. The most significant

difference was in the construction of the models, as the two model forms themselves (the

Boolean structure and the tree structure) performed quite similar functions statistically.

Detailed comparisons of these methods are presented in Chapters 2 and 4, and Chapter 5

compares the results.

From the initial 164 variables chosen out of the survey, 3,248 binary questions could

be formed. 13 Thus, if one is to find a particular combination of four questions, for example,

12 The term "death rate" is being used somewhat loosely here to mean the proportion of personswho died within three years (typically called a "raw" death rate by demographers). Formally definedhowever, a death rate for ages x to x+3, or 3MX' measures deaths divided by exposure, while the quantitymeasured here is actually a cohort estimate of 3qx> the probability of dying between ages x and x+3. Thisdistinction will be ignored in this chapter for the convenience of non-demographers.

13 Note that if a variable has Jpossible responses, then there are 2·(J - I) possible "<" questionsthat can be formed with this variable. For the example, the variable sex has only two responses (l or 2),and so only two questions from this variable can be formed: "Is sex s 17" or "Is sex> 17". A variable withthree responses, however, yields four possible questions: "Is X s 17", "Is X s 2", "Is X > I?" and "Is X >27". Of course, a variable with more than two responses may also be split with a more complicated

15

there exist more than four trillion possible combinations from which to choose (3,248 choose

four). If one wishes to find ten such "yes/no" questions, then more than 1028 combinations

are possible (3,248 choose ten). Moreover, many possible combinations of questions and

answers did not pick out any respondents at all, and few chose sets of respondents with

extremely high death rates. (Indeed, the number of question sets with low error was quite

small for models of only several questions, as is shown by exhaustive searching in Chapter

4.)

Suppose agam that the number of questions is fixed at four. A measure of

"efficiency" for the prediction scheme was defined. This criterion was based on the death

rate of the chosen respondents and how many of them were chosen. Questions that picked

more respondents, and with a higher death rate were designated as more "efficient" than

questions that picked fewer respondents or respondents with a lower death rate, as defined

formally below. Then, to search across the large pool of combinations of four questions the

following algorithm was used: 14

(l) The computer started by randomly generating a set of questions from thebaseline data in the form of binary splits. A lower limit on the number ofrespondents (dead or alive) chosen by these questions was enforced, say 25.The death rate of these respondents was calculated, and the efficiency wascomputed.

(2) Next, one question was picked at random and dropped. In its place, a newrandom question was introduced. This question was found by choosing avariable at random, and then choosing a split point at random from all

question, such as "Is X E {I ,3}". This same split can be achieved with".,,/>" style questions using the ORoperator, as done below, but this also increases the number of potential splits vastly.

14 The idea for this type of random search was inspired by evolutionary or genetic algorithms (seeGoldberg (1989); Davis (1987); Holland (1975)); however, it is a much simpler form of search, not a truegenetic algorithm (which did not seem to be necessary for the model sizes used here). The C code isprovided in Appendix V.

16

possible variable values. The death rate of the respondents chosen by thisnew set of questions was calculated, and the efficiency was computed.Again, it was ensured that the number of chosen respondents was above thefloor of25.

(3) If the new set of questions was more efficient than the old questions, thecomputer kept the new set of questions. If the new set was less efficient, thenthe computer kept the old set of questions. The computer then returned tostep (2) and repeated this process until no further improvement could beachieved by replacing any single question.

It was also possible, as mentioned, to combine sets of questions using the OR operator.

(Again, such a set of questions works with Boolean logic: a respondent is chosen if all the

answers to anyone of the four sets are all in bold, e.g., "Are the answers to A, B, C AND D

bold?" OR "Are E, F, G AND E bold?", and so on.) Here, the computer would start by

randomly generating a "full" model. For example, a full model would typically consist of

four subsets of four questions each, for 16 total questions. Then the above process ofrandom

replacement simply cycled from subset to subset, replacing a question each time to check for

improvement in the efficiency of the set of all 16 questions. This search algorithm was

always run until no further improvement in the model could be found by replacing any single

question. To be sure this was true, the computer periodically checked the model by

exhaustively replacing each question in the set with each of all the 3,248 possible questions

one at a time. If the computer found that no single change in the model could improve its

efficiency, the question set was labeled an absorption point; otherwise, the computer

continued its search undisturbed until it reached an absorption point (which was always

observed to happen eventually). The above algorithm for finding such an absorption point

was called the random search algorithm, or RSA.

Note that because the starting point for RSA is random, if one executes the program

17

more than once, a different initial starting point will be used, and different questions will be

randomly generated as replacements. 15 As a result, the algorithm would usually end at a

different absorption point with a different (perhaps higher) level of efficiency. Thus, this one

step RSA method is greedy in the sense that if it is executed only once, it can latch onto a

local maximum, finding no further improvement even if better absorption points exist; in

fact, if the number ofquestions in the set was at all large, this was typically the case. Chapter

4 shows this explicitly by searching exhaustively over models with only two questions.

As a result it was necessary to run the RSA many times over using different seeds for

the random number generator, so that many different absorption points could be observed.

From this large pool of question sets, the model with the lowest error was selected out. This

algorithm was labeled the repeated random search algorithm (denoted RRSA(N), where N

is the number of independent runs of the RSA). The RRSA(N) was always run for N L 100,

and in some cases for N > 1,000. For Set B in Appendix I, RRSA(2,000) was used. (This

was done merely to evaluate the performance of the algorithm. The result of RRSA(lOO) has

a greater than 99% chance of achieving the same result as RRSA(2,000) for that model

structure). A single RSA search would usually take more than 10,000 iterations for a

moderately sized model (e.g., 10 questions). Thus, at least one million total passes through

the dataset were always made before the set of questions with the most efficient set of

15 Of course, "random" actually implies that the computer reads an internal clock for a seed togenerate pseudo-random numbers (not to be confused with the seed model defined in Chapter 4). This is tosay, the computer must be instructed to use a new seed in each run if "independent" output is desired frommultiple runs of the program. Alternatively, "deterministic" results could be achieved by using the sameseed repeatedly.

18

questions was selected.16 One million may be a large number of passes, but compared with

the number of potential question sets, it is minute.

Thus no guarantee exists that this procedure search will find the model with lowest

possible error for any given model structure. To the contrary, if the starting question set is

made large enough (20 to 30 questions), the RRSA(100) method will usually not find the

global maximum. For smaller model sizes (less than 10 questions), finding a maximum with

RRSA(N) seems possible for an N of about 100 or more. Unfortunately, it is not clear how

to prove any given set is a maximum. Only for very small models (two questions) could it

be proved that RRSA(N) does find the global maximum for a reasonably sized N, since

exhaustive searches to find the true maximum were then possible. Applications of the

algorithm to simulated datasets also proved that the algorithm could find a global maximum

in some cases (as described in Chapter 4).

However, while the global maximum is of interest, it is the primary goal of this

research simply to find the most efficient absorption points as it can, since the chosen

respondents still exhibit very high death rates. The reasons for not focusing strictly on a

search for a global maximum are that it may well be nearly impossible to find a maximum

and to prove that the given absorption point is such a maximum. This is due mainly to the

extremely large number of potential sets of questions for models with more than about ten

variables. Moreover, it is certainly not clear how to go about doing so if it were possible.

Even the basic RSA method described above (requiring about 80 lines of simple C code) uses

considerable computing time when applied to a large dataset; a more complicated algorithm

16 This would typically take more than an hour of CPU time on a Cray C-90 supercomputer,depending on the particulars of the search.

19

would have to provide much greater searching efficiency to justify the lengthier code

(although some quite beneficial modifications were eventually discovered, as described in

Chapter 4). The search procedure also has an advantage by its simplicity, as it is intuitive,

easy to describe, and easy to carry out. Most important, the method seems to work, and the

results below suggest that it provides all the predictive power of more conventional

predictive models, if not more (see Chapter 5).

1.5 The test set method

At this point, many readers will recognize that the above algorithm may have the

tendency to capitalize on "chance" variation. Chance error is precisely why the questions

found using the learning set data are checked using the test set respondents. To see vividly

the sort of error that requires that a test set be used to judge the accuracy of the questions,

consider the scatterplot shown in Figure 1.1 (constructed with a simple random sample of

several thousand EPESE respondents). In this graph, every respondent who died within three

years of baseline is represented with an asterisk (*), and every survivor is shown by a dash

(-). Respondents are plotted against the x-axis according to systolic blood pressure, and

against the y-axis according to body mass, or weight in pounds divided by height in inches.

Notice that the graph is separated into two regions. The smaller region is delineated by two

yes/no questions: "Is systolic blood pressure;:>: 190?" and "Is body mass < 2.6?"; in this

region of the graph, only respondents who answer "yes" to these questions are plotted. The

death rate in this region is 25%, while the death rate for respondents outside the region is

13%. In this way, a series ofyes/no questions can divide the sample space into high and low

risk regions.

Figure 1.1 - Body mass index (weight/height) by systolic blood pressurePercentages give % who died in three years (* is dead, - is alive)

L()

C'0 13% died

25% died

*

*

*

* --

·--~;:r~-::=:-=-.- * --*

ll'*-

-*-

-;< -* -*

_---:: :;~:~~i:~~ii~~~:~1~i!~~-~;:::~:: -:_;Ii* - *= ==1*=--=",- -=ai_ i- -:- .=- __- _*-*: *-;.ii:I=:!!~~-= • i. -*.= **- -w *_-I·-~== =:~;5..¥_*=!_:-~tJt..-: :-: * -- - - ~ ·"'1'" •••*-- f--:-=·* - - =~- *-t= =-·1 -- == *,-:<'-1;- - _i ;ll'--l!' --'"_-- :;- ii_·=;i!3er =tti=-=·~:~ --=: -~- =*- :-*- -'==;;if=;;-I* • --.:=.=i-. = •• -11 .,. ..... -- ft- --I!*- -=- ---- !I-n··~·-II~",*·I__I~~: ••----*--* w- -_. * -~ .*wW--w. _a~ A --- - -

* t-=-a=!.=l••1;*.'''- --*-••1*-* :r_ -=-- *1* - .---••-- ••••11:-* I~·~· *- .- __ a --=*t-........ jj'=-...... !!.... .".~!=-~=- -*=== -=-- - ·=·i!:-i-C1 ili- 1- -,"".--1· _.=- "'-* --* - -. w ~•••; i' - ==.- - -=!-~*~ --*- W - *:i!=~-lli .I-~ =;! -1.--· *--= _-_il.~i.~ -~II.~ii'*:=*I=~*.*t _- -*-=~: .=:__=::-. • _ -~~:-W.i!: i-~-=· -- __

-- _ *i- _It...==I.•~i.i!:_=-i_t __-=- =* .. ...;it ~- - **- =!;~_:=i=~:!i:~ii!t;;§~iw~!-:~==w--~-w~~

_= =~~~;i:;;i=wl~:I=~i::i= it li_:I;ll';!-_- *- --_._.**! ·~:I~=-I=-=-= - - = -*• -m- : ---.--~.-~i..=- *-- -- _ll'__ * * * ~A _ s.t-.~-=~.=•• *- -= =-*--* - - =- - - ***,":; * - --* - -- --= --t•• _li - - - ---*- • ; _ iit -*!ll'=:=*;1i! *-;:; - =-* ;

= ._-·--i -*-w- -~ - ==_ -- _-t*=- -- _"\1:- - = - - .1*

oC'0

~ -N

*

0 -N

--*

L()

..-

xQ)

"'0C

C/)C/)

C1:lE>

"'0o.0

..-...

..c()c

:.:::::C/).0

I I I I

100 150 200 250

systolic blood pressure (mm of Hg)Source: EPESE New Haven, E. Boston and Iowa surveys (partial sample)

tvo

21

Notice, however, that ifno constraints are placed on the search for questions, picking

out regions that have very high rates of death due largely to chance variation is easy. For

example, the pair of questions, "Is systolic blood pressure < 80?" and "Is body mass> 2?",

would isolate a single dead respondent on the far left side of the graph, for a death rate of

100%. Suppose the "true" death rates are defined as those experienced by the general U.S.

population of elderly. It is unlikely that the true death rate for all persons in this region of

the plot is this high, and in fact it is not clear from this scatterplot whether it is at all high.

The problem is that the number of respondents in that area of the variable space is so small

that it becomes very easy to find regions with decedents unfairly, or "after the fact". (This

can be directly confirmed by examining the rest of the EPESE respondents, as Figure 1.1

uses only a small sample.) This is the problem described above as the "capitalization on

chance error". Strictly speaking, the element of chance error is typically thought of as

resulting from sampling variation when persons are chosen through a small probability

sample (as with the probability model formally defined in Chapter 2). Unfortunately, only

half the EPESE respondents (the New Haven and North Carolina respondents) were chosen

via probability samples, so this notion of chance is not so clearly defined. However, if an

explicit stochastic structure is sought, one might also think of mortality itself as a chance

process. This is frequently assumed after conditioning on a given set of variables (as in the

construction of age and sex-specific life tables, for example).

It is also possible for one to choose such a large number of questions that even if the

isolated set of respondents is larger than one respondent, many questions are still chosen

22

because of chance variation. To see this with Figure 1.1, imagine many variously sized

rectangles, all shaped in a way that includes many the oddly-positioned deaths on the fringes

of the scatterp10t, but without containing many survivors. The more such questions are

allowed, the easier it becomes to find such groups of rectangles that isolate substantial

numbers of deaths, yet the actual predictive power of such partitions is clearly suspect. One

can eliminate much of this problem by placing a hard lower limit on the number of

respondents that anyone rectangle can isolate (say, 50 respondents, as is done here), yet it

is still possible for certain questions to be found purely because of chance variation. This

issue is examined in more detail and with more rigor in Chapter 2. In short, if too many

questions are found which choose too few respondents, the true accuracy of the preferred

questions suffers. Thus, is necessary to limit the number of questions asked in some way.

However, the size ofthe question sets should not be constrained to be so small as to sacrifice

predictive accuracy by restricting the search too severely. The central problem is that of the

bias-variance tradeoff. To discuss this issue more meaningfully, it is necessary to define

exactly what is meant by predictive accuracy, or "efficiency".

To do so, notice that a set of questions that identifies a subset oftest set respondents

who have death rates of 75% or higher is still not very "efficient" in some sense if it only

picks out only a few deaths. 17 Thus, it is useful to discuss a criterion that somehow

measures the efficiency of a particular set of questions. Misclassification error is defined

as the number of survivors in the subset of respondents chosen by the questions plus the

number of deaths in the subset of respondents not chosen by the questions, divided by the

17 These characteristics are often referred to as "sensitivity" and "specificity", as explained below( also, see Ortiz et al. (1995); Thompson and Zucchini (1989».

23

total number of respondents. I8 For example, in relation to Figure 1.1 and the two

corresponding questions, the misclassification error would be the number of -' s in the lower

right-hand region plus the number of *'s in the upper-Iefthand region, divided by the total

number of points plotted.

Table 1.1 - Predicted outcome by true outcome for Question Set AofAppendix I as applied to a test set (N = 3,432)

TRUE OUTCOME

Survival Death

PREDICTED Survival 2,035 172

OUTCOME Death 888 337

TOTAL 2,923 509

The term "misclassification error" is used because the chosen respondents are

essentially being classified or predicted as dead by the questions, while the ignored

respondents are classified as survivors. The error is in the deaths ignored by the questions,

and the survivors who were chosen (i.e., those respondents who were misclassified). This

can be seen directly in a simple 2x2 cross-tabulation of the predicted outcomes against the

true outcomes, as is shown in Table 1.1 for a typical set ofquestions (Question Set A, listed

in Appendix I). The misclassification error here is simply the number of misclassified

18 This misclassification error is exactly the same criterion by which CART judges itsclassification trees in the context of the two class problem with unit misclassification costs. See Breiman etal. (1984). It is also sometimes convenient to implement a set of misclassification costs, such that thesearch may be more sensitive to picking up deaths at the cost of a lower death rate in the chosenrespondents; this is explored below. By convention, the criterion by which the overall accuracy of anypredictive classifier can be judged, which holds constant across varying levels of sensitivity and specificity,is the area under the receiver operating characteristic curve (see Swets and Pickett (1982); Thompson andZucchini, (1989». An estimate of this area is presented below, and the fitting of the ROC curve isexplained in detail in Chapter 2.

24

respondents (172 + 888) divided by the test set N, or 1,060/3,432 = 31 %. Note that those

sets ofquestions that isolate many deaths are rewarded more than questions that ignore many

deaths (holding the death rates of the chosen and unchosen persons constant). Thus, the more

"efficient" sets of questions result in a lower misclassification error. Note also that if the

number of respondents to be chosen by the questions is fixed, the goal of reducing

misclassification error is always equivalent to the goal of maximizing the death rate of the

chosen respondents.

Now the statistical goal of the dissertation can be stated explicitly in terms of the

misclassification criterion. The purpose of the method developed in this research is to find

a set of questions, combined with the AND and OR operators, which yield a reasonably low

test set misclassification error for the prediction of mortality or morbidity. Thus it is simply

a search for the questions that, when applied to the test set, correctly choose the most persons

who die while simultaneously choosing the fewest survivors, and likewise for the occurrence

of heart attacks, strokes, and cancer. This is the ultimate objective of all searches, and the

resulting questions are always judged by how well this error is minimized (but not according

to whether the absolute, lowest possible minimum error truly was achieved).

Exactly how many is "most", or "fewest"? To answer this question, a slight

modification to the above definition ofmisclassification error is also useful: a cost-adjusted

misclassification error can be computed. To see why this is useful, consider Table 1.1 again;

notice that a lower misclassification error may be achieved simply by classifying all

respondents as survivors, since then the only misclassified respondents are the deceased,

resulting in an error of 509/3,432 = 15%. This is halfthe size of the error for Set A. By this

25

criterion, we should choose to ask no questions, and ignore the deceased completely! In fact,

it is easily seen that the only sets ofquestions that result in a lower error than the strategy of

ignoring the deaths are those sets that isolate respondents having a death rate higher than

50%.19 These sets can be found, but the questions only account for a few deaths, and we are

interested in identifying a much larger proportion of the deaths even if the chosen

respondents do not have death rates quite so high. In other words, we may be willing to

allow more than one misclassified survivor for each death we predict successfully. Thus, we

can define the cost of misclassifying a death as survival to be greater than the cost of

misclassifying a survivor as dead. For example, suppose the relative cost of misclassifying

a death is defined to be five times that of misclassifying a survivor; then the numerator for

the cost-adjusted misclassification error associated with Table 1.1 is equal to 5x 172 + 888

= 1,748. In comparison, the numerator for the error associated with the ignorant strategy of

classifying all respondents as survivors is now 5x509 = 2,545, so asking the questions is an

improvement by this criterion since doing so yields a 31 % reduction in the cost-adjusted

error rate (equal to (2,545 - 1,748)/2,545 ).

As a result of increasing this relative cost, the questions designed to minimize the

adjusted error choose subsets of respondents who experience death rates lower than 50%, but

they also catch a higher proportion of the deaths. This tradeoff, as referred to above, is a

balance between specificity and sensitivity. Specificity is simply the proportion of survivors

correctly classified as survivors. For the questions summarized in Table 1.1, this is equal to

2,035/2,923 = 70%; this quantity is usually proportional to the death rate of the chosen

19 This is a direct result of Bayes' rule, defmed in Chapter 2.

26

respondents. Sensitivity is the proportion of deaths correctly classified as deaths, equal to

329/509 = 65% in Table 1.1, and tends to be inversely proportional to specificity. That is to

say, it is possible to identify a small fraction of the unhealthy elderly who are at extremely

high risk, or to catch a larger proportion of elderly who have a somewhat lower (but still

relatively high) risk of death or illness, but it is yet not possible to identify the vast majority

of deaths while simultaneously achieving very high accuracy. The purpose of using a cost

adjusted misclassification error is to explore the range ofpossible combinations of sensitivity

and specificity in this tradeoff. This is done by building multiple, separate sets of questions,

each built in an attempt to reduce cost-adjusted misclassification error, but with a different

relative cost for each set. In this way, the researcher is provided with a wide-range of models

with varying levels of sensitivity and precision.

1.6 Use of the test set to determine the number of questions

Consider the case ofquestion Set B in Appendix 1. How was this question set found?

First, the search algorithm of random replacement (called the RSA above) was employed on

the learning set data to build a combination of four subsets (combined with OR) of four

questions each (combined with AND), for 16 total questions. The search was run

independently 100 times, and the lone combination of questions with the lowest observed

misclassification error was chosen as the full model (defining the RRSA(1 00) method). A

backward deletion process was then applied to these questions. Each of the 16 questions

(and each of the four sets as a whole) was temporarily dropped from the model, and the

misclassification error was recomputed in its absence. The question (or set of questions)

which yielded the smallest increase in error when temporarily dropped was dropped from the

27

model permanently, resulting in a set of 15 (or 12) questions. The deletion process was then

repeated on this submodel to obtain a model of size 14 (or 11, etc.). This model was itself

subject to deletion and so on until no questions remained. This resulted in a sequence of up

to 16 nested models, each containing a unique number of questions, and each a subset of the

next larger set of questions. 20

Next, this sequence of models was applied to the test set, and the misclassification

error of each model was estimated on this new data. Figure 1.2 shows a plot of

misclassification error as estimated on both the learning set and the test set for each set of

questions in the overall sequence of models. The shape of the learning set error curve is

much like that which would result from a sequence of regression models of increasing size.

That is, obtaining a slightly lower error by using a larger set of questions was always

possible, just as increasing R2 by adding another coefficient to the regression equation is

always possible (provided the model is not completely saturated). However, when the larger

sets of questions were applied to the test set, they had slightly less predictive power than

more moderately sized sets of questions. This is because when more than seven questions

were found with the learning set, these additional questions began to reflect the random

variation in the learning set data. This occurred in much the same way that multiple

questions could be used to partition Figure 1.1 unfairly, as suggested above. Yet when less

than seven questions were used, it seemed that the resulting partition was not fine enough to

capture all ofthe truly high-risk regions adequately. Thus, if the test set estimates accurately

20 This backwards deletion process is almost exactly equivalent to CART's method of"pruning"the full-sized tree to obtain a nested sequence of subtrees, which are then applied to the test set to determinethe best-sized tree (see Breiman et al. (1984)).

Figure 1.2 - Misclassification error by number of questions asked

co~0

<0~0

~---.. C"")L() 0C"")

II..-

Nena C"")()

0----.....a..........a>

0C"")

0

co"!0

test set

minimum

learning set

o 5 10

number of questions in model

15

N00

29

measure the predictive power of the questions, a set of about seven questions provides the

optimal tradeoff between the lack of fit and the fitting of chance variation. This result was

for a relative misclassification cost of 3.5; a larger set of 10 questions seemed better when

the cost was fixed at five. In fact, it was found that when starting with the 4x4 model

structure, the best number of questions invariably ranged between six and ten. Moreover,

these were always arranged in 3-5 subsets of questions combined by AND which were 1-3

questions in size. Once this preferred model size was chosen, the test set and learning set

respondents were recombined into one learning set. Finally, the repeated search algorithm

RRSA(N) was repeated using the full dataset in an attempt to find the best set of seven

questions.

This set of seven questions, then, was designated as the preferred set for the relative

cost of 3.5. Now consider again the issue of the honesty of the prediction error for these

seven questions as it was estimated by the test set. It is more convenient to have a test set

which is truly an independent, unobserved sample of data on which to test the accuracy of

the predictions. Unfortunately, in this research the test set was not kept completely blind to

the researcher. One problem was that the test set was already examined by the author before

the above algorithm above was fully developed. In addition, the author has had previous

contact with the dataset in other analyses before the dissertation was conceived. Thus, it was

possible that any knowledge of which variables or transformations of variables yielded high

predictive power could have informed the model-building process, invalidating the honesty

of the test set. Several precautions were taken to mitigate the unfair advantage this might

have granted to the estimation of predictive accuracy.

30

First, the input to the algorithm was essentially the raw dataset itself. Almost no

variables were preselected out of the analysis; the sole exceptions were a handful of items

that were sample-specific (e.g., the respondent ID, the sample region) and some questions

that were unusable because they were not consistently phrased in all three surveys (consisting

mostly ofthe items on emotional status). These were selected out well before the analysis

began, and none of the remaining variables were dropped from the analysis at any point.

Also, no respondents were dropped from the analysis.21 Nor was any variable recoded from

its original form on the supplied tape or processed in any way at any point in the analysis

(except that missing values were assigned the value "999" at the start).22 Secondly, the

algorithm always chose the full question set by a systematically random process specified

completely by the simple three-step process above (see Chapter 4 for a formal definition).

That is, the initial set of questions was generated completely at random, and the replacement

of questions was also driven entirely randomly. Specifically, all variables had the same

probability of being selected into the initial model set. Also, within each replacement (steps

(2) and (3) above), all the questions in the model had the same probability of being replaced.

All variables had the same probability of being chosen to replace the dropped question, and

all possible splits for the replacement variable had an equal probability of being chosen.

Thus, knowledge about which variables or transformations of variables yielded predictive

power could not have directly influenced which questions were selected into the [mal

2\ The exception to this rule was in the analysis of cancer, where respondents known to havecancer at baseline were removed from the analysis because the determination of new cancers after baselinewas ambiguous for these persons.

22 One variable was added to the dataset: a measure of body mass, equal to the ratio of weight inpounds to height in inches. However, this variable was created well before any analysis began.

31

models.

It is more likely that if an unfair advantage has been gained, it was attained by sheer

familiarity with the data or some more heuristic type of knowledge that could have

influenced the model building strategies themselves. For example, it was thought that such

a splitting process was feasible due to initial runs of the CART algorithm on the data. This

is one reason to consider the true validation of these models with a completely independent

sample (see Chapter 5).

The second problem is that it is necessary to examine the test set more than once to

determine the best number of questions to ask. These multiple examinations of the test set

can also result in an optimistic assessment of the accuracy of the model if the same test set

is also used to estimate prediction error. This is partly why the method of model selection

by backward deletion outlined above (and described more thoroughly in Chapter 4) mimics

the extensively-simulated "pruning" method developed for classification trees.

However, the area of the curve near the minimum test set error in Figure 1.2 appears

quite stable (largely because of the large number of respondents). Sets of questions of size

six and eight also have very low error rates, and this stability in the error-by-model-size curve

was typical for all the model sequences examined below. Sets of questions at or just above

the right size consistently predicted at a comparable level of accuracy on the test set; thus,

the test set error curve was usually quite well defined at this point. Moreover, extensive

computer simulations of the backward-deletion method of model-selection and the test set

estimation of prediction error on nested sequences of partitions found by the CART method

(called "pruning") have found that in practice the downward bias in the test set estimates of

32

prediction error is small ifN is large and the test set is used to select from a small sequence

of nested models built on the learning set (and 16 is probably a small enough sequence)?3

Since the partitions found by the random replacement search are nearly identical in structure

to the type of partitions formed by CART, and since the method of backward deletion is

nearly identical to CART's pruning method, it is hoped that the estimates of prediction error

derived from the test set for the models in this dissertation are also nearly unbiased?4

Chapter 2 presents a somewhat more rigorous (but still heuristic) argument along these lines,

and Chapter 5 presents the results of the validation with an independent sample.

1.7 Results: Mortality

The above RRSA(l 00) method was used to find "full" models on the learning set for

three different misclassification costs: 1.5,3.5 and 5. These models were then subjected to

backward deletion, and the resulting sequences of submodels were applied to the test set.

Those models with the lowest error at each cost were then selected out, yielding three models

consisting of seven, seven and ten questions respectively (results are presented in Table 1.2).

The test set performance of each preferred set of questions for each cost is summarized on

each row of the table, catalogued by capital letters, and the corresponding questions and

23 See Breiman et al. (1984).

24 The only difference in the structure of partitions chosen by the two methods is that subsets ofrespondents isolated by the sets of questions here were not completely disjoint (but mostly disjoint). Inpartitions built by CART, all terminal nodes of a tree identify completely disjoint sets of respondents. Theonly difference in the method of backward deletion is in CART's use of a cost-complexity parameter toindex the "size" of the tree. Here, simply the number of questions is used, which is equivalent to a thenumber of terminal nodes in a tree. The cost-complexity parameter is a linear combination of the number ofterminal nodes in a tree and the cost-adjusted learning set misclassification error of the tree. But thebackward-deletion method used here is more basic than CART's more thorough method, and a more basicindex is used. More complicated methods of backwards deletion (e.g., by using a cost-complexityparameter) are explored in Chapter 4.

Table 1.2 - Deaths predicted correctly and survivors predicted incorrectly as dead(in a test set of2,923 survivors and 509 deaths three years after baseline)

reduction # of SensitivityS # of Specificity' Death rateSdeath rateSet! Cose in deaths survivors

prediction predicteddeaths predicted

predictedsurvivors predicted deaths predicted rate predicted

erro~ correctly4correctly

incorrectly6correctly correctly by age, sex

all deaths all survivors deaths predicted

A 5 31% 337 66% 888 70% 28% 1.39

B 3.5 17% 209 41% 348 88% 38% 1.83

C 1.5 5% 113 22% 132 96% 46% 2.12

1. The letter of each set refers to the letters of the sets of questions listed in Appendix I.2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death.3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as survivors.4. The number of deaths in the test set correctly classified as dead by the set of questions.5. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also called the true positive fraction or TPF.6. The number of survivors in the test set incorrectly classified as dead by the questions (i.e., false positives).7. The proportion of all survivors (2,923) correctly classified as survivors by the questions.8. The proportion of deaths in the respondents predicted as dead. In the column to the right, this rate is divided by the death rate a randomly chosen set ofrespondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.

Source: EPESE baseline and three years offollow-up data from New Haven, East Boston and Iowa.

ww

34

answers are listed in Appendix 1. The third column gives the percent reduction in prediction

error achieved by the question sets as compared with the ignorant strategy of classifying all

deaths as survivors (as calculated above). This quantity is mathematically equivalent to the

R2 statistic commonly used for regression models where the weighted total sum ofsquares

-

is defined about the ignorant strategy instead of the mean (i.e., Y == 0, since one cannot be

classified as 14% dead). However, note that if all the deaths are weighted (by the relative

cost of misclassification here), this measure of accuracy is very sensitive to changes in cost.

That is, it would be possible to achieve a 100% reduction in error simply by defining a very

high relative cost and classifying all respondents as deaths (clearly an inaccurate strategy).

Likewise, one can obtain 0% reduction in error by defining a relative cost near zero even if

the predictor catches 100% of deaths with no false positives! So this is not an objective

method for comparing the accuracy of models without regard to cost.

This is why it is more informative to consider the sensitivity and specificity for any

one predictor. Again, the sensitivity of the questions is the proportion of deaths that are

correctly classified as deaths by the questions, and the specificity is the proportion of

survivors who are correctly classified as survivors (i.e., not chosen by the questions). Thus

the sensitivity of Set A is 337/509 = 66% and the specificity is (2923 - 888)/2923, or 70%.

The death rate in the table is simply the number of true deaths in the chosen (predicted as

dead) respondents divided by the total number of respondents classified as dead. The last

column presents the ratio of this observed raw death rate to the age- and sex-adjusted rate

predicted solely from the age and sex distribution of the chosen respondents. If the

respondents chosen by Set A had been subjected to the sample-average mortality risks for

35

their ages and sexes, we would have expected 19.7% of them to die.25 However, it was

observed that 27.5% died, so this ratio is equal to 0.197/0.275 = 1.39. Set A, a total of only

ten questions, is the most "sensitive" set of questions presented; that is, it correctly predicts

the most deaths (two-thirds of all deaths, in fact). Of course, it would have been possible to

build a more sensitive set of questions by using a higher relative misclassification cost.

However, since Set A misclassified as many as 30% of survivors as dead, it was thought that

any higher proportion of false positives would be unacceptable.26

When the cost was lowered to 3.5, the predictions decreased in sensitivity and

increased in specificity, as expected. Set B correctly predicted only 41 % of all deaths, but

88% ofsurvivors were also correctly classified, a significant rate of success for seven simple

questions. About 38% of the respondents chosen by these questions died, a death rate 1.83

times as high as that predicted by the age and sex distribution of the respondents. When the

cost ofmisclassification was further lowered to 1.5 (Set C), the death rate of the predicted

dead increased to 46%, nearly half the 245 chosen respondents! This rate was more than

twice that predicted by age and sex, partly representing the fact that no questions about age

are included in Set C.

The consistently high levels of specificity suggest that the high death rates

experienced by the chosen subsets oflearning set respondents are not likely to be due entirely

25 Specifically, the death rate is compared to the death rate a randomly chosen group of elderlywould experience with the same age and sex composition. First, age- and sex-specific death rates wereestimated from all 10,294 respondents in the sample. Then each respondent chosen by the questions washypothetically subjected to the sample-wide mortality risks they would have experienced according to theirage and sex. This gives the death rate of the chosen respondents predicted only by age and sex.

26 Although if one turns the question around, so that the goal is to predict survival by isolatinggroups of respondents with low death rates, the present method is an equivalent way to approach theproblem - simply specifY a very high cost of relative misclassification.

36

to chance variation. While the highest level of sensitivity is only 66% with a specificity of

70% (which admittedly may be a slightly optimistic estimate), this level of accuracy is still

quite remarkable for a predictor that requires nothing more than asking a handful of simple

questions. It may also be quite useful to identify the 30% of survivors who did not die, as

they may still be at very high risk of death beyond the three years offollow-up.27 Moreover,

the questions apply to elderly of all ages and both sexes, and as the preferred questions

demonstrate, the method deals effectively with missing data.

Finally, it is informative to calculate one additional statistic that is conventionally

used for assessing the overall accuracy of a predictive method without regard to specific

combinations of specificity and sensitivity. This is the area under the receiver operating

characteristic (ROC) curve, also known as the c statistic. Based on the sensitivity and

specificity levels of the sets of questions A through C, a simple curve can be plotted as in

Figure 1.3. The proportion of deaths predicted correctly (the sensitivity, or true positive

fraction) is plotted on the y-axis, and the proportion of survivors predicted incorrectly as

deaths (the false positive fraction, equal to 1 - specificity) is plotted on the x-axis. The three

dots labeled A, B and C correspond to the question sets, and a curve is neatly fitted to these

points. Note that for a method ofprediction equivalent to random guessing, this curve would

follow the 45° line, and the area under it would be 50%. If the method of prediction was

perfectly accurate, implying 100% sensitivity with a 0% false positive fraction, the curve

would "fill up" the plot (i.e. it would merge with the y-axis and the line at y = 1), and the

area would be equal to 100%. The area under the ROC curve fitted to sets A, Band C was

27 Efforts are now underway to apply these questions to later waves of the EPESE data for furthervalidation.

37

Figure 1.3 - True positive fraction by false positive fraction(ROC cUNe) for questions predicting deaths in the test set

Area =74.4%

A

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

/... //, /V/'/ ,

/ ,/ ,

/ ,/ ~

//

//

//

//

//

//

//

//

//

//

//

//

//

o

.25

.75

o .25 .75False positive fraction (the proportion ofsurvivors predicted incorrectly as dead)

Note: The test set consisted of 2,923 survivors and 509 deaths.Letters (A,B.C) correspond to the question sets in Appendix I.Source: New Haven, East Boston and Iowa County EPESE

38

estimated at 74.4% ± 1.2%, about halfway between these two extremes.28 This gives a good

intuitive feel for the accuracy of the overall method without regard to costs or specific levels

of sensitivity and specificity.

However, sets ofquestions with different levels of specificity and sensitivity were not

estimated solely for the fitting of the ROC curve. The main idea is to allow the models to be

used in different ways. For example, question set C, which isolates the small fraction of

respondents at the very highest risk of death, could be used by a physician whose goal is to

identify the neediest of elderly for the efficient distribution of expensive resources. Sets A

and B are more geared toward identifying a large number of elderly who are at moderately

high risk.

1.8 Results: Heart failure, stroke and cancer

Within three years of baseline, some 868 respondents (8.4%) either reported that they

had been diagnosed with some type of heart failure, or died of heart failure without having

had a heart attack previously.29 New diagnosis of strokes was reported by or as the cause of

death for 943 (8.6%) respondents. For cancer, it seemed impossible to distinguish between

new cancers reported post-baseline and the existence of cancer as it was reported at baseline.

This was due partly to the wording of the questions and partly to the fact that cancer is not

a discrete event in time like a stroke or heart failure. Therefore, all respondents who reported

that they had ever been diagnosed with cancer were removed from the prediction of cancer

28 The points were fit with a rotated parabola which was constrained to pass through the origin andthe point at (1, I). The standard error was estimated with a bootstrap method. See Chapter 2 for anexplanation of this estimate and its standard error.

29 If a respondent died from heart failure without reporting it post-baseline and they reportedhaving experienced heart failure at baseline, it was assumed that no new heart attack occurred. Theoccurrence of a new stroke and diabetes was defined in this way also.

39

analysis to catch only genuinely new incidents of cancer. Thus, the resulting questions only

apply to persons never diagnosed with cancer. A total of 552 persons (6.2% of a total N of

8,874 ) never previously diagnosed with cancer either reported post-baseline that they had

been diagnosed with cancer or died of cancer.

One problem becomes evident at this point for any researcher familiar with survival

analysis: the occurrence of these traumatic events is censored in some respondents by death

due to other causes. (This is a demographer's euphemism for saying some respondents died

of cancer before they had a chance to have heart attacks!) The are two ways to think about

this intrusion. First, if one is genuinely concerned with estimating the probability ofan event

in a certain group of respondents (specifically, with the idea that some deaths could be

averted, e.g., in a cause-elimination scenario), care must be taken to adjust for the changes

in exposure and examine any assumptions regarding the dependence ofchronic illness events

and death due to other causes (the censoring agent). Still, for pure prediction, we can merely

define our event-of-interest to be the admittedly censored incidence of heart attack, stroke

or cancer, essentially ignoring the issue (that is to say, defining risk in the presence of other

causes ofdeath, perhaps with the justification that we are more interested in identifying those

persons for whom a specific event will occur before they die of another illness!) The brief

results presented below are given in the context of pure prediction, but there are still some

interesting substantive hypotheses suggested by the questions themselves.

Table 1.3 shows the results of using the above to predict heart failure, and Appendix

II contains the actual question sets (D and E). Questions were identified for two relative

costs of misclassification, nine and six. Set D, consisting of seven questions, correctly

Table 1.3 - Heart failures predicted correctly and non-failures predicted incorrectly(in a test set of288 heart failures and 3,144 non-failures 3 years after baseline)

reduction # of Sensitivity5 # of non- Specificity? Failure rate8

failure rateSet' Cose in failures failures

prediction predictedfailures predicted

predictednonfailurespredicted failurespredicted rate predicted

error3 correctly4correctly


all failures all nonfailures failurespredicted

D 9 25% 169 59% 871 72% 16% 1.60

E 6 12% 103 36% 416 87% 20% 1.86

I. The letter of each set refers to the letters of the sets of questions listed in Appendix II.2. The ratio of the cost ofmisclassifying a heart failure as non-failure to the cost ofmisclassifying a non-failure as failure.3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as non-failures.4. The number of heart failures in the test set correctly classified as failures by the set of questions.5. The number of failures correctly predicted divided by the total number of the failures in the test set (288), also called the true positive fraction or TPF.6. The number of nonfailures in the test set incorrectly classified as failures by the questions (i.e. false positives).7. The proportion of all non-failures (3,144) correctly classified as non-failures by the questions.8. The proportion offailures in the respondents predicted as failures. In the column to the right, this rate is divided by the failure rate a randomly chosen set ofrespondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.

Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.

~a

41

predicted some 169 heart failures out of288 failures total in the test set (sensitivity = 59%).

The number of false positives was 871, for a specificity of 72%, and the rate of heart attack

was 1.6 times as high as the rate predicted by the age and sex composition of the

respondents. The more sensitive questions (Set E), a total of only six questions, caught only

103 heart failures (36% of all failures in the test set) but with quite high specificity,

identifying only 416 false positives (87%). Of the respondents classified as high risk by the

questions, one in five suffered heart failure, a rate nearly 90% higher than that predicted by

their age and sex distribution! The area under the ROC curve was estimated as 71.0%.

The prediction of strokes was a little less accurate, as shown in Table 1.4; the

questions are in Sets F and G in Appendix III. The more sensitive set (G) catches 37% of all

strokes with a specificity of 86%, but Set F predicts 59% of strokes with. only 66%

specificity. The area under the ROC curve was estimated as 69.1 %. Cancer is much more

difficult to predict (see Table 1.5, and Sets H and I in Appendix IV.) Question in Set H

catch as many as 61 % of all respondents who get cancer, but with a specificity of only 54%.

Questions Set I appears more useful, catching as many as 43% ofcancerous respondents with

a specificity of71 % (not high, but still of interest). For cancer, the area under the ROC curve

was estimated as 61.5%.

1.9 Discussion

There are several perplexing questions evident in the preferred models, and there are

many questions that are even more puzzling by their absence. A physician is very likely to

be disconcerted by this lack of many clinically obvious predictors. One might imagine (as

did the author, before the analysis) that the most powerful predictors would be age, sex,

Table 1.4 - Strokes predicted correctly and non-strokes predicted incorrectly(in a test set of 319 strokes and 3,113 non-strokes 3 years after baseline)

reduction # of Sensitivity5 # of non- Specificity? Stroke rateSstroke rateSet! Cose in strokes strokes

prediction predictedstrokespredicted

predictednonstrokes predicted strokespredicted rate predicted

erro~ correctly4correctly


all strokes all nonstrokes strokes predictea

F 9 22% 189 59% 1065 66% 15% 1.38

G 5 10% 118 37% 439 86% 21% 1.73

I. The letter of each set refers to the letters of the sets of questions listed in Appendix III.2. The ratio of the cost of misclassifYing a stroke as a non-stroke to the cost of misclassifYing a non-stroke as stroke.3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifYing all respondents as non-strokes.4. The number of strokes in the test set correctly classified as strokes by the set of questions.5. The number of strokes correctly predicted divided by the total number of the strokes in the test set (319), also called the true positive fraction or TPF.6. The number of non-strokes in the test set incorrectly classified as strokes by the questions (i.e. false positives).7. The proportion of all non-strokes (3,113) correctly classified as non-strokes by the questions.8. The proportion of strokes in the respondents predicted as strokes. In the column to the right, this rate is divided by the stroke rate a randomly chosen set ofrespondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.


~tv

Table 1.5 - Cancer predicted correctly and non-cancer predicted incorrectly as cancer(in a test set of 185 respondents with cancer and 3,046 without cancer 3 years after baseline)

reduction # with Sensitivity5 # without Specificity7 Cancer rateSSee cose in

cancer ratecancercancer predicted

cancernoncancerpredicted cancer predicted

prediction predicted predicted rate predicted

error3 correctly4correctly


all cancer all noncancer cancer predicted

H 15 11.1% 113 61% 1387 54% 7.5% 1.15

1 13 6.3% 79 43% 875 71% 8.3% 1.18

1. The letter of each set refers to the letters of the sets of questions listed in Appendix IV.2. The ratio of the cost ofmisclassifying a cancerous respondent as non-cancerous to the cost ofmisclassifying a non-cancerous respondent as cancerous.3. The precentage decrease in the cost-adjusted misclassification error rate compared to the error of classifying all respondents as non-cancerous.4. The number of respondents in the test set who got cancer and were correctly classified as cancerous by the set of questions.5. The number of cancerous respondents predicted correctly divided by the total number of the cancerous respondents in the test set (185), also called the truepositive fraction or TPF.6. The number of non-cancerous respondents in the test set incorrectly classified as cancerous by the questions (i.e. false positives).7. The proportion of all non-cancerous respondents (3,046) correctly classified as non-cancerous by the questions.8. The proportion of cancerous respondents in the respondents predicted as cancerous. In the column to the right, this rate is divided by the cancer rate arandomly chosen set of respondents (with the same age/sex distribution as the respondents chosen by the questions) would have suffered.

Note: This analysis included only respondents who were diagnosed as cancer free at baseline.Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.

~w

44

smoking, heavy drinking, obesity, and questions asking about previous diagnoses of

illnesses (e.g., "Have you ever been told by a doctor that you had cancer?" and similarly for

heart failure, stroke, diabetes and high blood pressure). Yet consider Set A, which correctly

predicts the most deaths; of the above predictors, only age, sex and weight are considered.

Age and sex, which are undoubtedly the most commonly used variables in models of

mortality, are used only once in each of Set A and B. Set C, which catches the respondents

who at the highest risk of death, considers neither! This may be a very disturbing result for

many scholars of mortality, prompting one to ask why so many other risk factors known to

be important are also not more prominently featured.

Several answers to this question exist. The first concerns the predominance of

measures of functional status. In particular, several questions that appeared repeatedly were

the items asking whether the respondent could walk a half mile, do heavy work, or bathe

without assistance. The tremendous predictive power of these ADL' s (activities of daily

living) has been reported in a number of other studies on predicting mortality (specifically,

see Bianchetti et al. (1995); Cahalin et al. (1996); Davis et al. (1995); Reuben et al. (1992);

and Warren and Knight (1982)). In short, many researchers have found that ADL's have

more predictive power than nearly any other variable, including those "obvious" items listed

above. This may be because the conventional risk factors are all variables that affect

mortality over long periods, and with uncertain consistency; if the goal is to predict short

term mortality, it is more effective to focus on characteristics that appear just before death,

and these changes are clearly identified in the functional ability of the individual.

A highly simplified picture of the process of death from chronic illness over time

45

illustrates this, as shown in Figure 1.4. Any particular risk factor mayor may not lead to

illness (hence the questionable arrows), and any particular illness mayor may not lead to

death. However, that illness or combination of illnesses that leads to death is quite certain

to cause a change in the functionality ofthe individual at some time before death. To be sure,

death is the ultimate breakdown in functional status, and accurate prediction becomes a

matter of how soon functionality begins to deteriorate before death. Clearly, predicting the

death of the person who is perfectly capable one day and dies suddenly on the next, perhaps

experiencing loss of function only minutes before death, is much more difficult. Similarly,

the variable loses its prognostic value for persons who may become disabled for reasons

other than illness, and well before the onset of mortality. For the majority of individuals,

however, much loss of function is caused by chronic illnesses. Thus, not only are these

changes in functional ability temporally closer to the event of death than any of the other

predictors, we also expect this association to be the strongest as well. It is expected then that

this variable should be an extremely accurate short-term predictor of death, to the extent to

which it can be measured accurately.

It is for this reason, perhaps, that age and sex in particular have not emerged as a

central focus in this analysis. 3D Every one of the overall sets (A though I) is capable of

identifying any person of an age greater than 65 as high risk. Only two ofthe 11 subsets of

questions in Appendix I include a question about age and only two contain a question about

30 The author originally thought that the best way to precede with the analysis would be to dividethe sample into age- and sex-specific subsets before applying the search algorithm. This turned out to benot only much more complicated logistically, but suboptimal with respect to predictive power as well. Forexample, some sets of questions fmd subsets of respondents which cut across age boundaries while othersubsets are very age-specific. It seems that allowing the computer to decide when age or sex should orshould not be included is the best way to build question sets.

agelexposure

Figure 1.4 - Diagram of death fromchronic illness

I diet/obesity I c=c; heartdisease

I smoking I

c=c;I ? >~\ I

I drinking II cancer I

death I

I

c=c;genetics I

V Ifunctional

I ? > I

status

otherillnesses

otherfactors

.j:>..0\

47

sex. Yet the rates of both death and illness that the respondents experience are much higher

than predicted by age and sex, a very counterintuitive result given our knowledge of these

variables and their relation to mortality! However, age is generally such a powerful number

because it is a variable that captures the conglomerated effect ofmany other variables; when

we refer to the "aging process", we usually think of the decline of the individual's functional

status and the onset of chronic illness, not simply the mere act of passing time. Thus the

main effect that age measures directly is the exposure to repeated insults, and most of the

other variables that relate to the aging process are quite well measured in the available survey

questions. However, age is often included in the questions, and usually the chosen

respondents are older than the general population. Ideally, one would like a set of questions

that chooses persons who at highest risk relative to others of the same age and sex. In

Chapter 5, a hybrid combination of the question set method and a linear model gives a way

to accomplish this, but interestingly, it appears that the results are not substantially different

from those shown here.

Another reason for the puzzling inclusion and exclusion of certain variables is the

nonintuitive nature of the search strategy. Specifically, questions are designed to have low

misclassification error, and this is quite different from merely finding some respondents who

are at high risk. The criterion of misclassification error measures not only how high the risk

level is for those respondents classified as dead, it also reflects the number of respondents

classified as dead (i.e., the "efficiency" of the questions, as discussed above). This can be

seen vividly by actually trying to build question sets on a purely intuitive basis. For example,

consider a set of questions that isolates all those respondents who are males, aged 80+, and

48

who have been diagnosed with heart failure, cancer, stroke, or high blood pressure. One

expects these persons to be at very high risk, and indeed they are: 35% of these respondents

died. Yet of all 3,432 respondents in the test set, only 147 were isolated by these variables,

so only 53 deaths (out of 509) were correctly predicted. In contrast, the questions in Set C

isolated a group of respondents with a death rate of 46%. This was not only a set of

respondents at somewhat higher risk; it was also a much larger group of respondents. There

were 245 such persons in the test set, and 113 of them died, more than twice as many deaths

as predicted by the intuitive set! Even more incredibly, since the questions in Set Cask

nothing about sex or age and very little about illness, the persons so isolated are not all 80+

males who have been diagnosed with terminal diseases; so many respondents identified as

high risk are not already known to be at high risk prior to prediction.

An important hypothesis is suggested by these results. Because the observed

diagnoses of illness did not have a great deal of predictive power compared with other

symptomatic variables such as functionality, and because so many respondents who

eventually died had never been diagnosed with disease at baseline, it was thought that a great

deal of undetected morbidity existed in the EPESE sample at baseline. The relatively poor

respondents in this dataset likely had a particularly low level of access to and utilization of

medical resources, and it seemed that the diagnosis of morbidity was positively related with

income. It was also suspected that cancer accounted for many missing diagnoses. (This

issue is treated in detail in Chapters 6).

It seems that our etiological knowledge of mortality is still far too incomplete to

provide us with a predictor that is both causally intuitive and efficient based on the available

49

data. So, if the primary objective is to build an accurate and efficient predictor, as is the goal

ofthis dissertation, it benefits one to abandon (at least for the moment) conventional thinking

about causal structures. Much of the advantage of the present method comes from the fact

that it is not at all bound by our seriously flawed notions of what variables are causally

responsible for death. Rather, questions are chosen from the pool of potential variables by

a method that is completely objective, and for the purposes of building the models, the only

concern is that they increase predictive power. As it is statistically defined (see Chapter 2),

predictive structure is completely equivalent to correlative structure, and so one can

maximize predictive power totally without regard to causes. Thus, a word of caution is

necessary. Many laypersons (and even some academics) are ignorant to the difference

between cause and association. Careless consideration of the questions could lead

respondents to believe falsely that the variables suggest a way to mitigate their risk of death.

Question Set C, for example, predicts that respondents who cannot correctly state the day of

the week are at higher risk of death. This may be true, but it is doubtful that we could help

such persons merely by giving them calendars.

As a result, many researchers may be annoyed by this strategy of model-building

while ignoring considerations of causal structure. However, it is the intent of the author to

ignore causes only for the purpose of building the models; once they are built, it seems that

a great deal of insight into causal structure may be gained. The trick is to be extremely

cautious in this endeavor. As mentioned, many variables highly corrleated with death are

not controlled in the question sets, in comparison to a regression model for instance (see

Chapter 5 for such a model). However, a different type of advantaged is offered, and this

50

may also be another reason that the method predicts so well. People die for many different

reasons, and in many different ways, and it follows that using a single predictive equation

to try to predict all deaths is not an efficient approach. However, by allowing various

partitions to be combined with the OR operator, a set of questions can be constructed which

effectively deals with many different types of mortality since no single variable or question

is applied to all respondents.3! For example, consider Question Set A, which accounted for

66% of all deaths. This overall set consists of five subsets of questions (A. I through A.5)

combined by the OR operator. Two-thirds ofthe respondents chosen by these questions were

only selected by one of the five sets. Moreover, the recorded causes of death vary

substantially from set to set; in particular, some sets are dominated by heart disease mortality,

as deaths due to heart failure appear to be more predictable than deaths from cancer and other

causes. Etiologies seem to vary widely from set to set, and so by efficiently dividing the

population into subgroups that are more homogeneous with respect to causes, we may greatly

simplify the causal analysis. Chapter 6 demonstrates this by applying the more conventional

types of analysis to respondents within these delineated regions in an attempt to glean more

information about the different causes behind the different types of deaths.

The nonparametric tactic of partitioning the sample space (as opposed to fitting

equations) also allows the predictors to deal very effectively with a number of more practical

problems: nonlinearities, the very large pool of potential predictor variables, the inevitable

missing values, and other "messy data" problems inherent in large scale survey endeavors.

3! This is an especially important distinction between the present search method and CART, aswith classification trees, the binary splits are constrained to take the form of a tree structure. As a result, allsplits are beholden to the root node of the tree, and all splits lower in the tree are dependent on splits abovethem.

51

Finally, it is important to mention some deficiencies of the method. Since functional

status is being used to predict death, it is clear that the method predicts death in persons who

have already suffered some degree of illness (hence their decreased functionality). It is for

this reason that the method is also used to predict the debilitating events themselves, such

as heart failure, stroke and cancer, before they occur. The questions for predicting illness

contain some very obvious risk factors (questions about smoking predict both cancer and

strokes), but there are still some other questions (ADL's, questions about chest pain) which

indicate that the isolated respondents, again, were already somewhat ill at baseline. In future

analyses, later waves of the EPESE data will be used in an attempt to identify more long

term predictors and risk factors.

Unfortunately, another serious flaw in the prediction of illness exists. Obviously,

what is observed is not the true incidence of illness, but the diagnosis of it (either by a

physician or on a death certificate). In the case of heart failure or stroke, this difference may

not be too crucial. Cardiac events, if they are at all serious, tend to be painful and relatively

obvious, and so are probably diagnosed somewhat successfully. Cancer, however, can be

quite difficult to predict, particularly in its earliest stages. It is perhaps for this reason that

the method identifies questions such as that in Set HA, which asks whether the respondent

has been to a dentist recently. Respondents who had been to a dentist in the previous five

years exhibited a higher risk ofcancer than those respondents who had not been to a dentist.

It seems unlikely that going to a dentist is a cause of (or even moderately correlated with) the

true incidence of cancer; instead, it seems probable that persons who go to dentists are much

more likely to be diagnosed with cancer, perhaps because dentists can frequently spot oral

52

cancer, or maybe because such persons go to physicians more often as well. However, if one

is imaginative, it is usually possible to posit some kind of causal connection between

seemingly unrelated variables. For instance, it is possible that persons who go to dentists

regularly have received many more X-rays over the course of their lives, and so may actually

be at higher risk of cancer. This may be far-fetched, but nonetheless, the decision to reject

the hypothesis is purely a judgement call. Since such judgments are frequently wrong, and

since the observed correlation is still of interest in any case, these types of anomalies have

been allowed to remain in the preferred models. Consequently, the questions concerning the

prediction of cancer, which are not particularly accurate in any case, should be considered

principally of academic interest. As suggested above, two central concerns raised by the

results are the level of undetected morbidity evident in the population, and the possible

misspecification of causes on death certificates, as discussed in Chapters 6.

1.10 An overview of the dissertation

Chapter 2 of the dissertation addresses the problem of prediction from a statistical

standpoint and reviews the literature in this context. The chapter starts by formally stating

the problem of prediction, defining prediction error with respect to the test set, and then

briefly discusses recent research which is concerned with the prediction of mortality and

morbidity. A probability model for the classification problem is defined, and the statistical

theory behind the various existing methods is explored. Shortcomings of the present

methods are then examined vis a vis the literature, and the method suggested here is

compared with the existing models to show how these shortcomings might be addressed.

The motivation for constructing the overall search method centers around the bias-variance

53

tradeoff inherent in building large, multivariate models. The argument for the model

selection process works by analogy with CART, showing how the method ofdetermining the

best number ofquestions and partitions is similar to determining the best-sized classification

tree.

Chapter 3 presents the particulars of the dataset used in this dissertation Basic

summary statistics and the demographic makeup of the data are explored and compared with

the u.s. population of elderly. The issue of sample design is examined, as are the details of

missing data. Levels of disability, morbidity, and mortality in the EPESE sample are

summarized by age and sex.

Chapter 4 treats the method in detail, focusing first on the construction of the search

algorithm for selecting the full model and on the method of model selection through

backwards deletion. The primary goal of this chapter is to show how identifying a sequence

of submodels and applying it to the test set suggests a model size with low prediction error.

The practical problems associated with constructing this algorithm are discussed, and

variations on the basic search method are explored. The methods used for building

comparative models via linear discriminant analysis, logistic regression, and classification

trees are also set out. Again, the tactics used for model selection in all three cases revolve

around the optimization of the bias-variance tradeoff.

In Chapter 5, the models presented above and in the appendices are revisited in much

greater detail. The respondents chosen by the models are broken down for each of the

question subsets by age and sex. Performance characteristics of the questions are presented,

and they are compared in terms of predictive accuracy with results from the three other

54

methods discussed in Chapter 4. The validation of these models using the Duke EPESE

sample is also presented.

Chapter 6 is concerned largely with the substantive issues raised by the models. The

discussion begins with an in depth discussion of the causes in relation to the multiple cause

death certificate data linked to the sample. In particular, it is shown how the question

method developed here simplifies causal analysis by partitioning the high-risk respondents

into groups that are more homogeneous with respect to etiology. Some variables used in the

models (e.g., digitalis usage, body weight, disability) are analyzed as potential risk factors

in light of the causes of death associated with each question subset. The extent of the

problem of misspecification of causes of death on death certificates is explored. It is argued

that cancer deaths may be particularly underrepresented as a cause of death, and that the

detection of cancer as a cause of death and as a morbid condition in living persons is

dependent on the individual's access to medical care.

Chapter 7 summarizes the findings of the dissertation.

55

Chapter 2 - The Problem of Prediction

To be or not to be; that is the question.- Hamlet

2.1 The specific problem of predicting death or illness

Statistical theory has given researchers any number of models for predicting binary

outcomes such as death/survival. Most commonly, multiple regression is used (logistic and

otherwise), along with proportional hazards, linear discriminant analysis and classification

trees, to name only a few. However, the exact statistical definition ofthe general prediction

problem itself is rarely discussed, partly because of this overwhelming multitude of specific

approaches. Unfortunately, in the social sciences the problem ofprediction has often become

synonymous with causal modeling, particularly with multiple regression. One speaks of the

"independent" or right-hand side variables of a regression equation as predictors of the

dependent variable, and indeed any properly fit regression equation can be used to predict

the dependent variable in a literal sense.

Yet often these equations are not built with the goal of maximizing true predictive

power. The ultimate purpose of the models for most social scientists is not to predict, but

to understand the nature of the associations between the dependent and independent

variables. As such, the building of the equation typically operates around the maximization

ofan R2 measure of fit or the exclusion of variables that are not statistically significant rather

than the minimization of some well-defined prediction error. The use of modem, high

powered statistical techniques for predicting mortality in individuals (as opposed to

56

forecasting mortality rates in populations) has been defined and explored mostly by

researchers in medical fields. However, these efforts still do not usually use the available

methods to their full potential.

In this dissertation, the meaning of statistical prediction is stated very explicitly. All

models are treated within this framework, and the review of existing research focuses

primarily on those efforts that are truly aimed toward prediction as defined (as opposed to

causal models, for instance). This definition also yields a statistical criterion by which to

judge a predictive model of any type, and a common theme emerges which relates to the

problem of building models to maximize predictive accuracy. This theme, the bias-variance

tradeoff, centers around the dimensionality or size of the models, and in particular, the task

of finding what size model gives the most accurate predictions.

2.2 Defining prediction and prediction error for a general predictor

Suppose we have data on some potential predictor variables, XI"'" Xp , measured at

time To, for N individuals, and we also have a binary outcome variable Y for each person

(taking on the values zero and one, in this case representing survival and death respectively).

The outcome Y is usually measured at some later time T I for each person, but the time

element is not essential to the problem (i.e., it can also be defined in terms of classification).

The problem of prediction as defined here is to find some predictor (or classifier) of Y,

P(X!, ... , Xp), built from XI"'" Xp, such that the value of Y can be determined accurately

before it is observed at T I • Once Y is observed, a simple measure of accuracy, or prediction

error (PE), can now be defined - just divide the number of misclassified cases by N:

PE i=I

N

57

[2.1]

where Y; is the predicted value of Y for the ith respondent. I

The costs of misclassification can be adjusted asymmetrically to make the predictor

more sensitive to one type of misclassification or the other. For example, in the mortality

problem, suppose that we wish to find a set of questions that would catch a very high

proportion of the deaths in the sample. This might be possible only at the expense of more

misclassified survivors. Then we could define the cost of misclassifying a death as survival

to be higher than the cost of misclassifying a survivor as dead. If C(O, 1) is the cost of

misclassifying r; as I when it equals 0, and fiCO, 1) is an indicator random variable for this

type of misclassification, then costs can be accounted for with an adjusted PE: 2

PE

N

L qO,1) . fi(0,1) + C(l,O)' fp,O)

i=I

N N

C(0,1) 'L 1-Yi + C(l,O) 'L Yii=I i=I

[2.2]

where the denominator is also adjusted simply to keep the quantity within the (0,1) interval

(the denominator stays constant despite the predictor). In this way, the method can be

applied using a variety of misclassification costs so that different degrees of specificity and

sensitivity may be achieved.

Of course, the statistical problem of building P(X}, ... , Xp) is typically encountered

} This is equivalent to the misclassification rate defined in Breiman et al. (1984) for the two-classclassification tree with unit misclassification costs, and is the same measure of prediction error employed inChapter 1 as a criterion for judging combinations of sets of questions.

2 Indicator variable means that f i(O,1) is 1 if Yj = °and f;" = 1, else fl0,l) is 0.

58

after Tl , so Y has been observed and used in the building process. Then, estimating the

prediction error in this way does not necessarily give an honest estimate of the accuracy of

the predictor as applied to additional, independent values of Xl"'" Xp to predict a truly

unobserved y. (The reasons for this are discussed below.) Instead, a test set

misc1assification rate is defined.3 Before building a predictor, a simple random sample is

used to select out some respondents in the data, who are then set aside for future testing of

the predictor, yTS and X1TS

, ... , X/so (This is usually a half or a third of the sample, by

convention). Using the half or two thirds ofthe data remaining, a predictor of yLs is built,

P(X]LS, ..., XpLS

), and this predictor is then applied to xTs, ... , xiS to produce predicted

values, Y;, for all I in the test set. A more honest estimate of prediction error for this

predictor can now be calculated as:

PETS

L lYi - YiliETS [2.3]

where NTS is the number of observations in the test set. Likewise, in the cost-adjusted case,

it is:

PETS =

L C(O,I)' 1l0,1)NeTS

C(O,I) .L l-Yi

iETS+ C(l,O)' L Yi

iETS

[2.4]

In both cases the summation is performed only over the test set. This measure for prediction

3 Numerous variants of this scheme exist, e.g. N-fold cross-validation, for cases where the datasetis not large enough to afford selecting out a test set. However, in this case the number of observations islarge enough so that a test set method gives essential1y the same result as cross-validation, but with muchless logistical complexity.

59

error can be used (and will be used, for the remainder ofthis dissertation) as a criterion by

which to judge the predictive accuracy of any binary predictor or classifier.

To test whether a particular model is significantly difference than the null model

(identified as the "ignorant" model in Chapter 1), one must estimate a standard error for the

PETS. If the test set is treated as a simple random sample from the distribution of interest,

then in the unit cost scenario this can simply be estimated as the standard error of the mean

of a binomial variable, where the binary outcome is misclassification (l) or correct

classification (0), and the "success" parameter is estimated by PETs•4 For the cost-adjusted

case, one simply models the PETS as a random sample from three possible outcomes as 0, 1

or C. These three outcomes can then be assigned a distribution defined by the observed

proportions of correct classifications, incorrectly classified decedents, and incorrectly

classified survivors respectively. (This is not unlike a bootstrap estimate, but where no

resampling is necessary since the error distribution is completed specified by these

proportions).5 A statistical test can then be calculated to assess whether the difference

between the test set estimate of error and the null error is statistically significant simply by

dividing this difference by the test set standard error estimate. For all the fitted models

presented in this dissertation, the differences were clearly significant far beyond 0.01, but the

actual tests are usually not shown.

To judge the overall accuracy ofthe method, taking into account any combination of

misclassification costs, it is conventional to estimate the receiver operating characteristic

4 This is the model used by Brieman et al. with fair success. Since the respondents were notchosen with a simple random sample, one may also attribute the element of chance to the mortality processitself, after condition on some set ofX variables. Identical results may be achieved with either model.

5 See Chapter 5 for examples of the explicit calculations.

60

(ROC) curve, as shown in Figure 1.3.6 The name for this curve is more complicated than its

substance; it simply refers to a plot of sensitivity as it varies by specificity. In the context of

mortality prediction, sensitivity is the number of deaths correctly classified as dead by the

predictor, divided by the total number of deaths, also called the TPF (true positive fraction).

In the test set application:

TPF TS[2.5]

where Ii (l,1) is the indicator that the ith respondent is correctly classified as dead.

Specificity relates to the degree to which the respondents predicted as dead actually die, or

the FPF (false positive fraction). Specifically:

FPF TS

L (l - Y)iETS

[2.6]

where, as above, li(O, 1) is the indicator that the ith respondent is incorrectly classified as

dead. So the FPF is the number of respondents who survived but were incorrectly classified

as dead, as a proportion of the total number of survivors. (Specificity is usually defined as

1 - FPF). Notice that as C(l,O) (the cost ofmisclassifying a dead person as a survivor) is

raised relative to C(O, 1) (the cost of misclassifying a survivor as dead), a predictor built to

minimize such costs will improve sensitivity while sacrificing specificity.

The principal reason for considering the ROC curve is that the area under it serves

6 See Swets and Pickett (1982), and Thompson and Zucchini (1989).

61

as a natural and commonly used index of a predictive method's accuracy (as discussed in

Chapter 1), in addition to providing a visual summary of the error distribution which

expresses the ratio of sensitivity to specificity at any given level of either. There are many

methods for computing this statistic that rely on any number of questionable assumptions

(usually involving normality), and are ultimately designed for many situations. Here, there

was a rather remarkable feature of the fitted models that made the estimation quite

straightforward. It was observed, after having plotted many points in ROC space for several

different statistical methods, that the estimated points could always be extremely well

approximated by a particularly simple form. This fit was defined by a parabola constrained

to pass through the origin and the point at (1,1), resulting in a model with only one degree

of freedom. What allowed this approximation to work, however, was the following

serendipitous insight. To fit the data accurately the parabola had to be formed not by

modeling the TPF as a quadratic function of the FPF, but as a parabola rotated 45°

counterclockwise. Such a parabola lies symmetrically about the line at y = I-x in the

unrotated coordinate system. An example is the fitted curve in Figure 1.3, which passes

nearly exactly through all three observed points. This method was observed to work just as

well when applied to the ROC curve obtained from a linear discriminant analysis: with

nearly 10,000 unique points observed on the curve (as compared with the three points in

Figure 1.3), the single parameter parabolic fit achieved an R2 of greater than 0.999!

This method of fitting was not only extremely accurate, it was also easy to carry out.

First, only one point in ROC space is required to identify the model. The observed x and y

coordinates for each such point can be rotated 45° clockwise through a simple 2x2 change

62

ofcoordinate matrix (whose elements are composed of 1N2 and -lN2).7 The transformed

x and y (call them x' and y') can then be fit via ordinary least squares to the basis:

with the parameter P> O. This basis constrains the parabola to pass through the origin and

the point (1,1) in the original, umotated ROC space. Fitting the model in the rotated space

not only simplifies the process by allowing the use of least squares, it also minimizes the

squared errors along the y = 1 - x line in ROC space. This seems to provide a more

appropriate criterion than fitting OLS in the umotated space, since we do not wish to model

the TPF as a function ofthe FPF ( as in the usual OLS model). Also, by forcing the curve

to be symmetric about the y = 1 - x line, one insures that the area under the curve is not

greatly affected by the behavior ofthe curve in parts where no data is available. For instance,

in Figure 1.3, there were no data on models having a FPFhigher than 0.35, as that area of

the curve was not of great interest for the moment. However, since the unobserved half of

the curve simply mimics the observed half, one need not worry whether the principal statistic

of interest (the area under the curve) is driven by a part of the curve that is neither observed

nor interesting. Finally, this method has the added benefit that the area under the fitted ROC

curve is easily estimated as P/3 + 0.5, which can be seen by integrating the parabola over the

range (0, v'2) in the rotated space.

To estimate standard errors for this statistic, a resampling (or bootstrap) approach was

7 For a definition ofthe change of coordinate matrix, see Section 2.5 of Linear Algebra byFriedberg et ai., (1989).

63

used. For example, in the case of the question set method developed in this dissertation, the

misclassification error for the models was observed for each respondent in the survey. (That

is, it was noted whether the respondent was correctly or incorrectly classified as dead or alive

by each of the three questions sets in Appendix I). By repeatedly drawing samples (with

replacement) ofthe same size as the test set from the entire sample distribution of all such

respondents, it was possible to mimic the observed misclassification error distribution

associated with the question set method. Then one can estimate the standard error of the

statistic by observing its inherent variance as generated by the repeated resamples. (The

probability model for misclassification error is specified in section 2.5 below.)

It is important to note that special care had to be taken to preserve dependencies

between the question sets. This was because the same respondents had answered all three sets

of questions, and some questions were shared by all three sets. First, it was observed

whether each respondent was classified or misclassified with respect to each of the three

question sets, so that every respondent was associated with a triplet of errors. Then in the

resampling process, these triplets were the units actually being resampled (such that a

particular respondent's status with respect to all three sets was always held together as a

single element). In this way, the correlation between the errors in the three models was

duplicated in each bootstrap sample. The estimation of standard errors for the other methods

was nearly identical.

2.3 Previous efforts to predict mortality and morbidity

There is an extensive literature on efforts to predict mortality and morbidity. A

MEDLINE search through the past five years of abstracts yielded more than 600 articles

64

reporting on efforts to predict death and illness, often cause-specific mortality or a particular

morbid condition such as the occurrence of cancer or heart failure. Usually, however, the

goal is to predict these events for in-hospital subjects (frequently patients in intensive care

units) or persons who have already experienced severe medical trauma of some sort.8 Also,

the predictors almost always require extensive physiological information that can only be

obtained by a clinician (e.g., blood or urine analyses).9 Some methods do involve the use

of test set validation, but not usually for model selection. 1O For many other researchers, it

appeared (although this was not always clear) that the primary goal ofthe model was to find

causal connections, often for specific interventions. II

An excellent overview of most currently used predictive models is provided by

Marshall et al. (1994). The authors consider a total of eight approaches: simple logistic

regression (a full model and a backward, stepwise deleted model), cluster analysis combined

with intuition and logistic regression, principal components followed by logistic regression,

a subjective "sickness score", a model based on Bayes' theorem, an additive model, and

finally a classification tree approach. All of these methods are compared, using a large

dataset of heart surgery patients. A test set is used to maintain an accurate measure of

8 Some recent examples are Anderson et al. (1994); Becker et al. (1995); Davis et al. (1995);Grubb et al. (1995); Josephson et al. (1995); Marshall et al. (1994); Normand et al. (1996); Ortiz et al.(1995); Piccirillo et al. (1996); Pritchard et al. (1995); Quintana et al. (1995); Rowan et al. (1994);Schuchter et al. (1996); and Wong et al. (1995).

9 See Assmann et al. (1996); Becker et al. (1995); Bosch et al. (1996); Flanagan et al. (1996);Iezzoni et al. (1996); Rowan et al. (1994); and Schucter et al. (1996).

10 Research using a test set methodology includes Anderson et al. (1995); Becker et al. (1995);Normand et al. (1996); Ortiz et al. (1995); and Schucter et al. (1996).

11 Causal analysis was more frequently the goal for social scientists. For examples of causeoriented research, see Friedman et al. (1995); Huppert et al. (1995); Josephson et al. (1995); Silber et al.(1992); and Smith and Waitzman (1994).

65

predictive power, and the common criterion by which all models are judged is the area under

the ROC curve. This paper serves as a natural starting point for a comparative look at

commonly-used methods for predicting mortality. Before conducting an in-depth exploration

of the literature, however, it is necessary first to define some of the statistical issues and

models, so that the existing research can be placed in perspective more clearly.

Half the strategies employed by Marshall et al. involve logistic regression, and

similarly the research overall is dominated by logistic regression in one form or another.

Straightforward logistic regression (whether the model is fitted stepwise or with another

method) is the most commonly used model. 12 Marshall et al. fit a basic regression model in

two ways. They use all available variables not having too many missing values to build a

full model, and apply a backward, stepwise deletion of variables based on statistical

significance. The goal of the stepwise deletion was to avoid the overfit inherent in building

the full model, an example of the bias-variance tradeoff mentioned in Chapter 1. The effect

of noise on the building of sets of questions was easily demonstrated via Figure 1.1 in

Chapter 1, but exactly what was meant by bias and variance was not discussed, partly

because no parameters were being estimated. For regression, however, these ideas are quite

clearly defined and shown.

2.4 Regression as a method of classification

First consider the use ofthe OLS regression model for prediction, but where Yis real-

valued instead ofbinary. Again, the goal is to build a predictor of Ybased onXj,...,xp, or P(Y

IXj, ... ,xp), to be used on future, unobserved realizations of Y; here, however the predictor is

12 Quite a few of the articles in the bibliography used regression, e.g. Schucter et ai. (1996); andIezzoni et ai. (1996, 1994).

66

made to be a polynomial function of the independent variables, and so the parameters of the

model are the coefficients ofthis regression equation. Analogous to Equation 2.3, one can

easily defined a measure of accuracy for this predictor, the most obvious being

PETSL (Yi - PiiETS [2.7]

the test set squared prediction error. This is a more honest measure of predictive accuracy

than the more conventionalleaming set R2 statistic, as it reacts to overfit in the model.

For example, suppose that unknown to the researcher the underlying process that

generates Y is of the form of a quadratic function of a single X and some random error, so

that:

[2.8]

and suppose further that the Ej'S are independent and identically distributed with a mean of

zero and some finite variance. There are three parameters to estimate for predicting Y (each

of the p's), and for a fixed X the expected value of Yis a linear combination ofthe expected

values of these estimates. Likewise the noise in Y depends on the standard errors associated

with the estimates ofthe p's, so the accuracy ofthe predictor depends on minimizing the bias

and variance in the estimates of these parameters.

Now, the researcher has some data on X and Y, and some choice of models to fit,

depending on the number of distinct observations of X. For example, if there are four

observations of Y, with the corresponding .x;'s taking on four distinct values, it is possible

67

to fit the usual straight line or a quadratic or cubic polynomial function to the data. These

equations require two, three and four parameters respectively and we can fit no larger model,

the last being saturated. Figure 2.1 shows a simulated set of observations, with a plot of the

true underlying quadratic function along with plots ofthe three regression equations fit to this

data. The straight line (represented by the dotted line) clearly cannot fit the curvature

adequately, and indeed there is bias in the model since the estimation of P2 is ignored, or

equivalently, constrained to zero. When a quadratic is fit (the dashed line) the estimated

coefficients are obviously unbiased, and likewise for the cubic polynomial. However, in the

latter case, the variance of the estimates is higher than for the quadratic fit; therefore the

variance in Yis higher. With respect to Figure 2.1, we can see that the cubic fit (represented

by the dotted and dashed line) fits the data perfectly. As a result, from fit to fit, with new

errors introduced each time, the cubic curve will fluctuate around the true, underlying

function much more than the quadratic fit, fitting the new errors each time.

So although both the quadratic fit and cubic fit will overlay the true function on

average, the quadratic fit will yield the lowest PE as defined by Equation 2.7; it provides the

optimal tradeoff between bias and variance. Moreover, the researcher can estimate the test

set PE for each fit to infer this optimality. Here, if each model is fit and tested on a

sufficiently large test set, the linear model will generate the highest prediction error, the

quadratic the lowest, and that for the cubic model will be between the two. Even in larger

collections of models, a plot of PE against model size typically reveals this hook-shaped (

LJ ) pattern, and the optimal model size is easily gleaned. Notice that the more usual R2 is

highest for the cubic fit of course, a well-known deficiency with the index that has inspired

Figure 2.1 - Simulation of Y as quadratic function of X

Triangle = observed dataSolid line = true function

Dotted line = linear fitDashed line =quadratic fitDot-Dash line =cubic fit

>-

CD

"<:t

C\I

o

~-:--~-.:::--~~~ -'-'-.-- ~ -

.--'-- ~/' ./

/' ./

V ././

/,.. _-;/e:: '/1::{... -------------------

/_ ... /~

--~----- -- //

//

//

/

"....... ....... "

.......

.......' "-

"'~\"\\' ~

\ "\ "-

\

\

\

\

a 2

x

4 6

0\00

69

several modifications and fixes (e.g., adjusted R2, Mallow's Cp).

This type of overfitting problem is evident in the paper by Marshall et aI., as the

larger regression model resulted in a higher prediction error than the smaller model. The

stepwise deletion process that created the smaller, more powerful model is unlikely to choose

the best sized model, however. There is no clear relationship between the statistical

significance of the estimated coefficients and the predictive power of the overall model as

measured by a test set. Consequently, a more involved approach is used for fitting the

regression models in this dissertation. As described above, a whole series of models of all

sizes are fitted in order determine the model size that minimizes the test set PE.

There are other problems inherent to the regression approach. First, OLS, as

demonstrated above, models a real-valued Y, but here we are only interested in the

possibilities that Y is zero or one. So of course, a linkfunction must be used to map the real

valued output of the linear regression equation into the (0, I) interval. Then the researcher

must select a cutoff for the predicted value of Y (e.g., 0.5 with unit misclassification costs)

below which respondents are classified as zero, and above which they are classified as one.

There is no clear choice of link (and variance) function. The binomial model suggests the

use of the logit function (i.e., logistic regression) but there is often no a priori reason why

an inverse Gaussian (probit) or any other mapping would not be more appropriate.

Ultimately, the problem is that regression is simply not geared toward binary classification,

and so attempts to mold it into a true classifier are usually circuitous. More direct strategies

of classification are suggested by Bayes' rule, as discussed below.

Secondly, the complexity and mixture of data types in the present dataset (as well as

70

most social science datasets) does not accommodate an equation-fitting approach to the

problem. For example, missing data present a problem, as there is no clear way to build such

anomalies into an equation. 13 Also, many of the variables are coded such that the values of

the predictor variables do not naturally agree with the basis for a regression equation. That

is, most of the variables are not coded in an ordinal scale. So if these categorical variables

are to be meaningfully represented in a regression equation, the model must accommodate

them with a large number of 0-1 indicator functions (i.e., dummy variables). Since the

number of candidate variables is large (some with many possible values), the size of the

potential full model is extremely large. One is faced with the same problem as encountered

by the search method above: there is no way to find, for example, the optimal regression

model built with as many as fifty or sixty variables chosen from a pool of hundreds or

thousands of candidate predictors. The sheer immensity of the numbers implied by the

combinatorics of such a search is just as daunting an obstacle as that faced by the above

search for binary splits. Moreover, the resulting regression equation would be much more

unwieldy than sets of simple questions.

Thus it is the opinion of this author (and of many statisticians) that compared with

many other existing methods regression does not offer the best prospects for elegant or

efficient prediction. Undoubtedly, with some judicious recoding of data types, and with the

use ofa test set for the purposes of model selection (as shown below), regression models can

achieve a high degree of predictive accuracy. However, these are not tactics that can be

13 The usual methods are to ignore such cases altogether, or to replace missing values with themeans ofthe respective variables, neither of which utilizes any predictive information missing data maycontain.

71

easily packaged into generic software; they require a fair level of statistical sophistication,

and a degree of programming agility which is not available to most. (Most social science

researchers implement automatic stepwise model selection algorithms or build models

entirely by convenience or intuition.) It is perhaps unfortunate then, that a plurality of

research projects aimed at predicting mortality are centered around the use ofthe regression

model.

2.5 Classification trees and Bayes' rule

A more natural approach to classification also described by Marshall et al. is the

CART method of building classification trees invented by Breiman et al. (1984). This

method is also used by Gilpin et al. (1983) and and Henning et al. (1976) for predicting

death in heart failure patients. The idea behind classification trees is easy to understand. As

with the method developed here, the technique centers around the construction of binary

splits (i.e., yes/no questions) as a way to partition the sample space into homogeneous

regions (that is, regions containing respondents that are mostly dead, or mostly alive). A

classification tree is simply a way of organizing series of the yes/no questions or splits into

a tree form, where groups of respondents are split successively depending on previous splits.

Ultimately all final partitions (called terminal nodes) are assigned a class. Figure 2.2 shows

a hypothetical example of such a tree. A respondent is started at the top, with the question,

"Is the respondent's body-mass index less than 1.77". If the answer is "yes", the respondent

is sent to the left for the next question, while if the answer is "no", the respondent is sent to

the right to answer a different question. If sent left, the respondent is asked, "Is blood

pressure ~ 179?", and here the respondent reaches a terminal node: the respondent is

Figure 2.2 - Example of a classification treefor risk of death

Is bodymassindex

<=2.1 ?

72

Yes

Low High Low High

73

classified as low risk if the answer is "yes", or as high risk if the answer is "no". Similarly,

ifthe respondent had answered "no" to the first question on body-mass index, they were then

asked the question "Is blood pressure ::s; 179?" and subsequently classified as high or low risk

based on the answer. In this way, the tree-structured questions assign a class (represented

by one and zero, just as death and survival are represented) to every respondent in the dataset

by partitioning the sample space into rectangles, just as in Figure 1.1. Thus the tree works

quite like the method developed in this dissertation, and indeed the idea of a classification

tree was the inspiration for this research.

How is a classification tree built? As above, the CART software starts with a set of

respondents who are known to be dead or alive. It then searches across all possible yes/no

questions to find the question that most effectively separates the dead from the living (the

GINI index is used for this). Once such a question is found, the sample is divided into two

groups according to whether they answer "yes" or "no", and this forms the very first question

(the root node) of the tree. Then for each of these two subgroups, another exhaustive search

is conducted for a question that best separates the dead from the living. However, the same

problem addressed in Chapter 1 - that of how many questions to ask - was faced in the

building of a tree. That is, how large should the tree be made? As mentioned in Chapter 1,

the answer can be obtained by examining the bias-variance tradeoff inherent in changing the

size ofthe model. So far, the bias-variance tradeoff has only been defined formally for OLS

regression, but these terms can be defined for the classification scenario as well, as shown

directly below.

At this point it is useful to state the probability model behind the CART endeavor

74

explicitly, as the same model is used as the model for the approach developed in this

dissertation. A general model for predicting any number of classes is set out in detail by

Breiman et al. (1984), but this model may be unfamiliar to the reader. Since it is central to

the method developed here, a more context-specific, two-class version is briefly presented

below with some simplifications in the notation, and with a concrete example illustrated by

Figure 1.1.

Let Xbe the space of all p-vectors x (a particular vector X being data on p variables

Xl' ... , Xp for a single, randomly chosen respondent from the population), and let Vbe the set

{O,I} (here with Y= °corresponding to a survivor, and Y= 1 to a decedent, as above). Now

consider the space ofall couples, XxV, and assume that for any subset (A, j) where A c Xand

j E V, there is some probability Pr(A, j) that a respondent chosen at random from the

population has an X that is E A and a value of Yequal to j. 14 For example, consider Figure

1.1 again. Suppose that the number of variables is p = 2, and define X' as the 2-dimensional

Euclidean containing the plot itself, which contains all possible combinations of values for

body-mass index and blood pressure. Now denote the subset of X' delimited by the rectangle

in the lower right comer ofthe graph as A' , and letj' = 1. The assumption, in concrete terms,

is that there is some probability Pr(A', j') that a respondent drawn at random from the

14 Of course the actual sample design was much more complicated than a simple random sample,as only some of the respondents (the New Haven and Duke samples) were systematically random samples;respondents from East Boston and Iowa County were chosen through community censuses. It is hoped thatthese latter two samples of convenience can be considered a reasonable facsimile of a random sample fromsome general (but still meaningful) population of elderly. It would have been possible to use only the NewHaven dataset for this analysis, but since this would have ignored a large amount of data drawn fromrespondents outside ofNew Haven, it probably would have yielded a set of results even less relevant to thenational population of elderly. Chapter 3 considers the EPESE sample designs in detail. Chapter 7discusses possibilities for applying the present method to some existing datasets based on random sampleswhich represent the national population of elderly.

75

population subsequently dies and has values of blood pressure and body-mass index in this

region. Similarly, such a probability was assumed to exist for any subset of XxV.

Then if we have some predictor or classifier P(x) that deterministically assigns zero

or one to a respondent based on x (here it is a function that simply partitions ~t into

rectangular subsets and classifies a respondent as zero or one according to which subset of

;; the respondent belongs) we can define the probability that the predictor misclassifies a

respondent as the probability that the predicted value of Y given X is zero when the

respondent actually dies, or vice-versa. So define the true prediction error of a predictor,

PErue, as Pr(P(X}t Y), the probability that the predictor misclassifies the randomly chosen

respondent. For example, if a predictor P'(x) were formed which classifies all respondents

in A' as dead and all respondents in the complement ofA' as survivors, the true prediction

error would be the probability that X E A' and survives or X$ A' and dies. We would like

to estimate this prediction error for any given classifier (e.g., Equation 2.3), and we would

also like to construct a classifier that minimizes this error.

Define Pr(A Ii) as Pr(A,i)/Pr(Y =i), the probability that XE A given that Y =j. With

respect to Figure 1.1, Pr(A' I j') is the probability that a randomly chosen dead respondent

is in the lower-right rectangular region A'. Then assume that Pr(A Ii) has the probability

density f;(x) , so that

Pr(A Ii) = f!;(x)dx .A

[2.9]

To see this assumption via Figure 1.1, lay it flat on a table and imagine a third, vertical axis.

For respondents who died, visualize some surface over this plane with a height that is defined

76

by the function .ft(x) (and similarly imagine another surface for the survivors). The

interpretation of this surface is that the probability that a deceased respondent lies in the

rectangle (and likewise for survivors) is equal to the volume under the surface that lies

directly over the rectangle. An approximation to this surface can be estimated from the data

by breaking the surface into equal squares and estimating a two-dimensional histogram, as

shown in Figure 2.3 for the survivors, and Figure 2.4 for the decedents. (The comers on

these histograms have been rounded, so that it appears as a smooth surface rather than a pile

of blocks). The volume under each histogram has been normalized to one.

Now, by the definitions and assumptions above,

Pr(P(X) =Y) = Pr(P(X) =0 I Y =O)Pr(Y =0) + Pr(P(X) =1 I Y =1)Pr(Y=I)

J !o(x)Pr(Y =O)dx + J h(x)Pr(Y =I)dxall x:P(x) =0 all x:P(x) = I

= J[I(P(x) =O)!o(x)Pr(Y =0) + I(P(x) =I)h(x)Pr(Y =I)] dx

where 10 is the indicator function that the expression in parentheses is true. For a given x,

it can be seen upon inspection that

I(P(x) =O)!o(x)Pr(Y=O) + I(P(x) = l)h(x)Pr(Y= I) s:; max)!;(x)Pr(Y=})]

where maxJf;(x)Pr(Y=})] is that value ofJ;(x)Pr(Y=}) obtained by choosing the} which

maximizes this expression. The above inequality is equal if indeed P(x) is equal to this

maximizing value of}. Thus it is shown that for any classifier P(X),

Pr(P(X) =Y) s:; Jmax)t;(x) Prey =})] dx

><Q)-0C

(J)(J)coE>-0a

...c-0ccoQ)"-::l(J)(J)Q)"-Q.

-0aa

...c>...c~a>'~

::l(J)

\f-aca

+:i::l

...c'C+-',~o('f)

N

~::l0)

LL r-_,.--.---,---.-----.----,----V

77

9 9 £ ZSJOf\!f\.lnS }O %

~ 0

><Q)

"'0cenenmE>.

"'0o

..0"'0cmQ)L.::::Jenen~0-

"'0oo

..0>.

..0en.....cQ)

"'0Q)()Q)

"'0-.-oco~::::J

..0"C......~o~

NQ)L.

::::J.2>ll.

O~ 9o

78

79

which implies that the lowest prediction error a classifier can achieve is

PE Bayes = 1 - Jmax. [((x) Prey =j)] dxJ J

known as the Bayes misclassification rate. 15

[2.10]

Define the subsetAj of Xas all x such thath(x)Pr(Y= j) = maxJ/;Cx)Pr(Y= 1)], for I

= 0,1. Now suppose we consider the following rule for a classifier: classify all x in Aj as j.

That is, classify x as that j for which h(x)Pr(Y = j) is maximized, a maximum-likelihood

strategy known as Bayes' rule, or pbayes(x). As demonstrated above,

Prep Bayes(x) =Y) = Jmax. [((x) Prey =j)] dxJ J

so Bayes' rule achieves the optimal Bayes' misclassification rate.

In relation to Figure 1.1, this rule suggests that for any x, we can estimate the product

ofthe height of the surface.fo(x) at x and the probability that the respondent survives. Then

just compare this with the estimated product of the height of};(x) and the probability that the

respondent dies. Bayes' rule simply classifies the respondent as alive or dead according to

which product is greater.

Suppose one fixes Pr(y=O) = Pr(y=l) = 0.5. Then one simply classifies x according

to which is greater, .fo(x) or };(x). For example, consider the approximations to these

functions as pictured in Figures 2.3 and 2.4. Figure 2.5 shows a contour plot of the ratio of

the approximated functions, };(x)lfo(x), at each point in the 2-dimensional space 1'. Under

Bayes' rule, one classifies all x in those regions where this ratio is less than one as survivors,

15 This is proof is drawn directly from Breiman et al. (1984).

Figure 2.5 - Contour plot, ratio of survivors' distribution to decedents' distributionby systolic blood pressure and body mass index

~

~

.5~1:6~

a('i')

LO('i')

LO

aN

LO

N

......

><Q)

"'0C

CJ)CJ)roE>.

"'0o..c

.--..

..r:::.uc

:..:::::CJ)

..c

100 150 200

systolic blood pressure (mm of Hg)

00o

81

and all x in those regions where the ratio is greater than one are classified as deceased. One

can see from the contour plot that the area where the ratio is greater than one consists mainly

ofthe area on the bottom right and center, along with an area on the left lower part of the plot

(excluding the bottom left comer itself), and an "island" on the left side at a body mass of

about 3.0. Now suppose that Pr(y=1) = 0.333, and therefore that Pr(y=O) = 0.666. Then

Bayes' rule classifies all x in those regions where};(x)~(x) > 2 as dead. This is primarily

the region in the lower, right comer of the plot, almost exactly that region delineated in

Figure 1.1. There also seems to be a high risk area on the far left side of the graph, and

toward the bottom center, but the number of respondents in these areas is quite small. If

Pr(y=I) = 0.25, so that Pr(y=O) = 0.75, Bayes' rule classifies all x in those regions where

};(x)~(x) > 3 as dead. Judging from Figure 2.5, this area consists ofthe combined small

triangular regions on the far, right-hand edge of the plot, the far left edge, and the bottom,

right edge. These areas contain even fewer respondents.

Of course, estimating PreY =j) is easy here. It is simply the proportion of class j in

the sample; the problem centers around the estimation of J;(x). For example, linear

discriminant analysis tries to accomplish this by assuming fo(x) and f ~x) are normal

distributions with a common variance-covariance matrix; kernel density estimation fits them

nonparametrically.

Consider again the problem of deciding how large a classification tree should be.

With unit misclassification costs, the tree divides the learning sample space into disjoint

rectangles, and classifies each rectangle as zero or one according to whether most of the

respondents in the rectangle are alive or dead. With respect to Figure 1.1 then, we would

82

classify respondents in the lower right rectangle as dead only if more than 50% of these

respondents were observed to be dead, else we would classify them as survivors. The larger

the tree is (i.e., the more questions asked), the more of these rectangles it defines, and the

more the number of respondents in each rectangle diminishes. Denote the L disjoint

rectangles defined by a tree as SI' ..., SL' and lety" for I = 1, ... , L, represent the class (zero

or one) assigned to the lth partition (equal to the observed class majority in SJ. The

misclassification error for such a tree (call the predictor ptree(x), and its error pgree) may be

expressed as

PE tree = Prep tree(x) *- y) 1 "P (X S Y )= - LJ r E /, =y//

1 - L ICY/ = O)Pr(XE Sf' Y = 0) + ICY/ = l)Pr(XE Sf' Y = 1)/

where 10 is again the indicator function. Letytue be the true class majority in Sf' which is

equal to max} Pr(Y= j IX E SJ. The above expression can then be written as

PEtree = Prep tree(x) *- y) = 1 - L maxjPr(X E S/, Y =j)

( [2 11]+L ICY/ *- ytue) IPr(X E Sf' Y =0) - Pr(X E Sf' Y =1) I. .

/

Note that the first part of the right side of this equation, or

[2.12]

is very similar to the Bayes misclassification rate as defined by Equation 2.10. In fact, as

Breiman et al. describe it, Equation 2.12 "forms an approximation to the Bayes rate

constructed by averaging the densities over the rectangles". They then define the bias of the

83

tree classifier as PE1tree - PEBayes, and go on to argue that as L increases (implying that the

rectangles SI' ..., SL become not only more numerous but smaller as well) this bias decreases

rapidly for small L, falls more gradually as L grows, and that it eventually approaches zero. 16

This is not a particularly intuitive result, but one can see how, as the rectangles become

smaller, the approximation to the density on each rectangle is more likely to equal the true

density on average, but with more instability since fewer observations are used in the

estimate (not unlike the regression example above, where the act of increasing the

coefficients in the equation allowed the fitted line to fit the underlying function more easily

on average, while simultaneously causing the fitted function to "bounce around" more

severely).

To see the increased error due to this type of noise, examine the second part of the

right side of Equation 2.11, or

PEtee

= L I(y[1' y/rue) IPr(XE Sf' Y =0) -Pr(XE Sf' Y =1) If

[2.13]

which is described by Breiman et al. as a "variance-like" term. The contribution of this term

to the error can be more intuitively understood. Note that, because of the indicator function,

this type of error is caused only by those rectangles for which y, * y/rue; that is, only the

rectangles that are misclassified (where the observed class majority differs from the true class

majority) contribute any error. For Figure 1.1 then, the lower right rectangle would only

contribute error to this term if more than 50% ofthe persons in this rectangle were observed

16 This type of bias should not be confused with the downward bias in the prediction errordiscussed in the first chapter, which is defined by the difference between the true prediction error of aquestion set and the expected value ofthe test set estimate of prediction error.

84

to die, while in fact the true proportion of deaths in this region was less than 50% (or vice

versa). It can be seen intuitively that as the observed number of respondents in this rectangle

grows larger (and to the degree that the class proportions differ from 50%) this type of

misclassification is much less likely to occur. Thus, as the number of rectangles in a tree

increases (implying that the rectangles contain fewer respondents), the variance-like error

term becomes more prominent, gradually increasing, while the bias error term goes to zero.

However, the true error ofthe classifier is unknown, so it is not obvious what number

ofrectangles provides the optimal tradeoff between bias and variance. The problem is that

the learning set misclassification error for a particular predictor is not an honest estimate; it

is always possible to achieve a reduction in this estimate of the prediction error by adding

another rectangle (at least until there are so many rectangles that every rectangle contains

only one respondent). What is required, once the classifier has been constructed, is a large

number of additional, independent observations drawn from the same distribution, so that

the probability of misclassification can be estimated directly as the proportion of

misclassified respondents in the new sample.

It is for this reason that a test set sample is selected out of the original data set with

a simple random sample and set aside while the classifier is constructed with the learning set.

The test set respondents mimic a sample of respondents drawn independently from the same

distribution (although technically the two datasets are not independent), and so the error can

be estimated more honestly. Then the bias-variance tradeoff can be seen explicitly by

applying a sequence of variously-sized models to the test set, as in Figure 1.2; invariably, the

prediction error is seen to fall rapidly as the model first enlarges (indicating the decrease in

85

bias). It then bottoms out and increases gradually as the model size becomes too large

(suggesting that the variance-like error term is rising). Thus the optimal model size is

gleaned (or a "standard error rule" may be used to determine the minimum size), and the test

set estimate ofprediction error for this model is used as the estimate of the model's accuracy.

Since the test set is being used both to select the model size and to estimate the error, this

estimate of the error is not completely unbiased (i.e., E(PETS) =1= PErue). However, as

Breiman et al. have demonstrated, extensive computer simulations of the CART method of

classification suggest that this bias is small if the learning set is large and the optimal model

is chosen from a small sequence of nested submodels.

The method of model selection (but not the method of model building) used for the

models developed in this dissertation is nearly identical to the CART method. Consider

Figure 2.6, which shows a graphical depiction of the present method (in contrast to Figure

2.2, which depicts the tree method.) Again, one starts with a large set of binary splits that

divide :r into "rectangular" regions. Each column of questions in Figure 2.6, for example,

defines a particular rectangular region (actually a cube). The only difference between these

regions and those defined by a tree's terminal nodes is that here they are not defined

disjointly, as they are in CART. However, there always exists such a disjoint set of

rectangles (denoted as SI' ..., SL above) corresponding to the set of regions defined by the

questions, since here any area that is in the union of two overlapping rectangles is always

classified as high risk. To see that this is so, note that it is always possible to convert the

question sets developed here into tree form, e.g., in Figure 2.6 by making the second and

third columns ofquestions into branches along the questions in the first column of questions.

Figure 2.6 - Example of question setscombined with OR

86

If s

High risk

OR Ify s

If s

High risk

OR

If s

High risk

87

As the number of questions in each column increases, and as the total number of columns

increases, so does the number of disjoint regions (L) required to define the same partition.

Moreover, it is generally possible to define the regions SI' ... , SL as part of a nested

subsequence of such partitions, in much the same way a subsequence of nested tree models

defines them. Thus, the method of applying backward deletion to the questions to obtain a

subsequence of differently-sized partitions (described in greater detail in Chapter 4) is very

much like CART's pruning to obtain a set of smaller and smaller tree-based partitions.

Therefore, it is argued that the bias in the test-set estimates of prediction error attached to

these partitions is probably small. (Again, the principal difference between the two methods

lies not in the model selection or estimation of error, but in the building of the "full" model,

which was completed by use of the learning set only.)

To see the model selection explicitly, consider again the contour map in Figure 2.5

Suppose one assigns a cost of C = 6.1 to the misclassification of deceased respondents; this

is equal to the ratio of survivors to decedents, so the model is equivalent to assuming that

Pr(y=O) = Pr(y=l) = 0.5 in the unit cost scenario (as discussed above). Thus, Bayes' rule

classifies as dead all x for whichJ;(x)lfo(x) > 1, an area that can be approximated by the area

in Figure 2.5 where the contour is above one. Using the algorithm in Appendix V, a learning

set was drawn random, a full model with nine questions was constructed, and this model was

pruned to obtain a sequence of submodels. This sequence of submodels was then applied to

the test set, and the test set error associated with each model was estimated. The lowest error

was achieved when three questions were used.

Next, the data were recombined, and the largest full model with nine questions was

88

constructed with the search algorithm. The classification achieved by this model is shown

graphically in Figure 2.7; the respondents inside the regions on the bottom and right side of

the plot are classified as high risk. When this model was pruned down to three questions,

the classification shown in Figure 2.8 was achieved. As can be seen, the simpler

classification covers nearly the same area of the plot but with fewer questions. According

to the test set estimates it would probably have a lower prediction error than the larger model

when applied to a new sample from the same distribution.

2.6 Other methods of classification

Other commonly used methods of classification, as mentioned above, are linear

discriminant analysis, kernel density estimation, and cluster analysis. All these methods can

be related to Bayes' rule as a guide for classification based on estimates of the densities.fo(x)

andJ;(x), defined above. In the two-class problem, Fisher's method of linear discriminant

analysis assumes they are normal with means vectors Xo and XI' and a common variance-

covariance matrix,~. Then a p-vector of discriminant coordinates can be calculated via:

which can be interpreted as the linear combination of the p variables XI' ... , xp that maximizes

the ratio of the variance between the classes to the variance within the classes. 17 To use this

linear combination to classify a given observed vector of data X, one first computes the value

of this function at X, i.e.,

17 For more extensive discussions ofthis technique, see Gnanadesikan (1977) or Mardia et al.(1979). More modern versions of discriminant analysis (and other associated methods of classification) arepresented by Hastie and Tibshirani (1994, 1996), and Hastie et al. (1995).

Figure 2.7 - Division of contour plot, 9-question modelby systolic blood pressure and body mass index

..-..

..coc

:.::::Cf)..c

><Q}

"'0cCf)Cf)ctlE>.

"'0o..c

LOC'?

oC'?

LON

oN

LO..-

~

~

100 150


200

00\0

Figure 2.8 - Division of contour plot, 3-question modelby systolic blood pressure and body mass index

~

~

.53-1:6~

oC")

L()

C")

L()

oN

L()

N

..-

xQ.)

"'0cenenCOE>.

"'0o.0

--~()c

:.:::::en.0

100 150 200


'-Do

91

so that z is a scalar, representing the projection of the vector X from p-space onto lR I. Then

the respondent is typically classified as zero or one according to whether z is closer to the

projection ofxo (equal to xo'ft) or the projection of Xl (equal to x,'ft).

For a nonstatistician, the interpretation may require some illustration. Consider

Figure 2.9, which shows two "rounded" histograms. Instead of showing rectangular bars,

the centers of the top of each bar are connected with lines, cutting or "rounding" the corners

of the bars to obtain a more natural-looking distribution. One is a histogram of survivors,

the other a histogram of deaths; both are scaled to an area of one. The x-axis is scaled from

zero to 1,000, and corresponds to the range of z scores for all 10,294 respondents as

calculated by a linear discriminant analysis of 15 variables (see Appendix V). (The method

for determining the model size, estimating ft and calculating these scores is discussed in

Chapter 4, and the results are presented in great detail in Chapter 5.) The height of each of

these distributions may be thought of as a crude estimate of each of the densities fo(x) and

f.(x). Here the 1 space is the I5-dimensional space corresponding to the space of possible

answers to the 15 survey questions used in the model (called lIS). Specifically, the height

ofthe histogram at z estimates the height offo(x) andf.(x) at all those points x in lIS such that

the Fisherian linear combination ofx (or x'ft) is equal to z. For example, let z = 500, and let

A500+ be that subset of XIS such that x'ft > 500; that is, let A 500

+ be that portion of the space of

question answers such that the z score computed from these answers is greater than 500.

Then Pr(z > 500) for the surviving respondents is equal to

Figure 2.9 - Distribution of deaths and survivors by z indexarea under each curve scaled to 1

L()...-a

en.......cQ)

"'0C0c- aenQ) ...-..... a

'+- I / / \ deaths0c0t0c-o L()..... ac- a

aa

289 369 I 449 1625

a 200 400 600 800 1000

Index (discrminant variable)

\.QN

93

Pr(XEA 5001 Y=O) = f fo (x) dx

A 500

and likewise for };(x) and the dead respondents. By Figure 2.9, these probabilities can be

seen as estimated by the areas under the histograms to the right of 500, equal to 6.6% and

34% for survivors and decedents respectively. The projections ofthe means onto z-space can

be seen at z equal to 289 and 449. Thus, according to the usual classification rule, a

respondent at x is classified as dead if z is closer to 449 than 289; that is, x is classified as

zero ifz < (449+289)/2 = 369, and as one ifz > 369. Since the deaths and the survivors are

assumed to be normally distributed with equal variance-covariance matrices, it follows that

the two distributions ofz-scores should be normal with equal variances. This implies that the

midpoint between the two projected means is also that z at which.fo(x) is equal to};(x), and

that for z < 369,.fo(x) > hex) and for z> 369,.fo(x) <flex). Of course, it is obvious from

Figure 2.9 that the two groups are probably not distributed with equal variances, and nor are

the distributions perfectly normal. Nonetheless, 369 does seem to be quite close to that value

of z at which .fo(x) is equal to f (x). This seems to be a decent rule for distinguishing

between the groups, as 77% of survivors lie to the left of this cutoff, and 65% of deaths lie

to the right.

However, recall that the Bayes' rule suggests that the class assignment with minimal

error is the class j which maximizes J;(x)Pr(Y = j). This classification rule seems to

classifying x as the class j that maximizes J;(x). The problem is that this assignment rule

assumes that Prey= 1) and Prey= 0); here, it assumes there are an equal number ofsurvivors

94

and deaths in the population, and under this assumption the rules are equivalent. However,

the data suggest that these proportions are closer to 86% and 14%, a ratio of 86/14 = 6.14.

Thus Bayes' rule suggests that x should be classified as one when the ratio };(x)~(x) is

greater than 6.14, which seems to correspond to the region where z > 625 or so. This is also

observed to be the lowest score at which the death rate exceeds 50%. Thus, Bayes' rule for

this scheme can be seen to be equivalent to the rule of classification used in the method

above with unit costs, where the rule is to classify a rectangular region of the space as "dead"

when the proportion of deaths in that region is higher than 50%, and as "alive" otherwise.

Note that the costs ofmisclassification can easily be incorporated into Bayes' rule by

scaling up the estimate of Prey = 1) directly in proportion to the cost of misclassifying a

death as a survivor. Thus, assuming the defined relative misclassification cost of five is

mathematically equivalent to assuming that deaths in the data were undersampled by a factor

of five. So the cost can be incorporated directly into the error scheme simply by weighting

the dead respondents by a factor of five. (To see this via Equation 2.2, note that we obtain

the same error by replicating the data on each dead respondent five times over and applying

Equation 2.1.) For the observed proportions of 14% deaths and 86% survivors, then, Bayes'

rule corresponds to the simple linear discriminant rule of classifying a respondent according

to the relative distances to the projected means (i.e., according to the cutoff of that z where

fo(x) is equal to};(x)) when the relative cost ofmisclassification is set at 6.14 (suggesting the

cutoff at z = 369). It appears that the risk of death (proportionate to the ratio };(x)~(x)) is

mostly monotonically increasing as a function of the discriminant score, as it should be under

the normality assumption. So scaling this cost between one and 6.14 is roughly equivalent

95

to scaling the z cutoff between 625 and 369. Then the same tradeoff of sensitivity and

specificity is observed as when scaling the cost of misclassification in the question scheme.

As such, the area under the ROC curve can be estimated directly by summing up the area

under the sequence of estimated TPF by FPF coordinates obtained by setting the cutoff at

each observed quantile of z. (The area under the ROC curve for this particular model was

estimated at 77.4% on the test set. However, since much knowledge about the test dataset

was used when recoding categorical variables and treating missing data, so this is probably

an optimistic estimate.)

In this way, a connection between the linear discriminant analysis and the question

method can be established. For example, suppose Question Set B was converted to a linear

model. One way to do so is to form three indicator random variables, each corresponding

to one of the three question subsets B.1, B.2 or B.3. For example, to obtain an indicator

random variable for question set B.1, first form two indicators corresponding to the two

questions: let IB l.l be zero if the respondent does not answer question B.l.l in bold (i.e., if

they can walk half a mile), and let it equal one ifthe answer is bold. Define I B I.2 similarly

for the second question (one if the respondent uses digitalis, zero if not), and define the

indicator I BI = I BLI 'lB12, the product of the two question indicators. Then define I B2

similarly as IB21'lB.22·1823' and likewise define 183 as 1831'lB3.2' Now for each respondent,

arrange the vector xB as (1Bl' I B2 , I B.3), and for all respondents grouped together, estimate a

common variance-covariance matrix ~B' Then, as above, one can calculate the linear

discriminant coordinates &B and estimate a discriminant score ZB for each respondent via the

vector product XB'&B' The question set strategy of classifying a respondent as dead when all

96

the answers to anyone of the three subsets are in bold is equivalent to classifying xB as zero

when z = 0, and as one when z > 0, since any respondent classified as alive by the question

strategy has XB = (0,0,0). This crude split attempts to approximate Bayes' rule with a relative

misclassification cost of five. (This assumes the &B are all positive; since the vector is only

determined to a multiplicative constant, the direction of the signs of the &B is arbitrary, it only

matters whether they are all in the same direction, which was certainly true here.) Thus the

two seemingly disparate strategies are doing something quite similar, once a common space

is defined. Both classify as dead those regions of the space where the estimated ratio

J;(x)lfo(x) is higher than the ratio 86%/(c-14%), where C is the cost of misclassifying a death

relative to misclassifying a survivor. The main differences between the methods lie in how

the space is chosen and how the partition of regions is formed.

2.7 Existing applications of the models

Here the literature on mortality and morbidity prediction was grouped into two, very

broad categories: 1) true prediction research, typically with in-hospital patients, or for risk

adjustment to conduct quality assessments of hospitals; 2) causal or substantive analyses that

substantively interpret predictive models. Given the theoretically-based discussion above,

this literature can be analyzed on the basis of how well the researchers dealt with the issue

of model fitting from a bias/variance perspective. Unfortunately, the use of test set methods

to deal with this issue is not common, and in practice very little is said about the

bias/variance tradeoff at all.

Consider again the Marshall et al. paper discussed in section 2.3, which is an example

of the first type of research. The purpose of their research is to develop prognostic tools to

97

assist physicians in medical decision making. The subjects ofthis study were 12,712 patients

who underwent coronary artery bypass grafting between 1987 and 1990, and the outcome

was post-operative death within 30 days after the operation. The variables in their data

included age, body surface area, the usage of various medications (including Digoxin),

previous heart failure, diabetes, cerebrovascular disease, angina, hypertension, and many

other ailments. The authors compared a handful ofvarious statistical methods (listed above),

and a test set was used to estimate the misclassification error associated with each. As

mentioned, there was clear overfitting in the larger models, as the test set accuracy of the full

logistic regression model (containing some 33 variables) was lower than the accuracy of the

abbreviated model with six variables (chosen via backwards deletion with the Ale criterion);

the area under the ROC curve was estimated as 71.0% for the smaller model, and 69.4% for

the larger model (with missing values excluded). They also found that cluster analysis, in

combination with logistic regression, formed the most powerful model (area under the ROC

curve = 71.1 %). The classification trees, however, did not perform nearly as well as the

other methods, with an area under The ROC curve of 65.5% (which was the lowest of all

eight methods examined). In all cases, the learning set error was observed to be somewhat

lower than the test set error estimates, indicating the usual overfitting problems.

Unfortunately, apart from the classification tree method and the two different-sized

regression models, there were no other attempts to select model sizes based on test set error

estimates. Thus, it is not clear to what extent overfitting may have been a problem in

selection of the model sizes for the other methods examined.

A good, recent example of excellent methodology is also observed in an article by

98

Normand et al. (1996). The authors developed models for predicting in-hospital mortality

with 30 days of admission in more than 14,000 Medicare patients who had been admitted to

with acute myocardial infarction. The patients had been admitted to all acute care hospitals

in Alabama, Connecticut, or Iowa. The purpose of the research was the development of risk

evaluation models for hospital quality evaluation. This is an increasingly important area of

research. For obvious reasons, researchers would like to be able to assess the ability of a

hospital to deliver health care effectively. One important, unambiguous outcome by which

a hospital's quality may be measured is the mortality of its patients. However, one must

adjust for the fact that different hospitals (with different geographic locations) serve very

different populations. Patients admitted to one hospital may be at much great risk than

patients admitted to another, prior to any hospital contact at all. Thus to compare mortality

rates across hospitals fairly, it is necessary to adjust for these differences in risk-at-admission.

The process of doing so can be expensive, and may require the gathering of data on many

variables. By creating efficient predictive models that can accurately adjust for patient risk

at-admission without requiring a large number of variables would be a highly cost effective

solution.

These researchers used a learning sample of 10,936 patients, and a test sample of

3,645 patients to estimate model error. They started with a large number of candidate

predictor variables, including age, heart failure, cancer, functional status, laboratory

measurements (e.g., albumin, creatinine, serum urea nitrogen), diagnostic test results, and

anatomic location of the myocardial infarct. Using stepwise logistic regression on a learning

sample in conjunction with a rather complicated method of variable selection to minimize

99

overfit, a model was developed with some 30 variables. However, the important step in

assessing the accuracy of the model was the validation of the equation on a test sample,

wherein the area under the ROC curve was estimated as 78% (compared with 79% in the

learning sample). Therefore, the researchers could be sure that their model was not

dominated by variance error.

Davis et al. (1995), another example of good methodology, examined in-hospital

mortality for roughly 2,000 patients admitted to Beth Israel Hospital in Boston during 1987

and 1992 for pneumonia or cerebrovascular disease. The purpose was again risk-adjustment

for hospital evaluation. Variables used in their models included laboratory test results (e.g.,

blood urea nitrogen, while blood cell counts), chronic conditions, and functional status.

These authors also used a test set to validate the model error estimates, as well as using it to

refine the models built with the learning set. They used CART on the learning set to select

cutoff levels for some dichotomous variables, and used variables identified by CART in

addition to other variables to build a full logistic regression model with forward stepwise

selection. They then applied the full model to the test set with backward deletion, and

selected out variables that were not significant on test set estimated p-values. They also

computed the area under the ROC curve using the test set, as well as computing R2 for some

models. The most interesting finding of the final models, despite the very different dataset,

was that the functionality variables (very similar to the ADL's presented in the question sets

here) were the best predictors of mortality, superior even to most of the laboratory results!

This sort of result suggests that the question sets presented in Appendix I might also serve

to classify in-hospital patients with some fair degree of accuracy.

100

These models are not new, and less sophisticated versions have existed for some

more than a decade. One frequently discussed model for assessing patient health is the

APACHE (Acute Physiology and Chronic Health Evaluation) index, of which there are

several variations. One recent example is seen in a paper by Iezzoni et al. (1992) which

compared this method with a number of other commonly used, preconstructed models

(including a model simulating that used by the Health Care Financing Administration, and

the MedisGroup scoring system). These researchers used a cross-validated R2 estimate

(which applies a modified, more complicated version of the test set procedure) to assess the

accuracy of models honestly, but the variables in these models were not necessarily selected

on any such basis; rather, they are typically built more through medical knowledge and

intuition. They found that these models typically performed quite poorly when compared

with models built empirically through standard statistical methods (i.e., stepwise logistic

regression).

Many researchers did not resort to any sort of test set estimation, but instead relied

on either intuition, traditional model selection techniques (stepwise logistic regression) or

other more complicated schemes relying on standard statistical tests. Iezzoni et al. (1994)

developed a logistic regression model to predict in-hospital death using 1988 California

hospital discharge data from the Office of Statewide Health Planning and Development,

which included nearly two million admissions. With such a large N, statistical significance

is easy to achieve even for small coefficients, so model selection was not an issue; thus no

test set estimation was required! Grubb et al. (1995) predicted in-hospital mortality after 346

cardiac arrests at a hospital in Edinburgh. In more typical fashion, no test set estimates are

101

used, and model building is completed via stepwise logistic regression.

The second category of literature considered here includes "substantive" analyses of

mortality, which covers many purposes. From a statistical viewpoint, there is no difference

between the actual models used (e.g., a logistic regression equation is estimated no

differently), but rather how or why the chosen variables are selected is the defining

difference. The typical mode is to use either theory or intuition to start with a full regression

model and then apply standard stepwise regression techniques without estimating model

error on a test set, in which case the models are typically overfit. The problem is that

interpreting such models is often a convoluted exercise, since frequently some regression

coefficients do not reflect the currently championed theoretical paradigms.

For a recent, representative example of theory-based mortality modeling, see Smith

and Waitzman (1994). The authors were interested in the interaction effect of poverty and

marital status on mortality; in particular, it was hypothesized that the effect of being both

unmarried and poor had a greater impact on the risk of death than the sum of the effects of

each single variable. Using the NHANES (National Health and Nutrition Examination

Survey, conducted in 1971, N= 20,279) and the epidemiological follow-up survey (NHEFS)

of respondents in 1982-9184, the researchers analyzed 25-74 year-olds with death (1,935

traced) as the outcome measure. In addition to poverty and marital status, they observed age,

sex, race, smoking status, physical activity, serum cholesterol level, body mass index, and

hypertension.

Here, rather than the more common logistic regressIOn, the authors use the

proportional hazards regression model to estimate the effects ofthese variables (which treats

102

events and exposure explicitly, so dealing with the censoring issues mentioned in Chapter

1). Here, one is interested not in estimating predictive power necessarily. Instead, the

authors wished to test the interaction hypothesis by estimating the coefficient on this variable

when the potential confounders above are controlled. Smith and Waitzman, for instance,

conclude that there were indeed additional risks for nonelderly men subject to both

nonmarriage and poverty, but not women or the elderly, on the basis that these coefficients

above a threshold. The problem, however, is that of course one can never control for the

many unobserved factors that must influence mortality, and these variables are in all

likelihood highly correlated with the independent variables considered in the model. The

authors argue that by controlling for the above list of potential confounders, they have

accurately estimated the direct effect the independent variables involved; but one can easily

imagine that alcohol consumption, or access to health care, say (neither of which were built

into the model) is correlated with both poverty and mortality; by ignoring the potential effect

of these variables, one usually obtains a biased estimate of the coefficient on poverty. This

can lead to serious misinterpretations when coefficients are only taken at face value.

Furthermore, the authors (as do many researchers) often assess the importance of predictor

variables purely in terms of test statistics and p-values (usually on the basis of whether the

latter are smaller than 0.05). These measures are intrinsically driven by sample size and the

number of variables in the model, so they do not provide any real measure of a variable

social or human significance. It is unfortunate that so many researchers hinge their

conclusions on such numbers.

A large number of "substantive" studies of mortality models are not necessarily

103

interested in direct causal connections, but the general process-related or systematic

relationship between various variables and death (often from a descriptive or exploratory

perspective). One of the most well studied groups of variables in recent years has been the

various measures of functionality in elderly persons, in search of a description of how these

variables relate to mortality and morbidity. Reuben et al. (1992) studied a group of 282

elderly patients of UCLA faculty physicians by following them for 51 months and measuring

both death and functional status.

In addition to a battery of status questions measuring ADL's and IADL's, mental

health, social activity and self-assessed health perceptions, the researchers observed age,

gender, marital status, race, living arrangements, employment status, bed days, reduced

activity and other items. The model was fit with the standard stepwise backward deletion,

logistic regression methods, without estimating test set error. They too found that the best

predictors of mortality were the measures of functional status, particularly the IADL's (e.g.,

"During the past month, how much difficulty did you have doing work around the house?").

They also found gender, race and living alone to be significant at the 0.05 level. The authors

note the strength of association between variables, but stop short of drawing conclusions

about causality based on the results of the model. See Warren and Knight (1982) for another

study of the relationship between mortality and functionality, using 1,534 impaired, mostly

elderly persons from Canterbury.

Another interesting substantive but descriptive analysis ofhealth status and mortality

among the elderly is provided by Berkman et al. (1989). Using a segment ofthe same data

used here (the New Haven third of the EPESE sample), the authors develop a "grade of

104

membership" model (GOM) to compare health conditions between whites and blacks in the

sample. The purpose is to identify "clusters" of homogeneity (with respect to health status,

risk factors, and functional status) in what is an otherwise diverse and heterogeneous sample

of elderly. In contrast to the usual regression based approach, the GOM model does not

assume a single dependent variable that is a function of independent multiple variables.

Rather, one considers one group of variables of interest internal, such as ADL's and chronic

conditions (i.e., those above variables in which one would like to find homogeneity, not

unlike dependent variables), and all remaining variables are considered external (e.g., marital

status or income, not unlike independent variables).

This approach has much in common with the question set method considered here.

The GOM approach, however, has the advantage of recognizing that there are many

dimensions of health status (e.g., more than can be measured by mere survival status), in

addition to dealing with heterogeneity in the population. Future research into the method

developed here will explore the use of the search algorithm in Appendix VI for locating

efficient GOM representations.

105

Chapter 3 - Data: Established Populations forEpidemiologic Studies of the Elderly

3.1 The EPESE project

The data for this dissertation come from the EPESE project (Established Populations

for Epidemiologic Studies of the Elderly), initiated by the Epidemiology, Demography, and

Biometry Program of the National Institute on Aging in 1980. The EPESE project was an

attempt to monitor four small populations of noninstitutionalized elderly persons

prospectively (elderly are those aged 65 and over; total N = 14,456). Initially, this involved

the administration of a baseline, household interview survey (conducted in the first three

populations in 1981-82) and subsequently continued through annual follow-up surveys, both

telephone and household interviews. The surveys were designed toward four objectives: 1)

to estimate the prevalence of various chronic conditions and impairments; 2) to estimate the

incidence of chronic conditions and impairments; 3) to identify the factors associated with

these conditions; and 4) to measure changes over time in the functioning of the elderly.! In

short, a fairly large group of elderly persons was observed in great detail at baseline with

respect to many variables. They were also observed over time with respect to several types

of outcomes, including functionality, various illnesses, and death.

Originally, the populations observed consisted of three communities: New Haven,

Connecticut (Yale Health and Aging Project, N=2,812); East Boston, Massachusetts (Senior

Health Project, N=3,809); and Iowa County and Washington County, Iowa (Iowa 65+ Rural

! See Comoni-Huntley et al. (1986, 1993).

106

Health Study, N=3,673), for a total N of 10,294. An additional site was added in 1984 near

Durham, North Carolina (N = 4,162), partly with the intent of oversampling blacks. This

latter dataset is considered here only for the purposes of testing the predictions made by the

models constructed in Chapter 4, which data from use the first three sites. For the remainder

of this chapter, all references to the data refer only to the New Haven, East Boston and Iowa

sites. The North Carolina sample is summarized briefly in Chapter 5, which contains the

results of the validation.

3.2 Composition of the populations

Unfortunately, the sample designs for the three populations were not uniform. In fact

the Iowa and East Boston respondents were chosen through total community censuses, not

random sampling; only the New Haven respondents were chosen randomly, with a stratified

cluster sample. In Iowa, the target population was all noninstitutionalized persons aged 65

and over in Iowa and Washington Counties, an agricultural area in East Central Iowa

consisting of about 16 small towns. A list of elders in the area was compiled by the area's

Agency on Aging, and this list was supplemented with additional names given by local

informants. About 80% of the persons identified responded to the survey. In East Boston,

a total community census was conducted concurrently with the baseline survey, and 84% of

the noninstitutionalized elderly persons enumerated by the census responded to the survey.

The New Haven data, the only randomly designed sample, was a cluster sample stratified by

type of housing: public housing for the elderly, private housing for the elderly, and all other

elderly in the community. The overall response rate for the New Haven elderly was 82%.

(All of this information was taken directly from the EPESE Resource Data Book, Comoni-

107

Huntley et al. (1983).)

Of all 10,294 respondents, 6,256 (60.8%) respondents were female, and 4,038 were

male. About 2,874 (28%) respondents were aged 65-69,2,659 (26%) were 70-74, 3,274

(32%) were 75-84, and only 958 (9%) were 85 or older. The Iowa and East Boston

respondents were almost entirely white, while the New Haven sample was 18.8% black.

Ethnically, the East Boston respondents were described as predominantly Italian, Irish, and

northern/central European in descent, while the New Haven respondents were also largely

Italian, but with a more sizable eastern European contingent, in addition to the black

population. Presumably, the Iowans were mostly northern European. The East Boston

community was described as blue-collar, consisting oflow- to middle-income working class

persons. The New Haven area was dominated by educational institutions, manufacturing and

service industries with a level of income well below the state median. The Iowans, as

mentioned, were largely rural and agriculturally oriented with some light industry and retail

in small towns ofpopulations less than 2,000, and one small city with a population of 6,500.

Since the respondents were not chosen with a nationally representative random

sample, it is important to examine the composition of this sample in some detail, and to

compare it with national statistics, particularly with respect to age, sex, and mortality. This

is done below in Section 3.5. First, it is useful to discuss the particulars of the survey

instruments.

3.3 The baseline survey

In all three populations, the initial baseline data were gathered with an extensive

household interview survey. The instruments for the three studies were not identical, but

108

were highly similar; this dissertation uses only variables which could be uniformly coded

across all three datasets. Table 3.1 lists these variables according to 10 general categories:

Table 3.1 - Variables in baseline survey

Demographic and personal characteristicsSex, age, race, educational attainment, marital status, employment status, occupation, work historyincome, household composition, religion, number ofliving children, numbers of friends and relatives

Mental and emotional statusDoes respondent know: own age, date of birth, present date, day of week, who is president, mother'smaiden name, telephone number, subtracting 3 from 20Depression

Physical functioningRespondent reports whether they can perform activities of daily living: walking across the room,

bathing, using the toilet, grooming, walking half a mile, doing heavy work, moving largeobjects, eating, getting dressed, using stairs, stooping/kneeling, handling small objects, etc.

Does respondent need help from special equipment or a person to do these thingsHearing, hearing aid usage, vision, eyeglass usageBowel and urinary controlSelf-reported level of general healthSleep habits

Chronic conditions and suspected illnessesRespondent reports whether they were ever diagnosed with heart failure, cancer, stroke, diabetes,

hypertension, fractured hip, other bone fracturesWas respondent hospitalized overnight for any ofthese conditionChest pain (Rose questionnaire)Possible infarction - severe chest pain, leg pain, pain when walkingShortness of breath, coughing, phlegm, chest wheezing

Physiological characteristicsWeight and height at time of interview, weight at age 25 and 50, weight changes in past yearPulse, two readings of systolic and diastolic blood pressure

Tobacco and alcohol consumptionDid respondent drink beer, wine, or liquor in the past year or month, how often in the past month, past

history of heavy drinkingDoes respondent smoke now or in the past, how many cigarettes per day, age at first/last cigarette

MedicationsDoes respondent use insulin, hypertension medication, digitalis

Use of medical resources/institutionsWas respondent hospitalized for at least one night in the past yearHas respondent ever been a nursing home patientHow recent was last visit to a dentist

Some ofthese listed "variables" actually consisted of two or more questions, e.g., there were

109

a dozen questions on chest pain (the Rose questionnaire). It should also be noted that all

chronic condition, hospitalization and treatment variables captured only the self-reported

diagnosis and/or treatment; thus it was not discovered whether the respondent truly was

afflicted with or treated for a condition.

3.4 Outcomes

Following the baseline survey, respondents were recontacted annually to monitor a

number of different outcomes over time. The first two surveys were conducted by phone,

while the third involved another household interview. (Additional follow-up surveys were

completed, but these data are not publicly available, and are not considered in this

dissertation.) The outcome of central importance for this dissertation is mortality. Each of

the three centers established a "mortality surveillance system" to match up known deaths in

the community (e.g., through obituary notices or hospital records) with participants in the

study. Each respondent's status was observed at each annual recontact, and decedents were

matched with their death certificates. It was observed that 433 respondents died in the first

year, 486 died in the second, and 531 died in the third. Besides the mere occurrence ofdeath,

a full listing of both underlying and associated causes is provided with the data, coded by a

single nosologist at each center.

For respondents who survived to the first or second annual recontacts, other variables

observed were chronic conditions, physical functioning (ADL's, hearing, vision),

hospitalizations, marital status, working status, household composition, weight loss,

medication usage, and nursing home admissions. The third follow-up, the second household

interview, was more detailed; respondents were also assessed with respect to mental status

110

and blood pressure, and they were asked about religious activity, relatives and friends,

urinary control, chest pain, possible infarction, sleeping habits, smoking, drinking, and

depression.

The other outcomes of interest for this dissertation, besides mortality, were the

occurrences of new illness, particularly heart failure, strokes and cancer. Since it was

possible for any particular condition to go unreported by the respondent but discovered at

death and listed as a cause ofdeath, it was decided to include these listings as new incidences

of the illness. Thus any listing ofheart failure, stroke or cancer as a cause of death when the

respondent had never reported having such was counted as a new occurrence. This was done

under the assumption that if the respondent died of the event without having reported it, the

event occurred between the time of reporting and death. For the analysis of cancer,

respondents who reported having ever had cancer at baseline were removed from the

analysis, as it was not clear whether "new" incidences of cancer were not simply new

malignancies from previously diagnosed cancers. After these recodings, it was observed that

868 respondents suffered new heart failures, 943 had strokes, and a total of 552 persons had

new cancers (6.2% of 8,874 persons never previously diagnosed with cancer). Again, it

should be noted that these "occurrences" of illness are mainly self-reported diagnoses (and

for new illnesses found on the death certificate, observed postmortem diagnoses). This is

important with respect to the incidence of cancer, since tumors in their earlier stages

frequently go undiagnosed.

3.5 A comparison of the EPESE sample and the U.S. population

The ideal sample design would have been a nationally-representative probability

111

sample of U.S. elderly, since one would like to build models applicable to these persons.

However, the EPESE dataset was instead a combination of two community censuses and a

stratified cluster sample, all from three small geographical regions that were themselves

chosen by convenience. It was clearly important, then, to note the differences and

similarities between the observed dataset and what would have been expected from a

nationally representative sample. It is argued below that the population is quite similar to

the U.S. population ofelderly (although "statistically significant" differences do exist). With

respect to mortality in particular, the differences between the EPESE sample and the U.S.

population are relatively small (i.e., of the same size as the sampling variation one would

expect in a simple random sample of U.S. elderly).

To gauge the size ofthe differences between the EPESE sample and what would have

been expected from a nationally representative sample, one can calculate simple statistical

tests under the obviously false assumption that the EPESE sample was itself a simple random

sample (SRS) of U.S. elderly. That is, since the age and sex composition of the U.S.

population of elderly was known, and since population-level values for the probability of

death by age and sex were known, one can compare the EPESE sample estimates with the

population values directly. However, one would expect some small differences even if the

sample was highly representative of U.S. elderly. So the observed differences were

calibrated in terms of the standard errors one would have expected from a simple random

sample.

Figure 3.1 shows the age and sex distribution of all 10,294 respondents. The dashed

lines show the expected value for each total under the false assumption that the respondents

Figure 3.1 - Number of respondents, by age and sexdashed lines show expected values for a simple random sample of U.S. elderly

o

00LO...- I I I

(/)

I

II I

- I I~-j Ifemale

cQ.)

I VWLJ"0

male

c0 00... 0(/) 0Q.)I-

...-'+-0I-Q.)

..ClE::JC 0

0LO

65-69 70-74 75-79 80-84 85+

age

Respondents are from New Haven, East Boston and Iowa EPESE baseline surveys (N =10,294)

............N

113

had been chosen with a SRS of the U.S. population of the elderly (including the

institutionalized) of size 10,294.2 If the sample had been a simple random sample of

noninstitutionalized elderly only, one would expect more persons age 65-74, and fewer

persons aged 75 or older, compared with the entire U.S. population of elderly.3 It is clear

then that persons aged 65-74 were undersampled, while persons 75 and older were

oversampled. Under the SRS assumption, a calculation ofthe chi-square statistic across all

ten age-sex categories totals 57 on 9 degrees of freedom, p <.001; most of the contribution

to this statistic comes from the lack of males aged 65-69, the surplus of males 85+, and the

lack of females aged 70-74 (in that order). However, the overall sample distribution is

patterned quite similarly to the population distribution. There are significantly fewer

respondents with each increase in age, and there are significantly more females at every age.

Moreover, if one ranks the ten age-sex categories by size, they rank identically in both the

sample and the population. Although statistically significant differences exist, many of the

differences are not very large in real terms; rather, the large sample N makes even small

differences statistically significant. For example, with respect to sex only, the sample

proportion of females (60.8%) were actually quite close to the population proportion

(59.8%). However, this gives a significant {-statistic of 2.1 (p < 0.05) due to the small

standard error attached to the estimate.

A more serious difference is observed in the racial classifications. As mentioned,

only 5.1 % ofthe sample reported itself as black, while some 8.2% of the elderly population

2 The numbers for the U.S. population are taken from the Current Population Reports, P-25series No.1 095 (1982).

3 Approximately 97% of the population aged 65-74 was noninstitutionalized, while 86% ofpersons 75+ were noninstitutionalized. Source: Statistical Abstract.

114

in 1982 was black. (Since blacks are oversampled in the North Carolina data, this issue is

addressed in more detail in Chapter 5.) Also, the oldest elderly may be substantially

oversampled, based on the observed proportions of respondents older than 85 and the high

degree of institutionalization that would have been expected at these oldest ages.

Thus there are some significant differences between the sample and the population,

whether or not the institutionalized are included, so the data are clearly not nationally

representative as gauged by SRS sampling error. Ultimately, however, the goal of the

dissertation is to develop models that predict mortality in the general population of elderly.

Thus, of particular importance is the effect of the sample's idiosyncrasies on estimates of

predicted probabilities of death and the approximate standard errors used to gauge the

accuracy of these estimates. Fortunately, as mentioned, one is in the position of knowing

near exact population values for the cohort probability ofdeath according to U.S. single-year

period life tables for the three-year period 1982-1985.4 This was approximately the three

year period during which the EPESE respondents were observed, so ifthe EPESE cohort was

similar to the U.S. population of elderly, they should have experienced death rates

approximately equal to these national rates. A cohort estimate of the population value of 3qx

(the probability that a person age x dies between age x and x+3) may be obtained first by

gleaning lQx, jQx+l and jQx+2 from the 1982, 1983 and 1984 period life tables respectively. For

example, call them jQ}2, jQx+,83 and jQx+284

, where lQx+o8X is the probability that person aged

x+o died between ages x+o and x+o+1 in the year 198X. For the average person aged x in

the U.S. population in 1982 (the time of the baseline survey), the probability of death within

4 Source: us. Vital Statistics, 1982-1984.

115

the next three years was estimated as:

Thus, under the false assumption of that the sample is a SRS of elderly, the expected value

of the sample estimate of 3qxcohort (the sample estimate simply being the proportion of sample

respondents aged x who died in three years) could be predicted separately for each of the ten

age-sex groups by setting x equal to the average age of the sample respondents in each age-

sex category and using the above formula. Also, under the SRS assumption, the standard

error for the sample estimate of 3qxcohort can be estimated as:

(3 Qx cohort). (1 -3Qx cohort)

N of persons aged x

as implied by the formula for the variance of a random variable having a binomial

distribution. (Note that one should use the population estimate of 3Qxcohort for this calculation

since it is known, although there is little difference if the sample estimate is used.) Again,

since the population value includes the institutionalized elderly, one might expect that the

sample estimates of 3Q/Ohort would be slightly lower than the population estimate for ages

above 75 or so (assuming institutionalized persons have higher rates of death than the

noninstitutionalized).

Figure 3.2 shows the sample estimates of 3Qxcohort for the ten age-sex categories, with

dashed lines at 3Qxcohort +/- 1 S.E. for each estimate, and asterisks at the height corresponding

the u.s. population values of 3Qxcohort. Table 3.2 also lists these estimated probabilities and

their standard errors in addition to the U.S. population values and the p-value for the t-

statistic (under the null hypothesis that the expected value of the estimate is equal to the

Figure 3.2 - Probability of death within 3 years, by age and sexdashed lines show +/- 1 standard error (approximate estimates)

asterisks show population values

CJ)"<:t 0.398

L- acoQ)>.

C'?c

~ ~ I I female..c..... Whd - - - -- - -*.~ male 0.279..c.....coQ) 0.222-0\f-a>. N

:!::: 0:.0 0.159co

..0aL-a..

d l0.101-0 - - - - - --

Q)

0.061 fW'l--

0.078.....co *---- ---E ------:;::;

*~CJ)Q)

~ Ja

65-69 70-74 75-79 80-84 85+

age

Sources: 1982-85 EPESE surveys from New Haven, East Boston and Iowa; U.S. Vital Statistics.

............0\

Table 3.2 - Observed sample and u.s. population estimates ofprobability of dying within three years of age x (3qJ by sex

Sex

Age Female l S.E.2 US pOp.3 p-value4 Male' S.E.2 US pOp.3 p-va1.4

65-69 0.061 0.0056 0.056 0.382 0.101 0.0081 0.103 0.800

70-74 0.078 0.0065 0.082 0.581 0.159 0.0110 0.146 0.227

75-79 0.106 0.0088 0.127 0.020 0.222 0.0149 0.211 0.452

80-85 0.188 0.0135 0.209 0.115 0.279 0.0216 0.295 0.295

85+ I 0.279 0.0181 -- -- I 0.398 0.0265

1. This column shows the cohort estimate of 3Qx, equal to the sample proportion of respondents in each age category who died within three years of baseline.Data: New Haven, East Boston, and Iowa County EPESE surveys, 1982-1985. .2. The S.E. column gives an approximate standard error for the sample estimate of 3Qx obtained by assuming a binomial distribution of deaths.3. This column shows cohort values of 3Qx for the entire U.S. population (including the institutionalized) obtained from single-year period Iifetables for1982-1984 (US. Vital Statistics). The value ofx for each age category was set equal to the average age of sample respondents in that age category. Notethat about 97% of persons in the U.S. population aged 65-74 were noninsitutionalized, while only 86% of persons 75+ were noninstitutionalized.4. Given the assumption that the sample is a simple random sample from the U.S. population, the expected value of the sample estimate of 3Qx is equal tothe U.S. population value. The p-value for the t statistic under this null hypothesis is shown (two-tailed test).

.......

.......-.J

118

population value, so that any observed difference is due purely to random chance). For

example, the average female in the sample aged 65-69 had an estimated 6.1 % chance of

dying, give or take 0.56%. If the respondents were chosen from the U.S. population with a

SRS, the expected value of this estimate would be 5.6%, a difference of about 0.5%, or 0.9

standard errors. There was a 38.2% chance that a difference at least this big would have been

observed due to chance error, so one fails to reject the null hypothesis. This was the case for

all the estimates except that for females aged 75-79; this estimate is significantly lower than

the population value. This is not surprising, since the population values include

institutionalized persons, as mentioned above. In fact the estimates for both males and

females aged 80-85 were also somewhat below the population level, as expected (although

not significantly so). Overall, however, the confidence intervals implied by the standard

errors under the SRS assumption capture the true population values quite accurately, despite

the quite small standard errors for the estimates for ages less than 75.

3.6 Functionality, morbidity and causes of death in the EPESE sample

Since there were several key measures of functionality closely associated with

mortality, it was informative to understand what levels of functionality existed in the sample

as measured by these important variables, particularly regarding age and sex. For example,

part of the question model's surprising ability to predict nearly as many female deaths as

male (despite using "maleness" as a predictor of high risk) was the result of some interesting

gender differences in the various functionality measures.

Consider the variable that asked whether respondents could walk for half a mile

without help, which was an excellent predictor of mortality even after controlling for many

119

other variables (see Chapter 5). There were 2,527 respondents in the sample who responded

they were unable to walk this distance, about 24.6% of all respondents. Figure 3.3 shows the

percentage of respondents at each age and sex who reported that they could not walk half a

mile. As expected, the percentage increases with age. However, when examined by gender,

a surprising difference is observed. Females were much more likely to report this disfunction

at all age groups! This was not an expected result; it was thought that since both the

probability of death and the incidence of chronic disease was higher for males (as estimated

below), males would also be more severely debilitated than females. However, this was

usually not so.

A very similar result could be seen with another important predictor variable

measuring functionality, which was the question asking whether respondents could bathe

without help. There were 889 respondents (about 9% of the sample) who reported that they

either needed assistance or were unable to bathe at all. Figure 3.4 shows the percentage at

each age and sex who reported this inability. Again, a similar pattern is observed: the

percentage increases with age, which is hardly surprising, but females report a significantly

higher level of disfunction than males. Other variables, such as the variable reporting levels

of self-assessed health, also showed a higher level of disfunction or illness in females. These

results are not unique to this sample; other researchers have reported higher levels of

disfunction (as well as disease) in females. However, notice that by using these variables as

classifiers of high risk groups, one compensates for the gender imbalance in predicted deaths

caused by simultaneously using the male sex as a classifier.

Interestingly, the measures of mental functionality did not display the same patterns.

Figure 3.3 - Proportion unable to walk half of a mile without help, by age and sex

70-74 85+80-8475-79

malefemale

I IWZl

65-69

.9- CDQ) a..c.......::l0

..c L()........§ aQ)

Eco "<:t- a0-co

..c

..x:(")

co a~0.......Q) N

.a aC'Clc::l

C0

T"""

t a0Cl.eCl. 0

a

age

Sources: EPESE baseline surveys from New Haven, East Boston and Iowa

.....tvo

Figure 3.4 - Proportion unable to bathe without help, by age and sex

LON0

e..Q).c 0..... N::::l 00.c I I.....°3

male

Q) LOVLZd female

.c..... ..-m 0.c0.....~.c 0m ..-c 0::::lC0t0 LOe.. 00 0....e..

00

65-69 70-74 75-79 80-84 85+

age


,.....N,.....

122

Only 406 respondents could not correctly state their mothers' maiden names, and 514

respondents could not correctly state the day of the week. However, both groups had many

men in most age categories.

Respondents were also asked directly about many illnesses, as indicated above.

There were 1,239 respondents who reported having ever been diagnosed with heart failure

and 124 who reported having been suspected of having heart failure. Of these respondents,

about 54% were male. Figure 3.5 shows the percentage ofrespondents who reported that

they had been diagnosed with heart failure (or suspected heart failure) at some point in their

past (not necessarily at the time of baseline) by age and sex. The proportion of males with

heart failure was much higher than the proportion of females with heart failure at all ages.

Generally, the proportion with heart failure increased slightly with age, excepting the oldest

males. It appeared that the lifetime prevalence of heart failure in the EPESE sample was

lower than one might expect compared with other national samples (i.e., the National Health

Interview Survey). For example, NHIS estimates suggest that the prevalence ofheart disease

in 75-80 year olds was high as 30% (compared with 12% to 21 % in the sample) although the

NHIS definition ofheart disease is wider than the definition here (which is heart failure, and

only the reported diagnosis of it).

There were 1,420 respondents who reported having ever been diagnosed with cancer

(about 13.6% ofthe sample), and 92 who reported suspected cancer. Of these respondents,

nearly 68% were female. Figure 3.6 shows the proportion ofrespondents at each age and sex

who reported having been diagnosed with cancer (or suspected cancer). The proportion of

respondents with cancer did not appear to increase substantially with age, and in fact seemed

Figure 3.5 - Proportion diagnosed with heart failure by age and sex

85+80-8475-79age

70-7465-69

oo

0Nc:i I

t

I I

ctl

I I

Q)

,-- I II I

..c

I

..cL()

I I WhJmale

..- .....

I

.~ c:i

female

\:JQ)CJ)0C0> 0.~ .....\:J c:i.....Q)>Q)

c0t L()0 0c.. c:i0.....c..


--tvw

Figure 3.6 - Proportion ever diagnosed with cancer by age and sex

male

female

85+

I I~

80-8475-79age

70-7465-69

oo

L()oo

o

o

L()......

o......

o£3:cotoc..o....c..

....Q)()cco()

£-·3"0Q)(/)oC0>co"0CQ)Q)..c0>c.;;co

£

1::oc..Q)....


.....N~

125

to decrease for females (perhaps showing the selection of those who die of cancer at earlier

ages). At all ages except 80-84, females were more like to have been diagnosed with cancer.

The other chronic illnesses asked about in the survey included hypertension, strokes,

diabetes and bone fractures. Hypertension was by far the most common malady; some 4,429

respondents (43% of respondents) reported having ever been diagnosed with high blood

pressure, and 174 were suspected. Of these, the majority were again female (nearly 68%).

These levels may sound high, but comparisons with results from the National Health

Interview Survey suggest that these levels are quite plausible and that the gender difference

is quite real. For example, 48% of female respondents aged 75-80 in the NHIS sample had

hypertension, compared with 29% of the males. The proportion of respondents having been

diagnosed decreases with ages, particularly for males.

There were 1,322 respondents who reported having ever been diagnosed with

diabetes (12.8% of respondents), and 157 suspected. Males were slightly more likely to

report having diabetes, and the proportion seemed to decline with age. Fewer respondents,

601, reported having been diagnosed with a stroke, and 60 were suspected of having had a

stroke. Males were much more likely to have had a stroke (except at the oldest ages), and

the proportion of respondents having had a stroke increased with age. About 4% of the

respondents reported having ever been diagnosed with a fractured hip, and 18.9% reported

having been diagnosed with some other bone fracture.

For the more common illnesses, one is also typically interested in the incidence of

disease besides the lifetime prevalence measure used above. The number of respondents

who reported being diagnosed with chronic illnesses for the three-year follow-up period were

126

reported briefly in section 1.8 for heart failure, cancer and stroke. Note that these numbers

include persons who died ofthe condition, without having reported ever having it at baseline.

The assumption is that ifthe respondent died ofa heart attack, for instance, but did not report

having ever been diagnosed with a heart attack at baseline, then the death was the result of

a new incident. Conversely, if the condition had been reported at baseline and the patient

died of the condition without reporting a new incident in follow-up, then it was assumed that

no new incident occurred.

There were 868 respondents suffering a new heart attack in the three-year follow-up

period (8.4% of the sample), of whom 408 were male (47% of all victims). Figure 3.7 shows

the proportion of respondents at each age and sex who experienced new incidents of heart

attack. The risk of a new heart attack clearly rises with age, and is greater for men than

women, congruent with Figure 3.5. There were 943 respondents with new incidents of stroke

(9.2% of the sample), of whom 446 (47.3%) were male. The proportion of respondents at

each age and sex with new incidences of stroke is shown in Figure 3.8. Again, the risk rises

with age, and is much higher for males of all ages (excepting 80-84 year-olds).

There were 726 respondents reporting incidents of cancer during follow-up, not all

of which were clearly new malignancies. Of these respondents, 356 (49%) were male.

Figure 3.9 shows the proportion of respondents at each age and sex with incidents of cancer.

Interestingly, the risk is much higher for males, and does not necessarily increase with age.

This pattern is in contrast to Figure 3.6, which suggests that the lifetime prevalence of cancer

is much higher in females. This implies that the duration and survivability of cancer spells

are greater for females. Notice again that for the purpose of prediction, it was decided that

Figure 3.7 - Incidence of heart failure by age and sex

85+80-8475-79age

70-7465-69

oo

~ I:::J

II

~LO

male

..-- Whdt:: 6

female

mQ).c"t-oen-cQ) 0

"'C ..--'(3 6S~Q)c.c-'3: LO

c 0

0 61::00..0....0..

Sources: EPESE baseline surveys from New Haven, East Boston and IowaNote: Includes those respondents who died from new incidents

......N-J

Figure 3.8 - Incidence of stroke by age and sex

85+80-8470-7465-69 75-79age

Sources: EPESE baseline surveys from New Haven, East Boston and IowaNote: Includes those respondents who died from new incidents

oo

0N0

Q)~

0.....

J I I male.....en

'+-

~ female0 LOen ...--..... 0cQ)"0°u~

~ 0Q) ...--c 0

.J::.....03:c0:e LO0 0e.. 00.....e..

-tv00

o~

o

Figure 3.9 - Incidence of cancer by age and sex

I I~

malefemale

.... ccOJ 0<.:> 0cm<.:>....0

CD(/).....~c0OJ

"'0'(3C

.c "<;f".....0.~

0c0t00- N0 0....0- 0

oo

65-69 70-74 75-79age

80-84 85+


Note: Includes those respondents who died from new incidents

.......N\0

130

all respondents who reported ever having been diagnosed with cancer should be dropped

from the analysis. This was due to the difficulty in distinguishing between incidents of

cancer during follow-up which were genuinely new incidents, and those that were essentially

malignancies that may have been diagnosed prior to baseline. Of the 8,874 such persons,

there were 552 who experienced genuinely new cases of cancer (6.2%).

Finally, it is informative to examine the causes of death assigned to the deceased

respondents. For simplicity, one usually looks at the underlying cause as implied by the

death certificate, although this is quite an insufficient piece of information for fully

understanding the true causes behind most deaths. The definition of underlying cause is

discussed in Chapter 6, as well the shortcomings of this data and other details concerning the

classification of causes (both underlying and associated) on death certificates. Here,

however, it is sufficient to examine underlying causes as given, with the brief caveat that this

is an inexactly measured (and somewhat ambiguously defined) variable.

The most common underlying cause of death, as expected, was heart disease, which

accounted for 633 deaths, nearly 44% of all 1,450 deaths in the sample. The next most

common cause was cancer, assigned to 335 respondents, only 23% of all deaths.

Cerebrovascular disease was the cause identified for 6.8% of deaths, circulatory diseases

accounted for 2.8%, accidents accounted for 2.1 %, diabetes for 1.9%, and hypertension for

1.7% of all deaths. The remaining 261 deaths (18% of all deaths) were attributed to other

causes.

A comparison of these proportions to the U.S. population of elderly reveals some

interesting differences. For instance, using cause-specific death rates by age and sex from

131

the National Center for Health Statistics, as applied to the EPESE age and sex sample

distribution, one would have expected that about 50% of all deaths would have been

attributed to heart disease. About 30% would have been attributed to cancer, and about 11 %

would have been attributed to stroke. In fact, for every cause of death examined (including

accidents, diabetes, and hypertension), the proportion attributed to each cause is smaller in

the sample than in the population; this is to say, the only category of death that contains a

larger proportion of deaths in the sample than in the population is the residual, "other"

category. It is possible that some discrepancies in the grouping of ICD codes by cause may

have been responsible for some differences between the categories, but the observed

differences are quite substantial. More likely, it may have been that the EPESE population,

consisting of working-class, middle- and low-income persons, was less likely to have death

certificates examined by a physician or coroner than persons from a random sample ofD.S.

elderly. This would have produced a lower consistency ofclassification and more deaths due

to unknown causes. See Chapter 6 for a more thorough treatment of deaths by cause.

3.7 Missing data

As often happens with large complex datasets in social science, data were not always

available on every respondent for every variable. Fortunately, the numbers of missing values

were not large for the vast majority of variables of interest. However, as the accuracy of the

models includes the classification and misclassification of cases for which certain values

were indeed missing, it is important to document the extent of them and explain the source

where possible.

For example, one primary reason for missing values on certain variables related to

132

the small number ofphone and proxy interviews, in which some survey items were not asked

of the respondent. There were about 660 interviews (6.4% of the sample) completed by

telephone or proxy (for the New Haven sample, at least 50% of the interview was completed

by proxy). These respondents were important to isolate because, interestingly, they

experienced particularly high death rates: about 29% of them died, more than twice the raw

death rate in the whole sample. Thus, among these persons there were some 191 deaths

(13.2% of all deaths in the sample), of which 90 were concentrated in the East Boston

sample. Incredibly, of the 149 East Boston respondents who were interviewed by proxy (or

partially by proxy), 74 died (nearly 50%); this turned out to be, of all the variables on the

survey, the single largest split of respondents with such a high death rate. Thus, it was

possible that by including this set of respondents in the data, one might unfairly inflate the

apparent ability of the model to detect high risk persons by using questions that classified lots

ofmissing values as high risk (as do the questions in the appendix). The variables that many

of these respondents were not asked about included their mental status (e.g., the questions

asking for the mother's maiden name or the day of week), a handful of the physical

functioning questions (including questions on vision and hearing), some questions about

hospitalization and medication for chronic conditions, the items on physical measurement

(e.g., pulse, blood pressure), questions on sleeping habits, and the items on alcohol usage.

Other reasons for missing values included the more usual causes: some proportion

ofrespondents did not fully complete the survey for various reasons (about 162 surveys were

identified as partially complete or abbreviated), and inevitably respondents refused to answer

some questions, or were unable to respond for lack of opinion or knowledge. Some

133

questions were coded as "missing" but were not truly unknown. Rather, they were simply

not answerable by the respondent because they did not reply (e.g., a person who never

smoked was not asked at what age they started smoking); this is not what is considered

"missing" data in the totals presented below.

Ofparticular interest here are the questions presented in the appendices. Age and sex

had no missing variables. The greatest number of missing values occurred in the item in Set

A.3 asking about the respondent's weight at age 25: there were 2,068 missing values for this

variable (20% of respondents), and these included all of the Iowa proxy and telephone

interviews. Clearly, however, it seemed that most of these missing values belonged to

persons who were simply unable or unwilling to answer the question. The next largest

number of missing values was contained in the question asking the respondent to state the

mother's maiden name (see Set B.3): there were 747 missing values (7.3% of respondents).

The item asking how much difficulty the respondent has pushing or pulling large objects (Set

A.l) had 745 missing values. The question asking about the difficulty of bathing (Set C.1)

had 736, and the weight of the respondent at time of interview (Set A.4) had 707. The item

asking the respondent to state the day of the week (Set C.2) had 686, and the question about

the ability to see a friend across the street (Set A.5) had 525 missing values. For all these

variables, the total included all 456 respondents in the Iowa telephone and proxy interviews,

who had a death rate of about 20%. The next largest number of missing values was present

in the digitalis question in Set A.2 (381, or 3.7% of the sample). The question asking about

heavy work (Set A.4) had 116 missing values (only 1.1% of respondents) and the two

remaining questions had only 57 missing values.

134

To find out how much effect the high-mortality proxy interviews may have had on

the estimates of misclassification error associated with the model in Appendix I, the error

estimates were recalculated with these respondents removed from the sample. For Set A,

there was in fact a slight increase in error when the proxy interviews were removed, as there

was a decrease in sensitivity (the proportion of deaths correctly predicted) when the higher

risk respondents were removed. This was only about a 3% increase in error, however, an

amount that was less than the approximate standard error attached to the estimate. For Set

B, the error was nearly unchanged when the proxy and telephone respondents were removed,

and for Set C there was a slight decrease in error. The change was always very small (less

than the rather small standard errors attached to the estimates). Interestingly, however, the

downward shift in sensitivity and upward shift in specificity observed when the

proxy/telephone respondents were removed almost exactly mimics the change observed

when the model was applied to the North Carolina (Duke) sample (although not quite as

large in magnitude as the Duke differences). It is possible that these differences account for

part of that observed shift, assuming the Duke proxy respondents (of which there were 162)

were not high-risk respondents like the proxy respondents from the original interview. (See

Chapter 5 for a detailed treatment ofthe validation with the Duke sample.)

135

Chapter 4 - The statistical methods of prediction

4.1 Approaches to model selection

Important to any multivariate statistical method ofprediction is the decision ofwhich

variables should be included as predictors. As suggested by Table 3.1, the number of

candidate variables in the EPESE survey was quite large. Since the sample size is also large,

the largest possible models could contain many more variables than one typically wants.

Sometimes, these models can be larger than a computer can manage. Thus the researcher is

required to form some method for selecting out the best variables, and to recode many

variables for some types of models (i.e., as indicator variables). As suggested repeatedly in

Chapter 2, the size of a model greatly affects its ability to predict accurately, so the matters

of model size and what variables belong in the model require much attention. The purpose

of this chapter is to describe the method of model selection not only for the question set

method invented above, but for logistic regression, linear discriminant analysis and the

CART method as well. (The results from these methods are compared in Chapter 5.)

The four methods considered below all have various potentials for dealing with the

problems of variable selection and recoding, depending partly on the parameterization (or

lack of it), and partly on the abilities of the computer to handle large datasets. For example,

the nonparametric question set method developed here required no recoding of any variable

or any missing value, and could handle a very large data matrix or model consisting of

several hundred variables. Thus it was possible to perform a systematic and objective

variable selection by using the full, raw dataset as input to the algorithm. To perform logistic

regression, however, one is required to recode missing values and nonordinal variables.

136

Also, due to memory constraints, the computer was not able to handle a full model nearly as

large, requiring the elimination of many variables before a "full" model could be built at all.

Again, the main motivation behind dimensionality reduction lies in the bias-variance

tradeoff, discussed extensively in Chapter 2. In short, if too many variables or parameters

are used in the model, the predictions are subject to greater chance error, but if the model is

too small, one may forsake predictive accuracy by ignoring powerful predictors. The

problem is well recognized by statisticians, and many methods have been proposed for the

very purpose of reducing model size without losing predictive power. The most familiar

techniques involve some kind of backward deletion from a full model, and all the methods

considered below use a variant of this technique. Unfortunately, in actual application most

researchers use an inference-based approach (i.e., p-values) or a rough guide such as

Mallow's Cp or Akaike's Information Criterion (AlC), despite the frequently inappropriate

assumptions required. I Few researchers take advantage of information about the

approximate levels of bias and variance supplied by a test set method, although the CART

method does use such an approach.2

In this dissertation, a simple random sample was used to separate the dataset into a

learning set (N = 6,862) and an "internal" test set (N = 3432). The former was used for

building various models (it is also called the "training set") and the latter for gauging their

accuracy and choosing a particular size. The same division was used for all methods except

CART, in which the program chose its own test set. Since a fair amount of back-and-forth

I See Akaike (1973) and Mallows (1973).

2 Another example not specific to classification is the nonparametric smoothing algorithm in Spluscalled supsmu(), see Friedman (1984).

137

between the test set and learning set was necessary in the model selection process, it is

important to recognize that the test set estimates of model error are probably biased

downward to some degree. Hopefully, this bias is small for the question set method, since

the sample size was quite large and the method of choosing questions was systematically

random. More stringent tests of the models are conducted in Chapter 5 with the use of the

North Carolina EPESE dataset. These results suggest that any bias was probably small.

4.2 A method for choosing questions

The method of selection for the model invented in this dissertation was initially

described in Chapter 1. As mentioned, it was possible to use nearly the entire, untransformed

dataset as input to the search algorithm. After dropping some variables for logistical reasons,

a set of 164 candidate variables, coded as on the original tape, was chosen from the survey

data and used as the penultimate "full" dataset for the remainder of the project. Using this

set of 164 variables, a search was implemented on the learning set data to find a "full model".

For Question Set B, for example, the full model consisted of a subset of 16 questions, (four

sets offour questions each, combined with OR and AND as described in Chapter 1), which

gave a low learning set misclassification error. (Error was defined by Equation 2.2, as

applied only to the learning set respondents). Each of these questions could be any question

in binary form. Such a question could be of the form, "Is X < 3.2" or "Is X 2 5" where X

could be any of the 164 variables, and the value of the cutoff could be any possible value of

X in the dataset. Both were always chosen randomly. To do the random selection of

variables and cutoffs, the program uses the C math library function drand48( ), which

generates a pseudorandom number between zero and one. Thus, one could multiply the

138

output of this function times the number of variables in the data matrix (rounding the result

to an integer) in order to pick a variable at random. Similarly, one could pick a cutoff at

random for any given variable.

Briefly, the computer picked an initial set of 16 questions at random, and tried

dropping and replacing questions randomly to check for any improvement in the learning set

misclassification error. The event of dropping a question at random and trying out a new

random question in its place was called a mutation (a reference to genetic or evolutionary

algorithms, of which this algorithm is a very crude sort). Mutations that produced

improvement were kept, while mutations that did not were dropped. At some point, the

search always resulted in a set of questions that could no longer be improved with a single

step. That is, the search never continued indefinitely; eventually, the set of 16 questions was

constructed such that no one question could be replaced so that improvement in error was

achieved. This could be checked by instructing the computer to try replacing each of the 16

questions one at a time, each time trying out all possible candidate questions in its place. By

exhaustively searching in this way, it was always possible to check whether a given set of

questions could no longer be improved by the random search method above. Here, such a

set of questions will be called an absorption point. This search algorithm is described in

Chapter 1 as the random search algorithm, or RSA.

To define this algorithm as precisely and carefully as possible (so that a reader need

not interpret the C code in the appendices directly), establishing several definitions is helpful.

Let 'IJ be an index of the set of all predictor variables in the data. (Here, this is just taken as

the sequence from one to 164, with the 164 "X' variables indexed arbitrarily.) Let Uj be the

139

set ofall unique possible values (which were always real-valued) for the ith variable in 'IJ, and

let %' be the set Ui without its smallest element. Let b be a set of two comparison operators,

{ <, z }. Define a question as any 3-tuple of elements of the form (v,u,s), where v is any

element of'IJ, u is any element ofUy , and s is any element of b. Define a question subset as

any set ofquestions combined with the "AND" operator. Define a question model as any set

of question subsets combined with the "OR" operator.

Now define a mutation as a question that is randomly generated in the following

manner: Choose an element s at random from b, and choose v at random from 'IJ (here "at

random" always implies a simple random sample of size one). If s is "z II, choose u at

random from 'Uy ; else, ifsis "<", choose u at random from U/. Note that since 'IJ, b and U.

for all v in 'IJ are directly observed and known, the joint distribution of (v,u,s) for a mutation

is completely described by this definition. Note also that the elements of (v,u,s) were

correlated within a single mutation, but that the mutations were always generated

independently of one another. Based on the structure of'IJ and the U. for all v in 'IJ, it was

determined that 3,248 unique mutations were possible. Note that generating a mutation as

defined above is different from choosing from the set of all unique possible mutations at

random. Finally, define an absorption point as any question model that cannot be improved

(meaning that no lower misclassification error can be achieved) by replacing any single

question in the model with any possible mutation.

Now the RSA (random search algorithm) may be described for a full model of four

subsets offour questions each as follows (it is easily generalized to a model ofany structure):

1. Generate a "seed model", Start by making a model that contains four question

140

subsets with four questions each by independently generating 16 mutations andgrouping them randomly into four subsets. Once this 4x4 model is built, check tosee if anyone of the four question subsets captures fewer than Nmin respondents inthe learning set (where Nmin is a small integer, e.g., for Set B it was set at 25; CARTtypically sets it at 5). If such a subset exists, discard the model and generate a newone. Repeat this process as necessary until all four question subsets choose at leastNmin respondents. Once this is achieved, designate this model as the seed model,and calculate its misclassification error as applied to the learning set.

2. Starting with the first question subset in the seed model, choose a question fromthe model at random to be replaced. Generate a mutation, and replace the chosenquestion with the mutation. Designate this the new model. Calculate themisclassification error of the new model as applied to the learning set. If the newquestion subset does not identify at least Nmin respondents, assign the new model anerror of infinity.

3. Compare the misclassification error of the new model to that of the seed model.If the error of the new model is lower than that of the seed model, keep the newmodel and discard the seed model. Otherwise, keep the seed model, and discard thenew model. Designate whichever model is kept as the seed+1 model.

4. Use the seed+1 model to repeat step 2, mutating the seed+1 model just as theseed model was mutated. The only difference should be that the question to bereplaced should be chosen from the second subset in the model. Then repeat Step3 to obtain a seed+2 model. Repeatedly cycle through steps 2-3, each time movingto the next question subset until the last has been tried (when the next cycle startsagain with the first subset). Repeat this process indefinitely until an absorptionpoint has been reached.

In this way, the algorithm cycles regularly through the subsets in the model, each time

choosing a question at random to be replaced, and generating a random mutation to replace

it. Models that produced lower error were kept, and the mutation process considered

indefinitely until no further improvement was possible through the replacement of anyone

question. Again, the RSA always happened upon an absorption point if the search was

allowed to run long enough (e.g., about a million mutations for a model of 16 questions).

This was not a proven result, but out of tens of thousands of runs of the RSA on the present

dataset (usually using fewer than 30 variables in a model), the algorithm never avoided

141

absorption. (Although if one started with large enough models, it would take an extremely

long time; millions of mutations were required to reach absorption by the RSA for a model

of size 30, with the present data). Here, the computer was instructed to check for absorption

after 400,000 mutations for a model with as many as 16 questions. Allowing 100,000

mutations per subset was a rough rule of thumb that was usually employed across different

model sizes. The algorithm was always allowed to continue with the RSA undisturbed if

absorption was not achieved, so the check for absorption made no difference to the ultimate

outcome of the RSA. (However, later improvements to the algorithm used information from

the check for absorption to speed up the search, as described in the section below).

Define the observed absolute maximum for a question model of a particular structure

(e.g., 4x4) as that model with the lowest possible misclassification error when applied to a

particular dataset (usually the learning set), and assume this maximum is unique. (This

assumption was never contradicted in practice.) By definition, this model is an absorption

point. In any given run of the RSA, the resulting absorption point would usually not be the

absolute maximum. This fact was easily shown as it was usually possible to find an

absorption point with lower error using additional runs of the RSA with independently

generated seed models and mutations.

To show this problem explicitly, consider the very simple model form consisting of

two questions combined with "AND". From the 3,248 possible questions, the number of

possible models was about 5.2 million, so it was possible to search over this space

exhaustively to find the true maximum (and indeed, all absorption points). For instance,

using a misclassification cost of 3.5 with the full dataset, it was found through exhaustive

142

searches that there were only two possible absorption points. The model with the lowest

error chose the 1,702 respondents who were aged 75 or older who could not walk half a mile

without help, of whom 520 died (for an error of 0.319). The second absorption point picked

the 1,032 respondents who had used digitalis and could not do heavy work around the house,

of whom 364 died (for an error of 0.321). Neither of these models could be improved by

replacing one question, and these were the only two models for which this was the case. By

independently running the RSA hundreds of times over to search across models of this form,

it was found that the algoritllIll located the true maximum in only 33% of the searches. The

suboptimal model was found in the other 67% of the runs.

To find the model with the lowest obtainable error for larger model sizes, it was

necessary to repeat the RSA (perhaps hundreds of times), each time starting with an

independently generated seed model, and independently generated mutations. Then, out of

the many resulting models, that absorption point with the lowest possible misclassification

error was chosen (the case where more than one model had the lowest possible error was

never observed). This method of independently repeating the RSA and selecting the model

with lowest error was called the RRSA (repeated random search algorithm), and the notation

RRSA(N) was used to denote N repeated searches with the RSA. As N goes to infinity,

RRSA(N) will find the observed absolute maximum by pure luck since the finite pool of

models will eventually be exhausted solely through the random sampling of seed models.

A central question is whether the required N is small enough so that this absolute maximum

can be found within a reasonable time frame, given the available computing power. For

small enough models (e.g., two or three questions), the algorithm could obviously find the

143

absolute maximum within a reasonable period. For example, since the RSA could find the

true maximum for the two-question model considered above about a third of the time, one

needs only use RRSA(20) to ensure that the algorithm would find the maximum with 99.9%

certainty. Brute force arguments are made below to show that it may have found the

maximum for models with the structure of Set B (which contains a total of seven questions)

by using RRSA(2,000).

Consider the building ofQuestion Set B again. First, the RRSA(l 00) was used on the

learning set to select the full model as that model with the lowest misclassification error after

100 independent runs of the RSA. This model was then subjected to backward deletion.

Starting with the full set of 16 questions, the backward deletion was conducted in the

following manner:

1) The first question was dropped from the set, and the misclassification errorof the remaining 15 questions was computed and recorded.

2) Then the first question was put back into the model, and the secondquestion was dropped. The misclassification error was again computed andrecorded for the remaining 15 variables. This process was repeated until all16 questions had each been dropped from the model temporarily, resulting in16 different misclassification errors for 16 models, each consisting of 15questions.

3) The 16 errors were compared, and the question that produced the smallestincrease in prediction error when dropped was permanently dropped from themodel.

4) The process then returns to step one, this time dropping each of the 15questions temporarily to find the question that, when dropped, produced thesmallest increase in error for models of size 14. The process continues untilno variables are left.

At the end of the process, one is left with a sequence of 16 nested submodels, one of each

144

size, all built with data from the learning set only. Once this sequence of models was

obtained, they were each applied to the test set to estimate prediction error, according to

Figure 1.2. (Computer code in C for the algorithms defining both the search and the

backward deletion process is given in Appendix V. This code is generic, and can be applied

to any dataset to predict a binary outcome variable merely by changing the parameters

describing the dimensions of the dataset and the costs of classification.)

Once the relationship between model size and test set prediction error was observed

(e.g., by creating Figure 1.2). The structure of this cur-Ie was always observed to be ofthe

same general form: a rapid decrease in error was observed as model size increases from zero

to medium range (about seven questions for Set B as shown in Figure 1.2), but then leveled

out, and increasing very slowly as model size was increased up to 16 questions. Sometimes

(as was the case with all the models presented here) the researcher has the luxury of

examining such a graph. Then judgement can be used to choose the model size that seems

to lie right in the "elbow" of this curve (about 6-7 questions for the curve in Figure 1.2).

Sometimes, as with Set B and Figure 1.2, the chosen model size that seemed to lie just in the

"elbow" was also that model with the absolute lowest error on the test set, or min(PETS). If

one wishes instead to select a model size automatically, the standard error for the test set

rnisclassification error estimates can be estimated as described in Section 2.2. The preferred

model size can then be chosen as the smallest number of questions that have an error that is

less than min(PETS) plus a standard error (see Breiman et al. (1984)). This is sometimes

useful when the test set N (N respondents, not runs of the RSA) is not as large as it is here;

then the test estimates of misclassification error may possess more variation than is shown

145

in Figure 1.2. To test whether any given model has a misclassification error that is

significantly better than that of the null model, the same standard error estimate defined in

Section 2.2 can be used to check that the absolute difference between the test set

misclassification error of the preferred model and that of the null model is bigger than two

standard errors. For all models in the appendices, this was by far the case.

Then the learning set and test set data were recombined into the full data set, and the

RRSA(100) was implemented again, this time in search of the best model of the preferred

size. For example, it was observed by Figure 1.2 that the best model size for predicting

mortality with a relative misclassification cost of 3.5 was seven questions. So, the data were

recombined and the RRSA(l00) was conducted for the best set of seven questions, where the

model had the same combination of three subsets (one consisting of three questions, the

others consisting of two). As it turned out, the model obtained as the result of applying the

RRSA(100) to the combined dataset to find a model of seven questions was nearly the same

model achieved by applying the RRSA(100) to the learning set to find a full set of 16

questions and pruning backwards to seven questions. In fact only one of the subsets had

different questions, and it identified most of the same respondents).

This procedure of recombining the data and refitting the full model is a very typical

last step in model building (used by CART, for example). However, it was not entirely clear

whether the strategy was appropriate for the RRSA(N) search method. The problem, as

pointed out by the author's advisor, sterns from the fact that the results of the RSA are

random, and so the RRSA(N) method is not guaranteed to happen upon the absolute

maximum for finite N. Thus there was some uncertainty whether the local maximum

146

discovered from the search on the learning set implies the correct model size in the local

maximum obtained from the full dataset search. A partial solution to this sort of problem

is addressed by cost-complexity pruning, which is treated below. However, it also seemed

(as demonstrated in the results below) that it was possible to obtain a given absorption point

consistently through repeated applications of the RRSA(lOO) algorithm on a given dataset

with a model structure of about seven questions. Whether this absorption point is the

absolute maximum or a local maximum was not proven here. However, it is noteworthy that

the same model was obtained consistently for many runs of the RRSA(l 00) (numbering well

into the hundreds at this point).

As mentioned, there was in fact very little difference found between the models built

on the learning set and those on the test set. The questions in Set B for example, were

almost the same; the questions in Set B.1 and B.2 were identical, and most of the respondents

identified by the learning set question Set B.3 were also identified by the full dataset version

of Set B.3. However, in other applications of the method to other datasets, one may not have

such a large N, and one may not be able to verify that these similarities exist. Then, it may

be advisable to skip this last step of recombining the learning and test sets to estimate a final

model.

Table 4.1 shows the details associated with the construction of Set B, plus Sets A, C,

and an additional set discussed below. For example, to create the questions in Set C, the

same procedure described above was followed, except that a misclassification cost of 1.5 was

used. Again, the RRSA(lOO) algorithm was applied to the learning set to obtain a full model

of 16 questions (four subsets of four questions each) and the model was pruned backwards

Table 4.1 - Specifications for the construction of the question set models

Modell risk level of misclassification structure of full algorithm for final structure6 algorithm formortalitl cost3 mode14 full structure5 final structure7

SetA high 5 5 subsets of 4 RRSA(lOO) 3+2+2+2+1 RRSA(lOO)

SetB higher 3.5 4 subsets of 4 RRSA(lOO) 3+2+2 RRSA(2,OOO)

Set C highest 1.5 4 subsets of 4 RRSA(lOO) 3+2+2 RRSA(lOO)

Set J higher 3.5 10 subsets of 3 RRESA(200) 3+3+3+2+2+2+ RRESA(200) with2+2+2+2 cost-complexity

I. The letter of each model refers to one of the question sets in the appendices.2. The "high" risk model chose persons with three-year probability of death of about 28%. For the "higher" risk models this estimate was about 38%, and for the"highest" risk models it was about 46%.3. This was the cost ofmisclassifying a true decedent relative to the cost ofmisc1assifying a true survivor.4. This column refers to the number of questions in the full model and how they are grouped by "AND" and "OR". For example, the full model for Set A hadfive subsets of questions combined with "OR". Each of these subsets consisted of four questions combined with "AND", so the model contained twenty sets total.5. The RRSA(N) method consisted ofN independent runs of the random search algorithm RSA defmed in Section 4.2. The RRESA(N) method consisted ofNindependent runs of the random and exhaustive search algorithm (RESA) defmed in Section 4.7. This was first applied to the learning set of respondents.6. This refers to the grouping of questions for the final model. For example, Set A had three subsets of questions combined with "OR". The fust subset had twoquestions combined with "AND", and the others each had two questions combined with "AND", for seven total questions.7. This refers to the algorithm which was applied to the full dataset to achieve the fmal model. For Set J, the RRESA method was applied to the full model withthe full dataset structure of 30 questions; cost-complexity pruning (see Breiman et aI, 1984) was used to fmd the model structure in the column to the left.

.......~......:l

148

to obtain a sequence of nested submodels. This sequence of submodels was then applied

to the test set, and a curve similar to Figure 1.2 was plotted. It was again judged, by directly

observing this plot, that the best model size contained seven questions, composed of two

subsets of two questions each and one subset of three questions. Then the data were

recombined into the full dataset and the RRSA(l 00) algorithm with the same model structure

of seven questions in three subsets was applied to determine the final model as presented in

Appendix 1.

The questions in Set A were constructed in the same way, with the sole exception that

the full model was composed ofjive subsets of four questions each (i.e., a 5x4 structure for

a total of20 questions). The main reason for this difference was that it seemed possible to

achieve a lower model error by accommodating a somewhat larger model size when a higher

cost ofmisclassification was used. Then the subsets needed to identify a much larger number

of decedents (and respondents overall) to satisfy the low-error criterion. This suggested that

a great deal of heterogeneity existed in the data. In later analyses, when the search algorithm

was somewhat modified to achieve much greater speed, it was ultimately determined that

larger models could achieve a reduced test set error (see Section 4.7 below). This was

possible only when the model size was increased with the addition of more subsets, each

withfewer questions. For example, a model of 30 questions (ten subsets of three questions

each) may be a more appropriate full model (with the best pruned submodel containing 23

questions). All of the results presented in Appendix I, however, were achieved solely with

the slower, unmodified RRSA(IOO) algorithm, which took quite a bit ofcomputing. A model

of 20 questions could require millions of mutations to achieve an absorption point, so at the

149

time it was not feasible to search over larger model spaces.

Some mention should be made ofthe reasons for using the particular costs and model

structures used in the creation of the models presented in Appendix 1. To the reader, many

of these values (e.g., the relative misclassification costs of 1.5, 3.5 and 5, or the model

structure of four subsets of four questions each) may appear to be chosen from thin air.

Unfortunately, there were no systematic guidelines for choosing these parameters, so most

of them were found with some combination of experimentation and intuition.

The relative costs of misclassification were easy to specify, since the relative

proportions of decedents and survivors were known in the sample, and there was some

knowledge of the death rates among the smallest, highest risk subpopulations.

Experimentation combined with rough hand calculations of Bayes' rule then suggested a

rough misclassification cost which would identify a given proportion of decedents. For

example, when running the RRSA with a cost ofunity, Bayes' rule suggests that the algorithm

should attempt to find regions of sample space that have death rates higher than 50%. It was

observed that only a few respondents in the test set (accounting for perhaps 2-3% of all

deaths) could be identified with such high death rates. It was thought that models that could

identify larger numbers of deaths were of more interest, so Bayes' rule suggested a level of

misclassification cost higher than one was needed to capture more deaths.

On the other hand, it was also observed that when a cost of greater than about seven

was used, many more respondents were identified as high risk by the algorithm (e.g., more

than a third of all survivors). These models were not specific enough to be useful as

predictors of true "high risk"as advertised. It seemed that three different models with three

150

levels of risk seemed to be a reasonable alternative to a single level model. These levels

needed to be somewhere between one and seven, so three levels of misclassification cost

within this range were picked: 1.5,3.5 and 5. These three models captured roughly 20%,

40% and 65% of decedents, so the observed levels of sensitivity were evenly spread across

the range of interest. That is to say, a model identifying much less than 20% of deaths (on

the very far left, lower part of the ROC curve) was not particularly interesting for this

analysis; nor was a model that identified much more than 65% ofdecedents (as it would also

misclassify many survivors, being on the far right side of the ROC curve). However, the

three models built at the three given costs seemed to cover the remaining range of the ROC

curve quite well.

The structure of the full models (e.g., the decision to use four subsets of four

questions for Sets B and C) was also determined with some level ofexperimentation, and the

limitations of the computing power. It was decided, somewhat arbitrarily, to run the RSA at

least 100 times on each the learning set and the full set for a given model structure, so the

computer needed to be able to find an absorption point within a day to get a final model

within a month or so. (Note that since the RRSA ran the multiple RSA's independently, more

than one computer could always be used, if available, to reduce the total search time). This

dictated the use of a full model of less than 20 questions. It was not known at the time

whether more AND or OR operators should be used, so roughly the same number of each

was implemented (suggesting the 4x4 or 4x5 model).

The only other choice left to specify the RRSA(N) algorithm was the value of Nmin,

the smallest number of respondents allowed to be identified by any given question subset.

151

CART, for example, is designed so that by default no single terminal node identifies fewer

than five cases (although the parameter is adjustable). In the present case, considering the

very large number of variables and the large size of the dataset, raising this limit seemed

reasonable, so Nmin was set at 25 for all three models in Appendix I. In practice, the subsets

identified in the final models identified many more than 25 respondents (see Chapter 5 for

these details).

One big advantage ofthe nonparametric searching method described above is that the

program can adeptly handle a very large dataset, and it needs absolutely no recoding of the

data. Also, it was possible to consider not only more variables but also the computer could

handle different ways of splitting each variable (of which there could be many). A formulaic

approach requires that splits (dummy variables) and missing values be defined in advance

to accommodate the linear basis of the equation (although one might use a similar search

algorithm to select from many regression equations). Since substantive interpretations were

not the primary goal of the model, missing values were allowed to remain coded as "999"

with no further treatment, so that they were treated just as any other value. Furthermore,

since all mutations were chosen randomly from the full range of all possible variables and

variable values with equal weighting, it was possible to claim that the splits were chosen in

a completely objective, systematic, and well-defined manner.

4.3 Performance of the search algorithm

There were three central questions of interest concerning the efficacy of the search

algorithm. First, it was clearly possible to find an absolute maximum for sufficiently small

sets of questions (e.g., two or three questions as a single model). It also seemed that because

152

of the extremely large number of possible models inherent in a search over larger models

(e.g., 40-50 questions in a single model), one would probably not be able to locate an

absolute maximum within a reasonably short time. Thus, one would like to know the model

size at which, given the present computing power, one can find an absolute maximum within

a reasonable length of time. Ofparticular interest is whether it is possible to find an absolute

maximum for models ofthe size presented in Appendix I (e.g., less than 16 questions or so),

and whether the models given in Appendix I are in fact such maxima. Secondly, there was

an interest in exactly how long the search to find such a maximum needed to succeed. Lastly

one would also like to know how frequently the absolute maximum would be achieved, if

indeed a true maximum was found.

Suppose one starts with a relatively compact model (e.g., that given in Appendix I

as Set B). This set of questions consisted of a simple 2+3+2 structure (3 sets of questions

combined with "OR", with the sets consisting of2, 3 and 2 questions combined by "AND")

for a total of 7 questions. It used a relative misclassification cost of 3.5. At the time of this

writing, the RSA search across all models of this type (using the full dataset) has been

repeated independently more than 2,000 times; this was the equivalent of 20 independent

runs of RRSA(1 00). The observed maximum (defined as the result of the RRSA method,

which again is not proven to be an absolute maximum) of Question Set B was achieved in

almost exactly 5% of the RSA searches. When the RSA searches were grouped into 20

groups of 100 searches each, as if 20 RRSA(100) searches had been conducted, it was

observed that Set B was achieved in every one of the RRSA(100) searches. Thus, although

the RSA was inherently random and unlikely to settle into a single absorption point, the

153

RRSA(N) algorithm was nearly deterministic for an N ofat least 100. Suppose, for example,

that any single run ofthe RSA has a 5% chance of finding the observed maximum of Set B,

and a 0% chance of finding a better maximum. (This is the case as estimated here from

thousands of repeated runs of the RSA.) Then the chance that RRSA(100) finds Set B as the

observed maximum is equal to 1- (0.95)100 = 99.4%. For RRSA(200), there is a 99.997%

chance that the algorithm will find Set B.

For this particular model structure, it seems probable that this observed maximum

was also the absolute maximum. It was possible through repeated applications ofRRSA(N)

with large N to achieve the observed maximum an indefinite number of times, without ever

finding an improvement or settling on a worse model. If the true maximum were not being

achieved, one would expect that eventually, with enough continuous searching, one would

eventually find some degree of improvement. As the number of independent RSA runs goes

to infinity, the probability that the absolute maximum is found goes to one. This is because

the seeds themselves are chosen randomly from a finite pool of all possible models, so

eventually one will reach the maximum by pure chance. However, the author has left the

RRSA(lOO) search running continuously for multiple months, using more than 2,000 different

seeds. Ever since the observed maximum of Set B was found (which occurred within the first

25 or so searches), no additional improvement has ever been found.

This argument by brute force may become increasingly convincing with more

searches; in fact it seems possible in this case to catalog all the possible absorption points

that can result (not just those with the lowest error). There appears to be a relatively small

number of such points (less than 100 in all) for the given model structure. Thus, at some

154

point the search will no longer be able to find any additional absorption points (much less

than those that give improvement), and all possible absorption points will have been

achieved more than several hundred times, for example. At this point, it would be possible

to estimate a probability function for all absorption points having a mass greater than 0.1 %,

for instance.

The median number ofmutations required to achieve any given absorption point for

the model size of seven questions and a cost of 3.5 in one given run of the RSA was about

42,000, and the average was roughly 111,000. However, the median number of mutations

required was much larger in those RSA searches where the observed maximum was obtained:

319,000, nearly eight times as large (the reason for this is discussed below). Also, the

distribution ofthe numbers of mutations required to achieve absorption was heavily skewed,

with an extremely long right tail. Figure 4.1 shows a histogram of the number of mutations

required to achieve absorption for about 400 random absorption points obtained from 400

independent runs of the RSA (where the total number of mutations has been transformed by

the base 10 logarithm). The range of values was amazingly wide: the smallest number of

mutations required was 1,160 and the largest was more than 1,139,000!!

Also of interest was the rate at which an average run of the RSA converged to an

absorption point. Figure 4.2 shows the cumulative number of successful mutations (those

that achieve a reduction in learning set error) against the cumulative number of mutations for

three random searches. Interestingly, only about 40-50 successful mutations were required

to reach an absorption point in these three trajectories (the median for all trajectories was

almost exactly 50, no matter whether the observed maximum was achieved). Figure 4.3

Figure 4.1 - Histogram of numbers of mutations required to achieve absorption

oLO

o~

oC'0

oN

oT"""

o -_..I I I 1 I I

1-3.0 3.5 4.0 4.5 5.0 5.5 6.0

log (base 10) of number of mutations

--V1V1

olC)

en 0c ~0~cu.......::JE

0::J C'?......enenQ)()()::Jen 0...... N0

=t:t:CU.......0.......

0....

o

Figure 4.2 - Total number of successful mutations by total number of mutations(trajectories for 3 random absorption points)

-----------,

r --

r

__ ,:f- - - - - - - -- - -- - - - - --,,1- I !

,:-- r

.I

r

.I

Jr -

r

,_' ,I

__ , ~--~- J - rJ- - _.I

o 1 2 3 4 5

log (base 10) of total # of mutations)

VI0\

oco

C/)c 00 <0:;:::;en-;:,ES....C/)C/) 0Q) Vt)t);:,C/)....0

:tt:en 0- N0-

o

Figure 4.3 - Total number of successful mutations by total number of mutations(trajectories for 200 random absorption points)

o 1 2 3 4 5 6

log (base 10) of total # of mutations)

.......VI......:l

158

shows about 200 such trajectories. These figures suggest that the process has some degree

of regularity despite the extremely wide range of possible outcomes.

To see why this sort ofpattem arises, consider a simple probabilistic model of the

search. After starting with a random seed, a mutation is generated and there is some

probability that this mutation will be successful (that it will achieve a reduction in error); let

this probability be denoted by 1 - TIl' where re] is the probability of not succeeding. In the

search the computer draws such mutations randomly from the pool of all possible mutations

until one is successful. Clearly, the probability that one achieves a successful mutation in

the first mutation is 1-TI]. The probability that one achieves success only after two mutations

is TI j '(1-TI j ), the probability that one achieves success only after three mutations is TIt(1- TIl)'

and so on. That is, the total number of mutations required before the first success is achieved

(call it Ymut (1) ) has the well known geometric distribution:

Pre Ymut (1) = y) = TI/ . (1 - TIl)

which implies that Ymut (1) has a mean ofre/(1 - TI]) and a variance ofTI/(1- TIll

Once the first success has been achieved however, it seems clear that the probability

of drawing the second successful mutation in one more draw is not necessarily equal to (and

in many cases, will be smaller than) 1 - TIl' This is because the pool of successful mutations

may be smaller once the first successful mutation has been found (but not necessarily so).

In any case, it seems that the number of additional mutations required for the second

successful mutation to be achieved (call it Ymut (2) ) also has a geometric distribution. Most

likely, this distribution has a different probability of success (call it 1 - TI2), where TI2 is the

probability of not achieving a successful mutation, once the first successful mutation has

159

occurred.

Now, suppose that a total of K successful mutations were required before the search

reaches absorption. The total number of mutations required before the search reaches

absorption (call it Tmut ) may then be thought of as a sum of all the unsuccessful mutations

and the successes, so that:

K

Tmut == K + L Ymu/i)i~l

where the Ymut (1) are all each geometrically distributed with parameter 71j. If one takes K and

the sequence of parameters 11: 1, •.. , 11: K as given, then the Y mut (1) are independent (but not

identically distributed) and the distribution of Tmut is completely determined. As mentioned,

K for this case was around 50 (between 20 and 80), and the success parameters could also

be estimated quite easily. Figure 4.4 shows the estimated probability of success on the y-

axis, against the number of successful mutations which have occurred. On average, then, the

probability of success starts high (near 0.5) but falls rapidly, asymptotically approaching

zero. It becomes vanishingly small and (in these estimates) reaches zero beyond 90

successful mutations or so, suggesting that absorption before this point is virtually certain.

The pattern of decreasing probabilities explains the nearly-exponential pattern to the

cumulative number of successful mutations by total mutations as shown in Figure 4.3; at

first, success is common, but becomes extremely rare one 40 or so successful mutations have

been achieved. If one were to estimate these sequences of probabilities for each of the

possible values ofK (instead of averaging over all values, as in Figure 4.4) and observe the

distribution of the K's, this would completely specify the dynamics of the process.

Figure 4.4 - Probability of a successful mutation by number of previous successful mutationsfor models having the form of Question Set B in Appendix I

~0c

0:;::;

CO-::JE

~::J 0\t-enenQ)uU::Jen

"!co0\t-

o>-~:0co ..-..00 0.....e..

oo

o 20 40 60 80

number of previous successful mutations

.......0\o

161

Unfortunately, these parameters depend heavily on the structure of the data (largely by the

sets 'IJ and 'U., for all v in 'U, as defined in the section above), so there is no obvious model to

generalize the process any further.

For pragmatic purposes, some readers may be interested in the amount of real time

required to find the observed maximum. Since, for any given search, there was an estimated

5% chance of achieving the observed maximum, the total number of repeated RSA searches

required before reaching the maximum can itself be model geometrically, again with a

parameter 1C, which here is estimated as one minus the probability of success, or 95%. Then

the average number of RSA searches required before the observed maximum was achieved

was 95%/5% = 19 searches plus the single successful search for 20 searches. On an unused

SparcStation Ultra, this required less than 8 megabytes of RAM, and took an average of

about 18 hours to complete. Of course, one does not know beforehand whether the

maximum has been achieved, and so additional computing is undoubtedly necessary.

However, this was at least a short enough period so that the project became feasible.

Now consider the performance of the model for the larger-sized model, in particular

the 4x4, 16-question structure used as the full model before pruning back to obtain Set B.

With twice as many questions in the model, the number ofpotential models was much larger,

and the computer required more searching time for any single run of the RSA since more

questions had to be handled by the computer at any given point. Then, given the smaller

number of absorption points observed and the larger pool of potential points, it is harder to

argue that an absolute maximum has been found. About 300 different seeds have been used

at this point, and the observed maximum has only been achieved several times, implying that

162

the mass associated with this absorption point is closer to 1% (compared with 5% for the

small model). However, no improvement has yet been found.

To assess the efficiency of the search algorithm further, a simulated dataset was

constructed by the author's advisor. The structure of the absorption points in the data was

such that there was a global maximum consisting of eight questions as well as multiple local

maximum (all known to the advisor, but unknown to the author). These were embedded in

a dataset of 32 variables and 1,000 observations. No clues about the structure of these

absorption points were given to the author, and the RSA search algorithm used on the data

was completed (and programmed) before the author's contact with the data. Using a full

model with the 4x4 structure above, it was found that the RSA located the absolute maximum

about 15% ofthe time. It took an average of less than five minutes on those runs that did find

the maximum. The algorithm also found the local maxima quite efficiently. Moreover, the

backwards deletion invariably chose the proper model size. Overall, the performance of the

algorithm on this ersatz data was quite encouraging. An additional simulated dataset

(consisting entirely of noise, unknown to the author) was also constructed by the advisor.

This provided an interesting test for the algorithm, since building a full model on the learning

dataset that appeared to predict the noise with some accuracy was quite possible, with only

a modest amount ofsearching. However, when the fitted models were applied to the test set,

the data were quite readily identified as noise. This was discovered simply by calculating

a standard error for the test set estimate of the misclassification error (as suggested in

Chapter 2). Then one could calculate whether the estimate was within two standard errors

of the null misclassification rate.

163

4.4 Linear discriminant analysis

For linear discriminant analysis, it was possible to handle a very large number of

variables in the full model, but the model required extensive recoding of the data, and some

preselection of variables. Specifically, it was necessary to reassign missing values, to

convert categorical or nonordinal variables into indicator (dummy) variables, and to

eliminate some variables that were highly correlated. After removing 16 variables that were

one of a pair of variables with a correlation coefficient greater than 0.9, there were 150

variables left out of the 166 variables in the penultimately full dataset. For variables that

were clearly ordinal or interval in value (e.g., weight in pounds, height, blood pressure, pulse,

frequency ofalcohol consumption, and others) missing values were assigned the mean of the

nonmissing values. For many ofthe more categorical variables (particularly those pertaining

to functionality), all values including the missing were simply recoded as zero or one; for

example, the responses to the question "Was there any time in the past 12 months when you

needed help from a person or equipment to take a bath?" were recoded as zero if the answer

to the question was "no", and one if the answer was "yes", "unable to do", or "missing". (In

most of these case the number of missing values was small.) It should be noted, however,

that much knowledge was gleaned from the above question-choosing process concerning

which splits were effective. Certain splits of responses in many categorical variables were

known to work because they had been observed as the output to the method of Section 4.2;

in this way, the linear discriminant analysis was probably afforded some advantage over the

above method, which used no such information.

Using the set of 150 recoded variables for the learning set data, it was possible to

164

estimate just such a "full" model of size 150 using linear discriminant analysis as described

in Chapter 2.3 With such a large model, a thorough backward deletion was not feasible.

Instead, a close approximation was used. By standardizing all the variables to have a mean

ofzero and standard deviation of one, the size of the coefficient for each variable was taken

to be a good measure of a variable's importance in the model. Thus, the single variable

(from the full model of 150 variables) with the smallest standardized coefficient was dropped

from the model, and the parameters were reestimated. Again, the variable with the smallest

coefficient was dropped, and so on until only 30 variables remained. At this point, a more

thorough deletion process was implemented. Just as in Section 4.2, each of the 30 variables

was temporarily dropped from the model, and the error was estimated for each of the 30

submodels of size 29 obtained in this way. The variable that yielded the smallest increase

in error when temporarily dropped was then permanently dropped from the model to give the

best model of size 29. Again, the deletion was repeated in this way until no variables

remained, resulting in a sequence of 30 nested submodels, one of each size, built only with

the learning set data.

Next, this sequence of 30 models was applied to the test set, and the accuracy of each

model was recorded. Figure 4.5 shows the results ofthis application, where the y-axis shows

the accuracy of the model as measured by the correlation between the discriminant variable

( or z-score) and the true outcome of death or survival. The higher, smoother curve shows

the correlation coefficient as it was estimated by the learning set, and the lower curve shows

the result of applying the models to the test set (analogous to an upside-down version of

3 The Splus function used here for the linear discriminant analysis was discr().

......ccocE

·C(.)en

"'0

Q)NenQ)

"'0oE~..cenenQ)c......

li=

L{)N

Q)

0 "00N Ec(/)

~.cctl

"Cctl

L{) >4-..... 0....Q).cE::JC

0.....

L{)

165

9£"0

166

Figure 1.2). As in Figure 1.2, the learning set measure of accuracy could always be

improved by adding yet another variable to the model. However, when the model is applied

to the test set, there is a point at which the addition of variables fails to improve (or even

decreases) the accuracy of the model. By observing Figure 4.5, it was decided that the best

linear discriminant model size contained 15 variables, and the accuracy of this model on the

test set was recorded. Then, the data were recombined into a full data set, and the best model

of size 15 was reestimated, yielding the "preferred" linear discriminant model.

4.5 Logistic regression

Probably the most commonly used multivariate model for predicting mortality (and

many other binary events) is the multivariate regression model, typically employed with the

logit or probit link functions. Under the binomial scenario (i.e., with a logit link function

(log(p/(l-p)) and the variance function (p(l-p)/N) ), one obtains the most frequently used

form, logistic regression. For this method, missing values and categorical variables need to

be recoded to conform to the regression equation's linear basis. So the same recodings as

used for the linear discriminant analysis were employed (i.e., assigning the mean to missing

values of ordinal/interval variables, and splitting categorical variables into indicators). Also,

due to memory limitations the computer was unable to estimate the large model of all 150

variables used in the linear discriminant analysis. So the set of 30 variables obtained after

the first deletion pass in the linear discriminant analysis was used as the "full" logistic

regression model. In this way, the logistic regression method was granted an even larger

advantage than that given to the linear discriminant analysis. A great deal of knowledge

about which variables were important was gained in the linear discriminant deletion from

167

150 variables to 30 variables, in addition to knowledge about how to recode variables

effectively from the question set method.

To estimate the model parameters and variances, the Splus function glm() (standing

for generalized linear models) was used with the usual binomial model. The parameters

were fit with maximum-likelihood estimation using an iteratively reweighted least squares

algorithm.4 There were a number of algorithms for variable selection that could be applied

to the fitted full model at this point. The most common (and probably the most suspect)

method for stepwise backward deletion involves computing p-values for each coefficient,

dropping the variable with the lowest t-statistic at each pass until all the remaining

coefficients are statistically significant. (For the logistic regression model, this involves

computing standard errors from the asymptotic variance-covariance matrix of the coefficient

estimates.) The problem with this type of backward deletion is well known: "weeding" or

selecting out variables according to statistical inferences invalidates additional estimates of

p-values, since some coefficients are bound to be significant given a large enough number

of variables. The result is that the suggested model size is often larger than optimal.

Some statistics have been suggested which attempt to attach some "penalty" or cost

to each additional variable in a model, or to select out a model size based on some more

sophisticated criterion than t-statistics. Mallow's Cp and Akaike's Information Criterion both

attempt to pick an optimal model size (the AlC statistic actually being a likelihood version

of Cp). Here the AIC measure was quite easy to implement, with the Splus backward

deletion algorithm step(). The AlC is essentially a linear combination of the model deviance

4 See Chambers and Hastie (1992).

168

(a measure of model error analogous to the residual sum a/squares of the usual OLS model)

and the number of degrees of freedom in the model.s Specifically,

AlC = deviance + 2·¢·df

where ¢, assumed to be nonnegative, is known as the "dispersion parameter" (the Splus

algorithm sets this equal to one by default). The deviance gauges the error in the model, and

the second term in the expression assigns a penalty of2¢ to each additional coefficient. As

one drops a coefficient from the model, the deviance increases Gust as R2 would decrease for

the OLS model) but there is also a decrement of 2¢; one thinks of the increase in deviance

as showing an increase in bias, while the decrease of 2¢ reflects the decreased variance of

the smaller model. Thus, the goal is to find the model size that minimizes this criterion.

The deletion worked by starting with the full model as estimated on the learning set,

and dropping the variable that decreased the AlC the most, repeating this deletion until the

AlC could not be decreased by another deletion. It would have been possible to continue the

deletion to obtain a full sequence of nested models, applying each to the test set to determine

the optimal model size as above. However, it was felt that an interesting comparison would

be obtained by using the more standard method of model selection. The idea was to use

logistic regression much as it is commonly used in conventional practice (e.g., with a typical

"canned" algorithm for model selection), to compare it with the more sophisticated test set

techniques described above. The Splus automatic backward deletion algorithm step() is a

good example of the state of the art for such algorithms.

On the first pass, using only the learning set data and the 30 variables from the linear

5 See Chambers and Hastie (1992).

169

discriminant analysis, the step() function (with the dispersion parameter set at unity) picked

a regression model containing 26 coefficients, including the intercept. All coefficients had

asymptotic t-statistics that gave p-values significant beyond the 0.05 level. However, on

applying the model to the test set, this model size was evidently too large, and indeed, upon

recombining the data and reestimating the model ofsize 26, some coefficients were no longer

significant. Thus the step() function was then implemented using the entire dataset, and this

yielded a model with 24 coefficients. All p-values for this model were significant beyond

0.05, except two, which were significant beyond 0.10. It was suspected that this model was

still too large, particularly since it yielded almost the same level of accuracy (in both the test

set and the learning set) as the much smaller linear discriminant model (see Chapter 5).6 This

is a common criticism of the Mallow's Cp or AlC criterion; in retrospect, it might have been

wise to raise the dispersion parameter to obtain a smaller model.

4.6 The CART algorithm for model selection

Just as with the methods above, there are several parts to the CART method for

devising splits in the form of a classification tree: 1) the building of the full sized tree with

the learning set; 2) the pruning of the tree through backward deletion to obtain a sequence

ofnested submodels; 3) the application of these models to a test set to determine the optimal

model size; 4) the use of the full, undivided sample to estimate the best model of the optimal

size. Since the algorithm could accomplish this all automatically with a test set method, it

was possible simply to provide the program with the undivided data (N = 10,294), allowing

6 Interestingly, although the standard errors for the coefficients were smaller when the full datasetwas used (due to the 50% larger N compared to the learning set N), many of the coefficient estimates wereshrunk towards 0, resulting in smaller t statistics.

170

it to draw its own test set. As with logistic regression, the program was not able to take as

many as 150 variables, and so the set of 30 variables culled from the first pass of the linear

discriminant analysis (Section 4.3) was used. 7 An advantage to CART over logistic

regression and linear discriminant analysis, however, is that it is essentially possible to use

the raw dataset (including missing values) as the input, just as with the question set method

described above. The method for dealing with missing values involves finding substitutes

or "surrogate" splits that give a very similar division as the split that is missing. The

problems are that there is no guarantee that an accurate surrogate split exists; frequently

there would not be one. For this reason, it was decided to recode the missing values as

above.

The building process for the CART algorithm started by forming the split at the top

of the tree (the root node), conducting an exhaustive search of all possible splits and

choosing that split that best separated the survivors from the decedents. (The improvement

in the Gini index of heterogeneity is used to measure the success of the division). This root

node then yielded two subgroups of respondents, and for each of these subgroups, another

exhaustive search was conducted for the best split. This process was completed until the

"full" sized tree (TmaJ was constructed; this tree usually contained well more than 100 splits,

depending on the parameters supplied to the program. Since the searches for splits were

exhaustive, but only within each given subgroup ofrespondents as defined by the succession

of splits, the process was labeled one-step optimal. That is, the optimal split was found only

7 The CART program as designed by the original creators was used for this analysis. There is alsoa version of classification trees available with in the Splus software package (which possibly could havehandled a larger number ofvariabJes), but after extensive experience with both, the CART program was feltto be superior in most other respects.

171

for each stage of splitting, there was no guarantee that the overall combination of splits that

formed the tree as a whole would be the optimal combination.

Next the algorithm performed a backward deletion on this tree to obtain a sequence

of subtrees. The idea behind the deletion process was labeled cost-complexity pruning. The

idea was that pruning would be a process of dropping splits, starting from the bottom of the

tree and pruning upwards toward the root node. To determine which splits should be

dropped, each possible subtree was gauged by a cost-complexity measure. This was equal

to a linear combination of the learning set misclassification error of the tree and the number

of terminal nodes in the tree (the number of terminal nudes being equal to the number of

splits + 1):

cost complexity = misclassification error + (X' # of terminal nodes in tree

where (X denotes the cost-complexity parameter. The construction of this measure is highly

analogous to the Ale measure in that it combines a measure of the goodness of fit of the

model (measured by deviance in a logistic regression model and misclassification error for

a tree) and a penalty for each additional coefficient or split in the model.

However, the CART algorithm does not use this directly to choose the overall model

SIze. Instead, the parameter (X was scaled upward from zero along a continuum of values,

and at each stage the subtree that minimized the cost-complexity criterion by pruning from

the full tree was designated as the optimal tree for that value of a. Although the range of

possible a values (a L 0) is a continuum, there are a discrete number of such subtrees; as a

is fixed at zero, the largest tree possible is allowed, but as a is raised, there are threshold

values at which the splits toward the lower part of the tree are pruned off. As a is scaled

172

upwards, more splits are pruned from the bottom of the tree. Eventually, a can be raised

enough so that only a very small tree (or no tree at all) remains. This process yields a

sequence ofnested submodels ofvarying sizes (varying numbers of splits, or terminal nodes)

built with the learning set only.

Next, this sequence of models was applied to the test set, and the misclassification

error for each submodel was recorded. Figure 4.6 shows a plot of this error on the y-axis

with the number of terminal nodes on the x-axis. As usual (compare this figure to Figure 1.2

for instance), one observes that the smallest models have a higher error. For the test set, this

error decreases rapidly as the model size increases and then flattens once the tree contains

more than nine splits or questions. (If one tests larger trees, containing 100 splits say, the test

set error is much higher.) Thus the tree with nine splits was chosen as the optimal size, and

the data was recombined into the undivided, full (N = 10,294) dataset, whereupon the

optimal tree with the same cost-complexity parameter was grown. As with the questions set

method above, three different trees were grown using three relative misclassification costs,

equal to 2,3.5 and 7 (see Chapter 5 for an explanation of why the same set of costs was not

used).

4.7 Modifications and additions to the question set method

In latter stages of the dissertation research, a number of improvements were made to

both the RSA and the method of model selection. (None of these changes concerned the

results in Appendices I-IV.) First, as demonstrated in Section 4.3, it was noted that the RSA

was spending a great deal of searching time attempting to locate the last few successful

mutations. It turned out that exhaustive searching was much faster once most ofthe

173

0(J) C'?

(J)~

+-'l+-e(J)

L()N NC/)

~..c~

Q)e~ 0 ~~ N -(J) c

C ence 0

:;::;+-' enco Qi Q)

:::lU en L() 0-......~ en T""" ......C/)

2 0....C/) Q)

CO ..cE

U :::lC/) C

~E 0:::J T"""

E·c

<0·E

..q(J)~

:::J0)

L.1..

0

6'0 g'O LO

174

successful mutations were achieved. The trick however, was that this exhaustive searching

had to be done in a random way to avoid systematically falling into local maxima. Thus, the

candidate questions were always reordered randomly before conducting the exhaustive search

(and of course, the chosen questions in the existing model were already randomly ordered

at this point). This provided much faster searching times, particularly because ofthe fact that

some variables contained many possible values (e.g., weight at time of interview took on

hundreds of possible unique values), making it very difficult to find the optimal cutoff value

through a purely random search. This was dubbed the random and exhaustive search

algorithm (RESA) and when N independent runs were made, the notation RRESA(N) was

used.

Secondly, a change was introduced in the method of pruning. As mentioned above,

one problem was that since the searching could result in varying absorption points, perhaps

implying varying model sizes when pruned. Thus it was possible to estimate a model size

that was optimal for the absorption point constructed on the learning set, but suboptimal for

an absorption point constructed on the full, recombined dataset. One solution of this was to

allow more flexibility in the model size via a version of backward deletion called cost

complexity pruning, which is that method used by CART.

The idea is discussed above: one assigns a cost or penalty to each variable in the

model (called ex above), and this is combined linearly with the model error estimated from

the learning dataset. To estimate the optimal level of ex, one applies the sequence of models

built on a learning set to a test set, fixing ex at that value corresponding to the model size with

optimal test set error. For the test question set method, this amounted to attaching a cost of

175

IX to each question in the set. Then the data were recombined into the full dataset, and the

RRSA(N) was used to find a full model. This full model was then subjected to backward

deletion, the error was estimated for each model size, and the cost-complexity associated

with each model was then calculated using the value of IX as estimated above. The preferred

model was then chosen as that which minimized the cost-complexity criterion. In this way,

the model chosen on the full dataset was allowed to have a different number of questions

than the model chosen with the learning dataset.

To determine the efficacy of this algorithm, the method was applied to the dataset

contrived by the author's advisor, as discussed in Section 4.3 above. This dataset contained

two local (suboptimal) maxima consisting of six questions each, while the global maximum

consisted of eight questions. Once the data was divided into a learning set and test set, the

RSA was applied to the learning set. It was deliberately allowed to locate one of the

suboptimal maxima (embedded in a full, 4x4 model), a sequence of models was estimated

by backward deletion, and IX was estimated as that associated with the six-question, local

maxima. Then the data were recombined into the full dataset, and the RSA was allowed to

locate the global, eight-question maximum (embedded in a fu1l4x4 model). Then backwards

deletion was applied to obtain a sequence of submodels, and using the value of IX as

estimated on the suboptimal local maxima, the model that minimized the cost-complexity

criterion was chosen. Invariably, this turned out to be the global, eight-question maximum.

The C code for conducting this improved method completely automatically (both the

exhaustive searching, and the cost-complexity pruning) is given in Appendix VII.

These changes in the method were used to compute a much larger model ofmortality

176

with the EPESE data. The fact that the RESA was much faster than the RSA made it possible

to search across much larger model spaces. Based on the above results, it was thought that

the best sized subsets invariably contained three or fewer questions. It was also found that

when a full model consisted of subsets with no more than three questions, such a model

structure could accommodate more of these subsets. For example, one gains greater

predictive power on the test set by starting with a full model of 30 questions (ten subsets of

three questions each).

To show this, the RRESA(200) algorithm was applied to the learning set with a full

model of 30 questions, using a misclassification cost of 3.5. This full model was then

subjected to backward deletion, and the resulting sequence of nested models was applied to

the test set. 8 This suggested a model size of 20 to 25 questions. The cost-complexity

parameter for the optimal model size was estimated, and the learning and test sets were

recombined into the full dataset. Then, RRESA(200) was applied to the full dataset to choose

a full model with 30 questions. This full model was then subjected to backward deletion to

obtain a sequence of submodels, and the model that minimized the cost-complexity criterion

was selected out. The resulting model contained 23 questions in ten subsets, and it achieved

a lower prediction error than the most powerful linear discriminant model. See Chapter 5

for a summary of this model.

8 At this stage of the research, the original division of learning set and test set was lost. Therefore,a new learning set was created by drawing a simple random sample from the full dataset as before. Theonly difference was that this random division was repeated until the same proportion of deaths wasobserved in both the learning and test datasets. Breiman et al. (1984) suggest that this may result in a morestable estimate of prediction error.

177

Chapter 5 - Models for the prediction of mortality

5.1 Simple questions for predicting survival or death

The first set of questions in Appendix I, Set A, consists of 10 questions in total,

broken into three subsets of one, two or three questions each (see Table 4.1). The questions

in this set result in the following classification when applied to the full dataset, as shown in

Table 5.1.

Table 5.1 - Predicted outcome by true outcome, Question Set A,full dataset estimates

Cells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL

survived 6,743 (92.5%) 546 ( 7.5%) 7,289PREDICTED (76.2%) (37.7%)

OUTCOME died 2,101 (69.9%) 904 (30.1 %) 3,005(23.8%) (62.3%)

TOTAL 8,844 1,450 10,294

Thus, 3,005 persons in the sample were classified as high risk (predicted as dead) and

of these, about 30% actually died. This classification correctly classifies about 62% of all

deaths and 76% ofall survivors, corresponding to a cost-adjusted misclassification rate equal

to:

546·5 + 2,101

1,450'5 +8,8440.300.

However, since this is computed using the same dataset on which the classification was built,

the estimate is downwardly biased, perhaps to a substantial degree. Therefore, a more honest

178

estimate is given by the corresponding internal test set estimate, obtained by applying the

same-sized model built on the learning set of 6,862 respondents to the test set of 3,432

respondents. The results of this application are shown in Table 5.2. (These are the same

results summarized in Table 1.2, but they are presented here again for convenience, and with

a little more detail).

Table 5.2 - Predicted outcome by true outcome, Question Set A,internal test set estimates


TRUE OUTCOME

survived died TOTAL


OUTCOME died 888 (72.5%) 337 (27.5%) 1,225(30.4%) (66.2%)

TOTAL 2,923 509 3,432

Interestingly, one finds that when the model was applied to a test set, it captured

slightly more deaths (66%) with a lower specificity of nearly 70%. These were somewhat

different numbers than those obtained in the full dataset estimates. The error rate

corresponding to this result is equal to:

172·5 + 888

509·5 +2,9230.320

and may be considered a much more honest estimate (i.e., having a much smaller, though

still downward, bias). As discussed, this estimate is not completely unbiased because

multiple observations of the test set were made to obtain the estimate; however, one hopes

179

that the bias in this estimate is small (and in the validation process, discussed below, this

held true). An approximate estimate for the standard error of this estimate of error (based

on the assumption that the test set is a simple random sample) is 0.012. (For the calculation

of this estimate, see section 5.5 below.)

Suppose one breaks the high risk respondents down according to each of the five

question subsets. The most respondents were chosen by Set A2, which classifies all 1,294

digitalis users (12.6% ofthe sample) as high risk. More than 31 % of these persons (406)

died within three years. The second largest group of respondents was defined by Set A3,

which classified all 1,234 persons who are aged 80 and over and weighed more than 139

pounds at age twenty as high risk. Again, about 31% (381) of these respondents died within

three years. Set A.4, which isolated males weighing less than 168 pounds, and who could

not do heavy work, chose some 751 persons (7.3% of all respondents), ofwhom 277, or 37%

died within three years. Set Al chose 596 persons, and 234 (or about 39%) ofthem died.

Set A5, which chooses respondents who cannot see well and need help to walk across a

room, chose the smallest number of respondents (311), but these persons also had the highest

death rate (44%, or 137). Interestingly, then, an inverse correlation existed between the

number of respondents chosen by a subset, and the death rate of the chosen respondents.

About 908 respondents were chosen by more than one question subset, and 389

(43%) ofthese respondents died. Some 232 were chosen by more than two subsets (ofwhom

118, or 51 %, died), and 38 were chosen by more than three subsets (of whom 22, or 58%

died). Thus it seemed that the effect of being classified by each of the different subsets was

cumulative when a respondent was classified by more than one subset.

180

Interestingly, there were more females than males chosen as high risk by Set A,

although partly because there were more females in the sample. Some 1,452 males were

chosen (36% of all males), and 1,553 females were chosen (25% of all females). Figure 5.1

shows the distribution of chosen respondents by age and sex. As expected, the distribution

is quite elderly, but there are still many younger respondents chosen as high risk (nearly a

third of the respondents were younger than 75).

Consider Question Set B in Appendix I, consisting of seven questions total, made of

three subsets combined with "OR", with each subset containing two or three questions

combined with "AND". This model was obtained by searching on the full dataset (see Table

4.1 again). Table 5.3 shows the classification achieved by applying this question set to the

full dataset:

Table 5.3 - Predicted outcome by true outcome, Question Set B,full dataset estimates


TRUE OUTCOME

survived died TOTAL


OUTCOME died 1,006 (62.4%) 606 (37.6%) 1,612(11.4%) (41.8%)

TOTAL 8,844 1,450 10,294

This model selected out 1,612 respondents as high risk (about 16% of all respondents), and

about 38% of them died within three years. The model error corresponding to this

classification was equal to:

181

><Q)en

"'0cco +

L()Q) coC)co~

..0

cd:: ~co....., I

Q) aco

CJ)

~..0CQ) (j)

en I'--I

0 L() Q.).c I'-- 0>U CO

en.....,cQ)

"'0 ~I'--

C I

0a

Q)I'--

c..Q) ctlen

Q) ctl E

0:: E ~

~ rn (j)c.o

LOI

L()c.o

Q)s...:JC)

lL.

OOv 00£ OOl OO~ 0

sluapuodsaJ JO Jaqwnu

182

844,3.5 + 1,006 = 0.2851,450'3.5 +8,844

again, a downwardly biased estimate since it was computed using the same data on which

the model was built. In this case, however, when the same-sized model was built on the

learning set, it was nearly the same model (with the exception of one question). When

applied to a test set, the results were not dramatically different, as shown in Table 5.4:

Table 5.4 - Predicted outcome by true outcome, Question Set B,internal test set estimates


TRUE OUTCOME

survived died TOTAL

survived 2,575 (89.6%) 300 (10.4%) 2,875PREDICTED (88.1%) (58.9%)

OUTCOME died 348 (62.5%) 209 (37.5%) 557(11.9%) (41.1%)

TOTAL 2,923 509 3,432

Nonetheless, the accuracy is slightly lower than estimated from the full dataset, as can be

seen by calculating the error rate for the test set, equal to:

300,3.5 + 348= 0.297,

509,3.5 +2,923

which can be regarded as a much more honest measure, though still (hopefully slightly)

biased. An approximate standard error for this estimate (as calculated below) is equal to

183

0.0125.

When the high risk respondents were broken down by the three different subsets, it

was found that Set B.2 (which selected out the same 751 respondents chosen by Set A.4)

chose the greatest number of respondents. Set B.l picked out the next largest group,

classifying 650 persons as high risk (those digitalis users who could not walk a half mile),

of whom 267 (41 %) died within three years. Set B.3 defined the smallest group, those 413

respondents who were age 80 or older and could not state their mother's maiden name, of

whom 181 (44%) died. There were 196 respondents classified as high risk by more than one

ofthe question sets, and of these 113 (58%) died. There were only six persons chosen by all

three questions, but all six of these persons died! The question subsets in Set B, then, also

seemed to have a cumulative effect on those respondents chosen by more than one subset.

Table 5.5 - Predicted outcome by true outcome, Question Set C,full dataset estimates


TRUE OUTCOME

survived died TOTAL

survived 8,580 (88.1%) 1,163 ( 11.9%) 9,743PREDICTED (97.0%) (80.2%)

OUTCOME died 264 (47.9%) 287 (52.1%) 551( 3.0%) (19.8%)

TOTAL 8,844 1,450 10,294

There were much more males chosen by Set B (939, compared with 673 females),

owing partly to the first question in Set B.2 that classifies males as high risk. Figure 5.2

shows the distribution of chosen respondents by age and sex. Interestingly, the

184

><Q)en

"'0cco +Q)

LOco0>co~

..0

co "'¢co...... f

Q) 0CO

if)

~..0CQ) men f"'-.

0I

LO Q)..c f"'-. 0>() coen......cQ)

"'0 "'¢f"'-.

C f

00

a> f"'-.C-en a> roQ) ro E

0:: E ~

N

D~mCO

LOI

LOCO

Q)L..

:::J0>

U-

OOl 09~ OO~ 09 0

sluepuodseJ 10 Jeqwnu

185

predominance of males exists only at ages younger than 80. The nwnber of younger males

chosen by this question set is quite large.

Table 5.5 shows the classification achieved by Set C in Appendix I (a model

consisting of seven questions with the same structure as Set B, but seeking to classify

respondents at the highest level of risk) when applied to the full dataset. Thus, only about

20% of deaths were correctly classified, but with a specificity of 90%. The death rate of the

high risk respondents exceeded 50% in three years, indicating quite a high level of risk. The

cost-adjusted misclassification error for this result was calculated as:

1,163'1.5 + 264

1,450'1.5 +8,844= 0.182,

which, again, is downwardly biased. When the same-sized model as Set C was built with

the learning set (a set ofquestions very similar to Set C) and applied to the test set, the results

presented in Table 5.6 were observed:

Table 5.6 - Predicted outcome by true outcome, Question Set C,internal test set estimates


TRUE OUTCOME

survived died TOTAL


OUTCOME died 132 (53.9%) 113(46.1%) 245( 4.5%) (22.2%)

TOTAL 2,923 509 3,432

Here, the test set result revealed a higher level of sensitivity (22.2% of all deaths, compared

186

with less than 20% in the full sample), but with a lower level of specificity (95.5% compared

with 97%). However, the high risk persons experienced a lower death rate of 46%

(compared with 52%). The misclassification error was estimated as:

with a standard error of 0.008.

396·1.5 + 132

509,1.5 +2,923= 0.197,

Breaking down the respondents by subset, one finds that Set C.2 chose the most high

risk persons, some 249 respondents of whom 131 (53%) died. Set C.1 chose 240 persons as

high risk, and 130 (54%) died within three years, while Set C.3 picked 225 persons, ofwhom

110 (49%) died. There were 126 respondents chosen by more than one subset (of whom 64,

or 51%, died), and 32 respondents were chosen by all three subsets (of whom 20, or 63%,

died). Thus the cumulative effect ofthe subsets seems slighter than in the first two sets.

It was quite surprising to find that, in this highest risk group as chosen by Set C,

females outnumbered males nearly two to one! There were 364 females (5.8% ofall women)

and 192 males (4.8% of all men). Also, the distribution of males does not vary by age in any

regular way. High risk females were particularly predominant at older ages. It was expected

that the highest risk respondents would be primarily elderly males, but appeared that this was

not necessarily so.

5.2 An index for the risk of mortality: linear discriminant analysis

After determining that the best model size for the linear discriminant analysis seemed

to contain about 15 coefficients, the data were recombined into the full dataset. Then the

stepwise backward deletion was repeated to obtain the preferred model of size 15. The form

187

of this predictor is a set of 15 questions to which the answers can be scored to obtained an

index ofmortality. Appendix VI contains the actual questions and the scoring system for this

questionnaire. The order in which the questions are listed corresponds to the ranking of the

variables in terms of each variable's "impact" on the index score. This ranking was achieved

by standardizing all the variables to have a mean of zero and a standard deviation of one, and

ranking the standardized discriminant coefficients according their absolute magnitudes. The

first variable, age, had the largest standardized coefficient (not surprisingly) the next most

important variable was sex, and so on.

For this tool to be applied to an elderly individual, the interviewer simply asks the 15

questions, writes down the point value corresponding to each answer in the blank space, and

totals the point values to obtain the index score. Then by consulting Table 5.7, one estimates

the probability of death within three years by finding the row of the table corresponding to

the score. For example, suppose the questionnaire is completed by 65 year-old male with a

present and past weight of 170 pounds, and who scores zero on all other items. The index

score for this person is totaled as 145 + 91 - 225 + (170-26) = 155, corresponding to an

estimated probability of death of about 4.0%, give or take 0.55%. The point values

corresponding to each question are equivalent to the values of the discriminant coordinates

vector &.. described in Chapter 2. These coordinates have been scaled so that the observed

sample range of index scores (equal to the product X'&..) equaled zero to 1,000.

As can be seen in Table 5.7, the lower the index score, the higher the probability of

death within three years. The probability of death for each range of scores in the table was

estimated by the proportion of respondents in each range who were observed to die. The

188

Table 5.7 - Probability of death within three years by mortality index

Index score

<100

100-149

150-199

200-249

250-299

300-349

350-399

400-449

450-499

500-549

550-599

600-649

650-800

800+

Probability of death within 3 years (3qJ2.56%

2.91%

4.00%

4.35%

8.70%

11.97%

17.77%

21.56%

28.57%

33.42%

36.16%

51.21 %

57.72%

66.67%

Approximate standard error

0.84%

0.57%

0.55%

0.52%

0.71%

0.92%

1.22%

1.51%

1.97%

2.39%

2.74%

3.47%

3.15%

6.09%

Note: The interpretation of this probability and the standard error are as such: from the full sample of 10,294persons, 11.97% of those respondents who scored 300-349 actually died within three years; the standard errorroughly gauges the uncertainty in this estimate. For example, if the sample is assumed to be a simple randomsample from the general population ofnoninstitutionalized elderly, the range of probabilities 11.97% ± 0.92%= (11.05%,12.89%) has a 68% chance of catching the true probability of death for persons who scored 300349. For a 95% confidence interval, 11.97± 1.96'0.92% = (10.17%,13.77%). The sample was not a simplerandom sample from the general population ofelderly, so this estimate is a rough approximation. Probabilitieswere estimated for persons living in 1983-1985; present probabilities (for 1997) may be somewhat lower.

189

standard errors attached to these estimates were obtained by assummg a binomial

distribution of deaths (corresponding to the assumption that the respondents were chosen

with a simple random sample), and so should be taken as rough approximations. If the

respondents were chosen with a simple random sample, the interval of plus or minus one

standard error around each estimate would have a 68% chance of containing the true

population proportion of deaths for each group. Figure 5.3 shows a bar graph of Table 5.7,

where the heights ofthe bars are equal to the estimated probability ofdeath within each range

of scores. The dashed lines around the top of each bar give the ± 1 S.E. interval around each

estimate. Clearly, if the standard errors are taken to be even slightly accurate, there is a

substantial degree of differentiation between levels of risk by index score. The probability

of death ranges from 2.56% to 66.67% (the former estimate having a standard error ofless

than 1%).

To further understand the relationship between this index score and the risk of death,

the outcome variable Y as a function of z was smoothed with the nonparametric Splus

smoothing algorithm supsmu(). This function uses cross-validation to determine the span

of the smoothing window. The output is plotted in Figure 5.4. It appears from this graph

that the probability of death rises slowly until the index score reaches about 220, at which

it rises approximately linearly with a slope of 0.1 06 every 100 index points. This gives an

approximate rule of thumb for scores above 220, which included the scores of about 70% of

all respondents: an increase of 100 points in the index score is roughly equivalent to an

increase of 0.1 06 in the probability of death, giving a good intuitive feel for the importance

of various questions and their answers. For example, suppose we examine the scores of two

Figure 5.3 - Bar plot of probability of death by discriminant index

dashed lines show +/- 1 standard error (approximate estimates)

66.7%

r--

57.7%

51.21~r--

21.6%

17.8%r--

I------~ --- I

r--

2.6%

r----- I I

12.0%

d~-70/ol •.• d

2.9% .4.0% 4.4%~~I - r I - "·1·" ....

~

o

(0

o

oo

No

~CtlQ)>.

('f)

c::..c::......"§..c::coQ)

"'0....o~:.aCtl

..0ec.

o 200 400 600 800 1000

mortality index (discriminant variable)

.......\0o

~coQ)>

(")

c:.c.......'3..c.......coQ)"0-o£:0co

..0eQ.

OJa

<Da

'<:ta

No

oa

Figure 5.4 - Smoothed estimate of probability of death by discriminant indexobserved 0/1 outcomes, smoothed with Splus function supsmuO

slope is approximately 0.1 increase

for every 100 index points

---------------------------------------------------------------------------------------------------------------------- - - - - - - - - - - - - - - - - - - - - - --- -- - - - -- - - - - - - - - - - - -,- - - - - - - - - - - - - - - -- - - - - - - - - - --- - - - - - - - - - - -- - --- - - - - - - - - - - - - - - - - - - - -

- -- - -- -----~---------------------- ---------------- - - - - --

o 200 400 600 800 1000

mortality index (discriminant variable)

.......'-D.......

192

males with scores above 220 whose answers are identical for every question on the survey,

except that one of them reports being a smoker. The smoker's score would be 54 points

higher, since that is the value of the points corresponding to that answer (compared with a

nonsmoker's score of zero). Assuming the scores for these persons are above 220 or so, the

difference between the index scores implies that the smoker has a probability of death that

is approximately 0.106·54 = 5.7% higher than the nonsmoker. A linear regression of scores

in the range above 220 gave a simple rule for converting the index score to the estimated

probability of death: divide the index score by 10 and subtract 20 to obtain the percent

chance of death (these values are 9.457 and 21.323 to be more exact), an approximation that

works remarkably well for any score above 240 or so. For example, a person with a score

of300 had about a 10% chance of dying (equal to (300/10) - 20), as confirmed by Figure 5.3

and Table 5.7.

Table 5.8 - Average and median mortality index score by age and sex

SexFemale scores Male scores

Age average median average median

65-69 189 163 288 267

70-74 240 214 336 316

75-79 302 276 402 375

80-85 384 356 457 430

85+ 471 449 534 517

Note: The mortality index score may be computed by using the questionnaire in Appendix VI.Source: EPESE baseline and three years of follow-up data from New Haven, East Boston and Iowa.

Since it might be useful to be able to compare an individual's index score with that

of other persons of the same age and sex, Table 5.8 reports the average and median for the

index score by age and sex. It is evident from this table that for every age-sex group, the

193

distribution of scores had a long right tail, as the mean was found to lie well above the

median score for each group. Moreover, comparing the probability of death implied by the

average scores for each group with the observed sample proportions of dead reveals a

significant difference. For example, some 6.1 % of females younger than 70 in the sample

died, but the average score for these women was only 183, implying a death rate of about

4%; the distribution of scores is clearly asymmetric, with a small contingent of women

having a risk of death well above average, as one would expect since there is much more

potential for above-average mortality at such a low rate of death.

A simple measure of accuracy for the discriminant model is the raw correlation

between the index score and the binary outcome of survival or death, or rdisc = cor( z, Y). As

measured by the test set Y when z was computing using coefficients computed by the learning

set for the model ofsize 15, rdisc equaled 0.363. When respondents were recombined into the

full dataset and the best model of size 15 was estimated (giving the model above), r disc was

0.372. In comparison, a linear discriminant model based only on age and sex yielded a

correlation of about 0.22 with either the test set or the full sample.

To compare the accuracy of the model with that of the question method above, one

simply chooses a cutoff value of the index score, classifying all persons scoring above the

cutoff as dead, and all lower-scoring persons as alive. Then it is possible to estimate

misclassification error just as it was computed for the questions. To simplify the

comparison, three cutoff values of the index score were chosen such that the same number

of persons were classified as dead as were classified as dead by the three sets of questions

(sets A through C). The results of these classifications can be observed in Table 5.9, which

Table 5.9 - Deaths and survivors predicted by discriminant analysis(in a test set of 2,923 survivors and 509 deaths three years after baseline)

# of Sensitivity4 # of Specificity6 Death rate'death rate

Cutoff' Cose deaths survivorspredicted

deaths predictedpredicted

survivors predicted deaths predicted rate predicted

correctly3correctly



341 5 353 69% 872 70% 29% 1.44

455 3.5 221 43% 336 88% 40% 1.79

563 1.5 120 24% 125 96% 49% 2.09

1. The cutoff is that index score below which persons were classified as alive, and above which they were classified as dead. The index score may be computedby using the questionnaire in Appendix VI.

2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death. This was not actually used in thecomputation of the model or the cutoff value, rather it is included only for ease of comparison with Table 1.2.

3. The number of deaths in the test set correctly classified as dead by the model.4. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also caUed the true positive fraction or TPF.5. The number of survivors in the test set incorrectly classified as dead by the model (i.e., false positives).6. The proportion of aU survivors (2,923) correctly classified as survivors by the model.7. The proportion of deaths in the respondents predicted as dead. In the column to the right, this rate is divided by the death rate a randomly chosen set

of respondents (with the same age/sex distribution as the respondents classified as dead by the model) would have suffered.

Source: EPESE baseline and three years offoUow-up data from New Haven, East Boston and Iowa.

,....I,Q~

195

is directly comparable to Table 1.2. It appears that the accuracy of the linear discriminant

method is slightly better than the question set method in terms ofmisclassification error. For

the most sensitive model, using a cutoff of 341, 69% (353) of deaths in the test set were

classified correctly compared to 66% (337) for question Set A. These both were achieved

with a specificity of about 70% (872 and 888 false positives, respectively). Compared to

question Set B (209 deaths predicted correctly and 348 false positives), the linear

discriminant model caught 43% of all deaths (221) with a specificity of 88% (336 false

positives). With respect to question set C, the discriminant model predicted 24% of deaths

correctly with a specificity of 96% (120 true deaths, 125 false positives), compared with a

sensitivity of 22% (113 true deaths, 132 false positives). The area under the ROC curve for

the linear discriminant model was estimated as 76.2% ± 1.3% using the test set respondents,

compared with 74.4% for the three question sets.

Figure 5.5 shows these three cutoff points plotted on the ROC curve for the linear

discriminant model, along with the points corresponding to questions Sets A through C. It

does appear that the discriminant model possesses a small advantage over the question

method. However, it should be noted that the manner in which categorical variables were

recoded as indicator variables was greatly informed by the results of the question set method.

This was done after the accuracy of those questions had been confirmed with the test set.

Thus, the test set error estimate for the linear discriminant model could be somewhat

optimistic. Building such an accurate model would have been much more difficult without

using such knowledge about how to recode the variables.

The distribution of scores for both survivors and deaths was discussed in Chapter 2

.75

co'e ........0>0. :;:::;o 0.... OJ0. ....

OJ 0.cO:t:::--oc2o °n'5ell ~~o.

OJ l/)

.:::£~rol/) OJ0-00._OJ 0:::J

~

.25

o

Figure 5.5 - True positive fraction by false positive fraction(ROC curve) for discriminant model of deaths in the test set

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//" //, /

.x Area = 76.2%/ ,

/ ,/ ,

/ ,// ~

//

//

//

//

//

//

//

//

//

//

//

//

/

196

o .25False positive fraction (the proportion ofsurvivors predicted incorrectly as dead)

.75

Note: The test set consisted of 2,923 survivors and 509 deaths.Letters (A,B,C) correspond to the question sets in Appendix I.Source: New Haven, East Boston and Iowa County EPESE

197

and is shown in Figure 2.4. It is obvious from this graph that the assumptions concerning

the forms of the distributions (that the two distributions are distributed normally with the

same variance-covariance matrix) are not entirely accurate. The scores for the decedents

seem to have a larger variance than the survivors' scores, and the survivors exhibit an

abnormally long right tail, as mentioned. However, as the test set results indicate, the

obvious falsity of these assumptions does not seem to diminish the model's predictive

accuracy substantially. As Breiman et al. note, the success of the linear discriminant

approach is surprising given the questionable assumptions of the model.

5.3 Classification trees for predicting death

Using the 30 variables resulting from the first stage of the linear discriminant

backward deletion process, trees were grown for the same three levels of misclassification

cost (5, 3.5 and 1.5) as were used in the question set process. However, the program fit a

tree of size zero at a cost of 1.5 (i.e., it would not build a tree). The tree constructed with a

cost of five only caught 52% of all deaths. Thus, trees were constructed for the following

three levels of misclassification cost: 2, 3.5, and 7. Table 5.1 0 lists the results, and Figure

5.6 plots the models along the ROC curve, along with the points corresponding to the

question set method. With a misclassification cost of 3.5 the tree method seems to do about

as well as the question set method. At a cost of two or seven, however, the accuracy of the

trees falls well below the accuracy provided by Sets A and C. The accuracy ofthe medium

cost tree lies almost exactly on the same ROC curve as question Set B, but the low cost tree

had a sensitivity of only 18% at roughly the same specificity of question Set C. The most

sensitive tree had a slightly lower level of accuracy than question Set A. The area under the

Table 5.10 - Deaths and survivors predicted by classification trees(in a test set of 2,869 survivors and 486 deaths three years after baseline)

# of deaths Sensitivity4 # of Specificity6 Death rate'death rate

Size of Cost2 predicteddeaths predicted

survivorssurvivors predicted deaths predicted

tree! correctly3 predicted rate predictedcorrectly

incorrectlyScorrectly correctly by age, sex


15 7 298 61% 791 72% 27% --

9 3.5 216 44% 403 86% 35% --

2 2 87 18% 134 95% 40% --

1. The size of the tree is the number of questions (or splits) in the tree.2. The ratio of the cost ofmisclassifying a death as survival to the cost ofmisclassifying a survival as death.3. The number of deaths in the test set correctly classified as dead by the tree.4. The number of deaths correctly predicted divided by the total number of the deaths in the test set (509), also called the true positive fraction or TPF.5. The number of survivors in the test set incorrectly classified as dead by the tree (i.e., false positives).6. The proportion of all survivors (2,923) correctly classified as survivors by the tree.7. The proportion of deaths in the respondents predicted as dead.


......\000

199

Figure 5.6 - True positive fraction by false positive fraction(ROC curve) for classification trees of deaths in the test set

Area =73.4%

A•

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

~ //, /V/'/ ,

/ ,/ ,

/ ,// ~

//

//

//

//

//

//

//

//

//

//

//

//

/

.75

.25

a

cot~0>0..:;:;o ()..... CDo..l:::CD 0.c ()~-o

c2o ()'U'6

CO CD.........._ 0..

CD CIJ.2: :5:t=roCIJ CD0-00.._CD 0:::l

~

a 25 .75. False positive fraction (the proportion of

survivors predicted incorrectly as dead)

Note: The test set consisted of 2,869 survivors and 486 deaths.Letters (A,B,C) correspond to the question sets in Appendix I.Source: New Haven, East Boston and Iowa County EPESE

200

ROC curve for the three trees was estimated as 73.5%.

Figure 5.7 shows the medium-cost tree, which contained nine questions. It is quite

informative to compare the structure of these binary splits with the questions in Set B. Note,

for instance, that every question contained in the tree is essentially contained within question

Set B, and that Set B contains one additional question, asking about the ability to do heavy

work. The first high risk terminal node (containing persons who cannot walk half a mile and

are taking digitalis) catches the same respondents classified as high risk by Set B.1. It is

interesting how similarly these two models are constructed, considering the vastly different

algorithms employed in their constructions. Yet there are some important differences

between them; Set B asks about more variables than the tree (the additional question about

doing heavy work), but actually contains/ewer questions (seven for Set B, and nine for the

tree). The reason for this may be that the tree's tactic of splitting persons into completely

disjoint rectangles requires that some variables are asked about multiple times. For example,

in the first split respondents are divided completely into those who can walk half of a mile

and those who cannot. Thus any particularly powerful questions which are asked of the

former persons (e.g., is respondent male or female) must also be asked of the latter persons

separately if the predictive power of the question is to be fully exploited. This requires that

such a question appear multiple times in the tree.

Although the tree algorithm worked just as accurately for the medium level of cost,

the higher and lower cost models were less accurate than the question set models, as

mentioned. The most sensitive tree, built with a cost of seven, was only able to capture 61 %

ofall deaths, and lies slightly below the ROC curve for the questions set method. This model

201

Figure 5.7 - Classification tree for risk of death within 3 yearscost of misclassification = 3.5

No

LOW RISK HIGH RISK

No

Yes

LOW RISK

No

Yes

LOW RISK

No

HIGH RISK

Yes

HIGH RISK

HIGH RISK LOW RISK HIGH RISK LOW RISK

202

also required some 16 questions in total (compared with 10 questions, as in Set A), so that

the structure ofthis tree was quite complex. The tree built with a cost of two was clearly less

accurate that the comparable question set, and consisted of only two questions (the two

questions in Set B.l). It was not entirely clear why the tree was not able to pick out the

highest risk persons as accurately as the question set method. It may have been a result of

the one-step optimality of the tree-building process. It is likely that the highest risk persons

(who are small in number, by definition) may be identified with certain combinations of

questions which are not likely to occur in the form of a branch of a tree. This is because the

optimal splits formed near the top of the tree are greatly weighted by the large numbers of

the low-to-moderate-risk respondents.

5.4 A regression model of mortality

The final logistic regression model constructed with the full dataset (as described in

section 4.5) consisted of23 variables, in addition to the intercept. Table 5.11 shows the 21

coefficients in the model (ordered by the absolute size of the standardized coefficient) which

were significant at the 0.05 level, along with the standard errors estimated for each

coefficient. The most important variable was age, followed by sex, digitalis usage, the ability

to walk a half mile, self-assessed health status, and the ability to state the mother's maiden

name; these variables were identified by both the question set method, and the linear

discriminant analysis. The remainder of the variables consisted of some that were identified

in the question set method, and some that were found instead by discriminant analysis:

primarily smoking, insulin usage, the ability to bathe or use the toilet without help, the

previous diagnoses of cancer and heart attacks, and weight (both at time of interview, and

Table 5.11 - Coefficients in the logistic regression model ofmortality

203

Variable Coefficient Std.Error

Age: 65 to 69=0, 70 to 74=1, ... , 85+ = 4 0.3357 0.02578

Sex: female=O, male = 1 0.6594 0.07112

Do you take digitalis? yes=l, no or missing=O 0.6846 0.08159

Can you walk half a mile? yes=O, no or missing=1 0.5497 0.07494

Would you say your health is: excellent=O, ... , poor/bad=3 0.2497 0.04070

What is your mother's maiden name? correct=O, other=l 0.5464 0.09012

Do you smoke cigarettes now? yes=l, no/missing=O 0.5227 0.09190

Ever taken insulin for diabetes? yes=1, no/missing=O 0.6123 0.12407

Ever been diagnosed with cancer? yes=l, no/missing=O 0.3945 0.08255

Usual weight at age fifty (in pounds) 0.0081 0.00172

Wake up at night? most oftime=O, ... , rarely/never=2 0.1832 0.04079

Need help using toilet? yes=l, no/missing=O 0.5229 0.12400

Need help from person to bathe? yes=l, no/missing=O 0.4405 0.10461

Ever hospitalized for a heart attack? yes=1, no/missing=O 0.3836 0.09679

Did you ever smoke cigarettes? yes=1, no/missing=O 0.1269 0.03523

What is your date of birth? correct=O,other=1 0.3332 0.09953

Weight (in pounds) -0.0336 0.01024

Bring up phlegm at least 3 months of year? yes=1, no=O 0.4219 0.13313

Leg pain when walking on level groud? yes=l, no=O 0.4589 0.16007

Weight2 (in pounds) 7.61'10-5 3.52'10-5

Note: All coefficients were significant beyond the 0.05 level.

204

at age fifty). The squared term for weight at baseline was included in the model as well,

indicating that the principal effect that this variable was capturing was probably that of

respondents wasting away just prior to death, resulting in high mortality at very low body

weights. (Note that the question set method indicates that this is particularly true for men,

but the model has no way of detecting this without the appropriate interaction term).

There were many ways to measure the accuracy ofthis model. The residual deviance

was 6992.037 on 10,270 degrees offreedom (compared with a null deviance of 8369.42 on

10293 degrees offreedom). The correlation between the fitted values (ranging between zero

and one) and the true outcome of death (as one) or survival (as zero) was 0.395, roughly

equal to the correlation between the linear discriminant scores and the true outcome. A

model of nearly the same size was built on the learning set and applied to the test set (see

section 4.5) to calculate a set of fitted values. A cutoff was applied to the fitted values to

classify respondents as high or low risk, and the classification in Table 5.12 was achieved.

Table 5.12- Predicted outcome by true outcome, logistic regressioninternal test set estimates


TRUE OUTCOME

survived died TOTAL


OUTCOME died 329 (59.1 %) 228 (40.9%) 557(11.3%) (44.8%)

TOTAL 2,923 509 3,432

Note that the cutoff was chosen so that the row marginals would be exactly equal to those

205

displayed in Table 5.4 (the test set results from Question Set B) and the second row of Table

5.9 (the test set results from linear discriminant analysis). The regression model had a level

of sensitivity and specificity very close to that achieved by the linear discriminant model, but

required more variables than the discriminant analysis. The test set error estimate for the

regression model, using a relative misclassification cost of 3.5, was calculated as:

281'3.5 + 329509,3.5 +2,923

with an approximate standard error of 0.0122.

= 0.279,

5.5 Validation of the question sets with the North Carolina sample

As mentioned above, all the models were constructed after having made multiple

observations on the test set, and the author had used the entire dataset (excepting the North

Carolina sample) for other related analyses prior to this research. As a result, the "internal"

test set error estimates above cannot be considered truly unbiased. However, the hope is that

the bias in these estimates is small. Fortunately, this hope can be validated to some extent,

as there exists, as mentioned, the additional EPESE sample of 4,162 respondents from North

Carolina, also called the "Duke" sample. These respondents had been ignored by the author

initially, despite the public availability ofthe baseline dataset, since the ID's of the deceased

respondents were not publicly released. After the construction of the above models,

however, it became possible to identify the deceased respondents through private

communication with Duke researchers. All the questions required by the models were asked

of the Duke respondents, and so the Duke sample formed a completely compatible but

independent sample. Taken at a different point in time from an entirely different geographical

206

region within the U.S., this dataset provided a natural yet challenging validation sample for

the models.

However, the fact that the North Carolina sample was so radically different from the

original samples made the validation slightly more complicated. Ideally, for a validation

sample, one would like to have an independent probability sample from the same population

from which the original sample was drawn. Then a statistical comparison of accuracy would

be straightforward. One simply needs to calculate standard errors for the estimates of

misclassification error, so that a statistical test can be calculated to determine whether any

difference between the estimates is significant. However, since the validation sample was

so different in this case, and since the original dataset was not constructed with a probability

sample, such a comparison is not necessarily meaningful. The problem was that when the

models were applied to the validation sample, a smaller proportion of respondents were

classified as high risk as compared with the original sample. This was because the Duke

sample was younger on average, and contained more females; both of these variables were

used in the question set models. (There were also fewer respondents who used digitalis).

With respect to Table 5.2, for example, this meant that in the Duke sample there was

a greater proportion of respondents in the upper row of the table. This shift in the marginals

could have affected estimates of misclassification error, perhaps resulting in more optimistic

estimates. Thus, it is more informative simply to compare the respondents within each row;

that is, one can simply compare the death rates of the high or low risk respondents. Notice

that if the row marginals are held constant, a decrease in the death rate of the high risk

respondents (or likewise, an increase in the death rate of the low risk respondents) is directly

207

proportional to an increase in misclassification error. Thus, comparing death rates provides

an informative comparison of accuracy regardless of the marginals. In these terms, the

models performed extremely well, as the death rates in the validation sample were nearly

exactly as predicted. It was also found that, coincidentally, the observed death rates were

such that shifts in marginals actually had a very small effect on estimates of misclassification

error. Thus, the observed differences in error were also very small. In fact, they were no

larger than one would have expected from simple random sampling variation, as

demonstrated in the statistical tests below. Clearly, the samples were not simple random

samples, so these tests should not be taken literally. They are simply provided to gauge the

sizes of the observed differences (i.e., they are used to define "small"). Since the sample

sizes were large, such a measure provides a fairly strict gauge; nonetheless, the models

passed the test.

The target population for the Duke sample consisted of persons 65 and older living

m a contiguous five-county area in North Carolina consisting of Durham, Franklin,

Granville, Vance and Warren Counties. 1 A four-stage sample design was used to obtain a

probability sample of these persons. Area sampling was used to obtain census blocks and

enumeration districts, smaller geographic areas were randomly chosen within these units, and

the households within each area were listed and stratified by race. Finally, one person within

each household was randomly chosen as the designated interviewee. About 80% of these

selected persons completed interviews.

Apart from geography, there were other substantial differences between the Duke

I See Comoni-Huntley et al. (1986) for a detailed description of the North Carolina EPESEsample.

208

respondents and the respondents from the first three EPESE sites. Since two of the original

three samples (the Iowa and Boston samples) contained virtually no nonwhites, the Duke

sample was designed to oversample blacks, so that less than half the Duke respondents were

white. Figure 5.8 shows the distribution ofrespondents by age, sex, and race. By comparing

this distribution with the age-sex distribution shown in Figure 3.1, one can see that besides

the majority ofblacks, there is a preponderance of females at all ages (except nonblacks aged

65-69). Females were significantly oversampled relative to U.S. population sex-ratios with

respect to all age-race categories (again, excepting nonblacks aged 65-69).2 Thus, the

unweighted Duke sample was radically different from what would be expected in a simple

random sample of the U.S., raising the issue of whether sampling weights should be used in

the validation estimation ofmodel error. The models, though, suggested that race should not

be an important predictor ofmortality once other factors have been controlled (at least to the

extent to which the small number of nonwhites in the original sample provided such data);

so the unique makeup of the unweighted sample was viewed as providing an interestingly

more difficult test case, and the decision was made to use the raw, unweighted counts.

Again, it is important to note that all of the above models were constructed prior to

the author's knowing which Duke respondents had died. Each of the three methods was

applied to the Duke baseline dataset (Sets A through C, the linear discriminant model with

a cutoff score of455, and the classification tree model in Figure 5.7). Using these models,

lists ofID's were compiled of those respondents predicted to die according to each method.

Finally, to match the chosen respondents to the decedents, these lists were sent via electronic

2 See the Current Population Reports, P-25 series No.1 095.

Figure 5.8 - Number of respondents in Duke sample by age, sex, and race

o

I I black female0

I I IV/1 black male

~ l P77J nonblack femaletz0j nonblack male, , , , , /,

C/).......s::::: 0<l> 0"0 C"')s:::::0a.C/)

<l>.....- 00..... 0<l> N

..0E:::Js:::::

00......

65-69 70-74 75-79 80-84 85+

age

Respondents are from the North Carolina EPESE baseline survey (N = 4,162).

tvo\0

210

mail to a researcher (unassociated with the author) for the NIH who had access to a list of

the decedents' ID's.3

Table 5.13 - Results of applying Question Set A to Duke sampleCells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL

survived 2,786 (90.9%) 278 (9.1 %) 3,064PREDICTED (78.5%) (46.2%)

OUTCOME died 765 (70.2%) 324 (29.8%) 1,089(21.5%) (53.8%)

TOTAL 3,551 602 4,153

Table 5.13 shows the result of applying Question Set A in Appendix I to the Duke

sample. The cell counts give the number of persons according to the true and predicted

outcomes, and the percentages by row (next to the count) and column (below the count) are

also shown. There were nine respondents for whom the date of death was missing; these

respondents were removed from the analysis, leaving a total of4,153 respondents for testing.

Of these, 602 (14.5%) died within three years, a raw death rate slightly higher than the rate

of 14.1 % in the original sample. A total of 1,089 persons were classified as high risk, and

of these 324 (29.8%) died within three years, accounting for 53.8% of all deaths. There

were 765 persons incorrectly predicted to die, so of all the 3,551 survivors, 78.5% were

correctly classified. So the specificity is given by the lower right percentage-by-column

(53.8%) and the sensitivity is the upper left percentage-by-column (78.5%). The raw death

3 The actual ID matching was performed by Caroline Phillips at the National Institute for Aging,to whom the author is eternally grateful.

211

rates of the two groups classified as low- and high-risk are the percentages-by-row in the

right column (9.1% and 29.8% respectively).

These numbers should be compared with those in Table 5.2 (the internal test set

predictions, p. 178) and Table 5.1 (the full dataset predictions, p. 177). For example, one

central statistic of interest is the estimated death rate of the high risk respondents (the

percentage-by-row in the lower, right-hand cell). In the internal test set (Table 5.2), some

27.5% ofrespondents who were predicted to die actually died, and in the full dataset, 30.1%

of such persons died, quite close to the Duke sample estimate of29.8%. Likewise, about

7.8% (test set estimate) of persons were predicted to die in the low-risk group, and in the

Duke sample this estimate was 9.1 %. However, when one examines the percentages by

column (i.e., sensitivity and specificity), an interesting discrepancy is revealed: the

predictions (based on the test set estimates) were that 69.6% of all survivors and 66.2% of

all decedents would be classified correctly. However, when applied to the Duke sample, the

questions were much more successful at classifying survivors: 78.5% of survivors were

classified correctly, but only 53.8% of deaths were classified correctly.4 If one calculates the

marginal percentages by row, one notices that the reason for this may be that the row margins

for the two samples are quite different; in the test set, for example, 35.7% of persons were

classified as high risk, and in the full dataset some 29.2% of persons were predicted to die.

However, in the Duke sample, only 26.2% of persons were classified as high risk.

Thus, one type of error is more predominant in the Duke sample (the

misclassification of the decedents, or false negatives), but the other type of error (survivors

4 A chi-square test using the predicted, test-set column percentages as expected values and theDuke estimates as observed values suggests that these differences are statistically significant.

212

predicted to die, or false positives) is less frequent. The reader, then, may wonder exactly

in what way the original results have been validated. To answer this question, compare the

overall, cost-adjusted misclassification rate for the two samples (the criterion all models

were designed to reduce). The internal test-set estimate of error, as calculated above was

equal to 0.3196, and the Duke sample estimate is calculated as:

(5' 278 + 765) = 0.3285.(5'602 + 3,551)

So in terms of this criterion, the Duke classifications were slightly less accurate than

predicted. The difference is small, however, only 0.0088.

Using rough theoretical approximations for standard errors (i.e., based on the

assumption that the two samples are independent, simple random samples), one can calculate

a simple t-statistic for the difference between the two error rates. For example, suppose the

original three samples were considered a single simple random sample. Then the numerator

of the error rate for Question Set A may be modeled as the sum of 3,432 tickets randomly

drawn with replacement from a box that contains 3,432 tickets: 172 tickets are labeled with

a "5" (representing the errors for the misclassified decedents), 888 tickets are labeled "1" (the

errors for the misclassified survivors), and the remaining 2,372 tickets are labeled "0" (the

nonerrors for the properly classified respondents).5 The sum ofthese tickets has a standard

error equal to the product of the standard deviation of the tickets in the box and the square

root of the number of tickets drawn, here equal to 1.1192)3,432. The denominator is

5 Some readers may recognize that Statistics by Freedman, Pisani and Purves (1991) is theinspiration for this type of layperson's description.

213

constant, so the standard error of the internal test set estimate may be calculated as:

1.1192·J3A32 = 0.011995·509 + 2,923 '

and likewise, the standard error for the Duke estimate of misclassification error can be

calculated as:

1.2605' V4J53 = 0.012385'602+3,551 '

where 1.2605 is the standard deviation of 4,153 tickets in a box with 278 tickets marked "5",

765 tickets marked "1", and 3,110 tickets marked "0". (Of course, in the unit weight

scenario, this calculation is simply equivalent to the standard error as derived from the

binomial distribution.) Now to estimate the standard error for the difference between the two

misclassification elTOl'S, assume the samples are independent and therefore that the sums of

squares are orthogonal, expressed as:

with the usual formula. 6 Then the usual test statistic may be estimated as

observed differenceSEdiff

which is not significant.

0.008778 = 0.50930.01724 '

If one were comparing two independent, simple random samples from the same

population, this test would provide excellent evidence that any bias in the test-set estimate

6 See Elements by Euclid (ca. 300 Be) for a proof of this result.

214

of prediction error is negligibly small, as gauged by sampling error. However, the Duke

sample can hardly be considered a simple random sample of the u.s. population, and clearly

the faulty assumption makes a difference in regards to the model, as suggested by the shift

in row marginals. Exactly how does the assumption affect the statistical test above?

Consider the most obvious effect, as suggested by the shift in row marginals; that is, suppose

the percentages by row (the death rates) within each cell are held fixed while the row

margins are allowed to vary (which is an accurate description of what actually happened).

With respect to Table 5.13, for example, suppose that the death rates of the low- and high-

risks groups are held fixed at 9.1% and 29.8% respectively. Then shift the row margins

toward the high risk group so that the proportions in each row are equal to the proportions

observed in the test-set sample (Table 5.2). In the test set, 35.7% of respondents were

classified as high risk, so 35.7% of4,153 or about 1,482 Duke respondents would have been

classified as high risk. Then, with the death rates implied by the fixed percentages by row,

one would have correctly classified about 569 out of 836 deaths, while misclassifying 913

out of 3,317 survivors; this would have yielded a misclassification error of 0.2999, more

than two standard errors lower than the observed Duke error rate. Thus, the observed shift

of Duke respondents into the low risk class tends to push the error rate upwards in this

instance. This implies that applying the model to the Duke sample rather than a true simple

random sample from the original EPESE population inflates the above test statistic.7 Yet,

the observed error rate for the Duke sample was still within one standard error of the

7 Of course, it is possible for the true expected value of the Duke error to be lower than theexpected value for the test error (as suggested by Question Set C, below), in which case this effect does notnecessarily inflate the statistic, but we can ignore the statistic then since we are only concerned about anundue increase in error over the test set estimate. (That is to say, we wish to perform a one-tailed test).

215

predicted rate!

Table 5.14 - Results of applying Question Set B to Duke sampleCells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL


OUTCOME died 315 (62.5%) 189 (37.5%) 504( 8.9%) (31.4%)

TOTAL 3,551 602 4,153

Consider Question Set B in Appendix I, for which the results are shown in Table

5.14. The test set results (which were very close to the results from the full dataset)

suggested that the death rates in the low and high risk respondents would be about 11.3% and

37.5% respectively. The observed death rates were 10.4% and 37.5% respectively, giving

nearly perfect results. With respect to percentages by columns, however, the same shift in

row margins (and therefore a shift from sensitivity to specificity) was observed as with

Question Set A: only 31.4% of deaths were correctly classified (compared with 41.1 % in the

test set), while only 8.9% of survivors were misclassified (compared with 11.9% in the test

set). The resulting error rate for the Duke sample was estimated as 0.3112 (compared with

0.2971623 in the internal test set). So again the models were less accurate when applied to

the Duke sample, but only slightly so (a difference of 0.01399). Standard errors were

estimated for the two sample errors in the same way as above, giving 0.01249 for the test set

estimate, and 0.01202 for the Duke estimate. The standard error for the difference equaled

0.01734, so the test statistic was estimated as 0.01399/0.01734 = 0.8068, not a significant

216

difference. Here, however, the shift in marginals has the opposite effect on this statistic. For

example, suppose one holds the low and high risk death rates fixed at 11.3% and 37.5%

respectively. Then adjust the row margins so that the proportions of low and high risk

respondents are the same as in the test set (so that 16.2% of 4,153, or 674 respondents would

have been classified as high risk). This implies an error rate of 0.3119. Here, then, the

observed shift in row marginals toward the low risk class pushes the error rate downward.

However, the pressure is small; a hypothetical difference of about 0.0007 is obtained, less

than tenth of a standard error. Thus, had the death rates held fixed (apparently a reasonable

assumption), one would have expected about the same test statistic had the question been

applied to a true simple random sample from the original EPESE population.

Table 5.15 shows the results of applying Question Set C to the Duke sample. Again,

the same pattern is observed: the death rates for the high and low risk groups were predicted

quite accurately (48.3% and 12.7% respectively, compared with 46.1 % and 12.4% in the

internal test set), but a shift in row marginals toward the low risk category resulted in lower

sensitivity (16.8%, compared with 22.2%) and higher specificity (97.0% versus 95.5%). This

effect was quite systematic (not random). The cost-adjusted error rate was estimated at

0.1930, which was slightly lower than the internal test set estimate of 0.1969. In this case,

the observed shift in row margins toward the low risk class has the effect ofpushing the error

estimate down and below the test set estimate. One can see this by holding the death rates

fixed, adjusting the marginals toward the high risk class, and recalculating the error rate as

above. This implies an error rate of 0.1987, only slightly higher than the test set error rate.

Thus, the test statistic is less informative here.

217

Table 5.15 - Results of applying Question Set C to Duke sampleCells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL


OUTCOME died 108 (51.7%) 101 (48.3%) 209( 3.0%) (16.8%)

TOTAL 3,551 602 4,153

Figure 5.9 plots the results from all three question sets in ROC space, along with the

original test set results and the original ROC curve. In each case, one can see how the change

in specificity and sensitivity shifted the results downward and leftward in ROC space, with

only slightly less accuracy (which is to say that the results were quite close to the original

ROC curve). The area under the ROC curve for the Duke sample was estimated as 73.3%,

slightly lower than the test set estimate of 74.4%, showing the small decrease in accuracy.

As a general conclusion, however, the models performed quite admirably, particularly in

consideration ofthe diverse makeup of the Duke sample.

Because of this validation process, an interesting substantive issue arose: was the

observed shift in row marginals a product of the skewed racial stratification of the Duke

sample, and if so, what was the sociological mechanism responsible for this shift? Or is it

possible that the observed shift was in some way reflecting the potential problem of

overfitting which was inherent to the model-building process, as suggested in earlier

chapters. (This issue is not explored in detail in this dissertation, although it was thought that

some differences might have been do to the age and sex composition of the Duke sample.)

218

Figure 5.9 - True positive fraction by false positive fraction(ROC curve) for questions predicting death (with Duke results)

Duke area = 73.3%

A

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//

//... //, /

V/,/ ,

/ ,/ ,

/ ,/ ~

//

//

//

//

//

//

//

//

//

//

//

//

//

.75

.25

o

co'E,......0>c.:;::;o ().... <llc. ....<ll 0.s=. ()

::::::'-0c2o ()U'5rn 2!':=0.<ll <n.:::£.~ CO<n <ll0-0c. ....<ll 0::::l....f-

o .25 .75False positive fraction (the proportion ofsurvivors predicted incorrectly as dead)

Note: The test set consisted of 2,923 survivors and 509 deaths.Letters (A,B,C) correspond to the question sets in Appendix I.The Duke sample contained 3,551 survivors and 602 deaths.

Source: Establish Populations for Epidemiologic Studies of the Elderly

219

First, it is informative to examine the validation of two of the other models discussed above:

the linear discriminant analysis, and the classification tree.

5.6 Validation of the discriminant and classification tree models

The linear discriminant model in Appendix VI was applied to the Duke sample with

a cutoff score of 455 to classify respondents as high or low risk. Table 5.16 shows the

results.

Table 5.16 - Result of applying discriminant model to Duke sampleCells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL


OUTCOME died 401 (61.6%) 250 (38.4%) 651(11.3%) (41.5%)

TOTAL 3,551 602 4,153

The internal test set predictions (from Table 5.9 above) were that the high risk respondents

would suffer a death rate of about 39.7%, and that the model would capture 43.4% of all

deaths with a specificity of 88.5%. The result was that 250 (38.4%) of the high risk

respondents died, and the model caught 41.5% of all deaths with a specificity of 88.7%. So

the results are slightly less accurate, but based on the simple random sample assumptions,

it is not a statistically significant difference. Chi-square tests on these observed counts (using

the internal test set predictions for the expected values, by column, row, or cell) show that

there is no statistically significant difference between the observed Duke counts and the

220

predicted counts. That is to say, the results were statistically indistinguishable from what

would have been observed if the model had zero bias and were applied to a simple random

sample from the EPESE population! The internal test set error rate was calculated as 0.2857,

and the Duke error rate was estimated to be 0.2886, a minute increase.

Table 5.17 - Result of applying discriminant model to Duke sampleHighest risk; cells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL


OUTCOME died 145 (53.7%) 125 (46.3%) 270(4.1 %) (20.8%)

TOTAL 3,551 602 4,153

It was also possible, knowing outcomes for all persons with scores higher than 455,

to calculate the error rate for the subset of these persons having scores higher than 563,

which was the highest-risk cutoff suggested by Table 5.9. The results of this classification

are shown in Table 5.17. The high risk respondents suffered a death rate of 46.3% (slightly

lower than the predicted death rate of 49%), and the model caught 20.8% of all deaths

(compared with 23.6%) with a specificity of 95.9% (slightly higher than the predicted

specificity of95.7%). This loss of sensitivity with a slight gain in specificity is similar to the

shift which occurred in the application of the question sets above. There is also a slight shift

in row margins toward the low risk class (6.5% of respondents scored above 563, compared

with 7.1 % of the respondents in the test set). However, this is not a statistically significant

difference; a chi-square test of the observed cell counts in Table 5.17 (using the cell

221

percentages from the internal test set results to calculate the expected values) totals 3.28 on

3 degrees of freedom, p-value = 0.35.

Thus, error rates for the discriminant model were slightly higher than predicted, but

the differences were statistically insignificant (despite the small standard errors involved).

The area under the ROC curve for the Duke sample (using data from both Tables 5.16 and

5.17) was estimated as 75.8% (compared with 76.2% using the internal test set estimates

from Table 5.9).

Table 5.18 - Result of applying classification tree to Duke sampleCells counts, percentages by row, and percentages by column

TRUE OUTCOME

survived died TOTAL

survived 2,921 (88.9%) 365(11.1%) 3,286PREDICTED (82.3%) (60.6%)

OUTCOME died 630 (72.7%) 237 (27.3%) 867(17.7%) (39.4%)

TOTAL 3,551 602 4,153

Table 5.18 shows the result of applying the classification tree (shown in Figure 5.7)

to the Duke sample. The high risk respondents experienced a 27.3% death rate (somewhat

lower than the test set death rate of 35%). The tree correctly predicted 39.4% of all deaths

(compared with 44% in the test set) with a specificity of only 82.8% (compared with 86%).

Thus the tree model was less accurate than predicted; the error rate based on the test set was

0.3126, and the observed Duke error rate was 0.3371. A 2-sample, one-tailed t-test of these

errors (as calculated above) yields a p-value of 0.08, so the difference is probably real.

However, although the row margins shifted, they did so in the opposite direction from that

222

observed with the question sets; that is, there were more high risk respondents in the Duke

sample as classified by the tree (20.9%, compared with 18.5% in the test set.) The area under

the ROC curve was estimated as 67.7%, (compared with the test set estimate of 73.4%).

5.7 A comparative analysis of the models

The models above can be easily compared on predictive accuracy. Table 5.19

displays the central statistics of interest side-by-side. The general pattern is that the linear

discriminant analysis yielded the highest level accuracy, followed by the questions sets, and

lastly the classification tree. This order was maintained with respect to the internal test set

error estimates, the full EPESE dataset estimates, and the Duke dataset estimates. For all

three methods, the predictive accuracy when applied to the Duke sample was less accurate

than the predicted, internal test set estimates. Suppose one calculates standard errors and chi

square tests based on the (admittedly questionable) assumption that the two datasets were

two independent simple random samples. Then these differences were not statistically

significant at the 0.05 level, although the increase in error for the classification tree was

bordering significance (p-value = 0.08). For the discriminant model and question set

methods, however, the results indicated that any bias in the test set estimates of error was

relatively small, ifnot completely negligible. (The small increases in error that were observed

could have been due to the radical differences between the two samples). Overall, the linear

discriminant model performed most impressively; even the application of the linear

discriminant model to the Duke dataset predicted about as accurately as (if not more

accurately than) the internal test set error estimates for the question set method. However,

Table 5.19 - Performance of three methods of predictionas applied to the North Carolina sample, and internal test sets from the original EPESE samples

Method of Internal test sets (N =3,432) North Carolina sample (N = 4,153)

prediction Model4 Sensitivitl Specificity6 Area under ROC Sensitivitl Specificity6 Area under ROCcurve? curve?

Set A (high risk) 66.2% 69.6% 53.8% 78.5%Question 74.4% 73.3%

sets' Set B (higher) 41.1% 88.1% ±1.2% 31.4% 91.1%

Set C (highest) 22.2% 95.5% 16.8% 97.0%

Discrim- high risk 69.4% 70.2% -- -- ,I

inant 76.2% I 75.8%I

analysis2 higher risk 43.4% 88.5% ± 1.3% 41.5% 88.7% III

20.8% 95.9%I

highest risk 23.6% 95.7% II

high riskI

61.3% 72.4% -- -- ,I

Classifi- 73.4% I 67.7%I

cation tree3 higher risk 44.4% 86.0% ± 1.3% 39.4% 82.3% III

highest risk 17.9% 95.3%I-- -- II

1. The question sets are presented in Appendix I.2. The linear discriminant model is given in Appendix VI.3. The classification tree for the "higher risk" model is shown in Figure 5.9; the "highest risk" tree was a pruned version of this tree

(i.e., a subtree), and the "high risk" tree was a somewhat larger version.4. For each method, three models were estimated; the "high risk" models were constructed with a relative cost ofmisclassification equal

to 5, the "higher risk" models with a cost of3.5, and the "highest risk" models with a cost of 1.5.5. Sensitivity is the proportion of deaths which were correctly predicted by the model.6. Specificity is the proportion of survivors which were correctly predicted by the model.7. The area under the ROC curve is a measure of each method's accuracy based on the levels of sensitivity and specificity acheived by the

models. An area of 100% would indicate a perfect predictor, while an area of 50% would indicate a useless predictor (i.e. an area of 50%implies that there is no correlation between the predictions and the outcomes). The standard error was estimated with a "bootstrap"resarnpling method. tv

tvw

224

the advantage in accuracy of the discriminant model over the question sets is not an

overwhelming one. The discriminant model was about 7.4% more accurate than the question

set method on the internal test set, and about 11 % more accurate when applied to the Duke

sample.8

Notice, however, that (by comparing Tables 5.9 and 1.2) the discriminant model did

not gain any predictive power relative to what would have been expected by predicting death

rates based on age and sex. Next, the author calculated the ratio of the death rate for the high

risk persons in the discriminant model to the death rate predicted by the age and sex

distribution of those persons. This ratio was slightly lower than was estimated for the

question set method as shown in Table 1.2 (1.79 to 1.83 for the higher risk cutoff, and 2.09

to 2.12 for the highest risk cutoff). Thus, although the discriminant model may have been

able to predict more deaths than the question set method, these deaths tended to occur in

older males. Thus there was no gain over the death rates predicted by age and sex. That is,

the additional high risk respondents found by the discriminant model were already known

to be at high risk!

Perhaps more important than small differences in accuracy, however, are the less

tangible differences between the models themselves (and the methods). One of the most

glaring differences between the discriminant model and the two other methods of binary

classification is the ease of application as actual prognostic tools. Consider using the models

to diagnose a respondent in person. To score a respondent based on the discriminant model,

doing a substantial amount of arithmetic is necessary; this obviously allows more room for

8 These were calculated from the ratios of the areas under the ROC curves for the two methodsafter subtracting off the baseline area of 50%.

225

error on the scorer's part (and assuming a certain level of numerical literacy). To apply the

question sets or the classification tree, however, one simply needs to obey simple conditional

rules ofthe form, "ifX, then y", so there is no arithmetic required. For the classification tree

with the highest level of sensitivity, though, the ease of use was hampered by the large,

complex structure of the tree (which is why the tree is not pictured). Thus, the simplicity of

representation achieved by the question set method offers more intuitive, "nonmathematical"

insight into the structure of the data than the linear model. Yet it achieves a greater accuracy

than the classification tree model, which is almost as easily interpreted.

Another important difference concerns the number of variables required for each

model (which is very important if one is interested is using the predictor variables to

construct a proxy of mortality risk). To some extent, this comparison is difficult to make

fairly, as the question set method actually consisted ofthree separate models, each requiring

a small number of variables. The linear discriminant model required 15 variables, but

question Sets Band C consisted of only seven questions (Set A had ten), and with

considerable overlap in the number of variables actually needed for the two models.

However, the linear model could accommodate any cutofflevel of risk (that is, any possible

combination of specificity and sensitivity on the ROC curve) with the same 15 variables. If

one totals the number of unique variables required for all three question sets (which still

provides only a narrow part of the complete range ofpotential specificity/sensitivity levels),

there are some 16 variables required. This is the same number used in the discriminant

model. However, the question sets had the advantage of being usable as separate modules

requiring between seven and 10 questions. Thus, the question sets required fewer variables

226

if one was willing to accept a fixed risk cutoff, but the discriminant model was more

effective for predicting along the entire range ofcutoffs. The classification tree in Figure 5.7

required only six distinct variables, but many more questions were required for the larger,

high-sensitivity model.

The binary classification methods had another useful advantage over the discriminant

analysis that was particularly useful when considering the use of the models as proxies. For

some variables in the discriminant model, there were multiple levels of scoring when more

than two values were possible (e.g., the eighth question in Appendix VI). A problem with

this method arises when one wishes to use a set ofpredetermined proxy variables in a survey

where the necessary survey items are not coded or asked exactly as the original survey

questions from which the index was constructed. For example, the ADL variables used in

the question set method are quite common to many surveys in addition to the EPESE

questionnaire. Unfortunately, they are not usually phrased in the exact same manner, and the

possible answers for any survey question can vary from instrument to instrument. The binary

splitting of a variable is generally more adaptable to this type of problem, since this simpler

two-class classification is much easier to reconstruct based on different codings.

Constructing a discriminant model purely from binary splits would have been possible, as

with the tree or question sets, but the model would have been somewhat weaker.

As discussed in the theoretically-based comparisons of the methods in Chapter 2, it

is entirely possible to create a hybrid of the question set method and the discriminant analysis

model by forming interaction-based indicator random variables equivalent to the question

subsets and fitting the models parametrically. The results above suggest that this might

227

achieve even more powerful models, as there were many persons who were identified as high

risk by one model, but not the other.

5.8 An additional question set model

Table 5.20 - Predicted outcome by true outcome, Question Set J,full dataset estimates


TRUE OUTCOME

survived died TOTAL

survived 7,539 (92,0%) 652 ( 8,0%) 8,191PREDICTED (85,2%) (45,0%)

OUTCOME died 1,305 (62.1 %) 798 (38,0%) 2,103(14.8%) (55,0%)

TOTAL 8,844 1,450 10,294

Table 5.21 - Predicted outcome by true outcome, Question Set J,test dataset estimates


TRUE OUTCOME

survived died TOTAL

survived 2,480 (91.3%) 235 ( 8,7%) 2,715PREDICTED (84,2%) (48,6%)

OUTCOME died 466 (65,2%) 249 (34,8%) 715(15,8%) (51.4%)

TOTAL 2,946 484 3,430

Question Set J in Appendix VII is the larger model, constructed with the RRESA(N)

method and Breiman et al.'s cost-complexity pruning method of backward deletion, as

228

discussed in the last section of Chapter 4. The full model, using a relative misclassification

cost of3.5, consisted often subsets of three questions each, for 30 questions. The computer

required an average of about 20 hours to find a particular absorption point with the RESA

on a SparcStation Ultra. Since the RRESA(200) was applied to both the learning set and the

test set in this search, this method required more than 8,000 hours (333 days) of computing

time.9 Consequently, the final model was not constructed until the latter stages of the

research, so the model has not yet been validated with the Duke sample at the time of this

writing. Some readers may \vish to treat this result with greater caution since it was achieved

with a different method, and was not validated on an independent sample. This is the reason

for considering the model separately, in this section.

The final model (Question Set J) consisted of23 questions in ten subsets. All these

subsets consisted of two or three questions. Not one subset was dropped from the model,

suggesting that even larger models might be constructed without fitting variance. The result

of applying this model to the full dataset is shown in Table 5.20. The misclassification error

associated with this table is equal to:

652,3.5 + 1,305

1,450'3.5 +8,844= 0.2577,

which is bias downward (by roughly 0.02, based on the above results). When the model of

a similar size (26 questions) was created on the learning set and applied to the test set, the

following classification was achieved (Table 5.21). Thus, a model like Set J is estimated to

9 Of course, since the repeated runs of the RESA were independent, the actual time needed toconduct such a search could by reduced by using multiple computers in parallel. The author was able to usesix to eight computers at a time, thanks to the generous resources of the Statistical Computing Facility at theUniversity of Califomia at Berkeley.

229

capture more over half of all deaths with a specificity of 84.2%. The test set error associated

with Set J was estimated as:

235,3.5 + 466

484'3.5 +2,9460.2777,

with a standard error of 0.0147. The area under the ROC curve was estimated as 76.6%.

This model outperforms the linear discriminant analysis on the test set, and requires only 17

unique variables to do so (compared with 15 for the linear model and 23 for the regression

model). Many questions in Set J are also found in Sets A through C, although in different

combinations. Interestingly, some questions included in Set J but not in Sets A through C

are also found in the latter part of the discriminant model (e.g., diabetes and hospitalization

for heart failure were in the final model, and smoking was also included in the learning set

model).

230

Chapter 6 - The causes of death in the elderly

6.1 Causes as identified by the death certificate

Of obvious interest to mortality researchers are the causal processes involved in

dying. Important clues about these processes may be gained as a byproduct of the predictive

models constructed above, if one is careful with the interpretation of the identified

correlations. The problem is that the mortality process is one of enormous complexity,

reflected by the incredibly large number of ways to die, as witnessed by any researcher

familiar with reD (International Classification of Disease) codes. Fortunately, some

important clues have already been provided along with the data, most notably the connection

between respondents and their death certificates. Because of the link, it was possible to tell

not only who had died, but what causes were listed on the certificate.

First, it must be recognized that death certificate data on the causes of death can be

a double-edged sword for the researcher. On the one hand, one has a great deal of

information about many respondents. This could include as many as 30 associated

conditions besides the underlying and immediate causes of death for each respondent.

Unfortunately, a great deal of inaccuracy and incompleteness pervades this information, and

in many ways it seems that the data can be more misleading through what is missing from

the certificate rather than what is included. First, the vast majority of death certificates are

not informed by autopsies and do not include any data from them (mostly because autopsies

are not usually done on elderly persons). Secondly, the causes of death are often extremely

difficult to pinpoint (even with autopsy information) because the process of death itself is so

complicated, usually involving more than one medical condition for elderly persons. The

231

typical strategy of identifying a single, lone underlying cause does not account for the

multiplicity of conditions which may precipitate death. So it is often helpful to examine

more than one entry on the certificate's list of causes, provided one is lucky enough to have

this additional data on such respondents. The crux of the problem is that for many

respondents, the data are probably not fully present; inevitably, some "true" causes of death

must be absent from the certificates, and it seems likely that this is a substantial problem.

Thirdly, there is frequently a lack of uniformity in how certain conditions are reported and

coded (i.e., by ICD classification codes) on the death certificate. In the EPESE data, one is

fortunate to have data coded entirely by a single nosologist at each of the three centers,

hopefully providing some degree of uniformity. However, no information was available in

the EPESE data on the person who originally reported the causes on the certificate (e.g.,

whether it was a coroner, clinician or other). Thus, examining the methods by which death

certificates are typically completed (or supposed to be completed) by the general informant

is extremely important.

The relevant section of the death certificate for the purposes of this dissertation is the

listing of causes in the bottom area of the certificate. This section is divided into two parts:

Part I, which lists "immediate" and "underlying" causes, and Part II, containing "other

significant causes - conditions contributing to death but not related to the causes given in Part

I". Guidelines for filling out and coding this and other areas of the death certificate may be

found in several NCHS publications, including the Medical Examiners' and Coroners'

Handbook on Death Registration and Fetal Death Reporting, the Instruction Manual Part

2a - Instructions for Classifying the Underlying Cause ofDeath, and Instruction Manual

232

Part 2b - Instructions for Classifying Multiple Causes ofDeath. These manuals are the

sources for much of the information presented in this section.

Part I, as mentioned, contains the immediate and underlying causes of death and all

causes in between, with one cause listed on each line and additional causes listed as

necessary on improvised lines. (However, more than one condition was sometimes listed on

a single line.) The underlying cause is defined as "the disease or injury that initiated the train

of morbid events leading directly to death, or the circumstances of the accident or violence

that produced the fatal injury". I The immediate cause is "the final disease, injury, or

complication leading directly to death" (emphasis added)? Causes are listed in reverse order

according to their occurrence in time and causal ordering, so that the immediate cause is

listed first, the underlying cause is listed last, and all intermediate causes are listed between

them. If only one cause is listed, it appears on line "a.", and is considered both the

underlying and the immediate cause. An illustrative example given in the Medical

Examiners' and Coroners' Handbook is that of an unfortunate gardener who stepped on a

rake, contracted tetanus, and died of asphyxia (suffocation) during convulsions. Part I was

completed as follows:

Part Ia. Asphyxiab. Convulsionsc. Tetanusd. Infected laceration of foot

Thus infected laceration of foot was the underlying cause, which in turn caused tetanus,

I Medical Examiners' and Coroners' Handbook on Death Registration and Fetal Death Reporting (1987).

2 Ibid.

233

which lead to convulsions, which finally led to the immediate cause of death, asphyxia.

Additionally, the manner of death was recorded as accidental, and a short description of the

injury ("stepped on rake while gardening") was provided, both in the latter part of the

certificate. This example shows the causal and temporal order implied in the chain ofevents.

The approximate time between the onset of each condition and the cause of death is

sometimes listed on the actual death certificates, but unfortunately this information is not

coded into the EPESE data.

Part II of the certificate contains "any other important disease or condition that was

present at the time of death, and that may have contributed to death but did not result in the

underlying cause of death listed in Part 1".3 Again, an example from the Medical Examiners'

and Coroners' Handbook demonstrates this classification: "On May 5, 1989, a 54-year-old

male was found dead from carbon monoxide poisoning in an automobile in a closed garage.

A hose, running into the passenger compartment of the car, was attached to the exhaust pipe.

The deceased had been despondent for some time as a result of a malignancy, and letters

found in the car indicated intent to take his own life." The death certificate was completed

as follows:

Part Ia. Carbon monoxide poisoningb. Inhaled auto fumesc.d.

Part IICancer of stomach

The underlying cause was the inhalation of auto fumes, the immediate cause was carbon

3 Ibid.

234

monoxide poisoning, and the "other significant condition contributing to death but not

resulting in the underlying cause" was cancer of the stomach. Here, the stomach cancer was

not directly related (at least not in any physical or medical sense) to either the underlying or

immediate cause. It was only indirectly related in that it may have led to the decedent's

depression and therefore suicide. However, conditions listed in Part II do not necessarily

relate to the causes of death at all. Yet another example from the Handbook illustrates this:

"On July 4, 1989, a 56-year-old male was found dead in a hotel. Autopsy revealed

asphyxiation due to aspiration of vomitus - a result of acute alcohol intoxication. Blood

alcohol level was 0.350 gm percent." The death certificate contained these causes:

Part Ia. Asphyxiationb. Aspiration of vomitusc. Alcohol intoxication (0.350 gm percent)d.

Part IIAlcoholic cirrhosis

Here the victim became excessively drunk and choked on his vomit. Also listed is cirrhosis,

which is a liver disease often resulting from the long-term consumption of alcohol4.

However, the bout of drunkenness resulting in death was not itself responsible for cirrhosis.

Nor was the disease any cause of the victim's intoxication. (While it is possible that the

victim was drinking to escape the disease psychologically, no indications to this effect

existed in the example). In no way did the victim's cirrhosis contribute to his death. So

according to the guidelines, only causes thought to contribute to death should be listed; yet

this designation is evidently subject to considerable interpretation and not strictly obeyed.

4 Wyngaarden and Smith (1988).

235

Several lessons can be learned from the examples above. First, notice that in all three

examples, the actual physical process surrounding the death was easy to identify for persons

familiar with the details. Yet, for a researcher faced with such data in coded form, it is

extremely difficult to pinpoint the actual process of death based on a handful of leD codes.

A list of numbers does not tell the story of the gardener and his rake, nor the story of the

suicide (in fact suicide was never mentioned on the certificate, not an unusual omission). It

is also not at all clear to what extent, or in what way, any conditions listed in Part II might

have contributed to the death, so there is no way to work these causes into any

comprehensive story either.

Secondly, when considered from a causal standpoint, any given death is assumed to

be a result ofa series ofchain reaction events. However, it is often doubtful whether can one

truly identify a single "underlying" cause with any meaning. The gardener stepped on a rake,

but it was tetanus that eventually killed him. It seems both were necessary for his death to

occur (provided the laceration itself was not so life-threatening); neither event caused death

by itself. The only real distinction is one of temporal order, which may have been easy to

establish in these cases. With multiple chronic illnesses, a far more typical mode of death,

these connections are hardly so readily established. Suppose for example that a strong factor

in the gardener's demise was his chronically weakened immune system, which allowed him

to succumb to the infection. Presumably, this condition may have existed before the rake

incident. Where in the chain of events does this factor belong temporally? If the laceration

was minor, should one then list the immune condition as the underlying cause? What about

other chronic illnesses that work together simultaneously to deteriorate the body, so that no

236

single condition can be identified as underlying (e.g., arterioslerosis and heart failure)? No

standards or rules exist for handling such ambiguities on the death certificate.

Thus consider that for many elderly persons the process ofdeath is not always so easy

to recognize, even for a clinician who is intimately familiar with the patient. Such persons

frequently have any number of medical conditions and illnesses, and may also be under

multiple medications. Any or all of these events may interact or conspire to bring about the

ultimate demise of the body, and many of these factors may be entirely undetected by close

observers. Clearly, trying to pinpoint a cause for these deaths is quite inexact even for

experienced physicians; so trying to understand such a complex course of events based only

on a small list of incomplete ICD codes (which are themselves categorical representations

of a report that was often not derived from a physician's opinion) amounts to guesswork for

many decedents. It is with these caveats that one must examine any results from this data.

Part of this problem may be understood by trying to identify those certificates that are mostly

likely to have misspecified the causes of death, as suggested below.

6.2 Underlying and associated causes in the EPESE population

The underlying causes for the decedents in the EPESE sample, grouped into broad

categories, were presented in Section 3.6. As indicated, the most common underlying cause

of death was heart disease, followed by cancer and cerebrovascular disease, as one would

expect from a representative sample of U.S. elderly. However, although the general order

of causes was as expected, the proportions of deaths attributed to each cause were not as

expected. Instead, the underlying causes of death were much more diverse in the EPESE

sample, with more deaths categorized as "other". However, based on the discussion above,

237

examining the underlying cause alone is clearly insufficient. So the additional causes listed

on the death certificate were also examined.

There were 352 deaths (24% of all deaths) which had only one cause listed on Part

I ofthe death certificate, while 589 certificates (41 %) listed two causes, 369 (25.4%) listed

three causes, and 134 (9.2%) listed more than three causes. On Part II of the certificate, only

about 42% of the documents listed a cause. There were 383 (26.4%) respondents who listed

one cause, 168 (11.6%) listed two causes, and 59 (4.1 %) listed three or more causes. These

proportions were fairly close to what would be expected from a simple random sample of

death certificates in the U.S. For example, in a 1% simple random sample of U.S. death

certificates in 1988, about 43% of all certificates listed any causes in Part II, and roughly

75% of certificates contained more than one cause in Part I. To group the raw ICD codes

into meaningful categories, the ICD classifications shown in Table 6.1 were used. Any code

not included in the given ranges was classified as "other".

Table 6.1 - Classification ofICD-9 codes

ICD-9 code

390-398, 410-429

140-239

430-438

440-454, 456-459

400-404

250

487-496, 510-519

800+

Condition

cardiovascular disease

malignant neoplasms

cerebrovascular diseases

circulatory disease

hypertension

diabetes

bronchopulmonary diseases

accidents

Suppose one considers immediate causes, that condition listed at the top (line "a")

of Part I on the certificate (which was identical to the underlying cause for the 383

238

respondents with only one cause listed). Of the death certificates listing only one cause,

some 31% listed cancer, and 41 % listed heart disease. After removing persons with only one

cause listed, it was observed that of the 1,092 remaining certificates, more than half (580)

of these immediate causes were categorized as heart disease, and only 77 (7.1 %) were

categorized as cancer. There were 52 of these deaths with cerebrovascular disease listed as

the immediate cause, 31 certificates listed bronchopulmonary diseases, 22 listed accidents,

and 18 listed circulatory disease. It should also be noted that there were 127 certificates

which actually listed more than one condition on the first line of Part I (despite the

suggestion in the section above that only cause should be listed per line). These conditions

were ignored in the above tallies. Most of these were also listed as heart attack or cancer.

There were 1,002 certificates that also listed a condition on line "bOO of Part 1. Of

these, 400 listed heart disease on line "boo, 160 listed cancer, 41 listed cerebrovascular

disease, 64 listed circulatory disease, and 56 listed bronchopulmonary diseases. Some 408

certificates listed a cause on line "COO, of which only 137 (33.6%) listed heart disease, and

only 56 (13.7%) listed cancer. Circulatory diseases were listed on 37 certificates (9.1 %),24

listed bronchopulmonary disease, and between ten and twenty certificates listed diabetes and

hypertension each (the latter two having been nearly absent in the above sections of the

certificate). Thus, circulatory disease, bronchopulmonary disease, diabetes and hypertension

appeared more frequently in this section of the certificate than in line "a". There were only

24 certificates listing any causes on line "d".

As Explained above, Part II of the certificate contains any other causes that may have

contributed to death that were not associated with the immediate and underlying conditions

239

in Part 1. There were 610 certificates that listed at least one condition on this part, and 227

that listed two or more conditions. Ofthe conditions listed first, 107 (17.5%) certificates that

listed heart disease, 55 (9%) listed cancer, 71 (11.6%) listed diabetes, 69 listed

bronchopulmonary disease, 39 (6.4%) listed hypertension, 36 listed cerebrovascular disease,

and about two dozen listed circulatory disease and accidents each. In this section of the

certificate then, one is much more likely to find diabetes, bronchopulmonary disease, and

hypertension than in other parts of the certificate.

The general pattern observed in the death certificates then, is that cancer and heart

disease tended to dominate as immediate and underlying causes (with cancer usually being

considered both the immediate and underlying cause more frequently). Other conditions

such as circulatory disease, cerebrovascular disease, and bronchopulmonary disease were

seen as illnesses which participated in the causal chain of events leading to death.

Conditions such as diabetes and hypertension were apparently seen more as aggravating

conditions that contributed to death, but were not in the "chain of events" leading to the

immediate cause of death as described in Part I of the certificate.

To examine the associated conditions further, it was decided to group together all

those conditions listed in both Part I and Part II with the sole exclusion of the condition

identified as the underlying cause. Thus the unit of analysis became the associated conditions

themselves rather than the respondents. The reasons for conglomerating all associated

conditions on the certificate, despite their location on the document, are discussed in the

section above. Briefly, it was not entirely clear that the location of a condition on the

certificate was really a meaningful indicator of its position in the causal "chain of events"

240

leading to death. Moreover, this assumed such a straightforward chain of connections could

really be established in reality, which is debatable.

After removing all underlying causes, a total of 1,516 associated conditions remained

(drawn from a total of 1,092 death certificates). Interestingly, these conditions were

dominated by those in the "other" category, which accounted for 598 (39.4%) of all

conditions. Only 326 (21.5%) conditions were categorized as heart disease, 141 (9.3%) as

bronchopulmonary diseases, 7.3% as circulatory diseases, and about 5% as each of

cerebrovascular disease, diabetes, and accidents. Only 3% were attributed to cancer. Since

the "other" conditions were so predominant, it was decided to examine these codes in more

detail. It was determined that of the 498 "other" conditions, 203 (41 %) of the codes were

between 780.0 and 799.9, which contains all those "symptomatic" conditions, the causes of

which are unknown to the observer (e.g., weight loss, convulsions, or coma). These codes

were almost exclusively in the 799.0 - 799.9 range, consisting mostly of respiratory failure,

with a handful in the range 785.0-785.9, which consisted of symptoms involving the

cardiovascular system (mostly shock and gangrene). Of the remaining 296 "other" codes,

126 were between 560.0 and 579.9, consisting of diseases ofthe urinary and digestive tracks,

and the rest were distributed quite widely among all the remaining codes.

6.3 The causes of death associated with the models

Suppose now that one considers the causes of death as they were listed for the sets

ofpersons as chosen by the questions in Appendix 1. First, consider the underlying causes

listed on the death certificates of the deceased among the high risk respondents as identified

by these models. To examine the small groups of deaths captured, it was necessary to group

241

the conditions more broadly to maintain enough observations. Thus, four main groups of

conditions were identified: heart disease, cancer, cerebrovascular disease, and all other

remaining causes.

The persons chosen by Question Set A were dominated by heart disease deaths, as

431 (some 48%) of the 904 correctly predicted deaths were so listed, followed by 266 deaths

(29.4%) in the "other" category, 140 cancer deaths (15.5%), and 67 deaths (6.4%) due to

cerebrovascular disease. Thus, cancer deaths were substantially underrepresented, while

more heart disease and "other" deaths were captured. It appears that death due to cancer is

more difficult to predict than other types ofdeath (congruent with the observation that cancer

itself was difficult to predict). The deaths correctly predicted by Set B were also skewed

toward heart disease and away from cancer. More than 50% of the deceased respondents

(308 out of 606 total deaths) had heart disease listed as the underlying cause, only 13.7%

were cancer deaths, 7.8% were cerebrovascular, and 27.7% were classified as "other".

The causes of death varied quite substantially from subset to subset, however. For

example, consider subset B.l, which asks about digitalis usage and whether the respondent

could walk a half of a mile without help. This set correctly predicted 267 deaths, of which

157 (58.8%) had heart disease listed as the underlying cause, only 24 (9%) had cancer listed,

and 21 (7.9%) certificates listed cerebrovascular disease. Suppose one computes a chi-square

for this distribution, where the expected proportions are those observed in the set of all

deaths in the sample. The statistic is 37.9 on three degrees of freedom, suggesting that these

discrepancies were highly unlikely to be due to randomness. Consider, in contrast, the

deaths chosen by subset B.2, which selects out males who weigh less than 168 pounds and

242

cannot do heavy work. Of the 277 deaths correctly predicted by these questions, 117

certificates listed heart disease (42.2%), 50 listed cancer (18.1 %),22 listed cerebrovascular

disease (7.9%) and 88 (31.8%) listed other causes. Thus, the respondents chosen by the

questions in B.2 were more than twice as likely to have died of cancer than the persons

chosen by Set B.l, and 40% less likely to have died of heart disease. Table 6.2 shows the

breakdown by cause for each of the unique question subsets in Sets A and B. Note that set

B.2 was identical to Set AA. The number of deaths chosen by Set C was too small to

provide reliable data when broken down by cause.

Table 6.2 - Underlying causes of death by question subset

cause A.l A.2 A.3 AA A.5 B.1 B.3

heart 105 (45%) 231 (57%) 188 (49%) 117 (42%) 72 (53%) 157 (59%) 96 (53%)

cancer 34 (15%) 47 (12%) 49 (13%) 50 (18%) 17 (12%) 24 (9%) 25 (14%)

stroke 16 (7%) 27 (7%) 24 (6%) 22 (8%) 14 (10%) 21 (8%) 12 (7%)

other 79 (34%) 101 (25%) 120 (32%) 88 (32%) 34 (25%) 65 (24%) 48 (27%)

total 234 406 381 277 137 267 181

Overall, heart disease as an underlying cause is more strongly associated with Sets

B.1, A2, A5 and B.3. Cancer was more likely to be listed on the certificates of respondents

chosen by sets A4 and A.!. Stroke was slightly more likely in those chosen by A.5, and the

"other" category was represented more strongly by Sets AI, A3 and A4.

Of the 904 deceased respondents chosen by Set A, there were 676 (75%) with more

than one cause listed on the certificate. After removing the conditions specified as

underlying causes and conglomerating all the remaining conditions on the certificates as

above, it was found that on these 676 certificates, a total of 928 conditions were listed. Of

243

these 928 "associated" conditions, 375 (40.4%) were categorized as "other", 209 (22.5%)

were heart disease, and 82 (8.8%) were bronchopulmonary diseases. These proportions (and

the proportions of the other categories as well) were quite close to the proportions observed

in the entire sample of deaths (as listed in Section 2.2 above). When the conditions were

broken down by subsets, the same pattern was observed. When the associated conditions

were then conglomerated for the respondents chosen by Question Set B, the same patterns

were again observed, except that heart disease (26.1% of all conditions) was slightly more

frequent relative to the "other" conditions (36.8%).

Overall, the associated conditions were dominated by the "other" category, which

accounted for about 40% of all conditions, followed by heart disease (about 24%), and then

bronchopulmonary disease (about 9%), no matter which group ofrespondents was examined:

those chosen by Set A, Set B and their subsets, or the entire sample of deaths. Many of these

conditions (roughly a third of the "other" conditions) were completely symptomatic (of

entirely unknown etiology). Moreover, there was an extremely large degree of heterogeneity

associated with the various conditions, as no single category of disease (excepting the

catchall "other" category) was dominant: the single most prevalent condition was the

extremely broad category of"heart disease", which only accounted for about a quarter of all

conditions!

6.4 The causal processes and risk factors associated with death

At this point, the reader may be well aware of the double-edged nature of the death

certificate data. As demonstrated above, there is a vast amount of information associated

with death certificates. However, it is of highly dubious quality, and extremely difficult to

244

interpret on any substantive basis (partly because of the high degree of heterogeneity in the

causes ofdeath). Unfortunately, a sensible story is difficult to decipher from the data without

much additional information, or at least some well-reasoned conjecture. It is argued below

that some additional information is available in the form of the models themselves.

However, these models tell only half the story. To make a reasonable interpretation, it is

necessary to look at the picture presented by the above data as a whole, and to consider the

deficiencies that are likely to distort this picture.

First, although it could not be proven solely with the data here, it was strongly

suspected that a fair number of ailments that may have contributed to death were either

misdiagnosed or entirely undetected. This may also be the case in the U.S. population as a

whole (based on previous analyses of U.S. death certificates conducted by the author), but

the EPESE deaths were probably less accurately documented. That inaccuracies exist in

death certificate data is hardly an original opinion, as demonstrated by studies comparing

actual autopsy results with death certificates. However, it was also suspected that cancer was

the illness that accounted for a disproportionately large number of undetected or

misdiagnosed conditions. Secondly, one particularly large group of high risk respondents

existed for whom death due to heart disease was excessively high. The author wondered

whether these respondents were actually suffering deaths due not to heart disease itself, but

toxicity from a purported treatment for heart disease: digitalis.

Several signs pointed to cancer as the culprit for much of the "missing morbidity".

First, cancer deaths were underrepresented in the EPESE sample relative to the U.S.

population of elderly, as indicated in Section 3.6. Secondly, the respondents in the sample

245

were "working class", and had lower incomes compared with the U.S. population. In the

East Boston and New Haven samples, two-thirds of the respondents had household incomes

less than $10,000 per year; in the Iowa sample, incomes were somewhat higher, as about

63% of respondents had household incomes less than $15,000. It was thought that because

these poorer persons were less likely to be seen by a clinician than wealthier elderly persons,

they were probably less likely to be diagnosed with cancer (and therefore less likely to have

cancer listed as a cause of death). This was observable in the data, although the relationship

was not extremely strong: about 13% of persons with an income less than $5,000 reported

having been diagnosed with cancer, compared with 15% ofpersons with incomes of$15,000

or more. In studies of the relatively wealthy women in Marin County, for another example,

breast cancer rates were also observed to be abnormally high; this was thought by some to

be the result of increased detection rather than a truly higher incidence of cancer, although

these issues have been hotly debated. It was also observed in the EPESE sample that among

persons who had seen a dentist within the previous five years, the lifetime prevalence of

cancer was higher than in those who had not seen a dentist (15% versus 12%). This was

despite the fact that the latter group were more likely to be smokers or exsmokers than those

who had seen a dentist (45% smokers versus 39%). These sorts ofresults suggest that many

cancers in the general population go undetected, since much of the elderly population is not

wealthy.

It was also thought that many deaths that were truly (or in part) due to cancer instead

had unknown or misdiagnosed causes listed on the certificate (e.g., the "symptomatic" ICD

codes of unknown etiology, ranging from 780 to 800, which make up many of the deaths in

246

the "other" category). For instance, it seemed plausible that lung cancer might be appearing

as respiratory failure (the most common symptomatic condition), or that tumors affecting the

nervous system might appear as convulsions. Table 6.2 suggested that some of these deaths

may have been identified by the question subsets AA/B.2 and A.l (particularly the former).

Consider the questions in subset AA: this subset picked out males weighing less than 168

pounds who could not do heavy work. The fact that these men had lighter weights was likely

a sign of the physical "wasting way" which occurs in many cancer victims just before death.

Also, more than 68% of the persons identified by Set B.2 were either present smokers or

exsmokers (compared with 39% of the persons not identified by Set B.2). Yet amazingly,

only 14.1% of the persons identified by Set B.2 had ever been diagnosed with cancer at

baseline (compared with 13.8% for all those not chosen by B.2)! There is a clear implication

for health policy: elderly males weighing less than 168 pounds who are unable to do heavy

work should be encouraged to be particularly vigilant against cancer, regardless of their

observed history with the disease. This also suggests that researchers interested in studying

risk factors associated with cancer should be very careful to control for factors that may

affect the likelihood of detection.

At this point, there is an interesting question concerning the exact causal nature of

the respondent's weight at the baseline time of interview. It seemed to be associated with

the relatively large numbers of deaths due to cancer and "other" causes. As always, one is

concerned with the issue of whether this variable is essentially symptomatic of the condition

that is causing death, or whether it is itself a "causal" risk factor for death (meaning that

mitigation of the variable would reduce death rates). The etiology suggested above (that

247

cancer was responsible for many of these deaths) would suggest that the weight variable is

entirely symptomatic, as one cannot sustain a cancer patient merely through nourishment.

However, it is also easy to argue that for many elderly persons a lack of body weight may

actually contribute to death in the sense that the body's tissue mass and fat reserves may

serve as a type of "cushion" against diseases which cause loss of appetite. Thus, it may be

that an elderly person of lower weight succumbs to some illness more rapidly due to the

shortened time period needed to drive the body mass down below some "healthy weight"

threshold.

Notice also that the relationship between body weight and mortality is quite

nonlinear, as was suggested by the size of the coefficient on the squared term in the logistic

regression model. Thus, excessively high body weight was not associated with lower

mortality; rather there seemed to be a threshold weight below which there was a high risk of

mortality. (Note that the question set method deals with this sort of nonlinearity more

adeptly than the linear discriminant analysis, since the latter model is forced to treat the

variable linearly.) Also, it appeared that a better variable for capturing the causal effect of

obesity was the weight of the respondent at previous ages rather than at baseline. This

variable appeared in all the fitted models as strongly and positively correlated with death,

even when all other variables were controlled. Set A.3 suggests it is a particularly good

predictor of death for those above age 80.

The same issue of association or causality arose with the questions concerning the

functionality of the respondent. That is, if one cannot walk a half of a mile (or simply does

not attempt to do so under the belief that it would be too stressful), is this symptomatic of

248

disease, or does it contribute to one's demise? Both arguments are quite plausible, but it is

not clear to what extent one effect dominates over the other. The best way to answer such

a question would be to institute clinical trials designed to assess the efficacy of increasing

mobility and functionality in the elderly. If decreased functionality has even a moderate

influence on mortality, these results suggest that the impact on mortality rates would be

substantial. The argument for causality with respect to questions that measure mental

functioning IS more difficult to make; it appears that these variables were largely

symptomatic.

Now consider the question subsets related to the highest levels of heart disease

mortality. The subsets most strongly related to heart disease (Sets B.l and A.2) had one very

powerful predictor in common: the use of digitalis by the respondents chosen.

6.5 Digitalis use and mortality: cause or consequence?

Consider the 1,294 persons who were taking digitalis as baseline (12.6% of all

respondents). Of these, 792 (61 %) were female, and 551 (43%) were younger than 75. Yet,

there were 406 deaths among these persons in three years, a 31.4% probability of death.

There were nearly twice as many deaths as should have occurred based on the sex and

distribution ofthese persons. Moreover, these 406 deaths accounted for 28% of all deaths

in the sample, although only 12.6% of respondents were on the drug!

The main question of interest is whether or not the use of the drug was symptomatic

of heart disease (clearly, physicians usually do not prescribe it to persons without some form

of heart disease), or whether it was a cause of death through digitalis toxicity. Initially, the

author assumed that the former was the case; however, it became increasingly clear that the

249

correlation between digitalis use and death was impossible to explain away with any other

variable. It was judged to be the third most important predictor (third only to age and sex)

in both the linear discriminant analysis, and the logistic regression model; yet both

controlled extensively for every observed predictor thought to have any predictive power at

all, including previous diagnoses of heart failure. It is true that this still does not establish

a causal relationship. The problem was that many important predictors were not observed,

and most of the persons who were taking digitalis had heart trouble first; thus, one expects

these persons to have a high death rate despite digitalis use or even heart failure. Still, the

association between digitalis and death was powerfully stubborn. To see this explicitly,

consider two very different groups of persons. There were 665 respondents who reported

that they had been diagnosed with heart failure, but who had never used digitalis. Of these

persons, 126 died, a death rate of 19%. Another group of respondents reported that they had

never been diagnosed with heart failure, but were using digitalis at baseline. Of these 826

persons, 244 died (29.5%). The latter group ofdigitalis users with no heart failure, then, had

a probability of death more then half again as high as those who had suffered heart failure

but had never used digitalis; yet, this latter group with the lower death rate was the older of

the two groups! The highest death rate belonged to the persons who had suffered heart

failure, and used digitalis; of these 468 persons, 162 (34.6%) died.

It was this sort of finding that prodded the author to research digitalis and its use in

the elderly community. It was with some amazement, then, that the author discovered a

raging controversy in the medical community concerning the use of digitalis. Essentially,

the drug has been hotly debated since its very discovery (hundreds or probably thousands of

250

years ago). 5 Digitalis, which is extracted from the leaves of the plant Digitalis Lanata, is one

of the cardiac glycosides, a group of drugs that directly affect the muscle tissue of the heart.

It is usually prescribed for supraventricular fibrillation (essentially rapidness or irregularity

in the heart beat which is outside the ventricular, or lower, chambers of the heart, e.g., atrial

fibrillation and flutter) or heart failure itself. The influence of the drug is higWy dose-related.

At normal dosages, it causes a slowing of the heart rate (called a negative chronotropic

effect), an increase in the force of systolic contractions (called a positive inotropic effect),

and decreased conduction velocity through the atrioventricular node.6 In the past decade,

digitalis has been one of the most commonly prescribed drugs in the U.S., with some 21

million prescriptions in 1990.7

A known problem with digitalis is that it becomes fatally toxic if the concentration

of digitalis in the blood becomes too high. Unfortunately, very little is known about exactly

what the optimal dosage is, although it appears that the lethal dose is not very high relative

to the effective dose. Moreover, the proper dosage may vary tremendously from person to

person depending on any number of variables, such as age, body weight and renal

functionality. This last factor is tremendously important; persons who have bodies which

are less able to excrete wastes efficiently (as is frequently the case for those who are less

mobile and elderly persons overall) are much more susceptible to toxicity. This is because

5 For a summary of the present state of the debate, see Milton Packer's editorial comments in theNew England Journal of Medicine, "End ofthe Oldest Controversy in Medicine: Are We Ready ToConclude the Debate on Digitalis?" (1997).

6 See http://www.rxlist.com/cgi/generic/dig.htm for an excellent summary of the pharmacologyand chemistry of digitalis.

7 The Digitalis Investigation Group, (1996).

251

the concentration of the drug in the blood can build up to a toxic level much more easily.

Many substantial studies of digitalis toxicity exist.8 Based on some of these studies,

it does not appear to be a large problem. For example, Warren et aI. estimated that only

0.85% ofMedicare beneficiaries who used digitalis were hospitalized annually for adverse

effects from digitalis. Similarly, Kernan et aI., using the New Haven EPESE sample,

estimated that only 4-6% of digitalis users were hospitalized for toxicity in a period of 4-6

years. However, both studies suffered serious deficiencies. First, both studies ignored

persons who died. Secondly, since the optimal dosage (which is not really known) varies

from person to person, and since "toxicity" is defined on the basis of a somewhat arbitrarily

high blood serum threshold concentration of digitalis, it was not clear that the "true"

incidence of toxicity was estimated. ("True" toxicity is thought of as that level of digitalis

concentration at which the health of the individual is negatively affected by the drug). There

were also a substantial number of studies which asserted that digitalis toxicity was much

higher. However, these were anecdotal or setting-specific (i.e., applying to some particular

group such as persons in nursing homes or patients already known to be suffering toxicity).9

None ofthese could document toxicity systematically in a large, representative group ofU.S.

elderly. A number of other researchers using observational data have also noted the

persistence of the association between digitalis use and mortality despite the attempt to

control for many potential confounders. Io

A central problem in detecting digitalis toxicity, besides the ambiguity of the

8 See Kernan et al. (1994) and Warren et al. (1994).

9 For example, see Aronow (1996).

10 See Moss et al. (1991) and Bigger et al. (1985).

252

definition, is that it usually kills the recipient through arrhythmia or heart failure; both are

precisely those ailments for which it is prescribed! Thus, if toxicity is not checked for

explicitly (which is the case for the vast majority of deaths) it is very likely to go undetected.

An ignorant observer would likely assign heart failure or arrhythmia as the underlying cause

ofdeath. For elderly persons, who are generally much more susceptible to toxicity, and who

may require lower dosages, the problem is exacerbated; even if blood serum tests are

performed, the problem would not necessarily be detected because true toxicity may be

assll.'1led away by definition. Consider that of the 1,450 deaths in the EPESE sample, only

one was attributed to digitalis toxicity (having an ICD-9 code of 792.l)! If one takes this

estimate at face value, only 0.07% of digitalis users die from it in three years, a suspiciously

low estimate. There is little doubt that toxicity is going undetected; the question is to what

degree it may be killing people.

The National Institutes of Health, increasingly interested in the efficacy and safety

of digitalis, has sponsored several large randomized, double-blind clinical trials to examine

the drug. ll In the most recent of these (with results published in the 1997, February issue of

the New England Journal of Medicine), the Digitalis Investigation Group (a consortium of

researchers created by the NIH to study the issue) assessed the efficacy and safety of Digoxin

in a randomized, double-blind controlled clinical trial with 6,800 patients with heart failure

and normal sinus rhythm. There were 1,181 deaths in the treatment group (34.8%), and

1,194 deaths in the placebo group (35.1%) in an average of 37 months; thus, there was

almost no difference in mortality between the two groups. The benefits of the drug were

11 See The Digitalis Investigation Group (1997).

253

highly symptomatic; essentially, patients recelvmg the drug suffered somewhat less

discomfort and suffered slightly fewer hospitalizations (6% fewer in the treatment group).

This result is representative of many results from previous clinical trials. Although

admitting that there was no impact on mortality and that the actual savings implied by the

decrease in hospitalizations was negligible, Dr. Packer concludes, "For most patients with

heart failure, digitalis remains an effective, safe, and inexpensive choice for the relief of

symptoms".

However, several mitigating conditions must be considered before drawing broad

conclusions that generalize from this study to most u.s. elderly who use the drug. First, the

study was not intended to examine elderly patients per se (persons who probably account

for most of the susceptibility) as only 27% ofthe patients were over age 70. Secondly, there

was a concerted, thorough effort to monitor dosages closely through the periodic

measurement of serum concentrations, and dosages were adjusted accordingly. Thirdly, the

patients appeared to be relatively healthy in comparison with many elderly persons who use

the drug in reality. 12

If one believes that there is a neutral effect on mortality when the drug is applied to

relatively young, healthy patients in a highly monitored clinical setting, as suggested by the

results of clinical trials, then the safety of the drug should be highly suspect when applied to

elderly, sicker, less functional persons in a real-world setting. Consider the following cost

benefit analysis: If digitalis use is truly neutral with respect to mortality, then there is no real

harm in discontinuing the use of the drug, other than moderate symptomatic worsening.

12 A lengthy list of the exclusion criterion may be found in Digitalis Investigation Group, (1996).

254

However, if digitalis does have even a moderate positive impact on the risk of death in

elderly persons, the sheer numbers imply a substantial degree of excess mortality from this

drug. The implications for clinicians and public health researchers are clear. First,

physicians should be extremely cautious when prescribing the drug to elderly, less functional

patients, and particular care should be taken to monitor and moderate dosages. Secondly,

more extensive studies are needed to estimate the true level ofdigitalis toxicity and death due

to digitalis in elderly persons. Finally, it was worth noting that models of the sort in

Appendix I frequently chose male respondents who used digitalis and who weighed less than

about 160 pounds as being at particularly high risk of death. It is possible that prescribed

dosages of digitalis are not being adequately adjusted for the below-average body weight of

these persons.

255

Chapter 7 - Conclusions

7.1 The power of the models for predicting mortality

This dissertation has achieved several goals. First, it presented several powerful,

compact models for predicting three-year mortality in elderly persons. Second, it defined a

nonparametric model structure for binary prediction analogous to classification trees. Third,

the author assembled a search algorithm for selecting such models, and described a method

for using an internal test set to select model sizes and estimate prediction error. The

estimates oferror were then validated on an independent sample ofelderly persons. Finally,

the fitted models raised several interesting hypothesis concerning the causes of death in the

elderly, and suggested serious pitfalls that must be navigated by mortality researchers.

The primary focus of the research was the first of these objectives. In short, the

appendices of this dissertation present several small but efficient models for predicting short

term mortality and morbidity. Some ofthese models required as few as seven variables. Yet

they can detect about 40% of deaths with a specificity of 88%. The question set method

resulted in a ROC curve with an area estimated at 74.4% ±1.2%. When these models were

applied to an independent sample of elderly persons with a radically different demographic

makeup, the estimates of error were shown to be quite accurate. The differences between the

test set estimates of error and the validation estimates were no bigger than the sampling

errors that could have occurred had the respondents been chosen with a simple random

sample.

The accuracy achieved by Question Sets A through C was superior to the accuracy

of the classification trees, but less than that of the linear discriminant model. This order was

256

maintained whether the models were applied to the learning set respondents, the test set

respondents, or the Duke sample. However, the differences were generally small in

magnitude. Logistic regression could achieve the same accuracy as the linear discriminant

model, but used more variables (particularly when fit with the Ale criterion). It also seems

possible to build larger question set models (such as Set J, consisting of23 questions) with

even greater predictive power than the linear discriminant fit. Set J correctly predicted 51 %

of deaths in the test set with a specificity of 84%, but it has not yet been validated with an

independent sample.

The models provided an interesting alternative to the more conventional, parametric

methods of prediction (such as linear discriminant analysis and logistic regression). Besides

requiring fewer variables, the form of the predictors (combinations of simple questions)

allowed for an ease of interpretation that was not available through equation-based methods.

It was possible to understand the structure of the models with little knowledge of

mathematics or distribution theory, and the results provided a stark contrast to more

complicated forms. They also provided valuable insights into mortality processes, as

discussed below.

7.2 The efficiency of the search method

The RSA was defined as the basic random search algorithm for finding question sets

that could not be improved by replacing any single question. The method was likely to

achieve a suboptimal model if the algorithm was only run once. Consequently, N

independent runs of the algorithm were made with independently generated starting models,

and the best model out of these N runs was selected (the RRSA(N) method). With very small

257

model sizes (i.e., two questions), it was proven that the RRSA(N) method selected the

optimal model with a high probability, even for a reasonably small N. For moderately sized

models (seven to ten questions), a brute force argument was made to suggest that the optimal

model could be found with RRSA(N) with an N of at least 100, but this could not be proven.

For example, consider the seven-question model structure of Set B in Appendix I

with a relative misclassification cost of 3.5. The median number of mutations required for

a single run of the RSA with this model structure was estimated as rougWy 42,000. A single

run of the RSA was estimated to have a 5% chance of finding the observed maximum (Set

B) when applied to the full dataset. Consequently, one had to perform an average of 20

independent runs of the RSA to find Set B, requiring a total of about 18 hours of computing

time on a SparcStation Ultra. The RRSA(lOO) was estimated to have a 99.4% chance of

achieving the observed maximum.

For larger model sizes (containing more than a dozen questions), it was more

doubtful that the RRSA(N) could find an absolute maximum within a reasonable time. To

achieve shorter search times, the RSA was modified by combining random searching with

exhaustive searching (designated the RESA method). Also, a more flexible method of

choosing for model size was used (cost-complexity estimation, defined by Breiman et al.).

These approaches were used to select Set J in Appendix VIII.

7.3 Implications for causal analyses of mortality

Besides age and sex, the best predictors of short term mortality were the use of

digitalis, body weight at time of interview, several measures of functionality, and the

previous diagnosis of illnesses (in that order). Interestingly, the age and sex variables

258

featured prominently in the linear models, as they were always the two most powerful

variables in the model. However, in the question sets, these variables played a much less

important role. In fact, it was possible to build a model of the highest risk persons (Set C)

which makes no references to age or sex! One might expect that since age and sex were

correlated with many questions in the models, a model that controls for neither (like Set C

) is simply referencing the oldest males through these spurious predictors. Yet, the persons

chosen by Set C were hardly dominated by old males, and their probability of death was

more than twice as high as would have been predicted by age and sex. Sets A and C also use

age and sex sparingly. Each variable was used in only one subset of each model, and they

were never included in the same subset.

Several reasons exist to explain these results. As discussed in Chapter 1, the

variables that measure physical functioning are probably temporally closer to death than

many other predictors. There is also a less obvious reason. Clearly, the observed incidence

of heart failure, cancer and stroke was higher for males at all ages. However, females

showed a greater loss of function than males at all ages. It was also suspected that there was

much undetected morbidity in the EPESE sample. Many persons who died had never been

diagnosed with major illnesses at baseline (which is why previous diagnoses of illness were

relatively weak predictors). Taken together, these observations suggested that much of the

undetected morbidity was in the female population. Females had much lower incomes than

males; thus, they may have had less access to medical care, and consequently were less

likely to be diagnosed with illness.

The issue of"missing morbidity" is crucial to mortality researchers. It was suspected

259

that because of undetected morbidity and the poor quality of death certificate data, many

causes of death were not accurately recorded or completely missing from the certificates.

It was also suspected that cancer was the illness most likely to go undetected, and that many

deaths caused by cancer were actually listed with other or unknown causes. This was

because so many deaths listed as "other" (which included unknown, or "symptomatic"

causes) were found to have a strong common bond with many cancer deaths. Such persons

were chosen largely by the question concerning body weight at time of interview;

specifically, many of these persons had below-average weights at baseline, which would

indicate the "wasting" that frequently occurs before death from cancer. Most of them were

also smokers, or had smoked in the past. Finally, many of these deaths listed respiratory

failure of "unknown etiology" as a condition on the death certificate. This may lead

researchers to construct misguided conclusions about mortality processes. As suggested, the

likelihood of diagnoses (with true morbidity held constant) is probably much greater in those

with access to medical care, which is highly correlated with income and many other

variables. This is a major source of spurious association for any researcher wishing to

analyze the causes of death.

The body weight variable was an interesting predictor. Researchers have typically

thought of the variable as positively related to the risk ofdeath, reflecting the negative health

effects of obesity. A better measure of these effects was obtained through the variable that

asked about body weight at younger ages, which did tum out to be a positive predictor of

mortality. For predicting short term mortality, it is much more powerful to use body weight

at baseline to choose those respondents who are "wasting away"; other variables that also

260

served this purpose were the questions about loss of weight in the past year (positively

associated with death) and the body-mass index. There was a strong, nonlinear "threshold"

effect for this variable (about 165 pounds, for males), below which the risk of mortality was

high. Clearly, the linear approximation was inaccurate, although it did not substantially

detract from the power of the discriminant model. (The regression model included a squared

term for this variable.)

It was also observed that low body weight was a particularly powerful predictor when

combined with digitalis use. It is possible that the prescribed dosages for the drug are not

adequately adjusted for underweight elderly persons. In fact, it was strongly suspected that

digitalis toxicity may have a substantial source of excess mortality for elderly persons in

general. Discerning the true causal connection between digitalis use and mortality with

observational data is impossible, since this variable is highly confounded with heart disease

(which was obviously not measured perfectly in this data). However, clinical trials have

found that digitalis does not affect mortality when the drug is used on relatively younger,

healthier persons. The elderly persons in the EPESE sample were much more likely to have

renal failure, and were generally less functional. It is possible that toxic levels of the drug

were accumulating in these persons since their bodies would expel the drug less efficiently.

Moreover, digitalis toxicity kills by inducing heart failure or arrhythmia (exactly those

conditions for which it is prescribed), suggesting that the true cause of such deaths would

have been undetected. Physicians should be extremely careful to monitor digitalis dosages

in elderly, dysfunctional patients, or eliminate the use of the drug entirely, since the only

benefits are symptomatic. Additional studies are needed to estimate the "true" level of

261

digitalis toxicity in the elderly population.

7.4 Future applications of these methods

At the time of this writing, efforts are underway to apply the above model-building

techniques to nationally representative surveys. For example, the National Health and

Nutrition Examination Survey (NHANES I), conducted in 1971-75, provides an even larger

pool of predictor variables from which to select. Respondents were followed over time and

subsequently interviewed in the NHEFS epidemiologic followup studies in 1982 through

1986. Respondents were interviewed with respect to many different outcomes, and deceased

persons were matched with their death certificates. This rich dataset provides an opportunity

to construct nationally representative models for many different outcomes, induding

mortality, morbidity and functionality. Other longitudinal, national probability samples

which track decedents are the AHEAD survey, and the National Health Interview Survey.

This research will have several objectives. Foremost, the goal is to develop a

compact proxy for mortality that can be used in studies that do not directly observe deaths.

This dissertation suggests that it would be possible to construct an accurate proxy using a

limited number of questionnaire items that are common to many surveys. Secondly, future

research will attempt to construct more powerful models purely for the purposes of

prediction. This endeavor will draw from the entire range of potential questionnaire items

in a survey such as NHANES. The most likely model structure will be hybrid ofthe question

set method invented here and linear discriminant analysis, as discussed in Chapter 2. Thus,

it will be possible to control completely for age and sex while simultaneously utilizing the

predictive power of the question set method. The results of the above validation suggest that

262

this model structure may result in more stable error rates, since more variables can be

controlled.

To assess the predictive power ofthese models honestly, a test set of respondents will

be selected out with a simple random sample before conducting any analysis. The

researchers involved in the project will be kept blind to this dataset until the final models are

constructed. This technique will provide a truly unbiased estimate of model error.

(However, additional "test set"/"leaming set" divisions may be employed on the remaining

learning set respondents as part of the model building process.)

Finally, the research will endeavor to create a nationally representative "health index"

using some subset of survey items. Models such as those presented in the appendices are

very easily applied to a living person, and they provide an excellent means by which the

general health of a person may be assessed. There is some demand for such an index on

behalf clinicians and of public health practitioners, and some attempts have been made

already (e.g., the SF-36 questionnaire). However, these are rarely linked with an

unambiguous outcome such as death, and they do not usually use cutting-edge statistical

techniques.

263

References

Akaike H. 1973. "Information theory and an extension of the maximum likelihoodprinciple." Second International Symposium on Information Theory (eds. Petrovand Csaki). Akademia Kiado. Budapest. 267-81.

Anderson CS; Jamrozik KD; Broadhurst RJ; Stewart-Wynne EG. 1994. "Predictingsurvival for 1 year among different subtypes of stroke. Results from the PerthCommunity Stroke Study." Stroke, 25: 1935-44.

Aronow JS. 1996. "Prevalence of appropriate and innappropriate indications for use ofdigoxin in older patients at the time of admission to a nursing home." Journal ofthe American Geriatrics Society. 44:588-90.

Assmann G; Cullen P; Heinrich J; Schulte H. 1996. "Hemostatic variables in theprediction of coronary risk: results of th 8 year follow-up of healthy men in theMunster Heart Study (PROCAM)." Israel Journal ofMedical Sciences,32:364-70.

Becker RB; Zimmerman JE; Knaus WA; Wagner DP; SeneffMG; Draper EA; HigginsTL; Estafanous FG; Loop FD. 1995. "The use of APACHE III to evaluate ICUlength of stay, resource use, and mortality after coronary artery by-pass surgery."Journal ofCardiovascular Surgery, 36:1-11.

Bernstein JH; Carmel S. 1996. "Medical and psychosocial predictors of morbidity andmortality: results of a 26 year follow-up." Israel Journal ofMedical Sciences,32:205-10.

Bianchetti A; Scuratti A; Zanetti 0; Binetti G; Frisoni GB; Magni E; Trabucchi M. 1995."Predictors of mortality and institutionalization in Alzheimer disease patients 1year after discharge from an Alzheimer dementia unit." Dementia, 6:108-12.

Bigger JT; Fleiss JL; Rolnitzky LM; Merab JP; Ferrick KJ. "Effect of digitalis treatmenton survival after acute myocardial infarction." American Journal ofCardiology.55:623:30.

Blumberg D; Port JL; Weksler B; Delgado R; Rosai J; Bains MS; Ginsberg RJ; MartiniN; McCormack PM; Rusch V; et al. 1995. "Thymoma: a multivariate analysis offactors predicting survival." Annals ofThoracic Surgery, 60:908-13.

Bosch X; Magrina J; March R; Sanz G; Garcia A; Betriu A; Navarro-Lopez F. 1996."Prediction of in-hospital cardiac events using dipyridamole-thallium scintigraphyperformed very early after acute myocardial infarction." Clinical Cardiology,19:189-96.

264

Breiman L; Friedman JH; Olshen RA; Stone CJ. 1984. Classification and RegressionTrees. Wadsworth, Inc. Pacific Grove, California.

Cahalin LP; Mathier MA; Semigran MJ; Dec GW; DiSalvo TG. 1996. "The six-minutewalk test predicts peak oxygen uptake and survival in patients with advancedheart failure". Chest, 110:325-32.

Cain KC; Martin DP; Holubkov AL; Raghunathan TE; Cole WG; Thompson A. 1994. "Alogistic regression model of mortality following hospital admissions among

Medicare patients: comparison with HCFA's model [abstract]." Ahsr Fhsr AnnuMeet Abstr Book, 11 :81-2.

Chambers JM; Hastie TJ. 1991. Statistical Models in S. Wadsworth, Inc. Pacific Grove,California.

Cornoni-Huntley J; Brock DB; Ostfeld AM; Taylor JO; Wallace RB. 1986. EstablishedPopulations for Epidemiologic Studies ofthe Elderly: Data Resource Book.Washington D.C.: U.S. Government Printing Office. (NIH Pub. No. 86-2443.)

Cornoni-Huntley J; Ostfeld AM; Taylor JO; Wallace RB; Blazer D; Berkman LF; EvansDA; Kohout FJ; Lemke JH; Scherr PA; Korper SP. 1993. "Establishedpopulations for epidemiologic studies of the elderly: Study design andmethodology." Aging Clin. Exp. Res. 5: 27-37.

Davis L. 1987. Genetic Algorithms and Simulated Annealing. Morgan KaufmannPublishers. San Mateo, California.

Davis RB; Iezzoni LI; Phillips RS; Reiley P; Coffman GA; Safran C. "Predicting inhospital mortality: The importance of functional status information." MedicalCare 33: 906-921.

The Digitalis Investigation Group. 1997. "The effect of digoxin on mortality andmorbidity in patients with heart failure." The New England Journal ofMedicine.336:525:533.

Eysenck HJ. 1993. "Prediction of cancer and coronary heart disease mortality by means ofa personality inventory: results of a 15-year follow-up study." PsychologicalReports, 72:499-516.

Flanagan JR; Pittet D; Li N; Thievent B; Suter PM; Wenzel RP. 1996. "Predictingsurvival of patients with sepsis by use of regression and neural network models."Clinical Performance and Quality Health Care, 4:96-103.

Freedman, D; Pisani R; Purves R. 1991. Statistics. Norton. New York.

265

Friedman HS; Tucker JS; Schwartz JE; Tomlinson-Keasey C; Martin LR; Wingard DL;Criqui MH. 1995. "Psychosocial and behavioral predictors of longevity. Theaging and death of the "termites"." American Psychologist, 50:69-78.

Friedman JH. 1984. "A variable span smoother." Technical Report No.5, Laboratoryfor Computational Statistics, Department of Statistics, Stanford University,California.

Friedberg SH; Arnold JI; Spence LE. 1989. Linear Algebra. Prentice Hall. EnglewoodCliffs, New Jersey.

Gnanadesikan R. 1977. Methods for statistical data analysis ofmultivariateobservations. Wiley. New York.

Goldberg DE. 1989. Genetic algorithms in search, optimization, and machince learning.Addison-Wesley. Reading, Massachusetts.

Grubb NR; Elton RA; Fox KA. 1995. "In-hospital mortality after out-of-hospital cardiacarrest." Lancet, 346:417-21.

Hastie TJ; Tibshirani R. 1994. "Discriminant adaptive nearest neighborclassification." Technical Report (December).

Hastie TJ; Tibshirani R. 1996. "Discriminant analysis by Gaussian mixtures."Technical Report (February).

Hastie TJ; Buja A; Tibshirani R. 1995. "Penalized discriminant analysis." Annals ofStatistics, 73-102.

Holland JH. 1975. Adaptation in natural and artificial systems. The University ofMichigan Press. Ann Arbor, Michigan.

Huppert FA; Whittington JE. 1995. "Symptoms of psychological distress predict 7-yearmortality." Psychological Medicine, 25: 1073-86.

Iezzoni LI; Ash AS; Coffman GA; Moskowitz MA. 1992. "Predicting in-hospitalmortality. A comparison of severity measurement approaches." Medical Care,30:347-59.

Iezzoni LI; Heeren T; Foley SM; Daley J; Hughes J; Coffman GA. 1994. "Chronicconditions and risk ofin-hospital death." Health Services Research, 29:435-60.

266

Iezzoni LI; Shwartz M; Ash AS; Mackiernan YD. 1996 "Using severity measures topredict the likelihood of death for pneumonia inpatients." Journal ofGeneral

Internal Medicine, 11 :23-31.

Josephson RA; Chahine RA; Morganroth J; Anderson J; Waldo A; Hallstrom A. 1995."Prediction of cardiac death in patients with a very low ejection fraction aftermyocardial infarction: a Cardiac Arrhythmia Suppression Trial (CAST) study."American Heart Journal, 130:685-91.

Mallows CL. 1973. "Some comments on Cp'" Technometrics. 15:661 :675.

Mardia KV; Kent JT; Bibby JM. 1979. Multivariate Analysis. Academic Press. NewYork.

Marshall G; Grover FL; Henderson WG; Hammermeister KE. 1994. "Assessment ofpredictive models for binary outcomes: an empirical approach using operativedeath from cardiac surgery." Statistics in Medicine, 13: 1501-11.

Moss AJ; Davis HT; Conard DL; DeCamilla JJ; OdoroffCL. 1985. "Digitalis associatedcardiac mortality after acure myocardial infarction." Circulation. 64: 1150-56.

National Center for Health Statistics. 1987. Medical Examiners' and Coroners' Handbookon Death Registration and Fetal Death Reporting. U.S. Department of Health andHuman Services. Hyattsville, Maryland.

National Center for Health Statistics. 1988. Instruction Manual Part 2a: Instructions forClassifying the Underlying Cause ofDeath, 1988. U. S. Department of Health andHuman Services. Hyattsville, Maryland.

National Center for Health Statistics. 1986. Instruction Manual Part 2b: Instructions forClassifying Multiple Causes ofDeath, 1986. U.S. Department of Health andHuman Services. Hyattsville, Maryland.

Normand ST; Glickman ME; Sharma RG; McNeil BJ. 1996. "Using admissioncharacteristics to predict short-term mortality from myocardial infarction inelderly patients. Results from the Cooperative Cardiovascular Project." Jama,275:1322-8.

Ortiz J; Ghefter CG; Silva CE; Sabbatini RM. 1995. "One-year mortality prognosis inheart failure: a neural network approach based on echocardiographic data."Journal ofthe American College ofCardiology, 26:1586-93.

Piccirillo JF; Feinstein AR. 1996. "Clinical symptoms and comorbidity: Significance forthe prognostic classification of cancer." Cancer 77: 834-842.

267

Poses RM; McClish DK; Smith WR; Bekes C; Scott WE. 1996. "Prediction of survivalof critically ill patients by admission comorbidity." Journal ofClinicalEpidemiology, 49:743-7.

Pritchard ML; Woosley JT. 1995. "Comparison of two prognostic models predictingsurvival in patients with malignant melanoma." Human Pathology, 26:1028-31.

Quintana M; Lindvall K; Brolund F; Eriksson SV; Ryden L. 1995. "Prognostic value ofexercise stress testing versus ambulatory electrocardiography after acutemyocardial infarction: a 3 year follow-up study." Coronary Artery Disease,6:865-73.

Reuben DB; Rubenstein LV; Hirsch SH; Hays RD. 1992. "Value of functional status asa predictor of mortality: Results of a prospective study." The American JournalofMedicine 93: 663-669.

Rowan KM; Kerr JH; Major E; McPherson K; Short A; Vessey MP. 1994. "IntensiveCare Society's Acute Physiology and Chronic Health Evaluation (APACHE II)study in Britain and Ireland: a prospective, multicenter, cohort study comparingtwo methods for predicting outcome for adult intensive care patients." CriticalCare Medicine, 22:1392-401.

Schuchter L; Schultz DJ; Synnestvedt M; Trock BJ; Guerry D; Elder DE; Elenitsas R;Clark WH; Halpern AC. 1996. "A prognostic model for predicting 10-yearsurvival in patients with primary melanoma." The Pigmented Lesion Group.Annals ofInternal Medicine, 125:369-75.

Silber JH; Williams SV; Krakauer H; Schwartz JS. 1992. "Hospital and patientcharacteristics associated with death after surgery. A study of adverse occurrenceand failure to rescue." Medical Care, 30:615-29.

Smith KR; Waitzman NJ. 1994. "Double jeopardy: interaction effects of marital andpoverty status on the risk of mortality." Demography, 31 :487-507.

Swets JA; Pickett RM. 1982. Evaluation ofDiagnostic Systems: Methods from signalDetection Theory. Academic Press. New York.

Thompson ML; Zucchini, W. 1989. "On the statistical analysis of ROC curves."Statistics in Medicine 8:1277-1290.

TalcottJA; Siegel RD; Finberg R; Goldman L. 1992. "Risk assessment in cancer patientswith fever and neutropenia: a prospective, two-center validation of a predictionrule." Journal ofClinical Oncology, 10:316-22.

268

Turner JS; Morgan CJ; Thakrar B; Pepper JR. 1995. "Difficulties in predicting outcomein cardiac surgery patients." Critical Care Medicine, 23: 1843-50.

Warren MD; Knight R. 1982. "Mortality in relation to the functional capacities of peoplewith disabilities living at home." Journal ofEpidemiology and community Health36:220-223.

Warren JL; McBean AM; Hass SL; Babish JD. 1994. "Hospitalisations with adverseevents caused by digitalis therapy among elderly Medicare beneficiaries."Archives ofInternal Medicine, 154:1482-87.

Wong DT; Crofts SL; Gomez M; McGuire GP; Byrick RJ. 1995. "Evaluation ofpredictive ability of APACHE II system and hospital outcome in Canadianintensive care unit patients." Critical Care Medicine, 23: 1177-83.

269

Appendix I - Questions for the prediction of deathThis appendix presents the preferred (having minimal test set misclassification error) setsof questions which are constructed as described in Chapters 1 and 3. In order for arespondent to be chosen (classified as dead) by a particular set of questions, they mustanswer some of the question sets below with answers that are all in bold. For example,consider SET X.I below:

SET Xl:

1. Age at time of interview: < 85 85+

2. Sex:

Female

Male

3. Other then when you might have been in the hospital, was there any time in the past 12months in when you needed help from some person or any equipment or device for usingthe toilet?

No help

HelpUnable to doMissing

In order for a respondent to be classified as dead by this set, he would have to be a male,and aged 85+, and who also needed help using the toilet, or was unable to use it, orwhose answer was missing.

These sets are combined into overall sets using the OR operator. Then a respondent isconsidered chosen if he/she is chosen by one set OR another. For example, the overallSET A is comprised of sets A.I though A.5. A respondent is considered chosen by SET Aif they are chosen by A.I, or A.2, or any of the sets in SET A.

Also reported with each overall set is the result of the application of this set to the test setrespondents. The sensitivity is the proportion of deaths in the test set that were identifiedby the questions. For example, SET Xl caught 8 deaths, and there were 509 deaths in thetest set, so the sensitivity is 8/509 = 1.6%. The specificity is the proportion of survivorsnot chosen by the questions. SET Xl incorrectly caught 3 survivors, and there were2,923 survivors in the test set, so the specificity is (2,923 - 3)/2923 = 99.9%.

QUESTION SET A(Sensitivity: 66% Specificity: 70%)

270

SET A.3

1. Age: <80 80+

2. What was your usual weight at age 25?

2. Weight (in pounds): <168 168+ Missing

3. Are you able to do heavy work around thehouse, like shoveling snow, washing windows,walls or floors without help?

SET A.I1. Other than when you might have been in thehospital, was there any time in the past 12months when you needed help from some personor any equipment or device to do the following:

Bathing, either a sponge bath, tub bath, orshower?

No help

Help

Unable to do

Missing

Do you still require this help? Yes No Missing

<139 pounds 139+

ORSET A.4

1. Sex: Female Male

Missing

2. How much difficulty, ifany, do you havepulling or pushing large objects like a livingroom chair?

No difficulty at all

A little or some difficulty

Just unable to do it

Missing

OR

SET A.2

1. Do you take any digitalis, Digoxin, Lanoxin,or Digitoxin pills now?

Yes

No

Missing

OR

Yes

No

Missing

ORSET A.S

1. (When wearing eyeglasseslcontact lenses)Can you see well enough to recognize a friendacross the street?

Yes

No

Missing

2. Other then when you might have been in thehospital, was there any time in the past 12months when you needed help from some personor any equipment or device to do the followingthings:

Walking across a small room?

No help

Help

Unable to do

Missing

QUESTION SET B(Sensitivity: 41 % Specificity: 88%)

SET B.1:

1. Are you able to walk half a mile without help?That's about 8 ordinary blocks.

Yes

NoMissing


YesNo

Missing

OR

SETB.2:

1. Sex: Female Male

2. Weight in pounds: < 168 168+ Missing


Yes

No

Missing

271

OR

SETB.3:

1. Age: <80 80+

2. What was (is) your mother's maiden name?

Correct

IncorrectRefused

Missing

QUESTION SET C(Sensitivity: 22% Specificity: 96%)

SET C.1:

I. How much difficulty, on the average, do youhave bathing, either a sponge bath, tub bath, orshower?


A little difficulty

Some difficulty

A lot of difficulty

Missing


Yes

No

Missing

3. Weight in pounds: < 168 168+ Missing

ORSET C.2:

I. Other than when you might have been in thehospital, was there any time in the past 12months in when you needed help from someperson or any equipment or device for bathing,either a sponge bath, tub bath, or shower?

No help

Help

Unable to do

Missing

2. What day of the week is it?

Correct

Incorrect

Refused

Missing

272

OR

SET C.3:

Other than when you might have been in thehospital, was there any time in the past 12months in when you needed help from someperson or any equipment or device for bathing,either a sponge bath, tub bath, or shower?

No help

Help

Unable to do

Missing

I. Do you still require this help?

Yes

No

Missing (includes "No Help"respondents)

2. Have you ever taken any digitalis, Digoxin,Lanoxin, or Digitoxin pills?

YesNo

Missing

273

Appendix II - Questions for predicting heart failure

QUESTION SET D(Sensitivity: 59% Specificity: 72%)

SETD.1:

Have you ever had any pain or discomfort inyour chest?

Yes

No

Missing

(Note: The above question does not select outany respondents, and is only included becausethe question below refers to it.)

1. What do you do if you get this pain while youare walking?

Stop or slow down

Take a nitroglycerin

Continue at same pace

Missing (includes those with no pain)

OR

SETD.2:

I. As compared to other people your own age,would you say that your health is excellent, good,fair, poor or very poor?

Excellent

Good

Fair

Poor or bad

Missing

2. Weight (in pounds): <170 170+ Missing

OR

SETD.3:

1. Age: <80 80+

2. During the past week, I felt depressed:

Rarely or none of the time

Some of the time

Much of the time

Most or all of the time

Don't know/RefusedlMissing

For East Boston respondents, this question wasphrased: Have you felt this way much of the timeduring the past week? -- I felt depressed:

No

Yes

Missing

Has a doctor ever told you that you had anycancer, malignancy or malignant tumor of anytype?

Yes

Suspect

No

Missing

3. Were you hospitalized overnight or longer forthis?

Yes

No

Missing (includes those with no cancer)

OR

SET D.4:

1. Do you take any digitalis, Digoxin, Lanoxinor Digitoxin pills now?

Yes

No

Missing

QUESTION SET E(Sensitivity: 36% Specificity: 87%)

SETE.l:

1. Age: <85 85+


Correct

Incorrect

Refused

Missing

OR

274

SETE.2:

1. Do you take any digitalis, Digoxin, Lanoxinor Digitoxin pills now?

Yes

No

Missing


Yes

NoMissing

ORSETE.3:

I. Are you able to walk half a mile without help?That's about 8 ordinary blocks.

Yes

NoMissing


Yes

NoMissing



Stop or slow down



Missing (includes persons with no pain)

275

Appendix III - Questions for predicting strokes

OR

• QUESTION SET F(Sensitivity: 59% Specificity: 66%)

SETF.1:

1. As compared to other people your own age,would you say that your health is excellent, good,fair, poor or very poor?

Excellent

Good

Fair

Poor or bad

Missing

2. Has a doctor ever told you that you had anycancer, malignancy or malignant tumor of anytype?

Yes

Suspect

No

Missing

OR

SETF.2:


Yes

No

Missing

SET F.3:

1. Age: <75 80+

2. When was the last time you saw a dentist?

5 years ago or less

>5 years ago or never

Missing

OR

SETF.4:

Has a doctor ever told you that you had a heartattack or coronary, or coronary thrombosis, orcoronary occlusion or myocardial infarction?

Yes

Suspect

No

Missing


Yes

No

Missing

2. Do you smoke cigarettes regularly now?

Yes

No

Missing

QUESTION SET G(Sensitivity: 37% Specificity: 86%)

SETG.l:


YesNo

Missing

2. Do you get shortness of breath that requiresyou to stop and rest?

Yes

No

Missing

OR

SETG.2:

1. How much difficulty, on the average do youhave walking across a small room?


A little difficulty

Some difficultyA lot of difficulty

Missing


YesNo

Missing

276

SET G.3:

1. Age: <85 85+


Correct

IncorrectRefusedMissing

ORSET G.4:

1. Has a doctor ever told you that you had anycancer, malignancy or malignant tumor of anytype?

Yes

Suspect

NoMissing


Yes

NoMissing


Yes

No

Missing

3. Do you get this pain (or discomfort) when youwalk uphill or hurry?

YesNever walks uphill or hurries or

cannot walkNo

Missing

/

277

Appendix IV - Questions for predicting cancer

QUESTION SET H(Sensitivity: 61 % Specificity: 54%)

SETH.I:

How many close friends do you have?

None

lor 23 to 5

6 to 9

10 or more

Missing

1. How many of these friends do you see at leastonce a month?

None

lor 23 to 56 to 9

10 or more

Missing

2. In the past year, have you gained or lost more'than 10 pounds?

No change

Yes, gained

Yes, lost

Yes, both gained and lost

Missing

ORSETH.2:


Yes

No

Missing

OR

SETH.3:

1. On the average, how many cigarettes per daydid you usually smoke? (Former smokers only)

<2020>20Missing

(Note: Missing for the above question includescurrent smokers and nonsmokers)

2. How old were you when you first smokedcigarettes regularly? (Former smokers only)

<4040+Missing

(Note: Missing for the above question includescurrent smokers and nonsmokers)

OR

SET H.4:

1. When was the last time you saw a dentist?

5 years ago or less

>5 years ago or never

Missing


Yes

No

Missing

QUESTION SET I(Sensitivity: 43% Specificity: 71 %)

SET 1.1:

1. Height (inches):

<64

64+Missing

2. Pulse for 30 seconds:

<36

36+Missing


Yes

No

Missing

OR

SET 1.2:

1. Did you ever smoke cigarettes regularly?

Yes

No

Missing

2. On the average, how many cigarettes per daydid you usually smoke?

<20

20

>20

Missing

OR

278

SET 1.3:

1. About how often do you go to religiousmeetings or services?

Never/almost never

Once or twice a yearEvery few months

Once or twice a month

Once a week

More than once a week

Missing


Yes

No

Missing



Stop or slow down



Missing (includes those with no pain)

OR

SET 1.4:

1. Weight at age 50 (pounds): <170 170+Missing


Yes

No

Missing

279

Appendix V - C code for the repeated random search algorithm (RRSA)

/******************************************************************* Program: Repated random search algorithm (RRSA) ** Author: Michael Anderson ** ** This program performs repeated, independent runs of the random search ** algorithm defined in Section 4.2. It should be compiled with a command ** of the form "cc -0 progname sourcefile.c -1m" from a UNIX environment. ** The files "y.dat" (containing a list of a's and 1's) and "x.dat" (containing ** a matrix of predictor variables in columns for records in rows) must be ** placed in the same directory as the executable for the program to run. ** Thanks are due to Brad Wallet for help in coding earlier versions of ** this program. ********************************************************************/

#include <stdio.h>#include <stdlib.h>#include <math.h>#include <float.h>#include <limits.h>#include <time.h>#define FALSE a#define TRUE !FALSE#define MAXQK 4 /* Defines the max # of questions in a single subset */#define DEFKTOT 4 /* Defines the total # of subsets in the model */#define DEFQK 4 /* Defines the # of questions in each subset */#define DEFCOST 3.5 /* Defines the cost ofmisclassifying a decedent */#define DEFFREQ 20000 /* # of mutations before checking for absorption */#define DEFGENS INT MAX

/* GLOBAL VARIABLES: */

doubledoubleintintintintintintfloatintint

ndead, nalive;bfit,miscost;maxq,p,ktot,n;min-part = 25;gens;checfreq;lmax 1;myseed = 0;**X;*y;*qk;

/* # of 1s and as in the learning set Y *//* Relative cost of misclass. dead as alive *//* # of?s in subset, # of vars, # of sets, N *//* Smallest # of obs to allow in a subset *//* Max # of mutations allowed in one search *//* How soon to start looking for absorption *//* ignore this *//* Seed; only used if greater than a *//* Ptr to ptr to float: holds the X data *//* Ptr to int: This holds the Y values *//* Ptr to int: holds the number of?s in sets */

floatintintintfloatintFILE

**Uvals;*nvals;**bfeat, **bdirect;**feat, **direct;**bthresh, **thresh;srchreps=1000000;*absfile, *mutfile;

/* Ptr to ptr to float: holds variable values *//* Ptr to int: holds # of values each variable *//* Ptr to ptr to int: variables, directions *//* Ptr to ptr to int: variables, directions *//* Ptr to ptr to flt: cutoff levels for variables *//* Max number of indpendent runs *//* Ptrs to files that hold output */

280

/* These INPUT data files are required for the program to run: */

char *X file = "x.dat";char *y_file = "y.dat";

/* Ptr to file: n X p block of variables by col *//* Ptr to file: n-vector of Y variables, 0 or 1 */

/* The program OUTPUT is stored in these files: */

char *abs_file = "abs.out";char *mut_file = "mut.out";

/* FUNCTION PROTOYPES

/* Ptr to file: output - the absorption points *//* Ptr to file: # of mutations at successes */

*/

FILE *myopenf(FILE *locfile, char *loc_file);void welcome(void);void getdat(void);void getval(void);int mycomp(const void *i, const void *j);int fltcompare(float *i, float *j);double misclass(int **feat, int **direct, float **thresh, float **locX,

int *locy, double locndead, double locnalive, int *loqk,int locktot);

int checkabs(int **feat, int **direct, float **thresh, double best,float **locX, int *locy, double locndead, double locnalive);

double mutpoint(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive, double locfit);

double runsrch(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive);

int indofmin(double *vect, int lengv);int maxval(int *vect, int lengv);void printquesa(int **feat, int **direct, float **thresh,

int *locqk, int locktot);void printquesb(int **feat, int **direct, float **thresh,

int *locqk, int locktot);void copyques(int **tofeat, int **todirect, float **tothresh,

int **frfeat, int **frdirect, float **frthresh);void printscores(double *tscores, double *lscores, double *nvars,

281

double *cost_comp, int depth);

/********************* START MAIN PROGRAM *************************/

void main(argc, argv)int argc;char *argv[];

{int i,j, k, 1, k1, k2,j1, minind, pr, wasimprov, printstat;double cntdiff, best;

miscost = DEFCOST; gens = DEFGENS; checfreq = DEFFREQ;ktot = DEFKTOT; maxq = MAXQK;qk = (int *)maUoc(ktot*sizeof(int));qk[O] = 4; qk[1] = 4; qk[2] = 4; qk[3] = 4;

/* If you want deterministic results, use a positive value of myseed */if (myseed > 0){

for (k=O; k<myseed; k++) drand480;}else{

srand48((unsigned)time(NULL));}/* The variables below will hold info on the present point in space.

feat is the variable number, direct is < or >=, thresh is cutoff */feat = (int **)malloc(maxq*sizeof(int *));direct = (int **)malloc(maxq*sizeof(int *));thresh = (float **)malloc(maxq*sizeof(float *));bfeat = (int **)malloc(maxq*sizeof(int *));bdirect = (int **)malloc(maxq*sizeof(int *));bthresh = (float **)malloc(maxq*sizeof(float *));

for (k=O; k<maxq; k++){

feat[k] = (int *)calloc(ktot,sizeof(int));direct[k] = (int *)calloc(ktot,sizeof(int));thresh[k] = (float *)calloc(ktot,sizeof(float));bfeat[k] = (int *)calloc(ktot,sizeof(int));bdirect[k] = (int *)calloc(ktot,sizeof(int));bthresh[k] = (float *)calloc(ktot,sizeof(float));

}getdatO; /* Read in the data with this call */

282

n = (int)(ndead+nalive);printf("\n\n %d records and %d variables detected. \n\n", n, p);printf("\n\n Done reading in data, now processing data.\n");printf(" (This may take a few minutes if your dataset is big.)\n\n");getvalO; /* Get the variable values */printf("\n\n Done processing data, beginning search.\n\n");printf(" The most recent questions will be stored in the file \"question.out\".\n");printf(" All absorption points will be stored in the file \"abs.out\".\n");printf(" The mutation numbers at successful mutations will be in \"mut.out\" .\n");printf(" (If these files already exist, the results will be appended to them.)\n");printf("\n\n BOBCAT will now search indefinitely. Hit Ctrl-C to stop it.\n\n");absfile = myopenf(absfile,abs_file);fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");fclose(absfile);for (i=O;i<srchreps;i++){

bfit = runsrch(feat, direct, thresh, X, y, ndead, nalive);}

} /******************** END OF MAIN PROGRAM **********************/

double misclass(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive, int *locqk,int locktot)

/* * This function computes the misclassification error for any particular* point in space as defined by feat, direct, and thresh. It returns** (l - error) as a double.

{int checkin, i, j, k, insubset, locn, lktot,subsetsize, tdead;double fit;int *ssizes, *lqk;

locn = locndead + locnalive;lqk=locqk; lktot = locktot; subsetsize = 0; tdead = 0;

ssizes = (int *)malloc(lktot*sizeof(int *));

for (k=O; k<lktot; k++){

ssizes[k]=O;}

for (i=O; i<locn; i++)

***/

{checkin = 1;k=O;while (checkin) /* checkin keeps track of whether a case is chosen */{

while ((checkin) && (k<lktot)) /* Cycle through the k sets */{

insubset= 1;j = lqk[k];while (insubset && j){

J--;if (direct[j][k]){

if (locX[i] [feat[j] [k]] < thresh[j] [k]){

insubset--;}

}else{

if (locX[i] [feat[j][k]] >= thresh[j][k]){

insubset--;}

}}if (insubset) /* If the respondent is chosen, check for death */{

subsetsize++;if(locy[i]) tdead++;checkin--;ssizes[k]++;

}k++;

}if (checkin > 0) checkin--;

}}

fit = 1 - ((miscost*(locndead-(double)tdead))+( (double)subsetsize - (double)tdead)) /(miscost*locndead + locnalive);

283


if (ssizes[k]<minyart) fit = 0;}free( (void *) ssizes);return(fit);

}

double mutpoint(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive, double locfit)

/*** This function mutates a point repeatedly, searching for lower error.It returns I-error of the combination of questions with lowest error. ***/{

int replaced, new_feat, old_feat, new_direct, old_direct;int checkout, gen, i, k, j, cntmut;double best;float new_thresh, old_thresh;

best = locfit;checkout = 1;gen = 0; cntmut = 0;mutfile = myopenf(mutfile, mut_file);fprintf(mutfile, "\n\n Total # of mutations at Misclassification \nil);fprintf(mutfile, II successful mutations error \nil);fclose(mutfile);while((gen<gens) && checkout) /* Enter loop to do mutations */{

for (k=O; k<ktot; k++) /* Loop through the ktot question sets */{

/* Introduce mutation (save old question in case mutation is bad) */

cntmut++;replaced = floor(drand480 * qk[k]);if (replaced==qk[k]) replaced--;old_feat = feat[replaced][k];old_direct = direct[replaced][k];old_thresh = thresh[replaced][k];new_feat = floor(drand480 * p);if (new_feat==p) new_feat--;j = floor(drand480*nvals[new_feat]);if (j==nvals[new_feat]) j--;if (j < I) j++;

284

new_thresh = Uvals[new_feat][j];new_direct = floor(drand480 * 2);if (new_direct==2) new_direct--;feat[replaced][k] = new_feat;direct[replaced][k] = new_direct;thresh[replaced] [k] = new_thresh;/* Check the error rate */locfit = misclass(feat, direct, thresh, 10cX, locy, locndead,

locnalive,qk,ktot);

if (locfit <= best) /* If it isn't an improvement, get old quest */{

feat[replaced][k] = old_feat;direct[replaced][k] = old_direct;thresh[replaced][k] = old_thresh;

}else /* If it IS better, keep it, output "gen" to out2.dat */{

best = locfit;mutfile = myopenf(mutfile, mut_file);fprintf(mutfile,"%13d %1.8If\n", cntmut, I-best);fclose(mutfile);if (gen > checfreq) /* check for absorption */{

checkout = I - checkabs(feat,direct,thresh,best,locX,locy,locndead, locnalive);

}}

}if ((gen == checfreq) && checkout) /* check for absorption */{

checkout = 1 - checkabs(feat,direct,thresh,best,locX,locy,locndead,locnalive);

}gen++;

} /* Exit from mutation loop */absfile = myopenf(absfile,abs_file);fprintf(absfile, "\n Misclassification error: %If\n\n'', I-best);fclose(absfile);printquesa(feat, direct, thresh, qk, ktot);retum(best);

}

285

double runsrch(int **feat, int **direct, float **thresh, float **locX,int *Iocy, double locndead, double locnalive)

/* This function not only returns the fit of the best question set fromthe search, it also changes bfit, the questions, AND the best questions */

{int I, k, j, j 1;double lfit;

for (1=0; I < lmax; 1++) /* Enter loop to repeat the search lmax times */{

lfit = 0.0;while (1fit < 0.001) /* This loop generates a random point in space */{

for (k=O; k<ktot; k++){

for U=O; j<qk[k]; j++){

feat[j][k]= floor(drand480 * p);if (feat[j] [k]==p) feat[j] [k]--;direct[j][k]= floor(drand480 * 2);if (direct[j][k]==2) direct[j][k]--;j 1 = floor(drand480*nvals[feat[j][k]]);if U1==nvals[feat[j] [k]]) j 1--;if U1 < 1) j 1++;thresh[j][k] = Uvals[feat[j][k]][j 1];

}}lfit = misclass(feat, direct, thresh, 10cX, locy, locndead,

locnalive,qk,ktot);}

lfit = mutpoint(feat, direct, thresh, 10cX, locy, locndead, locnalive,lfit);

if (1fit > bfit){

bfit = lfit;for U=O;j<maxq;j++){

for (k=O;k<ktot;k++){

bfeat[j][k] = feat[j][k]; bdirect[j][k] = direct[j] [k];bthresh[j][k] = thresh[j][k];

286

}}

}}return(bfit);

}

int checkabs(int **feat, int **direct, float **thresh, double best,float **locX, int *locy, double locndead, double locnalive)

/* ** This function checks a point in space to see if it is an absorption* point by exhaustively replacing each question with all possible* questions and checking for a lower error. Notice that although it*** touches the point's parameters, it leaves them unchanged in the end.{

int k, j, i, 1, k1, checkout;int old_feat;int old_direct;float old_thresh;double fit;

checkout = 1;

k= 0;while((k<ktot) && checkout){

j = 0;while((j<qk[kD && checkout){

old_feat = feat[j] [k];old_direct = direct[j][k];old_thresh = thresh[j] [k];1= 0;while(l<p){

for(i=O; i<nvals[l]; i++){

for(kl =0; kl <2; kl ++){

feat[j][k] = 1;direct[j][k] = kl;thresh[j][k] = Uvals[l][i];

fit = misclass(feat, direct, thresh, 10cX, locy,

****/

287

locndead, locnalive, qk, ktot);if (fit> best){

checkout = 0;}

}}1++;

}feat[j] [k] = old_feat;thresh[j][k] = old_thresh;direct[j][k] = old_direct;j++;

}k++;

}retum(checkout);

}

void getdat(void) /******* This function reads in the data ***/{

int notdone, blnkint, i,j,k,np,n;FILE *ptryfile, *ptrxfile;double pcheck;float blnkflt;double outval;

if ((ptryfile=fopen(y_file,l rb"))==NULL){

printf("\n\nError: Could not open file %s\n\n", y_file);exit(-l);

}n=O;notdone = TRUE;while (notdone){

if(fscanf(ptryfile, "%d", &blnkint) != EOF){

n++;}else{

notdone = FALSE;}

288

}fclose(ptryfile);if ((ptrxfile=fopen(X_file,"rb"))==NULL){

printf("\n\nError: Could not open file %s\n\n", X_file);exit(-l);

}np=O;notdone = TRUE;while (notdone){

if(fscanf(ptrxfile, "%f', &blnkflt) != EOF){

np++;}else{

notdone = FALSE;}

}fclose(ptrxfile);pcheck = (double)np/(double)n;p = floor(pcheck);if ((pcheck - (double)p) > 0.0000000001){

printf("\n\nError: Your data files are not formatted properly. \nil);printf("\nMake sure that the data in the files y.dat and x.dat \nil);printf("have the same number of rows, and that every row of x.dat\n");printf("has the same number of variables. \n\n");exit(-l);

}if ((ptryfile=fopen(y_file,"rb"))==NULL){


}if ((ptrxfile=fopen(X_file,"rb"))==NULL){


}ndead = 0; nalive = 0;X = (float **)malloc(n*sizeof(float *));y = (int *)malloc(n*sizeof(int));

289

for (i=O; i<n; i++){

X[i] = (float *)malloc(p*sizeof(float));for O=O;j<p;j++){

fscanf(ptrxfile, "%f', &X[i][j));}fscanf(ptryfile, "%d", &y[i));if(y[i)) { ndead++; }

}nalive = (double)n - ndead;fc1ose(ptrxfile); fc1ose(ptryfile);

}

void getval(void) /******* This function gets the uniq variable values ****/{

int i,j,k;float *Tval;float lastval;

Tval = (float *)malloc(n*sizeof(float));nvals = (int *)malloc(p*sizeof(int));Uvals = (float **)malloc(p*sizeof(float *));

for 0=0; j<p; j++){

for (i=O; i<n; i++){

Tval[i] = X[i][j];}qsort((void *)Tval, n, sizeof(float), mycomp);lastval=Tval[O];nvals[j] = 1;for (i=l; i<n; i++){

if(Tval[i]>lastval){

nvals[j]++;lastval = Tval[i];

}}Uvals[j] = (float *)malloc(nvals[j]*sizeof(float));k= 0;Uvals[j][k] = Tval[k];

290

for (i=l; i<n; i++){

if(Tval[i]>Tval[i-1 ]){

k++;Uvals[j][k] = Tval[i];

}}

}free( (void *) Tval);

}

void copyques(int **tofeat, int **todirect, float **tothresh,int **frfeat, int **frdirect, float **frthresh)

{intj,k;

for G=O;j<maxq;j++){


tofeat[j] [k] = frfeat[j] [k]; todirect[j] [k] = frdirect[j] [k];tothresh[j] [k] = frthresh[j] [k];

}}

}

int fltcompare(float *i, float *j) /** Compares reals for qsortO call ****/{

int mycheck;float myi, myj;

mycheck= 1;myi = *i; myj = *j;

if (myi > (myj + 0.000001»{

mycheck = 0;return (1);

}if (myi < (myj - 0.000001»{

mycheck= 0;return (-1);

291

}if (mycheck) retum(O);

}

int indofmin(double *vect, int lengv)/* Returns smallest index of the "minimum" value in a vector of doubles*/{

double tmpval;int i, cnti;

tmpval = vect[O]; cnti = 0;

for (i=l ;i<lengv;i++){

if (vect[i]«tmpval-2*FLT_EPSILON)){

tmpval = vect[i];cnti = i;

}}retum(cnti);

}

int maxval(int *vect, int lengv)/* Returns the maximum value in a vector of ints*/{

int tmpval;int i;

tmpval = vect[O];for (i=1;i<lengv;i++){

if (vect[i]>tmpval){

tmpval = vect[i];}

}retum(tmpval);

}

int mycomp(const void *i, const void *j) /* This calls a function above */{ /* called fltcompare* /

return( fltcompare((float *) i, (float *) j)); /* for qsortO */}

292

void printquesa(int **feat, int **direct, float **thresh,int *locqk, int locktot)

{int i,j,k;

absfile = myopenf(absfile,abs_file);for(k=O; k<locktot; k++){

for (i=O; i<locqk[k]; i++){

fprintf(absfile,"Is variable %3d ", feat[i][k]+l);if (direct[i] [k]) fprintf(absfile,">= ");else fprintf(absfile,"< ");fprintf(absfile,"%3.4g ?\n", thresh[i] [k]);

}if(k<(locktot-l)) fprintf(absfile," OR \nil);

}fprintf(absfile, II \nil);fclose(absfile);

}

FILE *myopenf(FILE *locfile, char *loc_file){

if ((locfile=fopen(loc_file,l a"))==NULL){

printf("\n\nError: Could not open file %s\n\n", loc_file);exit(-l);

}retum(locfile);

}

293

294

Appendix VI - Mortality Index for the Elderly

INSTRUCTIONS The following index is designed for noninstitutionalizedpersons aged 65 and older. Enter the point value corresponding to eachanswer in the blank space, and add up the points for all 16 questions.

(1) Intercept: 145 pts.

(2) Age:

(1) 145 pts.

65-69:70-74:75-79:80-85:85+ :

(3) Sex:

( 0 pts.)( 42 pts.)( 83 pts.)(125 pts.)(166 pts.) (2) pts.

Female:Male:

( 0 pts.)( 91 pts.) (3) ____ pts.

(4) Do you take any digitalis, Digoxin, Lanoxin, or Digitoxin pills now?

YesNoMissing

(128 pts.)( 0 pts.)( Opts.) (4) ____ pts.

295

(5) Weight at time of interview:

Females

weight in pounds

< 100100-104105-109110-114

115-119

120-124

125-129130-134

135-139140-144

145-149

weight in pounds

< 120

120-124

125-129

130-134

135-139140-144

145-149150-154

155-159

160-164

165-169

170-174

175-179

180-184

Missing = -217 pts.

points

-148-155-163-170

-178-185

-193

-200-208

-216

-223

points

-144

-149

-157

-165

-172-180

-187

-195-202

-210

-217

-225

-233

-240

Males

weight in pounds

150-154155-159160-164165-169

170-174175-179

180-184

185-189190-195

195-199

200+

weight in pounds

185-189

190-194

195-199200-204

205-209210-214

215-219220-224

225-229

230-234

235-239

240-244

245-249

250+

(5)

points

-231-238

-246-253-261

-269

-276-284

-291

-299-303

points

-248

-255

-263

-270

-278

-286-293

-301

-308

-316

-323

-331

-339

-342

____ pts.

(Note: Negative points should be subtracted when adding up total.)

296

(6) Are you able to walk half a mile without help? That's about 8 ordinary blocks.

YesNoMissing

( 0 pts.)( 78 pts.)( 78 pts.) (6) ____ pts.

(7) What was (is) your mother's maiden name?

CorrectIncorrectRefusedMissing

( 0 pts.)( 90 pts.)( 90 pts.)( 90 pts.) (7) ____ pts.

(8) As compared to other people your own age, would you say that your health isexcellent, good, fair, poor or very poor?

Excellent ( Opts.)Good ( 31 pts.)Fair ( 63 pts.)Poor or bad ( 94 pts.)Missing ( 40 pts.) (8) pts.

(9) Other than when you might have been in the hospital, was there any time in the past12 months when you needed help from some person or any equipment or device to do thefollowing things:

Bathing, either a sponge bath, tub bath, or shower?

No helpHelpUnable to doMissing

Is (was) this help from a person, from special equipment, or both?

PersonSpecial equipmentBothMissing

( 100 pts.)( 0 pts.)( Opts.)( 0 pts.) (9) ____ pts.

297

(10) What was your usual weight at age 50?

Females: # ofpts. = subtract 3 from weight in lbs.Males: # of pts. = subtract 26 from weight in lbs.Missing: # of pts. = 141 pts. (10) ____ pts.

(11) Other than when you might have been in the hospital, was there any time in the past12 months when you needed help from some person or any equipment or device to do thefollowing things:

Using the toilet?

No helpHelpUnable to doMissing

( Opts.)( 104 pts.)( 104 pts.)( 104 pts.) (11) ____ pts.

(12) Do you smoke cigarettes now?

YesNoMissing

( 54 pts.)( Opts.)( Opts.) (12) ____----'pts.

(13) Has a doctor ever told you that you had diabetes, high blood sugar, or sugar in yoururine?

YesSuspectNoMissing

Has a doctor ever told you to take insulin for this?

YesNoMissing

( 89 pts.)( Opts.)( Opts.) (13) ____ pts.

298

(14) Has a doctor ever told you that you had any cancer, malignancy or malignant tumorof any type?

YesSuspectNoMissing


(15) Has a doctor ever told you that you had a heart attack or coronary, or coronarythrombosis, or coronary occlusion or myocardial infarction?

YesSuspectNoMissing

Were you hospitalized overnight or longer for this?

YesNoMissing

( 58 pts.)( Opts.)( 0 pts.) (15) ____ pts.

(16) How often do you have trouble with waking up during the night?

Most of the timeSometimesRarely or neverMissing


299

Appendix VII - C code for the repeated random and exhaustive searchalgorithm (RRESA), combined with backward deletion to

estimate error on a test set

/******************************************************************* ** Program: BOBCAT- Boolean Operators with Binary splitting Classification* Algorithm for Test set prediction

** Author: Michael Anderson

*

*****

* This program perfoms a modified version of the random search algorithm ** which incorporates exhaustive searching, for faster search times. It also uses ** a learning set/test set division automatically in conjunction with backward ** deletion to estimate prediction error. It then estimates a cost-complexity ** parameter and recombines the full dataset to find a model which achieves ** a low cost-complexity (see Breiman et aI, 1984). The program should be ** compiled with a command of the form "cc -0 bobcat bobcat.c -1m" ** from a UNIX environment. It should be run in the same directory with two ** files: "y.dat" (containing a list of O's and l's) and "x.dat" (containing ** a matrix of predictor variables in columns for records in rows). The program ** will prompt the user for information concerning the model structure. ********************************************************************/

#include <stdio.h>#include <stdlib.h>#include <math.h>#include <float.h>#include <limits.h>#include <time.h>#define FALSE 0#define TRUE !FALSE#define MAXQK 9 /* Defines the max # of questions in a single subset */#define DEFKTOT 10 /* Defines the total # of subsets in the model */#define DEFQK 9 /* Defines the # of questions in each subset */#define DEFCOST 1 /* Defines the cost of misclassifying a decedent (1) */#define DEFFREQ 60000 /* # of mutations before checking for absorption */#define DEFGENS INT MAX

/* GLOBAL VARIABLES: */

doubledoubledouble

tfrac;ndead, nalive;lndead, lnalive;

/* Fraction ofN for test set respondents *//* Number of 1s and Os in the learning set Y *//* Number of 1s and Os in the learning set Y */

doubledoubleintintintintintintfloatintintfloatintintintfloatdoubleintintintintFILE

tndead, tnalive; /* Number of Is and Os in the test set Y */miscost; /* Relative cost of misclass. dead as alive */maxq,p,ln,tn,ktot; /* Size of test set, number of question sets */min-part = 5; /* Smallest # of obs to allow in a subset */gens; /* Max # of mutations allowed in one search */checfreq; /* How soon to start looking for absorption *1lmax 1; /* Max # of cycles through search (abs. pts.) */myseed = 0; /* Seed, only used if greater than 0 */**fX,**tX, **IX; /* Ptr to ptr to float: holds the X data */*fy,*ty, *ly; /* Ptr to int: This holds the Y values */*qk; /* Ptr to int: holds the number of?s in sets */**Uvals; /* Ptr to ptr to float: holds variable values */*nvals; /* Ptr to int: holds # of values each variable */**bfeat, **bdirect,**tbfeat, **tbdirect, **beslfeat, **besldirect;**besffeat, **besfdirect, **feat, **direct; /* These hold the model*/**bthresh, **tbthresh, **beslthresh, **besfthresh, **thresh;bfit, alpha, lalpha; /* Cost-complexity parameter */splitdat=1 ;prunereps=1000000; /* Number of repetitions */lastsub, lasttd, goodalpha;splitinit = 0; /* Seed; only used if greater than zero. */*absfile, *mutfile, *quesfile; /* Ptrs to output files *1

300

1* These INPUT data files are required for the program to run: */

char *X file = "x.dat";char *y_file = "y.dat";

/* File: n X p block of variables by col/* File: n-vector of Y variables, 0 or 1

*/*/

/* The program OUTPUT is stored in these files: */

char *abs_file = "abs.out";char *mut_file = "mut.out";char *ques_file = "question.out";

/* File: output - the absorption points/* File: # of mutations at successes/* File: pruning output

*/*/*/

/* FUNCTION PROTOTYPES Note: some of these have major side-effects */

FILE *myopenf(FILE *locfile, char *loc_file);void welcome(void);void getdat(int splitdat);void getval(void);int mycomp(const void *i, const void *j);int fltcompare(float *i, float *j);double misclass(int **feat, int **direct, float **thresh, float **locX,

int *locy, double locndead, double locnalive, int *loqk,

301

int locktot);int checkabs(int **feat, int **direct, float **thresh, double best,

float **locX, int *locy, double locndead, double locnalive);double mutpoint(int **feat, int **direct, float **thresh, float **locX,

int *locy, double locndead, double locnalive, double locfit);double runsrch(int **feat, int **direct, float **thresh, float **locX,

int *locy, double locndead, double locnalive);int prune(int ldep, int **feat, int **direct, float **thresh, float **X,

int *y, double lndead, double lnalive, float **tX, int *ty,double tndead, double tnalive, int printstat);

int indofmin(double *vect, int lengv);int maxval(int *vect, int lengv);void printquesa(int **feat, int **direct, float **thresh,

int *locqk, int locktot);void printquesb(int **feat, int **direct, float **thresh,

int *locqk, int locktot);void copyques(int **tofeat, int **todirect, float **tothresh,

int **frfeat, int **frdirect, float **frthresh);void print2by2(double locnalive, double locndead, int locsubsetsize,

int loctdead, double miserr, int pstat, int minnvars);void printcost(double *lscores, double *cost_comp, double *nvars, int depth,

double alpha);void printscores(double *tscores, double *lscores, double *nvars,

double *cost_comp, int depth);void scramblei(int *vect, int lengv);void scramblef(float *vect, int lengv);

/********************* START MAIN PROGRAM *************************/

void main(argc, argv)int argc;char *argv[];

{int i,j, k, 1, kl, k2,jl, minind, pr, wasimprov, printstat;double lfit, tfit, cntdiff, best, bestlfit, bestffit;

miscost = DEFCOST; gens = DEFGENS; checfreq = DEFFREQ;ktot = DEFKTOT; maxq = MAXQK;qk = (int *)malloc(ktot*sizeof(int));for (i=O;i<ktot;i++){

qk[i] = DEFQK;}weicomeO; /* (When commented out, DONT'T) Prompt user for parameter values*/

/* If you want deterministic results, use a positive value ofmyseed */if (myseed > 0){


srand48((unsigned)time(NULL»;}/* The variables below will hold info on the present point in space.

feat is the variable number, direct is < or >=, thresh is cutoff */feat = (int **)malloc(maxq*sizeof(int *»;direct = (int **)malloc(maxq*sizeof(int *»;thresh = (float **)malloc(maxq*sizeof(float *»;tbfeat = (int **)malloc(maxq*sizeof(int *»;tbdirect = (int **)malloc(maxq*sizeof(int *»;tbthresh = (float **)malloc(maxq*sizeof(float *»;bfeat = (int **)malloc(maxq*sizeof(int *»;bdirect = (int **)malloc(maxq*sizeof(int *»;bthresh = (float **)malloc(maxq*sizeof(float *»;beslfeat = (int **)malloc(maxq*sizeof(int *»;besldirect = (int **)malloc(maxq*sizeof(int *»;beslthresh = (float **)malloc(maxq*sizeof(float *»;besffeat = (int **)malloc(maxq*sizeof(int *»;besfdirect = (int **)malloc(maxq*sizeof(int *»;besfthresh = (float **)malloc(maxq*sizeof(float *»;


feat[k] = (int *)calloc(ktot,sizeof(int»;direct[k] = (int *)calloc(ktot,sizeof(int»;thresh[k] = (float *)calloc(ktot,sizeof(float»;bfeat[k] = (int *)calloc(ktot,sizeof(int»;bdirect[k] = (int *)calloc(ktot,sizeof(int»;bthresh[k] = (float *)calloc(ktot,sizeof(float»;tbfeat[k] = (int *)calloc(ktot,sizeof(int»;tbdirect[k] = (int *)calloc(ktot,sizeof(int»;tbthresh[k] = (float *)calloc(ktot,sizeof(float»;beslfeat[k] = (int *)calloc(ktot,sizeof(int»;besldirect[k] = (int *)calloc(ktot,sizeof(int»;beslthresh[k] = (float *)calloc(ktot,sizeof(float»;besffeat[k] = (int *)calloc(ktot,sizeof(int»;besfdirect[k] = (int *)calloc(ktot,sizeof(int»;

302

besfthresh[k] = (float *)calloc(ktot,sizeof(float));}tfrac = 0.3333; lastsub=O; lasttd=O;getdat(splitdat); /* Read in the data with this call */printf("\n\n %d records and %d variables detected. \n\n",

(int)(lndead+lnalive+tndead+tnalive), p);printf("\n\n Done reading in data, now processing data.\n");printf(" (This may take a few minutes if your dataset is big.)\n\n");getvalO; /* Get the variable values */printf("\n\n Done processing data, beginning search.\n\n");printf(" The most recent questions will be stored in the file \"question.out\".\n");printf(" All absorption points will be stored in the file \"abs.out\".\n");printf(" The mutation numbers at successful mutations will be in \"mut.out\" .\n");printf(" (If these files already exist, the results will be appended to them.)\n");printf("\n\n BOBCAT will now search indefinitely. Hit Ctrl-C to stop it.\n\n");bfit = 0;absfile = myopenf(absfile,abs_file);fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");fclose(absfile);bfit = runsrch(feat, direct, thresh, IX, ly, lndead, lnalive);copyques(beslfeat,besldirect,beslthresh,bfeat,bdirect,bthresh);copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);bestlfit = bfit;minind = 0; alpha = 0; printstat = I; goodalpha = 1;minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,

lnalive, tX, ty, tndead, tnalive, printstat);if (goodalpha == 0){

copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);minind = 0; printstat = 1; alpha = 0;minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,

lnalive, tX, ty, tndead, tnalive, printstat);}lalpha = alpha;free ( (void *) Uvals); free ( (void *) nvals);getdat(O); getvalO;bfit = 0;absfile = myopenf(absfile, abs_file);fprintf(absfile," \n Absorption point in the full dataset: \n\n");fclose(absfile);bfit = runsrch(feat, direct, thresh, fX, fy, ndead, nalive);copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);copyques(besffeat,besfdirect,besfthresh,bfeat,bdirect,bthresh);bestffit = bfit;

303

minind = 0; printstat = 2; goodalpha=l;minind = prune(minind, tbfeat, tbdirect, tbthresh, fX, [Y, ndead,

nalive, fX, [Y, ndead, nalive, printstat);alpha = lalpha; printstat = 3; goodalpha = 1;copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);minind = prune(minind-l, tbfeat, tbdirect, tbthresh, fX, [Y, ndead,

nalive, fX, [Y, ndead, nalive, printstat);for(pr=2;pr<(prunereps+1);pr++){

bfit = 0; wasimprov=O; alpha = lalpha;myopenf(absfile,abs_file);fprintf(absfile, "\n Absorption point in the learning dataset: \n\n");fclose(absfile);bfit = runsrch(feat, direct, thresh, IX, ly, lndead, lnalive);if (bfit>bestlfit){

bestlfit = bfit; minind = 0; alpha = 0; wasimprov = 1; printstat=l;goodalpha= I;copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,

lnalive, tX, ty, tndead, tnalive,printstat);if (goodalpha == 0){

copyques(tbfeat,tbdirect,tbthresh,bfeat,bdirect,bthresh);minind = 0; printstat = 1; alpha = 0;minind = prune(minind, tbfeat, tbdirect, tbthresh, IX, ly, lndead,

lnalive, tX, ty, tndead, tnalive, printstat);}lalpha = alpha;copyques(beslfeat,besldirect,beslthresh,tbfeat,tbdirect,tbthresh);

}bfit = 0;absfile = myopenf(absfile, abs_file);fprintf(absfile," \n Absorption point in the full dataset: \n\n");fclose(absfile);bfit = runsrch(feat, direct, thresh, fX, [Y, ndead, nalive);if (bfit>bestffit){

copyques(besffeat,besfdirect,besfthresh,bfeat,bdirect,bthresh);wasimprov = 1; bestffit = bfit;

}if (wasimprov){

copyques(tbfeat,tbdirect,tbthresh,besffeat,besfdirect,besfthresh);

304

305

minind = 0; printstat = 2; goodalpha = 1;minind = prune(minind, tbfeat, tbdirect, tbthresh, fX, fy, ndead,

nalive, fX, fy, ndead, nalive,printstat);alpha = lalpha; printstat = 3;copyques(tbfeat,tbdirect,tbthresh,besffeat,besfdirect,besfthresh);goodalpha = 1;minind = prune((minind-1), tbfeat, tbdirect, tbthresh, fX, fy,

ndead, nalive, fX, fy, ndead, nalive,printstat);}

}} /******************** END OF MAIN PROGRAM **********************/

double misclass(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive, int *locqk,int locktot)

/* * This function computes the misclassification error for any particular* point in space as defined by feat, direct, and thresh. It returns* (l - error) as a double.

{int checkin, i, j, k, insubset, locn, lktot,subsetsize, tdead;double fit;int *ssizes, *lqk;

locn = locndead + locnalive; lastsub=O; lasttd=O;lqk=locqk; lktot = locktot; subsetsize = 0; tdead = 0;

ssizes = (int *)malloc(1ktot*sizeof(int *));


ssizes[k]=O;}

for (i=O; i<locn; i++){

checkin = 1;k=O;while (checkin) /* checkin keeps track of whether a case is chosen */{

while ((checkin) && (k<lktot)) /* Cycle through the k sets */{

insubset= 1;

***/

j = lqk[k];while (insubset && j){

J--;if (directO] [k]){

if (locX[i][feat[j][k]] < thresh[j][k]){

insubset--;}

}else{

if (locX[i] [featO][k]] >= thresh[j][k]){

insubset--;}

}}if (insubset) /* If the respondent is chosen, check for death */{

subsetsize++;if(locy[i]) tdead++;checkin--;ssizes[k]++;

}k++;


}}

fit = 1 - ((miscost*(locndead-(double)tdead»+( (double)subsetsize - (double)tdead» /(miscost*locndead + locnalive);

lastsub = subsetsize; lasttd = tdead;


if (ssizes[k]<minyart) fit = 0;}free( (void *) ssizes);retum(fit);

}

306

double mutpoint(int **feat, int **direct, float **thresh, float **locX,int *locy, double locndead, double locnalive, double locfit)

/*** This function mutates a point repeatedly, searching for lower error.It returns I-error of the combination of questions with lowest error. */{

int replaced, new_feat, old_feat, new_direct, old_direct;int checkout, gen, i, k, j, cntmut;double best;float new_thresh, old_thresh;

best = locfit;checkout = 1;gen = 0; cntmut = 0;mutfile = myopenf(mutfile, mut_file);fprintf(mutfile, "\n\n Total # of mutations at Misclassification \nIl);fprintf(mutfile," successful mutations error \n'l);fclose(mutfile);while((gen<gens) && checkout) /* Enter loop to do mutations */{

for (k=O; k<ktot; k++) /* Loop through the ktot question sets */{

/* Introduce mutation (save old question in case mutation is bad) */

cntmut++;replaced = floor(drand480 * qk[k]);if (replaced==qk[k]) replaced--;old_feat = feat[replaced][k];old_direct = direct[replaced][k];old_thresh = thresh[replaced][k];new_feat = floor(drand480 * p);if (new_feat==p) new_feat--;j = floor(drand48()*nvals[new_feat]);if (j==nvals[new_feat]) j--;if (j < 1) j++;new_thresh = Uvals[new_feat][j];new_direct = floor(drand48() * 2);if (new_direct==2) new_direct--;feat[replaced] [k] = new_feat;direct[replaced] [k] = new_direct;thresh[replaced][k] = new_thresh;/* Check the error rate */locfit = misclass(feat, direct, thresh, 10cX, locy, locndead,

307

308

locnalive,qk,ktot);

if (locfit <= best) /* If it isn't an improvement, get old quest */{

feat[replaced][k] = old_feat;direct[replaced][k] = old_direct;thresh[replaced] [k] = old_thresh;

}else /* If it IS better, keep it, output "gen" to out2.dat */{

best = locfit;mutfile = myopenf(mutfile, mut_file);fprintf(mutfile,"%13d %1.8If\n", cntmut, I-best);fclose(mutfile);if (gen > checfreq) /* check for absorption */{

checkout = I - checkabs(feat,direct,thresh,best,locX,locy,locndead, locnalive);

}}

}while ((gen> checfreq) && checkout) /* check for absorption */{

checkout = I - checkabs(feat,direct,thresh,best,locX,locy,locndead,locnalive);

if (checkout){best = misclass(feat, direct, thresh, 10cX, locy, locndead,


}gen++;

} /* Exit from mutation loop */best = misclass(feat, direct, thresh, 10cX, locy, locndead,

locnalive,qk,ktot);absfile = myopenf(absfile,abs_file);fprintf(absfile, "\n Misclassification error: %If\n\n'', I-best);fclose(absfile);printquesa(feat, direct, thresh, qk, ktot);return(best);

}

double runsrch(int **feat, int **direct, float **thresh, float **locX,

int *Iocy, double locndead, double locnalive)

/* This function not only returns the fit of the best question set fromthe search, it also changes bfit, the questions, AND the best questions */

{int I, k, j, j I ;double lfit;

for (1=0; 1< Imax; 1++) /* Enter loop to repeat the search Imax times */{

lfit = 0.0;while (1fit < 0.001) /* This loop generates a random point in space */{

for (k=O; k<ktot; k++){

for 0=0; j<qk[k]; j++){

feat[j][k]= floor(drand48() * p);if (feat[j] [k]==p) feat[j] [k]--;direct[j][k]= floor(drand48() * 2);if (direct[j][k]==2) direct[j] [k]--;j I = floor(drand48()*nvals[feat[j][k]]);if 0I==nvals[feat[j][k]]) j 1--;if 01 < l)jl++;thresh[j][k] = Uvals[feat[j] [k]] [j 1];

}}lfit = misclass(feat, direct, thresh, 10cX, locy, locndead,


lfit = mutpoint(feat, direct, thresh, 10cX, locy, locndead, locnalive,lfit);

if (1fit > bfit){

bfit = lfit;for O=O;j<maxq;j++){


bfeat[j][k] = feat[j][k]; bdirect[j][k] = direct[j][k];bthresh[j][k] = thresh[j][k];

}

309

}}

}return(bfit);

}

int prune(int ldep, int **feat, int **direct, float **thresh, float **IX,int *ly, double lndead, double lnalive, float **tX, int *ty,double tndead, double tnalive, int printstat)

/*** This function performs backwards deletion on a model

{int i,j ,k,l,k1,j l,j2, checkin, checktot, old_direct, old_feat, insubset;int *locqk, *ssizes, *nchosen, *ntrued;double *lscores, *tscores, *nvars, *cost_comp;int replaced, new_feat, new_direct, subsetsize, depth, minind;float new_thresh, old_thresh;double fit, oldfit, newfit;int tsubset, tdead, ktemp, qtemp, kmin, qmin, kdrop, tktot, locn;

depth = floor( (double)ktot * (double)maxq );

lscores = (double *)calloc((depth+2),sizeof(double));tscores = (double *)calloc((depth+2),sizeof(double));nvars = (double *)calloc((depth+2),sizeof(double));cost_comp = (double *)calloc((depth+2),sizeof(double));nchosen = (int *)calloc((depth+2),sizeof(int));ntrued = (int *)calloc((depth+2),sizeof(int));

for(i=O;i« depth+2);i++){

tscores[i] = 1 - ((miscost*(double)tndead)/(miscost*(double)tndead + tnalive));

lscores[i] = 1 - ((miscost*(double)lndead)/(miscost*(double)lndead + lnalive));

cost_comp[i] = I-tscores[i];}

locqk = (int *)malloc(ktot*sizeof(int));

for (i=O;i<ktot;i++) {locqk[i] = qk[i];} tktot = ktot;

ktemp = -1; qtemp = -1; locn = lndead+lnalive;

*/

310

311

1* NOW ENTER INTO BACKWARD DELETION *1I = depth-I;Iscores[1+2] = misclass(feat, direct, thresh, IX, ly,lndead,lnalive,locqk,

tktot);tscores[1+2] = misclass(feat, direct, thresh, tX, ty,tndead,tnalive,locqk,

tktot);nchosen[I+2]=lastsub; ntrued[1+2] = lasttd;for (k=O;k<tktot;k++) {nvars[I+2] = nvars[1+2] + locqk[k];}cost_comp[I+2] = (l-tscores[I+2]) + alpha*nvars[1+2];checktot = 1;

while ((l>=ldep) && (checktot)){

ktemp=O; qtemp=O; kdrop = -1 ;oldfit = 0.0;

1* ENTER INTO LOOP THROUGH QUESTIONS, DROPPING ONE AT A TIME*I

for (k1 =0; k1 <tktot; k1 ++){

locqk[k1]--;

forO 1=0; j 1<(locqk[k1]+1); j 1++){

old_feat = feat[j l][k1];old_thresh = thresh[j 1][k1];old_direct = direct[j 1] [k1];feat[j 1][k1] = feat[locqk[k1]] [k1];thresh[j 1] [k1] = thresh[locqk[k1]] [k1];direct[j 1][k1] = direct[locqk[k1 ]][k1];

subsetsize = 0; tdead= 0;

newfit = misclass(feat, direct, thresh, IX, ly, lndead,lnalive,locqk,tktot);

if (newfit>oldfit){

qtemp = j1; ktemp = k1;oldfit = newfit;

}

feat[j1][k1] = old_feat;thresh[jI][k1] = old_thresh;

direct[j 1] [k1] = old_direct;}locqk[k1]++;

}

for (k1=0; k1<tktot; k1++) /* NOW TRY DROPPING EACH PARTITION */{

subsetsize = 0; tdead = 0;for (i=O; i<locn; i++){

checkin = 1;k=O;while (checkin){

while ((checkin) && (k<tktot)){

insubset= 1;j = locqk[k];

if (k!=k1){

while (insubset && j){

J--;if (direct[j] [kJ){

if (lX[i][feat[j][k]] < thresh[j][k]){

insubset--;}

}else{

if (lX[i][feat[j] [k]] >= thresh[j] [k]){

insubset--;}

}}if (insubset){

subsetsize++;if(ly[iJ) tdead++;checkin--;

312

}}k++;


}}newfit = 1 - ((miscost*(lndead-(double)tdead»+

( (double)subsetsize - (double)tdead» /(miscost*lndead + Inalive) ;

if (newfit>oldfit){

oldfit = newfit;kdrop = kl;

}}if ((I < minind) && (goodalpha==O»{

kdrop = -1;}if (kdrop < 0){

feat[qtemp] [ktemp] = feat[locqk[ktemp]-1] [ktemp];thresh[qtemp] [ktemp] = thresh[locqk[ktemp]-1] [ktemp];direct[qtemp][ktemp] = direct[locqk[ktemp]-1 ][ktemp];locqk[ktemp]--;if (locqk[ktemp]<1) kdrop = ktemp;

}if (kdrop >=0){

locqk[kdrop]=locqk[tktot-1];for (j2=0; j2<locqk[kdrop]; j2++){

feat02][kdrop] = feat 02] [tktot-1];thresh02][kdrop] = thresh02][tktot-1];direct02][kdrop] = direct02] [tktot-1];

}tktot--;

}if (tktot<l) checktot--;1--',if (l >=0 ){

313

lscores[1+2]=misclass(feat, direct, thresh, IX, ly, Indead,lnalive,locqk,tktot);

tscores[l+2]=misclass(feat, direct, thresh, tX, ty, tndead,tnalive,locqk,tktot);

nchosen[l+2]=lastsub; ntrued[l+2] = lasttd;for (k=O;k<tktot;k++) {nvars[l+2] = nvars[l+2] + locqk[k];}cost_comp[l+2] = (l-tscores[1+2D + alpha*nvars[1+2];

}}minind = indofmin(cost_comp,depth+2);if (minind == 0){

alpha = DBL_MAX;minind++;

}else{

alpha = 0.97*((l-lscores[minind-lD-(l-lscores[minindD);}goodalpha = 1;if (printstat ==3){

nvars[minind-l] = nvars[minind]-l;}if ((nvars[minind]-nvars[minind-l D > 1){

goodalpha=O;}if (alpha < 0) { alpha = 0; }if ((printstat == l) && (goodalpha)){

printscores(tscores, lscores, nvars, cost_comp, depth+2);print2by2(tnalive, tndead, nchosen[minind], ntrued[minind],

I-tscores[minind],printstat,nvars[minindD;}else if (printstat == 2){

printcost(lscores, cost_comp, nvars, depth+2, lalpha);print2by2(lnalive, lndead, nchosen[minind], ntrued[minind],

l-lscores[minind] ,printstat,nvars[minindD;}else if (printstat == 3){

printquesb(feat,direct,thresh, locqk, tktot);

314

}free«void *) lscores); free«void *) ntrued); free«void *) nchosen);free«void *) tscores); free«void *) nvars); free«void *) cost_comp);retum(minind);

}

int checkabs(int **feat, int **direct, float **thresh, double best,float **locX, int *locy, double locndead, double locnalive)

/* ** This function checks a point in space to see if it is an absorption* point by exhaustively replacing each question with all possible* questions and checking for a lower error. Notice that although it*** touches the point's parameters, it leaves them unchanged in the end.{

int k, j, i, 1, k1, checkout;int old_feat;int old_direct;float old_thresh;double fit;int *tmpp;float *Tval;

tmpp = (int *)calloc(p,sizeof(int));for(i=O;i<p;i++) tmpp[i] = i;

checkout = 1;

k=O;while«k<ktot) && checkout){

j = 0;while(G<qk[kJ) && checkout){

old_feat = feat[j][k];old_direct = directU][k];old_thresh = threshU][k];1= 0;scramblei(tmpp,p);while((l<p) && checkout){

Tval = (float *)calloc(nvals[tmpp[l]],sizeof(float));for(i=O; i<nvals[tmpp[l]]; i++) Tval[i] = Uvals[tmpp[1]] [i];scramblef(Tval,nvals[tmpp[l]J);i=O;

****/

315

while((i<nvals[tmpp[l]]) && checkout){

k1=0;while((k1 <2) && checkout){

feat[j][k] = tmpp[l];direct[jUk] = k1;thresh[jUk] = Tval[i];

fit = misclass(feat, direct, thresh, locX, locy,locndead, locnalive, qk, ktot);

if (fit> best){

checkout = 0;}k1++;

}i++;

}free((void *)Tval);1++;

}if (checkout){

feat[j][k] = old_feat;thresh[j][k] = old_thresh;direct[j] [k] = old_direct;

}j++;

}k++;

}return(checkout);

}

void getdat(int splitdat) /******* This function reads in the data ***/{

int notdone, blnkint, i,j ,k,np,n;FILE *ptryfile, *ptrxfile;double pcheck;float blnkflt;int lent, tcnt;double outval;

316

if ((ptryfile=fopen(y_file,"rb"))==NULL){


}n= 0;notdone = TRUE;while (notdone){

if(fscanf(ptryfile, "%d", &blnkint) != EOF){

n++;}else{

notdone = FALSE;}

}fclose(ptryfile);if ((ptrxfile=fopen(X_file, "rb"))==NULL){


}np=O;notdone = TRUE;while (notdone){

if(fscanf(ptrxfile, "%f', &blnkflt) != EOF){

np++;}else{

notdone = FALSE;}

}fclose(ptrxfile);pcheck = (double)np/(double)n;p = floor(pcheck);if ((pcheck - (double)p) > 0.0000000001){

printf("\n\nError: Your data files are not formatted properly. \n");printf("\nMake sure that the data in the files y.dat and x.dat \n");

317

printf("have the same number of rows, and that every row of x.dat\n");printf("has the same number of variables. \n\n");exit(-I);

}tn = floor( tfrac * (double)n); In = n - tn;if ((ptryfile=fopen(y_file,"rb"))==NULL){

printf("\n\nError: Could not open file %s\n\n", y_file);exit(-I);

}if ((ptrxfile=fopen(X_file,"rb"))==NULL){

printf("\n\nError: Could not open file %s\n\n", X_file);exit(-I);

}Icnt=O; tcnt=O;if (splitdat){

tndead=O; tnalive=O; lndead = 0; lnalive = 0;tX = (float **)maBoc(tn*sizeof(float *));ty = (int *)maBoc(tn*sizeof(int));for (i=O; i<tn; i++){

tX[i] = (float *)maBoc(p*sizeof(float));}

IX = (float **)maBoc(ln*sizeof(float *));ly = (int *)maBoc(ln*sizeof(int));for (i=O; i<ln; i++){

IX[i] = (float *)maBoc(p*sizeof(float));}i=O; srand48(splitinit);while (i<n){

outval = drand480;if ((outval < tfrac) && (tcnt < tn)){

fscanf(ptryfile, "%d", &ty[tcnt));tndead = tndead + (double)ty[tcnt];for (j=O;j<p;j++){

fscanf(ptrxfile, "%£1', &tX[tcnt]U));}

318

tent++; i++;}else if ((outval >= tfrae) && (lent < In)){

fseanf(ptryfile, "%d", &ly[1cnt));lndead = lndead + (double)ly[1cnt];for O=O;j<p;j++){

fseanf(ptrxfile, "%f', &IX[lent][j));}1cnt++; i++;

}else if ( (tent==tn) && (lent==ln) ){

i++;}

}lnalive = (double)In - lndead; tnalive = (double)tn - tndead;

}else{

ndead = 0; nalive = 0;fX = (float **)malloe(n*sizeof(float *));fy = (int *)malloe(n*sizeof(int));for (i=O; i<n; i++){

fX[i] = (float *)malloe(p*sizeof(float));for 0=0; j<p; j++){

fseanf(ptrxfile, "%f', &fX[i][j));}fseanf(ptryfile, "%d", &fy[i]);if(fy[i]) { ndead++; }

}nalive = (double)n - ndead;

}fclose(ptrxfile); fclose(ptryfile);if (myseed){

srand48(myseed+ I);}else{

srand48((unsigned)time(NULL));

319

}}

void getval(void) /******* This function gets the uniq variable values ****/{

int i,j,k;float *Tval;float lastval;

Tval = (float *)malloc(ln*sizeof(float));nvals = (int *)malloc(p*sizeof(int));Uvals = (float **)malloc(p*sizeof(float *));

for 0=0; j<p; j++){

for (i=O; i<ln; i++){

Tval[i] = IX[i][j];}qsort((void *)Tval, In, sizeof(float), mycomp);lastval=Tval[O] ;nvals[j] = I;for (i=l; i<ln; i++){

if(Tval[i]>lastval){

nvals[j]++;lastval = Tval[i];

}}Uvals[j] = (float *)maUoc(nvals[j]*sizeof(float));k=O;Uvals[j][k] = Tval[k];for (i=l; i<ln; i++){

if(Tval[i]>Tval[i-l]){

k++;Uvals[j][k] = Tval[i];

}}

}free( (void *) Tval);

}

320

void copyques(int **tofeat, int **todirect, float **tothresh,int **frfeat, int **frdirect, float **frthresh)

{intj,k;

for G=O;j<maxq;j++){


tofeat[j][k] = frfeat[j] [k]; todirect[j][k] = frdirect[j][k];tothresh[j] [k] = frthresh[j] [k];

}}

}

int fltcompare(float *i, float *j) /** Compares reals for qsortO call ****/{

int mycheck;float myi, myj;

mycheck= 1;myi = *i; myj = *j;

if (myi > (myj + 0.000001)){

mycheck = 0;return (1);

}if (myi < (myj - 0.000001)){

mycheck = 0;return (-1);

}if (mycheck) return(O);

}

int indofmin(double *vect, int lengv)/* Returns smallest index of the "minimum" value in a vector of doubles */{

double tmpval;int i, cnti;

tmpval = vect[O]; cnti = 0;

321

for (i=l;i<lengv;i++){

if (vect[i]«tmpval-2*FLT_EPSILON)){

tmpval = vect[i];cnti = i;

}}return(cnti);

}

int maxval(int *vect, int lengv)/* Returns the maximum value in a vector of ints*/{

int tmpval;int i;

tmpval = vect[O];for (i=l ;i<lengv;i++){

if (vect[i]>tmpval){

tmpval = vect[i];}

}return(tmpval);

}

int mycomp(const void *i, const void *j) /* This calls a function above */{ /* called fltcompare*/

return( fltcompare((float *) i, (float *) j)); /* for qsortO */}

void printquesa(int **feat, int **direct, float **thresh,int *locqk, int locktot)

{int i,j,k;

absfile = myopenf(absfile,abs_file);for(k=O; k<locktot; k++){

for (i=O; i<locqk[k]; i++){

fprintf(absfile,"Is variable %3d ", feat[i] [k]+I);

322

if (direet[i] [k)) fprintf(absfile,">= ");else fprintf(absfile,"< ");fprintf(absfile,"%3.4g ?\n", thresh[iHk));

}if(k<(loektot-l)) fprintf(absfile," OR \n");

}fprintf(absfile," \n");fclose(absfile);

}

void printquesb(int **feat, int **direet, float **thresh,int *loeqk, int loektot)

{int i,j,k;

quesfile = myopenf(quesfile,ques_file);if (loektot > 0){

fprintf(quesfile, "\n The latest best question set using the full dataset is: \n\n");}else{

fprintf(quesfile, "\n The latest best question set using the full dataset has °questions.\n\n");

}for(k=O; k<loektot; k++){

for (i=O; i<loeqk[k]; i++){

fprintf(quesfile,"Is variable %3d ", feat[i] [k]+1);if (direet[i] [k)) fprintf(quesfile,">= ");else fprintf(quesfile,"< ");fprintf(quesfile,"%3.4g ?\n", thresh[iHk));

}if(k<(loektot-l)) fprintf(quesfile," OR \n");

}fprintf(quesfile," \n");fclose(quesfile);

}

void print2by2(double loenalive, double loendead, int loesubsetsize,int loetdead, double miserr, int pstat, int minnvars)

{int entOO, entOI, entlO, entll, rowO, rowl, colO, coll, totn;

323

324

quesfile = myopenf(quesfile,ques_file);fprintf(quesfile, "\nA set of%d questions was applied to the",minnvars);if (pstat == I){

fprintf(quesfile, " test dataset:\n");}else if (pstat == 2){

fprintf(quesfile, " full dataset:\n");}

rowO = (int)locnalive; rowl = (int)locndead; coll = locsubsetsize;cnt11 = loctdead;

totn = rowO + rowI; colO = totn - coll; cntO I = coll-cntII;cntlO = rowl-cntll; cntOO = rowO-cntOl;

\n" ,

\n\n");\n\n");

%6d \n",

\n");%6d \n\n",

%6d

%6d

%6d

predicted outcomeo I

o%6d %6d

%6d

fprintf(quesfile,"\n\n");fprintf(quesfile,"fprintf(quesfile,"fprintf(quesfile," truecntOO,cntO I ,rowO);fprintf(quesfile,"outcomefprintf(quesfile," I %6dcntlO, cnt1l, rowl);fprintf(quesfile,"colO, coll, totn);fprintf(quesfile, "\n Misclassification error = %If\n\n\n'', miserr);fclose(quesfile);

}

void printcost(double *lscores, double *cost_comp, double *nvars, int depth,double alpha)

{int i;

quesfile = myopenf(quesfile,ques_file);fprintf(quesfile,"\n\n The estimated cost-complexity parameter was %1.7Ig.\n",

alpha);fprintf(quesfile,"\n # of questions misclass. error cost complexity\n\n");i = depth-I;while (i>=O){

fprintf(quesfile," %3d %1.71f % 1.719\n",

(int)nvars[i], I-Iscores[i], cost_comp[i]);if (nvars[i]==O) i = 0;1--',

}fprintf(quesfile, "\n");fclose(quesfile);

}

void printscores(double *tscores, double *lscores, double *nvars,double *cost_comp, int depth)

{int i,i2;

quesfile = myopenf(quesfile,ques_file);fprintf(quesfile,"\n # of questions learning set error test set error\n\n");i = depth-I;while ((i>=O)&& i2){

fprintf(quesfile," %3d %If %If\n'' ,(int)nvars[i], I-lscores[i], I-tscores[i]);

if (nvars[i]==O) i = 0;1--',

}fprintf(quesfile, "\n");fclose(quesfile);

}

void scramblei(int *vect, int lengv){

int templeng,i,tempind,tempval;

templeng = lengv;for (i=O; i<lengv; i++){

tempind = floor(drand480*templeng);if (templeng == tempind) tempind--;tempval = vect[tempind];vect[tempind] = vect[templeng-I];vect[templeng-I] = tempval;templeng--;

}}

325

void scramblef(float *vect, int lengv){

float tempval;int templeng,i,tempind;

templeng = lengv;for (i=O; i<lengv; i++){

tempind = floor(drand480*templeng);if (templeng == tempind) tempind--;tempval = vect[tempind];vect[tempind] = vect[templeng-l];vect[templeng-l] = tempval;templeng--;

}}

void welcome(void){

int resnotok, chkcnt, resp 1, junkint, k, goback, tmpint;char cresp1, junkchar;double tmpflt;

printf("\n\n");printf("

?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?11?&?&?&?\n");printf(" & &\n");printf(" ? Welcome to BOBCAT! ?\n");printf(" & Boolean Operations on Binary splits Classification &\n");printf("? Algorithm for Test-set prediction ?\n");printf(" & &\n");printf("? Author: Michael Anderson Copyright 1997 ?\n");printf(" & &\n");printf("

?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?II?&?&?&?\n");resnotok=1;while(resnotok>O){

resnotok=1 ;printf("\n\n");printf(" MAIN MENU\n\n");printf(" 1. Search for a new set of questions\n");

326

printf(" 2. Analyze a previously constructed set of questions\n");printf(" 3. Exit\n\n");printf(" Enter a number from 1 to 3 (default = 1): ");cresp1 = getcharO;resp1 = (int) cresp1; junkint = resp1;while Gunkint != 10){

junkchar = getcharO; junkint = (int) junkchar;}if (resnotok ==1){

switch(respl){

case 49: resnotok=O;break;

case 50: printf("\n\n\n\n");printf(" Sorry, that option is not supported yet.\n");printf(" (BOBCAT is still under construction!)\n");printf("\n Please enter another number.\n");resnotok=1;break;

case 51: printf("\n\n\n Thank you for using BOBCAT.\n\n\n");resnotok=O; exit(O);break;

case 10: resnotok=O;break;

default: printf("\n Enter 1, 2, or 3. \n"); resnotok= I ;break;

}}

}goback = 1;while (goback>O){

resnotok=1;while(resnotok>O){

printf("\n\n Enter the number of question subsets to be");printf(" combined \n");printf(" with \"OR\" (maximum = %d, default = %d): ",MAXQK,ktot);cresp1 = getcharO;resp1 = (int) cresp1; junkint = resp1;while Gunkint != 10){

327

junkchar = getcharO; junkint = (int) junkchar;}if ( ((respl < 49) II (respl > (48+MAXQK))) && (respl !=10)){

printf("\n Enter a number no lower than 1, no greater than %d.",MAXQK);

resnotok = 1;}else if(respl=10){

ktot = DEFKTOT; resnotok=O;}else{

resp 1 = resp 1 - 48; ktot = resp 1;. resnotok = 0;}

}free( (void *) qk);qk = (int *)malloc(ktot*sizeof(int));for (k=O;k<ktot;k++){

qk[k] = DEFQK;resnotok=1;while(resnotok>O){

printf("\n\n Enter the number of questions in subset %d", k+1);printf(" to be combined \n");printf(" with \"AND\" (maximum = %d, default = %d): ",MAXQK, qk[k]);crespI = getcharO;resp 1 = (int) cresp1; junkint = resp1;while Uunkint != 10){

junkchar = getcharO; junkint = (int)junkchar;}if( ((respl < 49) II (respl > (MAXQK + 48))) && (respl != 10)){

printf("\n Enter a number no lower than 1, no greater than %d.\n",MAXQK);

resnotok = 1;}else if(~espl == 10){

328

qk[k] = DEFQK; resnotok = 0;}else{

respl = respl - 48; qk[k] = respl;resnotok = 0;

}}

}resnotok=1;while(resnotok>O){

printf("\n\n Enter the cost of misclassifying a true \" 1\" as a");printf(" \"0\" relative to \nil);printf(" misclassifying a true \"0\" as a \" 1\" (type \" 1\" for the\n");printf(" default, unit cost): ");chkcnt = scanf("%lf', &tmpflt);crespI = getcharO;resp1 = (int) cresp 1; junkint = resp1;while Gunkint != 10){

junkchar = getcharO; junkint = (int)junkchar;}if ((tmpflt < 0) II (chkcnt!= 1)){

printf("\n Enter a nonnegative, real-valued number.\n");}else{

miscost = tmpflt; resnotok = 0;}

}resnotok=1;printf("\n\n The following option only affects the speed ofthe search.\n\n");printf(" Enter the number of mutations-per-subset to allow before checking\n");printf(" whether the present set of questions is an absorption point. \nil);printf(" This number may be as large as 100,000 if you dataset is big\n");printf(" (e.g., more than 10,000 records and more than 150 variables), ar\n");printf(" as small as 1,000 if your dataset is small (e.g., less than 1,000\n");printf(" records and less than 40 variables). You may wish to experiment\n");printf(" to find the optimal value. (Start with 10,000 if you are unsure).\n");while(resnotok>O){

printf(" Enter a positive integer (omit commas): ");

329

330

chkcnt = scanf("%d", &tmpint);crespI = getcharO;resp 1 = (int) cresp1; junkint = resp 1;while Gunkint != 10){

junkchar = getcharO; junkint = (int)junkchar;}if ((tmpint < 1) II (chkcnt!= 1)){

resnotok = 1; printf("\n ");}else{

checfreq = tmpint; resnotok = 0;}

}printf(" \n\n\n");printf(" You have instructed BOBCAT to search for %d subsets of questions\n", ktot);printf(" to be combined with \IOR\I.\n\n");for (k=O;k<ktot;k++){

printf(" Subset %d will consist of%d questions combined with \IAND\I.\n", k+1,qk[k]);

}printf("\n The relative cost of misclassification will be % l.2lf.\nil,

miscost);printf("\n %d mutations-per-subset will be allowed before checking for absorption.\n",

checfreq);resnotok=1;while(resnotok>O){

printf("\n Are these instructions correct? (y or n, default = y): ");crespI = getcharO;resp 1 = (int) cresp1; junkint = resp 1;while Gunkint != 10){

junkchar = getcharO; junkint = (int) junkchar;}if((respi == 121) II (respi == 89) II (respi == 10)){

goback = 0; resnotok = 0;}else if ((respl == 110) II (respi == 78)){

goback = 1; resnotok = 0;}else{

resnotok = 1;printf("\n\n Enter \"y\" or \"n\" .\n");

}}

}maxq = maxval(qk,ktot);/* If you want deterministic results, use a positive value of myseed */if (myseed > 0){


srand48((unsigned)time(NULL));}/* The variables below will hold info on the present point in space.

feat is the variable number, direct is < or >=, thresh is cutoff */feat = (int **)malloc(maxq*sizeof(int *));direct = (int **)malloc(maxq*sizeof(int *));thresh = (float **)malloc(maxq*sizeof(float *));tbfeat = (int **)malloc(maxq*sizeof(int *));tbdirect = (int **)malloc(maxq*sizeof(int *));tbthresh = (float **)malloc(maxq*sizeof(float *));bfeat = (int **)malloc(maxq*sizeof(int *));bdirect = (int **)malloc(maxq*sizeof(int *));bthresh = (float **)malloc(maxq*sizeof(float *));beslfeat = (int **)malloc(maxq*sizeof(int *));besldirect = (int **)malloc(maxq*sizeof(int *));beslthresh = (float **)malloc(maxq*sizeof(float *));besffeat = (int **)malloc(maxq*sizeof(int *));besfdirect = (int **)malloc(maxq*sizeof(int *));besfthresh = (float **)malloc(maxq*sizeof(float *));


feat[k] = (int *)calloc(ktot,sizeof(int));direct[k] = (int *)calloc(ktot,sizeof(int));thresh[k] = (float *)calloc(ktot,sizeof(float));bfeat[k] = (int *)calloc(ktot,sizeof(int));bdirect[k] = (int *)calloc(ktot,sizeof(int));

331

bthresh[k] = (float *)calloc(ktot,sizeof(float));tbfeat[k] = (int *)calloc(ktot,sizeof(int));tbdirect[k] = (int *)calloc(ktot,sizeof(int));tbthresh[k] = (float *)calloc(ktot,sizeof(float));beslfeat[k] = (int *)calloc(ktot,sizeof(int));besldirect[k] = (int *)calloc(ktot,sizeof(int));beslthresh[k] = (float *)calloc(ktot,sizeof(float));besffeat[k] = (int *)calloc(ktot,sizeof(int));besfdirect[k] = (int *)calloc(ktot,sizeof(int));besfthresh[k] = (float *)calloc(ktot,sizeof(float));

}

}

FILE *myopenf(FILE *locfile, char *loc_file){

if ((locfile=fopen(1oc_file,"a"))==NULL){

printf("\n\nError: Could not open file %s\n\n", loc_file);exit(-l);

}retum(1ocfile);

}

332

333

Appendix VIn - A large question set for predicting death

QUESTION SET J(Sensitivity: 51 % Specificity: 84%)

SET J.11. Other than when you might have been in thehospital, was there any time in the past 12months when you needed help from some personor any equipment or device to do the following:

Walking across a small room?

No help

Help

Unable to do

Missing

2. (When wearing eyeglasses/contact lenses)Can you see well enough to read ordinarynewspaper print?

Yes

No

Missing

OR

SET J.21. Are you able to walk half a mile without help?That's about 8 ordinary blocks.

Yes

No

Missing


Yes

No

Missing

SET J.31. Age: 80+ <80

2. Did a doctor ever tell you that you had a strokeor brain hemorrhage?

Yes

Suspect

No

Missing

OR

SET J.41. Sex: Male Female

2. What is your telephone number?

Correct

Incorrect

Refused

Missing

3. Have you been to a hospital at least one nightin the past 12 months?

Yes

No

Missing

OR

SET J.51. Age: 80+ <80

334

SET 1.71. Weight at time of interview (in pounds):


Correct

Incorrect

Refused

Missing

OR

SET J.61. As compared to other people your own age,would you say that your health is excellent, good,fair, poor or very poor?

Excellent

Good

Fair

Poor or bad

Missing

2. In the past year, have you gained or lost morethan 10 pounds?

No change

Yes, gained

Yes, lost

Yes, both gained and lost

Missing

OR

<158 158+ Missing

2. Sex: Male Female


Yes

No

Missing

OR

SET 1.8I. What day of the week is it?

Correct

Incorrect

Refused

Missing

2. Other then when you might have been in thehospital, was there any time in the past 12months when you needed help from some personor any equipment or device to do the followingthings:

Bathing, either a sponge bath, tub bath, orshower?

No help

Help

Unable to do

Missing

OR

SET J.91. Weight at time of interview (in pounds):

<170 170+ Missing

2. Sex: Male Female


Yes

No

Missing

OR

335

SET J.10

1. Has a doctor ever told you you had diabetes,high blood sugar, or sugar in your urine?

Yes

Suspect

No

Missing

2. Has a doctor ever told you you had a heartattack, coronary, coronary thrombosis, coronaryocclusion or myocardial infarction?

Yes

No

Missing


Yes

No

Missing

Documents

A Method for Predicting Death and Illness in an Elderly Person