23
Handbook of Statistics, Vol. 27 ISSN: 0169-7161 r 2008 Elsevier B.V. All rights reserved DOI: 10.1016/S0169-7161(07)27006-3 6 Logistic Regression Edward L. Spitznagel Jr. Abstract This chapter provides an introduction to logistic regression, which is a powerful modeling tool paralleling ordinary least squares (OLS) regression. The differ- ence between the two is that logistic regression models categorical rather than numeric outcomes. First, the case of a binary or dichotomous outcome will be considered. Then the cases of unordered and ordered outcomes with more than two categories will be covered. In all three cases, the method of maximum likelihood replaces the method of least squares as the criterion by which the models are fitted to the data. Additional topics include the method of exact logistic regression for the case in which the maximum likelihood method does not converge, probit regression, and the use of logistic regression for analysis of case–control data. 1. Introduction Logistic regression is a modeling tool in which the dependent variable is cate- gorical. In most applications, the dependent variable is binary. However, it can have three or more categories, which can be ordered or unordered. Except for the nature of the dependent variable, logistic regression closely resembles ordinary least squares (OLS) regression. A comparison with OLS regression can lead to a better understanding of logistic regression. In both types of regression, the fitted models consist of intercept and slope coefficients. In OLS regression, the fitting criterion is the principle of least squares. In least squares, the intercept and slope coefficients are chosen to minimize the sum of squared deviations of the dependent variable’s values from the values given by the regression equation. In logistic regression, the fitting criterion is the principle of maximum likelihood. In maximum likelihood, the intercept and slope coefficients are chosen to maximize the probability of obtaining the observed data. For OLS regression, the principles of maximum likelihood and least squares coincide if the assumptions of normality, equal 187

[Handbook of Statistics] Epidemiology and Medical Statistics Volume 27 || 6 Logistic Regression

Embed Size (px)

Citation preview

ISSN: 0169-7161

r 2008 Elsevier B.V. All rights reserved

Handbook of Statistics, Vol. 27

DOI: 10.1016/S0169-7161(07)27006-3

6

Logistic Regression

Edward L. Spitznagel Jr.

Abstract

This chapter provides an introduction to logistic regression, which is a powerful

modeling tool paralleling ordinary least squares (OLS) regression. The differ-

ence between the two is that logistic regression models categorical rather than

numeric outcomes. First, the case of a binary or dichotomous outcome will be

considered. Then the cases of unordered and ordered outcomes with more than

two categories will be covered. In all three cases, the method of maximum

likelihood replaces the method of least squares as the criterion by which the

models are fitted to the data. Additional topics include the method of exact

logistic regression for the case in which the maximum likelihood method does

not converge, probit regression, and the use of logistic regression for analysis of

case–control data.

1. Introduction

Logistic regression is a modeling tool in which the dependent variable is cate-gorical. In most applications, the dependent variable is binary. However, it canhave three or more categories, which can be ordered or unordered. Except for thenature of the dependent variable, logistic regression closely resembles ordinaryleast squares (OLS) regression.

A comparison with OLS regression can lead to a better understanding oflogistic regression. In both types of regression, the fitted models consist ofintercept and slope coefficients. In OLS regression, the fitting criterion is theprinciple of least squares. In least squares, the intercept and slope coefficients arechosen to minimize the sum of squared deviations of the dependent variable’svalues from the values given by the regression equation. In logistic regression, thefitting criterion is the principle of maximum likelihood. In maximum likelihood,the intercept and slope coefficients are chosen to maximize the probability ofobtaining the observed data. For OLS regression, the principles of maximumlikelihood and least squares coincide if the assumptions of normality, equal

187

E. L. Spitznagel Jr.188

variance, and independence are satisfied. Therefore, the use of maximum like-lihood for logistic regression can be regarded as a natural extension of the fittingmethod used in OLS regression.

In OLS regression, the model coefficients can be estimated by solving a systemof linear equations, called the normal equations. In logistic regression, there is nocounterpart to the normal equations. Model coefficients are estimated by search-ing iteratively for the values that maximize the likelihood. Modern softwareusually finds the maximum in less than a half dozen iterative steps.

In OLS regression, hypotheses about coefficients are tested by using sums ofsquared deviations to compute test statistics known as F ratios. In logistic re-gression, hypotheses about coefficients are tested by using likelihoods to computetest statistics known as likelihood ratio w2.

In both OLS and logistic regression, every coefficient estimate has an asso-ciated standard error. In the case of OLS regression, these standard errors can beused to perform Student’s t-tests on the coefficients and to compute t-basedconfidence intervals. In logistic regression, these standard errors can be usedto perform approximate Z (or w2) tests on the coefficients and to computeapproximate Z-based confidence intervals.

2. Estimation of a simple logistic regression model

The example below illustrates how these ideas are used in logistic regression.Consider an experiment in which disease-vector mosquitoes are exposed to 9different levels of insecticide, 50 mosquitoes per dose level, with the results shownin Table 1.

In Fig. 1, the vertical axis shows the number killed at each dose level, from 1 atthe lowest level to 50 at the highest level. Owing to the curvature at the low andhigh levels of insecticide, a straight line produced by a linear function will not givean adequate fit to this data. In fact, a straight line would imply that for very lowdoses of insecticide, the number of kills would be less than 0, and for very highdoses of insecticide, the number of kills would be larger than the number ofmosquitoes exposed to the dose.

The natural bounds on the probability of kill, 0 and 1, constrain the model sothat no linear function can adequately describe the probability. However, we cansolve this problem by replacing probability with a function that has no loweror upper bounds. First, if we replace probability with odds ( ¼ p/(1�p)), weopen up the range on the upper end to+N. The lower end remains at 0, and

Table 1

Relationship between insecticide dose and number of mosquitoes killed

Dosage 1 2 3 4 5 6 7 8 9

Mosquitoes 50 50 50 50 50 50 50 50 50

Killed 1 2 4 17 26 39 42 48 50

00 1 2 3 4 5 6 7 8 9 10

10

20

30

40

50

Insecticide Dose

Nu

mb

er o

f M

osq

uit

oes

K

Fig. 1. Graph of proportion killed as a function of insecticide dose.

Logistic regression 189

probability ¼ 1/2 corresponds to odds of 1. To open up the range on the lowerend and symmetrize it at the same time, we can take the natural logarithm ofodds, known as the logit function, ln(p/(1�p)). When p ranges from 0 to 1, thelogit function ranges from �N to +N, in symmetric fashion. The logit of 1/2 isequal to 0. The logits of complementary probabilities, such as 1/3 and 2/3, havethe same size but opposite signs: 7ln(2) ¼ 70.693.

Using the SAS LOGISTIC procedure (SAS Institute Inc., 2004), we obtain thefollowing output estimating the logit as a linear function of dose:

Parameter

df Estimate Standard Error w2 Pr4w2

Intercept

1 �5.0870 0.4708 116.7367 o0.0001 Dose 1 1.0343 0.0909 129.3662 o0.0001

The estimated log-odds of death is

logit ¼ �5:0870þ 1:0343� dose.

From this formula we can estimate the probability of mosquito death forany dose level, including but not limited to the actual doses in the experiment.Suppose, for example, that we would like to know the probability ( ¼ expectedfraction) of insect kill with 6.5 units of insecticide. The estimated logit is�5.0870+1.0343� 6.5 ¼ 1.636. To convert into probability, we reverse the twosteps that took us from probability to the logit. First, we exponentiate base-e toconvert to odds: exp(1.636) ¼ 5.135. Second, we convert odds to probability bythe formula p ¼ odds/(odds+1): 5.135/6.135 ¼ 0.8346. Thus, with a 6.5 unitdose, we expect 83.46% of insects to be killed. Figure 2 demonstrates theclose agreement between actual numbers killed and the computed probabilitiesmultiplied by 50.

0

10

20

30

40

50

Insecticide Dose

Nu

mb

er o

f M

osq

uit

oes

K

0 1 2 3 4 5 6 7 8 9 10

Fig. 2. Logistic regression curve fitted to insecticide lethality data.

E. L. Spitznagel Jr.190

The coefficient of dose, 1.0343, estimates the increase in log-odds per unit ofdose. The exponential of this number, exp(1.0343) ¼ 2.813, estimates the mul-tiplicative increase in odds of insect death per unit of dose. This value is called theodds ratio. It is the most commonly used number in logistic regression results,usually reported along with approximate 95% confidence limits. These limits canbe obtained by adding and subtracting 1.96 standard errors from the slopeestimate, then exponentiating: exp(1.034371.96� 0.0909) ¼ (2.354, 3.362). Moststatistics packages will compute both odds ratios and confidence limits, eitherautomatically or as options.

If we divide the coefficient 1.0343 by its standard error, 0.0909, we obtainthe value of an asymptotically standard normal test statistic Z ¼ 1.0343/0.0909 ¼ 11.38 for testing the null hypothesis that the true value of the slope isequal to 0. Some statistics packages report the square of this statistic,11.382 ¼ 129.4 in the table above, which has an asymptotic w2 distributionwith 1 df, and is called aWald w2. The p-value will be the same, whichever statisticis used.

As we discussed earlier, estimation of a logistic regression model is achievedthrough the principle of maximum likelihood, which can be thought of as ageneralization of the least squares principle of linear regression. In maximumlikelihood estimation, we search over all values of intercept and slope until wereach the point where the likelihood of obtaining the observed data is largest.This occurs when the intercept is �5.0870 and the slope is 1.0343. This maximizedlikelihood is very small, 3.188� 10�67. Such small values are typical in maximumlikelihood calculations. Accordingly, in lieu of the likelihood itself, most statis-tical software reports the natural logarithm of the likelihood, which in our case isln(3.188� 10�67) ¼ �153.114.

Besides being the means of estimating the logistic regression coefficients, thelikelihood also furnishes us a means of testing hypotheses about the model. If wewish to test the null hypothesis of no relationship between dose and insectmortality, we re-fit the model with the slope coefficient constrained equal to 0.

Logistic regression 191

Doing so produces an intercept estimate of 0.0356 and the smaller log-likelihoodof �311.845. To compare the two models, we multiply the difference in log-likelihoods by 2, 2� (�153.114 � (�311.845)) ¼ 2� 158.731 ¼ 317.462. If thenull hypothesis being tested is true, we would have an asymptotic w2 distributionwith degrees of freedom equal to the number of coefficient constraints, which inthis case equals 1. This is known as a likelihood ratio w2, as it is equal to twice thelog of the ratio of the two likelihoods, unconstrained to constrained.

Taking advantage of the analogy with OLS, the likelihood ratio w2 value of317.462 is the counterpart of the F-ratio obtained by computing sums of squareswith and without constraints, whereas the earlier values of Z ¼ 11.38 and theWald w2 Z2

¼ 129.4, are the counterparts of Student’s t and its square, computedby comparing a coefficient estimate with its standard error. In the case of OLS,the two different computations give identical and exact results. In the case ofmaximum likelihood, the two different computations give different, and asymp-totic, rather than exact results. Ordinarily, we would expect the two different w2 tobe approximately equal. The primary reason for the large difference betweenthem is the strength of the dose–mortality relationship. Should it ever happenthat the likelihood ratio and Wald w2 lead to opposite conclusions in a hypothesistest, the likelihood ratio w2 is usually preferred.

3. Two measures of model fit

In our example, both w2 test statistics provide strong evidence of the relationshipbetween dose and mortality, but they do not provide an estimate of how strongthe relationship is. The reason is that test statistics in general depend on samplesize (if the null hypothesis of no relationship is false). In OLS regression, thecoefficient of determination, R2, is the most commonly used measure of thestrength of relationship. For maximum likelihood estimation, Cox and Snell(1989) proposed an analog of R2 computed from the likelihood ratio w2:

Cox2Snell R2 ¼ 1� expð�w2=nÞ ¼ 1� expð�317:462=450Þ ¼ 0:5061.

Nagelkerke (1991) pointed out that, in the case of discrete models such aslogistic regression, the maximum value possible with this definition will be lessthan 1. He proposed that the Cox–Snell R2 be divided by its maximum possiblevalue, to produce a closer analog of the R2 of least-squares regression. In ourexample, the Nagelkerke R2

¼ 0.6749. Both Cox–Snell and Nagelkerke R2 valuesare readily available in most statistics packages.

Logistic regression also has a direct and natural measure of relationshipstrength called the coefficient of concordance C. It stems from the purpose oflogistic regression, to estimate probabilities of an outcome from one or moreindependent variables. The coefficient of concordance measures the fraction ofpairs with different outcomes in which the predicted probabilities are consistentwith the outcomes. For a pair of mosquitoes, one dead and the other alive, if thecalculated probability of death for the dead mosquito is greater than the

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Proportion of False Positives (1 - Specificity)

Pro

po

rtio

n o

f T

rue

Po

siti

ves

(Sen

s)

Fig. 3. Interpretation of logistic regression in terms of a receiver-operating characteristic curve.

E. L. Spitznagel Jr.192

calculated probability of death for the live one, those probabilities would beconsistent with the outcomes in that pair.

In our example, of 450 mosquitoes, 221 survived and 229 died. We thus have221� 229 ¼ 50,609 pairs with different outcomes. The great majority of thesepairs, 45,717, were concordant, with the dead mosquito having the higher prob-ability of death. A much smaller number, 2,517, were discordant, with the livemosquito having the higher probability of death. There were also 2,375 ties, inwhich the dead and live mosquitoes had exactly the same probability of death.The ties are counted as half concordant and half discordant, and so the coefficientof concordance is C ¼ (45,717+2,375/2)/50,609 ¼ 0.927.

Like R2, the coefficient of concordance has range 0–1, but, unlike R2, itsexpected value under the null hypothesis of no relationship is 0.5 rather than 0.When logistic regression is used in medical diagnosis, a graph called thereceiver operating characteristic (ROC) curve is used to portray the effectivenessof the regression model. This graph, in Fig. 3, shows the tradeoff in the pro-portion of true positives (sensitivity) versus the proportion of false positives( ¼ 1�specificity). The area beneath this curve is equal to C. A value of C close to1 means the diagnosis can have simultaneously high sensitivity and highspecificity. Further information regarding the use of ROC curves can be foundin Chapter 5 of Hosmer and Lemeshow (2000) and in McNeil et al. (1975).

4. Multiple logistic regression

As with OLS regression, logistic regression is frequently used with multipleindependent variables. Suppose in our example we had treated 25 male

Table 2

Differential lethality of insecticide for males and females

Dosage 1 2 3 4 5 6 7 8 9

Males killed 1 1 3 11 16 21 22 25 25

Females killed 0 1 1 6 10 18 20 23 25

Logistic regression 193

and 25 female mosquitoes at each dose level, with the results shown inTable 2.

Within each group, there is a dose–response relation between insecticide andmosquito deaths, with more deaths among the males than among the females atalmost every dosage level. Using the SAS LOGISTIC procedure, we obtain thefollowing output estimating the logit as a linear function of dose and sex:

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

326.8997 2 o0.0001

Parameter

df Estimate Standard Error w2 Pr4w2

Intercept

1 �5.7186 0.5465 109.4883 o0.0001 Dose 1 1.0698 0.0950 126.8817 o0.0001 Male 1 0.9132 0.3044 9.0004 0.0027

The sex of the insect was coded as 0 ¼ female, 1 ¼ male. Hence, the choice ofvariable name ‘‘male.’’ In terms of the probabilities estimated from the model,it does not matter whether the sex variable is coded 1 for being male or 1 forbeing female. The coefficients will change so that the logits and the probabilitiescome out the same. Also, the w2 values will come out the same, except for thatof the intercept term. (Testing the intercept coefficient being equal to 0 is notparticularly meaningful, in any event.)

Since the coefficient for male is positive, males are more susceptible to theinsecticide than females are throughout the dose range. For example, at a doseof five units, the logit of a male mosquito dying is �5.7816+1.0698�5+0.9132 ¼ 0.4806. This corresponds to an odds of 1.6170 and a probabilityof 0.6179. At a dose of five units, the logit of a female dying is �5.7816+1.0698� 5 ¼ �0.4326. This corresponds to odds of 0.6488 and probability of0.3935. Figure 4 shows the estimated probabilities of death for males and femalesover the entire dose range. Across the entire range of dosing, the ratio of maleodds to female odds is exp(0.9132) ¼ 2.492. (Verifying at dose ¼ 5 units, we haveexplicitly 1.6179/0.6488 ¼ 2.494.)

0

0.2

0.4

0.6

0.8

1

Insecticide Dose

Fra

ctio

ns

of

Mo

squ

ito

es K

i

0 1 2 3 4 5 6 7 8 9 10

Fig. 4. Logistic regression curves fitted to males and females separately, with curve for males above

that for females.

E. L. Spitznagel Jr.194

5. Testing for interaction

The fitted model above is an additive, or non-interactive, model. It assumes that ifwe graph the logit versus dose separately for males and females, we get parallellines. To test the possibility that the two lines are not parallel, we can includean interaction term obtained by multiplying dose by male, and rerunning theregression with that term included:

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

326.9221 3 o0.0001

Parameter

df Estimate Standard Error w2 Pr4w2

Intercept

1 �5.6444 0.7322 59.4229 o0.0001 Dose 1 1.0559 0.1321 63.9266 o0.0001 Male 1 0.7735 0.9816 0.6209 0.4307 Interaction 1 0.0284 0.1900 0.0224 0.8812

By two different but related criteria, there is no evidence for the existence ofan interaction. First, we have the Wald w2 value 0.0224 (and P-value ¼ 0.8812)for testing the hypothesis that the interaction is equal to 0. Second, we can‘‘partition’’ the likelihood ratio w2 in the same fashion as calculating partial sumsof squares in OLS regression: 326.9221 � 326.8995 ¼ 0.0226. This difference has

Logistic regression 195

degrees of freedom 3� 2 ¼ 1. The conclusion again is that there is no evidence foran interaction.

In the above model, the coefficient and standard error for the variable malehave changed dramatically from what they were in the earlier additive model,leading to a non-significant w2 value for the sex effect. This is typical in testing forinteractions using simple product terms, which are usually highly correlated withone or both variables they are generated from. Since we have concluded that weshould return to the additive model, there is no problem for us. Had the inter-action been statistically significant, meaning we need to use the more complexmodel, we could redefine our variables by shifting them to have means of zero,which reduces the correlation between them and their product, the interaction.This yields a model whose coefficients are simpler to interpret.

6. Testing goodness of fit: Two measures for lack of fit

In OLS regression, there are two aspects to assessing the fit of a model. The firstis the fraction of variation explained by the model, denoted by R2. Section 3above described the logistic regression counterpart of R2, as well as the coefficientof concordance (which is not a counterpart to anything in OLS regression).The larger these measures, the better the fit. However, even a very large value isnot a guarantee that a modification of the existing model may not produce evena better fit.

The second aspect of assessing model fit speaks to the possibility that theexisting model might not be the best available. The goodness of fit of an OLSregression model can be based on examination of residuals. By plotting residualsagainst expected values, deviations from the assumptions of independence andidentical distribution (IID) can easily be detected. In logistic regression, thegoodness of fit can be assessed in two ways, the first based on contingency tables,the second based on a cumulative plot of the number of positives versus the sumof predicted probabilities.

The approach based on contingency tables is called the Hosmer–Lemeshowtest (Hosmer and Lemeshow, 1980). The data are first sorted by predicted prob-abilities and then divided into 10 (or possibly fewer) groups of approximatelyequal size. Within each group, the numbers of dead and alive (‘‘observed fre-quencies’’) are compared with the sums of the probabilities of dead and alive(‘‘expected frequencies’’). The usual w2 test for goodness of fit is then calculated.Its degrees of freedom are equal to the number of groups minus two. Using thelogistic regression of insects killed on dose and sex, the observed and expectedfrequencies are given in Table 3.

The value of w2 is 4.3001 with 9� 2 ¼ 7 degrees of freedom andp-value ¼ 0.7446. Therefore, there is no evidence that the model can beimproved using the existing two independent variables, dose and sex. Note thatthe software generated nine rather than ten groups. This is due to the tabularnature of the data, which resulted in large numbers of ties. Normally there willbe 10 groups, with 10 � 2 ¼ 8 degrees of freedom.

Table 3

Details of the Hosmer–Lemeshow test for goodness of fit

Group 1 2 3 4 5 6 7 8 9

Observed dead 1 2 4 17 26 39 42 48 50

Observed alive 49 48 46 33 24 11 8 2 0

Expected dead 0.82 2.30 6.09 14.08 26.03 37.55 44.76 48.05 49.31

Expected alive 49.18 47.70 43.91 35.92 23.97 12.45 5.24 1.95 0.69

0

50

100

150

200

250

0 50 100 150 200 250

Cumulative Probabilities of Death

Cu

mu

lati

ve o

f A

ctu

al D

ea

Fig. 5. Graphic assessment of goodness of fit.

E. L. Spitznagel Jr.196

An alternative to the Hosmer–Lemeshow test is a graphical procedure thatplots the cumulative number of positives versus the sum of predicted probabil-ities. The diagonal straight line plot in Fig. 5 indicates that no improvement in themodel is possible using existing variables. This diagnostic plot is in the same spiritas residual plots in OLS regression and therefore may be preferred by those whoare accustomed to reading residual plots. Using cumulatives smooth the plot,making it easier to read.

7. Exact logistic regression

On occasion, the method of maximum likelihood may fail to converge, or it mayyield coefficient estimates that are totally unrealistic. This is usually the resultof some combination of independent variables being associated with only onecategory of response. The simplest example occurs in a 2� 2 table containing acell with a frequency of 0, as in Table 4.

Table 4

Example of quasi-complete separation

Survival Dead Alive

Treatment A 4 16

Treatment B 0 20

Logistic regression 197

If we code Dead as 0 and Alive as 1, and code Treatments A and B as 0 and 1respectively, we can attempt to fit a logistic regression model to predict survivalstatus from treatment. The odds ratio from this table is (20/0)/(16/4) ¼ N. Sincethe slope coefficient in logistic regression is the log of the odds ratio, the cor-responding logistic regression model should be non-estimable, and the softwarewill warn us of this.

Model Convergence StatusQuasi-complete separation of data points detected

Warning: The maximum likelihood estimate may not exist

Warning

The LOGISTIC procedure continues in spite of the above warning.Results shown are based on the last maximum likelihood iteration.Validity of the model fit is questionable

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

5.9905 1 0.0144

Analysis of Maximum Likelihood Estimates

Parameter

df Estimate Standard Error w2 Pr4w2

Intercept

1 1.3863 0.5590 6.1498 0.0131 X 1 11.9790 178.5 0.0045 0.9465

Warning: The validity of the model fit is questionable

There are in fact three distinct warnings. The term ‘‘quasi-complete separa-tion’’ in the first warning refers to the fact that in our sample, all subjects whoreceived Treatment B are alive. That is, the lower left frequency in the table,representing deaths among subjects receiving Treatment B, is 0. (If the upperright cell also were 0, we would have the even more extreme case of ‘‘completeseparation.’’)

E. L. Spitznagel Jr.198

In categorical data, there are two distinct kinds of zero frequencies. One iscalled a structural zero. Structural zeros are zero frequencies that by logic cannotpossibly be non-zero. For example, if we were to crosstabulate gender by commonsurgical procedures, the cell for male hysterectomies would of necessity be 0, aswould the cell for female prostatectomies. (These cells are logically zero becausemales do not have uteruses, and females do not have prostates.)

The other type of zero is a sampling zero, which we could argue is the kindof zero frequency that occurred in our example. While Treatment B has animpressive success rate, 20 alive out of 20 trials, we have no doubt that in furthertrials, some deaths must surely occur. Therefore, we are convinced that the trueodds ratio cannot be infinite, and we would like to use the information from ourdata to estimate it.

A similar situation can happen with estimating a binomial success probability.Suppose, for example, that out of 20 independent trials there are 20 successes andno failures. The maximum likelihood estimate would be 1, and the estimate of itsstandard error would be 0. The conventional confidence interval (of any degreeof confidence) is the totally uninformative interval 170 ¼ (1,1). An alternativeapproach to calculating a confidence interval in this case is to base it on thevalue(s) pL and pU of the success probability, obtained from the binomialdistribution, that form the boundary between accepting and rejecting thenull hypotheses H0 : p ¼ pL and H0 : p ¼ pU. The lower 95% confidence limitfor the success probability is pL ¼ 0.8315, and the upper confidence limit pU istaken to be equal to 1 (since no value, no matter how close to 1, can leadto rejection of H0). Technically, this makes the interval have 97.5% confidence,but in all cases where the sample proportion is close to 1, both limits willexist and yield 95% confidence. For example, if out of 20 trials there are19 successes and one failure, we would find pL ¼ 0.7513, and pU ¼ 0.9987.This confidence interval is more satisfying than the standard-error based interval,as the latter has upper limit 1.0455, which is substantially larger than 1. Thus,exact methods are appropriate not just for the case of zero frequencies,but whenever sample sizes and frequencies are too small for asymptoticmethods to work.

Exact logistic regression is based on similar logic: Although the odds ratiofrom the sample may involve a division by zero and therefore be ‘‘infinite,’’ theobserved data are consistent with a range of odds ratios extending to infinity.This is a generalization of the well-known Fisher exact test, which we considerfirst. Under the fixed marginals of 20, 20, 36, and 4, the probability of obtainingthe table we have observed is the hypergeometric, (20! 20! 36! 4!)/(40! 4! 16! 0!20!) ¼ 0.0530. The null hypothesis for the Fisher exact test corresponds to apopulation odds ratio of 1. Calculating an exact lower 95% (or technically97.5%) confidence limit for the population odds ratio entails finding a valuefor the odds ratio for which a calculation similar to the above hypergeometricprobability (but incorporating an arbitrary odds ratio) becomes equal to0.025. The upper confidence limit is equal to infinity, analogous to the casewhere 20 successes out of 20 trials has upper confidence limit for proportionpU ¼ 1.

Logistic regression 199

The results obtained from adding an exact option to our logistic regression areas follows:

Exact Parameter Estimates

Parameter

Estimate 95% Confidence Limits p-value

Intercept

1.3863 0.2549 2.7999 0.0118 X 1.8027* �0.3480 Infinity 0.1060

Exact Odds Ratios

Parameter

Estimate 95% Confidence Limits p-value

Intercept

4.000 1.290 16.442 0.0118 X 6.066* 0.706 Infinity 0.1060

� Indicates a median unbiased estimate.

The 95% (actually 97.5%) confidence interval for the true odds ratio isreported as (0.706, N). The number 0.706 was obtained by searching for the oddsratio that made the probability of the crosstable equal to 0.025. The ‘‘medianunbiased estimate’’ of 6.066 was obtained by searching for the value of the oddsratio that made the probability of the crosstable equal to 0.5. The p-valuereported is the same as the p-value from the two-sided ‘‘Fisher Exact Test,’’ whichis the test customarily used when expected frequencies are too small to justifyusing a w2 test. Thus, although our point estimate of the odds ratio, 6.066, ismuch larger than 1, both the lower confidence limit for the odds ratio and thehypothesis test indicate that we cannot reject the null hypothesis that the oddsratio is equal to 1.

The point estimate of the log-odds ratio b is reported as 1.8027, and theconfidence interval for b is reported as (�0.3480, N). These values are thelogarithms of the estimate and confidence limits for the odds ratio.

The intercept estimates are based on the odds of success for Treatment A,16/4 ¼ 4, and the corresponding confidence intervals are exact intervalsbased on the binomial distribution. Thus, our estimated logistic regressionmodel is:

log-odds of survival ¼ 1:3863þ 1:8027

� ð0 for Treatment A; 1 for Treatment BÞ.

Exact logistic regression is valuable even beyond the extreme case ofquasi-complete separation. For small sample sizes and/or rare events, maximumlikelihood methods may not exhibit the asymptotic behavior guaranteed bytheory. Suppose, for example, we have the 2� 2 crosstabulation seen in Table 5,

Table 5

Example of small cell frequencies

Survival Dead Alive

Treatment A 3 17

Treatment B 1 19

E. L. Spitznagel Jr.200

in which two cells have low frequencies, particularly one cell containing a singleobservation.

The marginal frequencies are the same as before, but now there is no cellcontaining 0. Two cells have very small expected frequencies, each equal to 2.These are too small for the method of maximum likelihood to give goodestimates, and too small to justify using a w2 test. Below is the result of the exactlogistic regression, and, for comparison purposes, below it is the result of themaximum likelihood logistic regression.

Exact Logistic Regression:

Exact Parameter Estimates

Parameter

Estimate 95% Confidence Limits p-value

Intercept

1.7346 0.4941 3.4072 0.0026 X 1.1814 1.4453 5.2209 0.6050

Exact Odds Ratios

Parameter

Estimate 95% Confidence Limits p-value

Intercept

5.667 1.639 30.181 0.0026 X 3.259 0.236 185.105 0.6050

Maximum Likelihood Logistic Regression:

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

1.1577 1 0.2820

Parameter

df Estimate Standard Error Wald w2 Pr4w2

Intercept

1 1.7346 0.6262 7.6725 0.0056 X 1 1.2097 1.2020 1.0130 0.3142

Odds Ratio Estimates

Logistic regression 201

Effect

Point Estimate 95% Wald Confidence Limits

X

3.353 0.318 35.358

The two estimates of the odds ratio, 3.259 and 3.353, are very similar. How-ever, the two confidence intervals are very different, with the length of the exactinterval being much larger. Consistent with the differing lengths of the confidenceintervals, the p-values for testing the null hypothesis that the odds ratio is equal to1 are quite different, 0.6050 from the exact logistic regression, and 0.2820 (LR w2)and 0.3142 (Wald w2) from the maximum likelihood logistic regression. As withthe earlier exact logistic regression, the p-value 0.6050 is the same as the p-valuefrom the Fisher exact test.

8. Ordinal logistic regression

Although the dependent variable is usually binary, logistic regression has thecapability to handle dependent variables with more than two categories. If thecategories have a natural ordering, such as levels ‘‘none,’’ ‘‘mild,’’ ‘‘moderate,’’ and‘‘severe,’’ the natural first step in building a model is to see if an ordinal logisticregression model fits the data. In an ordinal logistic regression model, there is oneset of slope coefficients but multiple intercepts. The number of intercepts is alwaysone less than the number of categories. In conjunction with the slopes, the interceptsdetermine the probabilities of an observation being in the various categorical levels.

As an example, we will use a measure of drinking constructed in the 2001College Alcohol Study, (Wechsler, 2001), called DRINKCAT. This variable hasfour levels: 0 ¼ Abstainer or no drinking within the last year, 1 ¼ Drinkingwithin the last year but no binge drinking in the previous two weeks, 2 ¼ Bingedrinking once or twice within the last two weeks, 3 ¼ Binge drinking more thantwice within the last two weeks.

We will use three binary variables as independent variables: A2 ¼ male gender,A5 ¼ sorority/fraternity membership, B8 ¼ living in ‘‘alcohol-free’’ housing.There is an interaction between A2 and A5, which we will omit in order tomake the example simpler. For further simplicity, we will also omit effects due toweighting and to clustering by institution.

Following is the result of fitting an ordinal logistic regression model to this data:

Response Profile

Ordered Value

DRINKCAT Total Frequency

1

3.00 2365 2 2.00 2258 3 1.00 4009 4 0.00 2039

Probabilities modeled are cumulated over the lower ordered values

E. L. Spitznagel Jr.202

Score Test for the Proportional Odds Assumption

w2

df Pr4w2

80.4140

6 o0.0001

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

336.6806 3 o 0.0001

Analysis of maximum likelihood estimates

Parameter

df Estimate Standard Error Wald w2 Pr4w2

Intercept 3.00

1 �1.3848 0.0292 2250.9331 o0.0001 Intercept 2.00 1 �0.3718 0.0257 209.2745 o0.0001 Intercept 1.00 1 1.3749 0.0294 2185.4249 o0.0001 A2 1 0.1365 0.0366 13.9106 0.0002 A5 1 0.8936 0.0540 274.3205 o0.0001 B8 1 �0.3886 0.0486 64.0554 o0.0001

Odds Ratio Estimates

Effect

Point Estimate 95% Wald Confidence Limits

A2

1.146 1.067 1.231 A5 2.444 2.199 2.716 B8 0.678 0.616 0.746

The first test labeled, ‘‘Score Test for the Proportional Odds Assumption,’’ is acheck on the validity of the ordinal logistic model. This test is statistically sig-nificant, which means that there is a better-fitting model, namely, the multinomiallogistic regression model discussed below. We will, however, study the additionaloutput from this ordinal logistic model in order to compare results with thosefrom the more appropriate multinomial logistic.

The likelihood ratio w2 for testing the overall model is 336.6806 with 3 degreesof freedom and p-value less than 0.0001. This tells us that the overall model fitsbetter than we would expect by chance alone. We can then proceed to theassessment of the three variables A2 (male gender), A5 (Greek membership), andB8 (alcohol-free housing). All three are statistically significant, with A2 and A5being risk factors and B8 being protective.

Logistic regression 203

Because there is only one set of slope coefficients (rather than 3) for A2, A5,and B8, it is necessary to compute all probabilities using this single set of slopecoefficients combined with the three different intercept estimates. The threeintercept estimates yield the estimated log odds, odds and probabilities that:DRINKCAT ¼ 3 (from the first intercept, �1.3848), DRINKCATZ2 (from thesecond intercept, �0.3718), and DRINKCATZ1 (from the third intercept,1.3749). Once these estimated probabilities have been determined, the probabil-ities that DRINKCAT is equal to 0, 1, or 2 can be obtained by subtractions.

Suppose we have a subject who is at highest risk of heavy drinking. That wouldbe a male who is a member of a Greek organization and who does not live inalcohol-free housing. For that person, we have:

log-oddsðDRINKCAT ¼ 3Þ ¼ � 1:3848þ 0:1365� 1þ 0:8936

� 1� 0:3886� 0 ¼ �0:3547

log-oddsðDRINKCAT � 2Þ ¼ � 0:3718þ 0:1365� 1þ 0:8936

� 1� 0:3886� 0 ¼ 0:6583

log-oddsðDRINKCAT � 1Þ ¼ 1:3749þ 0:1365� 1þ 0:8936

� 1� 0:3886� 0 ¼ 2:4050

oddsðDRINKCAT ¼ 3Þ ¼ expð�0:3547Þ ¼ 0:7014;

PðDRINKCAT ¼ 3Þ ¼ 0:7014=1:7014 ¼ 0:4122

oddsðDRINKCAT � 2Þ ¼ expð0:6583Þ ¼ 1:9315;

PðDRINKCAT � 2Þ ¼ 1:9315=2:9315 ¼ 0:6589

oddsðDRINKCAT � 1Þ ¼ expð2:4050Þ ¼ 11:0784;

PðDRINKCAT � 1Þ ¼ 11:0784=12:0784 ¼ 0:9172.

Finally we obtain the individual probabilities by subtractions:

PðDRINKCAT ¼ 0Þ ¼ 1� 0:9172 ¼ 0:0828

PðDRINKCAT ¼ 1Þ ¼ 0:9172� 0:6589 ¼ 0:2583

PðDRINKCAT ¼ 2Þ ¼ 0:6589� 0:4122 ¼ 0:2467

PðDRINKCAT ¼ 3Þ ¼ 0:4122

By contrast, the four probabilities for a person with the lowest risk of heavydrinking – a female who is not a sorority member and lives in alcohol-freehousing – are:

PðDRINKCAT ¼ 0Þ ¼ 0:2716

E. L. Spitznagel Jr.204

PðDRINKCAT ¼ 1Þ ¼ 0:4098

PðDRINKCAT ¼ 2Þ ¼ 0:1734

PðDRINKCAT ¼ 3Þ ¼ 0:1451

9. Multinomial logistic regression

If the categories of the dependent variable do not have a natural ordering, thesingle set of slopes of ordinal logistic regression must be replaced with multiplesets of slopes, producing what is called a multinomial logistic regression model.The number of slopes for each independent variable will always be one less thanthe number of categories of the dependent variable, matching the number ofintercepts. In effect, a multinomial logistic regression model looks like a set ofbinary logistic regressions. The advantages of fitting one multinomial model overfitting several binary models are that there is one likelihood ratio w2 for the fit ofthe entire model, and there is opportunity to test hypotheses about equality ofslopes.

Even if the categories of the dependent variable are ordered, a multi-nomial model may be necessary to provide a better fit to the data than theordinal logistic regression may afford. One way to investigate this possibilityis to fit both ordered logistic and multinomial logistic models and calculatethe difference in the likelihood ratio w2. Another way is to look for a testof the ‘‘proportional odds assumption’’ in the ordinal logistic regression. Ifthe w2 statistic from that test is significant, there is evidence that one set ofslopes is insufficient, and the correct model is multinomial rather thanordinal.

In the ordinal logistic regression above, we remarked that the test for theproportional odds assumption had quite a large w2. Its value was 80.4140 with6 degrees of freedom and p-value less than 0.0001. This is evidence that we willobtain a better-fitting model by using multinomial logistic regression, eventhough the categories of DRINKCAT are naturally ordered.

Following is the result of running a multinomial logistic regression on theHarvard College Drinking Study data:

Response Profile

Ordered Value

DRINKCAT Total Frequency

1

0.00 2039 2 1.00 4009 3 2.00 2258 4 3.00 2365

Logits modeled use DRINKCAT ¼ 0.00 as the reference category

Logistic regression 205

Testing Global Null Hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood

Ratio 415.5538 9o0.0001

Type 3 Analysis of Effects

Effect

df Wald w2 Pr4w2

A2

3 68.8958 o0.0001 A5 3 266.3693 o0.0001 B8 3 87.2105 o0.0001

Analysis of Maximum Likelihood Estimates

Parameter

DRINKCAT df Estimate Standard Error Wald w2 Pr4w2

Intercept

1.00 1 0.8503 0.0373 519.4807 o0.0001 Intercept 2.00 1 0.0800 0.0430 3.4632 0.0627 Intercept 3.00 1 0.0731 0.0430 2.8970 0.0887 A2 1.00 1 �0.3120 0.0572 29.7276 o0.0001 A2 2.00 1 0.0514 0.0632 0.6597 0.4167 A2 3.00 1 0.0609 0.0629 0.9353 0.3335 A5 1.00 1 0.3462 0.1049 10.8943 0.0010 A5 2.00 1 0.9812 0.1064 85.0242 o0.0001 A5 3.00 1 1.3407 0.1027 170.4737 o0.0001 B8 1.00 1 �0.5220 0.0702 55.2565 o0.0001 B8 2.00 1 �0.5646 0.0811 48.5013 o0.0001 B8 3.00 1 �0.6699 0.0819 66.9267 o0.0001

Odds Ratio Estimates

Effect

DRINKCAT Point Estimate 95% Wald Confidence Limits

A2

1.00 0.732 0.654 0.819 A2 2.00 1.053 0.930 1.192 A2 3.00 1.063 0.939 1.202 A5 1.00 1.414 1.151 1.736 A5 2.00 2.668 2.165 3.286 A5 3.00 3.822 3.125 4.674 B8 1.00 0.593 0.517 0.681 B8 2.00 0.569 0.485 0.666 B8 3.00 0.512 0.436 0.601

E. L. Spitznagel Jr.206

The likelihood ratio w2 for this model is 415.5583 with 9 degrees of freedom. We2

can compare this value with the w value of 336.6806 with 3 degrees of freedom for

the ordinal logistic regression model. Their difference, 415.5583�336.6806 ¼

78.8732 has degrees of freedom 9 � 3 ¼ 6, in close agreement with the w2 80.4140with 6 df that tested the proportional odds assumption in the ordinal logisticregression. Thus, from two different perspectives, we are led to the conclusion thatmultinomial logistic regression is more appropriate for our data.

Interpreting the model coefficients is best done by thinking of the multinomialmodel as a collection of three binomial models. Each binomial model estimatesprobabilities of the DRINKCAT values 1, 2, and 3 relative to value 0. For example,consider the three log-odds ratio coefficients �0.5220, �0.5646, and �0.6699 forthe variable B8, living in alcohol-free housing. These correspond to odds ratios of0.593, 0.569, and 0.512, meaning that for those living in alcohol-free housing theodds of being in any of the positive drinking categories (DRINKCAT ¼ 1, 2, or 3)are about 40�50% less than being in the non-drinking category. The big drop inrisk, of about 40%, occurs between DRINKCAT ¼ 0 and DRINKCAT ¼ 1, withsmall further drops in risk of being in DRINKCAT ¼ 2 and DRINKCAT ¼ 3. Aformal test of the hypothesis that the three slopes are equal is not statisticallysignificant, leading us to conclude that alcohol-free housing has ‘‘threshold’’ pro-tective effect of reducing the risk of all three levels of drinking by the same amount.

By contrast, the three log-odds coefficients for the variable A5 show a largemonotone increasing pattern, from 0.3462 to 0.9812, and finally to 1.3407, mean-ing that fraternity or sorority membership shows a pattern of increasing risks ofthe more severe drinking behaviors. A formal test of the null hypothesis that thethree slopes are equal to each other rejects the null hypothesis. Based on the finalodds ratio of exp(1.3407) ¼ 3.822, members of Greek organizations have almostfour times the risk of extreme binge drinking than do non-members.

Finally, the coefficients for the variable A2 show that males have reduced oddsof moderate drinking, and somewhat greater odds of each level of binge drinking.A formal test of the null hypothesis that the three slopes are equal rejects the nullhypothesis.

10. Probit regression

Earlier we introduced the logit function as a means of ‘‘opening up’’ the probabilityrange of [0,1] to (�N, +N). This can also be done by using the inverse standardnormal function. Although the logit and the inverse normal (sometimes callednormit or probit) have very different behavior in the extreme tails, in actual practice,they yield nearly the same predicted probabilities. If we use the inverse standardnormal function in place of the logit for our insecticide data, the result is:

Testing global null hypothesis (Beta ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

327.4488 2 o0.0001

Parameter df Estimate Standard Error w2 Pr4w2

Logistic regression 207

Intercept

1 �3.2286 0.2759 136.9310 o0.0001 Dose 1 0.6037 0.0471 164.6159 o0.0001 Male 1 0.5126 0.1686 9.2450 0.0024

The likelihood ratio w2 is very close to that from the additive logit model,326.8997, but the estimated model coefficients are very different. This difference isdue to the fact that the standard normal density function has standard deviationequal to 1, whereas the logit density function has standard deviation equal top/O3 ¼ 1.8138. Despite the differences in model coefficients, probabilitiesestimated from the two models come out nearly the same.

For example, in the logit model, we found that at a dose of five units, the logitof a male mosquito dying is �5.7816+1.0698� 5+0.9132 ¼ 0.4806. This corre-sponded to a probability of 0.6179. From our current model, at a dose of fiveunits, the Z-score, or probit, of a mosquito dying is �3.2286+0.6037� 5+0.5126 ¼ 0.3025. This corresponds to a probability of 0.6189, almost identicalwith 0.6179.

The Hosmer–Lemeshow goodness of fit test yields a value of w2 is 4.1243 with7 degrees of freedom and p-value ¼ 0.7654. Not only is the test non-significant,the w2 value itself is virtually identical with the Hosmer–Lemeshow w2 4.3001 thatwe calculated from the logistic regression model.

Given that in most cases there is little difference between the probabilityestimates from logit and probit models; most users of logistic regression prefer thelogit model as simpler to understand. Its coefficients can be directly interpreted interms of odds ratios, and the odds ratios can be multiplied to estimate overall risk.

Most statistical software provides a third link function option called thecomplementary log–log function (or ‘‘Gompertz’’ function). This also opens upthe probability range of [0,1] to (�N,+N), but it does so in a non-symmetricfashion. It is useful for special purposes but should not be considered a naturalalternative to logit and probit models. In particular, it does not fit our exampledata as well as the logit and probit models do.

11. Logistic regression in case–control studies

Risk factors for rare disorders are frequently determined through case–controlstudies. For example, there is considerable literature on the interrelationship ofasbestos exposure, smoking, and lung cancer. Since the prevalence of lung canceris relatively low, most studies have compared lung cancer patients with compa-rable cancer-free control subjects. Many of these studies have involved shipyardworkers because of asbestos exposure in shipbuilding. Smoking was fairly com-mon among these blue-collar employees, so smoking was investigated along withasbestos exposure as another risk factor.

Table 6

Case–control data showing the relationship of asbestos exposure and smoking with lung cancer

Exposure(s) A–S– A+S– A–S+ A+S+

Cases 50 11 313 84

Controls 203 35 270 45

E. L. Spitznagel Jr.208

Since odds ratios are marginal-independent, they do not change if a factor isover or under represented. Thus, we can use ‘‘caseness’’ as a response variable inlogistic regression without introducing bias in our estimates of the logisticregression slope coefficients. The only coefficient that is biased is the interceptterm, and we ordinarily are not interested in estimating it anyway.

Table 6 contains data from Blot et al. (1978), as reproduced in Lee (2001).A full logistic regression model would attempt to predict lung cancer (caseness)

from asbestos exposure, smoking, and the interaction between asbestos andsmoking. In this full model, the interaction term is not statistically significant, andthe final model is the additive model estimated in the following output:

Testing global null hypothesis (BETA ¼ 0)

Test

w2 df Pr4w2

Likelihood ratio

118.2789 2 o0.0001

Analysis of maximum likelihood estimates

Parameter

df Estimate Standard Error Wald w2 Pr4w2

Intercept

1 �1.4335 0.1476 94.3817 o0.0001 Smoke 1 1.5902 0.1626 95.6905 o0.0001 Asbestos 1 0.4239 0.1769 5.7437 0.0165

Odds ratio estimates

Effect

Point Estimate 95% Wald Confidence Limits

Smoke

4.904 3.566 6.745 Asbestos 1.528 1.080 2.161

We conclude that smoking is by far the more potent risk factor, increasing riskof lung cancer, in the sense of odds, almost five-fold. Asbestos exposure is alsoa statistically significant risk factor, but it increases risk by 53%. Furthermore,the number 1.528 is a point estimate of the odds ratio, but the 95% confidence

Logistic regression 209

interval for the true asbestos-exposure odds ratio ranges from a low of 1.08 to ahigh of 2.16.

Case–control studies are normally performed for disorders that are rare in thegeneral population. For rare disorders, the odds ratio is a good approximation(a slight overestimate, always) to relative risk. Therefore, in these settings, it isvery common to treat the odds ratio as being a surrogate for relative risk. Forexample, suppose a disease has a base prevalence of exactly 1%, and a certainrisk factor changes its prevalence to 2%. By definition, the relative risk is exactlyequal to 2. The odds ratio is (0.02/0.98)/(0.01/0.99) ¼ 2.02 and therefore serves asan excellent approximation to relative risk.

More complex case–control designs involve matching individual cases withcontrols, either on a one-to-one basis or even a one-to-many basis. Logisticregression can also be used to analyze these more complicated designs. Chapters 6and 7 of Hosmer and Lemeshow (2000) contain a wealth of information regardingthe use of logistic regression in case–control studies.

References

Blot, W.J., Harrington, J.M., Toledo, A., Hoover, R., Heath, C.W., Fraumeni, J.F. (1978). Lung

cancer after employment in shipyards during World War II. The New England Journal of Medicine

299, 620–624.

Cox, D.R., Snell, E.J. (1989). The Analysis of Binary Data, 2nd ed. Chapman & Hall, London.

Hosmer, D.W., Lemeshow, S. (1980). A goodness-of-fit test for the multiple logistic regression model.

Communications in Statistics A10, 1043–1069.

Hosmer, D.W., Lemeshow, S. (2000). Applied Logistic Regression, 2nd ed. Wiley, New York.

Lee, P.N. (2001). Relation between exposure to asbestos and smoking jointly and the risk of breast

cancer. Occupational and Environmental Medicine 58, 145–153.

McNeil, B.J., Keller, E., Adelstein, S.J. (1975). Primer on certain elements of medical decision making.

The New England Journal of Medicine 293, 211–215.

Nagelkerke, N.J.D. (1991). A note on a general definition of the coefficient of determination.

Biometrika 78, 691–692.

SAS Institute Inc. (2004). SAS/STATs 9.1 User’s Guide. SAS Institute, Cary, NC.

Wechsler, H. (2001). Harvard School of Public Health College Alcohol Study. ICPSR, Ann Arbor.