8
Residual Plots for Log Odds Ratio Regression Models Author(s): Masaaki Tsujitani and Gary G. Koch Source: Biometrics, Vol. 47, No. 3 (Sep., 1991), pp. 1135-1141 Published by: International Biometric Society Stable URL: http://www.jstor.org/stable/2532665 . Accessed: 10/06/2014 21:44 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics. http://www.jstor.org This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PM All use subject to JSTOR Terms and Conditions

Residual Plots for Log Odds Ratio Regression Models

Embed Size (px)

Citation preview

Page 1: Residual Plots for Log Odds Ratio Regression Models

Residual Plots for Log Odds Ratio Regression ModelsAuthor(s): Masaaki Tsujitani and Gary G. KochSource: Biometrics, Vol. 47, No. 3 (Sep., 1991), pp. 1135-1141Published by: International Biometric SocietyStable URL: http://www.jstor.org/stable/2532665 .

Accessed: 10/06/2014 21:44

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access toBiometrics.

http://www.jstor.org

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 2: Residual Plots for Log Odds Ratio Regression Models

BIOMETRICS 47, 1135-1141 September 1991

SHORTER COMMUNICATIONS

EDITOR: NIELS KEIDING

Residual Plots for Log Odds Ratio Regression Models

Masaaki Tsujitani

Department of Literature, Kobe Women's University, Hyogo 654, Japan

and

Gary G. Koch

Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599-7400, U.S.A.

SUMMARY

This article describes graphical diagnostic methods for log odds ratio regression models. To study the effects of an additional covariate on log odds ratio regression analysis, three types of residual plots based on weighted least squares (WLS) are discussed: (i) added variable plot (partial regression plot), (ii) partial residual plot, and (iii) augmented partial residual plot.

These plots provide diagnostic procedures for identifying heterogeneity of error variances, outliers, or nonlinearity of the model. They are especially useful for clarifying whether including a covariate as a linear term is appropriate, or whether quadratic or other nonlinear transformations are preferable. A well-known data set for case-control studies is analyzed to illustrate the residual plots.

1. Introduction

In ordinary linear regression, the selection of possible transformations for an explanatory variable was originally proposed by Box and Tidwell (1962). More recently, diagnostic plots for this purpose have been suggested by several authors (Zelen, 1971; Breslow, 1976; Davis, 1985; Breslow and Cologne, 1986; Liang, Beaty, and Cohen, 1986; Chatterjee and Hadi, 1986; and O'Brien, 1988). Noteworthy issues in fitting models are (i) to detect outliers, (ii) to assess the fit of a model to data, and (iii) to suggest a transformation of an explanatory variable. These issues can be addressed through the added variable plot, the partial residual plot, and the augmented partial residual plot.

For logistic regression models, Landwehr, Pregibon, and Shoemaker (1984) and Fowlkes (1987) proposed graphical methods for diagnostic checking for goodness of fit. Wang (1985) extended the added variable plot to the generalized linear model (GLM). A residual plot for detection of nonlinearity in GLM is discussed by Wang (1987). This article presents diagnostic plots for an additional covariate in log odds ratio regression models.

For a series of 2 x 2 contingency tables, let nijk (i = 1, 2; j = 1, 2; k = 1, 2, .. . , K) denote the observed cell frequency, and Pijk the expected cell probability for the ijth cell of the kth table. We assume that K pairs (nl 1k' n21k) are independent binomial samples with

Key words. Added variable plot; Augmented partial residual plot; Diagnostics; Log odds ratio regression models; Partial residual plot; Weighted least squares.

1135

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 3: Residual Plots for Log Odds Ratio Regression Models

1136 Biometrics, September 1991

denominators (nl k, n2.k) and success probabilities (Pllk, P21k), where ni*k = nilk + ni2k- In case-control studies, n, Ik and n2l k are the numbers of exposed individuals among nl k cases and n2 k controls in the kth stratum.

The odds ratio Ak of the kth table is defined by

P11kP22k Ak -

p12P1 9 k = 1, 29 ... ., K.

A log odds ratio regression model for the bk iS

ln ik(3) =Xk3 (1.1)

where xk is a vector of covariates (with a constant term) corresponding to the kth stratum and ,B is a vector of regression coefficients. When the nijk are large (e.g., nearly all nijk > 5), the weighted least squares (WLS) methods in Grizzle, Starmer, and Koch (1969) provide best asymptotic normal estimators 3 for ,B in the fitting of model (1.1). These estimates (which are asymptotically equivalent to maximum likelihood estimates) are obtained via

= (X'S-X)-lX,S-lg, (1.2)

where g = (g1, .. ., gIK) is the vector of sample estimates gk = ln{(nllkn22k)/(nl2kn2lk)} of

the ln bk, S is the diagonal, estimated covariance matrix of g with respective diagonal elements Sk = {>2= ,1 =(1/nijk)}, and X = (xl, ... ,xK)' iS the matrix of covariates for all K strata. Also, as noted in Fleiss (1981) and Gart and Zweifel (1967), bias in ,3 can be reduced in situations where some of the n ijk are small by computing the g,k and Sk in terms of the (nijk + .5) rather than the nijk -

Predicted values for the ln Vk are obtained from A by

g-Xs3= X(X'S-1X) lX'S1g = Hg. (1.3)

The (K x K) matrix H = X(X'S - 1X) l1X'S- is usually known as the hat matrix. The (K x K) matrix (I - H) is called the residual matrix because the residuals can be written (g - g) = (I - H)g. The WLS residual statistic to test the goodness of fit of the model (1.1) is

QW = (g - 9)'S-I(g - g) = g'S -g - :'(X'S-1x)~q (1.4)

where (X'S - 1X)1 is the estimated covariance matrix for ,B. The aspects of WLS model fitting summarized here are also applicable to more general log-linear models with the structure F(P) = A ln P = X,B, where P is a vector of expected cell probabilities for a set of independent multinomial distributions, ln transforms P to logarithms, and A is a specification matrix for a set of log-linear contrasts like the ln 1bk.

2. Residual Plots for Log Odds Ratio Regression Models

Suppose we are considering the inclusion of a covariate Z in the model (1.1). The corresponding expanded model is

ln = x, + aZ, (2.1)

where ln 4' = (ln Al~, .. ., ln AbK)'. Three types of graphical plots can be used to study the effect of Z.

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 4: Residual Plots for Log Odds Ratio Regression Models

Residual Plots for Log Odds Ratio Regression 1137

2.1. Added Variable Plot

Let Hx be the hat matrix for X. Multiplying both sides of (2.1) by (I - Hx), it follows that

(I - Hx)ln A = (I - Hx) oZ (2.2)

because (I - Hx)X = 0. From (2.2), it follows that

EA(eF X) = ez-xa (2.3)

where eF.X = (I - Hx)g denotes the residual in g after fitting X, ez.x = (I - HX)Z denotes the residual in Z after fitting X, and EA(-) denotes asymptotic expectation. Equation (2.3) suggests that a graph of eF.X versus (I - HX)Z should be linear, passing through the origin with slope ce. We refer to this graph as the added variable plot.

Added variable plots can also indicate which of the data points are most influential on the magnitude of & (Cook and Weisberg, 1982; Atkinson, 1985; and Chatterjee and Hadi, 1988).

2.2. Partial Residual Plot

A partial residual plot, first proposed by Ezekiel and Fox (1959), and developed by Larsen and McCleary (1972) and Wood (1973), enables evaluation of both the extent of the deviation from linearity for the relationship with a covariate and the direction of the linearity. The partial residual plot can be defined by

g + &(Z - Z) + eF.XZ versus Z,

where Z = >Zk /K, g = E ln gk /K, and the slope & is the WLS estimate of ce for log odds ratio regression models involving all covariates (i.e., X and Z). eF.X z are the residuals from the fit of this model. This plot can sometimes shed light on nonlinearity and suggest the need for transformation of an explanatory variable (Chatterjee and Hadi, 1988). In practice, it is worthwhile to introduce appropriately transformed variables (e.g., inclusion of a quadratic term) into the model whenever this plot is nonlinear.

2.3. Augmented Partial Residual Plot

To focus more attention on the issue of linearity, suppose that the true relationship between ln t and Z is potentially nonlinear. Then one must determine the nonlinear function r(Z) to which Z should be transformed. Hence the following model is postulated:

ln & = X,B + T(Z). (2.4)

Mallows (1986) recommended using a single added quadratic term for r(Z) rather than, for example, both quadratic and cubic terms or a piecewise linear function. See Denby (unpublished Ph.D. dissertation, University of Michigan, 1986) for further discussion to support this choice.

We consider the expansion of the linear model (2.1) to include a quadratic term, i.e.,

ln & = XB + a1Z + ac2Z2. (2.5)

The augmented partial residual plot (Mallows, 1986) displays

+ &(Z - Z)+_2{(Z -) - ave} + eF.X,ZZ2 versus Z,

where ?1 and )2 are WLS estimates of c- and c02 in (2.5), eF.XZz2 = (I - Hx 2)g

denotes the residual in g after fitting X, Z, and Z2, and ave = ave(Z - Z)2.

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 5: Residual Plots for Log Odds Ratio Regression Models

1138 Biometrics, September 1991

3. Numerical Example

We consider here data from retrospective studies of the relationship between obstetric radiation and childhood cancer (Kneale, 1971; and Breslow and Day, 1980), as shown in the fifteen 2 x 2 tables in Table 1.

Cases and controls (corresponding to either dying or not dying of childhood cancer, respectively) were each classified according to whether the mother had been X-rayed during gestation. The covariates are year of birth and age at death. Note that X-rayed or not is response j, and case versus control is i. [Since several nijk are small, one might add .5 to all cell counts in Table 1, as noted in Section 1, or apply the residual plots in Section 2 in terms of maximum likelihood estimates for the parameters in the models (1. 1), (2. 1), and (2.5).]

We investigate three models for residual plots:

Model I: ln Ak = 0 + f1lXk;

Model II: ln Ak = f0 + f 3Xk + f2 Zk;

Model III: ln Ak = f0 + f3lXk + f2Zk + f3 k;

where Xk and Zk index year of birth and age at death, respectively. Figure 1 is an added variable plot for Model II relative to Model I. There are no influential

points that are well separated from the rest of the data. Let us now examine the modification of

Table 1 Oxford childhood cancer survey data

Year of birth Age at death X-rayed Cases Controls 1946 9 Yes 7 1

No 28 34 1946 8 Yes 7 2

No 28 33 1948 6 Yes 4 1

No 49 52 1948 5 Yes 4 7

No 57 54 1949 7 Yes 8 5

No 36 39 1950 8 Yes 9 4

No 39 44 1950 5 Yes 12 7

No 41 46 1953 6 Yes 6 8

No 47 45 1953 4 Yes 11 13

No 64 62 1954 7 Yes 4 6

No 23 21 1955 2 Yes 8 5

No 50 53 1956 3 Yes 21 9

No 50 62 1960 4 Yes 3 2

No 42 43 1960 1 Yes 10 4

No 43 49 1961 3 Yes 4 4

No 41 41

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 6: Residual Plots for Log Odds Ratio Regression Models

Residual Plots for Log Odds Ratio Regression 1139

Zk, either by including a quadratic term as in Model III or by its nonlinear transformation. Figures 2 and 3 show a partial residual plot and an augmented partial residual plot, respectively. Mallows (1986) pointed out that the augmented partial residual plots are usually almost identical to the corresponding partial residual plots if a linear model is adequate. When a nonlinear effect is present, however, the augmented partial residual plot may often better exhibit strong evidence of the quadratic effect. A limitation of the added variable plot and partial residual plot is that they only weakly reveal a need for the inclusion of a quadratic term in the model.

In this example, due to the recommendation in Mallows (1986), we have made the arbitrary decision to use a single added quadratic term rather than higher-order terms or an interaction term between age at death and year of birth.

2

a 0z . 1.* 0@

0~~~~~~

1. 0~~~0 -1 0 1 2 3

ev.x Figure 1. Added variable plot.

2

in1

(n 0 ~0 -@.

z ~~~*

2 4 6 8 10 Age at death

Figure 2. Partial residual plot.

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 7: Residual Plots for Log Odds Ratio Regression Models

1140 Biometrics, September 1991

6; 4- 2 *

2A26 Age at death

Figure 3. Augmented partial residual plot.

The WLS residual statistics (1.4) to test the goodness of fit of the Models I, II, and III are

ModelI: Qw= 16.706 (d.f.= 13);

ModelII1: Qw = 16.640 (d.f. = 12);

ModelIII: Qw= 10.008 (d.f.= 11).

Addition of both linear and quadratic terms of age at death to Model I significantly improves goodness of fit by Qw(Model I) - Qw(Model III) = 6.698 (d.f. = 2, P < .05).

RESUME

Cet article decrit des methodes graphiques diagnostics pour des modeles de regression de log de odds-ratios. Pour etudier les effets d'une covariable additionnelle sur la regression des logs de odds-ratios, trois types de graphiques des residus bases sur les moindres carres ponderes sont discutes: (i) graphique de variable supplementaire (graphique de regression partielle), (ii) graphique residuel partiel et (iii) graphique residuel partiel augmente.

Ces graphiques fournissent des procedures diagnostiques permettant d'identifier des heterogeneites sur les variances des erreurs, des points aberrants ou des ecarts 'a la linearite du modele. Ils sont particuliere- ment utiles pour savoir si l'inclusion d'une covariable par un terme lineaire est appropriee ou bien si des transformations quadratiques ou d'autres non-lineaires lui sont pref6rables. Une etude cas-temoins bien connue est analysee pour illustrer les graphiques residuels.

REFERENCES

Atkinson, A. C. (1985). Plots, Transformations, and Regression. Oxford: Clarendon Press. Box, G. E. P. and Tidwell, P. W. (1962). Transformations of the independent variables. Technometrics 4,

53 1-550. Breslow, N. (1976). Regression analysis of the log odds ratio: A method for retrospective studies.

Biometrics 32, 409-416. Breslow, N. and Cologne, J. (1986). Methods of estimation in log odds ratio regression models.

Biometrics 42, 949-954.

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions

Page 8: Residual Plots for Log Odds Ratio Regression Models

Residual Plots for Log Odds Ratio Regression 1141

Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research, Volume 1: The Analysis of Case-Control Studies. Lyon: International Agency for Research on Cancer.

Chatterjee, S. C. and Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regression. Statistical Science 1, 379-416.

Chatterjee, S. C. and Hadi, A. S. (1988). Sensitivity Analysis in Linear Regression. New York: Wiley. Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman and

Hall. Davis, L. J. (1985). Generalization of the Mantel-Haenszel estimator to nonconstant odds ratios.

Biometrics 41, 487-495. Ezekiel, M. and Fox, K. A. (1959). Methods of Correlation and Regression Analysis. New York:

Wiley. Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions, 2nd edition. New York: Wiley. Fowlkes, E. B. (1987). Some diagnostics for binary logistic regression via smoothing. Biometrika 74,

503-515. Gart, J. J. and Zweifel, J. R. (1967). On the bias of various estimators of the logit and its variance, with

application to grouped bioassay. Biometrika 54, 181-187. Grizzle, J. E., Starmer, C. F., and Koch, G. G. (1969). Analysis of categorical data by linear models.

Biometrics 25, 489-504. Kneale, G. W. (1971). Problems arising in estimating from retrospective survey data the latent period of

juvenile cancers initiated by obstetric radiography. Biometrics 27, 563-590. Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984). Graphical methods for assessing logistic

regression models. Journal of the American Statistical Association 79, 61-83. Larsen, W. A. and McCleary, S. J. (1972). The use of partial residual plots in regression analysis.

Technometrics 14, 781-790. Liang, K. Y., Beaty, T. H., and Cohen, B. H. (1986). Application of odds ratio regression models for

assessing familial aggregation from case-control studies. American Journal of Epidemiology 124, 678-683.

Mallows, C. L. (1986). Augmented partial residuals. Technometrics 28, 313-319. O'Brien, C. M. (1988). Discussion of Mallows (1986). Technometrics 30, 135-136. Wang, P. C. (1985). Adding a variable in generalized linear models. Technometrics 27, 273-276. Wang, P. C. (1987). Residual plots for detecting nonlinearity in generalized linear models. Technometrics

29, 435-438. Wood, F. S. (1973). The use of individual effects and residuals in fitting equations to data. Technometrics

15, 677-695. Zelen, M. (1971). The analysis of several 2 x 2 contingency tables. Biometrika 58, 129-137.

Received August 1989; revised February and July 1990; accepted November 1990.

This content downloaded from 62.122.72.94 on Tue, 10 Jun 2014 21:44:31 PMAll use subject to JSTOR Terms and Conditions