1 Logistic Regression and the new: Residual Logistic Regression F. Berenice Baez-Revueltas Wei Zhu F. Berenice Baez-Revueltas Wei Zhu

1

Logistic Regression and the new:Residual Logistic Regression

Logistic Regression and the new:Residual Logistic Regression

F. Berenice Baez-Revueltas

Wei Zhu

F. Berenice Baez-Revueltas

Wei Zhu

2

OutlineOutline1. Logistic Regression

2. Confounding Variables

3. Controlling for Confounding Variables

4. Residual Linear Regression

5. Residual Logistic Regression

6. Examples

7. Discussion

8. Future Work

1. Logistic Regression

2. Confounding Variables


4. Residual Linear Regression

5. Residual Logistic Regression

6. Examples

7. Discussion

8. Future Work

1. Logistic Regression Model1. Logistic Regression Model

In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable.

In 1938, Ronald Fisher and Frank Yates suggested the logit link for regression with a binary response variable.

0 1

( 1| ) ( 1| )ln ln

( 0 | ) 1 ( 1| )

ln1

P Y x P Y x

P Y x P Y x

xx

x

ln(Odds of Y 1| x)

A popular model for categorical response variableA popular model for categorical response variable

Logistic regression model is the most popular model for binary data.

Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical).

Y = 1 (true, success, YES, etc.) or

Y = 0 ( false, failure, NO, etc.) Logistic regression model can be extended to model a

categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable.)

Logistic regression model is the most popular model for binary data.

Logistic regression model is generally used to study the relationship between a binary response variable and a group of predictors (can be either continuous or categorical).

Y = 1 (true, success, YES, etc.) or

Y = 0 ( false, failure, NO, etc.) Logistic regression model can be extended to model a

categorical response variable with more than two categories. The resulting model is sometimes referred to as the multinomial logistic regression model (in contrast to the ‘binomial’ logistic regression for a binary response variable.)

More on the rationale of the logistic regression model More on the rationale of the logistic regression model Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to

model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor.

This model can be rewritten as

E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes:

P(Y=1|x) =

Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the logistic transform of P(Y=1|x) as a linear function of the predictor.

This model can be rewritten as

E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for all values of x. The following linear model may violate this condition sometimes:

P(Y=1|x) =

0 1

( 1| )ln

1 ( 1| )

P Y xx

P Y x

0 1

( 1| )ln

1 ( 1| )

P Y xx

P Y x

)exp(1

)exp()|1(

10

10

x

xxYP

x10

More on the properties of the logistic regression modelMore on the properties of the logistic regression model

In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.

For multiple predictor variables, the logistic regression model is

In the simple logistic regression, the regression coefficient has the interpretation that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.

For multiple predictor variables, the logistic regression model is

1

11010 ][)]1([)|0(

)|1(ln

)1|0(

)1|1(ln

xxxYP

xYP

xYP

xYP

kkk

k xxxxxYP

xxxYP

...),...,,|0(

),...,,|1(ln 110

21

21

Logistic Regression, SAS ProcedureLogistic Regression, SAS Procedure http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm Proc Logistic This page shows an example of logistic regression with footnotes explaining the output. The

data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.

data logit; set "c:\temp\hsb2";

honcomp = (write >= 60);

run;

proc logistic data= logit descending;

model honcomp = female read science;

run;

http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm Proc Logistic This page shows an example of logistic regression with footnotes explaining the output. The

data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is high writing test score (honcomp), where a writing score greater than or equal to 60 is considered high, and less than 60 considered low; from which we explore its relationship with gender (female), reading test score (read), and science test score (science). The dataset used in this page can be downloaded from http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.

data logit; set "c:\temp\hsb2";

honcomp = (write >= 60);

run;

proc logistic data= logit descending;

model honcomp = female read science;

run;

7

http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm


http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm






Logistic Regression, SAS OutputLogistic Regression, SAS Output

8

9

2. Confounding Variables2. Confounding Variables

Correlated with both the dependent and independent variables

Represent major threat to the validity of inferences on cause and effect

Add to multicollinearity Can lead to over or underestimation of an effect, it

can even change the direction of the conclusion They add error in the interpretation of what may

be an accurate measurement

Correlated with both the dependent and independent variables

Represent major threat to the validity of inferences on cause and effect

Add to multicollinearity Can lead to over or underestimation of an effect, it

can even change the direction of the conclusion They add error in the interpretation of what may

be an accurate measurement

10

For a variable to be a confounder it needs to haveRelationship with the exposureRelationship with the outcome even in the absence of the exposure (not an intermediary)Not on the causal pathwayUneven distribution in comparison groups

For a variable to be a confounder it needs to haveRelationship with the exposureRelationship with the outcome even in the absence of the exposure (not an intermediary)Not on the causal pathwayUneven distribution in comparison groups

Exposure Outcome

Third variable

11

Birth order Down Syndrome

Maternal Age

Maternal age is correlated with birth order and a risk factor for Down Syndrome, even if Birth order is low

Smoking is correlated with alcohol consumption and is a risk factor for Lung Cancer even for persons who don’t drink alcohol

Alcohol Lung Cancer

Smoking

Confounding

No Confounding

12



In study designs

Restriction Random allocation of subjects to study

groups to attempt to even out unknown confounders

Matching subjects using potential confounders

In study designs

Restriction Random allocation of subjects to study

groups to attempt to even out unknown confounders

Matching subjects using potential confounders

13

In data analysis

Stratified analysis using Mantel Haenszel method to adjust for confounders

Case-control studies Cohort studies Restriction (is still possible but it means to

throw data away) Model fitting using regression techniques

In data analysis

Stratified analysis using Mantel Haenszel method to adjust for confounders

Case-control studies Cohort studies Restriction (is still possible but it means to

throw data away) Model fitting using regression techniques

14

Pros and Cons of Controlling Methods Pros and Cons of Controlling Methods

Matching methods call for subjects with exactly the same characteristics

Risk of over or under matching Cohort studies can lead to too much loss of

information when excluding subjects Some strata might become too thin and thus

insignificant creating also loss of information Regression methods, if well handled,

can control for confounding factors

Matching methods call for subjects with exactly the same characteristics

Risk of over or under matching Cohort studies can lead to too much loss of

information when excluding subjects Some strata might become too thin and thus

insignificant creating also loss of information Regression methods, if well handled,

can control for confounding factors

15

4. Residual Linear Regression4. Residual Linear Regression Consider a dependant variable Y and a set of

n independent covariates, from which the first k (k<n) of them are potential confounding factors

Initial model treating only the confounding variables as follows

Residuals are calculated from this model, let

Consider a dependant variable Y and a set of n independent covariates, from which the first k (k<n) of them are potential confounding factors

Initial model treating only the confounding variables as follows

Residuals are calculated from this model, let

Y 0 1X1 2X2 ...kX k

0 1 1 2 2ˆ ˆ ˆ ˆˆ ... k kY X X X

16

The residuals are with the following properties: Zero meanHomoscedasticityNormally distributed ,

This residual will be considered the new dependant variable. That is, the new model to be fitted is

which is equivalent to:

The residuals are with the following properties: Zero meanHomoscedasticityNormally distributed ,

This residual will be considered the new dependant variable. That is, the new model to be fitted is

which is equivalent to:

j j je Y Y

0, ji eeCorr

i j

0 1 1 2 2 ...k k k k t tY Y X X X

0 1 1 2 2 ...k k k k t tY Y X X X

17

The Usual Logistic Regression Approach to ‘Control for’ Confounders

The Usual Logistic Regression Approach to ‘Control for’ Confounders

Consider a binary outcome Y and n covariates where the first k (k<n) of them being potential confounding factors

The usual way to ‘control for’ these confounding variables is to simply put all the n variables in the same model as:

Consider a binary outcome Y and n covariates where the first k (k<n) of them being potential confounding factors

The usual way to ‘control for’ these confounding variables is to simply put all the n variables in the same model as:

log

1

0 1X1 2X2 ...kXk ...nXn

18

5. Residual Logistic Regression5. Residual Logistic Regression

Each subject has a binary outcome Y

Consider n covariates, where the first k (k<n) are potential confounding factors

Initial model with as the probability of success where only confounding effect is analyzed

Each subject has a binary outcome Y

Consider n covariates, where the first k (k<n) are potential confounding factors

Initial model with as the probability of success where only confounding effect is analyzed

log

1

0 1X1 2X2 ...kXk

19

Method 1Method 1

The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.

That is, let The new model to be fitted is

The confounding variables effect is retained and plugged in to the second level regression model along with the variables of interest following the residual linear regression approach.

That is, let The new model to be fitted is

nnkkkk XXXT

...

1log 22110

T 1X1

2X2 ...

kX k

20

Method 2Method 2 Pearson residuals are calculated from the initial

model using the Pearson residual (Hosmer and Lemeshow, 1989)

where is the estimated probability of success based on the confounding variables alone:

The second level regression will use this residual as the new dependant variable.

Pearson residuals are calculated from the initial model using the Pearson residual (Hosmer and Lemeshow, 1989)

where is the estimated probability of success based on the confounding variables alone:

The second level regression will use this residual as the new dependant variable.

1

YZ

e0

iX ii1

k

1 e0

iX i

i1

k 0,1

21

Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates.

The new model to be fitted is a linear regression model

Therefore the new dependant variable is Z, and because it is not dichotomous anymore we can apply a multiple linear regression model to analyze the effect of the rest of the covariates.

The new model to be fitted is a linear regression model

Z 0 k1X k1 k2X k2 ...nXn

22

6. Example 16. Example 1

Data: Low Birth Weight Dow. Indicator of birth weight less than 2.5 Kg Age: Mother’s age in years Lwt: Mother’s weight in pounds Smk: Smoking status during pregnancy Ht: History of hypertension

Data: Low Birth Weight Dow. Indicator of birth weight less than 2.5 Kg Age: Mother’s age in years Lwt: Mother’s weight in pounds Smk: Smoking status during pregnancy Ht: History of hypertension

Age Lwt Smk Ht

Age 1.0000 0.1738 -0.0444 -0.0158

Lwt 1.0000 -0.0408 0.2369

Smk 1.0000 0.0134

Ht 1.0000

Correlation matrix with alpha=0.05

23

Potential confounding factor: Age Model for (probability of low birth weight) Logistic regression

Residual logistic regression

initial model Method 1

Method 2

Potential confounding factor: Age Model for (probability of low birth weight) Logistic regression



Method 2

log

1

0 1age2lwt 3smk 4ht

log

1

0 1age

T 1age

log

1

0 T 2lwt 3smk 4ht

Z 0 2lwt 3smk 4ht

24

ResultsResults

VariablesLogistic Regression RLR Method1

Odds ratio P-value SE Odds ratio P-value SE

lwt 0.988 0.060 0.0064 0.989 0.078 0.0065

smk 3.480 0.001 0.3576 3.455 0.001 0.3687

ht 3.395 0.053 0.6322 3.317 0.059 0.6342

RLR Method 2

Variables P-value SE

lwt 0.077 0.0024

Smk 0.000 0.1534

ht 0.042 0.3094

Conf. factors

VariablesP-value

Log reg Ini model

Age 0.055 0.027

25

Example 2Example 2 Data: Alzheimer patients

Decline: Whether the subjects cognitive capabilities deteriorates or not

Age: Subjects age

Gender: Subjects gender

MMS: Mini Mental Score

PDS: Psychometric deterioration scale

HDT: Depression scale

Data: Alzheimer patients

Decline: Whether the subjects cognitive capabilities deteriorates or not

Age: Subjects age

Gender: Subjects gender

MMS: Mini Mental Score

PDS: Psychometric deterioration scale

HDT: Depression scale

Age Gender MMS PDS HDT

Age 1.0000 0.0413 -0.2120 0.3327 0.9679

Gender 1.0000 -0.1074 0.2020 -0.1839

MMS 1.0000 0.3784 -0.1839

PDS 1.0000 0.0110

HDT 1.0000

Correlation matrix with alpha=0.05

26

Potential confounding factors: Age, Gender Model for (probability of declining) Logistic regression



Method 2

Potential confounding factors: Age, Gender Model for (probability of declining) Logistic regression



Method 2

log

1

0 1age 2gender 3mms4 pds5hdt

log

1

0 1age2gender

log

1

0 T 3mms 4 pds5hdt

Z 0 3mms4 pds5hdt

T 1age

2gender

27

ResultsResults

VariablesLogistic Regression RLR Method1

Odds ratio P-value SE Odds ratio P-value SE

mms 0.717 0.023 0.1451 0.720 0.023 0.1443

pds 1.691 0.001 0.1629 1.674 0.001 0.1565

hdt 1.018 0.643 0.0380 1.018 0.644 0.0377

RLR Method 2

Variables P-value SE

mms <0.001 0.0915

pds <0.001 0.0935

hdt 0.061 0.0273

Conf. factors

VariablesP-value

Log reg Ini model

Age 0.004 0.000

Gender 0.935 0.551

28

7. Discussion7. Discussion

The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.

Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach

Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.

The usual logistic regression is not designed to control for confounding factors and there is a risk for multicollinearity.

Method 1 is designed to control for confounding factors; however, from the given examples we can see Method 1 yields similar results to the usual logistic regression approach

Method 2 appears to be more accurate with some SE significantly reduced and thus the p-values for some regressors are significantly smaller. However it will not yield the odds ratios as Method 1 can.

29

8. Future Work8. Future Work

We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results.

We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account.

We will further examine the assumptions behind Method 2 to understand why it sometimes yields more significant results.

We will also study residual longitudinal data analysis, including the survival analysis, where one or more time dependant variable(s) will be taken into account.

30

Selected ReferencesSelected References

Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series

Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356

Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989.

Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981.

Menard, S. Applied Logistic Regression Analysis. Series: Quantitative Applications in the Social Sciences. Sage University Series

Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H. Predicting the Outcome of Intensive Care Unit Patients. Journal of the American Statistical Association 83, 348-356

Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets Logistic Regression. Biometrics 45, 1265-1270. 1989.

Pergibon, D. Logistic Regression Diagnostics. The Annals of Statistics 19(4), 705-724. 1981.

31

Questions?Questions?

Documents

1 Logistic Regression and the new: Residual Logistic Regression F. Berenice Baez-Revueltas Wei Zhu F. Berenice Baez-Revueltas Wei Zhu