27
Logistic Regression

Logistic Regression. Example: Survival of Titanic passengers We want to know if the probability of survival is higher among children Outcome (y) =

Embed Size (px)

Citation preview

Page 1: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Logistic Regression

Page 2: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example: Survival of Titanic passengers We want to know if the probability of survival is

higher among children Outcome (y) = survived/not Explanatory (x) = age at death

> titan<-read.table("D:/RShortcourse/titanic.txt",sep=",",header=T, na=".")

> titan2<-na.omit(titan)

> log1<-glm(titan2$Survived~titan2$Age,family=binomial("logit"))

Since “logit” is the default, you can actually use:> log1<-glm(titan2$Survived~titan2$Age,binomial)

> summary(log1)

Page 3: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example in RCall: glm(formula = titan$Survived ~ titan$Age, family =

binomial("logit"))

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.081428 0.173862 -0.468 0.6395

titan$Age -0.008795 0.005232 -1.681 0.0928 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1025.6 on 755 degrees of freedom

Residual deviance: 1022.7 on 754 degrees of freedom

AIC: 1026.7

xP

P00879.008143.0

1ln

Page 4: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Plot> plot(titan2$Age~fitted(log1), xlab="Age", ylab="P(Survival)", pch=15)

Page 5: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example: Categorical x Sometimes, it’s easier to interpret logistic

regression output if the x variables are categorical Suppose we categorize maternal age into 3

categories:

> titan$agecat4 <- ifelse(titan$Age>35, c(1),c(0))

> titan$agecat3 <- ifelse(titan$Age>=18 & titan$Age<=35, c(1),c(0))

> titan$agecat2 <- ifelse(titan$Age>=6 & titan$Age<=17, c(1),c(0))

> titan$agecat1 <- ifelse(titan$Age<6, c(1),c(0))

Age at Death

Survival < 6 years 6-17 years 18-35 years

> 35 years

No 8 30 263 142

Yes 28 30 149 106

Page 6: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example in R> log2<-glm(titan2$Survived~titan2$agecat2 + titan2$agecat3 + titan2$agecat4, binomial)

summary(log2)

Remember, with a set of dummy variables, you always put in one less variable than category

Page 7: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example in RCall:

glm(formula = titan2$Survived ~ titan2$agecat1 + titan2$agecat2 +

titan2$agecat3, family = binomial)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.2924 0.1284 -2.278 0.022734 *

titan2$agecat1 1.5452 0.4209 3.671 0.000242 ***

titan2$agecat2 0.2924 0.2883 1.014 0.310573

titan2$agecat3 -0.2758 0.1643 -1.679 0.093172 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 1025.57 on 755 degrees of freedom

Residual deviance: 999.07 on 752 degrees of freedom

AIC: 1007.1

Page 8: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Interpretation: odds ratios

> exp(cbind(OR=coef(log2),confint(log2)))

Waiting for profiling to be done...

OR 2.5 % 97.5 %

(Intercept) 3.5000000 1.67176790 8.2301627

titan2$agecat2 0.2857143 0.10674004 0.7040320

titan2$agecat3 0.1618685 0.06737731 0.3485255

titan2$agecat4 0.2132797 0.08772147 0.4663720

This tells us that: Passengers <6 years have a 3.5 times greater

odds of survival than passengers >35 years Passengers in the other two age groups have no

significant difference in odds of survival from passengers >35 years

)exp( nOR

Page 9: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Now let’s make it more complicated Do sex and passenger class predict survival? Let’s try a saturated model:

> log3<-glm(titan$Survived~titan$Sex + titan$PClass + titan$Sex:titan$PClass, binomial)

> summary(log3)

Page 10: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example in RCall:

glm(formula = titan$Survived ~ titan$Sex + titan$PClass + titan$Sex:titan$PClass, family = binomial)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.7006 0.3443 7.843 4.41e-15 ***

titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 ***

titan$PClass2nd -0.7223 0.4540 -1.591 0.112

titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 ***

titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512

titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 1688.1 on 1312 degrees of freedom

Residual deviance: 1155.9 on 1307 degrees of freedom

AIC: 1167.9

Page 11: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

What variables can we consider dropping?> anova(log3,test="Chisq")

Analysis of Deviance Table

Response: titan$Survived

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 1312 1688.1

titan$Sex 1 332.53 1311 1355.5 < 2.2e-16 ***

titan$PClass 2 156.48 1309 1199.0 < 2.2e-16 ***

titan$Sex:titan$PClass 2 43.19 1307 1155.9 4.176e-10 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Small p-values indicate that all variables are needed to explain the variation in y

Page 12: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Goodness of fit statisticsCoefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.7006 0.3443 7.843 4.41e-15 ***

titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 ***

titan$PClass2nd -0.7223 0.4540 -1.591 0.112

titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 ***

titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512

titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 1688.1 on 1312 degrees of freedom

Residual deviance: 1155.9 on 1307 degrees of freedom

AIC: 1167.9

-2 log LAIC

Page 13: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Binned residual plotinstall.packages("arm")

library(arm)

x<-predict(log3)

y<-resid(log3)

binnedplot(x,y)

Category is based on the fitted values

95% of all values should fall within the dotted line

Plots the average residual and the average fitted (predicted) value for each bin, or category

Page 14: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Poisson Regression

Using count data

Page 15: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

What is a Poisson distribution?

Page 16: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Example: children ever born The dataset has 70 rows representing group-

level data on the number of children ever born to women in Fiji: Number of children ever born Number of women in the group Duration of marriage

1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 Residence

1=Suva (capital city), 2=Urban, 3=Rural Education

1=none, 2=lower primary, 3=upper primary, 4=secondary+

> ceb<-read.table("D:/RShortcourse/ceb.dat",header=T)

Page 17: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Page 18: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Poisson regression in R> ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson",

data = ceb)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.43029 0.01795 79.691 <2e-16 ***

educnone 0.21462 0.02183 9.831 <2e-16 ***

educsec+ -1.00900 0.05217 -19.342 <2e-16 ***

educupper -0.40485 0.02956 -13.696 <2e-16 ***

resSuva -0.05997 0.02819 -2.127 0.0334 *

resurban 0.06204 0.02442 2.540 0.0111 *

---

Null deviance: 3731.5 on 69 degrees of freedom

Residual deviance: 2646.5 on 64 degrees of freedom

AIC: Inf

Need to account for different population sizes in each area/group unless data are from same-size populations

Page 19: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Assessing model fit1. Examine AIC score – smaller is better

2. Examine the deviance as an approximate goodness of fit test Expect the residual deviance/degrees of freedom to be

approximately 1

> ceb1$deviance/ceb1$df.residual

[1] 41.35172

3. Compare residual deviance to a 2 distribution> pchisq(2646.5, 64, lower=F)

[1] 0

Page 20: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Model fitting: analysis of deviance Similar to logistic regression, we want to compare

the differences in the size of residuals between models

> ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb)

> ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb)

> 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2))

[1] 0.0007124383

Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance

Page 21: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Overdispersion in Poission models A characteristic of the Poisson distribution is

that its mean is equal to its variance

Sometimes the observed variance is greater than the mean Known as overdispersion

Another common problem with Poisson regression is excess zeros Are more zeros than a Poisson regression would

predict

Page 22: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Overdispersion Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter

Doesn't change the estimates for the coefficients, but may change their standard errors

> ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)

Page 23: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Poisson vs. quasipoisson

Family = “poisson”Family = “quasipoisson”

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.43029 0.01795 79.691 <2e-16 ***

educnone 0.21462 0.02183 9.831 <2e-16 ***

educsec+ -1.00900 0.05217 -19.342 <2e-16 ***

educupper -0.40485 0.02956 -13.696 <2e-16 ***

resSuva -0.05997 0.02819 -2.127 0.0334 *

resurban 0.06204 0.02442 2.540 0.0111 *

---

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3731.5 on 69 degrees of freedom

Residual deviance: 2646.5 on 64 degrees of freedom

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.43029 0.10999 13.004 < 2e-16 ***

educnone 0.21462 0.13378 1.604 0.11358

educsec+ -1.00900 0.31968 -3.156 0.00244 **

educupper -0.40485 0.18115 -2.235 0.02892 *

resSuva -0.05997 0.17277 -0.347 0.72965

resurban 0.06204 0.14966 0.415 0.67988

---

(Dispersion parameter for quasipoisson taken to be 37.55359)

Null deviance: 3731.5 on 69 degrees of freedom

Residual deviance: 2646.5 on 64 degrees of freedom

Page 24: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Models for overdispersion When overdispersion is a problem, use a

negative binomial model Will adjust estimates and standard errors

> install.packages(“MASS”)

> library(MASS)

> ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb)

OR> ceb.nb<-glm.nb(ceb2)

> summary(ceb.nb)

Page 25: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

NB model in Rglm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.490043 0.160589 9.279 < 2e-16 ***

educnone 0.002317 0.183754 0.013 0.98994

educsec+ -0.630343 0.200220 -3.148 0.00164 **

educupper -0.173138 0.184210 -0.940 0.34727

resSuva -0.149784 0.165622 -0.904 0.36580

resurban 0.055610 0.165391 0.336 0.73670

---

(Dispersion parameter for Negative Binomial(3.3872) family taken to be 1)

Null deviance: 85.001 on 69 degrees of freedom

Residual deviance: 71.955 on 64 degrees of freedom

AIC: 740.55

Theta: 3.387

Std. Err.: 0.583

2 x log-likelihood: -726.555

> ceb.nb$deviance/ceb.nb$df.residual[1] 1.124297

Page 26: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

What if your data looked like…

Page 27: Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =

Zero-inflated Poisson model (ZIP) If you have a large number of 0 counts…

> install.packages(“pscl”)

> library(pscl)

> ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)