Logistic Regression
Example: Survival of Titanic passengers We want to know if the probability of survival is
higher among children Outcome (y) = survived/not Explanatory (x) = age at death
> titan<-read.table("D:/RShortcourse/titanic.txt",sep=",",header=T, na=".")
> titan2<-na.omit(titan)
> log1<-glm(titan2$Survived~titan2$Age,family=binomial("logit"))
Since “logit” is the default, you can actually use:> log1<-glm(titan2$Survived~titan2$Age,binomial)
> summary(log1)
Example in RCall: glm(formula = titan$Survived ~ titan$Age, family =
binomial("logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.081428 0.173862 -0.468 0.6395
titan$Age -0.008795 0.005232 -1.681 0.0928 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1025.6 on 755 degrees of freedom
Residual deviance: 1022.7 on 754 degrees of freedom
AIC: 1026.7
xP
P00879.008143.0
1ln
Plot> plot(titan2$Age~fitted(log1), xlab="Age", ylab="P(Survival)", pch=15)
Example: Categorical x Sometimes, it’s easier to interpret logistic
regression output if the x variables are categorical Suppose we categorize maternal age into 3
categories:
> titan$agecat4 <- ifelse(titan$Age>35, c(1),c(0))
> titan$agecat3 <- ifelse(titan$Age>=18 & titan$Age<=35, c(1),c(0))
> titan$agecat2 <- ifelse(titan$Age>=6 & titan$Age<=17, c(1),c(0))
> titan$agecat1 <- ifelse(titan$Age<6, c(1),c(0))
Age at Death
Survival < 6 years 6-17 years 18-35 years
> 35 years
No 8 30 263 142
Yes 28 30 149 106
Example in R> log2<-glm(titan2$Survived~titan2$agecat2 + titan2$agecat3 + titan2$agecat4, binomial)
summary(log2)
Remember, with a set of dummy variables, you always put in one less variable than category
Example in RCall:
glm(formula = titan2$Survived ~ titan2$agecat1 + titan2$agecat2 +
titan2$agecat3, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2924 0.1284 -2.278 0.022734 *
titan2$agecat1 1.5452 0.4209 3.671 0.000242 ***
titan2$agecat2 0.2924 0.2883 1.014 0.310573
titan2$agecat3 -0.2758 0.1643 -1.679 0.093172 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 1025.57 on 755 degrees of freedom
Residual deviance: 999.07 on 752 degrees of freedom
AIC: 1007.1
Interpretation: odds ratios
> exp(cbind(OR=coef(log2),confint(log2)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 3.5000000 1.67176790 8.2301627
titan2$agecat2 0.2857143 0.10674004 0.7040320
titan2$agecat3 0.1618685 0.06737731 0.3485255
titan2$agecat4 0.2132797 0.08772147 0.4663720
This tells us that: Passengers <6 years have a 3.5 times greater
odds of survival than passengers >35 years Passengers in the other two age groups have no
significant difference in odds of survival from passengers >35 years
)exp( nOR
Now let’s make it more complicated Do sex and passenger class predict survival? Let’s try a saturated model:
> log3<-glm(titan$Survived~titan$Sex + titan$PClass + titan$Sex:titan$PClass, binomial)
> summary(log3)
Example in RCall:
glm(formula = titan$Survived ~ titan$Sex + titan$PClass + titan$Sex:titan$PClass, family = binomial)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7006 0.3443 7.843 4.41e-15 ***
titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 ***
titan$PClass2nd -0.7223 0.4540 -1.591 0.112
titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 ***
titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512
titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 1688.1 on 1312 degrees of freedom
Residual deviance: 1155.9 on 1307 degrees of freedom
AIC: 1167.9
What variables can we consider dropping?> anova(log3,test="Chisq")
Analysis of Deviance Table
Response: titan$Survived
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL 1312 1688.1
titan$Sex 1 332.53 1311 1355.5 < 2.2e-16 ***
titan$PClass 2 156.48 1309 1199.0 < 2.2e-16 ***
titan$Sex:titan$PClass 2 43.19 1307 1155.9 4.176e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Small p-values indicate that all variables are needed to explain the variation in y
Goodness of fit statisticsCoefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.7006 0.3443 7.843 4.41e-15 ***
titan$Sexmale -3.4106 0.3793 -8.992 < 2e-16 ***
titan$PClass2nd -0.7223 0.4540 -1.591 0.112
titan$PClass3rd -3.2014 0.3724 -8.598 < 2e-16 ***
titan$Sexmale:titan$PClass2nd -0.3461 0.5274 -0.656 0.512
titan$Sexmale:titan$PClass3rd 1.8827 0.4283 4.396 1.10e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null deviance: 1688.1 on 1312 degrees of freedom
Residual deviance: 1155.9 on 1307 degrees of freedom
AIC: 1167.9
-2 log LAIC
Binned residual plotinstall.packages("arm")
library(arm)
x<-predict(log3)
y<-resid(log3)
binnedplot(x,y)
Category is based on the fitted values
95% of all values should fall within the dotted line
Plots the average residual and the average fitted (predicted) value for each bin, or category
Poisson Regression
Using count data
What is a Poisson distribution?
Example: children ever born The dataset has 70 rows representing group-
level data on the number of children ever born to women in Fiji: Number of children ever born Number of women in the group Duration of marriage
1=0-4, 2=5-9, 3=10-14, 4=15-19, 5=20-24, 6=25-29 Residence
1=Suva (capital city), 2=Urban, 3=Rural Education
1=none, 2=lower primary, 3=upper primary, 4=secondary+
> ceb<-read.table("D:/RShortcourse/ceb.dat",header=T)
Poisson regression in R> ceb1<-glm(y ~ educ + res, offset=log(n), family = "poisson",
data = ceb)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.43029 0.01795 79.691 <2e-16 ***
educnone 0.21462 0.02183 9.831 <2e-16 ***
educsec+ -1.00900 0.05217 -19.342 <2e-16 ***
educupper -0.40485 0.02956 -13.696 <2e-16 ***
resSuva -0.05997 0.02819 -2.127 0.0334 *
resurban 0.06204 0.02442 2.540 0.0111 *
---
Null deviance: 3731.5 on 69 degrees of freedom
Residual deviance: 2646.5 on 64 degrees of freedom
AIC: Inf
Need to account for different population sizes in each area/group unless data are from same-size populations
Assessing model fit1. Examine AIC score – smaller is better
2. Examine the deviance as an approximate goodness of fit test Expect the residual deviance/degrees of freedom to be
approximately 1
> ceb1$deviance/ceb1$df.residual
[1] 41.35172
3. Compare residual deviance to a 2 distribution> pchisq(2646.5, 64, lower=F)
[1] 0
Model fitting: analysis of deviance Similar to logistic regression, we want to compare
the differences in the size of residuals between models
> ceb1<-glm(y~educ, family=“poisson", offset=log(n), data= ceb)
> ceb2<-glm(y~educ+res, family=“poisson", offset=log(n), data= ceb)
> 1-pchisq(deviance(ceb1)-deviance(ceb2), df.residual(ceb1)-df.residual(ceb2))
[1] 0.0007124383
Since the p-value is small, there is evidence that the addition of res explains a significant amount (more) of the deviance
Overdispersion in Poission models A characteristic of the Poisson distribution is
that its mean is equal to its variance
Sometimes the observed variance is greater than the mean Known as overdispersion
Another common problem with Poisson regression is excess zeros Are more zeros than a Poisson regression would
predict
Overdispersion Use family=“quasipoisson” instead of “poisson” to estimate the dispersion parameter
Doesn't change the estimates for the coefficients, but may change their standard errors
> ceb2<-glm(y~educ+res, family="quasipoisson", offset=log(n), data=ceb)
Poisson vs. quasipoisson
Family = “poisson”Family = “quasipoisson”
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.43029 0.01795 79.691 <2e-16 ***
educnone 0.21462 0.02183 9.831 <2e-16 ***
educsec+ -1.00900 0.05217 -19.342 <2e-16 ***
educupper -0.40485 0.02956 -13.696 <2e-16 ***
resSuva -0.05997 0.02819 -2.127 0.0334 *
resurban 0.06204 0.02442 2.540 0.0111 *
---
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3731.5 on 69 degrees of freedom
Residual deviance: 2646.5 on 64 degrees of freedom
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.43029 0.10999 13.004 < 2e-16 ***
educnone 0.21462 0.13378 1.604 0.11358
educsec+ -1.00900 0.31968 -3.156 0.00244 **
educupper -0.40485 0.18115 -2.235 0.02892 *
resSuva -0.05997 0.17277 -0.347 0.72965
resurban 0.06204 0.14966 0.415 0.67988
---
(Dispersion parameter for quasipoisson taken to be 37.55359)
Null deviance: 3731.5 on 69 degrees of freedom
Residual deviance: 2646.5 on 64 degrees of freedom
Models for overdispersion When overdispersion is a problem, use a
negative binomial model Will adjust estimates and standard errors
> install.packages(“MASS”)
> library(MASS)
> ceb.nb <- glm.nb(y~educ+res+offset(log(n)), data= ceb)
OR> ceb.nb<-glm.nb(ceb2)
> summary(ceb.nb)
NB model in Rglm.nb(formula = ceb2, x = T, init.theta = 3.38722121141125, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.490043 0.160589 9.279 < 2e-16 ***
educnone 0.002317 0.183754 0.013 0.98994
educsec+ -0.630343 0.200220 -3.148 0.00164 **
educupper -0.173138 0.184210 -0.940 0.34727
resSuva -0.149784 0.165622 -0.904 0.36580
resurban 0.055610 0.165391 0.336 0.73670
---
(Dispersion parameter for Negative Binomial(3.3872) family taken to be 1)
Null deviance: 85.001 on 69 degrees of freedom
Residual deviance: 71.955 on 64 degrees of freedom
AIC: 740.55
Theta: 3.387
Std. Err.: 0.583
2 x log-likelihood: -726.555
> ceb.nb$deviance/ceb.nb$df.residual[1] 1.124297
What if your data looked like…
Zero-inflated Poisson model (ZIP) If you have a large number of 0 counts…
> install.packages(“pscl”)
> library(pscl)
> ceb.zip <- zeroinfl(y~educ+res, offset=log(n), data= ceb)