Download ppt - Lecture 7 GLMs II Binomial Family Olivier MISSA, [email protected]@york.ac.uk Advanced Research Skills

Lecture 7

GLMs II

Binomial Family

Olivier MISSA, [email protected]

Advanced Research Skills

mailto:[email protected]

2

Outline

Continue our Introduction to Generalized Linear Models.

In this lecture:

Illustrate the use of GLMs for

proportion and

binary data.

3

Binary & Proportion data tend to follow the Binomial distribution

The Canonical link of this glm family is the logit function:

The variance reaches a maximum for intermediate values of p and a minimum at either 0% or 100%.

Reminder

)1(

logp

p

ppnVar 1pnMean

4

In R, binary/proportion data can be entered

into a model as a response in three different ways:

as a numeric vector

(holding the number or proportion of successes)

as a logical vector or a factor

(TRUE or the first factor level will be considered successes).

as a two-column matrix(the first column holding the number of successes and

the second column the number of failures).

Three ways to work with binary data

5

Toxicity to tobacco budworm (moth) of different doses of trans-cypermethrin.

Batches of 20 moths (of each sex) were put in contact

for three days with increasing doses of the pyrethroid.

1st Example

Dose (micrograms)

Sex 1 2 4 8 16 32

Male 1 4 9 13 18 20

Female 0 2 6 10 12 16

Number of dead moths out of 20 tested

> (dose <- rep(2^(0:5), 2))[1] 1 2 4 8 16 32 1 2 4 8 16 32> numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16)> (sex <- factor( rep( c("M","F"), c(6,6) ) ))[1] M M M M M M F F F F F FLevels: F M> SF <- cbind(numdead, numalive=20-numdead)

6

1st Example

> modb <- glm(SF ~ sex*dose, family=binomial)> summary(modb)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.71578 0.32233 -5.323 1.02e-07 ***sexM -0.21194 0.51523 -0.411 0.68082 dose 0.11568 0.02379 4.863 1.16e-06 ***sexM:dose 0.18156 0.06692 2.713 0.00666 ** ---(Dispersion parameter for binomial family taken to be 1)

Null deviance: 124.876 on 11 degrees of freedomResidual deviance: 18.164 on 8 degrees of freedomAIC: 56.275

What is modelled is the proportion of successes

n

i i

iiiii yn

ynynyyyD

1 ˆln)()ˆ/ln(2

7

1st Example> ldose <- log2(dose)> modb2 <- glm(SF ~ sex*ldose, family=binomial)> summary(modb2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.9935 0.5527 -5.416 6.09e-08 ***sexM 0.1750 0.7783 0.225 0.822 ldose 0.9060 0.1671 5.422 5.89e-08 ***sexM:ldose 0.3529 0.2700 1.307 0.191 ---(Dispersion parameter for binomial family taken to be 1)


8

1st Example> drop1(modb2, test="Chisq")Single term deletions

Model:SF ~ sex * ldose Df Deviance AIC LRT Pr(Chi)<none> 4.994 43.104 sex:ldose 1 6.757 42.867 1.763 0.1842

> modb3 <- update(modb2, ~. – sex:ldose)> summary(modb3)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***sexM 1.1007 0.3558 3.093 0.00198 ** ldose 1.0642 0.1311 8.119 4.70e-16 ***---(Dispersion parameter for binomial family taken to be 1)


9

1st Example> drop1(modb3, test="Chisq")Single term deletions

Model:SF ~ sex + ldose Df Deviance AIC LRT Pr(Chi) <none> 6.757 42.867 sex 1 16.984 51.094 10.227 0.001384 ** ldose 1 118.799 152.909 112.042 < 2.2e-16 ***

> shapiro.test(residuals(modb3), type="deviance")

Shapiro-Wilk normality test

data: residuals(modb3, type = "deviance") W = 0.9666, p-value = 0.8725

10

1st Example

> par(mfrow=c(2,2))> plot(modb3)

11

1st Example> plot( c(0,1) ~ c(1,32), type="n", log="x",

xlab="dose", ylab="Probability")> text(dose, numdead/20, labels=as.character(sex) )> ld <- seq(0,32,0.5)> lines (ld, predict(modb3, data.frame(ldose=log2(ld),

sex=factor(rep("M", length(ld)), levels=levels(sex))),type="response") )

> lines (ld, predict(modb3, data.frame(ldose=log2(ld),sex=factor(rep("F", length(ld)), levels=levels(sex))),type="response"), lty=2, col="red" )

12

1st Example> modbp <- glm(SF ~ sex*ldose,

family=binomial(link="probit"))> AIC(modbp)[1] 41.87836

> modbc <- glm(SF ~ sex*ldose,family=binomial(link="cloglog"))

> AIC(modbc)[1] 43.8663

> AIC(modb3)[1] 42.86747

13

> summary(modb3)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***sexM 1.1007 0.3558 3.093 0.00198 ** ldose 1.0642 0.1311 8.119 4.70e-16 ***---

> exp(modb3$coeff) ## careful it may be misleading(Intercept) sexM ldose 0.031019 3.006400 2.898560 ## odds ration: p / (1-p)

> exp(modb3$coeff[1]+modb3$coeff[2]) ## odds for males(Intercept) 0.09325553

1st Example

logit scale

)1(

logp

p

Every doubling of the dose will lead to an increase in the odds of dying over surviving by a factor of 2.899

14

Erythrocyte Sedimentation Rate in a group of patients.

Two groups : <20 (healthy) or >20 (ill) mm/hour

Q: Is it related to globulin & fibrinogen level in the blood ?

2nd Example

> data("plasma", package="HSAUR")> str(plasma)'data.frame': 32 obs. of 3 variables: $ fibrinogen: num 2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 ... $ globulin : int 38 31 33 31 37 36 38 37 39 41 ... $ ESR : Factor w/ 2 levels "ESR < 20","ESR > 20": 1 1 ...> summary(plasma) fibrinogen globulin ESR Min. :2.090 Min. :28.00 ESR < 20:26 1st Qu.:2.290 1st Qu.:31.75 ESR > 20: 6 Median :2.600 Median :36.00 Mean :2.789 Mean :35.66 3rd Qu.:3.167 3rd Qu.:38.00 Max. :5.060 Max. :46.00

15

2nd Example> stripchart(globulin ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)",

ylab="Globulin blood level", method="jitter" )

> stripchart(fibrinogen ~ ESR, vertical=T, data=plasma, xlab="Erythrocyte Sedimentation Rate (mm/hr)",

ylab="Fibrinogen blood level", method="jitter" )

16

2nd Example> mod1 <- glm(ESR~fibrinogen, data=plasma, family=binomial)> summary(mod1)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.8451 2.7703 -2.471 0.0135 *fibrinogen 1.8271 0.9009 2.028 0.0425 *---(Dispersion parameter for binomial family taken to be 1)


> mod2 <- glm(ESR~fibrinogen+globulin, data=plasma,family=binomial)

> AIC(mod2)[1] 28.97111

factor

17

2nd Example> anova(mod1, mod2, test="Chisq")Analysis of Deviance Table

Model 1: ESR ~ fibrinogenModel 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|)1 30 24.8404 2 29 22.9711 1 1.8692 0.1716

> summary(mod2)Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -12.7921 5.7963 -2.207 0.0273 *fibrinogen 1.9104 0.9710 1.967 0.0491 *globulin 0.1558 0.1195 1.303 0.1925 ---(Dispersion parameter for binomial family taken to be 1)


The difference in terms of Deviance between these models is not significant, which leads us to select the least complex model

18

2nd Example> shapiro.test(residuals(mod1, type="deviance")) Shapiro-Wilk normality test

data: residuals(mod1, type = "deviance") W = 0.6863, p-value = 5.465e-07> par(mfrow=c(2,2))> plot(mod1)