Upload
lekien
View
216
Download
0
Embed Size (px)
Citation preview
Applied Statistics : Practical 8
This practical deals with mixed effects models (linear, generalized linear and generalized additive).The datasets production.txt, epilepsy.txt and growth.txt can be found on the course webpage.
1. Agricultural production data
An experiment has been performed to study the relationship between the quantity of nitrogen-based fertilizer and the agricultural production. The data from the experiment are recorded inthe production.txt file and they consist of the production per acre (Y), the quantity of fertilizerper acre (N) and the location of the field (Field, labelled A, B, C and D) where each observationhas been collected.
> data<-read.table("production.txt",header=TRUE)
> head(data)
> attach(data)
We can start by looking at the scatterplot between N and Y, separately for each field
> library(lme4)
> library(lattice)
> xyplot(Y~N|Field)
> xyplot(Y~N,col=Field)
or giving to the points from each field a different color coding:
> xyplot(Y~N,col=Field)
• Do these plots suggest a possible model for Y?
They suggest a linear relationship between N and Y, which may change with the location.
• Since the interest is in studying the general (average) relationship between N and Y, we wantthe effects associated to the field location to be random. Fit a model for Y where there ison average a linear relationship between N and Y and random effects associated to the fieldlocation for both the intercept and the slope.
We fit the following model:
> model1<-lmer(Y~N+(N|Field),REML=TRUE)
> summary(model1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Y ~ N + (N | Field)
##
## REML criterion at convergence: 107.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7509 -0.5774 -0.1346 0.4367 2.1518
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Field (Intercept) 1.174e+00 1.083593
1
## N 1.025e-05 0.003202 -1.00
## Residual 6.069e-01 0.779037
## Number of obs: 40, groups: Field, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4.94020 0.58818 8.399
## N 0.97959 0.03863 25.359
##
## Correlation of Fixed Effects:
## (Intr)
## N -0.366
• Looking at the summary of the fitted model, check if both random effects (for interceptand slope) are indeed needed in the model. (Hint: you may want to use some code fromlast practical)
The variance of the random effect for the slope is much smaller than those of the one forthe intercept (when compared with the size of the corresponding fixed effect). Therefore, wewould like to check if it is possible to remove the slope random effect from the model. Weuse a parametric bootstrap test to compare the two models:
> model0<-lmer(Y~N+(1|Field))
> likelihood_ratio<-rep(NA,100)
> for (k in 1:100){+ y<-simulate(model0)[[1]]
+ boot_null<-lmer(y~N+(1|Field),REML=TRUE)
+ boot_model1<-lmer(y~ N+(N|Field),REML=TRUE)
+ likelihood_ratio[k]<- as.numeric(2*(logLik(boot_model1)- logLik(boot_null)))
+ }> mean(likelihood_ratio>as.numeric(2*(logLik(model1)- logLik(model0))))
## [1] 0.95
A large proportion of bootstrapped test statistics are larger then the likelihood ratio teststatistics in the original sample, therefore we have no evidence to reject the null model (theone without the random effect on the slope).
• Using the a model with only one random effect, carry out the appropriate test to verify thepresence of a linear relationship between N and Y.
We use two models with a random effect for the intercept and different fixed effects. Notethat we need to fit the models using ML (this is done automatically by the anova command,if we forget).
> model0<-lmer(Y~N+(1|Field),REML=FALSE)
> model2<-lmer(Y~1+(1|Field),REML=FALSE)
> anova(model2,model0)
## Data: NULL
## Models:
## model2: Y ~ 1 + (1 | Field)
## model0: Y ~ N + (1 | Field)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## model2 3 216.17 221.24 -105.085 210.17
## model0 4 111.32 118.07 -51.659 103.32 106.85 1 < 2.2e-16 ***
2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a strong evidence (p-value < 2.2e-16) to reject the null model (the one with onlythe intercept as fixed effect: in this case model2, despite their names) in favor of the modelwith intercept and slope (model0). Therefore we can conclude that there is evidence of alinear relationship between N and Y
• Finally, what model would be chosen by the AIC?
> AIC(model1,model0,model2)
## df AIC
## model1 6 119.5153
## model0 4 111.3185
## model2 3 216.1699
The AIC would select model0 as well.
• Check the diagnostics plot for the chosen model.
> plot(model0)
fitted(.)
resi
d(.,
type
= "
pear
son"
)
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
4 6 8 10 12 14 16
> qqnorm(residuals(model0))
> qqline(residuals(model0),lwd=2)
3
−2 −1 0 1 2
−1.
0−
0.5
0.0
0.5
1.0
1.5
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
We cannot see any issue in the diagnostic plot.
• Write down the algebraic form of the chosen model in the hierarchical formulation and theestimates for its parameters.
The algebraic form of the selected model is
Yi|u = u∗ ∼ N(β0 + u∗ + β1Ni, σ2) independently
u ∼ N(0, σ2u)
for i = 1, . . . , 40. From the R output, we get β̂0 = 4.94, β̂1 = 0.98, σ̂2u = 0.8413, σ̂2 = 0.59.
2. Generalized linear mixed effects modelA researcher is studying the effect of a pharmacological treatment on the frequency of epilepticcrisis. She run a study which involves three different hospitals, each of which randomly assignedthe patients to the experimental treatment or to the existing treatment. For each patient theage and the number of crisis during the 8 weeks of the treatment(crisis) are recorded. Thedataset epilepsy.txt reports this information, which hospital treated the patient and if theyreceived the experimental treatment (1) or not (0)
> data<-read.table("epilepsy.txt",header=TRUE)
> attach(data)
> head(data)
## crisis hospital age treatment
## 1 7 1 47 0
## 2 6 1 40 0
4
## 3 3 1 24 0
## 4 7 1 46 0
## 5 3 1 40 0
## 6 2 1 41 0
We want to model the number of crisis using a Poisson random variable and including a randomeffect associated to the hospital. We can do this using the command glmer in the package lme4
and we fit the following model.
> library(lme4)
> ep_model<-glmer(crisis~ age+treatment+(treatment|hospital),family=poisson)
> summary(ep_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: crisis ~ age + treatment + (treatment | hospital)
##
## AIC BIC logLik deviance df.resid
## 266.3 278.9 -127.2 254.3 54
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.53364 -0.55036 -0.03809 0.32845 2.28105
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## hospital (Intercept) 0.13843 0.3721
## treatment 0.02212 0.1487 -1.00
## Number of obs: 60, groups: hospital, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.309975 0.364912 3.590 0.000331 ***
## age 0.008552 0.006725 1.272 0.203446
## treatment -0.081164 0.146312 -0.555 0.579075
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) age
## age -0.777
## treatment -0.552 0.104
Is the model a good fit for the data? What can you suggest to improve the fit? (Note that youcan use the anova command to compare models that differ in the fixed effects in this case aswell.) The fixed effects do not appear to be significantly different from zero, so we can try toremove them from the model.
> ep_model2<-glmer(crisis~ age+(treatment|hospital),family=poisson)
> anova(ep_model2,ep_model)
5
## Data: NULL
## Models:
## ep_model2: crisis ~ age + (treatment | hospital)
## ep_model: crisis ~ age + treatment + (treatment | hospital)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## ep_model2 5 264.59 275.06 -127.30 254.59
## ep_model 6 266.31 278.88 -127.16 254.31 0.2807 1 0.5963
> ep_model3<-glmer(crisis~ (treatment|hospital),family=poisson)
> anova(ep_model3,ep_model2)
## Data: NULL
## Models:
## ep_model3: crisis ~ (treatment | hospital)
## ep_model2: crisis ~ age + (treatment | hospital)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## ep_model3 4 264.40 272.78 -128.2 256.40
## ep_model2 5 264.59 275.06 -127.3 254.59 1.8129 1 0.1782
> summary(ep_model3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: crisis ~ (treatment | hospital)
##
## AIC BIC logLik deviance df.resid
## 264.4 272.8 -128.2 256.4 56
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.5518 -0.5919 -0.1273 0.4658 2.3665
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## hospital (Intercept) 0.15762 0.3970
## treatment 0.02679 0.1637 -1.00
## Number of obs: 60, groups: hospital, 3
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5432 0.1582 9.754 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What can you conclude about the effect of the treatment?On average between hospitals, there is no evidence that the treatment affects the number of crisis.
3. Growth curvesIn this item we consider a subset of the Berkeley growth study available in the fda package. Weconsider the height of 10 boys and 10 girls measured at different ages. We are interested in the
6
relationship between height and age and we may include random effects associated to the othervariables.
> data<-read.table("growth.txt",header=TRUE)
> attach(data)
> head(data)
## height age sex id
## 1 76.2 1.00 F 1
## 2 80.4 1.25 F 1
## 3 83.3 1.50 F 1
## 4 85.7 1.75 F 1
## 5 87.7 2.00 F 1
## 6 96.0 3.00 F 1
> library(lattice)
> xyplot(height~age|sex,col=id)
age
heig
ht
100
150
200
5 10 15
F
5 10 15
M
We do not want to impose a parametric form to the unknown growth curve. Therefore, we aregoing to fit an additive mixed effect model using the package mgcv.
> library(mgcv)
> growth_model<-gamm(height~s(age,bs="cr"),random=list(id=~age,sex=~age))
7
The result of the fit is an R object that contains both additive and mixed effects model objects.The additive model object is easy to interpret, it is the fit of the nonparametric part of themodel
> summary(growth_model$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## height ~ s(age, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.397 1.326 103.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 7.786 7.786 539.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.951
## Scale est. = 9.0141 n = 620
and we can plot the regression function
> plot(growth_model$gam)
8
5 10 15
−60
−40
−20
020
40
age
s(ag
e,7.
79)
The interpretation of the mixed effect object is more complicated, because the fixed effects hereare the spline basis evaluations.
> summary(growth_model$lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: strip.offset(mf)
## AIC BIC logLik
## 3314.379 3358.676 -1647.189
##
## Random effects:
## Formula: ~Xr - 1 | g
## Structure: pdIdnot
## Xr1 Xr2 Xr3 Xr4 Xr5 Xr6 Xr7
## StdDev: 1.005843 1.005843 1.005843 1.005843 1.005843 1.005843 1.005843
## Xr8
## StdDev: 1.005843
##
## Formula: ~age | id %in% g
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 2.8588490 (Intr)
## age 0.3573938 -0.157
##
## Formula: ~age | sex %in% id %in% g
9
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 2.8588548 (Intr)
## age 0.3573939 -0.157
## Residual 3.0023441
##
## Fixed effects: y ~ X - 1
## Value Std.Error DF t-value p-value
## X(Intercept) 137.3969 1.327236 599 103.52114 0
## Xs(age)Fx1 104.5599 2.060675 599 50.74064 0
## Correlation:
## X(Int)
## Xs(age)Fx1 0.717
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.242655257 -0.629695609 0.007053112 0.641966182 2.865182976
##
## Number of Observations: 620
## Number of Groups:
## g id %in% g sex %in% id %in% g
## 1 20 20
You can think at the fitted model in the gam part as producing the fixed effects in the mixedeffects model part. Should we remove the nonparametric age effect from the model?No, because the p-value of the test for the hypothesis of all the coefficients of the spline expan-sion to be zero is very small. However, some of the model assumptions are questioned by thediagnostics below.
We can also check the diagnostic plots:
> par(mfrow=c(1,2))
> qqnorm(residuals(growth_model$lme))
> qqline(residuals(growth_model$lme))
> plot(fitted(growth_model$lme),residuals(growth_model$lme))
10
−3 −1 1 3
−5
05
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
80 120 180
−5
05
fitted(growth_model$lme)
resi
dual
s(gr
owth
_mod
el$l
me)
Do you spot any problem with the model assumptions?The residuals plot shows evidence of heteroskedasticity in the errors.
Use the AIC to compare models with different (or without) random effects. Are the randomeffects needed in the model?
> growth_model2<-gamm(height~s(age,bs="cr"),random=list(id=~age,sex=~1))
> growth_model3<-gamm(height~s(age,bs="cr"),random=list(id=~age))
> growth_model4<-gamm(height~s(age,bs="cr"),random=list(id=~1))
> growth_model5<-gamm(height~s(age,bs="cr"))
> AIC(growth_model$lme,growth_model2$lme,growth_model3$lme,growth_model4$lme,growth_model5$lme)
## df AIC
## growth_model$lme 10 3314.379
## growth_model2$lme 8 3310.379
## growth_model3$lme 7 3308.379
## growth_model4$lme 5 3615.128
## growth_model5$lme 4 4225.301
The minimization of the AIC selects a model with random intercept and slope associated to theindividuals.
11