View
9.821
Download
1
Category
Tags:
Preview:
Citation preview
Regression Models in R
Harvard MIT Data Center
May 3, 2013
�e Institutefor Quantitative Social Scienceat Harvard University
(Harvard MIT Data Center) Regression Models in R May 3, 2013 1 / 49
Outline
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 2 / 49
Introduction
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 3 / 49
Introduction
Workshop description
This is an intermediate/advanced R courseAppropriate for those with basic knowledge of RThis is not a statistics course!Learning objectives:
Learn the R formula interfaceSpecify factor contrasts to test specific hypothesesPerform model comparisonsRun and interpret variety of regression models in RCreate and use imputed data sets in regression models
(Harvard MIT Data Center) Regression Models in R May 3, 2013 4 / 49
Introduction
Materials and Setup
Lab computer users:USERNAME dataclassPASSWORD dataclass
Find class materials at Scratch > DataClass > RStatisticsCopy this folder to your desktop!
Laptop usersDownload materials fromhttp://projects.iq.harvard.edu/rtc/r-statsScroll to the bottom of the page and download the r-programming.zipfileMove it to your desktop and extract
(Harvard MIT Data Center) Regression Models in R May 3, 2013 5 / 49
Introduction
Launch RStudio
Open the RStudio program from the Windows start menuOpen up today’s R script
In RStudio, Go to File => Open ScriptLocate and open the Rstatistics.R script in the Rstatistics folder onyour desktop
Go to Tools => Set working directory => To source file location (moreon the working directory later)I encourage you to add your own notes to this file!
(Harvard MIT Data Center) Regression Models in R May 3, 2013 6 / 49
Introduction
Set working directory
It is often helpful to start your R session by setting your working directory soyou don’t have to type the full path names to your data and other files
> # set the working directory> # setwd("~/Desktop/Rstatistics")>
You might also start by listing the files in your working directory
(Harvard MIT Data Center) Regression Models in R May 3, 2013 7 / 49
Introduction
Load data
Many of the following examples use data from the 2011 National HealthSurvey. From the CDC website:
The National Health Interview Survey (NHIS) has monitored thehealth of the nation since 1957. NHIS data on a broad range ofhealth topics are collected through personal household interviews.For over 50 years, the U.S. Census Bureau has been the datacollection agent for the National Health Interview Survey. Surveyresults have been instrumental in providing data to track healthstatus, health care access, and progress toward achieving nationalhealth objectives.
Key variables include:Seeattributes(NH11)$labels
for the full variable list.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 8 / 49
Linear regression
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 9 / 49
Linear regression
Linear regression example
Linear regression models can be fit with the lm() functionFor example, we can use lm to predict bmi based on:
number of cigarettes smoked/day (cigsday)duration of moderate exercise (modmin)hours of sleep (sleep)
> # Fit our regression model> weight.out <- lm(bmi~cigsday+modmin+sleep, # regression formula+ data=NH11) # data set> # Print the results> coef(summary(weight.out)) # show regression coefficients table
Estimate Std. Error t value Pr(>|t|)(Intercept) 26.281379 0.38557 68.163 0.00000000000000cigsday 0.038384 0.01909 2.011 0.04445427219108modmin -0.000775 0.00175 -0.443 0.65785096091484sleep 0.244050 0.03416 7.144 0.00000000000113
(Harvard MIT Data Center) Regression Models in R May 3, 2013 10 / 49
Linear regression
The lm class and methods
OK, we fit our model. Now what?Examine the model object:
> class(weight.out)[1] "lm"> names(weight.out)[1] "coefficients" "residuals" "effects" "rank"[5] "fitted.values" "assign" "qr" "df.residual"[9] "na.action" "xlevels" "call" "terms"
[13] "model"> methods(class = class(weight.out))[1:9][1] "add1.lm" "alias.lm" "anova.lm"[4] "case.names.lm" "confint.lm" "cooks.distance.lm"[7] "deviance.lm" "dfbeta.lm" "dfbetas.lm"
Use function methods to get more information about the fit
> confint(weight.out)2.5 % 97.5 %
(Intercept) 25.525383 27.03738cigsday 0.000952 0.07582modmin -0.004207 0.00266sleep 0.177065 0.31103> # summary(weight.out)>(Harvard MIT Data Center) Regression Models in R May 3, 2013 11 / 49
Linear regression
Comparing models
Does our model predict bmi over and above demographics? Fit two modelsand compare them:
> # Ommit missing (models can only by compared if data is the same)> NH.nomiss <- na.omit(NH11[c("bmi", "sex", "age_p",+ "cigsday", "modmin", "sleep")])> # demographics only model> demog.only <- lm(bmi~sex+age_p,+ data=NH.nomiss)> # demographics plus smoking, exercise, and sleep> demog.plus <- update(demog.only, . ~ . +cigsday+modmin+sleep)> # compare using the anova() function> anova(demog.only, demog.plus)Analysis of Variance Table
Model 1: bmi ~ sex + age_pModel 2: bmi ~ sex + age_p + cigsday + modmin + sleep
Res.Df RSS Df Sum of Sq F Pr(>F)1 3045 3478642 3042 341333 3 6531 19.4 0.0000000000019
(Harvard MIT Data Center) Regression Models in R May 3, 2013 12 / 49
Linear regression
Exercise 0: least squares regression
Use the NH11 data set.1 Use lm to fit a regression model predicting days missed work in past year
(wkdayr) from age.2 Test the hypothesis that vigorous exercise (vigmin) and moderate
exercise modmin (together) predict days missed work over and aboveage.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 13 / 49
Interactions and factors
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 14 / 49
Interactions and factors
Modeling interactions
Does the effect of smoking depend on exercise?
> #Add the interaction to the model> weight.int.out <- lm(bmi~cigsday*modmin+sleep,+ data=NH11)> #Show the results> coef(summary(weight.int.out)) # show regression coefficients table
Estimate Std. Error t value Pr(>|t|)(Intercept) 25.895912 0.407954 63.48 0.000000000000000cigsday 0.066651 0.021471 3.10 0.001925465603062modmin 0.005822 0.002892 2.01 0.044163951769538sleep 0.246864 0.034137 7.23 0.000000000000601cigsday:modmin -0.000502 0.000175 -2.86 0.004213812367825
(Harvard MIT Data Center) Regression Models in R May 3, 2013 15 / 49
Interactions and factors
Regression with categorical predictors
Let’s try to predict bmi from region, a categorical variable in which 1 =Northeast, 2 = Midwest, 3 = South and 4 = West:
> str(NH11$region)num [1:33014] 3 3 1 3 3 1 3 3 3 3 ...
> unique(NH11$region)[1] 3 1 2 4> weight.int.out <- lm(bmi~region,+ data=NH11)> coef(summary(weight.int.out))
Estimate Std. Error t value Pr(>|t|)(Intercept) 30.55 0.2152 141.98 0.00000region -0.24 0.0742 -3.23 0.00122
Not what we want: R doesn’t know that region is categorical!
(Harvard MIT Data Center) Regression Models in R May 3, 2013 16 / 49
Interactions and factors
Telling R which variables are categorical
Let’s try again to predict bmi from region
> # make a factor version of region, with labels> NH11 <- within(NH11, {+ regionF <- factor(region,+ levels=1:4,+ labels=c("Northeast", "Midwest",+ "South", "West"))})>> # predict bmi from region> weight.fac.out <- lm(bmi~region,+ data=NH11)> anova(weight.fac.out) # multi-df test of regionAnalysis of Variance Table
Response: bmiDf Sum Sq Mean Sq F value Pr(>F)
region 1 1980 1980 10.5 0.0012Residuals 33012 6245630 189> coef(summary(weight.fac.out)) # individual comparisons
Estimate Std. Error t value Pr(>|t|)(Intercept) 30.55 0.2152 141.98 0.00000region -0.24 0.0742 -3.23 0.00122
Take-home: make sure to tell R which variables are categorical by convertingthem to factors!
(Harvard MIT Data Center) Regression Models in R May 3, 2013 17 / 49
Interactions and factors
Setting factor contrasts
In the previous example we use the default contrasts for region. The defaultin R is treatment contrists, AKA dummy codes. Sometimes this default is notwhat we want, so we can get and set contrasts using the contrasts()function> # print default contrasts> contrasts(NH11$regionF)
Midwest South WestNortheast 0 0 0Midwest 1 0 0South 0 1 0West 0 0 1> # change to sum-to-zero contrasts> contrasts(NH11$regionF) <- contr.sum(n = 4)> contrasts(NH11$regionF)
[,1] [,2] [,3]Northeast 1 0 0Midwest 0 1 0South 0 0 1West -1 -1 -1
(Harvard MIT Data Center) Regression Models in R May 3, 2013 18 / 49
Interactions and factors
Regression models with specific contrasts
Regression models reflect contrasts setting
> weight.fac2.out <- lm(bmi~regionF,+ data=NH11)>> coef(summary(weight.fac2.out))
Estimate Std. Error t value Pr(>|t|)(Intercept) 29.8910 0.0789 378.860 0.0000regionF1 0.2383 0.1551 1.536 0.1245regionF2 0.0767 0.1381 0.555 0.5787regionF3 0.2971 0.1191 2.493 0.0127
Contrasts can also be set as arguments to lm()
> coef(summary(lm(bmi~regionF,+ data=NH11,+ contrasts=+ list(+ regionF=contr.treatment(+ n=4, base=4)))))
Estimate Std. Error t value Pr(>|t|)(Intercept) 29.279 0.149 196.19 0.00000000regionF1 0.850 0.241 3.53 0.00041194regionF2 0.689 0.219 3.14 0.00166450regionF3 0.909 0.195 4.65 0.00000332
(Harvard MIT Data Center) Regression Models in R May 3, 2013 19 / 49
Interactions and factors
Exercise 1: interactions and factors
Use the NH11 data set.1 Use lm to fit a regression model predicting days missed work in past year
(wkdayr) from age and race (mracrpi2).2 Change the contrasts for race (mracrpi2) to sum-to-zero contrasts and
re-fit the model from step 1.3 Evaluate the hypothesis that the relation between days missed work and
age differs as a function of race.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 20 / 49
Regression with binary outcomes
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 21 / 49
Regression with binary outcomes
Logistic regression
This far we have used the lm function to fit our regression models. lm isgreat, but limited–in particular it only fits models for continuous dependentvariables. For categorical dependent variables we can use the glm() function.Let’s predict the probability of being diagnosed with hypertension based onage, sex, sleep, and bmi
> str(NH11$hypev) # check stucture of hypevFactor w/ 2 levels "2 No","1 Yes": 1 1 2 1 1 2 1 1 2 1 ...
> levels(NH11$hypev) # check levels of hypev[1] "2 No" "1 Yes"> # collapse all missing values to NA> NH11$hypev <- factor(NH11$hypev, levels=c("2 No", "1 Yes"))> # run our regression model> hyp.out <- glm(hypev~age_p+sex+sleep+bmi,+ data=NH11, family="binomial")> coef(summary(hyp.out))
Estimate Std. Error z value Pr(>|z|)(Intercept) -4.26947 0.056495 -75.57 0.00e+00age_p 0.06070 0.000823 73.78 0.00e+00sex2 Female -0.14403 0.026798 -5.37 7.68e-08sleep -0.00704 0.001640 -4.29 1.78e-05bmi 0.01857 0.000951 19.53 6.49e-85
(Harvard MIT Data Center) Regression Models in R May 3, 2013 22 / 49
Regression with binary outcomes
Logistic regression coefficients
Generalized linear models use link functions, so raw coefficients are difficultto interpret. For example, the age coefficient of .06 in the previous modeltells us that for every one unit increase in age, the log odds of hypertensiondiagnosis increases by 0.06. Since most of us are not used to thinking in logodds this is not too helpful!One solution is to transform the coefficients to make them easier to interpret
> hyp.out.tab <- coef(summary(hyp.out))> hyp.out.tab[, "Estimate"] <- exp(coef(hyp.out))> hyp.out.tab
Estimate Std. Error z value Pr(>|z|)(Intercept) 0.014 0.056495 -75.57 0.00e+00age_p 1.063 0.000823 73.78 0.00e+00sex2 Female 0.866 0.026798 -5.37 7.68e-08sleep 0.993 0.001640 -4.29 1.78e-05bmi 1.019 0.000951 19.53 6.49e-85
Now we can say that for a one unit increase in age, the odds of beingdiagnosed with hypertension increase by a factor of 1.06. For moreinformation on interpreting odds ratios see our.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 23 / 49
Regression with binary outcomes
Computing quantities of interest with predict()
In addition to transforming the log-odds produced by glm to odds, we canuse the predict() function to make direct statements about the predictorsin our model. For example, we can ask "How much more likely is a 63 yearold female to have hypertension compared to a 33 year old female?".
> # Create a dataset with predictors set at desired levels> predDat <- with(NH11,+ expand.grid(age_p = c(33, 63),+ sex = "2 Female",+ bmi = mean(bmi, na.rm = TRUE),+ sleep = mean(sleep, na.rm = TRUE)))> # predict hypertension at those levels> cbind(predDat, predict(hyp.out, type = "response",+ se.fit = TRUE, interval="confidence",+ newdata = predDat))
age_p sex bmi sleep fit se.fit residual.scale1 33 2 Female 29.9 7.86 0.129 0.00285 12 63 2 Female 29.9 7.86 0.478 0.00482 1
This tells us that a 33 year old female has a 13% probability of having beendiagnosed with hypertension, while and 63 year old female has a 48%probability of having been diagnosed.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 24 / 49
Regression with binary outcomes
Packages for computing quantities of interest
Instead of doing all this ourselves, we can use the Zelig package to computequantities of interest for us (cf. the effects package). Use zelig instead ofglm and setx / sim instead of expand.grid / predict
> library(Zelig)> hyp.out.z <- zelig(hypev~age_p+sex+sleep+bmi,+ data=NH11,+ model="logit", cite=FALSE)> x.low <-setx(hyp.out.z, age_p=33) # set age to 33 years> x.high <-setx(hyp.out.z, age_p=63) # set age to 63 years> s.out <-sim(hyp.out.z, x=x.low, x1=x.high) # get predicted values>
(Harvard MIT Data Center) Regression Models in R May 3, 2013 25 / 49
Regression with binary outcomes
Computing quantities of interest with Zelig
> summary(s.out) # show the results
Model: logitNumber of simulations: 1000
Values of X(Intercept) age_p sex2 Female sleep bmi
1 1 33 1 7.86 29.9attr(,"assign")[1] 0 1 2 3 4attr(,"contrasts")attr(,"contrasts")$sex[1] "contr.treatment"
Values of X1(Intercept) age_p sex2 Female sleep bmi
1 1 63 1 7.86 29.9attr(,"assign")[1] 0 1 2 3 4attr(,"contrasts")attr(,"contrasts")$sex[1] "contr.treatment"
Expected Values: E(Y|X)mean sd 50% 2.5% 97.5%
0.129 0.003 0.129 0.124 0.135
Expected Values: E(Y|X1)mean sd 50% 2.5% 97.5%
0.478 0.005 0.478 0.468 0.488
Predicted Values: Y|X0 1
0.881 0.119
Predicted Values: Y|X10 1
0.536 0.464
First Differences: E(Y|X1) - E(Y|X)mean sd 50% 2.5% 97.5%
0.349 0.004 0.349 0.34 0.358
(Harvard MIT Data Center) Regression Models in R May 3, 2013 26 / 49
Regression with binary outcomes
Graphing quantities of interest
plot(s.out) # show the results graphically
(Harvard MIT Data Center) Regression Models in R May 3, 2013 27 / 49
Regression with binary outcomes
Exercise 2: logistic regression
Use the NH11 data set.1 Use glm or zelig to conduct a logistic regression to predict ever worked
(everwrk) using age (agep) and marital status (rmaritl). You may want tore-code rmarital first.
2 Predict the probability of working for each level of (the possiblyre-coded) marital status variable.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 28 / 49
Multiple imputation
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 29 / 49
Multiple imputation
Multiple imputation
Majority of datasets contain missing dataProduces a variety of problems and limitations to data analysisMultiple imputation (MI) generates multiple, complete datasets thatcontain estimations of missing data points
(Harvard MIT Data Center) Regression Models in R May 3, 2013 30 / 49
Multiple imputation
Multiple imputation
Earlier we wanted to compare a model predicting bmi from demographicvariables to a model including demographics and substantive predictors. Weomitted missing data so that we could fit both models to the same data.That is a common practice, but it has many problems (which weunfortunately don’t have time to discuss in detail). A popular solution is touse multiple imputation to fill in the missing values with reasonableplaceholders.MI is typically thought of as involving three steps:
Selection of imputation modelGeneration of imputed datasetsCombining results across imputed datasets
There are a number of packages for doing this in R: we will use the Ameliapackage because it is powerful, fast, and easy to use. You can refer to theAmelia documentation for more information about its imputation procedures:http://r.iq.harvard.edu/docs/amelia/amelia.pdf
(Harvard MIT Data Center) Regression Models in R May 3, 2013 31 / 49
Multiple imputation
Creating imputed data sets
We’re going to create several datasets to look at a model predicting thenumber of days of work missed/year (wkdayr)
> # load the Amelia package> library(Amelia)> # help(package="Amelia")> # load a smaller version of NH> NH08.mi <- readRDS("dataSets/NatHealth2008MI")> # generate five imputed data sets> amelia.log <- capture.output( # suppress amelia’s chattiness+ NatHealth.MI <- amelia(NH08.mi,+ m=5,+ idvars=c("id")))>
(Harvard MIT Data Center) Regression Models in R May 3, 2013 32 / 49
Multiple imputation
Checking imputed values
Compare imputed values to observed values
plot(NatHealth.MI, which.vars=9:12)
(Harvard MIT Data Center) Regression Models in R May 3, 2013 33 / 49
Multiple imputation
Checking imputed values: overimputation
Overimputation strategy:Treat every observed value as if it was missingImpute many values for that observed valueExamine the correspondence between imputed and observed values
overimpute(NatHealth.MI, var="sleep")
(Harvard MIT Data Center) Regression Models in R May 3, 2013 34 / 49
Multiple imputation
Using imputed data sets in regression models
Zelig makes it very easy to use imputed data sets – just point to the list ofimputed data sets in the data argument
> library(Zelig)> nhImp.out <- zelig(wkdayr ~ cigsday + modmin + sleep, model = "ls",+ data = NatHealth.MI$imputations, cite = FALSE)>> coef(summary(nhImp.out))
Value Std. Error t-stat p-value(Intercept) -8.8163 6.7633 -1.3035 0.2132cigsday -0.0102 0.1438 -0.0707 0.9437modmin -0.0373 0.0234 -1.5940 0.1110sleep 2.2303 0.9412 2.3697 0.0348
For separate results, use print(summary(x), subset = i:j).
(Harvard MIT Data Center) Regression Models in R May 3, 2013 35 / 49
Multiple imputation
Exercise 2: multiple imputation
1 Using Amelia, generate 5 imputed versions of the Exam dataset. Makesure you tell Amelia which variables are nominal, and that school is theid variable.
2 Create plots that compare imputed values to observed values3 Overimpute the variable "schavg"
(Harvard MIT Data Center) Regression Models in R May 3, 2013 36 / 49
Multilevel Modeling
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 37 / 49
Multilevel Modeling
Multilevel modeling overview
Multi-level (AKA hierarchical) models are a type of mixed-effects modelsUsed to model variation due to group membershipCan model different intercepts and/or slopes for each groupMixed-effecs models include two types of predictors: fixed-effects andrandom effectsFixed-effects observed levels are of direct interest (.e.g, sex, political
party. . . )Random-effects observed levels not of direct interest: goal is to make
inferences to a population represtent by observed levesIn R the lme4 package is the most popular for mixed effects models
Use the lmer function for liner mixed models, glmer for generalizedmixed models
(Harvard MIT Data Center) Regression Models in R May 3, 2013 38 / 49
Multilevel Modeling
The Exam data
The Exam data set contans exam scores of 4,059 students from 65 schools inInner London. The variable names are as follows:
school School ID - a factor.normexam Normalized exam score.schgend School gender - a factor. Levels are ’mixed’, ’boys’, and ’girls’.schavg School average of intake score.
vr Student level Verbal Reasoning (VR) score band at intake - afactor. Levels are ’bottom 25%’, ’mid 50%’, and ’top 25%’.
intake Band of student’s intake score - a factor. Levels are ’bottom25%’, ’mid 50%’ and ’top 25%’./
standLRT Standardised LR test score.sex Sex of the student - levels are ’F’ and ’M’.type School type - levels are ’Mxd’ and ’Sngl’.
student Student id (within school) - a factor
(Harvard MIT Data Center) Regression Models in R May 3, 2013 39 / 49
Multilevel Modeling
The null model and ICC
As a preliminary step it is often useful to partition the variance in thedependent variable into the various levels. This can be accomplished byrunning a null model (i.e., a model with a random effects grouping structure,but no fixed-effects predictors).
Linear mixed model fit by maximum likelihoodFormula: normexam ~ 1 + (1 | school)
Data: ExamAIC BIC logLik deviance REMLdev
10826 10844 -5410 10820 10824Random effects:Groups Name Variance Std.Dev.school (Intercept) 0.169 0.412Residual 0.848 0.921
Number of obs: 3987, groups: school, 65
Fixed effects:Estimate Std. Error t value
(Intercept) -0.0141 0.0538 -0.26
(Harvard MIT Data Center) Regression Models in R May 3, 2013 40 / 49
Multilevel Modeling
Calculating ICC
Calculate ICC–amount of total variance in exam scores that is betweengroups, i.e, between-school variance/total variance
> (N1re <-data.frame(summary(Norm1)@REmat,+ stringsAsFactors = FALSE))
Groups Name Variance Std.Dev.1 school (Intercept) 0.169 0.4122 Residual 0.848 0.921>> N1re[3:4] <-data.matrix(N1re[3:4])> N1re[1, "Variance"] / sum(N1re["Variance"])[1] 0.166
17% of the variance in exam scores is between schools; the rest is withinschool variance.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 41 / 49
Multilevel Modeling
Adding fixed-effects predictors
Predict exam scores from student’s standardized tests scores> Norm2 <-lmer(normexam~standLRT + (1|school),+ data=Exam,+ REML = FALSE)> summary(Norm2)Linear mixed model fit by maximum likelihoodFormula: normexam ~ standLRT + (1 | school)
Data: ExamAIC BIC logLik deviance REMLdev
9143 9169 -4568 9135 9147Random effects:Groups Name Variance Std.Dev.school (Intercept) 0.0919 0.303Residual 0.5670 0.753
Number of obs: 3958, groups: school, 65
Fixed effects:Estimate Std. Error t value
(Intercept) 0.00121 0.04003 0.0standLRT 0.56559 0.01265 44.7
Correlation of Fixed Effects:(Intr)
standLRT 0.007(Harvard MIT Data Center) Regression Models in R May 3, 2013 42 / 49
Multilevel Modeling
Multiple degree of freedom comparisons
As with lm and glm models, you can compare the two lmer models using theanova function.> anova(Norm1, Norm2)Data: ExamModels:Norm1: normexam ~ 1 + tag(1 | school)Norm2: normexam ~ standLRT + (1 | school)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)Norm1 3 10826 10844 -5410Norm2 4 9143 9169 -4568 1684 1 <2e-16
(Harvard MIT Data Center) Regression Models in R May 3, 2013 43 / 49
Multilevel Modeling
Random slopes
Add a random effect of students’ standardized test scores as well. Now inaddition to estimating the distribution of intercepts across schools, we alsoestimate the distribution of the slope of exam on standardized test.
> Norm3 <- lmer(normexam~standLRT + (standLRT|school), data=Exam,+ REML = FALSE)> summary(Norm3)Linear mixed model fit by maximum likelihoodFormula: normexam ~ standLRT + (standLRT | school)
Data: ExamAIC BIC logLik deviance REMLdev
9108 9146 -4548 9096 9107Random effects:Groups Name Variance Std.Dev. Corrschool (Intercept) 0.0899 0.300
standLRT 0.0141 0.119 0.512Residual 0.5552 0.745
Number of obs: 3958, groups: school, 65
Fixed effects:Estimate Std. Error t value
(Intercept) -0.0122 0.0397 -0.31standLRT 0.5586 0.0199 28.08
Correlation of Fixed Effects:(Intr)
standLRT 0.371
(Harvard MIT Data Center) Regression Models in R May 3, 2013 44 / 49
Multilevel Modeling
Test the significance of the random slope
To test the significance of a random slope just compare models with andwithout the random slope term
> anova(Norm2, Norm3)Data: ExamModels:Norm2: normexam ~ standLRT + tag(1 | school)Norm3: normexam ~ standLRT + tag(standLRT | school)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)Norm2 4 9143 9169 -4568Norm3 6 9108 9146 -4548 39 2 0.0000000035
(Harvard MIT Data Center) Regression Models in R May 3, 2013 45 / 49
Multilevel Modeling
Exercise 3: multilevel modeling
Use the dataset, bh1996:data(bh1996, package="multilevel")
From the data documentation:Variables are Cohesion (COHES), Leadership Climate (LEAD),Well-Being (WBEING) and Work Hours (HRS). Each of thesevariables has two variants - a group mean version that replicateseach group mean for every individual, and a within-group versionwhere the group mean is subtracted from each individual response.The group mean version is designated with a G. (e.g., G.HRS), andthe within-group version is designated with a W. (e.g., W.HRS).
1 Create a null model predicting wellbeing ("WBEING")2 Calculate the ICC for your null model3 Run a second multi-level model that adds two individual-level predictors,
average number of hours worked ("HRS") and leadership skills("LEAD") to the model and interpret your output.
4 Now, add a random effect of average number of hours worked ("HRS")to the model and interpret your output. Test the significance of thisrandom term.
5 Finally, add a group-level term, workplace cohesion ("G.COHES") to themodel and interpret your output.
(Harvard MIT Data Center) Regression Models in R May 3, 2013 46 / 49
Wrap-up
Topic
1 Introduction
2 Linear regression
3 Interactions and factors
4 Regression with binary outcomes
5 Multiple imputation
6 Multilevel Modeling
7 Wrap-up
(Harvard MIT Data Center) Regression Models in R May 3, 2013 47 / 49
Wrap-up
Help us make this workshop better!
Please take a moment to fill out a very short
feedback formThese workshops exist for you – tell us what you need!http://tinyurl.com/RstatisticsFeedback
(Harvard MIT Data Center) Regression Models in R May 3, 2013 48 / 49
Wrap-up
Additional resources
IQSS workshops:http://projects.iq.harvard.edu/rtc/filter_by/workshops
IQSS statistical consulting: http://rtc.iq.harvard.eduZelig
Website: http://gking.harvard.edu/zeligDocumentation: http://r.iq.harvard.edu/docs/zelig.pdf
AmeilaWebsite: http://gking.harvard.edu/Amelia/Documetation: http://r.iq.harvard.edu/docs/amelia/amelia.pdf
(Harvard MIT Data Center) Regression Models in R May 3, 2013 49 / 49
Recommended