33
Illustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February 18, 2015 Illustrations for Simple and Multiple Linear Regression February 2015 Simple Linear Regression 1. Introduction to Example Source: Chatterjee, S.; Hancock MS and Simonoff JS. A Casebook for a First Course in Statistics and Data Analysis. New York, John Wiley, 1995, pp 145-152. Setting: Calls the New York Auto Club are possibly related to the weather, with more calls occurring during bad weather. This example illustrates descriptive analyses and simple linear regression to explore this hypothesis in a data set containing information on calendar day, weather, and numbers of calls. Libraries that we will use in R # REMEMBER, if these do not upload, it is because you have not used the command install.packages() library(foreign) # to upload the State data set library(ggplot2) # to plot library(Hmisc) # for the command describe Stata Data Set that we will use in R Let’s load the data. url <- "http://people.umass.edu/biep640w/datasets/ers.dta" dat <- read.dta(file = url) Simple Linear Regression Variables: Outcome Y = calls Predictor X = low 2. Preliminaries: Descriptives After some searching of the web, several packages in R contain similar output to Stata’s describe commmand. Below is that exact command, but using the Hmisc package in R. Additionally, we’ll check for missing data because if we have any, it’s a minor nuisance. describe(dat) ## dat ## ## 12 Variables 28 Observations ## --------------------------------------------------------------------------- ## day ## n missing unique Info Mean .05 .10 .25 .50 ## 28 0 28 1 12258 12070 12072 12076 12258 ## .75 .90 .95 ## 12440 12444 12446 ## 1

Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

Embed Size (px)

Citation preview

Page 1: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

Illustrations - Simple and Multiple Linear RegressionSteele H. ValenzuelaFebruary 18, 2015

Illustrations for Simple and Multiple Linear Regression

February 2015

Simple Linear Regression

1. Introduction to Example Source: Chatterjee, S.; Hancock MS and Simonoff JS. A Casebook for aFirst Course in Statistics and Data Analysis. New York, John Wiley, 1995, pp 145-152.

Setting: Calls the New York Auto Club are possibly related to the weather, with more calls occurringduring bad weather. This example illustrates descriptive analyses and simple linear regression to explore thishypothesis in a data set containing information on calendar day, weather, and numbers of calls.

Libraries that we will use in R

# REMEMBER, if these do not upload, it is because you have not used the command 'install.packages()'library(foreign) # to upload the State data setlibrary(ggplot2) # to plotlibrary(Hmisc) # for the command describe

Stata Data Set that we will use in R

Let’s load the data.

url <- "http://people.umass.edu/biep640w/datasets/ers.dta"dat <- read.dta(file = url)

Simple Linear Regression Variables:

Outcome Y = calls Predictor X = low

2. Preliminaries: Descriptives After some searching of the web, several packages in R contain similaroutput to Stata’s describe commmand. Below is that exact command, but using the Hmisc package in R.Additionally, we’ll check for missing data because if we have any, it’s a minor nuisance.

describe(dat)

## dat#### 12 Variables 28 Observations## ---------------------------------------------------------------------------## day## n missing unique Info Mean .05 .10 .25 .50## 28 0 28 1 12258 12070 12072 12076 12258## .75 .90 .95## 12440 12444 12446##

1

Page 2: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## lowest : 12069 12070 12071 12072 12073## highest: 12443 12444 12445 12446 12447## ---------------------------------------------------------------------------## calls## n missing unique Info Mean .05 .10 .25 .50## 28 0 27 1 4319 1698 1739 1842 3062## .75 .90 .95## 6498 8132 8821#### lowest : 1674 1692 1709 1752 1776, highest: 7745 7841 8810 8827 8947## ---------------------------------------------------------------------------## fhigh## n missing unique Info Mean .05 .10 .25 .50## 28 0 21 1 34.96 15.70 21.90 29.75 35.00## .75 .90 .95## 41.75 47.30 48.00#### lowest : 10 15 17 24 26, highest: 44 46 47 48 53## ---------------------------------------------------------------------------## flow## n missing unique Info Mean .05 .10 .25 .50## 28 0 19 1 24.46 7.05 11.10 18.75 27.00## .75 .90 .95## 32.00 34.30 38.25#### 4 6 9 12 14 15 18 19 21 23 26 27 28 30 31 32 34 35 40## Frequency 1 1 1 1 1 1 1 3 1 1 1 2 2 2 1 3 2 1 2## % 4 4 4 4 4 4 4 11 4 4 4 7 7 7 4 11 7 4 7## ---------------------------------------------------------------------------## high## n missing unique Info Mean .05 .10 .25 .50## 28 0 19 1 37.46 15.00 19.20 32.00 39.50## .75 .90 .95## 43.25 50.20 54.30#### 10 15 21 31 32 35 36 38 39 40 41 42 43 44 46 47 49 53 55## Frequency 1 2 1 1 3 2 2 1 1 1 2 1 3 1 1 1 1 1 2## % 4 7 4 4 11 7 7 4 4 4 7 4 11 4 4 4 4 4 7## ---------------------------------------------------------------------------## low## n missing unique Info Mean .05 .10 .25 .50## 28 0 22 0.99 21.75 0.0 2.1 10.5 26.0## .75 .90 .95## 31.0 36.6 39.3#### lowest : -2 0 3 4 5, highest: 32 36 38 40 41## ---------------------------------------------------------------------------## rain## n missing unique Info Sum Mean## 28 0 2 0.66 9 0.3214## ---------------------------------------------------------------------------## snow## n missing unique Info Sum Mean## 28 0 2 0.51 6 0.2143

2

Page 3: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## ---------------------------------------------------------------------------## weekday## n missing unique Info Sum Mean## 28 0 2 0.69 18 0.6429## ---------------------------------------------------------------------------## year## n missing unique Info Sum Mean## 28 0 2 0.75 14 0.5## ---------------------------------------------------------------------------## sunday## n missing unique Info Sum Mean## 28 0 2 0.37 4 0.1429## ---------------------------------------------------------------------------## subzero## n missing unique Info Sum Mean## 28 0 2 0.44 5 0.1786## ---------------------------------------------------------------------------

sum(is.na(dat)) # sum indicates we are adding up the number of TRUE missing data. In this case, because there is no missing data, the output will be zero.

## [1] 0

Let’s summarize the y-variable and x-variable, calls and low. This time around, we will creat a function tosweep through these summary statistics rather than typing every command line by line.

tabstatR <- function(x) {N <- length(x)Mean <- mean(x)SD <- sd(x)Min <- min(x)Max <- max(x)return(N)return(Mean)return(SD)return(Min)return(Max)

}tabstatR(dat$low)

## [1] 28

We’re going to use the package ggplot2 for the remainder of this course. I’ll link a few websites after the endof this to aid your understanding of the package. For now, just follow my commands and the brief guidance Igive in order to understand the package.

# Let's make a scatterplotggplot(data = dat, aes(x = low, y = calls)) + geom_point() + ggtitle("Calls to NY Auto Club 1993-1994")

3

Page 4: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

2500

5000

7500

0 10 20 30 40low

calls

Calls to NY Auto Club 1993−1994

Asyou can see, the first chunk of commands ggplot captures the data as well as the aesthetics that will beoverplaced on the plot. The next chunk indicates what kind of geometric object we would like to overlayour aesthetics, our x and y variables. Here, we are specifying geom_point() which is a simple scatterplot.Lastly, the command ggtitle() indicates the main title. You’ll see all of these chunks/arguments evolvethroughout this handout, so please, stay tuned.

Next, we’re going to make another scatterplot, overlayed with a Lowess Regression.

ggplot(data = dat, aes(x = low, y = calls)) + geom_point() + stat_smooth(method = "loess", se = FALSE) + ggtitle("Calls to NY Auto Club 1993-1994")

4

Page 5: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

2500

5000

7500

0 10 20 30 40low

calls

Calls to NY Auto Club 1993−1994

Let us test for normality implementing the Shapiro-Wilk Test of Normality on the y-variable.

shapiro.test(dat$calls)

#### Shapiro-Wilk normality test#### data: dat$calls## W = 0.829, p-value = 0.0003628

Because of a small p-value of 3.6278 × 10-4, we reject the null hypothesis of normality.

For now, let’s make a histogram of the variable **calls* overlayed with a normal curve.

# first, let's find the mean and standard deviation, which will assist us in creating a normal curvemean(dat$calls); sd(dat$calls)

## [1] 4319

## [1] 2693

#Now, this is tricky but extremely helpfulggplot(dat, aes(x = calls)) + geom_histogram(aes(y = ..density..), binwidth = 1454) +

stat_function(fun = dnorm, colour = "red", arg = list(mean = 4318.75, sd = 2692.564))

5

Page 6: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

0e+00

1e−04

2e−04

3e−04

0 3000 6000 9000 12000calls

dens

ity

AsProf. Bigelow mentions on her handout: > No surprise here, given that the Shapiro Wilk test rejectnormality. This graph confirms non-linearity of the distribution of Y = calls.

3. Model Fitting (Estimation) We are now going to fit the model.

fit1 <- lm(calls ~ low, data = dat)summary(fit1)

#### Call:## lm(formula = calls ~ low, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -3112 -1468 -214 1144 3588#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 7475.8 704.6 10.61 6.1e-11 ***## low -145.2 27.8 -5.22 1.9e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1920 on 26 degrees of freedom## Multiple R-squared: 0.512, Adjusted R-squared: 0.493## F-statistic: 27.3 on 1 and 26 DF, p-value: 1.86e-05

6

Page 7: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

From the summary() command, we see vital information, no doubt, but we are missing some other pieces ofinformation that Stata executes wonderfully with only one line of code. We are missing the ANOVA table aswell as the confidence intervals.

anova(fit1)

## Analysis of Variance Table#### Response: calls## Df Sum Sq Mean Sq F value Pr(>F)## low 1 1.00e+08 1.00e+08 27.3 1.9e-05 ***## Residuals 26 9.55e+07 3.67e+06## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(fit1)

## 2.5 % 97.5 %## (Intercept) 6027.5 8924.24## low -202.3 -88.03

If you recall from assignment 3, we did create a function to call of this information.

stata.regress <- function(x) { # syntax: function(x1, x2,...) followed by a bracketa <- summary(x) # we are going to pass x to three commands as listedb <- anova(x)d <- confint(x)

print(a); print(b); print(d) # the last line typically specifies what one wants the function to print}stata.regress(fit1)

#### Call:## lm(formula = calls ~ low, data = dat)#### Residuals:## Min 1Q Median 3Q Max## -3112 -1468 -214 1144 3588#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 7475.8 704.6 10.61 6.1e-11 ***## low -145.2 27.8 -5.22 1.9e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1920 on 26 degrees of freedom## Multiple R-squared: 0.512, Adjusted R-squared: 0.493## F-statistic: 27.3 on 1 and 26 DF, p-value: 1.86e-05#### Analysis of Variance Table

7

Page 8: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

#### Response: calls## Df Sum Sq Mean Sq F value Pr(>F)## low 1 1.00e+08 1.00e+08 27.3 1.9e-05 ***## Residuals 26 9.55e+07 3.67e+06## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 6027.5 8924.24## low -202.3 -88.03

Let’s get back to the problem. From the summary output, given y = calls and x = low, you should be ableto draw up the fitted line:

y = 7475.85 − 145.15 ∗ x

The 0.5121 indicates that 51% of the variability in calls is explained. Also, the overall F test significancelevel, Pr(>F), 27.2849, suggests that the straight line fit performs better in explaining variability in callsthan does Y = average # calls.

4. Model Examination Let’s examine the model by fitting a confidence interval band on the scatterplot.

ggplot(data = dat, aes(x = low, y = calls)) + geom_point(size = 3) + stat_smooth(method = "lm") +ggtitle("Calls to NY Auto Club 1993-1994")

0

2500

5000

7500

0 10 20 30 40low

calls

Calls to NY Auto Club 1993−1994

Unfortunately, the legend for the ggplot2 package is not as powerful as Stata’s. . .

8

Page 9: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

5. Checking Model Assumptions and Fit Let’s check the model with several residual plots.

# nice ggplot2 function that displays a nice chunk of information. This will not be evaluated here but do run it during your own session. Also, we will run it in the last section.fortify(fit1)

qplot(.fitted, .resid, data = fit1) + geom_hline(yintercept = 0) + geom_smooth(se = F) + ggtitle("Fitted Estimates against Residuals")

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

−2000

0

2000

2000 4000 6000 8000.fitted

.res

id

Fitted Estimates against Residuals

Next, let’s plot a qqplot, to take a closer look at the normality of our model (is that a proper phrase?).

qplot(sample = .resid, data = fit1, stat = "qq") + geom_abline() + ggtitle("Normality of Residuals of Y = calls on X = low")

9

Page 10: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

−2000

0

2000

−2 −1 0 1 2theoretical

sam

ple

Normality of Residuals of Y = calls on X = low

Next, let’s plot Cook’s Distance using two different geometric objects; points and bars.

qplot(x = seq_along(.cooksd), y = .cooksd, data = fit1) + geom_point(stat = "identity") + ggtitle("Cook's Distance")

10

Page 11: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

0.00

0.05

0.10

0.15

0 10 20seq_along(.cooksd)

.coo

ksd

Cook's Distance

Thatwas points (obviously), and now, here are bars.

qplot(x = seq_along(.cooksd), y = .cooksd, data = fit1) + geom_bar(stat = "identity") + ggtitle("Cook's Distance")

11

Page 12: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

0.00

0.05

0.10

0.15

0 10 20seq_along(.cooksd)

.coo

ksd

Cook's Distance

PerProfessor Bigelow’s remarks: > For straight line regression, the suggestion is to regard Cook’s Distancevalues > 1 as significant. > Here, there are no unusually large Cook Distance values. > Not shown butuseful, too, are examinations of leverage and jackknife residuals.

Lastly, here are our jackknife residual plot.

qplot(x = .fitted, y = .stdresid, data = fit1) + geom_point(stat = "identity") + ggtitle("Jackknife Residuals v Predicted") + xlab("Linear Prediction") + ylab("Studentized Residuals")

12

Page 13: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

−1

0

1

2

2000 4000 6000 8000Linear Prediction

Stu

dent

ized

Res

idua

lsJackknife Residuals v Predicted

Per Professor Bigelow’s remarks: > Recall - a jackknife residual for an individual is a modification for thesolution for a studentized residual in which the mean square error is replaced by the mean square errorobtained after deleting that individual from the analysis. > Departures of this plot from a parallel bandabout the horizontal line at zero are significant. >The plot here is a bit noisy but not too bad consideringthe small sample size.

Multiple Linear Regression

1. A General Approach for Model Development There are no rules nor single best strategy.In fact, different study designs and different research questions call for different approaches for modeldevelopment. Tip – before you begin model development, make a list of your study design, research aims,outcome variable, primary predictor variables, and covariates.

As a general suggestion, the following approach as the advantages of providing a reasonably thoroughexploration of the data and relatively little risk of missing something important.

Preliminary - be sure you have: 1. Checked, cleaned, and described your data 2. Screened the data formultivariate associations 3. Thoroughly explored the bivariate relationships

Step 1 - Fit the “maximal” model The maximal model is the large model that contains all the explanatoryvariables of interest as predictors. This model also contains all the covariates that might be of interest. Italso contains all the interactions that might be of interest. Note the amount of variation explained.

Step 2 - Begin simplifying the model Inspect each of the terms in the “maximal” model with the goal ofremoving the predictor that is the least significant. Drop from the model the predictors that are the leastsignificant, beginning with the higher order interactions. Tip - interactions are complicated and we areaiming for a simple model). Fit the reduced model. Compare the amount of variation explained by thereduced model with the amount of variation explained by the “maximal” model.

If the deletion of a predictor has little effect on the variation explained. Then leave that predictor out of themodel. And inspect each of the terms in the model again.

13

Page 14: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

If the deletion of a predictor has a significant effect on the variation explained, then put that predictor backinto the model.

Step 3 - Keep simplifying the model Repeat step 2, over and over, until the model remaining contains nothingbut significant predictor variables.

Beware of some important caveats > Sometimes, you will want to keep a predictor in the model regardless ofits statistical significance (an example is randomization assignment in a clinical trial). > The order in whichyou delete terms from the model matters. > You still need to be flexible to considerations of biology andwhat makes sense.

2. Introduction to Example Source: Matthews et al. Parity Induced Protection Against Breast Cancer2007.

Research Question: What is the relationship of Y = 53 expression to parity and age at first pregnancy,after adjustments for the potentially confounding effects of current age and menopausal status. Age at firstpregnancy has been grouped and is either ≤ 24 years or > 24 years.

Now, let’s get started and load the data.

url2 <- "http://people.umass.edu/biep691f/data/p53paper_small.dta"dat2 <- read.dta(file = url2)

3. (Further) Introduction to Example Let’s explore the structure of the data as well as the data itself.

str(dat2[1:5])

## 'data.frame': 67 obs. of 5 variables:## $ agecurr : int 43 53 39 45 30 35 54 40 37 57 ...## $ pregnum : int 3 2 1 0 1 3 2 2 0 2 ...## $ menop : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 2 ...## $ p53 : num 3 2.4 2.2 3.5 5.5 ...## $ agefirst: Factor w/ 3 levels "never pregnant",..: 3 3 3 1 3 2 2 3 1 2 ...

I cheated by looking at the environment window pane and I had seen that the dataset dat2 contained 5variables, hence the specified subset of one through five ([1:5]). Anyway, back to the point. We see thatwe have 5 variables, two of which are integers (agecurr & pregnum) while another is a number, decimalsper se (p53). The last two variables, menop and agefirst, are factored variables with two and three levels,respectively.

Let’s look at the first 10 rows.

head(dat2, 10)

## agecurr pregnum menop p53 agefirst## 1 43 3 yes 3.0 age > 24## 2 53 2 yes 2.4 age > 24## 3 39 1 no 2.2 age > 24## 4 45 0 no 3.5 never pregnant## 5 30 1 no 5.5 age > 24## 6 35 3 no 6.0 age le 24## 7 54 2 no 4.0 age le 24## 8 40 2 no 4.0 age > 24## 9 37 0 no 2.0 never pregnant## 10 57 2 yes 5.0 age le 24

14

Page 15: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

And now, let’s look at the Stata like function we previously loaded, describe.

describe(dat2) # once again, this output is not as nice as Stata's but it gets the job done

## dat2#### 5 Variables 67 Observations## ---------------------------------------------------------------------------## agecurr## n missing unique Info Mean .05 .10 .25 .50## 67 0 38 1 39.63 16.6 20.0 28.5 40.0## .75 .90 .95## 49.5 58.0 60.0#### lowest : 15 16 18 19 20, highest: 58 60 62 63 75## ---------------------------------------------------------------------------## pregnum## n missing unique Info Mean## 67 0 4 0.92 1.657#### 0 (16, 24%), 1 (9, 13%), 2 (24, 36%), 3 (18, 27%)## ---------------------------------------------------------------------------## menop## n missing unique## 67 0 2#### no (48, 72%), yes (19, 28%)## ---------------------------------------------------------------------------## p53## n missing unique Info Mean .05 .10 .25 .50## 67 0 19 0.99 3.251 2.000 2.000 2.500 3.000## .75 .90 .95## 4.000 4.625 5.000#### 1 2 2.20000004768372 2.25 2.40000009536743 2.5 2.75 3## Frequency 3 7 1 1 1 5 7 9## % 4 10 1 1 1 7 10 13## 3.20000004768372 3.29999995231628 3.5 3.70000004768372 4## Frequency 1 1 10 1 9## % 1 1 15 1 13## 4.30000019073486 4.375 5 5.19999980926514 5.5 6## Frequency 2 2 4 1 1 1## % 3 3 6 1 1 1## ---------------------------------------------------------------------------## agefirst## n missing unique## 67 0 3#### never pregnant (16, 24%), age le 24 (32, 48%)## age > 24 (19, 28%)## ---------------------------------------------------------------------------

Let’s check for pairwise correlations.

15

Page 16: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

# Please follow the code as the factores variables tend to mess with the correlation functionmenop.new <- as.numeric(as.factor(dat2$menop))agefirst.new <- as.numeric(as.factor(dat2$agefirst))dat2.mat <- matrix(data = c(dat2$agecurr, dat2$pregnum, menop.new, dat2$p53, agefirst.new), nrow = 67, ncol = 5)colnames(dat2.mat) <- c("agecurr", "pregnum", "menop", "p53", "agefirst")

cor(dat2.mat) # unfortunately, p-values are not provided as well as column names. +1 to Stata

## agecurr pregnum menop p53 agefirst## agecurr 1.0000 0.5416 0.72848 0.13397 0.4765## pregnum 0.5416 1.0000 0.40205 0.44190 0.5765## menop 0.7285 0.4021 1.00000 0.04498 0.2823## p53 0.1340 0.4419 0.04498 1.00000 0.2021## agefirst 0.4765 0.5765 0.28227 0.20206 1.0000

Because the p-values here are not displayed, we will quote from Professor Bigelow’s handout. > Only onecorrelation with Y = 53 is statistically significant, r(p53, pregnum) = 0.44 and a p-value of 0.0002. Note,that some of the predictors are statistically significantly correlated with each other: r(agefirst, pregnum) =0.58 with a p-value less than 0.0001.

Now, let’s examine a pairwise scatterplot.

pairs(dat2.mat, gap = 0, pch = ".")

agecurr

0.0 1.0 2.0 3.0 1 2 3 4 5 6

2050

0.0

1.5

3.0

pregnum

menop1.

01.

41.

8

13

5

p53

20 40 60 1.0 1.4 1.8 1.0 1.5 2.0 2.5 3.0

1.0

2.0

3.0

agefirst

Pleasenote that my columns are arranged differently than Prof. Bigelow’s. Additionally, this plot is difficult to takein. It’s also not as useful as we were hoping.

Let’s take a closer look at it fitting a linear model and a lowess/loess model.

16

Page 17: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

# This is a lot of R code to take in so just trust me and plot it.ggplot(data = dat2, aes(x = pregnum, y = p53)) + geom_point() +

stat_smooth(method = "lm", se = FALSE, colour = "darkblue", size = 1) +stat_smooth(method = "loess", se = FALSE, colour = "darkred", size = 1) +ggtitle(expression(atop("Assessment of Linearity", atop("Y = p53, X = pregnum"))))

## Warning: pseudoinverse used at 3.015## Warning: neighborhood radius 2.015## Warning: reciprocal condition number 3.1562e-16## Warning: There are other near singularities as well. 1

2

4

6

0 1 2 3pregnum

p53

Assessment of LinearityY = p53, X = pregnum

Thered line represents a lowess/loess model while the blue line represents a linear model. From here, we canconfirm that we have a linear relationship.

Now, let’s fit the variable agefirst against p53.

ggplot(data = dat2, aes(x = agefirst, y = p53)) + geom_point() +stat_smooth(method = "lm", se = FALSE, colour = "darkblue", size = 1) +stat_smooth(method = "loess", se = FALSE, colour = "darkred", size = 1) +ggtitle(expression(atop("Assessment of Linearity", atop("Y = p53, X = agefirst"))))

## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?

17

Page 18: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

2

4

6

never pregnant age le 24 age > 24agefirst

p53

Assessment of LinearityY = p53, X = agefirst

Wesee that we get a warning and we cannot plot a linear model. We’ll have to use dummy variables for thismodel.

Moving on, let’s keep the same y-variable p53 and add the x-variable agecurr, instead.

ggplot(data = dat2, aes(x = agecurr, y = p53)) + geom_point() +stat_smooth(method = "lm", se = FALSE, colour = "darkblue", size = 1) +stat_smooth(method = "loess", se = FALSE, colour = "darkred", size = 1) +ggtitle(expression(atop("Assessment of Linearity", atop("Y = p53, X = agecurr"))))

18

Page 19: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

2

4

6

20 40 60agecurr

p53

Assessment of LinearityY = p53, X = agecurr

Wehave a linear model.

4. Handling of Categorical Predictors: Indicator Variables Let’s create dummy variables for ageat first pregnancy: early and late. The best way to go about this in R is to add these new dummy variablesto the original data frame, dat2.

# we are adding new variables early and late to dat2 using the ifelse function# it is a basic function in R as well as other programs. Enter ?ifelse in the console to find out moredat2$early <- ifelse(dat2$agefirst == "age le 24", 1, 0)dat2$late <- ifelse(dat2$agefirst == "age > 24", 1, 0)

table(dat2$agefirst, dat2$early)

#### 0 1## never pregnant 16 0## age le 24 0 32## age > 24 19 0

table(dat2$agefirst, dat2$late)

#### 0 1## never pregnant 16 0## age le 24 32 0## age > 24 0 19

And from the table() function, we see that our new variables are well-defined.

19

Page 20: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

5. Model Fitting and Estimation

Model Estimation Set 1: Determination of best model in the predictors of interest. Goal is toobtain best parameterization before considering covariates. MAXIMAL MODEL: Regression of Y= p53 on all variables: pregnum + [early, late]

# First, let's fit the pre-specified model and then run the our version of Stata's regression function on that modelfit2 <- lm(p53 ~ pregnum + early + late, data = dat2)

stata.regress(fit2)

#### Call:## lm(formula = p53 ~ pregnum + early + late, data = dat2)#### Residuals:## Min 1Q Median 3Q Max## -2.8603 -0.5703 0.0161 0.5161 2.6210#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.5703 0.2409 10.67 9.4e-16 ***## pregnum 0.3764 0.2009 1.87 0.066 .## early 0.1608 0.5556 0.29 0.773## late -0.0677 0.5017 -0.13 0.893## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.964 on 63 degrees of freedom## Multiple R-squared: 0.203, Adjusted R-squared: 0.165## F-statistic: 5.35 on 3 and 63 DF, p-value: 0.0024#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## pregnum 1 14.3 14.33 15.44 0.00021 ***## early 1 0.5 0.55 0.59 0.44447## late 1 0.0 0.02 0.02 0.89307## Residuals 63 58.5 0.93## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 2.0890 3.0517## pregnum -0.0250 0.7778## early -0.9495 1.2710## late -1.0704 0.9349

The fitted line is:p53 = 2.57 + (0.38) ∗ pregnum + (0.16) ∗ early − (0.07) ∗ late

20% of the variability in Y = 53 is explained by this model (R-squared = 0.20). This model is statisticallysignificantly better than the null model (p-value of F test = 0.0024).

20

Page 21: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

Note!! We see a consequence of the multi-collinearity of our predictors [early, late] and pregnum haveNON-significant t-statistic p-values: early and late pregnum has a t-statistical p-value that is only marginallysignificant.

Now, let’s run a partial F-test on the variables early and late.

library(car) # May not needlibrary(survey) # be sure to install this package first before loading the library

#### Attaching package: 'survey'#### The following object is masked from 'package:Hmisc':#### deff#### The following object is masked from 'package:graphics':#### dotchart

regTermTest(fit2, ~early + late) # syntax is the full model followed by the terms. A bit weird but oh well

## Wald test for early late## in lm(formula = p53 ~ pregnum + early + late, data = dat2)## F = 0.3052 on 2 and 63 df: p= 0.74

I had to search the web for this function as apparently, not many in the R community use this function.Anyway, we see that early and late are not statistically significant (p-value = 0.74). We may conclude that,in the adjusted model containing pregnum, [early, late] are not statistically significantly associated with Y =p53.

Next, let’s runa partial F-test on the variable pregnum.

regTermTest(fit2, ~pregnum)

## Wald test for pregnum## in lm(formula = p53 ~ pregnum + early + late, data = dat2)## F = 3.511 on 1 and 63 df: p= 0.066

We see that it is marginally statistically significant (p-value = 0.0656). The null hypothesis is rejected andwe may conclude that in the model that contains [early, late], pregnum is marginally statistically significantlyassociated with Y = p53.

Now that we have pregnum under our model belt, let’s fit a new model with only this variable.

fit3 <- lm(p53 ~ pregnum, data = dat2)stata.regress(fit3)

#### Call:## lm(formula = p53 ~ pregnum, data = dat2)##

21

Page 22: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## Residuals:## Min 1Q Median 3Q Max## -2.8093 -0.5635 0.0212 0.6060 2.5212#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.564 0.209 12.28 < 2e-16 ***## pregnum 0.415 0.105 3.97 0.00018 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.953 on 65 degrees of freedom## Multiple R-squared: 0.195, Adjusted R-squared: 0.183## F-statistic: 15.8 on 1 and 65 DF, p-value: 0.000181#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## pregnum 1 14.3 14.33 15.8 0.00018 ***## Residuals 65 59.1 0.91## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 2.1467 2.9804## pregnum 0.2064 0.6241

The fitted line is:p53 = 2.56 + (0.41) ∗ pregnum

19.5% of the variability in Y = p53 is explained by this model (R-squared = .1953). This model is statisticallysignificantly more explanatory that the null model (p-value = 0.0002).

Now let’s run a model with the variables early and late.

fit4 <- lm(p53 ~ early + late, data = dat2)stata.regress(fit4)

#### Call:## lm(formula = p53 ~ early + late, data = dat2)#### Residuals:## Min 1Q Median 3Q Max## -2.613 -0.592 -0.113 0.558 2.387#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.570 0.246 10.47 1.7e-15 ***## early 1.043 0.301 3.47 0.00094 ***## late 0.645 0.333 1.94 0.05720 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1##

22

Page 23: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## Residual standard error: 0.982 on 64 degrees of freedom## Multiple R-squared: 0.159, Adjusted R-squared: 0.132## F-statistic: 6.03 on 2 and 64 DF, p-value: 0.00399#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## early 1 8.0 8.02 8.31 0.0054 **## late 1 3.6 3.62 3.75 0.0572 .## Residuals 64 61.7 0.96## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 2.07975 3.061## early 0.44216 1.644## late -0.02033 1.311

Now, we’re picking up speed and we are going to create a regression table. Once again, R does not have sucha feature, so, +1 to Stata, again. Luckily, someone has created a package. Let’s check it out.

library(stargazer)

#### Please cite as:#### Hlavac, Marek (2014). stargazer: LaTeX code and ASCII text for well-formatted regression and summary statistics tables.## R package version 5.1. http://CRAN.R-project.org/package=stargazer

stargazer(fit2, fit3, fit4, type = "text")

#### ======================================================================================## Dependent variable:## ------------------------------------------------------------------## p53## (1) (2) (3)## --------------------------------------------------------------------------------------## pregnum 0.376* 0.415***## (0.201) (0.105)#### early 0.161 1.043***## (0.556) (0.301)#### late -0.068 0.645*## (0.502) (0.333)#### Constant 2.570*** 2.564*** 2.570***## (0.241) (0.209) (0.246)#### --------------------------------------------------------------------------------------## Observations 67 67 67

23

Page 24: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## R2 0.203 0.195 0.159## Adjusted R2 0.165 0.183 0.132## Residual Std. Error 0.964 (df = 63) 0.953 (df = 65) 0.982 (df = 64)## F Statistic 5.349*** (df = 3; 63) 15.770*** (df = 1; 65) 6.031*** (df = 2; 64)## ======================================================================================## Note: *p<0.1; **p<0.05; ***p<0.01

During my short stint as a undergraduate political science minor, many of the models they fit in their papersare printed in this format, which must be read from vertically. Now, some of the epidemiological papers thatI have stumbled upon are in this format. Back to the table.

Let’s choose model 2 as a good “minimally adequate” model: Y = p53 and X = pregnum. This is why.

1. Model 1 is the maximal model with a R-squared value of 0.202. Model 2 drops [early, late] and it has an R-squared value that is minimally lower of 0.1953. Model 3 drops pregnum and the R-squared drop is more substantial with a value of 0.159.

Model Estimation Set 2: Regression of Y = p53 on parity with adjustment for covariatesWe’re going to pick up speed on this section. Let’s create a new maximal model.

fit5 <- lm(p53 ~ pregnum + agecurr + menop, data = dat2)stata.regress(fit5)

#### Call:## lm(formula = p53 ~ pregnum + agecurr + menop, data = dat2)#### Residuals:## Min 1Q Median 3Q Max## -2.6056 -0.6065 -0.0172 0.4972 2.4465#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.70431 0.44035 6.14 6.1e-08 ***## pregnum 0.49233 0.12457 3.95 0.0002 ***## agecurr -0.00477 0.01364 -0.35 0.7276## menopyes -0.27978 0.37769 -0.74 0.4616## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.955 on 63 degrees of freedom## Multiple R-squared: 0.218, Adjusted R-squared: 0.181## F-statistic: 5.85 on 3 and 63 DF, p-value: 0.00137#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## pregnum 1 14.3 14.33 15.73 0.00019 ***## agecurr 1 1.2 1.15 1.27 0.26496## menop 1 0.5 0.50 0.55 0.46158## Residuals 63 57.4 0.91## ---

24

Page 25: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 1.82433 3.58428## pregnum 0.24340 0.74126## agecurr -0.03203 0.02248## menopyes -1.03453 0.47496

Let’s create a model with the variable agecurr dropped.

fit6 <- lm(p53 ~ pregnum + menop, data = dat2)stata.regress(fit6)

#### Call:## lm(formula = p53 ~ pregnum + menop, data = dat2)#### Residuals:## Min 1Q Median 3Q Max## -2.6263 -0.5687 -0.0188 0.4812 2.4563#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.569 0.208 12.37 < 2e-16 ***## pregnum 0.475 0.114 4.18 8.9e-05 ***## menopyes -0.367 0.281 -1.31 0.2## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.948 on 64 degrees of freedom## Multiple R-squared: 0.216, Adjusted R-squared: 0.192## F-statistic: 8.83 on 2 and 64 DF, p-value: 0.00041#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## pregnum 1 14.3 14.33 15.95 0.00017 ***## menop 1 1.5 1.54 1.71 0.19503## Residuals 64 57.5 0.90## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 2.1539 2.9835## pregnum 0.2482 0.7019## menopyes -0.9281 0.1931

And let’s pull everything together and create a regression table.

stargazer(fit5, fit6, type = "text")

#### ===============================================================

25

Page 26: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## Dependent variable:## -------------------------------------------## p53## (1) (2)## ---------------------------------------------------------------## pregnum 0.492*** 0.475***## (0.125) (0.114)#### agecurr -0.005## (0.014)#### menopyes -0.280 -0.367## (0.378) (0.281)#### Constant 2.704*** 2.569***## (0.440) (0.208)#### ---------------------------------------------------------------## Observations 67 67## R2 0.218 0.216## Adjusted R2 0.181 0.192## Residual Std. Error 0.955 (df = 63) 0.948 (df = 64)## F Statistic 5.847*** (df = 3; 63) 8.831*** (df = 2; 64)## ===============================================================## Note: *p<0.1; **p<0.05; ***p<0.01

Finally, only using R-squared values, we see that both values of these models are similar, around 0.20. Thiswas also similar to the model we used in the last section Model Estimation Set 1 with only the variablepregnum. This is the model we will choose.

6. Checking Model Assumptions and Fit Continuing from where we left off, let’s display the modelwe are going to check.

stata.regress(fit3)

#### Call:## lm(formula = p53 ~ pregnum, data = dat2)#### Residuals:## Min 1Q Median 3Q Max## -2.8093 -0.5635 0.0212 0.6060 2.5212#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.564 0.209 12.28 < 2e-16 ***## pregnum 0.415 0.105 3.97 0.00018 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.953 on 65 degrees of freedom## Multiple R-squared: 0.195, Adjusted R-squared: 0.183## F-statistic: 15.8 on 1 and 65 DF, p-value: 0.000181

26

Page 27: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

#### Analysis of Variance Table#### Response: p53## Df Sum Sq Mean Sq F value Pr(>F)## pregnum 1 14.3 14.33 15.8 0.00018 ***## Residuals 65 59.1 0.91## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## 2.5 % 97.5 %## (Intercept) 2.1467 2.9804## pregnum 0.2064 0.6241

To access the predicted values of Y in a new variables called y, we can pull it from a string (aka the dollarsign, $).

fit3$fitted.values

## 1 2 3 4 5 6 7 8 9 10 11 12## 3.809 3.394 2.979 2.564 2.979 3.809 3.394 3.394 2.564 3.394 3.809 3.809## 13 14 15 16 17 18 19 20 21 22 23 24## 2.564 3.394 3.394 3.394 2.979 2.564 3.809 3.394 3.394 3.394 3.394 2.564## 25 26 27 28 29 30 31 32 33 34 35 36## 3.809 3.394 2.564 2.979 3.809 2.564 3.809 3.394 2.979 3.809 3.809 2.564## 37 38 39 40 41 42 43 44 45 46 47 48## 3.394 2.979 3.809 3.809 3.394 3.394 3.394 3.394 2.564 3.394 3.809 3.394## 49 50 51 52 53 54 55 56 57 58 59 60## 2.979 2.564 3.394 3.394 3.809 3.394 3.394 2.979 3.809 3.809 2.564 2.564## 61 62 63 64 65 66 67## 2.979 2.564 2.564 3.809 2.564 3.809 2.564

Or we can view the fitted values in a simple function from ggplot2 pacakage.

fortify(fit3) # remember, the ggplot2 package must be installed and the library must be loaded

## p53 pregnum .hat .sigma .cooksd .fitted .resid .stdresid## 1 3.000 3 0.03664 0.9550 1.423e-02 3.809 -0.80929 -0.86506## 2 2.400 2 0.01634 0.9524 9.185e-03 3.394 -0.99404 -1.05152## 3 2.200 1 0.02011 0.9555 6.993e-03 2.979 -0.77879 -0.82540## 4 3.500 0 0.04795 0.9531 2.553e-02 2.564 0.93646 1.00692## 5 5.500 1 0.02011 0.9063 7.329e-02 2.979 2.52121 2.67212## 6 6.000 3 0.03664 0.9192 1.043e-01 3.809 2.19071 2.34166## 7 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 8 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 9 2.000 0 0.04795 0.9579 9.247e-03 2.564 -0.56354 -0.60594## 10 5.000 2 0.01634 0.9390 2.398e-02 3.394 1.60596 1.69882## 11 5.000 3 0.03664 0.9485 3.080e-02 3.809 1.19071 1.27275## 12 5.000 3 0.03664 0.9485 3.080e-02 3.809 1.19071 1.27275## 13 1.000 0 0.04795 0.9395 7.118e-02 2.564 -1.56354 -1.68117## 14 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 15 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 16 2.000 2 0.01634 0.9444 1.807e-02 3.394 -1.39404 -1.47465

27

Page 28: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

## 17 3.000 1 0.02011 0.9606 5.187e-06 2.979 0.02121 0.02248## 18 2.000 0 0.04795 0.9579 9.247e-03 2.564 -0.56354 -0.60594## 19 4.000 3 0.03664 0.9603 7.902e-04 3.809 0.19071 0.20385## 20 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 21 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 22 3.000 2 0.01634 0.9593 1.443e-03 3.394 -0.39404 -0.41683## 23 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 24 2.000 0 0.04795 0.9579 9.247e-03 2.564 -0.56354 -0.60594## 25 5.000 3 0.03664 0.9485 3.080e-02 3.809 1.19071 1.27275## 26 4.000 2 0.01634 0.9575 3.413e-03 3.394 0.60596 0.64100## 27 3.500 0 0.04795 0.9531 2.553e-02 2.564 0.93646 1.00692## 28 2.250 1 0.02011 0.9562 6.124e-03 2.979 -0.72879 -0.77241## 29 3.000 3 0.03664 0.9550 1.423e-02 3.809 -0.80929 -0.86506## 30 2.500 0 0.04795 0.9605 1.175e-04 2.564 -0.06354 -0.06832## 31 3.000 3 0.03664 0.9550 1.423e-02 3.809 -0.80929 -0.86506## 32 3.000 2 0.01634 0.9593 1.443e-03 3.394 -0.39404 -0.41683## 33 1.000 1 0.02011 0.9275 4.514e-02 2.979 -1.97879 -2.09723## 34 3.500 3 0.03664 0.9598 2.078e-03 3.809 -0.30929 -0.33061## 35 3.500 3 0.03664 0.9598 2.078e-03 3.809 -0.30929 -0.33061## 36 3.000 0 0.04795 0.9589 5.547e-03 2.564 0.43646 0.46930## 37 3.500 2 0.01634 0.9605 1.044e-04 3.394 0.10596 0.11208## 38 3.200 1 0.02011 0.9602 5.642e-04 2.979 0.22121 0.23445## 39 3.000 3 0.03664 0.9550 1.423e-02 3.809 -0.80929 -0.86506## 40 5.200 3 0.03664 0.9441 4.202e-02 3.809 1.39071 1.48653## 41 3.700 2 0.01634 0.9598 8.702e-04 3.394 0.30596 0.32365## 42 4.300 2 0.01634 0.9538 7.630e-03 3.394 0.90596 0.95834## 43 4.300 2 0.01634 0.9538 7.630e-03 3.394 0.90596 0.95834## 44 3.500 2 0.01634 0.9605 1.044e-04 3.394 0.10596 0.11208## 45 2.000 0 0.04795 0.9579 9.247e-03 2.564 -0.56354 -0.60594## 46 2.000 2 0.01634 0.9444 1.807e-02 3.394 -1.39404 -1.47465## 47 1.000 3 0.03664 0.8915 1.715e-01 3.809 -2.80929 -3.00287## 48 3.000 2 0.01634 0.9593 1.443e-03 3.394 -0.39404 -0.41683## 49 2.500 1 0.02011 0.9587 2.643e-03 2.979 -0.47879 -0.50745## 50 2.500 0 0.04795 0.9605 1.175e-04 2.564 -0.06354 -0.06832## 51 3.500 2 0.01634 0.9605 1.044e-04 3.394 0.10596 0.11208## 52 3.500 2 0.01634 0.9605 1.044e-04 3.394 0.10596 0.11208## 53 3.500 3 0.03664 0.9598 2.078e-03 3.809 -0.30929 -0.33061## 54 2.750 2 0.01634 0.9571 3.856e-03 3.394 -0.64404 -0.68128## 55 3.500 2 0.01634 0.9605 1.044e-04 3.394 0.10596 0.11208## 56 2.500 1 0.02011 0.9587 2.643e-03 2.979 -0.47879 -0.50745## 57 2.750 3 0.03664 0.9511 2.438e-02 3.809 -1.05929 -1.13228## 58 2.750 3 0.03664 0.9511 2.438e-02 3.809 -1.05929 -1.13228## 59 2.750 0 0.04795 0.9603 1.012e-03 2.564 0.18646 0.20049## 60 4.375 0 0.04795 0.9321 9.554e-02 2.564 1.81146 1.94775## 61 2.750 1 0.02011 0.9601 6.035e-04 2.979 -0.22879 -0.24248## 62 2.500 0 0.04795 0.9605 1.175e-04 2.564 -0.06354 -0.06832## 63 2.000 0 0.04795 0.9579 9.247e-03 2.564 -0.56354 -0.60594## 64 4.375 3 0.03664 0.9579 6.953e-03 3.809 0.56571 0.60469## 65 2.750 0 0.04795 0.9603 1.012e-03 2.564 0.18646 0.20049## 66 3.300 3 0.03664 0.9584 5.635e-03 3.809 -0.50929 -0.54439## 67 2.750 0 0.04795 0.9603 1.012e-03 2.564 0.18646 0.20049

fit3.yhat <- fortify(fit3)$.fitted

28

Page 29: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

Moving on, let’s plot the model of the observed values against the fitted values. Ideally, the points will fall onthe X = Y line (a diagonal line).

ggplot(data = dat2, aes(x = p53, y = fit3.yhat)) + geom_point() + stat_smooth(method = "lm") +ggtitle(expression(atop("Model Check", atop("Plot of Observed vs Predicted")))) +ylab("Fitted Values")

2.5

3.0

3.5

4.0

2 4 6p53

Fitt

ed V

alue

s

Model CheckPlot of Observed vs Predicted

Now, let’s check for the normality of residuals by plotting a standardized normality plot. This code will be abit complicated but try to follow along. Unfortunately, after an hour, I was not able to figure out the codefor this plot using the ggplot2 package.

fit3.resid <- fortify(fit3)$.resid # pulling the residuals out of the modelresid.stdnorm <- pnorm(fit3.resid)

plot(ppoints(length(resid.stdnorm)), sort(resid.stdnorm), main = "Model Check", sub = "Standardized Normality Plot of Residuals", xlab = "Observed Probability", ylab = "Expected Probability")abline(0, 1)

29

Page 30: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

THIS CHUNKASWELL.pdf

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Model Check

Standardized Normality Plot of ResidualsObserved Probability

Exp

ecte

d P

roba

bilit

y

Everything looks reasonable, so let’s move on.

Now, let’s do a quantile normal plot of the residuals. This, I can do in ggplot2.

qplot(sample = fit3.resid, stat = "qq") + geom_abline() + xlab("Inverse Normal") + ylab("Residuals") +ggtitle(expression(atop("Model Check", atop("Quantile-Normal Plot of Residuals"))))

30

Page 31: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

−3

−2

−1

0

1

2

−2 −1 0 1 2Inverse Normal

Res

idua

lsModel Check

Quantile−Normal Plot of Residuals

A lit-tle off the line in the tails, but okay.

We are almost done. Let’s run the Shapiro Wilk test of normality on the residuals.

shapiro.test(fit3.resid) # this does not add up to the same result in STATA...???

#### Shapiro-Wilk normality test#### data: fit3.resid## W = 0.9849, p-value = 0.5943

Lastly, after searching the web for model misspecificiation and the linktest, I have had no such luck. Let’srun the Cook-Weisberg Test for Homogeneity of variance of the residuals. In R, this test is known as theScore Test for Non-Constant Error Variance.

ncvTest(fit3)

## Non-constant Variance Score Test## Variance formula: ~ fitted.values## Chisquare = 1.561 Df = 1 p = 0.2115

We see that the null hypothesis of homogeneity of variance of the residuals is not reject as we have a p-valueof 0.21.

Next, let’s plot the fitted values against the residuals. This is a graphical assessment of constant variance ofthe residuals.

31

Page 32: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

qplot(x = fit3.yhat, y = fit3.resid) + xlab("Fitted Values") + ylab("Residuals") +ggtitle(expression(atop("Model Check", atop("Plot of Y = Residuals by X = fitted"))))

−3

−2

−1

0

1

2

3.0 3.5Fitted Values

Res

idua

ls

Model CheckPlot of Y = Residuals by X = fitted

Wesee that the variability of the residuals looks reasonably homogeneous, confirming the Cook-Weisberg testresult.

Last of the last, let’s check for outlying, leverage, and influential points - look for values > 4.

fit3.cooksd <- fortify(fit3)$.cooksdqplot(x = seq_along(fit3.cooksd), y = fit3.cooksd) + xlab("Subject") + ylab("Cook's D") +

ggtitle(expression(atop("Model Check", atop("Plot of Cook Distances"))))

32

Page 33: Illustrations - Simple and Multiple Linear Regressioncourses.umass.edu/biep640w/pdf/R_Regression.pdfIllustrations - Simple and Multiple Linear Regression Steele H. Valenzuela February

0.00

0.05

0.10

0.15

0 20 40 60Subject

Coo

k's

DModel Check

Plot of Cook Distances

Well, we’re done. Pat yourself on the back for making it through. The above plot looks nice. Not only arethere no Cook distances greater than 4, they are all less than 1!

33