17
Multiple Linear Regression Linear regression with two or more predictor variables

Multiple Linear Regression Linear regression with two or more predictor variables

Embed Size (px)

Citation preview

Page 1: Multiple Linear Regression Linear regression with two or more predictor variables

Multiple Linear Regression

Linear regression with two or more predictor variables

Page 2: Multiple Linear Regression Linear regression with two or more predictor variables

Introduction

Often in linear regression, you want to investigate the relationship between more than one predictor variable and some outcome. In this case, your model will contain more than one independent variable. It is also often important to investigate a possible interaction between two or more independent variables.

Page 3: Multiple Linear Regression Linear regression with two or more predictor variables

Consider the following situation:The file air.txt contains a subsample of data from a

study of the effect of air pollution on lung function. The variables measured were age, gender, height, weight, forced vital capacity (FVC), and forced expiratory volume in 1 second (FEV1). FVC is the total volume of air in liters which an individual can expel regardless of how long it takes. FEV1 is the volume of air expelled during the first second when an individual has been told to breath in deeply and then expel as much air as possible.

(Dunn and Clark (1987), Applied Statistics: Analysis of Variance and Regression, p.354.)

Page 4: Multiple Linear Regression Linear regression with two or more predictor variables

Input the file air.txt into SAS with the following code (adjusting the location of the file as necessary):

DATA air;

INFILE ‘C:\air.txt' dlm = ' ' firstobs = 2;

INPUT sex age height weight fvc fev1;

height_age = height*age;

RUN;

“Height_age” creates a new variable which represents the interaction between height and age.

Page 5: Multiple Linear Regression Linear regression with two or more predictor variables

Exploring the Data

We are interested in what factors may predict FVC. In order to explore this before analyzing the data, create two plots: one of FVC vs. height; the other of FVC vs. age:

PROC GPLOT DATA = air;PLOT fvc * height;PLOT fvc * age;

RUN;

Page 6: Multiple Linear Regression Linear regression with two or more predictor variables

Plot of FVC * Height

Page 7: Multiple Linear Regression Linear regression with two or more predictor variables

Plot of FVC * Age

Page 8: Multiple Linear Regression Linear regression with two or more predictor variables

It appears a linear relationship is justified between FVC and height, although it is unclear whether a linear relationship exists between FVC and age.

Create a multiple linear regression model using both height and age to predict FVC:

PROC REG DATA = air;

MODEL fvc = height age;

RUN;

QUIT;

Page 9: Multiple Linear Regression Linear regression with two or more predictor variables

Multiple Regression Output

Page 10: Multiple Linear Regression Linear regression with two or more predictor variables

Interpreting Output• The multiple regression equation is:Yhat = -6.67 + 0.18(height) – 0.03(age)• The R-Square value is interpreted the same as

with simple linear regression:67% of the variance in FVC is explained by height and age in the model.

• Because the model includes more than one predictor variable, you may want to consider using the adjusted R2 (Adj R-Sq) value instead of the R-Square for interpreting amount of variance explained by the independent variables.

Page 11: Multiple Linear Regression Linear regression with two or more predictor variables

Overall F-testTo test whether all of the variables taken

together significantly predict the outcome variable (FVC), use the overall F-test. The test statistic (F* = 36.96) is found under F Value. The associated pvalue (<0.001) is found under Pr > F.

Ho: β1 = β2 = 0 vs. Ha: At least one β ≠ 0.Because the p-value is less than 0.05, we

reject the null hypothesis and conclude that taken together, height and age are significantly related to FVC.

Page 12: Multiple Linear Regression Linear regression with two or more predictor variables

Testing Significance of One Variable

To test the significance of an individual variable in predicting FVC, use the test statistic (t Value) and associated pvalue for that particular variable (Pr > |t|).

For example, the test of whether height is significantly related to FVC [Ho: β1 = 0 vs. Ha: β1 ≠ 0], has t* = 8.15, p < 0.0001. Reject the null hypothesis and conclude that height is significantly related to FVC.

Page 13: Multiple Linear Regression Linear regression with two or more predictor variables

Testing for an InteractionBecause we have more than one predictor

variable, it is important to consider whether they interact in some way. To test whether the interaction between height and age is significant, create another model in SAS that contains both the main effects of height and age as well as the interaction term you created:

PROC REG DATA = air;

MODEL fvc = height age height_age;

RUN;

QUIT;

Page 14: Multiple Linear Regression Linear regression with two or more predictor variables

Output with Interaction Term

Page 15: Multiple Linear Regression Linear regression with two or more predictor variables

Is the interaction significant?Notice that the pvalue for the interaction is 0.39,

which is greater than 0.05. Therefore, the interaction between age and height is not significant, and we do not need to include it in the model.

Additionally, notice that the R-Square is 0.679, indicating that 68% of the variability in FVC is explained by height, age and height_age. This number is not much larger than the R-Square from the model with just height and age. This also is a good indicator that the interaction term is not necessary.

The final model only needs to include height and age predicting FVC.

Page 16: Multiple Linear Regression Linear regression with two or more predictor variables

Conclusions

Multiple Linear Regression in SAS is very similar to Simple Linear Regression. The major difference is that more variables are added to the model statement, and interaction terms need to be considered.

Use the same options (clb, cli, clm) for creating confidence intervals in SAS and determining outliers (r) and influential points (influence).

Page 17: Multiple Linear Regression Linear regression with two or more predictor variables

Linear Regression is used with continuous outcome variables. If the outcome variable of interest is categorical, logistic regression analysis is used. The next tutorial is an introduction to logistic regression.