25
MULTIPLE REGRESSION Using more than one variable to predict another

MULTIPLE REGRESSION Using more than one variable to predict another

Embed Size (px)

Citation preview

Page 1: MULTIPLE REGRESSION Using more than one variable to predict another

MULTIPLE REGRESSION

Using more than one variable to predict another

Page 2: MULTIPLE REGRESSION Using more than one variable to predict another

Last Week

Coefficient of Determination r2

Explained variance between 2 variables

Simple Linear Regression y = mx + b Predicting one variable from another

Based on explained variance – if r2 is large, should be a good predictor

Predicting one dependent variable from one independent variable

SEE, residuals

Page 3: MULTIPLE REGRESSION Using more than one variable to predict another

Tonight

Predicting one DV from one IV is simple linear regression

Predicting one DV from multiple IV’s is called multiple linear regression More IV’s usually allow for a better prediction of the

DV

If IV A explains 20% of the variance (r2 = 0.20) and

IV B explains 30% of the variance (r2 = 0.30), then

Can I use both to predict the dependent variable?

Page 4: MULTIPLE REGRESSION Using more than one variable to predict another

Example: Activity Dataset

To demonstrate, we’ll use the same data as last week, on the pedometer and armband

Goal: To predict Armband calories (real calories expended) as accurately as possible

Lets start by trying to predict Armband calories with body weight Complete simple linear regression with body

weight

Page 5: MULTIPLE REGRESSION Using more than one variable to predict another

Simple Regression

Here is the simple regression output from using Body Weight (kg) to predict Armband Calories

Page 6: MULTIPLE REGRESSION Using more than one variable to predict another

Simple Regression

Results using Body Weight (kg): r2 = 0.155 SEE = 400.5 calories

Can we improve on this equation by adding in new variables?

First, we have to determine if other variables in the dataset might be related to Armband Calories.. Use correlation matrix

Page 7: MULTIPLE REGRESSION Using more than one variable to predict another

Correlations

Notice several variables have some association with armband calories:

Variable r r2

Height 0.225 0.05

Weight 0.393 0.15

BMI 0.378 0.14

PedSteps 0.782 0.61

PedCalories

0.853 0.73

Page 8: MULTIPLE REGRESSION Using more than one variable to predict another

Create new regression equation Simple regression equation looks like:

y = mx + b Multiple regression equation looks like:

y = m1x1 + m2X2 + b Subscript is used to help organize the data

All we are doing is adding an additional variable into our equation. That new variable will have it’s own slope, m2

For the sake of simplicity, lets add in pedometer steps as X2

Page 9: MULTIPLE REGRESSION Using more than one variable to predict another

OUTPUT…

Page 10: MULTIPLE REGRESSION Using more than one variable to predict another

Multiple Regression Output

Page 11: MULTIPLE REGRESSION Using more than one variable to predict another

Simple to Multiple

Results using Body Weight (kg): r2 = 0.155 SEE = 400.5 calories

Results using Body Weight and Pedometer Steps: r2 = 0.672 SEE = 251.7 calories

r2 change = 0.672 – 0.155 = 0.517 If 2 variables are good – would 3 be even better?

Page 12: MULTIPLE REGRESSION Using more than one variable to predict another

Adding one more in…

In addition to body weight (x1) and pedometer steps (x2), lets add in age (x3)

Page 13: MULTIPLE REGRESSION Using more than one variable to predict another

Multiple Regression Output 2

Page 14: MULTIPLE REGRESSION Using more than one variable to predict another

Simple to Multiple Results using Body Weight (kg):

r2 = 0.155 SEE = 400.5 calories

Results using Body Weight and Pedometer Steps: r2 = 0.672 SEE = 251.7 calories r2 change = 0.517

Results using Body Weight, PedSteps, Age r2 = 0.689 SEE = 247.7 r2 change = 0.689 – 0.672 = 0.017

Page 15: MULTIPLE REGRESSION Using more than one variable to predict another

Multiple Regression Decisions Should we recommend that age is used in the

model? These decisions can be difficult “Model Building” or “Model Reduction” is more of an art

than a science Consider

p-value of age in model = 0.104 r2 change by adding age = 0.017, or 1.7% of variance More coefficients (predictors) make the model more

complicated to use and interpret Does it make sense to include age? Should age be

related to caloric expenditure?

Page 16: MULTIPLE REGRESSION Using more than one variable to predict another

Other Regression Issues

Sample Size With too small a sample, you lack the statistical

power to generalize your results to other samples/the whole population You increase your risk of Type II Error (failing to reject

the alternative hypothesis when true)

In multiple regression, the more variables you use in your model the greater your risk of Type II Error This is a complicated issue, but essentially you need

large samples to use several predictors Guidelines…

Page 17: MULTIPLE REGRESSION Using more than one variable to predict another

Other Regression Issues

Sample Size Tabachnick & Fidell (1996): N > 50 + 8m

N=appropriate sample size, m=# of IV’s So, if you use 3 predictors (like we just did in our

example): 50 + 8*3 = 74 subjects

You can find several different ‘guess-timates’, I usually just try have 30 subjects, plus another 30 for each variable in the model (ie, 30 + 30m) I like to play it safe…

Page 18: MULTIPLE REGRESSION Using more than one variable to predict another

Other Regression Issues

Multiple Regression has the same statistical assumptions as correlation/regression Check for normal distribution, outliers, etc…

One new concern with multiple regression is the idea of Collinearity You have to be careful that your IV’s (predictor

variables) are not highly correlated with each other Can cause a model to overestimate r2

Can also cause one new variable to eliminate another

Page 19: MULTIPLE REGRESSION Using more than one variable to predict another

Example Collinearity

Results of MLR using Body Weight, PedSteps, Age r2 = 0.689 SEE = 247.7

Imagine we want to add in one other variable, Pedometer Calories

Look at the correlation matrix first…

Page 20: MULTIPLE REGRESSION Using more than one variable to predict another

Notice that Armband calories is highly correlated with both Pedometer Steps and Pedometer Calories Initially, this looks great because we might have two very good

predictors to use But, notice that Pedometer Calories is very highly correlated

with Pedometer Steps These two variables are probably collinear – they are very similar

and may not explain ‘unique’ variance

Page 21: MULTIPLE REGRESSION Using more than one variable to predict another

Here is the MLR result with Weight, Steps, and Age:

Here is the MLR result by adding Pedometer calories in the model:

Pedometer calories becomes the only significant predictor in the model. In other words, the variance in the other 3 variables can be explained by Pedometer Calories – not all 4 variables add ‘unique’ variance to the model

Page 22: MULTIPLE REGRESSION Using more than one variable to predict another

Example Collinearity Results of MLR using Body Weight, PedSteps, Age

r2 = 0.689 SEE = 247.7

Results of MLR using Body Weight, PedSteps, Age, and PedCalories r2 = 0.745 SEE = 226.2

Results of MLR using just PedCalories (eliminates collinearity) r2 = 0.727 SEE = 227.5

Which model is the best model? Remember, we’d like to pick the strongest predictor model with the fewest number of predictor variables

Page 23: MULTIPLE REGRESSION Using more than one variable to predict another

Model Building

Collinearity makes model building more difficult 1) When you add in new variables you have to look

at r2, r2 change, and SEE – but you also have to notice what’s happening to the other IV’s in the model

2) Sometimes, you need to remove variables that used to be good predictors

3) This is why the model with the most variables is not always the best model – sometimes you can do just as well with 1 or 2 variables

Page 24: MULTIPLE REGRESSION Using more than one variable to predict another

What to do about Collinearity? Your approach:

Use a correlation matrix to examine the variables BEFORE you try to build your model 1) Check the IV’s correlations with the DV (high

correlations will probably be best predictors) but… 2) Check the IV’s correlations with the other IV’s

(high correlations probably indicate collinearity)

If you do find that two IV’s are highly correlated, be aware that having them both in the model is probably not the best approach (pick the best one and keep it)

QUESTIONS…?

Page 25: MULTIPLE REGRESSION Using more than one variable to predict another

Upcoming…

In-class activity on MLR…

Homework (not turned in due to exam): Cronk Section 5.4 OPTIONAL: Holcomb Exercises 31 and 32

Multiple correlation, NOT full multiple linear regression

Similar to MLR, but looks at the model’s r instead of making a prediction equation

Mid-Term Exam next week Group differences after spring break (t-test,

ANOVA, etc…)