Upload
cuthbert-morton
View
226
Download
2
Embed Size (px)
Citation preview
MULTIPLE REGRESSION
Using more than one variable to predict another
Last Week
Coefficient of Determination r2
Explained variance between 2 variables
Simple Linear Regression y = mx + b Predicting one variable from another
Based on explained variance – if r2 is large, should be a good predictor
Predicting one dependent variable from one independent variable
SEE, residuals
Tonight
Predicting one DV from one IV is simple linear regression
Predicting one DV from multiple IV’s is called multiple linear regression More IV’s usually allow for a better prediction of the
DV
If IV A explains 20% of the variance (r2 = 0.20) and
IV B explains 30% of the variance (r2 = 0.30), then
Can I use both to predict the dependent variable?
Example: Activity Dataset
To demonstrate, we’ll use the same data as last week, on the pedometer and armband
Goal: To predict Armband calories (real calories expended) as accurately as possible
Lets start by trying to predict Armband calories with body weight Complete simple linear regression with body
weight
Simple Regression
Here is the simple regression output from using Body Weight (kg) to predict Armband Calories
Simple Regression
Results using Body Weight (kg): r2 = 0.155 SEE = 400.5 calories
Can we improve on this equation by adding in new variables?
First, we have to determine if other variables in the dataset might be related to Armband Calories.. Use correlation matrix
Correlations
Notice several variables have some association with armband calories:
Variable r r2
Height 0.225 0.05
Weight 0.393 0.15
BMI 0.378 0.14
PedSteps 0.782 0.61
PedCalories
0.853 0.73
Create new regression equation Simple regression equation looks like:
y = mx + b Multiple regression equation looks like:
y = m1x1 + m2X2 + b Subscript is used to help organize the data
All we are doing is adding an additional variable into our equation. That new variable will have it’s own slope, m2
For the sake of simplicity, lets add in pedometer steps as X2
OUTPUT…
Multiple Regression Output
Simple to Multiple
Results using Body Weight (kg): r2 = 0.155 SEE = 400.5 calories
Results using Body Weight and Pedometer Steps: r2 = 0.672 SEE = 251.7 calories
r2 change = 0.672 – 0.155 = 0.517 If 2 variables are good – would 3 be even better?
Adding one more in…
In addition to body weight (x1) and pedometer steps (x2), lets add in age (x3)
Multiple Regression Output 2
Simple to Multiple Results using Body Weight (kg):
r2 = 0.155 SEE = 400.5 calories
Results using Body Weight and Pedometer Steps: r2 = 0.672 SEE = 251.7 calories r2 change = 0.517
Results using Body Weight, PedSteps, Age r2 = 0.689 SEE = 247.7 r2 change = 0.689 – 0.672 = 0.017
Multiple Regression Decisions Should we recommend that age is used in the
model? These decisions can be difficult “Model Building” or “Model Reduction” is more of an art
than a science Consider
p-value of age in model = 0.104 r2 change by adding age = 0.017, or 1.7% of variance More coefficients (predictors) make the model more
complicated to use and interpret Does it make sense to include age? Should age be
related to caloric expenditure?
Other Regression Issues
Sample Size With too small a sample, you lack the statistical
power to generalize your results to other samples/the whole population You increase your risk of Type II Error (failing to reject
the alternative hypothesis when true)
In multiple regression, the more variables you use in your model the greater your risk of Type II Error This is a complicated issue, but essentially you need
large samples to use several predictors Guidelines…
Other Regression Issues
Sample Size Tabachnick & Fidell (1996): N > 50 + 8m
N=appropriate sample size, m=# of IV’s So, if you use 3 predictors (like we just did in our
example): 50 + 8*3 = 74 subjects
You can find several different ‘guess-timates’, I usually just try have 30 subjects, plus another 30 for each variable in the model (ie, 30 + 30m) I like to play it safe…
Other Regression Issues
Multiple Regression has the same statistical assumptions as correlation/regression Check for normal distribution, outliers, etc…
One new concern with multiple regression is the idea of Collinearity You have to be careful that your IV’s (predictor
variables) are not highly correlated with each other Can cause a model to overestimate r2
Can also cause one new variable to eliminate another
Example Collinearity
Results of MLR using Body Weight, PedSteps, Age r2 = 0.689 SEE = 247.7
Imagine we want to add in one other variable, Pedometer Calories
Look at the correlation matrix first…
Notice that Armband calories is highly correlated with both Pedometer Steps and Pedometer Calories Initially, this looks great because we might have two very good
predictors to use But, notice that Pedometer Calories is very highly correlated
with Pedometer Steps These two variables are probably collinear – they are very similar
and may not explain ‘unique’ variance
Here is the MLR result with Weight, Steps, and Age:
Here is the MLR result by adding Pedometer calories in the model:
Pedometer calories becomes the only significant predictor in the model. In other words, the variance in the other 3 variables can be explained by Pedometer Calories – not all 4 variables add ‘unique’ variance to the model
Example Collinearity Results of MLR using Body Weight, PedSteps, Age
r2 = 0.689 SEE = 247.7
Results of MLR using Body Weight, PedSteps, Age, and PedCalories r2 = 0.745 SEE = 226.2
Results of MLR using just PedCalories (eliminates collinearity) r2 = 0.727 SEE = 227.5
Which model is the best model? Remember, we’d like to pick the strongest predictor model with the fewest number of predictor variables
Model Building
Collinearity makes model building more difficult 1) When you add in new variables you have to look
at r2, r2 change, and SEE – but you also have to notice what’s happening to the other IV’s in the model
2) Sometimes, you need to remove variables that used to be good predictors
3) This is why the model with the most variables is not always the best model – sometimes you can do just as well with 1 or 2 variables
What to do about Collinearity? Your approach:
Use a correlation matrix to examine the variables BEFORE you try to build your model 1) Check the IV’s correlations with the DV (high
correlations will probably be best predictors) but… 2) Check the IV’s correlations with the other IV’s
(high correlations probably indicate collinearity)
If you do find that two IV’s are highly correlated, be aware that having them both in the model is probably not the best approach (pick the best one and keep it)
QUESTIONS…?
Upcoming…
In-class activity on MLR…
Homework (not turned in due to exam): Cronk Section 5.4 OPTIONAL: Holcomb Exercises 31 and 32
Multiple correlation, NOT full multiple linear regression
Similar to MLR, but looks at the model’s r instead of making a prediction equation
Mid-Term Exam next week Group differences after spring break (t-test,
ANOVA, etc…)