Upload
maisie-barnett
View
31
Download
0
Embed Size (px)
DESCRIPTION
Regression. Shibin Liu SAS Beijing R&D. Agenda. 0 . Lesson overview 1. Exploratory Data Analysis 2. Simple Linear Regression 3. Multiple Regression 4. Model Building and Interpretation 5. Summary. 2. Agenda. 0. Lesson overview 1. Exploratory Data Analysis - PowerPoint PPT Presentation
Citation preview
Regression
Shibin LiuSAS Beijing R&D
Agenda• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
2
Agenda
• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
3
Lesson overview
4
ANOVA
Predictor VariableResponse Variable
+
Lesson overview
5
ContinuousContinuous
Correlation analysis
Linear regression
Lesson overview
6
Continuous predictorContinuous response
• Measure linear association• Examine the relationship• Screen for outliers• Interpret the correlation
Correlation analysis
Lesson overview
7
Continuous predictorContinuous response
• Define the linear association• Determine the equation for
the line• Explain or predict variability
Linear regression
Lesson overview
8
What do you want to
examine?
The relationship between variables
The difference between groups on one or more
variables
The location, spread, and shape of the data’s
distribution
Summary statistics
or graphics?
How many
groups?Which kind
of variables?
SUMMARY STATISTICS
DISTRIBUTION ANALYSIS
TTEST
LINEAR MODELS
CORRELATIONS
ONE-WAY FREQUENCIES
& TABLE ANALYSIS
LINEAR REGRESSION
LOGISTIC REGRESSION
Summary statistics
Both Two
Two or more
Descriptive Statistics
Descriptive Statistics, histogram, normal,
probability plots
Analysis of variance
Continuous only
Frequency tables, chi-square test
Categorical response variable
Descriptive Statistics
Inferential Statistics
Lesson 1 Lesson 2 Lesson 3 & 4 Lesson 5
Agenda• 0. Lesson overview
• 1. Exploratory Data Analysis• 2. Simple Linear Regression• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
9
Exploratory Data Analysis: Introduction
10
Weight Height
Continuous variable Continuous variable
Scatter plot Correlation analysis
Exploratory data analysis
Linear regression
Exploratory Data Analysis: Objective• Examine the relationship between continuous variable
using a scatter plot• Quantify the degree of association between two
continuous variables using correlation statistics• Avoid potential misuses of the correlation coefficient • Obtain Pearson correlation coefficients
11
Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables
12
Scatter plot Correlation analysis
Exploratory data analysis
Relationship
Trend
Range
Outlier
Communicate analysis result
X: Predict variableY: Response variableCoordinate: values of X and Y
Exploratory Data Analysis: Using Scatter Plots to Describe Relationships between Continuous Variables
13
?Model
Terms2
Squared Quadratic
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables
14
Exploratory data analysis
Scatter plot Correlation analysisCorrelation analysis
Linear association
Zero PositiveNegative
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables
15
Person correlation coefficient:
For population
For sample
Exploratory Data Analysis: Using Correlation to Measure Relationships between Continuous Variables
16
Correlation analysis
Person correlation coefficient: r
-1 0 +1
Strong negativelinear relationship
Strong positivelinear relationship
No linear relationship
Exploratory Data Analysis: Hypothesis testing for a Correlation
17
Correlation Coefficient Test
H0: 0
Ha: 0
Population parameter
Sample statistic
Correlation r
• A p-value does not measure the magnitude of the association.• Sample size affects the p-value.
Rejecting the null hypothesis only means that you can be confident that the true population correlation is not 0. small p-value can occur (as with many statistics) because of very large sample sizes. Even a correlation coefficient of 0.01 can be statistically significant with a large enough sample size. Therefor, it is important to also look at the value of r itself to see whether it is meaningfully large.
Exploratory Data Analysis: Hypothesis testing for a Correlation
18
-1 0 +1
0.72 0.81rrrr
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect
19
Correlation does not imply causation
Besides causality, could other reasons account for strong correlation between two variables?
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect
20
Weight Height
Correlation does not imply causation
A strong correlation between two variables does not mean change in one variable causes the other variable to change, or vice versa.
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect
21
Correlation does not imply causation
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect
22
Correlation does not imply causation
Exploratory Data Analysis: Avoiding Common Errors in Interpreting CorrelationsCause and Effect
23
X: the percent of students who take the SAT exam in one of the statesY: SAT scores
?SAT score bounded to college entrance
or not
Exploratory Data Analysis: Avoiding Common Errors: Types of Relationships
24
Pearson correlation
coefficient: r -> 0
?
curvilinear
quadratic
parabolic
Exploratory Data Analysis: Avoiding Common Errors: outliers
25
Data one Data two
r =0.82 r =0.02
Exploratory Data Analysis: Avoiding Common Errors: outliers
26
What to do with outlier?
Why an outlier ?
Valid Error
Collect data
Replicate data
Compute two correlation coefficients
Report both coefficients
Exploratory Data Analysis: Scenario: Exploring Data Using Correlation and Scatter Plots
27
Fitness
oxygen
consumption?
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots
28
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots
29
What’s the Pearson correlation coefficient of Oxygen_Consumption with Run_Time?
What’s the p-value for the correlation of Oxygen_Consumption with Performance?
Exploratory Data Analysis: Exploring Data with Correlations and Scatter Plots
30
Exploratory Data Analysis: Examining Correlations between Predictor Variables
31
Exploratory Data Analysis: Examining Correlations between Predictor Variables
32
What are the two highest Pearson correlation coefficient s?
Exploratory Data Analysis
33
Question 1.
The correlation between tuition and rate of graduation at U.S. college is 0.55. What does this mean?
a) The way to increase graduation rates at your college is to raise tuition
b) Increasing graduation rates is expensive, causing tuition to risec) Students who are richer tend to graduate more often than
poorer studentsd) None of the above.
Answer: d
Agenda• 0. Lesson overview• 1. Exploratory Data Analysis
• 2. Simple Linear Regression• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
34
Simple Linear Regression: Introduction
35
Simple Linear Regression: Introduction
36
-1 0 +1
Linear relationships
Variable A Variable DVariable CVariable B
Simple Linear Regression: Introduction
37
r
r
Same
Different
Simple Linear Regression: Introduction
38
Simple Linear Regression
Y: variable of primary interest
X: explains variability in Y
Regression Line
Simple Linear Regression: Objective
39
• Explain the concepts of Simple Linear Regression• Fit a Simple Linear Regression using the Linear
Regression task• Produce predicted values and confidence intervals.
Simple Linear Regression: Scenario: Performing Simple Linear Regression
40
Fitness
Run_Time Oxygen_Consumption
Simple Linear Regression
Linear regression
Simple Linear Regression: The Simple Linear Regression Model
41
Simple Linear Regression: The Simple Linear Regression Model
42
Question 2.
What does epsilon represent?
a) The intercept parameterb) The predictor variablec) The variation of X around the lined) The variation of Y around the line
Answer: d
Simple Linear Regression: How SAS Performs Linear Regression
43
Method of least squareMinimize
Best Linear Unbiased Estimators. Are unbiased estimators. Have minimum variance
Simple Linear Regression: Measuring How Well a Model Fits the Data
44
Regression model Baseline model
VS.
Simple Linear Regression: Comparing the Regression Model to a Baseline Model
45
Base line model: 𝑌
Better model: Explain more variability
Type of variability Equation
Explained (SSM)
Unexplained (SSE)
Total
Simple Linear Regression: Hypothesis Testing for Linear Regression
46
Linear regression
H 𝑎 : 𝛽1≠0
Simple Linear Regression: Assumptions of Simple Linear Regression
47
Linear regression
Assumptions:
1 .The mean of Y is linearly related to X.2. Errors are normally distributed3. Errors have equal variances.4. Errors are independent.
Simple Linear Regression: Performing Simple Linear Regression
48
Task >Regression>Linear Regression
Simple Linear Regression: Performing Simple Linear Regression
49
Task >Regression>Linear Regression
Simple Linear Regression: Performing Simple Linear Regression
50
Question 3.
In the model Y=X, if the parameter estimate (slope) of X is 0, then which of the following is the best guess (predicted value) for Y when X is equals to 13?
a) 13b) The mean of Yc) A random numberd) The mean of Xe) 0
Answer: b
Simple Linear Regression: Confidence and Prediction Intervals
51
Simple Linear Regression: Confidence and Prediction Intervals
52
Question 4.
Suppose you have a 95% confidence interval around the mean. How do you interpret it?
a) The probability is .95 that the true population mean of Y for a particular X is within the interval.
b) You are 95% confident that a newly sampled value of Y for a particular X is within the interval.
c) You are 95% confident that your interval contains the true population mean of Y for a particular X.
Answer: c
Simple Linear Regression: Confidence and Prediction Intervals
53
Simple Linear Regression: Confidence and Prediction Intervals
54
Simple Linear Regression: Producing Predicted Values of the Response Variable
55
data Need_Predictions; input Runtime @@; datalines;9 10 11 12 13;run;
Simple Linear Regression: Producing Predicted Values of the Response Variable
56
data Need_Predictions; input Runtime @@; datalines;9 10 11 12 13;run;
Simple Linear Regression: Producing Predicted Values of the Response Variable
57
18
Agenda• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression
• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
58
Multiple Regression• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression
• 3. Multiple Regression• 4. Model Building and Interpretation• 5. Summary
59
Multiple Regression: Introduction
60
Simple Linear Regression
Predictor VariableResponse Variable
Multiple Linear Regression
Response Variable Predictor Variable
More than one Predictor Variable
Predictor Variable
Multiple Regression: Introduction
61
Simple Linear Regression
Multiple Linear Regression
When k=2
Multiple Regression: Objective
62
• Explain the mathematical model for multiple regression• Describe the main advantage of multiple regression
versus simple linear regression• Explain the standard output from the Linear Regression
task.• Describe common pitfalls of multiple linear regression
Multiple RegressionAdvantages and Disadvantages of Multiple Regression
63
Advantages Disadvantages
Multiple Linear Regression
127 possible model
Complex to interpret
Multiple RegressionPicturing the Model for Multiple Regression
64
Multiple Linear Regression
=0 0
𝑘+1 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
𝑖𝑛𝑐𝑙𝑢𝑑𝑒
Multiple RegressionPicturing the Model for Multiple Regression
65
Multiple Linear Regression
Multiple RegressionCommon applications
66
Multiple Linear Regression is a powerful tool for the following tasks:
1. Prediction, which is used to develop a model future values of a response variable (Y) based one its relationships with other predictor variables (Xs).
2. Analytical or Explanatory Analysis, which is used to develop an understanding of the relationships between the response variable and predictor variables
Multiple Regression Analysis versus Prediction in Multiple Regression
67
Prediction 1. The terms in the model, the values of their
coefficients, and their statistical significance are of secondary importance.
2. The focus is on producing a model that is the best at predicting future values of Y as a function of the Xs.
𝑌= �̂�0+ �̂�1 𝑋 1+…+ �̂�𝑘 𝑋𝑘
Multiple Regression Analysis versus Prediction in Multiple Regression
68
Analytical or Explanatory Analysis
1. The focus is understanding the relationship between the dependent variable and independent variables.
2. Consequently, the statistical significance of the coefficient is important as well as the magnitudes and signs of the coefficients.
𝑌= �̂�0+ �̂�1 𝑋 1+…+ �̂�𝑘 𝑋𝑘
Multiple RegressionHypothesis Testing for Multiple Regression
69
Multiple Linear Regression
H 𝑎 :𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒𝛽𝑖≠0
The regression model does not fit the data better than the baseline model.
The regression model does fit the data better
than the baseline model.
Multiple RegressionHypothesis Testing for Multiple Regression
70
Question 4.
Match below items left and right?
a 1. At least one slope of the regression in the population is not 0 and at least one predictor variable explains a significant amount of variability in the response model
2. No predictor variable explains a significant amount of variability in the response variable
3. The estimated linear regression model does not fit the data better than the baseline model
a) Reject the null hypothesis
b) Fail to reject the null hypothesis
a) Reject the null hypothesis
b
b
Multiple RegressionAssumptions for Multiple Regression
71
Linear regression model
Assumptions:
1 .The mean of Y is linearly related to X.2. Errors are normally distributed3. Errors have equal variances.4. Errors are independent.
Multiple Regression: Scenario: Using Multiple Regression to Explain Oxygen Consumption
72
Performance
Age
Multiple Regression: Adjust R2
73
Adj. R2
𝑅2=1−(𝑛−𝑖)(1−R2 )
𝑛−𝑝
i = 1 if there is an intercept and 0 otherwisen = the number of observations used to fit the modelp = the number of parameters in the model
Multiple Regression: Performing Multiple Linear Regression
74
Multiple Regression: Performing Multiple Linear Regression
75
What’s the p-value of the overall model?
Should we reject the null hypothesis or not?
Based on our evidence, do we reject the null hypothesis that the parameter estimate is 0?
Multiple Regression: Performing Multiple Linear Regression
76
Oxygen_Consumption
RunTime
Performance Oxygen_Consumption
Oxygen_Consumption
Performance
RunTime
?
Multiple Regression: Performing Multiple Linear Regression
77
Performance
RunTime-0.82049
Collinearity
Agenda• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression• 3. Multiple Regression
• 4. Model Building and Interpretation• 5. Summary
78
Model Building and Interpretation : Introduction
79
Performance
Age
Model Building and Interpretation : Introduction
80
?
Model Building and Interpretation : Introduction
81
Stepwise selection methods
Forward
Backward
Stepwise
All possible regressions rank criteria:
R2
Adjusted R2
Mallows’ Cp
‘No selection’ is the default
Model Building and Interpretation: Objectives
82
• Explain the Linear Regression task options for the model selection
• Describe model selection options and interpret output to evaluate the fit of several models
Model Building and Interpretation : Approaches to Selecting Models: Manual
83
Full Model
?!
Model Building and Interpretation : SAS and Automated Approaches to Modeling
84
Stepwise selection methods
Forward
Backward
Stepwise
All possible regressions rank criteria:
R2
Adjusted R2
Mallows’ Cp
‘No selection’ is the default
Run all methods
Look for commonalities
Narrow down models
Model Building and Interpretation : The All-Possible Regressions Approach to Model Building
85
Fitness
Predictor variables
128 possible models
Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic
86
Cp
Mallows' Cp Statistic
Model Bias
Under-fitting Over-fitting
Model Building and Interpretation : Evaluating Models Using Mallows' Cp Statistic
87
Cp
Mallows' Cp Statistic
Model Bias
For Prediction
Mallows' criterion: Cp <= p
criteria
Parameter estimation
Hockings' criterion: Cp <= 2p –pfull +1
Model Building and Interpretation : Viewing Mallows' Cp Statistic
88
Cp Linea Regression task
Partial output
+1=p
Model Building and Interpretation : Viewing Mallows' Cp Statistic
89
Partial output
Mallows' criterion: Cp <= p
In this output, how many models have a value for Cp that is less than or equal to p?
Which of these models has the fewest parameters?
Model Building and Interpretation : Viewing Mallows' Cp Statistic
90
Partial output
How many models meet Hockings' criterion for Cp for parameter estimation?
First of all, what is the p for the full model? Hockings' criterion: Cp <= 2p –pfull +1
Pfull = 8 (7 vars +1intercept )
Cp <= 12 – 8 +1
Model Building and Interpretation : Viewing Mallows' Cp Statistic
91
Question 5.
What happens when you use the all-possible regressions method? Select all that apply.
y 1. You compare the R-square, adjusted R-square, and Cp statistics to evaluate the models.
2. SAS computes al possible models3. You choose a selection method (stepwise, forward, or
backward) 4. SAS ranks the results.5. You cannot reduce the number of models in the output6. You can produce a plot to help identify models that satisfy
criteria for the Cp statistic.
y
y
y
Model Building and Interpretation : Viewing Mallows' Cp Statistic
92
Question 6.
Match below items left and right.
c 1. Prefer to use R-square for evaluating multiple linear regression models (take into account the number of terms in the model).
2. Useful for parameter estimation3. Useful for prediction
b
a
a. Mallows' criterion for Cp.b. Hockings' criterion for Cp.c. adjusted R-square
Model Building and Interpretation : Using Automatic Model Selection
93
Model Building and Interpretation : Using Automatic Model Selection
94
Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models – Prediction model
95
Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models –Explanatory Model
96
Model Building and Interpretation : Estimating and Testing Coefficients for Selected Models
97
Model Building and Interpretation : The Stepwise Selection Approach to Model Building
98
Stepwise selection methods
Forward
Backward
Stepwise
Model Building and Interpretation : The Stepwise Selection Approach to Model Building
99
Forward
Forward selection method starts with no variable, then select the most significant variable, until there is no significant variable. The variable added will not be removed even it becomes in-significant later.
Model Building and Interpretation : The Stepwise Selection Approach to Model Building
100
Backward
Backward selection method starts with all variables in, then remove the most in-significant variable, until all variables left are significant. Once the variable is removed, it cannot re-enter.
Model Building and Interpretation : The Stepwise Selection Approach to Model Building
101
Stepwise
Stepwise combines the thoughts of both Forward and Backward selection. It starts with no variable, then select the most significant variable as the Forward , however, like Backward selection, stepwise method can drop the in-significant variable one at a time. until there is no significant variable. Stepwise method stops when all terms in the model are significant , and all terms out off model are not significant.
Model Building and Interpretation : The Stepwise Selection Approach to Model Building
102
Application Stepwise selection methods
Forward
Backward
Stepwise
Identify candidate models
Use expertise to choose
Model Building and Interpretation : Performing Stepwise Regression: Forward selection
103
Model Building and Interpretation : Performing Stepwise Regression: Forward selection
104
Model Building and Interpretation : Performing Stepwise Regression: Backward selection
105
Model Building and Interpretation : Performing Stepwise Regression: Backward selection
106
Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection
107
Model Building and Interpretation : Performing Stepwise Regression: Stepwise selection
108
Model Building and Interpretation : Using Alternative Significance Criteria for Stepwise Models
109
Stepwise Regression Models With default significant levels
Using 0.05 significant levels
Model Building and Interpretation : Comparison of Selection Methods
110
Stepwise selection methods
All-possible regression
Use fewer computer resources
Generate more candidate models that might have nearly equal R2 and Cp statistics.
Agenda• 0. Lesson overview• 1. Exploratory Data Analysis• 2. Simple Linear Regression• 3. Multiple Regression• 4. Model Building and Interpretation
• 5. Summary
111
Home Work: Exercise 11.1 Describing the Relationship between Continuous Variables
Percentage of body fat, age, weight, height, and 10 body circumference measurements (for example, abdomen) were recorded for 252 men. The data are stored in the BodyFat2 data set. Body fat one measure of health, was accurately estimated by an underwater weighing technique. There are two measures of percentage body fat in this data set.
Case case numberPctBodyFat1 percent body fat using Brozek’s equation, 457/Density-414.2PctBodyFat2 percent body fat using Siri’s equation, 495/Density-450Density Density(gm/cm^3)Age Age(yrs)Weight weight(lbs)Height height (inches)
112
Home Work: Exercise 1
113
Home Work: Exercise 11.1 Describing the Relationship between Continuous Variables
a. Generate scatter plots and correlations for the variables Age, Weight, Height, and the circumference measures versus the variable PctBodyFat2.
Important! The Correlation task limits you to 10 variables at a time for scatter plot matrices, so for this exercise, look at the relationships with Age, Weight, and Height separately from the circumference variables (Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist)
Note: Correlation tables can be created using more than 10 VAR variables at a time.
b. What variable has the highest correlation with PctBodyFat2?c. What is the value for the coefficient?d. Is the correlation statistically significant at the 0.05 level?e. Can straight lines adequately describe the relationships?f. Are there any outliers that you should investigate?g. Generate correlations among the variable (Age, Weight, Height), among one another,
and among the circumference measures. Are there any notable relationships?
114
Home Work: Exercise 22.1 Fitting a Simple Linear Regression Model
Use the BodyFat2 data set for this exercise:
a. Perform a simple linear regression model with PctBodyFat2 as the response variable and Weight as the predictor.
b. What is the value of the F statistic and the associated p-value? How would you interpret this with regard to the null hypothesis?
c. Write the predicted regression equation.d. What is the value of the R2 statistic? How would you interpret this?e. Produce predicted values for PctBodyFat2 when Weight is 125, 150, 175, 200 and
225. (see SAS code in below comments part)f. What are the predicted values?g. What’s the value of PctBodyFat2 when Weight is 150?
115
Home Work: Exercise 33.1 Performa Multiple Regression
a. Using the BodyFat2 data set, run a regression of PctBodyFat2 on the variables Age, Weight, Height, Neck, Chest, Abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist. Compare the ANOVA table with that from the model with only Weight in the previous exercise. What is the different?
b. How do the R2 and the adjusted R2 compare with these statistics for the Weight regression demonstration?
c. Did the estimate for the intercept change? Did the estimate for the coefficient of Weight change?
116
Home Work: Exercise 33.2 Simplifying the model
a. Rerun the model in the previous exercise, but eliminate the variable with the highest p-value. Compare the result with the previous model.
b. Did the p-value for the model change notably?c. Did the R2 and adjusted R2 change notably?d. Did the parameter estimates and their p-value change notably?
3.3 More simplifying of the model
e. Rerun the model in the previous exercise, but eliminate the variable with the highest p-value.
f. How did the output change from the previous model?g. Did the number of parameters with a p-value less than 0.05 change?
117
Home Work: Exercise 44.1 Using Model Building Techniques
Use the BodyFat2 data set to identify a set of “best” models.
a. Using the Mallows' Cp option, use an all-possible regression technique to identify a set of candidate models that predict PctBodyFat2 as a function of the variables Age, Weight, Height, Neck, Chest, abdomen, Hip, thigh, Knee, Ankle, Biceps, Forearm, and Wrist .
Hint: select the best 60 models based on Cp to compare
b. Use a stepwise regression method to select a candidate model. Try Forward selection, Backward selection, and Stepwise selection.
c. How many variables would result from a model using Forward selection and a significant level for entry criterion of 0.05, instead of the default of 0.50?
118
Thank you!