Upload
brian-watkins
View
229
Download
0
Embed Size (px)
Citation preview
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 1
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 2
Update: Accessing data files
• Access the data in mstuart's get folder:– in ISS Public Access labs, click Start, then
Network Shortcuts, open Get– on your own computer with TCD network
access, navigate to Ntserver-usr / get– once in get, type ms, open mstuart, Diploma
Reg, Excel Data,
or• Access the data on the Diploma web page at
https://www.scss.tcd.ie:453/courses/dipstats/Local/ST7002_0809.php
• Open the relevant Excel file and copy the data
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 3
Homework 2.1.1The shelf life of packaged foods depends on many factors. Dry cereal (such as corn flakes) is considered to be a moisture-sensitive product, with the shelf life determined primarily by moisture. In a study of the shelf life of one brand of cereal, packets of cereal were stored in controlled conditions (23°C and 50% relative humidity) for a range of times, and moisture content was measured. The results were as follows.
Draw a scatter diagram. Comment. What action is suggested? Why?
Storage Time
0 3 6 8 10 13 16 20 24 27 30 34 37 41
Moisture Content
2.8 3.0 3.1 3.2 3.4 3.4 3.5 3.1 3.8 4.0 4.1 4.3 4.4 4.9
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 4
Draw a scatter diagram. Comment. What action is suggested? Why?
2 exceptional cases; delete and investigate
Storage Time
Mois
ture
Conte
nt
403020100
5.0
4.5
4.0
3.5
3.0
Scatterplot of Moisture Content vs Storage Time
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 5
Following appropriate action, the following regression was computed.
The regression equation isMoisture = 2.86 + 0.0417 Storage
Predictor Coef SE Coef T PConstant 2.86122 0.02488 115.01 0.000Storage 0.041660 0.001177 35.40 0.000
S = 0.0493475
Calculate a 95% confidence interval for the daily change in moisture content; show details.
)04401.0,03931.0(00235.004166.0)ˆ(SE2ˆ
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 6
Was the action you suggested on studying the scatter diagram in part (a) justified? Explain.
Predict the moisture content of a packet of cereal stored under these conditions for 5 weeks; calculate a prediction interval.
What would be the effect on your interval of not taking the action you suggested on studying the scatter diagram? Why?
Taste tests indicate that this brand of cereal is unacceptably soggy when the moisture content exceeds 4. Based on your prediction interval, do you think that a box of cereal that has been on the shelf for 5 weeks will be acceptable? Explain.
What about 4 weeks? 3 weeks? What is acceptable?
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 7
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 8
Example 5A production prediction problem
Erie Metal Products: The problem
Metal products fabrication:
customers order varying quantities of products of varying complexity;
customers demand accurate and precise order delivery times.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 9
Table 8.1 Times, in hours, to complete jobs with varying numbers of units, numbers of operations per unit and priority status (normal or rushed)
Order Jobtime Units Operations Normal (0)
number (hours) per unit or Rushed (1)? 1 153 100 6 0
2 192 35 11 0 3 162 127 7 1 4 240 64 12 0 5 339 600 5 1 6 185 14 16 1 7 235 96 11 1 8 506 257 13 0 9 260 21 9 1
10 161 39 8 0 11 835 426 14 0 12 586 843 6 0 13 444 391 8 0 14 240 84 13 1 15 303 235 9 1 16 775 520 12 0 17 136 76 8 1 18 271 139 11 1 19 385 165 14 1 20 451 304 10 0
Erie Metal Products: The data
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 10
The multiple linear regression model
Jobtime =
Units × Units
Ops × Ops
T_Ops × T_Ops
Rushed × Rushed
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 11
Model parameters
The regression coefficients:
Units, Ops, T_Ops, Rushed
The "uncertainty" parameter:
standard deviation of
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 12
Regression of Jobtime on other variables
Predictor Coef SE Coef T PConstant 77.24 44.76 1.73 0.105Units -0.1507 0.1121 -1.34 0.199Ops 7.152 4.305 1.66 0.117T_Ops 0.11460 0.01322 8.67 0.000Rushed -24.94 19.11 -1.31 0.211
S = 37.4612
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 13
Homework
Predict job times for
small (U=100, O=5),
medium (U=300, O=10) and
large (U=500, O=15) jobs,
both normal and rushed.
Present the results in tabular form.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 14
Homework Solution
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 15
Are these predictions useful?
What is S?
What is 2S?
When will my order arrive?
NEXT
Diagnostics; analysis of residuals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 16
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 17
Checking model fit
Assumptions:
explanatory variables are adequate
error term ():
variation is Normal
variation is stable
Check via residuals
Response = Fit + Residual
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 18
Regression diagnostics
• The diagnostic plot, 'deleted' residuals vs fitted values
– checking for homogeneity of error
• The Normal residual plot,
– checking the Normal model
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 19
Residuals
Job 9, a rushed job with 21 units and 9 operations per unit, took 260 hours to complete.
Prediction
Jobtime = 77 – 0.15 × 21 + 7.1 × 9 + 0.11 × 189 – 25
= 135,
Residual = 260 – 135 = 125
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 20
Deleted residuals
Job 9, a rushed job with 21 units and 9 operations per unit, took 260 hours to complete.
Deleted prediction, regression with case 9 deleted:
Jobtime = 42 – 0.08 × 21 + 10 × 9 + 0.11 × 189 - 38
= 113,
DeletedResidual = 260 – 113 = 147
Standardised deleted residual ≈ DR / s = 147 / 14
= 10.5
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 21
Deleted residuals
• Residual
– observed – fitted
• Standardised Residual
– using an estimate of based on current data
• Standardised Deleted Residual
– calculated from data with suspect case deleted
– estimated from data with suspect case deleted
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 22
The Diagnostic Plot
200 300 400 500 600 700 800
Fitted values
-2
0
2
4
6
8
10
Deletedresiduals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 23
Scatterplot of artificial datawith a highly exceptional case
NB: exceptionally large Y value corresponds to small X value
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 24
Scatter plot and diagnostic plotfor artificial data
-1 0 1
Fitted values
-2
0
2
4
6
8
10
Deletedresiduals
-2 -1 0 1 2
X
-2
-1
0
1
2
Y
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 25
Normal plot of residuals
-2 -1 0 1 2
Normal scores
-2
0
2
4
6
8
10
Deletedresiduals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 26
Statistical AnalysisSection 8.4
Iterating the analysis
• Revising the fit
– revised prediction formula
– revised diagnostics
• A further iteration
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 27
Revised fit, case 9 deletedThe regression equation isJobtime = 41.7 – 0.0835 Units + 10.0 Ops + 0.110
T_Ops – 38.2 Rushed
19 cases used, 1 cases contain missing values
Predictor Coef SE Coef T PConstant 41.72 16.87 2.47 0.027Units -0.08349 0.04186 -1.99 0.066Ops 10.022 1.612 6.22 0.000T_Ops 0.110016 0.004891 22.49 0.000Rushed -38.217 7.166 -5.33 0.000
S = 13.7952
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 28
Revised fitExercise
Predict job times for small (U=100, O=5),
medium (U=300, O=10) and
large (U=500, O=15) jobs,
normal and rushed.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 29
Revised predictions
Table 8.3 Original and revised predicted job times for small, medium and large jobs
Small Medium Large Original 155 447 969 Normal Revised 138 447 975
Original 130 422 944 Rushed Revised 100 409 937
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 30
Recall scatter plot for artificial data
-2 -1 0 1 2
X
-2
-1
0
1
2
Y
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 31
Revised diagnostics, case 9 deleted
-2 -1 0 1 2
Normal scores
-4
-2
0
2
4
6
Deletedresiduals
200 300 400 500 600 700 800
Fitted values
-4
-2
0
2
4
6
Deletedresiduals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 32
Revised fit, cases 9, 11, 16 deleted
The regression equation isJobtime = 44.2 – 0.0693 Units + 9.83 Ops + 0.108 T_Ops
– 38.0 Rushed
17 cases used, 3 cases contain missing values
Predictor Coef SE Coef T PConstant 44.216 9.080 4.87 0.000Units –0.06931 0.02853 –2.43 0.032Ops 9.8286 0.8873 11.08 0.000T_Ops 0.107795 0.004114 26.20 0.000Rushed –37.960 3.857 –9.84 0.000
S = 7.41272
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 33
Revised diagnostics, cases 9, 11, 16 deleted
-2 -1 0 1 2
Normal scores
-3
-2
-1
0
1
2
3
Deletedresiduals
200 300 400 500 600
Fitted values
-3
-2
-1
0
1
2
3
Deletedresiduals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 34
Coefficient estimates from three fits
Coefficient Units Ops T_Ops Rushed
Original fit 77 –0.15 7.2 0.11 –25 Revised fit 42 –0.08 10 0.11 –38 Final fit 44 –0.07 9.8 0.11 –38
Final s.e. 9 0.03 0.9 0.004 4
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 35
Homework 2.2.1
Extend table of predictions of small medium and large jobs to include predictions based on the final fit.
Compare and contrast.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 36
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 37
The model fitting and testing procedure
• Step 1: Initial data analysis:
• Step 2: Least squares fit and interpretation:
• Step 3: Diagnostic analysis of residuals:
• Step 4: Iterate fit and check:
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 38
Step 1: Initial data analysis
• standard single variable summaries
– to determine extent of variation
– possible exceptional values;
• scatter plot matrix
– to view pair wise relationships between the response and the explanatory variables
and– to view pair wise relationships between the
explanatory variables themselves.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 39
Step 2: Least squares fit and interpretation
• calculate the best fitting regression coefficients
– check meaningfulness and statistical significance;
• calculate s
– check its usefulness for prediction
– its usefulness relative to alternative estimates of standard deviation.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 40
Step 3: Diagnostic analysis of residuals
• diagnostic plot
– check for exceptional residuals or patterns of residuals,
– possible explanations in terms of the fitted values;
• Normal plot
– check for exceptional residuals or non-linear patterns in the residuals
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 41
Step 4: Iterate fit and check
• determine cases for deletion
– repeat steps 2 and 3 until checks are passed.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 42
Homework 2.2.2You have been asked to comment, as a statistical consultant, on a prediction formula for forecasting job completion times prepared by a former employee. The formula is, effectively, the one derived from the first fit discussed above. Write a report for management. Your report should refer to
(i) the practical usefulness of the employee's prediction formula, from a customer's perspective,
(ii) the significance of the exceptional cases from the customer's and management's perspectives, and
(iii) your recommended formula, with its relative advantages.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 43
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 44
t-tests
First fit
The regression equation isJobtime = 77.2 – 0.151 Units + 7.15 Ops + 0.115 T_Ops
– 24.9 Rushed
Predictor Coef SE Coef T PConstant 77.24 44.76 1.73 0.105Units –0.1507 0.1121 –1.34 0.199Ops 7.152 4.305 1.66 0.117T_Ops 0.11460 0.01322 8.67 0.000Rushed –24.94 19.11 –1.31 0.211
S = 37.4612
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 45
Revised fit, case 9 deletedThe regression equation isJobtime = 41.7 – 0.0835 Units + 10.0 Ops + 0.110
T_Ops – 38.2 Rushed
19 cases used, 1 cases contain missing values
Predictor Coef SE Coef T PConstant 41.72 16.87 2.47 0.027Units -0.08349 0.04186 -1.99 0.066Ops 10.022 1.612 6.22 0.000T_Ops 0.110016 0.004891 22.49 0.000Rushed -38.217 7.166 -5.33 0.000
S = 13.7952
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 46
Revised fit, cases 9, 11, 16 deleted
The regression equation isJobtime = 44.2 – 0.0693 Units + 9.83 Ops + 0.108 T_Ops
– 38.0 Rushed
17 cases used, 3 cases contain missing values
Predictor Coef SE Coef T PConstant 44.216 9.080 4.87 0.000Units –0.06931 0.02853 –2.43 0.032Ops 9.8286 0.8873 11.08 0.000T_Ops 0.107795 0.004114 26.20 0.000Rushed –37.960 3.857 –9.84 0.000
S = 7.41272
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 47
Homework 2.2.3
Make a table of the t values and corresponding s values for the three regressions
Compare, contrast and explain.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 48
Introduction to RegressionLecture 2.2
1. Review of Lecture 2.1
– Homework– Multiple regression– Job times case study
2. Job times continued
– residual analysis– model fitting and testing
3. Model fitting and testing procedure
4. t-tests
5. Analysis of Variance
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 49
Analysis of Variance
S = 7.41272 R-Sq = 99.8% R-Sq(adj) = 99.7%
Analysis of Variance
Source DF SS MS F PRegression 4 299165 74791 1361.12 0.000Residual Error 12 659 55Total 16 299824
Residual Mean Square = s2: check!
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 50
Analysis of Variance
Regression Sum of Squares measuresexplained variation
Residual Sum of Squares measuresunexplained (chance) variation
Total Variation = Explained + Unexplained
Check it!
%Total
ExplainedR2
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 51
Analysis of Variance
Regression Sum of Squares measuresexplained variation
Residual Sum of Squares measuresunexplained (chance) variation
Total Variation = Explained + Unexplained
F = MS(Reg) / MS(Res)
with 4 and 12 degrees of freedom.
Check it! Check F tables.
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 52
Reduction in Prediction Error
No fit prediction error: sNo fit = sY = 202
1st fit prediction error: s1st fit = 37.5, less by factor of 5.4
2nd fit prediction error: s2nd fit = 13.8, less by factor of 2.7
3rd fit prediction error: s3rd fit = 7.4, less by factor of 1.9
Diploma in StatisticsIntroduction to Regression
Lecture 2.2 53
Reading
SA §§ 8.2 - 8.6, § 1.6
Extra Notes: Degrees of Freedom
R2 and Adjusted R2
(Further Interpretation of the Correlation Coefficient)