Upload
conor
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Regression Lecture 5.1. Review Transforming data, the log transform liver fluke egg hatching rate explaining CEO remuneration brain weights and body weights SLR with transformed data Transforming X, quadratic fit Other options. Using t values. Convention: n >30 is big, - PowerPoint PPT Presentation
Citation preview
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 1
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 2
Using t values
Convention: n >30 is big,
n < 30 is small.
Z0.05 = 1.96
≈ 2
t30, 0.05 = 2.04
≈ 2
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 3
Selected critical values for the t-distribution .25 .10 .05 .02 .01 .002 .001
= 1 2.41 6.31 12.71 31.82 63.66 318.32 636.61 2 1.60 2.92 4.30 6.96 9.92 22.33 31.60 3 1.42 2.35 3.18 4.54 5.84 10.22 12.92 4 1.34 2.13 2.78 3.75 4.60 7.17 8.61 5 1.30 2.02 2.57 3.36 4.03 5.89 6.87 6 1.27 1.94 2.45 3.14 3.71 5.21 5.96 7 1.25 1.89 2.36 3.00 3.50 4.79 5.41 8 1.24 1.86 2.31 2.90 3.36 4.50 5.04 9 1.23 1.83 2.26 2.82 3.25 4.30 4.78 10 1.22 1.81 2.23 2.76 3.17 4.14 4.59 12 1.21 1.78 2.18 2.68 3.05 3.93 4.32 15 1.20 1.75 2.13 2.60 2.95 3.73 4.07 20 1.18 1.72 2.09 2.53 2.85 3.55 3.85 24 1.18 1.71 2.06 2.49 2.80 3.47 3.75 30 1.17 1.70 2.04 2.46 2.75 3.39 3.65 40 1.17 1.68 2.02 2.42 2.70 3.31 3.55 60 1.16 1.67 2.00 2.39 2.66 3.23 3.46 120 1.16 1.66 1.98 2.36 2.62 3.16 3.37 ∞ 1.15 1.64 1.96 2.33 2.58 3.09 3.29
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 4
Quantify the extent of the recovery in Year 6, Q3.
= 1030 Q1 + 1292 Q2 + 1210 Q3 + 1279 Q4 + 33.7 Time
Year 6 Q2: P = 1657
= 1292 + 33.7 × 22 = 2033
P – = 1657 – 2033 = – 376
Year 6 Q3: P = 2185
= 1210 + 33.7 × 23 = 1985
P – = 2185 – 1985 = 200
Homework 4.2.1
P̂
P̂
P̂
P̂
P̂
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 5
Homework 4.2.2
List correspondences between the output from the original regression and the output from the alternative regression.
Confirm that the coefficients of Q1, Q2 and Q3 in the original are the corresponding coefficients in the alternative with the Q4 coefficient added.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 6
Predictor Coef SE Coef T PNoconstantQ1 1029.87 23.41 43.99 0.000Q2 1292.35 24.45 52.85 0.000Q3 1210.42 25.55 47.37 0.000Q4 1278.70 26.71 47.88 0.000Time 33.725 1.619 20.83 0.000S = 40.9654
Predictor Coef SE Coef T PConstant 1278.70 26.71 47.88 0.000Q1 -248.82 26.36 -9.44 0.000Q2 13.65 26.11 0.52 0.609Q3 -68.27 25.96 -2.63 0.019Time 33.725 1.619 20.83 0.000S = 40.9654
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 7
Homework 4.2.3
1. Calculate the simple linear regressions of Jobtime on each of T_Ops and Units. Confirm the corresponding t-values.
2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.
3. Confirm the calculation of the R2 values.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 8
Solution 4.2.3
2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.
Comment: The t-value is insignificant; the negative correlation is just chance variation, with no substantive meaning.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 9
Variance Inflation Factors
2kk
kR1
1ns
)ˆ(SE
ns)ˆ(SE0R
kk
2k
factorlationinferrordardtansR1
12k
factorlationinfiancevarR1
12k
Convention: problem if > 90% or VIFk > 102kR
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 10
What to do?
• Get new X values, to break correlation pattern
– impractical in observational studies
• Choose a subset of the X variables
– manually
– automatically
• stepwise regression
• other methods
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 11
Residential load survey data.
Data collected by a US electricity supplier during an investigation of the factors that influence peak demand for electricity by residential customers.
Load is demand at system peak demand hour, (kW)
Size is house size, in SqFt/1000,
Income (X2) is annual family income, in $/1000,
AirCon (X3) is air conditioning capacity, in tons,
Index (X4) is the house appliance index, in kW,
Residents (X5) is number in house on a typical day
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 12
Matrix plot
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 13
Results
All variables in:Predictor Coef SE Coef T PConstant 0.1263 0.2289 0.55 0.585Size -2.6689 0.9059 -2.95 0.006Income 0.00027912 0.00007892 3.54 0.001AirCon 0.42462 0.03472 12.23 0.000Index 0.00038137 0.00007884 4.84 0.000Residents 0.00197 0.02218 0.09 0.930
Income deletedPredictor Coef SE Coef T PConstant -397.0 492.7 -0.81 0.426Size 10943.3 594.2 18.42 0.000AirCon -1.86 75.45 -0.02 0.980Index 0.0721 0.1709 0.42 0.676Residents 38.65 47.75 0.81 0.424
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 14
Exercise
Calculate the VIF for Size. Comment.
Homework
Calculate variance inflation factors for all explanatory variables. Discuss
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 15
Multicollinearity
when when there is perfect correlation within the X variables.
Example: Indicators
Illustration: Minitab
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 16
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weightsA
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 17
(i) Hatching of liver fluke eggs
The life cycle of the liver fluke
1. Adults in liver lay eggs
2. Animals excrete eggs
3. Eggs hatch on ground
4. Larvae seek snail
5. Development within snail
6. Emergence from snail
7. Consumption by animal
8. Penetration to liver
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 18
Hatching of liver fluke eggs:Duration and Success rate
Duration and success rate of hatching of 600 liver fluke eggs at a series of fixed temperatures
Temperature (C)
Number hatched
Duration (mean days)
SD Hatch%
10 546 115.75 2.14 91.0 13 543 56.50 2.33 90.5 16 534 32.39 1.98 89.0 18 501 24.49 1.41 83.5 20 499 18.92 1.39 83.1 22 497 15.58 1.23 82.8 24 465 13.39 1.03 77.5 26 448 11.98 1.28 74.0 28 438 10.16 0.94 73.0 30 432 9.45 0.96 72.0 32 256 10.37 0.94 42.5 34 42 11.52 0.85 7.0 35 0
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 19
Temperature
Dura
tion
353025201510
120
100
80
60
40
20
0
Scatterplot of Duration vs Temperature
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 20
Temperature
Log(D
ura
tion)
353025201510
2.2
2.0
1.8
1.6
1.4
1.2
1.0
Scatterplot of Log(Duration) vs Temperature
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 21
Sales
Tota
l com
p
140000120000100000800006000040000200000
200000000
150000000
100000000
50000000
0
Scatterplot of Total comp vs Sales
(ii) Explaining CEO Compensationand Company Sales,
(Forbes magazine, May 1994)
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 22
Explaining CEO Remuneration,bivariate log transformation
LogSales
LogCom
p
5.55.04.54.03.53.02.52.0
8
7
6
5
4
Scatterplot of LogComp vs LogSales
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 23
(iii) Mammals' Brainweight vs Bodyweight
Species Bodyweight Brainweight
African elephant 6654 5712 African giant pouched rat 1 6.6 Artic fox 3.385 44.5 Artic ground squirrel 0.92 5.7 Asian elephant 2547 4603 Brachiosaurus 87000 154.5 Baboon 10.55 179.5 Big brown bat 0.023 0.3 Brazilian tapir 160 169 Cat 3.3 25.6 Chimpanzee 52.16 440
● ● ●
● ● ●
● ● ●
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 24
Bodyweight
Bra
inw
eig
ht
9000080000700006000050000400003000020000100000
6000
5000
4000
3000
2000
1000
0
Scatterplot of Brainweight vs Bodyweight
Scatterplot view
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 25
LBodyW
LBra
inW
543210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Scatterplot view,log transform
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 26
LBodyW
LBra
inW
43210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Scatterplot view,Dinosaurs deleted
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 27
Histogram view
600048003600240012000
48
36
24
12
0
Brainweight
Fre
qu
en
cy
6000500040003000200010000
60
45
30
15
0
Bodyweight
Fre
qu
en
cy
Histogram of Brainweight
Histogram of Bodyweight
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 28
Histogram view,log transform
43210-1
16
12
8
4
0
LBrainW
Fre
qu
en
cy
43210-1-2
12
9
6
3
0
LBodyW
Fre
qu
en
cy
Histogram of LBrainW
Histogram of LBodyW
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 29
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 30
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 31
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 32
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 33
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 34
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 35
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 36
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 37
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 38
Why the log transform works
High spread at high X
transformed to
low spread at high Y
Low spread at low X
transformed to
high spread at low Y
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 39
Why the log transform works
10 to 100
transformed to
log10(10) to log10(102)
i.e. 1 to 2
1/10 = 0.1 to 1/100 = 0.01
transformed to
log10(10–1) to log10(10–2)
i.e., – 1 to – 2
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 40
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 41
SLR with transformed dataLBrainW versus LBodyW
The regression equation is
LBrainW = 0.932 + 0.753 LBodyW
Predictor Coef SE Coef T P
Constant 0.93237 0.04170 22.36 0.000
LBodyW 0.75309 0.02858 26.35 0.000
S = 0.302949
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 42
LBodyW
LBra
inW
43210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Application:Do humans conform?
Human
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 43
Application:Do humans conform?
• Delete the Human data,
• calculate regression,
• predict human LBrainW and
• compare to actual, relative to s
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 44
Application:Do humans conform?
Regression Analysis: LBrainW versus LBodyW
The regression equation isLBrainW = 0.924 + 0.744 LBodyW
Predictor Coef SE Coef t pConstant 0.92410 0.03933 23.50 0.000LBodyW 0.74383 0.02706 27.48 0.000
S = 0.285036
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 45
Application:Do humans conform?
LBodyW(Human) = 1.79239
LBrainW(Human) = 3.12057
Predicted LBrainW = 0.924 + 0.744 × 1.79239
= 2.25754
Residual = 3.12057 – 2.25754= 0.86303
Residual / s = 0.86303 / 0.285036 = 3.03
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 46
Deleted residuals
For each potentially exceptional case:
– delete the case
– calculate the regression from the rest
– use the fitted equation to calculate a
deleted fitted value
– calculate deleted residual
= obseved value – deleted fitted value
Minitab does this automatically for all cases!
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 47
Application:Do humans conform?
With 63 cases, we do not expect to see any cases with residuals exceeding 3 standard deviations.
On the other hand, recalling the scatter plot, the humans do not appear particulary exceptional. The dotplot view of deleted residuals emphasises this:
Water opossums appear more exceptional.
HumanWater Opossum
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 48
Application:Do humans conform?
4
3
2
1
0
-1
-2
-3
-43210-1-2-3
De
lete
d R
esi
du
als
Score
AD 0.385
P-Value 0.383
Probability Plot of Deleted Residuals
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 49
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 50
Optimising a nicotine extraction process
In determining the quantity of nicotine in different samples of tobacco, temperature is a key variable in optimising the extraction process. A study of this phenomenon involving analysis of 18 samples produced these data.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 51
Optimising a nicotine extraction process
Regression Analysis: Nicotine versus Temperature
The regression equation isNicotine = 2.61 + 0.0247 Temperature
Predictor Coef SE Coef T PConstant 2.6086 0.2121 12.30 0.000Temperature 0.024656 0.003579 6.89 0.000
S = 0.217412 R-Sq = 74.8%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 52
Optimising a nicotine extraction process
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 53
Optimising a nicotine extraction process,quadratic fit
90807060504030
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
Temperature
Nic
oti
ne
Scatterplot of Nicotine vs Temperature
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 54
Optimising a nicotine extraction process,quadratic fit
The regression equation isNicotine = 1.20 + 0.0767 Temperature - 0.000453 Temp-sqr
Predictor Coef SE Coef T PConstant 1.2041 0.6312 1.91 0.076Temperature 0.07674 0.02257 3.40 0.004Temp-sqr -0.0004529 0.0001943 -2.33 0.034
S = 0.192398 R-Sq = 81.5%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 55
Optimising a nicotine extraction process,quadratic fit
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 56
Optimising a nicotine extraction process,quadratic fit, case 5 excluded
The regression equation isNicotine = 1.21 + 0.0750 Temperature - 0.000419 Temp-sqr
Predictor Coef SE Coef T PConstant 1.2096 0.5129 2.36 0.033Temperature 0.07504 0.01835 4.09 0.001Temp-sqr -0.0004189 0.0001583 -2.65 0.019
S = 0.156321 R-Sq = 88.6%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 57
Optimising a nicotine extraction process,quadratic fit, case 5 excluded
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 58
5 Other options
• Other functions,
– e.g., 1/Y, Y, Y2, etc., same for X
• Generalised linear models,
– choose a function of Y, a model for
• etc.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 59
Reading
EM Section 6.7.1
Hamilton, Ch. 5
Extra Notes: More on log