Upload
ambrose-knight
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Stat 112 Notes 17
• Time Series and Assessing the Assumption that the Disturbances Are Independent (Chapter 6.8)
• Using and Interpreting Indicator Variables (Chapter 7.1)
Time Series Data and Autocorrelation
• When Y is a variable collected for the same entity (person, state, country) over time, we call the data time series data.
• For time series data, we need to consider the independence assumption for the simple and multiple regression model.
• Independence Assumption: The residuals are independent of one another. This means that if the residual is positive this year, it needs to be equally likely for the residuals to be positive or negative next year, i.e., there is no autocorrelation.
• Positive autocorrelation: Positive residuals are more likely to be followed by positive residuals than by negative residuals.
• Negative autocorrelation: Positive residuals are more likely to be followed by negative residuals than by positive residuals.
Ski Ticket Sales
• Christmas Week is a critical period for most ski resorts.• A ski resort in Vermont wanted to determine the effect that weather had on its sale of lift tickets during Christmas week. • Data from past 20 years. Yi= lift tickets during Christmas week in year i Xi1=snowfall during Christmas week in year i Xi2= average temperature during Christmas week in year
i.Data in skitickets.JMP
Response Tickets Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 8308.0114 903.7285 9.19 <.0001 Snowfall 74.593249 51.57483 1.45 0.1663 Temperature -8.753738 19.70436 -0.44 0.6625 Residual by Predicted Plot
-3000
-2000
-1000
0
1000
2000
3000
Tic
kets
Res
idua
l
6000 8000 9000 11000 13000
Tickets Predicted
Response Tickets Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 8308.0114 903.7285 9.19 <.0001 Snowfall 74.593249 51.57483 1.45 0.1663 Temperature -8.753738 19.70436 -0.44 0.6625 Bivariate Fit of Residual Tickets By Year
-3000
-2000
-1000
0
1000
2000
3000
Res
idua
l Tic
kets
0 5 10 15 20
Year
Residuals suggestpositive autocorrelation
Durbin-Watson Test of Independence
• The Durbin-Watson test is a test of whether the residuals are independent.
• The null hypothesis is that the residuals are independent and the alternative hypothesis is that the residuals are not independent (either positively or negatively) autocorrelated.
• The test works by computing the correlation of consecutive residuals.
• To compute Durbin-Watson test in JMP, after Fit Model, click the red triangle next to Response, click Row Diagnostics and click Durbin-Watson Test. Then click red triangle next to Durbin-Watson to get p-value.
• For ski ticket data, p-value = 0.0002. Strong evidence of autocorrelation
Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW
0.5931403 20 0.5914 0.0002
Remedies for Autocorrelation
• Add time variable to the regression.• Add lagged dependent (Y) variable to the
regression. We can do this by creating a new column and right clicking, then clicking Formula, clicking Row and clicking Lag and then clicking the Y variables.
• After adding these variables, refit the model and then recheck the Durbin-Watson statistic to see if autocorrelation has been removed.
Response Tickets Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 5965.5876 631.2518 9.45 <.0001 Snowfall 70.183059 28.85142 2.43 0.0271 Temperature -9.232802 11.01971 -0.84 0.4145 Year 229.96997 37.13209 6.19 <.0001 Durbin-Watson
Durbin-Watson Number of Obs. AutoCorrelation Prob<DW 1.8849875 20 0.0405 0.3512
Bivariate Fit of Residual Tickets 3 By Year
-2000
-1500
-1000
-500
0
500
1000
1500
2000
Res
idua
l Tic
kets
3
0 5 10 15 20
Year
No evidence of autocorrelation once Year has been added as an explanatory variable
Example 6.10 in bookResponse SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -632.6945 47.27697 -13.38 <.0001 ADV 0.1772326 0.007045 25.16 <.0001 Durbin-Watson
Durbin-Watson Number of Obs. AutoCorrelation Prob<DW 0.4672937 36 0.7091 <.0001
Strong Evidence of Autocorrelation Bivariate Fit of Residual SALES By Year
-100
-50
0
50
Res
idua
l SA
LES
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Response SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 41058.859 8676.4 4.73 <.0001 ADV 0.4393009 0.054814 8.01 <.0001 Year -21.88738 4.554925 -4.81 <.0001 Durbin-Watson
Durbin-Watson Number of Obs.
AutoCorrelation Prob<DW
0.747831 36 0.5584 <.0001
Adding Year does not remove the autocorrelation. Bivariate Fit of Residual SALES By Year
-100
-50
0
50
Res
idua
l SA
LES
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Response SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -234.4752 78.06875 -3.00 0.0051 ADV 0.0630703 0.020228 3.12 0.0038 Lagged Sales 0.6751139 0.112302 6.01 <.0001 Durbin-Watson
Durbin-Watson Number of Obs. AutoCorrelation 2.3330219 35 -0.2063
Adding Lagged Sales removes the autocorrelation. Bivariate Fit of Residual SALES 2 By Year
-100
-80
-60
-40
-20
0
20
40
Res
idua
l SA
LES
2
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Categorical variables
• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).
• How to use categorical variables as explanatory variables in regression analysis?
Comparing Toy Factory Managers• An analysis has shown that the time
required to complete a production run in a toy factory increases with the number of toys produced. Data were collected for the time required to process 20 randomly selected production runs as supervised by three managers (Alice, Bob and Carol). Data in toyfactorymanager.JMP.
• How do the managers compare?
Picture from Toy Story (1995)
Marginal Comparison
• Marginal comparison could be misleading. We know that large production runs with more toys take longer than small runs with few toys. How can we be sure that Carol has not simply been supervising very small production runs?
• Solution: Run a multiple regression in which we include size of the production run as an explanatory variable along with manager, in order to control for size of the production run.
Oneway Analysis of Time for Run By Manager
150
200
250
300
Tim
e for
Run
Alice Bob Carol
Manager
Including Categorical Variable in Multiple Regression: Wrong
Approach • We could assign codes to the managers, e.g., Alice = 0,
Bob=1, Carol=2.
• This model says that for the same run size, Bob is 31 minutes faster than Alice and Carol is 31 minutes faster than Bob.
• This model restricts the difference between Alice and Bob to be the same as the difference between Bob and Carol – we have no reason to do this.
• If we use a different coding for Manager, we get different results, e.g., Bob=0, Alice=1, Carol=2
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 211.92804 7.212609 29.38 <.0001 Run Size 0.2233844 0.029184 7.65 <.0001 Managernumber -31.03612 3.056054 -10.16 <.0001
Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 188.63636 12.73082 14.82 <.0001 Run Size 0.2103122 0.048921 4.30 <.0001 Managernumber2 -5.008207 5.122956 -0.98 0.3324
Alice 5 min.faster than Bob
Including Categorical Variable in Multiple Regression: Right
Approach• Create an indicator (dummy) variable for
each category.• Manager[Alice] = 1 if Manager is Alice 0 if Manager is not Alice • Manager[Bob] = 1 if Manager is Bob 0 if Manager is not Bob• Manager[Carol] = 1 if Manager is Carol 0 if Manager is not Carol
Categorical Variables in Multiple Regression in JMP
• Make sure that the categorical variable is coded as nominal. To change coding, right clock on column of variable, click Column Info and change Modeling Type to nominal.
• Use Fit Model and include the categorical variable into the multiple regression.
• After Fit Model, click red triangle next to Response and click Estimates, then Expanded Estimates (the initial output in JMP uses a different, more confusing coding of the dummy variables).
• For a run size of length 100, the estimated time for run of Alice, Bob and Carol
• For the same run size, Alice is estimated to be on average 38.41-(-14.65)=53.06 minutes slower than Bob and
38.41-(-23.76)=62.17 minutes slower than Carol.
Response Time for Run Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[Alice] 38.409663 3.005923 12.78 <.0001 Manager[Bob] -14.65115 3.031379 -4.83 <.0001 Manager[Carol] -23.75851 2.995898 -7.93 <.0001
ˆ ( | 100, ) 176.71 0.24*100 38.41*1 14.65*0 23.76 *0
ˆ ( | 100, ) 176.71 0.24*100 38.41*0 14.65*1 23.76 *0
ˆ ( | 100, ) 176.71 0.24*100 38.
E Time Runsize Manager Alice
E Time Runsize Manager Bob
E Time Runsize Manager Carol
41*0 14.65*0 23.76*1
Election RegressionGoal: Predict the Incumbent Party’s share of the presidential vote Ray Fair of Yale University has developed a regression model for predicting the presidential vote. Data in Elections.JMP Response: Y = Incumbent party’s share of Vote Explanatory Varibles: In Power = Nominal Variable for party in power Economic_Growth = growth rate of real capita GDP in the first three quarters of the election year (annual rate) Inflation = absolute value of the growth rate of the GDP deflator in the first 15 quarters of the administration (annual rate) except for 1920, 1944, and 1948, where the values are zero Good_News_Quarters = number of quarters in the first 15 quarters of the administration in which the growth rate of real per capita GDP is greater than 3.2 percent at an annual rate except for 1920, 1944, and 1948, where the values are zero. Duration Value = 0 if the incumbent party has been in power for one term, 1 if the incumbent party has been in power for two consecutive terms, 1.25 if the incumbent party has been in power for three consecutive terms, 1.50 for four consecutive terms, and so on. President Running = Nominal variable for whether president is running War = Nominal variable for whether election is 1920, 1944 or 1948
Response Incumbent_Share Summary of Fit RSquare 0.882301 RSquare Adj 0.827374 Root Mean Square Error 2.834062 Mean of Response 52.33478 Observations (or Sum Wgts) 23 Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 54.229235 4.164649 13.02 <.0001 Economic_Growth 0.5538327 0.146503 3.78 0.0018 Inflation -0.602683 0.320438 -1.88 0.0796 Good_News_Quarters 0.7088524 0.245609 2.89 0.0113 Duration_Value -4.380337 1.501836 -2.92 0.0106 In Power[D] -2.060045 0.665706 -3.09 0.0074 In Power[R] 2.060045 0.665706 3.09 0.0074 President_Running[no] -0.722883 0.812978 -0.89 0.3879 President_Running[yes] 0.7228826 0.812978 0.89 0.3879 War[no] -2.069246 1.701099 -1.22 0.2426 War[yes] 2.0692456 1.701099 1.22 0.2426
Prediction and Prediction Interval for 2008
Predicted Lower 95% Upper 95% Value Indiv_Incumbent_Share Indiv_Incumbent_Share 2008 49.6063561 42.8822003 56.330512
Effect Tests
• Effect test for manager: vs. Ha: At least two of manager[Alice], manager[Bob] and manager[Carol] are
not equal. Null hypothesis is that all managers are the same (in terms of mean run time) when run size is held fixed, alternative hypothesis is that not all managers are the same (in terms of mean run time) when run size is held fixed. This is a partial F test.
• p-value for Effect Test <.0001. Strong evidence that not all managers are the same when run size is held fixed.
• Note that is equivalent to because JMP has constraint that manager[a]+manager[b]+manager[c]=0.• Effect test for Run size tests null hypothesis that Run Size coefficient is 0
versus alternative hypothesis that Run size coefficient isn’t zero. Same p-value as t-test.
Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Run Size 1 1 25260.250 94.1906 <.0001 Manager 2 2 44773.996 83.4768 <.0001
Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[Alice] 38.409663 3.005923 12.78 <.0001 Manager[Bob] -14.65115 3.031379 -4.83 <.0001 Manager[Carol] -23.75851 2.995898 -7.93 <.0001
0 : [ ] [ ] [ ]H Manager Alice Manager Bob Manager Carol
: [ ] [ ] [ ] 0aH manager Alice manager Bob manager Carol 0 : [ ] [ ] [ ]H Manager Alice Manager Bob Manager Carol
• Effect tests shows that managers are not equal.• For the same run size, Carol is best (lowest
mean run time), followed by Bob and then Alice.• The above model assumes no interaction
between Manager and run size – the difference between the mean run time of the managers is the same for all run sizes.
Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Run Size 1 1 25260.250 94.1906 <.0001 Manager 2 2 44773.996 83.4768 <.0001 Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[Alice] 38.409663 3.005923 12.78 <.0001 Manager[Bob] -14.65115 3.031379 -4.83 <.0001 Manager[Carol] -23.75851 2.995898 -7.93 <.0001
The effect test shows that the managers are not all equal. How about differences between specific managers? How does the mean time for Alice compare to Bob for fixed run sizes. We can test whether there is a difference between Alice and Bob, and can construct a confidence interval for the difference.
0
1
H : [manager Alice]= [manager Bob]
H : [manager Alice] [manager Bob]
We will use the Custom Test provided by JMPIN: Expanded Estimates Nominal factors expanded to all levels
Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001
Manager[Alice] 38.409663 3.005923 12.78 <.0001
Manager[Bob] -14.65115 3.031379 -4.83 <.0001
Manager[Carol] -23.75851 2.995898 -7.93 <.0001
Run Size 0.243369 0.025076 9.71 <.0001
Testing for Differences Between Specific Managers
Inference for Differences of Coefficients in JMP
Consider a multiple regression
1 0 1 1( | , , )k k kE Y X X X X
Suppose we want to test
0 : i jH vs. :a i jH .
This is equivalent to
0 : 0i jH vs. : 0a i jH
After Fit Model, click the red triangle next to Response. Then click Estimates and then Custom Test. Then enter a 1 for i and a -1
for j .
Prob>|t| is the p-value for the two sided test. Std Error is the standard error of i jb b so that
an approximate 95% confidence interval for
i j is 2*Std Errori jb b
Custom Test
Parameter
Intercept 0
Manager[Alice] 1
Manager[Bob] -1
Run Size 0
= 0 Value 53.06 Std Error 5.24
t Ratio 10.12 Prob>|t| 2.929978e-14
Conclusion: We reject the null hypothesis and conclude that Manager a and Manager b perform differently. 2. The confidence interval for the difference: 53.06 2.003 5.24=53.06 10.50
Here *.025,562.003 t .
3. How do we test the difference between Alice and Carol?
0 aH : [manager Alice]= [manager Carol], H : [manager Alice] [manager Carol]
Now [manager Carol]=-( [manager Alice] + [manager Bob])
So
0H : [manager a]= [manager Carol]=-( [manager Alice] + [manager Bob])
It becomes: 0H : 2 [manager Alice]+ [manager Bob])=0
We can use the Custom Test again with 2 and 1 as contrasts:
Custom Test
Parameter Intercept 0 Manager[Alice] 2 Manager[Bob] 1 Run Size 0 = 0