21
 MATH 6303 Multivariate Statistical Analysis Final Project By Marius M. Mihai December 3, 2014

Time Series Project - ARIMA

Embed Size (px)

Citation preview

Page 1: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 1/21

 

MATH 6303

Multivariate Statistical Analysis

Final Project

By

Marius M. Mihai

December 3, 2014

Page 2: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 2/21

Final Project – Multivariate Statistical Analysis December 2014

2 | P a g e  

1.  Do a discriminant analysis and write a report for the data of Table5.8.

Y1: Length of Cycle, Y2: Percentage of Rising Prices, Y3: Cyclical Amplitude, Y4: Rate of Change

The purpose of the discriminant analysis is to identify a linear combination of the variables

described above that would show the separation between consumer goods and producer goods.

But before we find the discriminant function, we need to see if the univariate differences in

groups are significant. Thus, we conduct 4 separate ANOVAs and analyze the results.

Thus for variable y1 (length of cycle), the p-value is less than 5% thus the difference is significant.

Page 3: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 3/21

Final Project – Multivariate Statistical Analysis December 2014

3 | P a g e  

For variable y2 (% of rising prices), the difference between consumer and producer goods was

not significant.

Page 4: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 4/21

Final Project – Multivariate Statistical Analysis December 2014

4 | P a g e  

For variable y3 (cyclical amplitude), the difference between groups is significant.

For variable y4 (rate of change), the difference between groups is also not significant.

Page 5: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 5/21

Final Project – Multivariate Statistical Analysis December 2014

5 | P a g e  

The univariate ANOVA results yielded differences between y1 and y3, while for y2 and y4 it did

not identify any significant differences. The next step is to run a MANOVA to test for the overall

differences between consumer goods and producer goods. The results were pretty significant for

all test statistic values. The p-value was lower than 5% in all cases, thus there is a significant

difference between the two groups, consumer goods and producer goods.

Page 6: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 6/21

Final Project – Multivariate Statistical Analysis December 2014

6 | P a g e  

Because the difference between consumer and producer goods was significant, the discriminant

analysis will help identify which variables contribute more to the difference between groups. The

analysis will follow as:

a)  First, the discriminant function will be identified along with its coefficients and test if it is

significant.b)  Second, the coefficients will be standardized in order to eliminate any unit issues, so that

we can analyze the contribution of each variable.

c)  Third, a stepwise selection of variables will be applied to identify any redundancies.

The analysis was carried assuming that the covariance matrices were equal. The discriminant

function can be computed using a= (Spooled)-1 * (̅   −)̅   . Thus, a’= (-0.05689, -0.00971, -

0.24213, -0.0713). To test for the significance of this discriminant function, the Hotelling-T 2 was

computed. In the case of two groups, the discriminant function is significant if T2  is also

significant. It was proven that T2

= (n1+n2-2)*(1-Wilk’s lambda)/Wilk’s lambda. Hence from myMANOVA output, Wilk’s lambda=0.48, and so T2= (9+10-2)*(1-0.48)/0.48= 18.42. This test

statistics follows a distribution T2α=0.05, p=4, n1+n2-2=17=15.117. The test statistic is higher than the

table value, thus T2 is significant, resulting in a significant discriminant function. Thus the linear

discriminant function can be written as z=-0.06*y1-0.0097*y2-0.24*y3-0.071*y4 

The standardized coefficients of the discriminant functions can be computed using the formula

a_standardized=√ ( )*a. The standardized coefficients were computed in SAS,

and a_standardized’= (-1.390, -0.083, -1.025, -0.032). By taking the absolute value, these

coefficients give us a good idea about the variable contribution in the model. Thus, the ranking

is as follows from the most important variable to the least important variable: y1 (length ofcycle), y3 (Cyclical amplitude), y2 (percentage of rising prices) and y4 (rate of change). These

results are comparable to the ones obtained in the individual ANOVAs, when the strongest

differences were in y1, and y3 (according to their p-values), while y2 and y4 did not exhibit any

significant differences.

Page 7: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 7/21

Final Project – Multivariate Statistical Analysis December 2014

7 | P a g e  

The last step of the discriminant analysis is the stepwise procedure which will be conducted in

order to identify any redundancies in the data. The output from the stepwise procedure in SAS

can be found below. As expected, variable y1 (length of cycle) was entered first because it had

the highest F-value, followed by y3 (cyclical amplitude) because it has the second highest F-value.

After step 2, there were no more significant variables. This was expected, because y2 (percentage

of rising prices), and y4 (rate of change) were not found significant in the individual ANOVAs run

in the initial step of the analysis. Also, according to the standardized discriminant function, y2

and y4 have the lowest contribution. Thus the reduced model with only y1 and y3, is as good as

the full model because y2 and y4 appear to be redundant in the full model.

Page 8: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 8/21

Final Project – Multivariate Statistical Analysis December 2014

8 | P a g e  

2. Do a classification analysis and write a report on Table5.6.

Y1: Intelligence, Y2: Form Relations, Y3: Dynamometer, Y4: Dotting, Y5: Sensory Motor

Coordination, Y6: Perseveration

Classification analysis is discriminant analysis taken one step forward. The purpose of this analysisis to tell us where to place future subjects with various scores in intelligence, form relations,

dynamometer, dotting, sensory motor coordination, and perseveration. In our case, we have a

two group classification analysis, engineer apprentices and pilots. It is important to note that any

preliminary analysis on this data such as ANOVA, MANOVA, and tests for normality, was done

previously in the midterm exam. Similarly to the previous problem, the analysis will be carried as

follows:

a)  First, the discriminant function will be identified along with its coefficients and test if it is

significant.

b)  Second, the coefficients will be standardized in order to eliminate any unit issues, so that

we can analyze the contribution of each variable.

c)  Test for the equality of covariance matrices

d)  We classify each observation based on both the linear discriminant function and the

quadratic discriminant function, and estimate the error rates

e)  We use the holdout method to see how it compares with the previous two.

Page 9: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 9/21

Final Project – Multivariate Statistical Analysis December 2014

9 | P a g e  

The discriminant function was computed in SAS. Same as in the analysis done in problem 1, the

vector a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049) contains the coefficients of the linear

discriminant function. The T2=66.7 was computed in the midterm exam, and it was significant;

hence the linear discriminant function is also significant.

The next step was to compute the standardized coefficient, to identify the contribution of eachvariable in the overall model. The standardized coefficients were (0.174, 1.496, -1.391, -1.280,

1.131, -1.440). Taking their absolute value, the ranking from the most important to the least

important is as follows: y2 (form relations), y6 (perseveration), y3 (dynamometer), y4 (dotting),

y5 (sensory motor coordination), ad y1 (intelligence). These results are on par with the ones in

the midterm, when sensory motor coordination, and intelligence appeared to be redundant in

the full model. In the discriminant analysis, these two variables were last in the level of

importance.

An assumption that was not tested in the midterm is particularly important for this analysis: the

equality of the covariance matrices. This assumption was tested in a previous homework (seeproblem 7.22), and the covariance matrices appeared to be equal.

With the covariance matrices being equal, the first classification will be done based on the linear

discriminant function a’= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049). The next page contains

the analysis which was done in Microsoft Excel. Based on the analysis, there were two

misclassifications in engineer apprentices, and two misclassifications in pilots. This would yield

an error rate of (2+2)/ (20+20) =0.1 (10%). The error will be compared to other error rates which

will be obtained further in the analysis in order to judge the ability to predict group membership.

Page 10: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 10/21

 

Apprentices Pilots

y3 y4 y5 y6 Value Decision y1 y2 y3 y4 y5 y6 Value Decis

22 74 223 54 254 -22.5221 pilots 132 17 77 232 50 249 -24.2204 pilo

30 80 175 40 300 -23.0577 pilots 123 32 79 192 64 315 -22.1684 pilo

49 87 266 41 223 -20.2287 engineer apprentices 129 31 96 250 55 319 -27.8371 pilo

37 66 178 80 209 -12.9194 engineer apprentices 131 23 67 291 48 310 -27.4374 pilo

35 71 175 38 261 -18.9185 engineer apprentices 110 24 96 239 42 268 -27.2938 pilo

37 57 241 59 245 -17.4942 engineer apprentices 47 22 87 231 40 217 -24.2903 pilo

39 52 194 72 242 -13.2495 engineer apprentices 125 32 87 227 30 324 -27.5684 pilo

34 89 200 85 242 -18.2732 engineer apprentices 129 29 102 234 58 300 -27.1664 pilo

55 91 198 50 277 -18.4763 engineer apprentices 130 26 104 256 58 270 -27.4665 pilo

38 72 162 47 268 -17.6913 engineer apprentices 147 47 82 240 30 322 -24.3164 pilo

37 87 170 60 244 -17.8653 engineer apprentices 159 37 80 227 58 317 -23.0874 pilo

33 88 208 51 228 -20.3226 engineer apprentices 135 41 83 216 39 306 -23.2369 pilo

40 60 232 29 279 -20.7101 engineer apprentices 100 35 83 183 57 242 -18.8145 engineer ap

39 73 159 39 233 -16.424 engineer apprentices 149 37 94 227 30 240 -23.2048 pilo

21 83 152 88 233 -17.3899 engineer apprentices 149 38 78 258 42 271 -22.9299 pilo

42 80 195 36 241 -18.739 engineer apprentices 153 27 89 283 66 291 -26.7734 pilo

49 73 152 42 249 -14.6486 engineer apprentices 136 31 83 257 31 311 -27.7362 pilo

37 76 223 74 268 -18.9062 engineer apprentices 97 36 100 252 30 225 -24.8977 pilo

46 83 164 31 243 -17.8151 engineer apprentices 141 37 105 250 27 243 -26.0327 pilo

42 82 188 57 267 -18.7053 engineer apprentices 164 32 76 187 30 264 -21.1992 engineer ap

38 1 76 2 192 75 53 65 250 3 129 3 31 7 87 4 236 6 44 25 280 2

Page 11: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 11/21

 

The following classification will be done based on a quadratic classification function. Although

the sample covariance matrices did not yield any significant differences, for the purpose of this

analysis we will try to compare the error rates from the two models. The SAS results are below.

It appears that using a quadratic discriminant function will yield a similar error rate of 10%.

A third type of classification analysis will be the holdout method. Again, I copied the SAS results

on the following page. There were 4 misclassifications in engineer apprentices, and 2 in pilots for

a total error rate of 0.1750. As expected, the error rate increased compared to the previous two

methods giving a more realistic expectation of how the linear discriminant function can perform

for future data subjects.

Page 12: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 12/21

Final Project – Multivariate Statistical Analysis December 2014

12 | P a g e  

Page 13: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 13/21

Final Project – Multivariate Statistical Analysis December 2014

13 | P a g e  

3. Do a regression analysis and write a report on Table 3.4.

The first step in this multivariate regression analysis is to try to estimate the parameters, which

is matrix ̂. The parameter matrix was computed in SAS, and the output is below. The first table

is the set of parameters for y1 (relative weight) and the second one is the set of parameters for

y2 (fasting plasma glucose).

The overall regression appears to be significant at α=5%, as indicated by all four tests shown

below. However, the R2 values appear to be relatively low for the overall model. Only 25% of the

variability in y1 (relative weight) can be explained by x1 (glucose intolerance), x2 (insulin response

to oral glucose) and x3 (insulin resistance), and only 1.6% of the variability in y2 (fasting plasma

glucose) can be explained by x1, x2, and x3.

Page 14: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 14/21

Final Project – Multivariate Statistical Analysis December 2014

14 | P a g e  

Page 15: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 15/21

Final Project – Multivariate Statistical Analysis December 2014

15 | P a g e  

The relatively low values of R2 suggest that more explanatory variables may be needed in order

to improve the model. However, for the purpose of this analysis we will run a stepwise procedure

in order to identify redundancies. A backward elimination will be applied to find a subset of the

x’s.

To find the subset of the x’s we compute a conditional Wilk’s lambda by formula 10.72 in thebook. Thus, for example, Wilk’s lambda (X1|X2 X3) = (Wilk’s lambda (X1, X2, X3)/Wilk’s lambda

(X2 X3)). This would be the first value that would go under X1 in the table below. The rest of the

x’s are computed similarly. No elimination could be done at step 1 because the highest Wilk’s

lambda (0.93) was significant at α=5%.  Thus the largest Wilk’s lambda is significant so the

backward elimination process would have to stop there.

Step # X1 (Glucose

Intolerance)

X2 (Insulin Response to

Oral Glucose)

X3 (Insulin Resistance)

1 Wilk’s Lambda=0.93 Wilk’s Lambda=0.89  Wilk’s Lambda=0.76 

It appears that none of the independent variables could be eliminated and all three x’s  are

needed in the full model. However, because of the small values of R2, they seem to explain only

a very small portion of the variability in y’s. Thus, as I mentioned earlier, this model needs more

explanatory variables in order to increase its accuracy.

Page 16: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 16/21

Final Project – Multivariate Statistical Analysis December 2014

16 | P a g e  

12.8. Carry out a principal component analysis on all six variables of the glucose data of Table

3.8. Use both S and R. Which do you think is more appropriate here? Show the percent of

variance explained. Based on the average eigenvalue or a scree plot, decide how many

components to retain. Can you interpret the components of either S or R?

The purpose of principal component analysis is to eliminate variables and optimize the model.For the data in Table 3.8 principal components were computed on both S and R.

First I will present the runs on the correlation matrix. The first four eigenvalues account for about

85% of the variance which is greater than 80%, so we can keep the first four.

Page 17: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 17/21

Final Project – Multivariate Statistical Analysis December 2014

17 | P a g e  

In the case of the covariance matrix, the first three eigenvalues account for 89% of the variability,

which is higher than 80%. Thus, we can keep the first three principal components. This makes

sense because the variances are significantly influenced by the larger variances of x1, x2, and x3.

Page 18: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 18/21

Final Project – Multivariate Statistical Analysis December 2014

18 | P a g e  

In this case, because of the disparate variances in S, choosing the principal components from R

will be more appropriate.

For the interpretation of the principal components in the case of R, we need a correlation

procedure between the chosen components and the variables. The runs were done in SAS and

they can be seen circled in the figure below. The correlations between the principal componentsand the variables differ, and only the ones above 0.5 were deemed to be significant. For example,

after selecting the first four principal components, a significant correlation (over 0.5) can be

identified between the first principal component and variables y1, y3, x1, x2, and x3. Significant

correlations can also be identified between the second principal component and y2. X2 has a

significant correlation with the third component, while y1, and y3 are strongly correlated with

the fourth principal component.

Page 19: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 19/21

Final Project – Multivariate Statistical Analysis December 2014

19 | P a g e  

12.12 Carry out a principal component analysis on the engineer data of Table 5.6 as follows:

(a) Use the pooled covariance matrix.

(b) Ignore groups and use a covariance matrix based on all 40 observations.

(c) Which of the approaches in (a) or (b) appears to be more successful?

Here, we are running a principal component analysis using an unpooled covariance matrix, and apooled covariance matrix. The two matrices were computed in SAS and are shown in the figures

below. The first figure is the pooled covariance matrix and the second figure is the unpooled

covariance matrix.

Page 20: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 20/21

Final Project – Multivariate Statistical Analysis December 2014

20 | P a g e  

First we run the component analysis on the unpooled covariance matrix. The following results

were obtained in SAS. The first three components account for 87% of the variance, thus it will be

enough to keep them. Thus the first three components are a1’= (0.212, -0.039, 0.08, 0.775, -

0.956, 0.580), a2’= (0.389, 0.064, -0.066, -0.608, 0.01, 0.686), and a3’= (0.889, 0.096, 0.08, 0.08,

0.01, -0.434)

For the unpooled matrix, I could not use a procedure in SAS so I computed in IML. The output is

copied on the next page. The table under the figures gives the cumulative proportion of the

eigenvalues in the overall model. Similar to the analysis done for the pooled covariance matrix,

the first three eigenvalues account for about 85% of the total variance, thus the first three

eigenvectors (components) can be kept in the model.

Given that the two analyses are very similar, it appears that neither is more successful and that

the results are independent of the choice made: to use the pooled covariance matrix, or the

unpooled covariance matrix.

Page 21: Time Series Project - ARIMA

8/10/2019 Time Series Project - ARIMA

http://slidepdf.com/reader/full/time-series-project-arima 21/21

Final Project – Multivariate Statistical Analysis December 2014

21 | P a g e  

Eigenvalue Proportion Cumulative Proportion

1,050.5963 38.6% 38.6%

858.3158 31.6% 70.2%

398.9035 14.7% 84.9%

259.1484 9.5% 94.4%

108.0892 4.0% 98.4%

43.3535 1.6% 100.0%