Upload
subir-ghosh
View
217
Download
0
Embed Size (px)
Citation preview
StaUsUcs & Probability Letters 7 (1989) 241-245 December 1988 North-Holland
O N T W O M E T H O D S O F I D E N T I F Y I N G I N F L U E N T I A L S E T S O F O B S E R V A T I O N S
Subir G H O S H
Department of Stattsttcs, Unwerstty of Cahforma, Rwerslde, CA 92521, USA
Received February 1987 Revased April 1987
Abstract: In tlus paper two new measures are proposed to Identify influential sets of observations at the design stage m view of prediction and fitting. A relaUonship is estabhshed between one of proposed measures and the Cook's measure at the inference stage.
AMS 1970 Subject Classtficatzons Primary and Secondary 62.J05, 62K15.
Keywords: Cook's measure, design, influential observatmns, hnear models, robustness, unavmlable observations.
1. Introduction
A set of observations under a design is said to be influential in this paper if the set affects not only the fitting o f the model to the data but also the prediction in terms of the fitted model. In the problem of identifying sets of t (a positive integer) influential observations, we assume the underlying design is robust against the unavailability of any t observations (Ghosh (1979)). We first explain this concept by considering the s tandard linear model
E ( y ) = Xfl, (1)
V ( y ) = o21, (2)
R a n k X = p , (3)
where y ( N × 1) is a vector of observations, X ( N × p ) is a known matrix, f l ( p × 1) is a vector of fixed unknown parameters and a 2 is a constant which may or may not be known. Let d be the underlying design corresponding to y. The design d is assumed to be robust against the unavailabil- ity of any t observations in the sense that the parameters in fl are still unbiasedly estimable
Work is sponsored by the ,Mr Force Office Of Scientific Research under Grant AFOSR-87-0048.
when any t observations in y are unavailable.
There are ( N ) possible sets of t observations. The
idea of robustness of designs against unavailabil- ity of data is fundamenta l in measuring the in- fluence of a set of observations. We measure the influence of a set of t observations by assuming the observations in the set unavailable and then assessing the model fitted with the remaining ( N - t) observations.
We propose the first measure of influence in terms of precise predict ion of t unavailable ob- servations. A set of t observations is the most influential if the model fitted with the remaining ( N - t ) observations does the worst j ob in pre- dicting t unavailable observations. The first mea- sure is in fact the sum of variances of predicted values of unavailable observations f rom the re- maining ( N - t) observations. The largest value of the sum indicates the corresponding set of t ob° servations is the most influential in terms of pre- cise predict ion of unavailable observations.
The vector of the least squares fitted values of N observations is uncorrelated with a complete set of or thonormal linear functions of y with zero expectations. This fact implies the op t imum prop- erty " M i n i m u m Variance" of the vector of the least squares fitted values as an unbiased estima-
0167-7152/88/$3.50 © 1988, Elsevter Science Pubhshers B.V. (North-Holland) 241
Volume 7, Number 3 STATISTICS & PROBABILITY LETTERS December 1988
tor of its expectation. Assuming a set of t observa- tions unavailable, the least squares fitted values of the remaining ( N - t) observations are correlated with a complete set of orthonormal linear func- tions of y ( N × 1) with zero expectation and there- fore the covariance matrix is not a null matrix. Further the covariance matrix is away from the null matrix, the more influence has the set of t observations on the least squares fitted values of the rem~aing ( N - t) observations. We propose the second measure of influence as the sum of squares of the elements of the covariance matrix between the least squares fitted values of the remaining ( N - t) observations and a complete set of orthonormal linear functions of y with zero expectation. The largest value of the sum of squares indicates the corresponding set of t observations is the most influential.
To illustrate the usefulness of two proposed measures, we consider two designs. Design I is a nonorthogonal 24 fractional factorial design and Design II is an orthogonal complete 24 factorial design. For orthogonal design, the formulas take simple forms and the calculations become easy. For nonorthogonal design, the calculations de- mand the use of m o d e m computer.
The importance of knowing the influential set of observations at the design stage is that (1) we can assess the influence of a set of unavailable observations in the planned analysis, (2) in the case of deficit of budget during a long term ex- periment using the robust design where it may be a good idea not to collect observations which are least influential. The importance for detecting the influential set of observations at the inference stage is in doing better current inference (see Cook and Weisberg (1982), Chatterjee and Hadi (1986), Draper and John (1981)) and also in future planning of similar experiment. We present a re- sult interrelating the measures for detecting in- fluential observations at the design and inference stages. The problem of finding the influential set of t observations at the design stage is indeed related to the problem of finding an opt imum experimental design (see Kiefer (1959)) and some of this we have already seen in the explanation of the proposed first measure of influence.
2. First method
We denote the i th set of t observations in y by y(2'); and the remaining observations in y by y~('); the corresponding submatrices of X by X2 (') and Xa('); the resulting design when t observations in
the t th set are unavailable by d ('), / = 1 . . . . . "--(,(N)I The least squares estimators of fl under d and d ' are
( x ' x ) - ' x ' y
and
~a,) = (X<l')'X<a'))-aX<~')'y<~').
We write the fitted values of y under d and d f') as .Va = X~a and .~a,o = X~a.,. When t observa- tions in the i th set are unavailable, the predicted values of unavailable observations y2 <') f rom avail- able observations are the elements in .~2 <')= X<2')I~a.,. The reliabifity of these estimators can be judged by
,
The first measure of influence of y20) is defined as
11( y2 °)) = Trace V(.~20)). (4)
The smallest value of Ii(yt2')), / = 1 . . . . . ( N ) , for
t = u, indicates that the uth set of t observations is the least influential in terms of precise prediction of unavailable observations. On the other hand
the largest value of Ii(Y(2')), i = 1 . . . . . ( N ) , for
t = w, indicates the wth set of t observations is the most influential.
We denote
B(,) = I, + X~')( X~')'X('))-I X~ ')'.
It can be checked that
Bg: = I, - x<;)( x ' x ) - l x < " ' , (5)
£ . . - / L = ( x ' x ) - l Y " " , , (6) ~'2 ~0)
E ( X~2')I~d - .I(2 ') )( "'2 Y(')~ vd -- .r2 "'(')! Y = o2Bo) 1 . (7)
We denote the t th observations in y by y, and the t i t h r o w i n X b y x , , t = l . . . . . N.
242
Volume 7, Number 3 STATISTICS & PROBABILITY LETTERS December 1988
Theorem 1. For any design
0- 2I,( y(2') ) >~ E ......
where the q . . . . . t t rows of X are rows of X(2 ').
(8)
Proof. It follows f rom Bo) and B(7) 1 given in (5) that for t = i l , . . . , t t,
1 - x ' ( X ' X ) -1 1 + x;( X(') 'X('))- 'x,>~
, ' ( x ' x ) - l x ,
1 +
i.e.~
x;( >
The rest is easy.
Theorem 2. I f for a destgn the mdivtdual observa- ttons are equally mfluential then
Po2 (9) 6 ( y , ) = ( N - p )
Proof. When the individual observat ions are equally influential, 11(y,) is a constant indepen- dent of t for t = 1 and thus X2( ' )=x, and x;(X}') 'X( ' ))- lx , is a constant independent of i. This in turn implies f rom (5) that for t - -
/ / --]. 1, x , ( X X) x, is a constant independent of t. We know trace X ( X ' X ) - I X ' = p and thus x;( X ' X ) - I x , = p / N . F r o m (5), we get
( ) - ' P x , = ( N - p )
and hence the result.
Theorem 3. 1f for a destgn, the individual observa- tions are equally mfluenttal, then
pa2t (10) ( N - p )
Proof. For t = 1 and equally influential individual observations, x [ ( X ' X ) - ix, = p / N and hence the result in (10) follows f rom (8). F r o m (9) and (10), we observe that for a design with equally influen- tial individual observat ions I1( Y2 (`)) > / t l l ( Y , ) .
3. Second method
Let Z ( ( N - p ) × N) be a matr ix such that Rank Z = ( N - p ) , Z X = 0 and Z Z ' = I. I t can be seen that Coy( f~d, Z y ) = 0. This implies that )3 d has the m i n i m u m variance within the class of all unbi- ased est imators of E(.Od) under (1-3) . When t observat ions in the t th set are unavadable , the least squares fitted values are .13~ ') = X~')l~d.). We denote the submatr ices of Z cor responding to X~ ') and X2 (') by Z~ ') and Z2 °). It follows that
COV( Y I " ' Z y ) - - - 0 2 [ yO)[ y(t,tyO)]-lX(lO,ZlOt ] 1"1 ~ ~'1 "*1 ]
The further Cov(.13~ '), Zy) is away f rom the null matrix, the more influential is the set of t observa- tions y~2 '). We thus have the second measure of influence as
12( y2 (')) = 0 - 2 [Sum of squares of e lements
in Cov(j3~ '), Zy)] . (11)
We now show some similarities between our two measures of influence 11( y~ , and 12(y~')).
Thefol lowmgis truefor l= l . . . . . ( N ) , Theorem 4. ~ d
--
Proof . We observe that 7( , ) ~'(,) ~ 7 0 ) v o ) = 0 and ~ 1 1 ~ 1 - - ~ 2 ¢ ~ 2
hence
V ( 7( t )~O)] a 2 7 ( t ) y ( 1 ) ( y ( t ) t y ( t ) ~1 3vl ] = ~ - - 1 "'1 \ ' ' 1 "'1 ) - l x ~ t ) " Z ~ t) '
o:7")Y"~t x('"x~'~ ' ~ 2 ~ 2 \ ) - Y ( n ) " 7 0 ) ' = " '2 *-" 2
= V ( 7 o ) ~ ( , ) ] ~2 .Iv2 ].
Theorem 5
12(Y(2") = Trace V( Z2°)132 ° ' ). (13)
Proof . It can be seen that
12( y2 0 ) ) - - 0 2 Trace X( ' ) (X} '"XI~ ' ) ) -a X('"Z~ ')'
×
= 0 2 Trace Z~')X("( X(') 'X~ ' ) ) - 1 X ( ' " Z ~ "
= Trace V( Z~').~ ' ))
= Trace V( Z2°).132 ° ' )
Corollary
I2(.~2(')) = Trace [V( f,(2'))][Z(2'"Z(2') ] . (14)
243
Volume 7, Number 3 STATISTICS & PROBABILITY LETTERS December 1988
The equation (13) displays the similarity be- tween two measures of influence 11(y2 (')) and 12(y2(')). Although the matrix Z is not unique, it can be checked that I2(y2 (')) is unique for all choices of the matrix Z.
4. Relationship
Cook (1977) proposed a distance function between ^ " ~ , ^
Yd and Yd~'~ popular as Cook's distance, at the inference stage as
D, = (.f'a"'--.Vd)" (.f'd'"--.Vd) , (15) pS~
where ( N - p ) s 2 = ( y - Yd)'( Y -- Yd)" The Cook's distance/9, measures the degree of influence of t observations in the ith set on the estimation of ft. We now show that our first measure of influence 11( y2 (')) is in fact related to/9,.
Theorem 6. From (4) and (15), we have
E ( p s 2 D , ) = I i ( Y ( 2 0 ) .
Proof. We get, from (6),
: Z , > )
It not follows from (7), (15) and (17) that
E ( ps2D,) = o 2 T r a c e ( B < , ) - 1 , )
= 0 2 Trace X(2')( X(l')'X(l'))-1 X2 (')'
This completes the proof.
(16)
5. Examples
Consider a 2 4 factorial experiment in a completely randomized set up and suppose the elements in fl
are the general mean, the main effects and the 2-factor interactions. The 3-factor and higher order interactions are assumed to be zero. Thus p --- 11. The treatments are denoted by (x I x 2 x 3 x4), x, -- 0, 1, t = 1, 2, 3, 4. For brevity, we indicate a treatment by the positions where the level 1 is occuring. For example, the treatment (1100) is denoted by 12. The treatment (00130) is denote by 0.
Destgn I
Consider a design w~th 15 treatments and we write the treatments in the order (O, 1, 2, 3, 4, 12, 13, 14, 23, 24, 34, 123, 124, 134, 234). Note that the elements in y, the rows of X and the columns of Z correspond to this ordering. The matrix Z is given in Table 1, where a = (1/V/5). The design is robust against the unavailability of any two ob- servations (Ghosh (1979)). We find that under Design I, any of y,, ~ = 2 . . . . . 11 is the least in- fluential w.r.t, both 11 and 12. Any pair of ob- servations with o-211 equals 11.333 is the most influential w.r.t. 11. On the other hand, any pair of observations with o-212 equals 1.600 is the most influential w.r.t. I 2. The variability in values in 12 is so small that it is very hard to assess the influence w.r.t. 12 under this design.
Design H
We consider a complete 2 4 factorial experiment with treatments written in the order
(1234, O, 1, 2, 3, 4, 12, 13, 14, 23, 24, 34, 123,
124, 134, 234).
The matrix Z is shown in Table 2. This design is robust against the unavailability of any three ob- servations (Ghosh (1979)). It can be checked that I1 ( y , )=2 .2o 2 and I2 (y , )=(0 .6875)o 2 for t = 1 . . . . . 16. Thus the individual observations are
Table 1
Z = (0.25)
4a -3a -3a -3a -3a 2a 2a 0 1 -1 1 -1 0 -2 0 1 -1 -1 1 0 0 0 1 1 -1 -1 -2 0
2a 2a 2a 2a - a - a - a - a 0 0 2 0 1 - 1 1 - 1
-2 2 0 0 -1 1 1 -1 0 0 0 2 1 1 -1 -1
244
Volume 7, Number 3
Table 2
STATISTICS & PROBABILITY LETTERS December 1988
Z ~ (0.25)
1 1 - 1 - 1 0 0 1 - 1 0 0 1 - 1 0 0 1 1 2 - 2 1 1
- 1 - 1 1 1 1 1 1 - 1 0 - 2 0 0
- 1 1 0 0 - 2 2 - 1 - 1 - 2 0 0 0
1 1 0 0 0 0
1 1 - 1 - 1 - 1 - 1 2 0 1 - 1 1 - 1 0 0 - 1 1 1 - 1 0 2 1 1 - 1 - 1 0 0 - 1 - 1 - 1 - 1
Table 3 Influences of indiwdual observations under design I
Observaaons o - 211 o - 212
Yl 4.000 0.800 y,, i = 2 . . . . . 11 2.333 0.700 y,, i = 12 . . . . . 15 4.000 0.800
Table 4 Influences of parrs of observations under Design I
Observations o - 211 o - 2-I 2
(Yl, Y,), I = 2 . . . . . 15; (Y2, Y15); 11.333 1.500 (Y3, Yl4); (Ys, Yl2); (Yr, Y,) I = 2,13; (Y7, Y,), I =12,14; (Ys, Y,), * =13,14; (Yg, Y,), I =12,15; (Ylo, Y,), * =13,15; (Yn, Y,), * =14,15.
(yt , y , ) , , = 6 . . . . 11; (Y2, Y,), ,=12 ,13 , 14; (Y3, Y,), * =12,13,15; (Ya, Y,), * =12,14,15; (Ys, Y,), i =13,14,15; (Y6, Y,), z =14,15; (YT, Y,), i =13,15; (Ys, Y,), i =12,15; (Yg, Y,), ,=13 ,14 ; (Ylo, Yl4); (Yn, Y,), ,=12 ,13 .
(Yl, Y,), * =12 . . . . . 15; (Y12, Yz) i = 13,14,15; (Y13, Yz), * = 14,15;
(Y14, YlS).
(Y2, Y,), I = 3,4,5,9,10,11, (Y3, Y,), = 4,5,7,8,11; (Y4, Y,), = 5,6,8,10; (Ys, Y,), = 6,7,9; (Yr, Y,), = 7,8,9,10; (Y7, Y,), =8,9,11; (Ys, Y,), t =10,11; (Yg, Y,), =10,11; (Yto, Y11)"
(Y2, Y,), z = 6,7,8; (Y3, Y,), t = 6,9,10, (.1'4, Y,), z = 7,9,11; (Ys, Y,), i = 8,10,11;
(76, YlI), (YT, Ylo); (Ya, Y9), (YI0, Y12)"
8 000 1.500
8.667 1 600
4.857 1.400
10.000 1.400
equally influential w.r.t, both /1 and 12. For any pair of observations corresponding to treatments with zero or three levels in common, the value of o-211 is 8.000. For every other pair of observa- tions, the value of o-211 is 4.667. We therefore see the validity of the equation (10) in Theorem 3 for this example since 211(y,)= 4.402. The value of o-212 for any pair of observations is a constant 1.375. The remark on 12 for Design I also holds for Design II.
Acknowledgement
The author is thankful to the referee for reading the paper critically.
References
Chatterjee, S. and A.S. Hadl (1986), Influential Observations, high leverage points and outhers tn hnear regression, Staus- ttcal Sctence 3, 379-416.
Cook, R.D. (1977), Detection of influential observations m linear regression, Technometncs 22, 495-508.
Cook, R.D. and S. Weisberg, (1982), Residuals and influence in regression, Technometncs 23, 21-26.
Draper, N.R. and J A John, (1981), Influential observations and outliers in regression, Technometrtcs 23, 21-26.
Ghosh, S (1979), On robustness of designs agamst incomplete data, Sankhy~ B, Pts 3 and 4, 204-208
Kaefer, J. (1959), Opt imum experimental designs, Roy Star*st Soc Ser. B 21 273-319
245