On two methods of identifying influential sets of observations

StaUsUcs & Probability Letters 7 (1989) 241-245 December 1988 North-Holland

O N T W O M E T H O D S O F I D E N T I F Y I N G I N F L U E N T I A L S E T S O F O B S E R V A T I O N S

Subir G H O S H

Department of Stattsttcs, Unwerstty of Cahforma, Rwerslde, CA 92521, USA

Received February 1987 Revased April 1987

Abstract: In tlus paper two new measures are proposed to Identify influential sets of observations at the design stage m view of prediction and fitting. A relaUonship is estabhshed between one of proposed measures and the Cook's measure at the inference stage.

AMS 1970 Subject Classtficatzons Primary and Secondary 62.J05, 62K15.

Keywords: Cook's measure, design, influential observatmns, hnear models, robustness, unavmlable observations.

1. Introduction

A set of observations under a design is said to be influential in this paper if the set affects not only the fitting o f the model to the data but also the prediction in terms of the fitted model. In the problem of identifying sets of t (a positive integer) influential observations, we assume the underlying design is robust against the unavailability of any t observations (Ghosh (1979)). We first explain this concept by considering the s tandard linear model

E ( y ) = Xfl, (1)

V ( y ) = o21, (2)

R a n k X = p , (3)

where y ( N × 1) is a vector of observations, X ( N × p ) is a known matrix, f l ( p × 1) is a vector of fixed unknown parameters and a 2 is a constant which may or may not be known. Let d be the underlying design corresponding to y. The design d is assumed to be robust against the unavailability of any t observations in the sense that the parameters in fl are still unbiasedly estimable

Work is sponsored by the ,Mr Force Office Of Scientific Research under Grant AFOSR-87-0048.

when any t observations in y are unavailable.

There are ( N ) possible sets of t observations. The

idea of robustness of designs against unavailability of data is fundamenta l in measuring the influence of a set of observations. We measure the influence of a set of t observations by assuming the observations in the set unavailable and then assessing the model fitted with the remaining ( N - t) observations.

We propose the first measure of influence in terms of precise predict ion of t unavailable observations. A set of t observations is the most influential if the model fitted with the remaining ( N - t ) observations does the worst j ob in pre- dicting t unavailable observations. The first measure is in fact the sum of variances of predicted values of unavailable observations f rom the remaining ( N - t) observations. The largest value of the sum indicates the corresponding set of t ob° servations is the most influential in terms of precise predict ion of unavailable observations.

The vector of the least squares fitted values of N observations is uncorrelated with a complete set of or thonormal linear functions of y with zero expectations. This fact implies the op t imum prop- erty " M i n i m u m Variance" of the vector of the least squares fitted values as an unbiased estima-

0167-7152/88/$3.50 © 1988, Elsevter Science Pubhshers B.V. (North-Holland) 241

Volume 7, Number 3 STATISTICS & PROBABILITY LETTERS December 1988

tor of its expectation. Assuming a set of t observations unavailable, the least squares fitted values of the remaining ( N - t) observations are correlated with a complete set of orthonormal linear functions of y ( N × 1) with zero expectation and therefore the covariance matrix is not a null matrix. Further the covariance matrix is away from the null matrix, the more influence has the set of t observations on the least squares fitted values of the rem~aing ( N - t) observations. We propose the second measure of influence as the sum of squares of the elements of the covariance matrix between the least squares fitted values of the remaining ( N - t) observations and a complete set of orthonormal linear functions of y with zero expectation. The largest value of the sum of squares indicates the corresponding set of t observations is the most influential.

To illustrate the usefulness of two proposed measures, we consider two designs. Design I is a nonorthogonal 24 fractional factorial design and Design II is an orthogonal complete 24 factorial design. For orthogonal design, the formulas take simple forms and the calculations become easy. For nonorthogonal design, the calculations de- mand the use of m o d e m computer.

The importance of knowing the influential set of observations at the design stage is that (1) we can assess the influence of a set of unavailable observations in the planned analysis, (2) in the case of deficit of budget during a long term experiment using the robust design where it may be a good idea not to collect observations which are least influential. The importance for detecting the influential set of observations at the inference stage is in doing better current inference (see Cook and Weisberg (1982), Chatterjee and Hadi (1986), Draper and John (1981)) and also in future planning of similar experiment. We present a result interrelating the measures for detecting influential observations at the design and inference stages. The problem of finding the influential set of t observations at the design stage is indeed related to the problem of finding an opt imum experimental design (see Kiefer (1959)) and some of this we have already seen in the explanation of the proposed first measure of influence.

2. First method

We denote the i th set of t observations in y by y(2'); and the remaining observations in y by y~('); the corresponding submatrices of X by X2 (') and Xa('); the resulting design when t observations in

the t th set are unavailable by d ('), / = 1 . . . . . "--(,(N)I The least squares estimators of fl under d and d ' are

( x ' x ) - ' x ' y

and

~a,) = (X<l')'X<a'))-aX<~')'y<~').

We write the fitted values of y under d and d f') as .Va = X~a and .~a,o = X~a.,. When t observations in the i th set are unavailable, the predicted values of unavailable observations y2 <') f rom avail- able observations are the elements in .~2 <')= X<2')I~a.,. The reliabifity of these estimators can be judged by

,

The first measure of influence of y20) is defined as

11( y2 °)) = Trace V(.~20)). (4)

The smallest value of Ii(yt2')), / = 1 . . . . . ( N ) , for

t = u, indicates that the uth set of t observations is the least influential in terms of precise prediction of unavailable observations. On the other hand

the largest value of Ii(Y(2')), i = 1 . . . . . ( N ) , for

t = w, indicates the wth set of t observations is the most influential.

We denote

B(,) = I, + X~')( X~')'X('))-I X~ ')'.

It can be checked that

Bg: = I, - x<;)( x ' x ) - l x < " ' , (5)

£ . . - / L = ( x ' x ) - l Y " " , , (6) ~'2 ~0)

E ( X~2')I~d - .I(2 ') )( "'2 Y(')~ vd -- .r2 "'(')! Y = o2Bo) 1 . (7)

We denote the t th observations in y by y, and the t i t h r o w i n X b y x , , t = l . . . . . N.

242


Theorem 1. For any design

0- 2I,( y(2') ) >~ E ......

where the q . . . . . t t rows of X are rows of X(2 ').

(8)

Proof. It follows f rom Bo) and B(7) 1 given in (5) that for t = i l , . . . , t t,

1 - x ' ( X ' X ) -1 1 + x;( X(') 'X('))- 'x,>~

, ' ( x ' x ) - l x ,

1 +

i.e.~

x;( >

The rest is easy.

Theorem 2. I f for a destgn the mdivtdual observa- ttons are equally mfluential then

Po2 (9) 6 ( y , ) = ( N - p )

Proof. When the individual observat ions are equally influential, 11(y,) is a constant independent of t for t = 1 and thus X2( ' )=x, and x;(X}') 'X( ' ))- lx , is a constant independent of i. This in turn implies f rom (5) that for t - -

/ / --]. 1, x , ( X X) x, is a constant independent of t. We know trace X ( X ' X ) - I X ' = p and thus x;( X ' X ) - I x , = p / N . F r o m (5), we get

( ) - ' P x , = ( N - p )

and hence the result.

Theorem 3. 1f for a destgn, the individual observations are equally mfluenttal, then

pa2t (10) ( N - p )

Proof. For t = 1 and equally influential individual observations, x [ ( X ' X ) - ix, = p / N and hence the result in (10) follows f rom (8). F r o m (9) and (10), we observe that for a design with equally influential individual observat ions I1( Y2 (`)) > / t l l ( Y , ) .

3. Second method

Let Z ( ( N - p ) × N) be a matr ix such that Rank Z = ( N - p ) , Z X = 0 and Z Z ' = I. I t can be seen that Coy( f~d, Z y ) = 0. This implies that )3 d has the m i n i m u m variance within the class of all unbiased est imators of E(.Od) under (1-3) . When t observat ions in the t th set are unavadable , the least squares fitted values are .13~ ') = X~')l~d.). We denote the submatr ices of Z cor responding to X~ ') and X2 (') by Z~ ') and Z2 °). It follows that

COV( Y I " ' Z y ) - - - 0 2 [ yO)[ y(t,tyO)]-lX(lO,ZlOt ] 1"1 ~ ~'1 "*1 ]

The further Cov(.13~ '), Zy) is away f rom the null matrix, the more influential is the set of t observations y~2 '). We thus have the second measure of influence as

12( y2 (')) = 0 - 2 [Sum of squares of e lements

in Cov(j3~ '), Zy)] . (11)

We now show some similarities between our two measures of influence 11( y~ , and 12(y~')).

Thefol lowmgis truefor l= l . . . . . ( N ) , Theorem 4. ~ d

--

Proof . We observe that 7( , ) ~'(,) ~ 7 0 ) v o ) = 0 and ~ 1 1 ~ 1 - - ~ 2 ¢ ~ 2

hence

V ( 7( t )~O)] a 2 7 ( t ) y ( 1 ) ( y ( t ) t y ( t ) ~1 3vl ] = ~ - - 1 "'1 \ ' ' 1 "'1 ) - l x ~ t ) " Z ~ t) '

o:7")Y"~t x('"x~'~ ' ~ 2 ~ 2 \ ) - Y ( n ) " 7 0 ) ' = " '2 *-" 2

= V ( 7 o ) ~ ( , ) ] ~2 .Iv2 ].

Theorem 5

12(Y(2") = Trace V( Z2°)132 ° ' ). (13)

Proof . It can be seen that

12( y2 0 ) ) - - 0 2 Trace X( ' ) (X} '"XI~ ' ) ) -a X('"Z~ ')'

×

= 0 2 Trace Z~')X("( X(') 'X~ ' ) ) - 1 X ( ' " Z ~ "

= Trace V( Z~').~ ' ))

= Trace V( Z2°).132 ° ' )

Corollary

I2(.~2(')) = Trace [V( f,(2'))][Z(2'"Z(2') ] . (14)

243


The equation (13) displays the similarity between two measures of influence 11(y2 (')) and 12(y2(')). Although the matrix Z is not unique, it can be checked that I2(y2 (')) is unique for all choices of the matrix Z.

4. Relationship

Cook (1977) proposed a distance function between ^ " ~ , ^

Yd and Yd~'~ popular as Cook's distance, at the inference stage as

D, = (.f'a"'--.Vd)" (.f'd'"--.Vd) , (15) pS~

where ( N - p ) s 2 = ( y - Yd)'( Y -- Yd)" The Cook's distance/9, measures the degree of influence of t observations in the ith set on the estimation of ft. We now show that our first measure of influence 11( y2 (')) is in fact related to/9,.

Theorem 6. From (4) and (15), we have

E ( p s 2 D , ) = I i ( Y ( 2 0 ) .

Proof. We get, from (6),

: Z , > )

It not follows from (7), (15) and (17) that

E ( ps2D,) = o 2 T r a c e ( B < , ) - 1 , )

= 0 2 Trace X(2')( X(l')'X(l'))-1 X2 (')'

This completes the proof.

(16)

5. Examples

Consider a 2 4 factorial experiment in a completely randomized set up and suppose the elements in fl

are the general mean, the main effects and the 2-factor interactions. The 3-factor and higher order interactions are assumed to be zero. Thus p --- 11. The treatments are denoted by (x I x 2 x 3 x4), x, -- 0, 1, t = 1, 2, 3, 4. For brevity, we indicate a treatment by the positions where the level 1 is occuring. For example, the treatment (1100) is denoted by 12. The treatment (00130) is denote by 0.

Destgn I

Consider a design w~th 15 treatments and we write the treatments in the order (O, 1, 2, 3, 4, 12, 13, 14, 23, 24, 34, 123, 124, 134, 234). Note that the elements in y, the rows of X and the columns of Z correspond to this ordering. The matrix Z is given in Table 1, where a = (1/V/5). The design is robust against the unavailability of any two observations (Ghosh (1979)). We find that under Design I, any of y,, ~ = 2 . . . . . 11 is the least influential w.r.t, both 11 and 12. Any pair of observations with o-211 equals 11.333 is the most influential w.r.t. 11. On the other hand, any pair of observations with o-212 equals 1.600 is the most influential w.r.t. I 2. The variability in values in 12 is so small that it is very hard to assess the influence w.r.t. 12 under this design.

Design H

We consider a complete 2 4 factorial experiment with treatments written in the order

(1234, O, 1, 2, 3, 4, 12, 13, 14, 23, 24, 34, 123,

124, 134, 234).

The matrix Z is shown in Table 2. This design is robust against the unavailability of any three observations (Ghosh (1979)). It can be checked that I1 ( y , )=2 .2o 2 and I2 (y , )=(0 .6875)o 2 for t = 1 . . . . . 16. Thus the individual observations are

Table 1

Z = (0.25)

4a -3a -3a -3a -3a 2a 2a 0 1 -1 1 -1 0 -2 0 1 -1 -1 1 0 0 0 1 1 -1 -1 -2 0

2a 2a 2a 2a - a - a - a - a 0 0 2 0 1 - 1 1 - 1

-2 2 0 0 -1 1 1 -1 0 0 0 2 1 1 -1 -1

244

Volume 7, Number 3

Table 2

STATISTICS & PROBABILITY LETTERS December 1988

Z ~ (0.25)

1 1 - 1 - 1 0 0 1 - 1 0 0 1 - 1 0 0 1 1 2 - 2 1 1

- 1 - 1 1 1 1 1 1 - 1 0 - 2 0 0

- 1 1 0 0 - 2 2 - 1 - 1 - 2 0 0 0

1 1 0 0 0 0

1 1 - 1 - 1 - 1 - 1 2 0 1 - 1 1 - 1 0 0 - 1 1 1 - 1 0 2 1 1 - 1 - 1 0 0 - 1 - 1 - 1 - 1

Table 3 Influences of indiwdual observations under design I

Observaaons o - 211 o - 212

Yl 4.000 0.800 y,, i = 2 . . . . . 11 2.333 0.700 y,, i = 12 . . . . . 15 4.000 0.800

Table 4 Influences of parrs of observations under Design I

Observations o - 211 o - 2-I 2

(Yl, Y,), I = 2 . . . . . 15; (Y2, Y15); 11.333 1.500 (Y3, Yl4); (Ys, Yl2); (Yr, Y,) I = 2,13; (Y7, Y,), I =12,14; (Ys, Y,), * =13,14; (Yg, Y,), I =12,15; (Ylo, Y,), * =13,15; (Yn, Y,), * =14,15.

(yt , y , ) , , = 6 . . . . 11; (Y2, Y,), ,=12 ,13 , 14; (Y3, Y,), * =12,13,15; (Ya, Y,), * =12,14,15; (Ys, Y,), i =13,14,15; (Y6, Y,), z =14,15; (YT, Y,), i =13,15; (Ys, Y,), i =12,15; (Yg, Y,), ,=13 ,14 ; (Ylo, Yl4); (Yn, Y,), ,=12 ,13 .

(Yl, Y,), * =12 . . . . . 15; (Y12, Yz) i = 13,14,15; (Y13, Yz), * = 14,15;

(Y14, YlS).

(Y2, Y,), I = 3,4,5,9,10,11, (Y3, Y,), = 4,5,7,8,11; (Y4, Y,), = 5,6,8,10; (Ys, Y,), = 6,7,9; (Yr, Y,), = 7,8,9,10; (Y7, Y,), =8,9,11; (Ys, Y,), t =10,11; (Yg, Y,), =10,11; (Yto, Y11)"

(Y2, Y,), z = 6,7,8; (Y3, Y,), t = 6,9,10, (.1'4, Y,), z = 7,9,11; (Ys, Y,), i = 8,10,11;

(76, YlI), (YT, Ylo); (Ya, Y9), (YI0, Y12)"

8 000 1.500

8.667 1 600

4.857 1.400

10.000 1.400

equally influential w.r.t, both /1 and 12. For any pair of observations corresponding to treatments with zero or three levels in common, the value of o-211 is 8.000. For every other pair of observations, the value of o-211 is 4.667. We therefore see the validity of the equation (10) in Theorem 3 for this example since 211(y,)= 4.402. The value of o-212 for any pair of observations is a constant 1.375. The remark on 12 for Design I also holds for Design II.

Acknowledgement

The author is thankful to the referee for reading the paper critically.

References

Chatterjee, S. and A.S. Hadl (1986), Influential Observations, high leverage points and outhers tn hnear regression, Staus- ttcal Sctence 3, 379-416.

Cook, R.D. (1977), Detection of influential observations m linear regression, Technometncs 22, 495-508.

Cook, R.D. and S. Weisberg, (1982), Residuals and influence in regression, Technometncs 23, 21-26.

Draper, N.R. and J A John, (1981), Influential observations and outliers in regression, Technometrtcs 23, 21-26.

Ghosh, S (1979), On robustness of designs agamst incomplete data, Sankhy~ B, Pts 3 and 4, 204-208

Kaefer, J. (1959), Opt imum experimental designs, Roy Star*st Soc Ser. B 21 273-319

245

Documents

On two methods of identifying influential sets of observations