Click here to load reader

View

219Download

0

Embed Size (px)

8/3/2019 Shapiro - An Analysis of Variance Test for Normality (Complete Samples) 1965

1/22

An Analysis of Variance Test for Normality (Complete Samples)

S. S. Shapiro; M. B. Wilk

Biometrika, Vol. 52, No. 3/4. (Dec., 1965), pp. 591-611.

Stable URL:

http://links.jstor.org/sici?sici=0006-3444%28196512%2952%3A3%2F4%3C591%3AAAOVTF%3E2.0.CO%3B2-B

Biometrika is currently published by Biometrika Trust.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available athttp://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtainedprior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content inthe JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained athttp://www.jstor.org/journals/bio.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. Formore information regarding JSTOR, please contact [email protected]

http://www.jstor.orgMon May 21 11:16:44 2007

http://links.jstor.org/sici?sici=0006-3444%28196512%2952%3A3%2F4%3C591%3AAAOVTF%3E2.0.CO%3B2-Bhttp://www.jstor.org/about/terms.htmlhttp://www.jstor.org/journals/bio.htmlhttp://www.jstor.org/journals/bio.htmlhttp://www.jstor.org/about/terms.htmlhttp://links.jstor.org/sici?sici=0006-3444%28196512%2952%3A3%2F4%3C591%3AAAOVTF%3E2.0.CO%3B2-B8/3/2019 Shapiro - An Analysis of Variance Test for Normality (Complete Samples) 1965

2/22

B i omet r i ka (1965), 52, 3 a n d 2 , p. 5 91W i t h 5 t ex t - j guresPrinted in &eat Bri tain

An analysis of variance test fo r normality(com plete samp1es)t

BYS. S. SHAPIRO AND M. B. WILKGeneral Electric Go. and Bell Telephone Laboratories, Inc.

The main intent of this paper is to introduce a new statistical procedure for testing acomplete sample for normality. The test statistic is obtained by dividing the square of anappropriate linear combination of the sample order statistics by the usual symmetricestimate of variance. This ratio is both scale and origin invariant and hence the statisticis appropriate for a test of the composite hypothesis of normality.Testing for distributional assumptions in general and for normality in particular has beena major area of continuing statistical research-both theoretically and practically. Apossible cause of such sustained interest is that many statistical procedures have beenderived based on particular distributional assumptions-especially that of normality.Although in many cases the techniques are more robust than the assumptions underlyingthem, still a knowledge that the underlying assumption is incorrect may temper the useand application of the methods. Moreover, the study of a body of data with the stimulusof a distributional test may encourage consideration of, for example, normalizing trans-formations and the use of alternate methods such as distribution-free techniques, as well asdetection of gross peculiarities such as outliers or errors.

The test procedure developed in this paper is defined and some of it s analytical propertiesdescribed in $2. Operational information and tables useful in employing the tes t are detailedin $ 3 (which may be read independently of the rest of the paper). Some examples are givenin $4. Section5 consists of an extract from an empirical sampling study of the comparison ofthe effectiveness of various alternative tests. Discussion and concluding remarks are givenin $6.

2. THE W TEST FOR NORMALITY (COMPLETE SAMPLES)2.1. Motivation and early work

This study was initiated, in pa rt, in an at tempt to summarize formally certain indicationsof probability plots. In particular, could one condense departures from statistical linearityof probability plots into one or a few 'degrees of freedom' in the manner of the applicationof analysis of variance in regression analysis?

In a probability plot, one can consider the regression of the ordered observations on theexpected values of the order statistics from a standardized version of the hypothesizeddistribution-the plot tending to be linear if the hypothesis is true. Hence a possible methodof testing the distributional assumptionis by means of an analysis of variance type procedure.Using generalized least squares (the ordered variates are correlated) linear and higher-ordermodels can be fitted and an 3'-type ratio used to evaluate the adequacy of the linear fit.

t Pa rt of this research was supported by the Office of Naval Research while both authors were a tRutgers University.

8/3/2019 Shapiro - An Analysis of Variance Test for Normality (Complete Samples) 1965

3/22

This approach was investigated in preliminary work. While some promising resultswere obtained, the procedure is subject to the serious shortcoming t ha t the selection of thehigher-order model is, practically speaking, arbitrary. However, research is continuingalong these lines.

Another analysis of variance viewpoint which has been investigated by the presentauthors is to compare the squared slope of the probability plot regression line, which underthe normality hypothesis is an estimate of the population variance multiplied by a constant,with the residual mean square about the regression line, which is another estimate of thevariance. This procedure can be used with incomplete samples and has been describedelsewhere (Shapiro & Wilk, 1965b).

As an alternative to the above, for complete samples, the squared slope may be com-pared with the usual symmetric sample sum of squares about t he mean which is independentof the ordering and easily computable. It is this last statistic that is discussed in the re-mainder of this paper.

2.2. Derivation of the W statisticLet m' = (ml,m,, ...,m,) denote the vector of expected values of standard normalorder statistics, and let V = (vii) be the corresponding n x n covariance matrix. That is, ifx, 6 x, 6 . x, denotes an ordered random sample of size n from a normal distribution withmean 0 and variance 1, then

E ( x ) ~=mi (i = 1,2, ...,n),and cov (xi, xj) = vii (i,j = 1,2, ...,n) .

Let y' = (y,, ...,y,) denote a vector of ordered random observations. The objective isto derive a test for the hypothesis that this is a sample from a normal distribution withunknown mean p and unknown variance a,.

Clearly, if the {y,} are a normal sample then yi may be expressed asy i = p + r x i ( i = 1,2,...,n).

It follows from th e generalized least-squares theorem (Aitken, 1938; Lloyd, 1952) th at t hebest linear unbiased estimates of p and a are those quantities th at minimize the quadraticform (y -p l -am )' V-l (y -p l --a m), where 1' = (1,1, ...,1). These estimates are, respec-tively, m' V-I ( ml'- lm') V-ly

'LI = 1'v-llm1v-lm- (11v-lm)21' V-l(l??a'-ml') V-lyand a = 1'8-I 1m'V-lm- (1'V-1m)2'

For symmetric distributions, 1'V-lm = 0, and henceA m' 7 - l ~= - y = , and 8 = -----n 1 m' V-lm'

Letdenote the usual symmetric unbiased estimate of (n- 1)a2.

The W test statist ic for normality is defined by

8/3/2019 Shapiro - An Analysis of Variance Test for Normality (Complete Samples) 1965

4/22

An an aly sis of variance test for no rm alit ywhere

m' V-Ia' = (a,, ...,a,) = (rn'V-1 V-lm)tandThus, b is, up to the normalizing constant C, the best linear unbiased estimate of the slopeof a linear regression of the ordered observations, y,, on the expected values, mi, of the stand-ard normal order statistics. The constant C is so defined th at the linear coefficients arenormalized.

It may be noted tha t if one is indeed sampling from a normal populatioii then the numer-ator, b2,and denominator, S2,of W are both, up to a constant, estimating the same quantity,namely a2.For non-normal populations, these quantities would not in general be estimatingthe same thing. Heuristic considerations augmented by some fairly extensive empiricalsampling results (Shapiro & Wilk, 1964~)sing populations with a wide range of andp2 values, suggest that the mean values of W for non-null distributions tends to shiftto the left of th at for the null case. Purther i t appears tha t the variance of the null dis-tribution of W tends to be smaller than tha t of the non-null distribution. It is likelythat this is due to the positive correlation between the numerator and denominator for anormal population being greater than th at for non-normal populations.

Note that the coefficients (a,) are just the normalized 'best linear unbiased' coefficientstabulated in Sarhan & Greenberg (1956).

2.3. Some analytica2 properties of WLEMMA. W is scale and origin invariantProof. This follows from the fact that for normal (more generally symmetric) distribu-tions,COROLLARY. W has a distribution which depends only on the sanzple size n, for samples

from a normal distribution.COROLLARYW is statistically independent of S2 and of 5, for samples from a normal.

distribution.Proof. This follows from the fact that y and S2are sufficient for p and a2(Hogg & Craig,

1956).COROLLARY = for any r.. E Wr Eb2r/ES2r, LEMMA. The maximum value of W is 1. Proof. Assume?j= 0 since W is origin invariant by Lemma 1. Hence

Sincebecause X a: = a'a = 1, by definition, then W is bounded by 1. This maximum is in fact

.Iachieved when yi = va,, for arbitrary 7.LEMMA. The minimum value of TV is na!/(n - 1).

8/3/2019 Shapiro - An Analysis of Variance Test for Normality (Complete Samp