14
Statistics and Computing (1994) 4, 221-234 The statistics of linear models." basics back to J. A. NELDER Department of Mathematics, Imperial College of Science, Technology, and Medicine, London SW7 2BZ, UK Received and accepted August 1994 Inference from the fitting of linear models is basic to statistical practice, but the development of strategies for analysis has been hindered by unnecessary complexities in the descriptions of such models. Three false steps are identified and discussed: they concern constraints on parameters, neglect of marginality constraints, and confusion between non-centrality parameters and corresponding hypotheses. Useful primitive statistical steps are discussed, and the need for strategies, rather than tactics, of analysis stressed. The implications for the development of good, fully interactive, computing software are set out, and illustrated with examples. Keywords: Constraints, data structure, fixed effect, functional marginality, linear models, marginal homogeneity, marginality, model selection, non-centrality parameter, operand, operator, prediction, random effect. 1. Introduction Linear models are a basic component of statistics. It is thus important that the ideas that underlie their definition and determine the inferences we may make from using them are set out as clearly and simply as possible. I believe that many expositions of linear models are not only unneces- sarily complex, but that the complexities arise from con- fusions that have persisted in the literature for at least 30 years. In this paper I hope to clarify those areas where confusion has arisen, and to show how the subject may be developed in a straightforward manner to allow scientific inferences to be made from linear models. The implications of this approach for the development of relevant computer software will be discussed. 2. The three false steps The three steps that have generated confusion and made for unnecessary complexity are: 1. The putting of constraints on parameters because constraints are put on their estimates. 2. Neglecting marginality relations between terms in factorial models. 3. Confusing non-centrality parameters in the expectation 0960-3174 1994 Chapman & Hall of sums of squares with the corresponding hypothesis that they might be used to test. Some unnecessary constructs, like Type III sums of squares, result from combinations of the above. We consider each in turn. 3. Constraints on parameters If we fit, for example, the model ~]ij = # "q- O~i -~ /~j "]- "~ij to data from an NA x NB table indexed by factors A and B with indices i and j respectively, we get estimates m, ai, hi, and c;j that satisfy ~ij = m + ai + bj + cij. We have thus fitted (NA + 1)x (Ne + 1) parameters to ?CA x NB data items. To produce a particular solution to the least-squares equations we must impose NA + NB + 1 constraints on the estimates. Two possible sets of constraints are (i) The symmetric set a. = b. = ci. = c.j = 0 (where the dot suffix denotes averaging) and (ii) The GLIM-type set al = bl = Cia = eli = O.

The statistics of linear models: back to basics

Embed Size (px)

Citation preview

Page 1: The statistics of linear models: back to basics

Statistics and Computing (1994) 4, 221 -234

The statistics of linear models." basics

back to

J. A. N E L D E R

Department of Mathematics, Imperial College of Science, Technology, and Medicine, London SW7 2BZ, UK

Received and accepted August 1994

Inference from the fitting of linear models is basic to statistical practice, but the development of strategies for analysis has been hindered by unnecessary complexities in the descriptions of such models. Three false steps are identified and discussed: they concern constraints on parameters, neglect of marginality constraints, and confusion between non-centrality parameters and corresponding hypotheses. Useful primitive statistical steps are discussed, and the need for strategies, rather than tactics, of analysis stressed. The implications for the development of good, fully interactive, computing software are set out, and illustrated with examples.

Keywords: Constraints, data structure, fixed effect, functional marginality, linear models, marginal homogeneity, marginality, model selection, non-centrality parameter, operand, operator, prediction, random effect.

1. Introduction

Linear models are a basic component of statistics. It is thus important that the ideas that underlie their definition and determine the inferences we may make from using them are set out as clearly and simply as possible. I believe that many expositions of linear models are not only unneces- sarily complex, but that the complexities arise from con- fusions that have persisted in the literature for at least 30 years. In this paper I hope to clarify those areas where confusion has arisen, and to show how the subject may be developed in a straightforward manner to allow scientific inferences to be made from linear models. The implications of this approach for the development of relevant computer software will be discussed.

2. The three false steps

The three steps that have generated confusion and made for unnecessary complexity are:

1. The putting of constraints on parameters because constraints are put on their estimates.

2. Neglecting marginality relations between terms in factorial models.

3. Confusing non-centrality parameters in the expectation 0960-3174 �9 1994 Chapman & Hall

of sums of squares with the corresponding hypothesis that they might be used to test.

Some unnecessary constructs, like Type III sums of squares, result from combinations of the above. We consider each in turn.

3. Constraints on parameters

If we fit, for example, the model

~]ij = # "q- O~i -~ /~j "]- "~ij

to data from an NA x NB table indexed by factors A and B with indices i and j respectively, we get estimates m, ai , h i , and c;j that satisfy

~ij = m + ai + bj + cij.

We have thus fitted (NA + 1 ) x (Ne + 1) parameters to ?CA x NB data items. To produce a particular solution to the least-squares equations we must impose NA + NB + 1 constraints on the estimates. Two possible sets of constraints are

(i) The symmetric set a. = b. = ci. = c.j = 0 (where the dot suffix denotes averaging) and

(ii) The GLIM-type set al = bl = Cia = e l i = O.

Page 2: The statistics of linear models: back to basics

222

Both sets are instances of the general solution to the equations given by

m + M , a i + A i , b j + B j , cij - M - A i - B j

for arbitrary M , A i , and B j .

A linear combinat ion of the parameters is said to be e s t i m a b l e if there exists a linear combination of the Yij S

having that expectation. Thus

].t q- O~ 1 ~-/~1 "q- 711 = E ( Y H )

eq + 71. - ~ - 72. = E ( y l . - Y2.)

/~1 + 7. 1 --/33 -- "/.3 =-- E ( y . 1 -- Y. 3)

and

711 -- ")'12 -- 721 q- 722 = E(Yll - Y12 - Y21 +Y22)

are all estimable, while #, ~a, 31,711, # + oq,/32 +/33, and 7n - 71z are all non-estimable.

Consider the 2 x 2 table with values

B Mean

12 14 13 A 4 10 7

Mean 8 12 10

With symmetric constraints we have m = 10; al, a2 = 3, - 3 ; bl, b 2 = - 2 , 2 and c11, c12, c21, c22 = 1, - 1 , - 1 , 1 while with the GLIM- type constraints we have m = 12; al, a2 = 0, - 8 ; bl, b 2 = 0 , 2 and c11, c12, c21, c22 ~ - - - 0 , 0 , 0 ~ 4 . For any estimable contrast the corresponding estimates are the same for either parametrization, e.g. a l + 71. - oz2 - - ")'2. is estimated by a I + c 1 . - a 2 - c 2. which equals 3 + 0 + 3 - 0 = 6 with the symmetric constraints and 0 + 0 + 8 - 2 = 6 with the G L I M constraints. By con- trast, a non-estimable combination like # + a l has 'estimates' m + al = 13 with symmetric constraints and 12 with the GLIM- type constraints.

It is tempting to match the symmetric constraints, say, on the parameter estimates with corresponding constraints a. = 3. = 7i. = 7.j = 0 on the parameters themselves. This temptat ion should be resisted, fo r reasons I hope to demonstrate. Such constraints are n o t an intrinsic part of the model; furthermore, the only quantities having

Nelder

inferential meaning are the estimable quantities and these are independent of the choice of constraints.

The confusing consequences of putting constraints on parameters show most clearly when we make the distinc- tion between fixed and random effects in model terms. The inferential distinction is between treating the param- eters in a term such as c~i as an unstructured set of quantities (fixed effect) or as looking like a sample from a normal distribution (random effect). I f we decide to treat ~; as a random effect with independent components then we cannot impose the constraint c~ = 0 because the mean of independent normal variates with non-zero variance also has non-zero variance. This elementary statistical fact has not, however, prevented the authors of one paper from defining the c~i as independent normal variates whose sample mean was zero; they did, however, prefix their definition by the words 'we shall now be rather s l o p p y . . . ' . There is, however, no need to be sloppy to make a consistent picture of models with fixed and random effects (mixed models); all we "have to do is no t to put constraints on the fixed parameters. We can illustrate the point with the same two-way table considered above, look- ing at the expected values of mean squares (EMS) in the ANOVA table. I f we put symmetric constraints on the par- ameters in the fixed model, and no constraints in the random model we get the EMS shown in Table 1 (the unit-level component c~ 2 is omitted). Note the different patterns; in the fixed-effect table there is a single com- (ponent in each EMS, while in the random-effect table all terms contain the interaction variance component o-2B.

The dissimilarities get much worse with mixed models, giving the results shown in Table 2. Note that with constraints on the effects all four columns have a different pattern. Note also that in the first row crzB appears in two models, in one of which A is fixed and the other it is random! With more complicated models components of variance appear and disappear in the various EMS in a most confusing way. Yet all this complication vanishes as soon as we remove constraints on the fixed effects. The EMS have either cr 2 o r Y] according to whether the corre- sponding terms are fixed or random. There is complete consistency between the columns. Furthermore the inter- pretations of a 2 and its corresponding N are completely

Table 1. Fixed and random effects

Fixed effects (with constraints)

Term d.f. EMS

A NA - 1 NB~A B N B - 1 NAP, B A . B (NA - 1)(NB- 1) gAB

Random effects (without constraints)

Term d.f. EMS

A B A . B

N A - - 1 N B - - 1 (NA -- 1)(NB -- 1)

4B +NB4 4B+NA4 4B

where gA = ~,(c~i -- C~.)2/(NA -- 1), etc.

Page 3: The statistics of linear models: back to basics

The stat is t ics o f linear models 223

Table 2. Mixed models

EMS for models with constraints

A fixed A fixed A random A random Term B fixed B random B fixed B random

A NBNA O'2~,+NBrA N B a 2 A a2AB+NBO2A B NA ~ . NA a2B ~2AB + NA ~B 2 2 aAB + NA G~ A r AB

EMS for models without constraints

A B ~ABq-NA~B tYABq-NAO- B A. B N At~ 02AB 02A 8 2 O-AB

analogous; thus cr~ is the excess variance between rows (A margin) over that described by the interaction term A. B, while E A is the excess variation in the A margin over that due to A.B. One uses variance, the other sums of squares of parameters. Thus ~A describes the variation in a i - a . + % - "y.. while EAB describes the variation in

"~ij -- "Yi. -- "~.j § ~[.. �9

3. Uninteresting hypotheses

In the lower half of Table 2 we can see immediately that NA describes the excess variation in the rows over that described by the interaction term ~AB" P'A = 0 means that the rows are no more variable than we would expect from P'AB. I f constraints are imposed then NA = 0 implies that, although the response to A varies with the level of B (because NAB r 0), nonetheless the A marginal variation is null. Now since the presence of the interaction places no restrictions at all on how A varies as B changes (and vice versa), we ought to be very surprised if either margin were null. Why should it be? This basically uninteresting hypothesis arises as a direct consequence of putting constraints on the parameters. The point at issue is immedi- ately clear if we replace fixed by random effects. I f we fill our table with N(0, 1) quantities e U then the A marginal means are N(0, 1/NB) variables. We do not expect them to be zero; even one exactly zero value would have prob- ability zero (if we ignore rounding). In either model the baseline for assessing the row variation is the interaction variation, not zero. Thus, while the hypothesis of a null margin given the presence of interaction can be stated mathematically it does not correspond to an interesting scientific hypothesis. I assert that such hypotheses are of no inferential interest.

4. Marginality

Let (AB) be the parameter space of our two-way table, implying one parameter per cell. The standard ANOVA

identity splits (AB) in main effects A and B, and interaction A.B. Each is a subspace of (AB) and inferentially these subspaces are structured; we say that A and B are marginal to A. B. The meaning ofA. B depends on the marginal terms preceding it. I f none precede it then A. B =- (AB); if one (A, say) precedes it then A. B = B within A; while if both precede it then A. B = interaction of A and B. Note that if we fit A . B first in a sequential model then SS(A[A .B) = SS(AIB, A .B) = SS (B[A .B ) = S S ( B I A , A . B ) = O. This is because fitting A. B first is equivalent to fitting a parameter for each cell, so that no variation remains to be accounted for. However, the literature is full of sums of squares like SS(AIB, A . B ) which are non-zero. This has happened because 7ij has been fitted subject to the parameter constraints 7i. = 7.j = 0. The resulting sum of squares, if used in a test, tests if ai + ~/i. is null when 7ij is non-null, i.e. one of the uninteresting hypotheses.

The marginality relations between the terms in a factorial model define the possible sequences of terms in a least- squares fit. Thus for our A x B table, there are only two valid sequences: A + B + A. B and B + A + A.B. These are to be interpreted as

A ign. B + B elim. A + A. B elim. (A and B) and

B ign. A + A elim. B + A. B elim. (A and B)

where ign. and elim. stand for ' ignoring' and 'eliminating'. Note that when A and B are orthogonal we have A i g n . B = A e l i m . B and Bign. A = B e l i m . A so that only one sequence needs to be computed.

With unbalanced data the number of valid sequences increases rapidly; thus with three factors A, B, C, and given that we need to fit main effects before their two-factor inter- actions and the three-factor interaction after the three two- factor ones, we find 48 possible sequences that respect the marginality relations. Two possible ones are

A + B + A . B + C + A . C + B . C + A . B . C

A + B + C + A . B + A . C + B . C + A . B . C .

We discuss in Section 4.5 the false inferences that can arise when marginality relations are ignored.

4.1. Functional marginality with continuous terms

When we fit sequences of quantitative terms such as xl, x2, xlx2, x 2, x ~ , . . . , we have to ask which sequences make sense. I f we fit x 1 without an intercept, then the response must go through the origin, i.e. zero must be a special point on the x-scale where y is zero. Similarly, if x~ is fitted with- out an xl term then the turning-point must occur at the origin (not impossible, but very unlikely). For if Xl might just as well be X 1 - - a then ( X 1 - - a ) 2 = - x 2 - - 2axl + a 2 and the linear term re-appears. Both terms must be fitted in the order xl, then x~, and we say that xl is f -marginal to x 2. With two continuous variables x 1 and x2, new effects

Page 4: The statistics of linear models: back to basics

224

Fig. 1. A saddlepoint on a two-dimensional surface

arise: if XlX2 is fitted without xl and x2 then the response surface must be centred on a col (saddlepoint) for the process to make sense. Figure 1 shows such a surface. In general there is no reason to expect such a centring to occur, so Xl and x2 must be fitted before XlX 2. Quadratic terms x~ and x 2 can be fitted without xax2, but only if the response surface is known to be aligned with the axes.

4.2. Marginality in mixed terms

Mixed terms involve a combination of categorical and continuous variables. Let A be a factor (categorical vari- able) with index i and X a continuous variable. In the model l + A + X + A. X, where l denotes the intercept term, 1 is marginal to A and X is marginal to A. X; I is f -marginal to X, as discussed above, but what is the relation of A to A . X (that is, of separate intercepts to separate slopes)? Figure 2 shows two relevant patterns: the left-hand one denotes a slope-ratio assay, where the common intercept is the zero dose of each compound, a priori the same, so that the model is I + A . X ; the

Slope-ratio assay

Model I+A.X

N e l d e r

fight-hand diagram shows a covariance adjustment to a standard value x0 of x, the individual slopes being again dif- ferent. The difference between the two models is that in the second x0 is not a special point on the scale, so that there is no prior reason to expect the yields to be the same there. The hypothesis of equal yields is an uninteresting one, so the fitted model is A + A.X. The conclusion is that A is f -marginal to A.X.

4.3. Cell-mean models and marginality

Cell-mean models use (in our two-way example) #ij with constraints to produce sub-models, instead of separate additive components like ai + r + ~'iy. For example, the model

# i j = l~ "J- OLi "~- /~j

is equivalent to the cell-means model with constraints

#ij - #i'j - #ij' + #vj' = 0 (4.1)

This is the model of no interaction. If we add to (4.1) the further constraints

#1. =/*2. = . . . (4.2)

then the model is equivalent to

#U = # + / 3 j . (4.3)

However, if the cell-means model (4.2) is asserted without (4.1) then the marginality rules are broken, for we are postulating an interaction with a null margin.

Two points should be made about cell-mean models. The

Covariance with unequal slopes

j /

i

xO Model A+A.X

Fig. 2. Marginality relations between intercepts and slopes

Page 5: The statistics of linear models: back to basics

The stat&tics o f linear models 225

first is that to every cell-mean model there is a corre- sponding unconstrained model, so that cell-mean models do not give the analyst anything new; they just express the same things in different ways. The second point is that it is more difficult to see if marginality constraints are being broken in a cell-mean specification, which may account for the fact that such models frequently appear in the literature.

4.4. False inferences after neglecting marginality: a n

example

A famous data set occurring in the context of quality improvement is the truck-spring data (Pignatiello and Ramberg, 1985). There were five factors, B, C, D, E, and O, and both mean and variance (in the form log (se)) were being analysed. The technique employed was to do a half-normal plot of the effects and pick out the large ones as being important. These turned out to be (with their aliases) B, DO, BCD =_ DEO, CD ==_ BE, CDO =_ BEO. Notice the complete neglect of marginality: we have DO without the corresponding main effects D and O, CD with- out C and D, etc. Since large interactions with small main effects, while not impossible, are usually a sign of acciden- tally large contrasts, this procedure is likely to pick such accidentally large contrasts, and hence to give a false infer- ence. That this is so here can be confirmed by a fitting of main effects only; this shows that only B and C produce effects, and that there is no significant interaction. Thus the neglect of marginality has led to a false inference. See McCullagh and Nelder (1989) for a detailed analysis of this data set. Similar considerations apply when we are looking for contrasts to use for error; here the danger is of rejecting accidentally large contrasts instead of accept- ing them, so making the error estimate too small. Thus we should not reject A. B. C, say, as an error component if it is large, but A. B, A. C, and B. C are all small.

5. Hypotheses and non-centrality parameters

We use again the two-way table example with full model # + ai +/3j + Z j" Consider the form of the non-centrality parameter (N-CP) for the sum of squares of a I#, that is for a;, ignoring Bj and "Tij. Searle (1987) says that the hypothesis is that

1 ai + - - ~ nij(/3j + "/ij) (5.1)

ni" is the same for all i, because if so this would make the N-CP zero. He objects to the fact that the N-CP involves the nij. I contend that the objection to the hypothesis expressed by (5.1) is rather that it involves a mixture of parameter sets, not the presence of the nij. Their presence simply reflects the differing amounts of information about the parameters in the unbalanced data set. Compare (5.1) with the N-CP

arising from the SS for a [#,/3 assuming no interaction. The EMS is a function of ai, nij, and cr 2, but the corre- sponding hypothesis is still ai = 0. The nij reflect the amount of information in the data about the various contrasts in the ai. By contrast the Type III SS used in SAS (SAS Institute Inc., 1985) uses a quadratic form whose N-CP is symmetric in the ai; this SS (i) loses power if used in any test, and (ii) is obtained by constraining the "/ij margins to be zero. It thus corresponds to an uninteresting hypothesis.

For examples of the complications that can arise from the neglect of marginality in interpreting N-CPs, we look at some models set out by Hocking and Speed (1975). For a set of four unstructured treatments with means #1, #2, #3, #4 and treatment totals T1, T2, T3, T4 based on nl, n2, n3, n 4 observations, Hocking and Speed consider the following hypotheses:

n l / z I -[- n2 lz 2 n 3 # 3 q- n4]z4 - , ( 5 . 2 )

n l + 1"12 n 3 ~- n 4

#1 "[- #2 = P~3 -[-/Z4, (5.3)

#1 7-- ]Z3[(/A1 = ]A2 and #3 = # 4 ) . (5.4)

All these hypotheses imply that the four treatments are being regarded as falling into two sets, (1, 2) and (3, 4).

Now (5.2) and (5.3) are between-group hypotheses without the assumption of within-group homogeneity; they therefore break the marginality constraints. By contrast (5.4) respects marginality and leads to the least squares estimate

TI + T2 T3+T4 n 1 q- n 2 n3 q- n 4

What could be simpler? Their next example is of a nested treatment structure A/B

with model #U = # + ai +/3ij. They compare

HA : cti +/3i. constant with

H*A : ai + Z nij /3ij constant. �9 n i .

J

Note that both hypotheses are uninteresting in the sense of Section 3 unless/3u = 0, when they become identical. Thus again the apparent complexities vanish when marginality relations are recognized. Finally they consider a model structure A �9 B, with cell-means models and hypotheses

H A :]Z i. constant,

H} : #ij constant (A ignoring B), J

- - - ] ~ i j HA : n i j n.j./

-- Z Z nijni'j ir j n.----~. #i'j = 0 (A elim. B and A. B).

Page 6: The statistics of linear models: back to basics

226 N e l d e r

All these hypotheses break the marginality constraints and so can be discarded as being without interest. Hocking and Speed say that 'the SS for B eliminating A is very difficult, if not impossible to interpret'. If A. B is non-null we have another of the class of uninteresting hypotheses; however, if A . B is zero, it is a test for the size of/3j in the presence of possible ai effects and has an entirely straightforward interpretation. Again the apparent complexities vanish.

6. More uninteresting hypotheses

In this section I discuss two more instances of uninteresting hypotheses, one famous one that dates from 1935, and the other the hypothesis of marginal homogeneity, which has arisen with log-linear models.

6.1. Unit-treatment add i t i v i t y

Consider a completely randomized experiment with t treat- ments applied to k plots each, i.e. there are n = k t plots in all. The randomization argument requires us to consider the hypothetical set of n x t yields that would arise if we could apply each of the t treatments to each of the n plots. Let the yield from putting treatment j on plot i be Yi j . Then the assumption of unit-treatment additivity means that Yij = ei + tj while non-additivity means that Yij = eij + tj . Neyman et al. (1935) proposed the hypoth- esis that average treatment effects over all plots would be null when unit-treatment non-additivity occurred. This means that the hypothetical unit-treatment table would show interaction but have a null margin, an uninteresting hypothesis. On the basis of similar reasoning Neyman asserted that the standard errors derived from the standard analysis for a Latin square would be biased in the presence of unit-treatment interaction. This claim was repeated in the 1950s. Cox (1958) showed that the error is not biased for the right null hypothesis, i.e. that the treatment margin is no more variable than expected from the interior of the table. For further development of this point of view, see Nelder (1977).

6.2. Marginal homogeneity in contingency tables

Consider a two-way table of counts indexed by two 'inter- changeable' factors, for example A s and A2, the party supported by a voter in elections 1 and 2. We use a log- linear model in which ~hj = log (# i j ) and ~hj = 7o + OLli Jr a 2 j + T i j is the linear predictor for cell (i, j ) . The following four models are all of possible interest; they are hierarchical and no marginality constraints are broken in their formulation. Note that ~/(ij) means that

7 i j = 7 j i

Model

g]o At- o~li At- oz2j -I- "/ij

g]o Ac r q- oL2j --I-- "/(ij)

r/o + ai + aj + 7(ij)

~]O -[- O~i "nt'- O~j

N a m e

general asymmetric

quasi-symmetric

symmetric

additive symmetric

For the marginal homogeneity model we restrict the table to be 3 x 3 for simplicity and put rrij for the expected fractions in each cell. Only the off-diagonal elements are required for the definition of marginal homogeneity, which can be written

and hence,

71"12 -I- 71"13 : 7921 Ac- 71"31

7r21 -t-- 7123 : 7112 -[- 7932

71-31 -}- 71"32 : 71-13 -[- 71-23

We can represent the table of 7rs as a symmetric component plus a residual as follows:

I ] I -i j[ :] �9 1] -0 �9 71"12 71"13 71"[2 71" 3 * 0

71-21 * 71"23 : 7rlt2 * t 3 -[- - - 0 *

LTr31 7132 * k 71-13 7123 0 - 0

where 71-~2=1(-B-12-~-71-21) etc and 0=1(71-12-7121) = 1 (71.31 _ 7r13 ) etc.

The residual table shows interaction with null marginal effects, and so corresponds to an uninteresting hypothesis. It is interesting to note that as the size of the table increases, the d.f. associated with the marginal- homogeneity hypothesis do not occupy a fixed place in the set of interesting hypothesis listed above. Note also that for linear models, as distinct from log-linear models, marginal homogeneity = symmetry, so that the difficulty does not arise.

7. Model selection and prediction: two distinct processes

Model selection is a well-understood process, in which we select from a class of models a subset of parsimonious models, which combine the necessarily conflicting requirements of giving a good fit while having a small number of parameters. We can formalize this process as replacing the original data y by parameter estimates together with their sampling covariances. From the we can construct fitted values /2, and the model- selection process can be regarded as a smoothing process that replaces the rough data y by the smooth fitted values/2.

The prediction phase involves the formation of sum- marizing quantifies which sum up the information that the experimenter wants from the analysis. These

Page 7: The statistics of linear models: back to basics

The statistics of linear models

summarizing quantities answer 'what- if ' questions. As an example, consider a two-way table of income classified by age-group and region. Let the sample size in cell (i, j ) be nij, and let Yijk denote the income of the kth individual in age-group i and region j . We fit a suitable model to Yijk, taking account of the sample sizes nij, which yields fitted values [~ij. Consider now the 'what- i f ' question--' what would the mean income be in each region /f its age- distribution were that of the country as a whole?' This kind of question requires a form of standardization in which we take the age distribution for the country as a whole (obtained from census data, for example) and then weight the ~ij accordingly. Let the country-wide age distribution be given by Ni; we form a predictive margin given by Y'~.i Ni~tij/~i Ni, which is the summarizing quantity required to answer the 'what- i f ' question. In addition, we can provide a covariance matrix for the elements in this predictive margin. See Lane and Nelder (1982) for a general account of this kind of prediction.

227

Two points should be made about structures like predic- tive margins. The first is that hypothesis-testing does not arise at the prediction stage; what we need are estimates of the summarizing quantities and estimates of their uncer- tainty. Thus we are not involved in testing the hypothesis that ~ i Ni~i j /~ i Ni is independent of j . The second point is that the sum of squares corresponding to a predictive margin, while calculable, is rarely of interest; sums of squares are mainly used in F-tests, but here hypothesis testing is not relevant.

Failure to distinguish between the model-selection and prediction stages lies behind an at tempt to justify Type I I I sums of squares by appealing to the fact that when the weights used in forming the predictive margins are the internal ones (nij), the Type I I I sum of squares coincides with Yates's weighted sum of squares of means (Yates, 1934). Now, in the days of hand computing the least- squares fit of an additive model to an unbalanced table was a major task, whereas the weighted sum of squares of

Table 3. Batch-type output for ANOVA

Dependent variable: response Source d.f. Sum of squares Mean square F value Pr > F R 2 CV

Model 22 1107.68560606 50.3493473 22.01 0.0001 0.958432 15.4769 Error 21 48.04166667 2.28769841 ROOT MSE RESPONSE MEAN Corrected total 43 1155.72727273 1.51251394 9.77272727

Source d.f. Type I SS F value Pr > F d.f. Type II SS F value Pr > F

A 1 24.81893939 10.85 0.0035 1 30.00000000 13.11 0.0016 B 1 27.07500000 11.84 0.0025 1 27.075000000 11.84 0.0025 A*B 1 45.37500000 19.83 0.0002 1 45.37500000 19.83 0.0002 SUBJ(A*B) 7 56.45833333 3.53 0.0116 7 56.45833333 3.53 0.0116 C 3 921.54545455 134.28 0.0001 3 921.54545455 134.28 0.0001 A*C 3 6.66287879 0.97 0.4251 3 8.09629630 1.18 0.3413 B*C 3 15.92129630 2.32 0.1046 3 15.92129630 2.32 0.1046 A*B*C 3 9.82870370 1.43 0.2616 3 9.82870370 1.43 0.2616

Source d.f. Type III SS F value Pr > F d.f. Type IV SS F value Pr > F

A 1 23.81332599 10.41 0.0041 1 35.68870656 15.60 0.0007 B 1 21.22477974 9.28 0.0062 2 34.18935006 14.94 0.0009 A*B 1 45.37500000 19.83 0.0002 1 45.37500000 19.83 0.0002 SUBJ(A*B) 7 56.45833333 3.53 0.0116 7 56.45833333 3.53 0.0116 C 3 907.12500000 132.17 0.0001 3 907.12500000 132.17 0.0001 A*C 3 6.34722222 0.92 0.4460 3 6.34722222 0.92 0.4460 B*C 3 13.27314815 1.93 0.1550 3 13.27314815 1.93 0.1550 A*B*C 3 9.82870370 1.43 0.2616 3 9.82870370 1.43 0.2616

Tests of hypotheses using the Type I MS for SUBJ(A*B) as an error term Source d.f. Type I SS F value Pr > F

A 1 24.81893939 3.08 0.1228 B 1 27.07500000 3.36 0.1096 A*B 1 45.37500000 5.63 0.0495

Page 8: The statistics of linear models: back to basics

228 Nelder

Table 4. Interactive analysis of a 26 table

units

read y, n

94 15 74 5 129 14 97 3 109 35 83 6 128 9 105 15

107 14 129 8 109 11 113 7

129 29 87 15 143 10 122 6

103 35 83 10 89 15 94 14

106 24 71 7 94 14 76 25

98 20 67 37 98 21 71 49

98 29 103 28 112 13 88 12 101 18 100 9 83 16 90 12 121 31 108 14 106 14 110 14

145 6 110 8 128 9 99 8

125 15 132 18 120 2 110 3 104 34 76 15 84 16 86 17

90 22 109 14 99 9 86 26 98 12 86 22 96 17 86 29

113 27 99 30 90 4 84 10:

'CData from Andrews and Herzberg (1985), p. 375"

[64] CCset length of vectors in data matrix"

C'mean no. of children (without decimal), and sample no.''

calc y=y/lO

factor [lev=2] a,b,c,d,e,f

generate a,b,c,d,e,f

formula[a*b*c*d*e*f] maxm

model [weight=n] y

terms #maxm

'summary produced by Oread' statement' '

Identifier Minimum Mean Maximum Values Missing

y 67.0 101.3 145.0 64 0 n 2.00 16.50 49.00 64 0

~restore decimal point' '

C~set up indexing factors, each with 2 levels''

C%enerate values in standard order''

C Cstore maximal model as a formula''

, 'sets up maximal model as contents of maxm''

, ,fi t main effects, then 2-factor interactions, then S-factor

interactions, suppressing printing' '

fit [print=*; factorial=l; pool=yes] #maxm

add [print=*; factorial=2; pool=yes] #maxm

add [print=*; factorial=3; pool=yes] #maxm

Cdisplay the accumulated analysis of variance so far"

rdisplay [print=acc]

***** Regression Analysis *****

*** Accumulated analysis of variance ***

Change d.f. s.s. m.s. v.r.

§ a§ b§ c§ d§ e § 6 1866.35 311.06 14.63 + a.b+ a.c+ b.c§ a.d+ b.d+ c.d + a.e+ b.e+ c.e+ d.e+ a.f+ b.f + c.f+ d.f+ e.f 15 413 .61 2 7 . 5 7 1 .30 + a.b.c+ a.b.d+ a.c.d+ b.c.d + a.b.e+ a.c.e+ b.c.e+ a.d.e + b.d.e+ c.d.e+ a.b.f+ a.c.f + b.cJ+ a.dJ+ b.dJ+ e. dJ + a.e.f+ b.e.f+ c.e.f+ d.e.f 20 368 .96 18 .45 0 . 8 7 Residual 22 467.65 21.26

Total 63 3116.57 49.47

"no sign of any 3-factor effects, so refit up to two-factor

level, and look at t-values for individual terms''

fit [print=estimates; fact=2] #maxm

***** Regression Analysis *****

*** Estimates of regression coefficients ***

Page 9: The statistics of linear models: back to basics

The statistics of linear models 229

Tab le 4. Continued

e s t i m a t e s . e . t

C o n s t a n t 1 0 . 2 9 4 0 . 6 7 1 1 5 . 3 5

a 2 0 . 0 8 2 0 . 6 9 2 0 . 1 2

b 2 - 0 . 4 0 1 0 . 6 9 8 - 0 . 5 7

c 2 1 . 0 6 7 0 . 7 1 4 1 . 5 0

d 2 0 . 7 5 6 0 . 6 7 6 1 . 1 2

e 2 0.913 0.732 1.25

f 2 - -1 .948 0.812 - -2 .40 a 2 .b2 - 0 . 0 8 8 0.592 - 0 . 1 5 a 2 . c 2 0 . 5 4 3 0 . 5 7 3 0 . 9 5

b 2 .c2 - 1 . 3 3 6 0.595 - 2 . 2 4 a 2 .d2 0 .134 0.572 0 .23

b 2 .d2 - 0 . 5 0 0 0 .593 - 0 . 8 4 c 2 .d2 0.563 0 .592 0.95 a 2 .e2 - 1 . 2 4 5 0.584 - 2 . 1 3

b 2 .e2 - 0 . 9 2 0 0.609 - 1 . 5 1

c 2 .e2 0 .106 0.608 0 .17 d 2 .e2 - 0 . 1 2 9 0 .594 - 0 . 2 2

a 2 . f 2 1.162 0.582 2.00

b 2 . f 2 - 0 . 1 1 0 0.603 - 0 . 1 8

c 2 . f 2 - 0 . 2 5 9 0,594 - 0 . 4 4 d 2 . f 2 0.323 0.587 0.55 e 2 . f 2 0.176 0.592 0.30

'CThe terms with t >=2 are b.c, a.e a a.f; try these with main effects''

effect of including

fit a+b+c+d+e+f+b.c+a.e+a. f * * * * * R e g r e s s i o n A n a l y s i s * * * * *

Response variate:y

Weight variate: n

Fitted terms: Constant § a+ b+ c+ d+ e+f + b.c+ a.e+ a.f

*** Summary of analysis ***

d.f. s.s.

Regression 9 2173.1

Residual 54 943.5

Total 63 3116.6

Change --9 - -2173.1

m.s. v.r.

241.45 13.82

17.47

49.47

241.45 13.82

Percentage variance accounted for 64.7

* MESSAGE: The following units have large standardized residuals:

30 2 .83

* * * Estimates of regression coefficients * * *

estimate s.e. t

Constant 10.411 0.402 25.91

a 2 0.357 0.401 0.89

b 2 -1.094 0.370 -2.96

c 2 1.538 0 .428 3 .60 d 2 0 .878 0.263 3 .34

e 2 0.396 0.366 1.08 / 2 - -2 .009 0.370 - -5 .43 b 2 .c2 - 1 . 3 8 0 0.537 - 2 . 5 7

a 2 .e2 - 1 . 2 9 6 0.530 - 2 . 4 5

a 2 . f 2

stop

1.272 0.525 2.42

''The data refer to mean counts, so that an analysis using log-linear models might be better.

The only change needing to be made is to add to the Cmodel' statement the extra option distribution=poisson

The rest of the statements remain unchanged. '~

Page 10: The statistics of linear models: back to basics

230 Nelder

means was relatively easy to compute. The latter can be regarded, therefore, as a quick and dirty way of assessing the contribution to the variation of the appropriate factor. Now that the efficient least-squares fit can be obtained easily it ceases to have any inferential interest.

8. The implications for computing

In the days of batch processing, statistical computing worked in a read-calculate-print-stop mode. Output, while to some extent controllable by the user, was nonethe- less stereotyped. Table 3 shows the sort of output produced for a regression (ANOVA) analysis. There are too many digits (perhaps 40% of them are noise), the Type III and Type IV SS are of little inferential interest for reasons described above, and there is a single sequential fit (Type I) depending on the order which the user specified. When the data are unbalanced there can be no guarantee that the user's choice is inferentially the most informative.

Given current computing environments there is no reason why users should not be given fully interactive soft- ware to aid the making of sound inferences. Why then has the batch-mode output survived as long as it has? One reason, I believe, is precisely its stereotyped format; it appears to be delivering the analysis, thus freeing the user from the need to think. Not everyone welcomes the oppor- tunity to guide the analysis, but statisticians should be encouraging this process. Table 4 shows an interactive analysis, using Genstat (Payne et al., 1987), of a six-way table obtained from Andrews and Herzberg (1985). Note how a good set of primitive operations can be combined and be driven by the intermediate results. In particular, note how resetting a single switch changes the model class from linear to log-linear. In this example the model was found using just 14 statements (commands), and these included getting the data. The interpretation of the model and the nature of the interactions require further study, the details of which are omitted here.

8.1. Strategies, not tactics

What is largely lacking from statistical textbooks are strategies for finding parsimonious models; what we get are elements of tactics (F-tests and the like), rather than strategies. In the ESPRIT project GLIMPSE (Wolsten- holme et al., 1988) we built a partial strategy for finding parsimonious main-effect models for the wider class of generalized linear models (McCullagh and Nelder, 1989), using GLIM 3.77 (Payne, 1986) as the algorithmic engine.

The strategy was incorporated into the model-selection module of GLIMPSE, a knowledge-based front end for GLIM. The explanatory variables could be categorical (factors) or continuous (variates) and the assumed model class was that of generalized linear models. Checks that

this assumption was reasonable were built in. The user begins by specifying a minimal model containing terms known a priori to be necessary. The remaining possible terms are known as free terms, meaning that their status is not fixed; they may be included or excluded from the final model or models.

The model obtained by adding all the free terms to the current model is called the maximal model. Initially the current model is set to the minimal model. The basic loop of the strategy takes the following form:

for each free term

form F-statistic F1 for adding free term to current model

form F-statistic Fz for subtracting free term from maximal model

repeat

In the next step the pairs (F1, F2) are classified into one of four categories with following actions: The basic loop is then repeated and the process continues until the set of free terms does not change. At this stage

F 1 F 2 Action

small small small large } large small large large

remove from possible models

retain as free terms

add to current model

there are two possibilities: either the set of flee terms is null or it is not. If the set is null a unique model has been found: this happened with the nuclear-power-station data given by Cox and Snell (1981). If the set is non-null each remaining free term is added to the current model in turn, forming a branch of the model tree, and the basic loop is repeated. Further branching may then take place, and the result is a tree of possible models. Trees are likely to appear when highly-correlated explanatory variables occur. The user who is faced with a tree must then recognize that there is no unique solution to the problem, and that other exter- nal information must be used to choose between models. Note that the definition of 'large' and 'small' in the interpretation of the F-values must be quantified. We used values of about 3, but obviously the process could be repeated with different values to gain insight into the fitting process.

A complete strategy would need to deal with interaction terms, and this introduces difficulties; for example the maximal model may become unwieldy. Also the selection of interaction terms must take account of which main effects were accepted at the first stage, and marginality considerations may require us to reinstate B, for example, if A but not B is chosen at the first stage but A. B is found to be needed at the second.

Page 11: The statistics of linear models: back to basics

The statistics o f linear models

There remains much to be done in developing, testing and comparing strategies; it is by no means obvious how this is to be done, and maybe we shall advance by Darwinian selection of competing strategies based on their fitness as inferential procedures. For this, however, we must agree on how to measure fitness.

8.2. Computing primitives

To implement strategies such as that discussed above we need a well-structured set of primitive operations com- bined with an interactive language for joining them together. The language needs a procedure-defining facility, together with a notation for model terms and SO on.

As an example, Genstat provides the sort of environment required. It supports the specification of GLMs, it has the model formula as a basic item in the language, and it has basic fitting operations such as FIT, A D D (a term), D R O P (a term), SWITCH (between terms), and T R Y (the effects of adding terms). Output from all these can be saved as instances of standard structures, allowing, for example, the formation and subsequent manipulation of F-values. The importance of this facility cannot be over- stressed, and, if it is to be provided, certain consequences follow concerning the basic date structures that the system must support. For example, the tables of means produced by the analysis of factorial experiments, with their accom- panying standard errors, constitute for the experimenter the principal summary of the experiment. Yet one finds widely used systems where it seems to be supposed that the ANOVA table is all that is required. Tables of means, if they are to be subsequently processed, require the multi- way table to be a basic data-structure for the system. The association of tables of means with their standard errors, whose number and type depend on the block structure of the design, requires a general data structure, the vital element of which is the pointer vector, whose elements are the identifies of other structures. Genstat contains several system structures, built with pointers, in which the output from different analyses can be stored. Such structures can be named, assigned values, and in turn drive system direc- tives for extracting their contents (such as ADISPLAY for ANOVA), at the same time their components are also directly accessible to users through the use of pointer references built into the language.

This section ends with two examples that arose in comparisons I have made between Genstat and other systems. In both cases the examples came from the other systems, so that any bias is in favour of those systems. The first example concerns the specification of the ANOVA for a designed experiment, and was taken from the SAS manual (SAS Institute Inc., 1985). The relevant code is shown in Table 5 for SAS and Genstat. Both specifications define the form of the ANOVA table. For the SAS specifi-

231

cation it is necessary to know where in the table each treat- ment SS goes and which is its appropriate error term. In the Genstat specification the design is defined by the block and treatment structures, and these, together with the design matrix, completely define the form of the ANOVA table. Furthermore they define all the distinct standard errors for the tables of treatment means, whereas the SAS specifi- cation does not produce these. The number of tokens for the parser is 124 for the SAS specification, and only 18 for Genstat.

The second example concerns the analysis of a contin- gency table given in Ripley (1992), where it is required to produce a sequential analysis of deviance for the full factor- ial model. In order to mimic the analysis given by S-plus it was necessary to write a Genstat procedure to mimic the corresponding S-plus function. The results are shown in Table 6. The Genstat procedure contains 11 statements involving 86 parsing tokens (ignoring comments), whereas the corresponding S-plus procedure has 49 functions and control structures and 551 parsing tokens.

The examples show how important it is to get the right computing primitives, both operands and operators, if code is to be succinct and readable. While it may be argued that decreasing the token count does not necessarily improve ease of use, a program that uses five times as many tokens as another is going to be harder to write and take longer to test.

Table 5. Comparison of two specifications of an analysis-of-variance table: (a) from SAS, and (b) from Genstat

(a) CLASS MEANS MODEL

TEST TEST TEST TEST TEST

REP SOIL CA FERT FERT CA SOIL CA * FERT Y = REP FERT FERT * REP CA C A * E E R T CA*REP(FERT) SOIL SOIL * REP SOIL * FERT SOIL * REP * FERT SOIL * CA SOIL * EERT * CA SOIL * CA * REP (FERT) H = FERT E = F E R T * R E P H = CA C A * F E R T E = CA*REP(FERT) H = SOIL E = SOIL*REP H = SOIL*FERT E = S O I L * R E P * F E R T H = SOIL * CA

SOIL * FERT * CA E = SOIL * CA * REP (FERT)

(b) BLOCKS REP / (ROWS / SUBROWS, COLS) TREATMENTS FERT �9 CA + SOIL ANOVA Y

Page 12: The statistics of linear models: back to basics

232 Nelder

Table 6. Comparison of two implementations of the analysis of deviance of a three-way contingency table

S-Plus code for running the example

hum <- scan( )

2 3 3 4 3 2 I 3 8 ii 6 6 7 12 Ii Ii

117 121 47 22 85 98 43 20 119 209 68 43 67 99 46 33

fnames <- list(press=l:4, serum=l:4, chd=c(~Cy '', ~n"))

kk < - data.frame (fac.design(c (4,4,2) ,fnames), num)

kk.glm <- glm(num ~ serum*press*chd, family=poisson, data=kk)

anova(kk.glm, test =c cChi' ')

84 tokens, excluding data.

Genstat code

units [32]

factor [lev=4] press, serum

factor [labels=!t(y,n)] chd

read num

2 3 3'4 3 2 1 3 8 ii 6 6 7 12 ii ii

117 121 47 22 85 98 43 20 119 209 68 43 67 99 46 33:

generate chd,serum,press

model [distr=p] num

anodev serum*press*chd

47 tokens, excluding data.

Genstat code to implement S-Plus function anova.glm

proc ~anodev ~ CCprints sequential anodev for GLM ~'

param CF';f

fclass [nterm=nt] #F ~ #F; out=terms[l ... nt]

terms #F

for i=l ... nt : add[*] #terms[i] : endf

rdis [pr=acc;ch=text]

CCdelete line beginning CKesidual' ~'

edit [ch=~L/Res/ D :7] text

prin [ipr=*;sq=y] text;j=l

endp

Corresponding S-Plus code

86 tokens, ignoring comments.

function(object .... , test = c(~'none '', ccChisq", CCF", C'Cp")) {

test < -- match.arg(test)

margs <-- function(...)

nargs( )

if (margs (...))

return (anova.glmlist (list (object .... ), test = test))

Terms <-- object$terms

term.labels <-- attr(Terms, ~Cterm.labels'')

nt < -- length (term.labels)

x < -- objectSx

m < -- model.frame(object)

if (is.null (x))

x < -- model.matrix(Terms, m)

ass <-- attr(x, '~assign'')

control < -- glm.control( )

family < -- as.family(object)

a < -- attributes(m)

y < -- model.extract (m, ''response'')

w <-- model.extract(m, C:weights'') if (!length(w))

Page 13: The statistics of linear models: back to basics

The statistics of linear models

Table 6. Continued

233

w <-- rep(l, nrow(m))

offset <-- attr(Terms, CCoffset'')

if (is.null (offset))

offset < - 0

else offset <-- m[[offset]]

dev.res < -- double(nt)

df.res <-- dev.res

nulld < -- objectSnull.deviance

if (is.null(nulld))

nulld <-- sum(w �9 (y -- weighted.mean(y, w))^2)

dev.res[l] <-- nulld

df.res[l] <-- nrow(x) -- attr(Terms, ~Cintercept'J)

if (nt > i)

for(iterm in seq(nt, 2)) {

x <-- x[, -- (ass[[(term.labels[iterm])]])]

fit <-- glm.fit(x, y, w, offset = offset, family = family)

dev.res [iterm] < -- deviance (fit)

df.res[iterm] <-- fitSdf.resid

} dev.res <-- c(dev.res, deviance (object))

df.res < - c(df.res, objeetSdf.resid)

dev <-- c(NA, - diff(dev.res))

df <- c(NA, -- diff (df.res))

heading <-- c (C~Analysis of Deviance Table\n", paste(familySfamily[l],

~Cmodel\n'~), paste(CCResponse: ~, as.character(formula(object)) [

2], CCkn'', sep = c~ ~,), C~Term s added sequentially (first to last) '~ )

aod <-- data.frame(Dr = dr, Deviance = dev, ~CResid. Dr'7 = df.res,

CCResid. Dev '~ = dev.res, row.names = c ('CNULL", term.labels),

check.names = F)

attr(aod, CCheadingJ~) < -- heading

attr(aod, C'class~') <-- c(CCanova~J, 'Cdata.frame'~)

if (test == CCnone'')

aod

else stat.anova(aod, test, deviance.lm(object)/objectSdf.resid, objects

df.resid, nrow(x))

55i tokens

9. Conclusions

Linear models are an important tool in statistical analysis, but their basic structure is often obscured by unnecessary complexities that have been introduced into their exposi- tion. The three false steps identified in Section 2 obstruct, in my view, the development of effective strategies of analysis and inference. This is not to say that the analysis of unbalanced data can always be easy, but establishing a powerful set of statistical primitives must be the first step. There is much to be done to develop interactive model-building strategies that use these primitives; a prime tool in developing such strategies is well-organized interactive statistical software. Comparisons between packages suggests that we are some way from agreeing on good sets of primitive statistical operands and operators, derived structures for storing the results of analyses, and

higher-level control structures that are both powerful and easy to use.

References

Andrews, D. F. and Herzberg, A. M. (1985) Data. Springer- Verlag, New York.

Cox, D. R. (1958) The interpretation of the effects of non- additivity in the Latin square. Biometrika, 45, 67-73.

Cox, D. R. and Snell, E. J. (1981) Applied statistics: principles and examples. Chapman and Hall, London.

Hocking, R. R. and Speed, F. M. (1975) A full rank analysis of some linear model problems. Journal of the American Statistical Association, 70, 706-12.

Lane, P. W. and Nelder, J. A. (1982) Analysis of covariance and standardization as instances of prediction. Biometrics, 38, 613-21.

Page 14: The statistics of linear models: back to basics

234 Nelder

McCullagh, P. M. and Nelder, J. A. (1989) Generalized linear models, 2nd edn. Chapman and Hall, London.

Nelder, J. A. (1977) A reformulation of linear models. Journal of the Royal Statistical Society, Series A, 140, 48-77.

Neyman, J., Iwaszkiewicz, K. and Kolodziesczyk, St. (1935) Statistical problems in agricultural experimentation. Journal of the Royal Statistical Society, Series B, 2, 107 54.

Payne, C. D. (ed.) (1986) The GLIM manual, Release 3.77. NAG, Oxford.

Payne, R. W. et al. (1987) Genstat 5 reference manual. Clarendon Press, Oxford.

Pignatiello, J. J. and Ramberg, J. S. (1985) Contribution to dis- cussion of off-line quality control, parameter design, and

the Taguchi method. Journal of Quality Technology, 17, 198-206.

SAS Institute Inc. (1985) SAS user's guide: statistics, version 5 edition. SAS Institute Inc., Cary, NC.

Searle, S. R. (1987) Linear models for unbalanced data. Wiley, New York.

Ripley, B. D. (1992) Introductory guide to S-Plus. Dept. of Statistics, University of Oxford.

Wolstenholme, D. E., O'Brien, C. M. and Nelder, J. A. (1988) GLIMPSE: A knowledge-based front end for statistical analysis. Knowledge-based Systems, 1, 173-8.

Yates, F. (1934) The analysis of multiple classifications with unequal numbers in the different classes. Journal of the American Statistical Association, 29, 51-66.