C2166_ch29_Principle of Medical Statistics

7/30/2019 C2166_ch29_Principle of Medical Statistics

1/16

29

Analysis of Variance

CONTENTS

29.1 Conceptual Background

29.1.1 Clinical Illustration

29.1.2 Analytic Principles

29.2 Fishers F Ratio

29.3 Analysis-of-Variance Table

29.4 Problems in Performance29.5 Problems of Interpretation

29.5.1 Quantitative Distinctions

29.5.2 Stochastic Nonsignificance

29.5.3 Stochastic Significance

29.5.4 Substantive Decisions

29.6 Additional Applications of ANOVA

29.6.1 Multi-Factor Arrangements

29.6.2 Nested Analyses

29.6.3 Analysis of Covariance

29.6.4 Repeated-Measures Arrangements

29.7 Non-Parametric Methods of Analysis29.8 Problems in Analysis of Trends

29.9 Use of ANOVA in Published Literature

References

The targeted analytic method called analysis of variance, sometimes cited acronymically as ANOVA,

was devised (like so many other procedures in statistics) by Sir Ronald A. Fisher. Although often marking

the conceptual boundary between elementary and advanced statistics, or between amateur fan and

professional connoisseur, ANOVA is sometimes regarded and taught as elementary enough to be used

for deriving subsequent simple procedures, such as the t test. Nevertheless, ANOVA is used much less

often today than formerly, for reasons to be noted in the discussions that follow.

29.1 Conceptual Background

The main distinguishing feature of ANOVA is that the independent variable contains polytomous

categories, which are analyzed simultaneously in relation to a dimensional or ordinal dependent (out-

come) variable.

Suppose treatments A, B, and C are tested for effects on blood pressure in a randomized trial. When

the results are examined, we want to determine whether one of the treatments differs significantly from

the others. With the statistical methods available thus far, the only way to answer this question would

be to do multiple comparisons for pairs of groups, contrasting results in group A vs. B, A vs. C, and Bvs. C. If more ambitious, we could compare A vs. the combined results of B and C, or group B vs. the

combined results of A and C, and so on. We could work out various other arrangements, but in each


2/16

instance, the comparison would rely on contrasting two collected groups, because we currently know

no other strategy.

The analysis of variance allows a single simultaneous comparison for three or more groups. The result

becomes a type of screening test that indicates whether at least one group differs significantly from the

others, but further examination is needed to find the distinctive group(s). Despite this disadvantage, ANOVA

has been a widely used procedure, particularly by professional statisticians, who often like to apply it even

when simpler tactics are available. For example, when data are compared for only two groups, a t test or

Z test is simpler, and, as noted later, produces exactly the same results as ANOVA. Nevertheless, many

persons will do the two-group comparison (and report the results) with an analysis of variance.

29.1.1 Clinical Illustration

Although applicable in experimental trials, ANOVA has been most often used for observational studies.

A real-world example, shown in Figure 29.1, contains data for the survival times, in months, of a random

sample of 60 patients with lung cancer,1,2 having one of the four histologic categories of WELL (well-

differentiated), SMALL (small cell), ANAP (anaplastic), and CYTOL (cytology only). The other variable

(the five categories of TNM stage) listed in Figure 29.1 will be considered later. The main analytic

question now is whether histology in any of these groups has significantly different effects on survival.

29.1.1.1 Direct Examination The best thing to do with these data, before any formal statis -tical analyses begin, is to examine the results directly. In this instance, we can readily determine the group

sizes, means, and standard deviations for each of the four histologic categories and for the total. The results,

shown inTable 29.1, immediately suggest that the data do not have Gaussian distributions, because the

standard deviations are almost all larger than the means. Nevertheless, to allow the illustration to proceed,

the results can be further appraised. They show that the well-differentiated and small-cell groups, as

expected clinically, have the highest and lowest mean survival times, respectively. Because of relatively

small group sizes and non-Gaussian distributions, however, the distinctions may not be stochasticallysignificant.

Again before applying any advanced statistics, we can check these results stochastically by using simple

t tests. For the most obvious comparison of WELL vs. SMALL, we can use the components of Formula

[13.7] to calculate sp = = 21.96; (1/nA) + (1/nB) = (1/22) + (1/11) =.369; and = 24.43 4.45 = 19.98. These data could then be entered into Formula [13.7] toproduce t = 9.98/[(21.96)(.369)] = 2.47. At 31 d.f., the associated 2P value is about .02. From thisdistinction, we might also expect that all the other paired comparisons will not be stochastically

significant. (If you check the calculations, you will find that the appropriate 2P values are all >.05.)

29.1.1.2 Holistic and Multiple-Comparison Problems The foregoing comparisonindicates a significant difference in mean survival between the WELL and SMALL groups, but does

not answer the holistically phrased analytic question, which asked whether histology has significant

effects in any of the four groups in the entire collection. Besides, an argument could be made, using

TABLE 29.1

Summary of Survival Times in Four Histologic Groups

of Patients with Lung Cancer inFigure 29.1

Histologic

Category

Group

Size

Mean

Survival

Standard

Deviation

WELL 22 24.43 26.56

SMALL 11 4.45 3.77

ANAP 18 10.87 23.39

CYTOL 9 11.54 13.47

Total 60 14.77 22.29

21 26.56( )2 10 3.77( )2+[ ]/ 21 10+( )XA XB


3/16

distinctions discussed in Section 25.2.1.1, that the contrast of WELL vs. SMALL was only one of the

six (4 3/2) possible paired comparisons for the four histologic categories. With the Bonferroni correc-tion, the working level of for each of the six comparisons would be .05/6 = .008. With the lattercriterion, the 2P value of about .02 for WELL vs. SMALL would no longer be stochastically significant.

We therefore need a new method to answer the original question. Instead of examining six pairs of

contrasted means, we can use a holistic approach by finding the grand mean of the data, determining

the deviations of each group of data from that mean, and analyzing those deviations appropriately.

OBS ID HISTOL TNMSTAGE SURVIVE

1 62 WELL I 82.32 107 WELL II 5.33 110 WELL IIIA 29.64 157 WELL I 20.35 163 WELL I 54.96 246 SMALL I 10.37 271 WELL IIIB 1.68 282 ANAP IIIA 7.69 302 WELL I 28.0

10 337 CYTOL I 12.811 344 WELL II 4.012 352 ANAP IIIA 1.313 371 WELL IIIB 14.114 387 SMALL IIIA 0.215 428 SMALL II 6.816 466 ANAP IIIB 1.417 513 ANAP I 0.118 548 ANAP IV 1.819 581 ANAP IV 6.020 605 CYTOL IV 1.021 609 CYTOL IV 6.222 628 SMALL IV 4.423 671 SMALL IV 5.524 764 SMALL IV 0.325 784 ANAP IV 1.626 804 WELL I 12.2

27 806 ANAP IIIB 6.528 815 WELL I 39.929 852 WELL IIIB 4.530 855 WELL II 1.631 891 CYTOL IIIB 8.132 892 WELL IIIB 62.033 931 CYTOL IIIB 8.834 998 WELL IIIB 0.235 1039 SMALL IV 0.636 1044 ANAP II 19.337 1054 WELL IIIB 0.638 1057 ANAP I 10.939 1155 ANAP I 0.240 1192 SMALL IV 11.241 1223 ANAP IV 0.942 1228 ANAP II 27.943 1303 ANAP IIIB 2.9

44 1309 ANAP II 99.945 1317 ANAP IV 4.746 1355 CYTOL IIIB 1.847 1361 WELL IV 1.048 1380 CYTOL IV 10.649 1405 SMALL IV 3.750 1444 WELL II 55.951 1509 SMALL IV 3.452 1515 WELL I 79.753 1521 ANAP IV 1.954 1556 ANAP IIIB 0.855 1567 SMALL IV 2.556 1608 CYTOL I 8.657 1612 WELL IIIA 13.358 1666 CYTOL IV 46.059 1702 WELL II 23.960 1738 WELL II 2.6

FIGURE 29.1Printout of data on histologic type, TNM Stage, and months of survival in a random sample of 60 patients with primary cancer

of the lung. [OBS = observation number in sample; ID = original indentification number; HISTOL = histology type; TNMSTAGE= one of five ordinal anatomic TNM stages for lung cancer; SURVIVE = survival time (mos.); WELL = well-differentiated;SMALL = small cell; ANAP = anaplastic; CYTOL = cytology only.]


4/16

Many different symbols have been used to indicate the entities that are involved. In the illustration

here, Yij will represent the target variable (survival time) for person i in group j. For example, if WELL

is the first group inFigure 29.1, the eighth person in the group has Y8,1= 4.0. The mean of the valuesin group j will be = Yij/nj, where nj is the number of members in the group. Thus, for the last

group (cytology) inTable 29.1, n4= 9, Yi,4= 103.9, and = 103.9/9 = 11.54. The grand mean, ,will be (nj )/N, where N = nj= size of the total group under analysis. From the data inTable 29.1,G = [(22 24.43) + (11 4.45) + (18 10.87) + (9 11.54)]/60 = 885.93/60 = 14.77.

We can now determine the distance, , between each groups mean and the grand mean. Forthe ANAP group, the distance is 10.87 14.77 =3.90. For the other three groups, the distances are3.23 for CYTOL, 10.32 for SMALL, and +9.66 for WELL. This inspection confirms that the meansof the SMALL and WELL groups are most different from the grand mean, but the results contain no

attention to stochastic variation in the data.

29.1.2 Analytic Principles

To solve the stochastic challenge, we can use ANOVA, which like many other classical statisticalstrategies, expresses real world phenomena with mathematical models. We have already used such models

both implicitly and explicitly. In univariate statistics, the mean, , was an implicit model for fitting

a group of data from only the values in the single set of data. The measured deviations from that model,

Yi , were then converted to the groups basic variance, .In bivariate statistics for the associations in Chapters 18 and 19, we used an explicit model based on

an additional variable, expressed algebraically as = a + bXi. We then compared variances for threesets of deviations: Yi , between the items of data and the explicit model; Yi , between the itemsof data and the implicit model; and , between the explicit and implicit models. The group variances

or sums of squares associated with these deviations were called residual (or error) for ,

basic for , and model for ( )2.

29.1.2.1 Distinctions in Nomenclature The foregoing symbols and nomenclature havebeen simplified for the sake of clarity. In strict statistical reasoning, any set of observed data is regarded

as a sample from an unobserved population whose parameters are being estimated from the data. If

modeled with a straight line, the parametric population would be cited as Y = + X. When theresults for the observed data are expressed as = a + bXi, the coefficients a and b are estimates ofthe corresponding and parameters.

Also in strict reasoning, variance is an attribute of the parametric population. Terms such as

or , which are used to estimate the parametric variances, should be called sums

of squares, not group variances. The linguistic propriety has been violated here for two reasons: (1) the

distinctions are more easily understood when called variance, and (2) the violations constantly appear

in both published literature and computer print-outs. The usage here, although a departure from strictformalism, is probably better than in many discussions elsewhere where the sums of squares are called

variances instead ofgroup variances.

Another issue in nomenclature is syntactical rather than mathematical. In most English prose,

between is used for a distinction of two objects, and among for more than two. Nevertheless, in the

original description of the analysis of variance, R. A. Fisher used the preposition between rather than

among when more than two groups or classes were involved. The term between groups has been

perpetuated by subsequent writers, much to the delight of English-prose pedants who may denounce

the absence of literacy in mathematical technocracy. Nevertheless, Fisher and his successors have been

quite correct in maintaining between. Its use for the cited purpose is approved by diverse high-echelon

authorities, including the Oxford English Dictionary, which states that between has been, from its

earliest appearance, extended to more than two.3 [As one of the potential pedants, I was ready to useamong in this text until I checked the dictionary and became enlightened.]

29.1.2.2 Partition of Group Variance The same type of partitioning that was used forgroup variance in linear regression is also applied in ANOVA. Conceptually, however, the models are

Yj

Y4 GYj

Yj G

Y

Y Yi Y( )2

Yi

Yi Y

Yi Y

Yi Yi( )2

Yi Y( )2

Yi Y

Yi

Yi Y( )2

Yi Y( )2


5/16

expressed differently. Symbolically, each observation can be labelled Y ij, with j representing the group

and i, the person (or other observed entity) within the group. The grand mean, , is used for the implicit

model when the basic group or system variance, , is summed for the individual values of

Yi in all of the groups. The individual group means, , become the explicit models when the total

system is partitioned into groups. The residual group variance is the sum of the values of (Yi )2

within each of the groups. [In more accurate symbolism, the two cited group variances would be written

with double subscripts and summations as (Yij )2 and (Yij )2.] The model group variance,summed for each group of njmembers with group mean , is nj( )2. These results for data inthe four groups ofFigure 29.1 andTable 29.1are shown inTable 29.2.

Except for minor differences due to rounding, the components ofTable 29.2have the same structure

noted earlier for simple linear regression in Section 19.2.2. The structure is as follows:

{Basic Group Variance} = {Model Variance between Groups} + { Residual Variance within Groups}

or Syy = SM + SR.

The structure is similar to that of the deviations

Total Deviation = Model Deviation + Residual Deviation

which arises when each individual deviation is expressed in the algebraic identity

If is moved to the first part of the right side, the equation becomes

and is consistent with a parametric algebraic model that has the form

Yij=+j+ij

In this model, each persons value of Yij consists of three contributions: (1) from the grand parametric

mean, (which is estimated by ); (2) from the parametric increment, j (estimated by ),between the grand mean and group mean; and (3) from an error term, ij (estimated by Yij ), forthe increment between the observed value of Yij and the group mean.

For stochastic appraisal of results, the null hypothesis assumption is that the m groups have the same

parametric mean, i.e., 1=2= = j= =m.

TABLE 29.2

Group-Variance Partitions of Sums of Squares for the Four Histologic Groups

inFigure 29.1andTable 29.1

Group

Basic

(Total System)

Model

(Between Groups)

Residual

(Within Groups)

WELL 16866.67 22(24.43 14.77)2 = 2052.94 14813.73SMALL 1313.52 11(4.45 14.77)2 = 1171.53 141.99ANAP 9576.88 18(10.87 14.77)2 = 273.78 9303.10CYTOL 1546.32 9(11.54 14.77)2 = 93.90 1452.42Total 29304.61* 3593.38* 25711.24

* These are the correct totals. They differ slightly from the sum of the collection of individual values, calculated

with rounding, in each column.

G

Y i G( )2

Yj

Yj

G Yj

Yj Yj G

Yij G Yj G( ) Yij Yj( )+=

G

Y ij G Yj G( ) Y ij Yj( )+ +=

G Yj GYj


6/16

29.1.2.3 Mean Variances and Degrees of Freedom When divided by the associateddegrees of freedom, each of the foregoing group variances is converted to a mean value. For the basic

group variance, the total system contains N =nj members, and d.f. = N 1. For the model variance, them groups have m 1 degrees of freedom. For the residual variance, each group has nj 1 degrees of freedom,and the total d.f. for m groups is (nj 1) = N m.

The degrees of freedom are thus partitioned, like the group variances, into an expression that indicates

their sum as

N 1 = (m 1) + (N m)

The mean variances, however, no longer form an equal partition. Their symbols, and the associated

values in the example here, are as follows:

Mean Group Variance = Syy/(N 1) = 29304.61/59 = 496.69

Mean Model Variance = SM/(m 1) = 3593.38/3 = 1197.79(between groups)

Mean Residual Variance = SR/(N m) = 25711.24/56 = 459.13(within groups)

29.2 Fishers F Ratio

Under the null hypothesis of no real difference between the groupsi.e., the assumption that they have

the same parametric meaneach of the foregoing three mean variances can be regarded as a separate

estimate of the true parametric variance. Within the limits of stochastic variation in random sampling,

the three mean variances should equal one another.

To test stochastic significance, R. A. Fisher constructed a variance ratio, later designated as F, that is

expressed as

It can be cited symbolically as

[29.1]

If only two groups are being compared, some simple algebra will show that Formula [29.1] becomes

the square of the earlier Formula [13.7] for the calculation of t (or Z). This distinction is the reason why

the F ratio is sometimes used, instead of t (or Z), for contrasting two groups, as noted earlier inSection 13.3.6.

The Fisher ratio has a sampling distribution in which the associated 2P value is found for the values

of F at the two sets of degrees of freedom in values of m 1 and N m. The three components makethe distribution difficult to tabulate completely; and it is usually cited according to values for F for each

degree of freedom simultaneously at fixed values of 2P such as .1, .05, .01.

In the example under discussion here, the F ratio is 1197.79/459.13 = 2.61. In the Geigy tables4

available for the combination of 3 and 56 degrees of freedom, the required F values are 2.184 for 2P = .1,2.769 for 2P = .05, and 3.359 for 2P = .025. If only the Geigy values were available, the result wouldbe written as .05 < 2P < .1. In an appropriate computer program, however, the actual 2P value is usuallycalculated and displayed directly. In this instance, it was .0605.

If 2P is small enough to lead to rejection of the null hypothesis, the stochastic conclusion is that atleast one of the groups has a mean significantly different from the others. Because the counter-hypothesis

for the F test is always that the mean variance is larger between groups than within them, the null

hypothesis can promptly be conceded if the F ratio is < 1. In this instance, because the null hypothesiscannot be rejected at = .05, we cannot conclude that a significant difference in survival has been

Mean variance between groups

Mean variance within groups--------------------------------------------------------------------------

FSM/ m 1( )SR/ N m( )----------------------------=


7/16

stochastically confirmed for the histologic categories. The observed quantitative distinctions seem

impressive, however, and would probably attain stochastic significance if the group sizes were larger.

29.3 Analysis-of-Variance Table

The results of an analysis of variance are commonly presented, in both published literature and computer

printouts, with a tabular arrangement that warrants special attention because it is used not only for

ANOVA but also for multivariable regression procedures that involve partitioning the sums of squared

deviations (SS) that form group variances.

In each situation, the results show the partition for the sums of squares of three entities: (1) the total

SS before imposition of an explicit model, (2) the SS between the explicit model and the original implicit

grand mean, and (3) the residual SS for the explicit model. The last of these entities is often called the

unexplained or error variance. Both of these terms are unfortunate because the mathematical expla-

nation is a statistical phenomenon that may have nothing to do with biologic mechanisms of explanation

and the error represents deviations between observed and estimated values, not mistakes or inaccuraciesin the basic data. In certain special arrangements, to be discussed shortly, the deviations receive an

additionally improved explanation when the model is enhanced with subdivisions of the main variable

or with the incorporation of additional variables.

Figure 29.2 shows the conventional headings for the ANOVA table of the histology example in

Figure 29.1. For this one-way analysis, the total results are divided into two rows of components. The

number of rows is appropriately expanded when more subgroups are formed (as discussed later) via

such mechanisms as subdivisions or inclusion of additional variables.

29.4 Problems in Performance

The mathematical reasoning used in many ANOVA arrangements was developed for an ideal experi-

mental world in which all the compared groups or subgroups had the same size. If four groups were

being compared, each group had the same number of members, so that n1= n2= n3= n4. If the groupswere further divided into subgroupssuch as men and women or young, middle-aged, and oldthe

subgroups had the same sizes within each group.

These equi-sized arrangements were easily attained for experiments in the world of agriculture, where

R. A. Fisher worked and developed his ideas about ANOVA. Equally sized groups and subgroups are

seldom achieved, however, in the realities of clinical and epidemiologic research. The absence of equal

sizes may then create a major problem in the operation of computer programs that rely on equal sizes,and that may be unable to manage data for other circumstances. For the latter situations, the computer

programs may divert ANOVA into the format of a general linear model, which is essentially a method

of multiple regression. One main reason, therefore, why regression methods are replacing ANOVA

Dependent Variable: SURVIVE

Source DF Sum of Squares Mean Square F Value Pr > F

Model 3 3593.3800000 1197.7933333 2.61 0.0605

Error 56 25711.2333333 459.1291667

Corrected Total 59 29304.6133333

R-Square C.V. Root MSE SURVIVE Mean

0.122622 145.1059 21.427300 14.766667

FIGURE 29.2Printout of analysis-of-variance table for survival time in the four histologic groups ofFigure 29.1.


8/16

methods today is that the automated regression methods can more easily process data for unequal-sized

groups and subgroups.

29.5 Problems of Interpretation

The results of an analysis of variance are often difficult to interpret for both quantitative and stochastic

reasons, as well as for substantive decisions.

29.5.1 Quantitative Distinctions

The results of ANOVA are almost always cited with F ratios and P values that indicate stochastic

accomplishments but not quantitative descriptive distinctions. The reader is thus left without a mechanism

to decide what has been accomplished quantitatively, while worrying that significant P values may

arise mainly from large group sizes.

Although not commonly used, a simple statistical index can provide a quantitative description of theresults. The index, called eta squared, was previously discussed in Section 27.2.2 as a counterpart of r2 for

proportionate reduction of group variance in linear regression. Labeled R-square in the printout of

Figure 29.2, the expression is

For the histologic data in Figure 29.2, this index is 3593.38/29304.61 = 0.12, representing a modestachievement, which barely exceeds the 10% noted earlier (see Section 19.3.3) as a minimum level for

quantitative significance in variance reduction.

29.5.2 Stochastic Nonsignificance

Another important issue is what to do when a result is notstochastically significant, i.e., P >. In previousanalytic methods, a confidence interval could be calculated around the nonsignificant increment, ratio,

or coefficient that described the observed d Odistinction in the results. If the upper end of this confidence

interval excluded a quantitatively significant value (such as ), the result could be called stochasticallynonsignificant. If the confidence interval included , the investigator might be reluctant to concede thenull hypothesis of no difference.

This type of reasoning would be equally pertinent for ANOVA, but is rarely used because the results

seldom receive a descriptive citation. Confidence intervals, although sometimes calculated for the mean

of each group, are almost never determined to give the value of eta the same type of upper and lowerconfidence boundaries that can be calculated around a correlation coefficient in simple linear regression.

In the absence of a confidence interval for eta, the main available descriptive approach is to examine

results in individual groups or in paired comparisons. If any of the results seem quantitatively significant,

the investigator, although still conceding the null hypothesis (because P >), can remain suspicious thata significant difference exists, but has not been confirmed stochastically. For example, in Figure 29.2,

the P value of 0.06 would not allow rejection of the null hypothesis that all group means are equal.

Nevertheless, the modestly impressive value of 0.12 for eta squared and the large increment noted earlier

between the WELL and SMALL group means suggest that the group sizes were too small for stochastic

confirmation of what is probably a quantitatively significant distinction.

29.5.3 Stochastic Significance

If P


9/16

containing m groups will allow m(m 1)/2 paired comparisons when each groups mean is contrastedagainst the mean of every other group. With m additional paired comparisons between each group and

the total of the others, the total number of paired comparisons will be m(m + 1)/2. For example, thesmall-cell histologic group inTable 29.1could be compared against each of the three other groups and

also against their total. A particularly ingenious (or desperate) investigator might compare a single

group or paired groups against pairs (or yet other combinations) of the others.

This plethora of activities produces the multiple comparison problem discussed in Chapter 25, as well

as the multiple eponymous and striking titles (such as Tukeys honestly significant difference5) that have

been given to the procedures proposed for examining and solving the problem.

29.5.4 Substantive Decisions

Because the foregoing solutions all depend on arbitrary mathematical mechanisms, investigators who

are familiar with the substantive content of the data usually prefer to avoid the polytomous structure of

the analysis of variance. For example, a knowledgeable investigator might want to compare only the

SMALL vs. WELL groups with a direct 2-group contrast (such as a t test) in the histologic data, avoidingthe entire ANOVA process. An even more knowledgeable investigator, recognizing that survival can be

affected by many factors (such as TNM stage and age) other than histologic category, might not want

to do any type of histologic appraisal unless the other cogent variables have been suitably accounted for.

For all these reasons, ANOVA is a magnificent method of analyzing data if you are unfamiliar with

what the data really mean or represent. If you know the substantive content of the research, however,

and if you have specific ideas to be examined, you may want to use a simpler and more direct way of

examining them.

29.6 Additional Applications of ANOVA

From a series of mathematical models and diverse arrangements, the analysis of variance has a versatility,

analogous to that discussed earlier for chi square, that for many years made ANOVA the most commonly

used statistical procedure for analyzing complex data. In recent years, however, the ubiquitous availability

of computers has led to the frequent replacement of ANOVA by multiple regression procedures, whose

results are often easier to understand. Besides, ANOVA can mathematically be regarded as a subdivision

of the general-linear-model strategies used in multivariable regression analysis.

Accordingly, four of the many other applications of ANOVA are outlined here only briefly, mainly

so that you will have heard of them in case you meet them (particularly in older literature). Details

can be found in many statistical textbooks. The four procedures to be discussed are multi-factorarrangements, nested analyses, the analysis of covariance (ANCOVA), and repeated-measures arrange -

ments (including the intraclass correlation coefficient).

29.6.1 Multi-Factor Arrangements

The procedures discussed so far are called one-way analyses of variance, because only a single inde-

pendent variable (i.e., histologic category) was examined in relation to survival time. In many circum-

stances, however, two or more independent variables can be regarded as factors affecting the dependent

variable. When these additional factors are included, the analysis is called two-way(or two-factor), three-

way (or three-factor), etc.

For example, if the two factors of histologic category and TNM stage are considered simultaneously,the data for the 60 patients inFigure 29.1 would be arranged as shown in Table 29.3. The identification

of individual survival times would require triple subscripts: i for the person, j for the row, and k for

the column.


10/16

29.6.1.1 Main Effects In the mathematical model of the two-way arrangement, thecategorical mean for each factorHistology and TNM Stagemakes a separate contribution, called

the main effect, beyond the grand mean. The remainder (or unexplained) deviation for each person is

called the residual error. Thus, a two-factor model for the two independent variables would express theobserved results as

[29.2]

The term here represents the grand mean. The next two terms represent the respective deviations of

each row mean ( ) and each column mean ( ) from the grand mean. The four components in the

last term for the residual deviation of each person are constructed as residuals that maintain the

algebraic identity. The total sum of squares in the system will be (Yijk )2, with N 1 degrees offreedom. There will be two sums of squares for the model, cited as nj( )2 for the row factor,and as nk( )2 for the column factor. The residual sum of squares will be the sum of all thevalues of (Yijk + )2.

Figure 29.3shows the printout of pertinent calculations for the data inTable 29.3. In the lower halfofFigure 29.3, the 4-category histologic variable has 3 degrees of freedom and its Type I SS (sum of

squares) and mean square, respectively, are the same 3593.38 and 1197.79 shown earlier. The 5-category

TNM-stage variable has 4 degrees of freedom and corresponding values of 3116.39 and 779.10. The

residual error group variance in the upper part of the table is now calculated differentlyas the corrected

TABLE 29.3

Two-Way Arrangement of Individual Data for Survival Time (i n Months)

of Patients with Lung Cancer

Histologic

Category

TNM Stage Mean for Total

Row CategoryI II IIIA IIIB IV

Well 82.3

20.3

54.9

28.0

12.2

39.9

79.7

5.3

4.0

1.6

55.9

23.9

2.6

29.6

13.3

1.6

14.1

4.5

62.0

0.2

0.6

1.0

24.43

Small 10.3 6.8 0.2 4.45.5

0.3

0.611.2

3.73.42.5

4.45

Anap 0.1

10.9

0.2

19.3

27.9

99.9

7.6

1.3

1.4

6.5

2.9

0.8

1.8

6.0

1.6

0.9

4.7

1.9

10.87

Cytol 12.88.6

8.18.8

1.8

1.06.2

10.6

46.0

11.54

Mean for Total

Column Category 27.7 24.72 10.40 8.72 5.96 14.77

Yijk G Yj G( ) Yk G( ) Y ij k Yj Yk G+( )+ + +=

G

Yj Yk

G

Yj G

Yk G

Yj Yk G


11/16

total sum of squares minus the sum of Type I squares, which is a total of 6709.77 for the two factors

in the model. Since those two factors have 7 (=3 + 4) degrees of freedom, the mean square for the modelis 6709.77/7 = 958.54, and the d.f. in the error variance is 59 7 = 52. The mean square for the errorvariance becomes 22594.84/52 = 434.52. When calculated for this two-factor model, the F ratio of meansquares is 2.21, which now achieves a P value (marked Pr > F) just below .05. If the level is set at.05, this result is significant, whereas it was not so in the previous analysis for histology alone.

The label Type I SS is used because ANOVA calculations can also produce three other types of

sums of squares (marked II, III, and IV when presented) that vary with the order in which factors are

entered or removed in a model, and with consideration of the interactions discussed in the next section.

As shown in the lower section ofFigure 29.3, an F-ratio value can be calculated for each factor when

its mean square is divided by the error mean square. For histology, this ratio is 1197.79/434.52 = 2.76.For TNM stage, the corresponding value in the printout is 1.79. The corresponding 2P values are just

above .05 for histology and .14 for TNM stage.

29.6.1.2 Interactions In linear models, each factor is assumed to have its own separateadditive effect. In biologic reality, however, the conjunction of two factors may have an antagonistic or

synergistic effect beyond their individual actions, so that the whole differs from the sum of the parts.

For example, increasing weight and increasing blood pressure may each lead to increasing mortality,

but their combined effect may be particularly pronounced in persons who are at the extremes of obesity

and hypertension. Statisticians use the term interactions for these conjunctive effects; and the potential

for interactions is often considered whenever an analysis contains two or more factors.

To examine these effects in a two-factor analysis, the model for Yijk is expanded to contain aninteraction term. It is calculated, for the mean of each cell of the conjoined categories, as the deviation

from the product of mean values of the pertinent row and column variables for each cell. In the expression

of the equation for Yijk, the first three terms of Equation [29.2] are the same: , for the grand mean;

for each row; and for each column. Because the observed mean in each cell will be, the interaction effect will be the deviation estimated as + . The remaining

residual effect, used for calculating the residual sum of squares, is Yijk . For each sum of squares,the degrees of freedom are determined appropriately for the calculations of mean squares and F ratios.

The calculation of interaction effects can be illustrated with an example from the data ofTable 29.3for

the 7-member cell in the first row, first column. The grand mean is 14.77; the entire WELL histologic

category has a mean of 24.43; and TNM stage I has a mean of 27.71. The mean of the seven values in the

cited cell is (82.3 + 20.3 + + 79.7)/7 = 45.33. According to the algebraic equation, = 14.77; in thefirst row, ( ) = 24.43 14.77 = 9.66; and in the first column, ( ) = 27.71 14.77 = 12.94.The interaction effect in the cited cell will be estimated as 45.33 24.43 27.71 + 14.77 = 7.96. Theestimated value of the residual for each of the seven Y ijk values in the cited cell will be Yijk 7.96.

Dependent Variable: SURVIVE


Model 7 6709.7729638 958.5389948 2.21 0.0486

Error 52 22594.8403695 434.5161610Corrected Total 59 29304.6133333

R-Square C.V. Root MSE SURVIVE Mean0.228966 141.1629 20.845051 14.766667

Source DF Type I SS Mean Square F Value Pr > F

HISTOL 3 3593.3800000 1197.7933333 2.76 0.0515TNMSTAGE 4 3116.3929638 779.0982410 1.79 0.1443

FIGURE 29.3Printout for 2-way ANOVA of data inFigure 29.1andTable 29.3.

G

Yj G Yk G

Yjk Yjk Yj Yk G

Yjk

GYj G Yk G


12/16

Figure 29.4 shows the printout of the ANOVA table when an interaction model is used for the two-

factor data inTable 29.3. InFigure 29.4, the sums of squares (marked Type I SS) and mean squares for

histology and TNM stage are the same as inTable 29.3, and they also have the same degrees of freedom.

The degrees of freedom for the interaction are tricky to calculate, however. In this instance, because

some of the cells ofTable 29.3are empty or have only 1 member, we first calculate degrees for freedom

for the residual sum of squares, (Yijk )2. In each pertinent cell, located at (j, k) coordinates inthe table, the degrees of freedom will be njk 1. Working across and then downward through the cellsinTable 29.3, the sum of the njk 1 values will be 6 + 5 + 1 + 5 + 7 + 2 + 2 + 1 + 3 + 5 + 1 + 2 + 3= 43. (The values are 0 for the four cells with one member each and also for the 3 cells with no members.)This calculation shows that the model accounts for 59 43 = 16 d.f.; and as the two main factors havea total of 7 d.f., the interaction factor contributes 9 d.f. to the model, as shown in the last row of

Figure 29.4.

Calculated with the new mean square error term inFigure 29.4, the F values produce 2P values below

F

Model 16 13835.482381 864.717649 2.40 0.0114

Error 43 15469.130952 359.747231Corrected Total 59 29304.613333

R-Square C.V. Root MSE SURVIVE Mean0.472126 128.4447 18.967004 14.766667


HISTOL 3 3593.3800000 1197.7933333 3.33 0.0282TNMSTAGE 4 3116.3929638 779.0982410 2.17 0.0890HISTOL*TNMSTAGE 9 7125.7094171 791.7454908 2.20 0.0408

FIGURE 29.4

Two-way ANOVA, with interaction component, for results inTable 29.3andFigure 29.3. [Printout from SAS PROC GLMcomputer program.]

Yjk


13/16

allow maintenance of the ranks, TNM stage could be declared a covariate, which would then be analyzed

as though it had a dimensional scale.

The results of the covariance analysis are shown inFigure 29.5. Note that TNM stage now has only

1 degree of freedom, thus giving the model a total of 4 D.F., an F value of 3.61 and a P value of 0.0111,

despite a decline of R-square from .229 inFigure 29.3to .208 inFigure 29.5. The histology variable,

which had P = .052 inFigure 29.3now has P = .428; and TNM stage, with P = .144 in Figure 29.3, hasnow become highly significant at P = .0012. These dramatic changes indicate what can happen whenthe rank sequence is either ignored or appropriately analyzed for polytomous variables.

In past years, the effect of confounding or ranked covariates was often formally adjusted in an analysis

of covariance, using a complex set of computations and symbols. Today, however, the same adjustment

is almost always done with a multiple regression procedure. The adjustment process in ANCOVA is

actually a form of regression analysis in which the related effects of the covariate are determined by

regression and then removed from the error variance. The group means of the main factor are also

adjusted to correspond to a common value of the covariate. The subsequent analysis is presumably

more powerful in detecting the effects of the main factor, because the confounding effects have

presumably been removed. The process and results are usually much easier to understand, however,

when done with multiple linear regression.2

29.6.4 Repeated-Measures Arrangements

Repeated measures is the name given to analyses in which the same entity has been observed repeatedly.The repetitions can occur with changes over time, perhaps after interventions such as treatment, or with

examinations of the same (unchanged) entity by different observers or systems of measurement.

29.6.4.1 Temporal Changes The most common repeated-measures situation is an ordinarycrossover study, where the same patients receive treatments A and B. The effects of treatment A vs.

treatment B in each person can be subtracted and thereby reduced to a single group of increments, which

can be analyzed with a paired t test, as discussed in Section 7.8.2.2. The same analysis of increments

can be used for the before-and-after measurements of the effect in patients receiving a particular

treatment, such as the results shown earlier for blood glucose in Table 7.4.

Because the situations just described can easily be managed with paired t tests, the repeated-measures

form of ANOVA is usually reserved for situations in which the same entity has been measured at threeor more time points. The variables that become the main factors in the analysis are the times and the

groups (such as treatment). Interaction terms can be added for the effects of groups times.


Model 4 6087.9081999 1521.9770500 3.61 0.0111Error 55 23216.7051334 422.1219115Corrected Total 59 29304.6133333

R-Square C.V. Root MSE SURVIVE Mean

0.207746 139.1350 20.545606 14.766667


TNMSTAGE 1 4897.3897453 4897.3897453 11.60 0.0012HISTOL 3 1190.5184546 396.8394849 0.94 0.4276

FIGURE 29.5Printout of Analysis of Covariance for data inFigure 29.3, with TNM stage used as ranked variable.
http://c2166_ch07.pdf/http://c2166_ch07.pdf/http://c2166_ch07.pdf/http://c2166_ch07.pdf/


14/16

Four major problems, for which consensus solutions do not yet exist, arise when the same entity is

measured repeatedly over time:

1. Independence. The first problem is violation of the assumption that the measurements are inde-

pendent. The paired t test manages this problem by reducing the pair of measurements to their

increment, which becomes a simple new variable. This distinction may not always be suitably

employed with more than two sets of repeated measurements.

2. Incremental components. A second problem is the choice of components for calculating

incremental changes for each person. Suppose t0 is an individual baseline value, and the

subsequent values are t1, t2, and t3. Do we always measure increments from the baseline

value, i.e., t1 t0, t2 t0, and t3 t0, or should the increments be listed successively as t 1t0, t2 t1, t3 t2?

3. Summary index of response. If a treatment is imposed after the baseline value at t0, what is

the best single index for summarizing the post-therapeutic response? Should it be the mean

of the post-treatment values, the increment between t0 and the last measurement, or a regres-

sion line for the set of values?

4. Neglect of trend. This problem is discussed further in Section 29.8. As noted earlier, an

ordinary analysis of variance does not distinguish between unranked nominal and ranked

ordinal categories in the independent polytomous variable. If the variable represents serial

points in time, their ranking may produce a trend, but it will be neglected unless special

arrangements are used in the calculations.

29.6.4.2 Intraclass Correlations Studies of observer or instrument variability can also beregarded as a type of repeated measures, for which the results are commonly cited with an intraclass

correlation coefficient (ICC).

As noted in Section 20.7.3, the basic concept was developed as a way of assessing agreement for

measurements of a dimensional variable, such as height or weight, between members of the same class,such as brothers in a family. To avoid the inadequacy of a correlation coefficient, the data were appraised

with a repeated-measures analysis-of-variance. To avoid decisions about which member of a pair should

be listed as the first or second measurements, all possible pairs were listed twice, with each member as

the first measurement and then as the second. The total sums of squares could be partitioned into one

sum for variability between the individuals being rated, i.e., the subjects (SSS), and another sum of

squares due to residual error (SSE). The intraclass correlation was then calculated as

The approach was later adapted for psychometric definitions ofreliability. The appropriate means forthe sums of squares were symbolized as for variance in the subjects and for the corresponding

residual errors.Reliability was then defined as

Using the foregoing symbols, when each of a set of n persons is measured by each of a set of r raters,

the variance of a single observation, s, can be partitioned as

where is the mean of the appropriate sums of squares for the raters.

These variances can be arranged into several formulas for calculating RI. The different arrangements

depend on the models used for the sampling and the interpretation.6 In a worked example cited by

Everitt,7 vital capacity was measured by four raters for each of 20 patients. The total sum of squares for

the 80 observations, with d.f. = 79, was divided into three sets of sums of squares: (1) for the four

R ISSS SSE

SSS SSE+--------------------------=

sc2 se

2

RI sc2/ sc

2 se2+( )=

s2 sc2 sr

2 se2+ +=

sr2


15/16

observers with d.f. = 3; (2) for the 20 patients with d.f. = 19; and (3) for the residual error with d.f. =3 19 = 57. The formula used by Everitt for calculating the intraclass correlation coefficient was

A counterpart formula, using SSR to represent sums of squares for raters, is

The intraclass correlation coefficient (ICC) can be used when laboratory measurements of instrument

variability are expressed in dimensional data. Nevertheless, as discussed in Chapter 20, most laboratories

prefer to use simpler pair-wise and other straightforward statistical approaches that are easier to under -

stand and interpret than the ICC.

The simpler approaches may also have mathematical advantages that have been cited by Bland and

Altman,8

who contend that the ICC, although appropriate for repetitions of the same measurement, isunsatisfactory when dealing with measurements by two different methods where there is no ordering

of the repeated measures and hence no obvious choice of X or Y. Other disadvantages ascribed to the

ICC are that it depends on the range of measurement and is not related to the actual scale of

measurement or to the size of error which might be clinically allowable. Instead, Bland and Altman

recommend their limits of agreement method, which was discussed throughout Section 20.7.1. The

method relies on examining the increments in measurement for each subject. The mean difference then

indicates bias, and the standard deviation is used to calculate a 95% descriptive zone for the limits of

agreement. A plot of the differences against the mean value of each pair will indicate whether the

discrepancies in measurement diverge as the measured values increase.

For categorical data, concordance is usually expressed (see Chapter 20) with other indexes of vari-

ability, such as kappa, which yields the same results as the intraclass coefficient in pertinent situations.

29.7 Non-Parametric Methods of Analysis

The mathematical models of ANOVA require diverse assumptions about Gaussian distributions and

homoscedastic (i.e., similar) variances. These assumptions can be avoided by converting the dimen-

sional data to ranks and analyzing the values of the ranks. The Kruskal-Wallis procedure, which is

the eponym for a one-way ANOVA using ranked data, corresponds to a WilcoxonMannWhitney

U test for 3 or more groups. The Friedman procedure, which refers to a two-way analysis of ranked

data, was proposed almost 60 years ago by Milton Friedman, who later become more famous in

economics than in statistics.

29.8 Problems in Analysis of Trends

If a variable has ordinal grades, the customary ANOVA procedure will regard the ranked categories

merely as nominal, and will not make provision for the possible or anticipated trend associated with

different ranks. The problem occurs with an ordinal variable, such as TNM stage inFigure 29.1, because

the effect of an increasing stage is ignored. The neglect of a ranked effect can be particularly importantwhen the independent variable (or factor) is time, for which the effects might be expected to occur in

a distinct temporal sequence. This problem in repeated-measures ANOVA evoked a denunciation by

Sheiner,9 who contended that the customary ANOVA methods were wholly inappropriate for many

studies of the time effects of pharmacologic agents.

R In sc

2 se2( )

ns c2 rs r

2 nr n r( )se2+ +-----------------------------------------------------------=

RISSS SSE

SSS SSE 2 SSR( )+ +----------------------------------------------------=


16/16

The appropriate form of analysis can be carried out, somewhat in the manner of the chi-square test

for linear trend in an array of proportions (see Chapter 27), by assigning arbitrary coding values (such

as 1,2,3,4) to the ordinal categories. The process is usually done more easily and simply, however, as

a linear regression analysis.

29.9 Use of ANOVA in Published Literature

To find examples of ANOVA in published medical literature, the automated Colleague Medical Database

was searched for papers, in English, of human-subject research that appeared in medical journals during

199195, and in which analysis of variance was mentioned in the abstract-summary. From the list of

possibilities, 15 were selected to cover a wide array of journals and topics. The discussion that follows

is a summary of results in those 15 articles.

A one-way analysis of variance was used to check the rate of disappearance of ethanol from venous

blood in 12 subjects who drank the same dose of alcohol in orange juice on four occasions.10 The

authors concluded that the variation between subjects exceeded the variations within subjects. Anotherclassical one-way ANOVA was done to examine values of intestinal calcium absorption and serum

parathyroid hormone levels in three groups of people: normal controls and asthmatic patients receiving

either oral or inhaled steroid therapy.11 A one-way ANOVA compared diverse aspects of functional

status in two groups of patients receiving either fluorouracil or saline infusions for head and neck

cancer.12 In a complex but essentially one-way ANOVA, several dependent variables (intervention points,

days of monitoring, final cardiovascular function) were related to subgroups defined by APACHE II

severity scores in a surgical intensive care unit.13 (The results were also examined in a regression

analysis.) In another one-way analysis of variance, preference ratings for six different modes of teaching

and learning were evaluated14 among three groups, comprising first-year, second-year, and fourth-year

medical students in the United Arab Emirates. The results were also examined for the preferences of

male vs. female students.In a two-way ANOVA, neurologic dysfunction at age four years was related 15 to two main factors:

birth weight and location of birth in newborn intensive care units of either Copenhagen or Dublin.

Multifactor ANOVAs were applied,16 in 20 patients with conjunctival malignant melanoma, to the

relationship between 5-year survival and the counts of cells positive for proliferating cell nuclear antigen,

predominant cell type, maximum tumor depth, and site of tumor. The result, showing that patients with

low counts had better prognoses, was then confirmed with a Cox proportional hazards regression

analysis. (The latter approach would probably have been best used directly.)

Repeated measures ANOVA was used in the following studies: to check the effect of oat bran consumption

on serum cholesterol levels at four time points;17 to compare various effects (including blood pressure levels

and markers of alcohol consumption) in hypertensive men randomized to either a control group or to receive

special advice about methods of reducing alcohol consumption;18 to assess the time trend of bloodpressure during a 24-hour monitoring period in patients receiving placebo or an active antihypertensive

agent;19 and to monitor changes at three time points over 6 months in four indexes (body weight, serum

osmolality, serum sodium, and blood urea nitrogen/creatinine ratios) for residents of a nursing home. 19

The intraclass correlation coefficient was used in three other studies concerned with reliability (or

reproducibility) of the measurements performed in neuropathic tests,21 a brief psychiatric rating scale,22

and a method of grading photoageing in skin casts.23

References

1. Feinstein, 1990d; 2. Feinstein, 1996; 3. Oxford English Dictionary, 1971; 4. Lentner, 1982; 5. Tukey, 1968;

6. Shrout, 1979; 7. Everitt, 1989; 8. Bland, 1990; 9. Sheiner, 1992; 10. Jones, 1994; 11. Luengo, 1991;

12. Browman, 1993; 13. Civetta, 1992; 14. Paul, 1994; 15. Ellison, 1992; 16. Seregard, 1993; 17. Saudia, 1992;

18. Maheswaran, 1992; 19. Tomei, 1992; 20. Weinberg, 1994; 21. Dyck, 1991; 22. Hafkenscheid, 1993;

23. Fritschi, 1995.

Documents

C2166_ch29_Principle of Medical Statistics