© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 1

Stats 330: Lecture 18


Anova Models• These are linear (regression) models where all

the explanatory variables are categorical.• If there is just one categorical explanatory

variable, then we have the “one-way anova” model discussed in STATS 201/8

• If there are two categorical explanatory variables, then we have the “two-way anova” model, also discussed in STATS 201/8

• However, we shall regard these as just another type of regression model


Example: One way model

• In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures.

• The response variable (ratio) is the ratio of damaged to undamaged cells, and the explanatory variable (treatment) is the substance

• On website – carcinogenic substances data


Data ratio treatment1 0.08 control2 0.08 choral hydrate3 0.10 diazapan4 0.10 hydroquinone5 0.07 econidazole6 0.17 colchicine7 0.08 control8 0.10 choral hydrate9 0.08 diazapan10 0.10 hydroquinone11 0.08 econidazole12 0.19 colchicine13 0.09 control14 0.08 choral hydrate150.12 diazapan. . . More data


Distributions skewed?

boxplot(ratio~treatment, data=cancer.df, ylab = "ratio", main = "Ratios for different substances")

control chloralhydrate colchicine diazapan econidazole hydroquinone

0.2

0.4

0.6

0.8

Ratios for different substances

ratio


The modelmeanratio

where the mean depends on the substance. Thus,

Colchicine

Control

meanratio

meanratio

:colchicineFor

:control theFor

.....

We make the usual assumptions about the errors (normal, equal variance, independent etc)


Offset form

ColchizineColchicine

eEconidazoleEconidazol

neHydroquinoneHydroquino

DiazepanDiazepan

ateChoralHydrateChoralHydr

Control

mean

mean

mean

mean

mean

mean


Dummy variable form

• Define:CH = 1 if treatment = Choral Hydrate, 0 elseD = 1 if treatment = diazapan, 0 elseH = 1 if treatment = hydroquinone, 0 elseE = 1 if treatment = econidazole, 0 elseC = 1 if treatment = colchicine, 0 else

• Then

CEH

DCHratio

ColchicineeEconidazolneHydroquino

DiazapanrateChloralHyd


Estimation • To estimate the offsets and the baseline

(control) mean, we use lm as usual. We have to rearrange the levels to make the control the baseline

carcin.df = read.table(file.choose(), header=T)carcin.df$treatment = factor(carcin.df$treatment, levels = c("control", "chloralhydrate", "colchicine", "diazapan", "econidazole", "hydroquinone"))summary(lm(ratio~treatment, data=carcin.df))


Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.23660 0.02037 11.616 < 2e-16 ***treatmentchloralhydrate 0.03240 0.02880 1.125 0.26158 treatmentcolchicine 0.21160 0.02880 7.346 2.02e-12 ***treatmentdiazapan 0.04420 0.02880 1.534 0.12599 treatmenteconidazole 0.02820 0.02880 0.979 0.32838 treatmenthydroquinone 0.07540 0.02880 2.618 0.00931 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.144 on 294 degrees of freedomMultiple R-squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12

lm output


0.25 0.30 0.35 0.40 0.45

-0.2

0.0

0.2

0.4

Fitted values

Res

idua

lsResiduals vs Fitted

100 200199

-3 -2 -1 0 1 2 3

-2-1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

100200199

0.25 0.30 0.35 0.40 0.45

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location100 200199

-2-1

01

23

Factor Level Combinations

Sta

ndar

dize

d re

sidu

als

control chloralhydrate colchicinetreatment :

Constant Leverage: Residuals vs Factor Levels

100 200199

Non-normal?

Variances about equal

ignore


boxcoxplot(ratio~ treatment, data=carcin.df)


Analyzing ¼ power

WB test: previous p=0.00

Current p=0.06

Normality better

0.70 0.72 0.74 0.76 0.78 0.80

-0.2

-0.1

0.0

0.1

0.2

Fitted values

Res

idua

ls

Residuals vs Fitted

10050 200

-3 -2 -1 0 1 2 3-2

-10

12

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

10050200

0.70 0.72 0.74 0.76 0.78 0.80

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location10050 200

-2-1

01

2

Factor Level Combinations

Sta

ndar

dize

d re

sidu

als

control chloralhydrate colchicinetreatment :

Constant Leverage: Residuals vs Factor Levels

10050 200


Analyzing ¼ power: summary

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68528 0.01244 55.105 < 2e-16 ***treatmentchloralhydrate 0.01744 0.01759 0.992 0.3222 treatmentcolchicine 0.12120 0.01759 6.891 3.37e-11 ***treatmentdiazapan 0.02993 0.01759 1.702 0.0898 . treatmenteconidazole 0.01470 0.01759 0.836 0.4039 treatmenthydroquinone 0.04104 0.01759 2.333 0.0203 *

Residual standard error: 0.08793 on 294 degrees of freedomMultiple R-squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10Residual standard error: 0.08793 on 294 degrees of freedomMultiple R-Squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10


Testing equality of means

The standard F-test for equality of means is computed using the anova function

Here comparing equal means model (Null model) with different means model – only one term in model

> quarter.lm <- lm(ratio^(1/4)~treatment, data=carcin.df)> anova(quarter.lm)Analysis of Variance Table

Response: ratio^(1/4) Df Sum Sq Mean Sq F value Pr(>F) treatment 5 0.47017 0.09403 12.161 1.008e-10 ***Residuals 294 2.27337 0.00773

Highly significant differences


Oneway plot plot (s20x)> onewayPlot(quarter.lm)

control chloralhydrate colchicine diazapan econidazole hydroquinone

0.5

0.6

0.7

0.8

0.9

Plot of `ratio^(1/4)' by levels of `treatment',with TUKEY intervals (95%, pooled SDs)

treatment

ratio

^(1

/4)

Tukey: all cover true values with 95% prob


Two factors: example

Experiment to study weight gain in rats

– Response is weight gain over a fixed time period

– This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low)


Data> diets.df gain source level1 73 Beef High2 98 Cereal High3 94 Pork High4 90 Beef Low5 107 Cereal Low6 49 Pork Low7 102 Beef High8 74 Cereal High9 79 Pork High10 76 Beef Low. . . 60 observations in all


Two factors: the model

• If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is ij

• Other standard assumptions (equal variance, normality, independence) apply


Diagramatically…

Source = Beef

Source = Cereal

Source = Pork

Level=High

11 12 13

Level=Low

21 22 23


Decomposition of the means

• We usually want to split each “cell mean” up into 4 terms:– A term reflecting the overall baseline level of

the response– A term reflecting the effect of factor A (row

effect)– A term reflecting the effect of factor B (column

effect)– A term reflecting how A and B interact.


Mathematically…Overall Baseline: 11 (mean when both factors are at their baseline levels)

Effect of i th level of factor A (row effect): i111The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline)

Effect of j th level of factor B (column effect) : 1j -11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline)Interaction: what’s left over (see next slide)


Interactions• Each cell (except the first row and column) has

an interaction:Interaction = cell mean - baseline - row effect - column

effect

cell mean = baseline + row effect + column effect + interaction

1111

11111111 )()(ninteractio

jiij

jiij

)(

)()(

1111

11111111

jiij

jiij


Notation• Overall baseline: = 11

• Main effects of Ai = i1 - 11

• Main effects of Bj = 1j - 11

• AB interactions: ij = ij - i1 - 1j + 11

• Thus, ij = i + j + ij


Importance of interactions

• If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B

• In mathematical terms, ij – i’j doesn’t depend on j

• Equivalently, effect of changing levels of B is the same for all levels of A

• If interactions are zero, relationship between factors and response is simple


Why are comparisons simple when interactions

are zero?

'

'

''

'

)(

)(

jj

jiji

ijji

ijjiijij

Doesn’t depend on i!


Splitting up the mean: rats

Cell Means

Beef Cereal Pork Baseline col

High 100 85.9 99.5 100

Low 79.2 83.9 78.7 79.2

Baseline row 100 85.9 99.5 100

Factors are : level (amount of food) and source (diet)

Row effect for Low: 79.2 – 100 = -20.8

Col effect for Cereal: 85.9 - 100 = -14.1

Col effect for Pork: 99.5 - 100 = -0.5

Low-Cereal interaction: 83.9 - 79.2 - 85.9 + 100 = 18.8

Low-Cereal interaction: 78.7 - 79.2 - 99.5 + 100 = 0


Exploratory plots> plot.design(diets.df)

8590

95

Factors

mea

n of

gai

n Beef

Cereal

Pork

High

Low

source level

More gain on high amount of feed and Beef diet


dotplot(source~gain|level, data=diets.df)

gain

60 80 100 120

Beef

Cereal

Pork

High

60 80 100 120

Low


Fit model

> diets.lm<-lm(gain~source+level + source:level, data=diets.df)> summary(diets.lm)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000

Residual standard error: 14.65 on 54 degrees of freedomMultiple R-Squared: 0.2848, Adjusted R-squared: 0.2185

F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299


Fitting as a regression model

Note that this is equivalent to fitting a regression with dummy variables R2, C2, C3

R2 = 1 if obs is in row 2, zero otherwise

C2 = 1 if obs is in column 2, zero otherwise

C3 = 1 if obs is in column 3, zero otherwise

The regression is

Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)


> R2 = ifelse(diets.df$level=="Low",1,0)> C2 = ifelse(diets.df$source=="Cereal",1,0)> C3 = ifelse(diets.df$source=="Pork",1,0)> reg.lm = lm(gain ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3), data=diets.df)> summary(reg.lm)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 ***C2 -1.410e+01 6.551e+00 -2.152 0.03585 * C3 -5.000e-01 6.551e+00 -0.076 0.93944 R2 -2.080e+01 6.551e+00 -3.175 0.00247 ** I(R2 * C2) 1.880e+01 9.264e+00 2.029 0.04736 * I(R2 * C3) -2.709e-14 9.264e+00 -2.92e-15 1.00000

Regression summary


Testing for zero interactions

> anova(diets.lm)Analysis of Variance Table

Response: gain Df Sum Sq Mean Sq F value Pr(>F) source 2 266.5 133.3 0.6211 0.5411319 level 1 3168.3 3168.3 14.7666 0.0003224 ***source:level 2 1178.1 589.1 2.7455 0.0731879 . Residuals 54 11586.0 214.6

Some evidence of interaction


Interaction plot> interaction.plot(source,level,gain)

8085

9095

100

diets.df$source

mea

n of

die

ts.d

f$ga

in

Beef Cereal Pork

diets.df$level

HighLow

Non-parallellines indicate interaction


Do we need Source in the model?

> model1<-lm(gain~source*level) # note shorthand> model2<-lm(gain~level)> anova(model2,model1)Analysis of Variance Table

Model 1: gain ~ levelModel 2: gain ~ source * level Res.Df RSS Df Sum of Sq F Pr(>F)1 58 13030.7 2 54 11586.0 4 1444.7 1.6833 0.1673

Not significant!

No significant effect of Source


Notations: reviewFor two factors A and B

• Baseline: = 11

• A main effect: i = i1- 11

• B main effect: j = 1j - 11

• AB interaction: ij = ij - i1 - 1j + 11

• Then ij = + i + j + ij


Zero interaction model

• If we have only one observation per factor level combination, we can’t estimate the interactions and the error variance

• We have to assume that the interactions are zero and fit an “additive model”

gain ~ level + source

• Can test zero interactions In a reduced form) using the “Tukey one-degree of freedom test” –


Possible models for two factors

For two factors A and B possible models are:

• Y~1 (Fit single mean only)• Y~A (cell means depend on A alone)• Y~B (Cell means depend on B alone)• Y~A+B (Cell means have no

interaction)• Y~A*B (General model, cell means

have no restrictions)


In terms of “effects”General model is Y~A+B+A:B

(equivalently Y~A*B) Mathematical form is

E(Yij) = + i + j + ij

• Y~1 implies i =0, j=0, ij=0

• Y~A implies j=0, ij=0

• Y~B implies i =0, ij=0

• Y~A+B implies ij=0


Interpreting AnovaAll F-tests essentially compare a model to a sub-model, using an estimate of 2 in the denominator:

The anova function can do this explicitly, as in anova(model1, model2), with the estimate of 2 coming from the bigger model.

When we use just 1 argument, as in anova(model1), the models being compared are selected implicitly

2ˆ/) (

dRSSRSS

F ModelSubmodel


Interpreting Anova (cont)

• For example, consider a model with 2 factors A and B:

> anova(lm(y~A+B+A:B))Analysis of Variance Table

Response: y Df Sum Sq Mean Sq F value Pr(>F) A 1 12.774 12.774 9.9147 0.003978 **B 2 4.031 2.015 1.5642 0.227629 A:B 2 6.898 3.449 2.6768 0.086985 . Residuals 27 34.788 1.288

Full-model estimate of 2


First lineThe first line of the table compares the model y~A with a null model (all means the same), using an estimate of 2 =1.288 from the full model y~A+B+A:B

> model1<-lm(y~A)> model0<-lm(y~1)> anova(model0,model1)Analysis of Variance Table

Model 1: y ~ 1Model 2: y ~ A Res.Df RSS Df Sum of Sq F Pr(>F) 1 32 58.491 2 31 45.716 1 12.774 8.6623 0.006105 **

Difference in Numerator of F test


Second lineThe second line of the table compares the “no interaction” model y~A+B with the model y~A, using an estimate of 2 from the full model y~A+B+A:B

> model2<-lm(y~A+B)> model1<-lm(y~A)> anova(model1,model2)Analysis of Variance Table

Model 1: y ~ AModel 2: y ~ A + B Res.Df RSS Df Sum of Sq F Pr(>F)1 31 45.716 2 29 41.685 2 4.031 1.4021 0.2623

Difference in Numerator of F test in line 2


Third lineThe third line of the table compares full model y~A+B+A:B with the “no interaction” model y~A+B, using an estimate of 2 from the full model

> model2<-lm(y~A+B)> model3<-lm(y~A+B+A:B)> anova(model2,model3)Analysis of Variance Table

Model 1: y ~ A + BModel 2: y ~ A + B + A:B Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 41.685 2 27 34.788 2 6.898 2.6768 0.08699 .

Difference in Numerator of F test in line 3


To summarise:

• Terms are added line by line

• The F-test compares the current model with the previous model

• At each stage, the estimate of 2 is obtained from the full model.


More than two factors: example

• An experiment was conducted to compare different diets for feeding chickens. The diets depended on 3 variables: – Source of Protein (variable protein) : either “groundnut” or

“soybean”– Level of protein (variable protlevel): either 0, 1 or 2– Level of fish solubles (variable fish) :either high or low

• Response variable was weight gain (variable chickweight)


datachickweight protein protlevel fish1 6559 groundnut 0 Low2 7075 groundnut 0 High3 6564 groundnut 1 Low4 7528 groundnut 1 High5 6738 groundnut 2 Low6 7333 groundnut 2 High7 7094 soybean 0 Low8 8005 soybean 0 High9 6943 soybean 1 Low10 7359 soybean 1 High11 6748 soybean 2 Low12 6764 soybean 2 High

. . . 24 observations in all


Data characteristics

• There are 3 factors– protein with 2 levels (groundnut, soybean)– protlevel with 3 levels (0,1,2)– fish with 2 levels high, low

• There are 2 x 3 x 2 = 12 factor level combinations, so 12 means

• Each combination is observed twice, so 24 observations in all


InteractionsLet ijk be the population mean of all observations taken at level i of protein, level j of protlevel and level k of fish

We can split this mean up into 8 terms:

An overall baseline = 111

3 “main effects” e.g.i = i11 - 111

3 “two-way interactions” e.g. ij iji1j1

A “3-way interaction”

ijk ijki1j111kij1jki1k


Interactions (cont)

Then

ijkijkijjkik

ijk

As before, if any one of the subscripts i, j, k is 1 then the corresponding interaction is zero.

Interpretation:

e.g. if the protlevel x fish and the 3-way interactions are all zero, then the effect of changing levels of fish is the same for all levels of protlevel.


Why?

''

''

'

)()(

)()(

)()(

ikikkk

ikijkji

ikijkjiijkijk

Doesn’t depend on j!


Estimating terms> model1<-lm(chickweight~protein*protlevel*fish, data=chickwts.df)> summary(model1) Estimate Std. Error t value Pr(>|t|) (Intercept) 6927.0 223.5 30.992 8e-13 proteinsoybean 904.0 316.1 2.860 0.0144protlevel1 266.5 316.1 0.843 0.4156 protlevel2 -80.0 316.1 -0.253 0.8045 fishLow -501.5 316.1 -1.587 0.1386 proteinsoybean:protlevel1 -772.0 447.0 -1.727 0.1098 proteinsoybean:protlevel2 -1089.0 447.0 -2.436 0.0314proteinsoybean:fishLow -256.0 447.0 -0.573 0.5774 protlevel1:fishLow -99.0 447.0 -0.221 0.8285 protlevel2:fishLow 245.5 447.0 0.549 0.5929 proteinsoybean:protlevel1:fishLow 127.0 632.2 0.201 0.8441 proteinsoybean:protlevel2:fishLow 435.0 632.2 0.688 0.5045 Residual standard error: 316.1 on 12 degrees of freedomMultiple R-Squared: 0.7531, Adjusted R-squared: 0.5269 F-statistic: 3.328 on 11 and 12 DF, p-value: 0.02482


Anova for the chick weights

> anova(model1)Analysis of Variance Table

Response: chickweight Df Sum Sq Mean Sq F value Pr(>F) protein 1 373003 373003 3.7334 0.077286 . protlevel 2 636519 318260 3.1854 0.077679 . fish 1 1423014 1423014 14.2429 0.002653 **protein:protlevel 2 858702 429351 4.2974 0.039134 * protein:fish 1 7073 7073 0.0708 0.794706 protlevel:fish 2 309421 154710 1.5485 0.252201 protein:protlevel:fish 2 50036 25018 0.2504 0.782453 Residuals 12 1198926 99911 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Suggests model protein*protlevel + fish


Check> anova(lm(chickweight~protein*protlevel + fish), lm(chickweight~protein*protlevel*fish))Analysis of Variance Table

Model 1: chickweight ~ protein * protlevel + fishModel 2: chickweight ~ protein * protlevel * fish Res.Df RSS Df Sum of Sq F Pr(>F)1 17 1565456 2 12 1198926 5 366530 0.7337 0.6121

Not significant, but interpret with caution

Effect of fish the same for each protein/protlevel combination


Interpretation of interactions

• If a factor (say A) does not interact with the others, the effect of changing levelsof A is the same for all levels of the other factors

• If the 3 way interactions are zero, then the interaction between A and B is the same for all levels of C


Summary

• Anova models are interpreted just like regressions, except– No question of planarity (linear by definition )– Need to interpret interactions– Judge effect of factors by anova – Factors in anova added one at a time– Suitable for completely randomised

experiments where it is reasonable to assume observations are independent

Documents

© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18