56
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

Embed Size (px)

Citation preview

Page 1: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 1

Stats 330: Lecture 18

Page 2: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 2

Anova Models• These are linear (regression) models where all

the explanatory variables are categorical.• If there is just one categorical explanatory

variable, then we have the “one-way anova” model discussed in STATS 201/8

• If there are two categorical explanatory variables, then we have the “two-way anova” model, also discussed in STATS 201/8

• However, we shall regard these as just another type of regression model

Page 3: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 3

Example: One way model

• In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures.

• The response variable (ratio) is the ratio of damaged to undamaged cells, and the explanatory variable (treatment) is the substance

• On website – carcinogenic substances data

Page 4: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 4

Data ratio treatment1 0.08 control2 0.08 choral hydrate3 0.10 diazapan4 0.10 hydroquinone5 0.07 econidazole6 0.17 colchicine7 0.08 control8 0.10 choral hydrate9 0.08 diazapan10 0.10 hydroquinone11 0.08 econidazole12 0.19 colchicine13 0.09 control14 0.08 choral hydrate150.12 diazapan. . . More data

Page 5: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 5

Distributions skewed?

boxplot(ratio~treatment, data=cancer.df, ylab = "ratio", main = "Ratios for different substances")

control chloralhydrate colchicine diazapan econidazole hydroquinone

0.2

0.4

0.6

0.8

Ratios for different substances

ratio

Page 6: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 6

The modelmeanratio

where the mean depends on the substance. Thus,

Colchicine

Control

meanratio

meanratio

:colchicineFor

:control theFor

.....

We make the usual assumptions about the errors (normal, equal variance, independent etc)

Page 7: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 7

Offset form

ColchizineColchicine

eEconidazoleEconidazol

neHydroquinoneHydroquino

DiazepanDiazepan

ateChoralHydrateChoralHydr

Control

mean

mean

mean

mean

mean

mean

Page 8: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 8

Dummy variable form

• Define:CH = 1 if treatment = Choral Hydrate, 0 elseD = 1 if treatment = diazapan, 0 elseH = 1 if treatment = hydroquinone, 0 elseE = 1 if treatment = econidazole, 0 elseC = 1 if treatment = colchicine, 0 else

• Then

CEH

DCHratio

ColchicineeEconidazolneHydroquino

DiazapanrateChloralHyd

Page 9: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 9

Estimation • To estimate the offsets and the baseline

(control) mean, we use lm as usual. We have to rearrange the levels to make the control the baseline

carcin.df = read.table(file.choose(), header=T)carcin.df$treatment = factor(carcin.df$treatment, levels = c("control", "chloralhydrate", "colchicine", "diazapan", "econidazole", "hydroquinone"))summary(lm(ratio~treatment, data=carcin.df))

Page 10: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 10

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.23660 0.02037 11.616 < 2e-16 ***treatmentchloralhydrate 0.03240 0.02880 1.125 0.26158 treatmentcolchicine 0.21160 0.02880 7.346 2.02e-12 ***treatmentdiazapan 0.04420 0.02880 1.534 0.12599 treatmenteconidazole 0.02820 0.02880 0.979 0.32838 treatmenthydroquinone 0.07540 0.02880 2.618 0.00931 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.144 on 294 degrees of freedomMultiple R-squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12

lm output

Page 11: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 11

0.25 0.30 0.35 0.40 0.45

-0.2

0.0

0.2

0.4

Fitted values

Res

idua

lsResiduals vs Fitted

100 200199

-3 -2 -1 0 1 2 3

-2-1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

100200199

0.25 0.30 0.35 0.40 0.45

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location100 200199

-2-1

01

23

Factor Level Combinations

Sta

ndar

dize

d re

sidu

als

control chloralhydrate colchicinetreatment :

Constant Leverage: Residuals vs Factor Levels

100 200199

Non-normal?

Variances about equal

ignore

Page 12: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 12

boxcoxplot(ratio~ treatment, data=carcin.df)

Page 13: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 13

Analyzing ¼ power

WB test: previous p=0.00

Current p=0.06

Normality better

0.70 0.72 0.74 0.76 0.78 0.80

-0.2

-0.1

0.0

0.1

0.2

Fitted values

Res

idua

ls

Residuals vs Fitted

10050 200

-3 -2 -1 0 1 2 3-2

-10

12

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

10050200

0.70 0.72 0.74 0.76 0.78 0.80

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location10050 200

-2-1

01

2

Factor Level Combinations

Sta

ndar

dize

d re

sidu

als

control chloralhydrate colchicinetreatment :

Constant Leverage: Residuals vs Factor Levels

10050 200

Page 14: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 14

Analyzing ¼ power: summary

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68528 0.01244 55.105 < 2e-16 ***treatmentchloralhydrate 0.01744 0.01759 0.992 0.3222 treatmentcolchicine 0.12120 0.01759 6.891 3.37e-11 ***treatmentdiazapan 0.02993 0.01759 1.702 0.0898 . treatmenteconidazole 0.01470 0.01759 0.836 0.4039 treatmenthydroquinone 0.04104 0.01759 2.333 0.0203 *

Residual standard error: 0.08793 on 294 degrees of freedomMultiple R-squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10Residual standard error: 0.08793 on 294 degrees of freedomMultiple R-Squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10

Page 15: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 15

Testing equality of means

The standard F-test for equality of means is computed using the anova function

Here comparing equal means model (Null model) with different means model – only one term in model

> quarter.lm <- lm(ratio^(1/4)~treatment, data=carcin.df)> anova(quarter.lm)Analysis of Variance Table

Response: ratio^(1/4) Df Sum Sq Mean Sq F value Pr(>F) treatment 5 0.47017 0.09403 12.161 1.008e-10 ***Residuals 294 2.27337 0.00773

Highly significant differences

Page 16: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 16

Oneway plot plot (s20x)> onewayPlot(quarter.lm)

control chloralhydrate colchicine diazapan econidazole hydroquinone

0.5

0.6

0.7

0.8

0.9

Plot of `ratio^(1/4)' by levels of `treatment',with TUKEY intervals (95%, pooled SDs)

treatment

ratio

^(1

/4)

Tukey: all cover true values with 95% prob

Page 17: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 17

Two factors: example

Experiment to study weight gain in rats

– Response is weight gain over a fixed time period

– This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low)

Page 18: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 18

Data> diets.df gain source level1 73 Beef High2 98 Cereal High3 94 Pork High4 90 Beef Low5 107 Cereal Low6 49 Pork Low7 102 Beef High8 74 Cereal High9 79 Pork High10 76 Beef Low. . . 60 observations in all

Page 19: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 19

Two factors: the model

• If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is ij

• Other standard assumptions (equal variance, normality, independence) apply

Page 20: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 20

Diagramatically…

Source = Beef

Source = Cereal

Source = Pork

Level=High

11 12 13

Level=Low

21 22 23

Page 21: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 21

Decomposition of the means

• We usually want to split each “cell mean” up into 4 terms:– A term reflecting the overall baseline level of

the response– A term reflecting the effect of factor A (row

effect)– A term reflecting the effect of factor B (column

effect)– A term reflecting how A and B interact.

Page 22: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 22

Mathematically…Overall Baseline: 11 (mean when both factors are at their baseline levels)

Effect of i th level of factor A (row effect): i111The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline)

Effect of j th level of factor B (column effect) : 1j -11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline)Interaction: what’s left over (see next slide)

Page 23: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 23

Interactions• Each cell (except the first row and column) has

an interaction:Interaction = cell mean - baseline - row effect - column

effect

cell mean = baseline + row effect + column effect + interaction

1111

11111111 )()(ninteractio

jiij

jiij

)(

)()(

1111

11111111

jiij

jiij

Page 24: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 24

Notation• Overall baseline: = 11

• Main effects of Ai = i1 - 11

• Main effects of Bj = 1j - 11

• AB interactions: ij = ij - i1 - 1j + 11

• Thus, ij = i + j + ij

Page 25: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 25

Importance of interactions

• If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B

• In mathematical terms, ij – i’j doesn’t depend on j

• Equivalently, effect of changing levels of B is the same for all levels of A

• If interactions are zero, relationship between factors and response is simple

Page 26: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 26

Why are comparisons simple when interactions

are zero?

'

'

''

'

)(

)(

jj

jiji

ijji

ijjiijij

Doesn’t depend on i!

Page 27: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 27

Splitting up the mean: rats

Cell Means      

  Beef Cereal Pork Baseline col

High 100 85.9 99.5 100

Low 79.2 83.9 78.7 79.2

Baseline row 100 85.9 99.5 100

Factors are : level (amount of food) and source (diet)

Row effect for Low: 79.2 – 100 = -20.8

Col effect for Cereal: 85.9 - 100 = -14.1

Col effect for Pork: 99.5 - 100 = -0.5

Low-Cereal interaction: 83.9 - 79.2 - 85.9 + 100 = 18.8

Low-Cereal interaction: 78.7 - 79.2 - 99.5 + 100 = 0

Page 28: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 28

Exploratory plots> plot.design(diets.df)

8590

95

Factors

mea

n of

gai

n Beef

Cereal

Pork

High

Low

source level

More gain on high amount of feed and Beef diet

Page 29: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 29

dotplot(source~gain|level, data=diets.df)

gain

60 80 100 120

Beef

Cereal

Pork

High

60 80 100 120

Low

Page 30: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 30

Fit model

> diets.lm<-lm(gain~source+level + source:level, data=diets.df)> summary(diets.lm)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000

Residual standard error: 14.65 on 54 degrees of freedomMultiple R-Squared: 0.2848, Adjusted R-squared: 0.2185

F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299

Page 31: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 31

Fitting as a regression model

Note that this is equivalent to fitting a regression with dummy variables R2, C2, C3

R2 = 1 if obs is in row 2, zero otherwise

C2 = 1 if obs is in column 2, zero otherwise

C3 = 1 if obs is in column 3, zero otherwise

The regression is

Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)

Page 32: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 32

> R2 = ifelse(diets.df$level=="Low",1,0)> C2 = ifelse(diets.df$source=="Cereal",1,0)> C3 = ifelse(diets.df$source=="Pork",1,0)> reg.lm = lm(gain ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3), data=diets.df)> summary(reg.lm)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 ***C2 -1.410e+01 6.551e+00 -2.152 0.03585 * C3 -5.000e-01 6.551e+00 -0.076 0.93944 R2 -2.080e+01 6.551e+00 -3.175 0.00247 ** I(R2 * C2) 1.880e+01 9.264e+00 2.029 0.04736 * I(R2 * C3) -2.709e-14 9.264e+00 -2.92e-15 1.00000

Regression summary

Page 33: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 33

Testing for zero interactions

> anova(diets.lm)Analysis of Variance Table

Response: gain Df Sum Sq Mean Sq F value Pr(>F) source 2 266.5 133.3 0.6211 0.5411319 level 1 3168.3 3168.3 14.7666 0.0003224 ***source:level 2 1178.1 589.1 2.7455 0.0731879 . Residuals 54 11586.0 214.6

Some evidence of interaction

Page 34: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 34

Interaction plot> interaction.plot(source,level,gain)

8085

9095

100

diets.df$source

mea

n of

die

ts.d

f$ga

in

Beef Cereal Pork

diets.df$level

HighLow

Non-parallellines indicate interaction

Page 35: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 35

Do we need Source in the model?

> model1<-lm(gain~source*level) # note shorthand> model2<-lm(gain~level)> anova(model2,model1)Analysis of Variance Table

Model 1: gain ~ levelModel 2: gain ~ source * level Res.Df RSS Df Sum of Sq F Pr(>F)1 58 13030.7 2 54 11586.0 4 1444.7 1.6833 0.1673

Not significant!

No significant effect of Source

Page 36: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 36

Notations: reviewFor two factors A and B

• Baseline: = 11

• A main effect: i = i1- 11

• B main effect: j = 1j - 11

• AB interaction: ij = ij - i1 - 1j + 11

• Then ij = + i + j + ij

Page 37: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 37

Zero interaction model

• If we have only one observation per factor level combination, we can’t estimate the interactions and the error variance

• We have to assume that the interactions are zero and fit an “additive model”

gain ~ level + source

• Can test zero interactions In a reduced form) using the “Tukey one-degree of freedom test” –

Page 38: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 38

Possible models for two factors

For two factors A and B possible models are:

• Y~1 (Fit single mean only)• Y~A (cell means depend on A alone)• Y~B (Cell means depend on B alone)• Y~A+B (Cell means have no

interaction)• Y~A*B (General model, cell means

have no restrictions)

Page 39: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 39

In terms of “effects”General model is Y~A+B+A:B

(equivalently Y~A*B) Mathematical form is

E(Yij) = + i + j + ij

• Y~1 implies i =0, j=0, ij=0

• Y~A implies j=0, ij=0

• Y~B implies i =0, ij=0

• Y~A+B implies ij=0

Page 40: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 40

Interpreting AnovaAll F-tests essentially compare a model to a sub-model, using an estimate of 2 in the denominator:

The anova function can do this explicitly, as in anova(model1, model2), with the estimate of 2 coming from the bigger model.

When we use just 1 argument, as in anova(model1), the models being compared are selected implicitly

2ˆ/) (

dRSSRSS

F ModelSubmodel

Page 41: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 41

Interpreting Anova (cont)

• For example, consider a model with 2 factors A and B:

> anova(lm(y~A+B+A:B))Analysis of Variance Table

Response: y Df Sum Sq Mean Sq F value Pr(>F) A 1 12.774 12.774 9.9147 0.003978 **B 2 4.031 2.015 1.5642 0.227629 A:B 2 6.898 3.449 2.6768 0.086985 . Residuals 27 34.788 1.288

Full-model estimate of 2

Page 42: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 42

First lineThe first line of the table compares the model y~A with a null model (all means the same), using an estimate of 2 =1.288 from the full model y~A+B+A:B

> model1<-lm(y~A)> model0<-lm(y~1)> anova(model0,model1)Analysis of Variance Table

Model 1: y ~ 1Model 2: y ~ A Res.Df RSS Df Sum of Sq F Pr(>F) 1 32 58.491 2 31 45.716 1 12.774 8.6623 0.006105 **

Difference in Numerator of F test

Page 43: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 43

Second lineThe second line of the table compares the “no interaction” model y~A+B with the model y~A, using an estimate of 2 from the full model y~A+B+A:B

> model2<-lm(y~A+B)> model1<-lm(y~A)> anova(model1,model2)Analysis of Variance Table

Model 1: y ~ AModel 2: y ~ A + B Res.Df RSS Df Sum of Sq F Pr(>F)1 31 45.716 2 29 41.685 2 4.031 1.4021 0.2623

Difference in Numerator of F test in line 2

Page 44: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 44

Third lineThe third line of the table compares full model y~A+B+A:B with the “no interaction” model y~A+B, using an estimate of 2 from the full model

> model2<-lm(y~A+B)> model3<-lm(y~A+B+A:B)> anova(model2,model3)Analysis of Variance Table

Model 1: y ~ A + BModel 2: y ~ A + B + A:B Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 41.685 2 27 34.788 2 6.898 2.6768 0.08699 .

Difference in Numerator of F test in line 3

Page 45: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 45

To summarise:

• Terms are added line by line

• The F-test compares the current model with the previous model

• At each stage, the estimate of 2 is obtained from the full model.

Page 46: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 46

More than two factors: example

• An experiment was conducted to compare different diets for feeding chickens. The diets depended on 3 variables: – Source of Protein (variable protein) : either “groundnut” or

“soybean”– Level of protein (variable protlevel): either 0, 1 or 2– Level of fish solubles (variable fish) :either high or low

• Response variable was weight gain (variable chickweight)

Page 47: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 47

datachickweight protein protlevel fish1 6559 groundnut 0 Low2 7075 groundnut 0 High3 6564 groundnut 1 Low4 7528 groundnut 1 High5 6738 groundnut 2 Low6 7333 groundnut 2 High7 7094 soybean 0 Low8 8005 soybean 0 High9 6943 soybean 1 Low10 7359 soybean 1 High11 6748 soybean 2 Low12 6764 soybean 2 High

. . . 24 observations in all

Page 48: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 48

Data characteristics

• There are 3 factors– protein with 2 levels (groundnut, soybean)– protlevel with 3 levels (0,1,2)– fish with 2 levels high, low

• There are 2 x 3 x 2 = 12 factor level combinations, so 12 means

• Each combination is observed twice, so 24 observations in all

Page 49: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 49

InteractionsLet ijk be the population mean of all observations taken at level i of protein, level j of protlevel and level k of fish

We can split this mean up into 8 terms:

An overall baseline = 111

3 “main effects” e.g.i = i11 - 111

3 “two-way interactions” e.g. ij iji1j1

A “3-way interaction”

ijk ijki1j111kij1jki1k

Page 50: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 50

Interactions (cont)

Then

ijkijkijjkik

ijk

As before, if any one of the subscripts i, j, k is 1 then the corresponding interaction is zero.

Interpretation:

e.g. if the protlevel x fish and the 3-way interactions are all zero, then the effect of changing levels of fish is the same for all levels of protlevel.

Page 51: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 51

Why?

''

''

'

)()(

)()(

)()(

ikikkk

ikijkji

ikijkjiijkijk

Doesn’t depend on j!

Page 52: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 52

Estimating terms> model1<-lm(chickweight~protein*protlevel*fish, data=chickwts.df)> summary(model1) Estimate Std. Error t value Pr(>|t|) (Intercept) 6927.0 223.5 30.992 8e-13 proteinsoybean 904.0 316.1 2.860 0.0144protlevel1 266.5 316.1 0.843 0.4156 protlevel2 -80.0 316.1 -0.253 0.8045 fishLow -501.5 316.1 -1.587 0.1386 proteinsoybean:protlevel1 -772.0 447.0 -1.727 0.1098 proteinsoybean:protlevel2 -1089.0 447.0 -2.436 0.0314proteinsoybean:fishLow -256.0 447.0 -0.573 0.5774 protlevel1:fishLow -99.0 447.0 -0.221 0.8285 protlevel2:fishLow 245.5 447.0 0.549 0.5929 proteinsoybean:protlevel1:fishLow 127.0 632.2 0.201 0.8441 proteinsoybean:protlevel2:fishLow 435.0 632.2 0.688 0.5045 Residual standard error: 316.1 on 12 degrees of freedomMultiple R-Squared: 0.7531, Adjusted R-squared: 0.5269 F-statistic: 3.328 on 11 and 12 DF, p-value: 0.02482

Page 53: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 53

Anova for the chick weights

> anova(model1)Analysis of Variance Table

Response: chickweight Df Sum Sq Mean Sq F value Pr(>F) protein 1 373003 373003 3.7334 0.077286 . protlevel 2 636519 318260 3.1854 0.077679 . fish 1 1423014 1423014 14.2429 0.002653 **protein:protlevel 2 858702 429351 4.2974 0.039134 * protein:fish 1 7073 7073 0.0708 0.794706 protlevel:fish 2 309421 154710 1.5485 0.252201 protein:protlevel:fish 2 50036 25018 0.2504 0.782453 Residuals 12 1198926 99911 ---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Suggests model protein*protlevel + fish

Page 54: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 54

Check> anova(lm(chickweight~protein*protlevel + fish), lm(chickweight~protein*protlevel*fish))Analysis of Variance Table

Model 1: chickweight ~ protein * protlevel + fishModel 2: chickweight ~ protein * protlevel * fish Res.Df RSS Df Sum of Sq F Pr(>F)1 17 1565456 2 12 1198926 5 366530 0.7337 0.6121

Not significant, but interpret with caution

Effect of fish the same for each protein/protlevel combination

Page 55: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 55

Interpretation of interactions

• If a factor (say A) does not interact with the others, the effect of changing levelsof A is the same for all levels of the other factors

• If the 3 way interactions are zero, then the interaction between A and B is the same for all levels of C

Page 56: © Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18

© Department of Statistics 2012STATS 330 Lecture 18 Slide 56

Summary

• Anova models are interpreted just like regressions, except– No question of planarity (linear by definition )– Need to interpret interactions– Judge effect of factors by anova – Factors in anova added one at a time– Suitable for completely randomised

experiments where it is reasonable to assume observations are independent