15
BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users Sol_anova 2020 R USERS.docx Page 1 of 15 Unit 6 – Analysis of Variance Practice Problems SOLUTIONS – R Users Before you begin: Download from the course website: R Users anova_infants.Rdata fishgrowth.Rdata Preliminaries: Set current directory. Attach libraries needed setwd("/Users/cbigelow/Desktop/") options(scipen=1000) library(DescTools) # command Freq( ): One way frequencies, relative frequencies, etc library(descr) # command CrossTable( ): Two way contingency table library(doBy) # command summaryBy( ): Descriptives by group library(car) # command leveneTest(): Test of equal variances 2 way anova Practice with one way analysis of variance Exercises #1-6 Data set: anova_infants.Rdata Zelazo et al. (1972) investigated the variability in age at first walking in infants. Study infants were grouped into four groups, according to reinforcement of walking and placement: (1) active (2) passive (3) no exercise; and (4) 8 week control. Sample sizes were 6 per group, for a total of n=24. For each infant, study data included group assignment and age at first walking, in months. The following are the data and consist of recorded values of age (months) by group: Active Group Passive Group No-Exercise Group 8 Week Control 9.00 11.00 11.50 13.25 9.50 10.00 12.00 11.50 9.75 10.00 9.00 12.00 10.00 11.75 11.50 13.50 13.00 10.50 13.25 11.50 9.50 15.00 13.00 12.35 Source: Zelazo et al (1972) “Walking” in the newborn. Science 176: 314-315.

Unit 6 – Analysis of Variance Practice Problems SOLUTIONS – R … 2020 R... · 2020. 3. 25. · Unit 6 – Analysis of Variance Practice Problems SOLUTIONS – R Users Before

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 1 of 15

    Unit 6 – Analysis of Variance Practice Problems

    SOLUTIONS – R Users

    Before you begin: Download from the course website:

    R Users anova_infants.Rdata fishgrowth.Rdata

    Preliminaries:Setcurrentdirectory.Attachlibrariesneeded

    setwd("/Users/cbigelow/Desktop/") options(scipen=1000) library(DescTools) # command Freq( ): One way frequencies, relative frequencies, etc library(descr) # command CrossTable( ): Two way contingency table library(doBy) # command summaryBy( ): Descriptives by group library(car) # command leveneTest(): Test of equal variances 2 way anova

    Practice with one way analysis of variance Exercises #1-6 Data set: anova_infants.Rdata

    Zelazo et al. (1972) investigated the variability in age at first walking in infants. Study infants were grouped into four groups, according to reinforcement of walking and placement: (1) active (2) passive (3) no exercise; and (4) 8 week control. Sample sizes were 6 per group, for a total of n=24. For each infant, study data included group assignment and age at first walking, in months. The following are the data and consist of recorded values of age (months) by group:

    Active Group Passive Group No-Exercise Group 8 Week Control 9.00 11.00 11.50 13.25 9.50 10.00 12.00 11.50 9.75 10.00 9.00 12.00 10.00 11.75 11.50 13.50 13.00 10.50 13.25 11.50 9.50 15.00 13.00 12.35

    Source: Zelazo et al (1972) “Walking” in the newborn. Science 176: 314-315.

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 2 of 15

    PreliminariesforExercises1-6:Readindata.Clean.Producedescriptives.Note:Thedatasetisanova_infants.Rdata.Afteryouloadit,lookintheENVIRONMENTwindow.Thereyouwillseethatthenameofthedataframethatisstoredinthisfileisactuallycalleddat1.

    load(file="anova_infants.Rdata”) str(dat1)

    ## 'data.frame': 24 obs. of 7 variables: ## $ group : int 1 1 1 1 1 1 2 2 2 2 ... ## $ age : num 9 9.5 9.75 10 13 ... ## $ passive : num 0 0 0 0 0 0 1 1 1 1 ... ## $ noex : num 0 0 0 0 0 0 0 0 0 0 ... ## $ I_active : num 1 1 1 1 1 1 0 0 0 0 ... ## $ I_passive: num 0 0 0 0 0 0 1 1 1 1 ... ## $ I_noex : num 0 0 0 0 0 0 0 0 0 0 ... ## - attr(*, "datalabel")= chr "" ## - attr(*, "time.stamp")= chr " 9 Apr 2017 16:58" ## - attr(*, "formats")= chr "%8.0g" "%9.0g" "%9.0g" "%9.0g" ... ## - attr(*, "types")= int 251 254 254 254 254 254 254 ## - attr(*, "val.labels")= chr "groupf" "" "" "" ... ## - attr(*, "var.labels")= chr "" "" "" "" ... ## - attr(*, "version")= int 12

    dat1$group

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 3 of 15

    1. State the analysis of variance model using notation as appropriate. Define all terms and constraints on the parameters

    Answer:

    2. By any means you like, produce a side by side box plot showing the distribution of age at first walking, separately for each of the 4 groups.

    # Descriptives (There’s more than one way to do this. If you have a different approach, great!) summaryBy(age~group,data=dat1, FUN=c(length,mean,var,sd,min,max), fun.names=c("n","Mean", "Var", "SD", "Min", "Max"))

    ## group age.n age.Mean age.Var age.SD age.Min age.Max ## 1 1=active 6 10.12500 2.093750 1.4469796 9.0 13.00 ## 2 2=passive 6 11.37500 3.593750 1.8957189 10.0 15.00 ## 3 3=noex 6 11.70833 2.310417 1.5200055 9.0 13.25 ## 4 4=control 6 12.35000 0.740000 0.8602325 11.5 13.50

    Interpretation

    - In these data, first walking occurs earlier when infants are reinforced - Distributions differ markedly with respect to variability with greatest seen among infants in the passive group and smallest among infants in the control group

    2iμ and τ and σ

    åe4

    2ij i ij ij i

    i=1

    i

    Y = μ + τ + ε , where ε ~N(0,σ ) and τ =0

    i = 1, 2, ... K indexes method of reinforcement group;K = number of groups = 4j=1, 2, ..., n =6 indexes infant within group;μ = population

    i

    i i

    ij

    mean age at first walking, over all groupsμ = mean age at first walking for infants in group "i"τ = [ μ - μ ]Y = observed age at first walking for the jth infant in group "i"

    O 1 2 3 4

    A i

    H : =0, =0, =0, and =0H : At least one 0

    t t t tt ¹

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 4 of 15

    # Side-by-side box plot (Again, there is more than one way to do this. Next time, I think I will use ggplot) boxplot(age~group,data=dat1, main="Age (Month) at First Walking", xlab="Group", ylab="Month")

    Interpretation

    - Plot confirms impressions from the descriptive statistics - But what we can now see more easily (yay visualization)is that the distributions also differ markedly in their patterns of symmetry with long right tails in the active and passive groups, long left tail in the no-exercise group, and symmetry in controls.

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 5 of 15

    3. By any means you like, obtain the entries of the analysis of variance table for this one way analysis of variance. Use your computer output (or excel work or hand calculations or whatever) to complete the following table:

    Source

    df

    Sum of Squares SSQ

    Mean Square MSQ

    F-Statistic

    p-value

    Between Groups

    3

    15.74

    5.25

    2.40

    .10

    Within Groups

    20

    43.69

    2.18

    Total, corrected 23 59.43

    q3anova F) ## group 3 15.74 5.2468 2.4018 0.09787 . ## Residuals 20 43.69 2.1845 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    bartlett.test(age ~ group, data=dat1)

    ## ## Bartlett test of homogeneity of variances ## ## data: age by group ## Bartlett's K-squared = 2.6355, df = 3, p-value = 0.4513

    4. Write a 2-5 sentence report of your description and hypothesis test findings using language as appropriate for a client who is intelligent but is not knowledgeable about statistics. Include figure and table as you think is appropriate.

    In this sample, the data suggest a trend towards earlier age at first walking with increasing reinforcement and placement. The median age at first walking is greatest among controls (12.35 months) and lowest among infants in the “active” group (10.13 months); see also the box plots. Tests of statistical significance were limited to the overall F test for group differences and this did not achieve statistical significance (p-value = .10), possibly due to the small sample sizes (6 in each group). Interestingly, examination of the data also suggests that the variability in age at first walking differed, depending on the intervention received. The variability was greater in the three intervention groups (“active”, “passive”, “no exercise”) compared to in the “control” group; this was not statistically significant however (p-value = .45). Further study, utilizing larger sample sizes and additional hypothesis tests to investigate trend are needed.

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 6 of 15

    5. For the brave Using appropriately defined indicator variables, perform a multivariable linear regression analysis of these same data! Use your computer output to complete the following table:

    Source df Sum of Squares Mean Square Overall F due model (p) = 3

    = 15.74 SSR/p = 5.25

    2.40

    due error (residual)

    (n-1-p) = 20 = 43.69

    SSE/(n-1-p) =2.18

    Total, corrected (n-1) = 23 = 59.43

    Some of this has already been done for you: I created the following 0/1 “flags” (indicator variables, dummy variables) for the predictor that is nominal here Y = age at first walking I_act = 0/1 indicator of group assignment to “active” I_pass = 0/1 indicator of group assignment to “passive” I_noex = 0/1 indicator of group assignment to “no exercise” Thus, I used a reference cell coding approach with “8 week control” as my reference. I fit the following multivariable linear model of Y Y = b0 + b1 [I_act] + b2 [I_pass] + b3 [I_noex] + error

    dat1$I_active

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 7 of 15

    summary(q5model)

    ## ## Call: ## lm(formula = age ~ I_active + I_passive + I_noex, data = dat1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.7083 -0.8500 -0.2792 0.5062 3.6250 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.3500 0.6034 20.468 6.94e-15 *** ## I_active1 -2.2250 0.8533 -2.607 0.0169 * ## I_passive1 -0.9750 0.8533 -1.143 0.2667 ## I_noex1 -0.6417 0.8533 -0.752 0.4608 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.478 on 20 degrees of freedom ## Multiple R-squared: 0.2649, Adjusted R-squared: 0.1546 ## F-statistic: 2.402 on 3 and 20 DF, p-value: 0.09787

    The prediction equation is thus:

    The two analyses match (hooray), thus confirming that a multiple linear regression model utilizing appropriately defined indicator variables is equivalent to an analysis of variance.

    6. For the brave: Using your output from your two analyses (1st-analysis of variance, 2nd – regression), obtain the predicted mean of Y =age at first walking twice in two ways.

    Prediction Using One Way Analysis of Variance

    Prediction Using Multiple Linear Regression

    Active 10.125 Passive 11.375

    No-Exercise 11.71 Control 12.35

    Ŷ = 12.35 - 2.225*I_active - 0.97*I_passive - 0.64*I_noex

    1 0 1ˆ ˆμ̂ = (β +β ) = 12.35 - 2.225 10 = .125

    2 0 2ˆ ˆμ̂ = (β +β ) = 12.35 - 0.97 = 11.38

    3 0 3ˆ ˆμ̂ = (β +β ) = 12.35 - 0.64 = 11.71

    4 0ˆμ̂ = β = 12.35

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 8 of 15

    Question#6Forthebrave-ObtainPredictedValues:a)anovab)regression

    # Predicted Means from ANOVA aactive

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 9 of 15

    Practice with two-way factorial analysis of variance Exercises #7-12 Data set used: fishgrowth.dta Consider again the fish growth data on page 41 of Notes 6. Introduction to Analysis of Variance.

    Light (light)

    Water Temp (temp)

    Fish Growth (growth)

    1=low 1=cold 4.55 1=low 1=cold 4.24 1=low 2=lukewarm 4.89 1=low 2=lukewarm 4.88 1=low 3=warm 5.01 1=low 3=warm 5.11 2=high 1=cold 5.55 2=high 1=cold 4.08 2=high 2=lukewarm 6.09 2=high 2=lukewarm 5.01 2=high 3=warm 7.01 2=high 3=warm 6.92

    Coding Manual:

    Variable Coding growth continuous light 1 = low

    2 = high temp 1 = cold

    2 = lukewarm 3 = warm

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 10 of 15

    PreliminariesforExercises7-12:Readindata.Clean.Descriptives

    load(file="fishgrowth.Rdata”) str(fishgrowth)

    ## 'data.frame': 12 obs. of 3 variables: ## $ light : Factor w/ 2 levels "1=low","2=high": 1 1 1 1 1 1 2 2 2 2 ... ## $ temp : Factor w/ 3 levels "1=cold","2=lukewarm",..: 1 1 2 2 3 3 1 1 2 2 ... ## $ growth: num 4.55 4.24 4.89 4.88 5.01 ... ## - attr(*, "datalabel")= chr "" ## - attr(*, "time.stamp")= chr " 9 Apr 2017 16:59" ## - attr(*, "formats")= chr "%8.0g" "%10.0g" "%9.0g" ## - attr(*, "types")= int 251 251 254 ## - attr(*, "val.labels")= chr "lightf" "tempf" "" ## - attr(*, "var.labels")= chr "" "" "" ## - attr(*, "version")= int 12 ## - attr(*, "label.table")=List of 2 ## ..$ lightf: Named int 1 2 ## .. ..- attr(*, "names")= chr "1=low" "2=high" ## ..$ tempf : Named int 1 2 3 ## .. ..- attr(*, "names")= chr "1=cold" "2=lukewarm" "3=warm"

    # Crosstab of factors I and II – I’m doing this to show you the sample sizes in each group. TINY! Oh well. CrossTable(fishgrowth$light,fishgrowth$temp,dnn=c("Light","Temp"), format=c("SAS"),prop.r=FALSE, prop.c=FALSE,prop.t=FALSE, prop.chisq=FALSE)

    ## Cell Contents ## |-------------------------| ## | N | ## |-------------------------| ## ## ============================================== ## Temp ## Light 1=cold 2=lukewarm 3=warm Total ## ---------------------------------------------- ## 1=low 2 2 2 6 ## ---------------------------------------------- ## 2=high 2 2 2 6 ## ---------------------------------------------- ## Total 4 4 4 12 ## ==============================================

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 11 of 15

    # Descriptives on y=growth in all 4 groups summaryBy(growth ~ light + temp, data = fishgrowth, FUN=c(length,mean,var,sd,min,max), fun.names=c("n","Mean", "Var", "SD", "Min", "Max"))

    ## light temp growth.n growth.Mean growth.Var growth.SD ## 1 1=low 1=cold 2 4.395 4.805013e-02 0.219203399 ## 2 1=low 2=lukewarm 2 4.885 4.999752e-05 0.007070892 ## 3 1=low 3=warm 2 5.060 4.999990e-03 0.070710611 ## 4 2=high 1=cold 2 4.815 1.080450e+00 1.039447157 ## 5 2=high 2=lukewarm 2 5.550 5.831999e-01 0.763675270 ## 6 2=high 3=warm 2 6.965 4.050014e-03 0.063639718 ## growth.Min growth.Max ## 1 4.24 4.55 ## 2 4.88 4.89 ## 3 5.01 5.11 ## 4 4.08 5.55 ## 5 5.01 6.09 ## 6 6.92 7.01

    7. State the analysis of variance model using notation as appropriate. Define all terms and constraints on the parameters. Answer:

    2i j ijμ, α , β , (αβ) and σ

    ( ) 2ijk i j ijk ijkijY = μ + α + β + αβ + ε , where ε ~N(0,σ )with i = 1, 2 indexing light j = 1, 2, 3 indexing temperature k = 1, 2 indexing individual fish under light "i" at water tempera

    ( )

    å

    å

    2

    i ii=1

    3

    j jj=1

    ij

    ture "j"μ = population mean fish growth, over all groups

    α = effect of light level "i", with α =0

    β = effect of water temperature "j", with β =0

    αβ = interaction effect of the combination of

    å å2 3

    ij iji=1 j=1

    "ith" light level and "jth" water temperature

    with (αβ) = 0 and (αβ) = 0.

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 12 of 15

    8. Create the following new variables with accompanying definitions

    Variable Coding i_high = 1 if (light is 2)

    0 otherwise i_luke = 1 if (temp is 2)

    0 otherwise i_warm = 1 if (temp is 3)

    0 otherwise hi_luke = (i_high) * (i_luke) hi_warm = (i_high)*(i_warm)

    fishgrowth$i_high

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 13 of 15

    leveneTest(growth ~ light*temp, data=fishgrowth)

    ## Warning in anova.lm(lm(resp ~ group)): ANOVA F-tests on an essentially ## perfect fit are unreliable

    ## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 5 2.7375e+31 < 2.2e-16 *** ## 6 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    10 Perform a regression analysis to obtain the following table and estimates

    Source Df SSQ MSQ F p-value Due Model 5 8.2320 1.6464 5.74 .0276 Due Residual 6 1.7208 0.2868 - Total (Corrected) 11 9.953 -

    Predictor beta Se(beta) T=beta/se p-value I_high 0.42 0.5355 0.78 0.46 I_luke 0.49 0.5355 0.91 0.395 I_warm 0.665 0.5355 1.24 0.261 Hi_luke 0.245 0.7574 0.32 0.757 Hi_warm 1.485 0.7574 1.96 0.098 Intercept 4.395 0.3787 11.61 0.000

    q10model F) ## i_high 1 2.9800 2.9800 10.3906 0.018059 * ## i_luke 1 0.0222 0.0222 0.0774 0.790164 ## i_warm 1 3.9621 3.9621 13.8149 0.009889 ** ## hi_luke 1 0.1650 0.1650 0.5753 0.476877 ## hi_warm 1 1.1026 1.1026 3.8445 0.097594 . ## Residuals 6 1.7208 0.2868 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 14 of 15

    summary(q10model)

    ## ## Call: ## lm(formula = growth ~ i_high + i_luke + i_warm + hi_luke + hi_warm, ## data = dat2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.73500 -0.07625 0.00000 0.07625 0.73500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.3950 0.3787 11.606 2.46e-05 *** ## i_high 0.4200 0.5355 0.784 0.4627 ## i_luke 0.4900 0.5355 0.915 0.3955 ## i_warm 0.6650 0.5355 1.242 0.2607 ## hi_luke 0.2450 0.7574 0.323 0.7573 ## hi_warm 1.4850 0.7574 1.961 0.0976 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5355 on 6 degrees of freedom ## Multiple R-squared: 0.8271, Adjusted R-squared: 0.683 ## F-statistic: 5.741 on 5 and 6 DF, p-value: 0.02755

    11. Using the output from each of your anova and regression analyses, complete the following tables and notice that they are the same.

    # Predicted Means from ANOVA ameans

  • BIOSTATS 640 Spring 2020 Unit 6 Introduction to Analysis of Variance Solutions - R Users

    Sol_anova 2020 R USERS.docx Page 15 of 15

    Estimated Mean Growth, by Conditions of Light and Temperature – Anova Analysis Cold Lukewarm Warm Low light 4.395 4.885 5.06 High light 4.815 5.55 6.965

    # Predicted Means from Regression rmeans