34
Workshop in R & GLMs: #2 Diane Srivastava University of British Columbia [email protected]

Workshop in R & GLMs: #2

  • Upload
    borka

  • View
    39

  • Download
    2

Embed Size (px)

DESCRIPTION

Workshop in R & GLMs: #2. Diane Srivastava University of British Columbia [email protected]. Start by loading your Lakedata_06 dataset: diane

Citation preview

Page 1: Workshop in  R & GLMs: #2

Workshop in R & GLMs: #2

Diane Srivastava

University of British Columbia

[email protected]

Page 2: Workshop in  R & GLMs: #2

Start by loading your Lakedata_06 dataset:

diane<-read.table(file.choose(),sep=";",header=TRUE)

Page 3: Workshop in  R & GLMs: #2

Dataframes

Two ways to identify a column (called "treatment") in

your dataframe (called "diane"):

diane$treatment

OR

attach(diane); treatment

At end of session, remember to: detach(diane)

Page 4: Workshop in  R & GLMs: #2

Summary statistics

length (x)

mean (x)

var (x)

cor (x,y)

sum (x)

summary (x) minimum, maximum, mean, median, quartiles

What is the correlation between two variables in your dataset?

Page 5: Workshop in  R & GLMs: #2

Factors

• A factor has several discrete levels (e.g. control,

herbicide)

• If a vector contains text, R automatically assumes it

is a factor.

• To manually convert numeric vector to a factor:

x <- as.factor(x)

• To check if your vector is a factor, and what the

levels are:

is.factor(x) ; levels(x)

Page 6: Workshop in  R & GLMs: #2

ExerciseMake lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

3. Make AreaFactor into a factor, then check that it is a factor

Page 7: Workshop in  R & GLMs: #2

ExerciseMake lake area into a factor called AreaFactor:

Area 0 to 5 ha: small

Area 5.1 to 10: medium

Area > 10 ha: large

You will need to:

1. Tell R how long AreaFactor will be.

AreaFactor<-Area; AreaFactor[1:length(Area)]<-"medium"

2. Assign cells in AreaFactor to each of the 3 levels

AreaFactor[Area<5.1]<-“small"; AreaFactor[Area>10]<-“large"

3. Make AreaFactor into a factor, then check that it is a factorAreaFactor<-as.factor(AreaFactor); is.factor(AreaFactor)

Page 8: Workshop in  R & GLMs: #2

Linear regression

model <- lm (y ~ x, data = diane)

invent a name for your model

linear model

insert youry vector name here

insert yourx vector name here

insert yourdataframename here

ALT+126

Page 9: Workshop in  R & GLMs: #2

Linear regression

model <- lm (Species ~ Elevation, data = diane)

summary (model)

Call:lm(formula = Species ~ Elevation, data = diane)

Residuals: Min 1Q Median 3Q Max -7.29467 -2.75041 -0.04947 1.83054 15.00270

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.421568 2.426983 3.882 0.000551 ***Elevation -0.002609 0.003663 -0.712 0.482070 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.811 on 29 degrees of freedomMultiple R-Squared: 0.01719, Adjusted R-squared: -0.0167 F-statistic: 0.5071 on 1 and 29 DF, p-value: 0.4821

Page 10: Workshop in  R & GLMs: #2

Linear regression

model2 <- lm (Species ~ AreaFactor, data = diane)

summary (model2)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.833 1.441 8.210 6.17e-09 ***AreaFactormedium -2.405 1.723 -1.396 0.174 AreaFactorsmall -8.288 1.792 -4.626 7.72e-05 ***

Large has mean species richness of 11.8

Medium has mean species richness of 11.8 - 2.4 = 9.4

Small has a mean species richness of 11.8 - 8.3 = 3.5

mean(Species[AreaFactor=="medio"])

Page 11: Workshop in  R & GLMs: #2

ANOVA

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

Analysis of Variance Table

Response: Species Df Sum Sq Mean Sq F value Pr(>F) AreaFactor 2 333.85 166.92 13.393 8.297e-05 ***Residuals 28 348.99 12.46

Page 12: Workshop in  R & GLMs: #2

F tests in regression

model3 <- lm (Species ~ Elevation + pH, data = diane)

anova (model, model3)

Model 1: Species ~ ElevationModel 2: Species ~ Elevation + pH

Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 671.10 2 28 502.06 1 169.05 9.4279 0.004715 **

F 1, 28 = 9.43

Page 13: Workshop in  R & GLMs: #2

• Fit the model: Species~pH

• Fit the model: Species~pH+pH2

("pH2" is just pH2)

• Use the ANOVA command to decide whether species richness is a linear or quadratic function of pH

Exercise

Page 14: Workshop in  R & GLMs: #2

Distributions: not so normal!

• Review assumptions for parametric stats (ANOVA, regressions)

• Why don’t transformations always work?

• Introduce non-normal distributions

Page 15: Workshop in  R & GLMs: #2

Tests before ANOVA, t-tests

• Normality

• Constant variances

Page 16: Workshop in  R & GLMs: #2

Tests for normality: exercise

data<-c(rep(0:6,c(42,30,10,5,5,4,4)));data

How many datapoints are there?

Page 17: Workshop in  R & GLMs: #2

Tests for normality: exercise

• Shapiro-Wilks (if sig, not normal)

shapiro.test (data)

If error message, make sure the stats package is loaded, then try again:

library(stats); shapiro.test (data)

Page 18: Workshop in  R & GLMs: #2

Tests for normality: exercise

• Kolmogorov-Smirnov (if sig, not normal)

ks.test(data,”pnorm”,mean(data),sd=sqrt(var(data)))

Page 19: Workshop in  R & GLMs: #2

Tests for normality: exercise

• Quantile-quantile plot (if wavers substantially off 1:1 line, not normal)

par(pty="s")

qqnorm(data); qqline(data)opens up a single plot window

Page 20: Workshop in  R & GLMs: #2

Tests for normality: exercise

-2 -1 0 1 2

01

23

45

6

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Page 21: Workshop in  R & GLMs: #2

Tests for normality: exercise

If the distribution isn´t normal, what is it?

freq.data<-table(data); freq.data

barplot(freq.data)

0 1 2 3 4 5 6

01

02

03

04

0

Page 22: Workshop in  R & GLMs: #2

Non-normal distributions• Poisson (count data, left-skewed, variance =

mean)

• Negative binomial (count data, left-skewed, variance >> mean)

• Binomial (binary or proportion data, left-skewed, variance constrained by 0,1)

• Gamma (variance increases as square of mean, often used for survival data)

Page 23: Workshop in  R & GLMs: #2

Exercise

model2 <- lm (Species ~ AreaFactor, data = diane)

anova (model2)

1. Test for normality of residuals

resid2<- resid (model2)

...you do the rest!

2. Test for homogeneity of variances

summary (lm (abs (resid2) ~ AreaFactor))

Page 24: Workshop in  R & GLMs: #2

Regression diagnostics

1. Residuals are normally distributed

2. Absolute value of residuals do not change with predicted value (homoscedastcity)

3. Residuals show no pattern with predicted values (i.e. the function “fits”)

4. No datapoint has undue leverage on the model.

Page 25: Workshop in  R & GLMs: #2

Regression diagnostics

model3 <- lm (Species ~ Elevation + pH, data = diane)

par(mfrow=c(2,2)); plot(model3)

Page 26: Workshop in  R & GLMs: #2

1. Residuals are normally distributed

• Straight “Normal Q-Q plot”

Theoretical Quantiles

Std

. d

evia

nce

res

id.

Page 27: Workshop in  R & GLMs: #2

2. Absolute residuals do not change with predicted values

• No fan shape in Residuals vs fitted plot

• No upward (or downward) trend in Scale-location plot

Fitted values

Sq

rt (

abs

(SD

res

id.)

)

Fitted values

Res

idu

als MALO

BUENO

Page 28: Workshop in  R & GLMs: #2

Examples of neutral and fan-shapes

Page 29: Workshop in  R & GLMs: #2

3. Residuals show no pattern

Curved residual plots result from fitting a straight line to non-linear data (e.g. quadratic)

Page 30: Workshop in  R & GLMs: #2

4. No unusual leverage

Cook’s distance > 1 indicates a point with undue leverage (large change in model fit when removed)

Page 31: Workshop in  R & GLMs: #2

Transformations

Try transforming your y-variable to improve the regression diagnostic plot

• replace Species with log(Species)

• replace Species with sqrt(Species)

Page 32: Workshop in  R & GLMs: #2

Poisson distribution

• Frequency data

• Lots of zero (or minimum value) data

• Variance increases with the mean

Page 33: Workshop in  R & GLMs: #2

1. Correct for correlation between mean and variance by log-transforming y (but log (0) is undefined!!)

2. Use non-parametric statistics (but low power)

3. Use a “generalized linear model” specifying a Poisson distribution

What do you do with Poisson data?

Page 34: Workshop in  R & GLMs: #2

The problem: Hard to transform data to satisfy all requirements!

Tarea: Janka example

Janka dataset: Asks if Janka hardness values are a good estimate of timber density? N=36