28
Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia [email protected]

Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia [email protected]

Embed Size (px)

Citation preview

Page 1: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Workshop in R & GLMs: #3

Diane Srivastava

University of British Columbia

[email protected]

Page 2: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Housekeeping

ls() asks what variables are in the global environment

rm(list=ls()) gets rid of EVERY variable

q() quit, get a prompt to save workspace or not

Page 3: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

hard~dens

500 1000 2000

-400

020

060

0

Fitted values

Res

idua

ls

Residuals vs Fitted

32

31

36

-2 -1 0 1 2

-2-1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

32

31

36

500 1000 2000

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location32

31 36

0.00 0.04 0.08

-2-1

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

0.5

1

Residuals vs Leverage

32

36

31

Page 4: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

hard^0.45~dens

Page 5: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

log(hard)~dens

6.5 7.0 7.5 8.0

-0.4

-0.2

0.0

0.2

Fitted values

Res

idua

ls

Residuals vs Fitted

3

2213

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

3

2213

6.5 7.0 7.5 8.0

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location3

2213

0.00 0.04 0.08

-3-2

-10

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance 0.5

Residuals vs Leverage

3

352

Page 6: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Janka exercise

Conclusion:

The best y transformation to optimize the model fit (highest log likelihood)…

..is not the best y transformation for normal residuals

Page 7: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

This workshop

• Linear, general linear, and generalized linear models.

• Understand how GLMs work [Excel simulation]

• Definitions: e.g. deviance, link functions

• Poisson GLMs[R exercise]• Binomial distribution and logistic regression• Fit GLMs in R! [Exercise]

Page 8: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

In the beginning there were…

Linear models: a normally-distributed y fit to a continuous x

But wait…couldn’t we just code a categorical

variable to be continuous?

Y x1.2 01.3 01.1 10.9 1

Page 9: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Then there were…

General Linear Models: a normally-distributed y fit to a continuous OR

categorical x

But wait…why do we force our data to be normal when often

it isn’t?

Page 10: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Generalized linear models

No more need for tedious

transformations!

Proud to be Poisson !

All variances are unequal, but some are more unequal than others…

All variances are unequal, but some are more unequal than others…

Because most things in life

aren’t normal !

Because most things in life

aren’t normal !

Distribution solution !

Distribution solution !

Page 11: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

What linear models do:

X

Y

X

Log Y

1. Transform y2. Fit line to transformed y3. Back transform to linear y

Page 12: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

What GLMs do:

X

Y

X

Log fitted values

1. Start with an arbitrary fitted line2. Back-transform line into linear space3. Calculate residuals4. Improve fitted line to maximize likelihood

Many iterations

Page 13: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Maximum likelihood• Means that an iterative process is used to find the model equation that has the highest probability (likelihood) of explaining the y values given the x values.

•Equation for likelihood depends on the error distribution chosen

• Least squares – by contrast – minimizes variation from the model.

• If the data are normally distributed, maximum likelihood gives the same answer as least squares.

Page 14: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

GLM simulation exercise

• Simulates fitting a model with normal errors and a log link to data.

• Your task:

(1) understand how the spreadsheet works

(2) find through an iterative process the best slope

Page 15: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Generalized linear modelsIn least squares, we fit:y=mx + b + error

In GLM, the model is fit more indirectly:y=g(mx + b + error)

where g is a function, the inverse of which is called the “link function”:

link fn(expected y) = mx + b + error

Page 16: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

LMs vs GLMs

• Uses least squares

• Assumes normality

• Based on Sum of Squares

• Fits model to transformed y

• Uses maximum likelihood

• Specify one of several distributions

• Based on deviance

• Fits model to untransformed y by means of a link function

Page 17: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

All that really matters…

• By using a log link function, we do not need to calculate log(0).

• Be careful! A log link model predicts log y not y!

• Error distribution need not be normal : Poisson, binomial, gamma, Gaussian (=normal)

Page 18: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Exercise

1. Open up the file : Rlecture.csv

diane<-read.table(file.choose(),sep=“,",header=TRUE)

2. Look at dataframe. Make treat a factor (“treat”)

3. Fit this model:

my.first.glm<-glm(growth~size*treat, family=poisson

(link=log), data=diane); summary(my.first.glm)

4. Model dignosticspar(mfrow=c(2,2)); plot(my.first.glm)

Page 19: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Overdispersion

Underdispersed Overdispersed Random

Page 20: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Overdispersion

Is your residual deviance = residual df (approx.)?

If residual dev>>residual df, overdispersed.

If residual dev<<residual df, underdispersed.

Solution:

second.glm<-glm(growth~size*treat, family = quasipoisson (link=log), data=diane); summary(second.glm)

Page 21: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Options

family default link other links

binomial logit probit, cloglog

gaussian identity

Gamma -- identity,inverse,

log

poisson log identity, sqrt

Page 22: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Rlecture.csv

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8

Size

Par

asit

ism

(%

)

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7

Size

Gro

wth

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7

Size

Sur

viva

l

Page 23: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Binomial errors

• Variance gets constrained near limits; binomial accounts for this

• Type 1: Classic example: series of trials resulting in success (value=1) or failure (value=0).

• Type 2: Also continuous but bounded (e.g. % mortality bounded between 0% and 100%).

Page 24: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Logistic regression

• Least squares: arcsine transformations

• GLMs: use logit (or probit) link with binomial errors

0

0.2

0.4

0.6

0.8

1

1.2

-40 -20 0 20 40 60 80

x

y

Page 25: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Logit

p = proportion of successes

If p = eax+b / (1+ eax+b) calculate:

loge(p/1-p)

Page 26: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Logits continued

Output from logistic regression with logit link: predicted loge (p/1-p) = a+bx

To obtain any expected values of p, need to input a and b in original equation:

p = eax+b / (1+ eax+b)

Page 27: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Binomial GLMs

Type 1 binomial• Simply set family = binomial (link=logit)

Type 2 binomial• First create a vector of % not parasitized.

• Then “cbind” into a matrix (% parasitized, % not parasitized)

• Then run your binomial glm (link = logit) with the matrix as your y.

Page 28: Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia srivast@zoology.ubc.ca

Homework

1. Fit the binomial glm survival = size*treat

2. Fit the bionomial glm parasitism = size*treat

3. Predict what size has 50% parasitism in treatment “0”