Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia [email protected]

Workshop in R & GLMs: #3

Diane Srivastava

University of British Columbia

[email protected]

Housekeeping

ls() asks what variables are in the global environment

rm(list=ls()) gets rid of EVERY variable

q() quit, get a prompt to save workspace or not

hard~dens

500 1000 2000

-400

020

060

0

Fitted values

Res

idua

ls

Residuals vs Fitted

32

31

36

-2 -1 0 1 2

-2-1

01

23

4

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

32

31

36

500 1000 2000

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location32

31 36

0.00 0.04 0.08

-2-1

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance

0.5

1

Residuals vs Leverage

32

36

31

hard^0.45~dens

log(hard)~dens

6.5 7.0 7.5 8.0

-0.4

-0.2

0.0

0.2

Fitted values

Res

idua

ls

Residuals vs Fitted

3

2213

-2 -1 0 1 2

-2-1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

3

2213

6.5 7.0 7.5 8.0

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale-Location3

2213

0.00 0.04 0.08

-3-2

-10

12

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance 0.5

Residuals vs Leverage

3

352

Janka exercise

Conclusion:

The best y transformation to optimize the model fit (highest log likelihood)…

..is not the best y transformation for normal residuals

This workshop

• Linear, general linear, and generalized linear models.

• Understand how GLMs work [Excel simulation]

• Definitions: e.g. deviance, link functions

• Poisson GLMs[R exercise]• Binomial distribution and logistic regression• Fit GLMs in R! [Exercise]

In the beginning there were…

Linear models: a normally-distributed y fit to a continuous x

But wait…couldn’t we just code a categorical

variable to be continuous?

Y x1.2 01.3 01.1 10.9 1

Then there were…

General Linear Models: a normally-distributed y fit to a continuous OR

categorical x

But wait…why do we force our data to be normal when often

it isn’t?

Generalized linear models

No more need for tedious

transformations!

Proud to be Poisson !

All variances are unequal, but some are more unequal than others…

All variances are unequal, but some are more unequal than others…

Because most things in life

aren’t normal !

Because most things in life

aren’t normal !

Distribution solution !

Distribution solution !

What linear models do:

X

Y

X

Log Y

1. Transform y2. Fit line to transformed y3. Back transform to linear y

What GLMs do:

X

Y

X

Log fitted values

1. Start with an arbitrary fitted line2. Back-transform line into linear space3. Calculate residuals4. Improve fitted line to maximize likelihood

Many iterations

Maximum likelihood• Means that an iterative process is used to find the model equation that has the highest probability (likelihood) of explaining the y values given the x values.

•Equation for likelihood depends on the error distribution chosen

• Least squares – by contrast – minimizes variation from the model.

• If the data are normally distributed, maximum likelihood gives the same answer as least squares.

GLM simulation exercise

• Simulates fitting a model with normal errors and a log link to data.

• Your task:

(1) understand how the spreadsheet works

(2) find through an iterative process the best slope

Generalized linear modelsIn least squares, we fit:y=mx + b + error

In GLM, the model is fit more indirectly:y=g(mx + b + error)

where g is a function, the inverse of which is called the “link function”:

link fn(expected y) = mx + b + error

LMs vs GLMs

• Uses least squares

• Assumes normality

• Based on Sum of Squares

• Fits model to transformed y

• Uses maximum likelihood

• Specify one of several distributions

• Based on deviance

• Fits model to untransformed y by means of a link function

All that really matters…

• By using a log link function, we do not need to calculate log(0).

• Be careful! A log link model predicts log y not y!

• Error distribution need not be normal : Poisson, binomial, gamma, Gaussian (=normal)

Exercise

1. Open up the file : Rlecture.csv

diane<-read.table(file.choose(),sep=“,",header=TRUE)

2. Look at dataframe. Make treat a factor (“treat”)

3. Fit this model:

my.first.glm<-glm(growth~size*treat, family=poisson

(link=log), data=diane); summary(my.first.glm)

4. Model dignosticspar(mfrow=c(2,2)); plot(my.first.glm)

Overdispersion

Underdispersed Overdispersed Random

Overdispersion

Is your residual deviance = residual df (approx.)?

If residual dev>>residual df, overdispersed.

If residual dev<<residual df, underdispersed.

Solution:

second.glm<-glm(growth~size*treat, family = quasipoisson (link=log), data=diane); summary(second.glm)

Options

family default link other links

binomial logit probit, cloglog

gaussian identity

Gamma -- identity,inverse,

log

poisson log identity, sqrt

Rlecture.csv

0

10

20

30

40

50

60

70

80

90

100

0 2 4 6 8

Size

Par

asit

ism

(%

)

0

10

20

30

40

50

60

70

80

0 1 2 3 4 5 6 7

Size

Gro

wth

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5 6 7

Size

Sur

viva

l

Binomial errors

• Variance gets constrained near limits; binomial accounts for this

• Type 1: Classic example: series of trials resulting in success (value=1) or failure (value=0).

• Type 2: Also continuous but bounded (e.g. % mortality bounded between 0% and 100%).

Logistic regression

• Least squares: arcsine transformations

• GLMs: use logit (or probit) link with binomial errors

0

0.2

0.4

0.6

0.8

1

1.2

-40 -20 0 20 40 60 80

x

y

Logit

p = proportion of successes

If p = eax+b / (1+ eax+b) calculate:

loge(p/1-p)

Logits continued

Output from logistic regression with logit link: predicted loge (p/1-p) = a+bx

To obtain any expected values of p, need to input a and b in original equation:

p = eax+b / (1+ eax+b)

Binomial GLMs

Type 1 binomial• Simply set family = binomial (link=logit)

Type 2 binomial• First create a vector of % not parasitized.

• Then “cbind” into a matrix (% parasitized, % not parasitized)

• Then run your binomial glm (link = logit) with the matrix as your y.

Homework

1. Fit the binomial glm survival = size*treat

2. Fit the bionomial glm parasitism = size*treat

3. Predict what size has 50% parasitism in treatment “0”

Documents

Workshop in R & GLMs: #3 Diane Srivastava University of British Columbia [email protected]