Upload
julie-lighter
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Housekeeping
ls() asks what variables are in the global environment
rm(list=ls()) gets rid of EVERY variable
q() quit, get a prompt to save workspace or not
hard~dens
500 1000 2000
-400
020
060
0
Fitted values
Res
idua
ls
Residuals vs Fitted
32
31
36
-2 -1 0 1 2
-2-1
01
23
4
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
32
31
36
500 1000 2000
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale-Location32
31 36
0.00 0.04 0.08
-2-1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance
0.5
1
Residuals vs Leverage
32
36
31
hard^0.45~dens
log(hard)~dens
6.5 7.0 7.5 8.0
-0.4
-0.2
0.0
0.2
Fitted values
Res
idua
ls
Residuals vs Fitted
3
2213
-2 -1 0 1 2
-2-1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q-Q
3
2213
6.5 7.0 7.5 8.0
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale-Location3
2213
0.00 0.04 0.08
-3-2
-10
12
Leverage
Sta
ndar
dize
d re
sidu
als
Cook's distance 0.5
Residuals vs Leverage
3
352
Janka exercise
Conclusion:
The best y transformation to optimize the model fit (highest log likelihood)…
..is not the best y transformation for normal residuals
This workshop
• Linear, general linear, and generalized linear models.
• Understand how GLMs work [Excel simulation]
• Definitions: e.g. deviance, link functions
• Poisson GLMs[R exercise]• Binomial distribution and logistic regression• Fit GLMs in R! [Exercise]
In the beginning there were…
Linear models: a normally-distributed y fit to a continuous x
But wait…couldn’t we just code a categorical
variable to be continuous?
Y x1.2 01.3 01.1 10.9 1
Then there were…
General Linear Models: a normally-distributed y fit to a continuous OR
categorical x
But wait…why do we force our data to be normal when often
it isn’t?
Generalized linear models
No more need for tedious
transformations!
Proud to be Poisson !
All variances are unequal, but some are more unequal than others…
All variances are unequal, but some are more unequal than others…
Because most things in life
aren’t normal !
Because most things in life
aren’t normal !
Distribution solution !
Distribution solution !
What linear models do:
X
Y
X
Log Y
1. Transform y2. Fit line to transformed y3. Back transform to linear y
What GLMs do:
X
Y
X
Log fitted values
1. Start with an arbitrary fitted line2. Back-transform line into linear space3. Calculate residuals4. Improve fitted line to maximize likelihood
Many iterations
Maximum likelihood• Means that an iterative process is used to find the model equation that has the highest probability (likelihood) of explaining the y values given the x values.
•Equation for likelihood depends on the error distribution chosen
• Least squares – by contrast – minimizes variation from the model.
• If the data are normally distributed, maximum likelihood gives the same answer as least squares.
GLM simulation exercise
• Simulates fitting a model with normal errors and a log link to data.
• Your task:
(1) understand how the spreadsheet works
(2) find through an iterative process the best slope
Generalized linear modelsIn least squares, we fit:y=mx + b + error
In GLM, the model is fit more indirectly:y=g(mx + b + error)
where g is a function, the inverse of which is called the “link function”:
link fn(expected y) = mx + b + error
LMs vs GLMs
• Uses least squares
• Assumes normality
• Based on Sum of Squares
• Fits model to transformed y
• Uses maximum likelihood
• Specify one of several distributions
• Based on deviance
• Fits model to untransformed y by means of a link function
All that really matters…
• By using a log link function, we do not need to calculate log(0).
• Be careful! A log link model predicts log y not y!
• Error distribution need not be normal : Poisson, binomial, gamma, Gaussian (=normal)
Exercise
1. Open up the file : Rlecture.csv
diane<-read.table(file.choose(),sep=“,",header=TRUE)
2. Look at dataframe. Make treat a factor (“treat”)
3. Fit this model:
my.first.glm<-glm(growth~size*treat, family=poisson
(link=log), data=diane); summary(my.first.glm)
4. Model dignosticspar(mfrow=c(2,2)); plot(my.first.glm)
Overdispersion
Underdispersed Overdispersed Random
Overdispersion
Is your residual deviance = residual df (approx.)?
If residual dev>>residual df, overdispersed.
If residual dev<<residual df, underdispersed.
Solution:
second.glm<-glm(growth~size*treat, family = quasipoisson (link=log), data=diane); summary(second.glm)
Options
family default link other links
binomial logit probit, cloglog
gaussian identity
Gamma -- identity,inverse,
log
poisson log identity, sqrt
Rlecture.csv
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8
Size
Par
asit
ism
(%
)
0
10
20
30
40
50
60
70
80
0 1 2 3 4 5 6 7
Size
Gro
wth
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7
Size
Sur
viva
l
Binomial errors
• Variance gets constrained near limits; binomial accounts for this
• Type 1: Classic example: series of trials resulting in success (value=1) or failure (value=0).
• Type 2: Also continuous but bounded (e.g. % mortality bounded between 0% and 100%).
Logistic regression
• Least squares: arcsine transformations
• GLMs: use logit (or probit) link with binomial errors
0
0.2
0.4
0.6
0.8
1
1.2
-40 -20 0 20 40 60 80
x
y
Logit
p = proportion of successes
If p = eax+b / (1+ eax+b) calculate:
loge(p/1-p)
Logits continued
Output from logistic regression with logit link: predicted loge (p/1-p) = a+bx
To obtain any expected values of p, need to input a and b in original equation:
p = eax+b / (1+ eax+b)
Binomial GLMs
Type 1 binomial• Simply set family = binomial (link=logit)
Type 2 binomial• First create a vector of % not parasitized.
• Then “cbind” into a matrix (% parasitized, % not parasitized)
• Then run your binomial glm (link = logit) with the matrix as your y.
Homework
1. Fit the binomial glm survival = size*treat
2. Fit the bionomial glm parasitism = size*treat
3. Predict what size has 50% parasitism in treatment “0”