44
Statistics 262: Intermediate Biostatistics Random Effects, GLMs Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics – p.1/??

Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

  • Upload
    vancong

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Statistics 262: IntermediateBiostatistics

Random Effects, GLMs

Jonathan Taylor & Kristin Cobb

Statistics 262: Intermediate Biostatistics – p.1/??

Page 2: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Overview of today’s class

Analysis of Variance: Random effects

Generalized Linear Models

Statistics 262: Intermediate Biostatistics – p.2/??

Page 3: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Example: sodium content in beer

How much sodium is there in North Americanbeer? How much does this vary by brand?

Observations: for 6 brands of beer, werecorded the sodium content of 8 12 ouncebottles.

Questions of interest: what is the “grandmean” sodium content? How much variabilityis there from brand to brand?

“Individuals” in this case are brands, repeatedmeasures are the 8 bottles.

Statistics 262: Intermediate Biostatistics – p.3/??

Page 4: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

One-way random effects model

Yij ∼ µ· + αi + εij, 1 ≤ i ≤ r, 1 ≤ j ≤ n

εij ∼ N(0, σ2), 1 ≤ i ≤ r, 1 ≤ j ≤ n

αi ∼ N(0, σ2µ), 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.4/??

Page 5: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

One-way random (equal sample sizes)

Source SS df E(SS)

Treatments SSTR =Pr

i=1n

`

Y i· − Y··

´2r − 1 σ2 + nσ2

µ

Error SSE =Pr

i=1

Pnj=1

(Yij − Y i·)2 (n − 1)r σ2

ANOVA table is still useful to setup tests: thesame F statistics for fixed or random will workhere.

Statistics 262: Intermediate Biostatistics – p.5/??

Page 6: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Inference for µ·

We know that E(Y ··) = µ·, and can show that

Var(Y ··) =nσ2

µ + σ2

rn.

Therefore,Y ·· − µ·

SSTR(r−1)rn

∼ tr−1

Statistics 262: Intermediate Biostatistics – p.6/??

Page 7: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Inference for µ·

Why r − 1 degrees of freedom? Imagine wecould record an infinite number ofobservations for each individual, so thatY i· → µi.

To learn anything about µ· we still only have robservations (µ1, . . . , µr).

Sampling more within an individual cannotnarrow the CI for µ·.

Statistics 262: Intermediate Biostatistics – p.7/??

Page 8: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Inference for σ2µ

σ2µ+σ2

This quantity describes the relativecontribution of the random effects variance tothe total variance of one observation.

We use the fact that

F =σ2 + nσ2

µ

σ2×

SSTRr−1SSE

(n−1)r

∼ Fr−1,(n−1)r.

Manipulate the inequalities:

P (Fr−1,(n−1)r,α/2 ≤ F ≤ Fr−1,(n−1)r,1−α/2) = 1−α.

Statistics 262: Intermediate Biostatistics – p.8/??

Page 9: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Estimating σ2µ

From the ANOVA table

σ2µ =

E(SSTR/(r − 1)) − E(SSE/((n − 1)r))

n.

Natural estimate:

S2µ =

SSTR/(r − 1) − SSE/((n − 1)r)

n

Problem: this estimate can be negative! Oneof the difficulties in random effects model.

Statistics 262: Intermediate Biostatistics – p.9/??

Page 10: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code

PROC GLM DATA=DATADIR.beer;

CLASS BRAND BOTTLE;

MODEL SODIUM = BRAND;

MEANS BRAND / LSD CLDIFF;

RUN;

Statistics 262: Intermediate Biostatistics – p.10/??

Page 11: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Two-way random effects model

Yijk ∼ µ·· + αi + βj + (αβ)ij + εij, 1 ≤ i ≤ r, 1 ≤j ≤ m, 1 ≤ k ≤ n

εijk ∼ N(0, σ2), 1 ≤ i ≤ r, 1 ≤ j ≤ m, 1 ≤ k ≤n

αi ∼ N(0, σ2α), 1 ≤ i ≤ r.

βj ∼ N(0, σ2β), 1 ≤ j ≤ m.

(αβ)ij ∼ N(0, σ2αβ), 1 ≤ j ≤ m, 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.11/??

Page 12: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

ANOVA tables: Two-way (random)

SS df E(MS)

SSA r − 1 σ2 + nmσ2α + nσ2

αβ

SSB m − 1 σ2 + nrσ2

β+ nσ2

αβ

SSAB (m − 1)(r − 1) σ2 + nσ2

αβ

SSE =Pr

i=1

Pmj=1

Pnk=1

(Yijk − Y ij·)2 (n − 1)mr σ2

To test H0 : σ2α = 0 use SSA and SSAB.

To test H0 : σ2αβ = 0 use SSAB and SSE.

Statistics 262: Intermediate Biostatistics – p.12/??

Page 13: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Mixed effects model

In some studies, some factors can be thoughtof as fixed, others random.

For instance, we might have a study of theeffect of a standard part of the brewingprocess on sodium levels in the beerexample.

Then, we might think of a model in which wehave a fixed effect for “brewing technique”and a random effect for beer.

Statistics 262: Intermediate Biostatistics – p.13/??

Page 14: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Two-way mixed effects model

Yijk ∼ µ·· + αi + βj + (αβ)ij + εij, 1 ≤ i ≤ r, 1 ≤j ≤ m, 1 ≤ k ≤ n

εijk ∼ N(0, σ2), 1 ≤ i ≤ r, 1 ≤ j ≤ m, 1 ≤ k ≤n

αi ∼ N(0, σ2α), 1 ≤ i ≤ r.

βj, 1 ≤ j ≤ m are constants.

(αβ)ij ∼ N(0, σ2αβ), 1 ≤ j ≤ m, 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.14/??

Page 15: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Constraints

∑mj=1 βj = 0

∑ri=1(αβ)ij = 0, 1 ≤ i ≤ r.

Statistics 262: Intermediate Biostatistics – p.15/??

Page 16: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

ANOVA tables: Two-way (mixed)

SS df E(MS)

SSA r − 1 σ2 + nmσ2α

SSB m − 1 σ2 + nr

Pmj=1

β2

i

m−1+ nσ2

αβ

SSAB (m − 1)(r − 1) σ2 + nσ2

αβ

SSE =Pr

i=1

Pmj=1

Pnk=1

(Yijk − Y ij·)2 (n − 1)ab σ2

To test H0 : σ2α = 0 use SSA and SSE.

To test H0 : β1 = · · · = βm = 0 use SSB andSSAB.

To test H0 : σ2αβ use SSAB and SSE.

Statistics 262: Intermediate Biostatistics – p.16/??

Page 17: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Example: pearls

Imitation pearl “study”.

Two factors: how many coats of lacquer, andbatch.

Natural to use batch as random, coats asfixed, but we’ll do both.

Statistics 262: Intermediate Biostatistics – p.17/??

Page 18: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code: two-way

PROC MIXED;CLASS BATCH COATS;MODEL VALUE= ;RANDOM COATS BATCH COATS*BATCH;

RUN;

Statistics 262: Intermediate Biostatistics – p.18/??

Page 19: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code: mixed

PROC MIXED;CLASS COATS BATCH;MODEL VALUE=COATS;RANDOM BATCH COATS*BATCH;

RUN;

Statistics 262: Intermediate Biostatistics – p.19/??

Page 20: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Nested vs. crossed

In previous examples the factors are suchthat we observe every pair of possible values.

Sometimes, particularly in random effectsmodels, this is not possible.

For example, consider a model with “city” as afactor, and “country” as another.

Statistics 262: Intermediate Biostatistics – p.20/??

Page 21: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Nested random effect models

Suppose we want to study the effectivenessof a recycling program at reducing householdwaste.

Outcome: monthly per capita householdwaste.

Design: for all G8 countries, recyclingprogram is implemented in 10 cities atrandom, with 10 other cities chosen atrandom. Outcome is measured each monthfor a year.

Statistics 262: Intermediate Biostatistics – p.21/??

Page 22: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

What can be estimated?

Interactions between city and country can’t beestimated, nor between city and “program”.

We say that “city” is nested within “country”and “program” because the levels of city can’tappear in other countries, and each city eitherimplements the program or not.

It is possible to estimate interactions betweentime, country and program.

City can be either a random or fixed effect –here we treat it as random.

Statistics 262: Intermediate Biostatistics – p.22/??

Page 23: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Nested model: SAS “code”

PROC MIXED;CLASS CITY COUNTRY PROGRAM MONTH;MODEL WASTE = COUNTRY PROGRAM MONTHCOUNTRY*PROGRAM;RANDOM CITY(COUNTRY PROGRAM);

LSMEANS COUNTRY*MONTH;

Statistics 262: Intermediate Biostatistics – p.23/??

Page 24: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Generalized linear models

All models we have seen so far deal withcontinuous outcome variables with norestriction on their expectations, and all haveassumed that mean and variance areunrelated.

Many outcomes of interest do not satisfy this.

Examples: binary outcomes, Poisson countoutcomes.

Statistics 262: Intermediate Biostatistics – p.24/??

Page 25: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Binary outcomes

If our outcome Yi is a 0-1 random variable,then 0 ≤ E(Yi) ≤ 1.

Also,

E(Yi) = πi, V ar(Yi) = πi(1 − πi).

A convenient way to model the dependenceof Yi on covariates Xi1, . . . , Xik is through thelogit transform.

Statistics 262: Intermediate Biostatistics – p.25/??

Page 26: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Logit transform

Logit transform:

logit(π) = log

1 − π

)∈ (−∞, +∞)

Inverse:

logit−1(x) =ex

1 + ex∈ (0, 1).

Statistics 262: Intermediate Biostatistics – p.26/??

Page 27: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Binary regression

Logistic regression model:

logit(E(Yi)) = logit(πi) = β0 +k∑

j=1

βjXij .

Probit regression model:

Φ−1(E(Yi)) = β0+k∑

j=1

βjXij, Φ(x) =

∫ x

−∞

e−x2/2

√2π

dx.

In each case, V ar(Yi) = πi(1 − πi) but theregression model is different.

Statistics 262: Intermediate Biostatistics – p.27/??

Page 28: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Link & variance fns.

If

ηi = g(E(Yi)) = g(µi) = β0 +k∑

j=1

βjXij

then g is the link function for the model.

IfV ar(Yi) = φ · V (E(Yi)) = φ · V (µi)

for some function V , then V is the variancefunction for the data.

logit(E(Yi)) = logit(πi) = β0 +k∑

j=1

βjXij .

Statistics 262: Intermediate Biostatistics – p.28/??

Page 29: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Binary (again)

For a logistic model,

g(µ) = logit(µ), V (µ) = µ(1 − µ).

For a probit model,

g(µ) = Φ−1(µ), V (µ) = µ(1 − µ).

Statistics 262: Intermediate Biostatistics – p.29/??

Page 30: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Other common examples

Multiple linear regression:g(µ) = µ, V ar(µ) = 1

Linear regression with variance tied to meang(µ) = µ, V ar(µ) = µ2.

Poisson log-linear models:g(µ) = log(µ), V ar(µ) = µ.

Statistics 262: Intermediate Biostatistics – p.30/??

Page 31: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Fitting a GLM

When fitting GLMs, instead of SSE, thelog-likelihood is minimized,

DEV (β0, . . . , βk)n∑

i=1

2 log L(Yi, Xij) − 2 log L(β0, . . . , βk|Yi, Xij).

The first term above is the likelihood in thecompletely unrestricted model. For example,in Gaussian setting it is zero (can set eachµi = Yi). In Poisson setting it is

2

(n∑

i=1

Yi log(Yi) − Yi − log(Yi!)

).

Statistics 262: Intermediate Biostatistics – p.31/??

Page 32: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Quasi-likelihood

For some choices of link and variancefunctions, there is no deviance.

In these situations, quasi-likelihood is what isminimized instead

X2(β0, . . . , βk) =n∑

i=1

(Yi − µi(β0, . . . , βk)√

V (µi)

)2

.

Pearson’s X2 also gives an estimate of φ ifnot specified as in binary or Poisson models

φ̂ =1

n − k − 1X2(β̂0, . . . , β̂k).

Statistics 262: Intermediate Biostatistics – p.32/??

Page 33: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Inference in GLMs

Large sample theory tells us that

β̂ ∼ N(β, φ̂(XTW−1X)−1

)

with weights Wi = V (µ̂i).

For testing individual β’s everything isbasically the same as in multiple regression,just have different estimates ofSE(

∑kj=0 ajβ̂j).

Likewise, confidence intervals just havedifferent estimates of SE. Use normal insteadof t. Statistics 262: Intermediate Biostatistics – p.33/??

Page 34: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Partial deviance tests

As in multiple regression to test that somesubset H0 : βi1 = · · · = βil = 0 we fit a full anda reduced model.

Large sample theory again tells us that

DEVχ2 =DEV (R) − DEV (F )

φ̂∼ χ2

dfR−dfF.

If deviance is unavailable, Pearson’s X2 issubstituted.

Reject H0 if DEVχ2 is larger than χ2dfR−dfF ,1−α.

Statistics 262: Intermediate Biostatistics – p.34/??

Page 35: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Wald χ2 tests

Test can also be done using a Wald χ2 whichdoes not fit a full and reduced model.

Wald χ2 to test Cβ = 0:

WALDχ2 = Cβ̂(C(XTW−1X)−1CT )−1(Cβ̂T ).

Reject H0 : Cβ = 0 if WALDχ2 is larger thanχ2

#rowsC,1−α.

Statistics 262: Intermediate Biostatistics – p.35/??

Page 36: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Logistic example

A local health clinic sent fliers to its clients toencourage everyone, but especially olderpersons at high risk of complications, to get aflu shot in time for protection against anexpected flu epidemic.

In a pilot follow-up study, 50 clients wererandomly selected and asked whether theyactually received a flu shot.

In addition, data were collected on their ageand their health awareness.

Statistics 262: Intermediate Biostatistics – p.36/??

Page 37: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code: logistic

PROC GENMOD;MODEL SHOT = AGE HEALTH AGE*HEALTH /DIST=BINOMIAL LINK=LOGIT TYPE3WALD CL WALDCI;RUN;

Statistics 262: Intermediate Biostatistics – p.37/??

Page 38: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code: probit

PROC GENMOD;MODEL SHOT = AGE HEALTH AGE*HEALTH /DIST=BINOMIAL LINK=PROBIT TYPE3WALD CL WALDCI;RUN;

Statistics 262: Intermediate Biostatistics – p.38/??

Page 39: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Poisson example

A researcher in geriatrics designed aprospective study to investigate the effects oftwo interventions on the frequency of falls.

One hundred subjects were randomlyassigned to one of the two interventions:education only and education plus exercise.

Three variables: gender, a balance index, anda strength index.

Statistics 262: Intermediate Biostatistics – p.39/??

Page 40: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code

PROC GENMOD;MODEL FALLS=INTERVENTION GENDER BALANCESTRENGTH /DIST=POISSON LINK=LOG TYPE3 WALDCI;CONTRAST ’GENDER&STRENGTH’ GENDER 1,STRENGTH 1;

RUN;

Statistics 262: Intermediate Biostatistics – p.40/??

Page 41: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Overdispersion in count data

Although Poisson is popular, it is not alwaysappropriate.

Often, observations are clustered more thanPoisson which increases the variance.

This phenomenon is called “overdispersion.”

Statistics 262: Intermediate Biostatistics – p.41/??

Page 42: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Simple approximate fix

We can estimate φ from either deviance orPearson X2.

This says that V ar(Yi) = φE(Yi) instead ofPoisson relationship where φ = 1.

Other solutions: use different distribution,maybe negative binomial.

Statistics 262: Intermediate Biostatistics – p.42/??

Page 43: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

SAS code

PROC GENMOD;MODEL FALLS=INTERVENTION GENDER BALANCESTRENGTH /DIST=POISSON LINK=LOG TYPE3 WALDCI;CONTRAST ’GENDER&STRENGTH’ GENDER 1,STRENGTH 1;

RUN;

Statistics 262: Intermediate Biostatistics – p.43/??

Page 44: Statistics 262: Intermediate Biostatisticsstatweb.stanford.edu/~jtaylo/courses/stats262/spring.2004/notes/... · SAS code PROC GLM DATA=DATADIR.beer; ... Statistics 262: Intermediate

Diagnostics

Diagnostic tests are similar to multipleregression, only use different residuals.

Deviance residuals: contribution of i-thobservation to deviance.

rD,i2 log L(Yi, Xij) − 2 log L(β0, . . . , βk|Yi, Xij)

Pearson residuals:

rP,i =Yi − µ̂i

φ̂V (µ̂i).

Statistics 262: Intermediate Biostatistics – p.44/??