Gibbs Sampling in Hierarchical Models

Hierarchical Modeling with Longitudinal (Panel) Data Blocking Steps in Mixed Models

Gibbs Sampling in Hierarchical Models

Econ 690

Purdue University

March 19, 2012

Justin L. Tobias Hierarchical Models


In many models in economics and statistics, it is natural tointroduce some kind of structure relating the parameters ofthe model.

For example, one might wish to express some degree of“similarity” across parameters of a model by assuming thatthey are drawn from a common population distribution.

The parameters of the population distribution are also ofinterest.

In Bayesian terms, such specifications can be accommodatedby the appropriate choice of priors on the model parameters,while in Frequentist parlance, these are termed “randomcoefficient” models.



Consider the following most basic version of a longitudinal (panel)data model:

In the above, yit refers to the outcomes for individual (or moregenerally, group) i at time t, and αi is a person (or group) specificrandom effect.

We assume i = 1, 2, · · · ,N and t = 1, 2, · · · .T (i.e., a balancedpanel).



For this model, we will do the following:

1 (a) Comment on how the presence of the random effects αi

accounts for correlation patterns within individuals over time.

2 (b) Derive the conditional posterior distributionp(αi |α, σ2

ε , σ2α, y).

3 (c) Obtain the mean of the conditional posterior distributionin (b). Comment on its relationship to a shrinkage estimator.(These are estimators that are typically written as some sortof weighted average of a “data” term and a prior term). Howdoes the mean change as T and σ2

ε /σ2α change?



(a) Conditional on the random effects {αi}Ni=1, the yit areindependent.

However, marginalized over the random effects, outcomes arecorrelated within individuals over time.

To see this, note that we can write our model equivalently as:

yit = α + ui + εit ,

where we have rewritten our “random effect” specification as

αi = α + ui , uiiid∼ N(0, σ2

α)



Thus for t 6= s,

Cov(yit , yis |α, σ2ε , σ

2α) = Cov(ui + εit , ui + εis)

= Cov(ui , ui )

= Var(ui )

= σ2α,

so that outcomes are correlated over time within individuals.

However, the model does not permit any degree of correlationbetween the outcomes of different individuals.



From previous results relating to the derivation of conditionalposterior distributions for regression parameters in a linear model,we can obtain:

where

and

with ιT denoting a T × 1 vector of ones.



The mean of this conditional posterior distribution is easilyobtained from our solution in (b):

Let

w = w(T , [σ2ε /σ

2α]) ≡ T

T + (σ2ε /σ

2α).



We can then write

E (αi |β, σ2ε , σ

2α, y) = wy i + (1− w)α.

This is in the form of a shrinkage estimator, where the conditionalposterior mean of αi is a weighted average of the averagedoutcomes for individual i , y i , and the common mean for allindividuals, α.

As T → 1, the weight w places all mass on y i .

On the other hand, if σ2ε is large relative to σ2

α, (and T is small ormoderate), the common mean α will get substantial weight.

The “fixed effect” formulation of this model is often criticized foroverfitting.



Posterior Simulation in a Panel Model

We illustrate the use of the Gibbs sampler in such models with thecelebrated “rat growth dataset” of Gelfand et al (1990).

30 different rats are weighed at 5 different points in time.

We denote the weight of rat i at measurement j as yij and let xij

denote the age of the i th rat at the j th measurement.

Since each of the rats were weighed at exactly the same number ofdays since birth, we have

xi1 = 8, xi2 = 15, xi3 = 22, xi4 = 29, xi5 = 36 ∀i .

The rat growth data set is provided on the following page:



Rat Growth Data from Gelfand et al (1990).Rat Weight Measurements Rat Weight Measurementsi yi1 yi2 yi3 yi4 yi5 i yi1 yi2 yi3 yi4 yi5

1 151 199 246 283 320 16 160 207 248 288 3242 145 199 249 293 354 17 142 187 234 280 3163 147 214 263 312 328 18 156 203 243 283 3174 155 200 237 272 297 19 157 212 259 307 3365 135 188 230 280 323 20 152 203 246 286 3216 159 210 252 298 331 21 154 205 253 298 3347 141 189 231 275 305 22 139 190 225 267 3028 159 201 248 297 338 23 146 191 229 272 3029 177 236 285 340 376 24 157 211 250 285 323

10 134 182 220 260 296 25 132 185 237 286 33111 160 208 261 313 352 26 160 207 257 303 34512 143 188 220 273 314 27 169 216 261 295 33313 154 200 244 289 325 28 157 205 248 289 31614 171 221 270 326 358 29 137 180 219 258 29115 163 216 242 281 312 30 153 200 244 286 324



In our model, we want to permit unit-specific variation in birth andgrowth rates.

This leads us to specify the following model:

so that each rat possesses its own intercept αi and growth rate βi .



We also assume that the rats share some degree of “similarity” intheir weight at birth and rates of growth.

Thus, we assume that the intercept and slope parameters aredrawn from a common Normal population:

We interpret α0 = θ0(1) as the population average weight at birthand β0 = θ0(2) as the population average growth rate.

The diagonal elements of Σ quantify the variation around thesepopulation means. (What would we expect the sign of Σ’soff-diagonal to be)?



We complete our Bayesian analysis by specifying the followingpriors:

σ2|a, b ∼ IG (a, b)

θ0|η,C ∼ N(η,C )

Σ−1|ρ,R ∼ W ([ρR]−1, ρ),

with W denoting the Wishart distribution.We now seek to describe how the Gibbs sampler can be employedto fit this hierarchical model.



Given the assumed conditional independence across observations,the joint posterior distribution for all the parameters of this modelcan be written as:

p(Γ|y) ∝

[30∏i=1

p(yi |xi , θi , σ2)p(θi |θ0,Σ

−1)

]p(θ0|η,C)p(σ2|a, b)p(Σ−1|ρ,R),

where Γ ≡ [{θi}, θ0,Σ−1, σ2] denotes all the parameters of the

model. We have stacked the observations over time for eachindividual rat so that

yi =

yi1

yi2...

yi5

and Xi =

1 xi1

1 xi2...

...1 xi5

.



Fitting this model via the Gibbs sampler requires thederivation of four posterior conditional distributions:

1

p(θi |Γ−θi , y).

2

p(θ0|Γ−θ0 , y).

3

p(σ2|Γ−σ2 , y).

4

p(Σ−1|Γ−Σ−1 , y).

We will derive each of these densities.



As for the complete posterior conditional for θi , we note:

This fits directly into our standard linear regression result, applyingLindley and Smith (1972):

where



As for the posterior conditional for θ0, we first obtain

Since the second stage of our model specifies p(θi |θ0,Σ−1) as iid ,

we can write θ1

θ2...θ30

=

I2I2...I2

θ0 +

u1

u2...

u30

,



Equivalently, we can write:

θ = Iθ0 + u,

with θ = [θ′1 θ′2 · · · θ′30]′, I = [I2 I2 · · · I2]′, u = [u′1 u′2 · · · u′30]′ and

E (uu′) = I30 ⊗ Σ.In this form, we can again apply our well-known result to obtain:

θ0|Γ−θ0 , y ∼ N(Dθ0dθ0 ,Dθ0)

where

Dθ0 =(

I ′(I30 ⊗ Σ−1)I + C−1)−1

=(30Σ−1 + C−1

)−1

dθ0 = (I ′(I30 ⊗ Σ−1)θ + C−1η) = (30Σ−1θ + C−1η),

where θ = (1/30)∑30

i=1 θi .



As for the posterior conditional for σ2, we obtain

Thus,

where N = 5(30) = 150.



Finally, for the posterior conditional for Σ−1, we obtain

Therefore,



We fit this model using priors of the forms:

η =

[10015

],C =

[402 00 102

], ρ = 5, R =

[102 00 .52

],

a = 3, b = 1/40.

The sampler was run for 10,000 iterations, and the first 500 werediscarded as the burn-in.

In the next two graphs, we provide some suggestive evidence ofrapid convergence, and also that the chain tends to mix reasonablywell.



0 10 20 300

200

Iteration

α 0

0 10 20 300

50

β 0

0 10 20 300

300

σ2 α

0 10 20 300

120

α 20

Iteration

Iteration Iteration



Table 12.3: Autocorrelations in ParameterChains at Various Lag Orders

Parameter Lag 1 Lag 5 Lag 10

α0 .24 .010 .007β0 .22 -.004 -.009σ2α .36 .020 -.010ρα,β .37 .036 .003α15 .18 .025 .018



Table 12.2: Posterior Quantities for a Selection of Parameters

Parameter Post Mean Post Std. 10th Percentile 90th Percentile

α0 106.6 2.34 103.7 109.5β0 6.18 .106 6.05 6.31σ2α 124.5 42.41 77.03 179.52σ2β .275 .088 .179 .389ρα,β -.120 .211 -.390 .161α10 93.76 5.24 87.09 100.59α25 86.91 5.81 79.51 94.45β10 5.70 .217 5.43 5.98β25 6.75 .243 6.44 7.05



80

90

100

110

120

130

Inte

rcep

ts

5

5.5

6

6.5

7

7.5

Gro

wth

Rat

es

Hierarchical Model

Hierarchical Model

OLSEstimates

OLSEstimates



A second application follows (Krueger 1998; Krueger andWhitmore 2001) and applies our model to analyze data fromProject STAR (Student/Teacher Achievement Ratio).

Project STAR was an experiment in Tennessee that randomlyassigned students to one of three types of classes - small class,regular size class, and regular size class with a teacher’s aide(regular/aide class).

The dependent variable is the average of a reading percentilescore and math percentile score of a Project STAR student.

There are two explanatory variables - a dummy variableindicating whether a student is assigned to a small class andanother indicating assignment to a regular/aide class. Thedefault category, therefore, is assignment to regular class.



The Project STAR data we use contains 79 participatingschools with a total of 5,726 students who entered the projectduring kindergarten.

We focus on the achievement measure taken at the end of thekindergarten year and consider heterogeneity of treatmentimpacts across schools.

Therefore, in this application of the model in (47), i denotesthe school and j/ t no longer represents a time index but,instead, denotes the student within a school.



Table 4: Posterior means, standard deviations and probabilities of beingpositive of the parameters

Parameter E(β|D) Std(β|D) Pr(β > 0|D)

β0 (intercept) 51 1.82 1β1 (small class) 5.48 1.44 1β2 (regular/aide class) 0.311 1.26 0.596√σ2 22.9 0.221 1√Σβ(1, 1) 15.2 1.32 1√Σβ(2, 2) 10.6 1.24 1√Σβ(3, 3) 8.93 1.14 1

Σβ(1, 2)/√

Σβ(1, 1)×Σβ(2, 2) -0.454 0.111 0.000125

Σβ(1, 3)/√

Σβ(1, 1)×Σβ(3, 3) -0.483 0.111 0.000125

Σβ(2, 3)/√

Σβ(2, 2)×Σβ(3, 3) 0.548 0.118 1



Evidence of small class size effect, no strong evidence of aideeffect.

Evidence of heterogeneity of impacts across schools.

Strong correlation among school-level parameters ... what isthe/ a interpretation for this result?



Gibbs sampling in mixed models

Let us now turn to a restricted (and perhaps morewidely-used) version of our previous model.

Note that, in the specification just presented, both theintercept and slope (or, more generally, set of slopes), werepermitted to vary across the units.

In a restricted version of this model, perhaps the “default”panel data specification in economics, we may permit theintercept to vary across individuals, but restrict the otherregression coefficients to be constant across individuals.

Such a model is termed a mixed model.



Formally, we consider a specification of the form:

yit = αi + xitβ + εit , εitiid∼ N(0, σ2

ε )

αiiid∼ N(α, σ2

α).

For this model we employ independent priors of the form:

β ∼ N(β0,Vβ)

α ∼ N(α0,Vα)

σ2ε ∼ IG (e1, e2)

σ2α ∼ IG (a1, a2).



We seek to do the following:

(a) Derive the complete posterior conditionals

p(αi |β, α, σ2ε , σ

2α, y), and p(β|{αi}, α, σ2

ε , σ2α, y),

(b) Describe how one could use a blocking or grouping step[e.g., Chib and Carlin (1999)] to obtain draws directly fromthe joint posterior conditional p({αi}, β|α, σ2

ε , σ2α, y).

(c) Describe how the Gibbs sampler can be used to fit themodel, given your result in (b). Would you expect anyimprovements in this blocked algorithm relative to thestandard Gibbs algorithm in (a)?

(d) How does your answer in (c) change for the case of anunbalanced panel where T = Ti?



As for (a), the complete posterior conditionals can be obtained in astraightforward manner. Specifically, we obtain


2α, y) ∼ N(Dd ,D)

where

D =(T/σ2

ε + 1/σ2α

)−1, d =

T∑t=1

(yit − xitβ)/σ2ε + α/σ2

α.



The complete posterior conditional for β follows similarly:

β|{αi}, α, σ2ε , σ

2α, y ∼ N(Hh,H),

where

H =(

X ′X/σ2ε + V−1

β

)−1, h = X ′(y − α)/σ2

ε + V−1β β0,

withα = [(ιTα1)′ (ιTα2)′ · · · (ιTαN)′]′.



(b) Instead of the strategy described in (a), we seek to draw therandom effects {αi} and “fixed effects” β in a single block.

This strategy of grouping together correlated parameters willgenerally facilitate the mixing of the chain and thereby reducenumerical standard errors associated with the Gibbs samplingestimates.

We break this joint posterior conditional into the following twopieces:



The assumptions of our model imply that the random effects {αi}are conditionally independent, so that

p({αi}, β|α, σ2ε , σ

2α, y) =

[N∏

i=1


2α, y)

]p(β|α, σ2

ε , σ2α, y).

This suggests that one can draw from this joint posteriorconditional via the method of composition by first drawing fromp(β|α, σ2

ε , σ2α, y) and then drawing each αi independently from its

complete posterior conditional distribution.



We now derive the conditional posterior distribution for β,marginalized over the random effects. Note that our model can berewritten as follows:

where uiiid∼ N(0, σ2

α).Let vit = ui + εit . If we stack this equation over t within i weobtain:

where

yi = [yi1 yi2 · · · yiT ]′, xi = [x ′i1 x ′i2 · · · x ′iT ]′, andvi = [vi1 vi2 · · · viT ]′.



Stacking again over i we obtain:

y = iNTα + Xβ + v ,

whereE (vv ′) = IN ⊗ Σ.



In this form, we can now appeal to our standard results for theregression model to obtain

β|α, σ2ε , σ

2α, y ∼ N(Gg ,G )

where

G =(

X ′(IN ⊗ Σ−1)X + V−1β

)−1=

(N∑

i=1

x ′i Σ−1xi + V−1

β

)−1

and

g = X ′(IN⊗Σ−1)(y−ιNTα)+V−1β β0 =

N∑i=1

x ′i Σ−1(yi−ιTα)+V−1

β β0.

Thus, to sample from the desired joint conditional, you first sampleβ from the distribution given above and then sample the randomeffects independently from their complete conditional posteriordistributions.



Finally, it is also worth noting that β and α could be drawntogether in the first step of this process.

That is, one could draw from the joint posterior conditional

p({αi}, α, β|σ2ε , σ

2α, y) =

[N∏

i=1

p({αi}|β, α, σ2α, σ

2ε , y)

]p(β, α|σ2

ε , σ2α, y)

in a similar way as described above.

We would expect the mixing of such chains to improve relative tothe standard (or unblocked) Gibbs sampler. In the limiting case,where we can sample in a single block, we are back to iid sampling!



(c) Given the result in (b), we now need to obtain the remainingcomplete conditionals. These are given as follows:

α|{αi}, β, σ2ε , σ

2α, y ∼ N(Rr ,R)

where

R =(N/σ2

α + V−1α

)−1, r =

N∑i=1

αi/σ2α + V−1

α α0.

σ2α|{αi}, β, σ2

ε , y ∼ IG

(N/2) + a1,

[a−1

2 + .5N∑

i=1

(αi − α)2

]−1 .

σ2ε |{αi}, β, σ2

α, y ∼ IG

((NT/2) + e1,[

e−12 + .5(y − ιNTα− Xβ)′(y − ιNTα− Xβ)

]−1).



Only slight changes are required in the case of unbalanced panels.

Let NT continue to denote the total number of observations,NT ≡

∑Ni=1 Ti .

In addition, let Σi ≡ σ2ε ITi

+ σ2αιTi

ι′Ti.

Replacing Σ with Σi and T with Ti , as appropriate in the aboveformulae, is all that is required.



Further Reading

Gelfand, A.E., S.E. Hills, A. Racine-Poon and A.F.M. Smith(1990).Illustration of Bayesian inference in normal data models usingGibbs sampling.JASA 85, 972-985.

Chib, S. and B. Carlin (1999).On MCMC Sampling in hierarchical longitudinal models.Statistics and Computing 9, 17-26.


Documents

Gibbs Sampling in Hierarchical Models