Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introduction to Bayesian StatisticsLecture 9: Hierarchical Models
Rung-Ching Tsai
Department of MathematicsNational Taiwan Normal University
May 6, 2015
Example
• Data: Weekly weights of 30 young rats (Gelfand, Hills,Racine-Poon, & Smith, 1990).
Day8 15 22 29 36
Rat 1 151 199 246 283 320Rat 2 145 199 249 293 354
· · ·Rat 30 153 200 244 286 324
• Model:Yij = α + βxj + εij ,
where Yij : weight of i-th rat on day xj ; εij ∼ Normal(0, σ2)
• What is the assumption on the growth of the 30 rats in this model?
2 of 22
Example
• Data: Number of Failures and length of operation time of 10power plant pumps (George, Makov, & Smith, 1993).
Pump 1 2 3 4 5 6 7 8 9 10
time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5failure 5 1 5 14 3 19 1 1 4 22
• Model:Xij ∼ Poisson(λti )
where Xij is the number of power failures, λ is the failure rate, andti is the length of operation time of pump i (in 1000s of hours).
• What is the assumption on the failure rates of the 10 power plantpumps in this model?
3 of 22
Possible problems with above approaches
• A single (α, β) may be inadequate to fit all the rats. Likewise,a common failure rate for all the power plant pumps may notbe suitable.
• Separate unrelated (αi , βi) for each rat, or λi for each pumpare likely to overfit the data. Some information about theparameters of one rat or one pump can be obtained fromothers’ data.
4 of 22
Motivation for hierarchical models
• A thought naturally arises by assuming that (αi , βi)’s or λi ’sare samples from a common population distribution. Thedistribution of observed outcomes are conditional onparameters which themselves have a probability specification,known as a hierarchical or multilevel model.
• The new parameters introduced to govern the populationdistribution of the parameters are called hyperparameters.
• Thus, we would need to estimate the parameters governingthe population distribution of (αi , βi) rather than each (αi , βi)separately.
5 of 22
Bayesian approach to hierarchical models
• Model specification◦ specify the sampling distribution of data: p(y |θ)◦ specify the population distribution of θ: p(θ|φ) where φ is the
hyperparameter• Bayesian estimation◦ specify the prior for hyperparameter: p(φ); Many levels are possible.
The hyperprior distribution at highest level is often chosen to benon-informative
◦ consider the above model specification: p(y |θ) and p(θ|φ)◦ find the joint posterior distribution of parameter θ and
hyperparameter φ:
p(θ, φ|y) ∝ p(θ, φ)p(y |θ, φ) = p(θ, φ)p(y |θ)
∝ p(φ)p(θ|φ)p(y |θ)
◦ Point and Credible interval estimations for φ and θ◦ Predictive distribution for y
6 of 22
Analytical derivation of conditional/marginal dist.
• Write put the joint posterior distribution:
p(θ, φ|y) ∝ p(φ)p(θ|φ)p(y |θ)
• Determine analytically the conditional posterior density of θ givenφ: p(θ|φ, y)
• Obtain the marginal posterior distribution of φ:
p(φ|y) =
∫p(θ, φ|y)dθ
or
p(φ|y) =p(θ, φ|y)
p(θ|φ, y).
7 of 22
Simulations from the posterior distributions
1. Two steps to simulate a random draw from the joint posteriordistribution of θ and φ: p(θ, φ|y)◦ Draw φ from its marginal posterior distribution: p(φ|y)◦ Draw parameter θ from its conditional posterior p(θ|φ, y)
2. If desired, draw predictive values y from the posterior predictivedistribution given the drawn θ
8 of 22
Example: Rat tumors
• Goal: Estimating the risk of tumor in a group of rats
• Data (number of rats developed some kind of tumor):
1. 70 historical experiments:0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/190/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/191/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/202/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/203/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/204/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/206/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24
2. Current experiment: 4/14
9 of 22
Bayesian approach to hierarchical models
• Model specification◦ sampling distribution of data: yj ∼ binomial(nj , θj), j = 1, 2, · · · , 71.◦ the population distribution of θ: θj ∼ Beta(α, β) where α and β are
the hyperparameters.
• Bayesian estimation◦ non-informative prior for hyperparameters: p(α, β)◦ consider the above model specification: p(θ|α, β)◦ find the joint posterior distribution of parameter θ and
hyperparameters α and β:
p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β)
∝ p(α, β)J∏
j=1
Γ(α + β)
Γ(α)Γ(β)θα−1j (1− θj)β−1
J∏j=1
θyij (1− θj)nj−yj
10 of 22
Analytical derivation of conditional/marginal dist.
• the joint posterior distribution:
p(θ, α, β|y) ∝ p(α, β)J∏
j=1
Γ(α + β)
Γ(α)Γ(β)θα−1j (1− θj)β−1
J∏j=1
θyij (1− θj)nj−yj
• the conditional posterior density of θ given α and β:
p(θ|α, β, y) =J∏
j=1
Γ(α + β + nj)
Γ(α + yj)Γ(β + nj − yj)θα+yj−1j (1− θj)β+nj−yj−1
• the marginal posterior distribution of α and β:
p(α, β|y) =p(θ, α, β|y)
p(θ|α, β, y)∝ p(α, β)
J∏j=1
Γ(α + β)
Γ(α)Γ(β)
Γ(α + yj)Γ(β + nj − yj)
Γ(α + β + nj)
11 of 22
Choice of hyperprior distribution
• Idea: To set up a ‘non-informative’ hyperprior distribution
◦ p(
logit( αα+β ) = log(αβ ), log(α + β)
)∝ 1
NO GOOD because it leads to improper posterior.
◦ p(
αα+β , α + β
)∝ 1 or p(α, β) ∝ 1
NO GOOD because the posterior density is not integrable in the limit.◦
p
(α
α+ β, (α+ β)−1/2
)∝ 1 ⇐⇒ p(α, β) ∝ (α+ β)−5/2
⇐⇒ p
(log(
α
β), log(α+ β)
)∝ αβ(α+ β)−5/2
OK because it leads to proper posterior.
12 of 22
Computing marginal posterior of the hyperparameters
• Computing the relative (unnormalized) posterior density on a gridof values that cover the effective range of (α, β)
◦(
log(αβ ), log(α + β))∈ [−1,−2.5]× [1.5, 3]
◦(
log(αβ ), log(α + β))∈ [−1.3,−2.3]× [1, 5]
• Drawing contour plot of the marginal density of(log(αβ ), log(α + β)
)◦ contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.
• Normalizing by approximating the posterior distribution as a stepfunction over a grid and setting total probability in the grid to 1.
• Computing the posterior moments based on the grid of(log(αβ ), log(α + β)). For example, E(α|y) is estimated by∑
log(αβ),log(α+β)
= αp(log(α
β), log(α + β)|y)
13 of 22
Sampling from the joint posterior
1. Simulation 1000 draws of (log(αβ ), log(α + β)) from their posteriordistribution using the discrete-grid sampling procedure.
2. For l = 1, · · · , 1000◦ Transform the l-th draw of (log(αβ ), log(α + β)) to the scale of (α, β)
to yield a draw of the hyperparameters from their marginal posteriordistribution.
◦ For each j = 1, · · · , J, sample θj from its conditional posteriordistribution θj |α, β, y ∼ Beta(α + yj , β + nj − yj).
14 of 22
Displaying the results
• Plot the posterior means and 95% intervals for the θj ’s (Figure 5.4on page 131)
• Rate θj ’s are shrunk from their sample point estimates,yjnj
, towards
the population distribution, with approximate mean.
• Experiment with few observation are shrunk more and have higherposterior variances.
• Note that posterior variability is higher in the full Bayesiananalysis, reflecting posterior uncertainty in the hyperparameters.
15 of 22
Hierarchical normal models (I)
• Model specification◦ Sampling distribution of data:
yij |θj ∼ Normal(θj , σ2), i = 1, · · · , nj , j = 1, 2, · · · , J. σ2 known
◦ the population distribution of θ: θj ∼ Normal(µ, τ 2) where µ and τare the hyperparameters. That is,
p(θ1, · · · , θJ |µ, τ) =J∏
j=1
N(θj |µ, τ 2)
◦
p(θ1, · · · , θJ) =
∫ J∏j=1
[N(θj |µ, τ 2)]p(µ, τ)d(µ, τ).
16 of 22
Hierarchical normal models (II)
• Bayesian estimation◦ non-informative prior for hyperparameters:
p(µ, τ) = p(µ|τ)p(τ) ∝ p(τ)
◦ consider the above model specification: p(θ|µ, τ)◦ find the joint posterior distribution of parameter θ and
hyperparameters µ and τ :
p(θ, µ, τ |y) ∝ p(µ, τ)p(θ|µ, τ)p(y|θ)
∝ p(µ, τ)J∏
j=1
N(θj |µ, τ 2)J∏
j=1
N(y.j |θj , σ2/nj)
17 of 22
Conditional posterior of θ given (µ, τ), p(θ|µ, τ, y)
•θj |µ, τ ∼ Normal(µ, τ2),
•θj |µ, τ, y ∼ Normal(θj ,Vj),
where◦
θj =njσ2 y.j + 1
τ 2µnjσ2 + 1
τ 2
◦Vj =
1njσ2 + 1
τ 2
18 of 22
Marginal posterior of µ and τ , p(µ, τ |y)
p(µ, τ |y) ∝ p(µ, τ)p(y|µ, τ)
y.j |µ, τ ∼ Normal(µ,σ2
nj+ τ2)
Therefore,
p(µ, τ |y) ∝ p(µ, τ)J∏
j=1
N(y.j |µ,σ2
nj+ τ2)
19 of 22
Posterior of µ given τ , p(µ|τ, y)
p(µ, τ |y) = p(µ|τ, y)p(τ |y)
⇒ p(µ|τ, y) =p(µ, τ |y)
p(τ |y)
Therefore,
µ|τ, y ∼ Normal(µ,Vµ),
where
µ =
∑Jj=1
1σ2
nj+τ2
y.j∑Jj=1
1σ2
nj+τ2
and V−1µ =
J∑j=1
1σ2
nj+ τ2
20 of 22
Posterior distribution of τ , p(τ |y)
p(τ |y) =p(µ, τ |y)
p(µ|τ, y
∝p(τ)
∏Jj=1 N(y.j |µ, σ
2
nj+ τ2)
N(µ|µ,Vµ)
∝p(τ)
∏Jj=1 N(y.j |µ, σ
2
nj+ τ2)
N(µ|µ,Vµ)
∝ p(τ)V 1/2µ
J∏j=1
(σ2
nj+ τ2)−1/2exp
− (y.j − µ)2
2(σ2
nj+ τ2)
21 of 22
Prior distribution of τ , p(τ)
p(τ |y) =p(µ, τ |y)
p(µ|τ, y
∝p(τ)
∏Jj=1 N(y.j |µ, σ
2
nj+ τ2)
N(µ|µ,Vµ)
∝p(τ)
∏Jj=1 N(y.j |µ, σ
2
nj+ τ2)
N(µ|µ,Vµ)
∝ p(τ)V 1/2µ
J∏j=1
(σ2
nj+ τ2)−1/2exp
− (y.j − µ)2
2(σ2
nj+ τ2)
22 of 22