Introduction to Bayesian Statisticsmath.ntnu.edu.tw/~rtsai/104/bayes/slides/lecture9.pdf · Department of Mathematics National Taiwan Normal University May 6, 2015. ... power plant

Introduction to Bayesian StatisticsLecture 9: Hierarchical Models

Rung-Ching Tsai

Department of MathematicsNational Taiwan Normal University

May 6, 2015

Example

• Data: Weekly weights of 30 young rats (Gelfand, Hills,Racine-Poon, & Smith, 1990).

Day8 15 22 29 36

Rat 1 151 199 246 283 320Rat 2 145 199 249 293 354

· · ·Rat 30 153 200 244 286 324

• Model:Yij = α + βxj + εij ,

where Yij : weight of i-th rat on day xj ; εij ∼ Normal(0, σ2)

• What is the assumption on the growth of the 30 rats in this model?

2 of 22

Example

• Data: Number of Failures and length of operation time of 10power plant pumps (George, Makov, & Smith, 1993).

Pump 1 2 3 4 5 6 7 8 9 10

time 94.5 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5failure 5 1 5 14 3 19 1 1 4 22

• Model:Xij ∼ Poisson(λti )

where Xij is the number of power failures, λ is the failure rate, andti is the length of operation time of pump i (in 1000s of hours).

• What is the assumption on the failure rates of the 10 power plantpumps in this model?

3 of 22

Possible problems with above approaches

• A single (α, β) may be inadequate to fit all the rats. Likewise,a common failure rate for all the power plant pumps may notbe suitable.

• Separate unrelated (αi , βi) for each rat, or λi for each pumpare likely to overfit the data. Some information about theparameters of one rat or one pump can be obtained fromothers’ data.

4 of 22

Motivation for hierarchical models

• A thought naturally arises by assuming that (αi , βi)’s or λi ’sare samples from a common population distribution. Thedistribution of observed outcomes are conditional onparameters which themselves have a probability specification,known as a hierarchical or multilevel model.

• The new parameters introduced to govern the populationdistribution of the parameters are called hyperparameters.

• Thus, we would need to estimate the parameters governingthe population distribution of (αi , βi) rather than each (αi , βi)separately.

5 of 22

Bayesian approach to hierarchical models

• Model specification◦ specify the sampling distribution of data: p(y |θ)◦ specify the population distribution of θ: p(θ|φ) where φ is the

hyperparameter• Bayesian estimation◦ specify the prior for hyperparameter: p(φ); Many levels are possible.

The hyperprior distribution at highest level is often chosen to benon-informative

◦ consider the above model specification: p(y |θ) and p(θ|φ)◦ find the joint posterior distribution of parameter θ and

hyperparameter φ:

p(θ, φ|y) ∝ p(θ, φ)p(y |θ, φ) = p(θ, φ)p(y |θ)

∝ p(φ)p(θ|φ)p(y |θ)

◦ Point and Credible interval estimations for φ and θ◦ Predictive distribution for y

6 of 22

Analytical derivation of conditional/marginal dist.

• Write put the joint posterior distribution:

p(θ, φ|y) ∝ p(φ)p(θ|φ)p(y |θ)

• Determine analytically the conditional posterior density of θ givenφ: p(θ|φ, y)

• Obtain the marginal posterior distribution of φ:

p(φ|y) =

∫p(θ, φ|y)dθ

or

p(φ|y) =p(θ, φ|y)

p(θ|φ, y).

7 of 22

Simulations from the posterior distributions

1. Two steps to simulate a random draw from the joint posteriordistribution of θ and φ: p(θ, φ|y)◦ Draw φ from its marginal posterior distribution: p(φ|y)◦ Draw parameter θ from its conditional posterior p(θ|φ, y)

2. If desired, draw predictive values y from the posterior predictivedistribution given the drawn θ

8 of 22

Example: Rat tumors

• Goal: Estimating the risk of tumor in a group of rats

• Data (number of rats developed some kind of tumor):

1. 70 historical experiments:0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/190/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/191/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/202/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/203/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/204/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/206/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24

2. Current experiment: 4/14

9 of 22

Bayesian approach to hierarchical models

• Model specification◦ sampling distribution of data: yj ∼ binomial(nj , θj), j = 1, 2, · · · , 71.◦ the population distribution of θ: θj ∼ Beta(α, β) where α and β are

the hyperparameters.

• Bayesian estimation◦ non-informative prior for hyperparameters: p(α, β)◦ consider the above model specification: p(θ|α, β)◦ find the joint posterior distribution of parameter θ and

hyperparameters α and β:

p(θ, α, β|y) ∝ p(α, β)p(θ|α, β)p(y|θ, α, β)

∝ p(α, β)J∏

j=1

Γ(α + β)

Γ(α)Γ(β)θα−1j (1− θj)β−1

J∏j=1

θyij (1− θj)nj−yj

10 of 22

Analytical derivation of conditional/marginal dist.

• the joint posterior distribution:

p(θ, α, β|y) ∝ p(α, β)J∏

j=1

Γ(α + β)

Γ(α)Γ(β)θα−1j (1− θj)β−1

J∏j=1

θyij (1− θj)nj−yj

• the conditional posterior density of θ given α and β:

p(θ|α, β, y) =J∏

j=1

Γ(α + β + nj)

Γ(α + yj)Γ(β + nj − yj)θα+yj−1j (1− θj)β+nj−yj−1

• the marginal posterior distribution of α and β:

p(α, β|y) =p(θ, α, β|y)

p(θ|α, β, y)∝ p(α, β)

J∏j=1

Γ(α + β)

Γ(α)Γ(β)

Γ(α + yj)Γ(β + nj − yj)

Γ(α + β + nj)

11 of 22

Choice of hyperprior distribution

• Idea: To set up a ‘non-informative’ hyperprior distribution

◦ p(

logit( αα+β ) = log(αβ ), log(α + β)

)∝ 1

NO GOOD because it leads to improper posterior.

◦ p(

αα+β , α + β

)∝ 1 or p(α, β) ∝ 1

NO GOOD because the posterior density is not integrable in the limit.◦

p

(α

α+ β, (α+ β)−1/2

)∝ 1 ⇐⇒ p(α, β) ∝ (α+ β)−5/2

⇐⇒ p

(log(

α

β), log(α+ β)

)∝ αβ(α+ β)−5/2

OK because it leads to proper posterior.

12 of 22

Computing marginal posterior of the hyperparameters

• Computing the relative (unnormalized) posterior density on a gridof values that cover the effective range of (α, β)

◦(

log(αβ ), log(α + β))∈ [−1,−2.5]× [1.5, 3]

◦(

log(αβ ), log(α + β))∈ [−1.3,−2.3]× [1, 5]

• Drawing contour plot of the marginal density of(log(αβ ), log(α + β)

)◦ contour lines are at 0.05, 0.15, · · · , 0.95 times the density at the mode.

• Normalizing by approximating the posterior distribution as a stepfunction over a grid and setting total probability in the grid to 1.

• Computing the posterior moments based on the grid of(log(αβ ), log(α + β)). For example, E(α|y) is estimated by∑

log(αβ),log(α+β)

= αp(log(α

β), log(α + β)|y)

13 of 22

Sampling from the joint posterior

1. Simulation 1000 draws of (log(αβ ), log(α + β)) from their posteriordistribution using the discrete-grid sampling procedure.

2. For l = 1, · · · , 1000◦ Transform the l-th draw of (log(αβ ), log(α + β)) to the scale of (α, β)

to yield a draw of the hyperparameters from their marginal posteriordistribution.

◦ For each j = 1, · · · , J, sample θj from its conditional posteriordistribution θj |α, β, y ∼ Beta(α + yj , β + nj − yj).

14 of 22

Displaying the results

• Plot the posterior means and 95% intervals for the θj ’s (Figure 5.4on page 131)

• Rate θj ’s are shrunk from their sample point estimates,yjnj

, towards

the population distribution, with approximate mean.

• Experiment with few observation are shrunk more and have higherposterior variances.

• Note that posterior variability is higher in the full Bayesiananalysis, reflecting posterior uncertainty in the hyperparameters.

15 of 22

Hierarchical normal models (I)

• Model specification◦ Sampling distribution of data:

yij |θj ∼ Normal(θj , σ2), i = 1, · · · , nj , j = 1, 2, · · · , J. σ2 known

◦ the population distribution of θ: θj ∼ Normal(µ, τ 2) where µ and τare the hyperparameters. That is,

p(θ1, · · · , θJ |µ, τ) =J∏

j=1

N(θj |µ, τ 2)

◦

p(θ1, · · · , θJ) =

∫ J∏j=1

[N(θj |µ, τ 2)]p(µ, τ)d(µ, τ).

16 of 22

Hierarchical normal models (II)

• Bayesian estimation◦ non-informative prior for hyperparameters:

p(µ, τ) = p(µ|τ)p(τ) ∝ p(τ)

◦ consider the above model specification: p(θ|µ, τ)◦ find the joint posterior distribution of parameter θ and

hyperparameters µ and τ :

p(θ, µ, τ |y) ∝ p(µ, τ)p(θ|µ, τ)p(y|θ)

∝ p(µ, τ)J∏

j=1

N(θj |µ, τ 2)J∏

j=1

N(y.j |θj , σ2/nj)

17 of 22

Conditional posterior of θ given (µ, τ), p(θ|µ, τ, y)

•θj |µ, τ ∼ Normal(µ, τ2),

•θj |µ, τ, y ∼ Normal(θj ,Vj),

where◦

θj =njσ2 y.j + 1

τ 2µnjσ2 + 1

τ 2

◦Vj =

1njσ2 + 1

τ 2

18 of 22

Marginal posterior of µ and τ , p(µ, τ |y)

p(µ, τ |y) ∝ p(µ, τ)p(y|µ, τ)

y.j |µ, τ ∼ Normal(µ,σ2

nj+ τ2)

Therefore,

p(µ, τ |y) ∝ p(µ, τ)J∏

j=1

N(y.j |µ,σ2

nj+ τ2)

19 of 22

Posterior of µ given τ , p(µ|τ, y)

p(µ, τ |y) = p(µ|τ, y)p(τ |y)

⇒ p(µ|τ, y) =p(µ, τ |y)

p(τ |y)

Therefore,

µ|τ, y ∼ Normal(µ,Vµ),

where

µ =

∑Jj=1

1σ2

nj+τ2

y.j∑Jj=1

1σ2

nj+τ2

and V−1µ =

J∑j=1

1σ2

nj+ τ2

20 of 22

Posterior distribution of τ , p(τ |y)

p(τ |y) =p(µ, τ |y)

p(µ|τ, y

∝p(τ)

∏Jj=1 N(y.j |µ, σ

2

nj+ τ2)

N(µ|µ,Vµ)

∝p(τ)


2

nj+ τ2)

N(µ|µ,Vµ)

∝ p(τ)V 1/2µ

J∏j=1

(σ2

nj+ τ2)−1/2exp

− (y.j − µ)2

2(σ2

nj+ τ2)

21 of 22

Prior distribution of τ , p(τ)

p(τ |y) =p(µ, τ |y)

p(µ|τ, y

∝p(τ)


2

nj+ τ2)

N(µ|µ,Vµ)

∝p(τ)


2

nj+ τ2)

N(µ|µ,Vµ)

∝ p(τ)V 1/2µ

J∏j=1

(σ2

nj+ τ2)−1/2exp

− (y.j − µ)2

2(σ2

nj+ τ2)

22 of 22

Documents

Introduction to Bayesian Statisticsmath.ntnu.edu.tw/~rtsai/104/bayes/slides/lecture9.pdf · Department of Mathematics National Taiwan Normal University May 6, 2015. ... power plant