STA 216, Generalized Linear Models, Lecture 6Outline Introduction to Bayes Inference for GLMs Introduction to MCMC Algorithms STA 216, Generalized Linear Models, Lecture 6 September

OutlineIntroduction to Bayes Inference for GLMs

Introduction to MCMC Algorithms

STA 216, Generalized Linear Models, Lecture 6

September 13, 2007




Introduction to Bayes Inference for GLMsDescription of PosteriorAsymptotic Approximations

Introduction to MCMC AlgorithmsGibbs sampling & Metropolis-HastingsConvergence & MixingInference from MCMC samplesIllustration




Description of PosteriorAsymptotic Approximations

Bayesian Inference via the Posterior Distribution

I Recall that Bayesian inference is based on the posteriordistribution

π(θ |y) =π(θ)L(y | θ)∫π(θ)L(y | θ) dθ

=π(θ)L(y | θ)

L(y),








=π(θ)L(y | θ)

L(y),

I π(θ) = prior distribution for parameter θ








=π(θ)L(y | θ)

L(y),

I π(θ) = prior distribution for parameter θI L(y | θ) = likelihood of the data y given θ








=π(θ)L(y | θ)

L(y),

I π(θ) = prior distribution for parameter θI L(y | θ) = likelihood of the data y given θI L(y) = marginal likelihood integrating over prior








=π(θ)L(y | θ)

L(y),


I Good news - we have the numerator in this expression








=π(θ)L(y | θ)

L(y),


I Good news - we have the numerator in this expression

I Bad news - the denominator is typically not available (mayinvolve high dimensional integral)





Conjugate Priors

I For conjugate priors, the posterior distribution of θ isavailable analytically





Conjugate Priors


I Example: L(y | θ) =∏n

i=1N(yi;x

′

iβ, τ−1) (normal linearregression)





Conjugate Priors



i=1N(yi;x

′


I The conjugate prior is normal-gamma:

π(β, τ) = Np(β0, τ−1Σ0)G(τ ; a, b),

where Np(·) denotes the p-variate normal &G(·) denotes the gamma





Conjugate Priors



i=1N(yi;x

′


I The conjugate prior is normal-gamma:

π(β, τ) = Np(β0, τ−1Σ0)G(τ ; a, b),

where Np(·) denotes the p-variate normal &G(·) denotes the gamma

I For this prior, the posterior is also normal-gamma





Non-Conjugate Priors

I Conjugate priors are not available for generalized linearmodels (GLMs) other than the normal linear model







I One can potentially rely on an asymptotic normalapproximation







I One can potentially rely on an asymptotic normalapproximation

I As n → ∞, the posterior distribution is normal centered onMLE





Asymptotic Approximation with Informative Priors

I Suppose we have a N(β0,Σ0) prior for β.







I Asymptotic normal approximation to the posterior is

π(β |y,X) ∝ exp

{−

1

2(β − β0)Σ

−1

0(β − β0)

}

× exp

{−

1

2(β − β̂)′I(β̂)(β − β̂)′

}

∝ N(β; β̃, Σ̃β

)







I Asymptotic normal approximation to the posterior is

π(β |y,X) ∝ exp

{−

1

2(β − β0)Σ

−1

0(β − β0)

}

× exp

{−

1

2(β − β̂)′I(β̂)(β − β̂)′

}

∝ N(β; β̃, Σ̃β

)

I Approximate posterior mean & variance:

β̃ = Σ̃(Σ−1

0β0 + I(β̂)β̂

), Σ̃β =

(Σ−1

0+ I(β̂)

)−1





Comments on Asymptotic Approximation

I Even for moderate sample sizes, asymptotic approximationmay be inaccurate







I In logistic regression for rare outcomes or rare binaryexposures, posterior can be highly skewed







I In logistic regression for rare outcomes or rare binaryexposures, posterior can be highly skewed

I Appealing to avoid any reliance on large sampleassumptions and base inferences on exact posterior




Gibbs sampling & Metropolis-HastingsConvergence & MixingInference from MCMC samplesIllustration

MCMC - Basic Idea

I Markov chain Monte Carlo (MCMC) provides an approachfor generating samples from the posterior distribution





MCMC - Basic Idea


I Note that this does not give us an approximation toπ(θ |y) directly





MCMC - Basic Idea



I However, from these samples we can obtain summaries ofthe posterior distribution for θ





MCMC - Basic Idea



I However, from these samples we can obtain summaries ofthe posterior distribution for θ

I Summaries of exact posterior distributions of g(θ), for anyfunctional g(·), can also be obtained.





How does MCMC work?

I Let θt = (θt1, . . . , θ

tp) denote the value of the p × 1 vector of

parameters at iteration t.





How does MCMC work?

I Let θt = (θt1, . . . , θ



I θ0 = initial value used to start the chain (shouldn’t be

sensitive)





How does MCMC work?

I Let θt = (θt1, . . . , θ




sensitive)

I MCMC generates θt from a distribution that depends onthe data & potentially on θt−1, but not on θ1, . . . , θt−2.





How does MCMC work?

I Let θt = (θt1, . . . , θ




sensitive)

I MCMC generates θt from a distribution that depends onthe data & potentially on θt−1, but not on θ1, . . . , θt−2.

I This results in a Markov chain with stationary distributionπ(θ |y) under some conditions on the sampling distribution





Different flavors of MCMC

I The most commonly used MCMC algorithms are:






I The most commonly used MCMC algorithms are:I Metropolis sampling (Metropolis et al., 1953)






I The most commonly used MCMC algorithms are:I Metropolis sampling (Metropolis et al., 1953)I Metropolis-Hastings (MH) (Hastings, 1970)






I The most commonly used MCMC algorithms are:I Metropolis sampling (Metropolis et al., 1953)I Metropolis-Hastings (MH) (Hastings, 1970)I Gibbs sampling (Geman & Geman, 1984; Gelfand & Smith,

1990)







1990)

I Easy overview of Gibbs - Casella & George (1992, The

American Statistician, 46, 167-174)







1990)

I Easy overview of Gibbs - Casella & George (1992, The

American Statistician, 46, 167-174)

I Easy overview of MH - Chib & Greenberg (1995, The

American Statistician)





Gibbs Sampling

I Start with initial value θ0 = (θ01, . . . , θ0

p)





Gibbs Sampling


p)

I For iterations t = 1, . . . , T ,





Gibbs Sampling


p)


1. Sample θt1 from the conditional posterior distribution

π(θ1 | θ2 = θt−1

2, . . . , θp = θt−1

p ,y)





Gibbs Sampling


p)



π(θ1 | θ2 = θt−1

2, . . . , θp = θt−1

p ,y)


π(θ2 | θ1 = θt1, θ3 = θt−1

3, . . . , θp = θt−1

p )





Gibbs Sampling


p)



π(θ1 | θ2 = θt−1

2, . . . , θp = θt−1

p ,y)


π(θ2 | θ1 = θt1, θ3 = θt−1

3, . . . , θp = θt−1

p )

3. Similarly, sample θt3, . . . , θ

tp from the conditional posterior

distributions given current values of other parameters.





Gibbs Sampling (continued)

I Under mild regularity conditions, samples converge tostationary distribution π(θ |y)







I At the start of the sampling, the samples are not from theposterior distribution π(θ |y).








I It is necessary to discard the initial samples as a burn-in toallow convergence








I It is necessary to discard the initial samples as a burn-in toallow convergence

I In simple models such as GLMs, convergence typicallyoccurs quickly & burn-in of 100 iterations should besufficient (to be conservative SAS uses 2,000 as default)





Example - DDE & Preterm Birth

I Scientific interest: Association between DDE exposure &preterm birth adjusting for possible confounding variables







I Data from US Collaborative Perinatal Project (CPP) - n =2380 children out of which 361 were born preterm







I Data from US Collaborative Perinatal Project (CPP) - n =2380 children out of which 361 were born preterm

I Analysis: Bayesian analysis using a probit model





Probit Model

yi = 1 if preterm birth and yi = 0 if full-term birth

Pr(yi = 1 |xi, β) = Φ(x′

iβ),

I xi = (1, ddei, xi3, . . . , xi7)′

I xi3, . . . , xi7=possible confounders (black race, etc)

I β1 = intercept

I β2 = slope





Prior, Likelihood & Posterior

I Prior: π(β) = N(β0,Σβ)







I Likelihood:

π(y |β,X) =

n∏

i=1

Φ(x′

iβ)yi

{1 − Φ(x′

iβ)}1−yi







I Likelihood:

π(y |β,X) =

n∏

i=1

Φ(x′

iβ)yi

{1 − Φ(x′

iβ)}1−yi

I Posterior:π(β |y,X) ∝ π(β)π(y |β,X).







I Likelihood:

π(y |β,X) =

n∏

i=1

Φ(x′

iβ)yi

{1 − Φ(x′

iβ)}1−yi

I Posterior:π(β |y,X) ∝ π(β)π(y |β,X).

I No closed form available for normalizing constant





Maximum Likelihood Results

Parameter MLE SE Z stat p-value

β1 -1.08068 0.04355 -24.816 < 2e − 16β2 0.17536 0.02909 6.028 1.67e-09β3 -0.12817 0.03528 -3.633 0.000280β4 0.11097 0.03366 3.297 0.000978β5 -0.01705 0.03405 -0.501 0.616659β6 -0.08216 0.03576 -2.298 0.021571β7 0.05462 0.06473 0.844 0.398721

β2 = dde slope (highly significant increasing trend)





Bayesian Analysis - Prior Elicitation

I Ideally, read literature on preterm birth → β0 best guess ofβ







I Should be possible (in particular) for confoundingcoefficients








I Σ0 expresses uncertainty - place high probability in aplausible range









I Much better than flat priors, which can yield implausibleestimates!









I Much better than flat priors, which can yield implausibleestimates!

I As a default, shrinkage-type prior we use N(0, 4 × I7×7)





Gibbs Sampling

I We choose β0 = 0 as starting values





Gibbs Sampling


I MLEs or asymptotic approximation to posterior mean mayprovide better default choice





Gibbs Sampling



I Results should not depend on starting values, though forpoor starting values you may need a longer burn-in





Gibbs Sampling




I For typical GLMs, such as probit models, convergence rapid





Gibbs Sampling




I For typical GLMs, such as probit models, convergence rapid

I For illustration, we collected 1,000 iterations





Example - probit binary regression model





Posterior Summaries

Parameter Mean Median SD 95% credible interval

β1 -1.08 -1.08 0.04 (-1.16, -1.01)β2 0.17 0.17 0.03 (0.12, 0.23)β3 -0.13 -0.13 0.04 (-0.2, -0.05)β4 0.11 0.11 0.03 (0.05, 0.18)β5 -0.02 -0.02 0.03 (-0.08, 0.05)β6 -0.08 -0.08 0.04 (-0.15, -0.02)β7 0.05 0.06 0.06 (-0.07, 0.18)





Estimated Posterior Density





Inferences on Functionals

I Often, it is not the regression parameter which is ofprimary interest.







I One may want to estimate functionals, such as the mean atdifferent values of a predictor.







I One may want to estimate functionals, such as the mean atdifferent values of a predictor.

I By applying the function to every iteration of the MCMCalgorithm after burn-in, one can obtain samples from themarginal posterior density of the unknown of interest.





Estimated Dose Response Function





Metropolis-Hastings Sampling

I Gibbs sampling requires sampling from the conditionalposterior distributions







I Metropolis-Hastings is an alternative that avoids thisrestriction







I Metropolis-Hastings is an alternative that avoids thisrestriction

I Again start with an initial value θ0 and sequentially updatethe parameters θ1, . . . , θp





Metropolis-Hastings (continued)

I To draw θtj:






I To draw θtj:

1. Sample a candidate θ̃tj ∼ qj(· | θ

t−1

j )






I To draw θtj:


t−1

j )

2. Let θtj = θ̃t

j with probability

min

{1,

π(θ̃tj) L(y | θj = θ̃t

j ,−) qj(θt−1

j | θ̃tj)

π(θt−1

j ) L(y | θj = θt−1

j ,−) qj(θ̃tj | θ

tj)

},

L(y | θj = θ̃tj ,−)=likelihood given θj = θ̃t

j and currentvalues of other parameters






I To draw θtj:


t−1

j )

2. Let θtj = θ̃t

j with probability

min

{1,

π(θ̃tj) L(y | θj = θ̃t

j ,−) qj(θt−1

j | θ̃tj)

π(θt−1

j ) L(y | θj = θt−1

j ,−) qj(θ̃tj | θ

tj)

},

L(y | θj = θ̃tj ,−)=likelihood given θj = θ̃t

j and currentvalues of other parameters

3. Otherwise let θtj = θt−1

j .





Comments on Metropolis-Hastings

I Performance sensitive to the proposal distributions,qj(· | θ

t−1

j )







t−1

j )

I Most common proposal is N(θt−1

j , κ), which is centered onthe previous value







t−1

j )



I This results in a Metropolis random walk







t−1

j )



I This results in a Metropolis random walk

I Inefficient if κ is too small or too large





Adaptive Rejection Sampling (ARS)

I ARS (Gilks & Wild, 1992) - approach to implement Gibbssampling for log-concave conditional distributions







I Uses sequentially defined envelopes around target density,leading to some additional computational expense








I Log concavity holds for most GLMs and typical priors








I Log concavity holds for most GLMs and typical priors

I When violated adaptive rejection Metropolis sampling(ARMS) (Gilks et al., 1995) used.





SAS Implementation

I BGENMOD, BLIFEREG, BPHREG all rely on ARS(when possible) or ARMS





SAS Implementation


I Hence, SAS uses Gibbs sampling for posterior computation





SAS Implementation


I Hence, SAS uses Gibbs sampling for posterior computation

I Important to diagnose convergence & mixing wheneverusing MCMC!!





Some Terminology

I Convergence: initial drift in the samples towards astationary distribution





Some Terminology


I Burn-in: samples at start of the chain that are discarded toallow convergence





Some Terminology



I Slow mixing: tendency for high autocorrelation in thesamples.





Some Terminology




I Thinning: practice of collecting every kth iteration toreduce autocorrelation





Some Terminology




I Thinning: practice of collecting every kth iteration toreduce autocorrelation

I Trace plot: plot of sampled values of a parameter vsiteration #





Example - trace plot with poor mixing





Poor mixing Gibbs sampler

I Exhibits “snaking” behavior in trace plot with cyclic localtrends in the mean







I Poor mixing in the Gibbs sampler caused by high posteriorcorrelation in the parameters








I Decreases efficiency & many more samples need to becollected to maintain low Monte Carlo error in posteriorsummaries









I For very poor mixing chain, may even need millions ofiterations.









I For very poor mixing chain, may even need millions ofiterations.

I Routinely examine trace plots!





Example - trace plot with good mixing





Convergence diagnostics

I Diagnostics available to help decide on number of burn-in& collected samples







I Note: no definitive tests of convergence & you should checkconvergence of all parameters








I With experience visual inspection of trace plots perhapsmost useful approach








I With experience visual inspection of trace plots perhapsmost useful approach

I There are a number of useful automated tests





Convergence diagnostics in SAS

I Gelman-Rubin: uses parallel chains with dispersed initialvalues to test convergence







I Geweke: applies test of stationarity to a single chain








I Heidelberger-Welch (stationarity): alternative to Geweke









I Heidelberger-Welch (halfwidth): # samples adequate forestimation of posterior mean?










I Raftery-Lewis: # samples needed for desired accuracy inestimating percentiles.











I autocorrelation: high values indicate slow mixing











I autocorrelation: high values indicate slow mixing

I effective sample size: low value relative to actual #indicates slow mixing





Practical advice on convergence diagnosis

I The Gelman-Rubin approach is quite appealing in usingmultiple chains







I Geweke & Heidelberger-Welch sometimes reject even whenthe trace plots look good








I Overly sensitive to minor departures from stationarity thatdo not impact inferences









I Sometimes this can be solved with more iterations










I Otherwise, you may want to try multiple chains










I Otherwise, you may want to try multiple chains

I For the models considered in SAS, chains tend to be verywell behaved when the MLE exists or priors are informative





How to summarize results from the MCMC chain?

I Posterior mean: θ̂ = 1/(T − B)∑T

t=B+1θt, with B = #

burn-in samples, T = total # samples








burn-in samples, T = total # samplesI Posterior mean most commonly used point estimate and

provides alternative to MLE (note - posterior modedifficult to estimate accurately from MCMC)










I Posterior median (50th percentile of {θt}Tt=B+1

) providesalternative point estimate












I Posterior standard deviation calculated as square root of

v̂ar(θj |y) =1

T − B − 1

T∑

t=B+1

(θtj − θ̂j)

2.












I Posterior standard deviation calculated as square root of

v̂ar(θj |y) =1

T − B − 1

T∑

t=B+1

(θtj − θ̂j)

2.

I As n increases, we obtain

π(θj |y) ≈ N(θj; θ̂j, v̂ar(θj |y)

).





Interval estimates

I As a Bayesian alternative to the confidence interval, onecan use a credible interval





Interval estimates


I The 100(1 − α)% credible interval ranges from the α/2 to1 − α/2 percentiles of {θt}T

t=B+1.





Interval estimates


I The 100(1 − α)% credible interval ranges from the α/2 to1 − α/2 percentiles of {θt}T

t=B+1.

I A highest posterior density (HPD) interval can also becalculated - smallest interval containing true parameterwith 100(1 − α) posterior probability





Posterior probabilities

I Often interest focuses on the weight of evidence ofH1 : θj > 0







I One can use the estimated posterior probability:

P̂r(θj > 0 |data) =1

T − B

T∑

t=B+1

1(θtj > 0),

with 1(θtj > 0) = 1 if θt

j > 0 and 0 otherwise.







I One can use the estimated posterior probability:

P̂r(θj > 0 |data) =1

T − B

T∑

t=B+1

1(θtj > 0),

with 1(θtj > 0) = 1 if θt

j > 0 and 0 otherwise.

I A high value (e.g., greater than 0.95) suggests strongevidence in favor of H1





Marginal posterior density estimation

I Summary statistics such as the mean, median, standarddeviation, etc provide an incomplete picture







I Since we have many samples from the posterior, we canaccurately estimate the exact posterior density







I Since we have many samples from the posterior, we canaccurately estimate the exact posterior density

I This can be done using a kernel-smoothed densityestimation procedure applied to the samples {θt

j}Tt=B+1

.





Illustration - linear regression

I Lewis & Taylor (1967) - study of weight (yi) in 237 students







I The model is as follows:

yi = β0 + β1x1i + β2x2i + β3x3i + εi, i = 1, . . . , 237,

x1i=height in feet - 5 feetx2i=age in years - 16x3i=1 for males, 0 for females







I The model is as follows:

yi = β0 + β1x1i + β2x2i + β3x3i + εi, i = 1, . . . , 237,

x1i=height in feet - 5 feetx2i=age in years - 16x3i=1 for males, 0 for females

I Implemented in SAS Proc BGENMOD - 2,000 burn-in &10,000 collected





Output and diagnostics - intercept (β0)





Output and diagnostics - height (β1)





Output and diagnostics - age (β2)





Output and diagnostics - male (β3)





Mixing - Autocorrelation in MCMC samples

Parameter Lag1 Lag5 Lag10 Lag50

Intercept 0.5489 0.0114 -0.0107 0.0009height 0.5166 -0.0124 0.0112 0.0042age 0.4634 -0.0068 -0.0038 0.0032male 0.5613 0.0294 -0.0170 0.0017

Precision -0.0039 -0.0088 -0.0042 0.0018Conclusion: Very good mixing





Tests of convergence

Gelman-Rubin GewekeParameter Estimate 97.5% z Pr > |z|

Intercept 1.0000 1.0002 0.5871 0.5572height 1.0004 1.0013 1.7153 0.0863age 1.0003 1.0012 -1.3831 0.1666male 1.0001 1.0005 -1.2658 0.2056

Precision 1.0003 1.0010 2.4947 0.0126Gelman-Rubin: values ≈ 1 suggest convergenceGeweke: convergence suggested except for precisionHeidelberger-Welsh: all parameters passed





Output and diagnostics - precision (τ)





Number of samples sufficient?

I Raftery-Lewis: 3746 samples needed for +/- 0.005 accuracyin estimating 0.025 quantile (10,000 sufficient number)







I Heidelberger-Welsh: 10,000 samples sufficient for accuratemean estimation - except for the male coefficient








I Effective sample size: ranged between 3033.5 - 3740.2 forregression coefficients








I Effective sample size: ranged between 3033.5 - 3740.2 forregression coefficients

I 10,000 Gibbs samples contain as much information as3033.5-3740.2 independent draws





Posterior summaries

10,000 Samples

Parameter Mean SD 95% CI 95% HPD

Intercept 96.155 1.138 [93.906, 98.352] [93.866, 98.294]height 3.103 0.272 [2.576, 3.642] [2.550, 3.611]age 2.390 0.566 [1.272, 3.492] [1.282. 3.4980male -0.280 1.601 [-3.3601, 2.948] [-3.344, 2.961]

precision 0.0071 0.00066 [0.0058, 0.0084] [0.0058, 0.0084]50,000 Samples

Intercept 96.207 1.145 [93.968, 98.457] [93.997, 98.482]height 3.107 0.267 [2.581, 3.627] [2.574, 3.619]age 2.375 0.562 [1.265, 3.467] [1.268, 3.470]male -0.353 1.605 [-3.495, 2.825] [-3.451, 2.863]

precision 0.0071 0.00065 [0.0059, 0.0084] [0.0058, 0.0084]





Convergence Diagnostics (50,000 samples)

I Gelman-Rubin 97.5% bound max of 1.0001







I Geweke p-values minimum of 0.4716








I Heidelberger-Welsh passed for all parameters








I Heidelberger-Welsh passed for all parameters

I Conclusion: for longer chain, no evidence of lack ofconvergence





Discussion

I Overall picture suggests convergence, good mixing andsufficient number of collected samples





Discussion


I Don’t take rejection of one convergence test too seriously iftrace plot looks good





Discussion


I Don’t take rejection of one convergence test too seriously iftrace plot looks good

I Rejection motivates collection of additional samples tomake sure inferences do not change


Documents

STA 216, Generalized Linear Models, Lecture 6Outline Introduction to Bayes Inference for GLMs Introduction to MCMC Algorithms STA 216, Generalized Linear Models, Lecture 6 September