Introduction to Bayesian inference and computation for social science data analysis Nicky Best Imperial College, London

Introduction to Bayesian inference and computation for social science

data analysis

Nicky Best

Imperial College, London

www.bias-project.org.uk

Outline

• Overview of Bayesian methods– Illustration of conjugate Bayesian inference

– MCMC methods

• Examples illustrating:– Analysis using informative priors

– Hierarchical priors, meta-analysis and evidence synthesis

– Adjusting for data quality

– Model uncertainty

• Discussion

Overview of Bayesian inference and computation

Overview of Bayesian methods

• Bayesian methods have been widely applied in many areas– medicine / epidemiology / genetics

– ecology / environmental sciences

– finance

– archaeology

– political and social sciences, ………

• Motivations for adopting Bayesian approach vary– natural and coherent way of thinking about science and

learning

– pragmatic choice that is suitable for the problem in hand


• Medical context: FDA draft guidance www.fda.gov/cdrh/meetings/072706-bayesian.html:

“Bayesian statistics…provides a coherent method for learning from evidence as it accumulates”

• Evidence can accumulate in various ways:– Sequentially

– Measurement of many ‘similar’ units (individuals, centres, sub-groups, areas, periods…..)

– Measurement of different aspects of a problem

• Evidence can take different forms:– Data

– Expert judgement

http://www.fda.gov/cdrh/meetings/072706-bayesian.html


• Bayesian approach also provides formal framework for propagating uncertainty – Well suited to building complex models by linking

together multiple sub-models

– Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest

• Bayesian inference doesn’t rely on asymptotics or analytic approximations– Arbitrarily wide range of models can be handled using

same inferential framework

– Focus on specifying realistic models, not on choosing analytically tractable approximation

Bayesian inference

• Distinguish between

x : known quantities (data)

: unknown quantities (e.g. regression coefficients, future outcomes, missing observations)

• Fundamental idea: use probability distributions to represent uncertainty about unknowns

• Likelihood – model for the data: p( x | )

• Prior distribution – representing current uncertainty about unknowns: p()

• Applying Bayes theorem gives posterior distribution

p(µjx) = p(µ)p(xjµ)Rp(µ)p(xjµ)dµ

/ p(µ)p(xjµ)

Conjugate Bayesian inference

• Example: election poll (from Franklin, 2004*)

• Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice.

• We enter the campaign with a prior distribution for the proportion supporting Government. This is p()

• As the campaign begins, we get polling data. How should we change our estimate of Government’s support?

*Adapted from Charles Franklin’s Essex Summer School course slides: http://www.polisci.wisc.edu/users/franklin/Content/Essex/Lecs/BayesLec01p6up.pdf


Data and likelihood

• Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition.

• If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x

• This binomial distribution is the likelihood p(x | )

x »

µnx

¶µx(1¡ µ)n¡ x / µx(1¡ µ)n¡ x


Prior

• We need to specify a prior that – expresses our uncertainty about the election (before it begins)

– conforms to the nature of the parameter, i.e. is continuous but bounded between 0 and 1

• A convenient choice is the Beta distribution

p(µ) = Beta(a;b) =¡ (a+b)¡ (a)¡ (b)

µa¡ 1(1¡ µ)b¡ 1

/ µa¡ 1(1¡ µ)b¡ 1


• Beta(a,b) distribution can take a variety of shapes depending on its two parameters a and b

0.0 0.4 0.8

theta

Beta(0.5,0.5)

0.0 0.4 0.8

theta

Beta(1,1)

0.0 0.4 0.8

theta

Beta(5,1)

0.0 0.4 0.8theta

Beta(5,5)

0.0 0.4 0.8theta

Beta(5,20)

0.0 0.4 0.8theta

Beta(50,200)

Mean of Beta(a, b) distribution = a/(a+b)

Variance of Beta(a,b) distribution =

ab(a+b+1)/(a+b)2


Posterior

• Combining a beta prior with the binomial likelihood gives a posterior distribution

• When prior and posterior come from same family, the prior is said to be conjugate to the likelihood– Occurs when prior and likelihood have the same ‘kernel’

p(µ j x;n) / p(x j µ;n) p(µ)

/ µx(1¡ µ)n¡ xµa¡ 1(1¡ µ)b¡ 1

= µx+a¡ 1(1¡ µ)n¡ x+b¡ 1

/ Beta(x +a; n ¡ x +b)


• Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07

– This is approximately a Beta(50, 50) distribution

• We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government

• This produces a posterior which is a

Beta(120+50, 80+50) = Beta(170, 130) distribution


• Prior mean, E() = 50/100 = 0.5

• Posterior mean, E( | x, n) = 170/300 = 0.57

• Posterior SD, √Var( | x, n) = 0.029

• Frequentist estimate is based only on the data:

µ̂= 120=200= 0:6; SE(µ̂) =p(0:6£ 0:4)=200= 0:035

0.0 0.2 0.4 0.6 0.8 1.0

theta (prob of voting for government)

PosteriorPriorLikelihood


A harder problem

• What is the probability that Government wins?– It is not .57 or .60. Those are expected votes but not the probability

of winning. How to answer this?

• Frequentists have a hard time with this one. They can obtain a p-value for testing H0: > 0.5, but this isn’t the same as the probability that Government wins– (its actually the probability of observing data more extreme than

120 out of 200 if H0 is true)

• Easy from Bayesian perspective – calculate Pr(> 0.5 | x, n), the posterior probability that > 0.5

0.45 0.55 0.65theta

Pr(theta>0.5) = 0.98

Bayesian computation

• All Bayesian inference is based on the posterior distribution

• Summarising posterior distributions involves integration

• Except for conjugate models, integrals are usually analytically intractable

• Use Monte Carlo (simulation) integration (MCMC)

Mean: E (µjx) =Rµp(µjx)dµ

Prediction: p(xnewjx) =Rp(xnew;µjx)p(µjx)dµ

Probabilities: P r(µ2 (c1;c2)jx) =Rc2c1p(µjx)dµ


• Suppose we didn’t know how to analytically integrate the Beta(170, 130) posterior…

• ….but we do know how to simulate from a Beta

theta0.4 0.6 0.8

100 samples mean = 0.569

theta0.4 0.6 0.8

10,000 samples mean = 0.566

0.4 0.6 0.8theta

True posterior mean=0.567


• Monte Carlo integration– Suppose we have samples (1), (2),…, (n) from p( | x )

– Then E(µjx) =Rµp(µjx)dµ ¼ 1

n

P ni=1

µ(i )

• Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc.

• Difficult to generate independent samples when posterior is complex and high dimensional

• Instead, generate dependent samples from a Markov chain having p( | x ) as its stationary distribution → Markov chain Monte Carlo (MCMC)

Illustrative Examples

Borrowing strength

• Bayesian learning → borrowing “strength” (precision) from other sources of information

• Informative prior is one such source– “today’s posterior is tomorrows prior”

– relevance of prior information to current study must be justified

Informative priors

Example 1: Western and Jackman (1994)*

• Example of regression analysis in comparative research

• What explains cross-national variation in union density? – Union density is defined as the percentage of the work

force who belongs to a labour union

• Two issues– Philosophical: data represent all available observations

from a population → conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate

– Practical: small, collinear dataset yields imprecise estimates of regression effects

* Slides adapted from Jeff Grynaviski: http://home.uchicago.edu/~grynav/bayes/abs03.htm

Informative priors

• Competing theories– Wallerstein: union density depends on the size of the

civilian labour force (LabF)

– Stephens: union density depends on industrial concentration (IndC)

– Note: These two predictors correlate at 0.92.

• Control variable: presence of a left-wing government (LeftG)

• Sample: n = 20 countries with a continuous history of democracy since World War II

• Fit linear regression model to compare theories

union densityi ~ N(i, 2)

i = 0 + 1LeftG + 2LabF + 3IndC

Informative priors• Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis)

-40 0 20 40

regression coefficient

Left Govt

Labour Force

Ind Conc.

point estimate

___ 95% CI

Informative priors

Motivation for Bayesian approach with informative priors

• Because of small sample size and multicollinear variables, not able to adjudicate between theories

• Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely

• Other historical data are available that could provide further relevant information

• Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients

Informative priors

Prior distributions for regression coefficients

Wallerstein

• Believes in negative labour force effect

• Comparison of Sweden and Norway in 1950:

→ doubling of labour force corresponds to 3.5-4% drop in union density

→ on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5

• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0

2 ~ N(-5, 2.52)

Informative priors


Stephens

• Believes in positive industrial concentration effect

• Decline in industrial concentration in UK in 1980s:

→ drop of 0.3 in industrial concentration corresponds to about 3% drop in union density

→ industrial concentration effect size ≈ 3/0.3 = 10


3 ~ N(10, 52)

Informative priors


Wallerstein and Stephens

• Both believe left-wing gov’ts assist union growth

• Assuming 1 year of left-wing gov’t increases union density by about 1% translates to effect size of 0.3


1 ~ N(0.3, 0.152)

• Vague prior 0 ~ N(0, 1002) assumed for intercept

Informative priors

coefficient-20 0 20

Vague priors

coefficient-20 0 20

Wallerstein priors

coefficient-20 0 20

Stephens priors

Ind Conc

Lab Force

Left Govt

Informative priors

• Effects of LabF and IndC estimated more precisely

• Both sets of prior beliefs support inference that labour-force size decreases union density

• Only Stephens’ prior supports conclusion that industrial concentration increases union density

• Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted “fairly”?

• Sensitivity analysis – Sensitivity to priors (e.g. repeat analysis using priors with

increasing variance)

– Sensitivity to data (e.g. residuals, influence diagnostics)

Hierarchical priors

• Hierarchical priors are another widely used approach for borrowing strength

• Useful when data available on many “similar” units (individuals, areas, studies, subgroups,…)

• Data xi and parameters i for each unit i=1,…,N

• Three different assumptions:– Independent parameters: units are unrelated, and each i is

estimated separately using data xi alone

– Identical parameters: observations treated as coming from same unit, with common parameter

– Exchangeable parameters: units are “similar” (labels convey no information) → mathematically equivalent to assuming i’s are drawn from common probability distribution with unknown parameters

Meta-analysis

Example 2: Meta-analysis (Spiegelhalter et al 2004)

• 8 small RCTs of IV magnesium sulphate following acute myocardial infarction

• Data: xig = deaths, nig = patients in trial i, treatment group g (0=control, 1=magnesium)

• Model (likelihood): xig ~ Binomial(pig, nig)

logit(pig) = i + i·g

• θi is log odds ratio for treatment effect

• If not willing to believe trials are identical, but no reason to believe they are systematically different → assume i’s are exchangeable with hierarchical prior

i ~ Normal(, 2), 2 also treated as unknown with (vague) priors

Meta-analysis

Mortality OR0.10 0.25 0.50 0.75 1.50

Overall

Morton

Rasmussen

Smith

Abraham

Feldstedt

Shecter

Ceremuz.

LIMIT-2

Hierachical BayesFixed Effects

Estimates and 95% intervals for treatment effect from independent MLE and hierarchical Bayesian analysis

Meta-analysis

• Effective sample size

n = sample size of trial

V1 = variance of i without borrowing (var of MLE)

V2 = variance of i with borrowing (posterior variance of i )

ESS = n£ V1V2

Trial n Var(MLE) Var(µi jx) V1=V2 ESS1 76 1.12 0.15 7.3 5552 270 0.17 0.10 1.7 4593 400 0.55 0.15 3.7 14804 94 1.37 0.16 8.5 7995 298 0.23 0.10 2.2 6566 115 0.81 0.22 3.6 4147 48 1.04 0.16 6.4 3078 2316 0.02 0.02 1.1 2350

Meta-analysis

15 20 25 30 35

-0.3

-0.1

0.1

0.3

Class size

Sta

nd

ard

ise

d T

est

Sco

re

• Example 3: Meta-analysis of effect of class size on educational achievement (Goldstein et al, 2000)

8 studies:

• 1 RCT

• 3 matched

• 2 experimental

• 2 observational

Meta-analysis

• Goldstein et al use maximum likelihood, with bootstrap CI due to small sample size

• Under-estimates uncertainty relative to Bayesian intervals

• Note that 95% CI for Bayesian estimate of effect of class size includes 0

-0.10 -0.05 0.00 0.05Effect of class size

(change in score per extra pupil)

Stu

dy

1

2

3

4

5

6

7

8

OverallGoldstein

Accounting for data quality

• Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model

• Natural tool for explicitly modelling different aspects of data quality

– Measurement error

– Missing data


Example 4: Accounting for population errors in small-area estimation and disease mapping(Best and Wakefield, 1999)

• Context: Mapping geographical variations in risk of breast cancer by electoral ward in SE England, 1981-1991

• Typical model: yi ~ Poisson(i Ni)

yi = number of breast cancer cases in area i

i is the area specific rate of breast cancer: parameter of interest

Ni = t Nit = population-years at risk in area i

Accounting for data quality • Ni usually assumed to be known

• Ignores uncertainty in small-area age/sex population counts in inter-census years

• B&W make use of additional data on Registrar General’s mid-year district age/sex population totals Ndt

• Model A: Nit = Ndt pit where pit is proportion of annual district population in particular age group of interest living in ward ipit estimated by interpolating 1981 and 1991 census counts

• Model B: Allow for sampling variability in Nit

Nit ~ Multinomial(Ndt, [ p1t ,…, pKt ])

• Model C: Allow for uncertainty in proportions pit

pit ~ informative Dirichlet prior distribution


i

2

i

prior prior prior

Xi

yi ward i

Ni

Random effects Poisson regression: log i = i + Xi

Xi = deprivation score for ward i


i

2

i

prior prior prior

Xi

yi ward i

Nitpit

Ndt

year t

prior

Ni


Ward RR (assuming Ei known)

Wa

rd R

R (

mod

elle

d E

i)

Area-specific RR estimates

Ni known

A B C

RR of breast cancer for affluent vs deprived wards

Model uncertainty

• Model uncertainty can be large for observational data studies

• In regression models:– What is the ‘best’ set of predictors for response

of interest?

– Which confounders to control for?

– Which interactions to include?

– What functional form to use (linear, non-linear,….)?

Model uncertainty• Example 5: Predictors of crime rates in US States (adapted

from Raftery et al, 1997)

• Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits

• Costs of crime related to probability of imprisonment and average length of time served in prison

• Benefits of crime related to income inequalities and aggregate wealth of community

• Net benefits of other (legitimate) activities related to employment rate and education levels in community

• Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables

• Up to 13 candidate control variables also considered

Model uncertainty

• y = log crime rate in 1960 in each of 47 US states

• Z1, Z2 = log prob. of prison, log av. time in prison

• X1,…, X13 = candidate control variables

• Fit Normal linear regression model

• Results sensitive to choice of control variables

Table adapted from Table 2 in Raftery et al (1997)

Model Control variables Estimated e®ect (SE)Prob Prison TimePrison

Full model All -0.30 (0.10) -0.27 (0.17)Stepwise 1,3,4,9,11,13 -0.19 (0.07) { {Adjusted R2 1,2,4,7,8,9,11,12,13 -0.30 (0.09) -0.25 (0.16)Ehrlich model 1 9,12,13 -0.45 (0.12) -0.55 (0.20)Ehrlich model 2 1,6,9,10,12,13 -0.43 (0.11) -0.53 (0.20)

Model uncertainty• Using Bayesian approach, can let set of control variables be an

unknown parameter of the model,

• Don't know (a priori) no. of covariates in ‘best’ model has unknown dimension assign prior distribution

• Can handle such “trans-dimensional” (TD) models using “reversible jump” MCMC algorithms

• Normal linear regression model

yi ~ Normal(i, 2) i = 1,...,47

• Variable selection model:¹ i = Zi° +Wi¯; Wi = (X iµ1 ;X iµ2 ; :::;X iµk )

k = number of currently selected predictorsµ= vector of k selected columns of X¯ = vector of k regression coe±cients

yi

k

2

i

state i

Xi

Zi

Model uncertainty

Model uncertainty

k chains 1:2 sample: 20000

-1 0 2 4 6

0.0 0.2 0.4 0.6 0.8

prob

abili

ty0.

00.

40.

8-1

13

5

control variable1 3 5 7 9 11 13

Pro

b pr

ison

Tim

e pr

ison

effe

ct o

n cr

ime

rate

prob

abili

ty0.

00.

40.

8

prob

abili

ty

0

0.

4

0.8

Posterior mean and 95% CI for effect (b) conditional on being in model

Model uncertainty

Posterior probability that control variable is in model

Model uncertainty

• Most likely (40%) set of control variables contains X4 (police expenditure in 1960) and X13 (income inequality)

• 2nd most likely (28%) set of control variables contains X5 (police expenditure in 1959) and X13 (income inequality)

• Control variables with >10% marginal probability of inclusion– X3 : average years of schooling (18%)

– X4 : police expenditure in 1960 (56%)

– X5 : police expenditure in 1959 (40%)

– X13 : income inequality (94%)

• Posterior estimates of prison variables, averaged over models

log prob. of prison: -0.29 (-0.55, -0.05)

log av. time in prison: -0.27 (-0.69, 0.14)

Discussion

• Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model

• Formal approach to “borrowing strength”– Improved precision/effective sample size

– Fully accounts for uncertainty

• Relevance of different pieces of evidence is a judgement – must be justifiable

• Bayesian approach forces us to be explicit about model assumptions

• Sensitivity analysis to assumptions is crucial

Discussion

• Bayesian calculations are computationally intensive, but:– Provides exact inference; no asymptotics

– MCMC offers huge flexibility to model complex problems

• All examples discussed here were fitted using free WinBUGS software: www.mrc-bsu.cam.ac.uk

• Want to learn more about using Bayesian methods for social science data analysis?– Short course: Introduction to Bayesian inference

and WinBUGS, Sept 19-20, Imperial College

See www.bias-project.org.uk for details

http://www.mrc-bsu.cam.ac.uk/

http://www.bias-project.org.uk/

Thank you!

References

• Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in population counts and case registration in cancer mapping studies. J Roy Statist Soc, Series A, 162: 363-382

• Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects. Applied Statistics, 49: 399-412.

• Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J Am Statist Assoc, 92: 179-191.

• Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health Care Evaluation, Wiley, Chichester.

• Western, B. and Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88: 412-423.

Documents

Introduction to Bayesian inference and computation for social science data analysis Nicky Best Imperial College, London