Bayesian Inference - Home | Applied Mathematics & …zhu/ams571/Bayesian_Analysis.pdfBayesian Inference In Bayesian inference there is a fundamental distinction between • Observable

Bayesian Inference

1

Thomas Bayes

• Bayesian statistics named after Thomas Bayes

(1702-1761) -- an English statistician, philosopher

and Presbyterian minister.

2

Bayes' solution to a problem of

inverse probability was

presented in "An Essay

towards solving a Problem in

the Doctrine of Chances"

which was read to the Royal

Society in 1763 after Bayes'

death. It contains a special

case of the Bayes' theorem.

Pierre-Simon Laplace

• In statistics, the Bayesian interpretation of

probability was developed mainly by Pierre-Simon

Laplace (1749 –1827).

3

Laplace is remembered as the

French Newton. He was

Napoleon's examiner when

Napoleon attended the Ecole

Militaire in Paris in 1784.

Laplace became a count of the

Empire in 1806 and was

named a marquis in 1817.

Bayes Theorem

• Bayes Theorem for probability events A and B

• Or for a set of mutually exclusive and exhaustive

events (i.e. ), then

)(

)()|(

)(

)()|(

BP

APABP

BP

ABPBAP

i iii APAP 1)()(

j jj

iii

APABP

APABPBAP

)()|(

)()|()|(

4

Example – Diagnostic Test

• A new Ebola (EB) test is claimed to have

“95% sensitivity and 98% specificity”

• In a population with an EB prevalence of

1/1000, what is the chance that a patient

testing positive actually has EB?

Let A be the event patient is truly positive, A’

be the event that they are truly negative

Let B be the event that they test positive

5

Diagnostic Test, ctd.

• We want P(A|B)

• “95% sensitivity” means that P(B|A) = 0.95

• “98% specificity” means that P(B|A’) = 0.02

So from Bayes Theorem

045.0999.002.0001.095.0

001.095.0

)'()'|()()|(

)()|()|(

APABPAPABP

APABPBAP

Thus over 95% of those testing positive will, in fact, not have EB.

6

Bayesian Inference

In Bayesian inference there is a fundamental distinction between

• Observable quantities x, i.e. the data

• Unknown quantities θ

θ can be statistical parameters, missing data, latent variables…

• Parameters are treated as random variables

In the Bayesian framework we make probability statements about model parameters

In the frequentist framework, parameters are fixed non-random quantities and the probability statements concern the data.

7

Prior distributions

As with all statistical analyses we start by posting a model which specifies f(x| θ)

This is the likelihood which relates all variables into a ‘full probability model’

However from a Bayesian point of view :

• is unknown so should have a probability distribution reflecting our uncertainty about it before seeing the data

• Therefore we specify a prior distribution f(θ)

Note this is like the prevalence in the example

8

Posterior Distributions

Also x is known so should be conditioned on and here we

use Bayes theorem to obtain the conditional distribution

for unobserved quantities given the data which is

known as the posterior distribution.

)|()()|()(

)|()()|(

x

x

xx ff

dff

fff

The prior distribution expresses our uncertainty about

before seeing the data.

The posterior distribution expresses our uncertainty

about after seeing the data.

9

Examples of Bayesian Inference

using the Normal distribution

Known variance, unknown mean

It is easier to consider first a model with 1

unknown parameter. Suppose we have a

sample of Normal data:

Let us assume we know the variance, 2

and we assume a prior distribution for the

mean, based on our prior beliefs:

Now we wish to construct the

posterior distribution f(|x).

.,...,1),(~ 2 niNX i ,

),(~ 2

00 N

10

Posterior for Normal distribution

mean

So we have

))//()//1(exp(

)/)(exp()2(

)/)(exp()2(

)|()()|(

)/)(exp()2()|(

)/)(exp()2()(

22

00

22

0

2

21

2

1

2

212

2

0

2

0212

0

22

212

2

0

2

0212

0

21

21

21

21

consxn

x

fff

xxf

f

i

i

n

i

i

ii

xx

hence and

11

Posterior for Normal distribution

mean (continued)

For a Normal distribution with response y

with mean and variance we have

}/exp{

}/)(exp{)2()(

12

21

2

212

1

consyy

yyf

We can equate this to our posterior as follows:

)//( and )//1(

))//()//1(exp(

22

00

122

0

22

00

22

0

2

21

i

i

i

i

xn

consxn

12

Large sample properties

As n

So posterior variance

Posterior mean

And so the posterior distribution

Compared to

in the Frequentist setting

xnx ))//(/( 22

00

222

0 / )//1(/1 nn

n/2

)/,(| 2 nxN x

)/,(~| 2 nNX

13

Sufficient Statistic

• Intuitively, a sufficient statistic for a parameter is a statistic that captures all the information about a given parameter contained in the sample.

• Sufficiency Principle: If T(X) is a sufficient statistic for θ, then any inference about θ should depend on the sample X only through the value of T(X).

• That is, if x and y are two sample points such that T(x) = T(y), then the inference about θ should be the same whether X = x or X = y.

• Definition: A statistic T(x) is a sufficient statistic for θ if the conditional distribution of the sample X given T(x) does not depend on θ.

14

• Definition: Let X1, X2, … Xn denote a

random sample of size n from a

distribution that has a pdf f(x,θ), θ ε Ω .

Let Y1=u1(X1, X2, … Xn) be a statistic

whose pdf or pmf is fY1(y1,θ). Then Y1 is a

sufficient statistic for θ if and only if

where H(X1, X2, … Xn) does not depend on

θ ε Ω.

1

1 2

1 2

1 1 2

; ; ;, ,

, , ;

n

n

Y n

f x f x f xH x x x

f u x x x

15

• Example: Normal sufficient statistic:

Let X1, X2, … Xn be independently and identically distributed N(μ,σ2) where the variance is known. The sample mean

is the sufficient statistic for μ.

1

1 n

iiT X X X

n

16

_

• Starting with the joint distribution function

2

221

2

22 2 1

1exp

22

1exp

22

ni

i

ni

ni

xf x

x

17

_

• Next, we add and subtract the sample

average yielding

2

22 2 1

2 2

1

22 2

1exp

22

1exp

22

ni

ni

n

i

i

n

x x xf x

x x n x

18

_

• Where the last equality derives from

• Given that the distribution of the sample

mean is

1 1

0n n

i i

i i

x x x x x x

2

1 22 2

1exp

22

n xq T X

n

19

_

• The ratio of the information in the sample

to the information in the statistic becomes

2 2

1

22 2

2

1 22 2

1exp

22

1exp

22

n

i

i

n

x x n x

f x

q T x n x

n

20

_

_

2

1

1 212 22

1exp

22

n

i

i

n

x xf x

q T x n

which is a function of the data X1, X2, … Xn only, and does not depend on μ. Thus we have shown that the sample mean is a sufficient statistic for μ.

21

_

_

• Theorem (Factorization Theorem) Let

f(x|θ) denote the joint pdf or pmf of a

sample X. A statistic T(X) is a sufficient

statistic for θ if and only if there exists

functions g(t|θ) and h(x) such that, for all

sample points x and all parameter points θ

f x g T x h x

22

_ _ _

Posterior Distribution Through

Sufficient Statistics

Theorem: The posterior distribution

depends only on sufficient statistics.

Proof: let T(X) be a sufficient statistic for θ, then

23

)|(

)|()(

)|()(

)()|()(

)()|()(

)|()(

)|()()|(

xx

x

xx

xx

x

xx

TfdTff

Tff

dHTff

HTff

dff

fff

)()|()|( xxx HTff

Posterior Distribution Through

Sufficient Statistics

Example: Posterior for Normal distribution

mean (with known variance)

Now, instead of using the entire sample, we

can derive the posterior distribution using

the sufficient statistic

24

xx T

Exercise: Please derive the posterior

distribution using this approach.

Example. Posterior distribution (& the conjugate priors)

Assume the parameter of interest is , the proportion of some

property of interest in the population. Thus the distribution of the

individual data point is Bernoulli(), and a reasonable prior

density for is the Beta density:

function Beta called-so the

, 1 , and

parameters (constant) twoare 0 and 0 where

10;,

1,;

1

0

11

11

dxxxB

Bf

0 0.5 1

Beta(1,1)

Beta(5,5)

Beta(1,5)

Beta(5,1)

Beta(2,5)

Beta(5,2)

Beta(0.5,0.5)

Beta(0.3,0.7)

Beta(0.7,0.3)

Now, assume a sample of size n from the population in which Y of the

values possess the property of interest (# of ‘successes’). It is easy to see

that Y is a sufficient statistics following the Bionomial distribution with

pdf:

The posterior pdf based on this sufficient statistic is thus:

yny

y

nyf

1;

ynyB

dd

dBy

n

By

n

dfyf

fyfyf

yny

yny

yny

yny

yny

yny

yny

,

1

1

1

11

11

,

11

,

11

;

;;

11

1

0

11

11

1

0

11

11

1

0

11

11

1

0

Thus, the posterior density is also a Beta density with parameters y +

and n – y +

Prior distributions that combined with the

likelihood gives a posterior in the same

distributional family are named conjugate priors.

(Note that by a distributional family we mean

distributions that go under a common name:

Normal distribution, Binomial distribution, Poisson

distribution etc. )

A conjugate prior always go together with a particular

likelihood to produce the posterior.

We sometimes refer to a conjugate pair of

distributions meaning

(prior distribution, sample distribution = likelihood) 28

In particular, if the sample distribution, i.e. f (x| ) belongs to the k-

parameter exponential family of distributions:

we may put

where 1 , … , k + 1 are parameters of this prior distribution

and K( ) is a function of 1 , … , k + 1 only .

θθ

θDxCxBA

k

j

jj

exf

1|

θθ

θθ

θ

DA

KDA

k

k

j

jj

kkk

k

j

jj

e

ef

1

1

111

1

,,,

29

Then

i.e. the posterior distribution is of the same form as the prior distribution but

with parameters

instead of

θθ

θθ

θθθθ

θθθ

DnxBA

DnxBAK

xC

KDAnDxCxBA

k

k

j

j

n

i

ijj

k

k

j

j

n

i

ijj

kk

n

i

i

kkk

k

j

jj

n

i

i

k

j

n

i

ijj

e

eee

ee

fff

1

1 1

1

1 1111

111

111 1

,,,

,,,

|

|xx

nxBxB k

n

i

ikk

n

i

i

1

11

11 ,,,

11 ,,, kk 30

Some common cases:

Conjugate prior Sample distribution Posterior

Beta Binomial Beta

Normal Normal, known 2 Normal

Gamma Poisson Gamma

Pareto Uniform Pareto

,~ Beta xnxBetax ,~| ,~ nBinX

2,~ N 2,~ NX i

22

22

22

2

22

2

,~|

nx

n

n

nNx

,~ Gamma PoX i ~ nxGammax ii ,~|

;p ,0~ UX i n

n xq ,max;; x

31

Non-informative priors

• We often do not have any prior information, although

true Bayesian’s would argue we always have some prior

information!

• We would hope to have good agreement between the

Frequentist approach and the Bayesian approach with a

non-informative prior.

• Diffuse or flat priors are often better terms to use as no

prior is strictly non-informative!

• For our example of an unknown mean, candidate priors

are a Uniform distribution over a large range or a Normal

distribution with a huge variance.

32

Improper priors

• The limiting prior of both the Uniform and Normal is a Uniform prior on the whole real line.

• Such a prior is defined as improper as it is not strictly a probability distribution and doesn’t integrate to 1.

• Some care has to be taken with improper priors however in many cases they are acceptable provided they result in a proper posterior distribution.

• Uniform priors are often used as non-informative priors however it is worth noting that a uniform prior on one scale can be very informative on another.

• For example: If we have an unknown variance we may put a uniform prior on the variance, standard deviation or log(variance) which will all have different effects.

33

Point and Interval Estimation

• In Bayesian inference the outcome of interest for a

parameter is its full posterior distribution however we

may be interested in summaries of this distribution.

• A simple point estimate would be the mean of the

posterior. (although the median and mode are

alternatives.)

• Interval estimates are also easy to obtain from the

posterior distribution and are given several names, for

example credible intervals, Bayesian confidence

intervals and Highest density regions (HDR). All of these

refer to the same quantity.

34

Definition (Mean Square Error)

2))((... θxθx hEESM h

θxθθxθx ddffh )()|()))(( 2

35

Bayes Estimator

Theorem: The Bayes estimator that will

Minimize the mean square error is

xx |θθ Eh

• That is, the posterior mean.

^

36

Bayes Estimator

Lemma: Suppose Z and W are real random variables, then

That is, the posterior mean E(Z|W) will minimize the quadratic loss (mean square error).

22 ))|(())((min WZEZEWhZEh

37

Bayes Estimator

Proof of the Lemma:

Conditioning on W, the cross term is zero. Thus

2

2

22

))()|((

))()|())(|((2

))|((

))()|()|(())((

WhWZEE

WhWZEWZEZE

WZEZE

WhWZEWZEZEWhZE

222 ))()|(())|(())(( WhWZEEWZEZEWhZE

38

Bayes Estimator

39

Bayes Estimator

40

Bayes Estimator

Recall:

Theorem: The posterior distribution

depends only on sufficient

statistics.

Therefore, let T(X) be a sufficient

statistic, then

)xx (|| TEE θθθ ^

41

In all cases

• The posterior mean is a compromise between

the prior mean and the MLE

• The posterior s.d. is less than both the prior s.d.

and the s.e. (MLE)

‘A Bayesian is one who, vaguely expecting a

horse and catching a glimpse of a donkey,

strongly concludes he has seen a mule’

-- Senn, 1997--

As n

• The posterior mean the MLE

• The posterior s.d. the s.e. (MLE)

• The posterior does not depend on the prior. 42

Point Estimation via Decision Theory

The action is a particular point estimator

State of nature is the true value of

The loss function is a measure of how good (desirable)

the estimator is of :

Prior information is quantified by the prior distribution

(p.d.f.) f( )

Data is the random sample x from a distribution with

p.d.f. f (x | )

ˆ,SS LL

43

Three loss functions

Zero-one loss:

Absolute (error) loss:

Quadratic (error) loss:

ˆ1

ˆ0ˆ,SL

|ˆ|ˆ, SL

2ˆˆ, SL44

Bayes estimator with respect to a given loss function

We are looking for the estimator that minimizes

For any given value of x what has to be minimized is

the posterior expected loss:

xxxx

xxxx

x|xx

x|xx

dfdfL

ddffL

ddffL

dfdfLdfR

XS

XS

S

S

|ˆ,

|ˆ,

ˆ,

ˆ,ˆ,

dfLS xx |ˆ,

45

Now minimization with respect to different loss functions

will result in measures of location in the posterior

distribution of .

Zero-one loss: The Maximum A-Posteriori (MAP)

estimator (* it will maximize the posterior pdf):

Absolute error loss:

Quadratic loss: The Bayes estimator:

given for mode posterior theis ˆ xx

given for median posterior theis ˆ xx

given for mean posterior theis ˆ xx 46

Credible Intervals (Regions)

• Considering a conjugate normal example with posterior distribution:

μ|x ~ N(165.23,2.222)

• Then a 95% credible interval for μ is

165.23±1.96×sqrt(2.222) = (162.31,168.15).

• Note that the credible intervals can be interpreted in the more natural way that there is a probability of 0.95 that the interval contains μ rather than the Frequentist conclusion that 95% of such intervals would contain μ.

47

Credible regions

Let be the parameter space for and let Sx be a subset of .

If

then Sx is a 1 – credible region for .

For a scalar we refer to it as a credible interval

The important difference compared to confidence regions:

A credible region is a fixed region (conditional on the sample) to which

the random variable belongs with probability 1 – .

A confidence region is a random region that covers the fixed with

probability 1 –

x

xS

df 1| θθ

Highest posterior density (HPD) regions

Equal-tailed credible intervals

An equal-tailed 1 – credible interval is a 1 – credible interval (a , b )

such that

region credible (HPD)density posterior highest 1 a called is then

||

, allfor If

21

21

x

xx

xx

S

ff

SS

θθ

θθ

2|Pr|Pr xx ba

Example:

In a consignment of 5000 pills suspected to contain the drug Ecstasy a

sample of 10 pills are sampled for chemical analysis. Let be the

unknown proportion of Ecstasy pills in the consignment and assume we

have a high prior belief that this proportion is 100%.

Such a high prior belief can be modelled with a Beta density

where is set to 1 and is set to a fairly high value, say 20, i.e. Beta

(20,1)

,

111

Bf

Beta(20,1)

0 0.2 0.4 0.6 0.8 1

Now, suppose after chemical analysis of the 10 pills, all of them showed

to contain Ecstasy.

The posterior density for is Beta( + x, + n – x) (conjugate prior

with binomial sample distribution as the population is large)

Beta (20 + 10, 1 + 10 – 10) = Beta (30, 1)

Then a lower-tail 99% credible interval for satisfies

Thus with 99% certainty we can state that at least 85.8% of the pills in the

consignment consist of Ecstasy

99.0

1,30

1 29

dxB

x

a

858.01,303099.01

99.011,3030

1

301

30

Ba

aB

Comments:

We have used the binomial distribution for the sample. More correct

would be to use the hypergeometric distribution, but the binomial is a

good approximation.

For a smaller consignment (i.e. population) we can benefit on using the

result that the posterior for the number of pills containing Ecstasy in the

rest of the consignment after removing the sample is beta-binomial.

This would however give similar results

If a sample consists of 100% of one kind, how would a confidence

interval for be obtained?

Hypothesis Testing

Another big issue in statistical modelling is the ability to

test hypotheses and model comparisons in general.

The Bayesian approach is in some ways more

straightforward. For an unknown parameter θ

we simply calculate the posterior probabilities

and decide between H0 and H1 accordingly.

We also require the prior probabilities to achieve this

)|( ),|( 1100 xPpxPp

)( ),( 1100 PP

53

Decision theory applied on hypothesis testing

Test of H0: = 0 vs. H1: = 1

Decision procedure: C = Use a test with critical region C

Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”

Loss function:

H0 true H1 true

Accept H0 0 b

Reject H0 a 0

54

Risk function

Assume a prior setting p0 = Pr (H0 is true) = Pr ( = 0)

and p1 = Pr (H1 is true) = Pr ( = 1)

The Bayes risk (prior expected risk) becomes

bbR

aaR

CH

CH

LER

C

C

CSC

10;

10;

|Pr valuefor true accepting when Loss

|Pr valuefor true rejecting when Loss

,;

1

0

0

0

θ

θ

θθ

θθ

θθ

X

X

XX

10; pbpaRE C θθ55

Bayes test:

Minimax test:

Lemma: Bayes tests and most powerful tests (Neyman-Pearson

lemma) are equivalent in that every most powerful test is a Bayes

test for some values of p0 and p1 and every Bayes test is a most

powerful test with

10minarg;minarg pbpaRECC

CB

θθ

baR

CC

C ,maxminarg;maxminarg*

θθθ

bp

ap

L

L

1

0

0

1

;

;

x

x

θ

θ

56

ap

bp

L

L

0

1

1

0

;

;

x

x

θ

θor, equivalently,

Example:

Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.

We would like to test H0: = 1 vs. H0: = 2 with a Bayes test with

losses a = 2 and b = 1 and with prior probabilities p0 and p1

0;0,|11 xexf x

1

021

1

0

1

0

2

10

10

11

11

0

1

2ln

24

44

;

;

21

21

21

21

2

1

01

1

0

2

1

11

1

1

p

pxx

p

p

bp

ape

ee

e

ee

ee

L

L

xx

xx

xx

xx

xx

xx

x

x

57

Now,

1

0

0

12ln1

1

02ln1

1

21

1

0

1

1

0

1

11

0

1

1

0

10

1

0 0

12

11

21

2ln1

21

2ln

1

1

1

0

1

0

1111

1

11

1

1

11

1

1

11

11

1

1

2

21

11

1

1

2

21

11

p

p

p

pe

p

pe

etetXXPete

exedxee

dxeedxee

dxdxeetXXP

p

p

p

p

tttt

t

x

tx

t

x

tx

t

x

xtx

t

x

xt

x

xx

t

x

xt

x

xx

A fixed size gives conditions on p0 and p1, and a certain

choice will give a minimized 58

59

Bayesian analysis is increasingly popular thanks to

the advent of the modern algorithms (Markov Chain

Monte Carlo [MCMC]), computers and growing data

base, some notable procedures are:

• Bayesian Hierarchical Models

• Empirical Bayes

• Bayesian Networks

• Naïve Bayes

• etc …

The Modern Bayesians

Acknowledgement

Part of this presentation was based on publicly

available resources posted on-line.

60

Documents

Bayesian Inference - Home | Applied Mathematics & …zhu/ams571/Bayesian_Analysis.pdfBayesian Inference In Bayesian inference there is a fundamental distinction between • Observable