Probabilities, Bayes Rule, Markov Chain Monte Carlo

High Performance Computing for Science and Engineering II Pantelis Vlachas

Computational Science and Engineering Lab ETH Zürich

Probabilities, Bayes Rule, Markov Chain Monte Carlo

Structure

Bayes Rule

Markov Chain Monte Carlo

Computing the posterior (coin toss)

Conjugate Priors

Structure

Bayes Rule



Conjugate Priors

Bayes Rule

Bayes Rule

P(A |B) =P(B |A)P(A)

P(B)

• We assume a model (usually omitted/self-explained and absorbed by )

• We look for a parametrisation of that “explains” the data

• is some observed data • is the likelihood of observing the data

given that we have a model of the reality • / is the prior • is the data evidence

M(θ)θ

θ M

DP(D |θ)

p(θ) p(θ |M)p(D)/p(D |M)

ORp(D |M) = ∫θ′

p(D |θ′ , M)p(θ′ |M)dθ′

p(D) = ∫θ′

p(D |θ′ )p(θ′ )dθ′

DATA EVIDENCE (does not depend on )θ

p(θ |D) =p(D |θ) p(θ)

p(D)

MODEL ABSORBED in θ

p(θ |D, M) =p(D |θ, M) p(θ |M)

p(D |M)

GENERAL FORM

Bayes Rule

=p(D |θ, M)

LIKELIHOOD ℒ

p(θ |M)

PRIOR π(θ)

p(θ |D, M)

POSTERIOR

p(D |M)DATA EVIDENCE

(does not depend on )θ

Bayes Rule

Speagle, J. S. (2021). “A conceptual introduction to markov chain monte carlo methods”, arXiv Preprint arXiv:1909.12313.

THE POSTERIOR IS A COMPORMISE OVER

PRIOR AND THE DATA (LIKELIHOOD)

p(θ) = π(θ) p(D |θ) = ℒ(D |θ)

p(θ |D)

Update Belief Based on Data

p(θ |x1) =p(x1 |θ) p0(θ)

p(x1)

Computation of the posteriorp0(θ)

Prior

x1

Data

Experiment 1

Computation of the posterior

p1(θ) =̂ p(θ |x1)Todays’ posterior

is the prior of tomorrow

p1(θ)

Prior

x2

Data

Experiment 2

p(θ |x2) =p(x2 |θ) p1(θ)

p(x2)

Accurate estimate of p(θ |x1, …, xN)

Structure

Bayes Rule



Conjugate Priors

Conjugate Priors

p(θ |D) =p(D |θ) p(θ)

p(D)

=p(D |θ) p(θ)

∫θ′

p(D |θ′ ) p(θ′ ) dθ′

=p(D |θ) p(θ)

∫θ′

p(D, θ′ ) dθ′

• Given some prior knowledge of the “data generating process” (model , etc.) the form of the likelihood is fixed and well-defined

• The choice (form) of the prior affects both the nominator and the denominator and determines the form of the posterior

• In applications, we need either to (1) have an analytic form of

the posterior (resolve ), or (2) be able to sample from it. • For certain choices of the prior , the posterior has the same

form (belongs to the same family) ! (i.e. different parameters) A. Then is conjugate to the likelihood B. The normal distribution is conjugate prior to a normal

likelihood ! C. Conjugate priors make bayesian update rule easy, else

numerical integration is needed

M p(D |θ)

p(θ′ ′ )

p(θ |x)

Zp(θ)

p(θ) p(D |θ)

=1Z

p(D |θ) p(θ)

Conjugate Priors

Structure

Bayes Rule



Conjugate Priors

Coin Toss Experiment• You are given a coin which probably is counterfeit and you perform

experiments by flipping it • Repeated runs (sampling) from a Bernoulli distribution • Suppose the probability of a head toss is (unknown) • If you knew , what is the probability of tosses in trials

(Likelihood) ? • head tosses of probability • tail tosses with probability

• number of permutations of the total tosses that have

head tosses

P(H) = θθ NH N

NH θN − NH (1 − θ)

( NNH) N NH

Bernulli Distribution

Bern(θ)

p(NH) = ( NNH) θNH (1 − θ)N−NH Binomial Distribution

H T

0.30.7

LIKELIHOOD

How to select a prior ?• In this case: Conjugate prior to the Binomial likelihood ? • Random variable on which the prior is defined: parametrization of model

• Support ? • Initially we might assume that we do not know anything about the coin

(Uninformative prior) • Uninformative prior - Uniform • Special case of the Beta distribution

P(H) = θθ ∈ [0,1]

U[0,1]

An Informative Prior

• Suppose that we do have information about the coin, we know that most probably it is a fair coin (why shouldn’t it ?)

• We want to incorporate this information into the prior belief

• Selection of a Prior belief of around

• The beta distribution is flexible enough to allow this !

• The magnitude of shape parameters controls our confidence

p(θ)θ = P(H) = 0.5

α = β

Conjugate to Binomial Likelihood

• Beta function is defined as

• How to choose prior ? Support: • Prior for selected as the Beta distribution

•

• Beta distribution : A distribution of a parametrisation of another distribution !How likely a random variable (probability) can take a value . Parametrized by (shape parameters).

• Assume that you conduct the experiment and get and

B(x, y) = ∫1

0tx−1 (1 − t)y−1 dt

P(H) = θ ∈ [0,1]q =̂ P(H)

p(θ) =̂ Beta(θ; α, β) =̂θα−1 (1 − θ)β−1

B(α, β)Beta(α, β)

[0,1] α, βNH NT = N − NH

p(N, NH⏟data

| θ = x⏟model

) = ( NNH) xNH (1 − x)N−NH

Likelihood

Binomial Distribution

BETAp(θ = x) =̂

xα−1 (1 − x)β−1

B(α, β)

prior

Conjugate to Binomial Likelihood

=p(N, NH |θ = x) p(θ = x)

∫ 1y=0

p(N, NH |θ = y) p(θ = y)dy

=( N

NH) xNH (1 − x)N−NHxα−1 (1 − x)β−1

B(α, β)

∫ 1y=0 ( N

NH) yNH (1 − y)N−NHyα−1 (1 − y)β−1

B(α, β) dy

=xNH+α−1 (1 − x)N−NH+β−1

∫ 1y=0

yNH+α−1 (1 − y)N−NH+β−1dy

=xNH+α−1 (1 − x)N−NH+β−1

B(α + NH, β + N − NH)

= Beta(α + NH, β + N − NH)

p(N, NH⏟data

| θ = x⏟model

) = ( NNH) xNH (1 − x)N−NH

LIKELIHOODp(θ = x) =̂

xα−1 (1 − x)β−1

B(α, β)

prior

PRIOR

B(x, y) = ∫1

0tx−1 (1 − t)y−1 dt

Beta function

p( θ = x⏟model

| N, NH⏟data

)

posterior

=p(N, NH |θ = x) p(θ = x)

p(N, NH)POSTERIOR:

Conjugate to Binomial Likelihood - BETA

Posterior Prior x Likelihood∝

x p(θ |N, NH) ∝ p(θ) p(NH, N |θ)

x p(θ |N, NH) ∝ Beta(α, β) Binomial(N, NH)

p(θ |N, NH) = Beta(α + NH, β + N − NH)

The BETA distribution is a conjugate distribution

to the Binomial Likelihood !

Coding …

Structure

Bayes Rule



Conjugate Priors


=p(D |θ, M)

LIKELIHOOD ℒ

p(θ |M)

PRIOR π(θ)

p(θ |D, M)

POSTERIOR

p(D |M)DATA EVIDENCE

(does not depend on )θIn practice: • Conjugate priors only for simple/academic examples • In MCMC we estimate/sample from the posterior without the normalization factor • Very important factor: SELECTION OF PRIOR (prior knowledge, selection of the distr., range, many issues, “informative

priors”) • Numerical estimation of model parameters and their uncertainty • Calculate high dimensional integrals in complex surfaces • e.g. particle moving on a potential , probability of location is , normalisation constant difficult to

evaluate, goal is to calculate by integration physical quantities, mean position, etc. How ? Use the

simulated values (markov chain) for posterior analysis.

Z = p(D |M)

V(x) p(x) ∝ exp(−V(x))

∫xf(x) p(x) dx

Markov Change Monte Carlo

θ1

θ2

GIVEN • MODEL • LIKELIHOOD • PRIOR • We can evaluate

(Bayes rule)

M(θ)L(D |θ)

p(θ) = π(θ)∝ p(θ |D)

∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)

High prop. region

True answer somewhere here • high likelihood • low posterior p(θ |D)θ⋆+1, p(θ⋆+1 |D) = P⋆

METROPOLIS SAMPLING • Sample from a proposal

distribution • If accept the jump,

• If accept with

probability

θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk

θk+1 = θ⋆

P⋆ ≤ Pk

P⋆/Pk

Evidence does not matter !

Initial guess from • • low likelihood • low posterior

πθ0 = (θ0

1 , θ02)

P0 = p(θ |D)

θ0, P0


θ1

θ2


(Bayes rule)

M(θ)L(D |θ)

p(θ) = π(θ)∝ p(θ |D)

High prop. region

True answer somewhere here • high likelihood • low posterior p(θ |D)



• If accept with

probability

θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk

θk+1 = θ⋆

P⋆ ≤ Pk

P⋆/Pk



πθ0 = (θ0

1 , θ02)

P0 = p(θ |D)

θ0, P0

1. The endless jumps form a chain

∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)


θ1

θ2


(Bayes rule)

M(θ)L(D |θ)

p(θ) = π(θ)∝ p(θ |D)

High prop. region




• If accept with

probability

θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk

θk+1 = θ⋆

P⋆ ≤ Pk

P⋆/Pk



πθ0 = (θ0

1 , θ02)

P0 = p(θ |D)

θ0, P0


2. Initial burn-in steps should be

removed

∝ ℒ(D |θ, M) π(θ |M)p(θ |D, M)


θ1

θ2


(Bayes rule)

M(θ)L(D |θ)

p(θ) = π(θ)∝ p(θ |D)

High prop. region




• If accept with

probability

θ⋆+1 ∼ p(θk+1 |θk)P⋆ > Pk

θk+1 = θ⋆

P⋆ ≤ Pk

P⋆/Pk


θ0, P0


2. Initial burn-in steps should be

removed

• Elaborate explanation • Form of proposal distribution ? • More sophisticated algorithms ? • Convergence ?

LECTURE


πθ0 = (θ0

1 , θ02)

P0 = p(θ |D)

Documents

Probabilities, Bayes Rule, Markov Chain Monte Carlo