Chapter 1: Bayesian Basics [5mm] [width=1.5in]fortunetelling n… · Chapter 1: Bayesian Basics...

Preview:

Citation preview

Chapter 1: Bayesian Basics

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 1 / 21

Objective

In this chapter, we introduce the basic theory and properties of Bayesian statisticsand some problems and advantages in the big data setting.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 2 / 21

Probability Rules I: Partitions

Events B1, ...,Bk form a partition if Bi ∩ Bj = φ∀i 6= j and⋃k

i=1 Bi = Ω. Thenfor any event A,

P(A) = P(A ∩ B1) + P(A ∩ B2) + ...+ P(A ∩ Bk).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 3 / 21

Probability Rules II: Conditional Probability

For two events A and B, the conditional probability of A given B is

P(A|B) =P(A ∩ B)

P(B).

The multiplication law is P(A ∩ B) = P(A|B)P(B).

A and B are independent if P(A ∩ B) = P(A)P(B) or P(A|B) = P(A) orP(B|A) = P(B).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 4 / 21

Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 5 / 21

Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 5 / 21

Probability Rules IV: Bayes Theorem

More generally, if B1, ...,Bk form a partition, we can write Bayes theorem as:

P(Bi |A) =P(A|Bi )P(Bi )∑kj=1 P(A|Bj)P(Bj)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 6 / 21

Example: The Monty Hall problem

Should you change doors?

Implicit assumption:

the host always opens a different door from the door chosen by the player andalways reveals a goat by this action because he knows where the car is hidden.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 7 / 21

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 8 / 21

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 8 / 21

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 8 / 21

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 8 / 21

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 8 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 9 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 10 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ...

and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 11 / 21

Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 12 / 21

Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 12 / 21

The beta prior distribution

θ has a beta distribution with parameters a, b > 0 if

f (θ) =1

B(a, b)θa−1(1− θ)b−1

for 0 < θ < 1.

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

f

The mean is E [θ] = aa+b .

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 13 / 21

Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

f (x)

=l(θ|x)f (θ)

f (x)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 14 / 21

Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

f (x)

=l(θ|x)f (θ)

f (x)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 14 / 21

Example

Suppose we use a Beta(5,5) prior distribution for θ:

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

Then the posterior distribution is:

f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

What distribution is this?

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 15 / 21

Example

Suppose we use a Beta(5,5) prior distribution for θ:

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

Then the posterior distribution is:

f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

What distribution is this?

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).

P(θ > 0.5|x) ≈ 0.905.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 15 / 21

Example

Suppose we use a Beta(5,5) prior distribution for θ:

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

Then the posterior distribution is:

f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

What distribution is this?

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 15 / 21

Bayesian inference: the posterior as an average

The posterior density combines information from both prior and likelihood.

Example

The plot shows the prior density (dotted), scaled likelihood (dashed) andposterior density (solid).

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

theta

f

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 16 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).

How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 17 / 21

Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

∫ 1

0

P(Y = y |θ)f (θ|x) dθ

= ...

=

(10y

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 18 / 21

Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

∫ 1

0

P(Y = y |θ)f (θ|x) dθ

= ...

=

(10y

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 18 / 21

Predictive distributions

The plot shows the classical ”plug in” and Bayesian predictive distributions.

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

y

P

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 19 / 21

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 20 / 21

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 20 / 21

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 20 / 21

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 20 / 21

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 20 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 21 / 21

Recommended