Chapter 1: Bayesian Basics [5mm] [width=1.5in]fortunetelling n… · Chapter 1: Bayesian Basics Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid

Chapter 1: Bayesian Basics

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 1 / 21

Objective

In this chapter, we introduce the basic theory and properties of Bayesian statisticsand some problems and advantages in the big data setting.


Probability Rules I: Partitions

Events B1, ...,Bk form a partition if Bi ∩ Bj = φ∀i 6= j and⋃k

i=1 Bi = Ω. Thenfor any event A,

P(A) = P(A ∩ B1) + P(A ∩ B2) + ...+ P(A ∩ Bk).


Probability Rules II: Conditional Probability

For two events A and B, the conditional probability of A given B is

P(A|B) =P(A ∩ B)

P(B).

The multiplication law is P(A ∩ B) = P(A|B)P(B).

A and B are independent if P(A ∩ B) = P(A)P(B) or P(A|B) = P(A) orP(B|A) = P(B).


Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .


Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .


Probability Rules IV: Bayes Theorem

More generally, if B1, ...,Bk form a partition, we can write Bayes theorem as:

P(Bi |A) =P(A|Bi )P(Bi )∑kj=1 P(A|Bj)P(Bj)


Example: The Monty Hall problem

Should you change doors?

Implicit assumption:

the host always opens a different door from the door chosen by the player andalways reveals a goat by this action because he knows where the car is hidden.


http://www.math.ucsd.edu/~crypto/Monty/monty.html

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

12 ×

13

12

=1

3

so P(C |opens 2) = 23 and it is better to switch.





P(A) = P(B) = P(C ) = 13 .



=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2


P(opens 2)=

12 ×

13

12

=1

3






P(A) = P(B) = P(C ) = 13 .



=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2


P(opens 2)=

12 ×

13

12

=1

3






P(A) = P(B) = P(C ) = 13 .



=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2


P(opens 2)=

12 ×

13

12

=1

3






P(A) = P(B) = P(C ) = 13 .



=1

2× 1

3+ 0× 1

3+ 1× 1

3=

1

2


P(opens 2)=

12 ×

13

12

=1

3



Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).




Classical Inference



l(θ|x) = f (x|θ).




P(Y < y |x) ≈ P(Y < y |x, θ).




Classical Inference



l(θ|x) = f (x|θ).




P(Y < y |x) ≈ P(Y < y |x, θ).




Classical Inference



l(θ|x) = f (x|θ).




P(Y < y |x) ≈ P(Y < y |x, θ).




Classical Inference



l(θ|x) = f (x|θ).




P(Y < y |x) ≈ P(Y < y |x, θ).




Classical Inference



l(θ|x) = f (x|θ).




P(Y < y |x) ≈ P(Y < y |x, θ).


Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).







p =12∑i=9











p =12∑i=9











p =12∑i=9











p =12∑i=9











p =12∑i=9






Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.


Bayesian Inference







Bayesian Inference


Inference based on the likelihood function ...

and





Bayesian Inference







Bayesian Inference







Bayesian Inference







Bayesian Inference







Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.


Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.


The beta prior distribution

θ has a beta distribution with parameters a, b > 0 if

f (θ) =1

B(a, b)θa−1(1− θ)b−1

for 0 < θ < 1.

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

theta

f

The mean is E [θ] = aa+b .


Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

f (x)

=l(θ|x)f (θ)

f (x)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.


Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

f (x)

=l(θ|x)f (θ)

f (x)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.


Example

Suppose we use a Beta(5,5) prior distribution for θ:

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

Then the posterior distribution is:

f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

What distribution is this?

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.


Example


f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.


f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1


f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).

P(θ > 0.5|x) ≈ 0.905.


Example


f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.


f (θ|x) ∝(

129

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1


f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

1

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.


Bayesian inference: the posterior as an average

The posterior density combines information from both prior and likelihood.

Example

The plot shows the prior density (dotted), scaled likelihood (dashed) andposterior density (solid).

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

theta

f


Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?





E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.








E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.








E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.








E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.








E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.








E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.



A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).

How do we interpret the two intervals?





E [θ|x] = 1422 ≈ 0.636.


14

22=

10

22× 1

2+

12

22× 9

12

E [θ|x] = wE [θ] + (1− w)θ where w =10

22.





Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

∫ 1

0

P(Y = y |θ)f (θ|x) dθ

= ...

=

(10y

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.


Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

∫ 1

0

P(Y = y |θ)f (θ|x) dθ

= ...

=

(10y

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.


Predictive distributions

The plot shows the classical ”plug in” and Bayesian predictive distributions.

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

y

P


Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)








Example


f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)








Example


f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)








Example


f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)








Example


f (θ|x, y) ∝(

106

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)


Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.










































Documents

Chapter 1: Bayesian Basics [5mm] [width=1.5in]fortunetelling n… · Chapter 1: Bayesian Basics Conchi Aus n and Mike Wiper Department of Statistics Universidad Carlos III de Madrid