Chapter 1: Bayesian Basics [5mm] [width=1.5in]fortunetelling n… · Chapter 1: Bayesian Basics...

Chapter 1: Bayesian Basics

Conchi Ausın and Mike Wiper

Department of Statistics

Universidad Carlos III de Madrid

Conchi Ausın and Mike Wiper Bayesian Inference ASDM 2018 1 / 21

Objective

In this chapter, we introduce the basic theory and properties of Bayesian statisticsand some problems and advantages in the big data setting.

Probability Rules I: Partitions

Events B1, ...,Bk form a partition if Bi ∩ Bj = φ∀i 6= j and⋃k

i=1 Bi = Ω. Thenfor any event A,

P(A) = P(A ∩ B1) + P(A ∩ B2) + ...+ P(A ∩ Bk).

Probability Rules II: Conditional Probability

For two events A and B, the conditional probability of A given B is

P(A|B) =P(A ∩ B)

The multiplication law is P(A ∩ B) = P(A|B)P(B).

A and B are independent if P(A ∩ B) = P(A)P(B) or P(A|B) = P(A) orP(B|A) = P(B).

Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .

Probability Rules III: Total Probability

Given a partition, B1, ...,Bk then for any event A,

P(A) = P(A|B1)P(B1) + P(A|B2)P(B2) + ...+ P(A|Bk)P(Bk).

For continuous variables, f (x) =∫f (x |y)f (y) dy .

Probability Rules IV: Bayes Theorem

More generally, if B1, ...,Bk form a partition, we can write Bayes theorem as:

P(Bi |A) =P(A|Bi )P(Bi )∑kj=1 P(A|Bj)P(Bj)

Example: The Monty Hall problem

Should you change doors?

Implicit assumption:

the host always opens a different door from the door chosen by the player andalways reveals a goat by this action because he knows where the car is hidden.

Solution using Bayes Theorem

Suppose without loss of generality that the player chooses door 1 and that thehost opens door 2 to reveal a goat.

Let A (B, C) be the event that the prize is behind door 1, (2, 3).

P(A) = P(B) = P(C ) = 13 .

P(opens 2|A) = 12 . P(opens 2|B) = 0, P(opens 2|C ) = 1.

P(opens 2) = P(opens 2|A)P(A) + P(opens 2|B)P(B) + P(opens 2|C )P(C )

3+ 0× 1

3+ 1× 1

P(A|opens 2) =P(opens 2|A)P(A)

P(opens 2)=

so P(C |opens 2) = 23 and it is better to switch.

P(A) = P(B) = P(C ) = 13 .

3+ 0× 1

3+ 1× 1

P(opens 2)=

P(A) = P(B) = P(C ) = 13 .

3+ 0× 1

3+ 1× 1

P(opens 2)=

P(A) = P(B) = P(C ) = 13 .

3+ 0× 1

3+ 1× 1

P(opens 2)=

P(A) = P(B) = P(C ) = 13 .

3+ 0× 1

3+ 1× 1

P(opens 2)=

Statistical Inference

Given data, x = (x1, ..., xn), we typically wish to make inference about somemodel parameter, θ, or predictions of future observations. There are two commonstatistical approaches: classical and Bayesian inference.

Classical Inference

Frequentist interpretation of probability.

Inference is based on the likelihood function:

l(θ|x) = f (x|θ).

θ is fixed. All uncertainty about X is quantified a priori.

Inferential procedures based on asymptotic performance.

Prediction often carried out by substituting an estimator for θ.

P(Y < y |x) ≈ P(Y < y |x, θ).

Classical Inference

l(θ|x) = f (x|θ).

P(Y < y |x) ≈ P(Y < y |x, θ).

Classical Inference

l(θ|x) = f (x|θ).

P(Y < y |x) ≈ P(Y < y |x, θ).

Classical Inference

l(θ|x) = f (x|θ).

P(Y < y |x) ≈ P(Y < y |x, θ).

Classical Inference

l(θ|x) = f (x|θ).

P(Y < y |x) ≈ P(Y < y |x, θ).

Classical Inference

l(θ|x) = f (x|θ).

P(Y < y |x) ≈ P(Y < y |x, θ).

Example: a coin tossing experiment

You have a coin with P(head) = θ. Suppose you decide to toss the coin 12 timesand observe 9 heads and 3 tails.

The maximum likelihood estimate for θ is θ = 912 .

An (approximate) 95% confidence interval for θ is (0.505, 0.995).

The p−value for the test H0 : θ = 0.5 vs H1 : θ > 0.5 is:

p =12∑i=9

P(i heads in 12 tosses|θ = 0.5) = 0.073

and the null hypothesis is not rejected at a 5% significance level.

With the alternative experiment of tossing the coin until the third tail isobserved, if this occurs on the 12th toss, then p = 0.0325 and H0 is rejected!

The plug-in predictive distribution for the number of heads in 10 more cointosses is Binomial(10,0.75).

p =12∑i=9

Bayesian Inference

Subjective interpretation of probability.

Inference based on the likelihood function ... and

θ is treated as a variable, with a prior distribution.

Inference is carried out via Bayes theorem and probability formulae.

Prediction is inherent to the procedure.

Bayesian Inference

Inference based on the likelihood function ...

Bayesian Inference

Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.

Bayesian inference: probability and the priordistribution

We have a prior distribution for θ which reflects personal subjectiveknowledge, previous data, ...

Different people can have different priors.

The only restriction is coherence.

Example

What do we know about the coin?

Coins typically have two faces with θ = P(head) ≈ 0.5.

0 ≤ θ ≤ 1.

Consider a prior distribution for θ centred in 0.5 but allowing for coin bias.

The beta prior distribution

θ has a beta distribution with parameters a, b > 0 if

f (θ) =1

B(a, b)θa−1(1− θ)b−1

for 0 < θ < 1.

0.0 0.2 0.4 0.6 0.8 1.0

The mean is E [θ] = aa+b .

Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

=l(θ|x)f (θ)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.

Bayesian inference: updating

When data are observed, beliefs are updated via Bayes theorem:

f (θ|x) =f (x|θ)f (θ)

=l(θ|x)f (θ)

∝ l(θ|x)f (θ)

because the denominator is independent of θ.

We can remember this as:

posterior ∝ likelihood× prior.

Example

Suppose we use a Beta(5,5) prior distribution for θ:

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

Then the posterior distribution is:

f (θ|x) ∝(

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

What distribution is this?

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.

Example

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

f (θ|x) ∝(

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).

P(θ > 0.5|x) ≈ 0.905.

Example

f (θ) =1

B(5, 5)θ5−1(1− θ)5−1.

f (θ|x) ∝(

)θ9(1− θ)3 × 1

B(5, 5)θ5−1(1− θ)5−1

∝ θ14−1(1− θ)8−1

f (θ|x) ∝ B(14, 8)

B(14, 8)θ14−1(1− θ)8−1 =

B(14, 8)θ14−1(1− θ)8−1.

Another beta distribution: θ|x ∼ Be(14, 8).P(θ > 0.5|x) ≈ 0.905.

Bayesian inference: the posterior as an average

The posterior density combines information from both prior and likelihood.

Example

The plot shows the prior density (dotted), scaled likelihood (dashed) andposterior density (solid).

0.0 0.2 0.4 0.6 0.8 1.0

Bayesian inference: point and interval estimation

For point estimates we could use the prior mean, median or mode forexample.

The classical MLE is θ = 912 = 0.75 and the Bayesian posterior mean is

E [θ|x] = 1422 ≈ 0.636.

We have a weighted average:

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

For interval estimates we can use a credible interval, i.e. an interval [θ, θ]such that P(θ < θ < θ|x) = 0.95.

The shortest such interval is a highest posterior density (hpd) interval.

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).How do we interpret the two intervals?

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

A classical 95% confidence interval is (0.505, 0.995) and the posteriorcredible interval is (0.430, 0.819).

How do we interpret the two intervals?

E [θ|x] = 1422 ≈ 0.636.

22× 1

22× 9

E [θ|x] = wE [θ] + (1− w)θ where w =10

Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

P(Y = y |θ)f (θ|x) dθ

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.

Bayesian inference: predictionSuppose that we wish to predict future observations, say Y. Then

f (y|x) =

∫f (y|x, θ)f (θ|x) dθ =

∫f (y|θ)f (θ|x) dθ

in cases of conditionally i.i.d. exchangeable variables.

Example

Let’s try to predict the number of heads, Y , in 10 further throws of the coin.

We know that Y |θ ∼ Binomial(10, θ), independent of the previous tosses.

P(Y = y |x) =

P(Y = y |θ)f (θ|x) dθ

)B(14 + y , 18− y)

B(14, 8)

for y = 0, 1, ..., 10.

Predictive distributions

The plot shows the classical ”plug in” and Bayesian predictive distributions.

0 2 4 6 8 10

Bayesian inference is sequential!

We start with a prior, f (θ).

Given data x we update by Bayes theorem to get the posterior f (θ|x).

Now this is our new prior and ...

Given more data, y, we update again to get f (θ|x, y).

In principle this is a big advantage in big data settings, allowing parallelization, etc.

Example

If we observe Y = 6, then

f (θ|x, y) ∝(

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Example

f (θ|x, y) ∝(

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Example

f (θ|x, y) ∝(

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Example

f (θ|x, y) ∝(

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Example

f (θ|x, y) ∝(

)θ6(1− θ)4 × 1

B(14, 8)θ14−1(1− θ)8−1

∝ θ20−1(1− θ)12−1

θ|x, y ∼ Beta(20, 12)

Summary and next chapter

We have seen an outline of the basic ideas behind Bayesian statistics andillustrated some of the ideas with a coin tossing example.

We have seen that a beta(5,5) prior implied a beta posterior.

Would this be the case with another beta prior?

Are there other situations when we can use “nice” priors.

What if we used a different type of prior? Would this be a problem in a bigdata setting? If so, what can we do?

We’ll see the solutions to these questions in the following classes.

Chapter 1: Bayesian Basics [5mm] [width=1.5in]fortunetelling n… · Chapter 1: Bayesian Basics...

Documents

Fortunetelling Cards

Full Bayesian inference (Learning)...Learning paradigms Learning as inference Bayesian learning, full Bayesian inference, Bayesian model averaging Model identification, maximum likelihood

CanIt-PRO User’s Guide...Chapter9, “Bayesian Filtering”, explains CanIt-PRO’s Bayesian filtering module. Bayesian filtering Bayesian filtering uses statistical analysis

Bayesian scoring functions for Bayesian Belief Networks

Fortunetelling With Playingcards

Bayesian Optimization with Robust Bayesian Neural Networks

Bayesian Learning and Learning Bayesian Networks · Bayesian Learning and Learning Bayesian Networks Chapter 20 some slides by Cristina Conati . Overview ! Full Bayesian Learning

Bayesian AI - Bayesian Artificial Intelligence …abnms.org/resources/Bayesian AI Introduction - Kevin Korb.pdf · Bayesian Artiﬁcial Intelligence 3/75 Abstract Reichenbach’s

Bayesian Decision and Bayesian Learning

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

BAYESIAN DECISION THEORY - eecs.yorku.ca Bayesian... · Bayesian Decision Theory: Topics 1. Probability 2. The Univariate Normal Distribution 3. Bayesian Classifiers 4. Minimizing

Bayesian Logistic Regression, Bayesian Generative ... · Figure courtesy: MLAPP (Murphy) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Logistic Regression, Bayesian

Fortunetelling and Archetyping: A Genre Study in the One ...facultyweb.cortland.edu/kennedym/courses/504/504...Fortunetelling and Archetyping: A Genre Study in the One-Act Play I

Bayesian Programming formalism and ProBT API€¦ · - Bayesian Robot Programming - The CyCab: Bayesian Navigation on Sensory-Motor Trajectories - The Bayesian Occupancy Filter -

Male Reproductive System. Path and Production of Sperm Testes- olive size (1-1.5in) Divided into lobules Lobules contain 1-4 seminiferous

Bayesian Interpretations of Regularization - mit.edu9.520/spring09/Classes/class15-bayes.pdf · The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation

Statistical Physics of Inference and Bayesian Estimation · Statistical Physics of Inference and Bayesian Estimation Florent ... Bayesian versus frequentistThe Bayesian approach assumes

bayesian bayesian network

Bayesian Belief Networks Compound Bayesian Decision Theory

Pacesetter BES Lettering 2 – Designs...2 Pacesetter BES Lettering 2 ‐ Designs Accent007 Accent008 Digitized Dimensions: Digitized Dimensions: H:1.4in X W:1.5in H:1.5in X W:1.5in