Beta and Gamma Distributions - Purdue UniversityBayesian Scientific Computing, Spring 2013 (N....

Preview:

Citation preview

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Beta and Gamma Distributions

Prof. Nicholas Zabaras

School of Engineering

University of Warwick

Coventry CV4 7AL

United Kingdom

Email: nzabaras@gmail.com

URL: http://www.zabaras.com/

August 7, 2014

1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Beta distribution,

Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to

Bernoulli, Posterior and Predictive Distributions

A Frequentist View of Bayesian Learning, Variance Decomposition

Gamma Distribution

Exponential Distribution

Chi Squared Distribution

Inverse Gamma Distribution

The Pareto Distribution

2

Contents

• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Beta(a,b) distribution with is defined as

follows:

The expected value, mode and variance of a Beta

random variable x with (hyper-)parameters α and β :

For more information visit this link.

[0,1], , 0x ba

1 11 1( ) (1 )

( ) (1 )( ) ( ) ( , )

x xx x x

beta

a ba ba b

a b a b

Beta

Normalizingfactor

xa

a b

2( 1)

var xab

a b a b

Beta Distribution

3

1

mod2

e xa

a b

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

If a=b=1, we obtain a uniform distribution.

If a and b are both less than 1, we get a bimodal

distribution with spikes at 0 and 1.

If a and b are both greater

than 1, the distribution

is unimodal.

Run betaPlotDemo

from PMTK

Beta Distribution

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.5

1

1.5

2

2.5

3

beta distributions

a=0.1, b=0.1

a=1.0, b=1.0

a=2.0, b=3.0

a=8.0, b=4.0

a=b=1

a,b<1

a,b>1

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The gamma function extends the factorial to real numbers:

With integration by parts:

For integer n:

For more information visit this link.

1 1( )( ) (1 )

( ) ( )x x xa ba b

a b

Beta

Gamma Function

5

1

0

( ) x ux u e du

( ) ( 1)!n n

( 1) ( )x x x

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Showing that the Beta(a,b) distribution is normalized

correctly is a bit tricky. We need to prove that:

Indeed we follow the steps: (a) change the variables y to

t=y+x; (b) change the order of integration in the shaded

triangular region; and (c) change x to m via x=tm:

1

1 1

0

( ) ( ) ( ) (1 ) dxa ba b a b m m

Beta Distribution: Normalization

6

11 1 1

0 0 0

11 11 1 1 1

0 0 0 0

111

0

( ) ( )

1

( ) 1

x y t

y t x x

t

t t

x t

e x dx e y dy x e t x dt dx

x e t x dx dt e t t tdt d

a d

ba b a

b ba a b a

m

ba

a b

m m m

b m m m

x

t t=x

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Beta Distribution

7

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(0.1,0.1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(1,1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(2,3)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

pdf

Beta(8,4)

See Matlab implementation

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Assuming a Bernoulli likelihood and Beta prior we derive the

posterior as:

This is also a Beta distribution:

a and b are the effective number of observations of x=1 and

x=0, respectively, introduced by the prior (don’t have to be

integers).

Posterior Distribution

8

( | ) (1 )m N mp m m m D 1 1( ) (1 )x a bm m Beta

1 1( | , ) (1 )m N mp a bm a b m m D,

( | , ) ( | , )

,

N N

N N

p

m N m

m a b m a b

a a b b

D, Beta

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

From the properties of the Beta distribution, we compute:

The posterior mean always lies in between the prior mean

and the MLE estimate:

This can be shown easily by noticing that:

Posterior Mean and Variance

9

a

ma b

N

N N

2( 1)

a bm

a b a b

N N

N N N N

var

a

mb a b

a m m

a m N m N

0 1 1

1

a a b a a bm

a b a b a b a b

m m

N N N N

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Distribution

10

For example, after observing heads, the posterior is computed as

follows:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Posterior: Beta(3,2)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Likelihood Function (N=m=1)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x

pdf

Prior Beta(2,2)

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

We can now compute the probability that the next coin flip is

heads:

Predictive Distribution

11

1 1( | , ) (1 )m N mp a bm a b m m D,

1

0

1

0 ( , )

( 1| , ) ( 1| ) ( | , )

( | , )

| ,

N N

N

N N

p x p x p d

p d

a b

a b m m a b m

m m a b m

am a b

a b

Beta

D, D,

D,

D,

Posterior mean

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Consider the case of infinite data (N→∞):

and the posterior mean and variance become:

For N→∞, the distribution as expected spikes around the

MLE estimate with zero variance (i.e. the uncertainty

decreases as N→∞). Is this a general property?

Properties of the Posterior Distribution

12

,N Nm m N m N ma a b b

a

ma b

N

N N

m m

m N m N

2 2

( )0

( 1) ( 1)

a bm

a b a b

N N

N N N N

m N mvar

m N m m N m

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Frequentist View of Bayesian Learning

13

Consider inference of parameter q using data D. We

expect that because the posterior p(q|D) incorporates the

information from the data D, it will imply less variability for q

than the prior p(q).

We have the following identities:

[ ] |q q D

[ ] | | |var var var varq q q q D D D

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

A Frequentist View of Bayesian Learning

14

This means that on average over the realizations of the

data D, the conditional expectation E[q|D] is equal to E[q].

Also, the posterior variance on average is smaller than the

prior variance by an amount that depends on the variations

in posterior means over the distribution of

possible data.

[ ] |q q D

[ ] | | |var var var varq q q q D D D

|var q D

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Mean

15

Note the not-surprising result regarding the posterior mean:

| ( | )

( | ) ( ) ( , ) ( )

p d

p p d d p d d p d

q q q q

q q q q q q q q q q

D D

D D D D D

|q q

Prior Posteriormean mean

Posterior meanaveraged over thedata

D

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Variance Decomposition Identity

16

If (q,D) are two scalar random variables then we have:

Here is the proof:

[ ] | r |q q q var var vaD D

22

22

22

[ ]

| |

| |

var | var |

var q q q

q q

q q

q q

D D

D D

D D

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Posterior Variability

17

We can derive a similar expression regarding the posterior

variance:

Thus on average (over the data), the variability in q

decreases. For a particular observed data set D, it is

however possible that

These results implicitly assume that the data follow the

distribution:

Pr

| | |

ior Posteriorvariance variance

averaged overall data

var var var varq q q q D D D

|var varq q D

( ) ( )m p p dq q q D D

|var varq qD

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Gamma distribution is a two-parameter family of

continuous distributions. It has a scale parameter θ>0 and

a shape parameter k>0. If k is an integer then the

distribution represents the sum of k independent

exponentially distributed random variables, each of which

has a mean of θ (which is equivalent to a rate parameter

of θ −1) .

More often, we also use the rate

parametrization

1 exp( / )~ ( , ) , 0,

( )

qq

q

k

k

xX k x x

kGamma

Gamma Distribution

18

1( | , ) exp( ), 0,( )

aab

X a b x xb xa

Gamma

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

It is frequently a model for waiting times. For important

properties see here.

It is more often parameterized in terms of a shape

parameter a = k and an inverse scale parameter b = 1/θ,

called a rate parameter:

The mean, mode and variance with this parametrization are:

1 1

0

( | , ) , 0, , ( )( )

aa bx a ub

p x a b x e x a u e dua

Gamma Distribution- Rate Parametrization

19

xb

a

1, 1

mod

0

afor a

e x b

otherwise

2var x

b

a

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Plots of

As we decrease the rate b, the distribution squeezes

leftwards and upwards .

Gamma Distribution

20

1 2 3 4 5 6 7

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gamma distributions

a=1.0,b=1.0

a=1.5,b=1.0

a=2.0,b=1.0

1( | , ) exp( ), 1( )

aab

X a b x xb ba

Gamma

Run gammaPlotDemo

from PMTK

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

An empirical PDF of rainfall data fitted with a Gamma

distribution.

Run MatLab function gammaRainfallDemo

from PMTK

Gamma Distribution

21

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

3

3.5

MoM

MLE

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Exponential Distribution

22

This is defined as

Here λ is the rate parameter.

This distribution describes the times between events in a

Poisson process, i.e. a process in which events occur

continuously and independently at a constant average rate

λ.

( | ) ( |1, ) exp( ), 0,X X x x Expon Gamma

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Chi-Squared Distribution

23

This is defined as

This is the distribution of the sum of squared Gaussian

random variables.

More precisely,

2

12 2

1

1 2( | ) ( | , ) exp( ), 0,

2 2 2

2

xX X x x

Gamma

2 2

1

~ (0,1) . ~i i

i

Let Z and S Z Then S

N

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

Inverse Gamma Distribution

24

This is defined as follows:

where:

a is the shape and b the scale parameters.

It can be shown that:

1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma

( 1)( | , ) exp( / ), 0,( )

aab

X a b x b x xa

InvGamma

2

2

( 1), ,1 1

var ( 2)( 1) ( 2)

b bMean exists for a Mode

a a

bexists for a

a a

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Pareto Distribution

25

Used to model the distribution of quantities that exhibit

long tails (heavy tails)

This density asserts that x must be greater than some

constant m, but not too much greater, k controls what is

“too much”.

As k → ∞, the distribution approaches δ(x − m).

On a log-log scale, the pdf forms a straight line, of the form

log p(x) = a log x + c for some constants a and c (power

law, Zipf’s law).

( 1)( | , ) ( )k kX k m km x x m Pareto

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Pareto Distribution

26

Applications: Modeling the frequency of words vs their

rank, distribution of wealth (k=Pareto Index), etc.

( 1)

2

2

( | , ) ( ),

( 1),1

,

var ( 2)( 1) ( 2)

k kX k m km x x m

kmMean if k

k

Mode m

m kif k

k k

Pareto

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Pareto distribution

m=0.01, k=0.10

m=0.00, k=0.50

m=1.00, k=1.00

ParetoPlot from PMTK

Recommended