Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta and Gamma Distributions
Prof. Nicholas Zabaras
School of Engineering
University of Warwick
Coventry CV4 7AL
United Kingdom
Email: [email protected]
URL: http://www.zabaras.com/
August 7, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta distribution,
Gamma Function, Normalization of the Beta Distribution, Beta as a Prior to
Bernoulli, Posterior and Predictive Distributions
A Frequentist View of Bayesian Learning, Variance Decomposition
Gamma Distribution
Exponential Distribution
Chi Squared Distribution
Inverse Gamma Distribution
The Pareto Distribution
2
Contents
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Beta(a,b) distribution with is defined as
follows:
The expected value, mode and variance of a Beta
random variable x with (hyper-)parameters α and β :
For more information visit this link.
[0,1], , 0x ba
1 11 1( ) (1 )
( ) (1 )( ) ( ) ( , )
x xx x x
beta
a ba ba b
a b a b
Beta
Normalizingfactor
xa
a b
2( 1)
var xab
a b a b
Beta Distribution
3
1
mod2
e xa
a b
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
If a=b=1, we obtain a uniform distribution.
If a and b are both less than 1, we get a bimodal
distribution with spikes at 0 and 1.
If a and b are both greater
than 1, the distribution
is unimodal.
Run betaPlotDemo
from PMTK
Beta Distribution
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
beta distributions
a=0.1, b=0.1
a=1.0, b=1.0
a=2.0, b=3.0
a=8.0, b=4.0
a=b=1
a,b<1
a,b>1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The gamma function extends the factorial to real numbers:
With integration by parts:
For integer n:
For more information visit this link.
1 1( )( ) (1 )
( ) ( )x x xa ba b
a b
Beta
Gamma Function
5
1
0
( ) x ux u e du
( ) ( 1)!n n
( 1) ( )x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Showing that the Beta(a,b) distribution is normalized
correctly is a bit tricky. We need to prove that:
Indeed we follow the steps: (a) change the variables y to
t=y+x; (b) change the order of integration in the shaded
triangular region; and (c) change x to m via x=tm:
1
1 1
0
( ) ( ) ( ) (1 ) dxa ba b a b m m
Beta Distribution: Normalization
6
11 1 1
0 0 0
11 11 1 1 1
0 0 0 0
111
0
( ) ( )
1
( ) 1
x y t
y t x x
t
t t
x t
e x dx e y dy x e t x dt dx
x e t x dx dt e t t tdt d
a d
ba b a
b ba a b a
m
ba
a b
m m m
b m m m
x
t t=x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta Distribution
7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(0.1,0.1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(1,1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(2,3)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Beta(8,4)
See Matlab implementation
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Assuming a Bernoulli likelihood and Beta prior we derive the
posterior as:
This is also a Beta distribution:
a and b are the effective number of observations of x=1 and
x=0, respectively, introduced by the prior (don’t have to be
integers).
Posterior Distribution
8
( | ) (1 )m N mp m m m D 1 1( ) (1 )x a bm m Beta
1 1( | , ) (1 )m N mp a bm a b m m D,
( | , ) ( | , )
,
N N
N N
p
m N m
m a b m a b
a a b b
D, Beta
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
From the properties of the Beta distribution, we compute:
The posterior mean always lies in between the prior mean
and the MLE estimate:
This can be shown easily by noticing that:
Posterior Mean and Variance
9
a
ma b
N
N N
2( 1)
a bm
a b a b
N N
N N N N
var
a
mb a b
a m m
a m N m N
0 1 1
1
a a b a a bm
a b a b a b a b
m m
N N N N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Distribution
10
For example, after observing heads, the posterior is computed as
follows:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Posterior: Beta(3,2)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Likelihood Function (N=m=1)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x
Prior Beta(2,2)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We can now compute the probability that the next coin flip is
heads:
Predictive Distribution
11
1 1( | , ) (1 )m N mp a bm a b m m D,
1
0
1
0 ( , )
( 1| , ) ( 1| ) ( | , )
( | , )
| ,
N N
N
N N
p x p x p d
p d
a b
a b m m a b m
m m a b m
am a b
a b
Beta
D, D,
D,
D,
Posterior mean
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Consider the case of infinite data (N→∞):
and the posterior mean and variance become:
For N→∞, the distribution as expected spikes around the
MLE estimate with zero variance (i.e. the uncertainty
decreases as N→∞). Is this a general property?
Properties of the Posterior Distribution
12
,N Nm m N m N ma a b b
a
ma b
N
N N
m m
m N m N
2 2
( )0
( 1) ( 1)
a bm
a b a b
N N
N N N N
m N mvar
m N m m N m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
13
Consider inference of parameter q using data D. We
expect that because the posterior p(q|D) incorporates the
information from the data D, it will imply less variability for q
than the prior p(q).
We have the following identities:
[ ] |q q D
[ ] | | |var var var varq q q q D D D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
A Frequentist View of Bayesian Learning
14
This means that on average over the realizations of the
data D, the conditional expectation E[q|D] is equal to E[q].
Also, the posterior variance on average is smaller than the
prior variance by an amount that depends on the variations
in posterior means over the distribution of
possible data.
[ ] |q q D
[ ] | | |var var var varq q q q D D D
|var q D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Mean
15
Note the not-surprising result regarding the posterior mean:
| ( | )
( | ) ( ) ( , ) ( )
p d
p p d d p d d p d
q q q q
q q q q q q q q q q
D D
D D D D D
|q q
Prior Posteriormean mean
Posterior meanaveraged over thedata
D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Variance Decomposition Identity
16
If (q,D) are two scalar random variables then we have:
Here is the proof:
[ ] | r |q q q var var vaD D
22
22
22
[ ]
| |
| |
var | var |
var q q q
q q
q q
q q
D D
D D
D D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Variability
17
We can derive a similar expression regarding the posterior
variance:
Thus on average (over the data), the variability in q
decreases. For a particular observed data set D, it is
however possible that
These results implicitly assume that the data follow the
distribution:
Pr
| | |
ior Posteriorvariance variance
averaged overall data
var var var varq q q q D D D
|var varq q D
( ) ( )m p p dq q q D D
|var varq qD
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Gamma distribution is a two-parameter family of
continuous distributions. It has a scale parameter θ>0 and
a shape parameter k>0. If k is an integer then the
distribution represents the sum of k independent
exponentially distributed random variables, each of which
has a mean of θ (which is equivalent to a rate parameter
of θ −1) .
More often, we also use the rate
parametrization
1 exp( / )~ ( , ) , 0,
( )
q
k
k
xX k x x
kGamma
Gamma Distribution
18
1( | , ) exp( ), 0,( )
aab
X a b x xb xa
Gamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
It is frequently a model for waiting times. For important
properties see here.
It is more often parameterized in terms of a shape
parameter a = k and an inverse scale parameter b = 1/θ,
called a rate parameter:
The mean, mode and variance with this parametrization are:
1 1
0
( | , ) , 0, , ( )( )
aa bx a ub
p x a b x e x a u e dua
Gamma Distribution- Rate Parametrization
19
xb
a
1, 1
mod
0
afor a
e x b
otherwise
2var x
b
a
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Plots of
As we decrease the rate b, the distribution squeezes
leftwards and upwards .
Gamma Distribution
20
1 2 3 4 5 6 7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gamma distributions
a=1.0,b=1.0
a=1.5,b=1.0
a=2.0,b=1.0
1( | , ) exp( ), 1( )
aab
X a b x xb ba
Gamma
Run gammaPlotDemo
from PMTK
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
An empirical PDF of rainfall data fitted with a Gamma
distribution.
Run MatLab function gammaRainfallDemo
from PMTK
Gamma Distribution
21
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
3
3.5
MoM
MLE
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Distribution
22
This is defined as
Here λ is the rate parameter.
This distribution describes the times between events in a
Poisson process, i.e. a process in which events occur
continuously and independently at a constant average rate
λ.
( | ) ( |1, ) exp( ), 0,X X x x Expon Gamma
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Chi-Squared Distribution
23
This is defined as
This is the distribution of the sum of squared Gaussian
random variables.
More precisely,
2
12 2
1
1 2( | ) ( | , ) exp( ), 0,
2 2 2
2
xX X x x
Gamma
2 2
1
~ (0,1) . ~i i
i
Let Z and S Z Then S
N
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Inverse Gamma Distribution
24
This is defined as follows:
where:
a is the shape and b the scale parameters.
It can be shown that:
1~ ( | , ) ~ ( | , )If X X a b X X a bGamma InvGamma
( 1)( | , ) exp( / ), 0,( )
aab
X a b x b x xa
InvGamma
2
2
( 1), ,1 1
var ( 2)( 1) ( 2)
b bMean exists for a Mode
a a
bexists for a
a a
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
25
Used to model the distribution of quantities that exhibit
long tails (heavy tails)
This density asserts that x must be greater than some
constant m, but not too much greater, k controls what is
“too much”.
As k → ∞, the distribution approaches δ(x − m).
On a log-log scale, the pdf forms a straight line, of the form
log p(x) = a log x + c for some constants a and c (power
law, Zipf’s law).
( 1)( | , ) ( )k kX k m km x x m Pareto
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Pareto Distribution
26
Applications: Modeling the frequency of words vs their
rank, distribution of wealth (k=Pareto Index), etc.
( 1)
2
2
( | , ) ( ),
( 1),1
,
var ( 2)( 1) ( 2)
k kX k m km x x m
kmMean if k
k
Mode m
m kif k
k k
Pareto
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Pareto distribution
m=0.01, k=0.10
m=0.00, k=0.50
m=1.00, k=1.00
ParetoPlot from PMTK