Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Machine Learning Srihari
1
Discrete Probability Distributions Sargur N. Srihari
Machine Learning Srihari
2
Binary Variables
Bernoulli, Binomial and Beta
Machine Learning Srihari
3
Bernoulli Distribution • Expresses distribution of Single binary-valued random variable x ε {0,1} • Probability of x=1 is denoted by parameter µ, i.e.,
p(x=1|µ)=µ – Therefore
p(x=0|µ)=1-µ
• Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x
• Mean is shown to be E[x]=µ• Variance is var[x]=µ (1-µ) • Likelihood of n observations independently drawn from p(x|µ) is
– Log-likelihood is
• Maximum likelihood estimator – obtained by setting derivative of ln p(D|µ) wrt m equal to zero is
• If no of observations of x=1 is m then µML=m/N
p(D | µ) = p(x
n| µ) =
n=1
N
∏ µxn
n=1
N
∏ (1−µ)1−xn
ln p(D | µ) = ln p(x
n| µ) =
n=1
N
∑ {xnln
n=1
N
∑ µ + (1−xn)ln(1−µ)}
µ
ML=
1N
xn
n=1
N
∑
Jacob Bernoulli 1654-1705
Machine Learning Srihari
4
Binomial Distribution
• Related to Bernoulli distribution • Expresses Distribution of m
– No of observations for which x=1 • It is proportional to Bern(x|µ) • Add up all ways of obtaining heads
• Mean and Variance are Bin(m |N ,µ) = N
m
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟µm(1−µ)N−m
E[m] = mBin(m |N ,µ) = Nµm=0
N
∑Var[m] = Nµ(1−µ)
Histogram of Binomial for N=10 and m=0.25
Nm
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
=N !
m ! N −m( )!
Binomial Coefficients:
Machine Learning Srihari
• Beta distribution
• Where the Gamma function is defined as
• a and b are hyperparameters that control distribution of parameter µ
• Mean and Variance
5
Beta Distribution
Beta(µ |a,b) = Γ(a + b)
Γ(a)Γ(b)µa−1(1 − µ)b−1
∫∞
−−=Γ0
1)( dueux ux
a=0.1, b=0.1 a=1, b=1
a=2, b=3 a=8, b=4
Beta distribution as function of µFor values of hyperparameters a and b
baaE+
=][µ)1()(
]var[ 2 +++=
babaabµ
Machine Learning Srihari
6
Bayesian Inference with Beta • MLE of µ in Bernoulli is fraction of observations with x=1
– Severely over-fitted for small data sets
• Likelihood function takes products of factors of the form µx(1-µ)(1-x)
• If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior – Called conjugacy
• Beta has form suitable for a prior distribution of p(µ)
Machine Learning Srihari
7
Bayesian Inference with Beta • Posterior obtained by multiplying beta
prior with binomial likelihood yields
– where l=N-m, which is no of tails – m is no of heads
• It is another beta distribution
– Effectively increase value of a by m and b by l – As number of observations increases
distribution becomes more peaked
11 )1( ),,,|( −+−+ − blambalmp µµαµ
11 )1( )()()( ),,,|( −+−+ −
+Γ+Γ+++Γ= blam
blamblambalmp µµµ
a=2, b=2
N=m=1, with x=1
a=3, b=2
Illustration of one step in process
µ1(1-µ)0
p(µ)
p(x=1/µ)=
p(µ/x=1)
Machine Learning Srihari
8
Predicting next trial outcome • Need predictive distribution of x given observed D
– From sum and products rule
• Expected value of the posterior distribution can be shown to be
– Which is fraction of observations (both fictitious and
real) that correspond to x=1 • Maximum likelihood and Bayesian results agree in
the limit of infinite observations – On average uncertainty (variance) decreases with
observed data
p(x = 1 |D) = p(x = 1,µ |D)dµ0
1
∫ = p(x = 1 | µ)p(µ |D)dµ0
1
∫ =
= µp(µ |D)dµ0
1
∫ = E[µ |D]
p(x = 1 |D) = m + a
m + a + l + b
Machine Learning Srihari
9
Summary of Binary Distributions
• Single Binary variable distribution is represented by Bernoulli
• Binomial is related to Bernoulli – Expresses distribution of number of
occurrences of either 1 or 0 in N trials • Beta distribution is a conjugate prior for
Bernoulli – Both have the same functional form
Machine Learning Srihari
Sample Matlab Code Probability Distributions
• Binomial Distribution: – Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P
at the values in X. – Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial
distribution with parameters N and P
• Beta Distribution – Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the
values in X. – Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with
parameters A and B.
Machine Learning Srihari
11
Multinomial Variables
Generalized Bernoulli and Dirichlet
Machine Learning Srihari
Generalization of Binomial
• Binomial – Tossing a coin – Expresses probability of no of successes in N trials
• Probability of 3 rainy days in 10 days
• Multinomial – Throwing a Die – Probability of a given frequency for each value
• Probability of 3 specific letters in a string of N
• Probability Calculator – http://stattrek.com/Tables/Multinomial.aspx 12
Histogram of Binomial for N=10 and µ=0.25
Machine Learning Srihari
13
Generalized Bernoulli: Multinoulli • Bernoulli distribution x is 0 or 1
Bern(x|µ)=µ x (1-µ) 1-x • Discrete variable that takes one of K values
(instead of 2) – Represent as 1-of-K scheme
• Represent x as a K-dimensional vector • If x=3 then we represent it as x=(0,0,1,0,0,0)T
– Such vectors satisfy
– If probability of xk=1 is denoted µk then distribution of x is given by
p(x | µ) = µ
k
xk
k=1
K
∏ where µ = (µ1,..,µ
K)T Generalized Bernoulli
x
kk=1
K∑ = 1
Machine Learning Srihari
14
MLE of Multinoulli Parameters • Data set D of N indep observations x1,..xN
• where the nth observation is written as [xn1,.., xnK]
• Likelihood function has the form
• where mk=Σn xnk is the no. of observations of xk=1
• Maximum likelihood solution – Maximize ln p(D|µ) with Lagrangian constraint that
the µk must sum to one, i.e., maximize
• Setting derivative wrt µk to zero
– which is fraction of N observations for which xk=1
p(D | µ) = µ
k
xnk
k=1
K
∏n=1
N
∏ = µk
xnkn∑( )k=1
K
∏ = µk
mk
k=1
K
∏
µ
kML =
mk
N
µ
kML =
mk
N
m
kk=1
K
∑ ln µk+ λ µ
k− 1
k=1
K
∑⎛⎝⎜
⎞
⎠⎟
Machine Learning Srihari
15
Generalized Binomial Distribution
• Multinomial distribution (with K-state variable)
– Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size
• Given by
Mult m
1m
2..m
K| µ,N( ) = N
m1m
2..m
k
⎛
⎝⎜
⎞
⎠⎟ µ
k
mk
k=1
K
∏
m1,m
2..m
k
Nm
1m
2..m
k
⎛
⎝⎜
⎞
⎠⎟ = N !
m1!m
2!..m
k!
µ
k= 1
k∑
Machine Learning Srihari
16
Dirichlet Distribution
• Family of prior distributions for parameters µk of multinomial distribution
• By inspection of multinomial, form of conjugate prior is
• Normalized form of Dirichlet distribution
Lejeune Dirichlet 1805-1859
p(µ |α) α µ
k
αk −1
k=1
K
∏ where 0 ≤ µk≤ 1 and µ
k= 1
k∑
Dir(µ |α) =
Γ(α0)
Γ(α1)...Γ(α
k)
µk
αk −1
k=1
K
∏ where α0= α
kk=1
K
∑
Machine Learning Srihari
17
Dirichlet over 3 variables
• Due to summation constraint – Distribution over
space of {µk} is confined to the simplex of dimensionality K-1
– For K=3
∑ =k k 1µ
αk=0.1
αk=1
αk=10
Plots of Dirichlet distribution over the simplex for various settings of parameters αk
Dir(µ |α) =
Γ(α0)
Γ(α1)...Γ(α
3)
µk
αk−1
k=1
3
∏ where α0
= αk
k=1
3
∑
Machine Learning Srihari
18
Dirichlet Posterior Distribution
• Multiplying prior by likelihood
• Which has the form of the Dirichlet distribution
)|()|( ),|(1
1∏=
−+K
k
mk
kkpDpDp αµααµµααµ
)()..(
)(
)|( ),|(
1
1
11
0 ∏=
−+
+Γ+Γ+Γ=
+=K
k
mk
KK
kk
mmN
mDirDp
αµαα
ααµαµ
Machine Learning Srihari
19
Summary of Discrete Distributions • Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x
– Binomial: • Generalized Bernoulli (K states):
– Multinomial • Conjugate priors:
– Binomial is Beta
– Multinomial is Dirichlet
TK
K
k
xkkp ),..,(µ where)µ|x( 1
1
µµµ ==∏=
mNm
mN
NmBin −−⎟⎟⎠
⎞⎜⎜⎝
⎛= )1(),|( µµµ
( ) ∏=
⎟⎟⎠
⎞⎜⎜⎝
⎛=
K
k
mk
kK
k
mmmN
NmmmMult121
21 ..,|.. µµ
11 )1()()()(),|( −− −
ΓΓ+Γ= ba
bababaBeta µµµ
∑∏==
− =ΓΓ
Γ=K
kk
K
kk
k
kDir
10
1
1
1
0 where)()...(
)()|( ααµαα
ααµ α
Nm
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
=N !
m ! N −m( )!
Machine Learning Srihari
20
Distributions: Landscape Discrete- Binary
Discrete- Multivalued
Continuous
Bernoulli
Multinomial
Gaussian
Angular Von Mises
Binomial Beta
Dirichlet
Gamma Wishart Student’s-t Exponential
Uniform
Machine Learning Srihari
21
Distributions: Relationships Discrete- Binary
Discrete- Multi-valued
Continuous
Bernoulli Single binary variable
Multinomial One of K values = K-dimensional binary vector
Gaussian
Angular Von Mises
Binomial N samples of Bernoulli
Beta Continuous variable between {0,1]
Dirichlet K random variables between [0.1]
Gamma ConjugatePrior of univariate Gaussian precision
Wishart Conjugate Prior of multivariate Gaussian precision matrix
Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians
Exponential Special case of Gamma
Uniform
N=1 Conjugate Prior
Conjugate Prior Large N
K=2
Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision
Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix