Discrete Probability Distributionssrihari/CSE574/Chap2/Ch2.1-Discrete... · Bernoulli Single binary variable Multinomial One of K values = K-dimensional binary vector Gaussian Angular

Machine Learning Srihari

1

Discrete Probability Distributions Sargur N. Srihari


2

Binary Variables

Bernoulli, Binomial and Beta


3

Bernoulli Distribution •  Expresses distribution of Single binary-valued random variable x ε {0,1} •  Probability of x=1 is denoted by parameter µ, i.e.,

p(x=1|µ)=µ –  Therefore

p(x=0|µ)=1-µ

•  Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x

•  Mean is shown to be E[x]=µ•  Variance is var[x]=µ (1-µ) •  Likelihood of n observations independently drawn from p(x|µ) is

–  Log-likelihood is

•  Maximum likelihood estimator –  obtained by setting derivative of ln p(D|µ) wrt m equal to zero is

•  If no of observations of x=1 is m then µML=m/N

p(D | µ) = p(x

n| µ) =

n=1

N

∏ µxn

n=1

N

∏ (1−µ)1−xn

ln p(D | µ) = ln p(x

n| µ) =

n=1

N

∑ {xnln

n=1

N

∑ µ + (1−xn)ln(1−µ)}

µ

ML=

1N

xn

n=1

N

∑

Jacob Bernoulli 1654-1705


4

Binomial Distribution

•  Related to Bernoulli distribution •  Expresses Distribution of m

– No of observations for which x=1 •  It is proportional to Bern(x|µ) •  Add up all ways of obtaining heads

•  Mean and Variance are Bin(m |N ,µ) = N

m

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟µm(1−µ)N−m

E[m] = mBin(m |N ,µ) = Nµm=0

N

∑Var[m] = Nµ(1−µ)

Histogram of Binomial for N=10 and m=0.25

Nm

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

=N !

m ! N −m( )!

Binomial Coefficients:


•  Beta distribution

•  Where the Gamma function is defined as

•  a and b are hyperparameters that control distribution of parameter µ

•  Mean and Variance

5

Beta Distribution

Beta(µ |a,b) = Γ(a + b)

Γ(a)Γ(b)µa−1(1 − µ)b−1

∫∞

−−=Γ0

1)( dueux ux

a=0.1, b=0.1 a=1, b=1

a=2, b=3 a=8, b=4

Beta distribution as function of µFor values of hyperparameters a and b

baaE+

=][µ)1()(

]var[ 2 +++=

babaabµ


6

Bayesian Inference with Beta •  MLE of µ in Bernoulli is fraction of observations with x=1

–  Severely over-fitted for small data sets

•  Likelihood function takes products of factors of the form µx(1-µ)(1-x)

•  If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior –  Called conjugacy

•  Beta has form suitable for a prior distribution of p(µ)


7

Bayesian Inference with Beta •  Posterior obtained by multiplying beta

prior with binomial likelihood yields

–  where l=N-m, which is no of tails –  m is no of heads

•  It is another beta distribution

–  Effectively increase value of a by m and b by l –  As number of observations increases

distribution becomes more peaked

11 )1( ),,,|( −+−+ − blambalmp µµαµ

11 )1( )()()( ),,,|( −+−+ −

+Γ+Γ+++Γ= blam

blamblambalmp µµµ

a=2, b=2

N=m=1, with x=1

a=3, b=2

Illustration of one step in process

µ1(1-µ)0

p(µ)

p(x=1/µ)=

p(µ/x=1)


8

Predicting next trial outcome •  Need predictive distribution of x given observed D

–  From sum and products rule

•  Expected value of the posterior distribution can be shown to be

–  Which is fraction of observations (both fictitious and

real) that correspond to x=1 •  Maximum likelihood and Bayesian results agree in

the limit of infinite observations –  On average uncertainty (variance) decreases with

observed data

p(x = 1 |D) = p(x = 1,µ |D)dµ0

1

∫ = p(x = 1 | µ)p(µ |D)dµ0

1

∫ =

= µp(µ |D)dµ0

1

∫ = E[µ |D]

p(x = 1 |D) = m + a

m + a + l + b


9

Summary of Binary Distributions

•  Single Binary variable distribution is represented by Bernoulli

•  Binomial is related to Bernoulli – Expresses distribution of number of

occurrences of either 1 or 0 in N trials •  Beta distribution is a conjugate prior for

Bernoulli – Both have the same functional form


Sample Matlab Code Probability Distributions

•  Binomial Distribution: –  Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P

at the values in X. –  Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial

distribution with parameters N and P

•  Beta Distribution –  Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the

values in X. –  Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with

parameters A and B.


11

Multinomial Variables

Generalized Bernoulli and Dirichlet


Generalization of Binomial

•  Binomial – Tossing a coin – Expresses probability of no of successes in N trials

•  Probability of 3 rainy days in 10 days

•  Multinomial – Throwing a Die – Probability of a given frequency for each value

•  Probability of 3 specific letters in a string of N

•  Probability Calculator – http://stattrek.com/Tables/Multinomial.aspx 12

Histogram of Binomial for N=10 and µ=0.25


13

Generalized Bernoulli: Multinoulli •  Bernoulli distribution x is 0 or 1

Bern(x|µ)=µ x (1-µ) 1-x •  Discrete variable that takes one of K values

(instead of 2) – Represent as 1-of-K scheme

•  Represent x as a K-dimensional vector •  If x=3 then we represent it as x=(0,0,1,0,0,0)T

–  Such vectors satisfy

–  If probability of xk=1 is denoted µk then distribution of x is given by

p(x | µ) = µ

k

xk

k=1

K

∏ where µ = (µ1,..,µ

K)T Generalized Bernoulli

x

kk=1

K∑ = 1


14

MLE of Multinoulli Parameters •  Data set D of N indep observations x1,..xN

•  where the nth observation is written as [xn1,.., xnK]

•  Likelihood function has the form

•  where mk=Σn xnk is the no. of observations of xk=1

•  Maximum likelihood solution – Maximize ln p(D|µ) with Lagrangian constraint that

the µk must sum to one, i.e., maximize

•  Setting derivative wrt µk to zero

– which is fraction of N observations for which xk=1

p(D | µ) = µ

k

xnk

k=1

K

∏n=1

N

∏ = µk

xnkn∑( )k=1

K

∏ = µk

mk

k=1

K

∏

µ

kML =

mk

N

µ

kML =

mk

N

m

kk=1

K

∑ ln µk+ λ µ

k− 1

k=1

K

∑⎛⎝⎜

⎞

⎠⎟


15

Generalized Binomial Distribution

•  Multinomial distribution (with K-state variable)

– Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size

•  Given by

Mult m

1m

2..m

K| µ,N( ) = N

m1m

2..m

k

⎛

⎝⎜

⎞

⎠⎟ µ

k

mk

k=1

K

∏

m1,m

2..m

k

Nm

1m

2..m

k

⎛

⎝⎜

⎞

⎠⎟ = N !

m1!m

2!..m

k!

µ

k= 1

k∑


16

Dirichlet Distribution

•  Family of prior distributions for parameters µk of multinomial distribution

•  By inspection of multinomial, form of conjugate prior is

•  Normalized form of Dirichlet distribution

Lejeune Dirichlet 1805-1859

p(µ |α) α µ

k

αk −1

k=1

K

∏ where 0 ≤ µk≤ 1 and µ

k= 1

k∑

Dir(µ |α) =

Γ(α0)

Γ(α1)...Γ(α

k)

µk

αk −1

k=1

K

∏ where α0= α

kk=1

K

∑


17

Dirichlet over 3 variables

•  Due to summation constraint – Distribution over

space of {µk} is confined to the simplex of dimensionality K-1

– For K=3

∑ =k k 1µ

αk=0.1

αk=1

αk=10

Plots of Dirichlet distribution over the simplex for various settings of parameters αk

Dir(µ |α) =

Γ(α0)

Γ(α1)...Γ(α

3)

µk

αk−1

k=1

3

∏ where α0

= αk

k=1

3

∑


18

Dirichlet Posterior Distribution

•  Multiplying prior by likelihood

•  Which has the form of the Dirichlet distribution

)|()|( ),|(1

1∏=

−+K

k

mk

kkpDpDp αµααµµααµ

)()..(

)(

)|( ),|(

1

1

11

0 ∏=

−+

+Γ+Γ+Γ=

+=K

k

mk

KK

kk

mmN

mDirDp

αµαα

ααµαµ


19

Summary of Discrete Distributions •  Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x

– Binomial: •  Generalized Bernoulli (K states):

– Multinomial •  Conjugate priors:

– Binomial is Beta

– Multinomial is Dirichlet

TK

K

k

xkkp ),..,(µ where)µ|x( 1

1

µµµ ==∏=

mNm

mN

NmBin −−⎟⎟⎠

⎞⎜⎜⎝

⎛= )1(),|( µµµ

( ) ∏=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

K

k

mk

kK

k

mmmN

NmmmMult121

21 ..,|.. µµ

11 )1()()()(),|( −− −

ΓΓ+Γ= ba

bababaBeta µµµ

∑∏==

− =ΓΓ

Γ=K

kk

K

kk

k

kDir

10

1

1

1

0 where)()...(

)()|( ααµαα

ααµ α

Nm

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

=N !

m ! N −m( )!


20

Distributions: Landscape Discrete- Binary

Discrete- Multivalued

Continuous

Bernoulli

Multinomial

Gaussian

Angular Von Mises

Binomial Beta

Dirichlet

Gamma Wishart Student’s-t Exponential

Uniform


21

Distributions: Relationships Discrete- Binary

Discrete- Multi-valued

Continuous

Bernoulli Single binary variable

Multinomial One of K values = K-dimensional binary vector

Gaussian

Angular Von Mises

Binomial N samples of Bernoulli

Beta Continuous variable between {0,1]

Dirichlet K random variables between [0.1]

Gamma ConjugatePrior of univariate Gaussian precision

Wishart Conjugate Prior of multivariate Gaussian precision matrix

Student’s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians

Exponential Special case of Gamma

Uniform

N=1 Conjugate Prior

Conjugate Prior Large N

K=2

Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision

Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix

Documents

Discrete Probability Distributionssrihari/CSE574/Chap2/Ch2.1-Discrete... · Bernoulli Single binary variable Multinomial One of K values = K-dimensional binary vector Gaussian Angular