The Exponential Family of Distributions

Bayesian Scientific Computing, Spring 2013 (N. Zabaras)

The Exponential Family of

Distributions Prof. Nicholas Zabaras

School of Engineering

University of Warwick

Coventry CV4 7AL

United Kingdom

Email: [email protected]

URL: http://www.zabaras.com/

August 12, 2014

1

mailto:[email protected]

http://mpdc.mae.cornell.edu/


The Exponential Family

The Bernoulli Distribution (Introducing Logistic Sigmoid), The Poisson

Distribution, The Multinomial Distribution (Introducing SoftMax)

The Beta Distribution, The Gamma Distribution, The Gaussian

Distribution, The von Mises Distribution, The Multivariate Gaussian

Computing the Moments of a Distribution from the Exponential Family,

Moment Parametrization, Sufficiency and Neymann Factorization,

Sufficient Statistics and MLE Estimates, MLE and Kullback-Leibler

Distance

Conjugate Priors, Posterior Predictive, Maximum Entropy and the

Exponential Family

Generalized Linear Models, Canonical Response Function, Batch

IRLS, Sequential Estimation - LMS

2

Chris Bishops’ PRML book, Chapter 2.

M. Jordan, An Introduction to Probabilistic Graphical Models, Chapter 8 (preprint)

Contents

http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm


Exponential Family Large family of useful distributions with common

properties

Bernoulli, beta, binomial, chi-square, Dirichlet,

gamma, Gaussian, geometric, multinomial, Poisson,

Weibull, . .

Not in the family: Uniform, Student’s T, Cauchy, Laplace,

mixture of Gaussians, . . .

Variable can be discrete or continuous (or vectors

thereof)

We will focus on the conditional setting in which we have

a directed model XY with both X & Y observed, and

with p(Y|X) being an exponential family distribution

parametrized using Generalized Linear Models (GLIM’s).

3


Exponential Family The exponential family of distributions over x, given para-

meters η, is defined to be the set of distributions of the form

x is scalar/vector, discrete/continuous. η are the natural

parameters and u(x) is referred to as a sufficient statistic.

g(η) ensures that the distribution is normalized and satisfies

The normalization factor Z and the log of it A are defined as:

The space of h for which is the

natural parameter space. 4

( | ) ( ) ( ) exp ( )

( | ) ( ) exp ( ) ( ) , : ( ) log ( )

T

T

p h g u or

p h u A where A g

h h h

h h h h h

x x x

x x x

( ) ( )exp ( ) 1Tg h u d x x xh h

1

( ) , ( ) ln ( ) ln ( ) ln ( )exp ( )( )

TZ A Z g h u dg

h h h h hh

x x x

( | ) ( ) exp ( ) ( )Tp h u Zh h hx x x

( ) exp ( )Th u d hx x x <


Exponential Family When the parameter q enters the exponential family as h(q),

we write the probability density of the exponential family as

follows:

η(q) are the canonical or natural parameters, q is the

parameter vector of some distribution that can be written in the

exponential family format

5

( | ) ( ) exp ( )

( | ) ( ) exp ( ) ,

: log

T

T

p h g u or

p h u A

where A g

q h q h q

q h q h q

h q h q

x x x

x x x


Joint Probability Distribution on Discrete RVs

We can show that any joint probability distribution on discrete

random variables lies on the exponential family. Indeed:

Substitution to the 1st Eq. gives:

This is in the exponential family with corresponding

to each component of h, and corresponding to

components of the sufficient statistic u(x).

6

1 2

1 11 21 2

1 21 2

, , ... ,

1 2

, ,...,

1( | ) ( ) exp log ( ) log ( )

( )

: ( ) , ,...,ii i k

k kk

ii i kk

C C C C

CC

x v x v x vii i

C C C k

v v v

p ZZ

But for discrete rvs v v v

x x x

x

CC

1 2 1 2

1 21 2

1 2 1 2

11

1 2 1 1 1 2

, ,...,

1 2 1 1 1 2

1( | ) ( ) exp log , ,..., , , ... , log ( )

( )

1( | ) ( ) exp log , ,..., , , ... ,

( )

k k

ii i kk

k k

i

i ii i i i

C C C k k k

CC v v v

i ii i i i

C C C k k k

CC v

p v v v x v x v x v ZZ

p v v v x v x v x vZ

x x

x x

CC

CC 22, ,...,

log ( )ii kkv v

Z

1 2

1 2log , ,..., kii i

C kv v v

1 2

1 1 1 2, , ... , kii i

k kx v x v x v


Exponential Family: The Bernoulli Distribution

Consider the Bernoulli distribution considered earlier:

From this we see that (note that the relation m(h) is invertible)

and

Finally:

7

1

( )

( | ) ( | ) (1 ) exp ln (1 ) ln(1 )

(1 )exp ln1

x x

g

p x x x x

x

h

h

m m m m m m

mm

m

Bern

ln1

mh

m

( ) 1 1g h m h h

( | ) ( ) exp , ( ) , ( ) 1, ( ) ( ),

( ) log(1 ) log(1 )

p x g x u x x h x g

A eh

h h h h h

h m

( | ) ( ) ( ) exp ( )

( ) exp ( ) ( )

T

T

p h g u

h u A

h h h

h h

x x x

x x

Logistic sigmoid

function 1

1 e hm h


Exponential Family: The Poisson Distribution

Consider the Poisson distribution with parameter l:

Recall that l is the mean of the distribution and observe once

more that the relation l(h) is invertible:

8

( ) ( )

( )

1( | ) ( | ) exp ln

! !

x

u x A

h x

ep x x x

x x

l

h h

ll l l l

Poisson

ln ehh l l


Exponential Family: The Multinoulli Distribution

Consider the Multinomial distribution considered earlier:

From this expression we see that h(x)=1, u(x)=x, g(h)=1. It

appears also that A(h)=0!

We can resolve this problem by accounting for the

dependence of mk, i.e.

9

11

1 1

( | ) exp ln exp ,

,..., , ,..., , ln

k

M Mx T

k k k

kk

T T

M M k k

p x

x x

m m

h h h m

m h

h

x x

x

1

1.M

k

k

m

( | ) ( ) ( )exp ( )

( )exp ( ) ( )

T

T

p h g u

h u A

h h h

h h

x x x

x x



We will express the distribution in terms of mk, k=1,…,M-1

subject to:

The multinomial distribution becomes:

10

1 1 1

1 1 1 1

1 1

11 1

1

exp ln exp ln 1 ln 1

exp ln ln 1

1

k

M M M M

k k k k k k

k k k k

M Mk

k kMk k

k

k

x x x

x

h

m m m

mm

m

1

1

0 1, 1M

k k

k

m m



We identify

Can also define:

This equation can be inverted as:

Substitution intro the expression on the top of the slide:

11

1

1

ln ln , 1,..., 1

1

k kk M

Mk

k

k Mm m

hm

m

1

11

1 11

1 1

1

11 1 1

1

11 1 1

1

exp exp

1 1

exp1

1 exp 1 1 exp1 exp

1

M

kMk k

k kM Mk

k k

k k

M

kM M Mk

k k k kMk k k kk

k kk

mm

h h

m m

h

h m m hh

m

1

1 11

1 1

expln ln 1 exp

1 1 exp

Mkk

k k k kM Mk

k k

k k

hmh m h m

m h

ln 0MM

M

mh

m



This is the so called softmax function (note again the relation

m(h) is invertible):

In this reduced representation, the distribution takes the form:

Comparing with the generic form of the exponential family:

12

1

1

exp

1 exp

k

k M

k

k

hm

h

1

1 1 1

1 1 1

( | ) exp ln 1 1 exp exp( )M M M

T

k k k k

k k k

p x h m h

h hx x

( | ) ( ) ( )exp ( )Tp h g ux x xh h h

11

1 1

1

1

1 1

,..., ,0 , ( ) , ( ) 1, ( ) 1 exp

ln ( ) ln 1 exp ln exp

MT

M k

k

M M

k k

k k

u h g

A g

h h h

h h

h h

h

x x x

Softmax

function


Consider the Beta distribution

Comparing this with our exponential family:

we can easily indentify:

13

1 1( ) ( )( | , ) (1 ) exp ( 1) ln ( 1) ln(1 )

( ) ( ) ( ) ( )

a ba b a ba b a b

a b a bm m m m m

Beta

( )( ) (ln , ln(1 )) , ( 1, 1) , ( ) 1, ( , ) ,

( ) ( )

( ) ( )( , ) ln

( )

T T a bu a b h g a b

a b

a bA a b

a b

m m m m

h

Exponential Family: The Beta Distribution

( | ) ( ) ( )exp ( )

( )exp ( ) ( )

T

T

p h g u

h u A

h h h

h h

x x x

x x


Exponential Family: Gamma Distribution

Consider the Gamma distribution


we can easily indentify:

14

1( | , ) exp ( 1) ln( ) ( )

a aa bb b

a b e a ba a

ll l l l

Gamma

( )( ) ( , ln ) , ( , 1) , ( ) 1, ( , ) , ( , ) ln

( )

aT T

a

b au b a h g a b A a b

a bl l l l

h

( | ) ( ) ( )exp ( )

( )exp ( ) ( )

T

T

p h g u

h u A

h h h

h h

x x x

x x


Exponential Family: The Gaussian Consider the univariate Gaussian


we can indentify (this is a two parameter distribution):

15

22 2 2

2 2 2 22 2

1 ( ) 1 1 1( | , ) exp exp

2 2 22 2

xp x x x

m mm m

2

1/22 122 2

2

1 1( ) ( , ) , ( , ) , ( ) , ( ) 2 exp

2 42

T Tu x x x h x ghm

h h

h h

2

12

2

1( ) ln 2

2 4A

hh

h h

( | ) ( ) ( ) exp ( ) ( ) exp ( ) ( )T Tp h g u h u A h h h h hx x x x x


Consider the von Mises distribution


we can easily identify that:

16

0 0 0 0

0 0

1 1( | , ) exp cos( ) exp cos cos sin sin

2 ( ) 2 ( )p m m m m

I m I mq q q q q q q q

0 0 0

0

0 0

1( ) (cos ,sin ) , ( cos , sin ) , ( ) 1, ( , ) ,

2 ( )

( , ) ln 2 ( )

T Tu m m h g mI m

A m I m

q q q q q q q

q

h

Exponential Family: von Mises Distribution

( | ) ( ) ( )exp ( ) ( )exp ( ) ( )T Tp h g u h u A h h h h hx x x x x


The exponent in the multivariate Gaussian is:

We need to put this in the form

The 3nd term contributes to g(h) whereas the 2nd term is

directly an inner product between x and x =Lm.

For the 1st term, define two D2-dimensional vector vec(L)

and vec(xxT) that consist of the columns of L and xxT,

respectively. Then the 1st term has the form of an inner

product between these two vectors. In summary:

17

11 1

2 2

T T Ttr where L m m mxx xL L L , :

( | ) ( ) ( )exp ( )Tp h g ux x xh h h

The Multivariate Gaussian

1/2 /211

, ( ) , ( ) exp , ( ) 2 ,1( ) 2( )

2

DT

Tu g h

vec vec

xh h x x x m

x

x xxx

L L LL

11 1ln ( ) ln

2 2

TA g h x xL L


Computing Moments of Sufficient Statistics u(x)

Differentiate wrt h the for the exponential family:

The above equation can be further simplified if written in terms of the

partition function Z=1/g(h) or A=logZ:

Let us re-write explicitly the above equation as:

We can compute the variance of u(x) by differentiating the Eq. above

with respect to h.

18

( | ) ( ) ( )exp ( )Tp h g ux x xh h h

( ) ( )exp ( ) ( ) ( )exp ( ) ( ) 0T Tg h d g h u d x u x x x x u x xh h h h

( | ) 1p d x xh

( )

( ) ( )exp ( ) ( ) ( )( )

Tgg h u d

g

= x x u x x u x

hh h

h

( ) ( )A u xh

( ) ( ) ( ) exp ( ) ( )TA g h u d h h hx x u x x



Thus the covariance of u(x) can be expressed in terms of the

2nd derivatives of A(η) and similarly for higher order moments.

Provided we can normalize a distribution from the exponential

family, we can always find its moments by simple

differentiation.

19

2 ( ) cov ( ) ( )A so A is convex u xh h

2

( ) ( ) ( ) ( )

( ) ( ) ( ) exp ( ) ( ) ( ) ( ) exp ( ) ( ) ( )

T T

T T TA g h u d g h u d

h h h h h

u x u x u x u x

x x u x x x x u x u x x

( ) ( ) ( )exp ( ) ( )TA g h u d h h hx x u x x



Let us check these relations for the Univariate Gaussian:

20

2 ( ) cov ( ) ( )A so A is convex u xh h

( ) ( )A u xh

2

212 2 2

2

1 1( ) ln 2 , ( , ) , ( ) ( , )

2 4 2

T TA u x x xh m

hh

h h

22 2 21 1

2

1 2 2 2 2

22

2

1 2

( ) ( ) 1,

2 2 4

( ) 1var , .

2

A AX X

AX etc

h hm m

h h h h h

h h

h h

h


We have shown that we can compute the mean of the distribution m=E[u(x)] in terms of the canonical parameter

h:

We have also shown that A(h) is

a convex function. Since for a

convex function there is one-to-one

relation between the argument of

the function and its derivative, the

mapping m(h) is invertable.

Thus the exponential family of distributions can also be

parameterized in terms of m (moment parametrization)

exactly as we started our earlier calculations (e.g. for the

Gaussian).

21

Moment Parametrization

( ) ( )Am hu x

Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 22

Sufficiency

( )u X

u(X) is sufficient for q if there is no

information in X regarding q beyond

that in u(X). Having observed u(X), we

can throw away X for the purposes of

inference with respect to q.

In the Bayesian approach in the Fig

shown, we treat q as an rv and say

that u(X) is sufficient for q if the

following CI statement holds:

Thus, u(X) contains all the needed

information in X about q.

| ( )

| ( ), | ( )

X u X

p u x x p u x

q

q q


Sufficiency

( )u X

The model in Fig b shown asserts the

same CI relations as shown in Fig a

earlier but has different

parametrization.

Treating q as a label, we can see the

above CI statement as a frequentist

definition of sufficiency.

u(X) is sufficient for q if the p(x|u(x)) is

not a function of q.

The two approaches discussed imply

a particular factorization of p(x|q).

| ( ), | ( )p x u x p x u xq


Neymann Factorization Theorem From the undirected graph (that expresses the same CI

relations as the two earlier graphs), we can factorize as:

On the left, u(x) is a deterministic function of x and can be

dropped as an argument:

One can derive for given y1, y2:

1 2, ( ), ( ), , ( )p x u x g u x g x u xq q

( )u X 1 2, ( ), , ( )p x g u x g x u xq q

1 2| , / ( ), , ( )p x p x p u x x u xq q q y q y

We can now see why u(x) was sufficient statistic for h in

the exponential family:

2

1

( ),( ),

( | ) ( ) exp ( )T

u x xu x

p h u A

yy q

q q h h hx x x


1

1( ) ( ) ( )

N

ML n

n

AN

h u x u x

MLE for the Exponential Family The joint density for a data X = {x1, . . . , xn} is itself an exp. distribution

with sufficient statistics :

The exponential family is the only family of distributions with finite

sufficient statistics (size independent of the data set size).

The log likelihood is concave (A convex) and has a unique maximum.

Maximizing wrt h gives:

At the MLE, the empirical average of the sufficient statistic is equal the

model’s theoretical expected sufficient statistics (moment matching).

Thus to find the expected value of the sufficient statistics, one can use

directly the data without having to estimate h. When u(x)=x, the above

allows us to compute the expectation of x directly from the data.

25

11 1

1 1 1 1

( | ) ( ) ( ) exp ( ) ( ) ( ) exp ( )

ln ( | ) ( ) ln ( ) ( ) ( ) ( ) ( )

N N NT N T

n n n n

nn n

N N N NT T

n n n n

n n n n

p h g u h g u

p h N g u h NA u

X x x x x

X x x x x

h h h h h

h h h h h

1( )

N

nnu

x


MLE for the Exponential Family

Using the sufficient statistic, one can in principle invert the above equ.

to compute hMLE. For example, for the Bernoulli distribution,

and thus:

and

26

1

1( 1) 1

N

MLE n

n

X p X xN

m m

1

1( ) ( ) ( )

N

ML n

n

A uN

u x xh

( | ) ( ) exp , ( ) , ( ) 1,

1 1, ( ) , ln

1 1 1

p x g x u x x h x

ge eh h

h h h

mm h h

m

ln1

MLE

mh

m


MLE and Kullback-Leibler Distance A useful property for the MLE (and not just a property for the

exponential family of distributions) is the following:

Minimizing the KL distance to the empirical distribution is equivalent to

maximizing the likelihood.

Indeed, let us consider the model logp(x|q) and the empirical

distribution:

We can then derive the following:

and from this:

Since the 1st term is independent of q, the assertion is proved.

27

1

1( ) ,

N

n

n

p x x xN

1 1

1 1 1( ) log ( | ) , log ( | ) log ( | ) ( | )

N N

n n

x n x n

p x p x x x p x p xN N N

q q q q

D

( ) 1

( ), ( | ) ( ) log ( ) log ( ) ( ) log ( | ) ( ) log ( ) ( | )( | )x x x x

p xD p x p x p x p x p x p x p x p x p x

p x Nq q q

q D


Conjugate Priors We have already encountered the concept of a conjugate prior:

For the Bernoulli, the conjugate prior is the Beta distribution

For the Gaussian, the conjugate prior for the mean is a Gaussian,

and the conjugate prior for the precision is the Wishart distribution

In general, for a given probability distribution p(x|η), we can seek a

prior p(η) that is conjugate to the likelihood function, so that the

posterior distribution has the same functional form as the prior. For

any member of the exponential family,

there exists a conjugate prior that can be written in the form

In normalized form, we write:

28

( | ) ( ) ( ) exp ( )Tp h g uq h q h qx x x

0

0 00 0 0 0 0 0 0( | , ) exp exp , :T Tp g A where

q h q h q h q h q

0

00 0 0 0 0

0 0 0 0

00 0 0 0

1 1( | , ) exp exp

, ,

: , exp

T T

T

p g AZ Z

where Z A d

q h q h q h q h q

h q h q q


Conjugate Priors

Using , the posterior becomes (this form justifies ):

The parameter 0 can be interpreted as effective number of fictitious

observations in the prior each of which has a value for the sufficient

statistic equal to .

29

0 0

0 00

1

( | , ) exp ( ) expN

N NT T

n

n

p g g N

q h q h q h q h qX u x u

000

0

0000 0

0

1 1| , exp exp ,

( ) ( )

, ,

N NT TNN N N

N N N N

N NN N N

Np g N v g

Z N v Z

Nwhere N N

N v

q h q h q h q h q

uX

uu

1

1( )

N

n

nN

u u x

11 1

( | ) ( ) exp ( ) ( ) exp ( )N N N

NT T

n n n n

nn n

p h g u h g u

q h q h q h q h qX x x x x

0

00 0 0 0 0

0 0 0 0

1 1( | , ) exp exp

, ,

T Tp g AZ Z

q h q h q h q h q

0

0


Posterior Predictive Let the posterior predictive is then:

This is simplified as follows:

If N=0, this becomes the marginal likelihood of , which reduces to

the normalizer of the posterior divided by the normalizer of the prior

multiplied by a constant.

30

'

'

0

1 0 0

( | ) ( | ) ( | )

1( ) ( ) exp ( ) exp ( )

( ( ) )

N

NN T T

i

i

p p p d

h g g dZ N

q q q

h h q h q h q q

X X X X

x u X u Xu X

''

0

1 0 0

'0 0

1 0 0

1( | ) ( ) exp ( ) ( )

, ( )

', ( ) ( )( )

, ( )

N

NN T

i

i

N

i

i

p h g dZ N

Z N Nh

Z N

h q h q qX X x u X u Xu X

u X u Xx

u X

'

1 1

( ) ( ), ( ) ( ),N N

ii

i i

u X u x u X u x

X


Beta/Bernoulli: Posterior Predictive Consider a Bernoulli likelihood with a Beta prior. The likelihood takes

the familiar exponential distribution form:

The conjugate prior is a Beta:

Thus the posterior becomes:

Let s the number of heads in the past data. The probability of

future heads in m trials is then:

31

| (1 ) (1 ) exp log1

i i

i i

x N xN

i

i

p xq

q q q qq

D

| , , , , ( 1)N N N N i

i

p s N s s xq D Beta

1 1' '| (1 ) (1 )N NN N N N N m N ms m s

N N N N N m N m

p d

q q q q q q

D

0 0 0| (1 )s N s

p q q q

D

1

' ( 1)m

i

i

s x

', 'N m N N m Ns m s

0 0 0 0 0| , , , 1, 1,p q Beta

0 0 00

0 0 0| , 1 exp log 11

p q

q q q qq


Maximum Entropy and Exponential Family If nothing is known about a distribution except that it belongs to a

certain class, then the distribution with the largest entropy should be

chosen as the default.a

The entropy is defined as

discrete case

When some statistics (moments) of the distribution are known,

the maximum entropy distribution is of the form (l’s are the Lagrange

multipliers enforcing the constraints):

Thus the MaxEnt distribution has the form of the exponential family.

32

( ) ( ) log ( )k k

k

H q q

1

1

exp ( )

( ) ,

exp ( )

K

k k i

k

i kK

k k j

j k

g

Lagrange multipliers

g

l q

q l

l q

, 1,..,k kg w k K q

a C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available for Cornell

students)

http://www.ceremade.dauphine.fr/~xian/coursBC.pdf

http://cornell.worldcat.org/oclc/187082900?referer=di&ht=edition


Generalized Linear Models

33

We now study the relationship between X and Y using a GLIM.

We choose a particular conditional expectation of Y. We denote the

modeled value of conditional expectation as m = f(qTx).

For linear regression, GLIM extends these ideas beyond the

Gaussian, Bernoulli and multinomial setting to the more general

exponential family.

X enters linearly as qTx and f is called a response

function. is a one-to-one map of m to h.

To specify a GLIM we need (a) a choice of exponential family

distribution, and (b) a choice of the response function f().

Choosing the exponential family distribution is strongly constrained

by the nature of the data.

In choosing f(.), note that it needs to be both monotonic and

differentiable. However, there is a particular response function

(canonical response function) that is uniquely associated with a given

exponential family distribution.


Canonical Response Function Canonical response function:

f() = -1()

x = h

If we decide to use the canonical response function, the choice of the

exponential family density completely determines the GLIM.

34


MLE & Canonical Response Function

35

Consider a regression problem with data D = {(xi, yi},i=1,…N}. The

log likelihood for a GLIM is as:

For a canonical response, this is simplified as:

Regardless of N, the size of the sufficient statistic is fixed: the

dimension of xn - important reason for using canonical response.

This is a general expression for GLM with exponential family

distributions and the canonical response function.

1 1

, log ( ) ( ) , :

n

N NT

n n n n n n n n

n n

h y y A where f with

y m

q h h m x x

xqD

1 1 1

, log ( ) ( )N N N

T

n n n n

n n n

Sufficient statistic for

h y y Aq h

x

q

qD

1 1 1

, '( ) ,

n

N N NT

n n n n n n n n n

n n n

y A y y orh h m h m

x

x X yq q q qq q mD D


Batch Algorithm

36

The Hessian can now be computed from

as:

To estimate parameters in the canonical response function choice,

one can use iteratively reweighted least squares (IRLS) algorithm

The batch Newton algorithm now takes the familiar IRLS form:

For non canonical response functions, the Hessian has an extra term

that contains the factor (y-m). When we take expectations this term

vanishes! So using the expected Hessian in the Newton method the

algorithm looks essentially the same (Fisher Scoring algorithm).

1

, ,N

T

n n n

n

y orm

x X yq qq q mD D

2 1

1 1

, , ,...,N

T Tn nn n

n n n

d ddor , where :

d d d

m mm

h h h

H x x X WX W =q qq qD D

1 11

1 11 1

t t T t T t T t T t t T t

T t T t t t t T t T t t t

h

q q q

q

X W X X y X W X X W X X y

X W X X W X W y X W X X W W y

m m

m h m


Sequential Estimation - LMS An on-line estimation algorithm can be obtained by following the

stochastic gradient of the log likelihood function.

If we do not use the canonical response function, then the gradient

also includes the derivatives of f() and (). These can be viewed as

scaling coefficients that alter the step size, but otherwise leave the

general LMS form intact.

The LMS algorithm is the generic stochastic gradient algorithm for

models throughout the GLIM family.

37

1 ,Tt t t t t

n n n n ny x f xq q m m q

Documents

The Exponential Family of Distributions