Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Exponential Family of
Distributions Prof. Nicholas Zabaras
School of Engineering
University of Warwick
Coventry CV4 7AL
United Kingdom
Email: [email protected]
URL: http://www.zabaras.com/
August 12, 2014
1
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The Exponential Family
The Bernoulli Distribution (Introducing Logistic Sigmoid), The Poisson
Distribution, The Multinomial Distribution (Introducing SoftMax)
The Beta Distribution, The Gamma Distribution, The Gaussian
Distribution, The von Mises Distribution, The Multivariate Gaussian
Computing the Moments of a Distribution from the Exponential Family,
Moment Parametrization, Sufficiency and Neymann Factorization,
Sufficient Statistics and MLE Estimates, MLE and Kullback-Leibler
Distance
Conjugate Priors, Posterior Predictive, Maximum Entropy and the
Exponential Family
Generalized Linear Models, Canonical Response Function, Batch
IRLS, Sequential Estimation - LMS
2
Chris Bishops’ PRML book, Chapter 2.
M. Jordan, An Introduction to Probabilistic Graphical Models, Chapter 8 (preprint)
Contents
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family Large family of useful distributions with common
properties
Bernoulli, beta, binomial, chi-square, Dirichlet,
gamma, Gaussian, geometric, multinomial, Poisson,
Weibull, . .
Not in the family: Uniform, Student’s T, Cauchy, Laplace,
mixture of Gaussians, . . .
Variable can be discrete or continuous (or vectors
thereof)
We will focus on the conditional setting in which we have
a directed model XY with both X & Y observed, and
with p(Y|X) being an exponential family distribution
parametrized using Generalized Linear Models (GLIM’s).
3
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family The exponential family of distributions over x, given para-
meters η, is defined to be the set of distributions of the form
x is scalar/vector, discrete/continuous. η are the natural
parameters and u(x) is referred to as a sufficient statistic.
g(η) ensures that the distribution is normalized and satisfies
The normalization factor Z and the log of it A are defined as:
The space of h for which is the
natural parameter space. 4
( | ) ( ) ( ) exp ( )
( | ) ( ) exp ( ) ( ) , : ( ) log ( )
T
T
p h g u or
p h u A where A g
h h h
h h h h h
x x x
x x x
( ) ( )exp ( ) 1Tg h u d x x xh h
1
( ) , ( ) ln ( ) ln ( ) ln ( )exp ( )( )
TZ A Z g h u dg
h h h h hh
x x x
( | ) ( ) exp ( ) ( )Tp h u Zh h hx x x
( ) exp ( )Th u d hx x x <
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family When the parameter q enters the exponential family as h(q),
we write the probability density of the exponential family as
follows:
η(q) are the canonical or natural parameters, q is the
parameter vector of some distribution that can be written in the
exponential family format
5
( | ) ( ) exp ( )
( | ) ( ) exp ( ) ,
: log
T
T
p h g u or
p h u A
where A g
q h q h q
q h q h q
h q h q
x x x
x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Joint Probability Distribution on Discrete RVs
We can show that any joint probability distribution on discrete
random variables lies on the exponential family. Indeed:
Substitution to the 1st Eq. gives:
This is in the exponential family with corresponding
to each component of h, and corresponding to
components of the sufficient statistic u(x).
6
1 2
1 11 21 2
1 21 2
, , ... ,
1 2
, ,...,
1( | ) ( ) exp log ( ) log ( )
( )
: ( ) , ,...,ii i k
k kk
ii i kk
C C C C
CC
x v x v x vii i
C C C k
v v v
p ZZ
But for discrete rvs v v v
x x x
x
CC
1 2 1 2
1 21 2
1 2 1 2
11
1 2 1 1 1 2
, ,...,
1 2 1 1 1 2
1( | ) ( ) exp log , ,..., , , ... , log ( )
( )
1( | ) ( ) exp log , ,..., , , ... ,
( )
k k
ii i kk
k k
i
i ii i i i
C C C k k k
CC v v v
i ii i i i
C C C k k k
CC v
p v v v x v x v x v ZZ
p v v v x v x v x vZ
x x
x x
CC
CC 22, ,...,
log ( )ii kkv v
Z
1 2
1 2log , ,..., kii i
C kv v v
1 2
1 1 1 2, , ... , kii i
k kx v x v x v
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Bernoulli Distribution
Consider the Bernoulli distribution considered earlier:
From this we see that (note that the relation m(h) is invertible)
and
Finally:
7
1
( )
( | ) ( | ) (1 ) exp ln (1 ) ln(1 )
(1 )exp ln1
x x
g
p x x x x
x
h
h
m m m m m m
mm
m
Bern
ln1
mh
m
( ) 1 1g h m h h
( | ) ( ) exp , ( ) , ( ) 1, ( ) ( ),
( ) log(1 ) log(1 )
p x g x u x x h x g
A eh
h h h h h
h m
( | ) ( ) ( ) exp ( )
( ) exp ( ) ( )
T
T
p h g u
h u A
h h h
h h
x x x
x x
Logistic sigmoid
function 1
1 e hm h
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Poisson Distribution
Consider the Poisson distribution with parameter l:
Recall that l is the mean of the distribution and observe once
more that the relation l(h) is invertible:
8
( ) ( )
( )
1( | ) ( | ) exp ln
! !
x
u x A
h x
ep x x x
x x
l
h h
ll l l l
Poisson
ln ehh l l
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Multinoulli Distribution
Consider the Multinomial distribution considered earlier:
From this expression we see that h(x)=1, u(x)=x, g(h)=1. It
appears also that A(h)=0!
We can resolve this problem by accounting for the
dependence of mk, i.e.
9
11
1 1
( | ) exp ln exp ,
,..., , ,..., , ln
k
M Mx T
k k k
kk
T T
M M k k
p x
x x
m m
h h h m
m h
h
x x
x
1
1.M
k
k
m
( | ) ( ) ( )exp ( )
( )exp ( ) ( )
T
T
p h g u
h u A
h h h
h h
x x x
x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Multinoulli Distribution
We will express the distribution in terms of mk, k=1,…,M-1
subject to:
The multinomial distribution becomes:
10
1 1 1
1 1 1 1
1 1
11 1
1
exp ln exp ln 1 ln 1
exp ln ln 1
1
k
M M M M
k k k k k k
k k k k
M Mk
k kMk k
k
k
x x x
x
h
m m m
mm
m
1
1
0 1, 1M
k k
k
m m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Multinoulli Distribution
We identify
Can also define:
This equation can be inverted as:
Substitution intro the expression on the top of the slide:
11
1
1
ln ln , 1,..., 1
1
k kk M
Mk
k
k Mm m
hm
m
1
11
1 11
1 1
1
11 1 1
1
11 1 1
1
exp exp
1 1
exp1
1 exp 1 1 exp1 exp
1
M
kMk k
k kM Mk
k k
k k
M
kM M Mk
k k k kMk k k kk
k kk
mm
h h
m m
h
h m m hh
m
1
1 11
1 1
expln ln 1 exp
1 1 exp
Mkk
k k k kM Mk
k k
k k
hmh m h m
m h
ln 0MM
M
mh
m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Multinoulli Distribution
This is the so called softmax function (note again the relation
m(h) is invertible):
In this reduced representation, the distribution takes the form:
Comparing with the generic form of the exponential family:
12
1
1
exp
1 exp
k
k M
k
k
hm
h
1
1 1 1
1 1 1
( | ) exp ln 1 1 exp exp( )M M M
T
k k k k
k k k
p x h m h
h hx x
( | ) ( ) ( )exp ( )Tp h g ux x xh h h
11
1 1
1
1
1 1
,..., ,0 , ( ) , ( ) 1, ( ) 1 exp
ln ( ) ln 1 exp ln exp
MT
M k
k
M M
k k
k k
u h g
A g
h h h
h h
h h
h
x x x
Softmax
function
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Consider the Beta distribution
Comparing this with our exponential family:
we can easily indentify:
13
1 1( ) ( )( | , ) (1 ) exp ( 1) ln ( 1) ln(1 )
( ) ( ) ( ) ( )
a ba b a ba b a b
a b a bm m m m m
Beta
( )( ) (ln , ln(1 )) , ( 1, 1) , ( ) 1, ( , ) ,
( ) ( )
( ) ( )( , ) ln
( )
T T a bu a b h g a b
a b
a bA a b
a b
m m m m
h
Exponential Family: The Beta Distribution
( | ) ( ) ( )exp ( )
( )exp ( ) ( )
T
T
p h g u
h u A
h h h
h h
x x x
x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: Gamma Distribution
Consider the Gamma distribution
Comparing this with our exponential family:
we can easily indentify:
14
1( | , ) exp ( 1) ln( ) ( )
a aa bb b
a b e a ba a
ll l l l
Gamma
( )( ) ( , ln ) , ( , 1) , ( ) 1, ( , ) , ( , ) ln
( )
aT T
a
b au b a h g a b A a b
a bl l l l
h
( | ) ( ) ( )exp ( )
( )exp ( ) ( )
T
T
p h g u
h u A
h h h
h h
x x x
x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Exponential Family: The Gaussian Consider the univariate Gaussian
Comparing this with our exponential family:
we can indentify (this is a two parameter distribution):
15
22 2 2
2 2 2 22 2
1 ( ) 1 1 1( | , ) exp exp
2 2 22 2
xp x x x
m mm m
2
1/22 122 2
2
1 1( ) ( , ) , ( , ) , ( ) , ( ) 2 exp
2 42
T Tu x x x h x ghm
h h
h h
2
12
2
1( ) ln 2
2 4A
hh
h h
( | ) ( ) ( ) exp ( ) ( ) exp ( ) ( )T Tp h g u h u A h h h h hx x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Consider the von Mises distribution
Comparing this with our exponential family:
we can easily identify that:
16
0 0 0 0
0 0
1 1( | , ) exp cos( ) exp cos cos sin sin
2 ( ) 2 ( )p m m m m
I m I mq q q q q q q q
0 0 0
0
0 0
1( ) (cos ,sin ) , ( cos , sin ) , ( ) 1, ( , ) ,
2 ( )
( , ) ln 2 ( )
T Tu m m h g mI m
A m I m
q q q q q q q
q
h
Exponential Family: von Mises Distribution
( | ) ( ) ( )exp ( ) ( )exp ( ) ( )T Tp h g u h u A h h h h hx x x x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
The exponent in the multivariate Gaussian is:
We need to put this in the form
The 3nd term contributes to g(h) whereas the 2nd term is
directly an inner product between x and x =Lm.
For the 1st term, define two D2-dimensional vector vec(L)
and vec(xxT) that consist of the columns of L and xxT,
respectively. Then the 1st term has the form of an inner
product between these two vectors. In summary:
17
11 1
2 2
T T Ttr where L m m mxx xL L L , :
( | ) ( ) ( )exp ( )Tp h g ux x xh h h
The Multivariate Gaussian
1/2 /211
, ( ) , ( ) exp , ( ) 2 ,1( ) 2( )
2
DT
Tu g h
vec vec
xh h x x x m
x
x xxx
L L LL
11 1ln ( ) ln
2 2
TA g h x xL L
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Differentiate wrt h the for the exponential family:
The above equation can be further simplified if written in terms of the
partition function Z=1/g(h) or A=logZ:
Let us re-write explicitly the above equation as:
We can compute the variance of u(x) by differentiating the Eq. above
with respect to h.
18
( | ) ( ) ( )exp ( )Tp h g ux x xh h h
( ) ( )exp ( ) ( ) ( )exp ( ) ( ) 0T Tg h d g h u d x u x x x x u x xh h h h
( | ) 1p d x xh
( )
( ) ( )exp ( ) ( ) ( )( )
Tgg h u d
g
= x x u x x u x
hh h
h
( ) ( )A u xh
( ) ( ) ( ) exp ( ) ( )TA g h u d h h hx x u x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Thus the covariance of u(x) can be expressed in terms of the
2nd derivatives of A(η) and similarly for higher order moments.
Provided we can normalize a distribution from the exponential
family, we can always find its moments by simple
differentiation.
19
2 ( ) cov ( ) ( )A so A is convex u xh h
2
( ) ( ) ( ) ( )
( ) ( ) ( ) exp ( ) ( ) ( ) ( ) exp ( ) ( ) ( )
T T
T T TA g h u d g h u d
h h h h h
u x u x u x u x
x x u x x x x u x u x x
( ) ( ) ( )exp ( ) ( )TA g h u d h h hx x u x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Computing Moments of Sufficient Statistics u(x)
Let us check these relations for the Univariate Gaussian:
20
2 ( ) cov ( ) ( )A so A is convex u xh h
( ) ( )A u xh
2
212 2 2
2
1 1( ) ln 2 , ( , ) , ( ) ( , )
2 4 2
T TA u x x xh m
hh
h h
22 2 21 1
2
1 2 2 2 2
22
2
1 2
( ) ( ) 1,
2 2 4
( ) 1var , .
2
A AX X
AX etc
h hm m
h h h h h
h h
h h
h
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
We have shown that we can compute the mean of the distribution m=E[u(x)] in terms of the canonical parameter
h:
We have also shown that A(h) is
a convex function. Since for a
convex function there is one-to-one
relation between the argument of
the function and its derivative, the
mapping m(h) is invertable.
Thus the exponential family of distributions can also be
parameterized in terms of m (moment parametrization)
exactly as we started our earlier calculations (e.g. for the
Gaussian).
21
Moment Parametrization
( ) ( )Am hu x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 22
Sufficiency
( )u X
u(X) is sufficient for q if there is no
information in X regarding q beyond
that in u(X). Having observed u(X), we
can throw away X for the purposes of
inference with respect to q.
In the Bayesian approach in the Fig
shown, we treat q as an rv and say
that u(X) is sufficient for q if the
following CI statement holds:
Thus, u(X) contains all the needed
information in X about q.
| ( )
| ( ), | ( )
X u X
p u x x p u x
q
q q
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 23
Sufficiency
( )u X
The model in Fig b shown asserts the
same CI relations as shown in Fig a
earlier but has different
parametrization.
Treating q as a label, we can see the
above CI statement as a frequentist
definition of sufficiency.
u(X) is sufficient for q if the p(x|u(x)) is
not a function of q.
The two approaches discussed imply
a particular factorization of p(x|q).
| ( ), | ( )p x u x p x u xq
Bayesian Scientific Computing, Spring 2013 (N. Zabaras) 24
Neymann Factorization Theorem From the undirected graph (that expresses the same CI
relations as the two earlier graphs), we can factorize as:
On the left, u(x) is a deterministic function of x and can be
dropped as an argument:
One can derive for given y1, y2:
1 2, ( ), ( ), , ( )p x u x g u x g x u xq q
( )u X 1 2, ( ), , ( )p x g u x g x u xq q
1 2| , / ( ), , ( )p x p x p u x x u xq q q y q y
We can now see why u(x) was sufficient statistic for h in
the exponential family:
2
1
( ),( ),
( | ) ( ) exp ( )T
u x xu x
p h u A
yy q
q q h h hx x x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
1
1( ) ( ) ( )
N
ML n
n
AN
h u x u x
MLE for the Exponential Family The joint density for a data X = {x1, . . . , xn} is itself an exp. distribution
with sufficient statistics :
The exponential family is the only family of distributions with finite
sufficient statistics (size independent of the data set size).
The log likelihood is concave (A convex) and has a unique maximum.
Maximizing wrt h gives:
At the MLE, the empirical average of the sufficient statistic is equal the
model’s theoretical expected sufficient statistics (moment matching).
Thus to find the expected value of the sufficient statistics, one can use
directly the data without having to estimate h. When u(x)=x, the above
allows us to compute the expectation of x directly from the data.
25
11 1
1 1 1 1
( | ) ( ) ( ) exp ( ) ( ) ( ) exp ( )
ln ( | ) ( ) ln ( ) ( ) ( ) ( ) ( )
N N NT N T
n n n n
nn n
N N N NT T
n n n n
n n n n
p h g u h g u
p h N g u h NA u
X x x x x
X x x x x
h h h h h
h h h h h
1( )
N
nnu
x
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MLE for the Exponential Family
Using the sufficient statistic, one can in principle invert the above equ.
to compute hMLE. For example, for the Bernoulli distribution,
and thus:
and
26
1
1( 1) 1
N
MLE n
n
X p X xN
m m
1
1( ) ( ) ( )
N
ML n
n
A uN
u x xh
( | ) ( ) exp , ( ) , ( ) 1,
1 1, ( ) , ln
1 1 1
p x g x u x x h x
ge eh h
h h h
mm h h
m
ln1
MLE
mh
m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MLE and Kullback-Leibler Distance A useful property for the MLE (and not just a property for the
exponential family of distributions) is the following:
Minimizing the KL distance to the empirical distribution is equivalent to
maximizing the likelihood.
Indeed, let us consider the model logp(x|q) and the empirical
distribution:
We can then derive the following:
and from this:
Since the 1st term is independent of q, the assertion is proved.
27
1
1( ) ,
N
n
n
p x x xN
1 1
1 1 1( ) log ( | ) , log ( | ) log ( | ) ( | )
N N
n n
x n x n
p x p x x x p x p xN N N
q q q q
D
( ) 1
( ), ( | ) ( ) log ( ) log ( ) ( ) log ( | ) ( ) log ( ) ( | )( | )x x x x
p xD p x p x p x p x p x p x p x p x p x
p x Nq q q
q D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conjugate Priors We have already encountered the concept of a conjugate prior:
For the Bernoulli, the conjugate prior is the Beta distribution
For the Gaussian, the conjugate prior for the mean is a Gaussian,
and the conjugate prior for the precision is the Wishart distribution
In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the
posterior distribution has the same functional form as the prior. For
any member of the exponential family,
there exists a conjugate prior that can be written in the form
In normalized form, we write:
28
( | ) ( ) ( ) exp ( )Tp h g uq h q h qx x x
0
0 00 0 0 0 0 0 0( | , ) exp exp , :T Tp g A where
q h q h q h q h q
0
00 0 0 0 0
0 0 0 0
00 0 0 0
1 1( | , ) exp exp
, ,
: , exp
T T
T
p g AZ Z
where Z A d
q h q h q h q h q
h q h q q
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Conjugate Priors
Using , the posterior becomes (this form justifies ):
The parameter 0 can be interpreted as effective number of fictitious
observations in the prior each of which has a value for the sufficient
statistic equal to .
29
0 0
0 00
1
( | , ) exp ( ) expN
N NT T
n
n
p g g N
q h q h q h q h qX u x u
000
0
0000 0
0
1 1| , exp exp ,
( ) ( )
, ,
N NT TNN N N
N N N N
N NN N N
Np g N v g
Z N v Z
Nwhere N N
N v
q h q h q h q h q
uX
uu
1
1( )
N
n
nN
u u x
11 1
( | ) ( ) exp ( ) ( ) exp ( )N N N
NT T
n n n n
nn n
p h g u h g u
q h q h q h q h qX x x x x
0
00 0 0 0 0
0 0 0 0
1 1( | , ) exp exp
, ,
T Tp g AZ Z
q h q h q h q h q
0
0
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Posterior Predictive Let the posterior predictive is then:
This is simplified as follows:
If N=0, this becomes the marginal likelihood of , which reduces to
the normalizer of the posterior divided by the normalizer of the prior
multiplied by a constant.
30
'
'
0
1 0 0
( | ) ( | ) ( | )
1( ) ( ) exp ( ) exp ( )
( ( ) )
N
NN T T
i
i
p p p d
h g g dZ N
q q q
h h q h q h q q
X X X X
x u X u Xu X
''
0
1 0 0
'0 0
1 0 0
1( | ) ( ) exp ( ) ( )
, ( )
', ( ) ( )( )
, ( )
N
NN T
i
i
N
i
i
p h g dZ N
Z N Nh
Z N
h q h q qX X x u X u Xu X
u X u Xx
u X
'
1 1
( ) ( ), ( ) ( ),N N
ii
i i
u X u x u X u x
X
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Beta/Bernoulli: Posterior Predictive Consider a Bernoulli likelihood with a Beta prior. The likelihood takes
the familiar exponential distribution form:
The conjugate prior is a Beta:
Thus the posterior becomes:
Let s the number of heads in the past data. The probability of
future heads in m trials is then:
31
| (1 ) (1 ) exp log1
i i
i i
x N xN
i
i
p xq
q q q qq
D
| , , , , ( 1)N N N N i
i
p s N s s xq D Beta
1 1' '| (1 ) (1 )N NN N N N N m N ms m s
N N N N N m N m
p d
q q q q q q
D
0 0 0| (1 )s N s
p q q q
D
1
' ( 1)m
i
i
s x
', 'N m N N m Ns m s
0 0 0 0 0| , , , 1, 1,p q Beta
0 0 00
0 0 0| , 1 exp log 11
p q
q q q qq
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Maximum Entropy and Exponential Family If nothing is known about a distribution except that it belongs to a
certain class, then the distribution with the largest entropy should be
chosen as the default.a
The entropy is defined as
discrete case
When some statistics (moments) of the distribution are known,
the maximum entropy distribution is of the form (l’s are the Lagrange
multipliers enforcing the constraints):
Thus the MaxEnt distribution has the form of the exponential family.
32
( ) ( ) log ( )k k
k
H q q
1
1
exp ( )
( ) ,
exp ( )
K
k k i
k
i kK
k k j
j k
g
Lagrange multipliers
g
l q
q l
l q
, 1,..,k kg w k K q
a C. P. Robert, The Bayesian Choice, Springer, 2nd edition, chapter 3 (full text available for Cornell
students)
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Generalized Linear Models
33
We now study the relationship between X and Y using a GLIM.
We choose a particular conditional expectation of Y. We denote the
modeled value of conditional expectation as m = f(qTx).
For linear regression, GLIM extends these ideas beyond the
Gaussian, Bernoulli and multinomial setting to the more general
exponential family.
X enters linearly as qTx and f is called a response
function. is a one-to-one map of m to h.
To specify a GLIM we need (a) a choice of exponential family
distribution, and (b) a choice of the response function f().
Choosing the exponential family distribution is strongly constrained
by the nature of the data.
In choosing f(.), note that it needs to be both monotonic and
differentiable. However, there is a particular response function
(canonical response function) that is uniquely associated with a given
exponential family distribution.
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Canonical Response Function Canonical response function:
f() = -1()
x = h
If we decide to use the canonical response function, the choice of the
exponential family density completely determines the GLIM.
34
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
MLE & Canonical Response Function
35
Consider a regression problem with data D = {(xi, yi},i=1,…N}. The
log likelihood for a GLIM is as:
For a canonical response, this is simplified as:
Regardless of N, the size of the sufficient statistic is fixed: the
dimension of xn - important reason for using canonical response.
This is a general expression for GLM with exponential family
distributions and the canonical response function.
1 1
, log ( ) ( ) , :
n
N NT
n n n n n n n n
n n
h y y A where f with
y m
q h h m x x
xqD
1 1 1
, log ( ) ( )N N N
T
n n n n
n n n
Sufficient statistic for
h y y Aq h
x
q
qD
1 1 1
, '( ) ,
n
N N NT
n n n n n n n n n
n n n
y A y y orh h m h m
x
x X yq q q qq q mD D
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Batch Algorithm
36
The Hessian can now be computed from
as:
To estimate parameters in the canonical response function choice,
one can use iteratively reweighted least squares (IRLS) algorithm
The batch Newton algorithm now takes the familiar IRLS form:
For non canonical response functions, the Hessian has an extra term
that contains the factor (y-m). When we take expectations this term
vanishes! So using the expected Hessian in the Newton method the
algorithm looks essentially the same (Fisher Scoring algorithm).
1
, ,N
T
n n n
n
y orm
x X yq qq q mD D
2 1
1 1
, , ,...,N
T Tn nn n
n n n
d ddor , where :
d d d
m mm
h h h
H x x X WX W =q qq qD D
1 11
1 11 1
t t T t T t T t T t t T t
T t T t t t t T t T t t t
h
q q q
q
X W X X y X W X X W X X y
X W X X W X W y X W X X W W y
m m
m h m
Bayesian Scientific Computing, Spring 2013 (N. Zabaras)
Sequential Estimation - LMS An on-line estimation algorithm can be obtained by following the
stochastic gradient of the log likelihood function.
If we do not use the canonical response function, then the gradient
also includes the derivatives of f() and (). These can be viewed as
scaling coefficients that alter the step size, but otherwise leave the
general LMS form intact.
The LMS algorithm is the generic stochastic gradient algorithm for
models throughout the GLIM family.
37
1 ,Tt t t t t
n n n n ny x f xq q m m q