Hyperparameter Estimation in Dirichlet Process Mixture Models

7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

1/6

Hyperparameter estimation

in Dirichlet process mixture models

By MIKE WEST

Institute of Statistics and Decision Sciences

Duke University, Durham NC 27706, USA.

SUMMARY

In Bayesian density estimation and prediction using Dirichlet process mix-

tures of standard, exponential family distributions, the precision or total

mass parameter of the mixing Dirichlet process is a critical hyperparame-ter that strongly influences resulting inferences about numbers of mixture

components. This note shows how, with respect to a flexible class of prior

distributions for this parameter, the posterior may be represented in a sim-

ple conditional form that is easily simulated. As a result, inference about

this key quantity may be developed in tandem with the existing, routine

Gibbs sampling algorithms for fitting such mixture models. The concept

of data augmentation is important, as ever, in developing this extension of

the existing algorithm. A final section notes an simple asymptotic approx-

imation to the posterior.

Some key words: Data augmentation; Dirichlet process; Gibbs sampling; Mixture

models

This research was partially financed by the National Science Foundation

under grant DMS-9024793

1


2/6

1. INTRODUCTION

Escobar and West (1991) develop Gibbs sampling methods for the computations

involved in Bayesian density estimation and problems of deconvolution using mix-

tures of Dirichlet processes. We adopt the normal mixture model of this reference

for example here note that the assumed normal data distributions may be replaced

by any exponential family form, if required. In analysis of random samples from

these nonparametric Bayesian models, data y1, . . . , yn are assumed conditionally in-

dependently normally distributed, (yj |j , vj) N(j , vj), (j = 1, . . . , n), with the

bivariate quantities (j, vj) drawn independently from an uncertain prior distribu-

tion G(, v). The Dirichlet mixture model supposes that G(., .) is a bivariate Dirich-

let process, G D(G0) with mean function G0(, v) = E(G(, v)), for any point

(, v), and precision, or total mass, parameter > 0 (Ferguson, 1973; Antoniak,1974). Following Escobar and West, we take G0 to be the conjugate normal/inverse

gamma distribution under which v1 G(s/2, S/2) and (|v) N(m,v). A key

feature of the Dirichlet is its discreteness, which in our context implies that the

pairs (j , vj), (j = 1, . . . , n), concentrate on a set of some k n distinct pairs

(i, wi), (i = 1, . . . , k). As a result, supposing these pairs together with G0 and

to be known, the conditional predictive distributions in such models are mixtures

of normals,

p(y|) = anN(0, v0) + an

ki=1

N(i, wi) (1)

where = {k, (i, wi, i = 0, . . . , k), G0} and an = 1/( + n). Observing a sample

D = (y1, . . . , yn) leads to a posterior distribution P(|D) and predictive inference

is based on the Bayesian density estimate p(y|D) =

p(y|)dP(|D). Even if n is

extremely small, posteriors P(|D) are extraordinarily complicated, and simulation

is the only method available. IfP(|D) can be sampled, the integral defining p(y|D)

can be approximated by a summation over samples of values for efficient Monte

Carlo inference (Gelfand and Smith 1991). Escobar and West (1991) shown how

this may be easily achieved, and how the basic simulation algorithm extends to

incorporate learning about the hyperparameters m, and, less critically, S of G0.

Illustrations appear in that paper, and some further applications appear in Westand Cao (1992) (the latter in the case of constant variances vi = v for each i.)

Central to this class of models is the precision parameter of the underlying

Dirichlet process. This directly determines the prior distribution for k, the number

of additional normal components in the mixture (1), and is thus a critical smoothing

parameter for the model and previously has been assumed specified by the inves-

tigator. At each stage of the simulation analysis, a specific value of k is simulated

2


3/6

from the posterior for k (together with sampled values of the means and variances

of the normal components) which also depends critically on this hyperparameter .

We now show how, based on a specific but flexible family of prior distributions for, the parameter vector may be augmented to allow for simulation of the full joint

posterior now including .

2. COMPUTING p(|k)

2.1 Introduction

We begin by recalling results from Antoniak (1974). There it is shown that the

prior distribution of k in (1) may be written as

P(k|, n) = cn(k)n!k ()

( + n)

, (k = 1, 2, . . . , n), (2)

and cn(k) = P(k| = 1, n), not involving . If required, the factors cn(k) are easily

computed using recurrence formulae for Stirling numbers (further details available

on request from the author). This is important, for example, in considering the

implications for priors over k of specific choices of priors for (and vice-versa) in

the initial prior elicitation process.

Assume a continuous prior density p() (which may depend on the sample size

n), hence an implied prior P(k|n) =

P(k|, n)p()d. Now suppose we have

sampled values of the parameter k and all the other model parameters. From our

model, the data D are initially conditionally independent of given all the otherparameters, and we deduce

p(|k , , D) p(|k) p()P(k|), (3)

with likelihood function given in (2) (the sample size n should appear in condition-

ing, of course, but is omitted for clarity of notation.) Thus the Gibbs sampling

analysis can be extended; for given , we sample parameters , and hence k, as

usual from the conditional posterior p(|, D). Then, at each iteration, we can

include in the analysis by sampling from the conditional posterior (2) based on

the previously sampled value of k no other information is needed. Sampling from

(2) may involve using a rejection, or other, method depending on the form of the

prior p(). Alternatively, we may discretise the range of so that (2) provides a

discrete approximation to the posteriors the so-called griddy Gibbs approach.

More attractively, sampling from the exact, continuous posterior (2) is possible in

the Gibbs iterations when the prior p() comes from the class of mixtures of gamma

distributions. We develop the results initially for a single gamma prior.

3


4/6

2.2 Gamma prior for

Suppose G(a, b), a gamma prior with shape a > 0 and scale b > 0 (which we

may extend to include a reference prior (uniform for log()) by letting a 0 andb 0.) In this case, (2) may be expressed as a mixture of two gamma posteriors,

and the conditional distribution of the mixing parameter given and k (and, of

course, n) is a simple beta. See this as follows. For > 0, the gamma functions in

(1) can be written as

()

( + n)=

( + n)( + 1, n)

(n), (3)

where (., .) is the usual beta function. Then, in (2), and for any k = 1, 2, . . . , n,

p(|k) p()

k1

( + n)( + 1, n)

p()k1( + n)

10

x(1 x)n1dx,

using the definition of the beta function. This implies that p(|k) is the marginal

distribution from a joint for and a continuous quantity x (0 < x < 1) such that

p(, x|k) p()k1( + n)x(1 x)n1, (0 < , 0 < x < 1).

Hence we have conditional posteriors p(|x, k) and p(x|, k) determined as follows.

Firstly, under the G(a, b) prior for ,

p(|x, k) a+k2( + n)e(blog(x))

a+k1e(blog(x)) + na+k2e(blog(x))

for > 0, which reduces easily to a mixture of two gamma densities, viz

(|x, k) xG(a + k, b log(x)) + (1 x)G(a + k 1, b log(x)) (4)

with weights x defined by

x

(1 x)

=(a + k 1)

n(b log(x))

(note that these distributions are well defined for all gamma priors, all x in the unit

interval and all k > 1). Secondly,

p(x|, k) x(1 x)n1 (0 < x < 1) (5)

so that (x|, k) B( + 1, n), a beta distribution with mean ( + 1)/( + n + 1).

4


5/6

It should now be clear how can be sampled at each stage of the simulation

at each Gibbs iteration, the currently sampled values of k and allow us to draw a

new value of by (i) first sampling an x value from the simple beta distribution (5),conditional on and k fixed at their most recent values; then (ii) sampling the new

value from the mixture of gammas in (4) based on the same k and the x value

just generated in (i).

On completion of the simulation, we will have a series of sampled values of k,

, x, and all the other parameters. Suppose that the Monte Carlo sample size is

N, and denote the sampled values ks, xs, etc, for s = 1, . . . , N . Only the sampled

values ks and xs are needed in estimating the posterior p(|D) via the usual Monte

Carlo average of conditional posteriors, viz

p(|D) N

1

Ns=1

p(|xs, ks),

where the summands are simply the conditional gamma mixtures in equation (4).

2.3 Mixture of gamma priors for

Now consider cases in which the prior for is a mixture of gammas, hi=1 ciG(ai, bi) based on specified shapes ai, scales bi and mixture weights ci that

sum to unity. The above analysis generalises easily the conditional posterior for

(x|, k) remains as in (5), but the two-component mixture (4) for ( |x, k) extends

to the 2h-component mixture

(|x, k) mi=1

ci,x{i,xG(ai + k, bi log(x)) + (1 i,x)G(ai + k 1, bi log(x))}

where, for i = 1, . . . , h,i,x

(1 i,x)=

(ai + k 1)

n(bi log(x))

and

ci,x ci(ai + k 1)

(bi log(x))ai+k1{n +

(ai + k 1)

(bi log(x))},

subject, of course, to unit sum. This extended mixture is trivially sampled.

3. AN ASYMPTOTIC RESULT

A simple asymptotic approximation to the posterior for can be derived by

developing an asymptotic approximation to the sampling density (1) directly. Take

equation (1),

P(k|, n) = cn(k)n!k ()

( + n), (k = 1, 2, . . . , n),

5


6/6

with cn(k) |S(k)n | (using results in Antoniak 1974), where S

(k)n (k = 1, . . . , n)

are Stirling numbers of the first kind (Abramowitz & Stegun section 24.1.3, p824).

As n and with k = o(log(n)), Abramowitz & Stegun gives the asymptoticapproximation

|S(k)n | (n 1)!

(k 1)!(+ log(n))k1

where is Eulers constant. It trivially follows that, under these conditions, (1)

reduces to

P(k|, n) k(+ log(n))k/(k 1)!, k = o(log(n)),

and so the asymptotic Poisson distribution k = 1 + X with X Poisson with mean

( + log(n)). In our framework, k is typically much smaller than n when n is

large so the approximation should be useful; the Poisson distribution for numbers

of groups is very nice indeed.

As a corollary, we have the following approximation to the posterior for

given k (and n) discussed above. Taking the gamma prior G(a, b), the Poisson

approximation here implies dirctly an approximate gamma posterior

p(|k, n) G(a + k 1, b + + log(n))

for large n and with k = o(log(n)). Just how good these approximations are remains

to be explored. The approximate posterior may be used to simulate values of in

the analysis, though the earlier exact mixture representation is only slight more

involved computationally.

REFERENCES

Abramowitz, M. and I.A. Stegun (1965), Handbook of Mathematical Functions,

with Formulas, Graphs, and Mathematical Tables, Dover: New York.

Antoniak, C.E. (1974) Mixtures of Dirichlet processes with applications to non-

parametric problems, The Annals of Statistics, 2, 1152-1174.

Escobar, D.M. and West, M. (to appear) Bayesian density estimation and inference

using mixtures, Journal of the American Statistical Association.

Ferguson, T.S. (1983) Bayesian density estimation by mixtures of normal distri-

butions, in Recent Advances in Statistics, eds. H. Rizvi and J. Rustagi, NewYork: Academic Press, pp. 287-302.

Gelfand, A.E. and Smith, A.F.M. (1990), Sampling based approaches to calcu-

lating marginal densities, Journal of the American Statistical Association, 85,

398-409.

M. West and G. Cao (to appear) Assessing mechanisms of neural synaptic activity.

In Bayesian Statistics in Science and Technology: Case Studies.

6

Documents

Hyperparameter Estimation in Dirichlet Process Mixture Models