Hyperparameter Estimation in Dirichlet Process Mixture Models

Embed Size (px)

Citation preview

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    1/6

    Hyperparameter estimation

    in Dirichlet process mixture models

    By MIKE WEST

    Institute of Statistics and Decision Sciences

    Duke University, Durham NC 27706, USA.

    SUMMARY

    In Bayesian density estimation and prediction using Dirichlet process mix-

    tures of standard, exponential family distributions, the precision or total

    mass parameter of the mixing Dirichlet process is a critical hyperparame-ter that strongly influences resulting inferences about numbers of mixture

    components. This note shows how, with respect to a flexible class of prior

    distributions for this parameter, the posterior may be represented in a sim-

    ple conditional form that is easily simulated. As a result, inference about

    this key quantity may be developed in tandem with the existing, routine

    Gibbs sampling algorithms for fitting such mixture models. The concept

    of data augmentation is important, as ever, in developing this extension of

    the existing algorithm. A final section notes an simple asymptotic approx-

    imation to the posterior.

    Some key words: Data augmentation; Dirichlet process; Gibbs sampling; Mixture

    models

    This research was partially financed by the National Science Foundation

    under grant DMS-9024793

    1

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    2/6

    1. INTRODUCTION

    Escobar and West (1991) develop Gibbs sampling methods for the computations

    involved in Bayesian density estimation and problems of deconvolution using mix-

    tures of Dirichlet processes. We adopt the normal mixture model of this reference

    for example here note that the assumed normal data distributions may be replaced

    by any exponential family form, if required. In analysis of random samples from

    these nonparametric Bayesian models, data y1, . . . , yn are assumed conditionally in-

    dependently normally distributed, (yj |j , vj) N(j , vj), (j = 1, . . . , n), with the

    bivariate quantities (j, vj) drawn independently from an uncertain prior distribu-

    tion G(, v). The Dirichlet mixture model supposes that G(., .) is a bivariate Dirich-

    let process, G D(G0) with mean function G0(, v) = E(G(, v)), for any point

    (, v), and precision, or total mass, parameter > 0 (Ferguson, 1973; Antoniak,1974). Following Escobar and West, we take G0 to be the conjugate normal/inverse

    gamma distribution under which v1 G(s/2, S/2) and (|v) N(m,v). A key

    feature of the Dirichlet is its discreteness, which in our context implies that the

    pairs (j , vj), (j = 1, . . . , n), concentrate on a set of some k n distinct pairs

    (i, wi), (i = 1, . . . , k). As a result, supposing these pairs together with G0 and

    to be known, the conditional predictive distributions in such models are mixtures

    of normals,

    p(y|) = anN(0, v0) + an

    ki=1

    N(i, wi) (1)

    where = {k, (i, wi, i = 0, . . . , k), G0} and an = 1/( + n). Observing a sample

    D = (y1, . . . , yn) leads to a posterior distribution P(|D) and predictive inference

    is based on the Bayesian density estimate p(y|D) =

    p(y|)dP(|D). Even if n is

    extremely small, posteriors P(|D) are extraordinarily complicated, and simulation

    is the only method available. IfP(|D) can be sampled, the integral defining p(y|D)

    can be approximated by a summation over samples of values for efficient Monte

    Carlo inference (Gelfand and Smith 1991). Escobar and West (1991) shown how

    this may be easily achieved, and how the basic simulation algorithm extends to

    incorporate learning about the hyperparameters m, and, less critically, S of G0.

    Illustrations appear in that paper, and some further applications appear in Westand Cao (1992) (the latter in the case of constant variances vi = v for each i.)

    Central to this class of models is the precision parameter of the underlying

    Dirichlet process. This directly determines the prior distribution for k, the number

    of additional normal components in the mixture (1), and is thus a critical smoothing

    parameter for the model and previously has been assumed specified by the inves-

    tigator. At each stage of the simulation analysis, a specific value of k is simulated

    2

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    3/6

    from the posterior for k (together with sampled values of the means and variances

    of the normal components) which also depends critically on this hyperparameter .

    We now show how, based on a specific but flexible family of prior distributions for, the parameter vector may be augmented to allow for simulation of the full joint

    posterior now including .

    2. COMPUTING p(|k)

    2.1 Introduction

    We begin by recalling results from Antoniak (1974). There it is shown that the

    prior distribution of k in (1) may be written as

    P(k|, n) = cn(k)n!k ()

    ( + n)

    , (k = 1, 2, . . . , n), (2)

    and cn(k) = P(k| = 1, n), not involving . If required, the factors cn(k) are easily

    computed using recurrence formulae for Stirling numbers (further details available

    on request from the author). This is important, for example, in considering the

    implications for priors over k of specific choices of priors for (and vice-versa) in

    the initial prior elicitation process.

    Assume a continuous prior density p() (which may depend on the sample size

    n), hence an implied prior P(k|n) =

    P(k|, n)p()d. Now suppose we have

    sampled values of the parameter k and all the other model parameters. From our

    model, the data D are initially conditionally independent of given all the otherparameters, and we deduce

    p(|k , , D) p(|k) p()P(k|), (3)

    with likelihood function given in (2) (the sample size n should appear in condition-

    ing, of course, but is omitted for clarity of notation.) Thus the Gibbs sampling

    analysis can be extended; for given , we sample parameters , and hence k, as

    usual from the conditional posterior p(|, D). Then, at each iteration, we can

    include in the analysis by sampling from the conditional posterior (2) based on

    the previously sampled value of k no other information is needed. Sampling from

    (2) may involve using a rejection, or other, method depending on the form of the

    prior p(). Alternatively, we may discretise the range of so that (2) provides a

    discrete approximation to the posteriors the so-called griddy Gibbs approach.

    More attractively, sampling from the exact, continuous posterior (2) is possible in

    the Gibbs iterations when the prior p() comes from the class of mixtures of gamma

    distributions. We develop the results initially for a single gamma prior.

    3

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    4/6

    2.2 Gamma prior for

    Suppose G(a, b), a gamma prior with shape a > 0 and scale b > 0 (which we

    may extend to include a reference prior (uniform for log()) by letting a 0 andb 0.) In this case, (2) may be expressed as a mixture of two gamma posteriors,

    and the conditional distribution of the mixing parameter given and k (and, of

    course, n) is a simple beta. See this as follows. For > 0, the gamma functions in

    (1) can be written as

    ()

    ( + n)=

    ( + n)( + 1, n)

    (n), (3)

    where (., .) is the usual beta function. Then, in (2), and for any k = 1, 2, . . . , n,

    p(|k) p()

    k1

    ( + n)( + 1, n)

    p()k1( + n)

    10

    x(1 x)n1dx,

    using the definition of the beta function. This implies that p(|k) is the marginal

    distribution from a joint for and a continuous quantity x (0 < x < 1) such that

    p(, x|k) p()k1( + n)x(1 x)n1, (0 < , 0 < x < 1).

    Hence we have conditional posteriors p(|x, k) and p(x|, k) determined as follows.

    Firstly, under the G(a, b) prior for ,

    p(|x, k) a+k2( + n)e(blog(x))

    a+k1e(blog(x)) + na+k2e(blog(x))

    for > 0, which reduces easily to a mixture of two gamma densities, viz

    (|x, k) xG(a + k, b log(x)) + (1 x)G(a + k 1, b log(x)) (4)

    with weights x defined by

    x

    (1 x)

    =(a + k 1)

    n(b log(x))

    (note that these distributions are well defined for all gamma priors, all x in the unit

    interval and all k > 1). Secondly,

    p(x|, k) x(1 x)n1 (0 < x < 1) (5)

    so that (x|, k) B( + 1, n), a beta distribution with mean ( + 1)/( + n + 1).

    4

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    5/6

    It should now be clear how can be sampled at each stage of the simulation

    at each Gibbs iteration, the currently sampled values of k and allow us to draw a

    new value of by (i) first sampling an x value from the simple beta distribution (5),conditional on and k fixed at their most recent values; then (ii) sampling the new

    value from the mixture of gammas in (4) based on the same k and the x value

    just generated in (i).

    On completion of the simulation, we will have a series of sampled values of k,

    , x, and all the other parameters. Suppose that the Monte Carlo sample size is

    N, and denote the sampled values ks, xs, etc, for s = 1, . . . , N . Only the sampled

    values ks and xs are needed in estimating the posterior p(|D) via the usual Monte

    Carlo average of conditional posteriors, viz

    p(|D) N

    1

    Ns=1

    p(|xs, ks),

    where the summands are simply the conditional gamma mixtures in equation (4).

    2.3 Mixture of gamma priors for

    Now consider cases in which the prior for is a mixture of gammas, hi=1 ciG(ai, bi) based on specified shapes ai, scales bi and mixture weights ci that

    sum to unity. The above analysis generalises easily the conditional posterior for

    (x|, k) remains as in (5), but the two-component mixture (4) for ( |x, k) extends

    to the 2h-component mixture

    (|x, k) mi=1

    ci,x{i,xG(ai + k, bi log(x)) + (1 i,x)G(ai + k 1, bi log(x))}

    where, for i = 1, . . . , h,i,x

    (1 i,x)=

    (ai + k 1)

    n(bi log(x))

    and

    ci,x ci(ai + k 1)

    (bi log(x))ai+k1{n +

    (ai + k 1)

    (bi log(x))},

    subject, of course, to unit sum. This extended mixture is trivially sampled.

    3. AN ASYMPTOTIC RESULT

    A simple asymptotic approximation to the posterior for can be derived by

    developing an asymptotic approximation to the sampling density (1) directly. Take

    equation (1),

    P(k|, n) = cn(k)n!k ()

    ( + n), (k = 1, 2, . . . , n),

    5

  • 7/28/2019 Hyperparameter Estimation in Dirichlet Process Mixture Models

    6/6

    with cn(k) |S(k)n | (using results in Antoniak 1974), where S

    (k)n (k = 1, . . . , n)

    are Stirling numbers of the first kind (Abramowitz & Stegun section 24.1.3, p824).

    As n and with k = o(log(n)), Abramowitz & Stegun gives the asymptoticapproximation

    |S(k)n | (n 1)!

    (k 1)!(+ log(n))k1

    where is Eulers constant. It trivially follows that, under these conditions, (1)

    reduces to

    P(k|, n) k(+ log(n))k/(k 1)!, k = o(log(n)),

    and so the asymptotic Poisson distribution k = 1 + X with X Poisson with mean

    ( + log(n)). In our framework, k is typically much smaller than n when n is

    large so the approximation should be useful; the Poisson distribution for numbers

    of groups is very nice indeed.

    As a corollary, we have the following approximation to the posterior for

    given k (and n) discussed above. Taking the gamma prior G(a, b), the Poisson

    approximation here implies dirctly an approximate gamma posterior

    p(|k, n) G(a + k 1, b + + log(n))

    for large n and with k = o(log(n)). Just how good these approximations are remains

    to be explored. The approximate posterior may be used to simulate values of in

    the analysis, though the earlier exact mixture representation is only slight more

    involved computationally.

    REFERENCES

    Abramowitz, M. and I.A. Stegun (1965), Handbook of Mathematical Functions,

    with Formulas, Graphs, and Mathematical Tables, Dover: New York.

    Antoniak, C.E. (1974) Mixtures of Dirichlet processes with applications to non-

    parametric problems, The Annals of Statistics, 2, 1152-1174.

    Escobar, D.M. and West, M. (to appear) Bayesian density estimation and inference

    using mixtures, Journal of the American Statistical Association.

    Ferguson, T.S. (1983) Bayesian density estimation by mixtures of normal distri-

    butions, in Recent Advances in Statistics, eds. H. Rizvi and J. Rustagi, NewYork: Academic Press, pp. 287-302.

    Gelfand, A.E. and Smith, A.F.M. (1990), Sampling based approaches to calcu-

    lating marginal densities, Journal of the American Statistical Association, 85,

    398-409.

    M. West and G. Cao (to appear) Assessing mechanisms of neural synaptic activity.

    In Bayesian Statistics in Science and Technology: Case Studies.

    6