On Tight Approximate Inference of the Logistic-Normal ...proceedings.mlr.press/v2/ahmed07a/ahmed07a.pdf · On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model

On Tight Approximate Inference of the Logistic-Normal TopicAdmixture Model

Amr AhmedSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

Eric P. XingSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA [email protected]

Abstract

The Logistic-Normal Topic Admixture Model(LoNTAM), also known as correlated topicmodel (Blei and Lafferty, 2005), is a promis-ing and expressive admixture-based textmodel. It can capture topic correlationsvia the use of a logistic-normal distribu-tion to model non-trivial variabilities in thetopic mixing vectors underlying documents.However, the non-conjugacy caused by thelogistic-normal makes posterior inference andmodel learning significantly more challeng-ing. In this paper, we present a new, tight ap-proximate inference algorithm for LoNTAMbased on a multivariate quadratic Taylor ap-proximation scheme that facilitates elegantclosed-form message passing. We present ex-perimental results on simulated data as wellas on the NIPS17 and PNAS document col-lections, and show that our approach is notonly simple and easy to implement, but alsoit converges faster, and leads to more accu-rate recovery of the semantic truth underly-ing documents and estimates of the parame-ters comparing to previous methods.

1 IntroductionStatistical admixture models have recently gainedmuch popularity in managing large collection of dis-crete objects. Via an admixture model, one canproject such objects into a low dimensional spacewhere their latent semantic (such as topical aspects)can be captured. This low dimensional representationcan then be used for tasks like classifications and clus-tering or merely as a tool to structurally browse theotherwise unstructured collection.An admixture model posits that each object is sam-pled from a mixture model according to the object’sspecific mixing vector over the mixture components.Special instances of this formalism have been usedfor population genetics (Pritchard et al., 2000), vi-

sion (Sivic et al., 2005), text modeling (Blei et al.,2003) and machine translation (Zhao and Xing, 2006).When applied to text, the objects correspond to doc-uments, and the mixture components are known astopics which are often represented as a multinomialdistribution over a given vocabulary. Under this for-malism, each document is succinctly represented as amixing vector over the set of topics, and the topic mix-ing vector reflects the semantics of the document.Much of the expressiveness of admixture-based textmodels lies in how they model the variabilities in thetopic mixing vectors underlying documents. For in-stance, when the variability is modeled via a Dirichletdistribution, the model is known as latent Dirichlet al-location (Blei et al., 2003). The Dirichlet indeed is anappealing choice because its conjugacy to the multino-mial allows for computational advantages in inferenceand learning. However, the Dirichlet can only capturevariations in each topic’s intensity (almost) indepen-dently, and fails to model the fact that some topics arehighly correlated and can arise synergistically. Failureto model such correlation limits the model’s ability todiscover subtle topical structures underlying the data.Apart from the Dirichlet, another popular distribu-tion over the simplex (the space of normalized weightvectors) is the logistic-normal (LN) distribution whichhas the sought-after property of being able to modelcorrelations between the components of the vectorsdrawn from it (Aitchison and Shen, 1980). Whenthe logistic-normal is used instead of the Dirichlet, wecall the resulting model a Logistic-Normal Topic Ad-mixture Model (LoNTAM) which is also known as thecorrelated topic model (Blei and Lafferty, 2005). Un-fortunately, this added expressivity comes with a price,because the non-conjugacy of LN to the multinomialdistribution makes posterior inference and parameterestimation significantly more difficult.In the sequel, we present a new, approximate inferencealgorithm for LoNTAM which offers a simple and effi-cient way of capturing correlated topic posteriors. (A

longer version, which contains derivational details andadditional extensions/generalizations can be found ina earlier Technical Report (Xing, 2005).) We beginby outlining LoNTAM; then we proceed to describeour approximate inference method that overcomes thenon-conjugacy of LN via the use of a multivariatequadratic Taylor approximation to LN, which enablesan elegant closed-form variational message passing al-gorithm. For completeness, we also describe an earlierinference algorithm for LoNTAM given by Blei andLafferty (2005) and highlight key differences betweenthe two approaches. Finally we present experimentalresults on simulated text datasets as well as real textcollections from NIPS and PNAS. Our results showthat the proposed algorithm is not only simpler andeasier to implement, but also leads tighter approxima-tion to the topic posterior, more accurate estimationsof the parameters, and faster convergence, comparingto the previous method based on linear approximationand numerical procedure.

2 Logistic-Normal Topic AdmixtureWe begin with a brief recap of the general admixtureformalism, and the LoNTAM model for text docu-ments represented as bags of words.

2.1 Admixture ModelStatistically, an object x is said to be derived froman admixture if it consists of a bag of elements, say{x1, . . . , xN}, each sampled independently or coupledin some way, from a mixture model, according toan admixing coefficient vector ~θ, which represents the(normalized) fraction of contribution from each ofthe mixture components to the object being mod-eled. In a typical text modeling setting, each docu-ment corresponds to an object, the words thereof cor-respond to the elements constituting the object, andthe document-specific admixing coefficient vector is of-ten known as a topic mixing vector (or simply, topicvector). Generatively, to sample a document accord-ing to an admixture model, we first sample a topicvector from some admixing prior, then a latent topicis sampled for each word based on the topic vectorto induce topic-specific word instantiations. Since theadmixture formalism enables an object to be synthe-sized from elements drawn from a mixture of multiplesources, it is also known as mixed membership modelin the statistical community (Erosheva et al., 2004).

2.2 Logistic-Normal Topic Admixture ModelMuch of the expressiveness of admixture-based textmodels lies in the choice of the prior for the docu-ments topic vectors. To capture non-trivial correla-tions among the weights of all possible topics underly-ing a document (i.e., the elements of the topic vector~θ), instead of using a Dirichlet as in LDA, a LoNTAMmodel employs a logistic-normal distribution for the

topic vectors of a study corpus. As discussed in Bleiand Lafferty (2005), this prior captures a much richerabundance of correlation patterns in a topic simplex;but as a cost, the non-conjugacy between the logistic-normal prior (for topic vectors) and the multinomiallikelihood (for topic instantiations) makes posterior in-ference and parameter estimation extremely hard. Forexample, variational inferences algorithms commonlyused for generalized linear models (GLIMs) won’t haveclose-form fixed point iterative formula in this case. In-deed, the approximate inference scheme adopted so farfail to capture the much-desired correlation structurein the posterior distribution of the topic vectors.Before presenting our tight approximate inference al-gorithm, which offers a simple and efficient way forobtaining truly correlated topic posteriors under LoN-TAM, in the following we outline the details of thismodel for later reference. As illustrated in Figure 1,in a LoNTAM, each topic, say topic k, is repre-sented by an M-dimensional word frequency vector~βk, which parameterizes a topic-specific multinomialdistribution. A document is generated by samplingits topic vector from a logistic normal distributionwith K-dimensional mean µ and covariance Σ, that isLN(µ, Σ), and then words are sampled based on thistopic vector. More formally, to generate a documentwd = {wd,1, wd,2, . . . , wd,N}, we proceed as bellow:

1. Draw ~θd ∼ LN(µ, Σ)2. For each word wd,n in wd

• Draw latent topic zd,n ∼ Multinomial(~θd)• Draw wd,n|zd,n = k ∼ Multinomial(~βk)

The first step can be broken down into two sub-steps:first draw ~γd ∼ Normal(µ, Σ); then map it to thesimplex via the following logistic transformation:

θd,k = exp{γd,k − C(~γd)}, ∀k = 1, . . . , K (1)

where C(~γd) = log“ KX

k=1

exp{γd,k}”. (2)

Here C(~γd) is a normalization constant (i.e., the logpartition function). Furthermore, due to the normal-izability constrain on the multinomial parameters,~θd only has K − 1 degree of freedom. Thus we onlyneed to draw the first K − 1 components of ~γd from a(K − 1)-dimensional multivariate Gaussian, and leaveγd,K = 0. For simplicity, we omit this technicalityin the forth coming general operation of our model.Putting everything together, the marginal probabilityof a document w can be written as follows (forsimplicity, in the sequel we omit document index ”d”when our statements and/or expressions apply to alldocuments):

p(w) =

Z

~γ

„ NYn=1

KXzn=1

p(wn|zn; ~β1:K)× (3)

p(zn|logistic(~γ))

«N (~γ|µ, Σ)d~γ,

Figure 1: The Graphical Model.

where p(wn|zn; ~β1:K) and p(zn|logistic(~γ)) are bothmultinomial distributions parameterized by ~βzn

and~θ = logistic(~γ), respectively.

3 Tight Approximate Inference andLearning on LoNTAM

−4 −3 −2 −1 0 1 2 3 40

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Gamma

Log

Par

titio

n F

unct

ion

Expanded Around −2

−4 −3 −2 −1 0 1 2 3 40

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Gamma

Log

Par

titio

n F

unct

ion

Expanded Around 2

ExactQuadratic

−4−2

02

4

−4

−2

0

2

40

5

10

−4−2

02

4

−4

−2

0

2

40

5

10

15

ExpandedArround(−2.0,−1.6)

ExpandedArround(0.5,1.2)

Figure 2: Approximating the log partition function usinga truncated quadratic Taylor expansion. Top for K=2 andbottom for K=3.

3.1 Variational Inference UnderTaylor-Approximated Conjugacy

Given a document w, the inference task is to find theposterior distribution over the latent variables. Un-fortunately, conditioned on w1:N , the posterior over{~γ, z1:N} under LoNTAM is intractable. Thereforefollowing a variational principle, such as in Xing et al.(2003), we approximate p(~γ, z1:N |w) with a product ofsimpler marginals, each on a cluster of latent variablesubset, i.e., {~γ} and {z1:N}.Based on the generalized mean field (GMF) theoremgiven in Xing et al. (2003), the optimal solution of eachmarginal, q(XC |Θ), over cluster of variables, XC , isisomorphic to the true conditional distribution of XC

given its Markov blanket (MB)– that is p(XC |XMB).This optimal variational marginal can be symbolicallywritten down from the original joint model (such as ourLoNTAM for {~γ, z1:N , w1:N}), except that in q(·|·) wereplace the XMB by a set of ”GMF messages” relatedto XMB. That is, q∗(XC |Θ) = p

(XC |GMF(XMB)

).

These GMF messages can be thought of as surro-gates of the dependencies of XC on XMB. Xing et al.(2003) showed that in case the joint model is a GLIM,the GMF messages correspond to an expectation ofthe sufficient statistics of the relevant Markov blan-

ket variables under their own associated GMF clustermarginals. In the sequel, we use 〈Sx〉qx

to denote theGMF message due to latent variable x; thus the opti-mal GMF approximation to p(XC) is:

q∗(XC) = p(XC |〈Sy〉qy

: ∀y ∈ XMB

). (4)

Inspecting our LoNTAM depicted in in Figure 1, wewould approximate the posterior over {~γ, z1:N} usingq(~γ, z1:N ) = qγ̃(~γ)qz(z1:N ). (The same approximatingscheme was also adopted in Blei and Lafferty (2005),but as we explain soon, different techniques for seekingoptimal qγ̃(·) and qz(·) have led to remarkably differentresults.) Using Eq. (4), we can write down the theGMF approximation to the marginal posterior of the(inverse-logistic-transformed) topic vector ~γ as:

qγ̃(~γ) = p(~γ|µ, Σ, z → 〈Sz〉qz)

∝ p(~γ, z → 〈Sz〉qz|µ, Σ)

∝ N (~γ; µ, Σ)× p(z → 〈Sz〉qz|~γ), (5)

where ”→” donates a symbolic replacement of the ar-gument on the left with the one on the right. Note thatFigure 1 makes explicit that the MB for ~γ is {z1:N},thus we replace {z1:N} with its GMF message, whichcorresponds to its expected sufficient statistics. It canbe easily shown that the second term in Eq. 5 is givenby:

p(z → 〈Sz〉qz|~γ) = exp{〈m〉qz

~γ −N × C(~γ)}, (6)

where N is the number of words in w, and 〈m〉qzis a K-

dimensional (row) vector that represents the expectedhistogram of topic occurrences in w. More formally:

mk =N∑

i=1

I(zi = k) and 〈mk〉qz=

N∑

i=1

qz(zi = k). (7)

Now we can have a glimpse of why LoNTAM is diffi-cult to handle. First, the two factors in Eq. (5) are notconjugate, thus their product does not emerge as aneasy close-form distribution such as a Gaussian; sec-ond, the second factor contains a nasty C(~γ), whichis a complex function of the argument of our approxi-mating distribution qγ̃(·). Thus qγ̃(~γ) as defined aboveis not integrable during inference (e.g. to calculate ex-pectations of ~γ as output of our low-dimensional rep-resentation of the document, and as the GMF messagesent to the {z1:N} cluster as needed bellow).To circumvent the non-conjugacy and non-integrability of our variational cluster marginalcaused by C(~γ), we introduce a truncated Taylor-appximotation to C(~γ) to make it algebraicallymanageable. The rational is that, in a multivariateGaussian, we have only linear and quadratic terms ofthe argument. Inspecting the forms of the argument

(i.e, ~γ) in the distribution defined by Eq. (5), allexcept C(~γ) are either linear or quadratic. If we canapproximate C(~γ) up to a quadratic form, then thetwo factors in Eq. (5) will become ”conjugate” andwe can rearrange the resulting approximation to qγ̃(~γ)into a reparameterized multivariate Gaussian! Fortu-nately, this turns out to be feasible, and indeed it leadsto a very general second-order approximate schemesuperior to the tangent approximations underlyingmany extant variational inference algorithms.Specifically, using a quadratic Taylor expansion ofC(~γ) with respect to some γ̂ 1,we have:

C(~γ) ≈ C(γ̂) + g′~γ(~γ − γ̂) +12(~γ − γ̂)′H~γ(~γ − γ̂), (8)

where g = (g1, · · · , gK) is the gradient and H = {hij}is the Hessian matrix of C w.r.t. ~γ. Figure 2 demon-strates this expansion for K=2 and 3. As clear fromthe figure, this expansion provides a tight local ap-proximation to C(~γ) for ”practical” values of ~γ. 2

Combining Eqs. (5,6,8), it is easy to show that (see(Xing, 2005)) q~γ(~γ) can now be expressed as a Gaus-sian N (µγ̂ , Σγ̂) where:

Σγ̂ = inv(Σ−1 + NH(γ̂)

), (9)

µγ̂ = Σγ̂(Σ−1µ + NH(γ̂)γ̂ + 〈m〉qz−Ng(γ̂)). (10)

Now we turn to qz(z1:N ). Again from Figure 1, we seethat the MB of {z1:N} is ~γ ∪{w1:N}, of which ~γ needsto be replaced by its GMF message. Thus we have:

qz(z1:N ) =N∏

n=1

qz(zn) =N∏

i=1

p(zn|~γ → 〈S~γ〉q~γ, wn, ~β1:K).

(11)For notational simplicity we drop the word-index ”n”and give a generic formula for the variational approx-imation to a singleton marginal:

p(zk|〈S~γ〉q~γ, wj , ~βk) ∝ p(zk|〈S~γ〉q~γ

)× p(wj |zk, ~βk)

∝ exp{〈γk〉q~γ}βkj = exp{µγ̂,k}βkj , (12)

where zk and wj are notational shorthands for z = k(i.e., z picks the kth topic) and w = j (i.e., w repre-sents the jth word), respectively, βkj is the probabilityof word j under topic k, and µγ̂,k is kth component ofthe expectation of ~γ given in Eq. (10).The above two GMF marginals given in Eqs. (9,10)and Eq. (12) are coupled and thus constitute a set of

1The γ̂ is replaced with the mean of the variational dis-tribution over ~γ from the pervious iteration.

2Empirically, values of ~γ outside of the ”practical” range

would result in a skewed ~θ on the topic simplex. Our ex-perimental results on real data sets confirm that the shownrange is the operational one.

fixed-point equations. Thus, we can iteratively updateeach marginal until convergence. This approximationcan be shown to minimize the KL divergence betweenthe variational posterior and the true posterior of la-tent variables at convergence (Xing et al., 2003). Todiagnose convergence, one can either monitor the rel-ative change in µγ̂ or the relative change of the log-likelihood of w (log of the integral in Eq. 3). Underour factorized approximation to the true distribution,the integral in Eq. (3) is computable (Xing, 2005).

3.2 Parameter Estimation via variational EMGiven a corpus of documents {w1:D}, the learning taskis to find model parameters {µ, Σ, ~β1:K} that maximizethe log likelihood of the data. We use a VariationalExpectation-Maximization (VEM) algorithm to fit themodel parameters. VEM alternates between two steps:in the E-Steps, the variational approximation in §3.1is used to compute expectations over hidden variables{~γd, zd,1:N}1:D; then in the M-Step, model parametersare updated using their expected sufficient statisticsfrom the E-step.As it is always the case, details are important withVEM. As pointed out by Welling et al. (2004), strongconditional dependencies between hidden variables indirected models can seriously affect the model perfor-mance in VEM-based learning, because such depen-dendies lead to poor approximation to the posterioron the hidden variables, and can cause difficulties toescape local optima. To remedy this we used deter-ministic annealing (DA) (Ueda and Nakano, 1998).DA-VEM seeks to maximize an exponentiated versionof the likelihood, Eq(Y |X)[p(X, Y )T ], where X and Yare observed and hidden variables respectively, andT is the temperature parameter. DA-VEM starts bymaximizing a concave function (small values of T) andmaintains local maxima while gradually morphing thefunction to the desired non-concave likelihood functionwhen T=1. It is a continuation method in which theestimate at the end of each annealing step is used toinitialize the search over the next step. In our experi-ments we used four annealing steps with temperatures(0.1, 0.25, 0.5, 1). We also found that a fast annealingon ~β1:K results in a faster convergence with no per-formance loss, thus we used the following temperatureschedule for ~β1:K : (0.25, 0.75, 1, 1).

3.3 Relation to Previous WorkBlei and Lafferty (2005) gave an alternative mean-fieldvariational approximation to the same model. Thefactorized distribution in their work is:

qBL(~γ, z1:N |λ1:K , ν1:K , φ1:N ) =K∏

k=1

qBL(γk|λk, νk)N∏

n=1

qBL(zn|φn), (13)

0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

∑K

k=1exp{γk}

Log

Parti

tion

Func

tion

ExactBound

Figure 3: Upper bounds for the log partition function.The function is viewed as a logarithm function of a singleparameter equals to

PKk=1 exp{γk} and each dotted line

represents a linear upper bound.

where λ1:K , ν1:K and φ1:N are the variational param-eters. It is noteworthy that, compare to our approxi-mate posterior of ~γ defined by Eq. (5), in this approx-imation the posterior of ~γ is a fully factored distribu-tion over each element of ~γ. Each qBL(γk|λk, νk) is aunivariate Gaussian. So in fact this approximate pos-terior does not capture correlations between elementsin ~γ, which might not result in a tight approximationduring inference and learning.With qBL(zn|φn) set to be a multinomial, the vari-ational parameters {λ1:K , ν1:K , φ1:N} are fit by maxi-mizing the following lower bound on the log-likelihood:

log p(w) ≥ H(qBL) + EqBL [log p(~γ|µ, Σ)] +N∑

n=1

(EqBL [log p(zn|~γ)] + EqBL [log p(wn|zn, ~β1:K)]

), (14)

where H is the entropy function. As discussed before,the expectation of the term log p(zn|~γ) = z′n~γ − C(~γ)can not be computed due to the non-conjugacy of thelogistic normal distribution. To deal with that, a first-order conjugate dual approximation (Jordan et al.,1999) was used to upper bound C(~γ) as follows:

log( K∑

k=1

exp{γk})≤ ζ−1

( K∑

k=1

exp{γk})− 1 + log(ζ), (15)

where ζ is a new variational parameter. Note that thebound in Eq. (15) views C(~γ) as a logarithm func-tion of a unary parameter equals to

∑Kk=1 exp{γk}.

As shown in Figure (3), each value of ζ correspondsto an upper bound. The variational parameters are fitby maximizing the bound in Eq. (14), the maximizerφ∗1:N and ζ∗ has a closed form solution, whereas themaximizer λ∗1:K , ν∗1:K are found numerically using con-jugate gradient and Newton’s methods respectively.There are two important differences between the afore-mentioned approach and our tight approximate infer-ence presented here. First, qBL posited that the pos-terior over ~γ has a diagonal covariance3, which con-

3We believe that this assumption was made to make

tradicts the original motivation of a ”correlated topicmodel” as is LoNTAM, which is to capture correla-tions between topic weights. In particular, since modellearning (e.g., for µ, Σ and ~β1:K) would iteratively usesthe posterior estimation of ~γ, a fully de-correlated es-timator of ~γ as resulted from qBL might possibly mis-lead the parameter estimation and eventually fail toaccurately estimate the correlation over topics as intro-duced by the LoNTAM model. In our newly proposedmethod, we have no restriction on the covariance inour approximate posterior of ~γ (see Eq. 9). Second,the two approaches differ in the way they deal withC(~γ). In their work,C(~γ) was viewed as a unary func-tion and was upper bounded using a tangent approx-imation. In contrast, we view C(~γ) as a multivariatefunction of ~γ and approximate it using a multivariatequadratic Taylor expansion. To make this differenceclear, note that C(~γ) is used in two places: first toapproximate the log-likelihood of w and second to fitthe posterior distribution over ~γ. For log-likelihoodcomputation, viewing C(~γ) as a unary function is suf-ficient to get a close bound; however, for updating theposterior over γ during the variational fixed point iter-ations, it is important to keep the coupling between thecomponents of ~γ as represented in C(γ). In the earlymethod, there is no clear way of how to deal with C(γ)differently based on its (different) roles. However, inour work, we can deal with C(~γ) differently based onits role. As will be shown in the next section, model-ing the interaction between the components of ~γ viathe multivariate quadratic Taylor expansion leads tomuch tighter approximation and faster convergence.

4 Experimental Results

We validate our inference algorithms on both simu-lated text corpus (sampled according to hand-specifiedtopics and admixing priors); and the NIPS dataset.

4.1 Experiments on Simulated Data

We first tested our approach (referred to as AX) overcontrolled settings (where the ground truth is known)and compared it to that of Blei and Lafferty (2005)(refereed to as BL). To test the accuracy of both in-ference algorithms, different settings were simulatedby varying one of the three model aspects — 1) K:number of topics; 2) M : size of the vocabulary; and3) N : number of words per document) — while fixingthe other two. The model parameters {µ,Σ, ~β1:K)}were drawn randomly in each case from some prespec-ified distributions. For each setting, 200 documentswere sampled and we run both algorithms under eachmodel setting until the relative change in the boundof the log-likelihood is less than 10−6.

possible that ν1:K can be fit numerically.

Accuracy of posterior inference: As shown in thefirst row of Figure 4, our approach achieves higher ac-curacy in recovering the true ~θ (the logistic transfor-mation of ~γ) simulated for each document across allsettings, using the posterior mean of ~γ inferred fromw. As we noted in Section 3.2, this is due to the tight-ness of the multivariate quadratic approximation weuse. As N increases (i.e. the longer the document),the task becomes easy and the difference in perfor-mance between the two approaches decreases. This isbecause in longer document 〈m〉qz

(the expected topichistogram) becomes the dominant factor in recovering~γ. The second row in Figure 4 shows the absolute dif-ference in the error in recovering ~θ between the twoapproaches, that is Error(BL) − Error(AX) on a perdocument level. Our approach always results in animprovement in the order of 10% absolute difference.

0 50 1000

20

40M=1000 , N=200

Topics

Err

or in

The

ta

0 50 1000

200

400

600

800

Topics

Itera

tios

0 50 1000

20

40

Topics

Err

or D

iffer

ence

0 2000 40000

20

40K=50 , N=200

Vocabulary Size

Err

or in

The

ta

0 2000 40000

200

400

600

800

Vocabulary Size

Itera

tios

0 2000 40000

20

40

Vocabulary Size

Err

or D

iffer

ence

0 500 10000

50

100K=50 , M=1000

Words Per Document

Err

or in

The

ta

0 500 10000

200

400

600

800

Words Per Document

Itera

tions

0 500 10000

20

40

Words Per Document

Err

or D

iffer

nce

Figure 4: Inference On Simulated Data. Dotted and solidlines correspond to the BL and AX approaches respectively.Each column represents an experiment in which one dimen-

sion is varied.Top row:Average L2 error in recovering ~θ.Middle row:Error difference (L2(BL)-L2(AX)) in recov-

ering ~θ on a per document level.Bottom row: Number ofiterations needed by each approach to converge.

Convergence rate: The third row in Figure 4 showsthe number of iterations consumed by each algorithmuntil the bound converges. Again, our approach con-verges significantly faster in almost all model settings.It is interesting to note how convergence is affectedwhen K is fixed and the other model aspects are var-ied. One might expect that as ~θ scales only with K,the other aspects should have little effect on the con-vergence of the algorithm. Indeed this was roughlythe case with our approach, yet for the BL approach,the number of iterations until convergence is highly af-fected by other aspects, especially N . To understandwhy this happens, we need to examine the messagescommunicated during the fixed point update equa-tions. Changes in the posterior mean over ~γ are propa-gated exponentially to the posterior over z (Eq. (12)),then summed up over all z1:N and propagated back

to ~γ via 〈m〉qz. Thus as N increases, this effect be-

comes more pronounced and prolongs the time neededuntil convergence. The reason why our approach doesnot suffer from this effect is because the use of thequadratic approximation damps the update in the pos-terior over ~γ quickly, and future iterations only finetune it. Hence even when N increases, the compoundeffect described above does not result in a large differ-ence between the 〈m〉qz

messages sent from the z1:N

to γ across iterations.One important point to be noted here is the cost ofeach iteration in the two approaches. Our approachscales as O(K3) per iteration due to matrix inversionin Eq. 9, while the BL approach scales as O(LK2) periteration, where L is the number of derivative evalua-tions for the numerical optimization routines4.

Parameter estimation: Does more accurate infer-ence result in better parameter estimation? To answerthis question, we started by a toy problem. Model di-mensions were fixed as: K = 3, N = 200 and M =32, other model parameters were generated randomly.The ground truth (topic distribution and the shape ofthe LN-density over the 3-topic simplex) is depicted inthe top of Figure 5. We sampled 400 documents fromthis model and run both approaches on it until therelative change of the bound on the log-likelihood isless than 10−3.5 The estimated topics and LN-densityshapes over the simplex are given in the Figure 5.

0.005

0.01

0.015

0.02

0.04

0.06

0.05

0.1

0.15

Topics

Topics

Topics

Words

WordsWordsWords

AX Approach

BL Approach

Ground Truth

Words

Figure 5: Parameter Estimation. Left panels representtopic distributions where each row is a topic, each columnis a word, and colors correspond to probabilities. Rightpanels represent shapes of LN distribution over the sim-plex. Top row gives the ground truth model parameters,while middle and bottom rows give those estimated usingthe AX and BL approach respectively.

As shown in Figure 5, our approach results in a muchmore accurate density estimation over the simplex,and the topic-specific word frequencies ~β1:3. In fact,the BL approach puts no mass at areas where theground truth puts high probability mass. The esti-

4We avoided comparing the two approaches using walltime because our code is written in matlab while the BLcode we compare against is written in C++.

5Unless otherwise stated, this is the convergence criteriaused for parameter estimation in the rest of this section.

mated topics by both models are acceptable, yet ourapproach was able to get more accurate results (notethe difference in the bottom topic). To better under-stand this result, in Figure 6 we depict the groundtruth ~θ (represented by a vertical line which is parti-tioned into colored segments in proportion to the topicweights recorded by ~θ) and that recovered by each ap-proach when the VEM algorithm converged over the400 documents. Recall that the expected sufficientstatistics for β1:K depend on the variational distribu-tion over z1:N . In turn the distribution over z1:N de-pends on ~θ. Thus the quality of the estimated topicsdepends on how well ~θ is approximated. In fact the er-ror in recovering θ was 13% for our approach and 19%for the BL one, which explains why both approachesgot comparable topic estimates (the KL divergence be-tween the estimated and true topic distribution is 0.02for our approach and 0.09 for BL).

0 100 200 300 4000

0.5

1

The

ta

Ground Truth

0 100 200 300 4000

0.5

1

The

ta

At the End of EM (AX)

0 100 200 300 4000

0.5

1

The

ta

At the End of EM (BL)

Figure 6: The Recovered Document Topic Mixing Vectors.Each vertical line represent a document mixing vector andeach color corresponds to a topic.

In contrast, estimation of the LN parameters, andhence its density over the simplex, depends more di-rectly on how well the components in the ~θ vector arerecovered, not just the overall error in recovering ~θ. Asclear from Figure 6, our approach results in recoveringfiner details of the components in the ~θ vector thanthe BL does — the difference can be easily seen byinspecting the red component (the top one).

4.2 Experiments on the NIPS Dataset

10 15 20 25 30 35 401900

1950

2000

2050

2100

2150

2200

2250

2300

2350

Number of topics

Hel

d-ou

t pe

rple

xity

AXBL

Figure 7: Test-set perplexities on the NIPS dataset.

In addition to simulation study, we conducted experi-ments on the NIPS17 dataset which contains the pro-ceedings of the NIPS conference from 1988 to 2003.

10 15 20 25 30 35 40

0

20

40

60

80

100

120

140

160

180

200

Number of topics

Itera

tions

AXBL

Figure 8: Number of Iterations to converge when perform-ing inference on the held out documents in the NIPS17collection.

This corpus has 2484 documents, vocabulary size (M)of 14036 words (after removing stop words), and andaverage of 1320 words per document. The dataset weredivided into 2220 documents for training and 256 onesfor testing. We fitted 4 models to this corpus withK = 10, 20, 30, 40 topics. We compared the two ap-proached based on perplexity on the held out testset,where the perplexity of a test document wtest is de-fined as:

Perplexity(Dtest) = exp(−∑|Dtest|

n=1 log p(wn)∑|Dtest|

n=1 |wn|)

(16)

To avoid comparing bounds, the true marginal log-likelihood was estimated using importance samplingwhere each approach’s posterior distribution over ~γwas used as the proposal. Figure 7 summarizes theresults which show that we achieve better testset per-plexities across all topics. The reason for this slight im-provement over this dataset is due to the large numberof words in each documents which reduces the effectof accurate prior estimation — similar results were ex-plained in Section 4.1 with reference to Figure 4. InFigure 8 we depict the average number of iterationsneeded by each approach to converge when perform-ing inference over the testset, which shows that ourapproach converges faster in terms of the number ofiterations.

4.3 Experiments on the PNAS Dataset

We also compared the two approaches on a 4-way clas-sification task over the abstracts of the Proceeding ofthe National Academy of Science (PNAS). These ab-stracts are labeled according to their scientific cate-gory. We selected 2500 abstracts from the period of1997 to 2002 and we fitted 40 topics to the resultingcorpus. We then used the resulting low dimensionaltopic mixing vectors induced by each approach as fea-tures for classification. Out of those 2500 abstracts,we only selected those having the required categorieswhich results in 962 abstracts. We Then trained anSVM classifier over 85% of those selected abstracts andtested the accuracy over the remaining 15% ones. Wepresent the classification accuracy of both approaches

in Table 1. As clear from this table, our approach re-sults in a more accurate classifier. It should be notedthat the improvement over the BL approach in thisdataset is significant, as opposed to the slight improve-ment in perplexities over the NIPS dataset. This is dueto the relatively small number of words per abstract inthe PNAS datset — which was an average of 170 wordsper abstract. These results conform with those fromthe simulation study we conducted, and fully analyzed,in Section 4.1. Furthermore, inspecting the confusionmatrix of both classifiers, we found that most of theerrors in the BL approach were due to confusing theBiochemistry and Biophysics abstracts. In figure 9 wedepict the low dimensional representation of the ab-stracts in both of these classes, as recovered by thetwo approaches. As it is clear from the figure, ourapproach results in a better separation of these twoseemingly similar classes.

Table 1: Document classification accuracies

Category Doc BL AXGenetics 21 61.9 61.9

Biochemistry 86 65.1 77.9Immunology 24 70.8 66.6Biophysics 15 53.3 66.6

Total 146 64.3 72.6

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1Biochemistry

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

0 50 1000

0.2

0.4

0.6

0.8

1Biophysics

0 50 1000

0.2

0.4

0.6

0.8

1

Figure 9: The reduced representation of abstracts in Bio-chemistry and Biophysics. Each line represents a docu-ment, and each color segment represents a topic contri-

bution as recorded by ~θ. Top row: the AX approach;Bottom row: the BL approach.

5 Conclusions

In this paper we presented a novel approximate in-ference algorithm for the Logistic-Normal Topic Ad-mixture Model. Our approach overcomes techni-cal difficulties for inference/learning due to the non-conjugacy within the model via the use of a multi-

variate quadratic Taylor approximation to LN. Ourmethod not only makes the variational fixed pointequations for inference amenable to analytic closed-form solution, but also keeps the coupling between thecomponents in the per document topic mixing vector.This results in a simple yet efficient tight approximateinference algorithm that enjoys nice representationaland convergence properties.We presented experimental results on simulateddatasets as well as on the NIPS17 collection and thePNAS collection, and contrasted our approach withthat given by Blei and Lafferty (2005). The resultsdemonstrated that our approach results in tighter ap-proximation in inference and learning especially whenthe number of words per document is relatively small.

Acknowledgements

We would like to thank David Blei for providing his code

that we compared against in this paper.

References

J. Aitchison and S. Shen. Logistic-normal distributions:Some properties and uses, Biometrika 67, 1980.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet alloca-tion. Journal of Machine Learning Research, 3:9931022,January 2003.

D. Blei and J. Lafferty. Correlated topic models. In NIPS18 , 2006

E. Erosheva, S. Fienberg, and J. Lafferty. Mixed member-ship models of scientific publications. PNAS, Vol. 101,Suppl. 1, April 6, 2004

M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Anintroduction to variational methods for graphical mod-els. In M. Jordan (Ed.), Learning in Graphical Models,Cambridge: MIT Press, 1999.

J. Pritchard, M. Stephens, and P. Donnelly. Inference ofpopulation structure using multilocus genotype data.Genetics, 155:945959,June 2000.

J. Sivic, B. Rusell, A. Efros, A. Zisserman, and W. Free-man. Discovering object categories in image collections.Technical report, CSAIL, MIT, 2005

N. Ueda and R. Nakano. Deterministic annealing EMalgo-rithm. Neural Networks, 11, 271282, 1998.

M. Welling, M. Rosen-Zvi, and G. Hinton. Exponentialfamily harmoniums with an application to informationretrieval. In NIPS 17, 2004.

E. Xing, M. Jordan, and S. Russell. A generalized meanfield algorithm for variational inference in exponentialfamilies. In Proc. of the 19th UAI, pages 583:591, 2003.

E. Xing. On topic evolution. CMU-ML TR 05-115, 2005.

B. Zhao and E. Xing, BiTAM: Bilingual Topic AdMixtureModels for Word Alignment, The joint conference of theAssociation for Computational Linguistics, (ACL 2006).

Documents

On Tight Approximate Inference of the Logistic-Normal ...proceedings.mlr.press/v2/ahmed07a/ahmed07a.pdf · On Tight Approximate Inference of the Logistic-Normal Topic Admixture Model