Hierarchically Clustered Representation Learning

Hierarchically Clustered Representation Learning

Su-Jin Shin 1 Kyungwoo Song 1 Il-Chul Moon 1

AbstractThe joint optimization of representation learn-ing and clustering in the embedding space hasexperienced a breakthrough in recent years. Inspite of the advance, clustering with representa-tion learning has been limited to flat-level cate-gories, which often involves cohesive clusteringwith a focus on instance relations. To overcomethe limitations of flat clustering, we introducehierarchically-clustered representation learning(HCRL), which simultaneously optimizes repre-sentation learning and hierarchical clustering inthe embedding space. Compared with a few priorworks, HCRL firstly attempts to consider a gen-eration of deep embeddings from every compo-nent of the hierarchy, not just leaf components.In addition to obtaining hierarchically clusteredembeddings, we can reconstruct data by the var-ious abstraction levels, infer the intrinsic hierar-chical structure, and learn the level-proportionfeatures. We conducted evaluations with imageand text domains, and our quantitative analysesshowed competent likelihoods and the best accu-racies compared with the baselines.

1. IntroductionClustering is one of the most traditional and frequently usedmachine learning tasks. Clustering models are designedto represent intrinsic data structures, such as latent Dirich-let allocation (Blei et al., 2003). The recent developmentof representation learning has contributed to generalizingmodel feature engineering, which also enhances data repre-sentation (Bengio et al., 2013). Therefore, representationlearning has been merged into the clustering models, e.g.,variational deep embedding (VaDE) (Jiang et al., 2017).

Autoencoder (Rumelhart et al., 1985) is a typical neural net-work for unsupervised representation learning and achievesa non-linear mapping from a input space to a embedding

1Department of Industrial and Systems Engineering, KAIST,Daejeon, Republic of Korea. Correspondence to: Il-Chul Moon<[email protected]>.

Preliminary work. Under review.

Figure 1. Example of hierarchically clustered embeddings onMNIST with three levels of hierarchy (left), the generated dig-its from the hierarchical Gaussian mixture components (top right),and the extracted level proportion features (bottom right). Wemarked the mean of a Gaussian mixture component with the col-ored square, and the digit written inside the square refers to theunique index of the mixture component.

space by minimizing reconstruction errors. To turn the em-beddings into random variables, a variational autoencoder(VAE) (Kingma & Welling, 2014) places a Gaussian prioron the embeddings. The autoencoder, whether it is prob-abilistic or not, has a limitation in reflecting the intrinsichierarchical structure of data. For instance, VAE assuminga single Gaussian prior needs to be expanded to suggest anelaborate clustering structure.

Due to the limitations of modeling the cluster structure withautoencoders, prior works combine the autoencoder and theclustering algorithm. While some early cases pipeline justtwo models, e.g., Huang et al. (2014), a typical mergingapproach is to model an additional loss, such as a clusteringloss, in the autoencoders (Xie et al., 2016; Guo et al., 2017;Yang et al., 2017; Nalisnick et al., 2016; Chu & Cai, 2017;Jiang et al., 2017). These suggestions exhibit gains fromunifying the encoding and the clustering, yet they remainat the parametric and flat-structured clustering. A morerecent development releases the previous constraints by us-ing the nonparametric Bayesian approach. For example,the infinite mixture of VAEs (IMVAE) (Abbasnejad et al.,2017) explores the infinite space for VAE mixtures by look-ing for an adequate embedding space through sampling,such as the Chinese restaurant process (CRP). Whereas IM-VAE remains at the flat-structured clustering, VAE-nested

arX

iv:1

901.

0990

6v1

[cs

.LG

] 2

8 Ja

n 20

19


Figure 2. Graphical representation of VaDE (Jiang et al., 2017) (left), VAE-nCRP (Goyal et al., 2017) (center), and neural architecture ofboth models (right). In the graphical representation, the white/shaded circles represent latent/observed variables. The black dots indicatehyper or variational parameters. The solid lines represent a generative model, and dashed lines represent a variational approximation. Arectangle box means a repetition for the number of times denoted by the bottom right of the box.

CRP (VAE-nCRP) (Goyal et al., 2017) captures a morecomplex structure, i.e., a hierarchical structure of the data,by adopting the nested Chinese restaurant process (nCRP)prior (Griffiths et al., 2004) into the cluster assignment ofthe Gaussian mixture model.

Hierarchical mixture density estimation (Vasconcelos &Lippman, 1999), where all internal and leaf components aredirectly modeled to generate data, is a flexible frameworkfor hierarchical mixture modeling, such as hierarchical topicmodeling (Mimno et al., 2007; Griffiths et al., 2004), withregard to the learning of the internal components. This pa-per proposes hierarchically clustered representation learning(HCRL) that is a joint model of 1) nonparametric Bayesianhierarchical clustering, and 2) representation learning withneural networks. HCRL extends a previous work on merg-ing flat clustering and representation learning, i.e., VaDE,by incorporating inter-cluster relation modelings.

Specifically, HCRL jointly optimizes soft-divisive hierar-chical clustering in an embedding space from VAE via twomechanisms. First, HCRL includes a hierarchical-versionedGaussian mixture model (HGMM) with a mixture of hier-archically organized Gaussian distributions. Then, HCRLsets the prior of embeddings by adopting the generativeprocesses of HGMM. Second, to handle a dynamic hierar-chy structure dealing with the clusters of unequal sizes, weexplore the infinite hierarchy space by exploiting an nCRPprior. These mechanisms are fused as a unified objectivefunction; this is done rather than concatenating the twodistinct models of clustering and autoencoding.

We developed two variations of HCRL, called HCRL1 andHCRL2, where HCRL2 extends HCRL1 by the flexiblemodeling on the level proportion. The quantitative evalua-tions focus on density estimation quality and hierarchicalclustering accuracy, which shows that HCRL2 have compe-tent likelihoods and the best accuracies compared with thebaselines. When we observe our results qualitatively, wevisualize 1) the hierarchical clusterings, 2) the embeddingsunder the hierarchy modeling, and 3) the generated images

from each Gaussian mixture component, as shown in Figure1. These experiments were conducted by crossing the datadomains of texts and images, so our benchmark datasets in-clude MNIST, CIFAR-100, RCV1 v2, and 20Newsgroups.

2. Preliminaries2.1. Variational Deep Embedding

Figure 2 presents a graphical representation and a neuralarchitecture of VaDE (Jiang et al., 2017). The model param-eters of κ, µ1:K , and σ2

1:K , which are a proportion, means,and covariances of mixture components, respectively, aredeclared outside of the neural network. VaDE trains modelparameters to maximize the lower bound of marginal loglikelihoods via the mean-field variational inference (Jor-dan et al., 1999). VaDE uses the Gaussian mixture model(GMM) as the prior, whereas VAE assumes a single stan-dard Gaussian distribution on embeddings. Following thegenerative process of GMM, VaDE assumes that 1) the em-bedding draws a cluster assignment, and 2) the embeddingis generated from the selected Gaussian mixture component.

VaDE uses an amortized inference as VAE, with a generativeand inference networks; L(x) in Equation 1 denotes theevidence lower bound (ELBO), which is the lower boundon the log likelihood. It should be noted that VaDE mergesthe ELBO of VAE with the likelihood of GMM.

log p(x) ≥ L(x) = Eq[log

p(c, z, x)

q(c, z|x)

]= Eq

[log

K∏c=1

κcN (z|µc,σ2cIJ)

p(c|z)N (z|µ̃, σ̃2IJ)+ log p(x|z)

](1)

2.2. Variational Autoencoder nested ChineseRestaurant Process

VAE-nCRP uses the nonparametric Bayesian prior for learn-ing tree-based hierarchies, the nCRP (Griffiths et al., 2004),so the representation could be hierarchically organized. The


nCRP prior defines the distributions over children compo-nents for each parent component, recursively in a top-downway. The variational inference of the nCRP can be for-malized by the nested stick-breaking construction (Wang& Blei, 2009), which is also kept in the VAE setting. Thedistribution over paths on the hierarchy is defined as be-ing proportional to the product of weights correspondingto the nodes lying in each path. The weight, πi, for thei-th node follows the Griffiths-Engen-McCloskey (GEM)distribution (Pitman et al., 2002), where πi is constructed asπi = vi

∏i−1j=1(1−vj), vi ∼ Beta(1, γ) by a stick-breaking

process. Since the nCRP provides the ELBO with the nestedstick-breaking process, VAE-nCRP has a unified ELBO ofVAE and the nCRP in Equation 2.

L(x) = Eq[log

p(v)

q(v|x)+ log p(x|z) + log

{p(ζ|v)

q(ζ|x)

p(αpar(p)|α∗)p(αp|αpar(p), σ2N )

q(αp,αpar(p)|x)︸︷︷︸(3.1)

p(z|αp, ζ, σ2D)

q(z|x)︸︷︷︸(3.2)

}](2)

Given the ELBO of VAE-nCRP, we recognized the poten-tial improvements. First, term (3.1) is for modeling thehierarchical relationship among clusters, i.e., each child isgenerated from its parent. VAE-nCRP trade-off is the directdependency modeling among clusters against the mean-field approximation. This modeling may reveal that thehigher clusters in the hierarchy are more difficult to train.Second, in term (3.2), leaf mixture components generateembeddings, which implies that only leaf clusters have di-rect summarization ability for sub-populations. Addition-ally, in term (3.2), variance parameter σ2

D is modeled asthe hyperparameter shared by all clusters. In other words,only with J-dimensional parameters, α, for the leaf mixturecomponents, the local density modeling without varianceparameters has a critical disadvantage.

For all of these weaknesses, we were able to compensatewith the level proportion modeling and HGMM prior. Thelevel assignment generated from the level proportion allowsa data instance to select among all mixture components.We do not need direct dependency modeling between theparents and their children because all internal mixture com-ponents also generate embeddings.

3. Proposed Models3.1. Generative Process

We developed two models for the hierarchically clusteredrepresentation learning; HCRL1 and HCRL2. The gener-ative processes of the presented models resemble the gen-erative process of hierarchical clusterings, such as the hi-erarchical latent Dirichlet allocation (Griffiths et al., 2004).In detail, the generative process departs from selecting a

path ζ, from the nCRP prior (Phase 1). Then, we samplea level proportion (Phase 2) and a level, l (Phase 3), fromthe sampled level proportion to find the mixture componentin the path, and this component of ζl provides the Gaussiandistribution for the latent representation (Phase 4). Finally,the latent representation is exploited to generate an observeddatapoint (Phase 5). The first subfigure of Figure 3 depictsthe generative process with the specific notations.

The level proportion of Phase 2 is commonly modeled asthe group-specific variable in the topic modeling. To adaptthe level proportion for our non-grouped setting, we con-sidered two modeling assumptions on the level proportion:1) globally defined the level proportion which is sharedby all data instances, which characterizes HCRL1, and 2)locally defined, i.e., data-specific level proportion, whichis a distinction of HCRL2 from HCRL1. Similar to thelatter assumption, several recently proposed models alsodefine a data-specific mixture membership over the mixturecomponents (Zhang et al., 2018; Ji et al., 2016).

The below formulas are the generative process of HCRL2with its density functions, where the level proportion isgenerated by a data instance. In addition, Figure 4 and 3illustrate graphical representations of HCRL1 and HCRL2,respectively, and the graphical representations are corre-sponding to the described generative process. The gener-ative process also presents our formalization of our priordistributions, denoted as p(·), and variational distributions,denoted as q(·), by generation phases. The variational dis-tributions are used for the mean-field variational inference(Jordan et al., 1999) as detailed in Section 3.3.

1. Choose a path ζ ∼ nCRP(ζ|γ)

• p(ζ) =∏Ll=1 π1,ζ2,...,ζl where π1,ζ2,...,ζl =∏l

l′=1{v1,ζ2,...,ζl′ (∏ζl′−1j=1 (1− v1,ζ2,...,j))}

• q(ζ|x) ∝ Sζ ,∑ζ∈child(ζ) Sζ

2. Choose a level proportion η ∼ Dirichlet(η|α)• p(η) = Dirichlet(η|α)• qφη (η|x) = Dirichlet(η|α̃)

≈ LogisticNormal(η|µ̃η, σ̃2ηIL)

where [µ̃η; log σ̃2η] = gφη (x),

α̃l = 1σ̃2ηl

(1− 2L + e

−µ̃ηlL2

∑l′ e−µ̃η

l′ )

3. Choose a level l ∼ Multinomial(l|η)• p(l) = Multinomial(η)• q(l|x) = Multinomial(l|ω)

where ωl ∝ exp{∑

ζ Sζ

(∑Jj=1−

12 log(2πσ2

ζl,j)

−σ̃2zj

2σ2ζl,j− (µ̃zj−µζl,j)

2

2σ2ζl,j

)+ ψ(α̃l)− ψ(α̃0)

}4. Choose a latent representation z ∼ N (z|µζl ,σ

2ζlIJ)

• p(z) =∑ζ,l p(ζ|γ) · ηl · N (z|µζl ,σ

2ζlIJ)

• qφz (z|x) = N (z|µ̃z, σ̃2zIJ)

where [µ̃z; log σ̃2z] = gφz (x)


Figure 3. A simple depiction (left) of the key notations, where each numbered circle refers to the corresponding Gaussian mixturecomponent. The graphical representation (center) and the neural architecture (right) of our proposed model, HCRL2. The neuralarchitecture of HCRL2 consists of two probabilistic encoder networks, gφη and gφz , and one probabilistic decoder network, fθ .

5. Choose an observed datapoint x ∼ N(x|µx,σ2

xID)

where [µx; logσ2x] = fθ(z)1

3.2. Neural Architecture

The discrepancy in prior assumptions on the level assign-ment leads to the different neural architectures. The neuralarchitecture of HCRL1 is a standard variational autoencoder,while the neural architecture of HCRL2 consists of two prob-abilistic encoders on z and η, and one probabilistic decoderon z as shown in the right part of Figure 3. We designedthe probabilistic encoder on η for inferring the variationalposterior of data-specific level proportion. The unbalancedarchitecture originates from our modeling assumption ofp(x|z), not p(x|z,η).

One may be puzzled by the lack of the generative networkof η, but η is used for the hierarchy construction in thenCRP that is a part of the previous section. In detail, ηis a random variable of the level proportion in Phase 2 ofthe generative process. The sampling of η and ζ reflectsin the selecting a Gaussian mixture component in Phase4, and the latent vector z becomes an indicator of a datainstance, x. Therefore, the sampling of η from the neuralnetwork is linked to the probabilistic modeling of x, sothe probabilistic model substitutes for creating a generativenetwork from η to x.

Considering η in HCRL, the inference network is given,but the generative network was replaced by the generativeprocess of the graphical model. If we imagine a balancedstructure, then the generative process needs to be fully de-scribed by the neural network, but the complex interactionwithin the hierarchy makes a complex neural network struc-ture. Therefore, the neural network structure in Figure 3may disguise that the structure misses the reconstructionlearning on η, but the reconstruction has been reflected inthe PGM side of learning. This is also a difference be-

1We introduce the sample distribution for the real-valued datainstances, and supplementary material Section 6 provides the bi-nary case as well, which we use for MNIST.

Figure 4. Graphical representation of HCRL1

tween (VaDE, VAE-nCRP) and HCRL because VaDE andVAE-nCRP adhere to the balanced autoencoder structure.We call this reconstruction process, which is inherently agenerative process of the traditional probabilistic graphi-cal model (PGM), PGM reconstruction (see the decodingneural network part of Figure 3).

3.3. Mean-Field Variational Inference

The formal specification can be a factorized probabilisticmodel as Equation 3, which is based on HCRL2. In thecase of HCRL1, ηn should be changed to η and be placedoutside the product over n.

p(Φ,x) =∏j /∈MT

p(vj |γ)×∏i∈MT

p(vi|γ)×

N∏n=1

p(ζn|v)p(ηn|α)p(ln|ηn)p(zn|ζn, ln)pθ(xn|zn)(3)

where Φ = {v, ζ,η, l, z} denotes the set of latent variables,and MT denotes the set of all nodes in tree T . The propor-tion and assignment on the mixture components for the n-thdata instance are modeled by ζn as a path assignment; ηnas a level proportion; and ln as a level assignment. v isa Beta draw used in the stick-breaking construction. Weassume that the variational distributions of HCRL2 are asEquation 4 by the mean-field approximation. In HCRL1,we also assume the mean-field variational distributions, andtherefore, ηn should be replaced by η and be outside the


product over n.

q(Φ|x) =∏j /∈MT

p(vj |γ)×∏i∈MT

q(vi|ai, bi)×

N∏n=1

q(ζn|xn)qφη (ηn|xn)q(ln|ωn,xn)qφz (zn|xn) (4)

where qφη (ηn|xn) and qφz (zn|xn) should be noted be-cause these two variational distributions follow the amor-tized inference of VAE. q(ζ|x) ∝ Sζ ,

∑ζ∈child(ζ) Sζ

is the variational distribution over path ζ, where child(ζ)means the set of all full paths that are not in T but includeζ as a sub path. Because we specified both generative andvariational distributions, we define the ELBO of HCRL2,L = Eq

[log p(Φ,x)

q(Φ|x)

], in Equation 5. Supplementary mate-

rial Section 6 enumerates the full derivation in detail. Wereport that the Laplace approximation with the logistic nor-mal distribution is applied to model the prior, α, of the levelproportion, η. We choose a conjugate prior of a multinomial,so p(ηn|α) follows the Dirichlet distribution. To configurethe inference network on the Dirichlet prior, the Laplaceapproximation is used (MacKay, 1998; Srivastava & Sutton,2017; Hennig et al., 2012).

L(x) = Eq[log

p(v)

q(v|x)+ log

p(η)

q(η|x)

+ log∏ζ,l

p(ζ|v)

q(ζ|x)

p(l|η)

q(l|x)

p(z|µζl ,σ2ζl

)

q(z|x)+ log p(x|z)

]

=∑i∈MT

[log γ + (γ − 1)(ψ(bi)− ψ(ai + bi))−{log Γ(ai + bi)− log Γ(ai)− log Γ(bi) + (ai − 1)ψ(ai)

+(bi − 1)ψ(bi)}] +∑Nn=1[Eq[log p(ζn|v)]

+∑Li=1(αi − 1)(ψ(α̃ni)− ψ(α̃n0)) + log Γ(α0)

−∑Li=1 log Γ(αi) +

∑l′ ωnl′(ψ(α̃nl′ )− ψ(α̃n0))

+∑ζ′ Snζ′

{∑l′ ωnl′

(∑Jj=1−

12 log(2πσ2

ζ′nl′ ,j

)

−(µ̃znj−µζ′

nl′,j)

2

2σ2ζ′nl′

,j

−σ̃2znj

2σ2ζ′nl′

,j

)}+ 1R

∑Rr=1

∑Dd=1−

12 log(2πσ

(r)2

xnd )−(xnd−µ(r)

xnd)2

2σ(r)2xnd

−{∑ζ′

Snζ′∑ζ′′ Snζ′′

logSnζ′∑ζ′′ Snζ′′

+ log Γ(α̃n0)

−∑Li=1 log Γ(α̃ni) +

∑Li=1(α̃ni − 1)(ψ(α̃ni)− ψ(α̃n0))

+∑l′ ωnl′ logωnl′

−J2 log(2π)− 12

∑Jj=1(1 + log σ̃2

znj )}] (5)

where α̃n0 =∑Li=1 α̃ni, α0 =

∑Li=1 αi, ψ denotes the

digamma function, and R is the mini-batch size.

3.4. Training Algorithm of Clustering Hierarchy

HCRL is formalized according to the stick-breaking processscheme. Unlike the CRP, the stick-breaking process doesnot represent the direct sampling of the mixture componentat the data instance level. Therefore, it is necessary to de-vise a heuristic algorithm for operations, such as GROW,PRUNE, and MERGE, to refine the hierarchy structure. sup-plementary material Section 3 provides details about eachoperation. In the below description, an inner path and a fullpath refer to the path ending with an internal node and a leafnode, respectively.

Algorithm 1 Training for Hierarchically Clustered Repre-sentation Learning

Require: Training data x; number of epochs,E; tree-basedhierarchy depth, L; period of performing GROW, tgrow;minimum number of epochs locking the hierarchy, tlock

Ensure: T (E),ω, {ai, bi,µi,σ2i }i∈MT (E)

1: µζ1:L ,σ2ζ1:L← Initialize L Gaussian of a single path ζ

2: T (0) ← Initialize the tree-based hierarchy having ζ3: t← 04: for each epoch e = 1, · · · , E do5: Update the weight parameters using∇L(x)6: {ai, bi,µi,σ2

i }i∈MT (e−1)← Update node-specific

parameters using∇a,b,µ,σ2L(x)7: Update other variational parameters using∇L(x)8: if mod(e, tgrow) = 0 then9: T (e),Q← GROW

10: end if11: if T (e) = T (e−1) and t ≥ tlock then12: T (e),Q← PRUNE13: if T (e) = T (e−1) then T (e),Q←MERGE14: end if15: if T (e) 6= T (e−1) then t← 0 else t← t+ 116: end for

• GROW expands the hierarchy by creating a newbranch under the heavily weighted internal node. Com-pared with the work of Wang & Blei (2009), we mod-ified GROW to first sample a path, ζ

∗, proportional

to∑n q(ζn = ζ

∗), and then to grow the path if the

sampled path is an inner path.• PRUNE cuts a randomly sampled minor full path,

ζ∗, satisfying

∑n q(ζn=ζ

∗)∑

n,ζ q(ζn=ζ)< δ, where δ is the pre-

defined threshold. If the removed leaf node of thefull path is the last child of the parent node, we alsorecursively remove the parent node.

• MERGE combines two full paths, ζ(i)

and ζ(j)

,with similar posterior probabilities, measured byJ(ζ

(i), ζ

(j)) = qiq

Tj /|qi||qj |, where qi = [q(ζ1 =

ζ(i)

), · · · , q(ζN = ζ(i)

)].


Algorithm 1 summarizes the overall algorithm for HCRL.The tree-based hierarhcy T is defined as (N,P), where Nand P denote a set of nodes and paths, respectively. We referto the node at level l lying on path ζ, as N(ζ1:l) ∈ N. Thedefined paths, P, consist of full paths, Pfull, and inner paths,Pinner, as a union set. The GROW algorithm is executed forevery specific iteration period, tgrow. After ellapsing tlockiterations since performing the GROW operation, we beginto check whether the PRUNE or MERGE operation shouldbe performed. We prioritize the PRUNE operation first, andif the condition of performing PRUNE is not satisfied, wecheck for the MERGE operation next. After performingany operation, we initialize t to 0, which is for locking thechanged hierarchy during minimum tlock iterations to befitted to the training data.

4. Experiments4.1. Datasets and Baselines

Datasets: We used various hierarchically organized bench-mark datasets as well as MNIST.

• MNIST (LeCun et al., 1998): 28x28x1 handwrittenimage data, with 60,000 train images and 10,000 testimages. We reshaped the data to 784-d in one dimen-sion.

• CIFAR-100 (Krizhevsky & Hinton, 2009): 32x32x3colored images with 20 coarse and 100 fine classes.We used 3,072-d flattened data with 50,000 trainingand 10,000 testing.

• RCV1 v2 (Lewis et al., 2004): The preprocessed textof the Reuters Corpus Volume. We preprocessed thetext by selecting the top 2,000 tf-idf words. We usedthe hierarchical labels up to the 4-level, and the multi-labeled documents were removed. The final prepro-cessed corpus consists of 11,370 training and 10,000testing documents randomly sampled from the originaltest corpus.

• 20Newsgroups (Lang, 1995): The benchmark textdata extracted from 20 newsgroups, consisting 11,314training and 7,532 testing documents. We also labeledby 4-level following the annotated hierarchical struc-ture. We preprocessed the data through the same pro-cess as that of RCV1 v2.

Baselines: We completed our evaluation in two aspects: 1)optimizing the density estimation, and 2) clustering the hier-archical categories. First, we evaluated HCRL1 and HCRL2from the density estimation perspective by comparing itwith diverse flat clustered representation learning models,and VAE-nCRP. Second, we tested HCRL1 and HCRL2from the accuracy perspective by comparing it with multipledivisive hierarchical clusterings. The below is the list ofbaselines. We also added the two-stage pipeline approaches,

where we trained features from VaDE first and then appliedthe hierarchical clusterings. We reused the open sourcecodes2 provided by the authors for several baselines, suchas IDEC, DCN, VAE-nCRP, and SSC-OMP.

1. Variational Autoencoder (VAE) (Kingma &Welling, 2014): places a single Gaussian prior onembeddings.

2. Variational Deep Embedding (VaDE) (Jiang et al.,2017): jointly optimizes a Gaussian mixture modeland representation learning.

3. Improved Deep Embedded Clustering (IDEC)(Guo et al., 2017): improves DEC (Xie et al., 2016)by attatching decoder structure. We use the code bythe authors.

4. Deep Clustering Network (DCN) (Yang et al.,2017): optimizes the K-means-related cost defined onthe embedding space. We used the open source codeprovided by the authors.

5. Infinite Mixture of Variational Autoencoders (IM-VAE) (Abbasnejad et al., 2017): searches for the in-finite embedding space by using a Bayesian nonpara-metric prior.

6. Variational Autoencoder - nested Chinese Restau-rant Process (VAE-nCRP) (Goyal et al., 2017): Weused the open source code provided by the authors.

7. Hierarchical K-means (HKM) (Nister & Stewe-nius, 2006): performs K-means (Lloyd, 1982) recur-sive in a top-down way.

8. Mixture of Hierarchical Gaussians (MOHG) (Vas-concelos & Lippman, 1999): infers the level-specificmixture of Gaussians.

9. Recursive Gaussian Mixture Model (RGMM): runsGMM recursively in a top-down manner.

10. Recursive Scalable Sparse Subspace Clustering byOrthogonal Matching Pursuit (RSSCOMP): per-forms SSCOMP (You et al., 2016) recursively for hier-archical clustering. SSCOMP is a well-known methodsfor image clustering, and we used the open source code.

4.2. Quantitative Analysis

We used two measures to evaluate the learned representa-tions in terms of the density estimations: 1) negative loglikelihood (NLL), and 2) reconstruction errors (REs). Au-toencoder models, such as IDEC and DCN, were tested onlyfor the REs. The NLL is estimated with 100 samples. Table1 indicates that HCRL is best in the NLL and is competentin the REs which means that the hierarchically clusteredembeddings preserve the intrinsic raw data structure.

2https://github.com/XifengGuo/IDEC (IDEC);https://github.com/boyangumn/DCN (DCN);https://github.com/prasoongoyal/bnp-vae (VAE-nCRP);http://vision.jhu.edu/code/ (SSC-OMP)

http://vision.jhu.edu/code/


Table 1. Test set performance of the negative log likelihood (NLL) and the reconstruction errors (REs). Replicated ten times, and the bestin bold. P † < 0.05 (Student’s t-test). Model-L# means that the model trained with the #-depth hierarchy.

MNIST CIFAR-100 RCV1 v2 20NewsgroupsModel NLL REs NLL REs NLL REs NLL REsVAE 230.71 10.46 1960.06 57.54 2559.46 1434.59 2735.80 1788.22VaDE 217.20 10.35 1921.85 53.60 2558.32 1426.38 2733.46 1782.86IDEC N/A 12.75 N/A 64.09 N/A 1376.26 N/A 1660.61†

DCN N/A 11.30 N/A 44.26 N/A 1361.98 N/A 1691.17IMVAE 296.57 10.69 1992.83 40.45† 2566.01 1387.02 2722.81 1718.08VAE-nCRP-L3 718.78 32.67 2969.62 198.66 2642.88 1538.42 2712.28 1680.56VAE-nCRP-L4 721.00 32.53 2950.73 198.97 2646.48 1542.81 2713.58 1680.71HCRL1-L3 209.59† 9.28† 1864.69† 55.12 2562.79 1418.30 2732.10 1792.13HCRL1-L4 212.31† 8.31† 1860.22† 55.56 2555.84 1404.23 2727.49 1754.94HCRL2-L3 203.24† 8.70† 1843.40† 50.44 2554.50† 1395.05 2726.75 1828.71HCRL2-L4 203.91† 8.16† 1849.13† 50.47 2535.43† 1353.34 2702.88 1711.30

Table 2. Hierarchical clustering accuracies with F-scores, onCIFAR-100 with a depth of three, RCV1 v2 with a depth of four,and 20Newsgroups with a depth of four. Replicated ten times, anda confidence interval with 95%. Best in bold.

Model CIFAR-100 RCV1 v2 20NewsgroupsHKM 0.162±0.008 0.256±0.068 0.410±0.043

MOHG 0.085±0.038 0.103±0.014 0.040±0.012

RGMM 0.169±0.012 0.274±0.052 0.435±0.037

RSSCOMP 0.146±0.023 0.266±0.055 0.295±0.047

VAE-nCRP 0.201±0.008 0.413±0.024 0.558±0.027

VaDE+HKM 0.164±0.012 0.331±0.066 0.485±0.056

VaDE+MOHG 0.166±0.016 0.423±0.093 0.492±0.071

VaDE+RGMM 0.181±0.013 0.386±0.062 0.410±0.065

VaDE+RSSCOMP 0.192±0.021 0.272±0.044 0.291±0.043

HCRL1 0.199±0.016 0.437±0.029 0.566±0.048

HCRL2 0.225±0.014 0.455±0.030 0.601±0.097

VaDE generally performed better than VAE did, whereasother flat clustered representation learning models tended tobe slightly different for each dataset. HCRL1 and HCRL2showed better results with a deeper hierarchy of level fourthan of level three, which implies that capturing the deeperhierarchical structure is likely to be useful for the density es-timation, and especially HCRL2 showed overall competentperformance.

Additionally, we evaluated hierarchical clustering accura-cies by following Xie et al. (2016), except for MNIST thatis flat structured. Table 2 points out that HCRL2 has bet-ter micro-averaged F-scores compared with every baseline.HCRL2 is able to reproduce the ground truth hierarchi-cal structure of the data, and this trend is consistent whenHCRL2 compared with the pipelined model, such as VaDEwith a clustering model. The result of the comparisons withthe clustering models, such as HKM, MOHG, RGMM, andRSSCOMP, is interesting because it experimentally provesthat the joint optimization of hierarchical clustering in the

embedding space improves hierarchical clustering accura-cies. HCRL2 also presented better hierarchical accuraciesthan VAE-nCRP. We conjecture the reasons for the model-ing aspect of VAE-nCRP: 1) the simplified prior modelingon the variance of the mixture component as just constants,and 2) the non-flexible learning of the internal components.

The performance gain of HCRL2 compared to HCRL1arises from the detailed modeling of the level proportion.The prior assumption that the level proportion is shared byall data may give rise to the optimization biased towardsthe learning of leaf components. Specifically, a lot of datawould be generated from the leaf components with the highprobability since the leaf components have small variance,which causes the global level proportion to focus the highprobability on the leaf level. This mechanism acceleratesthe biased optimization to the leaf components, and on theother hand, HCRL2 allows the flexible learning of the levelproportions.

4.3. Qualitative Analysis

MNIST: In Figure 1, the digits {4, 7, 9} and the digits{3, 8} are grouped together with a clear hierarchy, whichwas consistent between HCRL2 and VaDE. Also, somedigits {0, 4, 2} in a round form are grouped, together, inHCRL2. In addition, among the reconstructed digits fromthe hierarchical mixture components, the digits generatedfrom the root have blended shapes from 0 to 9, which isnatural considering the root position.

CIFAR-100: Figure 5 shows the hierarchical clusteringresults on CIFAR-100, which are inferred from HCRL2.Given that there were no semantic inputs from the data,the color was dominantly reflected in the clustering criteria.However, if one observes the second hierarchy, the scene im-ages of the same sub-hierarchy are semantically consistent,


Figure 5. Example extracted sub-hierarchies on CIFAR-100

(a) VAE (Kingma &Welling, 2014)

(b) VaDE(Jiang et al., 2017)

(c) VAE-nCRP(Goyal et al., 2017) (d) HCRL1 (e) HCRL2

Figure 6. Comparison of embeddings on RCV1 v2, plotted using t-SNE (Maaten & Hinton, 2008). We mark the mean of a mixturecomponent with a numbered square, colored in {red} for VaDE, {red (root), green (internal), blue (leaf)} for VAE-nCRP, HCRL1, andHCRL2. The first-level sub-hierarchies are indicated with four colors.

Figure 7. Example extracted sub-hierarchies on 20Newsgroups

although the background colors are slightly different.

RCV1 v2: Figure 6 shows the embedding of RCV1 v2.VAE and VaDE show no hierarchy, and close sub-hierarchiesare distantly embedded. Since the flat clustered representa-tion learning focuses on isolating clusters from each other,the distances between different clusters tend to be uniformlydistributed. VAE-nCRP guides the internal mixture com-ponents to be agglomerated at the center, and the causeof agglomeration is the generative process of VAE-nCRP,where the parameter of the internal components are inferredwithout direct information from data.

HCRL1 and HCRL2 show a relatively clear separation be-tween the sub-hierarchy without the agglomeration. How-

ever, HCRL2 is significantly superior to HCRL1 in termsof learning the hierarchically clustered embeddings. InHCRL1, the distant embeddings are learned even thoughthey belong to the same sub-hierarchy.

20Newsgroups: Figure 7 shows the example sub-hierarchies on 20Newsgroups. We enumerated topic wordsfrom documents with top-five likelihoods for each cluster,and we filtered the words by tf-idf values. We observe rela-tively more general contents in the internal clusters than inthe leaf clusters of each internal cluster.

5. ConclusionIn this paper, we have presented a hierarchically clusteredrepresentation learning framework for the hierarchical mix-ture density estimation on deep embeddings. HCRL aimsat encoding the relations among clusters as well as amonginstances to preserve the internal hierarchical structure ofdata. We have introduced two models called HCRL1 andHCRL2, whose the main differentiated features are 1) thecrucial assumption regarding the internal mixture compo-nents for having the ability to generate data directly, and2) the level selection modeling. HCRL2 improves the per-formance of HCRL1 by inferring the data-specific levelproportion through the unbalanced autoencoding neural ar-chitecture. From the modeling and the evaluation, we foundthat our proposed models enable the improvements due tothe high flexibility modeling compared with the baselines.


Acknowledgements

ReferencesAbbasnejad, M. E., Dick, A., and van den Hengel, A. Infi-

nite variational autoencoder for semi-supervised learning.In 2017 IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pp. 781–790. IEEE, 2017.

Bengio, Y., Courville, A., and Vincent, P. Representationlearning: A review and new perspectives. IEEE transac-tions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichletallocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

Chu, W. and Cai, D. Stacked similarity-aware autoencoders.In Proceedings of the 26th International Joint Conferenceon Artificial Intelligence, pp. 1561–1567. AAAI Press,2017.

Goyal, P., Hu, Z., Liang, X., Wang, C., and Xing, E. Non-parametric variational auto-encoders for hierarchical rep-resentation learning. arXiv preprint arXiv:1703.07027,2017.

Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei,D. M. Hierarchical topic models and the nested chineserestaurant process. In Advances in neural informationprocessing systems, pp. 17–24, 2004.

Guo, X., Gao, L., Liu, X., and Yin, J. Improved deepembedded clustering with local structure preservation. InInternational Joint Conference on Artificial Intelligence(IJCAI-17), pp. 1753–1759, 2017.

Hennig, P., Stern, D., Herbrich, R., and Graepel, T. Kerneltopic models. In Artificial Intelligence and Statistics, pp.511–519, 2012.

Huang, P., Huang, Y., Wang, W., and Wang, L. Deepembedding network for clustering. In Pattern Recog-nition (ICPR), 2014 22nd International Conference on,pp. 1532–1537. IEEE, 2014.

Ji, Z., Huang, Y., Sun, Q., and Cao, G. A spatially con-strained generative asymmetric gaussian mixture modelfor image segmentation. Journal of Visual Communica-tion and Image Representation, 40:611–626, 2016.

Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Varia-tional deep embedding: An unsupervised and generativeapproach to clustering. In Proceedings of the 26th Inter-national Joint Conference on Artificial Intelligence, pp.1965–1972. AAAI Press, 2017.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,L. K. An introduction to variational methods for graphicalmodels. Machine learning, 37(2):183–233, 1999.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. In Proceedings of the Second International Con-ference on Learning Representations (ICLR 2014), April2014.

Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. 2009.

Lang, K. Newsweeder: Learning to filter netnews. In Ma-chine Learning Proceedings 1995, pp. 331–339. Elsevier,1995.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: Anew benchmark collection for text categorization research.Journal of machine learning research, 5(Apr):361–397,2004.

Lloyd, S. Least squares quantization in pcm. IEEE transac-tions on information theory, 28(2):129–137, 1982.

Maaten, L. v. d. and Hinton, G. Visualizing data usingt-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

MacKay, D. J. Choice of basis for laplace approximation.Machine learning, 33(1):77–86, 1998.

Mimno, D., Li, W., and McCallum, A. Mixtures of hierar-chical topics with pachinko allocation. In Proceedings ofthe 24th international conference on Machine learning,pp. 633–640. ACM, 2007.

Nalisnick, E., Hertel, L., and Smyth, P. Approximate infer-ence for deep latent gaussian mixtures. In NIPS Workshopon Bayesian Deep Learning, volume 2, 2016.

Nister, D. and Stewenius, H. Scalable recognition with a vo-cabulary tree. In Computer vision and pattern recognition,2006 IEEE computer society conference on, volume 2,pp. 2161–2168. Ieee, 2006.

Pitman, J. et al. Combinatorial stochastic processes. Tech-nical report, Technical Report 621, Dept. Statistics, UCBerkeley, 2002. Lecture notes for St. Flour course, 2002.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learn-ing internal representations by error propagation. Tech-nical report, California Univ San Diego La Jolla Inst forCognitive Science, 1985.


Srivastava, A. and Sutton, C. Autoencoding variational infer-ence for topic models. arXiv preprint arXiv:1703.01488,2017.

Vasconcelos, N. and Lippman, A. Learning mixture hier-archies. In Advances in Neural Information ProcessingSystems, pp. 606–612, 1999.

Wang, C. and Blei, D. M. Variational inference for thenested chinese restaurant process. In Advances in NeuralInformation Processing Systems, pp. 1990–1998, 2009.

Xie, J., Girshick, R., and Farhadi, A. Unsupervised deepembedding for clustering analysis. In International con-ference on machine learning, pp. 478–487, 2016.

Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. To-wards k-means-friendly spaces: Simultaneous deep learn-ing and clustering. In International Conference on Ma-chine Learning, pp. 3861–3870, 2017.

You, C., Robinson, D., and Vidal, R. Scalable sparse sub-space clustering by orthogonal matching pursuit. In Pro-ceedings of the IEEE conference on computer vision andpattern recognition, pp. 3918–3927, 2016.

Zhang, H., Jin, X., Wu, Q. J., Wang, Y., He, Z., and Yang,Y. Automatic visual detection system of railway surfacedefects with curvature filter and improved gaussian mix-ture model. IEEE Transactions on Instrumentation andMeasurement, 67(7):1593–1608, 2018.

Documents

Hierarchically Clustered Representation Learning