38
Tao Li, Florida International University Chris Ding, University of Texas at Arlington Non-negative Matrix Factorizations for Clustering: A Survey

Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Tao Li, Florida International UniversityChris Ding, University of Texas at Arlington

Non-negative MatrixFactorizations for Clustering: ASurvey

Page 2: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

2

Page 3: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Chapter 1Non-negative Matrix Factorizations forClustering: A Survey

AbstractRecently there has been significant development in the use of non-negative ma-

trix factorization (NMF) methods for various clustering tasks. NMF factorizes aninput nonnegative matrix into two nonnegative matrices of lower rank. AlthoughNMF can be used for conventional data analysis, the recent overwhelming interestin NMF is due to the newly discovered ability of NMF to solve challenging datamining and machine learning problems. In particular, NMF with the sum of squarederror cost function is equivalent to a relaxed K-means clustering, the most widelyused unsupervised learning algorithm. In addition, NMF with the I-divergence costfunction is equivalent to probabilistic latent semantic indexing, another unsupervisedlearning method popularly used in text analysis. Many other data mining and ma-chine learning problems can be reformulated as an NMF problem. This chapter aimsto provide a comprehensive review of non-negative matrix factorization methods forclustering. In particular, we outline the theoretical foundations on NMF for cluster-ing, provide an overview of different variants on NMF formulations, and examineseveral practical issues in NMF algorithms. We also summarize recent advances onusing NMF-based methods for solving many other clustering problems including co-clustering, semi-supervised clustering, and consensus clustering and discuss somefuture research directions.

3

Page 4: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

4 Non-negative Matrix Factorizations for Clustering: A Survey

Page 5: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Contents

1 Non-negative Matrix Factorizations for Clustering: A Survey 31.1 NMF Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 NMF Formulations . . . . . . . . . . . . . . . . . . . . . . 3

1.2 NMF for Clustering: Theoretical Foundations . . . . . . . . . . . . 31.2.1 NMF and K-means Clustering . . . . . . . . . . . . . . . . 31.2.2 NMF and Probabilistic Latent Semantic Indexing . . . . . . 41.2.3 NMF and Kernel K-means and Spectral Clustering . . . . . 51.2.4 NMF Boundedness Theorem . . . . . . . . . . . . . . . . . 5

1.3 NMF Clustering Capabilities . . . . . . . . . . . . . . . . . . . . . 61.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 NMF Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.2 Algorithm Development . . . . . . . . . . . . . . . . . . . 81.4.3 Practical Issues in NMF Algorithms . . . . . . . . . . . . . 9

1.4.3.1 Initialization . . . . . . . . . . . . . . . . . . . . 91.4.3.2 Stopping Criteria . . . . . . . . . . . . . . . . . 101.4.3.3 Objective Function v.s. Clustering Performance . 101.4.3.4 Scalability . . . . . . . . . . . . . . . . . . . . . 10

1.5 NMF Related Factorizations . . . . . . . . . . . . . . . . . . . . . 111.6 NMF for Clustering: Extensions . . . . . . . . . . . . . . . . . . . 15

1.6.1 Co-Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 161.6.2 Semi-Supervised Clustering . . . . . . . . . . . . . . . . . 161.6.3 Semi-Supervised Co-Clustering . . . . . . . . . . . . . . . 171.6.4 Consensus Clustering . . . . . . . . . . . . . . . . . . . . . 181.6.5 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . 191.6.6 Other Clustering Extensions . . . . . . . . . . . . . . . . . 19

1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Bibliography 23

1

Page 6: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

2 Non-negative Matrix Factorizations for Clustering: A Survey

1.1 NMF Introduction1.1.1 Background

Originally proposed for parts-of-whole interpretation of matrix factors, NMF hasattracted a lot of attentions recently and has been shown to be useful in a variety ofapplied settings, including environmetrics [89], chemometrics [129], pattern recog-nition [71], multimedia data analysis [26], text mining [91, 131], document summa-rization [90, 119], DNA gene expression analysis [12], financial data analysis [42],and social network analysis [20, 124, 139]. NMF can be traced back to 1970s (Notesfrom G. Golub) and is studied extensively by Paatero [89]. The work of Lee andSeung [68, 69] brought much attention to NMF in machine learning and data min-ing fields. Algorithmic extensions of NMF have been developed to accommodate avariety of objective functions [28] and a variety of data analysis problems, includ-ing classification [98] and collaborative filtering [106]. A number of studies havefocused on further developing computational methodologies for NMF. In addition,various extensions and variations of NMF and the complexity proof that NMF isNP-hard have been proposed recently [8, 8, 33, 34, 37, 55, 98, 111, 113, 136].

The true power of NMF, however, is NMF’s ability to solve challenging datamining and pattern recognition problems. In fact, it has been shown [32, 33] thatNMF with the sum of squared error cost function is equivalent to a relaxed K-meansclustering, the most widely unsupervised learning algorithm. In addition, NMF withthe I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34, 38, 47], another unsupervised learning method popularly used in text analy-sis. Thanks to the clustering capability, NMF has attracted a lot of recent attentionsin data mining.

In this chapter, we provide a comprehensive review of non-negative matrix fac-torization methods for clustering. To appeal to a broader audience in the data miningcommunity, our review focuses more on conceptual formulation and interpretationrather than detailed mathematical derivations. The reader should be cautioned, how-ever, that NMF is such a large research area that truly comprehensive surveys arealmost impossible, and thus, that our overview may be a little eclectic. An interestedreader is encouraged to consult with other papers for further reading. The readershould be cautioned also that in our presentation many mathematical descriptionsare modified so that they adapt to data mining problems.

The rest of the chapter is organized as follows: Section 1.1.2 introduces the basicformulations of NMF; Section 1.2 outlines the theoretical foundations on NMF forclustering and presents the equivalence results between NMF and various clusteringmethodologies; Section 1.3 demonstrates the NMF clustering capabilities and ana-lyzes the advantages of NMF in clustering analysis; Section 1.4 provides an outlineon the NMF algorithm development and discusses several practical issues in NMFalgorithms; Section 1.5 provides an overview of many different NMF variants; andSection 1.6 summarizes recent advances on using NMF-based methods for solvingmany other clustering problems including co-clustering, semi-supervised clustering,

Page 7: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 3

and consensus clustering. Finally, Section 1.7 concludes the paper and discussesfuture research directions.

1.1.2 NMF Formulations

Let the input data matrix X = (x1, · · · ,xn) contain the collection of n data columnvectors. Generally, we factorize X into two matrices,

X ≈ FGT , (1.1)

where X ∈Rp×n, F ∈Rp×k and G ∈Rn×k. Generally, p < n and the rank of matricesF,G is much lower than the rank of X , i.e., k ≪ min(p,n). F,G are obtained byminimizing a cost function. The most common cost function is the sum of squarederrors,

minF,G≥0

Jsse = ∥X−FGT∥2. (1.2)

In this chapter, the matrix norm is implicitly assumed to be the Frobenius norm. Arank non-deficiency condition is assumed for F,G.

Another cost function is the so-called I-divergence:

minF,G≥0

JID =m

∑i=1

n

∑j=1

[Xi j log

Xi j

(FGT )i j−Xi j +(FGT )i j

]. (1.3)

It’s easy to show that the inequality I(x) = x logx− x+1≥ 0 holds when x ≥ 0; theequality holds when x = 1. The quantity I(u,v) = (u/v) log(u/v)−u/v+1 is calledI-divergence,

1.2 NMF for Clustering: Theoretical FoundationsAlthough NMF can be used for conventional data analysis [89], the overwhelm-

ing interest in NMF is the newly discovered ability of NMF to solve challenging clus-tering problems [32, 33]. Here we outline the recent results (1) on the relationshipbetween NMF with least square objective and K-means clustering; (2) the relation-ship between NMF using I-divergence and PLSI; and (3) the relationship betweenNMF and spectral clustering. These results established the theoretical foundationsfor NMF to solve unsupervised learning problems. We also present the bounded-ness theorem which offers the theoretical foundation for the normalization of factormatrices.

1.2.1 NMF and K-means Clustering

The K-means clustering algorithm is one of the most popularly used data clus-tering methods. Let X = (x1, · · · ,xn) be n data points. We partition them into K

Page 8: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

4 Non-negative Matrix Factorizations for Clustering: A Survey

mutually disjoint clusters. The K-means clustering objective can be written as

JKmeans =n

∑i=1

min1≤k≤K

∥xi− fk∥2 =K

∑k=1

∑i∈Ck

∥xi− fk∥2.

The following theorem shows that NMF is inherently related to K-means clusteringalgorithm [32].

Theorem 1 G-orthogonal NMF,

minF≥0,G≥0

∥X−FGT∥2, s.t. GT G = I. (1.4)

is equivalent to K-means clustering. This holds even if X and F have mixed-signentries.

We can understand this relationship in this way [32]. Let C = (c1, · · · ,ck) be thecluster centroids obtained via K-means clustering. Let H be the cluster indicators:i.e., hki = 1 if xi belongs to cluster ck; hki = 0 otherwise. We can write the K-meanscluster objective as J =∑n

i=1 ∑Kk=1 hik∥xi−ck∥2 = ∥X−CHT∥2. From this analogy, in

NMF F has the meaning of cluster centroids and G is the cluster indicator [32]. ThusK-means and NMF have the same objective function but with different constraints.

Indeed, K-means objective function can be expressed in such a way: if we ignorethe nonnegativity constraint while keeping the orthogonality constraint, the principalcomponent is the solution [36,138]. On the other hand, if we ignore the orthogonalitywhile keeping the nonnegativity, NMF is the solution.

1.2.2 NMF and Probabilistic Latent Semantic Indexing

Probabilistic Latent Semantic Indexing (PLSI) is a unsupervised learning methodbased on statistical latent class models and has been successfully applied to docu-ment clustering [54]. (PLSI is further developed into a more comprehensive LatentDirichlet Allocation model [10].) PLSI maximizes the likelihood

JPLSI =m

∑i=1

n

∑j=1

X(wi,d j) logP(wi,d j), (1.5)

where the joint occurrence probability is factorized (i.e., parameterized or approxi-mated ) as

P(wi,d j) = ∑k

P(wi,d j|zk)P(zk) = ∑k

P(wi|zk)P(d j|zk)P(zk), (1.6)

assuming that wi and d j are conditionally independent given zk. The following theo-rem shows the equivalence between NMF and PLSI [34, 38]:

Theorem 2 PLSI is equivalent to NMF with I-divergence objective. (A) The ob-jective function of PLSI is identical to the objective function of NMF using the I-divergence. (B) We can express FGT = FDGT where F ,D, GT satisfy the probabilitynormalization: ∑m

i=1 Fik = 1,∑nj=1 G jk = 1,∑K

k=1 Dkk = 1.

Page 9: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 5

Therefore, the NMF update algorithm and the EM algorithm in training PLSI arealternative methods to optimize the same objective function [34]. The relationshipsbetween NMF and PLSI have also been studied in [47].

1.2.3 NMF and Kernel K-means and Spectral Clustering

For a square symmetric matrix W , we would expect a W ≃ HHT type decompo-sition. The following theorem points out its usefulness for data clustering [32].

Theorem 3 Orthogonal symmetric NMF

minH≥0∥W −HHT∥2, s.t. HT H = I. (1.7)

is equivalent to Kernel K-means clustering.

It has also been shown that symmetric matrix factorization, a special case of non-negative matrix factorization, is equivalent to sophisticated normalized cut spectralclustering [101]. Given the adjacent matrix W of a graph, it can be easily seen thatthe following matrix factorization

minHT H=I,H≥0

||W −HHT ||2, (1.8)

whereW = D−1/2WD−1/2, D = diag(d1, · · · ,dn), di = ∑

jwi j,

is equivalent to Normalized Cut spectral clustering [32, 118].

1.2.4 NMF Boundedness Theorem

NMF differs from SVD (Singular Value Decomposition) due to the absence ofcancellation of plus and minus signs. But what is the fundamental significance of thisabsence of cancelation? It is the Boundedness Property [141,142]. The boundednesstheorem offers the theoretical foundation for the normalization of F and G in X =FGT . A matrix A is bounded, if 0 ≤ Ai j ≤ 1. Note that for any nonnegative inputmatrix, we can rescale it into the bounded form.

The boundedness property of NMF states: if X is bounded, then the factor matri-ces F, G must also be bounded. More precisely,

Theorem 4 (Boundedness Theorem) Let 0≤ X ≤ 1 be the input data matrix. F,Gare the nonnegative matrices satisfying X = FGT . There exists a diagonal matrixD such that when we rescale X = FGT = (FD)(GD−1)T = F∗(G∗)T , the rescaledmatrices satisfy 0 ≤ F∗i j,G

∗i j ≤ 1. D is constructed this way: D = D1/2

F D−1/2G , DF =

diag( f1, · · · , fk), fk = maxpFkp and DG = diag(g1, · · · ,gk),gk = maxpGkp. This holdswhen X is symmetric, i.e, X = HHT , 0≤ Hi j ≤ 1.

Page 10: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

6 Non-negative Matrix Factorizations for Clustering: A Survey

This theorem assures the existence of an appropriate scale such that both W andH are bounded, i.e., their elements can not exceed the magnitude of the input datamatrix. We note that SVD decomposition does not have the boundedness property. Inthis case, even if the input data are in the range of 0≤ Xi j ≤ 1, we can not guaranteethat for all i, j, |F∗i j| ≤ 1 and |V ∗i j| ≤ 1.

1.3 NMF Clustering Capabilities

1.3.1 Examples

We demonstrate the clustering capabilities of NMF using several examples. Fig-ures 1 and 2 present the computed F factors (images) on two image datasets: ORLface image dataset and Digits image dataset. Here each image is represented as avector of pixel gray values. Note that by Theorem 1, these F factors are cluster cen-troids (representatives of clusters). Intuitively, these images are representatives ofclusters of the original images. The examples show that NMF provides a holisticview of the datasets.

FIGURE 1.1: Left: ORL face image dataset, containing 400 images of 40 persons.Middle/Right: Computed NMF factors F = (f1, · · · , f16) for 2 runs with random ini-tialization.

1.3.2 Analysis

In general, the advantages of NMF over the existing unsupervised learning meth-ods can be summarized below. NMF can model widely varying data distributions dueto the flexibility of matrix factorization as compared to the rigid spherical clustersthat the K-means clustering objective function attempts to capture. When the datadistribution is far from a spherical clustering, NMF may have advantages. Anotheradvantage of NMF is that NMF can do both hard and soft clustering simultaneously.

Page 11: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 7

FIGURE 1.2: Left: Digits image dataset. Middle/Right: Computed NMF factorsF = (f1, · · · , f16) for 2 runs with random initialization.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(A)

(B)

(C)(D) (E)

(A)

(B)

(C)(D) (E)

(A)

(B)

(C)(D) (E)

(A)

(B)

(C)(D) (E)

(A)

(B)

(C)(D) (E)

0 5 10 15 20 25 30 35 40

0

0.2

0.4

0.6

0.8

1

FIGURE 1.3: Left subfigure: A 2D dataset of 38 data points. Right subfigure: TheirG = (g1,g2) values are shown as blue and red curves. Points in regions {A,B} arenumbered 1-30, where g1 and g2 values differ significantly, indicating that thesepoints are uniquely clustered into either cluster A or cluster B (hard clustering).Points in regions {C,D,E} are numbered 31-38 where g1 and g2 values are close,indicating that these points are fractionally clustered into clusters A and B (soft clus-tering).

Figure 1.3 gives an example. The 2D dataset in the left subfigure consists of 38data points and the G values are shown in the right subfigure. As seen in the figure,G values for points in regions {C,D,E} indicate they are fractionally assigned todifferent clusters. As a result, NMF is able to perform soft clustering.

The third advantage is that NMF is able to perform simultaneously clustering ofthe rows (features) and the columns (data points) of an input data matrix. Considerthe following NMF objective with orthonormal constraints on the factor matrices:

Jorth = ∥X−FGT∥2, s.t. F ≥ 0,G≥ 0,FT F = I,GT G = I. (1.9)

The orthonormal constraints and the nonnegative constraints in Eq.( 1.9) make theresulted F and G approximate the K-means clustering results on both features anddata points [32]. The fourth advantage is that many other data mining and machinelearning problems can be reformulated as an NMF problem.

Page 12: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

8 Non-negative Matrix Factorizations for Clustering: A Survey

1.4 NMF Algorithms1.4.1 Introduction

The algorithms for matrix factorizations are generally iterative updating pro-cedures: updating one factor while fixing the other factors. Existing algorithmsfor NMF include multiplicative updates of Lee and Seung [68, 69], alternat-ing least square (ALS) algorithms [8, 89], alternating non-negative least squares(ANLS) [58,59], gradient descent methods [55,99], projected gradient methods [80],and newton-type algorithms [57,135]. More details on NMF algorithms can be foundin [8, 25, 105]. In this section, we provide an outline on the algorithm developmentfor orthogonal NMF and also discusses several practical issues when applying NMFfor clustering.

1.4.2 Algorithm Development

Here we provide an outline on the algorithm development for orthogonal NMF.We wish to solve the optimization problem

minF≥0

J(F) = ||X−FGT ||2, s.t. FT F = I, (1.10)

where X ,G≥ 0 are fixed. We show that given an initial F , the updating algorithm

Fik← Fik(XG)ik

(FGT XT F)ik. (1.11)

correctly finds a local minima of this problem.We can use gradient-descent method, the current standard approach, to update

F as F ← F − δ∇GJ, where ∇F J = ∇F J−F(∇F J)T F is the natural gradient [45]to enforce the orthogonality FT F = I (F on the Stiefel manifold). We obtain theupdating rule

Fik← Fik−δ(−XG+FGT XT F)ik. (1.12)

Note that in Eq.(1.12), setting the stepsize δ = Gik/(FGT XT F)ik we recover theupdating rule Eq.(1.11).

We can also show that the fixed point of the iteration in Eq.(1.11) is in fact theKKT (Karush-Kuhn-Tucker) condition in the theory of constrained optimization. Weintroduce the Lagrangian multipliers Λ ∈ℜK×K and minimize the Lagrangian func-tion L(F) = ||X −FGT ||2 +Tr[Λ(FT F − I)]. The KKT complementarity slacknesscondition for the nonnegativity of Fik gives

(−2XG+2FGT G+2FΛ)ikFik = 0. (1.13)

This is the fixed point condition that any local minima F∗ must satisfy. FromEq.(1.13) we obtain the Lagrangian multiplier Λ = GT XT F∗−GT G. Now, one can

Page 13: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 9

easily check that the converged solution of the update rule Eq.(1.11) also satisfies thefixed point relation Eq.(1.13) with Λ substituted in. This shows that the convergedsolution of the update rule Eq.(1.11) is a local minima of the optimization problem.

The convergence of Eq.(1.11) is guaranteed by the monotonicity theorem whichcan be proved using the auxiliary function approach [39, 69].

Theorem 5 The Lagrangian function L is non-increasing (monotonically decreas-ing) under the update rule Eq.(1.11).

1.4.3 Practical Issues in NMF Algorithms

In the following we discuss several practical issues when applying NMF for clus-tering.

1.4.3.1 Initialization

Similar to K-means clustering, initialization also plays an important role whenapplying NMF for clustering since the objective function may have many local min-ima [52, 109]. The intrinsic alternating minimization in NMF algorithms is noncon-vex, even though the objective function is convex with respect to one set of variables.

A simple random initialization, where the factor matrices are initialized as ran-dom matrices, is generally not effective as it often leads to slow convergence to alocal minimum. Many different techniques have been developed for improving therandom initialization:

• Multiple initializations: The core idea is to perform multiple runs using dif-ferent random initialization and select the best estimates from multiple runs.This method needs to perform NMF multiple times and is computationallyexpensive.

• Factorization-based initialization: Note that NMF is a constrained low rankmatrix factorization, hence we can use the results from alternative lowrank factorization methods as the initialization [11]. Typical examples in-clude SVD-based initialization [11] and CUR decomposition based initializa-tion [65].

• Clustering-based initialization: If we think NMF as a clustering process, wecan seek the initialization strategy based on the results obtained from otherclustering algorithms (e.g., spherical k-means [39, 128], divergence-based K-means [132], and fuzzy clustering [143]).

A comparison of different initialization methods can be found in [65]. Note thatboth factorization-based initialization and clustering-based initialization methods areable to lead to rapid error reduction and faster convergence.

Page 14: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

10 Non-negative Matrix Factorizations for Clustering: A Survey

1.4.3.2 Stopping Criteria

Beyond a predefined number of iterations or a fixed running time, there are sim-ple heuristics that can be used as the stopping criteria for the iterative algorithmsof NMF: (1) the objective function is reduced to below a given threshold; and (2)there is little change on the factor matrices or the objective function between succes-sive iterations. Recently, Lin employed stopping conditions from bound-constrainedoptimization for the iterative procedures of NMF [80] and Kim and Park tested thecombined convergence criterion using the Karush-Kuhn-Tucker (KKT) optimalityconditions and the convergence of positions of the largest elements in the factor ma-trices [58].

Based on Theorem 1, upon the completion of NMF algorithms, G corresponds tothe cluster indicator. Thus the clustering assignment can be determined by the largestelement of each row in G. Clustering assignments can also be determined by findinga discrete clustering membership function that is close to the resulted G, using thestrategy similar to [134].

1.4.3.3 Objective Function v.s. Clustering Performance

In clustering problems, when we have the external structure information (i.e., thederived class labels of the data points), various performance measures such as purity,normalized mutual information (NMI) and accuracy [34, 52, 109] have been usedas the performance measure. In practice, the clusters of a given dataset could havemany different shapes and sizes. In addition, these clusters could overlap with eachother and they can not be easily identified and separated. As a result, it is difficultto effectively capture the cluster structures using a single (even the “best” if exists)clustering objective function.

Hence there is generally a gap between the objective function and the clusteringperformance measure [35]. In many real applications, when the clustering objec-tive function is optimized, the quality of clustering in terms of accuracy or NMI isusually improved. But the improvement only works only up to a certain point, andbeyond that, further optimization of the objective function will not improve the clus-tering performance as the objective function does not necessarily capture the clusterstructure. It is also possible that as the objective function is optimized, the clusteringperformance (e.g., accuracy or NMI) of clustering can be degraded.

1.4.3.4 Scalability

Many prior efforts have been reported on scaling up matrix factorization meth-ods through delicate algorithm designs. While efficient computation algorithms formatrix factorization have been developed, they have primarily been in the contextof scientific/engineering applications where sizes of matrices (tens of thousands bytens of thousands with millions of non-zero values) are typically much smaller thanthose for Web analysis (millions-by-millions matrices with billions of observations).Here we present several practical mechanisms for dealing with large-scale datasetsfor NMF:

Page 15: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 11

• Shrinking scheme: The scheme first “coarsens” the original large scale prob-lem into a small one (level by level), and then solves the coarsened opti-mization problem, and finally refines the solution to the coarsened problemto approximate the solution to the original problem. A typical example of theshrinking scheme is the multi-level graph partitioning [30, 53, 56].

• Partitioning scheme: The scheme first divides the original problem into a se-ries of small scale problems, and then solves those small problems, and finallycombines their solutions to approximate the solution to the original problem.Typical examples of this scheme include blockwise principal component anal-ysis [88], canopy-based clustering [86], and K-means pre-clustering [123].

• Online/Incremental scheme: The scheme does not require the input datasetto reside in the memory and performs factorization in an incremental fashion(i.e., one sample or a chunk of samples per step). Examples of the schemeinclude incremental NMF [17], online NMF by dynamic updating of the factormatrices [14], online NMF using stochastic approximation [51, 121, 122], andevolutionary NMF [95, 126].

• Parallel and distributed implementation: An orthogonal direction for improv-ing the scalability of matrix factorization methods is to use parallel and dis-tributed computing platforms [7,41,81]. MapReduce is a programming modelproposed for processing and generating large data sets [27]. Although the ini-tial purpose of MapReduce is to perform large-scale data processing, it turnsout that this model is much more expressive and has been widely used in manydata mining and machine learning tasks [16, 22, 112]. Liu et al. [81] suc-cessfully scaled up the basic NMF for web-scale dyadic data analysis usingMapreduce and Sun et al. [108] presented different matrix multiplication im-plementations and scaled up three types of nonnegative matrix factorizationson MapReduce. Recently, Graphics Processing Unit (GPU) implementation ofthe NMF algorithms has also been reported [64, 85, 94].

1.5 NMF Related FactorizationsHere we provide an overview on related matrix factorization methods.

1. SVD: The classic matrix factorization is Principal Component Analysis (PCA)which uses the singular value decomposition [44, 49], X ≈UΣV T , where weallow U,V to have mixed-signs; the input data could have mixed-signs. Ab-sorbing Σ into U , we can write

SVD: X± ≈U±V±. (1.14)

Page 16: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

12 Non-negative Matrix Factorizations for Clustering: A Survey

2. NMF: When the input data is nonnegative, and we restrict F and G to benonnegative. The standard NMF can be written as

NMF: X+ ≈ F+G+, (1.15)

using an intuitive notation for X ,F,G≥ 0.

3. Regularized NMF: Additional constraints can be added to the standard NMFformulation. In general, regularized NMF can be written as the following op-timization problem:

minF,G≥0

{∥X−FGT∥2 +αJ1(F)+βJ2(G)}, (1.16)

where the functions J1(F) and J2(G) are penalty terms to enforce certain con-straints and α and β are the corresponding regularization parameters [92].Many different penalty terms have been used in the literature to have dif-ferent effects on the computed solutions such as the smoothness constraint(e.g., the matrix norm) [92], the sparsity constraints (e.g., L1-norm regulariza-tion) [55, 58], the geometric constraints (e.g., manifold structures) [13], andthe local learning regularizations based on neighborhoods [50].

4. Projective NMF: A projective NMF model aims to find an approximate pro-jection of the input matrix [25, 133]. It can be formulated as follows:

X ≈ FFTX . (1.17)

The idea of the projective NMF is similar to subspace clustering.

5. Semi-NMF: When the input data has mixed signs, we can restrict G to benonnegative while placing no restriction on the signs of F . This is called semi-NMF [33, 40]:

semi-NMF: X± ≈ F±G+. (1.18)

6. Convex-NMF: In general, the basis vectors F = (f1, · · · , fk) ∈ℜn×k+ is a much

larger space than the input space spanned by the columns of X = (x1, · · · ,xn).while, according to Theorem 1, F has the meaning of cluster centroids. To en-force this geometry meaning, we restrict F to be a convex combination of theinput data points, i.e., F lies in the input space, fl = w1lx1 + · · ·+wnlxn =Xwl ,or F = XW,wil ≥ 0. We call this restricted form of factorization asConvex-NMF [33, 40]. Convex-NMF applies to both nonnegative and mixed-sign input data:

X± ≈ X±W+GT+. (1.19)

Recently, Convex-Hull NMF is proposed to extend Convex-NMF by restrict-ing the convexity on the columns of both F and G, thus leading to the factor-ization where each input data point is expressed as a convex combination ofconvex hull data points [110]. Convex-Hull NMF can be expressed as

X ≈CGT , (1.20)

Page 17: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 13

where C consists of a set of appropriate points ci ∈ conv(X), and conv(X) isthe convex hull of X .

7. Cluster NMF: In Convex-NMF, we require the columns of F to be convexcombinations of input data. Suppose now that we interpret the entries of G asposterior cluster probabilities. In this case the cluster centroids can be com-puted as fk = Xgk/nk, or F = XGD−1

n , where Dn = diag(n1, · · · ,nk). The extradegree of freedom for F is not necessary. Therefore, the pair of desiderata:(1) F encodes centroids, and (2) G encodes posterior probabilities motivates afactorization X ≈ XGD−1

n GT . We can absorb D−1n into G and solve for

Cluster-NMF : X ≈ XG+GT+. (1.21)

We call this factorization Cluster-NMF because the degree of freedom in thisfactorization is the cluster indicator G, as in a standard clustering problem [33].The objective function is J = ∥X−XGGT∥2.

8. Tri-Factorization: To simultaneously cluster the rows and the columns of theinput data matrix X , we consider the following nonnegative 3-factor decompo-sition [39]

X ≈ FSGT . (1.22)

Note that S provides additional degrees of freedom such that the low-rank ma-trix representation remains accurate while F gives row clusters and G givescolumn clusters. More precisely, we solve

minF≥0,G≥0,S≥0

∥X−FSGT∥2, s.t. FT F = I, GT G = I. (1.23)

This form gives a good framework for simultaneously clustering the rows andcolumns of X [29, 137].

An important special case is that the input X contains a matrix of pairwisesimilarities: X = XT = W . In this case, F = G = H. We call it as symmetricNMF, which optimizes:

minW≥0,S≥0

∥X−HSHT∥2, minW≥0,S≥0

∥X−HSHT∥2, s.t. HT H = I.

9. Kernel NMF: Consider a mapping such as those used in support vector ma-chines,

xi→ ϕ(xi), or , X → ϕ(X) = (ϕ(x1), · · · ,ϕ(xn)).

Similar to the concept proposed for convex-NMF, we restrict F to be convexcombination of transformed input data points:

ϕ(X)≃ ϕ(X)WGT , (1.24)

rather than a standard NMF like ϕ(X) ≈ FGT , which would be difficult since

Page 18: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

14 Non-negative Matrix Factorizations for Clustering: A Survey

F,G will depend explicitly on the mapping function ϕ(·). It is easy to see thatthe minimization objective

||ϕ(X)−ϕ(X)WGT ||2 = Tr[ϕ(X)T ϕ(X)−2GT ϕT (X)ϕ(X)W +W T ϕT (X)ϕ(X)WGT G],(1.25)

depends only on the kernel K = ϕT (X)ϕ(X). This kernel extension of NMF issimilar to kernel-PCA and kernel K-means .

10. Multi-layer NMF: In multi-layer NMF, the basic factor matrix is replaced bya set of cascaded matrices using a sequential decomposition [23,24,31]. Givenan input matrix X , it is first decomposed as F(1)GT

(1). Then in the second step,GT(1) is further decomposed to F(2)GT

(2) and the process is repeated for a numberof times. The multi-layer NMF model can be written as follows:

X ≈ F(1)F(2) · · ·F(t)GT(t) = FGT . (1.26)

It has been shown that the multi-layer NMF can improve the performance ofmost NMF algorithms and address the problem of local minima due to the dis-tributed and multi-stage nature and the sequential decomposition with differentinitial conditions [25].

11. Binary NMF: When the input data X is binary, binary NMF factorizes X intotwo binary matrices thus conserving the most important integer property ofX [72, 141, 142]. The binary NMF model can be written as

Binary NMF: X0−1 ≈ F0−1GT0−1. (1.27)

A special case of Binary NMF is the boolean factorization [87]

X0−1 ≈W0−1⊕H0−1.

12. Weighted Feature Subset NMF: In weighed NMF, weights are incorporatedto indicate the importance of the corresponding rows and columns. If we onlyconsider the feature importance, this leads to the feature subset NMF (FS-NMF):

minW≥0,F≥0,G≥0

||X−FGT ||2W ,s.t. ∑j

W αj = 1, (1.28)

where W ∈Rm×m+ is a diagonal matrix indicating the weights of the rows (key-

words or features) in X , and α is a parameter [116, 117]. In general, we canalso assign different weights to different samples. This leads to the weightedFS-NMF:

minW≥0,F≥0,G≥0

||X−FGT ||2W ,

where we set Wi j = aib j. This becomes

minW≥0,F≥0,G≥0

(X−FGT )2i jaib j,

s.t. ∑i

aαi = 1,∑

jbβ

j = 1, (1.29)

where α,β are two parameters with 0 < α < 1,0 < β < 1.

Page 19: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 15

13. Robust NMF: Recently a robust NMF by using L2,1-norm loss function isproposed in [62]. The error function is

||X−FGT ||2,1 = ∑i

√∑

j(X−FGT )2

i j. (1.30)

The proposed robust NMF formulation can handle outliers and noises in abetter way than standard NMF.

In summary, various newly proposed NMF formulations are collectively summa-rized as follows:

SVD: X± ≈U±V±NMF: X+ ≈ F+GT

+

Regularized NMF: minF,G≥0

{∥X−FGT∥2 +αJ1(F)+βJ2(G)}

Projective NMF: X ≈ FFTX

Semi-NMF: X± ≈ F±GT+

Convex-NMF: X± ≈ X±W+GT+

Convex-hull NMF: X ≈CGT , ci ∈ conv(X)

Cluster-NMF: X ≈ XG+GT+

Kernel-NMF: ϕ(X±)≈ ϕ(X±)W+GT+

Tri-Factorization: X+ ≈ F+S+GT+

Symmetric-NMF: W+ ≈ H+S+HT+

Multi-layer NMF: X ≈ F(1)F(2) · · ·F(t)GT(t)

Binary NMF: X0−1 ≈ F0−1GT0−1

Weighted Feature Subset NMF: min ||X−FGT ||2WRobust NMF: min ||X−FGT ||22,1

Note that there are other NMF formulations that are not included in the abovediscussion, such as convolutive NMF for a set of non-negative matrices [25,104] andBayesian NMF with the incorporation of Bayesian techniques to NMF [15, 96, 97].

1.6 NMF for Clustering: ExtensionsWe have shown that NMF provides a general framework for unsupervised learn-

ing. NMF can model widely varying data distributions and can do both hard andsoft clustering simultaneously. In fact, many other clustering problems such as co-clustering, consensus clustering, semi-supervised clustering, and graph clusteringcan be reformulated as an NMF problem.

Page 20: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

16 Non-negative Matrix Factorizations for Clustering: A Survey

1.6.1 Co-Clustering

In many real world applications, a typical task often involves more than onetype of data points and the input data are association data relating different typesof data points. For example, in document analysis, there are terms and documents.In DNA micro-array data, rows represent genes and columns represent samples. Co-clustering algorithms aim at clustering different types of data simultaneously by mak-ing use of the dual relationship information such as the term-document matrix andthe gene-sample matrix [21, 29, 83, 84, 137].

To simultaneously cluster the rows and the columns of the input data matrixX , Tri-factorization has been proposed for 3 factor non-negative matrix decompo-sition [39], which aims to solve

minF≥0,G≥0,S≥0

∥X−FSGT∥2, s.t. FT F = I, GT G = I. (1.31)

Note that S provides additional degrees of freedom such that the low-rank matrixrepresentation remains accurate while F gives row clusters and G gives column clus-ters. Tri-factorization provides a nice co-clustering framework. Recently, a fast Tri-factorization extension has been proposed in [127] for large-scale data co-clusteringby restricting the factor matrices to be cluster indicator matrices (a special type ofnonnegative matrices).

1.6.2 Semi-Supervised Clustering

In many situations when we discover new patterns using clustering, there existssome prior, partial, incomplete knowledge about the problem. We wish to incorporatethe knowledge into the clustering algorithm. Semi-supervised clustering refers to thesituation where the clustering is done with many pre-specified must-link constraints(two data points must be clustered into the same cluster) and cannot-link constraints(two data points can not be clustered into the same cluster) [5, 6, 9, 60, 115, 130].

Specifically the above constraints are formulated as follows [115]: (1) Must-link constraints. A = {(i1, j1), · · · ,(ia, ja)},a = |A|, contains pairs of data points,where xi1 ,x j1 are considered similar and must be clustered into the same clus-ter. (2) Cannot-link constraints. B = {(i1, j1), · · · ,(ib, jb)},b = |B|, where eachpair of points are considered dissimilar and they cannot be clustered into the sameclusters. A, B can also be viewed as symmetric matrices containing {0,1}. Us-ing cluster indicator H, the must-link of (i1, j1) implies that xi1 ,x j1 should havesignificant nonzero posterior probability at the same cluster k, i.e., the overlap∑K

k=1 hi1kh j1k = (HHT )i1 j1 should be maximized. Thus the must-link condition ismaxH ∑(i j)∈A(HHT )i j = ∑i j Ai j(HHT )i j = TrHT AH. Similarly, the cannot-link con-straints can be formulated as minH TrHT BH. Putting these constraint conditionstogether, the semi-supervised clustering problem can be casted as the following op-timization problem

maxHT H=I,H≥0

Tr[HTWH +αHT AH−βHT BH], (1.32)

Page 21: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 17

where parameter α controls the weight for must-link constraints in A and β controlsthe weight of cannot-link constraints in B. Weights α,β allow certain level of un-certainties so that must-link and cannot-link constraints are not necessarily alwaysvigorously enforced.

Let W+ =W +αA≥ 0, W−= βB≥ 0. The semi-supervised clustering problemcan be reformulated as an NMF problem [74]

minHT H=I,H≥0

||(W+−W−)−HHT ||2. (1.33)

Thus the semi-supervised clustering problem is equivalent to a semi-NMF prob-lem [33, 73].

Chen et al. formulated semi-supervised clustering with the instance-level con-straints using symmetric non-negative tri-factorization [18]. Zhu et al. [144] notedthat must-link and cannot-link constraints play different roles in clustering and pro-posed a constrained NMF method where must-link constraints are used to control thedistance of the data in the compressed form, and cannot-link constraints are used tocontrol the encoding factor.

There are also some other research efforts on incorporating the (partial) class la-bel information into the matrix factorization framework. For example, Lee et al. [70]presented semi-supervised NMF (SSNMF) which jointly incorporates the data ma-trix and the (partial) class label matrix into NMF. Liu and Wu [82] proposed a formof constrained NMF by requiring that the data points sharing the same label have thesame coordinate in the new representation space.

1.6.3 Semi-Supervised Co-Clustering

In co-clustering of two types of objects, sometimes we have partial knowledgeon x-type objects and also partial knowledge on y-type objects. Semi-supervised co-clustering aims to incorporate the knowledge in co-clustering. As in Section 1.6.2,we can formulate the partial knowledge as must-link and cannot-link constraints onboth x-type and y-type objects. Let Ax contain the must-link pairs for x-type objects(Ay for y-type objects), and Bx contain the connot-link pairs for x-type objects (By fory-type objects). Then, the semi-supervised co-clustering problem can be formulatedas

minF≥0,G≥0

J = ∥X−FSGT∥2 +Tr[aFT (Ax−Bx)F +bGT (Ay−By)G], (1.34)

where a,b are parameters to control the effects of different types of constraints [125].Another semi-supervised co-clustering method has been proposed in [19] using sym-metric non-negative tri-factorization.

Recently, Li et al. [75, 79] proposed several constrained non-negative tri-factorization knowledge transformation methods to use the partial knowledge (suchas instance-level constraints and partial class label information) from one type ofobjects (e.g., words) to improve the clustering of another type of objects (e.g., doc-uments). Their models bring together semi-supervised clustering/co-clustering andlearning from labeled features [43, 102, 103].

Page 22: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

18 Non-negative Matrix Factorizations for Clustering: A Survey

1.6.4 Consensus Clustering

Consensus clustering, also called aggregation of clustering, refers to the situationwhere a number of clustering results on the same dataset are already obtained and thetask is to find a clustering which is closest to those already obtained clusterings [46,48, 76, 107].

Formally let X = {x1,x2, · · · ,xn} be a set of n data points. Suppose we are given aset of T clusterings (or partitioning) P = {P1,P2, · · · ,PT} of X . Note that the numberof clusters could be different for different clusterings. Let us define the connectivitymatrix Mi j(Pt) for a single partition Pt as

Mi j(Pt) =

{1 (i, j) belong to the same cluster0 Otherwise (1.35)

Consensus clustering is to look for a consensus partition (consensus clustering) P∗

which is the closest to all the given partitions:

minP∗

J =1T

T

∑t=1

n

∑i, j=1

[Mi j(Pt)−Mi j(P∗)]2 =1T

T

∑t=1∥M(Pt)−M(P∗)∥2

F .

Let the average association between i and j be Mi j =1T ∑T

t=1 Mi j(Pt). We have

J =1T

T

∑t=1∥M(Pt)− M∥2

F +∥M−M(P∗)∥2F .

The first term is a constant which measures the average difference from the consensusassociation M. The smaller this term is, the closer to each other the partitions are.

We therefore minimize the second term. The optimal clustering solution P∗ canbe specified by clustering indicators H = {0,1}n×k, with the constraint that in eachrow of H there can only have one “l” and the rest must be zeros. The key connec-tion here is that the connectivity matrix M(P∗) = HHT . With this, the consensusclustering problem becomes

minH||M−HHT ||2 s.t. H is a cluster indicator. (1.36)

We can relax the constraint. Clearly (HT H) = D = diag(n1, · · · ,nk) where nk = |Ck|.However, before we solve the problem, we have no way to know D and thus no wayto impose the constraints. A slight reformulation resolves the problem. We defineH = H(HT H)−1/2. Thus HHT = HDHT , HT H = H(HT H)−1H = I. Therefore, theconsensus clustering becomes the following optimization problem:

minHT H=I, H,D≥0

||M− HDHT ||2, s.t. D is diagonal. (1.37)

Now, H and D are new variables. We do not need to pre-specify the cluster sizes.Thus the consensus clustering problem is equivalent to a symmetric NMF prob-lem [74].

Page 23: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 19

1.6.5 Graph Clustering

Three NMF models (Symmetric NMF, Asymmetric NMF and Joint NMF) havebeen proposed in [124] to identify communities in three different types of networks(undirected, directed and compound). Among their proposed models, SymmetricNMF and Asymmetric NMF are special cases of Tri-factorization. Joint NMF in-volves multiple and possibly heterogeneous networks. For example, in music rec-ommendation, we are given three networks: (1) the user network U showing therelationships among users; (2) the music network D showing the relationship amongmusic songs; and (3) the user-music network M showing user preferences. JointNMF is the problem of finding a latent matrix G, which reflects some “intrinsic”characteristics of the user-music network M, such that the following three objectivesare minimized simultaneously: ∥M−G∥, ∥U−GGT∥, ∥D−GT G∥. Formally, JointNMF aims to solve

minG∥M−G∥2 +α∥U−GGT∥2 +β∥D−GT G∥2 s.t. G ∈ Rn×m

+ , (1.38)

where α > 0 and β > 0 are constants to tradeoff the importance between differentterms. Recently, an efficient Newton-like algorithm is proposed for graph clusteringusing symmetric NMF [63].

1.6.6 Other Clustering Extensions

NMF methods have also been developed for many other clustering extensions.Badea proposed an NMF model which simultaneously factorizes two linked non-negative matrices with a shared factor matrix, aiming to uncover the common char-acteristics between the two input datasets [3]. Saha and Sindhwani proposed a frame-work for online topic detection to handle streaming non-negative data matrices withpossibly growing number of components [95]. Li et al. [77,78] proposed constrainednon-negative matrix tri-factorizations for cross-domain transfer clustering with theinput matrices in both the source and target domains. The proposed constrainedmatrix factorization framework naturally incorporates document labels via a leastsquares penalty incurred by a certain linear model and enables direct and explicitknowledge transfer across different domains. Wang et al. [120] proposed a newNMF-based language model to simultaneously cluster and summarize documents bymaking use of both the document-term and sentence-term matrices. The proposedframework leads to a better document clustering method with more meaningful inter-pretation using representative sentences. Wang et al. [116, 117] proposed weightedNMF-based approaches which combines keyword selection and document clustering(topic discovery) together by incorporating weights describing the importance of thekeywords.

NMF has also been extended for analyzing multi-way data (or multi-way tensor).Multi-way data are generalizations of matrices and they appear in many applica-tions [1, 2, 4, 61, 93, 114]. One typical type of three-way data is multiple two-waydata/matrices with different time periods. For example, a series of 2-D images, 2-Dtext data (documents vs terms) or 2-D microarray data (genes vs conditions) are natu-

Page 24: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

20 Non-negative Matrix Factorizations for Clustering: A Survey

rally represented as three-way data. Non-negative Tensor Factorization (NTD) [100]is an extension of Non-negative Matrix Factorization. The input data is a non-negative tensor. For a three-way tensor, the standard NTD can be thought as HOSVD(high-order SVD) [66, 67] with nonnegativity constraints. Tri-factorization has alsobeen extended as Tri-NTD (Tri-factor Non-negative Tensor Factorization) to analyzethree-way tensors [140].

1.7 Conclusions

Matrix-based methodologies are rapidly becoming a significant part of datamining as they are amenable to vigorous analysis and can benefit from the well-established knowledge in linear algebra accumulated through centuries. In partic-ular, NMF factorizes an input nonnegative matrix into two nonnegative matrices oflower rank and has the capability of solve challenging data mining problems. Thanksto the data mining capabilities, NMF has attracted a lot of recent attentions and hasbeen used in a variety of fields. This chapter provides a comprehensive review ofnon-negative matrix factorization methods for clustering by outlining the theoreticalfoundations on NMF for clustering and providing an overview of different variantson NMF formulations. We also examine the practical issues in NMF algorithms andsummarize recent advances on using NMF-based methods for solving many otherclustering problems.

There are many future research directions on NMF for clustering including

1. extending NMF for better cluster representation and for dealing with morechallenging clustering problems;

2. providing deeper understanding of NMF’s clustering capability besides the es-tablished theoretical results;

3. developing novel and rigorous proof strategies to prove the correctness andconvergence properties of the numerical algorithms;

4. studying NMF with other distance measures (such as other matrix norms andBregman divergences);

5. improving the scalability of NMF algorithms for large-scale datasets;

6. applying NMF to many different real-world applications and solving real prob-lems.

Page 25: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 21

AcknowledgementThe work of T. Li is supported by National Science Foundation under grants

DMS-0915110 and CCF-0830659, and by the Army Research Office under grantsW911NF-10-1-0366 and W911NF-12-1-0431. The work of C. Ding is supported byNational Science Foundation under grants DMS-0915228 and CCF-0830780.

Page 26: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

22 Non-negative Matrix Factorizations for Clustering: A Survey

Page 27: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Bibliography

[1] E. Acar and B. Yener. Unsupervised multiway data analysis: A literaturesurvey. IEEE Trans. on Knowl. and Data Eng., 21(1):6–20, January 2009.

[2] A. Smilde, R. Bro, and P. Geladi. Multi-way Analysis: Applications in theChemical Sciences. Wiley, 2004.

[3] L. Badea. Extracting gene expression profiles common to colon and pancre-atic adenocarcinoma using simultaneous nonnegative matrix factorization. InPacific Symposium on Biocomputing’08, pages 267–278, 2008.

[4] B. W. Bader, R. A. Harshman, and T. G. Kolda. Temporal analysis of semanticgraphs using asalsan. In Proceedings of the ICDM07, pages 33–42, October2007.

[5] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised clustering by seed-ing. In Proceedings of the Nineteenth International Conference on MachineLearning, pages 27–34, 2002.

[6] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In KDD ’04: Proceedings of the 2004 ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 59–68, New York, NY, USA, 2004. ACM Press.

[7] E. Battenberg and D. Wessel. Accelerating non-negative matrix factorizationfor audio source separation on multi-core and many-core architectures. InISMIR’09, pages 501–506, 2009.

[8] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons.Algorithms and applications for approximate nonnegative matrix factoriza-tion. Computational Statistics and Data Analysis, pages 155–173, 2006.

[9] M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metriclearning in semi-supervised clustering. In Proceedings of International Con-ference on Machine Learning, 2004.

[10] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of MachineLearning Research, 3:993–1022, 2003.

[11] C. Boutsidis and E. Gallopoulos. Svd based initialization: A head start fornonnegative matrix factorization. Pattern Recogn., 41(4):1350–1362, April2008.

23

Page 28: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

24 Non-negative Matrix Factorizations for Clustering: A Survey

[12] J.-P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov. Metagenes and molec-ular pattern discovery using matrix factorization. Proceedings of NationalAcademy of Sciences USA, 102(12):4164–4169, 2004.

[13] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrixfactorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell.,33(8):1548–1560, August 2011.

[14] B. Cao, D. Shen, J. Sun, X. Wang, Q. Yang, and Z. Chen. Detect and tracklatent factors with online nonnegative matrix factorization. In Proceedingsof the 20th international joint conference on Artifical intelligence, IJCAI’07,pages 2689–2694, San Francisco, CA, USA, 2007. Morgan Kaufmann Pub-lishers Inc.

[15] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models.Intell. Neuroscience, 2009:4:1–4:17, January 2009.

[16] E. Y. Chang, K. Zhu, and H. Bai. Parallel algorithms for mining large-scaledatasets. CIKM tutorial, 2009.

[17] W. Chen, B. Pan, B. Fang, M. Li, and J. Tang. Incremental nonnegative matrixfactorization for face recognition. Mathematical Problems in Engineering,2008.

[18] Y. Chen, M. Rege, M. Dong, and J. Hua. Non-negative matrix factorization forsemi-supervised data clustering. Knowl. Inf. Syst., 17(3):355–379, November2008.

[19] Y. Chen, L. Wang, and M. Dong. Non-negative matrix factorization forsemisupervised heterogeneous data coclustering. IEEE Trans. on Knowl. andData Eng., 22(10):1459–1474, October 2010.

[20] Y. Chi, S. Zhu, Y. Gong, and Y. Zhang. Probabilistic polyadic factorization andits application to personalized recommendation. In CIKM ’08: Proceeding ofthe 17th ACM conference on Information and knowledge management, pages941–950. ACM, 2008.

[21] H. Cho, I. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM DataMining Conference, pages 22–24, April 2004.

[22] C.-T. Chu et al. Map-reduce for machine learning on multicore. In NIPS,2006.

[23] A. Cichocki and R. Zdunek. Multilayer nonnegative matrix factorization.Electronics Letters, 42(16):947–948, 2006.

[24] A. Cichocki, R. Zdunek, and S. Amari. Hierarchical als algorithms for non-negative matrix and 3d tensor factorization. In In Lecture Notes on Com-puter Science, LNCS-4666, Proceedings of Independent Component Analysis(ICA07), pages 169–176. Springer, 2007.

Page 29: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 25

[25] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix andTensor Factorizations: Applications to Exploratory Multi-way Data Analysisand Blind Source Separation. Wiley, 2009.

[26] M. Cooper and J. Foote. Summarizing video using non-negative similaritymatrix factorization. In Proc. IEEE Workshop on Multimedia Signal Process-ing, pages 25–28, 2002.

[27] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on largeclusters. In OSDI, 2004.

[28] I. Dhillon and S. Sra. Generalized nonnegative matrix approximations withBregman divergences. In Advances in Neural Information Processing Systems17, Cambridge, MA, 2005. MIT Press.

[29] I. S. Dhillon. Co-clustering documents and words using bipartite spectralgraph partitioning. Proc. ACM Int’l Conf Knowledge Disc. Data Mining (KDD2001), 2001.

[30] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvec-tors: A multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell, 29:2007,2007.

[31] I. S. Dhillon and S. Sra. Generalized nonnegative matrix approximations withbregman divergences. In In Proceeding of the Neural Information ProcessingSystems (NIPS) Conference, pages 283–290, 2005.

[32] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrixfactorization and spectral clustering. Proc. SIAM Data Mining Conf, 2005.

[33] C. Ding, T. Li, and M. Jordan. Convex and semi-nonnegative matrix factor-izations for clustering and low-dimension representation. Technical ReportLBNL-60428, Lawrence Berkeley National Laboratory, University of Cali-fornia, Berkeley, 2006.

[34] C. Ding, T. Li, and W. Peng. Nonnegative matrix factorization and probabilis-tic latent semantic indexing: Equivalence, chi-square statistic, and a hybridmethod. In Proc. of National Conf. on Artificial Intelligence (AAAI-06), 2006.

[35] C. Ding and X. He. Cluster merging and splitting in hierarchical clusteringalgorithms. In Proceedings of the 2002 IEEE International Conference onData Mining, ICDM ’02, pages 139–146, Washington, DC, USA, 2002. IEEEComputer Society.

[36] C. Ding and X. He. K-means clustering via principal component analysis. InProceedings of the twenty-first international conference on Machine learning,ICML ’04, pages 225–232, 2004.

Page 30: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

26 Non-negative Matrix Factorizations for Clustering: A Survey

[37] C. Ding, T. Li, and M. I. Jordan. Nonnegative matrix factorization for com-binatorial optimization: Spectral clustering, graph matching, and clique find-ing. In Proceedings of the 2008 Eighth IEEE International Conference onData Mining, ICDM ’08, pages 183–192, Washington, DC, USA, 2008. IEEEComputer Society.

[38] C. Ding, T. Li, and W. Peng. On the equivalence between non-negative matrixfactorization and probabilistic latent semantic indexing. Comput. Stat. DataAnal., 52(8):3913–3927, April 2008.

[39] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the Twelfth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, 2006.

[40] C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix fac-torizations. IEEE Trans. Pattern Anal. Mach. Intell., 32(1):45–55, January2010.

[41] C. Dong, H. Zhao, and W. Wang. Parallel nonnegative matrix factorization al-gorithm on the distributed memory platform. International Journal of ParallelProgramming, 38(2):117–137, 2010.

[42] K. Drakakis, S. Rickard, R. D. Frein, and A. Cichocki. Analysis of finan-cial data using non-negative matrix factorization. International MathematicalForum, 3(38):1853 – 1870, 2008.

[43] G. Druck, G. Mann, and A. McCallum. Learning from labeled features usinggeneralized expectation criteria. In Proceedings of the 31st annual interna-tional ACM SIGIR conference on Research and development in informationretrieval, SIGIR ’08, pages 595–602, New York, NY, USA, 2008. ACM.

[44] C. Eckart and G. Young. The approximation of one matrix by another of lowerrank. Psychometrika, 1:183–187, 1936.

[45] A. Edelman, T. Arias, and S.T. Smith. The Geometry of Algorithms withOrthogonality Constraints. SIAM J. Matrix Anal. Appl., 20(2):303–353, 1999.

[46] X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartitegraph partitioning. In ICML ’04: Proceedings of the twenty-first internationalconference on Machine learning, page 36, 2004.

[47] E. Gaussier and C. Goutte. Relation between plsa and nmf and implications. InSIGIR ’05: Proceedings of the 28th annual international ACM SIGIR confer-ence on Research and development in information retrieval, pages 601–602,New York, NY, USA, 2005. ACM Press.

[48] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. In ICDE,pages 341–352, 2005.

Page 31: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 27

[49] G. Golub and C. Van Loan. Matrix Computations, 3rd edition. Johns Hopkins,Baltimore, 1996.

[50] Q. Gu and J. Zhou. Local learning regularized nonnegative matrix factor-ization. In Proceedings of the 21st international jont conference on Artificalintelligence, IJCAI’09, pages 1046–1051, San Francisco, CA, USA, 2009.Morgan Kaufmann Publishers Inc.

[51] N. Guan, D. Tao, Z. Luo, and B. Yuan. Online nonnegative matrix factoriza-tion with robust stochastic approximation. IEEE Trans. Neural Netw. LearningSyst., 23:1087–1099, 2012.

[52] J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques.Morgan Kaufmann Publishers, 2011.

[53] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs.In Proceedings of the 1995 ACM/IEEE conference on Supercomputing, Super-computing ’95, New York, NY, USA, 1995. ACM.

[54] T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15thAnnual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages289–296, 1999.

[55] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J.Mach. Learn. Res., 5:1457–1469, December 2004.

[56] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for par-titioning irregular graphs. SIAM J. Sci. Comput., 20(1):359–392, December1998.

[57] D. Kim, S. Sra, and I. S. Dhillon. Fast projection-based methods for the leastsquares nonnegative matrix approximation problem. Stat. Anal. Data Min.,1(1):38–51, February 2008.

[58] H. Kim and H. Park. Sparse non-negative matrix factorizations via alternatingnon-negativity-constrained least squares for microarray data analysis. Bioin-formatics, 23(12):1495–1502, 2007.

[59] J. Kim and H. Park. Fast nonnegative matrix factorization: An active-set-likemethod and comparisons. SIAM Journal on Scientific Computing, 2012. Toappear.

[60] D. Klein, S. D. Kamvar, and C. D. Manning. From instance-level constraintsto space-level constraints: Making the most of prior knowledge in data clus-tering. In Proceedings of the Nineteenth International Conference on MachineLearning, pages 307–314, 2002.

[61] T. G. Kolda and B. W. Bader. The tophits model for higher-order web linkanalysis. In Workshop on Link Analysis, Counterterrorism and Security, 2006.

Page 32: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

28 Non-negative Matrix Factorizations for Clustering: A Survey

[62] D. Kong and C. Ding and H. Huang. Robust nonnegative matrix factorizationusing L21-norm. In Proceedings of the 20th ACM international conference onInformation and knowledge management (CIKM’11), pages 673–682, 2011.

[63] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorizationfor graph clustering. In Proceedings of the 2012 SIAM International Confer-ence on Data Mining (SDM 2012), 2012.

[64] V. Kysenko, K. Rupp, O. Marchenko, S. Selberherr, and A. Anisimov. Gpu-accelerated non-negative matrix factorization for text mining. Natural Lan-guage Processing and Information Systems, page 158–163, 2012.

[65] A. N. Langville, C. D. Meyer, and R. Albright. Initalizations for the non-negative matrix factorization. In Proccedings of the Twelfth ACM SIGKDDInternational Conference on Knolwedge Discovery and Data Mining (KDD2006), 2006.

[66] L. De Lathauwer, B. De Moor and J. Vandewalle. On the Best Rank-1 andRank-(R1,R2, . . . ,Rn) Approximation of Higher-Order Tensors. SIAM. J. Ma-trix Anal. Appl., 21-4, pp.1324-1342, 2000

[67] L. De Lathauwer, B. De Moor and J. Vandewalle. A multilinear Singular ValueDecomposition. SIAM. J. Matrix Anal. Appl., 21-4, pp.1253-1278, 2000

[68] D.D. Lee and H. S. Seung. Learning the parts of objects by non-negativematrix factorization. Nature, 401:788–791, 1999.

[69] D.D. Lee and H. S. Seung. Algorithms for non-negatvie matrix factorization.In T. G. Dietterich and V. Tresp, editors, Advances in Neural Information Pro-cessing Systems, volume 13. The MIT Press, 2001.

[70] H. Lee, J. Yoo, and S. Choi. Semi-supervised nonnegative matrix factoriza-tion. IEEE SIGNAL PROCESSING LETTERS, 17(1), 2010.

[71] S.Z. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized,parts-based representation. In Proc. IEEE Conf. Computer Vision and Pat-tern Recognition, pages 207–212, 2001.

[72] T. Li. A general model for clustering binary data. In Proceedings of theeleventh ACM SIGKDD international conference on Knowledge discovery indata mining, KDD ’05, pages 188–197, New York, NY, USA, 2005. ACM.

[73] T. Li and C. Ding. The relationships among various nonnegative matrix fac-torization methods for clustering. In In Proceedings of the 2006 IEEE Inter-national Conference on Data Mining (ICDM 2006), pages 362–371, 2006.

[74] T. Li, C. Ding, and M. I. Jordan. Solving consensus and semi-supervisedclustering problems using nonnegative matrix factorization. In Proceedings ofthe 2007 Seventh IEEE International Conference on Data Mining, ICDM ’07,pages 577–582, Washington, DC, USA, 2007. IEEE Computer Society.

Page 33: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 29

[75] T. Li, C. Ding, Y. Zhang, and B. Shao. Knowledge transformation fromword space to document space. In Proceedings of the 31st annual interna-tional ACM SIGIR conference on Research and development in informationretrieval, SIGIR ’08, pages 187–194, New York, NY, USA, 2008. ACM.

[76] T. Li, M. Ogihara, and S. Ma. On combining multiple clusterings. In CIKM’04: Proceedings of the thirteenth ACM international conference on Informa-tion and knowledge management, pages 294–303, 2004.

[77] T. Li, V. Sindhwani, C. Ding, and Y. Zhang. Knowledge transformation forcross-domain sentiment classification. In Proceedings of the 32nd AnnualInternational ACM SIGIR Conference on Research and Development in Infor-mation Retrieval (SIGIR 2009), pages 716–717, 2009.

[78] T. Li, V. Sindhwani, C. Ding, and Y. Zhang. Bridging domains with words:Opinion analysis with matrix tri-factorizations. In Proceedings of the TenthSIAM Conference on Data Mining (SDM 2010), pages 293–302, 2010.

[79] T. Li, Y. Zhang, and V. Sindhwani. A non-negative matrix tri-factorizationapproach to sentiment classification with lexical prior knowledge. In Pro-ceedings of the Joint Conference of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on Natural Language Processing ofthe AFNLP: Volume 1 - Volume 1, ACL ’09, pages 244–252, Stroudsburg, PA,USA, 2009. Association for Computational Linguistics.

[80] C. Lin. Projected gradient methods for nonnegative matrix factorization. Neu-ral Comput., 19(10):2756–2779, October 2007.

[81] C. Liu, H. Yang, J. Fan, L. He, and Y. Wang. Distributed nonnegative matrixfactorization for web-scale dyadic data analysis on mapreduce. In WWW ’10:Proceedings of the 19th international conference on World wide web, pages681–690, 2010.

[82] H. Liu and Z. Wu. Non-negative matrix factorization with constraints. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence(AAAI-10), 2010.

[83] B. Long, Z. Zhang, and P.S. Yu. Co-clustering by block value decomposition.In KDD ’05: Proceeding of the eleventh ACM SIGKDD international confer-ence on Knowledge discovery in data mining, pages 635–640, New York, NY,USA, 2005. ACM Press.

[84] B. Long, X. Wu, Z. Zhang, and P. S. Yu. Unsupervised learning on k-partitegraphs. In Proceedings of ACM SIGKDD, pages 317–326, 2006.

[85] N. Lopes and B. Ribeiro. Non-negative matrix factorization implementationusing graphic processing units. In Proceedings of the 11th international con-ference on Intelligent data engineering and automated learning, IDEAL’10,pages 275–283, Berlin, Heidelberg, 2010. Springer-Verlag.

Page 34: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

30 Non-negative Matrix Factorizations for Clustering: A Survey

[86] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedingsof the sixth ACM SIGKDD international conference on Knowledge discov-ery and data mining, KDD ’00, pages 169–178, New York, NY, USA, 2000.ACM.

[87] P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila. The discretebasis problem. In Proceedings of the 10th European conference on Principleand Practice of Knowledge Discovery in Databases, PKDD’06, pages 335–346, Berlin, Heidelberg, 2006. Springer-Verlag.

[88] K. Nishino, S. K. Nayar, and T. Jebara. Clustered blockwise pca for represent-ing visual data. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1675–1679,October 2005.

[89] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factormodel with optimal utilization of error estimates of data values. Environ-metrics, 5:111–126, 1994.

[90] S. Park, J. Lee, D. Kim, and C. Ahn. Multi-document summarization basedon cluster using non-negative matrix factorization. In Proceedings of the 33rdconference on Current Trends in Theory and Practice of Computer Science,SOFSEM ’07, pages 761–770, Berlin, Heidelberg, 2007. Springer-Verlag.

[91] V. P. Pauca, F. Shahnaz, M.W. Berry, and R.J. Plemmons. Text mining usingnon-negative matrix factorization. In Proc. SIAM Int’l conf on Data Mining,pages 452–456, 2004.

[92] V. P. Pauca, J. Piper, and Robert J. Plemmons. Nonnegative matrix factoriza-tion for spectral data analysis. Linear Algebra and its Applications, 416(1):29–47, 2006.

[93] W. Peng and T. Li. Temporal relation co-clustering on directional social net-work and author-topic evolution. Knowl. Inf. Syst., 26(3):467–486, March2011.

[94] J. Platos, P. Gajdos, P. Kromer, and V. Snasel. Non-negative matrix factor-ization on gpu. In Filip Zavoral, Jakub Yaghob, Pit Pichappan, and Eyas El-Qawasmeh, editors, Networked Digital Technologies, volume 87 of Commu-nications in Computer and Information Science, pages 21–30. Springer BerlinHeidelberg, 2010.

[95] A. Saha and V. Sindhwani. Learning evolving and emerging topics in socialmedia: a dynamic nmf approach with temporal regularization. In Proceedingsof the fifth ACM international conference on Web search and data mining,WSDM ’12, pages 693–702, New York, NY, USA, 2012. ACM.

[96] M. N. Schmidt and H. Laurberg. Nonnegative matrix factorization with gaus-sian process priors. Intell. Neuroscience, 2008:3:1–3:10, January 2008.

Page 35: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 31

[97] M. N. Schmidt, O. Winther, and L. Hansen. Bayesian non-negative matrix fac-torization. In Independent Component Analysis and Signal Separation, Inter-national Conference on, volume 5441 of Lecture Notes in Computer Science(LNCS), pages 540–547. Springer, 2009.

[98] F. Sha, L.K. Saul, and D.D. Lee. Multiplicative updates for nonnegativequadratic programming in support vector machines. In Advances in NeuralInformation Processing Systems 15, pages 1041–1048. 2003.

[99] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document cluster-ing using nonnegative matrix factorization. Inf. Process. Manage., 42(2):373–386, March 2006.

[100] A. Shashua and T. Hazan. Non-negative tensor factorization with applicationsto statistics and computer vision. In ICML, 2005.

[101] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE. Trans.on Pattern Analysis and Machine Intelligence, 22:888–905, 2000.

[102] V. Sindhwani, J. Hu, and A. Mojsilovic. Regularized co-clustering with dualsupervision. In In Proceedings of NIPS’08, pages 1505–1512, 2008.

[103] V. Sindhwani and P. Melville. Document-word co-regularization for semi-supervised sentiment analysis. In Proceedings of the 2008 Eighth IEEE In-ternational Conference on Data Mining, ICDM ’08, pages 1025–1030, Wash-ington, DC, USA, 2008. IEEE Computer Society.

[104] P. Smaragdis. Non-negative matrix factor deconvolution; extraction of multi-ple sound sources from monophonic inputs. In Proceedings of ICA’04, pages494–499, 2004.

[105] S. Sra and I. S. Dhillon. Nonnegative matrix approximation: algorithms andapplications. Technical Report TR-06-27, UTCS, 2006.

[106] N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factoriza-tion. In Advances in Neural Information Processing Systems, Cambridge, MA,2005. MIT Press.

[107] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework forcombining multiple partitions. The Journal of Machine Learning Research,3:583–617, March 2003.

[108] Z. Sun, T. Li, and N. Rishe. Large-scale matrix factorization using mapreduce.In Proceedings of the 2010 IEEE International Conference on Data MiningWorkshops, ICDMW ’10, pages 1242–1248, Washington, DC, USA, 2010.IEEE Computer Society.

[109] P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. AddisonWesley, 2006.

Page 36: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

32 Non-negative Matrix Factorizations for Clustering: A Survey

[110] C. Thurau, K. Kersting, M. Wahabzada, and C. Bauckhage. Convexnon-negative matrix factorization for massive datasets. Knowl. Inf. Syst.,29(2):457–478, November 2011.

[111] F. D. Torre and T. Kanade. Discriminative cluster analysis. In Proceedings ofthe 23rd International Conference on Machine Learning (ICML 2006), 2006.

[112] A. Tveit. Mapreduce and hadoop algorithms in academic papers. Blog, Inter-net.

[113] S. A. Vavasis. On the complexity of nonnegative matrix factorization.http://arxiv.org/abs/0708.4149, 2007.

[114] M. Vichi, R. Rocci, and H. A. L. Kiers. Simultaneous component and clus-tering models for three-way data: Within and between approaches. Journal ofClassification, 24(1):71–98, 2007.

[115] K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clus-tering with background knowledge. In Proceedings of International Confer-ence on Machine Learning, pages 577–584, 2001.

[116] D. Wang, C. Ding, and T. Li. Feature subset non-negative matrix factoriza-tion and its applications to document understanding. In Proceedings of the33rd international ACM SIGIR conference on Research and development ininformation retrieval, SIGIR ’10, pages 805–806, New York, NY, USA, 2010.ACM.

[117] D. Wang, T. Li, and C. Ding. Weighted feature subset non-negative matrixfactorization and its applications to document understanding. In Proceedingsof the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages541–550, Washington, DC, USA, 2010. IEEE Computer Society.

[118] D. Wang, T. Li, C. Ding and S. Zhu. Multi-document summarization viasentence-level semantic analysis and symmetric matrix factorization. In Pro-ceedings of The 31st Annual International ACM SIGIR Conference (SIGIR2008), pages 307–314, 2008.

[119] D. Wang, T. Li, S. Zhu, and C. Ding. Multi-document summarization viasentence-level semantic analysis and symmetric matrix factorization. In Pro-ceedings of the 31st annual international ACM SIGIR conference on Researchand development in information retrieval, SIGIR ’08, pages 307–314, NewYork, NY, USA, 2008. ACM.

[120] D. Wang, S. Zhu, T. Li, Y. Chi, and Y. Gong. Integrating document clus-tering and multidocument summarization. ACM Trans. Knowl. Discov. Data,5(3):14:1–14:26, August 2011.

[121] F. Wang and P. Li. Efficient nonnegative matrix factorization with random pro-jections. In In Proceedings of 2010 SIAM Data Mining Conference (SDM’10),pages 281–292, 2010.

Page 37: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

Non-negative Matrix Factorizations for Clustering: A Survey 33

[122] F. Wang, P. Li, and A. C. Konig. Efficient document clustering via online non-negative matrix factorizations. In Proceedings of the 2011 SIAM InternationalConference on Data Mining (SDM’11), pages 908–919, 2011.

[123] F. Wang and T. Li. Gene selection via matrix factorization. In Proceedingsof 7th IEEE Conference on Bioinformatics and Bioengineering, pages 1046–1050.

[124] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding. Community discovery usingnonnegative matrix factorization. Data Min. Knowl. Discov., 22(3):493–521,May 2011.

[125] F. Wang, T. Li, and C. Zhang. Semi-supervised clustering via matrix factoriza-tion. In Proceedings of 2008 SIAM International Conference on Data Mining(SDM 2008), pages 1–12, 2008.

[126] F. Wang and H. Tong and C. Lin. Towards evolutionary nonnegative matrixfactorization. In Proceedings of the Twenty-Fifth AAAI Conference on Artifi-cial Intelligence (AAAI 2011, 2011.

[127] H. Wang, F. Nie, H. Huang, and F. Makedon. Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Vol-ume Two, IJCAI’11, pages 1553–1558. AAAI Press, 2011.

[128] S. Wild, J. Curry, and A. Dougherty. Improving non-negative matrix factor-izations through structured initialization. Pattern Recognition, 37(11):2217–2232, 2004.

[129] Y.-L. Xie, P.K. Hopke, and P. Paatero. Positive matrix factorization applied toa curve resolution problem. Journal of Chemometrics, 12(6):357–364, 1999.

[130] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning,with application to clustering with side-information. In Advances in NeuralInformation Processing Systems 15, pages 505–512, 2003.

[131] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negativematrix factorization. In Proc. ACM conf. Research and development inIR(SIGIR), pages 267–273, Toronto, Canada, 2003.

[132] Y. Xue, C. S. Tong, Y. Chen, and W. Chen. Clustering-based initialization fornon-negative matrix factorization. Applied Mathematics and Computation,205(2):525–536, 2008.

[133] Z. Yang and E. Oja. Linear and nonlinear projective nonnegative matrix fac-torization. IEEE Transactions on Neural Networks, 21(5):734 –749, 2010.

[134] S. X. Yu and J. Shi. Multiclass spectral clustering. In Proceedings of theNinth IEEE International Conference on Computer Vision - Volume 2, ICCV’03, pages 313–, Washington, DC, USA, 2003. IEEE Computer Society.

Page 38: Non-negative Matrix Factorizations for Clustering: A Survey · the I-divergence cost function is equivalent to probabilistic latent semantic index-ing [34,38,47], another unsupervised

34 Non-negative Matrix Factorizations for Clustering: A Survey

[135] R. Zdunek and A. Cichocki. Nonnegative matrix factorization with quadraticprogramming. Neurocomput., 71(10-12):2309–2320, June 2008.

[136] D. Zeimpekis and E. Gallopoulos. Clsi: A flexible approximation schemefrom clustered term-document matrices. In Proc. SIAM Data Mining Conf,pages 631–635, 2005.

[137] H. Zha, X. He, C. Ding, M. Gu, and H.D. Simon. Bipartite graph partitioningand data clustering. In Proc. Int’l Conf. Information and Knowledge Manage-ment (CIKM 2001), 2001.

[138] H. Zha, X. He, C. Ding, M. Gu, H.D. Simon and M. Gu. Spectral Relax-ation for K-means Clustering. In Advances in Neural Information ProcessingSystems 14 (NIPS’01), pages 1057–1064, 2001, MIT Press.

[139] S. Zhang, W. Wang, J. Ford, and F. Makedon. Learning from incompleteratings using non-negative matrix factorization. In Proceedings of the SixthSIAM Confernece on Data Mining(SDM), pages 549–553, 2006.

[140] Z. Zhang, T. Li, and C. Ding. Non-negative tri-factor tensor decompositionwith applications. Knowledge and Information Systems, 2012.

[141] Z. Zhang, T. Li, C. Ding, X. Ren, and X. Zhang. Binary matrix factorizationfor analyzing gene expression data. Data Min. Knowl. Discov., 20(1):28–52,January 2010.

[142] Z. Zhang, T. Li, C. Ding, and X. Zhang. Binary matrix factorization withapplications. In Proceedings of the 2007 Seventh IEEE International Con-ference on Data Mining, ICDM ’07, pages 391–400, Washington, DC, USA,2007. IEEE Computer Society.

[143] Z. Zheng, J. Yang, and Y. Zhu. Initialization enhancer for non-negative matrixfactorization. Engineering Applications of Artificial Intelligence, 20(1):101–110, 2007.

[144] Y. Zhu, L. Jing, and J. Yu. Text clustering via constrained nonnegative matrixfactorization. In Proceedings of the 2011 IEEE 11th International Conferenceon Data Mining, ICDM ’11, pages 1278–1283, Washington, DC, USA, 2011.IEEE Computer Society.