Learning low-rank kernel matrices for constrained clustering

Neurocomputing 74 (2011) 2201–2211

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

mahdie

journal homepage: www.elsevier.com/locate/neucom

Learning low-rank kernel matrices for constrained clustering

Mahdieh Soleymani Baghshah a,n, Saeed Bagheri Shouraki b

a Computer Engineering Department, Sharif University of Technology (SUT), Azadi St., PO Box 1458889694, Tehran, Iranb Electrical Engineering Department, Sharif University of Technology, Tehran, Iran

a r t i c l e i n f o

Article history:

Received 11 July 2010

Received in revised form

1 February 2011

Accepted 21 February 2011

Communicated by S. ChoiTherefore, they either learn a metric that has low flexibility or are applicable only on small data sets

Available online 21 March 2011

Keywords:

Low-rank kernel

Kernel learning

Distance metric

Constrained clustering

Semi-supervised

Spectral

12/$ - see front matter Crown Copyright & 2

016/j.neucom.2011.02.009

esponding author. Tel.: þ98 9124026131; fa

ail addresses: [email protected] (M.S. B

[email protected] (S.B. Shouraki).

a b s t r a c t

Constrained clustering methods (that usually use must-link and/or cannot-link constraints) have been

received much attention in the last decade. Recently, kernel adaptation or kernel learning has been

considered as a powerful approach for constrained clustering. However, these methods usually either

allow only special forms of kernels or learn non-parametric kernel matrices and scale very poorly.

due to their high computational complexity. In this paper, we propose a more efficient non-linear

metric learning method that learns a low-rank kernel matrix from must-link and cannot-link

constraints and the topological structure of data. We formulate the proposed method as a trace ratio

optimization problem and learn appropriate distance metrics through finding optimal low-rank kernel

matrices. We solve the proposed optimization problem much more efficiently than SDP solvers.

Additionally, we show that the spectral clustering methods can be considered as a special form of

low-rank kernel learning methods. Extensive experiments have demonstrated the superiority of the

proposed method compared to recently introduced kernel learning methods.

Crown Copyright & 2011 Published by Elsevier B.V. All rights reserved.

1. Introduction

In many applications, in addition to the usually large amountof unlabeled data, limited side information can also be obtained.Recently, there has been growing interest in the algorithms thatlie between supervised and unsupervised ones [3]. Semi-super-vised clustering algorithms as one category of these algorithmsare applied when no precise definition of classes is given. Sinceclass label information is not generally available for clusteringtasks, pairwise constraints are used as more natural supervisoryinformation for these tasks [30]. This type of supervisory informa-tion is weaker than labels. Must-link and cannot-link constraintsare the most popular kind of side information that has been usedfor semi-supervised clustering [30]. In the last decade, severalalgorithms were introduced for constrained clustering. Most ofthe early studies modify the clustering algorithms to use sideinformation and bias the search for an appropriate data cluster-ing. According to the discussion in Ref. [27], many of thesesemi-supervised algorithms [2,23,25,27] can be formulated asthe usual clustering algorithms employing an adapted distancematrix. Nonetheless, they cannot learn a distance function andjust adapt the distance matrix using pairwise constraints. A more

011 Published by Elsevier B.V. All

x: þ98 21 6601 9246.

aghshah),

powerful approach that has been attended especially in the lastfew years is distance function learning. In this approach, thealgorithm learns a proper distance function prior to clustering. Ityields a more flexible framework for constrained clustering.

Distance functions are important to many machine learning anddata mining algorithms [44]. Until now, there have been manystudies on distance function learning [46]. The early studies andsome of the more recent ones were devoted to metric learning forsupervised learning algorithms (nearest neighbor classifiers)[12,13,15–17,36,38]. Unfortunately, distance metric learning can-not be addressed properly for unsupervised learning algorithmsbecause it is not possible to define an appropriate dissimilaritymeasure (or well-defined optimization problem) in the absence ofsupervisory information [48]. In real-world clustering tasks, datasets can contain clusters of different shapes, sizes, data sparseness,and degrees of separation [14]. In these cases, clustering techni-ques require a proper definition of (dis)similarity measure betweenpatterns, but it is not a well-defined problem in the absence ofsupervisory information (side information) [14]. However, whensome limited supervisory information is available along withunlabeled data, we can use this information to learn a distancefunction.

In the last decade, there has been a growing research interestin distance metric learning based on pairwise constraints.Many of the existing studies (especially the earlier ones)[1,18,20,40,44,45,47] learn a global Mahalanobis metric or equiva-lently a linear transformation. In the last few years, some methods

rights reserved.

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2011.02.009

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.neucom.2011.02.009

M.S. Baghshah, S.B. Shouraki / Neurocomputing 74 (2011) 2201–22112202

[8,19,26,28,33–35,39,41,48–52] have been introduced to learnmore flexible distance metrics that correspond to non-lineartransformations for constrained clustering. Since learning a dis-tance metric can be linked to the well-defined problem of learninga kernel [32], almost all of the existing non-linear metric learningmethods have used a kernel-based approach. Few studies [50,51]are based on learning parameters of a parametric kernel likeGaussian one. Some other studies (on kernel learning for con-strained clustering) [34,35,43,49] assume the kernel matrices to bea linear combination of base matrices that are constructed before-hand. The base matrices are usually constructed from principaleigenvectors of a fixed kernel like Gaussian or graph Laplacian. Thechoice of the target kernel matrices in this approach is limited [19]and thus this approach yields limited flexibility for the resulteddistance metrics.

Most recent studies [19,26,28,33,41,48,52] have introducednon-parametric kernel learning methods for constrained cluster-ing. These methods yield more flexible distance metrics. Thekernel-A method [48] can use only must-links to optimize thekernel matrix KA ¼AKAT by finding the optimal n�n matrix A.Kulis et al. [28] have proposed a low-rank kernel learning methodwhose performance depends on the choice of the initial kernelmatrix [52]. They used a non-convex gradient descent procedureto solve their proposed problem [52]. Hoi et al. [19] and Li et al.[33] have presented methods that learn a non-parametric kernelmatrix according to the pairwise constraints and unlabeled data.These methods used the notion of kernel alignment introduced byCristianini et al. [10] that has problems mentioned by Crammeret al. [9] and also it is not proper for constrained clustering.Recently, we have introduced a non-parametric kernel learningmethod [41] that outperforms the introduced one by Hoi et al.[19] because of its more appropriate optimization problem.Nonetheless, this method similar to the other methods introducedfor non-parametric kernel learning from must-link and cannot-link constraints (like [19,33,41]) is a Semi-Definite Programming(SDP) problem. Standard SDP solvers introduced for solving theseproblems can require as high as Oðn6:5Þ computational complexity[52]. Therefore, all of these methods need SDP solvers and thus donot applicable to large-scale data sets. Recently, Zhuang et al. [52]have introduced SimpleNPKL method to address the efficiencyand scalability issues of the NPK method presented by Hoi et al.[19]. However, as mentioned in this study, SimpleNPKL when usesa squared hinge loss optimization problem (to yield comparableresults to those of NPK) is just 10 times faster than NPK.Additionally, the introduced problem in SimpleNPKL methodhas also some shortcomings of the proposed problem for NPKmethod.

Although learning non-parametric kernel matrices yields comple-tely flexible distance metrics (on seen data points), the freedomdegree of these models is very high (i.e., n2 where n denotes thenumber of data points) and supervisory information is very lowcompared to their flexibility. In this paper, we propose a novelmethod that learns a low-rank kernel matrix for constrained cluster-ing. The proposed method yields a kernel matrix that can besufficiently flexible while avoids unnecessary free variables. Weconsider must-link and cannot-link constraints along with thestructure of the data and formulate the proposed method as anoptimization problem to learn a low-rank kernel matrix. We solve theproposed problem and find its global optimum using an approachthat is much more efficient than SDP problem solvers. Compared tothe introduced methods in Refs. [34,48–51], our method finds a moregeneral form of distance metric through learning a non-parametriclow-rank kernel matrix. It does not need an initial kernel as opposedto the introduced methods in Refs. [26,28,43,48]. Additionally,compared to the existing non-parametric kernel learning methodsintroduced in Refs. [19,33,41,48,52], the proposed low-rank kernel

learning method is much more efficient and thus applicable to largerscale problems. Moreover, it is less prone to over-fitting.

The rest of this paper is organized as follows: in Section 2, theproposed algorithm is introduced. In this section, we formulate anoptimization problem using constraints and the topologicalstructure of the data and solve this problem to find an appropriatelow-rank kernel matrix. Moreover, we show the relation betweenthe spectral clustering methods and low-rank kernel learning.Experimental results are presented in Section 3. Finally, conclud-ing remarks are given in Section 4.

2. Low-rank kernel learning

In this section, we propose a novel low-rank kernel learningmethod using the geometrical structure of data and pairwiseconstraints to find an appropriate distance function. This methodis formulated as an orthogonally constrained trace ratio optimi-zation problem. We solve this problem efficiently.

2.1. Problem setup

We intend to learn a low-rank kernel matrix according to thepairwise constraints and the topological structure of the data.Indeed, we specify the geometry of the embedding space via akernel matrix and find a notion of (dis)similarity between datapairs in the input space [29].

We are given a set of data points X ¼ fxigni ¼ 1 �Rd and two sets S

and D including pairwise similarity (must-link) and dissimilarity(cannot-link) constraints. We consider a feature space F induced bya non-linear mapping f : Rd-F. For a proper chosen f, we candefine an inner product on F using Mercer kernel kðx,yÞ ¼ofðxÞ,fðyÞ4 where kð:,:Þ is a positive semi-definite kernel func-tion. Let X¼ ½x1,x2,. . .,xn� be the matrix containing all data points(n denotes the number of data), U¼ ½/1,/2,. . .,/n� ¼ ½fðx1Þ,fðx2Þ,. . .,fðxnÞ� be the transformed data to the kernel space, andK¼ ½kðxi,xjÞ�n�n ¼ UTU be the corresponding kernel matrix. Usingthe following proposition, we intend to find a proper (pseudo-)metric by learning an appropriate kernel.

Proposition. (Valid kernels induce pseudo-metrics) [42]. Let k : X �

X-R be a kernel matrix on X and dk : X � X-R be a distance

induced by k. If k is positive semi-definite, then dk is a pseudo-metric.

In the following section, we will exclusively use pseudo-metrics and for the sake of better readability abuse terminologyand refer to them as metrics (like in Refs. [42,48,49]). Moreover,we use the following notations: A represents a matrix, a denotes avector, and A shows a set. trðAÞ shows the trace of the matrix A,and AT denotes the transpose of A. ðAÞij shows the element locatedat the ith row and the jth column of A. biði¼ 1,. . .,nÞ representsthe ith column of the n�n identity matrix.

2.2. Geometrical structure

In this section, we use the cluster assumption and define U f ,Xð Þ

that measures how closely the mapping f can preserve thetopological structure of the data points (included in X) in thetransformed space. We use the idea of the graph Laplacian [4] forthis purpose. The incorporation of the geometrical structure byusing graph Laplacian has been discussed in some previousstudies like Ref. [6].

The k-nearest neighborhood as the most common definitionfor the concrete model of the neighborhood [30] is usedto consider the topological structure of the data. The graph ofk-nearest neighbors can model the relation between close data

M.S. Baghshah, S.B. Shouraki / Neurocomputing 74 (2011) 2201–2211 2203

points. The weight matrix of this graph can be defined as

wij ¼1 if ðxiANkðxjÞ3xjANkðxiÞÞ

0 otherwise,

(ð1Þ

where NkðxiÞ shows the set of k-nearest neighbors of xi. Based on(1), the weight matrix W is symmetric. According to the spectralgraph theory, the smoothness of the function f ðxÞ applied on datapoints (vertices of the graph) can be defined as

Uðf ,XÞ �1

2

Xn

i ¼ 1

Xn

j ¼ 1

��f ðxiÞffiffiffi

dp

i

�f ðxjÞffiffiffi

dp

j

��2

2

wij, ð2Þ

where di denotes the degree of the ith vertex of the neighborhoodgraph (i.e., di ¼

Pnk ¼ 1 wik). Thus, smoothness of the function f(x)

on the data points can be obtained as

Uðf,XÞ ¼1

2

Xi

Xj

��

/iffiffiffidp

i

�/jffiffiffi

dp

j

��2

2

wij ¼ trX

i

/i/Ti �X

i

Xj

/iffiffiffidp

i

wij

/Tjffiffiffidp

j

0@

1A

¼ trðUðI�D�1=2WD�1=2ÞUTÞ ¼ trðULUT

Þ ¼ trðLKÞ, ð3Þ

where D¼ diagðd1,d2,. . .,dnÞ and L¼ I�D�1=2WD�1=2 is thenormalized Laplacian matrix. Uðf,XÞ shows the weighted sum ofthe squared distances in the kernel space between the neighbor-ing data points (in the input space). Since we can write U f,Xð Þ

according to the kernel matrix as Uðf,XÞ ¼ trðLKÞ, it can bereplaced with UðKÞ ¼ trðLKÞ in the proposed problem.

2.3. Proposed problem and its solution

We propose an optimization problem that is formulated basedon the kernel matrix K¼UTU. First, we define JSðKÞ as the sum ofthe squared distances between pairs of similar data in the kernelspace

JSðKÞ ¼Xi,jð ÞAS

99fðxiÞ�fðxjÞ992

2 ¼Xi,jð ÞAS

ðKÞiiþðKÞjj�2ðKÞij

¼Xði,jÞA S

ðbi�bjÞT Kðbi�bjÞ ¼

Xði,jÞAS

trððbi�bjÞðbi�bjÞT KÞ ¼ trðKESÞ,

ð4Þ

where

ES ¼Xði,jÞA S

ðbi�bjÞðbi�bjÞT : ð5Þ

Similarly, JDðKÞ shows the sum of the squared distances betweenpairs of dissimilar data in the kernel space

JDðKÞ ¼Xði,jÞAD

99fðxiÞ�fðxjÞ992

2 ¼ trðKEDÞ, ð6Þ

where

ED ¼Xði,jÞAD

ðbi�bjÞðbi�bjÞT : ð7Þ

Let WS and WD be the matrices denoting must-link and cannot-linksconstraints, respectively. That is, the elements of WS are set to onefor those pairs appearing in S and set to zero for the other pairs(i.e., if ði,jÞAS3ðj,iÞAS then ðWSÞij ¼ 1, otherwise ðWSÞji ¼ 0).It is worth to note that ES in (5) and ED in (7) can be consideredas graph Laplacian matrices corresponding to WS and WD weightmatrices, respectively.

We intend to find the optimal matrix K that yields smalldistances between similar pairs and large distances betweendissimilar pairs in the kernel space. Additionally, we try topreserve the topological structure of the data in the kernel space.Since kernel learning methods usually scale poorly and also theamount of supervisory information is typically very limited

compared to the flexibility of these models (i.e., non-parametrickernel matrices), we use low-rank kernel matrices to avoid thisproblem. Thus, we propose the following optimization problem:

maxKj0

JDðKÞ

JSðKÞþaU ðKÞ,

s:t: rankðKÞrr, ð8Þ

where a is a positive constant adjusting the relative importance ofU ðKÞ. The topological structure of the data has been included viathe term UðKÞ as explained in Section 2.1. Given an n�n kernelmatrix K, if it is of low-rank, say ron, we can present the kernelmatrix in terms of a decomposition K¼GGT , with G an n� r

matrix [26]. By substituting (3), (4) and (6) in the problem (8) andsetting K¼GGT , we can rewrite the problem (8) as

maxGARn�r

trðGT EDGÞ

trðGTðESþaLÞGÞ

: ð9Þ

To avoid the trivial solution, we add the orthogonal constraintGT G¼ Ir to the above problem and obtain the following problem:

maxGT G ¼ Ir

trðGT EDGÞ

trðGTðESþaLÞGÞ

: ð10Þ

The problem in (10) is a trace ratio problem. Conventionally, thesolution of this problem is approximated via generalized eigenvaluedecomposition [44]. However, prior works have indicated that it ismore reasonable to solve it directly than via the conventional way[22]. Since ED and ESþaL are positive semi-definite matrices andthe problem in (10) is in the form of the orthogonally constrainedtrace ratio optimization problem, we can use the algorithm intro-duced in Ref. [44] to solve this problem. Table 1 shows the steps ofthis algorithm.

This algorithm considers two cases [44]:

–
mrn�r: if W is in the null space of S2, then trðWT S2WÞ ¼ 0.Therefore, trðWT S1WÞ can be maximized after performing anull-space transformation y¼ ZT x where ZARn�ðn�rÞ is amatrix whose columns are the eigen-vectors correspondingto n�r zero eigenvalues of S2. Thus, we find W�
¼ ZV� whereV� ¼ argmaxVT V ¼ I½trðV

TðZT S1ZÞVÞ�.

–
m4n�r: a binary search (giving a lower bound and an upperbound) is used to find l� ¼maxWT W ¼ Im
½trðWT S1WÞ=trðWT S2WÞ�.Indeed, a function gðlÞ ¼maxWT W ¼ Im

trðWTðS1�lS2ÞWÞ is intro-

duced and we seek a l such that g(l)¼0. The value of g(l) can beeasily calculated as the sum of the first m largest eigenvalues ofS1�lS2. The optimal Wn is finally obtained by performing theeigenvalue decomposition of S1�l

�S2.

In the spectral clustering methods, the number of eigenvectorsselected to map the data points to the discriminative space is setto c (i.e., the number of clusters). Similar to these methods, wecan set the rank of the kernel matrix to c (or, equivalently set thesize of the matrix G to n� c). We discuss about it further inSection 2.5. The proposed metric learning algorithm has beensummarized in Table 2. We call the proposed method as Low-Rank Kernel Learning (LRKL) in the following sections.

2.4. Computational complexity

Since the algorithm in Table 1 uses the binary search, it needsa small numbers of iteration to converge. The most costly step ofthis algorithm is finding eigenvectors corresponding to the largestm eigenvalues of an n�n matrix (Step 2.3.1) that must becomputed in few iterations until convergence. The eigen-decom-position of a dense matrix of size n�n is Oðn3Þ. Therefore, thetime complexity of our method is very low compared to that ofthe SDP problem solvers that can be as high as Oðn6:5Þ.

Table 2The proposed low-rank kernel learning (LRKL) algorithm.

Input: the matrix of data points X, the number of clusters c, the sets of must-

link S and cannot-link D constraints, an error threshold e.Output: K�ARn�n.

1. Construct the neighborhood graph and calculate the weight matrix W.

2. Find the Laplacian matrix L¼ I�D�1=2WD�1=2.

3. Calculate ES according to (4) and ED according to (6).

4. Learn G� according to the algorithm presented in Table 1 (input: ED , ESþaL, c,

and e).5. Output the kernel matrix K� ¼ G�G�T .

Table 1Algorithm for solving orthogonally constrained trace ratio optimization problem [44].

Input: S1 ,S2 ARn�n , the target dimensionality m, an error constant e.Output: W�ARn�m , W�

¼ argmaxWT W ¼ I

ðtrðWT S1WÞ=trðWT S2WÞÞ.

1. Calculate the rank r of the matrix S2.

2. if m4n�r

2.1 Find a1 ,. . .,am as the first m largest eigenvalues of S1 and b1 ,. . .,bm as the first m smallest eigenvalues of S2.

2.2 l1’trðS1Þ=trðS2Þ, l2’Pm

i ¼ 1 ai=Pm

i ¼ 1 bi , l’ðl1þl2Þ=2.

2.3 while l1�l2 4e do

2.3.1 Calculate g(l) as sum of the first m largest eigenvalues of S1�lS2.

2.3.2 if gðlÞ40 then l1’l else l2’l.

2.3.3 l’ðl1þl2Þ=2.

2.4 W�¼ ½l1 ,. . .,lm� where l1 ,. . .,lm are the m eigenvectors corresponding to the m largest eigenvalues of S1�lS2.

3. else

3.1 W�¼ Z½v1 ,. . .,vm�. Here v1 ,. . .,vm are eigenvectors corresponding to the m largest eigenvalues of ZT S1Z and Z¼ ½Z1. . .,Zn�r � are the eigenvectors corresponding

to n�r zero eigenvalues of S2.

Table 3Spectral clustering methods in a general form.

Spectral clustering method S in the problem (11)

Ratio associative S¼WRatio cut [7] S¼�LNCut [37] S¼D�1/2WD�1/2


Although the eigen-decomposition of a dense matrix of size n�n

is Oðn3Þ, sparse spectral decomposition can be performed morequickly [5,31]. Since the number of constraints is usually very lowcompared to the whole number of pairwise relations (i.e., n2), thematrices ED and ES will be sparse. Moreover, since the matrix W in(1) contains at most 2k�n nonzero elements (k is a constantnumber showing the number of nearest neighbors in the neighbor-hood graph) and D is the diagonal matrix, L¼ I�D�1=2WD�1=2 isalso a sparse matrix. Therefore, ED�lðESþaLÞ is usually a sparsematrix. Additionally, we need to find only c eigenvectors corre-sponding to c largest eigenvalues.

As mentioned in Section 1, SimpleNPKL method has beenrecently introduced as an efficient non-parametric kernel learningmethod that shows competitive results to Hoi et al.’s NPK method[19] on most data sets when it uses squared hinge loss optimiza-tion problem. However, SimpleNPKL (in this case) is just 10 timesfaster than Hoi et al.’s method (using a standard SDP solver)according to Ref. [52].

2.5. Relation between spectral clustering and low-rank kernel

learning

In this section, we show that the spectral clustering methodscan be considered as a special form of low-rank kernel learning.As shown in Ref. [27], the optimization problem of the spectralclustering methods can be formulated as the following generalproblem:

maxFFT¼ I

trðFT SFÞ, ð11Þ

where FARn�c shows the normalized assignment of the data pointsto the clusters and S is a similarity matrix. In Table 3, the matrix S in(11) for some spectral clustering methods has been shown. Here,

W shows the weight matrix containing the similarity of data points.L¼D�W is the Laplacian matrix and D¼ diagðd1,. . .,dnÞ denotes adiagonal whose entries are sum of the rows or columns of thesymmetric weight matrix W (di ¼

Pnj ¼ 1 wij).

We interpret the problem (11) for spectral clustering as aspecial form of the low-rank kernel learning problem. If we setK¼ FFT , the problem (11) can be reformulated as

maxK ¼ FFT , FT F ¼ Ic

trðKSÞ: ð12Þ

Therefore, we seek a low-rank kernel matrix K that has thelargest alignment ð

Pi,jðKÞijðSÞijÞ with the similarity matrix S.

Indeed, a low-rank kernel matrix that highly matches with thesimilarity matrix (and also satisfies the constraint FT F¼ Ic) isfound. The obtained kernel can be used in the kernel k-meansclustering algorithm to find final clusters. Indeed, the discretizia-tion step in the spectral clustering method can be interpreted asusing the learned kernel in the clustering process.

3. Experimental results

In this section, we conduct some experiments to evaluate theperformance of our method. We show results of the proposedmethod on some synthetic and real-world data sets and comparethem with results of some recently introduced metric learningmethods.

3.1. Experimental setup

In our evaluations, the first metric learning method is therecently introduced method by Xiang et al. [44] that is the mosteffective one among the existing linear metric learning methods. Inaddition to this linear method, we compare our method with somerecent non-linear metric learning methods. Kernel-b [48] and Li andLiu’s method [34] are included as the kernel learning methods thatconsider the target kernel matrix as a linear combination of base

Fig. 1. Synthetic data sets: (a) Data set 1 (DS1), (b) Data set 2 (DS2), (c) Data set 3 (DS3), (d) Data set 4 (DS4), (e) Data set 5 (DS5) [19] and (f) Data set 6 (DS6) [19].

Fig. 2. The average Rand index values vs. the number of constraints for different methods on the synthetic data sets shown in Fig. 1: (a) DS1, (b) DS2, (c) DS3, (d) DS4, (e)

DS5 and (f) DS6.

1 The MATLAB code of NCut is available at: http://www.cis.upenn.edu/� jshi/

software.


matrices and learn coefficients. Kernel-A [48] and Hoi et al.’s NPKmethod [19] are included as non-parametric kernel learning meth-ods. According to the results obtained in Ref. [52], the performanceof SimpleNPKL method is usually at most comparable to that of NPK[19]. Therefore, we have not included SimpleNPKL [52] in ourevaluations and used NPK instead.

We use the Euclidean distance (without metric learning) forthe baseline comparison as in Refs. [1,8,44,45,47–49]. To compareefficiency of different metric learning methods, we apply thek-means clustering algorithm on the metrics learned by thesemethods. For the kernel-based methods, we apply the kernel

k-means algorithm on the obtained kernels. Thus, the performanceof our metric learning algorithm is evaluated by comparing thefollowing algorithms (the short terms inside parentheses will beused subsequently):

(1)
k-means without metric learning (Euclidean); (2) normalized cut as a spectral clustering algorithm (NCut)1;
http://www.cis.upenn.edu/&sim;jshi/software



Fig. 3. The average NMI values vs. the number of constraints for different methods on the synthetic data sets shown in Fig. 1: (a) DS1, (b) DS2, (c) DS3, (d) DS4, (e) DS5

and (f) DS6.

Table 4


(3)
Properties of the UCI data sets used in our experiments.
2

ntu.e3

of th

k-means with the Xiang et al.’s metric learning method [44](Xiang’s);

Data set n c d
(4) kernel k-means with the kernel obtained by the kernel-b
method [48] (kernel-b);

Soybean 47 4 35
(5) Protein 116 6 20
kernel k-means with the kernel learned by the kernel-Amethod [48] (kernel-A);

Iris 150 3 4
(6) Wine 178 3 13
Sonar 208 2 60

kernel k-means with the kernel learned by Hoi et al.’smethod2 [19] (NPK);

Glasses 214 6 10
(7) Heart 270 2 13
kernel k-means with the kernel learned by Li and Liu’smethod3 [34] (CCSKL);

Breast cancer 569 2 31
(8) Balance 625 3 4
kernel k-means with the low-rank kernel learned by theproposed method (LRKL).

It must be mentioned that NCut is related to the kernelk-means with a specified kernel like Gaussian [11].

For the proposed method, we set the number of nearestneighbors to k¼5 and the regularization parameter to a¼0.2.Parameters of other methods are set to the specified values in thecorresponding studies. Unfortunately, there is no discussion aboutthe initial kernel used for kernel-A and kernel-b methods in Ref.[48]. However, these methods are sensitive to the parameter ofthe RBF kernel (used as the initial kernel). Thus, we have tried tospecify this parameter as well as possible by setting it accordingto the average of square distances between data points. Addition-ally, Yeung and Chang [48] have specified a range that the bestvalue for the parameter r of the kernel-A method must be chosenfrom. But, they have not used an automatic technique like to findthe best value for each individual data set and it seems that theycan fit these parameters for each data set manually. We set r¼1as an appropriate value in the range specified in Ref. [48] for alldata sets. CCSKL [34] is also sensible to the initial kernel and theauthors of Ref. [34] have formed 9 candidate initial kernels (with

The MATLAB code of Hoi et al.’s method [19] is available at: http://www.cais.

du.sg/�chhoi/paper_pdf/NPK_code.zip.

The MATLAB code of Li and Liu’s method [34] was obtained from the authors

is study.

different sigma parameter) for each data set and select the onethat gives the best result according to the labels of data. However,parameter selection in this manner is illogic and we use the midvalue sigma parameter.

We evaluate the results using 5-fold cross validation. Thepairwise constraints are selected on training set and the resultsof clustering are reported only on the validation set. The numberof pairwise similarity and dissimilarity constraints is set to beequal nc¼ 9S9¼ 9D9. Since results depend on the selected S and D

sets, 10 different S and D sets are generated for each number ofconstraints and results are evaluated on them. In addition, we runthe clustering algorithms 20 times with different random initi-alizations for each set of constraints. All clustering algorithms runwith the same random set of initial cluster centers in eachexperiment trial. All of the data sets are normalized before usein the clustering algorithms. Each feature is normalized to zeromean and unit standard deviation.

3.2. Performance measure

To measure the performance of clustering algorithms in ourexperiments, we use the Rand index as the most widely usedmeasure for evaluating the performance of metric learning

http://www.cais.ntu.edu.sg/&sim;chhoi/paper_pdf/NPK_code.zip




methods for clustering purposes [8,44,45,47–49]. Rand indexreflects how well the clustering results agree with the groundtruth clusters. Let ns be the number of data pairs that are assignedto the same cluster, both in the ground truth and the resultantclustering and nd be the number of data pairs that are assigned todifferent clusters both in the ground truth and the resultantclustering. The Rand index is defined as RI¼ 2ðnsþndÞ=ðnðn�1ÞÞ.This index will favor assigning data points to different clusterswhen there are more than two clusters [8,45]. Thus, as in[8,48,49] we use the modified Rand index introduced in Ref.[45] such that the matched pairs and mismatched pairs areassigned weights to give them equal chances of occurrence (0.5)

RIðC,CÞ ¼0:5�

Pi4 jdðci ¼ cj4ci ¼ cjÞP

i4 jdðci ¼ cjÞþ

0:5�P

i4 jdðciacj4cia cjÞPi4 jdðcia cjÞ

,

ð13Þ

where d(.) is an indicator (i.e., dðtrueÞ ¼ 1 and dðfalseÞ ¼ 0), ci is thecluster to which xi is assigned by the clustering algorithm, and ci

is the correct cluster assignment.In addition to the Rand index, we also use the Normalized

Mutual Information (NMI) to measure the amount of statisticalinformation shared by random variables corresponding to theobtained clustering and the target one [26]. Let C ¼ fC1,. . .,Ccg

represent the target clustering,_C ¼ f

_C 1,. . .,

_C kg show the result of

the clustering algorithm, 9Ci9¼ ni, 9_C i9¼ n0i, and nij ¼ 9Ci \

_C i9.

Fig. 4. The curves of the average Rand index values vs. the number of constraints for d

(e) Sonar, (f) Heart, (g) Glasses, (h) Breast cancer and (i) Balance.

The NMI measure that is calculated as

NMIðC,CÞ ¼

Pci ¼ 1

Pkj ¼ 1 nij logðn� nij=ni � n0jÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPc

i ¼ 1 ni logðni=nÞ� � Pk

j ¼ 1 n0j logðn0j=nÞ� �r , ð14Þ

measures how closely the clustering algorithm could constructthe underlying label distribution in the data [3].

3.3. Experiments on synthetic data sets

First, we perform experiments on some synthetic data sets.Distance metric learning is useful especially when clusters ofdifferent shapes, sizes, data sparseness, and degrees of separationappear in the data set. We generate some data sets containingclusters of somewhat complex structures. Fig. 1 displays thesynthetic data sets used in our evaluations. In this figure, thedata points of the same class have been displayed with the samecolor and style. Figs. 2 and 3 show the performance curves ofdifferent methods on the synthetic data sets according to theRand index and NMI measures respectively. According tothese figure, the standard k-means clustering algorithmcannot find appropriate clusters on these data sets. Xiang’s as alinear method cannot also yield appropriate results. CCSKL,kernel-A, and kernel-b as non-linear methods give betterresults than Xiang’s on some data sets. Nonetheless, they arenot able to find proper results on these data sets. NPK and LRKL

ifferent methods on the UCI data sets: (a) Soybean, (b) Protein, (c) Iris, (d) Wine,


perform well on all of the synthetic data sets. Indeed, the resultsof our LRKL method on these data sets are as well as thoseof NPK while our method needs much lower computational

Fig. 5. The curves of the average NMI values vs. the number of constraints for different

(f) Heart, (g) Glasses, (h) Breast cancer and (i) Balance.

soybean

proteinW

ine IrisG

lasses Sonar

Heart

Ionosphere

Breast

Balance

Diabetes

Fig. 6. Time cost of NPK and LRKL methods on UCI data sets with different numb

complexity and requires to learn a smaller size matrix (LRKLlearns an n� c matrix as opposed to the NPK that learns an n�n

matrix).

methods on the UCI data sets: (a) Soybean, (b) Protein, (c) Iris, (d) Wine, (e) Sonar,

soybean

proteinW

ine IrisG

lasses Sonar

Heart

Ionosphere

Breast

Balance

Diabetes

er of data examples (n): (a) Linear time scale and (b) Logarithmic time scale.

Table 5Results (mean and variance of Rand index) on some subsets of the MNIST dataset.

Subset Euclidean Xiang’s Kernel-b Kernel-A NPK CCSKL LRKL

{1,2,3} 0.8895 (70.0234) 0.8328 (70.0682) 0.6475 (70.0779) 0.8942 (70.0277) 0.9422 (70.0169) 0.8163 (70.0979) 0.9026 (70.0736)

{4,5,6} 0.7175 (70.0835) 0.6643 (70.0646) 0.5378 (70.0588) 0.8346 (70.0771) 0.9629 (70.0362) 0.7318 (70.1082) 0.9638 (70.0266)

{7,8,9} 0.7216 (70.0224) 0.6955 (70.0504) 0.5785 (70.0743) 0.7152 (70.0248) 0.8050 (70.0499) 0.6891 (70.0424) 0.8168 (70.0412)

{1,2,3,4} 0.8667 (70.0604) 0.7900 (70.0536) 0.5957 (70.0632) 0.8825 (70.0347) 0.8621 (70.0718) 0.7446 (70.0832) 0.8637 (70.0757)

{3,4,5,6} 0.7412 (70.0530) 0.6690 (70.0615) 0.5752 (70.0522) 0.7964 (70.0611) 0.8203 (70.0558) 0.6347 (70.0572) 0.8730 (70.0536)

{6,7,8,9} 0.7757 (70.0381) 0.6950 (70.0537) 0.5627 (70.0470) 0.7779 (70.0257) 0.8067 (70.0269) 0.7378 (70.0392) 0.7984 (70.0322)

Table 6The text data sets used in our experiments.

Data set Source # Documents (n) # Terms (d) # Categories (c)

tr11 TREC 414 6429 9

tr12 TREC 313 5804 8

tr23 TREC 204 5832 6

tr31 TREC 927 10,128 7

tr41 TREC 878 7454 10

tr45 TREC 690 8261 10

K1b WebAce 2340 21,839 6


3.4. Experiments on UCI data sets

After evaluating the performance of our method on thesynthetic data sets, we conduct experiments on nine real-worlddata sets obtained from the Machine Learning Repository4 of theUniversity of California, Irvine (UCI). Table 4 shows the propertiesof these data sets. The three columns of the table show thenumber of data points n, the number of classes c, and the numberof attributes d for each data set.

In Fig. 4, the average Rand index of each method (overdifferent sets of constraints and different runs of the clusteringalgorithm as explained in Section 3.1) vs. the number of con-straints has been displayed. As we can see in Fig. 4, our methodgenerally yields better results than the other methods. By com-paring LRKL with NPK, one of the most effective learning methodsas mentioned in Ref. [52], we find that results of our method isbetter than those of NPK on four out of the nine data sets andcomparable to it on three data sets. Moreover, the proposedmethod is much more computationally efficient according to thediscussion in Section 2.4. In addition to the Rand index curves, theperformance curves obtained using the NMI measure have alsobeen shown in Fig. 5. By comparing Figs. 4 and 5, we can see thatthe NMI measure shows similar results to the Rand index on mostdata sets.

According to the above figures that show the performance ofdifferent methods on UCI and synthetic data sets, generally the onlycomparable method to the LRKL method is NPK. In Fig. 6, we havedisplayed the computational time required for these two methods(both in linear and logarithmic time scale) on different UCI data sets.NPK is computationally prohibitive for data sets with more numberof data samples than the examined ones. In Fig. 6, the numbers ofexamples in the UCI data sets have been specified by the name ofthe corresponding data set below them. For each method, thetime on a data set has been averaged on different runs correspond-ing to different random sets of constraints and different numbersof constraints (nc¼ 10,20,. . .,90,100). Fig. 6 shows that LRKL ismuch faster than NPK. Moreover, according to the large differencebetween the time complexity of LRKL and NPK, we can concludethat LRKL method is much faster than SimpleNPKL [52] (using asquared hinge loss optimization problem) that is just 10 times fasterthan NPK as mentioned in Ref. [52].

3.5. Experiments on MNIST digits

The MNIST5 data set containing handwritten digits. There are60 000 digit images in the training set of this data set. Digits havebeen centered and normalized to 28�28 gray-scale images in theMNIST dataset. In our experiments, we choose 200 imagesrandomly for each digit. We set the number of constraints to9S9¼ 9D9¼ 30 for all subsets experimented in this subsection.

4 http://archive.ics.uci.edu/ml/.5 Available at: http://yann.lecun.com/exdb/mnist.

Table 5 shows the results of different methods on some subsets ofthe MNIST data set including samples of three or four of digits.

For each algorithm, the mean and the standard deviation of theRand index values over different runs (corresponding to differentsets of constraints and different initializations of the clusteringalgorithms) have been shown in Table 5. From these results, wecan see that the performance of the proposed LRKL method iscomparable to that of NPK method.

3.6. Experiments on document clustering

As the last set of experiments, we evaluate results on somewidely used benchmark data sets in data mining field that havebeen used in studies like [21]. Among the methods mentioned inSection 3.1, we compare our LRKL method with CCSKL methodthat can be implemented on text data sets in a reasonable timeand space complexity. We used the cosine distance as the basedistance in LRKL and CCSKL methods and compare the followingmethods:

(1)

6

kernel k-means with the kernel obtained by cosine similarity(Cosine);

(2)
kernel k-means with the kernel learned by Li and Liu’smethod [34] (CCSKL);
(3)
kernel k-means with the low-rank kernel learned by theproposed method (LRKL).
Table 6 shows the properties of the text data sets used in ourevaluations. tr11, tr12, tr23, tr31, tr41, tr45 have been derivedfrom TREC6 sources. These preprocessed data sets are included inCLUTO toolkit Karypis [24]. The k1b dataset comes from WebAceproject and contains 2340 documents consisting of news articlesfrom Reuters news service via the web in October 1997.

We consider two cases of ‘‘little’’ and ‘‘much’’ side informationsimilar to many of the existing studies [19,44,45,52]. For the‘‘little’’ case, we randomly select data pairs such that the numberof the resulted connected components from must-link constraintsis roughly equal to 90% of the dataset size. Similarly, the numberof the resulted connected components for the ‘‘much’’ case mustbe roughly equal to 70%. Table 7 displays the mean and variance

Available at: http://tre.nist.grov.

http://archive.ics.uci.edu/ml/

http://yann.lecun.com/exdb/mnist

http://tre.nist.grov

Table 7Results (mean and variance of Rand index) on the text data sets.

Subset Cosine CCSKL (little) LRKL (little) CCSKL (much) LRKL (much)

tr11 0.7568 (70.0433) 0.6849 (70.0643) 0.7644 (70.0581) 0.7344 (70.0651) 0.7942 (70.0560)

tr12 0.6914 (70.0473) 0.5873 (70.0339) 0.6381 (70.0613) 0.6178 (70.0437) 0.6923 (70.0622)

tr23 0.5957 (70.0428) 0.5626 (70.0396) 0.6352 (70.0596) 0.5704 (70.0429) 0.6874 (70.0582)

tr31 0.6955 (70.0480) 0.6905 (70.0596) 0.8285 (70.0441) 0.6909 (70.0567) 0.8774 (70.0362)

tr41 0.7039 (70.0426) 0.5799 (70.0328) 0.7977 (70.0455) 0.6467 (70.0481) 0.8510 (70.0437)

tr45 0.6865 (70.0391) 0.5866 (70.0310) 0.7316 (70.0441) 0.6568 (70.0476) 0.7990 (70.0430)

k1b 0.7240 (70.0377) 0.7080 (70.1441) 0.8576 (70.0431) 0.7145 (70.1407) 0.8623 (70.0372)


of the Rand index values for the cases of ‘‘little’’ and ‘‘much’’ sideinformation for the methods listed above. According to theseresults, LRKL method shows better results than CCSKL and alsooutperforms Cosine except to on tr12 for the ‘‘little’’ side informa-tion. It must be mentioned that the performance of the CCSKL andLRKL methods has been improved by increasing the number ofconstraints.

4. Conclusion and future work

In this paper, we have proposed a kernel learning method thatallows flexibility for matching the kernel matrix with the sideinformation and the structure of the data while it controls thenumber of free parameters and also the computational complexityof the non-parametric kernel learning. Indeed, a new low-rankkernel learning method for constrained clustering has been pro-posed. In this method, pairwise (must-link and cannot-link) con-straints are incorporated along with the structure of the data to findan optimal low-rank kernel matrix. We formulated the proposedmethod as an orthogonally constrained trace ratio optimizationproblem that can be solved more efficiently than the existing non-parametric kernel learning methods. The performance of ourmethod is at-least as well as that of some existing non-parametrickernel learning methods while it is more efficient and applicable tolarger scale data sets. Additionally, the proposed kernel learningmethod has a smaller number of free parameters and it is less proneto over-fitting. Besides, we showed a relation between spectralclustering and low-rank kernel learning methods.

References

[1] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning a Mahalanobismetric from equivalence constraints, Journal of Machine Learning Research 6(2005) 937–965.

[2] S. Basu, M. Bilenko, R.J. Mooney, A probabilistic framework for semi-supervised clustering, in: Proceedings of the Tenth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, 2004,pp. 59–68.

[3] S. Basu, Semi-supervised clustering: probabilistic models, algorithms andexperiments, Ph.D. Dissertation, University of Texas at Austin, 2005.

[4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction anddata representation, Neural Computation 15 (6) (2003) 1373–1396.

[5] S. Boyd, L. Vandenberghe, Convex optimization, Cambridge University Press,2004.

[6] D. Cai, H. Xiaofei, H. Jiawei, Semi-supervised discriminant analysis, in:Proceedings of the 11th IEEE International Conference on Computer Vision(ICCV), 2007, pp. 1–7.

[7] P. Chan, M. Schlag, J. Zien, Spectral k-way ratio cut partitioning, IEEETransactions CAD Integrated Circuits and Systems 13 (1994) 1088–1096.

[8] H. Chang, D.Y. Yeung, Locally linear metric adaptation with application tosemi-supervised clustering and image retrieval, Pattern Recognition 39(2006) 1253–1264.

[9] K. Crammer, J. Keshet, Y. Singer, Kernel design using boosting, in: Proceedingsof Advances in Neural Information Processing Systems, vol. 15, MIT Press,2003, pp. 537–544.

[10] N. Cristianini, J. Kandola, A. Elisseeff, J. Shawe-Taylor, On kerneltarget alignment, in: Proceedings of Advances in Neural Information Proces-sing Systems, vol. 14, MIT Press, 2002, pp. 367–373.

[11] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means, spectral clustering andnormalized cuts, in: Proceedings of the Tenth ACM SIGKDD Conference onKnowledge Discovery and Data Mining (KDD), 2004, pp. 551–556.

[12] C. Domeniconi, J. Peng, D. Gunopulos, Locally adaptive metric nearestneighbor classification, IEEE Transactions on Pattern Analysis and MachineIntelligence 24 (9) (2002) 1281–1285.

[13] C. Domeniconi, D. Gunopulos, J. Peng, Large margin nearest neighborclassifiers, IEEE Transactions on Neural Networks 16 (4) (2005) 899–909.

[14] A.L.N. Fred, A.K. Jain, Combining multiple clusterings using evidence accu-mulation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6) (2005) 835–850.

[15] J.H. Friedman, Flexible metric nearest neighbor classification, TechnicalReport, Statistics Department, Stanford University, 1994.

[16] K. Fukunaga, T.E. Flick, An optimal global nearest neighbor metric, IEEETransactions on Pattern Analysis and Machine Intelligence 6 (3) (1984)314–318.

[17] T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classification, IEEETransactions on Pattern Analysis and Machine Intelligence 18 (6) (1996) 607–616.

[18] S.C.H. Hoi, W. Liu, M.R. Lyu, W.-Y. Ma, Learning distance metrics withcontextual constraints for image retrieval, in: Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition(CVPR), Oregon State University, Corvallis, USA, 2006, pp. 2072–2078.

[19] S.C.H. Hoi, R. Jin, M.R. Lyu, Learning nonparametric kernel matrices frompairwise constraints, in: Proceedings of the 24th International Conference onMachine Learning (ICML), New York, USA, 2007, pp. 361–368.

[20] S.C.H. Hoi, W. Liu, S.-F. Chang, Semi-supervised distance metric learning forcollaborative image retrieval, in: Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition (CVPR),2008, pp. 1–7.

[21] G. Hu, S. Zhou, J. Guan, X. Hu, Toward effective document clustering: aconstrained k-means based approach, Information processing and manage-ment 44 (2008) 1397–1409.

[22] Y. Jia, F. Nie, C. Zhang, Trace ratio problem revisited, IEEE Transactions onNeural Networks 20 (4) (2009) 729–735.

[23] S. Kamvar, D. Klein, C.D. Manning, Spectral learning, in: Proceedings of the18th International Joint Conference on Artificial Intelligence (IJCAI), 2003,pp. 561–566.

[24] G. Karypis, CLUTO—a clustering toolkit, Technical Report 02-017, Depart-ment of Computer Science, University of Minnesota, 2002.

[25] D. Klein, S.D. Kamvar, C. Manning, From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering, in:Proceedings of the 19th International Conference on Machine Learning(ICML), Sydney, Australia, 2002, pp. 307–314.

[26] B. Kulis, M. Sustik, I. Dhillon, Learning low-rank kernel matrices, In:Proceedings of the 23th International Conference on Machine Learning(ICML), Pittsburg, PA, 2006, pp. 505–512.

[27] B. Kulis, S. Basu, I. Dhillon, Semi-supervised graph clustering: a kernelapproach, Machine Learning 74 (1) (2009) 1–22.

[28] B. Kulis, S. Basu, I. Dhillon, Low-rank kernel learning with Bregman matrixdivergences, Journal of Machine Learning Research 10 (2009) 341–376.

[29] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, M.I. Jordan, Learningthe kernel matrix with semi-definite programming, Journal of MachineLeaning Research 5 (1) (2004) 27–72.

[30] M.H.C. Law, Clustering, dimensionality reduction, and side information, Ph.D.Dissertation, Michigan University, 2006.

[31] H. Voss, Numerical methods for sparse nonlinear eigenvalue problem, TechnicalReport, Department of Mathematics, Hamburg University of Technology, 2003.

[32] F. Li, J. Yang, J. Wang, A transductive framework of distance metric learningby spectral dimensionality reduction, in: Proceedings of the 24th Interna-tional Conference on Machine Learning (ICML), Corvallis, OR, USA, 2007,pp. 513–520.

[33] Z. Li, J. Liu, X. Tang, Pairwise constraint propagation by semidefiniteprogramming for semi-supervised classification, in: Proceedings of the 25thInternational Conference on Machine Learning (ICML), 2008, pp. 576–583.

[34] Z. Li, J. Liu, Constrained clustering by spectral kernel learning, in: Proceedingsof the IEEE International Conference on Computer Vision (ICCV), 2009.

[35] Z. Li, J. Liu, X. Tang, Constrained clustering via spectral regularization, in:Proceedings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR), 2009, pp. 421–428.


[36] D.G. Lowe, Similarity metric learning for a variable-kernel classifier, NeuralComputation 7 (1) (1995) 72–85.

[37] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactionson Pattern Analysis and Machine Intelligence 22 (8) (2000) 888–905.

[38] R.D. Short, K. Fukunaga, The optimal distance measure for nearest neighborclassification, IEEE Transactions on Information Theory 27 (5) (1981) 622–627.

[39] M. Soleymani Baghshah, S. Bagheri Shouraki, Semi-supervised metric learn-ing using pairwise constraints, in: Proceedings of the 21st International JointConference on Artificial Intelligence (IJCAI), 2009, pp. 1217–1225.

[40] M. Soleymani Baghshah, S. Bagheri Shouraki, Metric learning for semi-supervised clustering using pairwise constraints and the geometrical struc-ture of data, Intelligent Data Analysis 13 (6) (2009) 887–899.

[41] M. Soleymani Baghshah, S. Bagheri Shouraki, Kernel-based metric learningfor semi-supervised clustering, Neurocomputing 73 (2010) 1352–1361.

[42] K.Q. Weinberger, Metric learning with convex optimization, Ph.D. Disserta-tion, University of Pennsylvania, 2007.

[43] L. Wu, R. Jin, S.C.H. Hoi, J. Zhu, N. Yu, Learning Bregman distance functionsand its application for semi-supervised clustering, Advances in NeuralInformation Processing Systems, MIT Press, Cambridge, MA, USA, 2009.

[44] S. Xiang, F. Nie, C. Zhang, Learning a Mahalanobis distance metric for dataclustering and classification, Pattern Recognition 41 (12) (2008) 3600–3612.

[45] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning withapplication to clustering with side information, in: Proceedings of Advancesin Neural Information Processing Systems, vol. 15, MIT Press, Cambridge, MA,USA, 2003, pp. 505–512.

[46] L. Yang, R. Jin, Distance metric learning: a comprehensive survey, TechnicalReport, Michigan State University, 2006.

[47] D.Y. Yeung, H. Chang, Extending the relevant component analysis algorithmfor metric learning using both positive and negative equivalence constraints,Pattern Recognition 39 (2006) 1007–1010.

[48] D.Y. Yeung, H. Chang, A Kernel approach for semi-supervised metric learning,IEEE Transactions on Neural Networks 18 (1) (2007) 141–149.

[49] D.Y. Yeung, H. Chang, G. Dai, A scalable kernel-based semi-supervised metriclearning algorithm with out-of-sample generation ability, Neural Computa-tion 20 (11) (2008) 2839–2861.

[50] B. Yan, C. Domeniconi, Kernel optimization using pairwise constraints forsemi-supervised clustering, Advances in Neural Information ProcessingSystems, MIT Press, Cambridge, MA, USA, 2009.

[51] X. Yin, S. Chen, E. Hu, D. Zhang, Semi-supervised clustering with metriclearning: an adaptive kernel method, Pattern Recognition 43 (2010)1320–1333.

[52] J. Zhuang, I.W. Tsang, S.C.H. Hoi, Simple NPKL: simple non-parametric kernellearning, in: Proceedings of the 26th International Conference on MachineLearning (ICML), Montreal, Canada, 2009.

Mahdieh Soleymani Baghshah received her B.S.,M.Sc., and Ph.D. degrees from Department of ComputerEngineering, Sharif University of Technology, Iran, in2003, 2005, and 2010. Her research interests includemachine learning and pattern recognition with pri-mary emphasis on semi-supervised learning andclustering.

Saeed Bagheri Shouraki received his B.Sc. in ElectricalEngineering and M.Sc. in Digital Electronics fromSharif University of Technology, Tehran, Iran, in 1985and 1987. He joined soon to Computer EngineeringDepartment of Sharif University of Technology as afaculty member. He received his Ph.D. on fuzzy controlsystems from Tsushin Daigaku (University of Electro-Communications), Tokyo, Japan, in 2000. He continuedhis activities in Computer Engineering Department upto 2008. He is currently an Associate Professor inElectrical Engineering Department of Sharif Universityof Technology. His research interests include control,
robotics, artificial life, and soft computing.