11
arXiv:1501.00037v1 [cs.LG] 30 Dec 2014 1 Discriminative Clustering with Relative Constraints Yuanli Pei , Xiaoli Z. Fern , R´ omer Rosales , and Teresa Vania Tjahja School of EECS, Oregon State University. Email: {peiy, xfern, tjahjat}@eecs.oregonstate.edu LinkedIn. Email: [email protected] Abstract—We study the problem of clustering with relative constraints, where each constraint specifies relative similarities among instances. In particular, each constraint (xi ,xj ,x k ) is acquired by posing a query: is instance xi more similar to xj than to x k ? We consider the scenario where answers to such queries are based on an underlying (but unknown) class concept, which we aim to discover via clustering. Different from most existing methods that only consider constraints derived from yes and no answers, we also incorporate don’t know responses. We introduce a Discriminative Clustering method with Relative Constraints (DCRC) which assumes a natural probabilistic rela- tionship between instances, their underlying cluster memberships, and the observed constraints. The objective is to maximize the model likelihood given the constraints, and in the meantime enforce cluster separation and cluster balance by also making use of the unlabeled instances. We evaluated the proposed method using constraints generated from ground-truth class labels, and from (noisy) human judgments from a user study. Experimental results demonstrate: 1) the usefulness of relative constraints, in particular when don’t know answers are considered; 2) the improved performance of the proposed method over state-of-the- art methods that utilize either relative or pairwise constraints; and 3) the robustness of our method in the presence of noisy constraints, such as those provided by human judgement. I. I NTRODUCTION Unsupervised clustering can be improved with the aid of side information for the task at hand. In general, side informa- tion refers to knowledge beyond instances themselves that can help inferring the underlying instance-to-cluster assignments. One common and useful type of side information has been represented in the form of instance-level constraints that expose instance-level relationships. Previous work has primarily focused on the use of pair- wise constraints (e.g., [1]–[11]), where a pair of instances is indicated to belong to the same cluster by a Must-Link (ML) constraint or to different clusters by a Cannot-Link (CL) constraint. More recently, various studies [12]–[17] have suggested that domain knowledge can also be incorporated in the form of relative comparisons or relative constraints, where each constraint specifies whether instance x i is more similar to x j than to x k . We were motivated to focus on relative constraints for a couple of reasons. First, the labeling (proper identification) of relative constraints by humans appears to be more reliable than that of pairwise constraints. Research in psychology has revealed that people are often inaccurate in making absolute judgments (required for pairwise constraints), but they are more trustworthy when judging comparatively [18]. Consider one of our applications, where we would like to form clusters of bird song syllables based on spectrogram segments from recorded sounds. Figure 1(a) and 1(b) shows examples of the two types of constraints/questions considered. In the examples, syllable 1 in both figures and syllable 3 in 1(b) are from the same singing pattern and syllable 2 in both figures belongs to a different one. From the figures, it is apparent that making an absolute judgment for the pairwise constraint in 1(a) is more difficult. In contrast, the comparative question for la- beling relative constraint in 1(b) is much easier to answer. Second, since each relative constraint includes information about three instances, they tend to be more informative than pairwise constraints (even when several pairwise constraints are considered). This is formally characterized in Section II-A. In the area of learning from relative constraints, most work uses metric learning approaches [12]–[16]. Such approaches assume that there is an underlying metric that determines the outcome of the similarity comparisons, and the goal is to learn such a metric. The learned metric is often later used for clustering (e.g., via Kmeans or related approaches). In practice, however, we may not have access to an oracle metric. Often the constraints are provided in a way that instances from the same class are considered more similar than those from different classes. This paper explicitly considers such scenarios where constraints are provided based on the underlying class concept. Unlike the metric-based approaches, we aim to directly infer an optimal clustering of the data using the provided relative comparisons, without requiring explicit metric learning. Formally, we regard each constraint (x i ,x j ,x k ) as being obtained by asking: is x i more similar to x j than to x k , and the answer is provided by a user/oracle based on the underlying instance clusters. In particular, a yes answer is given if x i and x j are believed to belong to the same cluster while x k is believed to be from a different one. Similarly, the answer will be no if it is believed that x i and x k are in the same cluster which is different from the one containing x j . Note that for some triplets, it may not be possible to provide a yes or no answer. For example, if the three instances belong to the same cluster, as shown in figure 1(c); or if each of them belongs to a different cluster, as shown in figure 1(d). Such cases have been largely ignored by prior studies. Here, we allow the user to provide a don’t know answer (dnk) when yes/no can not be determined. Such dnk’s not only allow for improved labeling flexibility, but also provide useful information about instance clusters that can help improve clustering, as will be demonstrated in Section II-A and the experiments. In this work, we introduce a discriminative clustering method, DCRC, that learns from relative constraints with yes,

1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

arX

iv:1

501.

0003

7v1

[cs.

LG]

30 D

ec 2

014

1

Discriminative Clustering with Relative ConstraintsYuanli Pei∗, Xiaoli Z. Fern∗, Romer Rosales†, and Teresa Vania Tjahja∗

∗School of EECS, Oregon State University.Email: {peiy, xfern, tjahjat}@eecs.oregonstate.edu

†LinkedIn. Email: [email protected]

Abstract—We study the problem of clustering with relativeconstraints, where each constraint specifies relative similaritiesamong instances. In particular, each constraint(xi, xj , xk) isacquired by posing a query: is instance xi more similar to xj

than to xk? We consider the scenario where answers to suchqueries are based on an underlying (but unknown)class concept,which we aim to discover via clustering. Different from mostexisting methods that only consider constraints derived fromyes and no answers, we also incorporatedon’t know responses.We introduce a Discriminative Clustering method with RelativeConstraints (DCRC) which assumes a natural probabilistic rela-tionship between instances, their underlying cluster memberships,and the observed constraints. The objective is to maximize themodel likelihood given the constraints, and in the meantimeenforce cluster separation and cluster balance by also making useof the unlabeled instances. We evaluated the proposed methodusing constraints generated from ground-truth class labels, andfrom (noisy) human judgments from a user study. Experimentalresults demonstrate: 1) the usefulness of relative constraints,in particular when don’t know answers are considered; 2) theimproved performance of the proposed method over state-of-the-art methods that utilize either relative or pairwise constraints;and 3) the robustness of our method in the presence of noisyconstraints, such as those provided by human judgement.

I. I NTRODUCTION

Unsupervised clustering can be improved with the aid ofside information for the task at hand. In general, side informa-tion refers to knowledge beyond instances themselves that canhelp inferring the underlying instance-to-cluster assignments.One common and useful type of side information has beenrepresented in the form ofinstance-level constraintsthatexpose instance-level relationships.

Previous work has primarily focused on the use ofpair-wise constraints(e.g., [1]–[11]), where a pair of instancesis indicated to belong to the same cluster by a Must-Link(ML) constraint or to different clusters by a Cannot-Link(CL) constraint. More recently, various studies [12]–[17]havesuggested that domain knowledge can also be incorporated inthe form of relative comparisons orrelative constraints, whereeach constraint specifieswhether instancexi is more similarto xj than toxk.

We were motivated to focus on relative constraints for acouple of reasons. First, the labeling (proper identification)of relative constraints by humans appears to be more reliablethan that of pairwise constraints. Research in psychology hasrevealed that people are often inaccurate in making absolutejudgments (required for pairwise constraints), but they aremore trustworthy when judging comparatively [18]. Consider

one of our applications, where we would like to form clustersof bird song syllables based on spectrogram segments fromrecorded sounds. Figure 1(a) and 1(b) shows examples of thetwo types of constraints/questions considered. In the examples,syllable 1 in both figures and syllable3 in 1(b) are from thesame singing pattern and syllable2 in both figures belongs toa different one. From the figures, it is apparent that makingan absolute judgment for the pairwise constraint in 1(a) ismore difficult. In contrast, the comparative question for la-beling relative constraint in 1(b) is much easier to answer.Second, since each relative constraint includes informationabout three instances, they tend to be more informative thanpairwise constraints (even when several pairwise constraintsare considered). This is formally characterized in SectionII-A.

In the area of learning from relative constraints, most workuses metric learning approaches [12]–[16]. Such approachesassume that there is an underlying metric that determines theoutcome of the similarity comparisons, and the goal is tolearn such a metric. The learned metric is often later used forclustering (e.g., via Kmeans or related approaches). In practice,however, we may not have access to anoraclemetric. Often theconstraints are provided in a way that instances from the sameclass are considered more similar than those from differentclasses. This paper explicitly considers such scenarios whereconstraints are provided based on the underlyingclass concept.Unlike the metric-based approaches, we aim to directly inferan optimal clustering of the data using the provided relativecomparisons, without requiring explicit metric learning.

Formally, we regard each constraint(xi, xj , xk) as beingobtained by asking:is xi more similar toxj than toxk, and theanswer is provided by a user/oracle based on the underlyinginstance clusters. In particular, ayes answer is given ifxi

andxj are believed to belong to the same cluster whilexk isbelieved to be from a different one. Similarly, the answer willbe no if it is believed thatxi andxk are in the same clusterwhich is different from the one containingxj . Note that forsome triplets, it may not be possible to provide ayesor noanswer. For example, if the three instances belong to the samecluster, as shown in figure 1(c); or if each of them belongsto a different cluster, as shown in figure 1(d). Such caseshave been largely ignored by prior studies. Here, we allow theuser to provide adon’t knowanswer (dnk) when yes/no cannot be determined. Suchdnk’s not only allow for improvedlabeling flexibility, but also provide useful information aboutinstance clusters that can help improve clustering, as willbedemonstrated in Section II-A and the experiments.

In this work, we introduce a discriminative clusteringmethod, DCRC, that learns from relative constraints withyes,

Page 2: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

2

(a) Pairwise Const. (b) Relative Const. (c) Relative Const. (d) Relative Const.

Fig. 1. Examples for labeling pairwise vs. relative constraints from Birdsongdata. Labeling question for (a):Do syllable 1 and syllable 2 belong to the samecluster?Labeling question for (b) (c) and (d):Is syllable 1 more similar to syllable 2 than to syllable 3?(a) and (b) reveal the cases where relative constraintsare easier to label. The cases in (c) and (d) motivate the introducing of adon’t knowanswer for relative constraints.

no, or dnk labels (Section III). DCRC uses a probabilisticmodel that naturally connects the instances, their underlyingcluster memberships, and the observed constraints. Basedon this model, we present a maximum-likelihood objectivewith additional terms enforcing cluster separation and clusterbalance. Variational EM is used to find approximate solutions(Section IV). In the experiments (Section V), we first evaluateour method on both UCI and additional real-world datasetswith simulated noise-free constraints generated from ground-truth class labels. The results demonstrate the usefulnessofrelative constraints includingdon’t know answers, and theperformance advantage of our method over current state-of-the-art methods for both relative and pairwise constraints. Wealso evaluate our method with human-labeled noisy constraintscollected from a user study, and results show the superiorityof our method over existing methods in terms of robustness tothe noisy constraints.

II. PROBLEM ANALYSIS

In this section, we first compare the cluster label informationobtained by querying different types of constraints, analyzingthe usefulness of relative constraints. Then we formally statethe problem.

A. Information from Constraints

Here we provide a qualitative analysis with a simplifiedbut illustrative example. Suppose we haveN i.i.d instances{xi}

Ni=1 sampled fromK clusters with even prior1/K.

Consider a triplet(xt1 , xt2 , xt3) and a pair(xb1 , xb2). LetYt = [yt1 , yt2 , yt3 ]

T andYb = [yb1 , yb2 ]T be their correspond-

ing cluster labels. Letlt ∈ {yes, no, dnk} and l′b ∈ {ML ,CL}be the label for the relative and pairwise constraint respectively.In this example they are determined by

lt =

yes, if yt1 = yt2 , yt1 6= yt3no, if yt1 = yt3 , yt1 6= yt2dnk, o.w.

(1)

l′b =

{

ML, if yb1 = yb2CL, if yb1 6= yb2 .

(2)

We can derive the mutual information between a relativeconstraint and the associated instance cluster labels as (seeAppendix for the derivation)

I(Yt; lt) = 2 logK − (1− Pdnk) log(K − 1)

−Pdnk log[K2 − 2(K − 1)],

(3)

0 10 20 30 40 50 600

0.5

1

1.5

2

2.5

Number of clusters

Mut

ual I

nfor

mat

ion

One Relative Const.One Relative YN Const.One Pairwise Const.Two Relative Const.Three Pairwise Const.

Fig. 2. Mutual information between instance cluster labelsand constraintlabels as a function of the number of clusters.

and that for a pairwise constraint as

I(Yb; l′b) = logK − PCL log(K − 1), (4)

wherePdnk = 1− 2(K − 1)/K2, andPCL = 1− 1/K.Figure 2 plots the values of (3) and (4) as a function

of the number of clustersK. Comparing the values ofonerelative constand one pairwise const, we see that, in theabsence of other information, a relative constraint providesmore information. One might argue that labeling a tripletrequires inspecting more instances than labeling a pair, makingthis comparison unfair. To address this bias, we compare theinformation gain from the two types of constraints with thesame number of instances, namely, comparing the values oftwo relative constraintswith that of three pairwise constraints,both involving six instances. In Figure 2 we see again thatrelative constraints are more informative.

Another aspect worth evaluating is the motivation behindexplicitly usingdnkconstraints. In prior work on learning fromrelative constraints, the constraints are typically generated byrandomly selecting triplets and producing constraints basedtheir class labels. If a triplet can not be definitely labeledwith yes or no, the resulting constraint is not employedby the learning algorithm (it is ignored). Such methods areby construction not using the information provided bydnkanswers. However, it is possible to show that in generaldnk’scan provide information about instance labels. Ifdnk’s areignored, the mutual information can be computed by replacingH(Yt|lt = dnk) with H(Yt), meaning that thednk’s are notinformative about the instance labels. In this case, we have

I ′(Yt; lt) = 2(1− Pdnk) logK − (1 − Pdnk) log(K − 1). (5)

Comparing the values ofone relative YN const(which ignores

Page 3: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

3

Fig. 3. The dependencies between three instances(xt1 , xt2 , xt3), theircluster labels(yt1 , yt2 , yt3), and the constraint labellt.

dnk) with that ofone relative constin Figure 2, we see a cleargap between using and not usingdnkconstraints, implying theinformativeness ofdnkconstraints. Additionally, the amount ofdnk constraints is usually large, especially when the numberof clusters is large. Consider randomly selecting tripletsfromclusters with equal sizes. There is a50% chance of acquir-ing dnk constraints in two-cluster problems, and the chanceincreases to78% in eight-cluster problems. The informationprovided by such large amount ofdnkconstraints is substantial.Hence, we believe it will be beneficial to explicitly employ andmodeldnk constraints.

B. Problem Statement

Let X = [x1, . . . , xN ]T be the given data, where eachxi ∈Rd and d is the feature dimension. LetY = [y1, . . . , yN ]T

be the hidden cluster label vector, whereyi is the label ofxi. With slight abuse of notation, we use{(t1, t2, t3)}Mt=1 todenote the index set ofM triplets, representingM relativeconstraints. Each(t1, t2, t3) contains the indices for the threeinstances in thet-th constraint. LetL = [l1, . . . , lM ]T bethe constraint label vector, wherelt ∈ {yes, no, dnk} is thelabel of (xt1 , xt2 , xt3). Each lt specifies the answer to thequestion:is xt1 more similar toxt2 than toxt3? Our goal is topartition the data intoK clusters such that similar instancesare assigned to the same cluster, while respecting the givenconstraints. In this paper, we assume thatK is pre-specified.

In the following, we will useIt = {t1, t2, t3} to denotethe set of indices in thet-th triplet, use I to index allthe distinct instances involved in the constraints, i.e.,I ={

1 ≤ i ≤ N : i ∈ ∪Mt=1It

}

, and useU to index the instancesthat are not in any constraints.

III. M ETHODOLOGY

In this section, we introduce our probabilistic model andpresent the proposed objective functions based on this model.

A. The Probabilistic Model

We propose aDiscriminative Clusteringmodel forRelativeConstraints (DCRC). Figure 3 shows the proposed proba-bilistic model defining the dependencies between the inputinstances(xt1 , xt2 , xt3), their cluster labels(yt1 , yt2 , yt3), andthe constraint labellt for only one relative constraint. For

TABLE I. D ISTRIBUTION OFP (lt|Yt), Yt = [yt1 , yt2 , yt3 ].

Cases lt = yes lt = no lt = dnk

yt1= yt2

, yt16= yt3

1 − ǫ ǫ/2 ǫ/2yt1

= yt3, yt1

6= yt2ǫ/2 1 − ǫ ǫ/2

o.w. ǫ/2 ǫ/2 1 − ǫ

a collection of constraints, it is possible to havey variablesconnected to more than one (or none) constraint labell if someinstances appear in multiple constraints (or do not appear inany given constraint).

We use a multi-class logistic classifier to model the condi-tional probability ofy’s given the observedx’s. For simplicity,in the following we will use the same notationx to repre-sent the(d + 1)-dimensional augmented vector[xT , 1]T . LetW = [w1, . . . , wK ]T be a weight matrix inRK×(d+1), whereeachwk contains weights on thed-dimensional feature spaceand an additional bias term. Then the conditional probabilityis represented as

P (y = k|x;W ) =exp (wT

k x)∑

k′ exp (wTk′x)

. (6)

In our model, the observed constraint labels only dependon the cluster labels of the associated instances. In an idealscenario, the conditional distribution oflt given the clusterlabels would be deterministic, as described by Eq. (1). How-ever, in practice users can make mistakes and be inconsistentduring the annotation process. We address this by relaxing thedeterministic relationship to the distributionP (lt|Yt) describedin Table I. The relaxation is parameterized byǫ ∈ [0, 1),indicating the probability of an error when answering thequery. Here we let the two erroneous answers have equalprobability ǫ/2. Namely, the ideal label oflt (e.g., lt = yesif yt1 = yt2 , yt1 6= yt3) is given with probability1 − ǫ,and any other labels (no and dnk in this case) are givenwith equal probabilityǫ/2. In practice, lower values ofǫare expected when constraints have fewer noise. Alternatively,we can view this relaxation as allowing the constraints tobe soft as needed, balancing the trade-off between findinglarge separation margins among clusters and satisfying alltheconstraints.

B. Objective

The first part of our objective is to maximize the likelihoodof the observed constraints given the instances, i.e.,

maxW

Φ(L|XI ;W ) = 1M logP (L|XI ;W )

= 1M log

YI

P (L, YI |XI ;W ) , (7)

whereI indexes the constrained instances as defined in SectionII-B, and 1

M is a normalization constant.To reduce overfitting, we add the standard L-2 regularization

for the logistic model, namely,

R(W ) =∑

k

wTk wk ,

Page 4: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

4

where eachwk is a vector obtained by replacing the bias termin wk with 0.

In addition to satisfying the constraints, we also expect theclustering solution to separate the clusters with large margins.This objective can be captured by minimizing the conditionalentropy of instance cluster labels given the observed features[19]. Since the cluster information about constrained instancesis captured by Eq. (7), we only impose such entropy mini-mization on the unconstrained instances, i.e.,

H(YU |XU ;W ) =1

|U |

i∈U

H [P (yi|xi;W )] .

Adding the above terms together, our objective is

maxW

Φ(L|XI ;W )− τH(YU |XU ;W )− λR(W ) . (8)

In some cases, we may also wish to maintain a balanceddistribution across different clusters. This can be achieved bymaximizing the entropy of the estimated marginal distributionof cluster labels [20], i.e.,

H(y|X ;W ) = −∑K

k=1 pk log pk ,

where we denote the estimated marginal probability aspk =P (y = k|X ;W ) = 1

N

∑Ni=1 pik andpik = P (yi = k|xi;W ).

In cases where balanced clusters are desired, our objectiveis formulated as

maxW

Φ(L|XI ;W )− λR(W )

+ τ [H(y|X ;W )−H(YU |XU ;W )] ,(9)

where we use the same coefficientτ to control the enforcementof the cluster separation and cluster balance terms, since theyare roughly at the same scale.

The two objectives (8) and (9) are non-concave, and op-timization generally can only be guaranteed to reach a localoptimum. In the next section, we present a variational EMsolution and discuss an effective initialization strategy.

IV. OPTIMIZATION

Here we consider optimizing the objective in Eq. (9), whichenforces cluster balance. The objective (8) is simpler and canbe optimized following the same procedure by simply remov-ing the corresponding terms employed for cluster balance.

Computing the log-likelihood Eq. (7) requires marginalizingover hidden variablesYI . Exact inference may be feasiblewhen the constraints are highly separated or the number ofconstraints is small, as this may produce a graphical modelwith low tree-width. As morey’s are related to each othervia constraints, marginalization becomes more expensive tocompute, and it is in general intractable. For this reason, weuse the variational EM algorithm for optimization.

Applying Jensen’s inequality, we obtain the lower bound ofthe objective as follows

LB = 1MEQ(YI )

[

log(P (YI ,L|XI ;W )Q(YI )

)]

− λR(W )

+ τ [H(y|X ;W )−H(YU |XU ;W )] ,(10)

TABLE II. T HE VALUES OF Q(lt|yi = k), i ∈ It . FOR SIMPLICITY, WE

DENOTEqjk ≡ q(ytj = k) AND qjk ≡ q(ytj 6= k).

Cases lt = yes lt = no lt = dnk

i = t1 q2kq3k q2kq3k 1 − q2kq3k − q

2kq3k

i = t2 q1kq3k∑

u6=k

q1uq3u 1 − q1kq3k −∑

u6=k

q1uq3u

i = t3∑

u6=k

q1uq2u q1kq2k 1 − q1kq2k −∑

u6=k

q1uq2u

whereQ(YI) is a variational distribution. In variational EM,such lower bound is maximized alternately in the E-stepand M-step respectively [21]. In each E-step, we aim tofind a tractable distributionQ(YI) such that the Kullback-Leibler divergence betweenQ(YI) and the posterior distribu-tion P (YI |L,XI ;W ) is minimized. Given the currentQ(YI),each M-step finds the newW that maximizes theLB. Notethat in the objective (and theLB), only the likelihood term isrelevant to the E-step. The other terms are only used in solvingfor W in the M-steps.

A. Variational E-Step

We use mean field inference [22], [23] to approximate theposterior distribution in part due to its ease of implementa-tion and convergence properties [24]. Mean field restricts thevariational distributionQ(YI) to the tractable fully-factorizedfamily Q(YI) =

i∈I q(yi), and finds theQ(YI) that min-imizes the KL-divergenceKL[Q(YI)||P (YI |L,XI ;W )]. Theoptimal Q(YI) is obtained by iteratively updating eachq(yi)until Q(YI) converges. The update equation is

q(yi) =1

Zexp{EQ(YI\i)[logP (XI , YI , L)]} , (11)

whereQ(YI\i) =∏

j∈I,j 6=i q(yj), andZ is a normalizationfactor to ensure

yiq(yi) = 1. In the following, we derive a

closed-form update for this optimization problem.Applying the model independence assumptions, the expec-

tation term in Eq. (11) is simplified to

EQ(YI\i)[M∑

t=1logP (lt|Yt) +

j∈I

logP (yj |xj ;W ) + logP (XI)]

=∑

t:i∈It

EQ(YIt\i)[logP (lt|Yt)] + logP (yi|xi;W ) + const,

(12)whereIt\i is the set of indices inIt except fori, andconstabsorbs all the terms that are constant with respect toyi.The first term in (12) sums over the expected log-likelihoodof observing eachlt given the fixedyi. To compute theexpectation, we first letQ(lt|yi) be the probability that theobservedlt is consistent with theYt given a fixedyi. That is,Q(lt|yi) is the probability for all possible assignments ofYt

given a fixedyi, such thatP (lt|Yt) = 1− ǫ according to TableI. The Q(lt|yi) can be computed straightforwardly as in TableII. Then each of the expectations in (12) is computed as

E[logP (lt|yi)] = [1− Q(lt|yi)] logǫ

2+ Q(lt|yi) log(1− ǫ).

Page 5: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

5

From the above, the update Eq. (11) is derived as

q(yi) =αF (yi)P (yi|xi;W )

yiαF (yi)P (yi|xi;W )

, with α =2(1− ǫ)

ǫ,

(13)whereF (yi) =

t:i∈ItQ(lt|yi).

The termF (yi) can be interpreted as measuring the compat-ibility of each assignment ofyi with respect to the constraintsand the othery’s. In Eq. (13),α is controlled by the parameterǫ. When ǫ ∈ (0, 2

3 ), α > 1 and the update allows morecompatible assignments ofyi, i.e., the ones with higherF (yi),to have largerq(yi). When ǫ = 2

3 , the constraint labels areregarded as uniformly distributed regardless of the instancecluster labels, as can be seen from Table I. In this case,α = 1and eachq(yi) is directly set to the conditional probabilityP (yi|xi;W ). This naturally reduces our method to learningwithout constraints. Clearly, whenǫ is smaller, the constraintsare harder and the updates will pushq(yi) to more extremedistributions. Note that the values ofǫ ∈ (23 , 1) causeα < 1,which will lead to results that contradict the constraints,andare generally not desired.

Special Case: Hard Constraints.In the special case whereǫ = 0 and α = ∞, P (lt|Yt) essentially reduces to thedeterministic model described in Eq. (1), allowing our model toincorporatehard constraints. The update equation of this casecan also be addressed similarly to Eq. (13). In this case,q(yi) isnon-zero only when the value ofF (yi) is the maximum amongall possible assignments ofyi. Thus, the update equation isreduced to a max model. More formally, we define the max-compatible label set for each instancexi as

Yi = {1 ≤ k ≤ K : F (yi = k) ≥ F (yi = k′), ∀ k′ 6= k}.

Namely, eachYi contains the most compatible assignments foryi with respect to the constraints. Then the update equationbecomes

q(yi) =

P (yi|xi;W )∑

y′i∈Yi

P (y′i|xi;W ), if yi ∈ Yi ,

0, o.w.(14)

B. M-Step

The M-step searches for the parameterW that maximizesthe LB. Applying the independence assumptions again andignoring all the terms that are constant with respect toW , weobtain the following objective

maxW

J = 1M

YI

Q(YI) logP (YI |XI ;W )− λR(W )

+τ [H(y|X ;W )−H(YU |XU ;W )] .

This objective is non-concave and a local optimum canbe found via gradient ascent. We used L-BFGS [25] in ourexperiments. The derivative ofJ w.r.t. W is

∂J∂W = 1

M

i∈I(Qi − Pi)xTi − 2λW

+ τ|U|

j∈U

k(1k − Pj)pjk log pjkxTj

− τN

∑Nn=1

k(1k − Pn)pnk log pkxTn ,

wherePi = [pi1, . . . , piK ]T , Qi = [qi1, . . . , qiK ]T with qik =q(yi = k), W = [w1, . . . , wK ]T , and1k is a K-dimensionalvector that contains the value1 on thek-th dimension and0elsewhere.

The above derivations use a linear model forP (y|x;W ),and thus the learned DCRC is also linear. However, all ofthe results can be easily generalized to using kernel functions,allowing DCRC to find non-linear separation boundaries.

C. Complexity and Initialization

In each E-step, the complexity isO(γK|I|), whereγ is thenumber of mean-field iterations forQ(YI) to converge. In theM-step, the complexity of computing the gradient ofW ineach L-BFGS iteration isO(NKD).

Although mean-field approximation is guaranteed to con-verge, in the first few E-steps it is not critical to achieve a veryclose approximation. In practice, we can run mean-field updateup to a fixed number of iterations (e.g., 100). We empiricallyobserve that the approximation still converges very fast inlaterEM iterations. Similarly, we observe in the M-step that the L-BFGS optimization usually converges with very few iterationsin the later EM runs, and a completion of a fixed number ofiterations for L-BFGS is also sufficient in the first few M-steps.

The EM algorithm is generally sensitive to the initialparameter values. Here we first apply Kmeans and train asupervised logistic classifier with the clustering results. Thelearned weights are then used as the starting point of DCRC.Empirically we observe that such initialization typicallyallowsDCRC to converge within100 iterations.

V. EXPERIMENTS

In this section, we experimentally examine the effectivenessof our model in utilizing relative constraints to improve clus-tering. We first evaluate all methods on both UCI and otherreal-world datasets with noise-free constraints generated fromtrue class labels. We then present a preliminary user studywhere we ask users to label constraints and evaluate all themethods on these human-labeled (noisy) constraints.

A. Baseline Methods and Evaluation Metric

We compare our algorithm with existing methods that con-sider relative constraints or pairwise constraints. The methodsemploying pairwise constraints areXing’s method [2] (distancemetric learning for a diagonal matrix) andITML [26]. Theseare the state-of-the-art methods that are usually comparedinthe literature and have publicly available source code.

For methods considering relative constraints, we comparewith: 1) LSML [15], a very recent metric learning methodstudying relative constraints (we use Euclidean distance as theprior); 2) SSSVaD[16], a method that directly finds clusteringsolutions with relative constraints; and 3)sparseLP[13], anearlier method that hasn’t been extensively compared. We alsoexperimented with a SVM-style method proposed in [12] andobserved that its performance is generally worse. Thus, we donot report the results on this method.

Page 6: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

6

0 20 40 60 80 100

0.6

0.7

0.8

0.9

No. of Constraints

F−

Mea

sure

(a) Ionosphere

0 50 100 150 200 250

0.55

0.6

0.65

0.7

No. of Constraints

F−

Mea

sure

(b) Pima

0 50 100 150 2000.4

0.5

0.6

0.7

0.8

0.9

No. of Constraints

F−

Mea

sure

(c) Balance-scale

0 200 400 600 800 1000

0.6

0.7

0.8

0.9

1

No. of ConstraintsF

−M

easu

re

(d) Digits-389

0 200 400 600 800 10000.4

0.5

0.6

0.7

0.8

0.9

No. of Constraints

F−

Mea

sure

(e) Letters-IJLT

0 50 100 150 200 250 300 3500.25

0.3

0.35

0.4

0.45

0.5

No. of Constraints

F−

Mea

sure

(f) MSRCv2

0 200 400 600 800 1000 12000.2

0.25

0.3

0.35

0.4

0.45

0.5

No. of Constraints

F−

Mea

sure

(g) Stonefly9

0 250 500 750 1000 1250 1500

0.35

0.4

0.45

0.5

0.55

No. of Constraints

F−

Mea

sure

(h) Birdsong

DCRC DCRC−YN LSML sparseLP SSSVaD ITML Xing

Fig. 4. (Best viewed in color.) The F-measure as a function ofnumber of relative constraints. Results are averaged over 20 runs with independently sampledconstraints. Error bars are shown as 95% confidence intervals.

Page 7: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

7

TABLE III. S UMMARY OF DATASET INFORMATION

Dataset #Inst. #Dim. #Cluster

Ionosphere 351 34 2Pima 768 8 2

Balance-scale 625 4 3Digits-389 3165 16 3

Letters-IJLT 3059 16 4MSRCv2 1046 48 6Stonefly9 3824 285 9Birdsong 4998 38 13

Xing’s method, ITML, LSML, and sparseLP are metriclearning techniques. Here we apply Kmeans with the learnedmetric (50 times) to form cluster assignments, and the cluster-ing solution with the minimum mean-squared error is chosen.

We evaluated the clustering results based on the ground-truthclass labels usingpairwise F-measure[3], Adjusted Rand Indexand Normalized Mutual Information. The results are highlysimilar with different measures, thus we only present the F-Measure results.

B. Controlled Experiments

In this set of experiments, we use simulated noise-freeconstraints to evaluate all the methods.

1) Datasets:We evaluate all methods on five UCI datasets:Ionosphere, Pima, Balance-Scale, Digits-389, and Letters-IJLT. We also use three extra real-world datasets: 1) a subset ofimage segments of theMSRCv2data1, which contains the sixlargest classes of the image segments; 2) the HJABirdsongdata [27], which contains automatically extracted segmentsfrom spectrograms of birdsong recordings, and the goal is toidentify the species for each segment; and 3) theStonefly9data[28], which contains insect images and the task is to identifythe species of the insect for each image. Table III summarizesthe dataset information. In our experiments, all features arestandardized to have zero mean and unit standard deviation.

2) Experimental Setup:For each dataset, we vary thenumber of constraints from0.05N to 0.3N with a 0.05Nincrement, whereN is the total number of instances. For eachsize, triplets are randomly generated and constraint labels areassigned according to Eq. (1). We evaluated our method in twosettings, one with all constraints as input (shown asDCRC),and the other with onlyyes/no constraints (shown asDCRC-YN). The baseline methods for relative constraints are designedfor yes/no constraints only and cannot be easily extended toincorporatednkconstraints, so we drop thednkconstraints forthese methods. To form the corresponding pairwise constraints,we infer one ML and one CL constraints from each relativeconstraint withyes/no labels (note that no pairwise constraintscould be directly inferred fromdnk relative constraints). Thus,all the baselines use the same information as DCRC-YN, sinceno dnk constraints are employed by them.

We use five-fold cross-validation to tune parameters forall methods. The same training and validation folds are usedacross all the methods (removingdnkconstraints, or convertingto pairwise constraints when necessary). For each method,

1http://research.microsoft.com/en-us/projects/ObjectClassRecognition/

−10 0 10 20−10

0

10

20

−10 0 10 20−10

0

10

20

(a) Data1: Soft Const. (88.33%) (b) Data1: Hard Const. (83.33%)

−10 0 10 20−10

0

10

20

−10 0 10 20−10

0

10

20

(c) Data2: Soft Const. (100%) (d) Data2: Hard Const. (100%)

Fig. 5. Estimated entropy using soft/hard constraints on synthetic datasets.Cluster assignments are represented with blue, pink, and green points. Entropyregions are shaded, with darker color representing higher entropy. Predictionaccuracy on instance cluster labels is shown in the parentheses.

we select the parameters that maximize the averaged con-straint prediction accuracy on the validation sets. For ourmethod, we search for the optimalτ ∈ {0.5, 1, 1.5} andλ ∈ {2−10, 2−8, 2−6, 2−4, 2−2}. We empirically observedthat our method is very robust to the choice ofǫ when itis within the range[0.05, 0.15]. Here we setǫ = 0.05 forthis set experiments with the simulated noise-free constraints.Experiments are repeated using 20 randomized runs, each withindependently sampled constraints.

3) Overall Performance:Figure 4 shows the performance ofall methods with different number of constraints. The sparseLPdoes not scale to the high-dimensionalStonefly9dataset andhence is not reported on this particular data.

From the results we see that DCRC consistently outperformsall baselines on all datasets as the constraints increase, demon-strating the effectiveness of our method.

Comparing DCRC with DCRC-YN, we observe that theadditional dnk constraints provide substantial benefits, espe-cially for datasets with large number of clusters (e.g.,MSRCv2,Birdsong). This is consistent with our expectation because theportion of dnk constraints increases significantly whenK islarge, leading to more information to be utilized by DCRC.

Comparing DCRC-YN with the baselines, we observe thatDCRC-YN achieves comparable or better performance evencompared with the best baseline ITML. This suggests that, withnoise-free constraints, our model is competitive with the state-of-the-art methods even without considering the additionalinformation provided bydnk constraints.

4) Soft Constraints vs. Hard Constraints:In this set ofexperiments, we explore the impact on our model when softconstraints (ǫ = 0.05) and hard constraints (ǫ = 0) are usedrespectively. We first use two synthetic datasets to examineand illustrate their different behaviors. These two datasetseach contain three clusters, 50 instances per cluster. Theclusters are close to each other in one dataset, and far apart

Page 8: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

8

0 50 100 150 2000.4

0.5

0.6

0.7

0.8

0.9

No. of Constraints

F−

Mea

sure

DCRCDCRC−Hard

(a) Balance-scale

0 500 10000.4

0.5

0.6

0.7

0.8

0.9

No. of Constraints

F−

Mea

sure

DCRCDCRC−Hard

(b) Letters-IJLT

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

No. of Constraints

F−

Mea

sure

DCRCDCRC−Hard

(c) MSRCv2

0 500 1000 15000.3

0.35

0.4

0.45

0.5

0.55

No. of Constraints

F−

Mea

sure

DCRCDCRC−Hard

(d) Birdsong

Fig. 6. Performance of DCRC using soft constraints vs. hard constraints.

(and thus easily separable) in the other. For each dataset,we randomly generated 500 relative constraints using pointsnear the decision boundaries. Figure 5 shows the predictionentropy and prediction accuracy on instances cluster labels forboth datasets achieved by our model, using soft and hard con-straints respectively. We can see that when clusters are easilyseparable, both soft and hard constraints produce reasonabledecision boundaries and perfect prediction accuracy. However,when cluster boundaries are fuzzy, the results of using softconstraints appear preferable. This indicates that bysofteningthe constraints, our method could search for more reasonabledecision to avoid overfitting to the constrained instances.

We then compare the performances of using soft (ǫ = 0.05)versus hard (ǫ = 0) constraints on real datasets with the samesetting utilized in Section V-B3. Due to space limit, here weonly show results on four representative datasets in Figure6.The behavior of other datasets are similar. We can see thatusing soft constraints generally leads to better performancethan using hard constraints. In particular, on theMSRCv2dataset, using hard constraints produces a large “dip” at thebeginning of the curve while this issue is not severe for softconstraints. This suggests that using soft constraints makes ourmodel less susceptible to overfitting to small sets of constraints.

5) Effect of Cluster Balance Enforcement:This set of ex-periments test the effect of the cluster balance enforcement onthe performance of DCRC for the unbalancedBirdsongandthe balancedLetters-IJLTdatasets. Figure 7 reports the perfor-mance of DCRC (soft constraints,ǫ = 0.05) with and withoutsuch enforcement with varied number of constraints. We seethat when there is no constraint, it is generally beneficial toenforce the cluster balance. The reason is, when cluster balanceis not enforced, the entropy that enforces cluster separation canbe trivially reduced by removing cluster boundaries, causingdegenerate solutions. However, as the constraint increases,enforcing cluster balance on the unbalancedBirdsong hurts

0 500 1000 15000.2

0.3

0.4

0.5

0.6

No. of Constraints

F−

Mea

sure

DCRCDCRC−w.o.Balance

(a) Birdsong: Unbalanced

0 500 1000

0.4

0.6

0.8

No. of Constraints

F−

Mea

sure

DCRCDCRC−w.o.Balance

(b) Letters-IJLT: Balanced

Fig. 7. Performance of DCRC with/without cluster balance enforcement.

the performance. Conceivably, such enforcement would causeDCRC to prefer solutions with balanced cluster distributions,which is undesirable for datasets with uneven classes. On theother hand, appropriate enforcement on the balancedLetters-IJLT dataset provides further improvement. In practice, onecould determine whether to enforce cluster balance based onprior knowledge of the application domain.

6) Computational Time:We record the runtime of learningwith 1500 constraints on the Birdsong dataset, on a standarddesktop computer with 3.4 GHz CPU and 11.6 GB of memory.On average it takes less than 2 minutes to train the model usingan un-optimized Matlab implementation. This is reasonableformost applications with similar scale.

C. Case Study: Human-labeled Constraints

We now present a case study where we investigate theimpact of human-labeled constraints on the proposed methodand its competitors.

1) Dataset and Setup:This case study is situated in oneof our applications where the goal is to find bird singingpatterns by clustering. The birdsong dataset used in SectionV-B contains spectrogram segments labeled with bird species.In reality, birds of the same species may vocalize in differentpatterns, which we hope to identify as different clusters. To-ward this goal, we created another birdsong dataset consistingof clusters that contains relatively pure singing patterns. Webriefly describe the data generation process as follows.

We first manually selected a collection of representativeexamples of the singing patterns, and then use them astemplates to extract segments from birdsong spectrograms byapplying template matching. Each of the extracted segmentsis assigned to the cluster represented by the correspondingtemplate. We then manually inspected and edited the clustersto ensure the quality of the clusters. As a result, each clustercontains relatively pure segments that are actually from thesame bird species and represent the same vocalization pattern.See Figure 1 for examples of several different vocalizationpatterns, which we refer to as syllables. We extract featuresfor each segment using the same method as described in [27].This process results in a new Birdsong dataset containing2601instances and14 ground-truth clusters.

After obtaining informed consents according to the protocolapproved by the Institutional Review Board of our institution,we tested six human subjects’ behaviors on labeling con-straints. None of the users has any prior experience/knowledge

Page 9: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

9

TABLE IV. T HE AVERAGE CONFUSION MATRIX OF THE HUMAN

LABELED CONSTRAINTS VS. THE CONSTRAINT LABELS INFERRED FROM

TRUE INSTANCE CLUSTERS.

(a) Relative Constraints

TrueHuman Labels

yes no dnkyes 18.50 0.33 4.83no 0.33 16.50 4.50dnk 3.50 3.00 98.50

(b) Pairwise Constraints

TrueHuman LabelsML CL

ML 42.50 9.67CL 10.83 162.00

TABLE V. F-M EASURE PERFORMANCE(MEAN ± STD) WITH THEHUMAN LABELED CONSTRAINTS.

(a) Without using constraintsMethod F-Measure

DCRC-NoConst 0.5175± 0.0232Kmeans 0.6523± 0.0189

(b) Using 150 relative constraintsMethod F-Measure

DCRC 0.7620± 0.1335DCRC-YN 0.7635± 0.1067

LSML 0.6409± 0.0654sparseLP 0.5200± 0.0706SSSVaD 0.6046± 0.0605

(c) Using 150 pairwise constraintsMethod F-Measure

ITML 0.6409± 0.0424Xing 0.6438± 0.0423

(d) Using 225 pairwise constraintsMethod F-Measure

ITML 0.6347± 0.0372Xing 0.6438± 0.0282

on the data. They were first given a short tutorial on thedata and the concepts ofclusteringandconstraints. Then eachuser is asked to label randomly selected150 triplets, and225pairs, using a graphical interface that displays the spectrogramsegments. To neutralize the potential bias introduced by thetask ordering (triplets vs. pairs), we randomly split the usersinto two groups with each group using a different ordering.

2) Results and Discussion:Table IV lists the average confu-sion matrix of the human-labeled constraints versus the labelsproduced based on the ground-truth cluster labels. From TableIV(a), we see that thednk constraints make up more thanhalf of the relative constraints, which is consistent with ouranalysis in Section II-A that the number ofdnkconstraints canbe dominantly large. The users rarely confuse between theyesandno labels but they do tend to provide more erroneousdnklabels. This phenomenon is not surprising because when indoubt, we are often more comfortable to abstain from givingan definiteyes/no answer and resort to thednk option.

For pairwise constraints, theCL constraints are the majority,and the confusions for bothCL andML are similar. We notethat the confusion between theyes/no constraints is muchsmaller than that ofML/CL constraints. This shows that theincreased flexibility introduced bydnk label allows the users tomore accurately differentiateyes/no labels. The overall labelingaccuracy of pairwise constraints is slightly higher than that ofrelative constraints. We suspect that this is due to the presenceof the large amount ofdnk constraints.

We evaluated all the methods using these human-labeledconstraints. To account for the labeling noise in the constraints,we setǫ = 0.15 for DCRC and DCRC-YN2. The averagedresults for all methods are listed in Table V. We observe thatwhile most of the competing methods’ performance degrade

2For these noisy constraints, our method remains robust to the choice ofǫ.Using different values ofǫ ranging from0.05 to 0.2 only introduces minorfluctuations (within 0.01 difference) to the F-measure.

with added constraints compared with unsupervised Kmeans,our method still shows significant performance improvementeven with the noisy constraints. We want to point out thatthe performance difference we observe is not due to the useof the multi-class logistic classifier. In particular, as shown inTable V(a), without considering any constraints, the logisticmodel achieves significantly lower performance than Kmeans.This further demonstrates the effectiveness of our method inutilizing the side information provided by noisy constraints toimprove clustering.

Recall that ITML is competitive with DCRC-YN previouslyconsidering noise-free constraints. Here with noisy constraints,DCRC-YN achieves far better accuracy than ITML, suggestingthat our method is much more robust to labeling noise. It isalso worth noting that although thednk constraints tend to bequite noisy, they do not seem to degrade the performance ofDCRC compared with DCRC-YN.

Our case study also points to possible ways to furtherimprove our model. As revealed by Table IV, the noise onthe labels for relative constraints is not uniform as assumedby our model. An interesting future direction is to introducea non-uniform noise process to more realistically model theusers’ labeling behaviors.

VI. RELATED WORK

Clustering with Constraints:Various techniques have beenproposed for clustering with pairwise constraints [4]–[8], [10].Our work is aligned with most of these methods in the sensethat we assume the guidance for labeling constraints is theunderlying instance clusters.

Fewer work has been done on clustering with relativeconstraints. The work in [12]–[16] propose metric learningapproaches that used(xi, xj) < d(xi, xk) to encode thatxi

is more similar toxj than toxk, whered(·) is the distancefunction. The work [15] studies learning from relative compar-isons between two pairs of instances, which can be viewed asthe same type of constraints when only three distinct examplesare involved. By construction, these methods only considerconstraints withyes/no labels. Practically, such answers mightnot always be provided, causing limitation of their applications.In contrast, our method is more flexible by allowing users toprovidednk constraints,

There also exist studies that encode the instance relativesimilarities in the form of hieratical ordering and attempthierarchical algorithms that directly find clustering solutionssatisfying the constraints [17], [29]. Different with those stud-ies, our work builds on a natural probabilistic model that hasnot been considered for learning with relative constraints.

Semi-supervised Learning:Related work also exists in amuch broader area of semi-supervised learning, involvingstudies on both clustering and classification problems. Thework [19] proposes that to enforce the formed clusters withlarge separation margins, we could minimize the entropy onthe unlabeled data, in addition to learning from the labeledones. The study [20] suggests to also maximize the entropy ofthe cluster label distribution in order to find balanced clusteringsolution. Our final formulation draws inspiration from theabove work.

Page 10: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

10

VII. C ONCLUSIONS

In this paper, we studied clustering with relative constraints,where each constraint is generated by posing a query:is xi

more similar toxj than to xk. Unlike existing methods thatonly consideryes/no responses to such queries, we studied thecase where the answer could also bednk (don’t know). Wedeveloped a probabilistic method DCRC that learns to clusterthe instances based on the responses acquired by such queries.We empirically evaluated the proposed method using bothsimulated (noise-free) constraints and human-labeled (noisy)constraints. The results demonstrated the usefulness ofdnkconstraints, the significantly improved performance of DCRCover existing methods, and the superiority of our method interms of the robustness to noisy constraints.

REFERENCES

[1] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl, “Constrained K-means Clustering with Background Knowledge,” inICML, 2001, pp.577–584.

[2] E. Xing, A. Ng, M. Jordan, and S. Russell, “Distance Metric Learningwith Application to Clustering with Side-information,” inNIPS, 2003,pp. 521–528.

[3] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating Constraints andMetric Learning in Semi-supervised Clustering,” inICML, 2004.

[4] N. Shental, A. Bar-hillel, T. Hertz, and D. Weinshall, “ComputingGaussian Mixture Models with EM using Equivalence Constraints,” inNIPS, 2003.

[5] Z. Lu and T. K. Leen, “Semi-supervised Learning with PenalizedProbabilistic Clustering,” inNIPS, 2004.

[6] S. Basu, M. Bilenko, and R. J. Mooney, “A Probabilistic Frameworkfor Semi-supervised Clustering,” inKDD, 2004, pp. 59–68.

[7] T. Lange, M. H. C. Law, A. K. Jain, and J. M. Buhmann, “Learningwith Constrained and Unlabelled Data,” inCVPR, 2005, pp. 731–738.

[8] B. Nelson and I. Cohen, “Revisiting Probabilistic Models for Clusteringwith Pair-wise Constraints,” inICML, 2007, pp. 673–680.

[9] R. Ge, M. Ester, W. Jin, and I. Davidson, “Constraint-Driven Cluster-ing,” in KDD, 2007, pp. 320–329.

[10] Z. Lu, “Semi-supervised Clustering with Pairwise Constraints: A Dis-criminative Approach,”Journal of Machine Learning Research, vol. 2,pp. 299–306, 2007.

[11] S. Basu, I. Davidson, and K. Wagstaff,Constrained Clustering: Ad-vances in Algorithms, Theory, and Applications, 1st ed. Chapman &Hall/CRC, 2008.

[12] M. Schultz and T. Joachims, “Learning a Distance Metricfrom RelativeComparisons,” inNIPS, 2003, p. 41.

[13] R. Rosales and G. Fung, “Learning Sparse Metrics via Linear Program-ming,” in KDD, 2006, pp. 367–373.

[14] K. Huang, Y. Ying, and C. Campbell, “Generalized SparseMetricLearning with Relative Comparisons,”Knowl. Inf. Syst., vol. 28, no. 1,pp. 25–45, 2011.

[15] E. Y. Liu, Z. Guo, X. Zhang, V. Jojic, and W. Wang, “MetricLearn-ing from Relative Comparisons by Minimizing Squared Residual,” inICDM, 2012, pp. 978–983.

[16] N. Kumar and K. Kummamuru, “Semisupervised Clusteringwith MetricLearning using Relative Comparisons,”IEEE Trans. Knowl. Data Eng.,vol. 20, no. 4, pp. 496–503, 2008.

[17] E. Liu, Z. Zhang, and W. Wang, “Clustering with RelativeConstraints,”in KDD, 2011, pp. 947–955.

[18] J. Nunnally and I. Bernstein,Psychometric Theory. McGraw Hill, Inc.,1994.

[19] Y. Grandvalet and Y. Bengio, “Semi-supervised Learning by EntropyMinimization,” in NIPS, 2005, pp. 33–40.

[20] R. Gomes, A. Krause, and P. Perona, “Discriminative Clustering byRegularized Information Maximization,” inNIPS, 2010, pp. 775–783.

[21] C. M. Bishop, Pattern Recognition and Machine Learning, 1st ed.Springer, 2007.

[22] L. Saul, T. Jaakkola, and M. Jordan, “Mean Field Theory for SigmoidBelief Networks,”Journal of Artificial Intelligence Research, vol. 4, pp.61–76, 1996.

[23] C. W. Fox and S. J. Roberts, “A tutorial on Variational BayesianInference,” Artificial Intelligence Review, vol. 38, no. 2, pp. 85–95,2012.

[24] M. Benaim and J.-Y. L. Boudec, “On Mean Field Convergence andStationary Regime,”arXiv preprint arXiv:1111.5710, 2011.

[25] M. Schmidt, “L-bfgs software,” Website, 2012,http://www.di.ens.fr/∼mschmidt/Software/minFunc.html.

[26] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-Theoretic Metric Learning,” inICML, 2007, pp. 209–216.

[27] F. Briggs, X. Z. Fern, and R. Raich, “Rank-loss Support InstanceMachines for MIML Instance Annotation,” inKDD, 2012, pp. 534–542.

[28] G. Martinez-Munoz, N. Larios, E. Mortensen, W. Zhang, A. Yamamuro,R. Paasch, N. Payet, D. Lytle, L. Shapiro, S. Todorovic, A. Moldenke,and T. Dietterich, “Dictionary-Free Categorization of Very SimilarObjects via Stacked Evidence Trees,” inCVPR, 2009, pp. 549 –556.

[29] K. Bade and A. Nurnberger, “Hierarchical Constraints,” Machine Learn-ing, pp. 1–29, 2013.

APPENDIX

This appendix provides the derivation of the mutual informationEq. (3). The derivations for Eqns. (4) and (5) are similar andareomitted here.

By definition, the mutual information between the instance labelsYt = [yt1 , yt2 , yt3 ] and the constraint labellt is

I(Yt; lt) = H(Yt)−H(Yt|lt). (15)

The first entropy term isH(Yt) = −∑

YtP (Yt) logP (Yt) =

3 logK, where we used the independence assumptionP (Yt) =∏3i=1 P (yti) and substituted the priorP (yti = k) = 1/K. By

definition, the second entropy term is

H(Yt|lt) = −∑

a∈{yes,no,dnk}

P (lt = a)∑

Yt

P (Yt|lt = a) logP (Yt|lt = a).

Now we need to compute the marginal distributionP (lt) and theconditional distributionP (Yt|lt). Based on Eq. (1), theP (lt) are

P (lt = yes) =∑

YtP (Yt)P (lt = yes|Yt)

=K∑

k=1

P (yt1 = k)P (yt2 = k) [1− P (yt3 = k)] = K−1K2 .

By distribution symmetry,P (lt = no) = P (lt = yes). ThenP (lt =dnk) = 1 − P (lt = yes) − P (lt = no) = 1 − [2(K − 1)]/K2. TocomputeP (Yt|lt), we notice that for the cluster label assignmentsthat do not satisfy the conditions for the correspondinglt describedin Eq. (1), the probabilityP (Yt|lt) = 0. For those satisfying suchconditions, theP (Yt|lt) are

P (Yt|lt = yes) = [P (Yt)P (lt = yes|Yt)]/P (lt = yes)

= [P (Yt)× 1]/P (lt = yes) = 1K(K−1)

.

Page 11: 1 Discriminative Clustering with Relative Constraints · Discriminative Clustering with Relative Constraints ... of relative constraints by humans appears to be more reliable than

11

By symmetry again,P (Yt|lt = no ) = P (Yt|lt = yes). Also,

P (Yt|lt = dnk) = [P (Yt)P (lt = dnk|Yt)]/P (lt = dnk)

= [P (Yt)× 1]/P (lt = dnk) = 1K[K2−2(K−1)]

.

Substituting the values ofP (Yt|lt) andP (Yt), we obtain

H(Yt|lt) = logK+(1−Pdnk) log(K−1)+Pdnk log[K2−2(K−1)],

where we denotePdnk = P (lt = dnk).SubstitutingH(Yt) andH(Yt|lt) into Eq. (15), we derive

I(Yt; lt) = 2 logK−(1−Pdnk) log(K−1)−Pdnk log[K2−2(K−1)]. �