K-means Recovers ICA Filters when Independent Components are …shais/papers/KmeansICA_ICML2014.pdf · 2014-05-22 · K-means Recovers ICA Filters when Independent Components are

K-means Recovers ICA Filters when Independent Components are Sparse

Alon Vinnikov [email protected] Shalev-Shwartz [email protected]

School of Computer Science and Engineering, The Hebrew University of Jerusalem, ISRAEL

AbstractUnsupervised feature learning is the task of usingunlabeled examples for building a representationof objects as vectors. This task has been exten-sively studied in recent years, mainly in the con-text of unsupervised pre-training of neural net-works. Recently, Coates et al. (2011) conductedextensive experiments, comparing the accuracyof a linear classifier that has been trained us-ing features learnt by several unsupervised fea-ture learning methods. Surprisingly, the best per-forming method was the simplest feature learn-ing approach that was based on applying the K-means clustering algorithm after a whitening ofthe data. The goal of this work is to shed lighton the success of K-means with whitening forthe task of unsupervised feature learning. Ourmain result is a close connection between K-means and ICA (Independent Component Anal-ysis). Specifically, we show that K-means andsimilar clustering algorithms can be used to re-cover the ICA mixing matrix or its inverse, theICA filters. It is well known that the independentcomponents found by ICA form useful featuresfor classification (Le et al., 2012; 2011; 2010),hence the connection between K-mean and ICAexplains the empirical success of K-means as afeature learner. Moreover, our analysis under-scores the significance of the whitening opera-tion, as was also observed in the experiments re-ported in Coates et al. (2011). Finally, our analy-sis leads to a better initialization of K-means forthe task of feature learning.

1. IntroductionMany deep learning algorithms attempt to learn multiplelayers of representations in an unsupervised manner. These

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

representations are commonly used as features for classifi-cation tasks. Particularly, it was demonstrated that sparsefeatures, i.e. features which are rarely activated, performwell in object recognition tasks.

Several algorithms have been proposed for learning sparsefeatures. Some examples are: sparse auto-encoders, sparsecoding, Restricted Boltzmann Machines and IndependentComponent Analysis (ICA). In particular, several variantsof ICA have been shown to achieve highly competitive orstate-of-the-art results for object classification (Le et al.,2011; 2010).

In computer vision applications, learning features is ofteninterpreted as learning dictionaries of “visual words”, thatare later being used for construction of higher level im-age features. While some works learn visual words by oneof the aforementioned feature learning methods, the mostwidely used approach in the computer vision literature is toemploy the vanilla K-means clustering as a method for ob-taining such dictionaries (Wang et al., 2010; Csurka et al.,2004; Lazebnik et al., 2006; Winn et al., 2005; Fei-Fei &Perona., 2005).

Recently, Coates et al. (2011) considered various featurelearning algorithms as part of a single-layer unsupervisedfeature learning framework. They applied K-means, sparseauto-encoders, and restricted Boltzmann machines. Sur-prisingly, the simple K-means prevailed over the morecomplicated algorithms, achieving state-of-the-art results.Two particular observations are of interest. One is thatwhitening plays a crucial role in classification performancewhen using K-means. Another is that when whitening isapplied to the input, K-means learns centroids that resem-ble the oriented edge patterns which are typically recoveredby ICA (Bell & Sejnowski, 1997).

Coates & Ng (2012) have observed empirically that K-means tends to discover sparse projections and have raisedthe question of whether this is accidental or there is adeeper relation to sparse decomposition methods such asICA. In this work, we draw a connection between ICAand K-means, showing that when K-means is applied af-ter whitening then, under certain conditions, it is able to


recover both the filters and the mixing matrix of the moreexpressive ICA model. This is despite the fact that theoriginal goal of K-means is to attach a single centroid toeach example. In addition, our analysis suggests a familyof clustering algorithms with the same ability, and a sim-ple way to empirically test whether an algorithm belongsto this family. Finally, our analysis reveals the importanceof whitening, and leads to a new way to apply K-means forfeature learning.

Based on these insights, we give an interpretation of thefeatures learned by the framework in Coates et al. (2011).In general, the discovered properties of K-means suggestthat, when applying K-means to computer vision tasks suchas classification or denoising, it may be beneficial to incor-porate whitening, as was done in the experiments presentedin Coates et al. (2011).

2. Background and Basic DefinitionsIn this paper we draw a connection between ICA and K-means. We first define these two learning methods.

2.1. K-means

The K-means objective is defined as follows. We aregiven a sample S =

{x(i)}mi=1⊆ Rd, and a number of

clusters k ∈ N, and our goal is to find centroids c ={c(1), . . . , c(k)} ⊆ Rd, which are a global minimum of theobjective function:

JS(c) =1

m

m∑i=1

minj∈[k]

∥∥∥x(i) − c(j)∥∥∥22. (1)

LetAk denote the algorithm which given S and k outputs a(global) minimizer of Equation (1). While it is intractableto implement Ak in the general case, and one usually em-ploys some heuristic search (such as Lloyd’s algorithm),here we will focus on the ideal K-means algorithm whichfinds a global minimum of Equation (1).

In addition, we will refer to the density based version ofK-means defined as a minimizer of

Jx(c) = Exminj∈[k]

∥∥∥x− c(j)∥∥∥22, (2)

where x is some random vector with a distribution over Rd.We denote by µ = {µ(1), . . . , µ(k)} ⊆ Rd a minimizer ofJx(c).

2.2. Independent Component Analysis (ICA)

The linear noiseless ICA model is a generative model (Hy-varinen & Oja, 2000), defining a distribution over a ran-dom vector x = (x1, . . . , xd)

>. To generate an instance

of x we should first generate a hidden random vector s =(s1, . . . , sd)

>, where each sk is a statistically independentcomponent, distributed according to some prior distributionover R. Then, we set

x = As (3)

where A ∈ Rd,d is some deterministic matrix, often calledthe mixing matrix. We assume that A is invertible. Aspecific ICA model is parameterized by the mixing ma-trix A and by the prior distribution over sk. Through-out this paper we mostly focus on the prior distributionbeing a Laplace distribution, that is, the density functionis p(sk) ∝ exp(−

√2|sk|). We denote by slap the ran-

dom vector over Rd whose components are i.i.d zero-meanunit-variance Laplace random variables. That is, p(s) ∝exp(−

√2‖s‖1), with ‖s‖1 being the `1 norm.

Given a sample x(1), ..., x(N) of N i.i.d. instantiations ofthe random vector x, the task of ICA is to estimate boththe mixing matrix A and the sources (i.e., the hidden vec-tors) s(1), ..., s(N). Since the mixing matrix is invertible,once we know A we can easily compute the sources bys = A−1x. From now on, we will refer to this model andtask simply as ICA. We denote W = A−1. The rows of Ware commonly referred to as filters.

Since we can always scale the columns of A, we can as-sume w.l.o.g. that sk have unit variance E[s2k] = 1. Inaddition, we can assume, w.l.o.g., that sk has zero mean,E[sk] = 0, since otherwise, we can subtract the mean of xby a simple preprocessing operation. Such preprocessingis often called centering.

Another preprocessing, which is often performed beforeICA, is called whitening. This preprocessing linearly trans-forms the random variable x into y = Tx such that y hasidentity covariance, namely, E{yy>} = I . Concretely, ifUDU> = E{xx>} is the spectral decomposition of thecovariance matrix of x, then one way to obtain whiteneddata is by y = D−1/2U>x. Another way, called ZCAwhitening (Bell & Sejnowski, 1997), is y = UD−1/2U>x.

The utility of whitening resides in the fact that the new mix-ing matrix is orthogonal, which reduces the number of pa-rameters to be estimated. Instead of having to estimate n2

parameters for the original matrix A, we only need to esti-mate the new orthogonal matrix that contains n(n − 1)/2degrees of freedom. A review of approaches for estimatingthe ICA model can be found in (Hyvarinen & Oja, 2000).

2.3. Additional Notation

Let ‖.‖p denote the p-norm, and let the set of numbers{1, . . . , .k} be denoted by [k]. ei will represent the i-th unitvector in Rd. Given c = {c1, . . . , ck} ⊆ Rd andH ∈ Rd,d,for purposes of brevity we defineH∗c , {Hc1, . . . ,Hck}.


If x(1), x(2), . . . is an infinite sequence of i.i.d copies of arandom vector x, Sm =

{x(i)}mi=1

is a sequence of ran-dom sets sharing a common sample space. We will simplysay Sm =

{x(i)}mi=1

is an i.i.d sample of random vari-able x to refer to this sequence. We will extend the notionof almost sure convergence to sets of random vectors. Asequence of sets of random vectors Cn = {cn1 , . . . , cnk}is said to converge almost surely to a set of fixed vectorsB = {b1, . . . , bk} if there exists a labeling cnn1, . . . , c

nnk of

the points in Cn such that cni → bi almost surely. We willdenote this relation by Cn → B a.s.

3. K-means and ICA relationshipBefore stating the main results, we first rewrite both theK-means and ICA objectives in terms of a matrix A andsources s(1), . . . , s(m), and discuss the differences in theobjectives.

In ICA we wish to estimate the unknown mixing matrixA, or equivalently its inverse W . Given m i.i.d sam-ples

{x(i)}mi=1⊆ Rd of ICA random vector x (see Equa-

tion (3)), a popular approach for estimating W is maxi-mum likelihood estimation (Hyvarinen et al., 2001). Thelog-likelihood maximization takes the form:

argmaxW

m∑i=1

d∑j=1

log(p(w>j x(i))) +m log |detW |

where p is the prior density of the independent compo-nents sj . In the context of natural images, sparsity is dom-inant (Hyvarinen et al., 2009), therefore p is often cho-sen to be the Laplace prior, which yields the L1 penalty,− log(p(sj)) = |sj |. Another popular prior distribution isthe Cauchy prior which yields the penalty − log(p(sj)) =log(1 + s2j ).

Equivalently, we can write the ICA problem as

argmaxA,s(1),...,s(m)

m∑i=1

d∑j=1

log(p(s(i)j ))−m log |detA|

s.t. ∀i, x(i) = As(i)

For comparison, consider a simple reformulation of the K-means objective. Given S =

{x(i)}mi=1⊆ Rd, k ∈ N, we

can rewrite the objective given in (1) as

argminA,s(1),...,s(m)

1

m

m∑i=1

∥∥∥x(i) −As(i)∥∥∥22

s.t. ∀i, s(i) ∈ {e1, . . . , ek}

Thus, both K-means and ICA can be viewed as a dictio-nary learning problem, seeking a matrix A and sources

s(1), . . . , s(m), that best explain the input set. K-means at-tempts to choose A so as to explain every sample with asingle column from it. ICA attempts to find a perfect ex-planation of the input set, but allows each sample to dependon a combination of the columns of A. Therefore, in gen-eral, these objectives seem to be quite different, and thereis no guarantee that the two optimization problems will re-cover the same A.

Nevertheless, in the next sections we will see that if theindependent components come from a sparse distribution(e.g. Laplace or Cauchy), K-means and ICA recovers thevery same mixing matrix A.

4. Main resultsIn this section we state our main results, showing con-ditions under which both K-means and ICA recovers thesame mixing matrix A. Throughout this section, we con-sider the ICA task restricted to the case in which theprior distribution over the independent components is theLaplace distribution, possibly the most common prior inthe context of sparsity. Extending the results to other sparsedistributions remains to future work. In the experimentssection we mention a variety of distributions that behavesimilarly to Laplace in practice.

We begin with a result regarding general clustering algo-rithms, beyond K-means. We define a family of clusteringalgorithms that satisfy two particular properties: RotationInvariant and Sparse Sensitivity (RISS). We call this familythe family of “RISS” clustering algorithms. We prove thatany “RISS” clustering algorithm can be used to solve theICA task. We then claim that the ideal K-means algorithmis “RISS”. This work makes the first steps towards a com-plete proof. For ICA in two dimensions we prove that aclose variant of K-means is indeed “RISS”, and we provideexperiments that support the claim for larger dimensions.

Furthermore, our analysis relies on the following two idealassumptions: we can obtain the exact whitening matrix Tfor the ICA random variable x (Equation (3)), and we haveaccess to the ideal algorithm for K-means or its variants.In practice, T can be estimated using a procedure similarto PCA and the K-means objective can be minimized to alocal minimum using the standard Lloyd’s algorithm. Inthe experiments section we show that even in the non-idealsetting our results tend to hold. We also discuss implemen-tation details of our algorithm and derive from our theoryan initialization technique for K-means that gives better re-sults in practice.

4.1. “RISS” clustering algorithms

A clustering algorithm for the purpose of our discussion isany algorithm that receives a set S =

{x(i)}mi=1⊆ Rd as


Figure 1. Illustration of property 2 of a “RISS” algorithm. (Blue)i.i.d samples of slap in two dimensions (White) output of standardK-means with k=4 converges to points that lie on the axes andhence to the unit vectors when normalized

input and outputs a set whose size is twice the dimension,C =

{c(i)}2di=1⊆ Rd. We will use the term “RISS” clus-

tering algorithm, for a clustering algorithm that satisfies thefollowing two properties: First, a rotation of the input setshould cause the same rotation of the output centroids. Thisis a reasonable assumption for any distance based cluster-ing algorithm (such as K-means), since a rotation of all theexamples does not change the distances between examples.The second property we require deals with the centroids theclustering algorithm finds when its input is a large enoughsample of slap. We require that the output would be a setof centroids that lie near the axes, which is where the bulkof the Laplace distribution is. See an illustration in Fig-ure 1. For the purpose of our paper, we are only interestedin the direction of the centroids and are not interested intheir magnitude. Therefore, we will always assume that theoutput of the clustering algorithm is normalized to have aunit Euclidean norm. Formally:

Definition 1. A clustering algorithm B is “RISS” if it sat-isfies two properties.

1. For every input S ={x(i)}mi=1⊆ Rd and every or-

thonormal matrix U∈ Rd,d, B(U ∗ S) = U ∗B(S).

2. Let Sm ={s(i)}mi=1⊆ Rd be an i.i.d sample of slap,

then bm = B(Sm) is a sequence of random sets suchthat bm → {e1, ..., ed,−e1, ...,−ed} a.s.

We now present the Cluster-ICA algorithm that em-ploys a “RISS” clustering algorithm and prove it solves theICA task, that is, it recovers the mixing matrix and filtersof ICA.

Algorithm 1 Cluster-ICA1: Input: i.i.d sample of ICA random variable x (Equa-

tion (3)) S ={x(i)}mi=1⊆ Rd

2: Obtain whitening matrix T for x3: Apply whitening to input y(i) = Tx(i)

4: Set {c(1), . . . , c(2d)} = ClusterAlgorithm({y(i)}mi=1)5: Output: {T>c(1), . . . , T>c(2d)}

Before proving correctness of the proposed algorithm, weneed the following lemma adapted from (Hyvarinen & Oja,2000).

Lemma 1. Suppose x is an ICA random variable with in-dependent components s, T is a whitening matrix for x andlet y = Tx, then y = Us where U is orthonormal.

Proof. Note that y = Tx = TAs. Denoting TA = Uwe have I = E{yy>} = UE{ss>}U> = UU> wherethe first equality follows from definition of T and the lastequality is true since we assume w.l.o.g E{s2k} = 1.

Theorem 1. Let Sm ={x(i)}mi=1

be an i.i.d sam-ple of ICA random variable x with s = slap.Then given a “RISS” clustering algorithm and Sm,the output of Cluster-ICA converges a.s. to{w1, . . . , wd,−w1, . . . ,−wd}, where w>i are the rows ofW = A−1.

Proof. From lemma 1, after whitening we have y(i) =Us(i), for some orthonormal matrix U . Therefore y(i) isan i.i.d sample of ICA random variable y = Us for someorthonormal matrix U . Let B be the “RISS” clustering al-gorithm used by Cluster-ICA. We show below that theproperties of B can be used to recover U .

First, from property 2,

B({s(i)}mi=1)→ {e1, ..., ed,−e1, ...,−ed} a.s.

In addition, from property 1,

B({y(i)}mi=1) = U ∗B({s(i)}mi=1) .

Therefore

B({y(i)}mi=1)→{U1, . . . , Ud,−U1, . . . ,−Ud} a.s.

That is, after step 4 we have a set {c(1), . . . , c(2d)} thatconverges to the columns of U and their negatives. Fi-nally since we have U , it is easy to recover A−1 sinceU = TA, and therefore A−1 = U>T . It follows thatW> = (A−1)> = T>U and so

{T>c(1), . . . , T>c(2d)} → {w1, . . . , wd,−w1, . . . ,−wd} a.s.

Similarly, it can be shown that if we change the output ofCluster-ICA to {T−1c(1), . . . , T−1c(2d)}, then it con-verges to the columns of the mixing matrix A and theirnegatives.

Two conclusions of practical interest arise from the aboveanalysis. Firstly, there are many ways to perform whiteningin step 2 of the algorithm. In computer vision tasks, a com-mon method is ZCA whitening which tends to preserve the


appearance of image patches as much as possible. Accord-ing to Theorem 1, All methods are equally valid in our con-text. Secondly, when we wish to empirically test whethersome clustering algorithm is “RISS”, if we can prove prop-erty 1 in definition 1, then property 2 is easy to validate byperforming the same experiment as in section 6.1.

4.2. K-means is “RISS”

We conjecture that K-means with k = 2d is “RISS”1. Insection 6 we present experiments supporting this conjec-ture. In this section we make small modifications to thestandard K-means objective (Equation (1)) that make theanalysis easier. Specifically, we introduce a set of con-straints to the objective that are, as evidenced in the ex-periments, already satisfied by K-means when applied toICA data. We then prove this variant is “RISS” in the two-dimensional case and hope the proof sheds some light as towhy the same might also be true for standard K-means andfor any dimension.

First, we change the distance metric in Equation (1) from `2norm to cosine distance, which is equivalent to constrain-ing the centroids to the unit `2 sphere. Next, we leave dcentroids free while constraining the other d to be their neg-atives. We then get

argmin∀i, ‖c(i)‖2=1

∀i>d, c(i)=−c(i−d)

1

m

m∑i=1

minj∈[2d]

∥∥∥x(i) − c(j)∥∥∥22

Or equivalently, denoting

JcosS (c) =1

m

m∑i=1

maxj∈[2d]

〈x(i), c(j)〉

we haveargmax∀i, ‖c(i)‖2=1

∀i>d, c(i)=−c(i−d)

JcosS (c) (4)

In the equality we used the facts that ‖c(i)‖2 = 1 andmin(f(x)) = −max(−f(x)). Let Acos2d denote an idealalgorithm which given a sample S =

{x(i)}mi=1⊆ Rd, re-

turns a set of centroids c = {c(1), . . . , c(2d)} ⊆ Rd, whichsolve Equation (4).

Theorem 2. Acos2d satisfies property 1 of a “RISS” cluster-ing algorithm

Proof. The proof follows directly from the invarianceof the inner product to orthonormal projection, that is〈x(i), c(j)〉 = 〈Ux(i), Uc(j)〉 for any orthonormal U .

The following theorem establishes the second property.

1Given that we normalize the resulting centroids to unit Eu-clidean norm.

Theorem 3. For d = 2, Acos2d satisfies property 2 of a“RISS” clustering algorithm.

5. Proof sketch of Theorem 3We now describe the main lemmas we rely upon. Theirproofs are deferred to the appendix. Recall that in section2 we defined the density based version of K-means. Weadapt Equation (2) to our variant as well and define

Jcosx (c) = Exmaxj∈[2d]

〈x, c(j)〉

where x is some random variable with distributionp(x) over Rd. In this section we denote by µ ={µ(1), . . . , µ(2d)} ⊆ Rd an optimal solution to the density-based counterpart of Equation (4)

argmax∀i, ‖c(i)‖2=1

∀i>d, c(i)=−c(i−d)

Jcosx (c) . (5)

We first characterize µ:

Lemma 2. For d = 2 and x = slap, µ ={e1, e2,−e1,−e2} is the unique maximizer of Equation (5)

An important step in the proof of the above lemma is sim-plifying the objective Jcosslap

(c) by the following.

Lemma 3. Let ulap be a random variable with uniformdensity over the `1 sphere, that is, all points satisfying‖x‖1 = 1 have equal measure and the rest zero. Then,for any feasible set C we have

argmaxc∈C

Jcosslap(c) = argmax

c∈CJcosulap

(c).

Thus, Lemma 3 allows us to rewrite Equation (5) w.r.t ulapinstead of slap. The statement in Lemma 2 now becomesmore obvious and easy to illustrate:

Figure 2. µ = {e1, e2,−e1,−e2}. Blue arrows mark µ(i),solid blue lines are the support of ulap, and dotted red linesare the boundaries between the clusters V µi = {x ∈ R2 :argmaxj∈[4]

〈x, µ(j)〉 = i}. Intuitively, on the `1 sphere, points near

the axes have maximal length. Therefore, to maximize the ex-pected inner product it is most beneficial to choose µ(i) on theaxes as well.


Before continuing we mention that Lemma 2 is the onlypoint in our analysis that is restricted to d = 2. The exten-sion to higher dimensional spaces, d > 2, relies on provingthe following conjecture, which is derived by setting x =ulap in Equation (5) and using the fact max{x,−x} = |x|.Conjecture 1.

argmaxc(1),...,c(d)

‖c(i)‖2=1

Eulap

{maxj∈[d]|〈ulap, c(j)〉|} = {e1, . . . , ed}

Now that we have characterized the optimizer for the den-sity based objective, let us prove convergence for a finitesample.

Lemma 4. Let Sm ={s(i)}mi=1⊆ Rd be an i.i.d. sample

of slap. If µ is a unique maximizer of Equation (5) wherex = slap, then Acos2d (Sm)→ µ a.s.

The proof of Theorem 3 now follows directly from Lemma2 and Lemma 4.

6. ExperimentsIn this section we present experiments to support our re-sults in the non-ideal setting. In appendix B we describethe non-ideal versions of Ak and Acosk , that is, the standardK-means algorithm and the algorithm for the variant pre-sented in Equation (4) (referred to as the cosine-K-means).The K-means algorithm, also known as Llyod’s algorithm,can be viewed as attempting to solve Equation (1) by alter-nating between optimizing for assignments of data pointswhile keeping centroids fixed, and vice versa. The cosine-K-means algorithm is derived in the same manner for Equa-tion (4).

6.1. K-means is “RISS”

We now perform an experiment showing that both K-meansand cosine-K-means tend to satisfy the definition of a“RISS” clustering algorithm when the number of requestedclusters k is twice the dimension.

Property 1 of definition 1 can easily be verified for bothalgorithms similarly to lemma 2. We therefore focus onproperty 2.

The experiment is as follows. We sample a number of d-dimensional vectors with each entry randomized i.i.d ac-cording to some distribution D. Then we run K-meansand cosine-K-means with our sample and 2d randomlyinitialized centroids as input, resulting in centroids c ={c(1), . . . , c(2d)}. For the output of standard K-means wenormalize c(i) to have unit norm. Finally, we measure thedistance of c to the set e = {e1, ..., ed,−e1, ...,−ed} bymatching pairs of vectors from c and e, taking their differ-ences εj and reporting the largest ‖εj‖∞. More precisely,

|S| = 104 |S| = 105 |S| = 5× 106

d=2 0.0306, 0.0131 0.0063, 0.0033 0.0023, 0.00058d=10 0.0908, 0.0495 0.0190, 0.0148 0.0032, 0.0024d=20 0.3849, 0.0749 0.0367, 0.0238 0.0044, 0.0033d=50 0.6124, 0.3748 0.2466, 0.1722 0.0079, 0.0046

Table 1. dist(c, e) for different dimensions and sample sizes.Left values - K-means, Right values - cosine-K-means

dist(c, e) = maxj ‖ej − cnj‖∞ where n1, . . . , n2d is thematching permutation. Note that if dist(c, e) = 0, we havec = e. Table 1 shows the resulting distances for varioussample sizes and dimensions with D set to Laplace dis-tribution, meaning the input is a sample of slap. Indeed,distances tend to zero as sample size increases, and so cconverges to e per coordinate.

Regarding the question of which distributions this work ap-plies to, the same experiment has been repeated for variousdistributions D. The distributions that exhibited similar re-sults are: Hyperbolic Secant, Logistic, Cauchy, and Stu-dent’s t. A common property of most of these is that theyare unimodal symmetric and have positive excess kurtosis.A simple adaption of Theorem 1 will therefore tell us thatthe corresponding Cluster-ICA algorithm can be usedto solve the ICA task w.r.t. all of the aforementioned distri-butions.

6.2. Recovery of a predetermined mixing matrix

In this experiment we show that Cluster-ICA combinedwith standard K-means solves the ICA task. To estimatethe whitening matrix, we use the regularized approach de-scribed in (Ng., 2013). The experiment is as follows:

1. Construct 100, 10-by-10 images of rectangles at ran-dom locations and sizes and set the columns of mixingmatrix A to be their vectorization. (Figure 3.a)

2. Take a sample S of size 5×105 from the ICA randomvariable x = Aslap. (Figure 3.b)

3. Run Cluster-ICA with standard K-means and S asinput to obtain estimate for clumns of A and rows ofits inverse

Figure 3.c-d shows the recovered columns and filters. Ascan be seen in Figure 3.c, we indeed recover the correctmatrix A. We also exhibited similar results when replacingslap with any of the distributions discussed in the abovesection and repeating the same experiment.


(a) Constructed columns of A

(b) A subset of sample S

(c) Estimate for columns of A. The average abso-lute pixel difference obtained by matching every es-timated column to its origin in A is 0.031. The en-tries in A are 0 or 1.

(d) Estimate for rows of A−1

Figure 3. Input and output of Cluster-ICA

6.3. The filters obtained for natural images

For natural scenes, ICA has been shown to recover orientededge-like filters with sparsly distributed outputs (Bell & Se-jnowski, 1997). If natural image patches can be capturedby an ICA model with sparse independent components, theCluster-ICA algorithm should be able to recover simi-lar filters.

We repeat the same experiment as in the previous section,only this time instead of the sample S we take random10-by-10 patches from the natural scenes provided by (Ol-shausen, 1996).

Figure 4 shows that the filters learned by Cluster-ICAare indeed oriented edges.

(a) Subset of input - natural image patches

(b) Estimated rows of A−1 given by Cluster-ICA

Figure 4. Cluster-ICA on natural image patches

6.4. K-means initialization technique and practicalremarks

When applying K-means, an important question is howto initialize the centroids. A “RISS” clustering algo-rithm should return centroids that lie on the axes andCluster-ICA is expected to return a rotated version ofthat, meaning each centroid’s neighborhood is some quad-rant of Rd. We therefore propose the following K-meansinitialization technique for our context: randomize an or-thonormal matrix with columns ui, and set the initial cen-troids to {u1, . . . , ud,−u1, . . . ,−ud} with the hope that asthe K-means iterations progress, the centroids will rotatethemselves symmetrically into the optimal solution.

Figure 5 shows the same experiment as above, with andwithout the proposed initialization technique. It can be seenthat with this technique applied, many of the noisy filtersare replaced with clear ones.

A few notes on applying the Cluster-ICA algorithm:

1. For numerical stability, it is recommended to apply aregularized estimation of the whitening matrix, as de-scribed in (Ng., 2013).

2. Throughout this work we have been setting k = 2d. Itmay be beneficial to learn a larger number of centroids.In practice, K-means may get stuck in a local minimumor the ICA model with square mixing matrix A may notrepresent reality (e.g. the real A could be overcomplete- more columns than rows). In these cases learning morecentroids could recover more of the meaningful filters,which translates into better performance as reported in(Coates et al., 2011). A natural extension of the pro-posed initialization technique for k > 2d is to scattercentroids on the `2 sphere evenly, or uniformly at ran-dom.


(a) Random initialization

(b) Proposed initialization

Figure 5. Good initialization eliminates noisy filters

7. Interpreting the results of (Coates et al.,2011)

It is interesting to attempt to understand why K-means isthe winning approach in (Coates et al., 2011). The train-ing phase in the classification framework in (Coates et al.,2011) essentially implements the Cluster-ICA algo-rithm and returns centroids {c(i)}. For simplicity of theargument, let us assume that the statistics of the data thisframework is applied to can be captured by the ICA modelwith sparse independent components x = As. The featureencoding is:

f(x)i = max{0, µ(z)− zi}

where zi = ‖x−c(i)‖2 and µ(z) is the mean of the elementsof z. From Theorem 1, if k = 2d then we have c(i) ≈ wi,and so

zi ≈√‖c(i)‖22 + ‖x‖22 − 2w>i x

≈√constant− 2si

The approximate constant is due to normalization of inputx to unit norm, and since c(i) have, at least visually, roughlythe same norm. Thus, the features are the sources of a patchwith non-linearity on top. When the ICA model does notrepresent reality (e.g. when A is over-complete) and k >2d the features could be similar in spirit to sparse coding,as the next section may suggest.

8. Discussion and Open ProblemsWe have presented the family of “RISS” clustering algo-rithms whose properties enable us to solve the ICA taskwith a simple algorithm incorporating the whitening oper-ation. K-means and a variant of it, cosine-K-means, appearto belong to this family. It is interesting to better under-stand Conjecture 1 and to extend Theorem 3 to higher di-mensional spaces, both for K-means and its variant. It isalso interesting to analyze convergence rates and comparethem to standard methods for solving ICA.

In our analysis “RISS” clustering algorithms have been de-fined w.r.t the Laplace distribution but as the experimentssuggest, K-means behaves similarly for a larger class ofdistributions. It is interesting to characterize this class. Thisclass of distributions can be regarded as a weaker prior,compared to maximum-likelihood approaches for solvingICA that assume a specific distribution of the independentcomponents.

Perhaps most intriguing is to understand the behavior inover-complete cases. Learning k > 2d centroids over nat-ural patches appears to recover more filters as well as im-prove classification results, suggesting that K-means maybe able to recover an over-complete mixing matrix. Con-sider, for example, the following over-complete mixture ofindependent components that have more extreme sparsitythan Laplace. K-means appears to recover the columns ofthe mixing matrix when k = 6:

To summarize the practical implications, whenever K-means is being used for dictionary learning, under suitablesettings, it may be beneficial to unlock its ICA-like prop-erties by combining the whitening operation, and to treatthe resulting centroids according to the interpretation pre-sented. Together with its ability to learn an overcompleterepresentation, K-means could become a powerful tool.

AcknowledgmentsThis research is supported by the Intel Collaborative Re-search Institute for Computational Intelligence (ICRI-CI).


ReferencesBell, A. and Sejnowski, T. The independent components of

natural scenes are edge filters. In Vision Research, 1997.

Coates, Adam and Ng, Andrew Y. Learning feature repre-sentations with k-means. In Neural Networks: Tricks ofthe Trade (2nd ed.), pp. 561–580. 2012.

Coates, Adam, Ng, Andrew Y., and Lee, Honglak. Ananalysis of single-layer networks in unsupervised featurelearning. In AISTATS, pp. 215–223, 2011.

Csurka, G., Dance, C., Fan, L., Willamowski, J., , andBray, C. Visual categorization with bags of keypoints.In ECCV Workshop on Statistical Learning in ComputerVision, 2004.

Fei-Fei, L. and Perona., P. A bayesian hierarchical modelfor learning natural scene categories. In Conference onComputer Vision and Pattern Recognition, 2005.

Gupta, A. K. and Song, D. Lp-norm spherical distribu-tion. Journal of Statistical Planning and Inference, 60(2):241–260, May 1997.

Hyvarinen, A. and Oja, E. Independent component anal-ysis: Algorithms and application. In Neural Networks,2000.

Hyvarinen, A., Karhunen, J., and Oja., E. Independentcomponent analysis. In Wiley Interscience, 2001.

Hyvarinen, A., Hurri, J., and Hoyer., P. O. Natural imagestatistics. In Springer, 2009.

Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags offeatures: Spatial pyramid matching for recognizing nat-ural scene categories. In Computer Vision and PatternRecognition, 2006.

Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., andNg., A. Y. Tiled convolutional neural networks. In NIPS,2010.

Le, Q. V., Karpenko, A., Ngiam, J., and Y., Ng A. Icawith reconstruction cost for efficient overcomplete fea-ture learning. In NIPS, 2011.

Le, Quoc V., Ranzato, Marc’Aurelio, Monga, Rajat, Devin,Matthieu, Corrado, Greg, Chen, Kai, Dean, Jeffrey, andNg, Andrew Y. Building high-level features using largescale unsupervised learning. In ICML, 2012.

Ng., A. Y. Unsupervised feature learn-ing and deep learning tutorial. Inhttp://ufldl.stanford.edu/wiki/index.php/UFLDL Tutorial,2013.

Olshausen, Bruno. Sparse coding simulation software. Inhttp://redwood.berkeley.edu/bruno/sparsenet/, 1996.

Pollard, D. Strong consistency of k-means clustering. InAnnals of Statistics 9, 135-140, 1981.

Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong,Y. Locality-constrained linear coding for image classi-fication. In Computer Vision and Pattern Recognition,2010.

Winn, J., Criminisi, A., and Minka., T. Object categoriza-tion by learned universal visual dictionary. In In Interna-tional Conference on Computer Vision, volume 2, 2005.


A. Theorem 3 proofsLet us begin with a useful definition and lemma. For anycentroid c(i) in c ⊆ Rd we denote the interior of its corre-sponding cluster by V ci , defined as:

V ci = {x ∈ Rd : argmaxj∈[2d]

〈x, c(j)〉 = i}

We can neglect the issue of how points along cluster bound-aries are assigned since the set of such points has zero mea-sure w.r.t the density.

The next lemma presents a necessary condition for the op-timality of µ.

Lemma 5. If p(x) is symmetric, that is, p(x) = p(−x)for all x ∈ Rd, then µ must satisfy µ(i) = E{x|x ∈V µi }/‖E{x|x ∈ V

µi }‖2

Proof. The proof is similar to the characterization of fixedpoints for the vanilla K-means objective. The law of totalexpectation and linearity of expectation allow us to write:

Jcosx (µ) =

2d∑i=1

p(x ∈ V µi )E{maxj∈[2d]

〈x, µ(j)〉|x ∈ V µi }

=

2d∑i=1

p(x ∈ V µi )E{〈x, µ(i)〉|x ∈ V µi }

=

2d∑i=1

p(x ∈ V µi )〈E{x|x ∈ V µi }, µ(i)〉.

Suppose by contradiction and w.l.o.g that µ(1) 6= E{x|x ∈V µ1 }/‖E{x|x ∈ V

µ1 }‖2.

Let µ∗ be the solution identical to µ in all elements exceptfor the following:

µ(1)∗ = E{x|x ∈ V µ1 }/‖E{x|x ∈ Vµ1 }‖2

µ(1+d)∗ = E{x|x ∈ V µ1+d}/‖E{x|x ∈ Vµ1+d}‖2

Observe that the unique maximizers of the terms 〈E{x|x ∈V µ1 }, c(1)〉 and 〈E{x|x ∈ V µ1+d}, c(1+d)〉 are µ(1)∗ andµ(1+d)∗ respectively when c(i) are constrained to the unit`2 sphere. Therefore, we have:

Jcosx (µ) <

2d∑i=1

p(x ∈ V µi )〈E{x|x ∈ V µi }, µ(i)∗〉

=

2d∑i=1

∫x∈V µi

p(x)〈x, µ(i)∗〉dx

≤∫x∈Rd

p(x)maxj∈[2d]

〈x, µ(j)∗〉dx = Jcosx (µ∗)

Thus, µ∗ has a strictly larger objective. To receive a con-tradiction let us now show that it is also a feasible so-lution. The symmetry constraint in Equation (5) implies

µ(1) = −µ(d+1), and it is easily verified that the neigh-borhoods V µi are symmetric as well. In particular, V µ1 ={−x : x ∈ V µ1+d}. Then, since p(x) is symmetric it fol-lows that E{x|x ∈ V µ1 } = −E{x|x ∈ V µ1+d} and henceµ(1)∗ = −µ(d+1)∗.

We now bring proofs for the lemmas presented in section5.

Proof. [of Lemma 3] As implied by (Gupta & Song, 1997),the random vector slap can be expressed as a product oftwo independent random variables slap = zulap where z isa scalar-valued random variable with the distribution of thesum of d independent centered exponential variables, alsoknown as the Erlang distribution.

Now

Jcosslap(c) = E

slap{maxj∈[2d]

〈x, c(j)〉}

= Ez,ulap

{maxj∈[2d]

〈zulap, c(j)〉}

= Ez{z} E

ulap

{maxj∈[2d]

〈ulap, c(j)〉}

= Ez{z}Jcosulap

(c)

Since Ez{z} does not depend on c, the result follows.

Proof. [of Lemma 2] Throughout this proof we will pro-vide geometrical illustrations. Blue arrows will mark c(i),dotted red lines are the boundaries between the clusters V ci ,solid blue lines are the `1 unit sphere, and solid red lines arepoints belonging to a particular Vi.

First, we invoke lemma 3, which allows us to replace theterm Jcosslap

(c) in Equation (5) with the more friendly objec-tive Jcosulap

(c).

When dealing with two dimensions we have 4 centroids. Itis easy to see that for any c(1), . . . , c(4) such that c(3) =−c(1), c(4) = −c(2) the sets V ci are the rotated quadrants ofR2:

Now consider any two adjacent sets V ci , Vcj . One is the

90-degrees rotated version of the other. Suppose x =


(x1, x2) ∈ V ci and x′ ∈ V cj is a 90 degrees rotation of x, i.e.x′ = (−x2, x1). Note that ‖x‖1 = ‖x′‖1 , so p(x) = p(x′)and it follows that the measure is rotated accordingly.

Now lemma 5 tells us that optimal c(i) are defined by themeasure within V ci and so c(1), . . . , c(4) are 90-degrees ro-tated from each other. It follows that every V cj is a 45 de-grees angular-span around c(j):

We are therefore left with the task of searching amongstc(1), . . . , c(4) that are 90-degrees apart such that c(i) =E{x|x ∈ V ci }/‖E{x|x ∈ V ci }‖2. An example of centroidsnot satisfying our criterion (the red arrow is in the directionof E{x|x ∈ V ci } which does not coincide with c(i)):

It is easily verified that the only centroids fitting oursearch criterion are c = {e1, e2,−e1,−e2} and c′ ={( 1√

2, 1√

2)>, (− 1√

2, 1√

2)>, ( 1√

2,− 1√

2)>, (− 1√

2,− 1√

2)>}:

Let us compute the objective value for c′. Since 〈x, c′(i)〉 =1√2

for every x with non-zero measure in V c′

i we have

Jcosulap(c′) =

∑4i=1 p(x ∈ V c

′

i )E{〈x, c′(i)〉|x ∈ V c′

i } =1√2

.

To compute the objective value for c we note that withinevery V ci , 〈x, c(i)〉 is uniformly distributed on the line from0.5 to 1. Therefore E{〈x, c(i)〉|x ∈ V ci } = 0.75 andJcosulap

(c) = 0.75.

We have therefore shown that Jcosulap(c) > Jcosulap

(c′), whichconcludes our proof.

Proof. [of Lemma 4] We use the consistency Theoremfrom Pollard (Pollard, 1981). Recall the notation intro-duced in section 2. Pollard’s proof consists of showing theoptimal centroids for JS(c) lie in a compact region almostsurely, establishing a uniform strong law of large numbersfor JS(c) and proving continuity of Jx(c). Almost sureconvergence of the minimum of JS(c) to the minimum ofJx(c) follows directly.

By adding a constraint on the centroids of standard K-means: ∀i : ‖c(i)‖2 = 1,∀i > d : c(i) = −c(i−d), the sameproof is applicable to our variant of K-means. A conditionfor applying the theorem is

∫x‖x‖2p(x)dx < ∞ which is

indeed the case when x = slap since the Laplace distribu-tion has finite moments. There is another condition regard-ing uniqueness of the minimizer of Jx(c) for any numberof centroids up to 2d which is used for the compact re-gion proof. Since our added constraints already imply c(i)

belong to a compact set, we can skip this condition andrequire a unique minimizer only for 2d centroids.


B. Lloyd’s K-means algorithm andcosine-K-means algorithm

Algorithm 2 Llyod’s K-means algorithm

Input: S = {x(1), ..., x(m)} ; an initial set of k cen-troids c(1)1 , . . . , c(1)krepeat

Uniquely assign data points to closest centroids:S(t)i =

{x ∈ S : ∀j,

∥∥x− c(t)i ∥∥ ≤ ∥∥x− c(t)j ∥∥ }Re-adjust centroids to cluster means:c(t+1)i = 1

|S(t)i |

∑x∈S(t)

ix

until converged

Algorithm 3 cosine-K-means algorithm

Input: S = {x(1), ..., x(m)} ; an initial set of k cen-troids c(1)1 , . . . , c(1)k , with even k, satisfyingc(1)k/2+1 = −c(1)1 , . . . , c

(1)k = −c(1)k/2

repeatUniquely assign data points to closest centroids:S(t)i =

{x ∈ S : ∀j, 〈x, c(t)i 〉 ≥ 〈x, c

(t)j 〉}

Re-adjust centroids to normalized cluster means:1. ∀i ∈ {1, . . . , k/2}:c(t+1)i =

∑x∈S(t)

ix−

∑x∈S(t)

i+k/2

x

2. ∀i ∈ {k/2 + 1, . . . , k}: c(t+1)i = −c(t+1)

i−k/2

3. normalize c(t)i to unit `2 normuntil converged

Documents

K-means Recovers ICA Filters when Independent Components are …shais/papers/KmeansICA_ICML2014.pdf · 2014-05-22 · K-means Recovers ICA Filters when Independent Components are