19
Randomized Subspace Learning Algorithms with Subspace Structure Preservation Guarantees Devansh Arpit, Gaurav Srivastava,Venu Govindaraju January 21, 2014 Abstract Modelling data as being sampled from a union of independent or disjoint sub- spaces has been widely applied to a number of real world applications. Recently, high dimensional data has come into focus because of advancements in computa- tional power and storage capacity. However, a number of algorithms that assume the aforementioned data model have high time complexity which makes them slow. Dimensionality reduction is a commonly used technique to tackle this problem. In this paper, we first formalize the concept of Independent Subspace Structure and then we propose two different randomized algorithms (supervised and unsuper- vised) for subspace learning that theoretically preserve this structure for any given dataset. This has important implications for subspace segmentation applications as well as algorithms for subspace based clustering and low-rank recovery of data, that can then be carried out in low dimensions by first applying our dimensional- ity reduction technique to the high dimensional data. We support our theoretical analysis with empirical results on both synthetic and real world data. 1 Introduction Subspace clustering or subspace segmentation has recently emerged as a powerful tool for data modeling and analysis in low dimensions. Conventionally, a given dataset is approximated by a single low dimensional subspace via PCA (Principal Component Analysis). However, in many problems, it is more logical to approximate the data by a union of multiple independent linear subspaces rather than a single subspace. Given just the set of data points, the problem of subspace clustering is to segment them into clusters each of which can be fitted by a low-dimensional subspace. Such low dimensional data modeling with multiple independent subspaces has found numerous applications in image segmentation [20], motion segmentation [18], face clustering [12] [9], image representation and compression [10] and systems theory [17]. With increasing technological advancements in computation speed and memory, high dimensional feature vectors are being used in many machine learning applica- tions. However, as computation time increases with dimensionality, a number of al- gorithms for the above mentioned tasks become inceasingly slow. Dimensionality re- duction techniques (PCA, LDA, LLE [13], LPP [7]) are usually employed to overcome 1 arXiv:1401.4489v1 [cs.CV] 17 Jan 2014

Randomized Subspace Learning Algorithms with Subspace Structure Preservation Guarantees

  • Upload
    bsip

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Randomized Subspace Learning Algorithms withSubspace Structure Preservation Guarantees

Devansh Arpit, Gaurav Srivastava,Venu Govindaraju

January 21, 2014

Abstract

Modelling data as being sampled from a union of independent or disjoint sub-spaces has been widely applied to a number of real world applications. Recently,high dimensional data has come into focus because of advancements in computa-tional power and storage capacity. However, a number of algorithms that assumethe aforementioned data model have high time complexity which makes them slow.Dimensionality reduction is a commonly used technique to tackle this problem. Inthis paper, we first formalize the concept of Independent Subspace Structure andthen we propose two different randomized algorithms (supervised and unsuper-vised) for subspace learning that theoretically preserve this structure for any givendataset. This has important implications for subspace segmentation applicationsas well as algorithms for subspace based clustering and low-rank recovery of data,that can then be carried out in low dimensions by first applying our dimensional-ity reduction technique to the high dimensional data. We support our theoreticalanalysis with empirical results on both synthetic and real world data.

1 IntroductionSubspace clustering or subspace segmentation has recently emerged as a powerful toolfor data modeling and analysis in low dimensions. Conventionally, a given dataset isapproximated by a single low dimensional subspace via PCA (Principal ComponentAnalysis). However, in many problems, it is more logical to approximate the databy a union of multiple independent linear subspaces rather than a single subspace.Given just the set of data points, the problem of subspace clustering is to segment theminto clusters each of which can be fitted by a low-dimensional subspace. Such lowdimensional data modeling with multiple independent subspaces has found numerousapplications in image segmentation [20], motion segmentation [18], face clustering[12] [9], image representation and compression [10] and systems theory [17].

With increasing technological advancements in computation speed and memory,high dimensional feature vectors are being used in many machine learning applica-tions. However, as computation time increases with dimensionality, a number of al-gorithms for the above mentioned tasks become inceasingly slow. Dimensionality re-duction techniques (PCA, LDA, LLE [13], LPP [7]) are usually employed to overcome

1

arX

iv:1

401.

4489

v1 [

cs.C

V]

17

Jan

2014

this problem. However, for the application of algorithms that perform tasks like sub-space segmentation or low rank recovery, the subspace structure of the data should bepreserved after dimensionality reduction.

In this paper, we formalize the idea of Independent Subspace Structure for datasets.Based on this definition, we propose two different algorithms for dimensionality re-duction (supervised and unsupervised) that preserve this structure of a dataset. For theunsupervised setting, we show that random projection preserves the subspace structureof data vectors generated from a union of independent linear subspaces. In the pastdecade, random projection has been extensively used for data dimensionality reductionfor a number of tasks like k-means clustering [5], classification [3] [15]. In these pa-pers the authors prove that certain desired properties of data vectors for the respectivetasks are preserved under random projection. However, to the best of our knowledge, aformal analysis of linear subspace structure preservation under random projection hasnot been reported this far. On the other hand, for the supervised setting, we first showthat for any two disjoint subspaces with arbitrary dimensionality, there exists a two di-mensional subspace such that both the subspaces collapse to form two lines. We thenextend this idea to the K class case and show that 2K projection vectors are sufficientfor preserving the subspace structure of a dataset.

2 Definitions and MotivationA linear subspace in Rn of dimensions (d) can be represented using a matrix B ∈Rn×d where the columns of B form the support of the subspace. Then any vector inthis subspace can be represented as x = Bw ∀w ∈ Rd. Without any loss of generality,we will assume that the columns ofB are unit vectors and pair-wise orthogonal to eachother. Let there be K independent subspaces denoted by S1, S2 . . . SK . Any subspaceSi is said to be independent of all other subspaces if there does not exist any non-zerovector in Si which is a linear combination of vectors in the other subspaces. Formally,

K∑i=1

Si = ⊕Ki=1Si

where,⊕ denotes direct sum of subspaces. We will use the notation x = Biw to denotethat the data vector x ∈ Si. Here, the columns of Bi ∈ Rn×di form the support of Siand w ∈ Rdi can be any arbitrary vector.

While the above definition simply states the condition under which two or moresubspaces are independent, it does not specifically tells us quantitatively how well theyare separated. This leads us to the definition of margin between a pair of subspaces.

Definition 1 (Subspace Margin) Subspaces Si and Sj are separated by margin γij if

γij = maxu∈Si,v∈Sj

〈u, v〉‖u‖2‖v‖2

(1)

Geometrically, the above definition simply says that margin between any two sub-spaces is defined as the maximum dot product between two unit vectors, one from

2

either subspace. The vector pair u and v that maximize this dot product is knownas the principal vector pair between the two subspaces while the angle between thesevectors is called the principal angle. Then there are min(di, dj) such principal angles(vector pairs) which can be found by performing the svd analysis of BTi Bj . Noticethat γij ∈ [0, 1] such that γij = 0 implies that the subspaces are maximally separatedwhile γij = 1 implies that the two subspaces are not independent.

Having defined these concepts, our goal is to learn a subspace from any givendataset that is sampled from a union of independent linear subspaces such that thisindependent subspace structure property is approximately preserved in the dataset. Wewill make this idea more concrete in a bit. Before that, notice that the above definitionsof independent subspaces and separation margin (definition 1) apply explicitly to welldefined subspaces. So a natural question is: How do we define these concepts fordatasets? We define the Independent Subspace Structure for a dataset as follows,

Definition 2 (Independent Subspace Structure) Let X = {x}Ni=1 be a K class datasetof N data vectors in Rn and Xi ⊂ X (i ∈ {1 . . .K}) such that data vectors in Xi

belong to class i. Then we say that the dataset X has Independent Subspace Structureif each ith class data x ∈ Xi is sampled from a linear subspace Si (i ∈ {1 . . .K}) inRn such that each subspace is independent.

Again, the above definition only specifies that data samples from different classesbelong to independent subspaces. To estimate the margin between subspaces these, wedefine Subspace Margin for datasets as follow,

Definition 3 (Subspace Margin for datasets) For a dataset X with Independent Sub-space Structure, class i (i ∈ {1 . . .K}) data is separated from all the other classeswith margin γi, if ∀x ∈ Xi and ∀y ∈ X \ {Xi}, 〈x,y〉

‖x‖2‖y‖2 ≤ γi, where γi ∈ [0, 1).

With these definitions, we will now make the idea of independent subspace struc-ture preservation concrete. Specifically, by subspace structure preservation, we meanthat if originally, a given set of data vectors are sampled from a union of indepen-dent linear subspaces, then after projection, the projected data vectors also belong toa union of independent linear subspaces. Formally, let X be a K class dataset in Rnwith independent subspace structure such that class i samples (x ∈ Xi) are drawn fromsubspace Si, then the projected data vectors (using projection matrix P ∈ Rn×m) inthe sets Xi := {PTx : x ∈ Xi} for i ∈ {1 . . .K} are such that data vectors in each setXi belong to a linear subspace (Si in Rm), such that the subspaces Si (i ∈ {1 . . .K})are independent, i.e.,

∑Ki=1 Si = ⊕Ki=1Si.

In this paper, we consider both supervised and unsupervised problem setting forsubspace learning. While in case of unsupervised setting, we present a non-adaptivesubspace learning technique, in the supervised setting, we show that for a K classproblem, 2K projection vectors (with certain properties) are sufficeint for independentsubspace structure preservation. Further, under the latter setting, we present a heurist-ing algorithm that approximately computes such 2K projection vectors from a givendataset.

3

3 Dimensionality Reduction for Unsupervised settingRandom Projection has gained large popularity in recent years due to its low compu-tational cost and the guarantees it comes with. Specifically, it has been shown in casesof linearly separable data [15] [3] and data that lies on a low dimensional compactmanifold [4] [8], that random projection preserves the linear separability and manifoldstructure respectively, given that certain conditions are satisfied. Notice that a unionof independent linear subspaces is a specific case of manifold structure and hence theresults of random projection for manifold structure apply in general to our case. How-ever, as those results are derived for a more general case, their results are weak whenapplied to our problem setting. Further, to the best of our knowledge, there has notbeen any prior analysis of random projection on the margin between independent sub-spaces. In this section, we derive bounds on the minimum number of random vectorsrequired specifically for the preservation of independent linear subspace structure for agiven dataset.

The various applications of random projection for dimensionality reduction arerooted in the following version of the Johnson-Lindenstrauss lemma [16]:

Lemma 4 For any vector x ∈ Rn, matrix R ∈ Rm×n where each element of R is drawni.i.d. from a standard Gaussian distribution, Rij ∼ 1√

mN (0, 1) and any ε ∈ (0, 1/2)

Pr((1− ε)‖x‖2 ≤ ‖Rx‖2 ≤ (1 + ε)‖x‖2

)≥ 1− 2exp

(−m

4

(ε2 − ε3

)) (2)

This lemma states that the `2 norm of the randomly projected vector is approximatelyequal to the `2 norm of the original vector. Using this lemma, it can be readily shownthat for any number of data vectors, the pair-wise distances among them are preservedwith high probability under the mild condition that the number of random vectors usedshould be proportional to the logarithmic of the number of data vectors. While conven-tionally, elements of the random matrix are generated from a Gaussian distribution, ithas been proved [1] [11] that one can indeed use sparse random matrices (with most ofthe elements being zero with high probability) to achieve the same goal.

Before studying the conditions required for independent subspace structure preser-vation for binary and multiclass cases, we first state our cosine preservation lemmawhich simply states that the cosine of angle between any two fixed vectors is approx-imately preserved under random projection. A similar angle preservation theorem isstated in [15], but we will state the difference between the two after the lemma.

Lemma 5 (Cosine preservation) For all x, y ∈ Rn, any ε ∈ (0, 1/2) and matrix R∈ Rm×n where each element of R is drawn i.i.d. from a standard Gaussian distribution,Rij ∼ 1√

mN (0, 1), one of the following inequalities holds true

1

(1− ε)〈x, y〉‖x‖2‖y‖2

− ε

1− ε≤ 〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1 + ε)

〈x, y〉‖x‖2‖y‖2

1 + ε

(3)

4

if 〈x,y〉‖x‖2‖y‖2 < −ε,

1

(1− ε)〈x, y〉‖x‖2‖y‖2

− ε

1− ε≤ 〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)〈x, y〉‖x‖2‖y‖2

1− ε

(4)

if −ε ≤ 〈x,y〉‖x‖2‖y‖2 < ε, and

1

(1 + ε)

〈x, y〉‖x‖2‖y‖2

− ε

1 + ε≤ 〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)〈x, y〉‖x‖2‖y‖2

1− ε

(5)

if 〈x,y〉‖x‖2‖y‖2 ≥ ε. Further the inequality holds true with probability at least 1 −

8exp(−m4

(ε2 − ε3

)).

Proof: Let x := x/‖x‖2 and y := y/‖y‖2 and consider the case when 〈x,y〉‖x‖2‖y‖2 ≥

ε. Then from lemma 4,

Pr((1− ε)‖x+ y‖2 ≤ ‖Rx+Ry‖2 ≤ (1 + ε)‖x+ y‖2

)≥ 1− 2exp

(−m

4

(ε2 − ε3

))Pr((1− ε)‖x− y‖2 ≤ ‖Rx−Ry‖2 ≤ (1 + ε)‖x− y‖2

)≥ 1− 2exp

(−m

4

(ε2 − ε3

))(6)

Using union bound on the above two, both hold true simultaneously with probability1 − 4exp

(−m4

(ε2 − ε3

)). Notice that ‖Rx + Ry‖2 − ‖Rx − Ry‖2 = 4〈Rx,Ry〉.

Using 6, we get

4〈Rx,Ry〉 = ‖Rx+Ry‖2 − ‖Rx−Ry‖2

≤ (1 + ε)‖x+ y‖2 − (1− ε)‖x− y‖2

= (1 + ε)(2 + 2〈x, y〉)− (1− ε)(2− 2〈x, y〉)= 4ε+ 4〈x, y〉

(7)

We can similarly prove in the other direction to yield 〈Rx,Ry〉 ≥ 〈x, y〉 − ε. Togetherwe have that

〈x, y〉 − ε ≤ 〈Rx,Ry〉 ≤ 〈x, y〉+ ε (8)

holds true with probability at least 1− 4exp(−m4

(ε2 − ε3

)).

Finally, applying lemma 4 on vectors x and y, we get

(1− ε)‖x‖2‖y‖2 ≤ ‖Rx‖2‖Ry‖2 (9)

5

Thus, 〈Rx,Ry〉 = 〈Rx,Ry〉‖x‖2‖y‖2 ≥ (1 − ε) 〈Rx,Ry〉

‖Rx‖2‖Ry‖2 . Combining this with eq 8, we get〈Rx,Ry〉‖Rx‖2‖Ry‖2 ≤

1(1−ε) 〈x, y〉+ ε

1−ε . We can similarly get the other inequality to achieve3. Notice that we made use of lemma 4 four times and hence inequality 3 holds withprobability at least 1− 8exp

(−m4

(ε2 − ε3

))using union bound.

Inequalities 4 and 5 can be achieved similarly. �

We would like to point out that cosine of both acute and obtuse angles are preservedunder random projection as is evident from the above lemma. However, if the cosinevalue is close to zero, the additive error in the inequalities 3, 4 and 5 distorts the cosinesignificantly after projection. On the other hand, [15] in their paper state that obtuseangles are not preserved. As evidence, the authors empirically show cosines with neg-ative value close to zero. However, as already stated, cosine values close to zero arenot well preserved. Hence this does not serve as an evidence that obtuse angles are notpreserved under random projection which we show emperically otherwise to be true.Notice that this is not the case for the JL lemma (4) where the error is multiplicativeand hence length of vectors are preserved to a good degree invariantly for all vectors.

On the other hand, in general, inner product between vectors is not well preservedunder random projection irrespective of the angle between the two vectors. This canbe analysed using equation 8. Rewriting this equation in the following form, we havethat,

〈x, y〉 − ε‖x‖2‖y‖2 ≤ 〈Rx,Ry〉 ≤ 〈x, y〉+ ε‖x‖2‖y‖2 (10)

holds with high probaility. Clearly, because the error term itself depends on the lengthof the vectors, inner product between arbitrary vectors after random projection is notwell preserved. However, as a special case, inner product of vectors with length lessthan 1 is preserved (corollary 2 in [2]) because the error term gets diminished in thiscase.

For ease of representation, in all further analysis, we will use equation 5 whilemaking use of the cosine preservation lemma. Notice that the difference between the3 inequalities is in consine preservation error while each holds with the same proba-bility bound. We will later see that this will not affect our bounds on the number ofrandom vectors required for subspace margin preservation and hence does not affectour analysis.

In order for independent subspace structure to be preserved for any dataset, weneed two things to hold simultaneously. First, data sampled from each subspace shouldcontinue to belong to a linear subspace after projection. Second, the subspace marginfor the dataset should be preserved.

Remark 6 (Individual Subspace preservation) Let Xi denote the set of data vectors(x) drawn from the subspace Si, and let R ∈ Rm×n denote the random projectionmatrix as defined before. Then after projection, all the vectors in Xi continue to liealong the linear subspace in the span ofRBi, where the columns ofBi denote the spanof Si.

The above straight forward remark states that the first requirement always holdstrue. Now we need to derive the condition needed for the second requirement to holdtrue.

6

Theorem 7 (Binary Subspace Preservation) Let X = {x}Ni=1 be a 2 class (K = 2)dataset with Independent Subspace structure and margin γ. Then for any ε ∈ (0, 1/2),the subspace structure of the dataset is preserved with margin γ after random projec-tion as follows

Pr

(γ ≤ 1

(1− ε)γ +

ε

1− ε

)≥ 1− 6N2exp

(−m

4

(ε2 − ε3

)) (11)

Proof: Let x ∈ X1 and y ∈ X2 be samples from class 1 and 2 respectively indataset X . Then applying union bound on lemma 5 for a single vector x and allvectors y ∈ X2,

〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)〈x, y〉‖x‖2‖y‖2

1− ε

≤ 1

(1− ε)max

x∈X1,y∈X2

〈x, y〉‖x‖2‖y‖2

1− ε

=1

(1− ε)γ +

ε

1− ε

(12)

holds with probability at least (1−6N2exp(−m4

(ε2 − ε3

))), whereN2 is the number

of samples in X2. As we are only interested in upper bounding the projected margin,we have only used one side of the inequality in lemma 5 and hence a tighter probabilitybound. Again, applying the above bound for all the samples x ∈ X1,

〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)γ +

ε

1− ε (13)

holds with probability at least (1−6N1N2exp(−m4

(ε2 − ε3

))), whereN1 is the num-

ber of samples in X1. Using N2 to upper bound N1N2, we get 11. �

The above theorem analyses the condition required for margin preservation for bi-nary class case. Now we analyse the requirement for multiclass case.

Theorem 8 (Multiclass Subspace Preservation) LetX = {x}Ni=1 be aK class datasetwith Independent Subspace structure and the ith class have margin γi. Then for anyε ∈ (0, 1/2), the subspace structure of the entire dataset is preserved after randomprojection with margin γi for class i as follows

Pr

(γi ≤

1

(1− ε)γi +

ε

1− ε,∀i ∈ {1 . . .K}

)≥ 1− 6N2exp

(−m

4

(ε2 − ε3

)) (14)

Proof: Applying union bound on lemma 5 for a single vector x ∈ X1 and allvectors y ∈ X \ {X1},

〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)γi +

ε

1− ε (15)

7

holds with probability at least (1 − 6N2exp(−m4

(ε2 − ε3

))), where Ni := N −Ni.

Again, applying the above bound for all the samples x ∈ X1,

〈Rx,Ry〉‖Rx‖2‖Ry‖2

≤ 1

(1− ε)γi +

ε

1− ε (16)

holds with probability at least (1 − 6N1N1exp(−m4

(ε2 − ε3

))). Computing bounds

similar to 16 for all the classes, we have that,

γi ≤1

(1− ε)γi +

ε

1− ε,∀i ∈ {1 . . .K} (17)

holds with probability at least (1 −∑Ki=1 6NiNiexp

(−m4

(ε2 − ε3

))). Notice that∑K

i=1NiNi ≤∑Ki=1NiN = N2 which leads to 14. �

Notice that the probability bounds are same for both binary class and multiclasscase. This implies that number of classes do not affect subspace structure preservation;it only depends on the number of data vectors.

Also recall that from our discussion in the Cosine preservation lemma (5), cosinevalues close to zero are not well preserved under random projection. However, from ourabove error bound on the margin (eq 14), it turns out that this is not a problem. Noticethat two subspaces separated with a margin close to zero implies that the principalangle between them is almost orthogonal, i.e., they are maximally separated. Underthese circumstances, the projected subspaces are also well separated. Formally, letγ = 0, then after projection, γ ≤ ε

1−ε which is further upper bounded by 1 as ε tendsto 0.5. In practice we set ε to be a much smaller quantity and hence γ is well below 1.

Corollary 9 For any δ ∈ [0, 1) and ε ∈ (0, 1/2), if any K class (K > 1) dataset Xwith N data samples has Independent Subspace Structure, then the number of randomvectors (m) required for this structure to be preserved with probability at least δ mustsatisfy

m ≥ 8

(ε2 − ε3)ln

√6N

1− δ(18)

The above corollary states the minimum number of random vectors required forindependent subspace structure preservation with error parameter ε and probability δfor any given dataset.

While all the analysis so far only talks about structure preservation for datasets withIndependent subspace structure, it is not hard to see that the same bounds also apply toa dataset with disjoint subspace structure, i.e., each subspace (class) is pairwise disjointwith each other but not independent overall.

4 Dimensionality Reduction for Supervised settingIn this section, we propose a new subspace learning approach applicable to labelleddatasets that theoretically also guarantees independent subspace structure preservation.

8

In the previous section, we saw that the minimum number of random vectors requiredfor dimensionality reduction depends on the size of a dataset. Under supervised setting,the number of projection vectors required by our new approach will not only be inde-pendent of the size of the dataset but will also be fixed, depending only on the numberof classes. Specifically, we show that for anyK class labelled dataset with independentsubspace structure, only 2K projection vectors are required for structure preservation.

The entire idea of being able to find a fixed number of projection vectors for thestructure preservation of a K class dataset is motivated by theorem 10. This theoremstates a useful property of any pair of disjoint subspaces.

Theorem 10 Let unit vectors v1 and v2 be the ith principal vectors for any two disjointsubspaces S1 and S2 in Rn. Let the columns of the matrix P ∈ Rn×2 be any twoorthonormal vectors in the span of v1 and v2. Then for all vectors x ∈ Sj , PTx = αtj(j ∈ {1, 2}), where α ∈ R depends on x and tj ∈ R2 is a fixed vector independent of

x. Further, tT1 t2‖t1‖2‖t2‖2 = vT1 v2

Proof: We use the notation (M)j to denote the jth column vector of matrix M forany arbitrary matrix M . We claim that tj = PT vj (j ∈ {1, 2}). Also, without any lossof generality, assume that (P )1 = v1. Then in order to prove theorem 10, it suffices toshow that ∀x ∈ S1, (P )T2 x = 0. By symmetry, ∀x ∈ S2, PTx will also lie along a linein the subspace spanned by the columns of P .

Let the columns of B1 ∈ Rn×d1 and B2 ∈ Rn×d2 be the support of S1 and S2

respectively, where d1 and d2 are the dimensionality of the two subspaces. Then wecan represent v1 and v2 as v1 = B1w1 and v2 = B2w2 for some w1 ∈ Rd1 andw2 ∈ Rd2 . Let B1w be any arbitrary vector in S1 where w ∈ Rd1 . Then we need toshow that T := (B1w)T (P )2 = 0∀w. Notice that,

T = (B1w)T (B2w2 − (wT1 BT1 B2w2)B1w1)

= wT (BT1 B2w2 − (wT1 BT1 B2w2)w1) ∀w

(19)

Let USV T be the svd of BT1 B2. Then w1 and w2 are the ith columns of U and Vrespectively, and vT1 v2 is the ith diagonal element of S if v1 and v2 are the ith principalvectors of S1 and S2. Thus,

T = wT (USV Tw2 − Sii(U)i)

= wT (Sii(U)i − Sii(U)i) = 0 �(20)

Geometrically, this theorem says that after projection on P , the entire subspace S1

and S2 collapse to two lines such that points from S1 lie along one line while pointsfrom S2 lie along the second line. Further, the angle that separates these lines is equalto the angle between the ith principal vector pair betweenf S1 and S2 if the span of theith principal vector pair is used as P .

We apply theorem 10 on a three dimensional example as shown in figure 1. In figure1 (a), the first subspace (y-z plane) is denoted by red color while the second subspaceis the black line in x-y axis. Notice that for this setting, the x-y plane (denoted byblue color) is in the span of the 1st (and only) principal vector pair between the two

9

(a) Independent subspaces in 3 dimensions (b) Subspaces after projection

Figure 1: A three dimensional example of the application of theorem 10. See textin section 4 for details.subspaces. After projection of the both the entire subspaces onto the x-y plane, we gettwo lines (figure 1 (b)) as stated in the theorem.

We would also like to mention that for this geometrical property to hold, the twosubspaces need not be disjoint in general. On careful examination of the statement oftheorem , we find that there must exist at least one principal vector pair between thetwo subspaces that are separated by a non-zero angle. However, because such a vectorpair is hard to estimate from a given dataset, we consider the case of disjoint subspaces.

Before stating our main theorem (12), we first state lemma 11 which we will uselater in our proof. This lemma states that if two vectors are separated by a non-zeroangle, then after augmenting these vectors with any arbitrary vectors, the new vectorsremain separated by some non-zero angle as well. This straightforward idea will helpus extend the two subspace case in theorem 10 to multiple subspaces.

Lemma 11 Let x1, y1 be any two fixed vectors of same dimensionality with respect toeach other such that xT

1 y1‖x1‖2‖y1‖2 = γ where γ ∈ [0, 1). Let x2, y2 be any two arbitrary

vectors of same dimensionality with respect to each other. The there exists a constantγ ∈ [0, 1) such that vectors x′ = [x1;x2] and y′ = [y1; y2] are also separated suchthat x′T y′

‖x′‖2‖y′‖2 ≤ γ.Proof: Notice that,

x′T y′

‖x′‖2‖y′‖2=xT1 y1 + xT2 y2

‖x′‖2‖y′‖2

=γ‖x1‖2‖y1‖2 + xT2 y2

‖x′‖2‖y′‖2

<‖x1‖2‖y1‖2 + xT2 y2

‖x′‖2‖y′‖2

≤ maxx2,y2

‖x1‖2‖y1‖2 + xT2 y2

‖x′‖2‖y′‖2= 1 �

(21)

Finally, we now show that for anyK class dataset with independent subspace struc-ture, 2K projection vectors are sufficient for structure preservation. The theorem also

10

Algorithm 1 Computation of projection matrix P for supervised settingINPUT: X ,K,λ

for k=1 to K dow∗2 ← random vector in RNk

while γ not converged dow∗1 ← maxw1‖Xkw1 − Xkw

∗2

‖Xkw∗2‖2‖2 + λ‖w1‖2

w∗2 ← maxw2‖ Xkw

∗1

‖Xkw∗1‖2− Xkw2‖2 + λ‖w2‖2

γ ← (Xkw∗1)T (Xkw

∗2)

end whilePk ← orthonormalized form of vectors Xkw

∗1 , Xkw

∗2

end forP ← orthonormalized form of vectors in [P1 . . . PK ]

OUTPUT: P

states which 2K vectors have this property.

Theorem 12 Let X = {x}Ni=1 be a K class dataset in Rn with Independent Subspacestructure. Let P = [P1 . . . PK ] ∈ Rn×2K be a projection matrix for X such thatthe columns of the matrix Pk ∈ Rn×2 consists of orthonormal vectors in the span ofany principal vector pair between subspaces Sk and

∑j 6=k Sj . Then the Independent

Subspace structure of the dataset X is preserved after projection on the 2K vectors inP .

Proof: It suffices to show that data vectors from subspaces Sk and∑j 6=k Sj (for

any k ∈ {1 . . .K}) are separated by margin less than 1 after projection using P . Letx and y be any vectors in Sk and

∑j 6=k Sj respectively and the columns of the matrix

Pk be in the span of the ith (say) principal vector pair between these subspaces. Usingtheorem 10, the projected vectors PTk x and PTk y are separated by an angle equal tothe the angle between the ith principal vector pair between Sk and

∑j 6=k Sj . Let the

cosine of this angle be γ. Then, using lemma 11, the added dimensions in the vectorsPTk x and PTk y to form the vectors PTx and PT y are also separated by some marginγ < 1. As the same argument holds for vectors from all classes, the IndependentSubspace Structure of the dataset remains preserved after projection. �

4.1 ImplementationEven though theorem 12 guarantees the structure preservation of the dataset X afterprojection using P as specified, this does not solve the problem of dimensionality re-duction. The reason is that given a labeled dataset sampled from a union of independentsubspaces, we do not have any information about the basis or even the dimensionalityof the underlying subspaces. Under these circumstances, constructing the projectionmatrix P as specified in theorem 12 itself becomes a problem. To solve this problem,we propose a heuristic algorithm that tries to approximate the underlying principal vec-tor pair between subspaces Sk and

∑j 6=k Sj (for k = 1 to K) given the labeled dataset

11

X . The assumption behind this attempt is that samples from each subspace (class) arenot heavily corrupted and that the underlying subspaces are independent.

Notice that we are not specifically interested in a particular principal vector pairbetween any two subspaces for the computation of the projection matrix. This is be-cause we have assumed independent subspace structure and so each principal vectorpair is separated by some margin γ < 1. Then we need an algorithm that computesany arbitrary principal vector pair, given data from two independent subspaces. Thesevectors can then be used to form one of theK submatrices in P as specified in theorem12. For computing the submatrix Pk, we need to find a principal vector pair betweensubspaces Sk and

∑j 6=k Sj . In terms of datasetX , we heuristically estimate the vector

pair using data in Xk and Xk where Xk := X \ {Xk}. We repeat this process for eachclass to finally form the entire matrix P . Our heuristic approach is stated in algorithm1. For eack class k, the idea is to start with a random vector in the span of Xk and findthe vector in Xk closest to this vector. Then fix this vector and search of the closestvector in Xk. Repeating this process till the convergence of the cosine between these 2vectors leads to a principal vector pair. Also notice that in order to estimate the closestvector from opposite subspace, we have used a quadratic program in 1 that minimizesthe reconstruction error of the fixed vector (of one subspace) using vectors from theopposite subspace. The regularization in the optimization is to handle noise in data.

5 Empirical AnalysisIn this section, we present empirical evidence to support our theoritical analysis pre-sented so far. We present experiment on both our supervised as well as unsupervisedsubspace learning approach.

5.1 Analysis of our Unsupervised approach5.1.1 Cosine preservation

In lemma 5, we concluded that the cosine of the angle between any two vectors remainspreserved under random projection irrespective of the angle being acute or obtuse.However, we also stated that cosine values close to zero are not well preserved. Here,we perform empirical analysis on vectors with varying angles (both acute and obtuse)and arbitrary length to verify the same. In order to achive this, we use settings similarto [15]. We generate 2000 random projection matrices Ri ∈ Rm×n (i = 1 to 2000)where we vary m = {30, 60, . . . 300} and n = 300 is the dimension of the originalspace. We define empirical rejection probability for cosine preservation similar to [15]as,

P = 1− 1

2000

2000∑i=1

1((1− ε) ≤ 〈Rix,Riy〉‖x‖2‖y‖2‖Rix‖2‖Riy‖2〈x, y〉

≤ (1 + ε))

where we vary ε ∈ {0.1, 0.3} and 1(.) is the indicator operator.

12

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=0.019021γ=0.37161γ=0.67809γ=0.92349

(a) ε = 0.1

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=0.019021γ=0.37161γ=0.67809γ=0.92349

(b) ε = 0.3

Figure 2: Empirical rejection probability for cosine preservation (acute angle). Seesection 5.1.1 for details.

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=−0.036831γ=−0.45916γ=−0.65797γ=−0.92704

(a) ε = 0.1

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=−0.036831γ=−0.45916γ=−0.65797γ=−0.92704

(b) ε = 0.3

Figure 3: Empirical rejection probability for cosine preservation (obtuse angle).See section 5.1.1 for details.

For acute angle, we randomly generate vectors x and y of arbitrary length but withfixed cosine values γ = {0.019021, 0.37161, 0.67809, 0.92349}. For obtuse angle, wesimilarly generate vectors x and y with fixed cosine values γ = {−0.036831,−0.45916,−0.65797,−0.92704}.We then compute the empirical rejection probability as mentioned above for differentvalues of ε. Figures 2 and 3 show the results on these vectors. In both figures, noticethat the rejection probability decreases as the absolute value of cosine of the angle (γ)increases (from 0 to 1), as well as for higher value of ε. These results corroborate withour theoretical analysis in lemma 5.

5.1.2 Inner Product under Random projection

We use the same experimental setting as in section 5.1.1. We define the empiricalrejection probability of inner product similar to [15] as

P = 1− 1

2000

2000∑i=1

1((1− ε) ≤ 〈Rix,Riy〉〈x, y〉

≤ (1 + ε))

13

0 50 100 150 200 250 3000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=0.019021γ=0.37161γ=0.67809γ=0.92349

(a) ε = 0.1

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=0.019021γ=0.37161γ=0.67809γ=0.92349

(b) ε = 0.3

Figure 4: Inner product is not preserved under random projection (acute angle).See section 5.1.2 for details.

0 50 100 150 200 250 3000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=−0.036831γ=−0.45916γ=−0.65797γ=−0.92704

(a) ε = 0.1

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

projected dimensionality (m)

reje

ctio

n pr

obab

ility

P

γ=−0.036831γ=−0.45916γ=−0.65797γ=−0.92704

(b) ε = 0.3

Figure 5: Inner product is not preserved under random projection (obtuse angle).See section 5.1.2 for details.

We use the same vectors as in 5.1.1 for experiments in this section. We then com-pute the empirical rejection probability as mentioned above for different values of ε.Figures 4 and 5 show the results on these vectors. As is evident from these figures, in-ner product between vectors is not well preserved (even when cosine values are close to1). This result is in line with our theoritical bound in equation 10 as the vector lengthsin our experiment are arbitrarily greater than 1.

5.1.3 Required number of random vectors

As shown in corollary 9, we study the number of random vectors required for subspacepreservation by varying different parameters. For δ = 0.99, figure 6 shows the mini-mum number of random vectors (m) required (y-axis) while varying the number (N ) ofdata samples (x-axis) between 100 and 10, 000, and the values of ε = {0.15, 0.3, 0.4}.It can be seen that for ε = 0.15, random projection to lower dimensions is effectiveonly if m > 6000 while for ε = 0.4, m > 900 suffices. The choice of ε depends onthe robustness of the algorithm (for the respective task) towards noise and is a trade-off

14

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

number of data samples (N)

requ

ired

num

ber

of r

ando

m v

ecto

rs (

m)

ε = 0.15ε = 0.3ε = 0.4

Figure 6: Required number of random vectors (m) versus number of data samples(N). The blue broken line is for reference. See section 5.1.3 for details.

0 200 400 600 800 1000 1200 140050

55

60

65

70

75

80

85

90

95

100

105

number of projection vectors (m)

% s

egm

enta

tion

accu

racy

0 %10 %20 %30 %40 %50 %

Figure 7: Accuracy vs projection dimension m. Performance at m = 100, 300, 500is approximately preserved relative to the original dimension n = 1000. This holdstrue for different degrees of data corruption.See section 5.1.4 for details.

between noise (allowed) and the number of random vectors (m) required.

5.1.4 Subspace preservation for synthetic data

To test our claim of subspace structure preservation, we run a subspace segmenta-tion algorithm on randomly projected synthetic data. Similarly to [12], we constructK = 5 independent subspaces {Si}5i=1 ⊂ Rn, (n = 1000), whose bases {Ui}5i=1

are computed by Ui+1 = TUi, i ∈ {1, . . . , 4}, where T is a random rotation ma-trix and U1 is a random orthogonal matrix of dimension 1000 × 5. So each subspacehas a dimension of 5. We randomly sample 200 data vectors from each subspace byXi = UiQi, i ∈ {1, . . . , 5} with Qi being a 5× 200 iid N (0, 1) matrix, resulting in atotal of 1000 data vectors. We also add a data-dependent Gaussian noise to a fractionof the generated data. For a data vector x chosen to be corrupted, the observed vec-tor is computed as x + ng where ng is a Gaussian noise with zero mean and variance0.3 ‖x‖2 (‖x‖2 mostly ranges from 0.1 to 1.7 in this experiment). We vary the fractionof samples corrupted in this way from 0 to 0.5 with a step size of 0.1.

15

Table 1: Segmentation accuracy on Extended Yale dataset B under unsupervisedsetting. Accuracy in original feature space (R10,000, no projection) is 83.75 %. Seesection 5.1.5 for details.

m 400 900 1600 2500accuracy (%) 81.47 83.14 83.18 83.39

± 2.71 ± 1.20 ± 1.17 ± 1.15

In our experiments, we use m = 100, 300, 500 random vectors and project the1000-dimensional original data to lower dimensions and then apply the LRR algorithm[12] for subspace segmentation. For each value of m and fraction of data corrupted,we do 50 runs of random projection matrix generation, data projection and subspacesegmentation. For comparison purpose, subspace segmentation is also carried out inthe original dimensions n = 1000. We report the mean and standard deviation ofthe segmentation accuracy averaged over the 50 runs. This is illustrated in Figure7 where we show % segmentation accuracy when m is varied. We observe that forevery magnitude of data corruption fraction, the segmentation accuracy stays nearlyflat when m is varied in the set {100, 300, 500, n(= 1000)}. This corroborates ourclaim that the subspace structure is preserved when the data is randomly projected toa lower dimensional subspace. We would also like to mention that the reason for highaccuracy at low dimensions (high error) is because the segmentation algorithm is robustto noise.

5.1.5 Subspace preservation for real data

We choose Extended Yale face dataset B [6] for evaluation on real data. It consists of∼ 2414 frontal face images of 38 individuals and 64 images each. For our evaluationpurposes we use the first 5 classes as our 5 clusters and perform the task of segmen-tation; thus the total dataset size N = 320. We choose a face dataset because it isgenerally assumed that face images with illumination variation lie along linear sub-spaces [14]. We use 100 × 100 images and stack all the pixels to form our featurevector. For ε = 0.4, the minimum required number of random vectors (m) is ∼ 940.For evaluation, we project the image vector onto dimensions ranging between 400 and2500 using random vectors. Similar to our synthetic data analysis, we do 50 runs ofrandom projection matrix generation and segmentation for each value of m we chooseand report the results. The accuracy of subspace segmentation after random projec-tion can be seem in table 1. The accuracy of subspace segmentation in original featurespace (100 × 100 dimensions) is 83.75%. Evident from the result, accuracy at 400dimensions is slightly low but saturates after 900 dimensions.

5.2 Analysis of our Supervised approach5.2.1 Two Subspaces Two Lines

We test both the claim of theorem 10 and the quality of approximation achieved byalgorithm 1 in this section. We generate two random subspaces in R1000 of dimension-

16

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

(a) Data projected using Pa

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

(b) Data projected using Pb

Figure 8: Synthetic two class high dimensional data projected onto two dimen-sional subspace. See section 5.2.1 for details.

ality 20 and 30 (notice that these subspaces will be independent with probability 1). Werandomly generate 100 data vectors from each subspace and normalize them to haveunit length. We then compute the 1st principal vector pair between the two subspacesusing their basis vectors by performing svd of BT1 B2, where B1 and B2 are the basisof the two subspaces. We orthonormalize the vector pair to form the projection matrixPa. Next, we use the labeled dataset of 200 points generated to form the projectionmatrix Pb by applying algorithm 1. The entire dataset of 200 points is then projectedonto Pa and Pb separately and plotted in figure 8. The green and red points denote datafrom either subspace. The results not only substantiate our claim in theorem 10 butalso suggest that our proposed heuristic algorithm for estimating the principal vectorpair between subspaces based on sampled dataset to finally form the projection matrixis a good approximation.

5.2.2 Projection of K class data

In order to evaluate theorem 12, we perform a classification experiment on a real datasetafter projecting the data vectors using different dimensionality reduction technique in-cluding our proposed method. We use all 38 classes from Extended Yale dataset B forevaluation (see section 5.1.5 for details) with 50%−50% train-test split. Further, sincethe classes have independent subspace structure, we make use of sparse coding [19] forclassification (see paper for technical details). We compare our approach against PCA,LDA and random projection (RP). The results are shown in table 2. Evidently, our pro-posed supervised approach yeilds the best performance accuracy using just 2K = 86projection vectors.

6 ConclusionIn this paper, we presented two randomized algorithms for dimensionality reductionthat preserve the independent subspace structure of datasets. First, we show that ran-dom projection matrices preserve this structure and we additionally derive the bound

17

Table 2: Classification Accuracy on Extended Yale dataset B under supervisedsetting. See section 5.2.2 for details.

Dim Ours PCA LDA RPdim 86 90 37 90acc 95.44 93.62 86.81 93.97

on the minimum number of random vectors required for this to hold (corollary 9). Weconclude that this number solely depends logarithmically on the number of data sam-ples. All the above arguments also hold under disjoint subspace setting. As a sideanalysis, we also show that while cosine values (lemma 5)are preserved under randomprojection, inner product (equation 10) between vectors are not preserved in general.These results were confirmed from our empirical analysis.

Second, we propose a supervised subspace learning algorithm for dimensionalityreduction. As a theoritical analysis, we show that for K independent subspaces, 2Kprojection vectors are sufficient for subspace structure preservation (theorem 12). Thisresult is motivated from our observation that for any two disjoint subspaces of arbitrarydimensionality, there exists a two dimensional subspace such that after projection, theentire subspaces collapse to just two lines (theorem 10). Further, we propose a heuristicalgorithm (1) that tries to exploit these properties of independent subspaces for learn-ing a projection matrix for dimensionality reduction. However, there may be betterapproaches of taking advantage of these properties which is a part of the future work.

References[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-

lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, June 2003.

[2] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robustconcepts and random projection. Machine Learning, 63:161–182, 2006.

[3] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features:On kernels, margins, and low-dimensional mappings. In In 15th InternationalConference on Algorithmic Learning Theory (ALT 04, pages 79–94, 2004.

[4] Richard G. Baraniuk and Michael B. Wakin. Random projections of smoothmanifolds. Foundations of Computational Mathematics, 9:51–77, 2009.

[5] Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projectionsfor k-means clustering. CoRR, abs/1011.4632, 2010.

[6] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Illu-mination cone models for face recognition under variable lighting and pose. IEEETrans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.

[7] X. He and P. Niyogi. Locality preserving projections. Proc. of the NIPS, Advancesin Neural Information Processing Systems. Vancouver: MIT Press, 103, 2004.

18

[8] Chinmay Hegde, Michael B. Wakin, and Richard G. Baraniuk. Random projec-tions for manifold learning. In John C. Platt, Daphne Koller, Yoram Singer, andSam T. Roweis, editors, NIPS. Curran Associates, Inc., 2007.

[9] Jeffrey Ho, Ming-Husang Yang, Jongwoo Lim, Kuang-Chih Lee, and DavidKriegman. Clustering appearances of objects under varying illumination con-ditions. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003IEEE Computer Society Conference on, volume 1, pages I–11–I–18. IEEE, 2003.

[10] Wei Hong, John Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear mod-els for lossy image representation. Image Processing, IEEE Transactions on,15(12):3655–3671, 2006.

[11] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random pro-jections. In Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining, KDD ’06, pages 287–296, New York, NY,USA, 2006. ACM.

[12] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation bylow-rank representation. In ICML, 2010.

[13] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction bylocally linear embedding. Science, 290:2323–2326, December 2000.

[14] Gregory Shakhnarovich and Baback Moghaddam. Face recognition in sub-spaces. In In: S.Z. LI, A.K. Jain, Handbook of Face Recognition, pages 141–168.Springer, 2004.

[15] Qinfeng Shi, Chunhua Shen, Rhys Hill, and Anton van den Hengel. Is marginpreserved after random projection? CoRR, abs/1206.4651, 2012.

[16] S. Vempala. The Random Projection Method. Dimacs Series in Discrete Mathe-matics and Theoretical Computer Science, 2004.

[17] Rene Vidal, Stefano Soatto, Yi Ma, and Shankar Sastry. An algebraic geometricapproach to the identification of a class of linear hybrid systems. In Decision andControl, 2003. Proceedings. 42nd IEEE Conference on, volume 1, pages 167–172. IEEE, 2003.

[18] Rene Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentationwith missing data using powerfactorization and gpca. International Journal ofComputer Vision, 79(1):85–105, 2008.

[19] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognitionvia sparse representation. IEEEE TPAMI, 31(2):210 –227, Feb. 2009.

[20] Allen Y Yang, John Wright, Yi Ma, and S Shankar Sastry. Unsupervised segmen-tation of natural images via lossy data compression. Computer Vision and ImageUnderstanding, 110(2):212–225, 2008.

19