Upload
lianli-liu
View
14
Download
0
Embed Size (px)
Citation preview
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051
Subspace Clustering with Missing Data
Lianli Liu Dejiao Zhang Jiabei Zheng
Abstract
1 Subspace clustering with missing data can be seen as the combination of subspace clusteringand low rank matrix completion, which is essentially equivalent to high-rank matrix completionunder the assumption that columns of the matrix X ∈ Rd×N belong to a union of subspaces. It’sa challenging problem, both in terms of computation and inference. In this report, we study twoefficient algorithms proposed for solving this problem, the EM-type algorithm [1] and k-meansform algorithm – k-GROUSE[2], and implement the two algorithms on both simulated and realdataset. Besides that we will also give a brief description of recently developed theorem [1] onthe sampling complexity for subspace clustering, which is a great improvement over the previouslyexisting theoretic analysis [3] that requires impractically sample size, e.g. the number of observedvectors per subspace should be super-polynomial in the dimension d.
1 Problem Statement
Consider a matrix X ∈ Rd×N , with entries missing at random, whose columns lie in the union of at most K unknownsubspaces of Rd, each has dimension at most rk < d. The goal of subspace clustering is to infer the underlying Ksubspaces from the partially observed matrix XΩ, where Ω denote the indexes of the observed entries, and to clusterthe columns of XΩ into groups that lie in the same subspaces.
2 Sampling Complexity for Generic Subspace Clustering and EM Algorithm
2.1 Sampling Complexity
[1] shows that subspace clustering is possible without impractically large sample size as that required by previous work[3]. The main assumptions used in the theoretic analysis of [1] is different from those arising from traditional Low-Rank Matrix Completion, e.g. [3]. It assumes the columns of X are drawn from a non-atomic distribution supportedon a union of low-dimensional generic subspaces, and entries in the columns are missing uniformly at random. Beforestating the main result in [1], we provide the following definition first.
Definition 1. We denote the set of d × n matrices with rank r byM(r, d × n). A generic (d × n) matrix of rank ris a continuousM(r, d× n) valued random variable. We say a subspace S is generic if a matrix whose columns aredrawn i.i.d according to a non-atomic distribution with support on S is generic a.s.
The key assumptions of this method is following:
• A1. The columns of the d×N data matrix X are drawn according to a non-atomic distribution with supporton the union of at most K generic subspaces. The subspaces, denoted by S = Sk|k = 0, . . . ,K − 1, eachhas rank exactly r < d.
• A2. The probability that a column is drawn from subspace k is ρk. Let ρ∗ be the bound on minkρk.1Authors by alphabetical order
1
052053054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103
• A3. We observe X only on a set of entries Ω and denote the observation XΩ. Each entry in XΩ is sampledindependently with probability p.
We now state the main results in [1], which implies that if we observe at least order dr+1(log d/r + logK) columnsand at least O(r log2 d) entries in each column, then identification of the subspaces is possible with high probability.Although by assumption A2 the rank of each subspace is exactly r, this result can be generalized to a relax versionthat the dimension of each subspace is upper bounded by r.
Theorem 1. [1] Suppose A1-A3 hold. Let ε > 0 be given. Assume the number of subspaces K ≤ ε6ed/4, the total
number of columns N ≥ (2d+ 4M)/ρ∗, and
p ≥ 1
d128µ2
1rβ0 log2(2d), β0 =
√1 +
log(
6Kε 12 log(d)
)2 log(2d)
,
M =
(de
r + 1
)r+1 [(r + 1) log
(de
r + 1
)+ log
(8K
ε
)](1)
where µ21 := maxk
d2
r ‖UkV∗k ‖2∞ and UkΣkV
∗k is the singular value decomposition of X [k]2. Then with probability at
least 1− ε, S can be uniquely determined from XΩ.
2.2 EM Algorithm
In [4], computing the principal subspaces of a set of data vectors is formulated in a probabilistic framework andobtained by a maximum-likelihood method. Here we use the same model but extend it to the case of missing data:[
xoixmi
]=
K∑i=1
1zi=k([
W oik
Wmik
]yi +
[µoikµmik
]+ ηi
)(2)
where 1, · · · ,K 3 zi ∼ ρ ⊥ yi ∼ N (0, I). Wk is a d × r matrix whose span is Sk, and ηi|zi ∼ N (0, σ2ziI) is the
noise in the zthi subspace. To find the Maximum Likelihood Estimate of θ = W,µ, ρ, σ2, i.e.
θ = arg maxθ
l(θ, xo) = arg maxθ
N∑i=1
log(
K∑k=1
ρkP (xoi |zi = k; θ)) (3)
an EM algorithm is proposed where Xo := xoi Ni=1 is the observed data, Xm := xmi Ni=1,Y := yiNi=1,Z :=ziNi=1 are hidden variables. The iterates of the algorithm are as follows, with detailed derivation provided in Ap-pendix A.
Wk =
[ N∑i=1
pi,kEk〈xiyTi 〉 −(∑N
i=1 pi,kEk〈xi〉)(∑N
i=1 pi,kEk〈yi〉T)∑N
i=1 pi,kEk〈yi〉T
][ N∑i=1
pi,kEk〈yiyTi 〉 −(∑N
i=1 pi,kEk〈yi〉)(∑N
i=1 pi,kEk〈yi〉T)∑N
i=1 pi,k
]−1
(4)
µk =
∑Ni=1 pi,k
(Ek〈xi〉 − WkEk〈yi〉
)∑Ni=1 pi,k
(5)
σ2k =
1
d∑Ni=1 pi,k
[ N∑i=1
pi,k(tr(E〈xixTi 〉)− 2µTk Ek〈xi〉+ µk
Tµk)
− 2tr(Ek〈yixTi 〉Wk) + 2µTk WkEk〈yi〉+ tr(Ek〈yiyTi 〉Wk
TWk)
](6)
2X [k] denote the columns of X corresponding to the kth subspace
2
104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
ρk =1
N
N∑i=1
pi,k , pi,k := Pzi|xoi ,θ(k) =ρkPxoi |zi=k,θ(x
oi )∑K
j=1 ρjPxoi |zi=j,θ(xoi )
(7)
where Ek〈·〉 denotes the conditional expectation E·|xoi ,zi=k,θ[·]. This conditional expectation can be easily calculated
from its conditional distribution. Denote Mk := σ2I +W oiTk W oi
k , by Gauss-Markov theorem,[xmiyi
] ∣∣∣∣xoi ,zi=k,θ
∼ N([
µmik +Wmik M−1
k W oiTk (xoi − µ
oik )
M−1k W oiT
k (xoi − µoik )
],
σ2
[I +Wmi
k M−1k WmiT
k Wmik M−1
k
M−1k WmiT
k M−1k
])(8)
Detailed derivation is provided in Appendix B.
3 Projection Residual Based K-MEANS Form Algorithm
[2] proposes a form of k-means algorithm adapted to k subspaces clustering. It alternats between assigning vectors tosubspaces based on the projection residuals and updates each subspace by performing an incremental gradient descentalong the geodesic curve of Grassmannian 3 manifold(GROUSE[6]).
3.1 Subspace Assignment
For subspace assignment, [2] proves that the incomplete data vector can be assigned to the correct subspace with highprobability if the angles between the vector and the various subspaces satisfied some mild condition.
For each data vector, when there is no entry missing, it’s natural to use the following
‖v − PSiv‖22 < ‖v − PSjv‖22 ∀j 6= i, and i, j ∈ 0, . . . ,K − 1. (9)
where PSi is the projection operator onto subspace Si, to decide whether v can be assigned to Si or not. Now supposewe only observe the a subset of the entries of v, denoted as vΩ, where Ω ⊂ 1, . . . , d, |Ω| = m < d. And define theprojection operator restricted to Ω as
PSΩ= (UTΩUΩ)−1UTΩ (10)
In [7], the authors prove that UTΩUΩ is invertible w.h.p as long as the assumptions of Theorem.2 are satisfied.
Let Uk ∈ Rd×rk whose orthonormal columns span the rk dimensional subspaces Sk respectively. Then an naturalquestion arises as under what condition we can use a similar decision rule as that of (9) for the subspace assignmentfor the incomplete data case. To answer this question, we will first restate a theorem in [2] to show that the projectionresidual will concentrated around that of the full data case by a scaling factor if the sampling size m satisfies certaincondition. Let v = x+ y, where x ∈ S, y ∈ S⊥.Theorem 2. [2]4 Let δ > 0 and m ≥ 8
3µ(S)r log(
2rδ
), then with probability at least 1− 3δ,
m(1− α)− dµ(S) (1+β)2
1−γ
d‖v − PSv‖22 ≤ ‖vΩ − PSΩ
vΩ‖22 (11)
and with probability at least 1− δ,
‖vΩ − PSΩvΩ‖22 ≤ (1 + α)
m
d‖v − PSv‖22 (12)
where α =√
2µ2(y)m log
(1δ
), β =
√2µ(y) log
(1δ
)and γ =
√8rµ(S)
3m log(
2rδ
).
3Grassmannian is a compact Riemannian manifold, and it’s geodesics can be explicitly computed [5]4In this theorem, µ(S), µ(y) is coherence parameter of subspace S and vector y, here we follow the definitions proposed in [8],
µS := drmaxj ‖PSej‖22 and µ(y) = d‖z‖2
∞‖z‖2
2
3
156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207
Next we will state the requirement on the angles between the vector and the subspaces, by which the incomplete datavector can be assigned to the correct subspace w.h.p. Define the angle between the vector v and its projection into the rk
dimensional subspace Sk, k = 0, . . . ,K−1 as θi = sin−1(‖v−PSkv‖2‖v‖2
). And let Ck(m) =
m(1−αk)−rkµ(Sk)(1+βi)
2
1−γim(1+α0) ,
where αk, βk and γk are defined as that in Theorem.2 for different rk. Notice that Ck(m) < 1 and Ck(m) 1 asm→∞. Without loss of generality, we assume θ0 < θk,∀k 6= 0.Corollary 1. [2] Let m ≥ 8
3 maxi 6=0
(riµ(Si) log
(2riδ
))for fixed δ > 0. Assume that
sin2(θ0) ≤ Ci(m) sin2(θi), ∀i 6= 0.
Then with probability at least 1− 4(K − 1)δ,
‖vΩ − PS0vΩ‖22 < ‖vΩ − PSivΩ‖22, ∀i 6= 0
Proof. Detail proof is available in Appendix C.
3.2 Subspace Update
GROUSE [6] is a very efficient algorithm for single subspace estimation, although the theoretic support of this methodis still unavailable5. k-GROUSE [2] can be seen as a combination of k-subspaces and GROUSE for multiple subspaceestimation, which is detailed in Algorithm.1.
Algorithm 1 k-subspaces with the GROUSERequire: A collection of vectors vΩ(t), t = 1, . . . , T , and the observed indices Ω(t). An integer number ofsubspaces k and dimensions rk, k = 1, . . . ,K. A maximum number of iterations, maxIter. A fixed step size η.Initialize Subspaces: Zero-fill the vectors and collect them in a mtrix V . Initialize k subspaces estimates usingprobabilistic farthest insertion.Calculate Orthonormal Bases Uj , k = 1, . . . ,K. Let QjΩ = (UTkΩ
UkΩ)−1UTkΩ
for i = 1, . . . ,maxIter doSelect a vector at random, vΩ.for k = 1, . . . , K do
Calculate weights of projection onto kth subspace: w(k) = QkΩvΩ.Calculate Projection Residuals to kth subspace: rΩ(k) = vΩ − UkΩw(k).
end forSelect min residual: k = arg mink ‖rΩ(k)‖22. Set w = w(k). Define p = Ukw and rΩc = 0, i.e, r is the zerofilled rΩ(k)Update Subspace:
Uk := Uk +
[(cos(ση)− 1)
p
‖p‖+ sin(ση)
r
‖r‖
]wT
‖w‖. (13)
where σ = ‖r‖‖p‖.end for
4 Experiment on simulated data
In this section, we will explore the performance of k-GROUSE and EM algorithm on simulated data. Before we pre-sented the results we will first give a brief description about two key points of the implement of the algorithm. Firstly,we use a k-means++ like method to initialize the subspaces for both k-GROUSE and EM algorithm. Specifically,we randomly pick a point x0 ∈ X , and select it’s q 6nearest nearest neighbors and calculate the best S0 w.r.t theseselected q points. Next, we recursively initialize Sj by randomly fist pick a point x with probability proportional to
5Local convergence of GROUSE is available in [9]6q is a nonnegative parameter, usually we set q = 2maxk rk
4
208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259
min[dist(x,S0), . . . , dist(x,Sj−1)], and then find the best fit Sj of its q neighborhood. Secondly, the error is calcu-lated by examining the cluster assignments if a set of vectors who come from the same underlying subspace(which isknown). For each set, we consider the cluster ID own the largest number of vectors to the true one, and the others asincorrectly clustered.
4.1 EM Algorithm
The first experiment of EM algorithm is dealing with the simulated data generated by the Gaussian mixture model insection 2. We used d = 100, K = 4 and r = 5 for this part. For simulation, each of K subspaces is the span ofan orthonormal basis generated from r i.i.d. standard Gaussian d-dimensional vectors. The performance of the EMalgorithm strongly depends on initialization. With poor initial estimation of W and µ, estimated θ will get stuck bya local minimum of l(θ, xo). Fig.1(a) gives an example of processing the same XΩ with different initial estimations,where we observed an obvious difference in performance.
We ran 50 trials for each set of |ω|, Nk and σ2, where |ω| is the number of observed entries in each column, Nk isthe number of columns of each subspace and σ2 is the noise level. The stopping criteria is chosen as l(θ(t+1), xo) −l(θ(t), xo) < 10−4l(θ(t), xo). The result are summarized in Fig.1(b)-(d). The difference between our results andresults reported in [1] mainly comes from different initialization strategies. Generally, larger |ω| and Nk, and smallerσ2 gives less misclassified points, which is reasonable. For some cases, all misclassified points come from gettingstuck at a local minimum.
0 50 100 150 200 250 300 350 400−4
−2
0
2
4
6x 10
4
Iterations
Log
Like
lihoo
d
GoodBad
(a)
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
Nk
Pro
p. o
f mis
clas
sifie
d po
ints
(b)
10 15 20 25 30 35 400
0.2
0.4
0.6
0.8
1
|w|
Pro
p. o
f mis
clas
sifie
d po
ints
(c)
−4 −3 −2 −1 00
0.2
0.4
0.6
0.8
1
log(σ2)
Pro
p. o
f mis
clas
sifie
d po
ints
(d)
Figure 1: (a) Good initial estimation: converges after 83 iterations, label correct rate = 100%. Bad initial estimation:converges after 395 iterations, label correct rate = 45.6%. (Nk = 200, ω = 24, σ2 = 10−4). (b)-(d) shows proportionof misclassified points: (b) as a function of Nk with |ω| = 24, σ2 = 10−4. (c) as a function of |ω| with σ2 = 10−4,Nk = 300. (d) as a function of σ2 with |ω| = 24, Nk = 210.
5
260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311
4.2 k-GROUSE
In this section, we will explore the performance of k-GROUSE in two simulation scenarios. We set the ambientdimension d = 40 and the data matrix X consists of 80 vectors per subspace. In the first case, we let the number ofsubspaces K1 = 4 and and r = 5 for each subspace, i.e, the sum of the dimensions of the subspaces R =
∑Kk=1 rk <
d. In the second case, we set the number of subspaces K2 = 5 and make one of them depended on the other four.From Fig.2(a) we can see that k-GROUSE perform well as long as the number of observations is a little bit more thatr log2 d, which is within the order predicted by the theory. However, the performance of K-GROUSE is also dependedon the initialization. Although k-means++ initialization can significantly improve the convergence of k-GROUSE, itcan converge to some local point in the worse case.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sampling density
prob
abili
ty o
f err
or
d = 40 dim = 5
K1 = 4K2 = 5
(a)
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.002
0.004
0.006
0.008
0.01
0.012
0.014
Sampling densitypr
obab
ility
of e
rror
d = 100 dim = 5 K = 4
k−GROUSEEM
(b)
Figure 2: (a) Performance of k-GROUSE with independent and dependent subspaces, obtained over 50 trials; (b) k-GROUSE vs EM Algorithm, obtained over 50 trials
4.3 k-GROUSE vs EM Algorithm
Now we compare the performance of k-GROUSE and EM Algorithm in simulation data. In this setting, we setK = 4, r = 5, d = 100. From Figure.2(a) we can see that both k-GROUSE and EM Algorithm perform well inthis situation, while k-GROUSE is more efficient. However, as we will see in the next section, the performance gapbetween the two will become obvious when we apply them to some worse conditioned real data, specifically whenthe conditional number of the data matrix X is large, k-GROUSE can perform poorly while EM algorithm still workunder the same condition. In traditional matrix completion, the sampling density can heavily depend on the conditionnumber of the matrix, it seems that it’s inherent here by k-GROUSE.
5 Experiment on real data
5.1 EM algorithm
We test the algorithm on real data by applying it to image compression. An image of 264×536 is used, which is thesame as the one used in mixture probabilistic PCA[4]. Similar with [4], the image was segmented into 8×8 non-overlapping blocks, giving a total dataset of 2211×64-dimensional vectors. In the experiment, we set the number ofsubspaces as 3, and test the performance of the algorithm under different missing rate (number of missing pixels inimage block) and compression ratio. To make sure the resulting coded image has the same bit rate as the conventionalSVD coded image, after the EM algorithm converges, the image coding was performed in a hard fashion, i.e. thecoding efficient of each image block is only estimated using the subspace it has the largest probability of belonging to.
We test the algorithm under different missing rate (number of missing pixels in each image block) and compressionratio. An initialization scheme that is slightly different from the simulated experiment is used: after setting all missingdata to zero, a simple k-means clustering is performed on the data set. µk,Wk and σk of each subspace is initializedfrom the kthdata cluster xi ∈ Rdusing probabilistic PCA [4], where µk is initialized by the sample mean. Denotethe sample covariance matrix as Sk = UΛV ,Wk ∈ Rd×r is initialized as
Wk = Ur(Λr − σ2I)1/2 (14)
6
312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363
where σ2 is the initial value of σ2k = 1
d−r∑dj=r+1 λj , Ur ∈ Rd×r is composed of the leading r left singular vectors
from U and Λr ∈ Λd×r is the corresponding singular value matrix.
Sample images reconstructed under different missing rate (equivalently, the number of observing pixels per block ω)and compression ratio (number of components in each subspace r) are shown in Fig.3. Mean squared error (MSE)between the compressed image and the original image is plotted in Fig5. As compression ratio decreases, the qualityof image improves significantly, as we can see from Fig3, the mean square error also decreases. While for differentmissing rate, the images appear visually the same, no significant difference is observed between missing data caseand complete data case (though means square error decreases slightly). This validates the theory in [1] that as long ascertain conditions are met and a reasonable number of datapoints are observed, with a high probability we can recoverthe true subspace from incomplete dataset.
(a) ω=14,r = 4 (b) ω=64,r = 4 (c) ω=14,r = 12
(d) ω=64,r = 12 (e) ω=14,r = 32 (f) ω=64,r = 32
Figure 3: Image compression using EM algorithm
5.2 k-GROUSE algorithm
We also test the k-GROUSE algorithm in image compression. We use the same block transformation scheme andnumber of subspaces as in the EM algorithm. The sample compressed images are shown below.
6 Conclusion
From this project, we learned different subspace learning techniques, especially under missing data case. We alsostudied the theoretical requirement for recovering subspaces when data is incomplete. We also practiced the imple-mentation of EM algorithm and k-GROUSE.
7 Contribution
Dejiao Zhang is in charge of abstract, sections 1, 2.1, 3, 4.2, 4.3 and (Appendix C). Jiabei Zheng rederived the EMalgorithm for Gaussian mixture model (Appendix A), did coding and made plots in 4.1. Lianli Liu rederived theconditional distribution (Appendix B), did experiment on real dataset (section 5) and debugged the EM algorithm.
7
364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415
(a) missing rate 30%, number of modes 32 (b) missing rate 30%, number of modes 50
Figure 4: Image compression using k-GROUSE
(a) MSE under different missing rate,mode = 12
(b) MSE under different compressionratio, missing rate = 78%
Figure 5: Plot of mean square error
References
[1] D. Pimentel and R. Nowak, “On the sample complexity of subspace clustering with missing data.”[2] L. Balzano, A. Szlam, B. Recht, and R. Nowak, “k-subspaces with missing data,” in Statistical Signal Processing
Workshop (SSP), 2012 IEEE. IEEE, 2012, pp. 612–615.[3] B. Eriksson, L. Balzano, and R. Nowak, “High-rank matrix completion and subspace clustering with missing
data,” arXiv preprint arXiv:1112.5629, 2011.[4] M. Tipping and C. Bishop, “Mixtures of probabilistic principal component analyzers,” Neural computation,
vol. 11, no. 2, pp. 443–482, 1999.[5] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM
journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998.[6] L. Balzano, R. Nowak, and B. Recht, “Online identification and tracking of subspaces from highly incomplete
information,” in Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on.IEEE, 2010, pp. 704–711.
[7] L. Balzano, B. Recht, and R. Nowak, “High-dimensional matched subspace detection when data are missing,” inInformation Theory Proceedings (ISIT), 2010 IEEE International Symposium on. IEEE, 2010, pp. 1638–1642.
[8] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computationalmathematics, vol. 9, no. 6, pp. 717–772, 2009.
[9] L. Balzano and S. J. Wright, “Local convergence of an algorithm for subspace identification from partial data,”arXiv preprint arXiv:1306.3391, 2013.
8
416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467
Appendix A
Here is the complete derivation of the EM algorithm. The complete data probability is
Pr(x, y, z; θ) =
N∏i=1
Pr(xi, yi, zi)
=
N∏i=1
Pr(xi|yi, zi)Pr(yi|zi)Pr(zi)
=
N∏i=1
Pr(xi|yi, zi)Pr(yi)Pr(zi)
=
N∏i=1
ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I)
(15)
The complete data log-likelihood is
l(θ;x, y, z) =
N∑i=1
log(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))
=
N∑i=1
K∑k=1
∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))
(16)
where
∆ik =
1 if zi = k
0 else.(17)
The E-step is
Q(θ; θ) = E[l(θ;x, y, z)|xo, θ]
=
K∑k=1
N∑i=1
E[∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, θ]
=
K∑k=1
N∑i=1
K∑l=1
E[∆iklog(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, zi = l, θ]Pr(zi = l|xo, θ)
=
K∑k=1
N∑i=1
E[log(ρkφ(xi;µk +Wkyi, σ2kI)φ(yi; 0, I))|xo, zi = k, θ]Pr(zi = k|xo, θ)
=
K∑k=1
N∑i=1
piklog(ρk)− d+ r
2log(2π)− d
2log(σ2
k)
− 1
2σ2k
Ek〈(xi − µk −Wkyi)T (xi − µk −Wkyi)〉 −
1
2Ek〈yTi yi〉
=
K∑k=1
(
N∑i=1
pik)log(ρk)− d+ r
2log(2π)− d
2log(σ2
k)− 1
2σ2k
Uk −1
2Ek〈yT y〉
(18)
where
pik = Pr(zi = k|xoi , θ) (19)
9
468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519
Ek〈·〉 = E[·|xoi , zi = k, θ] (20)
Ek〈u〉 =
∑Ni=1 pikEk〈ui〉∑N
i=1 pik(21)
Ek〈uT v〉 =
∑Ni=1 pikEk〈uTi vi〉∑N
i=1 pik(22)
Uk = Ek〈xTx〉+ µTk µk + tr(Ek〈yT y〉WTk Wk)
− 2µTkEk〈x〉+ 2µTkWkE〈y〉 − 2tr(E〈yxT 〉Wk)(23)
The M-step isθ = arg min
θQ(θ, θ) (24)
First derive ρ. ρ satisfies∑Ki=1 ρk = 1. Using Lagrangian multiplier λ:
∂Q+ λ∑Ki=1 ρk
∂ρk= 0 (25)
N∑i=1
pik + λρk = 0 (26)
K∑k=1
N∑i=1
pik + λ = 0 (27)
λ = −1 (28)
Use this to eq(), we have:
ρk =
N∑i=1
pik (29)
Then we eliminate σ2 with W fixed as W
∂Q
∂σ2k
= 0⇒ σk2
=Ukd
(30)
σk2
=1
d(Ek〈xTx〉+ µTk µk + tr(Ek〈yT y〉Wk
TWk)
− 2µTkEk〈x〉+ 2µTk WkE〈y〉 − 2tr(E〈yxT 〉Wk))
(31)
The elimination of µ is similar:
∂Q
∂µk= 0⇒ ∂Uk
∂µk= 0 (32)
10
520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571
µk = Ek〈x〉 − WkEk〈y〉 (33)
Then we derive expression for W . From Eq(18) we have:
∂Q
∂Wk= − d
2σ2k
∂σ2k
∂Wk− 1
2σ2k
∂Uk∂Wk
+Uk2σ4
k
∂σ2k
∂Wk= − 1
2σ2k
∂Uk∂Wk
= 0 (34)
As a result,∂Q
∂Wk= 0⇒ ∂Uk
∂Wk= 0 (35)
We plug in expression of µk to Uk, We get
∂
∂WkEk〈xTx〉+ (Ek〈x〉 −WkEk〈y〉)T (Ek〈x〉 −WkEk〈y〉) + tr(Ek〈yT y〉WT
k Wk)
− 2(Ek〈x〉 −WkEk〈y〉)TEk〈x〉+ 2(Ek〈x〉 −WkEk〈y〉)TWkE〈y〉 − 2tr(E〈yxT 〉Wk) = 0
(36)
Wk = (Ek〈xyT 〉 − Ek〈x〉Ek〈yT 〉)(Ek〈yyT 〉 − Ek〈y〉Ek〈yT 〉)−1 (37)
pik is calculated with Bayes’ rule:
pik = Pr(zi = k;xoi , θ)
=ρkP (xoi |zi = k; θ)∑Kj=1 ρjP (xoi |zi = j; θ)
=ρkφ(xoi ;µ
oik ,W
oik W
oikT + σ2
kI)∑Kj=1 ρjφ(xoi ;µ
oij ,W
oij W
oijT + σ2
j I)
(38)
8 Appendix B
By model definition, [xoixmiyi
] ∣∣∣∣∣zi=k,θ
=
W oik
WmikI
yi +
µoikµmik
0
+
[ηoiηmi0
](39)
Eq(39) is a linear transform of yi ∼ N (0, I). By property of mulivariable Gaussian distribution,[xoixmiyi
] ∣∣∣∣∣zi=k,θ
∼ N
([µoikµmik
0
],
W oik W
oiTk + σ2I W oi
k WmiTk W oi
k
Wmik W oiT
k Wmik WmiT
k + σ2I Wmik
W oiTk WmiT
k I
) (40)
By Bayesian Gauss-Markov Theorem,[xmiyi
] ∣∣∣∣∣xoi ,zi=k,θ
∼ N (µmyi +RmyoiR−1oi (xoi − µ
oik ), Rmyi −RmyoiR−1
oi Romyi) (41)
where
µmyi =
[µmik
0
], Rmyoi =
[Wmik W oiT
k
W oiTk
], Rmyi =
[Wmik WmiT
k + σ2I Wmik
WmiTk I
](42)
Roi = W oik W
oiTk + σ2I, Romyi = [W oi
k WmiTk W oi
k ] (43)Plug Eq(42) and Eq(43) into Eq(41),
RmyoiR−1oi =
[Wmik W oiT
k
W oiTk
](W oi
k WoiTk + σ2I)−1 (44)
11
572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623
By matrix identities,(I + PQ)−1P = P (I +QP )−1 (45)
ThereforeW oiTk (W oi
k WoiTk + σ2I)−1 = (W oiT
k W oik + σ2I)−1W oiT
k = (M−1k )W oiT
k (46)Put Eq(46) and Eq(44) together,
RmyoiR−1oi =
[Wmik W oiT
k
W oiTk
][W oi
k WoiTk + σ2I]−1 =
[Wmik M−1
k W oiTk
M−1k W oiT
k
](47)
Therefore
µmyi +RmyoiR−1oi (xoi − µ
oik ) =
[µmik +Wmi
k M−1k W oiT
k (xoi − µoik )
M−1k W oiT
k (xoi − µoik )
](48)
which is exactly the formula of mean given in the paper.
To prove the formula of variance, by Eq(47)
RmyoiR−1oi Romyi =
[Wmik W oiT
k
W oiTk
](W oi
k WoiTk + σ2I)−1[W oi
k WmiTk W oi
k ]
=
[Wmik M−1
k W oiTk W oi
k WmiTk Wmi
k M−1k W oiT
k W oik
M−1k W oiT
k W oik W
miTk M−1
k W oiTk W oi
k
](49)
Thus
Rmyi −RmyoiR−1oi Romyi =
[σ2I +Wmi
k (I −M−1k W oiT
k W oik )WmiT
k Wmik (I −M−1
k W oiTk W oi
k )
(I −M−1k W oiT
k W oik )WmiT
k I −M−1k W oiT
k W oik
](50)
Notice thatMk −W oiT
k W oik = σ2I ⇒ I −M−1
k W oiTk W oi
k = σ2M−1k (51)
Thus Eq(50) simplifies to
Rmyi −RmyoiR−1oi Romyi = σ2
[I +Wmi
k M−1k WmiT
k Wmik M−1
k
M−1k WmiT
k M−1k
](52)
which is exactly the formula of variance given in the paper.
Appendix C
Before we give the proof of Corollary.1, we examining the case of binary subspace assignment, i.e, K = 2. Letv ∈ Rd and S0,S1 ⊂ Rn with dimension r0, r1 separately. Define the angle between v and its projection ontosubspace S0,S1 as θ0, θ1 :
θ0 = sin−1
(‖v − PS0v‖2‖v‖2
)and θ1 = sin−1
(‖v − PS1v‖2‖v‖2
)(53)
Theorem 3. [2] Let δ > 0 and m ≥ 83d1µ(S1 log
(2d1
δ
). Assume that
sin2(θ0) < C(m) sin2(θ1)
Then with probability at least 1− 4δ
‖vΩ − PS0vΩ‖22 < ‖vΩ − PS1vΩ‖22
Proof. By Theorem.2 and union bound, the following two statements hold simultaneously with probability at least1− 4δ,
‖vΩ − PS0ΩvΩ‖22 ≤ (1 + α0)
m
n‖v − PS0v‖ and
m(1− α1)− r1µ(S1) (1+β1)2
1−γ1
d‖v − PS1v‖22 ≤ ‖vΩ − PS1
ΩvΩ‖22
(54)
12
624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675
Combining with (53), we have‖v − PS0
Ω‖22 < C(m)‖v − PS1v‖22
will hold if and only ifsin2(θ0) < C(m) sin2(θ1)
which complete the proof
Proof. of Corollary.1. The results of Theorem.3 can be generalized to the situation where there are multiplesubspaces Si, i = 0, . . . ,K − 1. Again without loss of generality we assume that θ0 < θi,∀i, and define
Ci(m) =m(1−αi)−riµ(Si) (1+βi)
2
1−γim(1+α0) . Then the conclusion of Corollary.1 can be obtained by applying union bound
and following the similar argument as that in the proof of Theorem.3.
13