23
Expectation Maximization for Sparse and Non-Negative PCA Christian D. Sigg Joachim M. Buhmann Institute of Computational Science, ETH Zurich, Switzerland ICML 2008, Helsinki

Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, [email protected]

Embed Size (px)

Citation preview

Page 1: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Expectation Maximization forSparse and Non-Negative PCA

Christian D. Sigg Joachim M. Buhmann

Institute of Computational Science, ETH Zurich, Switzerland

ICML 2008, Helsinki

Page 2: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Overview

Introduction

Constrained PCA

Solving Constrained PCA

Method

Probabilistic PCA

Expectation Maximization

Sparse and Non-Negative PCA

Multiple Principal Components

Experiments and Results

Setup

Sparse PCA

Nonnegative Sparse PCA

Feature Selection

Conclusions

Page 3: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Constrained PCA

Loadings w(1) of first principal component (PC) are found by solving aconvex maximization problem:

arg maxw

w>Cw, s.t.‖w‖2 = 1, (1)

where C ∈ RD×D is the symmetric p.s.d. covariance matrix.

Loadings w(l) of further PCs again maximize (1), subject to w>(l)w(k) = 0,l > k.

We consider (1) under one or two additional constraints:

1. Sparsity: ‖w‖0 ≤ K2. Non-negativity: wd ≥ 0, ∀d = 1, . . . , D

Introduction Constrained PCA 3/23

Page 4: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Motivation and Applications

Constraints facilitate trade-off between:

1. statistical fidelity – maximization of variance2. interpretability – feature selection omits irrelevant variables3. applicability – constraints imposed by domain (e.g. physical process,

transaction costs in finance)

Sparse PCA:

I Interpretation of PC loadings (Jolliffe et. al., 2003 JCGS)

I Gene selection (Zou et. al., 2004 JCGS) and ranking (d’Aspremont et. al., 2007

ICML).

Non-negative sparse PCA: Image parts extraction (Zass and Shashua, 2006 NIPS)

Left: Full loadings of first four PCs, Right: Corresponding non-negative sparse loadings

Introduction Constrained PCA 4/23

Page 5: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Combinatorial Solvers

Write (1) as

w>Cw =∑i,j

Cijwiwj (2)

For given sparsity pattern S = {i|wi 6= 0}, optimal solution is dominanteigenvector of corresponding submatrix CS .

Algorithms:

I Exact: Branch and bound (Moghaddam et. al., 2006 NIPS) for small D

I Greedy forward search (d’Aspremont et. al., 2007), improved to O(D2) perstep

Optimality verifiable in O(D3) (d’Aspremont et. al., 2007).

Introduction Solving Constrained PCA 5/23

Page 6: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Continuous Approximation

L1 relaxation: ‖w‖1 ≤ B

Adding constraint creates local minima (Jolliffe et. al., 2003)

w>Cw, subject to ‖w‖2 ≤ 1 w>Cw, subject to ‖w‖2 ≤ 1 ∧ ‖w‖1 ≤ 1.2

and makes the problem NP-hard.

Algorithms:

I Iterative L1 regression (Zou et. al., 2004)

I Convex O(D4√

logD) SDP approximation (d’Aspremont et. al., 2005 NIPS)

I d.c. minimization (Sriperumbudur et. al., 2007 ICML), O(D3) per iteration

Introduction Solving Constrained PCA 6/23

Page 7: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Overview

Introduction

Constrained PCA

Solving Constrained PCA

Method

Probabilistic PCA

Expectation Maximization

Sparse and Non-Negative PCA

Multiple Principal Components

Experiments and Results

Setup

Sparse PCA

Nonnegative Sparse PCA

Feature Selection

Conclusions

Page 8: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Generative Model

(Bishop and Tipping, 1997 TECR; Roweis, 1998 NIPS)

Latent variable y ∈ RL in PC subspace

p(y) = N (0, I).

Observation x ∈ RD conditioned on y

p(x|y) = N (Wy + µ, σ2I).

y

p(y)

y

x2

x1

µ

p(x|y)

}y|jwjj

wx2

x1

µ

p(x)

Illustrations from (Bishop, 2006 PRML).

Method Probabilistic PCA 8/23

Page 9: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Limit-Case EM Algorithm

Three Simplifications: Take limit σ2 → 0, consider L = 1 subspace w andnormalize ‖w‖ = 1.

E-Step: Orthogonal projection on current estimate (X ∈ RN×D)

y = Xw(t). (3)

M-Step: Minimization of L2 reconstruction error

w(t+1) = arg minw

N∑n=1

(x(n) − ynw

)2. (4)

Renormalization: w(t+1) = w(t+1)

‖w(t+1)‖

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2 (d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

Illustrations from (Bishop, 2006 PRML).

Method Expectation Maximization 9/23

Page 10: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Adding Constraints

Rewrite M-step (4) as isotropic QP, favor sparsity using L1 constraint andadd lower bounds

w◦ = arg minw

(hw>w − 2f>w

)s.t. ‖w‖1 ≤ B

wd ≥ 0, ∀d ∈ {1, . . . , D}

with h =∑N

n=1 y2n and f =

∑Nn=1 ynx(n).

Minimize L2 distance to unconstrained optimum w∗ = fh .

Method Sparse and Non-Negative PCA 10/23

Page 11: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Axis-Aligned Gradient Descent

After transformation into non-negative orthant:

w1

w2

B

w*

Express sparsity by K directly:

I L1 bound B set implicitly due to monotonicity

I Regularization path obtained by sorting elements of w∗, O(D logD)Method Sparse and Non-Negative PCA 11/23

Page 12: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Multiple Principal Components

Iterative Deflation:

1. Compute sparse PC loadings w(l).

2. Project data onto orthogonal subspace, using projectorP = I−w(l)w>(l).

Including nonnegativity constraints:

w(l)i > 0⇒ w

(m)i = 0

for m 6= l, i.e. Sl and Sm are disjoint.

This might be a too strong requirement. Enforcing quasi-orthogonality:Add constraint

V>w ≤ α (5)

where V =[w(1)w(2) · · ·w(l−1)

].

Method Multiple Principal Components 12/23

Page 13: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Overview

Introduction

Constrained PCA

Solving Constrained PCA

Method

Probabilistic PCA

Expectation Maximization

Sparse and Non-Negative PCA

Multiple Principal Components

Experiments and Results

Setup

Sparse PCA

Nonnegative Sparse PCA

Feature Selection

Conclusions

Page 14: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Algorithms and Datasets

Algorithms considered:

I SPCA (Zou et. al., 2004): iterative L1 regression, continuous, uses Xinstead of C and ranking

I PathSPCA (d’Aspremont et. al., 2007): combinatorial, greedy step is O(D2)I NSPCA (Zass and Shashua, 2006): non-negativity, quasi-orthogonality

Data sets:

1. N > D: MIT CBCL faces dataset (Sung, 1996), N = 2429, D = 361,referenced in NSPCA and NMF literature.

2. D � N : Gene expression data of three types of Leukemia (Armstrong

et. al., 2002), N = 72, D = 12582

Standardized to zero mean, unit variance per dimension

Experiments and Results Setup 14/23

Page 15: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Variance vs. Cardinality

Faces data Gene expression data

0 50 100 1500

20

40

60

80

100

120

Cardinality

Var

ianc

e

emPCAemPCAopt

SPCASPCAopt

PathSPCA

0 50 100 1500

20

40

60

80

100

120

140

CardinalityV

aria

nce

emPCASPCAPathSPCAThresholding

I “opt” denotes result after recomputing weights for given S.

Experiments and Results Sparse PCA 15/23

Page 16: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Computational Complexity

Parameter sensitivity of EM convergence and effective computational cost:

EM Iterations (Faces Data) Matlab Runtime (Gene Data)

Cardinality

Dim

ensi

onal

ity

20 40 60 80 100 120 140

20

40

60

80

100

120

140

0

1

2

3

4

5

6

7

8

9

100

101

102

10−2

10−1

100

101

102

CardinalityC

PU

Tim

e [s

]

emPCASPCAPathSPCA

I Sublinear dependence of EM iterations on D

I Sparse emPCA solutions (10 ≤ K ≤ 40) require more effort

Experiments and Results Sparse PCA 16/23

Page 17: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Variance vs. Cardinality

20 40 60 80 100 120 140 160 180 2000

2

4

6

8

10

12

14

16

18

20

Cardinality

Var

ianc

e

emPCA (nn)NSPCA (α = 1e5)

NSPCA (α = 1e6)

NSPCA (α = 1e7)

I Best result after 10 random restarts

I NSCPA sparsity parameter β determined by bisection search

I α is an orthonormality penalty

Experiments and Results Nonnegative Sparse PCA 17/23

Page 18: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Multiple Principal Components

First PC loadings may lie in nonnegative orthant. Recover more than onecomponent:

1. Keep orthogonality requirement, but constrain cardinality.2. Require minimum angle between components.

Orthogonal Sparse Loadings Full Quasi-Orthogonal Loadings

1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

12

14

16

18x 10

4

Number of components

Cum

ulat

ive

varia

nce

emSPCAemNSPCA

NSPCA (α = 1e12)

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5x 10

5

Number of components

Cum

ulat

ive

varia

nce

emNPCANSPCA

Experiments and Results Nonnegative Sparse PCA 18/23

Page 19: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Unsupervised Gene Selection

Compare loadings of emPCA to CE+SR criterion (Varshavsky et al., 2006 BIOI)

(LOO comparison of SV spectrum):

1. Choose gene subset

2. k-means clustering of samples(k = 3, 100 restarts)

3. Compare clustering assignmentto label (AML, ALL, MLL)

50 100 150 200 2500.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of FeaturesJa

ccar

d S

core

emPCAemPCA (nn)CE + SR

Jaccard Score: n11n11+n10+n01

n11: pairs which have same assignment in true labeling and clusteringsolution.

Experiments and Results Feature Selection 19/23

Page 20: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Overview

Introduction

Constrained PCA

Solving Constrained PCA

Method

Probabilistic PCA

Expectation Maximization

Sparse and Non-Negative PCA

Multiple Principal Components

Experiments and Results

Setup

Sparse PCA

Nonnegative Sparse PCA

Feature Selection

Conclusions

Page 21: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Summary

We have presented an O(D2) constrained PCA algorithm:

I Applicability to wide range of problems:I sparse and nonnegative constraintsI strict and quasi-orthogonality between componentsI N > D and D � N

I Competitive variance per cardinality for sparse PCA, superior fornon-negative sparse PCA

I Unmatched computational efficiency

I Direct specification of desired cardinality K, instead of bound B onL1 norm)

Conclusions 21/23

Page 22: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Thank you for your attention.

We have machine learning PhD and postdoc positions available at ETHZurich:

I Computational (systems) biology

I Robust (noisy) combinatorial optimization

I Learning dynamical systems

Contact: Prof. J.M. Buhmann, [email protected]://www.ml.ethz.ch/open_positions

Page 23: Expectation Maximization for Sparse and Non-Negative … · Expectation Maximization for Sparse and Non-Negative PCA ... optimal solution is dominant ... Prof. J.M. Buhmann, jbuhmann@inf.ethz.ch

Joint Optimization (Work in Progress)

EM algorithm for simultaneous computation of L PCs:

E-Step:

Y =(W>

(t)W(t)

)−1W>

(t)X

M-Step:

W(t+1) = arg minW‖X−WY‖2F

s.t. ‖w(l)‖1 ≤ B, l = {1, . . . , L}wi,j ≥ 0

Tasks:

I Make this as fast as iterative deflation approach

I Avoid bad minima: enforce orthogonality?

I Specify K instead of B

Conclusions 23/23