12
On Consistency and Sparsity for Principal Components Analysis in High Dimensions Iain M. JOHNSTONE and Arthur Yu LU Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n ! 0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) n. KEY WORDS: Eigenvector estimation; Reduction of dimension; Regularization, Thresholding; Variable selection. 1. INTRODUCTION Suppose {x i , i ¼ 1, ... , n} is a dataset of n observations on p variables. Standard principal components analysis (PCA) looks for vectors j that maximize varðj T x i Þ= k j k 2 : ð1Þ If j 1 , ... , j k have already been found by this optimization, then the maximum defining j kþ1 is taken over vectors j orthogonal to j 1 , ... , j k . Our interest lies in situations in which each x i is a realization of a possibly high-dimensional signal, so that p is comparable in magnitude with n, or may even be larger. In addition, we have in mind settings in which the signals x i contain localized features, so that the principal modes of variation sought by PCA may be localized as well. These issues are familiar in signal and image processing application areas in which each sample has many variables— pixels, frequencies, genes, stocks, and so forth. In applications, it is common to combine the use of transform domains and feature selection to achieve an effective reduction of dimen- sionality. For example, one might transform the data into a suitable orthogonal basis (e.g., wavelets), select coordinates with highest variance, and then do PCA on the reduced set of variables. A notable example occurs in the work of Wickerhauser (1994a, b), in which the orthobasis itself was chosen from a library of (wavelet packet) bases. Applications to face and fingerprint classification were given. A selection of later ex- amples (by no means exhaustive) would include Feng, Yuen, and Dai (2000) in face recognition; and Kaewpijit, Le Moigne, and El-Ghazawi (2002) and Du and Fowler (2008) for hyper- spectral images. For some further discussion, see Cherkassky and Mulier (1998). A recent approach to variable selection followed by dimensionality reduction that emphasizes sparsity is described by Wolf and Shashua (2005) and Wolf and Bileschi (2005). The purpose of this article is to contribute some theoretical analysis of PCA in these burgeoning high-dimensional settings. In a simple class of models of factor analysis type, we (a) describe inconsistency results to emphasize that when p is comparable with n, some reduction in dimensionality is desirable before applying any PCA-type search for principal modes; and (b) establish consistency results to illustrate that the reduction in dimensionality can be effected by working in a basis in which the signals have a sparse representation. We will support these assertions with arguments based on statistical performance and computational cost. We begin, however, with an illustration of our results on a simple constructed example. Consider a single component (or single factor) model, in which, when viewed as p-dimensional column vectors x i ¼ y i r þ sz i ; i ¼ 1; ... ; n ð2Þ in which r 2 R p is the single component to be estimated, y i ; N(0, 1) are iid Gaussian random effects, and z i ; N p (0, I) are independent noise vectors. Figure 1a shows an example of r with p ¼ 2,048 and the vector r l ¼ f(l/p), l 2 {1, ... , p}, where f(t) is a mixture of beta densities on [0, 1], scaled so that k r P p 1 r 2 l 1=2 ¼ 10: Figure 1b shows a sample case from model (2): The random effect y i r is hard to discern in individual cases. Figure 1c shows the result of standard PCA applied to n ¼ 1,024 observations from (2) with s ¼ 1, normalized to the same length as ||r||. The effect of the noise remains clearly visible in the estimated principal eigenvector. For functional data of this type, a regularized approach to PCA has been proposed by Rice and Silverman (1991) and Silverman (1996), (see also Ramsay and Silverman (1997) and references therein). Although smoothing can be incorporated in various ways, we illustrate the method discussed also in Ramsay and Silverman (1997, chap. 7), which replaces (1) with varðj T x i Þ=½k j k 2 þ l k D 2 j k 2 ; ð3Þ where D 2 j is the (p 2) 3 1 vector of second differences of j, and l 2 (0, ) is the regularization parameter. Figure 1d shows the estimated first principal component vector found by maximizing (3) with l ¼ 10 12 and l ¼ 10 6 , respectively. Neither is really satisfactory as an estimate. The Iain M. Johnstone is Professor of Statistics and Biostatistics, Stanford Uni- versity, Department of Statistics, 390 Serra Mall, Stanford, CA 94305 (E-mail: [email protected]). Arthur Yu Lu is Principal, Renaissance Technologies LLC, 600 Route 25A, East Setauket, NY 11733. This work was supported in part by grants NSF DMS 0072661, 0505303, and NIH EB R01 EB001988. The authors are grateful for helpful comments from Debashis Paul and the participants at the Functional Data Analysis Meeting at Gainesville, FL, January 9–11, 2003. Referees, an associate editor, and the editor contributed greatly to successive revisions. Zongming Ma updated the software for the latest revision and con- tributed numerous helpful remarks, as did Boaz Nadler by correspondence. 682 2009 American Statistical Association Journal of the American Statistical Association June 2009, Vol. 104, No. 486, Theory and Methods DOI 10.1198/jasa.2009.0121

OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

On Consistency and Sparsity for Principal ComponentsAnalysis in High Dimensions

Iain M. JOHNSTONE and Arthur Yu LU

Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases)

of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such

settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b)

the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe

a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if

p(n)/n! 0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done

on the selected subset, then consistency is recovered, even if p(n) � n.

KEY WORDS: Eigenvector estimation; Reduction of dimension; Regularization, Thresholding; Variable selection.

1. INTRODUCTION

Suppose {xi, i ¼ 1, . . . , n} is a dataset of n observations on pvariables. Standard principal components analysis (PCA) looksfor vectors j that maximize

varðjT xiÞ= k j k2 : ð1ÞIf j1, . . . , jk have already been found by this optimization, thenthe maximum defining jkþ1 is taken over vectors j orthogonalto j1, . . . , jk.

Our interest lies in situations in which each xi is a realizationof a possibly high-dimensional signal, so that p is comparablein magnitude with n, or may even be larger. In addition, wehave in mind settings in which the signals xi contain localizedfeatures, so that the principal modes of variation sought byPCA may be localized as well.

These issues are familiar in signal and image processingapplication areas in which each sample has many variables—pixels, frequencies, genes, stocks, and so forth. In applications,it is common to combine the use of transform domains andfeature selection to achieve an effective reduction of dimen-sionality. For example, one might transform the data into asuitable orthogonal basis (e.g., wavelets), select coordinateswith highest variance, and then do PCA on the reduced set ofvariables.

A notable example occurs in the work of Wickerhauser(1994a, b), in which the orthobasis itself was chosen from alibrary of (wavelet packet) bases. Applications to face andfingerprint classification were given. A selection of later ex-amples (by no means exhaustive) would include Feng, Yuen,and Dai (2000) in face recognition; and Kaewpijit, Le Moigne,and El-Ghazawi (2002) and Du and Fowler (2008) for hyper-spectral images. For some further discussion, see Cherkasskyand Mulier (1998). A recent approach to variable selectionfollowed by dimensionality reduction that emphasizes sparsityis described by Wolf and Shashua (2005) and Wolf and Bileschi(2005).

The purpose of this article is to contribute some theoreticalanalysis of PCA in these burgeoning high-dimensional settings.In a simple class of models of factor analysis type, we (a)describe inconsistency results to emphasize that when p iscomparable with n, some reduction in dimensionality isdesirable before applying any PCA-type search for principalmodes; and (b) establish consistency results to illustrate that thereduction in dimensionality can be effected by working in abasis in which the signals have a sparse representation. We willsupport these assertions with arguments based on statisticalperformance and computational cost.

We begin, however, with an illustration of our results on asimple constructed example. Consider a single component (orsingle factor) model, in which, when viewed as p-dimensionalcolumn vectors

xi ¼ yir þ szi; i ¼ 1; . . . ; n ð2Þin which r 2 R

p is the single component to be estimated, yi ;

N(0, 1) are iid Gaussian random effects, and zi ; Np(0, I) areindependent noise vectors.

Figure 1a shows an example of r with p ¼ 2,048 and thevector rl ¼ f(l/p), l 2 {1, . . . , p}, where f(t) is a mixture of beta

densities on [0, 1], scaled so that k r k ¼Pp

1 r2l

� �1=2

¼ 10:

Figure 1b shows a sample case from model (2): The randomeffect yir is hard to discern in individual cases. Figure 1c showsthe result of standard PCA applied to n ¼ 1,024 observationsfrom (2) with s ¼ 1, normalized to the same length as ||r||. Theeffect of the noise remains clearly visible in the estimatedprincipal eigenvector.

For functional data of this type, a regularized approach toPCA has been proposed by Rice and Silverman (1991) andSilverman (1996), (see also Ramsay and Silverman (1997) andreferences therein). Although smoothing can be incorporatedin various ways, we illustrate the method discussed also inRamsay and Silverman (1997, chap. 7), which replaces (1) with

varðjT xiÞ=½k j k2 þ l k D2j k2�; ð3Þwhere D2j is the (p � 2) 3 1 vector of second differences ofj, and l 2 (0, ‘) is the regularization parameter.

Figure 1d shows the estimated first principal componentvector found by maximizing (3) with l ¼ 10�12 and l ¼ 10�6,respectively. Neither is really satisfactory as an estimate. The

Iain M. Johnstone is Professor of Statistics and Biostatistics, Stanford Uni-versity, Department of Statistics, 390 Serra Mall, Stanford, CA 94305 (E-mail:[email protected]). Arthur Yu Lu is Principal, Renaissance Technologies LLC,600 Route 25A, East Setauket, NY 11733. This work was supported in part bygrants NSF DMS 0072661, 0505303, and NIH EB R01 EB001988. The authorsare grateful for helpful comments from Debashis Paul and the participants atthe Functional Data Analysis Meeting at Gainesville, FL, January 9–11, 2003.Referees, an associate editor, and the editor contributed greatly to successiverevisions. Zongming Ma updated the software for the latest revision and con-tributed numerous helpful remarks, as did Boaz Nadler by correspondence.

682

� 2009 American Statistical AssociationJournal of the American Statistical Association

June 2009, Vol. 104, No. 486, Theory and MethodsDOI 10.1198/jasa.2009.0121

Page 2: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

first recovers the original peak heights, but fails fully to sup-press the remaining baseline noise, whereas the second grosslyoversmooths the peaks in an effort to remove all trace of noise.Further investigation with other choices of l confirms theimpression already conveyed here: No single choice of l suc-ceeds both in preserving peak heights and in removing baselinenoise.

Figures 1e and f show the result of the adaptive sparse PCAalgorithm to be introduced later, respectively without and witha final thresholding step. Both goals are accomplished quitesatisfactorily after thresholding in this example.

This article is organized as follows. Section 2 reviews theinconsistency result Theorem 1. Section 3 sets out the sparsityassumptions and the consistency result (Theorem 2). Section 4gives an illustrative algorithm, demonstrated on simulated andreal data in Section 5. Proofs and their preliminaries aredeferred to Section 6 and the Appendix.

2. INCONSISTENCY OF CLASSIC PCA

A basic element of our sparse PCA proposal is initialselection of a relatively small subset of the initial p variablesbefore any PCA is attempted. In this section, we formulatesome (in)consistency results that motivate this initial step.

Consider first the single component model (2). The presenceof noise means that the sample covariance matrixS ¼ n�1

Pni¼1 xix

Ti will typically have min(n, p) nonzero

eigenvalues. Let r be the eigenvector associated with the

largest sample eigenvalue, with probability one it is uniquelydetermined up to sign.

One natural measure of the closeness of r to r uses the‘‘overlap’’ Rðr; rÞ; defined as the inner product between thevectors after normalization to unit length:

Rðr; rÞ ¼ Ær= k r k;r= k r k æ ¼ rT r= k r k k r k :Equivalently, Rðr; rÞ is the cosine of the angle between r and r

Rðr; rÞ ¼ cos aðr; rÞ; ð4Þand we may also write this in terms of a distance metric

dðr; rÞ ¼ sin aðr; rÞ: ð5ÞFor asymptotic results, we will assume that there is a sequenceof models (2) indexed by n. Thus, we allow pn and rn to dependon n, although the dependence will usually not be shownexplicitly. (Of course, s might also be allowed to vary with n,but for simplicity it is assumed fixed.)

Our first interest is whether the estimate r is consistent asn! ‘. This turns out to depend crucially on the limiting value

limn!‘

pn=n ¼ c: ð6Þ

One setting in which this last assumption may be reasonableis when pn grows by adding finer scale wavelet coefficients ofa fixed function as n increases.

We will also assume that the limiting ‘‘signal-to-noise ratio’’

limn!‘k rn k2 =s2 ¼ v > 0: ð7Þ

Figure 1. True principal component, the ‘‘three-peak’’ curve. (a) The single component rl ¼ f(l/p) where f(t) ¼ C{0.7B(1,500, 3,000) þ0.5B(1,200, 900) þ 0.5B(600, 160)} and B(a, b)(t) ¼ [G(a þ b)/(G(a)G(b))]ta�1(1 – t)b�1 denotes the beta density on [0, 1]. (b) A sample casedrawn from model (2) with s ¼ 1, n ¼ 1,024 replications in total, p ¼ 2,048. (c) Sample principal component by standard PCA. (d) Sampleprincipal component by smoothed PCA using l ¼ 10�12 and l ¼ 10�6. (e, f) Sample principal component by sparse PCA with weightingfunction w ¼ 99.5%, and k ¼ 142 and 35, respectively, without and with a thresholding step.

Johnstone and Lu: Consistency and Sparsity for PCA 683

Page 3: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

Theorem 1. Assume that there are n observations drawnfrom the p-dimensional model (2). Assume that pn/n ! c andthat ||rn||2/s2! v > 0. Then almost surely

limn

R2ðr; rÞ ¼ R2‘ðv; cÞ ¼

ðv2 � cÞþv2 þ cv

:

In particular, R‘(v, c) < 1 if and only if c > 0, and so r is aconsistent estimator of r if and only if p/n ! 0.

The situation is even worse if v2 # c—that is, if

limp

n

s4

jjrjj4$ 1;

because r and r are asymptotically orthogonal, and r ulti-mately contains no information at all regarding r.

In short, r is a consistent estimate of r if and only if p/n! 0.The noise does not average out if there are too many dimen-sions p relative to sample size n.

Theorem 1 is stated without proof, and we include somebibliographic remarks about its history. The inconsistencyphenomenon was first observed in a number of papers in thelearning theory literature in physics (see, for example, Biehland Mietzner 1994; Watkin and Nadal 1994; Reimann, Van denBroeck, and Bex 1996; Hoyle and Rattray 2004). The limitingoverlap formula of Theorem 1, derived in a related classi-fication setting, appears in Biehl and Mietzner (1994). Theseworks all use the nonrigorous ‘‘replica method’’ of statisticalmechanics.

The first rigorous proof of inconsistency was given, in model(2), by the second author in his Ph.D. dissertation (Lu 2002)and in the initial version of this article (Johnstone and Lu2004), available at arxiv.org. While this article was in pub-lication review, subsequent rigorous proofs were published,along with other results, by Paul (2007) and Nadler (2008).

Remark. Paul (2007) also includes extensions to a consid-erably more general ‘‘multicomponent’’ or ‘‘spiked’’ covariancemodel that has attracted interest in the literature. Assume thatwe have n data vectors xi, observed at p time points. Viewed asp-dimensional column vectors, this model assumes that

xi ¼ mþXm

j¼1

yjir

j þ szi; i ¼ 1; . . . ; n: ð8Þ

The mean function m is assumed known. The vectors r j, j ¼1, . . . , m # p are unknown and mutually orthogonal, withnorms rj(n) ¼ ||r j|| assumed decreasing: ||r1|| > ||r2|| $ � � � $

||rm||. The multipliers yji ; Nð0; 1Þ are all independent over j ¼

1, . . . , m and i ¼ 1, . . . , n, and the noise vectors zi ; Np(0, I)are independent among themselves and also of the randomeffects fy j

i g: The population covariance matrix of the iid vec-tors (xi) is given by

P¼Pm

j¼1 r jr jT þ s2I: The vectors r j

are the ordered principal component eigenvectors of the pop-ulation covariance

P. The asymptotics assume pn, mn, and rj

n

to be functions of n, and as n! ‘,

.n ¼ ðr1n; . . .; rmn

n ; 0; . . .; 0Þ ! . ¼ ð.1; . . .; .j; . . .Þ:

Paul (2007) shows that it continues to be true in the multi-component model that r1 is consistent if and only if pn/n! 0.Of course, the inconsistency extends to cases with pn/n ! ‘,because these models are even larger.

3. SPARSITY, SELECTION, AND CONSISTENCY

3.1 Sparsity

The inconsistency Theorem 1 asserts that ordinary PCAbecomes confused in the presence of too many variables eachwith equal independent noise. Ideally, we might wish to reducethe dimensionality from p to a smaller number of variables kbefore beginning PCA. For this to succeed, the populationprincipal components—r, in our model—should be essentiallyconcentrated in the smaller number of dimensions, in a mannerthat can be discovered from the data.

To quantify this, assume that the data and the populationprincipal components are represented, perhaps after trans-formation, in a fixed orthonormal basis {en}:

xi ¼Xp

n¼1

xi;nen; i ¼ 1; . . .; n; r ¼Xp

n¼1

rnen:

The index n will always indicate the transform domain. Inmany situations, including all examples in this article, the dataxi are initially collected in the time domain—for example in [0,1], with xi¼ {xi(tl)}, where tl¼ l/p, l¼ 1, . . . , p. In such cases,the basis vectors en are also time domain functions en(tl).

The idea of concentration in a small number of variables canbe captured by considering the ordered coefficient magnitudes|r|(1) $ |r|(2) $ � � �. The intuitive idea of sparse representationis that, for relatively small k, the ‘‘energy’’ in the largestk coordinates

Pki¼1 r2

ðiÞ is close to the total energy k r k2¼Ppn¼1 r2

n: This can only be true if the magnitudes |r|n decayrather quickly. Thus, we assume for some 0 < q < 2 and C > 0that

jrjðnÞ # Cn�1=q; n ¼ 1; 2; . . . : ð9Þ

The condition q < 2 forces rapid decay—clearly, the morerapid if q is smaller. This notion of ‘‘weak ‘q decay’’ is actually

equivalent to the concentration of energy in the sumsPk

i¼1 r2ðiÞ

just mentioned (see Donoho (1993) or Johnstone (2003, chap.15)), but is more convenient for the results given here.

The choice of orthonormal basis {en} for sparse repre-sentation will depend on the dataset and problem domain, andthus is beyond the scope of this article. We remark, however,that for certain signal processing settings, wavelet bases can benatural for uncovering sparsity. When one-dimensional signalsare smooth or have isolated discontinuities (either in the signalor its derivatives), then it can be shown (e.g., Mallat 1999) thatthe wavelet coefficients decay rapidly with frequency octaveaway from the discontinuities. In such cases, assumptions (9)are natural, as is shown in detail, for example, in the referencescited earlier. We have therefore used wavelet bases for theexamples in this article, but hasten to emphasize that our resultsapply equally to representations in other bases that might bebetter suited to, say, economic or genomic data.

3.2 Consistency

If the principal components have a sparse representation inbasis {en}, then selection of an appropriate subset of variablesshould overcome the inconsistency problem described by

684 Journal of the American Statistical Association, June 2009

Page 4: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

Theorem 1. In this direction, we establish a consistency resultfor sparse PCA. For simplicity, we use the single componentmodel (2), and assume s2 is known—although this latterassumption could be removed by estimating s2 using (14),presented later.

We again assume a sequence of models (2) indexed by n. Theunknown principal components r ¼ rn should satisfy a ‘‘uni-form sparsity condition’’: For some q 2 (0, 2) and C < ‘

independent of n, each rn satisfies decay condition (9). Inaddition, as in Theorem 1, the signal strength should stabilize:||rn|| ! . > 0.

On the assumption of model (2), the sample variances

s2n ¼ n�1

Xn

i¼1

x2in ; ðs2 þ r2

nÞx2ðnÞ=n: ð10Þ

Consequently, components n with large values of rn will typ-ically have large sample variances. We use here a simpleselection rule

I ¼ fn : s2n $ s2ð1þ anÞg; ð11Þ

with an ¼ a(n�1 log(n _ p))1/2 and a a sufficiently largepositive constant—for example, a >

ffiffiffiffiffi12p

would work for theproof. (By definition, n _ p ¼ max(p, n).)

Let SI ¼ ðSnn9 : n and n9 2 IÞ denote the sample cova-riance matrix of the selected variables. Applying PCA to SI

yields a principal eigenvector ðrn; n 2 IÞ: Let rI denote thecorresponding vector in the full p-dimensional space:

rI;n ¼rn n 2 I0 n =2 I:

�The sparsity assumption implies that rI is a consistent esti-mator of r.

Theorem 2. Assume that the single component model (2)holds with log(p_ n)/n! 0 and ||rn||! . > 0 as n! ‘. Assumefor some q 2 (0, 2) and C < ‘, that for each n, rn satisfies thesparsity condition (9). Then the estimated principal eigenvectorrI obtained via subset selection rule (11) is consistent:

aðrI ; rÞ!a:s:

0:

Here, a is the angle between rI and r as in (4). Converting to anestimate rðtÞ in the time domain (as in Step 5 of Section 4, anequivalent statement of the result is that k r= k r k � sr=k r k k! 0 in Euclidean norm, where s ¼ signðÆr; ræÞ:

The proof shows that consistency holds even under theweaker assumption p ¼ o(en). The result and proof could alsobe extended to multicomponent systems (8).

Armed with Theorems 1 and 2, let us return briefly to Figure1. Based on Figure 1c, one might ask if simple thresholding ofthe standard PCA estimate—either in the original or waveletdomain—might suffice. Although this may work for highsignal-to-noise settings, Theorem 1 suggests that such anapproach is doomed if ||r|| < p/n, because r is asymptoticallyorthogonal to r. No such constraint applies in Theorem 2—aslong as lim ||rn|| > 0—as a result of the preliminary reduction ofvariables.

Bibliographic Remarks. Significant extensions of theseresults have been obtained since this article was first written.For example, working with the multicomponent model, Paul

(2007) and Paul and Johnstone (2004) have derived lowerbounds to the possible quality of estimation, and optimal es-timators that attain the bounds (at the level of rates of con-vergence) in certain cases.

In a large p, n setting of partial least squares regression,Nadler and Coifman (2005) noted the importance of priordimension reduction and suggested the use of wavelets toexploit sparsity.

Alternative methods for ‘‘sparsifying’’ PCA have also beenproposed, based on connections with LASSO and ‘1 penalizedregression (Jolliffe, Trendafilov, and Uddin 2003; Zou, Hastie,and Tibshirani 2006; d’Aspremont et al. 2007), and com-pressive sensing (Fowler 2008). The study of consistencyproperties for these methods and comparison with those of thisarticle is a natural topic for further research, with significantprogress recently reported in Amini and Wainwright (2009).

3.3 Correct Selection Properties

A basic issue raised by the sparse PCA algorithm is whetherthe selected subset I in fact correctly contains the largestpopulation variances, and only those. We formulate a result,based on large deviations of chi-squared variables, to addressthis issue. The considerations of this section hold for coef-ficients in any orthogonal basis.

For this section, assume that the diagonal elements of thesample covariance matrix S ¼ n�1

Pn1 xix

Ti have marginal chi-

squared distributions—in other words,

s2n ¼ Snn ; s2

nx2ðnÞ=n; n ¼ 1; . . . ; p: ð12Þ

We will not require any assumptions on the joint distributionof fs2

vgThe use of the index n emphasizes the fact that we workin the transform domain.

Denote the ordered population coordinate variances bys2ð1Þ $ s2

ð2Þ $ . . . ; and the ordered sample coordinate var-

iances by s2ð1Þ $ s2

ð2Þ$ . . . . A desirable property is that I

should, for fixed k and for suitable positive an small, (i) includeall indices l in Iin ¼ fl : s2

l $ s2ðkÞð1þ anÞg and (ii) exclude

all indices l in Iout ¼ fl : s2l # s2

ðkÞð1� anÞg: We willshow that this, in fact, occurs with high probability ifan ¼ a

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffin�1 log n

p; for appropriate a > 0.

We say that a ‘‘false exclusion’’ (FE) occurs if any variablein Iin is missed:

FE ¼ [l2Iin

fs2l < s2

ðkÞg;

whereas a ‘‘false inclusion’’ (FI) happens if any variable in Iout

is spuriously selected:

FI ¼ [l2Iout

fs2l $ s2

ðkÞg:

Theorem 3. Assume that the sample variances satisfy (12)and that a subset of size k of variables is sought. With an ¼an–1/2(log n)1/2, the probability of an inclusion error of eithertype is polynomially small:

PfFE [ FIg # 2pkðp _ nÞ�bðaÞ þ kðp _ nÞ�ð1�2anÞbðaÞ;

with bðaÞ ¼ ½affiffiffi3p

=ð4þ 2ffiffiffi3pÞ�2:

Johnstone and Lu: Consistency and Sparsity for PCA 685

Page 5: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

The proof is in the Appendix. As an example, if a ¼ 9, thenbðaÞ ¼: 4:36: As a numerical illustration based on (A.4) (seenin the Appendix), if the subset size k ¼ 50, while p ¼ n ¼1,000, then the chance of an inclusion error corresponding to a25% difference in standard deviations (i.e.,

ffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ an

p¼ 1:25

when a ¼ 9) is less than 5%.

4. AN ILLUSTRATIVE ALGORITHM

4.1 An Algorithm

The inconsistency results summarized in Section 2 empha-size the importance of reducing the number of variables beforeembarking on PCA. The results of Section 3 show that theexistence of a sparse representation allows consistency to berecovered. The proof of Theorem 2 relies on two key steps: (i)sparsity allows r to be approximated using a relatively smallnumber of coefficients, and (ii) these smaller number ofcoefficients can be estimated by a reduced PCA.

We use these remarks as the basis for the sparse PCAalgorithm to be described in general terms here. Note that thealgorithm per se does not require the specification of a par-ticular model, such as the single component system (2) or themulticomponent version (8). Given the widespread use oftransform domain and feature selection techniques in the signalprocessing literature, as described in Section 1, we make noclaims for originality; this section is included primarily toillustrate results of previous sections.

1. Compute Basis Coefficients. Given a basis {en} for Rp,

compute coordinates xin¼ (xi, en) in this basis for each xi:

xiðtlÞ ¼Xp

n¼1

xinenðtlÞ; i ¼ 1; . . .; n; tl ¼ 1; . . .; p:

2. Subset. Calculate the sample variances s2n ¼ cvarðxinÞ: Let

I � f1; . . .; pg denote the set of indices n correspondingto the largest k variances.

3. Reduced PCA. Apply standard PCA to the reduceddataset fxin; n 2 I; i ¼ 1; . . .; ng on the selected k-dimensional subset, obtaining eigenvectors rj ¼ ðrj

nÞ;j ¼ 1; . . . ; k; n 2 I:

4. Thresholding. Filter out noise in the estimated eigen-vectors by hard thresholding

~rjn ¼ hHðrj

n; djÞ:

5. Reconstruction. Return to the original signal domain,using the given basis {en}, and set

rjðtlÞ ¼Xn2I

~r jnenðtlÞ:

Discussion: Steps 2 and 3. An important computationalpoint that is implicit in Steps 2 and 3 is that we only computethe variances Sn,n for the p transform domain variables. Off-diagonal covariance elements Sn,n9 are only computed for n, n9

in the reduced set I of size k. The reduced PCA size k may bespecified in advance or chosen based on the data (see Section4.2).

Discussion: Step 4. Although not formally studied in thetheory in the preceding section, the thresholding step is

found in our examples to yield a useful further filtering ofnoise. For a scalar value y, hard thresholding is given, as usual,by hH(y, d) ¼ yI{|y| $ d}. An alternative is soft thresholdinghS(y, d) ¼ sgn(y) 3 max(|y| – d, 0), but hard thresholding hasbeen used here because it preserves the magnitude of retainedsignals.

There is considerable freedom in the choice of thresholds dj.Trial and error is always possible, of course. Further, moreformal choices are suggested by analogy with the signal inGaussian noise setting dj ¼ tj

ffiffiffiffiffiffiffiffiffiffiffiffiffi2 log kp

(compare with, forexample, Donoho et al. (1995)), and, for this article, we use thischoice of dj. Here, tj is an estimate of the noise level infrj

n; n 2 Ig—in this article, estimate (16) is used. Anotherpossibility is to set tj ¼ MADfrj

n; n 2 Ig=0:6745; where MADdenotes ‘‘median absolute deviation.’’

The consistency result Theorem 2 applies to this algorithm,with subset selection rule (11), and without the thresholdingStep 4. Although the thresholding step helps in the examples tofollow, theoretical analysis to elucidate its specific advantagesis beyond the scope of this article.

Terminology. We will refer to the general procedure spec-ified by Steps 1 through 5 as ‘‘sparse PCA.’’ With the specificdata-based choice of k proposed as method (b) in Section4.2, with w ¼ 0.995, we use the term ‘‘adaptive sparse PCA’’(ASPCA).

In the rest of this section, we amplify and illustrate variousaspects of this algorithm. Given eigenvalue and eigenvectorroutines, it is not difficult to code. For example, a MATLABpackage ASPCALab that includes the algorithms and files thatproduce the figures in this article is available at www-stat.stanford.edu/;imj/. To exploit wavelet bases, it makes use ofthe open-source library WaveLab available at www-stat.stanford.edu/;wavelab/.

4.2 Data-Based Choice of k

When the size k of the set of selected variables I is itselfdetermined from the data, we write k for jIj: Here are twopossibilities, both based on the sample variances s2

n of Step 2presented earlier:

(a) Choose coordinates with variance exceeding the esti-mated noise level by a specified fraction an:

I ¼ fn : s2n $ s2ð1þ anÞg:

This choice was considered in Section 3.(b) As motivation for the second method, recall that we hopethat the selected set of variables I is both small in cardinalityand also captures most of the variance of the populationprincipal components, in the sense that the ratio

Xn2I

r2n

,Xn

r2n ð13Þ

is close to one for each of the leading population principalcomponents in {r1, . . . , rm}. Now let x2

ðnÞ;a denote the uppera percentile of the x2

ðnÞ distribution. If all coordinates werepure noise, one might expect the ordered sample variances

686 Journal of the American Statistical Association, June 2009

Page 6: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

s2ðnÞ to be close to ðn� 1Þ�1

s2 x2ðn�1Þ;n=ðpþ1Þ: Define the

excess over these percentiles by

h2ðnÞ ¼ maxfs2

ðnÞ � ðn� 1Þ�1s2x2

ðn�1Þ;n=ðpþ1Þ; 0g;

and for a specified fraction w(n) 2 (0, 1), set

I ¼ fn :Xk

n¼1

h2ðnÞ $ wðnÞ

Xn

h2ðnÞg;

where k is the smallest index k for which the inequalityholds, and the somewhat sloppy notation refers to the indicesn that contribute to the left-hand sum of excesses h2

ðnÞ: Thissecond method has been used for the figures in this article,typically with w(n) ¼ 0.995.

Estimation of s. If the population principal componentsr j have a sparse representation in basis {en}, then wemay expect that in most coordinates, n, {xin} will consistlargely of noise. This suggests a simple estimate of the noiselevel on the assumption that the noise level is the same in allcoordinates—namely,

s2 ¼ medianðs2nÞ: ð14Þ

4.3 Computational Complexity

One estimates the cost of sparse PCA by examining its mainsteps:

1. This depends on the choice of basis. In the wavelet case,no more than O(np log p) operations are needed. (see, forexample, Mallat (1999)).

2. Evaluate and sort the sample variances and selectI: Oðnpþ p logpÞ:

3. Compute a k 3 k matrix and its eigendecomposition:O(k3 þ k2n).

4. Apply thresholding to each vector in I : Oðk2Þ; and esti-mate s2 and ck r k2: OðpÞ:

5. Reconstruct eigenvectors in the original sample space:O(k2p).

Hence, the total cost of sparse PCA is O(np log pþ k2max(p,n)). Both standard and smoothed PCA need at least O((min(p,n))3) operations. Therefore, if we can find a sparse basis suchthat k/p! 0, then under the assumption that p/n! c as n! ‘,the total cost of sparse PCA is o(p3). We will see in theexamples that follow that the savings can be substantial.

5. EXAMPLES

5.1 Simulated Examples

The two examples in this section are both motivated byfunctional data with localized features. The first is a three-peakprincipal component depicted in Figure 1, and already dis-cussed in Section 1. The second example, Figure 2, has anunderlying first principal component composed of step func-tions. For both examples, the dimension of data vectors is p ¼2,048, the number of observations n ¼ 1,024, and the noiselevel s ¼ 1. However, the amplitudes of r differ, with ||r||¼ 10

for the ‘‘three-peak’’ function and ||r|| ¼ 25 for the ‘‘step’’function. The corresponding square root signal-to-noise ratiosv ¼ ./s (Theorem 1) are 10 and ;25 respectively.

Figure 1c and Figure 2c, respectively, show the sampleprincipal components obtained by using standard PCA.Although standard PCA does capture the peaks and steps, itretains significant noise in the flat regions of the function.Corresponding Figure 1d and Figure 2d show results fromsmooth PCA with the indicated values of the smoothingparameter. Just as for the three-peak curve discussed earlier, inthe case of the step function, none of the three estimatessimultaneously captures both jumps and flat regions well.

Figures 1e, f and Figures 2e, f present the principal com-ponents obtained by sparse PCA without and with the thresh-olding step, respectively. The WaveLab wavelet bases Symmletand Haar are used for the ‘‘three-peak’’ and ‘‘step’’ functionsrespectively. Using method (b) of Section 4.2 with w ¼ 99.5%,the subset step selects k ¼ 142 and 361 for the ‘‘three-peak’’curve and ‘‘step’’ function, respectively. The sample principalcomponent in Figure 1f is clearly superior to the other sampleprincipal components in Figure 1. Although the principalcomponent function in the step case appears to be only slightlybetter than the solid, red, smooth PCA estimate, we will seeshortly that its squared error is reduced by about 75%.

Table 1 compares the accuracy of the three PCA algorithms,using average squared error (ASE) defined as ASE ¼ p�1

k r � r k : (r was first normalized to have length ||r|| beforecomputing ASE.) The running time is average CPU time over50 iterations, used by MATLAB on an Intel Core Duo CPUT2400 at 1.83 GHz.

Figure 3 presents boxplots of ASE for the 50 iterations.Overall, sparse PCA with thresholding always gives the bestresult for the ‘‘step’’ curve. For the ‘‘three-peak’’ function, inonly a few iterations (;15%) does sparse PCA generate largererror than smoothed PCA with l ¼ 10�12. On average, ASEusing sparse PCA is superior to the other methods by a largemargin. Overall, Table 1 and Figure 3 show that sparse PCAleads to the most accurate principal component (within thetechniques considered) while using much less CPU time thanother PCA algorithms.

5.2 Noise Level in the Single Component Model

Anderson (1963) obtained the asymptotic distribution offfiffiffinpðr � rÞ for fixed p—in particular,

varfffiffiffinpðrn � rnÞg ! ðk r k2 þ s2Þ s2

k r k4ð1� r2

nÞ; ð15Þ

as n! ‘. Here, p increases with n, but one can nevertheless use(15) as a heuristic for estimating the variance t needed forthresholding. Because the effect of thresholding is to removenoise in small coefficients, setting rn to 0 in (15) suggests

t � 1ffiffiffinp s

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffik r k2 þs2

pk r k2

: ð16Þ

Neither ||r||2 nor s2 in (16) is known, but they can be estimatedby using the information contained in the sample covariancematrix S of (10). Hence, in the single component model,

Johnstone and Lu: Consistency and Sparsity for PCA 687

Page 7: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

jjrjj2 ¼Xp

1

r2n ¼

Xp

1

fEðs2nÞ � s2g:

If rn is a sparse representation of r, then most coefficients willbe small, suggesting the estimate (14) for s2. In turn, thissuggests as an estimate

cjjrjj2 ¼ maxXp

1

½S2n �medianðS2

nÞ�; medianðS2nÞ

ffiffiffiffiffiffiffiffip=n

p( ):

ð17Þ

5.3 ECG Example

We offer a brief illustration of sparse PCA as applied to someelectrocardiogram (ECG) data kindly provided by JeffreyFroning and Victor Froelicher in the cardiology group at PaloAlto Veterans Affairs Hospital. Beat sequences—typicallyabout 60 cycles in length—were obtained from some 15healthy patients; we selected two for the preliminary illus-trations here. Individual beats are notable for features such as

the sharp spike (‘‘QRS complex’’) and the subsequent lowerpeak (‘‘T wave’’), seen, for example, in Figures 5a and d. Thepresence of these local features, of differing spatial scales,suggests the use of wavelet bases for efficient representation.Traditional ECG analysis focuses on averages of a series ofbeats. If one were to look instead at beat-to-beat ‘‘variation,’’one might expect these local features to play a significant rolein the principal component eigenvectors.

Considerable preprocessing is routinely done on ECG sig-nals before the beat averages are produced for physician use.Here we summarize certain steps taken with our data, incollaboration with Jeff Froning, preparatory to the PCAanalysis. The most important feature of an ECG signal is theQ-R-S complex: The maximum occurs at the R wave, as seen,for example, in Figure 5a. Therefore, we define the lengthof one cycle as the gap between two adjacent maxima of Rwaves.

(i) ‘‘Baseline wander’’ is observed in many ECG datasets(compare with Figure 4, with a caption that summarizes theadjustment used).

Figure 2. Comparison of the sample principal components for a step function. (a) True principal component rl. (b) a sample case drawn frommodel (2) with s ¼ 1, p ¼ 2048. (c) Sample principal component by standard PCA. (d) Sample principal component by smoothed PCA usingl ¼ 10�12 and 10�6. (e, f) Sample principal component by sparse PCA with weighting function w ¼ 99.5%, and k ¼ 361 and 233 without andwith a thresholding step, respectively.

Table 1. Accuracy and efficiency comparison

Standard PCA Smoothed l: 10�12 Smoothed l: 10�6 Sparse PCA Sparse þ Threshold PCA

ASE (three-peak) 6.9e-04 2.5e-04 4.1e-3 4.1e-4 2.3e-04Time (three-peak) (sec) 81.9 42.7 40.8 3.2 3.0ASE (step) 6.9e-04 1.3e-3 8.4e-3 3.7e-4 3.0e-04Time (step) (sec) 80.8 42.5 40.8 1.7 1.5

688 Journal of the American Statistical Association, June 2009

Page 8: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

(ii) Because pulse rates vary even on short timescales, theduration of each heartbeat cycle may vary as well. We uselinear interpolation to equalize the duration of each cycle,and for convenience in using wavelet software, discretize to512 ¼ 29 sample points in each cycle.(iii) Because of the importance of the R wave, the horizontalpositions of the maxima are registered at the 150th positionin each cycle.(iv) The ECG data vector is converted into an n 3 p datamatrix, where n is the number of observed cycles and p ¼512.

5.4 PCA Analysis

Figures 5a and d show the mean curves for two ECG samplesin blue. The number of observations n (i.e., number of heart-beats recorded) are 66 and 61, respectively. The first sampleprincipal components for these two sample sets (‘‘patients’’)are plotted in Figures 5c and f, with the upper/red curves fromstandard PCA and the lower/blue curves from sparse PCA, withthresholds chosen subjectively. In both cases, there are twosharp peaks in the vicinity of the QRS complex. The first peakoccurs shortly before the 150th position, where all the maximaof R waves are aligned, and the second peak, which has anopposite sign, occurs shortly thereafter. (The lower/blue curveshave been offset vertically by �0.4 and �0.2 in Figures 5c andf, respectively, for legibility in monochrome.)

The standard PCA curve in Figure 5c (upper/red) is lessnoisy than that in Figure 5f (upper/red), even allowing for thedifference in vertical scales. Using (14), s2

1 ¼ 24:59 ands2

2 ¼ 80:77; whereas the magnitudes of the two mean samplecurves are very similar.

The sparse PCA curves (lower/blue) are smoother than thestandard PCA ones (upper/red), especially in Figure 5f, wherethe signal-to-noise ratio is lower. On the other hand, the upper/red and lower/blue curves match quite well at the two mainpeaks. Sparse PCA has reduced noise in the sample principalcomponent in the baseline while keeping the main features.

There is a notable difference between the estimated principalcomponents for the two patients. In the first case, the principal

component is concentrated around the R-wave maximum,and the effect is to accelerate or decelerate the rise (and fall) ofthis peak from baseline in a given cycle. This is more easilyseen by comparing plots of �xþ 2r (green) with �x� 2r (red),shown over a magnified part of the cycle in Figure 5b. Inthe second patient, the bulk of the energy of the principalcomponent is concentrated in a level shift in the part of thecycle starting with the ST segment. This can be interpreted asbeat-to-beat fluctuation in baseline; because each beat isanchored at 0 at the onset point, there is less fluctuation on theleft side of the peak. This is particularly evident in Figure 5e.There is, again, a slight acceleration/deceleration in the rise to

Figure 3. Side-by-side boxplots of ASE from 50 iterations using different algorithms for the ‘‘three-peak’’ function (a) and for the ‘‘step’’function (b).

Figure 4. Baseline wander is observed in many ECG datasets. Onecommon remedy for this problem is to deduct a piecewise linearbaseline from the signal, the linear segment (dashed line) between twobeats being determined from two adjacent onset points. The onsetpositions of R waves are shown by asterisks. Their exact locations varyfor different patients and, as seen here, even for adjacent R waves. Thelocations are determined manually in this example. To reduce theeffect of noise, the values of onset points are calculated by an averageof 5 points close to the onset position.

Johnstone and Lu: Consistency and Sparsity for PCA 689

Page 9: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

the R-wave peak—less pronounced in the first case, and alsoless evident in the fall.

Obvious questions raised by this illustrative example includethe nature of effects that may have been introduced by thepreprocessing steps—notably, the baseline removal anchored atonset points and the alignment of R-wave maxima. Clearly,some conventions must be adopted to create rectangular datamatrices for principal component analysis, but detailed analy-sis of these issues must await future work.

To summarize, sparse PCA has reduced noise in the sampleprincipal component in the baseline while keeping the mainfeatures, and in addition, sparse PCA uses less than 10% of thecomputing time used by standard PCA in these examples.

6. PROOF OF THEOREM 2

We first establish some notation and recall some pertinentmatrix results (e.g., Golub and Van Loan 1996). Norms onvectors are always Euclidean 2-norms: k x k ¼

Pn x2

n

� �1=2:

Define the 2-norm of a rectangular matrix by ||A||¼ sup{||Ax|| :||x|| ¼ 1}. If A is real and symmetric, then ||A|| ¼ lmax(A). IfAp3p is partitioned,

A ¼ a bT

b C

� ;

where b is (p � 1) 3 1, then by setting x ¼ (1, 0T)T, one findsthat ||b|| # ||A||. The matrix B ¼ ruT þ urT has at most twononzero eigenvalues, given by

l ¼ ðt61Þ k r k k u k; t ¼ rT u= k r k k u k : ð18Þ

6.1 Perturbation Bounds

Suppose that a symmetric matrix Ap3p has unit eigenvectorq1. We wish to bound the effect of a ‘‘symmetric’’ perturbationEp3p on q1. The following result (Golub and Van Loan 1996,theorem 8.1.10) constructs a unit eigenvector q1 of A þ E andbounds its distance from q1 in terms of ||E||.Let Qp3p ¼ [q1 Q2] be an orthogonal matrix containing q1 inthe first column, and partition conformally

QT AQ ¼ l 00 D22

� ;QT EQ ¼ � eT

e E22

� ;

where D22 and E22 are both (p � 1) 3 (p � 1).Suppose that l is separated from the set of eigenvalues of

D22, denoted l(D22); set

d ¼ minm2lðD22Þ

jl� mj:

If ||E|| # d/5, then there exists r 2 Rp�1 satisfying

k r k # ð4=dÞ k e k # ð4=dÞ k E k ð19Þsuch that q1 ¼ ð1þ rT rÞ�1=2ðq1 þ Q2rÞ is a unit eigenvector ofA þ E. Moreover, with d as in (5),

dðq1; q1Þ # ð4=dÞ k e k :

Figure 5. ECG examples. (Note: Colors refer to online version; ‘‘vertical offset’’ to monochrome print version for clarity.) (a) Mean curve forECG sample 1, n ¼ 66, in blue, along with �xþ 2r (green) and �x� 2r (red), with r being the estimated first principal component from sparsePCA (see also (c)). (b) Magnified section of (a) over the range 120 to 220. (c) First principal components for sample 1 from standard (upper/lefty-axis) and sparse PCA (lower/right y-axis; vertical offset, �0.4); threshold, 0.0044. (d–f) Corresponding plots for sample 2, n ¼ 61. Verticaloffset for sparse PCA in (f) is �0.2 and threshold is 0.0075.

690 Journal of the American Statistical Association, June 2009

Page 10: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

Let us remark that because ||e|| # ||E||, we have ||r|| # 1 and

qT1 q1 ¼ ð1þ k r k2Þ�1=2

$ 1=ffiffiffi2p

: ð20ÞSuppose now that q1 is the eigenvector of A associated with the‘‘principal’’ eigenvalue l1(A). Here and later, li(A) denotes theith largest eigenvalue of A. We verify that, using the precedingconditions, q1 is also the principal eigenvector of A þ E. Inother words, if ðAþ EÞq1 ¼ l�q1; then, in fact, l*¼ l1 (Aþ E).

To show this, we verify that l* > l2(A þ E). Take innerproducts with q1 in the eigenequation for q1:

l�qT1 q1 ¼ qT

1 Aq1 þ qT1 Eq1: ð21Þ

Because A is symmetric, qT1 A ¼ l1ðAÞqT

1 : Trivially, we haveqT

1 Eq1 $ � k E k : Combine these remarks with (20) to getl� $ l1ðAÞ �

ffiffiffi2pk E k :

Now d ¼ l1(A) � l2(A) and, because from the minimaxcharacterization of eigenvalues (e.g., Golub and Van Loan1996, p. 396), l2(A þ E) # l2(A) þ ||E||, we have

l� � l2ðAþ EÞ $ d� ð1þffiffiffi2pÞ k E k

$ d½1� ð1þffiffiffi2pÞ=5�> 0:

6.2 Some Limit Theorems

Collect the noise vectors into a matrix Zp3n¼ [z1 . . . zn]. Weturn to properties of the noise matrix Z. The cross-productsmatrix ZZT has a standard p-dimensional Wishart Wp(n, I)distribution with n degrees of freedom and identity covariancematrix (see, for example, Muirhead 1982, p. 82). Thus, thematrix C ¼ s2(n�1ZZT � Ip) is simply a scaled and recenteredWishart matrix.

Geman (1980) and Silverstein (1985), respectively, estab-lished almost sure limits for the largest and smallest eigen-values of a Wp(n, I) matrix as p/n ! c 2 [0, ‘), from whichfollows:

l1ðCÞ;lpðCÞ ! s2ðc62ffiffifficpÞ: ð22Þ

(Although the results in the articles cited are for c 2 (0, ‘), theresults are easily extended to c ¼ 0 by simple coupling argu-ments.)

Suppose, in addition, that y is an n 3 1 vector with inde-pendent N(0, 1) entries, which are also independent of Z.Conditioned on y, the vector Zy is distributed as Np(0, ||y||2I).Because Z is independent of y, we conclude that

Zy ¼D xðnÞxðpÞUp; ð23Þwhere x2

ðnÞ and x2ðpÞ denote chi-squared variables and Up a

vector uniform on the surface of the (p � 1)-dimensional unitsphere Sp�1 in R

p, and all three variables x(n), x(p), and Up areindependent.

Now let u¼ sn�1Zy. From (23) we have, as p/n! c2 [0, ‘),

k u k2 ¼D s2n�2x2ðnÞx

2ðpÞ !

a:s:s2c: ð24Þ

6.3 Proof of Theorem 2

Outline. Recall from (11) that, given an ¼ a(n�1 log(n _p))1/2, the selected subset of variables I is defined by

I ¼ fn : s2n $ s2ð1þ anÞg and that the estimated principal

eigenvector based on I is written rI : We define a vector rI withcoordinates (rI,n) by selecting coordinates from r ¼ (rn)according to membership in I:

rI;n ¼rn n 2 I0 n =2 I:

�We will use the triangle inequality dðrI ; rÞ # dðrI ; rIÞþdðrI ; rÞ to show that rI ! r: There are three main steps.

(i) Construct deterministic sets of indices I6n ¼ fn : r2

n $

s2a7ang with constants a7 to be determined later, whichbracket I almost surely as n ! ‘:

I�n � I � Iþn w:p:1: ð25Þ

(ii) The uniform sparsity, combined with Ic � ðI�n Þ

c; is usedto show that dðrI ; rÞ !

a:s:0:

(iii) The containment I � Iþn ; along with jIþn j ¼ oðnÞ showsthat dðrI ; rIÞ !

a:s:0:

Details. Step (i). We first obtain a bound on the cardinalityof I6

n using the sparsity condition (9). Using (9), |r|(n) # Cn�1/q,and so

jI6n j # jfn : C2n�2=q $ s2a7angj # Cq=ðs2a7anÞq=2

¼ oðn1=2Þ:Let s2

n ¼ s2 þ r2n: Turning to the bracketing relations (25), we

first remark that s2n ¼D

s2nx2ðnÞ=n; and when n 2 I6

n ;

s2n ¼ s2ð1þ r2

n=s2Þ $ s2ð1þ a7anÞ:Using the definitions of I and writing �Mn for a random vari-able with the distribution of x2

ðnÞ=n; we have

P�n ¼ PðI�n 6 I Þ #Xn2I�n

Pfs2n < s2ð1þ anÞg

# jI�n jPf �Mn < ð1þ anÞ=ð1þ aþanÞg:We apply (A.2) from the Appendix with �n ¼ (aþ – 1)an/(1 þaþan) and for n large and a9 slightly smaller than a2,n�2

n > ðaþ � 1Þ2a9 logðn_ pÞ; so that

P�n # cn1=2 expf�n�2n=4g # cn1=2�a

00þ ;

with a00

þ ¼ ðaþ � 1Þ2a9=4: If a $ffiffiffiffiffi12p

; then a00

þ $ 3 forsuitable aþ > 2.

The argument for the other inclusion is analogous, using(A.3) in place of (A.2):

Pþn ¼PðI 6 Iþn Þ #Xn=2Iþn

Pfs2n $ s2ð1þ anÞg

# pPf �Mn $ ð1þ anÞ=ð1þ a�anÞg # pðn _ pÞ�a00� ;

with a00� ¼ 3ð1� a�Þ2a9=16 so long as a9 is now slightly less

than a2 and n is large enough. If a $ffiffiffiffiffi12p

; then a00� > 2 for

suitable a�<1�ffiffiffiffiffiffiffiffi8=9

p:

By a Borel-Cantelli argument, (25) follows from the boundson P�n and Pþn :

Step (ii). We first remark that one may easily show thatd(r þ u, r) # ||u||/(||r|| � ||u||), so that norm convergenceimplies d-convergence. So, for n > n(v) we have I�n � I and so

Johnstone and Lu: Consistency and Sparsity for PCA 691

Page 11: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

k rI � r k2 ¼Xn =2I

r2n #

Xn =2I�n

r2n:

When n2ðI�n Þc; we have by definition

r2nðnÞ< s2aþa

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffin�1 logðn _ pÞ

p:¼ �2

n;

say, while the uniform sparsity condition entails rj j2ðnÞ #

C2n�2=q:Putting these together, and defining s1¼ s1(n) as the solution

of the equation Cs�1/q¼ �n, and writing a ^ b for min(a, b), weobtain, as n ! ‘,Xn =2 I�n

r2n #

Xn

�2n ^ r2

n #X

n

�2n ^ rj j2ðnÞ

#X

n

�2n ^ C2n�2=q #

ð‘

0

�2n ^ C2s�2=qds

¼ s1�2n þ qð2� qÞ�1C2s

1�2=q1 ¼½2=ð2� qÞ�Cq�2�q

n !0:

Step (iii). We adopt the abbreviations

uI ¼ ðun : n 2 IÞ; ZI ¼ ðzni : n 2 I; i ¼ 1; . . . ; nÞ;SI ¼ ðSnn9 : n; n9 2 IÞ;

and similarly for EI. We consider S�I ¼ SI � s2Ik ¼ rIrTI þ EI

and note that the perturbation term has the decomposition

EI ¼ ysrIrTI þ rIu

TI þ uIr

TI þ s2ðn�1ZIZ

TI � IÞ;

so that

k EI k # ys k rI k2 þ 2 k rI k k uI kþ s2½lmaxðn�1ZIZ

TI Þ � 1�:

Consider the first term on the right side. Becausek rI � r k !a:s: 0 from Step (ii), it follows that k rI k !

a:s: k r k :Because ys!

a:s:0; the first term is asymptotically negligible.

Let ZIþ ¼ ðzni : n 2 Iþn ; i ¼ 1; . . . ; nÞ and uIþ ¼ðun : n 2 Iþn Þ: On the event Vn ¼ fI � Iþn g; we havek uI k # k uIþ k and setting kþ ¼ jIþn j; by the same argu-ments as led to (24), we have

k uIþ k2 ¼D s2ðkþ=nÞðx2ðnÞ=nÞðx2

ðkþÞ=kþÞ!a:s:

0;

because kþ ¼ o(n) from Step (i).Finally, because on the event Vn, the matrix ZIþ contains ZI,

along with some additional rows, it follows that lmax

ðn�1ZIZTI � IÞ # lmaxðn�1ZIþZT

Iþ � IÞ!a:s: 0 by (22), againbecause kþ ¼ o(n). Combining previous bounds, we concludethat ||EI|| ! 0.

The separation dn ¼ ||rI||2 ! ||r||2 > 0 and so, by the per-

turbation bound (19),

dðrI ; rIÞ # ð4=dnÞ k EI k !a:s:

0:

7. CONCLUSIONS

In models with observational noise such as (2), in which thenumber of variables p grows with the number of cases n, wehave reviewed results that show that standard PCA yields

consistent estimates of the principal eigenvectors if and only ifp/n ! 0.

If the leading population principal eigenvector has a sparserepresentation in a given basis, Theorem 2 shows that it can beconsistently estimated by selecting a subset of variables withvariances above a threshold and then by restricting the PCA tothis selected set. Incorporating a threshold is found empiricallyto be helpful. Future theoretical work might explore thetradeoff between variable selection and thresholding.

In summary, sparse PCA as described here may be ofpractical benefit in high-dimensional settings with substantialobservational noise in which variation between individualsresides mainly in a subset of the coordinates in which the dataare represented (perhaps after transformation).

APPENDIX

A.1 Large Deviation Inequalities

If �X ¼ n�1Pn

1 Xi is the average of iid variates with moment-generating function exp{L(l)} ¼ E exp{lX1}, then Cramer’stheorem (see, for example, Dembo and Zeitouni 1993, sections2.2.2 and 2.2.12) says that for x > EX1,

Pf �X > xg # expf�nL�ðxÞg; ðA:1Þ

where the conjugate function L*(x) ¼ supl{lx � L(l)}. Thesame bound holds for Pf �X< xg when x < EX1. When applied tothe x2

ðnÞ distribution, with X1¼ z2 and z ; N(0, 1), the moment-generating function

LðlÞ ¼ �12 logð1� 2lÞ

and the conjugate function

L�ðxÞ ¼ 12½x� 1� log x�:

The bounds

logð1þ �Þ #�� �2=2 �1 < �< 0;�� 3�2=8 0 # �< 1

2 ;

�(the latter following, for example, from (47) in Johnstone(2001)) yield

Pfx2ðnÞ # nð1� �Þg # expf�n�2=4g; 0 # �< 1; ðA:2Þ

Pfx2ðnÞ $ nð1þ �Þg# expf�3n�2=16g; 0 # �<

1

2: ðA:3Þ

A.2 Proof of Theorem 3

Assume, without loss of generality, that s21 $ s2

2 $ � � �$ s2

p:

False Inclusion. For any fixed constant t, and l 2 Iout,

s2i $ t for i ¼ 1; . . . ; k and s2

l < t0s2l < s2

ðkÞ:

This threshold device leads to bounds on error probabilitiesusing only marginal distributions. For example, consider falseinclusion of variable l:

Pfs2l $ s2

ðkÞg #Xk

i¼1

Pfs2i < tg þ Pfs2

l $ tg:

Write �Mn for a x2ðnÞ=n variate, and note from (12) that

s2n;s2

n�Mn: Set t ¼ s2

kð1� �nÞ for a value of �n to be deter-mined. Because s2

i $ s2k and s2

l # s2kð1� anÞ; we arrive at

692 Journal of the American Statistical Association, June 2009

Page 12: OnConsistency andSparsity forPrincipalComponents Analysis in …stat.snu.ac.kr/multiscale/seminar/SPCA.pdf · 2015-07-24 · suitable orthogonal basis (e.g., wavelets), select coordinates

Pfs2l $ s2

ðkÞg # kPf �Mn < 1� �ng þ P �Mn $1� �n

1� an

� # k exp � n�2

n

4

� þ exp � 3n

16

an � �n

1� an

� 2( )

using large deviation bound (A.2). With the choice�n ¼

ffiffiffi3p

an=ð2þffiffiffi3pÞ; both exponents are bounded above by

�b(a) log(n _ p), and so P{FI} # p(k þ 1)(n _ p)�b(a).

False Exclusion. The argument is similar, starting with theremark that for any fixed t and l 2 Iin,

s2i # t for i $ k; i 6¼ l and s2

l $ t 0 s2l $ s2

ðkÞ:

Consequently, if we set t ¼ s2kð1þ enÞ and use s2

l $

s2kð1þ anÞ; we get

Pfs2l < s2

ðkÞg #Xi $ k

Pfs2i > tg þ Pfs2

l < tg

# ðp� 1ÞPf �Mn > 1þ �ng

þ P �Mn >1þ �n

1þ an

� # ðp� 1Þ exp � 3n�2

n

16

� þ exp � n

4

an � �n

1þ an

� 2( )

;

this time using (A.3). The bound PfFEg # pkðn _ pÞ�bðaÞþke�bðaÞð1�2anÞ logðn_pÞ follows on setting �n ¼ 2an=ð2þ

ffiffiffi3pÞ and

noting that (1 þ an)�2 $ 1 �2an .For numerical bounds, setting Ln ¼ log(n _ p), we may

collect the preceding bounds in the form

PðFE [ FIÞ # ½pk þ ðp� 1Þðk � 1Þ�e�bðaÞLn

þ pe�bðaÞLn=ð1�anÞ2

þ ðk � 1Þe�bðaÞLn=ð1þanÞ2 : ðA:4Þ

[Received January 2009. Revised January 2009.]

REFERENCES

Amini, A. A., and Wainwright, M. J. (2009), ‘‘High-Dimensional Analysis ofSemidefinite Relaxations for Sparse Principal Components,’’ The Annals ofStatistics.

Anderson, T. W. (1963), ‘‘Asymptotic Theory for Principal Component Anal-ysis,’’ Annals of Mathematical Statistics, 34, 122–148.

Biehl, M., and Mietzner, A. (1994), ‘‘Statistical Mechanics of UnsupervisedStructure Recognition,’’ Journal of Physics A: Mathematical and General,27, 1885–1897.

Cherkassky, V., and Mulier, F. (1998), Learning from Data, New York: Wiley.d’Aspremont, A., El Ghaoui, L., Jordan, M., and Lanckriet, G. (2007), ‘‘A

Direct Formulation for Sparse PCA Using Semidefinite Programming,’’SIAM Review, 49, 434–448.

Dembo, A., and Zeitouni, O. (1993), Large Deviations Techniques and Appli-cations, Boston: Jones and Bartlett.

Donoho, D. (1993), ‘‘Unconditional Bases Are Optimal Bases for Data Com-pression and Statistical Estimation,’’ Applied and Computational HarmonicAnalysis, 1, 100–115.

Donoho, D. L., Johnstone, I. M., Kerkyacharian, G., and Picard, D. (1995),‘‘Wavelet Shrinkage: Asymptopia’’ (with discussion). Journal of the RoyalStatistical Society, Ser. B, 57, 301–369.

Du, Q., and Fowler, J. E. (2008), ‘‘Low-Complexity Principal ComponentAnalysis for Hyperspectral Image Compression,’’ International Journal ofHigh Performance Computing Applications, 22, 438–448.

Feng, G. C., Yuen, P. C., and Dai, D. Q. (2000), ‘‘Human Face RecognitionUsing PCA on Wavelet Subband,’’ Journal of Electronic Imaging, 9, 226–233.

Fowler, J. E. (2008), ‘‘Compressive–Projection Principal Component Analysisfor the Compression of Hyperspectral Signatures,’’ in Data CompressionConference, 2008. DCC 2008, eds. J. A. Storer and M. W. Marcellin,Snowbird, UT: IEEE, pp. 83–92.

Geman, S. (1980), ‘‘A Limit Theorem for the Norm of Random Matrices,’’Annals of Probability, 8, 252–261.

Golub, G. H., and Van Loan, C. F. (1996), Matrix Computations (3rd ed.),Baltimore: Johns Hopkins University Press.

Hoyle, D. C., and Rattray, M. (2004), ‘‘Principal-Component-AnalysisEigenvalue Spectra from Data with Symmetry Breaking Structure,’’ PhysicalReview E: Statistical, Nonlinear, and Soft Matter Physics, 69, 026124.

Johnstone, I. M. (2001), ‘‘Chi Square Oracle Inequalities,’’ in Festschrift for WillemR. van Zwet (vol. 36 of IMS Lecture Notes—Monographs), eds. M. de Gunst, C.Klaassen, and A. van der Waart, Beachwood, OH: Institute of MathematicalStatistics, pp. 399–418.

——— (2003), Function Estimation and Gaussian Sequence Models. Draft of amonograph. Available at www-stat.stanford.edu/;imj.

Johnstone, I. M., and Lu, A. Y. (2004), ‘‘Sparse Principal Components Anal-ysis,’’ Technical Report, Stanford University, Dept. of Statistics. Available atarxiv.org as e-print 0901.4392.

Jolliffe, I. T., Trendafilov, N. T., and Uddin, M. (2003), ‘‘A Modified PrincipalComponent Technique Based on the LASSO,’’ Journal of Computationaland Graphical Statistics, 12, 531–547.

Kaewpijit, S., Le Moigne, J., and El-Ghazawi, T. (2002), ‘‘A Wavelet-BasedPCA Reduction for Hyperspectral Imagery,’’ in Geoscience and RemoteSensing Symposium, 2002. IGARSS’02. 2002 IEEE International (5th ed.),Washington, DC: IEEE, pp. 2581–2583.

Lu, A. Y. (2002), ‘‘Sparse Principal Components Analysis for FunctionalData,’’ Ph.D. dissertation, Stanford University, Dept. of Statistics.

Mallat, S. (1999), A Wavelet Tour of Signal Processing (2nd ed.), New York:Academic Press.

Muirhead, R. J. (1982), Aspects of Multivariate Statistical Theory, New York: Wiley.Nadler, B. (2008), ‘‘Finite Sample Approximation Results for Principal Com-

ponent Analysis: A Matrix Perturbation Approach,’’ The Annals of Statistics,36, 2791–2817.

Nadler, B., and Coifman, R. (2005), ‘‘The Prediction Error in CLS and PLS:The Importance of Feature Selection Prior to Multivariate Calibration,’’Journal of Chemometrics, 19, 107–118.

Paul, D. (2007), ‘‘Asymptotics of Sample Eigenstructure for a Large Dimen-sional Spiked Covariance Model,’’ Statistica Sinica, 17, 1617–1642.

Paul, D., and Johnstone, I. (2004), ‘‘Estimation of Principal Componentsthrough Coordinate Selection,’’ Technical Report, Stanford University, Dept.of Statistics.

Ramsay, J. O., and Silverman, B. W. (1997), Functional Data Analysis, Berlin:Springer.

Reimann, P., Van den Broeck, C., and Bex, G. J. (1996), ‘‘A Gaussian Scenariofor Unsupervised Learning,’’ Journal of Physics A: Mathematical andGeneral, 29, 3521–3535.

Rice, J. A., and Silverman, B. W. (1991), ‘‘Estimating the Mean and CovarianceStructure Nonparametrically When the Data Are Curves,’’ Journal of theRoyal Statistical Society, Ser. B, 53, 233–243.

Silverman, B. W. (1996), ‘‘Smoothed Functional Principal ComponentsAnalysis by Choice of Norm,’’ The Annals of Statistics, 24, 1–24.

Silverstein, J. W. (1985), ‘‘The Smallest Eigenvalue of a Large DimensionalWishart Matrix,’’ Annals of Probability, 13, 1364–1368.

Watkin, T. L. H., and Nadal, J.- P. (1994), ‘‘Optimal Unsupervised Learning,’’Journal of Physics A: Mathematical and General, 27, 1899–1915.

Wickerhauser, M. V. (1994a), ‘‘Large-Rank Approximate Principal ComponentAnalysis with Wavelets for Signal Feature Discrimination and the Inversionof Complicated Maps,’’ Journal of Chemical Information and ComputerSciences, 34, 1036–1046.

Wickerhauser, M. V. (1994b), ‘‘Two Fast Approximate Wavelet Algorithms forImage Processing, Classification, and Recognition,’’ Optical Engineering,33, 2225–2235. (Special issue on Adapted Wavelet Analysis.)

Wolf, L., and Bileschi, S. (2005), ‘‘Combining Variable Selection withDimensionality Reduction,’’ in Computer Vision and Pattern Recognition,2005 (vol. 2), Washington, DC: IEEE Computer Society, pp. 801–806.

Wolf, L., and Shashua, A. (2005), ‘‘Feature Selection for Unsupervised andSupervised Inference: The Emergence of Sparsity in a Weighted-BasedApproach,’’ Journal of Machine Learning Research, 6, 1855–1887.

Zou, H., Hastie, T., and Tibshirani, R. (2006), ‘‘Sparse Principal ComponentAnalysis,’’ Journal of Computational and Graphical Statistics, 15, 265–286.

Johnstone and Lu: Consistency and Sparsity for PCA 693