49
LETTER Communicated by Hagai Attias Dictionary Learning Algorithms for Sparse Representation Kenneth Kreutz-Delgado [email protected] Joseph F. Murray [email protected] Bhaskar D. Rao [email protected] Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407 , U.S.A. Kjersti Engan [email protected] Stavanger University College, School of Science and Technology Ullandhaug, N-4091 Stavanger, Norway Te-Won Lee [email protected] Terrence J. Sejnowski [email protected] Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A. Algorithms for data-driven learning of domain-speci c overcomplete dic- tionaries are developed to obtain maximum likelihood and maximum a posteriori dictionary estimates based on the use of Bayesian models with concave/Schur-concave (CSC) negative log priors. Such priors are ap- propriate for obtaining sparse representations of environmental signals within an appropriately chosen (environmentally matched) dictionary. The elements of the dictionary can be interpreted as concepts, features, or words capable of succinct expression of events encountered in the en- vironment (the source of the measured signals). This is a generalization of vector quantization in that one is interested in a description involv- ing a few dictionary entries (the proverbial 25 words or less), but not necessarily as succinct as one entry. To learn an environmentally adapted dictionary capable of concise expression of signals generated by the envi- ronment, we develop algorithms that iterate between a representative set of sparse representations found by variants of FOCUSS and an update of the dictionary using these sparse representations. Experiments were performed using synthetic data and natural images. For complete dictionaries, we demonstrate that our algorithms have im- Neural Computation 15, 349–396 (2003) c ° 2002 Massachusetts Institute of Technology

read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

LETTER Communicated by Hagai Attias

Dictionary Learning Algorithms for Sparse Representation

Kenneth [email protected] F. [email protected] D. [email protected] and Computer Engineering, Jacobs School of Engineering,University of California, San Diego, La Jolla, California 92093-0407 , U.S.A.

Kjersti [email protected] University College, School of Science and Technology Ullandhaug,N-4091 Stavanger, Norway

Te-Won [email protected] J. [email protected] Hughes Medical Institute, Computational Neurobiology Laboratory,Salk Institute, La Jolla, California 92037, U.S.A.

Algorithms for data-driven learning of domain-speci�c overcomplete dic-tionaries are developed to obtain maximum likelihood and maximuma posteriori dictionary estimates based on the use of Bayesian modelswith concave/Schur-concave (CSC) negative log priors. Such priors are ap-propriate for obtaining sparse representations of environmental signalswithin an appropriately chosen (environmentally matched) dictionary.The elements of the dictionary can be interpreted as concepts, features,or words capable of succinct expression of events encountered in the en-vironment (the source of the measured signals). This is a generalizationof vector quantization in that one is interested in a description involv-ing a few dictionary entries (the proverbial “25 words or less”), but notnecessarily as succinct as one entry. To learn an environmentally adapteddictionary capable of concise expression of signals generated by the envi-ronment, we develop algorithms that iterate between a representative setof sparse representations found by variants of FOCUSS and an update ofthe dictionary using these sparse representations.

Experiments were performed using synthetic data and natural images.For complete dictionaries, we demonstrate that our algorithms have im-

Neural Computation 15, 349–396 (2003) c° 2002 Massachusetts Institute of Technology

Page 2: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

350 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

proved performance over other independent component analysis (ICA)methods, measured in terms of signal-to-noise ratios of separated sources.In the overcomplete case, we show that the true underlying dictionary andsparse sources can be accurately recovered. In tests with natural images,learned overcomplete dictionaries are shown to have higher coding ef�-ciency than complete dictionaries; that is, images encoded with an over-complete dictionary have both higher compression (fewer bits per pixel)and higher accuracy (lower mean square error).

1 Introduction

FOCUSS, which stands for FOCal Underdetermined System Solver, is analgorithm designed to obtain suboptimally (and, at times,maximally) sparsesolutions to the following m £ n, underdetermined linear inverse problem1

(Gorodnitsky, George, & Rao, 1995; Rao & Gorodnitsky, 1997; Gorodnitsky& Rao, 1997; Adler, Rao, & Kreutz-Delgado, 1996; Rao & Kreutz-Delgado,1997; Rao, 1997, 1998),

y D Ax, (1.1)

for known A. The sparsity of a vector is the number of zero-valued ele-ments (Donoho, 1994), and is related to the diversity, the number of nonzeroelements,

sparsity D #fx[i] D 0g

diversity D #fx[i] 6D 0g

diversity D n ¡ sparsity.

Since our initial investigations into the properties of FOCUSS as an al-gorithm for providing sparse solutions to linear inverse problems in rel-atively noise-free environments (Gorodnitsky et al., 1995; Rao, 1997; Rao& Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler et al., 1996; Rao &Kreutz-Delgado, 1997), we now better understand the behavior of FOCUSSin noisy environments (Rao & Kreutz-Delgado, 1998a, 1998b) and as an in-terior point-like optimization algorithm for optimizing concave functionalssubject to linear constraints (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado& Rao, 1997, 1998a, 1998b, 1998c, 1999; Kreutz-Delgado, Rao, Engan, Lee, &Sejnowski, 1999a; Engan, Rao, & Kreutz-Delgado, 2000; Rao, Engan, Cotter,& Kreutz-Delgado, 2002). In this article, we consider the use of the FO-CUSS algorithm in the case where the matrix A is unknown and must belearned. Toward this end, we �rst brie�y discuss how the use of concave (and

1 For notational simplicity, we consider the real case only.

Page 3: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 351

Schur concave) functionals enforces sparse solutions to equation 1.1. We alsodiscuss the choice of the matrix, A, in equation 1.1 and its relationship tothe set of signal vectors y for which we hope to obtain sparse representa-tions. Finally, we present algorithms capable of learning an environmentallyadapted dictionary, A, given a suf�ciently large and statistically represen-tative sample of signal vectors, y, building on ideas originally presented inKreutz-Delgado, Rao, Engan, Lee, and Sejnowski (1999b), Kreutz-Delgado,Rao, and Engan (1999), and Engan, Rao, and Kreutz-Delgado (1999).

We refer to the columns of the full row-rank m £ n matrix A,

A D [a1, . . . , an] 2 Rm£n, n À m, (1.2)

as a dictionary, and they are assumed to be a set of vectors capable of provid-ing a highly succinct representation for most (and, ideally, all) statisticallyrepresentative signal vectors y 2 Rm. Note that with the assumption thatrank(A) D m, every vector y has a representation; the question at hand iswhether this representation is likely to be sparse. We call the statistical gen-erating mechanism for signals, y, the environment and a dictionary, A, withinwhich such signals can be sparsely represented an environmentally adapteddictionary.

Environmentally generated signals typically have signi�cant statisticalstructure and can be represented by a set of basis vectors spanning a lower-dimensional submanifold of meaningful signals (Field, 1994; Ruderman,1994). These environmentally meaningful representation vectors can be ob-tained by maximizing the mutual information between the set of these vec-tors (the dictionary) and the signals generated by the environment (Comon,1994; Bell & Sejnowski, 1995; Deco & Obradovic, 1996; Olshausen & Field,1996; Zhu, Wu, & Mumford, 1997; Wang, Lee, & Juang, 1997). This proce-dure can be viewed as a natural generalization of independent componentanalysis (ICA) (Comon, 1994; Deco & Obradovic, 1996). As initially devel-oped, this procedure usually results in obtaining a minimal spanning setof linearly independent vectors (i.e., a true basis). More recently, the desir-ability of obtaining “overcomplete” sets of vectors (or “dictionaries”) hasbeen noted (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Coifman& Wickerhauser, 1992; Mallat & Zhang, 1993; Donoho, 1994; Rao & Kreutz-Delgado, 1997). For example, projecting measured noisy signals onto thesignal submanifold spanned by a set of dictionary vectors results in noisereduction and data compression (Donoho, 1994, 1995). These dictionariescan be structured as a set of bases from which a single basis is to be selected torepresent the measured signal(s) of interest (Coifman & Wickerhauser, 1992)or as a single, overcomplete set of individual vectors from within which avector, y, is to be sparsely represented (Mallat & Zhang, 1993; Olshausen &Field, 1996; Lewicki & Sejnowski, 2000; Rao & Kreutz-Delgado, 1997).

The problem of determining a representation from a full row-rank over-complete dictionary, A D [a1, . . . , an], n À m, for a speci�c signal mea-

Page 4: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

352 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

surement, y, is equivalent to solving an underdetermined inverse problem,Ax D y, which is nonuniquely solvable for any y. The standard least-squaressolution to this problem has the (at times) undesirable feature of involv-ing all the dictionary vectors in the solution2 (the “spurious artifact” prob-lem) and does not generally allow for the extraction of a categorically orphysically meaningful solution. That is, it is not generally the case that aleast-squares solution yields a concise representation allowing for a precisesemantic meaning.3 If the dictionary is large and rich enough in representa-tional power, a measured signal can be matched to a very few (perhaps evenjust one) dictionary words. In this manner, we can obtain concise seman-tic content about objects or situations encountered in natural environments(Field, 1994). Thus, there has been signi�cant interest in �nding sparse so-lutions, x (solutions having a minimum number of nonzero elements), tothe signal representation problem. Interestingly, matching a speci�c signalto a sparse set of dictionary words or vectors can be related to entropy min-imization as a means of elucidating statistical structure (Watanabe, 1981).Finding a sparse representation (based on the use of a “few” code or dictio-nary words) can also be viewed as a generalization of vector quantizationwhere a match to a single “code vector” (word) is always sought (taking“code book” = “dictionary”).4 Indeed, we can refer to a sparse solution, x,as a sparse coding of the signal instantiation, y.

1.1 Stochastic Models. It is well known (Basilevsky, 1994) that the sto-chastic generative model

y D Ax C º, (1.3)

can be used to develop algorithms enabling coding of y 2 Rm via solvingthe inverse problem for a sparse solution x 2 Rn for the undercomplete(n < m) and complete (n D m) cases. In recent years, there has been a greatdeal of interest in obtaining sparse codings of y with this procedure for theovercomplete (n > m) case (Mallat & Zhang, 1993; Field, 1994). In our earlierwork, we have shown that given an overcomplete dictionary, A (with thecolumns of A comprising the dictionary vectors), a maximum a posteriori(MAP) estimate of the source vector, x, will yield a sparse coding of y inthe low-noise limit if the negative log prior, ¡ log(P(x) ), is concave/Schur-concave (CSC) (Rao, 1998; Kreutz-Delgado & Rao, 1999), as discussed below.

2 This fact comes as no surprise when the solution is interpreted within a Bayesianframework using a gaussian (maximum entropy) prior.

3 Taking “semantic” here to mean categorically or physically interpretable.4 For example, n D 100 corresponds to 100 features encoded via vector quantization

(“one column = one concept”). If we are allowed to represent features using up to four

columns, we can encode³

1001

´C

³1002

´C

³1003

´C

³1004

´D 4,087,975 concepts showing

a combinatorial boost in expressive power.

Page 5: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 353

For P (x) factorizable into a product of marginal probabilities, the resultingcode is also known to provide an independent component analysis (ICA)representation of y. More generally, a CSC prior results in a sparse represen-tation even in the nonfactorizable case (with x then forming a dependentcomponent analysis, or DCA, representation).

Given independently and identically distributed (i.i.d.) data, Y D YN D(y1, . . . , yN ), assumed to be generated by the model 1.3, a maximum likeli-hood estimate, bAML, of the unknown (but nonrandom) dictionary A can bedetermined as (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000)

bAML D arg maxA

P(YI A).

This requires integrating out the unobservable i.i.d. source vectors, X DXN D (x1, . . . , xN ), in order to compute P(YI A) from the (assumed) knownprobabilities P(x) and P(º). In essence, X is formally treated as a set ofnuisance parameters that in principle can be removed by integration. How-ever, because the prior P(x) is generally taken to be supergaussian, thisintegration is intractable or computationally unreasonable. Thus, approxi-mations to this integration are performed that result in an approximationto P (YI A), which is then maximized with respect to Y. A new, better ap-proximation to the integration can then be made, and this process is iterateduntil the estimate of the dictionary A has (hopefully) converged (Olshausen& Field, 1996). We refer to the resulting estimate as an approximate maxi-mum likelihood (AML) estimate of the dictionary A (denoted here by bAAML).No formal proof of the convergence of this algorithm to the true maxi-mum likelihood estimate, bAML, has been given in the prior literature, butit appears to perform well in various test cases (Olshausen & Field, 1996).Below, we discuss the problem of dictionary learning within the frame-work of our recently developed log-prior model-based sparse source vectorlearning approach that for a known overcomplete dictionary can be used toobtain sparse codes (Rao, 1998; Kreutz-Delgado & Rao, 1997, 1998b, 1998c,1999; Rao & Kreutz-Delgado, 1999). Such sparse codes can be found usingFOCUSS, an af�ne scaling transformation (AST)–like iterative algorithmthat �nds a sparse locally optimal MAP estimate of the source vector xfor an observation y. Using these results, we can develop dictionary learn-ing algorithms within the AML framework and for obtaining a MAP-likeestimate, bAMAP, of the (now assumed random) dictionary, A, assuming inthe latter case that the dictionary belongs to a compact submanifold corre-sponding to unit Frobenius norm. Under certain conditions, convergenceto a local minimum of a MAP-loss function that combines functions of thediscrepancy e D (y ¡ Ax) and the degree of sparsity in x can be rigorouslyproved.

1.2 Related Work. Previous work includes efforts to solve equation 1.3in the overcomplete case within the maximum likelihood (ML) framework.

Page 6: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

354 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

An algorithm for �nding sparse codes was developed in Olshausen andField (1997) and tested on small patches of natural images, resulting inGabor-like receptive �elds. In Lewicki and Sejnowski (2000) another MLalgorithm is presented, which uses the Laplacian prior to enforce sparsity.The values of the elements of x are found with a modi�ed conjugate gradientoptimization (which has a rather complicated implementation) as opposedto the standard ICA (square mixing matrix) case where the coef�cients arefound by inverting the A matrix. The dif�culty that arises when using MLis that �nding the estimate of the dictionary A requires integrating over allpossible values of the joint density P(y, xI A) as a function of x. In Olshausenand Field (1997), this is handled by assuming the prior density of x is a deltafunction, while in Lewicki and Sejnowski (2000), it is approximated by agaussian. The �xed-point FastICA (Hyvarinen, Cristescu, & Oja, 1999) hasalso been extended to generate overcomplete representations. The FastICAalgorithm can �nd the basis functions (columns of the dictionary A) oneat a time by imposing a quasi-orthogonality condition and can be thoughtof as a “greedy” algorithm. It also can be run in parallel, meaning that allcolumns of A are updated together.

Other methods to solve equation 1.3 in the overcomplete case have beendeveloped using a combination of the expectation-maximization (EM) algo-rithm and variational approximation techniques. Independent factor anal-ysis (Attias, 1999) uses a mixture of gaussians to approximate the priordensity of the sources, which avoids the dif�culty of integrating out theparameters X and allows different sources to have different densities. Inanother method (Girolami, 2001), the source priors are assumed to be su-pergaussian (heavy-tailed), and a variational lower bound is developed thatis used in the EM estimation of the parameters A and X. It is noted in Giro-lami (2001) that the mixtures used in independent factor analysis are moregeneral than may be needed for the sparse overcomplete case, and theycan be computationally expensive as the dimension of the data vector andnumber of mixtures increases.

In Zibulevsky and Pearlmutter (2001), the blind source separation prob-lem is formulated in terms of a sparse source underlying each unmixedsignal. These sparse sources are expanded into the unmixed signal with aprede�ned wavelet dictionary, which may be overcomplete. The unmixedsignals are linearly combined by a different mixing matrix to create theobserved sensor signals. The method is shown to give better separationperformance than ICA techniques. The use of learned dictionaries (insteadof being chosen a priori) is suggested.

2 FOCUSS: Sparse Solutions for Known Dictionaries

2.1 Known Dictionary Model. A Bayesian interpretation is obtainedfrom the generative signal model, equation 1.3, by assuming that x has the

Page 7: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 355

parameterized (generally nongaussian) probability density function (pdf),

Pp (x) D Z¡1p e¡c pdp (x) , Zp D

Ze¡c pdp (x) dx, (2.1)

with parameter vector p. Similarly, the noise º is assumed to have a param-eterized (possibly nongaussian) density Pq (º) of the same form as equa-tion 2.1 with parameter vector q. It is assumed that x and º have zero meansand that their densities obey the property d(x) D d(|x|), for | ¢ | de�nedcomponent-wise. This is equivalent to assuming that the densities are sym-metric with respect to sign changes in the components x, x[i] Ã ¡x[i] andtherefore that the skews of these densities are zero. We also assume thatd (0) D 0. With a slight abuse of notation, we allow the differing subscriptsq and p to indicate that dq and dp may be functionally different as well asparametrically different. We refer to densities like equation 2.1 for suitableadditional constraints on dp (x), as hypergeneralized gaussian distributions(Kreutz-Delgado & Rao, 1999; Kreutz-Delgado et al., 1999).

If we treat A, p, and q as known parameters, then x and y are jointlydistributed as

P(x, y) D P(x, yI p, q, A).

Bayes’ rule yields

P(x | yI p, A) D1b

P(y | xI p, A) ¢ P(xI p, A) D1b

Pq (y ¡ Ax) ¢ Pp (x) (2.2)

b D P(y) D P(yI p, q, A) DZ

P(y | x) ¢ Pp (x) dx. (2.3)

Usually the dependence on p and q is notationally suppressed, for example,b D P(yI A). Given an observation, y, maximizing equation 2.2 with respectto x yields a MAP estimate bx. This ideally results in a sparse coding of theobservation, a requirement that places functional constraints on the pdfs,and particularly on dp. Note that b is independent of x and can be ignoredwhen optimizing equation 2.2 with respect to the unknown source vector x.

The MAPestimate equivalently isobtained from minimizing thenegativelogarithm of P(x | y), which is

bx D arg minx

dq (y ¡ Ax) C ldp (x), (2.4)

where l D c p /c q and dq (y¡Ax) D dq (Ax¡y) by our assumption of symmetry.The quantity 1

l is interpretable as a signal-to-noise ratio (SNR). Furthermore,assuming that both dq and dp are concave/Schur–concave (CSC) as de�ned insection 2.4, then the term dq (y ¡ Ax) in equation 2.4 will encourage sparse

Page 8: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

356 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

residuals, e D y ¡ Abx, while the term dp (x) encourages sparse source vectorestimates,bx. A given value of l then determines a trade-off between residualand source vector sparseness.

This most general formulation will not be used here. Although we areinterested in obtaining sparse source vector estimates, we will not enforcesparsity on the residuals but instead, to simplify the development, willassume the q D 2 i.i.d. gaussian measurement noise case (º gaussian withknown covariance s2 ¢ I), which corresponds to taking,

c qdq (y ¡ bAbx) D1

2s2 ky ¡ bAbxk2. (2.5)

In this case, problem 2.4 becomes

bx D arg minx

12

ky ¡ Axk2 C ldp (x). (2.6)

In either case, we note that l ! 0 as c p ! 0 which (consistent withthe generative model, 1.3) we refer to as the low noise limit. Because themapping A is assumed to be onto, in the low-noise limit, the optimization,equation 2.4, is equivalent to the linearly constrained problem,

bx D arg min dp (x) subject to Ax D y. (2.7)

In the low-noise limit, no sparseness constraint need be placed on the resid-uals, e D y ¡ Abx, which are assumed to be zero. It is evident that the struc-ture of dp (¢) is critical for obtaining a sparse coding, bx, of the observation y(Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999). Throughtoutthis article, the quantity dp (x) is always assumed to be CSC (enforcing sparsesolutions to the inverse problem 1.3). As noted, and as will be evident duringthe development of dictionary learning algorithms below, we do not imposea sparsity constraint on the residuals; instead, the measurement noiseº willbe assumed to be gaussian (q D 2).

2.2 Independent Component Analysis and Sparsity Inducing Priors.An important class of densities is given by the generalized gaussians for which

dp (x) D kxkpp D

nX

kD1

|x[k]|p, (2.8)

for p > 0 (Kassam, 1982). This is a special case of the larger p class (thep-class) of functions, which allows p to be negative in value (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997). Note that this function hasthe special property of separability,

dp (x) DnX

kD1

dp (x[k]),

Page 9: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 357

which corresponds to factorizability of the density Pp (x),

Pp (x) DnY

kD1

Pp (x[k]),

and hence to independence of the components of x. The assumption of indepen-dent components allows the problem of solving the generative model, equa-tion 1.3, for x to be interpreted as an ICA problem (Comon, 1994;Pham, 1996;Olshausen & Field, 1996; Roberts, 1998). It is of interest, then, to consider thedevelopment of a large class of parameterizable separable functions dp (x)consistent with the ICA assumption (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997). Given such a class, it is natural to examine the issueof �nding a best �t within this class to the “true” underlying prior densityof x. This is a problem of parametric density estimation of the true prior,where one attempts to �nd an optimal choice of the model density Pp (x)by an optimization over the parameters p that de�ne the choice of a priorfrom within the class. This is, in general, a dif�cult problem, which mayrequire the use of Monte Carlo, evolutionary programming, or stochasticsearch techniques.

Can the belief that supergaussian priors, Pp (x), are appropriate for �nd-ing sparse solutions to equation 1.3 (Field, 1994; Olshausen & Field, 1996) beclari�ed or made rigorous? It is well known that the generalized gaussiandistribution arising from the use of equation 2.8 yields supergaussian dis-tributions (positive kurtosis) for p < 2 and subgaussian (negative kurtosis)for p > 2. However, one can argue (see section 2.5 below) that the conditionfor obtaining sparse solutions in the low-noise limit is the stronger require-ment that p · 1, in which case the separable function dp (x) is CSC. Thisindicates that supergaussianness (positive kurtosis) alone is necessary butnot suf�cient for inducing sparse solutions. Rather, suf�ciency is given bythe requirement that ¡ log Pp (x) ¼ dp (x) be CSC.

We have seen that the function dp (x) has an interpretation as a (negativelogarithm of) a Bayesian prior or as a penalty function enforcing sparsityin equation 2.4, where dp (x) should serve as a “relaxed counting function”on the nonzero elements of x. Our perspective emphasizes that dp (x) servesboth of these goals simultaneously. Thus, good regularizing functions, dp (x),should be �exibly parameterizable so that Pp (x) can be optimized over theparameter vector p to provide a good parametric �t to the underlying en-vironmental pdf, and the functions should also have analytical propertiesconsistent with the goal of enforcing sparse solutions. Such properties arediscussed in the next section.

2.3 Majorization and Schur-Concavity. In this section, we discuss func-tions that are both concave and Schur-concave (CSC functions; Marshall &Olkin, 1979). We will call functions, dp (¢), which are CSC, diversity functions,anticoncentration functions or antisparsity functions. The larger the value of the

Page 10: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

358 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

CSC function dp (x), the more diverse (i.e., the less concentrated or sparse)the elements of the vector x are. Thus, minimizing dp (x) with respect to xresults in less diverse (more concentrated or sparse) vectors x.

2.3.1 Schur-Concave Functions. A measure of the sparsity of the ele-ments of a solution vector x (or the lack thereof, which we refer to as thediversity of x) is given by a partial ordering on vectors known as the Lorentzorder. For any vector in the positive orthant, x 2 Rn

C , de�ne the decreasingrearrangement

x .D (xb1c, . . . , xbnc), xb1c ¸ ¢ ¢ ¢ ¸ xbnc ¸ 0

and the partial sums (Marshall & Olkin, 1979; Wickerhauser, 1994),

Sx[k] DkX

iD1

xbnc, k D 1, . . . , n.

We say that y majorizes x, y  x, iff for k D 1, . . . , n,

Sy[k] ¸ Sx[k]I Sy[n] D Sx[n],

and the vector y is said to be more concentrated, or less diverse, than x. Thispartial order de�ned by majorization then de�nes the Lorentz order.

We are interested in scalar-valued functions of x that are consistent withmajorization. Such functions are known as Schur-concave functions, d (¢): Rn

C! R. They are de�ned to be precisely the class of functions consistent withthe Lorentz order,

y  x ) d(y) < d(x).

In words, if y is less diverse than x (according to the Lorentz order) then d(y)is less than d (x) for d (¢) Schur-concave. We assume that Schur-concavity is anecessary condition for d (¢) to be a good measure of diversity (antisparsity).

2.3.2 Concavity Yields Sparse Solutions. Recall that a function d (¢) is con-cave on the positive orthant Rn

C iff (Rockafellar, 1970)

d ( (1 ¡ c )x C c y) ¸ (1 ¡ c )d (x) C c d (y),

8x, y 2 RnC , 8c , 0 · c · 1. In addition, a scalar function is said to be permu-

tation invariant if its value is independent of rearrangements of its compo-nents. An important fact is that for permutation invariant functions, con-cavity is a suf�cient condition for Schur-concavity:

Concavity C permutation invariance ) Schur-concavity.

Now consider the low-noise sparse inverse problem, 2.7. It is well knownthat subject to linear constraints, a concave function on Rn

C takes its minima

Page 11: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 359

on the boundary of RnC (Rockafellar, 1970), and as a consequence these min-

ima are therefore sparse. We take concavity to be a suf�cient condition for apermutation invariant d (¢) to be a measure of diversity and obtain sparsityas constrained minima of d (¢). More generally, a diversity measure shouldbe somewhere between Schur-concave and concave. In this spirit, one cande�ne almost concave functions (Kreutz-Delgado & Rao, 1997), which areSchur-concave and (locally) concave in all n directions but one, which alsoare good measures of diversity.

2.3.3 Separability, Schur-Concavity, and ICA. The simplest way to en-sure that d(x) be permutation invariant (a necessary condition for Schur-concavity) is to use functions that are separable. Recall that separability ofdp (x) corresponds to factorizability of Pp (x). Thus, separability of d(x) corre-sponds to the assumption of independent components of x under the model1.3. We see that from a Bayesian perspective, separability of d (x) correspondsto a generative model for y that assumes a source, x, with independent com-ponents. With this assumption, we are working within the framework ofICA (Nadal & Parga, 1994; Pham, 1996; Roberts, 1998). We have developedeffective algorithms for solving the optimization problem 2.7 for sparse so-lutions when dp (x) is separable and concave (Kreutz-Delgado & Rao, 1997;Rao & Kreutz-Delgado, 1999).

It is now evident that relaxing the restriction of separability generalizesthe generative model to the case where the source vector, x, has dependentcomponents. We can reasonably call an approach based on a nonsepara-ble diversity measure d (x) a dependent component analysis (DCA). Unlesscare is taken, this relaxation can signi�cantly complicate the analysis anddevelopment of optimization algorithms. However, one can solve the low-noise DCA problem, at least in principle, provided appropriate choices ofnonseparable diversity functions are made.

2.4 Supergaussian Priors and Sparse Coding. The P-class of diversitymeasures for 0 < p · 1 result in sparse solutions to the low-noise codingproblem, 2.7. These separable and concave (and thus Schur-concave) diver-sity measures correspond to supergaussian priors, consistent with the folktheorem that supergaussian priors are sparsity-enforcing priors. However,taking 1 · p < 2 results in supergaussian priors that are not sparsity enforc-ing. Taking p to be between 1 and 2 yields a dp (x) that is convex and thereforenot concave. This is consistent with the well-known fact that for this rangeof p, the pth-root of dp (x) is a norm. Minimizing dp (x) in this case drives x to-ward the origin, favoring concentrated rather than sparse solutions. We seethat if a sparse coding is to be found based on obtaining a MAP estimate tothe low-noise generative model, 1.3, then, in a sense, supergaussianness is anecessary but not suf�cient condition for a prior to be sparsity enforcing. Asuf�cient condition for obtaining a sparse MAP coding is that the negativelog prior be CSC.

Page 12: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

360 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

2.5 The FOCUSS Algorithm. Locally optimal solutions to the known-dictionary sparse inverse problems in gaussian noise, equations 2.6 and 2.7,are given by the FOCUSS algorithm. This is an af�ne-scaling transformation(AST)-like (interior point) algorithm originally proposed for the low-noisecase 2.7 in Rao and Kreutz-Delgado (1997, 1999) and Kreutz-Delgado andRao (1997);and extended by regularization to the nontrivial noise case, equa-tion 2.6, in Rao and Kreutz-Delgado (1998a), Engan et al. (2000), and Rao etal. (2002). In these references, it is shown that the FOCUSS algorithm hasexcellent behavior for concave functions (which includes the the CSC con-centration functions) dp (¢). For such functions, FOCUSS quickly convergesto a local minimum, yielding a sparse solution to problems 2.7 and 2.6.

One can quickly motivate the development of the FOCUSS algorithmappropriate for solving the optimization problem 2.6 by considering theproblem of obtaining the stationary points of the objective function. Theseare given as solutions, x¤, to

AT (Ax¤ ¡ y) C lrxdp (x¤) D 0. (2.9)

In general, equation 2.9 is nonlinear and cannot be explicitly solved for asolution x¤. However, we proceed by assuming the existence of a gradientfactorization,

rxdp (x) D a(x)P (x)x, (2.10)

where a(x) is a positive scalar function and P (x) is symmetric, positive-de�nite, and diagonal. As discussed in Kreutz-Delgado and Rao (1997,1998c) and Rao and Kreutz-Delgado (1999), this assumption is generallytrue for CSC sparsity functions dp (¢) and is key to understanding FOCUSSas a sparsity-inducing interior-point (AST-like) optimization algorithm.5

With the gradient factorization 2.10, the stationary points of equation 2.9are readily shown to be solutions to the (equally nonlinear and implicit)system,

x¤ D (ATA C b (x¤ )P (x¤ ))¡1ATy (2.11)

D P¡1 (x¤)AT (b (x¤ )I C AP¡1 (x¤)AT )¡1y, (2.12)

where b (x) D la(x) and the second equation follows from identity A.18.Although equation 2.12 is also not generally solvable in closed form, it does

5 This interpretation, which is not elaborated on here, follows from de�ning a diagonalpositive-de�nite af�ne scaling transformation matrix W (x) by the relation

P (x) D W¡2 (x).

Page 13: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 361

suggest the following relaxation algorithm,

bx à P¡1 (bx)AT (b (bx)I C AP¡1 (bx)AT )¡1y, (2.13)

which is to be repeatedly reiterated until convergence.Taking b ´ 0 in equation 2.13 yields the FOCUSS algorithm proved in

Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999)to converge to a sparse solution of equation 2.7 for CSC sparsity functionsdp (¢). The case b 6D 0 yields the regularized FOCUSS algorithm that willconverge to a sparse solution of equation 2.6 (Rao, 1998; Engan et al., 2000;Rao et al., 2002). More computationally robust variants of equation 2.13are discussed elsewhere (Gorodnitsky & Rao, 1997; Rao & Kreutz-Delgado,1998a).

Note that for the general regularized FOCUSS algorithm, 2.13, we haveb (bx) D la(bx), where l is the regularization parameter in equation 2.4. Thefunction b (x) is usually generalized to be a function of bx, y and the itera-tion number. Methods for choosing l include the quality-of-�t criteria, thesparsity critera, and the L-curve (Engan, 2000; Engan et al., 2000; Rao et al.,2002). The quality-of-�t criterion attempts to minimize the residual errory ¡ Ax (Rao, 1997), which can be shown to converge to a sparse solution(Rao & Kreutz-Delgado, 1999). The sparsity criterion requires that a certainnumber of elements of each xk be nonzero.

The L-curve method adjusts l to optimize the trade-off between the resid-ual and sparsity of x. The plot of dp (x) versus dq (y ¡ Ax) has an L shape,the corner of which provides the best trade-off. The corner of the L-curveis the point of maximum curvature and can be found by a one-dimensionalmaximization of the curvature function (Hansen & O’Leary, 1993).

A hybrid approach known as the modi�ed L-curve method combines the L-curve method on a linear scale and the quality-of-�t criterion, which is usedto place limits on the range of l that can be chosen by the L-curve (Engan,2000). The modi�ed L-curve method was shown to have good performance,but it requires a one-dimensional numerical optimization step for each xk ateach iteration, which can be computationally expensive for large vectors.

3 Dictionary Learning

3.1 Unknown, Nonrandom Dictionaries. The MLE framework treatsparameters to be estimated as unknown but deterministic (nonrandom). Inthis spirit, we take the dictionary, A, to be the set of unknown but determinis-tic parameters to be estimated from the observation set Y D YN. In particular,given YN, the maximum likelihood estimate bAML is found from maximizingthe likelihood function L(A | YN ) D P(YNI A). Under the assumption that

Page 14: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

362 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

the observations are i.i.d., this corresponds to the optimization

bAML D arg maxA

NY

kD1

P(ykI A), (3.1)

P (ykI A) DZ

P(yk, xI A) dx DZ

P(yk | xI A) ¢ Pp (x) dx

DZ

Pq (yk ¡ Ax) ¢ Pp (x) dx. (3.2)

De�ning the sample average of a function f (y) over the sample set YN D(y1, . . . , yN ) by

h f (y)iN D1N

NX

kD1

f (yk),

the optimization 3.1 can be equivalently written as

bAML D arg minA

¡hlog(P (yI A) )iN . (3.3)

Note that P(ykI A) is equal to the normalization factor b already encoun-tered, but now with the dependence of b on A and the particular sample,yk, made explicit. The integration in equation 3.2 in general is intractable,and various approximations have been proposed to obtain an approximatemaximum likelihood estimate, bAAML (Olshausen & Field, 1996; Lewicki &Sejnowski, 2000).

Inparticular, the following approximation has been proposed (Olshausen& Field, 1996),

Pp (x) ¼ d (x ¡bxk ( bA) ), (3.4)

where

bxk (bA) D arg maxx

P(yk, xI bA), (3.5)

for k D 1, . . . , N, assuming a current estimate, bA, for A. This approximationcorresponds to assuming that the source vector xk for which yk D Axk isknown and equal to bxk (bA). With this approximation, the optimization 3.3becomes

bAAML D arg minA

hdq (y ¡ Abx)N, (3.6)

Page 15: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 363

which is an optimization over the sample average h¢iN of the functional 2.4encountered earlier. Updating our estimate for the dictionary,

bA Ã bAAML , (3.7)

we can iterate the procedure (3.5)–(3.6) until bAAML has converged, hopefully(at least in the limit of large N) to bAML D bAML (YN ) as the maximum likelihoodestimate bAML (YN ) has well-known desirable asymptotic properties in thelimit N ! 1.

Performing the optimization in equation 3.6 for the q D 2 i.i.d. gaussianmeasurement noise case (º gaussian with known covariance s2 ¢ I) corre-sponds to taking

dq (y ¡ bAbx) D1

2s2 ky ¡ bAbxk2, (3.8)

in equation 3.6. In appendix A, it is shown that we can readily obtain theunique batch solution,

bAAML D Sy Ox S¡1OxOx , (3.9)

Sy Ox D1N

NX

kD1

ykbxTk , S Ox Ox D

1N

NX

kD1bxkbxT

k . (3.10)

Appendix A derives the maximum likelihood estimate of A for the idealcase of known source vectors X D (x1, . . . , xN ),

Known source vector case: AML D SyxS¡1xx ,

which is, of course, actually not computable since the actual source vectorsare assumed to be hidden.

As an alternative to using the explicit solution 3.9, which requires anoften prohibitive n £n inversion, we can obtain AAML iteratively by gradientdescent on equations 3.6 and 3.8,

bAAML Ã bAAML ¡ m1N

NX

kD1

ekbxTk , (3.11)

ek D bAAMLbxk ¡ yk, k D 1, . . . , N,

for an appropriate choice of the (possibly adaptive) positive step-size pa-rameter m . The iteration 3.11 can be initialized as bAAML D bA.

A general iterative dictionary learning procedure is obtained by nestingthe iteration 3.11 entirely within the iteration de�ned by repeatedly solv-ing equation 3.5 every time a new estimate, bAAML , of the dictionary becomes

Page 16: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

364 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

available. However, performing the optimization required in equation 3.5 isgenerally nontrivial (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000).Recently, we have shown how the use of the FOCUSS algorithm results in aneffective algorithm for performing the optimization required in equation 3.5for the case when º is gaussian (Rao & Kreutz-Delgado, 1998a; Engan et al.,1999). This approach solves equation 3.5 using the af�ne-scaling transfor-mation (AST)-like algorithms recently proposed for the low-noise case (Rao& Kreutz-Delgado, 1997, 1999; Kreutz-Delgado & Rao, 1997) and extendedby regularization to the nontrivial noise case (Rao & Kreutz-Delgado, 1998a;Engan et al., 1999). As discussed in section 2.5, for the current dictionaryestimate, bA, a solution to the optimization problem, 3.5, is provided by therepeated iteration,

bxk à P¡1 (bxk) bAT (b (bxk)I C bAP¡1 (bxk) bAT)¡1yk, (3.12)

k D 1, . . . , N, with P (x) de�ned as in equation 3.18, given below. This is theregularized FOCUSS algorithm (Rao, 1998; Engan et al., 1999), which hasan interpretation as an AST-like concave function minimization algorithm.The proposed dictionary learning algorithm alternates between iteration3.12 and iteration 3.11 (or the direct batch solution given by equation 3.9if the inversion is tractable). Extensive simulations show the ability of theAST-based algorithm to recover an unknown 20 £ 30 dictionary matrix Acompletely (Engan et al., 1999).

3.2 Unknown, Random Dictionaries. We now generalize to the casewhere the dictionary A and the source vector set X D XN D (x1, . . . , xN ) arejointly random and unknown. We add the requirement that the dictionaryis known to obey the constraint

A 2 A D compact submanifold of Rm£n.

A compact submanifold of Rm£n is necessarily closed and bounded. On theconstraint submanifold, the dictionary A has the prior probability densityfunction P(A), which in the sequel we assume has the simple (uniform onA) form,

P (A) D c X (A 2 A), (3.13)

where X (¢) is the indicator function and c is a positive constant chosen toensure that

P (A) DZ

AP(A) dA D 1.

The dictionary A and the elements of the set X are also all assumed to bemutually independent,

P (A, X) D P (A)P(X) D P(A)Pp (x1), . . . , Pp (xN ).

Page 17: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 365

With the set of i.i.d. noise vectors, (º1, . . . , ºN ) also taken to be jointly ran-dom with, and independent of, A and X, the observation set Y D YN D(y1, . . . , yN ) is assumed to be generated by the model 1.3. With these as-sumptions, we have

P(A, X | Y) D P(Y | A, X)P(A, X) /P(Y) (3.14)

D c X (A 2 A)P(Y | A, X)P(X)/P(Y)

Dc X (A 2 A)

P(Y)

NY

kD1

P(yk | A, xk)Pp (xk)

Dc X (A 2 A)

P(Y)

NY

kD1

Pq (y ¡ Axk)Pp (xk),

using the facts that the observations are conditionally independent andP(yk | A, X) D P (yk | A, xk).

The jointly MAP estimates

(bAMAP , bXMAP ) D (bAMAP, bx1,MAP , . . . , bxN,MAP)

are found by maximizing a posteriori probability density P(A, X | Y) simul-taneously with respect to A 2 A and X. This is equivalent to minimizingthe negative logarithm of P(A, X | Y), yielding the optimization problem,

(bAMAP , bXMAP ) D arg minA2A,X

hdq (y ¡ Ax) C ldp (x)iN . (3.15)

Note that this is a joint minimization of the sample average of the functional2.4, and as such is a natural generalization of the single (with respect to theset of source vectors) optimization previously encountered in equation 3.6.By �nding joint MAP estimates of A and X, we obtain a problem that ismuch more tractable than the one of �nding the single MAP estimate of A(which involves maximizing the marginal posterior density P(A | Y)).

The requirement that A 2 A, where A is a compact and hence boundedsubset of Rm£n, is suf�cient for the optimization problem 3.15 to avoid thedegenerate solution,6

for k D 1, . . . , N, yk D Axk, with kAk ! 1 and kxkk ! 0. (3.16)

This solution is possible for unbounded A because y D Ax is almost alwayssolvable for x since learned overcomplete A’s are (generically) onto, and forany solution pair (A, x) the pair ( 1

aA, ax) is also a solution. This fact shows

that the inverse problem of �ndinga solution pair (A, x) is generally ill posed

6 kAk is any suitable matrix norm on A.

Page 18: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

366 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

unless A is constrained to be bounded (as we have explicitly done here) orthe cost functional is chosen to ensure that bounded A’s are learned (e.g.,by adding a term monotonic in the matrix norm kAk to the cost function inequation 3.15).

A variety of choices for the compact set A are available. Obviously, sincedifferent choices of A correspond to different a priori assumptions on theset of admissible matrices, A, the choice of this set can be expected to affectthe performance of the resulting dictionary learning algorithm. We willconsider two relatively simple forms for A.

3.3 Unit Frobenius–Norm Dictionary Prior. For the i.i.d. q D 2 gaussianmeasurement noise case of equation 3.8, algorithms that provably converge(in the low step-size limit) to a local minimum of equation 3.15 can be readilydeveloped for the very simple choice,

AF D fA | kAkF D 1g ½ Rm£n, (3.17)

where kAkF denotes the Frobenius norm of the matrix A,

kAk2F D tr(ATA) , trace (ATA),

and it is assumed that the prior P(A) is uniformly distributed on AF as percondition 3.13. As discussed in appendix A, AF is simply connected, andthere exists a path in AF between any two matrices in AF.

Following the gradient factorization procedure (Kreutz-Delgado & Rao,1997; Rao & Kreutz-Delgado, 1999), we factor the gradient of d (x) as

rd (x) D a(x)P (x)x, a(x) > 0, (3.18)

where it is assumed thatP (x) isdiagonal and positive de�nite for all nonzerox. For example, in the case where d(x) D kxkp

p ,

P¡1 (x) D diag(|x[i]|2¡p ). (3.19)

Factorizations for other diversitymeasures d (x) are given in Kreutz-Delgadoand Rao (1997). We also de�ne b (x) D la(x). As derived and proved inappendix A, a learning law that provably converges to a minimum of equa-tion 3.15 on the manifold 3.17 is then given by

ddtb

xk D ¡Vkf (bAT bA C b (bxk)P (bxk) )bxk ¡ bATykg,

ddt

bA D ¡m (d bA ¡ tr (bATd bA) bA), m > 0, (3.20)

Page 19: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 367

for k D 1, . . . , N, where bA is initialized to kbAkF D 1, Vk are n £ n positivede�nite matrices, and the “error” d bA is

d bA D he(bx)bxTiN D1N

NX

kD1

e(bxk)bxTk , e (bxk) D bAbxk ¡ yk, (3.21)

which can be rewritten in the perhaps a more illuminating form (cf. equa-tions 3.9 and 3.10),

d bA D bAS Ox Ox ¡ Sy Ox. (3.22)

A formal convergence proof of equation 3.20 is given in appendix A, whereit is also shown that the right-hand side of the second equation in 3.20 corre-sponds to projecting the error term d bA onto the tangent space of AF, therebyensuring that the derivative of bA lies in the tangent space. Convergence ofthe algorithm to a local optimum of equation 3.15 is formally proved byinterpreting the loss functional as a Lyapunov function whose time deriva-tive along the trajectories of the adapted parameters (bA, bX) is guaranteedto be negative de�nite by the choice of parameter time derivatives shownin equation 3.20. As a consequence of the La Salle invariance principle, theloss functional will decrease in value, and the parameters will converge tothe largest invariant set for which the time derivative of the loss functionalis identically zero (Khalil, 1996).

Equation 3.20 is a set of coupled (between bA and the vectors bxk) nonlin-ear differential equations that correspond to simultaneous, parallel updat-ing of the estimates bA and bxk. This should be compared to the alternatedseparate (nonparallel) update rules 3.11 and 3.12 used in the AML algo-rithm described in section 3.1. Note also that (except for the trace term) theright-hand side of the dictionary learning update in equation 3.20 is of thesame form as for the AML update law given in equation 3.11 (see also thediscretized version of equation 3.20 given in equation 3.28 below). The keydifference is the additional trace term in equation 3.20. This difference corre-sponds to a projection of the update onto the tangent space of the manifold3.17, thereby ensuring a unit Frobenius norm (and hence boundedness) ofthe dictionary estimate at all times and avoiding the ill-posedness problemindicated in equation 3.16. It is also of interest to note that choosing Vk tobe the positive-de�nite matrix,

Vk D gk (bAT bA C b (bxk)P (bxk) )¡1, gk > 0, (3.23)

in equation 3.20, followed by some matrix manipulations (see equation A.18in appendix A), yields the alternative algorithm,

ddtb

xk D ¡gkfbxk ¡ P¡1 (bxk) bAT (b (bxk)I C bAP¡1 (bxk) bAT )¡1ykg (3.24)

Page 20: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

368 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

with gk > 0. In any event (regardless of the speci�c choice of the positivede�nite matrices Vk as shown in appendix A), the proposed algorithm out-lined here converges to a solution (bxk, bA), which satis�es the implicit andnonlinear relationships,

bxk D P¡1 (bxk) bAT (b (bxk)I C bAP¡1 (bxk) bAT )¡1y,

bA D STy Ox (S Ox Ox ¡ cI)¡1 2 AF, (3.25)

for scalar c D tr (bATd bA).To implement the algorithm 3.20 (or the variant using equation 3.24) in

discrete time, a �rst-order forward difference approximation at time t D tlcan be used,

ddtb

xk (tl) ¼ bxk (tlC1 ) ¡bxk (tl)tlC1 ¡ tl

, bxk[l C 1] ¡bxk[l]Dl

. (3.26)

Applied to equation 3.24, this yields

bxk[l C 1] D (1 ¡ m l)bxk[l] C m lbxFOCUSSk [l]

bxFOCUSSk [l] D P¡1 (bxk) bAT (b (bxk)I C bAP¡1 (bxk) bAT )¡1yk

m l D gkDl ¸ 0. (3.27)

Similarly, discretizing the bA-update equation yields the A-learning rule,equation 3.28, given below. More generally, taking m l to have a value be-tween zero and one, 0 · m l · 1 yields an updated value bxk[l C 1], which isa linear interpolation between the previous value bxk[l] and bxFOCUSS

k [l].When implemented in discrete time, and setting m l D 1, the resulting

Bayesian learning algorithm has the form of a combined iteration where weloop over the operations,

bxk à P¡1 (bxk) bAT (b (bxk)I C bAP¡1 (bxk) bAT)¡1yk,

k D 1, . . . , N and

bA Ã bA ¡ c (d bA ¡ tr( bATd bA) bA) c > 0. (3.28)

We call this FOCUSS-based, Frobenius-normalized dictionary-learning al-gorithm the FOCUSS-FDL algorithm. Again, this merged procedure shouldbe compared to the separate iterations involved in the maximum likelihoodapproach given in equations 3.11 and 3.12. Equation 3.28, with d bA given byequation 3.21, corresponds to performing a �nite step-size gradient descent

Page 21: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 369

on the manifold AF. This projection in equation 3.28 of the dictionary up-date onto the tangent plane of AF (see the discussion in appendix A) ensuresthe well-behavedness of the MAP algorithm.7 The speci�c step-size choicem l D 1, which results in the �rst equation in equation 3.28, is discussed atlength for the low-noise case in Rao and Kreutz-Delgado (1999).

3.4 Column-Normalized Dictionary Prior. Although mathematicallyvery tractable, the unit-Frobenius norm prior, equation 3.17, appears to besomewhat too loose, judging from simulation results given below. In simu-lations with the Frobenius norm constraint AF, some columns of A can tendtoward zero, a phenomenon that occurs more often in highly overcompleteA. This problem can be understood by remembering that we are using thedp (x), p 6D 0 diversity measure, which penalizes columns associated withterms in x with large magnitudes. If a column ai has a small relative magni-tude, the weight of its coef�cient x[i] can be large, and it will be penalizedmore than a column with a larger norm. This leads to certain columns beingunderused, which is especially problematic in the overcomplete case.

An alternative, and more restrictive, form of the constraint set A is ob-tained by enforcing the requirement that the columns ai of A each be nor-malized (with respect to the Euclidean 2-norm) to the same constant value(Murray & Kreutz-Delgado, 2001). This constraint can be justi�ed by notingthat Ax can be written as the nonunique weighted sum of the columns ai,

Ax DnX

iD1

aix[i] DnX

iD1

³ ai

ai

´(aix[i]) D A0x0 ,

for any ai > 0, i D 1, . . . , n,

showing that there is a column-wise ambiguity that remains even after theoverall unit-Frobenius norm normalization has occurred, as one can nowFrobenius-normalize the new matrix A0 .

Therefore, consider the set of matrices on which has been imposed thecolumn-wise constraint that

AC D»

A­­­­kaik2 D aT

i ai D1n

, i D 1, . . . , n¼

. (3.29)

The set AC is an mn¡n D n(m ¡1)–dimensional submanifold of Rm£n. Notethat every column of a matrix in AC has been normalized to the value 1p

n .

7 Because of the discrete-time approximation in equation 3.28, and even more generallybecause of numerical round-off effects in equation 3.20, a renormalization,

bA Ã bA/kAkF ,

is usually performed at regular intervals.

Page 22: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

370 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

In fact, any constant value for the column normalization can be used (in-cluding the unit normalization), but, as shown in appendix B, the particularnormalization of 1p

n results in AC being a proper submanifold of the mn ¡ 1dimensional unit Frobenius manifold AF,

AC ½ AF,

indicating that a tighter constraint on the matrix A is being imposed. Again,it is assumed that the prior P(A) is uniformly distributed on AC in themanner of equation 3.13. As shown in appendix B, AC is simply connected.

A learning algorithm is derived for the constraint AC in appendix B,following much the same approach as in appendix A. Because the derivationof the bxk update to �nd sparse solutions does not depend on the form ofthe constraint A, only the bA update in algorithm 3.28 needs to be modi�ed.Each column ai is now updated independently (see equation B.17),

ai à ai ¡c (I ¡baibaTi )dai

i D 1, . . . , n, (3.30)

where dai is the ith column of d bA in equation 3.21. We call the resultingcolumn-normalized dictionary-learning algorithm the FOCUSS-CNDL al-gorithm. The implementation details of the FOCUSS-CNDL algorithm arepresented in section 4.2.

4 Algorithm Implementation

The dictionary learning algorithms derived above are an extension of theFOCUSS algorithm used for obtaining sparse solutions to the linear inverseproblem y D Ax to the case where dictionary learning is now required. Werefer to these algorithms generally as FOCUSS-DL algorithms, with the unitFrobenius-norm prior-based algorithm denoted by FOCUSS-FDL and thecolumn-normalized prior-base algorithm by FOCUSS-CNDL. In this sec-tion, the algorithms are stated in the forms implemented in the experimen-tal tests, where it is shown that the column normalization-based algorithmachieves higher performance in the overcomplete dictionary case.

4.1 Unit Frobenius-Norm Dictionary Learning Algorithm. We nowsummarize the FOCUSS-FDL algorithm derived in section 3.3. For eachof the data vectors yk, we update the sparse source vectors xk using theFOCUSS algorithm:

P¡1 (bxk) D diag(|bxk[i]|2¡p )

bxk à P¡1 (bxk) bAT (lkI C bAP¡1 (bxk) bAT)¡1yk (FOCUSS), (4.1)

Page 23: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 371

where lk D b (bxk) is the regularization parameter. After updating the Nsource vectors xk, k D 1, . . . , n, the dictionary bA is reestimated,

Sy Ox D1N

NX

kD1

ykbxTk

S Ox Ox D1N

NX

kD1bxkbxT

k

d bA D bAS OxOx ¡ Sy Ox

bA Ã bA ¡ c (d bA ¡ tr( bATd bA) bA), c > 0, (4.2)

where c controls the learning rate. For the experiments in section 5, the datablock size is N D 100. During each iteration all training vectors are updatedusing equation 4.1, with a corresponding number of dictionary updatesusing equation 4.2. After each update of the dictionary bA, it is renormalizedto have unit Frobenius norm, k bAkF D 1.

The learning algorithm isa combined iteration,meaning that theFOCUSSalgorithm is allowed to run for only one iteration (not until full convergence)before the A update step. This means that during early iterations, thebxk arein general not sparse. To facilitate learning A, the covariances SyOx and S Ox Oxare calculated with sparsi�ed bxk that have all but the Qr largest elements setto zero. The value of Qr is usually set to the largest desired number of nonzeroelements, but this choice does not appear to be critical.

The regularization parameter lk is taken to be a monotonically increasingfunction of the iteration number,

lk D lmax tanh(10¡3 ¢ (iter ¡ 1500)) C 1). (4.3)

While this choice of lk does not have the optimality properties of the modi-�ed L-curve method (see section 2.5), it does not require a one-dimensionaloptimization for each bxk and so is much less computationally expensive.This is further discussed below.

4.2 Column Normalized Dictionary Learning Algorithm. The im-proved version of the algorithm called FOCUSS-CNDL, which providesincreased accuracy especially in the overcomplete case, was proposed inMurray and Kreutz-Delgado (2001). The three key improvements are col-umn normalization that restricts the learned bA, an ef�cient way of adjustingthe regularization parameter lk, and reinitialization to escape from local op-tima.

The column-normalized learning algorithm discussed in section 3.4 andderived in appendix B is used. Because the bxk update does not dependon the constraint set A, the FOCUSS algorithm in equation 4.1 is used to

Page 24: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

372 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

update N vectors, as discussed in section 4.1. After every N source vectorsare updated, each column of the dictionary is then updated as

ai à ai ¡c (I ¡baibaTi )dai

i D 1, . . . , n, (4.4)

where dai are the columns of d bA, which is found using equation 4.2. Afterupdating each ai, it is renormalized to kaik2 D 1/n by

ai Ãaip

nkaik, (4.5)

which also ensures that kbAkF D 1 as shown in section B.1.The regularization parameter lk may be set independently for each vec-

tor in the training set, and a number of methods have been suggested,including quality of �t (which requires a certain level of reconstruction ac-curacy), sparsity (requiring a certain number of nonzero elements), andthe L-curve which attempts to �nd an optimal trade-off (Engan, 2000). TheL-curve method works well, but it requires solving a one-dimensional op-timization for each lk, which becomes computationally expensive for largeproblems. Alternatively, we use a heuristic method that allows the trade-offbetween error and sparsity to be tuned for each application, while lettingeach training vector yk have its own regularization parameter lk to improvethe quality of the solution,

lk D lmax

Á1 ¡

kyk ¡ bAbxkkkykk

!, lk, lmax > 0. (4.6)

For data vectors that are represented accurately, lk will be large, driving thealgorithm to �nd sparser solutions. If the SNR can be estimated, we can setlmax D (SNR)¡1.

The optimization problem, 3.15, is concave when p · 1, so there willbe multiple local minima. The FOCUSS algorithm is guaranteed to con-verge only to one of these local minima, but in some cases, it is possible todetermine when that has happened by noticing if the sparsity is too low. Pe-riodically (after a large number of iterations), the sparsity of the solutionsbxkis checked, and if found too low,bxk is reinitialized randomly. The algorithmis also sensitive to initial conditions, and prior information may be incor-porated into the initialization to help convergence to the global solution.

5 Experimental Results

Experiments were performed using complete dictionaries (n D m) and over-complete dictionaries (n > m) on both synthetically generated data and

Page 25: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 373

natural images. Performance was measured in a number of ways. With syn-thetic data, performance measures include the SNR of the recovered sourcesxk compared to the true generating source and comparing the learned dic-tionary with the true dictionary. For images of natural scenes, the true un-derlying sources are not known, so the accuracy and ef�ciency of the imagecoding are found.

5.1 Complete Dictionaries:Comparison with ICA. To test theFOCUSS-FDL and FOCUSS-CNDL algorithms, simulated data were created follow-ing the method of Engan et al. (1999) and Engan (2000). The dictionary A ofsize 20£20 was created by drawing each element aij from a normal distribu-tion with m D 0, s2 D 1 (written as N (0, 1)) followed by a normalization toensure that kAkF D 1. Sparse source vectors xk, k D 1, . . . , 1000 were createdwith r D 4 nonzero elements, where the r nonzero locations are selectedat random (uniformly) from the 20 possible locations. The magnitudes ofeach nonzero element were also drawn from N (0, 1) and limited so that|xkl | > 0.1. The input data yk were generated using y D Ax (no noise wasadded).

For the �rst iteration of the algorithm, the columns of the initializa-tion estimate, bAinit, were taken to be the �rst n D 20 training vectors yk.The initial xk estimates were then set to the pseudoinverse solution bxk DbAT

init ( bAinit bATinit)

¡1yk. The constant parameters of the algorithm were set asfollows: p D 1.0, c D 1.0, and lmax D 2 £ 10¡3 (low noise, assumed SNR¼ 27 dB). The algorithms were run for 200 iterations through the entiredata set, and during each iteration, bA was updated after updating 100 datavectors bxk.

To measure performance, the SNR between the recovered sources bxk andthe the true sources xk were calculated. Each element xk[i] for �xed i wasconsidered as a time-series vector with 1000 elements, and SNRi for eachwas found using

SNRi D 10 log10

³ kxk[i]k2

kxk[i] ¡bxk[i]k2

´. (5.1)

The �nal SNR is found by averaging SNRi over the i D 1, . . . , 20 vectorsand 20 trials of the algorithms. Because the dictionary A is learned only towithin a scaling factor and column permutations, the learned sources mustbe matched with corresponding true sources and scaled to unit norm beforethe SNR calculation is done.

The FOCUSS-FDL and FOCUSS-CNDL algorithms were compared withextended ICA (Lee, Girolami, & Sejnowski, 1999) and Fastica8 (Hyvarinen

8 Matlab and C versions of extended ICA can be found on-line at http://www.cnl.salk.edu/»tewon/ICA/code.html. Matlab code for FastICA can be found at: http://www.cis.hut.�/projects/ica/fastica/.

Page 26: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

374 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Figure 1: Comparison between FOCUSS-FDL, FOCUSS-CNDL, extended ICA,and FastICA on synthetically generated data with a complete dictionary A, size20 £ 20. The SNR was computed between the recovered sourcesbxk and the truesources xk. The mean SNR and standard deviation were computed over 20 trials.

et al., 1999). Figure 1 shows the SNR for the tested algorithms. The SNRfor FOCUSS-FDL is 27.7 dB, which is a 4.7 dB improvement over extendedICA, and for FOCUSS-CNDL, the SNR is 28.3 dB. The average run time forFOCUSS-FDL/CNDL was 4.3 minutes, for FastICA 0.10 minute, and forextended ICA 0.19 minute on a 1.0 GHz Pentium III Xeon computer.

5.2 Overcomplete Dictionaries. To test the ability to recover the true Aand xk solutions in the overcomplete case, dictionaries of size 20 £ 30 and64 £ 128 were generated. Diversity r was set to �xed values (4 and 7) anduniformly randomly (5–10 and 10–15). The elements of A and the sourcesxk were created as in section 5.1.

The parameters were set as follows: p D 1.0, c D 1.0, lmax D 2 £ 10¡3.The algorithms were run for 500 iterations through the entire data set, andduring each iteration, bA was updated after updating 100 data vectors bxk.

As a measure of performance, we �nd the number of columns of A thatwere matched during learning. Because A can be learned only to within col-umn permutations and sign and scale changes, the columns are normalizedso that kbaik D kajk D 1, and bA is rearranged columnwise so thatbaj is giventhe index of the closest match in A (in the minimum two-norm sense). Amatch is counted if

1 ¡ |aTi bai | < 0.01. (5.2)

Page 27: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 375

Table 1: Synthetic Data Results.

Learned A Columns Learned x

Algorithm Size of A Diversity, r Average SD % Average SD %

FOCUSS-FDL 20 £ 30 7 25.3 3.4 84.2 675.9 141.0 67.6FOCUSS-CNDL 20 £ 30 7 28.9 1.6 96.2 846.8 97.6 84.7FOCUSS-CNDL 64 £ 128 7 125.3 2.1 97.9 9414.0 406.5 94.1FOCUSS-CNDL 64 £ 128 5–10 126.3 1.3 98.6 9505.5 263.8 95.1FOCUSS-FDL 64 £ 128 10–15 102.8 4.5 80.3 4009.6 499.6 40.1FOCUSS-CNDL 64 £ 128 10–15 127.4 1.3 99.5 9463.4 330.3 94.6

Similarly, the number of matching bxk are counted (after rearranging theelements in accordance with the indices of the rearranged bA),

1 ¡ |xTi bxi | < 0.05. (5.3)

If the data are generated by an A that is not column normalized, othermeasures of performance need to be used to compare xk and bxk.

The performance is summarized in Table 1, whichcompares theFOCUSS-FDL with the column-normalized algorithm (FOCUSS-CNDL). For the 20£30 dictionary, 1000 training vectors were used,and for the 64£128 dictionary10,000 were used. Results are averaged over four or more trials. For the64 £ 128 matrix and r D 10–15, FOCUSS-CNDL is able to recover 99.5%(127.4/128) of the columns of A and 94.6% (9463/10,000) of the solutions xkto within the tolerance given above. This shows a clear improvement overFOCUSS-FDL, which learns only 80.3% of the A columns and 40.1% of thesolutions xk.

Learning curves for one of the trials of this experiment (see Figure 2)show that most of the columns of A are learned quickly within the �rst 100iterations, and that the diversity of the solutions drops to the desired level.Figure 2b shows that it takes somewhat longer to learn the xk correctly andthat reinitialization of the low sparsity solutions (at iterations 175 and 350)helps to learn additional solutions. Figure 2c shows the diversity at eachiteration, measured as the average number of elements of each bxk that arelarger than 1 £ 10¡4.

5.3 Image Data Experiments. Previous work has shown that learnedbasis functions can be used to code data more ef�ciently than traditionalFourier or wavelet bases (Lewicki & Olshausen, 1999). The algorithm for�nding overcomplete bases in Lewicki and Olshausen (1999) is also de-signed to solve problem 1.1 but differs from our method in a number ofways, including using only the Laplacian prior (p D 1), and using conjugategradient optimization for �nding sparse solutions (whereas we use the FO-CUSS algorithm). It is widely believed that overcomplete representations

Page 28: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

376 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Figure 2: Performance of the FOCUSS-CNDL algorithm with overcomplete dic-tionary A, size 64 £ 128. (a) Number of correctly learned columns of A at eachiteration. (b) Number of sources xk learned. (c) Average diversity (n-sparsity) ofthe xk. The spikes in b and c indicate where some solutionsbxk were reinitializedbecause they were not sparse enough.

Page 29: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 377

are more ef�cient than complete bases, but in Lewicki and Olshausen (1999),the overcomplete code was less ef�cient for image data (measured in bitsper pixel entropy), and it was suggested that different priors could be usedto improve the ef�ciency. Here, we show that our algorithm is able to learnmore ef�cient overcomplete codes for priors with p < 1.

The training data consisted of 10,000 8 £ 8 image patches drawn atrandom from black and white images of natural scenes. The parameterp was varied from 0.5–1.0, and the FOCUSS-CNDL algorithm was trainedfor 150 iterations. The complete dictionary (64 £ 64) was compared withthe 2£ overcomplete dictionary (64 £ 128). Other parameters were set:c D 0.01, lmax D 2 £ 10¡3. The coding ef�ciency was measured usingthe entropy (bits per pixel) method described in Lewicki and Olshausen(1999). Figure 3 plots the entropy versus reconstruction error (root meansquare error, RMSE) and shows that when p < 0.9, the entropy is less forthe overcomplete representation at the same RMSE.

An example of coding an entire image is shown in Figure 4. The originaltest image (see Figure 4a) of size 256 £ 256 was encoded using the learneddictionaries. Patches from the test image were not used during training. Ta-ble 2 gives results for low- and high-compression cases. In both cases, codingwith the overcomplete dictionary (64£128) gives higher compression (lowerbits per pixel) and lower error (RMSE). For the high-compression case (see

Figure 3: Comparing the coding ef�ciency of complete and 2£ overcompleterepresentations on 8£8 pixel patches drawn from natural images. The points onthe curve are the results from different values of p, at the bottom right, p D 1.0,and at the top left, p D 0.5. For smaller p, the overcomplete case is more ef�cientat the same level of reconstruction error (RMSE).

Page 30: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

378 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Figure 4: Image compression using complete and overcomplete dictionaries.Coding with an overcomplete dictionary is more ef�cient (fewer bits per pixel)and more accurate (lower RMSE). (a) Original image of size 256 £ 256 pixels.(b) Compressed with 64£64 complete dictionary to 0.826 bits per pixel at RMSE= 0.329. (c) Compressed with 64 £ 128 overcomplete dictionary to 0.777 bits perpixel at RMSE = 0.328.

Page 31: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 379

Table 2: Image Compression Results.

Dictionary Compression AverageSize p (bits/pixel) RMSE Diversity

64 £ 64 0.5 2.450 0.148 17.364 £ 128 0.6 2.410 0.141 15.464 £ 64 0.5 0.826 0.329 4.564 £ 128 0.6 0.777 0.328 4.0

Figures 4b and 4c), the 64 £128 overcomplete dictionary gives compressionof 0.777 bits per pixel at error 0.328, compared to the 64 £ 64 complete dic-tionary at 0.826 bits per pixel at error 0.329. The amount of compression wasselected by adjusting lmax (the upper limit of the regularization parameter).For high compression, lmax D 0.02, and for low compression, lmax D 0.002.

6 Discussion and Conclusions

We have applied a variety of tools and perspectives (including ideas drawnfrom Bayesian estimation theory, nonlinear regularized optimization, andthe theory of majorization and convex analysis) to the problem of develop-ing algorithms capable of simultaneously learning overcomplete dictionar-ies and solving sparse source-vector inverse problems.

The test experiment described in section 5.2 is a dif�cult problem de-signed to determine if the proposed learning algorithm can solve for theknown true solutions for A and the spare source vectors xk to the underde-termined inverse problem yk D Axk. Such testing, which does not appearto be regularly done in the literature, shows how well an algorithm can ex-tract stable and categorically meaningful solutions from synthetic data. Theability to perform well on such test inverse problems would appear to be atleast a necessary condition for an algorithm to be trustworthy in domainswhere a physically or biologically meaningful sparse solution is sought,such as occurs in biomedical imaging, geophysical seismic sounding, andmultitarget tracking.

The experimental results presented in section 5.2 show that the FOCUSS-DL algorithms can recover the dictionary and the sparse sources vectors.This is particularly gratifying when one considers that little or no opti-mization of the algorithm parameters has been done. Furthermore, the con-vergence proofs given in the appendixes show convergence only to localoptima, whereas one expects that there will be many local optima in thecost function because of the concave prior and the generally multimodalform of the cost function.

One should note that algorithm 3.28 was constructed precisely with thegoal of solving inverse problems of the type considered here, and therefore

Page 32: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

380 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

one must be careful when comparing the results given here with other al-gorithms reported in the literature. For instance, the mixture-of-gaussiansprior used in Attias (1999) does not necessarily enforce sparsity. While otheralgorithms in the literature might perform well on this test experiment, tothe best of our knowledge, possible comparably performing algorithmssuch as Attias (1999), Girolami (2001), Hyvarinen et al. (1999), and Lewickiand Olshausen (1999) have not been tested on large, overcomplete matricesto determine their accuracy in recovering A, and so any comparison alongthese lines would be premature. In section 5.1, the FOCUSS-DL algorithmswere compared to the well-known extended ICA and FastICA algorithmsin a more conventional test with complete dictionaries. Performance wasmeasured in terms of the accuracy (SNR) of the recovered sources xk, andboth FOCUSS-DL algorithms were found to have signi�cantly better per-formance (albeit with longer run times).

We havealso shown that theFOCUSS-CNDL algorithm can learn an over-complete representation, which can encode natural images more ef�cientlythan complete bases learned from data (which in turn are more ef�cient thanstandard nonadaptive bases, such as Fourier or wavelet bases; Lewicki &Olshausen, 1999). Studies of the human visual cortex have shown a higherdegree of overrepresentation of the fovea compared to the other mammals,which suggests an interesting connection between overcomplete represen-tations and visual acuity and recognition abilities (Popovic & Sjostrand,2001).

Because the coupled dictionary learning and sparse-inverse solving al-gorithms are merged and run in parallel, it should be possible to run thealgorithms in real time to track dictionary evolution in quasistationary en-vironments once the algorithm has essentially converged. One way to dothis would be to constantly present randomly encountered new signals, yk,to the algorithm at each iteration instead of the original training set. Onealso has to ensure that the dictionary learning algorithm is sensitive to thenew data so that dictionary tracking can occur. This would be done by anappropriate adaptive �ltering of the current dictionary estimate driven bythe new-data derived corrections, similar to techniques used in the adaptive�ltering literature (Kalouptsidis & Theodoridis, 1993).

Appendix A: The Frobenius-Normalized Prior Learning Algorithm

Here we provide a derivation of the algorithm 3.20–3.21 and prove thatit converges to a local minimum of equation 3.15 on the manifold AF DfA | kAkF D 1g ½ Rm£n de�ned in equation 3.17. Although we focus on thedevelopment of the learning algorithm on AF, the derivations in sections A.2and A.3, and the beginning of subsection are done for a general constraintmanifold A.

Page 33: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 381

A.1 Admissible Matrix Derivatives.

A.1.1 The Constraint Manifold AF. In order to determine the structuralform of admissible derivatives, PA D d

dt A for matrices belonging to AF,9 it isuseful to view AF as embedded in the �nite-dimensional Hilbert space ofmatrices, Rm£n, with inner product

hA, Bi D tr (ATB) D tr(BTA) D tr(ABT ) D tr (BAT).

The corresponding matrix norm is the Frobenius norm,

kAk D kAkF Dp

tr ATA Dp

tr AAT .

We will call this space the Frobenius space and the associated inner productthe Frobenius inner product. It is useful to note the isometry,

A 2 Rm£n () A D vec(A) 2 Rmn,

where A is the mn-vector formed by stacking the columns of A. Henceforth,bold type represents the stacked version of a matrix (e.g., B D vec(B)). Thestacked vector A belongs to the standard Hilbert space Rmn, which we shallhenceforth refer to as the stacked space. This space has the standard Euclideaninner product and norm,

hA, Bi D ATB, kAk Dp

ATA.

It is straightforward to show that

hA, Bi D hA, Bi and kAk D kAk.

In particular, we have

A 2 AF () kAk D kAk D 1.

Thus, the manifold in equation 3.17 corresponds to the (mn¡1)–dimensionalunit sphere in the stacked space, Rmn (which, with a slight abuse of notation,we will continue to denote by AF). It is evident that AF is simply connectedso that a path exists between any two elements of AF and, in particular, apath exists between any initial value for a dictionary, Ainit 2 AF, used toinitialize a learning algorithm, and a desired target value, A�nal 2 AF.10

9 Equivalently, we want to determine the structure of elements PA of the tangent space,TAF , to the smooth manifold AF at the point A.

10 For example, for 0 · t · 1, take the t–parameterized path,

A (t) D(1 ¡ t)A init C tA�nal

k(1 ¡ t)A init C tA�nalk.

Page 34: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

382 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

A.1.2 Derivatives on AF: The Tangent Space TAF. Determining the formof admissible derivatives on equation 3.17 is equivalent to determining theform of admissible derivatives on the unit Rmn–sphere. On the unit sphere,we have the well-known fact that

A 2 AF H)ddt

kAk2 D 2AT PA D 0 H) PA ? A.

This shows that the general form of PA is PA D ¤ Q, where Q is arbitrary and

¤ D³

I ¡ AAT

kAk2

´D (I ¡ AAT ) (A.1)

is the stacked space projection operator onto the tangent space of the unitRmn–sphere at the point A (note that we used the fact that kAk D 1). The pro-jection operator ¤ is necessarily idempotent, ¤ D ¤

2. ¤ is also self-adjoint,¤ D ¤

¤, where the adjoint operator ¤¤ is de�ned by the requirement that

h ¤¤Q1, Q2i D hQ1, ¤ Q2i, for all Q1, Q2 2 Rmn,

showing that ¤ is an orthogonal projection operator. In this case, ¤¤ D ¤

T,so that self-adjointness corresponds to ¤ being symmetric. One can readilyshow that an idempotent, self-adjoint operator is nonnegative, which in thiscase corresponds to the symmetric, idempotent operator ¤ being a positivesemide�nite matrix.

This projection can be easily rewritten in the Frobenius space,

PA D ¤ Q D Q ¡ hA, QiA () PA D LQ D Q ¡ hA, QiA

D Q ¡ tr(ATQ)A. (A.2)

Of course, this result can be derived directly in the Frobenius space usingthe fact that

A 2 AF H)ddt

kAk2 D 2hA, PAi D 2 tr(AT PA) D 0,

from which it is directly evident that

PA 2 TAF at A 2 AF , hA, PAi D tr AT PA D 0, (A.3)

and therefore PA must be of the form11

PA D LQ D Q ¡ tr(ATQ)tr(ATA)

A D Q ¡ tr (ATQ)A. (A.4)

11 It must be the case that L D I ¡ |AihA|kAk2 D I ¡ |AihA|, using the physicist’s “bra-ket”

notation (Dirac, 1958).

Page 35: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 383

One can verify that L is idempotent and self-adjoint and is therefore anonnegative, orthogonal projection operator. It is the orthogonal projectionoperator from Rm£n onto the tangent space TAF.

In the stacked space (with some additional abuse of notation), we repre-sent the quadratic form for positive semide�nite symmetric matrices W as

kAk2W D ATWA.

Note that this de�nes a weighted norm if, and only if, W is positive de�-nite, which might not be the case by de�nition. In particular, when W D ¤ ,the quadratic form kAk2

¤is only positive semide�nite. Finally, note from

equation A.4 that 8A 2 AF,

LQ D 0 () Q D cA, with c D tr(ATQ). (A.5)

A.2 Minimizing the Loss Function over a General Manifold A. Con-sider the Lyapunov function,

VN (X, A) D hdq (y ¡ Ax) C ldp (x)iN , A 2 A, (A.6)

where A is some arbitrary but otherwise appropriately de�ned constraintmanifold associated with the prior, equation 3.13. Note that this is pre-cisely the loss function to be minimized in equation 3.15. If we can de-termine smooth parameter trajectories (i.e., a parameter-vector adaptationrule) ( PX, PA) such that along these trajectories PV (X, A) · 0, then as a conse-quence of the La Salle invariance principle (Khalil, 1996), the parameter val-ues will converge to the largest invariant set (of the adaptation rule viewedas a nonlinear dynamical system) contained in the set

C D f (X, A) | PVN (X, A) ´ 0 and A 2 Ag. (A.7)

The set C contains the local minima of VN . With some additional technicalassumptions (generally dependent on the choice of adaptation rule), theelements of C will contain only local minima of VN .

Assuming the i.i.d. q D 2 gaussian measurement noise case of equa-tion 3.8,12 the loss (Lyapunov) function to be minimized is then

VN (X, A) D½ 1

2kAx ¡ yk2 C ldp (x)

¾

N, A 2 A, (A.8)

which is essentially the loss function to be minimized in equation 3.15.13

12 Note that in the appendix, unlike the notation used in equations 3.8 et seq., the “hat”notation has been dropped. Nonetheless, it should be understood that the quantities Aand X are unknown parameters to be optimized over in equation 3.15, while the measuredsignal vectors Y are known.

13 The factor 12 is added for notational convenience and does not materially affect the

derivation.

Page 36: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

384 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Suppose for the moment, as in equations 3.4 through 3.9, that X is as-sumed to be known, and note that then (ignoring constant terms dependingon X and Y) VN can be rewritten as

VN (A) D htr (Ax ¡ y) (Ax ¡ y)TiN

D htr (AxxTAT) ¡ 2 tr(AxyT) C tr (yyT)iN

D tr ASxxAT ¡ 2 tr ASxy

VN (A) D trfASxxAT ¡ 2ASxyg,

for Sxx and Sxy D STyx de�ned as in equation 3.10. Using standard results

from matrix calculus (Dhrymes, 1984), we can show that VN (A) isminimizedby the solution 3.9. This is done by setting

@

@AVN (bA) D 0,

and using the identities (valid for W symmetric)

@

@Atr AWAT D 2AW and

@

@Atr AB D BT .

This yields (assuming that Sxx is invertible),

@

@AVN (bA) D 2bASxx ¡ 2ST

xy D 2(bASxx ¡ Syx) D 0,

) bA D Sy OxS¡1Ox Ox ,

which is equation 3.9, as claimed. For Sxx nonsingular, the solution is uniqueand globally optimal. This is, of course, a well-known result.

Now return to the general case, equation A.8, where both X and A areunknown. For the data indexed by k D 1, . . . , N, de�ne the quantities

dk D dp (xk), e(x) D Ax ¡ y, and ek D Axk ¡ yk.

The loss function and its time derivative can be written,

VN (X, A) D½ 1

2eT (x)e (x) C ldp (x)

¾

N

PVN (X, A) D heT (x) Pe (x) C lrTdp (x) PxiN

D heT (x) (A Px C PAx) C lrTdp (x) PxiN.

Then, to determine an appropriate adaptation rule, note that

PVN D T1 C T2, (A.9)

Page 37: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 385

where

T1 D h (eT (x)A C lrTdp (x)) PxiN D1N

NX

kD1

(eTk A C lrTdk) Pxk (A.10)

and

T2 D heT (x) PAxiN D1N

NX

kD1

eTk

PAxk. (A.11)

Enforcing the separate conditions

T1 · 0 and T2 · 0, (A.12)

(as well as the additional condition that A 2 A) will be suf�cient to ensurethat PVN · 0 on A. In this case, the solution-containing set C of equation A.7is given by

C D f (X, A) | T1 (X, A) ´ 0, T2 (X, A) ´ 0 and A 2 Ag. (A.13)

Note that if A is known and �xed, then T2 ´ 0, and only the �rst conditionof equation A.12 (which enforces learning of the source vectors, xk) is ofconcern. Contrawise, if source vectors xk, which ensure that e(xk) D 0 are�xed and known, then T1 ´ 0, and the second condition of equation A.12(which enforces learning of the dictionary matrix, A) is at issue.

A.3 Obtaining the xk Solutions with the FOCUSS Algorithm. We nowdevelop the gradient factorization-based derivation of the FOCUSS algo-rithm, which provides estimates of xk while satisfying the �rst convergencecondition of equation A.12. The constraint manifold A is still assumed tobe arbitrary. To enforce the condition T1 · 0 and derive the �rst adaptationrule given in equation 3.20, we note that we can factor rdk D rd(xk) as(Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999)

rdk D akP kxk

with ak D axk > 0 and Pxk D P k positive de�nite and diagonal for allnonzero xk. Then, de�ning bk D lak > 0 and selecting an arbitrary set of(adaptable) symmetric positive-de�nite matrices Vk, we choose the learningrule

Pxk D ¡VkfATek C lrdkg D ¡Vkf (ATA C bkP k)xk ¡ ATykg,

k D 1, . . . , N, (A.14)

Page 38: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

386 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

which is the adaptation rule for the state estimates xk D bxk given in the �rstline of equation 3.20. With this choice, we obtain

T1 D ¡hkATe (x) C lrd (x)k2ViN

D ¡ 1N

NX

kD1

kATek C lrdkk2Vk

D ¡ 1N

NX

kD1

k (ATA C bkP k)xk ¡ ATykk2Vk

· 0, (A.15)

as desired. Assuming convergence to the set A.13 (which will be seen to bethe case after we show below how to ensure that we also have T2 · 0), wewill asymptotically obtain (reintroducing the “hat” notation to now denoteconverged parameter estimates)

k (bAT bA C bkP k)bxk ¡ bATykk2Vk

´ 0, k D 1, . . . , N,

which is equivalent to

bxk D (bAT bA C bkP k)¡1 bATyk, k D 1, . . . , N, (A.16)

at convergence. This is also equivalent to the condition given in the �rst lineof equation 3.25, as shown below.

Exploiting the fact that Vk in equation A.14 are arbitrary (subject to thesymmetry and positive-de�niteness constraint), let us make the speci�cchoice shown in equation 3.23,

Vk D gk (ATA C bkP k)¡1, gk > 0, k D 1, . . . , N. (A.17)

Also note the (trivial) identity,

(ATA C bP )P¡1AT D AT (AP¡1AT C bI),

which can be recast nontrivially as

(ATA C bP )¡1AT D P¡1AT (bI C AP¡1AT)¡1. (A.18)

With equations A.17 and A.18, the learning rule, A.14, can be recast as

Pxk D ¡gkfxk ¡ P¡1k AT (bkI C AP¡1

k AT)¡1ykg, k D 1, . . . , N, (A.19)

which is the alternative learning algorithm, 3.24. At convergence (whenT1 ´ 0) we have the condition shown in the �rst line of equation 3.25,

bxk D P¡1 bAT (bkI C bAP¡1 bAT)¡1yk. (A.20)

Page 39: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 387

This also follows from the convergence condition A.16 and the identity A.18,showing that the result, A.20, is independent of the speci�c choice of Vk > 0.Note from equations A.14 and A.15 that T1 ´ 0 also results in Pxk ´ 0 fork D 1, . . . , N, so that we will have converged to constant values, bxk, whichsatisfy equation A.20.

For the case of known, �xed A, the learning rule derived here will con-verge to sparse solutions, xk, and when discretized as in section 3.3, yieldsequation 3.27, which is the known dictionary FOCUSS algorithm (Rao &Kreutz-Delgado, 1998a, 1999).

Note that the derivation of the sparse source-vector learning algorithmhere, which enforces the condition T1 · 0, is entirely independent of anyconstraints placed on A (such as, for example, the unit Frobenius-norm andcolumn-norm constraints considered in this article) or of the form of theA-learning rule. Thus alternative choices of constraints placed on A, as con-sidered in Murray and Kreutz-Delgado (2001) and described in appendix B,will not change the form of the xk-learning rule derived here. Of course,because the xk learning rule is strongly coupled to the A-learning rule, algo-rithmic performance and speed of convergence may well be highly sensitiveto conditions placed on A and the speci�c A learning algorithm used.

A.4 Learning the Dictionary A.

A.4.1 General Results. We now turn to the enforcement of the secondconvergence condition, T2 · 0 and the development of the dictionary adap-tation rule shown in equation 3.20. First, as in equation 3.21, we de�ne theerror term dA as

dA D he(x)xTiN D1N

NX

kD1

e(xk)xTk D ASxx ¡ Syx , (A.21)

using the fact that e(x) D Ax ¡ y. Then, from equation A.11, we have

T2 D heT (x) PAxiN D htr (xeT (x) PA)iN D tr(hxeT (x)iN PA)

D tr(dAT PA) D dAT PA. (A.22)

So far, these steps are independent of any speci�c constraints that may beplaced on A, other than that the manifold A be smooth and compact. WithA constrained to lie on a speci�ed smooth, compact manifold, to ensurecorrect learning behavior, it is suf�cient to impose the constraint that PA liesin the tangent space to the manifold and the condition that T2 · 0.

A.4.2 Learning on the Unit Frobenius Sphere, AF. To ensure that T2 · 0and that PA is in the tangent space of the unit sphere in the Frobenius spaceRmn, we take

PA D ¡m ¤ dA () PA D ¡m LdA D ¡m (dA ¡ tr(ATdA)A), m > 0, (A.23)

Page 40: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

388 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

which is the adaptation rule given in equation 3.20. With this choice, andusing the positive semide�niteness of ¤ , we have

T2 D ¡m kdAk2

¤· 0,

as required. Note that at convergence, the condition T2 ´ 0, yields PA ´ 0, sothat we will have converged to constant values for the dictionary elements,and

0 D Ld bA D L (bAS Ox Ox ¡ Sy Ox ) H) d bA D ( bAS Ox Ox ¡ Sy Ox ) D cbA, (A.24)

from equation A.5, where c D tr (bATd bA). Thus, the steady-state solution is

bA D Sy, Ox (S Ox Ox ¡ cI)¡1 2 AF. (A.25)

Note that equations A.20 and A.25 are the steady-state values given earlierin equation 3.25.

Appendix B: Convergence of the Column–Normalized LearningAlgorithm

The derivation and proof of convergence of the column-normalized learn-ing algorithm, applicable to learning members of AC, is accomplished byappropriately modifying key steps of the development given in appendix A.As in appendix A, the standard Euclidean norm and inner product applyto column vectors, while the Frobenius norm and inner product apply tom £ n matrix elements of Rm£n.

B.1 Admissible Matrix Derivatives.

B.1.1 The Constraint Manifold AC. Let ei D (0 ¢ ¢ ¢ 0 1 0 ¢ ¢ ¢ 0)T 2 Rn bethe canonical unit vector whose components are all zero except for the value1 in the ith location. Then

I DnX

iD1

eieTi and ai D Aei.

Note that

kaik2 D aTi ai D (Aei )TAei D eT

i ATAei D tr eieTi ATA D tr MT

i A D hMi, Ai,where

Mi , AeieTi D aieT

i D [0 0 ¢ ¢ ¢ 0 ai 0 ¢ ¢ ¢ 0] 2 Rm£n. (B.1)

Note that only the ith column of Mi is nonzero and is equal to ai D ithcolumn of A. We therefore have that

A DnX

iD1

Mi. (B.2)

Page 41: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 389

Also, for i, j D 1, . . . , n,

hMi, MjiD tr MTi Mj D tr eieT

i ATAejeTj D tr eT

i ATAejeTj ei D kaik2di, j, (B.3)

where di, j is the Kronecker delta. Note, in particular, that kaik D kMik.Let A 2 AC, where AC is the set of column-normalized matrices, as de-

�ned by equation 3.29. AC is an mn¡n D n(m¡1)–dimensional submanifoldof the Frobenius space Rm£n as each of the n columns of A is normalized as

kaik2 D kMik2 ´ 1n

,

and

kAk2 D tr ATA D trnX

iD1

MTi

nX

jD1

Mj DnX

iD1

kaik2 Dnn

D 1,

using linearity of the trace function and property, B.3. It is evident, therefore,that

AC ½ AF,

for AF de�ned by equation 3.17.We can readily show that AC is simply connected, so that a continu-

ous path exists between any matrix A D Ainit 2 AC to any other column-normalized matrix A0 D AFinal 2 AC. Indeed, let A and A0 be such that

A D [a1, . . . , an], A0 D [a01, . . . , a0

n],

kaik D ka0jk D 1/

pn, 8i, j D 1, . . . , n.

There is obviously a continuous path entirely in AC from A 2 AC to the inter-mediate matrix [a0

1, a2, . . . , an] 2 AC. Similarly, there is a continuous path en-tirely in AC from [a0

1, a2, . . . , an] to [a01, a0

2, . . . , an], and so on to [a01, . . . , a0

n] DA0 . To summarize, AC is a simply connected (nm ¡n)–dimensional subman-ifold of the simply connected (nm ¡ 1)–dimensional manifold AF and theyare both submanifolds of the (nm)–dimensional Frobenius space Rm£n.

B.1.2 Derivatives on AC: The Tangent Space TAC. For convenience, fori D 1, . . . , n de�ne

bai Dai

kaikD

pnai , kbaik D 1,

and

bMi DMi

kMikD

pnMi D baieT

i , k bMik D 1.

Page 42: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

390 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Note that equation B.3 yields

h bMi, bMji D tr bMTi bMj D di, j. (B.4)

For any A 2 AC we have for each i D 1, . . . , n,

0 Dddt

kaik2 Dddt

aTi ai D

ddt

eTi ATAei D 2eT

i AT PAei

D 2 tr eieTi AT PA D 2 tr MT

iPA,

or

A 2 AC ) hMi, PAi D tr MTi

PA D 0 for i D 1, . . . , n. (B.5)

In fact,

PA 2 TAC at A 2 AC , h bMi, PAi D tr bMTi

PA D 0

for i D 1, . . . , n. (B.6)

Note from equations B.2 and B.5 that

hA, PAi D* nX

iD1

Mi , PA+

D 0,

showing that (see equation A.3) TAC ½ TAF, as expected from the fact thatAC ½ AF.

An important and readily proven fact is that for each i D 1, . . . , n,

tr bMTi

PA D 0 , PA D PiQi (B.7)

for some matrix Qi and projection operator, Pi , de�ned by

PiQ , Q ¡ bMih bMi, Qi D Q ¡ bMi tr bMTi Q. (B.8)

From equation B.4, it can be shown that the projection operators commute,

PiPj D PjPi, i 6D j, i, j D 1, . . . , n, (B.9)

and are idempotent,

P2i D Pi, i D 1, . . . , n. (B.10)

Indeed, it is straightforward to show from equation B.4 that for all Q,

PiPjQ D PjPiQ D Q ¡ bMih bMi, Qi ¡ bMjh bMj, Qi, for all i 6D j, (B.11)

Page 43: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 391

and

P2i Q D Q ¡ bMih bMi, Qi D PiQ, i D 1, . . . , n. (B.12)

It can also be shown that

hPiQ1, Q2i D hQ1, PiQ2i,

showing that Pi is self–adjoint, Pi D P¤i .

De�ne the operator,

P D P1, . . . , Pn. (B.13)

Note that because of the commutativity of the Pi , the order of multiplica-tion on the right-hand side of equation B.11 is immaterial, and it is easilydetermined that P is idempotent,

P2 D P.

By induction on equation B.11, it is readily shown that

PQ D Q ¡ bM1h bM1, Qi¡¢ ¢ ¢ ¡ bMnh bMn, Qi D Q ¡nX

iD1

bMih bMi, Qi. (B.14)

Either from the self-adjointness and idempotency of each Pi, or directly fromequation B.14, it can be shown that P itself is self–adjoint, P D P¤,

hPQ1, Q2i D hQ1, PQ2i.

Thus, P is the orthogonal projection operator from Rm£n onto TAC.As a consequence, we have the key result that

PA 2 TAC at A , PA D PQ, (B.15)

for some matrix Q 2 Rm£n. This follows from the fact that given Q, thenthe right-hand side of equation B.6 is true for PA D PQ. On the other hand,if the right-hand side of equation B.6 is true for PA, we can take Q D PAin equation B.15. Note that for the TAF–projection operator L given byequation A.2, we have that

P D LP D PL ,

consistent with the fact that TAC ½ TAF.Let qi, i D 1, . . . , n be the columns of Q D [q1, . . . , qn]. The operation

Q0 D PjQ, corresponds to

q0i D qi, i 6D j, and q0

j D (I ¡bajbaTj )qj,

while the operation Q0 D PQ, corresponds to

q0i D (I ¡baibaT

i )qi , i D 1, . . . , n.

Page 44: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

392 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

B.2 Learning on the Manifold AC. The development of equations A.6through A.22 is independent of the precise nature of the constraint manifoldA and can be applied here to the speci�c case of A D AC. To ensure thatT2 · 0 in equation A.22 and that PA 2 TAC, we can use the learning rule,

PA D ¡m PdA D ¡m

ÁdA ¡

nX

iD1

bMi tr bMidA!

, (B.16)

for m > 0 anddA given by equation A.21. With the ith column ofdA denotedby dai , this corresponds to the learning rule,

Pai D ¡m (I ¡baibaTi )dai, i D 1, . . . , n. (B.17)

With rule B.16, we have

T2 D hdA, PAi D ¡m hdA, PdAi D ¡m hdA, P2dAi

D ¡m hPdA, PdAi D ¡m kPdAk2 · 0,

where we have explicitly shown that the idempotency and self-adjointnessof P correspond to its being a nonnegative operator. Thus, we will haveconvergence to the largest invariant set for which PA ´ 0, which from equa-tion B.16 is equivalent to the set for which

PdA D dA ¡nX

iD1

bMih bMi, dAi ´ 0. (B.18)

This, in turn, is equivalent to

dA DnX

iD1

bMih bMi, dAi DnX

iD1

nMihMi , dAi DnX

iD1

ciMi, (B.19)

with

ci , nhMi, dAi D naTi dai, i D 1, . . . , n.

An equivalent statement to equation B.19 is

dai D ciai, i D 1, . . . , n.

De�ning the diagonal matrix

C D diag[ci, . . . , cn]

and recalling the de�nitions A.21 and B.1, we obtain from equation B.19that

ASxx ¡ Syx D dA D AC,

which can be solved as

A D Syx (Sxx ¡ C)¡1. (B.20)

This is the general form of the solution found by the AC–learning algorithm.

Page 45: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 393

Acknowledgments

This research was partially supported by NSF grant CCR-9902961. K. K.-D.also acknowledges the support provided by the Computational Neurobiol-ogy Laboratory, Salk Institute, during a 1998–1999 sabbatical visit.

References

Adler, J., Rao, B., & Kreutz-Delgado, K. (1996). Comparison of basis selectionmethods. In Proceedings of the 30th Asilomar Conference on Signals, Systems, andComputers. Paci�c Grove, CA: IEEE.

Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803–851.

Basilevsky, A. (1994). Statistical factor analysis and related methods: Theory andapplications. New York: Wiley.

Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blindseparation and blind deconvolution. Neural Computation, 7, 1129–1159.

Coifman, R., & Wickerhauser, M. (1992). Entropy-based algorithms for best basisselection. IEEE Transactions on Information Theory, IT-38(2), 713–718.

Comon, P. (1994). Independent component analysis: A new concept? Signal Pro-cessing, 36, 287–314.

Deco, G., & Obradovic, D. (1996). An information-theoretic approach to neural com-puting. Berlin: Springer-Verlag.

Dhrymes, P. (1984). Mathematics for econometrics (2nd ed.). New York: Springer-Verlag.

Dirac, P. A. M. (1958). The principles of quantum mechanics. Oxford: ClarendonPress.

Donoho, D. (1994). On minimum entropy segmentation. In C. Chui, L. Monte-fusco, & L. Puccio (Eds.), Wavelets:Theory, algorithms, and applications (pp.233–269). San Diego, CA: Academic Press.

Donoho, D. (1995). De-noising by soft-thresholding. IEEE Trans. on InformationTheory, 41, 613–627.

Engan, K. (2000). Frame based signal representation and compression. Unpublisheddoctoral dissertation, Stavanger Universty College, Norway.

Engan, K., Rao, B., & Kreutz-Delgado, K. (1999). Frame design using FOCUSSwith method of optimal directions (MOD). In Proc. NORSIG-99. Tronheim,Norway: IEEE.

Engan, K., Rao, B. D., & Kreutz-Delgado, K. (2000). Regularized FOCUSS forsubset selection in noise. In Proc. NORSIG 2000, Sweden, June, 2000 (pp. 247–250). Kolmarden, Sweden: IEEE.

Field, D. (1994). What is the goal of sensory coding. Neural Computation, 6, 559–599.

Girolami, M. (2001). A variational method for learning sparse and overcompleterepresentations. Neural Computation, 13, 2517–2532.

Gorodnitsky, I., George, J., & Rao, B. (1995). Neuromagnetic source imagingwith FOCUSS: A recursive weighted minimum norm algorithm. Journal ofElectroencephalography and Clinical Neurophysiology, 95(4), 231–251.

Page 46: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

394 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Gorodnitsky, I., & Rao, B. (1997). Sparse signal reconstruction from limited datausing FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans. SignalProcessing, 45(3), 600–616.

Hansen, P. C., & O’Leary, D. P. (1993). The use of the L-curve in the regularizationof discrete ill-posed problems. SIAM J. Sct. Comput., 14, 1487–1503.

Hyvarinen, A., Cristescu, R., & Oja, E. (1999). A fast algorithm for estimatingovercomplete ICA bases for image windows. In Proc. Int. Joint Conf. on NeuralNetworks (IJCNN’99) (pp. 894–899). Washington, DC.

Kalouptsidis, T., & Theodoridis, S. (1993). Adaptive system identi�cation and signalprocessing algorithms. Upper Saddle River, NJ: Prentice Hall.

Kassam, S. (1982). Signal detection in non-gaussian noise. New York: Springer-Verlag.

Khalil, H. (1996). Nonlinear systems (2nd ed). Upper Saddle River, NJ: PrenticeHall.

Kreutz-Delgado, K., & Rao, B. (1997). A general approach to sparse basis selection:Majorization, concavity, and af�ne Scaling (Full Report with Proofs) (Center forInformation Engineering Report No. UCSD-CIE-97-7-1). San Diego: Univer-sity of California. Available on-line at: http://dsp.ucsd.edu/»kreutz.

Kreutz-Delgado, K., & Rao, B. (1998a). A new algorithm and entropy-like mea-sures for sparse coding. In Proc. 5th Joint Symp. Neural Computation. La Jolla,CA.

Kreutz-Delgado, K., & Rao, B. (1998b). Gradient factorization-based algorithmfor best-basis selection. In Proceedings of the 8th IEEE Digital Signal ProcessingWorkshop. New York: IEEE.

Kreutz-Delgado, K., & Rao, B. (1998c). Measures and algorithms for best basisselection. In Proceedings of the 1998 ICASSP. New York: IEEE.

Kreutz-Delgado, K., & Rao, B. (1999). Sparse basis selection, ICA, and majoriza-tion: Towards a uni�ed perspective. In Proc. 1999 ICASSP. Phoenix, AZ.

Kreutz-Delgado, K., Rao, B., & Engan, K. (1999). Novel algorithms for learningovercomplete Dictionaries. In Proc. 1999 Asilomar Conference. Paci�c Grove,CA: IEEE.

Kreutz-Delgado, K., Rao, B., Engan, K., Lee, T.-W., & Sejnowski, T. (1999a).Convex/Schur–convex (CSC) log-priors and sparse coding. In Proc. 6th JointSymposium on Neural Computation. Pasadena, CA.

Kreutz-Delgado, K., Rao,B.,Engan, K., Lee, T.-W., & Sejnowski, T. (1999b). Learn-ing overcomplete dictionaries: Convex/Schur-convex (CSC) log-priors, fac-torial codes, and independent/dependent component analysis (I/DCA). InProc. 6th Joint Symposium on Neural Computation. Pasadena, CA.

Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analy-sis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 417–441.

Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for theadaptation and comparison of image codes. J. Opt. Soc. Am. A, 16(7), 1587–1601.

Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations.Neural Computation, 12(2), 337–365.

Mallat, S., & Zhang, Z. (1993). Matching pursuits with time-frequency dictio-naries. IEEE Trans. ASSP, 41(12), 3397–3415.

Page 47: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

Dictionary Learning Algorithms for Sparse Representation 395

Marshall, A., & Olkin, I. (1979). Inequalities: Theory of majorization and its applica-tions. San Diego, CA: Academic Press.

Murray, J.F., & Kreutz-Delgado, K. (2001). An improved FOCUSS-based learningalgorithm for solving sparse linear inverse problems. In Conference Record ofthe 35th Asilomar Conference on Signals, Systems and Computers. Paci�c Grove,CA: IEEE.

Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low noise limit: Afactorial code maximizes information transfer. Network, 5, 565–581.

Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive �eld prop-erties by learning a sparse code for natural images. Nature, 381, 607–609.

Olshausen, B., & Field, D. (1997). Sparse coding with an overcomplete basis set:A strategy employed by V1? Vision Research, 37, 3311–3325.

Pham, D. (1996). Blind separation of instantaneous mixture of sources via anindependent component analysis. IEEE Trans. Signal Processing, 44(11), 2768–2779.

Popovic, Z., & Sjostrand, J. (2001). Resolution, separation of retinal ganglioncells, and cortical magni�cation in humans. Vision Research, 41, 1313–1319.

Rao, B. (1997). Analysis and extensions of the FOCUSS algorithm. In Proceedingsof the 1996 Asilomar Conference on Circuits, Systems, and Computers (Vol. 2,pp. 1218–1223). New York: IEEE.

Rao, B. (1998). Signal processing with the sparseness constraint. In Proc. 1998ICASSP. New York: IEEE.

Rao, B. D., Engan, K., Cotter, S. F., & Kreutz-Delgado, K. (2002). Subset selec-tion in noise based on diversity measure minimization. Manuscript submitted forpublication.

Rao, B., & Gorodnitsky, I. (1997). Af�ne scaling transformation based methodsfor computing low complexity sparse solutions. In Proc. 1996 ICASSP (Vol. 2,pp. 1783–1786). New York: IEEE.

Rao, B., & Kreutz-Delgado, K. (1997). Deriving algorithms for computing sparsesolutions to linear inverse problems. In J. Neyman (Ed.), Proceedings of the 1997Asilomar Conference on Circuits, Systems, and Computers. New York: IEEE.

Rao, B., & Kreutz-Delgado, K. (1998a). Basis selection in the presence of noise.In Proceedings of the 1998 Asilomar Conference. New York: IEEE.

Rao, B., & Kreutz-Delgado, K. (1998b). Sparse solutions to linear inverse prob-lems with multiple measurement vectors. In Proceedings of the 8th IEEE DigitalSignal Processing Workshop. New York: IEEE.

Rao, B., & Kreutz-Delgado, K. (1999). An af�ne scaling methodology for bestbasis selection. IEEE Trans. Signal Processing, 1, 187–202.

Roberts, S. (1998). Independent component analysis: Source assessment andseparation, a Bayesian approach. IEE (Brit.) Proc. Vis. Image Signal Processing,145(3), 149–154.

Rockafellar, R. (1970). Convex analysis. Princeton, NJ: Princeton University Press.Ruderman, D. (1994). The statistics of natural images. Network: Computation in

Neural Systems, 5, 517–548.Wang, K., Lee, C., & Juang, B. (1997). Selective feature extraction via signal

decomposition. IEEE Signal Processing Letters, 4(1), 8–11.Watanabe, S. (1981). Pattern recognition as a quest for minimum entropy. Pattern

Recognition, 13(5), 381–387.

Page 48: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado

396 K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski

Wickerhauser, M. (1994). Adapted wavelet analysis from theory to software. Wellesley,MA: A. K. Peters.

Zhu, S., Wu, Y., & Mumford, D. (1997). Minimax entropy principle and its ap-plication to texture modeling. Neural Computation, 9, 1627–1660.

Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparsedecomposition in a signal dictionary. Neural Computation, 13(4), 863–882.

Received August 6, 2001; accepted June 4, 2002.

Page 49: read.pudn.comread.pudn.com › downloads115 › sourcecode › others › 486253... · LETTER CommunicatedbyHagaiAttias DictionaryLearningAlgorithmsforSparseRepresentation KennethKreutz-Delgado