14
4380 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013 General Subspace Learning With Corrupted Training Data Via Graph Embedding Bing-Kun Bao, Guangcan Liu, Member, IEEE, Richang Hong, Shuicheng Yan, Senior Member, IEEE , and Changsheng Xu, Senior Member, IEEE Abstract—We address the following subspace learning prob- lem: supposing we are given a set of labeled, corrupted training data points, how to learn the underlying subspace, which contains three components: an intrinsic subspace that captures certain desired properties of a data set, a penalty subspace that fits the undesired properties of the data, and an error container that models the gross corruptions possibly existing in the data. Given a set of data points, these three components can be learned by solving a nuclear norm regularized optimization problem, which is convex and can be efficiently solved in polynomial time. Using the method as a tool, we propose a new discriminant analysis (i.e., supervised subspace learning) algorithm called Corruptions Tolerant Discriminant Analysis (CTDA), in which the intrinsic subspace is used to capture the features with high within-class similarity, the penalty subspace takes the role of modeling the undesired features with high between-class similarity, and the error container takes charge of fitting the possible corruptions in the data. We show that CTDA can well handle the gross corruptions possibly existing in the training data, whereas previ- ous linear discriminant analysis algorithms arguably fail in such a setting. Extensive experiments conducted on two benchmark human face data sets and one object recognition data set show that CTDA outperforms the related algorithms. Index Terms—Subspace learning, corrupted training data, discriminant analysis, graph embedding. I. I NTRODUCTION L INEAR subspaces are of significance in the recognition of visual patterns, due to the fact that visual data (e.g., face [1], [2], texture [3] and motion [4]) are usually well characterized by subspaces. Even for the complicated recog- nition tasks (e.g., image classification [5]), (linear) subspace Manuscript received May 24, 2012; revised October 25, 2012, February 1, 2013, and May 6, 2013; accepted July 7, 2013. Date of publication July 22, 2013; date of current version September 12, 2013. This work was supported in part by the National Program on Key Basic Research Project (973 Program) under Project 2012CB316304, the National Natural Science Foundation of China under Grants 61225009 and 61201374, the China Post- doctoral Science Foundation under Grant 2013T60196, the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative, and the IDM Programme Office. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Andrea Cavallaro. B.-K. Bao and C. Xu are with the National Laboratory of Pattern Recogni- tion, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the China-Singapore Institute of Digital Media, Singa- pore 119613 (e-mail: [email protected]; [email protected]). G. Liu is with the University of Illinois at Urbana-Champaign, Champaign, IL 61820 USA (e-mail: [email protected]). R. Hong is with the School of Computer and Information, Hefei University of Technology, Hefei 230009, China (e-mail: [email protected]). S. Yan is with the Department of Electrical and Computer Engi- neering, National University of Singapore, Singapore 10000 (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2013.2273665 is also a common interface: the raw images with compli- cated structures are firstly converted to the data points with subspace structures, and then the final classification results are produced by subspace based algorithms. The existing subspace learning methods can be roughly divided into three categories. The first one is unsupervised learning, such as Principal Component Analysis (PCA) [6], manifold leaning algorithms, like ISOMAP [7], Locally Linear Embedding (LLE) [8] with its linear extension Neighborhood Preserving Embedding (NPE) [9], [10], and Laplacian Eigenmaps (LE) [11] with its linear extension Locality Preserving Projections (LPP) [12]. The second category is supervised learning (also known as discriminant analysis), which utilizes the class label information to pursue discriminant feature representation. In this category, the most popular algorithm is the well-known Linear Discriminant Analysis (LDA) [13] and its variations, including Nonparametric Discriminant Analysis (NDA) [14], 2-D LDA [15], Marginal Fisher Analysis (MFA) [16] and Local Discriminant Embedding (LDE) [17] etc. The third category is semi-supervised learning [18], [19], [20], [21], which utilizes unlabeled data as well as relatively limited labeled data for better classification. For these various sub- space learning algorithms, Yan et al. [16] showed that they can be mathematically unified into a general framework, called Graph Embedding. This framework derives a low-dimensional feature space which not only preserves the desired relationship but also constrains the undesired one. In this work, we consider supervised subspace learning, i.e., label information is available in the training data. Although lots of work has been done, this topic being still far from being ended. An unsolved problem is how to achieve robust subspace learning with the training data that contain gross corruptions. Since the training data themselves are grossly cor- rupted, accurate prior knowledge about the desired subspace is no longer available. This setting, to our knowledge, has not been thoughtfully studied before. The well-known Principal Component Analysis (PCA) [6] method can efficiently seize the low-rank subspace structure of the high-dimensional data contaminated by small Gaussian noise, but is extremely fragile to gross corruptions [22], which have vital impacts on the recognition and processing of visual patterns [1], [23]. Sparse Representation Classifier (SRC) [24] is a robust supervised method and can correct the corruptions possibly existing in testing data, but cannot well handle the cases where the training data themselves are corrupted [25]. This is because SRC needs a set of noiseless training data to construct 1057-7149 © 2013 IEEE

4380 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. … · 2013. 12. 12. · 4380 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013 General Subspace Learning

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • 4380 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    General Subspace Learning With CorruptedTraining Data Via Graph Embedding

    Bing-Kun Bao, Guangcan Liu, Member, IEEE, Richang Hong,Shuicheng Yan, Senior Member, IEEE, and Changsheng Xu, Senior Member, IEEE

    Abstract— We address the following subspace learning prob-lem: supposing we are given a set of labeled, corrupted trainingdata points, how to learn the underlying subspace, which containsthree components: an intrinsic subspace that captures certaindesired properties of a data set, a penalty subspace that fits theundesired properties of the data, and an error container thatmodels the gross corruptions possibly existing in the data. Givena set of data points, these three components can be learned bysolving a nuclear norm regularized optimization problem, whichis convex and can be efficiently solved in polynomial time. Usingthe method as a tool, we propose a new discriminant analysis(i.e., supervised subspace learning) algorithm called CorruptionsTolerant Discriminant Analysis (CTDA), in which the intrinsicsubspace is used to capture the features with high within-classsimilarity, the penalty subspace takes the role of modeling theundesired features with high between-class similarity, and theerror container takes charge of fitting the possible corruptionsin the data. We show that CTDA can well handle the grosscorruptions possibly existing in the training data, whereas previ-ous linear discriminant analysis algorithms arguably fail in sucha setting. Extensive experiments conducted on two benchmarkhuman face data sets and one object recognition data set showthat CTDA outperforms the related algorithms.

    Index Terms— Subspace learning, corrupted training data,discriminant analysis, graph embedding.

    I. INTRODUCTION

    L INEAR subspaces are of significance in the recognitionof visual patterns, due to the fact that visual data (e.g.,face [1], [2], texture [3] and motion [4]) are usually wellcharacterized by subspaces. Even for the complicated recog-nition tasks (e.g., image classification [5]), (linear) subspace

    Manuscript received May 24, 2012; revised October 25, 2012, February1, 2013, and May 6, 2013; accepted July 7, 2013. Date of publicationJuly 22, 2013; date of current version September 12, 2013. This work wassupported in part by the National Program on Key Basic Research Project(973 Program) under Project 2012CB316304, the National Natural ScienceFoundation of China under Grants 61225009 and 61201374, the China Post-doctoral Science Foundation under Grant 2013T60196, the Singapore NationalResearch Foundation under its International Research Centre @SingaporeFunding Initiative, and the IDM Programme Office. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Prof. Andrea Cavallaro.

    B.-K. Bao and C. Xu are with the National Laboratory of Pattern Recogni-tion, Institute of Automation, Chinese Academy of Sciences, Beijing 100190,China, and also with the China-Singapore Institute of Digital Media, Singa-pore 119613 (e-mail: [email protected]; [email protected]).

    G. Liu is with the University of Illinois at Urbana-Champaign, Champaign,IL 61820 USA (e-mail: [email protected]).

    R. Hong is with the School of Computer and Information, Hefei Universityof Technology, Hefei 230009, China (e-mail: [email protected]).

    S. Yan is with the Department of Electrical and Computer Engi-neering, National University of Singapore, Singapore 10000 (e-mail:[email protected]).

    Digital Object Identifier 10.1109/TIP.2013.2273665

    is also a common interface: the raw images with compli-cated structures are firstly converted to the data points withsubspace structures, and then the final classification resultsare produced by subspace based algorithms. The existingsubspace learning methods can be roughly divided into threecategories. The first one is unsupervised learning, such asPrincipal Component Analysis (PCA) [6], manifold leaningalgorithms, like ISOMAP [7], Locally Linear Embedding(LLE) [8] with its linear extension Neighborhood PreservingEmbedding (NPE) [9], [10], and Laplacian Eigenmaps (LE)[11] with its linear extension Locality Preserving Projections(LPP) [12]. The second category is supervised learning (alsoknown as discriminant analysis), which utilizes the classlabel information to pursue discriminant feature representation.In this category, the most popular algorithm is the well-knownLinear Discriminant Analysis (LDA) [13] and its variations,including Nonparametric Discriminant Analysis (NDA) [14],2-D LDA [15], Marginal Fisher Analysis (MFA) [16] andLocal Discriminant Embedding (LDE) [17] etc. The thirdcategory is semi-supervised learning [18], [19], [20], [21],which utilizes unlabeled data as well as relatively limitedlabeled data for better classification. For these various sub-space learning algorithms, Yan et al. [16] showed that theycan be mathematically unified into a general framework, calledGraph Embedding. This framework derives a low-dimensionalfeature space which not only preserves the desired relationshipbut also constrains the undesired one.

    In this work, we consider supervised subspace learning, i.e.,label information is available in the training data. Althoughlots of work has been done, this topic being still far frombeing ended. An unsolved problem is how to achieve robustsubspace learning with the training data that contain grosscorruptions. Since the training data themselves are grossly cor-rupted, accurate prior knowledge about the desired subspaceis no longer available. This setting, to our knowledge, has notbeen thoughtfully studied before. The well-known PrincipalComponent Analysis (PCA) [6] method can efficiently seizethe low-rank subspace structure of the high-dimensional datacontaminated by small Gaussian noise, but is extremely fragileto gross corruptions [22], which have vital impacts on therecognition and processing of visual patterns [1], [23]. SparseRepresentation Classifier (SRC) [24] is a robust supervisedmethod and can correct the corruptions possibly existing intesting data, but cannot well handle the cases where thetraining data themselves are corrupted [25]. This is becauseSRC needs a set of noiseless training data to construct

    1057-7149 © 2013 IEEE

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4381

    a well-defined dictionary. The recently established RobustPrincipal Component Analysis (RPCA)1 [22], [24] method canwell handle the data contaminated by sparse corruptions andhas achieved many successes (e.g., [26]). However, RPCA isa pure unsupervised method and could not utilize the possi-bly available class label information that plays an importantrole in subspace learning. In [27], a convex formulation isestablished to make LDA be tolerant to the uncertaintiesof the training procedure. This method is robust to smallGaussian noise, but cannot handle the data containing grosscorruptions. In summary, to the best of our knowledge, thereis no previous method that can not only deal with the grosscorruptions possibly existing in the training (and testing)data, but also make use of the possibly available class labelinformation. In this work, we will take one step into thisdirection.

    Due to the common limitations of existing methods, inthis paper we propose a general method for robust subspacelearning. Our method inherits some ideas from the GraphEmbedding [16], [28], [29] framework and the recently estab-lished low-rank modeling techniques [22], [25], [23], [30].Our method consists of three components: an “intrinsic”subspace that captures certain desired properties of a dataset, a “penalty” subspace that fits the undesired propertiesof the data, and an “error container” that models the grosscorruptions possibly existing in the data. Given a set of train-ing data, these three components can be learned by solvinga nuclear norm regularized optimization problem, which isconvex and can be efficiently solved in polynomial time. Theproposed method can be suitable for both supervised andunsupervised environments, depending on how the desiredand undesired properties are defined. In this paper, we areparticularly interested in supervised subspace learning, i.e.,discriminant analysis. By utilizing the method as a tool,we propose a novel discriminant analysis algorithm calledCorruptions Tolerant Discriminant Analysis (CTDA), in whichthe intrinsic subspace is used to capture the features with highwithin-class similarity, the penalty subspace takes the roleof modeling the undesired features with high between-classsimilarity, and the error container takes charge of fitting thepossible corruptions in the data. Extensive experiments on twobenchmark human face datasets and one object recognitiondataset demonstrate that CTDA outperforms the state-of-the-art algorithms. In summary, the contributions of this workinclude:

    1) We provide a general method to deal with the gross cor-ruptions possibly existing in training data. Generally, themethod is suitable for both supervised and unsupervisedsettings.

    2) Unlike previous subspace learning methods and frame-works (e.g., [12], [16], [31], [32]), the proposed methodcontains two subspaces with specific duties: an intrinsicsubspace and a penalty subspace. In this way, thefeatures with high within-class similarity and those with

    1Notice that there are lots of RPCA methods existing. For ease of presen-tation, in this paper the word “RPCA” solely refers to the method introducedin [22] [24].

    TABLE I

    LIST OF KEY NOTATIONS

    high between-class similarity can be captured separately.This is quite different from previous methods, whichonly allocate one subspace and thus need the orthogonalconstraint to avoid trivial solution, usually leading tononconvex optimization problems. In contrast, our for-mulation is convex.

    3) Based on the proposed method, we establish a noveldiscriminant analysis algorithm called CTDA, whichoutperforms the state-of-the-art algorithms in handlingcorrupted data.

    The rest of the paper is organized as follows. For the easeof reading, we firstly discuss the background in Section II.Then we introduce the proposed method and algorithm inSection III. Section IV shows the experimental results. Finally,we conclude this paper in Section V.

    II. BACKGROUND

    Our proposed method and algorithm borrow some ideasfrom the Graph Embedding [16] framework and the recentlyestablished Robust Principal Component Analysis (RPCA)method [22]. In this section, we briefly summarize them tohelp reading this paper. Before delving in, we list our keynotations in Table I.

    A. Graph Embedding

    Graph Embedding is a general framework for dimensionreduction [16], [28], [29]. For a collection of n data samplesrepresented as a matrix X = [x1, x2, . . . , xn], xi ∈ �m , thegoal of dimension reduction is to find their low-dimensionalembeddings Y = [y1, y2, . . . , yn], yi ∈ �m′, m′ � m, whereyi is the low-dimensional embedding of the sample xi .

    Let G = {X, S} be an undirected weighted graph withvertex set X (which corresponds to the samples) and realsymmetric weight matrix S ∈ �n×n . Elements of S areassumed to be nonnegative and measure the within-classsimilarities among data samples. The graph G defined inthis way is called as intrinsic graph, which characterizes thefavorite relationship among the data samples. Let SB ∈ �n×nbe a symmetric weight matrix that measures the between-classsimilarities among data samples. Then, a penalty graph is

  • 4382 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    defined as Gu = {X, SB}, which characterizes the unfavorablerelationship among the data samples. With these definitions,the graph embedding of G is accordingly defined as the low-dimensional representation that best characterizes the similar-ity relationships encoded in S and SB :{

    minY∑

    i �= j ‖yi − y j‖22 Si j ,maxY

    ∑i �= j ‖yi − y j‖22SBi j , (1)

    where ‖·‖2 is the �2-norm of a vector. Intuitively, if xi and x jare more similar, that is, Si j is larger, ‖yi − y j‖22 should besmaller. Thus the minimization of the first objective function in(1) tends to preserve the within-class similarity encoded by theintrinsic graph G. Likewise, the maximization of the secondobjective function will suppress the between-class similarityrepresented by the penalty graph Gu .

    Notice that the min/max problem (1) is generally noncon-vex. In this paper, we shall introduce a convex approach toachieve the goal of Graph Embedding.

    B. Robust Principal Component Analysis

    PCA is one of the most widely-used tools for dimensionreduction and error correction. However, as mentioned, PCAcannot well handle the data containing large errors (e.g., grosscorruptions and outliers). To overcome this limitation, recentyears have witnessed a surge of Robust PCA methods (e.g.,[22], [24], [33], [34], [35]). In particular, Robust PCA method(referred to as RPCA) proposed in [22], [24] is emerging as apowerful tool to recover the subspace structures from the datacontaining gross errors [26], [30], [36], [37].

    For a given data matrix X ∈ �m×n , a fraction of whoseentries are grossly corrupted, RPCA aims to identify theoriginal data (denoted by Ŷ ) and the corruptions (denoted byE) by minimizing:

    minŶ ,E

    {rank

    (Ŷ

    )+ λ‖E‖0

    }, s.t. X = Ŷ + E,

    where λ > 0 is a parameter, and ‖ · ‖0 is the matrix �0-norm,i.e., the number of nonzero entries of a matrix. Note here thatalthough the ambient dimension of Ŷ is the same as that of X ,the rank (i.e., intrinsic dimension) of Ŷ is enforced to be smalland thus the underlying nature of RPCA is also dimensionreduction. By relaxing the rank function and the �0-norm intothe nuclear norm and the �1-norm, respectively, RPCA turnsto the following convex optimization problem:

    minŶ ,E

    {‖Ŷ‖∗ + λ‖E‖1

    }, s.t. X = Ŷ + E, (2)

    where ‖ · ‖∗ denotes the nuclear norm, also known as thetrace norm or Ky Fan norm (sum of the singular values),‖ · ‖1 is the �1-norm, and λ > 0 is a parameter. Providedthat the corruptions are sparse enough (i.e., only a fractionof entries of X are corrupted) and the original data aresufficiently incoherent, it is provable that the algorithm basedon minimizing (2) is exactly successful [22].

    III. ROBUST SUBSPACE LEARNING BY CONVEXOPTIMIZATION

    In this section, we first introduce a general method forsubspace learning with corrupted training data. Second, weestablish a discriminant analysis algorithm called CorruptionsTolerant Discriminant Analysis (CTDA) based on this method.Finally, we give the algorithm to solve the optimizationproblem.

    A. A General Method for Robust Subspace Learning

    Similar as the Graph Embedding framework presented inSection II-A, we also assume that we are given a data matrixX , an intrinsic graph G = {X, S} which characterizes thefavorite (desired) relationship among the data samples, and apenalty graph Gu = {X, SB} which characterizes the unfavor-able (undesired) relationship among the data. Unlike previousdimension reduction methods (e.g., [12], [16], [31], [32]),which aim at finding one type of embeddings for preservingthe desired property (described by the intrinsic graph) andsuppressing the undesired property (described by the penaltygraph) simultaneously, we propose to use two types of embed-dings, denoted as Y = [y1, . . . , yn] and Y u = [yu1 , . . . , yun ],to model the desired and undesired properties, respectively. Inthis way, we obtain a convex objective function to model theproperties encoded in the intrinsic and penalty graphs:

    minY,Y u

    {‖yi − y j‖22Si j + ‖yui − yuj ‖22 SBi j

    }. (3)

    The above objective function can be deducted to

    minY,Y u

    {Tr(Y LY T ) + T r(Y u Lu(Y u)T )

    }, (4)

    where L is the Laplacian matrix of the intrinsic graph G:L =D − S, D = diag

    {∑j S1 j ,

    ∑j S2 j . . .

    ∑j Snj

    }, and Lu is

    the Laplacian matrix of the penalty graph Gu : Lu = Du −SB , Du = diag

    {∑j S

    B1 j ,

    ∑j S

    B2 j , . . . ,

    ∑j S

    Bnj

    }.

    Without any constraints, the solution to (4) is always trivial,i.e., Y = 0 and Y u = 0. Fortunately, we have the followingpriors to constrain and regularize the optimization problem.First, since it is assumed that the data consist of threecomponents, we have the following equality constraint:

    X = Y + Y u + E, (5)where E is used to fit the corruptions possibly existing in thedata. Second, following the core idea of subspace learning, weenforce the embeddings Y, Y u to be low-rank, i.e., “implicitly”low-dimensional: {

    minY rank (Y ) ,minY u rank (Y u) .

    (6)

    Third, like RPCA, we assume that the corruptions are entry-wise sparse. So the �0-norm of E should be minimized:

    minE

    ‖E‖0. (7)Note here that the sparse assumption is mild, since it isubiquitous in reality due to occlusion or sensor failure [22].

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4383

    Finally, by combining the criterions (4), (5), (6) and (7)together, we have the following optimization problem:

    minY,Y u,E{

    rank (Y ) + αrank (Y u) + λ1‖E‖0+ λ2

    [T r(Y LY T ) + T r(Y u Lu(Y u)T )]}

    s.t. X = Y + Y u + E,(8)

    where α > 0, λ1 > 0, λ2 > 0 are three tunable parameters.The above optimization problem is difficult to solve due to

    the discrete nature of the rank function and the �0-norm. Asa common practice in rank minimization problems [22], [23],we replace the rank function with the nuclear norm. Moreover,as suggested by [38], it is adequate to replace the �0-norm withthe �1-norm. Hence, we can learn Y, Y u and E by solving thefollowing convex optimization problem:

    minY,Y u,E{‖Y‖∗ + α‖Y u‖∗ + λ1‖E‖1+ λ2

    [T r(Y LY T ) + T r(Y u Lu(Y u)T )]}

    s.t. X = Y + Y u + E .(9)

    Intuitively, the minimization of T r(Y LY T ) in (9) will capturethe similarity property encoded in the intrinsic graph G.Likewise, the minimization of T r(Y u Lu(Y u)T ) will model thesimilarity property described by the penalty graph Gu . Hence,the goal of Graph Embedding can be achieved by solving (9).

    Comparing to previous frameworks and methods (e.g.,[12], [16], [31], [32]), which usually aim at learning onesubspace to capture the desired prosperities and suppressthe undesired properties, the methodology of our formulation(9) is quite different. Namely, there are three componentswith specific duties: 1) a low-rank subspace for capturingthe desired property (identified by Y ) represented by theintrinsic graph G, so called as intrinsic subspace; 2) a low-rank subspace for modeling the undesired property (identifiedby Y u) described by the penalty graph Gu , so called as penaltysubspace; 3) a matrix E for fitting the possible corruptions inthe data, so called as error container.

    Please note that although our work can handle the imageswith small misalignment, we recommend to do pre-alignmenton the misaligned images beforehand.

    B. The Discriminant Analysis Algorithm

    By constructing appropriate intrinsic graph and penaltygraph (i.e., defining S and SB ), the proposed method canbe fitted into both supervised and unsupervised settings.In this work, we only consider supervised subspace learning,i.e., discriminant analysis. Based on the proposed framework,we easily establish a new discriminant analysis algorithmcalled Corruptions Tolerant Discriminant Analysis (CTDA),which will be presented in the rest of this subsection.

    The detailed definitions for S and SB are similar as in GraphEmbedding [16]. For a data sample xi , we find its k nearestneighbors (in sense of Euclidean distance) from the same classas xi . Such neighbors are called within-class neighbors anddenoted as NW (xi ). Likewise, the between-class neighborsof xi , denoted as NB(xi ), are found by selecting k ′ nearest

    Algorithm 1 Solving Problem (12) by ALMInitialization: P = J = 0, Y u = K = 0, E = 0, M1 = 0,M2 = 0, M3 = 0, μ = 10−6, maxμ = 106, ρ = 1.1 and� = 10−6.while not converged do

    1. Update J by

    J = arg minJ{

    1μ‖J‖∗ + 12‖J − (P + M2μ )‖2F

    }.

    2. Update K by

    K = arg minK{

    αμ‖K‖∗ + 12‖K − (Y u + M3μ )‖2F

    }.

    3. Update E by

    E = arg minE{

    λ1μ ‖E‖1

    + 12‖E − (X − P X − Y u + M1μ )‖2F}

    .

    4. Update P by

    P = [M1 X T − M2 + μ(X − Y u − E)X T + μJ ][λ2 X L X T + λ2 X LT X T + μX X T + μI

    ]−1.

    5. Update Y u by

    Y u = [M1 − M3 + μ(X − P X − E + K )](λ2 Lu + λ2(Lu)T + 2μI)−1.

    6. Update the Lagrange multipliers M1, M2 and M3 by

    M1 = M1 + μ(X − P X − Y u − E),M2 = M2 + μ(P − J ),M3 = M3 + μ(Y u − K ).

    7. Update the parameter μ by μ = min(ρμ, maxμ).8. Check the convergence conditions:‖X − P X − Y u − E‖∞ < �, ‖P − J‖∞ < �, and ‖Y u −K‖∞ < �

    end whileOutput: The projection matrix P .

    neighbors from each different class of xi . Then the similaritymatrices S and SB are computed as follows:

    Si j ={

    1, if x j ∈ NW (xi ),0, otherwise,

    (10)

    SBi j ={

    1, if x j ∈ NB(xi ),0, otherwise.

    (11)

    Note that the similarity matrices defined as above may beasymmetric, and thus we use (S + ST )/2 and (SB + (SB)T )/2as the final similarity matrices to define the intrinsic andpenalty graphs.

    Suppose the optimal solution to (9) is (Yo, Y uo , Eo). Then,Yo can be used as feature vectors for classification. This isbecause according to the definition of S, Yo should containthe features that tend to group together the samples from thesame class. Also, since there is another component Y uo thatfits the features of high between-class similarity, the featuresin Yo are discriminative. Hence, we can use the vectors in Yo asfeatures and utilize the nearest neighbor classifier to producethe final classification results.

  • 4384 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    One may have noticed that it seems hard for CTDA toefficiently process a new testing sample x , which is notinvolved in the training set X . Namely, to get the desiredfeature vector y for a fresh sample x , it essentially needsto recalculate the optimization procedure of (9) over all datasamples, leading to high computational cost. To facilitatethe generalization of the learnt model to the unseen data,we follow the notion of [30], [25], [23]: using a low-rankprojector to pursue a subspace. Thus we make Y = P X in(9), where P is a linear projection onto the intrinsic subspace.In this way, we have the following convex optimizationproblem:

    minP,Y u,E{‖P‖∗ + α‖Y u‖∗+λ1‖E‖1+ λ2

    [Tr(P X L X T PT )+Tr(Y u Lu(Y u)T )]}

    s.t. X = P X + Y u + E .(12)

    After obtaining an optimal projection Po by optimizing (12)over a training set X , it will be very efficient to generalizethe learnt model to the unseen data. Namely, for any givendata sample x given, its desired features y can be efficientlycomputed by y = Pox .C. The Optimization Algorithm

    This part details the optimization of the objective functions(12) and (9). For the sake of simplification, we only show howto solve the objective function (12). The optimization of (9)can be solved in a similar way.

    Objective function (12) is convex and can be solved byvarious algorithms such as semi-definite programming (SDP)[39], accelerated proximal gradient (APG) [40] and augmentedLagrange multipliers (ALM) [41], [42]. For efficiency, weadopt in this paper the ALM method. We first convert (12)to the following equivalent function:

    minJ,K ,P,Y u,E

    {‖J‖∗+α‖K‖∗+λ1‖E‖1+ λ2

    [T r(P X L X T PT )+Tr(Y u Lu(Y u)T )]}

    s.t. X = P X + Y u + E,P = J,

    Y u = K . (13)This function can be solved by the ALM method, whichminimizes the following augmented Lagrange function:

    minJ,K ,P,Y u,E,M1,M2,M3,μ

    {‖J‖∗ + α‖K‖∗ + λ1‖E‖1+ λ2

    [Tr(P X L X T PT )+Tr(Y u Lu(Y u)T )]

    + μ2

    [‖X − P X − Y u − E‖2F +‖P − J‖2F+‖Y u − K‖2F

    ]}, (14)

    where M1, M2 and M3 are Lagrange multipliers, μ > 0 is apenalty parameter, and ‖ · ‖F denotes the Frobenious norm ofa matrix [43]. The above problem is unconstrained, and thus itcan be minimized with respect to J, K , P, Y u , E respectively,by fixing the other variables, and then updating the Lagrangemultipliers M1, M2 and M3. The inexact ALM method, whichis also known as Alternating Direction Method (ADM) [44],

    is outlined in Algorithm 12. Notice that although Steps 1, 2and 3 contain optimization problems, they all have analyticsolutions. Steps 1 and 2 are solved via the Singular ValueThresholding (SVT) operator [45], while Step 3 is solved viathe shrinkage operator [42]. The convergence properties of theALM method have been generally discussed. For more details,one may refer to [41] [42].

    If m = n, then the complexity of Algorithm 1 is O(n3).This complexity is not high, as there are O(n2) unknownvalues to solve. Indeed, the standard SDP algorithms (e.g.,CVX3) need a complexity of O(n6) to solve problem (12).It is also worth noting that the computational costof our CTDA algorithm is mainly spent in the train-ing procedure. After learning an optimal projection Pofrom a set of training data, it will be very efficientfor CTDA to process the testing samples. Namely, forany given data sample x , we only need to computey = Pox .

    IV. EXPERIMENTS

    In this section we systematically evaluate the proposedCTDA algorithm under the context of face recognition andobject classification. We use four publicly available databases,including three face databases CMU PIE [46], FERET [47] andAR Face [48], as well as one object dataset Pittsburgh FoodImage Dataset (PFID) [49]. Using CMU PIE and FERET, wewill investigate the accuracy of our algorithm under variouslevels of (synthetic) contiguous occlusions and random pixelcorruptions. With the AR Face dataset, we will show therobustness of our algorithm to the realistic occlusions suchas sunglasses. Finally, we test our algorithm by PittsburghFood Image Dataset (PFID) with synthetic corruptions todemonstrate its performance in object classification.

    There are 7 existing algorithms based on the Graph Embed-ding framework: PCA, LDA, NMF, Marginal Fisher Analysis(MFA) [16], Multiplicative Nonnegative Graph Embedding(MNGE) [28], and Projective Nonnegative Graph Embedding(PNGE) [29]. As mentioned, these algorithms are not robustto the gross corruptions in training data. For the fairness ofcomparison, we utilize RPCA to preprocess the data, resultingin 9 baselines to demonstrate the superiority of our proposedCTDA: 1) RPCA+PCA, 2) RPCA+LDA, 3) RPCA+NMF,4) MFA, 5) RPCA+MFA, 6) MNGE, 7) RPCA+MNGE, 8)PNGE and 9) RPCA+PNGE.

    For both CTDA and the baselines, the intrinsic and penaltygraphs are constructed in the same way. The neighbor sizesare set as k = k ′ = 3. For the other parameters, such asfeature dimension and weight parameters to balance terms,we report the best results by exploring all possible val-ues for all the algorithms as conventionally performed in[16], [29]. For CTDA, the non-sensitive parameter α is setas 1/‖X‖2F , the parameter λ1 is tuned among an empiri-cal range {0.8, 0.6, 0.4, 0.2, 0.08, 0.06, 0.04, 0.02, 0.008}, andthe parameter λ2 is tuned among an empirical range{100, 50, 10, 1, 0.1, 0.01, 0.001}.

    2The MATLAB code of our CTDA algorithm can be downloaded fromhttp://sites.google.com/site/bingkunbao/download

    3The code for CVX is available at http://cvxr.com/cvx/

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4385

    Clear data( -

    pixel)

    Occlusion size = -pixel

    Occlusion size = -

    pixel

    Occlusion size = -

    pixel

    Occlusion size =

    -pixel

    PIE

    FERET

    Fig. 1. Some examples of the original and corrupted images under varying levels of contiguous occlusions from the PIE and FERET datasets.

    Clear data(32*32)

    5% corrupted 10% corrupted

    15% corrupted

    PIE

    FERET

    Fig. 2. Some examples of the original and corrupted images under different percentages of random pixel corruptions from the PIE and FERET datasets.

    The evaluation metric is chosen as the recognition accuracy,which is computed by

    recognition accuracy = # correctly classified images# images

    . (15)

    A. Face Recognition With Synthetic Corruptions

    In this experiment, we use CMU PIE and FERET to testthe robustness of our algorithm. The PIE (Pose, Illumina-tion, and Expression) dataset contains more than 40, 000facial images pictured from 68 people with various posesand illumination conditions. In our experiments, we use asubset of 4 near frontal poses (C05,C07, C09 and C29)and two illumination conditions (08 and 11), obtaining

    10 images per person and 680 images in total. For theFERET dataset, we construct a subset by randomly choosing70 people and 6 images for each person, thus obtainingin 420 images in total. All the face images in these twodatasets have been cropped and manually aligned by fixingthe locations of two eyes. The size of each cropped image is32 × 32.

    To compare our method with the baselines, we simulatevarious levels of contiguous occlusions and random pixelcorruptions respectively:

    • Contiguous occlusions: the block occlusions are randomlyadded to different locations in images, with increasingsizes: 8×8, 10×10, 12×12 and 14×14, as exemplifiedin Fig. 1.

  • 4386 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    Original data

    PIE

    FERET

    Corruptedtraining data

    Intrinsicsubspace

    Penaltysubspace

    Errorcontrainer

    Corruptedtraining data

    Intrinsicsubspace

    Penaltysubspace

    Errorcontrainer

    Original data

    PIE

    FERET

    (a)

    (b)

    Fig. 3. Illustration of applying CTDA on the training data. Given a data matrix X , CTDA decomposes it into a low-rank part P X that represents the featureswith high within-class similarity, a low-rank part Y u that encodes the undesired features with high between-class similarity, and a sparse part E that fits thepossible corruptions. (a) Under contiguous occlusions with size = 10 × 10-pixel; (b) Under random pixel corruptions with rate = 10%.

    • Random pixel corruptions: we corrupt percentages(5%, 10%, 15%) of randomly chosen pixels fromboth the training and test images, replacing theirvalues with independent and identically distributed(i.i.d.) samples from a uniform distribution4, shownin Fig. 2.

    We randomly select half of the images from each class fortraining, and the remaining images for testing. The randomtrial is repeated 5 times. The data vectors are formed by

    4Uniform over [0, xmax ], where xmax is the largest possible pixel value.

    normalizing the raw pixel values into [0, 1] and reshaping eachimage as a vector.

    Fig. 3 (a) shows several example training images from PIEand FERET under contiguous occlusion with size = 10 ×10-pixel, and Fig. 3 (b) shows those from PIE and FERETunder random pixel corruptions with rate = 10%. From theillustration, we can see that 1) the synthetic corruptions aremainly separated into the error container Eo, 2) by extractingthe desired and undesired features with intrinsic and penaltysubspaces, Po X is discriminant for face recognition. Fig. 4shows the obtained desired features of test images for CTDA

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4387

    Original data

    Corrupted data

    Desired feature

    CTDA

    RPCA+PCA

    RPCA+MFA

    RPCA+LDA

    MNGE

    RPCA+MNGE

    PNGE

    RPCA+PNGE

    Fig. 4. Illustration of the desired features for CTDA and baselines. Every three columns are from the same person in PIE.

    and baselines. For CTDA, after learning the projection Po froma training set, the desired features of a testing sample x can besimply obtained by Pox . From the figure, we can see that thedesired features obtained by CTDA are of higher within-classsimilarity and between-class diversity. Thus, the learnt model(identified by Po) can be generalized well to the testing data.

    Tables II and III show the comparison results of differentalgorithms on PIE and FERET datasets, with block occlusions.

    Tables IV and V show the results on randomly corruptedPIE and FERET datasets. Each row contains the accuraciesof different algorithms under different occlusion block sizesor corruption rates. When the data are clean, the gap ofperformances between CTDA and the closest competitor isnot very large. This result is reasonable, because when thedata are clean, it is the intrinsic and penalty graphs thatmainly contribute to the performance of CTDA. When the

  • 4388 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    TABLE II

    PIE DATASET WITH CONTIGUOUS OCCLUSIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE PIE DATASET UNDER

    VARYING LEVELS OF CONTIGUOUS OCCLUSIONS (PIXELS)

    TABLE III

    FERET DATASET WITH CONTIGUOUS OCCLUSIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE FERET DATASET

    UNDER VARYING LEVELS OF CONTIGUOUS OCCLUSIONS (PIXELS)

    TABLE IV

    PIE DATASET WITH RANDOM PIXEL CORRUPTIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE PIE DATASET UNDER

    DIFFERENT PERCENTAGES OF RANDOM PIXEL CORRUPTIONS

    TABLE V

    FERET DATASET WITH RANDOM PIXEL CORRUPTIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE FERET DATASET

    UNDER DIFFERENT PERCENTAGES OF RANDOM PIXEL CORRUPTIONS

    data are corrupted, the proposed CTDA can largely outperformthe other algorithms under both contiguous occlusion andrandom pixel corruptions. Taking PIE dataset as an example,

    when the occlusion size increases to 14 × 14, the result ofCTDA is 89.90% on average, which is 6.01%(= 89.90% −83.89%) higher than the closest competitor: The accuracy

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4389

    Cleanimage

    Occludedimage

    Fig. 5. Some examples of the original and occluded images with sunglasses from AR face datasets. The images in each column are from the same person.

    TABLE VI

    RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE

    AR FACE DATASET UNDER DIFFERENT PERCENTAGES OF OCCLUDED

    IMAGES

    of CTDA only decreases 8.68%(= 98.58% − 89.90%) afteradding occlusions to the clean data, while that of the mostclose competitor decreases by 11.23%(= 95.12% − 83.89%).When the rate of corrupted pixels increases to 15%, CTDA canstill achieve a classification accuracy of 71.56% on average,which is 27.54% higher than the accuracy 44.02% producedby the closest competitor; the accuracy of CTDA decreasesby 27.02%(= 98.58% − 71.56%) after adding corruptions,while that of the closest baseline decreases by 31.22%(= 80.25% − 59.03%).

    B. Face Recognition With Disguises

    We then test our algorithm on real disguises using a subsetof the AR Face dataset [48]. AR Face dataset contains over4, 000 images corresponding to 126 people’s faces (70 menand 56 women). The images feature frontal view faces withdifferent facial expressions, illumination conditions, and occlu-sions (sunglasses and scarf). The images of most persons are

    taken in two sessions. Each session contains 13 color images(7 clean images, 3 images with sunglasses and 3 images withscarf) and 120 individuals (65 men and 55 women) participatein both sessions. The images of these 120 individuals areselected and used in our experiment. We test on three subsetswith different percentages of occluded images:

    • 33.3%: 4 clean images and 2 occluded images are ran-domly selected to constitute the training set; 2 otherclean images and 2 other occluded images are randomlyselected to form the test set;

    • 50%: 3 clean images and 3 occluded images are randomlyselected to constitute the training set; 2 other clean imagesand 2 other occluded images are randomly selected toform the test set;

    • 66.7%: 2 clean images and 4 occluded images are ran-domly selected to constitute the training set; 2 otherclean images and 2 other occluded images are randomlyselected to form the test set.

    The random trial is repeated 5 times. We also manuallycrop the face portion of the images and then normalizethem to 50 × 40 pixels. The normalized images are shownin Fig. 5.

    Table VI shows the comparison results of different algo-rithms for images occluded by sunglasses. When the percent-age of occluded images is 33.3%, the recognition accuracyis 98.76% for CTDA, which is 4.04% higher than the closestcompetitor. When the percentage of occluded images increasesto 50% and 66.7%, the recognition accuracies are 98.52% and97.38% for CTDA, which are 4.91 and 3.77% higher than theclosest competitors. Also, when percentage of occluded imageincreases, the recognition accuracy for our proposed methoddoes not decrease, since the occlusion size nearly remainsunchanged.

  • 4390 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    Clear data( 00-

    pixel)

    Occ. size = 0-pixel

    Occ. size = -pixel

    Occ. size = -pixel

    Occ. size = -pixel

    PFID

    Fig. 6. Some examples of the original and corrupted images under varying levels of contiguous occlusions from PFID datasets.

    Clear data(300*300)

    5% corrupted 10% corrupted 15% corrupted

    PFID

    Fig. 7. Some examples of the original and corrupted images under different percentages of random pixel corruptions from PFID datasets.

    C. Object Classification

    In our experiments, we use Pittsburgh Food Image Dataset(PFID) [49] to test the object classification for our algorithm.The PFID dataset is a collection of fast food images andvideos from 13 chain restaurants acquired under lab andrealistic settings. Our experiments focus on the set of 61categories of specific food items (e.g., McDonald’s Big Mac)with masked background. Each food category contains threedifferent instances of the food (bought on different days fromdifferent branches of the restaurant chain), and six imagesfrom six viewpoints (60 degrees apart) of each food instance.All 1, 098 images in this dataset are resized with equalproportion to no more than 300 × 300 pixels. For each image,the SIFT [50] features are extracted from the images andclustered into 2, 048 visual words.

    To investigate the robustness of various algorithms, werandomly add realistic images with contiguous occlusionsand random pixel corruptions similarly as Section IV-A. Forcontiguous occlusion, we add 80 × 80, 100 × 100, 120 × 120,140 × 140-pixel block occlusions to different locations in300 × 300-pixel training and test images. For random pixelcorruption, we corrupt percentages (5%, 10%, 15%) of ran-domly chosen pixels from both the training and test images.Fig. 6 and Fig. 7 show some examples of original andcorrupted images under different contiguous occlusions orrandom pixel corruptions.

    Tables VII and VIII show the comparison results of differentalgorithms under different contiguous occlusions and randompixel corruptions. When the data are clean, the performanceof CTDA achieves 42.56%, while the reported highest oneis 28.2% [51] as far as we know. When the occlusion sizeincreases to 140 × 140-pixel, the result for CTDA achieves36.02%, while the second highest one is 32.49%. Whencorruption rate increases to 15%, the performance for our algo-rithm reaches 30.04%, while the largest one of the baselinesis 26.37%. These results not only verify the effectiveness ofCTDA, but also show the robustness under different types ofsparse noises.

    D. Discussion

    To deal with the corruptions possibly existing in the trainingdata, as mentioned, there exists a straightforward method thatuses RPCA to preprocess the data. However, as shown inTables II-VIII, this method is much inferior to CTDA. Themain reason is that RPCA usually does not exactly correctthe corruptions in reality, as the success conditions (e.g., theincoherence condition) required by RPCA may not be fullysatisfied. Hence, it is possible to achieve better performanceby integrating various criterions into a unified framework.CTDA achieves this in a brief way. Namely, as discussed inSection III-A, our method can integrate various priors suchas the low-rank of the data, class label information, and thesparsity of the corruptions into a convex optimization problem.

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4391

    TABLE VII

    PFID DATASET WITH CONTIGUOUS OCCLUSIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE PFID DATASET UNDER

    VARYING LEVELS OF CONTIGUOUS OCCLUSIONS

    TABLE VIII

    PFID DATASET WITH RANDOM PIXEL CORRUPTIONS: RECOGNITION ACCURACY (%) AND STANDARD DEVIATION (%) ON THE PFID DATASET UNDER

    DIFFERENT PERCENTAGES OF RANDOM PIXEL CORRUPTIONS

    TABLE IX

    THE COMPARISONS OF RUNNING TIME (seconds) AND RECOGNITION ACCURACY (%) UNDER 33.3% OCCLUDED IMAGES ON AR FACE DATASET

    We also compare the computational costs of CTDA and allthe baselines. Table IX shows the running time for all themethods on AR face dataset. As data suggest, CTDA is thefastest among supervised graph embedding methods, includingMNGE and PNGE. Moreover, for CTDA, the training processcosts the majority of the running time for learning projectionPo. In the test process, the desired feature of a testing samplex can be quickly and simply obtained by Pox .

    From the experimental results, it is much easier to findthat our method decomposes a set of data samples into threeparts: desired features (identified by the intrinsic subspace),undesired features (identified by the penalty subspace) andcorruptions (identified by the error container), shown in Fig. 3.By using the intrinsic subspace and penalty subspace to cap-ture the desired features (e.g. with high within-class similarity)and the unfavorable features (e.g. with high between-classsimilarity), respectively, our method can outperform the state-of-the-art discriminant algorithms, especially when data arecorrupted.

    V. CONCLUSION

    In this work, we address the problem of subspace learningwith corrupted training data. By borrowing some ideas fromGraph Embedding and RPCA, we propose a general methodfor robust subspace learning. Unlike previous methods and

    frameworks, our method contains three components, includingan intrinsic subspace, a penalty subspace and an error con-tainer. Based on this method, a novel discriminant algorithmcalled Corruptions Tolerant Discriminant Analysis (CTDA) isestablished to extract favorite features from corrupted data.Experimentally we show that CTDA can outperform the state-of-the-art discriminant algorithms, especially when the dataare corrupted.

    In the future work, we plan to extend our method to robustalignment by following the spirit of [36]. The extended workwill be robust to not only the corrupted training data, but pooror even no alignment data.

    REFERENCES

    [1] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Jun. 2008.

    [2] B. K. Bao, G. Zhu, J. Shen, and S. Yan, “Robust image analysis withsparse representation on quantized visual features,” IEEE Trans. ImageProcess., vol. 22, no. 3, pp. 860–871, Mar. 2013.

    [3] Y. Ma, H. Derksen, W. Hong, and J. Wright, “Segmentation of multi-variate mixed data via lossy data coding and compression,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 29, no. 9, pp. 1546–1562, Sep. 2007.

    [4] W. Gear, “Multibody grouping from motion images,” Int. J. Comput.Vis., vol. 29, no. 2, pp. 133–150, 1998.

    [5] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental Bayesian approach testedon 101 object categories,” Comput. Vis. Image Understand., vol. 106,no. 1, pp. 59–70, 2007.

  • 4392 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 11, NOVEMBER 2013

    [6] K. Pearson, “On lines and planes of closest fit to systems of points inspace,” Phil. Mag., vol. 6, no. 6, pp. 559–572, 1901.

    [7] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, 2000.

    [8] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction bylocally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,2000.

    [9] X. He, D. Cai, S. Yan, and H. J. Zhang, “Neighborhood preserv-ing embedding,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2005,pp. 1208–1213.

    [10] Y. Fu and T. S. Huang, “Graph embedded analysis for head poseestimation,” in Proc. Int. Conf. Autom. Face Gesture Recognit., 2006,pp. 3–8.

    [11] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniquesfor embedding and clustering,” Adv. Neural Inf. Process. Syst., vol. 14,pp. 585–591, Jan. 2001.

    [12] X. He, D. Cai, H. Liu, and W. Y. Ma, “Locality preserving indexing fordocument representation,” in Proc. ACM Int. Conf. Res. DevelopmentInf. Retr., 2004, pp. 96–103.

    [13] R. Fisher, “The use of multiple measurements in taxonomic problems,”Anal. Eugenics, vol. 7, no. 7, pp. 179–188, 1936.

    [14] K. Fukunaga and J. M. Mantock, “Nonparametric discriminant analysis,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 5, no. 6, pp. 671–678,Nov. 1983.

    [15] K. Liu, Y. Q. Cheng, and J. Y. Yang, “Algebraic feature extraction forimage recognition based on an optimal discriminant criterion,” PatternRecognit., vol. 26, no. 6, pp. 903–911, 1993.

    [16] S. Yan, D. Xu, B. Zhang, H. J. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1,pp. 40–51, Jan. 2007.

    [17] H. T. Chen, H. W. Chang, and T. L. Liu, “Local discriminant embeddingand its variants,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2005, pp. 846–853.

    [18] M. Belkin and P. Niyogi, “Semi-supervised learning on Riemannianmanifolds,” Mach. Learn., vol. 56, no. 1, pp. 209–239, 2004.

    [19] V. Sindhwani, P. Niyogi, M. Belkin, and S. Keerthi, “Linear manifoldregularization for large scale semi-supervised learning,” in Proc. Conf.Mach. Learn. Workshop Learning Partially Classified Training Data,2005, pp. 1–4.

    [20] D. Cai, X. He, and J. Han, “Semi-supervised discriminant analysis,” inProc. IEEE Int. Conf. Comput. Vis., Oct. 2007, pp. 1–7.

    [21] T. Zhang, D. Tao, and J. Yang, “Discriminative locality alignment,” inProc. Eur. Conf. Comput. Vis., 2008, pp. 725–738.

    [22] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis,” J. ACM, vol. 58, no. 3, pp. 1–17, 2009.

    [23] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recoveryof subspace structures by low-rank representation,” IEEE Trans. PatternAnn. Mach. Intell., vol. 35, no. 1, pp. 171–184, Jan. 2013.

    [24] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principalcomponent analysis: Exact recovery of corrupted low-rank matrices viaconvex optimization,” in Proc. Neural Inf. Process. Syst., 2009, pp. 1–9.

    [25] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rankrepresentation,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 1–8.

    [26] Z. Zhang, X. Liang, A. Ganesh, and Y. Ma, “TILT: Transform invariantlow-rank textures,” in Proc. Assian Conf. Comput. Vis., 2010, pp. 1–4.

    [27] S. J. Kim, A. Magnani, and S. P. Boyd, “Robust Fisher discriminantanalysis,” in Proc. Neural Inf. Process. Syst., 2005, pp. 659–666.

    [28] C. Wang, Z. Song, S. Yan, L. Zhang, and H. J. Zhang, “Multiplicativenonnegative graph embedding,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2009, pp. 389–396.

    [29] X. Liu, S. Yan, and H. Jin, “Projective nonnegative graph embedding,”IEEE Trans. Image Process., vol. 19, no. 5, pp. 1126–1137, May 2010.

    [30] B. K. Bao, G. Liu, C. Xu, and S. Yan, “Inductive robust principalcomponent analysis,” IEEE Trans. Image Process., vol. 21, no. 8,pp. 3794–3800, Aug. 2012.

    [31] X. He, S. Yan, Y. Hu, P. Niyogi, and H. J. Zhang, “Face recognitionusing Laplacianfaces,” IEEE Trans. Pattern Ann. Mach. Intell., vol. 27,no. 3, pp. 328–340, Mar. 2005.

    [32] F. Torre and M. Black, “A framework for robust subspace learning,” Int.J. Comput. Vis., vol. 54, nos. 1–3, pp. 117–142, 2003.

    [33] G. Li and Z. Chen, “Projection-pursuit approach to robust dispersionmatrices and principal components: Primary theory and Monte carlo,”J. Amer. Stat. Association, vol. 80, no. 391, pp. 759–766, 1985.

    [34] R. Maronna, “Robust M-estimators of multivariate location and scatter,”Anna. Stat., vol. 4, no. 1, pp. 51–67, 1976.

    [35] H. Xu, C. Caramanis, and S. Sanghavi, “Robust PCA via outlier pursuit,”IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3047–3064, May 2012.

    [36] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “RASL: Robustalignment by sparse and low-rank decomposition for linearly correlatedimages,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010,pp. 763–770.

    [37] H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using lowrank matrix completion,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2010, pp. 1791–1798.

    [38] D. L. Donoho, “For most large underdetermined systems of linearequations the minimal �1-norm solution is also the sparsest solution,”Commun. Pure Appl. Math., vol. 59, no. 6, pp. 797–826, 2006.

    [39] M. Jaggi and M. Sulovský, “A simple algorithm for nuclear normregularized problems,” in Proc. Int. Conf. Mach. Learn., 2010,pp. 471–478.

    [40] P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” SIAM J. Optim., vol. 3, pp. 1–10, May 2008.

    [41] D. Bertsekas, Constrained Optimization and Lagrange Multiplier Meth-ods. San Diego, CA, USA: Academic Press, 1982.

    [42] Z. Lin, M. Chen, L. Wu, and Y. Ma, “The augmented Lagrangemultiplier method for exact recovery of corrupted low-rank matrices,”Dept. Comput. Eng., Univ. Illinois, Champaign, IL, USA, Tech. Rep.UIUC UILU-ENG-09-2214, 01 2009.

    [43] G. H. Golub and C. F. Van Loan, Matrix Computations, vol. 3.Baltimore, MD, USA: Johns Hopkins Univ. Press, 1996.

    [44] Y. Zhang, “Recent advances in alternating direction methods: Practiceand theory,” Tutorial, vol. 3, pp. 1–3, Jan. 2010.

    [45] J. Cai, E. J. Candès, and Z. Shen, “A singular value thresholdingalgorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4,pp. 1956–1982, 2010.

    [46] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, andexpression database,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,no. 12, pp. 1615–1618, Dec. 2003.

    [47] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERETevaluation methodology for face-recognition algorithms,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 22, no. 10, pp. 1090–1104, Oct. 2000.

    [48] A. M. Martinez, “The AR face database,” Dept. Centre de Visi o perComputador, Univ. Autòonoma de Barcelona, Spain, Tech. Rep. 24,1998.

    [49] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,“PFID: Pittsburgh fast-food image dataset,” in Proc. 16th IEEE Int. Conf.Image Process., Feb. 2009, pp. 289–292.

    [50] D. Lowe, “Object recognition from local scale-invariant features,” inProc. IEEE Int. Conf. Comput. Vis., Feb. 1999, pp. 1150–1157.

    [51] L. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognitionusing statistics of pairwise local features,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2010, pp. 2249–2256.

    Bing-Kun Bao received the Ph.D. degree in controltheory and control application from the Departmentof Automation, University of Science and Tech-nology of China, Hefei, China, in 2009. She iscurrently an Assistant Researcher with the Insti-tute of Automation, Chinese Academy of Sciences,Beijing, China, and a Researcher with the China-Singapore Institute of Digital Media, Singapore. Hercurrent research interests include cross-media cross-modal image search, social event detection, imageclassification and annotation, and sparse/low rank

    representation. She received the Best Paper Award from ICIMCS in 2009, andserved as a Technical Program Committee Member of several internationalconferences, including MMM, ICIMCS, and PCM, all in 2013. She servedas a Special Session Organizer for MMM and PCM, both in 2013, and as aGuest Editor for Multimedia System Journal.

  • BAO et al.: GENERAL SUBSPACE LEARNING WITH CORRUPTED TRAINING DATA VIA GRAPH EMBEDDING 4393

    Guangcan Liu (M’10) received the bachelor’sdegree in mathematics from Shanghai Jiao TongUniversity (SJTU), Shanghai, China, in 2004, andthe Ph.D. degree in computer science and engineer-ing from SJTU in 2010. Between 2006 and 2009, hewas a Visiting Student with the Visual ComputingGroup, Microsoft Research Asia. From September2010 to February 2012, he was a Research Fellowwith the National University of Singapore, Sin-gapore. Currently, he is a Post-Doctoral ResearchAssociate with the University of Illinois at Urbana-

    Champaign, Champaign, IL, USA. His current research interests includemachine learning and computer vision.

    Richang Hong received the Ph.D. degree from theUniversity of Science and Technology of China,Hefei, China, in 2008. He was a Research Fellowwith the School of Computing, National Univer-sity of Singapore, Singapore, from September 2008to December 2010. He is currently a Professorwith the Hefei University of Technology, Hefei.He has co-authored more than 60 publications. Hiscurrent research interests include multimedia ques-tion answering, video content analysis, and patternrecognition. He is a member of the Association for

    Computing Machinery. He was a recipient of the Best Paper Award in theACM Multimedia 2010.

    Shuicheng Yan (M’06–SM’09) is currently anAssociate Professor with the Department of Electri-cal and Computer Engineering, National Universityof Singapore, Singapore, and the Founding Leadof the Learning and Vision Research Group. Hiscurrent research interests include computer vision,multimedia, and machine learning. He has authoredor co-authored over 300 technical papers over awide range of research topics, with Google Scholarcitation over 9200 times and H-index-42. He is anAssociate Editor of the IEEE TRANSACTIONS ON

    CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and ACM Transactionson Intelligent Systems and Technology, and has been serving as the GuestEditor of the special issues for TMM and CVIU. He received the Best PaperAwards from ACM MM12 (demo), PCM in 2011, ACM MM in 2010, ICMEin 2010, and ICIMCS in 2009, the winner prizes of the classification task inPASCAL VOC from 2010 to 2012, the winner prize of the segmentation taskin PASCAL VOC in 2012, the Honourable Mention Prize of the DetectionTask in PASCAL VOC in 2010, the 2010 TCSVT Best Associate EditorAward, the 2010 Young Faculty Research Award, the 2011 Singapore YoungScientist Award, and the 2012 NUS Young Researcher Award.

    Changsheng Xu (M’97–SM’99) is a Professor withthe National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sci-ences, Beijing, China, and an Executive Directorof the China-Singapore Institute of Digital Media,Singapore. His current research interests includemultimedia content analysis/indexing/retrieval, pat-tern recognition, and computer vision. He holds 30granted/pending patents and has published over 200refereed research papers. He is an Associate Editorof the IEEE TRANSACTIONS ON MULTIMEDIA and

    ACM Transactions on Multimedia Computing, Communications. He served asa Program Chair of ACM Multimedia in 2009. He has served as an AssociateEditor, Guest Editor, General Chair, Program Chair, Area/Track Chair, SpecialSession Organizer, Session Chair and TPC Member for over 20 prestigiousIEEE and ACM multimedia journals, conferences, and workshops. He is anACM Distinguished Scientist.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 600 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description >>> setdistillerparams> setpagedevice