Face Classif

Pattern Recognition 45 (2012) 4117–4128

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

http://d

n Corr

Univers

E-m

lf20431

journal homepage: www.elsevier.com/locate/pr

An evidential reasoning based classification algorithm and its application forface recognition with class noise

Xiaodong Wang a,b, F. Liu a,b,n, L.C. Jiao b, Zhiguo Zhou a,b, Jingjing Yu a,b, Bing Li b, Jianrui Chen b,Jiao Wu a,b, Fanhua Shang b

a School of Computer Science and Technology, Xidian University, Xi’an 710071, PR Chinab Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, PR China

a r t i c l e i n f o

Article history:

Received 2 February 2012

Received in revised form

26 May 2012

Accepted 7 June 2012Available online 18 June 2012

Keywords:

Face recognition

Class noise

Evidential reasoning

Linear regression classification (LRC)

Sparse representation-based classification

(SRC)

03/$ - see front matter & 2012 Elsevier Ltd. A

x.doi.org/10.1016/j.patcog.2012.06.005

esponding author at: School of Computer Sci

ity, Xi’an 710071, PR China. Tel.: þ86 298820

ail addresses: [email protected]

[email protected] (F. Liu).

a b s t r a c t

For classification problems, in practice, real-world data may suffer from two types of noise, attribute

noise and class noise. It is the key for improving recognition performance to remove as much of their

adverse effects as possible. In this paper, a formalism algorithm is proposed for classification problems

with class noise, which is more challenging than those with attribute noise. The proposed formalism

algorithm is based on evidential reasoning theory which is a powerful tool to deal with uncertain

information in multiple attribute decision analysis and many other areas. Thus, it may be more effective

alternative to handle noisy label information. And then a specific algorithm—Evidential Reasoning

based Classification algorithm (ERC) is derived to recognize human faces under class noise conditions.

The proposed ERC algorithm is extensively evaluated on five publicly available face databases with class

noise and yields good performance.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Pattern recognition/classification is a very topic in machine learn-ing (or artificial intelligence). It assigns the input data into one of agiven number of categories by an algorithm. The algorithm isobtained by learning the training set of instances. Classification isapplied in many fields, such as speech recognition, handwritingrecognition, document classification, internet search engines, medicalimage analysis, optical character recognition, and so on. For classifica-tion problems, the training set may suffer from two types of noise:attribute noise and class noise [1], that decrease classificationaccuracy usually. If some training samples are not correctly labeled,the training data would have class noise. Classification under classnoise conditions is a more challenging problem usually, compared toattribute noise. The sources of class noise are very diverse, such assubjectivity, data-entry error, or inadequacy of the information [2].This paper focuses on face classification problems in the presence ofclass noise.

Many algorithms have been proposed to solve the class noiseproblem. Parts of them are summarized as follows.

�
Nearest neighbor algorithm. Nearest neighbor based algorithms[3–7] choose effective subsets of training sets instead of the
ll rights reserved.

ence and Technology, Xidian

4310; fax: þ86 2988201023.

(X. Wang),

original training sets according to some rules. These algorithmsobtain better accuracy and efficiency since many noisy trainingdata and outliers are cleared from the original training sets.
� Decision tree algorithm. Many decision tree algorithms [8–12]
are applied to eliminate class noise. Some of them avoidovertraining by adjusting model parameters and buildingsmaller trees (e.g. [9,10]) and others eliminate the trainingdata with class noise by pruning methods or ensemble learn-ing (e.g. [8,11] and [12]).
� Probabilistic algorithm. Probabilistic methods [13–17] also help
many algorithms to tolerate class noise. High breakdownestimation is used to eliminate the influence of outliers in[13,14]. Modeling methods for the class noise are proposed by[15–17], which detect the inconsistencies of the labels of thetraining samples.
� Ensemble learning algorithm. Ensemble learning algorithms
[2,18,19] can also improve the classification accuracy underclass noise conditions. Bagging [18] tolerates the outliers bybootstrapping training subsets and combining the classifica-tion results from the different subsets. In [2], ensemble filtersare applied to eliminate mislabeled training samples. [19] usesreduced reward-punishment editing to identify and removethe outliers, which will form the different training subsetswith the changed parameters. Then, rotation forest algorithmcombines these subsets and yields the better classificationresults.
� Other algorithm. There are also many other methods to deal
with the class noise problems, such as the method based on

www.elsevier.com/locate/pr

www.elsevier.com/locate/pr

dx.doi.org/10.1016/j.patcog.2012.06.005



mailto:[email protected]

mailto:[email protected]


Table 1The related algorithms.

First group Second group

Nearest neighbor algorithm [3–7]

Decision tree algorithm [9,10] [8,11,12]

Probabilistic algorithm [13–17]

Ensemble learning algorithm [2] [18,19]

Other algorithm [20,21]

X. Wang et al. / Pattern Recognition 45 (2012) 4117–41284118

neural network used in [20] which removed noisy samples byneural network as a filtering mechanism, and the methodbased on mutual information proposed by [21] which calcu-lated mutual information for each training sample andremoved the samples with larger mutual information, and soon.

All of the above listed algorithms can be roughly divided into twogroups: the first group aims to clear the outliers, the second grouptries to construct the class-noise-proof models. According to theirgroups, we can summarize these algorithms briefly and formTable 1.

As a powerful framework for uncertain reasoning, Dempster–Shafer theory (D-S theory [22,23]) has been widely applied topattern recognition fields, such as classification [24–33], cluster-ing [34–37], and so on. Based on D-S theory, Yang et al. [38,39]proposed an evidential reasoning algorithm (ER). And then Wanget al. [40] developed the analytical evidential reasoning algorithm(ER analytical algorithm). Compared with traditional evidencecombination method of the D-S theory, ER (and ER analyticalalgorithm) can save the computation cost greatly and prevent theirrational conclusions under evidence conflict in evidence combi-nation. The main contribution of this paper is to propose a novelevidential reasoning based classification algorithm, which canovercome the effects of class noise in classification problems.Specifically, several novel aspects of this paper are summarized asfollows:

First, a formalism algorithm based on ER theory for classifica-tion problems. Because it distinguishes the training samplesaccording to their different importances, the formalism algorithmmay address the class noise better. Utilizing the properties of ER,it has many other potential benefits. For example, it may beapplied to the data with various structures (e.g. manifold orsphere structures), diverse attributes (e.g. the quantitative, qua-litative or vague attributes), and different labels (e.g. soft or crisplabels). In addition, it may take advantage of additional prioriinformation, and so on. Furthermore, under the framework of theformalism classification algorithm, a specific algorithm (namedby Evidential Reasoning based Classification algorithm, ERC) isproposed to recognize human faces with class noise. Numericalexperiments show that ERC has good performance for solvingsuch problems.

The remainder of this paper is organized as follows. Section 2introduces a formalism classification algorithm based on ERanalytical algorithm and analyzes its potential benefits. InSection 3, the specific algorithm ERC is described in detail. Severalnumerical experiments are presented in Section 4. They are usedto evaluate the performance of ERC to handle class noise. Finally,the paper is concluded in Section 5.

2. A formalism classification algorithm

In this section, we provide some backgrounds about the ERanalytical algorithm first, and then introduce a formalism classi-fication algorithm.

2.1. ER analytical algorithm

Now, we briefly describe some basic conceptions and conclu-sions of the ER analytical algorithm [38–40]. They are given tocater for classification problems. Let O¼ fC1,C2, . . . ,CKg be acollectively exhaustive and mutually exclusive set of hypotheses,then O is called the frame of discernment. The nonnegative vectorb¼ ðbð1Þ, . . . ,bðKÞÞT is called a belief degree vector (BDV) ifPK

i ¼ 1 bðiÞr1, where bðiÞ9bðCiÞ is the belief degree of the hypoth-esis Ci. If

PKi ¼ 1 bðiÞ ¼ 1, the BDV b is complete; otherwise, ifPK

i ¼ 1 bðiÞo1, the BDV b is incomplete. For classification pro-blems, let x be a sample and O¼ fC1,C2, . . . ,CKg correspond to K

classes respectively, bðiÞ represents the belief degree of thehypothesis Ci ‘‘x belongs to i-th class’’. So, BDV can be explainedas a soft label. If the BDV is incomplete, bðOÞ ¼ 1�

PKi ¼ 1 bðiÞ is the

belief degree of O which corresponds to the hypothesis ‘‘x do notbelong to any class’’.

Several BDVs corresponding to the same sample x form thebelief rule base (BRB). Let fb1, . . . ,bN

g be a BRB corresponding tothe sample x, the final conclusion can be combined by ERanalytical algorithm [40,41],

m¼XK

s ¼ 1

YNi ¼ 1

wibiðsÞþ1�wi

XK

j ¼ 1

biðjÞ

0@

1A�ðK�1Þ

YNi ¼ 1

1�wi

XK

j ¼ 1

biðjÞ

0@

1A

24

35�1

,

ð1Þ

bðkÞ ¼m½QN

i ¼ 1ðwibiðsÞþ1�wi

PKj ¼ 1 b

iðjÞÞ�

QNi ¼ 1ð1�wi

PKj ¼ 1 b

iðjÞÞ�

1�m½QN

i ¼ 1ð1�wiÞ�:

ð2Þ

In Eqs. (1) and (2), the activation weights are calculated [41,42] by

wi ¼yiPN

j ¼ 1 yj

, i¼ 1;2, . . . ,N, ð3Þ

where rule weights yi ði¼ 1;2, . . . ,NÞ reflect the importance ofbiði¼ 1;2, . . . ,NÞ in combination steps.Now, we illustrate the performance of ER analytical algorithm

by several examples. Let O¼ fC1,C2g, b1,b2 be two BDV such that

b1ð1Þ ¼ 0:6, b1

ð2Þ ¼ 0:4 and b2ð1Þ ¼ 0:6, b2

ð2Þ ¼ 0:4:

Their combination result according to Eq. (2) with activationweights w1 ¼w2 ¼ 0:5 is

bð1Þ ¼ 0:6190, bð2Þ ¼ 0:3810:

In the above example, if leaving w1,w2 unchanged and b1,b2

satisfy that

b1ð1Þ ¼ 0:6, b1

ð2Þ ¼ 0:4 and b2ð1Þ ¼ 0:4, b2

ð2Þ ¼ 0:6,

the combination result turns into

bð1Þ ¼ 0:5, bð2Þ ¼ 0:5:

So, if both b1 and b2 tend to support the same hypothesis, thecombination result more inclines to support this hypothesis.Otherwise, if b1 and b2 tend to support different hypotheses,the combination result forms an unclear conclusion. However, theasymmetric activation weights can improve the unclear result. Ifw1 ¼ 0:8,w2 ¼ 0:2 and b1,b2 satisfy that

b1ð1Þ ¼ 0:6, b1

ð2Þ ¼ 0:4 and b2ð1Þ ¼ 0:4, b2

ð2Þ ¼ 0:6,

the combination result turns into

bð1Þ ¼ 0:5793, bð2Þ ¼ 0:4207:

It means that BDV b1 with a bigger weight plays a more importantrole in ER.

X. Wang et al. / Pattern Recognition 45 (2012) 4117–4128 4119

2.2. A formalism classification algorithm

Based on ER, a formalism classification algorithm is proposedas follows.

The formalism classification algorithmStep 1: Generate a BDV for each training sample according tothe information from the training sample itself and othertraining samples.Step 2: For a test sample, compute an activation weight foreach BDV by exploiting the data structure.Step 3: Combine all BDVs with the activation weights by Eqs.(1)–(3) and make decision based on the combination result.

When the formalism classification algorithm is applied to classify,two crucial factors affect classification results. They are BDVs andactivation weights. The methods to form them can be chosenaccording to the characteristics of data flexibly. The different methodscause specific algorithms. For example, ERC applies the linear regres-sion classification algorithm (LRC [43]) and sparse representationbased classification algorithm (SRC [44]) to generate the BDVs andactivation weights in the next section because the two algorithms aremore suitable for face databases. There is a one-to-one correspon-dence between the BDVs and the training samples. In the trainingphase, BDVs are generated and fixed. The function of BDVs is todistinguish the importance of the training samples. This will beexplained with ERC in the next section. Since BDVs are fixed, theclassification result of a test sample depends entirely on the activationweights. The activation weights reflect some ‘‘similarities’’ that areclosely correlated with the structures of the data. If larger activationweights are arranged for the training data in the same class as the testsample, the correct classification result will be obtained. As anexample, we propose a specific algorithm ERC for face recognitionthat is robust against label noise. More details about ERC will bediscussed in the next section.

The formalism classification algorithm has many other poten-tial benefits yet.

Firstly, this algorithm may be applied to classify a test samplewith priori information if it is available. For example, a testsample belongs to some classes with larger probability (or mustnot belong to some classes) is learned in advance. For our method,the priori information can be used to make decision by providingan additional BDV with a proper activation weight or by onlyadjusting the activation weights corresponding to each trainingsample. If these priori information comes from other classifiers,our algorithm will form an ensemble learning algorithm.

Secondly, this algorithm may be applied to handle differentkinds of attitudes, e.g. quantitative, qualitative and vague atti-tudes. Many methods to deal with different kinds of the attitudeshave been developed by ER in multiple attribute decision analysis[38–42,45–52] which may be introduced in our algorithm.

Thirdly, this algorithm may learn from the training sampleswith soft label and crisp label, or even unreliable trainingsamples. Both soft label and crisp label can easily be recast asBDVs and the uncertainty of the i-th training sample included inbiðOÞ. If a test sample obtains the combined BDV such that bðOÞ is

larger than all other belief degrees, the algorithm can treat it as aninvalid test image and refuse to classify it.

3. ERC Algorithm

With increasingly diverse face data sources, e.g. internet orsurveillance video, class noise will be unavoidable. Based on theformalism classification algorithm, we develop a specific algorithmERC for face recognition problems with class noise in this section. It

demonstrates the benefits of the proposed formalism classificationalgorithm well. Let y be a test sample and matrix X consist of thetraining samples. Let X¼ ½X1,X2, . . . ,XK �, and Xkðk¼ 1;2, . . . ,KÞ be asubmatrix of X with training samples from the k-th class as itscolumn vectors. Suppose that Xk contains nk training samples andN¼

PKk ¼ 1 nk. ERC is introduced in detail in Sections 3.1–3.3.

3.1. Generate BDV

Let xi be a training sample which belongs to j-th class.According to its class label, a BDV is defined as

bðjÞ ¼ 1 and bðkÞ ¼ 0 8kAf1;2, . . . ,Kg\fjg: ð4Þ

Let nmin ¼minkfnk�1g and fxk1,xk

2, . . . ,xknkg be the column vectors of

the matrix Xk (k¼ 1;2, . . . ,K), i.e. the training samples in k-thclass. For ka j, Xk denotes the matrix with the nearest nmin

vectors in fxk1,xk

2, . . . ,xknkg to xi as its columns. And X j denotes

the matrix with the nearest nmin vectors in fxj1,xj

2, . . . ,xjnj

,g\fxig toxi as its columns. The BDV bi corresponding to xi is obtained bythe following Algorithm 1.

Algorithm 1.

Step 1.1: Generate the BDV b according to Eq. (4).

Step 1.2: Calculate ak ¼ ðXT

k XkÞ�1X

T

k xi, k¼ 1;2, . . . ,K.

Step 1.3: Compute distance dkðxiÞ ¼ Jxi�XkakJ‘2, k¼ 1;2, . . . ,K .

Step 1.4: Calculate ~b according to

formula ~bðkÞ ¼ exp � gdkðxiÞPjdjðxiÞ

� �, k¼ 1;2, . . . ,K ð5Þ

where g40 is a constant.

Step 1.5: Give activation weights w1 ¼ 1�r for b and w2 ¼ rfor ~b, where rA ½0;1� is a constant.

Step 1.6: Combine b and ~b according to formulae (1)–(3) with

activation weights w1,w2 and obtain bi.

The BDV bi fuses information that comes from both the class labelof xi and other training samples. So, it can reduce the adverse effectsfrom class noise well. It will be validated by experiments in Section4.1. Specifically, if b and ~b indicate the same class that xi belongs to,the combination result bi will clearly show the class; otherwise, therewill be no component of bi significantly greater than other compo-nents. In the latter case, the contribution of bi will be small to classifythe test sample. In other words, BDVs distinguish the trainingsamples according to their importance.

The method to generate bi is similar to many data-cleaningapproaches [3–7]. The difference is that the data-cleaning approachesremove the unreliable training samples, whereas our method retainsall training samples and generates BDVs. The BDVs represent thebelief degree of training samples, therefore they can be considered assoft labels. Compared with the data cleaning approaches, the BDVsmaintain more information of training samples which will becombined by ER analytical algorithm well later.

The idea to form ~b is inspired by the method in [30]. One of thedifference between them is that the proposed method uses thedistances between sample and the class subspaces rather than thedistances between samples. The distances between sample andthe class subspaces (LRC [43]) is more suitable for face recogni-tion. The function of BDV in this paper is similar to the function ofthe basic probability assignment (BPA) in [30]. Another differenceis that the BDVs are fixed and the BPAs are changed with testsamples during the test phase.

Algorithm 1 involves two parameters g and r. In Eq. (5), theparameter g is set to 10. Another parameter r in Step 1.5 will bedescribed in Section 4.


3.2. Generate activation weights

Let ~X j (j¼ 1;2, . . . ,K) be a matrix consisting of a series ofcolumn vector xi, where xi is a training sample such thatj¼ arg maxkfb

iðkÞg (bi corresponding to xi is the BDV generated

by Algorithm 1). Suppose B¼ ½X,I�, where X is obtained bynormalizing the columns of X to unit length, and I is an identitymatrix. For the test sample y, the algorithm to generate theactivation weight wi corresponding to the training sample xi

(i¼ 1;2, . . . ,N) is given as follows.

Algorithm 2.

Step 2.1: Calculate ~ak ¼ ð~X

T

k~XkÞ�1 ~X

T

k y, k¼ 1;2, . . . ,K and let ~a ik

be the component of ~ak corresponding to the training sample

xi if xi is a column vector of ~Xk.

Step 2.2: Compute distance dkðyÞ ¼ Jy� ~Xk ~akJ‘2(k¼ 1;2, . . . ,K)

and let kn¼ arg minkfdkðyÞg.

Step 2.3: Set ~yi ¼ ~a ikn if xi is a column vector of ~Xkn and ~a i

kn 40

and ~y i ¼ 0 otherwise, i¼ 1;2, . . . ,N.

Step 2.4: Calculate ~wi ¼~yi=PN

j ¼ 1~yj, i¼ 1;2, . . . ,N.

Step 2.5: Solve the ‘1-minimization problem:

a ¼ arg mina

JaJ‘1s:t: y¼ Ba, ð6Þ

and let aðiÞ be the i-th component of a (i.e aðiÞ corresponds to the

training sample xi), i¼ 1;2, . . . ,N.

Step 2.6: Set yi ¼ aðiÞ if aðiÞ40 and yi ¼ 0 otherwise,

i¼ 1;2, . . . ,N.

Step 2.7: Compute wi ¼ yi=PN

j ¼ 1 yj, i¼ 1;2, . . . ,N.

Step 2.8: Give activation weight wi (i¼ 1;2, . . . ,N) according to

wi ¼ l ~wiþð1�lÞwi,ð7Þ

where lA ½0;1� is a constant.

For test sample y, the activation weight wi (i¼ 1;2, . . . ,N)
includes two pieces of information, i.e. ~wi and wi. The former isobtained by LRC, and the latter by SRC. For face recognitionproblems, both LRC and SRC tend to arrange larger coefficients forthe training samples which come from the same class with y[43,44]. And thus, the BDVs corresponding to these trainingsamples are larger usually. The performance of ERC will beanalyzed in Section 4.3, if only ~wi (or wi) is applied as theactivation weight. Since the activation weights must not benegative, ~yi in Step 2.3 is set to 0 if ~a i
kn o0. Similarly, yi in Step2.6 is set to 0 if aðiÞo0, i¼ 1;2, . . . ,N. In addition, only the first N

components of a are used though its length is larger than N. Inother words, we discard the components of a corresponding to I.

3.3. Evidence combination and classification

For a test sample y, the complete ERC algorithm is presented asfollows.

Algorithm 3. (ERC algorithm).

Step 3.1: Generate the BDV for each training sample by usingAlgorithm 1.Step 3.2: Compute activation weights for the test sample y byusing Algorithm 2.Step 3.3: Combine these BDVs with the activation weights

according to Eqs. (1)–(3) and obtain the new BDV bcorresponding to y.

Step 3.4: Decide y to be in the k-th class if k¼ arg maxkfbðkÞg.

Generally speaking, SRC is hard to use to learn the training
samples with soft labels. However, it becomes possible by using
the strategy of Algorithm 3. Many other face recognition methodscan be extended similarly, e.g. SSM (sparse subspace method,[53]) and NFL (nearest feature line [54]). That is another benefitof ERC.

3.4. Analysis of computational complexity

Because ERC is based on the sparse representation method, ithas very high computation cost like SRC. In this subsection, wewill analyze its computational complexity.

Firstly, we give the analysis of the computational complexityabout evidential combination (see formulae (1) and (2)). Let N bethe number of BDVs in BRB and K be the number of thecomponents of BDVs in BRB. When m is calculated, two multi-plication operations are needed to compute wib

iðsÞ and

wi

PKj ¼ 1 b

iðjÞ, N multiplication operations are needed to computeQN


PKj ¼ 1 b

iðjÞÞ, and N multiplication operations

are needed to computeQN

i ¼ 1ð1�wi

PKj ¼ 1 b

iðjÞÞ. Other operations

have only very small computation cost. So, the computation costis 4� OðNÞ ¼OðNÞ for m. When bðkÞ ðk¼ 1;2, . . . ,KÞ is calculated, N

multiplication operations are needed to computeQN

i ¼ 1ð1�wiÞ.Note that only one

QNi ¼ 1ð1�wiÞ needs to be computed (for

all k¼ 1;2, . . . ,K),QN


PKj ¼ 1 b

iðjÞÞ andQN

i ¼ 1ð1�wi

PKj ¼ 1 b

iðjÞÞ have be computed when m is calculated.

In addition, two multiplication operations are needed to computem½QN


PKj ¼ 1 b

iðjÞÞ�

QNi ¼ 1ð1�wi

PKj ¼ 1 b

iðjÞÞ� and

m½QN

i ¼ 1ð1�wiÞ� for each kAf1;2, . . . ,Kg. So, the computation costis 2KþN for b. Therefore, if NbK , the total computation cost is2KþNþOðNÞ ¼OðNÞ for evidential combination; otherwise ifKbN, the total computation cost is 2KþNþOðNÞ ¼OðKÞ forevidential combination.

Secondly, we analyze the computation cost to calculate ~bðkÞ(k¼ 1;2, . . . ,K) according to the formula (5). Suppose that eachclass of data includes p training samples and each sample has D

features. When ak is calculated, p2D multiplication operations areneeded to compute X

T

k Xk, p3 multiplication operations are neededto compute ðX

T

k XkÞ�1, pD multiplication operations are needed to

compute XT

k xi, and p2 multiplication operations are needed tocompute ½ðX

T

k XkÞ�1�½X

T

k xi�. For face recognition problem, D is muchlarger than p usually. So, the computation cost is OðKp2DÞ for all ak

(k¼ 1;2, . . . ,K). It needs KðpDþDÞ ¼OðKpDÞ multiplication opera-tions to compute dkðxiÞ ¼ Jxi�XkakJ‘2

(k¼ 1;2, . . . ,K), and K mul-tiplication operations and K division operations to compute ~bðkÞ(k¼ 1;2, . . . ,K) according to the formula (5). Therefore, the totalcomputation cost is OðKp2DÞþOðKpDÞþ2K ¼ OðKp2DÞ for all ~bðkÞ(k¼ 1;2, . . . ,K).

According to the analysis of the second and third paragraphs inthis subsection and the fact that it does not need multiplicationoperations to generate bðjÞ ðj¼ 1;2, . . . ,KÞ according to the for-mula (4), the total computation cost to compute bi is that OðKp2DÞ

for ~b i plus O(K) for all bi by executing evidential combination.Note that the computation cost of evidential combination is O(K)because only two BDVs are combined to generate bi. Therefore,the total computation cost to train ERC (Algorithm 1) is OðNKp2DÞ

which is applied to generate all bi (i¼ 1;2, . . . ,N).Thirdly, we analyze the computation cost to calculate a by

solving programming (6). Let B be D� ðNþDÞ matrix where D isthe numbers of the components of the training samples and N isthe numbers of the training samples. There are many optimiza-tion algorithms to solve this problem. We choose a more efficientalgorithm–LARS [55] among them which is a greedy algorithm. Itcan obtain high-quality solution with less iterations. The entirecomputation cost of LARS is OðD3

Þ since the number of thecolumns of B NþD is larger than the number of the componentsof the samples D (see [55] for more details).


Fourthly, we analyze the computation cost to calculate activationweight wi (Algorithm 2). Similar to the analysis of the third paragraphin this subsection, OðKp2DÞ multiplication operations are needed tocompute all dkðyÞ (k¼ 1;2, . . . ,K). According to the analysis of the lastparagraph, OðD3

Þ multiplication operations are needed to compute a.For other variables, respective N division operations are needed tocompute all ~wi and wi (i¼ 1;2, . . . ,N), 2N multiplication operationsare needed to execute the formula (7) and generate wi (i¼ 1;2, . . . ,N).Note that Kp2D and D3 are much larger than N for face recognition. So,the entire computation cost of Algorithm 2 is OðKp2DÞþOðD3

Þ.According to the conclusion of the second and sixth paragraph and

the fact that NbK usually for the classification problem, thecomputation cost to test ERC is OðNÞþOðKp2DÞþOðD3

Þ ¼

OðKp2DÞþOðD3Þ for each test sample y. The computation costs to

test LRC and SRC are OðKp2DÞ and OðD3Þ, respectively. So, the

computation cost of ERC is equivalent to the sum of the computationcosts of LRC and SRC roughly.

4. Experimental results and discussions

To illustrate the efficiency of the proposed algorithm, ERC isemployed to classify five face databases under class noise condi-tions. The five face databases are AR [56,57], Georgia Tech (GT)[58], JAFFE [59], ORL [60] and Extended Yale B [61,62],respectively.

The AR face database contains over 4,000 color images corre-sponding to 126 people’s faces (70 men and 56 women). Eachclass contains 26 samples (face pictures) corresponding to differ-ent facial variations, illumination, expressions, and facial dis-guises. Similar to [44], a part of the data set are selected to testERC in the experiments. This part consists of 100 subjects (50male subjects and 50 female subjects) and 14 images with onlyillumination change and expressions for each person. In Fig. 1, thefirst row shows the images of the first object in this database.

Fig. 1. The first, second and fifth rows show cropped facial images (32�32) of the first

The third and fourth rows show cropped facial images (32�32) of the first subject in

Extended Yale B database are shown in the sixth-ninth rows.

The GT face database contains images of 50 people. There are15 images for each person. These images show frontal (or tiltedfaces) with different facial expressions and lighting conditions. InFig. 1, the second row shows the images of the first object in thisdatabase.

The JAFFE face database contains 213 images of ten femalemodels. There are about 21 images for each person. In Fig. 1, thethird and fourth rows show the images of the first object in thisdatabase.

The ORL face database contains 40 subjects with ten imagesper subject. The images were taken against a homogeneousbackground. For each person, the images show upright and frontalposition with different lighting, facial details (glasses/no glasses)and facial expressions. In Fig. 1, the fifth row shows the images ofthe first object in this database.

The Extended Yale B face database contains 2414 frontal faceimages of 38 subjects under various laboratory-controlled light-ing conditions. In Fig. 1, the sixth-ninth rows show the images ofthe first object in this database.

In order to reduce the computational cost, we resize the croppedfacial images to order 32�32, and convert them to the columnvectors, and then project them into the PCA subspace by throwingaway the smallest principal components. For PCA projection, 99%energy is kept in the sense of reconstruction error. Specifically,suppose that f1,f2, . . . ,fN are length-1024 vectors. LetWpca ¼ ðw1

pca,w2pca, . . . ,w1024

pca ÞT and wi

pca be the eigenvectors accordingto i-th largest eigenvalue of the matrix ð1=NÞ

PNi ¼ 1ðf i�f Þðf i�f ÞT ,

where f ¼ ð1=NÞPN

i ¼ 1 fi. Let WDpca ¼ ðw

1pca,w2

pca, . . . ,wDpcaÞ

T be a D�N

projective matrix and xDi ¼WD

pcaf i (i¼ 1;2, . . . ,N). Ifd¼minfDAf1;2, . . . ,1024g9

PNi ¼ 1 JxD

i J2240:99JfiJ

22g, xd

i (i¼ 1;2,. . . ,N) are the dimensionality reduction data by PCA which will beused to test ERC in the numerical experiments. The other experi-mental designs and corresponding results are given in the followingsections. They illustrate that ERC is more robust against class noisethan competing methods.

subject in the AR database, Georgia Tech database and ORL database, respectively.

the JAFFE database. The 64 faces (32�32 cropped images) about first subject of


4.1. The stability of BDV under class noise conditions

In this subsection, we illustrate that the BDV formed byAlgorithm 1 can reduce adverse effects from class noise well.

All 5 face databases are used to test the stability of BDV underclass noise conditions. For each class, about half of images arerandom chosen as training samples and 20% class noise is added.Specifically, n1 denotes the number of the training images of eachclass, n2 denotes the number of the training images with therandom noisy labels of each class, and they are shown in Table 2.

Algorithm 1 is employed to generate a BDV for each trainingimage. If bi

ðkÞ is the maximum component of bi, we determinethat the i-th training image belongs to the k-th class. In Algorithm1, parameter g is set to 10 and parameter r is searched from thegrid f0:1,0:2, . . . ,0:9g. The best average results of 50 randomexperiments are shown in Table 3. The original class noise ratiois shown in the first row. The new class noise ratio (under the bestr) obtained by BDV and corresponding r are presented in thesecond and third rows, respectively.

The results listed in Table 3 show that the BDV generated byAlgorithm 1 reduces class noise ratio in general, especially forJAFFE and extended Yale B databases.

4.2. Performance of ERC

The classification performance of ERC on all 5 face databases isprovided, as listed in Table 4. We choose seven algorithms forcomparison:

�

TabThe

F

n

n

TabThe

F

N

N

r

TabThe

F

E

rlK

LR

R

R

R

S

Li

The method proposed in [30] (which denoted by KNNDS),which is a k-nearest neighbor method based on Dempster-Shafer theory. KNNDS has a parameter (i.e. the neighborhoodsize) that is turned from the grid {1,2,y,20} in the experi-ments. The best results are reported.

le 2number of the training images and noise images of each class.

ace database AR GT JAFFE ORL Extended Yale B

1 7 8 11 5 32

2 1 2 2 1 6

le 3average error rate and standard deviation obtained by Algorithm 1.

ace database AR GT

oise ratio 0.1429 0.25

ew noise ratio 0.099470.0061 0.177870.0141

0.6 0.7

le 4average error rate and standard deviation obtained by different algorithms.

ace database AR GT

RC 0.251770.0206 0.340070.0188

0.6 0.6

0.2 0.1

NNDS 0.479470.0191 0.339670.0195

C 0.302070.0215 0.405870.0244

T1 0.667770.0219 0.464770.0256

T2 0.546270.0187 0.488270.0273

T3 0.793070.0185 0.539770.0340

RC 0.367070.0176 0.422370.0264

near SVM 0.273870.0235 0.307570.0218

�
RT1, RT2 and RT3 [7], which based on k-nearest neighbormethod are class-noise-tolerant algorithms. There are twoneighborhood sizes as the parameters for the three methods.All of them are turned from the grid {1,2,y,20} and the bestresults are reported in the experiments. � LRC [43] and SRC [44], which are two good performance face
recognition algorithms. LRC has no parameter. For SRC, sparserepresentation is obtained by the programming (6) in theexperiments. So, there is also no parameter for SRC.
� Linear support vector machine (linear SVM) [64], which is
suitable for face classification because the face data satisfy thelinear subspace hypothesis roughly (i.e. the face images fromthe same person lie on a linear subspace roughly). In theexperiments, the algorithm C�SVC with the linear kernel inthe libsvm2.91 tools (http://www.csie.ntu.edu.tw/�cjlin/libsvmtools/) is used. The regularization parameter C areturned from the grid f2�25,2�24, . . . ,225

g. The best results arereported.

In the experiments, the number of the training images and theclass noise ratio are same as the settings in Section 4.1. In theexperiments, least-angle regression algorithm (LARs [55], i.e.SolveLasso.m in SparseLab [63]) is applied to solve programming(6) for both ERC and SRC. The parameter l of ERC is searched fromthe grid {0,0.1,0.2,y,1} and the other parameters, i.e. g and r, arethe same as the settings in Section 4.1. For different parameters,the best results are shown. Table 4 shows the average error rateand standard deviation of 50 random examples and the corre-sponding best parameters. For AR, ORL, and Extended Yale Bdatabase, ERC obtains lower error rate than other seven algo-rithms. For GT database, ERC is worse than linear SVM and KNNDSand much better than other five algorithms. For JAFFE database,ERC is slightly worse than linear SVM and much better than othersix algorithms.

4.3. Sensitivity to the selection of parameters

Although the classification performance of ERC is influencedby two parameters (i.e. r and l), the following experimentalresults show that they are easy to tune. (The settings ofthe experiments are the same as the settings in Section 4.2.)Figs. 2–11 illustrate the evolutions of the average error rate and

JAFFE ORL Extended Yale B

0.1818 0.2 0.1875

0.059170.0194 0.123570.0160 0.037670.0051

0.6 0.6 0.6

JAFFE ORL Yale B

0.030770.0210 0.165370.0297 0.085970.0087

0.6 0.6 0.6

0.1 0.1 0.3

0.033470.0205 0.240270.0297 0.507070.0145

0.130970.0398 0.247270.0325 0.109670.0088

0.068370.0322 0.415970.0394 0.593170.0181

0.078870.0402 0.398770.0402 0.576870.0141

0.067070.0321 0.555070.0501 0.660270.0172

0.150170.0419 0.239270.0271 0.241370.0130

0.026270.0232 0.179770.0297 0.170970.0111

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.3

0.4

0.5

0.6

0.7

0.8

ρ

erro

r rat

e

AR database

ERCKNNDSLRCRT1RT2RT3SRCSVM

Fig. 2. Evolutions of the error rate and standard deviation of ERC on AR database

versus r. In experiments, l¼ 0:2 and best error rate and standard deviation of

other algorithms also are shown for comparison.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.3

0.4

0.5

0.6

0.7

0.8

λ

erro

r rat

e

AR database


Fig. 3. Evolutions of the error rate and standard deviation of ERC on AR database

versus l. In experiments, r¼ 0:6 and best error rate and standard deviation of


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.3

0.35

0.4

0.45

0.5

0.55

ρ

erro

r rat

e

GT database


Fig. 4. Evolutions of the error rate and standard deviation of ERC on GT database



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.3

0.35

0.4

0.45

0.5

0.55

λ

erro

r rat

e

GT database


Fig. 5. Evolutions of the error rate and standard deviation of ERC on GT database



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

ρ

erro

r rat

e

JAFFE database


Fig. 6. Evolutions of the error rate and standard deviation of ERC on JAFFE

database versus r. In experiments, l¼ 0:1 and best error rate and standard

deviation of other algorithms also are shown for comparison.


standard deviation of 50 random experiments on five facedatabases versus two parameters. When one parameter changes,another parameter is set to the best value in Table 4. For ease ofcomparison, the best error rate and standard deviation of otheralgorithms as straight lines are shown in the figures. As can beseen from the experimental results, although ERC is sensitive to r,the best r are near 0.6 for all databases. When r¼ 0:6, ERC isstable for a wide range of values of l.

Next, we will provide the analysis of performance of ERC whenonly one of SRC and LRC are applied to generate the activationweights. When the parameters are set to r¼ 0:6,l¼ 0, only SRC isapplied to generate the activation weights. For ORL databases,ERC is better than all of the other algorithms. For JAFFE databases,ERC is slightly worse than linear SVM and better than other sixalgorithms. For AR databases, ERC is slightly worse than LRC andlinear SVM, and better than other five algorithms. For ExtendedYale B databases, ERC is slightly worse than LRC and better thanother six algorithms. For GT databases, ERC is slightly worse thanKNNDS and linear SVM, and better than other five algorithms. It isclear that ERC is much better than SRC on all databases.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

λ

erro

r rat

e

JAFFE database


Fig. 7. Evolutions of the error rate and standard deviation of ERC on JAFFE

database versus l. In experiments, r¼ 0:6 and best error rate and standard


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

ρ

erro

r rat

e

ORL database


Fig. 8. Evolutions of the error rate and standard deviation of ERC on ORL database



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

λ

erro

r rat

e

ORL database


Fig. 9. Evolutions of the error rate and standard deviation of ERC on ORL database



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1

0.2

0.3

0.4

0.5

0.6

ρ

erro

r rat

e

Yale B database


Fig. 10. Evolutions of the error rate and standard deviation of ERC on Extended

Yale B database versus r. In experiments, l¼ 0:3 and best error rate and standard


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

λ

erro

r rat

e

Yale B database


Fig. 11. Evolutions of the error rate and standard deviation of ERC on Extended

Yale B database versus l. In experiments, r¼ 0:6 and best error rate and standard



When the parameters are set to r¼ 0:6,l¼ 1, only LRC isapplied to generate the activation weights. For Extended Yale Bdatabases, ERC performs better than the other algorithms. For ARand ORL databases, ERC is slightly worse than linear SVM andbetter than other five algorithms. For GT and JAFFE databases, ERCis slightly worse than linear SVM and KNNDS, and better thanother five algorithms. ERC is better than LRC for all databasesunder these conditions.

Since the classification performance of ERC is heavily depen-dent on SRC and LRC, ERC can obtain higher precision if the moreeffective technique to form the activation weights is found.

4.4. The performance for the diverse contamination rates

In this subsection, we test the performance of ERC as thecontamination rate changes. The experiments are split into twogroups. In the first group, the number of the training imageskeeps unchanged and the number of the training images withnoisy labels increases gradually. The different numbers of thetraining images with noisy labels are shown in Fig. 12. Thenumbers of the training images are 8, 8, 12, 6 and 32 for ARdatabase, GT database, JAFFE database, ORL database and

1 2 3 40.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

erro

r rat

e

AR database

ERC

KNNDS

LRC

RT1

RT2

RT3

SRC

SVM

1 2 3 4

0.3

0.4

0.5

0.6

0.7

the number of the training images with noisy labels

GT database

1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5JAFFE database

1 2 30.1

0.2

0.3

0.4

0.5

0.6

0.7

erro

r rat

e

ORL database

3 6 9 12 15

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

the number of the training images with noisy labels

Yale B database

Fig. 12. Evolutions of the error rate and standard deviation of ERC versus the numbers of the training images with noisy labels. In experiments, the numbers of the training

images keeps unchanged. There are 8, 8, 12, 6 and 32 training images for AR database, GT database, JAFFE database, ORL database and Extended Yale B databases,

respectively.


Extended Yale B databases, respectively. Other settings of theexperiments are same with the settings in Section 4.2. Theaverage error rates and standard deviations of 20 random experi-ments on the three face databases are given in Fig. 12. For AR, ORLand Extended Yale B databases, ERC obtained the best results forall contamination rates. For GT database, ERC is better than RT1,RT2, RT3, LRC and SRC for all noise ratios and worse than linearSVM and KNNDS. For JAFFE database, ERC is best for the leastnoise ratio, but ERC degenerates seriously with the increase ofcontamination rate. It is considered by us that the large classnoise has serious effects for LRC and SRC, which causes the badBDV, activation weights and serious degeneration of ERC.

In the second group, the number of the training images withnoisy labels keeps unchanged and the number of the trainingimages increases gradually. The numbers of the training imagesare shown in Fig. 13. The numbers of the training images withnoisy labels are 1, 2, 2, 1 and 6 for AR database, GT database, JAFFEdatabase, ORL database and Extended Yale B databases, respec-tively. Other settings of the experiments are same with thesettings in Section 4.2. The average error rate and standarddeviation of 20 random experiments on the three face databasesare reported in Fig. 13. For AR and Extended Yale B databases, ERChas the best performance. For JAFFE databases, ERC has thesimilar recognition rate with linear SVM and KNNDS, and betterperformance than other five algorithms. For ORL databases, ERChas the similar recognition rate with linear SVM, and has betterperformance than other six algorithms. For GT databases, ERC is

worse than linear SVM, is similar to the recognition accuracy ofKNNDS, and is better than other five algorithms.

4.5. Fixed parameters and computational complexity

It is very difficult to select proper parameters for ERC becauseof the noisy labels. In this subsection, we try to test ERC under thefixed parameters conditions in terms of classification accuraciesand CPU time.

All five face databases are applied to test ERC. In all fiveexperiments, the parameters r¼ 0:6 and l¼ 0:2 are fixed for ERC.KNNDS, RT1, RT2, RT3 and linear SVM select their best parametersto cater for five different databases according to the numericalresults in Section 4.2. The other settings of the experiments aresame as the settings in Section 4.2. Table 5 reports the averageerror rate and standard deviation of ERC and the competitivealgorithms. The numerical results illustrate that the recognitionaccuracies of ERC are best on AR, ORL and Extended Yale Bdatabases. Although ERC is not optimal, it is still better thanLRC, RT1, RT2, RT3 and SRC on GT and JAFFE databases.

The theoretical computational complexity of ERC is analyzedin Section 3.4, and the numerical computational complexity isprovided below. Tables 6 and 7 list the average training and testCPU time obtained by all algorithms under the best parameters,respectively. The numerical experiments are executed byMATLAB7.10 in an HP xw9400 Workstation with Six-Core AMDOpteron(tm) Processor 2439 SE 2.80 GHz and 32 GB memory. The

7 8 9 10 11 12

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

erro

r rat

e

AR database

ERC

KNNDS

LRC

RT1

RT2

RT3

SRC

SVM

8 9 10 11 12

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

the number of the trainning images

GT database

8 10 12 14 16 18

0

0.05

0.1

0.15

0.2

0.25

JAFFE dataqbase

4 5 6 7 8

0.1

0.2

0.3

0.4

0.5

0.6

0.7

erro

r rat

e

ORL database

26 32 40 48 56

0.1

0.2

0.3

0.4

0.5

0.6

0.7

the number of the trainning images

Yale B database

Fig. 13. Evolutions of the error rate and standard deviation of ERC versus the numbers of the training images. In experiments, the numbers of the training images with

noisy labels keeps unchanged. There are 1, 2, 2, 1 and 6 training images with noisy labels for AR database, GT database, JAFFE database, ORL database and Extended Yale B

databases, respectively.

Table 5The average error rate and standard deviation obtained by different algorithms.

Face database AR GT JAFFE ORL Yale B

ERC 0.252770.0195 0.351370.0245 0.029170.0201 0.177970.0298 0.084870.0090

KNNDS 0.480770.0171 0.342770.0288 0.036570.0223 0.248070.0253 0.505470.0123

LRC 0.295870.0237 0.404870.0303 0.115370.0320 0.256270.0384 0.107870.0080

RT1 0.676870.0219 0.476670.0305 0.075170.0427 0.417170.0430 0.588970.0179

RT2 0.547170.0189 0.486970.0326 0.073870.0304 0.407870.0462 0.567770.0169

RT3 0.791470.0199 0.538970.0247 0.067270.0350 0.566870.0517 0.663970.0142

SRC 0.373870.0159 0.425570.0294 0.139070.0349 0.253570.0355 0.236570.0107

Linear SVM 0.270070.0238 0.312370.0236 0.024970.0181 0.182270.0316 0.170270.0142

Table 6The average training CPU time and standard deviation obtained by different algorithms.


ERC 4.102870.0148 2.671970.5809 1.079470.4571 1.700070.6871 74.14376.2063

RT1 37.74671.4622 5.257870.5314 0.553870.2490 1.094470.2700 152.3173.1661

RT2 2.861970.0773 0.689770.0690 0.051370.0071 0.282270.0259 5.470370.1159

RT3 0.843870.0465 0.790070.2883 0.076670.0253 0.119770.0125 4.894571.4927

Linear SVM 0.285970.0119 0.105970.0106 0.008870.0101 0.032570.0062 0.697770.0116


CPU time about KNNDS, LRC and SRC is not represented in Table 6because they have not training phase. Although it needs long CPUtime to train ERC, the computation cost can be tolerated. In the

test phase, the CPU time is little more than the sum of the CPUtime consumed by LRC and SRC. It is consistent with thetheoretical analysis of computational complexity in Section 3.4.

Table 7The average test CPU time and standard deviation obtained by different algorithms.


ERC 13.04070.3545 5.942570.0342 2.040970.2936 1.479770.0298 189.5272.1644

KNNDS 0.347570.0098 0.543870.2047 0.030670.0137 0.054170.0253 4.953170.6369

LRC 1.436970.0289 0.486670.0055 0.051970.0074 0.139170.0384 89.86671.7034

RT1 0.290070.0096 0.096670.0068 0.017270.0047 0.037870.0430 2.151670.5379

RT2 0.354170.0075 0.156370.0055 0.020070.0071 0.045070.0462 3.507870.4134

RT3 0.245670.0078 0.085670.0079 0.015070.0044 0.033170.0517 0.390670.0134

SRC 8.374470.0356 4.806970.0291 1.968170.2936 1.146970.0355 51.98770.9396

Linear SVM 0.534470.0063 0.126370.0043 0.008470.0079 0.029170.0316 0.760270.0105


5. Conclusion

This paper focuses on applying ER to face classification. Thisapplication needs two preparations: (i) a BRB; (ii) the activationweights corresponding to all BDVs in BRB. Thus, a formalismclassification algorithm is proposed based on the following twoaspects. For BRB, a BDV is generated as a soft label for eachtraining sample and indicates the contribution of the sample forthe final classification decision. Furthermore, the activationweights are obtained by exploiting the structure informationbetween testing and training sets. Classification is based on theresult of the combination of evidence.

ER has well-developed theories and widespread applicationsin many areas. So, we think that the formalism classificationalgorithm may be a tie of ER and pattern recognition theories.Many benefits of ER can be applied to develop pattern recognitionmethods, for example, it can use various priori information,integrate different kinds of the attitudes (even partial missingattitudes), and handle unreliable training samples, and so on.

The formalism classification algorithm proposed in this papermight be regarded as a general framework. By changing themethod to generate the BDV and activation weights, differentspecific algorithms can be derived. Therefore, the formalismclassification algorithm is very flexible and applicable for a widerange of classification problems. However, to obtain better per-formance, the BDVs and activation weights must be designedcarefully according to the characteristics of different classificationproblems. So, it is better to choose one kind of data sets which hasthe same characteristics to test the performance of the formalismclassification algorithm. Then, ERC is proposed to recognizehuman face under class noise conditions. To the best of ourknowledge, ER is first applied to face recognition in this paper.Because BDV fuses the information from training sample itselfand other training samples, ERC can tolerate a certain degree ofclass noise well. Two excellent face recognition algorithms (i.e.LRC and SRC) are used to form the activation weights. Thus, ERC ismore suitable to classify face images. With the help of thestrategy of ERC, many algorithms (e.g. such as SRC, SSM andNFL etc.) can be applied to recognize human face when thetraining samples have soft labels, which could not be done before.By taking full advantage of LRC, SRC and ER, the numericalexperiments in Section 4 witness that the proposed ERC algorithmobtains high accuracy in face recognition with class noise.

There are several limitations to the proposed algorithm. ERChas two parameters which need to be adjusted for the betterperformance. Though these two parameters of ERC are easilyadjusted, as shown in Sections 4.3 and 4.5, we prefer an algorithmwithout any parameters. In addition, Algorithm 2 is time-con-suming in either training or testing phase, a more efficientmethod to find BDVs and activation weights should be designedto reduce the computational cost in our future work.

Acknowledgment

The authors would like to thank the three reviewers for theircomments, they help us very greatly to improve this submission. Theauthors would like to thank Dr. Zhijie Zhou for his suggestions whichcorrect many errors and improve the quality of this paper greatly.

This work was supported in part by the National Natural ScienceFoundation of China (No. 60803097, 60970067, 61003198, 61072106,60971112, 60971128, 61072108), The Fund for Foreign Scholars inUniversity Research and Teaching Programs (the 111 Project) (No.B07048), the National Science and Technology Ministry of China (No.9140A07011810DZ0107, 9140A07021010DZ0131), the FundamentalResearch Funds for the Central Universities (No. JY10000902001,K50510020001, JY10000902045).

References

[1] X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study, ArtificialIntelligence Review 22 (2004) 177–210.

[2] C. Brodley, M. Freidl, Identifying mislabeled training data, Journal of ArtificialIntelligence Research 11 (1999) 131–167.

[3] B. Dasarathy, Noising around the neighbourhood: a new system structure andclassification rule for recognition in partially exposed environments, IEEETransactions on Pattern Analysis and Machine Intelligence 2 (1980) 67–71.

[4] G. Gates, The reduced nearest neighbor rule, IEEE Transactions on Informa-tion Theory 18 (1972) 431–433.

[5] P. Hart, The condensed nearest neighbor rule, IEEE Transactions on Informa-tion Theory 14 (1968) 515–516.

[6] F. Angiulli, Fast condensed nearest neighbor rule, in: International Conferenceon Machine Learning, 2005, pp. 7–11.

[7] D. Wilson, T. Martinez, Instance pruning techniques, in: International Con-ference on Machine Learning, 1997, pp. 404–411.

[8] G. John, Robust decision trees: removing outliers from databases, in:Proceedings of the First ACM SIGKDD Conference on Knowledge Discoveryand Data Mining, 1995, pp. 174–179.

[9] T. Denoeux, M. Bjanger, Induction of decision trees from partially classifieddata using belief functions, in: Proceedings of SMC, 2000, pp. 2923–2928.

[10] P. Vannoorenbergue, T. Denoeux, Handling uncertain labels in multiclassproblems using belief decision trees, in: Proceedings of IPMU, 2002.

[11] J. Mingers, An empirical comparison of pruning methods for decision treeinduction, Machine Learning 4 (1989) 227–243.

[12] X. Zhu, X. Wu, Q. Chen, Eliminating class noise in large datasets, in:International Conference on Machine Learning, 2003, pp. 920–927.

[13] D. Hawkins, G. McLachlan, High-breakdown linear discriminant analysis,Journal of the American Statistical Association 92 (1997) 136–143.

[14] S. Bashir, E. Carter, High breakdown mixture discriminant analysis, Journal ofMultivariate Analysis 93 (2005) 102–111.

[15] N. Lawrence, B. Scholkopf, Estimating a kernel Fisher discriminant in thepresence of label noise, in: International Conference on Machine Learning,2001, pp. 306–313.

[16] Y. Li, L. Wessels, D. Ridder, M. Reinders, Classification in the presence of class noiseusing a probabilistic kernel Fisher method, Pattern Recognition 40 (2007)3349–3357.

[17] C. Bouveyrona, S. Girard, Robust supervised classification with mixture models:learning from data with uncertain labels, Pattern Recognition 42 (2009)2649–2658.

[18] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140.[19] L. Nanni, A. Franco, Reduced reward-punishment editing for building ensem-

bles of classifiers, Expert Systems with Applications 38 (2011) 2395–2400.


[20] X. Zeng, T. Martinez, A noise filtering method using neural networks, in: IEEEInternational Workshop on Soft Computing Techniques in Instrumentation,Measurement and Related Applications, 2003, pp. 26–31.

[21] I. Guyon, N. Matic, V. Vapnik, Discovering informative patterns and data cleaning,Advances in Knowledge Discovery and Data Mining (1996) 181–203.

[22] G. Shafer, A Mathematical Theory of Evidence, Princeton University Press,Princeton, 1976.

[23] A. Dempster, Upper and lower probabilities induced by a multi-valuedmapping, Annals of Mathematical Statistics 38 (1967) 325–339.

[24] T. Denoeux, Z. Younes, F. Abdallah, Representing uncertainty on set-valuedvariables using belief functions, Artificial Intelligence 174 (2010) 479–499.

[25] E. Come, L. Oukhellou, T. Denoeux, P. Aknin, Learning from partiallysupervised data using mixture models and belief functions, Pattern Recogni-tion 42 (2009) 334–348.

[26] T. Denoeux, P. Smets, Classification using belief functions: the relationshipbetween the case-based and model-based approaches, IEEE Transactions onSystems, Man and Cybernetics—Part B 36 (2006) 1395–1406.

[27] T. Denoeux, A neural network classifier based on Dempster–Shafer theory, IEEETransactions on Systems, Man and Cybernetics—Part A 30 (2000) 131–150.

[28] L. Zouhal, T. Denoeux, An evidence-theoretic k-NN rule with parameteroptimization, IEEE Transactions on Systems, Man and Cybernetics—Part C 28(1998) 263–271.

[29] T. Denoeux, Analysis of evidence-theoretic decision rules for pattern classi-fication, Pattern Recognition 30 (1997) 1095–1107.

[30] T. Denoeux, A k-nearest neighbor classification rule based on Dempster–Shafertheory, IEEE Transactions on Systems, Man and Cybernetics 25 (1995) 804–813.

[31] Y. Bi, J. Guan, D. Bell, The combination of multiple classifiers using anevidential reasoning approach, Artificial Intelligence 172 (2008) 1731–1751.

[32] Y. Bi, S. McClean, T. Anderson, Combining rough decisions for intelligent textmining using Dempster’s rule, Artificial Intelligence Review 26 (2006) 191–209.

[33] L. Xu, A. Krzyzak, C. Suen, Methods of combining multiple classifiers and theirapplications to handwriting recognition, IEEE Transactions on Systems, Manand Cybernetics 22 (1992) 418–435.

[34] M. Masson, T. Denoeux, RECM: relational evidential c-means algorithm,Pattern Recognition Letters 30 (2009) 1015–1026.

[35] M. Masson, T. Denoeux, ECM: an evidential version of the fuzzy c-meansalgorithm, Pattern Recognition 41 (2008) 1384–1397.

[36] M. Masson, T. Denoeux, Clustering interval-valued data using belief func-tions, Pattern Recognition Letters 25 (2004) 163–171.

[37] T. Denoeux, M. Masson, EVCLUS: evidential clustering of proximity data, IEEETransactions on Systems, Man and Cybernetics—Part B 34 (2004) 95–109.

[38] J. Yang, M. Singh, An Evidential reasoning approach for multiple-attributedecision making with uncertainty, IEEE Transactions on Systems, Man andCybernetics 24 (1994) 1–18.

[39] J. Yang, D. Xu, On the evidential reasoning algorithm for multiple attributedecision analysis under uncertainty, IEEE Transactions on Systems, Man andCybernetics 32 (2002) 289–304.

[40] Y. Wang, J. Yang, D. Xu, Environmental impact assessment using theevidential reasoning approach, European Journal of Operational Research174 (2006) 1885–1913.

[41] Z. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Online updating belief rule basedsystem for pipeline leak detection under expert intervention, Expert Systemswith Applications 36 (2009) 7700–7709.

[42] J. Yang, J. Liu, J. Wang, H. Sii, H. Wang, Belief rule-base inference methodologyusing the evidential reasoning approach-RIMER, IEEE Transactions on Systems,Man, and Cybernetics C Part A: Systems and Humans 36 (2006) 266–285.

[43] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEETransactions on Pattern Analysis and Machine Intelligence 32 (2010) 2106–2112.

[44] J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust face recognition viasparse representation, IEEE Transactions on Pattern Analysis and MachineIntelligence 31 (2009) 210–227.

[45] J. Yang, Y. Wang, D. Xu, K. Chin, The evidential reasoning approach for MADAunder both probabilistic and fuzzy uncertainties, European Journal ofOperational Research 171 (2006) 309–343.

[46] J. Yang, P. Sen, A general multi-level evaluation process for hybrid MADMwith uncertainty, IEEE Transactions on Systems Man, and Cybernetics 24(1994) 1458–1473.

[47] J. Yang, Rule and utility based evidential reasoning approach for multi-attribute decision analysis under uncertainties, European Journal of Opera-tional Research 131 (2001) 31–61.

[48] Z. Zhou, C. Hu, J. Yang, D. Xu, M. Chen, D. Zhou, A sequential learningalgorithm for online constructing belief-rule-based systems, Expert Systemswith Applications 37 (2010) 1790–1799.

[49] J. Zhou, C. Hu, D. Xu, M. Chen, D. Zhou, A model for real-time failure prognosisbased on hidden Markov model and belief rule base, European Journal ofOperational Research 207 (2010) 269–283.

[50] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, New model for system behaviorprediction based on belief rule based systems, Information Sciences 180(2010) 4843–4846.

[51] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Bayesian reasoning approach basedrecursive algorithm for online updating belief rule based expert system of pipelineleak detection, Expert Systems with Applications 38 (2011) 3937–3943.

[52] J. Zhou, C. Hu, J. Yang, D. Xu, D. Zhou, Online updating belief-rule-bass usingthe RIMER approach, IEEE Transactions on Systems, Man, andCybernetics—Part A: Systems and Humans, doi:http://dx.doi.org/10.1109/TSMCA.2011.2147312.

[53] T. Sakai, Multiple pattern classification by sparse subspace decomposition,arXiv:0907.5321v2.

[54] S. Li, J. Lu, Face recognition using the nearest feature line method, IEEETransactions on Neural Networks 10 (1999) 439–443.

[55] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, TheAnnals of Statistics 32 (2004) 407–499.

[56] A. Martinez, R. Benavente, The AR Face Database, CVC Technical Report 24, 1998.[57] A. Martinez, A. Kak, PCA versus LDA, IEEE Transactions on Pattern Analysis

and Machine Intelligence 23 (2001) 228–233.[58] ‘‘Georgia Tech Face Database,’’ /http://www.anefian.com/face_reco.htmS,

2007.[59] M. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facial

images, IEEE Transactions on Pattern Analysis and Machine Intelligence 21(1999) 1357–1362.

[60] F. Samaria, A. Harter, Parameterization of a stochastic model for human faceidentification, in: Proceedings of the Second IEEE Workshop Applications ofComputer Vision, 1994, pp. 138–142.

[61] A. Georghiades, P. Belhumeur, D. Kriegman, From few to many: illuminationcone models for face recognition under variable lighting and pose, IEEETransactions on Pattern Analysis and Machine Intelligence 23 (2001)643–660.

[62] K. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognitionunder variable lighting, IEEE Transactions on Pattern Analysis and MachineIntelligence 27 (2005) 684–698.

[63] /http://sparselab.stanford.edu/S.[64] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007.

Xiaodong Wang received the B.S. degree from Harbin Institute of Technology, Harbin, China, in 1998, and the M.S. degree from Inner Mongolia University of Technology,Hohhot, China, in 2007. He is currently working toward the Ph.D. degree in Computer Application Technology at the School of Computer Science and Technology, XidianUniversity and the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xi’an, China. His current research interests includeconvex optimization, compressive sensing and pattern recognition.

Fang Liu (M’07-SM’07) received the B.S. degree in Computer Science and Technology from Xi’an Jiaotong University, Xi’an, China, in 1984, and the M.S. degree in ComputerScience and Technology from Xidian University, Xi’an, in 1995. Currently, she is a Professor with the School of Computer Science, Xidian University, Xi’an, China. She is theauthor or coauthor of five books and more than 80 papers in journals and conferences. Her research interests include signal and image processing, synthetic aperture radarimage processing, multiscale geometry analysis, learning theory and algorithms, optimization problems, and data mining.

L.C. Jiao (SM’89) received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982, and the M.S. and Ph.D. degrees from Xi’an Jiaotong University, Xi’an,China, in 1984 and 1990, respectively. He is currently a Distinguished Professor with the School of Electronic Engineering, Xidian University, Xi’an, China. His researchinterests include signal and image processing, natural computation, and intelligent information processing. He has led approximately 40 important scientific researchprojects and published more than ten monographs and 100 papers in international journals and conferences. He is the author of three books: Theory of Neural NetworkSystems (Xi’an, China: Xidian University Press, 1990), Theory and Application on Nonlinear Transformation Functions (Xi’an, China: Xidian University Press, 1992), andApplications and Implementations of Neural Networks (Xi’an, China: Xidian University Press, 1996). He is the author or coauthor of more than 150 scientific papers.

Prof. Jiao is a member of the IEEE Xi’an Section Executive Committee, and the Chairman of Awards and Recognition Committee and an executive committee member ofthe Chinese Association of Artificial Intelligence.

Jiao Wu (S’09) received the B.S. degree and the M.S. degree in Applied Mathematics from Shaanxi Normal University, Xi’an, China, in 1999 and 2002, respectively. She iscurrently working towards the Ph.D. degree in Computer Application Technology at the School of Computer Science and Technology, Xidian University and the KeyLaboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xi’an, China. Her research interests include image processing, machinelearning, statistics learning theory, and algorithms.

dx.doi.org/http://dx.doi.org/10.1109/TSMCA.2011.2147312

dx.doi.org/http://dx.doi.org/10.1109/TSMCA.2011.2147312

arXiv:0907.5321v2

http://www.anefian.com/face_reco.htm

http://sparselab.stanford.edu/

Documents

Face Classif