A new fuzzy approach for handling class labels in canonical correlation analysis

ARTICLE IN PRESS

0925-2312/$ - se

doi:10.1016/j.ne

�CorrespondE-mail addr

Neurocomputing 71 (2008) 1735–1740

www.elsevier.com/locate/neucom

Letters

A new fuzzy approach for handling class labels incanonical correlation analysis

Yanyan Liu, Xiuping Liu�, Zhixun Su

Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, China

Received 25 April 2007; received in revised form 25 November 2007; accepted 30 November 2007

Communicated by S. Mitra

Available online 8 February 2008

Abstract

Canonical correlation analysis (CCA) can extract more discriminative features by utilizing class labels, especially the ones that can

reflect the sample distribution appropriately. In this paper, a new fuzzy approach for handling class labels in the form of fuzzy

membership degrees is proposed. We elaborately design a novel fuzzy membership function to represent the distribution of image

samples. These fuzzy class labels promote the classification performances of CCA and kernel CCA (KCCA) through incorporating

distribution information into the process of feature extraction. Comprehensive experimental results on face recognition demonstrate the

effectiveness and feasibility of the proposed method.

r 2007 Elsevier B.V. All rights reserved.

Keywords: Feature extraction; Fuzzy membership degree; Sample distribution; Kernel methods; Face recognition

1. Introduction

Canonical correlation analysis (CCA) was initiallyproposed as a multivariate analysis method by Hotelling[6] for correlating linear relationships between two sets ofvariables. Recently CCA gained much attention in thefields of image analysis [4] and pattern recognition[5,11,12]. When taking one set of variables as samplesand the other as corresponding class labels, CCA can beused for supervised feature extraction, especially if theselabels are binary vectors, CCA is equivalent to Fisherlinear discriminant analysis (FLDA) [1,5,12]. These binaryvectors, in which a single component is set to one to denotethe correct class and all the other components are set tozero, are encoded by f0; 1g for binary class assignment, i.e.each sample fully belongs to a unique class. However,image samples, such as face images, are significantlyaffected by numerous environmental conditions. Theseinfluences may blur the boundaries between classes andmake some samples locate in or near overlapping regions

e front matter r 2007 Elsevier B.V. All rights reserved.

ucom.2007.11.008

ing author.

ess: [email protected] (X. Liu).

among classes. This characteristic of sample distribution isnot considered in binary label vectors, which will result inthe loss of useful information for classification.To overcome the shortcoming of CCA methods based on

binary labels, the information of sample distribution shouldbe incorporated into the process of feature extraction. Sincethe distribution information is not provided by trainingimages explicitly, how to represent it in the form of numericalvalues appropriately is a key problem. It can be seen as theproblem of transforming categorical variables into intervalmeasures by measurement transformations in multivariatestatistics [14]. Such transformations are subject to somemeasurement restraints and can be obtained by an alternatingleast squares algorithm [14]. We are not going to focus ourefforts on discussing these restraints or proposing newiterative algorithms, but rather constructing an appropriatefunction of class labels to reflect the sample distributiondirectly. Obviously, fuzzy set theory is a good choice, andfuzzy k-nearest neighbor (FKNN) method has been takeninto account to yield class labels by utilizing neighborhoodinformation [7,12]. In this paper, we elaborately design anovel fuzzy membership function for handling fuzzy classlabels to represent the distribution of image samples. We

www.elsevier.com/locate/neucom

dx.doi.org/10.1016/j.neucom.2007.11.008

mailto:[email protected]

ARTICLE IN PRESSY. Liu et al. / Neurocomputing 71 (2008) 1735–17401736

anticipate that CCA incorporating fuzzy class labels, whichreferred as fuzzy label CCA later, could promote itsclassification performance. Furthermore, the modifiedfuzzy class labels via kernel trick [10] corresponding toKCCA [4] are also proposed to obtain nonlinear discrimi-native features. Comprehensive experimental results onface databases demonstrate the effectiveness of fuzzy labelbased CCA and KCCA.

The rest of this paper is organized as follows. A briefreview of CCA and KCCA is given in Section 2. In Section3 we present the fuzzy approach for handling class labels,and then incorporate them into CCA and KCCA. Ourexperimental results are presented in Section 4. In Section 5the conclusions and future work are discussed.

2. Overview of CCA and KCCA

CCA can be defined as the problem of finding basisvectors for two sets of variables such that the correlationbetween the projections of the variables onto these basisvectors are mutually maximized [4]. More formally,consider two multidimensional variables x and y with zeromean, and suppose that fxigi¼1;...;N and fyigi¼1;...;N are N

observations of them, respectively. Then the goal of CCA isequivalent to finding pairs of basis vectors a and b thatmaximize

r ¼aTXYTbffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

aTXXTa � bTYYTbq , (1)

where X ¼ ½x1; . . . ;xN �, Y ¼ ½y1; . . . ; yN �, and the symbol Tdenotes transpose. Solve this optimization problem, we canobtain the following generalized eigenvalue equations[4,12]:

XYTðYYTÞ�1YXTa ¼ lXXTa,

YXTðXXTÞ�1XYTb ¼ lYYTb; (2)

where the eigenvalue l equals to r2. Generally, only thefirst equation of (2) need to be solved for subsequentfeature extraction. If XXT is invertible, the generalizedeigenproblem can be converted to a standard eigenpro-blem. However, in pattern recognition area, small samplesize problem often occurs and makes XXT singular. Tosolve this problem, we perform PCA algorithm on theoriginal data before CCA for dimension reduction [11].This process does not lose any discriminative information,which is superior to FLDA [5].

CCA may not extract useful descriptors of the databecause of its linearity. As a nonlinear extension of CCAvia kernel trick, KCCA offers a solution by first implicitlymapping data into a higher dimensional feature space F

f : x! fðxÞ; X ! Xf ¼ ½fðx1Þ; . . . ;fðxNÞ� (3)

and then performing CCA in F [4]. Different from thederivation process in [4] where KCCA was formulated as ageneralized eigenproblem, an alternative natural approach

is tried to obtain a standard eigenproblem correspondingto KCCA in this paper. we perform CCA algorithm on Xf

and Yf directly as described in previous paragraph andutilize PCA to reduce the dimension of fðxÞ, which arepresented in detail in Section 3.2.

3. Fuzzy class label based CCA and KCCA

Correlating samples with appropriate class labels whichcan represent the sample distribution, CCA can be used toextract combined features by fusing the information ofgray level and distribution. Whereas, what kind of classlabels can represent the distribution information approx-imatively? As environmental effect blurs the boundariesbetween classes, image samples may have relations witheach class. These relationships can be represented by fuzzymembership degrees. Therefore, a novel fuzzy membershipfunction is defined for handling class labels to reflect thedistribution of image samples.

3.1. Fuzzy label approach

Let O ¼ fxij 2 Rp; i ¼ 1; . . . ; c; j ¼ 1; . . . ;Nig be a sampleset with N elements, where c is the number of image classes,Ni is the number of samples in the ith class and xij denotesthe jth sample in the ith class. Evidently, O contains thegray level information of images. Suppose Ok ¼ fxkj 2

O; j ¼ 1; . . . ;Nkg is the sample set of the kth class. Let dkij ¼

kxij �mkk be the Euclidean distance between xij and themean of Ok (k ¼ 1; . . . ; c). 8xij 2 O, we define the member-ship degree of xij belonging to Ok as

okij ¼

dkij �maxs¼1;...;c ðd

sijÞ

mins¼1;...;c ðdsijÞ �maxs¼1;...;c ðd

sijÞ, (4)

where okij 2 ½0; 1� essentially utilize distances to represent

the membership of samples and to yield class labels.However, due to the effect of numerous environmentalconditions, many samples may locate away from their ownregions, and even be surrounded by samples of otherclasses. If we use Eq. (4) to reflect the sample distribution,the wrong information about categories may result ininaccurate classification. Therefore, a penalty term isintroduced to define the improved membership function as

eokij ¼

1; k ¼ i;

okij

oiij þ gij

; kai;

8>><>>:

gij ¼

maxsai ðosijÞ � oi

ijy

y;

maxsai ðosijÞ

oiij

4y;

0; else:

8>><>>: (5)

It is easy to know that the penalty term gij should bemoderate, because too much penalty may lead to the loss ofdistribution information, while quite small value maynot work properly. Here, threshold y 2 ð0; 1Þ is used to

ARTICLE IN PRESSY. Liu et al. / Neurocomputing 71 (2008) 1735–1740 1737

determine the value of gij . Obviously, improved member-ship degrees satisfy eoi

ij ¼ 1, eokijpy (kai), which means that

sample xij exhibits the highest degree of membership to itsown class in contrast to other classes. Moreover, theequation eok

ij=eolij ¼ ok

ij=olij (k; lai) guarantees the limited

loss of fuzzy relationship information by fixing the

ðdkijÞf ¼ fðxijÞ �

1

Nk

XNk

l¼1

fðxklÞ

��

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifðxijÞ

TfðxijÞ �2

Nk

XNk

l¼1

fðxijÞTfðxklÞ þ

1

N2k

XNk

l¼1

XNk

s¼1

fðxklÞTfðxksÞ

vuut

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffikðxij ; xijÞ �

2

Nk

XNk

l¼1

kðxij ; xklÞ þ1

N2k

XNk

l¼1

XNk

s¼1

kðxkl ;xksÞ

vuut .

proportion between xij and other classes in terms ofmembership degree. Consequently, for 8xij 2 O, the vectorof class labels yij ¼ ðeo1

ij ; eo2ij ; . . . ; eoc

ijÞT reflects the approx-

imate location where sample xij locates in O in some sense.In other words, C ¼ fyij 2 Rc; i ¼ 1; . . . ; c; j ¼ 1; . . . ;Nig

contains the information of sample distribution.

3.2. Fuzzy label CCA/KCCA

To implement CCA algorithm, O and C are regarded astwo sets of variables. Suppose they are both zero mean forderivational convenience. Then we just need to solve thefirst equation of (2) with X ¼ ½x11;x12; . . . ;xcNc � and

Y ¼ ½y11; y12; . . . ; ycNc�. Since XXT is singular, as men-

tioned in Section 2, PCA algorithm is used for dimensionreduction [11]. Specifically, computing all eigenvectors xcorresponding to the nonzero eigenvalues of XXT, and

let P ¼ ðx1; . . . ; xrÞ (r ¼ rankðXXTÞ). Obviously, matrix

PTXXTP is reversible. In addition, small nonzero eigenva-

lues of XXT generally include more interference informa-tion which will be blown up in the inverse. To counteractthis effect, the regularization technique [3] is employed.Thus the eigenvalue equation is converted to

ðPTXXTPþ mIrÞ�1PTXYTðYYTÞ

�1YXTPa ¼ la, (6)

where Ir denotes r� r identity matrix and m is a smallnonnegative number. Solving Eq. (6), we can obtaineigenvectors a corresponding to the first d largesteigenvalues (dpminðr; cÞ). The extracted feature vector ofa test sample x is calculated as

z ¼ ða1; a2; . . . ; adÞTPTx. (7)

We anticipate that these features extracted by fuzzy labelCCA have more powerful ability of classification, for theyincorporate the information about gray level and distribu-tion of samples.

As to the corresponding kernel version, class labelsshould be recalculated because the distribution of samplesin F may be different from the original space. Since theimplicit nonlinear mapping f is unknown, we cannot useEqs. (4) and (5) to obtain Yf directly. Fortunately, theinner product in F can be computed via a kernel function

kðxi; xjÞ ¼ fðxiÞTfðxjÞ using kernel trick [10]. Thereby,

ðdkijÞf can be achieved as follows:

Replacing dkij with ðd

kijÞf, Yf can be obtained easily.

Assume Xf and Yf are all centered in F (see [3] for amethod to center the samples in feature space). Forsimplicity, we denote all the training samples in F byfðx1Þ;fðx2Þ; . . . ;fðxNÞ. Similar to the derivations of CCA,KCCA is equivalent to solving the following generalizedeigenvalue equation:

XfYTfðYfYT

fÞ�1YfXT

faf ¼ lfXfXTfaf. (8)

As XfXTf is unknown, to perform PCA, we first compute

the nonzero eigenvalues es and their correspondingeigenvectors us of K, where K ¼ XT

fXf is the kernel matrixwith elements Kij ¼ fðxiÞ

TfðxjÞ ¼ kðxi; xjÞ. According toSVD theorem [3], eigenvectors of XfXT

f satisfy vs ¼

Xfus=ffiffiffiffiesp

, where s ¼ 1; . . . ; n and n ¼ rankðKÞ, i.e. V ¼

XfUL�1=2, where V ¼ ðv1; . . . ; vnÞ, U ¼ ðu1; . . . ; unÞ andL ¼ diagðe1; . . . ; enÞ. Consequently, after the process ofPCA and regularization, Eq. (8) can be converted to

ðUTK2U þ mInÞ�1UTKYT

fðYfYTfÞ�1YfKUaf ¼ lfaf.

(9)

Accordingly, the fuzzy label KCCA features of a testsample x are calculated as

zf ¼ ðaf1; af2; . . . ; afdf ÞTL�1=2UTKx, (10)

where Kx 2 RN with elements ðKxÞi ¼ kðxi;xÞ, i ¼ 1;2; . . . ;N and dfpminðn; cÞ.

4. Experimental results

To evaluate the performance of the proposed fuzzy labelmethod, some classification experiments are performed onface images. We construct a large face database bycombining ORL [9], Yale [13] and Lab database in orderto represent the variations occurring in the real world

ARTICLE IN PRESS

Fig. 1. Some face images from the combined database.

80

78

76

74

72

70

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Threshold θ

Rec

ogni

tion

rate

(%)

Fig. 2. The change of recognition rates w.r.t. y on 10 different data sets.

78

77

76

75

74

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Recognitio

n r

ate

(%

)

Eq.(5)

Eq.(4)Binary labels

Fig. 3. The change of average recognition rates w.r.t. y.

Y. Liu et al. / Neurocomputing 71 (2008) 1735–17401738

sufficiently. Lab database contains 450 images from 30members in our laboratory, and these images are takenwith little restriction. The combined database is composedof 85 distinct subjects and each subject has 10 selectedimages. All images are cropped into 112� 92. Pose,illumination and expression variations are contained inthis database, as shown in Fig. 1. Experiments areconstructed on a P4-2.4GHz PC with 512MB RAM usingMatlab 6.5 environment. All of the obtained eigenvectorsa=af are used to extract features for classification by thenearest neighbor classifier.

To show how the threshold y affects the recognitionperformance, we change y 2 ð0; 1Þ with step 0.05 and repeatthe classification experiments 10 times by randomlychoosing different training and testing sets from thecombined database. The number of training samples persubject is 4. In PCA process, too small nonzero eigenvaluesand their corresponding eigenvectors are given up for theconvictive illustration of y’s effect. The recognition resultsin each round are shown in Fig. 2. Meanwhile the resultsusing Eq. (4) to handle fuzzy labels are also shown as dotlines due to its independence of y. The change of averagerecognition rates w.r.t. y are illustrated in Fig. 3. From theexperimental results in Figs. 2 and 3, we can observe that inmost cases, the proposed method works well by simplysetting y 2 ð0:1; 0:4Þ which corresponds to moderate g, andy ¼ 0:2 can achieve the highest accuracy.

We then fix y ¼ 0:2 for fuzzy label CCA/KCCA, andcompare them with Fisherfaces [2] and PCAþ CCA [5]. Inthis experiment, the regularization parameter m is taken asan experiential value 10�6 by trial and error, for m cannotbe determined automatically and the optimal value is noteasily specified [8]. Gaussian RBF kernel function kðx; yÞ ¼expð�kx� yk2=2s2Þ with s ¼ 10 000 is used in fuzzy labelKCCA. The number of training samples per subjectincreases from 2 to 9. Training samples are selectedrandomly and the remaining samples are used for testing.

ARTICLE IN PRESS

100959085807570656055504540

2 3 4 5 6 7 8 9Number of training samples

Rec

ogni

tion

rate

(%)

Fuzzy label KCCAFuzzy label CCAPCA+CCAFisherfaces

Fig. 4. Recognition rates comparison on the combined database.

Table 1

Recognition rates comparison on ORL database

Method Recognition rate (%) Average time (s)

Binary labels 95:25� 1:3591 10.4581

FKNN 96:20� 1:3166 24.0373

New method (y ¼ 0:1) 96:45� 1:3632 13.5352

New method (y ¼ 0:2) 96:60� 1:2867 14.0040

New method (y ¼ 0:3) 96:70� 1:0055 13.8280

New method (y ¼ 0:4) 96:40� 1:3499 13.9389

Y. Liu et al. / Neurocomputing 71 (2008) 1735–1740 1739

Ten times of random selections are performed on thecombined database. Fig. 4 shows the average recognitionrates. It can be seen that Fuzzy label KCCA achieves betterresults than fuzzy label CCA, and both of them outperformFisherfaces and PCAþ CCA methods. These results showthat the sample distribution information and nonlinearfeatures are beneficial to classification.

Finally, experiments are performed on ORL database tocompare the performance of different class assignmentmethods. The PCA eigenvectors corresponding to the first60 largest eigenvalues are preserved for dimension reduc-tion and the parameter of FKNN method is fixed as k ¼ 17[12]. For the new method, y is set to 0.1, 0.2, 0.3 and 0.4,respectively. The mean, standard deviation and averageruntime of 10 times repeated experimental results aretabulated in Table 1, from which we can observe that theproposed fuzzy label approach can achieve satisfactoryperformance in terms of both accuracy and efficiency. It isworth noting that the optimal performance of the newmethod on ORL database is obtained when y ¼ 0:3 whileon the combined database is obtained when y ¼ 0:2, whichimply that the value of y should be smaller for morecomplicated database seriously affected by environmentalconditions.

5. Conclusions and future work

In this paper, we design a new fuzzy membershipfunction for handling class labels to reflect the distributionof image samples. Incorporating these class labels intoCCA and KCCA, we can extract more discriminativefeatures by making use of the distribution information.Classification experimental results on face images showthat the new approach is effective and feasible.Intuitively, different class labels can result in different

classification performances. So how to construct moreappropriate functions of labels can be further studied. Inaddition, it is desirable to develop alternative approachesto yield class labels based on the techniques of handlingcategorical variables in multivariate data analysis [14].

Acknowledgments

This work was supported by the Program for NewCentury Excellent Talents in University of China underGrant no. NCET-05-0275 and National Natural SciencesFoundation of China under Grant no. 60673006. Theauthors would like to thank the anonymous reviewers fortheir precious suggestions.

References

[1] M. Barker, W. Rayens, Partial least squares for discrimination,

J. Chemometrics 17 (2003) 166–173.

[2] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs.

Fisherfaces: recognition using class specific linear projection, IEEE

Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720.

[3] T. De Bie, N. Cristianini, R. Rosipal, Eigenproblems in pattern

recognition, in: E. Bayro-Corrochano (Ed.), Handbook of Computa-

tional Geometry for Pattern Recognition, Computer Vision, Neuro-

computing and Robotics, Springer, Heidelberg, 2004.

[4] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correaltion

analysis; an overview with application to learning methods, Technical

Report CSD-TR-03-02, Computer Science Department, Royal

Rolloway, University of London, 2003.

[5] Y.H. He, L. Zhao, C.R. Zou, Face recognition based on PCA/KPCA

plus CCA, in: Proceedings of the ICNC 2005, Lecture Notes in

Computer Science, vol. 3611, Springer, Berlin, 2005, pp. 71–74.

[6] H. Hotelling, Relations between two sets of variates, Biometrika 28

(1936) 321–377.

[7] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy k-nearest neighbor

algorithm, IEEE Trans. Syst. Man Cybern. 15 (4) (1985) 580–585.

[8] T. Melzer, Generalized canonical correlation analysis for object

recognition, Ph.D. Thesis, Institute of Automation, Vienna Uni-

versity of Technology, 2002.

[9] ORL face database hhttp://www.uk.research.att.com/facedatabase.

htmli.

[10] J. Shawe-Taylor, N. Cristianin, Kernel Methods for Pattern Analysis,

Cambridge University Press, Cambridge, 2004.

[11] Q.S. Sun, S.G. Zeng, Y. Liu, P.A. Heng, D.S. Xia, A new method of

feature fusion and its application in image recognition, Pattern

Recognition 38 (2005) 2437–2448.

[12] T.K. Sun, S.C. Chen, Class label versus sample label-based CCA,

Appl. Math. Comput. 185 (2007) 272–283.

[13] Yale face database hhttp://cvc.yale.edu/projects/yalefaces/yalefaces.

htmli.

http://www.uk.research.att.com/facedatabase.html

http://www.uk.research.att.com/facedatabase.html

http://cvc.yale.edu/projects/yalefaces/yalefaces.html

http://cvc.yale.edu/projects/yalefaces/yalefaces.html

ARTICLE IN PRESSY. Liu et al. / Neurocomputing 71 (2008) 1735–17401740

[14] F.W. Young, J. de Leeuw, Y. Takane, Regression with qualitative

and quantitative variables: an alternating least squares method with

optimal scaling features, Psychometrika 41 (1976) 505–529.

Yanyan Liu was born in Hebei, China in 1980.

Currently, she is a Ph.D. candidate in the

Department of Applied Mathematics of Dalian

University of Technology. Her current research

interests include face recognition, image proces-

sing and computer vision.

Xiuping Liu was born in 1964. She received the

Ph.D. degree in Computational Mathematics in

1999 from Dalian University of Technology,

China. She was a postdoctoral research fellow

in School of Mathematics and Computational

Science of Sun Yat-sen University from October

1999 to October 2001. She is now an associate

professor in the Department of Applied Mathe-

matics, Dalian University of Technology. Her

research activities are in the areas of computer

vision, multivariate spline function, CAGD and wavelet analysis.

Zhixun Su received his B.Sc. degree in Mathe-

matics from Jilin University in 1987 and M.S.

degree in Computer Science from Nankai

University in 1990. He received his Ph.D. degree

in 1993 from Dalian University of Technology,

where he has been a professor in the Department

of Applied Mathematics since 1999. His research

interests include computer graphics and image

processing, computational geometry, computer

vision, etc.

Documents

A new fuzzy approach for handling class labels in canonical correlation analysis