23
Knowl Inf Syst DOI 10.1007/s10115-014-0744-0 REGULAR PAPER Scaling cut criterion-based discriminant analysis for supervised dimension reduction Xiangrong Zhang · Yudi He · Licheng Jiao · Ruochen Liu · Jie Feng · Sisi Zhou Received: 3 December 2012 / Revised: 7 January 2014 / Accepted: 22 March 2014 © Springer-Verlag London 2014 Abstract Dimension reduction has always been a major problem in many applications of machine learning and pattern recognition. In this paper, the scaling cut criterion-based super- vised dimension reduction methods for data analysis are proposed. The scaling cut criterion can eliminate the limit of the hypothesis that data distribution of each class is homoscedastic Gaussian. To obtain a more reasonable mapping matrix and reduce the computational com- plexity, local scaling cut criterion-based dimension reduction is raised, which utilized the localization strategy of the input data. The localized k -nearest neighbor graph is introduced , which relaxes the within-class variance and enlarges the between-class margin. Moreover, by kernelizing the scaling cut criterion and local scaling cut criterion, both methods are extended to efficiently model the nonlinear variability of the data. Furthermore, the optimal dimension scaling cut criterion is proposed, which can automatically select the optimal dimension for the dimension reduction methods. The approaches have been tested on several datasets, and the results have shown a better and efficient performance compared with other linear and nonlinear dimension reduction techniques. Keywords Dimension reduction · Scaling cut criterion · Local scaling cut criterion · Kernel scaling cut criterion · Local kernel scaling cut criterion · Optimal dimension scaling cut criterion 1 Introduction With the rapid accumulation of high-dimensional data in domains such as face images, videos, remote sensing images, it is well known that the dimension reduction plays more and more important roles in those domains. Dimension reduction is a basic task in pattern analysis X. Zhang (B ) · Y. He · L. Jiao · R. Liu · J. Feng · S. Zhou The Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, International Research Center for Intelligent Perception and Computation, Xidian University, Xi’an 710071, China e-mail: [email protected] 123

Scaling cut criterion-based discriminant analysis for supervised dimension reduction

  • Upload
    sisi

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Knowl Inf SystDOI 10.1007/s10115-014-0744-0

REGULAR PAPER

Scaling cut criterion-based discriminant analysisfor supervised dimension reduction

Xiangrong Zhang · Yudi He · Licheng Jiao ·Ruochen Liu · Jie Feng · Sisi Zhou

Received: 3 December 2012 / Revised: 7 January 2014 / Accepted: 22 March 2014© Springer-Verlag London 2014

Abstract Dimension reduction has always been a major problem in many applications ofmachine learning and pattern recognition. In this paper, the scaling cut criterion-based super-vised dimension reduction methods for data analysis are proposed. The scaling cut criterioncan eliminate the limit of the hypothesis that data distribution of each class is homoscedasticGaussian. To obtain a more reasonable mapping matrix and reduce the computational com-plexity, local scaling cut criterion-based dimension reduction is raised, which utilized thelocalization strategy of the input data. The localized k-nearest neighbor graph is introduced ,which relaxes the within-class variance and enlarges the between-class margin. Moreover, bykernelizing the scaling cut criterion and local scaling cut criterion, both methods are extendedto efficiently model the nonlinear variability of the data. Furthermore, the optimal dimensionscaling cut criterion is proposed, which can automatically select the optimal dimension forthe dimension reduction methods. The approaches have been tested on several datasets, andthe results have shown a better and efficient performance compared with other linear andnonlinear dimension reduction techniques.

Keywords Dimension reduction · Scaling cut criterion · Local scaling cut criterion ·Kernel scaling cut criterion · Local kernel scaling cut criterion · Optimal dimension scalingcut criterion

1 Introduction

With the rapid accumulation of high-dimensional data in domains such as face images, videos,remote sensing images, it is well known that the dimension reduction plays more and moreimportant roles in those domains. Dimension reduction is a basic task in pattern analysis

X. Zhang (B) · Y. He · L. Jiao · R. Liu · J. Feng · S. ZhouThe Key Laboratory of Intelligent Perception and Image Understanding of Ministryof Education, International Research Center for Intelligent Perception and Computation,Xidian University, Xi’an 710071, Chinae-mail: [email protected]

123

Page 2: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

and machine learning fields. Now there are many dimension reduction methods. Principalcomponent analysis (PCA) Wold et al. [25] and linear discriminant analysis (LDA) Fisherand Ronald [6] are the most widely used linear methods. PCA, an unsupervised dimensionreduction method, is able to find out the most representative projection direction of the orig-inal data in the sense of minimum mean square, and the projection direction is obtained by aset of optimal orthogonal vectors. However, because it is unsupervised, it cannot reflect thedifferential information of samples. LDA is a supervised algorithm which is actually able tofind out a set of linear transformation to maximize the between-class scatter and minimizethe within-class scatter. But LDA has the singular value problem. When the dimension oforiginal high-dimensional samples is much larger than the number of samples, the scattermatrix in LDA could be singular. Although the classical methods such as PCA, LDA, max-imum margin criterion (MMC) [12] have been proved to be computationally efficient andsuitable for practical applications, these linear approaches ignore the geodesic structure ofdata distribution; thus, the geometric structure cannot be revealed. Some strategies have beenproposed to remedy the above methods by localized models [7]. Localized PCA, localizedLDA, locality preserving projection (LPP) [17,32], locally linear embedding (LLE) [18] andneighborhood min-max projections [14] take the neighborhood structure into consideration.The methods such as Laplacian eigenmaps (LE) [2] and isometrical mapping (Isomap) [23]emphasize preserving the global structure of original dataset. Manifold-based methods suchas LLE, neighborhood preserving embedding [10] and [26] can handle the high-dimensionaland nonlinear data. By exploiting the local symmetries of linear reconstructions, LLE isable to learn the global structure of the nonlinear manifolds. Isomap is an extension of theclassical multidimensional scaling (MDS) [3], which computes the pair-wise distance in thegeodesic space to preserve the geodesic structure in the embedded low-dimensional space.LE actually tries to find the Laplacian projection by spectral decomposition of the graph. LPPand neighborhood preserving embedding can be seen as the linearization of LE and LLE,respectively, by performing the mapping by a projection matrix, and LPP preserves the localstructure and finds out the essential manifold structure.

Kernel principal component analysis (KPCA) [19,30] and kernel linear discriminationanalysis (KFDA) [29] are the kernelization of PCA and LDA so that the nonlinear data canbe dealt with. And also some other efficient nonlinear methods have been proposed [15,21,27,33].

In recent years, many graph-based dimension reduction methods have been developed dueto the fact that graph is a powerful tool to catch the structure information hidden in objectsor data [28,31]. Recently, the graph cut criteria become more popular for their ability to takeadvantage of graph theory. The criteria have been proved to be robust and adaptable separabil-ity criteria. The existing graph cut criteria include min-cut [5], ratio-cut [8], normalized-cut(Ncut) [20] and min-max-cut [4]. The min-cut only considers the between-class relationshipwhile ignoring the within-class relationship, which is prone to a skew partition. The lastthree criterions introduce different equilibrium conditions, e.g., the ratio-cut introduces thesize of class, normalized-cut introduces the sum weights between data points of one classand the full vertex, and min-max-cut introduces the sum weights of within-class data points.Various experiments show that the Ncut has better performance, which measures the totalbetween-class dissimilarity and the total within-class dissimilarity. However, the Ncut algo-rithm may trap to local minima. Therefore, based on the Ncut, we introduce a new graph cut,called scaling cut. Derived from the Ncut, the scaling cut criterion-based dimension reductionmethod is similar to LDA. LDA assumes that all classes obey a Gaussian distribution withthe same covariance. But in real world, most of the data distribution is more complex than theGaussian; thus, LDA will fail when dataset is heteroscedasticity or multimodal. The scaling

123

Page 3: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

cut criterion focuses on the dissimilarity between data points, thereby eliminating the impactof the class centers.

In order to utilize the local structure of dataset, a local scaling cut criterion is proposed,which aims to seek an optimal projection by preserving the local geometry of given data. Weuse the local strategy of the LPP, which only calculates the dissimilarity of adjacent pointsto construct a localized k-nearest neighbor graph, to relax the variance of within-class andincrease the margin of data samples of between-class. For the reason that scaling cut criterionhas quite high computational complexity in computing the between-class dissimilarity matrixand the within-class dissimilarity matrix, the local scaling cut criterion can also reduce thecomputational complexity by only considering the k-nearest neighbor samples instead of thewhole data samples. To extend the scaling cut criterion to the nonlinear field, we proposekernel scaling cut criterion, which can deal with dataset that lies in a nonlinear space so thatto obtain nonlinear discriminative features of data points in a new space that is obtained bymapping the original data into a nonlinear space by a kernel function. To utilize the localstructure in the high-dimensional nonlinear space, the local kernel scaling cut criterion isalso proposed by adopting the local strategy.

The optimal dimension scaling cut criterion proposed for dimension reduction is anothercontribution of this paper. The choice of feature dimension after dimension reduction is animportant issue. In this paper, we also propose how to automatically determine the optimaldimension for scaling cut criterion-based dimension reduction method. In paper [16], a novelcriterion was presented, of which the optimal value is not monotonic with respect to projectiondimension, and the optimal dimension for discriminant analysis can be effectively determined.For the same purpose of obtaining the optimal dimension, we propose the optimal dimensionscaling cut criterion, which can automatically select the suitable dimension to get a bestprojection matrix.

The remainder of the paper is organized as follows. Section 2 reviews the normalized-cutand introduce the scaling cut criterion. Section 3 is devoted to the presentation of our approachthe scaling cut criterion-based supervised dimension reduction. And also we utilize the localstructure to find out a new approach of local scaling cut criterion. In order to extend it intononlinear data field, we create the kernel scaling cut criterion and local kernel scaling cutcriterion in Sect. 4. In Sect. 5, we describe an optimal dimension scaling cut criterion-basedsupervised dimension reduction which can determine the suitable dimension automatically.Section 6 discusses the relation and discussion with other methods. Section 7 presents exten-sive experimental results obtained on UCI datasets, face dataset, digital recognition dataset,and hyperspectral remote sensing image dataset. Finally, Sect. 8 draws our conclusions.

2 Scaling cut criterion

Scaling cut criterion is derived from Ncut, so we will first give a brief review of Ncut and theintroduction of scaling cut criterion is follows.

2.1 Normalized-cut criterion

Shi and Malik [20] established the two-way partition of Ncut objective function accordingto spectral graph theory, and minimization of the Ncut function is called Ncut criterion.

Firstly, we consider the two-way partition problem. Suppose the graph is partitioned intotwo disjoint subsets V1 and V2, and V1

⋃V2 = V, V1

⋂V2 = φ. W (i, j) is defined as the

dissimilarity between points i and j , and cut (V1, V2) = ∑i∈V1, j∈V2

W (i, j) is the sum of

123

Page 4: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

the connections of V1 and V2, which is the degree of dissimilarity between these two subsetsin the graph theory.

The total connection from the nodes of a subset to all nodes in the graph is introduced asthe normalizing factor, which results in the Ncut criterion [20]:

Ncut (V1, V2) = cut (V1, V2)

assoc(V1, V )+ cut (V1, V2)

assoc(V2, V )(1)

where assoc(Vi , V ) = ∑i∈Vi , j∈V W (i, j) normalized factor. Minimizing the Ncut value

will bring in the optimal bipartition of the graph.Suppose a graph G = (V, E) is partitioned into C subsets V = ∑C

i=1 Vi . In order togeneralize the two-way normalized-cut to a C-way cut, C-way cut is redefined as follow:

Ncut (G) =C∑

i=1

cut (Vi , Vi )

assoc(Vi , V )(2)

where cut (Vi , Vi ) measures the total links from subset Vi to the complementary set of Vi ,and assoc(Vi , V ) = ∑

i∈Vi , j∈V W (i, j) measures the total connections from nodes in Vi toall nodes in the graph.

2.2 Scaling cut criterion

LDA is a classic method in dimension reduction and aims at learning an optimal transfor-mation W : RD → Rd , d � D to project the original high-dimensional data into thelow-dimensional space. It is learned by solving an optimization problem of between-classscatter matrix and within-class matrix based on the class centers, which means that samplesof each class obey the Gaussian distribution with equal variance. However, the distributionof real world data mostly is more complex than Gaussian. LDA would fail on heteroscedasticor multimodal data.

The definition of scaling cut criterion is similar to that of LDA. Scaling cut criterionis considered to eliminate the impact of class centers of LDA by the construction of thedissimilarity matrix among samples.

Scaling cut criterion overcomes the limitation of LDA by using the strategy of normalizedfactor, according to Ncut [20], which is defined over the whole samples without the limitationof class centers. Normalized factor assoc(Vi , V ) = ∑

i∈Vi , j∈V W (i, j) measures the totalconnections from nodes in one class to all nodes in the graph. By calculating this distanceconnections, the global structure of this graph is obtained without concerning the centers ofeach class.

Given a training dataset X = {x1, x2, . . . , xN } ⊂ RD belonging to c classes, N is thenumber of the training data. Suppose the graph G = (V, E) is partitioned into c sub-sets V = ∑c

i=1 Vi . Dimension reduction is trying to project the D-dimensional data to ad-dimensional space. We want to find a feasible and reasonable projection matrix based onscaling cut criterion. We define the dissimilarity between two samples as:

disi j = 1

nc(i)nc( j)(xi − x j )(xi − x j )

T (3)

123

Page 5: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

in which nc(i) is the size of the class where xi belongs to. Then, the between-class dissimilarityand the within-class dissimilarity of the p-th class are defined as follows:

Cp =∑

i∈Vp

j∈V p

1

n pnc( j)(xi − x j )(xi − x j )

T (4)

Ap =∑

i∈Vp

j∈Vp

1

n pn p(xi − x j )(xi − x j )

T (5)

where n p denotes the number of samples in the p-th class. Cp measures the total dissimilari-ties between the sample in p-th class and the samples in complementary of p-th class, whileAp measures the total dissimilarities between the samples within p-th class. The definitionsof the two dissimilarity matrices make sure that the relations between all the samples aretaken into account. And the class information is implied in the scale of each class.

Firstly, to elaborate our methods, we consider a simple situation. If the original high-dimensional data belong to two separate clusters, one-dimensional projection is enough torepresent and partition the data. That is to say, in two-class problem, the projection of theinput set is a line. By combining the definitions of Cp and Ap , the two-way normalized-cutcan be rewritten as:

Scuts(W) =2∑

p=1

WTCpWWTApW + WTCpW

(6)

Secondly, inspired by Ncut, we consider the C-way cut for multi-class problem. In multi-class problem, the projection of the input set is a subspace. By using determinative properties,Eq. (6) can be generalized to the case of C-way cut:

Scuts(W) =c∑

p=1

WTCpWWTApW + WTCpW

(7)

But the normalized-cut is a nonlinear function of the projection matrix W, which maymake the algorithm trap into local minimum. By extending the Scuts to linear function of theprojection matrix, a new objective function is obtained:

Scuts(W) = | ∑cp=1 WTCpW|

| ∑cp=1(W

TApW + WTCpW)|

= |WT ∑cp=1 CpW|

|WT∑c

p=1(Ap + Cp)W|

= |WTCW||WT(A + C)W|

= |WTCW||WTTW| (8)

where C = ∑cp=1 Cp and A = ∑c

p=1 Ap are the corresponding dissimilarity matrices ofthe whole input set, and T = A + C is the total dissimilarity matrix. The dissimilarity isscaled by the size of each class; therefore, the new graph cut discriminant criterion is namedas scaling cut criterion.

123

Page 6: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

3 Scaling cut criterion-based dimension reduction

3.1 Scaling cut criterion-based dimension reduction

Given the defined scaling cut criterion, optimization of it will be able to get the optimalprojection matrix W. The purpose of scaling cut criterion is to simultaneously maximize thebetween-class dissimilarity matrix and minimize the within-class dissimilarity matrix. Tothis end, scaling cut criterion is to solve the following trace ratio problem:

W∗ = arg maxW∈RD×d

|WTCW||WTTW| (9)

This trace ratio problem is conventionally approximately solved by generalized eigen-value decomposition due to the difficulty of the original problem. Recently, Wang et al.presented that this trace ratio problem can be solved directly via several efficient iterativeprocedures [20]. The global optimum solution to the trace ratio problem can be found bysolving a equivalent trace difference problem. The objective function of W can be formulatedas follows:

W∗ = arg maxW∈RD×d

Tr(WTCW)

Tr(WTTW)(10)

According to [11,24], Eq. (10) can be solved in an approximate way. This trace ratio isequivalent to a trace difference problem:

W∗ = arg maxW∈RD×d

Tr(WT(C − λ∗T)W) (11)

Jia et al. [11] used the Decomposed Newton Method (DNM) to find the accurate solutionto the trace difference equation. Eigenvalue perturbation theory is also introduced to derivean efficient algorithm based on the Newton–Raphson method. The DNM is a new way toefficiently find the global optimum of the trace ratio problem. The process of DNM is asfollows:

Algorithm 1 The process of DNM:1: ε the threshold value, set n = 1 and initialize W0 as an arbitrary column orthogonal matrix such that

WT0 W0 = I ;

2: Compute λn = T r(WTn−1CW)

T r(WTn−1TW)

;

3: Compute the d eigenvectors [vn1, vn2, ..., vnd ] of C − λ∗T. Then repeat the two operations until there isno change to Wn ;

(1) Sort Qi = vTni (C − λnT)vni , i = 1, ..., d, in descending order and select the first d

′eigenvectors

to construct Wn ;

(2)Compute λn+1 = T r(WTn CWn )

T r(WTn TWn )

;

4: If |λn+1 − λn | < ε go to step 5; else n = n + 1, go to step 3;5: Output W∗ = Wn .

Then the supervised dimension reduction approach based on scaling cut criterion can besummarized in Algorithm 2.

123

Page 7: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

Algorithm 2 Scaling cut criterion-based supervised dimension reductionInput:

The training dataset X = {xi , li }Ni=1 ∈ RD ;

The test dataset Xtest ={

xtestj

}M

j=1∈ RD ;

li is the label of xi ; d is the desired dimension.Output:

Projections of the training dataset Y = {yi }Ni=1 ∈ Rd .

The test dataset Ytest ={

ytestj

}M

j=1∈ Rd .

1: Construct the between-class dissimilarity matrix C =c∑

p=1Cp and the within-class dissimilarity matrix

A =c∑

p=1Ap by Eq. (4) and Eq. (5);

2: Form the total dissimilarity matrix by T = A + C;3: Use the DNM to get W;4: Project the original data into the low-dimensional space spanned by the vectors of W, Y = WT X and

Ytest = WT Xtest .

3.2 Local scaling cut criterion-based dimension reduction

In scaling cut criterion, the main computational complexity is produced by the constructionof the within-class dissimilarity matrix and between-class dissimilarity matrix, which makesthe computational complexity of the scaling cut criterion very high.

LFDA [22], the localization strategy was proposed to reduce the computational complexityof LDA. The LFDA does not impose far-apart data pairs of the same class to be close, bywhich local structure of the data tends to be reserved. In this paper, we explore the localizationstrategy for scaling cut criterion-based dimension reduction. Firstly, a localized k-nearestneighbor graph is constructed to characterize the neighborhood; secondly, the dissimilaritiesbetween neighbors are calculated instead of the dissimilarities between the whole data points.

For each point xi N b(xi ) and Nw(xi ) are defined as the between-class kb-nearest neigh-borhood and the within-class kw-nearest neighborhood, respectively. A simple geometricalinterpretation of the local within-class neighborhood and local between-class neighborhoodis illustrated in Fig. 1 where the circles and triangles denote two different classes. The con-nected circle points constitute N b(xi ) of xi , the connected triangle points constitute Nw(xi ) ofxi . Thus, the local between-class dissimilarity matrix and the local within-class dissimilaritymatrix can be easily formulated as:

C =c∑

p=1

n p∑

i=1

N∑

j=1

N bi j

(xi − x j

) (xi − x j

)T (12)

A =c∑

p=1

n p∑

i=1

N∑

j=1

Nwi j

(xi − x j

) (xi − x j

)T (13)

where N bi j =

{1/

n pkb x j ∈ N b(xi )

0 otherwiseand Nw

i j ={

1/

n pkw x j ∈ Nw(xi )

0 otherwise. Then the local

scaling cut criterion is formulated as follow:

Scuts (W) =∣∣∣WT ˜CW

∣∣∣

∣∣∣WT

(A + C

)W

∣∣∣

=∣∣∣WT ˜CW

∣∣∣

∣∣∣WT ˜TW

∣∣∣

(14)

123

Page 8: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

x i

N bN w(x i ) (x i)

x ix i x i

(a) (b)

Fig. 1 The local area to compute the local within-class dissimilarity matrix and local between-class dissim-ilarity matrix. a Nw(xi ) is the local within-class 3-nearest neighborhood of data point xi , b N b (xi ) is thelocal between-class 3-nearest neighborhood of data point xi

Using the localized k-nearest neighbor graph, the variances within classes are relaxed and themargins between classes are enlarged, which may bring a more reasonable mapping matrix.Besides, the computational complexity is reduced since it only takes local samples intoconsideration in calculating the within-class dissimilarities and between-class dissimilarities.The optimization of Scuts(W) can also be solved by DNM when it is transformed to theequivalent trace difference form:

W∗ = arg maxW∈RD×d

Tr(

WT(C − λ∗T)W)

(15)

4 Kernel scaling cut criterion-based dimension reduction

4.1 Kernel scaling cut criterion-based dimension reduction

Scaling cut criterion is a linear method in nature, so it is inadequate to handle complexitydatasets such as real face images with different facial expression and illumination. Over thelast few years, kernel-based learning approaches, such as KPCA [19,30], KFDA [1], andother kernel-based methods [34] have attracted more attention in pattern recognition andmachine learning for its capability to efficiently represent complicated nonlinear relations ofthe real world input data.

We extend the scaling cut criterion to a nonlinear space with the kernel method, which isnamed as kernel scaling cut criterion.

Firstly, we use Gaussian radial basis function k(xi , x j ) = exp(−||xi − x j ||2/σ 2) to map

the data from the original input feature space to another higher-dimensional Hilbert space as� : X = {x}N

i=1 ∈ RD → �(X) = {φ (xi )}Ni=1 ∈ F , and then solve the new features with a

linear dimension reduction method.

φCp =∑

i∈Vp

j∈Vp

1

n pnc( j)

(φ (xi ) − φ

(x j

)) (φ (xi ) − φ

(x j

))T (16)

φAp =∑

i∈Vp

j∈Vp

1

n pn p

(φ (xi ) − φ

(x j

)) (φ (xi ) − φ

(x j

))T (17)

123

Page 9: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

φCp denotes the total dissimilarity between samples in the p-th class and the remain sam-ples, and φAp denotes the total dissimilarity of samples within the p-th class. The definitionsof the two dissimilarities take the class size into account, which implies the class information.

Then we can get the kernel scaling cut criterion as follows:

φScuts (W) =∣∣∣WT ∑c

p=1 φCpW∣∣∣

∣∣∣WT

∑cp=1

(φAp + φCp

)W

∣∣∣

=∣∣WTφCW

∣∣

∣∣WT (φA + φC) W

∣∣

=∣∣WTφCW

∣∣

∣∣WTφT W

∣∣

(18)

where φC = ∑cp=1 φCp and φA = ∑c

p=1 φAp , respectively, denote the total kernel-basedbetween-class dissimilarity matrix and the within-class dissimilarity matrix. φT = φA+φCis the total kernel-based dissimilarity matrix.

Solving Eq. (18) is also a trace ratio problem, and it is transformed to its equivalent tracedifference form as follows:

W∗ = arg maxW∈RD×d

Tr(WT(φC − λ∗φT)W

)(19)

4.2 Kernel local scaling cut criterion-based dimension reduction

In order to utilize the local structure information, in this section, we extend the kernel scalingcut criterion-based dimension reduction to local kernel scaling cut criterion. Because thelocal kernel scaling cut criterion can keep nearby data pairs in the original data space beclose in the embedding space, by which multimodal data can be embedded without losing itslocal structure, it also can decrease the computation complexity for constructing a localized k-nearest neighbor graph because it just needs to calculate the dissimilarities between neighbors.

Firstly, the original dataset {xi }Ni=1 ∈ RD is mapped into a high-dimensional feature space

from the original dataset by a Gaussian kernel function.Secondly, the local between-class dissimilarity matrix and the local within-class dissimi-

larity matrix can be defined as follows.

φC =c∑

p=1

n p∑

i=1

N∑

j=1

N bi j

(φ(xi ) − φ(x j )

) (φ(xi ) − φ(x j )

)T (20)

φA =c∑

p=1

n p∑

i=1

N∑

j=1

Nwi j

(φ(xi ) − φ(x j )

) (φ(xi ) − φ(x j )

)T (21)

where N bi j = 1

/n pkp when x j ∈ N b(xi ) and N b

i j = 0 otherwise. Nwi j = 1

/n pkw when

x j ∈ Nw(xi ) and Nwi j = 0 otherwise. Then the local kernel scaling cut criterion is formulated

as follow:

φ Scuts (W) =∣∣∣WTφ ˜CW

∣∣∣

∣∣∣WT

(φA + φC

)W

∣∣∣

=∣∣∣WTφ ˜CW

∣∣∣

∣∣∣WTφ ˜TW

∣∣∣

(22)

where φT = φA + φC denotes the total kernel local dissimilarity matrix. This problem isalso a trace ratio problem. In order to get the global optimum solution, it is transformed intothe trace difference form as follows:

123

Page 10: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

W∗ = arg maxW∈RD×d

Tr(

WT(φC − λ∗φT

)W

)(23)

Then the original high-dimensional data is mapped to the low-dimensional space spannedby W.

5 Optimal dimension scaling cut criterion-based supervised dimension reduction

In dimension reduction, how to select a suitable dimension is a challenge issue. The com-monly used technique in PCA and KPCA is applying the cumulative contribution rate devi-ation to determine the dimension. But we cannot directly learn the quite proper dimensioncorresponding to the best performance. In both LDA and KFDA, the dimension is generallyreduced to c−1 (c is the number of classes). Since the intrinsic dimension of different datasetis not the same, c − 1 dimension does not always lead to an optimal result.

In Nie et al. [16] proposed the optimal dimensionality discriminant analysis (ODDA)which can solve the singularity problem of LDA and determine the optimal dimension auto-matically. In the method, it is assumed that the classification performance is the worst if thereis no dimension reduction since the datasets in practices are redundant and noisy. Under theassumption, the weighted difference was used to measure the performance. Without dimen-sion reduction, the value of the weighted difference is zero, and the value is positive afterdimension reduction. Different dimensions lead to different values, and the maximal valueis corresponding to the optimal dimension. Similar to the idea presented in [16], we proposethe optimal dimension scaling cut criterion. The local scaling cut criterion shown in Eq. (14)can be formulated as Eq. (24) by using the weighted difference

Scuts = Tr(

C − γ T)

(24)

where γ is the weighted coefficient and it is defined as γ = TrC/

TrT. Our purpose is toget an optimal projection matrix W ∈ RD×d for embedding the original data samples to alow-dimensional feature space, so Eq. (24) becomes:

Scuts (W) = Tr(

WT(

C − γ T)

W)

(25)

In order to avoid singular solution, a constraint condition WTW = I is added, where I is ad × d identity matrix. We define a matrix S

S = C − TrC

TrTT (26)

Then the optimization problem can be formulated as:

W∗ = arg maxW∈RD×d

WTW=I

Tr(WTSW

)(27)

Calculating the eigenvalues and the corresponding eigenvectors of S defined in Eq. (26), wecan get W∗. Let W = [w1,w2 . . . ,wd ] be the d generalized eigenvectors corresponding to thed eigenvalues: λ1 ≥ λ2 ≥ · · · ≥ λd . When d is equal to the number of positive eigenvaluesof S, the optimal solution of the above optimization problem is

∑di=1 λi , and it reaches the

maximum. Therefore, the optimal solution of the optimization problem can be explicitlycalculated by eigenvalue decomposition.

123

Page 11: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

The supervised dimension reduction algorithm based on the optimal dimension scalingcut criterion is shown in Algorithm 3.

Algorithm 3 Optimal dimension scaling cut criterion-based dimension reductionInput:

The training dataset X = {xi , li }Ni=1 ∈ RD ;

The test dataset Xtest ={

xtestj

}M

j=1∈ RD ;

li is the label of xi ; d is the desired dimension.Output:

Projections of the training dataset Y = {yi }Ni=1 ∈ Rd .

The test dataset Ytest ={

ytestj

}M

j=1∈ Rd .

1: Construct the between-class dissimilarity matrix C and the within-class dissimilarity matrix A by Eq.(12)and Eq.(13);

2: Construct the matrix S by Eq.(26);3: Calculate the eigenvalues and corresponding eigenvectors of S, then obtain the first d eigenvectors

[w1,w2...,wd ] corresponding to the d largest eigenvalues [λ1, λ2, · · · , λd ], which form the projectionmatrix W = [w1,w2...,wd ];

4: Project the original data into the low-dimensional space spanned by the vectors of W, Y = WT X andYtest = WT Xtest .

6 Relation and discussion

Relation to LDA: Classical LDA computes the within-class scatters and between-class scat-ters by using the centers of samples, which assumes that the data distributions satisfyhomoscedastic Gaussian. Scaling cut criterion computes dissimilarity matrix using all thedata points. And in scaling cut criterion, the within-class dissimilarity is the sum of distanceamong samples in each class, and the between-class dissimilarity is the sum of distancebetween each sample with the samples in other classes without the class centers. So, thescaling cut criterion can deal with the multimodal dataset for which LDA fails.

Relation to Ncut: Ncut and our methods are similar in formulation. However, it is noticedthat there are two main differences between our proposed methods with the Ncut criterion. (1)The graph construction is different. Ncut computes the Laplacian matrix with the connectedgraph while our proposed methods only compute the dissimilarity matrix with the within-classand between-class dissimilarities to get the appropriate projection for dimension reduction.And the local k-nearest neighbor graph built in local scaling cut criterion only utilized the localstructure of samples; (2) our proposed methods are fully supervised and address classificationtasks with discriminative information, while the Ncut formulations are unsupervised or semi-supervised methods, which usually is used for clustering tasks.

Relation to LFDA: Both of LFDA and LSC share the same localization theory as mentionedin LPP that only considers the k-nearest neighbor points. The difference between them is thatLFDA computes the local within-class scatter and between-class scatter, while LSC computesthe local within-class dissimilarity matrix and local between-class dissimilarity matrix.

7 Experimental results

In this section, we evaluate the proposed methods in the applications of classification tasks.For convenient description, in the following experiments, SC is used as the abbreviationfor the scaling cut criterion, LSC for local scaling cut criterion, KSC for the kernel scaling

123

Page 12: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

cut criterion, KLSC for the local kernel scaling cut criterion, and ODSC for the optimaldimension scaling cut criterion. The algorithms we used for comparison include PCA, LDA,KPCA, KFDA, ODDA [16] and LPP [17,32]. Various datasets are applied to validate theeffectiveness of these methods, including UCI machine learning repository, ORL face dataset,USPS digital recognition dataset and a hyperspectral remote sensing image. All the resultswe record in the following subsections are the average accuracy over 30 times of randomsplits of the training data and the test data.

7.1 Visualization of embedding space

We first visualize the obtained embedded space of the original dataset before we present thequantitative assess of classification. Iris dataset from UCI datasets is used for visualizationafter dimension reduction. It consists of 150 samples, where 50 samples come from eachof three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features weremeasured for each sample: the length and the width of the sepals and petals. Among the threeclasses, two classes Iris virginica and Iris versicolor are overlapped in the original featurespace.

7.2 UCI datasets

To validate the proposed algorithms in this paper, we use several datasets with compara-tively high dimension from the UCI database, and the k-nearest neighbor (KNN) classi-fier is employed as the classifier. The properties of the used UCI datasets are presented inTable 1. Table 2 shows the accuracy and the corresponding dimension when each methodreaches the best. For the Air dataset which has been divided into training and test sets, thetraining set is used for training and the test set is used for validation. For the other datasets,we select 30 % of the samples as the training dataset and the rest as the test dataset. Althoughin most literatures, the dimension of LDA is reduced to c−1, in this paper, the results of LDAwith feature dimensions from 1 to D are all presented for comparison. Table 2 shows theperformance comparison of different algorithms. The baseline approach is the result of theKNN classification using the original features. Figure 2 shows 2D projections of Iris datasetby different methods.

The results including the best accuracy and the corresponding dimension are recorded inTable 2. We can find that our approaches outperform PCA, KPCA, KFDA on most datasetswith higher accuracy and lower dimension, while LDA performs better than our proposed

Table 1 Properties of UCIdatasets

Dataset Class Dimension Size

Splice 3 240 3,175

Segment 7 18 2,310

Ionosphere 2 34 351

Wdbc 2 30 569

German 2 24 1,000

Vote 2 16 435

Wbcd 2 9 699

Air 3 64 Training:359

Test:719

123

Page 13: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

Table 2 The accuracy (%) and the corresponding dimension of UCI datasets

KNN PCA LDA KPCA KFDA LPP ODDA SC LSC KSC KLSC ODSC

Splice 80.95 92.92 95.72 89.91 89.83 79.27 91.21 68.82 73.13 92.59 94.24 89.20

(240) (8) (72) (12) (72) (29) (69.6) (1) (1) (4) (5) (75.6)

Segment 95.28 93.49 95.28 89.30 86.89 92.41 95.71 93.87 94.37 85.14 85.21 96.19

(18) (12) (14) (18) (14) (12) (5) (14) (13) (14) (16) (5.2)

Ionosphere 88.61 82.58 83.61 83.68 92.17 86.77 87.80 85.63 86.79 93.72 91.40 87.80

(34) (4) (5) (7) (5) (8) (11.7) (15) (7) (28) (32) (12.2)

Wdbc 93.64 89.97 93.38 91.71 87.28 92.45 91.02 92.60 92.60 91.30 89.40 90.52

(30) (3) (1) (6) (1) (12) (1.1) (30) (30) (1) (1) (1)

German 68.59 69.81 71.74 69.18 68.46 68.69 72.00 71.04 71.09 72.01 69.75 72.00

(24) (19) (5) (10) (5) (16) (9.6) (18) (7) (8) (21) (10.6)

Vote 91.45 89.64 93.38 91.06 89.45 91.22 95.10 90.46 92.30 93.27 92.76 96.08

(16) (12) (4) (8) (4) (12) (4.5) (12) (4) (1) (15) (4.5)

Wbcd 96.48 92.67 96.22 96.56 96.50 96.28 97.35 96.59 96.59 96.30 96.58 97.56

(9) (5) (6) (6) (6) (1) (1.9) (1) (1) (1) (4) (2)

Air 91.66 95.97 94.72 94.90 36.58 97.61 95.55 95.69 97.08 94.58 94.44 96.80

(64) (10) (19) (12) (19) (21) (12) (60) (24) (37) (18) (14)

-4 -2 0 2 4-2

-1

0

1

2PCA

-4 -2 0 2 4-1.5

-1

-0.5

0

0.5LDA

-100 -50 0 50-20

-10

0

10

20

30KPCA

-0.3 -0.2 -0.1 0 0.1-0.5

0

0.5

KFDA

-4 -2 0 2 4-1

-0.5

0

0.5

1

LPP

-4 -2 0 2-1.5

-1

-0.5

0

0.5

1

SC

-2 0 2 4 6-1

-0.5

0

0.5

1

LSC

-0.3 -0.2 -0.1 0 0.1-0.2

-0.1

0

0.1

0.2

KSC

-0.2 -0.1 0 0.1 0.2-0.2

-0.1

0

0.1

0.2

KLSC

Fig. 2 Visualization of 2D projections of Iris dataset by different methods

123

Page 14: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

Fig. 3 Examples of a USPS andb ORL

methods on Splice and Wdbc datasets. It can be observed that the ODSC gets better resultsthan ODDA in most datasets, but the corresponding dimension is relatively higher. From theresults of the Splice dataset, we can find that if the dimension of the original dataset is high,the corresponding optimal dimension is high as well, which does not meet our expectation ofhigher accuracy with lower dimension. The higher the optimal dimension, the more positiveeigenvalues decomposed by the matrix S in Eq. (26). If we delete some positive values close tozero, it is noted that it will not have a great impact on performance while the feature dimensionwould be lower. Let us take the Ionosphere dataset for example. The original dimension ofIonosphere is 34. ODSC gets the accuracy 87.80 % with 12.2 dimensions. If we cast off theeigenvalues that are lower than 10 % of the maximal eigenvalue, we will get the accuracy88.29 % with 5.2 dimensions. And if we throw away those eigenvalues that are lower than15 % of the maximal eigenvalue, the dimension becomes 4.2 and the accuracy is 86.83 %.It is noted that abandoning some positive eigenvalues that are close to zero will not lowerthe accuracy and even bring slightly improvement, but can reduce the feature dimensionsgreatly. In this experiment, in consideration of all the datasets, we almost use the wholepositive eigenvalues but just discard those eigenvalues that are lower than the 0.0001 % ofthe maximal eigenvalue in ODSC.

7.3 ORL database

The ORL face database is used in the application of face recognition. The ORL face databaseincludes 40 distinct individuals and each individual has 10 different images, which weretaken at different times, lighting conditions, facial expressions (open/closed eyes, smiling/notsmiling) and facial details (glasses/no glasses). These images were taken with a tolerancefor some tilting and rotation of the face up to 20◦. Each image in the database is size of112 × 92. In the experiment, each image is down-sampled to the size of 28 × 23 to save thecomputation time. Some images are shown in Fig. 3b. We carry out the experiments withdifferent numbers of training samples. 2, 3, 4 or 6 samples per class are randomly selectedfor training and the remaining samples for testing. The number of the within-class neighborsis set to 3 and the number of between-class neighbors is 3 as well. Our proposed methods,SC, LSC, KSC and KLSC are solved by DNM based on trace difference method.

Table 3 lists the average accuracy and standard deviation over 30 random splits. Fromthis table, we can see that the accuracy increases with the increase in the number of labeledsamples. And our proposed methods are much better than KNN. Regarding the kernel-basedmethods, we can find that our proposed KSC and KLSC are superior to KPCA and KFDAin classification of the ORL face dataset. Furthermore, the deviations presented in Table 3shows that our proposed methods are stable than the classical methods such as KNN, PCAand LDA.

Figure 4 shows the average accuracy rate as a function of the number of dimensionsobtained on ORL database. As we can see, the accuracies of our proposed approaches KSC

123

Page 15: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

Table 3 Best recognition results by different methods on ORL face database (%)

Method 2 samples 3 samples 4 samples 6 samples

Acc. Dev. Acc. Dev. Acc. Dev. Acc. Dev.

KNN 66.25 3.11 77.14 2.54 85.00 2.31 94.86 2.93

PCA 60.41 0.80 72.50 1.16 79.00 1.27 76.38 1.38

LDA 73.34 3.75 79.61 2.26 90.68 3.14 90.68 2.53

KPCA 60.75 2.72 74.65 2.18 75.50 2.27 78.18 2.38

KFDA 55.94 3.60 78.57 1.97 78.57 3.07 86.25 2.07

LPP 73.20 1.78 75.46 2.51 78.67 1.64 85.38 2.56

ODDA(dim) 84.17 (39) 2.02 89.65 (51.60) 1.60 92.09 (72.20) 2.11 95.08 (97.45) 1.74

SC 84.35 2.34 89.46 2.54 92.40 2.06 95.28 1.47

LSC 85.69 1.93 90.63 2.70 92.94 1.97 95.91 1.41

KSC 84.61 3.12 89.61 2.31 93.02 1.88 96.59 1.22

KLSC 83.48 2.80 87.80 2.53 90.18 2.30 94.25 2.32

ODSC(dim) 84.39 (39) 2.31 89.48 (53.30) 1.77 92.25 (70.10) 1.78 95.20 (98.20) 1.86

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

70

80

90

Number of Dimensions

Acc

urac

y(%

)

SCLSCKSCKLSCPCAKPCALPPLDAKFDAKNNODDAODSC

(a)

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

Number of Dimensions

Acc

urac

y(%

)

SCLSCKSCKLSCPCAKPCALPPLDAKFDAKNNODDAODSC

(b)

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

90

100

Number of Dimensions

Acc

urac

y(%

)

SCLSCKSCKLSCPCAKPCALPPLDAKFDAKNNODDAODSC

(c)

0 20 40 60 80 1000

20

40

60

80

100

Number of Dimensions

Acc

urac

y(%

)

SCLSCKSCKLSCPCAKPCALPPLDAKFDAKNNODDAODSC

(d)

Fig. 4 Accuracy as a function of the number of dimensions on ORL database. a 2 training samples per class,b 3 training samples per class, c 4 training samples per class, d 6 training samples per class

123

Page 16: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

0 20 40 60 80 10030

40

50

60

70

80

90

100

Number of Dimensions

Acc

urac

y (%

)PCA

KPCA

SC

LSC

LPP

LDA

KSC

KLSC

KFDA

KNN

ODDA

ODSC

Fig. 5 The average recognition rate of the whole USPS dataset with 80 training samples per class

and KLSC increase rapidly, and both methods can get the best results with the relativelylow dimensions, such as 10-dimension. It is probably because the ORL dataset is influencedby illumination and facial expression. kernel-based dimension reduction methods are moresuitable for this dataset. From the results in Table 3 and Fig. 4 we can see that our proposedmethods reached higher and better results than others.

7.4 USPS database

In this experiment, we evaluate the performance of our methods in the digit recognitiontask. The USPS dataset of handwritten digit recognition consists of 9,298 samples, each ofwhich is a normalized gray scale image of size 16 × 16. The dataset contains ten numbersfrom 0 to 9, and there are 7,291 training samples and 2,007 test samples. Some imagesare shown in Fig. 3a. Firstly, we perform the experiment for the recognition of ten classes{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} where we randomly choose 80 samples per class for training andthe remains for testing. The results are depicted in Fig. 5. Then we carry out four recognitiontasks where three groups {2, 7}, {7, 9}, and {3, 5, 8} are hard to distinguish and one group{1, 2, 3, 4} is easier. We randomly select 20, 40, 60, 80 samples per class for training andthe remaining for testing. The results are recorded in Table 4. Figure 5 illustrates the averagerecognition rate over 30 runs of random splits. The x-axis is the number of dimensions. Asit can be seen from Fig. 5, our proposed methods perform better than KNN. KSC and KLSChave an outstanding performance and they can reach the highest accuracy rate rapidly withthe lowest dimensions, i.e., 5.

Table 4 summarizes the average recognition accuracy over 30 runs in recognizing{2, 7}, {7, 9}, {3, 5, 8} and {1, 2, 3, 4}, respectively. The dimensions corresponding to thebest accuracies of ODDA and ODSC are also given. From Table 4 we can see that ODDAand ODSC obtain similar accuracy and dimension, while the KSC and KLSC also have goodand stable performance in the four recognition tasks.

123

Page 17: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

Tabl

e4

Ave

rage

reco

gniti

onac

cura

cy(%

)of

USP

Sda

taba

se(t

heco

rres

pond

ing

dim

ensi

on)

Num

ber

#tr

aini

ngsa

mpl

esK

NN

PCA

LD

AK

PCA

KFD

AL

PPO

DD

ASC

LSC

OD

SCK

SCK

LSC

2,7

2092

.31

96.1

396

.68

88.3

897

.74

92.7

098

.39

(9.7

0)96

.48

96.4

898

.45

(10.

4)96

.74

96.7

3

4094

.70

91.1

496

.92

94.2

398

.12

95.2

498

.54

(15.

2)96

.05

96.0

598

.84

(16.

7)97

.32

97.3

3

6095

.59

94.4

396

.30

95.6

498

.27

95.7

098

.81

(21.

2)95

.22

95.2

199

.06

(23.

3)97

.82

97.8

1

8096

.14

94.6

896

.61

96.2

498

.52

96.2

598

.84

(25.

8)93

.76

93.7

398

.78

(27.

8)97

.95

97.9

6

7,9

2089

.88

87.9

192

.62

86.0

991

.99

87.4

995

.49

(14.

1)92

.66

92.6

695

.93

(13.

3)87

.19

89.4

5

4093

.33

89.0

393

.50

88.6

292

.37

84.7

096

.22

(22.

2)90

.76

90.7

796

.15

(22.

4)91

.03

93.4

3

6094

.65

89.5

094

.15

89.8

792

.90

85.5

097

.32

(30.

4)89

.25

89.2

297

.12

(31.

4)92

.54

94.2

0

8094

.87

90.3

193

.75

89.4

894

.29

87.4

997

.80

(38.

6)80

.36

79.5

897

.80

(39.

8)92

.74

94.7

5

3,5,

820

85.7

277

.78

86.8

073

.95

85.8

884

.46

91.1

7(2

5.3)

83.9

083

.90

91.4

5(2

3.8)

86.5

888

.13

4089

.65

83.9

787

.78

83.8

589

.85

89.0

894

.03

(44.

6)80

.21

80.2

194

.17

(42.

5)92

.11

92.0

0

6091

.24

85.2

287

.72

83.8

689

.94

90.8

395

.74

(62.

1)81

.91

81.9

195

.70

(60.

3)91

.84

92.2

2

8091

.90

85.2

880

.86

85.7

392

.18

92.4

195

.92

(78.

7)79

.79

81.3

295

.78

(76.

5)94

.66

94.5

8

1,2,

3,4

2091

.09

85.8

791

.07

89.8

094

.36

87.8

095

.12

(20.

9)89

.07

89.0

794

.91

(21.

0)92

.62

92.6

2

4093

.55

89.1

192

.93

92.0

895

.85

92.9

996

.28

(34.

9)83

.53

83.5

196

.34

(36.

5)94

.43

94.4

1

6094

.92

90.1

889

.81

92.4

196

.59

94.4

697

.68

(46.

8)78

.15

78.6

397

.50

(49.

6)95

.44

95.4

3

8095

.90

91.5

985

.68

93.1

696

.96

95.7

597

.97

(53.

1)89

.81

90.7

297

.95

(55.

5)95

.95

95.9

4

123

Page 18: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

Fig. 6 Color composition image(512 × 455) and the labeledsamples

C 1C 2C 3C 4C 5C 6C 7C 8C 9C10C11C12C13

Table 5 Labeled samplespercentage per class

Class No. of samples (%)

1 Scrub 761 (14.6)

2 Willow swamp 243 (4.66)

3 Cabbage palm hammock 256 (4.92)

4 Cabbage palm/oak hammock 252 (4.84)

5 Slash pine 161 (3.07)

6 Oak/broadleaf hammock 229 (4.38)

7 Hardwood swamp 105 (2.0)

8 Graminoid marsh 431 (8.27)

9 Spartina marsh 520 (9.99)

10 Cattail marsh 404 (7.76)

11 Salt marsh 419 (8.04)

12 Mud flats 503 (9.66)

13 Water 927 (17.8)

7.5 Hyperspectral remote sensing image

We further apply our proposed dimension reduction methods to the classification of hyper-spectral remote sensing image. Dimensionality reduction is an important task in the analysisof hyperspectral image data [9]. The KSC data are used to evaluate the performance of ourproposed methods. It was obtained from NASA AVIRIS instrument over Kennedy SpaceCenter on March 23, 1996 and collected from an altitude of approximately 20 km with aspatial resolution of 18 m. AVIRIS acquires data in 224 bands of 10-nm width from 400 to2,500 nm. Removal of noisy and water absorption bands resulted in 176 candidate bands. Acolor composite of the image (512 × 455) and the labeled areas of 13 classes are shown inFig. 6. Details of the 13 land-covers considered in the KSC area are listed in Table 5. Figure 7shows the mean reflectance curves of the above 13 land-covers where each color denotes oneclass. From Fig. 7, we can find that they are hard to distinguish in the original space and thedimension of the original dataset is high, so it is important to perform dimension reductionfor the hyperspectral image to obtain a low-dimensional and more discriminative projectionspace. Here we use support vector machines (SVM) as the classifier for its successful usagein classification of hyperspectral image [13]. The compared dimension reduction methods

123

Page 19: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

0 20 40 60 80 100 120 140 160 1800

50

100

150

200

250

300

350

400

band

Ref

lect

ance

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

C12

C13

Fig. 7 The reflectance curves of the thirteen land-cover classes

0 10 20 30 40 50 60 7010

20

30

40

50

60

70

80

90

Number of Dimension

Acc

urac

y (%

)

SC

LSC

KSC

KLSC

PCA

KPCA

LPP

NMMP

LDA

KFDA

ODDA

ODSC

SVM

Fig. 8 Average accuracy of KSC image with different numbers of dimensions

include LDA, PCA, KPCA, KFDA, and ODDA. The baseline approach is simply the SVMclassification using all the features in the original space.

We randomly choose 8 samples per class for training, and the remains for testing. Figure 8gives the average accuracies of different methods over 30 runs of random splits. The finaldimension of LDA is c − 1, i.e., 12, where c is the number of classes. Both the numbers ofwithin-class and between-class neighbors in LSC, ODDA, and ODSC are set to be 10.

Figure 8 demonstrates the average accuracies of different methods with 70 dimensionson the KSC dataset. From the results, we can see that unsupervised and linear methods such

123

Page 20: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

as PCA and LPP perform unsatisfactorily under 25 dimensions. Figure 8 also indicates thatSC and LSC could work well when only few features are used (i.e., 5 features for SC and 8features for LSC according to the curves when they get the highest accuracy) in comparisonwith other approaches. Furthermore, from the curves, it can be found that our proposedmethod SC can achieve an effective performance using a small set of training samples (8labeled samples per class), which is very important for hyperspectral image classificationdue to the high difficulty in obtaining the labels of hyperspectral image.

8 Conclusions

This paper has presented the scaling cut criterion-based dimension reduction methods fordata analysis. To get a low-dimensional embedding space of a high-dimensional dataset, thescaling cut criterion calculated the dissimilarity matrix of whole samples to explore the globalstructure, which eliminated the impact of class centers to the data distribution consequently.So it can handle the heteroscedastic and multimodal data where LDA fails. To obtain amore reasonable mapping matrix and reduce the computational complexity, the localizedk-nearest neighbor graph was introduced in the scaling cut criterion to get the local scalingcut criterion. Furthermore, the two algorithms were kernelized, which made them suitable tononlinear tasks. In addition, in order to automatically determine the optimal dimension in ourdimension reduction methods, the optimal dimension scaling cut criterion-based discriminantanalysis was proposed.

Our experiments on a variety datasets, UCI datasets, ORL dataset, USPS datasets andhyperspectral remote sensing image illustrated the effectiveness of the proposed methodsand showed that the dimension reduction methods based on scaling cut criterion outperformsome popular dimension reduction techniques in terms of classification accuracy.

Acknowledgments This work was supported by the National Basic Research Program of China (973 Pro-gram) (Grant 2013CB329402), the National Natural Science Foundation of China (Nos. 61272282, 61203303,and 61272279), the Program for New Century Excellent Talents in University (NCET-13-0948), and Funda-mental Research Funds for the Central Universities (Grant K50511020011).

References

1. Abbasnejad ME, Ramachandram D, Mandava R (2012) A survey of the state of the art in learning thekernels. Knowl Inf Syst 31(2):193–221

2. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation.Neural Comput 15(6):1373–1396

3. Borg I, Groenen P (2003) Modern multidimensional scaling: theory and applications. J Educ Meas40(3):277–280

4. Ding C, He X, Zha H, Gu M, Simon HD (2001) A min-max cut algorithm for graph partitioning and dataclustering. In: Proceedings of the IEEE international conference on data mining, pp 107–114

5. Fiedler M (1973) Algebraic connectivity of graphs. Czechoslov Math J 23(2):298–3056. Fisher Ronald A (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–

1887. Fu Y, Li Z, Yuan J et al (2008) Locality versus globality: query-driven localized linear models for facial

image computing. IEEE Trans Circuits Syst Video Technol 18(12):1741–17528. Hagen L, Kahng AB (1992) New spectral methods for ratio cut partitioning and clustering. IEEE Trans

Comput Aided Des Integr Circuits Syst 11(9):1074–10859. Ham J, Chen Y, Crawford MM et al (2005) Investigation of the random forest framework for classification

of hyperspectral data. IEEE Trans Geosci Remote Sens 43(3):492–501

123

Page 21: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

10. He X, Cai D, Yan S et al (2005) Neighborhood preserving embedding. In: Proceedings of the 10thinternational conference on computer vision, pp 1208–1213

11. Jia Y, Nie F, Zhang C (2009) Trace ratio problem revisited. IEEE Trans Neural Netw 20(4):729–73512. Li XR, Jiang T, Zhang K (2006) Efficient and robust feature extraction by maximum margin criterion.

IEEE Trans Neural Netw 17(2):157–16513. Melgani F, Bruzzone L (2004) Classification of hyperspectral remote sensing images with support vector

machines. IEEE Trans Geosci Remote Sens 42(8):1778–179014. Nie F, Xiang S, Zhang C (2007) Neighborhood minmax projections. In: Proceedings of the 20th interna-

tional joint conference on artificial intelligence, pp 993–99815. Nie F, Xiang S, Song Y et al (2009) Orthogonal locality minimizing globality maximizing projections

for feature extraction. Opt Eng 48(1):017202–01720516. Nie F, Xiang S, Song Y et al (2007) Optimal dimensionality discriminant analysis and its application

to image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition,pp 1–8

17. Niyogi P, He X (2003) Locality preserving projections. Neural Inf Process Syst 16:15318. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science

290(5500):2323–232619. Schlkopf B, Smola A, Mller KR (1997) Kernel principal component analysis. In: Proceedings of the 7th

international conference on artificial neural networks, vol 97, pp 583–58820. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell

22(8):888–90521. Song G, Cui B, Zheng B et al (2009) Accelerating sequence searching: dimensionality reduction method.

Knowl Inf Syst 20(3):301–32222. Sugiyama M (2006) Local Fisher discriminant analysis for supervised dimensionality reduction. In:

Proceedings of the 23rd international conference on machine learning, pp 905–91223. Tenenbaum Joshua B, De Silva (2000) A global geometric framework for nonlinear dimensionality

reduction. Science 290(5500):2319–232324. Wang H, Yan S, Xu D et al (2007) Trace ratio vs. ratio trace for dimensionality reduction. In: Proceedings

of the IEEE conference on computer vision and pattern recognition, pp 1–825. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1):37–5226. Xiang S, Nie F, Song Y et al (2009) Embedding new data points for manifold learning via coordinate

propagation. Knowl Inf Syst 19(2):159–18427. Xiang S, Nie F, Zhang C et al (2009) Nonlinear dimensionality reduction with local spline embedding.

IEEE Trans Knowl Data Eng 21(9):1285–129828. Yan S, Xu D, Zhang B et al (2007) Graph embedding and extensions: a general framework for dimen-

sionality reduction. IEEE Trans Pattern Anal Mach Intell 29(1):40–5129. Yang J, Frangi AF, Zhang D et al (2005) KPCA plus LDA: a complete kernel Fisher discriminant frame-

work for feature extraction and recognition. IEEE Trans Pattern Anal Mach Intell 27(2):230–24430. Zhang C, Nie F, Xiang S (2010) A general kernelization framework for learning algorithms based on

kernel PCA. Neurocomputing 73(4):959–96731. Zhang L, Chen S, Qiao L (2012) Graph optimization for dimensionality reduction with sparsity constraints.

Pattern Recognit 45(3):1205–121032. Zhang L, Qiao L, Chen S (2010) Graph-optimized locality preserving projections. Pattern Recognit

43(6):1993–200233. Zhang XR, Jiao LC, Zhou S et al (2012) Adaptive multi-parameter spectral feature analysis for SAR

target recognition. Opt Eng 51(8):087203. doi:10.1117/1.OE.51.8.08720334. Zhang XR, Wang WW, Li YY et al (2012) A PSO-based automatic relevance determination and feature

selection system for hyperspectral image classification. IET J Electron Lett 48(20):1263–1264

123

Page 22: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

X. Zhang et al.

Xiangrong Zhang received the B.Sc. and M.Sc. in Computer Scienceand Technology from Xidian University, Xi’an, China, in 1999 and2003, respectively, and the Ph.D. degree in Pattern Recognition andIntelligent System from Xidian University, Xi’an, China, in 2006. Sheis currently a professor in the School of Electronic Engineering, Xid-ian University. Her research interests include visual information analy-sis and understanding, pattern recognition, and machine learning.

Yudi He received the B.Sc degree in Intelligence Science and Tech-nology from Xidian University, Xian, China, in 2011. She is currentlya Master candidate in the key Laboratory of Intelligent Perceptionand Image Understanding of the Ministry of Education, InternationalResearch Center for Intelligent Perception and Computation, XidianUniversity, Xi’an, 710071, China. Her current interests include patternrecognition, machine learning and hyperspectral image processing.

Licheng Jiao received the Ph.D. degree from Xian Jiaotong University,Xian, China, in 1990. From 1990 to 1991, he was a Postdoctoral Fel-low in the National Key Laboratory for Radar Signal Processing, Xid-ian University, Xian, China. Currently, he is the Dean of the ElectronicEngineering School and the Director of the Key Laboratory of Intel-ligent Perception and Image Understanding of Ministry of Education,International Research Center for Intelligent Perception and Computa-tion, Xidian University, Xi’an, 710071, China. He is a Fellow of IET,and an IEEE senior member. He is the author or coauthor of more than200 scientific papers. His current research interests include signal andimage processing, machine learning, natural computation, and intelli-gent information processing.

123

Page 23: Scaling cut criterion-based discriminant analysis for supervised dimension reduction

Scaling cut criterion-based discriminant analysis

Ruochen Liu is currently an associate professor with the IntelligentInformation Processing Innovative Research Team of the Ministry ofEducation of China at Xidian University, Xian, China. She received herPh.D. from Xidian University, Xian, China, in 2005. Her research inter-ests are broadly in the area of computational intelligence. Her areas ofspecial interest include artificial immune systems, evolutionary compu-tation, data mining, and optimization.

Jie Feng received the B.S. degree from Changan University, Xian,China, in 2008. She is currently working toward the Ph.D. degree inthe Key Laboratory of Intelligent Perception and Image Understandingof Ministry of Education, International Research Center for IntelligentPerception and Computation, Xidian University, Xi’an, 710071, China.Her current interests include synthetic aperture radar image interpre-tation, hyperspectral image processing, pattern recognition, and evolu-tionary computation.

Sisi Zhou received the B.Sc. from Beijing University of Posts andTelecommunications, Beijing, China, in 2007, and the M.Sc. in cir-cuits and systems from the Key Laboratory of Intelligent Perceptionand Image Understanding of Ministry of Education of China, Inter-national Research Center for Intelligent Perception and Computation,Xidian University, Xi’an, 710071, China, in 2010. Her research inter-ests include machine learning and pattern recognition.

123