Support vector machine with manifold regularization and partially labeling privacy protection

Information Sciences 294 (2015) 390–407

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Support vector machine with manifold regularizationand partially labeling privacy protection

http://dx.doi.org/10.1016/j.ins.2014.09.0500020-0255/� 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author at: School of Digital Media, Jiangnan University, Wuxi, Jiangsu, PR China. Tel.: +86 510 85912136.E-mail address: [email protected] (S. Wang).

Tongguang Ni a,b, Fu-Lai Chung c, Shitong Wang a,c,⇑a School of Digital Media, Jiangnan University, Wuxi, Jiangsu, PR Chinab School of Information Science and Engineering, Changzhou University, Changzhou, Jiangsu, PR Chinac Department of Computing, Hong Kong Polytechnic University, Hong Kong

a r t i c l e i n f o

Article history:Received 16 January 2014Received in revised form 22 September 2014Accepted 28 September 2014Available online 7 October 2014

Keywords:Large datasetsClassificationSupport vector machinePrivacy protectionManifold regularization

a b s t r a c t

A novel support vector machine with manifold regularization and partially labeling privacyprotection, termed as SVM-MR&PLPP, is proposed for semi-supervised learning (SSL) sce-narios where only few labeled data and the class proportion of unlabeled data, due to pri-vacy protection concerns, are available. It integrates manifold regularization and privacyprotection regularization into the Laplacian support vector machine (LapSVM) to improvethe classification accuracy. Privacy protection here refers to use only the class proportion ofdata. In order to circumvent the high computational burden of the matrix inversion oper-ation involved in SVM-MR&PLPP, its scalable version called SSVM-MR&PLPP is furtherdeveloped by introducing intermediate decision variables into the original regularizationframework so that the computational burden of the corresponding transformed kernel inSSVM-MR&PLPP can be greatly reduced, making it highly scalable to large datasets. Theexperimental results on numerous datasets show the effectiveness of the proposedclassifiers.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

A learning problem that has only recently gained attention in the machine learning community is that of learning a clas-sifier from class labeling proportion information [21–23,28]. This type of learning problem appears in areas like politics,medicine, spam filtering and so on. For example in political election, the result of each vote is not open to everyone; butfor each region, the vote distribution is publicly available and presents the class labeling proportion of each candidate inthe region. The final vote results are closely related to every voter with his income, family class, and so on. As we know,the above elements will directly influence the distribution of votes in a region. Likewise, a similar situation also exists inmedical diagnosis. For example, following the outbreak patterns of a new type of influenza virus is an important task, butrevealing which patient actually got infected should be treated in a highly confidential manner. However, outbreak frequen-cies in certain risk groups are not sensitive information. There exist similar problems in the area of spam identification. Thecost of data collection is also very important for real applications. As we know, the datasets of spam emails are likely to con-tain almost pure spam data (which is achieved, e.g., by listing e-mails as spam bait), while user’s inboxes typically contain amixture of spam and non-spam emails. The usual task is to use the inbox data to improve the estimation of spam.

http://crossmark.crossref.org/dialog/?doi=10.1016/j.ins.2014.09.050&domain=pdf

http://dx.doi.org/10.1016/j.ins.2014.09.050

mailto:[email protected]

http://dx.doi.org/10.1016/j.ins.2014.09.050

http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

T. Ni et al. / Information Sciences 294 (2015) 390–407 391

In many cases, it is possible to estimate the proportions of spam and non-spam in a user’s inbox, which is much cheaper thanestimating the actual labels. More importantly, collecting e-mail labels from a user’s inbox may involve an invasion of per-sonal privacy. A lot of individual information can be released in the form of label proportions in real life. As for the aboveexamples, after election, the proportions of votes of each demographic area may be released by the government. In health-care, the proportions of diagnosed diseases of each zip code area may be available to the public. Motivated by the above real-world applications, several classifiers using the class labeling proportion information have been proposed in [21,23,28].However, since all these classifiers do not consider the intrinsic structure hidden between labeled and unlabeled samples,they become inappropriate for the semi-supervised application scenarios where a few labeled samples can be quite expen-sively acquired while huge amounts of unlabeled samples can be easily and/or cheaply collected and the class labeling pro-portion information is also available.

As we may know well, semi-supervised learning (SSL), which exploits the huge amounts of unlabeled data jointly withthe limited labeled data for learning, has attracted considerable attention in recent years [1,4,12,14,15,18,36,39]. A popularSSL approach [2] is learning with the manifold assumption which states that each class lies on a separate low-dimensionalmanifold embedded in a higher dimensional space. Many forms of real-world data, such as handwritten digits [10], webpag-es [40], images [17] and so on, have been demonstrated to exhibit such an intrinsic geometric structure, which can be com-monly approximated by a weighted graph. In this study, we try to integrate such a SSL learning mechanism with classlabeling proportion to develop a novel support vector machine with manifold regularization (MR) and partially labeling pri-vacy protection (PLPP). The difference between the proposed classifiers and the often-used semi-supervised learning meth-ods is illustrated in Fig. 1.

In order to achieve our goal, we first use the Laplacian support vector machine (LapSVM) [2] as the basic framework toconstruct a novel learning framework with manifold regularization and partially labeling privacy protection, which makesgood use of the labeling proportion of unlabeled data, e.g., the proportion of patients having the diseases. Such proportioninformation can be helpful for classification and at the same time protects the privacy information. Based on the principle ofsupport vector machine (SVM) [16,25,29,32–34,38] by using an e-insensitive loss function, a support vector machine withmanifold regularization and partially labeling privacy protection, termed as SVM-MR&PLPP, is then proposed. The learningtask can be solved using the classical quadratic programming (QP) solver. Moreover, the traditional SSL approaches based onmanifold regularization framework are limited to small scale datasets and hence inappropriate for large data as a result ofthe matrix inversion operation involved in the dual problem [2]. As the computation of matrix inversion, likewise in LapSVM,is very expensive for a large dataset [6,13,20,35], the objective function of the proposed classifier, i.e., SVM-MR&PLPP, isreconstructed by introducing intermediate decision variables in the manifold regularization framework, and hence a scalableversion of SVM-MR&PLPP, called SSVM-MR&PLPP, is developed accordingly. Since the proposed classifiers consider both thelabeled and unlabeled data as well as the labeling proportion of the unlabeled data in a learning task, they not only inheritthe advantages of manifold learning, but also can effectively correct its decision boundary with the given labeling proportion.

The main contributions of this work can be highlighted below.

(1) A novel classifier SVM-MR&PLPP which considers the labeled data, the unlabeled data and the labeling proportion ofthe unlabeled data is proposed. We also prove that the training of SVM-MR&PLPP can be equivalently transformed as aclassical QP problem.

(2) By introducing intermediate variables into the learning framework, the proposed classifier SVM-MR&PLPP is extendedinto its scalable version called SSVM-MR&PLPP for large datasets. We show that the training of SSVM-MR&PLPP canalso be transformed as a classical QP problem and consequently SSVM-MR&PLPP can be efficiently solved by a QP sol-ver. SSVM-MR&PLPP inherits the same sparsity as in SVM, though it has a kernel matrix different from SVM-MR&PLPP.

(3) Extensive experiments on synthetic and real-world datasets demonstrate that the proposed classifiers outperform orare at least comparable to several state-of-the-art methods.

(a) semi-supervised learning

PositiveNegativeUnlabeled

PositiveNegativeUnlabeled

(b) semi-supervised learning with proportion

Fig. 1. Difference between the proposed classifiers and the often-used SSL (colors encoding class labels): (a) semi-supervised learning with labeled andunlabeled data explicitly given; (b) semi-supervised learning with labeling proportion where labeled data, unlabeled data and a proportion of unlabeleddata are given.

392 T. Ni et al. / Information Sciences 294 (2015) 390–407

The rest of this paper is organized as follows. In Section 2, we briefly review the manifold regularization and highlight itslimitations by taking LapSVM as an example. In Section 3, the proposed classifier SVM-MR&PLPP is proposed and its scalableversion SSVM-MR&PLPP is derived in Section 4. The experimental results are reported in Section 5 and we conclude thepaper in Section 6. For easy reading and understanding, the mathematical notations used in this paper are summarized inTable 1.

2. Manifold regularization

The idea of regularization has a long mathematical history in which it originates from solving ill-posed inverse problems[19]. The goal of regularization is to stabilize the solution by using some auxiliary nonnegative function that embeds priorinformation about the solution. Recently, manifold regularization (MR) framework has been proposed to exploit the geom-etry of the probability distribution that generates the data and incorporate it as a new regularization term [2]. Three distinctconcepts are brought together in the framework, namely, regularization in reproducing Kernel Hilbert spaces, the technologyof spectral graph theory and the geometric viewpoint of manifold learning algorithms. Manifold regularization provides aneffective framework to cope with training a dataset which contains both labeled and unlabeled data.

In the standard framework of learning from samples, there exists a probability distribution P on X� R according to whichexamples are generated for function learning. Labeled samples (xi,yi) for learning are generated according to P, where xi is thesample and yi is the corresponding label. Unlabeled samples xj are drawn according to the marginal distribution Px of P. Intu-itively, if two points x1 and x2 are close in the intrinsic geometry of Px, they are likely to have similar labels. In other words,P(yjx) varies smoothly along the geodesics in the geometry of Px. With these geometric smoothness assumptions, regulari-zation could be used for function learning. As Px is unknown in most applications, locality information of data is used to getempirical estimate of Px.

In general, manifold regularization seeks for an optimal classification function by minimizing the following function [2]:

minf2Hk

1l

Xl

i¼1

Vðxi; yi; f ðxiÞÞ þ cAkfk2K þ cIkfk

2I ; ð1Þ

where HK is its induced Reproducing Kernel Hilbert Space (RKHS) for a Mercer kernel K, V(�) denotes a certain loss function,the regularization term kfk2

K controls the complexity of the classifier to avoid overfitting and the other regularization termkfk2

I is used to smooth out the manifold geometry of the sample distribution [2].Generally, in semi-supervised learning, kfk2

I can be approximated by

kfk2I ¼

Xij

Wijðf i � f jÞ2 ¼ f T Lf ; ð2Þ

where Wij is the weight measuring the similarity between fi and fj, fi represents f(xi), f = [f1, f1, . . . , fl+u]T, L = D �W is theGaussian Laplacian matrix, and D is a diagonal matrix with its entries Dii ¼

Plþui¼1Wij. Note that except the Gaussian Laplacian,

any Laplacian matrix can be used here. For example, the Laplacian matrix proposed in [37] for data ranking can be used inEq. (2).

The regularization term kfk2I with Wij incurs a heavy penalty if neighboring points xi and xj are mapped far apart. There-

fore, the minimization of Eq. (2) is an attempt to ensure that if xi and xj are close to each other then yi and yj are close as well.According to the representer theorem [24], the solution to Eq. (1) can be expressed in the following form:

Table 1Notations used in this paper.

Notations Descriptions

l Number of labeled samplesu Number of unlabeled samplesx Input column data vectory Binary label for classificationf(x) Output of a learning system.K Mercer kernelHK Reproducing Kernel Hilbert Space of Kk(xi,xj) Kernel functionL Graph Laplacian matrixk Proportion of positive class among all unlabeled samplesI l � l identity matrixV(�) Some kind of loss functiond = [d1, . . . , dl+u]T Intermediate variable vector in the framework of SSVM-MR&PLPPe, e⁄ e-insensitive parametersni; n

�i ;g;g� Lagrangian slack variables

ai;bi; b�i ; ri; r�i ; c; c

�;~r;~r� Lagrangian multiplierscA, cI, cC, C1, C2, l, d Regularization parameters


f ðxÞ ¼XN

i¼1

aikðxi; xÞ: ð3Þ

Let the coefficient vector a = [a1, a2, . . . , aN]T and the kernel matrix K 2 Rn�n with its (i, j)-th entry K(i, j) = k(xi,xj), then Eq. (3)could be written as f⁄ = Ka.

If V(�) in Eq. (1) is the hinge loss function, i.e., max [0,1 � yif(xi)] for SVM [32], the manifold regularization can be extendedto the Laplacian support vector machines (LapSVM) [2], and the primal problem in Eq. (1) can be viewed as the followingformula:

mina2Rlþu ;n2Rl

1l

Xl

i¼1

ni þ cAaT Kaþ cI

uþ laT KLKa; ð4Þ

s:t: yif ðxiÞP 1� ni; i ¼ 1; . . . ; l;ni P 0; i ¼ 1; . . . ; l:

By the representer theorem [24], the solution to the problem above is given by:

f �ðxÞ ¼Xlþu

i¼1

a�i kðx; xiÞ; ð5Þ

where a� ¼ ð2cAI þ 2 cI

ðuþlÞ2LKÞ�1JT Yb�, J = [I,0]l�(l+u) with I being the l � l identity matrix (assuming the first l points labeled)

and Y = diag(y1,y2, . . . yl), b⁄ can be obtained by solving the following formula:

b� ¼maxb2Rl� 1

2bT Qbþ

Xl

i¼1

bi;

s:t:Xl

i¼1

biyi ¼ 0;0 6 bi 61l; i ¼ 1; . . . ; l;

where Q ¼ YJKð2cAI þ 2 cI

ðuþlÞ2LKÞ�1JT Y .

3. Proposed classifier: SVM-MR&PLPP

3.1. Framework

In this paper, we mainly consider a binary classification problem. Given a set of labeled samples fðxi; yiÞgli¼1 and a set of

unlabeled samples fxigl¼uj¼lþ1, we construct the learning framework with manifold regularization and partially labeling privacy

protection as follows:

minf2Hk

1l

Xl

i¼1

V labeledðxi; yi; f ðxiÞÞ þ cAkfk2K þ cIkfk

2I þ

1u

Xlþu

i¼lþ1

Vunlabeledðxi; f ðxiÞÞ; ð6Þ

wherePlþu

i¼lþ1Vunlabeledðxi; f ðxiÞÞ is a loss function of unlabeled data, which includes the information about labeling privacyprotection for the training dataset.

We can see from Eq. (6) that the privacy protection term about unlabeled data is added in the traditional manifold reg-ularization framework. Although the traditional framework is a classical method for a training dataset composed of bothlabeled and unlabeled data, it neglects some useful information of the unlabeled data, e.g., the labeling proportion. Aspointed out above, we can propose a new privacy-protection learning framework to incorporate the privacy-protection termwith the traditional manifold regularization term. Although many methods may be used to produce a privacy-protectionterm of unlabeled data, this paper proposes to construct a privacy-protection term as follows:
Xu
i¼1

f ðxiÞ � ku� ð1� kÞu ¼ ð2k� 1Þu; ð7Þ

where fðxiÞgui¼1 is the unlabeled data, f(xi) is the estimated value of the label of xi, k is the proportion of the positive class, and

u is the number of the unlabeled samples. Eq. (7) means that the sum of the labels obtained by the proposed classifier mustbe very close to the given labeling proportion of the positive class. Note here that we assume the label of each sample be 1 or�1.

Based on the SVM learning framework, we can get the primal of support vector machine with manifold regularization andpartially labeling privacy protection as follows:

minf2Hk ;n;n

� ;g;g�

12kf k2

Hkþ C1

Xl

i¼1

ni þ n�i� �

þ l2

f T Lf þ C2ðgþ g�Þ; ð8Þ


s:t: � e� n�i 6 f ðxiÞ � yi 6 eþ ni; i ¼ 1; . . . ; l; ð8aÞ

� e� � g� 6Xlþu

i¼lþ1

f ðxiÞ � ð2k� 1Þu 6 e� þ g; ð8bÞ

ni; n�i ;g;g

� P 0;

where {xi,yi}, i = 1, . . . , l are labeled data, {xi}, i = l + 1, . . . , l + u are unlabeled data, n ¼ ½n1; . . . ; nl�T ; n� ¼ n�1; . . . ; n�l� �T are the

slack vector of labeled data, e and e⁄ come from the e-insensitive loss function [32], and g and g⁄ are the slack variablesof the privacy-protection term. The terms C1, C2 and l are the regularization parameters.

In order to further justify the mechanism of SVM-MR&PLPP, we give the following analysis and explanation:

(1) kf k2Hk

is the structure complexity term controlling the complexity of the classifier, Hk are the set of functions in thefeature space.

(2) C1Pn

i¼1 ni þ n�i� �

and C2 (g + g⁄) are the empirical risk terms for labeled and unlabeled data respectively.(3) l

2 f T Lf is the manifold regularization term, where L is the Laplacian matrix [9]. This term is controlled by the parameterl, whose value will be bigger if the data has obvious manifold characteristics. With a small l, the proposed classifierwill work heavily in term of the empirical error minimization principle.

(4) Eq. (8a) is to ensure that the classification accuracy for labeled samples should be as high as possible; and Eq. (8b) is toensure that the sum of labels (label count) obtained by the proposed classifier for unlabeled data must be close to thegiven labeling proportion of the positive class.

(5) The geometric meaning of the optimization problem in Eq. (8) is that, in the feature space, the labels of the positivesamples should be close to +1 while the negative ones should be close to �1, and the manifold regularization term andprivacy-protection term are both used to regulate the decision boundary of the proposed classifier, so as to achieve theoptimal classification.

3.2. Derivation of SVM-MR&PLPP

Theorem 1. The dual of Eq. (8) is a QP problem as shown in Eq. (9).

min~b;~b� ;~c;~c�

12ð~b� ~b� þ ~c� ~c�ÞT H ~b� ~b� þ ~c� ~c�

� �þXlþu

i¼1

~ei~bi þ ~b�i þ ~cþ ~c��

þXlþu

i¼1

~yi~bi � ~b�i þ ~ci � ~c�i� �

; ð9Þ

s:t:Xlþu

i¼1

~bi � ~b�i þ ~ci � ~c�i� �

¼ 0; ~bi; ~b�i 2 ½0;C1�; ~ci; ~c�i 2 ½0;C2�;

where H = K(I + lLK)�1 with K 2 R(l+u)�(l+u) being the kernel matrix over both labeled and unlabeled data, L 2 R(l+u)�(l+u) beingthe graph Laplacian matrix, J = [I,0]l�(l+u) with I being the l � l identity matrix (assuming the first l points are labeled), ~b ¼½b1; . . . ; bl;0; . . . ;0|fflfflfflffl{zfflfflfflffl}

u

�T ; ~b� ¼ ½b�1; . . . ; b�l ;0; . . . ;0|fflfflfflffl{zfflfflfflffl}u

�T ; ~c ¼ ½0; . . . ; 0|fflfflfflffl{zfflfflfflffl}l

; c; . . . ; c|fflfflfflffl{zfflfflfflffl}u

�T ; ~c� ¼ ½0; . . . ;0|fflfflfflffl{zfflfflfflffl}l

; c�; . . . ; c�|fflfflfflfflfflffl{zfflfflfflfflfflffl}u

�T ~e ¼ ½e; . . . ; e|fflfflfflffl{zfflfflfflffl}l

; e�=u; . . . ; e�=u|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}u

�T

and ~y ¼ ½y1; . . . ; yl; ð2k� 1Þ; . . . ; ð2k� 1Þ�T with k being the labeling proportion of the positive samples of the unlabeled data.

Proof. According to the representer theorem [24], the optimal f⁄ in Eq. (8) can be written as f �ðxÞ ¼Plþu

i¼1a�i kðx; xiÞ. Let ushave f �ðxÞ ¼

Plþui¼1a�i kðx; xiÞ þ b with the bias term added. By using the Lagrangian optimization theorem, we can obtain the

following Lagrangian function for Eq. (8):

Lða; b; n; n�;g;g�Þ ¼ 12aTðK þ lKLKÞaþ C1

Xl

i¼1

ni þ n�i� �

þ C2ðgþ g�Þ �Xl

i¼1

bi eþ ni �Xlþu

j¼1

ajKðxi; xjÞ � bþ yi

!

�Xl

i¼1

rini �Xl

i¼1

b�i eþ n�i þXlþu

j¼1

ajKðxi; xjÞ þ b� yi

!�Xl

i¼1

r�i n�i

� c ð2k� 1Þuþ e� þ g�Xlþu

i¼lþ1

Xlþu

j¼lþ1

ajKðxi; xjÞ þ b

! !� ~rg

� c� e� þ g� þXlþu

i¼lþ1

Xlþu

j¼lþ1

ajKðxi; xjÞ þ b

!� ð2k� 1Þu

!� ~r�g� ð10Þ

According to the dual theory, the minimum of the Lagrangian function in Eq. (7) with respect to a, b, n, n⁄, g, g⁄ is equal tothe maximum of the function in the primal. With respect to a, the following equations can be considered as the necessaryconditions of the optimal solution:


@L@ni¼ 0 ) C1 ¼ bi þ ri ð11:aÞ

@L@n�i¼ 0 ) C1 ¼ b�i þ r�i ð11:bÞ

@L@g¼ 0 ) C2 ¼ cþ ~r ð11:cÞ

@L@g�¼ 0 ) C2 ¼ c� þ ~r� ð11:dÞ

@L@b¼ 0 )

Xl

i¼1

bi � b�i� �

þ uðc� c�Þ ¼ 0: ð11:eÞ

Substituting Eqs. (11a)–(11e) into Eq. (10), we have the dual of Eq. (12)

L ¼ 12aTðK þ lKLKÞaþ aT KJTðb� b�Þ þ aT KPTðc� c�Þ �

Xl

i¼1

e bi þ b�i� �

�Xl

i¼1

yi bi � b�i� �

� e�ðcþ c�Þ � ð2k

� 1Þuðc�c�Þ ð12Þ

where J = [I,0]l�(l+u) with I being the l � l identity matrix (assuming the first l points labeled) and P ¼ ½0;eI �l�ðlþuÞ with eI beingthe u � u identity matrix.

With ~b ¼ ½b1; . . . ; bl;0; . . . ;0|fflfflfflffl{zfflfflfflffl}u

�T ; ~b� ¼ ½b�1; . . . ; b�l ;0; . . . ;0|fflfflfflffl{zfflfflfflffl}u

�T ; ~c ¼ ½0; . . . ;0|fflfflfflffl{zfflfflfflffl}l


�T ; ~c� ¼ ½0; . . . ;0|fflfflfflffl{zfflfflfflffl}l


�T ;

~e ¼ ½e; . . . ; e|fflfflfflffl{zfflfflfflffl}l


�T ; ~y ¼ ½y1; . . . ; yl; ð2k� 1Þ; . . . ; ð2k� 1Þ�T , Eq. (12) is equivalent to the following equation:

L ¼ 12aTðK þ lKLKÞaþ aT Kð~b� ~b� þ ~c� ~c�Þ �

Xlþu

i¼1


�Xlþu

i¼1

~yi~bi � ~b�i þ ~c� ~c��

: ð13Þ

Taking derivative of the reduced Lagrangian with respect to a, we have

@L@a¼ 0 ) ðK þ lKLKÞa ¼ Kð~b� � ~bþ ~c� � ~cÞ: ð14Þ

This implies

a ¼ ðI þ lLKÞ�1ð~b� � ~bþ ~c� � ~cÞ: ð15Þ

After substituting Eq. (15) into Eq. (13), theorem 1 holds. h

Moreover, we can get b by

b ¼ 1l

Xl

i¼1

yi �Xlþu

j¼1

ajKðxi; xjÞ !

; ð16Þ

so the final decision function is written as

f �ðxÞ ¼ sgnXlþu

i¼1

a�i kðx; xiÞ þ b

!: ð17Þ

Table 2Details of the datasets adopted in the experiments.

Datasets Classes Sizes Attributes

two-moon 2 400 22 16002 6400

COIL2 2 1440 1024USPST2 2 2007 256PCMAC 2 1946 7511MNIST3VS8 2 13,966 784FACEMIT 2 31,022 361Shuttle 2 43,500 9Forest cover type 2 581,012 54


Based on the derivation above, we can state the proposed classifier SVM-MR&PLPP as follows.

TaRe

Learning algorithm for SVM-MR&PLPP

Input:

ble 3-organizatio

Datasets

two-moon

COIL2USPSTPCMACMNIST3VS8FACEMITShuttleForest cover

l labeled examples fðxi; yiÞgli¼1;u unlabeled examples fxjglþu

j¼lþ1 with k as the labeling proportion of the positive

samples

Output
Decision function f(x)
Step 1
Construct a KNN adjacent graph with (l + u) nodes and compute the edge weights matrix W [9] Step 2 Choose a kernel function and then compute the Gram matrix kij = k(xi,xj); Step 3 Compute the graph Laplacian matrix: L = D �W, where D is a diagonal matrix given by Dii ¼
Plþui¼1Wij

Step 4
Choose C1, C2 and l, compute ~b�; ~b; ~c�; ~c by solving Eq. (9) with a QP Solver. Step 5 Compute a using Eq. (15) Step 6 Compute b using Eq. (16)P Step 7 Output f ðxÞ ¼ ai>0aikðxi; xÞ þ b
Let us keep in mind that the calculation of the transformed kernel H involves a matrix inversion operation with the size ofthe matrix being (l + u) � (l + u), and the recovery of the expansion coefficients a also requires solving a linear system of amatrix with size (l + u). Both operations require O((l + u)3) time complexity. Moreover, computing LK in H also requiresO((l + u)3) time complexity because L and K are both matrices with size (l + u). So the proposed classifier here may be imprac-tical for large datasets. In the next section, we will present a novel formulation to address this issue, which has a much sim-pler form of the transformed kernel, making it scalable for large datasets.

4. Scalable SVM-MR&PLPP

4.1. Framework

In order to make SVM-MR&PLPP scalable to large scale datasets, we propose the following optimization problem by intro-ducing an intermediate variable vector d = [d1, . . . , dl+u]T into the framework of SVM-MR&PLPP in Eq. (6) as

minf2Hk ;d2Rlþn

1lþ u

Xlþu

i¼1

V labeledðxi;di; f ðxiÞÞ þ cC

Xl

i¼1

ðdi � yiÞ þ cAkfk2K þ cIkdk

2I þ

1u

Xlþu

i¼lþ1

Vunlabeledðf ðxiÞÞ ð18Þ

where d = [d1, . . . , dl+u]T plays a similar role as f(xi), i = 1, 2, . . . , l + u, the difference is that d is not used for final classification.In order to make Eq. (18) work well, we should enforce this intermediate variable vector d to be close to the labels of the

labeled data and also to be smooth with respect to the graph manifold. On the other hand, the ambient regularizer in Eq. (18)is another term which plays a regulating role for the linear/nonlinear prediction function f(x) in RKHS. As pointed out in [18],the prediction function f(x) may not fit well for the graph manifold structure. To relax such a restriction used in traditionalmanifold regularization, the first term in Eq. (18) is designed to penalize the mismatch between the intermediate variablevector d for the graph manifold and the prediction function f (x) in RKHS. In terms of [18], the introduction of this regularizercan indeed help us enhance the generalization performance of the traditional manifold regularization. The more interestingpart is the fact that the multiplication between K and L (i.e., LK) in the formulation of the above manifold regularization canbe decoupled in our formulation by regularizing d rather than f.

n of eight adopted datasets.

Training sets V T

L U

2 148 50 2008 742 50 800

64 3086 50 320020 700 20 70030 974 30 97320 953 20 95350 6933 50 693350 12,000 50 12,00050 12,000 50 12,000

type 50 12,000 50 12,000


For any fixed d, the optimization problem of SSVM-MR&PLPP in Eq. (18) reduces to:

minf2Hk ;d2Rlþn

1lþ u

Xlþu

i¼1

V labeledðxi; di; f ðxiÞÞ þ1u

Xlþu

i¼lþ1

Vunlabeledðf ðxiÞÞ þ cAkfk2K : ð19Þ

According to the representer theorem [24], the solution of the above problem can be represented as:

f ðx; dÞ ¼Xlþu

i¼1

aiðdÞkðxi; xÞ; ð20Þ

where ai(d) denotes the coefficient ai with respect to the current values of d.So, the optimal solution f⁄ of the optimization problem for SSVM-MR&PLPP is of the form

f �ðxÞ ¼Xlþu

i¼1

aikðxi; xÞ; ð21Þ

with a 2 Rl+u, i = 1, 2, . . . , l + u, denote the expansion coefficients.By instantiating the loss function in Eq. (19) as e-insensitive loss function, and reorganizing each term and its correspond-

ing parameter according to the formulation of SVM-MR&PLPP in Eq. (8), we obtain the following optimization problem:

minf2Hk ;d;n;n

�

12kfk2

Hkþ C1

Xlþu

i¼1

ni þ n�i� �

þ C2ðgþ g�Þ þ d2kd� yk2 þ l

2dT Ld; ð22Þ

s:t: � e� n�i 6 f ðxiÞ � di 6 eþ ni;

� e� � g� 6Xlþu

i¼lþ1

f ðxiÞ � ð2k� 1Þu 6 e� þ g;

where y ¼ ½y1; . . . ; yl;0; . . . ;0|fflfflfflffl{zfflfflfflffl}u

�T 2 Rlþu. The manifold regularizer is replaced by l2 dT Ld. The parameters C1, C2, and l control the

trade-off between different terms.

4.2. Derivation of SSVM-MR&PLPP

Theorem 2. The dual of Eq. (22) is a QP problem as shown in Eq. (23).

min~b;~b� ;~c;~c�

12ð~b� ~b� þ ~c� ~c�ÞT eHð~b� ~b� þ ~c� ~c�Þ þ

Xlþu

i¼1


þXlþu

i¼1

zi~bi � ~b�i þ ~ci � ~c�i� �

ð23Þ

s:t:Xlþu

i¼1

~bi � ~b�i þ ~ci � ~c�i� �

¼ 0; ~bi; ~b�i 2 ½0; C1�; ~ci; ~c�i 2 ½0;C2�;

where ~H ¼ K þ JðP þ lLÞ�1JT with K 2 R(l+u)�(l+u) being the kernel matrix over both labeled and unlabeled data, L 2 R(l+u)�(l+u)

being the graph Laplacian matrix, and J = [I,0]l�(l+u) with I being the l � l identity matrix (assuming the first l points labeled).

z ¼ ½~y1; . . . ; ~yl; ~ylþ1 þ ð2k� 1Þ; . . . ; ~ylþu þ ð2k� 1Þ�T with k being the labeling proportion of positive samples of the unlabeled

data, ~y ¼ ðP þ lLÞ�1Py, P is a diagonal matrix and Pii ¼d; xi is labeled0; xi is unlabeled

; i ¼ 1;2; . . . ; lþ u; y ¼ ½y1; . . . ; yl;0; . . . ;0�T .

Proof. According to the representer theorem [24], the optimal f⁄ in Eq. (22) can be written as f �ðxÞ ¼Plþu

i¼1a�i kðx; xiÞ. Let ushave f �ðxÞ ¼

Plþui¼1a�i kðx; xiÞ þ b with the bias term added. By using the Lagrangian optimization theorem, we can obtain the

following Lagrangian function for Eq. (22):

Lða; b; n; n�;g;g�Þ ¼ 12aT Kaþ d

2kd� yk2 þ l

2dT Ldþ C1

Xl

i¼1

ni þ n�i� �

þ C2ðgþ g�Þ

�Xlþu

i¼1

bi eþ ni �Xlþu

j¼1

ajKðxi; xjÞ � bþ di

!�Xlþu

i¼1

rini �Xlþu

i¼1

b�i eþ n�i þXlþu

j¼1

ajKðxi; xjÞ þ b� di

!

�Xlþu

i¼1

r�i n�i � c ð2k� 1Þuþ e� þ g�

Xlþu

i¼lþ1

Xlþu

j¼lþ1

ajKðxi; xjÞ þ b

! !� ~rg

� c� e� þ g� þXlþu

i¼lþ1

Xlþu

j¼lþ1

ajKðxi; xjÞ þ b

!� ð2k� 1Þu

!� ~r�g�; ð24Þ


Then, the following equations can be considered as the necessary conditions of the optimal solution:

@L@ni¼ 0 ) C1 ¼ bi þ ri; ð25:aÞ

@L@n�i¼ 0 ) C1 ¼ b�i þ r�i ; ð25:bÞ

@L@g¼ 0 ) C2 ¼ cþ ~r; ð25:cÞ

@L@g�¼ 0 ) C2 ¼ c� þ ~r�; ð25:dÞ

@L@b¼ 0 )

Xlþu

i¼1

bi � b�i� �

þ uðc� c�Þ ¼ 0; ð25:eÞ

@L@d¼ 0 ) dðd� yÞ þ lLd� bþ b� ¼ 0; ð25:fÞ

@L@a¼ 0 ) Kaþ Kðb� b�Þ þ Kðc� c�Þ ¼ 0: ð25:gÞ

Substituting Eqs. (25a)–(25g) into Eq. (24) and referring to the derivations in the proof of theorem 1, we can easily obtain Eq.

(23) which is the dual of Eq. (22), where ~b ¼ ½b1; . . . ; bl;0; . . . ;0|fflfflfflffl{zfflfflfflffl}u

�T ; ~b� ¼ ½b�1; . . . ; b�l ;0; . . . ; 0|fflfflfflffl{zfflfflfflffl}u

�T ; ~c ¼ ½0; . . . ;0|fflfflfflffl{zfflfflfflffl}l


�T ;

~c� ¼ ½0; . . . ;0|fflfflfflffl{zfflfflfflffl}l


�T ; ~e ¼ ½e; . . . ; e|fflfflfflffl{zfflfflfflffl}l


�T . Thus, theorem 2 holds. h

In particular, in terms of Eq. (25.g), we have

a ¼ ~b� � ~bþ ~c� � ~c: ð26Þ
Moreover, we can get b by
b ¼ 1l

Xl

i¼1

yi �Xlþu

j¼1

ajKðxi; xjÞ !

; ð27Þ

so the final decision function is written as

f �ðxÞ ¼ sgnXlþu

i¼1

a�i kðx; xiÞ þ b

!: ð28Þ

Based on the derivation above, we can state the proposed scalable classifier SSVM-MR&PLPP as follows:

Learning algorithm for SSVM-MR&PLPP

Input:
l labeled examples fðxi; yiÞgli¼1;u unlabeled examples fxjglþu
j¼lþ1 with kas the labeling proportion of the positive

samples

Output
Decision function f(x)
Step 1
Construct a KNN adjacency graph with (l + u) nodes and compute the edge weights matrix W [9] Step 2 Choose a kernel function, compute the Gram matrix kij = K(xi,xj) Step 3 Compute the graph Laplacian matrix: L = D �W, where D is a diagonal matrix given by Dii ¼
Plþui¼1Wij

Step 4
Choose C1, C2 and l, compute ~b�; ~b; ~c�; ~c by solving Eq. (23) with a QP Solver Step 5 Compute a using Eq. (26) Step 6 Compute b using Eq. (27)P Step 7 Output function f ðxÞ ¼ ai>0aikðxi; xÞ þ b
In contrast to SVM-MR&PLPP, we can see that SSVM-MR&PLPP has the following advantages:

(1) It shares the same formulation as the dual of the standard SVM [32]. The only difference between the proposed clas-sifier and the standard SVM is that SSVM-MR&PLPP has a different kernel matrix ~H and a different label vector z. Thus,SSVM-MR&PLPP can be solved by using those efficient QP solvers like Libsvm [3].

(2) The solutions of ~b�; ~b; ~c�; ~c inherit the sparsity characteristic of SVM. Moreover, since we have a ¼ ~b� � ~bþ ~c� � ~c, so ais sparse as well.

(3) Let us observe the scalability of SSVM-MR&PLPP by comparing the resultant transformed kernel matrices and therecoveries of the expansion coefficients in a of both SVM-MR&PLPP and SSVM-MR&PLPP. In terms of the analysis


above, both SVM-MR&PLPP and SSVM-MR&PLPP can use the same solver to solve their corresponding QP problems,meaning that they have the same time complexity in this aspect. Therefore, the discrepancy between the time com-plexities of SVM-MR&PLPP and SSVM-MR&PLPP heavily depends on the computation of their transformed kernelmatrices and the recoveries of the expansion coefficients in a. As SSVM-MR&PLPP only requires computing

a ¼ ~b� � ~bþ ~c� � ~c while SVM-MR&PLPP requires computing a ¼ ðI þ lLKÞ�1 ~b� � ~bþ ~c� � ~c� �

, so the recovery ofthe expansion coefficients in a of SSVM-MR&PLPP is much easier than that of SVM-MR&PLPP. Furthermore, let us

observe the transformed kernel matrix ~H ¼ K þ JðP þ lLÞ�1JT used in SSVM-MR&PLPP and the transformed kernelmatrixH = K(I + lLK)�1 used in SVM-MR&PLPP. Obviously, in contrast to H in which the multiplication of K and L

is required, eH can be more easily computed due to the fact that we can independently cope with K, L and simulta-neously the sparsity of L is preserved in (P + lL)�1, which can be beneficial for the consequent inverse operation.In summary, in contrast to SVM-MR&PLPP, SSVM-MR&PLPP has strong scalability in the sense of easy computationof the transformed kernel matrices and easy recovery of the expansion coefficients in a.

5. Experimental results

In this section, we first describe the datasets used and the experimental setups and then report the obtained experimentalresults to demonstrate the performance of the proposed classifiers and the scalability of SSVM-MR&PPLP. We also comparethe proposed classifiers with two semi-supervised classifiers, namely, Laplacian regularized least square (LapRLS) [2] andLaplacian support vector machine (LapSVM), and two supervised classifiers: support vector machine (SVM) [32] and kernelridge regression (KRR) [8] in our experiments. We implement KRR and LapRLS using MATLAB and the other four benchmark-ing classifiers using Libsvm [3].

5.1. Datasets and experimental setup

We select two types of popular datasets for our experiments. Firstly, we start with a toy problem on the famous two-moon datasets with different sizes. Then, we compare the results of the proposed classifiers with other benchmarkingclassifiers on seven two-class real-world datasets [5,11,26,27,30] to demonstrate their performances. In order to teston the scalability of the proposed classifier SSVM-MR&PLPP, four large scale real-world datasets, namely, MNIST3VS8,FACEMIT, Shuttle and Forest cover type, were adopted. The remaining real-world datasets COIL2, USPST and PCMAC areof small and middle sizes. The two-moon datasets we use in our experiments respectively contain 200, 800 and 3200points with 2, 8, and 64 labeled points for training and unlabeled points of the same size for testing. The COIL2 datasetis a collection of pictures of 2 different objects from the Columbia University. Each object has been placed on a turntableand at every 5 degrees of rotation a 32 � 32 gray scale image was acquired. The USPST dataset is a collection of hand-written digits from the USPS postal system from the first 5 digits and the remaining ones. Images are acquired at theresolution of 16 � 16 pixels. USPST refers to the test split of the original dataset. PCMAC is a dataset generated fromthe famous 20Newsgroups collection, which collects posts on Windows and Macintosh systems. MNIST3VS8 is the binaryversion of the MNIST dataset, a collection of 28 � 28 gray scale handwritten digit images from MNIST. The goal is toseparate digit 3 from digit 8. The FACEMIT dataset of the Center for Biological and Computational Learning at MIT con-tains 19 � 19 gray scale, PGM format, images of faces and non-faces. The Shuttle dataset contains 9 attributes which areall numerical. Finally, the Forest cover type is transformed from a multi-class dataset into a binary class by setting thedata of label 3 as one class and the remaining data as the other class. The details of the described datasets are summa-rized in Table 2.

In order to make our experimental results fair, we repeat the 10-fold cross validation strategy three times by randomlygenerating ten folds on the available data in which each fold keeps the same proportion of positive and negative samples asin the original dataset. In order to simulate the application scenarios we concern in this study, we re-organize our adopteddatasets in Table 3 as follows. Each dataset is divided into its labeled set (denoted as L), its unlabeled set (denoted as U), itsvalidation set (denoted as V) which is used to determine appropriate parameters in the benchmarking classifiers, and its testset (denoted as T) which is used to examine the performance of the benchmarking classifiers. Since the two-moon dataset isgenerated by ourselves and other seven adopted datasets are public ones, we can easily know the positive samples’ propor-tion k of unlabeled training samples in each unlabeled dataset in advance, which is in accordance with its known value givendirectly by users in practice. In other words, our experiments aim at examining whether the proposed classifiers outperformother benchmarking classifiers when such a labeling proportion is available for unlabeled training samples. When generatingevery labeled set by randomly taking from the training data, we assure that at least a sample of each class is involved thereinalthough its scale may be very small. The detailed re-organization of the adopted datasets can be seen in Table 3, in whichonly a portion of data points are taken for theFACEMIT, Shuttle and Forest cover type datasets due to the fact that LapSVM,LapRLS and the proposed classifier are computationally prohibitive to run in our experimental circumstance. Consideringthe imbalanced nature of the training datasets [31], the geometric mean accuracy g ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiaþ � a�p

is adopted to evaluate theperformance of the benchmarking classifiers, where


aþ ¼ #positive samples correctly classified#total positive samples classified

� 100%

and

a� ¼ #negative samples correctly classified#total negative samples classified

� 100%:

Without special specification, we use the terminology ‘‘the average classification accuracy’’ to represent the geometric meanaccuracy for brevity below.

We select a Gaussian kernel function in the form k(xi,xj) = exp(�kxi � xjk/2r2) for each experiment and the width param-eter 2r2 is searched in accordance with the following grids {s/64, s/32, s/16, s/8, s/4, s/2, s, 2s, 4s, 8s, 16s, 32s, 64s,} where s isthe mean squared norm of the training data. However, for MNIST3VS8, a polynomial kernel of degree 9 is used, as suggestedby Decoste [7]. The other parameters are selected by cross-validating them on the corresponding validation sets. The optimalweights of the ambient norms, intrinsic norms and labeling privacy protection term, C1, C2, l of the proposed classifiers are

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

(a) 400 points

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

(b) 1600 points

-1 -0.5 0 0.5 1 1.5 2 2.5 3

-1

-0.5

0

0.5

1

1.5

2

2.5

3

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

(c) 6400 points

Fig. 2. The two-moon datasets with different sizes (datasets on the left is for training while datasets on the right are for testing).

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(a) SVM

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(b) KRR

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(c) LapRLS

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(d) LapSVM

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(e) SVM-MR&PLPP

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2-1

-0.5

0

0.5

1

1.5

-1 0 1 2 3-1

-0.5

0

0.5

1

1.5

(f) SSVM-MR&PLPP

Fig. 3. Best decision surfaces obtained by different classifiers on two-moon datasets with sizes 150, 750 and 3150 (left, middle, right) respectively and withthe same number of positive and negative points.


determined by searching the grid {10�3, 10�2, 10�1, 100, 101, 102, 103, 104, 105, 106} with respect to the corresponding val-idation errors. As for the trade-off parameter d of SSVM-MR& PPLP, we search it from the set {101, 102, 103, 104, 105, 106}. The


optimal weights, i.e. cA, cI, of the ambient norms, intrinsic norms of LapSVM [2] and LapRLS [2] are respectively determinedby searching the grid {10�3, 10�2, 10�1, 100, 101, 102, 103, 104, 105, 106}. The parameter C in SVM and the parameter k in theregularization term of KRR are respectively searched from the grid {10�2, 10�1, 100, 101, 102, 103, 104, 105}.

5.2. Two-moon datasets

The two-moon datasets are often used to demonstrate the performance of manifold learning methods. In our experimentswith the two-moon datasets, the training and testing sets were generated as shown in Fig. 2. Fig. 3 shows the best decisionsurfaces of different classifiers on the two-moon training datasets with different sizes, which respectively contains 150, 750,3150 points with 2, 8, 64 labeled points (see ‘‘red’’ and ‘‘blue’’ points in Fig. 3). The proposed classifiers are compared withSVM, KRR LapRLS and LapSVM and the obtained results are tabulated in Tables 4 and 5.

From Fig. 3, Tables 4 and 5, we can observe several results as follows:

(1) From Fig. 3 and Table 4, we can see that the proposed classifiers, LapSVM and LapRLS can well maintain the intrinsicnonlinear manifold structures in the two-moon datasets in contrast to both SVM and KRR, because the intrinsic reg-ularizer of the manifold framework can appropriately adjust the decision surface by considering unlabeled dataaccording to the geometry of the two classes. Furthermore, when the datasets obviously include overlapping samples,

Table 4Comparisons of the average classification accuracy with standard deviation on the two-moon datasets with different sizes.

Datasets Classifiers U (%) V (%) T (%)

400 Points (2 labeled points) SVM 78.00(±0.05) 80.00(±0.00) 78.00 (±0.05)KRR 78.00(±0.05) 80.00(±0.00) 78.00(±0.05)LapRLS 100(±0.00) 100(±0.00) 100(±0.00)LapSVM 100(±0.00) 100(±0.00) 100 (±0.00)SVM-MR&PLPP 100(±0.00) 100(±0.00) 100(±0.00)SSVM-MR&PLPP 100(±0.00) 100(±0.00) 100(±0.00)

160 Points (8 labeled points) SVM 93.85(±0.05) 92.00(±0.05) 92.12(±0.06)KRR 92.80(±0.08) 92.00(±0.05) 92.45(±0.08)LapRLS 100(±0.00) 100(±0.00) 100(±0.00)LapSVM 100(±0.00) 100(±0.00) 100(±0.00)SVM-MR&PLPP 100(±0.00) 100(±0.00) 100(±0.00)SSVM-MR&PLPP 100(±0.00) 100(±0.00) 100(±0.00)

6400 Points (64 labeled points) SVM 87.33(±0.05) 85.00(±0.00) 87.12(±0.06)KRR 87.79(±0.10) 85.00(±0.00) 87.25(±0.06)LapRLS 92.50(±0.19) 92.00(±0.12) 92.35(±0.18)LapSVM 92.50(±0.11) 92.40(±0.09) 92.30(±0.12)SVM-MR&PLPP 95.30(±0.09) 96.70(±0.05) 96.20(±0.07)SSVM-MR&PLPP 95.40(±0.08) 96.90(±0.05) 96.20(±0.05)

Numbers in boldface indicate the best results.

Table 5Comparisons of average (and std dev) running time on two-moon datasets with different sizes.

Datasets Classifiers Training time (s) Testing time (s)

400 Points (2 labeled points) SVM 2.21(±0.21) � 10�4 4.65(±0.19) � 10�4

KRR 6.30(±0.35) � 10�5 4.50(±0.35) � 10�4

LapRLS 7.49 (±0.39) � 10�2 2.09 (±0.15) � 10�2

LapSVM 2.64 (±0.34) � 10�2 1.09 (±0.15) � 10�2

SVM-MR&PLPP 3.40 (±0.44) � 10�2 1.10 (±0.21) � 10�2

SSVM-MR&PLPP 1.89 (±0.31) � 10�2 1.70(±0.22) � 10�3


KRR 1.47(±0.12) � 10�4 6.16(±0.22) � 10�3

LapRLS 3.93 (±0.21) � 10�1 5.11(±0.49) � 10�2

LapSVM 1.88 (±0.13) � 10�1 1.98(±0.33) � 10�2

SVM-MR&PLPP 1.96(±0.12) � 10�1 1.99 (±0.23) � 10�2

SSVM-MR&PLPP 1.53 (±0.13) � 10�1 3.20(±0.19) � 10�3


KRR 2.65(±0.3) � 10�3 2.27(±0.3) � 10�1

LapRLS 3.12(±0.11) 5.81(±0.35) � 10�1

LapSVM 1.33(±0.05) 1.81(±0.12) � 10�1

SVM-MR&PLPP 1.43 (±0.04) 1.85(±0.24) � 10�1

SSVM-MR&PLPP 1.22(±0.02) 5.18(±1.51) � 10�2

Table 6Comparisons of average (and std dev) classification accuracy on different real-world datasets.

Datasets k Classifiers U (%) V (%) T (%)

COIL2 0.46 SVM 82.21(±2.20) 81.87(±3.09) 81.11(±2.55)KRR 83.98(±2.00) 82.09(±1.77) 82.33(±2.13)LapRLS 90.21(±1.12) 90.98(±1.98) 90.07(±2.01)LapSVM 90.81(±1.20) 90.92(±0.90) 90.18(±2.25)SVM-MR&PLPP 92.10(±1.60) 92.30(±0.22) 91.50(±1.10)SSVM-MR&PLPP 92.50(±2.05) 92.15(±1.19) 91.88(±1.20)

USPST 0.35 SVM 81.33(±1.23) 82.03(±1.15) 81.99(±3.00)KRR 81.25(±1.91) 81.99(±2.11) 82.33(±2.77)LapRLS 90.11(±2.33) 91.08(±2.18) 89.08(±3.21)LapSVM 90.21(±2.35) 90.92(±3.25) 89.07(±2.52)SVM-MR&PLPP 92.03(±2.27) 92.34(±2.65) 91.98(±3.00)SSVM-MR&PLPP 92.55(±1.50) 92.80(±3.20) 92.19(±2.01)

PCMAC 0.49 SVM 80.22(±1.23) 80.09(±2.03) 80.56(±3.01)KRR 80.77(±1.90) 82.09(±1.77) 81.23(±1.91)LapRLS 89.09(±2.22) 89.34(±2.87) 89.13(±1.91)LapSVM 89.12(±3.02) 89.51(±1.92) 89.00(±3.21)SVM-MR&PLPP 90.19(±1.20) 90.01(±1.20) 90.34(±1.20)SSVM-MR&PLPP 90.61(±0.77) 91.09(±1.51) 90.78(±2.14)

MNIST3VS8 0.43 SVM 90.78(±1.30) 91.23(±1.89) 91.34(±2.01)KRR 91.07(±1.77) 90.87(±2.21) 91.11(±1.99)LapRLS 96.35(±2.22) 96.18(±2.01) 96.21(±3.00)LapSVM 96.28(±2.66) 96.82(±1.37) 96.95(±2.21)SVM-MR&PLPP 97.31(±2.20) 97.11(±1.76) 97.03(±1.98)SSVM-MR&PLPP 97.22(±2.20) 97.63(±2.00) 97.27(±1.45)

FACEMIT 0.12 SVM 60.67(±1.96) 61.77(±2.33) 59.97(±2.43)KRR 61.23(±1.70) 60.39(±2.21) 60.75(±3.22)LapRLS 68.78(±2.33) 68.56(±2.01) 68.01(±1.91)LapSVM 68.87(±1.20) 68.87(±1.20) 68.87(±1.20)SVM-MR&PLPP 69.98(±1.44) 69.55(±1.46) 70.00(±1.22)SSVM-MR&PLPP 70.01(±1.25) 70.29(±1.11) 70.18(±1.34)

Shuttle 0.65 SVM 85.75(±2.09) 85.91(±2.11) 85.44(±1.98)KRR 85.90(±1.87) 86.01(±1.99) 87.22(±2.90)LapRLS 91.12(±3.11) 90.98(±2.21) 90.29(±1.98)LapSVM 91.23(±2.12) 90.77(±3.20) 91.07(±3.11)SVM-MR&PLPP 92.35(±2.44) 92.25(±2.76) 92.30(±2.12)SSVM-MR&PLPP 92.78(±2.12) 92.28(±3.02) 92.33(±1.90)

Forest cover type 0.78 SVM 73.56(±2.11) 76.38(±2.11) 74.53(±2.37)KRR 73.81(±1.89) 74.09(±1.98) 75.01(±3.67)LapRLS 80.33(±1.77) 80.99(±2.85) 79.91(±2.99)LapSVM 80.18(±1.35) 80.65(±1.51) 80.09(±0.98)SVM-MR&PLPP 81.53(±1.12) 81.78(±2.02) 81.33(±1.01)SSVM-MR&PLPP 81.87(±1.01) 82.01(±0.90) 81.55(±0.99)

Numbers in boldface indicate the best results.


LapSVM and LapRLS cannot find the optimal decision surfaces because the manifold framework focuses mainly on thesmoothness of the decision function. Since the proposed classifiers take full use of the privacy-protection information(i.e., the labeling proportion), the obtained classification results are slightly better than LapSVM and LapRLS.

(2) From Table 5, we can see that SVM has the fastest training and testing time since its training is only oriented for thelabeled points. We can also see, due to the existence of the matrix inversion, the computational burdens of LapSVM,LapRLS and SVM-MR&PLPP are heavier than SSVM-MR&PLPP. In particular, since SVM-MR&PLPP considers not onlythe manifold regularization term but also the privacy-protection term in its learning framework, so its training andtesting speeds are slightly faster than those of LapSVM. Since the kernel matrix of SSVM-MR&PLPP has a simpler form,which indeed results in the same behavior (see and compare Eqs. (26) and (15)) as the traditional SVM when makingprediction for a sample, its testing speed has obvious advantages over the other classifiers and hence it is very effectivefor large datasets.

5.3. Experiments with a few labeled samples of different real-world datasets

In this section, we examine the efficiency and effectiveness of the proposed classifiers SVM-MR&PLPP and SSVM-MR&PLPP by comparing their performance with SVM, KRR, LapSVM and LapRLS on seven real-world datasets, namely, COIL2,USPST, PCMAC, MNIST3VS8, FACEMIT, Shuttle and Forest cover type.

Table 7Comparisons of average (and std dev) running time on different real-world datasets.

Datasets k Classifiers Training time (s) Testing time (s)

COIL2 0.46 SVM 3.37(±0.31) � 10�3 1.93 (±0.19) � 10�2

KRR 1.00(±0.23) � 10�3 1.09(±0.56) � 10�2

LapRLS 1.98(±0.53) 0.36(±0.32)LapSVM 0.35(±0.03) 0.09(±0.01)SVM-MR&PLPP 0.36(±0.03) 0.11(±0.01)SSVM-MR&PLPP 0.12(±0.01) 0.09(±0.01)

USPST 0.35 SVM 2.31(±0.34) � 10�3 6.26(±0.60) � 10�3

KRR 4.99(±0.27) � 10�3 2.15 (±0.51) � 10�2


PCMAC 0.49 SVM 3.51(±0.35) � 10�3 2.55(±0.53) �10�2

KRR 1.05(±0.21) � 10�2 7.13(±0.32) � 10�2


MNIST3VS8 0.43 SVM 3.64(±0.23) � 10�2 6.23(±0.94) �10�2

KRR 4.44(±0.37) � 10�2 0.23(±0.73)LapRLS 3115.00(±191.90) 431.31(±15.66)LapSVM 809.67(±11.70) 59.71(±1.34)SVM-MR&PLPP 888.56(±17.32) 60.22(±1.11)SSVM-MR&PLPP 409.89(±18.20) 9.32(±0.92)

FACEMIT 0.12 SVM 5.01(±0.33) � 10�3 1.02(±0.19)KRR 3.49(±0.34) � 10�3 1.28(±0.38)LapRLS 4011.22(±213.12) 398.85(±15.33)LapSVM 1656.43(±72.28) 64.71(±2.31)SVM-MR&PLPP 1667.23(±61.53) 65.89(±1.67)SSVM-MR&PLPP 498.00(±12.76) 8.98(±1.87)

Shuttle 0.65 SVM 4.11(±0.51) � 10�3 0.22(±0.09)KRR 2.44(±0.02) � 10�3 1.09(±0.34)LapRLS 3912.31(±209.99) 278.85(±21.22)LapSVM 1598.33(±100.40) 63.23(±1.98)SVM-MR&PLPP 1609.56(±102.00) 64.02(±2.72)SSVM-MR&PLPP 518.09(±19.98) 8.09(±2.34)

Forest cover type 0.78 SVM 4.55(±0.23) � 10�3 0.37(±0.09)KRR 3.12(±0.55) � 10�3 1.11(±0.29)LapRLS 3976.34(±216.72) 271.77(±18.89)LapSVM 1545.15(±112.90) 60.00(±2.69)SVM-MR&PLPP 1561.05(±110.20) 60.98(±2.46)SSVM-MR&PLPP 546.95(±18.20) 8.70(±0.78)


Tables 6 and 7 show the average classification accuracies and their standard deviations and running time of all the bench-marking classifiers on different real-world datasets. From these results, we can have the following conclusions:

(1) From a semi-supervised perspective, it can be seen from Table 6 that compared with the supervised classifiers SVMand KRR, the manifold regularization term used in the proposed classifiers, LapRLS and LapSVM indeed results in verypromising performance for the semi-supervised datasets even with only a few labeled data as shown in Table 3.

(2) From the perspective of semi-supervised methods with manifold regularization, when the labeling proportion of unla-beled data is available, we can see from Table 6 that the proposed classifiers indeed outperform LapRLS and LapSVMby considering the labeling proportion information.

(3) From Table 6, it can be seen that SSVM-MR&PLPP outperforms SVM-MR&PLPP even for almost all the small scaledatasets.

(4) Although we list the training time of two supervised classifiers SVM and KRR in Table 7, we should pay attention to thetraining time of four semi-supervised classifiers in order to make our comparisons fairer. From Table 7, it can also beseen that due to the strong scalability of SSVM-MR&PLPP, it has obvious advantage over LapRLS, LapSVM and SVM-MR&PLPP in training time, especially for large datasets. Also, the testing time of SSVM-MR&PLPP is much shorter thanthose of SVM-MR&PLPP, LapSVM and LapRLS, which implies that SSVM-MR&PLPP is very suitable for the applicationswhere fast prediction is desired.


5.4. Experiments with varying number of labeled and unlabeled samples

In the previous experiments, we report the obtained experimental results of the benchmarking classifiers on the real-world datasets with fixed numbers of labeled and unlabeled samples. In this section, we report the change tendency ofthe performance of all the benchmarking classifiers by running them on the adopted datasets with varying number oflabeled samples and unlabeled samples. To save space, we only report the obtained experimental results of the USPSTand MNIST3VS8 datasets.

As can be seen in Fig. 4, with the increase of the number of labeled samples in the training set, the average classificationaccuracy of each benchmarking classifier goes up steadily, and the two proposed classifiers SVM-MR&PLPP and SSVM-MR&PLPP always outperform other benchmarking classifiers. Meanwhile, with the increase of the number of unlabeled sam-ples in the training set, the average classification accuracy of each of the two supervised methods SVM and KRR remainsunchanged, respectively, while that of every other semi-supervised method has been enhanced. These experimental resultsindeed manifest our conclusion in the previous sections that since the proposed classifiers consider both the intrinsic man-ifold structure information of the unlabeled samples and the class labeling proportion information of the unlabeled samples,they can achieve better performance.

As can be seen in Fig. 5 for the two supervised classifiers, the training time of SVM and KRR always remains unchangedwith the increase of unlabeled samples in the training datasets. However, the training time of other four semi-supervisedclassifiers becomes longer. Meanwhile, with increase of the samples in testing datasets, the testing time of all the

5 10 15 20 25 30 350.65

0.7

0.75

0.8

0.85

0.9

0.95

1USPST

The number of labeled samples

The

aver

age

clas

sific

atio

n ac

cura

cy (%

)

SVMKRRLapRLSLapSVMSVM-MR&PLPPSSVM-MR&PLPP

5 10 15 20 25 30 35 40 45 50 550.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05MNIST3VS8

The number of labeled samples

The

aver

age

clas

sific

atio

n ac

cura

cy (%

) SVMKRRLapRLSLapSVMSVM-MR&PLPPSSVM-MR&PLPP

(a) different labeled samples

0 20 40 60 80 100 1200.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96USPST

The

aver

age

clas

sific

atio

n ac

cura

cy (%

)


0 20 40 60 80 100 1200.88

0.9

0.92

0.94

0.96

0.98

1MNIST3VS8

The

aver

age

clas

sific

atio

n ac

cura

cy (%

)


(b) different unlabeled samples

The percentage of unlabeled samples to the total unlabeled samples in the training dataset (%)


Fig. 4. Comparisons of average (and std dev) classification accuracy on USPST and MNIST3VS8 with different number of labeled samples and unlabeledsamples.

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

40

45USPST

The

aver

age

train

ing

time

(s)


0 20 40 60 80 100 1200

500

1000

1500

2000

2500

3000

3500MNIST3VS8

The

aver

age

train

ing

time

(s)


(a) average training time

0 20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

1.2

1.4USPST

The

aver

age

test

ing

time

(s)


0 20 40 60 80 100 1200

50

100

150

200

250

300

350

400

450MNIST3VS8

The

aver

age

test

ing

time

(s)


(b) average testing time



The percentage of unlabeled samples to the total unlabeled samples in the testing dataset (%)

The percentage of unlabeled samples to the total unlabeled samples in the testing dataset (%)

Fig. 5. Comparisons of average (and std dev) training and testing time on USPST and MNIST3VS8 with different number of unlabeled samples.


benchmarking classifiers becomes longer. Moreover, as can be seen form Fig. 5, due to its strong scalability, SSVM-MR&PLPPindeed runs faster than other semi-supervised classifiers in the sense of both average training time and average testing time.

6. Conclusions

Designing classifiers for datasets with class labeling proportion information is an important learning task for practicalapplications. In view of this, we extend the manifold regularization learning methodology by incorporating additional labelprivacy-protection term about the labeling proportion of unlabeled samples, and propose a support vector machine calledSVM-MR&PLPP with manifold regularization and partially labeling privacy protection. Furthermore, since SVM-MR&PLPPis inefficient in training and testing for large scale datasets due to the involved inversion of a dense gram matrix whichindeed leads to O((l + u)3) time complexity, by reconstructing the objective function of SVM-MR&PLPP, we propose its scal-able version SSVM-MR&PLPP for large datasets. The proposed classifiers indeed consider not only the underlying geometricstructure of the unlabeled samples but also the labeling proportion as the privacy protection information. Future workincludes carrying out a theoretical analysis of SSVM-MR&PLPP and SSVM-MR&PLPP and developing a learning frameworkwhich considers other types of privacy-protection information in training datasets.


Acknowledgements

This work was supported in part by the Hong Kong Polytechnic University under Grant G-UA68, and by the NationalNatural Science Foundation of China under Grants 61170122, 61272210 and by the Natural Science Foundation of JiangsuProvince under Grant BK2011003, BK2011417, JiangSu 333 expert engineering grant (BRA2011142), and 2011, 2012, 2013Postgraduate Student’s Creative Research Fund of Jiangsu Province.

References

[1] M.M. Adankon, M. Cheriet, M.A. Biem, Semi-supervised least squares support vector machine, IEEE Trans. Neural Network 20 (12) (2009) 1858–1870.[2] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn.

Res. 7 (2006) 2399–2434.[3] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27:1–27:27.[4] W.J. Chen, Y.H. Shao, N. Hong, Laplacian smooth twin support vector machine for semi-supervised classification, Int. J. Mach. Learn. Cybern. (2013)

1–10.[5] R. Collobert, S. Bengio, Y. Bengio, A parallel mixture of SVMs for very large datasets, Neural Comput. 4 (5) (2002) 1105–1114.[6] Z.H. Deng, K.S. Choi, F. L Chung, S.T. Wang, Scalable TSK fuzzy modeling for very large datasets using minimal enclosing ball approximation, IEEE Trans.

Fuzzy Syst. 19 (2) (2011) 210–226.[7] D. Decoste, B. Scholkopf, Training invariant support vector machines, Mach. Learn. 46 (1-3) (2002) 161–190.[8] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning, Springer, 2001.[9] X.F. He, P. Niyogi, Locality preserving projections, in: Proceeding of Conference on Advances in Neural Information Processing Systems, 2003.

[10] G.E. Hinton, P. Dayan, M. Revow, Modeling the manifolds of images of handwritten digits, IEEE Trans. Neural Network 8 (1) (1997) 65–74.[11] C.W. Hsu, C.J. Lin, A comparison of methods for multi-class support vector machines, IEEE Trans. Neural Networks 13 (2) (2002) 415–425.[12] Y. Huang, D. Xu, F. Nie, Semi-supervised dimension reduction using trace ratio criterion, IEEE Trans. Neural Network Learn. Syst. 23 (3) (2012) 519–

526.[13] Z. Jin, Q. Wei, G. Chen, A heuristic approach for k-representative information retrieval from large-scale data, Inf. Sci. 277 (2014) 825–841.[14] W. Liu, J. He, S.-F. Chang, Large graph construction for scalable semi-supervised learning, in: Proceedings of International Conference on Machine

Learning, 2010, pp. 679–686.[15] X. Liu, S. Pan, Z. Hao, Z. Lin, Graph-based semi-supervised learning by mixed label propagation with a soft constraint, Inf. Sci. 277 (2014) 327–337.[16] H.X. Li, J.L. Yang, G. Zhang, B. Fan, Probabilistic support vector machines for classification of noise affected data, Inf. Sci. 221 (2013) 60–71.[17] S. Melacci, M. Belkin, Laplacian support vector machines trained in the primal, J. Mach. Learn. Res. 12 (2011) 1149–1184.[18] F. Nie, D. Xu, I.W. Tsang, C. Zhang, Flexible manifold embed-ding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans.

Image Process. 19 (7) (2010) 1921–1932.[19] F. O’Sullivan, A statistical perspective on Ill-posed inverse problems, Stat. Sci. 1 (4) (1986) 502–518.[20] P.J. Qian, F.L. Chung, S.T. Wang, Z.H. Deng, Fast graph-based relaxed clustering for large data sets using minimal enclosing ball, IEEE Trans. Syst. Man

Cybern. Part B: Cybern. 42 (3) (2012) 672–687.[21] N. Quadrianto, A.J. Smola, T.S Caetano, et al., Estimating labels from label proportions, in: Proceedings of 25th International Conference on Machine

Learning, ICML’08. Omnipress, 2008, pp. 776–783.[22] N. Quadrianto, A.J. Smola, T.S. Caetano, et al, Estimating labels from label proportions, J. Mach. Learn. Res. 10 (2009) 2349–2374.[23] S. Rüping, SVM classifier estimation from group probabilities, in: Proceedings of 27th International Conference on Machine Learning, ICML’10, Haifa,

2010, pp. 911–918.[24] B. Scholkopf, R. Herbrich, A. Smola, A generalized representer theorem, Lect. Notes Comput. Sci. 2111 (2001) 416–426.[25] Y.H. Shao, W.J. Chen, N.Y. Deng, Nonparallel hyperplane support vector machine for binary classification problems, Inf. Sci. 263 (2014) 22–35.[26] V. Sindhwani, D.S. Rosenberg, An RKHS for multi-view learning and manifold co-regularization, in: Proceedings of the International Conference on

Machine Learning, ICML ’08, ACM, New York, NY, USA, 2008, pp. 976–983.[27] V. Sindhwani, P. Niyogi, M. Belkin, Beyond the point cloud: from transductive to semi-supervised learning, in: Proceedings of the International

Conference on Machine Learning, ICML ’05, ACM, New York, NY, USA, 2005, pp. 825–832.[28] M. Stolpe, K. Morik, Learning from label proportions by optimizing cluster model selection, in: Proceedings of the 2011 European Conference on

Machine Learning and Knowledge Discovery in Databases, ECML PKDD’2011, Berlin, Heidelberg, Part III, 2011, pp. 349–364.[29] L. Sun, W.S. Mu, B. Qi, Z.J. Zhou, A new privacy-preserving proximal support vector machine for classification of vertically partitioned data, Int. J. Mach.

Learn. Cybern. (2014).[30] I.W. Tsang, J.T. Kwok, Large-scale sparsied manifold regularization, Adv. Neural Inf. Process. Syst. (2006) 1401–1408.[31] J.W. Tao, S.T. Wang, W.J. Hu, W.H. Ying, q-Margin kernel learning machine with magnetic field effect for both binary classification and novelty

detection, Int. J. Software Inf. 4 (3) (2010) 305–324.[32] V. Vapnik, Statistical Learning Theory, John Wiley and Sons, 1998.[33] X.Z. Wang, Q. He, D.G. Chen, D. Yeung, A genetic algorithm for solving the inverse problem of support vector machines, Neurocomputing 68 (2005)

225–238.[34] X.Z. Wang, S.X. Lu, J. H Zhai, Fast fuzzy multi-category SVM based on support vector domain description, Int. J. Pattern Recognit. Artif. Intell. 22 (01)

(2008) 109–120.[35] S. T Wang, J. Wang, F.L. Chung, Kernel density estimation, kernel methods, and fast learning in large datasets, IEEE Trans. Cybern. 44 (1) (2014) 2168–

2267.[36] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans.

Pattern Anal. Mach. Intell. 34 (4) (2012) 723–742.[37] Y. Yang, D. Xu, F. Nie, J. Luo, Y. Zhuang, Ranking with local regression and global alignment for cross media retrieval, in: Proceedings of the Seventeen

ACM International Conference on Multimedia, ACM, New York, NY, USA, 2009, pp. 175–184.[38] S.J. Yen, Y.C. Wu, J.C. Yang, Y.S. Lee, C.J. Lee, J.J. Liu, A support vector machine-based context-ranking model for question answering, Inf. Sci. 224 (2013)

77–87.[39] X. Zhou, M. Belkin, Semi-supervised learning by higher order regularization, in: Proceedings of International Conference on Artifial Intelligence

Statistic, 2011, pp. 1–9.[40] X. Zhu, Z. Ghahramani, J.D. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, in: Proceeding of International Conference

on Machine Learning, 2003, pp. 912–919.

http://refhub.elsevier.com/S0020-0255(14)00962-1/h0005






















































Documents

Support vector machine with manifold regularization and partially labeling privacy protection