Upload
ulisses
View
212
Download
0
Embed Size (px)
Citation preview
STUDYING THE POSSIBILITY OF PEAKING PHENOMENON IN LINEAR SUPPORTVECTOR MACHINES WITH NON-SEPARABLE DATA
Sardar Afra, and Ulisses Braga-Neto∗
Department of Electrical Engineering, Texas A & M University, College Station, Texas [email protected], and [email protected] (∗ Corresponding Author)
ABSTRACT
Typically, it is common to observe peaking phenomenon
in the classification error when the feature size increases.
In this paper, we study linear support vector machine clas-
sifiers where the data is non-separable. A simulation based
on synthetic data is implemented to study the possibility
of observing peaking phenomenon. However, no peaking
in the expected true error is observed. We also present the
performance of three different error estimators as a func-
tion of feature and sample size. Based on our study, one
might conclude that when using linear support vector ma-
chines, the size of feature set can increase safely.
1. INTRODUCTION
Typically (but not always), it is common to observe peak-
ing phenomenon while increasing feature size [1]. Actu-
ally, classification error increases after a point as the num-
ber of features grows. The peaking phenomenon that has
been studied for the first time in 1968 by Hughes [2] de-
pends on the joint feature-label distribution and classifi-
cation rule [3]. In the present paper, we implement linear
support vector machines as the classification rule and dis-
cuss the peaking phenomenon observed in the classifica-
tion error and its various common error estimators such as
resubstitution, k-fold cross-validation (CV) and bootstrap
methods (bootstrap zero and bootstrap 0.632). The paper
is organized as follows. In section 2 linear support vec-
tor machines (LSVMs) are explained. Section 3 describes
three different error estimation methods. In section 4, the
simulation based on the synthetic data (models parameters
are estimated from a real patient data [4]) is presented.
Section 5, discusses the simulation results. Finally, sec-
tion 6 provides some concluding remarks.
2. LSVM CLASSIFIERS
In two-group statistical pattern recognition, there is a fea-ture vector X ∈ IRp and a label Y ∈ {−1, 1}. The pair
(X,Y ) has a joint probability distribution F, which is un-
known in practice. Hence, one has to resort to designing
classifiers from training data, which consists of a set of
n independent points Sn = {(X1, Y1), . . . , (Xn, Yn)},
drawn from F. A classification rule is a mapping g :{IRp×{0, 1}}n×IRp → {0, 1}. A classification rule maps
the training data Sn into the designed classifier g(Sn, ·) :
IRp → {0, 1}. Linear discriminant classifier is defined by:
g(Sn, x) =
{1, aTx+ a0 > 0
0, otw
Therefore, all training points are correctly classified if
Yi(aTXi + a0) > 0, i = 1, . . . , n, where yi ∈ {−1, 1}
(instead of the usual 0 and 1). The main idea in support
vector machine is to adjust linear discrimination with mar-
gins such that the margin is maximal - this is called the
Maximal Margin Hyperplane (MMH) algorithm. Those
points closest to the hyperplane are called the support vec-tors and determine a lot of the properties of the classifier.
In this paper we consider LSVMs where the data is non-
separable.
2.1. Linear Discrimination with a Margin
If we want to have a margin b > 0, the constraint becomes
Yi(aTXi + a0) ≥ b, for i = 1, . . . , n. The optimal clas-
sification solution to this problem puts all points at a dis-
tance at least b/||a|| from the hyperplane. Since a, a0 and
b can be freely scaled, without loss of generality we can
set b = 1, therefore we have Yi(aTXi + a0) ≥ 1, for i =
1, . . . , n. The margin is 1/||a|| and the points that are at
this exact distance from the hyperplane are called supportvectors. The idea here is to maximize the margin 1/||a||.For this, it suffices to minimize 1
2 ||a||2 = 12 a
Ta subject
to the constraints Yi(aTXi + a0) ≥ 1, for i = 1, . . . , n.
The solution vector a∗ determines the MMH. The corre-
sponding optimal value a∗0 will be determined later from
a∗ and the constraints.
2.2. Non-Separable Data
If the data is not linearly separable, it is still possible to
formulate the problem and find a solution by introduc-
ing slack variables ξi, i = 1, . . . , n, for each of the con-
straints, resulting in a new set of 2n constraints:
Yi(aTXi + a0) ≥ 1− ξi and ξi ≥ 0, i = 1, . . . , n.
Therefore, if ξi > 0, the corresponding training point
is an “outlier,” i.e., it can lie closer to the hyperplane than
the margin, or even be misclassified. We introduce a penalty
term C∑n
i=1 ξi in the functional, which then becomes:
1
2aTa+ C
n∑i=1
ξi. (1)
2011 IEEE International Workshop on Genomic Signal Processing and StatisticsDecember 4-6, 2011, San Antonio, Texas, USA
978-1-4673-0490-0/11/$26.00 ©2011 IEEE 218
The constant C modulates how large the penalty for
the presence of outliers are. If C is small, the penalty is
small and a solution is more likely to incorporate outliers.
If C is large, the penalty is large and therefore a solution
is unlikely to incorporate many outliers. The method of
Lagrange Multipliers allows us to turn this into an uncon-
strained problem (unconstrained in a and a0). Consider
the primal Lagrangian functional:
LP (a, a0, ξ, λ, ρ) =12 a
Ta+ C∑n
i=1 ξi
−∑ni=1 λi
(Yi(a
TXi + a0)− 1 + ξi)−∑n
i=1 ρiξi.
where λi and ρi, for i = 1, . . . , n are the Lagrange
multipliers. It can be shown that the solution to the previ-
ous constrained problem can be found by simultaneously
minimizing Lp with respect to a and a0 and maximizing
it with respect to λ (hence, we search for a saddle point in
Lp). Since Lp is unconstrained for a and a0, a necessary
condition for optimality is that the derivatives of Lp with
respect to a and a0 be zero. This is part of the Karush-
Kuhn-Tucker (KKT) conditions for the original problem,
which yields the equations:
a =n∑
i=1
λiYiXi and
n∑i=1
λiYi = 0. (2)
Setting the derivatives of Lp with respect to ξi to zero
yields
C − λi − ρi = 0, i = 1, . . . , n. (3)
Substituting these equations back into LP leads to the
same expression for the dual Lagrangian functional as in
the separable case:
LD(λ) =n∑
i=1
λi − 1
2
n∑i=1
n∑j=1
λiλjYiYjXTi Xj ,
which must be maximized with respect to λi. The first set
of constraints come from Equation (2), by∑n
i=1 λiYi =0. One also has 0 ≤ λi ≤ C, for i = 1, . . . , n, which are
derived from Equation (3) and the non-negativity of the
Lagrange multipliers λi, ρi ≥ 0. The outliers are points
for which ξi > 0 ⇒ ρi = 0 ⇒ λi = C. The points for
which 0 < λi < C are called margin vectors. Once the
optimal λ∗ is found, the solution vector a∗ is determined
by Equation (2):
a∗ =n∑
i=1
λ∗i YiXi =
∑i∈S
λ∗i YiXi,
where S = {i |λi > 0} is the support vector index set.
The value of a∗0 can be determined from any of the active
constraints aTXi + a0 = Yi with ξi = 0, that is, the
constraints for which 0 < λi < C (the margin vectors), or
from their sum:
nma0 + (a∗)T∑i∈Sm
Xi =∑i∈Sm
Yi.
In the previous equation, Sm = {i |0 < λi < C} is
the margin vector index set, and nm = Card(Sm) is the
number of margin vectors. From this it follows that:
a∗0 = − 1
nm
∑i∈S
∑j∈Sm
λ∗i YiX
Ti Xj +
1
nm
∑i∈Sm
Yi.
The optimal discriminant is thus given by
(a∗)Tx+ a∗0 =∑i∈S
λ∗i YiX
Ti x+ a∗0 ≥ 0.
3. ERROR ESTIMATION
The true error of a designed classifier is its error rate given
the training data set:
εn[g|Sn] = P (g(Sn, x) �= y) = EF(|y−g(Sn, x)|), (4)
where the notation EF indicates that the expectation is
taken with respect to F; in fact, one can think of (x, y)in the above equation as a random test point (this inter-
pretation being useful in understanding error estimation).
The expected error rate over the data is given by
εn[g] = EFn
(εn[g|Sn]) = EFn
EF(|y − g(Sn, x)|),where Fn is the joint distribution of the training data Sn.
This is sometimes called the unconditional error of the
classification rule, for sample size n. If the underlying
feature-label distribution F were known, the true error
could be computed exactly, via (4). In practice, one is
limited to using an error estimator.
3.1. Resubstitution
The simplest and fastest way to estimate the error of a
designed classifier in the absence of test data is to compute
its error directly on the sample data itself:
ε̂resub =1
n
n∑i=1
|Yi − g(Sn, Xi)|.
This resubstitution estimator, attributed to [5], is very
fast, but is usually optimistic (i.e., low-biased) as an esti-
mator of εn[g] [6].
3.2. k-fold CV
CV removes the optimism from resubstitution by employ-
ing test points not used in classifier design [7]. In k-foldCV, the data set Sn is partitioned into k folds S(i), for
i = 1, . . . , k (for simplicity, we assume that k divides n).
Each fold is left out of the design process and used as a
test set, and the estimate is the overall proportion of error
committed on all folds:
ε̂cvk =1
n
k∑i=1
n/k∑j=1
|Y (i)j − g(Sn\S(i), X
(i)j )|,
where (X(i)j , Y
(i)j ) is a sample in the i-th fold. The pro-
cess may be repeated: several cross-validation estimates
219
are computed using different partitions of the data into
folds, and the results are averaged. CV estimators are of-
ten pessimistic, since they use smaller training sets to de-
sign the classifier. Their main drawback is their variance
[6, 8].
3.3. Bootstrap
The bootstrap error estimation technique [9, 10] is based
on the notion of an “empirical distribution” F∗, which
serves as a replacement to the original unknown distribu-
tion F. The empirical distribution puts mass 1n on each of
the n available data points. A “bootstrap sample” S∗n from
F∗ consists of n equally-likely draws with replacement
from the original data Sn. Hence, some of the samples
will appear multiple times, whereas others will not appear
at all. The actual proportion of times a data point (Xi, Yi)appears in S∗
n is P ∗i = 1
n
∑nj=1 I(X∗
j ,Y∗j )=(Xi,Yi), where
IS = 1 if the statement S is true, zero otherwise. The
basic bootstrap zero estimator [11] is written in terms of
the empirical distribution as ε̂0 = EF∗( |Y − g(S∗n, X)| :
(X,Y ) ∈ Sn \ S∗n). In practice, the expectation EF∗ has
to be approximated by a a Monte-Carlo estimate based on
independent replicates S∗bn , for b = 1, . . . , B (B between
25 and 200 being recommended [11]):
ε̂0 =
∑Bb=1
∑ni=1 |Yi − g(S∗b
n , Xi)| IP∗bi =0∑B
b=1
∑ni=1 IP∗b
i =0
.
The bootstrap zero works like cross-validation: the
classifier is designed on the bootstrap sample and tested
on the original data points that are left out. It tends to be
high-biased as an estimator of εn[g], since the amount of
samples available for designing the classifier is on average
only (1− e−1)n ≈ 0.632n. The estimator
ε̂b632 = (1− 0.632) ε̂resub + 0.632 ε̂0,
tries to correct this bias by doing a weighted average of the
bootstrap zero and resubstitution estimators. It is known
as the .632 bootstrap estimator [11], and has been per-
haps the most popular bootstrap estimator in data mining
[12]. It has low variance, but can be extremely slow to
compute. In addition, it can fail when resubstitution is too
low-biased [8].
4. SIMULATION
To generate synthetic data, we assume a multivariate Gaus-
sian distribution with parameters directly estimated from
a real data set [4]. First, we choose 500 features that have
the largest absolute mean across all samples. Then, we
calculate the mean vector and covariance matrix of the
chosen features in each class separately and use them for
generating synthetic data set from a multivariate Gaussian
distribution. We then generate a large set of sample points
of size 2000 in each class, only once, that is used as a test
set for calculating the expected true error. In addition to
the test set, a train set of variable size (as one parameter
of the simulation) 20, 40, . . . , 200 per class is generated in
each Monte Carlo iteration.
040
80120160200
040
80120
160200
0
0.05
0.1
0.15
0.2
Feature sizeSample per class
Exp
ecte
d tr
ue e
rror
Figure 1. True error as a function of sample size and fea-
ture size.
In each Monte Carlo iteration, for a fixed sample of
fixed size, we employ t-test to select best feature set of
variable size (another parameter of the simulation) 20, 40,. . . , 200. We design a LSVM classifier with C = 0.1on the training set and employ these error estimators: re-
substitution, 5-fold CV with 10 repetitions, 10-fold CV
with 10 repetitions, bootstrap zero and bootstrap 0.632with 25 repetitions. Note that in CV and bootstrap er-
ror estimators, t-test feature selection is done for every
fold and bootstrap sample and then surrogate classifiers
are designed. Also, we calculate true error of the classi-
fier based on the large test data set and selected features
from t-test. The Monte Carlo simulation is done for 1000iterations. For the linear SVM classification rule, we use
the LIBSVM package [13] with the default settings and
adjust the slack parameter to 0.1.
5. RESULTS
Expected true error is shown in Fig. 1. Strikingly, LSVM
shows no peaking up to even a feature size of 200. The
expected true error decreases while both sample size and
feature size increase. Therefore one can safely use a large
feature size although the sample size is small. Figures 2
shows the biases of error estimators in a same graph for
two different fixed sample size as a function of feature
size when the feature size increases. CV error estimators
have biases close to zero. As one know k-fold CV is a
pessimistic estimator of classification error. Also the bias
of k-fold cross validation is expected to go to zero while
the number of sample increases. For bootstrap zero, we
can see that the bias is positive, which means that the es-
timator is an pessimistic error estimator as we expected.
Bootstrap 0.632 is better than bootstrap zero in terms of
bias. In addition, while the sample size increases both the
biases go to zero. In Fig. 3 variances of all error estimator
deviations are shown in the same plot for two fixed sample
size when feature size changes.
CV is a high variance error estimator deviation com-
paring to other error estimators as observed in Fig. 3.
Bootstrap 0.632, has a small variance as expected and
the resubstitution tends to be low-variance, which also has
the smallest variance among all the other error estimators.
220
0 1 2 3 4 5 6 7 8 9 10 11−0.15
−0.1
−0.05
0
0.05
0.1
0.15
Feature size
Bia
s
Sample size = 20 (solid) and 200 (dashed) / per class
CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.
Figure 2. Bias of the error estimators for the sample
size of 20 and 200 per class for variable feature size of
20, 40, . . . , 200.
0 1 2 3 4 5 6 7 8 9 10 11
0
2
4
6
x 10−3
Feature size
Var
ianc
e
Sample size = 20 (solid) and 200 (dashed) / per class
CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.
Figure 3. Variance of the error estimator deviations for the
sample size of 20 and 200 per class for variable feature
size of 20, 40, . . . , 200.
RMS of all the error estimators are depicted in Fig. 4 for
tow different sample sizes. As we can see, RMS for the
bootstrap 0.632 surpasses all others as we look for. The
resubstitution has a large RMS; however, it tends to be
very close to zero.
6. CONCLUSION
In this paper, we discussed the peaking phenomenon in
the expected true error as a function of sample and feature
size for LSVMs as the classification rule where the classes
are not linearly separable. In addition, the performances
of three error estimators were studied. Surprisingly, we
observed no peaking in the expected true error of LSVM
classifiers as the number of features increases. This might
have two reasons. The first might be because of the com-
plexity of LSVMs where the overfitting is controlled by
the parameter C. The second might be because of the
structure of feature-label distribution. We think that us-
ing more features when LSVM classifier is implemented
can be safe in terms of peaking phenomenon.
7. REFERENCES
[1] J. Hua, Z. Xiong, J. Lowey, and E.R. Suh,
E.and Dougherty, “Optimal number of features as
a function of sample size for various classification
rules,” Bioinf., vol. 21, no. 8, pp. 1509–1515.
[2] G. Hughes, “On the mean accuracy of statistical pat-
tern recognizers,” Information Theory, IEEE Trans-actions on, vol. 14, no. 1, pp. 55 – 63, 1968.
0 1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
Feature size
RM
S
Sample size = 20 (solid) and 200 (dashed) / per class
CV5CV10Boot0Boot0.632Resub.CV5CV10Boot0Boot0.632Resub.
Figure 4. RMS of the error estimators for the sample
size of 20 and 200 per class for variable feature size of
20, 40, . . . , 200.
[3] C. Sima and E.R. Dougherty, “The peaking phe-
nomenon in the presence of feature-selection,” Pat-tern Recogn. Lett., vol. 29, pp. 1667–1674, 2008.
[4] X. Lin, B. Afsari, L. Marchionni, L. Cope, G. Parmi-
giani, D. Naiman, and D. Geman, “The ordering
of expression among a few genes can provide sim-
ple cancer biomarkers and signal brca1 mutations.,”
BMC Bioinf., vol. 10, pp. 256, 2009.
[5] C.A.B. Smith, “Some examples of discrimination,”
Ann. of Human Gen., vol. 13, no. 1, pp. 272–282,
1946.
[6] L. Devroye, L. Gyorfi, and G. Lugosi, A proba-bilistic theory of pattern recognition, Applications
of mathematics. Springer, 1996.
[7] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern clas-sification, Pattern Classification and Scene Analysis:
Pattern Classification. Wiley, 2001.
[8] U.M. Braga-Neto and E.R. Dougherty, “Is cross-
validation valid for small-sample microarray classi-
fication?,” Bioinf., vol. 20, pp. 374–380, 2004.
[9] B. Efron, “Bootstrap methods: Another look at the
jackknife,” The Ann. of Statistics, vol. 7, no. 1, pp.
pp. 1–26.
[10] B. Efron, The jackknife and bootstrap and other re-sampling plans, Springer series in statistics. SIAM
Monograph 38, NSF-CBMS, 1982.
[11] B. Efron, “Estimating the error rate of a prediction
rule: Improvement on cross-validation,” J. of theAmerican Stat. Assoc., vol. 78, no. 382, pp. pp. 316–
331, 1983.
[12] I. Witten and E. Frank, Data Mining, Academic
Press, CA: San Diego, 2000.
[13] C.-C. Chang and C-J Lin, LIBSVM: Introduction andbenchmarks, Department of Computer Science and
Information Engineering, National Taiwan Univer-
sity, Taipei, Taiwan, 2000.
221