6

Click here to load reader

[IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

  • Upload
    nahla

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

Feature Ranking Utilizing Support Vector

Machines’ SVs

Nahla Barakat Faculty of Informatics and Computer Science

British University in Egypt (BUE) Cairo, Egypt

Abstract—Classification performance of different algorithms

can often be improved by excluding irrelevant input features.

This was the main motivation behind the significant number of

studies proposing different families of feature selection

techniques. The objective is to find a subset of features that can

describe the input space, at least, as good as the original set of

features. In this paper, we propose a hybrid method for feature

ranking for support vector machines (SVMs); utilizing SVMs

support vectors (SVs). The method first finds the subset of

features that least contribute to interclass separation. These

features are then re-ranked using correlation based feature

selection algorithm, as a final step. Results on four benchmark

medical data sets show that the proposed method, though

simple, can be a promising feature reduction method for SVMs

and other families of classifiers as well.

Index Terms—feature ranking, support vector machines.

I. INTRODUCTION

Performance of classification algorithms can be affected by the presence of irrelevant input features [1]. This was the main motivation behind the increasing interest in feature selection/ ranking studies over the last two decades. The main objective of these studies is to obtain a subset of features, which can describe the problem domain, at least, as good as the original, full set of features [2].

Feature selection algorithms can be classified into three broad categories; filter, wrapper and embedded methods [2-4]. Hybrid approaches however, have been recently proposed, which combines concepts from filter and wrapper techniques [5].

Filter methods exclude irrelevant features as a preprocessing step, before applying the induction algorithm. The individual input features are ordered according to a predefined measure (ex. principal/independent component analysis, correlation criteria, Fisher’s discriminant scores, etc.) [1]. However, filter methods may not be the best techniques in case of nonlinear relationships among input features [6].

In wrapper based approaches, the prediction performance of a given classifier is used to assess the importance of a subset of features[3] . In particular, different candidate subsets are evaluated according to the classifier’s performance. The subset of features which produces the lowest classification error is considered the most relevant feature list. This is achieved using a searching procedure in

the space of all possible subsets of features; for ex. greedy, forward or backward methods [6]. Forward search methods start with an empty set of features and progressively add features until reaching the best classification performance. Backward selection starts with the full set of features, and then starts eliminating the least relevant features. However, wrapper methods are computationally expensive[6].

Embedded methods are usually performed as part of the induction algorithm. An example of such methods is the SVM-RFE algorithm, an SVM weight-based method [7].

Several application domains have benefited from feature selection, ex. medical data mining, biomedical informatics, gene selection in micro array data, and text mining, to mention some.

In this paper, we propose a simple method for feature ranking, utilizing Support Vector Machine (SVMs) support vectors (SVs), and building on interclass separability concept. SVMs have been chosen due to their superior classification performance, demonstrated over a wide range of problem domains. The main idea of the method is to find and rank the features that least discriminate positive from negative class (in case of binary classification tasks), in two steps: the first step finds each individual feature, where the difference between its two means, in positive and negative SVs is not statistically significant. The second step uses correlation based feature selection, which is only applied to the least relevant features obtained in the first step.

The proposed method has been evaluated in terms of accuracy, recall, precision and area under the precision– recall curve (PRC). The obtained results show improved performance for both SVMs as well as other families of classifiers, when trained with the most relevant features, compared to the performance in case of the full set of features.

The paper is organized as follows: Section II briefly highlights feature selection methods for SVMs, while Section III provides a brief background of SVMs and PRC. Section IV introduces the proposed method, followed by experimental methodology then results and discussion in Sections V and VI. The paper is concluded in Section VII.

II. RELATED WORK

In the context of SVMs, there have been five main themes of research in the area of feature selection; formulation of an optimization problem [8], sparse SVMs

978-1-4799-0048-0/13/$31.00 ©2013 IEEE 401

Page 2: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

(VS-SSVM) [9], principal component analysis (PCA)[10], sensitivity analysis [11], embedded method like 1-Norm SVM, and maximizing kernel class reparability [12]. This in addition to the filter and wrapper based methods.

An example of filter approach applied to SVMs is the recursive feature elimination (SVM RFE) method [7]. This method mainly uses the weights given by the decision function computed by a linear SVM as a feature ranking criteria, together with recursive feature elimination (RFE) algorithm, which recursively eliminates one feature at a time. Examples of wrapper based approaches include the method described in [13], where the set of features which minimizes the generalization bounds of the leave-one-out error (or smallest expected generalization error) in a classification task is obtained. Chan et al. [14] also used the accuracy of the SVM classifier as an indication of having a good set of features, and for defining a threshold value for each of these features. Wang [12] suggested a different method, which defines a class separability criterion in a high-dimensional kernel space, and feature selection is performed by the maximization of this criterion.

In the context of wrapper based methods, Fröhlich and Zell [15] suggested a feature selection approach, which aims at decreasing the regularized risk of the SVM. The approach builds on the recursive feature elimination (RFE) algorithm as a feature ranking method.

In a following study, the sensitivity analysis of the SVMs has been used, which measures the deviation of the separation margin with respect to the elimination of some input features [11]. In a later study, a hybrid method [5] has been proposed, which combines both Filter and Wrapper methods. It evaluates features by using an independent measure (via computationally-efficient filters) to find the best subset. The candidate feature set is further refined by more accurate wrappers, to find the final best subset.

More general comparisons between different methods for feature selection can be found in [16] and[17].

III. BACKGROUND

A. Support Vector Machines

SVMs are based on the principle of structural risk minimization, which aims to minimize the true error rate. SVMs operate by finding a linear hyper-plane that separates positive from negative examples with a maximum interclass distance or margin. In the case of non-separable data, a soft margin hyper-plane is defined to allow errors i (slack variable) in classification. Hence, the optimization problem is formulated as follows [18]:

!"

l

iiCw

1

2

2

1 minimize #

$$$$%&'$

0 ,1)( Subject to i ()(! ##iii bwxy$

Where C is a regularization parameter which defines the trade-off between the training error and the margin[18]. In the case of non-linearly separable data, SVMs map input

data to be linearly separable in the feature space using kernel functions. Including kernel functions, and Lagrange multiplier i, the dual optimization problem is modified as follows [18]:

% 'jijji

l

jii

l

ii xxKyyw .

2

1)( maximize

1,11**** ) "

"""

0 , 01

i " +(("

l

iiii yC **

$$$$$$$$$$$$$$$$$$$$$$%,'$

In the case of unequal misclassification costs, a cost factor J (C+/C-) is introduced, by which training errors on positive examples outweigh errors on negative examples [19]. Therefore, optimization problem becomes:

! !)"

)"

!1:1:

2

2

1 minimize

ji yjj

yii CCw ##

$$$$$$$$$$$$$$$$$$$$$%-'$0 ,1)( Subject to

k()(! ##kkk bwxy

$

B. The Precision/recall Curve

Precision-recall curves (PRC) have become a common method for assessing classification performance, as alternative to Receiver Operator Characteristic (ROC) curves. Precision-recall curves can better compare the performance of two algorithms, as they can show differences in PR space, which may not apparent in their ROC space [20]. In PR curves, recall (TPR) (please refer to Table I and the formulas below) is plotted against precision, (the fraction of true positives in relation to all positive predictions), which is also known as the positive predictive value (PPV).

TABLE I. CONFUSION MATRIX

Actual

Predicted Positive P Negative N

Positive P True Positive TP False positive FP

Negative N False negative FN True negative TN

True positive rate TPR (recall) = TP/P

False positive rate FPR= FP/N Positive predictive value PPV (precision) =TP/ (TP+FP)

IV.PROPOSED METHOD

In this paper, we propose a hybrid, two step method for feature ranking for SVMs. The method builds mainly on finding different features’ contribution in defining interclass separability between SVM positive and negative class SVs, (in case of binary classification tasks) at the first step. This is done by computing the Z score for each feature. As shown in the detailed description below, the value of Z is computed from the formula for the standard Normal distribution (Z-score) and estimates the probability that two means, in this case the means of positive and negative SVs are equal. Therefore, all features with a Z smaller than a specific threshold constitute the set of least discriminative features. The second step utilizes correlation based feature selection [21], which is only applied to the least relevant features

402

Page 3: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

obtained in the first step, to find a final ranking for least relevant features in a classification problem.

The detailed description of the steps is shown below: 1. Leave-one-out cross validation was used to select the

training parameters (kernel type and the regularization parameter C) that minimized error rate over the training set;

2. From the trained SVM, select the patterns that became SVs (from the SVM model),

3. Use the SVM model to re-classify the SVs, and select only both the correctly classified positive and negative SVs;

4. For each feature in the SVs obtained in step 3, compute the variance (VAR) and exclude all features with a variance of zero (a zero variance feature means that the feature has the same value in both negative and positive SVs, then has no or minor contribution to class separability);

5. Split the SVs selected in step 3 into two disjoint groups: one for the positive and the other for negative SVs;

6. For each feature, compute the means for each group of SVs (obtained in 5)

7. Compute the Z-score (Z) for the difference between the two means computed in step 6, and define the set of least discriminative features, as those with Z < 1.64 (1.64 corresponds to significance level of 0.95 of statistical hypothesis testing),

8. Rank the least relevant features based on the values of Z obtained in step 7 ;

9. Re-rank the features selected in step 8, using correlation based feature selection [21], (the set of least relevant features only are correlated to the target class of each training example), therefore, a final ranking of least relevant features is obtained. (This last step is performed to compensate the effect that may result from un-optimized training parameters in step 1, and verify the validity of ranking based on the value of Z-score).

V.EXPERIMENTAL METHODOLOGY

The proposed feature ranking method has been validated, by comparing the performance of SVMs and other families of classifiers trained with the full set of the original features, against their performance when excluding the least relevant features. We tried removing 10% of the original (all) features, then 20% of the original (all) features, based

on the ranking of least relevant features obtained. Tthree different set of experiments were conducted for

each three sets of input features: SVM leave one out cross validation on training sets, performance on a validation set, and performance of other classifiers, from different families of algorithms, using 10 fold cross validation on the training set. Further more, PR curves for the SVM training performance with full and reduced set of features have been plotted, to have a better insight on the effect of excluding the least relevant features on the SVM performance. It should be noted here that the curves had to manually connect to 0, 1 and 1, 0 for recall, precision respectively, and the area

under the PRC has been computed using trapezoidal integration. Four benchmark medical data sets from [22] , details are shown below and in Table II:

Pima Indians diabetes: A sample of 438 patterns were used from the original data set, after removing all patterns with a zero value for the attributes 2-hour OGTT plasma glucose, diastolic blood pressure and triceps skin fold thickness which are clinically insignificant;

Heart diseases: The reduced Cleveland heart diseases data set was used. All patterns with missing values were discarded;

Breast cancer: The Wisconsin breast cancer data set was used. All repeated patterns were discarded to avoid the bias resulting from the boosting effect of those patterns;

Thyroid: The experiments were conducted as two classes, normal against the rest. All patterns with missing values were discarded.

TABLE II. DATA SETS

Data set Features Training set validation set

Pima Indians 8 247 191

Breast cancer 9 208 147

Heart Disease 13 223 74

Thyroid 21 3772 3428

SVMlight [23] was used in all experiments as follows: A number of SVM models were generated by varying the misclassification cost factor, J, starting with small value and increasing J until no change in TPR or FPR was observed. Precision/ recall curves were then plotted for the obtained SVMs training results.

To show the effect of excluding the least relevant features (as obtained from SVM SVs) on other families of classifiers, we compared the performance of direct rule learners, decision trees, trained with full set of features, against their performance when trained without least relevant features. In particular , Ripper and C4.5 have been utilized, and Weka implementaion[24] has been used.

It should be noted here the same training parameters for the SVM and other learning algorithms have been used, before and after excluding least relevant faetures .

VI.RESULTS AND DISCUSSION

Results of different experiments are summarized in the following sections:

A. Performance of the SVM

In this section, the performance of SVMs trained with full and reduced set of features is compared, details as follows: 1) Training results at different misclassification costs (j).

Table II I shows the average training performance of the SVMs accuracy, precision and recall, computed at different values of j. From the table, it can be seen that improved performance has been obtained in terms of accuracy, precision and recall, excluding the least relevant (ranked) features (10 % of features) in heart diseases and pima indians

403

Page 4: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

data set, while better results have been obtained excluding 20% of features in case of breast cancer data set (improved results are shown in bold).

For Thyroid data set, the original full set of features gave better performance results. However, the only the differences in accuracy for pima Indians and breast cancer data sets are statistically significant (p<0.05). This can be attrbuted to the relatively small number of features of the data sets used. Figures 1 to 3 show precision and recall values plotted against j (misclassification cost) for SVMs trained with full set of features, and without the least relevant features. From these curves, it can be shown that better values of precision and recall are obtained, espaeciall at smaller values of j. Figures 4 to 6 show the precision/recall curves for Pima indian, heart diseases and breast cancer data sets respectively, while the area under the precision/recall curves are shown in Table IV. Curves shown in these figures confirm the improved performance of the SVM, when excluding the least relevant features; however breast cancer data set did not show improvement in the area under the PR curve. This can be attributed to the fact that the curve has high precision and recall values (over 0.80), even with smaller misclassification costs, so it occupies a small part of the Precision/recall space, and when connected manually to 0,1 and 1,0, the difference between the two curves became in favor of the SVM trained with full set of features (both curves for breast cancer before and after connection to 0,1 & 1,0 points are shown in figures 6 and 7 respectively).

Tables III and IV show improvement in the average performance computed over the four data sets. 2) Results on validation set

Table V shows results of SVMs performance tested on validation sets, at equal misclassification costs. From this table it can be seen that some improvement in the generalization performance have been obtained on the three out of four validation sets. However, the differences are not statistically significant (p<0.05). Again, an improvement in the average performance can be noted.

B. Performance of same and other families of

classification algorithms.

Tables from VI and VII show 10 fold cross validation training results for Jrip, and C 4.5 respectively. From these tables, it can be seen that excluding the least relevant features was beneficial for both learning algorithms for 3 out of 4 data sets.

VII. CONCLUSION

In this paper we proposed a hybrid method for feature ranking; utilizing SVM SVs. The idea builds on the ability of SVMs to find a separating hyper-plane between positive and negative examples, which is totally defined by the SVs. The feature ranking is done in two steps; the first is to find the least relevant features using positive and negative SVs. The second is to re-rank these features using a correlation based method. The proposed method has been validated using three sets of experiments two using training data and one using a validation set. Results show improved classification performance in most of the data sets. However, further

experiments on more domains are required to show whether the improvement would be consistent, given the fact that not all feature selection techniques are appropriate in all problem domains. The use of PRC was beneficial in giving better picture of the SVM performance.

One limitation of the proposed method is that optimizing the SVM training parameters is a critical factor in finding the optimal separation between classes, therefore least relevant features. As future research, the method would be tested with high dimensional data sets, and more skewed class distribution, may be with additional iterations of feature reduction.

Fig. 1. Precision and recall plotted over different misclassification cost for

heart diseases data set

Fig. 2. Precision and recall plotted over different misclassification cost for pima indian data set

Fig. 3. Precision and recall plotted over different misclassification cost for breast cancer data set

Fig. 4. Precision/recall curve for heart diseases data set

404

Page 5: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

Fig. 5. Precision/recall curve for pima indians data set

Fig. 6. Precision/recall curve for breast cancer data set

Fig. 7. Precision/recall curve for breast cancer data set (connected with 0,1

& 1,0 points)

TABLE III. SVM AVERAGE PERFORMANCE AT DIFFEREN MISCLASSIFICATION COSTS

All features Without 10% of features Without 20 of features

error% Preci recall error% Preci. recall error% Preci. recall

Heart Diseases 27.20 70.74 80.07 26.82 71.24 80.49 28.25 69.25 79.79 Pima Indians 24.10 67.94 80.87 22.97 68.59 81.54 23.95 68.17 81.6

Breast Cancer 5.40 93.49 95.35 4.64 94.79 95.46 3.89 95.16 96.74

Thyroid 2.63 89.47 75.68 2.64 89.42 75.66 2.85 88.32 74.01 Average 14.83 80.41 82.99 14.26 81.01 83.28 14.73 80.22 83.03

TABLE IV. AREA UNDER THE PRECISION/RECALL CURVE

All features Without 10% of features

Heart Diseases 0.836 0.859

Pima Indians 0.807 0.850

Breast Cancer 0.981 0.979 Thyroid 0.966 0.963 Average 0.898 0.913

TABLE V. SVM PERFORMANCE ON THE VALIDATION SET AT EQUAL MISCLASSIFICATION COST

All features Without 10% of features Without 20 of features

Acc.% Acc.% Acc.%

Heart Diseases 74.00 74.00 74.00 Pima Indians 78.00 78.91 77.85 Breast Cancer 94.00 93.42 94.77

Thyroid 97.00 96.42 96.81 Average 85.75 85.69 85.86

TABLE VI. JRIP 10 FOLD CROSS VALIDATION PERFORMANCE

All features Without 10% of features Without 20 of features

Acc% Preci. recall Acc% Preci. recall Acc% Preci. recall

Heart Diseases 79.82 0.798 0.798 77.57 0.780 0.780 81.16 0.810 0.810

Pima Indians 79.76 0.796 0.798 83.00 0.830 0.830 80.97 0.810 0.810 Breast Cancer 95.67 0.957 0.957 95.19 0.950 0.950 96.13 0.960 0.960

Thyroid 99.44 0.994 0.964 99.47 0.995 0.995 99.33 0.994 0.993

Average 88.67 0.880 0.870 88.80 0.888 0.888 89.39 0.893 0.893

405

Page 6: [IEEE 2013 Third International Conference on Innovative Computing Technology (INTECH) - London, United Kingdom (2013.08.29-2013.08.31)] Third International Conference on Innovative

TABLE VII. C4.5 10 FOLD CROSS VALIDATION PERFORMANCE

All features Without 10% of features Without 20 of features

Acc% Preci. recall Acc% Preci. recall Acc% Preci. recall

Heart Diseases 74.44 0.74 0 0.740 76.23 0.760 0.760 76.23 0.760 0.760

Pima Indians 81.78 0.820 0.820 83.4 0.830 0.830 84.61 0.850 0.850

Breast Cancer 93.75 0.940 0.940 95.19 0.950 0.950 94.71 0.950 0.950 Thyroid 99.68 0.997 0.997 99.62 0.997 0.997 99.55 0.996 0.995 Average 87.41 0.874 0.874 88.610 0.884 0.884 88.77 0.889 0.889

REFERENCES

[1] S. Salcedo-Sanz, G. Camps-Valls, F. Pérez-Cruz, J. Sepulveda-Sanchís, and C. Bousoño-Calzón, "Enhancing genetic feature selection through restricted search and Walsh analysis," IEEE Transactions on System, Man and

Cybernetics Part C, vol. 24, pp. 398–406, 2004. [2] I. Guyon and A. Elisseeff, "An introduction to variable and

feature selection," Machine Learning Research, vol. 3, pp. 1157-1182, 2003.

[3] A. Blum and P. Langley, "Selection of relevant features and examples in machine learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997.

[4] Y. I. Saeys and P. Larrañaga, "A review of feature selection techniques in bioinformatics. Bioinformatics," vol. 23, 2007.

[5] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, Hybrid feature selection by combining filters and wrappers," Expert Systems

with Applications vol. 38, pp. 8144–8150, 2011. [6] F. Alonso-Atienza, J. L. Rojo-Álvarez, A. Rosado-Muñoz, J.

J. Vinagre, A. García-Alberola, and G. Camps-Valls, "Feature selection using support vector machines and bootstrap methods for ventricular fibrillation detection," Expert Systems with Applications, vol. 39, pp. 1956–1967, 2012.

[7] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," Machine Learning, vol. 46, pp. 389-422, 2002.

[8] P. Bradley and O. Mangasarian, "Feature selection via concave minimization and support vector machines," presented at The 13th International Conference on Machine Learning, 1998.

[9] J. Bi, K. Bennet, M. Embrechts, C. Berneman, and M. Song., "Dimensionality reduction via sparse support vector machines," Machine Learning Research, vol. 3, pp. 1229–1244, 2003.

[10] N. Ancona, G. Cicirelli, and A. Distante, "Complexity reduction and parameter selection in support vector machines," presented at ANNIE 2002, 2002.

[11] D. Wang, P. Chan, D. Yeung, and E. TSANG, "Feature subset selection for support vector machines through sensitivity analysis," presented at the International Conference on Machine Learning and Cybernetics, 2004.

[12] L. Wang, "Feature Selection with Kernel Class Separability," IEEE TRANSACTIONS ON PATTERN ANALYSIS AND

MACHINE INTELLIGENCE, vol. 30, pp. 1534-1546, 2008.

[13] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, "Feature selection for SVMs," Advances In

Neural Information Processing Systems (NIPS), vol. 13, pp. 668-674, 2000.

[14] Z. Chen, J. Li, and L. Wei, "A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue," Artificial

Intelligence in Medicine, vol. 41, pp. 161-175, 2007. [15] H. Fröhlich and A. Zell, "Feature subset selection for support

vector machines by incremental regularized risk minimization," presented at Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 2004.

[16] M. A. Hall and G. Holmes, "Benchmarking attribute selection techniques for discrete class data mining," IEEE

TRANSACTIONS ON KNOWLEDGE AND DATA

ENGINEERING, vol. 15, pp. 1437-1447, 2003. [17] L. C. Molina, L. Belanche, and À. Nebot, "Feature Selection

Algorithms: A Survey and Experimental Evaluation," presented at IEEE International Conference on Data Mining, 2002.

[18] N. Cristianini and J. S. Taylor, An introduction to support

vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press, 2000.

[19] K. Morik, P. Brockhausen, and T. Joachims, "Combining statistical learning with knowledge-based approach-A case study in intensive care monitoring," presented at European Conference on Machine Learning, 1998.

[20] J. Davis and M. Goadrich, "The Relationship between Precision-Recall and ROC Curves," in 23 rd International

Conference on Machine Learning. Pittsburgh, 2006. [21] M. A. Hall, "Correlation-based feature selection for discrete

and numeric class machine learning," presented at 17th international conference on machine learning, 2000.

[22] C. Merz and P. Murphy, "UCI machine learning repository." Irvine, 1998.

[23] T. Joachims, Learning to Classify Text Using Support Vector

Machines: Kluwer, 2002, 2002. [24] I. Witten and E. Frank, Data Mining: Practical machine

learning tools and techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.

406