24
A Variance Reduction Framework for Stable Feature Selection Yue Han and Lei Yu [email protected] Binghamton University

Yue Han and Lei Yu [email protected] Binghamton University

Embed Size (px)

Citation preview

A Variance Reduction Framework for Stable Feature Selection

Yue Han and Lei [email protected]

Binghamton University

Introduction, Motivation and Related Work Theoretical Framework Empirical Framework : Margin Based Instance

Weighting

Empirical Study◦ Synthetic Data◦ Real-world Data

Conclusion and Future Work

Outline

Introduction and MotivationFeature Selection Applications

D1

D2

Sports

T1 T2 ….…… TN

12 0 ….…… 6

DM

C

Travel

Jobs

… … …

Terms

Docu

men

ts

3 10 ….…… 28

0 11 ….…… 16

Features(Genes or Proteins)

Sam

ple

s

Pixels

Vs

Features

MotivationStability of Feature Selection

Stability of Feature Selection : the insensitivity of the result of a feature selection algorithm to variations to the training set.

Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently.

Stability of Learning AlgorithmLearning Algorithm

Training Data

Learning Models

Training D1

Data Space Training D2

Training Dn

Feature Subset R1

Feature Subset R2

Feature Subset Rn

Feature Selection Method

Consistent or not???

Stability Issue of Feature Selection

MotivationWhy is Stable Feature Selection needed?

Data Variations

Stable Feature Selection Method

Stable Feature Subset

Learning Results

Learning Methods

Closer to characteristic

features(biomarkers)

Better learning

performance

Largely different feature subsets

Similarly good learning performance

Domain experts (Biomedicine and Biology) also interested in:Biomarkers stable and insensitive to data variations

Unstable feature selection method

Dampen confidence for validation;Increase the cost for experiments

Theoretical FrameworkVariance, Bias and Error of Feature Selection

How to represent the underlying data distribution without increasing sample size?

Challenge: Increasing training sample size could be very costly or impractical

Training D1 Feature Weight Vector

Data Space True Feature Weight Vector* * *1 2* ( , , )dr r r r

1 1 11 1 2( ) ( , , )D D D

dr D r r r

Feature Weight Vector 2 2 22 1 2( ) ( , , )D D D

dr D r r r

Feature Weight Vector 1 2( ) ( , , )n n nD D Dn dr D r r r

Training D2

Training Dn

Variance: fluctuation of n weight values around its central tendencyBias: loss of central tendency(average) from the true weight valueError: average loss of n weight values from the true weight value

, 1...ir i dfor

Theoretical FrameworkBias-variance Decomposition of Feature Selection Error

Error:

Data Space: ; Training Data: D ; FS Result: r(D) ; True FS Result: r*

Bias:

Variance:

Bias-Variance Decomposition of Feature Selection Error:

o Reveals relationship between accuracy(opposite of error) and stability (opposite of variance);

o Suggests a better trade-off between the bias and variance of feature selection.

For each individual feature: weight value instead of 0/1 selection

Average for all features

Theoretical FrameworkVariance Reduction via Importance Sampling Feature Selection (Weighting) Monte Carlo Estimator

Reducing Variance of Monte Carlo Estimator: Importance Sampling

? Increasing sample size impractical and costly

Importance Sampling

Instance Weightin

g

Intuition behind importance sampling:More instances draw from important regionsLess instances draw from other regions

Intuition behind instance weighting:Increase weights for instances from important regionsDecrease weights for instances from other regions

How to weight the instances?How important is each instance?

Empirical FrameworkOverall Framework

Challenges:

How to produce weights for instances from the point view of feature selection stability;

How to present weighted instances to conventional feature selection algorithms.

Margin Based Instance Weighting for Stable Feature Selection

Empirical FrameworkMargin Vector Feature Space

Original SpaceFor each

Margin Vector Feature Space

Hypothesis Margin(along each dimension):

hit miss

Nearest Hit

Nearest Miss

captures the local profile of feature relevance for all features at

Instances exhibit different profiles of feature relevance; Instances influence feature selection results differently.

Empirical FrameworkAn Illustrative Example

Hypothesis-Margin based Feature Space Transformation:(a) Original Feature Space (b) Margin Vector Feature Space.

(a) (b)

Empirical FrameworkMargin Based Instance Weighting Algorithm

Instance

exhibits different profiles of feature relevance

influence feature selection results differently

Instance Weighting

Higher Outlying Degree Lower Weight

Lower Outlying Degree Higher Weight

Review: Variance reduction via Importance Sampling

More instances draw from important regions

Less instances draw from other regions

Weighting:

Outlying Degree:

Empirical FrameworkAlgorithm Illustration

Time Complexity Analysis:

o Dominated by Instance Weighting: o Efficient for High-dimensional Data with small sample size (n<<d)

Empirical StudySubset Stability Measures

Average Pair-wise Similarity:

Kuncheva Index:

Training D1

Data Space Training D2

Training Dn

Feature Subset R1

Feature Subset R2

Feature Subset Rn

Feature Selection Method

Consistent or not???

Stability Issue of Feature Selection

Empirical StudyExperiments on Synthetic Data

Synthetic Data Generation:

Feature Value:two multivariate normal distributions

Covariance matrix

is a 10*10 square matrix with elements 1 along the diagonal and 0.8 off diagonal.100 groups and 10 feature each

Class label: a weighted sum of all feature values with optimal feature weight vector

500 Training Data:100 instances with 50 from and 50 fromLeave-one-out Test Data:5000 instances

Method in Comparison:SVM-RFE: Recursively eliminate 10% features of previous iteration till 10 features remained.

Measures:Variance, Bias, ErrorSubset Stability (Kuncheva Index)Accuracy (SVM)

Empirical StudyExperiments on Synthetic Data

Observations: Error is equal to the sum of bias and variance for both versions of SVM-RFE; Error is dominated by bias during early iterations and is dominated by variance during later iterations; IW SVM-RFE exhibits significantly lower bias, variance and error than SVM-RFE when the number of remaining features approaches 50.

Empirical StudyExperiments on Synthetic Data

Conclusion: Variance Reduction via Margin Based Instance Weightingbetter bias-variance tradeoffincreased subset stabilityimproved classification accuracy

Empirical StudyExperiments on Real-world Data

Microarray Data:

Methods in Comparison:

SVM-RFEEnsemble SVM-RFEInstance Weighting SVM-RFE

Measures:

VarianceSubset StabilityAccuracies (KNN, SVM)

Bootstrapped Training Data

Feature Subset

Aggregated Feature Subset20

...

Bootstrapped Training Data

...

Feature Subset

20-Ensemble SVM-RFE

Empirical StudyExperiments on Real-world Data

Note: 40 iterations starting from about 1000 features till 10 features remain

Observations:Non-discriminative during early iterations;

SVM-RFE sharply increase as # of features approaches 10;

IW SVM-RFE shows significantly slower rate of increase.

Empirical StudyExperiments on Real-world Data

Observations:Both ensemble and instance weighting approaches improve stability consistently;

Ensemble is not as significant as instance weighting;

As # of features increases, stability score decreases because of the larger correction factor.

Empirical StudyExperiments on Real-world Data

Conclusions:Improves stability of feature selection without sacrificing prediction accuracy;

Performs much better than ensemble approach and more efficient;

Leads to significantly increased stability with slight extra cost of time.

Prediction accuracy(via both KNN and SVM):non-discriminative among three approaches for all data sets

Accomplishments: Establish a bias-variance decomposition framework for feature

selection; Propose an empirical framework for stable feature selection; Develop an efficient margin-based instance weighting algorithm; Comprehensive study through synthetic and real-world data.

Future Work: Extend current framework to other state-of-the-art feature selection

algorithms; Explore the relationship between stable feature selection and

classification performance.

Conclusion and Future Work

Related WorkStable Feature Selection

Comparison of Feature Selection Algorithms w.r.t. Stability(Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007)Quantify the stability in terms of consistency on subset or weight;Algorithms varies on stability and equally well for classification;Choose the best with both stability and accuracy.

Bagging-based Ensemble Feature Selection (Saeys et al. ECML07)Different bootstrapped samples of the same training set;Apply a conventional feature selection algorithm;Aggregates the feature selection results.

Group-based Stable Feature Selection (Yu et al. KDD08; Loscalzo et al. KDD09)Explore the intrinsic feature correlations;Identify groups of correlated features;Select relevant feature groups.

Thank you

and

Questions?