Upload
kathlyn-carroll
View
214
Download
1
Embed Size (px)
Citation preview
A Genetic Algorithms Approach to Feature Subset Selection Problem
by Hasan Doğu TAŞKIRANCS 550 – Machine Learning Workshop
Department of Computer EngineeringBilkent University
May 16, 2005
2
Motivation Neural Networks Feature Subset Selection Genetic Algorithms Methodology Experiments and Results Conclusions and Future Work
Outline
3
Motivation
It is not unusual to find problems involving hundreds of features
Beyond a point the inclusion of additional features leads to worse rather than better performance
Differentiate between features that contribute new information and not
Many of current techniques such as PCA and LDA involve linear transformations to lower dimensions
A multi-objective genetic algorithm is needed to: Reduce the cost Increase the accuracy (if applicable)
4
Neural Networks
An information processing paradigm that is inspired by the way biological nervous systems process information
A large number of highly interconnected processing elements (neurones) working in unison to solve specific problems
They are configured for a specific application through a learning process
Adjustments to synaptic connections that exist between the neurons
5
Neural Networks
The network may become unbelievably complex if the number of the features used to for classification increases very much
If the network becomes too complex, then: Size increases Training time increases Training set size increases Classification time increases
Some optimization methods such as node pruning techniques exist for classification using ANNs
6
Feature Subset Selection
Reduce the number of features used in classification while maintaining acceptable classification accuracy
Considerable impact on the effectiveness of the resulting classification
Computational complexity is reduced as there is smaller number of inputs
Accuracy increases as the removed features hinders the classification process
Can be seen as a case of binary feature weighting
7
Genetic Algorithms
A family of computational models inspired by evolution
GAs are parallel iterative optimizers, and have been successfully applied to a broad spectrum of optimization problems
Focusing on the application of selection, mutation, and recombination to a population of competing problem solutions
A directed search rather than an exhaustive search
8
Genetic Algorithms
Given enough time and a well bounded problem, the genetic algorithm can find a global optimum
Performance of genetic algorithm depends on a number of factors including: The choice of genetic representation and operators, The fitness function, The details of the fitness-dependent selection procedure, Various user-determined parameters such as population
size
All about representation and fitness…
9
Methodology
Represent the feature subsets as binary strings where: A value of 1 will represent the inclusion of a particular feature
in the training process A value of 1 will represent its absence
The genetic algorithm will operate on a pool of binary strings
For each binary string we train a new neural network with the selected features as input nodes to evaluate the fitness of the resulting binary set
10
Methodology
As a result of the training we obtain an error value e(x),
where 0 ≤ e(x) ≤ 1
A cost function for the network s(x) is obtained,
where again 0 ≤ s(x) ≤ 1
After training the fitness of the feature subset is obtained through
)(2
)()(2)(
xe
xsxexf
11
Experiments and Results
We conducted an experiment that shows our results on a handwritten digit recognition problem
We implemented our methodology using the Matlab Neural Network and Genetic Algorithm toolboxes
The database we used in our experiments was the UCI database for handwritten digits
This database includes 200 samples for each digit (totally 2000 samples)
The digits are represented as 15 x 16 images each
12
Experiments and Results
We have randomly chosen 100 digits from each digit for the training set and used the remaining 100 digits for testing our networks to obtain the necessary e(x) and s(x) values
We decided to use the pixels as our features and so we have 240 features to evaluate
We create a pool of feature subsets represented as 240-bit bit-strings where 1s represent the inclusion of the associated pixel value and 0s represent the absence of it while training the network
For each binary string in the pool we create a new Feed-Forward back-propagation ANN with one hidden layer composed of 10 neurons.
We used logarithmic sigmoid functions which are gradient descent with momentum and adaptive learning rate back-propagation training functions… (the slowest in Matlab, namely ‘traingdx’)
13
Experiments and Results
The parameters for our GA are: Population Size: 50 Number of Generations: 100 Probability of Crossover: 0.6 Probability of Mutation: 0.001 Elite Count: 2 Type of Mutation: Uniform Type of Selection: Rank-based Stall Generations Limit: 10 Stall Time Limit: Infinite
14
Experiments and Results
1,600
1,650
1,700
1,750
1,800
1,850
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101
Generation
Fit
nes
s o
f th
e B
est
Fit
So
luti
on
15
Experiments and Results
AccuracyTrainingDataset
TestDataset
Full Feature Set(240, s(x) = 1.00)
99.7% 89.9%
Optimal Subset(53, s(x) = 0.221)
99.4% 90.4%
16
Conclusions
Proposed methodology succeeds in reducing the complexity of the feature set used by the ANN-classifier
Genetic algorithms offer an attractive approach to solving the features subset selection problem
This methodology finds application areas in cost sensitive design of classifiers for tasks such as medical diagnosis and computer vision
Other application areas include automated data mining and knowledge discovery from datasets with an abundance of irrelevant or redundant features
The GA-based approach to feature subset selection does not rely on monotonicity assumptions that are used in traditional approaches to feature subset selection
17
Future Work
An analysis is still needed to improve the results obtained using GAs
Performance improvement and trials for the other datasets may be included
Performance improvements should be done on the Genetic Algorithms themselves
Another analysis could be based on the fitness evaluation function where there may be used other fitness functions
The approach may be tried in semi-supervised learning case
18
Thanks for Listening…
Questions?