A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering

A Genetic Algorithms Approach to Feature Subset Selection Problem

by Hasan Doğu TAŞKIRANCS 550 – Machine Learning Workshop

Department of Computer EngineeringBilkent University

May 16, 2005

2

Motivation Neural Networks Feature Subset Selection Genetic Algorithms Methodology Experiments and Results Conclusions and Future Work

Outline

3

Motivation

It is not unusual to find problems involving hundreds of features

Beyond a point the inclusion of additional features leads to worse rather than better performance

Differentiate between features that contribute new information and not

Many of current techniques such as PCA and LDA involve linear transformations to lower dimensions

A multi-objective genetic algorithm is needed to: Reduce the cost Increase the accuracy (if applicable)

4

Neural Networks

An information processing paradigm that is inspired by the way biological nervous systems process information

A large number of highly interconnected processing elements (neurones) working in unison to solve specific problems

They are configured for a specific application through a learning process

Adjustments to synaptic connections that exist between the neurons

5

Neural Networks

The network may become unbelievably complex if the number of the features used to for classification increases very much

If the network becomes too complex, then: Size increases Training time increases Training set size increases Classification time increases

Some optimization methods such as node pruning techniques exist for classification using ANNs

6

Feature Subset Selection

Reduce the number of features used in classification while maintaining acceptable classification accuracy

Considerable impact on the effectiveness of the resulting classification

Computational complexity is reduced as there is smaller number of inputs

Accuracy increases as the removed features hinders the classification process

Can be seen as a case of binary feature weighting

7

Genetic Algorithms

A family of computational models inspired by evolution

GAs are parallel iterative optimizers, and have been successfully applied to a broad spectrum of optimization problems

Focusing on the application of selection, mutation, and recombination to a population of competing problem solutions

A directed search rather than an exhaustive search

8

Genetic Algorithms

Given enough time and a well bounded problem, the genetic algorithm can find a global optimum

Performance of genetic algorithm depends on a number of factors including: The choice of genetic representation and operators, The fitness function, The details of the fitness-dependent selection procedure, Various user-determined parameters such as population

size

All about representation and fitness…

9

Methodology

Represent the feature subsets as binary strings where: A value of 1 will represent the inclusion of a particular feature

in the training process A value of 1 will represent its absence

The genetic algorithm will operate on a pool of binary strings

For each binary string we train a new neural network with the selected features as input nodes to evaluate the fitness of the resulting binary set

10

Methodology

As a result of the training we obtain an error value e(x),

where 0 ≤ e(x) ≤ 1

A cost function for the network s(x) is obtained,

where again 0 ≤ s(x) ≤ 1

After training the fitness of the feature subset is obtained through

)(2

)()(2)(

xe

xsxexf

11

Experiments and Results

We conducted an experiment that shows our results on a handwritten digit recognition problem

We implemented our methodology using the Matlab Neural Network and Genetic Algorithm toolboxes

The database we used in our experiments was the UCI database for handwritten digits

This database includes 200 samples for each digit (totally 2000 samples)

The digits are represented as 15 x 16 images each

12


We have randomly chosen 100 digits from each digit for the training set and used the remaining 100 digits for testing our networks to obtain the necessary e(x) and s(x) values

We decided to use the pixels as our features and so we have 240 features to evaluate

We create a pool of feature subsets represented as 240-bit bit-strings where 1s represent the inclusion of the associated pixel value and 0s represent the absence of it while training the network

For each binary string in the pool we create a new Feed-Forward back-propagation ANN with one hidden layer composed of 10 neurons.

We used logarithmic sigmoid functions which are gradient descent with momentum and adaptive learning rate back-propagation training functions… (the slowest in Matlab, namely ‘traingdx’)

13


The parameters for our GA are: Population Size: 50 Number of Generations: 100 Probability of Crossover: 0.6 Probability of Mutation: 0.001 Elite Count: 2 Type of Mutation: Uniform Type of Selection: Rank-based Stall Generations Limit: 10 Stall Time Limit: Infinite

14


1,600

1,650

1,700

1,750

1,800

1,850

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101

Generation

Fit

nes

s o

f th

e B

est

Fit

So

luti

on

15


AccuracyTrainingDataset

TestDataset

Full Feature Set(240, s(x) = 1.00)

99.7% 89.9%

Optimal Subset(53, s(x) = 0.221)

99.4% 90.4%

16

Conclusions

Proposed methodology succeeds in reducing the complexity of the feature set used by the ANN-classifier

Genetic algorithms offer an attractive approach to solving the features subset selection problem

This methodology finds application areas in cost sensitive design of classifiers for tasks such as medical diagnosis and computer vision

Other application areas include automated data mining and knowledge discovery from datasets with an abundance of irrelevant or redundant features

The GA-based approach to feature subset selection does not rely on monotonicity assumptions that are used in traditional approaches to feature subset selection

17

Future Work

An analysis is still needed to improve the results obtained using GAs

Performance improvement and trials for the other datasets may be included

Performance improvements should be done on the Genetic Algorithms themselves

Another analysis could be based on the fitness evaluation function where there may be used other fitness functions

The approach may be tried in semi-supervised learning case

18

Thanks for Listening…

Questions?

Documents

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering