Data Mining Neural Networks With Genetic Algorithms

1

Data mining neural networks with genetic algorithms

Ajit Narayanan, Edward Keedwell and Dragan SavicSchool of Engineering and Computer Science

University of ExeterExeter EX4 4PTUnited [email protected]

tel: (+)1392 264064Abstract

It is an open question as to what is the best way to extract symbolic rules from trainedneural networks in domains involving classification. Previous approaches based on anexhaustive analysis of network connection and output values have already beendemonstrated to be intractable in that the scale-up factor increases exponentially with thenumber of nodes and connections in the network. A novel approach using geneticalgorithms to search for symbolic rules in a trained neural network is demonstrated inthis paper. Preliminary experiments involving classification are reported here, with theresults indicating that our proposed approach is successful in extracting rules. While it isaccepted that further work is required to convincingly demonstrate the superiority of ourapproach over others, there is nevertheless suff icient novelty in these results to justifyearly dissemination. (If the paper is accepted, the latest results will be reported, togetherwith suff icient information to aid replicabili ty and verification.)

Introduction

Artificial neural networks (ANNs) are increasingly used in problem domainsinvolving classification. They are adept at finding commonaliti es in a set of seeminglyunrelated data and for this reason are used in a growing number of classification tasks.Unfortunately, a commonly perceived problem with ANNs when used for classification isthat, while a trained ANN can indeed classify the data, sometimes with more accuracythan a traditional, symbolic machine learning approach, the reasons for theirclassification cannot be found easily. Trained ANNs are commonly perceived to be‘black boxes’ which map input data onto a class through a number of mathematicallyweighted connections between layers of neurons. While the idea of ANNs as black boxesmay not be a problem in applications where there is littl e interest in the reasons behindclassification, this can be a major obstacle in applications where it is important to havesymbolic rules or other forms of knowledge structure, such as identification or decisiontrees, which are easily interpretable by human experts. In particular, it may be importantto identify knowledge not previously known to domain experts and which may thereforelie at the periphery of domain expertise. Also, safety-criti cal systems (such as air trafficcontrol or missile firing) which use neural networks successfully to classify data facediff iculty in being accepted because of the reluctance by managers and administrators toaccept a system which is not open to symbolic verification. Often, there is a legal

2

requirement that such safety-criti cal systems be demonstrated to be correct to a certaindegree of confidence. It is often claimed that neural networks, because of their plasticityand use of soft constraints, can handle noisy data better than their symbolic counterpartsand should therefore be used precisely in those areas which are likely to benefit mostfrom their application, such as safety-criti cal systems and data mining.

In general, an ANN can be said to make its decisions by using the activation of the units(input and hidden) combined with the weights of the connections between these units.The topology of the network can also be used. Andrews et al. (1996) identify three typesof rule extraction techniques: ‘decompositional’ , ‘pedagogical’ and ‘eclectic’ , each ofwhich refers to a different method of extracting information from the network. Adecompositional approach is distinguished by its focus on extracting rules at the level ofindividual (hidden and output) units. The computed output from each hidden and outputunit is mapped onto a binary ‘ yes/no’ outcome corresponding to the notion of a ruleconsequent. The major problem with this approach is the apparent exponential behaviourof associated algorithms (Towell and Shavlik, 1993). Extracting rules from complexANNs may therefore be intractable. A pedagogical approach is distinguished by itstreatment of a trained ANN as a ‘black box’ where the knowledge to be extracted dealsdirectly with the way that input is mapped onto output by the internal weights (i.e. no‘yes/no rules’ are extracted – just rules dealing with the changes in the levels of the inputand output units). The major problem with this approach is the sheer number of rulesgenerated for even the simplest domains. Finally, the eclectic approach is characterisedby any use of knowledge concerning the internal architecture and/or weight vectors in atrained ANN to complement a symbolic learning algorithm. There is currently very littl eunderstanding of available methods for constructing an eclectic approach, of the domainswhere eclectic approaches may outperform their traditional symbolic and ANNcounterparts, and how to evaluate the results of an eclectic approach.

In this paper we propose a novel, evolutionary eclectic approach which integratestraditional ANNs with genetic algorithms for extracting simple, intelli gible and usefulrules from trained ANNs. It is claimed that this approach adopts the advantages of ANNs(gradual, incremental training which overcomes inconsistencies and ambiguities in thedata) as well as symbolic learning (intell igible output, rules for verification). In brief, thepaper proposes the use of a genetic algorithm to search the weight space of a trainedneural network to identify the best rules for classification. The genetic algorithm useschromosomes which can be mapped directly onto intelli gible rules (phenotypes).

Two major constraints are the following. First, the goal of many rule-extractiontechniques is to find a comprehensive rule base for the network so that it can be encodedas a set of ‘expert system’ rules in which the attributes causing a particular classificationcan be precisely and fully determined. In this paper we propose that this is not necessaryin the majority of applications. Algorithms attempting to produce comprehensive rule setshave a tendency to become exponential in complexity as network size increases. Thishas been recognised by researchers, and in a recent paper (Arbatli and Akin, 1997) thesearch space available to the symbolic algorithm has been decreased by optimizing thetopology of the network using genetic algorithms. The approach described here differs in

3

that it uses GAs to search a trained neural network for the extraction of symbolic rulesdirectly and not to optimise the network for another set of rule extraction techniques tobe applied. Secondly, the experiments below have been performed on categorical ratherthan continuous data. Many datasets of significance in the real world do indeed havecontinuous attributes, but datasets with large numbers of unpartitioned continuousattributes are unlikely to be successfully classified by a neural network in any case.

The Genetic Algorithm/Neural Network System

The starting point of any rule-extraction system is firstly to train the network onthe data required, i.e. the ANN is trained so that a satisfactory error level is reached. Forclassification problems, each input unit typically corresponds to a single feature in thereal world, and each output unit to a class value or class. The first objective of ourapproach is to encode the network in such a way that a genetic algorithm can be run overthe top of it. This is achieved by creating an n-dimensional weight space where n is thenumber of layers of weights. The network can be represented by simply enumeratingeach of the nodes and/or connections. For example, Figure 1 depicts a simple neuralnetwork with five input units (input features, data attributes), three hidden units, and oneoutput unit (class or class value), with each node enumerated in this case except theoutput. Typically, there will be more than one output class or class value and thereforemore than one output node.

From this encoding, genes can be created which, in turn, are used to constructchromosomes where there is at least one gene representing a node at the input layer andat least one gene representing a node at the hidden layer. A typical chromosome for thenetwork depicted in Figure 1 could look something like this (Figure 2):

Figure 1 - A typical encoding of a simpleneural network with only one class value

(one output node)

Figure 2 - A typical chromosome generated fromthe encoded network for only one class value

4

This chromosome corresponds to the fifth unit in the input layer and the third unit in thehidden layer. That is, the first gene contains the weight connecting input node 5 to hiddenunit 3, and the second gene contains the weight connecting hidden unit 3 to the outputclass. Fitness is computed as a direct function of the weights which the chromosomerepresents. For chromosomes containing just two genes (one for the input unit, the otherfor the hidden unit), the fitness function is:

Fitness = Weight(Input→Hidden)*Weight(Hidden→Output)

where ‘→’ signifies the weight between the two enumerated nodes. So the fitness of thechromosome in Figure 2 is:

Fitness = Weight(5→3)*Weight(3→Output)

This fitness is computed for an initial set of random chromosomes, and the population issorted according to fitness. An eliti st strategy is then used whereby a subset of the topchromosomes is selected for inclusion in the next generation. Crossover and mutationare then performed on these chromosomes to create the rest of the next population.

The chromosome is then easily converted into IF…THEN rules with an attachedweighting. This is achieved by using the template: ‘ IF <gene1> THEN output is<class> (weighting)’ , with the weighting being the fitness of the gene and the classsignifies which output unit is being switched on. The weighting is a major part of the rulegeneration procedure because the value of this is a direct measure of how the networkinterprets the data. Since ‘Gene 1’ above corresponds to the weight between an input unitand a hidden unit, the template is essentially stating that the consequent of the rule iscaused by the activation on that particular input node and its connection to a hidden unit(not specified explicitly in the rule). The rule template above therefore allows theextraction of single-condition rules. The number of extracted rules in each population canbe set by the user, according to the complexity of the network and/or the data. A largernumber of rules will yield less fit chromosomes and thus less important rules. Thisproperty is essential in extracting rules which represent knowledge at the periphery ofexpertise.

Experimentation

Three experiments are described here. The first two experiments use a toyexample to show that our approach can find rules comparable to those found with purelysymbolic methods of data-mining. The third experiment was performed on a larger dataset to show that this method is generalisable to real-world domains. All GA programs arewritten in C++. Neural network packages used were Neurodimensions’ Neurosolutionsv3.0 and Thinkspro v1.05 by Logical Designs Consulting.

Experiment 1

5

The dataset refers to named individuals for whom there are four attributes andtwo possible class values (Figure 3 - adapted from Winston, 1992):

Name Hair Height Weight Lotion ResultSarah Blonde Average Light No SunburnedDana Blonde Tall Average Yes Not sunburnedAlex Brown Short Average Yes Not sunburnedAnnie Blonde Short Average No SunburnedEmily Red Average Heavy No SunburnedPete Brown Tall Heavy No Not sunburnedJohn Brown Average Average No Not sunburnedKatie Blonde Short Light Yes Not sunburned

Figure 3 - The Sunburn Dataset

This dataset is converted as follows into a form suitable for input to the ANN (Figure 4):

Blonde 100Brown 010

Hair

Red 001Short 100Average 010

Height

Tall 001Light 100Average 010

Weight

Heavy 001No 10LotionYes 01Sunburned 10ClassNot sunburned 01

Figure 4 - Neural Network Conversion of Data in Figure 4.

One example of input is therefore: 10001010010, which represents a blonde haired (100),average height (010), light (100), no-lotion used (10) individual (i.e. Sarah). Note that weare dealing with a supervised learning network, where the class in which the sample fallsis explicitly represented for training purposes. So, in the case of Sarah, the output 10(sunburned) is used for supervised training. ‘10’ here signifies that the first output node isswitched on and the second is not. A neural network with 11 input, 5 hidden and 2output units was created. The input to the network was a string of 0’s and 1’s whichcorresponded to the records in the data set above. The network was then trained (usingback-propagation) until a mean square error of 0.001 was achieved. The networkweights were then recorded and the genetic algorithm process started. The weightsbetween the 11 input and 5 hidden units are as follows:

Hidden Unit 1 (all eleven input units): -2.029721 1.632389 -1.702274 -1.369853 0.133539 0.296253 -0.465295 0.680639 -0.610233 -1.432447 -1.462687

Hidden Unit 2: 0.960469 1.304169 -0.558034 -0.870080 0.394558 0.537783 0.047991 0.575487 -1.571345 0.476647 -0.003466

6

Hidden Unit 3: 0.952550 -2.791922 1.133562 0.518217 1.647397 -1.801673 -1.518900 -0.245973 0.450328 -0.169588 -1.979129

Hidden Unit 4: -1.720175 1.247111 1.095436 0.365523 0.350067 0.584151 0.773993 1.216627 -1.174810 -1.624518 2.342727

Hidden Unit 5:-1.217552 2.288170 -1.088214 -0.389681 -0.919714 1.168223 0.579115 1.039906 1.499586 -2.902985 2.754642

The weights between the five hidden units and the two output units are as follows:

Output Unit 1 (all 5 hidden units):-2.299536 -0.933331 2.137592 -2.556154 -4.569341

Output Unit 2:2.235369 -0.597022 -3.967368 1.887921 3.682286

A random number generator was used to create the initial population of fivechromosomes for the detection of rules, where an extra gene is added to the end of thechromosome to represent one of the two output class values. The alleles for this gene areeither 1 or 2 (to represent the output node values of 10 (sunburned) and 01 (notsunburned).

The following decisions were taken:

1. The fittest chromosome of each generation goes through to the next generation2. The next chromosome is chosen at random, but a greater fitness gives a greater

chance of being chosen. Negative fitnesses were not included. (A ‘ roulette wheel’selection.)

3. The remaining four chromosomes are created as a mutation of the two chosen aboveand crossover on these same two. Duplicate chromosomes are removed.

4. Fitness was computed simply as Weight(input_to_hidden)*Weight(hidden_to_output).The more positive the number, the greater the fitness.

An example run (first three generations only) for extracting rules dealing with the firstoutput node only (i.e. for sunburn cases only) is given in Figure 5.

Results

A traditional symbolic learning algorithm running on this dataset will find the followingfour rules: (a) If person has red hair then person is sunburned; (b) If person is brownhaired then person is not sunburned; (c) If person has blonde hair and no lotion used thenperson is sunburned; and (d) If person has blonde hair and lotion used then person is notsunburned. Our approach identified the following five single condition rules in tengenerations, with a maximum population of 6 in each generation:

(i) ‘ IF unit1 is 1 THEN output is 1 (fitness 4.667)’ , which corresponds to: ‘ IF haircolour=blonde THEN result is sunburned’ . The fitness here is calculated as follows:input unit 1 to hidden unit 1 weight of -2.029721∗ hidden unit 1 to output unit 1 weight of -2.299536.

7

(ii ) ‘ IF unit 3 is 1 THEN output is 1 (fitness 3.908)’ , which corresponds to ̀ IF haircolour=red THEN result is sunburned’ (input unit 3 to hidden unit 1 weight of -1.702274∗ hidden unit 1 to output unit 1 weight of -2.299536).

(iii ) ‘ IF unit 10 is 1 then output is 1 (fitness 4.154), which corresponds to ‘ IF no lotionused THEN result is sunburned’ (input unit 10 to hidden unit 4 weight of -1.624518 ∗hidden unit 4 to output weight of -2.556154)

(iv) ‘ IF unit 2 is 1 THEN output is 2 (fitness 8.43)’ , which corresponds to: ‘ IF haircolour=brown THEN result is not sunburned’ (input unit 2 to hidden unit 5 weighting of2.288170 ∗ hidden unit 5 to output unit 2 weighting of 3.682286, with rounding)

(v) ‘ IF unit 11 is 1 THEN output is 2 (fitness 10.12)’ , which corresponds to ‘ IF lotionused THEN result is not sunburned’ (input unit 11 to hidden unit 5 weighting of2.754642 ∗ hidden unit 5 to output unit 2 weighting of 3.682286, with rounding).

Figure 5 shows that, for the sunburnt cases (rules (i) – (iii ) above), there is earlyconvergence (within three generations) to these rules. The fitness values cited in the ruleset above may not be the maximum attainable but are nevertheless significantly above 0.

Experiment 2

Another toy example was chosen from the machine learning literature, again,only 8 records with four attributes (Figure 6).

Figure 5 – First three generations of chromosome evolution in the extraction of rulesdealing with sunburn cases (output node 1) only

8

Dataset

Run Supervisor Overtime Operator Output1 Sally Yes Joe High2 John No Samantha High3 Sally Yes Joe High4 John No Joe Low5 Sally Yes Samantha High6 Patrick No Samantha Low7 Sally Yes Joe High8 Patrick No Samantha Low

Figure 6: Second experimental dataset

The conversion between data and neural network representation was performed as before(Figure 7).

Sally 100John 010

Supervisor

Patrick 001Yes 10OvertimeNo 01Joe 10OperatorSamantha 01High 10OutputLow 01

Figure 7: Conversion of second dataset into a neural network format

The rules involved in this classification are complex and there is some repetition so thatonly very few records actually make a contribution to a rule. Symbolic algorithms do notproduce good results over this data set. See5 creates the ruleset:

IF overtime = Yes THEN output = High [0.833] IF overtime = No THEN output = Low [0.667]

CN2 creates these single-condition rules, along with some dual condition rules:

IF supervisor = Sally THEN output = High [0 4] IF supervisor = Patrick THEN output = Low [2 0]

where the numbers in brackets signifies how many cases of each class are captured bythat rule. For instance, ‘ [0 4]’ after the first rule above signifies that this rules capturesnone of the low output cases and 4 of the high output cases. The ANN with 7 input, 4hidden and 2 output units was trained over a series of 1522 epochs to achieve a meansquared error of 0.040. Below is the weight space for the network.

9

Hidden Unit 1 (all seven input to hidden connections)-0.836101 -0.437469 -0.972496 -0.977659 0.265379 -0.459824 0.313158Hidden Unit 2-2.508566 -2.855611 1.858439 -1.711295 2.86410 2.675891 -1.834709Hidden Unit 31.726850 0.421753 -0.725803 1.372710 -1.471043 0.338697 0.652326Hidden Unit 4-1.738682 -1.385388 2.255858 -0.626335 2.316902 0.007883 -3.285211

Output Unit 1 (all four hidden to output connections)0.491153 -4.961958 2.423375 -2.589325Output Unit 2-0.687410 4.479441 -2.092269 3.477822

The genetic algorithm was started with a population of 10 and run for just 20 generations.The top rules for each classification were as follows:

IF Supervisor = John THEN output = High (12.948)IF Supervisor = Sally THEN output = High (10.966)IF Operator = Samantha THEN output = High (7.847)

IF Overtime = No THEN output = Low (11.498)IF Operator = Joe THEN output = Low (10.706)IF Supervisor = Patrick THEN output = Low (7.120)

As before, the fitness measures for each rule are quoted to allow decisions to be made asto the validity of each of the rules. As can be seen from the ruleset, the results from thesymbolic algorithms have largely been reproduced and the algorithm has also found someextra rules.

Experiment 3

The dataset used was the mushroom dataset - a well-known collection of dataused for classifying mushrooms into an edible or poisonous class. The data contains 125categories spanning 23 attributes.As before, the data was converted into a neural network input format. The network wasfirst trained on this full dataset for 41 epochs and an error of 0.0161. However, the testresults from these runs were very poor and it prompted an investigation of the networkweights, revealing that the network was not learning successfully. Several solutions tothis problem were hypothesised and implemented with little success. The problem turnedout to be that the data set has a large number of unused categories and these weretranslated along with the rest of the data, resulting in a network with a very sparsedistribution of information since over half of the categories were not present. Thesecategories were eliminated from the data and a smaller network with 30 hidden units wastrained on the smaller 62 category data set for 69 epochs. The error was higher thanbefore at 0.03 but testing was, on average, better. The genetic algorithm was run for 100

10

iterations with a population of 20. There were 7 operations per population, 4 crossoverand 3 mutation. The mutation rate was randomly set between –40 to +40. The rulesfound by the GAs were encouragingly similar to those found by traditional algorithms,but the system also supplemented the most obvious rules with some previouslyundiscovered ones, exclusive to our approach:

IF odour=p THEN poisonous. (max 2.23) (found by CN2 and See5)IF gil l-size=n THEN poisonous. (max 1.13) (exclusive)IF stalk-root = e THEN poisonous (max 1.13) (exclusive)IF gil l-size=b THEN edible. (max 2.3) (found by CN2)IF odour=n THEN edible (max 1.58) (exclusive)IF cap-surface=f THEN edible (max 1.58) (found by CN2)

The weightings specify maximum values since they surface frequently in the rulelist with different fitness values, depending on which hidden unit the input was connectedto. The rules correlate well with the ones found by traditional packages. In fact, they arealmost identical to the rules found by CN2. The exciting aspect here is that there aresome totally new rules extracted regarding each classification. The algorithms used intraditional classification programs found only the odour=p rule for poisonousclassification, whereas our approach found two other rules.

The need to adapt the neural network to deal with a subset of the original data highlightsan inherent problem in any approach which attempts to integrate neural network learningwith symbolic rule extraction: The genetic algorithm can only generate rules from theneural network if they already exist. If a network has not been trained properly on thedata set then the algorithm will not find the required associations. This means that usersmust be very sure that the trained network is an accurate model of the domain they aretrying to mine. If this is not the case then the system will find spurious rules.

Discussion

Work is currently underway to amend the chromosome representation to extract two-condition and multi -condition rules from the neural network trained on the mushroomdataset, as well as to improve the behaviour of the trained neural network even furtherwhen tested with examples not previously seen. It is an open question as to how well thetrained neural network has to perform on unseen examples before the process of ruleextraction can begin.

Together, the preliminary results reported here provide evidence of the feasibili ty ofintegrating GAs with trained neural networks, both technically and in terms of efficiency.The approach can be scaled up easily, with the major constraint on scale being theaccuracy of the trained neural network when dealing with large datasets. What wasparticularly interesting was the extraction of rules not captured by traditional symboliclearning techniques. While such rules may not be totally accurate in that they don’ tcapture all or even most of the samples in a dataset, there is no doubt that the approachoutlined here can perform the useful function of extracting rules which lie at theperiphery of domain expertise or which capture exceptions (which can then be furtheranalysed to identify reasons for being exceptions). One of the major advantages of this

11

approach is that this is precisely what may be required in commercial applications of datamining, where the task is not to mine the data to extract rules which are already known todomain experts but to capture significant exceptions to general rules which then needexplaining in their own right for commercial advantage. The extraction of rules from theneural network trained on the mushroom data set, where these rules were not captured bysymbolic data mining techniques, is therefore particularly significant, since it suggeststhat the ability of neural networks to classify samples which cannot be classified bysymbolic means can now be tapped to produce intelligible rules which lie at the peripheryof domain expertise. In short, we claim that our approach utilises the best aspects ofneural network learning in noisy domains with the best aspects of symbolic rules throughthe application of GAs.

There are a number of outstanding issues, all currently being worked on. (If this paper isaccepted for the Conference, the latest results using our approach will be described.) Oursystem essentially finds a collection of paths (rules) through the trained network todetermine the optimal ones for a particular classification. It is certainly possible that oneinput unit can exert both a negative and a positive influence over the same classification.When fired, this unit could contribute in a large way towards the classification throughone hidden unit, but it might also have another set of heavily negative connections toother hidden units which would negate that classification. In that case, the geneticalgorithm will find the large positive and negative connections and interpret their effectseparately, thereby creating erroneous and perhaps contradictory rules. In fact, for theexperiments listed above, there was a symmetry about the weights which reflected howan input was classified. If the network determines that a certain attribute is notcontributing to a classification, it is far more likely to reduce the effect that that unit hason the network rather than increase two sets of weights. This is largely how back-propagation works, but it shows up a possible weakness in our approach if used onnetworks which have been trained using a different learning algorithm frombackpropagation. Further experiments are required on ANNs of different types (e.g.competitive, non-supervised learning networks) and different architectures (e.g. of morethan one hidden layer of neurons). The indications are that the system should be evenbetter suited to ANNs with larger numbers of hidden layers because, whilst thecomplexity involved in extracting rules increases enormously, the complexity of thegenetic algorithm does not.

Bibliography

1. Andrews, R., Cable, R. Diederich, J., Geva, S., Golea, M., Hayward, R., Ho-Stuart, C.and Tickle, A.B. (1996). An Evaluation And Comparison of Techniques For Extractingand Refining Rules From Artificial Neural Networks. World Wide Web URL:http://www.fit.qut.edu.au/NRC/ftpsite/QUTNRC-96-01-04.html

12

2. Arbatli, A.D. and Akin, H.L. (1997). Rule Extraction from Trained Neural NetworksUsing Genetic Algorithms. Nonlinear Analysis, Theory, Methods and Applications. Vol30. No. 3, pp 1639-1648

3. Towell, G. and Shavlik, J. (1993). The extraction of refined rules from knowledgebased neural networks. Machine Learning, 131, pp 71-101.

4. Winston, P. H. (1992). Artificial Intelligence (3rd Edition). Addison Wesley..

Acknowledgement

The research contained in this paper was funded in part by a grant from the Royal Mail.

Documents

Data Mining Neural Networks With Genetic Algorithms