8

Click here to load reader

[IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

  • Upload
    ap

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

Adaptive Genetic Programming for Dynamic Classification

Problems

M. Riekert, K. M. Malan, Member, IEEE and A.P. Engelbrect, Senior Member, IEEE

Abstract— This paper investigates the feasibility of using

Genetic Programming in dynamically changing environments toevolve decision trees for classification problems and proposes annew version of Genetic Programming called Adaptive GeneticProgramming. It does so by comparing the performance orclassification error of Genetic Programming and Adaptive Ge-netic Programming to that of Gradient Descent in abruptly andprogressively changing environments. To cope with dynamic en-vironments, Adaptive Genetic Programming incorporates adap-tive control parameters, variable elitism and culling. Resultsshow that both Genetic Programming and Adaptive GeneticProgramming are viable algorithms for dynamic environmentsyielding a performance gain over Gradient Descent for lowerdimensional problems even with severe environment changes.In addition, Adaptive Genetic Programming performs slightlybetter than Genetic Programming, due to faster recovery fromchanges in the environment.

I. INTRODUCTION

Adapting to changing environments is a key element to

the survival of any species. It is then prudent to investigate

and model this adaptive process seeing as there are very few

real world problems that remain static over time. One model

for this adaptive behavior may be found in the evolutionary

computing paradigm of artificial intelligence.

Genetic Programming (GP) was initially developed to

evolve computer programs by Koza [1]. GP is an evo-

lutionary search process which models the same survival

of the fittest paradigm as first proposed by Darwin in his

ground breaking work “The origin of species” [2]. GP differs

from other evolutionary algorithms in that it evolves trees

as opposed to solution vectors. One of the advantages of

evolving trees is that variable sized entities may be utilized.

This is especially convenient for dynamic environments as

the size of proposed solutions may vary over time.

GP is used in this paper as an evolutionary model for

evolving decision trees. This paper presents results obtained

by modeling the adaptive behavior of species in changing

environments and argues that doing so is a feasible way

of obtaining solutions in these environments. The results

indicate that even without modification to the standard GP

algorithm, it still yields acceptable results. However, a slight

performance gain may be obtained by having adaptive con-

trol parameters which are modified after a change in the

environment is detected. This version of the GP algorithm is

named Adaptive Genetic Programming (AGP). Both versions

of GP are compared to the classification capability of neural

networks trained using Gradient Descent (GD).

The authors are with the Department of Computer Science,University ofPretoria, Lynnwood Road, Hillcrest, Pretoria, South Africa, 0002 (phone:012 420 5242; email: mriekert/engel/[email protected]).

The remainder of the paper is structured as follows: Sec-

tion II gives a brief overview of the algorithms and structures

used for this study, Section III discusses the implementation

details of GP and AGP, Section IV discusses control param-

eter settings for the experiments, Section V describes how

the test data for the experiments was generated, Sections VI

presents and discusses the results obtained during the study

and the paper then ends with the conclusion in section VII.

II. BACKGROUND

Classification is a common real world activity. It is used

to bin, so to speak, entities or patterns into predefined

classes. Common examples of classification include deter-

mining fraudulent behavior in banking data or classifying

patient symptoms to a particular disease. Numerous artificial

intelligence techniques have been successfully applied to

classification problems, some examples of which are Quin-

lan’s C4.5 [3], Clark and Niblett’s CN2 [4] and neural

networks [5].

Neural networks have been shown to perform competi-

tively for classification problems [6]. Gradient Descent (GD)

is a technique used to train neural networks. It consists of two

phases, a feed forward phase and an error back propagation

phase [5]. During the feed forward phase input patterns

are presented to the neural network and the outputs of the

network are recorded for all the patterns. The mean squared

error is then calculated and an error signal is fed back from

the output layer to the hidden layer, adjusting the weights in

the network relative to how much a neuron has contributed

to the error. GD is used as a benchmark for comparing the

classification results of the GP and AGP approach proposed

in this paper as GD has been shown to function in dynamic

environments.

Decision trees are logical structures for drawing classifica-

tion boundaries between data patterns [5]. Each non-terminal

tree node represents a question to be answered, the outcome

of which will determine to which of the child nodes the

traversal procedure will advance. Terminal nodes represent

the classes of a data set. A tree is traversed for a particular

pattern, starting at the root node, by using the attribute values

of the pattern to answer the question posed by the non-

terminal nodes. An example question might be something

along the lines of X < 0.5 whereupon the pattern’s attribute

value corresponding to the X attribute is compared to 0.5.

If the inequality is satisfied, traversal proceeds down the

“true”-branch of the current node or down the “false”-branch

otherwise. This continues until a terminal node is reached.

The class represented by the terminal node will then be

674978-1-4244-2959-2/09/$25.00 c© 2009 IEEE

Page 2: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

assumed as the classification of the pattern that lead to that

particular terminal node.

Decision trees have been successfully evolved using GP

to solve classification problems [7][8]. Bramaier and Banzaf

[7] use the approach of flattening the decision tree into a

vector representation, which is then evolved. Their approach

yielded very good results when compared against a neural

network trained for classification of a medical data set. In

work done by Au et al. [8] GP is also used to evolve decision

trees successfully for classification to churn prediction in

static environments. The approach presented in this paper, as

opposed to the work done by Au et al., focusses on keeping

the decision tree logical structure intact and on dynamically

adapting the algorithm to suit dynamic environments.

GP is becoming an ever more popular technique to use in

dynamic environments [9][10][11][12][13]. Wagner et al. [9]

use GP to evolve regression trees to forecast time series data.

Augmentations to GP in terms of dynamic parameters have

been investigated by Yin et al. [10]. The focus of their work

was to investigate adaptive mutational and crossover proba-

bilities to break stagnation of the evolutionary process where

stagnation is a very undesirable phenomena in evolutionary

computing. The AGP approach presented in this paper makes

use of similar adaptive control parameters but extends this to

include other adaptive features such as variable elitism and

culling.

III. APPROACH TO ADAPTIVE GENETIC PROGRAMMING

This section describes the implementation details for both

GP and AGP. The main difference between GP and AGP is

that AGP makes use of adaptive control parameters, adaptive

elitism and culling whilst GP has static parameters, no culling

and full elitism where full elitism corresponds to “keep best

reproduction” as investigated by Wiese and Goodwin [14].

These concepts will be discussed in subsequent sections.

A. Composite fitness function

The fitness function in GP is what drives the search

process to desired solutions. It is a means of quantifying

the quality of solutions within a population of individuals

and specifies the attributes desired for an optimal solution.

Selection strategies more often than not utilize the fitness

function when selecting individuals for activities such as

crossover, survival to successive generations and possibly

mutation.

For the implementation in this study the fitness function

was decomposed into a number of components, translating

the overall problem into a multi-objective problem. These

components were weighted and then linearly combined to

calculate the overall composite fitness of a proposed solution.

The fitness function can be seen as a minimization function

where entities with a lower fitness value are deemed more

fit and preferred over those with higher fitness values. The

components were as follows:

1) Accuracy: Undeniably the most important component

of the fitness function, accuracy determines the classification

error that a proposed solution makes on a supplied training

set of patterns. Accuracy was expressed as a percentage of

misclassified patterns over the training set. It is important

that trees make the smallest error possible because trees

with high errors serve no purpose. The accuracy of a tree

was calculated by comparing the classification proposed by

a tree for a particular pattern to the actual classification of

the pattern. For each misclassified pattern the total number of

misclassified patterns was incremented. After all the patterns

in the training set were processed the accuracy for a tree T

was calculated as:

Accuracy(T ) =NumberMisclassifiedPatterns

NumberPatternsInTrainingSet∗ 100

which yields the percentage of misclassified patterns. The

weight for accuracy was set to 1, that is, the accuracy

contributes the full 100% of its value to the final fitness

calculation.

2) Identical Subtrees: Identical subtree fitness was calcu-

lated as the number of identical child nodes that each node

within a decision tree possessed. Identical subtrees lead to

redundant attribute tests within decision trees. The identical

subtrees component was weighted with 0.5 to prevent this

component from dominating the overall fitness calculation,

rather allowing for the accuracy component to be the most

important feature in the fitness calculation.

3) Repeated attributes: More often than not, multiple tests

on the same attribute within a decision tree contribute to

redundancy within a tree. Preferably, attribute tests should

appear only once within a tree and the condition described

by the node should be such that it will not be necessary to

test on that attribute again. If another test on the attribute is

required then the subsequent test should be a specialization

on a previous test, for example: x < 5 initially and then later

on x < 4 as opposed to x < 4 and then x > 5. In the latter

scenario the test on x > 5 will never be reached, rendering

this part of the decision tree redundant or dead. The weight

for the repeated attributes component in the fitness function

was set to 0.5 for the same reason as the identical subtree

component.

4) Repeated conditions: Repeated conditions calculated

the number of identical conditions within a subtree of nodes.

Repeated conditions is another form of redundancy within

decision trees. Similarly to the aforementioned structural

components (identical subtrees and repeated attributes), the

repeated conditions component was weighted in the final

calculation with a weight of 0.5.

5) Tree size: Larger trees will more often than not overfit

the training data. Large trees are also cumbersome to work

with and computationally expensive to traverse. The fitness

function was designed such that if two trees render the same

accuracy on the training data, the smaller of the two will

be deemed more fit. Restricting tree size in this way also

prevented the unbound growth of trees and ensured that only

2009 IEEE Congress on Evolutionary Computation (CEC 2009) 675

Page 3: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

the smallest most accurate trees would survive to successive

generations. To allow the growth of trees and disallow the

search process to be dominated by the size of trees, tree size

as a structural component was weighted with a weight of

0.05. This small weight was meant as a “tie breaker” between

two trees with the same accuracy (as is largely the case with

all the aforementioned structural fitness components).

B. Stopping criteria

As opposed to static data sets or environments, dynamic

environments do not afford the luxury of hard stopping

criteria. Static environments allow for stopping criteria along

the lines of “required fitness reached” or “required number

of generations exceeded”. In truly dynamic environments, on

the other hand, such stopping criteria cannot be set in place.

The evolutionary process must continue to accommodate

possible changes within the environment in the near future.

The closest “stopping” criterion might be expressed in terms

of a “hibernation” phase where the evolutionary process is

paused as soon as some required fitness is reached. During

the hibernation phase, continual environment change testing

is conducted and as soon as the environment changes, evolu-

tion is restarted. This hibernation phase may be implemented

by halting crossover and mutation and only conducting

fitness evaluations on the most fit entity found thus far.

C. Crossover strategy

One-point crossover was used. After selecting two parent

trees offspring were constructed by swapping randomly se-

lected subtrees within the parents to create two recombined

versions of the parent trees.

D. Mutation

Mutation is a mechanism to inject new genetic material

into a population of solutions. It is applied probabilistically to

the offspring generated during the crossover operation. There

are three mutational probabilities to consider when evolving

decision trees using GP: Firstly, the probability that a tree

will be selected for mutation, secondly, the probability that

a specific mutational operator will be applied to the selected

tree and lastly, the probability for each node in the tree to

mutate. The mutational operators implemented in this study

were as follows:

1) Growing: For each terminal node, decide if it should

be altered to a non-terminal or internal node and if so,

add the appropriate number of children (terminal nodes

which will recursively be considered for growing). To prevent

unbounded growth the probability of mutation was linearly

decreased for every successive level of nodes added to the

tree.

2) Truncation: For every internal node probabilistically

decide if the node should be replaced by a leaf node. If

truncation occurs all of the subtree children of the node are

discarded. This is a very severe mutation, especially if the

decision to truncate occurs at the root of the tree, and as such

it was applied at a very low probability. One way to decrease

the severity of the truncation mutational operator would be

to consider the root node of a tree for truncation at a very

low (almost 0) probability and then linearly increasing this

probability as traversal proceeds to every successive level of

the tree.

3) Shrinking: For every non-terminal, decide whether or

not it should be replaced by one of its children. If the decision

is to replace, then randomly decide which child subtree will

replace the node. All sibling subtrees are discarded. This is

also a form of truncation but is much less severe than “true”

truncation.

4) Node altering: Decide if any aspect of a node should

change. Changes include 1.) selecting a different attribute

for the current node to split upon, 2.) changing relational

operators 3.) changing the type of the node from terminal to

non-terminal (growth by one level) or vice-versa (truncation).

5) Physical swap: Change the structure of the tree by

swapping the subtrees or children of nodes decided upon

probabilistically.

6) Logical swap: Change the conditional operators of

probabilistically selected nodes for example change a “<”

relation to a “>=” relation.

E. Selection

Selection is an important operator in evolutionary com-

puting. Its application directly influences the direction of the

search process. Selection is applied to determine the individ-

uals within a population which will participate in crossover

as well as the individuals from the current population and

the offspring to survive to the successive generation. For this

study elitism and tournament selection were applied.

Elitism dictates that the N best entities from the current

population and the offspring will survive to the successive

generation [5]. Elitism enforces specialization when N is set

to a high number but prevents random searching by holding

on to the best solutions found thus far. For AGP, N varied

over time as performance increased or decreased. Adaptive

elitism is discussed in Subsection III-G.

Tournament selection randomly chooses M individuals

from a supplied set and then selects the best individual

from this subset. The parameter M is referred to as the

tournament size and is expressed as a percentage of the

population. The larger M becomes the more tournament

selection tends to elitism while smaller M values will tend

more to random selection (and ultimately random searching).

For AGP, tournament selection was used to select parents

for crossover and to select the remainder of the successive

population after elitism had been applied from the remaining

parents and offspring. The tournament size was set to 30%

of the population size.

F. Environment change detection

Change detection was implemented in AGP by keeping

track of the best fitness found so far. Because of elitism

the best fitness should never deteriorate as long as the

environment remained static and no new data was available.

Because an environment can change abruptly (in which

case exploration should be favored over exploitation) or

676 2009 IEEE Congress on Evolutionary Computation (CEC 2009)

Page 4: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

progressively (which requires only slight modifications to

current best solutions) it is not always clear which control

parameter settings would yield satisfactory results without

knowing the type of change the environment underwent. A

change in the environment may affect the fitness in one of

two ways: 1.) The fitness may improve, in which case it

could be assumed that the newly available data describe the

same trends as data seen before and no reaction strategy is

necessary 2.) The fitness may deteriorate, in which case it

may be assumed that the newly available data describe new

trends. The latter scenario requires that exploration should

start afresh while lessening exploitation.

G. Reaction to change and adaptive control parameters

Control parameter settings tend to be problem specific.

Some problems require very high mutation rates initially to

enforce exploration and then small mutational probabilities as

the population converges on a solution. A major disadvantage

to having static control parameters in dynamic environments

is that a drastic change within the environment will require

high crossover and mutational probabilities to restart explo-

ration while a negligible change will require small probabil-

ities to refine solutions. One approach to counteract the need

for deciding upon control parameter settings is to have the

parameters adjust dynamically to find a suitable combination

of settings [10]. These adaptive control parameters are what

distinguishes AGP from GP. The adaptive control parameters

implemented for AGP are as follows:

1) Elitist proportion: This control parameter dictates the

number of best solutions found thus far to survive to the

successive generation. In the case of fitness improving this

parameter was increased by 0.01, with an upper bound of 1 to

enforce exploitation and propagate the genetic material of the

best fit entities into future generations. If fitness deteriorated

exploration was indirectly jump-started by decreasing the

probability of fit solutions to survive by 0.01, with a lower

bound of 0.1, rather opting for more newly generated genetic

material to enter a population in the hopes of finding a new

fit individual within the search space.

2) Crossover probability: Crossover was the only mech-

anism for generating offspring to be mutated later on. To

spare computational effort the crossover rate was lessened

by 0.01, where the lower bound for this probability was

set to 0.1, while the evolutionary process was performing

to a satisfactory degree (i.e. the fitness of the best solution

improves). As soon as fitness deteriorated the probability

of crossover was linearly increased by 0.01, with an upper

bound of 0.9, to generate more solutions.

3) Mutational probabilities: Mutation injects new genetic

material into a population of solutions. Because high muta-

tion rates are required for abrupt changes in the environment

while small rates would suffice for gradual changes, the

mutation rates were set to be cyclic. As soon as a change

in the environment was detected, a random number in the

range [0, 1) was added to the current probability of mutation

and then scaled to the range [0, 1) if the new probability

exceeds 1 (in other words a mutation rate of 100%). This

was done for all 3 mutational probabilities as discussed

earlier. This implies that the algorithm continued to modify

the probabilities of mutation until it succeeded in finding a

good combination of such probabilities which would generate

more fit entities in successive generations.

4) Culling: Culling is a mechanism to jump start the

exploration capability of an evolutionary algorithm. It was

applied by removing a portion of the worst solutions from

a population and replacing these with randomly generated

solutions. Replacement solutions were constructed by gener-

ating a pool of trees of size 1, only consisting of terminal

nodes as the root nodes, and then applying the mutational

operators to these solutions, afterwards inserting them into

the population.

IV. EXPERIMENTAL APPROACH

This section briefly describes the control parameter set-

tings used for the experiments. The control parameters

for both GP and AGP as well as GD are discussed.

For the implementation the Computational Intelligence Li-

brary(CIlib)(available at http://sourceforge.net/projects/cilib/)

was utilized and only adapted to incorporate GP and AGP

as it already provided support for GD.

A. Control parameter settings for GP and AGP

Table I shows the parameter settings for both the GP and

AGP algorithms. These parameter settings were determined

through experimentation and have been shown to deliver

optimal results over other combinations tested.

TABLE I

GP AND AGP CONTROL PARAMETERS AND SETTINGS

Control parameter GP AGP

Population size 30 30

Mutation probabilities 50% Cyclic

Crossover probability 50% Variable

Elitist proportion 100% Variable

Culling proportion Not applied 50%

B. Gradient Descent

For GD the learning rate and momentum were both set

to 0.5 for all problems with only the number of hidden-

and output units differing among problems. For both the

hyperplanes and spheres data sets (discussed below) the

number of output units was set to one functioning as a “yes-

no” unit, while in the case of the thresholds data set three

output units were used (one for every class). The number of

hidden units utilized was 6, 4 and 3 for the hyperplanes-,

spheres- and threshold data sets respectively. The parameter

settings for GD where determined through experimentation.

V. THE DATA SETS

To test the proposed approach, six artificial data sets were

generated with increasing levels of complexity. This section

gives the details on how this was achieved.

2009 IEEE Congress on Evolutionary Computation (CEC 2009) 677

Page 5: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

For each data set a set of static points was generated

and these points were then classified as described in the

subsequent sections. The classification of a point in space

may change over time and it is up to the algorithm to

compensate for these changes in classification. To generate

each data set a total of 1 000 points were randomly generated.

These points were then classified using the decision bound-

aries as described below, inserting the patterns (points with

their classifications) in the data set, changing the decision

boundaries, reclassifying the points to obtain the next 1 000

patterns, and so on, until the data set contained 20 000

patterns in total.

A. Double thresholds

The double threshold data set consists of three classes sep-

arated by two vertical lines in Cartesian space. The thresholds

can be considered as the two lines x = threshold1 and

x = threshold2. This implies that the y-values of each

point do not have an influence on the classification of the

point, only the x-values. The classification of a point P is

determined as:

Classification(P ) =

⎧⎨

class1, if Px ≤ threshold1class2, if Px ≥ threshold2class3, otherwise

The classification of a point can change over time by al-

lowing the threshold values to change minding the constraint

that the thresholds may not swap around so that the initially

larger threshold is always greater than or equal to the smaller

threshold, that is, threshold1 ≤ threshold2.

B. Spheres

The objective of this data set is to determine whether

points distributed in space fall within a predefined sphere or

outside of the sphere, wherever the sphere might be in the

search space. The equation for a sphere in D-dimensional

space is given by

(Sr)2 =

D∑

i=1

(xi − ci)2

where Sr is the radius, ci for i ∈ {1, .., D} is the center

of the sphere and xi for i ∈ {1, .., D} is given as a point

on the surface of the sphere. For this data set, D was set to

3. Point coordinates are substituted into the sphere equation

and Pr is calculated for the point and compared to Sr. The

classification of a point P is then determined by:

Classification(P ) =

{1, if Pr ≤ Sr

0, otherwise

The sphere is allowed to move around in space by allowing

the center point to shift. Additionally, the sphere is allowed

to expand and contract by changing the radius Sr by adding

a uniformly distributed random number in the range [0, 1)and then scaling the new Sr to the range [0, 1) should it

exceed 1.

C. D-dimensional Hyperplanes

The hyperplanes data set utilized in this study was adapted

from the data set used by Tsymbal et al. [15]. A hyperplane

is given by the equation:

H(x) =

D−1∑

i=1

(aixi) + c

where ai for i ∈ {1, .., D − 1} represent the coefficients,

xi for i ∈ {1, .., D − 1}, are components of x and c

is a constant. For the purposes of this study D was set

to 10 yielding a hyperplane in 10-dimensional space. The

hyperplane separates two classes of D-dimensional points.

The classification of a point is either 0 or 1 depending on

which side of the hyperplane the point lies. The classification

for a point P is determined by:

Classification(P ) =

{0, if Pd ≤ Hd

1, otherwise

where Pd is the Dth component of the point P and Hd is

obtained by substituting the first D− 1 components of point

P into the hyperplane equation.

The hyperplane is allowed to move around in space

by changing ai for i ∈ {1, .., D − 1} (i.e. changing the

orientation of the plane) and c and then recalculating the

classifications of the points using the new hyperplane equa-

tion. The coefficients ai for i ∈ {1, .., D−1} and the constant

c where changed by replacing their previous values with

uniformly distributes random numbers in the range [0, 1).

D. Simulating dynamic environments

Windowing was applied to simulate a dynamic environ-

ment. In this scenario the patterns in the data set were

assumed to be ordered in the same way in which they

were generated. Two types of dynamic environments were

simulated namely abrupt change and progressive change.

1) Abrupt change: An abrupt change in the environment

was simulated by constructing a window of size 1 000 and

moving the window by 1 000 patterns every 100 iterations of

the algorithm. Using a window size of 1 000 ensured that the

changes in the environment corresponded with the decision

boundary changes in the data set. A total of 2 000 iterations

were required to move through the entire data set of 20 000

patterns.

2) Progressive change: To simulate progressive change

a window size of 500 was used, sliding the window 200

patterns every 100 iterations. The implication of this is that

points may become separated by more than one or two

decision boundaries at certain times during the evolutionary-

or training process (depending on the data set). New decision

boundaries appear and old boundaries disappear gradually

over time. Windowing over each data set required 9800

iterations to utilize all the patterns.

678 2009 IEEE Congress on Evolutionary Computation (CEC 2009)

Page 6: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

VI. RESULTS

Figures 1, 2 and 3 show the results obtained by averaging

the recorded classification errors for every 10 iterations

over 30 independent simulations for each algorithm/problem

combination. The purpose of Figures 1, 2 and 3 is to contrast

how well the three algorithms (GP, AGP and GD) could track

and recover from changes within the environment. A sharp

upward spike within the graphs indicates an environmental

change (window shift) where the algorithms where presented

with new unseen patterns to classify. A drop within the

graphs indicates an improvement in the classification capa-

bility of the algorithms. Table II summarizes the results by

averaging the recorded classification errors over the entire

runs (from iteration 0 to the last iteration) for each algorithm.

TABLE II

CLASSIFICATION ERROR WITH STANDARD DEVIATIONS FOR ALL

ALGORITHM AND PROBLEM COMBINATIONS

Problem GP AGP GD

Thresholds progressive 9.6 ± 9.4 8.8 ± 9.2 68.1 ± 22.5

Thresholds abrupt 9.2 ± 7.1 7.4 ± 6.1 69.5 ± 21.7

Spheres progressive 14.7 ± 9.4 14.2 ± 9.2 16.5 ± 9.1

Spheres abrupt 12.5 ± 8.7 11.7 ± 8 26.2 ± 8.5

Hyperplanes progressive 22.7 ± 10.2 22.3 ± 9.9 10.5 ± 8.7

Hyperplanes abrupt 20.6 ± 11.7 20.1 ± 11.2 17.2 ± 8.7

A. Thresholds results

In Figures 1(a) and 1(b) it is observed that both GP

and AGP outperformed GD significantly, except for a very

short initial setup time where GD performed better than

GP and AGP. Note that in the case of the progressively

changing environment, there were many more changes and

iterations required to move through the data set of 20 000

patterns, so the scales on the horizontal axes of Figures 1(a)

and 1(b) differ. The figures indicate that GD experienced a

“confusion” state from which it did not recover, whereas GP

and AGP were able to track and recover from changes in the

environment. Comparing the behavior of GP and AGP it can

be seen that in the case of AGP the classification error in

most cases droped faster after a change in the environment

than GP. This improved performance is also reflected in the

lower average of AGP as shown in Table II.

B. Spheres results

From Figures 2(a) and 2(b) as well as Table II it is evident

that both GP and AGP performed far better than GD on

the spheres data set, often reaching a classification error of

zero. Apart from having a lower average classification error,

both GP and AGP recovered at a more rapid rate after a

change in the environment than GD, as is observed in Figures

2(a) and 2(b). Once again it is also clear that AGP slightly

outperformed GP in terms of recovery rate and classification

error, yielding a smaller average classification error and a

quicker drop in the graph after a change in the environment.

(a) Progressive change

(b) Abrupt change

Fig. 1. Thresholds data set changing the environment every 100 iterations

C. Hyperplanes results

From Figures 3(a) and 3(b) it is clear that GD performed

better in both the progressive and abrupt environments for

the hyperplanes problem. For the abrupt environment, shown

in Figure 3(b), both variants of GP maintained a better

classification error between iterations 200 to 800, after which

GD once again started to outperform GP and AGP. This

superior performance of GD is also reflected by a lower

average classification error over all iterations shown in Table

II.

D. Discussion

With regard to GP and AGP, the behavior of recovery

after changes in the environment is due to the inherent

evolutionary nature of the algorithms, subject to the “survival

of the fittest” rule. Furthermore, the figures also show that

AGP had a tendency to recover quicker from change than GP,

an indication that adaptive parameters did indeed function

well in dynamic environments.

From the results it is clear that both GP and AGP struggled

on average with the hyperplanes data set in both the progres-

sively and abruptly changing environments. The situation was

2009 IEEE Congress on Evolutionary Computation (CEC 2009) 679

Page 7: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

(a) Progressive change

(b) Abrupt change

Fig. 2. Spheres data set changing the environment every 100 iterations

however completely reversed for the thresholds and spheres

data sets where both GP and AGP showed better performance

on the graphs and a lower average classification error than

GD.

With regard to the thresholds problem, GP and AGP were

consistent in terms of recovery from changes within the

environment. The results for the thresholds data set also

indicated that both GP and AGP handled multiple decision

boundaries appearing and disappearing over time much better

than GD. This was mainly due to the fact that GP and

AGP were able to evolve variable sized decision trees to

incorporate new decision boundaries as well as retaining

older and still relevant boundaries. Through the evolutionary

process old boundaries were phased out and replaced by

information about the new boundaries.

One possible explanation for GP and AGP’s poor perfor-

mance on the hyperplanes problem might be that both GP and

AGP evolved trees which fitted the current data (the patterns

in the window). Being a 10-dimensional problem, larger trees

were required. Larger trees have the tendency to “remember”

the data from a previous state of the environment more so

than smaller trees. As the tree pruning mutational operators

(a) Progressive change

(b) Abrupt change

Fig. 3. Hyperplanes data set changing the environment every 100 iterations

(truncation and shrinking) were responsible for making the

trees “forget” as it were, a possible solution to remedy

the performance of GP and AGP might be to apply these

operators more liberally when a change in the environment

is detected in the case of high dimensional problems.

As opposed to the hyperplanes problem, which required

larger trees for classification in 10 dimensions, the spheres

problem consisted of only three dimensions. This then al-

lowed for older decision boundaries to be phased out quicker

by the tree shrinking operators yielding an improvement in

performance over the hyperplanes problem.

The behavior of GP and AGP was consistent in that

an immediate degradation occurred after changes in the

environments after which a recovery period followed quickly.

GD struggled to recover from change and was subject to a

“confusion state” as the figures show that its recovery period

is more gradual and took longer than that of GP and AGP.

The most severe case of this confusion state is depicted in

Figure 1(b) where GD abruptly degraded at iteration 200

and never recovered to a previously good classification error

compared to GP and AGP.

The ideal classification error over all problems was 0%.

680 2009 IEEE Congress on Evolutionary Computation (CEC 2009)

Page 8: [IEEE 2009 IEEE Congress on Evolutionary Computation (CEC) - Trondheim, Norway (2009.05.18-2009.05.21)] 2009 IEEE Congress on Evolutionary Computation - Adaptive Genetic Programming

GD only managed to obtain a classification error of 0% in the

early stages of the progressively changing hyperplanes data

set but not for any of the other experiments. GP and AGP

both reached classification errors of 0% for the progressive

spheres-, progressive thresholds-, abrupt spheres- and abrupt

thresholds experiments, a feat that was unmatched by GD.

Even though GP and AGP struggled with the progressively

changing hyperplanes problem (Figure 3(a)), both algorithms

yielded comparable classification errors as obtained by GD.

The results shown in Table II indicates that on average

both GP and AGP outperformed GD, where AGP yielded

a slightly better classification error over GP, making AGP

the best algorithm for the thresholds and spheres problems.

GD outperformed both GP and AGP only for the hyperplanes

problem.

AGP more often than not performed better than GP in

terms of recovery from change. From the figures it can be

seen that AGP reached a lower classification error more

rapidly than GP in most of the environmental changes. This

is an indication that the adaptive parameters of AGP did

indeed function toward better classification error in dynamic

environments.

VII. CONCLUSIONS

The aim of this paper was to compare the behavior

of GP and AGP to that of GD in dynamic environments

and investigate the feasibility of using GP and AGP for

abruptly and progressively changing environments. In ad-

dition, modifications to the GP algorithm were suggested

yielding AGP, an adaptive version of GP. Three data sets,

namely thresholds, spheres and hyperplanes where utilized

for the experiments reported on in this paper. Two types of

dynamic environments, namely progressively- and abruptly

changing environments, were modeled by windowing over

the data sets with different window sizes and window shift

sizes.

GP and AGP were both able to track changes in a

dynamic environment. Compared to GD, both GP and AGP

performed better when considering the averaged results for

the thresholds and spheres data sets. GD only significantly

outperformed GP and AGP on the hyperplanes data set for

a progressively changing environment where the problem

dimension is significantly larger than for the other two data

sets. AGP, using adaptive control parameters and culling, also

provided a slight performance gain over GP making it the

better contender for dynamic environments over GP.

Future research will be aimed at investigating more in-

telligent adaptive control parameters for AGP as opposed to

simple cyclic parameters. An investigation will be conducted

by having the adaptive parameters adjust as a function of the

fitness of individuals within a population of solutions and

comparing against the changes of the cycling parameters

over time. On completion of the investigation on more

intelligent adaptive parameters for the AGP approach, the

next step would be to apply AGP to real world problems

and investigate its performance in real world environments.

An evolutionary hibernation phase to spare computational

resources was not set in place during this study, further

studies will investigate the effects of having a hibernation

phase for AGP. In the reported results neither the rules nor

the quality (crispness, number of rules, etc.) of these rules

generated by both GP and AGP where considered, only the

classification accuracy of these rules. Future research will

also focus more on a comparative study on the differences

in the rule sets generated by both GP and AGP.

ACKNOWLEDGMENT

The authors would like to thank the CIlib framework

developers. This work was supported in part by the National

Research Foundation.

REFERENCES

[1] J. R. Koza, “Genetic programming: a paradigm for genetically breed-ing populations of computer programs to solve problems,” Stanford,CA, USA, Tech. Rep., 1990.

[2] C. R. Darwin, On the Origin of Species by Means of Natural Selection,

or the Preservation of Favoured Races in the Struggle for Life (The

Origin of Species). England: Down, Bromley, Kent, 1859.[3] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,

CA, USA: Morgan Kaufmann Publishers Inc., 1993.[4] P. Clark and T. Niblett, “The CN2 induction algorithm,” in Machine

Learning, 1989, pp. 261–283.[5] A. P. Engelbrecht, Computational Intelligence An Introduction. Wiley

& Sons, 2007.[6] G. P. Zhang, “Neural networks for classification: a survey,” vol. 30,

no. 4, pp. 451–462, November 2000.[7] M. Brameier and W. Banzaf, “A Comparison of Linear Genetic

Programming and Neural Networks in Medical Data Mining,” IEEE

Transactions On Evolutionary Computation, vol. 5, pp. 17–26, 2001.[8] W.-H. Au, K. C. C. Chan, and X. Yao, “A Novel Evolutionary

Data Mining Algorithm With Applications to Churn Prediction,” IEEE

Trans. Evolutionary Computation, vol. 7, no. 6, pp. 532–545, 2003.[9] N. Wagner, Z. Michalawicz, M. Khouja, and R. R. McGregor, “Time

Series Forecasting for Dynamic Environments: The DyFor GeneticProgram Model,” Evolutionary Computation, vol. 11, p. 433, 2006.

[10] Z. Yin, A. Brabazon, C. O’Sullivan, and M. O’Neill, “GeneticProgramming for Dynamic Environments,” in Proceedings of the

International Multiconference on Computer Science and Information

Technology, 2007, pp. 437–446.[11] R. Nopsuwanchai and P. Chongstitvatana, “Improving Robustness of

Robot Programs Generated by Genetic Programming for DynamicEnvironments,” in In Proc. of Asia-Pasific Conference on Circuits and

Systems (APCCAS98, 1998, pp. 523–526.[12] W. Yan and C. D. Clack, “Behavioural GP diversity for dynamic envi-

ronments: an application in hedge fund investment,” in GECCO ’06:

Proceedings of the 8th annual conference on Genetic and evolutionary

computation. New York, NY, USA: ACM, 2006, pp. 1817–1824.[13] K. Rodriguez-Vazquez and P. J. Fleming, “Evolution of mathematical

models of chaotic systems based on multiobjective genetic program-ming,” Knowl. Inf. Syst., vol. 8, no. 2, pp. 235–256, 2005.

[14] K. Wiese and S. D. Goodwin, “Convergence characteristics of keep-best reproduction,” in SAC ’99: Proceedings of the 1999 ACM sympo-

sium on Applied computing. New York, NY, USA: ACM, 1999, pp.312–318.

[15] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen,“Dynamic integration of classifiers for handling concept drift,” Inf.

Fusion, vol. 9, no. 1, pp. 56–68, 2008.

2009 IEEE Congress on Evolutionary Computation (CEC 2009) 681