Click here to load reader
Upload
ap
View
213
Download
1
Embed Size (px)
Citation preview
Adaptive Genetic Programming for Dynamic Classification
Problems
M. Riekert, K. M. Malan, Member, IEEE and A.P. Engelbrect, Senior Member, IEEE
Abstract— This paper investigates the feasibility of using
Genetic Programming in dynamically changing environments toevolve decision trees for classification problems and proposes annew version of Genetic Programming called Adaptive GeneticProgramming. It does so by comparing the performance orclassification error of Genetic Programming and Adaptive Ge-netic Programming to that of Gradient Descent in abruptly andprogressively changing environments. To cope with dynamic en-vironments, Adaptive Genetic Programming incorporates adap-tive control parameters, variable elitism and culling. Resultsshow that both Genetic Programming and Adaptive GeneticProgramming are viable algorithms for dynamic environmentsyielding a performance gain over Gradient Descent for lowerdimensional problems even with severe environment changes.In addition, Adaptive Genetic Programming performs slightlybetter than Genetic Programming, due to faster recovery fromchanges in the environment.
I. INTRODUCTION
Adapting to changing environments is a key element to
the survival of any species. It is then prudent to investigate
and model this adaptive process seeing as there are very few
real world problems that remain static over time. One model
for this adaptive behavior may be found in the evolutionary
computing paradigm of artificial intelligence.
Genetic Programming (GP) was initially developed to
evolve computer programs by Koza [1]. GP is an evo-
lutionary search process which models the same survival
of the fittest paradigm as first proposed by Darwin in his
ground breaking work “The origin of species” [2]. GP differs
from other evolutionary algorithms in that it evolves trees
as opposed to solution vectors. One of the advantages of
evolving trees is that variable sized entities may be utilized.
This is especially convenient for dynamic environments as
the size of proposed solutions may vary over time.
GP is used in this paper as an evolutionary model for
evolving decision trees. This paper presents results obtained
by modeling the adaptive behavior of species in changing
environments and argues that doing so is a feasible way
of obtaining solutions in these environments. The results
indicate that even without modification to the standard GP
algorithm, it still yields acceptable results. However, a slight
performance gain may be obtained by having adaptive con-
trol parameters which are modified after a change in the
environment is detected. This version of the GP algorithm is
named Adaptive Genetic Programming (AGP). Both versions
of GP are compared to the classification capability of neural
networks trained using Gradient Descent (GD).
The authors are with the Department of Computer Science,University ofPretoria, Lynnwood Road, Hillcrest, Pretoria, South Africa, 0002 (phone:012 420 5242; email: mriekert/engel/[email protected]).
The remainder of the paper is structured as follows: Sec-
tion II gives a brief overview of the algorithms and structures
used for this study, Section III discusses the implementation
details of GP and AGP, Section IV discusses control param-
eter settings for the experiments, Section V describes how
the test data for the experiments was generated, Sections VI
presents and discusses the results obtained during the study
and the paper then ends with the conclusion in section VII.
II. BACKGROUND
Classification is a common real world activity. It is used
to bin, so to speak, entities or patterns into predefined
classes. Common examples of classification include deter-
mining fraudulent behavior in banking data or classifying
patient symptoms to a particular disease. Numerous artificial
intelligence techniques have been successfully applied to
classification problems, some examples of which are Quin-
lan’s C4.5 [3], Clark and Niblett’s CN2 [4] and neural
networks [5].
Neural networks have been shown to perform competi-
tively for classification problems [6]. Gradient Descent (GD)
is a technique used to train neural networks. It consists of two
phases, a feed forward phase and an error back propagation
phase [5]. During the feed forward phase input patterns
are presented to the neural network and the outputs of the
network are recorded for all the patterns. The mean squared
error is then calculated and an error signal is fed back from
the output layer to the hidden layer, adjusting the weights in
the network relative to how much a neuron has contributed
to the error. GD is used as a benchmark for comparing the
classification results of the GP and AGP approach proposed
in this paper as GD has been shown to function in dynamic
environments.
Decision trees are logical structures for drawing classifica-
tion boundaries between data patterns [5]. Each non-terminal
tree node represents a question to be answered, the outcome
of which will determine to which of the child nodes the
traversal procedure will advance. Terminal nodes represent
the classes of a data set. A tree is traversed for a particular
pattern, starting at the root node, by using the attribute values
of the pattern to answer the question posed by the non-
terminal nodes. An example question might be something
along the lines of X < 0.5 whereupon the pattern’s attribute
value corresponding to the X attribute is compared to 0.5.
If the inequality is satisfied, traversal proceeds down the
“true”-branch of the current node or down the “false”-branch
otherwise. This continues until a terminal node is reached.
The class represented by the terminal node will then be
674978-1-4244-2959-2/09/$25.00 c© 2009 IEEE
assumed as the classification of the pattern that lead to that
particular terminal node.
Decision trees have been successfully evolved using GP
to solve classification problems [7][8]. Bramaier and Banzaf
[7] use the approach of flattening the decision tree into a
vector representation, which is then evolved. Their approach
yielded very good results when compared against a neural
network trained for classification of a medical data set. In
work done by Au et al. [8] GP is also used to evolve decision
trees successfully for classification to churn prediction in
static environments. The approach presented in this paper, as
opposed to the work done by Au et al., focusses on keeping
the decision tree logical structure intact and on dynamically
adapting the algorithm to suit dynamic environments.
GP is becoming an ever more popular technique to use in
dynamic environments [9][10][11][12][13]. Wagner et al. [9]
use GP to evolve regression trees to forecast time series data.
Augmentations to GP in terms of dynamic parameters have
been investigated by Yin et al. [10]. The focus of their work
was to investigate adaptive mutational and crossover proba-
bilities to break stagnation of the evolutionary process where
stagnation is a very undesirable phenomena in evolutionary
computing. The AGP approach presented in this paper makes
use of similar adaptive control parameters but extends this to
include other adaptive features such as variable elitism and
culling.
III. APPROACH TO ADAPTIVE GENETIC PROGRAMMING
This section describes the implementation details for both
GP and AGP. The main difference between GP and AGP is
that AGP makes use of adaptive control parameters, adaptive
elitism and culling whilst GP has static parameters, no culling
and full elitism where full elitism corresponds to “keep best
reproduction” as investigated by Wiese and Goodwin [14].
These concepts will be discussed in subsequent sections.
A. Composite fitness function
The fitness function in GP is what drives the search
process to desired solutions. It is a means of quantifying
the quality of solutions within a population of individuals
and specifies the attributes desired for an optimal solution.
Selection strategies more often than not utilize the fitness
function when selecting individuals for activities such as
crossover, survival to successive generations and possibly
mutation.
For the implementation in this study the fitness function
was decomposed into a number of components, translating
the overall problem into a multi-objective problem. These
components were weighted and then linearly combined to
calculate the overall composite fitness of a proposed solution.
The fitness function can be seen as a minimization function
where entities with a lower fitness value are deemed more
fit and preferred over those with higher fitness values. The
components were as follows:
1) Accuracy: Undeniably the most important component
of the fitness function, accuracy determines the classification
error that a proposed solution makes on a supplied training
set of patterns. Accuracy was expressed as a percentage of
misclassified patterns over the training set. It is important
that trees make the smallest error possible because trees
with high errors serve no purpose. The accuracy of a tree
was calculated by comparing the classification proposed by
a tree for a particular pattern to the actual classification of
the pattern. For each misclassified pattern the total number of
misclassified patterns was incremented. After all the patterns
in the training set were processed the accuracy for a tree T
was calculated as:
Accuracy(T ) =NumberMisclassifiedPatterns
NumberPatternsInTrainingSet∗ 100
which yields the percentage of misclassified patterns. The
weight for accuracy was set to 1, that is, the accuracy
contributes the full 100% of its value to the final fitness
calculation.
2) Identical Subtrees: Identical subtree fitness was calcu-
lated as the number of identical child nodes that each node
within a decision tree possessed. Identical subtrees lead to
redundant attribute tests within decision trees. The identical
subtrees component was weighted with 0.5 to prevent this
component from dominating the overall fitness calculation,
rather allowing for the accuracy component to be the most
important feature in the fitness calculation.
3) Repeated attributes: More often than not, multiple tests
on the same attribute within a decision tree contribute to
redundancy within a tree. Preferably, attribute tests should
appear only once within a tree and the condition described
by the node should be such that it will not be necessary to
test on that attribute again. If another test on the attribute is
required then the subsequent test should be a specialization
on a previous test, for example: x < 5 initially and then later
on x < 4 as opposed to x < 4 and then x > 5. In the latter
scenario the test on x > 5 will never be reached, rendering
this part of the decision tree redundant or dead. The weight
for the repeated attributes component in the fitness function
was set to 0.5 for the same reason as the identical subtree
component.
4) Repeated conditions: Repeated conditions calculated
the number of identical conditions within a subtree of nodes.
Repeated conditions is another form of redundancy within
decision trees. Similarly to the aforementioned structural
components (identical subtrees and repeated attributes), the
repeated conditions component was weighted in the final
calculation with a weight of 0.5.
5) Tree size: Larger trees will more often than not overfit
the training data. Large trees are also cumbersome to work
with and computationally expensive to traverse. The fitness
function was designed such that if two trees render the same
accuracy on the training data, the smaller of the two will
be deemed more fit. Restricting tree size in this way also
prevented the unbound growth of trees and ensured that only
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 675
the smallest most accurate trees would survive to successive
generations. To allow the growth of trees and disallow the
search process to be dominated by the size of trees, tree size
as a structural component was weighted with a weight of
0.05. This small weight was meant as a “tie breaker” between
two trees with the same accuracy (as is largely the case with
all the aforementioned structural fitness components).
B. Stopping criteria
As opposed to static data sets or environments, dynamic
environments do not afford the luxury of hard stopping
criteria. Static environments allow for stopping criteria along
the lines of “required fitness reached” or “required number
of generations exceeded”. In truly dynamic environments, on
the other hand, such stopping criteria cannot be set in place.
The evolutionary process must continue to accommodate
possible changes within the environment in the near future.
The closest “stopping” criterion might be expressed in terms
of a “hibernation” phase where the evolutionary process is
paused as soon as some required fitness is reached. During
the hibernation phase, continual environment change testing
is conducted and as soon as the environment changes, evolu-
tion is restarted. This hibernation phase may be implemented
by halting crossover and mutation and only conducting
fitness evaluations on the most fit entity found thus far.
C. Crossover strategy
One-point crossover was used. After selecting two parent
trees offspring were constructed by swapping randomly se-
lected subtrees within the parents to create two recombined
versions of the parent trees.
D. Mutation
Mutation is a mechanism to inject new genetic material
into a population of solutions. It is applied probabilistically to
the offspring generated during the crossover operation. There
are three mutational probabilities to consider when evolving
decision trees using GP: Firstly, the probability that a tree
will be selected for mutation, secondly, the probability that
a specific mutational operator will be applied to the selected
tree and lastly, the probability for each node in the tree to
mutate. The mutational operators implemented in this study
were as follows:
1) Growing: For each terminal node, decide if it should
be altered to a non-terminal or internal node and if so,
add the appropriate number of children (terminal nodes
which will recursively be considered for growing). To prevent
unbounded growth the probability of mutation was linearly
decreased for every successive level of nodes added to the
tree.
2) Truncation: For every internal node probabilistically
decide if the node should be replaced by a leaf node. If
truncation occurs all of the subtree children of the node are
discarded. This is a very severe mutation, especially if the
decision to truncate occurs at the root of the tree, and as such
it was applied at a very low probability. One way to decrease
the severity of the truncation mutational operator would be
to consider the root node of a tree for truncation at a very
low (almost 0) probability and then linearly increasing this
probability as traversal proceeds to every successive level of
the tree.
3) Shrinking: For every non-terminal, decide whether or
not it should be replaced by one of its children. If the decision
is to replace, then randomly decide which child subtree will
replace the node. All sibling subtrees are discarded. This is
also a form of truncation but is much less severe than “true”
truncation.
4) Node altering: Decide if any aspect of a node should
change. Changes include 1.) selecting a different attribute
for the current node to split upon, 2.) changing relational
operators 3.) changing the type of the node from terminal to
non-terminal (growth by one level) or vice-versa (truncation).
5) Physical swap: Change the structure of the tree by
swapping the subtrees or children of nodes decided upon
probabilistically.
6) Logical swap: Change the conditional operators of
probabilistically selected nodes for example change a “<”
relation to a “>=” relation.
E. Selection
Selection is an important operator in evolutionary com-
puting. Its application directly influences the direction of the
search process. Selection is applied to determine the individ-
uals within a population which will participate in crossover
as well as the individuals from the current population and
the offspring to survive to the successive generation. For this
study elitism and tournament selection were applied.
Elitism dictates that the N best entities from the current
population and the offspring will survive to the successive
generation [5]. Elitism enforces specialization when N is set
to a high number but prevents random searching by holding
on to the best solutions found thus far. For AGP, N varied
over time as performance increased or decreased. Adaptive
elitism is discussed in Subsection III-G.
Tournament selection randomly chooses M individuals
from a supplied set and then selects the best individual
from this subset. The parameter M is referred to as the
tournament size and is expressed as a percentage of the
population. The larger M becomes the more tournament
selection tends to elitism while smaller M values will tend
more to random selection (and ultimately random searching).
For AGP, tournament selection was used to select parents
for crossover and to select the remainder of the successive
population after elitism had been applied from the remaining
parents and offspring. The tournament size was set to 30%
of the population size.
F. Environment change detection
Change detection was implemented in AGP by keeping
track of the best fitness found so far. Because of elitism
the best fitness should never deteriorate as long as the
environment remained static and no new data was available.
Because an environment can change abruptly (in which
case exploration should be favored over exploitation) or
676 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
progressively (which requires only slight modifications to
current best solutions) it is not always clear which control
parameter settings would yield satisfactory results without
knowing the type of change the environment underwent. A
change in the environment may affect the fitness in one of
two ways: 1.) The fitness may improve, in which case it
could be assumed that the newly available data describe the
same trends as data seen before and no reaction strategy is
necessary 2.) The fitness may deteriorate, in which case it
may be assumed that the newly available data describe new
trends. The latter scenario requires that exploration should
start afresh while lessening exploitation.
G. Reaction to change and adaptive control parameters
Control parameter settings tend to be problem specific.
Some problems require very high mutation rates initially to
enforce exploration and then small mutational probabilities as
the population converges on a solution. A major disadvantage
to having static control parameters in dynamic environments
is that a drastic change within the environment will require
high crossover and mutational probabilities to restart explo-
ration while a negligible change will require small probabil-
ities to refine solutions. One approach to counteract the need
for deciding upon control parameter settings is to have the
parameters adjust dynamically to find a suitable combination
of settings [10]. These adaptive control parameters are what
distinguishes AGP from GP. The adaptive control parameters
implemented for AGP are as follows:
1) Elitist proportion: This control parameter dictates the
number of best solutions found thus far to survive to the
successive generation. In the case of fitness improving this
parameter was increased by 0.01, with an upper bound of 1 to
enforce exploitation and propagate the genetic material of the
best fit entities into future generations. If fitness deteriorated
exploration was indirectly jump-started by decreasing the
probability of fit solutions to survive by 0.01, with a lower
bound of 0.1, rather opting for more newly generated genetic
material to enter a population in the hopes of finding a new
fit individual within the search space.
2) Crossover probability: Crossover was the only mech-
anism for generating offspring to be mutated later on. To
spare computational effort the crossover rate was lessened
by 0.01, where the lower bound for this probability was
set to 0.1, while the evolutionary process was performing
to a satisfactory degree (i.e. the fitness of the best solution
improves). As soon as fitness deteriorated the probability
of crossover was linearly increased by 0.01, with an upper
bound of 0.9, to generate more solutions.
3) Mutational probabilities: Mutation injects new genetic
material into a population of solutions. Because high muta-
tion rates are required for abrupt changes in the environment
while small rates would suffice for gradual changes, the
mutation rates were set to be cyclic. As soon as a change
in the environment was detected, a random number in the
range [0, 1) was added to the current probability of mutation
and then scaled to the range [0, 1) if the new probability
exceeds 1 (in other words a mutation rate of 100%). This
was done for all 3 mutational probabilities as discussed
earlier. This implies that the algorithm continued to modify
the probabilities of mutation until it succeeded in finding a
good combination of such probabilities which would generate
more fit entities in successive generations.
4) Culling: Culling is a mechanism to jump start the
exploration capability of an evolutionary algorithm. It was
applied by removing a portion of the worst solutions from
a population and replacing these with randomly generated
solutions. Replacement solutions were constructed by gener-
ating a pool of trees of size 1, only consisting of terminal
nodes as the root nodes, and then applying the mutational
operators to these solutions, afterwards inserting them into
the population.
IV. EXPERIMENTAL APPROACH
This section briefly describes the control parameter set-
tings used for the experiments. The control parameters
for both GP and AGP as well as GD are discussed.
For the implementation the Computational Intelligence Li-
brary(CIlib)(available at http://sourceforge.net/projects/cilib/)
was utilized and only adapted to incorporate GP and AGP
as it already provided support for GD.
A. Control parameter settings for GP and AGP
Table I shows the parameter settings for both the GP and
AGP algorithms. These parameter settings were determined
through experimentation and have been shown to deliver
optimal results over other combinations tested.
TABLE I
GP AND AGP CONTROL PARAMETERS AND SETTINGS
Control parameter GP AGP
Population size 30 30
Mutation probabilities 50% Cyclic
Crossover probability 50% Variable
Elitist proportion 100% Variable
Culling proportion Not applied 50%
B. Gradient Descent
For GD the learning rate and momentum were both set
to 0.5 for all problems with only the number of hidden-
and output units differing among problems. For both the
hyperplanes and spheres data sets (discussed below) the
number of output units was set to one functioning as a “yes-
no” unit, while in the case of the thresholds data set three
output units were used (one for every class). The number of
hidden units utilized was 6, 4 and 3 for the hyperplanes-,
spheres- and threshold data sets respectively. The parameter
settings for GD where determined through experimentation.
V. THE DATA SETS
To test the proposed approach, six artificial data sets were
generated with increasing levels of complexity. This section
gives the details on how this was achieved.
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 677
For each data set a set of static points was generated
and these points were then classified as described in the
subsequent sections. The classification of a point in space
may change over time and it is up to the algorithm to
compensate for these changes in classification. To generate
each data set a total of 1 000 points were randomly generated.
These points were then classified using the decision bound-
aries as described below, inserting the patterns (points with
their classifications) in the data set, changing the decision
boundaries, reclassifying the points to obtain the next 1 000
patterns, and so on, until the data set contained 20 000
patterns in total.
A. Double thresholds
The double threshold data set consists of three classes sep-
arated by two vertical lines in Cartesian space. The thresholds
can be considered as the two lines x = threshold1 and
x = threshold2. This implies that the y-values of each
point do not have an influence on the classification of the
point, only the x-values. The classification of a point P is
determined as:
Classification(P ) =
⎧⎨
⎩
class1, if Px ≤ threshold1class2, if Px ≥ threshold2class3, otherwise
The classification of a point can change over time by al-
lowing the threshold values to change minding the constraint
that the thresholds may not swap around so that the initially
larger threshold is always greater than or equal to the smaller
threshold, that is, threshold1 ≤ threshold2.
B. Spheres
The objective of this data set is to determine whether
points distributed in space fall within a predefined sphere or
outside of the sphere, wherever the sphere might be in the
search space. The equation for a sphere in D-dimensional
space is given by
(Sr)2 =
D∑
i=1
(xi − ci)2
where Sr is the radius, ci for i ∈ {1, .., D} is the center
of the sphere and xi for i ∈ {1, .., D} is given as a point
on the surface of the sphere. For this data set, D was set to
3. Point coordinates are substituted into the sphere equation
and Pr is calculated for the point and compared to Sr. The
classification of a point P is then determined by:
Classification(P ) =
{1, if Pr ≤ Sr
0, otherwise
The sphere is allowed to move around in space by allowing
the center point to shift. Additionally, the sphere is allowed
to expand and contract by changing the radius Sr by adding
a uniformly distributed random number in the range [0, 1)and then scaling the new Sr to the range [0, 1) should it
exceed 1.
C. D-dimensional Hyperplanes
The hyperplanes data set utilized in this study was adapted
from the data set used by Tsymbal et al. [15]. A hyperplane
is given by the equation:
H(x) =
D−1∑
i=1
(aixi) + c
where ai for i ∈ {1, .., D − 1} represent the coefficients,
xi for i ∈ {1, .., D − 1}, are components of x and c
is a constant. For the purposes of this study D was set
to 10 yielding a hyperplane in 10-dimensional space. The
hyperplane separates two classes of D-dimensional points.
The classification of a point is either 0 or 1 depending on
which side of the hyperplane the point lies. The classification
for a point P is determined by:
Classification(P ) =
{0, if Pd ≤ Hd
1, otherwise
where Pd is the Dth component of the point P and Hd is
obtained by substituting the first D− 1 components of point
P into the hyperplane equation.
The hyperplane is allowed to move around in space
by changing ai for i ∈ {1, .., D − 1} (i.e. changing the
orientation of the plane) and c and then recalculating the
classifications of the points using the new hyperplane equa-
tion. The coefficients ai for i ∈ {1, .., D−1} and the constant
c where changed by replacing their previous values with
uniformly distributes random numbers in the range [0, 1).
D. Simulating dynamic environments
Windowing was applied to simulate a dynamic environ-
ment. In this scenario the patterns in the data set were
assumed to be ordered in the same way in which they
were generated. Two types of dynamic environments were
simulated namely abrupt change and progressive change.
1) Abrupt change: An abrupt change in the environment
was simulated by constructing a window of size 1 000 and
moving the window by 1 000 patterns every 100 iterations of
the algorithm. Using a window size of 1 000 ensured that the
changes in the environment corresponded with the decision
boundary changes in the data set. A total of 2 000 iterations
were required to move through the entire data set of 20 000
patterns.
2) Progressive change: To simulate progressive change
a window size of 500 was used, sliding the window 200
patterns every 100 iterations. The implication of this is that
points may become separated by more than one or two
decision boundaries at certain times during the evolutionary-
or training process (depending on the data set). New decision
boundaries appear and old boundaries disappear gradually
over time. Windowing over each data set required 9800
iterations to utilize all the patterns.
678 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
VI. RESULTS
Figures 1, 2 and 3 show the results obtained by averaging
the recorded classification errors for every 10 iterations
over 30 independent simulations for each algorithm/problem
combination. The purpose of Figures 1, 2 and 3 is to contrast
how well the three algorithms (GP, AGP and GD) could track
and recover from changes within the environment. A sharp
upward spike within the graphs indicates an environmental
change (window shift) where the algorithms where presented
with new unseen patterns to classify. A drop within the
graphs indicates an improvement in the classification capa-
bility of the algorithms. Table II summarizes the results by
averaging the recorded classification errors over the entire
runs (from iteration 0 to the last iteration) for each algorithm.
TABLE II
CLASSIFICATION ERROR WITH STANDARD DEVIATIONS FOR ALL
ALGORITHM AND PROBLEM COMBINATIONS
Problem GP AGP GD
Thresholds progressive 9.6 ± 9.4 8.8 ± 9.2 68.1 ± 22.5
Thresholds abrupt 9.2 ± 7.1 7.4 ± 6.1 69.5 ± 21.7
Spheres progressive 14.7 ± 9.4 14.2 ± 9.2 16.5 ± 9.1
Spheres abrupt 12.5 ± 8.7 11.7 ± 8 26.2 ± 8.5
Hyperplanes progressive 22.7 ± 10.2 22.3 ± 9.9 10.5 ± 8.7
Hyperplanes abrupt 20.6 ± 11.7 20.1 ± 11.2 17.2 ± 8.7
A. Thresholds results
In Figures 1(a) and 1(b) it is observed that both GP
and AGP outperformed GD significantly, except for a very
short initial setup time where GD performed better than
GP and AGP. Note that in the case of the progressively
changing environment, there were many more changes and
iterations required to move through the data set of 20 000
patterns, so the scales on the horizontal axes of Figures 1(a)
and 1(b) differ. The figures indicate that GD experienced a
“confusion” state from which it did not recover, whereas GP
and AGP were able to track and recover from changes in the
environment. Comparing the behavior of GP and AGP it can
be seen that in the case of AGP the classification error in
most cases droped faster after a change in the environment
than GP. This improved performance is also reflected in the
lower average of AGP as shown in Table II.
B. Spheres results
From Figures 2(a) and 2(b) as well as Table II it is evident
that both GP and AGP performed far better than GD on
the spheres data set, often reaching a classification error of
zero. Apart from having a lower average classification error,
both GP and AGP recovered at a more rapid rate after a
change in the environment than GD, as is observed in Figures
2(a) and 2(b). Once again it is also clear that AGP slightly
outperformed GP in terms of recovery rate and classification
error, yielding a smaller average classification error and a
quicker drop in the graph after a change in the environment.
(a) Progressive change
(b) Abrupt change
Fig. 1. Thresholds data set changing the environment every 100 iterations
C. Hyperplanes results
From Figures 3(a) and 3(b) it is clear that GD performed
better in both the progressive and abrupt environments for
the hyperplanes problem. For the abrupt environment, shown
in Figure 3(b), both variants of GP maintained a better
classification error between iterations 200 to 800, after which
GD once again started to outperform GP and AGP. This
superior performance of GD is also reflected by a lower
average classification error over all iterations shown in Table
II.
D. Discussion
With regard to GP and AGP, the behavior of recovery
after changes in the environment is due to the inherent
evolutionary nature of the algorithms, subject to the “survival
of the fittest” rule. Furthermore, the figures also show that
AGP had a tendency to recover quicker from change than GP,
an indication that adaptive parameters did indeed function
well in dynamic environments.
From the results it is clear that both GP and AGP struggled
on average with the hyperplanes data set in both the progres-
sively and abruptly changing environments. The situation was
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 679
(a) Progressive change
(b) Abrupt change
Fig. 2. Spheres data set changing the environment every 100 iterations
however completely reversed for the thresholds and spheres
data sets where both GP and AGP showed better performance
on the graphs and a lower average classification error than
GD.
With regard to the thresholds problem, GP and AGP were
consistent in terms of recovery from changes within the
environment. The results for the thresholds data set also
indicated that both GP and AGP handled multiple decision
boundaries appearing and disappearing over time much better
than GD. This was mainly due to the fact that GP and
AGP were able to evolve variable sized decision trees to
incorporate new decision boundaries as well as retaining
older and still relevant boundaries. Through the evolutionary
process old boundaries were phased out and replaced by
information about the new boundaries.
One possible explanation for GP and AGP’s poor perfor-
mance on the hyperplanes problem might be that both GP and
AGP evolved trees which fitted the current data (the patterns
in the window). Being a 10-dimensional problem, larger trees
were required. Larger trees have the tendency to “remember”
the data from a previous state of the environment more so
than smaller trees. As the tree pruning mutational operators
(a) Progressive change
(b) Abrupt change
Fig. 3. Hyperplanes data set changing the environment every 100 iterations
(truncation and shrinking) were responsible for making the
trees “forget” as it were, a possible solution to remedy
the performance of GP and AGP might be to apply these
operators more liberally when a change in the environment
is detected in the case of high dimensional problems.
As opposed to the hyperplanes problem, which required
larger trees for classification in 10 dimensions, the spheres
problem consisted of only three dimensions. This then al-
lowed for older decision boundaries to be phased out quicker
by the tree shrinking operators yielding an improvement in
performance over the hyperplanes problem.
The behavior of GP and AGP was consistent in that
an immediate degradation occurred after changes in the
environments after which a recovery period followed quickly.
GD struggled to recover from change and was subject to a
“confusion state” as the figures show that its recovery period
is more gradual and took longer than that of GP and AGP.
The most severe case of this confusion state is depicted in
Figure 1(b) where GD abruptly degraded at iteration 200
and never recovered to a previously good classification error
compared to GP and AGP.
The ideal classification error over all problems was 0%.
680 2009 IEEE Congress on Evolutionary Computation (CEC 2009)
GD only managed to obtain a classification error of 0% in the
early stages of the progressively changing hyperplanes data
set but not for any of the other experiments. GP and AGP
both reached classification errors of 0% for the progressive
spheres-, progressive thresholds-, abrupt spheres- and abrupt
thresholds experiments, a feat that was unmatched by GD.
Even though GP and AGP struggled with the progressively
changing hyperplanes problem (Figure 3(a)), both algorithms
yielded comparable classification errors as obtained by GD.
The results shown in Table II indicates that on average
both GP and AGP outperformed GD, where AGP yielded
a slightly better classification error over GP, making AGP
the best algorithm for the thresholds and spheres problems.
GD outperformed both GP and AGP only for the hyperplanes
problem.
AGP more often than not performed better than GP in
terms of recovery from change. From the figures it can be
seen that AGP reached a lower classification error more
rapidly than GP in most of the environmental changes. This
is an indication that the adaptive parameters of AGP did
indeed function toward better classification error in dynamic
environments.
VII. CONCLUSIONS
The aim of this paper was to compare the behavior
of GP and AGP to that of GD in dynamic environments
and investigate the feasibility of using GP and AGP for
abruptly and progressively changing environments. In ad-
dition, modifications to the GP algorithm were suggested
yielding AGP, an adaptive version of GP. Three data sets,
namely thresholds, spheres and hyperplanes where utilized
for the experiments reported on in this paper. Two types of
dynamic environments, namely progressively- and abruptly
changing environments, were modeled by windowing over
the data sets with different window sizes and window shift
sizes.
GP and AGP were both able to track changes in a
dynamic environment. Compared to GD, both GP and AGP
performed better when considering the averaged results for
the thresholds and spheres data sets. GD only significantly
outperformed GP and AGP on the hyperplanes data set for
a progressively changing environment where the problem
dimension is significantly larger than for the other two data
sets. AGP, using adaptive control parameters and culling, also
provided a slight performance gain over GP making it the
better contender for dynamic environments over GP.
Future research will be aimed at investigating more in-
telligent adaptive control parameters for AGP as opposed to
simple cyclic parameters. An investigation will be conducted
by having the adaptive parameters adjust as a function of the
fitness of individuals within a population of solutions and
comparing against the changes of the cycling parameters
over time. On completion of the investigation on more
intelligent adaptive parameters for the AGP approach, the
next step would be to apply AGP to real world problems
and investigate its performance in real world environments.
An evolutionary hibernation phase to spare computational
resources was not set in place during this study, further
studies will investigate the effects of having a hibernation
phase for AGP. In the reported results neither the rules nor
the quality (crispness, number of rules, etc.) of these rules
generated by both GP and AGP where considered, only the
classification accuracy of these rules. Future research will
also focus more on a comparative study on the differences
in the rule sets generated by both GP and AGP.
ACKNOWLEDGMENT
The authors would like to thank the CIlib framework
developers. This work was supported in part by the National
Research Foundation.
REFERENCES
[1] J. R. Koza, “Genetic programming: a paradigm for genetically breed-ing populations of computer programs to solve problems,” Stanford,CA, USA, Tech. Rep., 1990.
[2] C. R. Darwin, On the Origin of Species by Means of Natural Selection,
or the Preservation of Favoured Races in the Struggle for Life (The
Origin of Species). England: Down, Bromley, Kent, 1859.[3] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 1993.[4] P. Clark and T. Niblett, “The CN2 induction algorithm,” in Machine
Learning, 1989, pp. 261–283.[5] A. P. Engelbrecht, Computational Intelligence An Introduction. Wiley
& Sons, 2007.[6] G. P. Zhang, “Neural networks for classification: a survey,” vol. 30,
no. 4, pp. 451–462, November 2000.[7] M. Brameier and W. Banzaf, “A Comparison of Linear Genetic
Programming and Neural Networks in Medical Data Mining,” IEEE
Transactions On Evolutionary Computation, vol. 5, pp. 17–26, 2001.[8] W.-H. Au, K. C. C. Chan, and X. Yao, “A Novel Evolutionary
Data Mining Algorithm With Applications to Churn Prediction,” IEEE
Trans. Evolutionary Computation, vol. 7, no. 6, pp. 532–545, 2003.[9] N. Wagner, Z. Michalawicz, M. Khouja, and R. R. McGregor, “Time
Series Forecasting for Dynamic Environments: The DyFor GeneticProgram Model,” Evolutionary Computation, vol. 11, p. 433, 2006.
[10] Z. Yin, A. Brabazon, C. O’Sullivan, and M. O’Neill, “GeneticProgramming for Dynamic Environments,” in Proceedings of the
International Multiconference on Computer Science and Information
Technology, 2007, pp. 437–446.[11] R. Nopsuwanchai and P. Chongstitvatana, “Improving Robustness of
Robot Programs Generated by Genetic Programming for DynamicEnvironments,” in In Proc. of Asia-Pasific Conference on Circuits and
Systems (APCCAS98, 1998, pp. 523–526.[12] W. Yan and C. D. Clack, “Behavioural GP diversity for dynamic envi-
ronments: an application in hedge fund investment,” in GECCO ’06:
Proceedings of the 8th annual conference on Genetic and evolutionary
computation. New York, NY, USA: ACM, 2006, pp. 1817–1824.[13] K. Rodriguez-Vazquez and P. J. Fleming, “Evolution of mathematical
models of chaotic systems based on multiobjective genetic program-ming,” Knowl. Inf. Syst., vol. 8, no. 2, pp. 235–256, 2005.
[14] K. Wiese and S. D. Goodwin, “Convergence characteristics of keep-best reproduction,” in SAC ’99: Proceedings of the 1999 ACM sympo-
sium on Applied computing. New York, NY, USA: ACM, 1999, pp.312–318.
[15] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen,“Dynamic integration of classifiers for handling concept drift,” Inf.
Fusion, vol. 9, no. 1, pp. 56–68, 2008.
2009 IEEE Congress on Evolutionary Computation (CEC 2009) 681