Upload
arteimi
View
443
Download
0
Embed Size (px)
DESCRIPTION
rule refinement in knowledge bases created through inductive methods
Citation preview
RULE REFINEMENT IN INDUCTIVE KNOWLEDGE-BASEDSYSTEMS
byMohamed Arteimi
Department of Computer Science
Witwatersrand UniversityJohannesburg, 1999
1
Abstract
This paper presents empirical methods for enhancing the accuracy of inductive learning systems. It
addresses the problems of: learning propositional production rules in multi-class classification tasks in
noisy domains, maintaining continuous learning when confronted with new situations after the learning
phase is completed, and classifying an object when no rule is satisfied for it.
It is shown that interleaving the learning and performance-evaluation processes, allows accurate
classifications to be made on real-world data sets. The paper presents the system ARIS which
implements this approach, and it is shown that the resulting classifications are often more accurate than
those made by the non-refined knowledge bases.
The core design decision that lies behind ARIS is that it employs an ordering of the rules according to
their weight. A rule’s weight is learned by using Bayes’ theorem to calculate weights for the rule’s
conditions and to combine them. This model focuses the analysis of the knowledge base and assists the
refinement process significantly.
The system is non-interactive, it relies on heuristics to focus the refinement on those experiments that
appear to be most consistent with the refinement data set. The design framework of ARIS consists of a
tabular model for expressing rule weights, and the relationship between refinement cases and the rules
satisfied for each case to focus the refinement process. The system has been used to refine knowledge
bases created by ARIS itself, as well as to refine knowledge bases created by the RIPPER and C4.5
systems [6,25] in ten selected domains. Two main advantages have been observed. First, the abili ty to
gradually improve the knowledge base as the refinement proceeds. Second, is the abili ty to learn strong
rules util ising condition’s weight using minimum covering algorithm. Thus, by taking account of this
information we improve the applicabili ty and quali ty of refinement.
Keywords: knowledge-base refinement; inductive learning.
1. Introduction
The growth in database technology and computer communication have resulted in the creation of huge,
efficient data stores in almost every domain. For example, credit-card transactions, medical images, and
large-scale sky surveys are all stored in massive and ever growing databases. These streams of data need
to be analysed to extract higher-level information that could be useful for decision making and for
understanding the process generating the data. For large systems of this type, we need automated
methods that acquire decision-making knowledge.
Humans attempt to understand their environment by using a simplification of this environment (called a
model). This model represents events in the environment, and similarities among objects [12]. Similar
objects are grouped in classes and rules are constructed to predict the behaviour of new objects of such a
2
class. In machine learning we try to automate this process and construct class descriptions (models) such
as a decision tree or a set of production rules using an iterative search strategy on a library of examples.
This is known as inductive learning.
There have been various methods for learning rules inductively by analysing a set of training examples
to extract interesting and relevant information from the databases. Among these are RIPPER [6], and
C4.5 [25]. This information is higher-level knowledge in the form of class descriptions that are
generalisations about what it means to be a member of a class or a category. This gives us the abili ty to
make inferences that are true for most or all members of the class.
In this paper we present a method for improving or refining existing knowledge bases by using a
refinement library of examples.
Section 2 gives an overview of two existing refinement systems, and inductive learning systems. Section
3 describes the ARIS knowledge-base refinement system. The application of ARIS to knowledge bases
generated by C4.5 and RIPPER systems is described in Section 4. Section 5 compares inductive
refinement and incremental learning. Section 6 concludes with discussion and future enhancements.
2 Related Work
2.1 Inductive systems
There has been a relatively long history of induction to solve complex problems, by creating a model of
the problem domain, and inferring from that model instead of from the domain itself. The constructed
model is a particularly useful tool i f it is possible to perform classification by simple, easy-to-understand
tests, whose semantics are intuitively clear, such as decision trees [3,5,9] and production rules [1,6,25].
Learning algorithms vary in:
- the way in which they search the training data set,
- the way in which they generalise,
- the way they represent class descriptions, and
- the way in which they cope with errors and noise in the training data.
Generally, we can classify concept learners into two categories depending on the learning strategy
involved.
� Simultaneous covering algorithms
At each step in the search, the algorithm selects among alternative attributes by comparing the
partitions of the data that each attribute generates. Decision-tree algorithms belong to this category,
examples include ID3 [24], C4.5 [25], and CART [3]. These algorithms use an overfit-and-simpli fy
learning strategy involving post pruning of the generated trees.
3
� Sequential covering algorithms
Here, the algorithm chooses among alternative attribute-value pairs to include in rules by computing
the subsets of data they cover. Direct rule learning algorithms such as RIPPER [6], and HYDRA [1]
belong to this category.
Two variations on this approach have been explored:
1. General-to-specific search
An approach to learning one rule at a time is to organise the search in the hypothesis space in
the same way as in simultaneous covering algorithms, but to follow only the most promising
branch in the tree at each step. The search starts by considering the most general condition
possible (i.e. the empty test that matches every instance), then adds the attribute test that most
improves rule performance measured over the training data. Once this test has been added, the
process is repeated by adding a second attribute test, and so on. This is different from
simultaneous covering approach since at each step only a single descendent is followed, rather
than growing a subtree that covers all possible values of the selected attributes.
Some learning systems such as BEXA [27] extend this approach to maintain a list of the k best
attribute tests at each step, rather than a single best attribute test.
2. Specific-to-general search
Another way to organise the search space is to start the search process beginning with the most
specific rule and graduall y generalise to cover more positive cases [18].
One way to see the significance of the difference between the above two approaches is to compare the
number of distinct choices made by the two algorithms in order to learn the same set of rules.
To learn a set of n rules, each containing k attribute-value tests in their conditions, sequential covering
algorithms will perform nk primitive search steps, making an independent decision to select each
condition of each rule. In contrast, simultaneous covering algorithms will make many fewer independent
choices, because each choice of a decision node in the decision tree corresponds to choosing the
condition for the multiple rules associated with that node. In other words, if the decision node tests an
attribute that has m possible values, the choice of the decision node corresponds to choosing a condition
for each of the m corresponding rules. Thus, sequential covering algorithms make a larger number of
independent choices than simultaneous covering algorithms.
The other difference is that in simultaneous covering algorithms different rules test the same attribute
while in sequential covering algorithms this behaviour is not required.
4
2.2 Refinement systems
Previous knowledge-base refinement systems use knowledge bases elicited from experts in a fairly well
understood domains. The knowledge base is believed to be complete and only minor tweaking is
required. This section describes two such refinement systems SEEK [22] and KRUST [7]. They are the
forerunners of the ARIS refinement algorithm. The statistical approach of SEEK is the basis for ARIS’s
rating of rules and imbedded hypothesis. SEEK uses statistical evidence to apply suitable refinements.
The refinements are:
� Generalise a rule which will fire if it is enabled.
� Specialise the error-causing rule to allow an enabled correct rule to fire instead.
The rules in SEEK are represented in the EXPERT language but expressed in a tabular format that
divides the findings associated with a conclusion into two categories; major and minor, and the level of
confidence associated with a conclusion into three categories: definite, probable, and possible.
A list of major and minor symptoms is defined for each conclusion that the system can reach. For
example, major symptoms for connective tissue disease (one of the rheumatic diseases) include swollen
hands and severe myositis, while examples of minor symptoms include anemia and mild myositis.
The rule changes include the following operators:
1) Change the certainty of a diagnosis,
2) Increase/decrease the number of majors or minors (i.e. suggestive symptoms),
3) Add/delete an exclusion (i.e. a symptom that should not be present),
4) Add/delete a requirement (i.e. a necessary symptom) and,
5) Change the conclusion (RHS) of a rule.
The rule changes do not include numerical range changes. SEEK requires the involvement of an expert
to decide on whether or not a suggested change by SEEK can be implemented. The main criticism of
SEEK is that it uses a criterion table for representing the hypotheses. This works well in the domain of
Rheumatoid diseases, but may be difficult to implement in other domains. SEEK is also unable to add
new rules.
Politakis [22] gives a method to integrate condition weights and rule weights for classification. The
method used relies on a very specific set of weights regarding the relationship between a set of
conditions and a conclusion. The condition weight can either be minor or major while the weight of a
rule can have a value in the set { possible, probable, definite} depending on the combination of
conditions used in a rule. This technique is strongly connected to our implementation of condition and
rule weights as will be discussed in Section 3. The proposed scheme in this paper is dynamically
adjusted with respect to a probabili stic estimate (depending on the data distribution) and it is defined on
a continuous scale, while in SEEK system an expert is required to assign weights for both conditions and
rules.
5
The use of an exclusion component in SEEK is another important aspect that is similar to the proposed
approach. We use an UNLESS clause to rule out a conclusion for a particular situation taking care of
special cases in the domain of interest.
The KRUST system [7] refines a knowledge base when there is a conflict between the expert’s
conclusion and the system’s conclusion for a particular test case. This test case is one of a bank of task-
solution pairs used by KRUST. KRUST normally outputs a single knowledge base, based on the
performance of a set of proposed refined knowledge bases on all the task-solution pairs. In the event of a
tie for best knowledge base, or if the original knowledge base is judged best, the expert is consulted.
However, unlike SEEK, KRUST does not require the involvement of an expert during refinement.
KRUST follows the “ first enabled is fired” conflict resolution strategy, and its refinement operations are:
� Generalise a rule,
� Specialise a rule,
� Change the priority of a rule,
� Add a new rule for the misdiagnosed case.
KRUST’s generation of a new rule is naïve [7], and needs to be improved. Again KRUST believes that
the original knowledge base is complete, and no redundant rules exist. Hence, there is no redundancy
checking to remove undesirable rules. Moreover, there is no way to handle or exclude special cases as is
the case with SEEK.
KRUST’s approach in generating many refinements is based on the realisation that the application of a
variety of refinement operations could lead to knowledge bases that have different performances. The
research in this paper emphasises the need for trying all possible refinement operations that are
applicable for the observed misdiagnoses and to choose one that best improves the knowledge-base
performance.
3. The ARIS inductive refinement system
This section is devoted to a description of the mechanisms involved in representing/creating the
knowledge base, improving the knowledge base, and the associated weighting schemes. This is
implemented in a system called ARIS (standing for Automated Refinement of Inductive Systems) which
refines propositional production rules. The architecture of ARIS is shown in Figure 1, and the main
features are:
• a rule weighting scheme for providing rule priorities and ordering,
• ranking refinement of cases by combining the weights of rules satisfied for each case,
• abili ty to classify an object, when no rule is satisfied, through partially satisfied rules, and
• abili ty to perform all refinement operations on continuous and discrete attributes.
6
Figure 1. The ARIS architecture
ARIS initially generates a knowledge base using induction on a set of training examples, it then
proceeds to test the knowledge base on a separate set of testing data for refinement purposes called the
refinement data set. It is only after this testing, and if some cases in the refinement set are mis-classified,
that the refinement subsystem is invoked. Finally, the system is tested on a separate testing data set for
evaluation of the refinement process. The refinement subsystem identifies possible errors in the
knowledge base, and calls on a library of operators to develop possible refinements, guided by a global
heuristic. The best refinement is implemented, and the process repeats until no further refinements are
possible. A knowledge-based refinement system is successful to the extent that it can improve the
accuracy of its initial knowledge base on new and unseen cases and produce a more accurate and
comprehensible knowledge base than the initial knowledge base. ARIS performs a search through a
space of both specialisation and generalisation operators in an attempt to find a minimal refinement to a
knowledge base.
Conceptually, the refinement subsystem has three main phases, two of them are executed for every
hypothesis present in the knowledge base, while keeping the rules ordered with respect to their weights:
Phase 1: (Localisation process)
During the first phase, all misdiagnosed cases from the refinement set belonging to a particular
hypothesis (class) are identified. Each case receives a weight from the rules satisfied for the
case. This indicates the rule overlapping at this point (case) in hypothesis space. The case that
has the highest absolute weight among the misdiagnosed cases is selected, since it points to the
strongest rule from the set of erroneous rules (i.e. with highest weight) and frequently many
rules overlap at this point.
Rule generator
Rule
base
Tre
e ge
nera
tor
Tra
inin
g da
ta
Refinementdata
Test data
Classifier
(inference engine)
Refinementgenerator
Rul
e cl
assif
ier
Refined rule set
Refinedknowledge base
Refinement moduleArchitecture
Orderedrules Pr
edic
tion
resu
lt
7
Phase 2: (Refinement generation, verification and selection)
During this phase, the rule responsible for the misdiagnoses is determined, and all possible
refinement operations applicable are tried. Namely; the erroneous rule is specialised, a similar
rule reaching the intended class is generalised, and a new rule is created. The resultant
performance is stored for each refinement operator and finally the refinement operator or group
of operators that produce the best performance is chosen. The process is repeated until no
further improvement is possible.
Phase 3: (Check completeness and remove superfluous rules)
Finally, the knowledge base is checked for incompleteness. Each case must be covered by at
least one rule. If there are cases that are not covered by the existing rules, new rules are created.
Moreover, any superfluous rules are removed. Figure 2 demonstrates the algorithm.
The main components of ARIS are: a tree generator, a rule generator, a refinement generator, a
judgement module, and an inference engine. The refinement generator is responsible for applying all
possible refinements to remedy the misdiagnoses. The set of refinements reflect how the rules interact.
Rules can be changed by allowing them to fire (called enabling), or preventing them from firing (called
disabling). The judgement module selects the best refinement operator or group of operators which
result in the best improvement of knowledge-base performance while correcting the misdiagnosed
cases.
8
Algorithm : ARIS
Inputs: A set of training data
A set of refinement data,
A set of test data,
Optionally, a set of rules of the form:
IF conditions THEN hypothesis
Output: Refined rule set of either
IF conditions THEN hypothesis
Or
IF conditions THEN hypothesis
UNLESS
Conditions
First:
If no rules are given generate a set of production rules by processing the training data,
Second:
1.
for hypothesis ←each hypothesis in the hypothesis space
{ Identify rules that are not consistent with refinement data set. These may at later stages be
specialised but will not be generalised. }
• Rank rules with respect to their weights.
• For each misdiagnosed case in the refinement set, calculate combined weight of rules satisfied
for the case (see Section 3.1).
• Select a case that has the highest absolute weight among the misdiagnosed cases.
• Refine the knowledge base for the selected case (i.e. missed case) by generating all possible
refinements.
• Select the best refinement, and implement it.
End for
2.
• Maintain knowledge-base completeness, by guaranteeing that each case is covered by at least
one rule, use a minimum covering algorithm to generate missing rules (see Section 3.4).
• Prune all superfluous rules.
Figure 2: The ARIS algorithm
3.1 Learning condition weights
Feature weights have been used in case-based learning [14,19] to denote the relevance of features in
similarity computations and to guide the selection of relevant features. The basic approach involves
storing cases and their associated classes in memory and then, when given a test instance, the task
9
becomes one of finding the training cases nearest to that test case and using them to predict the class of
the test instance. Several methods for constructing feature weights have been proposed, based on either a
naïve distance metric that weighs attributes equally [28] or a weighted Euclidean metric in which the
weight for each feature comes from a small set of options, for example [0,0.1,0.2,…,1], where 0 means
that the feature is irrelevant, and higher weight denotes greater relevance such as in [19]. Some methods
cannot handle numeric features directly, and have to discretise numeric values by classification; others
have to convert multi-valued symbolic features to a set of binary features with respect to the number of
its values (i.e. a symbolic feature with N values is mapped into N-1 binary features) such as
quantification method II [ 19].
Politakis [22] used a domain specific scale for weighing conditions and rules. His approach can be
summarised as follows:
The conditions in a rule are grouped under the following categories:
majors considered to be very important conditions,
minors less specific but suggestive for reaching a particular conclusion,
exclusions conditions that must not be satisfied in order to reach a given
conclusion.
The certainty in the conclusion of a rule is measured by the confidence level it reaches. There are three
levels of confidence:
(1) Possible,
(2) Probable, and
(3) Definite.
Different combinations of conditions could result in reaching the same conclusion but with different
certainty.
In contrast, the only source of information we have is the training data. In this research, conditional
probabili ty theory is used to take advantage of the data distribution in the training set. The intention is to
measure the importance of a condition to a particular class, through the measure of change in belief in
the class with respect to that condition. This poses the following question. For a particular hypothesis h,
if given a piece of evidence e, how will the relation between h and e change?
To formalise the concept, let us consider a sample space that is partitioned by events E1,E2,…En.
Let H+ be an event in the space denoting a certain class or a concept with probabilit y of P(H+)>0; Then,
by Bayes’ Theorem
For a simple problem, with two conditions, and hypotheses H+ and H-
∑=
+
++ =
n
1iii
jjj
)E(P)E|H(P
)E(P)E|H(P)H|E(P
)E(P)E|H(P)E(P)E|H(P
)E(P)E|H(P)H|E(P
2211
111 ++
++
+=
10
We can rewrite this as
Where
Or using the terminology of Table 2.
If we take the logarithm of QA, the incremental nature of the updating process of the evidence on the
condition of the hypothesis becomes apparent. In other words, the posterior probabilit y P(E1|H+) in
Equation 1 above, is dependent on the factor QA. If QA is greater than one, the posterior probabilit y will
be greater than the prior probabilit y P(E1); otherwise, if QA is less than one, it would result in P(E1|H+)
less than P(E1). We can thus view this quantity as a weight, carried by the evidence E1, which sways the
belief one way or the other. The weight QA calculated as above ranges from zero to infinity; however
we would like to map the values of the evidence weight within the interval [-1,+1] with a value of 1 for
QA mapping to 0 (central), so that it is easier to distinguish whether the given evidence is in favour of
the particular hypothesis or not. In other words positive weight denotes supporting evidence, and
negative weight indicates opposing evidence. Therefore the following function is used to map the value
to the desired range
W=F(QA)
Where
Another candidate for this transformation [23] is:
Some other possible forms have been suggested in the literature for this mapping [13].
)E(~P)E|~H(P)E(P)E|H(P
)E(P)E|H(P
1111
11++
+
+=
))E(P1)(E|~H(P)E(P)E|H(P
)E(P)E|H(P
1111
11
−+= ++
+
))E(P1)(E|~H(P)E(P)E|H(P
)E|H(PQA
1111
1
−+= ++
+
)1( 3231
1
XXXX
XQA
−+=
11
2)( −
+=
QA
QAQAF
11
1)( −
+=
−QAeQAF
(1) )E(PQA)H|E(P 11 ×=+
11
Positive weight indicates a supporting evidence for the hypothesis, and negative weight indicates
opposing evidence against the hypothesis.
Symbol Description
Ei Indicates an event or evidence (eg. Age<=20)
~Ei Indicates the complement of Ei (eg. Age>20)
H+ Denotes space of positive instances of hypothesis (eg. Healthy)
H- Denotes space of negative instances of hypothesis (eg. Sick)
P(Ei) Represents the prior distribution of objects(cases) within the scope of the condition
relative to the total number of examples covered by the entire population (e.g.
P(Age>20)
X1 = P(H|Ei) Fraction of positive instances of hypothesis covered by the condition Ei (i.e. True
positives TP)
X2 = P(H|~Ei) Fraction of positive instances that are not covered by the condition Ei, relative to the
complement of Ei (i.e. False negatives FN)
X3=P(Ei) Fraction of all instances covered by the evidence Ei, relative to the entire population
Table 2: Terminology for weight model
We emphasise that this weighting scheme defines class-specific weights, depending on class frequency
and according to the degree to which its distribution of values (for a class) differs from its distribution
for other classes.
3.2 Rule weights and ordering
In what follows, we discuss the rationale for weighting and ordering the rules in an inductive
knowledge-based system. The natural properties of production rules allow many ways for rules to
interact and overlap. In other words, a rule may colli de with or subsume another rule, because the
available conditions are not usuall y independent. This requires some process to use the available
information to resolve any existing conflict and generalise well over the training data, while keeping the
rules as simple as possible.
Knowledge-based systems are characterised by the separation of rules, data, and control scheme [8], and
the task execution is accomplished through the following phases:
Phase1: Matching phase in which the knowledge-based system matches its rules against a given
instance, to determine the conflict set (the set of rules satisfied),
Phase2: Conflict resolution. Here the system selects a rule to fire from the conflict set.
Phase3: Rule application, which entails execution of the selected rule by following its condition part.
12
The matching phase is known to be a time consuming operation, but the conflict resolution is
particularly important as it determines which rule will fire and hence how the case will be classified.
Several general schemes of conflict resolution have appeared in the literature utilising different kinds of
heuristics, for example:
1. Rule specificity was used to favour specific rules [11].
2. Connection strength [8] which measures rule strength that codes its “ relatedness” to other rules in
the knowledge base for selecting the next rule to fire on the basis of experience. Their scheme
imposes an explicit priority ordering given the last rule satisfied during task execution. The
following heuristic is adopted:
Wherever rule A fires, then based on experience the most likely candidate rule to be fired and
eventually lead to success is rule B.
This idea is very different from our approach.
3 Logical sufficiency [2] is a probabili stic estimate of rule reliabili ty, that measures the proportion of
the training examples of class c, for which the rule was true divided by the proportion of the training
examples of other classes for which the rule was true.
4 Belief certainty such as certainty factor model used in MYCIN [4] is used to represent uncertainty
to rules conclusion. This is based on expert estimates which are the integration of many factors
including probabili stic knowledge and attitude.
Traditionally, conflict resolution, requires computing the entire conflict set (except scheme 2 above),
and then selecting a rule according to certain property (like frequency of rule success). We propose the
addition of a different mechanism to the inference engine for ordering the rules with respect to a strength
of the association between rule’s conditions and the intended hypothesis (i.e. right hand side of the rule).
This is similar to Politakis’ approach [22] preferring rules with strongest confidence over rules with less
confidence.
The goal of refinement is to improve knowledge-base efficiency with as few modifications as possible.
This entails reducing the number of rules it has to consider for the purpose of improving the prediction
quali ty. For the proposed method we maintain a simple matcher that matches rules individually.
Selecting the next rule to match depends on its location (position in a list) and on its weight, thus
avoiding the expensive computation of the entire conflict set. This means that the first rule to match will
be fired, so there is never a conflict set. The rule weight can also be used for purging low utili ty rules
from the system.
Changing the priority of a rule affects its abili ty to fire, and it is the easiest operational change to
perform [7]. Several knowledge-based systems use a rule priority change as a plausible action in their
conflict resolution schemes [7,16]. Craw [8] advocates changing priority of a rule while rejecting the
change if it proves to be ineffective in later stages. Laird [16] supports the notion of changing control
13
rather than the conditions of the rules in his system Soar (i.e. modifying the decisions in which the
knowledge is used instead of modifying the rules that encode the knowledge) to recover from incorrect
decisions. In his system the decisions made on the basis of preferences are categorised into three classes
(acceptabilit y, necessity, and desirabili ty) to select a rule for firing. Other authors [29] took the opposite
extreme view point, namely, not to change the rule order under any circumstances; but there is no clear
evidence for keeping a specific ordering of the rules, if resorting to a small change like reordering could
improve knowledge-base efficiency. ARIS starts with an unordered set of rules, and imposes an order
required to keep the rules in line with their utili ty to the system, in the sense that:
(1) Provides as few matching operations as possible to produce the desired action without explicit
conflict resolution computation.
(2) Improves refinement efficiency by supplying a process for locating the rule responsible for
misdiagnoses directly.
For example, if we have a knowledge base containing rules that help making decisions about outdoor
activities for children, such as:
R1:
IF Today’s temperature < 22 THEN Play-Soccer
R2:
IF Today’s temperature < -10 THEN Build Snowman
As you can see, for a temperature of -12 degrees, the knowledge-base system will suggest playing
soccer, which is not correct for this situation. While there are several ways to amend this problem, the
simplest one is to reorder the rules; here we must use some criterion that allows the learning system to
perform the action (evaluation then ordering) automatically. We can provide the learning system with
semantic abili ty to relate each class with particular valid condition, so that the learning system can
distinguish irrelevant conditions, and order the rules with the proper context.
The other way is to assign a weight to a rule which relates the strength of the conditions to the
conclusion of the rule. For the example above, the condition “Today’s temperature<22” , covers some
false positive cases from the second rule’s class. Now suppose that the rule R2 is complete and
consistent. The intended scheme of rule weighting would result in the weight of rule R2 being higher
than the weight of rule R1. As a result R2 will be placed in the first location. This yields better
prediction performance. In Section 3.1, we described the weighting technique for rule conditions. We
must admit the fact that there is no consensus on how best to measure the combined effect of the
conditions on a conclusion in a rule, because in real-world applications, the conditions are not
independent, and the relation between conditions themselves is not clear. However, we use a heuristic
for combining condition weights to produce a weight for the rule.
14
The following combination function [4,13] will be used to calculate the weight of a rule
This computes the impact of the tests in a rule’s antecedent, where wi is the weight of condition i. Since
the weights of conditions can be either positive or negative evidence (for or against) the hypothesis, we
need to split the equation above as follows:
The weight of the rule is then MSE-MOE
where
MSE= Measure of supporting evidence,
MOE= Measure of opposing evidence,
wi = Weight for evidence i.
3.3 Improving an existing knowledge base
It is our hypothesis that tackling the most difficult part of the knowledge base is a good approach, and
we feel that other misdiagnoses could be solved as a by-product of the process. To test this hypothesis,
we maintain a list of the available rules in the knowledge base, against a set of refinement cases selected
independently from training data set as shown for example in Table 3 below:
RULE R1
+
R2
-
R3
+
CASE
WEIGHT
REFINEMENT CASE
1 + *
2 - *
3 + * *
.
.
Table 3: Generating refinement locations
...)]]1([1[)1( weight sRule' 1213121 +−+−+−+= wwwwwww
conditionspositiveforwwwwwww
MSE ...)]]1([1[)1(
evidence supporting no if 0
1213121
+−+−+−+=
conditionsnegativeforwwwwwww
MOE ...)]]1([1[)1(
videnceopposing no if 0
1213121
+−+−+−+=
15
For a particular class in the given domain, each rule maintains its weight and a label (+ or -) with respect
to the class under investigation. Each case is labelled + if it belongs to the class of interest or -
otherwise. A ‘ * ’ means that the case is covered by the rule. All the rules satisfied for a case in the
refinement data set collectively produce a weight for that case. To determine this weight, we follow the
formulation given in [4] which defines evidence combination as a special case of the Dempster-Shafer
theory of evidence, that is:
Where w is weight of rule J.
Since the rules in our approach are ordered with respect to their weight, we reformulate equation (2) as
follows
Where Max[w] is the highest weight of a rule among all the satisfied rules for a particular case. This
combined weight F(w) is an indication of rule overlapping. The case is misdiagnosed by the system if a
satisfied rule with the highest weight belongs to a different class (i.e. has a different sign). We would
like to find the points that have strongest overlapping (i.e. interaction) of rules where misdiagnoses exist.
The algorithm in Figure 5 describes how to assign case weights based on the rules satisfied. While the
weight of the rules is used as an ordering criterion to order the rules in the knowledge base, for aiding
classification, it also forms a basis for enhancing the refinement process.
Refinement locations are meant to be places in the knowledge base where misclassification may occur.
We identify refinement locations by a negative weight of misclassified instances. The number of
different rules which satisfy a particular refinement case represents its potential, i.e. the degree of
overlap in the knowledge base, which must be resolved. Our method of analysis calculates the
performance of rules in the knowledge base with respect to the refinement data set. For each stored case,
all the satisfied rules are identified. It is common to find multiple rules satisfied for a given case. The
problem of misclassification appears when rules of incorrect conclusions are satisfied for the case.
Because of this, we separate these rules into two groups one group corresponds to the active hypothesis
(the class of the misdiagnosed case) for the refinement location, and the second group contains rules
concluding other classes. We would like to determine the point at which the refinement would produce
the optimal result with respect to increased performance of the knowledge base, and least refinement
effort with respect to number of refinement operators implemented. Finding such an optimal refinement
is not possible, but the case weight as calculated in Figure 5 is a useful heuristic indicating areas of high
interest for possible refinements. Figure 3 summarises the refinement process.
∏=
>=−−=n
1j
jn1 (2) 1n),w1(1)w,...,w(F
( ) 1],[11),...,()(1
1 >=+−−== ∏=
nwMaxwwwFwFn
j
jn
16
Figure 3: Refinement process
Evaluate KBperformance
Incorrect knowledgebase diagnoses ?
KB systemconclusion
Storedrefinementcasesconclusion
Summarize rulesperformance on eachrefinement case
Instantiate heuristics tolocate the most critical partof the rule overlapping
Perform refinementoperations and select thebest
Apply refinementoperator
Evaluate theKB
Update thebestrefinement
Implement thebest refinement
Verify KBcompleteness
Verify KBredundancy
Return
Yes
No
17
Refine (erroneous, missed case)
{ The target rule is a rule satisfying missed case and concluding the correct hypothesis for the missed
case}
1. Specialise the erroneous rule
� Select best specialisation
� If there is a target rule, accept the best specialisation
2. If there is no target rule:
{
• Find closest rule to generalise, then
{ Two successive scans of the rule are performed in order to generalise it }
(1) Generalise by deleting a condition from the rule, such that the deletion does not degrade
the performance of the rule, so as to cover the missed case
(2) Generalise by expanding numerical scope of a condition if it is continuous, so that the
missed case will be correctly covered.
• If the weight of closest rule after generalisation > weight of the erroneous rule, then adopt the
generalisation operation only,
Else
{
• Augment erroneous rule with UNLESS clause
• Apply both generalisation (to closest rule) and specialisation (to erroneous
rule) operators
• Evaluate the knowledge-base performance using each of the above
refinements, and accept the one that better enhances the knowledge-base
performance.
}
}
3. If there is no rule to generalise, then
• Create a new rule using a minimum covering algorithm
• If the weight of the created rule > weight of erroneous rule, then add new rule to the
knowledge base;
Else
{
• Specialise the erroneous rule, so it no longer covers the missed case, and
• Add the newly created rule.
}
Figure 4: The Refine algorithm
18
Combined weight (missed case)
Fp←Weight of rules supporting the hypothesis h, and satisfying missed case.
Fn←Weight of rules that do not support the hypothesis h, and satisfy the missed case.
Maxwp←Maximum weight carried by a rule, among those supporting the hypothesis h.
Maxwn←Maximum weight carried by a rule, among those satisfying the missed case but do not conclude
hypothesis h.
Wi←Weight of rule(i) { Calculated using the approach discussed in Section 3.2 }
• Calculate combined weight of rules satisfying the missed case and concluding h
• Calculate combined weight of rules satisfying the missed case, and do not conclude h
• Compute the combined weight for the case, to indicate whether the case is misdiagnosed or not. If
the combined weight of rules in favour of the hypothesis under consideration, is greater than the
combined weight of the other rules, then the case will be diagnosed correctly; otherwise it will be
misdiagnosed.
• Return case weight.
Figure 5: The combined weight algorithm
Example:
Table 4, describes a small classification system, to demonstrate our approach. It contains two final
hypotheses (+) and (−). There are 16 training cases C1,C2,…,C16. The cases are marked as (+) if they
belong to the hypothesis (+), or (−) otherwise. The rules are shown with their weights. The marks
indicate the cases to which the rules R1 to R4 are applicable (i.e. the conditions of the rule are satisfied
for the case). The rules are assigned (+) sign if they conclude hypothesis (+) or (−) otherwise
np FFweightCase −=
( ) wp
k
i
ip MaxwF +−−← ∏=1
11
( ) wn
R
j
jn MaxwF +−−← ∏=1
11
19
Case Combined
case weight
R3 –
0.40386
R2+
0.28568
R5 –
0.225048
R6 +
0.000778
R1 +
-0.0003322
R4 +
-0.256594
1+ 0.6864 * � * �2+ 0.6864 * � * �3+ 0.6864 * � * �4+ 0.1709 * � * � * �5+ 0.1709 * � * � * �6+� -0.5148 * � * �7− -0.5148 * � * �8+� -0.5148 * � * �9+ 0.5155 * �10− -1.3064 * � * �11− -1.3064 * � * �12− -1.3064 * � * �13− -1.3048 * � * � * �14+ 0.2582 * � * �15+ 0.2582 * � * �16− 0.2582 * � * �
Table 4: Refinement locations
* means the case is covered by the corresponding rule
�means case is incorrectly covered
�means case is correctly covered
The assignment of the weights to the rules is done using the approach discussed in Section 3.2. The
application of rule ordering with respect to the rule weight (R3,R2,R5,R6,R1,R4) lets us directly
conclude that case 6 and case 8 will be misdiagnosed by the inference engine as they have a negative
combined weight, while they belong to hypothesis (+), this in turn will l ead to specialising rule 5, for its
overly general. Notice that if the rules were in a different order, for example, if rule 5 preceded rule 2,
other cases (i.e. case 4 and case 5) will be misclassified, but the ordering scheme adopted avoided this
problem.
3.4 Minimum covering algorithm
Many algorithms for concept learning search hypothesis space by considering a general-to-specific
ordering of hypotheses. However, a minimum covering algorithm organises the search for a hypothesis
beginning with the most specific possible hypothesis, then generalises this hypothesis at each step to
cover an observed positive example. To be more specific, the minimum covering algorithm specifies a
set of conditions that describe some cases centred around a particular instance which is normally a
20
misdiagnosed case; thus providing an approach that is constrained to focus on a subset of a particular
concept by only a subset of the relevant instance features observed. It represents an inductive
generalisation of consistent production rules, by adopting a least generalisation strategy. First, we select
the most central case to be covered by the initial rule, this is normally the misdiagnosed case, and form
the most specific rule that covers that case. Depending upon the condition values used in the case, an
over-specific (OSpec) rule is developed, such that the OSpec rule covers only the central case. All
example cases that belong to that class of the central case are considered to be positive examples and all
cases that do not belong to the class are considered to be negative examples. Next, we generalise against
the positive cases for each condition. The newly generated rule is a generalisation of OSpec. Then a new
rule (Min_Cover_Rule) is created from this rule. The technique used to do this generalisation is the
following:
1. Min_Cover_Rule ←Ø { Make an empty rule}
2. Create OSpec rule . { This is an overly specific rule which covers only the misdiagnosed case}
3. Generalise OSpec rule
� Ordinal range expansion
All the positive cases are ordered for the particular condition to facilitate expanding ordinal
ranges which will result in a rule covering more cases. Each time the next value for the
condition in the range is examined. For the less than (<) operator the next larger value on the
range is tried. While the next smaller value is used to examine the greater than (>) operator.
This results in the range of values for the condition being expanded in both directions.
� Condition deletion
For each discrete condition in OSpec rule, if deleting the condition improves quality of the rule
then execute the deletion operation.
4. Create Min_Cover_Rule
This involves a sequential scan through the conditions of the OSpec rule, after ordering them with
respect to their weights. As each condition is examined it is added to Min_Cover_Rule. If the
resulting rule is inferior to the original then the condition is removed from Min_Cover_Rule.
The following rule evaluation function is employed to test the rule efficiency:
Where α is a weighting parameter.
In our experiments, we set the value of the variable α to 0.8. The use of this function results in the
development of rules that are complete and consistent with regard to the data set, whenever this is
possible. A formal description of the above algorithm is provided in Figure 7. This process is il lustrated
cases covered#
cases coveredcorrectly #yConsistenc =
class samethe of cases#
cases coveredcorrectly #ssCompletene =
( ) ( ) yconsistencyconsistencsscompleteneyconsistencRuleCoverMinEfficiency 1),__( +++−= ααα
21
in Figure 6. In this figure, example cases are presented as points in a two dimensional space
representing a case’s age and weight. Positive cases are represented by (+) sign. Negative cases are
represented by (–) sign. Each rule can be viewed as defining a rectangle that includes those points that
the rule assigns to the positive class. The thick line indicates the rule to be refined.
IF Age<= 8 &
Age>1 &
Weight<= 8.2 &
Weight>3
THEN class= Negative
This rule covers eight negative and three positive cases. Assuming that the misdiagnosed case
determined by the refinement subsystem is case A, then the OSpec rule that covers case A and only A
is:
IF Age<= 6 &
Age >5.59 &
Weight<= 6.2 &
Weight> 6.19
THEN class= Positive
we then generalise against the case B and C. This results in the final rule:
IF Age <= 6.5 &
Age>3 &
Weight<= 7.5 &
Weight>5
THEN class= Positive
Figure 6: Example of minimum covering algorithm
The minimum covering algorithm is incorporated to help determine a missing rule for a particular
misdiagnosed case; and helps determine which condition to add to a rule for specialising it; as well as to
decide the UNLESS clause for imbedded hypotheses. The algorithm performs a specific-to-general
+ _ _
_ _
+ _ B+
_ A+ _
_ C+ +
_ +
+ _ _
+ +
1 2 3 4 5 6 7 8 9 10 11 12 13 age
9
8
7
6
5
4
3
2
1
weight
22
search, through a space of hypotheses, beginning with the most specific rule, that initially covers only
one case (namely the missed case) and terminating when the rule is sufficiently general. The algorithm
employs a performance measure for rule quality to guide selecting the most promising conditions. At
each stage, the rule is evaluated by estimating the utili ty of removing a condition or changing a
numerical range, by considering the number of positive and negative cases covered.
One might ask why the learning system did not find the newly learned rule when the knowledge base
was initially constructed. In the first place, the specified case may be an exceptional one and the rules
that are consistent with it can cover only a small number of other positive instances; thus those rules
dealing with this exception may not be found during the initial knowledge-base construction. Secondly,
the initial training set may not be a representative sample of cases in the category of interest. Thirdly,
since ARIS can be used to refine any knowledge base these rules might not have been discovered by the
original technique.
Minimum covering algorithm (missed case)
{ Returns a rule that covers some examples belonging to case's hypothesis, including the missed case
itself. Conducts a specific-to-general search for the best rule, guided by a utili ty metric.}
Min_Cover_Rule ← Ø
Rule ←Ø
1. Initialise Rule to the most specific hypothesis covering the given missed case
2. While Rule is overly specific Do
• Generate the next more general Rule:
• For each continuous condition
{ -Expand the range of the condition to cover the closest uncovered case
- If PERFORMANCE increases, then accept the modifications
Else
Refrain from implementing the last changes }
• For each condition
If removing the condition improves PERFORMANCE, execute the deletion operation
3. Sort conditions of Rule with respect to their weight
4. { From Rule, add to Min_Cover_Rule, one condition at a time}
I ←1
Do {
Min_Cove_Rule ← Min_Cover_Rule + condition I
If (Efficiency (Min_Cover_Rule) does not improve)
Remove condition I
I++
} Until (no more conditions are available)
5. Return Min_Cover_Rule of the form
IF conditions THEN hypothesis
Figure 7: The minimum covering algorithm
23
4. Experimental evaluation
4.1 Effects of new information
The research presented in this paper attempts to find better way of util ising information in domains
where available data is large and ever growing, particularly in automated data collection environments.
The ARIS algorithms were applied to problems from several different domains [20]. The aim was to
deliver an improvement on existing inductive learning systems. All the executed experiments employ
data sets which are randomly divided into a training data set, a refinement data set, and a test data set.
The training and testing data sets each comprises 40% of the available data, while the remaining 20% is
assigned to the refinement data set. Table 5 presents the results of the performance of ARIS before and
after refinement, where the rules are generated using consistency and completeness criteria. Table 6
demonstrates the performance of ARIS on knowledge bases created by C4.5 system. Table 7 shows the
effect of refinement on knowledge bases created by RIPPER system.
The experiments show that the main operators used in knowledge-base refinement by the ARIS system
are rule creation and rule specialisation. The biggest influence is the number of error-causing rules
because each of them must be prohibited from firing for the misdiagnosed cases, to allow for the correct
rule to be fired, or a new rule can be created. They can be specialised either by: strengthening the left
hand side of the error causing rule which may be achieved by adding another condition or shrinking a
numerical range of a condition, or appending a negation of a conjunction in a form of UNLESS clause.
The experiments also showed that reduction in number of rules produced is vital.
In practice, a knowledge base is created so that the rules have the maximum generali ty and least
specificity possible, in other words that the rules must be general enough to accommodate future
situations and commit the fewest possible misclassifications. The experiments indicate that the effect of
generalisation operation is rare relative to the use of other refinement operations (i.e. approximately 3%
of the total number of refinement operations) while specialisation was 13%. The most common
operations were rule creation (43%), UNLESS clause (21%) and rule deletion (20%). We would expect
the refinement system to follow this trend, because the initial knowledge base is susceptible to
incompleteness and inconsistency.
24
Before refinement After refinementDomain#Rules Accuracy #Rules Accuracy # of
Specs#of Gens # of
UNLESS# of newrules
# of rulesdeleted
Audiology 22.9 43.04 16.1 58.20 3.9 0 0 4.3 11.1Flag 31.2 56.39 15.9 58.20 0 0 0 1.5 16.8Mushroom 22.7 99.69 9.8 99.85 0 0 0 0.9 13.8Heart 26.8 48.93 26.2 49.58 0 0 0 2.8 3.4Wine 3.1 87.36 4.1 88.12 0 0 0 1.3 0.3Hepatitis 3 78.55 4.5 79.52 0 0 0 1.7 0.2Iris 2.6 93.17 3 95.67 0 0 0 0.7 0.3Hypothyroid 3.4 97.89 4.5 98.20 0 0 0 1.6 0.5Adult 160.8 71.43 57.1 71.44 0 0 0 7.5 11.01Artificial data 19.4 98.09 20.3 98.11 0 0 0 1 0.1
Table 5: ARIS performance on knowledge bases created by ARIS using completeness and consistency
criteria on ten selected domains and averaged on ten trials per domain
Before refinement After refinementDomain#Rules Accuracy #Rules Accuracy # of
Specs#of Gens # of
UNLESS# of newrules
# of rulesdeleted
Audiology 23.2 44.02 12.6 49.6 0.1 0.2 4 14.6Flag 33 55.3 14.2 60.24 0.1 0.1 0.1 3 21.8Mushroom 34.3 98.47 13.6 98.50 0.2 0.2 20.5Heart 38.9 49.26 12.3 50.66 0.4 0.1 0.6 27.2Wine 5.7 89.72 4.3 89.72 0.3 0.1 0.2 1.5Hepatitis 5.9 80.32 3.7 81.61 0.2 0.8 0.2 2.7Iris 4.6 93.5 2.5 93.5 0.1 0.3 2.4Hypothyroid 7.6 96.99 5.1 97.67 0.1 1.9 4.4Adult 228 77.62 36.4 78.61 0.4 192Artificial data 47.9 92.0 33.2 93.6 0.4 0.5 2.3 17
Table 6: ARIS performance on knowledge bases created by C4.5 system
and averaged on ten trials per domain
Before refinement After refinementDomain#Rules Accuracy #Rules Accuracy # of
Specs#of Gens # of
UNLESS# of newrules
# of rulesdeleted
Audiology 12.8 66.56 12.8 68.04 0.1 0.3 2.9 2.8Flag 8.6 52.89 13.7 55.66 0.5 0.5 6.1 1Mushroom 7.5 99.84 7.5 99.86 0.4 0.4Heart 2.9 52.2 9.5 53.03 0.1 0.7 7.1 0.3Wine 2.8 86.25 3.8 87.64 0.5 0.3 1.6 0.5Hepatitis 1.3 77.26 3.4 77.58 0.2 2.2 0.1Iris 2.6 90.67 3.1 90.99 0.8 0.3Hypothyroid 2.5 98.4 2.6 98.4 0.3 0.2Adult 4 81.27 7.5 81.27 3.8Artificial data 15.6 96.34 17.9 96.35 2.7 0.4
Table 7: ARIS performance on knowledge bases created by RIPPER system
and averaged on ten trials per domain
25
4.2 Fair comparison between ARIS and other systems
We have shown that the refinement approach improved concept-description accuracy of both C4.5 and
RIPPER systems, and reduced knowledge-base complexity (in terms of number of rules) whenever
possible for several data bases, using a separate set of data for refinement. In this section we compare
the approach of learning plus refinement to simply learning rules from all the data.
Our approach in the comparison involves the following strategy:
� Induce a knowledge base by training the learning system with 40% of the available data and then
refine it using 20% of the available data,
� Induce a knowledge base by training the learning system using 60% of the available data,
� Compare the performance of the generated knowledge bases by testing both approaches on the
remaining 40% of the data.
Table 8 is another comparison of three systems, namely ARIS, C4.5, and RIPPER on the selected test
domains. The refinement results were compared with knowledge bases induced by incorporating both
the training data and refinement data sets as a unified training set. This gives a fairer comparison
between ARIS, C4.5, and RIPPER. The first column names the domain used. The second column shows
the performance of the ARIS system on the test data used when trained on 40% of the available data.
The third column gives the performance of the ARIS system on the same test data when trained on the
combined training data (i.e. training and refinement data sets together). The fourth column gives the
performance of the knowledge base on the test data after refinement. Column five gives the performance
of the C4.5 system on the test data when trained on 40% of the available data. Column six shows the
performance of C4.5 on the test data when trained on the combined data. Column seven indicates the
performance of the refined knowledge base induced by C4.5 system on the same test data. Column eight
shows the performance of RIPPER system on the same test data when trained on 40% of the available
data. Column nine gives the performance of RIPPER on the test data when trained on the combined
data. Column ten gives the performance of the RIPPER’s refined knowledge-base on the same test data
set.
The tick marks indicate cases where training followed by refinement yielded better results than simply
training on all the data.
The experiments display the striking difference in rule induction between C4.5 and RIPPER.
Specifically, the C4.5 system generates many rules, some of which have the potential of causing rule
contradiction. The ARIS refinement process demonstrated that deleting such superfluous rules often
increases knowledge-base accuracy. On the other hand, RIPPER’s approach for inducing rules creates
fewer rules. Therefore, during refinement, ARIS performs more rule creations, and few rule deletions on
the RIPPER knowledge bases.
26
Finally, let us analyse the causes of an increase in the accuracy in inductive learning systems. In general,
it can be expected that concept-description accuracy will increase as the size of the database increases.
This general feeling is il lustrated by the accuracy for artificial data set (last row in Table 8), where
training a learning system on the large non-noisy data produced better results than training on small
portion and then refining the induced knowledge base. This will then begin to deteriorate as noise
increases such as in hepatitis and hypothyroid data.
Comparing the use of ARIS on the combined data set (60% of the data) just to generate the rules without
refinement, with training ARIS on 40% of the data and refining on 20%, we have observed that on 7 out
of 10 of the domains, the results of refinement approach is better, and was inferior on 3 domains (wine,
artificial data, and flag).
Refining C4.5 rules produced with 40% of the available data was better than training C4.5 system on the
combined data set in 7 out of 10 domains. This indicates that a refinement approach is useful when
knowledge bases are created without testing their predictive abiliti es as in the case with C4.5 system.
In RIPPER, the rules are generated by one training data set and post-pruned using another separate
pruning-data-set. This is clearly a limited refinement scheme which is the advantage of RIPPER over
other learning systems such as C4.5, HYDRA, and AQ15. This is the reason behind the good quali ty
results of RIPPER when trained on the combined data set as can be seen from Table 8. However, it is
still error prone due to noise, for example, on the Hypothyroid data set, the RIPPER system produced
results (when trained on combined data) which are inferior to the results obtained by training the system
on small data (40%) and then using ARIS to refine the rules.
The hepatitis data is the smallest noisy data set (135 instances) we have. The attributes available (19
attributes) enabled the refinement system to draw the class boundaries better than when the learning
system was trained on the combined data. The Heart data set has similar characteristics to the hepatitis
data, and similar results are obtained. On the mushroom data set, the same accuracy was reached using
both approaches. The remaining six domains expressed better results for RIPPER when trained on the
combined data set.
In summary, refinement mechanism improved concept-description quali ty for all the algorithms in the
three medical domains (i.e. Hepatitis, Hypothyroid, and Heart) characterised by noise and small
disjuncts problems. Moreover, improvement on several other domains was obtained with C4.5 and
ARIS. It is therefore, recommended that learning systems use a refinement mechanism on a data set
which is separate from the training set used to induce the knowledge base such as that of RIPPER and
ARIS to obtain good quali ty concept descriptions.
27
Domain ARIS using completeness and consistency C4.5 system RIPPER systemTrained on
40%Trained on
combined dataRefined KB Trained on
40%Trained on
combined dataRefined KB Trained on
40%Trained on
combined dataRefined KB
Iris 93.17 93.67 95.67 ✓ 93.5 94.0 93.5 90.67 93.33 90.99
Wine 87.36 89.72 88.12 89.72 90.0 89.72 86.25 90.14 87.64
Hepatitis 78.55 78.39 79.52 ✓ 80.32 77.96 81.61 ✓ 77.26 76.94 77.58 ✓
Hypothyroid 97.89 98.14 98.2 ✓ 96.99 97.74 97.67 ✓ 98.4 98.20 98.4 ✓
Heart 48.93 47.05 49.59 ✓ 49.26 50.19 50.66 ✓ 52.2 52.21 53.03 ✓
Flag 56.39 59.64 58.2 55.3 55.54 60.24 ✓ 52.89 57.23 55.66
Audiology 43.04 40.20 49.02 ✓ 44.02 43.12 49.12 ✓ 66.56 70.69 68.04
Mushroom 99.69 99.77 99.85 ✓ 98.47 98.37 98.5 ✓ 99.84 99.87 99.86
Adult 71.43 70.97 71.44 77.62 78.07 78.61 ✓ 81.27 82.22 81.27
Artificial data 98.09 98.86 98.11 92.0 98.16 96.6 96.34 97.53 96.35
28
4.3 Statistical comparisons
To test the results statistically, we used a two-way analysis of variance to compare the means with a
significance level of 5%. The raw data from the ten cross-validation experiments was used. The data set
was one feature and the type of analysis the other. The six analysis types used corresponded to the
results shown in Table 8, comparing using 60% of the data for training to using 40% for training and
20% for refinement. The following results show the ANOVA table and the Duncan groupings for the
type of analysis.
Source DF Type I SS Mean Square F Value Pr>F
Analysis type 5 1567.54 312.3083 6.77 0.0001
DAT 9 192014.52 21334.9474 462.53 0.0001
Table 9: ANOVA table for experimental results
Duncan Grouping Mean N EFF
A 82.42 100 RIPPER trained on 60%
A 81.47 100 RIPPER trained on 40% and
using 20% for refinement
B 79.57 100 ARIS trained on 40% and
Using 20% for refinement
B 79.42 100 C4.5 trained on 40% and
using 20% for refinement
B 78.37 100 C4.5 trained on 60%
B 77.86 100 ARIS trained on 60%
Table 10 Duncan grouping for overall mean values using 3 inductive learning
systems with refinement
Refinement on the results from the C4.5 system resulted in better performance than training on 60% of
the available data, by more than 1%. Training ARIS (just to create a knowledge base) using 40% then
refine it resulted in better performance than training ARIS on 60% by more than 1.7%. However,
training RIPPER on the combined data (i.e. 60%) is better than training on 40% and then refining using
20%, by less than 1%. It is important to note again that RIPPER includes a refinement scheme during
the creation of a knowledge base. Therefore, exposing the system to more data for refinement would
often produce better results than refining on only 20%. Moreover, the lack of some refinement operators
within RIPPER lead to inferior results than using our refinement approach in noisy domains with small
disjuncts problem as was discussed previously.
29
In a second analysis we confine our comparison to C4.5 and ARIS systems and exclude RIPPER as it
includes an internal refinement approach. The following tables show the output with 5% significance
level.
Source DF Type I SS Mean Square F Value Pr>F
Analysis type 3 203.9567 67.9856 1.6 0.188
DAT 9 150542.5780 16726.9531 394.24 0.000
Table 11: ANOVA table for experimental results for two inductive systems
Duncan Grouping Mean N EFF
A 79.57 100 ARIS trained on 40% and
used 20% for refinement
B A 79.42 100 C4.5 trained on 40% and
used 20% for refinement
B A 78.37 100 C4.5 trained on 60%
B 77.86 100 ARIS trained on 60%
Table 12: ANOVA comparison of results with and without refinement
Table 12 shows that training ARIS system on 40% and then refining using 20% produces better results
than training using the combined data (i.e. 60% of the available data) and the difference is statistically
significant. The table also shows that training C4.5 system on 40% and refining using 20% is better than
training on 60%.
5. Differences between inductive refinement and incremental learning
The top level goal of both subareas of incremental learning and inductive refinement is to produce
learning systems that are able to modify a knowledge base as a result of experience. Both subareas
interleave learning and performance (making learning systems themselves adaptable). However, each
subarea has its own characteristic strengths and weaknesses. We note some of them below:
� Source of knowledge
Incremental learning systems modify their own inferred knowledge base. However,
inductive refinement has the power to revise knowledge bases inferred by other batch
learning systems and hence can be valuable for improving inductive learning.
� Nature of refinement operators
Inductive refinement systems generally have the abili ty to refine a knowledge base by
generalising/specialising their rules, creating new rules, or removing superfluous rules. In
contrast, incremental learning specialises or generalises only its current hypothesis.
30
� Input schema used
Whereas incremental learning systems accept their own input one object at a time;
inductive refinement systems receive the output of a batch learning system and a set of
refinement data before commencing the refinement process.
� Evaluation strategy
Refinement systems require access to an additional view of the environment (i.e.
refinement data set) to perform refinement to the knowledge base leading to better
performance with regard to all the refinement set while correcting the misdiagnosed
cases. In contrast incremental learning processes its input one object at a time to
guarantee covering the given object correctly.
6. Conclusions
6.1 Relationship between ARIS and other systems
The main limitation of inductive learning systems is that the results from the testing phase have no role
in improving the learning process, but are merely used for validation purposes. After the learning
process is completed, the generated concept descriptions are used over and over for prediction. If new
information is observed the learning system cannot utilise it. In such a situation the previously learned
knowledge base will be discarded and all the data (old and new) has to be supplied to the learning
system to create the knowledge base anew.
In ARIS the results obtained are analysed and fed back automatically to the learning mechanism. This
causes the learning system to refine the knowledge base, add missing rules, and delete superfluous rules,
producing a more comprehensive and accurate knowledge base. While retaining the fundamental ideas
behind inductive learning ARIS is more powerful and capable of performing many additional actions to
improve its knowledge base.
The SEEK system [22,23] is able to help an expert focus on a problem within a knowledge base, but it is
then up to the expert to suggest whether to accept it or not. One feature of SEEK is the use of exclusions
to rule out a decision in some situations (which is a common phenomenon in the medical domain). But
because the rules were provided in advance by experts in the field (Rheumatic diseases), and are
expected to be complete which is a strong restriction, SEEK was unable to produce new exclusions to its
rules neither was it able to add new rules. Specialisation was limited to adding conditions and numerical
ranges cannot be altered. ARIS takes this further, suggesting new possible exclusions, as well as adding
or altering conditions (both discrete and ordinal) allowing slight modifications to the existing rules, in
contrast to radical changes of deleting and adding conditions. ARIS uses a minimum covering approach
to reach the exclusion in a form of a additional clause (i.e. UNLESS) appended to the original rule
which states that a combination of facts rules out a particular decision in certain situations. The
31
minimum covering approach facilit ates the process of reaching what condition to add to a rule for
specialisation.
The other feature of SEEK is the distinguishing between conditions by defining a condition as minor or
major. This is a realisation of the different effect of conditions to conclusions of rules, and it is a form of
weighing the conditions. ARIS weighs conditions in rules using a probabili stic estimate with respect to
data distribution in training set and the weight is continuous in the range between -1 and +1.
SEEK2 [10] an extension to SEEK, learns new rules by processing some examples. By using a set of
examples for which the hypothesis failed, SEEK2 finds a set of common features which are then used to
form a new rule. A complete refined knowledge base is presented to the expert (SEEK2 also works on a
knowledge base predefined by an expert). ARIS finds a set of features by using a minimum covering
algorithm. It starts by using a misdiagnosed example and gradually allows more facts to be incorporated.
This gives more flexibili ty and allows ARIS to succeed in many situations requiring rule modifications.
EITHER [21] uses multiple approaches to refine an existing knowledge base, guided by a pre-classified
test examples. Missing rules can be generated, however, conditions can only be added or removed.
Ordinal tests cannot be modified, and numerical ranges cannot be expanded or reduced, for example
IF age<= 14 THEN teen-ager
cannot be modified to
IF age<= 18 THEN teen-ager
KRUST [7] shows that generating many refinements to a knowledge base, with the help of a set of pre-
classified examples selected with precision (called chestnuts) and without considering statistical
measures, is another alternative for refinement. Different versions of refinements are generated, each
leading to a whole new knowledge base, then one or more refined knowledge bases which are believed
to be good are suggested to the collaborating expert. However ARIS is designed with a different idea in
mind. It tries to learn general rules using a heuristic search on a library of cases, and graduall y extends
or modifies its knowledge as required.
The lesson learned from KRUST is that applying all possible refinement operators and selecting the best
performing operator or a group of operators to remedy the misdiagnoses is better than accepting the first
successful refinement.
Table 13 presents and compares the following important characteristics for refinement systems:
� Knowledge-base source
Previous knowledge-base refinement systems rely on domain experts to construct rule sets that are
complete with respect to the domain of interest. However, our approach depends on heuristics to
generate/refine knowledge.
� Knowledge-base representation
32
Several different representations have been used to facilitate knowledge transfer from domain
experts to computers. Some representations are domain-specific, for example criteria table in
medical domain (used by SEEK) and cannot be easily used in other domains. A propositional
production rule language is used in ARIS because it is easy to comprehend and expresses
knowledge efficiently.
� Conflict resolution strategy
The conflict resolution scheme provides the framework for pin-pointing the conclusion among
various possible conclusions that can be reached. Some systems accept the conclusion of the first
rule satisfied such as KRUST and EITHER. Other systems accept a conclusion that is reached with
a highest certainty among several reached conclusions such as in SEEK and SEEK2. Our approach
orders rules according to their weight which is a probabili stic estimate of rule quali ty.
� Pre-assumptions
All other knowledge-base refinement systems assume that the knowledge base is complete and only
minor tweaking is required. This view assumes the availabili ty of experts and that the domain is
fairly well understood. Inductive concept-learning deals with more general environments that are
characterised by:
- lack of understanding of the particular domain,
- availabili ty of data describing instances of the environment behaviour,
- regular change of environment such as is the case with medical images and credit-card
transactions
These characteristics often make it impossible to guarantee completeness of the initial knowledge
base.
� Missing refinement operators
Early refinement systems such as SEEK and SEEK2 were developed for specific problem-domain,
and some refinement operators were missing. For example, as a result of the completeness
assumption, creating new rules was not available in SEEK. We implemented all the refinement
operators applicable for propositional production rules.
� User interaction
Interaction of domain-expert/knowledge-engineer is a common characteristic of previous
knowledge-base refinement systems. Our approach is non-interactive.
� Knowledge-base evaluation approach
Reaching a good quali ty refined knowledge base can be viewed as a search through a space of
possible knowledge bases. The EITHER system accepts the first successful refinement; while SEEK
relies on local statistics that favours a refinement which is deemed to correct more misdiagnoses.
KRUST on the other hand generates many refinements then select best one. ARIS evaluates the
quali ty of a knowledge base after application of every possible refinement operation for the
misclassifications, and chooses best refinement resulting in the highest knowledge-base quali ty
which is measured by its completeness and consistency conditions.
33
In summary, ARIS has improved on inductive learning systems, by using performance analyses as a
feed-back to the learning process. The results for several knowledge bases are encouraging. However
there are extensions that would make it a more powerful learning tool as discussed in Section 6.3.
34
Characteristic ARIS SEEK KRUST EITHERKnowledge base source Inductively generated by
analysis of pre-classified dataset
Expert originated Expert originated Expert originated
Knowledge base representation Propositional production rules
IF conditions THEN conclusionUNLESS conditions
Criteria tables Backward chaining productionrules, using semantic nets
Horn clause rules
Conflict resolution First rule satisfied is fired A rule with the highest certaintyis fired
First rule satisfied is fired First rule satisfied is fired
Pre-assumptions Training data is an adequaterepresentation of the domain
It is assumed that theknowledge base is complete,and only minor tweaking isrequired
It is assumed that theknowledge base is complete,and only minor tweaking isrequired
It is assumed that theknowledge base is complete,and only minor tweaking isrequired
Missing refinement operators All refinement operators areimplemented
� No rule creation� No KB redundancy check� Refinement of continuous
conditions is missing
� No KB redundancy check� New rule creation is not
efficient� Does not handle
exceptional cases
� No KB redundancy check� Refinement of continuous
conditions is missing� Does not handle
exceptional casesExpert interaction No interaction with the system Yes Yes yes
Knowledge base evaluation technique Completeness and consistencyfactors are used to evaluate thesystem
Local statistical measure, andfirst successful refinement isaccepted
Generate many refinements,and then select one.The KB's are tested on a specialdata set called chestnuts, thatare selected precisely for thegiven domain, so that a goodKB does not misclassify it.
First successful refinement isaccepted.
35
6.2 Conclusion
An inductive refinement model has been developed that is capable of creating a knowledge base from a
library of pre-classified cases, and continuously updating it to accommodate new facts. This model is of
particular importance in regularly changing and noisy domains such as credit-card transactions and
medical images.
We have developed a method of learning rule weights based on an estimate of the relationship of rule
conditions to a conclusion. The rules are ranked according to their weight to determine the misdiagnosed
cases more easily. We have further developed a method for learning rules centred around a
misdiagnosed case, that is what we call the “minimum covering algorithm” which help determining
missing conditions in erroneous rules for specialisation, as well as adding new rules to the knowledge
base. We have also developed an algorithm for inducing rules from trees utilising completeness and
consistency criteria for rules.
Specific areas of contribution include:
� Scaling learning algorithms to large and ever growing databases
Databases are large and growing larger, so learning systems must be able to handle enormous
amounts of data. Inductive refinement introduces a method for inductive systems to cope with
future changes in the data without forgetting past experience. This is useful since it is easier to
only investigate the recently observed data. For example, it is easier to just process the last
thousand credit-card transactions than a full database of milli ons.
� Focus of refinement process
Learning weights for the rules and conditions makes it possible to focus the refinement process
and resolve conflicts more easily.
� Implementation that demonstrates goals
The ideas put forth in this paper have been implemented, and the implementation has
demonstrated the feasibili ty and effectiveness of the approach including:
(a) More accurate than C4.5
ARIS’s refinement approach proved to be more accurate than training C4.5 on the same
data all at once (combined training) on many noisy data sets (see Table 8). C4.5 is more
accurate on one large noise-free data set (i.e. artificial data) and equal on two other small
noise-free data sets (i.e. iris and wine data sets).
(b) Occasionally equal or more accurate than RIPPER
Table 8 shows that ARIS’s refinement approach is more accurate than RIPPER on three
test domains (i.e. heart, hypothyroid, and hepatitis) that are characterised by noise and
small disjuncts problems (i.e. rules covering few cases). When RIPPER creates new rules it
ignores any rule that has accuracy less than 50% on the pruning data set (part of the
36
training data provided). In presence of small disjuncts problems combined with noise
RIPPER misses some rules that prove to be useful in future. The refinement mechanism
corrects such problems when observed.
On one data set (i.e. mushroom data set) where the majority class forms 51.81% of the
domain the two approaches are equal. However, RIPPER is better when trained on all the
data combined on the rest of test domains.
It is important to note that RIPPER includes an internal l imited form of refinement scheme.
Therefore, it can be expected that exposing a learning system to a larger view of the
environment will often perform better especially in noise-free domains such as the artificial
data set used in the experiments. Therefore, improving on RIPPER on some domains is
good. This is a benefit of the increase of search effort and the more refinement operators
available that can modify the class boundaries graduall y resulting in improvements of
accuracy of the knowledge-based system.
The goal of this research has been to provide an effective approach to inductive learning, based on a
batch-learning strategy making learning systems responsive to domain changes.
The experiments with the ARIS system, and the lessons learned during refining knowledge bases
induced by other learning systems such as C4.5 and RIPPER serve as a good demonstration of the
significance behind refinement in inductive learning to improve system’s performance.
Moreover, RIPPER can be improved if it:
(1) performs more refinement operators such as expanding and shrinking numeric ranges rather than
adding/deleting conditions only,
(2) Allows the use of negation of conditions in the developed rules, as used in ARIS.
37
6.3 Limitations and further work
This research has pointed out several potential areas for further research and investigation:
� Learning intermediate concepts
In order to get better benefit from concept induction, we need to relax the mutual exclusivity of
the classes. This would permit a class to be named as a condition in a rule, resulting in more
compact concept descriptions.
� Parallel implementation
More work needs to be done in order to speed up the refinement process. Since refinement
operators (specialisation/generalisation) are performed independently, this makes the
refinement process suitable for parallel implementation. The problem would be to design an
optimal framework to distribute and control the workload among a fixed set of processors
(equivalent to the number of refinement operations). This will l ead to an increase of the speed
of the refinement system.
� Constructing new features
In many learning problems, there are hundreds or thousands of features describing each input
case. Current concept-learning algorithms such as C4.5 [25] do not scale well when there are
many features. Moreover, it is easy for learning algorithms to be confused by noisy features and
construct poor classifiers.
Constructing new features from the given data enables learning systems to change their
representation to redescribe the given domain. New features can be constructed by assigning
names to groups of best-performing features for each class and/or combining feature values into
larger units. Constructing new features during the refinement cycle will produce more compact
concept descriptions, and might lead to increase in performance similar to that reported by
some constructive induction research in the DNA domain [17].
Acknowledgements
I gratefully acknowledge the intellectual support of my supervisor Professor David Lubinsky, during my
research work, and for his helpful comments on this paper.
REFERENCES
1. Ali K. (1996). Learning probabili stic relational concept description. Ph.D. thesis, University of
California.
38
2. Ali K., Pazzani, M. (1993). HYDRA: A noise- tolerant relational concept learning algorithm. In
Proceedings of the Thirteenth International Joint Conference on Artificial Intelli gence, Chamberg,
France.
3. Breiman L., Friedman J., Olshen R., Stone C. (1984). Classification and regression trees.
Wadsworth, pacific Grove, CA.
4. Buchanan B., Shortliffe E. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the
Stanford Heuristic Programming Project. Addison-Wesley Publishing Company.
5. Cloet I., Theron H. (1991). CID3: an extension of ID3 for attributes with ordered domains. South
African Computer Journal, 10-16.
6. Cohen W. (1995). Fast effective rule induction. In: Proceedings of the Twelfth International
Conference on Machine Leaning. San Francisco, CA.
7. Craw S. (1991). Automating the refinement of knowledge-based systems. Ph.D. thesis, University of
Aberdeen.
8. Fayad U., et al. (1988). Learning control information in rule-based systems: a weak method. In:
Proceedings of the Fourth Conference on AI applications, 14-19.
9. Friedman J., Kohavi R., Yun Y. (1996). Lazy decision trees. In: Proceedings of the National
Conference on Artificial Intelli gence.Portland.
10. Ginsberg A. (1986). Refinement of expert systems: A metalinguistic framework for heuristic
analysis. Ph.D. thesis, Rutgers University.
11. Holland J. (1986). Escaping brittleness: The possibiliti es of general purpose learning algorithms
applied to parallel rule-based systems. In: Michalski et al. (editors), Machine Learning II . Morgan
Kaufmann publishers Los Altos, California.
12. Holsheimer M., Siebes A. (1991). Data Mining: The search for knowledge in databases. Technical
Report CS-R9406, Amsterdam, The Netherlands.
13. Kanal L., Lemmer j. (1986). Uncertainty in Artificial Intelli gence. North-Holland, Amsterdam.
14. Kohavi R., et al. (1997). The utili ty of feature weighting in nearest neighbour algorithms. In: Van
Someren M. and Widermer G., (editors) , Poster paper: Ninth European Conference on Machine
Learning. Czech Republic.
15. Kohavi R., Kunz C. (1997). Option decision trees with majority votes. In: Proceedings of the
International Conference on Machine Learning. Tahoe City, CA.
16. Laird J., (1988). Recovery from incorrect knowledge in Soar. In: Proceedings of the National
Conference on Artificial Intelli gence, Saint Paul, Minnesota, 618-623.
17. Mark W., Jude W. (1993). Learning to represent codons: A challenge problem for constructive
induction. In: Proceedings of the Thirteenth International Joint Conference on Artificial
Intelli gence, Chamberg, France, 1319-1324.
18. Michell T. (1997). Machine Learning, McGraw Hill .
19. Mohri T., Tanaka H. (1994). An optimal weighting criterion of case indexing for both numeric and
symbolic attributes. In: D.W. Aha (Editors), Case-based reasoning: papers from the 1994 workshop
(TR WS-94-01). Menlo Park, CA.
20. Murphy p. UCI repository of machine learning databases,
39
[ http:/www.ics.uci.edu/~mlearn/mlrepository html].
21. Ourston D., Mooney R. (1994). Theory refinement combining analytical and empirical methods.
Artificial Intelli gence, 273-309.
22. Politakis P. (1983). Using empirical analysis to refine expert system knowledge bases. Ph.D. thesis,
Rutgers University.
23. Politakis P., Weiss S. (1984). Using empirical analysis to refine expert system knowledge bases.
Artificial Intelli gence, 23-48.
24. Quinlan R. (1988). Induction of decision trees. Machine Leaning, 81-108.
25. Quinlan R. (1993). C4.5, Programs for machine learning. Morgan Kaufman.
26. Raymond J., Mooney R., Ourston D. (1994). A multi-strategy approach to theory refinement. In:
Michalski et al. (editors) Machine Learning IV. Morgan Kaufmann publishers Los Altos, California.
27. Theron H. (1994). Specialisation by exclusion: An approach to concept learning. Ph.D. thesis,
University of Stellenbosch, South Africa.
28. Wettschereck D., Aha D. (1995). Weighting features. In: Proceedings of the First International
Conference on Case-Based Reasoning. Sesimbra, Portugal.
29. Wilkens D. Ma Y. (1994). The refinement of probabili stic rule sets: sociopathic interactions.
Artificial Intelli gence, 1-32.