rule refinement in inductive knowledge based systems

RULE REFINEMENT IN INDUCTIVE KNOWLEDGE-BASEDSYSTEMS

byMohamed Arteimi

Department of Computer Science

Witwatersrand UniversityJohannesburg, 1999

1

Abstract

This paper presents empirical methods for enhancing the accuracy of inductive learning systems. It

addresses the problems of: learning propositional production rules in multi-class classification tasks in

noisy domains, maintaining continuous learning when confronted with new situations after the learning

phase is completed, and classifying an object when no rule is satisfied for it.

It is shown that interleaving the learning and performance-evaluation processes, allows accurate

classifications to be made on real-world data sets. The paper presents the system ARIS which

implements this approach, and it is shown that the resulting classifications are often more accurate than

those made by the non-refined knowledge bases.

The core design decision that lies behind ARIS is that it employs an ordering of the rules according to

their weight. A rule’s weight is learned by using Bayes’ theorem to calculate weights for the rule’s

conditions and to combine them. This model focuses the analysis of the knowledge base and assists the

refinement process significantly.

The system is non-interactive, it relies on heuristics to focus the refinement on those experiments that

appear to be most consistent with the refinement data set. The design framework of ARIS consists of a

tabular model for expressing rule weights, and the relationship between refinement cases and the rules

satisfied for each case to focus the refinement process. The system has been used to refine knowledge

bases created by ARIS itself, as well as to refine knowledge bases created by the RIPPER and C4.5

systems [6,25] in ten selected domains. Two main advantages have been observed. First, the abili ty to

gradually improve the knowledge base as the refinement proceeds. Second, is the abili ty to learn strong

rules util ising condition’s weight using minimum covering algorithm. Thus, by taking account of this

information we improve the applicabili ty and quali ty of refinement.

Keywords: knowledge-base refinement; inductive learning.

1. Introduction

The growth in database technology and computer communication have resulted in the creation of huge,

efficient data stores in almost every domain. For example, credit-card transactions, medical images, and

large-scale sky surveys are all stored in massive and ever growing databases. These streams of data need

to be analysed to extract higher-level information that could be useful for decision making and for

understanding the process generating the data. For large systems of this type, we need automated

methods that acquire decision-making knowledge.

Humans attempt to understand their environment by using a simplification of this environment (called a

model). This model represents events in the environment, and similarities among objects [12]. Similar

objects are grouped in classes and rules are constructed to predict the behaviour of new objects of such a

2

class. In machine learning we try to automate this process and construct class descriptions (models) such

as a decision tree or a set of production rules using an iterative search strategy on a library of examples.

This is known as inductive learning.

There have been various methods for learning rules inductively by analysing a set of training examples

to extract interesting and relevant information from the databases. Among these are RIPPER [6], and

C4.5 [25]. This information is higher-level knowledge in the form of class descriptions that are

generalisations about what it means to be a member of a class or a category. This gives us the abili ty to

make inferences that are true for most or all members of the class.

In this paper we present a method for improving or refining existing knowledge bases by using a

refinement library of examples.

Section 2 gives an overview of two existing refinement systems, and inductive learning systems. Section

3 describes the ARIS knowledge-base refinement system. The application of ARIS to knowledge bases

generated by C4.5 and RIPPER systems is described in Section 4. Section 5 compares inductive

refinement and incremental learning. Section 6 concludes with discussion and future enhancements.

2 Related Work

2.1 Inductive systems

There has been a relatively long history of induction to solve complex problems, by creating a model of

the problem domain, and inferring from that model instead of from the domain itself. The constructed

model is a particularly useful tool i f it is possible to perform classification by simple, easy-to-understand

tests, whose semantics are intuitively clear, such as decision trees [3,5,9] and production rules [1,6,25].

Learning algorithms vary in:

- the way in which they search the training data set,

- the way in which they generalise,

- the way they represent class descriptions, and

- the way in which they cope with errors and noise in the training data.

Generally, we can classify concept learners into two categories depending on the learning strategy

involved.

� Simultaneous covering algorithms

At each step in the search, the algorithm selects among alternative attributes by comparing the

partitions of the data that each attribute generates. Decision-tree algorithms belong to this category,

examples include ID3 [24], C4.5 [25], and CART [3]. These algorithms use an overfit-and-simpli fy

learning strategy involving post pruning of the generated trees.

3

� Sequential covering algorithms

Here, the algorithm chooses among alternative attribute-value pairs to include in rules by computing

the subsets of data they cover. Direct rule learning algorithms such as RIPPER [6], and HYDRA [1]

belong to this category.

Two variations on this approach have been explored:

1. General-to-specific search

An approach to learning one rule at a time is to organise the search in the hypothesis space in

the same way as in simultaneous covering algorithms, but to follow only the most promising

branch in the tree at each step. The search starts by considering the most general condition

possible (i.e. the empty test that matches every instance), then adds the attribute test that most

improves rule performance measured over the training data. Once this test has been added, the

process is repeated by adding a second attribute test, and so on. This is different from

simultaneous covering approach since at each step only a single descendent is followed, rather

than growing a subtree that covers all possible values of the selected attributes.

Some learning systems such as BEXA [27] extend this approach to maintain a list of the k best

attribute tests at each step, rather than a single best attribute test.

2. Specific-to-general search

Another way to organise the search space is to start the search process beginning with the most

specific rule and graduall y generalise to cover more positive cases [18].

One way to see the significance of the difference between the above two approaches is to compare the

number of distinct choices made by the two algorithms in order to learn the same set of rules.

To learn a set of n rules, each containing k attribute-value tests in their conditions, sequential covering

algorithms will perform nk primitive search steps, making an independent decision to select each

condition of each rule. In contrast, simultaneous covering algorithms will make many fewer independent

choices, because each choice of a decision node in the decision tree corresponds to choosing the

condition for the multiple rules associated with that node. In other words, if the decision node tests an

attribute that has m possible values, the choice of the decision node corresponds to choosing a condition

for each of the m corresponding rules. Thus, sequential covering algorithms make a larger number of

independent choices than simultaneous covering algorithms.

The other difference is that in simultaneous covering algorithms different rules test the same attribute

while in sequential covering algorithms this behaviour is not required.

4

2.2 Refinement systems

Previous knowledge-base refinement systems use knowledge bases elicited from experts in a fairly well

understood domains. The knowledge base is believed to be complete and only minor tweaking is

required. This section describes two such refinement systems SEEK [22] and KRUST [7]. They are the

forerunners of the ARIS refinement algorithm. The statistical approach of SEEK is the basis for ARIS’s

rating of rules and imbedded hypothesis. SEEK uses statistical evidence to apply suitable refinements.

The refinements are:

� Generalise a rule which will fire if it is enabled.

� Specialise the error-causing rule to allow an enabled correct rule to fire instead.

The rules in SEEK are represented in the EXPERT language but expressed in a tabular format that

divides the findings associated with a conclusion into two categories; major and minor, and the level of

confidence associated with a conclusion into three categories: definite, probable, and possible.

A list of major and minor symptoms is defined for each conclusion that the system can reach. For

example, major symptoms for connective tissue disease (one of the rheumatic diseases) include swollen

hands and severe myositis, while examples of minor symptoms include anemia and mild myositis.

The rule changes include the following operators:

1) Change the certainty of a diagnosis,

2) Increase/decrease the number of majors or minors (i.e. suggestive symptoms),

3) Add/delete an exclusion (i.e. a symptom that should not be present),

4) Add/delete a requirement (i.e. a necessary symptom) and,

5) Change the conclusion (RHS) of a rule.

The rule changes do not include numerical range changes. SEEK requires the involvement of an expert

to decide on whether or not a suggested change by SEEK can be implemented. The main criticism of

SEEK is that it uses a criterion table for representing the hypotheses. This works well in the domain of

Rheumatoid diseases, but may be difficult to implement in other domains. SEEK is also unable to add

new rules.

Politakis [22] gives a method to integrate condition weights and rule weights for classification. The

method used relies on a very specific set of weights regarding the relationship between a set of

conditions and a conclusion. The condition weight can either be minor or major while the weight of a

rule can have a value in the set { possible, probable, definite} depending on the combination of

conditions used in a rule. This technique is strongly connected to our implementation of condition and

rule weights as will be discussed in Section 3. The proposed scheme in this paper is dynamically

adjusted with respect to a probabili stic estimate (depending on the data distribution) and it is defined on

a continuous scale, while in SEEK system an expert is required to assign weights for both conditions and

rules.

5

The use of an exclusion component in SEEK is another important aspect that is similar to the proposed

approach. We use an UNLESS clause to rule out a conclusion for a particular situation taking care of

special cases in the domain of interest.

The KRUST system [7] refines a knowledge base when there is a conflict between the expert’s

conclusion and the system’s conclusion for a particular test case. This test case is one of a bank of task-

solution pairs used by KRUST. KRUST normally outputs a single knowledge base, based on the

performance of a set of proposed refined knowledge bases on all the task-solution pairs. In the event of a

tie for best knowledge base, or if the original knowledge base is judged best, the expert is consulted.

However, unlike SEEK, KRUST does not require the involvement of an expert during refinement.

KRUST follows the “ first enabled is fired” conflict resolution strategy, and its refinement operations are:

� Generalise a rule,

� Specialise a rule,

� Change the priority of a rule,

� Add a new rule for the misdiagnosed case.

KRUST’s generation of a new rule is naïve [7], and needs to be improved. Again KRUST believes that

the original knowledge base is complete, and no redundant rules exist. Hence, there is no redundancy

checking to remove undesirable rules. Moreover, there is no way to handle or exclude special cases as is

the case with SEEK.

KRUST’s approach in generating many refinements is based on the realisation that the application of a

variety of refinement operations could lead to knowledge bases that have different performances. The

research in this paper emphasises the need for trying all possible refinement operations that are

applicable for the observed misdiagnoses and to choose one that best improves the knowledge-base

performance.

3. The ARIS inductive refinement system

This section is devoted to a description of the mechanisms involved in representing/creating the

knowledge base, improving the knowledge base, and the associated weighting schemes. This is

implemented in a system called ARIS (standing for Automated Refinement of Inductive Systems) which

refines propositional production rules. The architecture of ARIS is shown in Figure 1, and the main

features are:

• a rule weighting scheme for providing rule priorities and ordering,

• ranking refinement of cases by combining the weights of rules satisfied for each case,

• abili ty to classify an object, when no rule is satisfied, through partially satisfied rules, and

• abili ty to perform all refinement operations on continuous and discrete attributes.

6

Figure 1. The ARIS architecture

ARIS initially generates a knowledge base using induction on a set of training examples, it then

proceeds to test the knowledge base on a separate set of testing data for refinement purposes called the

refinement data set. It is only after this testing, and if some cases in the refinement set are mis-classified,

that the refinement subsystem is invoked. Finally, the system is tested on a separate testing data set for

evaluation of the refinement process. The refinement subsystem identifies possible errors in the

knowledge base, and calls on a library of operators to develop possible refinements, guided by a global

heuristic. The best refinement is implemented, and the process repeats until no further refinements are

possible. A knowledge-based refinement system is successful to the extent that it can improve the

accuracy of its initial knowledge base on new and unseen cases and produce a more accurate and

comprehensible knowledge base than the initial knowledge base. ARIS performs a search through a

space of both specialisation and generalisation operators in an attempt to find a minimal refinement to a

knowledge base.

Conceptually, the refinement subsystem has three main phases, two of them are executed for every

hypothesis present in the knowledge base, while keeping the rules ordered with respect to their weights:

Phase 1: (Localisation process)

During the first phase, all misdiagnosed cases from the refinement set belonging to a particular

hypothesis (class) are identified. Each case receives a weight from the rules satisfied for the

case. This indicates the rule overlapping at this point (case) in hypothesis space. The case that

has the highest absolute weight among the misdiagnosed cases is selected, since it points to the

strongest rule from the set of erroneous rules (i.e. with highest weight) and frequently many

rules overlap at this point.

Rule generator

Rule

base

Tre

e ge

nera

tor

Tra

inin

g da

ta

Refinementdata

Test data

Classifier

(inference engine)

Refinementgenerator

Rul

e cl

assif

ier

Refined rule set

Refinedknowledge base

Refinement moduleArchitecture

Orderedrules Pr

edic

tion

resu

lt

7

Phase 2: (Refinement generation, verification and selection)

During this phase, the rule responsible for the misdiagnoses is determined, and all possible

refinement operations applicable are tried. Namely; the erroneous rule is specialised, a similar

rule reaching the intended class is generalised, and a new rule is created. The resultant

performance is stored for each refinement operator and finally the refinement operator or group

of operators that produce the best performance is chosen. The process is repeated until no

further improvement is possible.

Phase 3: (Check completeness and remove superfluous rules)

Finally, the knowledge base is checked for incompleteness. Each case must be covered by at

least one rule. If there are cases that are not covered by the existing rules, new rules are created.

Moreover, any superfluous rules are removed. Figure 2 demonstrates the algorithm.

The main components of ARIS are: a tree generator, a rule generator, a refinement generator, a

judgement module, and an inference engine. The refinement generator is responsible for applying all

possible refinements to remedy the misdiagnoses. The set of refinements reflect how the rules interact.

Rules can be changed by allowing them to fire (called enabling), or preventing them from firing (called

disabling). The judgement module selects the best refinement operator or group of operators which

result in the best improvement of knowledge-base performance while correcting the misdiagnosed

cases.

8

Algorithm : ARIS

Inputs: A set of training data

A set of refinement data,

A set of test data,

Optionally, a set of rules of the form:

IF conditions THEN hypothesis

Output: Refined rule set of either


Or


UNLESS

Conditions

First:

If no rules are given generate a set of production rules by processing the training data,

Second:

1.

for hypothesis ←each hypothesis in the hypothesis space

{ Identify rules that are not consistent with refinement data set. These may at later stages be

specialised but will not be generalised. }

• Rank rules with respect to their weights.

• For each misdiagnosed case in the refinement set, calculate combined weight of rules satisfied

for the case (see Section 3.1).

• Select a case that has the highest absolute weight among the misdiagnosed cases.

• Refine the knowledge base for the selected case (i.e. missed case) by generating all possible

refinements.

• Select the best refinement, and implement it.

End for

2.

• Maintain knowledge-base completeness, by guaranteeing that each case is covered by at least

one rule, use a minimum covering algorithm to generate missing rules (see Section 3.4).

• Prune all superfluous rules.

Figure 2: The ARIS algorithm

3.1 Learning condition weights

Feature weights have been used in case-based learning [14,19] to denote the relevance of features in

similarity computations and to guide the selection of relevant features. The basic approach involves

storing cases and their associated classes in memory and then, when given a test instance, the task

9

becomes one of finding the training cases nearest to that test case and using them to predict the class of

the test instance. Several methods for constructing feature weights have been proposed, based on either a

naïve distance metric that weighs attributes equally [28] or a weighted Euclidean metric in which the

weight for each feature comes from a small set of options, for example [0,0.1,0.2,…,1], where 0 means

that the feature is irrelevant, and higher weight denotes greater relevance such as in [19]. Some methods

cannot handle numeric features directly, and have to discretise numeric values by classification; others

have to convert multi-valued symbolic features to a set of binary features with respect to the number of

its values (i.e. a symbolic feature with N values is mapped into N-1 binary features) such as

quantification method II [ 19].

Politakis [22] used a domain specific scale for weighing conditions and rules. His approach can be

summarised as follows:

The conditions in a rule are grouped under the following categories:

majors considered to be very important conditions,

minors less specific but suggestive for reaching a particular conclusion,

exclusions conditions that must not be satisfied in order to reach a given

conclusion.

The certainty in the conclusion of a rule is measured by the confidence level it reaches. There are three

levels of confidence:

(1) Possible,

(2) Probable, and

(3) Definite.

Different combinations of conditions could result in reaching the same conclusion but with different

certainty.

In contrast, the only source of information we have is the training data. In this research, conditional

probabili ty theory is used to take advantage of the data distribution in the training set. The intention is to

measure the importance of a condition to a particular class, through the measure of change in belief in

the class with respect to that condition. This poses the following question. For a particular hypothesis h,

if given a piece of evidence e, how will the relation between h and e change?

To formalise the concept, let us consider a sample space that is partitioned by events E1,E2,…En.

Let H+ be an event in the space denoting a certain class or a concept with probabilit y of P(H+)>0; Then,

by Bayes’ Theorem

For a simple problem, with two conditions, and hypotheses H+ and H-

∑=

+

++ =

n

1iii

jjj

)E(P)E|H(P

)E(P)E|H(P)H|E(P

)E(P)E|H(P)E(P)E|H(P

)E(P)E|H(P)H|E(P

2211

111 ++

++

+=

10

We can rewrite this as

Where

Or using the terminology of Table 2.

If we take the logarithm of QA, the incremental nature of the updating process of the evidence on the

condition of the hypothesis becomes apparent. In other words, the posterior probabilit y P(E1|H+) in

Equation 1 above, is dependent on the factor QA. If QA is greater than one, the posterior probabilit y will

be greater than the prior probabilit y P(E1); otherwise, if QA is less than one, it would result in P(E1|H+)

less than P(E1). We can thus view this quantity as a weight, carried by the evidence E1, which sways the

belief one way or the other. The weight QA calculated as above ranges from zero to infinity; however

we would like to map the values of the evidence weight within the interval [-1,+1] with a value of 1 for

QA mapping to 0 (central), so that it is easier to distinguish whether the given evidence is in favour of

the particular hypothesis or not. In other words positive weight denotes supporting evidence, and

negative weight indicates opposing evidence. Therefore the following function is used to map the value

to the desired range

W=F(QA)

Where

Another candidate for this transformation [23] is:

Some other possible forms have been suggested in the literature for this mapping [13].

)E(~P)E|~H(P)E(P)E|H(P

)E(P)E|H(P

1111

11++

+

+=

))E(P1)(E|~H(P)E(P)E|H(P

)E(P)E|H(P

1111

11

−+= ++

+

))E(P1)(E|~H(P)E(P)E|H(P

)E|H(PQA

1111

1

−+= ++

+

)1( 3231

1

XXXX

XQA

−+=

11

2)( −

+=

QA

QAQAF

11

1)( −

+=

−QAeQAF

(1) )E(PQA)H|E(P 11 ×=+

11

Positive weight indicates a supporting evidence for the hypothesis, and negative weight indicates

opposing evidence against the hypothesis.

Symbol Description

Ei Indicates an event or evidence (eg. Age<=20)

~Ei Indicates the complement of Ei (eg. Age>20)

H+ Denotes space of positive instances of hypothesis (eg. Healthy)

H- Denotes space of negative instances of hypothesis (eg. Sick)

P(Ei) Represents the prior distribution of objects(cases) within the scope of the condition

relative to the total number of examples covered by the entire population (e.g.

P(Age>20)

X1 = P(H|Ei) Fraction of positive instances of hypothesis covered by the condition Ei (i.e. True

positives TP)

X2 = P(H|~Ei) Fraction of positive instances that are not covered by the condition Ei, relative to the

complement of Ei (i.e. False negatives FN)

X3=P(Ei) Fraction of all instances covered by the evidence Ei, relative to the entire population

Table 2: Terminology for weight model

We emphasise that this weighting scheme defines class-specific weights, depending on class frequency

and according to the degree to which its distribution of values (for a class) differs from its distribution

for other classes.

3.2 Rule weights and ordering

In what follows, we discuss the rationale for weighting and ordering the rules in an inductive

knowledge-based system. The natural properties of production rules allow many ways for rules to

interact and overlap. In other words, a rule may colli de with or subsume another rule, because the

available conditions are not usuall y independent. This requires some process to use the available

information to resolve any existing conflict and generalise well over the training data, while keeping the

rules as simple as possible.

Knowledge-based systems are characterised by the separation of rules, data, and control scheme [8], and

the task execution is accomplished through the following phases:

Phase1: Matching phase in which the knowledge-based system matches its rules against a given

instance, to determine the conflict set (the set of rules satisfied),

Phase2: Conflict resolution. Here the system selects a rule to fire from the conflict set.

Phase3: Rule application, which entails execution of the selected rule by following its condition part.

12

The matching phase is known to be a time consuming operation, but the conflict resolution is

particularly important as it determines which rule will fire and hence how the case will be classified.

Several general schemes of conflict resolution have appeared in the literature utilising different kinds of

heuristics, for example:

1. Rule specificity was used to favour specific rules [11].

2. Connection strength [8] which measures rule strength that codes its “ relatedness” to other rules in

the knowledge base for selecting the next rule to fire on the basis of experience. Their scheme

imposes an explicit priority ordering given the last rule satisfied during task execution. The

following heuristic is adopted:

Wherever rule A fires, then based on experience the most likely candidate rule to be fired and

eventually lead to success is rule B.

This idea is very different from our approach.

3 Logical sufficiency [2] is a probabili stic estimate of rule reliabili ty, that measures the proportion of

the training examples of class c, for which the rule was true divided by the proportion of the training

examples of other classes for which the rule was true.

4 Belief certainty such as certainty factor model used in MYCIN [4] is used to represent uncertainty

to rules conclusion. This is based on expert estimates which are the integration of many factors

including probabili stic knowledge and attitude.

Traditionally, conflict resolution, requires computing the entire conflict set (except scheme 2 above),

and then selecting a rule according to certain property (like frequency of rule success). We propose the

addition of a different mechanism to the inference engine for ordering the rules with respect to a strength

of the association between rule’s conditions and the intended hypothesis (i.e. right hand side of the rule).

This is similar to Politakis’ approach [22] preferring rules with strongest confidence over rules with less

confidence.

The goal of refinement is to improve knowledge-base efficiency with as few modifications as possible.

This entails reducing the number of rules it has to consider for the purpose of improving the prediction

quali ty. For the proposed method we maintain a simple matcher that matches rules individually.

Selecting the next rule to match depends on its location (position in a list) and on its weight, thus

avoiding the expensive computation of the entire conflict set. This means that the first rule to match will

be fired, so there is never a conflict set. The rule weight can also be used for purging low utili ty rules

from the system.

Changing the priority of a rule affects its abili ty to fire, and it is the easiest operational change to

perform [7]. Several knowledge-based systems use a rule priority change as a plausible action in their

conflict resolution schemes [7,16]. Craw [8] advocates changing priority of a rule while rejecting the

change if it proves to be ineffective in later stages. Laird [16] supports the notion of changing control

13

rather than the conditions of the rules in his system Soar (i.e. modifying the decisions in which the

knowledge is used instead of modifying the rules that encode the knowledge) to recover from incorrect

decisions. In his system the decisions made on the basis of preferences are categorised into three classes

(acceptabilit y, necessity, and desirabili ty) to select a rule for firing. Other authors [29] took the opposite

extreme view point, namely, not to change the rule order under any circumstances; but there is no clear

evidence for keeping a specific ordering of the rules, if resorting to a small change like reordering could

improve knowledge-base efficiency. ARIS starts with an unordered set of rules, and imposes an order

required to keep the rules in line with their utili ty to the system, in the sense that:

(1) Provides as few matching operations as possible to produce the desired action without explicit

conflict resolution computation.

(2) Improves refinement efficiency by supplying a process for locating the rule responsible for

misdiagnoses directly.

For example, if we have a knowledge base containing rules that help making decisions about outdoor

activities for children, such as:

R1:

IF Today’s temperature < 22 THEN Play-Soccer

R2:

IF Today’s temperature < -10 THEN Build Snowman

As you can see, for a temperature of -12 degrees, the knowledge-base system will suggest playing

soccer, which is not correct for this situation. While there are several ways to amend this problem, the

simplest one is to reorder the rules; here we must use some criterion that allows the learning system to

perform the action (evaluation then ordering) automatically. We can provide the learning system with

semantic abili ty to relate each class with particular valid condition, so that the learning system can

distinguish irrelevant conditions, and order the rules with the proper context.

The other way is to assign a weight to a rule which relates the strength of the conditions to the

conclusion of the rule. For the example above, the condition “Today’s temperature<22” , covers some

false positive cases from the second rule’s class. Now suppose that the rule R2 is complete and

consistent. The intended scheme of rule weighting would result in the weight of rule R2 being higher

than the weight of rule R1. As a result R2 will be placed in the first location. This yields better

prediction performance. In Section 3.1, we described the weighting technique for rule conditions. We

must admit the fact that there is no consensus on how best to measure the combined effect of the

conditions on a conclusion in a rule, because in real-world applications, the conditions are not

independent, and the relation between conditions themselves is not clear. However, we use a heuristic

for combining condition weights to produce a weight for the rule.

14

The following combination function [4,13] will be used to calculate the weight of a rule

This computes the impact of the tests in a rule’s antecedent, where wi is the weight of condition i. Since

the weights of conditions can be either positive or negative evidence (for or against) the hypothesis, we

need to split the equation above as follows:

The weight of the rule is then MSE-MOE

where

MSE= Measure of supporting evidence,

MOE= Measure of opposing evidence,

wi = Weight for evidence i.

3.3 Improving an existing knowledge base

It is our hypothesis that tackling the most difficult part of the knowledge base is a good approach, and

we feel that other misdiagnoses could be solved as a by-product of the process. To test this hypothesis,

we maintain a list of the available rules in the knowledge base, against a set of refinement cases selected

independently from training data set as shown for example in Table 3 below:

RULE R1

+

R2

-

R3

+

CASE

WEIGHT

REFINEMENT CASE

1 + *

2 - *

3 + * *

.

.

Table 3: Generating refinement locations

...)]]1([1[)1( weight sRule' 1213121 +−+−+−+= wwwwwww

conditionspositiveforwwwwwww

MSE ...)]]1([1[)1(

evidence supporting no if 0

1213121

+−+−+−+=

conditionsnegativeforwwwwwww

MOE ...)]]1([1[)1(

videnceopposing no if 0

1213121

+−+−+−+=

15

For a particular class in the given domain, each rule maintains its weight and a label (+ or -) with respect

to the class under investigation. Each case is labelled + if it belongs to the class of interest or -

otherwise. A ‘ * ’ means that the case is covered by the rule. All the rules satisfied for a case in the

refinement data set collectively produce a weight for that case. To determine this weight, we follow the

formulation given in [4] which defines evidence combination as a special case of the Dempster-Shafer

theory of evidence, that is:

Where w is weight of rule J.

Since the rules in our approach are ordered with respect to their weight, we reformulate equation (2) as

follows

Where Max[w] is the highest weight of a rule among all the satisfied rules for a particular case. This

combined weight F(w) is an indication of rule overlapping. The case is misdiagnosed by the system if a

satisfied rule with the highest weight belongs to a different class (i.e. has a different sign). We would

like to find the points that have strongest overlapping (i.e. interaction) of rules where misdiagnoses exist.

The algorithm in Figure 5 describes how to assign case weights based on the rules satisfied. While the

weight of the rules is used as an ordering criterion to order the rules in the knowledge base, for aiding

classification, it also forms a basis for enhancing the refinement process.

Refinement locations are meant to be places in the knowledge base where misclassification may occur.

We identify refinement locations by a negative weight of misclassified instances. The number of

different rules which satisfy a particular refinement case represents its potential, i.e. the degree of

overlap in the knowledge base, which must be resolved. Our method of analysis calculates the

performance of rules in the knowledge base with respect to the refinement data set. For each stored case,

all the satisfied rules are identified. It is common to find multiple rules satisfied for a given case. The

problem of misclassification appears when rules of incorrect conclusions are satisfied for the case.

Because of this, we separate these rules into two groups one group corresponds to the active hypothesis

(the class of the misdiagnosed case) for the refinement location, and the second group contains rules

concluding other classes. We would like to determine the point at which the refinement would produce

the optimal result with respect to increased performance of the knowledge base, and least refinement

effort with respect to number of refinement operators implemented. Finding such an optimal refinement

is not possible, but the case weight as calculated in Figure 5 is a useful heuristic indicating areas of high

interest for possible refinements. Figure 3 summarises the refinement process.

∏=

>=−−=n

1j

jn1 (2) 1n),w1(1)w,...,w(F

( ) 1],[11),...,()(1

1 >=+−−== ∏=

nwMaxwwwFwFn

j

jn

16

Figure 3: Refinement process

Evaluate KBperformance

Incorrect knowledgebase diagnoses ?

KB systemconclusion

Storedrefinementcasesconclusion

Summarize rulesperformance on eachrefinement case

Instantiate heuristics tolocate the most critical partof the rule overlapping

Perform refinementoperations and select thebest

Apply refinementoperator

Evaluate theKB

Update thebestrefinement

Implement thebest refinement

Verify KBcompleteness

Verify KBredundancy

Return

Yes

No

17

Refine (erroneous, missed case)

{ The target rule is a rule satisfying missed case and concluding the correct hypothesis for the missed

case}

1. Specialise the erroneous rule

� Select best specialisation

� If there is a target rule, accept the best specialisation

2. If there is no target rule:

{

• Find closest rule to generalise, then

{ Two successive scans of the rule are performed in order to generalise it }

(1) Generalise by deleting a condition from the rule, such that the deletion does not degrade

the performance of the rule, so as to cover the missed case

(2) Generalise by expanding numerical scope of a condition if it is continuous, so that the

missed case will be correctly covered.

• If the weight of closest rule after generalisation > weight of the erroneous rule, then adopt the

generalisation operation only,

Else

{

• Augment erroneous rule with UNLESS clause

• Apply both generalisation (to closest rule) and specialisation (to erroneous

rule) operators

• Evaluate the knowledge-base performance using each of the above

refinements, and accept the one that better enhances the knowledge-base

performance.

}

}

3. If there is no rule to generalise, then

• Create a new rule using a minimum covering algorithm

• If the weight of the created rule > weight of erroneous rule, then add new rule to the

knowledge base;

Else

{

• Specialise the erroneous rule, so it no longer covers the missed case, and

• Add the newly created rule.

}

Figure 4: The Refine algorithm

18

Combined weight (missed case)

Fp←Weight of rules supporting the hypothesis h, and satisfying missed case.

Fn←Weight of rules that do not support the hypothesis h, and satisfy the missed case.

Maxwp←Maximum weight carried by a rule, among those supporting the hypothesis h.

Maxwn←Maximum weight carried by a rule, among those satisfying the missed case but do not conclude

hypothesis h.

Wi←Weight of rule(i) { Calculated using the approach discussed in Section 3.2 }

• Calculate combined weight of rules satisfying the missed case and concluding h

• Calculate combined weight of rules satisfying the missed case, and do not conclude h

• Compute the combined weight for the case, to indicate whether the case is misdiagnosed or not. If

the combined weight of rules in favour of the hypothesis under consideration, is greater than the

combined weight of the other rules, then the case will be diagnosed correctly; otherwise it will be

misdiagnosed.

• Return case weight.

Figure 5: The combined weight algorithm

Example:

Table 4, describes a small classification system, to demonstrate our approach. It contains two final

hypotheses (+) and (−). There are 16 training cases C1,C2,…,C16. The cases are marked as (+) if they

belong to the hypothesis (+), or (−) otherwise. The rules are shown with their weights. The marks

indicate the cases to which the rules R1 to R4 are applicable (i.e. the conditions of the rule are satisfied

for the case). The rules are assigned (+) sign if they conclude hypothesis (+) or (−) otherwise

np FFweightCase −=

( ) wp

k

i

ip MaxwF +−−← ∏=1

11

( ) wn

R

j

jn MaxwF +−−← ∏=1

11

19

Case Combined

case weight

R3 –

0.40386

R2+

0.28568

R5 –

0.225048

R6 +

0.000778

R1 +

-0.0003322

R4 +

-0.256594

1+ 0.6864 * � * �2+ 0.6864 * � * �3+ 0.6864 * � * �4+ 0.1709 * � * � * �5+ 0.1709 * � * � * �6+� -0.5148 * � * �7− -0.5148 * � * �8+� -0.5148 * � * �9+ 0.5155 * �10− -1.3064 * � * �11− -1.3064 * � * �12− -1.3064 * � * �13− -1.3048 * � * � * �14+ 0.2582 * � * �15+ 0.2582 * � * �16− 0.2582 * � * �

Table 4: Refinement locations

* means the case is covered by the corresponding rule

�means case is incorrectly covered

�means case is correctly covered

The assignment of the weights to the rules is done using the approach discussed in Section 3.2. The

application of rule ordering with respect to the rule weight (R3,R2,R5,R6,R1,R4) lets us directly

conclude that case 6 and case 8 will be misdiagnosed by the inference engine as they have a negative

combined weight, while they belong to hypothesis (+), this in turn will l ead to specialising rule 5, for its

overly general. Notice that if the rules were in a different order, for example, if rule 5 preceded rule 2,

other cases (i.e. case 4 and case 5) will be misclassified, but the ordering scheme adopted avoided this

problem.

3.4 Minimum covering algorithm

Many algorithms for concept learning search hypothesis space by considering a general-to-specific

ordering of hypotheses. However, a minimum covering algorithm organises the search for a hypothesis

beginning with the most specific possible hypothesis, then generalises this hypothesis at each step to

cover an observed positive example. To be more specific, the minimum covering algorithm specifies a

set of conditions that describe some cases centred around a particular instance which is normally a

20

misdiagnosed case; thus providing an approach that is constrained to focus on a subset of a particular

concept by only a subset of the relevant instance features observed. It represents an inductive

generalisation of consistent production rules, by adopting a least generalisation strategy. First, we select

the most central case to be covered by the initial rule, this is normally the misdiagnosed case, and form

the most specific rule that covers that case. Depending upon the condition values used in the case, an

over-specific (OSpec) rule is developed, such that the OSpec rule covers only the central case. All

example cases that belong to that class of the central case are considered to be positive examples and all

cases that do not belong to the class are considered to be negative examples. Next, we generalise against

the positive cases for each condition. The newly generated rule is a generalisation of OSpec. Then a new

rule (Min_Cover_Rule) is created from this rule. The technique used to do this generalisation is the

following:

1. Min_Cover_Rule ←Ø { Make an empty rule}

2. Create OSpec rule . { This is an overly specific rule which covers only the misdiagnosed case}

3. Generalise OSpec rule

� Ordinal range expansion

All the positive cases are ordered for the particular condition to facilitate expanding ordinal

ranges which will result in a rule covering more cases. Each time the next value for the

condition in the range is examined. For the less than (<) operator the next larger value on the

range is tried. While the next smaller value is used to examine the greater than (>) operator.

This results in the range of values for the condition being expanded in both directions.

� Condition deletion

For each discrete condition in OSpec rule, if deleting the condition improves quality of the rule

then execute the deletion operation.

4. Create Min_Cover_Rule

This involves a sequential scan through the conditions of the OSpec rule, after ordering them with

respect to their weights. As each condition is examined it is added to Min_Cover_Rule. If the

resulting rule is inferior to the original then the condition is removed from Min_Cover_Rule.

The following rule evaluation function is employed to test the rule efficiency:

Where α is a weighting parameter.

In our experiments, we set the value of the variable α to 0.8. The use of this function results in the

development of rules that are complete and consistent with regard to the data set, whenever this is

possible. A formal description of the above algorithm is provided in Figure 7. This process is il lustrated

cases covered#

cases coveredcorrectly #yConsistenc =

class samethe of cases#

cases coveredcorrectly #ssCompletene =

( ) ( ) yconsistencyconsistencsscompleteneyconsistencRuleCoverMinEfficiency 1),__( +++−= ααα

21

in Figure 6. In this figure, example cases are presented as points in a two dimensional space

representing a case’s age and weight. Positive cases are represented by (+) sign. Negative cases are

represented by (–) sign. Each rule can be viewed as defining a rectangle that includes those points that

the rule assigns to the positive class. The thick line indicates the rule to be refined.

IF Age<= 8 &

Age>1 &

Weight<= 8.2 &

Weight>3

THEN class= Negative

This rule covers eight negative and three positive cases. Assuming that the misdiagnosed case

determined by the refinement subsystem is case A, then the OSpec rule that covers case A and only A

is:

IF Age<= 6 &

Age >5.59 &

Weight<= 6.2 &

Weight> 6.19

THEN class= Positive

we then generalise against the case B and C. This results in the final rule:

IF Age <= 6.5 &

Age>3 &

Weight<= 7.5 &

Weight>5

THEN class= Positive

Figure 6: Example of minimum covering algorithm

The minimum covering algorithm is incorporated to help determine a missing rule for a particular

misdiagnosed case; and helps determine which condition to add to a rule for specialising it; as well as to

decide the UNLESS clause for imbedded hypotheses. The algorithm performs a specific-to-general

+ _ _

_ _

+ _ B+

_ A+ _

_ C+ +

_ +

+ _ _

+ +

1 2 3 4 5 6 7 8 9 10 11 12 13 age

9

8

7

6

5

4

3

2

1

weight

22

search, through a space of hypotheses, beginning with the most specific rule, that initially covers only

one case (namely the missed case) and terminating when the rule is sufficiently general. The algorithm

employs a performance measure for rule quality to guide selecting the most promising conditions. At

each stage, the rule is evaluated by estimating the utili ty of removing a condition or changing a

numerical range, by considering the number of positive and negative cases covered.

One might ask why the learning system did not find the newly learned rule when the knowledge base

was initially constructed. In the first place, the specified case may be an exceptional one and the rules

that are consistent with it can cover only a small number of other positive instances; thus those rules

dealing with this exception may not be found during the initial knowledge-base construction. Secondly,

the initial training set may not be a representative sample of cases in the category of interest. Thirdly,

since ARIS can be used to refine any knowledge base these rules might not have been discovered by the

original technique.

Minimum covering algorithm (missed case)

{ Returns a rule that covers some examples belonging to case's hypothesis, including the missed case

itself. Conducts a specific-to-general search for the best rule, guided by a utili ty metric.}

Min_Cover_Rule ← Ø

Rule ←Ø

1. Initialise Rule to the most specific hypothesis covering the given missed case

2. While Rule is overly specific Do

• Generate the next more general Rule:

• For each continuous condition

{ -Expand the range of the condition to cover the closest uncovered case

- If PERFORMANCE increases, then accept the modifications

Else

Refrain from implementing the last changes }

• For each condition

If removing the condition improves PERFORMANCE, execute the deletion operation

3. Sort conditions of Rule with respect to their weight

4. { From Rule, add to Min_Cover_Rule, one condition at a time}

I ←1

Do {

Min_Cove_Rule ← Min_Cover_Rule + condition I

If (Efficiency (Min_Cover_Rule) does not improve)

Remove condition I

I++

} Until (no more conditions are available)

5. Return Min_Cover_Rule of the form


Figure 7: The minimum covering algorithm

23

4. Experimental evaluation

4.1 Effects of new information

The research presented in this paper attempts to find better way of util ising information in domains

where available data is large and ever growing, particularly in automated data collection environments.

The ARIS algorithms were applied to problems from several different domains [20]. The aim was to

deliver an improvement on existing inductive learning systems. All the executed experiments employ

data sets which are randomly divided into a training data set, a refinement data set, and a test data set.

The training and testing data sets each comprises 40% of the available data, while the remaining 20% is

assigned to the refinement data set. Table 5 presents the results of the performance of ARIS before and

after refinement, where the rules are generated using consistency and completeness criteria. Table 6

demonstrates the performance of ARIS on knowledge bases created by C4.5 system. Table 7 shows the

effect of refinement on knowledge bases created by RIPPER system.

The experiments show that the main operators used in knowledge-base refinement by the ARIS system

are rule creation and rule specialisation. The biggest influence is the number of error-causing rules

because each of them must be prohibited from firing for the misdiagnosed cases, to allow for the correct

rule to be fired, or a new rule can be created. They can be specialised either by: strengthening the left

hand side of the error causing rule which may be achieved by adding another condition or shrinking a

numerical range of a condition, or appending a negation of a conjunction in a form of UNLESS clause.

The experiments also showed that reduction in number of rules produced is vital.

In practice, a knowledge base is created so that the rules have the maximum generali ty and least

specificity possible, in other words that the rules must be general enough to accommodate future

situations and commit the fewest possible misclassifications. The experiments indicate that the effect of

generalisation operation is rare relative to the use of other refinement operations (i.e. approximately 3%

of the total number of refinement operations) while specialisation was 13%. The most common

operations were rule creation (43%), UNLESS clause (21%) and rule deletion (20%). We would expect

the refinement system to follow this trend, because the initial knowledge base is susceptible to

incompleteness and inconsistency.

24

Before refinement After refinementDomain#Rules Accuracy #Rules Accuracy # of

Specs#of Gens # of

UNLESS# of newrules

# of rulesdeleted

Audiology 22.9 43.04 16.1 58.20 3.9 0 0 4.3 11.1Flag 31.2 56.39 15.9 58.20 0 0 0 1.5 16.8Mushroom 22.7 99.69 9.8 99.85 0 0 0 0.9 13.8Heart 26.8 48.93 26.2 49.58 0 0 0 2.8 3.4Wine 3.1 87.36 4.1 88.12 0 0 0 1.3 0.3Hepatitis 3 78.55 4.5 79.52 0 0 0 1.7 0.2Iris 2.6 93.17 3 95.67 0 0 0 0.7 0.3Hypothyroid 3.4 97.89 4.5 98.20 0 0 0 1.6 0.5Adult 160.8 71.43 57.1 71.44 0 0 0 7.5 11.01Artificial data 19.4 98.09 20.3 98.11 0 0 0 1 0.1

Table 5: ARIS performance on knowledge bases created by ARIS using completeness and consistency

criteria on ten selected domains and averaged on ten trials per domain


Specs#of Gens # of

UNLESS# of newrules

# of rulesdeleted

Audiology 23.2 44.02 12.6 49.6 0.1 0.2 4 14.6Flag 33 55.3 14.2 60.24 0.1 0.1 0.1 3 21.8Mushroom 34.3 98.47 13.6 98.50 0.2 0.2 20.5Heart 38.9 49.26 12.3 50.66 0.4 0.1 0.6 27.2Wine 5.7 89.72 4.3 89.72 0.3 0.1 0.2 1.5Hepatitis 5.9 80.32 3.7 81.61 0.2 0.8 0.2 2.7Iris 4.6 93.5 2.5 93.5 0.1 0.3 2.4Hypothyroid 7.6 96.99 5.1 97.67 0.1 1.9 4.4Adult 228 77.62 36.4 78.61 0.4 192Artificial data 47.9 92.0 33.2 93.6 0.4 0.5 2.3 17

Table 6: ARIS performance on knowledge bases created by C4.5 system

and averaged on ten trials per domain


Specs#of Gens # of

UNLESS# of newrules

# of rulesdeleted

Audiology 12.8 66.56 12.8 68.04 0.1 0.3 2.9 2.8Flag 8.6 52.89 13.7 55.66 0.5 0.5 6.1 1Mushroom 7.5 99.84 7.5 99.86 0.4 0.4Heart 2.9 52.2 9.5 53.03 0.1 0.7 7.1 0.3Wine 2.8 86.25 3.8 87.64 0.5 0.3 1.6 0.5Hepatitis 1.3 77.26 3.4 77.58 0.2 2.2 0.1Iris 2.6 90.67 3.1 90.99 0.8 0.3Hypothyroid 2.5 98.4 2.6 98.4 0.3 0.2Adult 4 81.27 7.5 81.27 3.8Artificial data 15.6 96.34 17.9 96.35 2.7 0.4

Table 7: ARIS performance on knowledge bases created by RIPPER system

and averaged on ten trials per domain

25

4.2 Fair comparison between ARIS and other systems

We have shown that the refinement approach improved concept-description accuracy of both C4.5 and

RIPPER systems, and reduced knowledge-base complexity (in terms of number of rules) whenever

possible for several data bases, using a separate set of data for refinement. In this section we compare

the approach of learning plus refinement to simply learning rules from all the data.

Our approach in the comparison involves the following strategy:

� Induce a knowledge base by training the learning system with 40% of the available data and then

refine it using 20% of the available data,

� Induce a knowledge base by training the learning system using 60% of the available data,

� Compare the performance of the generated knowledge bases by testing both approaches on the

remaining 40% of the data.

Table 8 is another comparison of three systems, namely ARIS, C4.5, and RIPPER on the selected test

domains. The refinement results were compared with knowledge bases induced by incorporating both

the training data and refinement data sets as a unified training set. This gives a fairer comparison

between ARIS, C4.5, and RIPPER. The first column names the domain used. The second column shows

the performance of the ARIS system on the test data used when trained on 40% of the available data.

The third column gives the performance of the ARIS system on the same test data when trained on the

combined training data (i.e. training and refinement data sets together). The fourth column gives the

performance of the knowledge base on the test data after refinement. Column five gives the performance

of the C4.5 system on the test data when trained on 40% of the available data. Column six shows the

performance of C4.5 on the test data when trained on the combined data. Column seven indicates the

performance of the refined knowledge base induced by C4.5 system on the same test data. Column eight

shows the performance of RIPPER system on the same test data when trained on 40% of the available

data. Column nine gives the performance of RIPPER on the test data when trained on the combined

data. Column ten gives the performance of the RIPPER’s refined knowledge-base on the same test data

set.

The tick marks indicate cases where training followed by refinement yielded better results than simply

training on all the data.

The experiments display the striking difference in rule induction between C4.5 and RIPPER.

Specifically, the C4.5 system generates many rules, some of which have the potential of causing rule

contradiction. The ARIS refinement process demonstrated that deleting such superfluous rules often

increases knowledge-base accuracy. On the other hand, RIPPER’s approach for inducing rules creates

fewer rules. Therefore, during refinement, ARIS performs more rule creations, and few rule deletions on

the RIPPER knowledge bases.

26

Finally, let us analyse the causes of an increase in the accuracy in inductive learning systems. In general,

it can be expected that concept-description accuracy will increase as the size of the database increases.

This general feeling is il lustrated by the accuracy for artificial data set (last row in Table 8), where

training a learning system on the large non-noisy data produced better results than training on small

portion and then refining the induced knowledge base. This will then begin to deteriorate as noise

increases such as in hepatitis and hypothyroid data.

Comparing the use of ARIS on the combined data set (60% of the data) just to generate the rules without

refinement, with training ARIS on 40% of the data and refining on 20%, we have observed that on 7 out

of 10 of the domains, the results of refinement approach is better, and was inferior on 3 domains (wine,

artificial data, and flag).

Refining C4.5 rules produced with 40% of the available data was better than training C4.5 system on the

combined data set in 7 out of 10 domains. This indicates that a refinement approach is useful when

knowledge bases are created without testing their predictive abiliti es as in the case with C4.5 system.

In RIPPER, the rules are generated by one training data set and post-pruned using another separate

pruning-data-set. This is clearly a limited refinement scheme which is the advantage of RIPPER over

other learning systems such as C4.5, HYDRA, and AQ15. This is the reason behind the good quali ty

results of RIPPER when trained on the combined data set as can be seen from Table 8. However, it is

still error prone due to noise, for example, on the Hypothyroid data set, the RIPPER system produced

results (when trained on combined data) which are inferior to the results obtained by training the system

on small data (40%) and then using ARIS to refine the rules.

The hepatitis data is the smallest noisy data set (135 instances) we have. The attributes available (19

attributes) enabled the refinement system to draw the class boundaries better than when the learning

system was trained on the combined data. The Heart data set has similar characteristics to the hepatitis

data, and similar results are obtained. On the mushroom data set, the same accuracy was reached using

both approaches. The remaining six domains expressed better results for RIPPER when trained on the

combined data set.

In summary, refinement mechanism improved concept-description quali ty for all the algorithms in the

three medical domains (i.e. Hepatitis, Hypothyroid, and Heart) characterised by noise and small

disjuncts problems. Moreover, improvement on several other domains was obtained with C4.5 and

ARIS. It is therefore, recommended that learning systems use a refinement mechanism on a data set

which is separate from the training set used to induce the knowledge base such as that of RIPPER and

ARIS to obtain good quali ty concept descriptions.

27

Domain ARIS using completeness and consistency C4.5 system RIPPER systemTrained on

40%Trained on

combined dataRefined KB Trained on

40%Trained on

combined dataRefined KB Trained on

40%Trained on

combined dataRefined KB

Iris 93.17 93.67 95.67 ✓ 93.5 94.0 93.5 90.67 93.33 90.99

Wine 87.36 89.72 88.12 89.72 90.0 89.72 86.25 90.14 87.64

Hepatitis 78.55 78.39 79.52 ✓ 80.32 77.96 81.61 ✓ 77.26 76.94 77.58 ✓

Hypothyroid 97.89 98.14 98.2 ✓ 96.99 97.74 97.67 ✓ 98.4 98.20 98.4 ✓

Heart 48.93 47.05 49.59 ✓ 49.26 50.19 50.66 ✓ 52.2 52.21 53.03 ✓

Flag 56.39 59.64 58.2 55.3 55.54 60.24 ✓ 52.89 57.23 55.66

Audiology 43.04 40.20 49.02 ✓ 44.02 43.12 49.12 ✓ 66.56 70.69 68.04

Mushroom 99.69 99.77 99.85 ✓ 98.47 98.37 98.5 ✓ 99.84 99.87 99.86

Adult 71.43 70.97 71.44 77.62 78.07 78.61 ✓ 81.27 82.22 81.27

Artificial data 98.09 98.86 98.11 92.0 98.16 96.6 96.34 97.53 96.35

28

4.3 Statistical comparisons

To test the results statistically, we used a two-way analysis of variance to compare the means with a

significance level of 5%. The raw data from the ten cross-validation experiments was used. The data set

was one feature and the type of analysis the other. The six analysis types used corresponded to the

results shown in Table 8, comparing using 60% of the data for training to using 40% for training and

20% for refinement. The following results show the ANOVA table and the Duncan groupings for the

type of analysis.

Source DF Type I SS Mean Square F Value Pr>F

Analysis type 5 1567.54 312.3083 6.77 0.0001

DAT 9 192014.52 21334.9474 462.53 0.0001

Table 9: ANOVA table for experimental results

Duncan Grouping Mean N EFF

A 82.42 100 RIPPER trained on 60%

A 81.47 100 RIPPER trained on 40% and

using 20% for refinement

B 79.57 100 ARIS trained on 40% and

Using 20% for refinement

B 79.42 100 C4.5 trained on 40% and

using 20% for refinement

B 78.37 100 C4.5 trained on 60%

B 77.86 100 ARIS trained on 60%

Table 10 Duncan grouping for overall mean values using 3 inductive learning

systems with refinement

Refinement on the results from the C4.5 system resulted in better performance than training on 60% of

the available data, by more than 1%. Training ARIS (just to create a knowledge base) using 40% then

refine it resulted in better performance than training ARIS on 60% by more than 1.7%. However,

training RIPPER on the combined data (i.e. 60%) is better than training on 40% and then refining using

20%, by less than 1%. It is important to note again that RIPPER includes a refinement scheme during

the creation of a knowledge base. Therefore, exposing the system to more data for refinement would

often produce better results than refining on only 20%. Moreover, the lack of some refinement operators

within RIPPER lead to inferior results than using our refinement approach in noisy domains with small

disjuncts problem as was discussed previously.

29

In a second analysis we confine our comparison to C4.5 and ARIS systems and exclude RIPPER as it

includes an internal refinement approach. The following tables show the output with 5% significance

level.

Source DF Type I SS Mean Square F Value Pr>F

Analysis type 3 203.9567 67.9856 1.6 0.188

DAT 9 150542.5780 16726.9531 394.24 0.000

Table 11: ANOVA table for experimental results for two inductive systems

Duncan Grouping Mean N EFF

A 79.57 100 ARIS trained on 40% and

used 20% for refinement

B A 79.42 100 C4.5 trained on 40% and

used 20% for refinement

B A 78.37 100 C4.5 trained on 60%

B 77.86 100 ARIS trained on 60%

Table 12: ANOVA comparison of results with and without refinement

Table 12 shows that training ARIS system on 40% and then refining using 20% produces better results

than training using the combined data (i.e. 60% of the available data) and the difference is statistically

significant. The table also shows that training C4.5 system on 40% and refining using 20% is better than

training on 60%.

5. Differences between inductive refinement and incremental learning

The top level goal of both subareas of incremental learning and inductive refinement is to produce

learning systems that are able to modify a knowledge base as a result of experience. Both subareas

interleave learning and performance (making learning systems themselves adaptable). However, each

subarea has its own characteristic strengths and weaknesses. We note some of them below:

� Source of knowledge

Incremental learning systems modify their own inferred knowledge base. However,

inductive refinement has the power to revise knowledge bases inferred by other batch

learning systems and hence can be valuable for improving inductive learning.

� Nature of refinement operators

Inductive refinement systems generally have the abili ty to refine a knowledge base by

generalising/specialising their rules, creating new rules, or removing superfluous rules. In

contrast, incremental learning specialises or generalises only its current hypothesis.

30

� Input schema used

Whereas incremental learning systems accept their own input one object at a time;

inductive refinement systems receive the output of a batch learning system and a set of

refinement data before commencing the refinement process.

� Evaluation strategy

Refinement systems require access to an additional view of the environment (i.e.

refinement data set) to perform refinement to the knowledge base leading to better

performance with regard to all the refinement set while correcting the misdiagnosed

cases. In contrast incremental learning processes its input one object at a time to

guarantee covering the given object correctly.

6. Conclusions

6.1 Relationship between ARIS and other systems

The main limitation of inductive learning systems is that the results from the testing phase have no role

in improving the learning process, but are merely used for validation purposes. After the learning

process is completed, the generated concept descriptions are used over and over for prediction. If new

information is observed the learning system cannot utilise it. In such a situation the previously learned

knowledge base will be discarded and all the data (old and new) has to be supplied to the learning

system to create the knowledge base anew.

In ARIS the results obtained are analysed and fed back automatically to the learning mechanism. This

causes the learning system to refine the knowledge base, add missing rules, and delete superfluous rules,

producing a more comprehensive and accurate knowledge base. While retaining the fundamental ideas

behind inductive learning ARIS is more powerful and capable of performing many additional actions to

improve its knowledge base.

The SEEK system [22,23] is able to help an expert focus on a problem within a knowledge base, but it is

then up to the expert to suggest whether to accept it or not. One feature of SEEK is the use of exclusions

to rule out a decision in some situations (which is a common phenomenon in the medical domain). But

because the rules were provided in advance by experts in the field (Rheumatic diseases), and are

expected to be complete which is a strong restriction, SEEK was unable to produce new exclusions to its

rules neither was it able to add new rules. Specialisation was limited to adding conditions and numerical

ranges cannot be altered. ARIS takes this further, suggesting new possible exclusions, as well as adding

or altering conditions (both discrete and ordinal) allowing slight modifications to the existing rules, in

contrast to radical changes of deleting and adding conditions. ARIS uses a minimum covering approach

to reach the exclusion in a form of a additional clause (i.e. UNLESS) appended to the original rule

which states that a combination of facts rules out a particular decision in certain situations. The

31

minimum covering approach facilit ates the process of reaching what condition to add to a rule for

specialisation.

The other feature of SEEK is the distinguishing between conditions by defining a condition as minor or

major. This is a realisation of the different effect of conditions to conclusions of rules, and it is a form of

weighing the conditions. ARIS weighs conditions in rules using a probabili stic estimate with respect to

data distribution in training set and the weight is continuous in the range between -1 and +1.

SEEK2 [10] an extension to SEEK, learns new rules by processing some examples. By using a set of

examples for which the hypothesis failed, SEEK2 finds a set of common features which are then used to

form a new rule. A complete refined knowledge base is presented to the expert (SEEK2 also works on a

knowledge base predefined by an expert). ARIS finds a set of features by using a minimum covering

algorithm. It starts by using a misdiagnosed example and gradually allows more facts to be incorporated.

This gives more flexibili ty and allows ARIS to succeed in many situations requiring rule modifications.

EITHER [21] uses multiple approaches to refine an existing knowledge base, guided by a pre-classified

test examples. Missing rules can be generated, however, conditions can only be added or removed.

Ordinal tests cannot be modified, and numerical ranges cannot be expanded or reduced, for example

IF age<= 14 THEN teen-ager

cannot be modified to

IF age<= 18 THEN teen-ager

KRUST [7] shows that generating many refinements to a knowledge base, with the help of a set of pre-

classified examples selected with precision (called chestnuts) and without considering statistical

measures, is another alternative for refinement. Different versions of refinements are generated, each

leading to a whole new knowledge base, then one or more refined knowledge bases which are believed

to be good are suggested to the collaborating expert. However ARIS is designed with a different idea in

mind. It tries to learn general rules using a heuristic search on a library of cases, and graduall y extends

or modifies its knowledge as required.

The lesson learned from KRUST is that applying all possible refinement operators and selecting the best

performing operator or a group of operators to remedy the misdiagnoses is better than accepting the first

successful refinement.

Table 13 presents and compares the following important characteristics for refinement systems:

� Knowledge-base source

Previous knowledge-base refinement systems rely on domain experts to construct rule sets that are

complete with respect to the domain of interest. However, our approach depends on heuristics to

generate/refine knowledge.

� Knowledge-base representation

32

Several different representations have been used to facilitate knowledge transfer from domain

experts to computers. Some representations are domain-specific, for example criteria table in

medical domain (used by SEEK) and cannot be easily used in other domains. A propositional

production rule language is used in ARIS because it is easy to comprehend and expresses

knowledge efficiently.

� Conflict resolution strategy

The conflict resolution scheme provides the framework for pin-pointing the conclusion among

various possible conclusions that can be reached. Some systems accept the conclusion of the first

rule satisfied such as KRUST and EITHER. Other systems accept a conclusion that is reached with

a highest certainty among several reached conclusions such as in SEEK and SEEK2. Our approach

orders rules according to their weight which is a probabili stic estimate of rule quali ty.

� Pre-assumptions

All other knowledge-base refinement systems assume that the knowledge base is complete and only

minor tweaking is required. This view assumes the availabili ty of experts and that the domain is

fairly well understood. Inductive concept-learning deals with more general environments that are

characterised by:

- lack of understanding of the particular domain,

- availabili ty of data describing instances of the environment behaviour,

- regular change of environment such as is the case with medical images and credit-card

transactions

These characteristics often make it impossible to guarantee completeness of the initial knowledge

base.

� Missing refinement operators

Early refinement systems such as SEEK and SEEK2 were developed for specific problem-domain,

and some refinement operators were missing. For example, as a result of the completeness

assumption, creating new rules was not available in SEEK. We implemented all the refinement

operators applicable for propositional production rules.

� User interaction

Interaction of domain-expert/knowledge-engineer is a common characteristic of previous

knowledge-base refinement systems. Our approach is non-interactive.

� Knowledge-base evaluation approach

Reaching a good quali ty refined knowledge base can be viewed as a search through a space of

possible knowledge bases. The EITHER system accepts the first successful refinement; while SEEK

relies on local statistics that favours a refinement which is deemed to correct more misdiagnoses.

KRUST on the other hand generates many refinements then select best one. ARIS evaluates the

quali ty of a knowledge base after application of every possible refinement operation for the

misclassifications, and chooses best refinement resulting in the highest knowledge-base quali ty

which is measured by its completeness and consistency conditions.

33

In summary, ARIS has improved on inductive learning systems, by using performance analyses as a

feed-back to the learning process. The results for several knowledge bases are encouraging. However

there are extensions that would make it a more powerful learning tool as discussed in Section 6.3.

34

Characteristic ARIS SEEK KRUST EITHERKnowledge base source Inductively generated by

analysis of pre-classified dataset

Expert originated Expert originated Expert originated

Knowledge base representation Propositional production rules

IF conditions THEN conclusionUNLESS conditions

Criteria tables Backward chaining productionrules, using semantic nets

Horn clause rules

Conflict resolution First rule satisfied is fired A rule with the highest certaintyis fired

First rule satisfied is fired First rule satisfied is fired

Pre-assumptions Training data is an adequaterepresentation of the domain

It is assumed that theknowledge base is complete,and only minor tweaking isrequired



Missing refinement operators All refinement operators areimplemented

� No rule creation� No KB redundancy check� Refinement of continuous

conditions is missing

� No KB redundancy check� New rule creation is not

efficient� Does not handle

exceptional cases

� No KB redundancy check� Refinement of continuous

conditions is missing� Does not handle

exceptional casesExpert interaction No interaction with the system Yes Yes yes

Knowledge base evaluation technique Completeness and consistencyfactors are used to evaluate thesystem

Local statistical measure, andfirst successful refinement isaccepted

Generate many refinements,and then select one.The KB's are tested on a specialdata set called chestnuts, thatare selected precisely for thegiven domain, so that a goodKB does not misclassify it.

First successful refinement isaccepted.

35

6.2 Conclusion

An inductive refinement model has been developed that is capable of creating a knowledge base from a

library of pre-classified cases, and continuously updating it to accommodate new facts. This model is of

particular importance in regularly changing and noisy domains such as credit-card transactions and

medical images.

We have developed a method of learning rule weights based on an estimate of the relationship of rule

conditions to a conclusion. The rules are ranked according to their weight to determine the misdiagnosed

cases more easily. We have further developed a method for learning rules centred around a

misdiagnosed case, that is what we call the “minimum covering algorithm” which help determining

missing conditions in erroneous rules for specialisation, as well as adding new rules to the knowledge

base. We have also developed an algorithm for inducing rules from trees utilising completeness and

consistency criteria for rules.

Specific areas of contribution include:

� Scaling learning algorithms to large and ever growing databases

Databases are large and growing larger, so learning systems must be able to handle enormous

amounts of data. Inductive refinement introduces a method for inductive systems to cope with

future changes in the data without forgetting past experience. This is useful since it is easier to

only investigate the recently observed data. For example, it is easier to just process the last

thousand credit-card transactions than a full database of milli ons.

� Focus of refinement process

Learning weights for the rules and conditions makes it possible to focus the refinement process

and resolve conflicts more easily.

� Implementation that demonstrates goals

The ideas put forth in this paper have been implemented, and the implementation has

demonstrated the feasibili ty and effectiveness of the approach including:

(a) More accurate than C4.5

ARIS’s refinement approach proved to be more accurate than training C4.5 on the same

data all at once (combined training) on many noisy data sets (see Table 8). C4.5 is more

accurate on one large noise-free data set (i.e. artificial data) and equal on two other small

noise-free data sets (i.e. iris and wine data sets).

(b) Occasionally equal or more accurate than RIPPER

Table 8 shows that ARIS’s refinement approach is more accurate than RIPPER on three

test domains (i.e. heart, hypothyroid, and hepatitis) that are characterised by noise and

small disjuncts problems (i.e. rules covering few cases). When RIPPER creates new rules it

ignores any rule that has accuracy less than 50% on the pruning data set (part of the

36

training data provided). In presence of small disjuncts problems combined with noise

RIPPER misses some rules that prove to be useful in future. The refinement mechanism

corrects such problems when observed.

On one data set (i.e. mushroom data set) where the majority class forms 51.81% of the

domain the two approaches are equal. However, RIPPER is better when trained on all the

data combined on the rest of test domains.

It is important to note that RIPPER includes an internal l imited form of refinement scheme.

Therefore, it can be expected that exposing a learning system to a larger view of the

environment will often perform better especially in noise-free domains such as the artificial

data set used in the experiments. Therefore, improving on RIPPER on some domains is

good. This is a benefit of the increase of search effort and the more refinement operators

available that can modify the class boundaries graduall y resulting in improvements of

accuracy of the knowledge-based system.

The goal of this research has been to provide an effective approach to inductive learning, based on a

batch-learning strategy making learning systems responsive to domain changes.

The experiments with the ARIS system, and the lessons learned during refining knowledge bases

induced by other learning systems such as C4.5 and RIPPER serve as a good demonstration of the

significance behind refinement in inductive learning to improve system’s performance.

Moreover, RIPPER can be improved if it:

(1) performs more refinement operators such as expanding and shrinking numeric ranges rather than

adding/deleting conditions only,

(2) Allows the use of negation of conditions in the developed rules, as used in ARIS.

37

6.3 Limitations and further work

This research has pointed out several potential areas for further research and investigation:

� Learning intermediate concepts

In order to get better benefit from concept induction, we need to relax the mutual exclusivity of

the classes. This would permit a class to be named as a condition in a rule, resulting in more

compact concept descriptions.

� Parallel implementation

More work needs to be done in order to speed up the refinement process. Since refinement

operators (specialisation/generalisation) are performed independently, this makes the

refinement process suitable for parallel implementation. The problem would be to design an

optimal framework to distribute and control the workload among a fixed set of processors

(equivalent to the number of refinement operations). This will l ead to an increase of the speed

of the refinement system.

� Constructing new features

In many learning problems, there are hundreds or thousands of features describing each input

case. Current concept-learning algorithms such as C4.5 [25] do not scale well when there are

many features. Moreover, it is easy for learning algorithms to be confused by noisy features and

construct poor classifiers.

Constructing new features from the given data enables learning systems to change their

representation to redescribe the given domain. New features can be constructed by assigning

names to groups of best-performing features for each class and/or combining feature values into

larger units. Constructing new features during the refinement cycle will produce more compact

concept descriptions, and might lead to increase in performance similar to that reported by

some constructive induction research in the DNA domain [17].

Acknowledgements

I gratefully acknowledge the intellectual support of my supervisor Professor David Lubinsky, during my

research work, and for his helpful comments on this paper.

REFERENCES

1. Ali K. (1996). Learning probabili stic relational concept description. Ph.D. thesis, University of

California.

38

2. Ali K., Pazzani, M. (1993). HYDRA: A noise- tolerant relational concept learning algorithm. In

Proceedings of the Thirteenth International Joint Conference on Artificial Intelli gence, Chamberg,

France.

3. Breiman L., Friedman J., Olshen R., Stone C. (1984). Classification and regression trees.

Wadsworth, pacific Grove, CA.

4. Buchanan B., Shortliffe E. (1984). Rule-Based Expert Systems: The MYCIN Experiments of the

Stanford Heuristic Programming Project. Addison-Wesley Publishing Company.

5. Cloet I., Theron H. (1991). CID3: an extension of ID3 for attributes with ordered domains. South

African Computer Journal, 10-16.

6. Cohen W. (1995). Fast effective rule induction. In: Proceedings of the Twelfth International

Conference on Machine Leaning. San Francisco, CA.

7. Craw S. (1991). Automating the refinement of knowledge-based systems. Ph.D. thesis, University of

Aberdeen.

8. Fayad U., et al. (1988). Learning control information in rule-based systems: a weak method. In:

Proceedings of the Fourth Conference on AI applications, 14-19.

9. Friedman J., Kohavi R., Yun Y. (1996). Lazy decision trees. In: Proceedings of the National

Conference on Artificial Intelli gence.Portland.

10. Ginsberg A. (1986). Refinement of expert systems: A metalinguistic framework for heuristic

analysis. Ph.D. thesis, Rutgers University.

11. Holland J. (1986). Escaping brittleness: The possibiliti es of general purpose learning algorithms

applied to parallel rule-based systems. In: Michalski et al. (editors), Machine Learning II . Morgan

Kaufmann publishers Los Altos, California.

12. Holsheimer M., Siebes A. (1991). Data Mining: The search for knowledge in databases. Technical

Report CS-R9406, Amsterdam, The Netherlands.

13. Kanal L., Lemmer j. (1986). Uncertainty in Artificial Intelli gence. North-Holland, Amsterdam.

14. Kohavi R., et al. (1997). The utili ty of feature weighting in nearest neighbour algorithms. In: Van

Someren M. and Widermer G., (editors) , Poster paper: Ninth European Conference on Machine

Learning. Czech Republic.

15. Kohavi R., Kunz C. (1997). Option decision trees with majority votes. In: Proceedings of the

International Conference on Machine Learning. Tahoe City, CA.

16. Laird J., (1988). Recovery from incorrect knowledge in Soar. In: Proceedings of the National

Conference on Artificial Intelli gence, Saint Paul, Minnesota, 618-623.

17. Mark W., Jude W. (1993). Learning to represent codons: A challenge problem for constructive

induction. In: Proceedings of the Thirteenth International Joint Conference on Artificial

Intelli gence, Chamberg, France, 1319-1324.

18. Michell T. (1997). Machine Learning, McGraw Hill .

19. Mohri T., Tanaka H. (1994). An optimal weighting criterion of case indexing for both numeric and

symbolic attributes. In: D.W. Aha (Editors), Case-based reasoning: papers from the 1994 workshop

(TR WS-94-01). Menlo Park, CA.

20. Murphy p. UCI repository of machine learning databases,

39

[ http:/www.ics.uci.edu/~mlearn/mlrepository html].

21. Ourston D., Mooney R. (1994). Theory refinement combining analytical and empirical methods.

Artificial Intelli gence, 273-309.

22. Politakis P. (1983). Using empirical analysis to refine expert system knowledge bases. Ph.D. thesis,

Rutgers University.

23. Politakis P., Weiss S. (1984). Using empirical analysis to refine expert system knowledge bases.


24. Quinlan R. (1988). Induction of decision trees. Machine Leaning, 81-108.

25. Quinlan R. (1993). C4.5, Programs for machine learning. Morgan Kaufman.

26. Raymond J., Mooney R., Ourston D. (1994). A multi-strategy approach to theory refinement. In:

Michalski et al. (editors) Machine Learning IV. Morgan Kaufmann publishers Los Altos, California.

27. Theron H. (1994). Specialisation by exclusion: An approach to concept learning. Ph.D. thesis,

University of Stellenbosch, South Africa.

28. Wettschereck D., Aha D. (1995). Weighting features. In: Proceedings of the First International

Conference on Case-Based Reasoning. Sesimbra, Portugal.

29. Wilkens D. Ma Y. (1994). The refinement of probabili stic rule sets: sociopathic interactions.


Education

rule refinement in inductive knowledge based systems