29
Saravanan Anandathiyagar Project Background Paper March 2002 Supervisor: Simon Colton A Substructure Server Abstract Much of the reason for the high cost of medicines is rooted in the length and complexity of the development and approval process. At every possible stage of development, it is possible that a potential drug (leader) will fail to gain approval on the basis that it produces erratic results or harmful side effects. Predictive toxicology aims to reduce the money and time spent by identifying as early on in the drug development process as possible leaders that are likely to fail. Numerous machine learning techniques exist to identify such leaders. Here we present a possible solution based on the Find a maximally specific hypothesis (Find-S) algorithm. This algorithm, given a set of positive and negative examples of data, finds substructures that are statistically true of the majority of positive compounds, and statistically not true of the negative compounds. A discussion of the algorithm and its motivation is presented here. i

Background Report (DOC)

  • Upload
    butest

  • View
    321

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

Abstract

Much of the reason for the high cost of medicines is rooted in

the length and complexity of the development and approval

process. At every possible stage of development, it is possible

that a potential drug (leader) will fail to gain approval on the

basis that it produces erratic results or harmful side effects.

Predictive toxicology aims to reduce the money and time spent

by identifying as early on in the drug development process as

possible leaders that are likely to fail. Numerous machine

learning techniques exist to identify such leaders. Here we

present a possible solution based on the Find a maximally

specific hypothesis (Find-S) algorithm. This algorithm, given a

set of positive and negative examples of data, finds

substructures that are statistically true of the majority of

positive compounds, and statistically not true of the negative

compounds.

A discussion of the algorithm and its motivation is presented

here.

i

Page 2: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

Contents

Abstract................................................................................i

1. Introduction....................................................................3

1.1. Motivation................................................................................3

1.2. Summary of Report..................................................................4

2. Previous Research..........................................................5

2.1. Structure-Activity Relationships..............................................5

2.2. Attribute-based representations..............................................5

2.3. Relational-based representations............................................7

2.4. Inductive logic programming..................................................7

3. The Find-S Technique.....................................................9

3.1. Motivation................................................................................9

3.2. General-to-specific ordering of hypotheses.............................9

3.3. The Find-S algorithm.............................................................10

3.4. Algorithm evaluation methods...............................................14

3.5. Issues with the Find-S technique...........................................15

3.6. Existing Prolog implementation............................................16

4. Implementation Considerations...................................18

4.1. Representing structures........................................................18

4.2. Improvement of current implementation..............................18

4.3. Extensions..............................................................................18

5. References....................................................................20

ii

Page 3: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

1. Introduction

1.1. Motivation

Each year, drug companies release new and improved drugs, claiming that

they produce better results with fewer side effects. However, the cost of

such advances in the drug industry is not small. Developing a drug from the

theoretical stage to it appearing on pharmacy shelves normally takes in the

region of 10 to 15 years, at an average cost of over £500 million [see ref 1].

This outlay by the drug company must be covered by the consumer for the

company to remain in profit, and evidence of this can be seen, for example,

in the regular rise of NHS prescription charges.

Much of the reason for the high cost of medicines is rooted in the length

and complexity of the development and approval process. At every possible

stage of development, it is possible that a potential drug (leader) will fail to

gain approval on the basis that it produces erratic results or harmful side

effects. Even after promising lab tests, further experiments on animal

specimens often return ideas to the drawing board. It is estimated that for

every one drug that reaches clinical (human) trial stage, another 1000 have

failed earlier testing.

Despite this, it is important to note that medicines still reduce overall

medical care costs by reducing even more expensive hospitalisation,

surgery or other treatments. Drugs are the primary way of controlling the

outcomes of chronic illness. Therefore, the development of new drugs is

important for both patient care and for the positive long-term financial

implications.

It is clear that reducing the number of drug leaders developed at an early

stage will have a significant effect in limiting development costs.

Determining at an early stage that a leader is unsuitable for further testing

saves the investment that may otherwise have been spent on this drug, only

for the same conclusion to be reached. For this reason, the field of

predictive toxicology was born. It is an effort on the part of biotechnology

companies to predict in advance whether or not a drug will be toxic, using

various techniques learnt from the fields of statistics, artificial intelligence

(AI), and machine learning.

Negative effects of a drug can range from relatively minor problems such as

headaches and stomach upsets, to potentially life-threatening organ

damage. While many accepted drugs do produce some side effects for some

Page 3

Page 4: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

patients, the value of the treatment is always said to outweigh the side

effects. However, there are certain characteristics of chemical compounds

that will limit their effectiveness as a drug. Predictive toxicology aims to

find this drug toxicity while still in the planning stages. Ruling out a leader

at this early stage saves it being synthesised and tested, and allows

resources to be focused on more promising areas of research.

Machine learning programs in a variety of different guises have been used

to try and discover the reasons why certain chemicals are toxic and others

are not. Essentially, they learn a concept that is true of the toxic drugs and

false for other non-toxic drugs. These derived concepts are usually small

(around five or six atoms) sub-structures of the larger drug molecule, where

some of the atoms are fixed elements and others may vary.

The task in hand is to effectively and efficiently identify such sub-structures

using the Find Maximally Specific Hypothesis (FIND-S) machine learning

algorithm. An implementation of the algorithm has been written in PROLOG

by S Colton; our work here is based on extending this implementation and

producing a web-based server application.

A molecule is said to be positive if it contains the sub-structure in question.

Conversely, it is said to be negative if it does not. The application will

return interesting substructures given positive and negative molecules,

whereby the substructure is true of statistically significant more positives

than negatives.

1.2. Summary of Report

This report is an overview of the research undertaken, with an outline of

how implementation of a Substructure Server may proceed. Section 2

summarises the machine learning techniques used in the field of predictive

toxicology, and introduces the concepts of attribute-based and relationship-

based structure-activity relationships.

Section 3 is a comprehensive overview of the Find-S algorithm, with an

emphasis on how it may perform in a predictive toxicology situation. A

fictional example is presented and analysed which demonstrates the key

methodologies of the technique. Evaluation techniques applicable to both

the algorithm itself and to the results it produces are outlined, as well as

various considerations that should be addressed on implementation. S

Colton’s existing Prolog implementation of the algorithm is also discussed.

Page 4

Page 5: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

Section 4 highlights some implementation considerations, suggesting a

possible course of action towards building a substructure server available

for public use.

Page 5

Page 6: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

2. Previous Research

As was mentioned above, machine learning algorithms to find relevant sub-

structures have been applied in the field of predictive toxicology. It is important

to understand the approaches that have been taken in previous work, using it

as a basis for further study.

A summary of the key features of background study undertaken is summarised

in this section.

2.1. Structure-Activity Relationships

A structure-activity relationship (SAR) models the relationship between

activities and physicochemical properties of a set of compounds [2]. The

goal of our work is essentially to form SARs from the given input molecules.

These resultant SARs represent the molecules most likely contribute to

toxicity, as calculated by our algorithm.

A SAR is derived from two components:

The learning algorithm employed during derivation, and

The choice of representation to describe the chemical structure of the

compounds being considered.

The learning algorithm used will rule out possible choices of representation,

as the latter has to be rich enough to support the algorithm’s procedure.

SARs can store different information about compounds, and typically such

information (attributes) could consist of any of the following chemical

properties [5]:

Partial atomic charges

Surface area

Volume

H_bond donors/acceptors

ClogP

CMR

pKa, pKb

Hansch parameters , ,

F

Molecular grids

Polarisability

The exact nature or meaning of each attribute type need not be discussed

here. It is however important to note that there are any number of ways of

representing a compound, using any combination of the attributes given

above (and more).

Page 6

Page 7: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

2.2. Attribute-based representations

A large variety of learning techniques are in use that derive SARs of

different forms. The majority of these are based on examining the types of

attributes listed above. A short summary of a few of these techniques is

presented here.

2.2.1. Linear and Partial least-squares regression

Linear regression was the first learning algorithm employed in

predictive toxicology, as detailed by Hansch et al. [3]. “Training” the

system involves providing suitable training examples, which are

simply saved to memory without being interpreted or compared in any

way. It is on this stored information (as explicitly provided by the

user) that regression aims to approximate its target function.

In the context of predictive toxicology, this would involve supplying

examples of positive compounds as training data. The procedure then

run on a new compound would invoke a set of similar compounds

being retrieved from the stored values, and use this to classify the

new compound. The analysis of the compounds is based on chemical

attributes as specified by the algorithm; Hansch used global chemical

properties of the molecule (LogP and ).

Least-squares regression is another learning technique involving the

relationship between chemical attributes. Visually it essentially entails

forming a ‘line of best fit’ for a set of training data plotted against two

variables y and x, where x and y are two chemical attributes. For any

new compound encountered, a plot is made of the same two

attributes; if the point produced lies within a fixed bound of the line of

best fit, then the new compound can be deemed positive. The system

can be extended to include multiple independent variables, and to give

each variable different weights – a measure of how important each

attribute measure is compared with each other.

It is important to note that both these techniques make no attempt to

interpret the training data as it is fed to them; all the processing of

determining suitability criteria for new compounds happens only once

the new compound has been encountered.

2.2.2. Decision trees

Page 7

Page 8: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

Decision trees classify the training data by considering each

<attribute, value> pair (tuple) for a given compound [4]. Each node in

the tree specifies a test of a particular attribute, and each branch

descending from that node corresponds to a possible value for that

attribute. A compound is classified as positive or negative at the leaf

nodes of the graph.

New compounds are classified by comparing their attribute values to

ones stored from the training data. An implementation of this

algorithm needs to address the critical issue of which attribute(s) to

perform the test on. This decision could crucially alter the

classification schema, and is a problem inherent in trying to separate

objects into discrete sets when their behaviour or identity is given by

a number of attribute. It is possible that any two attribute values could

contradict each other on a particular classification scheme, and it then

becomes necessary to impose some ordering or priority system over

the attributes.

2.2.3. Neural networks

Artificial Neural Networks (ANNs) provide a general and practical

method for learning functions from examples [4], and have

widespread use in AI applications. Predictive toxicology lends itself to

the use of ANNs because of how compound attributes can be treated

as <attribute, value> tuples, in a manner similar to that discussed in

section 2.1.2 above. A compound can be represented by a list of such

tuples covering the full range of attributes.

The simplest form of ANN system is based on perceptrons, which will

take the list of tuples and calculates a ‘score’ for the compound. This

score is calculated from a combination of the input tuples, and a

weight associated with each attribute. The algorithm can learn from

the training data by considering the attributes of positive compounds,

and can then classify unknown compounds as positive or negative,

depending on the score calculated being higher than a defined

threshold.

Practical ANN systems usually implement the more advanced

backpropogation algorithm, which learns the weights for a network of

neural nodes on multiple layers. However the principal is the same as

Page 8

Page 9: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

that used in the perceptron algorithm, with the compound score being

calculated in a non-linear manner taking into account more variables.

2.3. Relational-based representations

The techniques mentioned above for deriving SARs all share one key

concept: they are all based on attributes of the object (in our case, the

chemical compound being examined). These attributes can be considered to

be global properties of these molecules, e.g. using the molecular grid

attribute maps points in space, which are global properties of the

coordinate system used. The tuple of attributes that has been used to

represent the properties of the molecule is not an ideal format; it is difficult

to efficiently map atoms and the bonds onto a linear list.

A more general way to describe objects is to use relations. In a relational

description the basic elements are substructures and their associations [2].

This allows the spatial representation of the atoms within the molecule to

be represented more accurately, directly and efficiently.

2.4. Inductive logic programming

Fully relational descriptions were first used in SARs with the inductive logic

programming (ILP) learning technique, as shown in [6]. ILP algorithms are

designed to learn from training examples encoded as logical relations. ILP

has been shown to significantly outperform the feature (attribute) based

induction methods described above [7].

ILP for SARs can be based on knowledge of atoms and their bond

connectives within a molecule. Using this scheme has a number of benefits:

Simple, powerful, and can be generally applied to any SAR

Particularly well suited to forming SARs dependent on the

relationship between the atoms in space (shape)

Chemists can easily understand and interpret the resultant SARs as

they are familiar with relating chemical properties to groups of atoms.

The formal difference between the descriptive properties of attribute and

relational SARs corresponds to the difference between propositional and

first-order logic [2]. ILP involves learning a set of “if-then” rules for a

training set, which can then be applied to unseen examples. Sets of first-

order Horn clauses can be constructed to represent the given data rules,

and these can be interpreted in the logic programming language PROLOG.

Page 9

Page 10: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

ILP differs from the attribute based techniques in two key areas. ILP can

learn first-order rules that contain variables, whereas the earlier algorithms

can only accept finite ground terms for attribute values. Further, ILP

sequentially examines the data set, learning one rule at a time to

incrementally grow the final set of rules.

We stated above that relational SARs can be described by fist-order

predicate logic. The PROGOL algorithm was developed [8] to allow the

bottom-up induction of Horn clauses, and is implemented in PROLOG.

PROGOL uses inverted entailment to generalize a set of positive examples

(active compounds) with respect to some background knowledge – atom

and bond structure date, given in the form of prolog facts. PROGOL will

construct a set of “if-then” rules which explain the positive (and negative)

examples given.

In the case of predictive toxicology, these rules generally specify a sub-

molecular structure of around five or six atoms. These structures are those

that have been calculated to contribute to toxicity, based on their presence

in the set of positive training examples, and their non-presence in the set of

negative training examples.

These sub-structures can then be matched with components of unseen

compounds in an attempt to predict toxicity.

Page 10

Page 11: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

3. The Find-S Technique

3.1. Motivation

As mentioned previously, the focus of this research topic is to use the Find-

S algorithm as described below to identify the sub-structures discussed at

the end of section 2.3.1. Within the scope of predictive toxicology, it may

appear that both Find-S and ILP do the same thing, however this is not the

case. The Find-S technique differs from that of ILP due to the motivation

behind the process. ILP looks for concepts that are true for positive

examples, and false for negative examples, and produces a sub-molecule

structure as a result. The Find-S procedure, on the other hand, is given a

template (by the user) to guide its search, and the program looks for all

possibilities of the general shape in the positive inputs.

3.2. General-to-specific ordering of hypotheses

Any given problem has a predefined space of potential hypotheses [4],

which we shall denote H. Consider a target concept T, whose truth value (1

or 0) depends upon the values of three attributes, a1, a2, and a3. Each

attribute a1, a2, or a3 can take a range of discrete values, some combinations

of which will make T true, others will make T false. We denote the value x of

an attribute an as v(an) = x.

We can let each hypothesis consist of a conjunction of constraints on the

attributes, i.e. take the list of attribute values for that particular instance of

the problem. This list of attributes (of length three in this case) can be held

in a vector. For each attribute an, the value v(an) will take one of the

following forms:

? - indicating that any value is acceptable for this attribute

- indicating that no value is acceptable for this attribute

a single required value for the attribute, e.g. for an attribute ‘day of

week’, acceptable values would be ‘Monday’, ‘Tuesday’ etc.

With this notation, the most general hypothesis for T is

<?, ?, ?>

which states that any assignment to any of the three attributes will result in

the hypothesis being satisfied. Conversely, the most specific hypothesis

for T is

Page 11

Page 12: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

<, , >

which states that no assignment to any of the variables will ever satisfy the

hypothesis.

All hypotheses within H can be represented in this way, with the majority

falling somewhere between the two above extremes of generality. Indeed,

hypotheses can be ordered on their generality, from most general to most

specific instances. For example, consider the following two possible

hypotheses for T:

h1 = <x, ?, y>

h2 = <?, ?, y>

Considering the two sets of instances that are classified positive by the two

hypotheses, we can say that any instance classified positive by h1 will also

be classified positive by h2, as h2 imposes fewer constraints. We say that h2

is more general than h1.

Formally, for two hypotheses hj and hk, we can define hj to be more general

than or equal to hk (written h j ≥g hk) if and only if

(x X) [(hk(x) = 1) (hk(x) = 1)]

Further, we can define hj to be (strictly) more general than hk (written h j g

hk) if and only if

(h j ≥g hk) (h j ≱g hk)

3.3. The Find-S algorithm

The Find-S technique orders hypotheses according to their generality as

explained in the previous section. The algorithm then starts with the most

specific hypothesis h possible within H. For each positive example it

encounters in the training set, if generalises h (if needed) so h now

correctly classifies the encountered example as positive. After considering

all positive training examples, the resultant h is output. This is the most

specific hypothesis in H consistent with the examined positive examples.

The algorithm can be more formally defined as follows [4]:

1. Initialise h to the most specific hypothesis in H.

Page 12

Page 13: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

2. For each positive training instance x

For each v(ai) in h

If v(ai) is satisfied by x

Then do nothing

Else replace ai in h by the next more general constraint that is satisfied by x.

3. Output hypothesis h

The procedure is run with a different starting positive each time until all

positives have been analysed. There is a question over how to measure how

specific a particular hypothesis is. This is dependent on the representation

scheme, but in first-order logic, for example, a more specific hypothesis will

have more ground terms (fewer variables) in the logic sentence describing

it than a less specific hypothesis.

3.3.1. A simple example

An example to illustrate how the algorithm could be used in predictive

toxicology is presented below. It has been adapted from [9], and is

fabricated in that the derived structure is not a real indicator to

toxicity. The example is simply illustrates the algorithm process.

Training Data

Consider the training set of seven drugs, four of which are known

positives, and the remaining three known negatives. Diagrams of

these molecules are given below, with molecules P1, P2, P3 and P4

representing positive examples, and N1, N2 and N3 representing

negative ones. The atom labels (, , , and ) are used in place of

possible real elements (e.g. N, C, H etc) to enforce the notion that the

example is purely fabricated.

Page 13

Page 14: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

At this stage, the chemist (user) must suggest a possible template on

which to base the search for toxicity-inducing substructures. It is

thought that a substructure of the form

ATOM ATOM ATOM

(with representing a bond) contributes to toxicity. It is now the

task of the algorithm to find sub-molecules matching the structure

given above which exist in as many positives as possible, but do not

exist in as many negatives as possible.

The Algorithm Procedure

To solve the problem, we use the Find-S method with the aim of

producing solutions of the form

<A, B, C>

Page 14

N1

P1

P2

P3

P4

N2

N3

Figure 1: Training set for Find-S example

Page 15: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

where A, B and C are taken from the set of chemical symbols present

in the molecules, i.e. {, , , }. However, we also need to look for

general solutions where an atom in a particular position is not fixed.

We therefore append {?} to the previous set, giving {, , , , ?}.

We start off with the most specific hypothesis possible. Any final

concept learned will have to be true of at least one positive example.

We use this to produce our first set of triples:

<, , > and <, , >

These are the two substructures that exist in P1 and match the

template specified.

We now check whether each of these substructures is true in the next

molecule (P2). If they are not, then we generalise the substructure

such that it becomes true in P2. This generalisation is done by

introducing as few variables as possible. In doing this, we find the

least general generalisations, which then guarantees that our final

answers are as specific as possible. This expanded set of

substructures is then tested on P3, and following the same procedure,

on P4.

Page 15

Page 16: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

A trace of the intermediate results produced is shown here:

Molecule being analysed

P1 P2 P3 P4

<, , > <, , > <, , > <, , ><, , > <, , > <, , > <, , >

<, , ?> <, , ?> <, , ?><, ?, > <, ?, > <, ?, >

<?, , > <?, , ><?, , ?> <?, , ?><, ?, ?> <, ?, ?><, ?, ?> <, ?, ?><?, ?, > <?, ?, >

The trace shows previously derived substructures with a greyed out

background. Note that no new substructures are produced on analysis

of P4 – all the substructures produced after analysis of P3 match

exactly components of P4 without the need for generalisation.

Evaluation of Results

So the algorithm has now returned nine possible hypotheses for

substructures that determine toxicity. These can now be scored, based

on

How many positive molecules contain the substructure derived

How many negatives do not contain the substructure derived

A calculation of scores is given below:

Correctly classified

positives:

Correctly

classified

negatives:

Hypothesis

P1 P2 P3 P4 N1 N2 N3Accura

cy

1 <, , > 43%

2 <, , > 57%

3 <, , ?> 57%

4 <, ?, > 86%

5 <?, , > 57%

Page 16

Page 17: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

6 <?, , ?> 57%

7<, ?, ?

> 43%

8 <, ?, ?> 57%

9 <?, ?, > 57%

It can be seen that the most accurate hypothesis derived is number

four: <, ?, >. This is statistically the most frequent substructure (of

the form ATOM ATOM ATOM) that occurs in the positives, but

not in the negatives. This structure can then be used to predict the

toxicity of unseen compounds; other compounds containing a match

for hypothesis four are statistically likely to be toxic.

For a complete implementation of the algorithm, the procedure should

be repeated, but this time with P2 as the initial positive, and

generalising on the others. The same should be applied for P3 and P4

as initial positives.

3.4. Algorithm evaluation methods

On obtaining a ‘result’ from the Find-S algorithm, i.e. a hypothesis (or set of

hypotheses) representing a sub-molecule thought most likely to contribute

to toxicity, it is desirable to have some certainty that the result obtained is

indeed accurate. We want the promising results obtained with the training

set to be extended to unseen examples. There is no way to guarantee the

accuracy of a hypothesis, however there are accepted methods and

measures through which a user can become more confident in the results

obtained.

In our example above, the ‘best’ hypothesis had a (predicted) accuracy of

86%, calculated by considering the number of correctly classified positives

and negatives, over the total number of compounds analysed. However, this

figure is based purely on the examples that the hypothesis has already seen;

it is not a strong indicator of accuracy for unseen examples.

3.4.1. Cross validation

One possible way of addressing this situation is to reserve some

examples from the training set, and then subsequently use these

reserved examples as tests on the derived hypothesis. The results of

Page 17

Page 18: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

the hypothesis applied to the reserved examples can then be

compared to their actual categorisation, which is known as they were

provided as part of the training set. This cross validation is a standard

machine learning technique, and the splitting of initial example data

into a training set and test set can give the user more confidence that

the derived hypothesis will be accurate and of use. Clearly, it can have

the opposite effect, with a user finding out that the derived hypothesis

in fact performs poorly on genuinely unseen examples.

3.4.2. K-fold cross validation

It is often of importance and interest that the performance of the

learning algorithm itself is measured, and not just a specific

hypothesis. A technique to achieve this is k-fold cross validation [4].

This involves partitioning the data into k disjoint subsets, each of

equal size. There are then k training and testing rounds, with each

subset successively acting as a test set, and the other k-1 sets as

training sets. The average accuracy rate can then be calculated from

each independent test run. This technique is typically used when the

number of data objects is in the region of a few hundred, and the size

of each subset is at least thirty. This ensures that the tests provide

reasonable results, as having too few test examples would result in

skewed accuracy figures.

As each round is performed independently, there is no guarantee that

the hypothesis generated on one training round will be the same as

the hypothesis generated on another. It is for this reason that the

overall accuracy figures generated are representative of the algorithm

as a whole, not just one particular result.

3.5. Issues with the Find-S technique

As with all machine learning techniques, Find-S has some factors to

encourage its use, and others that make it less favourable. Some of these

considerations are discussed here.

3.5.1. Guarantee of finding most specific hypothesis

As the name of the algorithm suggests, the process is guaranteed to

find the most specific hypothesis consistent with the positive training

Page 18

Page 19: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

examples, within the hypothesis space. This is because of the

decisions made to select the least general generalisations when

analysing compounds.

This property can be viewed as being both advantageous and

disadvantageous. It is sometimes useful for users to know as much

information about the substructure as possible, and this may enable

them to better understand the chemical reason for the molecule’s

toxicity. However, in the case of an example deriving multiple

hypotheses consistent with the tracing data, the algorithm would still

return the most specific, even thought the others have the same

statistical accuracy.

Further, it is possible that the process derives several maximally

specific consistent hypotheses [4]. To account for this possible case,

we need to extend the algorithm to allow backtracking at choice

points for generalisation. This would find target concepts along a

different branch to that first explored.

3.5.2. Overfitting

Overfitting is often thought of as the problem of an algorithm

memorising answers rather than deducing concepts and rules from

them, and is inherent in many machine learning techniques. A

particular hypothesis is said to overfit the training examples when

some other hypothesis that fits the training examples less well,

actually performs better over the whole set of instances (i.e. including

non-training set instances).

Overfitting can occur when the number of training examples used is

too small and does not provide an illustrative sample of the true target

function. It can also occur when there are errors in the example data,

known as noise. Noise has a particularly detrimental effect on the

Find-S algorithm, as explained below.

3.5.3. Noisy data

Any non-trivial set of data taken from the real world is subject to a

degree of error in its representation. Mistakes can be made analysing

the data and categorising examples, in translation of information from

one form to another, and repeated data not being consistent with

itself. In machine learning terms, such errors in the data are termed

noise.

Page 19

Page 20: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

While certain algorithms are fairly robust to noise in data, the Find-S

technique is inherently not so. This is because the algorithm

effectively ignores all negative examples in the training examples.

Generalisations are made to include as many positive examples as

possible, but no attempt is made to exclude negatives. This in itself is

not a problem; if the data contains no errors, then the current

hypothesis can never require a revision in response to a negative

example [4]. However, the introduction of noise into the data changes

this situation. It may no longer be the case that the negative examples

can simply be ignored. Find-S makes no effort to accommodate for

these possible inconsistencies in data.

3.5.4. Parallelisability

The Find-S algorithm lends itself well to a parallel distributed

implementation, which would speed-up computation time. A parallel

implementation could involve individual processors being allocated

different initial positives; recall above that the algorithm is only

complete when hypotheses have been derived using each possible

start positive. The derivation of any particular hypothesis from an

initial positive can be run independently, and hence can be run in

parallel with other derivations.

3.6. Existing Prolog implementation

S. Colton has implemented an initial version of the Find-S algorithm in

PROLOG. This relatively compact program (approximately 300 lines of code)

identifies substructures from a sample data set as used by King et al [2].

The program is guided by substructure templates, of which a few have been

hard coded. It has recreated some of the results produced by the ILP

method and PROGOL on the sample data set considered. The program can

take parameters to specify the minimum number of ground terms that must

appear in a resultant hypothesis (i.e. limit variables), and also specify the

minimum number of molecules for which a hypothesis should return TRUE

for a positive, and the maximum for which it can FALSE for a negative.

An important point for discussion here is the representation of the

background and structural data. Information representing the molecules is

represented as a series of facts in a PROLOG database. The representation is

identical to that suggested in the section on inductive logic programming,

and involves storing information about atoms and their inter bonding. The

Page 20

Page 21: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

data stored for even a single molecule is extensive; however these PROLOG

facts can be generated automatically as mentioned in section 4.1.

Page 21

Page 22: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

4. Implementation Considerations

The Find-S algorithm has been discussed at length as it represents the core

component of a system to identify substructures. However, the initial remit was

to create a substructure server, whereby users would be able to identify

potentially interesting substructures from their positive and negative examples.

As such, other considerations need to be examined, and these are summarised

here.

4.1. Representing structures

There exists a conflict between the natural user representation of chemical

structures, and those that are useful to the implemented algorithm. In a

sense, the users’ view of structures must be parsed into the computer view

(first order logic) at some stage, either by the user manually, or by the

implemented software as pre-processing to the Find-S algorithm. It is

clearly more desirable from the users’ position that this conversion is done

in an automated fashion. The feasibility of this is briefly discussed here.

Chemists are often concerned with modelling compounds, and the industry

standard modelling software is QUANTA [9]. King et al. in [2] used QUANTA

editing tools to automatically map a visual representation of a molecule into

first order logic. After some suitable pre-processing, this mapped

representation could be read by their PROGOL program as a series of facts.

Another molecular simulation program, CHARMM [10], stores as data files

information about the molecule being simulated. These data files use

standard naming and referencing techniques, as described in The Protein

Data bank [11]. The structure of these flat text files is conducive to

translations to other formats, on development of suitable schema.

4.2. Improvement of current implementation

S Colton’s current implementation of the Find-S algorithm can serve as a

basis for further work. The algorithm could be recoded in a modern object

oriented language, which would facilitate parallelising and packaging the

algorithm as a web-based application.

One key improvement that could be made is with the introduction of new

search templates. These templates guide the algorithm, restricting its

search to sub-molecules matching the specified template. Currently only a

Page 22

Page 23: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

small number of templates are implemented; it is desirable that more be

available to the user.

4.3. Extensions

As advanced work in this area, further extensions to those suggested above

are possible. Implementing the algorithm in parallel is one such possible

extension. This would speed up the potentially highly complex and time

consuming derivations of hypotheses.

There is also scope for the generated hypotheses to be represented in

different formats. While an answer returned in first order logic maybe

strictly accurate, it is unlikely to be of much use to a user with little or no

knowledge of computational logic techniques. Molecular visualisation

software such as RASMOL and the later PROTEIN EXPLORER [12] exist, that

can take as input data in a similar format to that produced by QUANTA or

CHARM. It would be desirable for a user to view the resultant hypotheses,

with the sub-molecule derived by the algorithm presented visually.

Page 23

Page 24: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

5. References

[1] Ellis, L., Aetna InteilHealth Drug Resource Centre, From Laboratory To

Pharmacy: How Drugs Are Developed, 2002.

http://www.intelihealth.com/IH/ihtIH/WSIHW000/8124/31116/346361.html?

d=dmtContent

[2] King, Ross D., Muggleton, Stephen H., Srinivasan, A. & Sternberg, Michael

J.E., Structure-activity relationships derived by machine learning: The use of

atoms and their bond connectives to predict mutagenicity by inductive logic

programming (1995) Proceedings of the National Academy of Sciences

(USA) 93, 438-442

[3] Hansch, C., Maloney, P. P., Fujita, T. & Muir, R. M., Correlation of Biological

Activity of Phenoxyacetic Acids with Hammett Substituent Constants and

Partition Coefficients (1962). Nature (London) 194, 178-180

[4] Mitchell, T. M., Machine Learning, International Edition, 1997, McGraw-Hill

[5] Glen,B., Molecular Modelling and Molecular Informatics, University of

Cambridge – Centre for Molecular Infomatics,

www-ucc.ch.cam.ac.uk/colloquia/rcg-lectures/A4

[6] Muggleton, S., Inductive Logic Programming (1991), New Generation

Computing 8, 295-318

[7] Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., King, R. D., Theories

for mutagenicity: a study in first-order and feature-based induction (1996),

Artificial Intelligence 85(1,2), 277-299

[8] Muggleton, S., Inverse Entailment and Progol (1995), New Generation

Computing 13, 245-286

[9] Colton, S. G., Lecture 11 – Overview of Machine Learning, Imperial College

London, 2003.

http://www2.doc.ic.ac.uk/~sgc/teaching/341.html

[9] Quanta software, http://www.accelrys.com/quanta/, Accelrys Inc.

[10] Chemistry HARvard Molecular Mechanics (CHARMM),

http://www.ch.embnet.org/MD_tutorial/pages/CHARMM.Part1.html

[11] The Protein Data Bank, http://www.rcsb.org/pdb

Page 24

Page 25: Background Report (DOC)

Saravanan Anandathiyagar Project Background PaperMarch 2002 Supervisor: Simon Colton

A Substructure Server

[12] Rasmol Home Page, http://www.umass.edu/microbio/rasmol/

Page 25