Supervised Rule Induction for Relational Data

Mahmut Uludağ

Supervisor: Prof. Dr. Mehmet R. Tolun

Ph.D. Jury Presentation

Eastern Mediterranean UniversityComputer Engineering Department

February 25, 2005

Supervised Rule Inductionfor Relational Data

Outline Introduction ILA and ILA-2 algorithms Overview of the RILA system Query generation Optimistic estimate pruning Rule selection strategies Experiments and results Conclusion

Motivation for relational data mining

Traditional work in statistics and knowledge discovery assume data instances form a single table

Not always practical to represent complex objects in one

single table

RDBMS are widely used

Efficient management of data

Indexing and query services, transaction and security support

Can store complex data

Data mining without transferring to a new location

Author

Paper

Cites

Previous work – ILP based algorithms

Prolog is the main language to represent objects and relations between the objects

Incremental learning, incorporation of background knowledge

Initial research: deterministic rules Recent research: statistical learning Main obstacle to widespread acceptance;

dependency on a Prolog server

Client-server architecture; java client, ILProlog server

DMax; a modern ILP based data mining system

Source: www.pharmadm.com

Example output rule:

Previous work – relational data mining framework

(Knobbe et al, 1999)

Client – server architecture

Selection graphs

Algorithm to translate selection graphs into SQL

MRDTL and MRDTL-2 algorithms, Iowa State

University

M.Sc. Study in METU, Serkan Toprak, 2004

parent

child

toyage>30

Previous work – graph mining

Typical inputs are labelled graphs

Efficient tools in describing objects and the way they are connected

Subgraph isomorphism

Scalability problems

Avoid loading complete graph data into the main memory; partitioning

Nearly equivalent formalisms:

Graphs ≈ Database tables ≈ Prolog statements

ILA

Levelwise search construct hypotheses in the order of the increasing

number of conditions (i.e. at first, building hypotheses with one condition, then building hypotheses with two conditions, and so on)

Finds the smallest and completely accurate rule

set that represents the training data

ILA-2 Noise-tolerant evaluation function

score(hypothesis) = tp - pf * fn

tp is the number of true positive examples

fn is the number of false negative examples

pf stands for penalty factor, a user-defined minimum for the proportion

of tp to fn

not sensitive to distribution of false values

Multiple rule selection after a learning loop redundant rules

Implemented by modifying the source code of the C4.5

algorithm; some features inherited from C4.5

RILA

What is new when compared to ILA and ILA-2

Architecture

Performance

Internal representation of rules

Construction of hypotheses

RILA – what is new

Select late rule selection strategy; as an alternative to the select early strategy

An efficient implementation hypotheses can be refined by adding new conditions,

they do not need to be generated from scratch in each learning loop

Optimistic estimate pruning (beam search) Normalized hypotheses evaluation function

Architecture of the system

DBMSDiscovery system

JDB

C d

river

Rules

Hypotheses SQL, meta data queries

Result sets, meta data

•Traversing relational schema•Hypothesis construction•Conversions to SQL•Rule selection•Pruning

How tables are visited?

Interaction

Geneid1 Geneid2 Type Expression corr

Gene

Geneid Essential Chromosome Localization

Composition

Geneid Phenotype Class Motif Function Complex

Junction table

Target table

First level – stops in the junction table?Extension levels – extends complex hypotheses only by using attributes from the other side of the junction relation

Example hypotheses that can be generated:-If a gene has a relation r then its class is c-If a gene has a property p and relation r then its class is c-If a gene has a relation r to a gene having property p then its class is c

Internal representation of an example rule

Gene

Interaction

Class=‘Nucleases’

Composition

Gene

IF gene1.Composition.Class = ‘Nucleases’ AND

Interaction.Type = ‘Genetic’ AND

gene2.Composition.Complex = ‘ Intracellular transport’

THEN gene1.Localization = extracellular…

Complex=‘Intracellular transport’

Composition

Localization=‘extracellular…’

type=‘Genetic’

Conditions:

Gene



composition1.id=gene1.id

gene1.id=interaction.id1

composition2.id=gene2.id

Gene

interaction.id1=gene1.id

interaction.id2=gene2.id

Query generation

SQL template for building size one hypotheses

Numeric attributes

Refinement of hypotheses

How a hypothesis is represented in SQL?

How a hypothesis is extended by a condition from the other side of

an many-to-many relation?

SQL template for building size one hypotheses

Select attr, count(distinct targetTable.pk)

from covered, path.getTable_list()

where path.getJoins() and

targetTable.classAttr = currentClass and

covered.id = targetTable.pk and

covered.mark=0

group by attr

Numeric attributes

Discretization results are stored in a temporary tableColumns: table_name, attribute_name,

interval_name, min_value, max_value

disc.table_name = ‘table’ and

disc.attribute_name = ‘attr’ and

attr > disc.min_val and

attr < disc.max_val

SQL:

Refinement of hypotheses

Select attr, count(distinct targetTable.pk)

from covered, table_list,

hypothesis.table_list()

where targetAttr = currentClass and

join_list and

hypothesis.join_list()

covered.id = targetTable.pk and

covered.mark=0

group by attr;

How a hypothesis is extended by a condition from the other side of a many-to-many relation?

Select GENE_B.CHROMOSOME, count (distinct GENE.GENEID) from COMPOSITION, GENE, GENE GENE_B, INTERACTION where INTERACTION.GENEID2=GENE_B.GENEID and INTERACTION.GENEID1=GENE.GENEID andINTERACTION.EXPR > 0.026 andINTERACTION.EXPR < 0.513 and COMPOSITION.PHENOTYPE = 'Auxotrophies' and COMPOSITION.GENEID=GENE.GENEID andGENE.LOCALIZATION = 'ER'group by GENE_B.CHROMOSOME

Optimistic estimate pruning

Avoid working on hypotheses which are unlikely to result in satisfactory rules

F-measure criteria to assess hypotheses 2 * recall * precision / ( recall + precision )

Two types of pruning Extend only n best hypotheses (beam search) Minimum required f value in order a hypothesis to take place in

the hypothesis pool (similar to minimum support pruning)

Rule selection strategies

Select early strategy

Why do we need another

strategy?

Select late strategy

Learning algorithm when using the select early strategy

any rules

selected?

If size is 1 then build initial hypotheses

otherwise extend current hypotheses

select p rule(s)

all examplescovered?

end

mark covered objects

no

no, size++

yes

yes

size is smaller than m?

yes

size=1

no

Example training data to demonstrate the case where the select late strategy performs better than the select early strategy

Attribute A Attribute B Attribute C Class

a1 b1 c1 A

a1 b1 c2 A

a2 b2 c3 A

a3 b2 c3 A

a4 b1 c3 B

a5 b1 c3 B

a1 b2 c4 B

a1 b2 c5 B

Learning algorithm when using the select late strategy

Build initial hypothesis set Extend hypothesis set

Select Rules

end

yes,size++

size < max size?

no

start

Rule selection algorithm when using the select late strategy

Select hypothesiswith the highest

score

Is the score

positive?

all examplescovered? end

- Mark examples covered by this hypothesis- If no positive examples covered then return- Recalculate the score using the effective cover- If the new score is higher than the score of the next hypothesis or score of the hypothesis was previously reduced more than l then assert the hypothesis as a new rule otherwise undo markings and set the score to the new score calculated

yes

yes

start

no

no

Experiments

Summary of the parameters

The genes data set

The mutagenesis data set

Summary of the parameters

Parameters applicable both to the select late and to the select early strategies pf is a user-defined minimum for the proportion of the true positives

to the false negatives m is the maximum size for hypotheses

Parameter applicable only for the select early strategy p is the maximum number of hypotheses that can be selected as

new rules after a search iteration Parameter applicable only for the select late strategy

l is the limit on rule selection recursion Optimistic estimate pruning parameters

f is the minimum acceptable F-measure value n is maximum number of hypotheses that can be extended in each

level during the candidate rules generation phase of the mining processes

GENE

GENEID

Essential

Chromosome

Localization

COMPOSITION

GENEID

INTERACTION

GENEID1

GENEID2

Type

Expression

The ‘genes’ dataset of KDD Cup 2001

Class

Complex

Phenotype

Motif

Function

862 rows

4346 rows

910 rows

Junction tableMany-to-many relation between genes

Test results for the localization attribute using the select early strategy, pf=2, m=3

f=0.0n=10000

p=1

f=0.001n=500

p=1

f=0.001n=500

p=5

f=0.01n=1p=1

f=0.01n=1p=5

training time (seconds) 117 61 67 67 37

Number of rules 90 90 161 78 122

number of conditions 133 133 233 103 144

training set coverage (%)

60.56 60.56 70.19 59.16 65.43

training set accuracy (%)

95 95 96 95 95

test set accuracy (%) 83.19 83.19 83.61 84.37 85.78

test set coverage (%) 59.32 59.32 62.47 58.79 60.89

Test results for the localization attribute using the select late strategy

pfpf=2, =2, mm=3, =3, ll=0, =0, ff=0.01=0.01

n 1 2 3 4 5


number of rules 126 140 147 150 157


training set coverage (%) 67.17 70.30 71.46 72.16 72.97

training set accuracy (%) 94 94 93 93 93



Test results for the localization attribute using the select late strategy

n 1 2 3 4 5


number of rules 126 139 144 147 153


training set coverage (%) 65.55 68.68 69.37 69.84 70.65

training set accuracy (%) 96 96 96 96 96



pfpf=2, =2, mm=3, =3, ll=100, =100, ff=0.01=0.01

Why we did not have better results on the genes data set? Cup winner’s accuracy 72.1% MRDTL 76.1% accuracy Serkan 59.5% accuracy rila best accuracy 85.8% with 60.9%

coverage rila best coverage 65.3% with 81.5%

accuracy Missing values? no Default class selection? no Deteriorated performance when the

number of class values is high Distribution of false values among classes

not taken into account Problem when number of examples in

different classes are not evenly distributed

Attribute1 Attribute2 Class

1 5 pink

1 5 pink

1 5 pink

1 5 yellow

2 5 yellow

2 6 yellow

1 6 blue

2 7 blue

3 7 blue

Schema of the mutagenesis database

ATOM

ATOM_ID

Molecule_id

Element

Type

MOLECULE

Molecule_id

BOND

ATOM_ID1

ATOM_ID2

TypeLog_mut

Logp

Lugmo

Ind1

IndaCharge

Label

Cross validation test results using the select early strategy on the mutagenesis data for different p* values

p 1 2 3 4 5 6 7

time (in seconds) 305 260 225 215 215 215 210

# rules 103 103 103 104 104 104 104

# conditions 120 120 120 122 121 121 121

accuracy (%) 98.82 97.06 97.06 97.06 97.06 97.06 97.06

coverage (%) 89.89 90.43 90.43 90.43 90.43 90.43 90.43

**maximum number of rules selected when each time the rule selection step is executedmaximum number of rules selected when each time the rule selection step is executed

Cross validation test results using the select early strategy and OEP on the mutagenesis data for different n values

n 1 2 3 10 15 20 30 40

time (seconds) 134 144 159 221 325 397 311 311

# rules 118 120 120 142 123 119 104 103

# conditions 165 171 171 210 166 161 123 120

accuracy (%) 98.26 98.26 98.26 98.83 98.25 98.25 98.82 98.82

coverage (%) 91.49 91.49 91.49 90.95 90.96 90.96 89.89 89.89

Cross validation test results using the select late strategy on the mutagenesis data

n 1 2 3 10

time (seconds) 76 123 169 486

# rules 122 137 145 151

# conditions 165 204 224 248

accuracy (%) 96.49 94.77 93.68 93.68

coverage (%) 90.96 91.49 92.55 92.55

p =1, f=0.01, l=0


p =1, f=0.01, l=10

n 1 2 3 10

time (seconds) 84 135 183 490

# rules 105 107 109 112

# conditions 131 137 145 158

accuracy (%) 98.26 98.26 98.24 98.25

coverage (%) 91.49 91.49 90.43 90.96


p =1, f=0.01, l=100

n 1 2 3 10

time (seconds) 80 130 178 483

# rules 103 104 106 109

# conditions 127 131 137 153

accuracy (%) 98.26 98.26 98.26 98.26

coverage (%) 91.49 91.49 91.49 91.49

Comparison to others results on mutagenesis data

The best results by RILA (Table 2 and Table 5) accuracy 98.26% coverage 91.49%

The best results reported in (Atramentov et al. 2003) accuracy 87.5%

The best results reported by the originators (King et al. 1996) of the data set

accuracy 89.4%, (number of correct predictions divided by the number of predictions)

Conclusion A new relational rule learning algorithm has been

developed with two different rule selection strategies Several techniques used to have reasonable

performance; refinement of hypotheses, pruning

The results on the mutagenesis data are better than other results cited in the literature

Compared to traditional algorithms, there is no need to move relational data to another location; scalability, practicality

Techniques employed can be used to develop relational versions of other traditional learning algorithms

Thanks!

FOIL, a set-covering approach

[Cameron Jonaes and Quinlan 1994] Begins with the most general theory Repeatedly adds a clause to the theory that

covers some of the positive examples and few negative examples

Covered examples are removed Continue until the theory covers all positive

examples

Previous work – unsupervised algorithms

WARMR [Dehaspe et al., 1998] finds relational association rules (query extensions)

Input – Prolog database

Specification in the WARMODE language, limits the format of possible query extensions

SUBDUE [Cook and Holder, 1994] discovers substructures in a graph

Output – the substructure selected at each iteration as the best to compress the graph

PRM [Getoor et al., 2002] reinterpret Bayesian networks in a relational setting

Captures the probabilistic dependence between the attributes of interrelated objects

Link analysis

Models generated by some unsupervised learning algorithms can be used for

prediction tasks; WARMR, PRM, not SUBDUE

Relational rule induction

Schema graph represents structure of the data tables = nodes

foreign keys = edges

Multiple tables can represent several objects

and relations between the objects

Users should select tables that represent the

objects they are interested in

An example relational rule

Gene


Composition


IF Composition.Class = ‘ATPases’ AND

Composition.Complex = ‘ Intracellular transport’

THEN Gene.Localization = extracellular..

Many-to-many relations

Junction tables Between different

classes Between objects of

the same classRecursive queries are

needed to extract data

Junctiontable

Junctiontable

between different classes

between objects of the same class

Example rule having a many-to-many relation

Interaction

Geneid1 Geneid2 TypeExpression corr

Gene


Composition






Performance

Dynamic programmingrefinement of hypotheses

PruningMinimum support pruningOptimistic estimate pruning

Avoid redundant hypotheses Smart data structures

Tabular representation of the links in the example rule

Conditionscomposition.class

=

nucleases

interaction.type =

Genetic

composition.complex =

intracellular transport

Pathscomposition1.id =

gene1.idinteraction.id1 =

gene1.id

composition2.id =

gene2.id

interaction.id2 = gene2.id

gene1.id =

ineratction.geneid1





Building size one hypothesesVector buildSizeOneHypotheses(String class, String table name, Path path){

For each column in the selected table {

If ( column is not the target attribute and not a primary key and not a foreign key) {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL(path)

result set = execute generated SQL

hypotheses += generated hypotheses using the result set

}

}

For each table linked by a foreign key relation{

If the linked table was not visited before(check the path)

hypotheses += buildSizeOneHypotheses(class, linked table name, updated path)

}

return the hypotheses

}

Refinement of hypothesesVector extendHypotheses(String class, Vector currentHypotheses, String tableName, Path

path){

For each column in the current table{

If ( column is not the target attribute and not a primary key and not a foreign key) {

For each hypothesis in the current hypothesis set {

If the hypothesis does not include the current feature {

Check whether the table is the target table

Check whether the column is numeric

Select the proper SQL template and generate SQL(path, hypothesis)

result set = execute generated SQL

extended hypotheses += generated hypotheses using the result set

}

}

}

For each table linked by a foreign key relation{

If the linked table was not visited before(check the path)

extended hypotheses += extendHypotheses(class, currentHypotheses,

linkedTableName, updated path)

}

return extended hypotheses;

}

Relational rule induction Basic components are the same as in propositional rule induction

algorithms

Hypothesis construction

Rule selection

Interpretation of relational schema

Traversal of schema

Detection of cycles and handling of cycles

Communication with RDBMS

Expressing internal hypotheses in SQL

Understanding results returned from the RDBMS

Relational rule induction Target table

Primary key for the main objects being analyzed

GENE

GENEID

Essential

Chromosome

Localization

COMPOSITION

GENEID

INTERACTION

GENEID1

GENEID2

Type

Expression

Class

Complex

Phenotype

Motif

Function

Author

Paper

Cites

How a hypothesis is represented in SQL?

INTERACTION.GENEID1=GENE.GENEID and

INTERACTION.EXPR > 0.026 and

INTERACTION.EXPR <= 0.513 and

COMPOSITION.PHENOTYPE = 'Auxotrophies' and

COMPOSITION.GENEID=GENE.GENEID and

GENE.LOCALIZATION = 'ER'

INTERACTION.EXPR = (0.026658-0.513329] AND COMPOSITION.PHENOTYPE = Auxotrophies

Rule:

SQL:

Many to many relations

Search Algorithm

Recognition of junction tables

Extension of hypotheses having a

feature from a junction table

Automatic conversion of hypotheses to

SQL using table aliases

Ila evaluation function

Attribute A Attribute B Class C

A

B

C

Figure to show how the ila evaluation function performance decreases when number of class examples are unevenly distributed

Documents

Supervised Rule Induction for Relational Data