Upload
jud
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Supervised Rule Induction for Relational Data. Mahmut Uludağ Supervisor: Prof. Dr. Mehmet R. Tolun Ph.D. Jury Presentation Eastern Mediterranean University Computer Engineering Department February 25, 2005. Outline. Introduction ILA and ILA-2 algorithms Overview of the RILA system - PowerPoint PPT Presentation
Citation preview
Mahmut Uludağ
Supervisor: Prof. Dr. Mehmet R. Tolun
Ph.D. Jury Presentation
Eastern Mediterranean UniversityComputer Engineering Department
February 25, 2005
Supervised Rule Inductionfor Relational Data
Outline Introduction ILA and ILA-2 algorithms Overview of the RILA system Query generation Optimistic estimate pruning Rule selection strategies Experiments and results Conclusion
Motivation for relational data mining
Traditional work in statistics and knowledge discovery assume data instances form a single table
Not always practical to represent complex objects in one
single table
RDBMS are widely used
Efficient management of data
Indexing and query services, transaction and security support
Can store complex data
Data mining without transferring to a new location
Author
Paper
Cites
Previous work – ILP based algorithms
Prolog is the main language to represent objects and relations between the objects
Incremental learning, incorporation of background knowledge
Initial research: deterministic rules Recent research: statistical learning Main obstacle to widespread acceptance;
dependency on a Prolog server
Client-server architecture; java client, ILProlog server
DMax; a modern ILP based data mining system
Source: www.pharmadm.com
Example output rule:
Previous work – relational data mining framework
(Knobbe et al, 1999)
Client – server architecture
Selection graphs
Algorithm to translate selection graphs into SQL
MRDTL and MRDTL-2 algorithms, Iowa State
University
M.Sc. Study in METU, Serkan Toprak, 2004
parent
child
toyage>30
Previous work – graph mining
Typical inputs are labelled graphs
Efficient tools in describing objects and the way they are connected
Subgraph isomorphism
Scalability problems
Avoid loading complete graph data into the main memory; partitioning
Nearly equivalent formalisms:
Graphs ≈ Database tables ≈ Prolog statements
ILA
Levelwise search construct hypotheses in the order of the increasing
number of conditions (i.e. at first, building hypotheses with one condition, then building hypotheses with two conditions, and so on)
Finds the smallest and completely accurate rule
set that represents the training data
ILA-2 Noise-tolerant evaluation function
score(hypothesis) = tp - pf * fn
tp is the number of true positive examples
fn is the number of false negative examples
pf stands for penalty factor, a user-defined minimum for the proportion
of tp to fn
not sensitive to distribution of false values
Multiple rule selection after a learning loop redundant rules
Implemented by modifying the source code of the C4.5
algorithm; some features inherited from C4.5
RILA
What is new when compared to ILA and ILA-2
Architecture
Performance
Internal representation of rules
Construction of hypotheses
RILA – what is new
Select late rule selection strategy; as an alternative to the select early strategy
An efficient implementation hypotheses can be refined by adding new conditions,
they do not need to be generated from scratch in each learning loop
Optimistic estimate pruning (beam search) Normalized hypotheses evaluation function
Architecture of the system
DBMSDiscovery system
JDB
C d
river
Rules
Hypotheses SQL, meta data queries
Result sets, meta data
•Traversing relational schema•Hypothesis construction•Conversions to SQL•Rule selection•Pruning
How tables are visited?
Interaction
Geneid1 Geneid2 Type Expression corr
Gene
Geneid Essential Chromosome Localization
Composition
Geneid Phenotype Class Motif Function Complex
Junction table
Target table
First level – stops in the junction table?Extension levels – extends complex hypotheses only by using attributes from the other side of the junction relation
Example hypotheses that can be generated:-If a gene has a relation r then its class is c-If a gene has a property p and relation r then its class is c-If a gene has a relation r to a gene having property p then its class is c
Internal representation of an example rule
Gene
Interaction
Class=‘Nucleases’
Composition
Gene
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
Complex=‘Intracellular transport’
Composition
Localization=‘extracellular…’
type=‘Genetic’
Conditions:
Gene
Localization=‘extracellular…’
Localization=‘extracellular…’
composition1.id=gene1.id
gene1.id=interaction.id1
composition2.id=gene2.id
Gene
interaction.id1=gene1.id
interaction.id2=gene2.id
Query generation
SQL template for building size one hypotheses
Numeric attributes
Refinement of hypotheses
How a hypothesis is represented in SQL?
How a hypothesis is extended by a condition from the other side of
an many-to-many relation?
SQL template for building size one hypotheses
Select attr, count(distinct targetTable.pk)
from covered, path.getTable_list()
where path.getJoins() and
targetTable.classAttr = currentClass and
covered.id = targetTable.pk and
covered.mark=0
group by attr
Numeric attributes
Discretization results are stored in a temporary tableColumns: table_name, attribute_name,
interval_name, min_value, max_value
disc.table_name = ‘table’ and
disc.attribute_name = ‘attr’ and
attr > disc.min_val and
attr < disc.max_val
SQL:
Refinement of hypotheses
Select attr, count(distinct targetTable.pk)
from covered, table_list,
hypothesis.table_list()
where targetAttr = currentClass and
join_list and
hypothesis.join_list()
covered.id = targetTable.pk and
covered.mark=0
group by attr;
How a hypothesis is extended by a condition from the other side of a many-to-many relation?
Select GENE_B.CHROMOSOME, count (distinct GENE.GENEID) from COMPOSITION, GENE, GENE GENE_B, INTERACTION where INTERACTION.GENEID2=GENE_B.GENEID and INTERACTION.GENEID1=GENE.GENEID andINTERACTION.EXPR > 0.026 andINTERACTION.EXPR < 0.513 and COMPOSITION.PHENOTYPE = 'Auxotrophies' and COMPOSITION.GENEID=GENE.GENEID andGENE.LOCALIZATION = 'ER'group by GENE_B.CHROMOSOME
Optimistic estimate pruning
Avoid working on hypotheses which are unlikely to result in satisfactory rules
F-measure criteria to assess hypotheses 2 * recall * precision / ( recall + precision )
Two types of pruning Extend only n best hypotheses (beam search) Minimum required f value in order a hypothesis to take place in
the hypothesis pool (similar to minimum support pruning)
Rule selection strategies
Select early strategy
Why do we need another
strategy?
Select late strategy
Learning algorithm when using the select early strategy
any rules
selected?
If size is 1 then build initial hypotheses
otherwise extend current hypotheses
select p rule(s)
all examplescovered?
end
mark covered objects
no
no, size++
yes
yes
size is smaller than m?
yes
size=1
no
Example training data to demonstrate the case where the select late strategy performs better than the select early strategy
Attribute A Attribute B Attribute C Class
a1 b1 c1 A
a1 b1 c2 A
a2 b2 c3 A
a3 b2 c3 A
a4 b1 c3 B
a5 b1 c3 B
a1 b2 c4 B
a1 b2 c5 B
Learning algorithm when using the select late strategy
Build initial hypothesis set Extend hypothesis set
Select Rules
end
yes,size++
size < max size?
no
start
Rule selection algorithm when using the select late strategy
Select hypothesiswith the highest
score
Is the score
positive?
all examplescovered? end
- Mark examples covered by this hypothesis- If no positive examples covered then return- Recalculate the score using the effective cover- If the new score is higher than the score of the next hypothesis or score of the hypothesis was previously reduced more than l then assert the hypothesis as a new rule otherwise undo markings and set the score to the new score calculated
yes
yes
start
no
no
Experiments
Summary of the parameters
The genes data set
The mutagenesis data set
Summary of the parameters
Parameters applicable both to the select late and to the select early strategies pf is a user-defined minimum for the proportion of the true positives
to the false negatives m is the maximum size for hypotheses
Parameter applicable only for the select early strategy p is the maximum number of hypotheses that can be selected as
new rules after a search iteration Parameter applicable only for the select late strategy
l is the limit on rule selection recursion Optimistic estimate pruning parameters
f is the minimum acceptable F-measure value n is maximum number of hypotheses that can be extended in each
level during the candidate rules generation phase of the mining processes
GENE
GENEID
Essential
Chromosome
Localization
COMPOSITION
GENEID
INTERACTION
GENEID1
GENEID2
Type
Expression
The ‘genes’ dataset of KDD Cup 2001
Class
Complex
Phenotype
Motif
Function
862 rows
4346 rows
910 rows
Junction tableMany-to-many relation between genes
Test results for the localization attribute using the select early strategy, pf=2, m=3
f=0.0n=10000
p=1
f=0.001n=500
p=1
f=0.001n=500
p=5
f=0.01n=1p=1
f=0.01n=1p=5
training time (seconds) 117 61 67 67 37
Number of rules 90 90 161 78 122
number of conditions 133 133 233 103 144
training set coverage (%)
60.56 60.56 70.19 59.16 65.43
training set accuracy (%)
95 95 96 95 95
test set accuracy (%) 83.19 83.19 83.61 84.37 85.78
test set coverage (%) 59.32 59.32 62.47 58.79 60.89
Test results for the localization attribute using the select late strategy
pfpf=2, =2, mm=3, =3, ll=0, =0, ff=0.01=0.01
n 1 2 3 4 5
training time (seconds) 37 53 69 72 73
number of rules 126 140 147 150 157
number of conditions 152 188 206 213 229
training set coverage (%) 67.17 70.30 71.46 72.16 72.97
training set accuracy (%) 94 94 93 93 93
test set accuracy (%) 82.85 81.48 81.63 81.53 81.12
test set coverage (%) 62.73 63.78 64.30 65.35 65.35
Test results for the localization attribute using the select late strategy
n 1 2 3 4 5
training time (seconds) 65 69 69 89 102
number of rules 126 139 144 147 153
number of conditions 154 188 200 205 216
training set coverage (%) 65.55 68.68 69.37 69.84 70.65
training set accuracy (%) 96 96 96 96 96
test set accuracy (%) 84.96 83.91 83.91 83.62 83.27
test set coverage (%) 59.32 60.37 60.37 60.89 61.15
pfpf=2, =2, mm=3, =3, ll=100, =100, ff=0.01=0.01
Why we did not have better results on the genes data set? Cup winner’s accuracy 72.1% MRDTL 76.1% accuracy Serkan 59.5% accuracy rila best accuracy 85.8% with 60.9%
coverage rila best coverage 65.3% with 81.5%
accuracy Missing values? no Default class selection? no Deteriorated performance when the
number of class values is high Distribution of false values among classes
not taken into account Problem when number of examples in
different classes are not evenly distributed
Attribute1 Attribute2 Class
1 5 pink
1 5 pink
1 5 pink
1 5 yellow
2 5 yellow
2 6 yellow
1 6 blue
2 7 blue
3 7 blue
Schema of the mutagenesis database
ATOM
ATOM_ID
Molecule_id
Element
Type
MOLECULE
Molecule_id
BOND
ATOM_ID1
ATOM_ID2
TypeLog_mut
Logp
Lugmo
Ind1
IndaCharge
Label
Cross validation test results using the select early strategy on the mutagenesis data for different p* values
p 1 2 3 4 5 6 7
time (in seconds) 305 260 225 215 215 215 210
# rules 103 103 103 104 104 104 104
# conditions 120 120 120 122 121 121 121
accuracy (%) 98.82 97.06 97.06 97.06 97.06 97.06 97.06
coverage (%) 89.89 90.43 90.43 90.43 90.43 90.43 90.43
**maximum number of rules selected when each time the rule selection step is executedmaximum number of rules selected when each time the rule selection step is executed
Cross validation test results using the select early strategy and OEP on the mutagenesis data for different n values
n 1 2 3 10 15 20 30 40
time (seconds) 134 144 159 221 325 397 311 311
# rules 118 120 120 142 123 119 104 103
# conditions 165 171 171 210 166 161 123 120
accuracy (%) 98.26 98.26 98.26 98.83 98.25 98.25 98.82 98.82
coverage (%) 91.49 91.49 91.49 90.95 90.96 90.96 89.89 89.89
Cross validation test results using the select late strategy on the mutagenesis data
n 1 2 3 10
time (seconds) 76 123 169 486
# rules 122 137 145 151
# conditions 165 204 224 248
accuracy (%) 96.49 94.77 93.68 93.68
coverage (%) 90.96 91.49 92.55 92.55
p =1, f=0.01, l=0
Cross validation test results using the select late strategy on the mutagenesis data
p =1, f=0.01, l=10
n 1 2 3 10
time (seconds) 84 135 183 490
# rules 105 107 109 112
# conditions 131 137 145 158
accuracy (%) 98.26 98.26 98.24 98.25
coverage (%) 91.49 91.49 90.43 90.96
Cross validation test results using the select late strategy on the mutagenesis data
p =1, f=0.01, l=100
n 1 2 3 10
time (seconds) 80 130 178 483
# rules 103 104 106 109
# conditions 127 131 137 153
accuracy (%) 98.26 98.26 98.26 98.26
coverage (%) 91.49 91.49 91.49 91.49
Comparison to others results on mutagenesis data
The best results by RILA (Table 2 and Table 5) accuracy 98.26% coverage 91.49%
The best results reported in (Atramentov et al. 2003) accuracy 87.5%
The best results reported by the originators (King et al. 1996) of the data set
accuracy 89.4%, (number of correct predictions divided by the number of predictions)
Conclusion A new relational rule learning algorithm has been
developed with two different rule selection strategies Several techniques used to have reasonable
performance; refinement of hypotheses, pruning
The results on the mutagenesis data are better than other results cited in the literature
Compared to traditional algorithms, there is no need to move relational data to another location; scalability, practicality
Techniques employed can be used to develop relational versions of other traditional learning algorithms
Thanks!
FOIL, a set-covering approach
[Cameron Jonaes and Quinlan 1994] Begins with the most general theory Repeatedly adds a clause to the theory that
covers some of the positive examples and few negative examples
Covered examples are removed Continue until the theory covers all positive
examples
Previous work – unsupervised algorithms
WARMR [Dehaspe et al., 1998] finds relational association rules (query extensions)
Input – Prolog database
Specification in the WARMODE language, limits the format of possible query extensions
SUBDUE [Cook and Holder, 1994] discovers substructures in a graph
Output – the substructure selected at each iteration as the best to compress the graph
PRM [Getoor et al., 2002] reinterpret Bayesian networks in a relational setting
Captures the probabilistic dependence between the attributes of interrelated objects
Link analysis
Models generated by some unsupervised learning algorithms can be used for
prediction tasks; WARMR, PRM, not SUBDUE
Relational rule induction
Schema graph represents structure of the data tables = nodes
foreign keys = edges
Multiple tables can represent several objects
and relations between the objects
Users should select tables that represent the
objects they are interested in
An example relational rule
Gene
Geneid Essential Chromosome Localization
Composition
Geneid Phenotype Class Motif Function Complex
IF Composition.Class = ‘ATPases’ AND
Composition.Complex = ‘ Intracellular transport’
THEN Gene.Localization = extracellular..
Many-to-many relations
Junction tables Between different
classes Between objects of
the same classRecursive queries are
needed to extract data
Junctiontable
Junctiontable
between different classes
between objects of the same class
Example rule having a many-to-many relation
Interaction
Geneid1 Geneid2 TypeExpression corr
Gene
Geneid Essential Chromosome Localization
Composition
Geneid Phenotype Class Motif Function Complex
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
Performance
Dynamic programmingrefinement of hypotheses
PruningMinimum support pruningOptimistic estimate pruning
Avoid redundant hypotheses Smart data structures
Tabular representation of the links in the example rule
Conditionscomposition.class
=
nucleases
interaction.type =
Genetic
composition.complex =
intracellular transport
Pathscomposition1.id =
gene1.idinteraction.id1 =
gene1.id
composition2.id =
gene2.id
interaction.id2 = gene2.id
gene1.id =
ineratction.geneid1
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
Building size one hypothesesVector buildSizeOneHypotheses(String class, String table name, Path path){
For each column in the selected table {
If ( column is not the target attribute and not a primary key and not a foreign key) {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path)
result set = execute generated SQL
hypotheses += generated hypotheses using the result set
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
hypotheses += buildSizeOneHypotheses(class, linked table name, updated path)
}
return the hypotheses
}
Refinement of hypothesesVector extendHypotheses(String class, Vector currentHypotheses, String tableName, Path
path){
For each column in the current table{
If ( column is not the target attribute and not a primary key and not a foreign key) {
For each hypothesis in the current hypothesis set {
If the hypothesis does not include the current feature {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path, hypothesis)
result set = execute generated SQL
extended hypotheses += generated hypotheses using the result set
}
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
extended hypotheses += extendHypotheses(class, currentHypotheses,
linkedTableName, updated path)
}
return extended hypotheses;
}
Relational rule induction Basic components are the same as in propositional rule induction
algorithms
Hypothesis construction
Rule selection
Interpretation of relational schema
Traversal of schema
Detection of cycles and handling of cycles
Communication with RDBMS
Expressing internal hypotheses in SQL
Understanding results returned from the RDBMS
Relational rule induction Target table
Primary key for the main objects being analyzed
GENE
GENEID
Essential
Chromosome
Localization
COMPOSITION
GENEID
INTERACTION
GENEID1
GENEID2
Type
Expression
Class
Complex
Phenotype
Motif
Function
Author
Paper
Cites
How a hypothesis is represented in SQL?
INTERACTION.GENEID1=GENE.GENEID and
INTERACTION.EXPR > 0.026 and
INTERACTION.EXPR <= 0.513 and
COMPOSITION.PHENOTYPE = 'Auxotrophies' and
COMPOSITION.GENEID=GENE.GENEID and
GENE.LOCALIZATION = 'ER'
INTERACTION.EXPR = (0.026658-0.513329] AND COMPOSITION.PHENOTYPE = Auxotrophies
Rule:
SQL:
Many to many relations
Search Algorithm
Recognition of junction tables
Extension of hypotheses having a
feature from a junction table
Automatic conversion of hypotheses to
SQL using table aliases
Ila evaluation function
Attribute A Attribute B Class C
A
B
C
Figure to show how the ila evaluation function performance decreases when number of class examples are unevenly distributed