View
547
Download
9
Category
Preview:
Citation preview
Semantic Data Mining: an Ontology Based Approach
Agnieszka Lawrynowicz
Institute of Computing SciencePoznan University of Technology
April 12, 2016Seminar of the Institute of Computing Science
Poznan University of Technology
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 1
Outline
Introduction to semantic data mining
Ontology in computer science
Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas
Summary
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 2
Outline
Introduction to semantic data mining
Ontology in computer science
Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas
Summary
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 3
Introduction: data mining
Input: a data table, text documents, ...Output: a model, a pattern set
DATA$MINING$
Model,$pa0erns$data$
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 4
Introduction: using background knowledge in data mining
Using background knowledge in data mining has been extensivelyresearched
hierarchy/taxonomy of attributes (Michalski et al., 1986, Srikant,Agrawal, 1995)
Inductive Logic Programming (Muggleton, 1991, Lavrac andDzeroski, 1994)
relational learning (Quinlan, 1993, de Raedt, 2008)
semantic data mining tutorial @ ECML/PKDD’2011 (Lavrac,Vavpetic, Lawrynowicz, Potoniec, Hilario, Kalousis)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 5
Introduction: relational data mining
Input: a relational database, a graph, a set of logical facts, ...Output: a model, a pattern set
RELATIONAL)DATA)MINING)
Model,)pa4erns)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 6
Semantic data mining
Input:
a data table, text documents, Web pages, a relational database, agraph, a set of logical facts, ...
one or more ontologies
Output: a model, a pattern set
SEMANTIC)DATA)MINING)
Model,)pa3erns)
Data)
Ontologies)
annota;ons)mappings)vocabulary)reBuse)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 7
Outline
Introduction to semantic data mining
Ontology in computer science
Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas
Summary
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 8
Ontology in computer science
“engineering artefact [...]“ (Guarino 98)
“An ontology is aformal specification ê machine interpretationof a shared ê group of people, consensusconceptualization ê abstract model of phenomena, conceptsof a domain of interest“ ê domain knowledge(Gruber 93, Studer 98)
Ontology = formal specification of a terminological knowledge (most oftenfrom a particular domain)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 9
Semantic Web layer cakeStosjęzykówSieciSemantycznej
Języki modelowania ontologii
Dane
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 10
Ontologies + data = knowledge graph
reviewer1 paper10metaReviews
PeerReviewedPaperMetaReviewer metaReviews
reviews
RDF
RDFS
rdf:type rdf:type
rdfs:domain rdfs:range
rdfs:subPropertyOf
rdfs:subClassOf
OWL
owl:Restric>on
rdfs:subClassOf
Reviewer
rdf:type
owl:someValuesFrom
owl:onPropertyreviewedBy
owl:inverseOf
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 11
Logical meaning of OWL
Description Logics, DLs = family of first order logic-based formalismssuitable for representing knowledge, especially terminologies, ontologies,underpinning the Web Ontology Language (OWL).
Basic building blocks: concepts, roles, constructors, individuals
Example
TB
ox
Atomic concept: Reviewer, PaperRoles: reviews, metaReviews, reviewedByConstructors: ⊓, ∃Axiom (concept definition):PeerReviewedPaper ≡ Paper ⊓ ∃reviewedBy.ReviewerAxiom (concept description ”each meta reviewer is a reviewer”):MetaReviewer ⊑ Reviewer
AB
ox
Fact assertion: metaReviews(reviewer1, paper10)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 12
Outline
Introduction to semantic data mining
Ontology in computer science
Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas
Summary
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 13
Overview of meta-learning
Meta-learning: learning to learn
application of machine learning techniques to meta-data about pastmachine learning experiments;
the goal: to modify some aspect of the learning process to improvethe performance of the resulting model;
meta-mining: meta-learning applied to full data mining process
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 14
Overview of the e-LICO system (EU FP7 2009-2012)
!"#$%&'()*+,'-!./01' ' ' '(23"$%4'567879'
"':'"'
'
! "#$%&'()&*+,-./,012*+3*2-%4,&
!56 78+*8$+9&21&/:+&+;<=>7&?"&<#@&4;!' <=!*+0)/' />1,)*!?' )*' @!=1)/*' 5' A)!,?<' +' <!1' /B' 0!C>)0!D!*1<' /*' 1;!' >*?!0,A*E' ?+1+' D)*)*E'.,+1B/0DF'4;)<'<!=1)/*'.0!<!*1<'1;!'?)BB!0!*1'=/D./*!*1<'/B'1;!'!"#$%&'+0=;)1!=1>0!'G()E>0!'7H'+*?'<;/I<';/I'1;!A')*1!0+=1'1/'+=;)!J!'1;!'><!0K<'L*/I,!?E!'?)<=/J!0A'E/+,F''
4;!'!"#$%&')*B0+<10>=1>0!'G?!.)=1!?')*'1;!'B)E>0!'>*?!0'1;!'?+<;!?',)*!H')<'1;!'D!+*<'MA'I;)=;'1;!'?+1+"D)*)*E' .,+1B/0D' )<' ?!,)J!0!?' 1/' <=)!*1)<1<F' 4;!' )**/J+1)J!' =/0!' ' /B' 1;!' !"#$%&'.,+1B/0D' )<' 1;!'!"#$%%&'$"#( )&*+,-$./( 0**&*#1"#' G$NOP' +M/J!' 1;!' ?+<;!?' ,)*!H' I)1;' )1<' .,+**!0' +*?' D!1+",!+0*!0F'Q/I!J!0P'1/'?!,)J!0'1;!'?+1+"D)*)*E'.,+1B/0D'1/')1<'<=)!*1)<1'><!0<P'1;!0!'+0!'<!J!0+,'/1;!0'<!0J)=!<'+*?'=/D./*!*1<F'()E>0!'7'<;/I<'+*'/J!0J)!I'/B'!"#$%&R<'=/D./*!*1<'+*?';/I'1;!A' )*1!0+=1'I)1;'!+=;'/1;!0F'
'()E>0!'7F'&J!0J)!I'/B'1;!'!"#$%&'<A<1!DF''
4;!0!'+0!'1I/'><!0"B+=)*E'=/D./*!*1<'B/0'1;!'!"#$%&'.,+1B/0DS'1;!<!'+,,/I'<=)!*1)<1<'1/'+==!<<'?+1+"D)*)*E' /.!0+1/0<' +*?T/0' /1;!0' ?+1+' .0/=!<<)*E' <!0J)=!<P' 1/' =/D./<!' 1;!D' )*1/' I/0LB,/I<' +*?'!U!=>1!' 1;!DP' =/,,!=1)*E' 1;!' 0!<>,1<' B/0' )*1!0.0!1+1)/*' /0' B>01;!0' +*+,A<)<F' 4;!<!' 1I/' =!*10+,')*B0+<10>=1>0!'=/D./*!*1<'+0!V'
7F 213&45&"$.V' O*' +..,)=+1)/*' 1;+1' E)J!<' +==!<<' 1/' +' I)?!' J+0)!1A' /B' ?+1+"D)*)*E' /.!0+1/0<P'1/E!1;!0'I)1;'1;!'D!+*<'1/'=/D./<!'1;!D')*1/'I/0LB,/I<F'
5F 61-$."1V' O' I/0LB,/I' =0!+1)/*' +*?' !*+=1D!*1' I/0LM!*=;' 1;+1' E)J!<' +==!<<' 1/' +0M)10+0A'W!M'<!0J)=!<'+*?'D+*A'/1;!0'L)*?<'/B'<!0J)=!<F' $1' )<'I)?!,A'><!?' )*'M)/)*B/0D+1)=<P'M>1'+,</' )*'D+*A'/1;!0'?)<=).,)*!<F'
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 15
Background knowledge: DM OPtimization Ontology
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 16
Data Mining OPtimization Ontology (DMOP)
the primary goal of DMOP is to support all decision-making stepsthat determine the outcome of the data mining process;
development started in EU FP7 project e-LICO (2009-2012);
DMOP v5.5: 723 classes, 111 properties, 4291 axioms;
highly axiomatized;
represented in Web Ontology Language (OWL 2);
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 17
Competency questions
”Given a data mining task/data set, which of the valid or applicableworkflows/algorithms will yield optimal results (or at least better resultsthan the others)?”
”Given a set of candidate workflows/algorithms for a given task/dataset, which data set/workflow/algorithm characteristics should betaken into account in order to select the most appropriate one?”
and others more fine-grained, e.g.:
”Which induction algorithms should I use (or avoid) when my datasethas many more variables than instances?”
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 18
Architecture of DMOP knowledge base and its satellitetriple stores
TBox%
DMOP%
ABox%
Operator%DB%
DMEX(DB1%%%%DMEX(DB2%%…%%%DMEX(DBk%
OWL2%
RDF%
Triple%
Store%
Formal%Conceptual%Framework%%of%Data%Mining%Domain%
Accepted%Knowledge%of%DM%Tasks,%Algorithms,%Operators%%
Specific%DM%ApplicaFons%Datasets,%Workflows,%Results%
MetaHminer’s%training%data%
MetaHminer’s%prior%%
DM%knowledge%
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 19
The core concepts of DMOP (simplified)
Fig. 1. The core concepts of DMOP.
more than specify their input/output types; only processes called DM-Operations haveactual inputs and outputs. A process that executes a DM-Operator also realizes the DM-Algorithm implemented by the operator and achieves the DM-Task addressed by thealgorithm. Finally, a DM-Workflow is a complex structure composed of DM operators, aDM-Experiment is a complex process composed of operations (or operator executions).An experiment is described by all the objects that participate in the process: a workflow,data sets used and produced by the different data processing phases, the resulting mod-els, and meta-data quantifying their performance. In the following, the basic elementsof DMOP are detailed.
DM Tasks: The top-level DM tasks are defined by their inputs and outputs. ADataProcessingTask receives and outputs data. Its three subclasses produce new databy cleansing (DataCleaningTask), reducing (DataReductionTask), or otherwise trans-forming the input data (DataTransformationTask). These classes are further articulatedin subclasses representing more fine-grained tasks for each category. An Induction-Task consumes data and produces hypotheses. It can be either a ModelingTask or aPatternDiscoveryTask, based on whether it generates hypotheses in the form of globalmodels or local pattern sets. Modeling tasks can be predictive (e.g. classification) ordescriptive (e.g., clustering), while pattern discovery tasks are further subdivided intoclasses based on the nature of the extracted patterns: associations, dissociations, devia-tions, or subgroups. A HypothesisProcessingTask consumes hypotheses and transforms(e.g., rewrites or prunes) them to produce enhanced—less complex or more readable—versions of the input hypotheses.
Data: As the primary resource that feeds the knowledge discovery process, datahave been a natural research focus for data miners. Over the past decades meta-learningresearchers have actively investigated data characteristics that might explain generaliza-tion success or failure. Fig. 2 shows the characteristics associated with the different Datasubclasses (shaded boxes). Most of these are statistical measures, such as the number of
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 20
DMOP: algorithm representation
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 21
Alignment of DMOP with DOLCE 1/3
Two main reasons to align DMOP with a foundational ontology:
considerations about attributes and data properties; extantnon-foundational ontology solutions were partial re-inventions of howthey are treated in a foundational ontology;
reuse of the ontology’s object properties;
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 22
Alignment of DMOP with DOLCE 2/3
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 23
Alignment of DMOP with DOLCE 3/3
Perdurant: DM-Experiment and DM-Operation are subclasses ofdolce:process;
Endurant: most DM classes, such as algorithm, software, strategy,task, and optimization problem, are subclasses ofdolce:non-physical-endurant;
Quality: characteristics and parameters of DM entities madesubclasses of dolce:abstract-quality;
Abstract: for identifying discrete values, classes added as subclassesof dolce:abstract-region;
object properties: DMOP reuses mainly DOLCE’s parthood, quality,and quale relations;
each of the four DOLCE main branches have been used.
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 24
Qualities and attributes 1/3
How to handle ’attributes’ in OWL ontologies, and, in a broader context,measurements?
easy way: attribute is a binary functional relation between a class anda datatype
Elephant ⊑ =1 hasWeight.integerElephant ⊑ =1 hasWeightPrecise.realElephant ⊑ =1 hasWeightImperial.integer (in lbs)
building into one’s ontology application decisions about how to storethe data (and in which unit it is) /
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 25
Qualities and attributes 2/3
How to handle ’attributes’ in OWL ontologies, and, in a broader context,measurements?
more elaborate way: unfold the notion of an object’s property (e.g.weight) from one attribute/OWL data property into at least twoproperties:
▸ one OWL object property from the object to the ’reified attribute’(“quality property” represented as an OWL class)
▸ and another property to the value(s)
favoured in foundational ontologies;
solves the problem of non-reusability of the ’attribute’ and preventsduplication of data properties;
measurements for DMOP more alike values for parameters;
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 26
Qualities and attributes 3/3
ModelingAlgorithm ⊑ =1 dolce:has-quality.LearningPolicy
LearningPolicy ⊑ =1 dolce:has-quale.Eager-Lazy
Eager-Lazy ⊑ ≤ 1 hasDataValue.anyType
LearningPolicy is a subclass of dolce:quality
Eager-Lazy is a subclass of dolce:abstract-region
In this way, the ontology can be linked to many different applications, whoeven may use different data types, yet still agree on the meaning of thecharacteristics and parameters (’attributes’) of the algorithms, tasks, andother DM endurants.
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 27
Meta-modeling in DMOP 1/4
only processes (executions of workflows) and operations (executionsof operators) consume inputs and produce outputs
DM algorithms (as well as operators and workflows) can only specifythe type of input or output
inputs and outputs (DM-Dataset and DM-Hypothesis class hierarchy,respectively) are modeled as subclasses of IO-Object class
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 28
Meta-modeling in DMOP 2/4
DM algorithms: classes or individuals? Individuals.
Problem: expressing types of inputs/outputs associated withalgorithm
”C4.5 specifiesInputClass CategoricalLabeledDataSet” 8
↗ ↖Individual Class(instance of DM-Algorithm) (subclass of DM-Hypothesis)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 29
Meta-modeling in DMOP 3/4
Initial solution: one artificial class per each single algorithm with asingle instance corresponding to this particular algorithm
Problem: hasInput, hasOutput, specifiesInputClass,specifiesOutputClass—assigned a common range—IO-Object
”C4.5 specifiesInputClass Iris” ?
↗ ↖Individual Individual(instance of DM-Algorithm) (instance of DM-Hypothesis)
Iris is a concrete dataset. Clearly, any DM algorithm is not designedto handle only a particular dataset.
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 30
Meta-modeling in DMOP 4/4
Final solution: weak form of punning available in OWL 2
IO-Class: meta-class—the class of all classes of input and outputobjects
”C4.5 specifiesInputClass CategoricalLabeledDataSet” 4
↗ ↖Individual Individual(instance of DM-Algorithm) (instance of IO-Class)
”DM-Process hasInput some CategoricalLabeledDataSet” 4↗ ↖Class Class(subclass of dolce:process) (subclass of IO-Object)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 31
DM method: Fr-ONT-Qu semantic pattern miner
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 32
Data mining as search
learning in description logics (DLs) and other relational data can beseen as search in space of concepts / RDF triples / clauses /(conjunctive / SPARQL) queries, ...
it is possible to impose ordering on this search space, e.g., usingsubsumption as natural quasi-order and generality relation betweenDL concepts
▸ if D ⊑ C then C covers all instances that are covered by D
refinement operators may be applied to traverse the space bycomputing a set of specializations (resp. generalizations) of a concept/ RDF triples/ clauses/ (conjunctive / SPARQL) queries, ...
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 33
Properties of refinement operators
Consider downward refinement operator ρ and by C ;ρ D denote arefinement chain from a DL concept C to D
complete: each point in lattice is reachable (for D ⊑ C there exists Esuch that E ≡ D and a refinement chain C ;ρ ... ;ρ E
weakly complete: for any concept C with C ⊑ ⊺, concept E withE ≡ C can be reached from ⊺finite: finite for any concept
redundant: there exist two different refinement chains from C to D
proper: C ;ρ D implies C /≡ D
ideal = complete + proper + finite
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 34
Learning in DLs and in clausal languages is hard
Lehmann & Hitzler (ILP 2007, MLJ 2010) proved for many DLs and(Nienhuys-Cheng & Wolf, 1997) for clausal languages that no idealrefinement operator exists.
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 35
Fr-ONT-Qu
algorithm for mining patterns in RDF(s) data
patterns expressed as SPARQL queries
generality relation: taxonomical subsumption
consists of: a refinement operator ρ and a strategy to select bestpatterns for further refinement
Example SPARQL queryhead SELECT ?x WHERE {body ?x rdf:type :Paper .
?x rdf:type :PeerReviewedPaper .
?x :reviewedBy ?y
}
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 36
New generality relation: taxonomical subsumption
Taxonomically closed pattern
A pattern Q is taxonomically closed, or t-closed, w.r.t. the background knowledgeG if for each triple of the form (?x rdf:type c) in Q, Q also contains thetransitive closure of (?x rdf:type c) w.r.t. G , and for each triple of the form(?x p ?y) that appears in the pattern Q, Q also contains the transitive closureof (?x p ?y) w.r.t. G .
Taxonomical subsumption
Given two patterns Q1 and Q2 over ρdf dataset G , and their t-closures Q1t and
Q2t respectively, Q1 taxonomically subsumes (t-subsumes) Q2 iff there exists a
mapping σ such that a set of triple patterns and FILTER expressions fromσ(body(Q1
t )) is a subset of a set of triple patterns and FILTER expressions frombody(Q2
t ).
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 37
Input of the algorithm
a declarative bias (B) to limit a search space (i.e. classes andproperties to use) and maximal number of iterations
2 thresholds: for keeping good enough patterns and for refining bestpatterns
choice from several quality measures to select for thresholds (e.g.support on knowledge base)
beam search size
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 38
Example
B: classes: PeerReviewedPaper, JournalPaper, property: reviewedBy
1 Refine every pattern from the previous iteration by adding a singlerestriction for a variable already existing in the pattern. E.g. forpatern {?x rdf:type :Paper.}, its refinements are:
▸ {?x rdf:type :Paper . ?x rdf:type :PeerReviewedPaper .}▸ {?x rdf:type :Paper . ?x rdf:type :JournalPaper . }▸ {?x rdf:type :Paper . ?x :reviewedBy ?y}
2 Evaluate patterns (with some quality measure as support on a dataset) and select only the best ones
3 Repeat steps 1-2 as long as there are patterns for refinement andmaximal number of iterations is not exceeded
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 39
Refinement operator ρ: uses trie data structure
ρ: (locally) finite and complete
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 40
Pattern based classification 1/2
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 41
Pattern based classification 2/2
We learn features that are optimized with regard to the (classification) task
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 42
Propositionalisation 1/2
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 43
Propositionalisation 2/2
In this way, learned features may be consumed by any out-of-the-shelf’attribute-value’ classification algorithm
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 44
Comparative experiments on classification of semantic data1/2
we considered published work with available results and datasets(including ESWC 2008 best paper, ESWC 2012 best paper)
various types of methods: kernel methods, statistical relationalclassifier, concept learning algorithms
we strictly followed the tasks, protocols and experimental setups ofthe methods
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 45
Comparative experiments on classification of semantic data2/2
For classification task Fr-ONT-Qu outperformed state-of-art approaches toclassification of Semantic Web data(see: ”Pattern based feature construction in semantic data mining” by A.Lawrynowicz, J. Potoniec, IJSWIS 10(1), 2014):
kernel methods Bloehdorn et al. (2007), Loesch et al. (ESWC 2012best paper),
statistical relational classifier SPARQL-ML by Kiefer et al (ESWC2008 best paper),
concept learning algorithms DL-FOIL by Fanizzi et al (2008),DL-Learner cutting-edge CELOE variant by Lehmann (2009)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 46
What is RapidMiner? 1/2
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 47
What is RapidMiner? 2/2
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 48
RapidMiner XML based workflow representation
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 49
Creating (meta-)dataset for meta-mining
DMOP-basedrepositoryofDMprocesses(DMEX-DB)
Datasetfortrainingmeta-miner
>85mlnRDFtriples
BaselineDMexperiment
set
1581RapidMinerexecutedworkflows
Baselinedatasets
11UCIdatasets
DataCharacters6csTool(DCT)
DMOPontology
Transforma6ontoRDF
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 50
Propositionalisation
Workflowpa*erns
Dataset
DMOP-basedRDFrepositoryofDM
processes
Results of experiments. Below we present the results of experimental evaluation of Fr-ONT-Qu in the meta-mining scenario. In the experiments, we used OWLIM SE (v5.3.5849) as an underlying reasoning engine and a semantic store with the owl2-rl-reduced-optimized ruleset. The choice of such a ruleset was motivated by the expressivity of our background knowledge base, e.g. existence of object property chains. During each cycle of cross-validation, Fr-ONT-Qu discovered around 2000 patterns, and redundant patterns were subsequently pruned. We discuss some of the discovered patterns below (for compactness denoting by Bd the body of the base pattern used in the experiments). The first example pattern: Q1 = select distinct ?x where { Bd ∪ ?opex2!dmop:executes ?front0 .! ?opex2!dmop:executes rm:RM-Decision_Tree .! ?opex2!dmop:hasParameterSetting ?front1.! ?front0!dmop:executes rm:DM-Operator .! ?front0!dmop:implements ?front2 .!!! ?front2 a dmop:DM-Algorithm . ?front2 a dmop:InductionAlgorithm .!!! ?front2 a dmop:ModelingAlgorithm .!!! ?front2 a dmop:ClassificationModelingAlgorithm .!!! ?front2 a dmop:ClassificationTreeInductionAlgorithm .!}!
was mined when Fr-ONT-Qu traversed down the algorithm classes hierarchy specializing variable ?front2. In this way, it is possible to abstract from the level of operators (algorithm implementations) to the level of algorithms and their taxonomy. For instance, both rm:RM-Decision_Tree and weka:Weka-J48 operators implement a classification tree induction algorithm and one may generalize over it. The patterns containing class hierarchies provide similar expressivity to this of patterns mined in so-called generalized association rule mining.
The following pattern covers only those workflows that contain ‘Decision Tree’ operator, for which the parameter minimal size for split has value between 2 and 5.5: Q2 = select distinct ?x where { Bd ∪ ?opex2!dmop:executes ?front0 .! ?opex2!dmop:executes rm:RM-Decision_Tree .! ?opex2!dmop:hasParameterSetting ?front1.! ?front0!dmop:executes rm:DM-Operator .! ?front1!dmop:setsValueOf ?front2.! ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 16.000000) . ?front2!dmop:hasParameterKey 'minimal_size_for_split'.! ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 9.000000) . ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 5.500000) . }
Datasetcharacteris3cs…
Features
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 51
Semantic meta-mining results
McNemar’s test for pairs of classifiers performed with the nullhypothesis that a classifier built using dataset characteristics and amined pattern set has the same error rate as the baseline that useddataset characteristics and only the names of the machine learningDM operators
Test confirmed that classifiers trained using workflow patternsperformed significantly better (in terms of accuracy) than the baseline
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 52
Sharing: Standardization of DM/ML schemas
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 53
Evolution of the field of DM/ML ontologies
20092008 2011 2012
OntoDM
20142008
DMOP
ontologies/vocabularies
events
Experiment Databasesplatform
2010
ExposéDMWF
Data Mining OntologyJamboree(Slovenia)
2015
MEX
OpenML 2016(Netherlands)
W3C Machine Learning Schema Community Group
OpenMLplatform
2016
ML Schema Core
2013
others: KDDONTO, KD, ...Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 54
OntoDM
Pance Panov, Larisa N. Soldatova, Saso Dzeroski: Ontology of core data miningentities. Data Min. Knowl. Discov. 28(5-6): 1222-1265 (2014)
built in compliance to upper level ontologies BFO, OBI, IAO, modularized
incorporates structured data mining
Use case: generic, middle level ontology for ML; representing QSAR entities fordrug design, used by Eve Robot Scientist
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 55
DMOP: Data Mining Optimization Ontology
C. Maria Keet, Agnieszka Lawrynowicz, Claudia d’Amato, Alexandros Kalousis, PhongNguyen, Raul Palma, Robert Stevens, Melanie Hilario: The Data Mining OPtimizationOntology. J. Web Sem. 32: 43-53 (2015)
development started in e-LICO EU FP7 project (2009-2012)
detailed algorithm internal characteristics (’qualities’)
Use case: meta-learning (’whitebox’), meta-mining, used to produce IntelligentDiscovery Assistant for RapidMiner
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 56
Expose
Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, Geoffrey Holmes:Experiment databases - A new way to share, organize and learn from experiments.Machine Learning 87(2): 127-158 (2012)
re-uses OntoDM (at top-level) and DMOP (at bottom level)
superseded by OpenML DB schema
Use case: experiment databases, ExpML markup
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 57
Early work towards aligning DM/ML ontologies (2010)
DMO Ontology Jamboree, Josef Stefan Institute, Slovenia
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 58
MEX vocabulary
Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck,Markus Ackermann, Jens Lehmann: MEX vocabulary: a lightweight interchange formatfor machine learning experiments. SEMANTICS 2015: 169-176
lightweight interchange format
maps to PROV
Use case: annotating ML experiments and interchanging ML metadata
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 59
How to make existing DM/ML ontologies compatible?
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 60
W3C Machine Learning Schema Community Group (2015)
https://www.w3.org/community/ml-schema/
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 61
OpenML, Lorentz Center, Netherlands (2016)
First draft of ML Schema Core https://github.com/ML-Schema/core
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 62
Sharing beyond DM/ML domain
Mapping DMOP to workflow ontologies (Research Objects, OPMW)(ROHub hosted by Poznan Supercomputing and Networking Center)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 63
Semantic data mining: more information
Semantic data mining tutorial @ ECML/PKDD’2011http://videolectures.net/ecmlpkdd2011_lavrac_vavpetic_mining/
peculiarities of the learning setting: Open World Assumption, what is a”truly semantic” similarity measure?, ...
methods, applications, tools
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 64
Summary
semantic data mining: data mining with ontologies asbackground/prior knowledge, most often from structured data
ontologies best if engineered with uses cases in mind
learning in description logics and clausal languages is hard; heuristics,dealing with peculiarities
Fr-ONT-Qu semantic pattern mining algorithm: theorethicalproperties, practical evaluation
use case: semantic meta-mining for constructing Intelligent DataMining Assistant
importance of interoperability (for scientific reproducibility, forinter-domain applications)
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 65
Acknowledgements
Polish National Science Center under the SONATA program”ARISTOTELES: Methodology and algorithms for automatic revision ofontologies in task based scenarios” (2014/13/D/ST6/02076) (2015-2018)
Foundation for Polish Science under the POMOST programme, cofinancedfrom European Union, Regional Development Fund (POMOST/2013-7/8)(2013-2015)
EU FP7 ICT-2007.4.4 (231519) ”e-LICO: An e-Laboratory forInterdisciplinary Collaborative Research in Data Mining and Data-IntensiveScience” (2009-2012)
Fr-ONT-Qu, meta-mining experiments done jointly with Jedrzej Potoniec
Contributors to the development of DMOP and/or other e-LICOinfrastructure used in the research described in this presentation: MelanieHilario, C. Maria Keet, Claudia d’Amato, Huyen Do, Simon Fischer, DraganGamberger, Lina Al-Jadir, Simon Jupp, Alexandros Kalousis, JoergUwe-Kietz, Petra Kralj Novak, Babak Mougouie, Phong Nguyen, RaulPalma, Floarea Serban, Robert Stevens, Anze Vavpetic, Jun Wang, DerryWijaya, Adam Woznica
Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 66
Recommended