1 Berendt: Advanced databases, 2012, 1 Advanced databases – Core ideas of federated databases; Schema and ontology

1 Berendt: Advanced databases, 2012,1 Advanced databases Core ideas of federated databases; Schema and ontology matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer ScienceLast update: 24 October 2012 2 Berendt: Advanced databases, 2012,2 Recall: Task from the previous week n Find 3 ontological statements on the Semantic Web, for example using Sindice. n Paraphrase what they mean. n Find 1 statement involving owl:EquivalentClass. What problems are likely to arise when statements made about this class (with its two names) come from different knowledge sources? 3 Berendt: Advanced databases, 2012,3 Until now... n... we have looked into modelling n... we have seen how the languages RDF(S) and OWL allow us to combine different schemas and data n... we have seen how Linked Data on the Web uses HTTP as a connecting protocol/architecture n... we have assumed that such combinations can be done effortlessly (unique names etc.) n... we have looked at some interpretation problems associated with these procedures n Now we need to ask: l What are (further) challenges of such combinations? l What are approaches proposed to solve it? from the databases & the Semantic Web / ontologies fields from architectural and logical points of view 4 Berendt: Advanced databases, 2012,4 Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs 5 Berendt: Advanced databases, 2012,5 Motivation 2: Schemas coming from different languages n A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]streamwaterfreshwateroceanlake[1][2] n Une rivire est un cours d'eau qui s'coule sous l'effet de la gravit et qui se jette dans une autre rivire ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'ocan.cours d'eaufleuve n Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in Belgi ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.waterstroomzeeoceaanmeermoeraswoestijn beek 6 Berendt: Advanced databases, 2012,6 Motivation 3 (a): Are these the same entity? 7 Berendt: Advanced databases, 2012,7 Motivation 3 (b): Who is that? Merging identities Mickey Mouse 8 Berendt: Advanced databases, 2012,8 Motivation 3 (c): Who was that? Re-identification 9 Berendt: Advanced databases, 2012,9 High-level overview: Goals and approaches in data integration n Basic goal: Combine data/knowledge from different sources n Goal / emphasis can lie on finding correspondences between l the models schema matching, ontology matching l the instances record linkage n Techniques can leverage similarities between l schema/ontology-level information l instance information most of today An established problem in DB; a focus and challenge for LOD (owl:sameAs) 10 Berendt: Advanced databases, 2012,10 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 11 Berendt: Advanced databases, 2012,11 Overview n goal: interoperability through data integration: l combining heterogeneous data sources under a single query interface n A federated database system is a type of meta-database management system (DBMS) which transparently integrates multiple autonomous database systems into a single federated database. n The constituent databases are interconnected via a computer network, and may be geographically decentralized. n Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging together several disparate databases. n A federated database (or virtual database) is the fully-integrated, logical composite of all constituent databases in a federated database system. 12 Berendt: Advanced databases, 2012,12 Issues in federating data sources Interconnection and cooperation of autonomous and heterogeneous databases must address n Distribution n Autonomy n Heterogeneity 13 Berendt: Advanced databases, 2012,13 Architectures: Dealing differently with autonomy n Tightly coupled: global schema integration, e.g. data warehousing n More loosely coupled: federated databases with schema matching/mapping: l Global as View (GaV): the global schema is defined in terms of the underlying schemas l Local as View (LaV): the local schemas are defined in terms of the global schema 14 Berendt: Advanced databases, 2012,14 Issues in query processing n In both GaV and LaV systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries. n Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query. n This corresponds to the problem of answering queries using views. 15 Berendt: Advanced databases, 2012,15 An example developed in-house: SQI - PLQL n Purpose: For federated search in learning-object repositories n An approach with conceptual-level abstraction from data sources n Integratable data source types: l Relational, XML, IR systems, (search engine) Web services, search APIs n Full abstraction of user from data sources: l Yes n User-specific data souce selection for integration: l Depends on application n User-specific data modeling for integration: l No n Explicit, queryable semantics: l (delegated to the sources: LOM etc.) 16 Berendt: Advanced databases, 2012,16 Heterogeneity n Heterogeneity is independent of location of data n When is an information system homogeneous? l Software that creates and manipulates data is the same l All data follows same structure and data model and is part of a single universe of discourse n Different levels of heterogeneity l Different languages to write applications l Different query languages l Different models l Different DBMSs l Different file systems l Semantic heterogeneity etc. 17 Berendt: Advanced databases, 2012,17 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 18 Berendt: Advanced databases, 2012,18 The match problem (Running example 1) Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other 19 Berendt: Advanced databases, 2012,19 Running example 2 20 Berendt: Advanced databases, 2012,20 Motivation: application areas n Schema integration in multi-database systems n Data integration systems on the Web n Translating data (e.g., for data warehousing) n E-commerce message translation n P2P data management n Model management (tools for easily manipulating models of data) 21 Berendt: Advanced databases, 2012,21 Based on what information can the matchings/mappings be found? (work on the two running examples) 22 Berendt: Advanced databases, 2012,22 The match operator n Match operator: f(S1,S2) = mapping between S1 and S2 l for schemas S1, S2 n Mapping l a set of mapping elements n Mapping elements l elements of S1, elements of S2, mapping expression n Mapping expression l different functions and relationships 23 Berendt: Advanced databases, 2012,23 Matching expressions: examples n Scalar relations (=, ,...) l S.HOUSES.location = T.LISTINGS.area n Functions l T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) l T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n ER-style relationships (is-a, part-of,...) n Set-oriented relationships (overlaps, contains,...) n Any other terms that are defined in the expression language used 24 Berendt: Advanced databases, 2012,24 Matching and mapping 1. Find the schema match (declarative) 2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, procedural) Example of result of step 2: n To create T.LISTINGS from S (simplified notation): area = SELECT location FROM HOUSES agent-name = SELECT name FROM AGENTS agent-address = SELECT concat(city,state) FROM AGENTS list-price = SELECT price * (1+fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id 25 Berendt: Advanced databases, 2012,25 Based on what information can the matchings/mappings be found? Rahm & Bernsteins classification of schema matching approaches 26 Berendt: Advanced databases, 2012,26 Challenges n Semantics of the involved elements often need to be inferred Often need to base (heuristic) solutions on cues in schema and data, which are unreliable l e.g., homonyms (area), synonyms (area, location) n Schema and data clues are often incomplete l e.g., date: date of what? n Global nature of matching: to choose one matching possibility, must typically exclude all others as worse n Matching is often subjective and/or context-dependent l e.g., does house-style match house-description or not? n Extremely laborious and error-prone process l e.g., Li & Clifton 2000: project at GTE telecommunications: 40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years n Ontologies often even bigger l For example Cyc: now (as of 2012) has > 500,000 concepts, ~ 5,000,000 assertions, >26,000 relations 27 Berendt: Advanced databases, 2012,27 Semi-automated schema matching (1) Rule-based solutions n Hand-crafted rules n Exploit schema information + relatively inexpensive + do not require training + fast (operate only on schema, not data) + can work very well in certain types of applications & domains + rules can provide a quick & concise method of capturing user knowledge about the domain cannot exploit data instances effectively cannot exploit previous matching efforts (other than by re-use) 28 Berendt: Advanced databases, 2012,28 Semi-automated schema matching (2) Learning-based solutions n Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: instance-level matching) Exploit schema information and data n Some approaches: external evidence l Past matches l Corpus of schemas and matches (matchings in real-estate applications will tend to be alike) l Corpus of users (more details later in this slide set) + can exploit data instances effectively + can exploit previous matching efforts relatively expensive require training slower (operate data) results may be opaque (e.g., neural network output) explanation components! (more details later) 29 Berendt: Advanced databases, 2012,29 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 30 Berendt: Advanced databases, 2012,30 Overview (1) n Rule-based approach n Schema types: l Relational, XML n Metadata representation: l Extended ER n Match granularity: l Element, structure n Match cardinality: l 1:1, n:1 31 Berendt: Advanced databases, 2012,31 Overview (2) n Schema-level match: l Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations l Constraint-based: data type and domain compatibility, referential constraints l Structure matching: matching subtrees, weighted by leaves n Re-use, auxiliary information used: l Thesauri, glossaries n Combination of matchers: l Hybrid n Manual work / user input: l User can adjust threshold weights 32 Berendt: Advanced databases, 2012,32 Basic representation: Schema trees Computation overview: 1. Compute similarity coefficients between elements of these graphs 2. Deduce a mapping from these coefficients 33 Berendt: Advanced databases, 2012,33 Computing similarity coefficients (1): Linguistic matching n Operates on schema element names (= nodes in schema tree) 1. Normalization n Tokenization (parse names into tokens based on punctuation, case, etc.) ne.g., Product_ID {Product, ID} n Expansion (of abbreviations and acronyms) n Elimination (of prepositions, articles, etc.) 2. Categorization / clustering n Based on data types, schema hierarchy, linguistic content of names ne.g., real-valued elements, money-related elements 3. Comparison (within the categories) n Compute linguistic similarity coefficients (lsim) based on thesarus (synonmy, hypernymy) n Output: Table of lsim coefficients (in [0,1]) between schema elements 34 Berendt: Advanced databases, 2012,34 How to identify synonyms and homonyms: Example WordNet 35 Berendt: Advanced databases, 2012,35 How to identify hypernyms: Example WordNet 36 Berendt: Advanced databases, 2012,36 Computing similarity coefficients (2): Structure matching n Intuitions: l Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods l Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important) n Procedure: 1. Initialize structural similarity of leaves based on data types nIdentical data types: compat. = 0.5; otherwise in [0,0.5] 2. Process the tree in post-order 3. Stronglink(leaf1, leaf2) iff their weighted sim. threshold 4.. 37 Berendt: Advanced databases, 2012,37 The structure matching algorithm n Output: an 1:n mapping for leaves n To generate non-leaf mappings: 2nd post-order traversal 38 Berendt: Advanced databases, 2012,38 Matching shared types n Solution: expand the schema into a schema tree, then proceed as before n Can help to generate context-dependent mappings n Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions) 39 Berendt: Advanced databases, 2012,39 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 40 Berendt: Advanced databases, 2012,40 Main ideas n A learning-based approach n Main goal: discover complex matches l In particular: functions such as T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n Works on relational schemas n Basic idea: reformulate schema matching as search 41 Berendt: Advanced databases, 2012,41 Architecture Specialized searchers are specialized on discovering certain types of complex matches make search more efficient 42 Berendt: Advanced databases, 2012,42 Overview of implemented searchers 43 Berendt: Advanced databases, 2012,43 Example: The textual searcher For target attribute T.LISTINGS.agent-address: n Examine attributes and concatenations of attributes from S n Restrict examined set by analyzing textual properties l Data type information in schema, heuristics (proportion of non-numeric characters etc.) l Evaluate match candidates based on data correspondences, prune inferior candidates 44 Berendt: Advanced databases, 2012,44 Example: The numerical searcher For target attribute T.LISTINGS.list-price: n Examine attributes and arithmetic expressions over them from S n Restrict examined set by analyzing numeric properties l Data type information in schema, heuristics l Evaluate match candidates based on data correspondences, prune inferior candidates 45 Berendt: Advanced databases, 2012,45 Search strategy (1): Example textual searcher 1. Learn a (Naive Bayes) classifier text class (agent-address or other) from the data instances in T.LISTINGS.agent-address 2. Apply this classifier to each match candidate (e.g., location, concat(city,state) 3. Score of the candidate = average over instance probabilities 4. For expansion: beam search only k-top scoring candiates 46 Berendt: Advanced databases, 2012,46 Search strategy (2): Example numeric searcher 1. Get value distributions of target attribute and each candidate 2. Compare the value distributions (Kullback-Leibler divergence measure) 3. Score of the candidate = Kullback-Leibler measure 47 Berendt: Advanced databases, 2012,47 Evaluation strategies of implemented searchers 48 Berendt: Advanced databases, 2012,48 Pruning by domain constraints n Multiple attributes of S: attributes name and beds are unrelated do not generate match candidates with these 2 attributes n Properties of a single attribute of T: the average value of num-rooms does not exceed 10 use in evaluation of candidates n Properties of multiple attributes of T: lot-area and num- baths are unrelated at match selector level, clean up: l Example T.num_baths S.baths ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths Based on the domain constraint, drop the term involving S.baths 49 Berendt: Advanced databases, 2012,49 Pruning by using knowledge from overlap data n When S and T share the same data n Consider fraction of data for which mapping is correct l e.g., house locations: l S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address l Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location, keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state) 50 Berendt: Advanced databases, 2012,50 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 51 Berendt: Advanced databases, 2012,51 What is ontology matching (relative to schema matching)? n same basic idea n but works on ontologies that are conceptual models (not on logical schemas such as relational tables or XML trees) emphasizes that concepts and relations need to be matched and mapped, and may treat these differently (Note: in the schema matching literature, it is not always clearly laid out whether the matched items come from a conceptual or a logical model; the toy examples above in particular are also conceptual) n In practice, some ontology matching tasks in fact work on such simple models (or simple subparts of models) that they do not differ at all from what we have seen so far l example: Anatomy task, see below in evaluation n Terminology: Also known as ontology alignment l See (Shvaiko & Euzenat, 2005) for more details 52 Berendt: Advanced databases, 2012,52 Recap: Rahm & Bernsteins classification of schema matching approaches 53 Berendt: Advanced databases, 2012,53 The methods that are important when the schema is in the foreground (which it is in ontologies!) 54 Berendt: Advanced databases, 2012,54 The extension by Shvaiko & Euzenat (2005) [Partial view] 55 Berendt: Advanced databases, 2012,55 (slide from last week) Special challenges on LOD ?! 56 Berendt: Advanced databases, 2012,56 BLOOMS Problem: on LOD relatively many mapping relations between instances (owl:sameAs), few between classes (e.g., owl:equivalentClass) Approach: search for relations between classes in different ontologies Goal: derive subClassOf relations between classes from different ontologies on LOD Approach: look up the concepts from the two ontologies in Wikipedia Classification in Shvaiko&Euzenat: a mixture between external / linguistic resource and external / upper-level formal ontologies background knowledge 57 Berendt: Advanced databases, 2012,57 BLOOMS: Matching steps 58 Berendt: Advanced databases, 2012,58 BLOOMS trees (examples) 59 Berendt: Advanced databases, 2012,59 BLOOMS tree for concept C with sense s (definition) 60 Berendt: Advanced databases, 2012,60 BLOOMS: Computing the overlap & deciding on an aligning For all sense trees T s of C 1 and all sense trees T t of C 2 : 61 Berendt: Advanced databases, 2012,61 A classification of approaches See above 62 Berendt: Advanced databases, 2012,62 One area in which ontology alignment becomes particularly interesting: Natural language and cross-lingual integration (because this shows very nicely how concepts are not always aligned nicely) 63 Berendt: Advanced databases, 2012,63 Ocean Lake BodyOfWater River Stream Sea NaturallyOccurringWaterSource The water ontology [from the Costello&Jacobs OWL tutorial:2003/html/presentations/JamesHendler/owl/OWL.ppt ]2003/html/presentations/JamesHendler/owl/OWL.ppt Tributary Brook Rivulet Properties: feedsFrom: River Properties: emptiesInto: BodyOfWater (Functional) (Inverse Functional) (Inverse) Properties: containedIn: BodyOfWater (Transitive) Properties: connectsTo: NaturallyOccurringWaterSource (Symmetric) 64 Berendt: Advanced databases, 2012,64 How could this give rise to a mapping/matching problem? n A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]streamwaterfreshwateroceanlake[1][2] n Une rivire est un cours d'eau qui s'coule sous l'effet de la gravit et qui se jette dans une autre rivire ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'ocan.cours d'eaufleuve n Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in Belgi ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.waterstroomzeeoceaanmeermoeraswoestijn beek 65 Berendt: Advanced databases, 2012,65 Sometimes a class needs to restrict the range of a property Ocean Lake BodyOfWater River Stream Sea NaturallyOccurringWaterSource Tributary Brook Rivulet Fleuve Properties: emptiesInto: BodyOfWater Since Fleuve is a subclass of River, it inherits emptiesInto. The range for emptiesInto is any BodyOfWater. However, the definition of a Fleuve (French) is: "a River which emptiesInto a Sea". Thus, in the context of the Flueve class we want the range of emptiesInto restricted to Sea. 66 Berendt: Advanced databases, 2012,66 Pretty standard: A class and an object property in OWL Note for nerds: Why does this use rdf:ID and not rdf:about (as FOAF does)? As for choosing between rdf:ID and rdf:about, you will most likely want to use the former if you are describing a resource that doesn't really have a meaningful location outside the RDF file that describes it. Perhaps it is a local or convenience record, or even a proxy for an abstraction or real-world object (although I recommend you take great care describing such things in RDF as it leads to all sorts of metaphysical confusion; I have a practice of only using RDF to describe records that are meaningful to a computer). rdf:about is usually the way to go when you are referring to a resource with a globally well-known identifier or location. (http://www.ibm.com/developerworks/xml/library/x-tiprdfai.html)http://www.ibm.com/developerworks/xml/library/x-tiprdfai.html 67 Berendt: Advanced databases, 2012,67 Global vs Local Properties rdfs:range imposes a global restriction on the emptiesInto property, i.e., the rdfs:range value applies to River and all subclasses of River. As we have seen, in the context of the Fleuve class, we would like the emptiesInto property to have its range restricted to just the Sea class. Thus, for the Fleuve class we want a local definition of emptiesInto. 68 Berendt: Advanced databases, 2012,68 Defining emptiesInto (when used in Fleuve) to have allValuesFrom the Sea class ... naturally-occurring.owl (snippet) One way of specifying matching expressions in OWL... here: by model extension 69 Berendt: Advanced databases, 2012,69 Older brother Younger brother Older sister Younger sister What about this? Different languages have different (lexicalized) concept boundaries 70 Berendt: Advanced databases, 2012,70 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 71 Berendt: Advanced databases, 2012,71 How to compare? n Input: What kind of input data? (What languages? Only toy examples? What external information?) n Output: mapping between attributes or tables, nodes or paths? How much information does the system report? n Quality measures: metrics for accuracy and completeness? n Effort: how much savings of manual effort, how quantified? l Pre-match effort (training of learners, dictionary preparation,...) l Post-match effort (correction and improvement of the match output) l How are these measured? 72 Berendt: Advanced databases, 2012,72 Match quality measures n Need a gold standard (the true match) n Measures from information retrieval: (standard choice: F1, = 0.5) Quantifies post-match effort 73 Berendt: Advanced databases, 2012,73 Benchmarking n Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable Need more standardized conditions (benchmarks) n Since 2004: competitions in ontology matching (more in the next session): l Test cases and contests at 74 Berendt: Advanced databases, 2012,74 Example: Tasks 2009 (various are re-used; 2012 is currently running) (excerpt; fromlatest completed run atExpressive ontologies n anatomy anatomy l The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy. n conference conference l Participants will be asked to find all correct correspondences (equivalence and/or subsumption correspondences) and/or 'interesting correspondences' within a collection of ontologies describing the domain of organising conferences (the domain being well understandable for every researcher). Results will be evaluated a posteriori in part manually and in part by data- mining techniques and logical reasoning techniques. There will also be evaluation against reference mapping based on subset of the whole collection. Directories and thesauri n fishery gears fishery gears l features four different classification schemes, expressed in OWL, adopted by different fishery information systems in FIM division of FAO. An alignment performed on this 4 schemes should be able to spot out equivalence, or a degree of similarity between the fishing gear types and the groups of gears, such to enable a future exercise of data aggregation cross systems. Oriented matching n This track focuses on the evaluation of alignments that contain other mapping relations than equivalences. Instance matching n very large crosslingual resources very large crosslingual resources l The purpose of this task (vlcr) is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia. 75 Berendt: Advanced databases, 2012,75 Mice and humans The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy. (http://oaei.ontologymatching.org/2008/anatomy/)http://oaei.ontologymatching.org/2008/anatomy/ 76 Berendt: Advanced databases, 2012,76 Matching task and evaluation approach (http://oaei.ontologymatching.org/2007/anatomy/) We would like to gratefully thank Martin Ringwald and Terry Hayamizu (Mouse Genome Informatics -who provided us with a reference mapping for these ontologies. The reference mapping contains only equivalence correspondences between concepts of the ontologies. No correspondences between properties (roles) are specified. If your system also creates correspondences between properties or correspondences that describe subsumption relations, these results will not influence the evaluation (but can nevertheless be part of your submitted results). The results of your matching system will be compared to this reference alignment. Therefore, all of the the results have to be delivered in the format specified here.format specified here 77 Berendt: Advanced databases, 2012,77 Matching task and evaluation approach (http://oaei.ontologymatching.org/2011/oriented/index.html)http://oaei.ontologymatching.org/2011/oriented/index.html n An increasing number of matchers are now capable of deriving mapping relations other than equivalence relations, such as subsumption, disjointness or named relations. n This is a necessity given that we need to compute alignments between ontologies at different granularity levels or between ontologies that elaborate on non-equivalent elements. The evaluation of such mappings was addressed already in OAEI (2009) Oriented Matching track. [] n The track aims also to report on evaluation methods and measures for subsumption mappings, in conjunction to the computation of equivalence mappings. n Targeting these goals, we have built new benchmark datasets that are described below. 78 Berendt: Advanced databases, 2012,78 (Some) results (http://oaei.ontologymatching.org/2009/results/anatomy/)http://oaei.ontologymatching.org/2009/results/anatomy/ 79 Berendt: Advanced databases, 2012,79 (Some) results (http://oaei.ontologymatching.org/2011/results/anatomy/)http://oaei.ontologymatching.org/2011/results/anatomy/ 80 Berendt: Advanced databases, 2012,80 BLOOMS: Evaluation on benchmarks (1) 81 Berendt: Advanced databases, 2012,81 BLOOMS: Evaluation on benchmarks (2) 82 Berendt: Advanced databases, 2012,82 BLOOMS: Evaluation on LOD datasets (1) 83 Berendt: Advanced databases, 2012,83 BLOOMS: Evaluation on LOD datasets (2) 84 Berendt: Advanced databases, 2012,84 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration 85 Berendt: Advanced databases, 2012,85 Example in iMAP User sees ranked candidates: 1. List-price = price 2. List-price = price * (1 + fee-rate) Explanation: a) Both generated from numeric searcher, 2 ranked higher than 1 b) But: c) Match month-posted = fee-rate d) domain constraint: matches for month-posted and price do not share attributes e) cannot match list-price to anything to do with fee-rate f) Why c)? g) Data instances of fee-rate were classified as of type date User corrects this wrong step f), the rest is repaired accordingly 86 Berendt: Advanced databases, 2012,86 Background knowledge structure for explanation: dependency graph 87 Berendt: Advanced databases, 2012,87 MOBS: Using mass collaboration to automate data integration 1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.) 2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query 3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings) 4. Combining user feedback (e.g, majority count) n Important: instant gratification (e.g., include the new field in the results page after a user has given helpful input) 88 Berendt: Advanced databases, 2012,88 Some issues of matching (not only) when it comes to individuals Is this the same entity?: n What does the same mean anyway? n (When) do we want these inferences? n What if the different sources contrain contradictory information? n Task for next week: l Find 2 statements about two classes that are the same that (in combination) dont make sense l Describe why this combination doesnt make sense l Find 2 statements about two instances that are the same that (in combination) dont make sense l Describe why this combination doesnt make sense l Find 2 statements about two instances that are the same that (in combination) are or may be undesired l Describe why this may be undesired 89 Berendt: Advanced databases, 2012,89 Outlook The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Ontology matching, with Example BLOOMS Evaluating matching Core ideas of federated databases Involving the user: Explanations; mass collaboration Dealing with contradictions 90 Berendt: Advanced databases, 2012,90 References / background reading; acknowledgements Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD P. Shvaiko, J. Euzenat: A Survey of Schema-based Matching Approaches. Journal on Data Semantics, also interesting: N. Noy: Semantic Integration: A Survey of Ontology-based Approaches. SIGMOD Record, 33(3), Prateek Jain, Pascal Hitzler, Amit P. Sheth, Kunal Verma, and Peter Z. Yeh Ontology alignment for linked open data. In Proceedings of the 9th international semantic web conference on The semantic web - Volume Part I (ISWC'10), Peter F. Patel-Schneider, Yue Pan, Pascal Hitzler, Peter Mika, and Lei Zhang (Eds.), Vol. Part I. Springer-Verlag, Berlin, Heidelberg, Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, Revised Papers (pp ). Springer.McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).Please see the Powerpoint slide-specific notes for URLs of used pictures and formulae

Documents

1 Berendt: Advanced databases, 2012, 1 Advanced databases – Core ideas of federated databases; Schema and ontology