37
©2004, Philippe Cudré-Mauroux Semantic Interoperability for Global Information Systems Microsoft Research Asia 08.20.04 Philippe Cudré-Mauroux Distributed Information Systems Laboratory (LSIR) Swiss Federal Institute of Technology, Lausanne (EPFL)

©2004, Philippe Cudré-Mauroux Semantic Interoperability for Global Information Systems Microsoft Research Asia 08.20.04 Philippe Cudré-Mauroux Distributed

  • View
    218

  • Download
    1

Embed Size (px)

Citation preview

©2004, Philippe Cudré-Mauroux

Semantic Interoperability for Global Information Systems

Microsoft Research Asia 08.20.04

Philippe Cudré-Mauroux

Distributed Information Systems Laboratory (LSIR)Swiss Federal Institute of Technology, Lausanne (EPFL)

©2004, Philippe Cudré-Mauroux

Outline

I. Classical Information Integration (overview)

– Global Schema– Multidatabase Language Approach– Federated Databases

II. Information Integration in the Large– Context: The Semantic Web– Shared ontologies– The Chatty Web

III. State of the Art in Ontology Alignment (overview)

IV. Semantic Integration in a Large-Scale Image Sharing Scenario

©2004, Philippe Cudré-Mauroux

I. Classical Information Integration

• Goal: providing a uniform access to multiple heterogeneous information sources

• More than data exchange (e.g., ASCII, EDI, XML)

• Old problem, difficult, well-know (partial) solutions

©2004, Philippe Cudré-Mauroux

Global Schema Integration

• Merge multiple databases into one global database

• Performed by human expert• Time consuming and error prone• Local autonomy lost• Static solution

Book(ISBN, Title, Price, Author)Author(Name, ISBN)

Livre(ISBN, Prix, Titre)Auteur(Prenom, Nom, ISBN)

Book(ISBN, Title)Author(Name, ISBN)

S1 S2

©2004, Philippe Cudré-Mauroux

Multidatabase Language Approach

• No attempt at integrating schemas• Language (e.g., MSQL) used to integrate

information sources at run-time• Simple example:

• Not transparent• Heavy burden on (expert) users• Global queries subject to local changes

Use S1, S2Select TitreFrom S1.Book, S2.LivreWhere S1.Book.ISBN = S2.Livre.ISBN

©2004, Philippe Cudré-Mauroux

Federated Databases

• Idea: Each information source exports a schema specifying shared relations

• Tight-coupling:– Global schema integration on all export schema (cf.

global schema integration)

• Loose-coupling:– Dynamic add / drop, e.g., by creating views (logical

relations)

©2004, Philippe Cudré-Mauroux

GAV (Global as View)

• Global (mediated) schema is expressed as a view on local schemas

Book(ISBN, Title, Author)

[…][…]Book(ISBN, Title)Author(Name, ISBN)

Create VIEW Book AsSelect ISBN, Title, AuthorFrom S1.Book, S1.Author Where Book.ISBN = Author.ISBN

Mediated Schema

S1 S2 S3

©2004, Philippe Cudré-Mauroux

LAV (Local as View)

• Local schemas are expressed as a view on global schema

Book(ISBN, Title, Author)

[…][…]Book(ISBN, Title)Author(Name, ISBN)

Create VIEW S1.Book AsSelect ISBN, TitleFrom Book

Mediated Schema

S1 S2 S3

©2004, Philippe Cudré-Mauroux

LAV / GAV (cont.)

• Transparent access to heterogeneous databases in the federation

• Local autonomy is (usually) preserved• Query processing through query reformulation• Requires global agreement on the mediated

schema (tight semantic coupling)• Does not scale well

©2004, Philippe Cudré-Mauroux

II. Information Integration in the Large

• Goal: providing a uniform access to many heterogeneous information sources

• Traditional approaches are inadequate– Lack of adaptability– Lack of transparency– Lack of scalability

• Hot research area

©2004, Philippe Cudré-Mauroux

Some Applications

Agent Communication Web services integration Information retrieval from heterogeneous

databases Catalog matching P2P information sharing Personal information delivery Vertical information publishing

©2004, Philippe Cudré-Mauroux

General Context: The Semantic Web

Unicode

XML + NS + xmlschema

RDF + rdfschema

Ontology vocabulary

Logic

Proof

Trust

URI

Dig

ital

Sig

nat

ure

Self-desc.doc.

Data

Data

Rules

• Providing machine-processable data to the Web

©2004, Philippe Cudré-Mauroux

RDF/RDFS 2’ Overview

• RDF triple:

• RDF Schemas– Classes of resources– Classes of properties– Constraints on the subject (domain) or object (range)– Subclassing

• Extensible!

– Full-fledged ontological language: OWL

Subject ObjectProperty

©2004, Philippe Cudré-Mauroux

Example: CreativeCommons

<rdf:RDF xmlns="http://web.resource.org/cc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><Work rdf:about="http://example.org/gnomophone.mp3"> <dc:title>Compilers in the Key of C</dc:title> <dc:description>A lovely classical work on compiling code.</dc:description> <dc:creator><Agent> <dc:title>Yo-Yo Dyne</dc:title> </Agent></dc:creator> <dc:rights><Agent> <dc:title>Gnomophone</dc:title> </Agent></dc:rights> <dc:date>1842</dc:date> <dc:format>audio/mpeg</dc:format> <dc:type rdf:resource="http://purl.org/dc/dcmitype/Sound" /> <dc:source rdf:resource="http://example.net/gnomovision.mov" /> <license rdf:resource="http://creativecommons.org/licenses/by-nc-nd/2.0/" /> <license rdf:resource="http://www.eff.org/IP/Open_licenses/eff_oal.html" /></Work>

<License rdf:about="http://creativecommons.org/licenses/by-nc-nd/2.0/"> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <requires rdf:resource="http://web.resource.org/cc/Notice" /> <requires rdf:resource="http://web.resource.org/cc/Attribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /></License></rdf:RDF>

©2004, Philippe Cudré-Mauroux

Semantic Interoperability in The Semantic Web

• Common ontologies provide for shared context– Requires global agreement!

• Intractable standardization effort!• Back to stage 1…

• Two Plausible solutions:– Agreed-upon corpuses of basic concepts

• IEEE SUMO• Stanford TAP• …

– Local federation of ontologies fostering global interoperability• EPFL Chatty Web• U. Washington Piazza• …

• Complementary approaches

©2004, Philippe Cudré-Mauroux

The Chatty Web

A lab in Trondheim

species

species

EMBLChange site at Cambridge

Swissprot siteat Geneva

A lab at MIT

organism

Query postedat EPFL

organism

organism

EMBLChange peersspecies, …

SwissProt peersauthors, titles, organism, …

other peersauthors, …

organism authors

organism species

species organism

• Local translations enabling global agreements

• Analyzing transitive closures of local mappings

©2004, Philippe Cudré-Mauroux

On Translations Links (ontology mappings)

©2004, Philippe Cudré-Mauroux

III. State of the Art on Ontology Alignment

• Problem: Given two ontologies which describe each a set of discrete entities, find the relationships holding between the entities

• Alignments can then be used to foster interoperability locally

• Difficult problem (fully automatic solutions?)

• Active area of research

©2004, Philippe Cudré-Mauroux

Local Ontology Alignment Techniques

1. Terminological methods– string-based– language-based

• Intrinsic• Extrinsic• Multilingual

2. Structural methods– Internal– External

3. Others– Extensional– Semantic– User Feedback

©2004, Philippe Cudré-Mauroux

1. Terminological Methods

• String-based: compare labels of entities– (sub-) String equality– Edit distances– Token-based distances (e.g., TF/IDF on substrings)

• Language-based– Intrinsic

• Terminological matching with morphological / syntactic analysis (allomorphies)

– Extrinsic• Use of external resources (e.g., WordNet synsets)

– Multilingual methods• Matches terms in different languages

©2004, Philippe Cudré-Mauroux

2. Structural Methods

• Internal (constraint-based):– Data-based domain comparison– Multiplicities / Properties comparison– Similarity between collections

• External– Mereologic structures– Taxonomic structures– Relations bw similar entities

©2004, Philippe Cudré-Mauroux

3. Other

• Extensional methods– Extension set of instances of a class

– Jaccard similarity:

– Similarity-based extension comparison

• Semantic Methods– Based on model-theoretic semantics– SAT problem (e.g., subsumption)

• User Feedback

)(

)(),(

BAP

BAPBA

©2004, Philippe Cudré-Mauroux

A Handful of Systems

• APrompt (Stanford) [T,I,S,U]• Cupid (Microsoft research) [T,I,S]• Bibster (U. Karlsruhe) [T,I,S]• Glue (U. Washington) [E]• S-Match (U. Trento) [T,S,M]• …

Typically: a mix of techniques

[Terminological, Internal structure, external Structure, Extensional, seMantic, User]

©2004, Philippe Cudré-Mauroux

IV. Semantic Integration in a Large-Scale Image Sharing Scenario

• Problem: retrieve a specific image from a large collection of shared images

• So far: most application mix CB and text analysis– CB image analysis provides a low-level objective

representation of an image• Good for comparing image features• Not so good w.r.t. end-users needs expressed in N.L.

– Surrounding text / filenames might sometimes be a high-level subjective view of the image

• Incomplete, out-of-context description• Good w.r.t. N.L. (cf. Google images)

©2004, Philippe Cudré-Mauroux

Potential Opportunity

• Emerging applications make use of high-level, local and semi-contextualized image metadata– Structured metadata (Photoshop Album, XML)– Ontological metadata (RDF, Adobe XMP)– Type-based metadata (Microsoft WinFS)

• Paradigm shift from the old metadata standards (e.g., keywords, EXIF)– Extensible formats

• Personal conjecture: – Metadata will be prominent in a few years– Huge opportunity for image retrieval

©2004, Philippe Cudré-Mauroux

Structured Metadata

• Ex.: Photoshop Album• Hierarchy of tags• Stored in a relational,

proprietary, local database

• Non-exportable

©2004, Philippe Cudré-Mauroux

Ontological Metadata (1)

• Ex.: Extensible Metadata Platform (XMP)• Subset of RDF/S• Metadata might be embedded into the file• Supported by a wide range of Adobe applications

– Adobe® Acrobat®– Adobe FrameMaker® – Adobe GoLive®– Adobe Illustrator® – Adobe InCopy®– Adobe InDesign®– Adobe LiveMotion™– Adobe Photoshop®– Adobe Document Server – Adobe Graphics Server – Version Cue™

©2004, Philippe Cudré-Mauroux

Ontological Metadata (2)

• Ex.: Photoshop XMP schema

©2004, Philippe Cudré-Mauroux

Type-Based Metadata (1)

• New file-system for Longhorn (NTFS+++)• No more hierarchies (i.e., folders) but metadata• Items – Attributes – Relationships – Schemas –

Sub-Schemas (extensions)– Déjà vu?

©2004, Philippe Cudré-Mauroux

Type-Based Metadata (2)

• Ex.: image schema in WinFS

©2004, Philippe Cudré-Mauroux

Observations

• So far, all applications using these metadata are local– It is a typical semantic interoperability problem! Efficient, distributed WinFS is not for tomorrow…

• Image metadata will always be incomplete and subjective

• #images >> #peers >> #schemas

• All these formats can be formally described by a subset of Description Logics– Use them all in and in the darkness bind them!

©2004, Philippe Cudré-Mauroux

Outline of my Project

• Objective: large-scale image retrieval framework taking advantage of metadata

• Outline– Import images– Import metadata– Extract low-level features (thanks to Lei :-)– Store everything in a common, scalable representation– Export data in a shared repository

• SQL server• P2P network (SP2 ?)

– Infer Metadata / Schema mappings locally– Cross validate mappings– Cluster peers / images vis-à-vis their subjective views

©2004, Philippe Cudré-Mauroux

Specificities

• Different metadata models• Incompleteness of metadata (e.g., WinFS dangling

links)• Metadata sparseness• Few (but widely-used) core-classes• Many custom extensions• Many resources• Low-level representation of the resources• Embedded user feedback

Unique application

©2004, Philippe Cudré-Mauroux

Finding Mapping Candidates (sketch)

• [T,I,U] U-Inference based on mutual information

(scalable!)

schema, metadata

Low-levelfeatures

Low-levelfeatures

features

metadata

schema, metadata

schema, metadata

metadataschema

feedback

metadata,schemas

©2004, Philippe Cudré-Mauroux

Cross-Validating Mappings (sketch)

• [S,M] Cross-validation based on graph partitioning, semantic gossiping or SAT techniques

Ref.: Instance-based Schema Matching for Web Databases by Domain-specific Query Probing

Jiying Wang, Ji-Rong Wen, Frederick H. Lochovsky, Wei-Ying MaVLDB 2004

©2004, Philippe Cudré-Mauroux

Conclusions

• Leveraging local metadata produced by end-users– Complex problem! Good heuristics could take years to

be developed…

• Local communications / computations– Scalability

• Hopefully, better results than keywords / low-level analyses even for simple heuristics– Take advantage of context

• Images given local semantics• Analyze the dynamics of the overall system• Objectivity vs. subjectivity of interpretation

becomes a measurable quality

©2004, Philippe Cudré-Mauroux

References

• RDF/S: http://www.w3.org/

• XMP: http://www.adobe.com/products/xmp/

• WinFS: http://msdn.microsoft.com/Longhorn

• Chatty Web: http://lsirpeople.epfl.ch/cudre/

• Ontology Alignment: knowledgeweb D.2.2.3.– (send me an email to f-pcudre)