36
Bridging Different Data Representations Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 11, 2004

11/11 Slides

  • Upload
    tess98

  • View
    356

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 11/11 Slides

Bridging Different Data Representations

Zachary G. IvesUniversity of Pennsylvania

CIS 550 – Database & Information Systems

November 11, 2004

Page 2: 11/11 Slides

2

A Special Type of Query: Conjunctive Queries

A single Datalog rule with no “Ç,” “:,” “8” can express select, project, and join – a conjunctive query

Conjunctive queries are possible to reason about statically (Note that we can write CQ’s in other languages, e.g., SQL!)

We know how to “minimize” conjunctive queriesAn important simplification that can’t be done for general SQL

We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance)

Why might this be useful?

Page 3: 11/11 Slides

3

Example of Containment

Suppose we have two queries:

q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCIS(C), Course(C, “DB & Info Systems”)

q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)

Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction

(i.e., it’s more restricted) There’s no union or any other way it can add more data

We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

Page 4: 11/11 Slides

4

Wrapping up Datalog…

We’ve seen a new language, Datalog It’s basically a glorified DRC with a special feature,

recursion It’s much cleaner than SQL for reasoning about … But negation (as in the DRC) poses some

challenges

We’ve seen that a particular kind of query, the conjunctive query, is written naturally in Datalog Conjunctive queries are possible to reason about We can minimize them, or check containment Conjunctive queries are very commonly used in our

next problem, data integration

Page 5: 11/11 Slides

5

A Problem

We’ve seen that even with normalization and the same needs, different people will arrive at different schemas

In fact, most people also have different needs! Often people build databases in isolation, then want

to share their data Different systems within an enterprise Different information brokers on the Web Scientific collaborators Researchers who want to publish their data for others to

use This is the goal of data integration: tie together

different sources, controlled by many people, under a common schema

Page 6: 11/11 Slides

6

Building a Data Integration System

Create a middleware “mediator” or “data integration system” over the sources Can be warehoused (a data warehouse) or virtual Presents a uniform query interface and schema Abstracts away multitude of sources; consults them for

relevant data Unifies different source data formats (and possibly schemas) Sources are generally autonomous, not designed to be

integrated Sources may be local DBs or remote web sources/services Sources may require certain input to return output (e.g.,

web forms): “binding patterns” describe these

Page 7: 11/11 Slides

7

Data Integration System / Mediator

Typical Data Integration Components

Mediated Schema

Wrapper Wrapper Wrapper

SourceRelations

Mappingsin Catalog

SourceCatalog

Query Results

Page 8: 11/11 Slides

8

Typical Data Integration Architecture

Reformulator

QueryProcessor

SourceCatalog

Wrapper Wrapper Wrapper

Query

Query over sources

SourceDescrs.

Queries +bindings Data in mediated format

Results

Page 9: 11/11 Slides

9

Challenges of Mapping Schemas

In a perfect world, it would be easy to match up items from one schema with another Every table would have a similar table in the other schema Every attribute would have an identical attribute in the other

schema Every value would clearly map to a value in the other schema

Real world: as with human languages, things don’t map clearly! May have different numbers of tables – different

decompositions Metadata in one relation may be data in another Values may not exactly correspond It may be unclear whether a value is the same

Page 10: 11/11 Slides

10

A Few Simple Examples

Movie(Title, Year, Director, Editor, Star1, Star2)

Movie(Title, Year, Director, Editor, Star1, Star2)

PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)

MotionPicture(ID, Title, Year)Participant(ID, Name, Role)CustI

DCustName

1234 Ives, Z.

PennID

EmpName

46732 Zachary Ives

Page 11: 11/11 Slides

11

How Do We Relate Schemas?

General approach is to use a view to define relations in one schema (typically either the mediated schema or the source schema), given data in the other schema This allows us to “restructure” or “recompose +

decompose” our data in a new way

We can also define mappings between values in a view We use an intermediate table defining

correspondences – a “concordance table” It can be filled in using some type of code, and

corrected by hand

Page 12: 11/11 Slides

12

Mapping Our Examples

Movie(Title, Year, Director, Editor, Star1, Star2)

Movie(Title, Year, Director, Editor, Star1, Star2)

PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)

MotionPicture(ID, Title, Year)Participant(ID, Name, Role)

CustID

CustName

1234 Ives, Z.

PennID

EmpName

46732 Zachary Ives

PieceOfArt(I, A, S, T, “Movie”) :- Movie(T, Y, A, _, S1, S2),ID = T || Y, S = S1 || S2

Movie(T, Y, D, E, S1, S2) :- MotionPicture(I, T, Y), Participant(I, D, “Dir”), Participant(I, E, “Editor”), Participant(I, S1, “Star1”), Participant(I, S2, “Star2”)

T1 T2

Need a concordance table from CustIDs to PennIDs

Page 13: 11/11 Slides

13

Two Important Approaches

TSIMMIS [Garcia-Molina+97] – Stanford Focus: semistructured data (OEM), OQL-based language

(Lorel) Creates a mediated schema as a view over the sources Spawned a UCSD project called MIX, which led to a company

now owned by BEA Systems Other important systems of this vein: Kleisli/K2 @ Penn

Information Manifold [Levy+96] – AT&T Research Focus: local-as-view mappings, relational model Sources defined as views over mediated schema

Requires a special Spawned Tukwila at Washington, and eventually a company as

well Led to peer-to-peer integration approaches (Piazza, etc.)

Page 14: 11/11 Slides

14

TSIMMIS and Information Manifold

Focus: Web-based queryable sources CGI forms, online databases, maybe a few RDBMSs Each needs to be mapped into the system – not as

easy as web search – but the benefits are significant vs. query engines

A few parenthetical notes: Part of a slew of works on wrappers, source profiling,

etc. The creation of mappings can be partly automated –

systems such as LSD, Cupid, Clio, … do this Today most people look at integrating large

enterprises (that’s where the $$$ is!) – Nimble, BEA, IBM

Page 15: 11/11 Slides

15

TSIMMIS

“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew

An instance of a “global-as-view” mediation system

One of the first systems to support semi-structured data, which predated XML by several years

Page 16: 11/11 Slides

16

Semi-structured Data: OEM

Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity

Proposal: Object Exchange Model, OEMOID: <label, type, value>

… How does it relate to XML? … What problems does OEM solve, and

not solve, in a heterogeneous system?

Page 17: 11/11 Slides

17

OEM Example

Show this XML fragment in OEM:

<book> <author>Bernstein</author> <author>Newcomer</author> <title>Principles of TP</title></book>

<book> <author>Chamberlin</author> <title>DB2 UDB</title></book>

Page 18: 11/11 Slides

18

Queries in TSIMMIS

Specified in OQL-style language called Lorel OQL was an object-oriented query language Lorel is, in many ways, a predecessor to XQuery

Based on path expressions over OEM structures:select bookwhere book.title = “DB2 UDB” and book.author = “Chamberlin”

This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated =

for $b in AllData()/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

Page 19: 11/11 Slides

19

Query Answering in TSIMMIS

Basically, it’s view unfolding, i.e., composing a query with a view The query is the one being asked The views are the MSL templates for the

wrappers Some of the views may actually require

parameters, e.g., an author name, before they’ll return answers Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support

parameters as well, so we’ll see these in action

Page 20: 11/11 Slides

20

A Wrapper Definition in MSL

Wrappers have templates and binding patterns ($X) in MSL:B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //

This reformats a SQL query over Book(author, year, title)

In XQuery, this might look like:define function GetBook($x AS xsd:string) as book {

for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +”’”)

return <book>{$b/title}<author>$x</author></book>}

book

title author

… …

The union of GetBook’s results is unioned with others to form the view AllData()

Page 21: 11/11 Slides

21

How to Answer the Query

Given our query:for $b in AllData()/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

Find all wrapper definitions that: Contain output enough “structure” to match

the conditions of the query Or have already tested the conditions for us!

Page 22: 11/11 Slides

22

Query Composition with Views

We find all views that define book with author and title, and we compose the query with each:

define function GetBook($x AS xsd:string) as book {for $b in

sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”)

return <book> {$b/title} <author>{$x}</author></book>}for $b in AllData()/book

where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

book

title author

… …

Page 23: 11/11 Slides

23

Matching View Output to Our Query’s Conditions

Determine that $b/book/author/text() $x by matching the pattern on the function’s output:define function GetBook($x AS xsd:string) as book {

for $b in sql(“Amazon.DB”, “select * from book where author=‘” + $x +

“’”)return <book>{ $b/title } <author>{$x}</author></book>

}

let $x := “Chamberlin”for $b in GetBook($x)/bookwhere $b/title/text() = “DB2 UDB” return $b

book

title author

… …

Page 24: 11/11 Slides

24

The Final Step: Unfolding

let $x := “Chamberlin”for $b in (

for $b’ in sql(“Amazon.com”,

“select * from book where author=‘” + $x + “’”) return <book>{ $b/title }<author>{$x}</author></book> )/bookwhere $b/title/text() = “DB2 UDB” return $b

How do we simplify further to get to here?for $b in sql(“Amazon.com”,

“select * from book where author=‘Chamberlin’”)where $b/title/text() = “DB2 UDB” return $b

Page 25: 11/11 Slides

25

Virtues of TSIMMIS

Early adopter of semistructured data, greatly predating XML Can support data from many different kinds of

sources Obviously, doesn’t fully solve heterogeneity

problem

Presents a mediated schema that is the union of multiple views Query answering based on view unfolding

Easily composed in a hierarchy of mediators

Page 26: 11/11 Slides

26

Limitations of TSIMMIS’ Approach

Some data sources may contain data with certain ranges or properties

“Books by Aho”, “Students at UPenn”, … If we ask a query for students at Columbia, don’t

want to bother querying students at Penn… How do we express these?

Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

Page 27: 11/11 Slides

27

An Alternate Approach:The Information Manifold (Levy et al.)

When you integrate something, you have some conceptual model of the integrated domain

Define that as a basic frame of reference, everything else as a view over it

“Local as View”

May have overlapping/incomplete sources Define each source as the subset of a query over

the mediated schema We can use selection or join predicates to specify

that a source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj =

“Computers”

Page 28: 11/11 Slides

28

The Local-as-View Model

The basic model is the following: “Local” sources are views over the mediated

schema Sources have the data – mediated schema is

virtual Sources may not have all the data from the

domain – “open-world assumption”

The system must use the sources (views) to answer queries over the mediated schema

Page 29: 11/11 Slides

29

Answering Queries Using Views

Assumption: conjunctive queries, set semantics Suppose we have a mediated schema:

author(aID, isbn, year), book(isbn, title, publisher) A conjunctive query might be:

q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”

Recall intuitions about this class of queries: Adding a conjunct to a query removes answers

from the result but never adds anyAny conjunctive query with at least the same

constraints & conjuncts will give valid answers

Page 30: 11/11 Slides

30

Query Answering

Suppose we have the query: q(a, t, p) :- author(a, i, _), book(i, t, p)

and sources:s1(a,t) author(a, i, _), book(i, t, p), t = “123”…s5(a,i) author(a, i, _)s6(i,p) book(i, t, p)

We want to compose the query with the source mappings – but they’re in the wrong direction!

Page 31: 11/11 Slides

31

Inverse Rules

We can take every mapping and “invert” it, though sometimes we may have insufficient information:

Ifs5(a,i) author(a, i, _)

then we can also infer that:author(a, i, ???) s5(a,i)

But how to handle the absence of the 3rd (publisher) attribute? We know that there must be AT LEAST one instance of ???

in author for each (a,i) pair So we might simply insert a NULL and define that NULL

means “unknown” (as opposed to “missing”)…

Page 32: 11/11 Slides

32

But NULLs Lose Information

Suppose we take these rules and ask for: q(a,t) :- author(a, i, _), book(i, t, p)

If we look at the rule:s1(a,t) author(a, i, _), book(i, t, p), t = “123”

Clearly q(a,t) s1(a,t)

But if apply our inversion procedure, we get:author(a, NULL, NULL) s1(a,t)book(NULL, t, p) s1(a,t), t = “123”

and there’s no way to kow to join author and book on NULL!We need “a special NULL for each a-t combo” so we can

figure out which a’s and t’s go together

Page 33: 11/11 Slides

33

The Solution: “Skolem Functions”

Skolem functions: Conceptual “perfect” hash functions Each function returns a unique, deterministic value

for each combination of input values Every function returns a non-overlapping set of

values (Skolem function F will never return a value that matches any of Skolem function G’s values)

Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values They’re just a way of logically generating “special

NULLs”

Page 34: 11/11 Slides

34

Revisiting Our Example

Query: q(a,t) :- author(a, i, _), book(i, t, p)

Mapping rule:s1(a,t) author(a, i, _), book(i, t, p), t = “123”

Inverse rules:author(a, f(a,t), NULL) s1(a,t)

book(f(a,t), t, p) s1(a,t), t = “123”

Expand the query as follows: q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t) q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)

Page 35: 11/11 Slides

35

Query Answering Using Inverse Rules

Invert all rules using the procedures describedTake the query and the possible rule expansions and

execute them in a Datalog interpreter In the previous query, we expand with all combinations of

expansions of book and of author – every possible way of combining and cross-correlating info from different sources

Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)

More efficient, but equivalent, algorithms now exist: Bucket algorithm [Levy et al.] MiniCon [Pottinger & Halevy] Also related: “chase and backchase” [Popa, Tannen,

Deutsch]

Page 36: 11/11 Slides

36

Summary of Data Integration

Local-as-view integration has replaced global-as-view as the standard More robust way of defining mediated schemas and sources Mediated schema is clearly defined, less likely to change Sources can be more accurately described

Methods exist for query reformulation, including inverse rulesIntegration requires standardization on a single schema

Can be hard to get consensus Today we have peer-to-peer data integration, e.g., Piazza [Halevy et

al.], Orchestra [Ives et al.], Hyperion [Miller et al.]

Some other aspects of integration were addressed in related papers Overlap between sources; coverage of data at sources Semi-automated creation of mappings and wrappers

Data integration capabilities in commercial products: BEA’s Liquid Data, IBM’s DB2 Information Integrator, numerous packages from middleware companies