Integration what it takes to put data together Ir. Richard Vdovjak, MTD. [email protected]

Integration

what it takes to put data together

Ir. Richard Vdovjak, [email protected]

The Grand Challenge The Integration System

should: Provide flexible, homogeneous and

transparent access to several (possibly) distributed, autonomous, and heterogeneous sources.

Answer queries expressed in terms of the global schema by means of the source’s data.

Preserve the source’s data semantics.

Automate the integration process as much as possible.

Why is it so difficult? Those Sources!

Number of sources / size of the problem the Web is large! New optimization techniques are needed

(Pipelining, Distributed processing)

Location of the sources / source discovery does a source that supposedly fulfills my info needs exist? where is it located? (The Semantic Web helps – “google” for

SW agents” )

Heterogeneity of the sources syntactic heterogeneity (HTML, XML, RDF, RDBS, ORDBS, etc..) semantic heterogeneity (class Painter =? Creator) designation heterogeneity (“John Smith” =?

http://foo-ns/Smith)

Those Sources! 2 Autonomy of the sources

sources may change their data, schemas especially on the Internet and they will not notify the integration system about these changes.

Volatility of the sources sources may come and go, the system should be flexible to recover

from these changes with minimal effort (use flexible ways to map sources to the system, use of adaptive query answering techniques)

Different source capabilities some source are more intelligent than others, e.g. they allow to

perform e.g. join like queries on them) some sources are more restrictive than others, e.g. they require

certain input (filling out some forms) before they provide results

Related Fields Distributed Databases

Sources are homogeneous Sources are not (so) autonomous Similarities at the optimization and execution level

Information Retrieval Keywords only, no semantics IR techniques help for addressing the designation problem

Data mining discovering “hidden” properties of data

Approaches to Data Integration

Materialized views global (materialized) schema all data pre-fetched into a

central repository, should be automated similar to querying in VV

“easy” querying after the data is “in house”

the system is independent from the sources

difficult to keep up-to-date intranet applications (e.g.

enterprise data integration)

Virtual views global (mediated) schema data retrieved on demand

(no central repository) a query against the global

schema is translated into sub-queries and shipped to the sources

always up-to-date most WWW applications

e.g. shopbots, different internet portals

Approaches to Data Integration

Materialized views

Virtual views

DataSources

Mediated Schema

InternetLocal Schema Local Schema

O/RDB

Wrapper

WebSourc

e

Wrapper

User Query

IntegrationEngine

(Mediator)

Query 2Query1

IntegrationSystem

Global Schema

InternetLocal Schema Local Schema

IntegrationSystem

User Query

O/RDB

Wrapper

WebSourc

e

Wrapper

Repository

DataExtraction

Some DB Theory Schema Conjunctive queries

For example:Give me students together

with their advisors who took courses given by their advisors after the 1st trimester 2003.

namearea

name

Faculty StudentAdvises

* *

CourseRegisteredTeaches

* *

trimester trimestertitlec-number

* *

DB Theory SQLSELECT Advises.prof, Advises.studentFROM Registered, Teaches, AdvisesWHERE Registered.c-number = Teaches.c-number and

Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1”

DATALOGQ(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”

head

body

DB Theory, query containment A query Q’ is contained in Q if for all databases D, the

set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D)

A query Q’ is equivalent to Q iff

Q’ Q and Q Q’

Example:Q’(prof, student) :-


is contained in Q(prof, student) :-


Query containment check Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not

Example:Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y)

1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively:Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3)D={arc(0,1), arc(1,2), arc(2,3)}

2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)}

3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q

p3x

zw

y

arc

arcarc

Query reformulation using views

Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff

Q’ refers only to views in V, Q’ is equivalent to Q

Q’ is a Maximally-contained Rewriting of Q using V iff

Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1

Q

Bored? How about some

motivation? Why do we

bother? Is it really worth

it? Is anybody out

there interested in integration?

Oh Yeah, there is big money to be spent on integration

How Big is Big? Estimated Annual Integration costs + data quality costs worldwide:$1 Trilion/year!1,2

...and this is just a beginning

“The Grand Challenge is now becoming mission critical”.

1 Dr. M. Brodie, The Grand Challenge of Information Technology, Invited talk, CAiSE 2002

2 Trillion is 12 or 18 zeros, depending of which side of Atlantic you are :)

Approaches to map/describe the Sources Global As View (GAV)

The mediated schema is defined as a view over the source schemas

Local As View (LAV) Sources are described as views over the

mediated schema Ontological Approach

An explicit integration model instance that adheres to an integration model ontology is created.

Global As View (GAV)Global Schema

X

Source1 Source2

Y Z

X is defined as a view over Y and Z

Mapping Example:Faculty(name,area) -> Person(id, name), Expert(id, area)Teaches(name, c-number, trim) -> Person(id, name), Gives(id, c-number, trimester)

Query answering means unfolding the mappings/rules

Query Answering = Query Unfolding

Query Example: Who from the area of “IS” was teaching in the trimester “2003/2”?

Q(name):- Faculty(name, “IS”), Teaches(name,

c,2003/2”)

Person(id, name), Expert(id,”IS”), Gives(id, c, “2003/2”)

Local As View (LAV) Each data source

is described by one or more views wrt the mediated schema.

The views might not be complete, i.e., they may only contain a subset of tuples satisfying their definition.

Global Schema

Y

Source1 Source2

X

Local As View (LAV) Mapping ExampleV1(student, c-number, trimester,title) :- Registered(student, c-number, trimester),

Course(c-number, title), c-number> “500”, trimester> “2002/1”V2(student, prof, c-number, trimester) :- Registered(student, c-number,

trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester <

“1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester),

Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1”

User Query ExampleQ(s,c,p) :- Teaches(p,c,q), Registered(s,c,q), Course(c,t), c > “300”, q >

“1999/2”

Query Answering = Query Rewriting

Goal: reformulate a user query (against the mediated schema) into a query that refers directly to the available sources/views.

= Find a maximally-contained rewriting of the user query in the views describing the sources.

How?

The Bucket Algorithm 1

The main idea: the number of rewritings that need to be considered can be reduced by considering each sub-goal of the query in isolation.

input = set of views output = maximally-

contained rewriting composed from (some of the) views

informally, a view can be useful for a query if the set of relations it mentions overlaps with that of the query, and it selects some of the attributes selected by the query + the constrains of the view must be equivalent or weaker than those from the query

The Bucket Algorithm 2 Step 1:

Create a bucket for each sub-goal of Q and fill it in with views/sources that are relevant for that particular sub-goal.

A view V is put in the bucket of the relation/sub-goal g if it contains this relation AND the constraints in V are compatible (weaker than) those in Q.

Step 2: Combine source

relations from each bucket into conjunctive queries (each consisting of one conjunct from every bucket).

Check containment wrt to Q, if contained add the rewriting to the answer.

The result is a union of conjunctive rewritings.

The Bucket Algorithm, Example: Step 1

V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”

V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester),

trimester < “2000/1”

Query Constraints: c > “300”, q > “1999/2”Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,

Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)

variable mappings:{p-> prof, c->c-number, q-> trimester}{s->student, c->c-number, q-> trimester}{c->c-number, t->title}

The Bucket Algorithm, Example: Step 2

V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”

V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester),

trimester < “2000/1”

Query Constraints: c > “300”, q > “1999/2”Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,

Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)

Q’(s,c,p) :- V2(s’,p,c,q), V1(s,c,q,t’), V1(s’,c,q’,t) = V2(s,p,c,q), V1(s,c,q,t’), Q’’(s,c,p) :- V2(s,p,c,q), V4(p,c,t’,q)

Q’ Q, Q’’ Q Q’Q’’ Q

GAV versus LAV GAV

adding/deleting a new source means changing the global schema (that is defined by all it’s rewriting rules)

“easy” query reformulation

LAV adding/deleting

the source means just adding/deleting views from, i.e. the schema stays intact

“difficult” query reformulation

The road doesn’t end here, there are still problems to tackle

Performance issues the size of the problem,

remember? network behavior (delay, bursting

data) Source Discovery Schema Discovery Approximate answers

a less precise answer is often better than no (precise) answer at all

Choosing the right sources wrt to user demands

not everybody has the same evaluation criterion for an answer especially when multimedia are involved, e.g. (e.g. most “vivid”, with highest resolution, fastest, most reliable, etc. )

A Light at the End of the Tunnel?

Pipelineing, Parallelism, Adaptivity

Ontologies, Semantic Web, Agents

Emerging Standards e.g. RDF(S)

Combination of IR techniques and data integration, use of BBN

Programmable Mediators

Programmable MediatorsThe main idea:

Sources are clustered (based on the information they offer) and the order in which they are consulted within a cluster is not “hard-wired” but flexible based on their annotations.

RDFS-based application independent integration meta ontology (IMO) and its application dependent specialization

Sources are mapped and annotated as RDF instantiation of the IMO

Integration ModelSpecialization

IntegrationModel Ontology

Reliability

ResponseTime

Decoration

source

Articulation To

Fromvalue

Literal

compare

Comparator

PathExpression

starts

Node

rdfs:Resource

Edge

rdf:Property

begins

ends

starts ToNode

starts FromNode

PrimaryNode

ProcInstruction

Transformer

Literal2String

Sn2UriTrans

xy property "xy"subPropertyOfsubClassOf

backtrack

follow

target

applies

obtainedFrom

producedBy

idByU

RI

srcAddress

idByValue

Literal

Programmable Mediators 2 Sources are mapped

by articulations to the mediated schema.

Each such mapping can be annotated with a decoration(s) which rates that mapping for a concrete ordering criterion (criteria).

When the mediator encounters “equivalent” sources it chooses those that are having better “score” in the query dependent criterion

d1_1 0.5

d1_2 value

a1_1

a1_2

0.5

value

pe_to1

pe_from1

cm:Technique

ae:ArtTechnique

follow

follow

cm:tname

ae:title

ends

endsL Literal

pe_to2

pe_from2

cm:Technique

ae:ArtTechnique

follow cm:exemplified_by

ae:usedTechnique

cm:Artifact

cm:ArtPiece

idByValue

idByValue

String

d2_1 0.9

a2_1

value

pe_to3

pe_from3

cm:Painting

ac:Painting

cm:picture

ac:visualized Literal

Image

cm:aname

ac:title

ends

idByValue

idByValue

endsL

obtainedFromobtainedFrom

source

source

applies

applies

applies

obtainedFrom obtainedFrom

obtainedFromobtainedFrom

source

begins

follow

follow

srcAddress

http://www...ae

srcAddress

http://www...ac

srcAddress

http://www...ae

The End

Thank you for your attention!

Those who fell asleep: WAKE UP! It’s time to go home :)

Documents

Integration what it takes to put data together Ir. Richard Vdovjak, MTD. [email protected]