Upload
junior-page
View
219
Download
0
Embed Size (px)
Citation preview
The Grand Challenge The Integration System
should: Provide flexible, homogeneous and
transparent access to several (possibly) distributed, autonomous, and heterogeneous sources.
Answer queries expressed in terms of the global schema by means of the source’s data.
Preserve the source’s data semantics.
Automate the integration process as much as possible.
Why is it so difficult? Those Sources!
Number of sources / size of the problem the Web is large! New optimization techniques are needed
(Pipelining, Distributed processing)
Location of the sources / source discovery does a source that supposedly fulfills my info needs exist? where is it located? (The Semantic Web helps – “google” for
SW agents” )
Heterogeneity of the sources syntactic heterogeneity (HTML, XML, RDF, RDBS, ORDBS, etc..) semantic heterogeneity (class Painter =? Creator) designation heterogeneity (“John Smith” =?
http://foo-ns/Smith)
Those Sources! 2 Autonomy of the sources
sources may change their data, schemas especially on the Internet and they will not notify the integration system about these changes.
Volatility of the sources sources may come and go, the system should be flexible to recover
from these changes with minimal effort (use flexible ways to map sources to the system, use of adaptive query answering techniques)
Different source capabilities some source are more intelligent than others, e.g. they allow to
perform e.g. join like queries on them) some sources are more restrictive than others, e.g. they require
certain input (filling out some forms) before they provide results
Related Fields Distributed Databases
Sources are homogeneous Sources are not (so) autonomous Similarities at the optimization and execution level
Information Retrieval Keywords only, no semantics IR techniques help for addressing the designation problem
Data mining discovering “hidden” properties of data
Approaches to Data Integration
Materialized views global (materialized) schema all data pre-fetched into a
central repository, should be automated similar to querying in VV
“easy” querying after the data is “in house”
the system is independent from the sources
difficult to keep up-to-date intranet applications (e.g.
enterprise data integration)
Virtual views global (mediated) schema data retrieved on demand
(no central repository) a query against the global
schema is translated into sub-queries and shipped to the sources
always up-to-date most WWW applications
e.g. shopbots, different internet portals
Approaches to Data Integration
Materialized views
Virtual views
DataSources
Mediated Schema
InternetLocal Schema Local Schema
O/RDB
Wrapper
WebSourc
e
Wrapper
User Query
IntegrationEngine
(Mediator)
Query 2Query1
IntegrationSystem
Global Schema
InternetLocal Schema Local Schema
IntegrationSystem
User Query
O/RDB
Wrapper
WebSourc
e
Wrapper
Repository
DataExtraction
Some DB Theory Schema Conjunctive queries
For example:Give me students together
with their advisors who took courses given by their advisors after the 1st trimester 2003.
namearea
name
Faculty StudentAdvises
* *
CourseRegisteredTeaches
* *
trimester trimestertitlec-number
* *
DB Theory SQLSELECT Advises.prof, Advises.studentFROM Registered, Teaches, AdvisesWHERE Registered.c-number = Teaches.c-number and
Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1”
DATALOGQ(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”
head
body
DB Theory, query containment A query Q’ is contained in Q if for all databases D, the
set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D)
A query Q’ is equivalent to Q iff
Q’ Q and Q Q’
Example:Q’(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2”
is contained in Q(prof, student) :-
Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”
Query containment check Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not
Example:Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y)
1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively:Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3)D={arc(0,1), arc(1,2), arc(2,3)}
2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)}
3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q
p3x
zw
y
arc
arcarc
Query reformulation using views
Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff
Q’ refers only to views in V, Q’ is equivalent to Q
Q’ is a Maximally-contained Rewriting of Q using V iff
Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1
Q
Bored? How about some
motivation? Why do we
bother? Is it really worth
it? Is anybody out
there interested in integration?
Oh Yeah, there is big money to be spent on integration
How Big is Big? Estimated Annual Integration costs + data quality costs worldwide:$1 Trilion/year!1,2
...and this is just a beginning
“The Grand Challenge is now becoming mission critical”.
1 Dr. M. Brodie, The Grand Challenge of Information Technology, Invited talk, CAiSE 2002
2 Trillion is 12 or 18 zeros, depending of which side of Atlantic you are :)
Approaches to map/describe the Sources Global As View (GAV)
The mediated schema is defined as a view over the source schemas
Local As View (LAV) Sources are described as views over the
mediated schema Ontological Approach
An explicit integration model instance that adheres to an integration model ontology is created.
Global As View (GAV)Global Schema
X
Source1 Source2
Y Z
X is defined as a view over Y and Z
Mapping Example:Faculty(name,area) -> Person(id, name), Expert(id, area)Teaches(name, c-number, trim) -> Person(id, name), Gives(id, c-number, trimester)
Query answering means unfolding the mappings/rules
Query Answering = Query Unfolding
Query Example: Who from the area of “IS” was teaching in the trimester “2003/2”?
Q(name):- Faculty(name, “IS”), Teaches(name,
c,2003/2”)
Person(id, name), Expert(id,”IS”), Gives(id, c, “2003/2”)
Local As View (LAV) Each data source
is described by one or more views wrt the mediated schema.
The views might not be complete, i.e., they may only contain a subset of tuples satisfying their definition.
Global Schema
Y
Source1 Source2
X
Local As View (LAV) Mapping ExampleV1(student, c-number, trimester,title) :- Registered(student, c-number, trimester),
Course(c-number, title), c-number> “500”, trimester> “2002/1”V2(student, prof, c-number, trimester) :- Registered(student, c-number,
trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester <
“1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester),
Course(c-number, title), Teaches(prof, c-number, trimester), trimester < “2000/1”
User Query ExampleQ(s,c,p) :- Teaches(p,c,q), Registered(s,c,q), Course(c,t), c > “300”, q >
“1999/2”
Query Answering = Query Rewriting
Goal: reformulate a user query (against the mediated schema) into a query that refers directly to the available sources/views.
= Find a maximally-contained rewriting of the user query in the views describing the sources.
How?
The Bucket Algorithm 1
The main idea: the number of rewritings that need to be considered can be reduced by considering each sub-goal of the query in isolation.
input = set of views output = maximally-
contained rewriting composed from (some of the) views
informally, a view can be useful for a query if the set of relations it mentions overlaps with that of the query, and it selects some of the attributes selected by the query + the constrains of the view must be equivalent or weaker than those from the query
The Bucket Algorithm 2 Step 1:
Create a bucket for each sub-goal of Q and fill it in with views/sources that are relevant for that particular sub-goal.
A view V is put in the bucket of the relation/sub-goal g if it contains this relation AND the constraints in V are compatible (weaker than) those in Q.
Step 2: Combine source
relations from each bucket into conjunctive queries (each consisting of one conjunct from every bucket).
Check containment wrt to Q, if contained add the rewriting to the answer.
The result is a union of conjunctive rewritings.
The Bucket Algorithm, Example: Step 1
V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”
V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester),
trimester < “2000/1”
Query Constraints: c > “300”, q > “1999/2”Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,
Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)
variable mappings:{p-> prof, c->c-number, q-> trimester}{s->student, c->c-number, q-> trimester}{c->c-number, t->title}
The Bucket Algorithm, Example: Step 2
V1(student, c-number, trimester,title) :- Registered(student, c-number, trimester), Course(c-number, title), c-number> “500”, trimester> 2002/1”
V2(student, prof, c-number, trimester) :- Registered(student, c-number, trimester), Teaches(prof, c-number, quarter)V3(student, c-number) :- Registered(student, c-number, trimester), trimester < “1998/1”V4(prof, c-number, title, trimester) :- Registered(student, c-number, trimester), Course(c-number, title), Teaches(prof, c-number, trimester),
trimester < “2000/1”
Query Constraints: c > “300”, q > “1999/2”Sub-goals: Teaches(p,c,q), Registered(s,c,q), Course(c,t) ,
Buckets: V2(s’,p,c,q) V1(s,c,q,t’) V1(s’,c,q’,t) V4(p,c,t,q) V2(s,p’,c,q) V4(p’,c,t,q’)
Q’(s,c,p) :- V2(s’,p,c,q), V1(s,c,q,t’), V1(s’,c,q’,t) = V2(s,p,c,q), V1(s,c,q,t’), Q’’(s,c,p) :- V2(s,p,c,q), V4(p,c,t’,q)
Q’ Q, Q’’ Q Q’Q’’ Q
GAV versus LAV GAV
adding/deleting a new source means changing the global schema (that is defined by all it’s rewriting rules)
“easy” query reformulation
LAV adding/deleting
the source means just adding/deleting views from, i.e. the schema stays intact
“difficult” query reformulation
The road doesn’t end here, there are still problems to tackle
Performance issues the size of the problem,
remember? network behavior (delay, bursting
data) Source Discovery Schema Discovery Approximate answers
a less precise answer is often better than no (precise) answer at all
Choosing the right sources wrt to user demands
not everybody has the same evaluation criterion for an answer especially when multimedia are involved, e.g. (e.g. most “vivid”, with highest resolution, fastest, most reliable, etc. )
A Light at the End of the Tunnel?
Pipelineing, Parallelism, Adaptivity
Ontologies, Semantic Web, Agents
Emerging Standards e.g. RDF(S)
Combination of IR techniques and data integration, use of BBN
Programmable Mediators
Programmable MediatorsThe main idea:
Sources are clustered (based on the information they offer) and the order in which they are consulted within a cluster is not “hard-wired” but flexible based on their annotations.
RDFS-based application independent integration meta ontology (IMO) and its application dependent specialization
Sources are mapped and annotated as RDF instantiation of the IMO
Integration ModelSpecialization
IntegrationModel Ontology
Reliability
ResponseTime
Decoration
source
Articulation To
Fromvalue
Literal
compare
Comparator
PathExpression
starts
Node
rdfs:Resource
Edge
rdf:Property
begins
ends
starts ToNode
starts FromNode
PrimaryNode
ProcInstruction
Transformer
Literal2String
Sn2UriTrans
xy property "xy"subPropertyOfsubClassOf
backtrack
follow
target
applies
obtainedFrom
producedBy
idByU
RI
srcAddress
idByValue
Literal
Programmable Mediators 2 Sources are mapped
by articulations to the mediated schema.
Each such mapping can be annotated with a decoration(s) which rates that mapping for a concrete ordering criterion (criteria).
When the mediator encounters “equivalent” sources it chooses those that are having better “score” in the query dependent criterion
d1_1 0.5
d1_2 value
a1_1
a1_2
0.5
value
pe_to1
pe_from1
cm:Technique
ae:ArtTechnique
follow
follow
cm:tname
ae:title
ends
endsL Literal
pe_to2
pe_from2
cm:Technique
ae:ArtTechnique
follow cm:exemplified_by
ae:usedTechnique
cm:Artifact
cm:ArtPiece
idByValue
idByValue
String
d2_1 0.9
a2_1
value
pe_to3
pe_from3
cm:Painting
ac:Painting
cm:picture
ac:visualized Literal
Image
cm:aname
ac:title
ends
idByValue
idByValue
endsL
obtainedFromobtainedFrom
source
source
applies
applies
applies
obtainedFrom obtainedFrom
obtainedFromobtainedFrom
source
begins
follow
follow
srcAddress
http://www...ae
srcAddress
http://www...ac
srcAddress
http://www...ae
The End
Thank you for your attention!
Those who fell asleep: WAKE UP! It’s time to go home :)