Upload
madeline-oliver
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
1
iTrails: Pay-as-you-go Information Integration in Datasapces
Authors: Salles, Dittrich et al. (ETH Zurich)Published in VLDB2007
Presenter: Jim7 Dec 2007
3
Background cont.
• Information integration– Purpose: query multiple data sources
Data source 2Data source 1 Data source 3
Query interface
Result Query
Global schema
Source schema
?
4
Background cont.
• Schema matching/mapping– Enable the reformulation of query on the
global schema to the query on source schema
Global schema
Source schema
?
m1:for d in deptexists d' in Tdeptwhere ( d'.dname=d.dname ^ d'.location=d.location)
m2:for d in dept, e in d.empsexists d' in Tdept, e' in d'.empswhere...A complete and complicate mapping
is necessary for the query.
5
text,links
Background cont.
• Search engine– Easier way for query multiple data sources
text,links
text,links
text,links
Crawling/indexing
Graph IR Search Engine
Query semantics are NOT precise.
global warming zurich
6
Motivation of iTrail
• Find a integration solution in-between these two extremes?
?Dataspace System
text,links
text,links
text,links
text,links
Graph IR Search Engine
Data IntegrationSystem
Temps Cities
CO2 Sunspots
... ...
...
...
Pay-as-you-goInformation Integration The more effort you pay,
the more query power you have.
7
iTrails Core Idea: Add Integration Hints Incrementally
• Step 1: Provide a search service over all the data– Use a general graph data model (iDM)
– Works for unstructured documents, XML, and relations
• Step 2: Add integration semantics via hints (trails) on top of the graph– Works across data sources, not only between sources
• Step 3: If more semantics needed, go back to step 2
• Impact:
– Smooth transition between search and data integration
– Semantics added incrementally improve precision / recall
8
iTrails: Defining Trails
• Basic Form of a Trail
QL [.CL] → QR [.CR]
• Intuition:– When I query for QL [.CL], you should also
query for QR [.CR]
Queries: keyword and path expressions
Attribute projections
Query example: global warming zurich
or //Temperatures/*[celsius > 10]
9
Trail examples: keyword query
20
15
14
BE
ZH
ZH
• Trail for Implicit Meaning: “When I query for global warming, you should also query for Temperature data above 10 degrees”
• Trail for an Entity: “When I query for zurich, you should also query for references of zurich as a region”
global warming → //Temperatures/*[celsius > 10]
DBServer
Temperaturescity celsius
Bern
Zurich
zurich → //*[region = “ZH”]
Uster
region
global warming zurich
9ZHZurich26-Sep
25-Sep
24-Sep
23-Sep
date
10
Trail Example: Deep Web Bookmarks
• Trail for a Bookmark: “When I query for train home, you should also query for the TrainCompany’s website with origin at ETH Uni and destination at Seilbahn Rigiblick”
train home
train home → //trainCompany.com//*[origin=“ETH Uni”
and dest =“Seilbahn Rigiblick”]
WebServer
11
Trail Examples: Thesauri, Dictionaries
car auto
car carro
car → auto
car → carrocarro → car
Laptop Email Server
• Trail for Thesauri: “When I query for car, you should also query for auto”
• Trails for Dictionary: “When I query for car, you should also query for carro and vice-versa”
12
Trail Examples: Schema Equivalences
• Trail for schema match on names: “When I query for Employee.empName, you should also query for Person.name”
• Trail for schema match on salaries: “When I query for Employee.salary, you should also query for Person.income”
EmployeeempName salary
Personname age income
//Employee//*.tuple.empName → //Person//*.tuple.name
//Employee//*.tuple.salary → //Person//*.tuple.incomeDB
Server
empId
SSN
13
How are Trails Created?
• Given by the user– Explicitly– Via Relevance Feedback
• (Semi-)Automatically– Information extraction techniques– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data,
bookmarks)
14
Uncertainty/certainty of trails
• Probabilistic Trails: – model uncertain trails– probabilities used to rank trails
• QL [.CL] → QR [.CR], 0 ≤ p ≤ 1
– Example: car → auto
• Scored Trails: – give higher value to certain trails– scoring factors used to boost scores of query results obtained by
the trail• QL [.CL] → QR [.CR], sf > 1
– Examples: • T1: weather → //Temperatures/*• T2: yesterday → //*[date = today()– 1]
p
p = 0.8
p = 1, sf = 3
p = 0.9, sf = 2
15
Rewriting Queries with Trails
• Canonical form of the query– Decompse the query into location step seperators (‘/’
or ‘//’) and predicates– Construct a tree based on the decompsed query
Query: “retrieve all information about the projects containing keyword Mike” //home/projects//*[“Mike”]
U
Mike
U
home projects(simplified form for presentation)
16
Rewriting Queries with Trails cont.
U
weather yesterday
(1) Matching
T2: yesterday → //*[date = today() – 1]
(2) Transformation
U
weather
yesterday
U//*[date = today() – 1]
(3) Merging
T2
matches
QueryWeather yesterday
Trail
Rewriting method: Matching => Transformation => Merging
“yesterday” is containedIn the query
17
Rewriting Queries on the canonical form of the query
Trails:
NOT the final result!The rewriting based on ψ7 and ψ9 are omitted.
18
Problem: Recursive Matches
... UU
weather
yesterday
U//*[date = today() – 1]
T2: yesterday →
//*[date = today() – 1]
New query still matches T2,
so T2 could be applied
againU
weather U
yesterday
U//*[date = today() – 1]
//*[date = today() – 1]U
//*[date = today() – 1]
//*[date = today() – 1]...
Infinite recursion
T2
matches
T2
matches
19
Problem: Recursive Matches cont.
U
weather
yesterday
U//*[date = today() – 1]
Trails may be mutually recursive
T3: //*.tuple.date → //*.tuple.modified
U
weather U
yesterday
//*[date = today() – 1]
T10: //*.tuple.modified → //*.tuple.date
U//*[modified = today() – 1]
U
weather Uyesterday
//*[date = today() – 1]
U
//*[modified = today() – 1]
U //*[date = today() – 1]
We again match T3
and enter an infinite loop
T3 matches
T10 matches
20
U
weather yesterday
First Level
U
weatheryesterday
//Temperatures/*
UU
//*[date = today() – 1]
U
weather yesterday
//Temperatures/*
UU
//*[modified = today() – 1]
UU
//*[received = today() – 1]
//*[date = today() – 1]
SecondLevel
T1
matches
T2
matches
T3, T4 match
Solution: Multiple Match Coloring Algorithm
T1: weather → //Temperatures/*T2: yesterday → //*[date = today() – 1]T3: //*.tuple.date → //*.tuple.modifiedT4: //*.tuple.date → //*.tuple.received
21
Multiple Match Coloring Algorithm Analysis
• Problem:– MMCA is exponential in number of levels
• Solution: Trail Pruning– Prune by number of levels– Prune by top-K trails matched in each level
• Ranking by probability/certainity/weight
– Both
• Indexing techniques– Similar with inverted list indexing
• Given a query, materialize the query results (node ids)
– Materialization options• Left/right side of the trails
22
Experiment setup
• Test base: iMeMex Dataspace System
• Criterias in evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of Trails
• iTrail scenario– Scenario 1: Few High-quality Trails– Scenario 2: Many Low-quality Trails
• Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query
23
Experiment setup cont.
• Trails and queries
max original tree size: 14max final tree size: 35max # of trails applied: 5
24
Quality: Top-K Precision and Recall
SearchEngine misses relevantresults
Q3: pdf yesterday
SearchQuery is partially
semantics-aware
Q13: to = raimund.grube@
enron.com
Scenario 1:few high-qualitytrails(18 trails)
Queries
perfect query Perfect Query always has precision and recall
equal to 1
K = 20
Much improvement comparing with “baseline”.
25
Performance: Use of Materialization
Trail merging adds overhead to
query execution
Trail Materialization provides
interactive timesfor all queries
Scenario 1:few high-quality trails(18 trails)
26
Scalability: Query-rewrite Time vs. Number of Trails
Query-rewrite time can be controlled
with pruning
Scenario 2:many low-quality
trails
27
Conclusion: Pay-as-you-go Information Integration
DataspaceSystem
global warming zurich
text,links
DataSources
• Approach – Step 1: Provide a search service over all the
data
– Step 2: Add integration semantics via trails
– Step 3: If more semantics needed, go back to step 2
• Their Contributions– iTrails: a generic method to model semantic relationships (e.g.
implicit meaning, bookmarks, dictionaries, attribute matches, ...)
– Framework and algorithms for Pay-as-you-go Information
Integration
– Smooth transition between search and data integration