27
1 iTrails: Pay-as-you-go Information Integration in Datasapces Authors: Salles, Dittrich et al. (ETH Zurich) Published in VLDB2007 Presenter: Jim 7 Dec 2007

1 iTrails: Pay-as-you-go Information Integration in Datasapces Authors: Salles, Dittrich et al. (ETH Zurich) Published in VLDB2007 Presenter: Jim 7 Dec

Embed Size (px)

Citation preview

1

iTrails: Pay-as-you-go Information Integration in Datasapces

Authors: Salles, Dittrich et al. (ETH Zurich)Published in VLDB2007

Presenter: Jim7 Dec 2007

2

Background

• Dataspace– A step further than database– Model all kinds of files

• iDM: a DSMS

3

Background cont.

• Information integration– Purpose: query multiple data sources

Data source 2Data source 1 Data source 3

Query interface

Result Query

Global schema

Source schema

?

4

Background cont.

• Schema matching/mapping– Enable the reformulation of query on the

global schema to the query on source schema

Global schema

Source schema

?

m1:for d in deptexists d' in Tdeptwhere ( d'.dname=d.dname ^ d'.location=d.location)

m2:for d in dept, e in d.empsexists d' in Tdept, e' in d'.empswhere...A complete and complicate mapping

is necessary for the query.

5

text,links

Background cont.

• Search engine– Easier way for query multiple data sources

text,links

text,links

text,links

Crawling/indexing

Graph IR Search Engine

Query semantics are NOT precise.

global warming zurich

6

Motivation of iTrail

• Find a integration solution in-between these two extremes?

?Dataspace System

text,links

text,links

text,links

text,links

Graph IR Search Engine

Data IntegrationSystem

Temps Cities

CO2 Sunspots

... ...

...

...

Pay-as-you-goInformation Integration The more effort you pay,

the more query power you have.

7

iTrails Core Idea: Add Integration Hints Incrementally

• Step 1: Provide a search service over all the data– Use a general graph data model (iDM)

– Works for unstructured documents, XML, and relations

• Step 2: Add integration semantics via hints (trails) on top of the graph– Works across data sources, not only between sources

• Step 3: If more semantics needed, go back to step 2

• Impact:

– Smooth transition between search and data integration

– Semantics added incrementally improve precision / recall

8

iTrails: Defining Trails

• Basic Form of a Trail

QL [.CL] → QR [.CR]

• Intuition:– When I query for QL [.CL], you should also

query for QR [.CR]

Queries: keyword and path expressions

Attribute projections

Query example: global warming zurich

or //Temperatures/*[celsius > 10]

9

Trail examples: keyword query

20

15

14

BE

ZH

ZH

• Trail for Implicit Meaning: “When I query for global warming, you should also query for Temperature data above 10 degrees”

• Trail for an Entity: “When I query for zurich, you should also query for references of zurich as a region”

global warming → //Temperatures/*[celsius > 10]

DBServer

Temperaturescity celsius

Bern

Zurich

zurich → //*[region = “ZH”]

Uster

region

global warming zurich

9ZHZurich26-Sep

25-Sep

24-Sep

23-Sep

date

10

Trail Example: Deep Web Bookmarks

• Trail for a Bookmark: “When I query for train home, you should also query for the TrainCompany’s website with origin at ETH Uni and destination at Seilbahn Rigiblick”

train home

train home → //trainCompany.com//*[origin=“ETH Uni”

and dest =“Seilbahn Rigiblick”]

WebServer

11

Trail Examples: Thesauri, Dictionaries

car auto

car carro

car → auto

car → carrocarro → car

Laptop Email Server

• Trail for Thesauri: “When I query for car, you should also query for auto”

• Trails for Dictionary: “When I query for car, you should also query for carro and vice-versa”

12

Trail Examples: Schema Equivalences

• Trail for schema match on names: “When I query for Employee.empName, you should also query for Person.name”

• Trail for schema match on salaries: “When I query for Employee.salary, you should also query for Person.income”

EmployeeempName salary

Personname age income

//Employee//*.tuple.empName → //Person//*.tuple.name

//Employee//*.tuple.salary → //Person//*.tuple.incomeDB

Server

empId

SSN

13

How are Trails Created?

• Given by the user– Explicitly– Via Relevance Feedback

• (Semi-)Automatically– Information extraction techniques– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data,

bookmarks)

14

Uncertainty/certainty of trails

• Probabilistic Trails: – model uncertain trails– probabilities used to rank trails

• QL [.CL] → QR [.CR], 0 ≤ p ≤ 1

– Example: car → auto

• Scored Trails: – give higher value to certain trails– scoring factors used to boost scores of query results obtained by

the trail• QL [.CL] → QR [.CR], sf > 1

– Examples: • T1: weather → //Temperatures/*• T2: yesterday → //*[date = today()– 1]

p

p = 0.8

p = 1, sf = 3

p = 0.9, sf = 2

15

Rewriting Queries with Trails

• Canonical form of the query– Decompse the query into location step seperators (‘/’

or ‘//’) and predicates– Construct a tree based on the decompsed query

Query: “retrieve all information about the projects containing keyword Mike” //home/projects//*[“Mike”]

U

Mike

U

home projects(simplified form for presentation)

16

Rewriting Queries with Trails cont.

U

weather yesterday

(1) Matching

T2: yesterday → //*[date = today() – 1]

(2) Transformation

U

weather

yesterday

U//*[date = today() – 1]

(3) Merging

T2

matches

QueryWeather yesterday

Trail

Rewriting method: Matching => Transformation => Merging

“yesterday” is containedIn the query

17

Rewriting Queries on the canonical form of the query

Trails:

NOT the final result!The rewriting based on ψ7 and ψ9 are omitted.

18

Problem: Recursive Matches

... UU

weather

yesterday

U//*[date = today() – 1]

T2: yesterday →

//*[date = today() – 1]

New query still matches T2,

so T2 could be applied

againU

weather U

yesterday

U//*[date = today() – 1]

//*[date = today() – 1]U

//*[date = today() – 1]

//*[date = today() – 1]...

Infinite recursion

T2

matches

T2

matches

19

Problem: Recursive Matches cont.

U

weather

yesterday

U//*[date = today() – 1]

Trails may be mutually recursive

T3: //*.tuple.date → //*.tuple.modified

U

weather U

yesterday

//*[date = today() – 1]

T10: //*.tuple.modified → //*.tuple.date

U//*[modified = today() – 1]

U

weather Uyesterday

//*[date = today() – 1]

U

//*[modified = today() – 1]

U //*[date = today() – 1]

We again match T3

and enter an infinite loop

T3 matches

T10 matches

20

U

weather yesterday

First Level

U

weatheryesterday

//Temperatures/*

UU

//*[date = today() – 1]

U

weather yesterday

//Temperatures/*

UU

//*[modified = today() – 1]

UU

//*[received = today() – 1]

//*[date = today() – 1]

SecondLevel

T1

matches

T2

matches

T3, T4 match

Solution: Multiple Match Coloring Algorithm

T1: weather → //Temperatures/*T2: yesterday → //*[date = today() – 1]T3: //*.tuple.date → //*.tuple.modifiedT4: //*.tuple.date → //*.tuple.received

21

Multiple Match Coloring Algorithm Analysis

• Problem:– MMCA is exponential in number of levels

• Solution: Trail Pruning– Prune by number of levels– Prune by top-K trails matched in each level

• Ranking by probability/certainity/weight

– Both

• Indexing techniques– Similar with inverted list indexing

• Given a query, materialize the query results (node ids)

– Materialization options• Left/right side of the trails

22

Experiment setup

• Test base: iMeMex Dataspace System

• Criterias in evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of Trails

• iTrail scenario– Scenario 1: Few High-quality Trails– Scenario 2: Many Low-quality Trails

• Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query

23

Experiment setup cont.

• Trails and queries

max original tree size: 14max final tree size: 35max # of trails applied: 5

24

Quality: Top-K Precision and Recall

SearchEngine misses relevantresults

Q3: pdf yesterday

SearchQuery is partially

semantics-aware

Q13: to = raimund.grube@

enron.com

Scenario 1:few high-qualitytrails(18 trails)

Queries

perfect query Perfect Query always has precision and recall

equal to 1

K = 20

Much improvement comparing with “baseline”.

25

Performance: Use of Materialization

Trail merging adds overhead to

query execution

Trail Materialization provides

interactive timesfor all queries

Scenario 1:few high-quality trails(18 trails)

26

Scalability: Query-rewrite Time vs. Number of Trails

Query-rewrite time can be controlled

with pruning

Scenario 2:many low-quality

trails

27

Conclusion: Pay-as-you-go Information Integration

DataspaceSystem

global warming zurich

text,links

DataSources

• Approach – Step 1: Provide a search service over all the

data

– Step 2: Add integration semantics via trails

– Step 3: If more semantics needed, go back to step 2

• Their Contributions– iTrails: a generic method to model semantic relationships (e.g.

implicit meaning, bookmarks, dictionaries, attribute matches, ...)

– Framework and algorithms for Pay-as-you-go Information

Integration

– Smooth transition between search and data integration