45
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet

1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

Embed Size (px)

DESCRIPTION

3 Access to multiple data sources-Problems Users need good knowledge on where the required information is stored and how it can be accessed Representation of an entity in different data sources can be different. Same name in different data sources can refer to different entities.

Citation preview

Page 1: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

1

Integration of data sources

Patrick LambrixDepartment of Computer and Information ScienceLinköpings universitet

Page 2: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

2

Accessing multiple data sourcesWhere?

Which?How?

Disease information

Targetstructure

Chemicalstructure Disease

models

Clinicaltrials

Metabolism,toxicology

Genomics

DISCOVERY

Page 3: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

3

Access to multiple data sources-Problems Users need good knowledge on where the

required information is stored and how it can be accessed

Representation of an entity in different data sources can be different.

Same name in different data sources can refer to different entities.

Page 4: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

4

Queries over multiple data sources

Find PubMed publications on diseases of certain insulin sequences

SWISS-PROT OMIM PubMed

Find Divide & order Execute Combine

get relevant diseases

get IDs of publications

get publications

Page 5: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

5

Access to multiple data sources - steps

Decide which data sources should be used Divide query into sub-queries to the data sources Decide in which order to send sub-queries to the data

sources Send sub-queries to the data sources - use the

terminology of the data sources Merge results from the data sources to an answer for

the original query mistake in any step can lead to inefficient processing

of the query or failure to get a result

Page 6: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

6

Sub-query1

query

Sub-query1

Sub-query1

Answer1Answer2Answer3

Answer1Answer2Answer3

Answer1Answer2Answer3

Page 7: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

7

query

Answer1Answer2Answer3

Sub-query2(answer1)

Sub-query2(answer1)

Sub-query2(answer1)Answer1.1Answer1.2

Answer1.1Answer1.2

Answer1.1Answer1.2

Page 8: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

8

query

Answer1Answer2Answer3

Sub-query2(answer2)

Sub-query2(answer2)

Sub-query2(answer2)Answer2.1Answer2.2

Answer2.1Answer2.2

Answer1.1Answer1.2

Answer2.1Answer2.2

Page 9: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

9

query

Answer1Answer2Answer3

Sub-query2(answer3)

Sub-query2(answer3)

Sub-query2(answer3)Answer3.1

Answer3.1

Answer1.1Answer1.2

Answer2.1Answer2.2

Answer3.1

Page 10: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

10

query

Answer1Answer2Answer3

Answer1.1Answer1.2

Answer2.1Answer2.2

Answer3.1

Subquery3(Answer1.1,Answer1.2,Answer2.1,Answer2.2,Answer3.1)

Subquery3(Answer1.1,Answer1.2,Answer2.1,Answer2.2,Answer3.1)

Answer.aAnswer.bAnswer.cAnswer.dAnswer.eAnswer.f

Answer.aAnswer.bAnswer.cAnswer.dAnswer.eAnswer.f

result

Page 11: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

11

Problem formulation• Data source properties

– Autonomous data sources– Different data models– Differences in terminology– Overlapping, redundant data

• Integration aims to provide transparent access to multiple heterogenous data sources– uniform query language– uniform representation of

resultsData source 2Structure(name, structure, organism)date>2000

Data source 1Protein(name, authors, date, organism)Article(authors, title, year)date>1995

Page 12: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

12

Problem formulation• Data source properties

– Autonomous data sources– Different data models– Differences in terminology– Overlapping, redundant data

• Integration aims to provide transparent access to multiple heterogenous data sources– uniform query language– uniform representation of

resultsData source 2Structure(name, structure, organism)date>2000

Protein(name, date, organism)ProteinStructure(name, structure)

Data source 1Protein(name, authors, date, organism)Article(authors, title, year)date>1995

Page 13: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

13

Methods for integration Link driven federations

Explicit links between data sources. Warehousing

Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.

Mediation or View integration A global schema is defined over all data sources.

Page 14: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

14

Link driven federations

Creates explicit links between data sources

query: get interesting results and use web links to reach related data in other data sources

Page 15: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

15

Link driven federations

User Interface

Retrieval Engine

Answer Assembler

Index A->B

Data source B

Data

source AIndex B->A

Query interpreter

wrapper wrapper

Page 16: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

16

SRS

Integrates more than 300 resources Possible to add own resources interface: SRSWWW, getz http://srs.ebi.ac.uk/

Page 17: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

17

SRS – query language

text search[swissprot-des:kinase] documents in swissprot that contain ’kinase’ in the

’description’-field

[swissprot-des:kin*] documents in swissprot that contain a word that

starts with ’kin’ in the ’description’-field

Page 18: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

18

SRS – query language

boolean operators: and (&), or (|), andnot (!)[swissprot-des:(adrenergic & receptor) ! (alpha1A)] documents in swissprot that contain ’adrenergic’

and ’receptor’ in the ’description’-field, but not ’alpha1A’

Page 19: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

19

SRS – query language

boolean operators: and (&), or (|), andnot (!)[swissprot-des:kinase] & [swissprot-org:human] documents in swissprot that contain ’kinase’ in the

’description’-field and ’human’ in the ’organism’-field

Page 20: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

20

SRS – query language

links[swissprot-des:kinase] > PDB documents in PDB that are referred to from

documents in swissprot that contain ’kinase’ in the ’description’-field

Page 21: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

21

SRS – query language

links[swissprot-id: acha_human] > prosite >

swissprot documents in swissprot that are referred to

from documents in prosite that are referred to from documents in swissprot that contain ’acha_human’ in the ’id’- field

Page 22: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

22

SRS – query language

links[swissprot-org:human] > [swissprot-features:transmem] documents in swissprot that contain ’transmem’

in the ’features’-field and that are referred to from documents in swissprot that contain ’human’ in the ’organism’-field

Page 23: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

23

SRS – query language

multiple sources

[{swissprot sptremb}-des:kinase]

[dbs={swissprot sptremb}-des:kinase] & [dbs-org:human]

Page 24: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

24

Link driven federations

Advantages - complex queries - fast Disadvantages - require good knowledge - syntax based - terminology problem not solved

Page 25: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

25

Page 26: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

26

Mediation

Define a global schema over the data sources

high level query language

Page 27: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

27

Mediation

User Interface

Retrieval Engine

Query Interpreter and Expander Answer Assembler

OntologyBase

Data sourceKnowledge

Base

Data source

Data source

wrapper wrapper

Page 28: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

28

Mediation

Advantages - complex queries - requires less knowledge - solution for terminology problem - semantics based

Page 29: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

29

Mediation

Disadvantages - more computation - view maintenance

Page 30: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

30

Mediation

• Query problemHow to answer queries expressed using the global schema.

• Modeling problem How to model the global schema, data sources and mappings.

Application

Global schema

Local schemaLocal schema

Query

WrapperWrapper

Source Source

Mediator

Page 31: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

31

Queries use the global schema Conjunctive queries

select-project-join queries

Mediator reformulates queries in terms of a set of queries that use the local schemas. Equivalence and containment of queries needs to be preserved.

- Q1 is contained in Q2 if the result of Q1 is a subset of the result of Q2.

Queries

p(X,Z) :- a(X,Y), b(Y,Z)

head body/subgoalsif

q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)

Page 32: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

32

MediatorQuery

WrapperWrapper

Mediator

Source Source

• Mediator is responsible for query processing – reformulation of queries, decide query plan– query optimization– execution of query plan, assemble results into final answer

Issues:– Semantically correct reformulation– Access only relevant data sources

Query Reformulation

Query Optimization

Query Execution Engine

Page 33: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

33

Knowledge

Application

Global schema

Local schemaLocal schema

Query

WrapperWrapper

Source Source

Mediator

Mappings

• Description of data source content– global schema (domain model/ontology)– local schema (data source model)

• Information for integration– mapping

• Capabilities– attributes and constraints– processing capabilities– completeness– cost of query answering– reliability

• Used for– selection of relevant data sources– query plan formulation– query plan optimization

Page 34: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

34

Mapping Relation between domain and data

source content

Global as viewThe global schema is defined in terms of source terminology

Global schema:Protein(name, date, organism)ProteinStructure(name, structure)

Data source local schema:DS1(name, authors, date, organism)DS2(name, structure, organism)

Protein(name, date, organism) :- DS1(name, authors, date, organism)ProteinStructure(name, structure) :- DS2(name, structure, organism)

Page 35: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

35

Mapping Relation between domain and data

source content

Local as viewThe sources are defined in terms of the global schema.

Global schema:Protein(name, date, organism)ProteinStructure(name, structure)

DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995DS2(name, structure, organism) -: Protein(name, date, organism),ProteinStructure(name, structure), date >2000

Data source local schema:DS1(name, authors, date, organism)DS2(name, structure, organism)

Page 36: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

36

Query processing in GAV

No explicit representation of data source content Mapping gives direct information about which data satisfies the

global schema. Query is processed by expanding the query atoms according to their

definitions.

Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)

GAV: Protein(name, date, organism) :- DS1(name, authors, date, organism) ProteinStructure(name, structure) :- DS2(name, structure, organism)

New query: q(name, structure) :- DS1(name, authors, 2001, ’human’), DS2(name, structure, organism)

Page 37: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

37

Query processing in LAV

Mapping does not give direct information about which data satisfies the global schema.

To answer the query it needs to be inferred how the mappings should be used.

Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)

LAV: DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000

Page 38: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

38

Query processing in LAV

Bucket algoritm (Information Manifold) For each sub-goal in query create bucket of relevant views. Define rewritings of query. Each rewriting consists of one conjunct from

every bucket. Check whether the resulting conjunction is contained in the query.

The result is the union of the rewritings.

New query: q(name, structure) :- DS1(name, authors, 2001, ’human’), DS2(name, structure, organism)

Query: give name and structure for human proteins with date ‘2001’. q(name, structure) :- Protein(name, 2001, ‘human’), ProteinStructure(name, structure)

LAV: DS1(name, authors, date, organism) -: Protein(name, date, organism), date >1995 DS2(name, structure, organism) -: Protein(name, date, organism), ProteinStructure(name, structure), date >2000

Page 39: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

39

Comparison GAV - LAV Global as view

Clear how data sources interact When a data source is added, the global schema can

change Query processing is easy

Local as view Each data source is specified in isolation Easy to add data sources Easier to specify constraints on the contents of sources Query processing requires reasoning

Page 40: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

40

Capabilities Most common capabilities describe attributes

f - free, attribute can be specified or not b - bound, a value must be specified for the

attribute, all values are permitted u - unspecified, not permitted to specify a value for

the attribute c[S] - value should be one of the values in finite

set S o[S] - value is not specified or one of the values in

finite set S

DS1: (name, authors, date, organism)f f b c[human mouse]

Page 41: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

41

Mediation

User Interface

Retrieval Engine

Query Interpreter and Expander Answer Assembler

OntologyBase

Data sourceKnowledge

Base

Data source

Data source

wrapper wrapper

Research issues: - knowledge representation - creation and maintenance of global schema

Page 42: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

42

Mediation

User Interface

Retrieval Engine

Query Interpreter and Expander Answer Assembler

OntologyBase

Data sourceKnowledge

Base

Data source

Data source

wrapper wrapper

Research issues: - knowledge representation - creation of ontologies - aligning/merging of ontologies

Page 43: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

43

Mediation

User Interface

Retrieval Engine

Query Interpreter and Expander Answer Assembler

OntologyBase

Data sourceKnowledge

Base

Data source

Biological databank

wrapper wrapperResearch issues: - unified query language - knowledge representation - strategies for query expansion - query optimization

Page 44: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

44

Mediation

User Interface

Retrieval Engine

Query Interpreter and Expander Answer Assembler

OntologyBase

Data sourceKnowledge

Base

Data source

Data source

wrapper wrapper

Research issue: - semi-automatic generation of wrappers

Page 45: 1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linkpings universitet

45