Upload
joseph-dobson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
October 2007
Data integration architectures and methodologies for the Life Sciences
Alexandra Poulovassilis, Birkbeck, U. of London
October 2007
Outline of the talk
The problem and challenges faced Historical background Main Data Integration approaches in the Life Sciences Our work Materialised and Virtual DI Future directions
• ISPIDER Project• Bioinformatics service reconciliation
October 2007
1. The Problem
Given a set of biological data sources, data integration (DI) is the process of creating an integrated resource which• combines data from each of the data sources• in order to support new queries and analyses
Biological data sources are characterised by their high degree of heterogeneity, in terms of:• data model, query interfaces, query processing
capabilities, database schema or data exchange format, data types used, nomenclature adopted
Coupled with the variety, complexity and large volumes of biological data, this poses several challenges, leading to several methodologies, architectures and systems being developed
October 2007
Challenges faced
Increasingly large volumes of complex, highly varying, biological data are being made available
Data sources are developed by different people in differing research environments for differing purposes
Integrating them to meet the needs of new users and applications requires reconciliation of their heterogeneity w.r.t. content, data representation/exchange and querying
Data sources may freely change their format and content without considering the impact on any integrated derived resources
Integrated resources may themselves become data sources for high-level integrations, resulting in a network of dependencies
October 2007
Genome: DNA sequences of 4 bases (A,C,G,T)
RNA: copy of DNA
sequence
Protein: sequence of 20
amino acids
A gene
Biological data: Genes Proteins Biological Function
Permanent copy Temporary copy Product (eachtriple of RNA
bases encodes an amino acid)
FUNCTION
Job
BiologicalProcesses
This slide is adapted from Nigel Martin’s Lecture Notes on Bioinformatics
October 2007
Varieties of Biological Data
genomic data gene expression data (DNA proteins) and gene function data protein structure and function data regulatory pathway data: how gene expression is regulated by
proteins cluster data: similarity-based clustering of genes or proteins proteomics data: from experiments on separating proteins
produced by organisms into peptides, and protein identification phylogenetic data: evolution of genomic, protein, function data data on genomic variations in species semi-structured/unstructured data: medical abstracts
October 2007
Some Key Application Areas for DI
Integrating, analysing and annotating genomic data Predicting the functional role of genes and integrating
function- specific information Integrating organism-specific information Integrating protein structure and pathway data with gene
expression data, to support functional genomics analysis Integrating, analysing and annotating proteomics data
sources Integrating phylogenetic data sources for genealogy research Integrating data on genomic variations to analyse health
impact Integrating genomic, proteomic and clinical data for
personalised medicine
October 2007
2. Historical Background
One possible approach would be to encode transformation/ integration functionality in the application programs
However, this may be a complex and lengthy process, and may affect robustness, maintainability, extensibility
This has motivated the development of generic architectures and methodologies for DI, which abstract out this functionality from application programs into generic DI software
Much work has been done since the 1990s specifically in biological DI
Many systems have been developed e.g. DiscoveryLink, Kleisli, Tambis, BioMart, SRS, Entrez, that aim to address some of the challenges faced
October 2007
3. Main DI Approaches in the Life Sciences
Materialised• import data into a DW• transform & aggregate imported data • query the DW via the DBMS
Virtual• specify the integrated schema• “wrap” the data sources, using wrapper software • construct mappings between data sources and IS
using mediator software• query the integrated schema• mediator software coordinates query evaluation,
using the mappings and wrappers
October 2007
Main DI Approaches in the Life Sciences
Link-based• no integrated schema• users submit simple queries to the integration software e.g.
via web-based user interface• queries are formulated w.r.t to the data sources, as
selected by the user• the integration software provides additional capabilities for
• facilitating query formulation e.g. cross-references are maintained between different data sources and used to augment query results with links to other related data
• speeding up query evaluation e.g. indexes are maintained supporting efficient keyword based search
October 2007
4. Comparing the Main Approaches
Link-based integration is fine if functionality meets users´ needs
Otherwise materialised or virtual DI is indicated: • both allow the integrated resource to be queried as
though it were a single data source. Users/applications do not need to be aware of source schemas/formats/content
Materialised DI is generally adopted for:• better query performance• greater ease of data cleaning and annotation
Virtual DI is generally adopted for:• lower cost of storing and maintaining the integrated
resource • greater currency of the integrated resource
October 2007
5. Our work: AutoMed
The AutoMed Project at Birkbeck and Imperial:
is developing tools for the semi-automatic integration of heterogeneous information sources
can handle both structured and semi-structured data provides a unifying graph-based metamodel (HDM) for
specifying higher-level modelling languages provides a single framework for expressing data
cleansing, transformation and integration logic the AutoMed toolkit is currently being used for biological
data integration and p2p data integration
October 2007
AutoMed Architecture
Global Query Processor/Optimiser
Schema Matching Tools
Other Tools e.g.GUI, schema evolution,DLT
Schema Transformationand Integration Tools
Model Definition Tool
Schema and Transformation
Repository
Model Definitions Repository
Wrapper
October 2007
AutoMed Features
Schema transformations are automatically reversible:• addT/deleteT(c,q) by deleteT/addT(c,q)• extendT(c,Range q1 q2) by contractT(c,Range q1 q2)• renameT(c,n,n’) by renameT(c,n’,n)
Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas
The queries within transformations allow automatic data and query translation
Schemas may be expressed in a variety of modelling languages
Schemas may or may not have a data source associated with them
October 2007
AutoMed vs Common Data Model approach
OODBXMLRDB
LocalSchema
LocalSchema
LocalSchema
Wrapper1 Wrapper3Wrapper2
CDMCDMCDM
IntegratedSchema
OODB XMLRDB
AutoMedRelationalSchema
IntegratedSchema
AutoMedXML
Schema
AutoMedOO
Schema
(a) CDM Framework (B) AutoMed Framework
Wrapper3Wrapper2Wrapper1
Tra
nsf
orm
atio
np
ath
wa
y 2
October 2007
6. Materialised DI
DWSDSS1
DSS2
DSS3
DSS4
TS1
TS2
TS3
TS4
SS1
SS4
SS3
SS2
MS1
MS2
DS
SV1
SV2
SV3
SV4
SV5
DMS1
TransformingSingle-Source
CleaningMulti-Source
CleaningIntegrating Summarizing
DMS1
DMS1
Creating DataMarts
DWS: Data Warehouse SchemaDS: Detailed SchemaSV: Summary ViewDMS: Data Mart Schema
DSS: Data Source SchemaTS: Transformed SchemaSS: Single-Cleaned SchemaMS: Multi-Cleaned Schema
October 2007
Some characteristics of Biological DI
prevalence of automated and manual annotation of data • prior, during and after its integration • e.g. DAS distributed annotation service; GUS data
warehouse annotation of data origin and data derivation importance of being able to trace the provenance of data wide variety of nomenclatures adopted
• greatly increases the difficulty of data aggregation• has led to many standardised ontologies and
taxonomies inconsistencies in identification of biological entities
• has led to standardisation efforts e.g. LSID • but still a legacy of non-standard identifiers present
October 2007
The BioMap Data Warehouse
A data warehouse integrating • gene expression data • protein structure data including
• data from the Macromolecular Structure Database (MSD) from the European Bioinformatics Institute (EBI)
• CATH structural classification data• functional data including
• Gene Ontology; KEGG • hierachical clustering data, derived from the above
Aiming to support mining, analysis and visualisation of gene expression data
October 2007
BioMap integration approach
Cluster Data
Cluster Data
Data Source
Data Source
AutoMed toolkit
Global Database
AutoMed Metadata
Repository
Global Schema
Schema Schema Schema Schema
October 2007
BioMap architecture
Structure DataStructure DataMSD, CATH…MSD, CATH…
Function DataFunction DataGO, KEGG…GO, KEGG…
Cluster DataCluster Data
Microarray DataMicroarray Data(ArrayExpress)(ArrayExpress)
MEditorMEditor
Data MartsData Marts
Structure TablesStructure Tables
Function TablesFunction Tables
Cluster TablesCluster Tables
Microarray TablesMicroarray Tables
Metadata TablesMetadata Tables
Search TablesSearch Tables(Materialised (Materialised
views) views)
Search ToolsSearch Tools
Analysis ToolsAnalysis Tools
Mining ToolsMining Tools
Visualisation Visualisation ToolsTools
October 2007
Using AutoMed in the BioMap Project
Wrapping of data sources and the DW Automatic translation of source and
global schemas into AutoMed’s XML schema language (XMLDSS)
Domain experts provide matchings between constructs in source and global schemas: rename transfs.
Automatic schema restructuring and generation of transformation pathways
Pathways could subsequently be used for maintaince and evolution of the DW; also for data lineage tracing
See DILS’05 paper for details of the architecture and clustering approach
RDB
XMLFileRDB
AutoMedRelationalSchema
AutoMedIntegratedSchema
AutoMedXMLDSSSchema
AutoMedRelationalSchema
XMLWrapper
RDBWrapper
RDBWrapper
Tra
nsf
orm
atio
np
athw
ay
Tran
sfor
mat
ion
path
way
Transformation
pathway
IntegratedDatabaseWrapper
IntegratedDatabase
…..
…..
…..
October 2007
7. Virtual DI
The integrated schema may be defined in a standard data modelling language
Or, more broadly, it may be a source-independent ontology• defined in an ontology language• serving as a “global” schema for multiple potential data
sources, beyond the ones being integrated e.g. as TAMBIS The integrated schema may/may not encompass all of the
data in the data sources:• it may be sufficient to capture just the data needed for
answering key user queries/analyses• this avoids the possibly complex and lengthy process of
creating a complete integrated schema and set of mappings
October 2007
Virtual DI Architecture
Global Query Processor
Global Query Optimiser
Schema Integration Tools
Metadata Repository: •Data source schemas•Integrated schemas•Mappings
Wrappers
October 2007
Degree of Data Source Overlap
different systems make different assumptions about this some systems assume that each DS contributes a different
part of the integrated virtual resource e.g. K2/Kleisli some systems relax this but do not attempt any aggregation
of duplicate or overlapping data from the DSs e.g. TAMBIS some systems support aggregation at both schema and
data levels e.g. DiscoveryLink, AutoMed the degree of data source overlap impacts on complexity of
the mappings and the design effort involved in specifying them
the complexity of the mappings in turn impacts on the sophistication of the global query optimisation and evaluation mechanisms that will be needed
October 2007
Virtual DI methodologies
Top-down• integrated schema IS is first constructed• or it may already exist from previous integration or
standardisation efforts• the set of mappings M is defined between IS and DS
schemas
October 2007
Virtual DI methodologies
Bottom-up• initial version of IS and M constructed e.g. from one
DS • these are incrementally extended/refined by
considering in turn each of the other DSs• for each object O in each DS, M is modified to
encompass the mapping between O and IS, if possible
• if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly
October 2007
Virtual DI methodologies
Mixed Top-down and Bottom-up• initial IS may exist • initial set of mappings M is specified• IS and M may need to be extended/refined by
considering additional data from the DSs that IS needs to capture
• for each object O in each DS that IS needs to capture, M is modified to encompass the mapping between O and IS, if possible
• if not, IS is extended as necessary to encompass information represented by O, and M is then modified accordingly
October 2007
Defining Mappings
Global-as-view (GAV)• each schema object in IS defined by a view over DSs• simple global query reformulation by query unfolding • view evolution problems if DSs change
Local-as-view (LAV)• each schema object in a DS defined by a view over IS• harder global query reformulation using views• evolution problems if IS changes
Global-local-as-view (GLAV)• views relate multiple schema objects in a DS with IS
October 2007
Both-As-View approach supported by AutoMed
not based on views between integrated and source schemas instead, provides a set of primitive schema transformations
each adding, deleting or renaming just one schema object relationships between source and integrated schema objects
are thus represented by a pathway of primitive transformations
add, extend, delete, contract transformations are accompanied by a query defining the new/deleted object in terms of the other schema objects
from the pathways and queries, it is possible to derive GAV, LAV, GLAV mappings
currently AutoMed supports GAV, LAV and combined GAV+LAV query processing
October 2007
Typical BAV Integration Network
US1 US2 USi USn
DS1 DS2 DSi DSn
GS
id id id id id
… …
… …
October 2007
Typical BAV Integration Network (cont’d)
On the previous slide:• GS is a global schema• DS1, …, DSn are data source schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and
USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository
• the transformation pathway between USi and GS is similar
• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps
October 2007
8. Schema Evolution
In biological DI, data sources may evolve their schemas to meet the needs of new experimental techniques or applications
Global schemas may similarly need to evolve to encompass new requirements
Supporting schema evolution in materialised DI is costly: requires modifying the ETL and view materialisation processes, plus the processes maintaining any derived data marts
With view-based virtual DI approaches, the sets of views that may be affected need to be examined and redefined
October 2007
Schema Evolution in BAV
BAV supports the evolution of both data source and global schemas
The evolution of any schema is specified by a transformation pathway from the old to the new schema
For example, the figure on the right shows transformation pathways, T, from an old to a new global or data source schema
Global SchemaS
New GlobalSchema S’
T
New Data SourceSchema S’
Data Source Schema S
T
October 2007
Global Schema Evolution
Each transformation step t in T:SS’ is considered in turn• if t is an add, delete, rename then schema equivalence
is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway, using an AutoMed tool that does this); the extended pathway can be used to regenerate the necessary GAV or LAV views
• if t is a contract then there will be information present in S that is no longer available in S’; again there is nothing further to do
• if t is an extend then domain knowledge is required to determine if, and how, the new construct in S’ could be derived from existing constructs; if not, nothing further to do; if yes, the extend step is replaced by an add step
October 2007
Local Schema Evolution
This is a bit more complicated as it may require changes to be propagated also to the global schema(s)
Again each transformation step t in T:SS’ is considered in turn
In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically
If it is an extend, then domain knowledge is required See our CAiSE’02, ICDE’03 and ER’04 papers for more
details The last of these discusses a materialised DI scenario
where the old/new global/source schemas have an extent We are currently implementing this functionality within
the AutoMed toolkit
October 2007
9. Some Future Directions in Biological DI
Automatic or semi-automatic identification of correspondences between sources, or between sources and global schemas e.g.• name-based and structural comparisons of schema
elements• instance-based matching at the data level• annotation of data sources with terms from ontologies to
facilitate automated reasoning Capturing incomplete and uncertain information about
the data sources within the integrated resource e.g. using probabilistic or logic-based representations and reasoning
Automating information extraction from textual sources using grammar and rule-based approaches; integrating this with other related structured or semi-structured data
October 2007
9.1 Harnessing Grid Technologies – ISPIDER
ISPIDER Project Partners: Birkbeck, EBI, Manchester, UCL
Aims:• Large volumes of heterogeneous proteomics data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use
existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.
October 2007
Project Aims
October 2007
Project Aims
October 2007
Project Aims
October 2007
Project Aims
October 2007
Project Aims
October 2007
myGrid / DQP / AutoMed
myGrid: collection of services/components allowing high-level integration via workflows of data and applications
DQP:• uses OGSA-DAI (Open Grid Services Architecture
Data Access and Integration) to access data sources• provides distributed query processing over OGSA-DAI
enabled resources Current research: AutoMed – DQP and AutoMed – myGrid
workflows interoperation See DILS´06 and DILS´07 papers, respectively
October 2007
AutoMed – DQP Interoperability
Data sources wrapped with OGSA-DAI
AutoMed-DAI wrappers extract data sources’ metadata
Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema
IQL queries submitted to this integrated schema are:• reformulated to IQL
queries on the data sources, using the AutoMed transformation pathways
• Submitted to DQP for evaluation via the AutoMed-DQP Wrapper
AutoMed Wrappers
AutoMedRepository
OGSA-DAIActivity
OGSA-DAIActivity
OGSA-DAIActivity
DB
AutoMedwrapper
AutoMedwrapper
AutoMedwrapper
DistributedQuery Processor
IntegratedAutoMed Schema
AutoMedSchema
AutoMedSchema
AutoMedSchema
AutoMedQuery Processor
IQL query
OQL query
OGSA-DAIService
OGSA-DAIService
OGSA-DAIService
DBDB
AutoMed DQPwrapper
OQL result
IQL result
IQL query
IQL result
October 2007
9.2 Bioinformatics Service Reconciliation
Plethora of bioinformatics services are being made available
Semantically compatible services are often not able to interoperate automatically in workflows due to • different service technologies• differences in data model, data modelling, data types
need for service reconciliation
October 2007
Previous Approaches
Shims. myGrid uses shims, i.e. services that act as intermediaries between specific pairs of services and reconcile their inputs and outputs
Bowers & Ludäscher (DILS’04) use 1-1 path correspondences to one or more ontologies for reconciling services. Sample implementation uses mappings to a single ontology and generates an XQuery query as the transformation program
Thakkar et al. use a mediator system, like us, but for service integration i.e. for providing services that integrate other services – not for reconciling semantically compatible services that need to form a pipeline within a workflow
October 2007
Our approach
XML as the common representation format
Assume availability of format converters to convert to/from XML, if output/input of a service is not XML
O1
(1)
getPfamEntry
(1)
getIPIEntry
XMLDSSschema X2
S2 inputXML format
S2 inputnon-XML format
XMLDSSschema X1
S1 outputXML format
S1 outputnon-XML format
InterProentry
(flat file)sequence
(string)
Pfamaccession
(string)mediatoror shim
UniProtentry
(flat file)
IPIaccession
(string)getIPIAccession getIPIEntry getPfamEntry
ISPIDER IPI @ EBI Pfam (local)
October 2007
Our approach
XMLDSS as the schema type
We use our XMLDSS schema type as the common schema type for XML
Can be automatically derived from DTD/XML Schema, if available
Or can be automatically extracted from an XML document
O1
(2)
(1)
getPfamEntry
(2)
(1)
getIPIEntry
XMLDSSschema X2
S2 inputXML format
S2 inputnon-XML format
XMLDSSschema X1
S1 outputXML format
S1 outputnon-XML format
InterProentry
(flat file)sequence
(string)
Pfamaccession
(string)mediatoror shim
UniProtentry
(flat file)
IPIaccession
(string)getIPIAccession getIPIEntry getPfamEntry
ISPIDER IPI @ EBI Pfam (local)
October 2007
Our approach
Correspondences to an ontology
Set of GLAV corrrespondences between each XMLDSS schema and a typed ontology:• An element maps to a
concept/path in the ontology• An attribute maps to a
literal-valued property/path• There may be multiple
correspondences for elements/attributes in the ontology
O1
(2)
(1)
getPfamEntry
(2)
(1)
getIPIEntry
(3) (3)
XMLDSSschema X2
S2 inputXML format
S2 inputnon-XML format
XMLDSSschema X1
S1 outputXML format
S1 outputnon-XML format
InterProentry
(flat file)sequence
(string)
Pfamaccession
(string)mediatoror shim
UniProtentry
(flat file)
IPIaccession
(string)getIPIAccession getIPIEntry getPfamEntry
ISPIDER IPI @ EBI Pfam (local)
October 2007
Our approach
Schema and data transformation
a pathway is generated to transform X1 to X2:
correspondences are used to create X1X1’ and X2X2’
XMLDSS restructuring algorithm creates X1’X2’
hence overall pathway X1X1’X2’X2
(4)
O1
(2)
(1)
getPfamEntry
(2)
(1)
getIPIEntry
(3) (3)
XMLDSSschema X2
S2 inputXML format
S2 inputnon-XML format
XMLDSSschema X1
S1 outputXML format
S1 outputnon-XML format
InterProentry
(flat file)sequence
(string)
Pfamaccession
(string)mediatoror shim
UniProtentry
(flat file)
IPIaccession
(string)getIPIAccession getIPIEntry getPfamEntry
ISPIDER IPI @ EBI Pfam (local)
October 2007
Architecture
A workflow tool could use our approach either dynamically or statically:
Mediation service• Workflow tool invokes service S1 and receives its output• Workflow tool submits output of S1, the schema of S2 and
the two sets of correspondences to an AutoMed service• The AutoMed service transforms the output of S1 to a
suitable input for consumption by S2 Shim generation
• AutoMed is used to generate a shim for services S1 and S2• XMLDSS schema transformation algorithm currently tightly
coupled with AutoMed functionality can be exported as single XQuery query able to materialise S2 from the data output by S1
October 2007
10. Conclusions
Integrating biological data sources is hard!
The overarching motivation is the potential to make scientific discoveries that can improve quality of life
The technical challenges faced can lead to new, more generally applicable, DI techniques
Thus, biological data integration continues to be a rich field for multi- and interdiscplinary research between clinicians, biologists, bioinformaticians and computer scientists