View
1.122
Download
0
Category
Tags:
Preview:
DESCRIPTION
Citation preview
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: A System for Information Integration and Knowledge Acquisition from Autonomous, Distributed and Semantically
Heterogeneous Data Sources
Jie Bao, Doina Caragea, Jyotishman Pathak, Adrian Silvescu, Carson Andorf, Changhui Yan, Drena Dobbs and Vasant Honavar
June 28, 2005
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Background and Motivation
Transformation of biology from a data poor science into a data rich science
Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone)
Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources
InterProMIPS
Swissprot
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Ontology-based information integration in INDUS
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Semantically Heterogeneous Data Sources
Protein ID Protein Name Protein Sequence Prosite Motifs EC Number
P35626Beta-adrenergic
receptor kinase 2
MADLEAVLAD
VSYLMAMEKS
…
RGS
PROT_KIN_DOM
PH_DOMAIN
2.7.1.126
Beta-adrenergic
receptor kinase
Q12797Aspartyl/asparaginyl
beta-hydroxylase
MAQRKNAKSS
GNSSSSGSGS
…
TPR
TPR_REGION
TPR
1.14.11.16
Peptide-aspartate
beta-dioxygenase
Accession Number AN
Gene AA Sequence LengthPfam
DomainsMIPS Funcat
P32589 SSE1
STPFGLDLGN
NNSVLAVARN
…
692 HSP70 16.01 protein binding
P07278 BCY1
VSSLPKESQA
ELQLFQNEIN
…
415 RIIa
16.19.01 cyclic nucleotide
binding (cAMP, cGMP, etc.)
D1
D2
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Capabilities of INDUS
INDUS provides support for:• Specification and update of schemas and ontologies• Specification of mappings between ontologies• Registration of new data sources • Specification of user views • Specification and execution of queries across distributed,
semantically heterogeneous data sources
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Tools
Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source
schemas Mapping Editor for specifying mappings between
ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying
results
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Ontologists
A domain ontologist can: Specify or update ontologies Specify or update schemas Specify or update mappings between ontologies Specify or update mappings between schemas
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Data Providers
A data provider can: Associate a predefined schema and ontology with a data
source Specify data source location, type and access procedures Register a data source Act as a domain ontologist
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Experts
A domain expert can specify an application view, i.e., Select data sources of interest in an application domain Select an application specific schema Select an application specific ontology Select relevant mappings
A domain expert can serve as Domain ontologist Data provider
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Scientists
A domain scientist can Select an application view Formulate and execute queries
A domain scientist can act as Domain ontologist Data provider Domain expert
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Semantically Heterogeneous Data
Protein ID Protein Name Protein Sequence Prosite Motifs EC Number
P35626Beta-adrenergic
receptor kinase 2
MADLEAVLAD
VSYLMAMEKS
…
RGS
PROT_KIN_DOM
PH_DOMAIN
2.7.1.126
Beta-adrenergic
receptor kinase
Q12797Aspartyl/asparaginyl
beta-hydroxylase
MAQRKNAKSS
GNSSSSGSGS
…
TPR
TPR_REGION
TPR
1.14.11.16
Peptide-aspartate
beta-dioxygenase
Data sources need to be made self-describing by specifying the relevant meta data
Accession Number AN
Gene AA Sequence LengthPfam
DomainsMIPS Funcat
P32589 SSE1
STPFGLDLGN
NNSVLAVARN
…
692 HSP70 16.01 protein binding
P07278 BCY1
VSSLPKESQA
ELQLFQNEIN
…
415 RIIa
16.19.01 cyclic nucleotide
binding (cAMP, cGMP, etc.)
D1
D2
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Meta Data
Schema – structure of data Specification of the attributes of the data and their types
Ontology – conceptualization of semantics of data Domains of attributes and relationships between values
Protein ID : Swissprot ID
Protein Name: String
Protein Sequence: AA String
Prosite Motifs: Motifs
EC Number: EC Hierarchy
Schema for protein data in D1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Attribute value hierarchy
An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data
Example: MIPS Funcat Hierarchy
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Making data sources self-describing- Ontology-extended data source
Accession Number: MIPS ID
Gene:
Gene ID
Length:
Positive Integer
Prosite Motifs: Motifs
MIPS Funcat:
MIPS Hierarchy
Data
Schema
Ontology
+
+
P32589 SSE1STPFGLDLGNNNSVLAVARN 692 HSP70 16.01 protein binding
P07278 BCY1VSSLPKESQAELQLFQNEIN 415 RIIa
16.19.01 cyclic nucleotidebinding (cAMP, cGMP.)
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: Ontology Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: Schema Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: Data Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
User view
PID: Swissprot ID
Source:
Species String
Protein:
AA String
Structural Class: SCOP
GO Function:
GO Hierarchy
MIPS Swissprot
User Schema
Data Sources of Interest
User ViewUser OntologyA user view is given by:
a set of ontology-extended data sources that are of interest to the user
a user schema and ontology (defining a virtual data source)
a set of mappings from data source schemas and ontologies to the user schema and ontology
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings
The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at: Schema Level: between attributes in different schemas Ontology Level: between values of the attributes described in
different ontologies
The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene ID
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: Swissprot ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene Set
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: Swissprot ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein Sequence : D1≡ AA Composition : DU
AA Sequence : D2 ≡ AA Composition : DU
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA StringProsite Motifs: AA String
EC Number: EC Hierarchy
Accession No AN: MIPS ID
Gene: Gene ID
AA Sequence:
AA StringLength: Pos Integer
MIPS Funcat: MIPS Hierarchy
Pfam Motifs: Motifs
D1
D2
PID: Swissprot ID
Protein:
AA StringGO Function:GO Hierarchy
DUSource: Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein Sequence : D1≡ AA Composition : DU
AA Sequence : D2 ≡ AA Composition : DU
EC Number : D1 ≡ GO Function : DU’
MIPS Funcat : D2 ≡ GO Function : DU
Protein ID: SwissProt ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene ID
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: SwissProt ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
DUDUD1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
EC 2.7.1.126 : D1 ≡ GO 0047696 : DU
DUD1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
DU
EC 2.7.1 : D1 GO 00047696 : DU
D1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
D1
EC 2.7.1.126: D1 GO 0004672 : DU
DU
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: View Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: Mapping Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Sample Query
Return ALL proteins whose GO function isa nucleotide binding
Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Query processing in Indus
QL
SV,OV
QLSQL
SV
Q1
S1,OV
Qn
Sn,OV
Qr1
S1,O1
Qrn
S1,On
Qr1SQL
S1
QrnSQL
Sn
D1
Dn
r1
rn
In remote ontology
In local ontology In local schema
In remote schema
r1L
rnL
RL
QueryFormation
LocalRewriting
Query Decomposition
Query Translation
Remote Rewriting
QueryExecution
InverseTranslation
ResultComposition
M1
Mn
M1
Mn
Query Formulation
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS: Query Editor
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS
Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable
ontologies and mappings (no single global ontology) Data integration on the fly Semantic integrity of queries ensured by means of
semantics preserving mappings
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Related work
Information integration: [Sheth and Larson, 1990; Davidson et al., 2001; Eckman, 2003; Levy, 1998]
Biological data integration: SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc.
Ontology and mappings editors: Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc.
Ontology-extended relational algebra: [Bonatti et al., 2003]
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
INDUS Information Integration System INDUS Tools: Technical Details and Demo Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Summary
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Ontologies and mappings Support for more expressive ontologies (beyond hierarchies)
[Bao et al., 2005] Support for interactive specification of mappings between
ontologies, including automated generation of candidate mappings
Support for modular ontologies and mappings [Bao and Honavar, 2004]
Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings
More powerful reasoning to ensure the semantic integrity of mappings
Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL)
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Query Processing Query optimization under access, bandwidth and
computational constraints• Implementation of data retrieval procedures (iterators) for
widely used bioinformatics data sources• Support for data caching and data sharing
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Knowledge Acquisition Support for learning classifiers and other predictive models
from semantically heterogeneous data [Caragea et al., 2005]
Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ]
Support for annotating and sharing results of knowledge acquisition
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships Prediction of protein function [Andorf et al., 2004] Prediction of protein-protein, protein-DNA and protein-RNA
interfaces [Yan et al., 2004] Analysis, visualization, and interpretation of gene expression data
[Caragea et al., 2005] Modeling and discovery of gene regulatory networks
Usability studies Design of better user interfaces Performance evaluation
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Relevant Publications
Caragea, D., Pathak, J., Bao, J., Silvescu, A., Andorf., C., Dobbs, D. and Honavar, V. (2005). Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources. In: Proceedings of the 2nd International Workshop on Data Integration in Life Sciences (DILS'05), San Diego, CA.
Caragea, D., Pathak, J., and Honavar, V. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Proceedings of the Third International Conference on Ontologies, DataBases and Applications of Semantics for Large Scale Information Systems (ODBASE’04), October 25-29, 2004, Agia Napa, Cyprus.
Caragea, D., Silvescu, A., and Honavar, V. (2004). A Framework for Learning from Distributed Data Using Sufficient Statistics and its Application to Learning Decision Trees. International Journal of Hybrid Intelligent Systems. Vol. 1, No. 2. Invited Paper.
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
http://www.cs.iastate.edu/~dcaragea/indus.html
Recommended