Upload
jie-bao
View
1.527
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources
Doina Caragea, Jyotishman Pathak, Jie Bao, Adrian Silvescu, Carson Andorf, Drena Dobbs and Vasant Honavar
July 26, 2005
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Semantic Web Vision
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Background and Motivation
Transformation of biology from a data poor science into a data rich science
Proliferation of autonomous, semantically heterogeneous, distributed data sources (more than 500 data repositories of interest to molecular biologists alone)
Needed: Software tools for knowledge acquisition from semantically heterogeneous distributed data sources
InterProMIPS
Swissprot
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS (INtelligent Data Understanding System)
SwissProt, OSwissProt
Learning Algorithms
MIPS, OMIPS InterPro,OInterPro
Ontology O’Ontology O
O
Goal: knowledge discovery from large,
distributed, semantically heterogeneous data
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition
System Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Semantically Heterogeneous Data
Protein ID Protein Name Protein Sequence Prosite Motifs EC Number
P35626Beta-adrenergic
receptor kinase 2
MADLEAVLAD
VSYLMAMEKS
…
RGS
PROT_KIN_DOM
PH_DOMAIN
2.7.1.126
Beta-adrenergic
receptor kinase
Q12797Aspartyl/asparaginyl
beta-hydroxylase
MAQRKNAKSS
GNSSSSGSGS
…
TPR
TPR_REGION
TPR
1.14.11.16
Peptide-aspartate
beta-dioxygenase
Data sources need to be made self-describing by specifying the relevant meta data
Accession Number AN
Gene AA Sequence LengthPfam
DomainsMIPS Funcat
P32589 SSE1
STPFGLDLGN
NNSVLAVARN
…
692 HSP70 16.01 protein binding
P07278 BCY1
VSSLPKESQA
ELQLFQNEIN
…
415 RIIa
16.19.01 cyclic nucleotide
binding (cAMP, cGMP, etc.)
D1
D2
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Meta Data
Schema – structure of data Specification of the attributes of the data and their types
Ontology – conceptualization of semantics of data Domains of attributes and relationships between values
Protein ID : Swissprot ID
Protein Name: String
Protein Sequence: AA String
Prosite Motifs: Motifs
EC Number: EC Hierarchy
Schema for protein data in D1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Attribute value hierarchy
An attribute value hierarchy (AVH) is a partial order ontology over the values of attributes of data
Example: MIPS Funcat Hierarchy
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Making data sources self-describing - Ontology-extended data source
Accession Number: MIPS ID
Gene:
Gene ID
Length:
Positive Integer
Prosite Motifs: Motifs
MIPS Funcat:
MIPS Hierarchy
Data
Schema
Ontology
+
+
P32589 SSE1STPFGLDLGNNNSVLAVARN 692 HSP70 16.01 protein binding
P07278 BCY1VSSLPKESQAELQLFQNEIN 415 RIIa
16.19.01 cyclic nucleotidebinding (cAMP, cGMP.)
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
User view
PID: Swissprot ID
Source:
Species String
Protein:
AA String
Structural Class: SCOP
GO Function:
GO Hierarchy
MIPS Swissprot
User Schema
Data Sources of Interest
User ViewUser OntologyA user view is given by:
a set of ontology-extended data sources that are of interest to the user
a user schema and ontology (defining a virtual data source)
a set of mappings from data source schemas and ontologies to the user schema and ontology
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings
The interoperation between the schema and ontology associated with a data source and a user schema and ontology is facilitated by specifying mappings at: Schema Level: between attributes in different schemas Ontology Level: between values of the attributes described in
different ontologies
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene ID
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: Swissprot ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene Set
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: Swissprot ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein Sequence : D1≡ AA Composition : DU
AA Sequence : D2 ≡ AA Composition : DU
Protein ID: Swissprot ID
Protein Name: String
Protein Sequence:
AA StringProsite Motifs: AA String
EC Number: EC Hierarchy
Accession No AN: MIPS ID
Gene: Gene ID
AA Sequence:
AA StringLength: Pos Integer
MIPS Funcat: MIPS Hierarchy
Pfam Motifs: Motifs
D1
D2
PID: Swissprot ID
Protein:
AA StringGO Function:GO Hierarchy
DUSource: Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at schema level
Protein ID : D1≡ PID : DU
Accession Number AN : D2≡ PID : DU
Protein Sequence : D1≡ AA Composition : DU
AA Sequence : D2 ≡ AA Composition : DU
EC Number : D1 ≡ GO Function : DU’
MIPS Funcat : D2 ≡ GO Function : DU
Protein ID: SwissProt ID
Protein Name: String
Protein Sequence:
AA String
Prosite Motifs:
AA String
EC Number:
EC Hierarchy
Accession No AN:
MIPS ID
Gene:
Gene ID
AA Sequence:
AA String
Length:
Pos Integer
MIPS Funcat:
MIPS Hierarchy
Pfam Motifs:
Motifs
D1
D2
PID: SwissProt ID
Protein:
AA String
GO Function:
GO HierarchyDU
Source:
Species String
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
DUDUD1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
EC 2.7.1.126 : D1 ≡ GO 0047696 : DU
DUD1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
DU
EC 2.7.1 : D1 GO 00047696 : DU
D1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Mappings at ontology level
D1
EC 2.7.1.126: D1 GO 0004672 : DU
DU
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Integration ontology
An ontology (OU,) is called an integration ontology of a set of data source ontologies O1,…,OK if there exists K partial injective mappings Φ1,…,ΦK from O1,…,OK, respectively, to OU such that: x i y implies Φi(x) Φi(y), for all x,yOi
Order preservation
(x:Oi op y:OU)IC, then (Φi(x) op y), for all x Oi and y OU
Semantic correspondence preservation
We provide user-friendly tools for specifying semantic correspondences that are used to infer mappings semi-automatically
The consistency of the set of mappings between data source schemas and ontologies and user schema and ontology can be checked using a reasoner
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Sample Query
Return ALL proteins whose GO function isa nucleotide binding
Return ALL proteins whose GO function isa kinase activity OR those that are involved in the GO process phosphate metabolism
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition
System Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning classifiers from data
DataLabeled Examples
Learner Classifier(hypothesis)
Standard learning algorithms assume centralized access to data
Unlabeled Examples
Classification
Learning
Classifier Class
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Human and yeast protein training data
GO 0016208: AMP binding
GO 0005515: protein binding
GO 0004597: peptide-aspartate
GO 0047696: beta-adrenergic-receptor kinase
activity
GO Function
VSSLPKESQA
ELQLFQNEIN
STPFGLDLGN
NNSVLAVARN
MAQRKNAKSS
GNSSSSGSGS
MADLEAVLAD
VSYLMAMEKS
Sequence
Mainly alpha
Alpha betaYeastP39708
Mainly alpha YeastQ01574
Not KnownHumanQ12797
Mainly beta
Few Secondary Structures
HumanP35626
Structural Classes
SourcePID
Attributes/Features/Variables Class/Label
Examples/Instances/Cases
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Probabilistic models for protein function classification
GO 0016208: AMP binding
GO 0005515: protein binding
GO 0004597: peptide-aspartate
GO 0047696: beta-adrenergic-receptor
kinase activity
GO Function
VSSLPKESQA
ELQLFQNEIN
STPFGLDLGN
NNSVLAVARN
MAQRKNAKSS
GNSSSSGSGS
MADLEAVLAD
VSYLMAMEKS
Sequence
P39708
Q01574
Q12797
P35626
PIDNaïve Bayes AlgorithmNaïve Bayes Algorithm–Very simple algorithm, works Very simple algorithm, works surprisingly well in practicesurprisingly well in practice–Treats every sequence Treats every sequence SS as a as a “bag” of amino-acids “bag” of amino-acids AA11,…,A,…,Ann
–““Gold standard” for evaluating Gold standard” for evaluating other methodsother methods
€
c(S)= argmaxc j ∈C
P(c j | A1,...,An )
Most probable class of c(S) is:
€
=argmaxc j ∈C
P(A1,...,An | c j )P(c j )
P(A1,..., An )€
=argmaxc j ∈Y
P(A1,..., An | c j )P(c j )
€
=argmaxc j ∈C
P(c j ) P(Ai | c j )i
∏
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning classifiers from data revisited
Learning = Information extraction + Hypothesis generation
Query s(D,hi->hi+1)
Answer s(D,hi->hi+1)
Data D
Learner Partial hypothesis hi
Information extraction = Sufficient statistics gathering
Hypothesis Generationhi+1R(hi , s(D, hi->hi+1))
Statistical query formulation
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Sufficient statistics for learning classifiers
A statistic s(D) is called a sufficient statistic for a parameter θ if s(D) captures all the information about the parameter θ, contained in the data D. We are interested in minimal sufficient statistics [Cassela and Berger, 2001].
A statistic sL(D) is called a sufficient statistic for learning a hypothesis h using a learning algorithm L applied to a data set D if there exists an algorithm that takes sL(D) as input and outputs h [Caragea et al., 2004a].
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Naïve Bayes learning as information gathering and hypothesis generation
count(AminoAcid,Class) and count(Class)Sufficient statistics:
€
c(S = A1,...,An ) = argmaxc j ∈C
P(c j ) P(Ai | c j )i
∏Naïve Bayes class:
Query answering engine
For each ai & For each cj
Counts
Counts(Ai|cj), Counts(cj)
P(cj) & P(ai|cj)
Compute
Naïve Bayes
Data
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning classifiers from distributed data
Information extraction from distributed data + Hypothesis generation
Query s(D,hi->hi+1)
Answer s(D,hi->hi+1)
Query Decomposition
Answer Composition
D1
D2
DK
q1
q2
qK
Statistical Query Formulation
Hypothesis Generationhi+1R(hi , s(D, hi->hi+1 ))
LearnerPartial hypothesis hi
Query answering engine
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Learning classifiers from semantically heterogeneous data sources
O
Query s(D,hi)
Answer s(D,hi)
Query Decomposition
Answer Composition
D1,O1
D2, O2
DK, OK
Ontology
M(O1...OK , O)
Mappings from O1 … OK to O
Statistical Query Formulation
Hypothesis Generationhi+1R(hi , s(D, hi))
LearnerPartial hypothesis hi
q2
qK
q1
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition
System Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Ontology-based information integration in INDUS
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Capabilities of INDUS
INDUS provides support for: Specification and update of schemas and ontologies Specification of mappings between ontologies Registration of new data sources Specification of user views Specification and execution of queries across distributed,
semantically heterogeneous data sources Learning classifiers from semantically heterogeneous data
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Tools
Ontology Editor for specifying or modifying ontologies Schema Editor for specifying or modifying data source
schemas Mapping Editor for specifying mappings between
ontologies and between schemas Data Editor for registering data sources with INDUS View Editor for defining user views Query Interface for formulating queries and displaying
results
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Ontologists
A domain ontologist can: Specify or update ontologies Specify or update schemas Specify or update mappings between ontologies Specify or update mappings between schemas
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Data Providers
A data provider can: Associate a predefined schema and ontology with a data
source Specify data source location, type and access procedures Register a data source Act as a domain ontologist
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Experts
A domain expert can specify an application view, i.e., Select data sources of interest in an application domain Select an application specific schema Select an application specific ontology Select relevant mappings
A domain expert can serve as Domain ontologist Data provider
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS Users: Domain Scientists
A domain scientist can Select an application view Formulate and execute queries
A domain scientist can act as Domain ontologist Data provider Domain expert
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
INDUS
Some features of INDUS Clear distinction between structure and semantics of data Data integration from a user perspective - User-specifiable
ontologies and mappings (no single global ontology) Data integration on the fly Semantic integrity of queries ensured by means of
semantics preserving mappings
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Related work
Information integration: [Sheth and Larson, 1990; Davidson et al., 2001; Eckman, 2003; Levy, 1998]
Biological data integration: SRS [Etzold et al., 2003], K2 [Tannen et al., 2003], Kleisli [Chen et al., 2003], IBM’s Discovery Link [Haas et al., 2001], TAMBIS [Stevens et al., 2003], Bio-Mediator [Shaker et al., 2004], etc.
Ontology and mappings editors: Protégé [Noy et al., 2000], Clio [Eckman et al., 2002], DAG-Edit etc.
Ontology-extended relational algebra: [Bonatti et al., 2003]
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Outline
Ontology-Based Information Integration Learning Classifiers from Semantically Heterogeneous Data INDUS: Information Integration and Knowledge Acquisition
System Summary and Work in Progress
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Summary
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Ontologies and mappings Support for more expressive ontologies (beyond hierarchies)
[Bao et al., 2005] Support for interactive specification of mappings between
ontologies, including automated generation of candidate mappings
Support for modular ontologies and mappings [Bao and Honavar, 2004]
Scalability: efficient mechanisms for storage, manipulation, retrieval and use of large ontologies and mappings
More powerful reasoning to ensure the semantic integrity of mappings
Support for import, export, and sharing of ontologies and mappings (e.g. OBO and OWL)
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Query Processing Query optimization under access, bandwidth and
computational constraints Implementation of data retrieval procedures (iterators) for
widely used bioinformatics data sources Support for data caching and data sharing
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Knowledge Acquisition Support for learning classifiers and other predictive models
from semantically heterogeneous data [Caragea et al., 2005]
Support for statistical queries - including queries over partially specified data [Caragea et al., 2004 ]
Support for annotating and sharing results of knowledge acquisition
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
Work in progress
Applications in bioinformatics - data driven discovery of macromolecular sequence-structure-function relationships Prediction of protein function [Andorf et al., 2004] Prediction of protein-protein, protein-DNA and protein-RNA
interfaces [Yan et al., 2004] Analysis, visualization, and interpretation of gene expression data
[Caragea et al., 2005] Modeling and discovery of gene regulatory networks
Usability studies Design of better user interfaces Performance evaluation
Research supported in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM066387).
Iowa State University Department of Computer ScienceArtificial Intelligence Research Laboratory
http://www.cild.iastate.edu/software/indus.html