Upload
robertstevens65
View
86
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Invited talk about TAMBIS at a conference in Boston, 2003
Citation preview
Beyond Transparency: Success & Lessons From tambis
Robert Stevens bioHealth Informatics GroupUniversity of Manchester, UK
funded by: EPSRC/BBSRC/AstraZeneca Pharmaceuticals
Introduction
What is the problem?What does TAMBIS do?How does it work?
o Middlewareo Metadatao Ontologies
Outcomes and LessonsNext Steps, GRIDology and the Semantic Web
Take Homes TAMBIS aims to provide the illusion of: A single query language. A
single data model. A single location for distributed bio information sources. The illusion is called transparency.
Interoperating resources (by people or systems) requires descriptions of their information (metadata) and a consistent shared understanding of what the metadata means (an ontology)
Biologists pose a conceptual question against an ontology that gets rewritten to a coordinated plan of multi-information source requests and tool invocations ~ middleware
The illusion is high pain high gain in a highly autonomous and changeable environment where the sources hinder rather than help.
Transparency -> semi-transparency The ontologies and advanced knowledge representation
techniques turn out to be our greatest outcome!
A example of classic GRIDology ~ metadata, middleware, ontology
What is the problem?
Bioinformatics is the use of computational techniques for the consolidation and analysis of experimental data in biology
The bio community is distributed and shares data and tools
The global bioinformatics infrastructure is piecemeal
The sources and tools are poorly integrated and difficult to use together
• SQL
• appropriate questions to each tool
• file searches
What is the problem?
• databases
• on-line services
• files
There are over 500 biological information sources world-wideBut this data is only useful with sensible access mechanismsThe biologist must: phrase the question for each information system
aninterpret the answers received from the different sources
The biologists query task
• Identify sources and their locations• Identify the content/function of sources• Recognise components of a query and target
them to appropriate sources in the optimal order• Communicate with sources • Transform data between source formats• Express syntactically complex queries
appropriate to each source• Merge and link results from different sources
A solution...
A common interface to all these information sources
SRS
BioNavigator - graphical workflow
BioKleisli multidatabase queries
o A syntactically consistent viewo The multidatabase query language, CPL
usescript "cpl_libs/tambislib.cpl";htmlout("tambis.html", "html",{m | \p <- get-sp-entry-by-os (”Guppy"), \m <- do-prosite-scan-by-entry(p), set-member(m.#docid,na_associated) });
Semantic Reconciliation – Delegation
Delegation to the user
• The “plumbing” is eased but the users figure out the coordination of sources
• SRS, CPL/BioKleisli, BioNavigator, Discovery Linkwrappe
rwrapper
wrapper common shared
syntax,format, query languagecall interface
middleware
Semantic Reconciliation – Terminologies
Conformance to common terminologies
• Gene Ontology for describing gene function in FlyBase, MouseBase, SGD…
• ArrayExpress for microarray experiments
shared semantic model
middleware
Semantic Reconciliation - Mediation
mediator service
• Mediator forms a homogenising layer using a single common federated schema – an ontology• Mediator maps the member sources to the schema - metadata• Sources are wrapped to give the illusion of a common request language for each information resource.
Illusion of a single query language, single data model, single location.
Queries formulated against the common model are dynamically translated to the wrapped schemas of the member sources.
middleware
Semantic Reconciliation - Mediation
TAMBIS
• Mediator forms a homogenising layer using a single common federated schema – an ontology• Mediator maps the member sources to the schema - metadata• Sources are wrapped to give the illusion of a common request language for each information resource.
Illusion of a single query language, single data model, single location.
Queries formulated against the common model are dynamically translated to the wrapped schemas of the member sources.
middleware
Example questions
DNA binding motifs from eukaryotic proteins involved in apoptosis
Chimpanzee homologues of Human proteins containing seven propeller domains
Homologues to apoptosis receptor proteinsMotifs in enzymes using thyamine as a substrate
and iron as a cofactor
Complex and Multisource
TAMBIS Chief Components An ontology of biological and bioinformatics terms managed
by a terminology server. Ontology ~ a rigorous formal specification of the
conceptualisation of the domain
• Describes the biologists knowledge in a independent of source
• Links concepts to their real equivalents in the sources• Mediates between (near) equivalent concepts in the
sources• Guides the user to form biological appropriate queries
Query processor provides concrete query plan A wrapper service deals with external sources and doesn’t
use the terminology server
What is an Ontology? Describes a formal, shared conceptualization of a domain of
interest Concepts that are relevant for the domain e.g. Gene Relations between concepts e.g. Product of Gene Axioms about these concepts and relations e.g. Nucleic acids <
20 residues are oligonucleiotides Values The Product of the trpA Gene is tryptophan-synthetase
Enforces a well-defined semantics on such a conceptualisation A vocabulary of terms and some specification of the meaning of
the terms Instances
Nominals = instances that are part of the ontology • eg. Italy, sulphur
Knowledge base = ontology + instances• trpA Gene, Reaction 1.1.2.4
SequenceComponent
GeneMotif
Restriction site
Phosphorylation site
Macromolecule Reference Ontology
MacroMolecule
ProteinNucleic Acid
Lipid
Peptide EnzymeRNA
DNA
cDNA gDNA mDNA
mRNA
componentOf
TAMBIS OntologyLarge Model
• Around 1800 asserted biological concepts and their relationships; capable of inferring many more
• Coverage includes: Protein and protein sequence, protein component motifs , structure , enzyme function, enzymes and metabolic pathways, expressed sequence tags, nucleic acids, their component motifs, gene function and expression, sequence homology, taxonomy
Prototype System• Around 250 biological concepts and their
relationships, concentrating on proteins• used for the online TAMBIS pilot and has
complete coverage in the Sources and Services Model
• CATH, Enzyme, Swiss-Prot, Prosite, BLAST
Query Formulation: Entry PointsUser BookmarksQuery by exampleOntology browsing
Concept Explorer
Rewriting to the different sources
Query Concept Builder
Guiding Query Formulation
Showing or preventing nonsense queries
grouping base and criteria on the screen
redundancy and tautology - how much help to give?
Modelling issues Should we show that
concepts are made up of compositions?
Questions of the concepts vs questions of the data.
TAMBIS videos
Running a query in TAMBIS
How does it work?
•The Terminology Server provides services for reasoning about concept models, answering questions like:What can I say about Proteins?
What are more specialised/general kinds of Protein?
Terminology Server
Query Formulatio
n Dialogues
Sources and Services
Query Transformatio
n
Wrapper Service
Terminology Server
Molecular Biology and
BioinformaticsOntology
TaO
LinguisticModel
Conceptual Query Formulation
•The user interacts with Query Formulation Dialogues, expressing queries in terms of the biological model.•The dialogues are driven by the content of the model, guiding the user towards sensible queries and the instantiation of concepts.
Query Formulation Dialogues
Ontology Browser
Query formulation
Terminology Server
Query Formulatio
n Dialogues
Sources and Services
Query Transformatio
n
Wrapper Service
Source independent conceptual query
Concrete Query Translation
•Query Transformation takes the conceptual source-independent queries, selects the sources and rewrites to produce executable query plans.
•To do this it requires knowledge about the biological sources and the services they offer.•Information about particular user preferences - favourite databases or analysis methods – could also be incorporated by the query planner.
•The query plans are then passed to the wrappers.
Terminology Server
Query Formulatio
n Dialogues
Sources and Services
Query Transformatio
n
Wrapper Service
Source dependent concrete query execution plans
Mapping from Concepts to Sources •The Source and Services Model links the biological ontology with the sources and their schemas.• Used in query transformation to determine:• source selection• cost evaluation• type coercions• source <-> CPL function mappings•Values of terms
Terminology Server
Query Formulatio
n Dialogues
Sources and Services
Query Transformatio
n
Wrapper Service
Sources and services
Costs and Coercions
CPL functions
Ontology <-> functions
mapping
Rewrite rules
Sources Query Execution•The sources are thickly wrapped by access functions initially implemented using BioKleisli (UPenn)
•The Wrapper Service coordinates the execution of the query and sends each component to the appropriate source.
•Results are collected and returned to the user in HTML format.Wrapper Service
Query Execution Coordinator
Wrapper Client
Wrapper Client
Wrapper Client
Terminology Server
Query Formulatio
n Dialogues
Sources and Services
Query Transformatio
n
Wrapper ServiceResults
Ontology Powered
Represented in a knowledge form called a Description Logic
TAMBIS original ~ GRAILTAMBIS NG ~ FaCT/SHIQ [aka DAML+OIL]
Classes (concepts) of instances (individuals) that share the same properties represented by relations (roles) between individuals and controlled by axioms
Concepts and roles are organised into classifications
Concepts and roles are combined in controlled ways using term constructors
Why use a Description Logic?
Compositional ~ new terms are formed on demandSelf organising ~ the classification is inferredImplicit superclass detectionMulti-axial ~ the classification is a lattice not a treeChecks for consistency ~ no nonsense combinationsEvolvable and extensibleModelling based on propertiesMulti-definitions and equivalences are expressible
and detectable ~ N different definitions of genePrimitive and defined classes ~ necessary and
sufficient conditions
Outcomes and Lessons (1)Shift the responsibility from the user to the system
Complex queries that require interoperation between sources TAMBIS manages the heterogeneity between the sources
through mappings between the ontology and the real resources; User is shielded from changes in the sources and the
inadequacies of the resource accessibility functions; User is relieved of the task of choosing sources and tools,
ordering source requests, protects them from expensive queries, protects them from nonsensical queries;
Complex ad-hoc queries possible – query by navigation is limited
The ontology is a repository of knowledge itself without recourse to the sources;
Allows users to describe what they want by linking concepts Shows what it is possible to ask Represents the knowledge the biologist uses when choosing
and interoperating between sources
Outcomes and Lessons (2)
The big outcome was the ontology!
New work on ontology presentation and exchange languages
DAML+OIL/OWLNew applications using the ontology
PRINTS auto-annotationNew approaches to ontology migration and
developmentSemantic similarity
GO – Next GenerationFoundational work on Knowledge Representation,
reasoning and associated toolsPublished in Bioinformatics 1999
Outcomes and Lessons (3)A Classification of Tasks in Bioinformatics
A set of tasks commonly performed by biologists, common components & patterns, evaluation principles for query systems
Biologists do simple queries…but want to do more complex
queries…if they knew how… but are unaware of what is available now and don’t
exploit power of the tools that they already have. Interoperation is a major obstacle…but biologists like to get
”hands on” Interacting and intervening with the queries Routing, editing and storing intermediate results
Although its very common, no system supports the forking of parallel concurrent processes
Published in Bioinformatics 2001
Outcomes and Lessons (4)Shift the responsibility from the user to the system
Building, validating and maintaining the ontology Methodologies are immature Bootstrap from source schemas and taxonomies Reuse vs Task driven Sufficiency conditions in the natural world How do you know they are correct?
Building, validating and maintaining the mappings There are lots of resources They change a lot Their metadata is often implicit Their accessibility functions are poor – point and click Source management
A new paradigm ~ describing what you want not how to get it
Outcomes and Lessons (5)
Full transparency not always goodAll users want provenanceSome users wish to choose, rather than being told:
What happened, where is happened, who did it, …Re-use of results as TAMBIS sources
Take Homes TAMBIS aims to provide the illusion of: A single query language. A
single data model. A single location for distributed bio information sources. The illusion is called transparency.
Interoperating resources (by people or systems) requires descriptions of their information (metadata) and a consistent shared understanding of what the metadata means (an ontology)
Biologists pose a conceptual question against an ontology that gets rewritten to a coordinated plan of multi-information source requests and tool invocations ~ middleware
The illusion is high pain high gain in a highly autonomous and changeable environment where the sources hinder rather than help.
Transparency -> semi-transparency The ontologies and advanced knowledge representation
techniques turn out to be our greatest outcome!
A example of classic GRIDology ~ metadata, middleware, ontology
TAMBIS is a collaboration betweenComputer Science & Biological Sciences
at theUniversity of Manchester
funded by the BBSRC & EPSRC Bioinformatics Programme,(Astra)Zeneca Pharmaceuticals and the EPSRC DIM Programme
[email protected]://img.cs.man.ac.uk/tambis/
Carole Goble
Norman Paton
Sean Bechhofer
Robert Stevens
Gary Ng
Martin Peim
Alex Jacoby
Pierpaolo Larocchia
Bioinformaticians:
Andy Brass
Patricia Baker
Robert Stevens
Knowledge support services:
Ian Horrocks
Enrico Franconi
Sergio Tessaris
Graham Gough
Alan Rector
Acknowledgements