Beyond Transparency: Success & Lessons From tambisBoston2003

Beyond Transparency: Success & Lessons From tambis

Robert Stevens bioHealth Informatics GroupUniversity of Manchester, UK

funded by: EPSRC/BBSRC/AstraZeneca Pharmaceuticals

Introduction

What is the problem?What does TAMBIS do?How does it work?

o Middlewareo Metadatao Ontologies

Outcomes and LessonsNext Steps, GRIDology and the Semantic Web

Take Homes TAMBIS aims to provide the illusion of: A single query language. A

single data model. A single location for distributed bio information sources. The illusion is called transparency.

Interoperating resources (by people or systems) requires descriptions of their information (metadata) and a consistent shared understanding of what the metadata means (an ontology)

Biologists pose a conceptual question against an ontology that gets rewritten to a coordinated plan of multi-information source requests and tool invocations ~ middleware

The illusion is high pain high gain in a highly autonomous and changeable environment where the sources hinder rather than help.

Transparency -> semi-transparency The ontologies and advanced knowledge representation

techniques turn out to be our greatest outcome!

A example of classic GRIDology ~ metadata, middleware, ontology

What is the problem?

Bioinformatics is the use of computational techniques for the consolidation and analysis of experimental data in biology

The bio community is distributed and shares data and tools

The global bioinformatics infrastructure is piecemeal

The sources and tools are poorly integrated and difficult to use together

• SQL

• appropriate questions to each tool

• file searches

What is the problem?

• databases

• on-line services

• files

There are over 500 biological information sources world-wideBut this data is only useful with sensible access mechanismsThe biologist must: phrase the question for each information system

aninterpret the answers received from the different sources

The biologists query task

• Identify sources and their locations• Identify the content/function of sources• Recognise components of a query and target

them to appropriate sources in the optimal order• Communicate with sources • Transform data between source formats• Express syntactically complex queries

appropriate to each source• Merge and link results from different sources

A solution...

A common interface to all these information sources

SRS

BioNavigator - graphical workflow

BioKleisli multidatabase queries

o A syntactically consistent viewo The multidatabase query language, CPL

usescript "cpl_libs/tambislib.cpl";htmlout("tambis.html", "html",{m | \p <- get-sp-entry-by-os (”Guppy"), \m <- do-prosite-scan-by-entry(p), set-member(m.#docid,na_associated) });

Semantic Reconciliation – Delegation

Delegation to the user

• The “plumbing” is eased but the users figure out the coordination of sources

• SRS, CPL/BioKleisli, BioNavigator, Discovery Linkwrappe

rwrapper

wrapper common shared

syntax,format, query languagecall interface

middleware

Semantic Reconciliation – Terminologies

Conformance to common terminologies

• Gene Ontology for describing gene function in FlyBase, MouseBase, SGD…

• ArrayExpress for microarray experiments

shared semantic model

middleware

Semantic Reconciliation - Mediation

mediator service

• Mediator forms a homogenising layer using a single common federated schema – an ontology• Mediator maps the member sources to the schema - metadata• Sources are wrapped to give the illusion of a common request language for each information resource.

Illusion of a single query language, single data model, single location.

Queries formulated against the common model are dynamically translated to the wrapped schemas of the member sources.

middleware

Semantic Reconciliation - Mediation

TAMBIS

• Mediator forms a homogenising layer using a single common federated schema – an ontology• Mediator maps the member sources to the schema - metadata• Sources are wrapped to give the illusion of a common request language for each information resource.

Illusion of a single query language, single data model, single location.

Queries formulated against the common model are dynamically translated to the wrapped schemas of the member sources.

middleware

Example questions

DNA binding motifs from eukaryotic proteins involved in apoptosis

Chimpanzee homologues of Human proteins containing seven propeller domains

Homologues to apoptosis receptor proteinsMotifs in enzymes using thyamine as a substrate

and iron as a cofactor

Complex and Multisource

TAMBIS Chief Components An ontology of biological and bioinformatics terms managed

by a terminology server. Ontology ~ a rigorous formal specification of the

conceptualisation of the domain

• Describes the biologists knowledge in a independent of source

• Links concepts to their real equivalents in the sources• Mediates between (near) equivalent concepts in the

sources• Guides the user to form biological appropriate queries

Query processor provides concrete query plan A wrapper service deals with external sources and doesn’t

use the terminology server

What is an Ontology? Describes a formal, shared conceptualization of a domain of

interest Concepts that are relevant for the domain e.g. Gene Relations between concepts e.g. Product of Gene Axioms about these concepts and relations e.g. Nucleic acids <

20 residues are oligonucleiotides Values The Product of the trpA Gene is tryptophan-synthetase

Enforces a well-defined semantics on such a conceptualisation A vocabulary of terms and some specification of the meaning of

the terms Instances

Nominals = instances that are part of the ontology • eg. Italy, sulphur

Knowledge base = ontology + instances• trpA Gene, Reaction 1.1.2.4

SequenceComponent

GeneMotif

Restriction site

Phosphorylation site

Macromolecule Reference Ontology

MacroMolecule

ProteinNucleic Acid

Lipid

Peptide EnzymeRNA

DNA

cDNA gDNA mDNA

mRNA

componentOf

TAMBIS OntologyLarge Model

• Around 1800 asserted biological concepts and their relationships; capable of inferring many more

• Coverage includes: Protein and protein sequence, protein component motifs , structure , enzyme function, enzymes and metabolic pathways, expressed sequence tags, nucleic acids, their component motifs, gene function and expression, sequence homology, taxonomy

Prototype System• Around 250 biological concepts and their

relationships, concentrating on proteins• used for the online TAMBIS pilot and has

complete coverage in the Sources and Services Model

• CATH, Enzyme, Swiss-Prot, Prosite, BLAST

Query Formulation: Entry PointsUser BookmarksQuery by exampleOntology browsing

Concept Explorer

Rewriting to the different sources

Query Concept Builder

Guiding Query Formulation

Showing or preventing nonsense queries

grouping base and criteria on the screen

redundancy and tautology - how much help to give?

Modelling issues Should we show that

concepts are made up of compositions?

Questions of the concepts vs questions of the data.

TAMBIS videos

Running a query in TAMBIS

How does it work?

•The Terminology Server provides services for reasoning about concept models, answering questions like:What can I say about Proteins?

What are more specialised/general kinds of Protein?

Terminology Server

Query Formulatio

n Dialogues

Sources and Services

Query Transformatio

n

Wrapper Service

Terminology Server

Molecular Biology and

BioinformaticsOntology

TaO

LinguisticModel

Conceptual Query Formulation

•The user interacts with Query Formulation Dialogues, expressing queries in terms of the biological model.•The dialogues are driven by the content of the model, guiding the user towards sensible queries and the instantiation of concepts.

Query Formulation Dialogues

Ontology Browser

Query formulation

Terminology Server

Query Formulatio

n Dialogues


Query Transformatio

n

Wrapper Service

Source independent conceptual query

Concrete Query Translation

•Query Transformation takes the conceptual source-independent queries, selects the sources and rewrites to produce executable query plans.

•To do this it requires knowledge about the biological sources and the services they offer.•Information about particular user preferences - favourite databases or analysis methods – could also be incorporated by the query planner.

•The query plans are then passed to the wrappers.

Terminology Server

Query Formulatio

n Dialogues


Query Transformatio

n

Wrapper Service

Source dependent concrete query execution plans

Mapping from Concepts to Sources •The Source and Services Model links the biological ontology with the sources and their schemas.• Used in query transformation to determine:• source selection• cost evaluation• type coercions• source <-> CPL function mappings•Values of terms

Terminology Server

Query Formulatio

n Dialogues


Query Transformatio

n

Wrapper Service

Sources and services

Costs and Coercions

CPL functions

Ontology <-> functions

mapping

Rewrite rules

Sources Query Execution•The sources are thickly wrapped by access functions initially implemented using BioKleisli (UPenn)

•The Wrapper Service coordinates the execution of the query and sends each component to the appropriate source.

•Results are collected and returned to the user in HTML format.Wrapper Service

Query Execution Coordinator

Wrapper Client

Wrapper Client

Wrapper Client

Terminology Server

Query Formulatio

n Dialogues


Query Transformatio

n

Wrapper ServiceResults

Ontology Powered

Represented in a knowledge form called a Description Logic

TAMBIS original ~ GRAILTAMBIS NG ~ FaCT/SHIQ [aka DAML+OIL]

Classes (concepts) of instances (individuals) that share the same properties represented by relations (roles) between individuals and controlled by axioms

Concepts and roles are organised into classifications

Concepts and roles are combined in controlled ways using term constructors

Why use a Description Logic?

Compositional ~ new terms are formed on demandSelf organising ~ the classification is inferredImplicit superclass detectionMulti-axial ~ the classification is a lattice not a treeChecks for consistency ~ no nonsense combinationsEvolvable and extensibleModelling based on propertiesMulti-definitions and equivalences are expressible

and detectable ~ N different definitions of genePrimitive and defined classes ~ necessary and

sufficient conditions

Outcomes and Lessons (1)Shift the responsibility from the user to the system

Complex queries that require interoperation between sources TAMBIS manages the heterogeneity between the sources

through mappings between the ontology and the real resources; User is shielded from changes in the sources and the

inadequacies of the resource accessibility functions; User is relieved of the task of choosing sources and tools,

ordering source requests, protects them from expensive queries, protects them from nonsensical queries;

Complex ad-hoc queries possible – query by navigation is limited

The ontology is a repository of knowledge itself without recourse to the sources;

Allows users to describe what they want by linking concepts Shows what it is possible to ask Represents the knowledge the biologist uses when choosing

and interoperating between sources

Outcomes and Lessons (2)

The big outcome was the ontology!

New work on ontology presentation and exchange languages

DAML+OIL/OWLNew applications using the ontology

PRINTS auto-annotationNew approaches to ontology migration and

developmentSemantic similarity

GO – Next GenerationFoundational work on Knowledge Representation,

reasoning and associated toolsPublished in Bioinformatics 1999

Outcomes and Lessons (3)A Classification of Tasks in Bioinformatics

A set of tasks commonly performed by biologists, common components & patterns, evaluation principles for query systems

Biologists do simple queries…but want to do more complex

queries…if they knew how… but are unaware of what is available now and don’t

exploit power of the tools that they already have. Interoperation is a major obstacle…but biologists like to get

”hands on” Interacting and intervening with the queries Routing, editing and storing intermediate results

Although its very common, no system supports the forking of parallel concurrent processes

Published in Bioinformatics 2001

Outcomes and Lessons (4)Shift the responsibility from the user to the system

Building, validating and maintaining the ontology Methodologies are immature Bootstrap from source schemas and taxonomies Reuse vs Task driven Sufficiency conditions in the natural world How do you know they are correct?

Building, validating and maintaining the mappings There are lots of resources They change a lot Their metadata is often implicit Their accessibility functions are poor – point and click Source management

A new paradigm ~ describing what you want not how to get it

Outcomes and Lessons (5)

Full transparency not always goodAll users want provenanceSome users wish to choose, rather than being told:

What happened, where is happened, who did it, …Re-use of results as TAMBIS sources

Take Homes TAMBIS aims to provide the illusion of: A single query language. A

single data model. A single location for distributed bio information sources. The illusion is called transparency.

Interoperating resources (by people or systems) requires descriptions of their information (metadata) and a consistent shared understanding of what the metadata means (an ontology)

Biologists pose a conceptual question against an ontology that gets rewritten to a coordinated plan of multi-information source requests and tool invocations ~ middleware

The illusion is high pain high gain in a highly autonomous and changeable environment where the sources hinder rather than help.

Transparency -> semi-transparency The ontologies and advanced knowledge representation

techniques turn out to be our greatest outcome!

A example of classic GRIDology ~ metadata, middleware, ontology

TAMBIS is a collaboration betweenComputer Science & Biological Sciences

at theUniversity of Manchester

funded by the BBSRC & EPSRC Bioinformatics Programme,(Astra)Zeneca Pharmaceuticals and the EPSRC DIM Programme

[email protected]://img.cs.man.ac.uk/tambis/

Carole Goble

Norman Paton

Sean Bechhofer

Robert Stevens

Gary Ng

Martin Peim

Alex Jacoby

Pierpaolo Larocchia

Bioinformaticians:

Andy Brass

Patricia Baker

Robert Stevens

Knowledge support services:

Ian Horrocks

Enrico Franconi

Sergio Tessaris

Graham Gough

Alan Rector

Acknowledgements