39
myGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK http://www.mygrid.org.uk

MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Embed Size (px)

Citation preview

Page 1: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

myGrid: open knowledge basedhigh level services for bioinformatics

the information Grid            

Professor Carole Goble

University of Manchester, UK

http://www.mygrid.org.uk

Page 2: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The PRISM Forum PharmaGrid

• The GRID represents fast moving technology that will rapidly expand beyond initial applications of scavenging productive CPU cycles to the transparent provision of a wide range of services. With the GRID there is the potential to build powerful, complex problem-solving and collaborative environments, providing access to and sharing of diverse information sources and rigorous analytical tools. These benefits could deliver well within the five-year planning cycles of the pharmaceutical industry, if certain IT development challenges are met, including:

• Intelligent middleware that facilitates for the user transparent access to many services and execution of tasks

• High quality security features, enabling large databases to be accessed via GRID solutions

• Sophisticated semantic and contextual systems to enable diverse sources of data to be related for knowledge discovery

• The GRID’s potential for integration of information across the pharmaceutical value chain, well beyond discovery and development, offers a tremendous opportunity. Staff could be provided with personal working environments, and access to the best possible resources, services, information and knowledge available to solve problems and inform their decision-making.

Page 3: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Take home

• e-Science is bigger than Grid• the e-Science experimental method needs first class

support and is just as important as outcomes.• Personalised

– my private data holdings

yet collaborative– publish my workflow templates in registries for you to share

and adapt

• Automated– run a workflow a discover alternative services if a service

goes down

yet interactive with the scientist at the centre– user proxy notified to hand filter or view results

Page 4: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Challenges for Pharma

• Access to and understanding of distributed, heterogeneous information resources is critical

• Complex, time consuming process, because ...– 1000’s of relevant information sources, an explosion in

availability of;• experimental data• scientists’ annotations• text documents; abstracts, eJournal articles, monthly reports,

patents, ...

– Rapidly changing domain concepts and terminology and analysis approaches

– Constantly evolving data structures – Continuous creation of new data sources– Highly heterogeneous sources and applications – Data and results of uneven quality, depth, scope– But still growing

Page 5: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Integration of Pharma information

ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI

Page 6: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

myGrid

• EPSRC UK e-Science pilot project• Open Source Upper Middleware for Bioinformatics• (Web) Service-based architecture -> Grid services• 42 months, 20 months in.• Prototype v0 technical and user requirements• Prototype v1 Release Sept 2004, some services available

now.

Page 7: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Graves disease

• Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism

• Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophtalmos

Page 8: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The Biology

• Grave’s Disease caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system.

• What is the molecular basis for this autoimmune response?

PituitaryGland

Thyroid Hormones Released

ThyroidCell

TSH Receptor

TSH

-ve feedbackeffect

Autoimmune Antibodies attach to TSH receptors, competing with TSH

Page 9: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Bioinformatics

Annotation PipelineWhat is known about my

candidate gene?

Medline

OMIM

GO

BLAST

EMBL

DQP

Query

Genotype Assay Design System 3D Protein Structure

Select a SNP from candidate gene. Is this SNP associated with

Disease?

What is the structure of the proteinproduct encoded by my candidate gene?

Primer Design

Gene ID

Restriction FragmentLength Polymorphism experiment

SNPSN

PSN

P

Use primers designed by myGrid to amplify region flanking SNP on the gene

PDB

Query PDB & display proteinstructure using Rasmol

Obtain information about protein& extract information about active site

Swiss-Prot AMBITInterpro

Emboss Eprimer applicationin SoapLab

Selection of restriction enzyme

Talisman

SNP

Emboss Restrictin SoapLab

AMBIT

Determine whether coding SNPsaffects the active site of the protein

Peter Li1, Claire Jennings2, Simon Pearce2 and Anil Wipat1, (2003)1School of Computing Science and 2Institute of Human Genetics, University of Newcastle-upon-Tyne.Candidate gene

pool

Page 10: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Workflows are in silico experiments

Annotation PipelineWhat is known about my

candidate gene?

Medline

OMIM

GO

BLAST

EMBL

DQP

Query

http://cvs.mygrid.org.uk/scufl/NucleotideSeqAnnotationPipelineWithGoTerms/

Page 11: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

in silico Exploratory Experiments

Ad hoc virtual organisationsNo a priori agreementsDiscovery/exploratory workflows

by biologistsPersonalDifferent resourcesGrids

Predictive / stable integration Production workflows over known

resources Organisation wide Emphasis on performance and

resilience Data capture, cleaning and

replication protocols

Clear UnderstandingStandard

Well definedPredictive

Experimental orchestrationExploratory

Hypothesis drivenNot prescriptive

Methodology freeAd hoc

Page 12: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Experiment = Workflows + Services + (meta)DataDiscovering services to invokeDiscovering workflows to enact• Service & workflow registration and

discovery– Multi-user, multi-view, federated

registries– First, second and third party services &

workflows– Publishing new ones, adapting old ones.– My working set of services– Services maybe owned by another user,

and come and go– Views over registries of services– Third party annotation

• Ontologies for describing and finding workflows/services and guiding service composition

– Service A outputs compatible with Service B inputs

– Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…)

Page 13: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

An in silico experiment = a web of interconnected investigation holdings

Provenance record of workflow runs

Provenance of the workflow template. Related workflows.

People who wrote the workflow

Ontologies describing workflows

Services used

Notes

Data holdings

LiteratureLiterature

People to notify of the workflow status

Page 14: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Experiment life cycle

Executing experiments

Workflow enactmentDistributed Query

processingJob executionProvenance generation

Single sign-on authentician

Event notification

Resource & service discovery

Repository creationWorkflow creation

Database query formation

Discovering and reusingexperiments and resources

Workflow discovery & refinementResource &

service discoveryRepository

creationProvenance

Managing experiments

Information repositoryMetadata management

Provenance management

Workflow evolutionEvent notification

Providing services & experiments

Service registrationWorkflow depositionMetadata Annotation

Third party registration

Personalisation

Personalised registriesPersonalised workflows

Info repository viewsPersonalised annotations

Personalised metadataSecurity

Forming experiments

Page 15: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Investigation = set of experiments + metadata

• Hypothesis, materials and methods, results, conclusions, acknowledgements, bibliography

• Who, what, where, why, when, (w)how? recorded by provenance records

• Experiment is repeatable, if not reproducible.

• The traceability of knowledge as it is evolves and as it is derived.

• A web of myGrid holdings– input data, data results, intermediate data,

parameter sets, workflow logs, workflow templates, people, organisations, personal notes, services etc.

• Discovering links between experiment objects

• Selectively share (parts of) experiments and investigations

• Discover experiments and investigations

Page 16: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Data at the centre

Provenance record of workflow run that produced this data

Provenance of the data holdings

Workflows that could use pr generate this data

People who have registered an interest in this data

Ontologies describing data

Services that can use or produce this data

Notes

Data holdings

Literature relevant

Literature relevant

Related Data holding

Page 17: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Put the scientist at the centre

Provenance record of workflow runs they have made

People

Ontologies

Preferences for Services

Notes

Data holdings

LiteratureLiterature

Workflows they wrote or used

People they collaborate with

Page 18: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

myGrid in a nutshell

A “second generation” open service-based Grid project, a test bed for the OGSI, OGSA and OGSA-DAI base services

semantic grid capabilities knowledge-based technologies,

semantic-based service, workflow & data discovery,

match makinglinking investigation

components.

High level services for e-Science experimental

managementprovenance, change notification,

personalisation, investigation and experiment

holdings management

External Applications: workbench, portal, Talisman, Taverna

External Services: AMBIT, SoapLab, EMBOSS…

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rs

High level services for data intensive integrationworkflow & distributed query processing

Page 19: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

myGrid Services

Web Service & Grid communication fabric

AMBITText Extraction Service

Provenance mgt

Personalisation

Event Notification

Gateway

Service and WorkflowDiscovery myGrid

Information Repository

Ontology Mgt

Metadata Mgt

Work benchTaverna

workflow environment Talisman application

Bio Services

Soaplab

Portal

SRS

Bio

info

rmat

icia

nsT

ool P

rovi

ders

Ser

vice

Pro

vide

rs

Registries

Ontologies

EMBOSS

Workflow enactment engine

Distributed Query Processor

Page 20: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

mIRbrowser

Knowledge Services

Registry

Putting the services togetherSemantic registration

Service Knowledge Service

Registry

Workflow enactment

engine

Service &workflow browser

Find Service

Notification Service

Notification Service

Service Service Service

Distributed Query Processor

Information Extraction

Service

Job Execution

mIR

Provenance browser

RegistryView

Service Publicationsyntactic registration

Matchmaker

RegistryView

mIR

mIRmIR

User Proxy

Page 21: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

mIR

Notification

Workflow Enactment Engine

Registry View

NotificationClient

Service Browser

Finding Service

Workbench

TavernaWorkflow

Environment

UDDI

DomainServices

Bio-databases

SoapLab

EMBOSS

User Proxy

User Gateway

myGrid ClientmyGrid ServicesExternal Services

Page 22: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Application: Work bench demonstrator

The myGrid service components have been used in a demonstration application called the “myGrid WorkBench”, which provides a common point of use for the services.

We can select data from the myGrid Information repository (mIR), select a workflow based on its semantic description, and examine the results.

Page 23: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

A work bench for demonstrating services

myView on the mIR

Workflow

Metadata about

workflow

note aboutworkflow

Page 24: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Semantic services Services and workflows

within myGrid are described using semantic web technologies and ontologies enabling selection by the types of inputs they use, outputs they produce, or the bioinformatics tasks they perform.

DAML+OIL, OWL, RDF

Page 25: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Workflows

• Workflow enactment engineIBM’s Web Service Flow Language (WSFL)Scufl

• Dynamic workflow service invocation and service discovery– Choose services when running workflow

• User interactivity during workflow enactment– Not a batch script! – Requires user proxies

• Separate data flow from control flow– Large amounts of data

• Iteration, decision points• Monitoring• Provenance logs

• The enactment engine is a web service• Migrated to a OGSA service

Scufl

for each task:

• run(operation, inputs)

WorkflowEngine

Soaplab plugin

http://www.it-innovation.soton.ac.uk/mygrid/workflow

Page 26: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Bio Services

• Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually)

• Multiple services– Many hundreds of different services

in the public domain and privately owned

• Multiple registries– 3rd party public registries, private

registries, personal registries• 3rd parties

– JEMBOSS, PathPort, bioMoby• SoapLab

– A soap-based programmatic interface to command-line applications

– ~300 different classes of services– Swiss-Prot, EMBOSS, Medline, blah,

blah …– http://industry.ebi.ac.uk/soap/soaplab

Page 27: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Application: Taverna workflow workbench

Bioinformatics analyses typically involve visiting many data resources and analytical tools.

These in silico experiments can be created as pipelines or “workflows” in our Taverna editor.

http://sourceforge.net/projects/taverna)

Page 28: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

e-Science: notification

A notification service can inform the mIR and the user (proxy) that data, workflows, services, etc. have changed and thus prompt actions over data in the mIR.

Notifications are presented to the user with a client in the workbench environment.

User registers interest in notification topics

Page 29: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

e-Science: Provenance

Like a bench experiment, myGrid records the materials and methods it has used for an in silico experiment in a provenance log.

This is the where, what, when and how the experiment was run.

Derivation paths ~

workflows, queries

Annotations ~ notesEvolution paths ~

workflow workflow

Page 30: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Talisman application

http://www.ebi.ac.uk/collab/mygrid/service1/talisman/index.html

Page 31: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The annotation pipeline to identify Genes of Interest

Look at contents of work bench

User notified of new Affy data

Run a workflow over new Affy data– Launch workflow wizard– Discover appropriate

workflow– Enact workflow– Monitor workflow

Look at provenance

Select and view results

Annotation PipelineWhat is known about my

candidate gene?

Medline

OMIM

GO

BLAST

EMBL

DQP

Query

Page 32: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The “my” in myGrid

• my services• my favourite services• my opinion of those services• my workflow templates• my workflow runs• my data• my notes• my queries• my logs of what I did• the events I care about

Page 33: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The Grid in myGrid

• Service based architecture• mIR and the DQP is OGSA-DAI compliant• Migrating event notification and workflow

enactment engine to OGSA• Volatility of services and virtual organisations

– Graceful management of failure

• Scale of data – e.g. dataflow through workflow engine and distributed query processor

• Services that are large computational services

Page 34: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Life Science Identifier

• mIR uses LSIDs

• Integrating LSID resolvers from IBM for bio databases

• LSIDs form a connective glue along with the ontologies

Page 35: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Summary

• myGrid offers service based middleware components

• Open source and free• Open Grid Service Architecture-compliant• Allows the scientist to be at the centre of the

Grid -- Personalisation• Generic middleware that suits the creation of

bioinformatics applications• Inclusion of rich semantics to facilitate the

scientific process• Available from http://www.mygrid.org.uk

Page 36: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

I3C http://www.i3c.org/

Page 37: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Our Biology colleagues

Institute of Human Genetics School of Clinical Medical Sciences

University of NewcastleUK

Simon Pearce Claire Jennings

Page 38: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

The rest of the team

Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay Dialani, Alvaro Fernandes, Justin Ferris, Rob Gaizauskas, Kevin Glover, Carole Goble (director), Chris Greenhalgh, Mark Greenwood, Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon Mee, Simon Miles, Luc Moreau, Juri Papay, Norman Paton, Steve Pettifer, Milena Radenkovic, Peter Rice, Angus Roberts, Alan Robinson, Martin Senger, Nick Sharman, Paul Watson, Anil Wipat & Chris Wroe.

Page 39: MyGrid: open knowledge based high level services for bioinformatics the information Grid Professor Carole Goble University of Manchester, UK

Wrap up spare

• The myGrid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics.

• myGrid is building high level services for data & application integration such as resource discovery and workflow enactment.

• Additional services are provided to support the scientific method & best practice found at the bench but often neglected at the workstation, notably provenance management, change notification & personalisation.

• Semantically rich metadata expressed using ontologies is used to discover services and workflows.

• myGrid provides these services as middleware components, that can be used to build bioinformatics applications.

• An in silico laboratory workbench demonstrator is currently being developed with these components.