Upload
maurice-payne
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
myGrid: open knowledge basedhigh level services for bioinformatics
the information Grid
Professor Carole Goble
University of Manchester, UK
http://www.mygrid.org.uk
The PRISM Forum PharmaGrid
• The GRID represents fast moving technology that will rapidly expand beyond initial applications of scavenging productive CPU cycles to the transparent provision of a wide range of services. With the GRID there is the potential to build powerful, complex problem-solving and collaborative environments, providing access to and sharing of diverse information sources and rigorous analytical tools. These benefits could deliver well within the five-year planning cycles of the pharmaceutical industry, if certain IT development challenges are met, including:
• Intelligent middleware that facilitates for the user transparent access to many services and execution of tasks
• High quality security features, enabling large databases to be accessed via GRID solutions
• Sophisticated semantic and contextual systems to enable diverse sources of data to be related for knowledge discovery
• The GRID’s potential for integration of information across the pharmaceutical value chain, well beyond discovery and development, offers a tremendous opportunity. Staff could be provided with personal working environments, and access to the best possible resources, services, information and knowledge available to solve problems and inform their decision-making.
Take home
• e-Science is bigger than Grid• the e-Science experimental method needs first class
support and is just as important as outcomes.• Personalised
– my private data holdings
yet collaborative– publish my workflow templates in registries for you to share
and adapt
• Automated– run a workflow a discover alternative services if a service
goes down
yet interactive with the scientist at the centre– user proxy notified to hand filter or view results
Challenges for Pharma
• Access to and understanding of distributed, heterogeneous information resources is critical
• Complex, time consuming process, because ...– 1000’s of relevant information sources, an explosion in
availability of;• experimental data• scientists’ annotations• text documents; abstracts, eJournal articles, monthly reports,
patents, ...
– Rapidly changing domain concepts and terminology and analysis approaches
– Constantly evolving data structures – Continuous creation of new data sources– Highly heterogeneous sources and applications – Data and results of uneven quality, depth, scope– But still growing
Integration of Pharma information
ID MURA_BACSU STANDARD; PRT; 429 AA.DE PROBABLE UDP-N-ACETYLGLUCOSAMINE 1-CARBOXYVINYLTRANSFERASEDE (EC 2.5.1.7) (ENOYLPYRUVATE TRANSFERASE) (UDP-N-ACETYLGLUCOSAMINEDE ENOLPYRUVYL TRANSFERASE) (EPT).GN MURA OR MURZ.OS BACILLUS SUBTILIS.OC BACTERIA; FIRMICUTES; BACILLUS/CLOSTRIDIUM GROUP; BACILLACEAE;OC BACILLUS.KW PEPTIDOGLYCAN SYNTHESIS; CELL WALL; TRANSFERASE.FT ACT_SITE 116 116 BINDS PEP (BY SIMILARITY).FT CONFLICT 374 374 S -> A (IN REF. 3).SQ SEQUENCE 429 AA; 46016 MW; 02018C5C CRC32; MEKLNIAGGD SLNGTVHISG AKNSAVALIP ATILANSEVT IEGLPEISDI ETLRDLLKEI GGNVHFENGE MVVDPTSMIS MPLPNGKVKK LRASYYLMGA MLGRFKQAVI GLPGGCHLGP RPIDQHIKGF EALGAEVTNE QGAIYLRAER LRGARIYLDV VSVGATINIM LAAVLAEGKT IIENAAKEPE IIDVATLLTS MGAKIKGAGT NVIRIDGVKE LHGCKHTIIP DRIEAGTFMI
myGrid
• EPSRC UK e-Science pilot project• Open Source Upper Middleware for Bioinformatics• (Web) Service-based architecture -> Grid services• 42 months, 20 months in.• Prototype v0 technical and user requirements• Prototype v1 Release Sept 2004, some services available
now.
Graves disease
• Autoimmune disease of the thyroid in which the immune system of an individual attacks cells in the thyroid gland resulting in hyperthyroidism
• Weight loss, trembling, muscle weakness, increased pulse rate, increased sweating and heat intolerance, goitre, exophtalmos
The Biology
• Grave’s Disease caused by the stimulation of the thyrotrophin receptor by thyroid-stimulating autoantibodies secreted by lymphocytes of the immune system.
• What is the molecular basis for this autoimmune response?
PituitaryGland
Thyroid Hormones Released
ThyroidCell
TSH Receptor
TSH
-ve feedbackeffect
Autoimmune Antibodies attach to TSH receptors, competing with TSH
Bioinformatics
Annotation PipelineWhat is known about my
candidate gene?
Medline
OMIM
GO
BLAST
EMBL
DQP
Query
Genotype Assay Design System 3D Protein Structure
Select a SNP from candidate gene. Is this SNP associated with
Disease?
What is the structure of the proteinproduct encoded by my candidate gene?
Primer Design
Gene ID
Restriction FragmentLength Polymorphism experiment
SNPSN
PSN
P
Use primers designed by myGrid to amplify region flanking SNP on the gene
PDB
Query PDB & display proteinstructure using Rasmol
Obtain information about protein& extract information about active site
Swiss-Prot AMBITInterpro
Emboss Eprimer applicationin SoapLab
Selection of restriction enzyme
Talisman
SNP
Emboss Restrictin SoapLab
AMBIT
Determine whether coding SNPsaffects the active site of the protein
Peter Li1, Claire Jennings2, Simon Pearce2 and Anil Wipat1, (2003)1School of Computing Science and 2Institute of Human Genetics, University of Newcastle-upon-Tyne.Candidate gene
pool
Workflows are in silico experiments
Annotation PipelineWhat is known about my
candidate gene?
Medline
OMIM
GO
BLAST
EMBL
DQP
Query
http://cvs.mygrid.org.uk/scufl/NucleotideSeqAnnotationPipelineWithGoTerms/
in silico Exploratory Experiments
Ad hoc virtual organisationsNo a priori agreementsDiscovery/exploratory workflows
by biologistsPersonalDifferent resourcesGrids
Predictive / stable integration Production workflows over known
resources Organisation wide Emphasis on performance and
resilience Data capture, cleaning and
replication protocols
Clear UnderstandingStandard
Well definedPredictive
Experimental orchestrationExploratory
Hypothesis drivenNot prescriptive
Methodology freeAd hoc
Experiment = Workflows + Services + (meta)DataDiscovering services to invokeDiscovering workflows to enact• Service & workflow registration and
discovery– Multi-user, multi-view, federated
registries– First, second and third party services &
workflows– Publishing new ones, adapting old ones.– My working set of services– Services maybe owned by another user,
and come and go– Views over registries of services– Third party annotation
• Ontologies for describing and finding workflows/services and guiding service composition
– Service A outputs compatible with Service B inputs
– Blastn compares a nucleotide query sequence against a nucleotide sequence database (usually – intelligent misuse of services…)
An in silico experiment = a web of interconnected investigation holdings
Provenance record of workflow runs
Provenance of the workflow template. Related workflows.
People who wrote the workflow
Ontologies describing workflows
Services used
Notes
Data holdings
LiteratureLiterature
People to notify of the workflow status
Experiment life cycle
Executing experiments
Workflow enactmentDistributed Query
processingJob executionProvenance generation
Single sign-on authentician
Event notification
Resource & service discovery
Repository creationWorkflow creation
Database query formation
Discovering and reusingexperiments and resources
Workflow discovery & refinementResource &
service discoveryRepository
creationProvenance
Managing experiments
Information repositoryMetadata management
Provenance management
Workflow evolutionEvent notification
Providing services & experiments
Service registrationWorkflow depositionMetadata Annotation
Third party registration
Personalisation
Personalised registriesPersonalised workflows
Info repository viewsPersonalised annotations
Personalised metadataSecurity
Forming experiments
Investigation = set of experiments + metadata
• Hypothesis, materials and methods, results, conclusions, acknowledgements, bibliography
• Who, what, where, why, when, (w)how? recorded by provenance records
• Experiment is repeatable, if not reproducible.
• The traceability of knowledge as it is evolves and as it is derived.
• A web of myGrid holdings– input data, data results, intermediate data,
parameter sets, workflow logs, workflow templates, people, organisations, personal notes, services etc.
• Discovering links between experiment objects
• Selectively share (parts of) experiments and investigations
• Discover experiments and investigations
Data at the centre
Provenance record of workflow run that produced this data
Provenance of the data holdings
Workflows that could use pr generate this data
People who have registered an interest in this data
Ontologies describing data
Services that can use or produce this data
Notes
Data holdings
Literature relevant
Literature relevant
Related Data holding
Put the scientist at the centre
Provenance record of workflow runs they have made
People
Ontologies
Preferences for Services
Notes
Data holdings
LiteratureLiterature
Workflows they wrote or used
People they collaborate with
myGrid in a nutshell
A “second generation” open service-based Grid project, a test bed for the OGSI, OGSA and OGSA-DAI base services
semantic grid capabilities knowledge-based technologies,
semantic-based service, workflow & data discovery,
match makinglinking investigation
components.
High level services for e-Science experimental
managementprovenance, change notification,
personalisation, investigation and experiment
holdings management
External Applications: workbench, portal, Talisman, Taverna
External Services: AMBIT, SoapLab, EMBOSS…
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rs
High level services for data intensive integrationworkflow & distributed query processing
myGrid Services
Web Service & Grid communication fabric
AMBITText Extraction Service
Provenance mgt
Personalisation
Event Notification
Gateway
Service and WorkflowDiscovery myGrid
Information Repository
Ontology Mgt
Metadata Mgt
Work benchTaverna
workflow environment Talisman application
Bio Services
Soaplab
Portal
SRS
Bio
info
rmat
icia
nsT
ool P
rovi
ders
Ser
vice
Pro
vide
rs
Registries
Ontologies
EMBOSS
Workflow enactment engine
Distributed Query Processor
mIRbrowser
Knowledge Services
Registry
Putting the services togetherSemantic registration
Service Knowledge Service
Registry
Workflow enactment
engine
Service &workflow browser
Find Service
Notification Service
Notification Service
Service Service Service
Distributed Query Processor
Information Extraction
Service
Job Execution
mIR
Provenance browser
RegistryView
Service Publicationsyntactic registration
Matchmaker
RegistryView
mIR
mIRmIR
User Proxy
mIR
Notification
Workflow Enactment Engine
Registry View
NotificationClient
Service Browser
Finding Service
Workbench
TavernaWorkflow
Environment
UDDI
DomainServices
Bio-databases
SoapLab
EMBOSS
User Proxy
User Gateway
myGrid ClientmyGrid ServicesExternal Services
Application: Work bench demonstrator
The myGrid service components have been used in a demonstration application called the “myGrid WorkBench”, which provides a common point of use for the services.
We can select data from the myGrid Information repository (mIR), select a workflow based on its semantic description, and examine the results.
A work bench for demonstrating services
myView on the mIR
Workflow
Metadata about
workflow
note aboutworkflow
Semantic services Services and workflows
within myGrid are described using semantic web technologies and ontologies enabling selection by the types of inputs they use, outputs they produce, or the bioinformatics tasks they perform.
DAML+OIL, OWL, RDF
Workflows
• Workflow enactment engineIBM’s Web Service Flow Language (WSFL)Scufl
• Dynamic workflow service invocation and service discovery– Choose services when running workflow
• User interactivity during workflow enactment– Not a batch script! – Requires user proxies
• Separate data flow from control flow– Large amounts of data
• Iteration, decision points• Monitoring• Provenance logs
• The enactment engine is a web service• Migrated to a OGSA service
Scufl
for each task:
• run(operation, inputs)
WorkflowEngine
Soaplab plugin
http://www.it-innovation.soton.ac.uk/mygrid/workflow
Bio Services
• Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually)
• Multiple services– Many hundreds of different services
in the public domain and privately owned
• Multiple registries– 3rd party public registries, private
registries, personal registries• 3rd parties
– JEMBOSS, PathPort, bioMoby• SoapLab
– A soap-based programmatic interface to command-line applications
– ~300 different classes of services– Swiss-Prot, EMBOSS, Medline, blah,
blah …– http://industry.ebi.ac.uk/soap/soaplab
Application: Taverna workflow workbench
Bioinformatics analyses typically involve visiting many data resources and analytical tools.
These in silico experiments can be created as pipelines or “workflows” in our Taverna editor.
http://sourceforge.net/projects/taverna)
e-Science: notification
A notification service can inform the mIR and the user (proxy) that data, workflows, services, etc. have changed and thus prompt actions over data in the mIR.
Notifications are presented to the user with a client in the workbench environment.
User registers interest in notification topics
e-Science: Provenance
Like a bench experiment, myGrid records the materials and methods it has used for an in silico experiment in a provenance log.
This is the where, what, when and how the experiment was run.
Derivation paths ~
workflows, queries
Annotations ~ notesEvolution paths ~
workflow workflow
Talisman application
http://www.ebi.ac.uk/collab/mygrid/service1/talisman/index.html
The annotation pipeline to identify Genes of Interest
Look at contents of work bench
User notified of new Affy data
Run a workflow over new Affy data– Launch workflow wizard– Discover appropriate
workflow– Enact workflow– Monitor workflow
Look at provenance
Select and view results
Annotation PipelineWhat is known about my
candidate gene?
Medline
OMIM
GO
BLAST
EMBL
DQP
Query
The “my” in myGrid
• my services• my favourite services• my opinion of those services• my workflow templates• my workflow runs• my data• my notes• my queries• my logs of what I did• the events I care about
The Grid in myGrid
• Service based architecture• mIR and the DQP is OGSA-DAI compliant• Migrating event notification and workflow
enactment engine to OGSA• Volatility of services and virtual organisations
– Graceful management of failure
• Scale of data – e.g. dataflow through workflow engine and distributed query processor
• Services that are large computational services
Life Science Identifier
• mIR uses LSIDs
• Integrating LSID resolvers from IBM for bio databases
• LSIDs form a connective glue along with the ontologies
Summary
• myGrid offers service based middleware components
• Open source and free• Open Grid Service Architecture-compliant• Allows the scientist to be at the centre of the
Grid -- Personalisation• Generic middleware that suits the creation of
bioinformatics applications• Inclusion of rich semantics to facilitate the
scientific process• Available from http://www.mygrid.org.uk
I3C http://www.i3c.org/
Our Biology colleagues
Institute of Human Genetics School of Clinical Medical Sciences
University of NewcastleUK
Simon Pearce Claire Jennings
The rest of the team
Matthew Addis, Nedim Alpdemir, Rich Cawley, Vijay Dialani, Alvaro Fernandes, Justin Ferris, Rob Gaizauskas, Kevin Glover, Carole Goble (director), Chris Greenhalgh, Mark Greenwood, Ananth Krishna, Xiaojian Liu, Darren Marvin, Karon Mee, Simon Miles, Luc Moreau, Juri Papay, Norman Paton, Steve Pettifer, Milena Radenkovic, Peter Rice, Angus Roberts, Alan Robinson, Martin Senger, Nick Sharman, Paul Watson, Anil Wipat & Chris Wroe.
Wrap up spare
• The myGrid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics.
• myGrid is building high level services for data & application integration such as resource discovery and workflow enactment.
• Additional services are provided to support the scientific method & best practice found at the bench but often neglected at the workstation, notably provenance management, change notification & personalisation.
• Semantically rich metadata expressed using ontologies is used to discover services and workflows.
• myGrid provides these services as middleware components, that can be used to build bioinformatics applications.
• An in silico laboratory workbench demonstrator is currently being developed with these components.