Upload
michael-poole
View
224
Download
5
Tags:
Embed Size (px)
Citation preview
myGrid
Nobody said it was easy:
Semantically Discovering BioGrid Services is tricky
Professor Carole GobleUniversity of Manchester, UK
myGrid project http://www.mygrid.org.uk
myGrid
Environmental requirements of bioinformatics in silico experimentation
The services
Workflow execution
And the impact on describing services for how you description stuff, what to describe and how and when to use the descriptions
different levels of descriptions
different views on services
depending on whether you are middleware or a user
implications for registration
myGrid
Open Grid Services Architecture
• Present Grid Architecture is a services architecture
• Implemented using Web Services Technology
• OGSA will provide – Naming /Authorization /
Security / Privacy– Higher level services:
Workflow, Transactions, Data Mining,Knowledge Discovery,…
• Exploiting Synergy: Commercial Internet with Grid Services
• OGSI extends Web Services– Transient Service Instances– Service State– Lifetime management
• Defines fundamental (WSDL) interfaces and behaviors that define a Grid Service– Required + optional interfaces
= WS “profile”• Defines WSDL extensibility
elements– E.g., serviceType (a group of
portTypes)
myGrid
myGrid
• EPSRC UK e-Science pilot project• Open Source Upper Middleware for Bioinformatics• Data intensive not compute intensive• Sharing knowledge and sharing components
myGrid
myGrid in a nutshell
• An example of a “second generation” open service-based Grid project, specifically a test bed for the OGSI, OGSA and OGSA-DAI base services;– myGrid Information Repository that is OGSA-DAI compliant
• Developing high level services for data intensive integration, rather than computationally intensive problems;– Workflow & distributed query processing
• Developing high level services for e-Science experimental management;– Provenance, change notification and personalisation
• Developing Semantic Grid capabilities and knowledge-based technologies, such as semantic-based resource discovery and matching.– Metadata descriptions and ontologies for service discovery,
component discovery and linking components.
myGrid
myGrid
•Service discovery•Workflow discovery & refinement•Provenance logs
•Personalised service registries•Personalised workflows
•Workflow enactment
•Service invocation•Provenance logs
•Service registration•Workflow deposition•Metadata annotation•Third party registration•Provenance records
•Workflow evolution•Service monitoring
•Service discovery •Workflow discovery & refinement•Workflow creation
Experiment life cycle
Executing experiments
Discovering and reusingexperiments and
resources
Managing experiments
Providing services & experiments
Personalisation
Forming experiments
myGrid
Provenance
• Experiment is repeatable, if not reproducible, and explained by provenance records
• Who, what, where, why, when, (w)how?
• The tracability of knowledge as it is evolves and as it is derived.
• Implications for recording which services invoked on what data when with what parameters.
• Immutatable and persistent
myGrid
Notification Service
Knowledge Services
DB2
Registry
Architectural Overview
Semantic registration
ServiceStructural registration
Knowledge Service Ontology Server
Reasoner
Matcher
Registry
DB2
Workflow templates
DataProvenance
m Info Repository
Workflow enactment
engine
Workflow instances
DiscoverWorkflow or
Service
Service Discovery
Test Data
Notification Service
Service Service Service
Scufl & WSFL
JMS
Distributed Query Processor
Information Extraction
PESTO
Job Execution
SoapLab
mIR
Provenance service
mG Object Discovery
Metadata Concepts
RegistryView
UDDI
RDF-based UDDI
KB Store
myGrid
Workflows
• Workflow discovery– Finding workflows that others
have done, and that I have done myself
• Workflow specification– Finding classes of services– Guiding service composition– We don’t do automated
composition• Dynamic workflow enactment
service discovery and invocation– Choose services instances
when running workflow• User involvement
myGrid
Views
myGrid Find Service
Discovery ClientFind Service
Semanticdiscovery
Syntactic discovery
Ontology Server
Matcher
Reasoner
FaCT
ViewsUDDI-M
RDF
Org. registry Public
registryUDDI
WSDL
Ontologies
Word-based discovery
Third Party
Service
publishes
Third party description
publishes
KAON
Gather service descriptionsDescription
Store
myGrid
myGrid Components ~ Demo
• portal operation. • semantics to
define type system.
• mIR, to store, and retrieve data.
• registry to describe and record services
Uncharacterised DNA sequence
Select an open reading frame
Translate to protein
BLAST search Characterised DNA sequence
myGrid
myGrid Components ~ Demo
• Pre-existing third party application
• Service invocation
• Workflow enactment
DNA sequence getOrf transeq prophet
Proteins from a family emma prophecy
plotorf
Classical bioinformatics: detecting whether an uncharacterised protein domain is conserved across a group of proteins
myGrid
Bio Services Landscape
• Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually)
• Multiple services– Many hundreds of different services in the
public domain and privately owned• Multiple registries
– 3rd party public registries, private registries, personal registries
• 3rd parties– JEMBOSS, PathPort, bioMoby
• Wrap our own– Soaplab– A soap-based programmatic interface to
command-line applications– ~300 different classes of services– Swiss-Prot, EMBOSS, Medline, blah, blah
…– http://industry.ebi.ac.uk/soap/soaplab
myGrid
Bio Services Problem Space
• Multiple service providers of same service (not just similar service)– Many implementations of Swiss-Prot
version 40• “What and which” Discovery based on
– What the services does from a domain perspective.
– Which service instance has the appropriate capabilities from an operational perspective.
• Users don’t care if the service is a service or a workflow.– Same what description from their
perspective– Different “how” description from
middleware perspective.
SWISS-PROT
SWISS-PROT@local
SWISS-PROT@ebi
SWISS-PROT@ncbi
myGrid
Consequences
• We support (at least) two types of semantic service discovery: – Domain
• requiring access to common application domain ontologies• Biology and bioinformatics
– Service• using cross-domain knowledge independent of application• Quality of service, ownership, location, organisations …
• We describe the profile of workflows as if they were services (of course a workflow could be deployed as a service…)
• Should workflow descriptions be in the same registry as service descriptions, or elsewhere?– A find service must transcend the location.
myGrid
Tiers of service description
Select an open reading frame
Translate to protein
Characterised DNA sequence
Sequence alignment
Uncharacterised DNA sequence
EMBOSSGetORF
EMBOSSTransSeq
Characterised DNA sequence
BLAST-pCATTACCC
EMBOSSGetORF@http:img.cs.man.ac.uk
EMBOSSTransSeq@http:ed.ac.uk
Characterised DNA sequence
CATTACCC
myGrid
Summary: Tiered levels of descriptions
Abstract Service
Invoked Service
Service Instance
Specific Service
Sequence alignment
Blastn@EBI invoked proxy
Blastn@EBI
Blastn
Ontology
Ontology
OntologyData model
Service Data Element
Classes of servicesDomain “semantic”Unexecutable“Potentials”
Instances of servicesBusiness “operational”Executable“Actuals”
myGrid
What are you discovering? Classes & Users
Classes of Service
Workflow specificationsDiscovery
• Finding a service that will fulfil some task e.g. aligning of biological sequences.– What services perform a specific kind of task, for example, what
services can I used to perform a biological sequence similarity search?
• Finding a service that will accept or produce some kind of data.– What services produce this kind of data, for example, from where
can I find sequence data for a protein?– What services consume this kind of data, for example, if I have
protein sequence data, what can I do with it?• Class of service:
– a protein sequence alignment, a protein sequence database. • Specific example of an abstract service:
– BLAST, BLASTn, SWISS-PROT,
• Applies to class of services and workflow specifications
myGrid
Originally Based on DAML-S
• US DARPA Agent Markup Language – Services http://www.daml.org
• An Upper Ontology for Services
Resource Service
Service profile
Service model
Service grounding
provides
presents
describedBy
supportsWhat it does
How it works
How to access itdescription
functionalities functional attributes
myGrid
Bioinformatics ontology
Web serviceontology
Task ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Upper levelontology
Specialises. All concepts are subclassed from those in the more general ontology.
Contributes concepts to form definitions.
Suite
parameters: input, output, precondition, effectperforms_taskuses-resourceis_function_of
myGrid
myGrid
Pedro interface to Service Discovery
myGrid
Classification and matchmaking of services
• Classification of services/workflows• Imprecise (best effort) substitutions of services/workflows• Service/workflow organisation & indexing, • Service/workflow matchmaking & substitution
– “BLAST” finds tblastx, tblastn, psi-blast, marks_super_blast.– “Alignment” finds ClustalW, Blast, Smith-Waterman,
Needleman-Wunsch• Expanded selection of services based on expansion of in-hand
object• A vocabulary for expressing service descriptions without pre-
determining every description• A reasoning process to manage:
– coherency of the classifications and the descriptions when they are created,
– the service discovery, matching and composition when they are deployed.
• Ontologies in DAML+OIL/OWL based on the DAML-S ontology
myGrid
What are you discovering? Instances & Machines
Classes of Service
Workflow specificationsDiscovery
Select instances
Instantiateregistry
myGrid
Discovering services based on their operational properties
• What resources does a specific organisation provide?• Who authored this resource?• What services offering x currently give the best quality of
service?• Which service would the local bioinformatics expert suggest we
use?• Data quality, quality of service, cost, geographical location,
authorisation, provenance of data and so on.
• Third party metadata
• Instance service description of a specific service – BLAST, SWISS-PROT as offered by the EBI is 80% reliable.
• Invoked instance service description– BLAST as offered by the EBI on a particular date, with
particular parameters when a service invoked.
Applies to instances of services and workflows
myGrid
RDF based UDDI metadata for service instances
myGrid
User engagement
Classes of Service
Workflow specificationsDiscovery
Select instances
Instantiate
Support for the user to find a service that fulfils their task. ontology should be fairly simplecouched in concepts the user is familiar with e.g.
protein sequence. analogous to DAML-S profile
registry
myGrid
EMBOSS seqret
• Function that reads and writes (returns) sequences• But its so much more than that! • EMBOSS programs can take a wide range of qualifiers that
slightly change the behaviour of the program when reading or writing a sequence
• seqret can read a sequence or many sequences from databases, files, files of sequence names, the command-line or the output of other programs and then can write them to files, the screen or pass them to other programs.
• Because it can read in a sequence from a database and write it to a file, its a program for extracting sequences from databases
• Because it can write the sequence to the screen, seqret is a
program for displaying sequences.
myGrid
And more….
seqret can read sequences in any of a wide range of standard sequence formats. You can specify the input and output formats being used. If you don't specify the input format, it will try a set of possible formats until it reads it in successfully. Because you can specify the output sequence format, its a program to reformat a sequence.
seqret can read in the reverse complement of a nucleic acid sequence. So its a program for producing the reverse complement of a sequence.
seqret can read in a sequence whose begin and end positions you have specified and write out that fragment. So its a utility for doing simple extraction of a region of a sequence.
seqret can change the case of the sequence being read in to upper or to lower case. So its a simple sequence beautification utility.
seqret can do any combination of the above functions. ......
myGrid
EMBOSS
• EMBOSS sequence alignment service matcher simple way to describe the task it fulfils ismatcher has_input sequence performs_task aligning
• some verb acting on some object to produce a result and it fits most descriptions.
• Quickly get more complicated. • EMBOSS degap removes gap characters from a
sequence. • Where should the gap character concept be
included? It is neither an input or an output.
myGrid
• Several properties added over the DAML-S profile for bioinformatics – e.g. uses_resource and uses_application.
• These could be simplified away either just as one additional property or a precondition as used DAML-S. – More obtuse to the user. – Makes the model more complex or redundant for the benefit
of the user. – Reduces inter operability with service descriptions in other
domains. – Perhaps this redundancy should be encoded within the
applications delivering the ontology and a more complex precondition description used under the hood?
myGrid
EMBOSS matcher
• protein sequence is an ambiguous term and relies on implicit information held in the head of the bioinformatician.
• to reason over or organise concepts we need a more precise definition
• data structure conforming to some schema that encodes the sequence of amino acid in a protein molecule.
• We can now start to infer the relationship between protein sequences and nucleotide sequences.
• But a user cannot be expected to interact with such a complex model.
myGrid
Outcome: Views
• Multiple descriptions over same services & workflows held in registries
• Third party descriptions & Subsets of services– publication of descriptions must be supported both for the author of the
service and third parties;– third party annotations are a view of a service and discovery should
offer a variety of views based upon third party annotations;– there is a need for control over who make add and alter third party
annotations;
• Generic services supporting a wide variety of multiple tasks – Middleware must have some way of going beyond a generic
description and stating given these inputs what are the outputs going to be.
– Rather than author very complex description that cater for all possibilities, it is better to author many simpler descriptions for each case.
– It may in fact be necessary to ask the service itself for specific answers, such as ‘given these inputs what would you perform?’
myGrid
Views
myGrid Find Service
Discovery ClientFind Service
Semanticdiscovery
Syntactic discovery
Ontology Server
Matcher
Reasoner
FaCT
ViewsUDDI-M
RDF
Org. registry Public
registryUDDI
WSDL
Ontologies
Word-based discovery
Third Party
Service
publishes
Third party description
publishes
KAON
Gather service descriptionsDescription
Store
myGrid
Bio Services Problem Space
• Wrap CORBA, Perl etc to look like web services, to become Grid services (eventually)
• Dialogue oriented (e.g. Soaplab) and function oriented (e.g. bioMOBY)– Often highly parameterised
• Mixture of synchronous and asynchronous– Simulations and feedback loops
• Streaming large scale data– Mixture of binary and text
myGrid
EMBOSS
• Suite of 200+ command line programs, which uses a command definition language AJAX
How do we present these services?• As 200 different services, one for each EMBOSS program,
with a single method, with as many parameters as the EMBOSS program requires.
• As 200 different services, one for each EMBOSS program, with a number of overloaded methods where the program takes optional parameters.
• As a single service with 200 different methods, one for each EMBOSS program.
• As a single, highly parametric service, with a single method, called invoke, the first parameter of which names the EMBOSS program to run.
myGrid
Classes of Service
Workflow specificationsDiscovery
Select instances
Instantiate
Workflow enactmentInvoked instance Execution
registry
Registry?
myGrid
Invocation
Classes of Service
Workflow specificationsDiscovery
Select instances
Discovery &Instantiate
Workflow enactmentInvoked instance Execution
Monitor
Terminate
Registry
Registry?
myGrid
Phases
Classes of Service
Workflow specificationsDiscovery
Select instances
Discovery &Instantiate
Workflow enactmentInvoked instance Execution
Monitor
Terminate
Support for middleware to perform tasks such as substitution, data transformation between services, automatic invocation of services where the invocation model is not simple. a complex model to explicitly describe every
implementation detail of the service or a binding to it.
analogous to DAML-S process model and grounding.
myGrid
Invocation models
• bioMoby forces services to have a single operation that completely encompasses the single task the service supports.
• Each task may be in turn supported by a single operation
• Soaplab there is no one to one mapping between a single task and a single operation.
• Can repurpose a service to be presented multiple times – a different wrapper for every view– Proliferation of views– Makes discovery easier– Reasoning that it’s the same service as one
running
myGridSoaplab version of matcher alignment_local::matcher::derived (wsdl)
createEmptyJobget_detailed_statusget_reportget_outfileset_gappenaltyset_sbegin1 set_sbegin2 set_send1set_send2set_sformat1set_sformat2 set_slower1 set_slower2 set_snucleotide1 set_snucleotide2 set_sprotein1 set_sprotein2 set_sreverse1 set_sreverse2 set_supper1 set_supper2set_datafile_direct_data set_datafile_url set_sequencea_direct_data set_sequencea_usaset_sequenceb_direct_data
set_sequenceb_usa set_gaplength set_alternatives rundestroygetStatus describe getInputSpec getResultSpec getAnalysisTypecreateJob runNotifiable createAndRun createAndRunNotifiable waitFor runAndWaitFor getResults terminate getLastEvent getNotificationDescriptorgetCreatedgetStartedgetEndedgetElapsedgetCharacteristicsgetSomeResults ......
myGrid
Coordinating EMBOSS through Soaplab - WSFL
WorkflowEngine
WSFL
for each task:
• createJob(inputs:Map)
• run(...)
• waitFor(...)
• getResults(...)
• destroy(...)
myGrid
Coordinating EMBOSS through Soaplab - Scufl
WorkflowEngine
Scufl
for each task:
• run(operation, inputs)
Soaplab plugin
myGrid
Does the user ever see this?
• If the user never has to deal with the invocation model– The DAML-S approach of splitting the information
between two descriptions seems plausible. – Once the user has used the simpler profile, the
middleware gets to work on the more complex process model and binding, or a myGrid workflow to actually translate the task into concrete service operation calls.
• If the user does want to know what is going to happen– A more unified model with views for user and
middleware seems more appropriate. – The downside is the cost of implementing the
infrastructure to deliver the views.
myGrid
Summary: Views
• Two parallel but slightly redundant descriptions of the service– one for human discovery and one for middleware. – what DAML-S does. OR
• One common model which is complex and supports multiple tasks but have an extra layer that provides a view to support each specific task – intermediate representations, reasonables, perspectives,
language generation.• The user sees the term protein sequence even though the
underlying concept is far more explicit. • Transformed into the more complex pattern; the user may be
promoted for attributes associated with the parent concept “data” even though the user never explicitly stated this was a kind of data.
• The view approach used in GALEN and GONG. • The DAML-S profile probably too complex to present to
bioinformatics users.
myGrid
Summary 2: human vs machine views
Human Machine
Human
Machine
Service User
Service provider
UDDI style advertisements Weak semantic descriptions
Rewriting views
Elaborate Semantic descriptions
Simplication views
Syntactic descriptions
Semantic mining
myGrid
Discovery space
Classes and instances
People and machines
Multiple tasks
Third party multiple viewpoints
Abstractions over a single description of a service
Multiple descriptions over a single service
myGrid
Acknowledgements:Luc Moreau, Simon Miles, Keith Decker, Terry Payne, Phil Lord, Chris Wroe, Roberts Stevens, Kevin Garwood
http://www.mygrid.org.uk/