Upload
manjit
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
(Bio)Web Services at the INB. BioMOBY. Instituto Nacional de Bioinformática. INB Mission. “To generate and apply bioinformatics solutions to needs detected in development and implementation of genomics and proteomics focused projects” - PowerPoint PPT Presentation
Citation preview
(Bio)Web Services at the INB
BioMOBY
Instituto Nacional de Bioinformática
INB Mission
“To generate and apply bioinformatics solutions to needs detected in development and implementation of genomics
and proteomics focused projects”
• To support Bioinformatics and Computational Biology development in Spain
• To collaborate and provide scientific and technical support to national genomics and proteomics projects
• To contribute to the creation and establishment of local Bioinformatics groups with research and services components through bioinformaticians training
• To train bioinformaticians for genomics and proteomics research groups
• To develop pure Bioinformatics projects related with the Institute activities
• To support companies with activity in this sector in Spain
• To internationalize all its activities
INB Structure. A “virtual” institute
Web Services
Making some sense of this
Fuente: myGrid
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Or this…
Current practices
Description
Discovery
Remote ProgramaticAccesService Consumers Service Providers
Service(Application, DB)
Bioinformatics Integration: State of the Art
A Web Page is the de facto standard
Discovery:• Word of mouth• Web directories• Google• Paper publications
Description:• Word of mouth• Web documentation / examples / tutorials /
courses• Paper publications
Data transfer & Message Format:• Cut & paste! + Data reformatting
Automation:• CGI & Bespoke code (ad hoc)• APIs (normally big Bioinformatics
Projects/institutes)
Do you have data?
Do you have tools?
Publish a web page
What is wrong with Web apps.
User side
• How do I find out where services are provided?
• Once I discover a service, how do I use it?
• Input/output data types. How do I take the output of one service and send it to another service?
• How do I use the service from within a program instead of through a form on a web page?
Developer side
• Most of developer time is spent in presentation and input/output management. Error handling is mandatory
• Different projects use different formats, rules of use, etc.
Web services can solve the problem
• Central repository
• Well known input/output formats
• No need of user interfaces
A Web service is an interface that describes a collection of operations that are network accessible through standardized XML messaging*
*Web Services Conceptual Architecture, Heather Kreger, IBM Software Group, 2001
PublishFind
Bind
Service Requestor
Service Registry
Service Provider
Service Descriptions
Service Description
Service
WDSL, UDDIWSDL, UDDI
Web services model
However…
• Don’t help the situation much since…• A bioinformatics that consumes a “string” might
be expecting a FASTA sequence, or a keyword…??
• Bioinformatics has many different ‘strings’!
• In Bioinformatics Web Service registries merely catalogue the chaos!
• Semantics rather than structure is necessary
Our audience
• Information is distributed
• MOST data never makes it off of the scientists hard drive• This data should be added to the global scientific archive
• Biologists, by and large, are willing and able, but…
• The Web was embraced enthusiastically by biologists• In fact, most wet labs run a website!• Unfortunately, this only adds to the chaos…
The interoperability solution must be simple enough for a Biologist, with a little bit of computer
knowledge, to implement on their own
BioMOBY
From MOBY-DIC (Model Organisms, Bring Your own Database Interface
Conference)
http://www.biomoby.org
BioMOBY – Scope and Definition
http://www.biomoby.org
• OBJETIVES• Study how to address interoperability problems that are actually being faced by
bioinformatics users of web-accesible resources today, and what are the factors that promote the adoption of new approaches
• How to balance between increasing potential for interoperability and the likelihood of widespread adoption? I.e. focus upon minimizing the barriers to entry into the system, or insist upon a set of constraints that will guarantee usefulness of components of the system
MOBY is a project to develop a web services architecture for bioinformatics
1. Common Syntax 2. Common Semantic 3. Dynamic Discovery
BioMOBY is an international research project involving biological data hosts, biological data service providers, and coders whose aim is to explore various methodologies for biological data representation, distribution, and discovery.
The MOBY plan
• Define data-types commonly used in bioinformatics
• Organize these into an Ontology• Ontologically define web service inputs and outputs• Register the inputs and outputs in a “yellow pages”
• Machines can find an appropriate service• Machines can execute that service unattended• But users still can understand data types
Define: Semantics
• For a piece of data, its “semantics” are• its intention• its meaning• its raison d’etre• its context• its relationship to other data
MOBY Semantic Typing: Namespaces
• Any identifiable piece of data is an “entity”
• Identifiers fall into particular “Namespaces”• NCBI has gi numbers (gi Namespace)• GO Terms have accession numbers (GO Namespace)
• Namespaces indicate data’s semantic type.• GO:0003476 a Gene Ontology Term• gi|163483 a GenBank record
• However, we cannot tell if it is protein, RNA, or DNA sequence
• Namespace + ID precisely specifies a data “entity”
• The Namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces
Define: Syntax
• For a piece of data, its “syntax” are• its representation• its form• its structure• its language (of representation)
MOBY Syntactic Typing: The Object Ontology
• Syntactic types are defined by a GO-like ontology• Type (“Class”) name at each node• Edges define the relationships between Classes• GO used as a model because of its comprehension &
familiarity
• Edges define one of three relationships• ISA
• Inheritance relationship• All properties of the parent are present in the child
• HASA• Container relationship of ‘exactly 1’
• HAS• Container relationship with ‘1 or more’
Define: Ontology
• A systematic representation of the entities that exist in a domain of discourse, and the relationships between them.
Child
Father
Female
Male
MotherhasParent
hasParent
hasGender
hasGender
partnerOf
A portion of the MOBY-SObject Ontology
…community-built!
What’s an “Object”?
• The smallest unit of information that can be passed by MOBY
• Consists simply of• Namespace• ID
• Thus an Object is nothing more than a “reference” to a data entity
• Ex. <Object if=‘2KI5’ namespace=‘PDB’/> refers to the 3D structure of a Herpes Virus I Thymidine kinase, whereas
• <Object id=‘KITH_HHV1’ namespace=‘Uniprot’/> refers to its sequence
The Object Ontology: A small slice
•ISA relationships do not necessarily add complexity to objects, some times they are just semantics
•Inheritance makes easier service discovery
MOBY objects
A MOBY triple includes a namespace, an ID, and a class <Class namespace='...' id='...'>
A simple MOBY object is just a pointer to data to be retrieved from somewhere<Object namespace='NCBI_id' id='163483'>
An object may contain data in addition to the namespace and ID:<Class namespace='...' id='...'>object's data</Class>
The object's data may include XML markup
A complete MOBY object:<GenericSequence namespace='NCBI_gi' id='163483'
articleName='mySequence'><Integer namespace='' id='' articleName='Length'>975</Integer><String namespace='' id='' articleName='SequenceString'>
ATGATGCGGCTAGTGATGCTGTCGGCGGCATGATTAGG...</String>
</GenericSequence>
articleName’s add human readable semantics to subclasses
ISA relationship - inheritance
• Classes become more specialized as you move along the ISA relationship hierarchy
• DNA_Sequence • ISA
• Nucleotide_Sequence • ISA
• Generic_Sequence • ISA
• Virtual_Sequence• ISA
• Object
• Classes do not become more complex as a result of ISA relationships alone
HASA & HAS relationships
• HASA and HAS relationships make Classes more complex by embedding Classes within Classes
• Virtual_Sequence ISA Object• Virtual_Sequence HASA Length (Integer)• Generic_Sequence ISA Virtual_Sequence• Generic_Sequence HASA Sequence (String)
• Annotated_GIF ISA Image (base_64_GIF)• Annotated_GIF HAS Description (String)
Legacy file formats
• Classic bioinformatics “strings” are just embedded into XML• Binaries are base64 encoded.
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’><String articleName=‘content’>
TBLASTN 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Query= gi|1401126 (504 letters)
Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences 336,723 sequences; 677,679,054 total letters
Searchingdone
Score ESequences producing significant alignments: (bits) Value
gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05
</String></NCBI_Blast_Report>
Extending legacy data types
• With legacy data-types defined, we can extend them as we see fit
• annotated_jpeg ISA base64_encoded_jpeg • annotated_jpeg HASA 2D_Coordinate_set • annotated_jpeg HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’><String namespace=‘’ id=‘’ articleName=“content”>MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV
</String>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> <Integer namespace=‘’ id=‘’
articleName=“x_coordinate”>3554</Integer> <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>This is the phenotype of a ufo-1 mutant under long daylength, 16’C</String>
</annotated_jpeg>
• The position of an ontology node precisely defines the syntax by which that node will be represented
• End-users can define new data-types without having to write XML Schema!• This was an important aim of the project
• A machine can “understand” the structure of any incoming message by querying its ontological type!
The Object Ontology: Defines an XML Schema!
The Service Ontology
• A simple ISA hierarchy
• Primitive types include:• Analysis• Parsing• Registration• Retrieval• Resolution• Conversion
Goals achieved
• Common data-type ontology assures fully interoperability of services
• Present ontology, built freely, has low redundancy and covers most of bioinformatics entities.
• Currently, offered services are very specific and small modules, easy to interconnect to build complex workflows. • This has been a natural behaviour rather than
imposed by the standard!
Web services and workflows
• Common XML based input/output formats allow to chain several services to built a logical workflow
• Workflows are stored (in XML of course) and can be run several times
• Workflows can include web services from several providers
Output
Service
Input/output
AAS: AminoAcidSeq
Uniprot ID PDB ID
getAASfromUniprot getAASfromPDBId
parseAASfromPDBText
getPDBFilefromPDBId
AAS PDBText
BLASTText
PMUTText
StringtoAAS
runPSIBlastfromAAS
runPMUTHSfromBlastText
String
parseFeatureSeqfromPMUTText
parsePropfromPMUTText
plotFeatureAAS
showPMUTonStruc
Typed Image
PDB Enriched
FeatureAAS
PropertySeq
runFSOLVFromPDBText
showFSOLVonStruc
parseFeatureSeqfromFSOLVText
parsePropfromFSOLVText
FSOLVText
35
Gene detection by homology
Input: Protein Id and DNA genomic sequence
Building of Blast Database from DNA seq.
BLAST Search
Run GeneWise to detect gene structure
Web Services at INB-BSC
BioMoby Web services offer (146)
• Database retrieval (34)• Sequence comparison and alignment (46)• Phylogeny (10)• Sequence analysis (26)• Structure analysis (9)• Data handling and conversion (21)
• Applications covered• Blast (24), Fasta (6), Clustal (2), Tcoffee (3), Hmmer (4), Phylip
(10), Dali (1), Procheck (1), EMBOSS (25)
http://inb.bsc.es/webservices.php
BioMOBY Central: 254 Object Types, 430 Services
Implementation
XML?
• XML stands for EXtensible Markup Language • XML is a markup language much like HTML • XML was designed to describe data • XML tags are not predefined. You must define
your own tags • XML uses a Document Type Definition (DTD)
or an XML Schema to describe the data • XML can be parsed easily in most programming
languages (Perl XML::LibXML module)
XML example (from Amazon)
XML Bio example. FASTA sequence
<FASTA id=“SRC_HUMAN” namespace=“Swiss-Prot”><Header>SRC_HUMAN (P12931) Proto-oncogene tyrosine-protein
kinase Src (EC 2.7.1.112) (p60-Src) (c-Src) (pp60c-src</Header><Sequence>GSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAA
FAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
</Sequence></FASTA>
XML Prosite entry
<Prosite entry id=“TYR_PHOSPHO_SITE” namespace=“Prosite”>
<Id>TYR_PHOSPHO_SITE</Id><Type>PATTERN</Type><AC>PS00007</AC><DT_Created>APR-1990</DT_Created><DT_Updated>APR-1990</DT_Updated><Info Update>APR-1990</Info_Update><Description>Tyrosine kinase phosphorylation
site</Description><Pattern>[RK]-x(2,3)-[DE]-x(2,3)-Y</Pattern><CC>/TAXO-RANGE=??E?V; /SITE=5,phosphorylation; /SKIP-
FLAG=TRUE; /VERSION=1;</CC><DocumentRef>PDOC00007</DocumentRef><Prosite_entry>
XML / SOAP / WSDL
• XML is the basic language to transmit data between (bio)web services.
• Additional data between communication and data layers are necessary
Data layer
HTTP layer
TCP/IP layer
Data layer
“Bio” layer
TCP/IP layer
SOAP layer
HTTP layer
XML layers
Classical Web transaction (Bio)Web service transaction
XML / SOAP / WSDL
• SOAP (Simple Object Access Protocol): simple XML-based protocol to let applications exchange information over HTTP. • Can use HTTP (mainly) or SMTP as underlying
communication protocol, so it is platform and language independent
• WSDL: (Web Services Description Language) is an XML-based language for describing Web services and how to access them. • Allow to recover all the necessary information to
call a Web service through automatic SOAP requests
Structure of BioMOBY transaction
<SOAP-ENV:Envelope xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope/ ... ><SOAP-ENV:Body><m:ServicioSwissProt xmlns:m="http://biomoby.org/"><m:body xsi:type="xsd:string">
<?xml version='1.0' encoding='UTF-8'?><moby:MOBY xmlns:moby='http://www.biomoby.org/moby-s'><moby:Query><moby:queryInput moby:articleName='' queryID='1'><moby:Simple>
<Object namespace = '' id = 'ASA'/>
</moby:Simple></moby:queryInput></moby:Query></moby:MOBY>
</m:body></m:ServicioSwissProt></SOAP-ENV:Body></SOAP-ENV:Envelope>
BioMOBY answer<SOAP-ENV:Envelope xmlns:xsi=http://www.w3.org/1999/XMLSchema-instance ...><SOAP-ENV:Body><namesp1:ServicioSwissProtResponse xmlns:namesp1="http://biomoby.org/"><s-gensym3 xsi:type="xsd:string">
<?xml version='1.0' encoding='UTF-8'?><moby:MOBY xmlns:moby='http://www.biomoby.org/moby' xmlns='http://www.biomoby.org/moby'><moby:Response moby:authority='not_provided'><moby:queryResponse moby:queryID=''><moby:Simple articleName=''>
<String namespace='' id=''><![CDATA[Id: "ASA"SWRIISSIEQ KEESRGNEDH VKCIQEYRSK IESELSNICD GILKLLDSCL IPSASAGDSK ....]]></String>
</moby:Simple></moby:queryResponse></moby:Response></moby:MOBY>
</s-gensym3></namesp1:ServicioSwissProtResponse></SOAP-ENV:Body></SOAP-ENV:Envelope>
** Librería SOAP usada: libsoap 1.0.1
The three components of MOBY
MOBY-Central • Knows about all existing MOBY services • Ask it for services by type of input, output or keyword • Returns info on how to connect to a service
Service • Accepts MOBY requests • Runs a program on service provider's computer
Client • Locates a service through MOBY-Central • Connects to service, sends input data • Waits for result • Finds result buried within XML markup
MOBY transactions
Registration PhaseRegistration Phase
Query PhaseQuery Phase
Transaction PhaseTransaction Phase
MOBYServiceMOBY
ServiceMOBY
CentralMOBY
CentralMOBYClientMOBYClient
Register Service
OK
DATAData Object Type
Available ServicesService Types
Selected Service
Service Def Request
WSDL
Input Data Object
Output Data Object
DATA
Moby Service
Application
Building MOBY object
Building SOAP packet
Con
nect
ion
to e
xter
nal u
sers
SOAP server
MOBY object extraction
SOAP packet
SOAP packet
MOBY Object
MOBY Object
Input dataExtraction of
biological data
Output data
BioMOBY APIService provider
How to use MOBY services: clients
• Programatic Access – MOBY API (perl, java, python…)
• Web Access – GBrowse browser, INB
• Clients:• Bluejay, Eclipse/Haystack,
Talisman/Taverna (myGrid)
•Expert Bioinformaticians•Developers
•Biologists•Genomic Projects
•Bioinformaticians•Expert Biologists•Genomic Projects
Programmatic access
• Native MOBY APIs in Perl, Java or Python
Developed at INB-BSC
• MOBYLite API • Runs on top of Perl MOBY API• MOBY datatypes are translated into perl classes and
services into perl functions• API is built automatically from MOBY catalogue
• CommLineMOBY. • Perl API to run in-house services without need of the
SOAP layer
use INB_Ontology; # Package containing data typesuse inb_bsc_es; # Package containing services from inb.bsc.es#my $id = $ARGV[0];my $uniprotId = Object->new($id,‘Uniprot’);# Obtaining sequence from Uniprot, an AminoAcidSequence object is created.my $AASeq = inb_bsc_es::getAminoAcidSequence (
input => $uniprotId)->{sequence};
# Running Blastp with error handling. A Blast_text object is created.my $BlastRep;eval { $BlastRep = inb_bsc_es::runNCBIBlastp(
sequence => $AASeq)->{blast_report}};
unless ($@) {print $BlastRep->content; # a standard Blast report if no error
} else {print “Error: Blast execution failed: $@”;
}
MobyLite API
Uniprot ID
getAASequence
runBLAST
BLAST Report
MOBY Clients
• Gbrowse_moby (M Wilkinson)• Browser-style client
• Ahab & Ishmael (B Good, M Wilkinson)• “BLAST” & Semantic Web style clients
• PlaNet Locus_View (H Schoof, R Ernst)• Aggregator-style client
• Blue-Jay (P Gordon) and Rat Genome Database prototype (S Twigger)
• Menu-style clients• MOBY Graphs (M Senger)
• Auto-workflow discovery tool• Taverna (T Oinn, M Senger, E Kawas), and MOWserv (INB,
Spain)• Workflow builder/publisher/execution client• Enhanced support for MOBY currently being built
• Eclipse plugins… etc…
MOWServ. INB Client