(Bio)Web Services at the INB

(Bio)Web Services at the INB

BioMOBY

Instituto Nacional de Bioinformática

INB Mission

“To generate and apply bioinformatics solutions to needs detected in development and implementation of genomics

and proteomics focused projects”

• To support Bioinformatics and Computational Biology development in Spain

• To collaborate and provide scientific and technical support to national genomics and proteomics projects

• To contribute to the creation and establishment of local Bioinformatics groups with research and services components through bioinformaticians training

• To train bioinformaticians for genomics and proteomics research groups

• To develop pure Bioinformatics projects related with the Institute activities

• To support companies with activity in this sector in Spain

• To internationalize all its activities

INB Structure. A “virtual” institute

Web Services

Making some sense of this

Fuente: myGrid

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Or this…

Current practices

Description

Discovery

Remote ProgramaticAccesService Consumers Service Providers

Service(Application, DB)

Bioinformatics Integration: State of the Art

A Web Page is the de facto standard

Discovery:• Word of mouth• Web directories• Google• Paper publications

Description:• Word of mouth• Web documentation / examples / tutorials /

courses• Paper publications

Data transfer & Message Format:• Cut & paste! + Data reformatting

Automation:• CGI & Bespoke code (ad hoc)• APIs (normally big Bioinformatics

Projects/institutes)

Do you have data?

Do you have tools?

Publish a web page

What is wrong with Web apps.

User side

• How do I find out where services are provided?

• Once I discover a service, how do I use it?

• Input/output data types. How do I take the output of one service and send it to another service?

• How do I use the service from within a program instead of through a form on a web page?

Developer side

• Most of developer time is spent in presentation and input/output management. Error handling is mandatory

• Different projects use different formats, rules of use, etc.

Web services can solve the problem

• Central repository

• Well known input/output formats

• No need of user interfaces

A Web service is an interface that describes a collection of operations that are network accessible through standardized XML messaging*

*Web Services Conceptual Architecture, Heather Kreger, IBM Software Group, 2001

PublishFind

Bind

Service Requestor

Service Registry

Service Provider

Service Descriptions

Service Description

Service

WDSL, UDDIWSDL, UDDI

Web services model

However…

• Don’t help the situation much since…• A bioinformatics that consumes a “string” might

be expecting a FASTA sequence, or a keyword…??

• Bioinformatics has many different ‘strings’!

• In Bioinformatics Web Service registries merely catalogue the chaos!

• Semantics rather than structure is necessary

Our audience

• Information is distributed

• MOST data never makes it off of the scientists hard drive• This data should be added to the global scientific archive

• Biologists, by and large, are willing and able, but…

• The Web was embraced enthusiastically by biologists• In fact, most wet labs run a website!• Unfortunately, this only adds to the chaos…

The interoperability solution must be simple enough for a Biologist, with a little bit of computer

knowledge, to implement on their own

BioMOBY

From MOBY-DIC (Model Organisms, Bring Your own Database Interface

Conference)

http://www.biomoby.org

BioMOBY – Scope and Definition

http://www.biomoby.org

• OBJETIVES• Study how to address interoperability problems that are actually being faced by

bioinformatics users of web-accesible resources today, and what are the factors that promote the adoption of new approaches

• How to balance between increasing potential for interoperability and the likelihood of widespread adoption? I.e. focus upon minimizing the barriers to entry into the system, or insist upon a set of constraints that will guarantee usefulness of components of the system

MOBY is a project to develop a web services architecture for bioinformatics

1. Common Syntax 2. Common Semantic 3. Dynamic Discovery

BioMOBY is an international research project involving biological data hosts, biological data service providers, and coders whose aim is to explore various methodologies for biological data representation, distribution, and discovery.

The MOBY plan

• Define data-types commonly used in bioinformatics

• Organize these into an Ontology• Ontologically define web service inputs and outputs• Register the inputs and outputs in a “yellow pages”

• Machines can find an appropriate service• Machines can execute that service unattended• But users still can understand data types

Define: Semantics

• For a piece of data, its “semantics” are• its intention• its meaning• its raison d’etre• its context• its relationship to other data

MOBY Semantic Typing: Namespaces

• Any identifiable piece of data is an “entity”

• Identifiers fall into particular “Namespaces”• NCBI has gi numbers (gi Namespace)• GO Terms have accession numbers (GO Namespace)

• Namespaces indicate data’s semantic type.• GO:0003476 a Gene Ontology Term• gi|163483 a GenBank record

• However, we cannot tell if it is protein, RNA, or DNA sequence

• Namespace + ID precisely specifies a data “entity”

• The Namespace is assumed to be sufficiently descriptive of the data’s semantic type that a service provider can define their interface in terms of Namespaces

Define: Syntax

• For a piece of data, its “syntax” are• its representation• its form• its structure• its language (of representation)

MOBY Syntactic Typing: The Object Ontology

• Syntactic types are defined by a GO-like ontology• Type (“Class”) name at each node• Edges define the relationships between Classes• GO used as a model because of its comprehension &

familiarity

• Edges define one of three relationships• ISA

• Inheritance relationship• All properties of the parent are present in the child

• HASA• Container relationship of ‘exactly 1’

• HAS• Container relationship with ‘1 or more’

Define: Ontology

• A systematic representation of the entities that exist in a domain of discourse, and the relationships between them.

Child

Father

Female

Male

MotherhasParent

hasParent

hasGender

hasGender

partnerOf

A portion of the MOBY-SObject Ontology

…community-built!

What’s an “Object”?

• The smallest unit of information that can be passed by MOBY

• Consists simply of• Namespace• ID

• Thus an Object is nothing more than a “reference” to a data entity

• Ex. <Object if=‘2KI5’ namespace=‘PDB’/> refers to the 3D structure of a Herpes Virus I Thymidine kinase, whereas

• <Object id=‘KITH_HHV1’ namespace=‘Uniprot’/> refers to its sequence

The Object Ontology: A small slice

•ISA relationships do not necessarily add complexity to objects, some times they are just semantics

•Inheritance makes easier service discovery

MOBY objects

A MOBY triple includes a namespace, an ID, and a class <Class namespace='...' id='...'>

A simple MOBY object is just a pointer to data to be retrieved from somewhere<Object namespace='NCBI_id' id='163483'>

An object may contain data in addition to the namespace and ID:<Class namespace='...' id='...'>object's data</Class>

The object's data may include XML markup

A complete MOBY object:<GenericSequence namespace='NCBI_gi' id='163483'

articleName='mySequence'><Integer namespace='' id='' articleName='Length'>975</Integer><String namespace='' id='' articleName='SequenceString'>

ATGATGCGGCTAGTGATGCTGTCGGCGGCATGATTAGG...</String>

</GenericSequence>

articleName’s add human readable semantics to subclasses

ISA relationship - inheritance

• Classes become more specialized as you move along the ISA relationship hierarchy

• DNA_Sequence • ISA

• Nucleotide_Sequence • ISA

• Generic_Sequence • ISA

• Virtual_Sequence• ISA

• Object

• Classes do not become more complex as a result of ISA relationships alone

HASA & HAS relationships

• HASA and HAS relationships make Classes more complex by embedding Classes within Classes

• Virtual_Sequence ISA Object• Virtual_Sequence HASA Length (Integer)• Generic_Sequence ISA Virtual_Sequence• Generic_Sequence HASA Sequence (String)

• Annotated_GIF ISA Image (base_64_GIF)• Annotated_GIF HAS Description (String)

Legacy file formats

• Classic bioinformatics “strings” are just embedded into XML• Binaries are base64 encoded.

<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’><String articleName=‘content’>

TBLASTN 2.0.4 [Feb-24-1998]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query= gi|1401126 (504 letters)

Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences 336,723 sequences; 677,679,054 total letters

Searchingdone

Score ESequences producing significant alignments: (bits) Value

gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA... 1009 0.0emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t... 58 4e-07emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein 53 1e-05gb|U12856|ATU12856 Arabidopsis thaliana Col-0 abscisic acid inse... 53 1e-05

</String></NCBI_Blast_Report>

Extending legacy data types

• With legacy data-types defined, we can extend them as we see fit

• annotated_jpeg ISA base64_encoded_jpeg • annotated_jpeg HASA 2D_Coordinate_set • annotated_jpeg HASA Description

<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’><String namespace=‘’ id=‘’ articleName=“content”>MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC

Av4wggJnoAMCAQICAwhH9jANBgkqhkiG9w0BAQQFADCBkjELMAkGA1UEBhMCWkExFTATBgNV

</String>

<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”> <Integer namespace=‘’ id=‘’

articleName=“x_coordinate”>3554</Integer> <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>

</2D_Coordinate_set>

<String namespace=‘’ id=‘’ articleName=“Description”>This is the phenotype of a ufo-1 mutant under long daylength, 16’C</String>

</annotated_jpeg>

• The position of an ontology node precisely defines the syntax by which that node will be represented

• End-users can define new data-types without having to write XML Schema!• This was an important aim of the project

• A machine can “understand” the structure of any incoming message by querying its ontological type!

The Object Ontology: Defines an XML Schema!

The Service Ontology

• A simple ISA hierarchy

• Primitive types include:• Analysis• Parsing• Registration• Retrieval• Resolution• Conversion

Goals achieved

• Common data-type ontology assures fully interoperability of services

• Present ontology, built freely, has low redundancy and covers most of bioinformatics entities.

• Currently, offered services are very specific and small modules, easy to interconnect to build complex workflows. • This has been a natural behaviour rather than

imposed by the standard!

Web services and workflows

• Common XML based input/output formats allow to chain several services to built a logical workflow

• Workflows are stored (in XML of course) and can be run several times

• Workflows can include web services from several providers

Output

Service

Input/output

AAS: AminoAcidSeq

Uniprot ID PDB ID

getAASfromUniprot getAASfromPDBId

parseAASfromPDBText

getPDBFilefromPDBId

AAS PDBText

BLASTText

PMUTText

StringtoAAS

runPSIBlastfromAAS

runPMUTHSfromBlastText

String

parseFeatureSeqfromPMUTText

parsePropfromPMUTText

plotFeatureAAS

showPMUTonStruc

Typed Image

PDB Enriched

FeatureAAS

PropertySeq

runFSOLVFromPDBText

showFSOLVonStruc

parseFeatureSeqfromFSOLVText

parsePropfromFSOLVText

FSOLVText

35

Gene detection by homology

Input: Protein Id and DNA genomic sequence

Building of Blast Database from DNA seq.

BLAST Search

Run GeneWise to detect gene structure

Web Services at INB-BSC

BioMoby Web services offer (146)

• Database retrieval (34)• Sequence comparison and alignment (46)• Phylogeny (10)• Sequence analysis (26)• Structure analysis (9)• Data handling and conversion (21)

• Applications covered• Blast (24), Fasta (6), Clustal (2), Tcoffee (3), Hmmer (4), Phylip

(10), Dali (1), Procheck (1), EMBOSS (25)

http://inb.bsc.es/webservices.php

BioMOBY Central: 254 Object Types, 430 Services



Implementation

XML?

• XML stands for EXtensible Markup Language • XML is a markup language much like HTML • XML was designed to describe data • XML tags are not predefined. You must define

your own tags • XML uses a Document Type Definition (DTD)

or an XML Schema to describe the data • XML can be parsed easily in most programming

languages (Perl XML::LibXML module)

XML example (from Amazon)

XML Bio example. FASTA sequence

<FASTA id=“SRC_HUMAN” namespace=“Swiss-Prot”><Header>SRC_HUMAN (P12931) Proto-oncogene tyrosine-protein

kinase Src (EC 2.7.1.112) (p60-Src) (c-Src) (pp60c-src</Header><Sequence>GSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAA

FAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

</Sequence></FASTA>

XML Prosite entry

<Prosite entry id=“TYR_PHOSPHO_SITE” namespace=“Prosite”>

<Id>TYR_PHOSPHO_SITE</Id><Type>PATTERN</Type><AC>PS00007</AC><DT_Created>APR-1990</DT_Created><DT_Updated>APR-1990</DT_Updated><Info Update>APR-1990</Info_Update><Description>Tyrosine kinase phosphorylation

site</Description><Pattern>[RK]-x(2,3)-[DE]-x(2,3)-Y</Pattern><CC>/TAXO-RANGE=??E?V; /SITE=5,phosphorylation; /SKIP-

FLAG=TRUE; /VERSION=1;</CC><DocumentRef>PDOC00007</DocumentRef><Prosite_entry>

XML / SOAP / WSDL

• XML is the basic language to transmit data between (bio)web services.

• Additional data between communication and data layers are necessary

Data layer

HTTP layer

TCP/IP layer

Data layer

“Bio” layer

TCP/IP layer

SOAP layer

HTTP layer

XML layers

Classical Web transaction (Bio)Web service transaction

XML / SOAP / WSDL

• SOAP (Simple Object Access Protocol): simple XML-based protocol to let applications exchange information over HTTP. • Can use HTTP (mainly) or SMTP as underlying

communication protocol, so it is platform and language independent

• WSDL: (Web Services Description Language) is an XML-based language for describing Web services and how to access them. • Allow to recover all the necessary information to

call a Web service through automatic SOAP requests

Structure of BioMOBY transaction

<SOAP-ENV:Envelope xmlns:SOAP-ENV=http://schemas.xmlsoap.org/soap/envelope/ ... ><SOAP-ENV:Body><m:ServicioSwissProt xmlns:m="http://biomoby.org/"><m:body xsi:type="xsd:string">

<?xml version='1.0' encoding='UTF-8'?><moby:MOBY xmlns:moby='http://www.biomoby.org/moby-s'><moby:Query><moby:queryInput moby:articleName='' queryID='1'><moby:Simple>

<Object namespace = '' id = 'ASA'/>

</moby:Simple></moby:queryInput></moby:Query></moby:MOBY>

</m:body></m:ServicioSwissProt></SOAP-ENV:Body></SOAP-ENV:Envelope>

http://schemas.xmlsoap.org/soap/envelope/

BioMOBY answer<SOAP-ENV:Envelope xmlns:xsi=http://www.w3.org/1999/XMLSchema-instance ...><SOAP-ENV:Body><namesp1:ServicioSwissProtResponse xmlns:namesp1="http://biomoby.org/"><s-gensym3 xsi:type="xsd:string">

<?xml version='1.0' encoding='UTF-8'?><moby:MOBY xmlns:moby='http://www.biomoby.org/moby' xmlns='http://www.biomoby.org/moby'><moby:Response moby:authority='not_provided'><moby:queryResponse moby:queryID=''><moby:Simple articleName=''>

<String namespace='' id=''><![CDATA[Id: "ASA"SWRIISSIEQ KEESRGNEDH VKCIQEYRSK IESELSNICD GILKLLDSCL IPSASAGDSK ....]]></String>

</moby:Simple></moby:queryResponse></moby:Response></moby:MOBY>

</s-gensym3></namesp1:ServicioSwissProtResponse></SOAP-ENV:Body></SOAP-ENV:Envelope>

** Librería SOAP usada: libsoap 1.0.1

http://www.w3.org/1999/XMLSchema-instance

The three components of MOBY

MOBY-Central • Knows about all existing MOBY services • Ask it for services by type of input, output or keyword • Returns info on how to connect to a service

Service • Accepts MOBY requests • Runs a program on service provider's computer

Client • Locates a service through MOBY-Central • Connects to service, sends input data • Waits for result • Finds result buried within XML markup

MOBY transactions

Registration PhaseRegistration Phase

Query PhaseQuery Phase

Transaction PhaseTransaction Phase

MOBYServiceMOBY

ServiceMOBY

CentralMOBY

CentralMOBYClientMOBYClient

Register Service

OK

DATAData Object Type

Available ServicesService Types

Selected Service

Service Def Request

WSDL

Input Data Object

Output Data Object

DATA

Moby Service

Application

Building MOBY object

Building SOAP packet

Con

nect

ion

to e

xter

nal u

sers

SOAP server

MOBY object extraction

SOAP packet

SOAP packet

MOBY Object

MOBY Object

Input dataExtraction of

biological data

Output data

BioMOBY APIService provider

How to use MOBY services: clients

• Programatic Access – MOBY API (perl, java, python…)

• Web Access – GBrowse browser, INB

• Clients:• Bluejay, Eclipse/Haystack,

Talisman/Taverna (myGrid)

•Expert Bioinformaticians•Developers

•Biologists•Genomic Projects

•Bioinformaticians•Expert Biologists•Genomic Projects

Programmatic access

• Native MOBY APIs in Perl, Java or Python

Developed at INB-BSC

• MOBYLite API • Runs on top of Perl MOBY API• MOBY datatypes are translated into perl classes and

services into perl functions• API is built automatically from MOBY catalogue

• CommLineMOBY. • Perl API to run in-house services without need of the

SOAP layer

use INB_Ontology; # Package containing data typesuse inb_bsc_es; # Package containing services from inb.bsc.es#my $id = $ARGV[0];my $uniprotId = Object->new($id,‘Uniprot’);# Obtaining sequence from Uniprot, an AminoAcidSequence object is created.my $AASeq = inb_bsc_es::getAminoAcidSequence (

input => $uniprotId)->{sequence};

# Running Blastp with error handling. A Blast_text object is created.my $BlastRep;eval { $BlastRep = inb_bsc_es::runNCBIBlastp(

sequence => $AASeq)->{blast_report}};

unless ($@) {print $BlastRep->content; # a standard Blast report if no error

} else {print “Error: Blast execution failed: $@”;

}

MobyLite API

Uniprot ID

getAASequence

runBLAST

BLAST Report

MOBY Clients

• Gbrowse_moby (M Wilkinson)• Browser-style client

• Ahab & Ishmael (B Good, M Wilkinson)• “BLAST” & Semantic Web style clients

• PlaNet Locus_View (H Schoof, R Ernst)• Aggregator-style client

• Blue-Jay (P Gordon) and Rat Genome Database prototype (S Twigger)

• Menu-style clients• MOBY Graphs (M Senger)

• Auto-workflow discovery tool• Taverna (T Oinn, M Senger, E Kawas), and MOWserv (INB,

Spain)• Workflow builder/publisher/execution client• Enhanced support for MOBY currently being built

• Eclipse plugins… etc…

MOWServ. INB Client

Documents

(Bio)Web Services at the INB