WP6 Part 1: Bioinformatics First results passed peer review Working on more extensive proteomics knowledge sharing Library of existing services collated

WP6 Part 1: Bioinformatics

• First results passed peer review

• Working on more extensive proteomics knowledge sharing

• Library of existing services collated

• Library of LCC experiment protocols underway

Presenters: Xueping Quan, Marco Schorlemmer, Dave Robertson

OK From an Experimenter’s Viewpoint

• Interaction model = Experiment design

– Experimental roles allocated to peers

– Constraints prescribe methods on peers

– Message passing synchronises tasks

• Formal model gives:

– Automation, extending experiment repertoire

– Repeatability, because we preserve state

– Scrutiny, for reviewers

P2P Proteomics

Proteome is the protein equivalent of the genome

Proteomics studies the quantitative changes occurring in a proteome and its application for

– disease diagnostics– therapy– drug development

Peer-to-Peer Experimentation in Protein Structure Prediction: an Architecture, Experiment and Initial

Results

Experiment - Consistency Checking• Taking a non-expert user’s perspective…

Applied Bioinformatics - Whom to believe??

• Note: This Scenario needs to allow for “passive” peers to incorporate knowledge from the large number of traditional bioinformatics resources (databases etc.)

Comparison of server results for consistencytypically increases confidence in the result.

Experiment – “Consistency Checking”Step1: Proxy per service allowing data retrieving from “passive” peers.

Each query is related to the appropriate service.

query (input, keyword, ID, sequence, etc. )

data relating to input

Proxies (Wrappers)

Interfaces (WSDL, etc)

Application Database Web Server

Experiment – “Consistency Checking”

Local database of trusted results with provenance

Polling multiple sites

Step 2: Automated harvesting of results for targets and collation to allow easy comparison of answers. Scientist logs local opinion on relative quality of (passive) other peers for each target and caches the most important positive and/or negative results.

Extend structural knowledge through modelling:

Find fragments of 3D-models of S.cerevisiae (yeast)

proteins that can be trusted

• 6604 yeast protein sequences (some predicted)

• currently 330 known 3D-structures (in PDB)

Experiment: Specific Task

(Popular strategy, typically accomplished with the help of a meta-WWW-server today.)

Databases of pre-computed 3D-models

SWISS

restrictive + non-redundanthigh-quality models only

(SWISSMODEL)

SAM

yeast models

“complete” (at least one model per ID)+ redundant; raw models

(SAM-T06 / UNDERTAKER)

ModBase

permissive + highly redundantpre-filtered before the task

(PSI-BLAST / MODELLER)

Complications – True and False Redundancy

Example 1:highly redundant set

Example 2: multi-domain proteins“non-redundant” sets

(< 90% overlap)

Databases of pre-computed 3D-models

SWISS 769 models

SAM

yeast models

2211 models(selected top model if E-value < 10-3)

ModBase

2546 models(pre-filtered: sequence-id > 20% score > 0.7 E-value < 10-6)

• multi-agent interaction coordination through service composition

• LCC interpreter

• loosely based on electronic societies (of peers)

• uses WSDL as standard

• For more information please refer to: Xueping Quan, Chris Walton, Dietlind L Gerloff, Joanna L Sharman

and Dave Robertson, GCCB2006.

• to be superseded by (more flexible) OK-kernel

Implementation using LCC interpreter

Implementation using LCC Interpreter

Storing “good answers”in local database

Mo

dB

ase

(f

ilte

red

)

SA

MS

WI

SS

SWISS Service

SAM Service

ModBase Service

LCC Interpreter

CYSP Service

CY

SP

MaxSub

MaxSub Service

HTML

HTML

HTML

WSDL

WSDL

WSDL

WSDL

WSDL

Pair-wise comparisonof 3D-protein models

a(data_collator, X):: data_request(Is) <= a(experimenter, E) then a(data_collector(Is,Sp,Sd),X) yeast_id(Is) and source(Sp) then filter(Is,Sp,Sd) => a(data_filter((Is,Sp,Sd),F) then filtered(Is,Sp,S) <= a(data_filter(Is,Sp,Sd),F) then filtered(Is,Sp,S) => a(data_comparer,C) then data_compared(Is,SF) <= a(data_comparer,C) then data_compared(Is,SF) => a(experimenter,E) then data_compared(Is,SF) => a(data_publisher,PU)a(experimenter, E):: data_request(Is) => a(data_collator, X) then data_compared(Is,SF) <= a(data_collator, X)a(data_collector(Is,Sp,Sd),X):: ( null Sp=[] and Sd=[]) or ( a(data_retriever(I,P,D),X) (Sp=[P|Rp] and Sd=[D|Rd] and Is=[I|Ri]) then a(data_collector(Ri,Rp,Rd),X) )a(data_retriever(I,P,D),X):: data_request(I) => a(data_source,P) then data_report(I,D) <= a(data_source,P)a(data_filter(I,Sp,Sd),F):: filter(I,Sp,Sd) <= a(data_collator,X) then filtered(I,Sp,S) => a(data_collator,X) apply_filter(Sd,S)a(data_source,P):: data_request(I) <= a(data_retriever(I,P,D),X) then data_report(I,D) => a(data_retriever(I,P,D),X) lookup(I,D)a(data_comparer,C):: filtered(Is,Sp,S) <= a(data_collator,X) then data_compared(Is,SF) => a(data_collator,X) consistency_check(S,SF)

LCC Protocol

MaxSub - ExamplesSWISS-SAM ModBase-SAM SWISS-ModBase

YPL132W

YBR024W

YLR131C

• pair-wise, sequence-dependent

• finds common substructure (shown in blue)

CYSP =

Comparison of Yeast 3D Structure Predictions578 three-way supported

MaxSub-substructures > 45 aafrom 545 proteins

(Linked from www.openk.org)

Pair-wise MaxSub Comparisons:

Results

SWISS ModBase SAM

SWISS 769 (717) 649 (594) 585 (559)

ModBase 2546 (2280) 620 (594)

SAM 2211 (2211)

Proteomic AnalysisExpression Proteomics

– proteins are extracted from cells and tissues– proteins are separated

• two dimensional cell electrophoresis• liquid chromatography

– proteins are digested and identified• various mass spectrometry methods

Bioinformatic Analysis

– primary, secondary, tertiary structures– sequence alignment and homology– motifs and domains– protein interactions and networks

Functional Proteomics

Expression Proteomics


Peptide/Protein Identification

• Sequencing information in archives that do not produce clear identifications rarely accessible to other groups– most part of it will never be reflected in protein DBs– information is trashed

• Information of high importance for other groups analysing sequence/function of homologue proteins– contains sequences with post-translational

modifications not to be found in current protein DBs

• Spectra and sequence tags generated in one lab could be used by other labs to evaluate confidence of experimental or predicted sequences

Information Overflow

• Proteomic analysis is currently an inhumane task:– LC-MS analysis produces >10,000 of spectra– each spectra yields (after sequencing and DB

search) several peptide or peptide tag candidates– each step produces an identification score whose

final evaluation is performed manually (using probability data)

• Many proteomic labs are involved in the characterization of proteomes, protein complexes and networks

speed of information production increases very fast


P2P Proteomics with OK

Sequence Identification Scenario

• An investigator asks an identifier to match a sequence against proteomic labs repositories.

• The identifier acts as a searcher inquiring each known proteomics lab retrieving hits for the given input sequence, collects results, and then sends them back to investigator.

• The inquired proteomics lab could store high scoring queries to increase the reliability of the matching sequences.

• The end-point process of sequence data-mining done by the proteomics lab is performed by Blast engines local to each peer.

• The first prototype only matches input sequences; next release could also directly accept mass spectra as input. For this task will us an OMSSA engine capable of matching spectra against the same sequence database used by Blast engine.

Sequence Identification IM in LCCa(investigator,A) :: identify(Seqs,P) => a(identifier,B) get_sequences(Seqs,P) then visualise(Result_set) answer(Result_set) <= a(identifier,B)a(identifier,B) :: identify(Seqs,P) <= a(investigator,A) then a(searcher(Seqs,P,Ls,Result_set),B) lab_list(Ls) then answer(Result_set) => a(investigator,A) then a(identifier,B)a(searcher(Seqs,P,Ls,Result_set),B) :: ( query(Seqs,P) => a(proteomics_lab,L) Ls = [L|RLs] then Result_set = [(Result,L)|RSs] answer(Result) <= a(proteomics_lab,L) then a(searcher(Seqs,P,RLs,RSs) ) or null Ls = [] and Result_set = []a(proteomics_lab,L) :: query(Seqs,P) <= a(searcher(_,_,_,_),B) then answer(Result) => a(searcher(_,_,_,_),B) find_hit(Seqs,P,Result) then a(proteomics_lab,L)

investigatorget_sequence

(Seqs, P)

GUI

identifieridentify(Seqs, P)

answer(result_set)

searcher

query(Seqs, P)

answer(result)

identifierinvestigatorvisualise

(result_set)

GUI

proteomics_lab

An investigator uses a GUI to get an input sequences

and a set of parameters P

Investigator sends message identify(Seqs, P) to an

identifier

identifier retrieves a list of known proteomics labs

identifier becomes searcherand sends a query to the first

proteomics_lab of the list

proteomics_lab resolves find_hit constraint and sends

back an answer with the result (i.e. an URL for a XML file)

searcher loops the queries over the list of

proteomics_labs and collects results in a result_set

searcher comes back to role identifier and sends back result_set to investigator

investigator receives the result_set and displays it on a

GUI

Step by Step peer

message

constraint

lab_list(Ls)

find_hit(Seqs, P)

find_hit() constraint also kicks up a process inside

proteomics_lab peer which will store high scoring

queries

Documents

WP6 Part 1: Bioinformatics First results passed peer review Working on more extensive proteomics knowledge sharing Library of existing services collated