Genome_annotation@BioDec: Python all over the place

Genome_annotation@BioDec: Python all over the place.

Ivan Rossi

[email protected]@rouge2507

mailto:[email protected]

Hello● BioDec does bioinformatics since 2002

● Bioinformatics software development

● Bioinformation management system, BioDecoders● Bioinformatics Consulting

● Development, engineering and integration of custom solutions● Annotated databases of biosequences (e.g. genomes)

● Our Forte

● Protein-sequence analysis● Trans-membrane proteins● Machine-learning

● Python is everywhere

The Challenge:from Sequence to Function

>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH

Protein Function

Gene Sequence

Protein Sequence (~10^7)

Protein Structure (10^5)

Problems in Sequence Analysis

Information Overflow:very large sets of data available

High Throughput: New data must be processed at high speed (volume of data, time constraints)

Open Problems: difficult to provide a simple first-principle or a model-based solution

Alignments

OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGROEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD

OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVSOEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR

OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L

Alignments of some kind are the main tool for sequence comparison and database search

OmpA: PDB 1BXW, SwissProt OMPA_ECOLIOEP21: Transmembrane Domain (24-177)

Tools from machine learning

Prediction

Known sequences (DB subsets)TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

ANN,HMM,SVM

ANN,HMM,SVM

Known mapping

General Rules

Knownstructures

Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs)

Support Vector Machines (SVMs)

New sequence

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

Evolutionary Information

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

Sequence position

MSA

Seq. Profile

Sequence profile

Given a Multiple Sequence Alignment

(MSA) of similar sequences,

associate to each position a 20-valued

vector containing the relative aminoacidic

composition of the aligned sequences.

Why Python? (2.1.x, in 2002)

● Common ground, easy to pick up● Expressive: productive, fast prototyping● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython)

● Retrospective:

We were f...ing RIGHT!

Hidden Markov Models

Very powerful tools when:

● The system can be modeled in probabilistic terms.● There is a ‘grammar of the problem’● There is a “limited sequential dependency” that can

model the problem (at least to a rough approx)

N T0.01

0.010.99

0.99

99HMMers

End

Start

Signal Peptide

TM1

TM2

TM3

TM4

TM5

TM6

TM7

Insertion loop

Inside loop

Outside loop

Profile-HMM, based on:http://www.biocomp.unibo.it/piero/PHMM

BioPython

BioPython (http://biopython.org) is a community-developed (O|B|F) set of Python libraries and tools for bioinformatics.

● The Parsers for formats and application (vital)● The Sequence objects● Bio.SeqIO, Bio.AlignIO, Bio.PDB● Specialized External-application wrappers● BioSQL interface

http://biopython.org/

BioSQL

BioSQL (http://www.biosql.org) is a generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies.

● Works with all O|B|F Bio* projects● We extended it to suit our special need

Ruffus

Ruffus (http://www.ruffus.org.uk/) is a Computation Pipeline library for Python, designed to allow easy analysis automation.

● Acts like a pythonic Make on steroids● Write your Python functions and decorate them

– @originate, @transform, @merge an more

● Pipeline handling– Run pipelines make-style (run_pipeline)

– Schedule pipelines on SGE compute clusters (run_job)

http://www.ruffus.org.uk/

Angler pipeline Proteome

Generateprofiles

Predictions: Signal peptides Betabarrels Alpha-helical TMP Fold recognition Coiled coils Disordered regions Sub-cellular localization

Classify

ProteomeAtlas (a DB)

Angler annotates and classifies Protein sequences

ZenDockAnalyzes protein solvent-

exposed surface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer.

Interactors are correlated and grouped into patches

Results are mapped on the protein 3D structure and made available through a web interface

Contact-shell profile

Int non-Int

If you can't outrun them...

The Problem

● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run

(uniref90)

– Repeat for ~10^5 … CPU weeks for genome.

● Major genomes updated every 3 months● Micro-SME: limited resources

… try to outsmart them.

● Sequence space is redundant– Both intra-genome and inter-genome

● Profiles are built incrementally– PsiBlast is an iterative algorithm

● PsiBlast is deterministic– Given the same sequence, database, and number

of iterations you get the same profile

Our accelerator: the PyBlastCache

1) Hash the sequence

2) version the reference protein database

3) store computed profiles in a key-value store

1) Key as a combination of seq. hash and DB version

4) Compute● If full_key_match: skip_and_copy()● If seq_key_match: update_profile( seq, itn=1)● If no_key: create_profile(seq, itn=3)

The (Python) front-ends

● Plone: a CMS

– https://plone.org● Web2py: a MVC framework

– http://www.web2py.com● Galaxy: web interface + workflow engine

– Focus on reproducible research

– https://wiki.galaxyproject.org/

– Saas: https://usegalaxy.org

https://wiki.galaxyproject.org/

● A BiOSQL browser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors;

● We publicly released the base version open-source software at http://plone4bio.org;

● Used to be the la base for some commercial software we sold to clients.

Plone4Bio

http://plone4bio.org/

Plone4Bio screenshots

Bologna, 21/1/2010

LIMS features

GalaxyGalaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.

– Users without programming experience can easily specify parameters and run tools and workflows.

– Galaxy captures information in order to allow complete repeats of a computational analysis.

– Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

● Accepted as material by peer reviewed journals

Galaxy highlightsGalaxy is useful to both end user and bioinformatic devs.

● Get data directly from online DBs (USCS, Biomart,...)

● Handling of data from lab instrumentetion (e.g NGS seqs)

● Map calculated data on online viewers (e.g. genome viewer)

● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file.

● Data sharing (workflows, libraries, tools...)

● The community!

SnapshotsFrom https://usegalaxy.org

Visual programming

Thou Shalt Care For The DATA● So much junk in the literature!!

– Both for features and data sets

● Use training, testing and validation sets● The sets should always be disjoint

– Below 25% seq ID

● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection● Always compare results with a nearest-neighbor method

– Good ones are really hard to beat

No Free Lunch

● There is no killer method– Choose method that better models your domain

(e.g. sequences → HMMs)

– Data curation is always more important

● Be Humble, be Honest!

Meditation hint: http://www.no-free-lunch.org/

http://www.no-free-lunch.org/

The community is your friend.Give back to the community.

Technology

Genome_annotation@BioDec: Python all over the place