Upload
biodec
View
273
Download
0
Tags:
Embed Size (px)
Citation preview
Hello● BioDec does bioinformatics since 2002
● Bioinformatics software development
● Bioinformation management system, BioDecoders● Bioinformatics Consulting
● Development, engineering and integration of custom solutions● Annotated databases of biosequences (e.g. genomes)
● Our Forte
● Protein-sequence analysis● Trans-membrane proteins● Machine-learning
● Python is everywhere
The Challenge:from Sequence to Function
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein Function
Gene Sequence
Protein Sequence (~10^7)
Protein Structure (10^5)
Problems in Sequence Analysis
Information Overflow:very large sets of data available
High Throughput: New data must be processed at high speed (volume of data, time constraints)
Open Problems: difficult to provide a simple first-principle or a model-based solution
Alignments
OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGROEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD
OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVSOEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR
OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L
Alignments of some kind are the main tool for sequence comparison and database search
OmpA: PDB 1BXW, SwissProt OMPA_ECOLIOEP21: Transmembrane Domain (24-177)
Tools from machine learning
Prediction
Known sequences (DB subsets)TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
ANN,HMM,SVM
ANN,HMM,SVM
Known mapping
General Rules
Knownstructures
Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs)
Support Vector Machines (SVMs)
New sequence
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Evolutionary Information
1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K
Sequence position
MSA
Seq. Profile
Sequence profile
Given a Multiple Sequence Alignment
(MSA) of similar sequences,
associate to each position a 20-valued
vector containing the relative aminoacidic
composition of the aligned sequences.
Why Python? (2.1.x, in 2002)
● Common ground, easy to pick up● Expressive: productive, fast prototyping● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython)
● Retrospective:
We were f...ing RIGHT!
Hidden Markov Models
Very powerful tools when:
● The system can be modeled in probabilistic terms.● There is a ‘grammar of the problem’● There is a “limited sequential dependency” that can
model the problem (at least to a rough approx)
N T0.01
0.010.99
0.99
99HMMers
End
Start
Signal Peptide
TM1
TM2
TM3
TM4
TM5
TM6
TM7
Insertion loop
Inside loop
Outside loop
Profile-HMM, based on:http://www.biocomp.unibo.it/piero/PHMM
BioPython
BioPython (http://biopython.org) is a community-developed (O|B|F) set of Python libraries and tools for bioinformatics.
● The Parsers for formats and application (vital)● The Sequence objects● Bio.SeqIO, Bio.AlignIO, Bio.PDB● Specialized External-application wrappers● BioSQL interface
BioSQL
BioSQL (http://www.biosql.org) is a generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies.
● Works with all O|B|F Bio* projects● We extended it to suit our special need
Ruffus
Ruffus (http://www.ruffus.org.uk/) is a Computation Pipeline library for Python, designed to allow easy analysis automation.
● Acts like a pythonic Make on steroids● Write your Python functions and decorate them
– @originate, @transform, @merge an more
● Pipeline handling– Run pipelines make-style (run_pipeline)
– Schedule pipelines on SGE compute clusters (run_job)
Angler pipeline Proteome
Generateprofiles
Predictions: Signal peptides Betabarrels Alpha-helical TMP Fold recognition Coiled coils Disordered regions Sub-cellular localization
Classify
ProteomeAtlas (a DB)
Angler annotates and classifies Protein sequences
ZenDockAnalyzes protein solvent-
exposed surface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer.
Interactors are correlated and grouped into patches
Results are mapped on the protein 3D structure and made available through a web interface
Contact-shell profile
Int non-Int
If you can't outrun them...
The Problem
● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run
(uniref90)
– Repeat for ~10^5 … CPU weeks for genome.
● Major genomes updated every 3 months● Micro-SME: limited resources
… try to outsmart them.
● Sequence space is redundant– Both intra-genome and inter-genome
● Profiles are built incrementally– PsiBlast is an iterative algorithm
● PsiBlast is deterministic– Given the same sequence, database, and number
of iterations you get the same profile
Our accelerator: the PyBlastCache
1) Hash the sequence
2) version the reference protein database
3) store computed profiles in a key-value store
1) Key as a combination of seq. hash and DB version
4) Compute● If full_key_match: skip_and_copy()● If seq_key_match: update_profile( seq, itn=1)● If no_key: create_profile(seq, itn=3)
The (Python) front-ends
● Plone: a CMS
– https://plone.org● Web2py: a MVC framework
– http://www.web2py.com● Galaxy: web interface + workflow engine
– Focus on reproducible research
– https://wiki.galaxyproject.org/
– Saas: https://usegalaxy.org
● A BiOSQL browser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors;
● We publicly released the base version open-source software at http://plone4bio.org;
● Used to be the la base for some commercial software we sold to clients.
Plone4Bio
Plone4Bio screenshots
Bologna, 21/1/2010
LIMS features
GalaxyGalaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.
– Users without programming experience can easily specify parameters and run tools and workflows.
– Galaxy captures information in order to allow complete repeats of a computational analysis.
– Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.
● Accepted as material by peer reviewed journals
Galaxy highlightsGalaxy is useful to both end user and bioinformatic devs.
● Get data directly from online DBs (USCS, Biomart,...)
● Handling of data from lab instrumentetion (e.g NGS seqs)
● Map calculated data on online viewers (e.g. genome viewer)
● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file.
● Data sharing (workflows, libraries, tools...)
● The community!
SnapshotsFrom https://usegalaxy.org
Visual programming
Thou Shalt Care For The DATA● So much junk in the literature!!
– Both for features and data sets
● Use training, testing and validation sets● The sets should always be disjoint
– Below 25% seq ID
● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection● Always compare results with a nearest-neighbor method
– Good ones are really hard to beat
No Free Lunch
● There is no killer method– Choose method that better models your domain
(e.g. sequences → HMMs)
– Data curation is always more important
● Be Humble, be Honest!
Meditation hint: http://www.no-free-lunch.org/
The community is your friend.Give back to the community.