The Bioinformatics Toolkit at the MPI for Developmental Biology
Workshop Systems Biology
Berlin, March 3, 2006
Johannes Söding
Department for Protein Evolution (Andrei Lupas)
Max-Planck-Institute for Developmental Biology
Our toolkit assists the department’s research in protein evolution …
… and makes methods developed in our group accessible to a larger public
Sequence similarity searches
Multiple sequence alignment
Sequence analysis (repeats, periodicities, subtyping)
Secondary structure and transmembrane prediction
Tertiary structure prediction and structure analysis
Phylogeny and classification
Utilities (reformatting, sequence retrieval, filtering)
?
Overview page for Sequence Search
toolkit
PSI-BLAST has enhanced functionality over NCBI
• Select subsets out of >300 genomes
• Upload personal databases
• Change databases between search rounds
• Show colored multiple alignment (JalView)
• Submit results to other tools
57636
Quick2D integrates results of various 2’ndary structure prediction programs
Contributed by Christian Mayer, MPI-DevBio 68748
REPPER detects periodic regions in proteins
Gruber M, Söding J, and Lupas AN. (2005) NAR 33, W239-243. 92259
Several tools rely on a sensitive new method for remote homology detection
HHrep De-novo repeat detection
HHpred Structure and function prediction by detecting remote homologs in databases such as the PDB, SCOP, Pfam, Smart, InterPro, CDD at NCBI,…
HHsenser Sequence search method that employs exhaustive intermediate profile search`
Underlying method: Pairwise comparison of profile hidden Markov models (HMMs)
What is a sequence profile?
What is a profile HMM?
Sequence profiles are a condensed representation of multiple alignments
HBA_human ... W G K V G A - - H A G E ...HBB_human ... W G K V - - - - N V D E ...MYG_phyca ... W G K V E A - - D V A G ...LGB2_luplu ... W K D F N A - - N I P K ...GLB1_glydi ... W E E I A G A D N G A G ...
0 0.25 0.75 0 0.2 0.4 0...A ...
0 0 0 0.2 0 0.2 0...D ...0 0.25 0 0 0 0 0.4...E ...
0.2 0 0 0 0 0 0...F ...0 0.25 0.25 0 0.2 0.2 0.4...G ...0 0 0 0.2 0 0 0...H ...
0.2 0 0 0 0.2 0 0...I ...0 0 0 0 0 0 0.2...K ...0 0 0 0 0 0 0...L ...
0 0.25 0 0.6 0 0 0...N ...0 0 0 0 0 0.2 0...P ...
0.6 0 0 0 0.4 0 0...V ...0 0 0 0 0 0 0...W ...
0 0 0 0 0 0 0...C ...
0 0 0 0 0 0 0...M ...
0 0 0 0 0 0 0...T ...
0 0 0 0 0 0 0...Q ...0 0 0 0 0 0 0...R ...0 0 0 0 0 0 0...S ...
0 0 0 0 0 0 0...Y ...
Each column of the profile pj(a)
contains the amino acid
frequencies in the multiple
sequence alignment
0
00.2
00.6
00
0.20
00
00
0
0
0
000
0
0
0.20.2
0000
0.60
00
00
0
0
0
000
0
0
00000000
00
01.0
0
0
0
000
0
master sequence
…
HMMs include position-specific gap penalties
HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G A G V ...
M/D M/D M/D I I M/D M/D M/D M/D M/D Deletions
Insertions0 0.25 0.2 0.4 0 0...A ...
0 0 0 0.2 0 0...D ...0 0.25 0 0 0.4 0...E ...
0.2 0 0 0 0 0...F ...0 0.25 0.2 0.2 0.4 0...G ...0 0 0 0 0 0.4...H ...
0.2 0 0.2 0 0 0...I ...0 0 0 0 0.2 0...K ...0 0 0 0 0 0...L ...
0 0.25 0 0 0 0...N ...0 0 0 0.2 0 0...P ...
0.2 0 0 0 0 0...MD ...
0 0 0 0 0 0...C ...
0 0 0 0 0 0...M ...
0 0 0 0 0 0...W ...0 0 0 0 0 0.2...Y ...
0 1.0 0 0 0 0...DD ...
0 0 0 0 0 0...I I ...0 0 0 0 0 0...M I ...
0.75
000
0.250000
00
0
0
0
00
0
0.50.25
0
0.2000
0.2000
0.60
0
0
0
00
0
00
Match or Delete
Probabilities for Insert Open Insert Extend Delete Open Delete Extend
Profile HMMs can be represented as states connected by transitions
HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...
M/D M/D M/D I I M/D M/D M/D M/D M/D
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
… …
…
0 0.25 0.2 0.4 0 0A
0.2 0 0 0 0 0MD
0 0 0 0 0 0C
0 0 0 0 0 0W0 0 0 0 0 0.2Y
0 1.0 0 0 0 0DD
0 0 0 0 0 0I I0 0 0 0 0 0M I
0.75
0
0
00
0
0.5 0.25
0
0
0
00
0
00
HMM p
pi(a)
pi(XY)
Matrix:
M M M M MMMM
Profile HMMs can be represented as states connected by transitions
HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...
M/D M/D M/D I I M/D M/D M/D M/D M/D
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
… …
…
0 0.25 0.2 0.4 0 0A
0.2 0 0 0 0 0MD
0 0 0 0 0 0C
0 0 0 0 0 0W0 0 0 0 0 0.2Y
0 1.0 0 0 0 0DD
0 0 0 0 0 0I I0 0 0 0 0 0M I
0.75
0
0
00
0
0.50.25
0
0
0
00
0
00
HMM p
pi(a)
pi(XY)
Matrix:
M M M M MMMM
Profile HMMs can be represented as states connected by transitions
HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...
M/D M/D M/D I I M/D M/D M/D M/D M/D
D
I
D
I
D
I
D
I
D
II
DD
I
D
I
… …
…
0 0.25 0.2 0.4 0 0A
0.2 0 0 0 0 0MD
0 0 0 0 0 0C
0 0 0 0 0 0W0 0 0 0 0 0.2Y
0 1.0 0 0 0 0DD
0 0 0 0 0 0I I0 0 0 0 0 0M I
0.75
0
0
00
0
0.50.25
0
0
0
00
0
00
HMM p
pi(a)
pi(XY)
Matrix:
M M M M MMMM
Profile HMMs can be represented as states connected by transitions
HBA_human ... V G A . . H A G E Y ...HBB_human ... V - - . . N V D E V ...MYG_phyca ... V E A . . D V A G H ...LGB2_luplu ... F N A . . N I P K H ...GLB1_glydi ... I A G a d N G - G V ...
M/D M/D M/D I I M/D M/D M/D M/D M/D
D
I
D
I
D
I
D
I
D
I
D
I
D
I
D
I
… …
…
0 0.25 0.2 0.4 0 0A
0.2 0 0 0 0 0MD
0 0 0 0 0 0C
0 0 0 0 0 0W0 0 0 0 0 0.2Y
0 1.0 0 0 0 0DD
0 0 0 0 0 0I I0 0 0 0 0 0M I
0.75
0
0
00
0
0.50.25
0
0
0
00
0
00
HMM p
pi(a)
pi(XY)
Matrix:
M M M M MMMM
Find path through two HMMs that maximizes co-emission probability
State q
State p
M
D
I
M
D
I
M
D
I
M
D
I
M
D
I
M
D
I
M
D
I
HMM q
M
M
M
M
M
I
M
M
M
M
D
–
M
M
M
D
I
M
D
I
M
D
I
M
D
I
M
D
I
HMM p
x1 x2 x3 x4 x5 x6
Söding, J. (2005) Bioinformatics 21, 951-960.
Include Null model maximize “log-sum-of-odds score”
Co-emitted sequence
HHrep detects repeats by HMM-HMM comparison of the sequence with itself
The dotplot with suboptimal alignments reveals internal symmetries
repeat 1 repeat 2 repeat 3 repeat 4
repe
at 4
repe
at 3
repe
at 2
repe
at 1
Outer membrane barrels might have evolved by duplication of a single hairpin
OmpA
… but is there an internal symmetry in the sequences?
HHrep indeed finds a fourfold sequence symmetry in OMPs
50
100
150
50 100 150
ompa_2
OmpA
blue: significantalignments
TIM barrels possess approximate structural symmetry …
… but up to now it has not been possible to detect this repeat pattern on the sequence level
HHrep detects structural repeats in TIMs
1fq0a_1
1fq0a_2
Did TIM barrels evolve by duplication of a quarter barrel peptide?
HisF
KDPG aldolase
Fourfold symmetry Eightfold symmetry
profile-profile dot plot
after consistencytransformation
same, but lower score threshold
seq-seq
profile-seqHMM-seq
profile-profile
profile-profile
HMM-HMM
HMM-HMM+SS
HMM-HMM+corr
HMM-HMM+predSS
10% ra
te of false
positive
s profile-profile
HMM-HMM comparison improves upon profile-profile comparison
All-against-all benchmark on SCOP (20% seq. id.)
8
The HHpred input page
1. Paste ScbA sequence
2. Select database
3. Submit jobAll input
parameters are linked to explanations on help
pages ScbA from Steptomyces is involved in regulating the onset of antibiotics production, but its function is unknown
Search results: alignment view
Query sequence (ScbA)
Template sequence: (from database)
Predicted 2nd’ary structure (query)
Predicted 2nd’ary structure (template)
Actual 2nd’ary structure (template)
Graphical representation of
best database hits along query sequence
View alignments as histograms
View template alignment
View template structure
Match quality
Statistical significance
Summary hit list for best
database matches
...
Alignments with database
sequences (templates)Interesting region
of high similarity
Six best hits belong to a superfamily of enzymes from the
fatty acid synthesis pathway!
Create 3D model
48830
Histogram view
Highly conserved residues E and Q are catalytic residues in FabZ / FabA!
FabZ
FabAFabZ
Highly conserved arginine: catalytic ?
Homology between histones and C-terminal subdomain in AAA+ ATPases
RuvB (AAA+)
kink
TAFII62
TAFII42
Work in progress, V. Alva Kullanja and M. Ammelburg et al.
The prediction of transmembrane barrel proteins is a challenging problem
• TM β-barrel proteins occur in outer membranes of bacteria, mitochondria and plastids
• TM β-barrel proteins are normally amphiphilic → more difficult to identify than α-helical TMPs
• Only a handful of known structures exist
• No structure of OmpW has yet been released→ use OmpW as test case
OmpA MspA porin
Most dedicated TM β-barrel predictors fail to predict Erwinia carotovora OmpW correctly
Server
TBBpred(Chandigarh, India)
TMBETA-NET (AIST, Tokyo)
PROFtmb(Columbia University)
Pred-TMBB(University of Athens)
Result
“Protein is likely to be globular”
Confidence? Nine strands predicted with unrealistic positions
Low confidence (Z-score 5.8 ≈ 35% accuracy)Six strands predicted with realistic positions
Score below threshold;Nine strands predicted, 4 probably misplaced, 5 correct
HHpred model of Erwinia carotovora OmpW (default parameters, no refinements)
Correct topology predicted, with 8 strands at realistic positions;
High confidence for OMP prediction(Probability = 100%)
Only needs refine-ment for precise placement of loop inserts
Ompw_3
Ompw_1
HHsenser is a novel method to search for remote homologs in sequence databases
• Recursive search strategy employing PSI-BLAST to build new aligynments that may be homologous to query
• HMM-HMM comparison for validation of homology between newly built alignment and alignment of validated sequences
• Very sensitive!
..
.
.
. .
... x
..
.. .. .... x
x
x
..
..
.
.
..
.
. .
.
..
.
.
.
...
x
..
..
.
.
..
. .
.
..
.
.
..
.. ..
..
.
query
E<10
E<10-3
x
.
.
.
.
.
shaded:accepted
sequences
HHsenser defines a diverse superfamily of transcription factors around AbrB/SpoVTSpoVT
NC
C N
N’
NC’
C
N’NCC’
NC C N
N
C
C’
N’
N
C
N’C’
N’C’ C
NC’ N’
CN
MazE (1mvf)
MraZ (1n0g)
AbrB (new, 1yfb)
SpoVTSpoVT
AbrB
MraZ-CMraZ-N
cyano TF
YjiW
Archaeal PhoUPemI / MazE
Vir
VagC
PrlF
1n0g1n0g
1yfb
1mvf
AbrB (1ekt)
Sequences obtained with HHsenser, clustered with CLANS:
M. Coles et al. (2005) Structure 13, 919-928. abrb_1
Retroactive from Drosophila was identified in a screen in for chitin-associated defects
● The retroactive fly larvae are bloated and show a characterisitic disarrangement of chitin fibres in the cuticle
● Except for the orthologous genes from D pseudoobscura and Anopheles, no homologs are found in the databases
● Understanding chitin-related developmental and metabolic pathways is important for pest control
wildtype
rtvmutant
Based on remote homology with CD59 and snake toxins, HHpred could generate
a 3D model for Rtv
B. Moussian, J. Söding, H. Schwarz, and C. Nüsslein-Volhard, Dev Dyn 2005
• Rtv is membrane-bound and adopts a three-finger neurotoxin fold
• The long fingers carry two exposed aromatic residues each
• These exposed residues are likely to binding chitin at the surface of epidermal cells
63951
rtv_1
HHsenser finds homology between P5 protein of phage phi-6 and lytic transglycosilases
(default parameters)
p5_2
HHpred confidently predicts Gas1 (target 5 from AFP-SIG) to be a GDNF receptor
(default parameters, database: CDD)
ma
In collaboration with Mart Saarma, Helsinki Gas1_2
Gas1_1
Outlook
Toolkit as open-source package Continuous integration of the best available tools Several new tools planned or in development
• Cluster known folds by sequence similarity(Galaxy of folds)
• Functional subtyping
• PDB remote homology alert barrel membrane protein prediction
• Repeat detection (database-assisted) Expert system
The Toolkit Team
Andreas Biegert
Michael Remmert
Christian Mayer
Andrei Lupas
Johannes Söding
Many thanks to
• Tancred Frickey, Markus Gruber, Alex Diemand, and Pavel Szczesny for contributing tools
• Alexander Diemand for systems admin and support
• Members of our group for critical feedback
http://toolkit.tuebingen.mpg.dehttp://toolkit.tuebingen.mpg.de
Stucture is more conserved in evolution than function
Sequence identity
60%50%40%30%20%
Main-chain RMSD in conserved core
0.85 Å1.0 Å1.2 Å1.5 Å 1.8 Å
Conservation of structure
Fraction of aas in conserved core
90%80%70%60%50%
Structure prediction based on homology to template with known structure can yield useful 3D models even at sequence
identities below 20% (twilight zone)
Sequence identity is a good indicator of functional similarity …
Sequence identity
Conservation of substrate specificity
(all four EC digits)
Conservation of reaction mechanism (first three EC digits)
Conservation of enzyme function (EC code) in proteins
50% - 60%40% - 50%30% - 40%20% - 30%
75%60%35%15%
85%70%50%25%
… but function evolves quickly:below 50% direct functional inference gets problematic
Analysis of conserved functional residues,comparative sequence analysis, structure prediction, …
Global versus local alignment
global alignment
BLAST and PSI-BLAST use a local alignment method
HHpred can construct both local and global alignments
• Probabilities / E-values more reliable for local alignment
• Global alignment mode useful for making 3D models and for determination of structural domain boundaries
query
db match
query
db match
local alignment