Upload
dinhthien
View
224
Download
1
Embed Size (px)
Citation preview
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
An introduction
to multiple alignmentsoriginal version by Cédric Notredame, updated by Laurent Falquet
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Overview
! Multiple alignments
! How-to, Goal, problems, use
! Patterns
! PROSITE database, syntax, use
! PSI-BLAST
! BLAST, matrices, use
! [ Profiles/HMMs ] …
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Overview
! What are multiple alignments?
! How can I use my alignments?
! How does the computer align the sequences?
! The progressive alignment algorithm
! What are the difficulties?
! Pre-requisite?
! How can we compare sequences?
! How can we align sequences?
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Sometimes two sequences are not enough
The man with TWO watches
NEVER knows the exact time
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
What is a multiple sequence alignment?
! What can it do for me?
! How can I produce one of these?
! How can I use it?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
What is a multiple sequence alignment?
! Structural/biochemical criteria! Residues playing a similar role end up in the same column.
! Evolution criteria! Residues having the same ancestor end up in the same column.
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
unknown -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
unknown AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Extrapolation
SwissProt
Unkown Sequence
Homology?
Less Than 30 % idBUT
Conserved where it MATTERS
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Extrapolation
Prosite Patterns
P-K-R-[PA]-x(1)-[ST]…
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
Extrapolation
Prosite Patterns
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
L?K>R
Prosite Profiles -More Sensitive
-More Specific
AFDEFGHQIVLW
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
PROSITE profile (see also HMMs)
A Substitution Cost For Every Amino
Acid, At Every Position
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
Phylogeny
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
chite
wheat
trybr
mouse
-Evolution
-Paralogy/Orthology
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
Phylogeny
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Struc. Prediction
Column Constraint"
Evolution Constraint"
Structure Constraint
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
Phylogeny
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Struc. Prediction
PsiPred or PhD
For secondary
Structure Prediction:
75% Accurate.
Threading: is improving
but is not yet as good.
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How can I use a multiple alignment?
Phylogeny
Struc. Prediction
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Caution!
Automatic Multiple
Sequence Alignment methods
are not always perfect…
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
The problem
! why is it difficult to compute a multiple sequencealignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
Computation
What is the good alignment?
Biology
What is a good alignment?
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
The problem
! why is it difficult to compute a multiple sequencealignment?
CIRCULAR PROBLEM....
GoodSequences
GoodAlignment
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
The problem
! Same as pairwise alignment problem
! We do NOT know how sequences evolve.
! We do NOT understand the relation betweenstructures and sequences.
! We would NOT recognize the “correct” alignment ifwe had it IN FRONT of our eyes…
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
The Charlie Chaplin paradox
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
What do I need to know to make a good multiple alignment?
! How do sequences evolve?
! How does the computer align the sequences?
! How can I choose my sequences?
! What is the best program?
! How can I use my alignment?
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
An alignment is a story
ADKPKRPLSAYMLWLN
ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN
ADKPRRPLS-YMLWLNADKPKRPKPRLSAYMLWLN
Mutations
+
Selection
ADKPRRP---LS-YMLWLNADKPKRPKPRLSAYMLWLN
Insertion
Deletion
Mutation
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Homology
! Same sequences -> same origin? -> same function? -> same 3D fold?
Length
%Sequence Identity
30%
100
Same 3D Fold
Twilight Zone
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Residues and mutations
! All residues are equal, but some more than others…
PG
SC
LI
T
V A
W
YF QH
K
R
ED N
Aliphatic
Aromatic
Hydrophobic
Polar
SmallM
Accurate matrices are data driven rather than knowledge driven
G
C
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Substitution matrices
Different Flavors:
• Pam: 250, 350• Blosum: 45, 62• …
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
What is the best substitution matrix?
! Mutation rates depend on families
! Choosing the right matrix may be tricky
! Gonnet250 > BLOSUM62 > PAM250
! Depends on the family, the program used and its tuning
Family S NHistone3 6.4 0
Insulin 4.0 0.1
Interleukin I 4.6 1.4!"Globin 5.1 0.6
Apolipoprot. AI 4.5 1.6
Interferon G 8.6 2.8
Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Insertions and deletions?
Indel Cost
L
Cost
L
Cost
L
Affine Gap Penalty
Cost=GOP+GEP*L
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
2 Globins =>1 sec
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
3 Globins =>2 mn
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
4 Globins =>5 hours
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
! -> heuristic wished
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
5 Globins =>3 weeks
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
! -> heuristic really wished!
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
6 Globins =>9 years
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
! -> heuristic required!
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
How to align many sequences?
! Exact algorithms are computing time consuming
! Needlemann & Wunsch
! Smith & Waterman
! -> heuristic definitely required!
7 Globins =>1000 years
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Existing methods
1-Carillo and Lipman:
-MSA, DCA.
-Few Small Closely RelatedSequence.
2-Segment Based:
-DIALIGN, MACAW.
-May Align Too Few Residues
-Do Well When They Can Run.
3-Iterative:-HMMs, HMMER, SAM.
-Slow, Sometimes Inacurate
-Good Profile Generators
4-Progressive:
-ClustalW, Pileup, Multalign…
- Fast and Sensitive
5-Mixtures:
-T-Coffee, MAFFT, MUSCLE,ProbCons, Psi-Praline,
- Very sensitive
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Progressive alignment
Feng and Dolittle, 1980; Taylor 1981
Dynamic Programming Using A Substitution Matrix
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Progressive alignment
Feng and Dolittle, 1980; Taylor 1981
-Depends on the ORDER of the sequences (Tree).
-Depends on the CHOICE of the sequences.
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
•Tree making Algorithm.
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Progressive alignment
! Works well when phylogeny is dense
! No outlayer sequence
! Example: river crossing
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Selecting sequences from a BLAST output
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
A common mistake
! Sequences too closely related
! Identical sequences brings no information
! Multiple sequence alignments thrive on diversity
PRVA_MACFU SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
PRVA_HUMAN SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE
PRVA_GERSP SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE
PRVA_MOUSE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE
PRVA_RAT SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
PRVA_RABIT AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE
:**::*.*******:***:* :****************..::******:***********
PRVA_MACFU DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES
PRVA_HUMAN DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES
PRVA_GERSP DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES
PRVA_MOUSE DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES
PRVA_RAT DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES
PRVA_RABIT EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES
:*** ******.******.**** *:************.:******:**
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Respect information!
-This alignment is notinformative about therelation between TPCCMOUSE and the rest ofthe sequences.
-A better spread of thesequences is needed
PRVA_MACFU ------------------------------------------SMTDLLN----AEDIKKA
PRVA_HUMAN ------------------------------------------SMTDLLN----AEDIKKA
PRVA_GERSP ------------------------------------------SMTDLLS----AEDIKKA
PRVA_MOUSE ------------------------------------------SMTDVLS----AEDIKKAPRVA_RAT ------------------------------------------SMTDLLS----AEDIKKA
PRVA_RABIT ------------------------------------------AMTELLN----AEDIKKA
TPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
: :*. .*::::
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
PRVA_HUMAN VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI
PRVA_GERSP IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFIPRVA_MOUSE IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI
PRVA_RAT IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI
PRVA_RABIT IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI
TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** :
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-
PRVA_HUMAN LKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_GERSP LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES-
PRVA_MOUSE LKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES-
PRVA_RAT LKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES-
PRVA_RABIT LKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES-TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE
*: . .. :: .: : *: ***:.**:*. :** ::
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Selecting diverse sequences
-A REASONABLE model now exists.
-Going further:remote homologues.
PRVB_CYPCA -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIEPRVB_BOACO -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIEPRV1_SALSA MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIEPRVB_LATCH -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIEPRVB_RANES -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIEPRVA_MACFU -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEPRVA_ESOLU --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:**
PRVB_CYPCA EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO EDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH DEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES QDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-PRVA_MACFU EDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** **
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Aligning remote homologuesPRVA_MACFU ------------------------------------------SMTDLLNA----EDIKKAPRVA_ESOLU -------------------------------------------AKDLLKA----DDIKKA
PRVB_CYPCA ------------------------------------------AFAGVLND----ADIAAA
PRVB_BOACO ------------------------------------------AFAGILSD----ADIAAG
PRV1_SALSA -----------------------------------------MACAHLCKE----ADIKTAPRVB_LATCH ------------------------------------------AVAKLLAA----ADVTAA
PRVB_RANES ------------------------------------------SITDIVSE----KDIDAA
TPCS_RABIT -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI
TPCS_PIG -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAITPCC_MOUSE MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
: ::
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFIPRVA_ESOLU LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV
PRVB_CYPCA LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF
PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF
PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLFPRVB_LATCH LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF
PRVB_RANES LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF
TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI
TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEITPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
: . .: .. . *: * : * :* : .*:*: :** .
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-PRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA-
PRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA--
PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-
PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--
PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA--PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA--
TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ
TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ
TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE
:: .. :: : :: .* :.** *. :** ::
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Going further…
PRVA_MACFU VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
PRVB_BOACO LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF
PRV1_SALSA LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF
TPCS_RABIT IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI
TPCS_PIG IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI
TPCC_MOUSE IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
TPC_PATYE SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI
. : .. . :: . : * :* : .* *. : * .
PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES--
PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG--
PRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ---
TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ-
TPCS_PIG FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ-
TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE-
TPC_PATYE LS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA
: . :: : :: * :..* :. :** ::
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
What makes a good alignment…
! The more divergent the sequences, the better
! The fewer indels, the better
! Nice ungapped blocks separated with indels
! Different classes of residues within a block:
! Completely conserved (*)
! Size and hydropathy conserved (:)
! Size or hydropathy conserved (.)
! The ultimate evaluation is a matter of personaljudgment and knowledge
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Avoiding pitfalls
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Naming your sequences the right way
! Never use white spaces in your sequence names
! Never use special symbols. Stick to plain letters,numbers and the underscore sign (_) to replacespaces. Avoid ALL other signs, especially the mosttempting ones like @, #, |, *, >, <…
! Never use names longer than 15 characters
! Never give the same name to 2 different sequencesin your set. Some programs accept it, others likeClustalW don’t.
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Do not use too many sequences!
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Beware of Repeats
! There is a problem when two sequences do not contain the same number ofrepeats
! It is then better to manually extract the repeats and to align them separately.Individual repeats can be recognized using Dotlet or Dotter.
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Keep a biological perspective
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
chite AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-
wheat -DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS
trybr -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG
mouse ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS
* *** .:: ::... : * . . . : * . *: *
chite KSEWEAKAATAKQNY-I--RALQE-YERNG-G-
wheat KAPYVAKANKLKGEY-N--KAIAA-YNK-GESA
trybr RKVYEEMAEKDKERY----K--RE-M-------
mouse KQAYIQLAKDDRIRYDNEMKSWEEQMAE-----
: : * : .* :
DIFFERENTPARAMETERS
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Do not overtune!!!
DO NOT PLAY WITHPARAMETERS!
IF YOU KNOW THEALIGNMENT YOU
WANT:MAKE IT YOURSELF!
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
chite ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
wheat --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
trybr KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP
mouse -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. * .: .. . : . . * . *: *
chite AATAKQNYIRALQEYERNGG-
wheat ANKLKGEYNKAIAAYNKGESA
trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE
* : .* . :
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
BaliBase classification and benchmark
DescriptionPROBLEM
EvenPhylogenicSpread.
One OutlayerSequence
Two Distantlyrelated Groups
Long InternalIndel
Long TerminalIndel
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Choosing the right method
Source:
BaliBase
Thompsonet al, NAR,
(1999)
Do et al,Genome
Res. (2005)
PROBLEM Program Strategy
ClustalW,T-coffee,
MUSCLE,ProbCons
ProbCons,MUSCLE,MAFFT
Dialign II,ProbCons,
T-Coffee
T-Coffee,MUSCLE,ProbCons
Dialign II,
ProbCons,
MAFFT
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Some interesting links
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
More links
! MUSCLE
! http://www.drive5.com/muscle
! MAFFT
! http://timpani.genome.ad.jp/~mafft/server
! PROBCONS
! http://probcons.stanford.edu
! PSI-PRALINE
! http://ibivu.cs.vu.nl/programs/pralinewww
! 3D-COFFEE! http://www.igs.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi
Swiss Institute of Bioinf ormatics
Institut Suisse de Bioinf ormatique
CN+LF-2006.01
Conclusion
! The best alignment method:
! Your brain
! The right data
! The best evaluation method:
! Your eyes
! Experimental information(SwissProt)
! Choosing the sequences well isimportant
! Beware of repeated elements
! What can I conclude?
! Homology -> informationextrapolation
! How can I go further?
! Patterns
! Profiles
! HMMs
! …