Upload
tyler-mcmillan
View
218
Download
3
Tags:
Embed Size (px)
Citation preview
Methods course
Multiple sequence alignment andReconstruction of phylogenetic trees
Burkhard Morgenstern, Fabian Schreiber
Göttingen, October/November 2007
Tools for multiple sequence alignment
Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics
Tools for multiple sequence alignment
T Y I M R E A Q Y E
T C I V M R E A Y E
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V M R E A - Y E
Tools for multiple sequence alignment
T Y I M R E A Q Y E
T C I V M R E A Y E
Y I M Q E V Q Q E
Y I A M R E Q Y E
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V M R E A - Y E
Y - I - M Q E V Q Q E
Y – I A M R E - Q Y E
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V M R E A - Y E
- Y I - M Q E V Q Q E
Y – I A M R E - Q Y E
Astronomical Number of possible alignments!
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V - M R E A Y E
- Y I - M Q E V Q Q E
Y – I A M R E - Q Y E
Astronomical Number of possible alignments!
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V M R E A - Y E
- Y I - M Q E V Q Q E
Y – I A M R E - Q Y E
Which one is the best ???
Tools for multiple sequence alignment
Questions in development of alignment programs:
(1) What is a good alignment?
→ objective function (`score’)
(2) How to find a good alignment?
→ optimization algorithm
Tools for multiple sequence alignment
What is a biologically good alignment ??
Tools for multiple sequence alignment
Criteria for alignment quality:
1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!
2. Evolution: align residues with common ancestors!
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V M - R E A Y E
- Y I - M Q E V Q Q E
- Y I A M R E - Q Y E
Alignment hypothesis about sequence evolution
Search for most plausible hypothesis!
Tools for multiple sequence alignment
T Y I - M R E A Q Y E
T C I V - M R E A Y E
- Y I - M Q E V Q Q E
- Y I A M R E - Q Y E
Alignment hypothesis about sequence evolution
Search for most plausible hypothesis!
Tools for multiple sequence alignment
Compute for amino acids a and b
Probability pa,b of substitution a → b (or b → a),
Frequency qa of a
Define similarity score s(a,b) based on pa,b , qa
Result: similarity matrix (substitution matrix), e.g. PAM (Dayhoff matrix), BLOSUM, …
Tools for multiple sequence alignment
Tools for multiple sequence alignment
Traditional objective functions:
Define Score of alignments as
Sum of individual similarity scores s(a,b) of aligned amino acid residues
Gap penalty g for each gap in alignment
Optimal alignment can be calculated for two sequences but in practice not for > 8 sequences
T Y W I V
T - - L V
Example:
Score = s(T,T) + s(I,L) + s (V,V) – 2 g
Tools for multiple sequence alignment
Most commonly used heuristic for multiple alignment:
Progressive alignment (mid 1980s):
Idea: calculate multiple alignment as series of pairwise
alignments of sequences and profiles Use guide tree to determine order of pairwise
alignments
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WWRLNDKEGYVPRNLLGLYP
AVVIQDNSDIKVVPKAKIIRD
YAVESEAHPGSFQPVAALERIN
WLNYNETTGERGDFPGTYVEYIGRKKISP
Guide tree
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASFQPVAALERIN
WLNYNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------
WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN-
WW--RLNDKEGYVPRNLLGLYP-
AVVIQDNSDIKVVP--KAKIIRD
YAVESEASVQ--PVAALERIN------
WLN-YNEERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
`Progressive´ Alignment
WCEAQTKNGQGWVPSNYITPVN--------
WW--RLNDKEGYVPRNLLGLYP--------
AVVIQDNSDIKVVP--KAKIIRD-------
YAVESEA---SVQ--PVAALERIN------
WLN-YNE---ERGDFPGTYVEYIGRKKISP
Profile alignment, “once a gap - always a gap”
CLUSTAL W
Most important software program: CLUSTAL W:
J. Thompson, T. Gibson, D. Higgins (1994, Nuc. Acids Res.)
(22,327 citations in the literaterature!, Oct 2007)
Tools for multiple sequence alignment
Problems with traditional approach:
Results depend on gap penalty
Heuristic guide tree determines alignment;
alignment used for phylogeny reconstruction
Algorithm produces global alignments.
Tools for multiple sequence alignment
Problems with traditional approach:
But:
Many sequence families share only local similarity
E.g. sequences share one conserved motif
Local sequence alignment
Find common motif in sequences; ignore the rest
EYENS
ERYENS
ERYAS
Local sequence alignment
Find common motif in sequences; ignore the rest
E-YENS
ERYENS
ERYA-S
Local sequence alignment
Find common motif in sequences; ignore the rest – Local alignment
E-YENSERYENSERYA-S
Gibbs Motive Sampler
Local multiple alignment without gaps:
E.g. Gibbs sampling
C.E. Lawrence et al. (1993, Science)
Traditional alignment approaches:
Either global or local methods!
New question: sequence families with multiple local similarities
Neither local nor global methods appliccable
New question: sequence families with multiple local similarities
Alignment possible if order conserved
The DIALIGN approach
Morgenstern, Dress, Werner (1996, Proc Natl. Acad. Sci.)
Combination of global and local methods
Assemble multiple alignment from gap-free local pairwise alignments (,,fragments“)
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atctaatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaagagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacccctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgcttag
cagtgcgtgtattactaacggttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
The DIALIGN approach
atc------taatagttaaactcccccgtgc-ttag
cagtgcgtgtattactaac----------gg-ttcaatcgcg
caaa--gagtatcacc----------cctgaattgaataa
Consistency!
The DIALIGN approach
atc------TAATAGTTAaactccccCGTGC-TTag
cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg
caaa--GAGTATCAcc----------CCTGaaTTGAATaa
The DIALIGN approach
Advantages of segment-based approach:
Program can produce global and local alignments!
Sequence families alignable that cannot be aligned with standard methods
T-COFFEE
C. Notredame, D. Higgins, J. Heringa (2000, J. Mol. Biol.)
Combination of global and local methods
T-COFFEE
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
T-COFFEE
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
T-COFFEE
SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT
SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD ---------THE ---- FAT CAT
SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD ---------THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT
Pairwise Alignments
Progressive Alignment
Mixing Heterogenous Data With T-Coffee
Local Alignment Global Alignment
Multiple Sequence Alignment
Multiple Alignment
StructuralSpecialist
T-COFFEE
T-COFFEE
Idea:
1. Build library of pairwise alignments
2. Alignment from seq i, j and seq j, k supports alignment from seq i, k.
T-COFFEE
T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL
Evaluation of multi-alignment methods
Alignment evaluation by comparison to trusted benchmark alignments.
`True’ alignment known by information about structure or evolution.
1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG
1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......
Key
alpha helix RED beta strand GREEN core blocks UNDERSCORE BAliBASE
Reference alignments
Evaluation of multi-alignment methods
Result: DIALIGN best method for distantly related sequences, T-Coffee best for globally related proteins
Evaluation of multi-alignment methods
Conclusion: no single best multi alignment program!
Advice: try different methods!
Tools for phylogeny reconstruction
Two approaches covered in this course:
Distance methods, e.g. Neighbour-Joining Maximum Likelihood
Other important methods (not covered in this course):
Maximum parsimony Bayesian approaches
Tools for phylogeny reconstruction
Phylogenetic trees:
rooted trees unrooted trees
Many methods produce unrooted trees: find root using outgroup!
Biological Question:Are Sponges mono-/paraphyletic?
Phylogenetic Reconstuction: An Example
Organims of interest:Sponge
Build Dataset
Dataset
Query Sequence
DNA/Protein Sequencefrom Sponge Gene
Search for Homologsusing e.g BLAST
Hits from Search:“putative” homologs
Sequence alignment
Dataset
Sequence Alignment
Hits from Search:“putative” homologs
Alignment tools:-Clustalw-T-Coffee-Dialign...many more
Use
to bring sequencesin relation
Alignment
PhylogeneticTree
Phylogeny Methods:Distance-based:---Nj---UPGMAParsimony:---Max.Parsimony(Phylip/Paup)Statistical:---Max.Likelihood (Phyml)---Bayesian Inf. (MrBayes)
Estimate Phylogeny
Interpretate results
Hypothesis: Sponges are monophyletic
Tools for phylogeny reconstruction
Distance methods: For N sequences S1, … SN: Calculate distance d(i,j) for any two sequences Si and Sj
Goal find tree that represents all distances d(i,j) as closely as possible
To calculate distances d(i,j) : construct multiple alignment of input sequences, consider substitutions implied by alignment
Matrix of pairwise distances d(i,j)
Find tree that corresponds to distances d(i,j)
Tools for phylogeny reconstruction
Maximum likelihood:
Consider evolution of sequences as random process. Stochastical model assigns probabilities to substitutions.
Consider tree T as hypothesis about observed sequence data D
Search tree with highest likelihood P(D|T)
Tools for phylogeny reconstruction
Assumptions:
Positions in sequences (colums in alignment) independent of each other
Events on different branches of tree independent of each other
Result: probabilities can be multiplied
Probability P(D|T) for given residues at internal nodes
Consider all possible residues for internal nodes
Testing the reliability of a tree (or parts of it): the bootstrap approach
Bootstrap in general: repeat statistical test after random “re-sampling”, i.e. by drawing additional sample data.
In phylogeny:
1. Select randomly columns from Alignment and repeat tree reconstruction with the same method (e.g. 1000 times)
2. Calculate for every branch: how often is it observed in newly constructed trees?