Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November

Methods course

Multiple sequence alignment andReconstruction of phylogenetic trees

Burkhard Morgenstern, Fabian Schreiber

Göttingen, October/November 2007

Tools for multiple sequence alignment

Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics


T Y I M R E A Q Y E

T C I V M R E A Y E


T Y I - M R E A Q Y E

T C I V M R E A - Y E


T Y I M R E A Q Y E

T C I V M R E A Y E

Y I M Q E V Q Q E

Y I A M R E Q Y E




Y - I - M Q E V Q Q E

Y – I A M R E - Q Y E




- Y I - M Q E V Q Q E


Astronomical Number of possible alignments!



T C I V - M R E A Y E



Astronomical Number of possible alignments!






Which one is the best ???


Questions in development of alignment programs:

(1) What is a good alignment?

→ objective function (`score’)

(2) How to find a good alignment?

→ optimization algorithm


What is a biologically good alignment ??


Criteria for alignment quality:

1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!

2. Evolution: align residues with common ancestors!



T C I V M - R E A Y E


- Y I A M R E - Q Y E

Alignment hypothesis about sequence evolution

Search for most plausible hypothesis!



T C I V - M R E A Y E


- Y I A M R E - Q Y E

Alignment hypothesis about sequence evolution

Search for most plausible hypothesis!


Compute for amino acids a and b

Probability pa,b of substitution a → b (or b → a),

Frequency qa of a

Define similarity score s(a,b) based on pa,b , qa

Result: similarity matrix (substitution matrix), e.g. PAM (Dayhoff matrix), BLOSUM, …



Traditional objective functions:

Define Score of alignments as

Sum of individual similarity scores s(a,b) of aligned amino acid residues

Gap penalty g for each gap in alignment

Optimal alignment can be calculated for two sequences but in practice not for > 8 sequences

T Y W I V

T - - L V

Example:

Score = s(T,T) + s(I,L) + s (V,V) – 2 g


Most commonly used heuristic for multiple alignment:

Progressive alignment (mid 1980s):

Idea: calculate multiple alignment as series of pairwise

alignments of sequences and profiles Use guide tree to determine order of pairwise

alignments

`Progressive´ Alignment

WCEAQTKNGQGWVPSNYITPVN

WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP



WWRLNDKEGYVPRNLLGLYP

AVVIQDNSDIKVVPKAKIIRD

YAVESEAHPGSFQPVAALERIN

WLNYNETTGERGDFPGTYVEYIGRKKISP

Guide tree



WW--RLNDKEGYVPRNLLGLYP-

AVVIQDNSDIKVVP--KAKIIRD

YAVESEASFQPVAALERIN

WLNYNEERGDFPGTYVEYIGRKKISP

Profile alignment, “once a gap - always a gap”





YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP



WCEAQTKNGQGWVPSNYITPVN-



YAVESEASVQ--PVAALERIN------

WLN-YNEERGDFPGTYVEYIGRKKISP



WCEAQTKNGQGWVPSNYITPVN--------

WW--RLNDKEGYVPRNLLGLYP--------

AVVIQDNSDIKVVP--KAKIIRD-------

YAVESEA---SVQ--PVAALERIN------

WLN-YNE---ERGDFPGTYVEYIGRKKISP


CLUSTAL W

Most important software program: CLUSTAL W:

J. Thompson, T. Gibson, D. Higgins (1994, Nuc. Acids Res.)

(22,327 citations in the literaterature!, Oct 2007)


Problems with traditional approach:

Results depend on gap penalty

Heuristic guide tree determines alignment;

alignment used for phylogeny reconstruction

Algorithm produces global alignments.


Problems with traditional approach:

But:

Many sequence families share only local similarity

E.g. sequences share one conserved motif

Local sequence alignment

Find common motif in sequences; ignore the rest

EYENS

ERYENS

ERYAS


Find common motif in sequences; ignore the rest

E-YENS

ERYENS

ERYA-S


Find common motif in sequences; ignore the rest – Local alignment

E-YENSERYENSERYA-S

Gibbs Motive Sampler

Local multiple alignment without gaps:

E.g. Gibbs sampling

C.E. Lawrence et al. (1993, Science)

Traditional alignment approaches:

Either global or local methods!

New question: sequence families with multiple local similarities

Neither local nor global methods appliccable

New question: sequence families with multiple local similarities

Alignment possible if order conserved

The DIALIGN approach

Morgenstern, Dress, Werner (1996, Proc Natl. Acad. Sci.)

Combination of global and local methods

Assemble multiple alignment from gap-free local pairwise alignments (,,fragments“)


atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa






















atc------taatagttaaactcccccgtgcttag






caaa--gagtatcacccctgaattgaataa




caaa--gagtatcacc----------cctgaattgaataa


atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg



atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg


Consistency!


atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa


Advantages of segment-based approach:

Program can produce global and local alignments!

Sequence families alignable that cannot be aligned with standard methods

T-COFFEE

C. Notredame, D. Higgins, J. Heringa (2000, J. Mol. Biol.)

Combination of global and local methods

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT

SeqB GARFIELD THE FAST CAT

SeqC GARFIELD THE VERY FAST CAT

SeqD THE FAT CAT

T-COFFEE

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT

SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE FAST CA-T --- SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD ---------THE ---- FAT CAT

SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD ---------THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD ---------THE ---- FA-T CAT

Pairwise Alignments

Progressive Alignment

Mixing Heterogenous Data With T-Coffee

Local Alignment Global Alignment

Multiple Sequence Alignment

Multiple Alignment

StructuralSpecialist

T-COFFEE

T-COFFEE

Idea:

1. Build library of pairwise alignments

2. Alignment from seq i, j and seq j, k supports alignment from seq i, k.

T-COFFEE

T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL

Evaluation of multi-alignment methods

Alignment evaluation by comparison to trusted benchmark alignments.

`True’ alignment known by information about structure or evolution.

1aboA 1 .NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1 .NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1 .drvrkksga.........awqGQIVGWYctnlt.............peG

1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

Key

alpha helix RED beta strand GREEN core blocks UNDERSCORE BAliBASE

Reference alignments


Result: DIALIGN best method for distantly related sequences, T-Coffee best for globally related proteins


Conclusion: no single best multi alignment program!

Advice: try different methods!

Tools for phylogeny reconstruction

Two approaches covered in this course:

Distance methods, e.g. Neighbour-Joining Maximum Likelihood

Other important methods (not covered in this course):

Maximum parsimony Bayesian approaches


Phylogenetic trees:

rooted trees unrooted trees

Many methods produce unrooted trees: find root using outgroup!

Biological Question:Are Sponges mono-/paraphyletic?

Phylogenetic Reconstuction: An Example

Organims of interest:Sponge

Build Dataset

Dataset

Query Sequence

DNA/Protein Sequencefrom Sponge Gene

Search for Homologsusing e.g BLAST

Hits from Search:“putative” homologs

Sequence alignment

Dataset

Sequence Alignment

Hits from Search:“putative” homologs

Alignment tools:-Clustalw-T-Coffee-Dialign...many more

Use

to bring sequencesin relation

Alignment

PhylogeneticTree

Phylogeny Methods:Distance-based:---Nj---UPGMAParsimony:---Max.Parsimony(Phylip/Paup)Statistical:---Max.Likelihood (Phyml)---Bayesian Inf. (MrBayes)

Estimate Phylogeny

Interpretate results

Hypothesis: Sponges are monophyletic


Distance methods: For N sequences S1, … SN: Calculate distance d(i,j) for any two sequences Si and Sj

Goal find tree that represents all distances d(i,j) as closely as possible

To calculate distances d(i,j) : construct multiple alignment of input sequences, consider substitutions implied by alignment

Matrix of pairwise distances d(i,j)

Find tree that corresponds to distances d(i,j)


Maximum likelihood:

Consider evolution of sequences as random process. Stochastical model assigns probabilities to substitutions.

Consider tree T as hypothesis about observed sequence data D

Search tree with highest likelihood P(D|T)


Assumptions:

Positions in sequences (colums in alignment) independent of each other

Events on different branches of tree independent of each other

Result: probabilities can be multiplied

Probability P(D|T) for given residues at internal nodes

Consider all possible residues for internal nodes

Testing the reliability of a tree (or parts of it): the bootstrap approach

Bootstrap in general: repeat statistical test after random “re-sampling”, i.e. by drawing additional sample data.

In phylogeny:

1. Select randomly columns from Alignment and repeat tree reconstruction with the same method (e.g. 1000 times)

2. Calculate for every branch: how often is it observed in newly constructed trees?

Documents

Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November