Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computational Biology and Bioinformatics
What is a multiple sequence alignment?
1Wednesday 13 February 2013
What is a MSA?
• A multiple sequence alignment (MSA) is an alignment of more than two protein sequences that allows for a better identification of the similarities or detection of evolutionary conserved regions, ...
2Wednesday 13 February 2013
© Tom Lenaerts ULB
Why MSA?• An alignment of 2 sequences is an hypothesis that is supported or
rejected by the score one obtains for the alignment
• Yet the score does not guarantee that their really is a common ancestor
• Moreover, it is not guaranteed that the elements are correctly aligned
• We can reduce this uncertainty by aligning additional sequences
• A multiple sequence alignment (MSA) algorithm performs this task and can clearly show:
• The similarity between positions
• the conservation of certain amino acids within the family of homologues proteins
3Wednesday 13 February 2013
© Tom Lenaerts ULB
Why MSA? 2
The conserved regions: In green the identical residues and in blue the elements with the same
properties
Improve the alignment between two sequences
4Wednesday 13 February 2013
© Tom Lenaerts ULB
Available tools
• CLUSTALW http://www.clustal.org
• TCOFFEE http://www.ebi.ac.uk/Tools/t-coffee/
• MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/
5Wednesday 13 February 2013
© Tom Lenaerts ULB
The problem
1. Find the best way of calculating the score of the alignment between multiple sequences
2. Find the algorithm that can find the MAS with the optimal (or close to optimal) score
• These algorithms are again global or local
• Determining the alignment between N sequences is a hard problem = Combinatorial optimization problem (COP)
• To solve this COP we need provide two systems
6Wednesday 13 February 2013
© Tom Lenaerts ULB
The scoring systemHow can we calculate the score of an MSA?
S(m)=∑ S(mi)i
The total scoreSo the assumption is made that the score of one column is independent
of the other columns
The score of 1 column S(mi)=∑ s(mki, mli)k,l
mki is a residue in the sequence k in column i
s(mki, mli) The score from the substitution matrice
Sum-of-pairs or SP
7Wednesday 13 February 2013
© Tom Lenaerts ULB
The scoring system 2SP and its alternatives
S(mi)=∑ s(mki, mli)k,l
S(mi)=∑ s(m1I, mki)k>1
S(mi)=∑ wk,l s(mki, mli)k,l
Minimum entropy
Maximum likelihood
S(mi)=∑ fki ln(fki)i
8Wednesday 13 February 2013
© Tom Lenaerts ULB
Global methodsSmith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences
BUT : the approach is far from efficient since a lot of computational resources are required (assume size=200).
Number of sequences
O(2nLn)
2 22×2002=0.16M
3 23×2003=64M
4 24×2004=25600M
6 ...
9Wednesday 13 February 2013
© Tom Lenaerts ULB
BUT : the approach is far from efficient since a lot memory is required to store the matrix (assume size=200).
Number of sequences memory (1 byte/position)
2 400 bytes
3 7.63 Mbytes
4 1.5Gbytes
6 60000 Gbytes
Global methods 2Smith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences
10Wednesday 13 February 2013
© Tom Lenaerts ULB
Global methods 3
• Well-known methods:
• An optimized version of the standard DP algorithms (the MSA system)
• limited applicability but exact
• Progressive alignment (the CLUSTAL system)
• Stochastic methods for multiple sequence alignment (the SAGA system)
11Wednesday 13 February 2013
© Tom Lenaerts ULB
Global methods 4
Progressive alignment (the CLUSTAL system)
We already implemented step (A)
By next week step (B)
12Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment
A progressive alignment method is a heuristic, meaning that we have no guarantee to find the optimal solution
3 stages:
Calculate a distance matrix between all sequence pairs
Use this matrix to construct a guide tree
Use the ordering provided by the tree to align the sequences into a MSA
13Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 3
How to construct the tree (more later)
First, put the closest sequences in the same group (e.g.1)
Afterwards, group :A. the next two sequences that are close (e.g. 2)
B. a sequence with a previously constructed group (p.e. 4)
C. two groups (p.e. 3)
SEQ1
SEQ2
SEQ3
SEQ4
SEQ5
1
24
3
14Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 4
The tree determines the order in which the elements are added to the MSA
GATTGTAGTA
GATGGTAGTA
GATTGTAGTA
GATGGTAGTAGATTGTTC--GTA
GATTGTTCGGGTA
1
2
GATTGTA---GTA
GATGGTA---GTAGATTGTTC--GTA
GATTGTTCGGGTA
3
GATTGTA-----GTA
GATGGTA-----GTAGATTGTTC----GTA
GATTGTTCGG--GTA
4
GATGGTAGGCGTGTA
15Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 5The Feng et Doolittle system for progressive alignment:
D.-F. Feng and R.F. Doolittle (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees J Mol Evol 25:351-360
It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created gaps, which is maybe plausible from an optimization perspective but not from a biological perspective
The system is composed of 6 functions, of which we only discus the essential ones
SCORE BORD DFAlign
“once a gap, always a gap”
16Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 6
Pairwise alignment and calculation of distance matrixSCORE
Dij=-lnSij - Srand
Siden - Srand×100
Sij
Srand = (1/L)∑∑ S(a,b)Ni(a)Nj(b)-N(g)gpenalty
The alignment score (using for instance PAM250)
Siden =Sii + Sjj
2
The alignement score for two random sequences with the same amino acid composition and the same length
17Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 6For instance take the following 4 segments obtained from 4 proteines of the I-immunogobulin family
SCORE
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTDGRHFVSQTT x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPLASQNRVEVLA x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD"
the substitution matrix is PAM250
gpenalty = 8
Pairwise alignments include an alignment between the sequence and itself (DP)
18Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 7
SCORE
x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTE-ILGNST-RV--TVTSD"
Alignment 1 : S12=31
Alignment 2 : S13=44
Alignment 3 : S14=13
...
S11=262, S22=287 ...
19Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 8
SCORE
Sij x1 x2 x3 x4
x1 262 31 44 13 x2 287 15 16 x3 222 45 x4 215
Dij x1 x2 x3 x4
x1 0 1.25 0.95 1.31 x2 0 1.24 1.30 x3 0 1.13 x4 0
Srand x1 x2 x3 x4
x1 -66.94 -80.28 -70.48 x2 -82.86 -72.52 x3 -37.85 x4
Dij=-ln Sij - Srand
Siden - Srand
20Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 9
BORDUse the Fitch-Margoliash (FM) algorithm to construct the guide tree
W.M. Fitch and E. Margoliash (1967) Construction of phylogenetic trees, Science 155(3760):279-284
At each step, join the sequences or groups of sequence with the smallest distance between them and recalculate the between the new group and the remaining sequences (or groups)
21Wednesday 13 February 2013
FM algorithmThis algorithm produces unrooted additive trees
It is based on the analysis of a three-leave tree, like the one shown below
the distances between the pairs of sequences determines the branch lengths b1, b2 and b3
b1 = (dAB + dAC - dBC)12
b2 = (dAB + dBC - dAC)12
b3 = (dAC + dBC - dAB)12
which shows the additive nature of the tree : dAB = b1+b2 , dAC = b1+b3
et dBC = b2+b3
22Wednesday 13 February 2013
FM algorithmThe guide tree is produced in a stepwise fashionAt every stage three clusters are defined, with all sequences belonging to one of the clusters
The first two clusters contain the two elements which have the smallest evolutionary distance and the third cluster contains the rest of the sequences
Branch lengths are determined using the equations shown on the previous slides
The branch length between the third and each of the other clusters is calculated
including the location of the internal node that connects them
The two newly added clusters (the first and the second) are then combined into a new cluster
23Wednesday 13 February 2013
FM algorithm
The three clusters are intitally {A}, {C} and W={B,D,E}, where W contains the remaining elements
Afterwards {A} and {C} become the cluster X={A,C}
recalculate distance calculate branch lengths
24Wednesday 13 February 2013
FM algorithm
The three clusters are intitally X, {B} and Y={D,E}, where Y contains the remaining elements
Afterwards X and {B} become the cluster Z={B,{A,C}}
recalculate distance calculate branch lengths
25Wednesday 13 February 2013
FM algorithm
26Wednesday 13 February 2013
FM algorithm
We immediately see a weakness of this method in this example :
If there are different evolutionary rates along different tree branches, the two closest elements may not really be neighbors
Here this has lead to a negative branch length, which can not arise in a true evolutionary history
27Wednesday 13 February 2013
FM algorithm
The same problem becomes apparant when one determines the distances measured on the tree itself (patristic distance) and the original distances and then errors that follow from that
The neighbour-joining (NJ) algorithm solves this problem
28Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 11
Use the tree to construct the MSADFAlign
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEIL-GNST-RV--TVTSD "
x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAG-KPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRA-APKATILWSKGTEIL-GNST-RV--TVTSD x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"
x1x3
5x4 6
x2
First step
second step
third step
The outcome depends on the kind of substitution matrix and the penalty g
29Wednesday 13 February 2013
© Tom Lenaerts ULB
How to combine?
• Align the new sequence to a profile (PSSM)
• Make a consensus sequence and align to this consensus sequence
x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"
co ISDTEAIGSNLRWGCWWWKPRPMVRWLRNGEPLXASQNXRVXXEVLAXx4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD
30Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 2Profiles store general properties of a set of sequences :
1.) Frequency information concerning the residues in each column of the MSA
2.) The evolutionary importance of every residue in that position
TGVEAENLLLPRAKAEEMLSGRKDAERQLL
Take for instance:
f7,R=2/3 f23,S=1/3f11,E=3/3
fu,b=nu,b
Nseq
fu,b=ln(1- (nu,b/(Nseq+1))
ln(1/ (Nseq+1))The frequencies are:
31Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 3
0.667 0.333
0.333 0.3330.3330.333 1 0.333
0.3330.333
0.3330.333
0.333 0.3330.333
0.333 1
0.333 1 0.6670.333
0.333
The position in the MSA
0 9
All
poss
ible
am
ino
acid
s
RHKDESTNQCGPAILMFWYV
4
frequencies
32Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 4
mu,a=∑ fu,b sa,bb ∈{AA}
A profile stores for each column in the MSA, the frequency multiplied by the alignment score with every other AA.
mu,a=log qu,a
pa
This method is only useful when there are sufficient sequences and
all amino acids appear at least once
Thus mu,a is actually an alignment score
between a residue a and the column u
33Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 5
The probability pa is again the probability that one can find a particular amino acid in any of the positions
Information obtained form SwissProt
34Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 6
-0.666 3.335 0 1 -1 0 1.665 -1 -3 -2.334-1 0.668 -1.332 0.666 -2 0 1 -1 -3 -2.334
... ... ... ... ... ... ... ... ... ...
-1 -1.666 0.666 -2.331 1 -2 -2.331 0.333 2 19 9 9 9 9 9 9 9 9 9
0 9
RHKDESTNQCGPAILMFWYV
4
+/-
m0,R=0.333 (-1) +0.333 (0) +0.333 (-1) =-0.666
m1,R=0.667 (5) +0.333 (0) =3.335
m2,R=0.333 (4) +
0.333 (-3) =00.333 (-1) +
mu,a=∑ fu,b sa,bb ∈{AA}
Without the penalty line, the matrix is called a PSSM (Position-
specific scoring matrix)
35Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 7
The biggest problem in creating profiles is that the number of sequences can be too small and as a consequence not every amino acids is present in each column
log 0 =-∞
As a result the alignment can not be made using the log-formula (see PAM and BLOSUM discussion)
The score mu,a is the score for aligning a residue a with the column u We can therefore use the same DP algorithms as before
to align a sequence to a profile
→PSEUDOCOUNTS
36Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 8
Pseudocounts are constants that one adds to the values in the profile to avoid zero’s
qu,a=nu,a+1
Nseq+20
Thus, qu,a will never be 0 !
qu,a=nu,a+βpa
Nseq+ββ is weighting factor that tunes
the importance of the pseudocounts in the value of qu,a
β=√Nseq
Pseudocounts provide the prior information that we have on the
amino acids
37Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 9
We can further generalize this function and introduce fu,a directly
qu,a=αfu,a+βpa
α+β
α is also a weighting factor that tunes the importance of the
actual data. One sometimes uses α=Nseq-1
When there is no data (no sequence) then the pseudocounts will determine the values in the profiles
Therefore, pseudocounts correspond to the prior distribution, which corresponds to what we know about the system before introducing the data
38Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 10
-0.65 0.934 -0.65 -0.65 -0.65 -0.65 0.645 -0.65 -0.65 -0.65-0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60
... ... ... ... ... ... ... ... ... ...
-0.60 -0.60 0.582 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.609 9 9 9 9 9 9 9 9 9
0 9
RHKDESTNQCGPAILMFWYV
4
+/-
qu,a=nu,a+βpa
Nseq+β
mu,a=log qu,a
pa
q0,R=0.064
m0,R=log 0.040.06
q1,R=2.06
4m1,R=log 0.13
0.06
q6,R=1.06
4m6,R=log 0.09
0.06
β=1
Note that in here the substitution score is not take
into account
39Wednesday 13 February 2013
© Tom Lenaerts ULB
How can we make use of the information inside the substitution matrices?
Profiles 11
Each log-odd score contains information concerning the probability of aligning two specific amino acids
qa,b
papb=eλs(a,b)
If a column u contains fu,b amino acids of type b, the probability of finding an alignment with an amino acids a is proportional to
qa,b
papbfu,b
The sum of all these probabilities produces the total probability for a
40Wednesday 13 February 2013
© Tom Lenaerts ULB
Profiles 12
qu,a=αfu,a+βgu,a
α+β
Multiplying this probability for column u with pa provides a better pseudocount for a
gu,a=∑ qa,b
pbfu,b
b
As a result, the equation used for qu,a now becomes
These gu,a values can be recovered from the substitution scores listed in the matrices PAM or BLOSUM
41Wednesday 13 February 2013
© Tom Lenaerts ULB
Aligning to a profile
Simply use either Needleman-Wunsch (L3) or Smith-Waterman (L3) to make the alignment
TGVEAENLLL
PRAKAEEMLS
GRKDAERQLL
SRNAAEYLLS
profile
sequence
The profile contains now the alignment scores
The major difficulty lies in the way in which gap penalties are assigned.
42Wednesday 13 February 2013
© Tom Lenaerts ULB
Remember • In the articles by Osamu Gotoh, a detailed analysis is made of how
to align groups of sequences.
O. Gotoh (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment CABIOS 9(3):361-370
O. Gotoh (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations CABIOS 10(4):379-387
• For next week, read the articles and 2 people provide a presentation in which they discuss the work in those articles and how to use them.
• All other students also examine the articles since they will need to help in deciding which one we will implement in the project.
43Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 12
This method suffers from 2 problems:
The local minimum problem
The choice of alignment parameters
New sequences are added on top of existing alignments.
Hence any early error (misaligned regions) propagates to the complete alignment (due to divergence between sequences)
One selects a scoring matrix, a gap-open and a gap-extension value
This works often for highly related sequences, yet starts to fail seriously as soon as sequences diverge
CLUSTAL W tried to resolve this
problem
44Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 13
Parameter adjustments introduced by CLUSTAL W:
J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap
penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680
dynamically varying gap penalties in a residue and position specific manner
Information concerning the probability of finding a gap next to one of the 20 amino acids is used to locally adjust the gap-opening penalty
Short stretches of hydrophilic residues usually indicate loop or random coil regions, requiring reduction in the gap-opening penalty
...
45Wednesday 13 February 2013
© Tom Lenaerts ULB
Progressive alignment 14
Parameter adjustments introduced by CLUSTAL W:
J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap
penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680
Scoring matrices are dynamically used depending on the divergence of the sequences to be aligned at each stage
Sequences are weighted to correct for unequal sampling across all evolutionary distances in the data set
Frequently occuring sequences are down-weighted (weights are calculated from the branch lengths of the guide tree, constructed using neighbor-joining method)
46Wednesday 13 February 2013
© Tom Lenaerts ULB
Assignment 2• Second step to complete by our next week meeting
• Implement your own version of the Fitch Margoliash algorithm
• Use the distance calculation explained for the Feng and Doolittle progressive alignment algorithm
• Test your algorithm on the 5 hemoglobin sequences
• Compare the result to those you obtain in Uniprot
• first look for “sapiens globin blood” and select the checkbox of HBD_HUMAN, HBB_HUMAN, HBA_HUMAN, HBG1_HUMAN and HBG2_HUMAN
• A green bar appears at the bottom with on the right side a button “Align”. Click it.
• This will produce a MSA for these 5 sequences, and when scrolling down a bit an image of the guide tree you should expect.
47Wednesday 13 February 2013
© Tom Lenaerts ULB
Assignment 2
• Remember: deadline for putting this project on line is April 1, 2013
48Wednesday 13 February 2013