Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shufﬂed or removed earlier created

Computational Biology and Bioinformatics

What is a multiple sequence alignment?

1Wednesday 13 February 2013

What is a MSA?

• A multiple sequence alignment (MSA) is an alignment of more than two protein sequences that allows for a better identification of the similarities or detection of evolutionary conserved regions, ...


© Tom Lenaerts ULB

Why MSA?• An alignment of 2 sequences is an hypothesis that is supported or

rejected by the score one obtains for the alignment

• Yet the score does not guarantee that their really is a common ancestor

• Moreover, it is not guaranteed that the elements are correctly aligned

• We can reduce this uncertainty by aligning additional sequences

• A multiple sequence alignment (MSA) algorithm performs this task and can clearly show:

• The similarity between positions

• the conservation of certain amino acids within the family of homologues proteins


© Tom Lenaerts ULB

Why MSA? 2

The conserved regions: In green the identical residues and in blue the elements with the same

properties

Improve the alignment between two sequences


© Tom Lenaerts ULB

Available tools

• CLUSTALW http://www.clustal.org

• TCOFFEE http://www.ebi.ac.uk/Tools/t-coffee/

• MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/


© Tom Lenaerts ULB

The problem

1. Find the best way of calculating the score of the alignment between multiple sequences

2. Find the algorithm that can find the MAS with the optimal (or close to optimal) score

• These algorithms are again global or local

• Determining the alignment between N sequences is a hard problem = Combinatorial optimization problem (COP)

• To solve this COP we need provide two systems


© Tom Lenaerts ULB

The scoring systemHow can we calculate the score of an MSA?

S(m)=∑ S(mi)i

The total scoreSo the assumption is made that the score of one column is independent

of the other columns

The score of 1 column S(mi)=∑ s(mki, mli)k,l

mki is a residue in the sequence k in column i

s(mki, mli) The score from the substitution matrice

Sum-of-pairs or SP


© Tom Lenaerts ULB

The scoring system 2SP and its alternatives

S(mi)=∑ s(mki, mli)k,l

S(mi)=∑ s(m1I, mki)k>1

S(mi)=∑ wk,l s(mki, mli)k,l

Minimum entropy

Maximum likelihood

S(mi)=∑ fki ln(fki)i


© Tom Lenaerts ULB

Global methodsSmith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences

BUT : the approach is far from efficient since a lot of computational resources are required (assume size=200).

Number of sequences

O(2nLn)

2 22×2002=0.16M

3 23×2003=64M

4 24×2004=25600M

6 ...


© Tom Lenaerts ULB

BUT : the approach is far from efficient since a lot memory is required to store the matrix (assume size=200).

Number of sequences memory (1 byte/position)

2 400 bytes

3 7.63 Mbytes

4 1.5Gbytes

6 60000 Gbytes

Global methods 2Smith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences


© Tom Lenaerts ULB

Global methods 3

• Well-known methods:

• An optimized version of the standard DP algorithms (the MSA system)

• limited applicability but exact

• Progressive alignment (the CLUSTAL system)

• Stochastic methods for multiple sequence alignment (the SAGA system)


© Tom Lenaerts ULB

Global methods 4

Progressive alignment (the CLUSTAL system)

We already implemented step (A)

By next week step (B)


© Tom Lenaerts ULB

Progressive alignment

A progressive alignment method is a heuristic, meaning that we have no guarantee to find the optimal solution

3 stages:

Calculate a distance matrix between all sequence pairs

Use this matrix to construct a guide tree

Use the ordering provided by the tree to align the sequences into a MSA


© Tom Lenaerts ULB

Progressive alignment 3

How to construct the tree (more later)

First, put the closest sequences in the same group (e.g.1)

Afterwards, group :A. the next two sequences that are close (e.g. 2)

B. a sequence with a previously constructed group (p.e. 4)

C. two groups (p.e. 3)

SEQ1

SEQ2

SEQ3

SEQ4

SEQ5

1

24

3


© Tom Lenaerts ULB


The tree determines the order in which the elements are added to the MSA

GATTGTAGTA

GATGGTAGTA

GATTGTAGTA

GATGGTAGTAGATTGTTC--GTA

GATTGTTCGGGTA

1

2

GATTGTA---GTA

GATGGTA---GTAGATTGTTC--GTA

GATTGTTCGGGTA

3

GATTGTA-----GTA

GATGGTA-----GTAGATTGTTC----GTA

GATTGTTCGG--GTA

4

GATGGTAGGCGTGTA


© Tom Lenaerts ULB

Progressive alignment 5The Feng et Doolittle system for progressive alignment:

D.-F. Feng and R.F. Doolittle (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees J Mol Evol 25:351-360

It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created gaps, which is maybe plausible from an optimization perspective but not from a biological perspective

The system is composed of 6 functions, of which we only discus the essential ones

SCORE BORD DFAlign

“once a gap, always a gap”


© Tom Lenaerts ULB


Pairwise alignment and calculation of distance matrixSCORE

Dij=-lnSij - Srand

Siden - Srand×100

Sij

Srand = (1/L)∑∑ S(a,b)Ni(a)Nj(b)-N(g)gpenalty

The alignment score (using for instance PAM250)

Siden =Sii + Sjj

2

The alignement score for two random sequences with the same amino acid composition and the same length


© Tom Lenaerts ULB

Progressive alignment 6For instance take the following 4 segments obtained from 4 proteines of the I-immunogobulin family

SCORE

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTDGRHFVSQTT x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPLASQNRVEVLA x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD"

the substitution matrix is PAM250

gpenalty = 8

Pairwise alignments include an alignment between the sequence and itself (DP)


© Tom Lenaerts ULB


SCORE

x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTE-ILGNST-RV--TVTSD"

Alignment 1 : S12=31



...

S11=262, S22=287 ...


© Tom Lenaerts ULB


SCORE

Sij x1 x2 x3 x4

x1 262 31 44 13 x2 287 15 16 x3 222 45 x4 215

Dij x1 x2 x3 x4

x1 0 1.25 0.95 1.31 x2 0 1.24 1.30 x3 0 1.13 x4 0

Srand x1 x2 x3 x4

x1 -66.94 -80.28 -70.48 x2 -82.86 -72.52 x3 -37.85 x4

Dij=-ln Sij - Srand

Siden - Srand


© Tom Lenaerts ULB


BORDUse the Fitch-Margoliash (FM) algorithm to construct the guide tree

W.M. Fitch and E. Margoliash (1967) Construction of phylogenetic trees, Science 155(3760):279-284

At each step, join the sequences or groups of sequence with the smallest distance between them and recalculate the between the new group and the remaining sequences (or groups)


FM algorithmThis algorithm produces unrooted additive trees

It is based on the analysis of a three-leave tree, like the one shown below

the distances between the pairs of sequences determines the branch lengths b1, b2 and b3

b1 = (dAB + dAC - dBC)12

b2 = (dAB + dBC - dAC)12

b3 = (dAC + dBC - dAB)12

which shows the additive nature of the tree : dAB = b1+b2 , dAC = b1+b3

et dBC = b2+b3


FM algorithmThe guide tree is produced in a stepwise fashionAt every stage three clusters are defined, with all sequences belonging to one of the clusters

The first two clusters contain the two elements which have the smallest evolutionary distance and the third cluster contains the rest of the sequences

Branch lengths are determined using the equations shown on the previous slides

The branch length between the third and each of the other clusters is calculated

including the location of the internal node that connects them

The two newly added clusters (the first and the second) are then combined into a new cluster


FM algorithm

The three clusters are intitally {A}, {C} and W={B,D,E}, where W contains the remaining elements

Afterwards {A} and {C} become the cluster X={A,C}

recalculate distance calculate branch lengths


FM algorithm

The three clusters are intitally X, {B} and Y={D,E}, where Y contains the remaining elements

Afterwards X and {B} become the cluster Z={B,{A,C}}

recalculate distance calculate branch lengths


FM algorithm


FM algorithm

We immediately see a weakness of this method in this example :

If there are different evolutionary rates along different tree branches, the two closest elements may not really be neighbors

Here this has lead to a negative branch length, which can not arise in a true evolutionary history


FM algorithm

The same problem becomes apparant when one determines the distances measured on the tree itself (patristic distance) and the original distances and then errors that follow from that

The neighbour-joining (NJ) algorithm solves this problem


© Tom Lenaerts ULB


Use the tree to construct the MSADFAlign


x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEIL-GNST-RV--TVTSD "

x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAG-KPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRA-APKATILWSKGTEIL-GNST-RV--TVTSD x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"

x1x3

5x4 6

x2

First step

second step

third step

The outcome depends on the kind of substitution matrix and the penalty g


© Tom Lenaerts ULB

How to combine?

• Align the new sequence to a profile (PSSM)

• Make a consensus sequence and align to this consensus sequence


co ISDTEAIGSNLRWGCWWWKPRPMVRWLRNGEPLXASQNXRVXXEVLAXx4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD


© Tom Lenaerts ULB

Profiles 2Profiles store general properties of a set of sequences :

1.) Frequency information concerning the residues in each column of the MSA

2.) The evolutionary importance of every residue in that position

TGVEAENLLLPRAKAEEMLSGRKDAERQLL

Take for instance:

f7,R=2/3 f23,S=1/3f11,E=3/3

fu,b=nu,b

Nseq

fu,b=ln(1- (nu,b/(Nseq+1))

ln(1/ (Nseq+1))The frequencies are:


© Tom Lenaerts ULB

Profiles 3

0.667 0.333

0.333 0.3330.3330.333 1 0.333

0.3330.333

0.3330.333

0.333 0.3330.333

0.333 1

0.333 1 0.6670.333

0.333

The position in the MSA

0 9

All

poss

ible

am

ino

acid

s

RHKDESTNQCGPAILMFWYV

4

frequencies


© Tom Lenaerts ULB

Profiles 4

mu,a=∑ fu,b sa,bb ∈{AA}

A profile stores for each column in the MSA, the frequency multiplied by the alignment score with every other AA.

mu,a=log qu,a

pa

This method is only useful when there are sufficient sequences and

all amino acids appear at least once

Thus mu,a is actually an alignment score

between a residue a and the column u


© Tom Lenaerts ULB

Profiles 5

The probability pa is again the probability that one can find a particular amino acid in any of the positions

Information obtained form SwissProt


© Tom Lenaerts ULB

Profiles 6

-0.666 3.335 0 1 -1 0 1.665 -1 -3 -2.334-1 0.668 -1.332 0.666 -2 0 1 -1 -3 -2.334

... ... ... ... ... ... ... ... ... ...

-1 -1.666 0.666 -2.331 1 -2 -2.331 0.333 2 19 9 9 9 9 9 9 9 9 9

0 9


4

+/-

m0,R=0.333 (-1) +0.333 (0) +0.333 (-1) =-0.666

m1,R=0.667 (5) +0.333 (0) =3.335

m2,R=0.333 (4) +

0.333 (-3) =00.333 (-1) +

mu,a=∑ fu,b sa,bb ∈{AA}

Without the penalty line, the matrix is called a PSSM (Position-

specific scoring matrix)


© Tom Lenaerts ULB

Profiles 7

The biggest problem in creating profiles is that the number of sequences can be too small and as a consequence not every amino acids is present in each column

log 0 =-∞

As a result the alignment can not be made using the log-formula (see PAM and BLOSUM discussion)

The score mu,a is the score for aligning a residue a with the column u We can therefore use the same DP algorithms as before

to align a sequence to a profile

→PSEUDOCOUNTS


© Tom Lenaerts ULB

Profiles 8

Pseudocounts are constants that one adds to the values in the profile to avoid zero’s

qu,a=nu,a+1

Nseq+20

Thus, qu,a will never be 0 !

qu,a=nu,a+βpa

Nseq+ββ is weighting factor that tunes

the importance of the pseudocounts in the value of qu,a

β=√Nseq

Pseudocounts provide the prior information that we have on the

amino acids


© Tom Lenaerts ULB

Profiles 9

We can further generalize this function and introduce fu,a directly

qu,a=αfu,a+βpa

α+β

α is also a weighting factor that tunes the importance of the

actual data. One sometimes uses α=Nseq-1

When there is no data (no sequence) then the pseudocounts will determine the values in the profiles

Therefore, pseudocounts correspond to the prior distribution, which corresponds to what we know about the system before introducing the data


© Tom Lenaerts ULB

Profiles 10

-0.65 0.934 -0.65 -0.65 -0.65 -0.65 0.645 -0.65 -0.65 -0.65-0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60

... ... ... ... ... ... ... ... ... ...

-0.60 -0.60 0.582 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.609 9 9 9 9 9 9 9 9 9

0 9


4

+/-

qu,a=nu,a+βpa

Nseq+β

mu,a=log qu,a

pa

q0,R=0.064

m0,R=log 0.040.06

q1,R=2.06

4m1,R=log 0.13

0.06

q6,R=1.06

4m6,R=log 0.09

0.06

β=1

Note that in here the substitution score is not take

into account


© Tom Lenaerts ULB

How can we make use of the information inside the substitution matrices?

Profiles 11

Each log-odd score contains information concerning the probability of aligning two specific amino acids

qa,b

papb=eλs(a,b)

If a column u contains fu,b amino acids of type b, the probability of finding an alignment with an amino acids a is proportional to

qa,b

papbfu,b

The sum of all these probabilities produces the total probability for a


© Tom Lenaerts ULB

Profiles 12

qu,a=αfu,a+βgu,a

α+β

Multiplying this probability for column u with pa provides a better pseudocount for a

gu,a=∑ qa,b

pbfu,b

b

As a result, the equation used for qu,a now becomes

These gu,a values can be recovered from the substitution scores listed in the matrices PAM or BLOSUM


© Tom Lenaerts ULB

Aligning to a profile

Simply use either Needleman-Wunsch (L3) or Smith-Waterman (L3) to make the alignment

TGVEAENLLL

PRAKAEEMLS

GRKDAERQLL

SRNAAEYLLS

profile

sequence

The profile contains now the alignment scores

The major difficulty lies in the way in which gap penalties are assigned.


© Tom Lenaerts ULB

Remember • In the articles by Osamu Gotoh, a detailed analysis is made of how

to align groups of sequences.

O. Gotoh (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment CABIOS 9(3):361-370

O. Gotoh (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations CABIOS 10(4):379-387

• For next week, read the articles and 2 people provide a presentation in which they discuss the work in those articles and how to use them.

• All other students also examine the articles since they will need to help in deciding which one we will implement in the project.


© Tom Lenaerts ULB


This method suffers from 2 problems:

The local minimum problem

The choice of alignment parameters

New sequences are added on top of existing alignments.

Hence any early error (misaligned regions) propagates to the complete alignment (due to divergence between sequences)

One selects a scoring matrix, a gap-open and a gap-extension value

This works often for highly related sequences, yet starts to fail seriously as soon as sequences diverge

CLUSTAL W tried to resolve this

problem


© Tom Lenaerts ULB


Parameter adjustments introduced by CLUSTAL W:

J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap

penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680

dynamically varying gap penalties in a residue and position specific manner

Information concerning the probability of finding a gap next to one of the 20 amino acids is used to locally adjust the gap-opening penalty

Short stretches of hydrophilic residues usually indicate loop or random coil regions, requiring reduction in the gap-opening penalty

...


© Tom Lenaerts ULB


Parameter adjustments introduced by CLUSTAL W:

J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap

penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680

Scoring matrices are dynamically used depending on the divergence of the sequences to be aligned at each stage

Sequences are weighted to correct for unequal sampling across all evolutionary distances in the data set

Frequently occuring sequences are down-weighted (weights are calculated from the branch lengths of the guide tree, constructed using neighbor-joining method)


© Tom Lenaerts ULB

Assignment 2• Second step to complete by our next week meeting

• Implement your own version of the Fitch Margoliash algorithm

• Use the distance calculation explained for the Feng and Doolittle progressive alignment algorithm

• Test your algorithm on the 5 hemoglobin sequences

• Compare the result to those you obtain in Uniprot

• first look for “sapiens globin blood” and select the checkbox of HBD_HUMAN, HBB_HUMAN, HBA_HUMAN, HBG1_HUMAN and HBG2_HUMAN

• A green bar appears at the bottom with on the right side a button “Align”. Click it.

• This will produce a MSA for these 5 sequences, and when scrolling down a bit an image of the guide tree you should expect.


© Tom Lenaerts ULB

Assignment 2

• Remember: deadline for putting this project on line is April 1, 2013


Documents

Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shufﬂed or removed earlier created