48
Computational Biology and Bioinformatics What is a multiple sequence alignment? 1 Wednesday 13 February 2013

Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

Computational Biology and Bioinformatics

What is a multiple sequence alignment?

1Wednesday 13 February 2013

Page 2: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

What is a MSA?

• A multiple sequence alignment (MSA) is an alignment of more than two protein sequences that allows for a better identification of the similarities or detection of evolutionary conserved regions, ...

2Wednesday 13 February 2013

Page 3: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Why MSA?• An alignment of 2 sequences is an hypothesis that is supported or

rejected by the score one obtains for the alignment

• Yet the score does not guarantee that their really is a common ancestor

• Moreover, it is not guaranteed that the elements are correctly aligned

• We can reduce this uncertainty by aligning additional sequences

• A multiple sequence alignment (MSA) algorithm performs this task and can clearly show:

• The similarity between positions

• the conservation of certain amino acids within the family of homologues proteins

3Wednesday 13 February 2013

Page 4: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Why MSA? 2

The conserved regions: In green the identical residues and in blue the elements with the same

properties

Improve the alignment between two sequences

4Wednesday 13 February 2013

Page 5: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Available tools

• CLUSTALW http://www.clustal.org

• TCOFFEE http://www.ebi.ac.uk/Tools/t-coffee/

• MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/

5Wednesday 13 February 2013

Page 6: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

The problem

1. Find the best way of calculating the score of the alignment between multiple sequences

2. Find the algorithm that can find the MAS with the optimal (or close to optimal) score

• These algorithms are again global or local

• Determining the alignment between N sequences is a hard problem = Combinatorial optimization problem (COP)

• To solve this COP we need provide two systems

6Wednesday 13 February 2013

Page 7: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

The scoring systemHow can we calculate the score of an MSA?

S(m)=∑ S(mi)i

The total scoreSo the assumption is made that the score of one column is independent

of the other columns

The score of 1 column S(mi)=∑ s(mki, mli)k,l

mki is a residue in the sequence k in column i

s(mki, mli) The score from the substitution matrice

Sum-of-pairs or SP

7Wednesday 13 February 2013

Page 8: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

The scoring system 2SP and its alternatives

S(mi)=∑ s(mki, mli)k,l

S(mi)=∑ s(m1I, mki)k>1

S(mi)=∑ wk,l s(mki, mli)k,l

Minimum entropy

Maximum likelihood

S(mi)=∑ fki ln(fki)i

8Wednesday 13 February 2013

Page 9: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Global methodsSmith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences

BUT : the approach is far from efficient since a lot of computational resources are required (assume size=200).

Number of sequences

O(2nLn)

2 22×2002=0.16M

3 23×2003=64M

4 24×2004=25600M

6 ...

9Wednesday 13 February 2013

Page 10: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

BUT : the approach is far from efficient since a lot memory is required to store the matrix (assume size=200).

Number of sequences memory (1 byte/position)

2 400 bytes

3 7.63 Mbytes

4 1.5Gbytes

6 60000 Gbytes

Global methods 2Smith-Waterman and Needleman-Wunsch could be used to align more than 2 sequences

10Wednesday 13 February 2013

Page 11: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Global methods 3

• Well-known methods:

• An optimized version of the standard DP algorithms (the MSA system)

• limited applicability but exact

• Progressive alignment (the CLUSTAL system)

• Stochastic methods for multiple sequence alignment (the SAGA system)

11Wednesday 13 February 2013

Page 12: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Global methods 4

Progressive alignment (the CLUSTAL system)

We already implemented step (A)

By next week step (B)

12Wednesday 13 February 2013

Page 13: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment

A progressive alignment method is a heuristic, meaning that we have no guarantee to find the optimal solution

3 stages:

Calculate a distance matrix between all sequence pairs

Use this matrix to construct a guide tree

Use the ordering provided by the tree to align the sequences into a MSA

13Wednesday 13 February 2013

Page 14: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 3

How to construct the tree (more later)

First, put the closest sequences in the same group (e.g.1)

Afterwards, group :A. the next two sequences that are close (e.g. 2)

B. a sequence with a previously constructed group (p.e. 4)

C. two groups (p.e. 3)

SEQ1

SEQ2

SEQ3

SEQ4

SEQ5

1

24

3

14Wednesday 13 February 2013

Page 15: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 4

The tree determines the order in which the elements are added to the MSA

GATTGTAGTA

GATGGTAGTA

GATTGTAGTA

GATGGTAGTAGATTGTTC--GTA

GATTGTTCGGGTA

1

2

GATTGTA---GTA

GATGGTA---GTAGATTGTTC--GTA

GATTGTTCGGGTA

3

GATTGTA-----GTA

GATGGTA-----GTAGATTGTTC----GTA

GATTGTTCGG--GTA

4

GATGGTAGGCGTGTA

15Wednesday 13 February 2013

Page 16: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 5The Feng et Doolittle system for progressive alignment:

D.-F. Feng and R.F. Doolittle (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees J Mol Evol 25:351-360

It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created gaps, which is maybe plausible from an optimization perspective but not from a biological perspective

The system is composed of 6 functions, of which we only discus the essential ones

SCORE BORD DFAlign

“once a gap, always a gap”

16Wednesday 13 February 2013

Page 17: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 6

Pairwise alignment and calculation of distance matrixSCORE

Dij=-lnSij - Srand

Siden - Srand×100

Sij

Srand = (1/L)∑∑ S(a,b)Ni(a)Nj(b)-N(g)gpenalty

The alignment score (using for instance PAM250)

Siden =Sii + Sjj

2

The alignement score for two random sequences with the same amino acid composition and the same length

17Wednesday 13 February 2013

Page 18: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 6For instance take the following 4 segments obtained from 4 proteines of the I-immunogobulin family

SCORE

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTDGRHFVSQTT x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPLASQNRVEVLA x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD"

the substitution matrix is PAM250

gpenalty = 8

Pairwise alignments include an alignment between the sequence and itself (DP)

18Wednesday 13 February 2013

Page 19: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 7

SCORE

x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTE-ILGNST-RV--TVTSD"

Alignment 1 : S12=31

Alignment 2 : S13=44

Alignment 3 : S14=13

...

S11=262, S22=287 ...

19Wednesday 13 February 2013

Page 20: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 8

SCORE

Sij x1 x2 x3 x4

x1 262 31 44 13 x2 287 15 16 x3 222 45 x4 215

Dij x1 x2 x3 x4

x1 0 1.25 0.95 1.31 x2 0 1.24 1.30 x3 0 1.13 x4 0

Srand x1 x2 x3 x4

x1 -66.94 -80.28 -70.48 x2 -82.86 -72.52 x3 -37.85 x4

Dij=-ln Sij - Srand

Siden - Srand

20Wednesday 13 February 2013

Page 21: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 9

BORDUse the Fitch-Margoliash (FM) algorithm to construct the guide tree

W.M. Fitch and E. Margoliash (1967) Construction of phylogenetic trees, Science 155(3760):279-284

At each step, join the sequences or groups of sequence with the smallest distance between them and recalculate the between the new group and the remaining sequences (or groups)

21Wednesday 13 February 2013

Page 22: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithmThis algorithm produces unrooted additive trees

It is based on the analysis of a three-leave tree, like the one shown below

the distances between the pairs of sequences determines the branch lengths b1, b2 and b3

b1 = (dAB + dAC - dBC)12

b2 = (dAB + dBC - dAC)12

b3 = (dAC + dBC - dAB)12

which shows the additive nature of the tree : dAB = b1+b2 , dAC = b1+b3

et dBC = b2+b3

22Wednesday 13 February 2013

Page 23: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithmThe guide tree is produced in a stepwise fashionAt every stage three clusters are defined, with all sequences belonging to one of the clusters

The first two clusters contain the two elements which have the smallest evolutionary distance and the third cluster contains the rest of the sequences

Branch lengths are determined using the equations shown on the previous slides

The branch length between the third and each of the other clusters is calculated

including the location of the internal node that connects them

The two newly added clusters (the first and the second) are then combined into a new cluster

23Wednesday 13 February 2013

Page 24: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithm

The three clusters are intitally {A}, {C} and W={B,D,E}, where W contains the remaining elements

Afterwards {A} and {C} become the cluster X={A,C}

recalculate distance calculate branch lengths

24Wednesday 13 February 2013

Page 25: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithm

The three clusters are intitally X, {B} and Y={D,E}, where Y contains the remaining elements

Afterwards X and {B} become the cluster Z={B,{A,C}}

recalculate distance calculate branch lengths

25Wednesday 13 February 2013

Page 26: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithm

26Wednesday 13 February 2013

Page 27: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithm

We immediately see a weakness of this method in this example :

If there are different evolutionary rates along different tree branches, the two closest elements may not really be neighbors

Here this has lead to a negative branch length, which can not arise in a true evolutionary history

27Wednesday 13 February 2013

Page 28: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

FM algorithm

The same problem becomes apparant when one determines the distances measured on the tree itself (patristic distance) and the original distances and then errors that follow from that

The neighbour-joining (NJ) algorithm solves this problem

28Wednesday 13 February 2013

Page 29: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 11

Use the tree to construct the MSADFAlign

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEIL-GNST-RV--TVTSD "

x1 ILDMDVVEGSAARFDCKVEG-YPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAG-KPRPMVRWLRNGEPL-ASQN-RV--EVLA- x4 RRLIPAARGGEISILCQPRA-APKATILWSKGTEIL-GNST-RV--TVTSD x2 RDPVKTHEGWGVMLPCNPPAHYPGLSYRWLLNEFPNFIPTD-GRHFVSQTT"

x1x3

5x4 6

x2

First step

second step

third step

The outcome depends on the kind of substitution matrix and the penalty g

29Wednesday 13 February 2013

Page 30: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

How to combine?

• Align the new sequence to a profile (PSSM)

• Make a consensus sequence and align to this consensus sequence

x1 ILDMDVVEGSAARFDCKVEGYPDPEVMWFKDDNPVKESRHFQIDYDEEGN x3 ISDTEADIGSNLRWGCAAAGKPRPMVRWLRNGEPL-ASQN-RV--EVLA-"

co ISDTEAIGSNLRWGCWWWKPRPMVRWLRNGEPLXASQNXRVXXEVLAXx4 RRLIPAARGGEISILCQPRAAPKATILWSKGTEILGNSTRVTVTSD

30Wednesday 13 February 2013

Page 31: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 2Profiles store general properties of a set of sequences :

1.) Frequency information concerning the residues in each column of the MSA

2.) The evolutionary importance of every residue in that position

TGVEAENLLLPRAKAEEMLSGRKDAERQLL

Take for instance:

f7,R=2/3 f23,S=1/3f11,E=3/3

fu,b=nu,b

Nseq

fu,b=ln(1- (nu,b/(Nseq+1))

ln(1/ (Nseq+1))The frequencies are:

31Wednesday 13 February 2013

Page 32: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 3

0.667 0.333

0.333 0.3330.3330.333 1 0.333

0.3330.333

0.3330.333

0.333 0.3330.333

0.333 1

0.333 1 0.6670.333

0.333

The position in the MSA

0 9

All

poss

ible

am

ino

acid

s

RHKDESTNQCGPAILMFWYV

4

frequencies

32Wednesday 13 February 2013

Page 33: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 4

mu,a=∑ fu,b sa,bb ∈{AA}

A profile stores for each column in the MSA, the frequency multiplied by the alignment score with every other AA.

mu,a=log qu,a

pa

This method is only useful when there are sufficient sequences and

all amino acids appear at least once

Thus mu,a is actually an alignment score

between a residue a and the column u

33Wednesday 13 February 2013

Page 34: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 5

The probability pa is again the probability that one can find a particular amino acid in any of the positions

Information obtained form SwissProt

34Wednesday 13 February 2013

Page 35: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 6

-0.666 3.335 0 1 -1 0 1.665 -1 -3 -2.334-1 0.668 -1.332 0.666 -2 0 1 -1 -3 -2.334

... ... ... ... ... ... ... ... ... ...

-1 -1.666 0.666 -2.331 1 -2 -2.331 0.333 2 19 9 9 9 9 9 9 9 9 9

0 9

RHKDESTNQCGPAILMFWYV

4

+/-

m0,R=0.333 (-1) +0.333 (0) +0.333 (-1) =-0.666

m1,R=0.667 (5) +0.333 (0) =3.335

m2,R=0.333 (4) +

0.333 (-3) =00.333 (-1) +

mu,a=∑ fu,b sa,bb ∈{AA}

Without the penalty line, the matrix is called a PSSM (Position-

specific scoring matrix)

35Wednesday 13 February 2013

Page 36: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 7

The biggest problem in creating profiles is that the number of sequences can be too small and as a consequence not every amino acids is present in each column

log 0 =-∞

As a result the alignment can not be made using the log-formula (see PAM and BLOSUM discussion)

The score mu,a is the score for aligning a residue a with the column u We can therefore use the same DP algorithms as before

to align a sequence to a profile

→PSEUDOCOUNTS

36Wednesday 13 February 2013

Page 37: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 8

Pseudocounts are constants that one adds to the values in the profile to avoid zero’s

qu,a=nu,a+1

Nseq+20

Thus, qu,a will never be 0 !

qu,a=nu,a+βpa

Nseq+ββ is weighting factor that tunes

the importance of the pseudocounts in the value of qu,a

β=√Nseq

Pseudocounts provide the prior information that we have on the

amino acids

37Wednesday 13 February 2013

Page 38: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 9

We can further generalize this function and introduce fu,a directly

qu,a=αfu,a+βpa

α+β

α is also a weighting factor that tunes the importance of the

actual data. One sometimes uses α=Nseq-1

When there is no data (no sequence) then the pseudocounts will determine the values in the profiles

Therefore, pseudocounts correspond to the prior distribution, which corresponds to what we know about the system before introducing the data

38Wednesday 13 February 2013

Page 39: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 10

-0.65 0.934 -0.65 -0.65 -0.65 -0.65 0.645 -0.65 -0.65 -0.65-0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60

... ... ... ... ... ... ... ... ... ...

-0.60 -0.60 0.582 -0.60 -0.60 -0.60 -0.60 -0.60 -0.60 -0.609 9 9 9 9 9 9 9 9 9

0 9

RHKDESTNQCGPAILMFWYV

4

+/-

qu,a=nu,a+βpa

Nseq+β

mu,a=log qu,a

pa

q0,R=0.064

m0,R=log 0.040.06

q1,R=2.06

4m1,R=log 0.13

0.06

q6,R=1.06

4m6,R=log 0.09

0.06

β=1

Note that in here the substitution score is not take

into account

39Wednesday 13 February 2013

Page 40: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

How can we make use of the information inside the substitution matrices?

Profiles 11

Each log-odd score contains information concerning the probability of aligning two specific amino acids

qa,b

papb=eλs(a,b)

If a column u contains fu,b amino acids of type b, the probability of finding an alignment with an amino acids a is proportional to

qa,b

papbfu,b

The sum of all these probabilities produces the total probability for a

40Wednesday 13 February 2013

Page 41: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Profiles 12

qu,a=αfu,a+βgu,a

α+β

Multiplying this probability for column u with pa provides a better pseudocount for a

gu,a=∑ qa,b

pbfu,b

b

As a result, the equation used for qu,a now becomes

These gu,a values can be recovered from the substitution scores listed in the matrices PAM or BLOSUM

41Wednesday 13 February 2013

Page 42: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Aligning to a profile

Simply use either Needleman-Wunsch (L3) or Smith-Waterman (L3) to make the alignment

TGVEAENLLL

PRAKAEEMLS

GRKDAERQLL

SRNAAEYLLS

profile

sequence

The profile contains now the alignment scores

The major difficulty lies in the way in which gap penalties are assigned.

42Wednesday 13 February 2013

Page 43: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Remember • In the articles by Osamu Gotoh, a detailed analysis is made of how

to align groups of sequences.

O. Gotoh (1993) Optimal alignment between groups of sequences and its application to multiple sequence alignment CABIOS 9(3):361-370

O. Gotoh (1994) Further improvement in methods of group-to-group sequence alignment with generalized profile operations CABIOS 10(4):379-387

• For next week, read the articles and 2 people provide a presentation in which they discuss the work in those articles and how to use them.

• All other students also examine the articles since they will need to help in deciding which one we will implement in the project.

43Wednesday 13 February 2013

Page 44: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 12

This method suffers from 2 problems:

The local minimum problem

The choice of alignment parameters

New sequences are added on top of existing alignments.

Hence any early error (misaligned regions) propagates to the complete alignment (due to divergence between sequences)

One selects a scoring matrix, a gap-open and a gap-extension value

This works often for highly related sequences, yet starts to fail seriously as soon as sequences diverge

CLUSTAL W tried to resolve this

problem

44Wednesday 13 February 2013

Page 45: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 13

Parameter adjustments introduced by CLUSTAL W:

J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap

penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680

dynamically varying gap penalties in a residue and position specific manner

Information concerning the probability of finding a gap next to one of the 20 amino acids is used to locally adjust the gap-opening penalty

Short stretches of hydrophilic residues usually indicate loop or random coil regions, requiring reduction in the gap-opening penalty

...

45Wednesday 13 February 2013

Page 46: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Progressive alignment 14

Parameter adjustments introduced by CLUSTAL W:

J.D Thompson, D.G. Higgins and T.J. Gibson (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap

penalties adn weight matrix choice. Nucleic Acid Research 22:4673-4680

Scoring matrices are dynamically used depending on the divergence of the sequences to be aligned at each stage

Sequences are weighted to correct for unequal sampling across all evolutionary distances in the data set

Frequently occuring sequences are down-weighted (weights are calculated from the branch lengths of the guide tree, constructed using neighbor-joining method)

46Wednesday 13 February 2013

Page 47: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Assignment 2• Second step to complete by our next week meeting

• Implement your own version of the Fitch Margoliash algorithm

• Use the distance calculation explained for the Feng and Doolittle progressive alignment algorithm

• Test your algorithm on the 5 hemoglobin sequences

• Compare the result to those you obtain in Uniprot

• first look for “sapiens globin blood” and select the checkbox of HBD_HUMAN, HBB_HUMAN, HBA_HUMAN, HBG1_HUMAN and HBG2_HUMAN

• A green bar appears at the bottom with on the right side a button “Align”. Click it.

• This will produce a MSA for these 5 sequences, and when scrolling down a bit an image of the guide tree you should expect.

47Wednesday 13 February 2013

Page 48: Computational Biology and Bioinformatics is a MSA part 2.pdf · It was constructed out of the concern that the MSA systems at that point in time shuffled or removed earlier created

© Tom Lenaerts ULB

Assignment 2

• Remember: deadline for putting this project on line is April 1, 2013

48Wednesday 13 February 2013