1Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Elya Flax
&
Inbar Matarasso
Multiple sequence alignment algorithms
2
Outline
The importance of multiple string alignments in molecular biology.
CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods
3
Motivation
Why multiple string comparison?
Because many important commonalties are faint or widely dispersed, they might not be apparent when comparing two strings alone but may become clear, or even obvious, when comparing a set of related strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
4
Defenition
Definition: A global multiple alignment of k>2 strings S={S1,S2,…,Sk} is a natural generalization of alignment for two strings. Chosen spaces are inserted into each of the k strings so that the resulting strings have the same length, defined to be l. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
5
Biological basis for multiple string comparison
The second fact of biological sequence comparison Evolutionarily and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same tow-dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
6
Three “big-picture” biological uses for multiple string comparison
The representation of protein families and superfamilies.
The identification and representation of conserved sequence features of DNA or protein that correlate with structure and/or function.
The deduction of evolutionary history from DNA or protein sequences.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
7
CLUSTAL W
Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
http://www.ebi.ac.uk/clustalw/
Sequences
results
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
8
Family and superfamily representation
Often a set of strings (a family) is defined by biological similarity, and one wants to find subsequence commonalities that characterize or represent the family.
There are three common kinds of family representations that come from multiple string comparison:
I. Profile representationII. Consensus sequence representationIII. Signature representation
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
9
Family representation and alignment with profiles
Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
10
Family representation and alignment with profiles
a b c _ a
a b a b a
a c c b _
c b _ b c
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
C1 C2 C3 C4 C5
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
_ .25 .25 .25 Often the values in the profile are converted to log-
odds ratio – If p(y,j) is the frequency that character y appears in column j, and p(y) is the frequency that character y appears anywhere in the multiply aligned sequences, then log( p(y,j)/p(y) ) is commonly used as the y,j profile entry.
11
Aligning a string to a profile
Given a profile P and a new string S, we want to answer the question: “How well S, or substring of S, fit the profile P” .
Since space is a legal character of a profile, a fit of S to P should also allow the insertion of spaces into S, and hence the question is naturally formalized as an easy generalization of pure string alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
a a b b c
1 2 3 4 5An alignment of string aabbc to the column positions of the previous alignment.
12
How to optimally align a string to a profile
Recall that for two characters x and y, s(x,y) denotes the alphabet-weight value assigned to aligning x with y in the pure string alignment problem.
Definition: For character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote y[s(x,y) × p(y,j)], the score for aligning x with column j.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
13
How to optimally align a string to a profile
Definition: Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C.
The recurrence: V(i,0)=s(S1(k),_) V(0,j)=S(_,k)For I and j both strictly positive, the general recurrence is:
V(i,j) = max [V(i-1,j-1) + S(S1(i),j),V(i-1,j) + s(S1(i),__),V(i,j-1) + S(_,j) ].
Time analysis: O(nm), where n is the length of S and is the size of the alphabet.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
k≤i k≤j
14
Profile to profile alignment
Another way that profiles are used is to compare one protein set to another. In that case, the profile for one set is compared to the profile of the other.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
15
Introduction to computing multiple string alignments
Definition: Given a set of k > 2 strings S={S1, S2, ...,Sk}, a local multiple alignment of S is obtained by selecting one substring Si’ from each string Si S and then globally aligning those substrings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
16
How to score multiple alignments
To date, there is no objective function that has been as well accepted for multiple alignment as edit distance or similarity has been for two-string alignment.
We will discuss three types of objective functions:
I. sum-of-pairs functionsII. consensus functionsIII. tree functions
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
17
Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. That is, the induced alignment is multiple alignment M restrict to Si and Sj. Any two opposing spaces in that induced alignment can be removed if desired.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
How to score multiple alignments
18
How to score multiple alignments
Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for tow-string alignment in the standard manner.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
A A G A A _ A
A T _ A A T G
C T G _ G _ G
A T G A A _ G
45 5
SP score 14
19
Multiple alignment with the sum-of-pairs (SP) objective function
Definition: The sum of pairs (SP) score of multiple alignment M is the sum of the scores of pairwise global alignments induced by M.
The SP alignment problem Compute a global multiple alignment M with minimum sum-of-pairs score.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
20
An exact solution to the SP alignment problem
Via dynamic programming – for k strings of length n, it takes (nk) time.
We will develop the dynamic programming recurrence only for the case of three strings.
We will develop an accelerant to the basic dynamic programming solution that somewhat increases the number of strings that can be optimally aligned.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
21
An exact solution to the SP alignment problem
Definition: Let S1, S2 and S3 denote three strings of length n1,n2 and n3, respectively, and let D(i,j,k) be the optimal SP score for aligning S1[1..i], s2[1..j] and s3[1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis and sspace respectively.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
22
Recurrences for a nonboundary cell(i,j)
For i:=1 to n1 do
for j:=1 to n2 do
for k:=1 to n3 dobegin
if (S1(i)=S2(j)then sij:=smatch
else cij:=smis;
if (S1(i)=S3(k)then cik:=smatch
else cik:=smis;
if (S2(j)=S3(k)then cjk:=smatch
else cjk:=smis;
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
d1:=D(i-1,j-1,k-1)+cij+cik+cjk;d2:=D(i-1,j-1,k)+cij+2*sspace;d3:=D(i-1,j,k-1)+cik+2*sspace;d4:=D(i,j-1,k-1)+cjk+2*sspace;d5:=D(i-1,j,k)+2*sspace;d6:=D(i,j-1,k)+2*sspace;d7:=D(i,j,k-1)+2*sspace;
D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7];end;
23
D values for boundary cells
Let D1,2(i,j) denote the familiar pairwise distance between substrings S1[1..i] and S2[1..j], and let D1,3(i,k) and D2,3(j,k) denote the analogous pairwise distance. Then,
I. D(i,j,0)=D1,2(i,j)+(i+j)*sspace
II. D(i,0,k)=D1,3(i,k)+(i+k)*sspace
III. D(i,j,0)=D2,3(j,k)+(J+k)*sspace
IV. D(0,0,0)=0
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
24
A speed up for the exact solution
The program for multiple alignment that was shown uses recurrences in backward direction.
In forward dynamic programming when D(i,j,k) is set, D(i,j,k) is sent forward the seven cells that can be influenced by it.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
25
A speed up for the exact solution
Definition: Let d1,2(i,j) be the edit distance between suffixes S1[i..n] and S2[j..n] of string S1 and S2. Define d1,3(i,k) and d2,3(j,k) analogously.
All these d values can be computed in O(n2) time by reversing the strings and computing three pairwise distances.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
26
A speed up for the exact solution
Suppose that some multiple alignment of S1, S2, and S3 is known and that the alignment has SP score z.
Key idea of the heuristic speed up Recall that D(i,j) is the optimal SP score for aligning S1[1..i], S2[1..j], and S3[1..k]. If D(i,j,k)+d1,2(i,j)+d1,3(i,k)+d2,3(j,k) is greater than z, then node (i,j,k) cannot be on any optimal path and so D(i,j,k) need not be sent forward to any cell.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
27
A bounded-error approximation method for SP alignment
The method is provably fast (runs in polynomial worst-case time) and yet produced alignments whose SP score is guaranteed to be less than twice the score of optimal SP alignment.
Recall that for two strings, D(Si,Sj) is the (optimal) weighted edit distance between Si and Sj.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
28
An initial key idea: alignments consistent with a tree
Definition: Let S be a set of strings, and let T be a tree where each node is labeled with a distinct string from S. Then, a multiple alignment M of S is called consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si,Sj) that label adjacent nodes in T.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
29
A bounded-error approximation method for SP alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
3
1
4
5
2
AXZ AXZ
AXXZ
AYZ
AYXYZ
3 A X X _ Z
1 A X _ _ Z
2 A _ X _ Z
4 A Y _ _ Z
5 A Y X Y Z
a) b)
30
An initial key idea: alignments consistent with a tree
Theorem: For any set of strings S and for any tree T whose nodes are labeled by distinct strings of S, we can efficiently find a multiple alignment M(T) of S that is consistent with T
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
31
The center star method for SP alignment
We will describe the method in terms of an alphabet-weighted scoring scheme for two-string alignment, and let s(x,y) be the score contributed when a character x is aligned opposite a character y.
Definition: A scoring scheme satisfies the triangle inequality if for any three characters x,y and z, s(x,z)≤ s(x,y) + s(y,z).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
32
The center star method for SP alignment
Definition: Given a set of k strings S, define a center string Sc S as a string in S that minimizes SjSD(Sc,Sj), and let M denote the minimum sum. Define the center star to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct string in S-Sc.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
33
The center star method for SP alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
S3
S4
S2
S1
S6
S3
A generic center star for six strings, where the center string Sc is S3
34
The center star method for SP alignment
Definition: Define the multiple alignment Mc of the set of strings S to be the multiple alignment consistent with the center star.
Definition: Define d(Si,Sj) as the score of the pairwise alignment of strings Si and Sj induced by Mc. Denote the score of an alignment M as d(M).
d(Si,Sj)≥D(Si,Sj), d(Mc)=i<jd(Si,Sj), d(Si,Sc)=D(Si,Sc)
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
35
The center star method for SP alignment
Lemma: Assume that the two-string scoring scheme satisfies the triangle inequality. Then for any strings Si and Sj in S, d(Si,Sj) ≤ d(Si,Sc) + d(Sc + Sj) = D(Si,Sc) + D(Sc + Sj)
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
36
The center star method for SP alignment
Definition: Let M* be the optimal multiple alignment of the k strings of S. Let d*(Si,Sj) be the score of the pairwise alignment of strings Si and Sj induced by M*. Then d(M*)=i<jd*(Si,Sj).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
37
The center star method for SP alignment
Theorem: d(Mc)/d(M*) ≤ 2(k-1)/k <2.
Corollary:
kM≤i<jD(Si,Sj)≤d(M*)≤d(Mc)≤[2(k-1)/ki<jD(Si,Sj).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
38
Steiner consensus strings
Definition: Given a set of strings S, and given another string S’, the consensus error of a string S’ relative to S is
E(S’)= Si S D (S’, Si). Note that S’ need not be from S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
39
Steiner consensus strings
Definition: Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
40
Steiner consensus strings
Lemma: Let S have k strings, and assume that the two-string scoring scheme satisfies the triangle inequality. Then there exists a string S S such that
E(S) / E(S*) ≤ 2 – 2/k < 2
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
__
41
Steiner consensus strings
Recall that Sc is a string that minimizes
Si S D (Sc, Si) over all strings in S.
Theorem: Assuming that the scoring scheme satisfies the triangle inequality,
E(Sc) / E(S*) ≤ 2 – 2/k < 2
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
42
Consensus strings from multiple alignment
Definition: Given a multiple alignment M of a set of strings S, the consensus character of column I of M is the character that minimizes the summed distance to it from all the characters in column i. let d(i) denote the minimum sum in column i.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
43
Consensus strings from multiple alignment
Definition: The consensus string SM derived from alignment M is the concatenation of the consensus characters for each column of M.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
44
Consensus strings from multiple alignment
Definition: Let M be a multiple alignment of a set of strings S, and let SM be its consensus string containing q characters. Then the alignment error of SM equals
d(i), and the alignment error of M is defined as the alignment error of SM.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
i=1i=q
45
Consensus strings from multiple alignment
Definition: The optimal consensus multiple alignment is a multiple alignment M for input set S whose consensus string SM has smallest alignment error over all possible multiple alignments of S
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
46
Consensus strings from multiple alignment
Definition: Given set S of k strings, let T be the star tree with Steiner string S* at the root and each of the k strings at distinct leaves of T. Then the multiple alignment of SUS* consistent with T is said to be consistent with S*.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
47
Consensus strings from multiple alignment
Theorem: Let S’ denote the consensus string of the optimal consensus multiple alignment. Then, removal of the spaces from S’ creates the optimal Steiner string S*. Conversely’ removal of the row for S* from the multiple alignment consistent with S* creates the optimal consensus multiple alignment of S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
48
Approximating the optimal consensus multiple alignment
Theorem: Assuming the triangle inequality, the multiple alignment Mc created by the center star method has an SP score that is never more than 2 – 2/k times the SP score of the optimal SP alignment, and it has a (consensus) alignment error that is never more than 2 – 2/k times the alignment error of the optimal consensus multiple alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
49
Multiple alignment to a (phylogenetic) tree
Definition: Given an input tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
50
Multiple alignment to a (phylogenetic) tree
Definition: If strings S and S’ are assigned to the endpoints of an edge (i,j), then (i,j) had edge distance D(S,S’). The distance along a path is the sum of the distances on the edges in the path. The distance of a phylogenetic alignment is the total of all the edge distances in the tree.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
51
Multiple alignment to a (phylogenetic) tree
The phylogenetic alignment problem for T find an assignment of strings to internal nodes of T (one string to each node) that minimizes the distance of the alignment.
The consensus alignment problem is a special case of the phylogenetic alignment problem (i.e., when tree T is a star).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
52
A heuristic for phylogenetic alignment
Definition: A phylogenetic alignment is called a lifted alignment if for every internal node V, the string assigned to V is also assigned to one of V’s children.
We will show that the best lifted alignment in T has a total distance less than twice that of the optimal phylogenetic alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
53
A heuristic for phylogenetic alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
S6
S5S6
S6
S6
S7 S8
S5
S1 S2
S2
S4 S5S3
54
The transformation creating T
We will construct the lifted alignment T out of T* which is the optimal phylogenetic alignment.
Definition: we say a node has been lifted after it has been labeled by a string in the leaf set S.
Let Sv* be the string labeling internal node V in T*. S1, S2 ,…., Sk – v’s children. We lift Sj if D(Sv*,Sj)≤ D(Sv*,Si) for any i from 1 to k.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
L
L
55
The lifting operation at node V. The numbers on the edges are the distances from Sv* to the lifted strings labeling its children. Note that after the lift, one edge will have zero distance.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The transformation creating T L
Sv* S3
S3S4 S1 S2
S3S4S1 S2
VV
57 3
06
56
The error analysis
Theorem: The lifted alignment T has total distance less or equal to twice that of the optimal phylogenetic T* of T.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
L
57
Computing the minimum distance lifted alignment
The best lifted alignment is computed by dynamic programming.
Definition: Let Tv be the subtree of T rooted at node V. Let d(V,S) denote the distance of the best lifted alignment of Tv under the requirement that string S is assigned to node V (assuming of course that S is a string at a leaf of Tv.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
58
Computing the minimum distance lifted alignment
We start with the assumption that all the leaves have already been processed.
S’- a string written at a leaf; V’-child of V.
If V is a node all of whose children are leaves
d(V,S)= S’ D (S, S’).
For a general internal node V, the dynamic programming recurrence is
d(V,S)= min [ D (S, S’) + d(V’,S’) ]
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
V’ S’
59
Computing the minimum distance lifted alignment
Theorem: The optimal lifted alignment can be computed in polynomial time as a function of size of the tree and the lengths of the input strings.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
60
Iterative pairwise alignment
The target is to iteratively merge two multiple alignments of two subsets of strings into a single multiple alignment of the union of those subsets.
As an example we will explain the average linkage method, and is also known as UPGMA, for “Unweighted Pair-Group Method using arithmetic Averages”. At each merge step, the new multiple alignment could be created by aligning some representation of the two smaller alignments (for example, by aligning profiles or consensus sequences).
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
61
Iterative pairwise alignment
multiple alignments serve the purpose of characterizing protein families and for identifying important molecular structures, but….
Doolittle: “ ….what we’re really interested in is a historical alignment. The historical alignment ought to reflect, as accurately as possible, the series of divergences that led to the contemporary sequences…..”
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
62
Iterative pairwise alignment
Iterative alignment methods determine a sequence of merges of disjoint subsets of strings. Hence the history of those merges can be described by a binary tree T. Each leaf of T represents a single string from the input set, and each node of T specifies a merge of the strings found at the leaves of its subtree. Each node also represents a multiple alignment created by the merge at that node.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
63
Progressive alignment
A pair of strings with minimum edit distance (or greatest similarity) is likely obtained from the pair of taxa that has most recently diverged.
Any spaces (gaps) that appear in the optimal pairwise alignment of those two strings in preserved throughout the entire sequence of successive merges.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
64
Progressive alignment
The progressive alignment method is explicitly aimed at building an evolutionary tree from molecular data while simultaneously constructing an evolutionarily informative multiple alignment.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
65
Improvements to progressive alignment
Sequence weighting – the weights are normalized such that the biggest one is set to 1. closely related sequences receive lowered weights. Highly divergent sequences receive high weights.
Initial gap penalties – a gap opening penalty (GOP) is given for every gap, and gap extension penalty (GEP) gives the cost of every space in the gap.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
66
Improvements to progressive alignment
Weight matrices – Two main series of weight matrices are offered to the user: Dayhoff PAM, BLOSUM.
Divergent sequences – The most divergent sequences are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
67
Progressive alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Hba_Horse 4 .59 .59 .13 -
Myg_Phyca 5 .77 .77 .75 .75 -
Glb5_Petma 6 .81 .82 .73 .74 .80 -
Lgb2_Luplu 7 .87 .86 .86 .88 .93 .90
1 2 3 4 5 6
Pairwise alignment: calculate distance matrix
68
Progressive alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Unrooted Neighbor-joining tree
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
Myg_Phyca
Glb5_Petma
Lgb2_Luplu
69
Progressive alignment
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Hbb_Human
Hbb_Horse
Hba_Human
Hba_Horse
Myg_Phyca
Glb5_Petma
Lgb2_Luplu
Rooted NJ tree (guide tree) and sequence weights
Progressive alignment: Align following the guide tree
70
Repeated-motif methods
The second major approach used in multiple alignment methods.
Definition: a motif is a substring or a small subsequence that is common to many of the strings in the set.
“width” refers to the length of the motif, and “multiplicity” refers to the number of strings that it appears in.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
71
Repeated-motif methods
Repeated-motif method general algorithm:1. Find a “good” motif (wide and with high multiplicity)
2. The strings containing it are shifted so that the occurrences of the motif are aligned with each other.
3.The problems divides into two sub problems, one for substrings on each side of the motif.
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
72
Repeated-motif methods
4. Continue this recursion until no sufficiently wide or high motif is found.
5. The remaining sub problems can be solved by iterative alignment methods.
6. Strings that did not contain the first good motif are aligned separately.
7. Finally, the two alignments are merged.
73
Summary
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
The importance of multiple string alignments in molecular biology.
CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods
74
Bibliography
Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.
Algorithms on strings, trees, and sequences : computer science and computational biology; Gusfield Dan; Cambridge : Cambridge University Press, 1997
Nucleic Acids Research, 1994, Vol. 22, No. 22, Oxford University Press.