• Exact methods, like Needleman and Wunsch, generate optimal alignments but aren’t feasible for alignments of many sequences.
• Computational time for this approach is describe in Big O notation as O(2NLN). • Algorithm computational time (T = number of steps) has order O of (2NLN) complexity, where N is the number of sequences and L is the average sequence length.
• Build MSA based on the order in the guide tree• Start with the two most closely related sequences• Then add the next closest sequence• Continue until all sequences are added to the MSA• Follows Rule: “once a gap, always a gap.”
2 closest alignments
Why “once a gap, always a gap”?Why “once a gap, always a gap”?
• There are many possible ways to make a MSA• Where gaps are added is a critical question• Gaps are often added to the first two (closest) sequences• To change the initial gap choices later on would be to give more
weight to distantly related sequences• To maintain the initial gap choices is to trust that those gaps are
most believable• Insertions receive higher penalties than deletions, and are
propagated throughout alignment
Note placement of M and A at end of gap
Partial ClustalW Output for CD2 ProteinPartial ClustalW Output for CD2 Protein
1. Color coding indicates AA property class2. * Indicates 100% conserved over entire alignment3. : Conservative mutations4. . Less conservative mutations5. [blank] gap or least conserved mutations
ClustalW Output for CD2 ProteinClustalW Output for CD2 Protein1 2 3 4 5
Can use to build phylogeny tree
Clustal W alignment of 5 distantly related globinsClustal W alignment of 5 distantly related globins
Clustal W alignment of 5 closely related globinsClustal W alignment of 5 closely related globins
* asterisks indicate identity in a column
Additional features of ClustalW improveAdditional features of ClustalW improveits ability to generate accurate MSAsits ability to generate accurate MSAs
• Individual weights are assigned to sequences; very closely related sequences are given less weight,while distantly related sequences are given more weight
• Scoring matrices are varied dependent on the presenceof conserved or divergent sequences, e.g.:
PAM20 80-100% idPAM60 60-80% idPAM120 40-60% idPAM350 0-40% id
• Residue-specific gap penalties are applied
In-Class AssignmentIn-Class AssignmentMultiple sequence alignments using ClustalWMultiple sequence alignments using ClustalW
• Example of MSA using ClustalW: two data sets
• Five distantly related globins (human to plant)
• Five closely related beta globins
• Obtain your sequences in the FASTA format! • You can save them in Notepad or other text editor.
MSA: Iterative MethodsMSA: Iterative Methods
• Compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges.
• Available at http://mafft.cbrc.jp/alignment/software/• Uses Fast Fourier Transform to speed up profile
alignment• Uses fast two-stage method for building alignments using
k-mer (matching 6-tuples) frequencies• Offers many different scoring and aligning techniques• One of the more accurate programs available• Available as standalone or web interface• Many output formats, including interactive phylogenetic
• Use database of both local high-scoring alignments and long-range global alignments to create a final alignment
• Incorporates evidence from multiple sequences to guide pairwise alignment– In a sequence, if x is related to y, and y is related to z, then x
should be related to z.
• Fast and accurate• Examples: T-COFFEE, Prrp, DiAlign, ProbCons
Which methods are best?Which methods are best?
• Depends on:– Number of sequences to align.– What you are trying to do.– Level of user expertise.– Personal Preference.
• Other Considerations:– Does method use benchmarking of multiple structures?– Do you want to evaluate 3D protein structures (e.g., try Expresso
• You might want to:– Try making multiple sequence alignments with many different
sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).
– Compare results.
Example: 5 alignments of 5 globinsExample: 5 alignments of 5 globins
Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths.
We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
Our conclusion will be that there is no single best approach to MSA.
Note how the region of a conserved histidine (▼) varies depending on which of five prominent algorithms is used
ClustalW ResultsClustalW Results
Praline ResultsPraline Results
Muscle ResultsMuscle Results
ProbCons ResultsProbCons Results
Tcoffee ResultsTcoffee Results
See Thompson et al. (1994) for an explanation of the three stages of progressive alignment implemented in ClustalW