Upload
rafe-armstrong
View
231
Download
1
Tags:
Embed Size (px)
Biology 4900Biology 4900
Biocomputing
Chapter 6Chapter 6
Multiple Sequence Alignments
Relationships between biological sequencesRelationships between biological sequences
• Biological sequences tend to occur in families– These may be related genes within an organism (paralogs) or
between species (orthologs)– Presumably derived from common ancestor
• Nucleotides corresponding to coding regions are typically less well conserved than proteins due to degeneracy of genetic code– More difficult to align
Sequences evolve faster than structures, but homologous sequences tend to retain similar structure and function (e.g., rat vs. human CaM)
Multiple sequence alignmentsMultiple sequence alignments
• Homology can be observed through multiple sequence alignments (MSA)
• MSA: 3 or more protein (or nucleic acid) sequences that are partially or completely aligned
• Homologous residues are aligned in columns across the length of the sequences
1exr_A -EQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 59 1N0Y_A AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 3cln_ ----TEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 56 :************:******************************************* 1exr_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 119 1N0Y_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 3cln_ GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 116 *********::******: *****: ***:***:**** *******************:* 1exr_A VDEMIREADIDGDGHINYEEFVRMMVS- 146 1N0Y_A VDEMIREADIDGDGHINYEEFVRMMVSK 148 3cln_ VDEMIREANIDGDGQVNYEEFVQMMTA- 143 ********:*****::******:**.:
Multiple sequence alignmentsMultiple sequence alignments
• MSAs are powerful because they can reveal relationships between 2 sequences that can only be observed by their relationships with a third sequence
AVGYDFGEKMLSGADDWLVGERADLTGAEIDE
AVGYDFGEKMLSGA--DDWLVGYDRADK-LTGAE-DD-LVG-ERAD--LTGAEIDE-
Seq 1
Seq 2
Seq 1
Seq 3
Seq 2
How MSAs are determined?How MSAs are determined?
MSAs can be determined based on:
•Presence of highly-conserved residues such as cysteine•Conserved motifs and domains•Conserved features of protein secondary structure•Regions showing consistent patterns of insertions or deletions
C-terminal domain of CaM (from 3cln.pdb)
Conserved 2° structure (α-helices)
Why use MSAs?Why use MSAs?
• If protein (or gene) you are studying is part of a larger group, you may be able to gain insight into structure, function and evolution of the sequence.
• MSAs more sensitive than pairwise alignments to detect homologs.
• MSAs can reveal conserved residues, motifs, domains.
• Useful for generating phylogeny trees.
• Regulatory regions of many genes contain conserved consensus sequences.
BenchmarkingBenchmarking
• Q: How good is a MSA?• A: Compare sequence alignment against known
structure alignments (reference scores).– Measured by an objective scoring system such as sum-of-pairs
scores (SPS).
Ai1 Ai2 Ai3 Ai4 Ai5
1 A V L I
2 A G M L I
3 A V M R
M Columns
N R
ows
Sum of scores for all pairs in 1 column
Sum of scores for all your aligned columns
Sum of reference scores
Five MSA ApproachesFive MSA Approaches
1. Exact methods
2. Progressive alignment (e.g., ClustalW)
3. Iterative approaches (e.g., PRALINE, IterAlign, MUSCLE)
4. Consistency-based methods (e.g., MAFFT, ProbCons)
5. Structure-based methods (e.g., Expresso)
Our Focus
Exact MethodsExact Methods
• Exact methods, like Needleman and Wunsch, generate optimal alignments but aren’t feasible for alignments of many sequences.
• Computational time for this approach is describe in Big O notation as O(2NLN). • Algorithm computational time (T = number of steps) has order O of (2NLN) complexity, where N is the number of sequences and L is the average sequence length.
Progressive Sequence Alignment (Feng-Doolittle)Progressive Sequence Alignment (Feng-Doolittle)
How it works:
1.Calculates pairwise sequence alignment scores between all proteins (or nucleic acid sequences)
2.Aligns 2 closest sequences using a guide tree
3.Progressively aligns more sequences to the first 2
•Advantages: Permits rapid alignment of 100s of sequences.•Disadvantages: May not provide most accurate alignment depending on how alignment is started.
ClustalW
MUSCLE
What these numbers mean…What these numbers mean…
ClustalW
MUSCLE
Needleman & WunschN=10 sequences, L=100 residues (Avg.)
Too large to calculate
20,000
110,000
best score
Progressive MSA stage 1 of 3:Progressive MSA stage 1 of 3:generate global pairwise alignmentsgenerate global pairwise alignments
For n sequences, (n-1)(n) / 2 = number of alignmentsFor 5 sequences, (4)(5) / 2 = 10 alignments*First find the two that produce the highest score
Tree Views of alignmentsTree Views of alignments
• Alignments may be evaluated by either similarity or distance measures
• A tree shows the distance between objects
Closely-related Sequences
Distantly-related Sequences
How to read tree views of alignmentsHow to read tree views of alignments
Closely-related Sequences
5 closely related globins
Feng-Doolittle stage 2: guide treeFeng-Doolittle stage 2: guide tree• Convert similarity scores to distance scores• Use unweighted pair group method of arithmetic averages UPGMA
(defined in Chapter 7)• ClustalW output shown below. Use JalView in ClustalW to display
tree view.
Feng-Doolittle stage 3: progressive alignmentFeng-Doolittle stage 3: progressive alignment
• Build MSA based on the order in the guide tree• Start with the two most closely related sequences• Then add the next closest sequence• Continue until all sequences are added to the MSA• Follows Rule: “once a gap, always a gap.”
2 closest alignments
Why “once a gap, always a gap”?Why “once a gap, always a gap”?
• There are many possible ways to make a MSA• Where gaps are added is a critical question• Gaps are often added to the first two (closest) sequences• To change the initial gap choices later on would be to give more
weight to distantly related sequences• To maintain the initial gap choices is to trust that those gaps are
most believable• Insertions receive higher penalties than deletions, and are
propagated throughout alignment
Note placement of M and A at end of gap
Partial ClustalW Output for CD2 ProteinPartial ClustalW Output for CD2 Protein
1. Color coding indicates AA property class2. * Indicates 100% conserved over entire alignment3. : Conservative mutations4. . Less conservative mutations5. [blank] gap or least conserved mutations
ClustalW Output for CD2 ProteinClustalW Output for CD2 Protein1 2 3 4 5
Can use to build phylogeny tree
Alignment Size
Medium
Medium
Small
Clustal W alignment of 5 distantly related globinsClustal W alignment of 5 distantly related globins
Clustal W alignment of 5 closely related globinsClustal W alignment of 5 closely related globins
* asterisks indicate identity in a column
Additional features of ClustalW improveAdditional features of ClustalW improveits ability to generate accurate MSAsits ability to generate accurate MSAs
• Individual weights are assigned to sequences; very closely related sequences are given less weight,while distantly related sequences are given more weight
• Scoring matrices are varied dependent on the presenceof conserved or divergent sequences, e.g.:
PAM20 80-100% idPAM60 60-80% idPAM120 40-60% idPAM350 0-40% id
• Residue-specific gap penalties are applied
In-Class AssignmentIn-Class AssignmentMultiple sequence alignments using ClustalWMultiple sequence alignments using ClustalW
• Example of MSA using ClustalW: two data sets
• Five distantly related globins (human to plant)
• Five closely related beta globins
• Obtain your sequences in the FASTA format! • You can save them in Notepad or other text editor.
MSA: Iterative MethodsMSA: Iterative Methods
• Compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges.
• Unlike progressive methods, iterative methods can dynamically correct alignment errors
• Examples: – MUSCLE: Multiple Sequence Comparison by Log-Expectation
(Edgar, 2004)– Iteralign: (Karlin and Brocchieri, 1998)– Praline: PRofile ALInNmEnt (Heringa, 1999; Simossis and
Heringa, 2005)– MAFFT: Multiple Alignment using Fast Fourier-Transform (Katoh
et al., 2005)
Iterative approaches: MAFFTIterative approaches: MAFFT
• Available at http://mafft.cbrc.jp/alignment/software/• Uses Fast Fourier Transform to speed up profile
alignment• Uses fast two-stage method for building alignments using
k-mer (matching 6-tuples) frequencies• Offers many different scoring and aligning techniques• One of the more accurate programs available• Available as standalone or web interface• Many output formats, including interactive phylogenetic
trees
Iterative approaches: MUSCLEIterative approaches: MUSCLE
• Available at http://www.ebi.ac.uk/Tools/msa/muscle/
• 3 Stage approach• Stage 1:
– Algorithm builds initial alignment based on similarities of paired alignments
– Calculates distance matrix and generates rooted tree
• Stage 2: – Improves tree by recalculating similarities
• Stage 3: – Rescores pairs at branches
MSA: Consistency-based algorithmsMSA: Consistency-based algorithms
• Use database of both local high-scoring alignments and long-range global alignments to create a final alignment
• Incorporates evidence from multiple sequences to guide pairwise alignment– In a sequence, if x is related to y, and y is related to z, then x
should be related to z.
• Fast and accurate• Examples: T-COFFEE, Prrp, DiAlign, ProbCons
Which methods are best?Which methods are best?
• Depends on:– Number of sequences to align.– What you are trying to do.– Level of user expertise.– Personal Preference.
• Other Considerations:– Does method use benchmarking of multiple structures?– Do you want to evaluate 3D protein structures (e.g., try Expresso
at http://www.tcoffee.org)?
• You might want to:– Try making multiple sequence alignments with many different
sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).
– Compare results.
Example: 5 alignments of 5 globinsExample: 5 alignments of 5 globins
Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths.
We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
Our conclusion will be that there is no single best approach to MSA.
Note how the region of a conserved histidine (▼) varies depending on which of five prominent algorithms is used
ClustalW ResultsClustalW Results
Praline ResultsPraline Results
Muscle ResultsMuscle Results
ProbCons ResultsProbCons Results
Tcoffee ResultsTcoffee Results
ClustalW
Praline
Muscle
ProbCons
See Thompson et al. (1994) for an explanation of the three stages of progressive alignment implemented in ClustalW