Biology 4900 Biocomputing. Chapter 6 Multiple Sequence Alignments

Biology 4900Biology 4900

Biocomputing

Chapter 6Chapter 6

Multiple Sequence Alignments

Relationships between biological sequencesRelationships between biological sequences

• Biological sequences tend to occur in families– These may be related genes within an organism (paralogs) or

between species (orthologs)– Presumably derived from common ancestor

• Nucleotides corresponding to coding regions are typically less well conserved than proteins due to degeneracy of genetic code– More difficult to align

Sequences evolve faster than structures, but homologous sequences tend to retain similar structure and function (e.g., rat vs. human CaM)

Multiple sequence alignmentsMultiple sequence alignments

• Homology can be observed through multiple sequence alignments (MSA)

• MSA: 3 or more protein (or nucleic acid) sequences that are partially or completely aligned

• Homologous residues are aligned in columns across the length of the sequences

1exr_A -EQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 59 1N0Y_A AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 3cln_ ----TEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 56 :************:******************************************* 1exr_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 119 1N0Y_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 3cln_ GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 116 *********::******: *****: ***:***:**** *******************:* 1exr_A VDEMIREADIDGDGHINYEEFVRMMVS- 146 1N0Y_A VDEMIREADIDGDGHINYEEFVRMMVSK 148 3cln_ VDEMIREANIDGDGQVNYEEFVQMMTA- 143 ********:*****::******:**.:

Multiple sequence alignmentsMultiple sequence alignments

• MSAs are powerful because they can reveal relationships between 2 sequences that can only be observed by their relationships with a third sequence

AVGYDFGEKMLSGADDWLVGERADLTGAEIDE

AVGYDFGEKMLSGA--DDWLVGYDRADK-LTGAE-DD-LVG-ERAD--LTGAEIDE-

Seq 1

Seq 2

Seq 1

Seq 3

Seq 2

How MSAs are determined?How MSAs are determined?

MSAs can be determined based on:

•Presence of highly-conserved residues such as cysteine•Conserved motifs and domains•Conserved features of protein secondary structure•Regions showing consistent patterns of insertions or deletions

C-terminal domain of CaM (from 3cln.pdb)

Conserved 2° structure (α-helices)

Why use MSAs?Why use MSAs?

• If protein (or gene) you are studying is part of a larger group, you may be able to gain insight into structure, function and evolution of the sequence.

• MSAs more sensitive than pairwise alignments to detect homologs.

• MSAs can reveal conserved residues, motifs, domains.

• Useful for generating phylogeny trees.

• Regulatory regions of many genes contain conserved consensus sequences.

BenchmarkingBenchmarking

• Q: How good is a MSA?• A: Compare sequence alignment against known

structure alignments (reference scores).– Measured by an objective scoring system such as sum-of-pairs

scores (SPS).

Ai1 Ai2 Ai3 Ai4 Ai5

1 A V L I

2 A G M L I

3 A V M R

M Columns

N R

ows

Sum of scores for all pairs in 1 column

Sum of scores for all your aligned columns

Sum of reference scores

Five MSA ApproachesFive MSA Approaches

1. Exact methods

2. Progressive alignment (e.g., ClustalW)

3. Iterative approaches (e.g., PRALINE, IterAlign, MUSCLE)

4. Consistency-based methods (e.g., MAFFT, ProbCons)

5. Structure-based methods (e.g., Expresso)

Our Focus

Exact MethodsExact Methods

• Exact methods, like Needleman and Wunsch, generate optimal alignments but aren’t feasible for alignments of many sequences.

• Computational time for this approach is describe in Big O notation as O(2NLN). • Algorithm computational time (T = number of steps) has order O of (2NLN) complexity, where N is the number of sequences and L is the average sequence length.

Progressive Sequence Alignment (Feng-Doolittle)Progressive Sequence Alignment (Feng-Doolittle)

How it works:

1.Calculates pairwise sequence alignment scores between all proteins (or nucleic acid sequences)

2.Aligns 2 closest sequences using a guide tree

3.Progressively aligns more sequences to the first 2

•Advantages: Permits rapid alignment of 100s of sequences.•Disadvantages: May not provide most accurate alignment depending on how alignment is started.

ClustalW

MUSCLE

What these numbers mean…What these numbers mean…

ClustalW

MUSCLE

Needleman & WunschN=10 sequences, L=100 residues (Avg.)

Too large to calculate

20,000

110,000

best score

Progressive MSA stage 1 of 3:Progressive MSA stage 1 of 3:generate global pairwise alignmentsgenerate global pairwise alignments

For n sequences, (n-1)(n) / 2 = number of alignmentsFor 5 sequences, (4)(5) / 2 = 10 alignments*First find the two that produce the highest score

Tree Views of alignmentsTree Views of alignments

• Alignments may be evaluated by either similarity or distance measures

• A tree shows the distance between objects

Closely-related Sequences

Distantly-related Sequences

How to read tree views of alignmentsHow to read tree views of alignments

Closely-related Sequences

5 closely related globins

Feng-Doolittle stage 2: guide treeFeng-Doolittle stage 2: guide tree• Convert similarity scores to distance scores• Use unweighted pair group method of arithmetic averages UPGMA

(defined in Chapter 7)• ClustalW output shown below. Use JalView in ClustalW to display

tree view.

Feng-Doolittle stage 3: progressive alignmentFeng-Doolittle stage 3: progressive alignment

• Build MSA based on the order in the guide tree• Start with the two most closely related sequences• Then add the next closest sequence• Continue until all sequences are added to the MSA• Follows Rule: “once a gap, always a gap.”

2 closest alignments

Why “once a gap, always a gap”?Why “once a gap, always a gap”?

• There are many possible ways to make a MSA• Where gaps are added is a critical question• Gaps are often added to the first two (closest) sequences• To change the initial gap choices later on would be to give more

weight to distantly related sequences• To maintain the initial gap choices is to trust that those gaps are

most believable• Insertions receive higher penalties than deletions, and are

propagated throughout alignment

Note placement of M and A at end of gap

Partial ClustalW Output for CD2 ProteinPartial ClustalW Output for CD2 Protein

1. Color coding indicates AA property class2. * Indicates 100% conserved over entire alignment3. : Conservative mutations4. . Less conservative mutations5. [blank] gap or least conserved mutations

ClustalW Output for CD2 ProteinClustalW Output for CD2 Protein1 2 3 4 5

Can use to build phylogeny tree

Alignment Size

Medium

Medium

Small

Clustal W alignment of 5 distantly related globinsClustal W alignment of 5 distantly related globins

Clustal W alignment of 5 closely related globinsClustal W alignment of 5 closely related globins

* asterisks indicate identity in a column

Additional features of ClustalW improveAdditional features of ClustalW improveits ability to generate accurate MSAsits ability to generate accurate MSAs

• Individual weights are assigned to sequences; very closely related sequences are given less weight,while distantly related sequences are given more weight

• Scoring matrices are varied dependent on the presenceof conserved or divergent sequences, e.g.:

PAM20 80-100% idPAM60 60-80% idPAM120 40-60% idPAM350 0-40% id

• Residue-specific gap penalties are applied

In-Class AssignmentIn-Class AssignmentMultiple sequence alignments using ClustalWMultiple sequence alignments using ClustalW

• Example of MSA using ClustalW: two data sets

• Five distantly related globins (human to plant)

• Five closely related beta globins

• Obtain your sequences in the FASTA format! • You can save them in Notepad or other text editor.

MSA: Iterative MethodsMSA: Iterative Methods

• Compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges.

• Unlike progressive methods, iterative methods can dynamically correct alignment errors

• Examples: – MUSCLE: Multiple Sequence Comparison by Log-Expectation

(Edgar, 2004)– Iteralign: (Karlin and Brocchieri, 1998)– Praline: PRofile ALInNmEnt (Heringa, 1999; Simossis and

Heringa, 2005)– MAFFT: Multiple Alignment using Fast Fourier-Transform (Katoh

et al., 2005)

Iterative approaches: MAFFTIterative approaches: MAFFT

• Available at http://mafft.cbrc.jp/alignment/software/• Uses Fast Fourier Transform to speed up profile

alignment• Uses fast two-stage method for building alignments using

k-mer (matching 6-tuples) frequencies• Offers many different scoring and aligning techniques• One of the more accurate programs available• Available as standalone or web interface• Many output formats, including interactive phylogenetic

trees

http://mafft.cbrc.jp/alignment/software/

Iterative approaches: MUSCLEIterative approaches: MUSCLE

• Available at http://www.ebi.ac.uk/Tools/msa/muscle/

• 3 Stage approach• Stage 1:

– Algorithm builds initial alignment based on similarities of paired alignments

– Calculates distance matrix and generates rooted tree

• Stage 2: – Improves tree by recalculating similarities

• Stage 3: – Rescores pairs at branches

http://www.ebi.ac.uk/Tools/msa/muscle/

MSA: Consistency-based algorithmsMSA: Consistency-based algorithms

• Use database of both local high-scoring alignments and long-range global alignments to create a final alignment

• Incorporates evidence from multiple sequences to guide pairwise alignment– In a sequence, if x is related to y, and y is related to z, then x

should be related to z.

• Fast and accurate• Examples: T-COFFEE, Prrp, DiAlign, ProbCons

Which methods are best?Which methods are best?

• Depends on:– Number of sequences to align.– What you are trying to do.– Level of user expertise.– Personal Preference.

• Other Considerations:– Does method use benchmarking of multiple structures?– Do you want to evaluate 3D protein structures (e.g., try Expresso

at http://www.tcoffee.org)?

• You might want to:– Try making multiple sequence alignments with many different

sets of proteins (very related, very distant, few gaps, many gaps, insertions, outliers).

– Compare results.

Example: 5 alignments of 5 globinsExample: 5 alignments of 5 globins

Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths.

We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.

Our conclusion will be that there is no single best approach to MSA.

Note how the region of a conserved histidine (▼) varies depending on which of five prominent algorithms is used

ClustalW ResultsClustalW Results

Praline ResultsPraline Results

Muscle ResultsMuscle Results

ProbCons ResultsProbCons Results

Tcoffee ResultsTcoffee Results

ClustalW

Praline

Muscle

ProbCons

See Thompson et al. (1994) for an explanation of the three stages of progressive alignment implemented in ClustalW

Documents

Biology 4900 Biocomputing. Chapter 6 Multiple Sequence Alignments