48
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engi neering

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Embed Size (px)

Citation preview

Page 1: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

COT 6930HPC and Bioinformatics

Multiple Sequence Alignment

Xingquan Zhu

Dept. of Computer Science and Engineering

Page 2: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 3: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

What is a Multiple Sequence Alignment?

Pairwise alignments: involve two sequences Multiple sequence alignments: involve more than 2

sequences (often 100’s, either nucleotide or protein). A formal definition

A multiple alignment of strings S1, … Sk is a series of strings with spaces such that |S1’| = … = |Sk’|Sj’ is an extension of Sj by insertion of spaces

Goal: Find an optimal multiple alignment.

Hs ---MK----- --LSLVAAML LLLSAARAEE EDKK-EDVGT VVGIDLGTTY

Sp ---MKKFQLF SILSYFVALF LLPMAFASGD DNST-ESYGT VIGIDLGTTY

Tg MTAAKKLSLF SLAALFCLLS VATLRPVAAS DAEEGKVKDV VIGIDLGTTY

Pf --------MN QIRPYILLLI VSLLKFISAV DSN---IEGP VIGIDLGTTY

Page 4: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Why we do multiple alignments?

In order to reveal the relationship between a group of sequences (homology)

Simultaneous alignment of similar gene sequences may Discover the conserved regions in genes Determine the consensus sequence of these aligned

sequences Help defines a protein family that may share a common

biochemical function or evolutionary origin and thus reveals an evolutionary history of the sequences.

Help prediction of the secondary and tertiary structures of new sequences

Page 5: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

MSA Methods Multidimensional dynamic programming

Extension of DP to multiple (3) sequences Star Alignment, Tree Alignment, Progressive

Alignment Starting with an alignment of the most alike

sequences and building an alignment by adding more sequences

Iterative methods Making an initial alignment of groups of sequences

and revising the alignment to achieve a more reasonable result

Page 6: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 7: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Multiple Sequence Alignment by DP

Pairwise sequence alignment a scoring matrix where each position provides the

best alignment up to that point Extension to 3 sequences

the lattice of a cube that is to be filled with calculated dynamic programming scores.

Scoring positions on 3 surfaces of the cube represent the alignment

of a pair

Page 8: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Scoring of MSA: Sum of Pairs

Scores = summation of all possible combinations of amino acid pairs

Using BLOSUM62 matrix, gap penalty -8

In column 1, we have pairs -,S -,S S,S

k(k-1)/2 pairs per column

- I K

S I K

S S E

-8 - 8 + 4 = -12

Page 9: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Sum of Pairs Given 5 sequences:

N C C E N N C E N - C N S C S N

S C S E

How many possible combinations of pairwise alignments for each position?

10!3!2

!552

C

Page 10: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Sum of Pairs Assume: match/mismatch/gap = 1/0/-1 N C C E

N N C E N - C N S C S N

S C S EThe 1st position: # of N-N (3), # of S-S (1), # of N-S (6) SP(1) = 4*1 + 0*6 + (-1)*0 = 4 The 2nd position: # of C-C (3), # of N-C (3), # of gaps (4),

SP(2) = 3*1 + 0*3 + (-1)*4 = -1

Page 11: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

G T G C T T G A

T

G

G

C

C

T

Dynamic programming matrixPairwise alignment

Gap in sequence 2

Match/Mismatch Gap in sequence 1Seq 1

Seq 2

Page 12: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Multiple sequence alignment

Dynamic programming matrix

many possibilities

S

M

V

S M T

AM

V

Seq 1

Seq 2Seq 3

Page 13: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

DP Alignment Examples

All three match/mismatch

Sequence 1 & 2 match/mismatch with gap in 3

Sequence 1 & 3 match/mismatch with gap in 2

Sequence 2 & 3 match/mismatch with gap in 1

Sequence 1 with gaps in 2 & 3

Sequence 2 with gaps in 1 & 3

Sequence 3 with gaps in 1 & 2

Choose the largest value among the above seven possibilities

Page 14: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Computational Complexity

For protein sequences each 300 amino acid in length & excluding gaps, with DP algorithm Two sequences, 3002 comparisons Three sequences, 3003 comparisons N sequences, 300N comparisonsO(LN) L: length of the sequences; N: number of sequences

The number of comparisons & memory required are too large for n > 3 and not practical

Page 15: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 16: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Star Alignments

Heuristic method for multiple sequence alignments Select a sequence sc as the center of the star

For each sequence s1, …, sk such that index i c, perform a global alignment (using DP)

Aggregate alignments with the principle “once a gap, always a gap.”

Page 17: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Star Alignments Example

s2

s1s3

s4

s1: MPEs2: MKEs3: MSKEs4: SKE

MPE

| |

MKE

MSKE

- ||

MKE

MKE

||

SKE MPEMKE

-MPE-MKEMSKE

-MPE-MKEMSKE-SKE

Page 18: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Choosing a center

Try them all and pick the one with the best score

Calculate all O(k2) alignments, and pick the sequence sc that maximizes

ci

ci ssscore ),(

Page 19: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Star Alignment Example

S1=ATTGCCATT S2=ATGGCCATT S3=ATCCAATTTT S4=ATCTTCTT S5=ATTGCCGATT

s1 s2 s3 s4 s5

s1 7 -2 0 -3

s2 7 -2 0 -4

s3 -2 -2 0 -7

s4 0 0 0 -3

s5 -3 -4 -7 -3

21-11

-3-17

Page 20: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Star Alignments Example

Merging Pairwise Alignment

Page 21: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Star Alignment Example

Merging Pairwise Alignment

Page 22: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Analysis

Assuming all sequences have length n O(n2) to calculate global alignment O(k2) global alignments to calculate Using a reasonable data structure for joining

alignments, no worse than O(kl), where l is upper bound on alignment lengths

O(k2n2+kl)=O(k2n2) overall cost

Page 23: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 24: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Tree Alignment Compute the overall similarity based on pairwise

alignment along the edge

The sum of all these weights is the score of the tree

sequence

sequence

sequence S2

sequence S1

weight : sim(s1,s2)

Consensus String

The consensus string derived from multiple alignment is the concatenation of the consensus characters for each

column. The consensus character for column is the character that minimizes the summed distance to it from all

the characters in column

Page 25: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Tree Alignment Example

Scoring system used is

-1 p(a,-)

ba if 0b)p(a,

ba if bap 1),(

CAT

GT

CTG

CG

CAT - GT

CTG3

0

13

1

We have a score of 8

CAT

CTGC - G

Page 26: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Tree Alignment Example

Page 27: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 28: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 29: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 30: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 31: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 32: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 33: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Example

Page 34: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Analysis

We don’t know the correct tree Without the tree, the tree alignment

problem is NP-complete Likely only exponential time solution

available (for optimal answers)

Page 35: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 36: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Progressive Methods

DP-based MSA program is limited in 3 sequences or to a small # of relatively short sequences

Progressive alignments uses DP to build a msa starting with the most related sequences and then progressively adding less-related sequences or groups of sequences to the initial alignment

Most commonly used approach

Page 37: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Progressive Methods

Progressive alignment is heuristic. It does not separate the process of scoring an

alignment from the optimization algorithm It does not directly optimize any global scoring

scoring function of “alignment correctness”. It is fast, efficient and the results are reasonable.

We will illustrate this using ClustalW.

Page 38: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Progressive MSA occurs in 3 stages

1. Do a set of global pairwise alignments (Needleman and Wunsch)

2. Create a guide tree

3. Progressively align the sequences

Page 39: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

ClustalW Procedure

Page 40: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Progressive Methods: ClustalW http://www.ebi.ac.uk/clustalw/ ClustalW is a general purpose multiple

alignment program for DNA or proteins. ClustalW: The W standing for “weighting” to

represent the ability of the program to provide weights to the sequence and program parameters.

CLUSTALX provides a graphic interface

Page 41: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Operational options

Output options

Input options, matrix choice, gap opening penalty

Gap information,output tree type

File input in GCG, FASTA, EMBL, GenBank, Phylip, or several other formats

Use Clustal W to do a progressive MSA

Page 42: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Progressive MSA stage 3 of 3 : progressive alignment

Make a MSA based on the order in the guide tree

Start with the two most closely related sequences

Then add the next closest sequence Continue until all sequences are added to the

MSA

Page 43: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Problems w/ Progressive Alignment

Highly sensitive to the choice of initial pair to align. The very first sequences to be aligned are the

most closely related on the sequence tree. If alignment good, few errors in the initial alignment

The more distantly related these sequences, the more errors

Errors in alignment propagated to the MSA

Page 44: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm

Page 45: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Iterative Methods

Results do NOT depend on the initial pairwise alignment (recall progressive methods)

Starting with an initial alignment and repeatedly realigning groups of the sequences

Repeat until one MSA doesn’t change significantly from the next.

After iterations, alignments are better and better. An example is genetic algorithm approach.

Page 46: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Genetic Algorithms

A general problem solving method modeled on evolutionary change.

Inspired by the biological evolution process Uses concepts of “Natural Selection” and “Genetic

Inheritance” (Darwin 1859) Create a set of candidate solutions to your problem,

and cause these solutions to evolve and become more and more fit over repeated generations.

Use survival of the fittest, mutation, and crossover to guide evolution.

Page 47: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Genetic Search Algorithms

Random generationRandom generation (candidate solutions)(candidate solutions)

EvaluationEvaluation (fitness (fitness function)function)

SelectionSelection (candidate (candidate solutions with larger solutions with larger

fitness values will have fitness values will have larger chance to be larger chance to be

included)included)

Crossover + MutationCrossover + Mutation (change some selected (change some selected candidate solutions to candidate solutions to

converge to the optimal converge to the optimal solution and to prevent a solution and to prevent a

local extremelocal extreme

Page 48: COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering

Outline

Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods

Multidimensional dynamic programming Star Alignment Tree Alignment

Progressive Alignment Clustalw: a widely used algorithm

Iterative Alignment Genetic Algorithm