60
Alignment Class II We continue where we stopped last week: Dynamic programing

Alignment Class II We continue where we stopped last week: Dynamic programing

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alignment Class II We continue where we stopped last week: Dynamic programing

Alignment Class II

We continue where we stopped last week: Dynamic programing

Page 2: Alignment Class II We continue where we stopped last week: Dynamic programing

Reminder -Structure of a genome

pre-mRNA

transcription

splicing

translationmature mRNA

protein

a gene

Page 3: Alignment Class II We continue where we stopped last week: Dynamic programing

Pairwise Sequence Alignment

Example

Which one is better?

HEAGAWGHEEPAWHEAE

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE

Page 4: Alignment Class II We continue where we stopped last week: Dynamic programing

Example

AEGHW

A5-10-2-3

E-16-30-3

H-20-210-3

P-1-1-2-2-4

W-3-3-3-315

• Gap penalty: -8

• Gap extension: -3

HEAGAWGHE-E

P-A--W-HEAE

HEAGAWGHE-E

--P-AW-HEAE(-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9

Exercise: Calculate for

Page 5: Alignment Class II We continue where we stopped last week: Dynamic programing

Global Alignment

Notation xi – ith letter of string x yj – jth letter of string y x1..i – Prefix of x from letters 1 through I F – matrix of optimal scores

F(i,j) represents optimal score lining up x1..i with y1..j

d – gap penalty s – scoring matrix

Page 6: Alignment Class II We continue where we stopped last week: Dynamic programing

Global Alignment

The work is to build up F Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd Fill from top left to bottom right using the recursive

relation

)(),(min

)(),((min

),()1,1(

max),(

kgapkjiF

kgapjkiF

yxsjiF

jiF

k

k

ji

Page 7: Alignment Class II We continue where we stopped last week: Dynamic programing

Example

X_

_X

__

XX

X__X

_XX_

Page 8: Alignment Class II We continue where we stopped last week: Dynamic programing

Example

XX

X__X

_XX_

XYX_

X_Y_X_

_XYX__

XY_X

XY__

X_

XY___X

Page 9: Alignment Class II We continue where we stopped last week: Dynamic programing

Global Alignment

F(i-1,j-1)F(i,j-1)

F(i-1,j)F(i,j)

s(xi,yj) d

d

Move ahead in both

xi aligned to gap

yj aligned to gap

While building the table, keep track of where optimal score came from, reverse arrows

Page 10: Alignment Class II We continue where we stopped last week: Dynamic programing

Example

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16

W-24

H-32

E-40

A-48

E-56

Page 11: Alignment Class II We continue where we stopped last week: Dynamic programing

Completed Table

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16-10-3-4-12-20-28-36-44-52-60

W-24-18-11-6-7-15-5-13-21-29-37

H-32-14-18-13-8-9-13-7-3-11-19

E-40-22-8-16-16-9-12-15-73-5

A-48-30-16-3-11-11-12-12-15-52

E-56-38-24-11-6-12-14-15-12-91

ScoreGap –8Error –2Fit +6

Page 12: Alignment Class II We continue where we stopped last week: Dynamic programing

Traceback

HEAGAWGHEE

0-8-16-24-32-40-48-56-64-72-80

P-8-2-9-17-25-33-42-49-57-65-73

A-16-10-3-4-12-20-28-36-44-52-60

W-24-18-11-6-7-15-5-13-21-29-37

H-32-14-18-13-8-9-13-7-3-11-19

E-40-22-8-16-16-9-12-15-73-5

A-48-30-16-3-11-11-12-12-15-52

E-56-38-24-11-6-12-14-15-12-91 HEAGAWGHE-E--P-AW-HEAE

Trace arrows back from the lower right to top left

• Diagonal – both• Up – upper gap • Left – lower gap

Page 13: Alignment Class II We continue where we stopped last week: Dynamic programing

Summary

Uses recursion to fill in intermediate results table

Uses O(nm) space and time O(n2) algorithm Feasible for moderate sized sequences, but not

for aligning whole genomes.

Page 14: Alignment Class II We continue where we stopped last week: Dynamic programing

Local Alignment

Smith-Waterman (1981) Another dynamic programming solution

)()1,(min

)(),(min),()1,1(

0

max),(

kgapjiF

kgapjkiFyxsjiF

jiF

k

k

ji

Page 15: Alignment Class II We continue where we stopped last week: Dynamic programing

Example

HEAGAWGHEE

00000000000

P00000000000

A00050500000

W0000202012400

H0102000121822146

E0216800410182820

A0082113504102027

E0061318124041626

Page 16: Alignment Class II We continue where we stopped last week: Dynamic programing

Traceback

HEAGAWGHEE

00000000000

P00000000000

A00050500000

W0000202012400

H0102000121822146

E0216800410182820

A0082113504102027

E0061318124041626

Start at highest score and traceback to first 0

AWGHEAW-HE

Page 17: Alignment Class II We continue where we stopped last week: Dynamic programing

Summary

Similar to global alignment algorithm For this to work, expected match with random

sequence must have negative score. Behavior is like global alignment otherwise

Similar extensions for repeated and overlap matching

Care must be given to gap penalties to maintain O(nm) time complexity

Page 18: Alignment Class II We continue where we stopped last week: Dynamic programing

Statistical Significance of Sequence Alignments

STATISTICAL SIGNIFICANCE = probability that our score would be found between random (or unrelated) sequences

Examine alignment for long runs of matches and placement of gaps

Try alternative alignments and compare scores Calculate statistical significance of alignment score using

extreme value distribution formula Scramble one of sequences 1000's of times and realign to

obtain idea of distribution of scores with unrelated sequences of same size

Page 19: Alignment Class II We continue where we stopped last week: Dynamic programing

Alternative alignments

Programs like LALIGN (stands for local alignment) produce as many different alignments as you like. Each subsequent alignment does not align the same sequence positions.

Page 20: Alignment Class II We continue where we stopped last week: Dynamic programing

 Extreme Value Distributions   

The average of n samples taken from any distribution with finite mean and variance will have a normal distribution for large n.  This is the CLT.  

The largest member of a sample of size n has a LEV, Type I largest extreme value, also called Gumbel, distribution, regardless of the parent population,

IF the parent has an unbounded tail that decreases at least as fast as an exponential function, and has finite moments (as does the normal, for example). The LEV, has pdf given by

f (x | 1 , 2 ) = 1/ 2 exp( -z -exp( -z ) ) ) where z = (x - 1) / 2 and 1, 2 are location and scale*

parameters, respectively, and 1 > 0.

Page 21: Alignment Class II We continue where we stopped last week: Dynamic programing

Normal Distribution versus Extreme Value Distribution

0.0

0.4

-4 -3 -2 -1 0 1 2 3 4

x

Normal

ExtremeValue

Extreme value distribution:

y = exp(-x – exp(-x))

Normal distribution:

y = exp(-x2/2) / sqrt(2π)

Page 22: Alignment Class II We continue where we stopped last week: Dynamic programing

Extreme Value Distribution

Probability density function: f(x) = exp(-x-exp(-x))

Cumulative distribution function: Prob(S<x) = exp(-exp(-x)) Prob(S≥x) = 1 - Prob(S<x) = 1 - exp(-exp(-x))

Sample mean m, sample variance σ2

λ = 1.2825 / σ μ = m – 0.45σ Prob(x) = exp(-x-exp(-λ(x-μ)))

Page 23: Alignment Class II We continue where we stopped last week: Dynamic programing

Calculating the statistical significance

How is this done? Make many random protein sequences of varying lengths Locally align two sequences of a given length with the same scoring

matrix and gap penalty as used for our alignment Repeat for many pairs of sequences of approximately the same length to

see how high a score we can get Look at the distribution of scores for a given range of lengths

-2 -1 0 1 2 3 4 5

0.2

0.4A.

X

Yev

Page 24: Alignment Class II We continue where we stopped last week: Dynamic programing

The formula for extreme values

The probability that a score S between two unrelated sequences is equal to or greater than a value x, P (S > x) is given by:

P = 1 – exp ( K m n e-x )where m and n are the sequence lengths, is a “scaling factor” for the scoring matrix used, and K is a constant that depends on the scoring matrix and gap penalty combination that is used.

What we want is a score that gives a very low value of P, say less than 0.01- 0.05. However, there is a trick here. We usually calculate another value E, the expect value for the alignment score that depends on how we found P. This method is used by BLAST. For blosum62, gap –11/-1, = 0.3 and K=0.1 (roughly).

Page 25: Alignment Class II We continue where we stopped last week: Dynamic programing

Values Describing Scores

Both the Gumbel Extreme Value Distribution and Karlin-Ashtul Distribution use E values and P values E-value (Expect value): the average number of

times such a match would be found P-value (probability): probability of finding an

alignment under assumptions Important note: Alignments that are statistically

important may not be biologically important

Page 26: Alignment Class II We continue where we stopped last week: Dynamic programing

Dot Matrices

Page 27: Alignment Class II We continue where we stopped last week: Dynamic programing

The Dot Matrix MethodThe Dot Matrix Method..

Provides a ‘Gestalt’ of all possible alignments between two sequences.

To begin — Let us use a very simple 0, 1 (match, no-match) identity scoring function without any windowing. The sequences to be compared are written out along the x and y axes of a matrix.

Put a dot wherever symbols match; identities are highlighted.

Page 28: Alignment Class II We continue where we stopped last week: Dynamic programing

S E Q U E N C E A N A L Y S I S P R I M E R

S • • •

E • • • •

Q •

U •

E • • • •

N • •

C •

E • • • •

A • •

N • •

A • •

L •

Y •

S • • •

I • •

S • • •

P •

R • •

I • •

M •

E • • • •

R • •

Since this is a comparison between two of the same sequences, an Since this is a comparison between two of the same sequences, an intra-intra-sequence comparison, the most obvious feature is the main sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”crosses directly off the main diagonal; they are “ANA” and “SIS.”

Page 29: Alignment Class II We continue where we stopped last week: Dynamic programing

Dot MatricesDot Matrices

The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region, but rather giving you the ‘Gestalt’ of the whole thing.

Since your own mind and eyes are still better than computers at discerning complex visual patterns, especially when more than one pattern is being considered, you can see all these ‘less than best’ comparisons as well as the main one and then you can ‘zoom-in’ on those regions of interest using more detailed procedures.

Page 30: Alignment Class II We continue where we stopped last week: Dynamic programing

It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an

insertion or a deletion and hence this phenomena is called an ‘indel’.

Page 31: Alignment Class II We continue where we stopped last week: Dynamic programing

S E Q U E N C E A N A L Y S I S P R I M E R

S ¥ ¥ ¥

E ¥ ¥ ¥ ¥

Q ¥

U ¥

E ¥ ¥ ¥ ¥

N ¥ ¥

C ¥

E ¥ ¥ ¥ ¥

S ¥ ¥ ¥

E ¥ ¥ ¥ ¥

Q ¥

U ¥

E ¥ ¥ ¥ ¥

N ¥ ¥

C ¥

E ¥ ¥ ¥ ¥

S ¥ ¥ ¥

E ¥ ¥ ¥ ¥

Q ¥

U ¥

E ¥ ¥ ¥ ¥

N ¥ ¥

C ¥

E ¥ ¥ ¥ ¥

The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.

Page 32: Alignment Class II We continue where we stopped last week: Dynamic programing

S E Q U E N C E A N A L Y S I S P R I M E R

A ¥ ¥

N ¥ ¥ ¥

A ¥ ¥

L ¥

Y ¥ ¥

Z

E ¥ ¥ ¥

S ¥ ¥ ¥ ¥

E ¥ ¥ ¥

Q ¥

U ¥

E ¥ ¥ ¥

N ¥ ¥

C ¥ ¥

E ¥ ¥ ¥ ¥

S ¥ ¥ ¥

Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.

Page 33: Alignment Class II We continue where we stopped last week: Dynamic programing

Sequence comparison – dot matrix alignment

Page 34: Alignment Class II We continue where we stopped last week: Dynamic programing

Filtered Windowing

Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves.

How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met.

What is meant by this is that if within some defined window size, and when some defined criteria is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise.

Page 35: Alignment Class II We continue where we stopped last week: Dynamic programing

In this plot a window of size

three and a stringency of two

is used to considerably

improve the signal to noise

ratio

Windowing

Page 36: Alignment Class II We continue where we stopped last week: Dynamic programing

Alignment to Databsases

Fasta

Blast

Page 37: Alignment Class II We continue where we stopped last week: Dynamic programing

FASTA

FastA is a family of programs: FastA, TFastA, FastX, FastY

Query: DNA Protein

Database: DNA Protein

Page 38: Alignment Class II We continue where we stopped last week: Dynamic programing

FastA

Blosum50 default.Lower PAM higher blosum to detect close sequencesHigher PAM and lower blosumto detect distant sequences

Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively

Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively

The larger the word-length the less sensitive, but faster the search will be

Max number of scores and alignments is 100

Page 39: Alignment Class II We continue where we stopped last week: Dynamic programing

FastA Output

Database code hyperlinked to the SRS database at EBI

Accession number

Description Length

Initn, init1, opt, z-score calculated during run

E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database.

E() does not represent the % similarity

Page 40: Alignment Class II We continue where we stopped last week: Dynamic programing

FASTA Output

Page 41: Alignment Class II We continue where we stopped last week: Dynamic programing

FASTA-Stages

1. Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences)

2. Score and select top 10 scoring “local diagonals”

a. For proteins, each k-tup found is scored using the PAM250 matrix

b. For DNA, the number of k-tups foundc. Penalize intervening gaps

Page 42: Alignment Class II We continue where we stopped last week: Dynamic programing

FASTA-Stages

3. Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores.

4. Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score

5. After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.

Page 43: Alignment Class II We continue where we stopped last week: Dynamic programing

Finding k-tups

position 1 2 3 4 5 6 7 8 9 10 11protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offsetamino acid protein A protein B pos A - posB-----------------------------------------------------a 6 6 0c 2 7 -5k - 11n 1 -p 4 9 -5r - 10s 3 8 -5t 5 ------------------------------------------------------Note the common offset for the 3 amino acids c,s and pA possible alignment is thus quickly found -protein 1 n c s p t a | | | protein 2 a c s p r k

Page 44: Alignment Class II We continue where we stopped last week: Dynamic programing

FASTA, K-tups with common offset

Page 45: Alignment Class II We continue where we stopped last week: Dynamic programing

BLAST

Basic Local Alignment Search Tool Altschul et al. 1990,1994,1997

Heuristic method for local alignment Designed specifically for database searches Idea: Good alignments contain short lengths

of exact matches

Page 46: Alignment Class II We continue where we stopped last week: Dynamic programing

Blast Application

Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX

BlastN - nt versus nt database BlastP - protein versus protein database BlastX - translated nt versus protein database tBlastN - protein versus translated nt database tBlastX - translated nt versus translated nt database

Query: DNA Protein

Database: DNA Protein

Page 47: Alignment Class II We continue where we stopped last week: Dynamic programing

Blast – Basic Local Alignment Search Tool

Blast uses a heuristic search algorithm and uses statistical methods of Karlin and Altshul (1990)

Blast programs were designed for fast database searching with minimal sacrifice of sensitivity for distantly related sequences

Page 48: Alignment Class II We continue where we stopped last week: Dynamic programing

Mathematical Basis of BLAST

Model matches as a sequence of coin tosses Let p be the probability of a “head”

For a “fair” coin, p = 0.5 (Erdös-Rényi) If there are n throws, then the expected length

R of the longest run of heads is

R = log1/p (n). Example: Suppose n = 20 for a “fair” coin

R=log2(20)=4.32 Trick is how to model DNA (or amino acid) sequence

alignments as coin tosses.

Page 49: Alignment Class II We continue where we stopped last week: Dynamic programing

Mathematical Basis of BLAST

To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”.

For DNA, the probability of a “head” is 1/4 Same logic applies to amino acids

AATCAT

ATTCAGHTHHHT

Page 50: Alignment Class II We continue where we stopped last week: Dynamic programing

Mathematical Basis of BLAST

So, for one particular alignment, the Erdös-Rényi property can be applied

What about for all possible alignments? Consider that sequences are being shifted back and forth,

dot matrix plot The expected length of the longest match is

R=log1/p(mn)where m and n are the lengths of the two sequences.

Page 51: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

1. Filter out low-complexity regions

where L is length, N is alphabet size, ni is the number of letter i appearing in sequence. Example: AAAT

K=1/4 log4(24/(3!*1!*0!*0!))=0.25

iiN nLLK !/!log/1

Page 52: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

2. Query words of length 3 (for proteins) or 11 (for DNA) are created from query sequence using a sliding window

Expected run length in sequences of ~90 for proteins and 64 for DNA.

MEFPGLGSLGTSEPLPQFVDPALVSSMEF EFP FPG PGL GLG

Page 53: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

3. Using BLOSUM62 (for proteins) or scores of +5/-4 (DNA, PAM40), score all possible words of length 3 or 11 respectively against a query word.

4. Select a neighborhood word score threshold (T) so that only most significant sequences are kept. Approximately 50 hits per query word.

5. Repeat 3 and 4 for each query word in step 2. Total number of high scoring words is approximately 50 * sequence length.

Page 54: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

6. Organize the high-scoring words into a search tree

7. Scan each database sequence for match to high-scoring words. Each match is a seed for an ungapped alignment.

M

E

F

E

GP

Page 55: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

8. (Original BLAST) extend matching words to the left and right using ungapped alignments. Extension continues as long as score increases or stays same. This is a HSP (high scoring pair).

(BLAST2) Matches along the same diagonal (think dot plot) within a distance A of each other are joined and then the longer sequence extended as before. (Requires lower T)

Page 56: Alignment Class II We continue where we stopped last week: Dynamic programing

Steps of BLAST

9. Using a cutoff score S, keep only the extended matches that have a score at least S.

10. Determine statistical significance of each remaining match (from last time).

11. Try to extend the HSPs if possible.

12. Show Smith-Waterman local alignments.

Page 57: Alignment Class II We continue where we stopped last week: Dynamic programing

Statistical Significance of Blast

Probability (P) – score of the likelihood of its having arisen by chance. The closer the p-value approaches zero, the greater the confidence that the match is real. The closer the value is to unity, the greater the chance that the match is spurious

Expected Frequency (E) value – number of hits one can expect to see by chance (noise) when searching a database of a particular size. E value of 1 – one match with a similar score by chance. E value of 0 – no matches expected by chance

Page 58: Alignment Class II We continue where we stopped last week: Dynamic programing

Low-complexity and Gapped Blast Algorithm

The SEG program has been implemented as part of the blast routine in order to mask low-complexity regions

Low-complexity regions are denoted by strings of Xs in the query sequence

In 1997 a modification introduced generation of gapped alignments The gapped algorithm seeks only ONE ungapped alignment that

makes up a significant match and hence speeds the initial database search

Dynamic programming is used to extend a central pair of aligned residues in both directions to yield the final gapped alignment

Gapped blast is 3 times faster than ungapped blast

Page 59: Alignment Class II We continue where we stopped last week: Dynamic programing

Smith and Waterman

Compare query to each sequence in database Perform full Smith and Waterman pairwise

alignment to find the optimal alignment SW searching is exhaustive and therefore

runs on special hardware (Biocellerator)

Page 60: Alignment Class II We continue where we stopped last week: Dynamic programing

The sequence scrambling method

We first align the two sequences and obtain the optimal score S Next, we scramble one of the sequences many 1000s of times (N), align

it with the other sequence and obtain a distribution of scores (not related but they have the same composition* as our sequences)

We fit the scores to an extreme value distribution and calculate our and K.

Then, we calculate P, as before, for the probability that one of the scrambled sequence pairs would exceed our optimal score S

Finally, we calculate an E (expect value), which is (usually) P times the number of sequence pairs we compared. (If P = 10 -6 and N=10,000, then E = 10-2.

E is the no. we want to be <0.01- 0.05 Method used by FASTA suite of programs * We can scramble the whole sequence (pick 20 kinds of marbles from a

bag, or a sliding window (pick the first one from a bag with the first 10 or so – then slide ahead one and pick one from the next 10. The window method is more stringent. Why? (think about low complexity)