35
Fa05 CSE 182 L3: Blast: Keyword match basics

L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

L3: Blast: Keyword match basics

Page 2: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Silly Quiz

• TRUE or FALSE:• In New York City at any moment, there are 2

people (not bald) with exactly the same number ofhairs!

Page 3: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Assignment 1 is online

• Due 10/6 (Thursday) in class.

Page 4: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

An O(nm) algorithm for score computation

• The iteration ensures that all values on the right are computed inearlier steps.

• How much space do you need?

S[i, j] =maxS[i −1, j −1]+ C(si,t j )S[i −1, j]+ C(si,−)S[i, j −1]+ C(−,t j )

For i = 1 to n For j = 1 to m

Page 5: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Alignment?

• Is O(nm) space too much?• What if the query and database are each 1Mbp?€

S[i, j] =maxS[i −1, j −1]+ C(si,t j )S[i −1, j]+ C(si,−)S[i, j −1]+ C(−,t j )

Page 6: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Alignment (Linear Space)

• Score computation

i2 = i%2; i1 = (i −1)%2;

S[i2, j] = maxS[i1, j −1] + C(si,t j )S[i1, j] + C(si,−)

S[i2, j −1] + C(−,t j )

For i = 1 to n For j = 1 to m

S[i, j] =maxS[i −1, j −1]+ C(si,t j )S[i −1, j]+ C(si,−)S[i, j −1]+ C(−,t j )

Page 7: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Linear Space

• In Linear Space, we can doeach row of the D.P.

• We need to compute theoptimum path from theorigin (0,0) to (m,n)

Page 8: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Linear Space (cont’d)

• At i=n/2, we knowscores of all theoptimal paths endingat that row.

• Define F[j] = S[n/2,j]• One of these j is on

the true path. Whichone?

Page 9: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Backward alignment

• Let S’[i,j] be theoptimal score ofaligning s[i..n] witht[j..m]

• Define B[j] = S’[n/2,j]• One of these j is on

the true path. Whichone?

Page 10: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Forward, Backward computation

• At the optimalcoordinate, j– F[j]+B[j]=S[n,m]

• In O(nm) time, andO(m) space, we cancompute one of thecoordinates on theoptimum path.

Page 11: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Linear Space Alignment

• Align(1..n,1..m)– For all 1<=j <= m

• Compute F[j]=S(n/2,j)

– For all 1<=j <= m• Compute B[j]=Sb(n/2,j)

– j* = maxj {F[j]+B[j] }– X = Align(1..n/2,1..j*)– Y = Align(n/2..n,j*..m)– Return X,j*,Y

Page 12: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Linear Space complexity

• T(nm) = c.nm + T(nm/2) = O(nm)• Space = O(m)

Page 13: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Why is Blast Fast?

Page 14: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Large database search

Query (m)

Database (n)

Database size n=10M, Querysize m=300.O(nm) = 3. 109 computations

Page 15: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Observations

• Much of the database is random from the query’sperspective

• Consider a random DNA string of length n.– Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25

• Assume for the moment that the query is all 1’s(length m).

• What is the probability that an exact match tothe query can be found?

Page 16: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Basic probability

• Probability that there is a match starting at afixed position i = 0.25m

• What is the probability that some position i has amatch.

• Dependencies confound probability estimates.

Page 17: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Basic Probability:Expectation

• Q: Toss a coin: each time it comes up heads, youget a dollar– What is the money you expect to get after n

tosses?– Let Xi be the amount earned in the i-th toss

E(Xi) =1.p + 0.(1− p) = p

Total money you expect to earn

E( Xi) = E(Xi) =i∑ np

i∑

Page 18: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Expected number of matches

• Expected number of matches can still be computed.

Let Xi=1 if there is a match starting at position i, Xi=0 otherwise

Pr(Match at Position i) = pi = 0.25m

E(Xi) = pi = 0.25m

Expected number of matches =

E( Xi) = E(Xi)i∑i∑ = n 14( )m

i

Page 19: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Expected number of exactMatches is small!

• Expected number of matches = n*0.25m

– If n=107, m=10,• Then, expected number of matches = 9.537

– If n=107, m=11• expected number of hits = 2.38

– n=107,m=12,• Expected number of hits = 0.5 < 1

• Bottom Line: An exact match to a substring of the query isunlikely just by chance.

Page 20: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Observation 2

• What is the pigeonhole principle?

Suppose we are looking for a database string withgreater than 90% identity to the query (length 100) Partition the query into size 10 substrings. Atleast one much match the database string exactly

Page 21: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Why is this important?

• Suppose we are looking for sequences that are 80% identical to the query sequence oflength 100.

• Assume that the mismatches are randomly distributed.• What is the probability that there is no stretch of 10 bp, where the query and the

subject match exactly?

• Rough calculations show that it is very low. Exact match of a short query substring to atruly similar subject is very high.

– The above equation does not take dependencies into account– Reality is better because the matches are not randomly distributed

≅ 1− 810( )

10

90

= 0.000036

Page 22: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Just the Facts

• Consider the set of all substrings of the querystring of fixed length W.– Prob. of exact match to a random database

string is very low.– Prob. of exact match to a true homolog is very

high.– Keyword Search (exact matches) is MUCH

faster than sequence alignment

Page 23: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

BLAST

• Consider all (m-W) query words of size W (Default = 11)• Scan the database for exact match to all such words• For all regions that hit, extend using a dynamic programming alignment.• Can be many orders of magnitude faster than SW over the entire string

Database (n)

Page 24: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Why is BLAST fast?• Assume that keyword searching does not consume any time and

that alignment computation the expensive step.• Query m=1000, random Db n=107, no TP• SW = O(nm) = 1000*107 = 1010 computations• BLAST, W=11

• E(#11-mer hits)= 1000* (1/4)11 * 107=2384• Number of computations = 2384*100*100=2.384*107

• Ratio=1010/(2.384*107)=420• Further speed improvements are possible

Page 25: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Keyword Matching• How fast can we match

keywords?• Hash table/Db index?

What is the size of thehash table, for m=11

• Suffix trees? What isthe size of the suffixtrees?

• Trie based search. Wewill do this in class.

AATCA 567

Page 26: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Related notes

• How to choose the alignment region?– Extend greedily until the score falls below a certain

threshold• What about protein sequences?

– Default word size = 3, and mismatches are allowed.• Like sequences, BLAST has been evolving continuously

– Banded alignment– Seed selection– Scanning for exact matches, keyword search versus

database indexing

Page 27: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

P-value computation• How significant is a score? What happens to

significance when you change the score function• A simple empirical method:

• Compute a distribution of scores against a randomdatabase.

• Use an estimate of the area under the curve to getthe probability.

• OR, fit the distribution to one of the standarddistributions.

Page 28: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Z-scores for alignment• Initial assumption was that the scores followed a

normal distribution.• Z-score computation:

– For any alignment, score S, shuffle one of thesequences many times, and recompute alignment.Get mean and standard deviation

– Look up a table to get a P-value

ZS =S −µσ

Page 29: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Blast E-value

• Initial (and natural) assumption was that scores followed a Normal distribution• 1990, Karlin and Altschul showed that ungapped local alignment scores follow an

exponential distribution• Practical consequence:

– Longer tail.– Previously significant hits now not so significant

Page 30: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Exponential distribution

• Random Database, Pr(1) = p• What is the expected number of hits to a sequence of k 1’s

• Instead, consider a random binary Matrix. Expected # of diagonalsof k 1s

(n − k)pk ≅ nek ln p = ne−k ln 1

p

Λ = (n − k)(m − k)pk ≅ nmek ln p = nme−k ln 1

p

Page 31: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

• As you increase k, the number decreases exponentially.• The number of diagonals of k runs can be approximated by a Poisson process

• In ungapped alignments, we replace the coin tosses by column scores, butthe behaviour does not change (Karlin & Altschul).

• As the score increases, the number of alignments that achieve the scoredecreases exponentially€

Pr[u] =Λue−Λ

u!

Pr[u > 0] =1− e−Λ

Page 32: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Blast E-value• Choose a score such that the expected score between a pair of

residues < 0• Expected number of alignments with a particular score

• For small values, E-value and P-value are the same

E = Kmne−λS = mn2−λS− ln Kln 2

Pr(S ≥ x) =1− e−Kmne−λx

Page 33: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Blast Variants1. What is mega-blast?2. What is discontiguous mega-

blast?3. Phi-Blast/Psi-Blast?4. BLAT?5. PatternHunter?

Longer seeds.Seeds with don’t care valuesLaterDatabase pre-processingSeeds with don’t care values

Page 34: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182

Keyword Matching

P O T A S T P O T A T O

P O T A T O

T UIS M

S ETA

l c c c cc c c c c cc cl cl

Page 35: L3: Blast: Keyword match basics...Fa05 CSE 182 Linear Space • In Linear Space, we can do each row of the D.P. • We need to compute the optimum path from the origin (0,0) to (m,n)

Fa05 CSE 182