Genetic Pattern Matching 2 - cse.lehigh.edulopresti/Courses/2007-08/CSE308...Lopresti · Spring 2007 · Lecture 15 - 22 - BLAST algorithm Keyword search of all words of length w from

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Spring 2007 · Lecture 15 - 1 -

Bioinformatics:Issues and Algorithms

CSE 308-408 • Fall 2007 • Lecture 15

Genetic Pattern Matching 2


Administrative notes

I'll send you feedback on your proposal by the middle of the following week – then you're off and running!

Your final project / paper proposal is due on Friday, November 9 at 5:00 pm.

The proposal just needs to be a couple paragraphs telling me the problem area you plan to work on and some of the references you'll probably use.

If there's a possible connection between the work you'd like to do and the topics you've heard Professor Marzillier talk about, I'll discuss your proposal with her to get her feedback and suggestions (e.g., other papers you might read, datasets you might use for testing code you develop, etc.).


Outline

• Heuristic Similarity Search Algorithms• Filtration• Algorithm Behind BLAST• Statistics Behind BLAST

http://www.bioalgorithms.info


Approximate vs. exact pattern matching


• Usually, because of mutations, it makes much more biological sense to find approximate pattern matches.

• Biologists often use fast heuristic approaches (rather than local alignment via dynamic programming) to find approximate matches.

Recall last lecture: we studied exact pattern matching algorithms. We know this is artificial, however:

This nicely sets the stage for today's lecture.


Heuristic similarity searches

• Genomes are huge: Smith-Waterman-style quadratic time alignment algorithms are too slow.

• Usually, sequences we want to align have short identical or highly similar fragments.

Why look to heuristics?

• Find short exact matches, and use them as seeds for potential match extension.

• Filter out positions with no extendable matches.

Many heuristic methods (e.g., FASTA) are based on filtration:



Dot matrices

• Dot matrices show similarities between two sequences.

• FASTA makes implicit dot matrix from short exact matches, and then tries to find long diagonals (allowing for some mismatches).


G A T T C G C T T A G TC * *T * * * * *G * * *A * *T * * * * *T * * * * *C * *C * *T * * * * *T * * * * *A * *G * * *T * * * * *C * *A * *G * * *

Place a “dot” at starting position of exact matches of specified length (in this case, one nucleotide)



Dot matrices

• Identify diagonals above threshold length.

• Diagonals in dot matrix indicate exact substring matching.



Diagonals in dot matrices

• Extend diagonals and try to link them together, allowing for a small number of mismatches and indels.

• Linking diagonals reveals approximate matches over longer substrings.




The Approximate Pattern Matching Problem.

Find all approximate occurrences of a pattern in a text.

Input: A pattern p = p1 … pn, a text t = t1 … tm, and k (the maximum number of allowable mismatches).Output: All positions 1 ≤ i ≤ (m – n + 1) such that ti … ti+n-1 and p1 … pn have at most k mismatches (i.e., the Hamming distance between ti … ti+n-1 and p1 … pn is ≤ k).

Approximate pattern matching



Approximate pattern matching: brute-force approach

ApproximatePatternMatching(p, t, k)n ← length of pattern pm ← length of text tfor i ← 1 to m – n + 1

dist ← 0for j ← 1 to n

if ti+j-1 ≠ pj

dist ← dist + 1if dist ≤ k

output i

Time complexity?


O(mn)


Appoximate pattern matching

• Previous algorithm runs in time O(mn).

• Landau-Vishkin algorithm requires O(kn).

Time complexity:

• We want to match substrings in query to substrings in a text with at most k mismatches.

• Motivation: we want to see similarities to some gene, but we may not know which parts of gene to look for.

We can generalize Approximate Pattern Matching Problem into Query Matching Problem:



Query matching

The Query Matching Problem.

Find all substrings of query that approximately match text.

Input: A query q = q1 … qw, a text t = t1 … tm, n (length of matching substrings), and k (maximum number of allowable mismatches).Output: All pairs of positions (i, j) such that n-letter substring of q starting at i approximately matches n-letter substring of t starting at j, with at most k mismatches.



Approximate pattern matching vs. query matching

Approximate Pattern Matching

Query Matching



Query matching

• Approximately matching strings share some perfectly matching substrings.

• Instead of searching for approximately matching strings (difficult), search for perfectly matching substrings (easy).

Main idea:



Query matching

Filtration in query matching:

• We want all n-matches between a query and a text with up to k mismatches.

• Filter out positions we know do not match.

• Potential match detection: find all matches of l-tuples in query and text for some small l.

• Potential match verification: verify each potential match by extending it to left and right until (k + 1) mismatches found.



Filtration: match detection

• If x1 … xn and y1 … yn match with at most k mismatches, they must share an l-tuple that is perfectly matched, withl = n / (k + 1).

• Break string of length n into k + 1 parts, each each of length n / (k + 1).

• k mismatches can affect at most k of these k + 1 parts.

• At least one of these k + 1 parts is perfectly matched.

nn

k slices = k + 1 parts

X

X

X

X

X

X

X

X

X

X

X

X

X

X

No mismatch here!


Filtration: match detection

Suppose k = 3. We would then have l = n / (k + 1) = n / 4.

There are at most k mismatches in n, so at very least there must be one out of (k + 1) l-tuples without a mismatch.

1…l l +1 … 2l 2l +1 … 3l 3l +1 … n

1 2 k k + 1



Filtration: match verification

For each l-match, we extend it further to see if it is substantial.

quer

y

text

Extend perfect match of length l until we find approximate match of length n with k mismatches.



Filtration: illustration

n/6n/5n/4n/3n/2nl-tuplelength

k = 5k = 4k = 3k = 2k = 1k = 0

Shorter perfect matches required

Performance decreases



Local alignment is too slow …

Quadratic time local alignment is too slow when looking for similarities between long strings (e.g., entire GenBank database).

Even so:

• Guaranteed to find optimal local alignment.

• Sets standard for sensitivity.

"Basic Local Alignment Search Tool," by Altschul, Gish, Miller, Myers, and Lipman, Journal of Mol. Biol., 1990.

Search sequence databases for local alignments to a query.

si , j=max{0si−1, jv i , -si , j−1- ,w jsi−1, j−1vi ,w j



BLAST

BLAST features:

• Big improvement in speed, only modest drop in sensitivity.

• Minimizes search space instead of exploring entire search space between two sequences.

• Finds short exact matches (“seeds”), only explores locally around these “hits.”

BLASTing a new gene can be used for exploring evolutionary relationships, similarity between protein function.

BLASTing a genome can be used for finding potential genes.



BLAST algorithm

Keyword search of all words of length w from query of length n in database of length m with score above threshold.

w = 11 or 12 for DNA queries, w = 3 or 4 for proteins

Local alignment extension for each found keyword. Extend result until longest match above threshold is achieved.

Running time is O(mn)

Output all local alignments with score > threshold.



BLAST algorithm

22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query:KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

Neighborhoodscore threshold

(T = 13)

Neighborhoodwords

High-scoring Pair (HSP)

extension

Query:

Sbjct:



Original BLAST: example

A C G A A G T A A G G T C C A G T

C

T G

A

T

C C

T

G

G

A T

T

G C

G

A

• w = 4

• Exact keyword match of GGTC.

• Extend diagonals with mismatches until score is under 50%.

• Output result is:GTAAGGTCCGTTAGGTCC



Gapped BLAST: example

• Original BLAST exact keyword search, then:

• Extend with gaps around ends of exact match until score is below threshold.

• Output result is:GTAAGGTCCAGTGTTAGGTC-AGT

A C G A A G T A A G G T C C A G T

C

T G

A

T

C C

T

G

G

A T

T

G C

G

A



BLAST recap

Overall BLAST approach is to locate short, high scoring segment pairs between query and database sequence. It then extends these "seeds" in both directions until it seems unlikely further extensions will raise score (this is heuristic).

Basic steps in BLAST:(1) Compile list of high-scoring "words" from query.(2) Search for hits for these in database (call these "seeds").(3) Extend seeds so long as score improves.

Usual word size is 3 or 4 for proteins, 11 or 12 for nucleic acids.

Scanning database (step 2) can be facilitated by using hash table or building finite state automaton (FSA).

Some proteins may have very different amino acid sequences, but are still similar.


Recall 250-PAM matrix

A R N D C Q E G H I L K M F P S T W Y VA 2R -2 6N 0 0 2D 0 -1 2 4C -2 -4 -4 -5 12Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4


BLAST

BLAST defines maximal segment pair (MSP) to be highest scoring pair of identical length segments from two sequences.

With matrix such as 250-PAM, we can score protein segments: Q M F T R

E M L K F2 6 2 0 -4

= 6

A segment pair is locally maximal if its score cannot be improved either by extending it or shortening it.

Nucleotide segments can be scored using similar rules:A C G T AA C G A C5 5 5 -4 -4

= 7


BLAST example: compiling word list

Divide the query into all subsequences of length w = 2:

Q L L N N F F S S A A G G W

Say that the query protein sequence is QLNFSAGW.

Q L N F S A G W

yields


BLAST example: compiling word list

Generate all 2-letter combinations that score at least T = 8 when aligned with query subsequence using 250-PAM:

QL QL (10), QM (10), HL (9)LN LN (8)NF NF (11), AF (9), NY (9), DF (11), QF (10), EF (10),

GF (9), HF (11), KF (10), SF (10)FS FS (11), FA (10), FN (10), FD (9), FG (10), FP (10),

YS (9)SA noneAG noneGW AW (18), RW (14), NW (17), GW (18), QW (16), EW (17),

GW (22), HW (15), IW (14), KW (15), MW (14), PW (16),SW (18), TW (17), VW (10)


BLAST example: identifying high scoring segment pairs

Say that the database protein sequence is NLNYTPW.

Query: Q L N F S A G WDatabase: N L N Y T P W

T = 8


T = 16


BLAST example: extending a hit

Continue trying to extend hit until score falls below threshold.


T = 8+7 = 15


T = 8+1 = 9


BLAST statistics

When is a maximal alignment score statistically significant?

where K and λ are parameters obtained from the scoring matrix and the amino acid probabilities (distribution is Poisson).

Turn it around: what is probability it could arise by chance?

Consider two random sequences generated by rolling a 20-faced die an appropriate number of times. The die is loaded according to relative frequencies of amino acids in database.

When m and n are large, the expected number of distinct segment pairs between s and t with score above S is:

Kmne−S


BLAST statistics

Note: statistical significance must NOT be mistaken as biological truth.

From probability distribution of maximal alignment scores, we can determine probability of getting a random alignment as good as one observed. If this probability is small (say < 0.05), the alignment is deemed statistically significant.

In BLAST output, this probability p is converted to a bit score which is equal to -log 2 p. Smaller probability = larger bit score.

We can also calculate expected number of times, E, an alignment with such a score would occur in a database of the same size. BLAST lets you discard alignments expected to occur more than certain number of times (default 10).


BLAST variations


Sample BLAST Output 1

Sequences producing significant alignments: (bits) Valuegi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43ALIGNMENTS>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148 Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FGSbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YHSbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

Blast of human beta globin protein against zebra fish:


Sample BLAST Output 2

Sequences producing significant alignments: (bits) Valuegi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33ALIGNMENTS>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468 Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| ||||||||||||Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

Blast of human beta globin DNA against human DNA:


FAST

Philosophically, FAST is similar to BLAST (FAST came first):• short matching segments between query and database

sequences are identified,• these matches are extended and refined (including running

dynamic programming on likely candidates),• matches that are statistically significant are reported.

FAST builds a lookup table of all k-tuples in query (k = 1 or 2).It then scans database sequence and records offsets of matches it finds. E.g.,

if a k-tuple at s[i] matches one at t[i], offset = 0if a k-tuple at s[i] matches one at t[i+3], offset = -3if a k-tuple at s[i] matches one at t[i-2], offset = +2


FAST

1 2 3 4 5 6 7 8Database (t): V D M A A Q I A

Keeping track of k-tuple matches at each offset allows FAST to identify strongest diagonals for further anaysis.

1 2 3 4 5 6 7 8 9 10 11 Query (s): H A R F Y A A Q I V L

A 2, 6, 7F 4H 1I 9L 11Q 8R 3V 10Y 5

Offsets: +9 -2 -3 +2 +2 -6 +2 +1 -2 +3 +2 -1

-71-6 -5 -4

1-3

2-2

1-1 0

1+1

3+2

1+3

... 1+9

good choice


Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• IBA Sections 10.1-10.7 (hierarchical clustering, evolutionary

trees).

Documents

Genetic Pattern Matching 2 - cse.lehigh.edulopresti/Courses/2007-08/CSE308...Lopresti · Spring 2007 · Lecture 15 - 22 - BLAST algorithm Keyword search of all words of length w from