Upload
eddy-valdeiglesias-quispe
View
217
Download
0
Embed Size (px)
Citation preview
8/17/2019 Patr an Huntre and Blast
1/102
www.bioalgorithms.info An Introduction to Bioinformatics Algorithms
CombinatorialPattern Matching
8/17/2019 Patr an Huntre and Blast
2/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Hash Tables• Repeat Finding• Exact Pattern Matching• Keyword Trees
• Suffix Trees• Heuristic Similarity Search Algorithms• Approximate String Matching• Filtration•
Comparing a Sequence Against a Database• Algorithm behind BLAST• Statistics behind BLAST• PatternHunter and BLAT
8/17/2019 Patr an Huntre and Blast
3/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Genomic Repeats
• Example of repeats:
• ATGGTCTAGGTCCTAGTGGTC
• Motivation to find them:
• Genomic rearrangements are oftenassociated with repeats
• Trace evolutionary secrets
• Many tumors are characterized by anexplosion of repeats
8/17/2019 Patr an Huntre and Blast
4/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Genomic Repeats
• The problem is often more difficult:
• ATGGTCTAGGACCTAGTGTTC
• Motivation to find them:
• Genomic rearrangements are oftenassociated with repeats
• Trace evolutionary secrets
• Many tumors are characterized by anexplosion of repeats
8/17/2019 Patr an Huntre and Blast
5/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
l -mer Repeats
• Long repeats are difficult to find• Short repeats are easy to find (e.g., hashing)
• Simple approach to finding long repeats:
• Find exact repeats of short l -mers (l is usually10 to 13)
• Use l -mer repeats to potentially extend intolonger, maximal repeats
8/17/2019 Patr an Huntre and Blast
6/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
l -mer Repeats (cont!d)
• There are typically many locations where anl -mer is repeated:
GCTTACAGATTCAGTCTTACAGATGGT
• The 4-mer TTAC starts at locations 3 and 17
8/17/2019 Patr an Huntre and Blast
7/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Extending l -mer Repeats
GCTTACAGATTCAGTCTTACAGATGGT
• Extend these 4-mer matches:
GCTTACAGATTCAGTCTTACAGATGGT
• Maximal repeat: TTACAGAT
8/17/2019 Patr an Huntre and Blast
8/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Maximal Repeats
• To find maximal repeats in this way, we need ALL start locations of all l -mers in the
genome
• Hashing lets us find repeats quickly in thismanner
8/17/2019 Patr an Huntre and Blast
9/102
8/17/2019 Patr an Huntre and Blast
10/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing: Definitions
• Hash table: array used in hashing
• Records: data stored in a hash table
• Keys: identifies sets of records
• Hash function: uses a key to generate anindex to insert at in hash table
• Collision: when more than one record ismapped to the same index in the hash table
8/17/2019 Patr an Huntre and Blast
11/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing: Example
• Where do theanimals eat?
• Records: eachanimal
• Keys: whereeach animaleats
8/17/2019 Patr an Huntre and Blast
12/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing DNA sequences
• Each l -mer can be translated into a binary
string (A, T, C, G can be represented as
00, 01, 10, 11)
• After assigning a unique integer per l -mer it
is easy to get all start locations of each l -
mer in a genome
8/17/2019 Patr an Huntre and Blast
13/102
A I t d ti t Bi i f ti Al ith bi l ith i f
8/17/2019 Patr an Huntre and Blast
14/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing: Collisions
• Dealing withcollisions:
• “Chain” all startlocations of l -mers
(linked list)
A I t d ti t Bi i f ti Al ith bi l ith i f
8/17/2019 Patr an Huntre and Blast
15/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hashing: Summary
• When finding genomic repeats from l -mers:
• Generate a hash table index for each l -mer
sequence
• In each index, store all genome startlocations of the l -mer which generated that
index
• Extend l -mer repeats to maximal repeats
A I t d ti t Bi i f ti Al ith bi l ith i f
8/17/2019 Patr an Huntre and Blast
16/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Pattern Matching
• What if, instead of finding repeats in agenome, we want to find all sequences in adatabase that contain a given pattern?
• This leads us to a different problem, thePattern Matching Problem
A I t d ti t Bi i f ti Al ith bi l ith i f
8/17/2019 Patr an Huntre and Blast
17/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Pattern Matching Problem
• Goal: Find all occurrences of a pattern in a text
• Input: Pattern p = p1… pn and text t = t 1…t m
• Output: All positions 1< i < (m – n + 1) such thatthe n-letter substring of t starting at i matches p
• Motivation: Searching database for a knownpattern
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
18/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Exact Pattern Matching: A Brute-Force
Algorithm
PatternMatching(p,t)1
n ß length of pattern p2 m ß length of text t3 for i ß 1 to (m – n + 1)4
if t i …t i+n-1 = p5 output i
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
19/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Exact Pattern Matching: An Example
• PatternMatching algorithm for:
• Pattern GCAT
• Text CGCATC
GCATCGCATC
GCATCGCATC
CGCATCGCAT
CGCATC
CGCATCGCAT
GCAT
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
20/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Exact Pattern Matching: Running Time
• PatternMatching runtime: O(nm)
• Probability-wise, it’s more like O(m)
• Rarely will there be close to n comparisonsin line 4
• Better solution: suffix trees
• Can solve problem in O(m) time
• Conceptually related to keyword trees
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
21/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Example
• Keyword tree:
• Apple
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
22/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Example (cont!d)
• Keyword tree:
• Apple
• Apropos
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
23/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Example (cont!d)
• Keyword tree:
• Apple
• Apropos
• Banana
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
24/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Example (cont!d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
25/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Example (cont!d)
• Keyword tree:
• Apple
• Apropos
• Banana
• Bandana
• Orange
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
26/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Properties
• Stores a set of keywordsin a rooted labeled tree
• Each edge labeled with a
letter from an alphabet• Any two edges coming
out of the same vertexhave distinct labels
•
Every keyword storedcan be spelled on a pathfrom root to some leaf
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
27/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Threading (cont!d)
• Thread “appeal”
• appeal
An Introduction to Bioinformatics Algorithms www bioalgorithms info
8/17/2019 Patr an Huntre and Blast
28/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Threading (cont!d)
• Thread “appeal”
• appeal
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
29/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Threading (cont!d)
• Thread “appeal”
• appeal
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
30/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Threading (cont!d)
• Thread “appeal”
• appeal
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
31/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees: Threading (cont!d)
• Thread “apple”
• apple
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
32/102
g g
Keyword Trees: Threading (cont!d)
• Thread “apple”
• apple
8/17/2019 Patr an Huntre and Blast
33/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
34/102
g g
Keyword Trees: Threading (cont!d)
• Thread “apple”
• apple
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
35/102
g g
Keyword Trees: Threading (cont!d)
• Thread “apple”
• apple
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
36/102
g g
Multiple Pattern Matching Problem
• Goal: Given a set of patterns and a text, find alloccurrences of any of patterns in text
•
Input: k patterns p1,…,p
k , and text t = t 1…t m
• Output: Positions 1 < i < m where substring of t
starting at i matches p j for 1
8/17/2019 Patr an Huntre and Blast
37/102
g g
Multiple Pattern Matching: StraightforwardApproach
• Can solve as k “Pattern Matching Problems”
• Runtime:
O(kmn)
using the PatternMatching algorithm k times
• m - length of the text
• n - average length of the pattern
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
38/102
Multiple Pattern Matching: Keyword TreeApproach
• Or, we could use keyword trees:
• Build keyword tree in O(N ) time; N is totallength of all patterns
• With naive threading: O(N + nm)
• Aho-Corasick algorithm: O(N + m)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
39/102
Keyword Trees: Threading
• To match patternsin a text using akeyword tree:
• Build keywordtree of patterns
• “Thread” the
text through thekeyword tree
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
40/102
Keyword Trees: Threading (cont!d)
• Threading is“complete” when wereach a leaf in the
keyword tree
• When threading is
“complete,” we’vefound a pattern inthe text
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
41/102
Suffix Trees=Collapsed Keyword Trees
• Similar to keyword trees,except edges that formpaths are collapsed
• Each edge is labeledwith a substring of atext
• All internal edges haveat least two outgoingedges
• Leaves labeled by theindex of the pattern.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
42/102
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG
TG G
Keywor d Tree
Suffix Tree
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
43/102
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG
TG G
Keywor d Tree
Suffix Tree
How much time does it take?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
44/102
Suffix Tree of a Text
• Suffix trees of a text is constructed for all its suffixes
ATCATG TCATG CATG ATG
TG G
quadratic Keywor d Tree
Suffix Tree
Time is linear in the total size of all suffixes,i.e., it is quadratic in the length of the text
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
45/102
Suffix Trees: Advantages
• Suffix trees of a text is constructed for all its suffixes• Suffix trees build faster than keyword trees
ATCATG TCATG CATG ATG
TG G
quadratic Keywor d Tree
Suffix Tree
linear (Weiner suffix tree algorithm)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
46/102
Use of Suffix Trees
• Suffix trees hold all suffixes of a text
• i.e., ATCGC: ATCGC, TCGC, CGC, GC, C• Builds in O(m) time for text of length m
• To find any pattern of length n in a text:• Build suffix tree for text• Thread the pattern through the suffix tree
• Can find pattern in text in O(n) time!• O(n + m) time for “Pattern Matching Problem”
• Build suffix tree and lookup pattern
8/17/2019 Patr an Huntre and Blast
47/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
48/102
Suffix Trees: Example
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
49/102
Multiple Pattern Matching: Summary
• Keyword and suffix trees are used to findpatterns in a text
• Keyword trees:
• Build keyword tree of patterns, and thread text through it
• Suffix trees:
• Build suffix tree of text, and thread patterns through it
8/17/2019 Patr an Huntre and Blast
50/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
51/102
Heuristic Similarity Searches
• Genomes are huge: Smith-Watermanquadratic alignment algorithms are too slow
• Alignment of two sequences usually has short
identical or highly similar fragments
• Many heuristic methods (i.e., FASTA) are
based on the same idea of filtration
• Find short exact matches, and use them as
seeds for potential match extension• “Filter” out positions with no extendable
matches
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
52/102
Dot Matrices
• Dot matrices showsimilarities betweentwo sequences
• FASTA makes animplicit dot matrix fromshort exact matches,and tries to find longdiagonals (allowing forsome mismatches)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
53/102
Dot Matrices (cont!d)
• Identify diagonalsabove a thresholdlength
• Diagonals in the dotmatrix indicate exact
substring matching
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
54/102
Diagonals in Dot Matrices
• Extend diagonalsand try to link themtogether, allowing
for minimalmismatches/indels
• Linking diagonalsreveals approximate
matches over longersubstrings
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
55/102
Approximate Pattern Matching Problem
• Goal: Find all approximate occurrences of a pattern in a text
• Input: A pattern p = p1… pn, text t = t 1…t m,
and k , the maximum number of mismatches
• Output: All positions 1 < i < (m – n + 1) such
that t i …t i +n-1 and p1… pn have at most k mismatches (i.e., Hamming distance betweent i …t i +n-1 and p < k )
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
56/102
Approximate Pattern Matching: A Brute-Force Algorithm
ApproximatePatternMatching(p, t, k )2 n ß length of pattern p3 m ß length of text t4 for i ß 1 to m – n + 15 dist ß 06 for j ß 1 to n 7
if t i+j-1 != p j 8 dist ß dist + 19 if dist < k 10 output i
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
57/102
Approximate Pattern Matching: Running Time
• That algorithm runs in O(nm).• Landau-Vishkin algorithm: O(kn)• We can generalize the “Approximate Pattern
Matching Problem” into a “Query MatchingProblem”:• We want to match substrings in a query to
substrings in a text with at most k mismatches• Motivation: we want to see similarities to
some gene, but we may not know which partsof the gene to look for
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
58/102
Query Matching Problem
• Goal: Find all substrings of the query thatapproximately match the text
• Input: Query q = q1…qw ,
text t = t 1…t m,
n (length of matching substrings), k (maximum number of mismatches)• Output: All pairs of positions (i , j ) such that the
n-letter substring of q starting at i
approximately matches then-letter substring of t starting at j ,
with at most k mismatches
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
59/102
Approximate Pattern Matching vs Query Matching
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
60/102
Query Matching: Main Idea
• Approximately matching strings share someperfectly matching substrings.
• Instead of searching for approximately
matching strings (difficult) search for perfectlymatching substrings (easy).
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
61/102
Filtration in Query Matching
• We want all n-matches between a query anda text with up to k mismatches
• “Filter” out positions we know do not match
between text and query• Potential match detection: find all matches
of l -tuples in query and text for some small l
• Potential match verification: Verify eachpotential match by extending it to the left andright, until (k + 1) mismatches are found
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
62/102
Filtration: Match Detection
• If x 1… x n and y 1…y n match with at most k
mismatches, they must share an l -tuple that
is perfectly matched, with l = ën/(k + 1)û
• Break string of length n into k +1 parts, eacheach of length ën/(k + 1)û
• k mismatches can affect at most k of these
k +1 parts• At least one of these k +1 parts is perfectly
matched
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
63/102
Filtration: Match Detection (cont!d)
• Suppose k = 3. We would then have l=n/(k+1)=n/4:
•
There are at most k mismatches in n, so at the veryleast there must be one out of the k +1 l –tuples
without a mismatch
1…l l +1…2l 2l +1…3l 3l +1…n
1 2 k k + 1
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
64/102
Filtration: Match Verification
• For each l -match we find, try to extend thematch further to see if it is substantial
query
Extend perfect
match oflength l
until we find anapproximate
match oflength n with k mismatchestext
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
65/102
Filtration: Example
k = 0 k = 1 k = 2 k = 3 k = 4 k = 5
l -tuple
length
n n/2 n/3 n/4 n/5 n/6
Shorter perfect matches required
Performance decreases
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
66/102
Local alignment is to slow…
• Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
67/102
Local alignment is to slow…
• Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)
8/17/2019 Patr an Huntre and Blast
68/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
69/102
Local alignment is to slow…
• Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)
• Guaranteed to find the optimallocal alignment
• Sets the standard for sensitivity
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
70/102
Local alignment is to slow…
• Quadratic local alignment is tooslow while looking for similaritiesbetween long strings (e.g. the entireGenBank database)
• Basic Local Alignment Search Tool• Altschul, S., Gish, W., Miller, W.,
Myers, E. & Lipman, D.J.
Journal of Mol. Biol., 1990
• Search sequence databases forlocal alignments to a query
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
71/102
BLAST
• Great improvement in speed, with a modestdecrease in sensitivity
• Minimizes search space instead of exploring entiresearch space between two sequences
• Finds short exact matches (“seeds”), only exploreslocally around these “hits”
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
72/102
What Similarity Reveals
• BLASTing a new gene
• Evolutionary relationship
• Similarity between protein function
• BLASTing a genome
• Potential genes
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
73/102
BLAST algorithm
• Keyword search of all words of length w fromthe query of length n in database of length mwith score above threshold•
w = 11 for DNA queries, w =3 for proteins• Local alignment extension for each foundkeyword• Extend result until longest match above
threshold is achieved• Running time O(nm)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
74/102
BLAST algorithm (cont!d)
Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
keyword
GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11
neighborhoodscore threshold
(T = 13)
Neighborhoodwords
High-scoring Pair (HSP)
extension
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
75/102
Original BLAST
• Dictionary• All words of length w
• Alignment
• Ungapped extensions until score fallsbelow some statistical threshold
• Output
• All local alignments with score > threshold
8/17/2019 Patr an Huntre and Blast
76/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
77/102
Gapped BLAST : Example
• Original BLASTexact keywordsearch, THEN:
• Extend with gaps
around ends ofexact match untilscore < threshold
• Output result
GTAAGGTCCAGT
GTTAGGTC-AGT
A C G A A G T A A G G T C C A G T
C
T
G
A
T
C
C
T
G
G
A
T
T
G
C
G
A
From lectures by Serafim Batzoglou(Stanford)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
78/102
Incarnations of BLAST
• blastn: Nucleotide-nucleotide
• blastp: Protein-protein
• blastx: Translated query vs. protein database
• tblastn: Protein query vs. translated database
• tblastx: Translated query vs. translated
database (6 frames each)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
79/102
Incarnations of BLAST (cont!d)
• PSI-BLAST• Find members of a protein family or build a
custom position-specific score matrix
• Megablast:• Search longer sequences with fewer
differences
• WU-BLAST: (Wash U BLAST)
• Optimized, added features
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8/17/2019 Patr an Huntre and Blast
80/102
Assessing sequence similarity
• Need to know how strong an alignment can beexpected from chance alone
• “Chance” relates to comparison of sequences that aregenerated randomly based upon a certain sequence
model• Sequence models may take into account:
• G+C content• Poly-A tails•
“Junk” DNA• Codon bias• Etc.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAST S S
8/17/2019 Patr an Huntre and Blast
81/102
BLAST: Segment Score
• BLAST uses scoring matrices (d) to improveon efficiency of match detection
• Some proteins may have very different
amino acid sequences, but are still similar • For any two l -mers x 1… x l and y 1…y l :
• Segment pair: pair of l -mers, one from each
sequence
• Segment score: Sl i=1 d( x i , y i )
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAST L ll M i l S P i
8/17/2019 Patr an Huntre and Blast
82/102
BLAST: Locally Maximal Segment Pairs
• A segment pair is maximal if it has the bestscore over all segment pairs
• A segment pair is locally maximal if its score
can’t be improved by extending or shortening• Statistically significant locally maximal segment pairs are of biological interest
• BLAST finds all locally maximal segment
pairs with scores above some threshold• A significantly high threshold will filter out
some statistically insignificant matches
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAST S i i
8/17/2019 Patr an Huntre and Blast
83/102
BLAST: Statistics
• Threshold: Altschul-Dembo-Karlin statistics• Identifies smallest segment score that is
unlikely to happen by chance
• # matches above q has mean E(q) = Kmne-lq; K is a constant, m and n are the lengths ofthe two compared sequences•
Parameter l is positive root of:S x ,y in A( p x py e
d(x,y)) = 1, where p x
and py are frequenceies of amino
acids x and y , and A is the twenty
8/17/2019 Patr an Huntre and Blast
84/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S l BLAST t t
8/17/2019 Patr an Huntre and Blast
85/102
Sample BLAST output Score E
Sequences producing significant alignments: (bits) Value
gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44
gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44
gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44
gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43
ALIGNMENTS
>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148
Score = 171 bits (434), Expect = 3e-44
Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)
Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60
MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60
Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120
V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FG
Sbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120
Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147
+ F VQ A+QK +A V +AL +YHSb ct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148
• Blast of human beta globin protein against zebra fish
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S l BLAST t t
8/17/2019 Patr an Huntre and Blast
86/102
Sample BLAST output (cont!d) Score E
Sequences producing significant alignments: (bits) Value
gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75
gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75
gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72
gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66
gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34
gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33
ALIGNMENTS
>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11
Length = 81706
Score = 149 bits (75), Expect = 3e-33
Identities = 183/219 (83%)
Strand = Plus / Plus
Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326
|| ||| | || | || | |||||| ||||| ||||||||||| ||||||||
Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468
Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365
||||||||| |||||||||| ||||| ||||||||||||
Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507
• Blast of human beta globin DNA against human DNA
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ti li
8/17/2019 Patr an Huntre and Blast
87/102
Timeline
• 1970: Needleman-Wunsch global alignmentalgorithm
• 1981: Smith-Waterman local alignment algorithm• 1985: FASTA• 1990: BLAST (basic local alignment search tool)• 2000s: BLAST has become too slow in “genome vs.
genome” comparisons - new faster algorithmsevolve!
• Pattern Hunter • BLAT
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PatternHunter: faster and even
8/17/2019 Patr an Huntre and Blast
88/102
PatternHunter: faster and even
more sensitive• BLAST: matches short
consecutive sequences
(consecutive seed)
• Length = k
•
Example (k = 11):
11111111111
Each 1 represents a “match”
• PatternHunter: matchesshort non-consecutivesequences (spaced seed)
• Increases sensitivity bylocating homologies thatwould otherwise be missed
• Example (a spaced seed oflength 18 w/ 11 “matches”):
111010010100110111
Each 0 represents a “don’tcare”, so there can be amatch or a mismatch
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S d d
8/17/2019 Patr an Huntre and Blast
89/102
Spaced seeds
Example of a hit using a spaced seed:
How does this result in better sensitivity?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Wh i PH b tt ?
8/17/2019 Patr an Huntre and Blast
90/102
Why is PH better?
• BLAST: redundanthits
! PatternHunter
This results in > 1 hit andcreates clusters ofredundant hits
This results in very fewredundant hits
8/17/2019 Patr an Huntre and Blast
91/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ad t of G d S d
8/17/2019 Patr an Huntre and Blast
92/102
Advantage of Gapped Seeds
11 positions
11 positions
10 positions
8/17/2019 Patr an Huntre and Blast
93/102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Use of Multiple Seeds
8/17/2019 Patr an Huntre and Blast
94/102
Use of Multiple Seeds
Basic Searching Algorithm2. Select a group of spaced seed models3. For each hit of each model, conduct extension to
find a homology.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Another method: BLAT
8/17/2019 Patr an Huntre and Blast
95/102
Another method: BLAT
• BLAT (BLAST-Like Alignment Tool)• Same idea as BLAST - locate short
sequence hits and extend
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAT vs BLAST: Differences
8/17/2019 Patr an Huntre and Blast
96/102
BLAT vs. BLAST: Differences
• BLAT builds an index of the database andscans linearly through the query sequence,whereas BLAST builds an index of the querysequence and then scans linearly through thedatabase
• Index is stored in RAM which is memoryintensive, but results in faster searches
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAT: Fast cDNA Alignments
8/17/2019 Patr an Huntre and Blast
97/102
BLAT: Fast cDNA Alignments
Steps:1. Break cDNA into 500 base chunks.
2. Use an index to find regions in genome similar toeach chunk of cDNA.
3. Do a detailed alignment between genomic regionsand cDNA chunk.
4. Use dynamic programming to stitch togetherdetailed alignments of chunks into detailed
alignment of whole.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BLAT: Indexing
8/17/2019 Patr an Huntre and Blast
98/102
BLAT: Indexing
• An index is built that contains the positions ofeach k -mer in the genome
• Each k -mer in the query sequence iscompared to each k -mer in the index
• A list of ‘hits’ is generated - positions in cDNAand in genome that match for k bases
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Indexing: An Example
8/17/2019 Patr an Huntre and Blast
99/102
Indexing: An Example
Here is an example with k = 3:
Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15
cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac
0 1 2 3 4 5 6
Hits: aat 0,3cac 6,0cac 6,9
clump: cac AATtatCACgaccgc
Multiple instances map to
single index
Position of 3-mer in query, genome
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
However
8/17/2019 Patr an Huntre and Blast
100/102
However…
• BLAT was designed to find sequences of95% and greater similarity of length >40; maymiss more divergent or shorter sequencealignments
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PatternHunter and BLAT vs BLAST
8/17/2019 Patr an Huntre and Blast
101/102
PatternHunter and BLAT vs. BLAST
• PatternHunter is 5-100 times faster thanBlastn, depending on data size, at the samesensitivity
• BLAT is several times faster than BLAST, butbest results are limited to closely relatedsequences
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Resources
8/17/2019 Patr an Huntre and Blast
102/102
Resources
• tandem.bu.edu/classes/ 2004/papers/pathunter_grp_prsnt.ppt• http://www.jax.org/courses/archives/2004/gsa04_king_presentation.pdf
• http://www.genomeblat.com/genomeblat/blatRapShow.pps