Upload
john-erol-evangelista
View
244
Download
2
Embed Size (px)
Citation preview
7/31/2019 Pattern Matching + Hashing
1/29
Bioinformatics ReportJohn Erol Evangelista
7/31/2019 Pattern Matching + Hashing
2/29
Pattern Matching
Problem: Given a string ofn characters
called the textand a string ofm
characters (m n) called thepattern,
find a substring of the text that matches
the pattern.
7/31/2019 Pattern Matching + Hashing
3/29
Brute Force
7/31/2019 Pattern Matching + Hashing
4/29
Brute Force
7/31/2019 Pattern Matching + Hashing
5/29
Analysis
Worst case: O(nm)
Average case O(n+m) = O(n)
7/31/2019 Pattern Matching + Hashing
6/29
Combinatorial Pattern Matching
Finds the exact or approximateoccurrences of a given pattern in a long
text.
Organize efficient Data Structures
7/31/2019 Pattern Matching + Hashing
7/29
Suffix Trees
A suffix tree for text t = t1t2t3...tn is alabeled tree with n leaves (numbered 1
to n) satisfying the followingconditions:
Each edge is labeled with a substring of the text.
Each internal vertex (except for possibly the root) has at least two children
Any two edges out of the same vertex start with a different letter.
Every suffix of text t is spelled out in a path from the root to the leaf.
No suffix is a prefix of another suffix.
7/31/2019 Pattern Matching + Hashing
8/29
Keyword Tree vs. Suffix tree
root
A
T
4
G C
A
T
1
G
6
G T
5
G C
A
T
2
G
C
A
T
3
G
(a) Keyword tree
root
AT
4
G
1
CATG 6
G T
5
G
2
CATG 3
CATG
(b) Suffix tree
7/31/2019 Pattern Matching + Hashing
9/29
Building suffix trees
Can take up to quadratic time.
Weiner: O(n) running time.
7/31/2019 Pattern Matching + Hashing
10/29
Pattern Matching on Suffix Trees
Problem: Given a pattern p, find the
exactoccurrence of it in a long text t.
7/31/2019 Pattern Matching + Hashing
11/29
Why suffix trees?
Allow one to preprocess a text in such a
way that given any pattern of length n,
one can answer whether or not it occurs
in the text using only O(n) time,
regardless of how long the text is.
7/31/2019 Pattern Matching + Hashing
12/29
SuffixTreePatternMatching
SUFFIXTRE EPATTERNMATCHING(p, t)
1 Build the suffix tree for text t
2 Thread pattern p through the suffix tree.
3 if threading is complete
4 output positions of every p-matching leaf in the tree
5 else
6 output pattern does not appear anywhere in the text
7/31/2019 Pattern Matching + Hashing
13/29
Threading
root
A
7
CATGG T
G
1
CATACATG
9
G
5
ACATGG
T
6
ACATGG G
10
G
2
CATACATGG
G
3
CATACATGG
11
G
CAT
8
GG
4
ACATGG
Figure 9.6 Threading the pattern ATG through the suffix tree for the text ATGCATA-CATGG. The suffixes ATGCATACATGG and ATGG both match, as noted by the grayvertices in the tree (the p-matching leaves). Each p-matching leaf corresponds to aposition in the text where p occurs.
7/31/2019 Pattern Matching + Hashing
14/29
Analysis
Threading takes O(m) time, where m is
the length of the pattern.
Combining with the construction of the
suffix trees, total running time is:
O(n+m)
7/31/2019 Pattern Matching + Hashing
15/29
Hashing
based on the idea of distributing keys
among a one dimensional arrayH[0,...,m-1] called a hash table.
Hash Functions: assigns an integer
from 1 to m-1 to a key. This integer iscalled the hash address.
7/31/2019 Pattern Matching + Hashing
16/29
Hash Functions
Must satisfy two requirements:
A hash function needs to distributekeys among the cells of a hash table
as evenly as possible.
A hash function must be easy tocompute.
7/31/2019 Pattern Matching + Hashing
17/29
Collisions
A phenomenon of two (or more) keys
being hashed in the same cell of a hashtable.
Worst case: All keys are hashed on the
same hash address
7/31/2019 Pattern Matching + Hashing
18/29
Open Hashing
Keys are stored in linked lists attached
to cells of a hash table.
7/31/2019 Pattern Matching + Hashing
19/29
Open Hashing
7/31/2019 Pattern Matching + Hashing
20/29
Open Hashing
7/31/2019 Pattern Matching + Hashing
21/29
Analysis
Efficiency depends on the lengths of the linked
lists.
If a hash function distributes n keys among mcells of the hash tables about evenly, each list
will be about n/m keys long. This ratio is the
load factorof the hash table.
Average number of pointers inspected in a
successful search S and unsuccessful search U
are:
7/31/2019 Pattern Matching + Hashing
22/29
Insertion and Deletion
Insertion: Normally done at the end ofthe list.
Deletion: Searching for key to be
deleted and deleting it
7/31/2019 Pattern Matching + Hashing
23/29
Closed Hashing
All keys are stored in the hash tableitself without the use of linked lists.
Simplest is linear probing.
7/31/2019 Pattern Matching + Hashing
24/29
Linear Probing
If a collision occurs, it checks the cell
next to the cell were the collisionoccurs. If it is empty, the new key is
placed there, otherwise it checks for the
next cell and so on.
Circular array design.
7/31/2019 Pattern Matching + Hashing
25/29
Linear Probing
7/31/2019 Pattern Matching + Hashing
26/29
Analysis
Insertion and Search are pretty much
straightforward.
Deletion can be tricky, but a lazy deletioncan make the problem easier.
Given the load factor, the number of
times the algorithm must access thetable for successful and unsuccessful
searches is, respectively:
7/31/2019 Pattern Matching + Hashing
27/29
Clustering
A clusterin linear probing is a sequence
of contiguously occupied cells.
7/31/2019 Pattern Matching + Hashing
28/29
Double Hashing
We use another hash function s(K) todetermine a fixed increment for the
probing sequence to be used after a
collision at location l = h(K):
7/31/2019 Pattern Matching + Hashing
29/29
References
Jones, Neil C. and Pevzner, Pavel A. An
Introduction to BioinformaticsAlgorithms.
Levitin, Anany. The Design and
Analysis of Algorithms. 2nd ed.