Pattern Matching + Hashing

7/31/2019 Pattern Matching + Hashing

1/29

Bioinformatics ReportJohn Erol Evangelista


2/29

Pattern Matching

Problem: Given a string ofn characters

called the textand a string ofm

characters (m n) called thepattern,

find a substring of the text that matches

the pattern.


3/29

Brute Force


4/29

Brute Force


5/29

Analysis

Worst case: O(nm)

Average case O(n+m) = O(n)


6/29

Combinatorial Pattern Matching

Finds the exact or approximateoccurrences of a given pattern in a long

text.

Organize efficient Data Structures


7/29

Suffix Trees

A suffix tree for text t = t1t2t3...tn is alabeled tree with n leaves (numbered 1

to n) satisfying the followingconditions:

Each edge is labeled with a substring of the text.

Each internal vertex (except for possibly the root) has at least two children

Any two edges out of the same vertex start with a different letter.

Every suffix of text t is spelled out in a path from the root to the leaf.

No suffix is a prefix of another suffix.


8/29

Keyword Tree vs. Suffix tree

root

A

T

4

G C

A

T

1

G

6

G T

5

G C

A

T

2

G

C

A

T

3

G

(a) Keyword tree

root

AT

4

G

1

CATG 6

G T

5

G

2

CATG 3

CATG

(b) Suffix tree


9/29

Building suffix trees

Can take up to quadratic time.

Weiner: O(n) running time.


10/29

Pattern Matching on Suffix Trees

Problem: Given a pattern p, find the

exactoccurrence of it in a long text t.


11/29

Why suffix trees?

Allow one to preprocess a text in such a

way that given any pattern of length n,

one can answer whether or not it occurs

in the text using only O(n) time,

regardless of how long the text is.


12/29

SuffixTreePatternMatching

SUFFIXTRE EPATTERNMATCHING(p, t)

1 Build the suffix tree for text t

2 Thread pattern p through the suffix tree.

3 if threading is complete

4 output positions of every p-matching leaf in the tree

5 else

6 output pattern does not appear anywhere in the text


13/29

Threading

root

A

7

CATGG T

G

1

CATACATG

9

G

5

ACATGG

T

6

ACATGG G

10

G

2

CATACATGG

G

3

CATACATGG

11

G

CAT

8

GG

4

ACATGG

Figure 9.6 Threading the pattern ATG through the suffix tree for the text ATGCATA-CATGG. The suffixes ATGCATACATGG and ATGG both match, as noted by the grayvertices in the tree (the p-matching leaves). Each p-matching leaf corresponds to aposition in the text where p occurs.


14/29

Analysis

Threading takes O(m) time, where m is

the length of the pattern.

Combining with the construction of the

suffix trees, total running time is:

O(n+m)


15/29

Hashing

based on the idea of distributing keys

among a one dimensional arrayH[0,...,m-1] called a hash table.

Hash Functions: assigns an integer

from 1 to m-1 to a key. This integer iscalled the hash address.


16/29

Hash Functions

Must satisfy two requirements:

A hash function needs to distributekeys among the cells of a hash table

as evenly as possible.

A hash function must be easy tocompute.


17/29

Collisions

A phenomenon of two (or more) keys

being hashed in the same cell of a hashtable.

Worst case: All keys are hashed on the

same hash address


18/29

Open Hashing

Keys are stored in linked lists attached

to cells of a hash table.


19/29

Open Hashing


20/29

Open Hashing


21/29

Analysis

Efficiency depends on the lengths of the linked

lists.

If a hash function distributes n keys among mcells of the hash tables about evenly, each list

will be about n/m keys long. This ratio is the

load factorof the hash table.

Average number of pointers inspected in a

successful search S and unsuccessful search U

are:


22/29

Insertion and Deletion

Insertion: Normally done at the end ofthe list.

Deletion: Searching for key to be

deleted and deleting it


23/29

Closed Hashing

All keys are stored in the hash tableitself without the use of linked lists.

Simplest is linear probing.


24/29

Linear Probing

If a collision occurs, it checks the cell

next to the cell were the collisionoccurs. If it is empty, the new key is

placed there, otherwise it checks for the

next cell and so on.

Circular array design.


25/29

Linear Probing


26/29

Analysis

Insertion and Search are pretty much

straightforward.

Deletion can be tricky, but a lazy deletioncan make the problem easier.

Given the load factor, the number of

times the algorithm must access thetable for successful and unsuccessful

searches is, respectively:


27/29

Clustering

A clusterin linear probing is a sequence

of contiguously occupied cells.


28/29

Double Hashing

We use another hash function s(K) todetermine a fixed increment for the

probing sequence to be used after a

collision at location l = h(K):


29/29

References

Jones, Neil C. and Pevzner, Pavel A. An

Introduction to BioinformaticsAlgorithms.

Levitin, Anany. The Design and

Analysis of Algorithms. 2nd ed.

Documents

Pattern Matching + Hashing