Pattern Matching + Hashing

Embed Size (px)

Citation preview

  • 7/31/2019 Pattern Matching + Hashing

    1/29

    Bioinformatics ReportJohn Erol Evangelista

  • 7/31/2019 Pattern Matching + Hashing

    2/29

    Pattern Matching

    Problem: Given a string ofn characters

    called the textand a string ofm

    characters (m n) called thepattern,

    find a substring of the text that matches

    the pattern.

  • 7/31/2019 Pattern Matching + Hashing

    3/29

    Brute Force

  • 7/31/2019 Pattern Matching + Hashing

    4/29

    Brute Force

  • 7/31/2019 Pattern Matching + Hashing

    5/29

    Analysis

    Worst case: O(nm)

    Average case O(n+m) = O(n)

  • 7/31/2019 Pattern Matching + Hashing

    6/29

    Combinatorial Pattern Matching

    Finds the exact or approximateoccurrences of a given pattern in a long

    text.

    Organize efficient Data Structures

  • 7/31/2019 Pattern Matching + Hashing

    7/29

    Suffix Trees

    A suffix tree for text t = t1t2t3...tn is alabeled tree with n leaves (numbered 1

    to n) satisfying the followingconditions:

    Each edge is labeled with a substring of the text.

    Each internal vertex (except for possibly the root) has at least two children

    Any two edges out of the same vertex start with a different letter.

    Every suffix of text t is spelled out in a path from the root to the leaf.

    No suffix is a prefix of another suffix.

  • 7/31/2019 Pattern Matching + Hashing

    8/29

    Keyword Tree vs. Suffix tree

    root

    A

    T

    4

    G C

    A

    T

    1

    G

    6

    G T

    5

    G C

    A

    T

    2

    G

    C

    A

    T

    3

    G

    (a) Keyword tree

    root

    AT

    4

    G

    1

    CATG 6

    G T

    5

    G

    2

    CATG 3

    CATG

    (b) Suffix tree

  • 7/31/2019 Pattern Matching + Hashing

    9/29

    Building suffix trees

    Can take up to quadratic time.

    Weiner: O(n) running time.

  • 7/31/2019 Pattern Matching + Hashing

    10/29

    Pattern Matching on Suffix Trees

    Problem: Given a pattern p, find the

    exactoccurrence of it in a long text t.

  • 7/31/2019 Pattern Matching + Hashing

    11/29

    Why suffix trees?

    Allow one to preprocess a text in such a

    way that given any pattern of length n,

    one can answer whether or not it occurs

    in the text using only O(n) time,

    regardless of how long the text is.

  • 7/31/2019 Pattern Matching + Hashing

    12/29

    SuffixTreePatternMatching

    SUFFIXTRE EPATTERNMATCHING(p, t)

    1 Build the suffix tree for text t

    2 Thread pattern p through the suffix tree.

    3 if threading is complete

    4 output positions of every p-matching leaf in the tree

    5 else

    6 output pattern does not appear anywhere in the text

  • 7/31/2019 Pattern Matching + Hashing

    13/29

    Threading

    root

    A

    7

    CATGG T

    G

    1

    CATACATG

    9

    G

    5

    ACATGG

    T

    6

    ACATGG G

    10

    G

    2

    CATACATGG

    G

    3

    CATACATGG

    11

    G

    CAT

    8

    GG

    4

    ACATGG

    Figure 9.6 Threading the pattern ATG through the suffix tree for the text ATGCATA-CATGG. The suffixes ATGCATACATGG and ATGG both match, as noted by the grayvertices in the tree (the p-matching leaves). Each p-matching leaf corresponds to aposition in the text where p occurs.

  • 7/31/2019 Pattern Matching + Hashing

    14/29

    Analysis

    Threading takes O(m) time, where m is

    the length of the pattern.

    Combining with the construction of the

    suffix trees, total running time is:

    O(n+m)

  • 7/31/2019 Pattern Matching + Hashing

    15/29

    Hashing

    based on the idea of distributing keys

    among a one dimensional arrayH[0,...,m-1] called a hash table.

    Hash Functions: assigns an integer

    from 1 to m-1 to a key. This integer iscalled the hash address.

  • 7/31/2019 Pattern Matching + Hashing

    16/29

    Hash Functions

    Must satisfy two requirements:

    A hash function needs to distributekeys among the cells of a hash table

    as evenly as possible.

    A hash function must be easy tocompute.

  • 7/31/2019 Pattern Matching + Hashing

    17/29

    Collisions

    A phenomenon of two (or more) keys

    being hashed in the same cell of a hashtable.

    Worst case: All keys are hashed on the

    same hash address

  • 7/31/2019 Pattern Matching + Hashing

    18/29

    Open Hashing

    Keys are stored in linked lists attached

    to cells of a hash table.

  • 7/31/2019 Pattern Matching + Hashing

    19/29

    Open Hashing

  • 7/31/2019 Pattern Matching + Hashing

    20/29

    Open Hashing

  • 7/31/2019 Pattern Matching + Hashing

    21/29

    Analysis

    Efficiency depends on the lengths of the linked

    lists.

    If a hash function distributes n keys among mcells of the hash tables about evenly, each list

    will be about n/m keys long. This ratio is the

    load factorof the hash table.

    Average number of pointers inspected in a

    successful search S and unsuccessful search U

    are:

  • 7/31/2019 Pattern Matching + Hashing

    22/29

    Insertion and Deletion

    Insertion: Normally done at the end ofthe list.

    Deletion: Searching for key to be

    deleted and deleting it

  • 7/31/2019 Pattern Matching + Hashing

    23/29

    Closed Hashing

    All keys are stored in the hash tableitself without the use of linked lists.

    Simplest is linear probing.

  • 7/31/2019 Pattern Matching + Hashing

    24/29

    Linear Probing

    If a collision occurs, it checks the cell

    next to the cell were the collisionoccurs. If it is empty, the new key is

    placed there, otherwise it checks for the

    next cell and so on.

    Circular array design.

  • 7/31/2019 Pattern Matching + Hashing

    25/29

    Linear Probing

  • 7/31/2019 Pattern Matching + Hashing

    26/29

    Analysis

    Insertion and Search are pretty much

    straightforward.

    Deletion can be tricky, but a lazy deletioncan make the problem easier.

    Given the load factor, the number of

    times the algorithm must access thetable for successful and unsuccessful

    searches is, respectively:

  • 7/31/2019 Pattern Matching + Hashing

    27/29

    Clustering

    A clusterin linear probing is a sequence

    of contiguously occupied cells.

  • 7/31/2019 Pattern Matching + Hashing

    28/29

    Double Hashing

    We use another hash function s(K) todetermine a fixed increment for the

    probing sequence to be used after a

    collision at location l = h(K):

  • 7/31/2019 Pattern Matching + Hashing

    29/29

    References

    Jones, Neil C. and Pevzner, Pavel A. An

    Introduction to BioinformaticsAlgorithms.

    Levitin, Anany. The Design and

    Analysis of Algorithms. 2nd ed.