Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all...

Preview:

Citation preview

Pattern Matching

Rhys Price Jones

Anne R. Haake

Pattern matching algorithms - Review

• Finding all occurrences of pattern p in text t• P has length m, t has length n• Naïve algorithm, Rabin-Karp both have worst-

case O(mn) and expected case O(m+n) behavior

• Automaton approach preprocesses p to yield a O(n) algorithm

Suffix Tree

• Preprocess t• To yield a O(m) algorithm• Useful if t is fixed and there are lots of p’s that

you want to search for.

Tries

• Are often used for retrieval of keywords to provide efficient indexing.

• Suppose you want to index:– PATTERN– MONKEY – PATAPAN– PROBOSCIS– PATHETIC

Build a TRIE

• For PATTERN

PATTERN

Build a TRIE

• For PATTERN MONKEY

PATTERN MONKEY

Build a TRIE

• For PATTERN MONKEY PATAPAN

PATMONKEY

APANTERN

Build a TRIE

• For PATTERN MONKEY PATAPAN PROBOSCIS

MONKEY

APAN

P

TERN

ROBOSCIS

AT

Build a TRIE

• For PATTERN MONKEY PATAPAN PROBOSCIS PATHETIC

P M

HETICAPAN

TERN

ROBOSCISAT

Each keyword can be located in O(k) steps where k is the length of the keyword

Applications of a Trie

• Dictionary– Just need to check if you get to a leaf to know the

word exists– Or store a link to the word’s definition at the leaf

• Index for a book– Store a list of all pages where the keyword

appears at the leaf

• Finding reserved words or filtering unwanted words …

Suffix Tree

• For a text t• Is a Trie for the set of suffixes of t• BIOINFORMATICS IOINFORMATICS

OINFORMATICS INFORMATICS… ICS CS S

• Build it on the board

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

ACACTACT

0

Suffix Tree For ACACTACT

ACACTACTACACTACT CACTACT ACTACT CTACT TACT ACT CT T

ACACTACT CACTACT

01

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

CACTACT

AC

ACTACT TACT

10 2

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

What’s the Problem

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

This suffix is a prefix of another suffix

What’s the Fix?

• Add a new symbol to the end of the string• A symbol $ that does not appear elsewhere

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

ACACTACT$

0

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

ACACTACT$ CACTACT$

01

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

CACTACT$

AC

ACTACT$ TACT$

10 2

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ TACT$

C

TACT$ ACTACT$

0 2 3 1

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ TACT$

C

TACT$ ACTACT$

0 2 3 1

4

TACT

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

TACT$ ACTACT$

0 3 1

4

TACT

ACT$ $2 5

Suffix is prefix problem went away

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

4

TACT

ACT$$

2 5

$

T

6

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

$8

Use for Testing Substrings

ACT follow links, get 2,5 CAC follow links, get 1

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

$8

Reprise

• To review the procedure, let’s build a suffix tree for MISSISSIPPI

• On the board• Don’t forget the $

Code

(define suffix-tree ; input string output suffix tree (lambda (t) (trie (suffixes-of t) (string-length t))))

(define suffixes-of ; input string output list of all its suffixes (lambda (t) (cond ((zero? (string-length t)) '()) (else (cons t (suffixes-of (substring t 1 (string-length t))))))))

(define trie ; input list of strings and n ; output suffix-tree style trie with ; n-(length of keyword) at the leaves (lambda (l n) ; list of keywords to put in a trie (tries (sort (lambda (x y) (string<=? x y)) l) n)))

More code

(define tries ; builds a trie from sorted list l. Leaves as above (lambda (l n) (cond ((null? l) (make-empty-trie)) ((singleton? (samestarts l)) (make-internal-node (make-edge (car l) (make-leaf (- n (string-length (car l))))) (tries (cdr l) n))) (else (let ((childstrings (samestarts l)))

(let ((label (commonprefix childstrings))) (let ((childnode

(trie (map (chop (string-length label)) childstrings) (- n (string-length label)))))

(make-internal-node (make-edge label childnode) (tries (nthcdr l (length childstrings)) n)))))))))

Analysis

• Building suffix tree: O(n2)• Searching for p: O(m+k)

– Where p appears k times

Improvement possible

• Suffix tree for text length n can be built in time O(n)

• Thereafter all searches are O(m)

Applications in Biology

• Suffix Trees in Computational Biology• Link doesn’t work

Recommended