Upload
molimo
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner. String Matching. Pattern: CCATT Text: ACTGCCATTCCTTAGGGCCATGTG Brute force & variant Automata-like strategies Suffix trees. To do: [CLRS] 32 Supplements #5. Some Applications. Text & p rogramming languages - PowerPoint PPT Presentation
Citation preview
Design & Analysis of Algorithms COMP 482 / ELEC 420
John Greiner
2
String Matching
Pattern: CCATT
Text: ACTGCCATTCCTTAGGGCCATGTG
• Brute force & variant• Automata-like strategies• Suffix trees
To do:[CLRS] 32Supplements#5
3
Some Applications• Text & programming languages
– Spell-checking– Linguistic analysis– Tokenization– Virus scanning– Spam filtering– Database querying
• DNA sequence analysis• Music identification & analysis
4
Brute Force Exact Match
Pattern: CCATT
Text: ACTGCCATTCCTTAGGGCCATGTG
Algorithm?Example of worst case?
Running time?
5
Rabin-Karp, 1981
Pattern: CCATT
Text: ACTGCCATTCCTTAGGGCCATGTG
Hash pattern & -substrings– Compare hashes.– Compare strings only when hashes match.– easily computed from ,
Best and worst cases?
6
Intuition for Better Algorithms
Pattern: CCATT
Text: CCAGCCATTCCTTAGGGCCATGTG
After failing match at 1st position, what should we do?
7
Quick Overview of Two Algorithms
Knuth-Morris-Pratt, 1977– Use previous intuition.– Preprocessing builds shift table based upon pattern
prefixes. – Each text character compared once or twice.
Boyer-Moore, 1977 & Turbo Boyer-Moore, 1992– Use previous intuition, along with similar heuristics.
Complicated.– Match from right end of pattern.– Can often skip some text characters. , with best case.
8
Finite Automata Matching Example
Pattern: CCATT
Text: CCAGCCATTCCTTAGGGCCATGTG
CCA CCAT CCATT
CCCC C
A
C
T
C
TC
C
to build
9
Regular Expressions
Syntax: , , , , ,
10
Regular Expression to Finite Automaton
𝑎 𝑎
𝜖 𝜖
11
Regular Expression to Finite Automaton
𝑟+𝑠𝑟𝑠𝜖
𝜖 𝜖
𝜖
𝑟𝑠 𝑟 𝑠𝜖 𝜖𝜖
𝑟∗ 𝑟𝜖 𝜖
𝜖
𝜖
12
Eliminating
𝑎 𝜖 𝜖
𝑎 𝑎𝑎
13
Eliminating Non-Determinism
Each state is a set of the old states.
0 1
a
a
b
0 1
0,1
a
b a
b
a
b
a,b
14
Can Minimize
States unreachable or equivalent?
0 1
a
a
b
0
0,1
a
b
1
a
b
a
b
a,b
15
Finite Automaton Approach Summary
Preprocessing expensive for REs.Matching is flexible and still linear.
16
Suffix Tree (Trie)
Text: CCAGAT
How many leaves, nodes, edges? Total space?How to search for a pattern? Running time?
T A
T GAT AGAT
C
CAGAT
5
4
3 2 1
6
GAT
Each edge labeled with two indices.
17
Require Suffixes to End at Leaves
Text: CCAGAT
Reason: Simplicity. Distinguish nodes & leaves.Example text that would break that?
T A
T GAT AGAT
C
CAGAT
5
4
3 2 1
6
GAT
18
Forcing Suffixes to End at Leaves
Text: CCAGAC$
$ A
C$ GAC$ AGAC$
C
CAGAC$
5
4
3 2 1
7
GAC$
6
$
19
How Would You Create the Tree?
Text: CCAGAC$
$ A
C$ GAC$
C
5
4
3 2 1
7
GAC$
6
$ AGAC$ CAGAC$
20
Suffix Tree Construction Example
Text: nananabanana$
nananabanana$
21
Suffix Tree Construction Example
Text: nananabanana$
nananabanana$ananabanana$
22
Suffix Tree Construction Example
Text: nananabanana$
Match nanabanana$, nananabanana$: 4 comparisons
nana
banana$nabanana$
ananabanana$
23
Suffix Tree Construction Example
Text: nananabanana$
Match anabanana$, ananabanana$: 3 redundant compsNext …Match nabanana$, nanabanana$: 2 redundant compsMatch abanana$, anabanana$: 1 redundant comp
nana
banana$nabanana$
nabanana$
banana$
ana
24
First Algorithm Improvement: No Redundant Matching
When inserting , nanabanana$If we discover is a prefix in tree, nanaThen will also be a prefix in tree. anaSo, don’t bother matching to verify.
25
Suffix Tree Construction Example
Text: nananabanana$
Match nanabanana$, nananabanana$: 4 comparisons
nana
banana$nabanana$
ananabanana$
26
Suffix Tree Construction Example
Text: nananabanana$
No matching. Just form new node and adjust edge string indices.
nana
banana$nabanana$
nabanana$
banana$
ana
27
Suffix Tree Construction Example
Text: nananabanana$
No matching. Just form new node and adjust edge string indices.
na
banana$nabanana$
nabanana$
banana$
ana
na
banana$
28
Suffix Tree Construction Example
Text: nananabanana$
No matching. Just form new node and adjust edge string indices.
na
banana$nabanana$
nabanana$
banana$
na
na
banana$
a
banana$
29
Suffix Tree Construction Example
Text: nananabanana$
na
banana$nabanana$
nabanana$
banana$
na
na
banana$
a
banana$
banana$
30
Suffix Tree Construction Example
Text: nananabanana$
anana is there, but it takes work to match & follow links.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na
31
Suffix Tree Construction Example
Text: nananabanana$
nana is there, but it takes work to match & follow links.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $
32
Second Algorithm Improvement: Suffix Links
Text: nananabanana$
Each internal node for has pointer to node for .
banana$a
banana$ na
na
banana$$ na
banana$$ nabanana$banana$ $na
banana$ $
$
33
Suffix Tree Construction Example
Text: nananabanana$
Have to follow links for match on first time.Need to create suffix link for new node.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na
34
Suffix Tree Construction Example
Text: nananabanana$
Need to create suffix link for new node.Use parent’s suffix link to get close.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na
35
Suffix Tree Construction Example
Text: nananabanana$
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na
36
Suffix Tree Construction Example
Text: nananabanana$
Already found node. No searching or matching.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $
37
Suffix Tree Construction Example
Text: nananabanana$
Follow suffix link.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $$
38
Suffix Tree Construction Example
Text: nananabanana$
Follow suffix link.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $$
$
39
Suffix Tree Construction Example
Text: nananabanana$
Follow suffix link.
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $$
$$
40
Suffix Tree Construction Example
Text: nananabanana$
Done., but roughly how big is ?
na
banana$nabanana$banana$
na
na
banana$
a
banana$
banana$
$banana$
na $$
$$
$
41
Almost Correct Analysis of Construction
Two indices:
: Each increment takes time.– Just search for one more character.
: Each increment takes time.– Follow suffix link, or from root, get to suffix match.– Possibly split node & create suffix link.– Add one edge & leaf.
, each incremented times total.
42
Creating & Following Suffix Links
When creating new node, need new suffix link.Follow parent’s suffix link to get close, then search down.
Text:
𝑎𝐵 𝐵
𝐸𝐺𝐷
𝐶𝐶𝐷𝐸𝐺 𝐹 = index of next
suffix link
43
Correct Analysis of Construction
Three indices: ( not part of alg.)
: Each increment takes time.: Follow some number of links in time. Increments by at least .: Each increment takes time in addition to that considered for .
, , each incremented at most times total.
44
Some Applications of Suffix TreesSearch for fixed patternsSearch for regular expressionsFind longest common substrings, longest repeated substringsFind most commonly repeated substringsFind maximal palindromesFind Lempel-Ziv decomposition (for text compression)
As used in
BioinformaticsData compressionData clustering
45
Supplementary Resources
Exact String Matching Algorithms>30 algorithms, with animations
Course on string matching (Biosequencing)Wikipedia: Suffix treesTutorial on Suffix treesTutorial on Suffix trees with applet & codeNotes on string matching and building suffix trees
Suffix tree slides adapted from those by Guy Blelloch, CMU 15-853.