View
224
Download
0
Embed Size (px)
Citation preview
/course/eleg67701-f/Topic-1b 2
Outline
Data Structures Space Complexity Case Study: string matching
Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)
/course/eleg67701-f/Topic-1b 3
Algorithm in action: data structure transformation
Intermediate data structure
Algorithm
Input datastructure
Output datastructure
/course/eleg67701-f/Topic-1b 4
Basic Data Structures
Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types
Composite data structures arrays, records
Data Abstraction Abstract Data Types: A collection of data values together
with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.
/course/eleg67701-f/Topic-1b 5
Scalar Data Structure
Conceptual View
0238
0239
0240
0241
0242
0243
0244
0245
Physical Layout inthe Computer Memory
Memory address
value
value
Variable name
var1
Assignment operation:
var1 value;
var2 var1;
var1 var3;
/course/eleg67701-f/Topic-1b 6
Composite Data Structure: Array
Conceptual View
v1
Variable name Array A[1..5]
A v2 v3 v4 v5
Accessing array elements:
A[0] 5
k 1
A[k] 11
A[k+1] A[k] + 3
0 1 2 3 4 0238
0239
0240
0241
0242
0243
0244
0245
Physical Layout inthe Computer Memory
Memory address
v2
v1
v3
v4
v5
nil
/course/eleg67701-f/Topic-1b 7
Data Abstraction: Tree
Conceptual View
v1
v2 v3
v4
T
___ __ ____ __ _
___ __ _
___ __ ____ __ _
Accessing the elements:
T.value 12
T.left new(T)
T.right new(T)
0238
0239
0240
0241
0242
0243
0244
0245
Physical Layout inthe Computer Memory
Memory address
0241
0244
v1
T 0238
nil
nil
v2
nil
v3
0247
...
/course/eleg67701-f/Topic-1b 8
Space Analysis
Storage space, like time, is another limited resource that is important to programmers
Space requirements are also expressed as a function of the input size
Space functions are classified in the same manner as running times
/course/eleg67701-f/Topic-1b 9
Complexity Analysis: Sorting
Algorithm Time-Complexity
Insertionsort O(n2)
Quicksort O(n.log n)
Space-Complexity
O(n)
O(n)
/course/eleg67701-f/Topic-1b 10
Space-Time Tradeoff
Reductions in running time are often possible if we increase storage requirements
Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed
values can drastically increase the speed of a function
/course/eleg67701-f/Topic-1b 11
Case Study: Searching for Patterns
Problem: find the first occurrence of pattern P of length m inside the text S of length n.
String matching problem
/course/eleg67701-f/Topic-1b 12
String Matching - Applications
Text editing Term rewriting Lexical analysis Information retrieval And, bioinformatics
/course/eleg67701-f/Topic-1b 13
Model for Pattern-Matching Problem
PatternMatchergenerator
PatternMatcher
PatternP
Input stringS
Yes
No
/course/eleg67701-f/Topic-1b 14
Array Implementation
Text S represented as an array of characters: S [1..n]
Pattern P represented as an array of characters: P [1..m]
a g c a g a a g a g t aS
Time complexity = O(m.n)
Space complexity = O(m + n)
P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g
/course/eleg67701-f/Topic-1b 15
Can we be more clever ?
When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters.
We try to take advantage of this to decide where to restart matching
a g c a g a a g a g t aS
P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g
/course/eleg67701-f/Topic-1b 16
Problem of Matching Keyword
PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise.
For convenience, we will assume p=p1p2…pm and s=s1s2…sn where pi represents the ith character of the pattern and sj the jth character of the input string.
/course/eleg67701-f/Topic-1b 17
The Knuth-Morris-Pratt AlgorithmObservation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position).
What to do:
Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process.
Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350
/course/eleg67701-f/Topic-1b 18
The key idea is that if we have successfully matched the prefix
p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2… sj-
1 of the input string and pi = sj, then we do not need to reprocess any of the suffix sj-i+1 sj-i+2… sj-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.
/course/eleg67701-f/Topic-1b 19
Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the innerwhile loop terminate, both i and j will shift by one, and inner loop repeats ...
/course/eleg67701-f/Topic-1b 20
An Important Property of the Next Function in KMP
Algorithm
The largest k less than i such that p1p2…pk-1 is a
suffix of p1p2…pi-1 (i.e., p1…pk-1 = pi-k+1…pi-1) and pi
= pk. if there is no such i, then hi=0
/course/eleg67701-f/Topic-1b 21
Backtrack or Not Backtrack ?
Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text
S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) –
i.e. by how much If for each j, the shift of P is a small constant, then
the total time complexity is clearly linear in n
P(i) = S(j)
/course/eleg67701-f/Topic-1b 22
An Example
1 2 3 4 5 6 7 8 9 10 11 12 13Patten: a b a a b a b a a b a a bNext funciton: 0 1 0 2 1 0 4 0 2 1 0 7 1
abaababaabacabaababaabaab.
Given:
Input string:
a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b
Scenario 1:i = 12
j = 12
Scenario 2: i
j
h12 = 7, i = 7
a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b
Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1
What is hi = h12 = ? hi = 7
/course/eleg67701-f/Topic-1b 23
An Example
Scenario 3: i
j
h7 = 4, i = 4
Subsequently i = 2, 1, 0
Finally, a match is found:
a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b
i
j
(Contn’d)
a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b
/course/eleg67701-f/Topic-1b 24
Question: when P(i) = S(j), how much should we shift?
Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j)
P
S
i=1
j=1
i
Pi
j
Sj
Pattern
Input
/course/eleg67701-f/Topic-1b 27
How to Compute the Next Function?
hi:= hj hi := j
j:= hj
Note: once p_i does not match p_j -- we know that j should bethe index to be found where a prefix before i matches a suffix ends at j
/course/eleg67701-f/Topic-1b 28
Interpretation of the Next Function
Interpretation
Question: how to compute the next function?
aababaaba
aababaaba
987654321
Note: P2 = P5 P4 = P9
0 1 0 2 1 0 4 0 2
/course/eleg67701-f/Topic-1b 29
Interpretation of the Next Function
Interpretation
Question: how to compute the next function?
1 2 3 4 5 6 7 8 9
a b a a b a b a a
a b a a b a b a a
Note: P1 = P5 P4 = P9
0 1 0 2 1 0 4 0 2
/course/eleg67701-f/Topic-1b 30
Interpretation of the Next Function
Interpretation
Question: how to compute the next function?
1 2 3 4 5 6 7 8 9
a b a a b a b a a
a b a a b a b a a
0 1 0 2 1 0 4 0 2
Note: P1 = P5 P4 = P9
/course/eleg67701-f/Topic-1b 31
KMP - Analysis
The KMP algorithm never needs to backtrack on the text string.
Time complexity = O(m + n)
Space complexity = O(m + n)
preprocessing searching
/course/eleg67701-f/Topic-1b 32
KMP Algorithm Complexity Analysis Hints
What is the cost in the building of the next function? (hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop)
What is the cost of the matching itself? (hint: similar to the above)
/course/eleg67701-f/Topic-1b 33
Other String Matching Algorithms
The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72]
The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249-260].
/course/eleg67701-f/Topic-1b 34
Matching of A Set of Key Words ?
Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise.
How to solve this ?
/course/eleg67701-f/Topic-1b 35
What time complexity KMP algorithm will have when do a matching of k patterns?
- Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time- Searching each pattern will take o (n) time per pattern
so, total time = k • o(m+n)
How about repeatedly apply KMP ?
/course/eleg67701-f/Topic-1b 36
Question: Can we improve the time complexity when k is large?
Answer:
Yes, preprocessing the input string – tree implementation.
/course/eleg67701-f/Topic-1b 37
Model for Pattern-Matching Problem
PatternMatchergenerator
PatternMatcher
PatternP
Input string
S
Yes
No
Pre Pro-cessing
/course/eleg67701-f/Topic-1b 38
Tree Implementation -- suffix tree
Instead of preprocessing the pattern (P), preprocess the text T !
Use a tree structure where all suffixes of the text are represented;
Search for the pattern by looking for substrings of the text;
You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.
/course/eleg67701-f/Topic-1b 39
Suffix Tree
3
c
a
x
ba b x a c
62
x a b x a c
4
cw
c
c u
Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used.
Con’d
A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].
/course/eleg67701-f/Topic-1b 40
Note on Suffix Tree
Not all strings guaranteed to have corresponding suffix trees
For example:
consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix
(I.e. xa does not necessarily ends at a leaf) How to fix the problem: add $ - a special
“termination” character to the alphabet.
/course/eleg67701-f/Topic-1b 41
Algorithm for Constructing a Suffix Tree
A subtree can be constructed in linear time
[Weiner73, McCreight76, Ukkonen95]
/course/eleg67701-f/Topic-1b 42
Suffix Tree
Time complexity = O(n + m)
Space complexity = O(m + n)
preprocessing searching
/course/eleg67701-f/Topic-1b 43
Question
How to use suffix tree to help solving the string matching problem ?