Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline Data Structures Space Complexity Case Study: string matching Array implementation

Algorithms and

Data Structures

/course/eleg67701-f/Topic-1b 2

Outline

Data Structures Space Complexity Case Study: string matching

Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)


Algorithm in action: data structure transformation

Intermediate data structure

Algorithm

Input datastructure

Output datastructure


Basic Data Structures

Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types

Composite data structures arrays, records

Data Abstraction Abstract Data Types: A collection of data values together

with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.


Scalar Data Structure

Conceptual View

0238

0239

0240

0241

0242

0243

0244

0245

Physical Layout inthe Computer Memory

Memory address

value

value

Variable name

var1

Assignment operation:

var1 value;

var2 var1;

var1 var3;


Composite Data Structure: Array

Conceptual View

v1

Variable name Array A[1..5]

A v2 v3 v4 v5

Accessing array elements:

A[0] 5

k 1

A[k] 11

A[k+1] A[k] + 3

0 1 2 3 4 0238

0239

0240

0241

0242

0243

0244

0245


Memory address

v2

v1

v3

v4

v5

nil


Data Abstraction: Tree

Conceptual View

v1

v2 v3

v4

T

___ __ ____ __ _

___ __ _

___ __ ____ __ _

Accessing the elements:

T.value 12

T.left new(T)

T.right new(T)

0238

0239

0240

0241

0242

0243

0244

0245


Memory address

0241

0244

v1

T 0238

nil

nil

v2

nil

v3

0247

...


Space Analysis

Storage space, like time, is another limited resource that is important to programmers

Space requirements are also expressed as a function of the input size

Space functions are classified in the same manner as running times


Complexity Analysis: Sorting

Algorithm Time-Complexity

Insertionsort O(n2)

Quicksort O(n.log n)

Space-Complexity

O(n)

O(n)


Space-Time Tradeoff

Reductions in running time are often possible if we increase storage requirements

Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed

values can drastically increase the speed of a function


Case Study: Searching for Patterns

Problem: find the first occurrence of pattern P of length m inside the text S of length n.

String matching problem


String Matching - Applications

Text editing Term rewriting Lexical analysis Information retrieval And, bioinformatics


Model for Pattern-Matching Problem

PatternMatchergenerator

PatternMatcher

PatternP

Input stringS

Yes

No


Array Implementation

Text S represented as an array of characters: S [1..n]

Pattern P represented as an array of characters: P [1..m]

a g c a g a a g a g t aS

Time complexity = O(m.n)

Space complexity = O(m + n)

P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g


Can we be more clever ?

When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters.

We try to take advantage of this to decide where to restart matching

a g c a g a a g a g t aS

P ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a gP ga gg a ga g a g


Problem of Matching Keyword

PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise.

For convenience, we will assume p=p1p2…pm and s=s1s2…sn where pi represents the ith character of the pattern and sj the jth character of the input string.


The Knuth-Morris-Pratt AlgorithmObservation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position).

What to do:

Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process.

Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350


The key idea is that if we have successfully matched the prefix

p=p1p2…pi-1 of the keyword with the substring sj-i+1 sj-i+2… sj-

1 of the input string and pi = sj, then we do not need to reprocess any of the suffix sj-i+1 sj-i+2… sj-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.


Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the innerwhile loop terminate, both i and j will shift by one, and inner loop repeats ...


An Important Property of the Next Function in KMP

Algorithm

The largest k less than i such that p1p2…pk-1 is a

suffix of p1p2…pi-1 (i.e., p1…pk-1 = pi-k+1…pi-1) and pi

= pk. if there is no such i, then hi=0


Backtrack or Not Backtrack ?

Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text

S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) –

i.e. by how much If for each j, the shift of P is a small constant, then

the total time complexity is clearly linear in n

P(i) = S(j)


An Example

1 2 3 4 5 6 7 8 9 10 11 12 13Patten: a b a a b a b a a b a a bNext funciton: 0 1 0 2 1 0 4 0 2 1 0 7 1

abaababaabacabaababaabaab.

Given:

Input string:

a b a a b a b a a b a a ba b a a b a b a a b a c a b a a b a b a a b a a b

Scenario 1:i = 12

j = 12

Scenario 2: i

j

h12 = 7, i = 7


Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1

What is hi = h12 = ? hi = 7


An Example

Scenario 3: i

j

h7 = 4, i = 4

Subsequently i = 2, 1, 0

Finally, a match is found:


i

j

(Contn’d)



Question: when P(i) = S(j), how much should we shift?

Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j)

P

S

i=1

j=1

i

Pi

j

Sj

Pattern

Input


Observation: Never backtrack on the input string S.


How to Compute the Next Function?

hi:= hj hi := j

j:= hj


How to Compute the Next Function?

hi:= hj hi := j

j:= hj

Note: once p_i does not match p_j -- we know that j should bethe index to be found where a prefix before i matches a suffix ends at j


Interpretation of the Next Function

Interpretation

Question: how to compute the next function?

aababaaba

aababaaba

987654321

Note: P2 = P5 P4 = P9

0 1 0 2 1 0 4 0 2



Interpretation


1 2 3 4 5 6 7 8 9

a b a a b a b a a

a b a a b a b a a

Note: P1 = P5 P4 = P9

0 1 0 2 1 0 4 0 2



Interpretation


1 2 3 4 5 6 7 8 9

a b a a b a b a a

a b a a b a b a a

0 1 0 2 1 0 4 0 2

Note: P1 = P5 P4 = P9


KMP - Analysis

The KMP algorithm never needs to backtrack on the text string.

Time complexity = O(m + n)


preprocessing searching


KMP Algorithm Complexity Analysis Hints

What is the cost in the building of the next function? (hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop)

What is the cost of the matching itself? (hint: similar to the above)


Other String Matching Algorithms

The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72]

The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249-260].


Matching of A Set of Key Words ?

Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise.

How to solve this ?


What time complexity KMP algorithm will have when do a matching of k patterns?

- Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time- Searching each pattern will take o (n) time per pattern

so, total time = k • o(m+n)

How about repeatedly apply KMP ?


Question: Can we improve the time complexity when k is large?

Answer:

Yes, preprocessing the input string – tree implementation.


Model for Pattern-Matching Problem

PatternMatchergenerator

PatternMatcher

PatternP

Input string

S

Yes

No

Pre Pro-cessing


Tree Implementation -- suffix tree

Instead of preprocessing the pattern (P), preprocess the text T !

Use a tree structure where all suffixes of the text are represented;

Search for the pattern by looking for substrings of the text;

You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.


Suffix Tree

3

c

a

x

ba b x a c

62

x a b x a c

4

cw

c

c u

Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used.

Con’d

A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].


Note on Suffix Tree

Not all strings guaranteed to have corresponding suffix trees

For example:

consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix

(I.e. xa does not necessarily ends at a leaf) How to fix the problem: add $ - a special

“termination” character to the alphabet.


Algorithm for Constructing a Suffix Tree

A subtree can be constructed in linear time

[Weiner73, McCreight76, Ukkonen95]


Suffix Tree

Time complexity = O(n + m)


preprocessing searching


Question

How to use suffix tree to help solving the string matching problem ?


Other Tree based Methods

Suffix tree is not the only one ..

Documents

Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline Data Structures Space Complexity Case Study: string matching Array implementation