Vakhitov Alexander Approximate Text Indexing

Vakhitov Alexander

Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are

bound and by a clever division of the search pattern sub-linear time is achieved

by “A Hybrid Method for ApproximateString Matching” G. Navarro, R. Baeza-Yates

The task is to find substrings from the long text T, approximately matching to our pattern P.

For example, we have text T='adbc' and P='abc'(s-starting position of a substring)

adbc=a+d+b+c – insertion of 'd' (s=1)dbc=(a->d) +b+c – replacement 'a' with 'd' (s=2)bc=(a)+b+c – deletion of 'a' (s=3)

The Task

ErrorsThere are 3 kinds of transformations, which make errors

in initial string: insertion, replacement and deletion. If we transform S with such chnges to S', then we can transform S' to S with the same number of changes.

The minimal number of deletions, insertions and replacements, needed to transform string A to string B is called edit distance between A and B (ed(A,B)).

Example: ed('abc','adbc')=ed('dbc','abc')=ed('bc','abc')=1;ed('survey','surgery')=2

survey surgey surgeryreplace 'v' with 'g' insert 'r'

The resulting algorithm

The algorithm solves the approximate string matching problem in O(nlog n) time (n is the size of text T, (0,1)), if the error level , where

is the size of the alphabet,e=2.718..,

=k/m, k is the number of errors,m is the size of the pattern P.

1e

Plan of the report

● Some useful ideas & basic algorithms● The main algorithm● Analysis of the complexity of the algorithm in

different cases

Dividing the pattern

Lemma There are strings A and B, ed(A,B)k, and we divide A into j substrings (A

i). Then at least one of the

(Ai) appear in B with at most ⌊k/j⌋ errors.

We need k changes to transform A to B. Each change transforms one of the A

i, so k changes are distributed

between j substrings => the average number of changes is k/j.

Example: ed('he likes','they like')=3=k;A

1='he ',A

2='likes' => j=2;

ed('he ','they ')=2; ed('likes','like')=2=⌊k/j⌋

Computing edit distanceThere are strings x,y. ed(x,y)=?

x=x1x

2...x

m,y=y

1y

2..y

n, x

p,y

qΣ

Cij=ed(x

1..x

i,y

1..y

j) C

0..|x|,0..|y| is a matrix, filled with C

ij.

Computing Cij:

C0,j

=j; Ci,0

=i;

Ci,j = if (x

i=y

j) then Ci-1,j-1

else 1+min{Ci-1,j-1,Ci-1,j,Ci,j-1}

Example: x='survey',y='surgery'

... u r v ey

...ur g

ery

0 1

1

2

0

11

1

2

green means xi<>y

j

red means xi=y

j

arrow shows the elementused to sompute C

ij

Edit distance in the case of text T and pattern P

We need to find a substring in text T which matches P with minimal number of errors. Let x be the pattern, and y will be a text.

The matching text substring can begin in every text position – so, we have to initialize C

0,j with 0. The rest

is left from the previous task.

The algorithm can store only the last column and analyze the text incrementally from the beginning. It goes left and down through the matrix, filling it with C

ij.

Examples of the matrices

Construction of the NFANondeterministic Finite Automaton, which is searching

the text substrings, approximately matching to the pattern with k errors. It consists of k rows and m columns.

Transitions:

Pattern and text characters are the same (horizontal)

Insert characters into pattern (vertical)

Replace the pattern character with the text one (solid diagonal)

Delete the pattern character (dashed diagonal)

Nondeterministic Finite Automaton

This automaton is for approximate string matching for the pattern 'survey' with 2 errors

Depth-first searchDEF “k-neighborhood” is the set of strings that match P

with at most k errors:

Uk(P)={x * : ed(x,P)k}

Searching this strings in the text (without errors) can solve the problem, but |U

k(P)|=O(mkk) is quite large.

We can determine which strings form Uk(P) appear in the

text by traversing the text suffix tree. Here we can use the U

kt(P) set. U

kt(P) is a set of neighborhood elements

which are not prefixes of others.

Algorithm for searching on the suffix tree

● Starts from the root● Considers the string x incrementally● Determines when ed(P,x)k● Determines when ed(P,xy)>k for any y

Algorithm for searching on the suffix tree

– Each new character of x corresponds to a new column in the matrix (adding s to x <=> updating column in O(m) time).

– A match is detected when the last element of the column is k

– x cannot be extended to match P when all the values of the last column are > k

Algorithm for searching on the suffix tree (illustration)

Partitioning the pattern

The cost of the suffix tree search is exponential in m and k, so it's better to perform j searches of patterns of length m/j and k/j errors – that's why we divide patterns.

So, we divide our pattern into j pieces and search them using the above algorithm. Then, for each match found ending at text position i we check the text area [i-m-k..i+m+k]

But the larger j, the more text positions need to be verified, and the optimal j will be found soon.

Searching pieces of the patternLet's use NFA with depth-first search (DFS)

technique (the suffixes from the suffix tree will be the input of the automaton)

At first, we'll transform our NFA– Initial self-loop isn't needed (it allowed us earlier to

start matching from every position of the text);– We remove the low-left triangle of our automaton,

because we avoid initial insertions to the pattern– We can start matching only with k+1 first pattern

characters

The changes to NFA

Using suffix array instead of suffix tree

The suffix array can replace the suffix tree in our algorithm. It has less space requirements, but the time complexity should be multiplied by log n.

Suffix array replaces nodes with intervals and traversing to the node is going to the interval. If there is a node and it's children, then the node interval contains children intervals.

Analysis for the algorithm: the average number of nodes at level l

● For a small l, all the text suffixes (except the last l) are longer than l, so nearly n suffixes reach level l;

● The maximum number of nodes in the level l is l, where =||;

● We use the model of n balls randomly thrown into l urns. The average number of filled urns is

l(1-(1-1/l)n)=l(1-e-(n/l))=(min{n,l})

Probability of processing a given node at depth l in the suffix tree.

If lm', at least l-k text characters must match the pattern (m’ is the pattern size), and

if lm', at least m'-k pattern characters must match the text. We sum all the probabilities for different pattern prefixes:

The largest term of the 1st sum is the first one: and by using Stirling's approximation we have:


..which is:

= ()lO(1/l) , where

The whole first summation is bounded by l-k times the last term, so we get (l-k)()lO(1/l)=O(()l).The first summation exponentially decreases if ()<1. It means that:

>e2/(1- )2

(because e-1< /(1- ) if [0,1])

, =k/l


..or, equivalently,

The second summation can be also bounded by this O(()l).So the upper bound for the probability of processing a givennode at depth l in the suffix tree is O(()l).

In practice, e should be replaced by c=1.09 (it was defined experimentally), because we have only founded the upper bound of the probability.

Analysis of the single pattern search in the suffix tree

Using the formulas bounding the probability of matching, let's consider that in levels l:

all the nodes are visited, while nodes at level l>L(k) are visited with probability O((k/l)l).

Remember that the average number of visited nodes at the level l (for small l) is (min{n,l}).

Three cases of analysis

The cases of analysis(a) L(k) logn, n L(k) “small n” online search

preferable, no index needed (since the total work is n);

(b) m+k < logn, n > m+k

“large n” the total cost

independent on n;(=k/l)

(c) L(k) logn m+k

“intermediate n”,sublinear of n time.

Analysis of pattern partitioning

We need to perform j searches and then verify all the possible matches. We also determine three cases according to previous slide:

(a) j logn, n L(k/j), complexity O(n)

(b) m+k < j logn, n > m+k)/j , if error level

the complexity is O(n1-log) - sublinear of n (using

j=(m+k) / logn)

1k

(c) with the same as in (b) error level, using the same j, we also get sublinear complexity.

1e

Other types of algorithms

● Limited depth-first search technique determines viable prefixes (the prefixes of the possible pattern matches) and searches for them in the suffix tree (it is expensive and it cannot be implemented on the suffix array)

● Filtering discard large parts of the text checking for a necessary condition (simpler than the matching condition). Most existing filters are based on finding substrings of the pattern without errors, and with big error level they can't work.

Summary & Conclusions

● The splitting technique balances between traversing too many nodes of the suffix tree and verifying too many text positions

● The resulting index has sublinear retrieval time (O(n)), 0<<1) if the error level is moderate

● In future there can appear more exact algorithms to determine the correct number of pieces in which the pattern is divided and there are (and may appear in future) some better algorithms for verifying after matching a piece of pattern.

Documents

Vakhitov Alexander Approximate Text Indexing