29
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever division of the search pattern sub-linear time is achieved by “A Hybrid Method for Approximate String Matching” G. Navarro, R. Baeza-Yates

Vakhitov Alexander Approximate Text Indexing

  • Upload
    brand

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever division of the search pattern sub-linear time is achieved by “A Hybrid Method for Approximate String Matching” G. Navarro, R. Baeza-Yates. Vakhitov Alexander Approximate Text Indexing. - PowerPoint PPT Presentation

Citation preview

Page 1: Vakhitov Alexander Approximate Text Indexing

Vakhitov Alexander

Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are

bound and by a clever division of the search pattern sub-linear time is achieved

by “A Hybrid Method for ApproximateString Matching” G. Navarro, R. Baeza-Yates

Page 2: Vakhitov Alexander Approximate Text Indexing

The task is to find substrings from the long text T, approximately matching to our pattern P.

For example, we have text T='adbc' and P='abc'(s-starting position of a substring)

adbc=a+d+b+c – insertion of 'd' (s=1)dbc=(a->d) +b+c – replacement 'a' with 'd' (s=2)bc=(a)+b+c – deletion of 'a' (s=3)

The Task

Page 3: Vakhitov Alexander Approximate Text Indexing

ErrorsThere are 3 kinds of transformations, which make errors

in initial string: insertion, replacement and deletion. If we transform S with such chnges to S', then we can transform S' to S with the same number of changes.

The minimal number of deletions, insertions and replacements, needed to transform string A to string B is called edit distance between A and B (ed(A,B)).

Example: ed('abc','adbc')=ed('dbc','abc')=ed('bc','abc')=1;ed('survey','surgery')=2

survey surgey surgeryreplace 'v' with 'g' insert 'r'

Page 4: Vakhitov Alexander Approximate Text Indexing

The resulting algorithm

The algorithm solves the approximate string matching problem in O(nlog n) time (n is the size of text T, (0,1)), if the error level , where

is the size of the alphabet,e=2.718..,

=k/m, k is the number of errors,m is the size of the pattern P.

1e

Page 5: Vakhitov Alexander Approximate Text Indexing

Plan of the report

● Some useful ideas & basic algorithms● The main algorithm● Analysis of the complexity of the algorithm in

different cases

Page 6: Vakhitov Alexander Approximate Text Indexing

Dividing the pattern

Lemma There are strings A and B, ed(A,B)k, and we divide A into j substrings (A

i). Then at least one of the

(Ai) appear in B with at most ⌊k/j⌋ errors.

We need k changes to transform A to B. Each change transforms one of the A

i, so k changes are distributed

between j substrings => the average number of changes is k/j.

Example: ed('he likes','they like')=3=k;A

1='he ',A

2='likes' => j=2;

ed('he ','they ')=2; ed('likes','like')=2=⌊k/j⌋

Page 7: Vakhitov Alexander Approximate Text Indexing

Computing edit distanceThere are strings x,y. ed(x,y)=?

x=x1x

2...x

m,y=y

1y

2..y

n, x

p,y

Cij=ed(x

1..x

i,y

1..y

j) C

0..|x|,0..|y| is a matrix, filled with C

ij.

Computing Cij:

C0,j

=j; Ci,0

=i;

Ci,j = if (x

i=y

j) then Ci-1,j-1

else 1+min{Ci-1,j-1,Ci-1,j,Ci,j-1}

Example: x='survey',y='surgery'

... u r v ey

...ur g

ery

0 1

1

2

0

11

1

2

green means xi<>y

j

red means xi=y

j

arrow shows the elementused to sompute C

ij

Page 8: Vakhitov Alexander Approximate Text Indexing

Edit distance in the case of text T and pattern P

We need to find a substring in text T which matches P with minimal number of errors. Let x be the pattern, and y will be a text.

The matching text substring can begin in every text position – so, we have to initialize C

0,j with 0. The rest

is left from the previous task.

The algorithm can store only the last column and analyze the text incrementally from the beginning. It goes left and down through the matrix, filling it with C

ij.

Page 9: Vakhitov Alexander Approximate Text Indexing

Examples of the matrices

Page 10: Vakhitov Alexander Approximate Text Indexing

Construction of the NFANondeterministic Finite Automaton, which is searching

the text substrings, approximately matching to the pattern with k errors. It consists of k rows and m columns.

Transitions:

Pattern and text characters are the same (horizontal)

Insert characters into pattern (vertical)

Replace the pattern character with the text one (solid diagonal)

Delete the pattern character (dashed diagonal)

Page 11: Vakhitov Alexander Approximate Text Indexing

Nondeterministic Finite Automaton

This automaton is for approximate string matching for the pattern 'survey' with 2 errors

Page 12: Vakhitov Alexander Approximate Text Indexing

Depth-first searchDEF “k-neighborhood” is the set of strings that match P

with at most k errors:

Uk(P)={x * : ed(x,P)k}

Searching this strings in the text (without errors) can solve the problem, but |U

k(P)|=O(mkk) is quite large.

We can determine which strings form Uk(P) appear in the

text by traversing the text suffix tree. Here we can use the U

kt(P) set. U

kt(P) is a set of neighborhood elements

which are not prefixes of others.

Page 13: Vakhitov Alexander Approximate Text Indexing

Algorithm for searching on the suffix tree

● Starts from the root● Considers the string x incrementally● Determines when ed(P,x)k● Determines when ed(P,xy)>k for any y

Page 14: Vakhitov Alexander Approximate Text Indexing

Algorithm for searching on the suffix tree

– Each new character of x corresponds to a new column in the matrix (adding s to x <=> updating column in O(m) time).

– A match is detected when the last element of the column is k

– x cannot be extended to match P when all the values of the last column are > k

Page 15: Vakhitov Alexander Approximate Text Indexing

Algorithm for searching on the suffix tree (illustration)

Page 16: Vakhitov Alexander Approximate Text Indexing

Partitioning the pattern

The cost of the suffix tree search is exponential in m and k, so it's better to perform j searches of patterns of length m/j and k/j errors – that's why we divide patterns.

So, we divide our pattern into j pieces and search them using the above algorithm. Then, for each match found ending at text position i we check the text area [i-m-k..i+m+k]

But the larger j, the more text positions need to be verified, and the optimal j will be found soon.

Page 17: Vakhitov Alexander Approximate Text Indexing

Searching pieces of the patternLet's use NFA with depth-first search (DFS)

technique (the suffixes from the suffix tree will be the input of the automaton)

At first, we'll transform our NFA– Initial self-loop isn't needed (it allowed us earlier to

start matching from every position of the text);– We remove the low-left triangle of our automaton,

because we avoid initial insertions to the pattern– We can start matching only with k+1 first pattern

characters

Page 18: Vakhitov Alexander Approximate Text Indexing

The changes to NFA

Page 19: Vakhitov Alexander Approximate Text Indexing

Using suffix array instead of suffix tree

The suffix array can replace the suffix tree in our algorithm. It has less space requirements, but the time complexity should be multiplied by log n.

Suffix array replaces nodes with intervals and traversing to the node is going to the interval. If there is a node and it's children, then the node interval contains children intervals.

Page 20: Vakhitov Alexander Approximate Text Indexing

Analysis for the algorithm: the average number of nodes at level l

● For a small l, all the text suffixes (except the last l) are longer than l, so nearly n suffixes reach level l;

● The maximum number of nodes in the level l is l, where =||;

● We use the model of n balls randomly thrown into l urns. The average number of filled urns is

l(1-(1-1/l)n)=l(1-e-(n/l))=(min{n,l})

Page 21: Vakhitov Alexander Approximate Text Indexing

Probability of processing a given node at depth l in the suffix tree.

If lm', at least l-k text characters must match the pattern (m’ is the pattern size), and

if lm', at least m'-k pattern characters must match the text. We sum all the probabilities for different pattern prefixes:

The largest term of the 1st sum is the first one: and by using Stirling's approximation we have:

Page 22: Vakhitov Alexander Approximate Text Indexing

Probability of processing a given node at depth l in the suffix tree.

..which is:

= ()lO(1/l) , where

The whole first summation is bounded by l-k times the last term, so we get (l-k)()lO(1/l)=O(()l).The first summation exponentially decreases if ()<1. It means that:

>e2/(1- )2

(because e-1< /(1- ) if [0,1])

, =k/l

Page 23: Vakhitov Alexander Approximate Text Indexing

Probability of processing a given node at depth l in the suffix tree.

..or, equivalently,

The second summation can be also bounded by this O(()l).So the upper bound for the probability of processing a givennode at depth l in the suffix tree is O(()l).

In practice, e should be replaced by c=1.09 (it was defined experimentally), because we have only founded the upper bound of the probability.

Page 24: Vakhitov Alexander Approximate Text Indexing

Analysis of the single pattern search in the suffix tree

Using the formulas bounding the probability of matching, let's consider that in levels l:

all the nodes are visited, while nodes at level l>L(k) are visited with probability O((k/l)l).

Remember that the average number of visited nodes at the level l (for small l) is (min{n,l}).

Page 25: Vakhitov Alexander Approximate Text Indexing

Three cases of analysis

Page 26: Vakhitov Alexander Approximate Text Indexing

The cases of analysis(a) L(k) logn, n L(k) “small n” online search

preferable, no index needed (since the total work is n);

(b) m+k < logn, n > m+k

“large n” the total cost

independent on n;(=k/l)

(c) L(k) logn m+k

“intermediate n”,sublinear of n time.

Page 27: Vakhitov Alexander Approximate Text Indexing

Analysis of pattern partitioning

We need to perform j searches and then verify all the possible matches. We also determine three cases according to previous slide:

(a) j logn, n L(k/j), complexity O(n)

(b) m+k < j logn, n > m+k)/j , if error level

the complexity is O(n1-log) - sublinear of n (using

j=(m+k) / logn)

1k

(c) with the same as in (b) error level, using the same j, we also get sublinear complexity.

1e

Page 28: Vakhitov Alexander Approximate Text Indexing

Other types of algorithms

● Limited depth-first search technique determines viable prefixes (the prefixes of the possible pattern matches) and searches for them in the suffix tree (it is expensive and it cannot be implemented on the suffix array)

● Filtering discard large parts of the text checking for a necessary condition (simpler than the matching condition). Most existing filters are based on finding substrings of the pattern without errors, and with big error level they can't work.

Page 29: Vakhitov Alexander Approximate Text Indexing

Summary & Conclusions

● The splitting technique balances between traversing too many nodes of the suffix tree and verifying too many text positions

● The resulting index has sublinear retrieval time (O(n)), 0<<1) if the error level is moderate

● In future there can appear more exact algorithms to determine the correct number of pieces in which the pattern is divided and there are (and may appear in future) some better algorithms for verifying after matching a piece of pattern.