SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi

SUFFIX TREESSUFFIX TREES

From exact to approximate string From exact to approximate string matching.matching.

17 dicembre 200317 dicembre 2003

Luca BortolussiLuca Bortolussi

An introduction to string matchingAn introduction to string matching

String matching is an important branch of String matching is an important branch of algorithmica, and it has applications in many algorithmica, and it has applications in many fields, as:fields, as:

Text searchingText searching Molecular biologyMolecular biology Data compressionData compression and so on…and so on…

Exact String matching: a brief historyExact String matching: a brief history

Naive algorithmNaive algorithm

Knuth-Morris-Pratt Knuth-Morris-Pratt (1977)(1977) Boyer-Moore Boyer-Moore (1977)(1977)

Suffix Trees:Suffix Trees:Weiner (1973), McCreight (1978), Weiner (1973), McCreight (1978), Ukkonen (1995)Ukkonen (1995)

Suffix TreesSuffix Trees

Definition: Definition: A suffix tree for a string T of length m is a A suffix tree for a string T of length m is a rooted tree such that:rooted tree such that:

1. It has exactly m leafs, numbered from 1 to m;

2. Every edge has a label, which is a substring of T;

3. Every internal node has at least two children;

4. Labels of two edges starting at an internal node do not start with the same character;

5. The label of the path from the root to a leaf numbered I is the suffix of T starting at position i, i.e. T[i..m]

Suffix Trees - IISuffix Trees - IIabbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Suffix Trees – searching a patternSuffix Trees – searching a pattern

abbcbab# ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Pattern: bcb

Suffix Trees – naive constructionSuffix Trees – naive constructionabbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4abbcbab#

bbcbab#

Suffix Trees – Ukkonen AlgorithmSuffix Trees – Ukkonen AlgorithmUkkonen algorithm was published in 1995, and it is the fastest and well performing algorithm for building a suffix tree in linear time.

The basic idea is constructing iteratively the implicit suffix trees for S[1..i] in the following way:

Construct tree 1

For i = 1 to m-1 // phase i+1 for j = 1 to i+1 // extension j find the end of the path from the root with label S[j…i] in the current tree. Extend the path adding character S(i+1), so that S[j…i+1] is in the tree.

The extension will follow one of the next three rules, being The extension will follow one of the next three rules, being = S[j..i]:= S[j..i]:

1.1. ends at a leaf. Add ends at a leaf. Add S(i+1)S(i+1) at the end of the label of the path to the leaf at the end of the label of the path to the leaf2.2. There’s one path continuing from the end of There’s one path continuing from the end of , but none starting with , but none starting with S(i+1).S(i+1). Add a node Add a node

at the end of at the end of and a path stating from the new node with label and a path stating from the new node with label S(i+1),S(i+1), terminating in a terminating in a leaf with numberleaf with number j j..

3.3. There’s one path from the end of There’s one path from the end of starting with starting with S(i+1).S(i+1). In this case do nothing. In this case do nothing.

Suffix Trees – Ukkonen Algorithm - IISuffix Trees – Ukkonen Algorithm - IIThe main idea to speed up the construction of the tree is the concept of suffix link.

Suffix links are pointers from a node v with path label x to a node s(v) with path label ( is a string and x a character).

The interesting feature of suffix trees is that every internal node, except the root, has a suffix link towards another node.

abbcbab#

ab

#

bcbab#

b

#

cbab#

bcbab#

ab#

cbab#

6

1 7

3

2

5

4

Suffix link

v

S(v)

Suffix Trees – Ukkonen Algorithm - IIISuffix Trees – Ukkonen Algorithm - IIIWith suffix links, we can speed up the construction of the ST

x

In addition, every node can be crossed in costant time, just keeping track of In addition, every node can be crossed in costant time, just keeping track of the label’s length of every single edge. This can be done because no two the label’s length of every single edge. This can be done because no two edges exiting from a node can start with the same character, hence a single edges exiting from a node can start with the same character, hence a single comparison is needed to decide which path must be taken. comparison is needed to decide which path must be taken.

Anyway, using suffix links, complexity is still quadratic. Anyway, using suffix links, complexity is still quadratic.

Suffix Trees – Ukkonen Algorithm - IVSuffix Trees – Ukkonen Algorithm - IV

To complete the speed up of the algorithm, we need the following observations:

• Once a leaf is created, it will remain forever a leaf.

• Once in a phase rule 3 is used, all succeccive extensions make use of it, hence we can ignore them.

• If in phase i the rule 1 and 2 are applied in the first ji moves, in phase i+1 the first ji extensions can be made in costant time, simply adding the character S(i+2) at the end of the paths to the first ji leafs (we will use a global variable e do do this). Hence the extensions will be computed explicitly from ji+1, reducing their global number to 2m.

Storing the path labels explicitly will cost a quadratic space. Anyway, each edge need only costant space, i.e. two pointers, one to the beginning and one to the end of the substring it has as label.

Generalized Suffix TreesGeneralized Suffix TreesA generalized suffix tree is simply a ST for a set of strings, each one ending with a different marker. The leafs have two numbers, one identifiing the string and the other identifiing the position inside the string.

S1 = abbc$

S2 = babc#

ab

c#

bc$

b

c$

bc$

abc#

c$

(2,2)

(1,1)

(1,3)

(2,3)(1,2)

(2,1)

(1,4)

(2,4)

Longest common substringLongest common substring

Let S1 and S2 be two string over the same alphabeth. The Longest Common Substring problem is to find the longest substring of S1 that is also a substring of S2.

Knuth in 1970 conjectured that this problem was (n2)

Building a generalized suffix tree for S1 and S2, to solve the problem one has to identify the nodes which belong to both suffix trees of S1 and S2 and choose the one with greatest string depth (length of the path label from the root to itself). All these operations cost O(n).

Longest Common ExtensionLongest Common ExtensionA problem that can be solved linearly using suffix trees is the Longest Common Extension problem, that is, for every couple of indexes (i,j), finding the length of the longest substring of T starting at position i that matches a substring of P starting at position j.

It can be solved in O(n+m) time, building a generalized suffix tree for T and P, and finding, for every leaf i of T and j of P, their lowest common ancestor in the tree (it can be done in costant time after preprocessing the tree).

Hamming and Edit DistancesHamming and Edit Distances

Hamming Distance: two strings of the same length are aligned and the distance is the number of mismatches between them.

abbcdbaabbc

abbdcbbbaacH = 6

Edit Distance: it is the minimum number of insertions, deletions and substitutions needed to trasform a string into another.

abbcdbaabbc

cbcdbaabc

abbcdbaabbc

abbcdbaabbcE = 3

The k - mismatches problemThe k - mismatches problemWe have a text T and a pattern P, and we want to find occurences of P in T, allowing a maximum of k mismatches, i.e. we want to find all the substring T’ of T such that H(P,T’) ≤ k.

We can use suffix trees, but they do not perfome well anymore: the algorithm scans all the paths to leafs, keeping track of errors, and abandons the path if this number becomes greater that k.

The algorithm is fastened using the longest common extensions. For every suffix of T, the pieces of agreement between the suffix and P are matched together until P is exausted or the errors overcome k. Every piece is found in costant time. The complexity of the resulting algorithm is O(k|T|).

aaacaabaaaaa….

aabaab

c

b

An occurence is found in position 2 of T, with one error.

Inexact MatchingInexact MatchingIn biology, inexact matching is very important:

• Similarity in DNA sequences implies often that they have the same biological function (viceversa is not true);

• Mutations and error transcription make exact comparison not very useful.

There are a lot of algorithms that deal with inexact matching (with respect to edit distance), and they are mainly based on dynamic programming or on automata. Suffix trees are used as a secondary tools in some of them, because their structure is inadapt to deal with insertions and deletions, and even with substitutions.

The main efforts are spend in fastening the average behaviour of algorithms, and this is justified because of the fact that random sequences often fall in these cases (and DNA sequences have an high degree of randomness).

Dynamic ProgrammingDynamic Programming

The main idea is computing the edit distance between any of the prefixes of S and T. Let D(i,j) be this distance. Of course, the edit distance between S and T is D(n,m), where n=|P| and m=|T|.

The following properties hold:

1. D(i,0) = i, D(0,j) = j;

2. D(i,j) = min { D(i,j-1) + 1, D(i-1,j) + 1, D(i-1,j-1) + t(i,j) }.

We aim to compute edit distance (global alignements) between two string S and T

Hence in O(mn) time we can compute a matrix which encodes not only the edit distance, bu also the way to trasform a string into another (just keeping track, by means of pointers, of which elements realize the minimum)

Dynamic Programming IIDynamic Programming II

C A S E

1 2 3 4

1 2 3 4

1 1 2 3

2 2 2 3

3 3 3 2

0

0

1

2

3

0

1

2

3

A

R

E

NonNon-Deterministic Automata-Deterministic Automata

C A S E

C A S E

C A S E

To recognize the approximate occurences of a pattern P in a text T, we can build a non-deterministic automaton for P, and run it with T as input. This leads to faster algorithms for the search, but the problem is building the automaton.

Longest Common SubsequenceLongest Common SubsequenceThe Longest Common Subsequence between two strings S1 and S2 is the greater number of characters of S1 that can be aligned to S2.

It is a global alignement problem, which is obviously connected with edit distance. Anyway, often it is modelled with a scoring scheme, which gives a positive score to matches and a negative one to mismatches, insertions and substitutions. So the best global alignement is the one which maximizes the total score. Clearly, given the best global alignement, the number of matches is the longest common subsequence solution.

a b b c d a b b a

a b _ c b a b _ a

The k – differences problemThe k – differences problemThis problem is to find all the occurences of a pattern P in a text T, allowing a maximum number of k insertions, deletions or substitutions.

The Landau-Vishkin algorithm solves it in O(k|T|) time, and implements an hybrid dynamic programming tecnique, which uses suffix trees to solve a subproblem.

The algorithm looks for paths in the dynamic programming matrix (which start in the upper row), in particular for d-paths, which are paths that specify exactly d mismatches and spaces.

Some of these paths are computed, for d ≤ k, and the ones that reach the bottom row correspond to approximate occurences of P in T, with exactly d mismatches or spaces.

Inexact Matching, a new approachInexact Matching, a new approachSuffix trees work very well for exact matching, but they fail when we admit errors in the matching process. This happens because, the only way to find approximate occurences of a pattern, when we search it in a suffix tree, is to walk down every path, keeping track of errors and discarding the paths which overcome the tolerance level previously chosen.

A different approach may be that of defining a different data structure, though similar to suffix trees, which encodes in some way a concept of distance, in particular the Hamming Distance.

A possible way is to shift from alphabeth to alphabet k, encoding the distance in a relation between letters: two letters are said to be “equivalent” if and only if their Hamming distance is less than a threshold .

Equivalence between lettersEquivalence between lettersLet’s show and example of this idea of equivalence, with = {0,1} and k = 3. So, we can build the following table for A3:

If the distance between two letters is less or equal than 1, we define them equivalent. For example ab, bd, but NOT(ad).

Bundled Suffix TreesBundled Suffix TreesGiven this equivalence relation (which is not transitive), we want to incorporate it in a tree structure.

For simplicity, we assume that the tree for the sequence S is the smallest tree which contains, for every substring of S, all the exact paths and all the equivalent paths that can be found in S. For “historical reasons”, we will call it a Bundled Suffix Tree.

Definition: A bundled suffix tree A bundled suffix tree for a string for a string SS of length of length mm is a rooted tree is a rooted tree such that:such that:

• It has exactly m leafs, numbered from 1 to m;• Every edge has a label, which is a substring of S;• Every node has a set of labels, which is a subset of {1,2,..,m,};• The tree obtained deleting all nodes which do not has as label is the

suffix tree for S;• For every substring P of S, the subtree of rooted at the end of the path

labeled with P has node labels which union (discarding ) gives all the starting positions of substrings of S equivalent to P;

• In every path from the root to a leaf no two nodes can be labelled with the same number.

Bundled Suffix Trees - IIBundled Suffix Trees - IIabbcda# a b

d ca

#

b

bcd

a#

a

#

b

bcd

a

#

d

ca

#

d

c

a#

d

5,3

2

1

3

1

1

4

2

2

5

3

61,4

Open ProblemsOpen Problems 1. Does BST work well for Hamming distance? (they seem to need a distributed distance).

2. How can BST be used to manage approximate searching using edit distance? At what price?

3. Which is the average number of “red” nodes expected? Is it linear or does it grows quadratically?

4. Is there a linear algorithm for building BST?

5. Does BST manage to improve existant algorithms, or the interest is just theoretical?

Documents

SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi