40
1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu

1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

Embed Size (px)

Citation preview

Page 1: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

1

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Advisor: Prof. R. C. T. Lee

Speaker: C. W. Lu

Page 2: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

2

• Let x and y be two strings. Edit distance d(x, y) is the minimum number of character insertions, deletions, and replacements to covert string x to y.

• k-difference string matching problem:– Given a text T with length n, a pattern P with lengt

h m, and an error bound k.– Find all position i of T such that there exists an suf

fix S of T(1, i), d(S, P) ≦ k.

Page 3: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

3

• The approach of this paper is as the follows:

• Given a pattern P and an error bound k, we generate all possible P’s which contain (≦k) errors deduced from P.

• Then we conduct an exact match of all such P’s against T.

Page 4: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

4

• Example:

T=abbaaa,

P=aba and k=1.

From P and k, we generate the following P’s:

ba, aaba, baba, bba, aa, abba, aaa, ab, abaa, abb, aba.

Page 5: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

5

• Then we conduct an exact matching of all P’s against T. Any success indicates that there is a substring S in T such that d(S,T)≦k.

• How can we generate all P’s which we want?

• We use the following observation.

Page 6: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

6

T

P

S2

Let S be a substring of T, and S= S1S2.

P = P1P2.

If d(S1, P1) ≦k, and Dist(S2, P2) = 0,

d(S, P) ≦ k.

S1

S

P1 P2

Page 7: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

7

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(6, 11) = AAAACA,

Let S1 = T(6, 9) = AAAA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

Page 8: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

8

Example:

T A C A C A A A A A C A C C

1 2 3 4 5 6 7 8 9 10 11 12 13

A G A B C AP1 2 3 4 5 6

k = 2

Consider the substring S = T(8, 11) = AACA,

Let S1 = T(8, 9) = AA, and S2 = T(10, 11) = CA.

Dist(S1, P1) = 2 ≦k, and Dist(S2, P2) = 0.

We have Dist(S, P) = 2 ≦k.

S1

P1

S2

P2

Page 9: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

9

• Based upon the above observation, we can generate all edited pattern P’s by editing the prefix and keeping the suffix untouched, in some manner.

• Consider P=aba, k=1.

Page 10: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

10

• P=aba, k=1.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

Page 11: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

11

• P=aba, k=2.

P = aba

ba (Deletion) k = 1

i = 1 aaba (Insertion) k = 1baba (Insertion) k = 1

bba (Substution) k = 1

aba k = 0

i = 2

aa (Deletion) k = 1aaba (Insertion) k = 1abba (Insertion) k = 1

aaa (Substution) k = 1

aba k = 0ab (Deletion) k = 1abaa (Insertion) k = 1abba (Insertion) k = 1

abb (Substution) k = 1

aba k = 0

i = 3

i = 4

abaa (Insertion) k = 1abab (Insertion) k = 1

Page 12: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

12

• P=aba, k=2.

ba

(k = 1)

a (Deletion) k = 2i = 2 aba (Insertion) k = 2

bba (Insertion) k = 2

aa (Substution) k = 2

ba k = 1

i = 3

b (Deletion) k = 2baa (Insertion) k = 2bba (Insertion) k = 2

bb (Substution) k = 2

ba k = 1

i = 4

baa (Insertion) k = 2

bab (Insertion) k = 2

Page 13: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

13

For i=1 to m+1

PL’ PR’P’

k’=Dist(PL’, PL)≦k.

Dist(PR’, PR) = 0

iPL’ PR’

P’

iPL

PR

P

Deletion, k’++

A

PL’ PR’

P’

CP’…

Replacement , k’++

A

PL’ PR’

P’

CP’…

Insertion, k’++

PL’ PR’

P’ No operation.

i

Terminate if k’ > k.

Page 14: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

14

• Our problem now becomes the following: Given a pattern P, we produce a modified pattern P’. Our job is to determine whether P’ exactly matches some substring of T or not.

• For example, Suppose P=aba. We have ba as one of the modified patterns. So, we like to find out whether ba matches exactly with a substring in T.

Page 15: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

15

• This exact matching can be found by using the suffix array and the inverse suffix array.

Page 16: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

16

Suffix Array

• Let , where t0, t1, …tn-1 an alphabet A and tn=$ is a special symbol that is not in A and smaller than any symbol in A.

• The jth suffix of T is defined as T(j, n) = tj…tn and is denoted by Tj.

• The suffix array SA[0..n] of T is an array of integers j that represent suffix Tj and the integers are sorted in lexicographic order of corresponding suffixes.

nn- t...tttT 110

Page 17: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

17

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Suffixes of T:

{GACAGTTCG$, ACAGTTCG$, CAGTTCG$, AGTTCG$, GTTCG$, TTCG$, TCG$, CG$, G$, $}

Lexicographic order:

$, ACAGTTCG$, AGTTCG$, CAGTTCG$, CG$, G$, GACAGTTCG$, GTTCG$, TCG$, TTCG$.

= T9, T1, T3, T2, T7, T8, T0, T4, T6, T5

SA[i]

9 1 3 2 7 8 0 4 6 5

0 1 2 3 4 5 6 7 8 9i

Page 18: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

18

Inverse Suffix Array

• The inverse suffix array of T is denoted as SA-1[i].• SA-1[i] equals the number of suffix which are

lexicographically smaller then Ti.

Page 19: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

19

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i SA-1[i]6

1

3

2

7

9

8

4

5

0

SA-1[SA[x] ] = x.

SA-1[0]=6 because there are 6 suffixes smaller than T0=

GACAGTTCG.

Page 20: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

20

• The size of SA and SA-1 are O(nlogn) bits. Both data structures can be constructed in linear time[13, 15, 17].

Page 21: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

21

• In this paper, an interval [st..ed] is called the range of the suffix array of T corresponding to a string P if [st..ed] is the largest interval such that P is a prefix of every suffix Tj for j = SA[st], SA[st+1], …, SA[ed].

We write [st..ed ] = range(T, P).

Page 22: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

22

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P = G.

G is a prefix of T8, T0 and T4.

T8 = TSA[5]

T0 = TSA[6]

T4 = TSA[7]

st=5, ed=7,

range(T, P) = [5..7].

Page 23: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

23

Lemma 1 (Gusfild [12])

Given a text T together with its suffix array, assume [st..ed] = range(T, P). Then, for any character c, the interval[st’..ed’] = range(T, Pc) can be computed in O(logn) time.

Page 24: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

24

Lemma 2

Given the interval [st1..ed1] = range(T , P1) and the interval [st2..ed2] = range(T , P2), we can find the interval [st..ed] = range(T , P1P2) in O(logn) time using the suffix array and the inverse suffix array of T.

Page 25: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

25

Let [st1..ed1] = range(T , P1),

[st2..ed2] = range(T , P2),

[st..ed] = range(T , P1P2).

[st..ed] is a subinterval of [st1..ed1].

Page 26: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

26

Example:

T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

iP1 = G. P2 = A.

range(T, P1) = [5..7].

range(T, P1P2) must be

within [5..7].

How can we find the

exact interval with [5..7]?

Page 27: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

27

• By the definition of suffix array, the lexicographic order of are increasing.

• The lexicographic order of

are also increasing.

][]1[][ 111 edSAstSAstSA , ..., T, TT

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

Page 28: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

28

Lexicographic order: $

(T9)ACAGTTCG$ (T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$.

(T5)

T2 = CAGTTCG$

T2+1 = T3 = AGTTCG$

T2+1 is obtained by deleting the prefix with length 1 from T2.

In general, Ti+1 can be obtained by deleting the prefix with length 1 from Ti.

Page 29: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

29

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [5..7].

][]1[][ 111 edSAstSAstSA , ..., T, TT

T8 < T0 < T4

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

T8+1, T0+1, T4+1

T9 < T1 < T5

Page 30: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

30

• The lexicographic order of

are also increasing.

• Thus

• To find st and ed, we find the smallest st such that and the largest ed such that

||][||]1[||][ 111111 PedSAPstSAPstSA , ..., T, TT

|]|][[ ... |]|1][[ |]|][[ 11-1

11-1

11-1 PedSASAPstSASAPstSASA

21-1

2 |]|][[ edPstSASAst . |]|][[ 21

-12 edPedSASAst

Page 31: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

31

Example:T G A C A G A T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)ATCG$. (T5)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GATCG$

(T4)TCG$

(T6)

SA[i]9

1

3

5

2

7

8

0

4

6

0

1

2

3

4

5

6

7

8

9

i P1 = G. P2 = A.

range(T, P1) = [6..8].

6 ≦ st, ed ≦ 8

SA-1[i]7

1

4

2

8

3

9

5

6

0

range(T, P2) = [1..3].

range(T, P1P2) = [st..ed].

st = 7 and ed = 8.

3 1 1 1, 1][7][-1 SASA

3 3 1 3, 1][8][-1 SASA

1 0 0, 1][6][-1 SASA

Page 32: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

32

• To find the interval of the first character of P:

We construct an array C such that for any c in A, C[c] stores the total number of occurrences of all c’ in T, where c’ ≦ c.

range(T, p1) = [C[c2]+1 … C[c]] where c2 is a character immediately before c in A.

Page 33: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

33

Example:T G A C A G T T C G $

0 1 2 3 4 5 6 7 8 9

Lexicographic order: $

(T9)ACAGTTCG$

(T1)AGTTCG$

(T3)CAGTTCG$

(T2)CG$

(T7)G$

(T8)GACAGTTCG$

(T0)GTTCG$

(T4)TCG$

(T6)TTCG$. (T5)

SA[i]9

1

3

2

7

8

0

4

6

5

0

1

2

3

4

5

6

7

8

9

i

P = GACAGCA

C[A] = 2

C[C] = 4

C[G] = 7

C[T] = 9

range(T, p1)

= [C[C]+1…C[G] ]

= [5…7].

Page 34: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

34

• Lemma 3

Given the suffix array and the inverse suffix array of T, assume [st..ed] = range(T, P). For any character c, assume we have in advance the array C, we can find the interval [st’..ed’] = range(T, cP) in O(logn) time.

Page 35: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

35

I Construct Fst [1..m+1] and Fed [1..m+1] such that [Fst [i]..Fed [i]]= range(T ,P[i..m]).II Call kapproximate([0..n], 1, 0, ε, ε).

kapproximate([s’..e’], i, k’, PL’, Υ )begin 1. Given [Fst [i]..Fed [i]] = range(T , P[i..m]) and [s’..e’] = range(T , PL’), by Lemma 2 find [st..ed] = range(T , PL’P[i..m]). 2. Report occurrences of P∗ = PL’P[i..m] in [st..ed] if the interval exists. 3. If (k’ = k) return. 4. For j :=i to m+1 (a) (when j ≦m, deletion at j) Call kapproximate([s’..e’], j+1, k’+1, PL’, dΥ). (b) (when j ≦ m, replacement at j ) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j+1, k’+1, PL’c, rΥ). (c) (insertion at j) for each c in A i. Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’c). ii. Call kapproximate([s’’..e’’], j, k’+1, PL’c, iΥ). (d) (when j≦m) Given [s’..e’] = range(T , PL’), by Lemma 1 find [s’’..e’’] = range(T , PL’P[j]). s’ := s’’; e’ := e’’; PL’ := PL’P[j]; Υ := uΥ;end

Page 36: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

36

• After an O(n) time preprocessing the text T into an O(nlogn)-bit data structure, the algorithm solves the k-difference problem in O(|A|kmklogn + outputtime) time.

Page 37: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

37

References

• [1] A. Amir, D. Keselman, G.M. Landau, M. Lewenstein, N. Lewenstein, M. Rodeh, Indexing and dictionary matching with one error, in: Proc.

• Sixth WADS, Lecture Notes in Computer Science, vol. 1663, Springer, Berlin, 1999, pp. 181–192.

• [2] A. Amir, M. Lewenstein, Ely. Porat, Faster algorithms for string matching with k mismatches, in: Proc. 11th Ann. ACM-SIAM Symp. on

• Discrete Algorithms, 2000, pp. 794–803.• [3] R.A. Baeza-Yates, G. Navarro, A faster algorithm for approximate string matching, in: Pro

c. Seventh Ann. Symp. on Combinatorial Pattern• Matching (CPM’96), pp. 1–23.• [4] R.A. Baeza-Yates, G. Navarro, A practical index for text retrieval allowing errors, in: CLE

I, vol. 1, November 1997, pp. 273–282.• [5] R. Boyer, S. Moore, A fast string matching algorithm, CACM 20 (1977) 762–772.• [6] A.L. Buchsbaum, M.T. Goodrich, J. Westbrook, Range searching over tree cross products.

in: ESA 2000, pp. 120–131.• [7] A. Cobbs, Fast approximate matching using suffix trees. in: Proc. Sixth Ann. Symp. on Co

mbinatorial Pattern Matching (CPM’95), Lecture• Notes in Computer Science, vol. 807, Springer, Berlin, 1995, pp. 41–54.• [8] R. Cole, L.A. Gottlieb, M. Lewenstein, Dictionary matching and indexing with errors and

don’t cares, in: Proc. 36th Ann. ACM Symp. on• Theory of Computing, 2004, pp. 91–100.• [9] P. Ferragina, G. Manzini, Opportunistic data structures with applications, in: Proc. 41st IE

EE Symp. on Foundations of Computer Science• (FOCS’00), 2000, pp. 390–398.

Page 38: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

38

• [10] G. Gonnet, A tutorial introduction to computational biochemistry using Darwin, Technical Report, Informatik E.T.H., Zurich, Switzerland,

• 1992.• [11] R. Grossi, J.S. Vitter, Compressed suffix arrays and suffix trees with applications to text i

ndexing and string matching, in: Proc. 32nd ACM• Symp. on Theory of Computing, 2000, pp. 397–406.• [12] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Compu

tational Biology, Cambridge University Press,• Cambridge, 1997.• [13] W.K. Hon, K. Sadakane,W.K. Sung. Breaking a time-and-space barrier in constructing fu

ll-text indices, in: Proc. IEEE Symp. on Foundations• of Computer Science, 2003.• [14] P. Jokinen, E. Ukkonen, Two algorithms for approximate string matching in static texts. i

n: Proc. MFCS’91, Lecture Notes in Computer Science,• vol. 520, Springer, Berlin, 1991, pp. 240–248.• [15] D.K. Kim, J.S. Sim, H. Park, K. Park, Linear-time construction of suffix arrays, in: CPM

2003, pp. 186–199.• [16] D.E. Knuth, J. Morris, V. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (197

7) 323–350.• [17] P. Ko, S. Aluru, Space efficient linear time construction of suffix arrays. in: CPM 2003, p

p. 200–210.• [18] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorit

hms 10 (1989) 157–169.• [19] U. Manber, G. Myers, Suffix arrays: a new method for on-line string searches, SIAM J. C

omput. 22 (5) (1993) 935–948.

Page 39: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

39

• [20] E.M. MCreight, A space economical suffix tree construction algorithm, J. ACM 23 (2) (1976) 262–272.

• [21] G. Navarro, A guided tour to approximate string matching, ACM Comput. Surveys 33 (1) (2001) 31–88.

• [22] G. Navarro, R.A. Baeza-Yates, A new indexing method for approximate string matching, in: Proc. 10th Ann. Symp. on Combinatorial Pattern

• Matching (CPM’99), pp. 163–185.• [23] G. Navarro, R.A. Baeza-Yates, A hybrid indexing method for approximate string matchin

g, J. Discrete Algorithms 1 (1) (2000) 205–239 18.• [24] G. Navarro, R. Baeza-Yates, E. Sutinen, J. Tarhio, Indexing methods for approximate stri

ng matching, IEEE Data Eng. Bull. 24 (4) (2001)• 19–27.• [25] G. Navarro, E. Sutinen, J. Tanninen, J. Tarhio, Indexing text with approximate q-grams, i

n: Proc. 11th Ann. Symp. on Combinatorial Pattern• Matching, Lecture Notes in Computer Science, vol. 1848, Springer, Berlin, 2000.• [26] K. Sadakane, T. Shibuya, Indexing huge genome sequences for solving various problems,

Genome Informatics 12 (2001) 175–183.• [27] F. Shi, Fast approximate string matching with q-blocks sequences, in: Proc. Third South

American Workshop on String Processing (WSP’96),• Carleton University Press, 1996.• [28] E. Sutinen, J. Tarhio, Filtration with q-samples in approximate string matching. in: Proc.

Seventh Ann. Symp. on Combinatorial Pattern Matching• (CPM’96), pp. 50–63.• [29] E. Ukkonen, Approximate matching over suffix trees, in: Proc. Combinatorial Pattern Ma

tching 1993, vol. 4, Springer, Berlin, June 1993,• pp. 228–242.• [30] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 16

8–173.

Page 40: 1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol

40

Thank you!