Upload
george
View
54
Download
0
Embed Size (px)
DESCRIPTION
July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park. Linear-Time Search in Suffix Arrays. Suffix arrays. Suffix array of text T The lexicographically sorted list of all suffixes of text T. Suffix arrays. Example for T = abbabaababbb# - PowerPoint PPT Presentation
Citation preview
ETRIETRI
Linear-Time Search in Suffix Arrays
July 14, 2003
Jeong Seop Sim, Dong Kyue Kim
Heejin Park, Kunsoo Park
ETRIETRI
Suffix arrays
Suffix array of text T
The lexicographically sorted list of all suffixes of text T
ETRIETRI
Suffix arrays
Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1)
bbabaababbb# (2)
abaababbb# (3)
…
b# (12)
# (13)
are stored in lexicographical order.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b ## is the lexicographically smallest special character.
ETRIETRI
Suffix arrays
Example for T = abbabaababbb# The suffixes of T are
abbabaababbb# (1)
bbabaababbb# (2)
abaababbb# (3)
…
b# (12)
# (13)
In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.
1 13 #
2 6 a a b a b b b #
3 4 a b a a b a b b b #
4 7 a b a b b b #
5 1 a b b a b a a b a b b b #
6 9 a b b b #
7 12 b #
8 5 b a a b a b b b #
9 3 b a b a a b a b b b #
10 8 b a b b b #
11 11 b b #
12 2 b b a b a a b a b b b #
13 10 b b b #
ETRIETRI
Suffix arrays
Definition: s-suffixesSuffixes starting with string s
a-suffixes, ba-suffixes, …
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Suffix arrays vs. Suffix trees
Construction time
Suffix Array = Suffix Tree
Space
Suffix Array = Suffix Tree In practice , suffix arrays are more space efficient than suffix trees.
Search time
Suffix Array: , (p=|P|, n=|T|)
Suffix Tree:
|)|log( p
)log( np |)|( p
ETRIETRI
Contribution
Construction time Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , ,
Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
ETRIETRI
The meaning of our contribution
Construction time Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , ,
Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
Search time: SA ST
ETRIETRI
The meaning of our contribution
Construction time Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , ,
Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
Search time: SA ST
Suffix arrays are more powerful than suffix trees.
ETRIETRI
Our search algorithm
Our search algorithm
ETRIETRI
Search in a suffix array
Definition: Search in a suffix array
Input
A pattern P
A suffix array of T
Output
All P-suffixes of T
ETRIETRI
Search in a suffix array
All ab-suffixes are neighbors.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = ab
T = abbabaababbb#
Find all ab-suffixes.
A search example
ETRIETRI
Search in a suffix array
We have only to find
the first and the last ab-suffixes.
Because the other ab-suffixes are
stored between them.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = ab
T = abbabaababbb#
A search example
ETRIETRI
Related work
In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001).
Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm
Search P from the last character to the first character of PP = ababaaabbabaaabb
We adopt this backward pattern searching idea.
ETRIETRI
Algorithm outline
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
Our algorithm has p stages
(In this case, there are 3 stages.)
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
Stage 1: find all a-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
stage 1: find all a-suffixes.
stage 2: find all ba-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
stage 1: find all a-suffixes.
stage 2: find all ba-suffixes.
stage 3: find all aba-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
A stage
by elaborating stage 2
We find the first ba-suffix from the
first a-suffix and the last ba-suffix
from the last a-suffix.
We find all ba-suffixes
using a-suffixes found in stage 1.
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
Only explain how to find the first
ba-suffix from the first a-suffix.
Finding the last ba-suffix is similar.
A stage
by elaborating stage 2
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array.
P = aba
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Suffixes preceding ba-suffixes are
divided into two categories.
- A-type: Suffixes starting with
characters lexicographically smaller than b. (#-suffixes, a-suffixes)
- B-type: Suffixes starting with the same
character b and preceding ba-suffixes.
We count A-type and B-type suffixes in different ways.
Elaborate stage 2
A-type
B-type
ETRIETRI
Count the number of A-type suffixes
Count the number of A-type suffixes 1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.
A-type
ETRIETRI
Count the number of A-type suffixes
We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix.
With this array, we can count A-type suffixes in O(1) time.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
# 1
a 6
b 13
ETRIETRI
Count the number of A-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Array
S pace:
Time: O(n) (one scan)
|)(|
# 1
a 6
b 13
ETRIETRI
Count the number of B-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Count B-type suffixesb-suffixes preceding ba-suffixes.
B-type
ETRIETRI
Count the number of B-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
B-type suffixesb-suffixes preceding ba-suffixes.
A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1.
B-type
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Count the number of B-type suffixes
The number of B-type suffixes are the number of suffixes
being in a suffix subarray that precedes a-suffixes
whose previous characters are bs B-type
We count this with array N.
b
b
b
a
#
b
b
a
b
a
b
a
a
Let U be the conceptual array of
previous characters of suffixes.
U
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
Count the number of B-type suffixes # a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
5],7[ bN
Array N
entries|| n
N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].
U
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
Count the number of B-type suffixes # a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
U
We can count B-type
suffixes in O(1) time
by accessing an entry of N.
ETRIETRI
Array N
Space:
An alternative way
Space: O(n)
time for counting B-type suffixes.
Array N
|)| (O n
|)|(logO
# a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
ETRIETRI
Query for N[i,b]
Counting B-type suffixes
O(log n) time
O(log ) time||
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b] O(log n) time
In an O(log n) time algorithm,
we generate an array
whose ith entry stores
the location of the ith b in U.
1 1
2 2
3 3
4 6
5 7
6 9
7 11
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
To count suffixes whose previous
characters are bs in SA[1,8].
= To count bs in U[1,8]
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
Find the largest value not
exceeding 8 in this array.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
To find 7 in this array,
we perform binary search.
O(log n) time.
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
The index of 7 (5) is
the number of b’s in U[1,8].1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
1 4
2 8
3 10
4 12
5 13
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
1 5
Generally, we require arrays for
all characters. #
a
b
O(n) space
ETRIETRI
Query for N[i,b]
O(log n) time
O(log ) time||
ETRIETRI
For the last characters
of each block,
we compute the entries
of N.
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: time
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Divide U into
-sized blocks.
|)|(log O# a b
0 0 3
1 1 4
1 2 6
1 4 7
||
ETRIETRI
For the other entries
in each block,
we generate a similar
data structure used
in O(log n) time alg.
O(log ) time
for binary search.
Still O(n) space in total.
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: time
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
|)|(log O# a b
0 0 3
1 1 4
1 2 6
1 4 7
||
ETRIETRI
Summary
p stages
Each stage
Count A-type suffixes Time: O(1)
Space: O(n) for M array
Count B-type suffixes Time:
Space: O(n) for computing the value of an entry N
In total, time with O(n) space.|)|log( p
|)|(log
ETRIETRI
Conclusion
In a suffix array, one can choose or search time algorithm depending on the alphabet
size.
Suffix arrays are more powerful than suffix trees.
|)|log( p )log( np