43
ETRI ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

Linear-Time Search in Suffix Arrays

  • Upload
    george

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park. Linear-Time Search in Suffix Arrays. Suffix arrays. Suffix array of text T The lexicographically sorted list of all suffixes of text T. Suffix arrays. Example for T = abbabaababbb# - PowerPoint PPT Presentation

Citation preview

Page 1: Linear-Time Search in Suffix Arrays

ETRIETRI

Linear-Time Search in Suffix Arrays

July 14, 2003

Jeong Seop Sim, Dong Kyue Kim

Heejin Park, Kunsoo Park

Page 2: Linear-Time Search in Suffix Arrays

ETRIETRI

Suffix arrays

Suffix array of text T

The lexicographically sorted list of all suffixes of text T

Page 3: Linear-Time Search in Suffix Arrays

ETRIETRI

Suffix arrays

Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1)

bbabaababbb# (2)

abaababbb# (3)

b# (12)

# (13)

are stored in lexicographical order.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b ## is the lexicographically smallest special character.

Page 4: Linear-Time Search in Suffix Arrays

ETRIETRI

Suffix arrays

Example for T = abbabaababbb# The suffixes of T are

abbabaababbb# (1)

bbabaababbb# (2)

abaababbb# (3)

b# (12)

# (13)

In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.

1 13 #

2 6 a a b a b b b #

3 4 a b a a b a b b b #

4 7 a b a b b b #

5 1 a b b a b a a b a b b b #

6 9 a b b b #

7 12 b #

8 5 b a a b a b b b #

9 3 b a b a a b a b b b #

10 8 b a b b b #

11 11 b b #

12 2 b b a b a a b a b b b #

13 10 b b b #

Page 5: Linear-Time Search in Suffix Arrays

ETRIETRI

Suffix arrays

Definition: s-suffixesSuffixes starting with string s

a-suffixes, ba-suffixes, …

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 6: Linear-Time Search in Suffix Arrays

ETRIETRI

Suffix arrays vs. Suffix trees

Construction time

Suffix Array = Suffix Tree

Space

Suffix Array = Suffix Tree In practice , suffix arrays are more space efficient than suffix trees.

Search time

Suffix Array: , (p=|P|, n=|T|)

Suffix Tree:

|)|log( p

)log( np |)|( p

Page 7: Linear-Time Search in Suffix Arrays

ETRIETRI

Contribution

Construction time Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , ,

Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Page 8: Linear-Time Search in Suffix Arrays

ETRIETRI

The meaning of our contribution

Construction time Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , ,

Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Search time: SA ST

Page 9: Linear-Time Search in Suffix Arrays

ETRIETRI

The meaning of our contribution

Construction time Suffix Array = Suffix Tree

Space Suffix Array = Suffix Tree

In practice , suffix arrays are more space efficient than suffix trees.

Search timeSuffix Array: , ,

Suffix Tree: |)|log( p

)log( np |)|( p |)|log( p

Search time: SA ST

Suffix arrays are more powerful than suffix trees.

Page 10: Linear-Time Search in Suffix Arrays

ETRIETRI

Our search algorithm

Our search algorithm

Page 11: Linear-Time Search in Suffix Arrays

ETRIETRI

Search in a suffix array

Definition: Search in a suffix array

Input

A pattern P

A suffix array of T

Output

All P-suffixes of T

Page 12: Linear-Time Search in Suffix Arrays

ETRIETRI

Search in a suffix array

All ab-suffixes are neighbors.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = ab

T = abbabaababbb#

Find all ab-suffixes.

A search example

Page 13: Linear-Time Search in Suffix Arrays

ETRIETRI

Search in a suffix array

We have only to find

the first and the last ab-suffixes.

Because the other ab-suffixes are

stored between them.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = ab

T = abbabaababbb#

A search example

Page 14: Linear-Time Search in Suffix Arrays

ETRIETRI

Related work

In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001).

Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm

Search P from the last character to the first character of PP = ababaaabbabaaabb

We adopt this backward pattern searching idea.

Page 15: Linear-Time Search in Suffix Arrays

ETRIETRI

Algorithm outline

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

Our algorithm has p stages

(In this case, there are 3 stages.)

Page 16: Linear-Time Search in Suffix Arrays

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

Stage 1: find all a-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 17: Linear-Time Search in Suffix Arrays

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

stage 1: find all a-suffixes.

stage 2: find all ba-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 18: Linear-Time Search in Suffix Arrays

ETRIETRI

Algorithm outline

P = aba

T = abbabaababbb#

Outline of our search algorithm

We find all aba-suffixes

by searching P backward.

stage 1: find all a-suffixes.

stage 2: find all ba-suffixes.

stage 3: find all aba-suffixes.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 19: Linear-Time Search in Suffix Arrays

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

A stage

by elaborating stage 2

We find the first ba-suffix from the

first a-suffix and the last ba-suffix

from the last a-suffix.

We find all ba-suffixes

using a-suffixes found in stage 1.

Page 20: Linear-Time Search in Suffix Arrays

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

P = aba

Only explain how to find the first

ba-suffix from the first a-suffix.

Finding the last ba-suffix is similar.

A stage

by elaborating stage 2

Page 21: Linear-Time Search in Suffix Arrays

ETRIETRI

Elaborate stage 2

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array.

P = aba

Page 22: Linear-Time Search in Suffix Arrays

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Suffixes preceding ba-suffixes are

divided into two categories.

- A-type: Suffixes starting with

characters lexicographically smaller than b. (#-suffixes, a-suffixes)

- B-type: Suffixes starting with the same

character b and preceding ba-suffixes.

We count A-type and B-type suffixes in different ways.

Elaborate stage 2

A-type

B-type

Page 23: Linear-Time Search in Suffix Arrays

ETRIETRI

Count the number of A-type suffixes

Count the number of A-type suffixes 1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.

A-type

Page 24: Linear-Time Search in Suffix Arrays

ETRIETRI

Count the number of A-type suffixes

We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix.

With this array, we can count A-type suffixes in O(1) time.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

# 1

a 6

b 13

Page 25: Linear-Time Search in Suffix Arrays

ETRIETRI

Count the number of A-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Array

S pace:

Time: O(n) (one scan)

|)(|

# 1

a 6

b 13

Page 26: Linear-Time Search in Suffix Arrays

ETRIETRI

Count the number of B-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Count B-type suffixesb-suffixes preceding ba-suffixes.

B-type

Page 27: Linear-Time Search in Suffix Arrays

ETRIETRI

Count the number of B-type suffixes

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

B-type suffixesb-suffixes preceding ba-suffixes.

A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1.

B-type

Page 28: Linear-Time Search in Suffix Arrays

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Count the number of B-type suffixes

The number of B-type suffixes are the number of suffixes

being in a suffix subarray that precedes a-suffixes

whose previous characters are bs B-type

We count this with array N.

b

b

b

a

#

b

b

a

b

a

b

a

a

Let U be the conceptual array of

previous characters of suffixes.

U

Page 29: Linear-Time Search in Suffix Arrays

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

Count the number of B-type suffixes # a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

5],7[ bN

Array N

entries|| n

N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].

U

Page 30: Linear-Time Search in Suffix Arrays

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

Count the number of B-type suffixes # a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

U

We can count B-type

suffixes in O(1) time

by accessing an entry of N.

Page 31: Linear-Time Search in Suffix Arrays

ETRIETRI

Array N

Space:

An alternative way

Space: O(n)

time for counting B-type suffixes.

Array N

|)| (O n

|)|(logO

# a b

0 0 1

0 0 2

0 0 3

0 1 3

1 1 3

1 1 4

1 1 5

1 2 5

1 2 6

1 3 6

1 3 7

1 4 7

1 5 7

Page 32: Linear-Time Search in Suffix Arrays

ETRIETRI

Query for N[i,b]

Counting B-type suffixes

O(log n) time

O(log ) time||

Page 33: Linear-Time Search in Suffix Arrays

ETRIETRI

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b] O(log n) time

In an O(log n) time algorithm,

we generate an array

whose ith entry stores

the location of the ith b in U.

1 1

2 2

3 3

4 6

5 7

6 9

7 11

Page 34: Linear-Time Search in Suffix Arrays

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

To count suffixes whose previous

characters are bs in SA[1,8].

= To count bs in U[1,8]

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 35: Linear-Time Search in Suffix Arrays

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

Find the largest value not

exceeding 8 in this array.

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 36: Linear-Time Search in Suffix Arrays

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

To find 7 in this array,

we perform binary search.

O(log n) time.

Page 37: Linear-Time Search in Suffix Arrays

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

The index of 7 (5) is

the number of b’s in U[1,8].1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Page 38: Linear-Time Search in Suffix Arrays

ETRIETRI

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: O(log n) time

1 1

2 2

3 3

4 6

5 7

6 9

7 11

1 4

2 8

3 10

4 12

5 13

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

1 5

Generally, we require arrays for

all characters. #

a

b

O(n) space

Page 39: Linear-Time Search in Suffix Arrays

ETRIETRI

Query for N[i,b]

O(log n) time

O(log ) time||

Page 40: Linear-Time Search in Suffix Arrays

ETRIETRI

For the last characters

of each block,

we compute the entries

of N.

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: time

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

Divide U into

-sized blocks.

|)|(log O# a b

0 0 3

1 1 4

1 2 6

1 4 7

||

Page 41: Linear-Time Search in Suffix Arrays

ETRIETRI

For the other entries

in each block,

we generate a similar

data structure used

in O(log n) time alg.

O(log ) time

for binary search.

Still O(n) space in total.

b

b

b

a

#

b

b

a

b

a

b

a

a

UQuery for N[i,b]: time

1 #

2 a a b a b b b #

3 a b a a b a b b b #

4 a b a b b b #

5 a b b a b a a b a b b b #

6 a b b b #

7 b #

8 b a a b a b b b #

9 b a b a a b a b b b #

10 b a b b b #

11 b b #

12 b b a b a a b a b b b #

13 b b b #

|)|(log O# a b

0 0 3

1 1 4

1 2 6

1 4 7

||

Page 42: Linear-Time Search in Suffix Arrays

ETRIETRI

Summary

p stages

Each stage

Count A-type suffixes Time: O(1)

Space: O(n) for M array

Count B-type suffixes Time:

Space: O(n) for computing the value of an entry N

In total, time with O(n) space.|)|log( p

|)|(log

Page 43: Linear-Time Search in Suffix Arrays

ETRIETRI

Conclusion

In a suffix array, one can choose or search time algorithm depending on the alphabet

size.

Suffix arrays are more powerful than suffix trees.

|)|log( p )log( np