27
FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. pl Prague Stringology Club, Praha, Aug. 2006 Gonzalo Navarro Dept. of Computer Science Univ. of Chile, Chile [email protected] .cl Alejandro Salinger David R. Cheriton School of Computer Science Univ. of Waterloo, Canada [email protected] Rafał Przywarski Computer Engineering Dept., Tech. Univ. of Łódź, Poland rafal.przywarski@svens son.com.pl

FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected] Prague

Embed Size (px)

Citation preview

Page 1: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

FM-KZ: An even simpler alphabet-indepent FM-index

Szymon GrabowskiComputer Engineering Dept., Tech. Univ. of Łódź, Poland

[email protected]

Prague Stringology Club, Praha, Aug. 2006

Gonzalo NavarroDept. of Computer Science

Univ. of Chile, [email protected]

Alejandro SalingerDavid R. Cheriton School

of Computer ScienceUniv. of Waterloo, Canada

[email protected]

Rafał PrzywarskiComputer Engineering Dept., Tech. Univ. of Łódź, Poland

[email protected]

Page 2: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

2

suffix tree (aka lord of the strings):powerful, flexible, but needs at least 10n space (avg. case, assuming indices 4x larger than characters);

suffix array: 4n space, otherwise quite practical.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Full text indexing – past

Full text indexing – now

compressed suffix array (CSA) (Grossi & Vitter, 2000; ...)

FM-index based on the BWT (Ferragina & Manzini, 2000)

LZ-index based on the suffix tree with LZ78 (Navarro, 2003)

alphabet-friendly FM (Ferragina et al., 2004)

......

Page 3: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

3

Compressed indexes

Common feature: the original text may be omitted, only its compressed representation

suffices for handling queries.

Most of the compressed indexes are based on the Burrows–Wheeler transform (BWT).

Rapid development in theory (see the survey by Navarro & Mäkinen, 2006);

implementations somewhat lag behind...

This work – practice oriented. A step on from our earlier work (SPIRE04, PSC05).

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 4: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

4

rotations as they go sorted rotations

Burrows-Wheeler transform (BWT),

an exampleF L

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 5: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

5

Pattern searching in BWT sequence:LF-mapping mechanism

Starting point in Ferragina & Manzini’s index (2000):search time: O(m log n),

space occupancy: O(n log n) bits.

Note that in such form the complexities are like with the plain suffix array,

but text T itself may be eliminated!

Better? Ferragina & Manzini (2000) reach O(m) timewith (roughly) O(Hk n) space, assuming small alphabet.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 6: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

6

Searching in BWT sequence, an example

F L

BWT matrix

feasible form of L column

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 7: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

7

FM-Huffman (Grabowski et al., 2004)

Idea: Search in BWT sequence, but use a binary (or, generally, constant size) alphabet.

Use rank() operation in binary sequence(Jacobson, 1989; Munro, 1996; Clark, 1996).

Rank(k) tells the number of 1’s in T[1...k], k n, in O(1) timeand needs o(n) extra space.

Binary representation?

Yes, you guessed: Huffman coding (approximation of order-0 entropy).

Soon we’ll see this is not so good as might first seem.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 8: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

8

Searching (counting query) in FM-Huffman

Searching for pattern P’ in bit-vector B

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 9: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

9

FM-Huffman index

1. Huffman encode the text T, obtaining T’ (n’ bits).2. Calculate the BWT for the T’, call it B.3. Create another bit array, Bh, such that indicates

the bits in B which start Huffman codewords.4. Huffman encode the pattern P, obtaining P’.5. Search in a similar manner as shown at slide 7,

BUT the BWT sequence is kept naturally (array B) and the additional space overhead is sublinear in n’.

6. Verify a match with additional bits (Bh array + again extra structures sublinear in n’).

Main drawback: Bh as large as B.

inde

xco

nstr

uctio

n

quer

y

hand

ling

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 10: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

10

What instead of the binary Huffman?(Consider both space and search time.)

k-ary Huffman (Grabowski et al., PSC05)k typically 4 or 16:

- B array needs more space (usu. slightly more).+++ Bh array is almost halved.

-- rank structures for each of 4 symbols needed (but for a halved sequence).

In total: some 10% space gain for English and proteins(almost no gain for DNA).

Significant speedup in most cases (fewer codeword chars fewer rank operations).

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 11: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

11

Now more radical:remove Bh completely

Removing Bh is possible if our encoding has some self-synchronizing property.

Every codeword beginning must be recognized instantanously.

Very naïve solution: unary coding. Anything better?

Yes. Kautz-Zeckendorf coding.

The search is exactly like in slide 8 (for binary FM-Huffman),only line 9 will be now

if ep < sp then occ = 0 else occ = ep – (sp – 1) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 12: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

12R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Kautz–Zeckendorf coding

Basic variant (we denote it as KZ2): all the codewords start with 110;

nowhere else 110 appear.

Let B be encoded with KZ2. If during the LF-mapping we read 0 followed by two 1’s,

we know we are at a codeword boundaries.

Note we allow 1 at a codeword end! So even three 1’s can be “in a row”. But 110 only at a codeword beginning.

Page 13: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

13

Kautz–Zeckendorf coding, cont’d

KZ2 encoding (in an alterantive variant: each codeword has 1 at the beginning and at a start and no two adjacent 1’s elsewhere)

presents an integer as a sum of Fibonacci numbers in a unique form.

Fib. sequence (note a single 1 at the start): 1, 2, 3, 5, 8, 13, 21, 34, 55...

So, for example 27 will be represented as (LSDigit first):1001001

Since 27 = 1 + 5 + 21.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 14: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

14

What is the avg codeword lengthfor KZ?

We don’t know.

But asymptotically (large alphabet) it can be upper-boundedby 1.618... • (H0 + r), where r < 1 is the Huffman redundancy

for a given distribution.

1.618... = 1+sqrt(5) / 2 (golden ratio)

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 15: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

15

Benefits of KZ

No Bh array (and its rank structure).

...So we don’t perform the final pair of rank operations either.

With FM-Huffman, selectnext (telling the pos. of the next 1)is needed at a start of report / display query handling.

Now all the matches are in a contiguous range of rows.

Drawbacks of KZ

B (and its rank) is longer, as KZ code is longer than Huffman.

Longer encoded patterns mean more rank operations(as opposed to FM-Huff4).

Harder analysis...

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 16: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

16

On a Fibonacci numbers application...

The number 1.618... Does it ring a bell?

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

1 mile = 1.609... km

How does a mathematician convert miles into kilometers?(According to Graham, Knuth, and Patashnik, Concrete

Mathematics.)

Represent the distance in the Fibonacci base (e.g. KZ2),shift left by 1, sum what you’ve obtained.

Example: 80 miles.80 =1+3+21+55 = 101000101(fib)

After the << : 0101000101(fib) = 2+5+34+89 = 130 kmRatio 1.625, not bad...

Page 17: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

17R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Generalized Kautz–Zeckendorf

KZ1: 10 prefix (unary coding!)KZ2: 110 prefix

KZ3: 1110 prefix, etc.

Is KZ2 best? Not always.For example, for DNA (4 symbols) the seemingly

very naive unary codinghas 2.5 bit avg codeword length (assuming non-compressible symbols, ie. H0 = 2 bits / symbol).

...Ops, this is for a slightly twisted variant: the codewords are simply 1, 10, 100 and 1000.

Page 18: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

18

Reporting queries (basic idea)– same for all the FM-*

One extra bit per original symbol needed (plus some sublinear data), and one position index per h symbols (h user-selected parameter, e.g., 32).

We sample positions of T’ in regular intervals, but only at codewords’ beginnings.

Handling a query: for each found occurence of P continue digit-by-digit backward moving in T’ until a sampled position

(signalled with a flag) is met. Read its index (original position) and it’s done.

The backward moving in T’ is limited.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 19: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

19

Experimental results

Datasets: 80 MB of English text (from TREC-3)

60 MB of DNA (from BLAST database), 5 characters! 55 MB of proteins (from BLAST database)

Test platform:Intel Xeon 3.06 GHz

2 GB of RAM512 KB cache

Gentoo Linux 2.6.10Gcc 3.3.5 -O9

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 20: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

20

Experimental results, cont’d

Counting queries: Pattern length from 10 to 100,for each length 1000 patterns taken from random positions of each text.

Reporting queries: Pattern length 10. 1000 random patterns taken.

Display queries: 1000 random patterns, 100 chars to display around each of the found occurence.

Competitors:FM-index (very simple and fast (byte-oriented) variant by Navarro, 2004),

Compressed Suffix Array (CSA) (Sadakane, 2000),Run-Length FM (RLFM) (Mäkinen & Navarro, 2005),

Succinct Suffix Array (SSA) (Mäkinen & Navarro, 2005),LZ-index (Navarro, 2004),

FM-Huffman2 and FM-Huffman4 (Grabowski et al., 2005)FM-KZ1 and FM-KZ2 (this work).

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 21: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

21

English text, search time

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

time in sec for varying pattern lengths

Page 22: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

22

English text, space vs. search time

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

time in sec per character

Page 23: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

23

DNA, space vs. search time

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

time in sec per character

Page 24: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

24

Proteins, space vs. search time

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

time in sec per character

Page 25: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

25

Observations

CSA and RLFM: hardly ever competitive.

FM-Huff-16 fastest for counting queries for English and proteins.

FM-KZ1: most succinct and among the fastest on DNA.

Reporting time: FM-Huff variants lose to FM-index for English and proteins. They (k=2 and k=4) win on DNA instead

(but there SSA is even better, and more flexible for low space use).

Display time: FM-KZ1 best for DNA. Best for proteins: FM and then FM-KZ2 (but the fastest is FM-Huff16).English text: similar to proteins but LZ-index equally fast to FM-Huff16

and needs about 25% less space.

Original binary Huffman: never competitive.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 26: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

26

Presented algorithm – properties

Search time: O((H0+1)m + occ) avg search time.O(m log n + occ) worst-case search time.

Space occupancy: less than 1.618... • (H0+1)n + o(H0 n) bits.

Pros and cons (summary):• very simple and practical succinct index;

• no dependence on the alphabet size;

• among the fastest (but not the most succinct) compressed indexes;

• worse “in theory” than some recent indexes (but simpler);

• quite flexible

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Page 27: FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland sgrabow@kis.p.lodz.pl Prague

27

To do:Better analysis?

Some more little tricks (and tweaks), e.g., the B array may be truncated somewhat.

Good for space and even also for speed (elimination of some rank operations).

More experiments with more succinct rank (e.g. 5% overhead rank is only moderately slower than the 10% one;

definitely not twice; quite an option for Huff4 and Huff16).

Higher arity KZ?

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06