FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected] Prague

FM-KZ: An even simpler alphabet-indepent FM-index

Szymon GrabowskiComputer Engineering Dept., Tech. Univ. of Łódź, Poland

[email protected]

Prague Stringology Club, Praha, Aug. 2006

Gonzalo NavarroDept. of Computer Science

Univ. of Chile, [email protected]

Alejandro SalingerDavid R. Cheriton School

of Computer ScienceUniv. of Waterloo, Canada

[email protected]

Rafał PrzywarskiComputer Engineering Dept., Tech. Univ. of Łódź, Poland

[email protected]

2

suffix tree (aka lord of the strings):powerful, flexible, but needs at least 10n space (avg. case, assuming indices 4x larger than characters);

suffix array: 4n space, otherwise quite practical.

R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Full text indexing – past

Full text indexing – now

compressed suffix array (CSA) (Grossi & Vitter, 2000; ...)

FM-index based on the BWT (Ferragina & Manzini, 2000)

LZ-index based on the suffix tree with LZ78 (Navarro, 2003)

alphabet-friendly FM (Ferragina et al., 2004)

......

3

Compressed indexes

Common feature: the original text may be omitted, only its compressed representation

suffices for handling queries.

Most of the compressed indexes are based on the Burrows–Wheeler transform (BWT).

Rapid development in theory (see the survey by Navarro & Mäkinen, 2006);

implementations somewhat lag behind...

This work – practice oriented. A step on from our earlier work (SPIRE04, PSC05).


4

rotations as they go sorted rotations

Burrows-Wheeler transform (BWT),

an exampleF L


5

Pattern searching in BWT sequence:LF-mapping mechanism

Starting point in Ferragina & Manzini’s index (2000):search time: O(m log n),

space occupancy: O(n log n) bits.

Note that in such form the complexities are like with the plain suffix array,

but text T itself may be eliminated!

Better? Ferragina & Manzini (2000) reach O(m) timewith (roughly) O(Hk n) space, assuming small alphabet.


6

Searching in BWT sequence, an example

F L

BWT matrix

feasible form of L column


7

FM-Huffman (Grabowski et al., 2004)

Idea: Search in BWT sequence, but use a binary (or, generally, constant size) alphabet.

Use rank() operation in binary sequence(Jacobson, 1989; Munro, 1996; Clark, 1996).

Rank(k) tells the number of 1’s in T[1...k], k n, in O(1) timeand needs o(n) extra space.

Binary representation?

Yes, you guessed: Huffman coding (approximation of order-0 entropy).

Soon we’ll see this is not so good as might first seem.


8

Searching (counting query) in FM-Huffman

Searching for pattern P’ in bit-vector B


9

FM-Huffman index

1. Huffman encode the text T, obtaining T’ (n’ bits).2. Calculate the BWT for the T’, call it B.3. Create another bit array, Bh, such that indicates

the bits in B which start Huffman codewords.4. Huffman encode the pattern P, obtaining P’.5. Search in a similar manner as shown at slide 7,

BUT the BWT sequence is kept naturally (array B) and the additional space overhead is sublinear in n’.

6. Verify a match with additional bits (Bh array + again extra structures sublinear in n’).

Main drawback: Bh as large as B.

inde

xco

nstr

uctio

n

quer

y

hand

ling


10

What instead of the binary Huffman?(Consider both space and search time.)

k-ary Huffman (Grabowski et al., PSC05)k typically 4 or 16:

- B array needs more space (usu. slightly more).+++ Bh array is almost halved.

-- rank structures for each of 4 symbols needed (but for a halved sequence).

In total: some 10% space gain for English and proteins(almost no gain for DNA).

Significant speedup in most cases (fewer codeword chars fewer rank operations).


11

Now more radical:remove Bh completely

Removing Bh is possible if our encoding has some self-synchronizing property.

Every codeword beginning must be recognized instantanously.

Very naïve solution: unary coding. Anything better?

Yes. Kautz-Zeckendorf coding.

The search is exactly like in slide 8 (for binary FM-Huffman),only line 9 will be now

if ep < sp then occ = 0 else occ = ep – (sp – 1) R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

12R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Kautz–Zeckendorf coding

Basic variant (we denote it as KZ2): all the codewords start with 110;

nowhere else 110 appear.

Let B be encoded with KZ2. If during the LF-mapping we read 0 followed by two 1’s,

we know we are at a codeword boundaries.

Note we allow 1 at a codeword end! So even three 1’s can be “in a row”. But 110 only at a codeword beginning.

13

Kautz–Zeckendorf coding, cont’d

KZ2 encoding (in an alterantive variant: each codeword has 1 at the beginning and at a start and no two adjacent 1’s elsewhere)

presents an integer as a sum of Fibonacci numbers in a unique form.

Fib. sequence (note a single 1 at the start): 1, 2, 3, 5, 8, 13, 21, 34, 55...

So, for example 27 will be represented as (LSDigit first):1001001

Since 27 = 1 + 5 + 21.


14

What is the avg codeword lengthfor KZ?

We don’t know.

But asymptotically (large alphabet) it can be upper-boundedby 1.618... • (H0 + r), where r < 1 is the Huffman redundancy

for a given distribution.

1.618... = 1+sqrt(5) / 2 (golden ratio)


15

Benefits of KZ

No Bh array (and its rank structure).

...So we don’t perform the final pair of rank operations either.

With FM-Huffman, selectnext (telling the pos. of the next 1)is needed at a start of report / display query handling.

Now all the matches are in a contiguous range of rows.

Drawbacks of KZ

B (and its rank) is longer, as KZ code is longer than Huffman.

Longer encoded patterns mean more rank operations(as opposed to FM-Huff4).

Harder analysis...


16

On a Fibonacci numbers application...

The number 1.618... Does it ring a bell?


1 mile = 1.609... km

How does a mathematician convert miles into kilometers?(According to Graham, Knuth, and Patashnik, Concrete

Mathematics.)

Represent the distance in the Fibonacci base (e.g. KZ2),shift left by 1, sum what you’ve obtained.

Example: 80 miles.80 =1+3+21+55 = 101000101(fib)

After the << : 0101000101(fib) = 2+5+34+89 = 130 kmRatio 1.625, not bad...

17R.Przywarski et al., FM-KZ: An even simpler alphabet-independent FM-index, PSC’06

Generalized Kautz–Zeckendorf

KZ1: 10 prefix (unary coding!)KZ2: 110 prefix

KZ3: 1110 prefix, etc.

Is KZ2 best? Not always.For example, for DNA (4 symbols) the seemingly

very naive unary codinghas 2.5 bit avg codeword length (assuming non-compressible symbols, ie. H0 = 2 bits / symbol).

...Ops, this is for a slightly twisted variant: the codewords are simply 1, 10, 100 and 1000.

18

Reporting queries (basic idea)– same for all the FM-*

One extra bit per original symbol needed (plus some sublinear data), and one position index per h symbols (h user-selected parameter, e.g., 32).

We sample positions of T’ in regular intervals, but only at codewords’ beginnings.

Handling a query: for each found occurence of P continue digit-by-digit backward moving in T’ until a sampled position

(signalled with a flag) is met. Read its index (original position) and it’s done.

The backward moving in T’ is limited.


19

Experimental results

Datasets: 80 MB of English text (from TREC-3)

60 MB of DNA (from BLAST database), 5 characters! 55 MB of proteins (from BLAST database)

Test platform:Intel Xeon 3.06 GHz

2 GB of RAM512 KB cache

Gentoo Linux 2.6.10Gcc 3.3.5 -O9


20

Experimental results, cont’d

Counting queries: Pattern length from 10 to 100,for each length 1000 patterns taken from random positions of each text.

Reporting queries: Pattern length 10. 1000 random patterns taken.

Display queries: 1000 random patterns, 100 chars to display around each of the found occurence.

Competitors:FM-index (very simple and fast (byte-oriented) variant by Navarro, 2004),

Compressed Suffix Array (CSA) (Sadakane, 2000),Run-Length FM (RLFM) (Mäkinen & Navarro, 2005),

Succinct Suffix Array (SSA) (Mäkinen & Navarro, 2005),LZ-index (Navarro, 2004),

FM-Huffman2 and FM-Huffman4 (Grabowski et al., 2005)FM-KZ1 and FM-KZ2 (this work).


21

English text, search time


time in sec for varying pattern lengths

22

English text, space vs. search time


time in sec per character

23

DNA, space vs. search time



24

Proteins, space vs. search time



25

Observations

CSA and RLFM: hardly ever competitive.

FM-Huff-16 fastest for counting queries for English and proteins.

FM-KZ1: most succinct and among the fastest on DNA.

Reporting time: FM-Huff variants lose to FM-index for English and proteins. They (k=2 and k=4) win on DNA instead

(but there SSA is even better, and more flexible for low space use).

Display time: FM-KZ1 best for DNA. Best for proteins: FM and then FM-KZ2 (but the fastest is FM-Huff16).English text: similar to proteins but LZ-index equally fast to FM-Huff16

and needs about 25% less space.

Original binary Huffman: never competitive.


26

Presented algorithm – properties

Search time: O((H0+1)m + occ) avg search time.O(m log n + occ) worst-case search time.

Space occupancy: less than 1.618... • (H0+1)n + o(H0 n) bits.

Pros and cons (summary):• very simple and practical succinct index;

• no dependence on the alphabet size;

• among the fastest (but not the most succinct) compressed indexes;

• worse “in theory” than some recent indexes (but simpler);

• quite flexible


27

To do:Better analysis?

Some more little tricks (and tweaks), e.g., the B array may be truncated somewhat.

Good for space and even also for speed (elimination of some rank operations).

More experiments with more succinct rank (e.g. 5% overhead rank is only moderately slower than the 10% one;

definitely not twice; quite an option for Huff4 and Huff16).

Higher arity KZ?


Documents

FM-KZ: An even simpler alphabet-indepent FM-index Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected] Prague