Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa

Faster Approximate String Matching over Compressed Text

ByGonzalo Navarro*, Takuya Kida†, Masayuki Takeda†,

Ayumi Shinohara†, and Setsuo Arikawa†

* Dept. of Computer Science, University of Chile

† Dept. of Informatics, Kyushu University

Contents

• Introduction– Motivation– Related works and our goal

• Our search approach on LZ78/LZW– Basic idea – Filtration technique– Multiple pattern matching algorithms on

compressed text

• Experimental results• Conclusion

Motivation

• Compressed pattern matching– Let sleeping files lie.– Reduce space, reduce searching time.

File transfer

on Memory

Search

on Secondary disk storage

Decompress

on Memory

Motivation

File transfer

on Memoryon Secondary disk storage

Search directly

• Compressed pattern matching– Let sleeping files lie– Reduce space, reduce searching time

Related Works (1)

1988 Eliam-Tzoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gąsieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata, et al. byte pair encoding

1994 Manber original compression scheme

1998 Miyazaki, et al. Huffman encoding

1998 Kida, et al. LZ78/LZW

year researcher compression

1998 Moura, et al. Word based encoding

Related Works (2)year researcher compression

1999 Shibata, et al. Antidictionary based

1999 Kida, et al. LZ78/LZW

2000 Shibata, et al. collage systems

1999 Navarro and Raffinot LZ family, Hybrid LZ

Kida, et al.1999 Dictionary based methods(Collage system)

2000 Kärkkäinen, Navarro and Ukkonen LZ family

2000 Matsumoto, et al. Simple collage systems

2000 Navarro and Tarhio LZ family

1999 Gąsieniec and Rytter LZW

2000 Klein and Shapira LZSS variant

2001 Klein and Shapira Huffman encoding

Approximate String Matching• Edit distance ed(P, P’)

– Insertions, deletions and replacements

• Report all occurrences of any string P’ s.t. ed(P, P’)k for a given pattern P.

• Survey paperG. Navarro. A guided tour to approximate string matching. ACM Computing Surverys, 2000.

Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

Pattern: TAAATCACGGCATACT

k = 2

Example.

Previous Results

• J. Kärkkäinen, G. Navarro, and E. Ukkonen.Approximate string matching over Ziv-Lempel compressed text. In Proc. CPM2000.– Dynamic programming technique

– O(mkn+R) worst case, O(k2n+R) average case

• T. Matsumoto, T. Kida, M. Takeda, A. Shinohara, and S. Arikawa.Bit-parallel approach to approximate string matching in compressed texts. In Proc. SPIRE2000.– Bit-parallel technique

– O(mk3n/w) worst case

Our Search Approach on LZ78/LZW


• Our search approach on LZ78/LZW– Basic idea– Multiple pattern matching algorithms on

compressed text


Basic Idea

• Filtration technique (Wu and Manber, 1992)– Split the pattern in k+1 equal-length pieces– Find pattern pieces – Multiple pattern matching– Direct verification of candidate text area

(We have chosen Myers’ algorithm)

Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

Pattern pieces: TAAAT, CACGG, CATACT

k = 2Pattern: TAAATCACGGCATACT

Example.

Why LZ78/LZW?

• We have already developed a multiple pattern matching algorithm on LZW.

• Easy to decompress locally.

Multiple Pattern Matching Algorithms on Compressed Text

• Aho-Corasick technique

• Boyer-Moore technique

• Bit parallel technique

Aho-Corasick Technique

• T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Multiple pattern matching in LZW compressed text. In Proc. DCC’98.

• Simulate the AC machine• Running over LZW directly• O(m2+n+R) time, O(m2+n) space

Aho-Corasick Technique

・・・・・

b1 b2 b3 b4 b5 b6 b7Compressed text:

Original text: ・・・・・CTTAATTAAGCCCCCTGCTAAGCT

T T A A

A

A6

0 1 2 3 4

5

0 1 3 0 0 5 0 1State transition:

Pattern occurrences:

TTAA, AA

AA: goto function: failure function

Patterns: TTAA, AA

/{T,A}

Boyer-Moore Technique

• G. Navarro and J. Tarhio,Boyer-Moore string matching over Ziv-Lempel compressed text. In Proc. CPM2000.

• Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa,A Boyer-Moore type algorithm for compressed pattern matching, In Proc. CPM2000.

T. Kida et al.Multiple Pattern Matching Algorithms on Collage SystemIn Proc. CPM2001, to appear.

Boyer-Moore Technique

1. Find all occurrences that end in the focused block.

2. Calculate the maximum safe shift .

3. Move focus according to .

・・・・・

b1 b2 b3 b4 b5 b6 b7Compressed text:

Original text: ・・・・・CTTAATTAAGCCCCCTGCTAAGCT

Pattern occurrences:

Bit Parallel Technique

• G. Navarro and M. Raffinot,A general practical approach to pattern matching over Ziv-Lempel compressed text. In Proc. CPM’99.

• T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa,Shift-And approach to pattern matching in LZW compressed text. In Proc. CPM’99.

Bit Parallel Technique

・・・・・

bi-1 bi bi+1Compressed text: ・・・・・

Focused phrase: AAGTTAACTTAAGCCGTT

Pattern: TTAA

(i) Pattern suffixes (iii) Pattern prefixes(ii) Occurrences inside block bi

(i) := 110000000000000000(ii) := 000000100001000000

(iii) := 000000000000000011Bit vectors:

Experimental Results


• Our search approach on LZ78/LZW– Basic idea– Multiple pattern matching algorithms on

compressed text



Intel Pentium III of 550 MHz and 64Mb of RAM running Linux

10Mb of Wall Street Journal articles and 10Mb of DNA data

WSJ was compressed to 42.59% of its size and DNA to 27.71%



Conclusion

• We applied the filtration technique to compressed texts.

• We implemented two new multiple pattern matching algorithms on compressed text.– Boyer-Moore type and Bit-parallel type.

• We showed that this is a practical solution for approximate pattern matching on compressed text.– 10-30 times faster than previous solutions.– Up to 3 times faster than decompressing plus searching.

Documents

Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa