Speeding up pattern matching by text compression

Speeding up pattern matching Speeding up pattern matching by text compressionby text compression

Department of Informatics,　 Kyushu University, JapanDepartment of AI, Kyushu Institute of Technology, Japan

Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa

Contents

Pattern matching on compressed text.

A unifying framework for compressed

pattern matching (Collage System)

Byte pair encoding (BPE).

Pattern matching algorithm on BPE compressed text.

Experimental result.

Conclusion.

Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.

Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.

Pattern Matching Problem

matchingmatchingPatternPattern

TextText

Knuth-Morris-Pratt (1974)

Boyer-Moore (1977)

Aho-Corasick (1975)

Shift-Or (1992)

Pattern Matching on Compressed Text

Expand

on Memory

on Memory

File transfer

on Secondary disk storage

original textoriginal text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

SearchSearch

SearchSearch

It requires extra time and space.

Pattern Matching on Compressed Text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

Search directlySearch directly

To perform a faster search in compressed texts in comparisonwith a regular decompression followed by an ordinary search.

GOAL 1GOAL 1

To perform a faster search in compressed texts in comparison with an ordinary search in the original texts.

GOAL 2GOAL 2

Speeding up pattern matching by text compression

Previous Results(1)

1988 Eliam-Tsoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding

1998 Kida, et al. LZW

year researcher compression

year researcher compression

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionary based

1999 Kida, Takeda, Shinohara, andArikawa

LZW

2000 Shibata, et al. Byte pair encoding

1999 Navarro and Raffinot LZ family

Today’s talkToday’s talk

Previous Results(2)

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

Unifying frameworkUnifying

frameworkKida, et al.1999 Dictionary based methods

(Collage system)

A Unifying Framework for Compressed Pattern Matching

Previous:Compression A PM Algorithm A

Compression B PM Algorithm B

Compression C PM Algorithm C

Collage system

Kida et al.[1999]:

Pattern matching algorithm on the unifying framework

Compression A

Compression B

Compression C

Collage SystemCollage System

Definition and Several Examples

Originaltext

Originaltext

Dictionary Based Compression

compressedtext

compressedtext

Dictionarystructure

Dictionarystructure

encoding

factorize into a series of phrases

How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.

Collage System

Collage system is a pair 〈 D, S 〉

S : A sequence of variables defined in D (Compressed text)

S = Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )

D : A sequence of assignments (Dictionary structure)

X1 := expr1 ; ・・・X2 := expr2 ; Xn := exprn ;

||D|| = n : number of assignments in D

|S| = l : number of variables in S

where exprk are ...

X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;

D : A sequence of assignments (Dictionary structure)

a a ∈Σ {ε∪ }, (primitive assignment)

Xi ・ X ｊ (concatenation)for i, j < k,

( Xi ) j for i < k and integer j ( j times repetition)

[ j ]Xi(prefix truncation)for i < k and integer j

Xi [ j ] (suffix truncation)for i < k and integer j

Collage System

Example of Collage System

X1 = a ;X2 = b ;

D :

S : X3 , X6 , X4 , X7

abbabbababba

X7 = X6・ X4 ;

X6 = [ 3 ]X5 ;

X5 = ( X3 )3 ;

X4 = X2・ X1 ;

X3 = X1・ X2 ;

babbabababababbaab

X7

X6 X4

X5

X3

X1 X2

X2 X1

a b )3 )[ 3 ] (( b a

prefixtruncation

3 timesrepetition

T(X7)

height(X7) = 4

height(D) = 4

??????

Pattern Matching Algorithmon a Collage System

Compressed pattern matching on a collage system

mm : pattern lengthrr : number of pattern occurrences

||||DD|||| : number of assignments in D||SS|| : number of variables in S

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in


state: 0

: goto function: failure function

Pattern π= a b a b b

Basic Idea

original text: abababba

0a

1 2b a

3b

4b

5

1 2 3 4 3 4 5 1

S ： Xi1 Xi2 Xi3 Xi4

abababba

The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・ u[1: i]}

The function Jump( j, u) =δKMP( j, u)

•This set contains the pattern occurrences.

•The domain is Q×D• It simulates the sequence of state transitions for u.

Jump and Output

Reply inO(1) timeReply inO(1) time

Reply inO( l ) timeReply in

O( l ) time

Realization of Jump and Output

for Jump( q, Xk) , if Xk is ...

a

Xi ・ X ｊ

O(1) time

If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.

a

Xi ・ X ｊ

O(1) time

for Output( q, Xk), if Xk is ...

It can be enumerate in O( l ) time

from Output of Xi and X ｊ .

Size of the set Output

Size of the set Output

Factor Concatenation Problem

example: P = COPACABANA

OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate

Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.

Solution to the problem

• Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.

• Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.

It can be solved in O(1) time after O(m2) space and time preprocessing.

Outline of Our Algorithm

Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

/* preprocessing of D and P */ preprocess(D); preprocess(P);

l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;

q:= Jump(q, Xij); /* state transition */

l:= l + |Xij |; /* calculation of the offset */end

Compressed pattern matching on a collage system

O( ||D|| + |S| + m2 + r ) time

LZ78, LZW, BPEBPE, Run-length, etc...

LZ78, LZW, BPEBPE, Run-length, etc...

no truncation

LZ77, LZSS, etc...LZ77, LZSS, etc...

truncation

O( (||D|| + |S| )・ height(D) + m2 + r ) time

not suitable for speeding up

pattern matching

Byte Pair EncodingByte Pair Encoding

original encoding algorithmand modified algorithm

ABCDEFGHI

Code Pair

Pair Table

Byte Pair Encoding

Text:　 T = ABABCDEBDEFABDEABC

GGCHBHFGHGC

GIHBHFGHI

GGCDEBDEFGDEGC

ABAB

AB→GAB→G

DEDE

DE→HDE→H

GCGC

GC→IGC→I

AABBCCDDEEFF

Used Character

ABABABAB ABAB ABAB

DEDE DEDE DEDE

GCGC GCGC

Byte Pair Encoding “collage system”

Text:　 T = ABABABABCDEBDEFABABDEABABC

GGCGCHBHFGHGCGC

GIHBHFGHI

GGCDEBDEDEFGDEDEGCAB→GAB→G

DE→HDE→H

GC→IGC→I

X1 = A;X2 = B ;

D :

X7 = X1・ X2 ;

X6 = F ;X5 = E ;

X4 = D ;

X3 = C ;

X8 = X4・ X5 ;

X9 = X7・ X3 ;S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9

Speeding up of compression

Time complexity of BPE 　 O(uN)

u : The number of character codes，N : Text length

using doubly-linked list

O(u + N) time

Speed-up of compressionoriginal text:

we apply the BPE algorithm to the first block.

X1 = A

X2 = C

X3 = X2・ X1

X255 = X247・ X8

X256 = X125・ X48

D:

Pattern Matching Machine for multiple replacement

[Arikawa et al. 1984]

BPE compressed text:

BPECompress Gzip

originalmodifiedBrown corpus ( 6.8Mb)Medline (60.3Mb)Genbank (17.1Mb)

51.056.230.8 32.5

59.059.0

26.842.343.7 39.0

33.323.1

Brown corpus Medline Genbank

196.91699.9440.6 16.5

60.78.0

19.373.312.7 37.7

242.2100.9

Comparison of Compression Ratio and time

compression Ratio(%)

compression time(sec)

BPE are worse than those of “Compress” and “Gzip”

It is drastically acceleratedby our modification

Compressed pattern matching on BPE compressed text

Problem of compressed pattern matchingon BPE compressed text can be solved in


Problem of compressed pattern matchingon BPE compressed text can be solved in


||D|| 256≦

-The dictionary D is encoded separately from the sequence S.

-The size of D is small enough.

-The variables of S are encoded using a fixed length code.

Experimental result

0.20

0.30

0.40

0.50

0.60

0.70

0.80

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

0.05

0.10

0.15

0.20

0.25

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

KMPKMPKMPKMP

AgrepAgrep

AgrepAgrepour algorithmour algorithm

our algorithmour algorithm

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)

Ultra ...

a clinically-oriented subset of

Medlin

a data set from GenBank

Concluding RemarksConcluding Remarks

Conclusion and Future Works

Conclusion

We introduced compressed pattern matching from practical viewpoints.

We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case.

We also observed that it is occasionally faster than

Agrep ．

Future Works

• Can we reduce the complexity of the preprocessing? O(m2) O(m)

• To develop a sublinear algorithm on BPE compressed texts.

• To develop an approximate pattern matching algorithm on a collage system.

• To develop a new compression which is suitable for compressed pattern matching.

More recent work

More recent work

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

We proposed a Boyer-Moore (BM) type algorithmfor pattern matching in BPE compressed texts.

Does text compression speed up such a sublinear time algorith

m?

More recent work

0.20

0.30

0.40

0.50

0.60

0.70

0.80

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

0.00

0.05

0.10

0.15

0.20

0.25

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

KMPKMP

AgrepAgrep


most recent workmost recent work

KMPKMP

AgrepAgrep


most recent workmost recent work

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)

Documents

Speeding up pattern matching by text compression