35
Speeding up pattern Speeding up pattern matching matching by text compression by text compression tment of Informatics, Kyushu University, J ment of AI, Kyushu Institute of Technology, Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shi nohara, Setsuo Arikawa

Speeding up pattern matching by text compression

  • Upload
    adolfo

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

Speeding up pattern matching by text compression. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa. Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan. - PowerPoint PPT Presentation

Citation preview

Page 1: Speeding up pattern matching  by text compression

Speeding up pattern matching Speeding up pattern matching by text compressionby text compression

Department of Informatics,  Kyushu University, JapanDepartment of AI, Kyushu Institute of Technology, Japan

Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa

Page 2: Speeding up pattern matching  by text compression

Contents

Pattern matching on compressed text.

A unifying framework for compressed

pattern matching (Collage System)

Byte pair encoding (BPE).

Pattern matching algorithm on BPE compressed text.

Experimental result.

Conclusion.

Page 3: Speeding up pattern matching  by text compression

Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.

Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.

Pattern Matching Problem

matchingmatchingPatternPattern

TextText

Knuth-Morris-Pratt (1974)

Boyer-Moore (1977)

Aho-Corasick (1975)

Shift-Or (1992)

Page 4: Speeding up pattern matching  by text compression

Pattern Matching on Compressed Text

Expand

on Memory

on Memory

File transfer

on Secondary disk storage

original textoriginal text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

SearchSearch

SearchSearch

It requires extra time and space.

Page 5: Speeding up pattern matching  by text compression

Pattern Matching on Compressed Text

File transfer

on Memoryon Secondary disk storage

compressed textcompressed text

Search directlySearch directly

To perform a faster search in compressed texts in comparisonwith a regular decompression followed by an ordinary search.

GOAL 1GOAL 1

To perform a faster search in compressed texts in comparison with an ordinary search in the original texts.

GOAL 2GOAL 2

Speeding up pattern matching by text compression

Page 6: Speeding up pattern matching  by text compression

Previous Results(1)

1988 Eliam-Tsoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding

1998 Kida, et al. LZW

year researcher compression

Page 7: Speeding up pattern matching  by text compression

year researcher compression

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionary based

1999 Kida, Takeda, Shinohara, andArikawa

LZW

2000 Shibata, et al. Byte pair encoding

1999 Navarro and Raffinot LZ family

Today’s talkToday’s talk

Previous Results(2)

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

Unifying frameworkUnifying

frameworkKida, et al.1999 Dictionary based methods

(Collage system)

Page 8: Speeding up pattern matching  by text compression

A Unifying Framework for Compressed Pattern Matching

Previous:Compression A PM Algorithm A

Compression B PM Algorithm B

Compression C PM Algorithm C

Collage system

Kida et al.[1999]:

Pattern matching algorithm on the unifying framework

Compression A

Compression B

Compression C

Page 9: Speeding up pattern matching  by text compression

Collage SystemCollage System

Definition and Several Examples

Page 10: Speeding up pattern matching  by text compression

Originaltext

Originaltext

Dictionary Based Compression

compressedtext

compressedtext

Dictionarystructure

Dictionarystructure

encoding

factorize into a series of phrases

How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.

Page 11: Speeding up pattern matching  by text compression

Collage System

Collage system is a pair 〈 D, S 〉

S : A sequence of variables defined in D (Compressed text)

S = Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )

D : A sequence of assignments (Dictionary structure)

X1 := expr1 ; ・・・X2 := expr2 ; Xn := exprn ;

||D|| = n : number of assignments in D

|S| = l : number of variables in S

Page 12: Speeding up pattern matching  by text compression

where exprk are ...

X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;

D : A sequence of assignments (Dictionary structure)

a a ∈Σ {ε∪ }, (primitive assignment)

Xi ・ X j (concatenation)for i, j < k,

( Xi ) j for i < k and integer j ( j times repetition)

[ j ]Xi(prefix truncation)for i < k and integer j

Xi [ j ] (suffix truncation)for i < k and integer j

Collage System

Page 13: Speeding up pattern matching  by text compression

Example of Collage System

X1 = a ;X2 = b ;

D :

S : X3 , X6 , X4 , X7

abbabbababba

X7 = X6・ X4 ;

X6 = [ 3 ]X5 ;

X5 = ( X3 )3 ;

X4 = X2・ X1 ;

X3 = X1・ X2 ;

babbabababababbaab

X7

X6 X4

X5

X3

X1 X2

X2 X1

a b )3 )[ 3 ] (( b a

prefixtruncation

3 timesrepetition

T(X7)

height(X7) = 4

height(D) = 4

Page 14: Speeding up pattern matching  by text compression

??????

Pattern Matching Algorithmon a Collage System

Page 15: Speeding up pattern matching  by text compression

Compressed pattern matching on a collage system

mm : pattern lengthrr : number of pattern occurrences

||||DD|||| : number of assignments in D||SS|| : number of variables in S

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

Theorem[Kida et al. 1999]Problem of compressed pattern matching

can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime

using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

Page 16: Speeding up pattern matching  by text compression

state: 0

: goto function: failure function

Pattern π= a b a b b

Basic Idea

original text: abababba

0a

1 2b a

3b

4b

5

1 2 3 4 3 4 5 1

S : Xi1 Xi2 Xi3 Xi4

abababba

Page 17: Speeding up pattern matching  by text compression

The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・ u[1: i]}

The function Jump( j, u) =δKMP( j, u)

•This set contains the pattern occurrences.

•The domain is Q×D• It simulates the sequence of state transitions for u.

Jump and Output

Reply inO(1) timeReply inO(1) time

Reply inO( l ) timeReply in

O( l ) time

Page 18: Speeding up pattern matching  by text compression

Realization of Jump and Output

for Jump( q, Xk) , if Xk is ...

a

Xi ・ X j

O(1) time

If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.

a

Xi ・ X j

O(1) time

for Output( q, Xk), if Xk is ...

It can be enumerate in O( l ) time

from Output of Xi and X j .

Size of the set Output

Size of the set Output

Page 19: Speeding up pattern matching  by text compression

Factor Concatenation Problem

example: P = COPACABANA

OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate

Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.

Page 20: Speeding up pattern matching  by text compression

Solution to the problem

• Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.

• Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.

It can be solved in O(1) time after O(m2) space and time preprocessing.

Page 21: Speeding up pattern matching  by text compression

Outline of Our Algorithm

Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

/* preprocessing of D and P */ preprocess(D); preprocess(P);

l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;

q:= Jump(q, Xij); /* state transition */

l:= l + |Xij |; /* calculation of the offset */end

Page 22: Speeding up pattern matching  by text compression

Compressed pattern matching on a collage system

O( ||D|| + |S| + m2 + r ) time

LZ78, LZW, BPEBPE, Run-length, etc...

LZ78, LZW, BPEBPE, Run-length, etc...

no truncation

LZ77, LZSS, etc...LZ77, LZSS, etc...

truncation

O( (||D|| + |S| )・ height(D) + m2 + r ) time

not suitable for speeding up

pattern matching

Page 23: Speeding up pattern matching  by text compression

Byte Pair EncodingByte Pair Encoding

original encoding algorithmand modified algorithm

Page 24: Speeding up pattern matching  by text compression

ABCDEFGHI

Code Pair

Pair Table

Byte Pair Encoding

Text:  T = ABABCDEBDEFABDEABC

GGCHBHFGHGC

GIHBHFGHI

GGCDEBDEFGDEGC

ABAB

AB→GAB→G

DEDE

DE→HDE→H

GCGC

GC→IGC→I

AABBCCDDEEFF

Used Character

ABABABAB ABAB ABAB

DEDE DEDE DEDE

GCGC GCGC

Page 25: Speeding up pattern matching  by text compression

Byte Pair Encoding “collage system”

Text:  T = ABABABABCDEBDEFABABDEABABC

GGCGCHBHFGHGCGC

GIHBHFGHI

GGCDEBDEDEFGDEDEGCAB→GAB→G

DE→HDE→H

GC→IGC→I

X1 = A;X2 = B ;

D :

X7 = X1・ X2 ;

X6 = F ;X5 = E ;

X4 = D ;

X3 = C ;

X8 = X4・ X5 ;

X9 = X7・ X3 ;S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9

Page 26: Speeding up pattern matching  by text compression

Speeding up of compression

Time complexity of BPE   O(uN)

u : The number of character codes,N : Text length

using doubly-linked list

O(u + N) time

Page 27: Speeding up pattern matching  by text compression

Speed-up of compressionoriginal text:

we apply the BPE algorithm to the first block.

X1 = A

X2 = C

X3 = X2・ X1

X255 = X247・ X8

X256 = X125・ X48

D:

Pattern Matching Machine for multiple replacement

[Arikawa et al. 1984]

BPE compressed text:

Page 28: Speeding up pattern matching  by text compression

BPECompress Gzip

originalmodifiedBrown corpus ( 6.8Mb)Medline (60.3Mb)Genbank (17.1Mb)

51.056.230.8 32.5

59.059.0

26.842.343.7 39.0

33.323.1

Brown corpus Medline Genbank

196.91699.9440.6 16.5

60.78.0

19.373.312.7 37.7

242.2100.9

Comparison of Compression Ratio and time

compression Ratio(%)

compression time(sec)

BPE are worse than those of “Compress” and “Gzip”

It is drastically acceleratedby our modification

Page 29: Speeding up pattern matching  by text compression

Compressed pattern matching on BPE compressed text

Problem of compressed pattern matchingon BPE compressed text can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

Problem of compressed pattern matchingon BPE compressed text can be solved in

OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.

||D|| 256≦

-The dictionary D is encoded separately from the sequence S.

-The size of D is small enough.

-The variables of S are encoded using a fixed length code.

Page 30: Speeding up pattern matching  by text compression

Experimental result

0.20

0.30

0.40

0.50

0.60

0.70

0.80

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

0.05

0.10

0.15

0.20

0.25

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

KMPKMPKMPKMP

AgrepAgrep

AgrepAgrepour algorithmour algorithm

our algorithmour algorithm

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)

Ultra ...

a clinically-oriented subset of

Medlin

a data set from GenBank

Page 31: Speeding up pattern matching  by text compression

Concluding RemarksConcluding Remarks

Conclusion and Future Works

Page 32: Speeding up pattern matching  by text compression

Conclusion

We introduced compressed pattern matching from practical viewpoints.

We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case.

We also observed that it is occasionally faster than

Agrep .

Page 33: Speeding up pattern matching  by text compression

Future Works

• Can we reduce the complexity of the preprocessing? O(m2) O(m)

• To develop a sublinear algorithm on BPE compressed texts.

• To develop an approximate pattern matching algorithm on a collage system.

• To develop a new compression which is suitable for compressed pattern matching.

More recent work

Page 34: Speeding up pattern matching  by text compression

More recent work

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for

compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]

We proposed a Boyer-Moore (BM) type algorithmfor pattern matching in BPE compressed texts.

Does text compression speed up such a sublinear time algorith

m?

Page 35: Speeding up pattern matching  by text compression

More recent work

0.20

0.30

0.40

0.50

0.60

0.70

0.80

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

0.00

0.05

0.10

0.15

0.20

0.25

5 10 15 20 25 30

run

tim

e (

sec)

pattern length

KMPKMP

AgrepAgrep

our algorithmour algorithm

most recent workmost recent work

KMPKMP

AgrepAgrep

our algorithmour algorithm

most recent workmost recent work

Medline dataMedline data(compression ratio is 59%)

Genbank dataGenbank data(compression ratio is 32%)