Upload
adolfo
View
60
Download
0
Embed Size (px)
DESCRIPTION
Speeding up pattern matching by text compression. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa. Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology, Japan. - PowerPoint PPT Presentation
Citation preview
Speeding up pattern matching Speeding up pattern matching by text compressionby text compression
Department of Informatics, Kyushu University, JapanDepartment of AI, Kyushu Institute of Technology, Japan
Yusuke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, Setsuo Arikawa
Contents
Pattern matching on compressed text.
A unifying framework for compressed
pattern matching (Collage System)
Byte pair encoding (BPE).
Pattern matching algorithm on BPE compressed text.
Experimental result.
Conclusion.
Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.
Pattern matchingmatching is one of the most fundamental operations in string processing.Recently, a new trend for accelerating pattern matchingmatching hasemerged: Speeding up pattern matchingmatching by text compression.From the traditional criteria for data compression, i.e., compression ratio and compression/decompression time,adaptive dictionary methods such as the Lempel-Ziv family are often preferred. However, such methods cannot speed upthe pattern matchingmatching since an extra work is needed to keep track of compression mechanism.
Pattern Matching Problem
matchingmatchingPatternPattern
TextText
Knuth-Morris-Pratt (1974)
Boyer-Moore (1977)
Aho-Corasick (1975)
Shift-Or (1992)
Pattern Matching on Compressed Text
Expand
on Memory
on Memory
File transfer
on Secondary disk storage
original textoriginal text
File transfer
on Memoryon Secondary disk storage
compressed textcompressed text
SearchSearch
SearchSearch
It requires extra time and space.
Pattern Matching on Compressed Text
File transfer
on Memoryon Secondary disk storage
compressed textcompressed text
Search directlySearch directly
To perform a faster search in compressed texts in comparisonwith a regular decompression followed by an ordinary search.
GOAL 1GOAL 1
To perform a faster search in compressed texts in comparison with an ordinary search in the original texts.
GOAL 2GOAL 2
Speeding up pattern matching by text compression
Previous Results(1)
1988 Eliam-Tsoreff and Vishkin run-length
1992 Amir, Landau, and Vishkin two-dimensional run-length
1995 Farach and Thorup LZ77
1996 Amir, Benson and Farach LZW
1997 Karpinski, Rytter, and Shinohara straight-line programs
1996 Gasieniec, et al. LZ77
1997 Miyazaki, Shinohara, and Takeda straight-line programs
1992 Amir and Benson two-dimensional run-length
Amir, Benson, and Farach1994 two-dimensional run-length
1997 Takeda finite state encoding
1998 Shibata byte pair encoding
1994 Manber original compression scheme
1998 Fukamachi, Shinohara, and Takeda Huffman encoding
1998 Kida, et al. LZW
year researcher compression
year researcher compression
1999 Shibata, Takeda, Shinohara, andArikawa
Antidictionary based
1999 Kida, Takeda, Shinohara, andArikawa
LZW
2000 Shibata, et al. Byte pair encoding
1999 Navarro and Raffinot LZ family
Today’s talkToday’s talk
Previous Results(2)
1998 de Moura, Navarro, Ziviani, andBaeza-Yates
Word based encoding
Unifying frameworkUnifying
frameworkKida, et al.1999 Dictionary based methods
(Collage system)
A Unifying Framework for Compressed Pattern Matching
Previous:Compression A PM Algorithm A
Compression B PM Algorithm B
Compression C PM Algorithm C
Collage system
Kida et al.[1999]:
Pattern matching algorithm on the unifying framework
Compression A
Compression B
Compression C
Collage SystemCollage System
Definition and Several Examples
Originaltext
Originaltext
Dictionary Based Compression
compressedtext
compressedtext
Dictionarystructure
Dictionarystructure
encoding
factorize into a series of phrases
How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.
Collage System
Collage system is a pair 〈 D, S 〉
S : A sequence of variables defined in D (Compressed text)
S = Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )
D : A sequence of assignments (Dictionary structure)
X1 := expr1 ; ・・・X2 := expr2 ; Xn := exprn ;
||D|| = n : number of assignments in D
|S| = l : number of variables in S
where exprk are ...
X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;
D : A sequence of assignments (Dictionary structure)
a a ∈Σ {ε∪ }, (primitive assignment)
Xi ・ X j (concatenation)for i, j < k,
( Xi ) j for i < k and integer j ( j times repetition)
[ j ]Xi(prefix truncation)for i < k and integer j
Xi [ j ] (suffix truncation)for i < k and integer j
Collage System
Example of Collage System
X1 = a ;X2 = b ;
D :
S : X3 , X6 , X4 , X7
abbabbababba
X7 = X6・ X4 ;
X6 = [ 3 ]X5 ;
X5 = ( X3 )3 ;
X4 = X2・ X1 ;
X3 = X1・ X2 ;
babbabababababbaab
X7
X6 X4
X5
X3
X1 X2
X2 X1
a b )3 )[ 3 ] (( b a
prefixtruncation
3 timesrepetition
T(X7)
height(X7) = 4
height(D) = 4
??????
Pattern Matching Algorithmon a Collage System
Compressed pattern matching on a collage system
mm : pattern lengthrr : number of pattern occurrences
||||DD|||| : number of assignments in D||SS|| : number of variables in S
Theorem[Kida et al. 1999]Problem of compressed pattern matching
can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime
using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
Theorem[Kida et al. 1999]Problem of compressed pattern matching
can be solved inOO( (||( (||DD||+|||+|SS|)|) ・・ heightheight((DD) + ) + mm22 + + r r ) ) timetime
using OO( ||( ||DD|| + || + mm22 ) ) spacespace.If D contains no truncation, it can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
state: 0
: goto function: failure function
Pattern π= a b a b b
Basic Idea
original text: abababba
0a
1 2b a
3b
4b
5
1 2 3 4 3 4 5 1
S : Xi1 Xi2 Xi3 Xi4
abababba
The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j]・ u[1: i]}
The function Jump( j, u) =δKMP( j, u)
•This set contains the pattern occurrences.
•The domain is Q×D• It simulates the sequence of state transitions for u.
Jump and Output
Reply inO(1) timeReply inO(1) time
Reply inO( l ) timeReply in
O( l ) time
Realization of Jump and Output
for Jump( q, Xk) , if Xk is ...
a
Xi ・ X j
O(1) time
If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.
a
Xi ・ X j
O(1) time
for Output( q, Xk), if Xk is ...
It can be enumerate in O( l ) time
from Output of Xi and X j .
Size of the set Output
Size of the set Output
Factor Concatenation Problem
example: P = COPACABANA
OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate
Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.
Solution to the problem
• Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.
• Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.
It can be solved in O(1) time after O(m2) space and time preprocessing.
Outline of Our Algorithm
Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
Input. pattern P and collage system 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.
/* preprocessing of D and P */ preprocess(D); preprocess(P);
l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;
q:= Jump(q, Xij); /* state transition */
l:= l + |Xij |; /* calculation of the offset */end
Compressed pattern matching on a collage system
O( ||D|| + |S| + m2 + r ) time
LZ78, LZW, BPEBPE, Run-length, etc...
LZ78, LZW, BPEBPE, Run-length, etc...
no truncation
LZ77, LZSS, etc...LZ77, LZSS, etc...
truncation
O( (||D|| + |S| )・ height(D) + m2 + r ) time
not suitable for speeding up
pattern matching
Byte Pair EncodingByte Pair Encoding
original encoding algorithmand modified algorithm
ABCDEFGHI
Code Pair
Pair Table
Byte Pair Encoding
Text: T = ABABCDEBDEFABDEABC
GGCHBHFGHGC
GIHBHFGHI
GGCDEBDEFGDEGC
ABAB
AB→GAB→G
DEDE
DE→HDE→H
GCGC
GC→IGC→I
AABBCCDDEEFF
Used Character
ABABABAB ABAB ABAB
DEDE DEDE DEDE
GCGC GCGC
Byte Pair Encoding “collage system”
Text: T = ABABABABCDEBDEFABABDEABABC
GGCGCHBHFGHGCGC
GIHBHFGHI
GGCDEBDEDEFGDEDEGCAB→GAB→G
DE→HDE→H
GC→IGC→I
X1 = A;X2 = B ;
D :
X7 = X1・ X2 ;
X6 = F ;X5 = E ;
X4 = D ;
X3 = C ;
X8 = X4・ X5 ;
X9 = X7・ X3 ;S : X7 , X9 , X8 , X2 , X8 , X6 , X7 , X8 , X9
Speeding up of compression
Time complexity of BPE O(uN)
u : The number of character codes,N : Text length
using doubly-linked list
O(u + N) time
Speed-up of compressionoriginal text:
we apply the BPE algorithm to the first block.
X1 = A
X2 = C
X3 = X2・ X1
X255 = X247・ X8
X256 = X125・ X48
D:
Pattern Matching Machine for multiple replacement
[Arikawa et al. 1984]
BPE compressed text:
BPECompress Gzip
originalmodifiedBrown corpus ( 6.8Mb)Medline (60.3Mb)Genbank (17.1Mb)
51.056.230.8 32.5
59.059.0
26.842.343.7 39.0
33.323.1
Brown corpus Medline Genbank
196.91699.9440.6 16.5
60.78.0
19.373.312.7 37.7
242.2100.9
Comparison of Compression Ratio and time
compression Ratio(%)
compression time(sec)
BPE are worse than those of “Compress” and “Gzip”
It is drastically acceleratedby our modification
Compressed pattern matching on BPE compressed text
Problem of compressed pattern matchingon BPE compressed text can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
Problem of compressed pattern matchingon BPE compressed text can be solved in
OO( ||( ||DD|| + ||| + |SS| + | + mm22 + + r r )) time time.
||D|| 256≦
-The dictionary D is encoded separately from the sequence S.
-The size of D is small enough.
-The variables of S are encoded using a fixed length code.
Experimental result
0.20
0.30
0.40
0.50
0.60
0.70
0.80
5 10 15 20 25 30
run
tim
e (
sec)
pattern length
0.05
0.10
0.15
0.20
0.25
5 10 15 20 25 30
run
tim
e (
sec)
pattern length
KMPKMPKMPKMP
AgrepAgrep
AgrepAgrepour algorithmour algorithm
our algorithmour algorithm
Medline dataMedline data(compression ratio is 59%)
Genbank dataGenbank data(compression ratio is 32%)
Ultra ...
a clinically-oriented subset of
Medlin
a data set from GenBank
Concluding RemarksConcluding Remarks
Conclusion and Future Works
Conclusion
We introduced compressed pattern matching from practical viewpoints.
We observed that our algorithm is reduced at the same rate as the compression ratio compared with uncompressed case.
We also observed that it is occasionally faster than
Agrep .
Future Works
• Can we reduce the complexity of the preprocessing? O(m2) O(m)
• To develop a sublinear algorithm on BPE compressed texts.
• To develop an approximate pattern matching algorithm on a collage system.
• To develop a new compression which is suitable for compressed pattern matching.
More recent work
More recent work
A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for
compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]
A Boyer-Moore type algorithm for A Boyer-Moore type algorithm for
compressed pattern matching [CPM2000]compressed pattern matching [CPM2000]
We proposed a Boyer-Moore (BM) type algorithmfor pattern matching in BPE compressed texts.
Does text compression speed up such a sublinear time algorith
m?
More recent work
0.20
0.30
0.40
0.50
0.60
0.70
0.80
5 10 15 20 25 30
run
tim
e (
sec)
pattern length
0.00
0.05
0.10
0.15
0.20
0.25
5 10 15 20 25 30
run
tim
e (
sec)
pattern length
KMPKMP
AgrepAgrep
our algorithmour algorithm
most recent workmost recent work
KMPKMP
AgrepAgrep
our algorithmour algorithm
most recent workmost recent work
Medline dataMedline data(compression ratio is 59%)
Genbank dataGenbank data(compression ratio is 32%)