Upload
ruby-francis
View
218
Download
0
Embed Size (px)
Citation preview
1
Pattern Matching Using n-gram Sampling
Of Cumulative Algebraic Signatures : Preliminary Results
Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2]
[1] Université Paris Dauphine[2] Santa Clara University
2
n-gram Search• New pattern matching idea• Matches algebraic signatures • Preprocesses both : pattern & string (record)
– String preprocessing is a new idea • To the best of our knowledge
• Provides incidental protection of stored data• Important for P2P & grid systems
• Fast processing• Especially useful for DBs & longer patterns
– ASCII, Unicode, DNA…– Should be then often faster than Boyer-Moore– Possibly the fastest known in this context
3
Algebraic Signature
• Symbols of the alphabet are elements of a Galois Field– GF (256) usually
• We choose there one primitive element – Usually = 2
• The algebraic signature of the string of i symbols p1… pi is the sum:
p’i = p1 +…+pi i.
• Here the addition and the multiplication are the operations in GF.
4
Algebraic Signature
• In our GF (2f) where f = 8,16:p + q = p – q = p XOR q
• One method for multiplying is :p*q = antilog (( log p + log q) mod 255)
• The division is then :p / q = antilog (( log p - log q) mod 255)
• The log and antilog are encoded in log and antilog tables with 2f elements each. – Entry 0 is for element 0 of the GF and is by
convention set to 2f - 1.
5
Cumulative Algebraic Signature
• We encode every symbol pi in a string into
the signature of the prefix p1…pi
• The value of a CAS symbol now encodes also the knowledge of values of all the previous ones
• Matching a single symbol means prefix matching
6
Application of CASs
• Incidental stored data protection – On P2P & Grid Servers especially
• Numerous CAS encoded string matching algorithms– Prefix match with O (1) complexity– Pattern match by signature only
• Karp – Rabin like, linear O (L) complexity
– Longest common string search– Longest common prefix search– …
7
CAS Properties
• O (K) encoding and decoding speed• For encoding, for instance:
p’i = p’i-1 + pi i = CAS ( pi-1) + pi i
• Fast n – gram signature calculus– For Sk,l = pk…pl with k > 1 and l – k = n :
AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1
• Logarithmic Algebraic Signature (LAS) LAS ( Sk,l ) = log AS ( Sk,l ) =
= ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1
8
The n-gram SearchKey ideas
• Design a sublinear pattern match search– With speed about L / K
• Apply to CAS encoded DB– New idea for string search algorithm with
preprocessing – Justified for a DB
• Store once, search many times
9
The n-gram SearchKey ideas
• Preprocess the pattern to create a jump table– As in Boyer – Moore
• Use n –grams with n > 1 to increase the discriminative power of an attempt – Comparison of a sample from the pattern
• a single symbol for BM• an LAS of an n – gram for a CAS-encoded string
10
The n-gram SearchKey ideas
• If the alphabet uses m symbols, the probability that a symbol matches is 1/m– Assuming all symbols equally likely
• For usual ASCII pattern matching m = 20-25
• For DNA m = 4
• A single symbol may often match without the whole pattern matching
• e.g., ¼ times for DNA on the average
• Leading to small jumps, – by m symbols on the average
11
The n-gram SearchKey ideas
• The probability of an n - gram matching may be : min ( 1/ 2f , 1 / mn )
• In our examples it can reach 1 / 256– More discriminative sampling– Longer jumps
• By almost K or 256 symbols in general
• Useful for longer strings– DNA, text, images…
12
ASCII Exemple Usual Alphabet
2-grams => 5 jumps
1-gram => 6 jumps
13
DNA Exemple4-letter Alphabet
3 jumps
4 jumps4 jumps
11 jumps
14
The n-gram Search Preprocessing
• Encode every record (string) into its CAS– Done for incidental protection anyhow for SDDS-2006
• Encode the terminal n - gram of the searched pattern SK into its LAS in variable V
• Fill up the jump table T for every other n - gram in SK – calculate every LAS – for each LAS, store in T its rightmost offset with
respect to the end of SK
15
The n-gram Search Jump Table
• For GF (256), every n – gram Si, i+n-1 in the pattern and i = LAS (Si, i+n-1):– T ( i ) = the offset – T ( i ) = K – n + 1 otherwise
• Remainder : LAS (0) = 255• T can be also hash table
– See the paper– Slower to use but possibly more memory efficient
• Probably more useful for a larger GF
16
ASCII Exemple
Dauphine
V = ne’’
70
71
……
1in’’
……
5au’’
……
3ph’’
……
7255
Notation :
xy’’ = LAS (xy)
17
The n-gram Search Processing
• Calculate LAS of the current n-gram in the string– Start with the n-gram SK-n+1,K – Continue depending on jump calculus
• Attempt to match V– If .true then calculate LAS of the entire current
possibly matching substring • of length K and ending with the current n-gram
• If .true, then resolve the possible collision– Either attempt to match all the K symbols– Or match enough of terminal n-grams or symbols to
decrease the probability of collision to a very small value
18
The n-gram Search Processing
• Otherwise– Go to T using LAS of the n-gram– Jump by the number of symbols found in T
• Update the “current” position for n-gram to attempt the match
– Re-attempt the match as above• Unless the n-gram to attempt is beyond the end of
the string
19
ASCII Exemple Again
2-grams => 5 jumps
1-gram => 6 jumps
20
DNA Exemple Again
3 jumps
4 jumps4 jumps
11 jumps
21
Related Work• Implemented in SDDS-2006• Applies best to
– longer patterns• where many jumps occur
– alphabets much smaller than the size of GF used
• Instead of jump of size m in the average, one reaches almost min (K, 2f) per jump– up to almost 256 for DNA or ASCII with GF (256) – up to almost 64K for DNA or Unicode with GF (64K)
• instead of 4 or 25 respectively
– For Boyer-Moore especially
22
n-grams / BM
• Jumps with n-grams can be typically longer• Calculate an attempt & jump are more expensive
as well– About twice as long at first approach– The precise analysis remains to be done
• Rule of thumb: If jumps are more than 2 times longer, n-grams with n > 1 or should be faster than BM.
• In both our examples, it should be the case of patterns longer than : – 50 symbols for ASCII– 8 symbols for DNA
23
Related Work
• In SDDS 2006 & P2P or Grid System in general
• Wish to hide what is searched for ?• Use the signature only based search
– Usually slower since linear only
24
Conclusion
• A new pattern matching algorithm• Uses algebraic signatures• Preprocesses both the pattern and the string• Appears particularly efficient
– For databases– For longer patterns
• Possibly faster in this context than any other algorithm known know
• But all this are only preliminray results
25
Future Work• Performance Analysis
– Theoretical• Jump Length
– Median, Average…
– Experimental• Actual text
– Non uniform symbol distribution
• DNA– Actual DNA strings
26
Future Work• Variants
– Jump Table– Partial Signatures of n –grams
• Symbol pi encodes the n –gram signature up to pi-
n+1 …pi
– No more XORing & Division to find this signature– Faster unsuccessful attempt to match
– Approximate Match• Tolerating match errors
– E.g., and at most 1 symbol