1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas

1

Pattern Matching Using n-gram Sampling

Of Cumulative Algebraic Signatures : Preliminary Results

Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2]

[1] Université Paris Dauphine[2] Santa Clara University

2

n-gram Search• New pattern matching idea• Matches algebraic signatures • Preprocesses both : pattern & string (record)

– String preprocessing is a new idea • To the best of our knowledge

• Provides incidental protection of stored data• Important for P2P & grid systems

• Fast processing• Especially useful for DBs & longer patterns

– ASCII, Unicode, DNA…– Should be then often faster than Boyer-Moore– Possibly the fastest known in this context

3

Algebraic Signature

• Symbols of the alphabet are elements of a Galois Field– GF (256) usually

• We choose there one primitive element – Usually = 2

• The algebraic signature of the string of i symbols p1… pi is the sum:

p’i = p1 +…+pi i.

• Here the addition and the multiplication are the operations in GF.

4

Algebraic Signature

• In our GF (2f) where f = 8,16:p + q = p – q = p XOR q

• One method for multiplying is :p*q = antilog (( log p + log q) mod 255)

• The division is then :p / q = antilog (( log p - log q) mod 255)

• The log and antilog are encoded in log and antilog tables with 2f elements each. – Entry 0 is for element 0 of the GF and is by

convention set to 2f - 1.

5

Cumulative Algebraic Signature

• We encode every symbol pi in a string into

the signature of the prefix p1…pi

• The value of a CAS symbol now encodes also the knowledge of values of all the previous ones

• Matching a single symbol means prefix matching

6

Application of CASs

• Incidental stored data protection – On P2P & Grid Servers especially

• Numerous CAS encoded string matching algorithms– Prefix match with O (1) complexity– Pattern match by signature only

• Karp – Rabin like, linear O (L) complexity

– Longest common string search– Longest common prefix search– …

7

CAS Properties

• O (K) encoding and decoding speed• For encoding, for instance:

p’i = p’i-1 + pi i = CAS ( pi-1) + pi i

• Fast n – gram signature calculus– For Sk,l = pk…pl with k > 1 and l – k = n :

AS ( Sk,l ) = AS (S l - k+1) = (p’l XOR p’k - 1) / k-1

• Logarithmic Algebraic Signature (LAS) LAS ( Sk,l ) = log AS ( Sk,l ) =

= ( log (p’l XOR p’k - 1) – (k-1)) mod 2f – 1

8

The n-gram SearchKey ideas

• Design a sublinear pattern match search– With speed about L / K

• Apply to CAS encoded DB– New idea for string search algorithm with

preprocessing – Justified for a DB

• Store once, search many times

9


• Preprocess the pattern to create a jump table– As in Boyer – Moore

• Use n –grams with n > 1 to increase the discriminative power of an attempt – Comparison of a sample from the pattern

• a single symbol for BM• an LAS of an n – gram for a CAS-encoded string

10


• If the alphabet uses m symbols, the probability that a symbol matches is 1/m– Assuming all symbols equally likely

• For usual ASCII pattern matching m = 20-25

• For DNA m = 4

• A single symbol may often match without the whole pattern matching

• e.g., ¼ times for DNA on the average

• Leading to small jumps, – by m symbols on the average

11


• The probability of an n - gram matching may be : min ( 1/ 2f , 1 / mn )

• In our examples it can reach 1 / 256– More discriminative sampling– Longer jumps

• By almost K or 256 symbols in general

• Useful for longer strings– DNA, text, images…

12

ASCII Exemple Usual Alphabet

2-grams => 5 jumps

1-gram => 6 jumps

13

DNA Exemple4-letter Alphabet

3 jumps

4 jumps4 jumps

11 jumps

14

The n-gram Search Preprocessing

• Encode every record (string) into its CAS– Done for incidental protection anyhow for SDDS-2006

• Encode the terminal n - gram of the searched pattern SK into its LAS in variable V

• Fill up the jump table T for every other n - gram in SK – calculate every LAS – for each LAS, store in T its rightmost offset with

respect to the end of SK

15

The n-gram Search Jump Table

• For GF (256), every n – gram Si, i+n-1 in the pattern and i = LAS (Si, i+n-1):– T ( i ) = the offset – T ( i ) = K – n + 1 otherwise

• Remainder : LAS (0) = 255• T can be also hash table

– See the paper– Slower to use but possibly more memory efficient

• Probably more useful for a larger GF

16

ASCII Exemple

Dauphine

V = ne’’

70

71

……

1in’’

……

5au’’

……

3ph’’

……

7255

Notation :

xy’’ = LAS (xy)

17

The n-gram Search Processing

• Calculate LAS of the current n-gram in the string– Start with the n-gram SK-n+1,K – Continue depending on jump calculus

• Attempt to match V– If .true then calculate LAS of the entire current

possibly matching substring • of length K and ending with the current n-gram

• If .true, then resolve the possible collision– Either attempt to match all the K symbols– Or match enough of terminal n-grams or symbols to

decrease the probability of collision to a very small value

18

The n-gram Search Processing

• Otherwise– Go to T using LAS of the n-gram– Jump by the number of symbols found in T

• Update the “current” position for n-gram to attempt the match

– Re-attempt the match as above• Unless the n-gram to attempt is beyond the end of

the string

19

ASCII Exemple Again

2-grams => 5 jumps

1-gram => 6 jumps

20

DNA Exemple Again

3 jumps

4 jumps4 jumps

11 jumps

21

Related Work• Implemented in SDDS-2006• Applies best to

– longer patterns• where many jumps occur

– alphabets much smaller than the size of GF used

• Instead of jump of size m in the average, one reaches almost min (K, 2f) per jump– up to almost 256 for DNA or ASCII with GF (256) – up to almost 64K for DNA or Unicode with GF (64K)

• instead of 4 or 25 respectively

– For Boyer-Moore especially

22

n-grams / BM

• Jumps with n-grams can be typically longer• Calculate an attempt & jump are more expensive

as well– About twice as long at first approach– The precise analysis remains to be done

• Rule of thumb: If jumps are more than 2 times longer, n-grams with n > 1 or should be faster than BM.

• In both our examples, it should be the case of patterns longer than : – 50 symbols for ASCII– 8 symbols for DNA

23

Related Work

• In SDDS 2006 & P2P or Grid System in general

• Wish to hide what is searched for ?• Use the signature only based search

– Usually slower since linear only

24

Conclusion

• A new pattern matching algorithm• Uses algebraic signatures• Preprocesses both the pattern and the string• Appears particularly efficient

– For databases– For longer patterns

• Possibly faster in this context than any other algorithm known know

• But all this are only preliminray results

25

Future Work• Performance Analysis

– Theoretical• Jump Length

– Median, Average…

– Experimental• Actual text

– Non uniform symbol distribution

• DNA– Actual DNA strings

26

Future Work• Variants

– Jump Table– Partial Signatures of n –grams

• Symbol pi encodes the n –gram signature up to pi-

n+1 …pi

– No more XORing & Division to find this signature– Faster unsuccessful attempt to match

– Approximate Match• Tolerating match errors

– E.g., and at most 1 symbol

27

Thank You for Your Attention

[email protected]

Documents

1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas