Fast Approximate Point Set Matching for Information Retrieval

Fast Approximate Point Set Matching for Information Retrieval

Raphaël Clifford and Benjamin Sach

[email protected]



Contents

• What’s the problem?

• What use is it?

• Is it (3-SUM) hard?

• How have we solved it?

• How good is our solution?



The Maximal Subset Matching problem

Given a pattern, P and a text, T:

We want to find the largest “match” of P in T

This is also referred to as the “constellation” problem (originally by B. Chazelle)

- of size m

- of size n




What is a “match”?





A point pi in P matches

a point tj in T with a shift, v if:

pi + v = tj




A subset of P, M is a subset

match if: There exists a shift, v, with which all

points in M match points in T

The Maximal Subset Matching problem is…

to find the size of the largest subset match for a given P, T




Application to Music Information Retrieval

• Allows for matches shifted in time and pitch• Intrinsically handles polyphonic music which traditional string based methods do not

Other Applications:

• Protein structure alignment• Pharmacophore identification

• Image registration• Model-based object recognition



Is Maximal Subset Matching hard?

3-SUM is…

• There is a simple algorithm to solve 3-SUM in O(n2) time • No lower complexity solutions are known • It is conjectured that this is a lower bound

Maximum Subset Matching has been proven to be 3-SUM HARD

Given a set Tof n integers:

Is therea triple a;b;c 2 Tsuch that a+b+c= 0?

“Many fundamental geometric problems fall in this class”



MSMBP

• Bit-parallel implementation

• O(nm) time

• O(n) space with very low constants

• Cross-correlation implemented via Bit-sets

MSMFT• FFT based implementation• O(n*log(m)) time• O(n) space • Cross-correlation implemented via Fast Fourier Transforms

The Structure

The Algorithms

1. Randomly project the pattern and the text into 1D

2. “Length reduce” the data to decrease sparsity

3. Perform a cross-correlation at each alignment of the length reduced pattern and text

4. Find the shift in the length reduced pattern that gave the largest value in thecross-correlation

5. Using the “improved estimate”, infer the shift in the original data.

6. Return the size of the match with this shift.



(a) Randomised Projection and (b) Length Reduction

g(x) = ax modq, h(x) = g(x) mod s and h2(x) = (g(x) +q) mod s

Using hash functions:

Where:

q = a random prime in [2N,…,4N] (N is the maximum of the projected values of P’ and T’) a = a random in [1,…,q-1] s = r*n, where r>1 is a constant

•Projected patternpoints are mapped to h(x) in the pattern binary array

•Projected text points are mapped to h(x) and h2(x) in the text binary array

•Both arrays are of length r*n, where n is the number of text pointsbinary array of length r*n

(See Cole and Hariharan [3])



Lemma 1:

Significance:

• If some point matches so that p + v = t then (h(p) +h(v)) mod s matches either h(t) or h2(t)

• By counting the number of 1’s in common at each alignment we can estimate the true subset match in the original data

Proof:

(h(x) +h(y)) mod s =

(h(x+y) if g(x) +g(y) < q,h2(x+y) otherwise

If g(x)+g(y) ¸ q, theng(x+y)=g(x)+g(y) ¡ q.If g(x)+g(y)<q, theng(x+y)=g(x)+g(y).

= (g(x) +g(y)) mod s.

= (g(x) mod s+g(y) mod s) mod s

(As h(x) = g(x) mod s)

(As g(x) = ax modq)

Why does this work?

(h(x) +h(y)) mod s



Estimating the Size of the Largest Subset Match

• Estimation based on projected and length reduced matches:

high variance which grows linearly as the number of true matches decreases (discussed in paper)

• An improved Estimate:

1. Find the best match of the length reduced pattern in the text.

2. Determine in O(m) time which points in the reduced pattern match the text at that shift.

3. Look up, by the use of a precalculated hash table, where each of the matching points where matched from in the 1D projection, P’ and T’.

4. Now we have a shift for each pair of points in P’ and T’. This may have rare-inconsistencies due to collisions. We therefore perform a count and take the most frequent shift.

1. Finally we return the size of the match at this shift. When does this work?



Bit-Parallel Cross-correlation (MSMBP)

We store the reduced pattern and text arrays as bitsets and perform a

bit-parallel correlation using ANDs and counts:

– Correlation of two architectural words can be found using an AND followed by a count of the number of 1’s in the result in constant time

– Count implemented by use of a look-up table.

– Each reduced array is of size r*n so the bitset has O(n) words so gives each correlation in O(n) time

– We need to find the correlation at each shift.– To shift the text we must shift every word in the text so takes O(n) time again.

Therefore, naively, this method takes O(n2) time

(O(n) +O(n))O(n) = O(n2)

(Correlation) (Alignments) (Shift)



Bit-Parallel Cross-correlation (MSMBP)

We reduce this complexity by taking advantage of the sparsenessof the reduced pattern array when m << n:

– p has O(n) words but only O(m) non-zero values:

• we only store these at worst m words. • this reduces each correlation computation to O(m) time

However, we also need to reduce the number of shifts required:

|01010010|01000100|01011011|10000100|10100100|10010010|…By use of pointer arithmetic,we can align the data to any

constant*b alignment (where b is the byte-size)in constant time

|10100100|10001000|10110111|00001001|01001001|00100100|…A single full shift of t

gives us access to alignmentsc*b +1 for any c

So by calculating the correlations out of order, we need to perform only b shifts

This results in an O(nm) time complexity algorithm



FFT Cross-correlation (MSMFT)

Uses the same steps as MSMBP except the cross-correlation step is implemented using FFTs (Fast Fourier Transforms):

This uses the property of the FFT that for numerical strings:

This can be calculated accurately and efficiently in O(n*log(m)) time

(thanks to the FFTW team for the implementation used, see [5])

p¢t(i) def=

mX

j =1

pj t( i+j ¡ 1); 1 · i · n;

(Where t(i) is them length substring of t, beginning at position i)



Speed Comparisons (1)

Increasing Text size with proportional Pattern size (25%,75%)

(P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m))



Speed Comparisons (2)

Increasing Text size with fixedPattern size (40 points)

Constant Text size (960000 points) with increasing Pattern size



Accuracy Tests

Match %

Actual Run 1

Run 2

Run 3

Avr. Diff

90% 180 180 180 180 100%

75% 150 150 150 150 100%

25% 50 50 50 50 100%

10% 20 4 5 5 23%

Match % Actual Run 1

Run 2

Run 3

Avr. Diff

1st , 2nd 1st , 2nd

100%,10% 200,20 200 200 200 100%

100%,50% 200,100 200 200 200 100%

100%,90% 200,180 200 200 200 100%

100%,99% 200,198 200 200 200 100%

75%,10% 150,20 150 150 150 100%

75%,65% 150,130 150 150 150 100%

75%,70% 150,140 150 150 140 98%

75%,73% 150,146 150 150 150 100%

50%,10% 100,20 100 100 100 100%

50%,40% 100,80 100 100 100 100%

50%,45% 100,90 100 100 90 97%

25%, 5% 50,10 50 50 50 100%

25%,15% 50,30 50 50 50 100%

25%,20% 50,40 40 50 50 93%

Match % - The percentage of the pattern that existed in the text

Actual – The sizes of the actual best matches

Run 1,2,3 – The sizes of the matches found by the algorithm in each test.

Avr. Diff – The average percentage of the largest present match that was returned.

The text used was 4000 points in both cases

Only MSMBP was used for accuracy testing

as the two algorithms differ only in performance



Conclusions

• We have presented two algorithms, MSMBP with O(nm) and MSMFT with O(n*log(m)) time complexity, both with O(n) space

• We have shown that these are efficient on large random point sets

• We have also shown that the accuracy is very high, even in situations theorised in the paper to have a lower probability of success.

• We have shown experimentally speed ups of several orders of magnitude in some cases without a significant decrease in accuracy

The Authors would like to thank Manolis Christodoulakis for the original implementation of the MSMFT algorithm and the EPSRC for the funding of the second author.



Questions?

(from xkcd.com)

Documents

Fast Approximate Point Set Matching for Information Retrieval