Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order

Proximity Searching in High Dimensional Spaces with a Proximity

Preserving Order

Edgar Chávez

Karina Figueroa

Gonzalo Navarro

UNIVERSIDADMICHOACANA,MEXICO

UNIVERSIDADDE CHILE,

CHILE

Content

1. About the problem

2. Basic concepts

3. Previous work

4. Our technique

5. Experiments

6. Conclusion and future wok

Proximity Searching

Huge Database

•Exact searching is not possible

Expensive distance

Applications

• Retrieval Information

• Classification

• People finder through the web

• Clustering

• Currently used on– Classification of Spider’s web– Face recognition on Chilean’s Web

Problems (metric spaces)

Index

Extraction of characteristics

Complex objects

High dimension

Memorylimited

Huge databases

Terminology

• Queries– Range query– K nearest neighbor

Properties•Symmetry•Strict possitiveness•Triangle inequality

Previous work

• Pivot based • Partition based

Pivot

distance

q

Previous work

• Pivot based • Partition based

centroq

Our techniquePermutation

Permutantp3

p2

p5

P4

P6

u

P1

Our technique

• Exact matching elements have the same permutation

• Similar elements must have a similar permutation (we guess)

• Spearman footrule metric– Measures the similarity of the

permutations– Promissority elements first

Spearman Footrule metricExample

3-1, 6 - 2, 3-2, 4-1, 5-5, 6-4

Difference of positions

Searching process (1a. part)Preprocessing time

Permutantp1

p2

p3

p3,p1,p2

p3,p2,p1

p2,p1,p3

p2,p3,p1

Searching process (2a. part)Query time

Permutantp1

p2

p3

p3,p1,p2

p3,p2,p1

p2,p1,p3

p2,p3,p1

q

p2,p1,p3

Sorting elementsby SpearmanFootrule metric

p2,p1,p3p2,p3,p1…..…..p3,p1,p2

Experiments 93% retrieved, comparing 10% of database

90% retrieved, comparing 60% of databasePivot based

algorithmRetrieved 48%

%re

trie

ved

Experiments100% retrieved, comparing 15% of database

100% retrieved, comparing 90% of database%

retr

ieve

d

How good is our prediction?

retrieved

Dimension 256, using 256 pivots

Percentage of the database compared

Metric algorithms are using one of them

Similarities between permutations

Almost the same value

Conclusion

• A new probabilistic algorithm for proximity searching in metric space.

• Our technique is based on permutations.• Close elements will have similar

permutations.• This technique is the fastest known

algorithm for high dimension.• Permutations are good predictor

Future Work

• Can Non-metric spaces be tackled with this technique?

• Approximated all K Nearest neighbor algorithm.

• Improving other metric indexes.

Thank you

UNIVERSIDADMICHOACANA,MEXICO

UNIVERSIDADDE CHILE,

CHILE

[email protected]

Documents

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order