Nearest Neighbours Search using the PM-tree

DASFAA 2005, Beijing 1

Nearest Neighbours Search

using the PM-tree

Tomáš Skopal1 Jaroslav Pokorný1 Václav Snášel2

1 Charles University in PragueDepartment of Software Engineering Czech Republic

2 VSB - Technical University of OstravaDepartment of Computer Science

Czech Republic


Presentation Outline Similarity search in Metric Spaces

M-tree the structure k-NN search

PM-tree (an extension of M-tree) motivation the structure k-NN search

Experimental Results


Similarity search in Metric Spaces Similarity search

methods for content-based retrieval in multimedia databases the similarity measure is often modelled by a metric d

(satisfying triangular inequality, symmetry, reflexivity, non-negativity)

similarity queries (query by example) realized as metric queries range query (Q , rQ) (specified by a query object Q and covering radius rQ) k-NN query (Q , k) (specified by a query object Q and number of nearest neighbours k)

Metric Access Methods (MAMs) designed to search in metric datasets in order to keep the search costs minimal

search costs = number of distance computations + I/O costs only distances between objects are used for indexing

(the structure of object representation is not used for indexing) many MAMs are not suitable for similarity search in large datasets

either a static method or high I/O search costs M-tree and (recently) D-index are the only suitable candidates so far

DASFAA 2005, Beijing 4 (euclidean 2D space)

M-tree (metric tree)

dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) the leaves are clusters of indexed objects Oj (ground objects) routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.) during query evaluation

range query

Q


k-NN search in the M-tree branch-and-bound algorithm (similar to that of R-tree) modification of range query algorithm, but the query radius rQ is dynamic rQ decreasing from infinity to the distance to the k-th neighbour utilized two structures: priority queue PR and sorted array NN

PR: stores requests for nodes not-filtered from the search yet request of form [routing entry to a node N, dmin(N)],

where dmin(N) is the lower bound distance from Q to all possible objects in N, i.e.

dmin(rout. entry to N) = max {0 , d(Q , Oi) – rOi}

where (Oi , rOi ) is region of the N’s routing entry; (requests in PR sorted by dmin(N)) NN: stores k candidate objects (or distance upper bounds)

at the end of algorithm run, NN contains the result, i.e. the k nearest neighbours entry of form [candidate object Oi, d(Q,Oi)] or [ - , dmax(N)],

where dmax(·) is the upper bound distance from Q to all possible objects in N, i.e

dmax(rout. entry to N) = d(Q , Oi) + rOi

PR stores only requests with dmin(·) < dmax(·), other requests are removed from PR i.e. such requests are removed, which do not overlap the dynamic query region (Q , rQ)

Query processing: the requests in PR are processed in FIFO manner → a node N is retrieved, while PR and NN structures are updates by routing/ground entries of N PR is initialized to ( [root , ∞] ), NN is initialized by k entries [-,∞] to ( [- ,∞] , [- ,∞] , ... ) optimal in I/O costs (the same I/O costs as range query (Q , d(Q , NN[5]) ) )


read root

rQ = ∞

read node(II.) dmin(I.)

dmin(II.) = 0

dmax(I.)

dmax(II.)

k-NN search in M-tree: example (k=2)


dmin(C)

dmin(D)

dmax(D)

dmax(C)

read node(D)dmax(O5)

dmax(O6)



read node(I.)dmin(B)

read node(B)

dmax(O4)

5 nodes accessed, the same nodes accessed by range query (Q , d(Q,O5) )



PM-tree motivation

metric regions in M-tree are unnecessarily large

indexing of large portions of empty space (the “dead” space)

higher probability of intersection with query region

less efficient search reduction of metric region “volume” should lead to more effective discarding

of irrelevant subtrees the question is how to specify a compact metric region bounding all the

objects more “tightly” generalization of the M-tree for another metric region shape representations


PM-tree region utilization of global pivots (inspired by LAESA-like methods) given a fixed set of p global pivots Pi (selected from (a part of) the dataset)

p hyper-ring regions (Pi , HR[i]) are defined for each routing entry array HR of p intervals <HR[i].min , HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi

PM-tree region = M-tree region + HR array (pivots Pi shared by all PM-tree regions) intersection of the hyper-sphere and the

hyper-rings forms a smaller region bounding all the objects in leaves

the more pivots, the more tightly bounded region

PM-tree is built the same way as M-tree is built, i.e. the hyper-rings only „cut off“ the M-tree sphere


PM-tree region

PM-tree, query processing distances d(Q , Pi) for all i ≤ p must be computed prior to processing a query

metric region (Oi , rOi , HR) is relevant to (intersected by) a range query (Q , rQ)

just in case that all the hyper-rings and the hyper-sphere overlap the range query region the more hyper-rings, the lower probability of intersection with query

no additional distance computations are needed for the intersection test

M-tree region

queryquery

QQ


k-NN search in the PM-tree3 modifications of M-tree’s k-NN algorithm

different intersection test between query region (Q, rQ) and PM-tree region (Oi , rOi , HR)

Λt=1..p d(Pt , Q) – rQ ≤ HR[t].max Λ d(Pt , Q) + rQ ≥ HR[t].min

different dmin construction (+ possible distance increase to the farthest hyper-ring)

dmin(rout. entry to N) = max {0, d(Q , Oi) – rOi , HRfarthest}

HRfarthest = maxt=1..p { d(Pt , Q) – HR[t].max , HR[t].min – d(Pt , Q) }

different dmax construction (+ possible distance decrease to the farthest object in the nearest hyper-ring)

dmax(rout. entry to N) = max { d(Q , Oi) + rOi , HRnearest }

HRnearest = min t=1..p { d(Q , Oi) + HR[t].max }


read root dmin(I.)

dmin(II.)

dmax(II.)

dmax(I.)

read node(I.)

k-NN search in PM-tree: example (k=2)


read node(B)

read node(II.)



read node(D)

5 nodes accessed, the same nodes accessed by range query (Q , d(Q,O5) )



Experimental Results (synthetic datasets)

synthetic vector datasets (4D – 60D); 100,000 tuples; 1000 clusters disk page sizes: 1 KB – 4 KB; index sizes: 4.5 MB – 55 MB


Experimental Results(image database)

WBIIS image database; appr. 10,000 256D-vectors (gray histograms)

disk page size: 32 KB; index sizes: 16 MB – 20 MB


References[1] Skopal T., Pokorný J., Snášel V.:

PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, ADBIS 2004, Budapest, Hungary

[2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná, Czech Republic

[3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, Dresden, Germany, LNCS 2798, Springer

[4] Skopal T.:Metric Indexing in Information RetrievalPhD thesis, VSB-Technical University of Ostravahttp://urtax.ms.mff.cuni.cz/~skopal/phd/thesis.pdf

Documents

Nearest Neighbours Search using the PM-tree