18
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department of Software Engineering Czech Republic 2 VSB - Technical University of Ostrava Department of Computer Science Czech Republic

Nearest Neighbours Search using the PM-tree

  • Upload
    chacha

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Nearest Neighbours Search using the PM-tree. Tomáš Skopal 1 Jaroslav Pokorn ý 1 Václav Snášel 2. 1 Charles University in Prague Department of Software Engineering Czech Republic. 2 VSB - Technical University of Ostrav a Department of Computer Science Czech Republic. - PowerPoint PPT Presentation

Citation preview

Page 1: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 1

Nearest Neighbours Search

using the PM-tree

Tomáš Skopal1 Jaroslav Pokorný1 Václav Snášel2

1 Charles University in PragueDepartment of Software Engineering Czech Republic

2 VSB - Technical University of OstravaDepartment of Computer Science

Czech Republic

Page 2: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 2

Presentation Outline Similarity search in Metric Spaces

M-tree the structure k-NN search

PM-tree (an extension of M-tree) motivation the structure k-NN search

Experimental Results

Page 3: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 3

Similarity search in Metric Spaces Similarity search

methods for content-based retrieval in multimedia databases the similarity measure is often modelled by a metric d

(satisfying triangular inequality, symmetry, reflexivity, non-negativity)

similarity queries (query by example) realized as metric queries range query (Q , rQ) (specified by a query object Q and covering radius rQ) k-NN query (Q , k) (specified by a query object Q and number of nearest neighbours k)

Metric Access Methods (MAMs) designed to search in metric datasets in order to keep the search costs minimal

search costs = number of distance computations + I/O costs only distances between objects are used for indexing

(the structure of object representation is not used for indexing) many MAMs are not suitable for similarity search in large datasets

either a static method or high I/O search costs M-tree and (recently) D-index are the only suitable candidates so far

Page 4: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 4 (euclidean 2D space)

M-tree (metric tree)

dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) the leaves are clusters of indexed objects Oj (ground objects) routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves the triangular inequality allows discarding of irrelevant M-tree branches (metric regions resp.) during query evaluation

range query

Q

Page 5: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 5

k-NN search in the M-tree branch-and-bound algorithm (similar to that of R-tree) modification of range query algorithm, but the query radius rQ is dynamic rQ decreasing from infinity to the distance to the k-th neighbour utilized two structures: priority queue PR and sorted array NN

PR: stores requests for nodes not-filtered from the search yet request of form [routing entry to a node N, dmin(N)],

where dmin(N) is the lower bound distance from Q to all possible objects in N, i.e.

dmin(rout. entry to N) = max {0 , d(Q , Oi) – rOi}

where (Oi , rOi ) is region of the N’s routing entry; (requests in PR sorted by dmin(N)) NN: stores k candidate objects (or distance upper bounds)

at the end of algorithm run, NN contains the result, i.e. the k nearest neighbours entry of form [candidate object Oi, d(Q,Oi)] or [ - , dmax(N)],

where dmax(·) is the upper bound distance from Q to all possible objects in N, i.e

dmax(rout. entry to N) = d(Q , Oi) + rOi

PR stores only requests with dmin(·) < dmax(·), other requests are removed from PR i.e. such requests are removed, which do not overlap the dynamic query region (Q , rQ)

Query processing: the requests in PR are processed in FIFO manner → a node N is retrieved, while PR and NN structures are updates by routing/ground entries of N PR is initialized to ( [root , ∞] ), NN is initialized by k entries [-,∞] to ( [- ,∞] , [- ,∞] , ... ) optimal in I/O costs (the same I/O costs as range query (Q , d(Q , NN[5]) ) )

Page 6: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 6

read root

rQ = ∞

read node(II.) dmin(I.)

dmin(II.) = 0

dmax(I.)

dmax(II.)

k-NN search in M-tree: example (k=2)

Page 7: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 7

dmin(C)

dmin(D)

dmax(D)

dmax(C)

read node(D)dmax(O5)

dmax(O6)

k-NN search in M-tree: example (k=2)

Page 8: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 8

read node(I.)dmin(B)

read node(B)

dmax(O4)

5 nodes accessed, the same nodes accessed by range query (Q , d(Q,O5) )

k-NN search in M-tree: example (k=2)

Page 9: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 9

PM-tree motivation

metric regions in M-tree are unnecessarily large

indexing of large portions of empty space (the “dead” space)

higher probability of intersection with query region

less efficient search reduction of metric region “volume” should lead to more effective discarding

of irrelevant subtrees the question is how to specify a compact metric region bounding all the

objects more “tightly” generalization of the M-tree for another metric region shape representations

Page 10: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 10

PM-tree region utilization of global pivots (inspired by LAESA-like methods) given a fixed set of p global pivots Pi (selected from (a part of) the dataset)

p hyper-ring regions (Pi , HR[i]) are defined for each routing entry array HR of p intervals <HR[i].min , HR[i].max> each interval HR[i] bounds the distances of objects to the respective pivot Pi

PM-tree region = M-tree region + HR array (pivots Pi shared by all PM-tree regions) intersection of the hyper-sphere and the

hyper-rings forms a smaller region bounding all the objects in leaves

the more pivots, the more tightly bounded region

PM-tree is built the same way as M-tree is built, i.e. the hyper-rings only „cut off“ the M-tree sphere

Page 11: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 11

PM-tree region

PM-tree, query processing distances d(Q , Pi) for all i ≤ p must be computed prior to processing a query

metric region (Oi , rOi , HR) is relevant to (intersected by) a range query (Q , rQ)

just in case that all the hyper-rings and the hyper-sphere overlap the range query region the more hyper-rings, the lower probability of intersection with query

no additional distance computations are needed for the intersection test

M-tree region

queryquery

QQ

Page 12: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 12

k-NN search in the PM-tree3 modifications of M-tree’s k-NN algorithm

different intersection test between query region (Q, rQ) and PM-tree region (Oi , rOi , HR)

Λt=1..p d(Pt , Q) – rQ ≤ HR[t].max Λ d(Pt , Q) + rQ ≥ HR[t].min

different dmin construction (+ possible distance increase to the farthest hyper-ring)

dmin(rout. entry to N) = max {0, d(Q , Oi) – rOi , HRfarthest}

HRfarthest = maxt=1..p { d(Pt , Q) – HR[t].max , HR[t].min – d(Pt , Q) }

different dmax construction (+ possible distance decrease to the farthest object in the nearest hyper-ring)

dmax(rout. entry to N) = max { d(Q , Oi) + rOi , HRnearest }

HRnearest = min t=1..p { d(Q , Oi) + HR[t].max }

Page 13: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 13

read root dmin(I.)

dmin(II.)

dmax(II.)

dmax(I.)

read node(I.)

k-NN search in PM-tree: example (k=2)

Page 14: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 14

read node(B)

read node(II.)

k-NN search in PM-tree: example (k=2)

Page 15: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 15

read node(D)

5 nodes accessed, the same nodes accessed by range query (Q , d(Q,O5) )

k-NN search in PM-tree: example (k=2)

Page 16: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 16

Experimental Results (synthetic datasets)

synthetic vector datasets (4D – 60D); 100,000 tuples; 1000 clusters disk page sizes: 1 KB – 4 KB; index sizes: 4.5 MB – 55 MB

Page 17: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 17

Experimental Results(image database)

WBIIS image database; appr. 10,000 256D-vectors (gray histograms)

disk page size: 32 KB; index sizes: 16 MB – 20 MB

Page 18: Nearest Neighbours Search  using the PM-tree

DASFAA 2005, Beijing 18

References[1] Skopal T., Pokorný J., Snášel V.:

PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases, ADBIS 2004, Budapest, Hungary

[2] Skopal T.: Pivoting M-tree: A Metric Access Method for Efficient Similarity Search, DATESO 2004, Desná, Czech Republic

[3] Skopal T., Pokorný J., Krátký M., Snášel V.: Revisiting M-tree Building Principles. ADBIS 2003, Dresden, Germany, LNCS 2798, Springer

[4] Skopal T.:Metric Indexing in Information RetrievalPhD thesis, VSB-Technical University of Ostravahttp://urtax.ms.mff.cuni.cz/~skopal/phd/thesis.pdf