Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 1

Clustered Pivot Tables forI/O-optimized Similarity SearchJuraj Moško, Jakub Lokoč, Tomáš Skopal

Department of Software Engineering

Faculty of Mathematics and Physics

Charles University in Prague


Presentation outlineSimilarity search in metric spaces

Pivot tables

Clustered pivot tables◦Static variant◦Dynamic variant

Experiments


Similarity searchSuitable for unstructured data, query often not

in DB

Similarity is often modeled by a metric distance

Expensive distance functions - EMD, SQFD, DTW, …

Metric indexing◦ Based on lower-bounding◦ If abs(d(p, q) – d(p, o)) > r

filter out object o


Pivot tables Simple yet efficient main memory metric index Having k static pivots Pi and database S of n objects

Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n

Pivot tables = two structures - distance matrix + data file

Cheap filtering of non-relevant objects (lower-bounding)

Non-filtered objects are refined by the original expensive distance function


Clustered pivot tablesWhat if the pivot table does not fit into

main memory?

Solution 1 – just slice datafile◦+ simple to construct◦ - sequential scan => high I/O cost

Solution 2 – reorganize and slice datafile◦+ similar objects in one page (page = cluster)

=> higher probability that all objects are filtered=> lower I/O cost

◦ - metric clustering is expensive


Metric clustering? M-tree!Dynamic, persistent, balanced

structureLeaf node represents cluster of similar

objectsMany construction strategies

considering quality of M-tree hierarchy with complexity < O(n2)◦Single/Multi/Hybrid-way leaf

selection◦Slim-down algorithm◦Reinsertions


Static CPTData file = objects serialized from M-

tree leaves◦Classic pivot table reorganizing input

Fixed page size in a paged data file

Preserve M-tree?◦ Future re-indexing◦ Query processing


Dynamic CPTData file = set of M-tree leaves

◦ Distance matrix connected to the M-tree leaves

Internal fragmentation◦ M-tree leaves contain different number of

data objects, utilization is not 100%Dynamic operations do not

degenerate created clusters


CPT - QueryingFiltering based on lower-bounding

If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization


CPT - Querying problemsProblem 1 – LAESA kNN algorithm

sorts DB objects according to their lower bound to the query object – not optimal for I/O cost◦ Solution - CPT does not sort objects =>

objects are processed sequentially


CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered


CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered

Qx

Qx


Experiments (1)2 real datasets

◦subset of CoPhIR, subset of Corel2 synthetic datasets

◦Cloud, PolygonSetWe considered more M-tree variants

◦Single/Multi way leaf selection◦Reinsertions

Measured I/O costCPT vs. PT vs. M-tree


Experiments (2)


Experiments (3)


ConclusionWe have designed I/O-optimized

method for persistent pivot tables

Future work◦Thorough experiments on SSD disks◦Use other metric clustering

techniques


Thank you

Documents

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics