17
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško , Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1 SISAP 2011, Lipari

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

Embed Size (px)

Citation preview

Page 1: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 1

Clustered Pivot Tables forI/O-optimized Similarity SearchJuraj Moško, Jakub Lokoč, Tomáš Skopal

Department of Software Engineering

Faculty of Mathematics and Physics

Charles University in Prague

Page 2: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 2

Presentation outlineSimilarity search in metric spaces

Pivot tables

Clustered pivot tables◦Static variant◦Dynamic variant

Experiments

Page 3: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 3

Similarity searchSuitable for unstructured data, query often not

in DB

Similarity is often modeled by a metric distance

Expensive distance functions - EMD, SQFD, DTW, …

Metric indexing◦ Based on lower-bounding◦ If abs(d(p, q) – d(p, o)) > r

filter out object o

Page 4: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 4

Pivot tables Simple yet efficient main memory metric index Having k static pivots Pi and database S of n objects

Oj, pivot table stores all the distances d(Pi, Oj) in the matrix of size k x n

Pivot tables = two structures - distance matrix + data file

Cheap filtering of non-relevant objects (lower-bounding)

Non-filtered objects are refined by the original expensive distance function

Page 5: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 5

Clustered pivot tablesWhat if the pivot table does not fit into

main memory?

Solution 1 – just slice datafile◦+ simple to construct◦ - sequential scan => high I/O cost

Solution 2 – reorganize and slice datafile◦+ similar objects in one page (page = cluster)

=> higher probability that all objects are filtered=> lower I/O cost

◦ - metric clustering is expensive

Page 6: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 6

Metric clustering? M-tree!Dynamic, persistent, balanced

structureLeaf node represents cluster of similar

objectsMany construction strategies

considering quality of M-tree hierarchy with complexity < O(n2)◦Single/Multi/Hybrid-way leaf

selection◦Slim-down algorithm◦Reinsertions

Page 7: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 7

Static CPTData file = objects serialized from M-

tree leaves◦Classic pivot table reorganizing input

Fixed page size in a paged data file

Preserve M-tree?◦ Future re-indexing◦ Query processing

Page 8: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 8

Dynamic CPTData file = set of M-tree leaves

◦ Distance matrix connected to the M-tree leaves

Internal fragmentation◦ M-tree leaves contain different number of

data objects, utilization is not 100%Dynamic operations do not

degenerate created clusters

Page 9: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 9

CPT - QueryingFiltering based on lower-bounding

If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization

Page 10: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 10

CPT - Querying problemsProblem 1 – LAESA kNN algorithm

sorts DB objects according to their lower bound to the query object – not optimal for I/O cost◦ Solution - CPT does not sort objects =>

objects are processed sequentially

Page 11: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 11

CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered

Page 12: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 12

CPT – Querying problemsProblem 2 – in CPT the dynamic radius

decreases slower during the kNN processing◦ Solution - First bunch of objects is not

clustered

Qx

Qx

Page 13: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 13

Experiments (1)2 real datasets

◦subset of CoPhIR, subset of Corel2 synthetic datasets

◦Cloud, PolygonSetWe considered more M-tree variants

◦Single/Multi way leaf selection◦Reinsertions

Measured I/O costCPT vs. PT vs. M-tree

Page 14: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 14

Experiments (2)

Page 15: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 15

Experiments (3)

Page 16: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 16

ConclusionWe have designed I/O-optimized

method for persistent pivot tables

Future work◦Thorough experiments on SSD disks◦Use other metric clustering

techniques

Page 17: Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics

SISAP 2011, Lipari 17

Thank you