36
1 NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine) Supported by NSF CAREER No. IIS-0238586 EDBT 2004

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Embed Size (px)

Citation preview

Page 1: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

1

NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Chen Li (UC Irvine)

Supported by NSF CAREER No. IIS-0238586

EDBT 2004

Page 2: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

2

NN (nearest-neighbor) searchKNN: find the k nearest neighbors of an object.

qNN-join: for each object in the 1st dataset, find

the k nearest neighbors in the 2nd dataset

D1 D2

Page 3: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

3

Example: image search

Images represented as features (color histogram, texture moments, etc.)

Similarity search using these features “Find 10 most similar images for the query image”

Other applications: Web-page search: “Find 100 most similar pages for a given

page GIS: “find 5 closest cities of Irvine” Data cleaning

Query image

Page 4: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

4

NN Algorithms Distance measurement:

For objects are points, distance well defined Usually Euclidean Other distances possible

For arbitrary-shaped objects, assume we have a distance function between them

Most algorithms assume a high-dimensional tree structure for the datasets (e.g., R-tree).

Page 5: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

5

Search process (1-NN for example)

Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach

Keep a priority queue of nodes (“MBR”) to be visited Sorted based on the “minimum distance” between q and each no

de Improvement:

Use MINDIST and MINMAXDIST Reduce the queue size Avoid unnecessary disk IO’s to access MBR’s

Priority queue

Page 6: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

6

Problem Queue size may be large:

60,000 objects, 32d (image) vectors, 50 NNs Max queue size: 15K entries Avg queue size: half (7.5K entries)

If queue can’t fit in memory, more disk IOs! Problem worse for k-NN joins

E.g., 1500 x 1500 join: Max queue size: 1.7M entries: >= 1GB memory! 750 seconds to run

Couldn’t scale up to 2000 objects! Disk thrashing

Page 7: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

7

Our Solution: Nearest-Neighbor Histogram (NNH)

Main idea Utilizing NNH in a search (KNN, join) Construction and incremental

maintenance Experiments Related work

Page 8: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

8

p1p2

pm

Distances of its nearest neighbors: r1, r2, …,

NNH: Nearest-Neighbor Histograms

m: # of pivots

They are not part of the database

Page 9: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

9

Structure Nearest Neighbor Vectors: Trrpv ,...,)( 1

Nearest Neighbor Histogram Collection of m pivots with their NN vectors

each ri is the distance of p’s i-th NN

T: length of each vector

Page 10: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

10

Outline

Main idea Utilizing NNH in a search (KNN, join) Construction and incremental

maintenance Experiments Related work

Page 11: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

11

Estimate NN distance for query object

NNH does not give exact NN information for an object But we can estimate an upper bound for the k-NN dista

nce qest of q

mikpHpq iiq 1),,(),(

Triangle inequality : of NN- theof Distance qk

Page 12: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

12

Estimate NN for query object(con’t)

Apply the triangle inequality to all pivots Upper bound estimate of NN distance of q

)),(),((min1

kpHpq iimi

estq

Complexity: O(m)

Page 13: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

13

Utilizing estimates in NN search More pruning: prune an mbr if:

),( mbrqMINDISTestq

mbrMINDISTq

Page 14: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

14

Utilizing estimates in NN join K-NN join: for each object o1 in D1, find

its k-nearest neighbors in D2. Traverse two trees top down; keep a

queue of pairs

Page 15: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

15

Utilizing estimates in NN join (cont’t)

Construct NNH for D2. For each object o1 in D1, keep its estimated

NN radius o1est using NNH of D2.

Similar to k-NN query, ignore mbr for o1 if:

),( 11mbroMINDISTest

o

mbrMINDISTo1

Page 16: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

16

More powerful: prune MBR pairs

)),(),((min 212

1kpHpmbrMAXDIST ii

Hp

estmbr

i

)),(),(: 2111 1kpHpombro iio

)),(),( 211kpHpmbrMAXDIST iimbr

Page 17: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

17

Prune MBR pairs (cont)

),( 211mbrmbrMINDISTest

mbr

mbr1mbr2

MINDIST

Prune this MBR pair if:

Page 18: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

18

Outline

Main idea Utilizing NNH in a search (KNN, join) Construction and incremental

maintenance Experiments Related work

Page 19: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

19

NNH Construction If we have selected the m pivots:

Just run KNN queries for them to construct NNH

Time is O(m) Offline

Important: selecting pivots Size-Constraint Construction Error-Constraint Construction (see paper)

Page 20: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

20

# of pivots “m” determines Storage size Initial construction cost Incremental-maintenance cost

Choose m “best” pivots

Size-constraint NNH construction

Page 21: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

21

Size-constraint NNH construction

Given m (# of pivots), assume: query objects are from the database D H(pi,k) doesn’t vary too much

Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots:

Clustering problem: Many algorithms available Use K-means for its simplicity and efficiency

miDq

ipq,...,1,

),(

mikpHpq iiq 1),,(),(

Page 22: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

22

Incremental Maintenance How to update the NNH when inserting or d

eleting objects? Need to “shift” each vector:

Associate a valid length Ei to each NN vector.

Page 23: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

23

Outline

Main idea Utilizing NNH in a search (KNN, join) Construction and incremental

maintenance Experiments Related work

Page 24: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

24

Experiments

Datasets: Corel image database

Contains 60,000 images Each image represented by a 32-dimensional float vector

Time-series data from AT&T Similar trends. Report results for Corel data set

Test bed: PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. GNU C++ in CYGWIN

Page 25: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

25

Goal

Is the pruning using NNH estimates powerful? KNN queries NN-join queries

Is it “cheap” to have such a structure? Storage Initial construction Incremental maintenance

Page 26: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

26

Improvement in k-NN search Ran k-means algorithm to generate

400 pivots for 60K objects, and constructed NNH

Performed K-NN queries on 100 randomly selected query objects.

Queue size to measure memory usage. Max queue size Average queue size

Page 27: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

27

Reduced Memory Requirement

Page 28: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

28

Reduced running time

Page 29: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

29

Effects of different # of pivots

Page 30: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

30

Join: Reduced Memory Requirement

Page 31: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

31

Join: Reduced running time

Page 32: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

32

Join:Running time for different data sizes

Page 33: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

33

Cost/Benefit of NNH

Pivot # (m) 10 50 100 150 200 250 300 350 400

Construction time (sec)

0.7 3.59

6.6 9.4 11.5 13.7 15.7 17.8

20.4

Storage space (kB)

2 10 20 30 40 50 60 70 80

Incr mantnce. time (ms)

~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0 ~0

Improved q-size(kNN)(%)

40 30 28 24 24 24 23 20 18

Improved q-size(join)(%)

45 34 28 26 26 25 24 24 22

“~0” means almost zero.

For 60,000 float vectors (32-d).

Page 34: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

34

Conclusion NNH: efficient, effective approach to

improving NN-search performance. Can be easily embedded into current

implementation of NN algorithms. Can be efficiently constructed and

maintained. Offers substantial performance

advantages.

Page 35: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

35

Related work Summary histograms

E.g., [Jagadish et al VLDB98], [Mattias et al VLDB00] Objective: approximate frequency values

NN Search algorithms Many algorithms developed Many of them can benefit from NNH

Algorithms based on “pivots/foci/anchors” E.g., Omni [Filho et al, ICDE01], Vantage objects [Vleugels et al

VIIS99], M-trees [Ciaccia et al VLDB97] Choose pivots far from each other (to represent the “intrinsic

dimensionality”) NNH: pivots depend on how clustered the objects are Experiments show the differences

Page 36: 1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

36

Work conducted in the Flamingo Project on Data Cleansing at UC Irvine