32
APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEX Wei Wang, University of New South Wales 1 6/26/14 NICTA Machine Learning Research Group Seminar

Approximate Nearest Neighbor Queries with a Tiny Index

  • Upload
    norton

  • View
    63

  • Download
    1

Embed Size (px)

DESCRIPTION

Approximate Nearest Neighbor Queries with a Tiny Index. Wei Wang, University of New South Wales. Outline. Overview of Our Research SRS: c -Approximate Nearest Neighbor with a tiny index [PVLDB 2015] Conclusions. Research Projects. Similarity Query Processing - PowerPoint PPT Presentation

Citation preview

Page 1: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar 1

APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEXWei Wang, University of New South Wales6/26/14

Page 2: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

2

Outline Overview of Our Research SRS: c-Approximate Nearest Neighbor

with a tiny index [PVLDB 2015]

Conclusions

6/26/14

Page 3: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

3

Research Projects Similarity Query Processing Keyword search on (Semi-) Structured

Data Graph Succinct Data Structures

6/26/14

Page 4: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

4

NN and c-ANN Queries

Definitions A set of points, D = ∪i=1

n {Oi}, in d-dimensional Euclidean space (d is large, e.g., hundreds)

Given a query point q, find the closest point, O*, in D Relaxed version:

Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability

NN = a

c-ANN = x x D = {a, b, x}

aka. (1+ε)-ANN

6/26/14

Page 5: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

5

Applications and Challenges Applications

Feature vectors: Data Mining, Multimedia DB Fundamental geometric problem: “post-office

problem” Quantization in coding/compression …

Challenges Curse of Dimensionality / Concentration of Measure

Hard to find algorithms sub-linear in n and polynomial in d

Large data size: 1KB for a single point with 256 dims

6/26/14

Page 6: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

6

Existing Solutions NN:

O(d5log(n)) query time, O(n2d+δ) space O(dn1-ε(d)) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space

(1+ε)-ANN O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space Probabilistic test remove exponential

dependency on d Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2))

space LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space

ρ= 1/(1+ε) + oc(1)

LSH is the best approach using sub-quadratic space

Linear scan is (practically) the best approach using linear space & time

6/26/14

Page 7: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

7

Approximate NN for Multimedia Retrieval

Cover-tree Spill-tree Reduce to NN search with Hamming

distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-

Means)

6/26/14

Page 8: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

8

Locality Sensitive Hashing (LSH) Equality search

Index: store o into bucket h(o) Query: retrieve every o in the

bucket h(q), verify if o = q LSH

∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: Rd Z technically, dependent on r

“Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)

LSH is the best approach using sub-quadratic space

6/26/14

Page 9: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

9

LSH: Indexing & Query Processing Index

For a fixed r sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩ store o into bucket sig(o)

Iteratively increase r Query

Search with a fixed r Retrieve and “verify” points in

the bucket sig(q) Repeat this L times (boosting)

Galloping search to find the first good r

Reduce Query Cost

Const Succ.Prob.

Incurs additional cost + only c2 quality guarantee

6/26/14

Page 10: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

10

Locality Sensitive Hashing (LSH) Standard LSH

c2-ANN binary search on R(ci, ci+1)-NN problems

LSH on external memory LSB-forest [SIGMOD’09, TODS’10]:

A different reduction from c2-ANN to a R(ci, ci+1)-NN problem

C2LSH [SIGMOD’12]: Do not use composite hash keys Perform fine-granular counting number of collisions

in m LSH projections

O((dn/B)0.5) query, O((dn/B)1.5) space

O(n*log(n)/B) query,O(n*log(n)/B) space

O(n/B) query,O(n/B) space

SRS (Ours) 6/26/14

Page 11: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

11

Weakness of Ext-Memory LSH Methods

Existing methods uses super-linear space Thousands (or more) of hash tables needed if

rigorous People resorts to hashing into binary code (and using

Hamming distance) for multimedia retrieval Can only handle c, where c = x2, for integer x ≥ 2

To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c)

Dataset Size

LSB-forest

C2LSH+ SRS (Ours)

Audio, 40MB

1500 MB 127 MB 2 MB6/26/14

Page 12: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

12

SRS: Our Proposed Method Solving c-ANN queries with O(n) query time and O(n)

space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective

Advantages: Small index Rich-functionality Simple

Central idea: c-ANN query in d dims kNN query in m-dims with

filtering Model the distribution of m “stable random projections”

6/26/14

Page 13: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

13

2-stable Random Projection Let D be the 2-stable random projection

= standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B

~ D, then x*A + y*B ~ (x2+y2)1/2 * D Illustration

Vr1

V.r1 ~ N(0,

ǁvǁ)

V.r2 ~ N(0,

ǁvǁ)r2

V

6/26/14

Page 14: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

14

Dist(O) and ProjDist(O) and Their Relationship

z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ) z1

2+z22 ~ ǁvǁ2 *χ2

m i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2

m distribution

O in d dims

(z1, … zm) in m dims

m 2-stable random projections

V=Dist(

O) r1

⟨V.r1⟩ ~ N(0,

ǁvǁ)

Q

O

⟨V.r2 ⟩ ~ N(0,

ǁvǁ)r2

V=Dist

(O)

Q

O

Proj(O)

Proj(Q)

ProjDist(O)

Dist(O)

ProjDist(O)

6/26/14

Page 15: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

15

LSH-like Property Intuitive idea:

If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with high probability

But the inverse is NOT true NN object in the projected space is most likely not the NN

object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T)

Solution Perform incremental k-NN search on the projected space

till accessing T objects + Early termination test

6/26/14

Page 16: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

16

Indexing Finding the minimum m

Input n, c, T ≔ max # of points to access by the algorithm

Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear

space index Generate m 2-stable random projections n

projected points in a m-dimensional space Index these projections using any index that

supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n)

6/26/14

Page 17: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

17

SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18Index = 0.0059n, Query = 0.0084n, succ prob = 0.13

Early-termination test:

ψmc 2 * ProjDist2(ok )

Dist 2(omin )

⎛ ⎝ ⎜

⎞ ⎠ ⎟> pτ

6/26/14Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost

Page 18: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

18

Variations of SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

1. SRS-α2. SRS-β3. SRS-αβ(T, c’, pτ)

Better quality; query cost is O(T)Best quality; query cost bounded by O(n);

handles c = 1 Better quality; query cost

bounded by O(T) 6/26/14All with success probability at least

Page 19: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

19

Other Results Can be easily extended to support top-k

c-ANN queries (k > 1) No previous known guarantee on the

correctness of returned results We guarantee the correctness with

probability at least pτ, if SRS-αβ stops due to early-termination condition ≈100% in practice (97% in theory)

6/26/14

Page 20: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

20

Analysis

6/26/14

Page 21: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

21

Stopping Condition α “near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o:

Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2) Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)

Both because ProjDist2(o)/Dist2(o) ~ χ2m

Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0

Feasible due to good concentration bound for χ2m

P1P2

6/26/14

Page 22: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

22

Choosing κ

Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) =

64

κ*r6/26/14

Page 23: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

23

ProjDist(OT): Case I

ProjDist(OT)

ProjDist(o) in m-dims

Omin = the NN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14

Page 24: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

24

ProjDist(OT): Case II

ProjDist(OT)

ProjDist(o) in m-dims

Omin = a cNN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14

Page 25: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

25

Early-termination Condition (β) Omit the proof here

Also relies on the fact that the squared sum of m projected distances follows a scaled χ2

m distribution

Key to Handle the case where c = 1

Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods

Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method

6/26/14

Page 26: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

26

Experiment Setup Algorithms

LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15]

Data Measures

Index size, query cost, result quality, success probability

6/26/14

Page 27: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

27

Datasets

5.6PB

369GB

16GB

6/26/14

Page 28: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

28

Tiny Image Dataset (8M pts, 384 dims)

Fastest: SRS-αβ, Slowest: C2LSH Quality the other

way around SRS-αhas

comparable quality with C2LSH yet has much lower cost.

SRS-* dominates LSB-forest

6/26/14

Page 29: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

29

Approximate Nearest Neighbor

Empirically better than the theoretic guarantee

With 15% I/Os of linear scan, returns NN with probability 71%

With 62% I/Os of linear scan, returns NN with probability 99.7%

6/26/14

Page 30: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

30

Large Dataset (0.45 Billion)

6/26/14

Page 31: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

31

Summary c-ANN queries in arbitrarily high dim

space kNN query in low dim space Our index size is approximately d/m of

the size of the data file Opens up a new direction in c-ANN

queries in high-dimensional space Find efficient solution to kNN problem in 6-

10 dimensional space

6/26/14

Page 32: Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar

32

Q&A

Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

6/26/14