Approximate Nearest Neighbor Queries with a Tiny Index

Preview:

DESCRIPTION

Approximate Nearest Neighbor Queries with a Tiny Index. Wei Wang, University of New South Wales. Outline. Overview of Our Research SRS: c -Approximate Nearest Neighbor with a tiny index [PVLDB 2015] Conclusions. Research Projects. Similarity Query Processing - PowerPoint PPT Presentation

Citation preview

NICTA Machine Learning Research Group Seminar 1

APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEXWei Wang, University of New South Wales6/26/14

NICTA Machine Learning Research Group Seminar

2

Outline Overview of Our Research SRS: c-Approximate Nearest Neighbor

with a tiny index [PVLDB 2015]

Conclusions

6/26/14

NICTA Machine Learning Research Group Seminar

3

Research Projects Similarity Query Processing Keyword search on (Semi-) Structured

Data Graph Succinct Data Structures

6/26/14

NICTA Machine Learning Research Group Seminar

4

NN and c-ANN Queries

Definitions A set of points, D = ∪i=1

n {Oi}, in d-dimensional Euclidean space (d is large, e.g., hundreds)

Given a query point q, find the closest point, O*, in D Relaxed version:

Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability

NN = a

c-ANN = x x D = {a, b, x}

aka. (1+ε)-ANN

6/26/14

NICTA Machine Learning Research Group Seminar

5

Applications and Challenges Applications

Feature vectors: Data Mining, Multimedia DB Fundamental geometric problem: “post-office

problem” Quantization in coding/compression …

Challenges Curse of Dimensionality / Concentration of Measure

Hard to find algorithms sub-linear in n and polynomial in d

Large data size: 1KB for a single point with 256 dims

6/26/14

NICTA Machine Learning Research Group Seminar

6

Existing Solutions NN:

O(d5log(n)) query time, O(n2d+δ) space O(dn1-ε(d)) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space

(1+ε)-ANN O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space Probabilistic test remove exponential

dependency on d Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2))

space LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space

ρ= 1/(1+ε) + oc(1)

LSH is the best approach using sub-quadratic space

Linear scan is (practically) the best approach using linear space & time

6/26/14

NICTA Machine Learning Research Group Seminar

7

Approximate NN for Multimedia Retrieval

Cover-tree Spill-tree Reduce to NN search with Hamming

distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-

Means)

6/26/14

NICTA Machine Learning Research Group Seminar

8

Locality Sensitive Hashing (LSH) Equality search

Index: store o into bucket h(o) Query: retrieve every o in the

bucket h(q), verify if o = q LSH

∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: Rd Z technically, dependent on r

“Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)

LSH is the best approach using sub-quadratic space

6/26/14

NICTA Machine Learning Research Group Seminar

9

LSH: Indexing & Query Processing Index

For a fixed r sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩ store o into bucket sig(o)

Iteratively increase r Query

Search with a fixed r Retrieve and “verify” points in

the bucket sig(q) Repeat this L times (boosting)

Galloping search to find the first good r

Reduce Query Cost

Const Succ.Prob.

Incurs additional cost + only c2 quality guarantee

6/26/14

NICTA Machine Learning Research Group Seminar

10

Locality Sensitive Hashing (LSH) Standard LSH

c2-ANN binary search on R(ci, ci+1)-NN problems

LSH on external memory LSB-forest [SIGMOD’09, TODS’10]:

A different reduction from c2-ANN to a R(ci, ci+1)-NN problem

C2LSH [SIGMOD’12]: Do not use composite hash keys Perform fine-granular counting number of collisions

in m LSH projections

O((dn/B)0.5) query, O((dn/B)1.5) space

O(n*log(n)/B) query,O(n*log(n)/B) space

O(n/B) query,O(n/B) space

SRS (Ours) 6/26/14

NICTA Machine Learning Research Group Seminar

11

Weakness of Ext-Memory LSH Methods

Existing methods uses super-linear space Thousands (or more) of hash tables needed if

rigorous People resorts to hashing into binary code (and using

Hamming distance) for multimedia retrieval Can only handle c, where c = x2, for integer x ≥ 2

To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c)

Dataset Size

LSB-forest

C2LSH+ SRS (Ours)

Audio, 40MB

1500 MB 127 MB 2 MB6/26/14

NICTA Machine Learning Research Group Seminar

12

SRS: Our Proposed Method Solving c-ANN queries with O(n) query time and O(n)

space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective

Advantages: Small index Rich-functionality Simple

Central idea: c-ANN query in d dims kNN query in m-dims with

filtering Model the distribution of m “stable random projections”

6/26/14

NICTA Machine Learning Research Group Seminar

13

2-stable Random Projection Let D be the 2-stable random projection

= standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B

~ D, then x*A + y*B ~ (x2+y2)1/2 * D Illustration

Vr1

V.r1 ~ N(0,

ǁvǁ)

V.r2 ~ N(0,

ǁvǁ)r2

V

6/26/14

NICTA Machine Learning Research Group Seminar

14

Dist(O) and ProjDist(O) and Their Relationship

z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ) z1

2+z22 ~ ǁvǁ2 *χ2

m i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2

m distribution

O in d dims

(z1, … zm) in m dims

m 2-stable random projections

V=Dist(

O) r1

⟨V.r1⟩ ~ N(0,

ǁvǁ)

Q

O

⟨V.r2 ⟩ ~ N(0,

ǁvǁ)r2

V=Dist

(O)

Q

O

Proj(O)

Proj(Q)

ProjDist(O)

Dist(O)

ProjDist(O)

6/26/14

NICTA Machine Learning Research Group Seminar

15

LSH-like Property Intuitive idea:

If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with high probability

But the inverse is NOT true NN object in the projected space is most likely not the NN

object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T)

Solution Perform incremental k-NN search on the projected space

till accessing T objects + Early termination test

6/26/14

NICTA Machine Learning Research Group Seminar

16

Indexing Finding the minimum m

Input n, c, T ≔ max # of points to access by the algorithm

Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear

space index Generate m 2-stable random projections n

projected points in a m-dimensional space Index these projections using any index that

supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n)

6/26/14

NICTA Machine Learning Research Group Seminar

17

SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18Index = 0.0059n, Query = 0.0084n, succ prob = 0.13

Early-termination test:

ψmc 2 * ProjDist2(ok )

Dist 2(omin )

⎛ ⎝ ⎜

⎞ ⎠ ⎟> pτ

6/26/14Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost

NICTA Machine Learning Research Group Seminar

18

Variations of SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

1. SRS-α2. SRS-β3. SRS-αβ(T, c’, pτ)

Better quality; query cost is O(T)Best quality; query cost bounded by O(n);

handles c = 1 Better quality; query cost

bounded by O(T) 6/26/14All with success probability at least

NICTA Machine Learning Research Group Seminar

19

Other Results Can be easily extended to support top-k

c-ANN queries (k > 1) No previous known guarantee on the

correctness of returned results We guarantee the correctness with

probability at least pτ, if SRS-αβ stops due to early-termination condition ≈100% in practice (97% in theory)

6/26/14

NICTA Machine Learning Research Group Seminar

20

Analysis

6/26/14

NICTA Machine Learning Research Group Seminar

21

Stopping Condition α “near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o:

Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2) Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)

Both because ProjDist2(o)/Dist2(o) ~ χ2m

Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0

Feasible due to good concentration bound for χ2m

P1P2

6/26/14

NICTA Machine Learning Research Group Seminar

22

Choosing κ

Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) =

64

κ*r6/26/14

NICTA Machine Learning Research Group Seminar

23

ProjDist(OT): Case I

ProjDist(OT)

ProjDist(o) in m-dims

Omin = the NN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14

NICTA Machine Learning Research Group Seminar

24

ProjDist(OT): Case II

ProjDist(OT)

ProjDist(o) in m-dims

Omin = a cNN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14

NICTA Machine Learning Research Group Seminar

25

Early-termination Condition (β) Omit the proof here

Also relies on the fact that the squared sum of m projected distances follows a scaled χ2

m distribution

Key to Handle the case where c = 1

Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods

Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method

6/26/14

NICTA Machine Learning Research Group Seminar

26

Experiment Setup Algorithms

LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15]

Data Measures

Index size, query cost, result quality, success probability

6/26/14

NICTA Machine Learning Research Group Seminar

27

Datasets

5.6PB

369GB

16GB

6/26/14

NICTA Machine Learning Research Group Seminar

28

Tiny Image Dataset (8M pts, 384 dims)

Fastest: SRS-αβ, Slowest: C2LSH Quality the other

way around SRS-αhas

comparable quality with C2LSH yet has much lower cost.

SRS-* dominates LSB-forest

6/26/14

NICTA Machine Learning Research Group Seminar

29

Approximate Nearest Neighbor

Empirically better than the theoretic guarantee

With 15% I/Os of linear scan, returns NN with probability 71%

With 62% I/Os of linear scan, returns NN with probability 99.7%

6/26/14

NICTA Machine Learning Research Group Seminar

30

Large Dataset (0.45 Billion)

6/26/14

NICTA Machine Learning Research Group Seminar

31

Summary c-ANN queries in arbitrarily high dim

space kNN query in low dim space Our index size is approximately d/m of

the size of the data file Opens up a new direction in c-ANN

queries in high-dimensional space Find efficient solution to kNN problem in 6-

10 dimensional space

6/26/14

NICTA Machine Learning Research Group Seminar

32

Q&A

Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

6/26/14

Recommended