Approximate Nearest Neighbor Queries with a Tiny Index

NICTA Machine Learning Research Group Seminar 1

APPROXIMATE NEAREST NEIGHBOR QUERIES WITH A TINY INDEXWei Wang, University of New South Wales6/26/14

NICTA Machine Learning Research Group Seminar

2

Outline Overview of Our Research SRS: c-Approximate Nearest Neighbor

with a tiny index [PVLDB 2015]

Conclusions

6/26/14


3

Research Projects Similarity Query Processing Keyword search on (Semi-) Structured

Data Graph Succinct Data Structures

6/26/14


4

NN and c-ANN Queries

Definitions A set of points, D = ∪i=1

n {Oi}, in d-dimensional Euclidean space (d is large, e.g., hundreds)

Given a query point q, find the closest point, O*, in D Relaxed version:

Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q) May return a c-ANN point with at least constant probability

NN = a

c-ANN = x x D = {a, b, x}

aka. (1+ε)-ANN

6/26/14


5

Applications and Challenges Applications

Feature vectors: Data Mining, Multimedia DB Fundamental geometric problem: “post-office

problem” Quantization in coding/compression …

Challenges Curse of Dimensionality / Concentration of Measure

Hard to find algorithms sub-linear in n and polynomial in d

Large data size: 1KB for a single point with 256 dims

6/26/14


6

Existing Solutions NN:

O(d5log(n)) query time, O(n2d+δ) space O(dn1-ε(d)) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space

(1+ε)-ANN O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space Probabilistic test remove exponential

dependency on d Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2))

space LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space

ρ= 1/(1+ε) + oc(1)

LSH is the best approach using sub-quadratic space

Linear scan is (practically) the best approach using linear space & time

6/26/14


7

Approximate NN for Multimedia Retrieval

Cover-tree Spill-tree Reduce to NN search with Hamming

distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-

Means)

6/26/14


8

Locality Sensitive Hashing (LSH) Equality search

Index: store o into bucket h(o) Query: retrieve every o in the

bucket h(q), verify if o = q LSH

∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝ 1/Dist(q, o) h :: Rd Z technically, dependent on r

“Near-by” points (blue) have more chance of colliding with q than “far-away” points (red)

LSH is the best approach using sub-quadratic space

6/26/14


9

LSH: Indexing & Query Processing Index

For a fixed r sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩ store o into bucket sig(o)

Iteratively increase r Query

Search with a fixed r Retrieve and “verify” points in

the bucket sig(q) Repeat this L times (boosting)

Galloping search to find the first good r

Reduce Query Cost

Const Succ.Prob.

Incurs additional cost + only c2 quality guarantee

6/26/14


10

Locality Sensitive Hashing (LSH) Standard LSH

c2-ANN binary search on R(ci, ci+1)-NN problems

LSH on external memory LSB-forest [SIGMOD’09, TODS’10]:

A different reduction from c2-ANN to a R(ci, ci+1)-NN problem

C2LSH [SIGMOD’12]: Do not use composite hash keys Perform fine-granular counting number of collisions

in m LSH projections

O((dn/B)0.5) query, O((dn/B)1.5) space

O(n*log(n)/B) query,O(n*log(n)/B) space

O(n/B) query,O(n/B) space

SRS (Ours) 6/26/14


11

Weakness of Ext-Memory LSH Methods

Existing methods uses super-linear space Thousands (or more) of hash tables needed if

rigorous People resorts to hashing into binary code (and using

Hamming distance) for multimedia retrieval Can only handle c, where c = x2, for integer x ≥ 2

To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c)

Dataset Size

LSB-forest

C2LSH+ SRS (Ours)

Audio, 40MB

1500 MB 127 MB 2 MB6/26/14


12

SRS: Our Proposed Method Solving c-ANN queries with O(n) query time and O(n)

space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective

Advantages: Small index Rich-functionality Simple

Central idea: c-ANN query in d dims kNN query in m-dims with

filtering Model the distribution of m “stable random projections”

6/26/14


13

2-stable Random Projection Let D be the 2-stable random projection

= standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B

~ D, then x*A + y*B ~ (x2+y2)1/2 * D Illustration

Vr1

V.r1 ~ N(0,

ǁvǁ)

V.r2 ~ N(0,

ǁvǁ)r2

V

6/26/14


14

Dist(O) and ProjDist(O) and Their Relationship

z1≔⟨V, r1⟩ ~ N(0, ǁvǁ) z2≔⟨V, r2⟩ ~ N(0, ǁvǁ) z1

2+z22 ~ ǁvǁ2 *χ2

m i.e., scaled Chi-squared distribution of m degrees of freedom Ψm(x): cdf of the standardχ2

m distribution

O in d dims

(z1, … zm) in m dims

m 2-stable random projections

V=Dist(

O) r1

⟨V.r1⟩ ~ N(0,

ǁvǁ)

Q

O

⟨V.r2 ⟩ ~ N(0,

ǁvǁ)r2

V=Dist

(O)

Q

O

Proj(O)

Proj(Q)

ProjDist(O)

Dist(O)

ProjDist(O)

6/26/14


15

LSH-like Property Intuitive idea:

If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with high probability

But the inverse is NOT true NN object in the projected space is most likely not the NN

object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases! (say T)

Solution Perform incremental k-NN search on the projected space

till accessing T objects + Early termination test

6/26/14


16

Indexing Finding the minimum m

Input n, c, T ≔ max # of points to access by the algorithm

Output m : # of 2-stable random projections T’ ≤ T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear

space index Generate m 2-stable random projections n

projected points in a m-dimensional space Index these projections using any index that

supports incremental kNN search, e.g., R-tree Space cost: O(m * n) = O(n)

6/26/14


17

SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18Index = 0.0059n, Query = 0.0084n, succ prob = 0.13

Early-termination test:

€

ψmc 2 * ProjDist2(ok )

Dist 2(omin )

⎛ ⎝ ⎜

⎞ ⎠ ⎟> pτ

6/26/14Main Theorem: SRS-αβ returns a c-NN point with probability pτ-f(m,c) with O(n) I/O cost


18

Variations of SRS-αβ(T, c, pτ) Compute proj(Q) Do incremental kNN search from proj(Q) for k = 1 to T

Compute Dist(Ok) Maintain Omin = argmin1≤i≤k Dist(Oi) If early-termination test (c, pτ) = TRUE

BREAK Return Omin

// stopping condition α

// stopping condition β

1. SRS-α2. SRS-β3. SRS-αβ(T, c’, pτ)

Better quality; query cost is O(T)Best quality; query cost bounded by O(n);

handles c = 1 Better quality; query cost

bounded by O(T) 6/26/14All with success probability at least

pτ


19

Other Results Can be easily extended to support top-k

c-ANN queries (k > 1) No previous known guarantee on the

correctness of returned results We guarantee the correctness with

probability at least pτ, if SRS-αβ stops due to early-termination condition ≈100% in practice (97% in theory)

6/26/14


20

Analysis

6/26/14


21

Stopping Condition α “near” point: the NN point its distance ≕ r “far” points: points whose distance > c * r Then for any κ > 0 and any o:

Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2) Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)

Both because ProjDist2(o)/Dist2(o) ~ χ2m

Pr[the NN point projected beforeκ*r] ≥ ψm(κ2) Pr[# of bad points projected before κ*r < T] > (1 - ψm(κ2/c2)) * (n/T) Choose κ such that P1 + P2 – 1 > 0

Feasible due to good concentration bound for χ2m

P1P2

6/26/14


22

Choosing κ

Let c = 4 Mode = m – 2 Blue: 4 Red: 4*(c2) =

64

κ*r6/26/14


23

ProjDist(OT): Case I

ProjDist(OT)

ProjDist(o) in m-dims

Omin = the NN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14


24

ProjDist(OT): Case II

ProjDist(OT)

ProjDist(o) in m-dims

Omin = a cNN point

Consider cases where both conditions hold (re. near and far points) P1 + P2 – 1 probability

6/26/14


25

Early-termination Condition (β) Omit the proof here

Also relies on the fact that the squared sum of m projected distances follows a scaled χ2

m distribution

Key to Handle the case where c = 1

Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods

Guarantees the correctness of top-k cANN points returned when stopped by this condition No such guarantee by any previous method

6/26/14


26

Experiment Setup Algorithms

LSB-forest [SIGMOD’09, TODS’10] C2LSH [SIGMOD’12] SRS-* [VLDB’15]

Data Measures

Index size, query cost, result quality, success probability

6/26/14


27

Datasets

5.6PB

369GB

16GB

6/26/14


28

Tiny Image Dataset (8M pts, 384 dims)

Fastest: SRS-αβ, Slowest: C2LSH Quality the other

way around SRS-αhas

comparable quality with C2LSH yet has much lower cost.

SRS-* dominates LSB-forest

6/26/14


29

Approximate Nearest Neighbor

Empirically better than the theoretic guarantee

With 15% I/Os of linear scan, returns NN with probability 71%

With 62% I/Os of linear scan, returns NN with probability 99.7%

6/26/14


30

Large Dataset (0.45 Billion)

6/26/14


31

Summary c-ANN queries in arbitrarily high dim

space kNN query in low dim space Our index size is approximately d/m of

the size of the data file Opens up a new direction in c-ANN

queries in high-dimensional space Find efficient solution to kNN problem in 6-

10 dimensional space

6/26/14


32

Q&A

Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

6/26/14

Documents

Approximate Nearest Neighbor Queries with a Tiny Index