Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

Sketching, Sampling and other Sublinear Algorithms:

Nearest Neighbor Search

Alex Andoni(MSR SVC)

Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

𝑞𝑝

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule speech/image/video/music recognition,

vector quantization, bioinformatics, etc… Distance can be:

Hamming, Euclidean, edit distance, Earth-mover

distance, etc… Primitive for other problems:

find the similar pairs in a set D, clustering…

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝

Lecture Plan1. Locality-Sensitive Hashing2. LSH as a Sketch3. Towards Embeddings

2D case

Compute Voronoi diagram Given query , perform point

location Performance:

Space: Query time:

High-dimensional case

All exact algorithms degrade rapidly with the dimension

In practice: When is “low-medium”, kd-trees work reasonably When is “high”, state-of-the-art is unsatisfactory

Algorithm Query time SpaceFull indexing (Voronoi diagram size)No indexing – linear scan

Approximate NNS

r-near neighbor: given a new point , report a point s.t.

Randomized: a point returned with 90% probability

c-approximate

𝑐𝑟if there exists apoint at distance

q

r p

cr

Heuristic for Exact NNS

r-near neighbor: given a new point , report a set with all points s.t. (each with 90%

probability)

may contain some approximate neighbors s.t.

Can filter out bad answers

q

r p

cr

c-approximate

Approximation Algorithms for NNSA vast literature:

milder dependence on dimension[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13],

little to no dependence on dimension[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]

Locality-Sensitive Hashing

Random hash function on s.t. for any points : Close when is “high”

Far when is “small”

Use several hashtables

q

p

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1𝑃 2

: , where

[Indyk-Motwani’98]

q

“not-so-small”𝑃 1=¿𝑃 2=¿

𝜌=log 1/𝑃1

log 1/𝑃2

Locality sensitive hash functions

11

Hash function is usually a concatenation of “primitive” functions:

Example: Hamming space , i.e., choose bit for a random chooses bits at random

Formal description

12

Data structure is just hash tables: Each hash table uses a fresh random function

Hash all dataset points into the table Query:

Check for collisions in each of the hash tables until we encounter a point within distance

Guarantees: Space: , plus space to store points Query time: (in expectation) 50% probability of success.

Analysis of LSH Scheme

13

How did we pick ? For fixed , we have

Pr[collision close pair] = Pr[collision far pair] =

Want to make Set , where

Analysis: Correctness

14

Let be an -near neighbor If does not exists, algorithm can output anything

Algorithm fails when: near neighbor is not in the searched buckets

Probability of failure: Probability do not collide in a hash table: Probability they do not collide in hash tables at

most

Analysis: Runtime

15

Runtime dominated by: Hash function evaluation: time Distance computations to points in buckets

Distance computations: Care only about far points, at distance In one hash table, we have

Probability a far point collides is at most Expected number of far points in a bucket:

Over hash tables, expected number of far points is

Total: in expectation

LSH in the wild

16

If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:

trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset

Further advantages: Point insertions/deletions easy natural to distribute computation/hash tables in

a cluster

𝐿

𝑘

safety not guaranteed

fewer false positives

fewer tables

LSH Zoo

17

Hamming distance [IM’98] : pick a random coordinate(s)

Manhattan distance: homework

Jaccard distance between sets:

: pick a random permutation on the universe

min-wise hashing [Bro’97] Euclidean distance: next lecture

To be or not to be

To sketch or not to sketch

…21102…

be toornot

sket

ch

…01122…

be toornot

sket

ch

…11101… …01111…

{be,not,or,to} {not,or,to, sketch}

1 1

not

not

=be,to,sketch,or,not

be to

Documents

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search