Upload
heath
View
43
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search. Alex Andoni (MSR SVC). Nearest Neighbor Search (NNS). Preprocess: a set of points Query: given a query point , report a point with the smallest distance to . Motivation. Generic setup: - PowerPoint PPT Presentation
Citation preview
Sketching, Sampling and other Sublinear Algorithms:
Nearest Neighbor Search
Alex Andoni(MSR SVC)
Nearest Neighbor Search (NNS)
Preprocess: a set of points
Query: given a query point , report a point with the smallest distance to
𝑞𝑝
Motivation
Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure
Application areas: machine learning: k-NN rule speech/image/video/music recognition,
vector quantization, bioinformatics, etc… Distance can be:
Hamming, Euclidean, edit distance, Earth-mover
distance, etc… Primitive for other problems:
find the similar pairs in a set D, clustering…
000000011100010100000100010100011111
000000001100000100000100110100111111 𝑞
𝑝
Lecture Plan1. Locality-Sensitive Hashing2. LSH as a Sketch3. Towards Embeddings
2D case
Compute Voronoi diagram Given query , perform point
location Performance:
Space: Query time:
High-dimensional case
All exact algorithms degrade rapidly with the dimension
In practice: When is “low-medium”, kd-trees work reasonably When is “high”, state-of-the-art is unsatisfactory
Algorithm Query time SpaceFull indexing (Voronoi diagram size)No indexing – linear scan
Approximate NNS
r-near neighbor: given a new point , report a point s.t.
Randomized: a point returned with 90% probability
c-approximate
𝑐𝑟if there exists apoint at distance
q
r p
cr
Heuristic for Exact NNS
r-near neighbor: given a new point , report a set with all points s.t. (each with 90%
probability)
may contain some approximate neighbors s.t.
Can filter out bad answers
q
r p
cr
c-approximate
Approximation Algorithms for NNSA vast literature:
milder dependence on dimension[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13],
little to no dependence on dimension[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]
Locality-Sensitive Hashing
Random hash function on s.t. for any points : Close when is “high”
Far when is “small”
Use several hashtables
q
p
𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)
Pr [𝑔 (𝑝 )=𝑔 (𝑞)]
𝑟 𝑐𝑟
1
𝑃 1𝑃 2
: , where
[Indyk-Motwani’98]
q
“not-so-small”𝑃 1=¿𝑃 2=¿
𝜌=log 1/𝑃1
log 1/𝑃2
Locality sensitive hash functions
11
Hash function is usually a concatenation of “primitive” functions:
Example: Hamming space , i.e., choose bit for a random chooses bits at random
Formal description
12
Data structure is just hash tables: Each hash table uses a fresh random function
Hash all dataset points into the table Query:
Check for collisions in each of the hash tables until we encounter a point within distance
Guarantees: Space: , plus space to store points Query time: (in expectation) 50% probability of success.
Analysis of LSH Scheme
13
How did we pick ? For fixed , we have
Pr[collision close pair] = Pr[collision far pair] =
Want to make Set , where
Analysis: Correctness
14
Let be an -near neighbor If does not exists, algorithm can output anything
Algorithm fails when: near neighbor is not in the searched buckets
Probability of failure: Probability do not collide in a hash table: Probability they do not collide in hash tables at
most
Analysis: Runtime
15
Runtime dominated by: Hash function evaluation: time Distance computations to points in buckets
Distance computations: Care only about far points, at distance In one hash table, we have
Probability a far point collides is at most Expected number of far points in a bucket:
Over hash tables, expected number of far points is
Total: in expectation
LSH in the wild
16
If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:
trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset
Further advantages: Point insertions/deletions easy natural to distribute computation/hash tables in
a cluster
𝐿
𝑘
safety not guaranteed
fewer false positives
fewer tables
LSH Zoo
17
Hamming distance [IM’98] : pick a random coordinate(s)
Manhattan distance: homework
Jaccard distance between sets:
: pick a random permutation on the universe
min-wise hashing [Bro’97] Euclidean distance: next lecture
To be or not to be
To sketch or not to sketch
…21102…
be toornot
sket
ch
…01122…
be toornot
sket
ch
…11101… …01111…
{be,not,or,to} {not,or,to, sketch}
1 1
not
not
=be,to,sketch,or,not
be to