17
Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search Alex Andoni (MSR SVC)

Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Embed Size (px)

DESCRIPTION

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database. Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic? Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines. The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Citation preview

Page 1: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Sketching, Sampling and other Sublinear Algorithms:

Nearest Neighbor Search

Alex Andoni(MSR SVC)

Page 2: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

𝑞

𝑝

Page 3: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule speech/image/video/music recognition,

vector quantization, bioinformatics, etc… Distance can be:

Hamming, Euclidean, edit distance, Earth-mover

distance, etc… Primitive for other problems:

find the similar pairs in a set D, clustering…

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝

Page 4: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Lecture Plan1. Locality-Sensitive Hashing2. LSH as a Sketch3. Towards Embeddings

Page 5: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

2D case

Compute Voronoi diagram Given query , perform point

location Performance:

Space: Query time:

Page 6: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

High-dimensional case

All exact algorithms degrade rapidly with the dimension

In practice: When is “low-medium”, kd-trees work reasonably When is “high”, state-of-the-art is unsatisfactory

Algorithm Query time Space

Full indexing (Voronoi diagram size)

No indexing – linear scan

Page 7: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Approximate NNS

r-near neighbor: given a new point , report a point s.t.

Randomized: a point returned with 90% probability

c-approximate

𝑐𝑟if there exists apoint at distance

q

r p

cr

Page 8: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Heuristic for Exact NNS

r-near neighbor: given a new point , report a set with all points s.t. (each with 90%

probability)

may contain some approximate neighbors s.t.

Can filter out bad answers

q

r p

cr

c-approximate

Page 9: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Approximation Algorithms for NNS

A vast literature: milder dependence on dimension

[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13],

little to no dependence on dimension

[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]

Page 10: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Locality-Sensitive Hashing

Random hash function on s.t. for any points : Close when is “high”

Far when is “small”

Use several hashtables

q

p

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1

𝑃 2

: , where

[Indyk-Motwani’98]

q

“not-so-small”𝑃 1=¿𝑃 2=¿

𝜌=log 1/𝑃1

log 1/𝑃2

Page 11: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Locality sensitive hash functions

11

Hash function is usually a concatenation of “primitive” functions:

Example: Hamming space , i.e., choose bit for a random chooses bits at random

Page 12: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Formal description

12

Data structure is just hash tables: Each hash table uses a fresh random function

Hash all dataset points into the table Query:

Check for collisions in each of the hash tables until we encounter a point within distance

Guarantees: Space: , plus space to store points Query time: (in expectation) 50% probability of success.

Page 13: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis of LSH Scheme

13

How did we pick ? For fixed , we have

Pr[collision close pair] = Pr[collision far pair] =

Want to make Set , where

Page 14: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis: Correctness

14

Let be an -near neighbor If does not exists, algorithm can output anything

Algorithm fails when: near neighbor is not in the searched buckets

Probability of failure: Probability do not collide in a hash table: Probability they do not collide in hash tables at

most

Page 15: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis: Runtime

15

Runtime dominated by: Hash function evaluation: time Distance computations to points in buckets

Distance computations: Care only about far points, at distance In one hash table, we have

Probability a far point collides is at most Expected number of far points in a bucket:

Over hash tables, expected number of far points is

Total: in expectation

Page 16: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

LSH in the wild

16

If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:

trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset

Further advantages: Point insertions/deletions easy natural to distribute computation/hash tables in

a cluster

𝐿

𝑘

safety not guaranteed

fewer false positives

fewer tables

Page 17: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

LSH Zoo

17

Hamming distance [IM’98] : pick a random coordinate(s)

Manhattan distance: homework

Jaccard distance between sets:

: pick a random permutation on the universe

min-wise hashing [Bro’97] Euclidean distance: next

lecture

To be or not to be

To sketch or not to sketch

…21102…

be toornot

sket

ch

…01122…

be toornot

sket

ch

…11101… …01111…

{be,not,or,to} {not,or,to, sketch}

1 1

not

not

=be,to,sketch,or,not

be to