18
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)

Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Embed Size (px)

DESCRIPTION

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database. Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic? Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines. The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

Citation preview

Page 1: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Sketching, Sampling and other Sublinear Algorithms:

Euclidean space: dimension reduction and NNS

Alex Andoni(MSR SVC)

Page 2: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

A Sketching Problem

2

Sketching: :objects short bit-strings given and should be able to deduce if and are

“similar” Why?

reduce space and time to compute similarity

𝑓 𝑓

010110 010101

similar?

To be or not to be

To sketch or not to sketch

𝑓 𝑓be to

similar?

Page 3: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Sketch from LSH

3

LSH often has property:

Sketching from LSH:

Estimate by the fraction of collisions between controls the variance of the estimate

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]1

[Broder’97]: for Jaccard coefficient

Page 4: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

General Theory: embeddings The above map is an embedding General motivation: given distance (metric)

, solve a computational problem under

Euclidean distance (ℓ2)

Hamming distance

Edit distance between two strings

Earth-Mover (transportation) Distance

Compute distance between two points

Diameter/Close-pair of set S

Clustering, MST, etc

Nearest Neighbor Search

f

Reduce problem <P under hard metric> to <P under simpler metric>

Page 5: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Embeddings: landscape Definition: an embedding is a map of a metric into a host

metric such that for any :

where is the distortion (approximation) of the embedding .

Embeddings come in all shapes and colors: Source/host spaces Distortion Can be randomized: with probability Time to compute

Types of embeddings: From norm to the same norm but of lower dimension (dimension

reduction) From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1) From given finite metric (shortest path on a planar graph) into a norm

(ℓ1) not a metric but a computational procedure sketches

Page 6: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Dimension Reduction Johnson Lindenstrauss Lemma: there is a linear

map , , that preserves distance between two vectors up to distortion with probability ( some constant)

Preserves distances among points for Motivation:

E.g.: diameter of a pointset in -dimensional Euclidean space

Trivially: time Using lemma: time for approximation MANY applications: nearest neighbor search, streaming,

pattern matching, approximation algorithms (clustering)…

Page 7: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Main intuition The map can be simply a projection onto a

random subspace of dimension

Page 8: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

1D embedding How about one dimension () ? Map

, where are iid normal (Gaussian) random variable

Why Gaussian? Stability property: is distributed as , where is also

Gaussian Equivalently: is centrally distributed, i.e., has

random direction, and projection on random direction depends only on length of

pdf = E[g]=0E[g2]=1

Page 9: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

1D embedding Map ,

for any , Linear:

Want: Claim: for any , we have

Expectation: Standard deviation:

Proof: Prove for since linear Expectation

pdf = E[g]=0E[g2]=1

2 2

Page 10: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Full Dimension Reduction Just repeat the 1D embedding for times!

where is matrix of Gaussian random variables

Want to prove: with probability

OK to prove for fixed

Page 11: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Concentration is distributed as

where each is distributed as Gaussian Norm

is called chi-squared distribution with degrees Fact: chi-squared very well concentrated:

Equal to with probability Akin to central limit theorem

Page 12: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Dimension Reduction: wrap-up with high probability

Extra: Linear: can update as changes Can use instead of Gaussians [AMS’96, Ach’01,

TZ’04…] Fast JL: can compute faster than time [AC’06,

AL’07’09, DKS’10, KN’10’12…]

Page 13: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

NNS for Euclidean space

13

Can use dimensionality reduction to get LSH for

LSH function : pick a random line , and quantize project point into

is a random Gaussian vector random in is a parameter (e.g., 4)

[Datar-Immorlica-Indyk-Mirrokni’04]

𝑝

Page 14: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Regular grid → grid of balls p can hit empty space, so take

more such grids until p is in a ball

Need (too) many grids of balls Start by projecting in

dimension t

Analysis gives Choice of reduced dimension

t? Tradeoff between

# hash tables, n, and Time to hash, tO(t)

Total query time: dn1/c2+o(1)

Near-Optimal LSH

2D

p

pRt

[A-Indyk’06]

Page 15: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Open question:

More practical variant of above hashing? Design space partitioning of that is

efficient: point location in poly(t) time qualitative: regions are “sphere-like”

[Prob. needle of length 1 is not cut]

[Prob needle of length c is not cut]

c2

𝑝

Page 16: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Time-Space Trade-offs

[AI’06]

[KOR’98, IM’98, Pan’06]

[Ind’01, Pan’06]

Space Time Comment Reference

[DIIM’04, AI’06]

[IM’98]

querytime

space

medium medium

lowhigh

highlow

one hash table lookup!

no(1/ε2) ω(1) memory lookups [AIP’06]

n1+o(1/c2) ω(1) memory lookups [PTW’08, PTW’10]

Page 17: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

NNS beyond LSH

17

Data-dependent partitions…

Practice: Trees: kd-trees, quad-trees, ball-trees,

rp-trees, PCA-trees, sp-trees… often no guarantees

Theory: can improve standard LSH by random

data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??]

tree-based approach to max-norm ()

Page 18: Sketching, Sampling, and other Sublinear Algorithms 2 (Lecture by Alex Andoni)

Finale Dimension Reduction in Euclidean space

, random projection preserves distances only dimensions for distance among points!

NNS for Euclidean space Random projections gives LSH Even better with ball partitioning

Or better with cool lattices?