32
Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Embed Size (px)

Citation preview

Page 1: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Ryan O’Donnell (CMU, IAS)

joint work with

Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Page 2: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Locality Sensitive Hashing [Indyk–Motwani ’98]

objects sketchesh :

H : family of hash functions h s.t.

“similar” objects collide w/ high prob.

“dissimilar” objects collide w/ low prob.

Page 3: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Abbreviated history

Page 4: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

A

Broder ’97, Altavista

B

0 1 1 1 0 0 1 0 0

1 1 1 0 0 0 1 0 1

wor

d 1?

wor

d 2?

wor

d 3?

wor

d d?

Jaccard similarity:

Invented simple H s.t. Pr [h(A) = h(B)] =

Page 5: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Indyk–Motwani ’98 (cf. Gionis–I–M ’98)

Defined LSH.

Invented very simple H good for

{0, 1}d under Hamming distance.

Showed good LSH implies good

nearest-neighbor-search data structs.

Page 6: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Charikar ’02, STOC

Proposed alternate H (“simhash”) for

Jaccard similarity.

Page 7: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Many papers about LSH

Page 8: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Practice Theory

Free code base [AI’04]

Sequence comparisonin bioinformatics

Association-rule findingin data mining

Collaborative filtering

Clustering nouns bymeaning in NLP

Pose estimation in vision

• • •

[Tenesawa–Tanaka ’07]

[Broder ’97]

[Indyk–Motwani ’98]

[Gionis–Indyk–Motwani ’98]

[Charikar ’02]

[Datar–Immorlica– –Indyk–Mirrokni ’04]

[Motwani–Naor–Panigrahi ’06]

[Andoni–Indyk ’06]

[Neylon ’10]

[Andoni–Indyk ’08, CACM]

Page 9: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Given: (X, dist), r > 0, c > 1

distance space “radius” “approx factor”

Goal: Family H of functions X → S

(S can be any finite set)

s.t. ∀ x, y ∈ X,

≥ p

≤ q

≥ q.5 ≥ q.25 ≥ q.1 ≥ qρ

Page 10: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Theorem

[IM’98, GIM’98]

Given LSH family for (X, dist),

can solve “(r,cr)-near-neighbor search”

for n points with data structure of

size: O(n1+ρ)

query time: Õ(nρ) hash fcn evals.

Page 11: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Example

X = {0,1}d, dist = Hamming

r = ϵd, c = 5

0 1 1 1 0 0 1 0 0

1 1 1 0 0 0 1 0 1

dist ≤ ϵd

or ≥ 5ϵd

H = { h1, h2, …, hd }, hi(x) = xi[IM’98]

“output a random coord.”

Page 12: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Analysis

= q

= qρ

(1 − 5ϵ)1/5 ≈ 1 − ϵ. ∴ ρ ≈

(1 − 5ϵ)1/5 ≤ 1 − ϵ. ∴ ρ ≤

In general, achieves ρ ≤ ∀ c (∀ r).

Page 13: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Optimal upper bound

( {0, 1}d, Ham ), r > 0, c > 1.

S ≝ {0, 1}d ∪ {✔}, H ≝ {hab : dist(a,b) ≤ r}

hab(x) = ✔ if x = a or x = b

x otherwise

0

positive=> 0.5 > 0.1 > 0.01 > 0.0001

Page 14: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)
Page 15: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Wait, what?

[IM’98, GIM’98] Theorem:

Given LSH family for (X, dist),

can solve “(r,cr)-near-neighbor search”

for n points with data structure of

size: Õ(n1+ρ)

query time: Õ(nρ) hash fcn evals

Page 16: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Wait, what?

[IM’98, GIM’98] Theorem:

size: Õ(n1+ρ)

query time: Õ(nρ) hash fcn evals

Page 17: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

More results

For Rd with ℓp-distance:

when p = 1, 0 < p < 1, p = 2

[IM’98] [DIIM’04] [AI’06]For Jaccard similarity: ρ ≤ 1/c

For {0,1}d with Hamming distance:

[Bro’97]

−od(1) (assuming q ≥ 2−o(d))[MNP’06]

immediately

for ℓp-distance

Page 18: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Our Theorem

For {0,1}d with Hamming distance:

−od(1) (assuming q ≥ 2−o(d))

immediately

for ℓp-distance

(∃ r s.t.)

Proof also yields ρ ≥ 1/c for Jaccard.

Page 19: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Proof:

Page 20: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Proof:

Noise-stability is log-convex.

Page 21: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Proof:

A definition, and two lemmas.

Page 22: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Fix any arbitrary function h : {0,1}d → S.

Pick x ∈ {0,1}d at random:

0 1 1 1 0 0 1 0 0x = h(x) = s

Continuous-time (lazy)

random walk for time τ.

0 0 1 1 0 0 1 1 0y = h(y) = s’

def:

Page 23: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Lemma 1:

Lemma 2:

From which the proof of ρ ≥ 1/c follows easily.

For x y,τ

when τ ≪ 1.

Kh(τ) is a log-convex function of τ.

(for any h)

0

1

τ

Page 24: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Continuous-Time Random Walk

: Repeatedly

— waits Exponential(1) seconds,

— dings.

(Reminder: T ~ Expon(1) means Pr[T > u] = e−u.)

In C.T.R.W. on {0,1}d, each coord. gets

its own independent alarm clock.

When ith clock dings, coord. i is rerandomized.

Page 25: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

0 1 1 1 0 0 1 0 0 1x =

0 1 0 1 0 0 1 0 1 1y =

timeτ

0

1

1

1

Pr[coord. i never updated] = Pr[Exp(1) > τ] = e−τ

∴ Pr[xi ≠ yi] =

⇒ Lemma 1: dist(x,y) ≈

Page 26: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Lemma 2: Kh(τ) is a log-convex function of τ.

Remark: True for any reversible C.T.M.C.

Recall: For f : {0,1}d → ℝ,

Given hash function h : {0,1}d → S,

for each s ∈ S, introduce

hs : {0,1}d → {0,1}, hs(x) = 1{h(x)=s}

Page 27: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Proof of Lemma 2:

is log-convex.log-convexnon-neg. lin. comb. of

Page 28: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Lemma 1:

Lemma 2:

Theorem: LSH for {0,1}d requires

For x y,τ

is a log-convex function of τ.

Page 29: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Proof: Say H is an LSH family for {0,1}d

with params .

r (c − o(1)) r

def: (Non-neg. lin. comb.

of log-convex fcns.

∴ KH(τ) is also

log-convex.)

w.v.h.p.,

dist(x,y) ≈ ∴ KH(ϵ) ≳ qρ

KH(cϵ) ≲ q

in truth, q+2−Θ(d); we assume q not tiny

Page 30: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

∴ KH(ϵ) ≳

KH(cϵ) ≲

∴ KH(0) = ln

ln

ln

1

q

0

ρ ln q

ln q

KH(τ) is log-convex

0 τ

ln KH(τ)

ln q

ϵ

Page 31: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)

Super-tedious, super-straightforward

Make Lemma 1 precise. (Chernoff)

Make precise. (Taylor)

Choose ϵ = ϵ(c, q, d) very carefully.

Theorem:

Meaningful iff q ≥ 2−o(d); i.e., not tiny.

Page 32: Ryan O’Donnell (CMU, IAS) joint work with Yi Wu (CMU, IBM), Yuan Zhou (CMU)