Overcoming the L 1 Non-Embeddability Barrier

Overcoming the L1 Non-Embeddability BarrierRobert Krauthgamer (Weizmann Institute)

Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Overcoming the L_1 non-embeddability barrier

Algorithms on Metric SpacesFix a metric MFix a computational problem

Solve problem under MUlam metricED(x,y) = minimum number of edit operations that transform x into y.edit operation = insert/delete/ substitute a characterED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it.Compute distance between x,yEarthmover distanceHamming distance


Motivation for Nearest NeighborMany applications:Image search (Euclidean dist, Earth-mover dist)Processing of genetic information, text processing (edit dist.)many othersGenericSearchEngine


A General Tool: EmbeddingsAn embedding of M into a host metric (H,dH) is a map f : MH preserves distances approximately

has distortion A 1 if for all x,y M,dM(x,y) dH(f(x),f(y)) A*dM(x,y)

Why?If H is easy (= can solve efficiently computational problems like NNS)Then get good algorithms for the original space M!

f


Host space?Popular target metric: 1 Have efficient algorithms:Distance estimation: O(d) for d-dimensional space (often less)NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]Powerful enough for some things1=real space withd1(x,y) =i |xi-yi|


Below logarithmic?Cannot work with 1

Other possibilities? (2)p is bigger and algorithmically tractablebut not rich enough (often same lower bounds) is rich (includes all metrics), but not efficient computationally usually (high dimension)And thats roughly it (at least for efficient NNS)

(2)p=real space withdist2p(x,y)=||x-y||2p=real space withdist(x,y)=maxi|xi-yi|


Meet our new hostIterated product space, 22,,1=d,1d1d22,,1


Why 22,,1?Because we canTheorem 1. Ulam embeds into 22,,1 with O(1) distortion Dimensions (,,)=(d, log d, d)

Theorem 2. 22,,1 admits NNS on n points withO(log log n) approximation O(n) query time and O(n1+) space

In fact, there is more for Ulam

RichAlgorithmicallytractable


Our Algorithms for UlamUlam = edit on strings where each symbol appears at most onceA classical distance between rankingsExhibits hardness of misalignments (as in general edit)All lower bounds same as for general edit (up to () )Distortion of embedding into 1 (and (2)p, etc): (log d)

Our approach implies new algorithms for Ulam:1. NNS with O(log log n) approx, O(n) query time Can improve to O(log log d) approx

2. Sketching with O(1)-approx in logO(1) d space

3. Distance estimation with O(1)-approx in timeED(1234567, 7123456) = 2[BEKMRRS03]: when EDd, approx d in O(d1-2) time If we ever hope for approximation

Theorem 1Theorem 1. Can embed Ulam into 22,,1 with O(1) distortion Dimensions (,,)=(d, log d, d)

ProofGeometrization of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness):Sublinear algorithms [EKKRV98, ACCL04]Data-stream algorithms [GJKK07, GG07, EH08]


Thm 1: Characterizing UlamConsider permutations x,y over [d]Assume for now: x = identity permutationIdea:Count # chars in y to delete to obtain increasing sequence ( Ulam(x,y))Call them faulty charactersIssues:AmbiguityHow do we count them?123456789234657891123456789341256789X=y=


Thm 1: Characterization inversionsDefinition: chars a

Thm 1: Characterization faulty charsDefinition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y(ok to consider K=2k)

Lemma [ACCL04, GJKK07]: # faulty chars = (Ulam(x,y)).1234567892345678914 characters preceding 1 (all inversions with 1)


Thm 1: CharacterizationEmbeddingTo get embedding, need:Symmetrization (neither string is identity)Deal with exists, majority?

To resolve (1), use instead X[a;K]

Definition 2: a is faulty if exists K=2k such that |X[a;2k] Y[a;2k]| > 2k (symmetric difference)

123456789123467895Y[5;4]X[5;4]


Thm 1: Embedding final stepWe have

Replace by weight?

Final embedding:123456789123467895Y[5;22]X[5;22]equal 1 iff true()2


Theorem 2Theorem 2. 22,,1 admits NNS on n pointsO(log log n) approximation O(n) query time and O(n1+) space for any small (ignoring ()O(1))

A rather general approachLSH on 1-products of general metric spacesOf course, cannot do, but can reduce to -products


Thm 2: ProofLets start from basics: 1[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space(ignoring O(1))

Ok, what about Suppose: NNS for M with cM-approx QM query time SM space.Then: NNS for O(cM * log log n) -approx O(QM) query time O(SM * n1+) space.[I02]


Thm 2: What about (2)2-product?Enough to consider

(for us, M is the l1-product)

Off-the-shelf?[I04]: gives space ~n or >log n approximation

We reduce to multiple NNS queries underInstructive to first look at NNS for standard 1


Thm 2: Review of NNS for 1LSH family: collection H of hash functions such that:For random hH (parameter >0) Pr[h(q)=h(p)] 1-||q-p||1 /

Query just uses primitive:

Can obtain H by imposing randomly-shifted grid of side-length

Then for h defined by ri2[0, ] at random, primitive becomes:pqreturn all points p such that h(q)=h(p) return all p s.t. |qi-pi|

Thm 2: LSH for 1-productIntuition: abstract LSH!Recall we had: for ri random from [0, ],point p returned if for all i: |qi-pi|

Thm 2: Final Thus, sufficient to solve primitive:

We reduced NNS over

to several instances of NNS over(with appropriately scaled coordinates)

Approximation is O(1)*O(log log n)Done!return all points ps such that maxi dM(qi,pi)/ri

Take-home message:Can embed combinatorial metrics into iterated product spacesWorks for Ulam (=edit on non-repetitive strings)Approach bypasses non-embeddability results into usual-suspect spaces like 1, (2)2 Open:Embeddings for edit over {0,1}d, EMD, other metrics?Understanding product spaces?[Jayram-Woodruff]: sketching


Documents

Overcoming the L 1 Non-Embeddability Barrier