Overcoming the L 1 Non-Embeddability Barrier

  • Upload
    skyla

  • View
    30

  • Download
    2

Embed Size (px)

DESCRIPTION

Overcoming the L 1 Non-Embeddability Barrier. Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT). Algorithms on Metric Spaces. Hamming distance. Fix a metric M Fix a computational problem Solve problem under M. Ulam metric. - PowerPoint PPT Presentation

Citation preview

  • Overcoming the L1 Non-Embeddability BarrierRobert Krauthgamer (Weizmann Institute)

    Joint work with Alexandr Andoni and Piotr Indyk (MIT)

    Overcoming the L_1 non-embeddability barrier

  • Algorithms on Metric SpacesFix a metric MFix a computational problem

    Solve problem under MUlam metricED(x,y) = minimum number of edit operations that transform x into y.edit operation = insert/delete/ substitute a characterED(0101010, 1010101) = 2 Nearest Neighbor Search: Preprocess n strings, so that given a query string, can find the closest string to it.Compute distance between x,yEarthmover distanceHamming distance

    Overcoming the L_1 non-embeddability barrier

  • Motivation for Nearest NeighborMany applications:Image search (Euclidean dist, Earth-mover dist)Processing of genetic information, text processing (edit dist.)many othersGenericSearchEngine

    Overcoming the L_1 non-embeddability barrier

  • A General Tool: EmbeddingsAn embedding of M into a host metric (H,dH) is a map f : MH preserves distances approximately

    has distortion A 1 if for all x,y M,dM(x,y) dH(f(x),f(y)) A*dM(x,y)

    Why?If H is easy (= can solve efficiently computational problems like NNS)Then get good algorithms for the original space M!

    f

    Overcoming the L_1 non-embeddability barrier

  • Host space?Popular target metric: 1 Have efficient algorithms:Distance estimation: O(d) for d-dimensional space (often less)NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]Powerful enough for some things1=real space withd1(x,y) =i |xi-yi|

    Overcoming the L_1 non-embeddability barrier

  • Below logarithmic?Cannot work with 1

    Other possibilities? (2)p is bigger and algorithmically tractablebut not rich enough (often same lower bounds) is rich (includes all metrics), but not efficient computationally usually (high dimension)And thats roughly it (at least for efficient NNS)

    (2)p=real space withdist2p(x,y)=||x-y||2p=real space withdist(x,y)=maxi|xi-yi|

    Overcoming the L_1 non-embeddability barrier

  • Meet our new hostIterated product space, 22,,1=d,1d1d22,,1

    Overcoming the L_1 non-embeddability barrier

  • Why 22,,1?Because we canTheorem 1. Ulam embeds into 22,,1 with O(1) distortion Dimensions (,,)=(d, log d, d)

    Theorem 2. 22,,1 admits NNS on n points withO(log log n) approximation O(n) query time and O(n1+) space

    In fact, there is more for Ulam

    RichAlgorithmicallytractable

    Overcoming the L_1 non-embeddability barrier

  • Our Algorithms for UlamUlam = edit on strings where each symbol appears at most onceA classical distance between rankingsExhibits hardness of misalignments (as in general edit)All lower bounds same as for general edit (up to () )Distortion of embedding into 1 (and (2)p, etc): (log d)

    Our approach implies new algorithms for Ulam:1. NNS with O(log log n) approx, O(n) query time Can improve to O(log log d) approx

    2. Sketching with O(1)-approx in logO(1) d space

    3. Distance estimation with O(1)-approx in timeED(1234567, 7123456) = 2[BEKMRRS03]: when EDd, approx d in O(d1-2) time If we ever hope for approximation

  • Theorem 1Theorem 1. Can embed Ulam into 22,,1 with O(1) distortion Dimensions (,,)=(d, log d, d)

    ProofGeometrization of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness):Sublinear algorithms [EKKRV98, ACCL04]Data-stream algorithms [GJKK07, GG07, EH08]

    Overcoming the L_1 non-embeddability barrier

  • Thm 1: Characterizing UlamConsider permutations x,y over [d]Assume for now: x = identity permutationIdea:Count # chars in y to delete to obtain increasing sequence ( Ulam(x,y))Call them faulty charactersIssues:AmbiguityHow do we count them?123456789234657891123456789341256789X=y=

    Overcoming the L_1 non-embeddability barrier

  • Thm 1: Characterization inversionsDefinition: chars a
  • Thm 1: Characterization faulty charsDefinition 1: a is faulty if exists K>0 s.t. a is inverted w.r.t. a majority of the K symbols preceding a in y(ok to consider K=2k)

    Lemma [ACCL04, GJKK07]: # faulty chars = (Ulam(x,y)).1234567892345678914 characters preceding 1 (all inversions with 1)

    Overcoming the L_1 non-embeddability barrier

  • Thm 1: CharacterizationEmbeddingTo get embedding, need:Symmetrization (neither string is identity)Deal with exists, majority?

    To resolve (1), use instead X[a;K]

    Definition 2: a is faulty if exists K=2k such that |X[a;2k] Y[a;2k]| > 2k (symmetric difference)

    123456789123467895Y[5;4]X[5;4]

    Overcoming the L_1 non-embeddability barrier

  • Thm 1: Embedding final stepWe have

    Replace by weight?

    Final embedding:123456789123467895Y[5;22]X[5;22]equal 1 iff true()2

    Overcoming the L_1 non-embeddability barrier

  • Theorem 2Theorem 2. 22,,1 admits NNS on n pointsO(log log n) approximation O(n) query time and O(n1+) space for any small (ignoring ()O(1))

    A rather general approachLSH on 1-products of general metric spacesOf course, cannot do, but can reduce to -products

    Overcoming the L_1 non-embeddability barrier

  • Thm 2: ProofLets start from basics: 1[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space(ignoring O(1))

    Ok, what about Suppose: NNS for M with cM-approx QM query time SM space.Then: NNS for O(cM * log log n) -approx O(QM) query time O(SM * n1+) space.[I02]

    Overcoming the L_1 non-embeddability barrier

  • Thm 2: What about (2)2-product?Enough to consider

    (for us, M is the l1-product)

    Off-the-shelf?[I04]: gives space ~n or >log n approximation

    We reduce to multiple NNS queries underInstructive to first look at NNS for standard 1

    Overcoming the L_1 non-embeddability barrier

  • Thm 2: Review of NNS for 1LSH family: collection H of hash functions such that:For random hH (parameter >0) Pr[h(q)=h(p)] 1-||q-p||1 /

    Query just uses primitive:

    Can obtain H by imposing randomly-shifted grid of side-length

    Then for h defined by ri2[0, ] at random, primitive becomes:pqreturn all points p such that h(q)=h(p) return all p s.t. |qi-pi|

  • Thm 2: LSH for 1-productIntuition: abstract LSH!Recall we had: for ri random from [0, ],point p returned if for all i: |qi-pi|
  • Thm 2: Final Thus, sufficient to solve primitive:

    We reduced NNS over

    to several instances of NNS over(with appropriately scaled coordinates)

    Approximation is O(1)*O(log log n)Done!return all points ps such that maxi dM(qi,pi)/ri

  • Take-home message:Can embed combinatorial metrics into iterated product spacesWorks for Ulam (=edit on non-repetitive strings)Approach bypasses non-embeddability results into usual-suspect spaces like 1, (2)2 Open:Embeddings for edit over {0,1}d, EMD, other metrics?Understanding product spaces?[Jayram-Woodruff]: sketching

    Overcoming the L_1 non-embeddability barrier