1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

Efficient Algorithms for Substring Near

Neighbor Problem

Alexandr Andoni

Piotr Indyk

What’s SNN?

SNN ≈ Text Indexing with mismatches Text Indexing:

Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T

Text indexing with mismatches: Given P, find the substrings of T that are equal to P except

≤R chars.

Motivation: e.g., computational bio (BLAST)

T= GAGTAACTCAATA

P= AGTA

T= GAGTAACTCAATA

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

Approach (Or, why SNN?)

SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on

D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R

from P Use a NN data structure for Hamming

D={GAGT, AGTA, GTAA, …. AATA}

T= GAGTAACTCAATA

P= AGTA

Approximate NN

Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time)

Approximate NN is easier Defined for approximation c=1+ε as

OK to report a point at distance ≤cR (when there is a point at distance ≤R)

Query Space

[KOR98, IM98] poly(log n, m) nO(1/ε^2)

LSH [IM98] n1/c+m n1+1/c

Our contribution

Problem: need m in advance for NN Have to construct a data structure for each m≤M

Here: approx SNN data structure for unknown m Without degradation in space or query time

Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n1+1/c

Optimal* query time: n1/c

Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors)

Also extends to l1

Outline

Locality-Sensitive Hashing

Based on a family of hash functions {g} For points P[1..m], Q[1..m]:

If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low”

Idea: Construct L hash tables with random g1, g2, … gL

For query P, look at buckets g1(P), g2(P)… gL(P) Space: L*n Query time: L

LSH for Hamming

Hash function g: Projection on k random coordinates

E.g.: g1(“AGTA”)=“AA” (k=2)

L=#hash tables=n1/c

k=|log n / log(1-cR/m)| < m * log n

T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA …

P= AGTA

Outline

Unknown m

Bad news k dependent on m! Distinct m distinct hash tables

T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …}

HT1: GG-> GAG AT-> AGG, ACT,… …

P= AGT

g1(“AGT”)=“AT”

Solution

Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT” have to guess last char

g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing…

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

Tries*!

Replace HT1 with

trie on g1(suffixes)

Stop searchwhen outside P

Same analysis!

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

AGTAAATA

… …

AGTAGTA

* Tries have been used with LSH before in [MS02], but in a different context

Resulting performance Space:

n1+1/c (using compressed tries, one trie takes n space) Optimal!

Query time: n1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n1/c + mno(1)

Preprocessing time: n1+1/c * M (M=max m) Not optimal (optimal = n1+1/c) Can improve to n1+1/c + M1/3 * n1+o(1)

Optimal for c<3

Outline

Better query & preprocessing

Redesign LSH to improve query and preprocessing: Query: n1/c * m n1/c + mno(1)

Preprocessing: n1+1/c * M n1+1/c + n1+o(1) * M Idea for new LSH

Use same # of hash tables/tries (#=L= n1/c) But use “less randomness” in choosing hash

functions g1, g2, …gL

S.t., each gi looks random, but g’s are not independent

New LSH scheme

Old scheme: Choose L hash functions gi

Each gi = projection on k random coordinates New scheme:

Construct the L functions gi from a smaller number of “base” hash functions

A “base” hash function = projection on k/2 random coordinates

{gi ,i =1..L} = all pairs of “base” hash functions Need only ~L1/2 “base” hash functions!

Example

#base fns=4

L=(w choose 2)=(4 choose 2)=6

g1=<u1, u2>=

g2=<u1, u3>=

g3=<u1, u4>=...

Saving time

Can save time since there are less “base” hash functions

E.g.: computing fingerprints Want to compute FP(gi(P)) for i=1..L

FP(gi(P))=(Σj P[j] * χji * 2j) mod prime

Old way Would take L * m time for L functions g

New way Takes L1/2 * m time for L1/2 functions ui

Need only L time to combine FP(u(P)) into FP(g(P)) If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime

Total: L + L1/2 * m

Better query & preproc (2)

E.g., for query Use fingerprints to leap faster in the trie Yields time n1/c + n1/(2c) * m (since L= n1/c)

To get n1/c + no(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates

Other details similar to fingerprints

Better preprocessing (3)

Preprocessing, can get n1+1/c + n1+o(1) * M

Can get n1+1/c + n1+o(1) * M1/3

Can construct a trie in n * M1/3 (instead on n * M) Using FFT, etc

Outline

General approach View: Near Neighbor problem in Hamming metric Focus: reducing space

Solution = LSH + Tries Reducing query & preprocessing

Conclusions

Problem: Substring Near Neighbor (a.k.a., text indexing with

mismatches) Approach:

View as NN in m-dimensional Hamming Use LSH

Challenge: Variable-length pattern w/o degradation in performance

Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3

Extensions

Extends to l1 Nontrivial since a need a quite different LSH

functions Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3

Using “Less-than-matching” problem [Amir-Farach’95]

Remarks

Other approaches? Or, why LSH for SNN?

Since better SNN better NN… And LSH is the “best” known algorithm for

high-dimensional NN (using reasonable space)

Thanks!

1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

Documents

DR ANDONI GARRITZ RUIZ... · 2013. 10. 24. · ...DR ANDONI GARRITZ RUIZ

DR ANDONI GARRITZ RUIZ

andoni beristain fashion photography

Aditzak Andoni eta Neida

Recent&Developments&in&the& Sparse&Fourier&Transformpeople.csail.mit.edu/indyk/fourier-gsip.pdfRecent&Developments&in&the& Sparse&Fourier&Transform! Piotr!Indyk! MIT! Jointwork!with!

#N14 Pattern Value (aka Substring attribute)

Ben Andoni - Java - Gdansk

Elkarrizketa: Andoni Olariaga

Substring Search

ASIER,ANDONI ETA KARLA

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Mikel Andoni

Approximate Substring Matching over Uncertain Strings

Alex Andoni (MSR SVC)

Slender PUF Protocol Authentication by Substring Matching

Andoni Eizagirre

Hashing, sketching, and other approximate algorithms …people.csail.mit.edu/indyk/emnlp.pdf · Hashing, sketching, and other approximate algorithms for ... LSH [Indyk-Motwani’98]

Tutorial on Compressed Sensing - MIT CSAILpeople.csail.mit.edu/indyk/princeton.pdf · Tutorial on Compressed Sensing (or Compressive Sampling, or Linear Sketching) Piotr Indyk MIT

Sobre especias, Andoni Aduriz

Andoni Garrido