1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT

Preview:

Citation preview

1

Efficient Algorithms for Substring Near

Neighbor Problem

Alexandr Andoni

Piotr Indyk

MIT

2

What’s SNN?

SNN ≈ Text Indexing with mismatches Text Indexing:

Construct a data structure on a text T[1..n], s.t. Given query P[1..m], finds occurrences of P in T

Text indexing with mismatches: Given P, find the substrings of T that are equal to P except

≤R chars.

Motivation: e.g., computational bio (BLAST)

T= GAGTAACTCAATA

P= AGTA

T= GAGTAACTCAATA

3

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

4

Approach (Or, why SNN?)

SNN = a near neighbor problem in Hamming metric with m dimensions: Construct data structure on

D={all substrings of T of length m}, s.t. Given P, find a point in D that is at distance ≤R

from P Use a NN data structure for Hamming

D={GAGT, AGTA, GTAA, …. AATA}

T= GAGTAACTCAATA

P= AGTA

5

Approximate NN

Exact NN problem seems hard (i.e., hard w/o exponential space or O(n) query time)

Approximate NN is easier Defined for approximation c=1+ε as

OK to report a point at distance ≤cR (when there is a point at distance ≤R)

Query Space

[KOR98, IM98] poly(log n, m) nO(1/ε^2)

LSH [IM98] n1/c+m n1+1/c

R

cR

q

6

Our contribution

Problem: need m in advance for NN Have to construct a data structure for each m≤M

Here: approx SNN data structure for unknown m Without degradation in space or query time

Our algorithm for SNN based on LSH: Supports patterns of length m≤M Optimal* space: n1+1/c

Optimal* query time: n1/c

Slightly worse preprocessing time if c>3 (* Optimal w.r.t. LSH, modulo subpoly factors)

Also extends to l1

7

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

8

Locality-Sensitive Hashing

Based on a family of hash functions {g} For points P[1..m], Q[1..m]:

If dist(P,Q) ≤ R, Prg[g(P)=g(Q)] = “medium” If dist(P,Q) > cR, Prg[g(P)=g(Q)] = “low”

Idea: Construct L hash tables with random g1, g2, … gL

For query P, look at buckets g1(P), g2(P)… gL(P) Space: L*n Query time: L

9

LSH for Hamming

Hash function g: Projection on k random coordinates

E.g.: g1(“AGTA”)=“AA” (k=2)

L=#hash tables=n1/c

k=|log n / log(1-cR/m)| < m * log n

T= GAGTAACTCAATA D={GAGT, AGTA, GTAA, …, AATA}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA …

P= AGTA

R=1

10

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

11

Unknown m

Bad news k dependent on m! Distinct m distinct hash tables

T= GAGTAACTCAATA D={GAG, AGT, …, ACT, …}

HT1: GG-> GAG AT-> AGG, ACT,… …

P= AGT

R=1

g1(“AGT”)=“AT”

12

Solution

Let’s just reuse the same data structure for all m g(“AGTA”)=“AA” On “AGT” have to guess last char

g(“AGT?”)=g(“AGT?”) = “A?” Like in [exact] text indexing…

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

R=1

13

Tries*!

Replace HT1 with

trie on g1(suffixes)

Stop searchwhen outside P

Same analysis!

T= GAGTAACTCAATA D={GAGT, AGTA, … ACTA, …}

HT1: GT->GAGT AA->AGTA, AATA GA->GTAA AC->ACTC …

P= AGT

R=1

A G

A C

AGTAAATA

ACTCT

AACT

AT

… …

AGTAGTA

* Tries have been used with LSH before in [MS02], but in a different context

14

Resulting performance Space:

n1+1/c (using compressed tries, one trie takes n space) Optimal!

Query time: n1/c * m (m=length P) Not [yet] really optimal: originally, could do dim-reduction Can improve to n1/c + mno(1)

Preprocessing time: n1+1/c * M (M=max m) Not optimal (optimal = n1+1/c) Can improve to n1+1/c + M1/3 * n1+o(1)

Optimal for c<3

15

Outline

General approach View: Near Neighbor in Hamming Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution Reducing query & preprocessing

Redesign LSH Concluding remarks

16

Better query & preprocessing

Redesign LSH to improve query and preprocessing: Query: n1/c * m n1/c + mno(1)

Preprocessing: n1+1/c * M n1+1/c + n1+o(1) * M Idea for new LSH

Use same # of hash tables/tries (#=L= n1/c) But use “less randomness” in choosing hash

functions g1, g2, …gL

S.t., each gi looks random, but g’s are not independent

17

New LSH scheme

Old scheme: Choose L hash functions gi

Each gi = projection on k random coordinates New scheme:

Construct the L functions gi from a smaller number of “base” hash functions

A “base” hash function = projection on k/2 random coordinates

{gi ,i =1..L} = all pairs of “base” hash functions Need only ~L1/2 “base” hash functions!

18

Example

k=4

w=

#base fns=4

L=(w choose 2)=(4 choose 2)=6

u1=

u2=

u3=

u4=

g1=<u1, u2>=

g2=<u1, u3>=

g3=<u1, u4>=...

19

Saving time

Can save time since there are less “base” hash functions

E.g.: computing fingerprints Want to compute FP(gi(P)) for i=1..L

FP(gi(P))=(Σj P[j] * χji * 2j) mod prime

Old way Would take L * m time for L functions g

New way Takes L1/2 * m time for L1/2 functions ui

Need only L time to combine FP(u(P)) into FP(g(P)) If g=<u1,u2>, then FP(g(P))=(FP(u1(P))+FP(u2(P))) mod prime

Total: L + L1/2 * m

20

Better query & preproc (2)

E.g., for query Use fingerprints to leap faster in the trie Yields time n1/c + n1/(2c) * m (since L= n1/c)

To get n1/c + no(1) * m, generalize: g = tuple of t base functions a base function = k/t random coordinates

Other details similar to fingerprints

21

Better preprocessing (3)

Preprocessing, can get n1+1/c + n1+o(1) * M

Can get n1+1/c + n1+o(1) * M1/3

Can construct a trie in n * M1/3 (instead on n * M) Using FFT, etc

22

Outline

General approach View: Near Neighbor problem in Hamming metric Focus: reducing space

Background Locality-Sensitive Hashing (LSH)

Solution = LSH + Tries Reducing query & preprocessing

Redesign LSH Concluding remarks

23

Conclusions

Problem: Substring Near Neighbor (a.k.a., text indexing with

mismatches) Approach:

View as NN in m-dimensional Hamming Use LSH

Challenge: Variable-length pattern w/o degradation in performance

Solution: Space/query optimal (w.r.t. LSH) Preprocessing optimal (w.r.t. LSH) for c<3

24

Extensions

Extends to l1 Nontrivial since a need a quite different LSH

functions Preprocessing slightly worse n1+1/c + n1+o(1) * M2/3

Using “Less-than-matching” problem [Amir-Farach’95]

25

Remarks

Other approaches? Or, why LSH for SNN?

Since better SNN better NN… And LSH is the “best” known algorithm for

high-dimensional NN (using reasonable space)

26

Thanks!