Download ppt - A Sublinear Algorithm For Weakly Approximating Edit Distance Batu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami Presentation by Itai Dinur

A Sublinear Algorithm For Weakly Approximating Edit DistanceBatu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami

Presentation by Itai Dinur

Edit Distance (Levenshtein distance)

Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.

Applications

Bioinformatics Text processing Web search

Algorithms

Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n2)

Masek and Paterson gave an improved algorithm that runs in time O(n2/logn)

The Edit Distance Testing Problem

On input A,B and parameters 0<α<1, C>1: If D(A,B)≤nα, output CLOSE with probability at

least 2/3 If D(A,B)>n/C, output FAR with probability at least

2/3 Note that the output is unrestricted for

nα<D(A,B)≤n/C E.g. cannot distinguish between n0.1 and n0.9

The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1})

Motivation

In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings For string pairs where the edit distance is above a

certain threshold, the actual value of the distance is irrelevant

Lower Bound

Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries The algorithm presented for the problem runs in

time Õ(nmax{α/2,2α-1}), which is close to optimal for α≤2/3

Other Approximations

There are several papers that give better approximation results, but none run in sublinear time Andoni and Onak give an algorithm that computes

the edit distance between two strings up to a factor of in n1+o(1) time )lognÕ(2

Algorithm Overview

A recursive divide and conquer algorithm B is broken into substrings which are recursively

matched against A The matches are pieced together to form a

matching for A It is too expensive to match all the substrings

A small number of them are sampled and matched, relying on statistical properties of the matchings

Approximate Matching

Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E

A abcd1234efgh5678

B cd02

I has a (2,1)-(approximate) matching with respect to A

Coordinated Matching

Definition 2:Let I = (I1,…Ik) be a collection of intervals. We say that I has a (t,σ,E,D)-coordinated matching with A if for all but D of the intervals Ii I, Ii has a (ti,E)-matching with A, where |t-ti|≤σ

A abcd1234efgh5678

B cd0236gjfkl5

I has a (1,1,2,1)-coordinated matching with A

Coordinated Matching to Approximate Matching

We decompose an interval I of size S into k disjoint continuous subintervals, I=(I1,…Ik), each of size S’=S/k (assuming k|S)

Lemma 1: If (I1,…Ik) has a (t,σ,εS’,δk)-coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)

Approximate Matching to Coordinated Matching

Lemma 2: Let c>1 and S>cE. If I has a (t,E)-matching with A then I=(I1,…Ik) has (t,E,cE/k,k/c)-coordinated matching with A

Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I1,…Ik) has (t,E,0,E)-coordinated matching with A

To match A and B

Decompose B into a set of continuous disjoint intervals I Lemma 2 argues that a match for A and B gives a

coordinated matching for A and I Use a subroutine (COORD-MATCHES) to

find coordinated matches for I Lemma 1 infers the existence of good matches for

B from coordinated matches for I

COORD-MATCHES

COORD-MATCHES(A,I,σ,E,D,ε,c) Let d be a constant, l=dlog(n). Choose samples i1,

…,il uniformly and independently from [1,…,k]

For each chosen sample ij compute Tj=MATCHES(A,ij,E)

Let Δ=(D/k+ε/2)l Return the set T, where t T iff Tj∩[t-σ…t+σ]=Ø

for at most Δ sets Tj

Sampling Lemma

Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/nc

COORD-MATCHES

Lemma 5: With probability 1-1/nc-1 over the random coins of COORD-MATCHES, the output T of COORD-MATCHES(A,I,σ,E,D,ε,c) has the following properties: If I has a (t,σ,E,D)-coordinated matching then t T If t T then I has a (t,σ,E,D+εk)-coordinated

matching

MATCHES(A,I,E)

If E≥1, use a recursive call to COORD-MATCHES

If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS

Implementing SHIFTS

A naïve implementation of SHIFTS may give an output set T consisting of n elements We may restrict the allowed shifts to [-nα,…,+nα ]

However, we need a running time of o(nα), so we must further restrict the set of possible outputs

The Approximate Matching problem

Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c2>1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c2 Note that if D(A,B)<nα, it is enough to consider s’

in the interval [s-nα,s+nα]

The Approximate Matching problem

Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2nα possible shifts Requires Ω(nα) queries to A

The Ruler Procedure

We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2nα with √u queries to each string given that b>√u

In A character positions divisible by √u are queried A[√u,2√u,…u] . In I, √u consecutive positions are queried I[1…√u]

Define cen=t/√u1mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t

The Ruler Procedure

To test whether a block matches: pick l=Θ(log(n)) random numbers m1,m2…,ml from [0,b-√u]

For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m1,√u+m2,…,√u+ml]

Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e.

f(cen√u)= A[cen√u+m1…cen√u+ml] and f(t(mod√u)) =I[t(mod√u)+m1… t(mod√u)+ml]

The Ruler Procedure

If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks

We can find all matching shifts by using O(max{√u,u/b}log(n)) queries

Efficient Implementation of the Ruler

We need an efficiently algorithm to compare all fingerprints and return valid shifts

A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3

FingerprintA-ListB-List


A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3


da3


A dbadaabcdabddcd

B abcdab

u=|A|-|B|=9 √u=3l=2 m1=1 m2=3


da3

bd61

ad9

ca2

db3

Quantizing the Ruler

The explicit list of all matching t can have Ω(u) values

We round the values of t to multiples of some integer Q and return all quantized shifts

The running time is O(max{√u,u/b,u/Q}log(n))

SHIFTS(A,I,Q)

Initialize the fingerprint data structure Pick l=Θ(log(n)) random numbers m1,m2…,ml

Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i)

Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j)

Quantize all A-lists and B-lists For each fingerprint, output the list of

quantized shifts (differences)

SHIFTS(A,I,Q)

Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b

MATCHES(A,I,E)

If E<1, use SHIFTS to compute T If E≥1

Set k=min{εn1-α,2c1E} Decompose I into a set I of continuous disjoint

intervals of size |I|/k Compute T=COORD-MATCHES(A,I,E,c1E/k,k/c1)

Return T

DECIDE(A,B,α,C)

Choose sufficiently small ε, and sufficiently large c1 (given α,C)

Let the quantization parameter be

Q=εmin{n1-α,nα/2} Set T = MATCHES(A,B,nα) If T is nonempty, output CLOSE, otherwise

output FAR

DECIDE(A,B,α,C)

For any fixed α<1, we can choose constants ε and c1 such that procedure DECIDE solves the edit distance testing problem with high probability

Running Time Analysis

Note that when k=2c1E, COORD-MATCHES is called with edit distance parameter c1E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion

Each level, The interval input to MATCHES goes down by a factor of k=Ω(n1-α), after r=α/(1-α) levels the intervals are of length n/nr(1-α)=O(n1-α), E=O(nα/nr(1-α))=O(1) and SHIFT will be called next

Running Time Analysis α<1/2

One level of recursion B is broken to intervals of size O(nα) dlog(n) calls to SHIFT with Q=εnα/2

Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{nα/2,1,nα/2}log(n))=O(nα/2log(n))

One merge taking O(nα/2log(n)) Total running time O(nα/2log2(n))

Running Time Analysis 1/2<α<2/3

Two levels of recursion At the last level, B is broken to intervals of

size O(nα/2) log2(n) calls to SHIFT with Q=εnα/2

Each call takes O(nα/2log(n)) log(n) merges each taking O(nα/2log(n)) Total running time O(nα/2log3(n))

Running Time Analysis α>2/3

r>2 levels of recursion At the last level, B is broken to intervals of

size O(n1-α) logO(1)(n) calls to SHIFT with Q=εn1-α

Note that n1-α<nα/2

Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n2α-1log(n))

Total running time Õ(n2α-1log(n))

Conclusion

We saw an algorithm for the edit distance test problem that runs in time Õ(nmax{α/2,2α-1})

Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries