A Sublinear Algorithm For Weakly Approximating Edit DistanceBatu, Ergun, Killian, Magen, Raskhodnikova, Rubinfeld, Sami
Presentation by Itai Dinur
Edit Distance (Levenshtein distance)
Let A,B be two strings over a fixed alphabet Σ. The edit distance D(A,B) between A and B is defined as the minimum number of character insertions, deletions, and substitutions that transform A into B, or vice versa.
Applications
Bioinformatics Text processing Web search
Algorithms
Wagner and Fischer gave a dynamic programming algorithm that runs in time O(n2)
Masek and Paterson gave an improved algorithm that runs in time O(n2/logn)
The Edit Distance Testing Problem
On input A,B and parameters 0<α<1, C>1: If D(A,B)≤nα, output CLOSE with probability at
least 2/3 If D(A,B)>n/C, output FAR with probability at least
2/3 Note that the output is unrestricted for
nα<D(A,B)≤n/C E.g. cannot distinguish between n0.1 and n0.9
The algorithm presented for the problem runs in time Õ(nmax{α/2,2α-1})
Motivation
In some applications, given many pairs of strings, one is interested in computing the edit distance only for close strings For string pairs where the edit distance is above a
certain threshold, the actual value of the distance is irrelevant
Lower Bound
Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries The algorithm presented for the problem runs in
time Õ(nmax{α/2,2α-1}), which is close to optimal for α≤2/3
Other Approximations
There are several papers that give better approximation results, but none run in sublinear time Andoni and Onak give an algorithm that computes
the edit distance between two strings up to a factor of in n1+o(1) time )lognÕ(2
Algorithm Overview
A recursive divide and conquer algorithm B is broken into substrings which are recursively
matched against A The matches are pieced together to form a
matching for A It is too expensive to match all the substrings
A small number of them are sampled and matched, relying on statistical properties of the matchings
Approximate Matching
Definition 1: An interval I = B[s…e] has a (t,E)-(approximate) matching with respect to A if for some interval A[s’…e’], s’=s+t and D(A[s’…e’],I)≤E
A abcd1234efgh5678
B cd02
I has a (2,1)-(approximate) matching with respect to A
Coordinated Matching
Definition 2:Let I = (I1,…Ik) be a collection of intervals. We say that I has a (t,σ,E,D)-coordinated matching with A if for all but D of the intervals Ii I, Ii has a (ti,E)-matching with A, where |t-ti|≤σ
A abcd1234efgh5678
B cd0236gjfkl5
I has a (1,1,2,1)-coordinated matching with A
Coordinated Matching to Approximate Matching
We decompose an interval I of size S into k disjoint continuous subintervals, I=(I1,…Ik), each of size S’=S/k (assuming k|S)
Lemma 1: If (I1,…Ik) has a (t,σ,εS’,δk)-coordinated matching with A, then I has a (t,βS)-(approximate) matching with A, where β = (2σ/S’ + ε+δ)
Approximate Matching to Coordinated Matching
Lemma 2: Let c>1 and S>cE. If I has a (t,E)-matching with A then I=(I1,…Ik) has (t,E,cE/k,k/c)-coordinated matching with A
Lemma 3: If I has a (t,E)-matching with A, and k≥E, then I=(I1,…Ik) has (t,E,0,E)-coordinated matching with A
To match A and B
Decompose B into a set of continuous disjoint intervals I Lemma 2 argues that a match for A and B gives a
coordinated matching for A and I Use a subroutine (COORD-MATCHES) to
find coordinated matches for I Lemma 1 infers the existence of good matches for
B from coordinated matches for I
COORD-MATCHES
COORD-MATCHES(A,I,σ,E,D,ε,c) Let d be a constant, l=dlog(n). Choose samples i1,
…,il uniformly and independently from [1,…,k]
For each chosen sample ij compute Tj=MATCHES(A,ij,E)
Let Δ=(D/k+ε/2)l Return the set T, where t T iff Tj∩[t-σ…t+σ]=Ø
for at most Δ sets Tj
Sampling Lemma
Lemma 4: Suppose that a random element of a set S of size n has a property Z with probability p. For any positive ε and c, there exists d such that for dlog(n) random samples from S the fraction p’ of these samples with property Z satisfies p-ε/2≤p’≤p+ε/2 with probability 1-1/nc
COORD-MATCHES
Lemma 5: With probability 1-1/nc-1 over the random coins of COORD-MATCHES, the output T of COORD-MATCHES(A,I,σ,E,D,ε,c) has the following properties: If I has a (t,σ,E,D)-coordinated matching then t T If t T then I has a (t,σ,E,D+εk)-coordinated
matching
MATCHES(A,I,E)
If E≥1, use a recursive call to COORD-MATCHES
If E<1 (i.e E=0), then A must contain the interval I unchanged. The set of t values is computed directly using the algorithm SHIFTS
Implementing SHIFTS
A naïve implementation of SHIFTS may give an output set T consisting of n elements We may restrict the allowed shifts to [-nα,…,+nα ]
However, we need a running time of o(nα), so we must further restrict the set of possible outputs
The Approximate Matching problem
Actually, we will solve the approximate matching problem: Given a block I=B[s…e] of length b=e-s+1, and a constant c2>1, find all indexes s’ such that A[s’…(s’+b-1)] matches I, in a sense that the two substrings have Hamming-distance at most b/c2 Note that if D(A,B)<nα, it is enough to consider s’
in the interval [s-nα,s+nα]
The Approximate Matching problem
Naively, we can randomly sample O(log(n)) indexes i to determine (with high probability) if a substring of A[(t+1)…(t+b)] matches I, for a given t, and try all 2nα possible shifts Requires Ω(nα) queries to A
The Ruler Procedure
We can compare pairs of characters A[i],I[j] such that a pair is compared for every i-j from 0 to u=2nα with √u queries to each string given that b>√u
In A character positions divisible by √u are queried A[√u,2√u,…u] . In I, √u consecutive positions are queried I[1…√u]
Define cen=t/√u1mil=t(mod√u), then for i=cen√u, j=√u-mil i-j=t
The Ruler Procedure
To test whether a block matches: pick l=Θ(log(n)) random numbers m1,m2…,ml from [0,b-√u]
For each cen and mil marks construct a fingerprint with l offsets e.g. f(√u)=A[√u+m1,√u+m2,…,√u+ml]
Detect with high probability if a block matches with shift t by comparing the cen and mil fingerprints. i.e.
f(cen√u)= A[cen√u+m1…cen√u+ml] and f(t(mod√u)) =I[t(mod√u)+m1… t(mod√u)+ml]
The Ruler Procedure
If b≤√u we have only O(b) mil marks and Ω(u/b) cen marks
We can find all matching shifts by using O(max{√u,u/b}log(n)) queries
Efficient Implementation of the Ruler
We need an efficiently algorithm to compare all fingerprints and return valid shifts
A dbadaabcdabddcd
B abcdab
u=|A|-|B|=9 √u=3l=2 m1=1 m2=3
FingerprintA-ListB-List
Efficient Implementation of the Ruler
A dbadaabcdabddcd
B abcdab
u=|A|-|B|=9 √u=3l=2 m1=1 m2=3
FingerprintA-ListB-List
da3
Efficient Implementation of the Ruler
A dbadaabcdabddcd
B abcdab
u=|A|-|B|=9 √u=3l=2 m1=1 m2=3
FingerprintA-ListB-List
da3
bd61
ad9
ca2
db3
Quantizing the Ruler
The explicit list of all matching t can have Ω(u) values
We round the values of t to multiples of some integer Q and return all quantized shifts
The running time is O(max{√u,u/b,u/Q}log(n))
SHIFTS(A,I,Q)
Initialize the fingerprint data structure Pick l=Θ(log(n)) random numbers m1,m2…,ml
Add all the fingerprints f(i) of A to the data structure, adding i to the A-list of f(i)
Add all the fingerprints f(j) of I to the data structure, adding j to the B-list of f(j)
Quantize all A-lists and B-lists For each fingerprint, output the list of
quantized shifts (differences)
SHIFTS(A,I,Q)
Theorem 1: Procedure SHIFTS finds all quantized shifts of interval I in A, with high probability. It runs in time O(max{√u,u/b,u/Q}log(n)), where u=|A|-b
MATCHES(A,I,E)
If E<1, use SHIFTS to compute T If E≥1
Set k=min{εn1-α,2c1E} Decompose I into a set I of continuous disjoint
intervals of size |I|/k Compute T=COORD-MATCHES(A,I,E,c1E/k,k/c1)
Return T
DECIDE(A,B,α,C)
Choose sufficiently small ε, and sufficiently large c1 (given α,C)
Let the quantization parameter be
Q=εmin{n1-α,nα/2} Set T = MATCHES(A,B,nα) If T is nonempty, output CLOSE, otherwise
output FAR
DECIDE(A,B,α,C)
For any fixed α<1, we can choose constants ε and c1 such that procedure DECIDE solves the edit distance testing problem with high probability
Running Time Analysis
Note that when k=2c1E, COORD-MATCHES is called with edit distance parameter c1E/k=1/2<1. I.e. next call to MATCHES will call SHIFTS and end the recursion
Each level, The interval input to MATCHES goes down by a factor of k=Ω(n1-α), after r=α/(1-α) levels the intervals are of length n/nr(1-α)=O(n1-α), E=O(nα/nr(1-α))=O(1) and SHIFT will be called next
Running Time Analysis α<1/2
One level of recursion B is broken to intervals of size O(nα) dlog(n) calls to SHIFT with Q=εnα/2
Each call takes O(max{√u,u/b,u/Q}log(n)) = O(max{nα/2,1,nα/2}log(n))=O(nα/2log(n))
One merge taking O(nα/2log(n)) Total running time O(nα/2log2(n))
Running Time Analysis 1/2<α<2/3
Two levels of recursion At the last level, B is broken to intervals of
size O(nα/2) log2(n) calls to SHIFT with Q=εnα/2
Each call takes O(nα/2log(n)) log(n) merges each taking O(nα/2log(n)) Total running time O(nα/2log3(n))
Running Time Analysis α>2/3
r>2 levels of recursion At the last level, B is broken to intervals of
size O(n1-α) logO(1)(n) calls to SHIFT with Q=εn1-α
Note that n1-α<nα/2
Each call takes O(max{√u,u/b,u/Q}log(n)) = O((u/b)log(n))=O(n2α-1log(n))
Total running time Õ(n2α-1log(n))
Conclusion
We saw an algorithm for the edit distance test problem that runs in time Õ(nmax{α/2,2α-1})
Any probabilistic algorithm for the edit distance test problem requires Ω(nα/2) queries