Dynamic Edit Distance Table under a General Weighted Cost Function
Heikki Hyyrö (University of Tampere, Finland) Kazuyuki Narisawa (Kyushu University, Japan)
and Shunsuke Inenaga (Kyushu University, Japan)
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
Edit Distance
minimum total cost d for transforming string x[1:n] to y[1:m]
x = prague, y = passage Ins. = Del. = Sub. = 1p r a g u e
⇓ ⇓ ⇓ ⇓p a s s a g e
Edit Distance= Sub. + Ins. + Ins. + Del.= 1+1+1+1= 4
Example
Edit Operation CostInsertion Ins.= δ(ε, b)Deletion Del.= δ(a, ε)Substitution Sub.= δ(a, b)
p r a g u e0 1 2 3 4 5 6
0 0 1 2 3 4 5 6p 1 1 0 1 2 3 4 5a 2 2 1 1 1 2 3 4s 3 3 2 2 2 2 3 4s 4 4 3 3 3 3 3 4a 5 5 4 4 3 4 4 4g 6 6 5 5 4 3 4 5e 7 7 6 6 5 4 4 4
D
Dynamic Programming
a a
b 1 2
a 1 1
)1,1(
),(],[),,(],1[
),,(]1,[
min],[
)0(),(],0[
)0(),(]0,[
1
1
njmi
bajiDajiD
bjiD
jiD
njbjD
miaiD
ji
i
j
i
hh
i
hh
p r a g u e
0 1 2 3 4 5 60 0 1 2 3 4 5 6
p 1 1 0 1 2 3 4 5a 2 2 1 1 1 2 3 4s 3 3 2 2 2 2 3 4s 4 4 3 3 3 3 3 4a 5 5 4 4 3 4 4 4g 6 6 5 5 4 3 4 5e 7 7 6 6 5 4 4 4
D
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
p r a g u e p r a g u0 1 2 3 4 5 6 0 1 2 3 4 5
0 0 1 2 3 4 5 6 0 0 1 2 3 4 5p 1 1 0 1 2 3 4 5 p 1 1 0 1 2 3 4a 2 2 1 1 1 2 3 4 a 2 2 1 1 1 2 3s 3 3 2 2 2 2 3 4 s 3 3 2 2 2 2 3s 4 4 3 3 3 3 3 4 s 4 4 3 3 3 3 3a 5 5 4 4 3 4 4 4 a 5 5 4 4 3 4 4g 6 6 5 5 4 3 4 5 g 6 6 5 5 4 3 4e 7 7 6 6 5 4 4 4 e 7 7 6 6 5 4 4
D D'
Right Increment/Decrement•Right I/D of Edit Distance▫ input : D of strings A and B▫output : D’ of strings A and B’ ( B = B’a or Ba
= B’ )▫easy to compute▫ insert or delete right column of D → D’ : O(m)
decrement
increment
Left Increment/Decrement• Left I/D of ED▫ input : D of strings A and B▫output : D of strings A and B’ ( B = aB’ or aB =
B’ )▫difficult to compute
values of left side effect to the values of right sider a g u e p r a g u e
0 2 3 4 5 6 0 1 2 3 4 5 60 0 1 2 3 4 5 0 0 1 2 3 4 5 6
p 1 1 1 2 3 4 5 p 1 1 0 1 2 3 4 5a 2 2 2 1 2 3 4 a 2 2 1 1 1 2 3 4s 3 3 3 2 2 3 4 s 3 3 2 2 2 2 3 4s 4 4 4 3 3 3 4 s 4 4 3 3 3 3 3 4a 5 5 5 4 4 4 4 a 5 5 4 4 3 4 4 4g 6 6 6 5 4 5 5 g 6 6 5 5 4 3 4 5e 7 7 7 6 5 5 5 e 7 7 6 6 5 4 4 4
D' D
decrement
increment
Contribution•Propose an efficient algorithm for Left I/D problem
with any nonnegative integer costs
• Left I/D problem▫ input : ED table D of strings A and B▫output : ED table D’ of strings A and B’
B = aB’ (decrement) B’ = aB (increment)
▫costs of operations are nonnegative integers
Applications•Cyclic String Comparison [Landau et. al 1998]
•Computing Approximate periods [Schmidt 1998]
•Edit distance for sliding window
•String Kernel based on Edit distance▫kernel is mapping to high dimensional feature space
used in Support Vector Machine(classifier)
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
Related Work•naïve method▫compute D’ from scratch▫O(nm) time
•Kim & Park algorithm [2004]▫Each operation has cost 1▫Compute difference representation DR of table D
Using Change Table Ch▫O(n+m) time
Definition• Left Increment/Decrement Problem
• input : DR table of string A and B•output : DR’ table of string A and B’▫B = aB’ (decrement)▫B’ = aB (increment)
•Each cost (Ins., Del., Sub.) is a Non Negative Integer▫Kim & Park algorithm : each cost is 1
Difference Representation],1[],[].,[ jiDjiDUjiDR
]1,[],[].,[ jiDjiDLjiDR
under minus upper
right minus left
p r a g u e0 1 2 3 4 5 6
0 0 1 2 3 4 5 6p 1 1 0 1 2 3 4 5a 2 2 1 1 1 2 3 4s 3 3 2 2 2 2 3 4s 4 4 3 3 3 3 3 4a 5 5 4 4 3 4 4 4g 6 6 5 5 4 3 4 5e 7 7 6 6 5 4 4 4
D
p r a g u e0 1 2 3 4 5 6
0p 1 - 1 - 1 - 1 - 1 - 1 - 1a 2 1 0 - 1 - 1 - 1 - 1s 3 1 1 1 0 0 0s 4 1 1 1 1 0 0a 5 1 1 0 1 1 0g 6 1 1 1 - 1 0 1e 7 1 1 1 1 0 - 1
DR.U
p r a g u e0 1 2 3 4 5 6
0p 1 - 1 1 1 1 1 1a 2 - 1 0 0 1 1 1s 3 - 1 0 0 0 1 1s 4 - 1 0 0 0 0 1a 5 - 1 0 - 1 1 0 0g 6 - 1 0 - 1 - 1 1 1e 7 - 1 0 - 1 - 1 0 0
DR.L
DR’ – DR
We need not update all cells
r a g u e p r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 0 0p 1 0 0 0 0 0 p 1 - 1 - 1 - 1 - 1 - 1 - 1 p 1 1 1 1 1 1a 2 1 - 1 - 1 - 1 - 1 a 2 1 0 - 1 - 1 - 1 - 1 a 2 1 0 0 0 0s 3 1 1 0 0 0 s 3 1 1 1 0 0 0 s 3 0 0 0 0 0s 4 1 1 1 0 0 s 4 1 1 1 1 0 0 s 4 0 0 0 0 0a 5 1 1 1 1 0 a 5 1 1 0 1 1 0 a 5 0 1 0 0 0g 6 1 1 0 1 1 g 6 1 1 1 - 1 0 1 g 6 0 0 1 1 0e 7 1 1 1 0 0 e 7 1 1 1 1 0 - 1 e 7 0 0 0 0 1
r a g u e p r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 0 0p 1 0 1 1 1 1 p 1 - 1 1 1 1 1 1 p 1 - 1 0 0 0 0a 2 0 - 1 1 1 1 a 2 - 1 0 0 1 1 1 a 2 0 - 1 0 0 0s 3 0 - 1 0 1 1 s 3 - 1 0 0 0 1 1 s 3 0 - 1 0 0 0s 4 0 - 1 0 0 1 s 4 - 1 0 0 0 0 1 s 4 0 - 1 0 0 0a 5 0 - 1 0 0 0 a 5 - 1 0 - 1 1 0 0 a 5 0 0 - 1 0 0g 6 0 - 1 - 1 1 0 g 6 - 1 0 - 1 - 1 1 1 g 6 0 0 0 0 - 1e 7 0 - 1 - 1 0 0 e 7 - 1 0 - 1 - 1 0 0 e 7 0 0 0 0 0
DR'.U
DR'.L
DR.U
DR.L
-
-
=
=
Change Table•Ch[i, j] = D’[i, j] – D[i, j]• cost = 1▫values in Ch : –1, 0, 1▫ is separated into three areas
r a g u e p r a g u e p r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 0 1 2 3 4 5 0 0 1 2 3 4 5 6 0 - 1 - 1 - 1 - 1 - 1 - 1p 1 1 1 2 3 4 5 p 1 1 0 1 2 3 4 5 p 1 1 0 0 0 0 0a 2 2 2 1 2 3 4 a 2 2 1 1 1 2 3 4 a 2 1 1 0 0 0 0s 3 3 3 2 2 3 4 s 3 3 2 2 2 2 3 4 s 3 1 1 0 0 0 0s 4 4 4 3 3 3 4 s 4 4 3 3 3 3 3 4 s 4 1 1 0 0 0 0a 5 5 5 4 4 4 4 a 5 5 4 4 3 4 4 4 a 5 1 1 1 0 0 0g 6 6 6 5 4 5 5 g 6 6 5 5 4 3 4 5 g 6 1 1 1 1 1 0e 7 7 7 6 5 5 5 e 7 7 6 6 5 4 4 4 e 7 1 1 1 1 1 1
D' D Ch
- =
Affected Entries•entries where DR’[i, j] ≠ DR[i, j]▫they must be updated▫affected entries are along the borders of three areas in Ch
r a g u e r a g u e r a g u e0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
0 0 0 -1 -1 -1 -1 -1p 1 1 1 1 1 1 p 1 -1 0 0 0 0 p 1 0 0 0 0 0a 2 1 0 0 0 0 a 2 0 -1 0 0 0 a 2 1 0 0 0 0s 3 0 0 0 0 0 s 3 0 -1 0 0 0 s 3 1 0 0 0 0s 4 0 0 0 0 0 s 4 0 -1 0 0 0 s 4 1 0 0 0 0a 5 0 1 0 0 0 a 5 0 0 -1 0 0 a 5 1 1 0 0 0g 6 0 0 1 1 0 g 6 0 0 0 0 -1 g 6 1 1 1 1 0e 7 0 0 0 0 1 e 7 0 0 0 0 0 e 7 1 1 1 1 1
Ch
DR'.L - DR.LDR'.U - DR.U D' - D
r a g u e0 1 2 3 4 5 6
0 -1 -1 -1 -1 -1p 1 0 0 0 0 0a 2 1 0 0 0 0s 3 1 0 0 0 0s 4 1 0 0 0 0a 5 1 1 0 0 0g 6 1 1 1 1 0e 7 1 1 1 1 1
Ch
Sketch of Kim & Park Algorithm•Update affected entries▫scan borders in Ch, computing Ch and DR’
•Time Complexity : O(n+m)
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
General Costs•Ch can be separated into more than three areas▫the number of areas depends on the costs▫the values are not limited to –1, 0, 1
•Kim & Park algorithm ▫ is specialized to the three area case▫can not be applied with general costs
r a g u e0 1 2 3 4 5 6
0 -2 -2 -2 -2 -2p 1 -1 -1 -1 -1 -1a 2 2 -1 -1 -1 -1s 3 2 1 -1 -1 -1s 4 2 1 1 -1 -1a 5 2 2 1 1 -1g 6 2 2 2 1 1e 7 2 2 2 2 1
Ch
Ins. = 2, Del. = 2, Sub. = 1Example
Our Algorithm•Update only affected entries ▫without Ch▫compute only DR’.U and DR’.L
•Time complexity : O(min{c(n+m), nm})▫c is the maximum cost
r a g u e0 1 2 3 4 5 6
0 -2 -2 -2 -2 -2p 1 -1 -1 -1 -1 -1a 2 2 -1 -1 -1 -1s 3 2 1 -1 -1 -1s 4 2 1 1 -1 -1a 5 2 2 1 1 -1g 6 2 2 2 1 1e 7 2 2 2 2 1
Chr a g u e0 1 2 3 4 5 6
0p 1 1 1 1 1 1a 2 3 0 0 0 0s 3 0 2 0 0 0s 4 0 0 2 0 0a 5 0 1 0 2 0g 6 0 0 1 0 2e 7 0 0 0 1 0
r a g u e0 1 2 3 4 5 6
0p 1 - 3 0 0 0 0a 2 0 - 3 0 0 0s 3 0 - 1 - 2 0 0s 4 0 - 1 0 - 2 0a 5 0 0 - 1 0 - 2g 6 0 0 0 - 1 0e 7 0 0 0 0 - 1
DR’.U – DR.U DR’.L – DR.L D’ – D
Affected Entry•DR’[i, j] ≠ DR[i, j]
•Kim & Park Algorithm ▫computes DR’ and Ch for computing Affected Entry
•Our Algorithm▫compute affected entry by only DR table▫use following lemma
DR’[i, j] is Affected Entry ⇔DR’[i–1, j].L ≠ DR[i–1 , j].LorDR’[i, j–1].U ≠ DR[i, j–1].U
comparison of pseudo codesOur Algorithm
1 for i =1 to m do2 prev [⊿ i] = i + 1; DR[i,1].U = δ(ai, ε);3 i = 1; j = 1; DR[0, j].L = δ(ε, bj); currIdx = 1; prevIdx = 1;4 while i ≦ m and j ≦ n do5 while i ≦ m do6 x = DR[i-1, j].L; y = DR[i, j-1].U7 z = min{x+δ(ai,ε), y+δ(ε,bj), δ(ai,bj)}8 new.L = z-y; new.U = z-x;9 old.L = DR[i, j].L; old.U = DR[i, j].U;
10 DR[i, j].L = new.L; DR[i, j].U = new.U;11 if old.U ≠ new.U then12 curr [⊿ currIdx] = i; currIdx = currIdx + 1;13 i = i + 1;14 if old.L = new.L then15 now = i;16 repeat17 i = prev [⊿ prevIdx]; prevIdx = prevIdx + 1;18 until i ≧ now19 curr [⊿ currIdx] = m + 1;20 Interchange the roles of the tables curr and ⊿ prev ;⊿21 currIdx = 1; i = prev [1];⊿ prevIdx = 2; j = j + 1;
Kim & Park Algorithm 1 Let k be the smallest index in A such that A[k] = B[1]2 i-1 = 0; j-1 = 1; i1 = k; j1 = 0; f (0) = 0; g(0) = k;3 finished-1 = false; finished1 = false;4 while ( finished-1 = false) or ( finished1 = false) do5 if i-1 < i1 – 1 then /* case1 */6 if j-1 > j1 + 1 then7 if j-1 > j1+1 then X = -1;8 else X = 0;9 Y = 0;
10 else11 if f (i-1) < j-1 then X = -1;12 else if g( j1) ≦ i-1 then X = 1;13 else X = 0;14 if g( j1) ≦ i-1 + 1 then Y = 1;15 else Y = 0;16 Z = -1;17 Ch[i-1+1, j-1]= min{ -DR[i-1+1, j-1+1].UL + X+δi-1+1,j-1+1, -DR[i-1+1, j-1+1].U+Z+1, -DR[i-1+1, j-1+1].L+Y+1};18 DR’[i-1+1, j-1].U = DR[i-1+1, j-1+1].U – Ch[i-1+1, j-1] + Z;19 DR’[i-1+1, j-1].L = DR[i-1+1, j-1+1].L – Ch[i-1+1, j-1] + Y;20 if Ch[i-1+1, j-1] = -1 then i-1 = i-1 + 1; f (i-1) = j-1;21 else j-1 = j-1 + 1;22 else if j1 < j-1-1 then /* case2 */23 if i1 > i-1 +1 then24 if g( j1) < i1 then X =1;25 else X = 0;26 Y = 0;27 else28 if g( j1) < i1 then X =1;29 else if f (i-1) ≦ j1 then X = 0;30 else X = 0;31 if f ( i1-1) ≦ j1 + 1 then Y=-1;32 else Y = 0;33 Z = 1;34 Ch[i1, j1+1]= min{ -DR[i1, j1+2].UL + X+δi1,j1+2, -DR[i1, j1+2].U+Y+1, -DR[i1, j1+2].L+Z+1};35 DR’[i1, j1+1].U = DR[i1, j-1+2].U – Ch[i1, j1+1] + Y;36 DR’[i1, j1+1].L = DR[i1, j-1+2].L – Ch[i1, j1+1] + Z;37 if Ch[i1, j1+1] = 1 then j1 = j1 + 1; g( j1) = i1;38 else i1 = i1 + 1;39 else /* case3 */40 if f (i-1 < j-1) then X = -1;41 else if g( j1) ≦ i-1 then X = 1;42 else X = 0;43 Y = -1; Z = 1;44 Ch[i-1+1, j-1]= min{ -DR[i-1+1, j-1+1].UL +X+δi-1+1,j-1+1, -DR[i-1+1, j-1+1].U+Y+1, -DR[i-1+1, j-1+1].L+Z+1};45 DR’[i-1+1, j-1].U = DR[i-1+1, j-1+1].U – Ch[i-1+1, j-1] + Y;46 DR’[i-1+1, j-1].L = DR[i-1+1, j-1+1].L – Ch[i-1+1, j-1] + Z;47 if Ch[i-1+1, j-1] = 1 then j-1 = j-1 + 1; j1 = j1 + 1; g( j1) = i1;48 else if Ch[i-1+1, j-1] = 1 then j-1 = j-1 + 1; j1 = j1 + 1; g( j1) = i1;49 else j-1 = j-1 + 1; i1 = i1 + 1;50 if (i-1 = m) or ( j-1 = n) then 51 finished-1 = true;52 if (i1 = m+1) or ( j1 = n-1) then 53 finished1 = true;
comparison of behaviors
0 1 2 3 4 5 6 7 … … … m01234567…
……
n
our algorithm Kim & Park algorithm
0 1 2 3 4 5 6 7 … … … m01234567…
……
n
0 1 2 3 4 5 6 7 … … … m01234567…
……
n
0 1 2 3 4 5 6 7 … … … m01234567…
……
n
Contents•Edit Distance
• Left Increment/Decrement Edit Distance Problem
•Related Work
•Our Algorithm
•Experiments
•Summary
Experiments• strings A[1:m] and B[1:m]▫Total time of computing representations of edit
distance between A and B[ j:m] for j = m, m–1,…, 1 left incremental computation
•Machine Specifications▫CentOS Linux▫Xeon 3.0GhHz▫16GB memory
Experiment 1•Time comparison with naïve algorithm
• costs▫chosen randomly
Insertion = 137, Deletion = 116, Substitution = 242
•Random data▫alphabet size 2,3, …, 52▫string length 100, 200, …, 5000
Result 1
Result 1
Experiment 2•Time comparison with Kim & Park algorithm
• costs▫Insertion = Deletion = Substitution = 1
•Random data▫alphabet size 2, 3, , …, 52▫string length 100, 200, …, 5000
Result 2
Result 2
Experiment 3•TimeCompare with naïve algorithm
•Corpus ▫English(reuters news)
costs Insertion = 137, Deletion = 116, Substitution = 242
string length : 1000, 2000, 3000, 4000, 5000▫Protein data(canterbury corpus: E.coli)
costs proposed in [Kurtz 1996] string length : 1000, 2000, 3000, 4000, 5000
δ ε A C G Tε 0 3 3 3 3A 3 0 2 1 3C 3 2 0 2 1G 3 1 2 0 2T 3 2 1 2 0
Result 3
lengthTime (seconds)
Our algorithm Naïve algorithm
1000 0.04 1.50
2000 0.27 12.0
3000 0.71 40.4
4000 1.36 97.1
5000 2.29 189
lengthTime (seconds)
Our algorithm Naïve algorithm
1000 0.01 1.43
2000 0.09 11.5
3000 0.23 38.8
4000 0.43 92.8
5000 0.70 181
English News Protein Data
Summary•Algorithm for Left I/D problem▫nonnegative integer costs▫O( min{c(n+m), nm} )
c is the maximum cost▫experimentally fast
Our Algorithm Naïve Algorithm Kim & Park Algorithm
Costs Non negative integer Real number 1
Tables to compute DR D DR and ChTime Complexity O( min{c(n+m), nm} ) O(nm) O(n+m)Source code Simple Simple Cumbersome
Speed Fast Very Slow Slower
Related Work•naïve method▫compute D’ from scratch▫O(nm) time
•Kim & Park algorithm [2004]▫Each operation has cost 1▫Compute difference representation
DR → DR’ Using Change Table Ch
▫O(n+m) time
D
D’
DR, Ch
DR’, Ch
Edit Distance
O(nm)
O(nm)
O(1)
O(n+m)
O(n+m)
naïve Kim & Park