1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

1

Edit Distance and

Large Data Sets

Ziv Bar-Yossef

Robert Krauthgamer

Ravi Kumar

T.S. Jayram

IBM AlmadenTechnion

2

Motivating Example:Near-Duplicate Elimination

Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97]

• Group pages into clusters of “similar” pages

• Keep one “representative” from each cluster

Crawler

Duplicate elimination

Page Repository

Page Repository

3

Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97]

• Corpus is huge (billions of pages, 10K/page)

• Streaming access

• Limited main memory

• Linear running time

Challenges

p h(p)

Locality Sensitive Hashes [Indyk, Motwani 98]

Prh[h(p) = h(q)] = sim(p,q)

Cluster:

Collection of pages that have a common sketch

• Can compute sketches in one pass

• Sketches can be stored and processed on a single machine

4

Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98]

|(q)S(p)S|

|(q)S(p)S|

ww

ww

Sw(p) Sw(q)

w-shingling:

Sw(p) = all substrings of p of length w

resemblancew(p,q) =

Pr[min((Sw(p)) = min((Sw(q))] =|(q)S(p)S|

|(q)S(p)S|

ww

ww

5

The Sketching Model

Alice Bob

Refereed(x,y) · kd(x,y) · k

x y

x)

y)

d(x,y) ¸ rd(x,y) ¸ r

Shared Randomness

Shared Randomnessk vs. r Gap

Problem

d(x,y) · k or d(x,y) ¸ r

Decide which of the two holds.

ApproximationApproximation

Promise:

Goal:

6

Applications of Sketching

Large data sets

• Clustering• Nearest Neighbor schemes• Data streams Management of Files

over the Network• Differential backup• Synchronization

Theory

• Low distortion embeddings• Simultaneous messages

communication complexity

7

Known Sketching Schemes

• Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]

• Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]

• Cosine similarity [Charikar 02]

• Earth mover distance [Charikar 02]

In this talk: Edit Distance

8

Edit Distance

x 2 n, y 2 m

Minimum number of character insertions, deletions and substitutions that transform x to y.

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications

• Genomics

• Text processing

• Web searchFor simplicity: m = n, = {0,1}.

ED(x,y):

9

Computing Edit Distance

• Dynamic programming (1970) O(n2)• Masek and Paterson (1980) O(n2/log n)

Exact Computation

• Impractical for comparing two very long strings.

• Natural question 1: can we do it in linear time?

• Impractical for handling massive document repositories.

• Natural question 2: are there constant size sketches of edit distance?

Can we solve the above problems if we settle for approximation?

Can we solve the above problems if we settle for approximation?

Focus of this

talk

10

Sketching Schemes for Edit Distance

Algorithm Gap Sketch size

Batu et al O(n) vs. (n) O(nmax(/2, 2 – 1))

This paper k vs. O((kn)2/3) O(1)

This paper

(non-repetitive strings)

k vs. O(k2) O(1)

• No known embeddings of Edit distance into a normed space.

• Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03]

• Weak nearest neighbor schemes [Indyk 04]

Negative Indications

11

Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98]

Ham(x,y) = # of positions in which x,y differ

Gap: k vs. 2k Sketch size: O(1)

Shared randomness:

r1,…,rn 2 {0,1} are independent and

Sketch: h(x) = (i xi ri ) mod 2

h(y) = (i yi ri ) mod 2

Analysis:

Pr[h(x) h(y)] =

Pr[h(x) + h(y) = 1] =

Pr[i: xi yi ri = 1] =

½(1- (1 – 1/k)Ham(x,y))

x) = (h1(x),…,ht(x)), y) = (h1(y),…,ht(y)), t = O(1)

12

Edit Distance Sketches: Basic Framework

Underlying Principle

ED(x,y) is small iff x and y share many common substrings at nearby positions.

Sx = set of pairs of the form (,h(i))

a substring of x

h(i): a “locality sensitive” encoding of the substring’s position

x

Sx

y

Sy

ED(x,y) small iff intersection Sx Å Sy

large

common substrings at nearby positions

13

Basic Framework (cont.)

•Need to estimate size of symmetric difference

•Hamming distance computation of characteristic vectors

•Use constant size sketches [KOR]

x

Sx

y

Sy

ED(x,y) small iff symmetric difference

Sx Sy small

Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance

14

1 2 3

12 3

General Case: Encoding Scheme

Gap: k vs. O((kn)2/3)

x

y

B = n2/3/k1/3, W = n/B

1

Sx = {

Sy = {

2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

(1,1),

(1,1),

(2,1),

(2,1),

(3,2),

(3,2),

…

…

B windows of size W each.

,(i, win(i)),…

,(i, win(i)),…

15

Analysis

j

ix

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 1: ED(x,y) · k

• If i is “unmarked”, it has a matching “companion” j

• (i,win(i)) 2 Sx n Sy, only if:

• either i is “marked”

• or i is unmarked, but win(i) win(j)

• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows

• Therefore, Ham(Sx,Sy) · 4kB

16

Analysis (cont.)

2

1x

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 2: Ham(Sx,Sy) · 8kB

• If i has a “companion” j and win(i) = win(j), can align i with

j using at most W operations

• Otherwise, substitute first character of i

• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

B+1 2B+1

B-1

17

y2

x2

y1

x1

Non-repetitive Case: Encoding Scheme

1 2 3 4 5 6 7

1 2 3 4 5 67

t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W

x

y

W

W

Alice and Bob choose a sequence of “anchors” in a coordinated way

1: a random permutation on {0,1}t

1: minimal length-t substring of x1 (under 1)

1: minimal length-t substring of y1 (under 1)

Gap: k vs. O(k W)

18

11

Encoding scheme (cont.)

2 3 4 5 6 7

1 2 3 4 5 6 7

2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

x

y

Sx = { (1,1),…,(8,8) }

Sy = { (1,1),…,(8,8) }

19

1 2 3 4 5 67

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis

Case 1: ED(x,y) · k.

•All anchors are “unmarked” with probability 1 - kt/W = (1)

•If i,i are unmarked, they are aligned

•# of mismatching substrings · 2k

•Ham(Sx,Sy) · 2k

x

y 1 2 3 4 5 6 7 8

20

1 2 3 4 5 671 2 3 4 5 6 7 8

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis (cont.)

Case 2: Ham(Sx,Sy) · 4k

•# of mismatching substrings · 4k

•ED(x,y) · 2 ¢ W ¢ 4k = O(k W).

x

y

21

Approximation in Linear Time

Algorithm Gap Time Approx. factor in O(n) time

Dynamic Programming

k vs. k+1 O(kn) None

Batu et al O(n) vs. (n) O(nmax(/2, 2-1)) None

Cole, Hariharan k vs. 2k O(n + k4) O(n3/4)

This paper k vs. k7/4 O(n) O(n3/7)

Algorithm Gap Time Approx. factor in O(n) time

Cole, Hariharan k vs. 2k O(n + k3) O(n2/3)

This paper k vs. k3/2 O(n) O(n1/3)

Non-repetitive Strings

Arbitrary Strings

22

Summary and Open Problems• Designed efficient approximation schemes for edit

distance.– Best sketching and linear-time approximations to date

• Subsequent work:– O(n2/3) distortion embedding of edit distance into L1 [Indyk 04]

[Rabani 04]

– Better embeddings of edit distance into L1 [Ostrovsky, Rabani, 05]

– Embeddings of the Ulam metric into L1 [Charikar, Krauthgamer, 05]

• Open Problems– Sketch size lower bounds– Constant factor approximations in linear time– Better embeddings of edit distance– Sketching schemes for other distance measures

23

Thank You

Documents

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion