23
1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgame r Ravi Kumar T.S. Jayram IBM Almaden Technion

1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

1

Edit Distance and

Large Data Sets

Ziv Bar-Yossef

Robert Krauthgamer

Ravi Kumar

T.S. Jayram

IBM AlmadenTechnion

Page 2: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

2

Motivating Example:Near-Duplicate Elimination

Web Syntactic clustering [Broder, Glassman, Manasse, Zweig 97]

• Group pages into clusters of “similar” pages

• Keep one “representative” from each cluster

Crawler

Duplicate elimination

Page Repository

Page Repository

Page 3: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

3

Syntactic Clustering via Sketching[Broder,Glassman,Manasse,Zweig 97]

• Corpus is huge (billions of pages, 10K/page)

• Streaming access

• Limited main memory

• Linear running time

Challenges

p h(p)

Locality Sensitive Hashes [Indyk, Motwani 98]

Prh[h(p) = h(q)] = sim(p,q)

Cluster:

Collection of pages that have a common sketch

• Can compute sketches in one pass

• Sketches can be stored and processed on a single machine

Page 4: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

4

Shingling and Resemblance [Broder,Glassman,Manasse,Zweig 97], [Broder,Charikar,Frieze,Mitzenmacher 98]

|(q)S(p)S|

|(q)S(p)S|

ww

ww

Sw(p) Sw(q)

w-shingling:

Sw(p) = all substrings of p of length w

resemblancew(p,q) =

Pr[min((Sw(p)) = min((Sw(q))] =|(q)S(p)S|

|(q)S(p)S|

ww

ww

Page 5: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

5

The Sketching Model

Alice Bob

Refereed(x,y) · kd(x,y) · k

x y

x)

y)

d(x,y) ¸ rd(x,y) ¸ r

Shared Randomness

Shared Randomnessk vs. r Gap

Problem

d(x,y) · k or d(x,y) ¸ r

Decide which of the two holds.

ApproximationApproximation

Promise:

Goal:

Page 6: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

6

Applications of Sketching

Large data sets

• Clustering• Nearest Neighbor schemes• Data streams Management of Files

over the Network• Differential backup• Synchronization

Theory

• Low distortion embeddings• Simultaneous messages

communication complexity

Page 7: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

7

Known Sketching Schemes

• Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]

• Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]

• Cosine similarity [Charikar 02]

• Earth mover distance [Charikar 02]

In this talk: Edit Distance

Page 8: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

8

Edit Distance

x 2 n, y 2 m

Minimum number of character insertions, deletions and substitutions that transform x to y.

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications

• Genomics

• Text processing

• Web searchFor simplicity: m = n, = {0,1}.

ED(x,y):

Page 9: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

9

Computing Edit Distance

• Dynamic programming (1970) O(n2)• Masek and Paterson (1980) O(n2/log n)

Exact Computation

• Impractical for comparing two very long strings.

• Natural question 1: can we do it in linear time?

• Impractical for handling massive document repositories.

• Natural question 2: are there constant size sketches of edit distance?

Can we solve the above problems if we settle for approximation?

Can we solve the above problems if we settle for approximation?

Focus of this

talk

Page 10: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

10

Sketching Schemes for Edit Distance

Algorithm Gap Sketch size

Batu et al O(n) vs. (n) O(nmax(/2, 2 – 1))

This paper k vs. O((kn)2/3) O(1)

This paper

(non-repetitive strings)

k vs. O(k2) O(1)

• No known embeddings of Edit distance into a normed space.

• Every embedding of Edit distance into L1 incurs ¸ 3/2 distortion [Andoni,Deza,Gupta,Indyk,Raskhodnikova 03]

• Weak nearest neighbor schemes [Indyk 04]

Negative Indications

Page 11: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

11

Hamming Distance Sketches[Kushilevitz, Ostrovsky, Rabani 98]

Ham(x,y) = # of positions in which x,y differ

Gap: k vs. 2k Sketch size: O(1)

Shared randomness:

r1,…,rn 2 {0,1} are independent and

Sketch: h(x) = (i xi ri ) mod 2

h(y) = (i yi ri ) mod 2

Analysis:

Pr[h(x) h(y)] =

Pr[h(x) + h(y) = 1] =

Pr[i: xi yi ri = 1] =

½(1- (1 – 1/k)Ham(x,y))

x) = (h1(x),…,ht(x)), y) = (h1(y),…,ht(y)), t = O(1)

Page 12: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

12

Edit Distance Sketches: Basic Framework

Underlying Principle

ED(x,y) is small iff x and y share many common substrings at nearby positions.

Sx = set of pairs of the form (,h(i))

a substring of x

h(i): a “locality sensitive” encoding of the substring’s position

x

Sx

y

Sy

ED(x,y) small iff intersection Sx Å Sy

large

common substrings at nearby positions

Page 13: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

13

Basic Framework (cont.)

•Need to estimate size of symmetric difference

•Hamming distance computation of characteristic vectors

•Use constant size sketches [KOR]

x

Sx

y

Sy

ED(x,y) small iff symmetric difference

Sx Sy small

Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance

Page 14: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

14

1 2 3

12 3

General Case: Encoding Scheme

Gap: k vs. O((kn)2/3)

x

y

B = n2/3/k1/3, W = n/B

1

Sx = {

Sy = {

2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

(1,1),

(1,1),

(2,1),

(2,1),

(3,2),

(3,2),

B windows of size W each.

,(i, win(i)),…

,(i, win(i)),…

Page 15: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

15

Analysis

j

ix

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 1: ED(x,y) · k

• If i is “unmarked”, it has a matching “companion” j

• (i,win(i)) 2 Sx n Sy, only if:

• either i is “marked”

• or i is unmarked, but win(i) win(j)

• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows

• Therefore, Ham(Sx,Sy) · 4kB

Page 16: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

16

Analysis (cont.)

2

1x

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Case 2: Ham(Sx,Sy) · 8kB

• If i has a “companion” j and win(i) = win(j), can align i with

j using at most W operations

• Otherwise, substitute first character of i

• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)

B+1 2B+1

B-1

Page 17: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

17

y2

x2

y1

x1

Non-repetitive Case: Encoding Scheme

1 2 3 4 5 6 7

1 2 3 4 5 67

t ¸ 1 “non-repetitiveness” parameter, W = O(k * t) no substring of length t repeats within a window of size W

x

y

W

W

Alice and Bob choose a sequence of “anchors” in a coordinated way

1: a random permutation on {0,1}t

1: minimal length-t substring of x1 (under 1)

1: minimal length-t substring of y1 (under 1)

Gap: k vs. O(k W)

Page 18: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

18

11

Encoding scheme (cont.)

2 3 4 5 6 7

1 2 3 4 5 6 7

2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

x

y

Sx = { (1,1),…,(8,8) }

Sy = { (1,1),…,(8,8) }

Page 19: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

19

1 2 3 4 5 67

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis

Case 1: ED(x,y) · k.

•All anchors are “unmarked” with probability 1 - kt/W = (1)

•If i,i are unmarked, they are aligned

•# of mismatching substrings · 2k

•Ham(Sx,Sy) · 2k

x

y 1 2 3 4 5 6 7 8

Page 20: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

20

1 2 3 4 5 671 2 3 4 5 6 7 8

1 2 3 4 5 6 71 2 3 4 5 6 7 8

Analysis (cont.)

Case 2: Ham(Sx,Sy) · 4k

•# of mismatching substrings · 4k

•ED(x,y) · 2 ¢ W ¢ 4k = O(k W).

x

y

Page 21: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

21

Approximation in Linear Time

Algorithm Gap Time Approx. factor in O(n) time

Dynamic Programming

k vs. k+1 O(kn) None

Batu et al O(n) vs. (n) O(nmax(/2, 2-1)) None

Cole, Hariharan k vs. 2k O(n + k4) O(n3/4)

This paper k vs. k7/4 O(n) O(n3/7)

Algorithm Gap Time Approx. factor in O(n) time

Cole, Hariharan k vs. 2k O(n + k3) O(n2/3)

This paper k vs. k3/2 O(n) O(n1/3)

Non-repetitive Strings

Arbitrary Strings

Page 22: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

22

Summary and Open Problems• Designed efficient approximation schemes for edit

distance.– Best sketching and linear-time approximations to date

• Subsequent work:– O(n2/3) distortion embedding of edit distance into L1 [Indyk 04]

[Rabani 04]

– Better embeddings of edit distance into L1 [Ostrovsky, Rabani, 05]

– Embeddings of the Ulam metric into L1 [Charikar, Krauthgamer, 05]

• Open Problems– Sketch size lower bounds– Constant factor approximations in linear time– Better embeddings of edit distance– Sketching schemes for other distance measures

Page 23: 1 Edit Distance and Large Data Sets Ziv Bar-Yossef Robert Krauthgamer Ravi Kumar T.S. Jayram IBM Almaden Technion

23

Thank You