27
Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhøj 2 1 Moscow State University, Department of Mechanics and Mathematics, [email protected] 2 Technical University of Denmark, DTU Compute, [email protected] CPM 2013, Bad Herrenalb, Germany June 17, 2013 1 / 27

Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Time-Space Trade-Offs for theLongest Common Substring Problem

Tatiana Starikovskaya1 and Hjalte Wedel Vildhøj2

1Moscow State University, Department of Mechanics and Mathematics,[email protected]

2Technical University of Denmark, DTU Compute, [email protected]

CPM 2013, Bad Herrenalb, GermanyJune 17, 2013

1 / 27

Page 2: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

2 / 27

Page 3: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

d = 3 ⇒ LCS = c t a g

3 / 27

Page 4: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemDefinition

Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.

Example

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

T3 = a c t a g t a a t g c a t

d = 3 ⇒ LCS = c t a g

d = 2 ⇒ LCS = c t a c c

4 / 27

Page 5: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

The Longest Common Substring ProblemA patented solution

5 / 27

Page 6: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

6 / 27

Page 7: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

7 / 27

Page 8: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Textbook Solution

1

acctaccctag$

7ctag$

10

$

3

accctag$

t

cc

12$

6ctacct$

1gctagctacct$

ga

2acctaccctag$

8

ctag$

11

$

4

ccctag$

9

g$

at

c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$

g

a

tc

13

$

7

cct$

3

gctacct$

cta

2

gctagctacct$

g

13

$

6

ctag$

9

t$

cc

11

ccctag$

5

g$

g

a

t

Space: Θ(n)

/

Build Generalized Suffix Tree

T1 = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

T2 = a c a c c t a c c c t a g

8 / 27

Page 9: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Our Results

QuestionCan the LCS problem be solved (deterministically) in O

(n1−ε) space

and O(n1+ε

)time for 0 ≤ ε ≤ 1?

Our AnswerYes if 0 ≤ ε ≤ 1

3 . More precisely,

For two strings (d = m = 2), the problem can be solved in:

Time: O(n1+ε

)Space: O

(n1−ε) for any 0 < ε ≤ 1

3 .

In the general case (2 ≤ d ≤ m), the problem can be solved in:

Time: O(

n1+ε log2 n(d log2 n + d2))

Space: O(n1−ε) for any 0 ≤ ε < 1

3 .

9 / 27

Page 10: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

10 / 27

Page 11: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ

Difference CoversA difference cover modulo τ is a set of integers DCτ ⊆ {0,1, . . . , τ − 1}such that for any distance d ∈ {0,1, . . . , τ − 1}, DCτ contains twoelements separated by distance d modulo τ .Ex: The set DCτ = {1,2,4} is a difference cover modulo 5.

d 0 1 2 3 4i, j 1,1 2,1 1,4 4,1 1,2

12

4

03

1

4

23

11 / 27

Page 12: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

12 / 27

Page 13: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.

I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled inboth strings.

I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

13 / 27

Page 14: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.

I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

14 / 27

Page 15: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

15 / 27

Page 16: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries.

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

RB(11)= (g c t a c)R = c a t c g

I Number of sampled suffixes: O( nτ |DCτ |

)= O

( n√τ

).

I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.I Hence the LCS corresponds to a pair (p∗1, p

∗2) maximizing

lcp(RB(p1),RB(p2)

)+ lcp

(T[p1..],T[p2..]

)− 1

16 / 27

Page 17: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

17 / 27

Page 18: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

18 / 27

Page 19: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

19 / 27

Page 20: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]

LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.

20 / 27

Page 21: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is long

How to compute the pair (p∗1, p∗2) faster than O

( n2

τ

)?

T = a1

g2

g3

c4

t5

a6

g7

c8

t9

a10

c11

c12

t13

$1

14

a15

c16

a17

c18

c19

t20

a21

c22

c23

c24

t25

a26

g27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]

LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]

O(τ)

Analysis (sketch): O(τ) rounds each using O(n/√τ) time and space:

Time: O (n√τ)

Space: O (n/√τ)

Time: O(n1+ε

)Space: O

(n1−ε) 0 < ε ≤ 1

2 .τ = n2ε

21 / 27

Page 22: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is shorter than τ

T1 =

Si

I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total

length O( n√τ

).

I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.

I Repeat for all O(√τ) batches.

22 / 27

Page 23: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A Solution for Two StringsWhen the LCS is shorter than τ

T1 =

Si

I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total

length O( n√τ

).

I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.

I Repeat for all O(√τ) batches.

Time: O (n√τ)

Space: O (n/√τ)

Time: O(n1+ε

)Space: O

(n1−ε) 0 ≤ ε ≤ 1

3 .τ = n2ε

τ = O(n/√τ)

23 / 27

Page 24: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

24 / 27

Page 25: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

τ

541

532

25 / 27

Page 26: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

A General Solution for m StringsWhen the LCS is long

Challenge: The difference cover property only holds for pairs.

T =

T1 T2 T3 T4

LCS LCS LCS

τ

541

532

Algorithm: Extract the maximum head until we have d− 1 distinctstrings. Repeat everything for all n possible positions of the LCS.Computing the next element in a list can be done in O(log n(log2 n + d)).Extracting it costs O(

√τ). At most O(d

√τ) extractions.

Time: O(

n√τ log2 n(log2 n + d)

)Space: O (n/

√τ)

26 / 27

Page 27: Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya1 and Hjalte Wedel

Conclusion

ResultsFor two strings (d = m = 2), the LCS problem can be solved in:

Time: O(n1+ε

)Space: O

(n1−ε) for any 0 < ε ≤ 1

3 .

In the general case (2 ≤ d ≤ m), the LCS problem can be solved in:

Time: O(

n1+ε log2 n(d log2 n + d2))

Space: O(n1−ε) for any 0 ≤ ε < 1

3 .

Open ProblemsCan the generalized solution be improved? Can the trade-off interval ofour solutions be extended to 0 ≤ ε ≤ 1

2 ? Can the problem be solved inO(n1+ε) time and O(n1−ε) space for any 0 ≤ ε ≤ 1?

27 / 27