String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word

String Algorithms

Maxime Crochemore

King’s College London Universite Paris-Est

&

M.C. PhD Univ. Warsaw 1/63

Algorithms and combinatorics on words

⋆ Links between combinatorial properties of words and algorithms on words:

– some algorithms are based on combinatorial properties

– combinatorics is sometimes used to evaluate efficiency of algorithms

⋆ Examples:

– Text searching

– Fragment assembly and shortest common superstring

– Text indexing and suffix arrays

– Text compression and permutations

⋆ Other examples in Algorithms on Strings

[C., Hancart, Lecroq, 2007], [C., Rytter, 1994]

http://www.dcs.kcl.ac.uk/staff/mac/

⋆ Combinatorial aspects in Applied Combinatorics on Words

[Lothaire, 2004], [Lothaire, 2002]

http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html


Periods and borders of words

⋆ Non-empty string u, integer p, 0 < p ≤ |u|

⋆ p is a period of u if any of these equivalent conditions is satisfied:

– u[i] = u[i + p], for 1 ≤ i ≤ |u| − p

– u is a prefix of some yk, k > 0, |y| = p

– u = yw = wz, for some strings y, z, w with |y| = |z| = p

String w is called a border of u

b o r d e r l a n d b o r d e r

b o r d e r l a n d b o r d e r

-� p

-�

p

⋆ period(u) = smallest period of u (can be |u|)

border(u) = longest proper border of u (can be empty)

⋆ Periods and borders of abaabaa

3 abaa

6 a

7 empty string


Periodicity Lemma

Lemma 1 (Periodicity Lemma)

If p and q are periods of a word x and satisfy p+q−GCD(p, q) ≤ |x|

then GCD(p, q) is a period of x.

[Fine, Wilf, 1965]

see Algebraic Combinatorics on Words [Lothaire, 2002]

a b a a b a b a a b a b a a b · · ·

a b a a b a b a a b a

a b a a b a b a a b a a b a b · · ·

Used in the analysis of KMP algorithm and of many other pattern

matching algorithms.

Lemma 2 (Weak Periodicity Lemma)

If p and q are periods of a word x and satisfy p + q ≤ |x| then

GCD(p, q) is a period of x.


Proof of the weaker statement

⋆ p and q periods of x with p + q ≤ |x| and p > q

⋆ p− q period of x

a b c

-p

�

q-�

p− q

⋆ the rest like Euclid’s induction


Proof of the weaker statement

⋆ p and q periods of x with p + q ≤ |x| and p > q

⋆ p− q period of x

a b c

-p

�

q-�

p− q

a bc�

q

-p

-�

p− q

⋆ the rest like Euclid’s induction


On-line String Matching

u c

u a

u a-�

compatible shift


On-line String Matching (1)

u c

u a

u a-�

compatible shift

⋆ compatible with match:

shift = period(u)

[Morris, Pratt, 1969]

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a a

a b a a b a a



u c

u a

u a-�

compatible shift


shift = period(u)


⋆ idem+not incompatible with c

[Knuth, Morris, Pratt, 1977]

text y · · a b a a b a c · · · · · · ·


a b a a b a a

text y · · a b a a b a c · · · · · · ·


a b a a b a a



u c

u a

u a-�

compatible shift


shift = period(u)


⋆ idem+not incompatible with c

[Knuth, Morris, Pratt, 1977]

⋆ best shift = period(uc)

[Simon, 1989], [Hancart, 1993]

[Breslauer, Colussi, Toniolo,

1993]

text y · · a b a a b a c · · · · · · ·


a b a a b a a

text y · · a b a a b a c · · · · · · ·


a b a a b a a

text y · · a b a a b a c · · · · · · ·


a b a a b a a


Delay

⋆ Delay: maximum number of comparisons on a text letter

⋆ MP algorithm

delay ≤ |x|

⋆ KMP algorithm

delay ≤ logΦ(|x| + 1)

text y · · a b a a b a c · · · · · · ·

pattern x a b a a b a b

a b a a b a a

a b a a b a a

a b a a b a a

Proof by the Periodicity Lemma

The bound is tight

⋆ Simon-Hancart algorithm

use of string-matching automaton

delay ≤ min(1 + log2 |x|, cardA)


Searching with an automaton

⋆ Uses the string-matching automaton SMA(x):

smallest deterministic automaton accepting A∗x

⋆ Example x = abaa

0 1 2 3 4a b a a

b a

b

b

a

b

⋆ Search for abaa in:

b a b b a a b a a b a a b b a · · ·

state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·


Construction of SMA(x)

⋆ Unwinding arcs

⋆ From SMA(abaa) . . .

0 1 2 3 4a b a a

b a

b

b

a

b

⋆ . . . to SMA(abaab)

0 1 2 3 4 5a b a a b

b a

b

b

a

a

b


Complexity

⋆ Time and space optimization: implementation of significant arcs only

– Forward arcs: spell the pattern

– Backward arcs: arcs going backwards without reaching the

initial state

Lemma 3 SMA(x) contains at most |x| backward arcs.

⋆ Consequences:

– implementation of SMA(x) in O(|x|) space

– construction in O(|x|) time, independently of the alphabet size

⋆ There is a strategy on the choice of arcs for which:

delay ≤ min(1 + log2 |x|, cardA)

[Hancart, 1993], [Breslauer, Colussi, Toniolo, 1998]


Significant arcs

⋆ Complete SMA(ananas)

- -�� 0 1 2 3 4 5 6- - - - - -a n a n a s

��n, s ��a � ��a� ��a' $�a

� ��s � ��n� ��n, s& %�s& %�n, s& %�n, s

⋆ Forward arcs: spell the pattern

⋆ Backward arcs: arcs going backwards without reaching the

initial state

- -�� 0 1 2 3 4 5 6- - - - - -a n a n a s

��a � ��a� ��a' $�a

� ��n

Lemma 4 SMA(x) contains at most |x| backward arcs.

⋆ Consequence: the implementation of SMA(x) can be done in

O(|x|) time and space, independently of the alphabet size


Backward arcs in SMA

⋆ States of SMA(x) are identified with prefixes of x

A backward arc is of the form (u, τ, vτ ) (u, v prefixes of x, τ symbol) with

– vτ longest suffix of uτ that is a prefix of x, and u 6= v

Note: uτ is not a prefix of x

Let p(u, τ ) = |u| − |v| ; it is a period of u because v is a border of u

� ��τ

x τ σ

v τ v-�

p(u, τ)

⋆ Backward arcs to periods: p is injective

Each period p, 1 ≤ p ≤ |x|, corresponds to at most one backward arc,

thus there are at most |x| such arcs

⋆ A worst case: SMA(abm−1) has m backward arcs (a 6= b)


Backward arcs (followed)

⋆ Proof that p is injective

Two backward arcs (u, τ, vτ ), (u′, τ ′, v′τ ′)

Assume p(u, τ ) = p(u′, τ ′) = p ; we prove u = u′ and τ = τ ′.

� ��τ

x τ σ

v τ v

v′ τ ′ v′

-�p

⋆ If v = v′ then u = u′ and also τ = τ ′� ��τ� ��τ′

x τ στ ′ σ′

v τ v

v′ τ ′ v′

-�p

⋆ If v′ a proper prefix of v then v′τ ′ and v′σ′ are prefixes of v

thus τ ′ = σ′ a contradiction


Repetition

⋆ Repetition

(w, w) repetition of (u, v) if w 6= ε and both:

– w suffix of u or u suffix of w, and

– w prefix of v or v prefix of w

|w| is a local period of (u, v)

⋆ Local Period

localperiod(u, v) = minimum local period of (u, v)

a b a b b a

b b

a b a b b a

a a

a b a b b a

b a b a

a b a b b a

b b a b a b b a b a


Maximal Local Period

⋆ Word of period 5

a b a b b a

1 2 2 5 1 3 1

⋆ Note: localperiod(u, v) ≤ period(uv)

⋆ (u, v) is a critical factorization of uv if

localperiod(u, v) = period(uv)

localperiod(u, v) is maximal among all local periods

⋆ Computation of all local periods in linear time

[Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]


Critical Factorization

Theorem 1 (Critical Factorization Theorem)

Any non-empty word x can be factorized into u · v with both:

• |u| < p and

• localperiod(u, v) = period(x).

[Cesari, Vincent, Duval, 1983]

a b a b b a

b b a b a b b a b a

Leads to time-space optimal string-matching algorithm:

two-way algorithm


Two-Way String Matching

text y

pattern x

w c

u w a

u

c w v

va w

v

shifts-�

|wc|-�

period(x)

⋆ Time-space optimality

– Search Time: linear time (≤ 2n comparisons) with constant extra space

– Preprocessing Time: idem, based on next theorem

[Crochemore, Perrin, 1992]

⋆ Other solutions

[Galil, Seiferas, 1983], [Crochemore, 1992]

[Crochemore, Rytter, 1994], [Rytter, 2002]


Two-way string matching (followed)

Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]

⋆ (u, v) critical factorization of x:

uv = x and |u| < localperiod(u, v) = period(x)

⋆ Searching

– search for an occurrence of v by left-to-right scan

if mismatch: shift length = scan length, else

– search for a preceding occurrence of u by right-to-left scan

shift length = period of x


Example

⋆ Critical factorization a b a b b a b a

u v

period = 5

⋆ Searching

window

a a a b b b b b a a b b a b a b b a b a b a . .

a b a b b a b a


Example (1)


u v

period = 5

⋆ Searching

window


a b a b b a b a

left-to-right scan length = 3


Example (1)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

shift length = 3


Example (2)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a


Example (2)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

left-to-right scan length = 4


Example (2)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

shift length = 4


Example (3)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a


Example (3)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

left-to-right scan


Example (3)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

right-to-left scan


Example (3)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

a b a b b a b a

shift length = period = 5


Example (4)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

a b a b b a b a


Example (4)


u v

period = 5

⋆ Searching

window


a b a b b a b a

a b a b b a b a

a b a b b a b a

a b a b b a b a

left-to-right and right-to-left scans

occurrence found

next shift length = period = 5


Two-way string matching

Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]

⋆ (u, v) critical factorization of x:

uv = x and |u| < localperiod(u, v) = period(x)

⋆ Searching

– search for an occurrence of v by left-to-right scan

if mismatch: shift length = scan length, else

– search for a preceding occurrence of u by right-to-left scan

shift length = period of x

⋆ Preprocessing

– compute factorization (u, v)

– compute the (smallest) period of x

⋆ Algorithmic complexity

– total linear time; searching in less than 2n comparisons

– only constant additional memory space required


Orderings

⋆ Orderings

≤ lexicographic ordering based on the ordering ≤ of the alphabet

� lexicographic ordering based on the inverse ordering≤−1 of the alphabet

Theorem 2 x is a non-empty word

Let x = uv with v = suffix of x that is maximal for ≤

Let x = u′v′ with v′ = suffix of x that is maximal for �

If |v| ≤ |v′| then (u, v) is a critical factorization of x otherwise (u′, v′) is.

Moreover, |u| < period(x) and |u′| < period(x).

[Crochemore, Perrin, 1992]

a b a a b a a

a b a a b a a

a b a b a a b b a b a b a

a b a b a a b b a b a b a


Proof (1)

Four cases — w shortest repetition at (u, v)

⋆ w suffix of u and v prefix of w

v < w < wv, impossible

x u v

w w


Proof (2)




x u v

w w

⋆ w suffix of u and w prefix of v

z < v implies v = wz < wv

impossible

x u

w w

z


Proof (3)




x u v

w w



impossible

x u

w w

z

⋆ u suffix of w and v prefix of w

(u, v) is a critical factorization

x u v

w w


Proof (4)




x u v

w w



impossible

x u

w w

z

⋆ u suffix of w and v prefix of w

(u, v) is a critical factorization

x u v

w w

⋆ u suffix of w and w prefix of v

z < v; yz ≺ yv implies z ≺ v;

z prefix of v then border of v

w period of v and of x.

x

v′

w

y

y

z


Computing Maximal Suffixes

⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w

x u v

w w w′

a b a c b c b a c b c b a c b cu w w w′


Computing Maximal Suffixes (followed)


x u v

w w w′


⋆ Match: the periodicity continues

a b a c b c b a c b c b a c b cu w w new w′

? ?

b




x u v

w w w′




? ?

b

⋆ Smaller letter: new w, border-free

a b a c b c b a c b c b a c b cu new w

? ?

a




x u v

w w w′




? ?

b

⋆ Smaller letter: new w, border-free

a b a c b c b a c b c b a c b cu new w

? ?

a

⋆ Greater letter: new u and recomputation on the rest

a b a c b c b a c b c b a c b cnew u recomputation

? ?

c


Perfect factorization

Theorem 3 Any non-empty word x can be factorized into u · v with both:

• |u| < 2 period(v) and

• v starts with at most one cube of a primitive word.

[Galil, Seiferas, 1983], [Crochemore, Rytter, 1994]

[Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002]

⋆ Example word with period = 10

a a a b a a b a a b a a a b a a b a a b a a a b a a · · ·

⋆ Leads to a time-space optimal string-matching algorithm


Square prefixes

w w

v v

u u

Lemma 5 (Three prefix squares)

If u2 prefix of v2, v2 prefix of w2, and u primitive then |u|+ |v| ≤ |w|.

[Crochemore, Rytter, 1995]

10 a a b a a b a a a b a a b a a b a a a b

7 a a b a a b a a a b a a b a

3 a a b a a b


Squares in a word

Lemma 6 ([Fraenkel, Simpson, 1998])

No more that 2n squares of primitive root occur in a word of length n.

y

w w

v v

u u?

rightmost positions in y? impossible!

Direct proofs [Hickerson, 2004], [Ilie, 2005]

Best bound: 2n− Θ(log n) [Ilie, 2005]

Computation in linear time [Gusfield, Stoye, 1998]

Lemma 7 ([Crochemore, 1981], [Gusfield, Stoye, 1999])

Maximal number of occurrences of squares (with primitive root) : cn log n.

Maximum reached by Fibonacci strings.

Theorem 4 ([Kolpakov, Kucherov, 1998])

Linear number of occurrences of runs (maximal powers) in a word.


Sequencing and fragment assembly

⋆ “Reading” a biological molecular sequence

⋆ Straight sequencing for short fragments: ≤ 500 bases

⋆ Splicing long sequences, “shotgun sequencing”

=⇒ reconstruction problem

⋆ Difficulties :

– loss of orientation (double-stranded DNA)

– overlaps with errors between fragments

⋆ Formalization : F set of fragments, ε ∈ [0, 1[

compute s, a short sequence for which

∀f ∈ F ∃a fragment of s d(a, f) ≤ ε|a| or d(a, f) ≤ ε|a|


Shortest Common Superstring

⋆ Problem: let F = {f1, f2, . . . , fn} a factor code

Compute a shortest word s in which each fi occurs

⋆ SCS(F) = |s|

s = T A A T A T T A T A

f1 = A T A T

f2 = T A T T

f3 = T T A T

f4 = T A T A

f5 = T A A T

f6 = A A T A

Lemma 8 Computing SCS(F) is NP-hard.


Greedy Approximation

⋆ while card F > 1 do

assemble two fragments having the longest overlap

Theorem 5 The sequence s produced by the greedy algorithm

satisfies |s| ≤ 4× SCS(F).

⋆ Conjecture : |s| ≤ 2× SCS(F)

⋆ Variations based on the computation of hamiltonian cycle of

minimal weight


Graph of Overlaps

⋆ F = {f1, f2, . . . , fn}, factor code

⋆ Overlaps

ov(fi, fj) = max{|v| | fi = uv and fj = vw}

⋆ Nodes = {f1, f2, . . . , fn}

Arcs = {(fi, ov(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}

⋆ ov(f, g) computed in time O(min(|f |, |g|))

[Morris et Pratt, 1970]

s = A T A T T A T A T

f1 = A T A T

f2 = T A T T

f3 = T T A T

f4 = T A T A

f1 = A T A T

↔ ↔ ↔ ↔3 2 3 3

f4

f3

f1

f2�

��

��

@@

@@

@I

@@

@@

@R�

��

��

3

3

2

3


Graph of Prefixes

⋆ F = {f1, f2, . . . , fn}, factor code

⋆ Prefixes

pr(fi, fj) = |fi| − ov(fi, fj)

⋆ Nodes = {f1, f2, . . . , fn}

Arcs = {(fi, pr(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}

s = A T A T T A T A T

f1 = A T A T

f2 = T A T T

f3 = T T A T

f4 = T A T A

f1 = A T A T

↔ ←→ ↔ ↔1 2 1 1 = 5

f4

f3

f1

f2�

��

��

@@

@@

@I

@@

@@

@R�

��

��

1

1

2

1


Technique

⋆ A cycle (fi1, fi2, . . . , fik , fi1) produces a superstring s of

length:

|s| =∑k

j=1 |fij | −∑k−1

j=1 ov(fij , fij+1)

=∑k−1

j=1 pr(fij , fij+1) + |fik|

⋆ Computation of an hamiltonian cycle of maximal weight

in the graph of overlaps

⋆ . . . or of minimal weight in the graph of prefixes

⋆ NP-hard problems


3×OPT Approximation

⋆ Based on cycle covers of the graph

⋆ Polynomial running time

⋆ Schema:

(i) compute a cycle cover of minimal weight for the graph of prefixes

(ii) open cycles

(iii) s = assembly with overlaps of words obtained in (ii)

⋆ Approximation : |s| ≤ 3× SCS(F)

⋆ Proof based on the Overlap Lemma

[Blum, Jiang, Li, Tromp, Yanakakis, 1994]


Periodicities

⋆ Word x, integer p, 0 < p ≤ |x| ; p period of x if x[i] = x[i + p]

⋆ period(x) = smallest period of x

Lemma 9 (Overlap Lemma) Let u, v, p = period(u), q = period(v).

If u[1 . . p] et v[1 . . q] are not conjugate, ov(u, v) < p + q − GCD(p, q).

⋆ Exemple : period(baabaabaa) = 3, period(aabaaabaaaba) = 4

ov(baabaabaa, aabaaabaaaba) = |aabaa| = 5 < 3 + 4− 1

⋆ Straight corollary of the Periodicity Lemma


Better Approximation

⋆ Based on a more efficient cycle opening

⋆ Schema :

(i) compute a cycle cover of minimal weight for the graph of prefixes

(ii) open cycles at positions deduced from the Rotation Lemma

(iii) s = assembly with overlaps of words obtained in (ii)

⋆ Approximation : |s| ≤ (2 + 23)× SCS(F)

⋆ Proof based on the Rotation Lemma

[Jiang, Jiang, Breslauer, 1996]


Rotation Lemma

Lemma 10 (Rotation Lemma)

Let x and p = period(x) be such that |x| ≥ 3p. There is a factorization

u · v of x with both:

• |u| < p, and

• each w prefix of v with q = period(w) < p satisfies |w| ≤ 23(p + q).

[Breslauer, Jiang, Jiang, 1996]

x-� p

y y y

w-� q

z z-�

≤ 23(p + q)

⋆ The bound is tight: at each position < 2n + 3 of x = (anban+1b)3

starts a factor of length ≥ 2n + 2 having a period ≥ n + 1


Choice of the rotation

⋆ y3 prefix of x with |y| = period(x) = p

⋆ ≤ ordering on the alphabet, � inverse ordering

⋆ r conjugate of y that is a Lyndon word according to ≤

⋆ t conjugate of y that is a Lyndon word according to �

⋆ i first position of r on x, j first position of t on x

⋆ If i < j and j − i ≤ p, we choose u such that y = uu′, r = u′u

-� p

y y y

r

it

j

⋆ If i < j and j − i > p, we choose u such that y = uu′, t = u′u


Proof

⋆ r Lyndon word and j − i ≤ p2

x-� p

y y y

tj

r

i

w s s-�

q

– |w| < p since q < p and r is border-free

– If q ≥ p2, then |w| < 2

3(p + q)

– Else, q < p2, and |w| < j − i + q (Critical Factorization at j)

and j − i + q < p2

+ q < 23(p + q), QED

⋆ If j − i > p2, replace r by t and t by the next occurrence of r, which

brings back to the first case


Text Indexing

⋆ Set of factors (subwords) of a static text

⋆ Basic operations

– existence of patterns in the text

– number of occurrences of patterns

– list of positions of occurrences

⋆ Other applications

– finding repetitions in texts

– finding regularities in texts

– approximate matchings

– two-dimensional pattern matching

– . . .


Implementation of indexes

pattern

� �suffix of text

Implementation with efficient data structures

⋆ Suffix Trees

digital trees, PATRICIA trees (compact trees)

⋆ Suffix Automata or DAWG’s

minimal automata, compact automata

Implementation with efficient algorithm

⋆ Suffix Arrays

binary search in the ordered list of suffixes


Suffixes

Text y ∈ A∗ of length n

⋆ Suff (y) = set of suffixes of y

⋆ Suff (ababbb)

i 0 1 2 3 4 5

y[i] a b a b b b

positions

a b a b b b 0

b a b b b 1

a b b b 2

b b b 3

b b 4

b 5

ε 6 (empty string)


Suffix Structures

⋆ Suffix trie of ababbb

0 6

1 2

3 4 5 6 0

75

8 9 10 11 1

12 13 2

14

4

15 3

a

b

b

a

b

b b b

a

b

b b b

b

b

⋆ Its suffix automaton

0 1 2 3 4 5 6

2′ 5′

a b a

b

b b b

b

a

b

b

⋆ Its suffix tree

0 6

1 0

2 1

3

4 2

55

6 37

4

ab

b

abbb

bb

abbb

b

b

⋆ Its compact suffix automaton

0 12

32′

ab abbb

bb

b

abbb

b

b

Linear size data structures and O(n log card A) construction time, except for suffix tries


Suffix Array

⋆ Structure composed of:

lexicographically sorted list of non-empty suffixes, and

maximal lengths of common prefixes between suffixes consecutive in the list

⋆ Suffix array of ababbb

j 0 1 2 3 4 5

y[j] a b a b b b

k SUF[k] LCP[k]

0 0 0 a b a b b b

1 2 2 a b b b

2 5 0 b

3 1 1 b a b b b

4 4 1 b b

5 3 2 b b b

⋆ Benefit: use of binary search and of additional combinatorial features

lead to a O(m + log n) searching time (instead of a mere O(m× log n) time)


Searching—Case one

⋆ Hypotheses

Ld < x < Lf and ld ≤ |lcp(Li, Lf)| < lf

⋆ Example

Ld a a a c a

a a a c b a

Li a a b b a b a

a a b b a b b

Lf a a b b b a b

x a a b b b a a

x a a b b b a a

⋆ Conclusion

Li < x < Lf and |lcp(x, Li)| = |lcp(Li, Lf)|


Searching—Case two

⋆ Hypotheses

Ld < x < Lf and ld ≤ lf < |lcp(Li, Lf)|

⋆ Example

Ld a a a c a

a a a c b a

Li a a b b a b a

a a b b a b b

Lf a a b b b a b

x a a b a c b

x a a b a c b

⋆ Conclusion

Ld < x < Li and |lcp(x, Li)| = |lcp(x, Lf)|


Searching—Case three

⋆ Hypotheses

Ld < x < Lf and ld ≤ lf = |lcp(Li, Lf)|

⋆ Example

Ld a a a c a

a a a c b a

Li a a b b a b a

a a b b a b b

Lf a a b b b a b

x a a b b a b

x a a b b a b

⋆ Conclusion

compare x and Li from position lf


Sorting Suffixes on a Bounded Integer Alphabet

⋆ Schema

[1] bucket sort positions i according to first3(y[i . . n− 1]), for i = 3q or i = 3q + 1

t[i]: rank of i in the sorted list

[2] recursively sort the suffixes of the 2/3-shorter word

t[0]t[3] · · · t[3q] · · · t[1]t[4] · · · t[3q + 1] · · ·

s[i]: rank of suffix i in the sorted list (i = 3q or i = 3q + 1)

[3] sort suffixes y[j . . n− 1] for j of the form 3q + 2 (i.e., bucket sort pairs (y[j], s[j + 1]))

[4] merge lists obtained at steps 2 and 3

Note: comparing suffixes i (first list) and j (second list) remains to compare:

(x[i], s[i + 1]) and (x[j], s[j + 1]) if i = 3q

(x[i]x[i + 1], s[i + 2]) and (x[j]x[j + 1], s[j + 2]) if i = 3q + 1

⋆ Running time: T (n) = T (2n/3) + O(n) then T (n) = O(n)

[Karkkainen, Sanders, 2003]


Example

i 0 1 2 3 4 5 6 7 8 9 10

y[i] a a b a a b a a b b a

Rank t

0 a

1 a a b

2 a b a

3 a b b

4 b a

Rank s i Suff (11142230)

0 10 0

1 0 1 1 1 4 2 2 3 0

2 3 1 1 4 2 2 3 0

3 6 1 4 2 2 3 0

4 1 2 2 3 0

5 4 2 3 0

6 7 3 0

7 9 4 2 2 3 0

Rank j (y[j], s[j + 1])

0 2 (b, 2)

1 5 (b, 3)

2 8 (b, 7)

i 0 1 2 3 4 5 6 7 8 9 10


r[i] 1 4 8 2 5 9 3 6 10 7 0

SUF[i] 10 0 3 6 1 4 7 9 2 5 8


Computing LCP’s of suffixes

⋆ LCP’s of suffixes

LCP[i] = |lcp(y[SUF[i− 1] . . n− 1], y[SUF[i] . . n− 1])|, for 0 ≤ i ≤ n

i 0 1 2 3 4 5 6 7 8 9 10 11


SUF[i] 10 0 3 6 1 4 7 9 2 5 8

LCP[i] 0 1 6 3 1 5 2 0 2 4 1 0

j RANK[j]

0 1 a a b a a b a a b b a

3 2 a a b a a b b a

j RANK[j]

1 4 a b a a b a a b b a

4 5 a b a a b b a

Lemma 11 Let j ∈ (1, 2, . . . , n− 1) with RANK[j] > 0.

Then LCP[RANK[j − 1]]− 1 ≤ LCP[RANK[j]].


LCP algorithm

⋆ Rank RANK: RANK[j] = rank of suffix at position j

(RANK = SUF−1)

⋆ Permutation SUF: SUF[k] = position of suffix of rank k

Lcp(y, n, SUF, RANK)

1 ℓ← 0

2 for j ← 0 to n− 1 do

3 ℓ← max{0, ℓ− 1}

4 if RANK[j] > 0 then

5 i← SUF[RANK[j]− 1]

6 while y[i + ℓ] = y[j + ℓ] do

7 ℓ← ℓ + 1

8 LCP[RANK[j]]← ℓ

9 LCP[0]← 0

10 LCP[n]← 0

11 return LCP

⋆ Running time: O(n)


Burrows-Wheeler Transform

⋆ Text w = a1a2 · · · an a primitive word

w1, w2, . . . , wn sequence of conjugates of w in increasing order

bi = last letter of wi, then T (w) = b1b2 · · · bn

⋆ T (abracadabra) = rdarcaaaabb 1 a a b r a c a d a b r

2 a b r a a b r a c a d

3 a b r a c a d a b r a

4 a c a d a b r a a b r

5 a d a b r a a b r a c

6 b r a a b r a c a d a

7 b r a c a d a b r a a

8 c a d a b r a a b r a

9 d a b r a a b r a c a

10 r a a b r a c a d a b

11 r a c a d a b r a a b

⋆ T (w) depends only on the conjugacy class of w

we may suppose that w is a Lyndon word, i.e. that w = w1


Text Compression

⋆ Basis of bzip text compression

– intial text

aabracadabr

– BW Transform: T (aabracadabr) =

rdarcaaaabb

– Either run-length encoding (using ASCII code):

rdarca4b2

– ... or move-to-front encoding, roughly like encoding of:

17, 4, 2, 2, 4, 1, 0, 0, 0, 4, 0


Related works

⋆ Reversible transformation of a word used for text compression, used

in bzip [Burrows, Wheeler, 1994]

⋆ Analysis of compression by [Manzini, 2001]

⋆ Related to combinatorics on words, Sturmian words

[Mantaci, Restivo, Sciortino, 2003]

⋆ Particular case of a bijection due to Gessel and Reutenauer

[Crochemore, Desarmenien, Perrin, 2005]

⋆ Linear-time computation (adapting suffix sorting)

[Karkkainen, Sanders, 2003][Kim, Sim, Park, Park, 2003]

[Ko, Aluru, 2003][Nong, Zhang, Chan, 2009]


Permutation σ

⋆ Permutation σ: σ(i) = rank of i-th circular shift

⋆ w = aabracadabr

i σ(i)

1 1 a a b r a c a d a b r

2 3 a b r a c a d a b r a

3 7 b r a c a d a b r a a

4 11 r a c a d a b r a a b

5 4 a c a d a b r a a b r

6 8 c a d a b r a a b r a

7 5 a d a b r a a b r a c

8 9 d a b r a a b r a c a

9 2 a b r a a b r a c a d

10 6 b r a a b r a c a d a

11 10 r a a b r a c a d a b

σ−1(j) j

1 1 a a b r a c a d a b r

9 2 a b r a a b r a c a d

2 3 a b r a c a d a b r a

5 4 a c a d a b r a a b r

7 5 a d a b r a a b r a c

10 6 b r a a b r a c a d a

3 7 b r a c a d a b r a a

6 8 c a d a b r a a b r a

8 9 d a b r a a b r a c a

11 10 r a a b r a c a d a b

4 11 r a c a d a b r a a b


Permutation π

⋆ Permutation P (w) = π: π(i) = σ(σ−1(i) + 1)

⋆ w = aabracadabr

σ =

1 2 3 4 5 6 7 8 9 10 11

1 3 7 11 4 8 5 9 2 6 10

π as a cycle

π =(

1 3 7 11 4 8 5 9 2 6 10)

π as an array

π =

1 2 3 4 5 6 7 8 9 10 11

3 6 7 8 9 10 11 5 2 1 4


Inverse Transformation

⋆ w = a1a2 · · · an

y = b1b2 · · · bn with bi = last letter of wi

z = c1c2 · · · cn with ci = first letter of wi

⋆ ai = cσ(i)

bi = aσ−1(i)−1

ai = cπi−1(1)

ci = bπ(i)

⋆ Property:

i < j and ci = cj ⇒ π(i) < π(j)

Occurrences of a in w appear in the same

relative order in both y and z

⋆ Linear-time computation of π and w from y

⋆ w = a1a2b3r4a5c6a7d8a9b10r11

y = r11d8a1r4c6a9a2a5a7b10b3

i ci bπ(i)

1 a1 · · · r11

2 a9 · · · d8

3 a2 · · · a1

4 a5 · · · r4

5 a7 · · · c6

6 b10 · · · a9

7 b3 · · · a2

8 c6 · · · a5

9 d8 · · · a7

10 r11 · · · b10

11 r4 · · · b3


Descents of permutations

⋆ Descent of permutation P (w) = π: i such that π(i) > π(i + 1)

⋆ for π = P (aabracadabr): des(π) = {7, 8, 9}

π =

1 2 3 4 5 6 7 8 9 10 11

3 6 7 8 9 10 11 5 2 1 4

⋆ Parikh vector of w: (n1, n2, . . . , nk) where k = #Alphabet

with n1 + n2 + · · · + nk = n = |w|

⋆ ρ(v) = {n1, n1 + n2, . . . , n1 + · · · + nk−1}

⋆ for π = P (aabracadabr): v = (5, 2, 1, 1, 2) and ρ(v) = (5, 7, 8, 9)

⋆ From property: des(π) ⊆ ρ(v)

Theorem 6 Let v = (n1, n2, . . . , nk) be a positive vector, n1+n2+· · ·+nk = n.

The map P : w 7→ π is one to one from the set of conjugacy classes of pri-

mitive words of length n on k letters with Parikh vector v onto the set of cyclic

permutations on {1, 2, . . . , n} such that des(π) ⊆ ρ(v).


From permutation to words

⋆ Words w having a given cyclic permutation π: P (w) = π

No Parikh vector given

⋆ π = P (aabracadabr)

π =

1 2 3 4 5 6 7 8 9 10 11

3 6 7 8 9 10 11 5 2 1 4

π as a cycle =(

1 3 7 11 4 8 5 9 2 6 10)

on 5 letters z = a a a a a b b c d r r

w = a a b r a c a d a b r

on 4 letters z = a a a a a a a c d r r

w = a a a r a c a d a a r

on 7 letters z = a a b b c c c d e f g

w = a b c g b d c e a c f

number of words (up to alphabetical renaming): 26.21 = 128


Length 4

σ π descents w

1 2 3 4 2 3 4 1 1 a a a b

a b b c

a a b c

a b c d

1 2 4 3 2 4 1 3 1 a a b b

a b c c

a a c b

a b d c

1 3 2 4 3 4 2 1 2 a b a c

a c b d

σ π descents w

1 3 4 2 3 1 4 2 2 a c b c

a d b c

1 4 2 3 4 3 1 2 2 a b c b

a c d b

1 4 3 2 4 1 2 3 1 a b b b

a c c b

a c b b

a d c b


Main references

References

[1] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms

on Strings. Cambridge University Press, 2007. 392 pages.

[2] Maxime Crochemore and Wojciech Rytter. Jewels of Stringology. World

Scientific Publishing, Hong-Kong, 2002. 310 pages.

[3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf

or http://www.mimuw.edu.pl/ rytter/BOOKS/text-algorithms.pdf


Documents

String Algorithms - phdopen.mimuw.edu.plphdopen.mimuw.edu.pl/zima09/PhD-Warsaw-period.pdf · matching algorithms. Lemma 2 (Weak Periodicity Lemma) If p and q are periods of a word