Upload
others
View
19
Download
0
Embed Size (px)
Citation preview
String Algorithms
Maxime Crochemore
King’s College London Universite Paris-Est
&
M.C. PhD Univ. Warsaw 1/63
Algorithms and combinatorics on words
⋆ Links between combinatorial properties of words and algorithms on words:
– some algorithms are based on combinatorial properties
– combinatorics is sometimes used to evaluate efficiency of algorithms
⋆ Examples:
– Text searching
– Fragment assembly and shortest common superstring
– Text indexing and suffix arrays
– Text compression and permutations
⋆ Other examples in Algorithms on Strings
[C., Hancart, Lecroq, 2007], [C., Rytter, 1994]
http://www.dcs.kcl.ac.uk/staff/mac/
⋆ Combinatorial aspects in Applied Combinatorics on Words
[Lothaire, 2004], [Lothaire, 2002]
http://igm.univ-mlv.fr/∼berstel/Lothaire/index.html
M.C. PhD Univ. Warsaw 2/63
Periods and borders of words
⋆ Non-empty string u, integer p, 0 < p ≤ |u|
⋆ p is a period of u if any of these equivalent conditions is satisfied:
– u[i] = u[i + p], for 1 ≤ i ≤ |u| − p
– u is a prefix of some yk, k > 0, |y| = p
– u = yw = wz, for some strings y, z, w with |y| = |z| = p
String w is called a border of u
b o r d e r l a n d b o r d e r
b o r d e r l a n d b o r d e r
-� p
-�
p
⋆ period(u) = smallest period of u (can be |u|)
border(u) = longest proper border of u (can be empty)
⋆ Periods and borders of abaabaa
3 abaa
6 a
7 empty string
M.C. PhD Univ. Warsaw 3/63
Periodicity Lemma
Lemma 1 (Periodicity Lemma)
If p and q are periods of a word x and satisfy p+q−GCD(p, q) ≤ |x|
then GCD(p, q) is a period of x.
[Fine, Wilf, 1965]
see Algebraic Combinatorics on Words [Lothaire, 2002]
a b a a b a b a a b a b a a b · · ·
a b a a b a b a a b a
a b a a b a b a a b a a b a b · · ·
Used in the analysis of KMP algorithm and of many other pattern
matching algorithms.
Lemma 2 (Weak Periodicity Lemma)
If p and q are periods of a word x and satisfy p + q ≤ |x| then
GCD(p, q) is a period of x.
M.C. PhD Univ. Warsaw 4/63
Proof of the weaker statement
⋆ p and q periods of x with p + q ≤ |x| and p > q
⋆ p− q period of x
a b c
-p
�
q-�
p− q
⋆ the rest like Euclid’s induction
M.C. PhD Univ. Warsaw 5/63
Proof of the weaker statement
⋆ p and q periods of x with p + q ≤ |x| and p > q
⋆ p− q period of x
a b c
-p
�
q-�
p− q
a bc�
q
-p
-�
p− q
⋆ the rest like Euclid’s induction
M.C. PhD Univ. Warsaw 5/63
On-line String Matching
u c
u a
u a-�
compatible shift
M.C. PhD Univ. Warsaw 6/63
On-line String Matching (1)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
On-line String Matching (2)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
⋆ idem+not incompatible with c
[Knuth, Morris, Pratt, 1977]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
On-line String Matching (3)
u c
u a
u a-�
compatible shift
⋆ compatible with match:
shift = period(u)
[Morris, Pratt, 1969]
⋆ idem+not incompatible with c
[Knuth, Morris, Pratt, 1977]
⋆ best shift = period(uc)
[Simon, 1989], [Hancart, 1993]
[Breslauer, Colussi, Toniolo,
1993]
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a a
a b a a b a a
M.C. PhD Univ. Warsaw 6/63
Delay
⋆ Delay: maximum number of comparisons on a text letter
⋆ MP algorithm
delay ≤ |x|
⋆ KMP algorithm
delay ≤ logΦ(|x| + 1)
text y · · a b a a b a c · · · · · · ·
pattern x a b a a b a b
a b a a b a a
a b a a b a a
a b a a b a a
Proof by the Periodicity Lemma
The bound is tight
⋆ Simon-Hancart algorithm
use of string-matching automaton
delay ≤ min(1 + log2 |x|, cardA)
M.C. PhD Univ. Warsaw 7/63
Searching with an automaton
⋆ Uses the string-matching automaton SMA(x):
smallest deterministic automaton accepting A∗x
⋆ Example x = abaa
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ Search for abaa in:
b a b b a a b a a b a a b b a · · ·
state 0 0 1 2 0 1 1 2 3 4 2 3 4 2 0 1 · · ·
M.C. PhD Univ. Warsaw 8/63
Construction of SMA(x)
⋆ Unwinding arcs
⋆ From SMA(abaa) . . .
0 1 2 3 4a b a a
b a
b
b
a
b
⋆ . . . to SMA(abaab)
0 1 2 3 4 5a b a a b
b a
b
b
a
a
b
M.C. PhD Univ. Warsaw 9/63
Complexity
⋆ Time and space optimization: implementation of significant arcs only
– Forward arcs: spell the pattern
– Backward arcs: arcs going backwards without reaching the
initial state
Lemma 3 SMA(x) contains at most |x| backward arcs.
⋆ Consequences:
– implementation of SMA(x) in O(|x|) space
– construction in O(|x|) time, independently of the alphabet size
⋆ There is a strategy on the choice of arcs for which:
delay ≤ min(1 + log2 |x|, cardA)
[Hancart, 1993], [Breslauer, Colussi, Toniolo, 1998]
M.C. PhD Univ. Warsaw 10/63
Significant arcs
⋆ Complete SMA(ananas)
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -a n a n a s
����n, s ����a � ��a� ��a' $�a
� ��s � ��n� ��n, s& %�s& %�n, s& %�n, s
⋆ Forward arcs: spell the pattern
⋆ Backward arcs: arcs going backwards without reaching the
initial state
- -���� ���� ���� ���� ���� ���� ����0 1 2 3 4 5 6- - - - - -a n a n a s
����a � ��a� ��a' $�a
� ��n
Lemma 4 SMA(x) contains at most |x| backward arcs.
⋆ Consequence: the implementation of SMA(x) can be done in
O(|x|) time and space, independently of the alphabet size
M.C. PhD Univ. Warsaw 11/63
Backward arcs in SMA
⋆ States of SMA(x) are identified with prefixes of x
A backward arc is of the form (u, τ, vτ ) (u, v prefixes of x, τ symbol) with
– vτ longest suffix of uτ that is a prefix of x, and u 6= v
Note: uτ is not a prefix of x
Let p(u, τ ) = |u| − |v| ; it is a period of u because v is a border of u
� ��τ
x τ σ
v τ v-�
p(u, τ)
⋆ Backward arcs to periods: p is injective
Each period p, 1 ≤ p ≤ |x|, corresponds to at most one backward arc,
thus there are at most |x| such arcs
⋆ A worst case: SMA(abm−1) has m backward arcs (a 6= b)
M.C. PhD Univ. Warsaw 12/63
Backward arcs (followed)
⋆ Proof that p is injective
Two backward arcs (u, τ, vτ ), (u′, τ ′, v′τ ′)
Assume p(u, τ ) = p(u′, τ ′) = p ; we prove u = u′ and τ = τ ′.
� ��τ
x τ σ
v τ v
v′ τ ′ v′
-�p
⋆ If v = v′ then u = u′ and also τ = τ ′� ��τ� ��τ′
x τ στ ′ σ′
v τ v
v′ τ ′ v′
-�p
⋆ If v′ a proper prefix of v then v′τ ′ and v′σ′ are prefixes of v
thus τ ′ = σ′ a contradiction
M.C. PhD Univ. Warsaw 13/63
Repetition
⋆ Repetition
(w, w) repetition of (u, v) if w 6= ε and both:
– w suffix of u or u suffix of w, and
– w prefix of v or v prefix of w
|w| is a local period of (u, v)
⋆ Local Period
localperiod(u, v) = minimum local period of (u, v)
a b a b b a
b b
a b a b b a
a a
a b a b b a
b a b a
a b a b b a
b b a b a b b a b a
M.C. PhD Univ. Warsaw 14/63
Maximal Local Period
⋆ Word of period 5
a b a b b a
1 2 2 5 1 3 1
⋆ Note: localperiod(u, v) ≤ period(uv)
⋆ (u, v) is a critical factorization of uv if
localperiod(u, v) = period(uv)
localperiod(u, v) is maximal among all local periods
⋆ Computation of all local periods in linear time
[Duval, Kolpakov, Kucherov, Lecroq, Lefebvre, 2003]
M.C. PhD Univ. Warsaw 15/63
Critical Factorization
Theorem 1 (Critical Factorization Theorem)
Any non-empty word x can be factorized into u · v with both:
• |u| < p and
• localperiod(u, v) = period(x).
[Cesari, Vincent, Duval, 1983]
a b a b b a
b b a b a b b a b a
Leads to time-space optimal string-matching algorithm:
two-way algorithm
M.C. PhD Univ. Warsaw 16/63
Two-Way String Matching
text y
pattern x
w c
u w a
u
c w v
va w
v
shifts-�
|wc|-�
period(x)
⋆ Time-space optimality
– Search Time: linear time (≤ 2n comparisons) with constant extra space
– Preprocessing Time: idem, based on next theorem
[Crochemore, Perrin, 1992]
⋆ Other solutions
[Galil, Seiferas, 1983], [Crochemore, 1992]
[Crochemore, Rytter, 1994], [Rytter, 2002]
M.C. PhD Univ. Warsaw 17/63
Two-way string matching (followed)
Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]
⋆ (u, v) critical factorization of x:
uv = x and |u| < localperiod(u, v) = period(x)
⋆ Searching
– search for an occurrence of v by left-to-right scan
if mismatch: shift length = scan length, else
– search for a preceding occurrence of u by right-to-left scan
shift length = period of x
M.C. PhD Univ. Warsaw 18/63
Example
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
M.C. PhD Univ. Warsaw 19/63
Example (1)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
left-to-right scan length = 3
M.C. PhD Univ. Warsaw 19/63
Example (1)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
shift length = 3
M.C. PhD Univ. Warsaw 19/63
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 20/63
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
left-to-right scan length = 4
M.C. PhD Univ. Warsaw 20/63
Example (2)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
shift length = 4
M.C. PhD Univ. Warsaw 20/63
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 21/63
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
left-to-right scan
M.C. PhD Univ. Warsaw 21/63
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
right-to-left scan
M.C. PhD Univ. Warsaw 21/63
Example (3)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
shift length = period = 5
M.C. PhD Univ. Warsaw 21/63
Example (4)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
M.C. PhD Univ. Warsaw 22/63
Example (4)
⋆ Critical factorization a b a b b a b a
u v
period = 5
⋆ Searching
window
a a a b b b b b a a b b a b a b b a b a b a . .
a b a b b a b a
a b a b b a b a
a b a b b a b a
a b a b b a b a
left-to-right and right-to-left scans
occurrence found
next shift length = period = 5
M.C. PhD Univ. Warsaw 22/63
Two-way string matching
Searching for pattern x = x[0 . .m− 1] in text y = y[0 . . n− 1]
⋆ (u, v) critical factorization of x:
uv = x and |u| < localperiod(u, v) = period(x)
⋆ Searching
– search for an occurrence of v by left-to-right scan
if mismatch: shift length = scan length, else
– search for a preceding occurrence of u by right-to-left scan
shift length = period of x
⋆ Preprocessing
– compute factorization (u, v)
– compute the (smallest) period of x
⋆ Algorithmic complexity
– total linear time; searching in less than 2n comparisons
– only constant additional memory space required
M.C. PhD Univ. Warsaw 23/63
Orderings
⋆ Orderings
≤ lexicographic ordering based on the ordering ≤ of the alphabet
� lexicographic ordering based on the inverse ordering≤−1 of the alphabet
Theorem 2 x is a non-empty word
Let x = uv with v = suffix of x that is maximal for ≤
Let x = u′v′ with v′ = suffix of x that is maximal for �
If |v| ≤ |v′| then (u, v) is a critical factorization of x otherwise (u′, v′) is.
Moreover, |u| < period(x) and |u′| < period(x).
[Crochemore, Perrin, 1992]
a b a a b a a
a b a a b a a
a b a b a a b b a b a b a
a b a b a a b b a b a b a
M.C. PhD Univ. Warsaw 24/63
Proof (1)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
M.C. PhD Univ. Warsaw 25/63
Proof (2)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
M.C. PhD Univ. Warsaw 25/63
Proof (3)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
⋆ u suffix of w and v prefix of w
(u, v) is a critical factorization
x u v
w w
M.C. PhD Univ. Warsaw 25/63
Proof (4)
Four cases — w shortest repetition at (u, v)
⋆ w suffix of u and v prefix of w
v < w < wv, impossible
x u v
w w
⋆ w suffix of u and w prefix of v
z < v implies v = wz < wv
impossible
x u
w w
z
⋆ u suffix of w and v prefix of w
(u, v) is a critical factorization
x u v
w w
⋆ u suffix of w and w prefix of v
z < v; yz ≺ yv implies z ≺ v;
z prefix of v then border of v
w period of v and of x.
x
v′
w
y
y
z
M.C. PhD Univ. Warsaw 25/63
Computing Maximal Suffixes
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
M.C. PhD Univ. Warsaw 26/63
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
M.C. PhD Univ. Warsaw 26/63
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
⋆ Smaller letter: new w, border-free
a b a c b c b a c b c b a c b cu new w
? ?
a
M.C. PhD Univ. Warsaw 26/63
Computing Maximal Suffixes (followed)
⋆ v maximal suffix of x; |w| its period; w′ proper prefix of w
x u v
w w w′
a b a c b c b a c b c b a c b cu w w w′
⋆ Match: the periodicity continues
a b a c b c b a c b c b a c b cu w w new w′
? ?
b
⋆ Smaller letter: new w, border-free
a b a c b c b a c b c b a c b cu new w
? ?
a
⋆ Greater letter: new u and recomputation on the rest
a b a c b c b a c b c b a c b cnew u recomputation
? ?
c
M.C. PhD Univ. Warsaw 26/63
Perfect factorization
Theorem 3 Any non-empty word x can be factorized into u · v with both:
• |u| < 2 period(v) and
• v starts with at most one cube of a primitive word.
[Galil, Seiferas, 1983], [Crochemore, Rytter, 1994]
[Mignosi, Restivo, Salemi, 1995] see [Lothaire, 2002]
⋆ Example word with period = 10
a a a b a a b a a b a a a b a a b a a b a a a b a a · · ·
⋆ Leads to a time-space optimal string-matching algorithm
M.C. PhD Univ. Warsaw 27/63
Square prefixes
w w
v v
u u
Lemma 5 (Three prefix squares)
If u2 prefix of v2, v2 prefix of w2, and u primitive then |u|+ |v| ≤ |w|.
[Crochemore, Rytter, 1995]
10 a a b a a b a a a b a a b a a b a a a b
7 a a b a a b a a a b a a b a
3 a a b a a b
M.C. PhD Univ. Warsaw 28/63
Squares in a word
Lemma 6 ([Fraenkel, Simpson, 1998])
No more that 2n squares of primitive root occur in a word of length n.
y
w w
v v
u u?
rightmost positions in y? impossible!
Direct proofs [Hickerson, 2004], [Ilie, 2005]
Best bound: 2n− Θ(log n) [Ilie, 2005]
Computation in linear time [Gusfield, Stoye, 1998]
Lemma 7 ([Crochemore, 1981], [Gusfield, Stoye, 1999])
Maximal number of occurrences of squares (with primitive root) : cn log n.
Maximum reached by Fibonacci strings.
Theorem 4 ([Kolpakov, Kucherov, 1998])
Linear number of occurrences of runs (maximal powers) in a word.
M.C. PhD Univ. Warsaw 29/63
Sequencing and fragment assembly
⋆ “Reading” a biological molecular sequence
⋆ Straight sequencing for short fragments: ≤ 500 bases
⋆ Splicing long sequences, “shotgun sequencing”
=⇒ reconstruction problem
⋆ Difficulties :
– loss of orientation (double-stranded DNA)
– overlaps with errors between fragments
⋆ Formalization : F set of fragments, ε ∈ [0, 1[
compute s, a short sequence for which
∀f ∈ F ∃a fragment of s d(a, f) ≤ ε|a| or d(a, f) ≤ ε|a|
M.C. PhD Univ. Warsaw 30/63
Shortest Common Superstring
⋆ Problem: let F = {f1, f2, . . . , fn} a factor code
Compute a shortest word s in which each fi occurs
⋆ SCS(F) = |s|
s = T A A T A T T A T A
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f5 = T A A T
f6 = A A T A
Lemma 8 Computing SCS(F) is NP-hard.
M.C. PhD Univ. Warsaw 31/63
Greedy Approximation
⋆ while card F > 1 do
assemble two fragments having the longest overlap
Theorem 5 The sequence s produced by the greedy algorithm
satisfies |s| ≤ 4× SCS(F).
⋆ Conjecture : |s| ≤ 2× SCS(F)
⋆ Variations based on the computation of hamiltonian cycle of
minimal weight
M.C. PhD Univ. Warsaw 32/63
Graph of Overlaps
⋆ F = {f1, f2, . . . , fn}, factor code
⋆ Overlaps
ov(fi, fj) = max{|v| | fi = uv and fj = vw}
⋆ Nodes = {f1, f2, . . . , fn}
Arcs = {(fi, ov(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}
⋆ ov(f, g) computed in time O(min(|f |, |g|))
[Morris et Pratt, 1970]
s = A T A T T A T A T
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f1 = A T A T
↔ ↔ ↔ ↔3 2 3 3
f4
f3
f1
f2�
��
���
@@
@@
@I
@@
@@
@R�
��
��
3
3
2
3
M.C. PhD Univ. Warsaw 33/63
Graph of Prefixes
⋆ F = {f1, f2, . . . , fn}, factor code
⋆ Prefixes
pr(fi, fj) = |fi| − ov(fi, fj)
⋆ Nodes = {f1, f2, . . . , fn}
Arcs = {(fi, pr(fi, fj), fj) | 1 ≤ i, j ≤ n, i 6= j}
s = A T A T T A T A T
f1 = A T A T
f2 = T A T T
f3 = T T A T
f4 = T A T A
f1 = A T A T
↔ ←→ ↔ ↔1 2 1 1 = 5
f4
f3
f1
f2�
��
���
@@
@@
@I
@@
@@
@R�
��
��
1
1
2
1
M.C. PhD Univ. Warsaw 34/63
Technique
⋆ A cycle (fi1, fi2, . . . , fik , fi1) produces a superstring s of
length:
|s| =∑k
j=1 |fij | −∑k−1
j=1 ov(fij , fij+1)
=∑k−1
j=1 pr(fij , fij+1) + |fik|
⋆ Computation of an hamiltonian cycle of maximal weight
in the graph of overlaps
⋆ . . . or of minimal weight in the graph of prefixes
⋆ NP-hard problems
M.C. PhD Univ. Warsaw 35/63
3×OPT Approximation
⋆ Based on cycle covers of the graph
⋆ Polynomial running time
⋆ Schema:
(i) compute a cycle cover of minimal weight for the graph of prefixes
(ii) open cycles
(iii) s = assembly with overlaps of words obtained in (ii)
⋆ Approximation : |s| ≤ 3× SCS(F)
⋆ Proof based on the Overlap Lemma
[Blum, Jiang, Li, Tromp, Yanakakis, 1994]
M.C. PhD Univ. Warsaw 36/63
Periodicities
⋆ Word x, integer p, 0 < p ≤ |x| ; p period of x if x[i] = x[i + p]
⋆ period(x) = smallest period of x
Lemma 9 (Overlap Lemma) Let u, v, p = period(u), q = period(v).
If u[1 . . p] et v[1 . . q] are not conjugate, ov(u, v) < p + q − GCD(p, q).
⋆ Exemple : period(baabaabaa) = 3, period(aabaaabaaaba) = 4
ov(baabaabaa, aabaaabaaaba) = |aabaa| = 5 < 3 + 4− 1
⋆ Straight corollary of the Periodicity Lemma
M.C. PhD Univ. Warsaw 37/63
Better Approximation
⋆ Based on a more efficient cycle opening
⋆ Schema :
(i) compute a cycle cover of minimal weight for the graph of prefixes
(ii) open cycles at positions deduced from the Rotation Lemma
(iii) s = assembly with overlaps of words obtained in (ii)
⋆ Approximation : |s| ≤ (2 + 23)× SCS(F)
⋆ Proof based on the Rotation Lemma
[Jiang, Jiang, Breslauer, 1996]
M.C. PhD Univ. Warsaw 38/63
Rotation Lemma
Lemma 10 (Rotation Lemma)
Let x and p = period(x) be such that |x| ≥ 3p. There is a factorization
u · v of x with both:
• |u| < p, and
• each w prefix of v with q = period(w) < p satisfies |w| ≤ 23(p + q).
[Breslauer, Jiang, Jiang, 1996]
x-� p
y y y
w-� q
z z-�
≤ 23(p + q)
⋆ The bound is tight: at each position < 2n + 3 of x = (anban+1b)3
starts a factor of length ≥ 2n + 2 having a period ≥ n + 1
M.C. PhD Univ. Warsaw 39/63
Choice of the rotation
⋆ y3 prefix of x with |y| = period(x) = p
⋆ ≤ ordering on the alphabet, � inverse ordering
⋆ r conjugate of y that is a Lyndon word according to ≤
⋆ t conjugate of y that is a Lyndon word according to �
⋆ i first position of r on x, j first position of t on x
⋆ If i < j and j − i ≤ p, we choose u such that y = uu′, r = u′u
-� p
y y y
r
it
j
⋆ If i < j and j − i > p, we choose u such that y = uu′, t = u′u
M.C. PhD Univ. Warsaw 40/63
Proof
⋆ r Lyndon word and j − i ≤ p2
x-� p
y y y
tj
r
i
w s s-�
q
– |w| < p since q < p and r is border-free
– If q ≥ p2, then |w| < 2
3(p + q)
– Else, q < p2, and |w| < j − i + q (Critical Factorization at j)
and j − i + q < p2
+ q < 23(p + q), QED
⋆ If j − i > p2, replace r by t and t by the next occurrence of r, which
brings back to the first case
M.C. PhD Univ. Warsaw 41/63
Text Indexing
⋆ Set of factors (subwords) of a static text
⋆ Basic operations
– existence of patterns in the text
– number of occurrences of patterns
– list of positions of occurrences
⋆ Other applications
– finding repetitions in texts
– finding regularities in texts
– approximate matchings
– two-dimensional pattern matching
– . . .
M.C. PhD Univ. Warsaw 42/63
Implementation of indexes
pattern
� �suffix of text
Implementation with efficient data structures
⋆ Suffix Trees
digital trees, PATRICIA trees (compact trees)
⋆ Suffix Automata or DAWG’s
minimal automata, compact automata
Implementation with efficient algorithm
⋆ Suffix Arrays
binary search in the ordered list of suffixes
M.C. PhD Univ. Warsaw 43/63
Suffixes
Text y ∈ A∗ of length n
⋆ Suff (y) = set of suffixes of y
⋆ Suff (ababbb)
i 0 1 2 3 4 5
y[i] a b a b b b
positions
a b a b b b 0
b a b b b 1
a b b b 2
b b b 3
b b 4
b 5
ε 6 (empty string)
M.C. PhD Univ. Warsaw 44/63
Suffix Structures
⋆ Suffix trie of ababbb
0 6
1 2
3 4 5 6 0
75
8 9 10 11 1
12 13 2
14
4
15 3
a
b
b
a
b
b b b
a
b
b b b
b
b
⋆ Its suffix automaton
0 1 2 3 4 5 6
2′ 5′
a b a
b
b b b
b
a
b
b
⋆ Its suffix tree
0 6
1 0
2 1
3
4 2
55
6 37
4
ab
b
abbb
bb
abbb
b
b
⋆ Its compact suffix automaton
0 12
32′
ab abbb
bb
b
abbb
b
b
Linear size data structures and O(n log card A) construction time, except for suffix tries
M.C. PhD Univ. Warsaw 45/63
Suffix Array
⋆ Structure composed of:
lexicographically sorted list of non-empty suffixes, and
maximal lengths of common prefixes between suffixes consecutive in the list
⋆ Suffix array of ababbb
j 0 1 2 3 4 5
y[j] a b a b b b
k SUF[k] LCP[k]
0 0 0 a b a b b b
1 2 2 a b b b
2 5 0 b
3 1 1 b a b b b
4 4 1 b b
5 3 2 b b b
⋆ Benefit: use of binary search and of additional combinatorial features
lead to a O(m + log n) searching time (instead of a mere O(m× log n) time)
M.C. PhD Univ. Warsaw 46/63
Searching—Case one
⋆ Hypotheses
Ld < x < Lf and ld ≤ |lcp(Li, Lf)| < lf
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b b b a a
x a a b b b a a
⋆ Conclusion
Li < x < Lf and |lcp(x, Li)| = |lcp(Li, Lf)|
M.C. PhD Univ. Warsaw 47/63
Searching—Case two
⋆ Hypotheses
Ld < x < Lf and ld ≤ lf < |lcp(Li, Lf)|
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b a c b
x a a b a c b
⋆ Conclusion
Ld < x < Li and |lcp(x, Li)| = |lcp(x, Lf)|
M.C. PhD Univ. Warsaw 48/63
Searching—Case three
⋆ Hypotheses
Ld < x < Lf and ld ≤ lf = |lcp(Li, Lf)|
⋆ Example
Ld a a a c a
a a a c b a
Li a a b b a b a
a a b b a b b
Lf a a b b b a b
x a a b b a b
x a a b b a b
⋆ Conclusion
compare x and Li from position lf
M.C. PhD Univ. Warsaw 49/63
Sorting Suffixes on a Bounded Integer Alphabet
⋆ Schema
[1] bucket sort positions i according to first3(y[i . . n− 1]), for i = 3q or i = 3q + 1
t[i]: rank of i in the sorted list
[2] recursively sort the suffixes of the 2/3-shorter word
t[0]t[3] · · · t[3q] · · · t[1]t[4] · · · t[3q + 1] · · ·
s[i]: rank of suffix i in the sorted list (i = 3q or i = 3q + 1)
[3] sort suffixes y[j . . n− 1] for j of the form 3q + 2 (i.e., bucket sort pairs (y[j], s[j + 1]))
[4] merge lists obtained at steps 2 and 3
Note: comparing suffixes i (first list) and j (second list) remains to compare:
(x[i], s[i + 1]) and (x[j], s[j + 1]) if i = 3q
(x[i]x[i + 1], s[i + 2]) and (x[j]x[j + 1], s[j + 2]) if i = 3q + 1
⋆ Running time: T (n) = T (2n/3) + O(n) then T (n) = O(n)
[Karkkainen, Sanders, 2003]
M.C. PhD Univ. Warsaw 50/63
Example
i 0 1 2 3 4 5 6 7 8 9 10
y[i] a a b a a b a a b b a
Rank t
0 a
1 a a b
2 a b a
3 a b b
4 b a
Rank s i Suff (11142230)
0 10 0
1 0 1 1 1 4 2 2 3 0
2 3 1 1 4 2 2 3 0
3 6 1 4 2 2 3 0
4 1 2 2 3 0
5 4 2 3 0
6 7 3 0
7 9 4 2 2 3 0
Rank j (y[j], s[j + 1])
0 2 (b, 2)
1 5 (b, 3)
2 8 (b, 7)
i 0 1 2 3 4 5 6 7 8 9 10
y[i] a a b a a b a a b b a
r[i] 1 4 8 2 5 9 3 6 10 7 0
SUF[i] 10 0 3 6 1 4 7 9 2 5 8
M.C. PhD Univ. Warsaw 51/63
Computing LCP’s of suffixes
⋆ LCP’s of suffixes
LCP[i] = |lcp(y[SUF[i− 1] . . n− 1], y[SUF[i] . . n− 1])|, for 0 ≤ i ≤ n
i 0 1 2 3 4 5 6 7 8 9 10 11
y[i] a a b a a b a a b b a
SUF[i] 10 0 3 6 1 4 7 9 2 5 8
LCP[i] 0 1 6 3 1 5 2 0 2 4 1 0
j RANK[j]
0 1 a a b a a b a a b b a
3 2 a a b a a b b a
j RANK[j]
1 4 a b a a b a a b b a
4 5 a b a a b b a
Lemma 11 Let j ∈ (1, 2, . . . , n− 1) with RANK[j] > 0.
Then LCP[RANK[j − 1]]− 1 ≤ LCP[RANK[j]].
M.C. PhD Univ. Warsaw 52/63
LCP algorithm
⋆ Rank RANK: RANK[j] = rank of suffix at position j
(RANK = SUF−1)
⋆ Permutation SUF: SUF[k] = position of suffix of rank k
Lcp(y, n, SUF, RANK)
1 ℓ← 0
2 for j ← 0 to n− 1 do
3 ℓ← max{0, ℓ− 1}
4 if RANK[j] > 0 then
5 i← SUF[RANK[j]− 1]
6 while y[i + ℓ] = y[j + ℓ] do
7 ℓ← ℓ + 1
8 LCP[RANK[j]]← ℓ
9 LCP[0]← 0
10 LCP[n]← 0
11 return LCP
⋆ Running time: O(n)
M.C. PhD Univ. Warsaw 53/63
Burrows-Wheeler Transform
⋆ Text w = a1a2 · · · an a primitive word
w1, w2, . . . , wn sequence of conjugates of w in increasing order
bi = last letter of wi, then T (w) = b1b2 · · · bn
⋆ T (abracadabra) = rdarcaaaabb 1 a a b r a c a d a b r
2 a b r a a b r a c a d
3 a b r a c a d a b r a
4 a c a d a b r a a b r
5 a d a b r a a b r a c
6 b r a a b r a c a d a
7 b r a c a d a b r a a
8 c a d a b r a a b r a
9 d a b r a a b r a c a
10 r a a b r a c a d a b
11 r a c a d a b r a a b
⋆ T (w) depends only on the conjugacy class of w
we may suppose that w is a Lyndon word, i.e. that w = w1
M.C. PhD Univ. Warsaw 54/63
Text Compression
⋆ Basis of bzip text compression
– intial text
aabracadabr
– BW Transform: T (aabracadabr) =
rdarcaaaabb
– Either run-length encoding (using ASCII code):
rdarca4b2
– ... or move-to-front encoding, roughly like encoding of:
17, 4, 2, 2, 4, 1, 0, 0, 0, 4, 0
M.C. PhD Univ. Warsaw 55/63
Related works
⋆ Reversible transformation of a word used for text compression, used
in bzip [Burrows, Wheeler, 1994]
⋆ Analysis of compression by [Manzini, 2001]
⋆ Related to combinatorics on words, Sturmian words
[Mantaci, Restivo, Sciortino, 2003]
⋆ Particular case of a bijection due to Gessel and Reutenauer
[Crochemore, Desarmenien, Perrin, 2005]
⋆ Linear-time computation (adapting suffix sorting)
[Karkkainen, Sanders, 2003][Kim, Sim, Park, Park, 2003]
[Ko, Aluru, 2003][Nong, Zhang, Chan, 2009]
M.C. PhD Univ. Warsaw 56/63
Permutation σ
⋆ Permutation σ: σ(i) = rank of i-th circular shift
⋆ w = aabracadabr
i σ(i)
1 1 a a b r a c a d a b r
2 3 a b r a c a d a b r a
3 7 b r a c a d a b r a a
4 11 r a c a d a b r a a b
5 4 a c a d a b r a a b r
6 8 c a d a b r a a b r a
7 5 a d a b r a a b r a c
8 9 d a b r a a b r a c a
9 2 a b r a a b r a c a d
10 6 b r a a b r a c a d a
11 10 r a a b r a c a d a b
σ−1(j) j
1 1 a a b r a c a d a b r
9 2 a b r a a b r a c a d
2 3 a b r a c a d a b r a
5 4 a c a d a b r a a b r
7 5 a d a b r a a b r a c
10 6 b r a a b r a c a d a
3 7 b r a c a d a b r a a
6 8 c a d a b r a a b r a
8 9 d a b r a a b r a c a
11 10 r a a b r a c a d a b
4 11 r a c a d a b r a a b
M.C. PhD Univ. Warsaw 57/63
Permutation π
⋆ Permutation P (w) = π: π(i) = σ(σ−1(i) + 1)
⋆ w = aabracadabr
σ =
1 2 3 4 5 6 7 8 9 10 11
1 3 7 11 4 8 5 9 2 6 10
π as a cycle
π =(
1 3 7 11 4 8 5 9 2 6 10)
π as an array
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
M.C. PhD Univ. Warsaw 58/63
Inverse Transformation
⋆ w = a1a2 · · · an
y = b1b2 · · · bn with bi = last letter of wi
z = c1c2 · · · cn with ci = first letter of wi
⋆ ai = cσ(i)
bi = aσ−1(i)−1
ai = cπi−1(1)
ci = bπ(i)
⋆ Property:
i < j and ci = cj ⇒ π(i) < π(j)
Occurrences of a in w appear in the same
relative order in both y and z
⋆ Linear-time computation of π and w from y
⋆ w = a1a2b3r4a5c6a7d8a9b10r11
y = r11d8a1r4c6a9a2a5a7b10b3
i ci bπ(i)
1 a1 · · · r11
2 a9 · · · d8
3 a2 · · · a1
4 a5 · · · r4
5 a7 · · · c6
6 b10 · · · a9
7 b3 · · · a2
8 c6 · · · a5
9 d8 · · · a7
10 r11 · · · b10
11 r4 · · · b3
M.C. PhD Univ. Warsaw 59/63
Descents of permutations
⋆ Descent of permutation P (w) = π: i such that π(i) > π(i + 1)
⋆ for π = P (aabracadabr): des(π) = {7, 8, 9}
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
⋆ Parikh vector of w: (n1, n2, . . . , nk) where k = #Alphabet
with n1 + n2 + · · · + nk = n = |w|
⋆ ρ(v) = {n1, n1 + n2, . . . , n1 + · · · + nk−1}
⋆ for π = P (aabracadabr): v = (5, 2, 1, 1, 2) and ρ(v) = (5, 7, 8, 9)
⋆ From property: des(π) ⊆ ρ(v)
Theorem 6 Let v = (n1, n2, . . . , nk) be a positive vector, n1+n2+· · ·+nk = n.
The map P : w 7→ π is one to one from the set of conjugacy classes of pri-
mitive words of length n on k letters with Parikh vector v onto the set of cyclic
permutations on {1, 2, . . . , n} such that des(π) ⊆ ρ(v).
M.C. PhD Univ. Warsaw 60/63
From permutation to words
⋆ Words w having a given cyclic permutation π: P (w) = π
No Parikh vector given
⋆ π = P (aabracadabr)
π =
1 2 3 4 5 6 7 8 9 10 11
3 6 7 8 9 10 11 5 2 1 4
π as a cycle =(
1 3 7 11 4 8 5 9 2 6 10)
on 5 letters z = a a a a a b b c d r r
w = a a b r a c a d a b r
on 4 letters z = a a a a a a a c d r r
w = a a a r a c a d a a r
on 7 letters z = a a b b c c c d e f g
w = a b c g b d c e a c f
number of words (up to alphabetical renaming): 26.21 = 128
M.C. PhD Univ. Warsaw 61/63
Length 4
σ π descents w
1 2 3 4 2 3 4 1 1 a a a b
a b b c
a a b c
a b c d
1 2 4 3 2 4 1 3 1 a a b b
a b c c
a a c b
a b d c
1 3 2 4 3 4 2 1 2 a b a c
a c b d
σ π descents w
1 3 4 2 3 1 4 2 2 a c b c
a d b c
1 4 2 3 4 3 1 2 2 a b c b
a c d b
1 4 3 2 4 1 2 3 1 a b b b
a c c b
a c b b
a d c b
M.C. PhD Univ. Warsaw 62/63
Main references
References
[1] Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. Algorithms
on Strings. Cambridge University Press, 2007. 392 pages.
[2] Maxime Crochemore and Wojciech Rytter. Jewels of Stringology. World
Scientific Publishing, Hong-Kong, 2002. 310 pages.
[3] See also http://monge.univ-mlv.fr/ mac/REC/text-algorithms.pdf
or http://www.mimuw.edu.pl/ rytter/BOOKS/text-algorithms.pdf
M.C. PhD Univ. Warsaw 63/63