On multiset of factors of a word

Information Processing Letters 113 (2013) 78–80

Contents lists available at SciVerse ScienceDirect

Information Processing Letters

www.elsevier.com/locate/ipl

On multiset of factors of a word

Kalpesh Kapoor ∗, Himadri Nayak

Department of Mathematics, Indian Institute of Technology Guwahati, India

a r t i c l e i n f o a b s t r a c t

Article history:Received 26 April 2012Received in revised form 26 September2012Accepted 26 September 2012Available online 8 November 2012Communicated by A. Tarlecki

Keywords:Combinatorial problemsCombinatorics on wordsRepeated factors

Given two strings, we investigate the conditions under which they have common multisetof factors of a fixed length. We show that if two strings have the same multiset of factorsof length less than or equal to k then they have a common prefix and suffix of length k −1.We also show that if two strings have the same multiset of factors of length k and k − 1then they also have the same multiset of factors of length less than k − 1.

© 2012 Elsevier B.V. All rights reserved.

1. Introduction

Let Σ be a finite alphabet. A string or a word is a finitesequence of letters chosen from an alphabet Σ . The lengthof a string u is denoted by |u|. The string with length 0is referred to as ε or empty. The set of all finite lengthstrings including ε is denoted by Σ∗ . The set of all stringsof length n is denoted by Σn .

A string x is said to be a factor of another string yif and only if y = pxs, where p and s are some stringsin Σ∗ . If p = ε then x is said to be a prefix of y. Sim-ilarly, if s = ε then x is said to be a suffix of y. A pre-fix and suffix of length l of string y (0 � l � |y|) aredenoted by prefl(y) and suffl(y), respectively. We definepref0(y) = suff0(y) = ε .

Let u and x be two non-empty words. The number oftimes x appears (as a factor) in u is denoted by |u|x . For

example, |1010101|101 = 3. We define a relationk≡ as fol-

lows: uk≡ v iff mk(u) = mk(v), where mk(u) and mk(v) are

multisets formed by k-length factors of u and v , respec-tively. This is an equivalence relation. We define Mk(u) as

* Corresponding author.E-mail addresses: [email protected] (K. Kapoor),

[email protected] (H. Nayak).

0020-0190/$ – see front matter © 2012 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.ipl.2012.09.010

the set of strings whose k-length factor multisets are equalto mk(u).

The problem of identifying words u and v such that it

satisfies the equations ui≡ v , for all i � k is studied in [1].

It is shown that the equations have only trivial solution(i.e. u = v) if k = � |u|

2 � + 1. The following questions wereleft as open in [1]:

1. Let u, v ∈ Σn and u = rθ s and v = sθ ′r where |r| =|s| = k − 1. Further, let u

i≡ v for all i � k, where 3 �k � n. Is it possible that r �= s?

2. Does there exist two words u and v such that ui≡ v

for i ∈ {k,k − 1} and uk−2�≡ v?

The second question is also relevant in the context ofconstructing a word from multisets of its factors of lengthk − 2, k − 1 and k. In [1], it is shown that a unique solutionexists when k = � n

2 �+ 1, where n is the length of the wordto be constructed.

In the following section, we describe the frameworkthat is used to answer the above questions. In Section 3,Lemma 5 and subsequently Corollary 6 answers the firstquestion in negation. At the end of Section 3, Theorem 7answers the second question also in negation.

http://dx.doi.org/10.1016/j.ipl.2012.09.010

http://www.ScienceDirect.com/

http://www.elsevier.com/locate/ipl

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.ipl.2012.09.010

K. Kapoor, H. Nayak / Information Processing Letters 113 (2013) 78–80 79

2. Structure of MMMk(u)

In [2], Ukkonen introduced a string-distance measurebased on the number of occurrences of different factors offixed length k in two given strings. Given a string, v , inMk(u), a new string in Mk(u) can be obtained by one ofthe following transformations.

Definition 1 (k-Transposition). (See [2].) Let y and y′ betwo strings that can be expressed as either of these twoforms given below.

(1st form

)[y = y1z1 y2z2 y3z1 y4z2 y5y′ = y1z1 y4z2 y3z1 y2z2 y5

]or

[y = y1zy2zy3zy4y′ = y1zy3zy2zy4

](2nd form

)

where |z1| = |z2| = |z| = k − 1 and y1, y2, y3, y4, y5 ∈ Σ∗ .Then the two strings y and y′ are k-transpositions of eachother.

Definition 2 (k-Rotation). (See [2].) Let y and y′ be twostrings that can be expressed as y = z1 y1z2 y2z1 and y′ =z2 y2z1 y1z2 where, |z1| = |z2| = k−1 and y1, y2 ∈ Σ∗ thenthe two strings y and y′ are k-rotations of each other.

Ukkonen [2] proved that if a string uk≡ v then |u| = |v|.

The following lemma, proved by Pevzner [3], claims thatthe entire set Mk(u) can be generated by application ofthe above transformations.

Lemma 1. (See [3].) Let u and v be two distinct strings such

that uk≡ v. Then, u and v can be transformed into each other by

application of one or more k-transpositions and k-rotations.

As these transformations preserve k-length multiset ofstrings we shall call them k-transformations. We will de-note the first and second form of transpositions betweentwo strings y and y′ by k-tr1 and k-tr2, respectively. Simi-larly, we will refer to rotations with z1 �= z2 and z1 = z2 byk-rt1 and k-rt2, respectively.

We denote the action of a k-rt1 rotation by yzz′−→

k-rt1

y′

where y and y′ are strings as given in Definition 2. Let u,v and w be three strings. A non-empty string, s, is saidto spread over a string uv if u = u1u2, v = v1 v2 and s =u2 v1, where |u2|, |v1| > 0. Similarly, s spreads over a stringuv w if u = u1u2, w = w1 w2, s = u2 v w1 and both u2, w1are non-empty strings.

Lemma 2. Let uk≡ u′ such that u′ is obtained from u by apply-

ing any one of the k-tr1 , k-tr2 or k-rt2 transformation. Then,

(a) prefk−1(u) = prefk−1(u′) and suffk−1(u) = suffk−1(u′).

(b) ui≡ u′ ∀i � k.

Proof. (a) This is straightforward from Definitions 1 and 2.Observe that the prefix y1z1 or y1z and the suffix z2 y5 orzy4 of lengths at least k − 1 in a k-transposition remains

the same. Similarly, both the suffix and prefix of lengthk − 1 in a k-rt2 rotation does not change.

(b) Let s be any string of length l, where l � k − 1. As-sume that the strings u and u′ are of the form as given ineither Definition 1 or 2. We consider the following threecases.

(i) s is a factor of any of the following strings: y1, y2, y3,y4, y5, z1, z2, z3 or z.

(ii) s is spread over any of the following strings: y1z1,y2z2, y3z1, y4z2, z1 y2, z2 y3, z1 y4, z2 y5 (in case of k-tr1 transformation); y1z, y2z, y3z, zy2, zy3, zy4 (in caseof k-tr2 transformation); zy1, zy2, y1z, y2z (in case of k-rt2 transformation, where z1 = z2 = z).

(iii) s is spread over any of the following strings: z1 y2z2,z2 y3z1, z1 y4z2 (in case of k-tr1 transformation); zy2z,zy3z (in case of k-tr2 transformation); zy1z, zy2z (in caseof k-rt2 transformation, where z1 = z2 = z).

Apart from the above three cases s cannot be a fac-tor of u or u′ in any other possible way. For example, scannot be spread over a segment y1z1 y2 as |z1| = k − 1.We note that, in all the above three cases, the segmentsare only changing their positions in u′ from u. Thus, theirfrequencies remain the same. This means that there can-not be any string s such that |s| < k and |u|s �= |u′|s if

uk≡ u′ . �

Lemma 3. Let uzz′−→

k-rt1

u′ where u and u′ are of the form zyz′ y′z

and z′ y′zyz′ , respectively, where |z| = |z′| = k − 1. Let s be astring of length less than k. Then, |u|s − |u′|s = |z|s − |z′|s .

Proof. Let |z|s = λ and |z′|s = γ (�= λ). In addition to s be-ing a factor of z and z′ , s can spread over the segments zy,yz′ , z′ y′ , y′z, zyz′ and z′ y′z of u. However, s cannot spreadover the segment yz′ y′ (and y′zy) as |s| � |z′| = k − 1.All these segments (over which s can spread) are alsopresent in u′ with the identical frequency. If at all s ispresent anywhere else other than the strings z and z′ , theapplication of k-transformation will only change their po-sitions without changing its frequency. Let the frequencyof such s that do not occur in either z or z′ is δ. Thisimplies |u|s = 2λ + γ + δ and |u′|s = λ + 2γ + δ. Thus,|u|s − |u′|s = λ − γ = |z|s − |z′|s . �Lemma 4. Let ui

zi zi+1−→k-rt1

ui+1 ∀i ∈ {1,2, . . . ,d − 1}, d > 1. Fur-

ther s be any string of size less than k. Then, |u1|s − |ud|s =|pref k−1(u1)|s − |prefk−1(ud)|s .

Proof. The proof is immediate from Lemma 3 and applyingmathematical induction on d. �3. Multiset of factors of consecutive lengths

Let uk≡ v . In order to obtain v from u we may need

to apply one or more k-transformations. Assume T =〈t1, t2, . . . , tη〉 be the sequence of k-transformations thatare needed to get v from u if they are applied in the given

80 K. Kapoor, H. Nayak / Information Processing Letters 113 (2013) 78–80

order. Further, let R = 〈r1, r2, . . . , rψ 〉 be the subsequenceof T such that ri ’s are all k-rt1 rotations. When a k-rt1rotation ri is applied on a string ui−1 to get ui , by Defini-tion 2, ui−1 and ui are of the form zi−1 yi−1z′

i−1 y′i−1zi−1

and zi yi z′i y′

i zi , respectively. We express the action of ri as

uzi−1 zi−→k-rt1

v .

Lemma 5. Let uk≡ v and u

k−1≡ v. Also, let T and R be as definedin the previous paragraph. If ri is a transformation of the formzi z′

i−→k-rt1

then

(a) z′i = zi+1 ∀i ∈ {1,2, . . . , (ψ − 1)}.

(b) z1 = z′ψ .

Proof. (a) This is an implication of Lemma 2(a). When ak-rt1 rotation ri is applied, the string z′

i becomes (k − 1)-length suffix and prefix of ui . From Lemma 2(b), we knowthat the application of other k-transformations betweenany two k-rt1 rotations, ri and ri+1, the string z′

i will re-main the (k − 1)-length suffix and prefix of intermediatestrings. Thus, when the k-transformation ri+1 is used, zi

becomes z′i+1. This happens for all the transformations in-

dexed i = 1 to ψ − 1 in R .(b) From the first proof we see that whenever a k-rt1

rotation of the formzi zi+1−→k-rt1

is applied, the frequency of

zi is decreased by 1 and frequency of zi+1 is increasedby 1. Now by Lemma 2, after the action of a k-rt1 ro-tation any of the other three k-transformations will notchange l-length substring multiset where l � k. When-ever the next k-rt1 rotation is applied it must be of the

formzi+1 zi+2−→

k-rt1

. Thus frequency of zi+1 is restored. This is

applicable for all transformations indexed from i = 1 toψ − 1 in R . Thus, starting from u, the frequency of ev-ery of substring z2, z3, . . . , zψ is restored in v . But z1and zψ+1 = z′

ψ has not been restored yet. If z1 �= zψ+1then the frequency of z1 will remain one less in stringu than in string v . Similarly, the frequency of zψ willbecome one more in string u than in string v . This can-

not happen as it is assumed that uk−1≡ v . Hence, z1 =

z′ψ which means if u

k≡ v and uk−1≡ v then prefk−1(u) =

prefk−1(v). �

The proof of Lemma 5(b) gives us an idea about the re-lation between z1 and z′

ψ . The contrapositive of the secondpart of the above lemma is as follows.

Corollary 6. Let u and v be two distinct strings such thatprefk(u) �= prefk(v) or suffk(u) �= suffk(v) and k > 1, thenthere exists a string x of length � k + 1 such that |u|x �= |v|x.

This answers the question “Can we get two strings w =rθ s and v = sθ ′r such that r �= s, |r| = |s| = k−1 and v

i≡ w∀1 � i � k, where k � 3?” in negation.

Theorem 7. Let u and v be two words such that ui≡ v for i =

k,k − 1. Then, ui≡ v for all 1 � i � k.

Proof. Here the assumptions are the same as in Lemma 5.Thus we have prefk−1(u) = prefk−1(v) = z1. Let s beany string of length less than k. The other three k-transformations that are applied in between any two k-rt1rotations will not change the multiset of substrings of anylength � k. We can apply Lemma 4 here and hence weget |u|s − |v|s = |z1|s − |z1|s = 0. This implies that there inno string of length less than k − 1 whose frequency can

be different in u and v . Also, the assumption was ui≡ v

for i = k,k − 1. Combining this, we have ui≡ v for all

1 � i � k. �This answers the question “Does there exist two words

w and v such that ui≡ v for i = k,k − 1 and u

k−2�≡ v?” innegation.

Acknowledgements

The authors would like to thank the anonymous re-viewers for their extensive and detailed reports whichhelped in significantly improving the paper.

References

[1] C. Piña, C. Uzcátegui, Reconstruction of a word from a multiset of itsfactors, Theoretical Computer Science 400 (1–3) (2008) 70–83.

[2] E. Ukkonen, Approximate string matching with q-grams and maximalmatches, Theoretical Computer Science 92 (1992) 191–211.

[3] P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles incolored graphs, Algorithmica 13 (1995) 77–105.

Documents

On multiset of factors of a word