Upload
himadri
View
214
Download
2
Embed Size (px)
Citation preview
Information Processing Letters 113 (2013) 78–80
Contents lists available at SciVerse ScienceDirect
Information Processing Letters
www.elsevier.com/locate/ipl
On multiset of factors of a word
Kalpesh Kapoor ∗, Himadri Nayak
Department of Mathematics, Indian Institute of Technology Guwahati, India
a r t i c l e i n f o a b s t r a c t
Article history:Received 26 April 2012Received in revised form 26 September2012Accepted 26 September 2012Available online 8 November 2012Communicated by A. Tarlecki
Keywords:Combinatorial problemsCombinatorics on wordsRepeated factors
Given two strings, we investigate the conditions under which they have common multisetof factors of a fixed length. We show that if two strings have the same multiset of factorsof length less than or equal to k then they have a common prefix and suffix of length k −1.We also show that if two strings have the same multiset of factors of length k and k − 1then they also have the same multiset of factors of length less than k − 1.
© 2012 Elsevier B.V. All rights reserved.
1. Introduction
Let Σ be a finite alphabet. A string or a word is a finitesequence of letters chosen from an alphabet Σ . The lengthof a string u is denoted by |u|. The string with length 0is referred to as ε or empty. The set of all finite lengthstrings including ε is denoted by Σ∗ . The set of all stringsof length n is denoted by Σn .
A string x is said to be a factor of another string yif and only if y = pxs, where p and s are some stringsin Σ∗ . If p = ε then x is said to be a prefix of y. Sim-ilarly, if s = ε then x is said to be a suffix of y. A pre-fix and suffix of length l of string y (0 � l � |y|) aredenoted by prefl(y) and suffl(y), respectively. We definepref0(y) = suff0(y) = ε .
Let u and x be two non-empty words. The number oftimes x appears (as a factor) in u is denoted by |u|x . For
example, |1010101|101 = 3. We define a relationk≡ as fol-
lows: uk≡ v iff mk(u) = mk(v), where mk(u) and mk(v) are
multisets formed by k-length factors of u and v , respec-tively. This is an equivalence relation. We define Mk(u) as
* Corresponding author.E-mail addresses: [email protected] (K. Kapoor),
[email protected] (H. Nayak).
0020-0190/$ – see front matter © 2012 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.ipl.2012.09.010
the set of strings whose k-length factor multisets are equalto mk(u).
The problem of identifying words u and v such that it
satisfies the equations ui≡ v , for all i � k is studied in [1].
It is shown that the equations have only trivial solution(i.e. u = v) if k = � |u|
2 � + 1. The following questions wereleft as open in [1]:
1. Let u, v ∈ Σn and u = rθ s and v = sθ ′r where |r| =|s| = k − 1. Further, let u
i≡ v for all i � k, where 3 �k � n. Is it possible that r �= s?
2. Does there exist two words u and v such that ui≡ v
for i ∈ {k,k − 1} and uk−2�≡ v?
The second question is also relevant in the context ofconstructing a word from multisets of its factors of lengthk − 2, k − 1 and k. In [1], it is shown that a unique solutionexists when k = � n
2 �+ 1, where n is the length of the wordto be constructed.
In the following section, we describe the frameworkthat is used to answer the above questions. In Section 3,Lemma 5 and subsequently Corollary 6 answers the firstquestion in negation. At the end of Section 3, Theorem 7answers the second question also in negation.
K. Kapoor, H. Nayak / Information Processing Letters 113 (2013) 78–80 79
2. Structure of MMMk(u)
In [2], Ukkonen introduced a string-distance measurebased on the number of occurrences of different factors offixed length k in two given strings. Given a string, v , inMk(u), a new string in Mk(u) can be obtained by one ofthe following transformations.
Definition 1 (k-Transposition). (See [2].) Let y and y′ betwo strings that can be expressed as either of these twoforms given below.
(1st form
)[y = y1z1 y2z2 y3z1 y4z2 y5y′ = y1z1 y4z2 y3z1 y2z2 y5
]or
[y = y1zy2zy3zy4y′ = y1zy3zy2zy4
](2nd form
)
where |z1| = |z2| = |z| = k − 1 and y1, y2, y3, y4, y5 ∈ Σ∗ .Then the two strings y and y′ are k-transpositions of eachother.
Definition 2 (k-Rotation). (See [2].) Let y and y′ be twostrings that can be expressed as y = z1 y1z2 y2z1 and y′ =z2 y2z1 y1z2 where, |z1| = |z2| = k−1 and y1, y2 ∈ Σ∗ thenthe two strings y and y′ are k-rotations of each other.
Ukkonen [2] proved that if a string uk≡ v then |u| = |v|.
The following lemma, proved by Pevzner [3], claims thatthe entire set Mk(u) can be generated by application ofthe above transformations.
Lemma 1. (See [3].) Let u and v be two distinct strings such
that uk≡ v. Then, u and v can be transformed into each other by
application of one or more k-transpositions and k-rotations.
As these transformations preserve k-length multiset ofstrings we shall call them k-transformations. We will de-note the first and second form of transpositions betweentwo strings y and y′ by k-tr1 and k-tr2, respectively. Simi-larly, we will refer to rotations with z1 �= z2 and z1 = z2 byk-rt1 and k-rt2, respectively.
We denote the action of a k-rt1 rotation by yzz′−→
k-rt1
y′
where y and y′ are strings as given in Definition 2. Let u,v and w be three strings. A non-empty string, s, is saidto spread over a string uv if u = u1u2, v = v1 v2 and s =u2 v1, where |u2|, |v1| > 0. Similarly, s spreads over a stringuv w if u = u1u2, w = w1 w2, s = u2 v w1 and both u2, w1are non-empty strings.
Lemma 2. Let uk≡ u′ such that u′ is obtained from u by apply-
ing any one of the k-tr1 , k-tr2 or k-rt2 transformation. Then,
(a) prefk−1(u) = prefk−1(u′) and suffk−1(u) = suffk−1(u′).
(b) ui≡ u′ ∀i � k.
Proof. (a) This is straightforward from Definitions 1 and 2.Observe that the prefix y1z1 or y1z and the suffix z2 y5 orzy4 of lengths at least k − 1 in a k-transposition remains
the same. Similarly, both the suffix and prefix of lengthk − 1 in a k-rt2 rotation does not change.
(b) Let s be any string of length l, where l � k − 1. As-sume that the strings u and u′ are of the form as given ineither Definition 1 or 2. We consider the following threecases.
(i) s is a factor of any of the following strings: y1, y2, y3,y4, y5, z1, z2, z3 or z.
(ii) s is spread over any of the following strings: y1z1,y2z2, y3z1, y4z2, z1 y2, z2 y3, z1 y4, z2 y5 (in case of k-tr1 transformation); y1z, y2z, y3z, zy2, zy3, zy4 (in caseof k-tr2 transformation); zy1, zy2, y1z, y2z (in case of k-rt2 transformation, where z1 = z2 = z).
(iii) s is spread over any of the following strings: z1 y2z2,z2 y3z1, z1 y4z2 (in case of k-tr1 transformation); zy2z,zy3z (in case of k-tr2 transformation); zy1z, zy2z (in caseof k-rt2 transformation, where z1 = z2 = z).
Apart from the above three cases s cannot be a fac-tor of u or u′ in any other possible way. For example, scannot be spread over a segment y1z1 y2 as |z1| = k − 1.We note that, in all the above three cases, the segmentsare only changing their positions in u′ from u. Thus, theirfrequencies remain the same. This means that there can-not be any string s such that |s| < k and |u|s �= |u′|s if
uk≡ u′ . �
Lemma 3. Let uzz′−→
k-rt1
u′ where u and u′ are of the form zyz′ y′z
and z′ y′zyz′ , respectively, where |z| = |z′| = k − 1. Let s be astring of length less than k. Then, |u|s − |u′|s = |z|s − |z′|s .
Proof. Let |z|s = λ and |z′|s = γ (�= λ). In addition to s be-ing a factor of z and z′ , s can spread over the segments zy,yz′ , z′ y′ , y′z, zyz′ and z′ y′z of u. However, s cannot spreadover the segment yz′ y′ (and y′zy) as |s| � |z′| = k − 1.All these segments (over which s can spread) are alsopresent in u′ with the identical frequency. If at all s ispresent anywhere else other than the strings z and z′ , theapplication of k-transformation will only change their po-sitions without changing its frequency. Let the frequencyof such s that do not occur in either z or z′ is δ. Thisimplies |u|s = 2λ + γ + δ and |u′|s = λ + 2γ + δ. Thus,|u|s − |u′|s = λ − γ = |z|s − |z′|s . �Lemma 4. Let ui
zi zi+1−→k-rt1
ui+1 ∀i ∈ {1,2, . . . ,d − 1}, d > 1. Fur-
ther s be any string of size less than k. Then, |u1|s − |ud|s =|pref k−1(u1)|s − |prefk−1(ud)|s .
Proof. The proof is immediate from Lemma 3 and applyingmathematical induction on d. �3. Multiset of factors of consecutive lengths
Let uk≡ v . In order to obtain v from u we may need
to apply one or more k-transformations. Assume T =〈t1, t2, . . . , tη〉 be the sequence of k-transformations thatare needed to get v from u if they are applied in the given
80 K. Kapoor, H. Nayak / Information Processing Letters 113 (2013) 78–80
order. Further, let R = 〈r1, r2, . . . , rψ 〉 be the subsequenceof T such that ri ’s are all k-rt1 rotations. When a k-rt1rotation ri is applied on a string ui−1 to get ui , by Defini-tion 2, ui−1 and ui are of the form zi−1 yi−1z′
i−1 y′i−1zi−1
and zi yi z′i y′
i zi , respectively. We express the action of ri as
uzi−1 zi−→k-rt1
v .
Lemma 5. Let uk≡ v and u
k−1≡ v. Also, let T and R be as definedin the previous paragraph. If ri is a transformation of the formzi z′
i−→k-rt1
then
(a) z′i = zi+1 ∀i ∈ {1,2, . . . , (ψ − 1)}.
(b) z1 = z′ψ .
Proof. (a) This is an implication of Lemma 2(a). When ak-rt1 rotation ri is applied, the string z′
i becomes (k − 1)-length suffix and prefix of ui . From Lemma 2(b), we knowthat the application of other k-transformations betweenany two k-rt1 rotations, ri and ri+1, the string z′
i will re-main the (k − 1)-length suffix and prefix of intermediatestrings. Thus, when the k-transformation ri+1 is used, zi
becomes z′i+1. This happens for all the transformations in-
dexed i = 1 to ψ − 1 in R .(b) From the first proof we see that whenever a k-rt1
rotation of the formzi zi+1−→k-rt1
is applied, the frequency of
zi is decreased by 1 and frequency of zi+1 is increasedby 1. Now by Lemma 2, after the action of a k-rt1 ro-tation any of the other three k-transformations will notchange l-length substring multiset where l � k. When-ever the next k-rt1 rotation is applied it must be of the
formzi+1 zi+2−→
k-rt1
. Thus frequency of zi+1 is restored. This is
applicable for all transformations indexed from i = 1 toψ − 1 in R . Thus, starting from u, the frequency of ev-ery of substring z2, z3, . . . , zψ is restored in v . But z1and zψ+1 = z′
ψ has not been restored yet. If z1 �= zψ+1then the frequency of z1 will remain one less in stringu than in string v . Similarly, the frequency of zψ willbecome one more in string u than in string v . This can-
not happen as it is assumed that uk−1≡ v . Hence, z1 =
z′ψ which means if u
k≡ v and uk−1≡ v then prefk−1(u) =
prefk−1(v). �
The proof of Lemma 5(b) gives us an idea about the re-lation between z1 and z′
ψ . The contrapositive of the secondpart of the above lemma is as follows.
Corollary 6. Let u and v be two distinct strings such thatprefk(u) �= prefk(v) or suffk(u) �= suffk(v) and k > 1, thenthere exists a string x of length � k + 1 such that |u|x �= |v|x.
This answers the question “Can we get two strings w =rθ s and v = sθ ′r such that r �= s, |r| = |s| = k−1 and v
i≡ w∀1 � i � k, where k � 3?” in negation.
Theorem 7. Let u and v be two words such that ui≡ v for i =
k,k − 1. Then, ui≡ v for all 1 � i � k.
Proof. Here the assumptions are the same as in Lemma 5.Thus we have prefk−1(u) = prefk−1(v) = z1. Let s beany string of length less than k. The other three k-transformations that are applied in between any two k-rt1rotations will not change the multiset of substrings of anylength � k. We can apply Lemma 4 here and hence weget |u|s − |v|s = |z1|s − |z1|s = 0. This implies that there inno string of length less than k − 1 whose frequency can
be different in u and v . Also, the assumption was ui≡ v
for i = k,k − 1. Combining this, we have ui≡ v for all
1 � i � k. �This answers the question “Does there exist two words
w and v such that ui≡ v for i = k,k − 1 and u
k−2�≡ v?” innegation.
Acknowledgements
The authors would like to thank the anonymous re-viewers for their extensive and detailed reports whichhelped in significantly improving the paper.
References
[1] C. Piña, C. Uzcátegui, Reconstruction of a word from a multiset of itsfactors, Theoretical Computer Science 400 (1–3) (2008) 70–83.
[2] E. Ukkonen, Approximate string matching with q-grams and maximalmatches, Theoretical Computer Science 92 (1992) 191–211.
[3] P.A. Pevzner, DNA physical mapping and alternating Eulerian cycles incolored graphs, Algorithmica 13 (1995) 77–105.