20
Synchronization Strings: Highly Efficient Deterministic Constructions over Small Alphabets Kuan Cheng Bernhard Haeupler Xin Li § Amirbehshad Shahrasbi Ke Wu Abstract Synchronization strings are recently introduced by Ha- eupler and Shahrasbi [1] in the study of codes for cor- recting insertion and deletion errors (insdel codes). A synchronization string is an encoding of the indices of the symbols in a string, and together with an appropri- ate decoding algorithm it can transform insertion and deletion errors into standard symbol erasures and cor- ruptions. This reduces the problem of constructing in- sdel codes to the problem of constructing standard error correcting codes, which is much better understood. Be- sides this, synchronization strings are also useful in ot- her applications such as synchronization sequences and interactive coding schemes. For all such applications, synchronization strings are desired to be over alphabets that are as small as possible, since a larger alphabet size corresponds to more redundant information added. Haeupler and Shahrasbi [1] showed that for any parameter ε> 0, synchronization strings of arbitrary length exist over an alphabet whose size depends only on ε. Specifically, [1] obtained an alphabet size of O(ε 4 ), which left an open question on where the minimal size of such alphabets lies between Ω(ε 1 ) and O(ε 4 ). In this work, we partially bridge this gap by providing an improved lower bound of Ω ( ε 3/2 ) , and an improved upper bound of O ( ε 2 ) . We also provide fast explicit constructions of synchronization strings over small alphabets. Further, along the lines of previous work on similar combinatorial objects, we study the extremal question of the smallest possible alphabet size over which syn- Supported in part by NSF grants CCF-1527110, CCF- 1618280, CCF-1617713, CCF-1814603 and NSF CAREER award CCF-1750808. [email protected]. Department of Computer Science, Johns Hopkins University. [email protected]. Department of Computer Science, Carnegie Mellon University. § [email protected]. Department of Computer Science, Johns Hopkins University. [email protected]. Department of Computer Science, Carnegie Mellon University. [email protected]. Department of Computer Science, Johns Hopkins University. chronization strings can exist for some constant ε< 1. We show that one can construct ε-synchronization strings over alphabets of size four while no such string exists over binary alphabets. This reduces the extremal question to whether synchronization strings exist over ternary alphabets. 1 Introduction This paper focuses on the study of a combinatorial object called synchronization string. Intuitively, a synchronization string is a (finite or infinite) string that avoids similarities between pairs of intervals in the string. Such nice properties and synchronization strings themselves can actually be motivated from at least two different aspects: coding theory and pattern avoidance. We now discuss the motivations and previous work in each aspect below. 1.1 Motivation and Previous Work in Coding Theory The general and most important goal of coding theory is to obtain reliable transmissions of information in the presence of noise or adversarial error. Starting from the pioneering works of Shannon, Hamming, and many others, coding theory has evolved into an exten- sively studied field, with applications found in various areas in computer science. Regarding the general goal of correcting errors, we now have a very sophisticated and almost complete understanding of how to deal with Hamming errors, i.e., symbol erasures and substituti- ons. On the other hand, the knowledge of codes for edit errors (i.e., symbol insertions and deletions) has lagged far behind despite also being studied intensively since the 1960s. In practice, this is one of the main reasons why communication systems require a lot of effort and resources to maintain synchronization strictly. One major difficulty in designing codes for insertion and deletion errors is that in the received codeword, the positions of the symbols may have changed. This is in contrast to standard symbol erasures and substitutions, where the positions of the symbols always stay the same. Thus, many of the known techniques in designing codes for standard Hamming errors cannot be directly utilized to protect against insertion and deletion errors. Copyright © 2019 by SIAM Unauthorized reproduction of this article is prohibited 2185 Downloaded 03/24/19 to 71.61.179.136. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Synchronization Strings: Highly Efficient Deterministic

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Synchronization Strings: Highly Efficient Deterministic

Synchronization Strings: Highly Efficient Deterministic Constructions over

Small Alphabets∗

Kuan Cheng† Bernhard Haeupler ‡ Xin Li§ Amirbehshad Shahrasbi ¶ Ke Wu‖

Abstract

Synchronization strings are recently introduced by Ha-eupler and Shahrasbi [1] in the study of codes for cor-recting insertion and deletion errors (insdel codes). Asynchronization string is an encoding of the indices ofthe symbols in a string, and together with an appropri-ate decoding algorithm it can transform insertion anddeletion errors into standard symbol erasures and cor-ruptions. This reduces the problem of constructing in-sdel codes to the problem of constructing standard errorcorrecting codes, which is much better understood. Be-sides this, synchronization strings are also useful in ot-her applications such as synchronization sequences andinteractive coding schemes. For all such applications,synchronization strings are desired to be over alphabetsthat are as small as possible, since a larger alphabet sizecorresponds to more redundant information added.

Haeupler and Shahrasbi [1] showed that for anyparameter ε > 0, synchronization strings of arbitrarylength exist over an alphabet whose size depends onlyon ε. Specifically, [1] obtained an alphabet size ofO(ε−4), which left an open question on where theminimal size of such alphabets lies between Ω(ε−1) andO(ε−4). In this work, we partially bridge this gap byproviding an improved lower bound of Ω

(ε−3/2

), and an

improved upper bound of O(ε−2). We also provide fast

explicit constructions of synchronization strings oversmall alphabets.

Further, along the lines of previous work on similarcombinatorial objects, we study the extremal questionof the smallest possible alphabet size over which syn-

∗Supported in part by NSF grants CCF-1527110, CCF-

1618280, CCF-1617713, CCF-1814603 and NSF CAREER award

CCF-1750808.†[email protected]. Department of Computer Science, Johns

Hopkins University.‡[email protected]. Department of Computer Science,

Carnegie Mellon University.§[email protected]. Department of Computer Science, Johns

Hopkins University.¶[email protected]. Department of Computer Science,

Carnegie Mellon University.‖[email protected]. Department of Computer Science, Johns

Hopkins University.

chronization strings can exist for some constant ε <1. We show that one can construct ε-synchronizationstrings over alphabets of size four while no such stringexists over binary alphabets. This reduces the extremalquestion to whether synchronization strings exist overternary alphabets.

1 Introduction

This paper focuses on the study of a combinatorialobject called synchronization string. Intuitively, asynchronization string is a (finite or infinite) stringthat avoids similarities between pairs of intervals in thestring. Such nice properties and synchronization stringsthemselves can actually be motivated from at least twodifferent aspects: coding theory and pattern avoidance.We now discuss the motivations and previous work ineach aspect below.

1.1 Motivation and Previous Work in Coding

Theory The general and most important goal of codingtheory is to obtain reliable transmissions of informationin the presence of noise or adversarial error. Startingfrom the pioneering works of Shannon, Hamming, andmany others, coding theory has evolved into an exten-sively studied field, with applications found in variousareas in computer science. Regarding the general goalof correcting errors, we now have a very sophisticatedand almost complete understanding of how to deal withHamming errors, i.e., symbol erasures and substituti-ons. On the other hand, the knowledge of codes for editerrors (i.e., symbol insertions and deletions) has laggedfar behind despite also being studied intensively sincethe 1960s. In practice, this is one of the main reasonswhy communication systems require a lot of effort andresources to maintain synchronization strictly.

One major difficulty in designing codes for insertionand deletion errors is that in the received codeword, thepositions of the symbols may have changed. This is incontrast to standard symbol erasures and substitutions,where the positions of the symbols always stay the same.Thus, many of the known techniques in designing codesfor standard Hamming errors cannot be directly utilizedto protect against insertion and deletion errors.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2185

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 2: Synchronization Strings: Highly Efficient Deterministic

In this context, a recent work of Haeupler andShahrasbi [1] introduced synchronization strings, whichenable a black-box transformation of Hamming-typeerror correcting codes to codes that protect againstedit errors (insdel codes for short). Informally, asynchronization string of length n is an encoding ofthe indices of the n positions into a string over somealphabet Σ, such that, despite some insertion anddeletion errors, one can still recover the correct indicesof many symbols. With these correct indices, a standarderror correcting code can then be used to recover theoriginal message. This then gives a code for insertionand deletion errors, which is the combination of astandard error correcting code and a synchronizationstring.

The simplest example of a synchronization string isjust to record the index of each symbol, i.e, the string1, 2, · · · , n. It can be easily checked that even if (1− ε)fraction of these indices are deleted, one can still cor-rectly recover the positions of the remaining εn symbols.However, this synchronization string uses an alphabetwhose size grows with the length of the string. Themain contribution of [1] is to show that under a slightrelaxation, there exist synchronization strings of arbi-trary length n over an alphabet of fixed size. Further,[1] provided efficient and streaming algorithms to cor-rectly recover the indices of many symbols from a syn-chronization string after being altered by insertion anddeletion errors. Formally, [1] defines ε-synchronizationstrings as follows. A string S is an ε-synchronizationstring if the edit distance (the number of insertions anddeletions needed to transform one string into another)of any two consecutive substrings S[i, j) and S[j, k) isat least (1− ε)(k − i).

Definition 1.1. (ε-synchronization string) Astring S is an ε-synchronization string if ∀1 ≤ i <j < k ≤ |S| + 1, ED(S[i, j), S[j, k)) > (1 − ε)(k − i).A string S is an infinite ε-synchronization stringif it has infinite length and ∀1 ≤ i < j < k,ED(S[i, j), S[j, k)) > (1−ε)(k− i). Here ED(, ) standsfor the edit distance.

Using the construction and decoding algorithmsfor ε-synchronization strings, [1] gives a code that forany δ ∈ (0, 1) and ε > 0, can correct δ fraction ofinsertion and deletion errors with rate 1−δ−ε. Besidesthis, synchronization strings have found a variety ofapplications, such as in synchronization sequences [2],interactive coding schemes [3, 4, 5, 6, 7, 8, 9, 10], codingagainst synchronization errors [10, 11, 12], and editdistance tree codes [9, 13].

For all such applications, synchronization stringsare desired to be over alphabets that are as small as pos-

sible, since a larger alphabet size corresponds to moreredundant information added. Thus a natural questionhere is how small the alphabet size can be. In [1],Haeupler and Shahrasbi showed that ε-synchronizationstrings with arbitrary length exist over an alphabetof size O(ε−4), they also gave a randomized polyno-mial time algorithm to construct such strings. In avery recent work [10], they further gave various effi-cient deterministic constructions for finite/infinite ε-synchronization strings, which have alphabets of sizepoly

(ε−1)for some unspecified large polynomial. On

the other hand, the definition of synchronization stringsimplies that any ε−1 consecutive symbols in an ε-synchronization string have to be distinct—providingan Ω

(ε−1)lower-bound for the alphabet size.

Haeupler and Shahrasbi [10] further introducedanother notion called long distance synchronizationstrings. Besides requiring that the edit distance ofany adjacent intervals should be large, long distancesynchronization strings also require that any large,nearby intervals should have large edit distance.

Definition 1.2. (f(l)-distance ε-synchronization string)A string S ∈ Σn is a f(l)-distance ε-synchronizationstring if for every 1 ≤ i < j ≤ i′ < j′ ≤ n + 1,ED (S[i, j), S[i′, j′)) > (1− ε)(l) for i′ − j ≤ f(l) wherel = j + j′ − i− i′.

As a special case, 0-distance synchronization stringsare standard synchronization strings. Similar to [10],we focus on f(l) = n · 1l>c logn where 1l>c logn isthe indicator function for l > c log n. This definitionconsiders the edit distance of all pairs of large intervalsand adjacent small intervals.

Definition 1.3. (c-long-distance ε-synchronizationstring)

We call n · 1l>c logn-distance ε-synchronization stringsc-long-distance ε-synchronization strings.

Haeupler and Shahrasbi [10] showed that long dis-tance synchronization strings are locally decodable.This crucial property is used in [10] to construct co-des for edit errors with block transpositions and blockrepetitions. Long distance synchronization strings canalso be applied to achieve infinite channel simulationswith exponentially smaller memory and decoding timerequirements [10]. We note that the constructions in[10] have alphabets of size poly

(ε−1)for some unspeci-

fied large polynomial, which motivates us to reduce thealphabet size.

1.2 Motivation and Previous Work in Pattern

Avoidance Apart from applications in coding theoryand other communication problems involving insertions

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2186

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 3: Synchronization Strings: Highly Efficient Deterministic

and deletions, synchronization strings are also interes-ting combinatorial objects from a mathematical per-spective. As a matter of fact, plenty of very similarcombinatorial objects have been studied prior to thiswork.

A classical work of Thue [14] introduces and studiessquare-free strings, i.e., strings that do not contain twoidentical consecutive substrings. Thue shows that suchstrings exist over alphabets of size three and provides afast construction of such strings using morphisms. Theseminal work of Thue inspired further works on the sameproblem [15, 16, 17, 18, 19, 20] and problems with asimilar pattern avoidance theme.

Krieger et. al. [21] study strings that satisfy relaxedvariants of square-freeness, i.e., strings that avoid ap-proximate squares. Their study provides several resultson strings that avoid consecutive substrings of equallength with small additive or multiplicative Hammingdistance in terms of their length. In each of these re-gimes, [21] gives constructions of approximate squarefree strings over alphabets with small constant size fordifferent parameters.

Also, Camungol and Rampersad [22] study approxi-mate squares with respect to edit distance, which is equi-valent to the ε-synchronization string notion except thatthe edit distance property is only required to hold forpairs of consecutive substrings of equal length. [22] em-ploys a technique based on entropy compression to provethat such strings exist over alphabets that are constantin terms of string length but exponentially large in termsof ε−1. We note that the previous result of Haeuplerand Shahrasbi [1] already improves this dependence toO(ε−4).

Finally, Beck [31] uses the Lovasz local lemma toshow the existence of words where subsequent occur-rences of identical substrings of length n occur expo-nentially (in n) far away from each other, which is verysimilar to long distance synchronization string. The dif-ference is that Beck mainly care about identical sub-strings being widely separated.

Again, a main question addressed in most of theabove-mentioned previous work on similar mathemati-cal objects is how small the alphabet size can be.

1.3 Our Results In this paper we study the questionof how small the alphabet size of an ε-synchronizationstring can be. We address this question both for aspecified ε and for unspecified ε. In the first case we tryto bridge the gap between the upper bound of O

(ε−4)

provided in [1] and the lower bound of Ω(ε−1). In the

second case we study the question of how small thealphabet size can be to ensure the existence of an ε-synchronization string for some constant ε < 1. In both

cases we also give efficient constructions that improveprevious results.

1.3.1 New Bounds on Minimal Alphabet Size

for a given ε Our first theorem gives improved upperbound and lower bound for the alphabet size of an ε-synchronization string for a given ε.

Theorem 1.1. For any 0 < ε < 1, there existsan alphabet Σ of size O

(ε−2)

such that an infiniteε-synchronization string exists over Σ. In addition,∀n ∈ N, a randomized algorithm can construct anε-synchronization string of length n in expected timeO(n5 log n). Further, the alphabet size of any ε-synchronization string has to be at least Ω

(ε−3/2

).

Remark 1.1. Here ε can be any number in (0, 1), notjust a constant. The lower bound is conditioned onthat the synchronization string is long enough in termsof ε, as short synchronization strings can exist oversmall alphabets. For example, a binary string 01 is asynchronization string, but we are not very interestedin such cases.

Next, we provide efficient and even linear-time con-structions of ε-synchronization strings over drasticallysmaller alphabets than the efficient constructions in [10].The construction is highly-explicit in the sense that thei’th position of a synchronization string can be compu-ted in time O(log i). Moreover, our construction alsoworks for the notion of long distance synchronizationstrings.

Theorem 1.2. For every n ∈ N and any constantε ∈ (0, 1), there is a deterministic construction ofa (long-distance) ε-synchronization string of length nover an alphabet of size O(ε−2) that runs in poly(n)time. Further, there is a highly-explicit linear timeconstruction of such strings over an alphabet of sizeO(ε−3).

In addition, in Section 5.3, we present a methodto construct infinite synchronization strings using con-structions of finite ones, which only increases the alp-habet size by a constant factor—as opposed to the con-struction in [10] that increases the alphabet size qua-dratically.

Theorem 1.3. For any constant 0 < ε < 1, there existsan explicit construction of an infinite ε-synchronizationstring S over an alphabet of size O(ε−2). Further, thereexists a highly-explicit construction of an infinite ε-synchronization string S over an alphabet of size O(ε−3)such that for any i ∈ N, the first i symbols can becomputed in O(i) time and S[i, i+log i] can be computedin O(log i) time.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2187

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 4: Synchronization Strings: Highly Efficient Deterministic

1.3.2 Minimal Alphabet Size for Unspecified

ε: Three or Four? One interesting question thathas been commonly addressed by previous work onsimilar combinatorial objects is the size of the smallestalphabet over which one can find such objects. Alongthe lines of [14, 20, 21, 22], we study the existenceof synchronization strings over alphabets with minimalconstant size.

It is easy to observe that no such string can ex-ist over a binary alphabet since any binary string oflength four either contains two consecutive identicalsymbols or two consecutive identical substrings of lengthtwo. On the other hand, one can extract constructionsover constant-sized alphabets from the existence proofsin [1, 10], but the unspecified constants there would bequite large. In Section 7.2, for some ε < 1, we pro-vide a construction of arbitrarily long ε-synchronizationstrings over an alphabet of size four. This narrows downthe question to whether such strings exist over alpha-bets of size three.

To construct such strings, we introduce the no-tion of weak synchronization string, which requires sub-strings to satisfy a similar property as that of an ε-synchronization string, except that the lower bound onedit distance is rounded down to the nearest integer.We show that weak synchronization strings exist overbinary alphabets and use one such string to modify aternary square-free string ([14]) into a synchronizationstring over an alphabet of size four.

1.3.3 Constructing Synchronization Strings

Using Uniform Morphisms Morphisms have beenwidely used in previous work as a tool to construct si-milar combinatorial objects. A uniform morphism ofrank r over an alphabet Σ is a function φ : Σ → Σr thatmaps any symbol of an alphabet Σ to a string of lengthr over the same alphabet. Using this technique, some si-milar combinatorial objects in previous work have beenconstructed by taking a symbol from the alphabet andthen repeatedly using an appropriate morphism to re-place each symbol with a string [20, 21]. Here we in-vestigate whether such tools can also be utilized to con-struct synchronization strings. In Section 7.1, we showthat no such morphism can construct arbitrarily longε-synchronization strings for any ε < 1.

1.4 Applications Our new constructions of synchro-nization strings can be used in all known applications ofthese objects, including converting standard error cor-recting codes into error correcting codes for edit errorsthat approach the singleton bound, giving locally de-codable codes for edit errors, constructing codes forblock transpositions and replications, obtaining near-

linear time interactive coding schemes for edit errors,achieving exponentially more efficient infinite channelsimulations and so on.

For all these applications, our constructions can re-duce the alphabet size of the code by at least a ε−2

factor where ε is the error parameter of the synchroni-zation string. This also further improves the alphabetsize of an approximate square free string with respectto edit distance in [22] to O(ε−2).

In addition, our construction relies on a novelimplementation of a deterministic Lovasz Local Lemma.This may have further applications in other relatedproblems.

1.5 Overview of Our Techniques Our constructi-ons differ from previous constructions in several key as-pects. Here we first highlight the new ideas we use toachieve the improvements.

1. Non-uniform Lovasz Local Lemma. To showthe existence of ε-synchronization strings over analphabet of size O(ε−2), as in previous works we useLovasz Local Lemma. However, instead of usinga uniform sample space, we use a non-uniformsample space. We carefully set up this sample spaceso that the dependency graph is still sparse, andthus the Local Lemma can be used to establish theexistence of synchronization strings.

2. Adapted algorithmic Lovasz Local Lemma.To get an explicit construction of our synchroniza-tion strings, we need to use an algorithmic versionof Lovasz Local Lemma. However, since our samplespace is non-uniform, we also need to adapt the al-gorithm accordingly. Specifically, we need to adjustthe sampling process to generate the non-uniformsample space very carefully.

3. Adapted deterministic Lovasz Local Lemma.To get a deterministic construction of our synchro-nization strings, we further need a deterministic al-gorithmic Lovasz Local Lemma. A direct applica-tion of previous known results such as that of ofChandrasekaran et al. [23] is problematic, since itwill make the dependency graph too dense. Thus,we need to carefully redefine the bad events in theLocal Lemma to make this work.

4. Synchronization circles. To protect small inter-vals in a synchronization string, we introduce a newcombinatorial object called synchronization circle.A synchronization circle is a generalization of syn-chronization string, which can be viewed as a cir-cle of symbols. It has the property that no matter

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2188

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 5: Synchronization Strings: Highly Efficient Deterministic

where one cuts the circle, it becomes a synchroniza-tion string. We show that a synchronization circlecan be constructed by concatenating two synchro-nization strings over two disjoint alphabets, whichonly increases the alphabet size by a factor of two.This idea is also used in our construction of infinitesynchronization strings to reduce the alphabet size.

5. Reverse connection between insdel codes

and synchronization strings. We give a newway to construct synchronization strings, by conca-tenating different codewords from an appropriatelydesigned insdel code. This reduces the task of con-structing synchronization strings to a much smal-ler size, which is important to achieve our highlyexplicit constructions. This also complements theconnection established in [1], which shows that in-sdel codes can be constructed from synchronizationstrings.

We now explain our techniques in more details.

1.5.1 Existence of ε-synchronization strings

over small alphabets Haeupler and Shahrasbi [1] sho-wed the existence of ε-synchronization strings over analphabet of size O(ε−4). By using the Local Lemma,they first show that with positive probability, a uni-formly random string S over an alphabet of size O(ε−2)satisfies the requirement of a synchronization stringwhen the intervals have size at least t for some t =Ω(ε−2). To take care of the small intervals, they do asymbol-wise concatenation of S and another string Rwhose i’th position is R[i] = (i mod t). This results inan alphabet size of O(ε−4).

To get rid of the concatenation, we directly use anon-uniform sample space in the Local Lemma. In ourconstruction, we randomly sample each symbol (say thei’th symbol) in the string S from an alphabet Σ ofsize O(ε−2), conditioned on that this symbol is differentfrom the previous t − 1 or i − 1 symbols, whichever issmaller. We then define the bad events in the followingway: an interval S[i, k] is bad if there is a j, i < j < k,s.t. ED(S[i, j), S[j, k)) ≤ (1 − ε)(k − i). At a firstglance, as each symbol now depends on the previoust− 1 symbols, we may not directly apply Lovasz LocalLemma. However, our key observation is that, althoughthe symbols are not independent of each other, thebadness of an interval I is mutually independent of thebadness of all intervals that do not intersect with I. Weshow this by a direct computation of the probabilities.Now we can apply Lovasz Local Lemma to show that Sis an ε-synchronization string with positive probability.

1.5.2 Constructions of ε-synchronizationstrings over small alphabets

Randomized construction. In order to applythe algorithmic framework of Lovasz Local Lemma in[24] and [25], we need the probability space in the LocalLemma to be sampled from n independent randomvariables P = P1, . . . , Pn, and that each event inthe collection of bad events A = A1, . . . , Am isdetermined by some subsets of P. However our non-uniform sample space discussed above does not satisfythese conditions.

To rectify this, we implement our previous samplingprocess as follows (which gives an alternative view ofthe previous sampling procedure): we initially fix anarbitrary order for the symbols in the alphabet Σ. Whenwe sample the i’th symbol of the string, we reorder thealphabet such that the first t−1 symbols in the alphabetare exactly the previous t − 1 symbols in the string,and the rest of the symbols are placed in the currentorder as the last |Σ| − t + 1 symbols in the alphabet.Then we uniformly randomly choose an integer Pi in1, 2, . . . , |Σ| − t + 1 and let the i’th symbol be the(t − 1 + Pi)’th symbol in the new order of Σ. Inthis way, the random variables P1, . . . , Pn are indeedindependent, and the i’th symbol is different from theprevious t − 1 symbols. Furthermore, the event of anyinterval S[i, k] being bad depends only on Pi, . . . , Pk.This enables us to follow the framework in [24] and [25]and give a randomized construction of synchronizationstrings over small alphabets.

Deterministic polynomial time construction.

We combine the sampling algorithm described abovewith the deterministic Lovasz local lemma of Chandra-sekaran et al. [23] to give a polynomial-time con-struction of c-long distance ε-synchronization strings.In addition to the requirement on edit distance of ad-jacent intervals, long distance synchronization stringsalso require that any pair of large disjoint intervalsS[i1, i1 + l1), S[i2, i2 + l2), where i1 + l1 ≤ i2 andl1+l2 ≥ c log n, has large enough edit distance. However,it’s problematic to define the bad event correspondingto intervals S[i1, i1 + l1) and S[i2, i2 + l2) in the usualway: ED(S[i1, i1 + l1), S[i2, i2 + l2)) ≤ (1− ε)(l1 + l2).This is because under this definition, the event not onlydepends on the random variables Pj within these in-tervals, but also depends on the random variables be-tween these two intervals. This will make the depen-dency graph too dense to apply the Local Lemma.

To fix this, we redefine the badness of the pair of in-tervals S[i1, i1+ l1) and S[i2, i2+ l2) as follows: the pairis bad if there exists a choice of the random variables inthe middle, such that according to the sampling algo-rithm, ED(S[i1, i1+ l1), S[i2, i2+ l2)) ≤ (1− ε)(l1+ l2).

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2189

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 6: Synchronization Strings: Highly Efficient Deterministic

This definition removes the dependence on the randomvariables between the two intervals, and we show thatunder this definition, the badness of a pair of intervals isstill mutually independent of the badness of all disjointpairs of intervals. Fortunately, the probability of thebad events under this new definition can also be boun-ded appropriately. Thus we can apply the deterministicLovasz Local Lemma of Chandrasekaran et al. [23].

Deterministic highly explicit construction.

A long distance synchronization string of length nconsiders the edit distance of all pairs of large intervalsand adjacent small intervals. Our construction of longdistance synchronization strings uses concatenations ofdifferent codewords of a carefully designed insdel codeC where each codeword has length m = O(log n), andwe show that this construction can handle both largeintervals and small intervals.

For large intervals the intuition is the following. Ifthe codeword has length m = c log n, then after con-catenations of different codewords, any pair of intervalswhose total length is larger than m must have large editdistance, since the intervals must contain parts from atleast two different codewords.

We are now left with small adjacent intervals. Thisis where we need to carefully design the code C. [1] sho-wed that an insdel code can be constructed by doinga symbol-wise concatenation of a standard error cor-recting code with a synchronization string. If we con-struct C in this way, each codeword in C is a synchroniza-tion string itself, which can protect adjacent small inter-vals if they lie within one codeword. However, we haveno guarantee for small intervals crossing two codewords.To address this problem, we introduce a new combinato-rial object called synchronization circle, which is a gene-ralization of synchronization string. A synchronizationcircle S of length n requires that S is a synchroniza-tion string beginning from any position i. That is, forany 1 ≤ i ≤ n, S[i], . . . , S[n], S[1], . . . , S[i − 1] is a sy-nchronization string. We give explicit constructions forsynchronization circles by concatenating two synchro-nization strings on disjoint alphabets. Now by doinga symbol-wise concatenation of a standard error cor-recting code with a synchronization circle, we have thatevery codeword in C is itself a synchronization circle.This guarantees that any two adjacent small intervalsin our synchronization string have large edit distance.

Note that our construction is highly-explicit be-cause we can construct each block (each codeword ofC in the construction) independently.

Explicit construction of infinite ε-synchronization strings To construct an infiniteε-synchronization string, we first construct a sequenceof synchronization strings whose lengths increase

exponentially. That is, for some k = O(1/ε), weconstruct synchronization strings S1, S2, . . . such thatSi has length ki, and concatenate them sequentially.For the analysis, consider any pair of intervals whosetotal length is between kt and kt+1 for some t, thenwe can ignore the part of these two intervals beforeSt. This is because the number of symbols before St

is quite small compared to the rest of the string. Theremaining part either lies in a synchronization stringSi, or crosses two synchronization strings Si−1 and Si.The first case is already taken care of by the propertyof a synchronization string. To take care of the secondcase, we use two disjoint alphabets Σ1,Σ2 to constructthese synchronization strings. That is, we constructSi for odd i’s over Σ1 and Si for even i’s over Σ2. Ifwe use our previous highly-explicit constructions ofsynchronization strings to construct these Si’s, thenthis construction is also highly-explicit.

1.5.3 Existence of ε-synchronization string over

alphabet of size 4 To construct a synchronizationstring over an alphabet of size four, we introduce thenotion of weak synchronization string, which requiressubstrings to satisfy a similar property as that of anε-synchronization string, except that the lower boundon edit distance is rounded down to the nearest integer.We show that weak synchronization strings exist overbinary alphabets and use one such string to modify aternary square-free string ([14]) into a synchronizationstring over an alphabet of size four.

2 Discussions and Open Problems

In this paper we give highly efficient constructions ofvarious ε-synchronization strings over small alphabets.In particular, we reduce the alphabet size from O(ε−4)or poly(1/ε) in previous works to O(ε−2). We alsoobtain an improved lower bound of Ω

(ε−3/2

). This

leaves the open problem of what is the tight bound.Similarly, for synchronization strings with unspecified ε,we show that such strings exist over an alphabet of size 4but cannot exist over an alphabet of size 2, which leavesthe open problem of whether synchronization stringscan exist over an alphabet of size 3. Experimentalresults seem to suggest that the answer is yes.

It would be interesting to find other applications ofsynchronization strings, and we believe our new way ofusing the deterministic Lovasz Local Lemma may alsofind other applications.

3 Notations and Definitions

Usually we use Σ (probably with some subscripts) todenote the alphabet and Σ∗ to denote all strings overalphabet Σ.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2190

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 7: Synchronization Strings: Highly Efficient Deterministic

Definition 3.1. (Subsequence) The subsequence ofa string S is any sequence of symbols obtained from S bydeleting some symbols. It doesn’t have to be continuous.

Definition 3.2. (Edit distance) For every n ∈ N,the edit distance ED(S, S′) between two strings S, S′ ∈Σn is the minimum number of insertions and deletionsrequired to transform S into S′.

Definition 3.3. (Longest Common Subsequence)For any strings S, S′ over Σ, the longest common sub-sequence of S and S′ is the longest pair of subsequencethat are equal as strings. We denote by LCS(S, S′) thelength of the longest common subsequence of S and S′.

Note that ED(S, S′) = |S| + |S′| − 2LCS(S, S′)where |S| denotes the length of S.

Definition 3.4. (Square-free string) A string Sis a square free string if ∀1 ≤ i < i + 2l ≤ |S| + 1,(S[i, i+ l) and S[i+ l, i+ 2l)) are different as words.

We also introduce the following generalization ofa synchronization string, which will be useful in ourdeterministic constructions of synchronization strings.

Definition 3.5. (ε-synchronization circle)A string S is an ε-synchronization circle if∀1 ≤ i ≤ |S|, Si, Si+1, . . . , S|S|, S1, S2, . . . , Si−1 isan ε-synchronization string.

Note that in a communication that suffers fromsymbol corruptions and erasures, any erasure leads tohalf a unit of Hamming distance from original stringand any corruption results into one. To easily quantifyerrors that consist of both types, we define half-errorsas follows. Here, the name half-error comes from thefact that a symbol erasure is half as bad as a symbolcorruption.

Definition 3.6. (Half-Errors) Any set of a symbolsubstitutions and b symbol erasures is referred to as2a + b half-errors, where one symbol substitution iscounted as two half errors, and one symbol erasure iscounted as one half error.

4 ε-synchronization Strings and Circles with

Alphabet Size O(ε−2)

In this section we show that by using a non-uniformsample space together with the Lovasz local lemma, wecan have a randomized polynomial time construction ofan ε-synchronization string with alphabet size O(ε−2).We then use this to give a simple construction of anε-synchronization circle with alphabet size O(ε−2) aswell. Although the constructions here are randomized,

the parameter ε can be anything in (0, 1) (even sub-constant), while our deterministic constructions in latersections usually require ε to be a constant in (0, 1).

We first recall the general Lovasz local lemma, [30].

Lemma 4.1. (General Lovasz Local Lemma) LetA1, ..., An be a set of bad events. G(V,E) is a depen-dency graph for this set of events if V = 1, . . . , n andeach event Ai is mutually independent of all the eventsAj : (i, j) /∈ E.

If there exists x1, ..., xn ∈ [0, 1) such that for all iwe have

Pr[Ai] ≤ xi

(i,j)∈E

(1− xj)

Then the probability that none of these events hap-pens is bounded by

Pr

[n∧

i=1

Ai

]≥

n∏

i=1

(1− xi) > 0

Using this lemma, we have the following theoremshowing the existence of ε-synchronization strings overan alphabet of size O(ε−2).

Theorem 4.1. ∀ε ∈ (0, 1) and ∀n ∈ N, there exists anε-synchronization string S of length n over alphabet Σof size Θ(ε−2).

Proof. Suppose |Σ| = c1ε−2 where c1 is a constant. Let

t = c2ε−2 and 0 < c2 < c1. The sampling algorithm is

as follows:

1. Randomly pick t different symbols from Σ and letthem be the first t symbols of S. If t ≥ n, we justpick n different symbols.

2. For t + 1 ≤ i ≤ n, we pick the ith symbol S[i]uniformly randomly from Σ\S[i−1], . . . , S[i− t+1]

Now we prove that there’s a positive probability thatS contains no bad interval S[i, k] which violates therequirement that ED(S[i, j], S[j+1, k]) > (1−ε)(k− i)for any i < j < k. This requirement is equivalent toLCS(S[i, j], S[j + 1, k]) < ε

2 (k − i).Notice that for k − i ≤ t, the symbols in S[i, k] are

completely distinct. Hence we only need to considerthe case where k − i > t. First, let’s upper bound theprobability that an interval is bad:

Pr[interval I of length l is bad]

≤(l

εl

)(|Σ| − t)−

εl2

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2191

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 8: Synchronization Strings: Highly Efficient Deterministic

The first inequality holds because if the interval isbad, then it has to contain a repeating sequencea1a2 . . . apa1a2 . . . ap where p is at least εl

2 . Such se-quence can be specified via choosing εl positions inthe interval and the probability that a given sequenceis valid for the string in this construction is at most(|Σ| − t)−

εl2 . The second inequality comes from Stir-

ling’s inequality.The inequality above indicates that the probability

that an interval of length l is bad can be upper boundedby C−εl, where C is a constant and can be arbitrarilylarge by modifying c1 and c2.

Now we use general Lovasz local lemma to showthat S contains no bad interval with positive probabi-lity. First we’ll show the following lemma.

Claim 1. The badness of interval I = S[i, j] is mutu-ally independent of the badness of all intervals that donot intersect with I.

Proof. Suppose the intervals before I that do not in-tersect with I are I1, . . . , Im, and those after I areI ′1, . . . , I

′m′ . We denote the indicator variables of each

interval being bad as b, bk and b′k′ .First we prove that there exists p ∈ (0, 1) such that

∀x1, x2, . . . , xm ∈ 0, 1,

Pr[b = 1|bk = xk, k = 1, . . . ,m] = p

According to our construction, we can see that forany fixed prefix S[1, i− 1], the probability that I is badis a fixed real number p′. That is,

∀ valid S ∈ Σi−1,Pr[b = 1|S[1, i− 1] = S] = p′

This comes from the fact that, the sampling of thesymbols in S[i, k] only depends on the previous h =mini−1, t−1 different symbols, and up to a relabelingthese h symbols are the same h symbols (e.g., we canrelabel them as 1, · · · , h and the rest of the symbolsas h+1, · · · , |Σ|). On the other hand the probabilitythat b = 1 remains unchanged under any relabeling ofthe symbols, since if two sampled symbols are the same,they will stay the same; while if they are different, theywill still be different. Thus we have:

Pr[b = 1|bk = xk, i = 1, . . . ,m]

=

∑S Pr[b = 1, S[1, i− 1] = S]∑

S Pr[S[1, i− 1] = S]

=p′∑

S

Pr[S[1, i− 1] = S]∑

S′ Pr[S[1, i− 1] = S′]

=p′.

In the equations, S indicates all valid string that prefixS[1, i−1] can be such that bk = xk, k = 1, . . . ,m. Hence,b is independent of bk, k = 1, . . . ,m. Similarly, we canprove that the joint distribution of b′k′ , k′ = 1, . . . ,m′is independent of that of b, bk, k = 1, . . . ,m. Hence bis independent of bk, b′k′ , k = 1, . . . ,m, k′ = 1, . . . ,m′,which means, the badness of interval I is mutuallyindependent of the badness of all intervals that do notintersect with I.

Obviously, an interval of length l intersects at mostl + l′ intervals of length l′. To use Lovasz local lemma,we need to find a sequence of real numbers xi,k ∈ [0, 1)for intervals S[i, k] for which

Pr[S[i, k]is bad] ≤ xi,k

S[i,k]∩S[i′,k′] 6=∅

(1− xi′,k′)

The rest of the proof is the same as that of Theorem5.7 in [1].

We propose xi,k = D−ε(k−i) for some constantD ≥ 1. Hence we only need to find a constant D suchthat for all S[i, k],

C−ε(k−i) ≤ D−ε(k−i)n∏

l=t

[1−D−εl]l+(k−i)

That is, for all l′ ∈ 1, ..., n,

C−l′ ≤ D−l′n∏

l=t

[1−D−εl]l+l′

ε

⇒ C ≥ D∏n

l=t[1−D−εl]l/l′+1

ε

Notice that the righthand side is maximized when n =∞, l′ = 1. Hence it’s sufficient to show that

C ≥ D∏∞

l=t[1−D−εl]l+1

ε

Let L = maxD>1D

∏∞l=t[1−D−εl]

l+1

ε

. We only need to

guarantee that C > L.We claim that L = Θ(1). Since t = c2ε

−2 =

ω(

log 1ε

ε

),

D∏∞

l=t[1−D−εl]l+1

ε

≤ D

1− 2ε3

D−1ε

(1−D−ε)2

We can see that for D = 7, maxε

2ε3

D−1ε

(1−D−ε)2

<

0.9. Therefore (5) is bounded by a constant, whichmeans L = Θ(1) and the proof is complete.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2192

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 9: Synchronization Strings: Highly Efficient Deterministic

Using a modification of an argument in [1], we canalso obtain a randomized construction.

Lemma 4.2. There exists a randomized algorithmwhich for any ε ∈ (0, 1) and any n ∈ N, constructsan ε-synchronization string of length n over alphabet ofsize O(ε−2) in expected time O(n5 log n).

Proof. The algorithm is similar to that of Lemma 5.8 in[1], using algorithmic Lovasz Local lemma [24] and theextension in [25].

Theorem 4.2. (Theorem 3.1 in [25]) Suppose thereis an ξ ∈ [0, 1) and an assignment of reals x : A 7→ (0, 1)such that:

∀A ∈ A : Pr[A] ≤ (1− ξ)x(A)∏

B∈Γ(A)

(1− x(B))

Here, T :=∑

A∈A x(A), where Γ(A) = ΓA(A) forthe set of all events B 6= A in A with vbl(B)∩ vbl(A) 6=∅, in which vbl(A) is the smallest subset of randomvariables determining event A.

Furthermore:

1. if ξ = 0, then the expected number of re-samplingsdone by the MT algorithm is at most v1 =T maxA∈A

11−x(A) , and for any parameter λ ≥

1, the MT algorithm terminates within λv1 re-samplings with probability at least 1− 1/λ.

2. if ξ > 0, then the expected number of re-samplingsdone by the MT algorithm is at most v2 =O(nξ log T

ξ ), and for any parameter λ ≥ 1, the MTalgorithm terminates within λv2 re-samplings withprobability 1− exp(−λ).

Here, MT algorithm is a sampling algorithm introducedin [24], which can quickly terminates with high probabi-lity.

It starts with a string sampled according to thesampling algorithm in the proof of Theorem 4.1, overalphabet Σ of size Cε−2 for some large enough constantC. Then the algorithm checks all O(n2) intervals fora violation of the requirements for ε-synchronizationstring. If a bad interval is found, this interval is re-sampled by randomly choosing every symbol s.t. eachone of them is different from the previous t−1 symbols,where t = c′ε−2 with c′ being a constant smaller thanC.

One subtle point of our algorithm is the following.Note that in order to apply the algorithmic frameworkof [24] and [25], one needs the probability space to besampled from n independent random variables P =P1, · · · , Pn so that each event in the collection A =

A1, · · · , Am is determined by some subset of P. Then,when some bad event Ai happens, one only resamplesthe random variables that decide Ai. Upon first look,it may appear that in our application of the LovaszLocal lemma, the sampling of the i’th symbol dependson the the previous h = mini − 1, t − 1 symbols,which again depend on previous symbols, and so on.Thus the sampling of the i’th symbol depends on thesampling of all previous symbols. However, we canimplement our sampling process as follows: for thei’th symbol we first independently generate a randomvariable Pi which is uniform over 1, 2, · · · , |Σ| − h,then we use the random variables P1, · · · , Pn to decidethe symbols, in the following way. Initially we fixsome arbitrary order of the symbols in Σ, then fori = 1, · · · , n, to get the i’th symbol, we first reorderthe symbols Σ so that the previous h chosen symbolsare labeled as the first h symbols in Σ, and the rest ofthe symbols are ordered in the current order as the last|Σ| −h symbols. We then choose the i’th symbol as the(h + Pi)’th symbol in this new order. In this way, therandom variables P1, · · · , Pn are indeed independent,and the i’th symbol is indeed chosen uniformly fromthe |Σ| − h symbols excluding the previous h symbols.Furthermore, the event of any interval S[i, k] being badonly depends on the random variables (Pi, · · · , Pk) sinceno matter what the previous h symbols are, they arerelabeled as 1, · · · , h and the rest of the symbolsare labeled as h + 1, · · · , |Σ|. From here, the samesequence of (Pi, · · · , Pk) will result in the same behaviorof S[i, k] in terms of which symbols are the same. Wecan thus apply the same algorithm as in [1].

Note that the time to get the i’th symbol fromthe random variables P1, · · · , Pn is O(n log 1

ε ) sincewe need O(n) operations each on a symbol of sizeCε−2. Thus resampling each interval takes O(n2 log 1

ε )time since we need to resample at most n symbols.For every interval, the edit distance can be computedusing the Wagner-Fischer dynamic programming withinO(n2 log 1

ε ) time. Theorem 4.2 shows that the expectednumber of re-sampling is O(n). The algorithm willrepeat until no bad interval can be found. Hence theoverall expected running time is O(n5 log 1

ε ).Note that without loss of generality we can assume

that ε > 1/√n because for smaller errors we can always

use the indices directly, which have alphabet size n. Sothe overall expected running time is O(n5 log n).

We can now construct an ε-synchronization circleusing Theorem 4.1.

Theorem 4.3. For every ε ∈ (0, 1) and every n ∈ N,there exists an ε-synchronization circle S of length nover alphabet Σ of size O(ε−2).

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2193

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 10: Synchronization Strings: Highly Efficient Deterministic

Proof. First, by Theorem 4.1, we can have two ε-synchronization strings: S1 with length dn

2 e over Σ1

and S2 with length bn2 c over Σ2. Let Σ1 ∩ Σ2 = ∅ and

|Σ1| = |Σ2| = O(ε−2). Let S be the concatenation of S1

and S2. Then S is over alphabet Σ = Σ1∪Σ2 whose sizeis O(ε−2). Now we prove that S is an ε-synchronizationcircle.

∀1 ≤ m ≤ n, consider string S′ =sm, sm+1, . . . , sn, s1, s2, . . . , sm−1. Notice that for twostrings T and T ′ over alphabet Σ, LCS(T, T ′) ≤ ε

2 (|T |+|T ′|) is equivalent to ED(T, T ′) ≥ (1 − ε)(|T | + |T ′|).For any i < j < k, we call an interval S′[i, k] good ifLCS(S′[i, j], S′[j + 1, k]) ≤ ε

2 (k− i). It suffices to showthat ∀1 ≤ i, k ≤ n, the interval S′[i, k] is good.

Without loss of generality let’s assumem ∈ [dn2 e, n].

Intervals which are substrings of S1 or S2 are goodintervals, since S1 and S2 are ε-synchronization strings.

We are left with intervals crossing the ends of S1 orS2.

If S′[i, k] contains sn, s1 but doesn’t contain

sdn2e: If j < n − m + 1, then there’s no common

subsequence between s′[i, j] and S′[n−m+ 2, k]. Thus

LCS(S′[i, j], S′[j+1, k]) ≤ ε

2(n−m+1− i) <

ε

2(k− i)

If j ≥ n−m+ 1, then there’s no common subsequencebetween S′[j + 1, k] and S′[i, n−m+ 1]. Thus

LCS(S′[i, j], S′[j+1, k]) ≤ ε

2(k−(n−m+2)) <

ε

2(k−i)

Thus intervals of this kind are good.If S′[i, k] contains sbn

2c, sdn

2e but doesn’t con-

tain sn: If j ≤ n−m+bn2 c+1, then there’s no common

subsequence between S′[i, j] and S′[n−m+ dn2 e+1, k],

thusLCS(S′[i, j], S′[j + 1, k]) <

ε

2(k − i)

If j ≥ n − m + bn2 c + 1, then there’s no common

subsequence between S′[j+1, k] and S′[i, n−m+bn2 c+

1]. Thus

LCS(S′[i, j], S′[j + 1, k]) <ε

2(k − i)

Thus intervals of this kind are good.If S′[i, k] contains sdn

2e and sn: If n − m + 2 ≤

j ≤ n−m+ bn2 c+ 1, then the common subsequence is

either that of S′[i, n−m+1] and S′[n−m+ dn2 e+1, k]

or that of S′[n−m+2, j] and S′[j+1, n−m+ bn2 c+1].

This is because Σ1 ∩ Σ2 = ∅. ThusLCS(S′[i, j], S′[j + 1, k])

≤maxLCS(S′[i, n−m+ 1], S′[n−m+ dn2e+ 1, k]),

LCS(S′[n−m+ 2, j], S′[j + 1, n−m+ bn2c+ 1])

2(k − i)

If j ≤ n−m+ 1, then there’s no common subsequencebetween S′[i, j] and S′[n−m+2, n−m+bn

2 c+1]. Thus

LCS (S′[i, j], S′[j + 1, k]) <ε

2(k − i)

If j ≥ S′[n − m + dn2 e + 1], the proof is similar to the

case where j ≤ n−m+ 1.This shows that S′ is an ε-synchronization string.

Thus by the definition of synchronization circle, theconstruction gives an ε-synchronization circle.

5 Deterministic Constructions of

Long-Distance Synchronization Strings

In this section, we give deterministic constructions ofsynchronization strings. In fact, we consider a generali-zed version of synchronization strings, i.e., f(l)-distanceε-synchronization strings first defined by Haeupler andShahrasbi [10]. Throughout this section, ε is consideredto be a constant in (0, 1).

5.1 Polynomial Time Constructions of Long-

Distance Synchronization Strings Here, by com-bining the deterministic Lovasz local lemma ofChandrasekaran et al. [23] and the non-uniform sam-ple space used in Theorem 4.1, we give a deter-ministic polynomial-time construction of c-long ε-synchronization strings over an alphabet of size O(ε−2).

Theorem 5.1. (Theorem 2 in [23]) Let the timeneeded to compute the conditional probabilityPr[A|∀i ∈ I : Pi = vi] for any A ∈ A and anypartial evaluation (vi ∈ Di)i∈I where I ⊂ [n] be at mosttC . If there is an ε ∈ (0, 1) and an assignment of realsx : A 7→ (0, 1) such that

∀A ∈ A : Pr[A] ≤ x′(A)1+ε =

x(A)

B∈Γ(A)

(1− x(B))

1+ε

Then, a deterministic algorithm can find a good evalu-ation in time

O(tC) ·DM2(1+1/ε) logM

εwmin= O

(tC · D ·M3+2/ε

)

Here,

• x′(A) = x(A)∏

B∈Γ(A)(1− x(B))

• D := maxp∈Pdomain(P ) where P is the set ofrandom variables.

• M := maxn, 4m, 4

∑a∈A

2|vbl(A)|x′(A) · x(A)

1−x(A)

where A = A ∈ A|x′(A) ≥ 14m, n is the number

of random variables and m is the number of eventsin A.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2194

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 11: Synchronization Strings: Highly Efficient Deterministic

• wmin := minA∈A− log x′(A)

• Γ(A) = ΓA(A) for the set of all events B 6= Ain A with vbl(B) ∩ vbl(A) 6= ∅, in which vbl(A) isthe smallest subset of random variables determiningevent A.

First we recall the following property of c-long synchro-nization strings.

Lemma 5.1. (Corollary 4.4 of [10]) If S is astring which satisfies the edit distance requirementof c-long-distance ε-synchronization property statedin Definition 1.2 for any two non-adjacent intervalsof total length 2c log n or less, then it satisfies suchrequirement for all pairs of non-adjacent intervals.

We now have the following theorem.

Theorem 5.2. For any n ∈ N and any constantε ∈ (0, 1), there is a deterministic construction of anO(1/ε)-long-distance ε-synchronization string of lengthn, over an alphabet of size O(ε−2), in time poly(n).

Proof. To prove this, we will use the Lovasz local lemmaand its deterministic algorithm in [23]. Suppose thealphabet is Σ with |Σ| = q = c1ε

−2 where c1 is aconstant. Let t = c2ε

−2 and 0 < c2 < c1. We denote|Σ|−t as q. The sampling algorithm of string S (1-indexbased)is as follows:

• Initialize an arbitrary order for Σ.

• For ith symbol:

– Denote h = mint − 1, i − 1. Gene-rate a random variable Pi uniformly over1, 2, . . . , |Σ| − h.

– Reorder Σ such that the previous h chosensymbols are labeled as the first h symbols in Σ,and the rest are ordered in the current orderas the last |Σ| − h symbols.

– Choose the (Pi + h)’th symbol in this neworder as S[i].

Define the bad event Ai1,l1,i2,l2 as intervals S[i1, i1+ l1)and S[i2, i2 + l2) violating the c = O(1/ε)-long-distancesynchronization string property for i1+l1 ≤ i2. In otherwords, Ai1,l1,i2,l2 occurs if and only if ED(S[i1, i1 +l1), S[i2, i2 + l2)) ≤ (1− ε)(l1 + l2), which is equivalentto LCS(S[i1, i1 + l1), S[i2, i2 + l2)) ≥ ε

2 (l1 + l2).Note that according to the definition of c-long

distance ε-synchronization string and Lemma 5.1, weonly need to consider Ai1,l1,i2,l2 where l1 + l2 < c log n

and c log n ≤ l1 + l2 ≤ 2c log n. Thus we can upperbound the probability of Ai1,l1,i2,l2 ,

Pr [Ai1,l1,i2,l2 ] ≤(

el

εl√

|Σ| − t

)εl

=

(e

ε√|Σ| − t

)εl

= Cεl,

where l = l1 + l2 and C is a constant which depends onc1 and c2.

However, to apply the deterministic Lovasz locallemma (LLL), we need to have two additional requi-rements. The first requirement is that each bad eventdepends on up to logarithmically many variables, andthe second is that the inequalities in the Lovasz locallemma hold with a constant exponential slack [10].

The first requirement may not be true under thecurrent definition of badness. Consider for example therandom variables Pi1 , . . . , Pi1+l1−1, Pi2 , . . . , Pi2+l2−1 fora pair of split intervals S[i1, i1 + l1), S[i2, i2 + l2) wherethe total length l1 + l2 is at least 2c log n. The eventAi1,l1,i2,l2 may depend on too many random variables(i.e., Pi1 , . . . , Pi2+l2−1).

To overcome this, we redefine the badness of thesplit interval S[i1, i1 + l1) and S[i2, i2 + l2) as fol-lows: let Bi1,l1,i2,l2 be the event that there existsPi1+l1 , . . . , Pi2−1 (i.e., the random variables chosen be-tween the two intervals) such that the two intervals ge-nerated by Pi1 . . . , Pi1+l1−1 and Pi2 , . . . , Pi2+l2−1 (to-gether with Pi1+l1 , . . . , Pi2−1) makes LCS(S[i1, i1 +l1), S[i2, i2 + l2)) ≥ ε

2 (l1 + l2) according to the samplingalgorithm. Note that if Bi1,l1,i2,l2 does not happen, thencertainly Ai1,l1,i2,l2 does not happen.

Notice that with this new definition of badness,Bi1,l1,i2,l2 is independent of Pi1+l1 , . . . , Pi2−1 andonly depends on Pi1 . . . , Pi1+l1−1, Pi2 , . . . , Pi2+l2. Inparticular, this implies that Bi1,l1,i2,l2 is independentof the badness of all other intervals which have nointersection with (S[i1, i1 + l1), S[i2, i2 + l2)).

We now bound Pr[Bi1,l1,i2,l2 ]. When consideringthe two intervals S[i1, i1 + l1), S[i2, i2 + l2) and theiredit distance under our sampling algorithm, withoutloss of generality we can assume that the order of thealphabet at the point of sampling S[i1] is (1, 2, . . . , q)just by renaming the symbols. Now, if we fix the orderof the alphabet at the point of sampling S[i2] in oursampling algorithm, then S[i2, i2 + l2) only depends onPi2 , . . . , Pi2+l2 and thus LCS(S[i1, i1 + l1), S[i2, i2 +l2)) only depends on Pi1 . . . , Pi1+l1−1, Pi2 , . . . , Pi2+l2.

Conditioned on any fixed order of the alphabet atthe point of sampling S[i2], we have that LCS(S[i1, i1+l1), S[i2, i2+l2)) ≥ ε

2 (l1+l2) happens with probability at

most Cεl by the same computation as we upper boundPr[Ai1,l1,i2,l2 ]. Note that there are at most q! different

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2195

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 12: Synchronization Strings: Highly Efficient Deterministic

orders of the alphabet. Thus by a union bound we have

Pr[Bi1,l1,i2,l2 ] ≤ Cεl × q! = Cεl,

for some constant C.In order to meet the second requirement of the

deterministic algorithm of Lovasz local lemma, we alsoneed to find real numbers xi1,i1+l1,i2,i2+l2 ∈ [0, 1] suchthat for any Bi1,l1,i2,l2 ,

Pr[Bi1,l1,i2,l2 ] ≤[xi1,l1,i2,l2

X

(1− xi′1,l′1,i′

2,l′2)

]1.01.

where X denotes the condition that[S[i1, i1 + l1) ∪ S[i2, i2 + l2)] ∩ [S[i′1, i

′1 + l′1) ∪ S[i′2, i

′2 +

l′2)] 6= ∅. We propose xi1,l1,i2,l2 = D−ε(l1+l2) for someD > 1 to be determined later. D has to be chosen suchthat for any i1, l1, i2, l2 and l = l1 + l2:

(e

ε√

|Σ| − t

)εl

≤[D−εl

X

(1−D−ε(l′1+l′2)

)]1.01

Notice that

D−εl∏

X

(1−D−ε(l′1+l′2)

)

≥D−εl

×2c logn∏

l′=c logn

l′∏

l′1=1

(1−D−εl′

)˜n×

c logn∏

l′′=t

(1−D−εl′′

)l+l′′

=D−εl

2c logn∏

l′=c logn

l′∏

l′1=1

(1−D−εl′

)4(l+l′)n

×c logn∏

l′′=t

(1−D−εl′′

)l+l′′

≥D−εl(1− 32c3n log3 nD−εc logn

)[1− D−c2ε

−1

1−D−ε

]l

×(1− D−c2/ε(D−ε + c2/ε

2 − c2D−ε/ε2)

(1−D−ε)2

)

where in the inequalities,

˜= [(l1 + l′1) + (l1 + l′2) + (l2 + l′1) + (l2 + l′2)]

For D = 2 and c = 2/ε, limε→02−c2/ε

1−2−ε = 0

Thus, for sufficiently small ε, 2−c2/ε

1−2−ε < 12 . Moreover,

32c2n log2 nD−εl′ = 28

ε3log3 n

n = o(1) Finally, for suffi-

ciently small ε, 1 − D−c2/ε(D−ε+c2/ε2−c2D

−ε/ε2)(1−D−ε)2 > 2−ε.

Therefore, for sufficiently small ε and sufficiently largen, the requirement for LLL is satisfied under the con-dition:

D−εl∏

X

(1−D−ε(l′1+l′2)

)

≥ 2−εl

(1− 1

2

)(2−ε)l

(1− 1

2

)≥ 4−εl

4

So for LLL to work, the following should be gua-ranteed:(

e

ε√|Σ| − t

) εl1.01

≤ 4−εl

4⇐ 42.02(1+ε)e2

ε2≤ |Σ| − t

Hence the second requirement holds for |Σ| − t =44.04e2

ε2 = O(ε−2).

Corollary 5.1. For any n ∈ N and any constantε ∈ (0, 1), there is a deterministic construction of anε-synchronization string of length n, over an alphabetof size O(ε−2), in time poly(n).

By a similar concatenation construction used inthe proof of Theorem 4.3, we also have a deterministicconstruction for synchronization circles.

Corollary 5.2. For any n ∈ N and any constantε ∈ (0, 1), there is a deterministic construction of anε-synchronization circle of length n, over an alphabet ofsize O(ε−2), in time poly(n).

5.2 Deterministic linear time constructions of

c-long distance ε-synchronization string Here wegive a much more efficient construction of a c-longdistance ε-synchronization string, using synchronizationcircles and standard error correcting codes. We showthat the following algorithm gives a construction of c-long distance synchronization strings.

Algorithm 5.1. [Explicit Linear Time Constructionof c-long distance ε-synchronization string]Input:

• An error correcting code C ⊂ ΣmC, with distance δm

and block length m = c log n.

• An ε0-synchronization circle SC = (sc1, . . . , scm)of length m over alphabet ΣSC .

Operations:

• Construct a code C ⊂ Σm such that

C = ((c1, sc1), . . . , (cm, scm))|(c1, . . . , cm) ∈ Cwhere Σ = ΣC × ΣSC .

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2196

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 13: Synchronization Strings: Highly Efficient Deterministic
Page 14: Synchronization Strings: Highly Efficient Deterministic

(5.1)

LCS(S1, S2)

<(s1m

+ 2 +s2m

+ 2)αm

=α(s1 + s2 + 4m)

<5α(s1 + s2)

=5α(k − i) =ε12(k − i)

Case 2: If k − i ≤ m, then according to theproperty of synchronization circle SC, we know thatthe longest common subsequence of S1 and S2 is lessthan ε0

2 (k − i) ≤ α(k − i) ≤ ε12 (k − i).

As a result, the longest common subsequence ofS[i, j] and S[j+1, k] is less than ε1

2 (k− i), which meansthat S is an ε1-synchronization circle.

Similarly, we also have the following lemma.

Lemma 5.4. The output S of algorithm 5.1 is a c-longdistance ε-synchronization string of length n = Nmwhere N is the number of codewords in C, ε = 12(1 −1−ε01+ε0

δ).

Proof. By Lemma 5.3, S is an ε1-synchronization string,thus the length of longest common subsequence foradjacent intervals S1, S2 with total length l < c log n isless than ε1

2 l. We only need to consider pair of intervalsS1, S2 whose total length l ∈ [c log n, 2c log n].

Notice that the total length of S1 and S2 is at most2c log n, which means that S1 and S2 each intersectswith at most 3 codewords from C. Using Lemma 5.2,we have that LCS(S1, S2) ≤ 6αl.

Thus picking ε = max12α, ε1 = 12α = 12(1 −1−ε01+ε0

δ), S from algorithm 5.1 is a c-long distance ε-synchronization circle.

We need the following code constructed by Gurus-warmi and Indyk [26].

Lemma 5.5. (Theorem 3 of [26]) For every 0 < r <1, and all sufficiently small ε > 0, there exists a familyof codes of rate r and relative distance (1−r−ε) over an

alphabet of size 2O(ε−4r−1 log(1/ε)) such that codes fromthe family can be encoded in linear time and can also beuniquely decoded in linear time from 2(1−r−ε) fractionof half errors.

Lemma 5.6. [Error Correcting Code by Brute ForceSearch]

For any n ∈ N, any ε ∈ [0, 1], one can construct anerror correcting code in time O(2εn( 2eε )

nn log(1/ε)) andspace O(2εnn log(1/ε)), with block length n, number ofcodewords 2εn, distance d = (1−ε)n, alphabet size 2e/ε.

Proof. We conduct a brute-force search here to find allthe codewords one by one.

We denote the code as C and the alphabet as Σ. Let|Σ| = q. At first, let C = ∅. Then we add an arbitraryelement in Σn to C. Every time after a new elementC is added to C, we exclude every such element in Σn

that has distance less than d from C. Then we pick anarbitrary one from the remaining elements, adding it toC. Keep doing this until |C| = 2εn.

Note that given C ∈ Σn, the total number ofelements that have distance less than d to C, is at most(nd

)qd =

(n

(n−d)

)qd ≤ ( eε )

εnq(1−ε)n. We have to require

that |C|( eε )εnq(1−ε)n ≤ qn. Let q = 2e/ε. So C can be2εn.

The exclusion operation takes timeO(( 2eε )

nn log(1/ε)) as we have to exhaustively se-arch the space and for each word we have to computeit’s hamming distance to the new added code word.Since there are 2εn code words, the time complexity isas stated.

We have to record those code words, so the spacecomplexity is also as stated.

We can now use Algorithm 5.1 to give a lineartime construction of c-long distance ε-synchronizationstrings.

Theorem 5.4. For every n ∈ N and any constant0 < ε < 1, there is a deterministic construction ofa c = O(ε−2)-long distance ε-synchronization stringS ∈ Σn where |Σ| = O(ε−3), in time O(n). Moreover,S[i, i+ log n] can be computed in O( logn

ε2 ) time.

Proof. Suppose we have an error correcting code C with

distance rate 1 − ε′ =1+ ε

36

1− ε36

(1 − ε12 ), message rate

rc = O(ε′2), over an alphabet of size |Σc| = O(ε′−1),with block length m = O(ε′−2 log n). Let c = O(ε′−2) =O(ε−2). We apply Algorithm 5.1, using C and an ε

36 -synchronization circle SC of length m over an alphabetof size O(ε−2). Here SC is constructed by Corollary5.2 in time poly(m) = poly(log n). By Lemma 5.4,

we have a c-long-distance 12(1 − 1−ε/361+ε/36 (1 − ε′)) = ε-

synchronization string of length m · |Σc|rcm ≥ n.It remains to show that we can have such a C

with linear time encoding. We use the code in Lemma5.5 as the outer code and the one in Lemma 5.6 asinner code. Let Cout be an instantiation of the codein Lemma 5.5 with rate ro = εo = 1

3ε′, relative distance

do = (1 − 2εo) and alphabet size 2O(ε−5o log(1/εo)), and

block length no =ε4o lognlog(1/εo)

, which is encodable and

decodable in linear time.Further, according to Lemma 5.6 one can find a

code Cin with rate ri = O(εi) where εi =13ε

′, relative

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2198

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 15: Synchronization Strings: Highly Efficient Deterministic
Page 16: Synchronization Strings: Highly Efficient Deterministic

If we instantiate Algorithm 5.2 using Corollary 5.3,then we have the following theorem.

Theorem 5.6. For any constant 0 < ε < 1, thereexists a highly-explicit construction of an infinite ε-synchronization string S over an alphabet of sizeO(ε−3). Moreover, for any i ∈ N, the first i symbolscan be computed in O(i) time and S[i, i + log i] can becomputed in O(log i) time.

Proof. Combine Algorithm 5.2 and Corollary 5.3. Inthe algorithm, we can construct every substring Ski inlinear time with alphabet size q = O(ε−3), by Corollary5.3. So the first i symbols can be computed in O(i)time. Also any substring S[i, i+ log i] can be computedin time O(log i).

By Lemma 5.7, S is an infinite ε-synchronizationstring over an alphabet of size 2q = O(ε−3).

6 Ω(ε−3/2

)Lower-Bound on Alphabet Size

The twin word problem was introduced by Axenovich,Person, and Puzynina [27] and further studied byBukh and Zhou [28]. Any set of two identical disjointsubsequences in a given string is called a twin word.[27, 28] provided a variety of results on the relationsbetween the length of a string, the size of the alphabetover which it is defined, and the size of the longest twinword it contains. We will make use of the followingresult from [28] that is built upon Lemma 5.9 from [29]to provide a new lower-bound on the alphabet size ofsynchronization strings.

Theorem 6.1. (Theorem 3 from [28]) There existsa constant c so that every word of length n over a q-letter alphabet contains two disjoint equal subsequencesof length cnq−2/3.

Further, Theorem 6.4 of [1] states that any ε-synchronization string of length n has to satisfy ε-self-matching property which essentially means that it can-not contain two equal (not necessarily disjoint) subse-quences of length εn or more. These two requirementslead to the following inequality for an ε-synchronizationstring of length n over an alphabet of size q.

cnq−2/3 ≤ εn ⇒ c′ε−3/2 ≤ q

7 Synchronization Strings over Small

Alphabets

In this section, we focus on synchronization strings oversmall constant-sized alphabets. We study the questionof what is the smallest possible alphabet size over whicharbitrarily long ε-synchronization strings can exist forsome ε < 1, and how such synchronization strings canbe constructed.

Throughout this section, we will make use of square-free strings introduced by Thue [14], which is a we-aker notion than synchronization strings that requi-res all consecutive equal-length substrings to be non-identical. Note that no synchronization strings orsquare-free strings of length four or more exist overa binary alphabet since a binary string of length foureither contains two consecutive similar symbols or twoidentical consecutive substrings of length two. Howe-ver, for ternary alphabets, arbitrarily long square-freestrings exist and can be constructed efficiently usinguniform morphism [20]. In Section 7.1, we will brieflyreview this construction and show that no uniform mor-phism can be used to construct arbitrary long synchro-nization strings. In Section 7.2, we make use of ter-nary square-free strings to show that arbitrarily long ε-synchronization strings exist over alphabets of size fourfor some ε < 1.

7.1 Morphisms cannot Generate Synchroniza-

tion Strings Previous works show that one can con-struct infinitely long square-free or approximate-square-free strings using uniform morphisms. A uniform mor-phism of rank r over an alphabet Σ is a functionφ : Σ → Σr that maps any symbol out of an alpha-bet Σ to a string of length r over the same alphabet.Applying the function φ over some string S ∈ Σ∗ isdefined as replacing each symbol of S with φ(S).

[21, 15, 16, 17, 20] show that there are uniformmorphisms that generate the combinatorial objects theystudy respectively. More specifically, one can start fromany letter of the alphabet and repeatedly apply themorphism on it to construct those objects. For instance,using the uniform morphisms of rank 11 suggestedin [20], all such strings will be square-free. In thissection, we investigate the possibility of finding similarconstructions for synchronization strings. We will showthat no such morphism can possibly generate an infiniteε-synchronization strings for any fixed 0 < ε < 1.

The key to this claim is that a matching betweentwo substrings is preserved under an application of theuniform morphism φ. Hence, we can always increase thesize of a matching between two substrings by applyingthe morphism sufficiently many times, and then addingnew matches to the matching from previous steps.

Theorem 7.1. Let φ be a uniform morphism of rank rover alphabet Σ. Then φ does not generate an infiniteε-synchronization string, for any 0 < ε < 1.

Proof. To prove this, we show that for any 0 <ε < 1, applying morphism φ sufficiently many ti-mes over any symbol of alphabet Σ produces a stringsthat has two neighboring intervals which contradict ε-

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2200

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 17: Synchronization Strings: Highly Efficient Deterministic
Page 18: Synchronization Strings: Highly Efficient Deterministic

following for m = m1 +m2 + 1.

LCS(φm(a), φm(b))

≥[

1

|Σ|2r +

(1−

(1− 1

|Σ|2r

)n

− δ

2

)

·(1− 1

|Σ|2r − δ

2

)]rm

≥[1−

(1− 1

|Σ|2r

)n+1

− δ

]rm

This completes the induction step and finishes the proof.

7.2 Synchronization Strings over Alphabets of

Size Four In this section, we show that synchroniza-tion strings of arbitrary length exist over alphabets ofsize four. In order to do so, we first introduce the notionof weak ε-synchronization strings. This weaker notionis very similar to the synchronization string propertyexcept the edit distance requirement is rounded down.

Definition 7.1. (weak ε-synchronization strings)String S of length n is a weak ε-synchronization stringif for every 1 ≤ i < j < k ≤ n,

ED(S[i, j), S[j, k)) ≥ b(1− ε)(k − i)c.

We start by showing that binary weak ε-synchronizationstrings exist for some ε < 1.

7.2.1 Binary Weak ε-Synchronization Strings

Here we prove that an infinite binary weak ε-synchronization string exists. The main idea is to takea synchronization string over some large alphabet andconvert it to a binary weak synchronization string bymapping each symbol of that large alphabet to a binarystring and separating each binary encoded block with ablock of the form 0k1k.

Theorem 7.2. There exists a constant ε < 1 and aninfinite binary weak ε-synchronization string.

Proof. Take some arbitrary ε′ ∈ (0, 1). According to[1], there exists an infinite ε′-synchronization string Sover a sufficiently large alphabet Σ. Let k = dlog |Σ|e.Translate each symbol of S into k binary bits, andseparate the translated k-blocks with 0k1k. We claimthat this new string T is a weak ε-synchronizationbinary string for some ε < 1.

First, call a translated k-length symbol followedby 0k1k a full block. Call any other (possibly empty)substring a half block. Then any substring of T is a half-block followed by multiple full blocks and ends with ahalf block.

Let A and B be two consecutive substrings in T .Without loss of generality, assume |A| ≤ |B| (becauseedit distance is symmetric). LetM be a longest commonsubsequence between A and B. Partition blocks of Binto the following 4 types of blocks:

1. Full blocks that match completely to another fullblock in A.

2. Full blocks that match completely but not to just1 full block in A.

3. Full blocks where not all bits within are matched.

4. Half blocks.

The key claim is that the 3k elements in B whichare matched to a type-2 block in A are not contiguousand, therefore, there is at least one unmatched symbolin B surrounded by them. To see this, assume bycontradiction that all letters of some type-2 block in Aare matched contiguously. The following simple analysisover 3 cases contradicts this assumption:

• Match starts at middle of some translated k-lengthsymbol, say position p ∈ [2, k]. Then the first 1 of1k in A will be matched to the (k − p + 2)-th 0 of0k in B, contradiction.

• Match starts at 0-portion of 0k1k block, say at thep-th 0. Then the p-th 1 of 1k in A will be matchedto the first 0 of 0k in B, contradiction.

• Match starts at 1-portion of 0k1k block, say at thep-th 1. Then the p-th 0 of 0k in A will be matchedto the first 1 of 1k in B, contradiction.

Let the number of type-i blocks in B be ti. For everytype-2 block, there is an unmatched letter in A betweenits first and last matches. Hence, |A| ≥ |M | + t2.For every type-3 block, there is an unmatched letterin B within. Hence, |B| ≥ |M | + t3. Therefore,|A|+ |B| ≥ 2|M |+ t2 + t3.

Since |A| ≤ |B|, the total number of full blocksin both A and B is at most 2(t1 + t2 + t3) + 1. (theadditional +1 comes from the possibility that the twohalf-blocks in B allows for one extra full block in A)Note t1 is a matching between the full blocks in A andthe full blocks in B. So due to the ε′-synchronizationproperty of S, we obtain the following.

t1 ≤ ε′

2(2(t1 + t2 + t3) + 1)

=⇒ t1 ≤ ε′

1− ε′(t2 + t3) +

ε′

2(1− ε′)

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2202

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 19: Synchronization Strings: Highly Efficient Deterministic

Furthermore, t1 + t2 + t3 + 2 > |B|3k ≥ |A|+|B|

6k . This,along with the above inequality, implies the following.

1

1− ε′(t2 + t3) +

4− 3ε′

2(1− ε′)>

|A|+ |B|6k

.

The edit distance between A and B is

ED(A,B)

= |A|+ |B| − 2|M | ≥ t2 + t3

>1− ε′

6k(|A|+ |B|)− 4− 3ε′

2

>1− ε′

6k(|A|+ |B|)− 2.

Set ε = 1− 1−ε′

18k . If |A|+ |B| ≥ 11−ε , then

1− ε′

6k(|A|+ |B|)− 2

≥(1− ε′

6k− 2(1− ε)

)(|A|+ |B|)

= (1− ε)(|A|+ |B|) ≥ b(1− ε)(|A|+ |B|)c.

As weak ε-synchronization property trivially holds for|A| + |B| < 1

1−ε , this will prove that T is a weak ε-synchronization string.

7.2.2 ε-Synchronization Strings over Alphabets

of Size Four A corollary of Theorem 7.2 is the exis-tence of infinite synchronization strings over alphabetsof size four. Here we make use of the fact that infiniteternary square-free strings exist, which was proven inprevious work [14]. We then modify such a string tofulfill the synchronization string property, using the ex-istence of an infinite binary weak synchronization string.

Theorem 7.3. There exists some ε ∈ (0, 1) and aninfinite ε-synchronization string over an alphabet of sizefour.

Proof. Take an infinite ternary square-free string T overalphabet 1, 2, 3 [14] and some ε ∈

(1112 , 1

). Let S be

an infinite weak binary ε′ = (12ε− 11)-synchronizationstring. Consider the stringW that is similar to T exceptthat the i-th occurrence of symbol 1 in T is replacedwith symbol 4 if S[i] = 1. Note W is still square-free.We claim W is an ε-synchronization string as well.

Let A = W [i, j), B = W [j, k) be two consecutivesubstrings of W . If k− i < 1/(1− ε), then ED(A,B) ≥1 > (1− ε)(k − i) by square-freeness.

Otherwise, k − i ≥ 1/(1 − ε) ≥ 12. Considerall occurrences of 1 and 4 in A and B, which formconsecutive subsequences As and Bs of S respectively.

Note that |As|+|Bs| ≥ (k−i−3)/4, because, by square-freeness, there cannot be a length-4 substring consistingonly of 2’s and 3’s in W .

By weak synchronization property,

ED(As, Bs) ≥ b(1− ε′)(|As|+ |Bs|)c≥ b3(1− ε)(k − i− 3)c> 3(1− ε)(k − i)− 9(1− ε)− 1

≥ (1− ε)(k − i),

and hence, ED(A,B) ≥ ED(As, Bs) ≥ (1 − ε)(k − i).Therefore, W is an ε-synchronization string.

References

[1] Bernhard Haeupler and Amirbehshad Shahrasbi. Syn-chronization strings: codes for insertions and deletionsapproaching the singleton bound, In Proceedings of the49th Annual ACM SIGACT Symposium on Theory ofComputing, pages 33-46. ACM, 2017.

[2] Hugues Mercier, Vijay K Bhargava, and Vahid Tarokh.A survey of error-correcting codes for channels withsymbol synchronization errors, IEEE CommunicationsSurveys & Tutorials, 12(1), 2010.

[3] Ran Gelles. Coding for interactive communication: Asurvey, 2015.

[4] Ran Gelles and Bernhard Haeupler. Capacity of inte-ractive communication over erasure channels and chan-nels with feedback. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Al-gorithms, pages 1296-1311. Society for Industrial andApplied Mathematics, 2015.

[5] Mohsen Ghaffari, Bernhard Haeupler, and Madhu Su-dan. Optimal error rates for interactive coding i: Adap-tivity and other settings. In Proceedings of the forty-sixth annual ACM symposium on Theory of compu-ting, pages 794-803. ACM, 2014.

[6] Mohsen Ghaffari and Bernhard Haeupler. Optimalerror rates for interactive coding II: Effciency andlist decoding. In Foundations of Computer Science(FOCS), IEEE 55th Annual Symposium on, pages 394-403, 2014.

[7] Bernhard Haeupler. Interactive channel capacity revi-sited. In Foundations of Computer Science (FOCS),2014 IEEE 55th Annual Symposium on, pages 226-235.IEEE, 2014.

[8] Gillat Kol and Ran Raz. Interactive channel capacity.In Proceedings of the forty-fifth annual ACM sympo-sium on Theory of computing, pages 715-724. ACM,2013.

[9] Bernhard Haeupler, Amirbehshad Shahrasbi, and El-len Vitercik. Synchronization strings: Channel simula-tions and interactive coding for insertions and deleti-ons. In 45th International Colloquium on Automata,Languages, and Programming (ICALP), pages 75:1–75:14, 2018.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2203

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php

Page 20: Synchronization Strings: Highly Efficient Deterministic

[10] Bernhard Haeupler and Amirbehshad Shahrasbi. Syn-chronization strings: Explicit constructions, local deco-ding, and applications. Proceedings of the 50th AnnualACM SIGACT Symposium on Theory of Computing,2018.

[11] Bernhard Haeupler, Amirbehshad Shahrasbi, andMadhu Sudan. Synchronization strings: List decodingfor insertions and deletions. In 45th International Col-loquium on Automata, Languages, and Programming(ICALP), 76:1–76:14, 2018.

[12] Bernhard Haeupler, Aviad Rubinstein, and Amirbehs-had Shahrasbi. Near-linear time insertion-deletion co-des and (1+ε)-approximating edit distance via index-ing. arXiv preprint arXiv:1810.11863, 2018.

[13] Mark Braverman, Ran Gelles, Jieming Mao, and Ra-fail Ostrovsky. Coding for interactive communicationcorrecting insertions and deletions. IEEE Transactionson Information Theory, 63(10):6256-6270, 2017.

[14] Axel Thue. Uber unendliche zeichenreihen (1906). Se-lected Mathematical Papers of Axel Thue. Universi-tetsforlaget, 1977.

[15] Axel Thue. Uber die gegenseitige lage gleicher teile ge-wisser zeichenreihen. Christiana Videnskabs-SelskabsSkrifter, I. Math. Naturv. Klasse, 1, 1912.

[16] John Leech. 2726. a problem on strings of beads. TheMathematical Gazette, 41(338):277-278, 1957.

[17] Max Crochemore. Sharp characterizations of squarefreemorphisms. Theoretical Computer Science, 18(2):221-226, 1982.

[18] Robert Shelton. Aperiodic words on three symbols.Journal fur die Reine und Angewandte Mathematik,321:195-209, 1981.

[19] Robert O Shelton and Raj P Soni. Aperiodic wordson three symbols. iii. Journal fur die Reine und Ange-wandte Mathematik, 330:44-52, 1982.

[20] Boris Zolotov. Another solution to the Thue problem ofnon-repeating words. arXiv preprint arXiv:1505.00019,2015.

[21] Dalia Krieger, Pascal Ochem, Narad Rampersad, andJeffrey Shallit. Avoiding approximate squares. In In-ternational Conference on Developments in LanguageTheory, pages 278-289. Springer, 2007.

[22] Serina Camungol and Narad Rampersad. Avoidingapproximate repetitions with respect to the longestcommon subsequence distance. Involve: A Journal ofMathematics, 9(4):657-666, 2016.

[23] Karthekeyan Chandrasekaran, Navin Goyal, and Bern-hard Haeupler. Deterministic algorithms for the Lovaszlocal lemma. SIAM Journal on Computing, 42(6):2132-2155, 2013.

[24] Robin A Moser and Gabor Tardos. A constructiveproof of the general Lovasz local lemma. Journal ofthe ACM (JACM), 57(2):11, 2010.

[25] Bernhard Haeupler, Barna Saha, and Aravind Srini-vasan. New constructive aspects of the Lovasz locallemma. Journal of the ACM (JACM), 58(6):28, 2011.

[26] V. Guruswami and P. Indyk. Linear-time encoda-ble/decodable codes with near-optimal rate. IEEE

Transactions on Information Theory, 51(10):3393-3400,2005.

[27] Maria Axenovich, Yury Person, and Svetlana Puzy-nina. A regularity lemma and twins in words. Jour-nal of Combinatorial Theory, Series A, 120(4):733-743,2013.

[28] Boris Bukh and Lidong Zhou. Twins in words and longcommon subsequences in permutations. Israel Journalof Mathematics, 213(1):183-209, 2016.

[29] Paul Beame and Dang-Trinh Huynh-Ngoc. On thevalue of multiple read/write streams for approximatingfrequency moments. In 49th Annual IEEE Symposiumon Foundations of Computer Science (FOCS), pages499–508, 2008.

[30] Alon, N. and Spencer, J.H., The Probabilistic Method.Wiley: 2004. Wiley Series in Discrete Mathematics andOptimization.

[31] J. Beck, An application of Lovasz local lemma: Thereexists an infinite 01-sequence containing no near iden-tical intervals. Colloq. Mathematica Societatis JanosBolyai, 37. Finite and Infinite Sets, 1981, pp. 103-107.

Copyright © 2019 by SIAMUnauthorized reproduction of this article is prohibited2204

Dow

nlo

aded

03/2

4/1

9 t

o 7

1.6

1.1

79.1

36. R

edis

trib

uti

on s

ubje

ct t

o S

IAM

lic

ense

or

copyri

ght;

see

htt

p:/

/ww

w.s

iam

.org

/journ

als/

ojs

a.php