Subwords, Regular Languages, and Prime Numbers

Subwords, Regular Languages, and Prime

Numbers

Jeffrey ShallitSchool of Computer Science

University of WaterlooWaterloo, Ontario N2L 3G1

[email protected]

https://www.cs.uwaterloo.ca/~shallit

Joint work with Curtis Bright and Raymond Devillers.

1 / 34

Partial orders

Recall: a partial order “≤” on a set S is a subset T ⊆ S × Ssatisfying three properties (where we write x ≤ y if (x , y) ∈ T ):

1. Reflexive: ∀x x ≤ x

2. Transitive: ∀x , y , z x ≤ y and y ≤ z implies x ≤ z

3. Anti-symmetric: ∀x , y x ≤ y and y ≤ x implies x = y

So partial orders mimic the behavior of “≤” on the real numbers.

2 / 34

Comparable and incomparable elements

We say x , y ∈ S are comparable according to the partial order ifeither x ≤ y or y ≤ x .

Otherwise they are incomparable.

An antichain is a list of pairwise incomparable elements.

Some partial orders have infinite antichains and some do not...

3 / 34

Antichains in Nk

Consider the following partial order on k-tuples of natural numbers(Nk):

a point (a1, a2, . . . ak) is ≤p (b1, b2, . . . , bk)

if a1 ≤ b1, a2 ≤ b2, . . ., ak ≤ bk .

Are there infinite antichains in this partial order?

4 / 34

Antichains in Nk

No! We prove this by induction on k .

For k = 1 this is clear: any two elements of N are comparable.

Otherwise assume true for k − 1 and we prove for k .

Let p1, p2, . . . be an infinite antichain in Nk .

Since each of p2, p3, . . . are incomparable to p1, each pi has somecoordinate where it is less than the corresponding coordinate of p1.

Since there are only k coordinates, some coordinate has theproperty that infinitely many of the pi are less than p1 in thatcoordinate.

Without loss of generality, let it be the first coordinate.

5 / 34

Antichains in Nk

Call these infinitely many pi

q1, q2, q3, . . . .

Now there are only finitely many non-negative integers less thanthe first coordinate of p1, so there is some non-negative integersuch that infinitely many of the qi have their first coordinate equal(say equal to d for some d < first coordinate of p1).

Call these r1, r2, . . ..

Now delete the first coordinate of each of the ri to get infinitelymany pairwise incomparable elements in N

k−1, a contradiction.

That completes the proof.

6 / 34

Partial orders on words

There are a number of obvious partial orders on words:

x ≤ y if |x | ≤ |y |

x ≤ y if x is a factor of y (a contiguous block sitting inside y , theway ore is a factor of theorem)

x ≤ y if x precedes y in alphabetic order

x ≤ y if x is a subword of y (alternatively, x is obtained from y bystriking out 0 or more letters of y , the way them is a subword oftheorem)

Note: “subword” is also called “scattered subword” or “substring”or “subsequence”.

7 / 34

The factor partial order has infinite antichains

For example, the set

{abna : n ≥ 1} = {aba, abba, abbba, . . .}

is an infinite set in which no two words are factors of each other.

8 / 34

Higman-Haines theorem: the subword partial order has no

infinite antichains

Write x ⊳ y for the partial order “x is a subword of y” and x ⊳/yfor “x is not a subword of y”.

Proof strategy: assume there is an infinite antichain.

This implies the weaker result that there is an infinite division-freesequence of words (fi )i≥1, i.e., a sequence of strings f1, f2, . . . suchthat i < j =⇒ fi ⊳/fj .

Now iteratively choose a minimal such sequence, as follows:

◮ Let f1 be a shortest word beginning an infinite division-freesequence;

◮ Let f2 be a shortest word such that f1, f2 begins an infinitedivision-free sequence;

◮ Let f3 be a shortest word such that f1, f2, f3 begins an infinitedivision-free sequence; etc.

9 / 34

Higman-Haines theorem: the subword partial order has no

infinite antichains

By the pigeonhole principle, there exists an infinite subsequence ofthe fi , say fi1 , fi2 , fi3 , . . . such that each of the strings in thissubsequence starts with the same letter, say a.

Define xj for j ≥ 1 by fij = axj . Then

f1, f2, f3, . . . , fi1−1, x1, x2, x3, . . .

is an infinite division-free sequence which precedes (fi )i≥1,contradicting the supposed minimality of (fi )i≥1.

To see this, note that fi ⊳/fj for 1 ≤ i < j < i1 by assumption.

Next, if fi ⊳ xj for some i with 1 ≤ i < i1 and j ≥ 1, thenfi ⊳ axj = fij , a contradiction.

Finally, if xj ⊳ xk , then axj ⊳ axk , and hence fij ⊳ fik , acontradiction. That completes the proof.

10 / 34

The difference between infinite and very large

Notice that although we have proved there are no infinite pairwiseincomparable sets for the subword ordering, there are arbitrarilylarge such sets.

For example, the language {0, 1}n consists of 2n strings that arepairwise incomparable.

11 / 34

Two operations on languages

We now introduce two operations on languages, the subword andsuperword operations.

Let L ⊆ Σ∗.

We define

sup(L) = {x ∈ Σ∗ : there exists y ∈ L such that y ⊳ x}

sub(L) = {x ∈ Σ∗ : there exists y ∈ L such that x ⊳ y}

Our goal is to prove that if L is a language, then sub(L) andsup(L) is regular.

12 / 34

Basics

LemmaLet L ⊆ Σ∗. Then

(a) L ⊆ sup(L);

(b) L ⊆ sub(L);

(c) sub(L) = sub(sub(L)).

13 / 34

Minimal elements

Let R be a partial order on a set S .

Then we say x ∈ S is minimal if

yRx =⇒ y = x

for y ∈ S .

14 / 34

Basic properties of minimal elements

Let D(y) be the set {x ∈ S : xRy}.

LemmaLet R be a partial order on a set S.

(a) If x , y are distinct minimal elements, then x , y areincomparable.

(b) Suppose the set D(y) is finite. Then there exists a minimal y ′

such that y ′Ry.

15 / 34

The result for sup

LemmaLet L ⊆ Σ∗. Then there exists a finite subset M ⊆ L such thatsup(L) = sup(M).

Proof.Let M be the set of minimal elements of L.

We proved that the elements of M are pairwise incomparable.Hence M is finite.

It remains to see that sup(L) = sup(M).

Clearly sup(M) ⊆ sup(L). Now suppose x ∈ sup(L).

Then there exists y ∈ L such that y ⊳ x . By lemma above, thereexists y ′ ∈ M such that y ′ ⊳ y .

Then y ′ ⊳ y ⊳ x , and so x ∈ sup(M).16 / 34

The second lemma

LemmaLet L ⊆ Σ∗. Then there exists a finite subset G ⊆ Σ∗ such thatsub(L) = Σ∗ − sup(G ).

Proof.Let T = Σ∗ − sub(L). I claim that T = sup(T ).

Clearly T ⊆ sup(T ).

Suppose sup(T ) 6⊆ T .

Then there exists an x ∈ sup(T ) with x 6∈ T .

Since T = Σ∗ − sub(L), this means x ∈ sub(L).

Since x ∈ sup(T ), there exists y ∈ T such that y ⊳ x .

Hence, by a lemma, we have y ∈ sub(L).17 / 34

The second lemma

But then y 6∈ T , a contradiction.

Finally, by part (a) there exists a finite subset G such thatsup(G ) = sup(T ).

Then sup(G ) = sup(T ) = T = Σ∗ − sub(L), and sosub(L) = Σ∗ − sup(G ).

18 / 34

The main result

TheoremLet L be a language (not necessarily regular). Then both sub(L)and sup(L) are regular.

Proof.

Clearly sup(L) is regular if L = {w} for some single word w .

This is because if w = a1a2 · · · ak , then

sup({w}) = Σ∗a1Σ∗a2Σ

∗ · · ·Σ∗akΣ∗.

Similarly, for any finite language F ⊆ Σ∗, sup(F ) is regular because

sup(F ) =⋃

w∈F

sup({w}).

19 / 34

The main result

Now let L ⊆ Σ∗, and let M and G be defined as in the proof before.

Then sup(L) = sup(M), and so sup(L) is regular, since M is finite.

Also, sub(L) = Σ∗ − sup(G ), and so sub(L) is regular since G isfinite. That completes the proof.

20 / 34

Representations of integers

We’ll represent integers in base k using the digits 0, 1, . . . , k − 1.

We’ll write (n)k for the word giving the canonical representation ofthe integer n in base k (with no leading zeroes).

We’ll write [w ]k for the integer represented by the word w in basek (where w can have leading zeroes).

21 / 34

Minimal elements for the prime numbers

Consider the language

P3 = {2, 10, 12, 21, 102, 111, 122, 201, 212, 1002, . . .},

which represents the primes in base 3.

I claim that the minimal elements of P3 are {2, 10, 111}.

Clearly each of these are in P3 and no proper subword is in P3.

Now let x ∈ P3.

If 2 ⊳/x , then x ∈ {0, 1}∗.

If further 10 ⊳/x , then x ∈ 0∗1∗.

22 / 34

An example involving prime numbers

Since x represents a number, x cannot have leading zeroes.

It follows that x ∈ 1∗.

But the numbers represented by the strings 1 and 11 are notprimes.

However, 111 represents the number 13, which is prime.

It now follows that

sup(P3) = Σ∗2Σ∗ ∪ Σ∗1Σ∗0Σ∗ ∪ Σ∗1Σ∗1Σ∗1Σ∗

where Σ = {0, 1, 2}.

On the other hand, sub(P3) = Σ∗. This follows from Dirichlet’stheorem on primes in arithmetic progressions, which states thatevery arithmetic progression of the form (a+ nb)n≥0,gcd(a, b) = 1, contains infinitely many primes.

23 / 34

The base-10 case

THE PRIME GAME

Ask a friend to write down a prime number.Bet them that you can always strike out 0 or

more digits to get a prime on this card.

2, 3, 5, 7, 11, 19, 41, 61, 89, 409, 449, 499, 881, 991,

6469, 6949, 9001, 9049, 9649, 9949, 60649,

666649, 946669, 60000049, 66000049, 66600049

c©2007 - [email protected]

24 / 34

Minimal elements for the primes in other bases

A computationally difficult problem! No algorithm is known that isguaranteed to halt.

There is a “sort-of” algorithm:

(1) M := ∅(2) while (L 6= ∅) do

(3) choose x , a shortest string in L

(4) M := M ∪ {x}

(5) L := L− sup({x})

It’s hard to carry out step (5)!

In practice we work with L′, a regular “over-approximation” of L,and we assume L′ is the union of sets of the form L1L

∗2L3, and use

heuristics.25 / 34

Heuristics

We have to rule out prime numbers in various regular languages.

One method is to find an N such that N divides each of thenumbers [xL∗z ]b.

You might think you have to check [xLiz ]b for all i .

But in fact

LemmaLet x , z ∈ Σ∗

b, and let L ⊆ Σ∗b. Then N divides all numbers of the

form [xL∗z ]b iff N divides [xz ]b and all numbers of the form [xLz ]b.

26 / 34

Heuristics

Corollary

If 1 < gcd([xz ]b, [xy1z ]b, . . . , [xynz ]b) < [xz ]b, then all numbers ofthe form [x{y1, y2, . . . , yn}

∗z ]b are composite.

Example: since gcd(49, 469) = 7, every number with base-10representation of the form 46∗9 is divisible by 7 and hencecomposite.

27 / 34

Other heuristics

Difference-of-squares factorization:

An example: since

[44n1]16 =(4n+1 · 8 + 7)(4n+1 · 8− 7)

15,

it follows that all numbers of the form [44n1]16 are composite.

28 / 34

Our results

We were able to find the minimal elements for the primes in allbases up to 16, and some additional bases up to 30.

Sometimes we had to do primality tests on very large numbers(with thousands of digits).

For primes of the form 4n + 3 in base 10, the set of minimalelements consists of 13 elements, with the largest having 19153decimal digits! This was proved prime by Francois Morain.

29 / 34

Minimal elements for the composite numbers

By contrast, for computing the minimal elements for the compositenumbers, there is an algorithm (Devillers).

Write Sb := { (n)b : n ≥ 4 is composite }.

TheoremEvery minimal element of Sb is of length at most b + 2.

Proof.

Consider any word w of Sb of length ≥ b + 3.

Since there are only b distinct digits, some digit d is repeated atleast twice, so that dd ⊳ w .

If d > 1, the number [dd ]b is composite, as it is divisible by [11]bbut not equal to it.

30 / 34

Minimal elements for the composite numbers

If d = 0, then some nonzero digit c precedes it in w , so c00 ⊳ wand [c00]b is divisible by b2, which is composite.

Finally, if no digit other than 1 is repeated, then 1111 ⊳ w , and[1111]b = [11]b · [101]b, and hence is composite.

31 / 34

Minimal elements for the composite numbers, base 10

They are:

{4, 6, 8, 9, 10, 12, 15, 20, 21, 22, 25, 27, 30, 32, 33, 35, 50,

51, 52, 55, 57, 70, 72, 75, 77, 111, 117, 171, 371, 711, 713, 731}

32 / 34

Some open problems

1. What are the minimal elements for the powers of 2, expressedin base 10? Probably

{1, 2, 4, 8, 65536}

but nobody knows how to prove this!

2. Are there infinitely many primes whose base-10 representationconsists of all 1’s? The only known “repunit” primes are ofthe form (10p − 1)/9 for p = 2, 19, 23, 317, 1031. It seemslikely that those for p = 49081, 86453, 109297, 270343 arealso prime, but this has not been rigorously proven.

33 / 34

Some open problems

3. Is the following problem decidable? Given a finite automaton Aaccepting (say) numbers expressed in base 2, does A accept thebase-2 representation of at least one prime number? By contrast,the same problem with “prime” replaced with “composite” isdecidable.

4. Is the following even weaker variant decidable? Given a regularexpression of the form xy∗z , does it represent the base-2 expansionof at least one prime number? If this were decidable, in principle wecould determine if there exists another Fermat prime in addition to22

i

+ 1 for i = 0, 1, 2, 3, 4. (Choose x = 1, y = 0, and z = 0161.)

34 / 34

Documents

Subwords, Regular Languages, and Prime Numbers