A Linguist's Survey of Pumping Lemmata

Thesis for the Degree of Master of Science

A Linguist’s Survey of Pumping Lemmata

Johan Behrenfeldt

June 2009

Supervisor: Peter Ljunglöf, PhD

2

Abstract

This thesis gives a survey of pumping lemmata from a linguist’s pointof view and how they can be applied to natural languages. The surveyincludes definitions and proofs of pumping lemmata for regular lan-guages, context-free languages, tree adjoining languages, and multiplecontext-free languages. Also, the grammars that generate these classesof languages are briefly described.

Using the pumping lemmata, a number of formal languages areclassified. This systematization is used to describe the classificationsof natural languages that have been argued by Chomsky (1957), Bar-Hillel and Shamir (1960), Postal (1964), Huybregts (1976), Manaster-Ramer (1987), Shieber (1985), Culy (1985), and Radzinski (1991). Thecounter-claims that have been raised against these arguments are alsopresented.

The survey shows that there is strong evidence that places naturallanguages outside the class of context-free languages and inside theclass of mildly context-sensitive languages.

Keywords: pumping lemma, natural language

3

4

En lingvists studie av pumplemman

Referat: Den här uppsatsen beskriver pumplemman från en lingvists pers-pektiv och hur de kan användas för att beskriva naturliga språk. I uppsatsenåterfinns definitioner för och bevis av pumplemman för reguljära språk, kon-textfria språk, trädfogande språk (tree adjoining languages) samt multiplakontextfria språk (multiple context-free languages). Grammatikerna somgenererar dessa språkklasser beskrivs även kortfattat.

Med hjälp av pumplemmana klassificeras ett antal formella språk. Densystematiseringen används för att beskriva de klassificeringar av naturligaspråk som har hävdats av Chomsky (1957), Bar-Hillel and Shamir (1960),Postal (1964), Huybregts (1976), Manaster-Ramer (1987), Shieber (1985),Culy (1985) och Radzinski (1991). Motargumenten mot dessa påståendenpresenteras även de.

Uppsatsen visar på att det finns starka bevis för att placera naturligaspråk utanför klassen av kontextfria språk och inuti klassen av milt kontext-känsliga språk.

Nyckelord: pumplemma, naturliga språk

5

6

AcknowledgementsI am deeply thankful for all the help and advice I have received from mysupervisor Peter Ljunglöf. Without his support and guidance, this thesiswould not have been what it is today.

Also, I would like to thank Karl Erland Gadelii for his inspiring courseson linguistics.

Finally, I would like to thank my wonderful wife Malin and my beautifuldaughters Agnes and Elin for all their encouragement and patience.

7

8

Contents1 Introduction 11

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Regular Languages 132.1 Finite State Automatons . . . . . . . . . . . . . . . . . . . . . 132.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Context-Free Languages 173.1 Phrase Structure Grammars . . . . . . . . . . . . . . . . . . . 173.2 Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . 173.3 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.2 Mohawk . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6.3 Dutch . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6.4 Swiss German . . . . . . . . . . . . . . . . . . . . . . . 243.6.5 Bambara . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Tree Adjoining Languages 294.1 Tree Adjoining Grammars . . . . . . . . . . . . . . . . . . . . 294.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 314.4 Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5.1 Mandarin Chinese . . . . . . . . . . . . . . . . . . . . 35

5 Multiple Context-Free Languages 375.1 Generalized Context-free Grammars . . . . . . . . . . . . . . 375.2 Multiple Context-free Grammars . . . . . . . . . . . . . . . . 375.3 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4 Pumping Lemma . . . . . . . . . . . . . . . . . . . . . . . . . 395.5 Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.6 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6.1 Mandarin Chinese . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion 43

9

10

1 Introduction

In the field of linguistics, it is often interesting to determine whether a natu-ral language is within a certain class of languages or not. The class to whicha language belongs determines how easily the language is computationallyparsed and interpreted. Also, it gives an insight in how humans learn andunderstand the language.

A pumping lemma constitutes a set of necessary conditions that has to bemet by a language for it to belong to a class of languages and is normally usedto determine whether a language is outside of a specific class of languagesor not. Pumping lemmata are most often used in conjunction with formallanguages but are still highly relevant to natural languages since naturallanguage phenomena can be transformed to formal language constructionsusing intersections and homomorphisms.

A formal language is a, possibly infinite, set of strings of finite lengthand is usually described by a formalism such as a grammar or an automaton.Formal languages are often classified using a classification system that wasinitiated by Chomsky (1956, p. 118). Since pumping lemmata have beendeveloped for

• regular languages,• context-free languages,• tree adjoining languages,• multiple context-free languages,

and equivalent classes, these are the language classes that will be the subjectof this thesis.

1.1 Overview

The thesis begins with a survey of the class of regular languages in thesection Regular Languages. It is shown that all finite languages are regularbut that the languages {anbn : n > 0} and {wwR : w ∈ {a, b}∗} are examplesof languages that are not regular. This limitation of regular grammars isused to present claims of the English language not being regular.

Thereafter, the class of context-free languages is studied in the sectionContext-Free Languages. The languages which were proven not to be regularearlier are shown to be context-free. Also, the languages {ww : w ∈ {a, b}∗},{anbncn : n > 0}, {ambncmdn : m,n > 0}, and {xambnycmdnz : m,n > 0}are demonstrated not to be context-free. This latter classification is usedto recount claims of English, Mohawk, Dutch, Swiss German, and Bambaranot being context-free languages.

Next, mildly context-sensitive formalisms are researched in the follow-ing two sections Tree Adjoining Languages and Multiple Context-Free Lan-guages. All languages which were proven not to be context-free are shown

11

to be tree adjoining languages. Furthermore, the languages {anbncndnen :n > 0}, {www : w ∈ {a, b}∗}, and {abhabiabjabkabl : h > i > j > k > l > 1}are established not to be tree adjoining languages, which places MandarinChinese outside the class of tree adjoining languages according to the claimspresented in the section Tree Adjoining Languages.

In the section Multiple Context-Free Languages, the class of multiplecontext-free languages is shown to include the languages that were provennot to be tree adjoining. Also, the language {a1

na2n . . . a2m

n : n > 1}is demonstrated to be a multiple context-free language whereas the lan-guage {a1

na2n . . . a2m+1

n : n > 1} is not. More importantly, the language{abk1 abk2 . . . abk2m+1 : k1 > k2 > . . . > k2m+1 > 1} is established not tobe a multiple context-free language and this information is used to presentclaims of Mandarin Chinese not being a multiple context-free language.

Finally, concluding thoughts are presented in the section Conclusion.

12

2 Regular LanguagesThe set of regular languages over an alphabet, Σ, can be defined recursivelyas follows (Kleene 1956):

• the empty language, ∅, is a regular language;• the empty string language, {ε}, is a regular language;• for each a ∈ Σ, the singleton language, {a} is a regular language;• given the regular languages L1 and L2 , L1 ∪ L2 (union), L1 · L2

(concatenation), and L1∗ (Kleene star1) are regular languages; and

• no other language is a regular language.

2.1 Finite State Automatons

Another way of defining the class of regular languages is through the def-inition of finite state automatons. Kleene (1956) proved that every finitestate automaton language is a regular language and vice versa. Finitestate automatons are a type of automaton that was developed by Huffman(1954), Mealy (1955), and Moore (1956). They can be expressed as a tuple〈Q,Σ, δ, q0 , F 〉 (Jurafsky and Martin 2008, p. 28) where:

• Q is a finite set of states,• Σ is a finite input alphabet of symbols,• δ is a transition function from Q× Σ to Q,• q0 is the start state, and• F is the set of final states, F ⊆ Q.

2.2 Languages

The language L(A) generated by a finite state automatonA = 〈Q,Σ, δ, q0 , F 〉is defined by Kleene (1956, p. 81) as:

L(A) = {w : w is accepted by A}.

To determine whether a string w ∈ Σ∗ is accepted by A, w is used as inputto A. The symbols of w are read sequentially starting at the first symbol andthe automaton starts in the initial state q0 . In every step of the process, thecurrent symbol of w is matched against the transitions that leave the currentstate. If a matching transition exists, the next symbol of w is processed atthe state reached by the transition. The process ends if no transition isfound for one of the input symbols or if the automaton reaches a final state.In the former case, A rejects w, and in the latter case, A accepts w if allsymbols of w have been processed.

1The Kleene star operator was introduced by Kleene (1956, p. 22). Given a set ofstrings S, then V ∗ is the smallest superset of V that contains the the empty string, ε, andis closed under concatenation.

13

2.3 Pumping Lemma

To determine whether a language is regular or not, the following pumpinglemma can be used. It constitutes a necessary but not sufficient conditionfor a language to belong to the class of regular languages. In other words,all regular languages satisfy the lemma but it may also be satisfied by anon-regular language.

Theorem 1. If L is a regular language, then there is a constant p such thatany string w ∈ L with |w| > p can be expressed as w = xyz with substringsx, y, and z of w such that |y| > 1, |xy| 6 p, and xytz ∈ L for all t > 0.

Proof. (Bar-Hillel, Perles, and Shamir 1961) Let L be accpeted by the finitestate automaton A = 〈Q,Σ, δ, q0 , F 〉 and p be the number of states in A,that is p = |Q|. Given a word w = a1a2 . . . an , where a1 , a2 , . . . , an ∈ Σ,the computation of A on w is a sequence of transitions qi+1 = δ(qi , ai+1 )where 0 6 i < n, q0 , q1 , . . . qn ∈ Q, and qn ∈ F .

If n > p, then by Dirichlet’s pigeonhole principle, the sequence of tran-sitions must contain two states, qi and qj , where 0 6 i < j 6 p, such thatqi = qj . The sequence of transitions spanning between the states qi and qjmay be repeated t times where t > 0.

Setting x = a1a2 . . . ai , y = ai+1ai+2 . . . aj , and z = aj+1aj+2 . . . an , itfollows that xytz ∈ L for all t > 0, where |xy| 6 p and |y| > 1.

2.4 Expressivity

Arguably, the most important set of regular languages is the set of finitelanguages.

Theorem 2. All finite languages are regular languages.

Proof. Any finite language can be created as a union of all strings in thelanguage.

Using the pumping lemma for regular languages (see Theorem 1), it ispossible to exclude a number of infinite languages from the class of regularlanguages.

Theorem 3. The language L = {anbn : n > 0} is not a regular language.

Proof. Assume that L is a regular language and satisfies the pumping lemmafor regular languages for some constant p and let w = apbp. Then, accordingto the pumping lemma, there exist substrings x, y, and z of w, where |y| >1, |xy| 6 p such that xytz ∈ L for all t > 0. Since |xy| 6 p, y = aq

where q 6 p. However, if y is pumped more than once, the resulting wordxytz = ap+(t−1 )qbp will not be in L. This is a contradiction and L cannotbe a regular language.

14

Theorem 4. The mirror language L = {wwR : w ∈ {a, b}∗} is not a regularlanguage.

Proof. Assume that L is a regular language and satisfies the pumping lemmafor regular languages for some constant p and let w = apbbap. Then, ac-cording to the pumping lemma, there exist substrings x, y, and z of wwR,where |y| > 1, |xy| 6 p such that xytz ∈ L for all t > 0. Since |xy| 6 p,y = aq where q < p. However, if y is pumped more than once, the resultingword xytz = ap+(t−1 )qbbap will not be in L. This is a contradiction and Lcannot be a regular language.

2.5 Natural Language

Knowing that the aforementioned languages are not regular, it can be arguedthat natural languages are not regular.

2.5.1 English

Chomsky (1957, pp. 21–23) claimed that the English language is not regularby studying the following three sentence formats.

(1) If S1 , then S2 .(2) Either S3 , or S4 .(3) The man who said that S5 , is arriving today.

Replacing ”then” with ”or” in the first sentence, ”or” with ”then” in the sec-ond sentence, or ”is” with ”are” in the third sentence would yield ungram-matical sentences. Also, in all three sentences, there is a dependency be-tween the words on opposite sides of the comma (”if”/”then”, ”either”/”or”,and ”man”/”is”). Since it is possible to insert a declarative sentence betweenthese dependent words in all three sentences and the declarative sentence tobe inserted may be one of the three sentences mentioned, it is possible tocreate nested sentences of arbitrary length, such as:

(4) If either the man who said that S5 , is arriving today, or S4 , thenS2 .

The nesting of the sentences, claims Chomsky (1957, p. 22), is an exam-ple of mirroring and concludes therefore, that the English language cannotbe regular since the mirror language is not regular (see Theorem 4).

This claim is fallacious since a regular language may very well containnon-regular sub-languages.

The argument can, however, be repaired by intersecting the English lan-guage with an adequate regular language. It is more easily shown using adifferent example, namely the following sentence:

(5) I know, that the cats that the dog chases hunt.

15

In order for the sentence to be grammatical the subjects (”the cats”/”thedog”) have to agree with the verbs (”hunt”/”chases”). Also, the sentencecan be expanded indefinitely to create sentences of the format:

(6) I know, (that the cats that the dog)n (chases hunt)n .

Given the regular language R:

R = {I know, (that the cats that the dog)i (chases hunt)j : i, j > 1},

and a language L in which sentences of the format in example (6) above aregrammatical, the intersection of L and R is:

L ∩R = {I know, (that the cats that the dog)n (chases hunt)n : n > 1}.

This is a homomorphism of {x(ab)n(ba)n : n > 1}, which is not regular(see Theorem 4), and therefore L cannot be regular. Consequently, sincesentences of the format in example (6) above are grammatical in the Englishlanguage, the English language cannot be a regular language.

One common counter-argument to Chomsky’s claim, is that humans areextremely bad at processing center-embedding2 and that the emedding can-not be expanded indefinitely (e.g. Miller and Chomsky 1963).

Another common counter-argument to Chomsky’s claim is that natu-ral languages are finite. Chomsky (1957, pp. 23–24) raised this argumenthimself after his proof of English not being regular. At the same time, hechallenged his counter-argument by saying that natural languages are pro-ductive in the sense that new sentences can always be constructued whichdo not belong to a finite list of all possible sentences. Moreover, a regularlanguage containing all possible sentences as a list would be too extensivefor any human to learn.

2Center-embedding is the process of embedding a phrase in the middle of anotherphrase of the same type.

16

3 Context-Free LanguagesMore expressive than the class of regular languages is the class of context-free languages. It is a superset of the class of regular languages, so anyregular language is also context-free.

3.1 Phrase Structure Grammars

Phrase structure grammars were introduced by Chomsky (1956, p. 117) asa grammar formalism to generate context-free languages. His definition of aphrase structure grammar G, using a notation that is more commonly usedtoday (e.g. Carnie 2008, pp. 71–72), was:

• a finite set N of non-terminal symbols,• a finite set T of terminal symbols,• a finite set P of production rules of the form (T ∪N)∗N(T ∪N)∗ →

(T ∪N)∗, and• a start symbol, S ∈ N ,

where ∗ is the Kleene star operator (Kleene 1956, p. 22). In short, thegrammar G can be expressed as the tuple 〈N,T, P, S〉.

3.2 Context-Free Grammars

For context-free grammars, the production rules are on the form A→ γ | ε,where A ∈ N , γ is a string of terminals and non-terminals, and ε is theempty string (Chomsky 1956, p. 119).

Every context-free grammar can be transformed into an equivalent gram-mar on Chomsky Normal Form (Chomsky 1957). A grammar which isin Chomsky Normal Form only contains production rules on the formsA → BC, A → a, and S → ε, where A,B,C ∈ N and a ∈ T . Also,neither B nor C may be the start symbol.

3.3 Languages

The language L(G) generated by a context-free grammar G = 〈N,T, P, S〉is defined by Chomsky (1956, p. 114) as:

L(G) = {α ∈ T ∗ : S ∗⇒ α},

where ∗⇒ is the reflexive transitive closure of the relation ⇒, that is:

α∗⇒ β iff α⇒ . . .⇒ β or α = β,

and the relation ⇒ is defined as:

αBγ ⇒ αβγ iff B → β.

17

3.4 Pumping Lemma

A necessary but not sufficient condition for a language to belong to the classof context-free languages is the following pumping lemma.

Theorem 5. If L is a context-free language, then there is a constant p suchthat any string w ∈ L with |w| > p can be expressed as w = rstuv withsubstrings r, s, t, u, and v of w such that |su| > 1, |stu| 6 p, and rsituiv ∈ Lfor all i > 0.

Proof. (Bar-Hillel, Perles, and Shamir 1961) Let G be a grammar in Chom-sky normal form that generates L, so that each production rule is either ofthe form A → BC or A → a where A,B,C ∈ N and a ∈ T . Any parsetree of G with height h can at most produce 2h−1 leaf nodes and any stringw ∈ L will require a parse tree of height h > 1 + log 2 |w|.

Given a string w ∈ L such that |w| > 2p, the parse tree Γ has to have aheight:

h > 1 + log 2 |w| > 1 + log 2 2p = p+ 1,

which gives a longest path P from the root node of Γ to one of the leaf nodes,that has a length greater than or equal to p + 1. Let P ′ be the subpath ofP consisting of the p+ 1 last non-terminals of P.

If p is set to be equal to the size of the set of non-terminals N , then theremust exist, by Dirichlet’s pigeonhole principle, two nodes, γu and γl (whereγu comes before γl) in P ′, which represent the same non-terminal A ∈ N .

��

@@

@@

@@@

@@

��

@@

@@@@

��

@@@

S

P

r

s

t

u

vγu

γl

Figure 1: Parse tree for a context-free grammar.

Then, t is set to correspond to the substring of w that is induced at γl .Since t will remain intact above γl , the string induced at γu can be expressed

18

as stu with s, u, and stu being substrings of w. Since P is the longest pathin Γ, P ′ has to be the longest subpath starting at γu . Therefore, the heightof the subtree below γu is p+ 1 and the maximum number of leaf nodes inthe subtree is 2p. It is therefore possible to conclude that:

|stu| 6 2p. (1)

Also, since the grammar G is in Chomsky normal form, at least one of s andu must not be empty, which can be expressed as:

|su| > 1. (2)

The substring stu will remain intact above γu , and w can therefore beexpressed as w = rstuv.

Furthermore, the set of rules between γu and the γl may be removedsince γu and γl both represent the same non-terminal symbol which makesthem interchangeable. Also, the set of rules between γu and the γl can berepeated indefinitely since the subtree of Γ that can be reached from γuwithout passing γl , can fit between itself and the subtree of Γ that can bereached from γl . This can be expressed as:

rsituiv ∈ L, i > 0. (3)

The lemma holds by (1), (2), and (3).

3.5 Expressivity

The class of context-free languages includes a great number of languagesincluding all regular languages. Examples of non-regular but context-freelanguages are the languages {anbn : n > 0} and {wwR : w ∈ {a, b}∗}.

Theorem 6. The language L = {anbn : n > 0} is a context-free language.

Proof. The language L is generated by the context-free grammar S →aSb | ε.

Theorem 7. The mirror language L = {wwR : w ∈ {a, b}∗} is a context-freelanguage.

Proof. The language L is generated by the context-free grammar S →aSa | bSb | ε.

Using the pumping lemma for context-free languages (see Theorem 5),it is possible to prove that the copy language {ww : w ∈ {a, b}∗} and thelanguages {anbncn : n > 0}, {ambncmdn : m,n > 0}, and {xambnycmdnz :m,n > 1} are not context-free.

19

Theorem 8. The copy language L = {ww : w ∈ {a, b}∗} is not a context-free language.

Proof. Assume that L is a context-free language and satisfies the pumpinglemma for context-free languages for some constant p and let w = apbp.Then, according to the pumping lemma, there exist substrings r, s, t, u,and v of ww, where |su| > 1, |stu| 6 p such that rsituiv ∈ L for all t > 0.Since |stu| 6 p, the substring stu must overlap the string ww = apbpapbp inone of the following ways:

1. a . . . a . . . a︸︷︷︸ stu . . . ab . . . b . . . b . . . ba . . . a . . . a . . . ab . . . b . . . b . . . b,2. a . . . a . . . a . . . ab . . . b︸︷︷︸ stu . . . b . . . ba . . . a . . . a . . . ab . . . b . . . b . . . b,3. a . . . a . . . a . . . ab . . . b . . . b︸︷︷︸ stu . . . ba . . . a . . . a . . . ab . . . b . . . b . . . b,4. a . . . a . . . a . . . ab . . . b . . . b . . . ba . . . a︸︷︷︸ stu . . . a . . . ab . . . b . . . b . . . b,5. a . . . a . . . a . . . ab . . . b . . . b . . . ba . . . a . . . a︸︷︷︸ stu . . . ab . . . b . . . b . . . b,6. a . . . a . . . a . . . ab . . . b . . . b . . . ba . . . a . . . a . . . ab . . . b︸︷︷︸ stu . . . b . . . b, or7. a . . . a . . . a . . . ab . . . b . . . b . . . ba . . . a . . . a . . . ab . . . b . . . b︸︷︷︸ stu . . . b.

Pumping the substrings s and u i times, would yield the following resultsfor the seven cases:

1. ap+i|su|bpapbp;2. ap+jbp+i|su|−japbp, 0 < j < i|su|;3. apbp+i|su|apbp;4. apbp+jap+i|su|−jbp, 0 < j < i|su|;5. apbpap+i|su|bp;6. apbpap+jbp+i|su|−j , 0 < j < i|su|; and7. apbpapbp+i|su|, respectively.

None of the resulting strings are in L for i > 0. This is a contradiction andL cannot be a context-free language.

Theorem 9. The language L = {anbncn : n > 0} is not a context-freelanguage.

Proof. Assume that L is a context-free language and satisfies the pumpinglemma for context-free languages for some constant p and let w = apbpcp.Then, according to the pumping lemma, there exist substrings r, s, t, u,and v of w, where |su| > 1, |stu| 6 p such that rsituiv ∈ L for all t > 0.Since |stu| 6 p, stu cannot contain all of the terminal symbols a, b, and c.Therefore, if s and u are pumped more than once, the resulting word willnot contain an equal number of a’s, b’s, and c’s and will not be in L. Thisis a contradiction and L cannot be a context-free language.

Theorem 10. The language L = {ambncmdn : m,n > 0} is not a context-free language.

20

Proof. Assume that L is a context-free language and satisfies the pumpinglemma for context-free languages for some constant p and let w = apbpcpdp.Then, according to the pumping lemma, there exist substrings r, s, t, u,and v of w, where |su| > 1, |stu| 6 p such that rsituiv ∈ L for all t > 0.Since |stu| 6 p, stu cannot contain a’s and c’s or b’s and d’s. Therefore, ifs and u are pumped more than once, the resulting word will not contain anequal number of a’s and c’s or b’s and d’s and will not be in L. This is acontradiction and L cannot be a context-free language.

Corollary 1. The language L = {xambnycmdnz : m,n > 1} is not acontext-free language.

Proof. The proof is the same as that for the Theorem 10.


Regardless of whether Chomsky’s claim that English is not a regular lan-guage is correct or not, the center-embeddings which disqualified Englishfrom the class of regular languages, can be generated by a context-free gram-mar (see Theorem 7).

Chomsky (1957) also conjectured that natural languages do not belongto the class of context-free languages. Many attempts were made in the 60’s,70’s and early 80’s to prove this claim.

3.6.1 English

One of the first attempts to argue the non-context-freeness of natural lan-guages was presented by Bar-Hillel and Shamir (1960). They used Englishsentences of the following structure to support their argument.

(7) John, Mary, and David are a widower, a widow, and a widower,respectively.

Bar-Hillel and Shamir (1960) stated that the sentences could be ex-panded indefinitely to create arbitrarily long sentences.

(8) John, Mary, David, . . . are a widower, a widow, a widower, . . . , re-spectively.

According to Bar-Hillel and Shamir (1960), such sentences are only gram-matical if the gender of the proper names agree with the gender of the words”widow/widower”. Bar-Hillel and Shamir claim that this is an example of acopy language, which is not context-free (see Theorem 8), and hence Englishwould not be context-free.

Daly (1974, pp. 57–60) was not convinced by the arguments of Bar-Hilleland Shamir and gave a detailed critique of it. His foremost complaint wasthat Bar-Hillel and Shamir had not made a formal argument of their claim.

21

Langendoen (1977, pp. 4–5) tried to reconstruct the argument with adifferent example and a solid theoretical basis. He considered the regularlanguage R, which he defined as follows:

R = {(the woman + the men)+ and (the woman + the men)(smokes + drink)+ and (smokes + drink) respectively}.

Then, Langendoen (1977) intersected the regular language R with theEnglish language to receive a copy language, which is not context-free (seeTheorem 8).

Pullum and Gazdar (1982, pp. 481–485) found Langendoen’s conclusionto be erroneous both formally and empirically. They argue that the subjectsand the verbs do not agree in the way Langendoen claims and also that thenumber of subjects does not have to correspond to the number of verbs.

3.6.2 Mohawk

Postal (1964) claimed that the language Mohawk3 is not context-free byanalyzing a certain type of verb construction, where a noun-stem is incor-porated into the verb, and combining it with the nominalization of verbs.

In the type of verb construction studied by Postal (1964), the noun-stem that is incorporated into the verb comes from the subject if the verb isintransitive and from the direct object if the verb is transitive. The resultinginternal structure of the verb is Prefixes Noun-stem Verb-stem Suffixes. Anexample of noun incorporation can be seen in the examples (9) and (10)where the noun ”house” is incorporated into the verb ”be white”.

(9) Ka-rakv3NEUT-be white

nePOSS

SawatisSawatis

hrao-nuhs-a3MASC-house-SUFF

”Sawatis’ house is white”

(10) Hrao-nuhs-rakv3MASC-house-be white

nePOSS

SawatisSawatis

”Sawatis’ house is white”

Postal (1964) also saw that a verb like ”house-be white” could be nom-inalized to make a noun meaning ”house-being white”, and that these re-sulting nouns could, in turn, be incorporated into a verb. Furthermore,he claimed that the nominalization and verb incorporation can be repeatedindefinitely, creating an infinite set of noun-stems.

Sometimes a verb with an incorporated subject (object) noun-stem oc-curs with an overt subject (object) noun phrase. In such cases the noun-stemin the verb must exactly match the noun-stem in the external noun phrase.

3Mohawk is a Native American Language spoken by the Mohawk nation in the UnitedStates and Canada.

22

This is string-copying over an infinite set of strings (the set of noun-stems),which makes Mohawk a copy language and not context-free.

There are, however, two errors in Postal’s argument, as stated by Pullumand Gazdar (1982, p. 491):

1. Postal (1964, p. 146) claims that a language which consists only ofstrings on the format ww, where w ∈ T ∗, is not context-free. Hestates that this has been proven by Noam Chomsky, but no such proofexists and cannot exist since it is not true4.

2. Postal (1964, p. 147) also claims that Mohawk is not context-free sinceit contains an infinite set of sentences with formal properties of thelanguage {ww : w ∈ T ∗}. This is not correct either as context-freelanguages can contain non-context-free sub-languages.

3.6.3 Dutch

In a working paper, Huybregts (1976) reached the conclusion that the Dutchlanguage is not context-free by analyzing the cross-serial dependencies thatmay arise in the language.

First, Huybregts (1976) saw that structures of the following type aregrammatical in the Dutch language.

(11) . . . dat. . . that

JanJan

MarieMarie

PietPiet

de kinderenthe children

zagsaw

helpenhelp

latenmake

zwemmenswim

”. . . that Jan saw Marie help Piet make the children swim”

The structure is on the form NP1 NP2 NP3 NP4 V1 V2 V3 V4.Second, Huybregts (1976) concluded that the structure could be ex-

panded indefinitely by adding more pairs of noun and verb phrases, butthat the number of noun phrases and verb phrases had to be the same.

Finally, Huybregts (1976) claimed that the structure is an example ofa cross-serial dependency of arbitrary length. Cross-serial dependencies ofarbitrary length can be mapped by a homomorphism into the copy languageL = {ww : w ∈ {a, b}∗} which is not context-free (see Theorem 8).

Later, Pullum and Gazdar (1982) challenged Huybregts’ claim and ar-gued that syntactically the structure is merely NPnVn which is context-free5.

This was, in turn, commented on by Bresnan et al. (1982) who agreedthat context-free grammars may be sufficient for classifying Dutch sentencesas grammatical or not but not for the semantic interpretation of them.

Manaster-Ramer (1987) expanded on the previous work and looked atthe following type of structure in the Dutch language.

4E.g. Daly (1974, p. 68) showed that the language {abnabn : n > 0}, which consistsonly of strings on the format ww, can be generared by a context-free grammar with therewriting rules S → aZ, Z → bZb | a.

5The structure can be generated by the following grammar: S → NP S V | ε

23

(12) OfQ

JanJan

PietPiet

MarieMarie

hoordeheard

ontmoetenmeet

enand

zagsaw

omhelzenembrace

”Did Jan hear Piet meet Marie and see [him] embrace [her]?”

The structure is on the form NP1 NP2 V1 V2 and V1 V2.Manaster-Ramer (1987) claimed that the structure could be expanded

indefinitely by adding more noun and verb phrases, but that the numberhad to be equal6.

Using a homomorphism to map the structure into the language L ={anbncn : n > 0} which is not a context-free language (see Theorem 9),Manaster-Ramer (1987) concluded that the Dutch language is not context-free.

3.6.4 Swiss German

Shieber (1985) was the first to present new evidence of a non-context-freelanguage after Pullum and Gazdar (1982) disarmed all previous proofs ofnon-context-freeness of natural language. Shieber analyzed the languageSwiss German7 and argued the context-freeness using four claims aboutSwiss German.

Shieber (1985) first claimed that subordinate clauses can have a structurewhere all the verbs follow all the noun phrases. Specifically, some sentencesof the following format are grammatically correct:

(13) JanJan

säitsaid

dasthat

merwe

NP*NP*

es huusthe house-ACC

haendhave

welewanted

V*V*

aastriichepaint

”Jan said that we have wanted to V* NP* paint the house”

The second claim by Shieber (1985) was that, among such sentences,those with all dative noun phrases preceeding all accusative noun phrasesand all verbs requiring dative objects preceeding all verbs requiring ac-cusative objects are acceptable. Specifically, some sentences of the followingformat are grammatically correct:

(14) JanJan

säitsaid

dasthat

merwe

(d’chind)*(the children-ACC)*

(em Hans)*(Hans-DAT)*


haendhave

welewanted

laa*let*

hälfe*help*

aastriichepaint

”Jan said that we have wanted to (let the children)* (help Hans)*paint the house”

6Actually, Manaster-Ramer (1987) proved his claim for sentences where the number ofverbs were equal to or greater than the number of noun phrases. He did this to addressthe fact that all verbs must not take an object for a sentence to be grammatical.

7Swiss German is any of the Alemannic dialects spoken in Switzerland and in someAlpine communities in Northern Italy.

24

The third claim by Shieber (1985) was that the number of verbs requir-ing dative objects must be equal to the number of dative noun phrases andthe number of verbs requiring accusative objects must be equal to the num-ber of accusative noun phrases. For instance, the following sentence is notgrammatically correct:

(15) * JanJan

säitsaid

dasthat

merwe

d’chindthe children-ACC

de HansHans-ACC


haendhave

welewanted

löndlet

hälfehelp

aastriichepaint

The fourth and last claim by Shieber (1985) was that the number ofverbs in a subordinate clause is constrained only by performance.

Shieber (1985) showed that any language L which satisfies the four claimsabove must not be context-free because if it is intersected with the regularlanguage R, where:

R = {Jan säit das mer (d’chind)h (em Hans)i es huushaend wele (laa)j (hälfe)k aastriiche : h, i, j, k > 1},

the result is:

L ∩R = {Jan säit das mer (d’chind)m (em Hans)n es huushaend wele (laa)m (hälfe)n aastriiche : m,n > 1},

and this is a homomorphism of {xambnycmdnz : m,n > 1}, which is notcontext-free (see Corollary 1).

Since Swiss German satisifies the four claims, Shieber (1985) concludedthat Swiss German is not context-free.

Actually, Swiss German does not satisify the third claim, which waspointed out by Manaster-Ramer (1988, p. 101), since not all verbs musthave an object. The claim needs to be modified to state that the number ofverbs requiring dative objects must be equal to or greater than the numberof dative noun phrases and the number of verbs requiring accusative objectsmust be equal to or greater than the number of accusative noun phrases.

3.6.5 Bambara

In 1981, Langendoen (1981, p. 320) stated that he knew of no natural lan-guage for which the word-formation component must be more powerful thana finite state automaton. In other words, he knew of no language where themorphology was more complex than a regular language.

Four years later, as a reply to Langendoen, Culy (1985) claimed that thelanguage Bambara8 contains word-formation constructions that give rise toa morphology that is not even context-free.

8Bambara is a Manding language spoken in Mali and neighboring countries.

25

Culy (1985) focused on the two word-formation constructions Noun oNoun, which translates into ”whichever Noun” or ”whatever Noun” andNoun + Transitive Verb + la, which translates into ”one who TransitiveVerbs Nouns”. Utilizing the recursive nature of the second constructionCuly was able to construct more and more complex nouns, as can be seenin the following examples:

(16) wuludog

+ nyinisearch

+ lafor

= wulunyinina

”one who searches for dogs”, i.e., ”dog searcher”(17) wulu

dog+ filè

watch+ la = wulufilèla

”one who watches dogs”, i.e., ”dog watcher”(18) wulunyinina

dog searcher+ nyini

search+ la

for= wulunyininanyinina

”one who searches for dog searchers”(19) wulunyinina

dog searcher+ filè

watch+ la = wulunyininafilèla

”one who watches dog searchers”

According to Culy (1985), the construction can be repeated indefinitely andalso be combined with the first construction to create words such as:

(20) wulunyininanyininawulunyininanyinina

o wulunyininanyininawulunyininanyinina

”whoever searches for dog searchers”(21) wulunyininafilèla

one who watches dog searcherso wulunyininafilèlaone who watches dog searchers

”whoever watches dog searchers”

Using the fact that the intersection of a context-free language and a regularlanguage is a context-free language (Hopcroft and Ullman 1979, p. 135),Culy (1985) showed that Bambara cannot be context-free. He proved thisby letting B be the vocabulary of Bambara and R be the regular language:

R = {wulu(filèla)h(nyinina)i owulu(filèla)j(nyinina)k : h, i, j, k > 1},

which gives the intersection:

B ∩R = {wulu(filèla)m(nyinina)n owulu(filèla)m(nyinina)n : m,n > 1}.

26

Since B ∩R is a homomorphism of {ambnambn : m,n > 1}, which is notcontext-free, Culy (1985) concluded that B is not context-free, and therefore,that the morphology of Bambara is not context-free.

Three years later, Manaster-Ramer (1988, pp. 101–102) criticized Culy’sfindings as he saw no evidence of the duplicative constructions’ being in themorphology and not in the syntax. This means that Culy challenged theclaim that the morphology is not context-free, and not that the language ofBambara is not context-free.

27

28

4 Tree Adjoining LanguagesAs the examples in the previous section show, it seems likely that naturallanguages are context-sensitive. This is a problem since context-sensitivelanguages are bereft of some of the desirable properties9 of context-freelanguages. Therefore, linguists have tried to find formalisms that can modelnatural languages without losing these desirable properties.

Joshi (1985) proposed a class of languages in between the class of context-free and the class of context-sensitive languages with the following threeproperties:

• limited crossed dependencies,• constant growth, and• polynomial parsing,

and he named this intermediate class of languages mildly context-sensitive.Independently of one another, four extensions of context-free grammars

were developed to capture the expressivity of natural languages:

• Tree Adjoining Grammars by Joshi, Levy, and Takahashi (1975);• Head Grammars by Pollard (1984);• Linear Indexed Grammars by Gazdar (1988); and• Combinatory Categorial Grammars by Steedman (1985, 1988).

Although the four formalisms are superficially different, Vijay-Shankerand Weir (1994) managed to prove that the four extensions are equivalent.It is, therefore, possible to focus solely on tree adjoining grammars and treeadjoining languages and extend the findings to the other formalisms.

4.1 Tree Adjoining Grammars

A tree adjoining grammar G is defined as (Joshi, Levy, and Takahashi 1975,p. 139):

• a finite set N of non-terminal symbols,• a finite set T of terminal symbols,• a finite set I of initial trees, and• a finite set A of auxiliary trees.

In short, the grammar G can be expressed as the tuple 〈N,T, I, A〉.Initial trees have the form in Figure 2 and auxiliary trees have the form

in Figure 3 where X ∈ N and w,w1 , w2 ∈ T ∗.All the nodes in the frontier of an auxiliary tree are labelled by terminal

symbols, except one, which is called the foot note of the tree and is labelledby the same non-terminal symbol as the root node of the tree.

9E.g. the context-free languages can be parsed in polynomial time while context-sensitive languages require non-polynomial time to be parsed.

29

��

TTTTTTTT

X

w

Figure 2: An initial tree in a tree adjoining grammar.

��

TTTTTTTT

X

Xw1 w2

Figure 3: An auxiliary tree in a tree adjoining grammar.

Trees can be adjoined to derive new trees. Given the tree in Figure 4, thetree in Figure 3 can be adjoined at the node labelled with the non-terminalsymbol X to create the tree in Figure 5.

��

TTTTTTTT

��

TTT

Y

X

w3 w4 w5

Figure 4: An auxiliary tree in a tree adjoining grammar.

The nodes of the initial and auxiliary trees are usually associated with aselective adjoining constraint which specifies which auxiliary trees that canbe adjoined at the node. A constraint which specifies that no trees maybe adjoined at a node is called a null adjoining constraint and a constraintwhich specifies that a tree must be adjoined at a node is called a obligatoryadjoining constraint. When a tree is adjoined at a node η of another tree,the node η adopts the constraints of its counterpart in the latter tree andthe constraints of the rest of the nodes remains the same.

30

��

TTTTTTTT

��

TTTTT

��

TTT

Y

X

X

w1 w2

w3

w4

w5

Figure 5: A derived tree in a tree adjoining grammar.

It is possible to extend tree adjoining grammars by allowing substi-tutions, without affecting the expressivity of the grammar (Joshi, Vijay-Shanker, and Weir 1989, p. 39). A substitution attaches a tree derived froman initial tree to a substitution node of a derived tree. Substitution nodesare leaf nodes that are labelled by non-terminal symbols and are flagged forsubstitution.

4.2 Languages

The language L(G) generated by a tree adjoining grammar G = 〈N,T, I, A〉is defined by Vijay-Shanker, Weir, and Joshi (1986, p. 203) as:

L(G) = {w : w is the frontier of some γ ∈ T (G)},

where T (G) is the tree set:

T (G) =⋃

α∈ID(α),

and D(γ) is the set of trees that can be derived from the initial or auxiliarytree γ using zero or more adjunctions.

4.3 Pumping Lemma

A necessary but not sufficient condition for a language to belong to the classof tree adjoining languages is the following pumping lemma.

Theorem 11. If L is a tree adjoining language, then there is a constantp such that any string w ∈ L with |w| > p can be expressed as w =xw1v1w2yw3v2w4 z with substrings x, y, z, v1 , v2 , w1 , w2 , w3 , w4 ∈ T ∗ suchthat |w1w2w3w4 | > 1, |v1v2w1w2w3w4 | 6 p, and xw1

iv1w2iyw3

iv2w4iz ∈

L for all i > 0.

31

Proof. (Vijay-Shanker (1987)) Let G be a tree adjoining grammar that gen-erates L and m be the maximum length of the yield of any derivation treein G such that no auxiliary tree γ ∈ A occurs twice on the same path in thederivation tree.

Given a word w ∈ L such that |w| > m + 1, there must, by Dirichlet’spigeonhole principle, exist a path, Γ, in the derivation tree of w, that hastwo nodes, γu and γl (where γu comes before γl), which represent the sameauxiliary tree.

The adjoinings between γu and γl may be removed since γu and γl bothrepresent the same auxiliary tree which makes them interchangeable. Also,the adjoinings between γu and the γl can be repeated indefinitely since thesubtree of Γ that can be reached from γu without passing γl can fit betweenitself and the subtree of Γ that can be reached from γl .

If w is expressed as w = xw1v1w2yw3v2w4 z, the pumping of the adjoin-ings between γu and γl yield strings on the format xw1

iv1w2iyw3

iv2w4iz

(i > 0), that is:

xw1iv1w2

iyw3iv2w4

iz ∈ L, i > 0. (4)

Also, if γu and γl are selected so that there are no other derived auxiliarytrees adjoined into themselves below γl , there is a constant n such that:

|v1v2w1w2w3w4 | 6 n. (5)

Finally, since |w| > m, γl can be chosen so that at least one of the fourpumped substrings w1 , w2 , w3 , or w4 is non-empty, that is:

|w1w2w3w4 | > 1. (6)

The lemma holds by (4), (5), and (6) if p = max(m+ 1, n).

4.4 Expressivity

Tree adjoining grammars are slightly more expressive than context-free gram-mars. Unlike context-free grammars, tree adjoining grammars are able toexpress the copy language {ww : w ∈ {a, b}∗} as well as the languages{anbncn : n > 0} and {anbncndn : n > 0}.

Theorem 12. The copy language L = {ww : w ∈ {a, b}∗} is a tree adjoininglanguage.

Proof. The copy language L is generated by the tree adjoining grammar inFigure 6.

Theorem 13. The language L = {anbncn : n > 0} is a tree adjoininglanguage.

32

HHH

��

HHH

��S

ε

SNA

a S

S∗NA a

SNA

b S

S∗NA b

Figure 6: A tree adjoining grammar for the copy language {ww : w ∈{a, b}∗}.

��HHH

��S

ε

SNA

a S

S∗NAb c

Figure 7: A tree adjoining grammar for the language {anbncn : n > 0}.

Proof. The language L is generated by the tree adjoining grammar in Figure7.

Theorem 14. The language L = {anbncndn : n > 0} is a tree adjoininglanguage.

Proof. The language L is generated by the tree adjoining grammar in Figure8.

��

HHH

��HHHS

ε

SNA

a dS

S∗NAb c

Figure 8: A tree adjoining grammar for the language {anbncndn : n > 0}.

Moreover, Vijay-Shanker (1987) proved that tree adjoining languagesare closed under intersection with regular languages. As a consequence, treeadjoining grammars can express the language {aibjaibj : i, j > 0}.

33

Theorem 15. The language L = {aibjaibj : i, j > 0} is a tree adjoininglanguage.

Proof. Since the copy language {ww : w ∈ {a, b}∗} is a tree adjoining lan-guage, a∗b∗a∗b∗ is a regular language, and the intersection of a tree adjoininglanguage and a regular language is a tree adjoining language (Vijay-Shanker1987), it follows that

{ww : w ∈ {a, b}∗} ∩ a∗b∗a∗b∗ = {aibjaibj : i, j > 0}.

Even though tree adjoining grammars are able to express some languageswhich cannot be expressed by context-free grammars, there are still manyexamples of languages which are not tree adjoining languages. One of themis the language {anbncndnen : n > 0}.

Theorem 16. The language L = {anbncndnen : n > 0} is not a treeadjoining language.

Proof. Assume that L is a tree adjoining language and satisfies the pump-ing lemma for tree adjoining languages for some constant p and let w =ap+1 bp+1 cp+1dp+1 ep+1 . Then, according to the pumping lemma, there ex-ist substrings w1 , w2 , w3 , and w4 of w, where at least one is non-empty,that can be pumped repeatedly into w to create new words in L. Since ev-ery word in L contains an equal number of five different terminal symbols,then by Dirichlet’s pigeonhole principle, one of the substrings w1 , w2 , w3 ,or w4 , must contain more than one terminal symbol. However, if the sub-string with more than one terminal symbol is pumped more than once, thesymbols will be interleaved and the resulting word will not be in L. This isa contradiction and L cannot be a tree adjoining language.

Another language which cannot be expressed by tree adjoining grammarsis the double copy language.

Theorem 17. The double copy language L = {www : w ∈ {a, b}∗} is not atree adjoining language.

Proof. Assume that L is a tree adjoining language. Then, since tree adjoin-ing languages are closed under intersection with regular languages (Vijay-Shanker 1987), L′ = L ∩ a∗b∗a∗b∗a∗b∗ = {anbmanbmanbm : n,m > 0}is also a tree adjoining language. Assume that L′ satisfies the pump-ing lemma for tree adjoining languages for some constant p and let w′ =ap+1 bp+1ap+1 bp+1ap+1 bp+1 .

None of the substrings w1 , w2 , w3 , or w4 can contain both a’s and b’s.Moreover, at least three of them must contain the same letters and must beinserted into the three different ap+1 or the three different bp+1 . This is acontradiction since this would imply that either |v1 | > p+ 1 or |v2 | > p+ 1,and L cannot be a tree adjoining language.

34

Radzinski (1991) proved that the language {abhabiabjabkabl : h > i >j > k > l > 1} is not a tree adjoining language.

Theorem 18. The language L = {abhabiabjabkabl : h > i > j > k > l > 1}is not a tree adjoining language.

Proof. Assume that L is a tree adjoining language and satisfies the pump-ing lemma for tree adjoining languages for some constant p and let w =abp+4abp+3abp+2abp+1abp. Then, according to the pumping lemma, thereexist substrings w1 , w2 , w3 , and w4 of w, where at least one is non-empty,that can be pumped repeatedly into w to create new words in L.

None of the substrings w1 , w2 , w3 , or w4 can contain any a’s sincepumping a’s would yield strings that have b sections which are not orderedaccording to length. Thus, the substrings must contain only b’s.

However, since there are five sections of b’s in w and only four pumpablesubstrings, at least one of the sections of b’s will not be pumped and therequirement that h > i > j > k > l > 1 will not be satisfied unless it is theright-most section of b’s that are not being pumped. In that case, though,pumping down will result in a string that does not belong to L. This is acontradiction and L cannot be a tree adjoining language.

As mentioned earlier, the same expressivity that applies to tree adjoininggrammars also applies to head grammars, linear indexed grammars, andcombinatory categorial grammars.


All of the examples of natural languages which are not context-free presentedearlier can be expressed by tree adjoining grammars. There are, however,claims of constructions in natural languages which cannot be expressed bytree adjoining grammars either.

4.5.1 Mandarin Chinese

Radzinski (1991) argued that the language Mandarin Chinese is neithercontext-free nor a tree adjoining language by analyzing the names of num-bers and how they are constructed in the language.

Radzinski (1991) focused specifically on how huge numbers are expressedin Mandarin Chinese. There are three words in Mandarin Chinese thatexpress powers of ten:

wan : 104

yi : 108

zhao : 1012 ,

and numbers that are exponentially greater than 1012 are expressed bystringing together instances of zhao, as can be seen in the following example:

35

(22) wufive

zhaotrillion

zhaotrillion

wufive

zhaotrillion

According to Radzinski (1991), the stringing can be repeated indefinitelyto create words such as:

(23) wufive

zhaotrillion

zhaotrillion

zhaotrillion

zhaotrillion

zhaotrillion

wufive

zhaotrillion

zhaotrillion

zhaotrillion

zhaotrillion

wufive

zhaotrillion

zhaotrillion

zhaotrillion

wufive

zhaotrillion

zhaotrillion

wufive

zhaotrillion

Using the fact that the intersection of a tree adjoining language and a reg-ular language is a tree adjoining language (Vijay-Shanker 1987), Radzinski(1991) showed that Mandarin Chinese cannot be a tree adjoining language.He proved this by letting NC be the subset of Mandarin Chinese consistingof the names of the numbers in the language and R be the regular language:

R = {wu (zhao)h wu (zhao)i wu (zhao)j

wu (zhao)k wu (zhao)l : h, i, j, k, l > 1},

which gives the intersection:

NC ∩R = {wu (zhao)h wu (zhao)i wu (zhao)j

wu (zhao)k wu (zhao)l : h > i > j > k > l > 1}.

Since NC ∩R is a homomorphism of {abhabiabjabkabl : h > i > j > k >l > 1}, which is not a tree adjoining language (see Theorem 18), Radzinski(1991) concluded that NC is not a tree adjoining language, and therefore,that Mandarin Chinese is not a tree adjoining language.

36

5 Multiple Context-Free LanguagesAn even more expressive formalism, which is still mildly context-sensitive,was developed by Kasami, Seki, and Fujii in 1987. It was constructed as asubclass of generalized context-free grammars.

5.1 Generalized Context-free Grammars

In order to describe head grammars mathematically, Pollard (1984) intro-duced the formalism of generalized context-free grammars. It is a formalismthat is very expressive and Kasami, Seki, and Fujii (1987) has proved thatit is powerful enough to express recursively enumerable grammars. Pollarddefined a generalized context-free grammar as (with the notation from Sekiet al. 1991, p. 194):

• a finite set N of non-terminal symbols,• a finite set O of n-tuples over a finite set of symbols,• a finite set F of partial functions from O × · · · ×O to O,• a finite set P of rewriting rules, and• a start symbol, S ∈ N .

The rewriting rules in P are written as:

A→ f [A1 , A2 , . . . , Aq ]

where A,A1 , A2 , . . . , Aq ∈ N are non-terminal symbols and f ∈ F is afunction from Oq to O. A rewriting rule is called a terminating rule if q = 0and is written as:

A→ θ, θ ∈ O.

Otherwise, the rewriting rule is called a non-terminating rule.In short, a generalized context-free grammar G can be expressed as the

tuple 〈N,O, F, P, S〉.

5.2 Multiple Context-free Grammars

Multiple context-free grammars were developed by Kasami, Seki, and Fujii(1987) as a subclass of generalized context-free grammars. Unlike gener-alized context-free grammars, all rewriting rules are defined as concatena-tions of constant strings and components of the arguments. An m-multiplecontext-free grammar can, therefore, be expressed as a generalized context-free grammar that satisfies the following conditions:

• O =⋃

i=1m(T ∗)i , where T is a finite set of terminal symbols.

• For each function f ∈ F , which takes a(f) arguments, there are pos-itive integers r(f) and di(f) (1 6 i 6 a(f)) where 1 6 r(f) 6 m and1 6 di(f) 6 m, such that f is a function from (T ∗)d1 (f ) × (T ∗)d2 (f ) ×· · · × T ∗)da(f )(f ) to T ∗)r(f ).

37

• Functions f ∈ F are defined as concatenations of constant strings inT ∗ and components of its arguments. That is:

fh [x̄1 , x̄2 , . . . , x̄a(f )] = αh,0xϕf (h,1 )αh,1xϕf (h,2 ) . . . xϕf (h,vh(f ))αh,vh(f ),

where vh(1 6 h 6 r(f), αi,j ∈ T ∗(1 6 i 6 r(f), 1 6 j 6 vi) andϕf is a function from {(i, j) ∈ N × N : 1 6 i 6 r(f), 1 6 j 6 vi} to{(i, j) ∈ N× N : 1 6 i 6 a(f), 1 6 j 6 di(f)}.• For each non-terminal A ∈ N , there exists a positive integer d(A)such that all tuples that can be derived from A have exactly d(A)components.• Every rewriting rule A → f [A1 , A2 , . . . , Aa(f )] in P must satisfy the

condition r(f) = d(A) and di(f) = d(Ai)(1 6 i 6 a(f)).• For the initial symbol S, d(S) = 1.

In this thesis, the focus is on linear multiple context-free grammars andnot on parallel multiple context-free grammars. The difference between thetwo grammar formalisms is that the former requires all rewriting rules to bedefined such that no component of no argument to the rule is repeated inthe result of the rule.

Moreover, Seki et al. (1991, pp. 197–198) showed that any m-multiplecontext-free grammar can be mapped into an m-multiple context-free gram-mar that is non-erasing. As a consequence, all rewriting rules are definedsuch that every component of every argument to the rule is repeated at leastonce in the result of the rule.

According to Ljunglöf (2004, p. 60), linear and non-erasing multiplecontext-free grammars and linear context-free rewriting systems are equiv-alent formalisms. The latter is a formalism that was introduced by Weir(1988) and Vijay-Shanker, Weir, and Joshi (1987) and was developed in-dependently of multiple context-free grammars. Since the formalisms areequivalent, the expressivity of multiple context-free grammars can be ex-tended to linear context-free rewriting systems.

5.3 Languages

The language L(G) generated by a multiple context-free grammar G =〈N,O,F, P, S〉 is defined by Seki et al. (1991, p. 195) as:

L(G) = LG(S),

where LG(A) : A ∈ N is the smallest set that satisfies the following condi-tions:

• If a terminating rewriting rule A→ θ is in P , then θ is in LG(A).• If a non-terminating rewriting rule A → f [A1 , A2 , . . . , Aa(f )] is in P

and θi is in LG(Ai) for all 1 6 i 6 a(f), then f [θ1 , θ2 , . . . , θa(f )] is inLG(A).

38

5.4 Pumping Lemma

A necessary but not sufficient condition for a language to belong to the classof multiple context-free languages is the following pumping lemma.

Theorem 19. If L is an m-multiple context-free language, then there is aconstant p such that any string w ∈ L with |w| > p can be expressed asw = r0 s1 t1u1 r1 . . . smtmumrm with substrings rj ∈ T ∗, 0 6 j 6 m andsj , tj , uj ∈ T ∗, 1 6 j 6 m such that

∑j=1

m |sjuj | > 1 and r0 s1it1u1

ir1 . . .sm

itmumirm ∈ L for all i > 0.

Proof. (Seki et al. 1991, pp. 201–202) Let G be an m-multiple context-freegrammar that generates L where the maximum number of arguments of anyof the rewriting rules in P is n. Any parse tree of G with height h can atmost produce 2n(h−1 ) leaf nodes and any string w ∈ L will require a parsetree of height h > 1 + 1

n log 2 |w|.Given a string w ∈ L such that |w| > 2np, the parse tree Γ has to have

a height:h > 1 + 1

nlog 2 |w| > 1 + 1

nlog 2 2np = p+ 1,

which gives a longest path P from the root node of Γ to one of the leaf nodesthat has a length greater than or equal to p + 1. Let P ′ be the subpath ofP consisting of the p+ 1 last non-terminals of P.

If we let p be equal to the size of N , then there must exist, by Dirichlet’spigeonhole principle, two nodes, γu and γl (where γu comes before γl), inP ′ which represent the same non-terminal.

The set of rewriting rules between γu and the γl may be removed sinceγu and γl both represent the same non-terminal which makes them inter-changeable. Also, the set of rewriting rules between γu and the γl can berepeated indefinitely since the subtree of Γ that can be reached from γuwithout passing γl can fit between itself and the subtree of Γ that can bereached from γl .

Since the grammar G is (or can be mapped into a grammar that is) linearand non-erasing, each of the components of the input to γl will remain intactat γu but they will have undergone a permutation. However, it is possible tofind a constant δ such that, if the set of rewriting rules between γu and theγl are repeated δ times, there are only cycles left in the permutation. It isalso possible to find a constant λ such that the positions of the componentswill be the same after δ + iλ repetitions, for all i > 0.

If the result at γl is expressed as the tuple 〈tµl(1 ), tµl(2 ), . . . tµl(m)〉, re-peating the rewriting rules between γu and γl , δ + iλ times will yield thetuple 〈s1

itµu(1 )u1i , s2

itµu(2 )u2i , . . . sm

itµu(m)umi〉, where the functions µl

and µu are permutations. The yield at the root node is therefore:

w = r0 s1it1u1

ir1 . . . smitmum

irm , i > 0.

39

5.5 Expressivity

The expressivity of m-multiple context-free grammars depend on m. 1-multiple context-free grammars are by definition equivalent to context-freegrammars and 2-multiple context-free grammars were shown by Roach (1987)to be equivalent to head grammars. For these m, the expressivity followsfrom the earlier theorems.

Unlike tree adjoining languages, multiple context-free grammars can ex-press the double copy language {www : w ∈ {a, b}∗} and the language{abhabiabjabkabl : h > i > j > k > l > 1}.

Theorem 20. The double copy language L = {www : w ∈ {a, b}∗} is a3-multiple context-free language.

Proof. The grammar G = 〈N,O, F, P, S〉 is a 3-multiple context-free gram-mar that generates L, where:

• N = {S,X},• O =

⋃i=1

3 (T ∗)i ,• F = {f, g1 , g2 , θ},• P = {S → f [X], X → g1 [X], X → g2 [X], X → θ},• f [x1 , x2 , x3 ] = x1x2x3 ,• g1 [x1 , x2 , x3 ] = (x1a, x2a, x3a),• g2 [x1 , x2 , x3 ] = (x1 b, x2 b, x3 b), and• θ = (ε, ε, ε).

Theorem 21. The language L = {abhabiabjabkabl : h > i > j > k > l > 1}is a 5-multiple context-free language.

Proof. The grammar G = 〈N,O, F, P, S〉 is a 5-multiple context-free gram-mar that generates L, where:

• N = {S,X},• O =

⋃i=1

5 (T ∗)i ,• F = {f, g1 , g2 , g3 , g4 , g5 , θ},• P = {S → f [X], X → g1 [X], X → g2 [X], X → g3 [X], X → g4 [X],X → g5 [X], X → θ},• f [x1 , x2 , x3 , x4 , x5 ] = x1x2x3x4x5 ,• g1 [x1 , x2 , x3 , x4 , x5 ] = (x1 b, x2 , x3 , x4 , x5 ),• g2 [x1 , x2 , x3 , x4 , x5 ] = (x1 b, x2 b, x3 , x4 , x5 ),• g3 [x1 , x2 , x3 , x4 , x5 ] = (x1 b, x2 b, x3 b, x4 , x5 ),• g4 [x1 , x2 , x3 , x4 , x5 ] = (x1 b, x2 b, x3 b, x4 b, x5 ),• g5 [x1 , x2 , x3 , x4 , x5 ] = (x1 b, x2 b, x3 b, x4 b, x5 b), and• θ = (ab5 , ab4 , ab3 , ab2 , ab).

A general measure of the expressivity of an m-multiple context-freegrammar is that it is able to express the language {a1

na2n . . . a2m

n : n > 1}.

40

Theorem 22. The language L = {a1na2

n . . . a2mn : n > 1} is an m-

multiple context-free language.

Proof. The grammarG = 〈N,O,F, P, S〉 is anm-multiple context-free gram-mar that generates L, where:

• N = {S,X},• O =

⋃i=1

m(T ∗)i ,• F = {f, g, θ},• P = {S → f [X], X → g[X], X → θ},• f [x1 , x2 , . . . , xm ] = x1x2 . . . xm ,• g[x1 , x2 , . . . , xm ] = (a1x1a2 , a3x2a4 , . . . , a2m−1xma2m), and• θ = (a1a2 , a3a4 , . . . , a2m).

At the same time the pumping lemma for multiple context-free languages(see Theorem 19) can be used to show that the language {a1

na2n . . . a2m+1

n :n > 1} is not an m-multiple context-free language.

Theorem 23. The language L = {a1na2

n . . . a2m+1n : n > 1} is not an

m-multiple context-free language.

Proof. Assume that L is an m-multiple context-free language and satisfiesthe pumping lemma for multiple context-free languages for some constantp and let w = a1

pa2p . . . a2m+1

p. Then, according to the pumping lemma,there exist substrings si and ui (1 6 i 6 m) of w, where at least one isnon-empty, that can be pumped repeatedly into w to create new words inL.

Since there are 2m + 1 different symbols in w and only 2m pumpablesubstrings, at least one of the symbols will not be pumped and the resultwill not belong to L. This is a contradiction and L cannot be an m-multiplecontext-free language.

Simlarily, it can also be shown that the language {abk1 abk2 . . . abk2m+1 :k1 > k2 > . . . > k2m+1 > 1} is not a m-multiple context-free language.

Theorem 24. The language L = {abk1 abk2 . . . abk2m+1 : k1 > k2 > . . . >k2m+1 > 1} is not a m-multiple context-free language.

Proof. Assume that L is an m-multiple context-free language and satisfiesthe pumping lemma for multiple context-free languages for some constant pand let w = abp+2mabp+2m−1 . . . abp+2m−(2m−1 )abp. Then, according to thepumping lemma, there exist substrings si and ui (1 6 i 6 m) of w, whereat least one is non-empty, that can be pumped repeatedly into w to createnew words in L.

None of the substrings si and ui (1 6 i 6 m) can contain any a’s sincepumping a’s would yield strings that have b sections which are not orderedaccording to length. Thus, the substrings must contain only b’s.

41

However, since there are 2m+1 sections of b’s in w and only 2m pumpablesubstrings, at least one of the sections of b’s will not be pumped and therequirement that k1 > k2 > . . . > k2m+1 > 1 will not be satisfied unless itis the right-most section of b’s that are not being pumped.

In that case, though, pumping down will result in a string that doesnot belong to L. This is a contradiction and L cannot be an m-multiplecontext-free language.


5.6.1 Mandarin Chinese

Radzinski (1991) did not only show that the Mandarin Chinese language isnot a tree adjoining language but also that it cannot be represented by anymultiple context-free grammar. He did this by defining the regular language:

R = {(wu zhao+)+},

and intersecting it with the subset of Mandarin Chinese, NC, consisting ofthe names of the numbers in the language. The yield of the intersection is:

NC ∩R = {wu (zhao)k1 . . .wu (zhao)kn : k1 > . . . > kn > 1}.

This is a homomorphism of the language {abk1 . . . abkn : k1 > . . . > kn > 1}which is not a multiple context-free language (see Theorem 24). Hence,Mandarin Chinese is not a multiple context-free language.

42

6 ConclusionPumping lemmata are used to prove that a language does not belong to acertain class of languages. For a linguist, knowing which class of languagesthat natural languages belong to gives insight into how humans learn andinterpret language.

There is convincing evidence that moves natural language outside theclass of context-free languages and into the class of mildly context-sensitivelanguages. A more expressive formalism does not seem to be justified, al-though Radzinski (1991) has argued that names of numbers in MandarinChinese cannot be fully expressed by neither tree adjoining grammars normultiple context-free grammars.

Common to all of the examples of natural language constructions thatlie outside the class of context-free languages is that they quickly becomevery difficult for humans to understand and comprehend. At the same time,as Chomsky (1957, pp. 23–24) stated:

. . . the assumption that languages are infinite is made in order tosimplify the description of these languages. If a grammar doesnot have recursive devices it will be prohibitively complex. If itdoes have recursive devices of some sort, it will produce infinitelymany sentences.

In other words, natural language may not be as complex, in practice, assome constructions indicate. However, the way in which humans interpretand learn languages, may very well be.

43

44

References

Bar-Hillel, Y. and Shamir, E. (1960). Finite-State Languages: Formal Rep-resentations and Adequacy Problems. In Bar-Hillel, Y. (ed.). Languageand Information. Reading, MA: Addison Wesley. Pages 87–98.

Bar-Hillel, Y., Perles, M., and Shamir, E. (1961). On Formal Properties ofSimple Phrase Structure Grammars. Zeitschrift für Phonetik, Sprachwis-senschaft und Kommunikationsforschung, vol. 14: 2, pages 143–172.

Bresnan, J., Kaplan, R., Peters, S., and Zaenen, A. (1982). Cross-SerialDependencies in Dutch. Linguistic Inquiry, vol. 13, pages 613–636.

Carnie, A. (2008). Constituent Structure. Oxford: Oxford University Press.

Chomsky, N. (1956). Three Models for the Description of Language. IRETransactions on Information Theory, vol. 2, pages 113–124.

Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton & Co.

Culy, C. (1985). The Complexity of the Vocabulary of Bambara. Linguisticsand Philosophy, vol. 8, pages 345–351.

Daly, R. T. (1974). Applications of the Mathematical Theory of Linguistics.The Hague: Mouton & Co.

Gazdar, G. (1988). Applicability of Indexed Grammars to Natural Lan-guages. In Reyle, U. and Rohrer, C. (eds.). Natural Language Parsingand Linguistic Theories. Dordrecht: Reidel. Pages 69–94.

Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory,Languages, and Computation. Reading, MA: Addison Wesley.

Huffman, D. A. (1954). The Synthesis of Sequential Switching Circuits. Jour-nal of the Franklin Institute, vol. 257: 3–4, pages 161–190 and 257–303.

Huybregts, R. (1976). Overlapping dependencies in Dutch. Utrecht WorkingPapers in Linguistics, vol. 1, pages 24–65.

Joshi, A. K., Levy, L. S., and Takahashi, M. (1975). Tree Adjunct Gram-mars. Journal of Computer and System Sciences, vol. 10, pages 136–163.

Joshi, A. K. (1985). Tree Adjoining Grammars: How Much Context-Sensitivity is Required to Provide Reasonable Structural Descriptions?In Dowty, D. R., Karttunen, L., and Zwicky, A. M. (eds.). Natural Lan-guage Parsing: Psychological, Computational, and Theoretical Perspec-tives. New York, NY: Cambridge University Press. Pages 206–250.

45

Joshi, A. K., Vijay-Shanker K., and Weir D. J. (1990). The Convergenceof Mildly Context-Sensitive Grammar Formalisms. In Sells, P., Shieber,S. M., and Wasow, T. (eds.) Foundational Issues in Natural LanguageProcessing. Cambridge, MA: MIT Press. Pages 31–81.

Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing (2ndEdition). Upper Saddle River, NJ: Prentice Hall.

Kasami, T., Seki, H., and Fujii, M. (1987). Generalized Context-Free Gram-mars, Multiple Context-Free Grammars and Head Grammars. Technicalreport. Osaka University.

Kleene, S. C. (1956). Representation of Events in Nerve Nets and FiniteAutomata. In Shannon, C. E. & McCarthy, J. (eds.). Automata Studies.Princeton, NJ: Princeton University Press. Pages 3–41.

Langendoen, D. T. (1977). On the Inadequacy of Type-3 and Type-2 Gram-mars for Human Languages. In Hopper, P. J. (ed.). Studies in Descriptiveand Historical Linguistics: Festschrift for Winfred P. Lehmann. Amster-dam, Holland: John Benjamin. Pages 159–171

Langendoen, D. T. (1981). The Generative Capacity of Word-FormationComponents. Linguistic Inquiry, vol. 12, pages 320–322.

Ljunglöf, P. (2004). Expressivity and Complexity of the Grammatical Frame-work. PhD thesis. Chalmers University of Technology and Göteborg Uni-versity. Göteborg, Sweden: Chalmers University of Technology.

Maclachlan, A. and Rambow, O. (2002). Cross-Serial Dependencies in Taga-log. Proceedings of the Sixth International Workshop on Tree AdjoiningGrammar and Related Frameworks (TAG+6); May 20–23 2002; Venice,Italy. Pages 100–106.

Manaster-Ramer, A. (1987). Dutch as a Formal Language. Linguistics andPhilosophy, vol 10, pages 221–246.

Manaster-Ramer, A. (1988). Review of Savitch, W. J., Bach, E., Marsh,W., and Safran-Naveh, G. (eds.). The Formal Complexity of Natural Lan-guage. Computational Linguistics, vol 14: 4, pages 98–103.

Mealy, G. H. (1955). A Method for Synthesizing Sequential Circuits. BellSystem Technical Journal, vol. 34: 5, pages 1045–1079.

Miller, G. A. and Chomsky, N. (1963). Finitary Models of Language Users.In Luce, R. D., Bush, R. R., and Galanter, E. (eds.). Handbook of Math-ematical Psychology, vol. 2. New York, NY: Wiley. Pages 419–492.

46

Moore, E. F. (1956). Gedanken-Experiments on Sequential Machines. InShannon, C. E. and McCarthy, J. (eds.). Automata Studies. Princeton,NJ: Princeton University Press. Pages 129–153.

Pollard, C. J. (1984). Generalized Phrase Structure-Grammars, Head Gram-mars, and Natural Language. PhD thesis. Stanford University. Stanford,CA: Stanford University Press.

Postal, Paul (1964). Limitations of Phrase Structure Grammars. In Fodor,J. A. and Katz, J. J. (eds.). The Structure of Language: Readings inthe Philosophy of Language. Englewood Cliffs, NJ: Prentice Hall. Pages137–151.

Pullum, G. K. and Gazdar G. (1982). Natural Languages and Context-FreeLanguages. Linguistics and Philosophy, vol. 4, pages 471–504.

Radzinski, D. (1991). Chinese Number-Names, Tree Adjoining Languages,and Mild Context-Sensitivity. Computational Linguistics, vol. 17: 3, pages277–299.

Roach, K. (1987). Formal Properties of Head Grammars. In Manaster-Ramer, A. (ed.).Mathematics of Language. Amsterdam, The Netherlands:John Benjamins. Pages 293–347.

Seki, H., Matsumura, T., Fujii M., and Kasami, T. (1991). Theoretical Com-puter Science, vol. 88: 2, pages 191–229.

Shieber, S. M. (1985). Evidence against the Context-Freeness of NaturalLanguage. Linguistics and Philosophy, vol. 8, pages 333–343.

Steedman, M. J. (1985). Dependency and Coordination in the Grammar ofDutch and English. Language, vol. 61, pages 523–568.

Steedman, M. J. (1988). Combinators and Grammars. In Oehrle, R., Bach,E., and Wheeler, D. (eds.). Categorial Grammars and Natural LanguageStructures. Dordrecht: Reidel. Pages 417–442.

Vijay-Shanker, K., Weir, D. J., and Joshi, A. K. (1986). Tree Adjoining andHead Wrapping. Proceedings of the 11th Conference on ComputationalLinguistics ; August 25–29 1986; Bonn, Germany. Pages 202–207.

Vijay-Shanker, K. (1987). A Study of Tree Adjoining Grammars. PhD thesis.University of Pennsylvania. Philadelphia, PA: University of PennsylvaniaPress.

Vijay-Shanker, K., Weir, D. J., and Joshi, A. K. (1987). CharacterizingStructural Descriptions Produced by Various Grammatical Formalisms.Proceedings of the 25th Annual Meeting of the Association for Computa-tional Linguistics; July 6–9 1987; Stanford, CA. Pages 104–111.

47

Vijay-Shanker, K. and Weir, D. J. (1994). The Equivalence of Four Exten-sions of Context-Free Grammars. Mathematical Systems Theory, vol. 27:6, pages 511–546.

Weir, D. J. (1988). Characterizing Mildly Context-Sensitive Grammar For-malisms. PhD thesis. University of Pennsylvania. Philadelphia, PA: Uni-versity of Pennsylvania.

48

Documents

A Linguist's Survey of Pumping Lemmata