Meaning Space Structure Determines the Stability

7/27/2019 Meaning Space Structure Determines the Stability

1/6

Meaning Space Structure Determines the Stability of Culturally Evolved

Compositional Language

Henry Brighton ([email protected])Language Evolution and Computation Research Unit,

Department of Theoretical and Applied Linguistics,

The University of Edinburgh, Edinburgh, UK

Simon Kirby ([email protected])Language Evolution and Computation Research Unit,

Department of Theoretical and Applied Linguistics,

The University of Edinburgh, Edinburgh, UK

Abstract

Explanations for the evolution of compositional and re-cursive syntax have previously attributed these phenom-ena to the genetic evolution of the language acquisitiondevice. Recent work in the field of computational evolu-tionary linguistics suggests that syntactic structure can in-stead be explained in terms of the dynamics arising from

the cultural evolution of language. We build on this pre-vious work by presenting a model of language acquisi-tion based on the Minimum Description Length princi-ple. Our Monte Carlo simulations show that the relativecultural stability of compositional language versus non-compositional language is greatest under conditions spe-cific to hominids: a complex meaning space structure.

Introduction and Related WorkHuman language differs greatly from other natural com-munication systems. Our use of compositional and re-cursive syntax places us in a unique position: we cancomprehend and produce an ostensibly infinite numberof utterances. Why are we alone in this position? Human

language is a result of three adaptive systems: learning,genetic evolution, and cultural evolution. Over the pasthalf century cognitive scientists has addressed the prob-lem of learning. The past ten years has seen a resurgenceof interest in the evolutionary basis of language (Pinker& Bloom, 1990). Only recently has the cultural evolu-tion of language been seriously analysed. Hare & Elman(1994) outlined perhaps the first iterated learning model.The iterated learning model seeks to model the evolu-tion of language through generations of language users,solely on the basis of each agent observing the behaviourof the previous generation (Kirby, in press b). Recentdemonstrations of the cultural evolution of composition-ality and recursive syntax (Kirby, in press a; Batali in

press) suggest that these properties of human language,traditionally attributed to genetic evolution, can in factbe explained as emergent properties arising from the dy-namics of iterated learning. One criticism levelled atthese models is that the learning bias of the individualagents is typically too strong the observed behaviour isstriking yet inevitable (Tonkes & Wiles, in press).

Here, we consider compositional syntax the prop-erty of human language whereby the meaning of a signalis some function of the meaning of its parts. We addressthe criticisms of bias strength by employing the Mini-mum Description Length (MDL) principle, which rests

on a solid mathematical justification for induction. Wedemonstrate that the relative stability of compositionallanguage, with respect to non-compositional language, isat a maximum under twoconditions specific to hominids:(a) a complex meaning space, and (b), limited languageexposure, a situation commonly referred to as thepovertyof the stimulus. Gell-Mann (1992) was perhaps the

first to suggest the relevance of Kolmogorov Complex-ity, which is closely related to MDL, to the study of lan-guage evolution. Our use of the MDL principle is similarto that of Teal et al (1999), who model change in signalstructure using the iterated learning model. Our modelextends this work by considering the role of meanings, aswell as allowing signals of arbitrary length. The structureof this article as follows. First, we outline the MDL prin-ciple and introduce a novel hypothesis space. We thendiscuss issues of stability and learnability in the contextof cultural evolution. Finally we illustrate the impact ofmeaning space complexity on the stability of composi-tional language. Our main goal is to establish proper-

ties of compositional language relative to a more soundmodel of linguistic generalisation.

Hypothesis Selection by MDLRanking potential hypotheses by minimum descriptionlength is a highly principled and very elegant approachto hypothesis selection (Li & Vitanyi, 1997). The MDLprinciple can be derived from Bayess Rule, and in shortstates that the best hypothesis for some observed data isthe one that minimises the sum of (a) the encoding lengthof the hypothesis, and (b) the encoding length of the data,when represented in terms of the hypothesis. A trade-off then exists between small hypotheses with a largedata encoding length and large hypotheses with a small

data encoding length. When the observed data containsno regularity, the best hypothesis is one that representsthe data verbatim, as this minimises the data encodinglength. However, when regularity does exist in the data,a smaller hypothesis is possible which describes the reg-ularity, making it explicit, and as result the hypothesisdescribes more than just the observed data. For this rea-son, the cost of encoding the data increases. MDL tellsus the ideal tradeoff between the length of the hypothesisencoding and the length of the data encoding describedrelative to the hypothesis. More formally, given someobserved data D and a hypothesis space H the best hy-


2/6

pothesis hMDL is defined as:

hMDL min

h

H

LC1 h LC2 D h (1)

where LC1 h is the length in bits of the hypothesis hwhen using an optimal coding scheme over hypotheses.

Similarly, LC2 D

h

is the length, in bits, of the encodingof the observed data using the hypothesis h. We use theMDL principle to find the most likely hypothesis for anobserved set of meaning/signal pairs passed to an agent.When regularity exists in the observed language, the hy-pothesis will capture this regularity, when justified, andallow for generalisation beyond what was observed. Byemploying MDL we have a more theoretically solid jus-tification for generalisation. The next section will clarifythe MDL principle we introduce the hypothesis spaceand coding schemes.

The Hypothesis Space

We introduce a novel model for mapping strings of sym-bols to meanings, which we term a Finite State Unifica-tion Transducer (FSUT). This model extends the schemeused by Teal et al (1999) to include meanings and vari-able length strings. Given some observed data, the hy-pothesis space consists of all FSUTs which are consis-tent with the observed data. Both compositional andnon-compositional languages can be represented usingthe FSUT model.

Throughout this paper, a meaning is defined as a setof features represented by a vector, with each featuretaking a value. A meaning space profile describes thestructure of a meaning space. For example, the meaningspace profile

33

defines a meaning space with two di-

mensions, each dimension having three possible values.Signals are just strings of symbols (of arbitrary length)drawn from some alphabet . A Finite State UnificationTransducer is specified by a 6-tuple

Q q0 qF

where Q is the set of states used by the transducer, is the alphabet from which symbols are drawn, and is the meaning space profile which defines the structureof the meaning space. The transition function mapsstate/symbol pairs to a new state, along with the (possi-bly under specified) meaning corresponding to that partof the transducer. Two states, q0 and qF need to be speci-fied, they are the initial and final state, respectively. Con-sider an agent A, which receives a set of meaning/signalpairs during acquisition. For example, a simple observedlanguage might be the set:

L1

21

cdef

22

cdgh

12

abgh

Figure 1(a) depicts an FSUT which models L1. We termthis transducer the prefix tree transducer the observedlanguage and only the observed language is representedby the prefix tree transducer. The power of the FSUTmodel only becomes apparent when we consider possi-ble generalisations made by merging states and edges inthe transducer. Figure 1(c) shows a compressed trans-ducer. Here, the some of the states and meaning labels

attached to the edges in the prefix tree transducer havebeen merged. There are two merge operations:

1. State Merge. Two states q1 and q2 can be merged if thetransducer remains consistent. All edges that mentionq1 or q2 now mention the new state.

2. Edge merge. Two edges e1 and e2 can be merged ifthey share the same source and target states and ac-cept the same symbol. The result of merging the twoedges is a new edge with a new meaning label. Mean-ings are merged by finding the intersection of the twocomponent meanings. Those features which do nothave values in common take the value ? a wildcardwhich matches all values. As fragments of the mean-ings may be lost, a check for transducer consistency isalso required.

Figure 1(b) illustrates which state merge operationsare applied to the prefix tree transducer in order to com-press it. Figure 1 is simple example, as the resulting

transducer does not generalize: only the observed mean-ing/signal pairs can be accepted or produced.

Encoding Lengths

In order to apply the MDL principle we need an appro-priate coding scheme for: (a) the hypotheses, and (b) thedata using the given hypothesis. These schemes corre-spond to LC1

h

and LC2

Dh

introduced in Equation 1.The requirement for the coding scheme LC1 is that somemachine can take the encoding of the hypothesis and de-code it in such a way that a unique transducer results.Similarly, the coding of the data with respect to the trans-ducer must describe the data uniquly. To encode a trans-

ducer T

Q

q0

qF

containing n states and medges we must calculate the space required, in bits, ofencoding a state

Sstate log2

n

, a symbol (Ssymbol

log2

), and a meaning(Smeaning

i

1 log2 i 1 ).The encoding length of the transducer is then:

ST m

2Sstate Ssymbol Smeaning Sstate

which corresponds to encoding the transition function along with the identity of the accepting state. To en-able the machine M to uniquely decode this tranducerwe must also specify the lengths of constituent parts. Weterm this part of the encoding the prefix block:

Spre f ix

Sstate

1

Ssymbol

1

Smeaning

1

To calculate LC1

h

we then use the expression:

LC1 h Spre f ix ST (2)

The data encoding length is far simpler to calculate thanthe grammar encoding length. For some string s com-posed of symbols w1w2 w

s

we need to detail the tran-sition we choose after after accepting each symbol withrespect to the given transducer. The list of choices madedescribes a unique path through the transducer. Addi-tional information is required when the transducer enters


3/6

d/{2 1}

e/{2 1}

c/{2 1}

f/{2 1}

c/{2 2}

d/{2 2}

g/{2 2}

h/{2 2}

a/{1 2}

b/{1 2}

g/{1 2}

h/{1 2}

d/{2 1}

e/{2 1}

c/{2 1}

f/{2 1}

c/{2 2}

d/{2 2}

g/{2 2}

h/{2 2}

a/{1 2}

b/{1 2}

g/{1 2}

h/{1 2}

a/{1 2}

b/{1 2}d/{2 ?}

c/{2 ?}

g/{? 2}

h/{? 2}

e/{2 1}

f/{2 1}

(a) Prefix Tree Transducer (b) Merge Operations (c) Compressed Transducer

Figure 1: (a) The prefix tree transducer. (b) The state merge operations required to induce the compressed transducer

shown in (c).

an accepting state as the transducer could either acceptthe string or continue parsing characters, as the accept-ing state might contain a loop transition. Given some

data D composed ofp meaning/signal pairs, LC2 D h iscalculated by:

LC2 D h

p

i

1

si

j

1

log2zi j F

si j (3)

where zi j is the number of outward transitions fromthe state reached after parsing j symbols of the ith string.The function F handles the extra information for accept-ing states:

F

si j !

1 : when the transducer is in qF0 : otherwise

Prefix tree transducers are compressed by applying themerge operators described above. We use a hill climbingsearch. All the merge operators are applied in turn andthe one which leads to the greatest reduction in LC1 h

LC2

Dh

is chosen. The process is repeated until this

expression cannot be minimised further.

Iterated Learning

Cultural evolution transmits information down genera-tions by non-genetic means. The cultural evolution oflanguage results from language users inheriting the lin-guistic behaviour of previous generations. We model

this process using the Iterated Learning Model (Kirby,in press b). Each generation consists of a single agentwhich observes the linguistic performance of the agentin the previous generation. This process is repeated over(usually) thousands of generations. Under conditions ofperfect transmission, the language of each generationwould be identical. This is not how human languageworks, as real language users suffer from the povertyof the stimulus: language users only ever see a fractionof possible utterances, yet are capable of producing andcomprehending an ostensibly infinite number of utter-ances. Obviously, language users make generalisations

from language they have observed. The sparsity of lan-guage exposure is modelled here using a communicationbottleneck. The bottleneck is imposed on the agents in

our simulations by restricting the number of utteranceseach generation observes. Initially this restriction resultsin each generation having a different language, the lan-guage changes down the generations: we have a dynamicsystem.

More precisely, the iterated learning model consists ofa series of learners which are called on in turn to expressa random subset of the meaning space to the next learnerin the series. If the meaning space consists ofn differentmeanings, some number m (m # n) of distinct randommeanings are observed by each agent, although the totalnumber of meanings observed may be larger than n assome are repeated. We are interested in language designsthat result in stability. For stability to occur the wholemapping from meanings to signals must be recoverablefrom limited exposure: the language must be learnable.

In the experiments that follow we analyse composi-tionality, the property of language in which the meaningof a sentence is a function of the meanings of its parts.Are compositional language designs stable? We contrastcompositionality with non-compositionality, i.e., signalswhose meaning is not a function of the meaning of itsparts. In our model non-compositional languages arethose where the mapping from meanings to signals israndom.

Compression and Learnability

The evolving language, as it passes through generationsof language users, can be seen as a complex system. Weare interested in the nature of steady states attractors those languages which are stable and persist. One wayof characterising stable languages is in terms of expres-sivity. If a language can express all possible meaningsand is learnable then it will persist. Instead of modellingthe full iterated learning model we try and establish theconditions for stability. To do this we construct two lan-guages: some compositional language Lcomp and somerandom language Lnoncomp. Through experimentation weidentify the learnable transducerswhich possess the min-


4/6

imum description length, described above, for both typesof language. We conduct Monte Carlo simulations to es-tablish how expressiveness depends on the structure ofthe meaning space.

Compositional Languages

To construct a compositional language each feature valueis assigned a unique word which is used in formingthe signal for meanings containing that feature value.Uniqueness is not a necessity for feature values valuesoccurring in other features can share the same word. Forexample, given a meaning space profile

33

we could

construct the compositional language L2:

L2

11

aa

12

ab

13

ac

21

ba

22

bb

23

bc

31

ca

32

cb

33

cc

For purposes of clarity single symbol words are used in

the signal to denote feature values. Variable lengthwordscould have been used.

b/{2 ?}

b/{? 2}a/{? 1}

a/{1 ?} c/{3 ?}

c/{? 3}

Figure 2: The MDL transducer for languages L2 and L3.

Which is the best hypothesis, according to MDL, forthis data? Figure 2 depicts the MDL transducer whichaccepts this language. Consider the language L3

L2 %

33

cc

. Exactly the same transducer is induced

by MDL when L3 is observed. The transducer has gen-eralised from the data to account for the missing sen-tence. L2 is learnable, even when the learner is not ex-posed to all the sentences in L2. The structure of thetransducer depicted in Figure 2 is typical for a compo-sitional language. These transducers are learnable fromcompositional input. From now on we term these trans-ducers compressed transducers. Equivalent random lan-guages are constructed by assigning random signals toeach meaning. Occasionally the MDL transducer for ran-dom languages is smaller than the prefix tree transducer,but correct generalisation cannot occur. Below, we anal-yse the properties of these two families of transducer.

Meaning Space Structure Affects

Learnability and Stability

In this section we investigate how meaning space struc-ture impacts on the learnability and stability of com-positional and non-compositional languages. We con-sider two types of transducer: compressed transducersand prefix tree transducers. By carrying out Monte Carlo

simulations, we show how, (a) the size of the communi-cation bottleneck, and (b), the meaning space structure,affects the degreeof language stability with respect to theiterated learning model.

Preliminaries

Usually the proportion of the meaning space expressedby an agent at each generation is given by the numberof random meanings the agent must express. The fol-lowing analysis requires a more concrete measure of thedegree of exposure to the meaning space. For example,given a meaning space composed of n meanings and abottleneck size ofm, significantly fewer than n distinctmeanings will be observed. Below, we measure the bot-tleneck size in terms ofmeaning space coverage which isjust the expected proportion of the meaning space sam-pled when picking at random. The expected coverage, c,when picking r elements at random (with replacement)

from n is c 1%

1%

1n

r

StabilityA stable language is one which survives the communica-tion bottleneck it occurs when a transducer is inducedwith maximum expressivity. For example, Figure 3(a)shows that, given a compositional language, the com-pressed transducer reaches maximum expressivity afterseeing only 20% of the meaning space. This is becauseexposure to feature values is required rather than ex-posure to whole meanings (recall the structure of com-pressed transducers shown in Figure 2). Maximum ex-pressivity results when all the feature values have beenobserved, and as a result, inductioncan account for novelmeanings whose individual feature values have already

been seen. As one would expect, the expressivity ofa prefix tree transducer, the best hypothesis for a non-compositional language, increases linearly with cover-age. Here, expressivity depends on the exposure to wholemeanings. In order for the entire meaning space to becommunicated, an infinitely large bottleneck is required:prefix tree transducers will rarely result in maximum ex-pressivity.

Learnability

Given a compositional language as input which trans-ducer results in the smallest description length? In termsof transducer size, compressed machines will always besmaller, but what influence does the data encoding length

have? Figure 3(b) shows the relative size of encodinglengths for the compressed transducer versus the prefixtransducer. When the size difference lies above the base-line compressed transducers are chosen, and below thebaseline, prefix tree transducers are chosen. Figure 3(b)illustrates that compressed machines are always prefer-able. This is not the case when we consider amplifiedbottlenecks where we multiply the frequency of mean-ings passed through the bottleneck. For future work thisobservation is relevant, as we intend to investigate dif-ferent probability distributions over the meaning space.Figure 3(c) shows the results for a 100-fold amplified


5/6

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Expressivity

&

Meaning Space Coverage

Expressivity vs. Coverage forPrefix and Compressed Machines

P. Tree ExpComp. Exp

(b)

0

5000

10000

15000

20000

25000

0 0.2 0.4 0.6 0.8 1

NumberofB

its

'


Learnability of Compressed Transducer vs.Prefix Transducer

Size DifferenceBaseline

(c)

-5000

0

5000

10000

15000

20000

25000

0 0.2 0.4 0.6 0.8 1

NumberofB

its

'


100-Fold Amplified Bottleneck:Learnability Results

Size DifferenceBaseline

Figure 3: (a) Expressivity as a function of coverage for prefix tree transducers and compressed transducers, (b) depicts

the size advantage of compressed machines over prefix tree machines: the MDL search always chooses compressed

machines, (c) shows how an amplified bottleneck affects learnability. A meaning space structure of (2,2) was used.

bottleneck. For small coverage values the prefix treetransducer is preferable. The MDL measure prefers thetransducer which does not generalise: the less evidencewe have of the sample space, the less we are justified inaccurately postulating the existence of unseen membersof the sample space. This justification becomes weakerthe more we see of the sample space. MDL reflects thisintuition by preferring compressed transducers at highercoverage values.

Competing Languages

Above,we argued that compositional languages aremorestable than non-compositional languages. In short, com-pression, and as a result generalisation and high expres-sivity, is only possible with compositional languages.Non-compositional languages, by definition, lack anyform of regularity in the mapping between meanings andsignals and are therefore far less compressible. We alsoillustrated that for compositional languages, high expres-sivity through compression is achievable for low mean-ing space coverage, as induction via compression is afunction of degree of exposure to feature values, ratherthan whole meanings. This tells us that the size of thebottleneck which maximises the relative stability of com-positional language versus non-compositional language

is a function of meaning space structure.Ultimately, we are interested in the question: Un-

der what circumstances is compositional language mostlikely to occur? But now we can reformulate the ques-tion: When is the relative stability of compositional lan-guages versus non-compositional languages at a max-imum? The relative stability of a compositional lan-guage over a non-compositional language can measuredby comparing the expressivity of the transducers chosenby MDL for each language type. For example, givensome compositional languageLcomp we identify the mostlikely hypothesis on the basis of MDL. This gives us

a transducer Tcomp with expressivity Ecomp. Similarly,for a non-compositional language Lnoncomp we identifyTnoncomp with expressivity Enoncomp. The relative stabil-ity measure tells how much of a stability advantage com-positional language provides. We denote this quantity as

R and define is as:

R Ecomp

Ecomp Enoncomp

Relative Stability of Compressed Machines vs.

Prefix Machines for Different Numbers of Values

0246

8101214

16182022

Values0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Coverage

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

R

Figure 4: The relative expressivity, R, for a two-featuremeaning space for different numbers of values.

Now, constructing compositional languages by fixingthe number of features but varying the number of fea-ture values, for different meaning space coverage values,and then measuring R, will provide an insight into how

R depends on the meaning space structure. Figure 4 il-lustrates this dependency. The striking feature of theseresults is that compositionality is most likely, or morepreferable, when the number of valuesper feature is largeand the meaning space coverage is small. The payoff, R,


6/6

in the number of values per feature decreases as the num-ber of values increases. Similar results occur when we fixthe number of values per feature but increase the numberof features. Again, small bottlenecks and many featureslead to a large payoff when considering compositionallanguages. It appears that the more complex the mean-

ing space, the higher the R value, especially for smallbottlenecks. However, R does not increase linearly withmeaning space complexity, the payoff achieved throughincreased meaning space complexity deteriorates.

Perhaps a more informative analysis results when weconsider the problem of communicating about somefixed number of objects. For example, given the problemof describing 1000 objects, which meaning space struc-ture leads to the occurrence of compositional languagebeing most likely? Figure 5 shows that compositionalityis most likely when the 1000 objects are discriminatedmore by features than by feature values. The greater thenumber of features, the smaller the number of observa-tions required before all feature values are seen. Only

when many feature values have been observed can in-duction justifiably be applied. However, as before, thispayoff does not increase linearly with the number of fea-tures. After a point, more features offer little advantage.

0 0.10.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Coverage

0.50.550.6

0.650.7

0.750.8

0.850.9

0.951

R

Relative Stability for Different Meaning Spaces

Describing Approximately 1000 Objects

(3,3,3,3,3)

(6,6,6)(10,10)

Meaning Space Structure

(2,2,2,2,2,2,2,2,2)

Figure 5: The relative expressivity, R, for different ways

of describing approximately 1000 objects.

These results tell us when the stability of composi-tional language is at a maximum, in comparison to non-compositional language. These conditions provide thefirst steps of an explanation for the emergence of com-positionality in human language. The poverty of stimu-

lus coupled with the supposed complexity of the hominidmind are exactly the conditions under which these exper-iments predict compositional language is most likely toemerge.

Conclusions

By providing a sound basis for induction we have ad-dressed criticisms of the poorly justified, and arguablyoverly strong, inductive bias typical of earlier work onthe cultural evolution of syntactic language. However,the chief point we aim to make is that compositionalityis advantageous under conditions specific to hominids:

1. Complex meaning space structure: Hominids carveup their perceived environment into many features, atleast, their perception is unlikely to be restricted toholistic experiences.

2. The poverty of the stimulus: The need for a commu-nication system with high expressivity is required if

meanings drawn from a complex meaning space are tobe communicated. However, limited exposure to thismass of possible utterances is all that is required forunlimited comprehension and production.

Using a mathematical model, Nowak, Plotkin, &Jansen (2000) present a similar argument with respectto the genetic evolution of syntactic structure: the com-plexity of the perceived environment leads to a pressurefor syntax. Our argument is similar in spirit, but demon-strates that natural selection is not the only mechanismwhich can explain the emergence of syntax.

References

Batali, J. (in press). The Negotiation and Acquisitionof Recursive Communication Systems as a Result ofCompetition among Exemplars. In Briscoe, E. (Ed.),

Linguistic Evolution through Language Acquisition:Formal and Computational Models, Cambridge Uni-versity Press.

Gell-Mann, M. (1992). Complexity and Complex Adap-tive Systems. In Hawkins, J. A. and Gell-Mann, M.(Eds.) The Evolution of Human Languages. Addison-Wesley.

Hare, M., & Elman, J. L. (1995). Learning and Morpho-logical Change, Cognition, 56(1).

Kirby, S. (in press a). Learning, Bottlenecks and the Evo-lution of Recursive Syntax. In Briscoe, E. (Ed.), Lin-guistic Evolution through Language Acquisition: For-mal and Computational Models, Cambridge Univer-sity Press.

Kirby, S. (in press b). Spontaneous Evolution of Lin-guistic Structure: An Iterated Learning Model of theEmergenceof Regularity andIrregularity, to appear in:

IEEE Transactions on Evolutionary Computation.

Li, M., & Vitanyi, P. (1997). A Introduction to Kol-mogorov Complexity and Its Applications. New York:Springer-Verlag.

Nowak, M. A., Plotkin, J. B., & Jansen, V. A. A. (2000).

The evolution of syntactic communication, Nature,404, 495-498.

Pinker, S., & Bloom, P. (1990). Natural Language andNatural Selection. Behavioral and Brain Sciences, 13.

Teal, T., Albro, D., Stabler, E, & Taylor, C.E. (1999).Compression and Adaptation. Fifth European Confer-ence on Artificial Life (pp. 709719). Springer-Verlag.

Tonkes, B., & Wiles, J. (in press). Methodological Is-sues in Simulating the Emergence of Language. Toappear in a volume arising from: Third Conference onthe Evolution of Language, Paris, 2000.

Documents

Meaning Space Structure Determines the Stability