104
Statistical methods for Historical Linguistics Robin J. Ryder Centre de Recherche en Math´ ematiques de la D´ ecision Universit´ e Paris-Dauphine 7 March 2013 UCLA Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98

Statistical Methods in Historical Linguistics

Embed Size (px)

DESCRIPTION

Talk given at UCLA, March 2013

Citation preview

Page 1: Statistical Methods in Historical Linguistics

Statistical methods for Historical Linguistics

Robin J. Ryder

Centre de Recherche en Mathematiques de la DecisionUniversite Paris-Dauphine

7 March 2013UCLA

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98

Page 2: Statistical Methods in Historical Linguistics

Introduction

A large number of recent papers describe computationally-intensivestatistical methods for Historical Linguistics

Increased computational power

Advances in statistical methodology

Complex linguistic questions which cannot be answered withtraditional methods

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 2 / 98

Page 3: Statistical Methods in Historical Linguistics

Caveats

I am not a linguist

I am a statistician

These papers were not written by me; most figures were created bythe papers’ authors

I use the word ”evolution” in a broad sense

”All models are wrong, but some are useful”

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 3 / 98

Page 4: Statistical Methods in Historical Linguistics

Aims of this talk

Review of several recent papers on statistical models for HistoricalLinguistics

Statisticians won’t replace linguists

When done correctly, collaborations between statisticians and linguistscan provide useful results

Brief introduction to some major statistical concepts + statisticalcomments

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 4 / 98

Page 5: Statistical Methods in Historical Linguistics

Statistics slides

Different colour

Feel free to ignore the equations

Useful (essential?) if you wish to collaborate with statisticians

Limitations of statistical tools

Statistics should not be a black box

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 5 / 98

Page 6: Statistical Methods in Historical Linguistics

Advantages of statistical methods

Analyse (very) large datasets

Test multiple hypotheses

Cross-validation

Estimate uncertainty

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 6 / 98

Page 7: Statistical Methods in Historical Linguistics

Statistical method in a nutshell

1 Collect data

2 Design model

3 Perform inference (MCMC, ...)

4 Check convergence

5 In-model validation (is our inference method able to answer questionsfrom our model?)

6 Model mis-specification analysis (do we need a more complex model?)

7 Conclude

In general, it is more difficult to perform inference for a more complexmodel.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 7 / 98

Page 8: Statistical Methods in Historical Linguistics

Contents

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Tomorrow: phylogenetics for core vocabulary

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 8 / 98

Page 9: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 9 / 98

Page 10: Statistical Methods in Historical Linguistics

Swadesh (1952)

LEXICO-STATISTIC DATING OF PREHISTORIC ETHNIC CONTACTS

With Special Reference to North American Indians and Eskimos

MORRIS SWADESH

PREHISTORY refers to the long period of early human society before writing was available for the recording of events. In a few places it gives way to the modern epoch of recorded history as much as six or eight thousand years ago; in many areas this happened only in the last few centuries. Everywhere prehistory represents a great obscure depth which science seeks to penetrate. And in- deed powerful means have been found for illumi- nating the unrecorded past, including the evidence of archeological finds and that of the geographic distribution of cultural facts in the earliest known periods. Much depends on the painstaking analy- sis and comparison of data, and on the effective reading of their implications. Very important is the combined use of all the evidence, linguistic and ethnographic as well as archeological, biological, and geological. And it is essential constantly to seek new means of expanding and rendering more accurate our deductions about prehistory.

One of the most significant recent trends in the field of prehistory has been the development of ob- jective methods for measuring elapsed time. Where vague estimates and subjective judgments formerly had to serve, today we are often able to determine prehistoric time within a relatively nar- row margin of accuracy. This development is im- portant especially because it adds greatly to the possibility of interrelating the separate reconstruc- tions.

Unquestionably of the highest value has been the development of radiocarbon dating.' This tech- nique is based on W. F. Libby's discovery that all living substances contain a certain percentage of radioactive carbon, an unstable substance which tends to change into nitrogen. During the life of a plant or animal, new radiocarbon is continually taken in from the atmosphere and the percentage remains at a constant level. After death the per- centage of radiocarbon is gradually dissipated at an essentially constant statistical rate. The rate of "decay" being constant, it is possible to determine the time since death of any piece of carbon by

1 See Radiocarbon dating, assembled by Frederick Johnson, Mein. Soc. Amner. Archacol. 8, 1951.

measuring the amount of radioactivity still going on. Consequently, it is possible to determine within certain limits of accuracy the time depth of any archeological site which contains a suitable bit of bone, wood, grass, or any other organic sub- stance.

Lexicostatistic dating makes use of very dif- ferent material from carbon dating, but the broad theoretical prin~iple is similar. Researches by the present author and several other scholars within the last few years have revealed that the funda- mental everyday vocabulary of any language-as against the specialized or "cultural" vocabulary- changes at a relatively constant rate. The per- centage of retained elements in a suitable test vocabulary therefore indicates the elapsed time. Wherever a speech community comes to be divided into two or more parts so that linguistic change goes separate ways in each of the new speech com- munities, the percentage of common retained vo- cabulary gives an index of the amount of time that has elapsed since the separation. Consequently, wherever we find two languages which can b)e shown by comparative linguistics to be the end products of such a divergence in the prehistoric past, we are alble to determine when the first separation took place. Before taking up the de- tails of the method, let us examine a concrete il- lustrative instance.

The Eskimo and Aleut languages are by no means the same. An Eskimo cannot understand Aleut unless he learns the language like any other foreign tongue, except that structural similarities and occasional vocabulary agreements make the learning a little easier than it might otherwise be. The situation is roughly comparable to that of an English-speaking person learning Gaelic or 1 ithu- anian. It has been shown that Eskimo and Aleut are modern divergent forms of an earlier single language." In other words, the similarities be-

- Concrete proof of this relationship has recently been presented in two independent studies: Knut Bergslund, Kleinschmidt Centennial IV: Aleut demonstratives and the Aleut-Eskimo relationship, InternatI. Joulr. ,Aizcr. Ling. 17: 167-179, 1951: Gordon Marsh and Morris

PROCEEDINGS OF THE AMERICAN PHILOSOPHICAL SOCIETY, VOL. 96, No. 4, AUGU ST, 1952 452

This content downloaded on Fri, 1 Mar 2013 08:19:32 AMAll use subject to JSTOR Terms and Conditions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 10 / 98

Page 11: Statistical Methods in Historical Linguistics

Question at hand

Aim: dating the MRCA (Most Recent Common Ancestor) of a pair oflanguages.Data: ”core vocabulary” (Swadesh lists). 215 or 100 words.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 11 / 98

Page 12: Statistical Methods in Historical Linguistics

all

and

animal

ashes

at

back

bad

bark

because

belly

big

bird

bite

black

blood

blow

bone

breast

breathe

burn

child

claw

cloud

cold

come

count

cut

day

die

dig

dirty

dog

drink

dry

dull

dust

ear

earth

eat

egg

eye

fall

far

fat

father

fear

feather

few

fight

fire

fish

five

float

flow

flower

fly

fog

foot

four

freeze

full

give

good

grass

green

guts

hair

hand

he

head

hear

heart

heavy

here

hit

hold

horn

how

hunt

husband

I

ice

if

in

kill

knee

know

lake

laugh

leaf

left

leg

lie

live

liver

long

louse

man

many

meat

moon

mother

mountain

mouth

name

narrow

near

neck

new

night

nose

not

old

one

other

person

play

pull

push

rain

red

right(cor-rect)

right(side)

river

road

root

rope

rotten

round

rub

salt

sand

say

scratch

sea

see

seed

sew

sharp

short

sing

sit

skin

sky

sleep

small

smell

smoke

smooth

snake

snow

some

spit

split

squeeze

stab

stand

star

stick

stone

straight

suck

sun

swell

swim

tail

ten

that

there

they

thick

thin

think

this

thou

three

throw

tie

tongue

tooth

tree

turn

two

vomit

walk

warm

wash

water

we

wet

what

when

where

white

who

wide

wife

wind

wing

wipe

with

woman

woods

worm

ye

year

yellow

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 12 / 98

Page 13: Statistical Methods in Historical Linguistics

Assumptions

Swadesh assumed that core vocabulary evolves at a constant rate (throughtime, space and meanings). Given a pair of languages with percentage Cof shared cognates, and a constant retention rate r , the age t of theMRCA is

t =logC

2 log r

The constant r was estimated using a pair of languages for which the ageof the MRCA is known.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 13 / 98

Page 14: Statistical Methods in Historical Linguistics

Issues with glottochronology

Many statistical shortcomings. Mainly:

1 Simplistic model

2 No evaluation of uncertainty of estimates

3 Only small amounts of data are used

See Bergsland and Vogt (1962) for debunking of glottochronology.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 14 / 98

Page 15: Statistical Methods in Historical Linguistics

What has changed?

More elaborate models + model misspecification analyses

We can estimate the uncertainty (⇒easier to answer ”I don’t know”)

Large amounts of data

Tomorrow: new analysis of Bergsland and Vogt’s data.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 15 / 98

Page 16: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 16 / 98

Page 17: Statistical Methods in Historical Linguistics

Gray and Atkinson

reanneal. Although our model simulations do not include calcu-lations past the fragmentation threshold, we propose that a localdecrease in shear-strain rates associated with fragmentation maypromote reannealing28. Furthermore, it seems reasonable to assumethat shear-induced fragmentation has a marked effect on the flow ofthe ascending magma and that upon continued ascent, fragmentsfrom different parts of the ascending magma may become juxta-posed. If the magma is texturally heterogeneous, which in itself maybe a consequence of repeated cycles of fragmentation, flow defor-mation and reannealing, fragments can become elongated intobands10 (Fig. 1). Minimum strain estimates to produce milli-metre-size bands from decimetre-size fragments is of the order of100. Using d as an estimate of the length scale for shear, thiscorresponds to an ascent distance, Dz < gRd, of the order of10 m. We propose that the long-standing enigma of pervasive flowbanding of silicic magmas may in some cases be viewed as a recordof fragmentation and reannealing during magma ascent, in muchthe same way as banding can be made by fragmentation andreannealing in flows29. In addition, we expect that shear-inducedfragmentation can, to some degree, replace viscous deformation asthe mode of shear along conduit walls, thereby reducing theexceedingly large dynamic pressures required to erupt highlycrystalline silicic magmas. However, none of our model simulationsexplicitly include the effect of crystals on fragmentation30.

Our prediction that shear-induced fragmentation occurs in bothexplosive and effusive silicic volcanism is consistent with theobserved conditions of volcanic systems22 (Fig. 3), with the degassednature of effusive silicic lavas7,8, and with textural observations atthe outcrop scale down to the microscale9–11 (Fig. 1). As opposed tothe common view that explosive volcanism “is defined as involvingfragmentation of magma during ascent”1, we conclude that frag-mentation may play an equally important role in reducing thelikelihood of explosive behaviour, by facilitating magma degassing.Because shear-induced fragmentation depends so strongly on therheology of the ascending magma, our findings are in a broadersense equivalent to Eichelberger’s hypothesis1 that “higher viscosityof magma may favour non-explosive degassing rather thanhinder it”, albeit with the added complexity of shear-inducedfragmentation. A

Received 19 May; accepted 15 November 2003; doi:10.1038/nature02138.

1. Eichelberger, J. C. Silicic volcanism: ascent of viscous magmas from crustal reservoirs. Annu. Rev.

Earth Planet. Sci. 23, 41–63 (1995).

2. Dingwell, D. B. Volcanic dilemma: Flow or blow? Science 273, 1054–1055 (1996).

3. Papale, P. Strain-induced magma fragmentation in explosive eruptions. Nature 397, 425–428 (1999).

4. Dingwell, D. B. & Webb, S. L. Structural relaxation in silicate melts and non-Newtonian melt rheology

in geologic processes. Phys. Chem. Miner. 16, 508–516 (1989).

5. Webb, S. L. & Dingwell, D. B. The onset of non-Newtonian rheology of silcate melts. Phys. Chem.

Miner. 17, 125–132 (1990).

6. Webb, S. L. & Dingwell, D. B. Non-Newtonian rheology of igneous melts at high stresses and strain

rates: experimental results for rhyolite, andesite, basalt, and nephelinite. J. Geophys. Res. 95,

15695–15701 (1990).

7. Newman, S., Epstein, S. & Stolper, E. Water, carbon dioxide and hydrogen isotopes in glasses from the

ca. 1340 A.D. eruption of the Mono Craters, California: Constraints on degassing phenomena and

initial volatile content. J. Volcanol. Geotherm. Res. 35, 75–96 (1988).

8. Villemant, B. & Boudon, G. Transition from dome-forming to plinian eruptive styles controlled by

H2O and Cl degassing. Nature 392, 65–69 (1998).

9. Polacci, M., Papale, P. & Rosi, M. Textural heterogeneities in pumices from the climactic eruption of

Mount Pinatubo, 15 June 1991, and implications for magma ascent dynamics. Bull. Volcanol. 63,

83–97 (2001).

10. Tuffen, H., Dingwell, D. B. & Pinkerton, H. Repeated fracture and healing of silicic magma generates

flow banding and earthquakes? Geology 31, 1089–1092 (2003).

11. Stasiuk, M. V. et al. Degassing during magma ascent in the Mule Creek vent (USA). Bull. Volcanol. 58,

117–130 (1996).

12. Goto, A. A new model for volcanic earthquake at Unzen Volcano: Melt rupture model. Geophys. Res.

Lett. 26, 2541–2544 (1999).

13. Mastin, L. G. Insights into volcanic conduit flow from an open-source numerical model. Geochem.

Geophys. Geosyst. 3, doi:10.1029/2001GC000192 (2002).

14. Proussevitch, A. A., Sahagian, D. L. & Anderson, A. T. Dynamics of diffusive bubble growth in

magmas: Isothermal case. J. Geophys. Res. 3, 22283–22307 (1993).

15. Lensky, N. G., Lyakhovsky, V. & Navon, O. Radial variations of melt viscosity around growing bubbles

and gas overpressure in vesiculating magmas. Earth Planet. Sci. Lett. 186, 1–6 (2001).

16. Rust, A. C. & Manga, M. Effects of bubble deformation on the viscosity of dilute suspensions.

J. Non-Newtonian Fluid Mech. 104, 53–63 (2002).

17. Pal, R. Rheological behavior of bubble-bearing magmas. Earth Planet. Sci. Lett. 207, 165–179 (2003).

18. Llewellin, E. W., Mader, H. M. & Wilson, S. D. R. The constitutive equation and flow dynamics of

bubbly magmas. Geophys. Res. Lett. 29, doi:10.1029/2002GL015697 (2002).

19. Simmons, J. H., Mohr, R. K. & Montrose, C. J. Non-Newtonian viscous flow in glass. J. Appl. Phys. 53,

4075–4080 (1982).

20. Hess, K.-U. & Dingwell, D. B. Viscosities of hydrous leucogranitic melts: A non-Arrhenian model. Am.

Mineral. 81, 1297–1300 (1996).

21. Manga, M. & Loewenberg, M. Viscosity of magmas containing highly deformable bubbles. J. Volcanol.

Geotherm. Res. 105, 19–24 (2001).

22. Pyle, D. M. in Encyclopedia of Volcanoes (eds Sigurdsson, H., Houghton, B. F., McNutt, S. R., Rymer, H.

& Stix, J.) 263–269 (Academic, San Diego, 2000).

23. Jaupart, C. & Allegre, C. J. Gas content, eruption rate and instabilities of eruption regime in silicic

volcanoes. Earth Planet. Sci. Lett. 102, 413–429 (1991).

24. Boudon, G., Villemant, B., Komorowski, J.-C., Ildefonse, P. & Semet, M. P. The hydrothermal system

at Soufriere Hills volcano, Montserrat (West Indies): characterization and role in the on-going

eruption. Geophys. Res. Lett. 25, 3693–3696 (1998).

25. Blower, J. D. Factors controlling porosity-permeability relationships in magma. Bull. Volcanol. 63,

497–504 (2001).

26. Klug, C. & Cashman, K. V. Permeability development in vesiculating magmas: implications for

fragmentation. Bull. Volcanol. 58, 87–100 (1996).

27. Klug, C., Cashman, K. V. & Bacon, C. R. Structure and physical characteristics of pumice from the

climactic eruption of Mount Mazama (Crater Lake), Oregon. Bull. Volcanol. 64, 486–501 (2002).

28. Gottsmann, J. & Dingwell, D. B. The thermal history of a spatter-fed lava flow: the 8-ka pantellerite

flow of Mayor Island, New Zealand. Bull. Volcanol. 64, 410–422 (2002).

29. Smith, J. V. Ductile-brittle transition structures in the basal shear zone of a rhyolite lava flow, eastern

Australia. J. Volcanol. Geotherm. Res. 72, 217–223 (1996).

30. Martel, C., Dingwell, D. B., Spieler, O., Pichavant, M. & Wilke, M. Experimental fragmentation of

crystal- and vesicle-bearing melts. Bull. Volcanol. 63, 398–405 (2001).

Acknowledgements We thank P. Papale and D. L. Sahagian for comments on the previous

versions of the manuscript, and K. V. Cashman, A. Rust, and A. M. Jellinek for comments on

earlier versions. This work was supported by the National Science Foundation and the Sloan

Foundation.

Competing interests statement The authors declare that they have no competing financial

interests.

Correspondence and requests for materials should be addressed to H.M.G.

([email protected]).

..............................................................

Language-tree divergence timessupport the Anatolian theoryof Indo-European originRussell D. Gray & Quentin D. Atkinson

Department of Psychology, University of Auckland, Private Bag 92019,Auckland 1020, New Zealand.............................................................................................................................................................................

Languages, like genes, provide vital clues about human history1,2.The origin of the Indo-European language family is “the mostintensively studied, yet still most recalcitrant, problem of his-torical linguistics”3. Numerous genetic studies of Indo-Europeanorigins have also produced inconclusive results4,5,6. Here weanalyse linguistic data using computational methods derivedfrom evolutionary biology. We test two theories of Indo-European origin: the ‘Kurgan expansion’ and the ‘Anatolianfarming’ hypotheses. The Kurgan theory centres on possiblearchaeological evidence for an expansion into Europe and theNear East by Kurgan horsemen beginning in the sixth millen-nium BP

7,8. In contrast, the Anatolian theory claims that Indo-European languages expanded with the spread of agriculturefrom Anatolia around 8,000–9,500 years BP

9. In striking agree-ment with the Anatolian hypothesis, our analysis of a matrix of87 languages with 2,449 lexical items produced an estimated agerange for the initial Indo-European divergence of between 7,800and 9,800 years BP. These results were robust to changes in codingprocedures, calibration points, rooting of the trees and priors inthe bayesian analysis.

letters to nature

NATURE | VOL 426 | 27 NOVEMBER 2003 | www.nature.com/nature 435© 2003 Nature Publishing Group

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 17 / 98

Page 18: Statistical Methods in Historical Linguistics

Swadesh lists, better analysed

Use Swadesh lists for 87 Indo-European languages, and a phylogeneticmodel from Genetics

Assume a tree-like model of evolution

Bayesian inference via MCMC (Markov Chain Monte Carlo)

Reconstruct trees and dates

Main parameter of interest: age of the root (Proto-Indo-European,PIE)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 18 / 98

Page 19: Statistical Methods in Historical Linguistics

Lexical trees

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 19 / 98

Page 20: Statistical Methods in Historical Linguistics

Correcting the issues with glottochronology

Returning to the issues with glottochronology:

1 Simplistic model → Slightly better, but the model of evolution isrudimentary

2 No evaluation of uncertainty of estimates → Bayesian inference

3 Only small amounts of data are used → Large number of languagesreduces variability of estimates

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 20 / 98

Page 21: Statistical Methods in Historical Linguistics

Why be Bayesian?

In the settings described in this talk, it usually makes sense to useBayesian inference, because:

The models are complex

Estimating uncertainty is paramount

The output of one model is used as the input of another

We are interested in complex functions of our parameters

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 21 / 98

Page 22: Statistical Methods in Historical Linguistics

Frequentist statistics

Statistical inference deals with estimating an unknown parameter θgiven some data D.

In the frequentist view of statistics, θ has a true fixed (deterministic)value.

Uncertainty is measured by confidence intervals, which are notintuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100± 20)for θ, I cannot say that there is a 95% probability that θ belongs tothe interval [80 ; 120].

Frequentist statistics often use the maximum likelihood estimator: forwhich value of θ would the data be most likely (under our model)?

L(θ|D) = P[D|θ]

θ = arg maxθ

L(θ|D)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98

Page 23: Statistical Methods in Historical Linguistics

Frequentist statistics

Statistical inference deals with estimating an unknown parameter θgiven some data D.

In the frequentist view of statistics, θ has a true fixed (deterministic)value.

Uncertainty is measured by confidence intervals, which are notintuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100± 20)for θ, I cannot say that there is a 95% probability that θ belongs tothe interval [80 ; 120].

Frequentist statistics often use the maximum likelihood estimator: forwhich value of θ would the data be most likely (under our model)?

L(θ|D) = P[D|θ]

θ = arg maxθ

L(θ|D)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98

Page 24: Statistical Methods in Historical Linguistics

Bayesian statistics

In the Bayesian framework, the parameter θ is seen as inherentlyrandom: it has a distribution.

Before I see any data, I have a prior distribution on π(θ), usuallyuninformative.

Once I take the data into account, I get a posterior distribution,which is hopefully more informative.

π(θ|D) ∝ π(θ)L(θ|D)

Different people have different priors, hence different posteriors. Butwith enough data, the choice of prior matters little.

We are now allowed to make probability statements about θ, such as”there is a 95% probability that θ belongs to the interval [78 ; 119]”(credible interval)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 23 / 98

Page 25: Statistical Methods in Historical Linguistics

Advantages and drawbacks of Bayesian statistics

More intuitive interpretation of the results

Easier to think about uncertainty

In a hierarchical setting, it becomes easier to take into account all thesources of variability

Prior specification: need to check that changing your prior does notchange your result

Computationally intensive

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 24 / 98

Page 26: Statistical Methods in Historical Linguistics

Effect of prior

Example: Bernoulli model (biased coin). θ=probability of success.Observe y = 72 successes out of n = 100 trials.Frequentist estimate: θ = 0.72Confidence interval: [0.63 0.81].Bayesian estimate: will depend on the prior.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 25 / 98

Page 27: Statistical Methods in Historical Linguistics

Effect of prior

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

y = 72, n = 100Black:prior; green: likelihood; red: posterior.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 26 / 98

Page 28: Statistical Methods in Historical Linguistics

Effect of prior

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

04

8

x

prio

r(x)

y = 7, n = 10Black:prior; green: likelihood; red: posterior.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 27 / 98

Page 29: Statistical Methods in Historical Linguistics

Effect of prior

0.0 0.2 0.4 0.6 0.8 1.0

010

25

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

010

25

x

prio

r(x)

0.0 0.2 0.4 0.6 0.8 1.0

010

25

x

prio

r(x)

y = 721, n = 1000Black:prior; green: likelihood; red: posterior.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 28 / 98

Page 30: Statistical Methods in Historical Linguistics

Bayesian inference for lexical trees

The tree parameter is seen as random: it has a distribution

Via MCMC, G & A get a sample of possible trees, with associatedprobabilities, rather than a single tree

The uncertainty in trees is thus made explicit

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 29 / 98

Page 31: Statistical Methods in Historical Linguistics

G & A: Assumptions

Tree-like evolution

Genetics model

Independence of cognates

Constant rate of change

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 30 / 98

Page 32: Statistical Methods in Historical Linguistics

G & A: conclusions

Age of PIE: 7800-9800 BP (Before Present)

Large error bars, but this is a good thing

Reconstruct many known features of the tree of Indo-Europeanlanguages

Little validation of the model, no model misspecification analysis

Much more on these data and more elaborate models and analysestomorrow

These trees can also be used as a building block to answer otherquestions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 31 / 98

Page 33: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 32 / 98

Page 34: Statistical Methods in Historical Linguistics

Lieberman et al. (2007)

LETTERS

Quantifying the evolutionary dynamics of languageErez Lieberman1,2,3*, Jean-Baptiste Michel1,4*, Joe Jackson1, Tina Tang1 & Martin A. Nowak1

Human language is based on grammatical rules1–4. Cultural evolu-tion allows these rules to change over time5. Rules compete witheach other: as new rules rise to prominence, old ones die away. Toquantify the dynamics of language evolution, we studied the regu-larization of English verbs over the past 1,200 years. Although anelaborate system of productive conjugations existed in English’sproto-Germanic ancestor, Modern English uses the dental suffix,‘-ed’, to signify past tense6. Here we describe the emergence of thislinguistic rule amidst the evolutionary decay of its exceptions,known to us as irregular verbs. We have generated a data set ofverbs whose conjugations have been evolving for more than amillennium, tracking inflectional changes to 177 Old-Englishirregular verbs. Of these irregular verbs, 145 remained irregularin Middle English and 98 are still irregular today. We study howthe rate of regularization depends on the frequency of word usage.The half-life of an irregular verb scales as the square root of itsusage frequency: a verb that is 100 times less frequent regularizes10 times as fast. Our study provides a quantitative analysis of theregularization process by which ancestral forms gradually yield toan emerging linguistic rule.

Natural languages comprise elaborate systems of rules that enableone speaker to communicate with another7. These rules serve tosimplify the production of language and enable an infinite array ofcomprehensible formulations8–10. However, each rule has exceptions,and even the rules themselves wax and wane over centuries andmillennia11,12. Verbs that obey standard rules of conjugation in theirnative language are called regular verbs13. In the Modern Englishlanguage, regular verbs are conjugated into the simple past andpast-participle forms by appending the dental suffix ‘-ed’ to theroot (for instance, infinitive/simple past/past participle: talk/talked/talked). Irregular verbs obey antiquated rules (sing/sang/sung) or, insome cases, no rule at all (go/went)14,15. New verbs entering Englishuniversally obey the regular conjugation (google/googled/googled),and many irregular verbs eventually regularize. It is much rarer forregular verbs to become irregular: for every ‘sneak’ that ‘snuck’ in16,there are many more ‘flews’ that ‘flied’ out.

Although less than 3% of modern verbs are irregular, the ten mostcommon verbs are all irregular (be, have, do, go, say, can, will, see,take, get). The irregular verbs are heavily biased towards high fre-quencies of occurrence17,18. Linguists have suggested an evolutionaryhypothesis underlying the frequency distribution of irregular verbs:uncommon irregular verbs tend to disappear more rapidly becausethey are less readily learned and more rapidly forgotten19,20.

To study this phenomenon quantitatively, we studied verb inflec-tion beginning with Old English (the language of Beowulf, spokenaround AD 800), continuing through Middle English (the language ofChaucer’s Canterbury Tales, spoken around AD 1200), and endingwith Modern English, the language as it is spoken today. The modern‘-ed’ rule descends from Old English ‘weak’ conjugation, which

applied to three-quarters of all Old English verbs21. The excep-tions—ancestors of the modern irregular verbs—were mostlymembers of the so-called ‘strong’ verbs. There are seven differentclasses of strong verbs with exemplars among the Modern Englishirregular verbs, each with distinguishing markers that often includecharacteristic vowel shifts. Although stable coexistence of multiplerules is one possible outcome of rule dynamics, this is not whatoccurred in English verb inflection22. We therefore define regularitywith respect to the modern ‘-ed’ rule, and call all these exceptionalforms ‘irregular’.

We consulted a large collection of grammar textbooks describingverb inflection in these earlier epochs, and hand-annotated everyirregular verb they described (see Supplementary Information).This provided us with a list of irregular verbs from ancestralforms of English. By eliminating verbs that were no longer part ofModern English, we compiled a list of 177 Old English irregularverbs that remain part of the language to this day. Of these 177 OldEnglish irregulars, 145 remained irregular in Middle English and 98are still irregular in Modern English. Verbs such as ‘help’, ‘grip’ and‘laugh’, which were once irregular, have become regular with thepassing of time.

Next we obtained frequency data for all verbs by using the CELEXcorpus, which contains 17.9 million words from a wide variety oftextual sources23. For each of our 177 verbs, we calculated the fre-quency of occurrence among all verbs. We subdivided the frequencyspectrum into six logarithmically spaced bins from 1026 to 1.Figure 1a shows the number of irregular verbs in each frequencybin. There are only two verbs, ‘be’ and ‘have’, in the highest frequencybin, with mean frequency .1021. Both remain irregular to the pre-sent day. There are 11 irregular verbs in the second bin, with meanfrequency between 1022 and 1021. These 11 verbs have all remainedirregular from Old English to Modern English. In the third bin, with amean frequency between 1023 and 1022, we find that 37 irregularverbs of Old English all remained irregular in Middle English, butonly 33 of them are irregular in Modern English. Four verbs in thisfrequency range, ‘help’, ‘reach’, ‘walk’ and ‘work’, underwent regu-larization. In the fourth frequency bin, 1024 to 1023, 65 irregularverbs of Old English have left 57 in Middle and 37 in Modern English.In the fifth frequency bin, 1025 to 1024, 50 irregulars of Old Englishhave left 29 in Middle and 14 in Modern English. In the sixth fre-quency bin, 1026 to 1025, 12 irregulars of Old English decline to 9 inMiddle and only 1 in Modern English: ‘slink’, a verb that aptlydescribes this quiet process of disappearance (Table 1).

Plotting the number of irregular verbs against their frequencygenerates a unimodal distribution with a peak between 1024 and1023. This unimodal distribution again demonstrates that irregularverbs are not an arbitrary subset of all verbs, because a random subsetof verbs (such as all verbs that contain the letter ‘m’) would followZipf’s law, a power law with a slope of 20.75 (refs 24,25).

*These authors contributed equally to this work

1Program for Evolutionary Dynamics, Department of Organismic and Evolutionary Biology, Department of Mathematics, 2Department of Applied Mathematics, Harvard University,Cambridge, Massachusetts 02138, USA. 3Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.4Department of Systems Biology, Harvard Medical School, Boston, Massachusetts 02115, USA.

Vol 449 | 11 October 2007 | doi:10.1038/nature06137

713Nature ©2007 Publishing Group

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 33 / 98

Page 35: Statistical Methods in Historical Linguistics

Question at hand

Wish to verify that the rate of change tends to be slower for words whichare used more frequently

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 34 / 98

Page 36: Statistical Methods in Historical Linguistics

Data

Regularization of irregular verbs

Old English, Middle English and Modern English (1200 years ofevolution)

177 irregular verbs, of which 79 regularize

Frequency of use

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 35 / 98

Page 37: Statistical Methods in Historical Linguistics

Results

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 36 / 98

Page 38: Statistical Methods in Historical Linguistics

Conclusions

Regularization rate changes as the square root of usage frequency

A verb used 4 times more often regularizes at half the rate

Evolutionary history known

Sub-optimal: binning

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 37 / 98

Page 39: Statistical Methods in Historical Linguistics

Pagel et al.

LETTERS

Frequency of word-use predicts rates of lexicalevolution throughout Indo-European historyMark Pagel1,2, Quentin D. Atkinson1 & Andrew Meade1

Greek speakers say ‘‘oura’’, Germans ‘‘schwanz’’ and the French‘‘queue’’ to describe what English speakers call a ‘tail’, but all ofthese languages use a related form of ‘two’ to describe the numberafter one. Among more than 100 Indo-European languages anddialects, the words for some meanings (such as ‘tail’) evolverapidly, being expressed across languages by dozens of unrelatedwords, while others evolve much more slowly—such as the num-ber ‘two’, for which all Indo-European language speakers use thesame related word-form1. No general linguistic mechanism hasbeen advanced to explain this striking variation in rates of lexicalreplacement among meanings. Here we use four large and diver-gent language corpora (English2, Spanish3, Russian4 and Greek5)and a comparative database of 200 fundamental vocabulary mean-ings in 87 Indo-European languages6 to show that the frequencywith which these words are used in modern language predicts theirrate of replacement over thousands of years of Indo-Europeanlanguage evolution. Across all 200 meanings, frequently usedwords evolve at slower rates and infrequently used words evolvemore rapidly. This relationship holds separately and identicallyacross parts of speech for each of the four language corpora, andaccounts for approximately 50% of the variation in historical ratesof lexical replacement. We propose that the frequency with whichspecific words are used in everyday language exerts a general andlaw-like influence on their rates of evolution. Our findings areconsistent with social models of word change that emphasize therole of selection, and suggest that owing to the ways that humansuse language, some words will evolve slowly and others rapidlyacross all languages.

Languages, like species, evolve by way of a process of descent withmodification (Supplementary Table 1). The remarkable diversity oflanguages—there are about 7,000 known living languages7—is aproduct of this process acting over thousands of years. Ancestrallanguages split to form daughter languages that slowly diverge asshared lexical, phonological and grammatical features are replacedby novel forms. In the study of lexical change, the basic unit ofanalysis is the cognate. Cognates are words of similar meaning withsystematic sound correspondences indicating they are related bycommon ancestry. For example, cognates meaning ‘water’ exist inEnglish (water), German (wasser), Swedish (vatten) and Gothic(wato), reflecting descent from proto-Germanic (*water).

Early lexicostatistical8 studies of Malayo-Polynesian and Indo-European language families revealed that the rate at which newcognates arise varies across meaning categories1,9. More recently wehave obtained direct estimates of rates of cognate replacement onlinguistic phylogenies (family trees) of Indo-European and Bantulanguages, using a statistical model of word evolution in a bayesianMarkov chain Monte Carlo (MCMC) framework10. We found thatrates of cognate replacement varied among meanings, and that ratesfor different meanings in Indo-European were correlated with their

paired meanings in the Bantu languages. This indicates that variationin the rates of lexical replacement among meanings is not merely anhistorical accident, but rather is linked to some general process oflanguage evolution.

Social and demographic factors proposed to affect rates oflanguage change within populations of speakers include social status11,the strength of social ties12, the size of the population13 and levels ofoutside contact14. These forces may influence rates of evolution on alocal and temporally specific scale, but they do not make generalpredictions across language families about differences in the rate oflexical replacement among meanings. Drawing on concepts fromtheories of molecular15 and cultural evolution16–18, we suggest thatthe frequency with which different meanings are used in everydaylanguage may affect the rate at which new words arise and becomeadopted in populations of speakers. If frequency of meaning-use isa shared and stable feature of human languages, then this couldprovide a general mechanism to explain the large differences acrossmeanings in observed rates of lexical replacement. Here we test thisidea by examining the relationship between the rates at which Indo-European language speakers adopt new words for a given meaningand the frequency with which those meanings are used in everydaylanguage.

We estimated the rates of lexical evolution for 200 fundamentalvocabulary meanings8 in 87 Indo-European languages6. Rates wereestimated using a statistical likelihood model of word evolution10

applied to phylogenetic trees of the 87 languages (SupplementaryFig. 1). The number of cognates observed per meaning varied fromone to forty-six. For each of the 200 meanings, we calculated themean of the posterior distribution of rates as derived from a bayesianMCMC model that simultaneously accounts for uncertainty in theparameters of the model of cognate replacement and in the phylo-genetic tree of the languages (Methods). Rate estimates were scaledto represent the expected number of cognate replacements per10,000 years, assuming a 8,700-year age for the Indo-European lan-guage family6. Opinions on the age of Indo-European vary betweenapproximately 6,000 and 10,000 years before present19,20. Using adifferent calibration would change the absolute values of the ratesbut not their relative values.

Figure 1a shows the inferred distribution of rate estimates, wherewe observe a roughly 100-fold variation in rates of lexical evolutionamong the meanings. At the slow end of the distribution, the ratespredict zero to one cognate replacements per 10,000 years for wordssuch as ‘two’, ‘who’, ‘tongue’, ‘night’, ‘one’ and ‘to die’. By compa-rison, for the faster evolving words such as ‘dirty’, ‘to turn’, ‘to stab’and ‘guts’, we predict up to nine cognate replacements in the sametime period. In the historical context of the Indo-European languagefamily, this range yields an expectation of between 0–1 and 43 lexicalreplacements throughout the ,130,000 language-years of evolutionthe linguistic tree represents, very close to the observed range in the

1School of Biological Sciences, University of Reading, Whiteknights, Reading, Berkshire, RG6 6AS, UK. 2Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501, USA.

Vol 449 | 11 October 2007 | doi:10.1038/nature06176

717Nature ©2007 Publishing Group

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 38 / 98

Page 40: Statistical Methods in Historical Linguistics

Question at hand

Check link between frequency of use and rate of change for vocabulary.Hypothesis: when a meaning is used more often, the corresponding wordhas less chances of changing.Problem: since this rate is expected to be much slower than that of strongverb regularization, we need to look at much deeper history. But then theevolutionary history is unknown.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 39 / 98

Page 41: Statistical Methods in Historical Linguistics

Workaround

Use Indo-European core vocabulary data, and frequencies fromEnglish, Greek, Russian and Spanish

Get a sample from the distribution on trees and ancestral ages usingG&A’s method

For each tree in the sample, estimate the rate of change for eachmeaning.

Average across all trees.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 40 / 98

Page 42: Statistical Methods in Historical Linguistics

Results

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 41 / 98

Page 43: Statistical Methods in Historical Linguistics

Comments

Even if there is high uncertainty in the phylogenies, we can stillanswer other questions (integrating out the tree)

Similar results for Bantu (Pagel and Meade 2006)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 42 / 98

Page 44: Statistical Methods in Historical Linguistics

Sampling from a distribution

Given a parameter θ with probability density π(θ), we often wish toestimate the expected value E [θ], or the probability mass of an intervalP[a < θ < b].If we can sample n independent and identically distributed (iid) valuesθ1, . . . , θn from π(·), then (under general assumptions) we have a goodestimated of E [θ]:

1

n

n∑i=1

θi −→n→∞

E [θ]

(and the same holds for Ih =∫h(θ) dθ for any measurable function h).

Convergence is fast.But in general, getting an iid sample is difficult or impossible.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 43 / 98

Page 45: Statistical Methods in Historical Linguistics

MCMC

If we are able to compute the value of the density π(θ) at any point θ, wecan use MCMC (Markov Chain Monte Carlo) to generate correlatedθ1, . . . , θn from π(·). The convergence still holds:

1

n

n∑i=1

θi −→n→∞

E [θ]

but is much slower. Since we use a finite value of n, we need to check theconvergence of the algorithm. In general, this is done by:

1 Checking the trace of the algorithm

2 Checking the autocorrelations

3 Running the algorithm for a very large value of n

Curse of dimensionality: when the dimensionality of the parameter spaceincreases, convergence slows dramatically.More during the TraitLab tutorial.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 44 / 98

Page 46: Statistical Methods in Historical Linguistics

ABC

If we are not able to compute π(θ), approximate inference may still bepossible using a likelihood-free method such as Approximate BayesianComputation (ABC).The idea is to draw many values of the parameters; for each value,simulate a new dataset; keep only parameter values which lead tosimulated data ”close” to the observed data.The theoretical properties are still mostly unknown. This method inducesan approximation, and it is difficult to control the size of theapproximation.But sometimes, this is the only possibility.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 45 / 98

Page 47: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 46 / 98

Page 48: Statistical Methods in Historical Linguistics

Perreault and Mathew

Dating the Origin of Language Using Phonemic DiversityCharles Perreault1., Sarah Mathew2*.

1 Santa Fe Institute, Santa Fe, New Mexico, United States of America, 2Centre for the Study of Cultural Evolution, Stockholm University, Stockholm, Sweden

Abstract

Language is a key adaptation of our species, yet we do not know when it evolved. Here, we use data on language phonemicdiversity to estimate a minimum date for the origin of language. We take advantage of the fact that phonemic diversityevolves slowly and use it as a clock to calculate how long the oldest African languages would have to have been around inorder to accumulate the number of phonemes they possess today. We use a natural experiment, the colonization ofSoutheast Asia and Andaman Islands, to estimate the rate at which phonemic diversity increases through time. Using thisrate, we estimate that present-day languages date back to the Middle Stone Age in Africa. Our analysis is consistent with thearchaeological evidence suggesting that complex human behavior evolved during the Middle Stone Age in Africa, and doesnot support the view that language is a recent adaptation that has sparked the dispersal of humans out of Africa. Whilesome of our assumptions require testing and our results rely at present on a single case-study, our analysis constitutes thefirst estimate of when language evolved that is directly based on linguistic data.

Citation: Perreault C, Mathew S (2012) Dating the Origin of Language Using Phonemic Diversity. PLoS ONE 7(4): e35289. doi:10.1371/journal.pone.0035289

Editor: Michael D. Petraglia, University of Oxford, United Kingdom

Received September 8, 2011; Accepted March 14, 2012; Published April 2 , 2012

Copyright: � 2012 Perreault, Mathew. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: Funding was provided by the Swedish Research Council [grant number 2009-2390]. The funders had no role in study design, data collection andanalysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

A capacity for language is a hallmark of our species [1,2], yet we

know little about the timing of its appearance. Language appears in

the archaeological record only recently, with the advent of

lexicographic writing around 5,400 years ago [3]. Therefore,

investigators have addressed the origin of language by studying

the evolutionary history of anatomical features [4–7] and genes [8–

15] that are associated with speech production. This research

suggests that other Homo species had the ability to produce speech

sounds that overlap with the range of speech sounds of modern

humans, and that species such as Neanderthals possessed genes that,

in humans, play a role in language. But we do not know whether

these archaic hominins actually produced speech, and if so, to which

extent it was similar to our capacity for language. As of now, the

anatomical and genetic data lack the resolution necessary to

differentiate proto-language from modern human language. Until

this resolution is improved, we need alternative lines of evidence in

order to better understand the timing of language origin.

Here, we use phonemic diversity data to date the origin of

language. Phonemic diversity denotes the number of perceptually

distinct units of sound–consonants, vowels and tones–in a lan-

guage. The worldwide pattern of phonemic diversity potentially

contains the statistical signal of the expansion of modern humans

on the planet [16]. As human populations left Africa, 60–70 kya,

and expanded into the rest of the world [1,17], they underwent

a series of bottlenecks. This serial founder effect has led to a clinal

loss of genetic [18–20], phenotypic [21–23] and phonemic

diversity [16] that can be observed in present-day human

populations. African languages today have some of the largest

phonemic inventories in the world, while the smallest inventories

are found in South America and Oceania, some of the last regions

of the globe to be colonized. The loss of phonemes through serial

founder effect is consistent with other lines of evidence that

indicate that phonemic diversity is determined by cultural

transmission forces, rather than cognitive or functional constraints.

First, phonemic diversity varies considerably among languages,

and several languages function with a restricted number of

phonemes. Rotokas, a language of New Guinea, and Piraha,

spoken in South-America, both have 11 phonemes [24,25], while

!Xun, a language spoken in Southern Africa has 141 phonemes.

Second, as predicted by theoretical models linking cultural

transmission and demography [26–28], phonemic diversity

correlates positively with speaker population size [16,29]. And

finally, phonemic diversity also correlates positively with the

number of surrounding languages [16], suggesting that phonemes,

like other cultural traits, can be borrowed. Phonemic diversity not

only evolves culturally, but it also evolves slowly [16]. That the

languages outside of Africa might have not recovered their original

phonemic diversity, despite thousands of years of history in their

respective continent, and despite all the historical, linguistic and

social factors that lead to linguistic change [30–36], suggests that

phonemic diversity changes over long time scales. Here, we take

advantage of the fact that phonemic diversity evolves culturally

and slowly, and use it as a slow-clock to date the origin of

language.

By focusing on phonemes rather than cognates–words that

share a common ancestry–we are able to circumvent problems

that prevent current historical linguistic approaches from tackling

the problem of dating the origin of language. Glottochronology

uses the number of cognates that languages share to estimate when

they diverged [37–39]. However, because cognates change over

short time scales, the time-depth resolution of glottochronology is

limited to a few thousand years [8]. Several historical, social and

PLoS ONE | www.plosone.org 1 April 2012 | Volume 7 | Issue 4 | e35289

7

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 47 / 98

Page 49: Statistical Methods in Historical Linguistics

Aim

Date Proto-Human.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 48 / 98

Page 50: Statistical Methods in Historical Linguistics

Premises

Two small populations B and C disperse from the same parent A. Bcolonizes a large territory, C a small isolated island.

B will tend to accumulate new phonemes at a faster rate than C(PB > PC ).

If the island is small enough, C will have about the same phonemicdiversity as the ancestral language A (E (PC ) = PA).

The difference in modern-day phonemic diversity is due entirely to B,so if we know the splitting time t, we can estimate the rate ofphoneme accumulation in a large territory.

Two models of phoneme accumulation: linear and exponential.

r =PB − PC

tk =

ln(PB)− ln(PC )

t

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 49 / 98

Page 51: Statistical Methods in Historical Linguistics

Idea

Estimate r or k

Estimate modern phonemic diversity in Africa (PAfrica) and guessinitial phonemic diversity Pinitial (take Pinitial = 11 or 29, minimumand median modern phonemic diversity).

Then we can deduce the age of the initial language.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 50 / 98

Page 52: Statistical Methods in Historical Linguistics

Parameter estimation

Single natural experiment: Andaman islands (colonized 45-65 kya).Estimate PB and PC by averaging over 20 languages.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 51 / 98

Page 53: Statistical Methods in Historical Linguistics

Results

(Linear/exponential model; 2 estimates for colonization of Andamanislands; 2 guesses of Pinitial).

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 52 / 98

Page 54: Statistical Methods in Historical Linguistics

Comments

Highly dependent on strong assumptions

Needs model validation

The parameter estimates are very noisy, since they are based onrelated languages.

Intriguing

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 53 / 98

Page 55: Statistical Methods in Historical Linguistics

Atkinson

ability to control the intricate structure of DNAnano-architectures and create more diverse build-ing blocks for molecular engineering.

References and Notes1. N. C. Seeman, J. Theor. Biol. 99, 237 (1982).2. N. C. Seeman, Nature 421, 427 (2003).3. T. J. Fu, N. C. Seeman, Biochemistry 32, 3211 (1993).4. J. H. Chen, N. C. Seeman, Nature 350, 631 (1991).5. E. Winfree, F. Liu, L. A. Wenzler, N. C. Seeman,

Nature 394, 539 (1998).6. H. Yan, S. H. Park, G. Finkelstein, J. H. Reif, T. H. LaBean,

Science 301, 1882 (2003).7. W. M. Shih, J. D. Quispe, G. F. Joyce, Nature 427, 618

(2004).8. F. Mathieu et al., Nano Lett. 5, 661 (2005).9. R. P. Goodman et al., Science 310, 1661 (2005).

10. F. A. Aldaye, H. F. Sleiman, J. Am. Chem. Soc. 129,13376 (2007).

11. Y. He et al., Nature 452, 198 (2008).

12. C. Zhang et al., Proc. Natl. Acad. Sci. U.S.A. 105,10665 (2008).

13. C. Lin, Y. Liu, H. Yan, Biochemistry 48,1663 (2009).

14. P. W. K. Rothemund, Nature 440, 297 (2006).15. E. S. Andersen et al., Nature 459, 73 (2009).16. S. M. Douglas et al., Nature 459, 414 (2009).17. Y. Ke et al., J. Am. Chem. Soc. 131, 15903 (2009).18. H. Dietz, S. M. Douglas, W. M. Shih, Science 325,

725 (2009).19. Materials and methods are available as supporting

material on Science Online.20. P. W. K. Rothemund et al., J. Am. Chem. Soc. 126,

16344 (2004).21. P. J. Hagerman, Annu. Rev. Biophys. Chem. 17, 265

(1988).Acknowledgments: We thank the AFM applications group

at Bruker Nanosurfaces for assistance in acquiring someof the high-resolution AFM images, using Peak ForceTapping with ScanAsyst on the MultiMode 8. This researchwas partly supported by grants from the Office of Naval

Research, Army Research Office, National ScienceFoundation, Department of Energy, and NationalInstitutes of Health to H.Y. and Y.L. and from theSloan Research Foundation to H.Y. Y.L. and H.Y. weresupported by the Technology and Research InitiativeFund from Arizona State University and as part of theCenter for Bio-Inspired Solar Fuel Production, anEnergy Frontier Research Center funded by the U.S.Department of Energy, Office of Science, Office ofBasic Energy Sciences under award DE-SC0001016.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/332/6027/342/DC1Materials and MethodsSOM TextFigs. S1 to S39Tables S1 to S19Reference 22

18 January 2011; accepted 4 March 201110.1126/science.1202998

Phonemic Diversity Supports a SerialFounder Effect Model of LanguageExpansion from AfricaQuentin D. Atkinson1,2*

Human genetic and phenotypic diversity declines with distance from Africa, as predicted by aserial founder effect in which successive population bottlenecks during range expansionprogressively reduce diversity, underpinning support for an African origin of modern humans.Recent work suggests that a similar founder effect may operate on human culture andlanguage. Here I show that the number of phonemes used in a global sample of 504 languagesis also clinal and fits a serial founder–effect model of expansion from an inferred origin inAfrica. This result, which is not explained by more recent demographic history, local languagediversity, or statistical non-independence within language families, points to parallelmechanisms shaping genetic and linguistic diversity and supports an African origin ofmodern human languages.

The number of phonemes—perceptuallydistinct units of sound that differentiatewords—in a language is positively corre-

lated with the size of its speaker population (1) insuch a way that small populations have fewerphonemes. Languages continually gain and losephonemes because of stochastic processes (2, 3).If phoneme distinctions are more likely to be lostin small founder populations, then a successionof founder events during range expansion shouldprogressively reduce phonemic diversity with in-creasing distance from the point of origin, paral-leling the serial founder effect observed in populationgenetics (4–9). A founder effect has already beenused to explain patterns of variation in other cul-tural replicators, including human material culture(10–13) and birdsong (14). A range of possiblemechanisms (15) predicts similar dynamics govern-

ing the evolution of phonemes (11, 16) and lan-guage generally (17–20). This raises the possibilitythat the serial founder–effect model used to traceour genetic origins to a recent expansion fromAfrica(4–9) could also be applied to global phonemicdiversity to investigate the origin and expansionof modern human languages. Here I examine geo-graphic variation in phoneme inventory size usingdata on vowel, consonant, and tone inventoriestaken from 504 languages in the World Atlas ofLanguage Structures (WALS) (21), together withinformation on language location, taxonomicaffiliation, and speaker demography (Fig. 1 andtable S1) (15).

Consistent with previous work (1), speakerpopulation size is a significant predictor of phone-mic diversity (Pearson’s correlation r = 0.385,df = 503,P < 0.001), with smaller population sizepredicting smaller overall phoneme inventories(fig. S1A).The same relationship holds for vow-el (r = 0.378, df = 503, P < 0.001) and tone (r =0.230, df = 503, P < 0.001) inventories sepa-rately,withaweaker, thoughstill significant,effectof population size on consonant diversity (r =

0.131, df = 503, P = 0.003). To account for anynon-independence within language families, theanalysiswas repeated, first usingmean values atthe language family level (table S2) and thenusingahierarchical linear regression frameworktomodelnested dependencies in variation at thefamily, subfamily, and genus levels (15). Theseanalyses confirm that, consistent with a foundereffect model, smaller population size predictsreduced phoneme inventory size both betweenfamilies (family-level analysis r = 0.468, df = 49,P < 0.001; fig. S1B) and within families, con-trolling for taxonomic affiliation {hierarchical lin-ear model: fixed-effect coefficient (b) = 0.0338to 0.0985 [95% highest posterior density (HPD)],P = 0.009}.

Figure 1B shows clear regional differences inphonemic diversity, with the largest phonemeinventories in Africa and the smallest in SouthAmerica and Oceania. A series of linear regres-sions was used to predict phoneme inventory sizefrom the log of speaker population size and dis-tance from 2560 potential origin locations aroundthe world (15). Incorporating modern speakerpopulation size into the model controls for geo-graphic patterning in population size and meansthat the analysis is conservative about the amountof variation attributed to ancient demography.Model fit was evaluated with the Bayesian in-formation criterion (BIC) (22). Following previ-ous work (5, 6), the set of origin locations withinfour BIC units of the best-fit location was takento be the most likely area of origin under a serialfounder–effect model.

The origin locations producing the strongestdecline in phonemic diversity and best-fit modellie across central and southern Africa (Fig. 2A).This region could represent either a single originfor modern languages or the main origin undera polygenesis scenario. The best-fit model in-corporating population size and distance fromthe origin explains 31% of the variance in pho-neme inventory size [correlation coefficient (R) =0.558, F2,501 = 113.463, P < 0.001] (Fig. 3). Bothpopulation size (rpopulation = 0.146, P = 0.002)

1Department of Psychology, University of Auckland, PrivateBag 92019,Auckland, New Zealand. 2Institute of Cognitive andEvolutionary Anthropology, University of Oxford, 64 BanburyRoad, Oxford OX2 6PN, UK.

*E-mail: [email protected]

15 APRIL 2011 VOL 332 SCIENCE www.sciencemag.org346

REPORTS

on

Apr

il 19

, 201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 54 / 98

Page 56: Statistical Methods in Historical Linguistics

Question at hand

Can we use phonemic data to reconstruct the expansion of language?

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 55 / 98

Page 57: Statistical Methods in Historical Linguistics

Premises

Premise 1: Size of population is positively correlated with number ofphonemes.

The effect is small, but significant. Still holds when you restrict topopulation < 5000.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98

Page 58: Statistical Methods in Historical Linguistics

Premises

Premise 1: Size of population is positively correlated with number ofphonemes.

The effect is small, but significant. Still holds when you restrict topopulation < 5000.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98

Page 59: Statistical Methods in Historical Linguistics

Premises

Premise 1: Size of population is positively correlated with number ofphonemes.

Premise 2: Language expansion occurred with a series of small foundergroups.

The effect is small, but significant. Still holds when you restrict topopulation < 5000.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98

Page 60: Statistical Methods in Historical Linguistics

Premises

Premise 1: Size of population is positively correlated with number ofphonemes.

Premise 2: Language expansion occurred with a series of small foundergroups.

Conclusion: During language expansion, the number of phonemesdecreases at the frontier of expansion.

The effect is small, but significant. Still holds when you restrict topopulation < 5000.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98

Page 61: Statistical Methods in Historical Linguistics

Idea

Test many different possible points of origin. From the true point oforigin, you should observe a cline in phoneme diversity.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 57 / 98

Page 62: Statistical Methods in Historical Linguistics

Data

Data from WALS, 504 languages (all data with languages). Three traits,grouped into classes by WALS:

Tone structure (No tone / Simple tone / Complex tone)

Number of vowels (2–4 / 5–6 / 7–14)

Number of consonants (6–14 / 15–18 / 19–25 / 26–33 / 34+)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 58 / 98

Page 63: Statistical Methods in Historical Linguistics

Map of languages

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 59 / 98

Page 64: Statistical Methods in Historical Linguistics

Tone map

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 60 / 98

Page 65: Statistical Methods in Historical Linguistics

Vowel map

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 61 / 98

Page 66: Statistical Methods in Historical Linguistics

Consonant map

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 62 / 98

Page 67: Statistical Methods in Historical Linguistics

Measure of phonemic diversity

A measure of phonemic diversity is created.

Scale and center the three components: tone, vowels, consonants(mean=0, variance=1)

Diversity is measured as the unweighted sum of the three components

This is probably the most contentious aspect of the paper. The issues thisraises are dealt with in the Supplementary Material.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 63 / 98

Page 68: Statistical Methods in Historical Linguistics

Main result

For each possible origin, perform a linear regression of phoneme diversityagainst distance from origin.Best regression:

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 64 / 98

Page 69: Statistical Methods in Historical Linguistics

Map of best origin

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 65 / 98

Page 70: Statistical Methods in Historical Linguistics

Validation

Several other factors could have an influence. The author checks:

Dependence within language families

Modern population size

Geographic area

Population density

Local language diversity (number of languages within N km,N = 500, 1000, 2000, 3000)

A second origin

Other factors certainly play a role. However, as long as they areuncorrelated with distance to Africa, they will not bias the results.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98

Page 71: Statistical Methods in Historical Linguistics

Validation

Several other factors could have an influence. The author checks:

Dependence within language families

Modern population size

Geographic area

Population density

Local language diversity (number of languages within N km,N = 500, 1000, 2000, 3000)

A second origin

Other factors certainly play a role. However, as long as they areuncorrelated with distance to Africa, they will not bias the results.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98

Page 72: Statistical Methods in Historical Linguistics

Tones vs Vowels vs Consonants

The measure of diversity is fishy.The author also checks for Tones, Vowels and Consonants separately.

The conclusions are unchanged.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 67 / 98

Page 73: Statistical Methods in Historical Linguistics

Comments

Statistically sound.

The data are what they are...

Assumptions: see multiple comments in Linguistic Typology (2011)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 68 / 98

Page 74: Statistical Methods in Historical Linguistics

Bayesian model choice

The evidence for or against a model given data is summarized in the Bayesfactor:

B21(y) =P[M = 2|y ]/P[M = 1|y ]

P[M = 2]/P[M = 1]]

=m2(y)

m1(y)

where

mk(y) =

∫Θk

L(θk |y)πk(θk)dθk

This includes an automatic penalty for complexity: when a model is morecomplex, the integral is over a larger space, where the likelihood is most ofthe time very small.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 69 / 98

Page 75: Statistical Methods in Historical Linguistics

Interpreting the Bayes factor

Jeffrey’s scale of evidence states that

If log10(B21) is between 0 and 0.5, then the evidence in favor ofmodel 2 is weak

between 0.5 and 1, it is substantial

between 1 and 2, it is strong

above 2, it is decisive

(and symmetrically for negative values)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 70 / 98

Page 76: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 71 / 98

Page 77: Statistical Methods in Historical Linguistics

Dunn et al.

LETTERdoi:10.1038/nature09923

Evolved structure of language shows lineage-specifictrends in word-order universalsMichael Dunn1,2, Simon J. Greenhill3,4, Stephen C. Levinson1,2 & Russell D. Gray3

Languages vary widely but not without limit. The central goal oflinguistics is to describe the diversity of human languages andexplain the constraints on that diversity. Generative linguists fol-lowing Chomsky have claimed that linguistic diversity must beconstrained by innate parameters that are set as a child learns alanguage1,2. In contrast, other linguists following Greenberg haveclaimed that there are statistical tendencies for co-occurrence oftraits reflecting universal systems biases3–5, rather than absoluteconstraints or parametric variation. Here we use computationalphylogenetic methods to address the nature of constraints onlinguistic diversity in an evolutionary framework6. First, contraryto the generative account of parameter setting, we show that theevolution of only a few word-order features of languages arestrongly correlated. Second, contrary to the Greenbergian general-izations, we show that most observed functional dependenciesbetween traits are lineage-specific rather than universal tendencies.These findings support the view that—at least with respect to wordorder—cultural evolution is the primary factor that determineslinguistic structure, with the current state of a linguistic systemshaping and constraining future states.

Human language is unique amongst animal communication sys-tems not only for its structural complexity but also for its diversity atevery level of structure and meaning. There are about 7,000 extantlanguages, some with just a dozen contrastive sounds, others with morethan 100, some with complex patterns of word formation, others withsimple words only, some with the verb at the beginning of the sentence,some in the middle, and some at the end. Understanding this diversityand the systematic constraints on it is the central goal of linguistics. Thegenerative approach to linguistic variation has held that linguisticdiversity can be explained by changes in parameter settings. Each ofthese parameters controls a number of specific linguistic traits. Forexample, the setting ‘heads first’ will cause a language both to placeverbs before objects (‘kick the ball’), and prepositions before nouns(‘into the goal’)1,7. According to this account, language change occurswhen child learners simplify or regularize by choosing parameter set-tings other than those of the parental generation. Across a few genera-tions such changes might work through a population, effectinglanguage change across all the associated traits. Language changeshould therefore be relatively fast, and the traits set by one parametermust co-vary8.

In contrast, the statistical approach adopted by Greenbergian linguistssamples languages to find empirically co-occurring traits. These co-occurring traits are expected to be statistical tendencies attributable touniversal cognitive or systems biases. Among the most robust of thesetendencies are the so-called ‘‘word-order universals’’3 linking the orderof elements in a clause. Dryer has tested these generalizations on aworldwide sample of 625 languages and finds evidence for some of theseexpected linkages between word orders9. According to Dryer’s reformu-lation of the word-order universals, dominant verb–object orderingcorrelates with prepositions, as well as relative clauses and genitives

after the noun, whereas dominant object–verb ordering predicts post-positions, relative clauses and genitives before the noun4. One generalexplanation for these observations is that languages tend to be consist-ent (‘harmonic’) in their order of the most important element or ‘head’of a phrase relative to its ‘complement’ or ‘modifier’3, and so if the verbis first before its object, the adposition (here preposition) precedes thenoun, while if the verb is last after its object, the adposition follows thenoun (a ‘postposition’). Other functionally motivated explanationsemphasize consistent direction of branching within the syntactic struc-ture of a sentence9 or information structure and processing efficiency5.

To demonstrate that these correlations reflect underlying cognitiveor systems biases, the languages must be sampled in a way that controlsfor features linked only by direct inheritance from a commonancestor10. However, efforts to obtain a statistically independent sampleof languages confront several practical problems. First, our knowledgeof language relationships is incomplete: specialists disagree about high-level groupings of languages and many languages are only tentativelyassigned to language families. Second, a few large language familiescontain the bulk of global linguistic variation, making sampling purelyfrom unrelated languages impractical. Some balance of related, unre-lated and areally distributed languages has usually been aimed for inpractice11,12.

The approach we adopt here controls for shared inheritance byexamining correlation in the evolution of traits within well-establishedfamily trees13. Drawing on the powerful methods developed in evolu-tionary biology, we can then track correlated changes during the his-torical processes of language evolution as languages split and diversify.Large language families, a problem for the sampling method describedabove, now become an essential resource, because they permit theidentification of coupling between character state changes over long timeperiods. We selected four large language families for which quantitativephylogenies are available: Austronesian (with about 1,268 languages14

and a time depth of about 5,200 years15), Indo-European (about 449languages14, time depth of about 8,700 years16), Bantu (about 668 or522 for Narrow Bantu17, time depth about 4,000 years18) and Uto-Aztecan (about 61 languages19, time-depth about 5,000 years20).Between them these language families encompass well over a third ofthe world’s approximately 7,000 languages. We focused our analyses onthe ‘word-order universals’ because these are the most frequently citedexemplary candidates for strongly correlated linguistic features, withplausible motivations for interdependencies rooted in prominent formaland functional theories of grammar.

To test the extent of functional dependencies between word-ordervariables, we used a Bayesian phylogenetic method implemented in thesoftware BayesTraits21. For eight word-order features we comparedcorrelated and uncorrelated evolutionary models. Thus, for each pairof features, we calculated the likelihood that the observed states of thecharacters were the result of the two features evolving independently,and compared this to the likelihood that the observed states were theresult of coupled evolutionary change. This likelihood calculation was

1Max Planck Institute for Psycholinguistics, Post Office Box 310, 6500 AH Nijmegen, The Netherlands. 2Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, Kapittelweg 29,6525 EN Nijmegen, The Netherlands. 3Department of Psychology, University of Auckland, Auckland 1142, New Zealand. 4Computational Evolution Group, University of Auckland, Auckland 1142, NewZealand.

0 0 M O N T H 2 0 1 1 | V O L 0 0 0 | N A T U R E | 1

Macmillan Publishers Limited. All rights reserved©2011

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 72 / 98

Page 78: Statistical Methods in Historical Linguistics

Question at hand

Do the following traits evolve independently?

1 Adjective-Noun order

2 Adposition-Noun phrase order

3 Demonstrative-Noun order

4 Genitive-Noun order

5 Numeral-Noun order

6 Object-Verb order

7 Relative clause-Noun order

8 Subject-Verb order

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 73 / 98

Page 79: Statistical Methods in Historical Linguistics

Idea

Difficult to get an independent sample of languages

Instead, look at languages known to be related and check forco-evolution

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 74 / 98

Page 80: Statistical Methods in Historical Linguistics

Data

4 language families:

Indo-European

Bantu

Uto-Aztecan

Austronesian

Word-order data mainly from WALS.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 75 / 98

Page 81: Statistical Methods in Historical Linguistics

Lexical tree

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 76 / 98

Page 82: Statistical Methods in Historical Linguistics

Methods

Start with a sample of trees constructed with lexical data.

Put two traits at the leaves

Perform Bayesian model choice between independent model andco-evolution model

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 77 / 98

Page 83: Statistical Methods in Historical Linguistics

Two models

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 78 / 98

Page 84: Statistical Methods in Historical Linguistics

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 79 / 98

Page 85: Statistical Methods in Historical Linguistics

Results

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 80 / 98

Page 86: Statistical Methods in Historical Linguistics

Conclusions

The networks look very different.

Co-evolution seems to be very different between different languagefamilies.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 81 / 98

Page 87: Statistical Methods in Historical Linguistics

What went wrong?

1 Nothing about borrowing or geography

2 No attempt at cross-validation

3 Over-interpreting the data

4 Testing too many hypotheses

5 They did not test the hypothesis that different families have the samedependencies!

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 82 / 98

Page 88: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 83 / 98

Page 89: Statistical Methods in Historical Linguistics

Automatic cognate detection?

All models of lexical data take as granted cognacy judgements (andfamily membership)

It would be nice to do without these: we could increase the numberof languages under study, as well as the number of meanings

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 84 / 98

Page 90: Statistical Methods in Historical Linguistics

Levenshtein distance

The Levenshtein distance (or edit distance) between two words is thenumber of changes (substitutions, insertions, deletions) necessary tochange one word into the other.

Swedish tre (/tre:/) and Latin tres (/tre:s/): distance 1 (1 insertion)

Swedish tre and Dutch drie (/dri/): distance 2 (2 substitutions)

(Can be normalized by word length)

By calculating the distance between all words in a meaning list, get adistance between two languages. Deduce a tree (several possiblealgorithms: Neighbour joining, UPGMA...)

Unlike other methods, this does not require cognacy judgements, andit takes into account the phonetics (more information).

Serva and Petroni 2008, van der Ark et al. 2007: ”the resulting tree looksclose to the truth” (except it doesn’t)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 85 / 98

Page 91: Statistical Methods in Historical Linguistics

Levenshtein distance: so many problems!

Does not require that compared words be cognates

Does not require that compared languages be cousins

No model of phonetic change (hence no model validation ormis-specification)

Some changes are more likely that others

Changes can be systematic (e.g. vowel shift), not independent

See Greenhill (2010) for a detailed rebuttal.

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 86 / 98

Page 92: Statistical Methods in Historical Linguistics

Bouchard-Cote et al. (2012)

Automated reconstruction of ancient languages usingprobabilistic models of sound changeAlexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb

aDepartment of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology,University of California, Berkeley, CA 94720

Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for reviewMarch 19, 2012)

One of the oldest problems in linguistics is reconstructing thewords that appeared in the protolanguages from which modernlanguages evolved. Identifying the forms of these ancient lan-guages makes it possible to evaluate proposals about the natureof language change and to draw inferences about human history.Protolanguages are typically reconstructed using a painstakingmanual process known as the comparative method. We present afamily of probabilistic models of sound change as well as algo-rithms for performing inference in these models. The resultingsystem automatically and accurately reconstructs protolanguagesfrommodern languages. We apply this system to 637 Austronesianlanguages, providing an accurate, large-scale automatic reconstruc-tion of a set of protolanguages. Over 85% of the system’s recon-structions are within one character of the manual reconstructionprovided by a linguist specializing in Austronesian languages. Be-ing able to automatically reconstruct large numbers of languagesprovides a useful way to quantitatively explore hypotheses about thefactors determining which sounds in a language are likely to changeover time. We demonstrate this by showing that the reconstructedAustronesian protolanguages provide compelling support for a hy-pothesis about the relationship between the function of a soundand its probability of changing that was first proposed in 1955.

ancestral | computational | diachronic

Reconstruction of the protolanguages from which modern lan-guages are descended is a difficult problem, occupying histor-

ical linguists since the late 18th century. To solve this problemlinguists have developed a labor-intensivemanual procedure calledthe comparative method (1), drawing on information about thesounds and words that appear in many modern languages to hy-pothesize protolanguage reconstructions even when no writtenrecords are available, opening one of the few possible windowsto prehistoric societies (2, 3). Reconstructions can help in un-derstanding many aspects of our past, such as the technologicallevel (2), migration patterns (4), and scripts (2, 5) of early societies.Comparing reconstructions across many languages can help revealthe nature of language change itself, identifying which aspectsof language are most likely to change over time, a long-standingquestion in historical linguistics (6, 7).In many cases, direct evidence of the form of protolanguages is

not available. Fortunately, owing to the world’s considerable lin-guistic diversity, it is still possible to propose reconstructions byleveraging a large collection of extant languages descended from asingle protolanguage. Words that appear in these modern lan-guages can be organized into cognate sets that contain words sus-pected to have a shared ancestral form (Table 1). The keyobservation that makes reconstruction from these data possible isthat languages seem to undergo a relatively limited set of regularsound changes, each applied to the entire vocabulary of a languageat specific stages of its history (1). Still, several factors make re-construction a hard problem. For example, sound changes are oftencontext sensitive, and many are string insertions and deletions.In this paper, we present an automated system capable of

large-scale reconstruction of protolanguages directly from wordsthat appear in modern languages. This system is based on aprobabilistic model of sound change at the level of phonemes,

building on work on the reconstruction of ancestral sequencesand alignment in computational biology (8–12). Several groupshave recently explored how methods from computational biologycan be applied to problems in historical linguistics, but such workhas focused on identifying the relationships between languages(as might be expressed in a phylogeny) rather than reconstructingthe languages themselves (13–18). Much of this type of work hasbeen based on binary cognate or structural matrices (19, 20), whichdiscard all information about the form that words take, simply in-dicating whether they are cognate. Such models did not have thegoal of reconstructing protolanguages and consequently use a rep-resentation that lacks the resolution required to infer ancestralphonetic sequences. Using phonological representations allows usto perform reconstruction and does not require us to assume thatcognate sets have been fully resolved as a preprocessing step. Rep-resenting the words at each point in a phylogeny and having a modelof how they change give a way of comparing different hypothesizedcognate sets and hence inferring cognate sets automatically.The focus on problems other than reconstruction in previous

computational approaches has meant that almost all existingprotolanguage reconstructions have been done manually. How-ever, to obtain more accurate reconstructions for older languages,large numbers of modern languages need to be analyzed. TheProto-Austronesian language, for instance, has over 1,200 de-scendant languages (21). All of these languages could potentiallyincrease the quality of the reconstructions, but the number ofpossibilities increases considerably with each language, making itdifficult to analyze a large number of languages simultaneously.The few previous systems for automated reconstruction of pro-tolanguages or cognate inference (22–24) were unable to handlethis increase in computational complexity, as they relied on de-terministic models of sound change and exact but intractablealgorithms for reconstruction.Being able to reconstruct large numbers of languages also

makes it possible to provide quantitative answers to questionsabout the factors that are involved in language change. We dem-onstrate the potential for automated reconstruction to lead tonovel results in historical linguistics by investigating a specifichypothesized regularity in sound changes called functional load.The functional load hypothesis, introduced in 1955, asserts thatsounds that play a more important role in distinguishing words areless likely to change over time (6). Our probabilistic reconstructionof hundreds of protolanguages in the Austronesian phylogenyprovides a way to explore this question quantitatively, producingcompelling evidence in favor of the functional load hypothesis.

Author contributions: A.B.-C., D.H., T.L.G., and D.K. designed research; A.B.-C. and D.H.performed research; A.B.-C. and D.H. contributed new reagents/analytic tools; A.B.-C.,D.H., T.L.G., and D.K. analyzed data; and A.B.-C., D.H., T.L.G., and D.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. N.C. is a guest editor invited by the EditorialBoard.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1204678110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1204678110 PNAS Early Edition | 1 of 6

COMPU

TERSC

IENCE

SPS

YCHOLO

GICALAND

COGNITIVESC

IENCE

S

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 87 / 98

Page 93: Statistical Methods in Historical Linguistics

Aim

Model of sound change along a tree (including systematic changes)

Very large number of parameters

Attempt to perform automatically the comparative method

Possible uses: reconstruction of ancient forms, cognate detection,phylogenies...

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 88 / 98

Page 94: Statistical Methods in Historical Linguistics

Model

Assume that the phylogeny is known.�

Latin

Italian Spanish

/tre:s//ste:lla/

/tre//stella/

/tRes//estReLa/

Latin to Italian: /tre:s/ → /tre/P[t→ t] = 0.98P[r→ r] = 0.99P[e:→ e] = 0.95P[s→ −] = 0.20(These numbers are made up.)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98

Page 95: Statistical Methods in Historical Linguistics

Model

Assume that the phylogeny is known.�

Latin

Italian Spanish

/tre:s//ste:lla/

/tre//stella/

/tRes//estReLa/

Latin to Italian: /tre:s/ → /tre/P[t→ t] = 0.98P[r→ r] = 0.99P[e:→ e] = 0.95P[s→ −] = 0.20(These numbers are made up.)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98

Page 96: Statistical Methods in Historical Linguistics

Transition probabilities

Branch Latin to Italian, transition probabilities for /t/t 0.98d 0.01k 0.01

anything else 0We would like to estimate the transition probabilities from any phoneme toany other phoneme, on each branch. Huge number of parameters (numberof phonemes×number of phonemes×number of branches).

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 90 / 98

Page 97: Statistical Methods in Historical Linguistics

Context dependent

In fact, the probability that /s/ is deleted depends on the position withinthe word.The transition probabilities become context dependent:P[s→ −|preceded by e and is last letter] = 0.70P[r→ r|between t and e] = 0.99Even more parameters!

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 91 / 98

Page 98: Statistical Methods in Historical Linguistics

Reducing the number of parameters

Generic contexts of the form ”CrV”

Dirichlet prior: force most parameters to be zero (sparsity)

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 92 / 98

Page 99: Statistical Methods in Historical Linguistics

Implementation

Possible to use large datasets: dumps from Wiktionary, Bible...

Expectation-Maximization algorithm

Only point estimates of the parameters, not uncertainty

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 93 / 98

Page 100: Statistical Methods in Historical Linguistics

Results: reconstruction

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 94 / 98

Page 101: Statistical Methods in Historical Linguistics

Results: Most probable transitions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 95 / 98

Page 102: Statistical Methods in Historical Linguistics

Bouchard et al.: Comments

Impressive first step

Much more refinement is possible, especially in defining the context

Allows quantitative tests of open questions (e.g. functional load,universality of changes)

Joint estimation of sound changes and phylogenies

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 96 / 98

Page 103: Statistical Methods in Historical Linguistics

Outline

1 Swadesh: glottochronology

2 Gray and Atkinson: Language phylogenies

3 Lieberman et al., Pagel et al.: Frequency of use

4 Perreault and Mathew, Atkinson: Phoneme expansion

5 Dunn et al.: Word-order universals

6 Bouchard-Cote et al: automated cognates

7 Overall conclusions

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 97 / 98

Page 104: Statistical Methods in Historical Linguistics

Overall conclusions

When done right, statistical methods can provide new insight intolinguistic history

The conclusions will only ever be as good as the data

Importance of collaboration in building the model and in checking formis-specification

Major avenues for future research. Challenges in finding relevantdata, building models, and statistical inference:

Models for morphosyntactical traitsPutting together lexical, phonemic and morphosyntactic traitsEscape the tree-like model (borrowing, networks, diffusion processes...)Incorporate geography

Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 98 / 98