20
In Contextualizing Historical Lexicology, Helsinki May 15-17, 2017 Basic vocabulary and the phylogenetic approach to the study of Uralic language history Michael Rießler 1 , Mervi de Heer 2 Terhi Honkola 3 Unni-Päivä Leino 4 , Kaj Syrjänen 4 Outi Vesakoski 3 1 Univ of Freiburg, Germany 2 Univ of Uppsala, Sweden 3 Univ of Turku, Finland 4 Univ of Tampere, Finland

Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

In Contextualizing Historical Lexicology, Helsinki May 15-17, 2017

Basic vocabulary and the phylogenetic approach to the study of Uralic language history

Michael Rießler1, Mervi de Heer2

Terhi Honkola3

Unni-Päivä Leino4, Kaj Syrjänen4 Outi Vesakoski3

1 Univ of Freiburg, Germany 2 Univ of Uppsala, Sweden 3 Univ of Turku, Finland 4 Univ of Tampere, Finland

Page 2: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Senior staff Terhi Honkola1

Unni-Päivä Leino2

Urho Määttä2 Luke Maurits1

Jenni Santaharju3 Outi Vesakoski 1

Niklas Wahlberg

BEDLAN team funded by the Kone Foundation

BEDLAN 2009-2013 PI Urho Määttä URALEX 2013-2016 PI Unni-Päivä Leino SumuraSyyni 2014-2016 PI Outi Vesakoski Kippo 2017-2020 PI Unni-Päivä Leino AikaSyyni 2017-2020 PI Outi Vesakoski

Collaboration & visitors

Rogier Blokland4 Michael Dunn4 Mikko Heikkilä2

Michael Riessler5 Harri Tolvanen1

Sanni Översti3

1

2 3

4

5

Doctoral students Mervi de Heer4 Timo Rantanen1

Kaj Syrjänen2

Assistants Hilkka Ahola1

Jaakko Helke3 Timo Rantakaulio3 Ilpo Tammi1

Geography Linguistics

Mathematics Biology

Page 3: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Themes of the presentation

1. What is “phylogenetic linguistics” 2. Lexical data 3. Uralic family analysed with phylogenetic methods & FAQ 4. Added value of phylogenetic linguistics

Page 4: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Phylogenetic linguistics

CM vs statistical approaches: hypothesis testing Exploration of linguistic data with the help of phylogenetic methodology

Based on the idea that linguistic change can be regarded as a type of generalized evolution • generalized evolution ≠ biological evolution

A large selection of computational methods based on varying principles • Often adopted from phylogenetics • Developed to find signal/pattern from large data sets

Page 5: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Phylogenetic linguistics

Phylogenetic linguistics is not equal to ● Lexicostatistics ≈ Clustering languages based on distances calculated from meaning

lists

● Glottochronology ≈ Use of lexicostatistical distances to infer chronological dates

● Mass comparisons ≈ Subjective inspection of large data

Page 6: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Data in phylogenetic linguistics

E.g. Lexical data ● Often basic vocabulary & cognate coding (root-meaning forms) ● Most available data type ● Data hygienity differs

Typological (structural) data ● Syntactic, morphological, phonological ● Collection on-going (e.g. world languages, Uralic languages) ● Hygienity: Subjective decision of traits? ● Limited design space of typological traits!

Page 7: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Different basic vocabulary lists available: ● Swadesh 100 and 200 lists ● Leipzig-Jakarta list 100 meanings (WOLD 1-100) ● BEDLAN: Less stable vocabulary (WOLD 401-500)

WOLD= World loanword database (Haspelmath and Tadmor 2009) Basic vocabulary in WOLD ● wide-spread concepts ● resistant to borrowing ● unlikely to be replaced ● morphologically simple

213 linguistic traits = map sheets Lexical data in phylogenetic linguistics

Meaning Sw200 Sw100 WOLD 1-100

1 all X X —

2 and X — —

3 animal X — —

4 ashes X X X

5 at X — —

6 back X — X

7 bad X — —

8 bark X X —

9 because X — —

10 belly X X —

11 big X X X

12 bird X X X

13 bite X X X

14 black X X X

15 blood X X X

16 blow X — X

17 bone X X X

18 breast X X X

19 breathe X — —

20 burn X X X

21 child X — —

22 claw X X —

23 cloud X X —

24 cold X X —

25 come X X X

Page 8: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Data collection initiated by the BEDLAN group in 2009

26 Uralic languages, 313 meanings (available on request)

Collection described in detail in:

Syrjänen et al. 2013 Shedding more light on language classification using basic vocabularies

and phylogenetic methods (Diachronica 30:3)

Lehtinen et al. 2014: Behind Family Trees. Secondary Connections in Uralic Language Networks

(Language Dynamics and Change 4)

213 linguistic traits = map sheets Uralic data

Page 9: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

213 linguistic traits = map sheets Uralic languages collected

Not yet included:

Ludic, Võru, Lule/Akkala/Ter

Saami, Moksha, Enets, Nenets,

Kamas

Map source:

Geographic database of Uralic languages

• Timo Rantanen + BEDLAN

• Jussi Ylikoski

• Language experts: Authors of the

Oxford Handbook of Uralic languages

Page 10: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Data hygienity

• Uniform criteria for selection of words and cognates (single person work)

• Multiple equivalents for a meaning when relevant

• No cognate hunting!

• 17 of 26 languages checked by an expert and / or native speaker of the languages→

Refining in progress

Words collected from bilingual dictionaries

Etymological relationships and cognate coding based on literature • E.g. Itkonen & Kulonen 1992-2000, Rédei 1988-1991, Sammallahti 1988, Janhunen 1997 (Álgu-

database)

213 linguistic traits = map sheets Collection of Uralic data

Page 11: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

FISH WATER EAR

1 0 1 0 0 ? ? ?

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 1 0 0

1 0 1 0 0 1 0 0

0 1 1 0 0 1 0 0

0 1 1 0 0 1 0 0

1 0 1 0 0 1 0 0

1 0 0 0 1 1 0 0

1 0 1 0 0 0 0 1

Coding cognate relationships

FISH WATER EAR

SaaS "guelie" etc.

KomiZ "ćeri" etc.

Fin "vesi" etc.

SaaS "tjaetsie" etc.

KhaV "jəŋk"

SaaS "bieljie" etc.

Fin "korva" etc.

NenT "xa" etc.

Proto-Uralic (outgroup) *kala –

*weti – – [Not rec'able] [Not

rec'able]

[Not

rec'able]

South Saami guelie – – tjaetsie – bieljie – –

North Saami guolli – – čáhci – beallji – –

Inari Saami kyeli – – čääci – pelji – –

Kildin Saami kūll’ – – čāʒ’ – piellj – –

Standard Finnish kala – vesi – – – korva –

Ingrian kala – vezi – – – korva –

Western Votic kaлa – vesi – – – ke̮rv –

Standard Estonian kala – vesi – – – kõrv –

Võro South Estonian kala – vesi – – – kõrv –

Courland Livonian kalà – veiʾž ~ veʾžʹ – – – kùora –

Erzya kal – vedʹ – – pilʹe – –

Meadow Mari kol – βüt – – pə̑lə̑š – –

Komi-Zyrian – ćeri va – – pelʹ – –

Udmurt – ćori̮g vu – – pelʹ – –

Hungarian hal – víz – – fül – –

Vakh-Vasyugan Khanty kul – – – Jəŋk pəl – –

Tundra Nenets xalya – yiq – – – – xa

…or better root–meaning forms (Chang et al. 2015) as absence-presence binary matrix

(pics by J. Lehtinen)

Page 12: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Methods in phylogenetic linguistics

A large selection of computational methods based on varying principles ● Character-based and distance-based methods

Character-based methods ● Parsimony (smallest number of evolutionary changes) ● Maximum Likelihood (find the best tree and model parameters) ● Bayesian statistics (produce a distribution of likely trees and model parameters) BEDLAN work so far mostly Bayesian statistics

FISH WATER EAR

1 0 1 0 0 ? ? ?

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 0 1 0 1 0 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 0 1 0

1 0 1 0 0 1 0 0

1 0 1 0 0 1 0 0

0 1 1 0 0 1 0 0

0 1 1 0 0 1 0 0

1 0 1 0 0 1 0 0

1 0 0 0 1 1 0 0

1 0 1 0 0 0 0 1

Cognates = characters

Page 13: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

213 linguistic traits = map sheets Language data in numbers?

Basic vocabulary + MrBayes Syrjänen et al. 2013

Kulonen 2002 Korhonen 1981

Works in Uralic languages, as seen from traditional results.

Page 14: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Potential • Data readily available • Builds upon observations from an earlier research tradition • Datasets are usually large enough

Two important aspects related to evaluating (un)certainty ● Possibility to compare likelihoods of alternative scenarios of linguistic history ● Inferring the outcome: Likelihood of the tree / network and parts of it

Challenges • Linguistic: Diverse material (e.g. descriptive vs. normative sources) • Computational: Better methods to model different types of linguistic change needed • Cultural: Unorthodox approach in historical-comparative linguistic

213 linguistic traits = map sheets Usage of lexical data in general

Page 15: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

• Tree figures are simplifications of all the alternative trees produced by the algorithm • (Un)certainty (variation between the trees) condenced into posterior probability values

Example: Timed tree of Uralic family with all 313 meanings and Bayesian BEAST algorithm with restricted clock based on 2 calibration points (PRIORS) (Saami 1300 YBP, 30-y stdev, Samoyed languages 2030, 60-y stdev). Manuscript.

213 linguistic traits = map sheets Measuring uncertainty & usage of priors

0,47

Page 16: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

• Networks illustrate the secondary contacts, while trees show inheritance. Example: Distance-based NeighbourNet and character-based MrBayes tree with a data of low amount of known borrowings (149 meanings) in Lehtinen et al. 2014

213 linguistic traits = map sheets Language lineaging as tree models

Page 17: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

213 linguistic traits = map sheets Loan words – a problem?

MrBayes analyses for 1. more stable (WOLD 1-100) 2. less stable (WOLD 401-500) basic vocabulary Lehtinen et al. 2014

• Removing known loans retains the old, unattested loans. • Horizontal transfer IS essential part of language lineaging. • Current models don’t differentiate between inherited cognates and lost & replaced root forms • How big of a problem? 100 most stable in WOLD list Less stable

Page 18: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Added value of phylogenetic linguistics?

Technical issues: Objective handling of large data sets (Un)certainty (posterior propabilities, model comparison) of the inference Flexible analyses (model can be adjusted as needed) Phylogenetic modelling allows: ● Making trees without prior assumptions (only data talks) ● Using earlier knowledge as “priors” in evolutionary modelling (earlier knowledge and

data talks) ● Running the model without data: Do priors talk only?

Page 19: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Added value of phylogenetic linguistics

Contextual issues: Aim is not to create a new linguistic paradigm, but add to existing paradigms Possibility to test the probability of alternative hypothesis From basic research to “applied historical linguistics” • Data and results readily usable for other disciplines • Language history as a well-studied approximation of human history • -> Stronger role in studies of holistic human prehistory?

Page 20: Basic vocabulary and the phylogenetic approach to the ... · Methods in phylogenetic linguistics Cognates ... (find the best tree and model parameters) Bayesian statistics (produce

Acknowledgements