66
Introduction Language change Models of language change Diversity and differences Questions, answers and contributions Acknowledgements References Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka Rama Språkbanken & GSLT University of Gothenburg 1 / 59

Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Vocabulary lists in computationalhistorical linguistics

Licentiate Seminar

Taraka Rama

Språkbanken & GSLT

University of Gothenburg

1 / 59

Page 2: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

2 / 59

Page 3: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Goal of the thesis

Applying techniques from Language Technology (LT)to the following problems:

I Dating of language families

I Structural similarity vs. genetic similarity

I Language classification

3 / 59

Page 4: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Goal of the thesis

Applying techniques from Language Technology (LT)to the following problems:

I Dating of language families

I Structural similarity vs. genetic similarity

I Language classification

3 / 59

Page 5: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Goal of the thesis

Applying techniques from Language Technology (LT)to the following problems:

I Dating of language families

I Structural similarity vs. genetic similarity

I Language classification

3 / 59

Page 6: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Goal of the thesis

Applying techniques from Language Technology (LT)to the following problems:

I Dating of language families

I Structural similarity vs. genetic similarity

I Language classification

3 / 59

Page 7: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Language count

I More than 7000 languages or

I 100,000 languoids1

I 400 language families

1Defined as a set of documented and closely related linguistic varieties. (Nordhoff &

Hammarström 2012)4 / 59

Page 8: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Historical linguistics I

Concerned with:

I Language change: phonological, lexical,grammatical, and semantic change

I The processes introducing the language change

I Identifying the (pre-)historic relationshipsbetween languages

5 / 59

Page 9: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Historical linguistics II

From Diamond (2011), Vajda (2010).

6 / 59

Page 10: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Computational Historical Linguistics

Employs LT (including computational) techniques:

I To classify languages

I Evaluate language relatedness hypothesis

I Devise phonological rule systems

I Reconstruct proto-forms

7 / 59

Page 11: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Basics: I

Cognates:

I Inherited words whose origin can be traced backto a common form

I Ex: Sanskrit dva ~ Armenian erku ‘two’Sanskrit chakra ~ English wheel < PIE kwekwelo

8 / 59

Page 12: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Basics: II

Cognacy representation2

Item Danish Swedish Dutch English‘skin’ skind/1 skinn/1, hud/2 huid/2 skin/1

Items Danish Swedish Dutch English‘skin-1’ 1 1 0 1‘skin-2’ 0 1 1 0

9 / 59

Page 13: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Basics: III

Families:

I Language families: related languages descended from acommon ancestor

I Ex: Indo-European, Dravidian, Niger-Congo, Mixe-Zoquean,and Austronesian

I Language group: subset of a language family

I Indo-European: Slavic, Germanic, Indo-Iranian (Iranian andIndic)

10 / 59

Page 14: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Basics: IV

Spread of Indo-European family3

2Wichmann (2010)

3Bouckaert et al. (2012)

11 / 59

Page 15: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Word lists

(from Grant 2010)

Holman et al. (2008):40-word lista

blood, bone, breasts,come, die, dog, drink, ear,eye, fire, fish, full, hand,hear, horn, I, knee, leaf,liver, louse, mountain,name, new, night, nose,one, path, person, see,skin, star, stone sun,tongue, tooth, tree, two,water, we, you

aSwadesh (1955): 100-word list

12 / 59

Page 16: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

13 / 59

Page 17: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Phonological change: I

I Sound addition: Cypriot Arabic developed a [k]as in *pjara > pkjara

I Sound loss: *tracu > Pengo racu ‘snake’

I Metathesis: Latin miraculum > Spanish milagro‘miracle’

14 / 59

Page 18: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Phonological change: II

Levenshtein (1966):Computes the distance between two strings as the minimumnumber of insertions, deletions, and substitutions to transform asource string to a target string (LD)Damerau (1964) is an extension to LD

15 / 59

Page 19: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Phonological change: III

Linguistically sensitive Levenshtein distance4

I Represent each symbol as a vector of phoneticfeatures

I Compare the vectors of phonetic features usinga distance measure

4Kessler (2005)

16 / 59

Page 20: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Semantic change: I

I Semantic change

I Lexical change

I Grammatical change

17 / 59

Page 21: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Semantic change: II

Typology:

I Broadening and narrowing: English dog hound

I Melioration and pejoration: OHG diorna ‘younggirl’ > MHG dirne ‘prostitute’

I Metaphoric extension: head, tail, star

I Metonymic extension: Sanskrit ratha ‘chariot’~ Latin rota ‘wheel’

18 / 59

Page 22: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Semantic change: III

Lexical change

I Borrowing: beef ‘cow’ from Norman French

I Neologisms: New words in a language

I vandalize (from Vandals, a Germanic tribe)

I all + together⇒ altogether

I gym < gymnasium

19 / 59

Page 23: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Semantic change: IV

Grammatical change

I Morphological change: English umlaut⇒ foot :feet, mouse : mice

I Syntactic change: Word order, morphologicalcomplexity, verb chains, and grammaticalization

20 / 59

Page 24: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

21 / 59

Page 25: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Tree model

Figure: IE family from Garrett (1999).

22 / 59

Page 26: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Wave model

Figure: Indo-European isoglosses Bloomfield (1935, 316) andtree-envelope from Southworth (1964). 1. Sibilants for velars incertain forms. 2. Case-endings with [m] for [bh]. 3. Passive-voiceendings with [r]. 4. Prefix [e-] in past tenses. 5. Feminine nounswith masculine suffixes. 6. Perfect tense used as general pasttense.

23 / 59

Page 27: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Network model5

5Gray & Atkinson (2003) and Huson & Bryant (2006)

24 / 59

Page 28: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

25 / 59

Page 29: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: I

As defined by Nettle (1999):

I Language diversity

I Phylogenetic diversity

I Structural diversity

26 / 59

Page 30: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: II

Language diversity

I Languages per square kilometer

I Ex. New Guinea has 800 languages(786× 103km2) ~ Iceland has only one language(103× 103km2)

27 / 59

Page 31: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: III

Hotspots of language diversity6

28 / 59

Page 32: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: IV

Phylogenetic diversity

I Number of families per square kilometer.

I North America has more than 20 languagefamilies in 24.49× 106km2

I South America has 53 families in 17.84× 106km2.

29 / 59

Page 33: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: V

North American family distribution7

30 / 59

Page 34: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: VI

Structural diversity

I Languages per square kilometer w.r.t a linguisticfeature.

I Ex. Word order, size of phoneme inventory,morphological type, or suffixing vs. prefixing.

31 / 59

Page 35: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: VII

Plosive systems8

Encoding

32 / 59

Page 36: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: VIII

Phonological segment distribution

Figure: Segment identity. White circles: Romance; Black circles:North Sea; Grey squares: Slavic; Grey triangles: Ireland, Britain;Isolates rest of them.

33 / 59

Page 37: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Linguistic diversity: IX

Phonological diversity

I Similarity between the phonetic inventoriestranslates into language relatedness (Lohr 1998)

I Differences translate into inter-language distance(or divergence)

6Gorenflo et al. (2012)

7http://www.freelang.net/families/language_maps.php

8Donohue (2012)

34 / 59

Page 38: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

35 / 59

Page 39: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Questions:

1. Can language relationships be inferred from parallelcorpora? Corpus-based phylogenetic inference

2. How well can structural relations be employed for the task oflanguage classification? Structural similarity and geneticclassification

3. How to develop a system for dating the split/divergence oflanguage groups present in the world’s language families?Estimating age of language families

4. How to generate a ranked list of concepts which can beused for investigating the problem of automatic languageclassification? Item stability

5. Which string similarity measure is the best for the tasks ofautomatic discrimination and internal classification oflanguages? Comparison of string similarity measures forautomated language classification

36 / 59

Page 40: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Questions:

1. Can language relationships be inferred from parallelcorpora? Corpus-based phylogenetic inference

2. How well can structural relations be employed for the task oflanguage classification? Structural similarity and geneticclassification

3. How to develop a system for dating the split/divergence oflanguage groups present in the world’s language families?Estimating age of language families

4. How to generate a ranked list of concepts which can beused for investigating the problem of automatic languageclassification? Item stability

5. Which string similarity measure is the best for the tasks ofautomatic discrimination and internal classification oflanguages? Comparison of string similarity measures forautomated language classification

36 / 59

Page 41: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Questions:

1. Can language relationships be inferred from parallelcorpora? Corpus-based phylogenetic inference

2. How well can structural relations be employed for the task oflanguage classification? Structural similarity and geneticclassification

3. How to develop a system for dating the split/divergence oflanguage groups present in the world’s language families?Estimating age of language families

4. How to generate a ranked list of concepts which can beused for investigating the problem of automatic languageclassification? Item stability

5. Which string similarity measure is the best for the tasks ofautomatic discrimination and internal classification oflanguages? Comparison of string similarity measures forautomated language classification

36 / 59

Page 42: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Questions:

1. Can language relationships be inferred from parallelcorpora? Corpus-based phylogenetic inference

2. How well can structural relations be employed for the task oflanguage classification? Structural similarity and geneticclassification

3. How to develop a system for dating the split/divergence oflanguage groups present in the world’s language families?Estimating age of language families

4. How to generate a ranked list of concepts which can beused for investigating the problem of automatic languageclassification? Item stability

5. Which string similarity measure is the best for the tasks ofautomatic discrimination and internal classification oflanguages? Comparison of string similarity measures forautomated language classification

36 / 59

Page 43: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Questions:

1. Can language relationships be inferred from parallelcorpora? Corpus-based phylogenetic inference

2. How well can structural relations be employed for the task oflanguage classification? Structural similarity and geneticclassification

3. How to develop a system for dating the split/divergence oflanguage groups present in the world’s language families?Estimating age of language families

4. How to generate a ranked list of concepts which can beused for investigating the problem of automatic languageclassification? Item stability

5. Which string similarity measure is the best for the tasks ofautomatic discrimination and internal classification oflanguages? Comparison of string similarity measures forautomated language classification

36 / 59

Page 44: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Corpus-based phylogenetic inference9

I Use three different stringsimilarity measures

I Show that parallelcorpora can be used toautomatically extractcognates and infer aphylogenetic tree

I Work with 10 Europeanlanguages

Dice and LCSR tree

9Rama & Borin (2011)

37 / 59

Page 45: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

The grouping of Dutch, English and French: is that the firsttwo have borrowed large parts of the vocabulary used inthe Europarl corpus (administrative and legal terms) fromFrench, and additionally in many cases have a spellingclose to the original French form of the words (whereasFrench loanwords in e.g. Swedish have often beenorthographically adapted, for example French jus ∼English juice ∼ Swedish sky ‘meat juice’).

38 / 59

Page 46: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Structural similarity and geneticclassification10

Correlate typological distances

Withgenealogicalclassification

10Rama & Kolachina (2012)

39 / 59

Page 47: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Correlate typological distances

With the lexicaldistancescomputed from40-wordSwadesh lists

40 / 59

Page 48: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Estimating the age of language families I

The combination of phonotactic diversity and lexicaldivergence are used to predict the dates of splits formore than 50 language families11

41 / 59

Page 49: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Estimating the age of language families II

Subfamily NOL CD Type FN MOS GA12

Brythonic 2 1450 H IE AGR EurasiaDardic 22 3550 A IE AGR Eurasia

Inuit 4 800 A EA PAS AmericasMalayo-Polynesian 954 4250 A An AGR Oceania

Ongamo-Maa 4 1150 A NS AGR AfricaSlavic 16 1450 H IE AGR EurasiaTurkic 51 2500 AH Alt AGR Eurasia

42 / 59

Page 50: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Estimating the age of language families III

LGS

3.0 4.0 5.0

0.819***

0.907***

4 5 6 7 8 9

0.943***

0.913***

3 4 5 6 7 8 9

0.872***

13

57

0.712***

3.0

4.0

5.0

1−grams 0.911***

0.817***

0.724***

0.654***

0.727***

2−grams 0.961***

0.902***

0.853***

4.5

5.5

6.5

7.5

0.747***

45

67

89

3−grams 0.978***

0.945***

0.723***

4−grams 0.987***

46

8

0.667***

35

79

5−grams 0.646***

1 2 3 4 5 6 7 4.5 5.5 6.5 7.5 4 5 6 7 8 9 5.5 6.5 7.5 8.5

5.5

7.0

8.5

CD

Pairwise scatterplot matrix of group size, N−gram diversity and date; the lower matrix panels showscatterplots and LOESS lines; the upper matrix panels show Spearman rank correlation (ρ) andlevel of statistical significance (?). The diagonal panels display variable names. All the plots are ona log-log scale.

11Rama (2013)

12FN: family name; MOS: Mode of subsistence; GA: Geographical area; An: Austronesian; Alt: Altaic;

EA: Eskimo-Aleut; NS: Nilo-Saharan43 / 59

Page 51: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Comparison of string similarity measures I

I Compare the performance of 14 different stringsimilarity techniques:13

1. IDENT

2. PREFIX

3. DICE

4. LCSR

5. TRIGRAM

6. XDICE

7. Jaccard’s index JCD

44 / 59

Page 52: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Comparison of string similarity measures II

I The FDR procedure14

I Suggests that choice of string similarity measure isimmaterial for internal classification

I For Dist suggests that JCD(D) > JCD, JCD > TRI(D),DICED > IDENTD, LDND > LCSD, and LCSD > LDN

13Rama & Borin (2014a)

14False Discovery Rate (Benjamini & Hochberg 1995)

45 / 59

Page 53: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Item stability I

Employ n-grams to quantify the resistance to lexicalreplacement across the branches of a languagefamily15

46 / 59

Page 54: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Item stability II

Lexical items with widespread and numerouscognates are stable. This notion can be capturedusing self-entropy of n-grams.

47 / 59

Page 55: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Item stability III

Ranks derived from n-gram analysis largely agree withthe item stability ranks based on phonologicalmatches found by Holman et al. (2008) using LD asthe similarity measure.

48 / 59

Page 56: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Item stability IV

At the same time, n-gram analysis is cheaper in termsof computational resources – the fundamentalcomparison step has linear complexity, againstquadratic complexity for LD – which is importantwhen processing large quantities of language data.

15Rama & Borin (2014b)

49 / 59

Page 57: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Future work: I

I Exploiting longer word lists such as IDS16 andLWT17 (Borin, Comrie & Saxena 2013)

I Apply all the available string similarity measures18

50 / 59

Page 58: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Future work: II

I Check the relationship between reticulation andtypological distances (Donohue 2012)

I Use multilingual tree-banks for the comparison ofword order, part-of-speech, and syntacticsubtree (or treelet) distributions (Wiersmaet al. 2011)

51 / 59

Page 59: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Future work: III

I Include the phylogenetic tree structure intoautomatic dating (Pagel 1999)

I Extract typological and phonological databases(Nordhoff 2012)

I digitized grammatical descriptions

I public resources such as Wikipedia andWiktionary

16Intercontinental Dictionary Series

17Loanword Typology

18SimMetrics: http://sourceforge.net/projects/simmetrics/

52 / 59

Page 60: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

53 / 59

Page 61: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Thanks for listening!

54 / 59

Page 62: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

Outline

Introduction

Language change

Models of language change

Diversity and differences

Questions, answers and contributions

Acknowledgements

References

55 / 59

Page 63: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

References: I

Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: A practical and powerfulapproach to multiple testing’, Journal of the Royal Statistical Society. Series B(Methodological) 57(1), 289–300.

Bloomfield, L. (1935), Language, Allen, George and Unwin, London.

Borin, L., Comrie, B. & Saxena, A. (2013), The intercontinental dictionary series – a rich andprincipled database for language comparison, in L. Borin & A. Saxena, eds, ‘Approaches toMeasuring Linguistic Differences’, De Gruyter Mouton, Berlin, pp. 285–302.

Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., Gray, R. D.,Suchard, M. A. & Atkinson, Q. D. (2012), ‘Mapping the origins and expansion of theIndo-European language family’, Science 337(6097), 957–960.

Damerau, F. J. (1964), ‘A technique for computer detection and correction of spelling errors’,Communications of the ACM 7(3), 171–176.

Diamond, J. (2011), ‘Linguistics: Deep relationships between languages’, Nature476(7360), 291–292.

Donohue, M. (2012), ‘Typology and Areality’, Language Dynamics and Change 2(1), 98–116.

Garrett, A. (1999), A new model of Indo-European subgrouping and dispersal, in S. S. Chang,L. Liaw & J. Ruppenhofer, eds, ‘Proceedings of the Twenty-Fifth Annual Meeting of theBerkeley Linguistics Society’, Berkeley Linguistic Society, Berkeley, pp. 146–156.

Gorenflo, L. J., Romaine, S., Mittermeier, R. A. & Walker-Painemilla, K. (2012), ‘Co-occurrence oflinguistic and biological diversity in biodiversity hotspots and high biodiversity wildernessareas’, Proceedings of the National Academy of Sciences 109(21), 8032–8037.

56 / 59

Page 64: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

References: II

Grant, A. P. (2010), ‘Swadesh’s life and place in linguistics’, Diachronica 27(2), 191–196.

Gray, R. D. & Atkinson, Q. D. (2003), ‘Language-tree divergence times support the Anatoliantheory of Indo-European origin’, Nature 426(6965), 435–439.

Holman, E. W., Wichmann, S., Brown, C. H., Velupillai, V., Müller, A. & Bakker, D. (2008), ‘Explorationsin automated language classification’, Folia Linguistica 42(3-4), 331–354.

Huson, D. H. & Bryant, D. (2006), ‘Application of phylogenetic networks in evolutionary studies’,Molecular Biology and Evolution 23(2), 254–267.

Kessler, B. (2005), ‘Phonetic comparison algorithms’, Transactions of the Philological Society

103(2), 243–260.

Levenshtein, V. I. (1966), Binary codes capable of correcting deletions, insertions and reversals, in‘Soviet physics doklady’, Vol. 10, p. 707.

Lohr, M. (1998), Methods for the genetic classification of languages, PhD thesis, University ofCambridge.

Nettle, D. (1999), Linguistic Diversity, Oxford University Press, Oxford.

Nordhoff, S., ed. (2012), Electronic Grammaticography, University of Hawaií, Honolulu, Hawaií.

Nordhoff, S. & Hammarström, H. (2012), Glottolog/Langdoc: Increasing the visibility of greyliterature for low-density languages., in ‘Language Resources and Evaluation Conference’,pp. 3289–3294.

57 / 59

Page 65: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

References: III

Pagel, M. (1999), ‘Inferring the historical patterns of biological evolution’, Nature

401(6756), 877–884.

Rama, T. (2013), ‘Phonotactic diversity predicts the time depth of the world’s language families’,PloS one 8(5), e63238.

Rama, T. & Borin, L. (2011), Estimating language relationships from a parallel corpus. A study of theEuroparl corpus, in ‘NEALT Proceedings Series (NODALIDA 2011 Conference Proceedings)’,Vol. 11, pp. 161–167.URL: http://hdl.handle.net/10062/17303

Rama, T. & Borin, L. (2014a), ‘Comparison of string similarity measures for automated languageclassification’. Under review.

Rama, T. & Borin, L. (2014b), ‘N-gram approaches to the historical dynamics of basic vocabulary’,Journal of Quantitative Linguistics 21(1), 50–64.

Rama, T. & Kolachina, P. (2012), How good are typological distances for determining genealogicalrelationships among languages?, in ‘COLING (Posters)’, pp. 975–984.URL: http://aclweb.org/anthology/C/C12/C12-2095.pdf

Southworth, F. C. (1964), ‘Family-tree diagrams’, Language 40(4), 557–565.

Swadesh, M. (1955), ‘Towards greater accuracy in lexicostatistic dating’, International Journal of

American Linguistics 21(2), 121–137.

Vajda, E. (2010), Yeniseian, Na-Dene, and historical linguistics, in J. Kari & B. A. Potter, eds, ‘TheDene-Yeniseian Connection’, Anthropological papers of the University of Alaska,pp. 100–118.

58 / 59

Page 66: Vocabulary lists in computational historical linguistics ...spraakdata.gu.se/taraka/240114_LIC.pdf · Vocabulary lists in computational historical linguistics Licentiate Seminar Taraka

Introduction

Languagechange

Models oflanguagechange

Diversity anddifferences

Questions,answers andcontributions

Acknowledgements

References

References: IV

Wichmann, S. (2010), Internal language classification, in S. Luraghi & V. Bubeník, eds, ‘ContinuumCompanion to Historical Linguistics’, Continuum International Publishing Group, pp. 70–88.

Wiersma, W., Nerbonne, J. & Lauttamus, T. (2011), ‘Automatically extracting typical syntacticdifferences from corpora’, Literary and Linguistic Computing 26(1), 107–124.

59 / 59