Deciphering Undersegmented Ancient Scripts Using Phonetic …people.csail.mit.edu/j_luo/assets/publications/Decipher... · 2020. 10. 21. · Deciphering Undersegmented Ancient Scripts

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Jiaming LuoCSAIL, MIT

j [email protected]

Frederik HartmannUniversity of [email protected]

Enrico SantusBayer

[email protected]

Regina BarzilayCSAIL, MIT

[email protected]

Yuan CaoGoogle Brain

[email protected]

Abstract

Most undeciphered lost languages exhibit twocharacteristics that pose significant decipher-ment challenges: (1) the scripts are not fullysegmented into words; (2) the closest knownlanguage is not determined. We propose adecipherment model that handles both ofthese challenges by building on rich linguis-tic constraints reflecting consistent patterns inhistorical sound change. We capture the naturalphonological geometry by learning characterembeddings based on the International Pho-netic Alphabet (IPA). The resulting generativeframework jointly models word segmentationand cognate alignment, informed by phono-logical constraints. We evaluate the model onboth deciphered languages (Gothic, Ugaritic)and an undeciphered one (Iberian). The ex-periments show that incorporating phoneticgeometry leads to clear and consistent gains.Additionally, we propose a measure for lan-guage closeness which correctly identifiesrelated languages for Gothic and Ugaritic. ForIberian, the method does not show strong evi-dence supporting Basque as a related language,concurring with the favored position by thecurrent scholarship.1

1 Introduction

All the known cases of lost language deciphermenthave been accomplished by human experts,oftentimes over decades of painstaking efforts.At least a dozen languages are still undecipheredtoday. For some of those languages, even the mostfundamental questions pertaining to their origins

1Code and data available at https://github.com/j-luo93/DecipherUnsegmented/.

and connections to known languages are shroudedin mystery, igniting fierce scientific debate amonghumanities scholars. Can NLP methods be helpfulin bringing some clarity to these questions? Recentwork has already demonstrated that algorithms cansuccessfully decipher lost languages like Ugariticand Linear B (Luo et al., 2019), relying only onnon-parallel data in known languages—Hebrewand Ancient Greek, respectively. However, thesemethods are based on assumptions that are notapplicable to many undeciphered scripts.

The first assumption relates to the knowledgeof language family of the lost language. Thisinformation enables us to identify the closestliving language, which anchors the deciphermentprocess. Moreover, the models assume significantproximity between the two languages so that asignificant portion of their vocabulary is matched.The second assumption presumes that wordboundaries are provided that uniquely define thevocabulary of the lost language.

One of the famous counterexamples to both ofthese assumptions is Iberian. The Iberian scriptsare undersegmented with inconsistent use of worddividers. At the same time, there is no definitiveconsensus on its close known language—overthe years, Greek, Latin, and Basque were allconsidered as possibilities.

In this paper, we introduce a deciphermentapproach that relaxes the above assumptions. Themodel is provided with undersegmented inscrip-tions in the lost language and the vocabulary in aknown language. No assumptions are made aboutthe proximity between the lost and the knownlanguages and the goal is to match spans in thelost texts with known tokens. As a byproductof this model, we propose a measure of languagecloseness that drives the selection of the best targetlanguage from the wealth of world languages.

69

Transactions of the Association for Computational Linguistics, vol. 9, pp. 69–81, 2021. https://doi.org/10.1162/tacl a 00354Action Editor: Hinrich Schütze. Submission batch: 7/2020; Revision batch: 8/2020; Published 02/2021.

c© 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://github.com/j-luo93/DecipherUnsegmented/https://github.com/j-luo93/DecipherUnsegmented/https://doi.org/10.1162/tacl_a_00354

Given the vast space of possible mappingsand the scarcity of guiding signal in the inputdata, decipherment algorithms are commonlyinformed by linguistic constraints. These cons-traints reflect consistent patterns in languagechange and linguistic borrowings. Examples ofpreviously utilized constraints include skewnessof vocabulary mapping, and monotonicity ofcharacter-level alignment within cognates. Wefurther expand the linguistic foundations ofdecipherment models, and incorporate phonologi-cal regularities of sound change into the matchingprocedure. For instance, a velar consonant [k]is unlikely to change into a labial [m]. Anotherimportant constraint in this class pertains to soundpreservation, that is, the size of phonologicalinventories is largely preserved during languageevolution.

Our approach is designed to encapsulate theseconstraints while addressing the segmentationissue. We devise a generative framework thatjointly models word segmentation and cognatealignment. To capture the natural phonologicalgeometry, we incorporate phonological featuresinto character representations using the Inter-national Phonetic Alphabet (IPA). We introducea regularization term to explicitly discouragethe reduction of the phonological system andemploy an edit distance-based formulation tomodel the monotonic alignment between cognates.The model is trained in an end-to-end fashion tooptimize both the quality and the coverage of thematched tokens in the lost texts.

The ultimate goal of this work is to evaluate themodel on an undeciphered language, specificallyIberian. Given how little is known about thelanguage, it is impossible to directly assess pre-diction accuracy. Therefore, we adopt two com-plementary evaluation strategies to analyze modelperformance. First, we apply the model to deci-phered ancient languages, Ugaritic and Gothic,which share some common challenges withIberian. Second, we consider evaluation scenariosthat capitalize on a few known facts about Iberian,such as personal names, and report the model’saccuracy against these ground truths.

The results demonstrate that our model canrobustly handle unsegmented or undersegmentedscripts. In the Iberian personal name experiment,our model achieves a top 10 accuracy scoreof 75.0%. Across all the evaluation scenarios,incorporating phonological geometry leads to

clear and consistent gains. For instance, the modelinformed by IPA obtains 12.8% increase inGothic-Old Norse experiments. We also demon-strate that the proposed unsupervised measure oflanguage closeness is consistent with historicallinguistics findings on known languages.

2 Related Work

Non-parallel Machine Translation At a highlevel, our work falls into research on non-parallelmachine translation. One of the important recentadvancements in that area is the ability to induceaccurate crosslingual lexical representations with-out access to parallel data (Lample et al., 2018b,a;Conneau and Lample, 2019). This is achieved byaligning embedding spaces constructed from largeamounts of monolingual data. The size of data forboth languages is key: High-quality monolingualembeddings are required for successful matching.This assumption, however, does not hold forancient languages, where we can typically accessa few thousands of words at most.

Decoding Cipher Texts Man-made ciphershave been the focal point for most of the early workon decipherment. They usually use EM algo-rithms, which are tailored towards these specifictypes of ciphers, most prominently substitutionciphers (Knight and Yamada, 1999; Knight et al.,2006). Later work by Nuhn et al. (2013), Haueret al. (2014), and Kambhatla et al. (2018) addressesthe problem using a heuristic search procedure,guided by a pretrained language model. To thebest of our knowledge, these methods developedfor tackling man-made ciphers have so far notbeen successfully applied to archaeological data.One contributing factor could be the inherentcomplexity in the evolution of natural languages.

Deciphering Ancient Scripts Our research ismost closely aligned with computational decipher-ment of ancient scripts. Prior work has alreadyfeatured several successful instances of ancientlanguage decipherment previously done by humanexperts (Snyder et al., 2010; Berg-Kirkpatrickand Klein, 2013; Luo et al., 2019). Our workincorporates many linguistic insights about thestructure of valid alignments introduced in priorwork, such as monotonicity. We further expand thelinguistic foundation by incorporating phoneticregularities that have been beneficial in early,pre-neural decipherment work (Knight et al.,

70

2006). However, our model is designed to handlechallenging cases not addressed by prior work,where segmentation of the ancient scripts isunknown. Moreover, we are interested in deadlanguages without a known relative and introducean unsupervised measure of language closenessthat enables us to select an informative knownlanguage for decipherment.

3 Model

We design a model for the automatic extractionof cognates2 directly from unsegmented or under-segmented texts (detailed setting in Section 3.1).In order to properly handle the uncertaintiescaused by the issue of segmentation, we devise agenerative framework that composes the lost textsusing smaller units—from characters to tokens,and from tokens to inscriptions. The model istrained in an end-to-end fashion to optimize boththe quality and the coverage of the matched tokens.

To help the model navigate the complexsearch space, we consider the following linguisticproperties of sound change, including phonologyand phonetics in our model design:

• Plausibility of sound change: Similarsounds rarely change into drastically differentsounds. This pattern is captured by thenatural phonological geometry in humanspeech sounds and we incorporate relevantphonological features into the representationof characters.

• Preservation of sounds: The size ofphonological inventories tends to be largelypreserved over time. This implies that totaldisappearance of any sound is uncommon. Inlight of this, we use a regularization term todiscourage any sound loss in the phonologicalsystem of the lost language.

• Monotonicity of alignment: The alignmentbetween any matched pair is predominantlymonotonic, which means that character-levelalignments do not cross each other. Thisproperty inspires our edit distance-basedformulation at the token level.

2Throughout this paper, the term cognate is liberally usedto also include loanwords, as the sound correspondences incognates and loanwords are both regular, although usuallydifferent.

To reason about phonetic proximity, we needto find character representation that explicitlyreflects its phonetic properties. One such repre-sentation is provided by the IPA, where eachcharacter is represented by a vector of pho-nological features. As an example, consider IPArepresentation for two phonetically close charac-ters [b] and [p] (See Figure 3), which sharetwo identical coordinates. To further refine thisrepresentation, the model learns to embed thesefeatures into a new space, optimized for thedecipherment task.

3.1 Problem Setting

We are given a list of unsegmented orundersegmented inscriptions X = {X} in the lostlanguage, and a vocabulary, that is, a list of tokensY = {y} in the known language. For each lost textX , the goal is to identify a list of non-overlappingspans {x} that correspond to cognates in Y . Werefer to these spans as matched spans and anyremaining character as unmatched spans.

We denote the character sets of the lost and theknown languages byCL = {cL} andCK = {cK},respectively. To exploit the phonetic prior, IPAtranscriptions are used for CK, while ortho-graphic characters are used for CL. For this paper,we only consider alphabetical scripts for the lostlanguage.3

3.2 Generative Framework

We design the following generative frameworkto handle the issue of segmentation. It jointlymodels segmentation and cognate alignment,which requires different treatments for matchedspans and unmatched spans. An overview of theframework is provided in Figure 1 and a graphicalmodel representation in Figure 2.

For matched spans, we introduce two latentvariables: y representing the correspondingcognate in the known language and a indicatingthe character alignment between x and y (seethe Token box in Figure 1). More concretely,a = {aτ} is a sequence of indices, with aτrepresenting the aligned position for yτ in x. Thelost token is generated by applying a character-level mapping to y according to the alignment

3Given that the known side uses IPA, an alphabeticalsystem, having an alphabetical system on the lost side makesit much easier to enforce the linguistic constraints in thispaper. For other types of scripts, it requires more thoroughinvestigation, which is beyond the scope of this work.

71

Figure 1: An overview of our framework, which generates the lost texts from smaller units—fromcharacters to tokens and from tokens to inscriptions. Character mappings are first performed on thephonetic alphabet of the known language. Based on these mappings, a token y in the known vocabularyY is converted into a token x in the lost language according to the latent alignment variable a. Lastly,all generated tokens, together with characters in unmatched spans, are concatenated to form a lostinscription. Blue boxes display the corresponding linguistic properties associated with each level ofmodeling.

Figure 2: A graphical model representation forour framework to generate a span x. Charac-ters in unmatched spans are generated in anindependent and identically distributed fashionwhereas matched spans are additionally condi-tioned on two latent variables: y representing aknown cognate and a the character-level align-ment between x and y.

provided by a. For unmatched spans, we assumeeach character is generated in an independentand identically distributed fashion under a uniformdistribution p0 = 1|CL| .

Whether a span is matched or not is indicated byanother latent variable z, and the correspondingspan is denoted by xz . More specifically, eachcharacter in an unmatched span is tagged byz = O, whereas the entirety of a matched span oflength l is marked by z = El at the end of the span(see the Inscription box in Figure 1). All spans arethen concatenated to form the inscription, with acorresponding (sparse) tag sequence Z = {z}.

Under this framework, we arrive at the follow-ing derivation for the marginal distribution foreach lost inscription X:

Pr(X)=∑Z

[∏z∈Z

Pr(z)][ ∏

z∈Zz=O

p0

][ ∏z∈Zz �=O

Pr(xz|z)],

(1)

72

where Pr(xz|z �= O) is further broken down intoindividual character mappings:

Pr(xz|z �= O) =∑y∈Y

∑a∈A

Pr(y)Pr(a) · Pr(xz|y, z, a)

∝∑y∈Y

∑a∈A

Pr(xz|y, z, a)

≈∑y∈Y

maxa

Pr(xz|y, z, a)

=∑y∈Y

maxa

∏τ

Pr(xaτ |yτ ), (2)

Note that we assume a uniform prior for both yand a, and use the maximum to approximate thesum of Pr(xz|y, z, a) over the latent variable a. Ais the set of valid alignment values to be detailedin § 3.2.2.

3.2.1 Phonetics-aware ParameterizationThe character mapping distributions are specifiedas follows:

Prθ(xaτ = cLj |yτ = cKi )

∝exp(EL(cLj ) · EK(cKi )

T

), (3)

where T is a temperature hyperparameter, EL(·)andEK(·) are the embedding functions for the lostcharacters and the known characters, respectively,and θ is the collection of all trainable parameters(i.e., the embeddings).

In order to capture the similarity within cer-tain sound classes, we use IPA embeddings torepresent each IPA character in the known lan-guage. More specifically, each IPA character is re-presented by a vector of phonological features.The model learns to embed these features into anew space and the full IPA embedding for cK

is composed by concatenating all of its rele-vant feature embeddings. For the example inFigure 3, the phone [b] can be represented asthe concatenation of the voiced embedding, thestop embedding, and the labial embedding.

This compositional structure encodes the na-tural geometry existent in sound classes (Stevens,2000) and biases the model towards utilizing sucha structure. By design, the representations for [b]and [p] are close as they share the same valuesfor two out of three feature groups. This structuralbias is crucial for realistic character mappings.

For the lost language, we represent eachcharacter cLj as a weighted sum of IPA embeddings

Figure 3: An illustration of IPA embeddings.Each phone is first represented by a vector ofphonological features. The model first embedseach feature and then IPA embedding is ob-tained by concatenating all its relevant featureembeddings. For instance, the phone [b] can berepresented as the concatenation of the voiced,stop, and the labial embeddings.

on the known side. Specifically,

EL(cLj ) =∑i

wi,j · EK(cKi ), (4)

where {wi,j} are learnable parameters.

3.2.2 Monotonic Alignment and Edit DistanceIndividual characters in the known token y aremapped to a lost token x according to thealignment variable a. The monotonic nature ofcharacter alignment between cognate pairingsmotivates our design of an edit distance-basedformulation to capture the dominant mechanismsinvolved in cognate pairings: substitutions,deletions, and insertions (Campbell, 2013). Inaddition to aτ taking the value of an integersignifying the substituted position, aτ can be�, which indicates that yτ is deleted. To modelinsertions, aτ = (aτ,1, aτ,2) can be two4 adjacentindices in x.

This formulation inherently defines a set A ofvalid values for the alignment. Firstly, they aremonotonically increasing with respect to τ , withthe exception of �. Secondly, they cover everyindex of x, which means every character in x isaccounted for by some character in y. The Tokenbox in Figure 1 showcases such an example with

4Insertions of even longer character sequences are rare.

73

all three types of edit operations. More concretely,we have the following alignment model:

Pr(xaτ |yτ ) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

ccPrθ(xaτ |yτ ) (substitution)Prθ(�|yτ ) (deletion)Prθ(xaτ,1 |yτ )·αPrθ(xaτ,2 |yτ ) (insertion)

where α ∈ [0, 1] is a hyperparameter to controlthe use of insertions.

3.3 ObjectiveGiven the generative framework, our trainingobjective is designed to optimize the quality of theextracted cognates, while matching a reasonableproportion of the text.

Quality We aim to optimize the quality ofmatched spans under the posterior distributionPr(Z|X), measured by a scoring functionΦ(X,Z). Φ(X,Z) is computed by aggregatingthe likelihoods of these matched spans normalizedby length. The objective is defined as follows:

Q(X) = EZ∼Pr(Z|X)Φ(X,Z), (5)

Φ(X,Z) =∑z∈Zz �=O

φ(xz, z), (6)

φ(xz, z) = Pr(xz|z)1

|xz | . (7)

This term encourages the model to explicitly focuson improving the probability of generating thematched spans.

Regularity and Coverage The regularity ofsound change, as stated by the Neogrammarianhypothesis (Campbell, 2013), implies that we needto find a reasonable number of matched spans. Toachieve this goal, we incur a penalty if the expectedcoverage ratio of the matched characters under theposterior distribution falls below a given thresholdrcov:

Ωcov(X ) = max(rcov −

∑X∈X

cov(X)

|X | , 0.0)

cov(X) = EZ∼Pr(Z|X)Ψ(X,Z), (8)

Ψ(X,Z) =∑z∈Zz �=O

ψ(xz, z) =∑z∈Zz �=O

|xz|. (9)

Note that the ratio is computed on the entirecorpus X instead of individual texts X becausethe coverage ratio can vary greatly for different

individual texts. The hyperparameter rcov controlsthe expected overlap between two languages,which enables us to apply the method evenwhen languages share some loanwords but arenot closely related.

Preservation of Sounds The size of phono-logical inventories tends to be largely preservedover time. This implies that total disappearanceof any sound is uncommon. To reflect this ten-dency, we introduce an additional regularizationterm to discourage any sound loss. The intuitionis to encourage any lost character to be mappedto exactly one5 known IPA symbol. Formally, wehave the following term

Ωloss(CL, CK) =

∑cL

(∑cK

Pr(cL|cK)− 1.0)2.

Final Objective Putting the terms together, wehave the following final objective:

S(X ;CL, CK) =∑X∈X

Q(X) + λcovΩcov(X )

+λlossΩloss(CL, CK), (10)

where λcov and λloss are both hyperparameters.

3.4 TrainingTraining with the final objective involves eitherfinding the best latent variable, as in Equation (2),or computing the expectation under a distributionthat involves one latent variable, as in Equation (5)and Equation (8). In both cases, we resort to dy-namic programming to facilitate efficient compu-tation and end-to-end training. We refer interestedreaders to Appendix A.1 for more detailedderivations. We illustrate one training step inAlgorithm 1.

4 Experimental Setup

Our ultimate goal is to evaluate the deciphermentcapacity for unsegmented lost languages, withoutinformation about a known counterpart. Iberianfits both of these criteria. However, our abilityto evaluate decipherment of Iberian is limitedbecause a full ground truth is not known. There-fore, we supplement our evaluation on Iberian withmore complete evaluation on lost languages withknown translation, such as Gothic and Ugaritic.

5We experimented with looser constraints (e.g., with atleast instead of exactly one correspondence), but obtainedworse results.

74

Algorithm 1: One training step for our decipherment modelInput: One batch of lost inscriptions X̃ , entire known vocabulary Y = {y}Parameters: Feature embeddings θ

1: Pr(cLj |cKi ) ← ComputeCharDistr(θ) Compute character mapping distributions (Section 3.2.1)2: Pr(x|y) ← EditDistDP

(x, y, Pr(cLj |cKi )

) Compute token alignment probability (Section 3.2.2)

3: S(X̃ ;CL, CK) ← WordBoundaryDP(Pr(x|y)

) Compute final objective (Section 3.3)

4: θ ← SGD(S) Backprop and update parameters

Language Family Source #Tokens Segmentation Situation Century

Gothic Germanic Wulfila† 40,518 unsegmented 3–10 ADUgaritic Semitic Snyder et al. (2010) 7,353†† segmented 14–12 BCIberian unclassified Hesperia‡ 3,466‡‡ undersegmented 6–1 BC

† http://www.wulfila.be/gothic/download/.†† http://hesperia.ucm.es/. Iberian language is semi-syllabic, but this database has already

transliterated the inscriptions into Latin scripts.‡ This dataset directly provides the Ugaritic vocabulary, i.e., each word occurs exactly once.‡‡ Since the texts are undersegmented and we do not know the ground truth segmentations, this

represents the number of unsegmented chunks, each of which might contain multiple tokens.

Table 1: Basic information about the lost languages.

4.1 Languages

We focus our description on the Gothic andIberian corpora that we compiled for this paper.Ugaritic data was reused from the prior workon decipherment (Snyder et al., 2010). Table 1provides statistics about these languages. Toevaluate the validity for our proposed languageproximity measure, we additionally include sixknown languages: Spanish (Romance), Arabic(Semitic), Hungarian (Uralic), Turkish (Turkic),classical Latin (Latino-Faliscan), and Basque(isolate).

Gothic Several features of Gothic make itan ideal candidate for studying deciphermentmodels. Because Gothic is fully deciphered,we can compare our predictions against groundtruth. Like Iberian, Gothic is unsegmented. Itsalphabet was adapted from a diverse set oflanguages: Greek, Latin, and Runic, but somecharacters are of unknown origin. The latter werein the center of decipherment efforts on Gothic(Zacher, 1855; Wagner, 2006). Another appealingfeature of Gothic is its relatedness to severalknown Germanic languages that exhibit variousdegree of proximity to Gothic. The closest isits reconstructed ancestor Proto-Germanic, withOld Norse and Old English being more distantly

related to Gothic. This variation in linguisticproximity enables us to study the robustness ofdecipherment methods to the historical change inthe source and the target.

Iberian Iberian serves as a real test scenariofor automatic methods—it is still undeciphered,withstanding multiple attempts over at leasttwo centuries. Iberian scripts present two issuesfacing many undeciphered languages today:undersegmentation and lack of a well-researchedrelative. Many theories of origin have beenproposed in the past, most notably linking Iberianto Basque, another non-Indo-European languageon the Iberian peninsula. However, due to a lackof conclusive evidence, the current scholarshipfavors the position that Iberian is not geneticallyrelated to any living language. Our knowledge ofIberian owes much to the phonological systemproposed by Manuel Gómez Moreno in the mid20th century, based on fragmentary evidencessuch as bilingual coin legends (Sinner and Velaza,2019). Another area with a broad consensus relatesto Iberian personal names, thanks to a key Latinepigraph, Ascoli Bronze, which recorded the grantof Roman citizenship to Iberian soldiers who hadfought for Rome (Martı́ et al., 2017). We use thesepersonal names recorded in Latin as the knownvocabulary.

75

http://www.wulfila.be/gothic/download/http://hesperia.ucm.es/

WR† Known languageProto-Germanic (PG) Old Norse (ON) Old English (OE) avg††

0% 0.820 / 0.749 / 0.863 0.213 / 0.397 / 0.597 0.046 / 0.204 / 0.497 0.360 / 0.450 / 0.65225% 0.752 / 0.734 / 0.826 0.312 / 0.478 / 0.610 0.128 / 0.328 / 0.474 0.398 / 0.513 / 0.63750% 0.752 / 0.736 / 0.848 0.391 / 0.508 / 0.643 0.169 / 0.404 / 0.495 0.438 / 0.549 / 0.66275% 0.761 / 0.732 / 0.866 0.435 / 0.544 / 0.682 0.250 / 0.447 / 0.533 0.482 / 0.574 / 0.693

avg‡ 0.771 / 0.737 / 0.851 0.338 / 0.482 / 0.633 0.148 / 0.346 / 0.500 0.419 / 0.522 / 0.661† Short for whitespace ratio.‡ Averaged over all whitespace ratio values.†† Averaged over all known languages.

Table 2: Main results on Gothic in a variety of settings using A@10 scores. All scores are reported in theformat of triplets, corresponding to base / partial / full models. In general, more phonologicalknowledge about the lost language, more segmentations improve the model performance. The choice ofthe known language also plays a significant role as Proto-Germanic has a noticeably higher score thanthe other two choices.

4.2 Evaluation

Stemming and Segmentation Our matchingprocess operates at the stem level for the knownlanguage, instead of full words. Stems are moreconsistently preserved during language change orlinguistic borrowings. While we always assumethat gold stems are provided for the knownlanguage, we estimate them for the lost language.

The original Gothic texts are only segmentedinto sentences. To study the effect of havingvarying degrees of prior knowledge about theword segmentations, we create separate datasetsby randomly inserting ground truth segmentations(i.e., whitespaces) with a preset probability tosimulate undersegmentation scenarios.

Model Variants In multiple deciphermentscenarios, partial information about phoneticassignments is available. This is the case withboth Iberian and Gothic. Therefore, we evaluateperformance of our model with respect to availablephonological knowledge for the lost language.The base model assumes no knowledge whilethe full model has full knowledge of thephonological system and therefore the charactermappings. For the Gothic experiment, weadditionally experiment with a partial modelthat assumes that we know the phonetic valuesfor the characters k, l, m, n, p, s, and t. Thesound values of these characters can be used asprior knowledge as they closely resemble theiroriginal counterparts in Latin or Greek alphabets.These known mappings are incorporated through

an additional term which encourages the model tomatch its predicted distributions with the groundtruths.

In scenarios with full segmentations where it ispossible to compare with previous work, we reportthe results for the Bayesian model proposed bySnyder et al. (2010) and NeuroCipher by Luoet al. (2019).

Metric We evaluate the model performanceusing top K accuracy (A@K) scores. The pre-diction (i.e., the stem-span pair) is consideredcorrect if and only if the stem is correct and the thespan is the prefix of the ground truth. For instance,the ground truth for the Gothic word garda has thestem gard spanning the first four letters, matchingthe Old Norse stem garð. We only consider theprediction as correct if it correctly matches garðand the predicted span starts with the first letter.

5 Results

Decipherment of Undersegmentated TextsOur main results on Gothic in Table 2 demon-strate that our model can effectively extractcognates in a variety of settings. Averaged overall choices of whitespace ratios and knownlanguages (bottom right), our base/partial/full models achieve A@10 scores of 0.419/0.522/0.661, respectively. Not surprisingly, accessto additional knowledge either about phonologicalmappings and/or segmentation lead to improvedperformance. See Table 5 for an example of modelpredictions.

76

Figure 4: (a) A@K scores on Iberian using personal name recorded in Latin; (b), (c), and (d): Closenessplots for Gothic, Ugaritic and Iberian, respectively.

The choice of the known language also playsa significant role. On the closest language pairGothic-PG, A@10 reaches 75% even withoutassuming any phonological knowledge about thelost language. As expected, language proximitydirectly impacts the complexity of the decipher-ment tasks which in turn translates into lowermodel performance on Old English and Old Norse.These results reaffirm that choosing a close knownlanguage is vital for decipherment.

The results on Iberian shows that ourmodel performs well on a real undecipheredlanguage with undersegmented texts. As shown inFigure 4a, base model reaches 60% in P@10

while fullmodel reaches 75%. Note that Iberianis non-Indo-European with no genetic relationshipwith Latin, but our model can still discover regularcorrespondences for this particular set of personalnames.

Ablation Study To investigate the contributionof phonetic and phonological knowledge, weconduct an ablation study using Gothic/Old Norse(Table 4). The IPA embeddings consistentlyimprove all the model variants. As expected, thegains are most noticeable (+12.8%) for the hardestmatching scenario where no prior information isavailable (basemodel). As expected,Ωloss is vital

77

Lost Ugaritic† GothicKnown Hebrew PG ON OE

Bayesian 0.604 – – –NeuroCipher 0.659 0.753 0.543 0.313base 0.778 0.865 0.558 0.472

† A@1 is reported for Ugaritic to make directcomparison with previous work. A@10 is stillused for Gothic experiments.

Table 3: Results for comparing base model withprevious work. Bayesian and NeuroCipherare the models proposed by Snyder et al. (2010)and Luo et al. (2019), respectively. Ugaritic resultsfor previous work are taken from their papers.For NeuroCipher, we run the authors’ publicimplementation to obtain the results for Gothic.

IPA Ωloss base partial full

+ + 0.435 0.544 0.682− + 0.307 0.490 0.599+ − 0.000 0.493 0.695

Table 4: Ablation study on the pair Gothic-ON.Both IPA embeddings and the regularization onsound loss are beneficial, especially when we donot assume much phonological knowledge aboutthe lost language.

Inscription Matched stemammuhsaminhaidau xaið

ammuhsaminhaidau xaiðammuhsaminhaidau raiðammuhsaminhaidau braið

Table 5: One example of top 3 model predictionsfor base on Gothic-PG in WR 0% setting.Spans are highlighted in the inscriptions. Thefirst row presents the ground truth and the othersare the model predictions. Green color is usedfor correct predictions and red for incorrectones.

for base but unnecessary for full which hasreadily available character mapping.

Comparison with Previous Work To comparewith the state-of-the-art decipherment models(Snyder et al., 2010; Luo et al., 2019), weconsider the version of our model that operateswith 100% whitespace ratio for the lost language.Table 3 demonstrates that our model consistently

outperforms the baselines for both Ugaritic andGothic. For instance, it reaches over 11% gainfor Hebrew/Ugaritic pair and over 15% forGotchic/Old English.

Identifying Close Known Languages Next weevaluate model’s ability to identify a close knownlanguage to anchor the decipherment process.We expect that for a closer language pair, thepredictions of the model will be more confidentwhile matching more characters. We illustrate thisidea with a plot that charts character coverage(i.e., what percentage of the lost texts are matchedregardless of its correctness) as a function ofprediction confidence value (i.e., probability ofgenerating this span normalized by its length).As Figure 4b and Figure 4c illustrate, the modelaccurately predicts the closest languages for bothUgaritic and Gothic. Moreover, languages withinthe same family as the lost language stand outfrom the rest.

The picture is quite different for Iberian (seeFigure 4d). No language seems to have a pro-nounced advantage over others. This seems toaccord with the current scholarly understandingthat Iberian is a language isolate, with no estab-lished kinship with others. Basque somewhatstands out from the rest, which might be attributedto its similar phonological system with Iberian(Sinner and Velaza, 2019) and very limitedvocabulary overlap (numeral names) (Aznar,2005) which doesn’t carry over to the lexicalsystem.6

6 Conclusions

We propose a decipherment model to extractcognates from undersegmented texts, withoutassuming proximity between lost and known lan-guages. Linguistics properties are incorporatedinto the model design, such as phonetic plausibil-ity of sound change and preservation of sounds.Our results on Gothic, Ugaritic, and Iberianshows that our model can effectively handleundersegmented texts even when source andtarget languages are not related. Additionally, weintroduce a method for identifying close languagesthat correctly finds related languages for Gothicand Ugaritic. For Iberian, the method does notshow strong evidence supporting Basque as a

6For true isolates, whether the predicted segmentationsare reliable despite the lack of cognates is beyond our currentscope of investigation.

78

related language, concurring with the favoredposition by current scholarship.

Potential applications of our method are notlimited to decipherment. The phonetic values oflost characters can be inferred by mapping themto the known cognates. These values can serve asthe starting point for lost sound reconstructionand more investigation is needed to establishtheir efficacy. Moreover, the effectiveness ofincorporating phonological feature embeddingsprovides a path for future improvement for cog-nate detection in computational historical lin-guistics (Rama and List, 2019). Currently ourmethod operates on a pair of languages. Tosimultaneously process multiple languages as itis common in the cognate detection task, morework is needed to modify our current model andits inference procedure.

Acknowledgments

We sincerely thank Noemı́ Moncunill Martı́ forher invaluable guidance on Iberian onomastics,and Eduardo Orduñ Aznar for his tremendoushelp on the Hesperia database and the Vasco-Iberian theories. Special thanks also go to IgnacioFuentes and Carme Huertas for the insightfuldiscussions. This research is based upon worksupported in part by the Office of the Directorof National Intelligence (ODNI), IntelligenceAdvanced Research Projects Activity (IARPA),via contract # FA8650-17-C-9116. The viewsand conclusions contained herein are those ofthe authors and should not be interpreted asnecessarily representing the official policies,either expressed or implied, of ODNI, IARPA,or the U.S. Government. The U.S. Governmentis authorized to reproduce and distribute reprintsfor governmental purposes notwithstanding anycopyright annotation therein.

References

Eduardo Orduña Aznar. 2005. Sobre algunosposibles numerales en textos ibéricos.Palaeohispanica, 5:491–506.

Taylor Berg-Kirkpatrick and Dan Klein. 2013.Decipherment with a million random restarts.In Proceedings of the 2013 Conference onEmpirical Methods in Natural LanguageProcessing, pages 874–878. Association forComputational Linguistics.

Steven Bird. 2006. NLTK: The naturallanguage toolkit. In Proceedings of theCOLING/ACL 2006 Interactive PresentationSessions, pages 69–72. Association for Compu-tational Linguistics. DOI: https://doi.org/10.3115/1225403.1225421

Lyle Campbell. 2013. Historical Linguistics.Edinburgh University Press.

Christos Christodouloupoulos and Mark Steedman.2015. A massively parallel corpus: The bible in100 languages. Language Resources and Eval-uation, 49(2):375–395. DOI: https://doi.org/10.1007/s10579-014-9287-y,PMID: 26321896, PMCID: PMC4551210

Alexis Conneau and Guillaume Lample. 2019.Cross-lingual language model pretraining. InAdvances in Neural Information ProcessingSystems, pages 7059–7069.

Bradley Hauer, Ryan Hayward, and GrzegorzKondrak. 2014. Solving substitution cipherswith combined language models. In Proceed-ings of COLING 2014, the 25th InternationalConference on Computational Linguistics:Technical Papers, pages 2314–2325, Dublin,Ireland. Dublin City University and Associationfor Computational Linguistics.

Nishant Kambhatla, Anahita Mansouri Bigvand,and Anoop Sarkar. 2018. Decipherment ofsubstitution ciphers with neural languagemodels. In Proceedings of the 2018 Confe-rence on Empirical Methods in Natural Lan-guage Processing, pages 869–874, Brussels,Belgium. Association for Computational Lin-guistics. DOI: https://doi.org/10.18653/v1/D18-1102

Kevin Knight, Anish Nair, Nishit Rathod, andKenji Yamada. 2006. Unsupervised analysisfor decipherment problems. In Proceedingsof the COLING/ACL 2006 Main ConferencePoster Sessions, pages 499–506, Sydney,Australia. Association for ComputationalLinguistics. DOI: https://doi.org/10.3115/1273073.1273138

Kevin Knight and Kenji Yamada. 1999. Acomputational approach to decipheringunknown scripts. Unsupervised Learning inNatural Language Processing.

79

https://doi.org/10.3115/1225403.1225421https://doi.org/10.3115/1225403.1225421https://doi.org/10.1007/s10579-014-9287-yhttps://doi.org/10.1007/s10579-014-9287-yhttps://europepmc.org/article/MED/26321896https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210https://doi.org/10.18653/v1/D18-1102https://doi.org/10.18653/v1/D18-1102https://doi.org/10.3115/1273073.1273138https://doi.org/10.3115/1273073.1273138

Guillaume Lample, Alexis Conneau, LudovicDenoyer, and Marc’Aurelio Ranzato. 2018a.Unsupervised machine translation usingmonolingual corpora only.

Guillaume Lample, Alexis Conneau, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou.2018b. Word translation without parallel data.In International Conference on LearningRepresentations.

Jiaming Luo, Yuan Cao, and Regina Barzilay.2019. Neural decipherment via minimum-cost flow: From ugaritic to linear b. InProceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 3146–3155, Florence, Italy. Associationfor Computational Linguistics.

Noemı́ Moncunill Martı́ and others. 2017.Indigenous naming practices in the westernMediterranean: The case of Iberian. StudiaAntiqua et Archaeologica, 23(1):7–20.

David R. Mortensen, Siddharth Dalmia, andPatrick Littell. 2018. Epitran: Precision G2Pfor many languages. In Proceedings of theEleventh International Conference on Lan-guage Resources and Evaluation (LREC 2018),Paris, France. European Language ResourcesAssociation (ELRA).

Malte Nuhn, Julian Schamper, and HermannNey. 2013. Beam search for solving substi-tution ciphers. In Proceedings of the 51stAnnual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers),pages 1568–1576, Sofia, Bulgaria. Associationfor Computational Linguistics.

Taraka Rama and Johann-Mattis List. 2019.An automated framework for fast cognatedetection and bayesian phylogenetic inferencein computational historical linguistics. InProceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics,pages 6225–6235, Florence, Italy. Associ-ation for Computational Linguistics. DOI:https://doi.org/10.18653/v1/P19-1627

Jesús Rodrı́guez Ramos. 2014. Nuevo ı́ndicecrı́tico de formantes de compuestos de tipoonomástico ı́beros. Arqueoweb: Revista sobreArqueologı́a en Internet, 15(1):7–158.

Donald A. Ringe. 2017. From Proto-Indo-European to Proto-Germanic. Oxford.

Alejandro Garcia Sinner and Javier Velaza. 2019.Palaeohispanic Languages and Epigraphies.Oxford University Press.

Benjamin Snyder, Regina Barzilay, and KevinKnight. 2010. A statistical model for lostlanguage decipherment. In Proceedings of the48th Annual Meeting of the Association forComputational Linguistics, pages 1048–1057,Uppsala, Sweden. Association for Computa-tional Linguistics. DOI: https://doi.org/10.1093/oso/9780198792581.001.0001

Kenneth N. Stevens. 2000. Acoustic phonetics,volume 30. MIT Press.

Larry Trask. 2008. Etymological Dictionary ofBasque.http://www.bulgari-istoria-2010.com/Rechnici/baski-rechnik.pdf.

Norbert Wagner. 2006. Zu got. hv, q und ai, au.Historische Sprachforschung/Historical Lin-guistics, pages 286–291.

Julius Zacher. 1855. Das Gothische AlphabetVulfilas und das Runen Alphabet: eineSprachwissenschaftliche Untersuchung, FABrockhaus.

A Appendices

A.1 Derivations for Dynamic Programming

We show the derivation for Pr(X) here—otherquantities can be derived in a similar fashion.

Given any X with length n, let pi(X) be theprobability of generating the prefix subsequenceX:i, and pi,z(X) be the probability of generatingX:i using z as the last latent variable. Bydefinition, we have

Pr(X) = pn(X). (11)

pi(X) =∑z

pi,z(X). (12)

pi,z can be recursively computed using thefollowing equations:

pi,O = Pr(O) · p0 · pi−1. (13)pi,El = Pr(El) · Pr(xi−l+1:l|El) · pi−l. (14)

80

https://doi.org/10.18653/v1/P19-1627https://doi.org/10.18653/v1/P19-1627https://doi.org/10.1093/oso/9780198792581.001.0001https://doi.org/10.1093/oso/9780198792581.001.0001http://www.bulgari-istoria-2010.com/Rechnici/baski-rechnik.pdfhttp://www.bulgari-istoria-2010.com/Rechnici/baski-rechnik.pdfhttp://www.bulgari-istoria-2010.com/Rechnici/baski-rechnik.pdf

A.2 Data PreparationStemming Gothic stemmers are developedbased on the documentations of Gomorphv2.7

Stemmers for Proto-Germanic, Old Norse andOld English are derived from relevant Wikipediaentries on their grammar and phonology. For allother languages, we use the Snowball stemmerfrom NLTK (Bird, 2006)

IPA Transcription We use the CLTK library8for Old Norse and Old English, and a rule-basedconverter for Proto-Germanic based on (Ringe,2017, pp. 242–260). Basque transcriber is basedon its Wikipedia guide for transcription, and allother languages are transcribed using Epitran(Mortensen et al., 2018). The ipapy library9

is used to obtain their phonetic features. There are7 feature groups in total.

Known vocabulary For Proto-Germanic, OldNorse, and Old English, we extract the informationfrom the descendant trees in Wiktionary.10 Allmatched stems with at least four characters formthe known vocabulary. It resulted in 7883, 10,754and 11,067 matches with Gothic inscriptions, and613, 529, 627 unique words in the vocabulariesfor Proto-Germanic, Old Norse, and Old English,respectively. For Ugaritic-Hebrew, we retainstems with at least three characters due to itsshorter average stem length. For the Iberian-Latin personal name experiments, we take the listprovided by Ramos (2014) and select the elementsthat have both Latin and Iberian correspondences.We obtain 64 unique Latin stems in total. ForBasque, we use a Basque etymological dictionary(Trask, 2008), and extract Basque words ofunknown origins to have a better chance to matchIberian tokens.

For all other known languages used forthe closeness experiments, we use the Book

7http://www.wulfila.be/gomorph/gothic/html/.

8http://cltk.org/.9https://github.com/pettarin/ipapy.

10https://www.wiktionary.org/.

of Genesis in these languages compiled byChristodouloupoulos and Steedman (2015) andtake the most frequent stems. The number ofstems is chosen to be roughly the same asthe actual close relative, in order to removeany potential impact due to different vocabularysizes. For instance, for the Gothic experiments inFigure 4b, this number is set to be 600 since thePG vocabulary has 613 words.

A.3 Training DetailsArchitecture For the majority of our exper-iments, we use a dimensionality of 100 for eachfeature embedding, making the character embed-ding of size 700 (there are 7 feature groups). Forablation study without IPA embeddings, eachcharacter is directly represented by a vector ofsize 700 instead. To compare with previous work,we use the default setting from Neurocipherwhich has a hidden size of 250, and therefore forour model we use a feature embedding size of 35,making it 245 for each character.

Hyperparameters We use SGD with a learningrate of 0.2 for all experiments. Dropout witha rate of 0.5 is applied after the embeddinglayer. The length for matched spans l in therange [4, 10] for most experiments and [3, 10] forUgaritic. Other settings include T = 0.2, λcov =10.0, λcov = 100.0. We experimented with twoannealing schedules for the insertion penaltyα: lnα is annealed from 10.0 to 3.5 or from0.0 to 3.5. These values are chosen based onour preliminary results, representing an extreme(10.0), a moderate (3.5), or a non-existent (0.0)penalty. Annealing last for 2000 steps, and themodel is trained for an additional 1000 stepafterwards. Five random runs are conducted foreach setting and annealing schedule, and the bestresult is reported.

81

http://www.wulfila.be/gomorph/gothic/html/http://www.wulfila.be/gomorph/gothic/html/http://cltk.org/https://github.com/pettarin/ipapyhttps://www.wiktionary.org/

IntroductionRelated WorkModelProblem SettingGenerative FrameworkPhonetics-aware ParameterizationMonotonic Alignment and Edit Distance

ObjectiveTraining

Experimental SetupLanguagesEvaluation

ResultsConclusionsAppendicesDerivations for Dynamic ProgrammingData PreparationTraining Details

Documents

Deciphering Undersegmented Ancient Scripts Using Phonetic …people.csail.mit.edu/j_luo/assets/publications/Decipher... · 2020. 10. 21. · Deciphering Undersegmented Ancient Scripts