Latin Noun Inflection and Latin Prosody - A Finite State Implementation.demmer

Latin Noun Inflection and Latin Prosody -

A Finite State Implementation

BA-Thesis

Author: Bettina Demmer

Nauklerstraße 63

72074 Tübingen

[email protected]

-------------------------------

Seminar: Finite State Methods in Computational Linguistics (SS 2006)

Instructor: Dr. Dale Gerdemann

International Studies in Computational Linguistics

Seminar für Sprachwissenschaft

Eberhard Karls Universität Tübingen

2

Hiermit versichere ich, dass ich die vorgelegte Arbeit selbstständig und nur mit den

angegebenen Quellen und Hilfsmitteln einschließlich des www und anderer elektronischer

Quellen angefertigt habe. Alle Stellen der Arbeit, die ich anderen Werken dem Wortlaut oder

dem Sinne nach entnommen habe, sind kenntlich gemacht.

Tübingen, den 21. August 2006

Bettina Demmer

3

In principio erat verbum (Joh 1,1)

4

Table of Content

Abstract ...................................................................................................................................... 5

1 Introduction ........................................................................................................................ 5

2 Morphology in Computational Linguistics ........................................................................ 6

2.1 Definition of Morphology .......................................................................................... 6

2.2 Computational Applications of Morphology ............................................................. 7

2.3 What is Finite State Morphology? ............................................................................. 8

2.4 Existing Approaches to the Morphology of Latin.................................................... 11

3 The Latin Noun ................................................................................................................ 13

3.1 Latin Alphabet.......................................................................................................... 13

3.2 Latin Noun Inflection ............................................................................................... 14

3.2.1 Stem + Ending.................................................................................................. 14

3.2.2 Case, Number, Gender ..................................................................................... 16

3.2.3 Declension........................................................................................................ 16

3.3 Latin Prosody and Stress Assignment ...................................................................... 17

3.3.1 Latin Syllabification......................................................................................... 17

3.3.2 Penultimate Law............................................................................................... 18

4 Latin Finite State Implementation in xfst......................................................................... 19

4.1 The Overall Structure of the Script .......................................................................... 19

4.2 Introduction to the xfst Syntax ................................................................................. 20

4.3 The xfst Script in More Detail.................................................................................. 22

5 Bibliography..................................................................................................................... 31

6 Appendix: xfst Script File ................................................................................................ 33

5

Abstract This paper – submitted for the degree 'Bachelor of Arts' – describes a finite state approach to

the inflectional morphology and the prosody of classical Latin nouns. Using Xerox finite state

tools we developed an xfst-script which describes step by step – in terms of several small

transducers which are composed together – the construction of a classical Latin noun on the

one hand and stress assignment on a classical Latin noun on the other. The idea of using

finite state tools for that is that the approach is two-way, which means that the xfst-script can

be used to form a declined Latin noun surface form with assigned stress from a given lexicon

entry (generation) or to analyze a given Latin noun in its surface form towards its lexicon

entry (analysis). The paper also covers a general introduction to morphology, a definition of

finite state morphology, which is used to describe the natural language morphology of Latin,

a survey of existing computational approaches to Latin morphology and a linguistic

description of Latin inflectional morphology and prosody.

1 Introduction Finite state morphology is a computational description of natural language morphology.

Morphology – as a classical branch of linguistics – deals with the formation of words out of

smaller pieces, called morphemes (→ section 2.1). In finite state morphology (→ section 2.3)

one is concerned with morphologies of natural languages but in a technical way. One tries to

extract rules in describing the structural patterns of word formation, that means, rules that can

be spelled out in order to form a two-way program which is able to analyze surface word

forms of a specific language and to generate word forms out of the lexicon according to

specific features. In this paper we will discuss a finite state implementation of the inflectional

morphology and prosody of Latin nouns. It is a program that is able to generate an inflected

Latin noun from the lexicon according to features specified by the user (case and number) and

to assign stress to it. It is also able to analyze an inflected noun – specified by the user –

towards its lemma (dictionary entry) and the morphological features it contains.

In section two of this paper we will give a general introduction to morphology. What is

morphology and how can it be described in terms of a finite state machine? As we deal with

the high inflecting language Latin in this paper, we will focus especially on the definition of

6

inflectional morphology (→ section 2.1). In section 2.2 we will give a survey of possible

applications of morphology in computational linguistics. Further, the general ideas of finite

state morphology will be discussed (→ section 2.3). This section introduces the basics of

finite state theory. If you are familiar with this theory, you can skip section 2.3. In the end of

section two (→ section 2.4), we will give an overview of the research done on Latin

morphology and existing approaches dealing with it.

In section three of this paper we will give an overview of Latin noun morphology (→ section

3.2) and Latin prosody (→ section 3.3). We will explain our motivation for choosing Latin as

an example language to experiment on with finite state tools. We will argue that it is the most

general and computationally efficient way to split Latin nouns into stem and ending,

compared to the traditional way of splitting Latin nouns into root, theme vowel and suffix (→

section 3.2.1). Also the prosody of Latin will be discussed in this chapter. The Latin Stress

Rule – the 'Penultimate Law' – will be explained (→ section 3.3.2) and we will argue that it is

possible to assign stress to Latin nouns without knowing the exact syllable boundaries of the

word.

In section four of this paper we will come to the finite state implementation, the realization of

the beforehand discussed theories on finite state morphology on the natural - though 'dead'

language – Latin. We will first argue for the 'Item-and-Process' theory according to which we

chose the basic structure of our xfst implementation (→ section 4.1). In section 4.2 we will

give an introduction to the syntax of xfst. We will explain further step by step the rules of the

finite state transducers which form in the end a complete 'construction plan' of a Latin noun

(→ section 4.3).

2 Morphology in Computational Linguistics

2.1 Definition of Morphology

Morphology, from Greek morphe 'form', is the branch of linguistics that studies the 'forms of

words'. It deals with the internal structure of words. The basic components of a word are one

or more morphemes. There are three types of morphemes (Müller, 2002): root morpheme,

which carries lexical meaning and is the base of all morphologically related words of the

7

same family, stem morpheme, which is a realization of a root morpheme and can be identical

to the root morpheme, and affix, a dependent component of a complex word, which cannot

stand alone. There are three fields into which morphology divides its studies about word

formation (Matthews, 1991): Derivation describes the formation of a new word out of an

existing word with the help of derivational morphemes. This process usually involves

changing of meaning or changing of the part of speech of a word. An example of this process

is the English prefix 'un-' which turns an adjective into its negative counterpart. Composition

describes the formation of new words out of two (or more) single words. The third field is

called inflection, which is concerned with the grammatical motivated forms of words

depending on the syntactic context they appear in. In inflecting languages words are usually

constructed of a basic morpheme, the root of a word (which carries the lexical meaning of the

word), and inflection morphemes, affixes (which carry some other information, e.g. plural

marking, case marking etc.). Traditionally, inflection is presented in paradigms, a two-

dimensional representation form, which covers one morphosyntactic category on one axis,

and another morphosyntactic category (a category that is "directly referred to by specific rules

in both morphology and syntax" (Matthews, 1991)) on the other axis. These categories can

consist of sets of variables. A word (or actually a lemma, the morpheme carrying the meaning

of the word) is then inflected according to the categories on the two axes. Latin, which is

described morphologically in section 3.2, is a high inflecting language and the function of a

noun in its context is expressed by a suffix according to its declension on the one axis and

gender, case and number as a set of categories on the other axis. An advantage of the

representation of the inflection of a language in paradigms is that it is quite easy to find word

forms that share the same spelling and phonetics but express different functions. This

phenomenon is called 'syncretism' (Matthews, 1991). In Latin, for example, the ablative plural

form is always identical in spelling and phonetics to the dative plural form of the same noun.

But the function of the noun, either as dative plural or as ablative plural, in the context is

different. The branch of morphology that is concerned with inflection and paradigms is called

'Inflectional Morphology'.

2.2 Computational Applications of Morphology

In this chapter an overview should be given over the role of morphology in computational

linguistics in general and of some possibilities of its application in natural language

processing.

8

Morphology, as the study of the internal structure of words, builds the basis for almost all

natural language applications in computational linguistics. In classical applications of

computational linguistics, as for example machine translation, information retrieval, parsing,

part of speech disambiguation, data mining etc., it is necessary and contributes to the

efficiency of a system to correctly determine the internal structure of a word: To know the

grammatical function of a word in its context which can be mainly determined by its

morphological analysis is necessary in order to determine its correct translation (e.g. machine

translation). To trace a word back to its lemma (the basic form as which the word appears in a

dictionary) rather than to analyze its grammatical realization in the context, is helpful in order

to summarize the information of a text (e.g. information retrieval).

Morphology is actually the most important branch of linguistics and computational

linguistics, as it builds the basis for all the other branches: syntax, semantics and phonology.

2.3 What is Finite State Morphology?

Finite state morphology is a branch of computational linguistics which deals with morphology

in a technical sense. In finite state morphology, a morphological description of a natural

language is displayed as a finite state automaton or as a finite state transducer (general term:

finite state machine). A finite state automaton describes a language compared to a finite state

transducer which describes a relation of two languages. The language or the relation of two

languages is described in terms of regular expressions (Roark and Sproat, 2006).

Kaplan and Kay's (1994) is the most influential work in the field of finite state morphology. It

was their idea to represent phonological rules as a cascade of transducers. Inspired by the idea

of Kaplan and Kay, Koskenniemi (1983) implemented a finite state system which he calls 'a

general computational model for word-form recognition and production'. Since his work,

finite state methods have been used to describe the morphology and phonology of a wide

range of natural languages.

Every finite state machine consists of one or more states, exactly one start state and any

number of final states, which are connected by arcs. Every arc has a label and a destination

(one state of the network). Small networks can be viewed graphically as transition diagrams.

Every finite state network includes a 'sigma', the symbol alphabet of the machine. These

symbols represent the range of the language or relation that the network describes (Beesley

and Karttunen, 2003).

9

Finite State Automaton

Dealing with natural language, a finite state automaton is a network that accepts a regular

language. Figure one shows a finite state automaton which describes the language ab*cdd*e.

Figure 1: A simple finite-state automaton accepting the language ab*cdd*e (Roark and Sproat, 2006).

A language in finite state terms is a set of words from an alphabet which contains a set of

characters. A finite state language is called regular if it is constructed from a finite alphabet in

combination with on of the following operations: set union, concatenation or transitive

closure (Roark and Sproat, 2006). A finite state automaton maps an input string against the

labels of its arcs. If after this matching a final state is reached the string is accepted and it is in

the language of the automaton. Roark and Sproat (2006) give a technical summary of the

definition of a finite state automaton:

A finite-state automaton is a quintuple M = (K, s, F, Σ, d) where:

1. K is a finite set of states

2. s is a designated initial state

3. F is a designated set of final states

4. Σ is an alphabet of symbols, and

5. d is a transition relation from K × (Σ c є) to K

There are some special languages which should be mentioned:

1) The empty language contains exactly one final state and accepts only the empty string.

2) The null language does not accept any string, not even the empty string, and consists of

exactly one non-final state.

3) The universal language which is denoted by Σ* contains all strings that can be

constructed out of the alphabet Σ, including the empty string є.

Finite State Transducer

A finite state transducer is a network that describes a regular relation. Figure two shows a

finite state transducer that describes the regular relation (a :a)(b :b)*(c :g)(d: f)(d: f)*(e : e).

10

Figure 1.2: A simple finite-state automaton that computes the relation (a : a)(b :b)*(c : g)(d : f)(d : f)*(e : e) (Roark and Sproat, 2006).

Dealing with natural languages, a regular relation is almost always a mapping between pairs

of strings. A finite state transducer matches a string against the upper symbols of the labels of

its arcs and maps these to the lower symbols of its arcs. If a string is matched, i.e. a final state

is in the network is reached, the changed string is given as output. Roark and Sproat (2006)

give a technical summary of the definition of a finite state transducer:

A (2-way) finite-state transducer is a quintuple M = (K, s, F, Σ × Σ, d) where:

1. K is a finite set of states

2. s is a designated initial state

3. F is a designated set of final states

4. Σ is an alphabet of symbols, and

5. d is a transition relation from K × (Σ c є × Σ c є) to K

Composition plays an important role in finite state transducers (Roark and Sproat, 2006).

Transducers can be composed together. A composition of two transducers means first

applying the first transducer and then applying the second transducer to the output of the first

transducer. We used this operation very often for our finite state description of Latin noun

morphology where we factored our system into a set of operations that are executed one after

each other using composition (→ section 4.3). Another central feature of finite state

transducers is inversion (Roark and Sproat, 2006). Inversion means that the system that is

implemented as a finite state transducer or a set of transducers composed together can be used

in two directions. It can be used in morphological analysis to map a string from a lexical level

to the surface level (generation) following several rules or it can be used the other way

around, from the surface level to the lexical level (analysis). This feature constitutes the

innovation of our morphological analysis of Latin nouns, as our program can be used to

generate Latin nouns from the lexicon as well as to analyze Latin nouns from a given surface

form (→ section 4.3).

Finite state methods can be used for speech and language processing including morphology

and phonology, computational analysis of syntax, language modelling for speech recognition,

pronunciation modelling etc (Roark and Sproat, 2006).

11

For our implementation we used Xerox finite state tools, the platform xfst in specific, in order

to describe transducers which – composed together – form the construction plan of a Latin

noun. xfst includes a compiler which builds a finite state network out of the description of the

transducers in the xfst script file. For more information on the syntax of xfst see section 4.2.

2.4 Existing Approaches to the Morphology of Latin

Latin is a very popular language for morphological analysis. Much research has been done on

Latin inflectional morphology. In the following section we will present an overview of

literature or systems concerned with Latin morphology.

Matthews (1972) describes the inflection of Latin verbs in his book in order to explain

inflectional morphology in general. In his book, he describes the 'Item-and-Arrangement'

theory opposed to the 'Item-and-Process' theory. He argues that for high inflecting languages

as Latin, the 'Item-and-Arrangement' theory is not sufficient, where morphemes are the basic

units of meaning which are arranged linearly. Instead he argues for the 'Item-and-process'

theory for inflecting languages where morphology is viewed as the construction of words out

of base forms (stems or roots) modified by rules. See section 4.1 for further discussion.

In Matthews (1991), Latin is used as representative for other high inflecting Indo-European

languages in order to show and explain paradigm structures. Paradigms (Greek parádeigma

'pattern') are the traditional way of presenting a word, in our paper nouns, and its inflectional

changes according to certain features and contexts. Paradigms are two-dimensional

constructions where one category is opposed to other categories. In Latin noun inflection,

which is described in this paper, we oppose declension of a noun to its case and number.

Lindsay (1894), Sommer (1914) and Sommer (1977) give classical analyses of Latin

morphology without a reference to computational applications of these. Our summary of

Latin noun morphology in section 3.1 and 3.2 is mostly taken from these books.

Bender describes Latin noun inflection (found in his collection 'Essays on Morphology') a bit

differently from traditional descriptions of Latin noun morphology. In his analysis, he splits

Latin nouns into stems and endings – opposed to the traditional analysis of Latin nouns into

root, theme vowel and suffix (with the fusion of the latter two) – in order to minimize the

morphological condition contexts. He argues that this way of splitting is a generalization of

the traditional theme-vowel-plus-suffix analysis which in most cases differs just in the change

of the theme vowel. By counting the theme vowel towards the root of the word, which forms

the stem, it is possible to predict the declension membership of this noun. The final character

12

of the stem is decisive for the membership of the noun to one of the six declensions. In our

implementation we took Bender's analysis. For further discussion see section 3.2.1.

Convington (1999) adopts the same generalization theory about Latin noun inflection in his

paper as Bender. In his paper he argues that by leaving the theme vowel together with the root

of the word, which forms the stem, and generalizing the rest ending over the other

declensions, contributes to the "economy of representation when the inflectional system is

stored as a transition network […], a representation that is computationally efficient and may

be psychologically realistic" In his paper he refers to Bubenheimer (1995) who has

implemented a morphological analyzer based on transition networks.

McLean presents a Latin translator on his homepage. The program takes Latin inflected words

as input and gives the English translation and a short analysis (including the case and number

but not declension) of the word. A disadvantage of the program is that it does not trace back

declined nouns to its lemma.

Logos offers 'language solutions' on its homepage. There we found a 'universal conjugator', a

system, which also handles Latin verb inflection. It is possible to enter a Latin verb and the

output of the system is the complete conjugation of that verb.

Bozzi and Cappelli (1991) present 'A project for Latin Lexicography: 2. A Latin

morphological analyzer'. In their article they describe a morphological analyzer, which

comprises a base dictionary, a table of suffixes, a table of endings and a table of postfixes.

The Perseus Project, 'a digital library for the humanities', offers morphological analysis for

inflecting languages as Latin and Greek. Using the tool 'Latin Morphological Analysis' the

user can enter an inflected word – the system covers adjective, nouns and verbs – and gets a

table with all possible morphological analyzes of the entered word, including the lemma of

the word, its English translation, its frequency in either the Latin Prose or Latin Poetry or

Latin Texts corpus

During the research of literature about Latin morphology, we encountered many

morphological implementations or ideas about Latin morphology. But most systems we found

deal with Latin verb morphology rather than with Latin noun morphology. All systems only

provide analysis of Latin morphology. What is new in our approach described in this paper, is

the construction of a bidirectional system, which on one hand analyzes Latin noun

morphology but on the other hand also generates Latin noun forms according to given

features.

13

3 The Latin Noun Latin belongs to the Indo-European language family. It constitutes the mother language of the

Romance languages in the Indo-European languages tree (Stowasser, 2004).

Classical Latin, also called 'aurea Latinitas', is the name of a 100 year period in the first

century BC in the development of Latin. It is the period in which Latin is most developed

compared to other periods in the development of Latin. In this time Latin developed towards a

cultivated language of literature and education (Stowasser, 2004). Latin became the official

language of the Roman Empire. As classical Latin is the phase in the history of the language

in which the most grammatical restrictions existed, we will concentrate in the following

analysis on the grammar of Latin of this time. We always refer to classical Latin when we

mention the language Latin.

A Latin noun is determined by case, number and gender. Further, Latin nouns are grouped

into five different declensions which are distinguished by different final character of the stem

of a noun.

3.1 Latin Alphabet

The alphabet of Latin consists of 24 letters (Stock, 1970).

A B C D E F G H I K L M N O P Q R S T U V X Y Z

a b c d e f g h i k l m n o p q r s t u v x y z

Classical Latin has

• 6 vowels a e i o u (y); a= e= i= o= u= (y=) which can be either long or short (long

vowels marked with a = -sign in the text)

• 4 diphthongs ae oe au eu which are always long

• 17 consonants b p d t g c q k l r m n f v s z h x (Stock, 1970).

In written Latin there is no graphical distinction between long vowels or short vowels. Thus,

in Latin many homographs can be found, words which share the same writing but have

different meanings caused by differences in vowel quantity, e.g. iace=re (Engl. to lie) vs.

iacere (Engl. to throw), pare=re (Engl. to obey) vs. parere (Engl. to give birth), cupi=do (Engl.

desire) vs. cupido (Engl. someone who is eager (in the dative case)) (Stowasser, 2004). The

information of vowel quantity has to be given in the lexicon.

14

3.2 Latin Noun Inflection

3.2.1 Stem + Ending

Latin is an inflecting language, i.e. the grammatical function of a word form is expressed

mainly by changes of the word final ending (Stowasser, 2004), e.g. the word form 'amica-m'

(engl. '(female) friend') can be analyzed in the following way:

amica - m

'stem'

The stem carries the lexical meaning of the

word, in this case '(female) friend' and the

information which declension the noun

belongs to; in this case the final character of

the stem is an 'a', so the noun belongs to the

first declension (or a-declension).

'ending'

The ending carries the grammatical function

of the word, in this case 'accusative' +

'singular' + 'feminine'.

Stem

The stem of a noun can be found by cutting off the ending '–um' or '–rum' in the genitive

plural form of a noun, e.g. flamma-rum, lupo-rum, passu-um, die-rum, turri-um, reg-um

(Stock, 1970). This stem appears in front of all case endings except for the nominative

singular ending (and except for accusative singular ending in neuter nouns). In front of

nominative singular endings (and accusative singular endings in neuter nouns), a stem change

takes place (Sommer, 1914). The changed stem that is used in front of the nominative singular

ending (and accusative singular ending) combined with its nominative singular ending

constitutes the base form, the lemma, that is found in the Latin dictionary. The final character

of the stem is decisive for the declension of the noun (→ section 3.2.3).

Ending

The ending is added to this stem according to the declension, the case, the number and the

gender of the noun. Most grammars split up Latin nouns differently; they count the final

character of the stem towards the ending, so that every declension has its own ending for each

case, number and gender. This way of splitting is easier concerning the studying of the

language. The learner first studies the different declensions that exist in Latin. Then she learns

15

all the endings according to the declension, i.e. the endings that contain the theme vowel. But

in this paper we discuss a computational implementation of the Latin noun morphology, so

we prefer the highest generalization possible that can be done examining Latin nouns.

Splitting up Latin nouns into stem and ending means higher possible generalization of the

endings. This can be seen in more detail in the description of the xfst-implementation (→

section 4.3). All the endings can be summarized in only two tables (→ see table1 and

table2). Further, the endings can be summarized even more: E.g. the ablative singular ending

for masculine/feminine nouns is the same for all declensions except for the consonantal

declension. This phenomenon is called 'syncretism'. This way of scaling down the endings

and combining features works against studying a language, where it is easier to have an

ending for every feature.

In the following tables all Latin noun endings can be seen (taken from Bender). The proper

ending for a noun is determined by the final character of the noun (→ declension) and its

gender. Table1 covers all endings for masculine and feminine nouns; table2 covers all

endings for neuter nouns. As neuter nouns come up only in the second, fourth and third

declension, the other columns are left empty. Some special characters have to be explained: A

carat (^) preceding a vowel-initial suffix indicates that that vowel replaces the stem vowel, if

any. A tilde (~) preceding a consonantal suffix indicates that the preceding vowel is always

short. A hash (#) stands for a zero suffix. A colon (:) following a vowel indicates that the

vowel turns into a long vowel.

1 masc/fem 1st: a 2nd: o 5th: e 4th: u 3rd: i 3rd: C

nom sg # s s s s s

acc sg ~m ~m ~m ~m ~m em

dat sg e : i: i: i: i:

abl sg : : : : : e

gen sg e î: i: :s îs îs

nom pl e î: :s :s ê:s ê:s

acc pl :s :s :s :s :s e:s

dat pl/abl pl î:s î:s bus îbus îbus îbus

gen pl :rum :rum :rum um um um

16

2 neut 1st: a 2nd: o 5th: e 4th: u 3rd: i 3rd: C

nom sg ~m : # #

acc sg ~m : # #

dat sg : : : i:

abl sg : : : e

gen sg î: :s s is

nom pl â a a a

acc pl â a a a

dat pl/abl pl î:s îbus îbus îbus

gen pl :rum um um um

As an example we take the word 'stella' (engl. star) and go through its declension. The stem of

that noun which is given in the lexicon is 'stella-'. From the final character of the stem we can

see that the noun belongs to a-declension. The noun is feminine (which is given in the

lexicon), that means that the endings are taken from table1, 2nd column (with the title '1st: a').

Now all the endings from this column can be added to the stem according to case and number.

3.2.2 Case, Number, Gender

In Latin, six cases are distinguished: nominative (subject of sentence), genitive (possession,

attachment), dative (indirect object), accusative (direct object), ablative (means, object of

prepositions of position) and vocative (personal address). The vocative case endings

correspond to the nominative case endings in all declensions except for the vocative singular

ending in o-declension nouns, where the ending is '–e' instead of the nominative singular

ending '–us' (Stock, 1970).

The number of a noun is either singular or plural.

The gender of a Latin noun can be masculine, feminine or neuter. The gender information

cannot be seen by looking at the noun. Thus, the information must be given in the dictionary.

3.2.3 Declension

Latin nouns are grouped into five different declensions (Stock, 1970). A noun's membership

to a declension is decided by the last character of its stem, i.e. the noun rex (engl. king) has

the stem reg-, which ends in a consonant. Therefore 'rex' belongs to the third declension

(consonantal stems). The first declension covers nouns with a-stems, the second declension

17

nouns with o-stems, the third – as already mentioned – covers nouns with consonantal stems,

i-stems or mixed nouns (which are nouns that belong to the consonantal stem nouns in the

singular and to the i-stem nouns in the plural). The fourth declension covers nouns with u-

stems, the fifth declension nouns with e-stems. The declensions have traditionally different

names in some grammars (e.g. Stock, 1970); they are called a-declension, o-declension, and

consonantal declension, i-declension, mixed declension, u-declension and e-declension,

respectively.

3.3 Latin Prosody and Stress Assignment

As Latin is not phonetically realized anymore, Latin prosody as a science is based on the

syllabification theories of Roman grammarians, on actual syllabification in inscriptions and

on a theory of syllable boundary with which linguistic phenomena can be explained (Sommer,

1977).

In written Latin, stress is not visible. But from observations of how stress affects meaning or

how stress occurs in verse and from antique notes it is possible to reconstruct stress

assignment in Latin words.

Until the 5th century BC, stress appeared in Latin as expiratory stress which is produced by a

stronger air pressure during pronunciation of the stressed syllable. In this phase, stress is

always assigned to the first syllable of the word form, called initial stress (Allen, 1978). Later

in the 5th century BC, this type of stress changes to a musical stress in which the stressed

syllable is pronounced on a higher pitch. In this second phase, stress is assigned according to

quantity of the penultimate syllable of the word form. After the two mentioned phases in the

history of Latin stress a third phase follows in which the stress changes again into an

expiratory stress (Sommer, 1914). It distinguishes its accented syllables by giving them

greater energy of articulation than the unaccented. The stress remains on its old place

(according to the Penultimate Law). This stress type lives on in the Romance languages

(Stowasser, 2004).

3.3.1 Latin Syllabification

The basic principles of Latin syllabification make the syllable end with a vowel and begin

with a consonant or a combination of consonants (Lindsay, 1894). The syllabification rule of

Roman grammarians confirms that a set of consonants in a word is added to the following

syllable unless it is not pronounceable. The syllable boundary in the latter case falls into the

18

consonant group (Sommer, 1977; Lindsay 1894). Compounds are split etymological. In

inscriptions, however, syllable boundaries are found at different places: The syllables are

always split between consonants (if more then two consonants occur, the syllable boundary

lies before the last consonant) except for the 'muta cum liquida' sequence, which always

counts towards the next syllable.

As Latin syllabification appears vague in the literature, we use a different analysis for stress

assignment on Latin nouns. We have the information that every syllable contains exactly one

vowel or diphthong. Every syllable (or vowel) is long (= heavy) when it is followed by at

least two consonants (position length) (Allen, 1978). With this information it is possible to

apply the penultimate law to Latin nouns without knowing the exact syllable boundaries in a

word. The penultimate law thus applies to the several vowels as deputies of the syllables.

3.3.2 Penultimate Law

The penultimate law is a rule which describes the grammatical stress assignment on Latin

words. According to that law, the last syllable of a word is extrametrical. If a word consists of

one or two syllables stress is assigned to the first syllable. If a word consists of at least three

syllables the quantity of the penultimate syllable is decisive for the stress: Stress is on the

penultimate syllable if it is heavy (i.e. ending a long vowel or diphthong) and on the

antepenultimate syllable if the penultimate syllable is light (i.e. ending in a short vowel)

(Stowasser, 2004; Sommer, 1914; Lindsay, 1894; Allen, 1978; Kenstowicz, 1994; Zirin,

1967).

A syllable is heavy (i.e. consisting of long vowel) in Latin either by nature ('natural length') or

by position ('position length'). A syllable is naturally heavy if it ends in a long vowel or a

diphthong (which are always long). If a syllable ends in a short vowel followed by at least two

consonants it is called 'closed' which turns it into a heavy syllable by position (Stock, 1970).

If a 'muta' (i.e. b, p, g, c, d, t) is followed by a 'liquida' (l, r) – muta cum liquida – the two

consonants count to the initial sound of the following syllable, e.g. inte-grum which triggers

stress on the antepenultimate syllable, as the syllable is not closed anymore (Allen, 1978).

Words which have stress on the third syllable have a second stress on the first syllable

(Sommer, 1914; Lindsay, 1894).

If an enclitic occurs at the end of a word, stress is pushed on the syllable before the enclitic,

no matter which quantity the syllable has (Allen, 1978).

19

4 Latin Finite State Implementation in xfst

4.1 The Overall Structure of the Script

The overall structure of the Latin xfst-script that we implemented for this paper follows the

traditional 'Item-and-Process' theory. There has been long discussion about two different

morphological theories: On the one hand we have the theory which is called 'Item-and-

Arrangement' (Hockett, 1954) where morphology is viewed as the construction of words out

of morphemes, small lexical pieces. On the other hand we have the theory which is called

'Item-and-process' (Hockett, 1954) – the theory that we use for our analysis of Latin nouns in

this paper – a theory where morphology is viewed as the construction of words out of base

forms (stems or roots) modified by rules. These different approaches are motivated by the

properties of different languages. Roark and Sproat (2006) argue in their paper, that the

differences between these two approaches are not as significant from a more formal or

computational point of view. As Latin noun inflection is presented with paradigms listing the

various inflected word forms according to their functions and with rules for deriving these

forms, it is clear that an analysis of Latin – also analyses of other high inflected Indo-

European languages such as Classical Greek and Sanskrit – is best done in the framework of

the 'Item-and-process' theory. In more detail, the first reason to apply the 'Item-and-Process'

and not 'Item-and-Arrangement' theory to Latin noun morphology is that in Latin many

morphosyntactic features are expressed by only one single suffix (Roark and Sproat, 2006).

There are not many morphs out of which a noun is constructed which can be associated with

corresponding morphemes. A second reason why 'Item-and-Arrangement' is inappropriate for

Latin noun inflection is that there is not only one stem to which suffixes are attached, but

sometimes two, depending on the context of the suffix. In Latin, if a second stems occurs, it

appears in all cases but in the nominative (and in neuter nouns also in the accusative case).

Thirdly, the suffixes may change, depending on the class of the noun they are attached to. In

Latin, for example, the dative/ablative plural ending is '–is' in the first and second declension,

but '-(i)bus' in the other declensions even though they represent the same morphosyntactic

feature. As a consequence of the three mentioned reasons, the 'Item-and-Process' theory fits

better to the structure of Latin noun inflection which can be better described in terms of rules

that introduce suffixes according to some features than assuming separate morphemes that

encode the features (Roark and Sproat, 2006).

20

The 'Item-and-Process' and 'Item-and-Arrangement' theories are reformulated by Stump

(2001), who describes four terms in his theory: lexical and inferential, incremental and

realizational. Without explaining these four terms in this paper in too much detail, we just

mention that his inferential-realizational theory, which he also calls Paradigm Function

Morphology, corresponds best to the 'Item-and-Process' theory. With the term 'inferential' he

describes theories in which "associations between the morphosyntactic properties of a word

and its morphology are expressed by morphological rules which relate that word to its root"

(Stump, 2001). The term 'realizational' refers to theory which says that "the association of a

word with a particular set of morphosyntactic properties licenses the introduction of the

inflectional exponents of those properties" (Stump, 2001).

4.2 Introduction to the xfst Syntax

In this section we will give an introduction to the syntax of xfst. Xfst is part of Xerox finite-

state tools which "provides a regular-expression compiler and direct access to the XEROX

FINITE-STATE CALCULUS, the algorithm for building and manipulating finite-state

networks" (Beesley and Karttunen, 2003). Xfst is maintained and expanded by Lauri

Karttunen. In the table, all signs that are used in the xfst script below are listed with their

functions respectively (taken from Beesley and Karttunen, 2003).

0 EPSILON symbol: empty-string language or

corresponding identity relation

? ANY symbol: language of all single-symbol

strings or corresponding identity relation

a single symbol: language that consists of the

corresponding string or identity relation on

that language

a:b pair of symbols: relation that consists of the

corresponding ordered pair of strings; a =

UPPER symbol, b = LOWER symbol

.#. boundary symbol; designates the beginning

of a string in the left or the end of a string in

the right context of a restriction

[] empty string language

21

[..] -> A epenthesis rule: mapping the empty string

into non-empty string A

() optionality

A+ Kleene-plus: concatenation of A with itself

one or more times

A* Kleene-star: union of A+ with the empty

string language

~A complement of A

[A B] concatenation of the two languages or

relations

{word} = [w o r d] concatenation of the corresponding single-

character symbols

$A 'contains A'

[A | B] union of the two languages or relations;

DISJUNCTION

[A & B] intersection of the two languages;

CONJUNCTION

[A – B] all the strings in A that are not members of B

[A .o. B] composition of the relation A with the

relation B

A.u upper language of the relation A

A.l lower language of the relation A

A.i inverse of the relation A

[A -> B || L _ R] replacement of an original upper-side string

A by a string from B if the indicated

condition (L = left context; R = right context)

is fulfilled

clear clear the stack (which saves networks)

define VAR text command to define a variable VAR

# start of a comment

source <filename> e.g. source Latin.xfst: reads in the xfst script

and builds a network out of it

read regex a single regular expression can be read in

22

with this command

apply up in Latin.xfst: the user can enter a declined

noun after this command and gets back the

lemma of the noun and some features

(analysis)

apply down in Latin.xfst: the user can enter a Latin noun

in nominative singular and some features

after this command and gets back the

declined noun (generation)

print lower-words print all surface strings (i.e. declined nouns)

that the network covers

print upper-words print all lexical strings (i.e. lemma and

features) that the network covers

print net print information about the network

4.3 The xfst Script in More Detail

Stem + Ending

In order to understand the particular rules we begin with the definition of the lexicon which is

later handled by replace rules. The lexicon entries look like |1|:

|1| [{stella} [noun & $fem]]

[{genus} %# {gener} [noun & $neut]]

Each entry of the lexicon consists of the stem of the Latin noun and information on its part of

speech (just in case the script is extended to handle other parts of speech) and on its gender. A

variant stem is also mentioned in the lexicon, as it cannot be reconstructed automatically from

the other stem. In the example above, 'stella' does not have a variant stem, but 'gener', which

has a variant stem in the nominative singular and accusative singular form. That is why its

nominative singular form is given in front of the 'traditional' stem (separated by a hash sign

for further processing reasons). How a stem of a Latin noun can be found is described in

section 3.2.1. The lexicon with its features is the part of the program that is used as input to

the generation part and as output from the analysis part of the program. To make the program

more user-friendly, i.e. the user does not have to know the stem of a word by heart but can use

23

the program with the better known nominative singular form, we implemented a transducer

that turns the stem of a noun into its nominative singular form automatically, which is the

lexical entry of a noun in a Latin dictionary (→ |2|).

|2| define StemToNomSg [?* []:[noun & $nom & $sg]] .o. Lexicon .o. Suffixes;

If a noun has a variant stem (which is already given as nominative singular form) this form is

used. That means, either the nominative singular form has to be newly constructed ('Case2')

or the already given nominative singular form is used ('Case1') (→ |3|). In the end, the

inversion of the StemToDict transducer is composed with the Morphology transducer, which

is described later in this section.

|3| define Case1 [ $%#

.o.

[..] -> %# || \LexFeatures _ LexFeatures

.o.

%# ?* %# -> 0

.o.

~$%#

];

define Case2 [

~$%# .o. [StemToNomSg LexFeatures*]

];

define StemToDict [

Case1

|

Case2

];

define Dictionary [

[StemToDict].i .o. Morphology

];

Now, we come to the actual program, the part which works only on the lexicon itself. Firstly,

the 'noun' feature in the lexicon is replaced by the four features that it describes: the part of

speech tag 'Noun', gender 'Gend' (which is already replaced by the gender that is given in the

entry because of the 'contains'-sign $), case 'Case' and number 'Num' (→ |4|).

|4| define noun Noun Gend Case Num;

In the next step, these features are extended by two more features, namely 'Ending', which is a

placeholder for the possible endings that can be attached to the stem of the word, and

24

'DeclTag' which is a placeholder for the class (declension) the noun belongs to (→ |5|). These

are 'helper' tags as they are not important nor harmful for the user. The declension tag can be

rewritten by automatic recognition of the condition context. The user does not have to specify

the class of the noun himself.

|5| define Features [ [..] -> [Ending DeclTag] || _ noun

];

After this step, the lexicon entry looks like the following:

{stella} Ending DeclTag Noun Gend Case Num

This is now the starting position for the other transducers of the program. All these tags are

necessary to determine the surface word form of a Latin noun (in the generation phase; all

transducers can also be used the other way around (for analysis)). In the following part of the

program, each of these 'tags' (Ending, DeclTag, Noun, Gend, Case, Num) is now replaced by

its possible realizations (conditioned by the realization of the other tags).

|6| define Declension [

DeclTag -> adecl || [a|a=] Ending _

.o.

DeclTag -> odecl || [o|o=] Ending _

.o.

DeclTag -> edecl || [e|e=] Ending _

.o.

DeclTag -> udecl || [u|u=] Ending _

.o.

DeclTag -> idecl || [i|i=] Ending _

.o.

DeclTag -> cdecl || C Ending _

];

|6| describes the rewriting of the declension tag. There are six different noun declensions in

Latin and the final character of the stem is decisive for the membership of a word to a specific

declension. This can be seen in the condition part of the rewrite rule (after the ||). In our

example from the lexicon (the entry 'stella'), DeclTag would be rewritten to 'adecl', as the

word stem ends in an 'a'. The gender tag would be already rewritten (from the information in

the lexicon) to 'feminine'.

|7| define StemChange [ [C|V]+ -> 0 || .#. _ %# [C|V]+ Ending cdecl Noun masc nom pl

.

.

25

.

.o.

[C|V]+ -> 0 || %# _ Ending cdecl Noun masc nom sg

.

.

.

.o. o -> u || _ Ending odecl Noun masc acc sg

];

In |7| the definition of the possible stem change can be seen. In the first part of the rewrite rule

all cases are listed in which the 'standard' stem (→ section 3.2.1) is used. These are all cases

except for the nominative singular case for masculine or feminine nouns and all cases except

for the nominative and accusative singular case for neuter nouns. In these cases the variant

stem in front of the hash sign is deleted. In the second part of the rule all nominative (and for

neuter nouns accusative) singular forms are listed in the condition part of the rewrite rule. For

these cases the variant stem is used and the 'standard' stem (following the hash sign) is

deleted. At the end of the StemChange definition, one exceptional case is listed: In masculine

o-declension nouns the o of the stem changes into a u in the accusative singular. |7| doesn't

affect our example from the lexicon. 'stella' does not have a variant stem.

|8| define Endings [

Ending -> 0 || _ adecl Noun fem nom sg

.o.

.

.

.

Ending -> {~m} || _ adecl Noun fem acc sg

.o.

.

.

.

Ending -> {:rum} || _ adecl Noun fem gen pl

.o.

.

.

.

Ending -> {î:s} || _ adecl Noun fem dat pl

]

In the Endings section (→ |8| is just an extract) the Ending tag is rewritten to the respective

'real' ending with the gender, case and number as conditions for that. We first listed all

possible endings for all possible conditions and realized later that there is much conformity

between the abstract paradigms (which is called syncretism in morphology) that we describe

with the different conditions. It is possible to summarize this conformity with another rule,

which states that two conditions trigger the same rewriting of the Ending tag. In the Ending

26

section three special characters can be observed which have to be removed later from the

surface string. The explanation of these special characters can be repeated in section 3.2.1. It

can be seen again that the Latin noun is determined in its ending by its declension, its gender,

its case and number. Our example from the lexicon gets its endings in the Ending section

according to the remaining information case and number.

Throughout rules |9| to |11| no stem or ending changing replacements are undertaken. These

rules describe phonological replacements: In |9| vowels preceding the carat sign (^) are

deleted, in |10| vowels preceding the colon (:) turn into long vowels and in |11| vowels in front

of the tilde (~) turn into short vowels (→ recall the special characters in section 3.2.1):

|9| define Voweldeletion

|10| define Long

|11| define Short

Rule |12| and |13| describe stylistic replacements: In |12| all the special characters which were

used to trigger phonological changes in rule |9| to |11| are now deleted and in |13| all features

are deleted in order to leave only the word in its surface form.

|12| define RemoveSpecialCharacters

|13| define RemoveFeatures

Finally, in |14|, all the rules (which are small transducers respectively) that we just described

are composed and combined to a bigger transducer which we call 'Suffixes'. Thus, in the

section 'Suffixes' every just described smaller transducer is executed one after the other.

Composition means operation on two relations with a new relation as a result. In this case we

compose several relations, which means, that the lower language of one transducer is the

upper language for the next transducer. This is done throughout all the transducers until we

get in the end the result of the composition of all the relations.

|14| define Suffixes [

Features

.o.

Declension

.o.

StemChange

.o.

RemoveHashSign

.o.

Endings

.o.

27

Voweldeletion

.o.

Long

.o.

Short

.o.

RemoveSpecialCharacters

.o.

RemoveFeatures

];

To read in the Lexicon and to actually draw the network we have to give the command 'read

regex' (→ |15|). First we define the morphology of Latin nouns, which is the composition of

the Lexicon – with all its stem entries and some additional information – with the Suffixes (→

|14|). Then we apply the inversion of the StemToDict function to the 'Morphology' transducer

in order to get the nominative singular forms instead of the stems. This 'Dictionary' transducer

is then composed with 'Spaces' which is a stylistic transducer to add spaces between all the

lexical features in the lexical part (upper language) just for readability reasons. This is the

final definition that is read in, in order to draw the final state network which can be used by

the user.

|15| define Morphology [ Lexicon .o. Suffixes

];

define Dictionary [


];

read regex Spaces .o. Dictionary;

Stress Assignment

After completion of the 'stem-and-ending' transducer we go further to the stress assignment on

Latin nouns. As already mentioned in section 3.3.1 we argue that it is possible to assign stress

to a Latin noun without knowing the exact syllable boundaries simply with the facts that

every syllable contains exactly one vowel or diphthong and the information that every vowel

is long by position when it is followed by at least two consonants (except for the situation that

it is followed by a 'muta cum liquida' sequence, which counts towards the beginning of the

following syllable rather then to be split, which does not trigger a long vowel by position).

Thus, it is possible with these two facts to formulate rules for the stress assignment:

28

Initially, in |16| all vowels are replaced by long vowels where applicable, namely only in the

context where the vowel is followed by at least two consonants.

|16| define LongVowel [ a -> a= || _ [C C+] - MutaCumLiquida

.o.

e -> e= || _ [C C+] - MutaCumLiquida

.

.

.

];

All the other vowels which are naturally long are marked in the lexicon.

In the next steps (→ |17| to |20|) the nouns are divided into three classes according to the

number of syllables (or vowels/diphthongs) they consist of: one syllable (→ |17|), two

syllables (→ |18|) or three syllables (→ |19| and |20|). If a word consists of only one syllable

the stress lies on the single vowel or diphthong:

|17| define OneSyllable [ a -> 'a || .#. C* _ C* .#.

.o.

.

.

.

a= -> 'a= || .#. C* _ C* .#.

.o.

.

.

.

{ae} -> '{ae} || .#. C* _ C* .#.

.o.

.

.

.

];

If a word consists of two syllables (i.e. two vowels/diphthongs), stress is assigned to the first

vowel or diphthong independent of the quality of the vowel, which can be either long or short:

|18| define TwoSyllables [ a -> 'a || .#. C* _ C* [V|D] C* .#.

.o.

.

.

.

a= -> 'a= || .#. C* _ C* [V|D] C* .#.

.o.

.

.

.

29

{ae} -> '{ae} || .#. C* _ C* [V|D] C* .#.

.o.

.

.

.

];

If the word consists of three or more syllables, the penultimate law (→ section 3.3.2) comes

into play. That means if the second last vowel is a long vowel or a diphthong, stress is

assigned to it:

|19| define ThreeOrMoreSyllablesPenult [ a= -> 'a= || _ C* [V|D] C* .#.

.o.

.

.

.

{ae} -> '{ae} || _ C* [V|D] C* .#.

.o.

.

.

.

];

If on the other hand the second last syllable contains a short vowel stress is assigned to the

vowel or diphthong preceding that short vowel, independent of the quality of the vowel:

|20| define ThreeOrMoreSyllablesAntepenult [ a -> 'a || _ C* VShort C* [V|D] C* .#.

.o.

.

.

.

a= -> 'a= || _ C* VShort C* [V|D] C* .#.

.o.

.

.

.

{ae} -> '{ae} || _ C* VShort C* [V|D] C* .#.

.o.

.

.

.

];

As in |14| all the five transducers are composed together to build the 'StressAssignment'

transducer (→ |21|):

|21| define StressAssignment [ LongVowel

.o.

30

OneSyllable

.o.

TwoSyllables

.o.

ThreeOrMoreSyllablesPenult

.o.

ThreeOrMoreSyllablesAntepenult

];

'Prosody' then is the composition of 'Spaces', the dictionary and the just mentioned

'StressAssignment' transducer:

|22| define Prosody [ Spaces .o. Dictionary .o. StressAssignment

];

|23| read regex Prosody;

|23| is the command to read in the transducer to build the actual finite state network.

One Final Remark

The implementation of the Latin noun inflection is a program that can be used in two ways:

for generation and analysis of declined Latin nouns. The implementation of Latin stress

assignment, on the other hand, is a program that is just interesting for use in one direction,

namely in the generation phase. If we composed 'Prosody' with 'Dictionary' generally, the user

would have to specify main stress in the word he enters for analysis, which is unpractical,

because he maybe does not know the main stress rule. That would mean if the user does not

know the stress he cannot use the program for analysis. Therefore, it is useful to keep the two

finite state networks separate. If the user wishes to have information on stress in Latin nouns

in the generation phase he activates the second network 'Prosody'. Otherwise just the network

'Dictionary' is used.

31

5 Bibliography Allen, W. Sidney. Vox Latina – A Guide to the Pronunciation of Classical Latin, Second

edition. Cambridge University Press, Cambridge, 1978.

Beesley, R. Kenneth and Karttunen, Lauri. Finite State Morphology. CSLI Publications,

Leland Stanford Junior University, 2003.

Bender, Byron. W. Latin Noun Inflection (A Solution to Latin 10). University of Hawaii at

Manoa.

Bozzi, A. and Cappelli, G. A Project for Latin Lexicography: 2. A Latin Morphological

Analyzer. In Computers and the Humanities, 24 (5-6). 1991.

Bubenheimer, Uli. Eine Morphologische Analysekomponente für das Lateinische zum Einsatz

in einem Lehrunterstützendem System. Studienarbeit, Universität Koblenz-Landau, 1995.

Convington, Michael A. Converging Transition Networks and Sub-Morphemic Regularities in

Latin Noun Inflection. Draft. Artificial Intelligence Center, University of Georgia, Athens,

1999.

Hockett, Charles F. Two models of grammatical description. 1954.

Kaplan, Ronald M. and Kay, Martin. Regular Models of Phonological Rule Systems.

Computational Linguistics 20:331-378, 1994.

Kenstowicz, Michael. Phonology in Generative Grammar. Blackwell Publishing, Oxford,

1994.

Koskenniemi, Kimmo. Two-Level Morphology: A General Computational Model for Word-

Form Recognition and Production. Dissertation, University of Helsinki, 1983.

Lindsay, W.M. The Latin Language – An Historical Account of Latin Sounds, Stems, and

Flexions. Clarendon Press, Oxford, 1894.

Logos Group. URL: http://www.logosconjugator.org. 2006. Universal Conjugator.

Matthews, P. H. Inflectional Morphology. University Press, Cambridge, 1972.

Matthews, P. H.. Morphology, Second Edition. University Press, Cambridge, 1991.

32

McLean, Adam. URL: http://www.levity.com/alchemy/latin/latintrans.html. Latin parser and

translator 0.96.

Müller, Horst M. (Hrsg.). Arbeitsbuch Linguistik. Schöningh, Paderborn, 2002.

Perseus Digital Library Project. URL: http://www.perseus.tufts.edu. Ed. Gregory R. Crane.

Tufts University.

Roark, Brian and Sproat, Richard. Compuational Approaches to Morphology and Syntax.

2006. (unpublished draft).

Sommer, Ferdinand. Handbuch der Lateinischen Laut- und Formenlehre – Eine Einführung

in das sprachwissenschaftliche Studium des Lateins, 2. und 3. Auflage. Carl Winters

Universitätsbuchhandlung, Heidelberg, 1914.

Sommer, Ferdinand. Handbuch der lateinischen Laut- und Formenlehre – Eine Einführung in

das sprachwissenschaftliche Studium des Lateins, 4. Auflage. Carl Winter

Universitätsverlag, Heidelberg, 1977.

Stock, Leo. Langenscheidts Kurzgrammatik Latein, 24. Auflage. Langenscheidt, Berlin, 1970.

Stowasser, J.M. et al. Stowasser, Auflage 2004. HPT Medien AG, Zug, 1979.

Stump, Gregory T. Inflectional Morphology - A Theory of Paradigm Structure. Cambridge

University Press, Cambridge, 2001.

Zirin, R. The Phonological Basis of Latin Prosody. University Microfilms, Inc., Ann Arbor,

Michigan, 1967.

33

6 Appendix: xfst Script File clear

undefine all

define VShort a | e | i | o | u; #short vowel

define VLong a= | e= | i= | o= | u=; #long vowel

define V VShort | VLong; #vowel

define D {ae} | {au} | {oe} | {eu}; #diphthong

define C b | c | d | f | g | h | l | m | n | p | q | r | s | t | v | x | z;

#consonant

define Seg C | V; #segment

define Gend masc | fem | neut; #gender

define Case nom | gen | dat | acc | abl; #case

define Num sg | pl; #number

define Decl adecl | odecl | edecl | udecl | idecl | cdecl; #declension

define MutaCumLiquida {bl} | {br} | {pl} | {pr} | {gl} | {gr} | {cl} | {cr}

| {dl} | {dr} | {tl} | {tr}; #muta cum liquida

define noun Noun Gend Case Num; #noun

define POS Noun; #part of speech

define LexFeatures [Gend | Case | Num | POS]; #lexical features

####################STEM+ENDING############################################

###########################################################################

###########################################################################

# The lexical features given in the lexicon of the noun are extended by two

# more: 'Ending' and 'DeclTag' (declension)

define Features [

[..] -> [Ending DeclTag] || _ noun

];

# The 'DeclTag' feature is rewritten by the actual declension depending on

# the context

define Declension [

DeclTag -> adecl || [a|a=] Ending _

.o.

DeclTag -> odecl || [o|o=] Ending _

.o.

DeclTag -> edecl || [e|e=] Ending _

.o.

DeclTag -> udecl || [u|u=] Ending _

.o.

DeclTag -> idecl || [i|i=] Ending _

.o.

DeclTag -> cdecl || C Ending _

];

# The following definition about stem change contains three different

# condition contexts: 1) all segments preceding the hash sign are deleted

# in all cases 2) except for the nominative singular (and for neuter nouns

# in the accusative singular) where instead the segments following the

# hash sign are deleted and 3) final stem character o changes into u with

# o-declension masculine accusative singular nouns

define StemChange [

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc gen

.o.

34

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc acc

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun masc abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem acc

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun fem abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem acc

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun fem abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc acc

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun masc abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut acc pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending cdecl Noun neut abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut dat

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut acc pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending idecl Noun neut abl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut nom pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut gen

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut dat

35

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut acc pl

.o.

Seg+ -> 0 || .#. _ %# Seg+ Ending odecl Noun neut abl

.o.

Seg+ -> 0 || %# _ Ending cdecl Noun masc nom sg

.o.

Seg+ -> 0 || %# _ Ending cdecl Noun fem nom sg

.o.

Seg+ -> 0 || %# _ Ending idecl Noun fem nom sg

.o.

Seg+ -> 0 || %# _ Ending odecl Noun masc nom sg

.o.

Seg+ -> 0 || %# _ Ending cdecl Noun neut nom sg

.o.

Seg+ -> 0 || %# _ Ending idecl Noun neut nom sg

.o.

Seg+ -> 0 || %# _ Ending odecl Noun neut nom sg

.o.

Seg+ -> 0 || %# _ Ending cdecl Noun neut acc sg

.o.

Seg+ -> 0 || %# _ Ending idecl Noun neut acc sg

.o.

Seg+ -> 0 || %# _ Ending odecl Noun neut acc sg

.o.

o -> u || _ Ending odecl Noun masc acc sg

];

# The auxiliary hash sign is deleted after 'StemChange'

define RemoveHashSign [

%# -> 0

];

# The Ending tag is rewritten to the actual ending of the noun according to

# its declension, its gender, its case and its number

define Endings [

Ending -> 0 || _ adecl Noun fem nom sg

.o.

Ending -> e || _ adecl Noun fem gen sg

.o.

Ending -> e || _ adecl Noun fem dat sg

.o.

Ending -> {~m} || _ adecl Noun fem acc sg

.o.

Ending -> {:} || _ adecl Noun fem abl sg

.o.

Ending -> e || _ adecl Noun fem nom pl

.o.

Ending -> {:rum} || _ adecl Noun fem gen pl

.o.

Ending -> {î:s} || _ adecl Noun fem dat pl

.o.

Ending -> {:s} || _ adecl Noun fem acc pl

.o.

Ending -> {î:s} || _ adecl Noun fem abl pl

.o.

Ending -> 0 || _ odecl Noun masc nom sg

.o.

Ending -> {î:} || _ odecl Noun masc gen sg

.o.

Ending -> {:} || _ odecl Noun masc dat sg

36

.o.

Ending -> {~m} || _ odecl Noun masc acc sg

.o.

Ending -> {:} || _ odecl Noun masc abl sg

.o.

Ending -> {î:} || _ odecl Noun masc nom pl

.o.

Ending -> {:rum} || _ odecl Noun masc gen pl

.o.

Ending -> {î:s} || _ odecl Noun masc dat pl

.o.

Ending -> {:s} || _ odecl Noun masc acc pl

.o.

Ending -> {î:s} || _ odecl Noun masc abl pl

.o.

Ending -> s || _ edecl Noun fem nom sg

.o.

Ending -> {i:} || _ edecl Noun fem gen sg

.o.

Ending -> {i:} || _ edecl Noun fem dat sg

.o.

Ending -> {~m} || _ edecl Noun fem acc sg

.o.

Ending -> {:} || _ edecl Noun fem abl sg

.o.

Ending -> {:s} || _ edecl Noun fem nom pl

.o.

Ending -> {:rum} || _ edecl Noun fem gen pl

.o.

Ending -> {bus} || _ edecl Noun fem dat pl

.o.

Ending -> {:s} || _ edecl Noun fem acc pl

.o.

Ending -> {bus} || _ edecl Noun fem abl pl

.o.

Ending -> s || _ udecl Noun masc nom sg

.o.

Ending -> {:s} || _ udecl Noun masc gen sg

.o.

Ending -> {i:} || _ udecl Noun masc dat sg

.o.

Ending -> {~m} || _ udecl Noun masc acc sg

.o.

Ending -> {:} || _ udecl Noun masc abl sg

.o.

Ending -> {:s} || _ udecl Noun masc nom pl

.o.

Ending -> {um} || _ udecl Noun masc gen pl

.o.

Ending -> {îbus} || _ udecl Noun masc dat pl

.o.

Ending -> {:s} || _ udecl Noun masc acc pl

.o.

Ending -> {îbus} || _ udecl Noun masc abl pl

.o.

Ending -> s || _ udecl Noun fem nom sg

.o.

Ending -> {:s} || _ udecl Noun fem gen sg

.o.

Ending -> {i:} || _ udecl Noun fem dat sg

.o.

37

Ending -> {~m} || _ udecl Noun fem acc sg

.o.

Ending -> {:} || _ udecl Noun fem abl sg

.o.

Ending -> {:s} || _ udecl Noun fem nom pl

.o.

Ending -> {um} || _ udecl Noun fem gen pl

.o.

Ending -> {îbus} || _ udecl Noun fem dat pl

.o.

Ending -> {:s} || _ udecl Noun fem acc pl

.o.

Ending -> {îbus} || _ udecl Noun fem abl pl

.o.

Ending -> s || _ idecl Noun fem nom sg

.o.

Ending -> {îs} || _ idecl Noun fem gen sg

.o.

Ending -> {î:} || _ idecl Noun fem dat sg

.o.

Ending -> {~m} || _ idecl Noun fem acc sg

.o.

Ending -> {:} || _ idecl Noun fem abl sg

.o.

Ending -> {ê:s} || _ idecl Noun fem nom pl

.o.

Ending -> {um} || _ idecl Noun fem gen pl

.o.

Ending -> {îbus} || _ idecl Noun fem dat pl

.o.

Ending -> {:s} || _ idecl Noun fem acc pl

.o.

Ending -> {îbus} || _ idecl Noun fem abl pl

.o.

Ending -> 0 || _ cdecl Noun masc nom sg

.o.

Ending -> {îs} || _ cdecl Noun masc gen sg

.o.

Ending -> {i:} || _ cdecl Noun masc dat sg

.o.

Ending -> {em} || _ cdecl Noun masc acc sg

.o.

Ending -> e || _ cdecl Noun masc abl sg

.o.

Ending -> {ê:s} || _ cdecl Noun masc nom pl

.o.

Ending -> {um} || _ cdecl Noun masc gen pl

.o.

Ending -> {îbus} || _ cdecl Noun masc dat pl

.o.

Ending -> {e:s} || _ cdecl Noun masc acc pl

.o.

Ending -> {îbus} || _ cdecl Noun masc abl pl

.o.

Ending -> 0 || _ cdecl Noun fem nom sg

.o.

Ending -> {îs} || _ cdecl Noun fem gen sg

.o.

Ending -> {i:} || _ cdecl Noun fem dat sg

.o.

Ending -> {em} || _ cdecl Noun fem acc sg

38

.o.

Ending -> e || _ cdecl Noun fem abl sg

.o.

Ending -> {ê:s} || _ cdecl Noun fem nom pl

.o.

Ending -> {um} || _ cdecl Noun fem gen pl

.o.

Ending -> {îbus} || _ cdecl Noun fem dat pl

.o.

Ending -> {e:s} || _ cdecl Noun fem acc pl

.o.

Ending -> {îbus} || _ cdecl Noun fem abl pl

.o.

Ending -> 0 || _ odecl Noun neut nom sg

.o.

Ending -> {î:} || _ odecl Noun neut gen sg

.o.

Ending -> {:} || _ odecl Noun neut dat sg

.o.

Ending -> 0 || _ odecl Noun neut acc sg

.o.

Ending -> {:} || _ odecl Noun neut abl sg

.o.

Ending -> {â} || _ odecl Noun neut nom pl

.o.

Ending -> {:rum} || _ odecl Noun neut gen pl

.o.

Ending -> {î:s} || _ odecl Noun neut dat pl

.o.

Ending -> {â} || _ odecl Noun neut acc pl

.o.

Ending -> {î:s} || _ odecl Noun neut abl pl

.o.

Ending -> {:} || _ udecl Noun neut nom sg

.o.

Ending -> {:s} || _ udecl Noun neut gen sg

.o.

Ending -> {:} || _ udecl Noun neut dat sg

.o.

Ending -> {:} || _ udecl Noun neut acc sg

.o.

Ending -> {:} || _ udecl Noun neut abl sg

.o.

Ending -> a || _ udecl Noun neut nom pl

.o.

Ending -> {um} || _ udecl Noun neut gen pl

.o.

Ending -> {îbus} || _ udecl Noun neut dat pl

.o.

Ending -> a || _ udecl Noun neut acc pl

.o.

Ending -> {îbus} || _ udecl Noun neut abl pl

.o.

Ending -> 0 || _ idecl Noun neut nom sg

.o.

Ending -> s || _ idecl Noun neut gen sg

.o.

Ending -> {:} || _ idecl Noun neut dat sg

.o.

Ending -> 0 || _ idecl Noun neut acc sg

.o.

39

Ending -> {:} || _ idecl Noun neut abl sg

.o.

Ending -> a || _ idecl Noun neut nom pl

.o.

Ending -> {um} || _ idecl Noun neut gen pl

.o.

Ending -> {îbus} || _ idecl Noun neut dat pl

.o.

Ending -> a || _ idecl Noun neut acc pl

.o.

Ending -> {îbus} || _ idecl Noun neut abl pl

.o.

Ending -> 0 || _ cdecl Noun neut nom sg

.o.

Ending -> {is} || _ cdecl Noun neut gen sg

.o.

Ending -> {i:} || _ cdecl Noun neut dat sg

.o.

Ending -> 0 || _ cdecl Noun neut acc sg

.o.

Ending -> e || _ cdecl Noun neut abl sg

.o.

Ending -> a || _ cdecl Noun neut nom pl

.o.

Ending -> {um} || _ cdecl Noun neut gen pl

.o.

Ending -> {îbus} || _ cdecl Noun neut dat pl

.o.

Ending -> a || _ cdecl Noun neut acc pl

.o.

Ending -> {îbus} || _ cdecl Noun neut abl pl

];

#define Referral

# Vowels preceding a caret (^) are deleted

define Voweldeletion [

V -> 0 || _ %^

];

# Short vowels preceding a colon (:) turn into long vowels respectively

define Long [

a -> a= || _ %:

.o.

e -> e= || _ %:

.o.

i -> i= || _ %:

.o.

o -> o= || _ %:

.o.

u -> u= || _ %:

];

# A vowel preceding a tilde (~) is always short

define Short [

[a|a=] -> a || _ %~

.o.

[e|e=] -> e || _ %~

.o.

[i|i=] -> i || _ %~

.o.

40

[o|o=] -> o || _ %~

.o.

[u|u=] -> u || _ %~

];

# After 'Voweldeletion', 'Long' and 'Short' all special characters are

# deleted

define RemoveSpecialCharacters [

%^ -> 0 || Seg _

.o.

%: -> 0 || VLong _

.o.

%~ -> 0 || VShort _

];

# Finally, all tags are deleted to leave just the surface form of the noun

# as a result

define RemoveFeatures [

Decl -> 0

.o.

Gend -> 0

.o.

nom | gen | dat | acc | abl -> 0

.o.

sg | pl -> 0

.o.

POS -> 0

];

define Suffixes [

Features

.o.

Declension

.o.

StemChange

.o.

RemoveHashSign

.o.

Endings

.o.

Voweldeletion

.o.

Long

.o.

Short

.o.

RemoveSpecialCharacters

.o.

RemoveFeatures

];

define Lexicon [{stella} [noun & $fem]] |

[{fenestra} [noun & $fem]] |

[{servus} %# {servo} [noun & $masc]] |

[{bellum} %# {bello} [noun & $neut]] |

[{integrum} %# {integro} [noun & $neut]] |

[{puer} %# {puero} [noun & $masc]] |

[{ager} %# {agro} [noun & $masc]] |

[{vir} %# {viro} [noun & $masc]] |

[{deus} %# {deo} [noun & $masc]] |

[{rex} %# r e= g [noun & $masc]] |

41

[{cor} %# {cord} [noun & $neut]] |

[{iter} %# {itiner} [noun & $neut]] |

[{caput} %# {capit} [noun & $neut]] |

[c o= {nsul} [noun & $masc]] |

[{pater} %# {patr} [noun & $masc]] |

[n o= {men} %# n o= {min} [noun & $neut]] |

[{genus} %# {gener} [noun & $neut]] |

[{corpus} %# {corpor} [noun & $neut]] |

[{turri} [noun & $fem]] |

[i= {gni} [noun & $fem]] |

[{animal} %# {anim} a= {li} [noun & $neut]] |

[{manu} [noun & $fem]] |

[{lacu} [noun & $masc]] |

[{genu} [noun & $neut]] |

[r e= [noun & $fem]] |

[{di} e= [noun & $fem]] |

[{fid} e= [noun & $fem]];

define Spaces [

~[{ } ?*]

.o.

~[?* { }]

.o.

~$[{ } { }]

.o.

~$[Seg { } Seg]

.o.

~[?* [Seg|Gend|Case|Num|POS] [Gend|Case|Num|POS] ?*]

.o.

{ } -> 0

];

# The morphology of Latin nouns is defined as the composition of the

# lexicon with the suffixes

define Morphology [

Lexicon .o. Suffixes

];

# Stylistic transducer to change every stem into its nominative singular

# (standard dictionary) form

define StemToNomSg [?* []:[noun & $nom & $sg]] .o. Lexicon .o. Suffixes;

# If a noun has a variant stem, the nominative singular form of the noun is

# given in the lexicon preceding a hash sign. In these cases the nominative

# singular form of the noun does not have to be formed but can be taken

# from the lexicon.

define Case1 [

$%#

.o.

[..] -> %# || \LexFeatures _ LexFeatures

.o.

%# ?* %# -> 0

.o.

~$%#

];

# If there is a hash sign in the lexicon entry of a noun, the nominative

# singular form is taken from the lexicon, otherwise the nominative

# singular form of the noun is newly constructed

define StemToDict [

Case1

42

|

[~$%# .o. [StemToNomSg LexFeatures*]]

];

# The dictionary is defined to be the composition of the inversion of the

# 'StemToDict' function with the Latin morphology

define Dictionary [


];

read regex Spaces .o. Dictionary;

####################PROSODY################################################

###########################################################################

###########################################################################

# Every short vowel turns into a long vowel (-> heavy syllable ('position

# lenght')) when it is followed by at least two consonants

define LongVowel [

a -> a= || _ [C C+] - MutaCumLiquida

.o.

e -> e= || _ [C C+] - MutaCumLiquida

.o.

i -> i= || _ [C C+] - MutaCumLiquida

.o.

o -> o= || _ [C C+] - MutaCumLiquida

.o.

u -> u= || _ [C C+] - MutaCumLiquida

];

# If a word consists of only one syllable, stress is assigned to that

# syllable

define OneSyllable [

a -> 'a || .#. C* _ C* .#.

.o.

e -> 'e || .#. C* _ C* .#.

.o.

i -> 'i || .#. C* _ C* .#.

.o.

o -> 'o || .#. C* _ C* .#.

.o.

u -> 'u || .#. C* _ C* .#.

.o.

a= -> 'a= || .#. C* _ C* .#.

.o.

e= -> 'e= || .#. C* _ C* .#.

.o.

i= -> 'i= || .#. C* _ C* .#.

.o.

o= -> 'o= || .#. C* _ C* .#.

.o.

u= -> 'u= || .#. C* _ C* .#.

.o.

{ae} -> '{ae} || .#. C* _ C* .#.

.o.

{au} -> '{au} || .#. C* _ C* .#.

.o.

{oe} -> '{oe} || .#. C* _ C* .#.

];

43

# If a word consists of two syllables (two vowels or diphthongs) stress is

# assigned to the first syllable

define TwoSyllables [

a -> 'a || .#. C* _ C* [V|D] C* .#.

.o.

e -> 'e || .#. C* _ C* [V|D] C* .#.

.o.

i -> 'i || .#. C* _ C* [V|D] C* .#.

.o.

o -> 'o || .#. C* _ C* [V|D] C* .#.

.o.

u -> 'u || .#. C* _ C* [V|D] C* .#.

.o.

a= -> 'a= || .#. C* _ C* [V|D] C* .#.

.o.

e= -> 'e= || .#. C* _ C* [V|D] C* .#.

.o.

i= -> 'i= || .#. C* _ C* [V|D] C* .#.

.o.

o= -> 'o= || .#. C* _ C* [V|D] C* .#.

.o.

u= -> 'u= || .#. C* _ C* [V|D] C* .#.

.o.

{ae} -> '{ae} || .#. C* _ C* [V|D] C* .#.

.o.

{au} -> '{au} || .#. C* _ C* [V|D] C* .#.

.o.

{oe} -> '{oe} || .#. C* _ C* [V|D] C* .#.

];

# If a word consists of three or more syllables stress is assigned to the

# second last syllble if it is a heavy syllable (ending in a long vowel or

# diphthong)

define ThreeOrMoreSyllablesPenult [

a= -> 'a= || _ C* [V|D] C* .#.

.o.

e= -> 'e= || _ C* [V|D] C* .#.

.o.

i= -> 'i= || _ C* [V|D] C* .#.

.o.

o= -> 'o= || _ C* [V|D] C* .#.

.o.

u= -> 'u= || _ C* [V|D] C* .#.

.o.

{ae} -> '{ae} || _ C* [V|D] C* .#.

.o.

{au} -> '{au} || _ C* [V|D] C* .#.

.o.

{oe} -> '{oe} || _ C* [V|D] C* .#.

];

# If a word consists of three or more syllables and the second last

# syllable is a light syllable (ending in a short vowel) stress is assigned

# to the third last syllable (vowel or diphthong)

define ThreeOrMoreSyllablesAntepenult [

a -> 'a || _ C* VShort C* [V|D] C* .#.

.o.

e -> 'e || _ C* VShort C* [V|D] C* .#.

.o.

i -> 'i || _ C* VShort C* [V|D] C* .#.

.o.

44

o -> 'o || _ C* VShort C* [V|D] C* .#.

.o.

u -> 'u || _ C* VShort C* [V|D] C* .#.

.o.

a= -> 'a= || _ C* VShort C* [V|D] C* .#.

.o.

e= -> 'e= || _ C* VShort C* [V|D] C* .#.

.o.

i= -> 'i= || _ C* VShort C* [V|D] C* .#.

.o.

o= -> 'o= || _ C* VShort C* [V|D] C* .#.

.o.

u= -> 'u= || _ C* VShort C* [V|D] C* .#.

.o.

{ae} -> '{ae} || _ C* VShort C* [V|D] C* .#.

.o.

{au} -> '{au} || _ C* VShort C* [V|D] C* .#.

.o.

{oe} -> '{oe} || _ C* VShort C* [V|D] C* .#.

];

define StressAssignment [

LongVowel

.o.

OneSyllable

.o.

TwoSyllables

.o.

ThreeOrMoreSyllablesPenult

.o.

ThreeOrMoreSyllablesAntepenult

];

define Prosody [

Spaces .o. Dictionary .o. StressAssignment

];

#read regex Prosody;

Documents

Latin Noun Inflection and Latin Prosody - A Finite State Implementation.demmer