Bastien Kindt [email protected] Tamar … · • Theophylact Simocatta (6th-7th c.) • Theophanes Confessor (8th-9th c.) • Joseph Genesius (10th c.) • Doukas (15th c.)

Bastien Kindt [email protected] Tamar Pataridze [email protected]

Emmanuel Van Elverdinghe [email protected]

International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages

Lyon, 2nd-4th June 2014

PRLG

Projet de recherche en lexicologie grecque

Two goals

Creating an electronic dictionary of Ancient Greek

Lemmatizing patristic

and historiographical Byzantine texts

The Dictionary

DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)

Lexical data directly stemming from

corpus-based observations: ensures comprehensiveness and coherence

Without restriction regarding the

handled texts date, literary genre, language level or dialect

The Dictionary


434,190 word-forms

66,772 lemmata

Every morphosyntactic category

The Dictionary


Proper names: anthroponyms, toponyms

Numeric determiners

Crases (984 different forms)

Elided forms (1,160 forms)

Lemmatization

1990-1991 Thesaurus Sancti Gregorii Nazianzeni

Lemmatization

Clement of Alexandria (2nd-3rd c.) Basil of Caesarea (4th c.) Gregory of Nyssa (4th c.) Procopius of Caesarea (6th c.) Theophylact Simocatta (6th-7th c.) Theophanes Confessor (8th-9th c.) Joseph Genesius (10th c.) Doukas (15th c.) Etc.

Lemmatization

Comprehensive inventories of the vocabulary of Byzantine patristic and

historiographic texts

with the D.A.G.

Lemmatization

Concordances published in the Thesaurus Patrum Graecorum

series (Brepols Publishers)

24 volumes published

Concordances on microfiches (!)

PRLGs tools

Lemmatized concordances Frequency indexes Reverse indexes End-of-book indexes Indexes of words common or specific (to

two corpuses or corpus parts) Etc.

From PRLG to GREgORI project

Switching to full-digital Extending to the other languages of

Christian Orient the computing tools and linguistic resources developed for Greek

Constituting multilingual lexica

From PRLG to GREgORI project

Armenian: studying the formulaic style in

manuscript colophons (E. Van Elverdinghe) Georgian: studying translation methods

from Greek (T. Pataridze)

Armenian manuscript colophons

I. Lemmatizing Armenian

II. The Armenian colophons project

III. An illustrative case-study

Lemmatizing Armenian

SOME METHODOLOGICAL NOTES

Indo-European, flexional language with a leaning towards agglutination

Grammatical categories

Tokenization issues

Diachrony


1. CORPUS

2. PURPOSE

3. METHOD



1. CORPUS

Text

Digitized text editions

6154 pages in 9 volumes

5th century to 1500 + 1601 to 1660

16,000 different colophons (from 1 word to several pages)

>1,300,000 forms

Processed through Unitex

Database

Gathering metadata extracted from the editions

Reference

Date

Author

Place

Manuscript content

Etc.

2. PURPOSE

Studying stereotypical patterns (formulae)

Lifespan

Frequency

Variation

Evolution

Geographical diffusion

Relevant for manuscript studies as a whole

3. METHOD

1) Spotting formulaic patterns (= collocations)

2) Determining the formulas structure

3) Extracting all utterances

4) Cross-analysis with information stored in the database

5) Sketch of the formulas life and deeds


1. COLLOCATIONS

2. STRUCTURE

3. ANALYSIS

4. LEARNINGS



An illustrative case-study

[I wrote this] from a good and choice exemplar

396 occurrences (variants included)

From 989 AD onwards

Relatively stable

Verbose colophons

1. COLLOCATIONS

Log-likelihood

Browsing the concordance

Literature survey

2. STRUCTURE

Yields 56 pertinent matches

Provides new vocabulary

= good, choice

but also = reliable

true

glorious etc.

Concordance for each word

Some variation in structure

3 qualifiers

Repetition/omission of the preposition

Polysyndeton/asyndeton

Rarer vocabulary

New, more complex, and exhaustive graph

Total: 396 matches

2 qualifiers

3 qualifiers

3. ANALYSIS

Statistical outlook

Most frequent adjectives

Making up 90% of attestations

A B

205 9

16 2 45

1 78

Attestations by century

Varieties found in only 1 manuscript

10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.

20% 27.3% 12.3% 1.7% 3.4%

10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.

2 1 5 33 57 118 400?

3. ANALYSIS

Formula in context metadata

Date, copyist and locality almost always known

Found alongside the formula

Part of a wider-scale pattern

In early times, mostly in biblical manuscripts

Not very significant

Subtypes distinguishable by the context

,

,

I wrote this with my unworthy hands, from a reliable and choice exemplar, with a tormented

life and through much emigration ...

N.B. Both word orders: or

42 attestations

83% before 1500

First time in 1331

Gospels, Bibles

Lake Van (1 from Jerusalem)

Often the same copyists

7 attestations in the 17th c.

Lake Van; 1 from todays Armenia

Gospels, canon-books

: 21 attestations (50%)

Gospels only

First time in 1399, in Atamar (island on Lake Van)

17 times during the 15th century: 14 manuscripts from Atamar, 3 written on the shores of Lake Van

All manuscripts from Atamar with this formula present this word-order

vs

History and geography of the formula

4. LEARNINGS

Copyists mentality

Stylistic and orthographic habits

Increasing standardization

Inferring missing information about some manuscripts

Insight into the life and activity of copyists: passing down of techniques, traditions, and knowledge

Lemmatizing Georgian

Bilingual index of Gregory of Nazianzus

Edited, digitized and

formatted text

DATABASE (SQL)

UNITEX Corpus processor

- First disambiguation

step - Lexical lookup - Fully manual disambi-

guation - Lexical data export

Production of lexical tools

-iLemmatized concordan-ces

- Frequency index - Reverse index - End-of-book index -iCommon or specific

vocabulary index - Fully tagged corpus - Bilingual index

State of the art for processing Greek

Lemmatization principles for Georgian

B. Kindt, La lemmatisation des sources patristiques et byzantines au service d'une description lexicale du grec ancien. Les principes de formulation des lemmes du Dictionnaire Automatique Grec, in: Byzantion, 74 (2004), pp. 213-272. B. Coulie, B. Kindt, T. Pataridze, Lemmatisation automatique des sources du gorgien ancien, in: Le Muson, 126 (2013), pp. 161-201.

Final goal: a bilingual Greek-Georgian index

DATABASE (SQL)

- SOURCE language tagged corpus

- TARGET language tagged corpus

Production of bilingual index

mkAlign Text alignment processor

Method: from text-alignment to bilingual index

Some analysis

Verbs and Common Names

V V+Mas

V V+Part

VERBS

Greek Georgian

V+Mas

V+Part

Correspondance

Correspondance

Greek Georgian

N+Com

V+Part

[Geo.] V+Part = A [Gr.]

A V+Part

Where -- [ketili] has the morphology of participle, formed through a morpheme - [-il], it is used as an adjective

[Geo.] V+Part = N+Com [Gr.]

N+Com V+Part

--- [mo-u-ar-i] is an active participle formed through the -- [mo--ar] morphemes. Literally speaking it means one who teaches, but after having become a substantive it receives a meaning similar to professor.

Another example of the past participle: V+Part [--- / mo-na-geb-i] what was obtained, earned leads to the meaning of goods / properties. The corresponding Greek term is N+Com

COMMON NAMES

Greek Georgian

N+Com V+Part

N+Com

[Gr.] N+Com = V+Part [Geo.]

N+Com A

N+Com V+Part

-- [na-ksov-i], past participle, something that is knit, with the meaning of material, fabric

[Gr.] N+Com = A [Geo.]

Grass, lawn, greenery, green the meaning can be expressed by the suffix of possession - [ovan], when -- [mcuanil-ovan-il] means something like holder of grass

Greek Georgian

N+Com

I+Adv

N+Com

[Geo.] N+Com = I+Adv [Gr.]

PRO+Per1p N+Com

V N+Com

I+Adv N+Com

[Geo.] N+Com = V [Gr.]

[Geo.] N+Com = Pro+Pers [Gr.]

[tavi] = head

[Geo.] N+Com = A [Gr.]

= genitive of stone, --. Of course, the genitive of the common name will receive a nominative lemma tagged N + Com.

Genitive of common name [in Georgian] = adjective [in other languages]

Adverbial case of the adjectives and participles [in Georgian] = adverbs [in other languages]

_{.PRO+Per1s.54-0} _{ ().V.55-0} = -_{.V+Mas.52-0}

[Gr.] PRO+Pers + V = V [Geo.]

- = - [tana]

V V+Mas

V V+Mas

V V+Mas

V V+Mas

V V+Part

A $ A$N+Com

A V+Part

frequency % + text ref







Bastien Kindt [email protected] Tamar Pataridze [email protected]

Emmanuel Van Elverdinghe [email protected]

International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages

Lyon, 2nd-4th June 2014

Documents

Bastien Kindt [email protected] Tamar … · • Theophylact Simocatta (6th-7th c.) • Theophanes Confessor (8th-9th c.) • Joseph Genesius (10th c.) • Doukas (15th c.)