Upload
nguyenlien
View
218
Download
0
Embed Size (px)
Citation preview
Bastien Kindt [email protected] Tamar Pataridze [email protected]
Emmanuel Van Elverdinghe [email protected]
International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages
Lyon, 2nd-4th June 2014
PRLG
Projet de recherche en lexicologie grecque
Two goals
Creating an electronic dictionary of Ancient Greek
Lemmatizing patristic
and historiographical Byzantine texts
The Dictionary
DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)
Lexical data directly stemming from
corpus-based observations: ensures comprehensiveness and coherence
Without restriction regarding the
handled texts date, literary genre, language level or dialect
The Dictionary
DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)
434,190 word-forms
66,772 lemmata
Every morphosyntactic category
The Dictionary
DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)
Proper names: anthroponyms, toponyms
Numeric determiners
Crases (984 different forms)
Elided forms (1,160 forms)
Lemmatization
1990-1991 Thesaurus Sancti Gregorii Nazianzeni
Lemmatization
Clement of Alexandria (2nd-3rd c.) Basil of Caesarea (4th c.) Gregory of Nyssa (4th c.) Procopius of Caesarea (6th c.) Theophylact Simocatta (6th-7th c.) Theophanes Confessor (8th-9th c.) Joseph Genesius (10th c.) Doukas (15th c.) Etc.
Lemmatization
Comprehensive inventories of the vocabulary of Byzantine patristic and
historiographic texts
with the D.A.G.
Lemmatization
Concordances published in the Thesaurus Patrum Graecorum
series (Brepols Publishers)
24 volumes published
Concordances on microfiches (!)
PRLGs tools
Lemmatized concordances Frequency indexes Reverse indexes End-of-book indexes Indexes of words common or specific (to
two corpuses or corpus parts) Etc.
From PRLG to GREgORI project
Switching to full-digital Extending to the other languages of
Christian Orient the computing tools and linguistic resources developed for Greek
Constituting multilingual lexica
From PRLG to GREgORI project
Armenian: studying the formulaic style in
manuscript colophons (E. Van Elverdinghe) Georgian: studying translation methods
from Greek (T. Pataridze)
Armenian manuscript colophons
I. Lemmatizing Armenian
II. The Armenian colophons project
III. An illustrative case-study
Lemmatizing Armenian
SOME METHODOLOGICAL NOTES
Indo-European, flexional language with a leaning towards agglutination
Grammatical categories
Tokenization issues
Diachrony
II. The Armenian colophons project
1. CORPUS
2. PURPOSE
3. METHOD
I. Lemmatizing Armenian
III. An illustrative case-study
1. CORPUS
Text
Digitized text editions
6154 pages in 9 volumes
5th century to 1500 + 1601 to 1660
16,000 different colophons (from 1 word to several pages)
>1,300,000 forms
Processed through Unitex
Database
Gathering metadata extracted from the editions
Reference
Date
Author
Place
Manuscript content
Etc.
2. PURPOSE
Studying stereotypical patterns (formulae)
Lifespan
Frequency
Variation
Evolution
Geographical diffusion
Relevant for manuscript studies as a whole
3. METHOD
1) Spotting formulaic patterns (= collocations)
2) Determining the formulas structure
3) Extracting all utterances
4) Cross-analysis with information stored in the database
5) Sketch of the formulas life and deeds
II. The Armenian colophons project
1. COLLOCATIONS
2. STRUCTURE
3. ANALYSIS
4. LEARNINGS
I. Lemmatizing Armenian
III. An illustrative case-study
An illustrative case-study
[I wrote this] from a good and choice exemplar
396 occurrences (variants included)
From 989 AD onwards
Relatively stable
Verbose colophons
1. COLLOCATIONS
Log-likelihood
Browsing the concordance
Literature survey
2. STRUCTURE
Yields 56 pertinent matches
Provides new vocabulary
= good, choice
but also = reliable
true
glorious etc.
Concordance for each word
Some variation in structure
3 qualifiers
Repetition/omission of the preposition
Polysyndeton/asyndeton
Rarer vocabulary
New, more complex, and exhaustive graph
Total: 396 matches
2 qualifiers
3 qualifiers
3. ANALYSIS
Statistical outlook
Most frequent adjectives
Making up 90% of attestations
A B
205 9
16 2 45
1 78
Attestations by century
Varieties found in only 1 manuscript
10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.
20% 27.3% 12.3% 1.7% 3.4%
10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.
2 1 5 33 57 118 400?
3. ANALYSIS
Formula in context metadata
Date, copyist and locality almost always known
Found alongside the formula
Part of a wider-scale pattern
In early times, mostly in biblical manuscripts
Not very significant
Subtypes distinguishable by the context
,
,
I wrote this with my unworthy hands, from a reliable and choice exemplar, with a tormented
life and through much emigration ...
N.B. Both word orders: or
42 attestations
83% before 1500
First time in 1331
Gospels, Bibles
Lake Van (1 from Jerusalem)
Often the same copyists
7 attestations in the 17th c.
Lake Van; 1 from todays Armenia
Gospels, canon-books
: 21 attestations (50%)
Gospels only
First time in 1399, in Atamar (island on Lake Van)
17 times during the 15th century: 14 manuscripts from Atamar, 3 written on the shores of Lake Van
All manuscripts from Atamar with this formula present this word-order
vs
History and geography of the formula
4. LEARNINGS
Copyists mentality
Stylistic and orthographic habits
Increasing standardization
Inferring missing information about some manuscripts
Insight into the life and activity of copyists: passing down of techniques, traditions, and knowledge
Lemmatizing Georgian
Bilingual index of Gregory of Nazianzus
Edited, digitized and
formatted text
DATABASE (SQL)
UNITEX Corpus processor
- First disambiguation
step - Lexical lookup - Fully manual disambi-
guation - Lexical data export
Production of lexical tools
-iLemmatized concordan-ces
- Frequency index - Reverse index - End-of-book index -iCommon or specific
vocabulary index - Fully tagged corpus - Bilingual index
State of the art for processing Greek
Lemmatization principles for Georgian
B. Kindt, La lemmatisation des sources patristiques et byzantines au service d'une description lexicale du grec ancien. Les principes de formulation des lemmes du Dictionnaire Automatique Grec, in: Byzantion, 74 (2004), pp. 213-272. B. Coulie, B. Kindt, T. Pataridze, Lemmatisation automatique des sources du gorgien ancien, in: Le Muson, 126 (2013), pp. 161-201.
Final goal: a bilingual Greek-Georgian index
DATABASE (SQL)
- SOURCE language tagged corpus
- TARGET language tagged corpus
Production of bilingual index
mkAlign Text alignment processor
Method: from text-alignment to bilingual index
Some analysis
Verbs and Common Names
V V+Mas
V V+Part
VERBS
Greek Georgian
V+Mas
V+Part
Correspondance
Correspondance
Greek Georgian
N+Com
V+Part
[Geo.] V+Part = A [Gr.]
A V+Part
Where -- [ketili] has the morphology of participle, formed through a morpheme - [-il], it is used as an adjective
[Geo.] V+Part = N+Com [Gr.]
N+Com V+Part
--- [mo-u-ar-i] is an active participle formed through the -- [mo--ar] morphemes. Literally speaking it means one who teaches, but after having become a substantive it receives a meaning similar to professor.
Another example of the past participle: V+Part [--- / mo-na-geb-i] what was obtained, earned leads to the meaning of goods / properties. The corresponding Greek term is N+Com
COMMON NAMES
Greek Georgian
N+Com V+Part
N+Com
[Gr.] N+Com = V+Part [Geo.]
N+Com A
N+Com V+Part
-- [na-ksov-i], past participle, something that is knit, with the meaning of material, fabric
[Gr.] N+Com = A [Geo.]
Grass, lawn, greenery, green the meaning can be expressed by the suffix of possession - [ovan], when -- [mcuanil-ovan-il] means something like holder of grass
Greek Georgian
N+Com
I+Adv
N+Com
[Geo.] N+Com = I+Adv [Gr.]
PRO+Per1p N+Com
V N+Com
I+Adv N+Com
[Geo.] N+Com = V [Gr.]
[Geo.] N+Com = Pro+Pers [Gr.]
[tavi] = head
[Geo.] N+Com = A [Gr.]
= genitive of stone, --. Of course, the genitive of the common name will receive a nominative lemma tagged N + Com.
Genitive of common name [in Georgian] = adjective [in other languages]
Adverbial case of the adjectives and participles [in Georgian] = adverbs [in other languages]
_{.PRO+Per1s.54-0} _{ ().V.55-0} = -_{.V+Mas.52-0}
[Gr.] PRO+Pers + V = V [Geo.]
- = - [tana]
V V+Mas
V V+Mas
V V+Mas
V V+Mas
V V+Part
A $ A$N+Com
A V+Part
frequency % + text ref
frequency % + text ref
frequency % + text ref
frequency % + text ref
frequency % + text ref
frequency % + text ref
frequency % + text ref
Bastien Kindt [email protected] Tamar Pataridze [email protected]
Emmanuel Van Elverdinghe [email protected]
International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages
Lyon, 2nd-4th June 2014