Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Prim(j)ena MPrim(j)ena MULTEXT-East standarda i ULTEXT-East standarda i normi TEI u izradi paralelnih korpusanormi TEI u izradi paralelnih korpusaApplikation des Applikation des MMULTEXT-East und der ULTEXT-East und der TEI-Normen bei der Erstellung vonTEI-Normen bei der Erstellung von ParallelkorporaParallelkorporaApplication of Application of MMULTEXT-East and TEI ULTEXT-East and TEI in the compilation of parallel corporain the compilation of parallel corpora

Tomaž ErjavecTomaž Erjavec

Department of Knowledge TechnologiesDepartment of Knowledge Technologies

Jožef Stefan Institute, LjubljanaJožef Stefan Institute, Ljubljana

[email protected], http://nl.ijs.si/et/ [email protected], http://nl.ijs.si/et/

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

OverviewOverview

1.1. The need for standardisationThe need for standardisation

2.2. Corpus encoding in TEICorpus encoding in TEI

3.3. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic descriptionsdescriptions



Why standards (for Why standards (for digital language digital language resources)?resources)? public documentation (+ software)public documentation (+ software) (semi)automated validation(semi)automated validation application independentapplication independent platform independentplatform independent do not become obdo not become obssolescent (as fast)olescent (as fast) However:However:

– demand time to understand and use themdemand time to understand and use them– there are (too) many and not all are there are (too) many and not all are

acceptedaccepted– they are not perfectly tuned to application they are not perfectly tuned to application

(overhead)(overhead)



TEI: the Text Encoding TEI: the Text Encoding InitiativeInitiative TEI Guidelines TEI Guidelines are a vocabulary to describe are a vocabulary to describe

text for scholarly purposestext for scholarly purposes They consist of:They consist of:

– XML schemasXML schemas– documentationdocumentation

P3 (1994), P4 (2002), P5 (0.9, 2007) P3 (1994), P4 (2002), P5 (0.9, 2007) being developed by the TEI Consortiumbeing developed by the TEI Consortium large user base, web site, mailing list, tutorials, large user base, web site, mailing list, tutorials,

yearly meetingsyearly meetings increasingly popular for digital libraries, text-increasingly popular for digital libraries, text-

critical editions,…, to a certain extent for critical editions,…, to a certain extent for corporacorpora



Jp-Sl Jp-Sl dictionardictionaryy

<entry id="jaslo.113"><entry id="jaslo.113"> <form type="hw"><form type="hw"> <orth type="roma">akeru</orth><orth type="roma">akeru</orth> <orth type="kana"><orth type="kana"> あけるあける </orth></orth> <orth type="kanji"><orth type="kanji"> 開ける開ける </orth></orth> </form></form> <gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp><gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp> <form type="infl"><form type="infl"> <orth type="v-masu"><orth type="v-masu"> あけますあけます </orth></orth> <orth type="v-te"><orth type="v-te"> あけてあけて </orth></orth> <orth type="v-nai"><orth type="v-nai"> あけないあけない </orth></orth> </form></form> <trans><tr>odpreti</tr></trans><trans><tr>odpreti</tr></trans> <eg><q><eg><q>穴（あな）をあける穴（あな）をあける </q> <tr>narediti luknjo</tr></eg></q> <tr>narediti luknjo</tr></eg> <eg><q><eg><q>窓（まど）を開ける窓（まど）を開ける </q> <tr>odpreti okno</tr></eg></q> <tr>odpreti okno</tr></eg> <xr type="related"><xr type="related"> <lbl>prim.</lbl> <ref><lbl>prim.</lbl> <ref> 開く（あく）開く（あく） </ref> <lbl>intr.</lbl></ref> <lbl>intr.</lbl> </xr></xr> <usg type="level">4</usg><usg type="level">4</usg> </entry></entry>



Example: MULTEXT-Example: MULTEXT-East “1984”, SerbianEast “1984”, Serbian

<text id="mteo-sr." lang="sr"><text id="mteo-sr." lang="sr"> <body id="Osr" lang="sr"><body id="Osr" lang="sr"> <div id="Osr.1" n="1" type="part"><div id="Osr.1" n="1" type="part"> <head>Prvi deo</head><head>Prvi deo</head> <div id="Osr.1.2" n="1" type="chapter"><div id="Osr.1.2" n="1" type="chapter"> <head>1.</head><head>1.</head> <p id="Osr.1.2.2"><p id="Osr.1.2.2"> <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na

časovnicimačasovnicima je izbijalo trinaest.</s>je izbijalo trinaest.</s> <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade

zabijene u zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapijunedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no stambene zgrade <hi rend="it">Pobeda</hi>, no

nedovoljno hitronedovoljno hitro da bi sprećio jednu spiralu oštre prašine da bi sprećio jednu spiralu oštre prašine da uđe zajedno s da uđe zajedno s

njim.</s>njim.</s> </p></p> … …



MULTEXT-EastMULTEXT-East

MULTEXT-EastMULTEXT-East: EU Project (1995-1997) : EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Multilingual Texts and Corpora for Eastern and Central European LanguagesCentral European Languages

Based on the results of EU MULTEXT (~West)Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six To produce a harmonised BLARK for six

languages:languages:– morphosyntactic specifications morphosyntactic specifications (EAGLES / (EAGLES /

MULTEXT)MULTEXT)– morphosyntacticaly annotated parallel corpusmorphosyntacticaly annotated parallel corpus– inflectional lexicainflectional lexica– multilingual comparable, speech corporamultilingual comparable, speech corpora– language processing toolslanguage processing tools



History of MULTEXT-History of MULTEXT-East resourcesEast resources First release 1998 on CD-ROM:First release 1998 on CD-ROM:

already extended with new languagesalready extended with new languages Resources since 1998 available on the Web:Resources since 1998 available on the Web:

http://nl.ijs.si/ME/ http://nl.ijs.si/ME/ Second release 2002 (EU CONCEDE):Second release 2002 (EU CONCEDE):

re-encoding in XML/TEI, harmonisationre-encoding in XML/TEI, harmonisation Third releaseThird release 2004: 2004:

merge of first two releases, further languagesmerge of first two releases, further languages Fourth release 2007 (?)Fourth release 2007 (?)



The Languages of The Languages of MULTEXT-EastMULTEXT-East

Germanic: Germanic: EnglishEnglish Romance: Romance:

RomanianRomanian Baltic: Baltic:

– LatvianLatvian – LithuanianLithuanian

Finno-Ugric: Finno-Ugric: – EstonianEstonian – HungarianHungarian

(BalkaNet):(BalkaNet):– GreekGreek– Tukrish)Tukrish)

Slavic: Slavic: – Russian (East Slavic)Russian (East Slavic)– Czech (West Slavic) Czech (West Slavic) – Slovene (South West Slavic) Slovene (South West Slavic) – Resian (Slovene dialect) Resian (Slovene dialect) – CroatianCroatian (South West (South West

Slavic)Slavic)-- Marko Tadi-- Marko Tadičč

– Serbian Serbian (South West Slavic)(South West Slavic)-- C. Krstev, D. Vitas-- C. Krstev, D. Vitas

– Bulgarian (South East Slavic)Bulgarian (South East Slavic) In progress:In progress:

– MacedonianMacedonian– Persian Persian



The MULTEXT The MULTEXT morphosyntactic morphosyntactic trinitytrinity1.1. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic

specifications (Croatian, specifications (Croatian, Serbian)Serbian)

2.2. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic lexica (Serbian)lexica (Serbian)

3.3. MULTEXT-East MULTEXT-East morphosyntactically annotated morphosyntactically annotated "1984" corpus (Serbian)"1984" corpus (Serbian)



1. Morphosyntactic 1. Morphosyntactic specificationsspecifications

Based on EAGLES / MULTEXTBased on EAGLES / MULTEXT Define PoS, their attributes and valuesDefine PoS, their attributes and values The specs are a document containing: The specs are a document containing:

– introductionintroduction– common tablescommon tables– language particular sectionslanguage particular sections

Written in LaTeX Written in LaTeX PDF & HTML PDF & HTML Derived XML/TEI encoding as feature Derived XML/TEI encoding as feature

structuresstructures In Version 4 specifications to be fully in In Version 4 specifications to be fully in

TEI/TEI/XMLXML



Example common tableExample common table



Example Example languaglanguage e specific specific tabletable



2. The lexica2. The lexica

Medium size morphosyntactic lexicaMedium size morphosyntactic lexica Languages: English, Romanian, Slovene, Languages: English, Romanian, Slovene,

Czech, Bulgarian, Estonian, Hungarian, Czech, Bulgarian, Estonian, Hungarian, SerbianSerbian..

~ all word-forms of cca 15.000 lemmas~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields:

– the word-form: the inflected form of the wordthe word-form: the inflected form of the word– the lemma: the base-form of the wordthe lemma: the base-form of the word– the morphosyntactic description (MSD)the morphosyntactic description (MSD)



Example: Slovene Example: Slovene lexicon lexicon abeced abeced abeceda abeceda Ncfdg Ncfdg abeced abeced abeceda abeceda Ncfpg Ncfpg abeceda abeceda = = Ncfsn Ncfsn abecedah abecedah abeceda abeceda Ncfdl Ncfdl abecedah abecedah abeceda abeceda Ncfpl Ncfpl abecedam abecedam abeceda abeceda Ncfpd Ncfpd abecedama abecedama abeceda abeceda Ncfdd Ncfdd abecedama abecedama abeceda abeceda Ncfdi Ncfdi abecedami abecedami abeceda abeceda Ncfpi Ncfpi abecede abecede abeceda abeceda Ncfpa Ncfpa abecede abecede abeceda abeceda Ncfpn Ncfpn abecede abecede abeceda abeceda Ncfsg Ncfsg abecedi abecedi abeceda abeceda Ncfda Ncfda abecedi abecedi abeceda abeceda Ncfdn Ncfdn ……



3. The “1984” corpus3. The “1984” corpus

Languages: En, Ro, Sl, Cs, Et, Hu, Languages: En, Ro, Sl, Cs, Et, Hu, SrSr, (Bg, Ru, (Mk, , (Bg, Ru, (Mk, HrHr, Tr,…)), Tr,…)) Structurally annotated Structurally annotated Sentence aligned with EnglishSentence aligned with English Words annotated with lemma and MSDWords annotated with lemma and MSD Encoded in TEI P4 (XML)Encoded in TEI P4 (XML)



Example linguistic Example linguistic encodingencoding<text id="Osl." lang="sl"> <text id="Osl." lang="sl"> <body> <body> <div type="part" id="Osl.1"> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <p id="Osl.1.2.2"> <s id="<s id="Osl.1.2.2.1Osl.1.2.2.1"> "> <w lemma="<w lemma="bitibiti" ana="" ana="Vcps-smaVcps-sma">">BilBil</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3s--nVcip3s--n">">jeje</w> </w> <w lemma="<w lemma="jasenjasen" ana="" ana="AfpmsnnAfpmsnn">">jasenjasen</w> </w> <c><c>,,</c> </c> <w lemma="<w lemma="mrzelmrzel" ana="" ana="AfpmsnnAfpmsnn">">mrzelmrzel</w> </w> <w lemma="<w lemma="aprilskiaprilski" ana="" ana="AopmsnAopmsn">">aprilskiaprilski</w> </w> <w lemma="<w lemma="dandan" ana="" ana="NcmsnNcmsn">">dandan</w> </w> <w lemma="<w lemma="inin" ana="" ana="CcsCcs">">inin</w> </w> <w lemma="<w lemma="uraura" ana="" ana="NcfpnNcfpn">">ureure</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3p--nVcip3p--n">">soso</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vmps-pfaVmps-pfa">">bilebile</w> </w> <w lemma="<w lemma="trinajsttrinajst" ana="" ana="McnpnlMcnpnl">">trinajsttrinajst</w> </w> <c><c>..</c> </c> </s> </s> … …

Context disambiguated Context disambiguated

lemmaslemmas and and MSDsMSDs



Utility of MULTEXT-Utility of MULTEXT-East LRsEast LRs

Specifications became, for some, the “national” Specifications became, for some, the “national” standardstandard

Training/testing dataset for HLT development:Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILPPoS taggers, lemmatizers, lexicon extractors, ILP

A base dataset for further annotation and A base dataset for further annotation and experiments:experiments:– Word-sense disambiguationWord-sense disambiguation– WordNet development and evaluationWordNet development and evaluation– Syntactic parser inductionSyntactic parser induction

Teaching aid in HLT coursesTeaching aid in HLT courses ~ 100 registered users~ 100 registered users As a BLARK “best practice” for new languages: As a BLARK “best practice” for new languages:

Resian, Croatian, Macedonian, Persian,Resian, Croatian, Macedonian, Persian,Bosnian?Bosnian?



Corpora using Corpora using TEI+MULTEXT-EastTEI+MULTEXT-East Reference corpus of Slovene:Reference corpus of Slovene:

FIDA (100MFIDA (100Mww), FIDA+ (600M), FIDA+ (600Mww))(+ other Sl. corpora)(+ other Sl. corpora)

Croatian National Corpus:Croatian National Corpus:HNK (HNK (1100M00Mww))

Various Various Romanian corpora, …Romanian corpora, … En-Sl parallel annotated corpus:En-Sl parallel annotated corpus:

SVEZ-IJS (10MSVEZ-IJS (10Mww))



ConclusionsConclusions

TEI provides a rich and flTEI provides a rich and fleexible xible infrastructure to encode parallel infrastructure to encode parallel corpora: corpora: meta-data, corpus and document meta-data, corpus and document structure, alignment, linguistic analysisstructure, alignment, linguistic analysis

MULTEXT-East provides a harmonised MULTEXT-East provides a harmonised and common infrastructure for word-and common infrastructure for word-level morphosyntactic descriptionslevel morphosyntactic descriptions

Both have already been used for a Both have already been used for a number of corporanumber of corpora

Maybe also for BKS?Maybe also for BKS?

Thank you!Thank you!

Documents

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana