21
Prim(j)ena M Prim(j)ena M ULTEXT-East standarda i ULTEXT-East standarda i normi TEI u izradi paralelnih normi TEI u izradi paralelnih korpusa korpusa Applikation des Applikation des M M ULTEXT-East und der ULTEXT-East und der TEI-Normen bei der Erstellung von TEI-Normen bei der Erstellung von Parallelkorpora Parallelkorpora Application of Application of M M ULTEXT-East and TEI ULTEXT-East and TEI in the compilation of parallel in the compilation of parallel corpora corpora Tomaž Erjavec Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana Jožef Stefan Institute, Ljubljana [email protected], http://nl.ijs.si/et/ [email protected], http://nl.ijs.si/et/

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

  • Upload
    huela

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Prim(j)ena M ULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des M ULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora Application of M ULTEXT-East and TEI in the compilation of parallel corpora. Tomaž Erjavec Department of Knowledge Technologies - PowerPoint PPT Presentation

Citation preview

Page 1: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Prim(j)ena MPrim(j)ena MULTEXT-East standarda i ULTEXT-East standarda i normi TEI u izradi paralelnih korpusanormi TEI u izradi paralelnih korpusaApplikation des Applikation des MMULTEXT-East und der ULTEXT-East und der TEI-Normen bei der Erstellung vonTEI-Normen bei der Erstellung von ParallelkorporaParallelkorporaApplication of Application of MMULTEXT-East and TEI ULTEXT-East and TEI in the compilation of parallel corporain the compilation of parallel corpora

Tomaž ErjavecTomaž Erjavec

Department of Knowledge TechnologiesDepartment of Knowledge Technologies

Jožef Stefan Institute, LjubljanaJožef Stefan Institute, Ljubljana

[email protected], http://nl.ijs.si/et/ [email protected], http://nl.ijs.si/et/

Page 2: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

OverviewOverview

1.1. The need for standardisationThe need for standardisation

2.2. Corpus encoding in TEICorpus encoding in TEI

3.3. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic descriptionsdescriptions

Page 3: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Why standards (for Why standards (for digital language digital language resources)?resources)? public documentation (+ software)public documentation (+ software) (semi)automated validation(semi)automated validation application independentapplication independent platform independentplatform independent do not become obdo not become obssolescent (as fast)olescent (as fast) However:However:

– demand time to understand and use themdemand time to understand and use them– there are (too) many and not all are there are (too) many and not all are

acceptedaccepted– they are not perfectly tuned to application they are not perfectly tuned to application

(overhead)(overhead)

Page 4: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

TEI: the Text Encoding TEI: the Text Encoding InitiativeInitiative TEI Guidelines TEI Guidelines are a vocabulary to describe are a vocabulary to describe

text for scholarly purposestext for scholarly purposes They consist of:They consist of:

– XML schemasXML schemas– documentationdocumentation

P3 (1994), P4 (2002), P5 (0.9, 2007) P3 (1994), P4 (2002), P5 (0.9, 2007) being developed by the TEI Consortiumbeing developed by the TEI Consortium large user base, web site, mailing list, tutorials, large user base, web site, mailing list, tutorials,

yearly meetingsyearly meetings increasingly popular for digital libraries, text-increasingly popular for digital libraries, text-

critical editions,…, to a certain extent for critical editions,…, to a certain extent for corporacorpora

Page 5: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Jp-Sl Jp-Sl dictionardictionaryy

<entry id="jaslo.113"><entry id="jaslo.113"> <form type="hw"><form type="hw"> <orth type="roma">akeru</orth><orth type="roma">akeru</orth> <orth type="kana"><orth type="kana"> あけるあける </orth></orth> <orth type="kanji"><orth type="kanji"> 開ける開ける </orth></orth> </form></form> <gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp><gramGrp><pos>V1</pos> <subc>trans.</subc></gramGrp> <form type="infl"><form type="infl"> <orth type="v-masu"><orth type="v-masu"> あけますあけます </orth></orth> <orth type="v-te"><orth type="v-te"> あけてあけて </orth></orth> <orth type="v-nai"><orth type="v-nai"> あけないあけない </orth></orth> </form></form> <trans><tr>odpreti</tr></trans><trans><tr>odpreti</tr></trans> <eg><q><eg><q>穴(あな)をあける穴(あな)をあける </q> <tr>narediti luknjo</tr></eg></q> <tr>narediti luknjo</tr></eg> <eg><q><eg><q>窓(まど)を開ける窓(まど)を開ける </q> <tr>odpreti okno</tr></eg></q> <tr>odpreti okno</tr></eg> <xr type="related"><xr type="related"> <lbl>prim.</lbl> <ref><lbl>prim.</lbl> <ref> 開く(あく)開く(あく) </ref> <lbl>intr.</lbl></ref> <lbl>intr.</lbl> </xr></xr> <usg type="level">4</usg><usg type="level">4</usg> </entry></entry>

Page 6: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example: MULTEXT-Example: MULTEXT-East “1984”, SerbianEast “1984”, Serbian

<text id="mteo-sr." lang="sr"><text id="mteo-sr." lang="sr"> <body id="Osr" lang="sr"><body id="Osr" lang="sr"> <div id="Osr.1" n="1" type="part"><div id="Osr.1" n="1" type="part"> <head>Prvi deo</head><head>Prvi deo</head> <div id="Osr.1.2" n="1" type="chapter"><div id="Osr.1.2" n="1" type="chapter"> <head>1.</head><head>1.</head> <p id="Osr.1.2.2"><p id="Osr.1.2.2"> <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na <s id="Osr.1.2.2.1">Bio je vedar i hladan aprilski dan; na

časovnicimačasovnicima je izbijalo trinaest.</s>je izbijalo trinaest.</s> <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade <s id="Osr.1.2.2.2"><name>Vinston Smit</name>, brade

zabijene u zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapijunedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no stambene zgrade <hi rend="it">Pobeda</hi>, no

nedovoljno hitronedovoljno hitro da bi sprećio jednu spiralu oštre prašine da bi sprećio jednu spiralu oštre prašine da uđe zajedno s da uđe zajedno s

njim.</s>njim.</s> </p></p> … …

Page 7: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

MULTEXT-EastMULTEXT-East

MULTEXT-EastMULTEXT-East: EU Project (1995-1997) : EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Multilingual Texts and Corpora for Eastern and Central European LanguagesCentral European Languages

Based on the results of EU MULTEXT (~West)Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six To produce a harmonised BLARK for six

languages:languages:– morphosyntactic specifications morphosyntactic specifications (EAGLES / (EAGLES /

MULTEXT)MULTEXT)– morphosyntacticaly annotated parallel corpusmorphosyntacticaly annotated parallel corpus– inflectional lexicainflectional lexica– multilingual comparable, speech corporamultilingual comparable, speech corpora– language processing toolslanguage processing tools

Page 8: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

History of MULTEXT-History of MULTEXT-East resourcesEast resources First release 1998 on CD-ROM:First release 1998 on CD-ROM:

already extended with new languagesalready extended with new languages Resources since 1998 available on the Web:Resources since 1998 available on the Web:

http://nl.ijs.si/ME/ http://nl.ijs.si/ME/ Second release 2002 (EU CONCEDE):Second release 2002 (EU CONCEDE):

re-encoding in XML/TEI, harmonisationre-encoding in XML/TEI, harmonisation Third releaseThird release 2004: 2004:

merge of first two releases, further languagesmerge of first two releases, further languages Fourth release 2007 (?)Fourth release 2007 (?)

Page 9: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

The Languages of The Languages of MULTEXT-EastMULTEXT-East

Germanic: Germanic: EnglishEnglish Romance: Romance:

RomanianRomanian Baltic: Baltic:

– LatvianLatvian – LithuanianLithuanian

Finno-Ugric: Finno-Ugric: – EstonianEstonian – HungarianHungarian

(BalkaNet):(BalkaNet):– GreekGreek– Tukrish)Tukrish)

Slavic: Slavic: – Russian (East Slavic)Russian (East Slavic)– Czech (West Slavic) Czech (West Slavic) – Slovene (South West Slavic) Slovene (South West Slavic) – Resian (Slovene dialect) Resian (Slovene dialect) – CroatianCroatian (South West (South West

Slavic)Slavic)-- Marko Tadi-- Marko Tadičč

– Serbian Serbian (South West Slavic)(South West Slavic)-- C. Krstev, D. Vitas-- C. Krstev, D. Vitas

– Bulgarian (South East Slavic)Bulgarian (South East Slavic) In progress:In progress:

– MacedonianMacedonian– Persian Persian

Page 10: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

The MULTEXT The MULTEXT morphosyntactic morphosyntactic trinitytrinity1.1. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic

specifications (Croatian, specifications (Croatian, Serbian)Serbian)

2.2. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic lexica (Serbian)lexica (Serbian)

3.3. MULTEXT-East MULTEXT-East morphosyntactically annotated morphosyntactically annotated "1984" corpus (Serbian)"1984" corpus (Serbian)

Page 11: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

1. Morphosyntactic 1. Morphosyntactic specificationsspecifications

Based on EAGLES / MULTEXTBased on EAGLES / MULTEXT Define PoS, their attributes and valuesDefine PoS, their attributes and values The specs are a document containing: The specs are a document containing:

– introductionintroduction– common tablescommon tables– language particular sectionslanguage particular sections

Written in LaTeX Written in LaTeX PDF & HTML PDF & HTML Derived XML/TEI encoding as feature Derived XML/TEI encoding as feature

structuresstructures In Version 4 specifications to be fully in In Version 4 specifications to be fully in

TEI/TEI/XMLXML

Page 12: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example common tableExample common table

Page 13: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example Example languaglanguage e specific specific tabletable

Page 14: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

2. The lexica2. The lexica

Medium size morphosyntactic lexicaMedium size morphosyntactic lexica Languages: English, Romanian, Slovene, Languages: English, Romanian, Slovene,

Czech, Bulgarian, Estonian, Hungarian, Czech, Bulgarian, Estonian, Hungarian, SerbianSerbian..

~ all word-forms of cca 15.000 lemmas~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields:

– the word-form: the inflected form of the wordthe word-form: the inflected form of the word– the lemma: the base-form of the wordthe lemma: the base-form of the word– the morphosyntactic description (MSD)the morphosyntactic description (MSD)

Page 15: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example: Slovene Example: Slovene lexicon lexicon abeced abeced abeceda abeceda Ncfdg Ncfdg abeced abeced abeceda abeceda Ncfpg Ncfpg abeceda abeceda = = Ncfsn Ncfsn abecedah abecedah abeceda abeceda Ncfdl Ncfdl abecedah abecedah abeceda abeceda Ncfpl Ncfpl abecedam abecedam abeceda abeceda Ncfpd Ncfpd abecedama abecedama abeceda abeceda Ncfdd Ncfdd abecedama abecedama abeceda abeceda Ncfdi Ncfdi abecedami abecedami abeceda abeceda Ncfpi Ncfpi abecede abecede abeceda abeceda Ncfpa Ncfpa abecede abecede abeceda abeceda Ncfpn Ncfpn abecede abecede abeceda abeceda Ncfsg Ncfsg abecedi abecedi abeceda abeceda Ncfda Ncfda abecedi abecedi abeceda abeceda Ncfdn Ncfdn ……

Page 16: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

3. The “1984” corpus3. The “1984” corpus

Languages: En, Ro, Sl, Cs, Et, Hu, Languages: En, Ro, Sl, Cs, Et, Hu, SrSr, (Bg, Ru, (Mk, , (Bg, Ru, (Mk, HrHr, Tr,…)), Tr,…)) Structurally annotated Structurally annotated Sentence aligned with EnglishSentence aligned with English Words annotated with lemma and MSDWords annotated with lemma and MSD Encoded in TEI P4 (XML)Encoded in TEI P4 (XML)

Page 17: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Example linguistic Example linguistic encodingencoding<text id="Osl." lang="sl"> <text id="Osl." lang="sl"> <body> <body> <div type="part" id="Osl.1"> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <p id="Osl.1.2.2"> <s id="<s id="Osl.1.2.2.1Osl.1.2.2.1"> "> <w lemma="<w lemma="bitibiti" ana="" ana="Vcps-smaVcps-sma">">BilBil</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3s--nVcip3s--n">">jeje</w> </w> <w lemma="<w lemma="jasenjasen" ana="" ana="AfpmsnnAfpmsnn">">jasenjasen</w> </w> <c><c>,,</c> </c> <w lemma="<w lemma="mrzelmrzel" ana="" ana="AfpmsnnAfpmsnn">">mrzelmrzel</w> </w> <w lemma="<w lemma="aprilskiaprilski" ana="" ana="AopmsnAopmsn">">aprilskiaprilski</w> </w> <w lemma="<w lemma="dandan" ana="" ana="NcmsnNcmsn">">dandan</w> </w> <w lemma="<w lemma="inin" ana="" ana="CcsCcs">">inin</w> </w> <w lemma="<w lemma="uraura" ana="" ana="NcfpnNcfpn">">ureure</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3p--nVcip3p--n">">soso</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vmps-pfaVmps-pfa">">bilebile</w> </w> <w lemma="<w lemma="trinajsttrinajst" ana="" ana="McnpnlMcnpnl">">trinajsttrinajst</w> </w> <c><c>..</c> </c> </s> </s> … …

Context disambiguated Context disambiguated

lemmaslemmas and and MSDsMSDs

Page 18: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Utility of MULTEXT-Utility of MULTEXT-East LRsEast LRs

Specifications became, for some, the “national” Specifications became, for some, the “national” standardstandard

Training/testing dataset for HLT development:Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILPPoS taggers, lemmatizers, lexicon extractors, ILP

A base dataset for further annotation and A base dataset for further annotation and experiments:experiments:– Word-sense disambiguationWord-sense disambiguation– WordNet development and evaluationWordNet development and evaluation– Syntactic parser inductionSyntactic parser induction

Teaching aid in HLT coursesTeaching aid in HLT courses ~ 100 registered users~ 100 registered users As a BLARK “best practice” for new languages: As a BLARK “best practice” for new languages:

Resian, Croatian, Macedonian, Persian,Resian, Croatian, Macedonian, Persian,Bosnian?Bosnian?

Page 19: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

Corpora using Corpora using TEI+MULTEXT-EastTEI+MULTEXT-East Reference corpus of Slovene:Reference corpus of Slovene:

FIDA (100MFIDA (100Mww), FIDA+ (600M), FIDA+ (600Mww))(+ other Sl. corpora)(+ other Sl. corpora)

Croatian National Corpus:Croatian National Corpus:HNK (HNK (1100M00Mww))

Various Various Romanian corpora, …Romanian corpora, … En-Sl parallel annotated corpus:En-Sl parallel annotated corpus:

SVEZ-IJS (10MSVEZ-IJS (10Mww))

Page 20: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

BKS symposiumBKS symposiumApril April 20020077

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute

ConclusionsConclusions

TEI provides a rich and flTEI provides a rich and fleexible xible infrastructure to encode parallel infrastructure to encode parallel corpora: corpora: meta-data, corpus and document meta-data, corpus and document structure, alignment, linguistic analysisstructure, alignment, linguistic analysis

MULTEXT-East provides a harmonised MULTEXT-East provides a harmonised and common infrastructure for word-and common infrastructure for word-level morphosyntactic descriptionslevel morphosyntactic descriptions

Both have already been used for a Both have already been used for a number of corporanumber of corpora

Maybe also for BKS?Maybe also for BKS?

Page 21: Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana

Thank you!Thank you!