The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana [email protected]@ijs.si,

The FIDA & MULTEXT-The FIDA & MULTEXT-East language East language resourcesresources

Tomaž ErjavecTomaž Erjavec

Department of Knowledge TechnologiesDepartment of Knowledge Technologies

Jožef Stefan Institute, LjubljanaJožef Stefan Institute, Ljubljana

[email protected], [email protected], http://nl.ijs.si/et/http://nl.ijs.si/et/

Gralis 2006Gralis 2006

Institut für Slawistik der Universität GrazInstitut für Slawistik der Universität Graz

2006-05-092006-05-09

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

OverviewOverview

1.1. BackgroundBackground

2.2. FIDA: a reference corpus of FIDA: a reference corpus of SloveneSlovene

3.3. MULTEXT-East: morphosyntactic MULTEXT-East: morphosyntactic resources for Central and East-resources for Central and East-European languagesEuropean languages

4.4. Other language resources for Other language resources for SloveneSlovene

GralisGralis2006-05-092006-05-09


Language ResourcesLanguage Resources

LR comprise three layers of data: LR comprise three layers of data: – corpora: mono- or multilingual, reference or corpora: mono- or multilingual, reference or

specialised, … /variously annotated/specialised, … /variously annotated/– lexica: vocabularies, morphosyntactic, syntactic, lexica: vocabularies, morphosyntactic, syntactic,

semantic, (ontologies)semantic, (ontologies)– standards: linguistic and technical encodingstandards: linguistic and technical encoding

LRs, esp. corpora are used for LRs, esp. corpora are used for empirical language research:empirical language research:– linguistic studies:linguistic studies:

(annotated) corpus + (sophisticated) search engine(annotated) corpus + (sophisticated) search engine– human language technology R&D:human language technology R&D:

testing and training datasettesting and training dataset

GralisGralis2006-05-092006-05-09


Part I.Part I.The FIDA corpusThe FIDA corpus

Slovene reference corpus for Slovene reference corpus for linguistic studieslinguistic studies

GralisGralis2006-05-092006-05-09


FIDA FIDA http://www.fida.net/http://www.fida.net/

Joint project (1997-2000) of Joint project (1997-2000) of FFilozofska fakultetailozofska fakulteta

Vojko Gorjanc, Marko Stabej, Špela VintarVojko Gorjanc, Marko Stabej, Špela Vintar IInstitut Jonstitut Jožef Stefanžef Stefan

Tomaž ErjavecTomaž Erjavec DDZSZS

Simon KrekSimon Krek AAmebismebis

Peter Holozan, Miro RomihPeter Holozan, Miro Romih

Financed by industry partnernsFinanced by industry partnerns

GralisGralis2006-05-092006-05-09


Characteristics of FIDACharacteristics of FIDA

monolingualmonolingual synchronoussynchronous written languagewritten language referencereference

– representativerepresentative– balancedbalanced

annotatedannotated

GralisGralis2006-05-092006-05-09


SizesSizes

Total Total 103,513,072103,513,072 wordswords 29,177 29,177 textstextsAvgAvg.. text length text length3,5483,548 words words

Largest texts:Largest texts: Leksikon DZS: Leksikon DZS: 508,370 508,370 wordswords69 69 texts > texts > 100.000100.000

Smallest texts:Smallest texts: 2.648 2.648 < < 100100 words words2 x 2 x <w>rezgrtshdrghgth4</w><w>rezgrtshdrghgth4</w>

GralisGralis2006-05-092006-05-09


Time CompositionTime Composition

Oldest/most recent textOldest/most recent text: : 19891989//20002000

Average date Average date 1997-021997-02 Texts/Words with unknTexts/Words with unknown own datedate: :

3.943.94%/%/8.288.28%%

GralisGralis2006-05-092006-05-09


FIDA tFIDA taxonomoyaxonomoy::publication typespublication types

……Ft.P.P.O (published) Ft.P.P.O (published) 95.72%95.72%Ft.P.P.O.K (books) Ft.P.P.O.K (books) 22.71%22.71%Ft.P.P.O.P (periodicals) Ft.P.P.O.P (periodicals) 70.50%70.50%Ft.P.P.O.P.C (newspaper) Ft.P.P.O.P.C (newspaper) 46.59%46.59%Ft.P.P.O.P.C.D (daily) Ft.P.P.O.P.C.D (daily) 32.67%32.67%Ft.P.P.O.P.C.T (weekly) Ft.P.P.O.P.C.T (weekly) 66.18%66.18%Ft.P.P.O.P.C.V (multi-weekly)Ft.P.P.O.P.C.V (multi-weekly) 17.74%17.74%……

GralisGralis2006-05-092006-05-09


FIDA tFIDA taxonomoyaxonomoy: : ttext typesext typesFt.Z (text type) Ft.Z (text type) 99.47%99.47%Ft.Z.N (non-ficiton) Ft.Z.N (non-ficiton) 93.57%93.57%Ft.Z.N.N (non-professional)Ft.Z.N.N (non-professional) 75.14%75.14%Ft.Z.N.S (professional) Ft.Z.N.S (professional) 18.37%18.37%Ft.Z.N.S.H (hum. & soc. sci.)Ft.Z.N.S.H (hum. & soc. sci.) 10.57%10.57%Ft.Z.N.S.N (nat. & tech. sci.) Ft.Z.N.S.N (nat. & tech. sci.) 6.04% 6.04%Ft.Z.U (fiction) Ft.Z.U (fiction) 5.90% 5.90%Ft.Z.U.D (drama) Ft.Z.U.D (drama) 0.10% 0.10%Ft.Z.U.P (poetry) Ft.Z.U.P (poetry) 0.17% 0.17%Ft.Z.U.R (prose) Ft.Z.U.R (prose) 5.12% 5.12%

GralisGralis2006-05-092006-05-09


Markup of FIDAMarkup of FIDA

corpus elements annotated with corpus elements annotated with meta-data (bibliographic, taxonomy)meta-data (bibliographic, taxonomy)

text linguistically annotatedtext linguistically annotated encoded according to international encoded according to international

standards and recommendationsstandards and recommendations– technical: SGML, TEI P3technical: SGML, TEI P3– linguistic: MULTEXT-Eastlinguistic: MULTEXT-East

(MULTEXT, EAGLES)(MULTEXT, EAGLES)

GralisGralis2006-05-092006-05-09


Linguistic annotationLinguistic annotation

GralisGralis2006-05-092006-05-09


AccesibilityAccesibility

Exploitation by pExploitation by partners:artners:– DZS: new dictionariesDZS: new dictionaries– Amebis: Amebis: development of HLTdevelopment of HLT– Arts faculty: Arts faculty: teachingteaching– IJS: research on HLTIJS: research on HLT

Availability to the pAvailability to the public:ublic:– access via caccess via concordance engine by Amebis oncordance engine by Amebis – free accessfree access, but, but displays only few hits displays only few hits– possibility of academic licencespossibility of academic licences

FIDA (web site) no longer maintainedFIDA (web site) no longer maintained!!

GralisGralis2006-05-092006-05-09


FIDA+ FIDA+ http://www.fidaplus.nehttp://www.fidaplus.net/t/ FIDA Plus project:FIDA Plus project:

– FilozofskFilozofskaa fakultet fakultetaa, Fakulteta za družbene vede, , Fakulteta za družbene vede, Institut Jožef StefanInstitut Jožef Stefan

– DZS, AmebisDZS, Amebis Financed by the ministryFinanced by the ministry + ind. partners+ ind. partners Extend the corpus with Extend the corpus with

– Web materialsWeb materials– spoken componentspoken component

Better linguistic markupBetter linguistic markup Free cFree concordances: up to 100 linesoncordances: up to 100 lines Also possibility of licencesAlso possibility of licences

GralisGralis2006-05-092006-05-09


ConcordancerConcordancer

GralisGralis2006-05-092006-05-09


OutputOutput

GralisGralis2006-05-092006-05-09


Extended searchesExtended searches

GralisGralis2006-05-092006-05-09


Corpus “Nova Beseda”Corpus “Nova Beseda”http://bos.zrc-sazu.si/http://bos.zrc-sazu.si/

being developed at Institute for being developed at Institute for Slovene language, ZRC SAZU Slovene language, ZRC SAZU (Primo(Primož Jakopin)ž Jakopin)

Web concordancer with no Web concordancer with no hit hit limitlimit now now larger than FIDAlarger than FIDA but but much less variedmuch less varied::

fiction, Delo, DZ fiction, Delo, DZ not linguistically annotatednot linguistically annotated

GralisGralis2006-05-092006-05-09


Part II.Part II.MULTEXT-EastMULTEXT-East

multilingual morphosyntactic multilingual morphosyntactic resources for HLT resources for HLT developmentdevelopment

GralisGralis2006-05-092006-05-09


MULTEXT-East MULTEXT-East resourcesresources

MULTEXT-EastMULTEXT-East: Copernicus Joint Project COP 106 : Copernicus Joint Project COP 106 (1995-1997) (1995-1997) Multilingual Texts and Corpora for Multilingual Texts and Corpora for Eastern and Central European LanguagesEastern and Central European Languages

Based on the results of EU MULTEXT (~West)Based on the results of EU MULTEXT (~West) To produce a harmonised To produce a harmonised BLARKBLARK for six for six

languages:languages:– corpus encoding standardisation (TEI / CES)corpus encoding standardisation (TEI / CES)– multilingual parallel, comparable, speech corporamultilingual parallel, comparable, speech corpora– morphosyntactic specifications (EAGLES / MULTEXT)morphosyntactic specifications (EAGLES / MULTEXT)– (inflectional) lexicon(inflectional) lexicon– annotated corpusannotated corpus– language processing toolslanguage processing tools

GralisGralis2006-05-092006-05-09


History of MULTEXT-History of MULTEXT-East resourcesEast resources First release 1998 on TELRI CD-ROM Vol II:First release 1998 on TELRI CD-ROM Vol II:

already extended with new languagesalready extended with new languages Resources since 1998 available on the Web:Resources since 1998 available on the Web:

http://nl.ijs.si/ME/http://nl.ijs.si/ME/ Second release 2002 in scope of EU CONCEDE:Second release 2002 in scope of EU CONCEDE:

re-encoding in XML/TEI, harmonisationre-encoding in XML/TEI, harmonisation Third releaseThird release 2004: 2004:

merge of first two releases, further languagesmerge of first two releases, further languages Work (indirectly) supported by:Work (indirectly) supported by:

TELRI, CONCEDE, NSF grant, bi-lateral projectsTELRI, CONCEDE, NSF grant, bi-lateral projects

GralisGralis2006-05-092006-05-09


The Languages of The Languages of MULTEXT-EastMULTEXT-East

Germanic: Germanic: EnglishEnglish Romance: Romance:

RomanianRomanian Baltic: Baltic:

– LatvianLatvian – LithuanianLithuanian

Finno-Ugric: Finno-Ugric: – EstonianEstonian – HungarianHungarian

Slavic: Slavic: Russian (East Slavic)Russian (East Slavic) CzechCzech (West Slavic) (West Slavic) Slovene Slovene (South West Slavic) (South West Slavic) Resian (Slovene dialect) Resian (Slovene dialect) Croatian (South West Slavic) Croatian (South West Slavic) SerbianSerbian (South West Slavic) (South West Slavic) Bulgarian (South East Slavic)Bulgarian (South East Slavic)

In progress:In progress: MacedonianMacedonian Persian Persian

GralisGralis2006-05-092006-05-09


Version 3Version 3

Available on Available on http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/ Some parts completely free, others Some parts completely free, others

free for research free for research Web licence Web licence Web pages gives:Web pages gives:

– extensive documentationextensive documentation– bibliography listbibliography list– web licence formweb licence form– resource downloadresource download

GralisGralis2006-05-092006-05-09


The MULTEXT The MULTEXT morphosyntactic morphosyntactic trinitytrinity1.1. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic

specificationsspecifications

2.2. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic lexica lexica

3.3. MULTEXT-East MULTEXT-East morphosyntactically annotated morphosyntactically annotated "1984" "1984" corpuscorpus

GralisGralis2006-05-092006-05-09


1. Morphosyntactic 1. Morphosyntactic specificationsspecifications

Based on EAGLES / MULTEXTBased on EAGLES / MULTEXT Define PoS, their attributes and valuesDefine PoS, their attributes and values The specs are a document containing: The specs are a document containing:

– introductionintroduction– common tablescommon tables– language particular sectionslanguage particular sections

Written in LaTeX Written in LaTeX PDF & HTML PDF & HTML Derived XML/TEI encoding as feature Derived XML/TEI encoding as feature

structuresstructures

GralisGralis2006-05-092006-05-09


Example common tableExample common table

GralisGralis2006-05-092006-05-09


Example Example language specific language specific sectionsection

tabletable(shows only (shows only categories actually categories actually used)used)

notesnotes

combinationscombinations

lexiconlexicon

for Slovene (FIDA):for Slovene (FIDA):localisation of localisation of category namescategory names

GralisGralis2006-05-092006-05-09


Morphosyntactic Morphosyntactic ComplexityComplexity

GralisGralis2006-05-092006-05-09


2. The lexica2. The lexica

Medium size morphosyntactic lexicaMedium size morphosyntactic lexica Languages: English, Romanian, Slovene, Languages: English, Romanian, Slovene,

Czech, Bulgarian, Estonian, Hungarian, Czech, Bulgarian, Estonian, Hungarian, Serbian.Serbian.

~ all word-forms of cca 15.000 lemmas~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields:

– the word-form: the inflected form of the wordthe word-form: the inflected form of the word– the lemma: the base-form of the wordthe lemma: the base-form of the word– the morphosyntactic description (MSD)the morphosyntactic description (MSD)

GralisGralis2006-05-092006-05-09


Example: Slovene Example: Slovene lexicon lexicon abeced abeced abeceda abeceda Ncfdg Ncfdg abeced abeced abeceda abeceda Ncfpg Ncfpg abeceda abeceda = = Ncfsn Ncfsn abecedah abecedah abeceda abeceda Ncfdl Ncfdl abecedah abecedah abeceda abeceda Ncfpl Ncfpl abecedam abecedam abeceda abeceda Ncfpd Ncfpd abecedama abecedama abeceda abeceda Ncfdd Ncfdd abecedama abecedama abeceda abeceda Ncfdi Ncfdi abecedami abecedami abeceda abeceda Ncfpi Ncfpi abecede abecede abeceda abeceda Ncfpa Ncfpa abecede abecede abeceda abeceda Ncfpn Ncfpn abecede abecede abeceda abeceda Ncfsg Ncfsg abecedi abecedi abeceda abeceda Ncfda Ncfda abecedi abecedi abeceda abeceda Ncfdn Ncfdn ……

GralisGralis2006-05-092006-05-09


Lexicon sizesLexicon sizes

GralisGralis2006-05-092006-05-09


3. The “1984” corpus3. The “1984” corpus

Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…))Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structuraly annotated Structuraly annotated Sentence aligned with EnglishSentence aligned with English Words annotated with lemma and MSDWords annotated with lemma and MSD Encoded in TEI P4 (XML)Encoded in TEI P4 (XML)

GralisGralis2006-05-092006-05-09


Example linguistic Example linguistic encodingencoding<text id="Osl." lang="sl"> <text id="Osl." lang="sl"> <body> <body> <div type="part" id="Osl.1"> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <p id="Osl.1.2.2"> <s id="<s id="Osl.1.2.2.1Osl.1.2.2.1"> "> <w lemma="<w lemma="bitibiti" ana="" ana="Vcps-smaVcps-sma">">BilBil</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3s--nVcip3s--n">">jeje</w> </w> <w lemma="<w lemma="jasenjasen" ana="" ana="AfpmsnnAfpmsnn">">jasenjasen</w> </w> <c><c>,,</c> </c> <w lemma="<w lemma="mrzelmrzel" ana="" ana="AfpmsnnAfpmsnn">">mrzelmrzel</w> </w> <w lemma="<w lemma="aprilskiaprilski" ana="" ana="AopmsnAopmsn">">aprilskiaprilski</w> </w> <w lemma="<w lemma="dandan" ana="" ana="NcmsnNcmsn">">dandan</w> </w> <w lemma="<w lemma="inin" ana="" ana="CcsCcs">">inin</w> </w> <w lemma="<w lemma="uraura" ana="" ana="NcfpnNcfpn">">ureure</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3p--nVcip3p--n">">soso</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vmps-pfaVmps-pfa">">bilebile</w> </w> <w lemma="<w lemma="trinajsttrinajst" ana="" ana="McnpnlMcnpnl">">trinajsttrinajst</w> </w> <c><c>..</c> </c> </s> </s> … …

Sentence alignmentSentence alignment & &

Context disambiguated Context disambiguated

lemmaslemmas and and MSDsMSDs

GralisGralis2006-05-092006-05-09


Quantifying the corpusQuantifying the corpus

GralisGralis2006-05-092006-05-09


Utility of MULTEXT-Utility of MULTEXT-East LRsEast LRs

Specifications became, for some, the “national” Specifications became, for some, the “national” standardstandard

Training/testing dataset for HLT development:Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILPPoS taggers, lemmatizers, lexicon extractors, ILP

A base dataset for further annotation and experiments:A base dataset for further annotation and experiments:– Word-sense disambiguationWord-sense disambiguation– WordNet development and evaluationWordNet development and evaluation– Syntactic parser inductionSyntactic parser induction

Teaching aid in HLT coursesTeaching aid in HLT courses ~ 100 registered users~ 100 registered users As a BLARK “best practice” for new languages: As a BLARK “best practice” for new languages:

Resian, Croatian, Macedonian, PersianResian, Croatian, Macedonian, Persian

GralisGralis2006-05-092006-05-09


LRs @ JSILRs @ JSI http://nl.ijs.si/nl.html#Resourcehttp://nl.ijs.si/nl.html#Resource

AAlso ours: VAYNA, GORE, sloWNet lso ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUISContributors to: FIDA, DSI, FDV, JRC-ACQUIS

GralisGralis2006-05-092006-05-09


Overview of Slovene LRs and services Overview of Slovene LRs and services @ Slovenian Language Technologies @ Slovenian Language Technologies SocietySocietyhttp://nl.ijs.si/sdjt/http://nl.ijs.si/sdjt/

GralisGralis2006-05-092006-05-09


Thank you!Thank you!

Documents

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana [email protected]@ijs.si,