38
The FIDA & MULTEXT- The FIDA & MULTEXT- East language East language resources resources Tomaž Erjavec Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana Jožef Stefan Institute, Ljubljana [email protected], [email protected], http://nl.ijs.si/et/ http://nl.ijs.si/et/ Gralis 2006 Gralis 2006 Institut für Slawistik der Universität Graz Institut für Slawistik der Universität Graz 2006-05-09 2006-05-09

The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana [email protected]@ijs.si,

  • View
    221

  • Download
    1

Embed Size (px)

Citation preview

Page 1: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

The FIDA & MULTEXT-The FIDA & MULTEXT-East language East language resourcesresources

Tomaž ErjavecTomaž Erjavec

Department of Knowledge TechnologiesDepartment of Knowledge Technologies

Jožef Stefan Institute, LjubljanaJožef Stefan Institute, Ljubljana

[email protected], [email protected], http://nl.ijs.si/et/http://nl.ijs.si/et/

Gralis 2006Gralis 2006

Institut für Slawistik der Universität GrazInstitut für Slawistik der Universität Graz

2006-05-092006-05-09

Page 2: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

OverviewOverview

1.1. BackgroundBackground

2.2. FIDA: a reference corpus of FIDA: a reference corpus of SloveneSlovene

3.3. MULTEXT-East: morphosyntactic MULTEXT-East: morphosyntactic resources for Central and East-resources for Central and East-European languagesEuropean languages

4.4. Other language resources for Other language resources for SloveneSlovene

Page 3: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Language ResourcesLanguage Resources

LR comprise three layers of data: LR comprise three layers of data: – corpora: mono- or multilingual, reference or corpora: mono- or multilingual, reference or

specialised, … /variously annotated/specialised, … /variously annotated/– lexica: vocabularies, morphosyntactic, syntactic, lexica: vocabularies, morphosyntactic, syntactic,

semantic, (ontologies)semantic, (ontologies)– standards: linguistic and technical encodingstandards: linguistic and technical encoding

LRs, esp. corpora are used for LRs, esp. corpora are used for empirical language research:empirical language research:– linguistic studies:linguistic studies:

(annotated) corpus + (sophisticated) search engine(annotated) corpus + (sophisticated) search engine– human language technology R&D:human language technology R&D:

testing and training datasettesting and training dataset

Page 4: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Part I.Part I.The FIDA corpusThe FIDA corpus

Slovene reference corpus for Slovene reference corpus for linguistic studieslinguistic studies

Page 5: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

FIDA FIDA http://www.fida.net/http://www.fida.net/

Joint project (1997-2000) of Joint project (1997-2000) of FFilozofska fakultetailozofska fakulteta

Vojko Gorjanc, Marko Stabej, Špela VintarVojko Gorjanc, Marko Stabej, Špela Vintar IInstitut Jonstitut Jožef Stefanžef Stefan

Tomaž ErjavecTomaž Erjavec DDZSZS

Simon KrekSimon Krek AAmebismebis

Peter Holozan, Miro RomihPeter Holozan, Miro Romih

Financed by industry partnernsFinanced by industry partnerns

Page 6: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Characteristics of FIDACharacteristics of FIDA

monolingualmonolingual synchronoussynchronous written languagewritten language referencereference

– representativerepresentative– balancedbalanced

annotatedannotated

Page 7: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

SizesSizes

Total Total 103,513,072103,513,072 wordswords 29,177 29,177 textstextsAvgAvg.. text length text length3,5483,548 words words

Largest texts:Largest texts: Leksikon DZS: Leksikon DZS: 508,370 508,370 wordswords69 69 texts > texts > 100.000100.000

Smallest texts:Smallest texts: 2.648 2.648 < < 100100 words words2 x 2 x <w>rezgrtshdrghgth4</w><w>rezgrtshdrghgth4</w>

Page 8: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Time CompositionTime Composition

Oldest/most recent textOldest/most recent text: : 19891989//20002000

Average date Average date 1997-021997-02 Texts/Words with unknTexts/Words with unknown own datedate: :

3.943.94%/%/8.288.28%%

Page 9: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

FIDA tFIDA taxonomoyaxonomoy::publication typespublication types

……Ft.P.P.O (published) Ft.P.P.O (published) 95.72%95.72%Ft.P.P.O.K (books) Ft.P.P.O.K (books) 22.71%22.71%Ft.P.P.O.P (periodicals) Ft.P.P.O.P (periodicals) 70.50%70.50%Ft.P.P.O.P.C (newspaper) Ft.P.P.O.P.C (newspaper) 46.59%46.59%Ft.P.P.O.P.C.D (daily) Ft.P.P.O.P.C.D (daily) 32.67%32.67%Ft.P.P.O.P.C.T (weekly) Ft.P.P.O.P.C.T (weekly) 66.18%66.18%Ft.P.P.O.P.C.V (multi-weekly)Ft.P.P.O.P.C.V (multi-weekly) 17.74%17.74%……

Page 10: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

FIDA tFIDA taxonomoyaxonomoy: : ttext typesext typesFt.Z (text type) Ft.Z (text type) 99.47%99.47%Ft.Z.N (non-ficiton) Ft.Z.N (non-ficiton) 93.57%93.57%Ft.Z.N.N (non-professional)Ft.Z.N.N (non-professional) 75.14%75.14%Ft.Z.N.S (professional) Ft.Z.N.S (professional) 18.37%18.37%Ft.Z.N.S.H (hum. & soc. sci.)Ft.Z.N.S.H (hum. & soc. sci.) 10.57%10.57%Ft.Z.N.S.N (nat. & tech. sci.) Ft.Z.N.S.N (nat. & tech. sci.) 6.04% 6.04%Ft.Z.U (fiction) Ft.Z.U (fiction) 5.90% 5.90%Ft.Z.U.D (drama) Ft.Z.U.D (drama) 0.10% 0.10%Ft.Z.U.P (poetry) Ft.Z.U.P (poetry) 0.17% 0.17%Ft.Z.U.R (prose) Ft.Z.U.R (prose) 5.12% 5.12%

Page 11: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Markup of FIDAMarkup of FIDA

corpus elements annotated with corpus elements annotated with meta-data (bibliographic, taxonomy)meta-data (bibliographic, taxonomy)

text linguistically annotatedtext linguistically annotated encoded according to international encoded according to international

standards and recommendationsstandards and recommendations– technical: SGML, TEI P3technical: SGML, TEI P3– linguistic: MULTEXT-Eastlinguistic: MULTEXT-East

(MULTEXT, EAGLES)(MULTEXT, EAGLES)

Page 12: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Linguistic annotationLinguistic annotation

Page 13: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

AccesibilityAccesibility

Exploitation by pExploitation by partners:artners:– DZS: new dictionariesDZS: new dictionaries– Amebis: Amebis: development of HLTdevelopment of HLT– Arts faculty: Arts faculty: teachingteaching– IJS: research on HLTIJS: research on HLT

Availability to the pAvailability to the public:ublic:– access via caccess via concordance engine by Amebis oncordance engine by Amebis – free accessfree access, but, but displays only few hits displays only few hits– possibility of academic licencespossibility of academic licences

FIDA (web site) no longer maintainedFIDA (web site) no longer maintained!!

Page 14: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

FIDA+ FIDA+ http://www.fidaplus.nehttp://www.fidaplus.net/t/ FIDA Plus project:FIDA Plus project:

– FilozofskFilozofskaa fakultet fakultetaa, Fakulteta za družbene vede, , Fakulteta za družbene vede, Institut Jožef StefanInstitut Jožef Stefan

– DZS, AmebisDZS, Amebis Financed by the ministryFinanced by the ministry + ind. partners+ ind. partners Extend the corpus with Extend the corpus with

– Web materialsWeb materials– spoken componentspoken component

Better linguistic markupBetter linguistic markup Free cFree concordances: up to 100 linesoncordances: up to 100 lines Also possibility of licencesAlso possibility of licences

Page 15: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

ConcordancerConcordancer

Page 16: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

OutputOutput

Page 17: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Extended searchesExtended searches

Page 18: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Corpus “Nova Beseda”Corpus “Nova Beseda”http://bos.zrc-sazu.si/http://bos.zrc-sazu.si/

being developed at Institute for being developed at Institute for Slovene language, ZRC SAZU Slovene language, ZRC SAZU (Primo(Primož Jakopin)ž Jakopin)

Web concordancer with no Web concordancer with no hit hit limitlimit now now larger than FIDAlarger than FIDA but but much less variedmuch less varied::

fiction, Delo, DZ fiction, Delo, DZ not linguistically annotatednot linguistically annotated

Page 19: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Part II.Part II.MULTEXT-EastMULTEXT-East

multilingual morphosyntactic multilingual morphosyntactic resources for HLT resources for HLT developmentdevelopment

Page 20: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

MULTEXT-East MULTEXT-East resourcesresources

MULTEXT-EastMULTEXT-East: Copernicus Joint Project COP 106 : Copernicus Joint Project COP 106 (1995-1997) (1995-1997) Multilingual Texts and Corpora for Multilingual Texts and Corpora for Eastern and Central European LanguagesEastern and Central European Languages

Based on the results of EU MULTEXT (~West)Based on the results of EU MULTEXT (~West) To produce a harmonised To produce a harmonised BLARKBLARK for six for six

languages:languages:– corpus encoding standardisation (TEI / CES)corpus encoding standardisation (TEI / CES)– multilingual parallel, comparable, speech corporamultilingual parallel, comparable, speech corpora– morphosyntactic specifications (EAGLES / MULTEXT)morphosyntactic specifications (EAGLES / MULTEXT)– (inflectional) lexicon(inflectional) lexicon– annotated corpusannotated corpus– language processing toolslanguage processing tools

Page 21: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

History of MULTEXT-History of MULTEXT-East resourcesEast resources First release 1998 on TELRI CD-ROM Vol II:First release 1998 on TELRI CD-ROM Vol II:

already extended with new languagesalready extended with new languages Resources since 1998 available on the Web:Resources since 1998 available on the Web:

http://nl.ijs.si/ME/http://nl.ijs.si/ME/ Second release 2002 in scope of EU CONCEDE:Second release 2002 in scope of EU CONCEDE:

re-encoding in XML/TEI, harmonisationre-encoding in XML/TEI, harmonisation Third releaseThird release 2004: 2004:

merge of first two releases, further languagesmerge of first two releases, further languages Work (indirectly) supported by:Work (indirectly) supported by:

TELRI, CONCEDE, NSF grant, bi-lateral projectsTELRI, CONCEDE, NSF grant, bi-lateral projects

Page 22: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

The Languages of The Languages of MULTEXT-EastMULTEXT-East

Germanic: Germanic: EnglishEnglish Romance: Romance:

RomanianRomanian Baltic: Baltic:

– LatvianLatvian – LithuanianLithuanian

Finno-Ugric: Finno-Ugric: – EstonianEstonian – HungarianHungarian

Slavic: Slavic: Russian (East Slavic)Russian (East Slavic) CzechCzech (West Slavic) (West Slavic) Slovene Slovene (South West Slavic) (South West Slavic) Resian (Slovene dialect) Resian (Slovene dialect) Croatian (South West Slavic) Croatian (South West Slavic) SerbianSerbian (South West Slavic) (South West Slavic) Bulgarian (South East Slavic)Bulgarian (South East Slavic)

In progress:In progress: MacedonianMacedonian Persian Persian

Page 23: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Version 3Version 3

Available on Available on http://nl.ijs.si/ME/V3/http://nl.ijs.si/ME/V3/ Some parts completely free, others Some parts completely free, others

free for research free for research Web licence Web licence Web pages gives:Web pages gives:

– extensive documentationextensive documentation– bibliography listbibliography list– web licence formweb licence form– resource downloadresource download

Page 24: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

The MULTEXT The MULTEXT morphosyntactic morphosyntactic trinitytrinity1.1. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic

specificationsspecifications

2.2. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic lexica lexica

3.3. MULTEXT-East MULTEXT-East morphosyntactically annotated morphosyntactically annotated "1984" "1984" corpuscorpus

Page 25: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

1. Morphosyntactic 1. Morphosyntactic specificationsspecifications

Based on EAGLES / MULTEXTBased on EAGLES / MULTEXT Define PoS, their attributes and valuesDefine PoS, their attributes and values The specs are a document containing: The specs are a document containing:

– introductionintroduction– common tablescommon tables– language particular sectionslanguage particular sections

Written in LaTeX Written in LaTeX PDF & HTML PDF & HTML Derived XML/TEI encoding as feature Derived XML/TEI encoding as feature

structuresstructures

Page 26: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Example common tableExample common table

Page 27: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Example Example language specific language specific sectionsection

tabletable(shows only (shows only categories actually categories actually used)used)

notesnotes

combinationscombinations

lexiconlexicon

for Slovene (FIDA):for Slovene (FIDA):localisation of localisation of category namescategory names

Page 28: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Morphosyntactic Morphosyntactic ComplexityComplexity

Page 29: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

2. The lexica2. The lexica

Medium size morphosyntactic lexicaMedium size morphosyntactic lexica Languages: English, Romanian, Slovene, Languages: English, Romanian, Slovene,

Czech, Bulgarian, Estonian, Hungarian, Czech, Bulgarian, Estonian, Hungarian, Serbian.Serbian.

~ all word-forms of cca 15.000 lemmas~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields:

– the word-form: the inflected form of the wordthe word-form: the inflected form of the word– the lemma: the base-form of the wordthe lemma: the base-form of the word– the morphosyntactic description (MSD)the morphosyntactic description (MSD)

Page 30: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Example: Slovene Example: Slovene lexicon lexicon abeced abeced abeceda abeceda Ncfdg Ncfdg abeced abeced abeceda abeceda Ncfpg Ncfpg abeceda abeceda = = Ncfsn Ncfsn abecedah abecedah abeceda abeceda Ncfdl Ncfdl abecedah abecedah abeceda abeceda Ncfpl Ncfpl abecedam abecedam abeceda abeceda Ncfpd Ncfpd abecedama abecedama abeceda abeceda Ncfdd Ncfdd abecedama abecedama abeceda abeceda Ncfdi Ncfdi abecedami abecedami abeceda abeceda Ncfpi Ncfpi abecede abecede abeceda abeceda Ncfpa Ncfpa abecede abecede abeceda abeceda Ncfpn Ncfpn abecede abecede abeceda abeceda Ncfsg Ncfsg abecedi abecedi abeceda abeceda Ncfda Ncfda abecedi abecedi abeceda abeceda Ncfdn Ncfdn ……

Page 31: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Lexicon sizesLexicon sizes

Page 32: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

3. The “1984” corpus3. The “1984” corpus

Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…))Languages: En, Ro, Sl, Cs, Et, Hu, Sr, (Bg, Ru, (Mk, Hr, Tr,…)) Structuraly annotated Structuraly annotated Sentence aligned with EnglishSentence aligned with English Words annotated with lemma and MSDWords annotated with lemma and MSD Encoded in TEI P4 (XML)Encoded in TEI P4 (XML)

Page 33: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Example linguistic Example linguistic encodingencoding<text id="Osl." lang="sl"> <text id="Osl." lang="sl"> <body> <body> <div type="part" id="Osl.1"> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <p id="Osl.1.2.2"> <s id="<s id="Osl.1.2.2.1Osl.1.2.2.1"> "> <w lemma="<w lemma="bitibiti" ana="" ana="Vcps-smaVcps-sma">">BilBil</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3s--nVcip3s--n">">jeje</w> </w> <w lemma="<w lemma="jasenjasen" ana="" ana="AfpmsnnAfpmsnn">">jasenjasen</w> </w> <c><c>,,</c> </c> <w lemma="<w lemma="mrzelmrzel" ana="" ana="AfpmsnnAfpmsnn">">mrzelmrzel</w> </w> <w lemma="<w lemma="aprilskiaprilski" ana="" ana="AopmsnAopmsn">">aprilskiaprilski</w> </w> <w lemma="<w lemma="dandan" ana="" ana="NcmsnNcmsn">">dandan</w> </w> <w lemma="<w lemma="inin" ana="" ana="CcsCcs">">inin</w> </w> <w lemma="<w lemma="uraura" ana="" ana="NcfpnNcfpn">">ureure</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vcip3p--nVcip3p--n">">soso</w> </w> <w lemma="<w lemma="bitibiti" ana="" ana="Vmps-pfaVmps-pfa">">bilebile</w> </w> <w lemma="<w lemma="trinajsttrinajst" ana="" ana="McnpnlMcnpnl">">trinajsttrinajst</w> </w> <c><c>..</c> </c> </s> </s> … …

Sentence alignmentSentence alignment & &

Context disambiguated Context disambiguated

lemmaslemmas and and MSDsMSDs

Page 34: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Quantifying the corpusQuantifying the corpus

Page 35: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Utility of MULTEXT-Utility of MULTEXT-East LRsEast LRs

Specifications became, for some, the “national” Specifications became, for some, the “national” standardstandard

Training/testing dataset for HLT development:Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILPPoS taggers, lemmatizers, lexicon extractors, ILP

A base dataset for further annotation and experiments:A base dataset for further annotation and experiments:– Word-sense disambiguationWord-sense disambiguation– WordNet development and evaluationWordNet development and evaluation– Syntactic parser inductionSyntactic parser induction

Teaching aid in HLT coursesTeaching aid in HLT courses ~ 100 registered users~ 100 registered users As a BLARK “best practice” for new languages: As a BLARK “best practice” for new languages:

Resian, Croatian, Macedonian, PersianResian, Croatian, Macedonian, Persian

Page 36: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

LRs @ JSILRs @ JSI http://nl.ijs.si/nl.html#Resourcehttp://nl.ijs.si/nl.html#Resource

AAlso ours: VAYNA, GORE, sloWNet lso ours: VAYNA, GORE, sloWNet Contributors to: FIDA, DSI, FDV, JRC-ACQUISContributors to: FIDA, DSI, FDV, JRC-ACQUIS

Page 37: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Overview of Slovene LRs and services Overview of Slovene LRs and services @ Slovenian Language Technologies @ Slovenian Language Technologies SocietySocietyhttp://nl.ijs.si/sdjt/http://nl.ijs.si/sdjt/

Page 38: The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana tomaz.erjavec@ijs.sitomaz.erjavec@ijs.si,

GralisGralis2006-05-092006-05-09

Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, JoDept. of Knowledge Technologies, Jožžef Stefan Instituteef Stefan Institute

Thank you!Thank you!