56
Двуязычные и многоязычные электронные языковые ресурсы Иван А. Держанский ( [email protected] ) Институт математики и информатики Болгарской академии наук Секция Математической лингвистики

Ivan Derganskyi

Embed Size (px)

DESCRIPTION

Двуязычные и многоязычные электронные языковые ресурсы

Citation preview

Page 1: Ivan Derganskyi

Двуязычные и многоязычные электронные языковые

ресурсыИван А. Держанский ([email protected])Институт математики и информатики

Болгарской академии наукСекция Математической лингвистики

Page 2: Ivan Derganskyi

2

Resources for language engineering

• lexical databases (LDBs)• electronic dictionaries

– monolingual– bilingual and multilingual

• corpora

Page 3: Ivan Derganskyi

3

Corpus annotation

• Def: the process of adding linguistic information in an electronic form to a text corpus.

• Most common types:– morphosyntactic (grammatical, PoS)

annotation– lemma annotation

Page 4: Ivan Derganskyi

4

PoS tagging

• Def: the task of labelling each word in a sequence of words with its appropriate part-of-speech.

• Ambiguity:– вероятно ‘probable (sg. n.), probably’

• вероятно → PоS: adjective, Gender: neuter, Number: singular, Definiteness: no

• вероятно → PоS: adverb, Type: adjectival

• Def tagset: set of PoS tags

Page 5: Ivan Derganskyi

5

Electronic corpora of Bulgarian

The first two electronic corpora of the Bulgarian language were created in the framework of two EU projects on language technologies:

• MULTEXT-East (http://nl.ijs.si/IME);

• CONCEDE.

Page 6: Ivan Derganskyi

6

MULTEXT-East

The project MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages, 1995–1997) produced resources for six Central and Eastern European languages:

• Bulgarian,• Slovene,• Czech,• Roumanian,• Hungarian,• Estonian,as well as English (as

the ‘hub language’ of the project).

Page 7: Ivan Derganskyi

7

MULTEXT-East (continued)

The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI (Technology-Enhanced Learning in Research-led

Institutions) Research Archive of Computational Tools and Resources.

Version 3 (2004) includes material in five more languages (Croatian, Lithuanian, Resian, Russian, Serbian).

Page 8: Ivan Derganskyi

8

MULTEXT-East (continued)

The corpus of Bulgarian, developed according to the methodology and requirements of the project, contains three parts:

• Bulgarian Language-Specific Resources,

• a Parallel Annotated 1984 Corpus,• a Comparative Corpus.

Page 9: Ivan Derganskyi

9

The Parallel Annotated 1984 Corpus

The Parallel Annotated 1984 Corpus consists of

• the Bulgarian translation of George Orwell’s novel Nineteen Eighty-Four (including approximately 87,000 words);

• Bulgarian-English aligned texts.

Page 10: Ivan Derganskyi

10

The Parallel Annotated 1984 Corpus (continued)

The material was formatted as a well-structured, lemmatised, Corpus Encoding Standard (CES) corpus (Ide, 1998).

That is, each word form is accompanied by the corresponding lemma and grammatical information that constitute its standard lexical description.

Page 11: Ivan Derganskyi

11

The Parallel Annotated 1984 Corpus (continued)

The lexical descriptions for Bulgarian are in line with the terminology and the methodology used by MULTEXT.

The corpus was marked and validated for alignment and sentence boundaries.

Page 12: Ivan Derganskyi

12

The Comparative Corpus

The Comparative Corpus contains two subsets of about 100,000 words, each consisting of fiction, comprising excerpts from two contemporary Bulgarian novels, and excerpts from newspaper text.

The data was comparable across the six languages, in terms of the number and size of texts.

Page 13: Ivan Derganskyi

13

The Comparative Corpus (continued)

The entire multilingual Comparative Corpus was prepared in CES (Corpus Encoding Standard) format, manually or using ad-hoc tools, and was automatically annotated for tokenisation, sentence boundaries, and part of speech using the project tools.

Page 14: Ivan Derganskyi

14

Bulgarian Language-Specific Resources

The Bulgarian Language-Specific Resources are data required by the segmentation procedure, morphological analyser and disambiguator.

This includes a lexical list and lists of special tokens (frequent abbreviations and names, titles, patterns for proper names, etc.) with their types.

Page 15: Ivan Derganskyi

15

The lexicon

The lexical list (lexicon) contains about 242,000 lemmata.

Each lemma in the lexicon is associated with its part(s) of speech and lexical characteristics.

156,000 morpho syntactic descriptions were provided for Bulgarian.

Page 16: Ivan Derganskyi

16

The lexicon (continued)

Each lexicon entry includes the following information:

• word form;• lemma;• part of speech;• further morphological information

(feature values).

Page 17: Ivan Derganskyi

17

The lexicon (continued)

• part of speech– the traditional set of 10 parts of speech– punctuation– abbreviations– numbers written in digits– unidentified objects (residuals)

• same system for all languages of the project (though different interpretations)

Page 18: Ivan Derganskyi

18

Lexicography ↔ Linguistic Theory

• lexicography requires linguistic theory (analysis, methodology)– but also serves as a touchstone,

because what can be represented must have been studied, understood, formalised to a sufficient extent

• lexicography supports linguistic theory (data for research)

Page 19: Ivan Derganskyi

19

Dictionary ↔ Grammar

• mutually complementary, mutually indispensable components of integrated linguistic description

• lexicographic type (unification)• lexicographic portrait

(individualisation)

Page 20: Ivan Derganskyi

20

Computational lexicography

digital (machine-readable) dictionaries:• digital versions of traditional

dictionaries for human use• computer dictionaries as components

of information systems

Page 21: Ivan Derganskyi

21

Advantages of digital dictionaries

• size not an issue– potential for infinite growth in depth and

breadth (a dictionary needn’t be small, medium or large by design)

– many purposes served (explanatory dictionary, grammatical dictionary, dictionary of synonyms, antonyms, phraseology, etymology, etc., all as one integrated system)

Page 22: Ivan Derganskyi

22

Advantages of digital dictionaries (continued)

• easy update possible, incl. by continued distributed collective effort (wiki-style)

• flexible search (incl. bidirectional) and presentation of results

• audio-, video- etc. material can be added

• requirement: definitions must be simpler, but at the same time more comprehensive

Page 23: Ivan Derganskyi

23

Dictionary (definition)

• an aggregate of linguistic units (forms)– established in the language system as

represented by the usage of a certain language community,

– put in a predetermined order and– accompanied by formal (orthographic,

phonetic, grammatical, etymological, stylistic, etc.) and semantic information• on the linguistic units themselves or• on the denoted entities or phenomena,

Page 24: Ivan Derganskyi

24

Dictionary (definition, continued)

• an aggregate of linguistic units (forms)– put in a predetermined order and– accompanied by formal and semantic

information,– arranged and ordered in a certain way within

the entry,

• … almost always supplemented by auxiliary material– introduction, criteria, sources, list of

abbreviations, structure of the dictionary entry, grammar tables

Page 25: Ivan Derganskyi

25

Structure of the dictionary entry

• register part (on the left)• interpretation part (on the right)• all the register parts together form

the dictionary’s register• the set of rules and methods used

when composing the entries forms the metalanguage

Page 26: Ivan Derganskyi

26

The register

• designing the register (needn’t be a one-time event in the case of an electronic dictionary)– from other dictionaries– from a corpus of texts

• editing the register: eliminating obsolete words, arbitrary neologisms, suspected non-words

• automatic extension: productive derivation made into procedures

Page 27: Ivan Derganskyi

27

Structural aspects of lexicography

• macrostructure: nature and purpose of the dictionary, place within the typology of dictionaries, choice of register, choice of illustrations, order, metalanguage

• mediostructure: relations between language units, e.g., derivation, families of words

• microstructure: setup of the entry, hierarchy of meanings; requirements: standardisation, economy, simplicity, completeness

Page 28: Ivan Derganskyi

28

An example of a lexical entry: CONCEDE Bulgarian dictionary

<entry><hw>цел</hw><gen>ж.</gen><struc type="Sense" n="1"><def>Това, към което е насочена някаква дейност, към коетонякой се стреми; умисъл, намерение.</def><eg><q>С каква цел отиваш в града?</q></eg><eg><q>Вървя без цел.</q></eg><eg><q>Постигнах целта си.</q></eg><eg><q>Целта оправдава средствата.</q></eg></struc><struc type="Sense" n="2"><def>Предмет или точка, в която някой стреля, къмкоято е насочено определено действие, движение, удар и под.;прицел.</def><eg><q>Улучих целта.</q></eg></struc><struc type="Phrases"><struc type="Phrase" n="1"><orth>Имам (нямам) [за] цел.</orth><def>стремя се (не се стремя) към нещо.</def><eg><q>Нямам за цел да му навредя.</q></eg></struc><struc type="Phrase" n="2"><orth>Попадам в целта.</orth><def>улучвам, умервам.</def></struc></struc><etym><lang>нем.</lang>&gt;<lang>рус.</lang></etym></entry>

Page 29: Ivan Derganskyi

29

An example of a lexical entry (zoom, part 1: head word,

gender)<entry>

<hw>цел</hw>

<gen>ж.</gen>

[…]

</entry>

Page 30: Ivan Derganskyi

30

An example of a lexical entry (zoom, part 2)

<struc type="Sense" n="1"><def>Това, към което е насочена някаква дейност, към което някой се стреми; умисъл, намерение.</def>

<eg><q>С каква цел отиваш в града?</q></eg>

<eg><q>Вървя без цел.</q></eg><eg><q>Постигнах целта си.</q></eg><eg><q>Целта оправдава средствата.</q></eg></struc>

Page 31: Ivan Derganskyi

31

An example of a lexical entry (zoom, part 3)

<struc type="Sense" n="2">

<def>Предмет или точка, в която някой стреля, към която е насочено определено действие, движение, удар и под.; прицел.</def>

<eg><q>Улучих целта.</q></eg></struc>

Page 32: Ivan Derganskyi

32

An example of a lexical entry (zoom, part 4)

<struc type="Phrases"><struc type="Phrase" n="1"><orth>Имам (нямам) [за] цел.</orth>

<def>стремя се (не се стремя) към нещо.</def>

<eg><q>Нямам за цел да му навредя.</q></eg></struc>

<struc type="Phrase" n="2"><orth>Попадам в целта.</orth>

<def>улучвам, умервам.</def></struc></struc>

Page 33: Ivan Derganskyi

33

An example of a lexical entry (zoom, part 5: etymology)

<entry>

[…]

<etym><lang>нем.</lang>&gt;<lang>рус.</lang></etym>

</entry>

Page 34: Ivan Derganskyi

34

ABBYY Lingvo (Ru–It)

Page 35: Ivan Derganskyi

35

ABBYY Lingvo (Ru–Et)

цель

[m1][trn]eesmärk, märk, otstarve, siht[/trn][/m]

Page 36: Ivan Derganskyi

36

Why is order important?

Page 37: Ivan Derganskyi

37

Page 38: Ivan Derganskyi

38

Why is order important? (continued)

Ингредиенты: сахар, глюкоза, мука, милая, корица, какао, сода, маргарин

Page 39: Ivan Derganskyi

39

Why is order important? (continued)

Ингредиенты: бикарбонат натрия, ароматы, студень, молочный порошок, эмульгатор

Page 40: Ivan Derganskyi

40

wash (En–Ru)

Page 41: Ivan Derganskyi

41

honey (En–Ru)

Page 42: Ivan Derganskyi

42

jelly (En–Ru)

Page 43: Ivan Derganskyi

43

Digital grammatical dictionaries

• modelling of inflexion– (essential for inflecting languages)

• word form ↔ lemma + grammatical meaning– built upon a formal model of inflexion: a

division of the set of words into inflexional paradigmatic classes (non-intersecting subsets with algorithmically described rules)

Page 44: Ivan Derganskyi

44

Bi- and multilingual dictionaries

translation:• most general member(s) of the

corresponding synset• grammatical semantics (incl.

valency, subcategorisation)• pragmatic context (sublanguage of

most frequent usage)

Page 45: Ivan Derganskyi

45

Bi- and multilingual dictionaries (continued)

bilingual dictionary:• two integrated linguistic systems

(explanatory dictionary, grammatical dictionary, dictionary of synonyms, of antonyms, of phraseology)

• complemented by– comparable monolingual corpora and– a parallel bilingual corpus and

• linked by an interface

Page 46: Ivan Derganskyi

46

Bi- and multilingual dictionaries (continued)

• Integrating a synonym and a translation linguistic system: EuroWordNet (an assembly of WordNets using a common ontology and indexing)

Page 47: Ivan Derganskyi

47

Bi- and multilingual dictionaries (continued)

• multilingual dictionary:– a set of pairs of bilingual dictionaries– interlingua

• one of the target languages• an external natural language• an artificial but speakable language (e.g.,

Esperanto)• a semantic interlingua (a digital concept

dictionary)

Page 48: Ivan Derganskyi

48

Plans

of the joint research project “Semantics and Contrastive linguistics with a focus on a bilingual electronic dictionary” between IMI—BAS and ISS—PAS:

• Bulgarian–Polish/Polish–Bulgarian dictionaries

• Bulgarian–Polish–Ukrainian dictionary• Bulgarian–Polish–Ukrainian–Lithuanian …• … more?

Page 49: Ivan Derganskyi

49

Bulgarian–Polish/Polish–Bulgarian dictionaries … on the

basis of (1)the most recent paper bilingual

dictionaries (1987, 1988)• volume ≈60 000 words• already dated• of questionable reliability to boot

Page 50: Ivan Derganskyi

50

Bulgarian–Polish/Polish–Bulgarian dictionaries … on the

basis of (2)a bilingual corpus (3 000 000 words

envisaged) consisting of• fiction

– Polish to Bulgarian (easy to find)– Bulgarian to Polish (hard to find)– 3rdLg original, translated into Bg and Pl

• EU/EC documents• texts in Bulgarian and Polish of similar

sizes– excerpts from newspapers– literary works available on the Internet

Page 51: Ivan Derganskyi

51

Bulgarian–Polish dictionary (after OCR and proofreading)

претовар|я, -иш vp. v. претоварямпретоп|я, -иш vp. v. претапям, претопявампретопява|м, -ш vi. przetapiać; przen. asymilowaćпретор, -и т hist. pretor mпреториан|ец, -ци т pretorianin mпреториански adi. pretoriańskiI преточ|а, -иш vp. v. npeтакамII преточ|а, -иш vp. v. II преточвамI преточвам v. претакамII преточва|м, -ш vi. ostrzyć nadmiernieпретрайва|м, -ш vi. v. npeтраяпретра|я, -еш vp. lud. przetrwaćпретрива|м, -ш vi. przecierać, przecinać, przepiłowywać; ~м

праговете wycieram (obijam) cudze progiпретри|я, -еш vp. v. претривам

Page 52: Ivan Derganskyi

52

Bulgarian–Polish dictionary (after first round of markup)

[b]претовар|я, -иш[/b] [i]vp.[/i] v. [b]претоварям[/b][b]претоп|я, -иш[/b] [i]vp.[/i] v. [b]претапям, претопявам[/b][b]претопява|м, -ш[/b] [i]vi.[/i] przetapiać; [i]przen.[/i] asymilować[b]претор, -и[/b] [i]m[/i] [i]hist.[/i] [b]pretor[/b] [i]m[/i][b]преториан|ец, -ци[/b] [i]m[/i] pretorianin [i]m[/i][b]преториански[/b] [i]adi.[/i] pretoriański[b]I преточ|а, -иш[/b] [i]vp.[/i] v. [b]претакам[/b][b]II преточ|а, -иш[/b] [i]vp.[/i] v. [b]II преточвам[/b][b]I преточвам[/b] v. [b]претакам[/b][b]II преточва|м, -ш[/b] [i]vi.[/i] [b]ostrzyć nadmiernie[/b][b]претрайва|м, -ш[/b] [i]vi.[/i] v. [b]претрая[/b][b]претра|я, -еш[/b] [i]vp.[/i] [i]lud.[/i] przetrwać[b]претрива|м, -ш[/b] [i]vi.[/i] przecierać, przecinać, przepiłowywać;

[b]~м праговете[/b] wycieram (obijam) cudze progi[b]претри|я, -еш[/b] [i]vp.[/i] v. [b]претривам[/b]

Page 53: Ivan Derganskyi

53

Adding procedurality?

погазва|м, -ш vi. deptać, brodzić (trochę)погор|я, -иш vp. popalić się (trochę, krótko);

[…]погъделичква|м, -ш vi. łaskotać, łechtać

(trochę, lekko)погълта|м, -ш vp. łyknąć trochęпогърмява|м, -ш vi. pogrzmiewać, grzmieć

od czasu do czasu, […]подадва|м, -ш vi. lud. dawać po trochę, od

czasu do czasu

Page 54: Ivan Derganskyi

54

Polyprefixation

позагаз|я, -иш vp. zabrnąć, wpaść w ciężkie położenie (trochę)

позагатн|а, -еш vp. napomknąć, wspomnieć mimochodem

позагледа|м, -ш vp. spoglądnąć, spojrzeć, popatrzyć (trochę, od czasu do czasu)

понатежава|м, -ш vi. stawać się trochę cięższym, ciążyć trochę

понатисн|а, -еш vp. nacisnąć, przycisnąć trochę

понатовар|я, -иш vp. naładować trochę, obciążyć, obarczyć trochę

Page 55: Ivan Derganskyi

55

Adding procedurality? (continued)

претъркаля|м, -ш vp. przetoczyć, przesunąć tocząc

Likewise perhaps:• evaluatives• words for females• abstract nouns• … and other productive derivatives

Page 56: Ivan Derganskyi

56

Applications of the electronic LDB

• lexicography:– creation of electronic bilingual

dictionaries for research and teaching– specialised reference works, e.g.,

valency dictionaries

• education: training skills of independent investigation with the help of the computer