Download pdf - Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages

Morphological Analyzer and Generator for Russian and Ukrainian Languages

Mikhail Korobov AIST 2015

Morphological Analysis: word -> possible grammatical tags

• стали: VERB,perf,intr plur,past,indc (ГЛ,сов,неперех мн,прош,изъяв);

• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct] (СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])

• бутявка: NOUN,inan,femn sing,nomn (СУЩ,неод,жр ед,им)

Moprhological Generation

• lemmatization: стали -> стать, ежом -> ёж

• inflection: стали -> (sing,3per,fut) -> станет

• inflection: ёж -> (datv) -> ежу

pymorphy2: features• Morphological analysis of Russian words;

• morphological generation: lemmatization, inflection, number agreement;

• P(tag | word) estimates;

• out-of-vocabulary words handling;

• experimental support for Ukrainian language.

pymorphy2: implementation• Python library and a command line tool

• Permissive open-source license: MIT for code, Creative Commons BY-SA for data

• 600+ unit tests; 90%+ test coverage

• Memory usage: 30MB = 15MB pymorphy2 + 15MB Python interpreter

• Speed: 20-100K words per second with an optional C++ extension

Analysis of Vocabulary Words

• OpenCorpora dictionary for Russian (5M word forms, 400K lemmas);

• a dictionary based on LanguageTool data (2.5M word forms) by Andrey Rysin, Dmitry Chaplinsky, Mariana Romanyshyn, Vladimir Sevastyanov & others.

Analysis of Vocabulary Words

Source dictionaries provide lexemes:

ёж NOUN,anim,masc sing,nomnежа NOUN,anim,masc sing,gentежу NOUN,anim,masc sing,datv...ежами NOUN,anim,masc plur,abltежах NOUN,anim,masc plur,loct

Tasks

• Analyze: get a word from dictionary, return its tag

• Lemmatize: find a word in dictionary, get 1st word from its lexeme

• Inflect: find a word in dictionary, get a compatible word from its lexeme

Efficiency considerations

• OpenCorpora XML dictionary is 400MB on disk

• XML search lookup is O(N)

• When loaded to an in-memory hash table (Python dict) dictionary takes several GB of RAM

Solution

• Extract paradigms from lexemes; encode words as DAFSA.

• Also tried: succinct tries, two double-array tries

• 5M Russian word forms in DAFSA == 3MB RAM

Lexeme word tag хомяковый ADJF,Qual masc,sing,nomn хомякового ADJF,Qual masc,sing,gent ... хомяковы ADJS,Qual plur хомяковее COMP,Qual хомяковей COMP,Qual V-ejпохомяковее COMP,Qual Cmp2похомяковей COMP,Qual Cmp2,V-ej

Lexemeprefix stem suffix tag хомяков ый ADJF,Qual masc,sing,nomn хомяков ого ADJF,Qual masc,sing,gent ... хомяков ы ADJS,Qual plur хомяков ее COMP,Qual хомяков ей COMP,Qual V-ej по хомяков ее COMP,Qual Cmp2 по хомяков ей COMP,Qual Cmp2,V-ej

Paradigmprefix suffix tag ый ADJF,Qual masc,sing,nomn ого ADJF,Qual masc,sing,gent ... ы ADJS,Qual plur ее COMP,Qual ей COMP,Qual V-ej по ее COMP,Qual Cmp2 по ей COMP,Qual Cmp2,V-ej

Paradigm, encodedprefix_id suffix_id tag_id 0 66 78 0 67 79 ... 0 37 94 0 82 95 0 121 96 1 82 97 1 121 98

DAFSA10

14

0

2

3

1

16

4 6

32И

sep

7

22sep8 9sep

И

13103

12103

102

2

2

0

17104

2

(word, paradigm_id, form_index) triples:(двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2); (ёжик, 102, 2)

Out of Vocabulary Words

Common prefixes removal: language-specific lists of common immutable

prefixes (e.g. "не", "псевдо")

• недопсевдоавиашоу == недо + псевдоавиашоу

• псевдоавиашоу == псевдо + авиашоу

• авиашоу == авиа + шоу

• шоу - a known word

Words Ending with Other Dictionary Words Example: котопсина

• a word being analyzed has another word from a dictionary as a suffix;

• the length of this "suffix" word is no less than 3;

• the length of the word without the "suffix" is no greater than 5;

• "suffix" word is of an open class (noun, verb, adjective, participle, gerund)

Endings Matching Example: бурбуляторовый

• words with common endings often have the same grammatical form

• pymorphy2 builds an index of all 1-5 char word endings and their analyses

• (frequency, paradigm_id, form_index) triple is stored for each ending

Words with a Hyphen

• adverbs with a hyphen: по-хорошему

• particles separated by a hyphen: смотри-ка

• compound words: интернет-магазин, человек-паук

P(tag | word) estimation

• Based on partially disambiguated OpenCorpora data;

• MLE with Laplace smoothing

Evaluation: bad ideas

• evaluate pymorphy2 on OpenCorpora data

• evaluate Mystem on ruscorpora.ru (НКРЯ) data

http://ruscorpora.ru

Evaluation Setup• pymorphy2 and Mystem 3.0;

• 100 randomly selected sentences from OpenCorpora ("microcorpus");

• 100 randomly selected sentences from ruscorpora.ru;

• tagsets are different; evaluation requires complicated tag matching and manual checking of all errors;

• available online (http://goo.gl/BNXQXf)


http://goo.gl/BNXQXf

Evaluation: errors (full grammatical tags, recall, errors in

hyphenated words are not considered errors)

0

7,5

15

22,5

30

pymorphy2 Mystem 3.0

89

15

10

microcorpus ruscorpora

Evaluation: errors

0

3,5

7

10,5

14

Abbreviations People Names Regular Words Other Hyphenated Words*

11

2

6

1

14

02

44

9

pymorphy2 Mystem 3.0

Evaluation: results• Both pymorphy2 and mystem made less than 1%

errors (without disambiguation); most errors are in special cases.

• Hard to draw a conclusion; interpretation of evaluation results is important.

• 6 errors in ruscorpora.ru gold results are found by parsing it with pymorphy2, 1 error in microcorpus gold results is found by parsing it with mystem.


Future work• Improve people names, abbreviations, hyphenated words

parsing;

• improve non-contextual P(tag|word) estimates;

• improve Ukrainian language support;

• add Belarusian language support;

• there is a room for speed improvements;

• nicer command-line utility;

• ideas?

You can helphttps://github.com/kmike/pymorphy2

https://github.com/kmike/pymorphy2