Morphological Analyzer and Generator for Russian and Ukrainian Languages
Mikhail Korobov AIST 2015
Morphological Analysis: word -> possible grammatical tags
• стали: VERB,perf,intr plur,past,indc (ГЛ,сов,неперех мн,прош,изъяв);
• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct] (СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])
• бутявка: NOUN,inan,femn sing,nomn (СУЩ,неод,жр ед,им)
Moprhological Generation
• lemmatization: стали -> стать, ежом -> ёж
• inflection: стали -> (sing,3per,fut) -> станет
• inflection: ёж -> (datv) -> ежу
pymorphy2: features• Morphological analysis of Russian words;
• morphological generation: lemmatization, inflection, number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.
pymorphy2: implementation• Python library and a command line tool
• Permissive open-source license: MIT for code, Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB Python interpreter
• Speed: 20-100K words per second with an optional C++ extension
Analysis of Vocabulary Words
• OpenCorpora dictionary for Russian (5M word forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M word forms) by Andrey Rysin, Dmitry Chaplinsky, Mariana Romanyshyn, Vladimir Sevastyanov & others.
Analysis of Vocabulary Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomnежа NOUN,anim,masc sing,gentежу NOUN,anim,masc sing,datv...ежами NOUN,anim,masc plur,abltежах NOUN,anim,masc plur,loct
Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word from its lexeme
• Inflect: find a word in dictionary, get a compatible word from its lexeme
Efficiency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python dict) dictionary takes several GB of RAM
Solution
• Extract paradigms from lexemes; encode words as DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM
Lexeme word tag хомяковый ADJF,Qual masc,sing,nomn хомякового ADJF,Qual masc,sing,gent ... хомяковы ADJS,Qual plur хомяковее COMP,Qual хомяковей COMP,Qual V-ejпохомяковее COMP,Qual Cmp2похомяковей COMP,Qual Cmp2,V-ej
Lexemeprefix stem suffix tag хомяков ый ADJF,Qual masc,sing,nomn хомяков ого ADJF,Qual masc,sing,gent ... хомяков ы ADJS,Qual plur хомяков ее COMP,Qual хомяков ей COMP,Qual V-ej по хомяков ее COMP,Qual Cmp2 по хомяков ей COMP,Qual Cmp2,V-ej
Paradigmprefix suffix tag ый ADJF,Qual masc,sing,nomn ого ADJF,Qual masc,sing,gent ... ы ADJS,Qual plur ее COMP,Qual ей COMP,Qual V-ej по ее COMP,Qual Cmp2 по ей COMP,Qual Cmp2,V-ej
Paradigm, encodedprefix_id suffix_id tag_id 0 66 78 0 67 79 ... 0 37 94 0 82 95 0 121 96 1 82 97 1 121 98
DAFSA10
14
0
2
3
1
16
4 6
32И
sep
7
22sep8 9sep
И
13103
12103
102
2
2
0
17104
2
(word, paradigm_id, form_index) triples:(двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2); (ёжик, 102, 2)
Out of Vocabulary Words
Common prefixes removal: language-specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word
Words Ending with Other Dictionary Words Example: котопсина
• a word being analyzed has another word from a dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no greater than 5;
• "suffix" word is of an open class (noun, verb, adjective, participle, gerund)
Endings Matching Example: бурбуляторовый
• words with common endings often have the same grammatical form
• pymorphy2 builds an index of all 1-5 char word endings and their analyses
• (frequency, paradigm_id, form_index) triple is stored for each ending
Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-паук
P(tag | word) estimation
• Based on partially disambiguated OpenCorpora data;
• MLE with Laplace smoothing
Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data
Evaluation Setup• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora ("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)
Evaluation: errors (full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
89
15
10
microcorpus ruscorpora
Evaluation: errors
0
3,5
7
10,5
14
Abbreviations People Names Regular Words Other Hyphenated Words*
11
2
6
1
14
02
44
9
pymorphy2 Mystem 3.0
Evaluation: results• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in special cases.
• Hard to draw a conclusion; interpretation of evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by parsing it with pymorphy2, 1 error in microcorpus gold results is found by parsing it with mystem.
Future work• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?
You can helphttps://github.com/kmike/pymorphy2