Upload
aist
View
182
Download
0
Embed Size (px)
Citation preview
The Hybrid Approach
to Part-of-Speech
Disambiguation
Bruches Elena, [email protected]
Karpenko Dmitrii, [email protected]
Krayvanova Varvara, [email protected]
AIST’2016
Problem
Goal: Syntactic parser
Problem: Part-of-Speech ambiguity
Solution: Approach, which combines Neural
Networks and Manually Crafted Rules
2
State-of-Art
● Hidden Markov Models Ex.: Sokirko A.V., Toldova C.U. The comparison of
two methods of lexical and morphological disambiguation for Russian //
Internet-mathematics 2005. Web-data processing. Moscow, 2005;
● Rule-based approach Ex.: Brill E. A Simple Rule-Based Part of Speech
Tagger // In Proceedings of ANLC’92, 3d conference on applied natural
language processing, Trento, IT (pp. 152 - 155);
● Neural Networks Ex.: Santos C., Zadrozny B. Learning character-level
representations for part-of-speech tagging // In Proceedings of the 31st
International Conference on Machine Learning, JMLR: W&CP volume 32.
2014;
● Conditional Random Fields Ex.: Antonova, A., Solov'ev, A. Conditional
Random Fields in NLP-related Tasks for Russian // Information Technologies
and Systems. 2013;
● etc.
3
Model
Word-form w is a lexeme in particular grammatical form.
Tag is a grammem of a grammatical category.
TagsW is a set of tags, which a word-form w is assigned with.
TW = <w, TagsW> is the token of word-form w.
Tстали = <стали, [СУЩ, им|вн, мн, жр, неод] |
[СУЩ, рд|дт|пр, ед, жр, неод] |
[ГЛ, сов, неперех, мн, изъв, прош]>
Тsteels/became = <steels/became, [NOUN, nomn|accs, pl, fem, inan] |
[NOUN, gent|datv|loct, sing, fem, inan] |
[VERB, perf, intran, pl, indic, past]>
4
Algorithm
SP0- the initial set of PoS tags
SPN- the set of PoS tags obtained using neural networks
SPR- the set of PoS tags obtained using rules based approach
SP - the result set of PoS tags
5
Tagging Module
Tag Sets:
● Part of Speeh;
● Gender;
● Number;
● Case;
● Aspect;
● Transitivity;
● Person;
● Tense;
● Mood;
● Voice
OpenCorpora Wiktionary
6
Neural Networks: Idea and Input
One network for each common pairs <sp1, sp2>
Input: set of binary values
Context: 3 word-forms
Part of Speech
17 bits
Case
12 bits
Gender
5 bitsNumber
5 bits
Transitivity
2 bits
Aspect
2 bits
Position
in Sentence
4 bitsFor word-forms wi-1, wi, wi+1
7
Neural Networks: Architecture
Vecto
r
f
f
f
f
f
f
f
f
f
fLikelihood
of this PoS
Input layer
(133 neurons)
Hidden layers
(532 neurons
per layer)
Output layer
(1 neuron)
Used library: Encog;
Learning algorithm: Resilient Backpropagation 8
Rule-based Approach
Rule:
C(<T>) sp;
C(<T>) is a predicate;
<T> is a set of tokens;
sp is a part of speech
Example:
SP0 = {ADJS, NOUN};
<T> = {NOUN, Nomn, Masc};
sp = ADJS
у одних жемчуг мелок
‘pearl is small’
9
Comparison
SPR ⋂ SPN , if SPR ⋂ SPN ≠ ,
SP =
SPR ∪ SPN , otherwise
SPN- the set of PoS tags by neural
networks
SPR- the set of PoS tags by rules
approach
SP - the result set of PoS tags
она[МС]
точно[КР_ПРИЛ, Н, СОЮЗ, ЧАСТ]
Правила: [Н]
Нейросеть: [Н:1.00(185), Н:1.00(335),
СОЮЗ:0.8871(1567)]
Итого:[Н]
перевела[ГЛ]
текст[СУЩ]
Example:
10
Results
OpenCorpora RusCorpora
General Volume 33566 words 169908 words
Ambiguous
words
5880 words 61137 words
Precision 96,11% 86.39%
Accuracy 99,02% 93,54%
Average
cardinality
26,2% 48,64%
11
Conclusion
Results revealed that the algorithm copes with notional parts of speech,
worse - with functional ones.
Future Work:
● Improve this algorithm for conjunctions, particles etc.;
● Take into account multi-words and unknown words;
● Take into account punctuation.
Java implementation is available on our web-site:
http://176.9.34.20:8080/com.onpositive.text.webview/parsing/omonimy
12