Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation

The Hybrid Approach

to Part-of-Speech

Disambiguation

Bruches Elena, [email protected]

Karpenko Dmitrii, [email protected]

Krayvanova Varvara, [email protected]

AIST’2016

mailto:[email protected]



Problem

Goal: Syntactic parser

Problem: Part-of-Speech ambiguity

Solution: Approach, which combines Neural

Networks and Manually Crafted Rules

2

State-of-Art

● Hidden Markov Models Ex.: Sokirko A.V., Toldova C.U. The comparison of

two methods of lexical and morphological disambiguation for Russian //

Internet-mathematics 2005. Web-data processing. Moscow, 2005;

● Rule-based approach Ex.: Brill E. A Simple Rule-Based Part of Speech

Tagger // In Proceedings of ANLC’92, 3d conference on applied natural

language processing, Trento, IT (pp. 152 - 155);

● Neural Networks Ex.: Santos C., Zadrozny B. Learning character-level

representations for part-of-speech tagging // In Proceedings of the 31st

International Conference on Machine Learning, JMLR: W&CP volume 32.

2014;

● Conditional Random Fields Ex.: Antonova, A., Solov'ev, A. Conditional

Random Fields in NLP-related Tasks for Russian // Information Technologies

and Systems. 2013;

● etc.

3

Model

Word-form w is a lexeme in particular grammatical form.

Tag is a grammem of a grammatical category.

TagsW is a set of tags, which a word-form w is assigned with.

TW = <w, TagsW> is the token of word-form w.

Tстали = <стали, [СУЩ, им|вн, мн, жр, неод] |

[СУЩ, рд|дт|пр, ед, жр, неод] |

[ГЛ, сов, неперех, мн, изъв, прош]>

Тsteels/became = <steels/became, [NOUN, nomn|accs, pl, fem, inan] |

[NOUN, gent|datv|loct, sing, fem, inan] |

[VERB, perf, intran, pl, indic, past]>

4

Algorithm

SP0- the initial set of PoS tags

SPN- the set of PoS tags obtained using neural networks

SPR- the set of PoS tags obtained using rules based approach

SP - the result set of PoS tags

5

Tagging Module

Tag Sets:

● Part of Speeh;

● Gender;

● Number;

● Case;

● Aspect;

● Transitivity;

● Person;

● Tense;

● Mood;

● Voice

OpenCorpora Wiktionary

6

Neural Networks: Idea and Input

One network for each common pairs <sp1, sp2>

Input: set of binary values

Context: 3 word-forms

Part of Speech

17 bits

Case

12 bits

Gender

5 bitsNumber

5 bits

Transitivity

2 bits

Aspect

2 bits

Position

in Sentence

4 bitsFor word-forms wi-1, wi, wi+1

7

Neural Networks: Architecture

Vecto

r

f

f

f

f

f

f

f

f

f

fLikelihood

of this PoS

Input layer

(133 neurons)

Hidden layers

(532 neurons

per layer)

Output layer

(1 neuron)

Used library: Encog;

Learning algorithm: Resilient Backpropagation 8

Rule-based Approach

Rule:

C(<T>) sp;

C(<T>) is a predicate;

<T> is a set of tokens;

sp is a part of speech

Example:

SP0 = {ADJS, NOUN};

<T> = {NOUN, Nomn, Masc};

sp = ADJS

у одних жемчуг мелок

‘pearl is small’

9

Comparison

SPR ⋂ SPN , if SPR ⋂ SPN ≠ ,

SP =

SPR ∪ SPN , otherwise

SPN- the set of PoS tags by neural

networks

SPR- the set of PoS tags by rules

approach

SP - the result set of PoS tags

она[МС]

точно[КР_ПРИЛ, Н, СОЮЗ, ЧАСТ]

Правила: [Н]

Нейросеть: [Н:1.00(185), Н:1.00(335),

СОЮЗ:0.8871(1567)]

Итого:[Н]

перевела[ГЛ]

текст[СУЩ]

Example:

10

Results

OpenCorpora RusCorpora

General Volume 33566 words 169908 words

Ambiguous

words

5880 words 61137 words

Precision 96,11% 86.39%

Accuracy 99,02% 93,54%

Average

cardinality

26,2% 48,64%

11

Conclusion

Results revealed that the algorithm copes with notional parts of speech,

worse - with functional ones.

Future Work:

● Improve this algorithm for conjunctions, particles etc.;

● Take into account multi-words and unknown words;

● Take into account punctuation.

Java implementation is available on our web-site:

http://176.9.34.20:8080/com.onpositive.text.webview/parsing/omonimy

12

http://176.9.34.20:8080/com.onpositive.text.webview/parsing/omonimy

Data & Analytics

Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation