70
Computational Linguistics Colloquium Modelling Inflection and Derivation of a Morphologically Rich Language Jan ˇ Snajder University of Zagreb, Faculty of Electrical Engineering and Computing, TakeLab Institute for Computational Linguistics, University of Heidelberg December 6, 2012 Jan ˇ Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 1 / 65

Modelling Inflection and Derivation of a Morphologically ... and Derivation of a Morphologically Rich Language Jan Snajder University of Zagreb, Faculty of Electrical Engineering and

Embed Size (px)

Citation preview

Computational Linguistics Colloquium

Modelling Inflection and Derivation of aMorphologically Rich Language

Jan Snajder

University of Zagreb, Faculty of Electrical Engineering and Computing, TakeLab

Institute for Computational Linguistics, University of HeidelbergDecember 6, 2012

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 1 / 65

Why morphology modelling?

Morphological processing is a prerequisite for many NLP tasks

Typical morphological processing tasks: (1) grouping, (2)segmentation, (3) full morphological analysis

This work focuses on morphological processing for IR/TM:due to morphological variation, the “meaning of a word getsdispersed” among several of its morphological variants

This problem can be addressed by morphological normalization:the grouping of morphologically related words and their replacementwith a single representative form

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 2 / 65

Why Croatian morphology?

Croatian is a heavily under-resourced language

Back in 2007: no morphological analyzer/lexicon

Initial goal: development of a procedure for inflectional andinflectional/derivational morphological normalization for Croatianlanguage suitable for IR/TM applications

Working hypotheses:

1 Positive effect of morphological normalization on IR/TMapplications is depends on the quality of the normalization

2 Due to morphological complexity of Croatian, satisfactory qualitycan only be reached using a linguistically-based approach

Focus gradually shifted towards building a full blown model ofCroatian morphology, suitable for morphological analysis in general

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 3 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 4 / 65

By the end of this talk, you’ll. . .

1 have an impression of how difficult and tedious morphology modellingcan be, and why it’s a good thing that someone else has already doneit for your language

2 be familiar with one particular approach to modelling inflection andderivation

3 have an idea how to acquire a morphological lexicon using amorphology model

4 have a sense of why lexical semantics is important for derivation andvice versa

5 have an idea how one can evaluate morphological normalization

6 know a couple of Croatian words!

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 5 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 6 / 65

Morphology

1 Inflectional morphology deals with how word-forms are formed fromthe stem. Word-forms express grammatical features of the word.

I fish→ fishes, kuca→ kucom, bogat→ najbogatijih

2 Derivational morphology deals with derivation of new words fromthe existing words using derivational affixes.The derived words are possibly of a different POS, but (almostalways) of a different meaning than the basis word.

I fish→ fishery, kuca→ kucanski, bogat→ bogatstvo

Typically accomplished by adding affixes (functional morphemes) to theword’s stem (the part of the word common to all inflectional variants).

3 Compounding deals with word formation that combines two or moreexisting words

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 7 / 65

Morphological typology(Pirkola, 2001)

Morhological complexity is determined by the degree of affixation andthe difficulty of affix segmentation

Traditional morphological typology: isolating, agglutinative,fusional, and polysynthetic languages

Most languages are mixed types. It is possible to describe themorphological complexity of each language using two variables:

I index of synthesis (IS): the amount of affixation (the averagenumber of morphemes per word in a language)

I index of fusion (IF): the ease with which affixes can besegmented in words (the average number of fused words in alanguage)

I defined separately for inflection, derivation, and compounding

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 8 / 65

Croatian morphology

Croatian is a morfologically complex (rich) language:

1 Many grammatical categories

I nouns: case, number; adjectives: case, gender, number,definiteness, degree; verbs: time, person, gender, number. . .

2 Fusion of grammatical categories/suffix ambiguity

I vojnik-e – ruk-e

3 Morphological syncretisms/homography

I pile→ pile/piliti/piti,

4 Alternations:

I vojnik→ vojnice, podatak→ podatci/podaci

5 Many derivational patterns

I iskljuciti – iskljucivati – iskljuciv – iskljucen – iskljucivanje –iskljucivost – iskljucni – . . .

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 9 / 65

Croatian morphology – example

riba ribe ribi ribu ribo ribom ribama ribanje ribanja ribanju ribanjem ribanjima ribar ribara ribaru ribare ribarom ribari ribarima

ribarenje ribarenja ribarenju ribarenjem ribarenjima ribarica ribarice ribarici ribaricu ribarico ribaricom ribaricama ribariti ribarit

ribarim ribaris ribari ribarimo ribarite ribare ribareci ribarivsi ribario ribarila ribarilo ribarili ribarile ribarnica ribarnice ribarnici

ribarnicu ribarnico ribarnicom ribarnicama ribarski ribarskog ribarskoga ribarskom ribarskome ribarskomu ribarskim ribarskih

ribarskima ribarske ribarsko ribarska ribarskoj ribarsku ribarstven ribarstvena ribarstvenu ribarstvenim ribarstveni ribarstvenih

ribarstvenima ribarstvene ribarstveno ribarstvenoj ribarstvenom ribarstveni ribarstvenog ribarstvenoga ribarstvenom ribarstvenome

ribarstvenomu ribarstvenim ribarstvenih ribarstvenima ribarstvene ribarstveno ribarstvena ribarstvenoj ribarstvenu ribarstvo

ribarstva ribarstvu ribarstvom ribarstvima ribarstava ribic ribica ribicu ribicem ribici ribicima ribice ribici ribicija ribiciju ribicijem

ribiciji ribicijima ribicije ribicki ribickog ribickoga ribickom ribickome ribickomu ribickim ribickih ribickima ribicke ribicko ribicka

ribickoj ribicku riblji ribljeg ribljega ribljem ribljom ribljemu ribljim ribljih ribljima riblje riblja ribljoj riblju ribnjak ribnjaka

ribnjaku ribnjace ribnjakom ribnjaci ribnjacima ribnjake ribnjacarski ribnjacarskog ribnjacarskoga ribnjacarskom ribnjacarskome

ribnjacarskomu ribnjacarskim ribnjacarskih ribnjacarskima ribnjacarske ribnjacarsko ribnjacarska ribnjacarskoj ribnjacarsku

ribnjacarstvo ribnjacarstva ribnjacarstvu ribnjacarstvom ribnjacarstava ribnjacarstvima ribnjacki ribnjackog ribnjackoga

ribnjackom ribnjackome ribnjackomu ribnjackim ribnjackih ribnjackima ribnjacke ribnjacko ribnjacka ribnjackoj ribnjacku

ribnjicarstvo ribnjicarstva ribnjicarstvu ribnjicarstvom ribnjicarstvima ribnjicarstava

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 10 / 65

Word-and-paradigm approach

The three basic models of morphology description (Hockett, 1954):

1 Item-and-arrangement (IA) – assumes agglutinative structure

2 Item-and-process (IP), e.g., two-level morphology, (Koskenniemi, 1983)

– assumes agglutinative structure, at least on the lexical level

3 Word-and-paradigm (WP)

Croatian morphology is fusional, hence:

IA is inadequate because one morpheme often expresses manygrammatical features

IP is inadequate because of frequent morphological syncretisms andmorphologically conditioned alternations

In WP, word formation rules are defined w.r.t. the grammatical categories(which may be fused) and the paradigm (a class of rules with similarinflection rules)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 11 / 65

Morphological normalization

Morphological normalization conflates morphological variants of asingle word or many words to a single and unique form, themorphological norm

I {walk,walking,walks,walked} → walkI {fish,fishes,fisher,fishers,fishery, . . . } → fish

Two main approaches:

1 Stemming – reducing of inflectional/derivational variants to a(pseudo)stem

F princesses → princ

2 Lemmatization – reducing of inflectional variants to alinguistically valid, canonical form

F princesses → princess

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 12 / 65

Morphological normalization – typical approaches

Lexicon-based lemmatization

Rule-based stemming

I (Lovins, 1968), (Paice, 1990), (Porter, 1980), Snowball project

Hybrid stemming (rules+corpus)

I (Krovetz, 1993), (Xu & Croft, 1998)

Machine learning-based lemmatization

I (Dzeroski & Erjavec, 2000), (Mladenic, 2002), (Plisson et al., 2008)

Unsupervised stemming (morphology induction)

I (Hafer & Weiss, 1974), (Adamson & Boreham, 1974; Majumder et al.,

2007), (Goldsmith, 2000), (Melucci & Orio, 2003), (Schone &

Jurafsky, 2001), (Goldsmith, 2000; Gelbukh et al., 2004; Majumder

et al., 2007)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 13 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 14 / 65

Higher Order Functional Morphology (HOFM)(Snajder & Dalbelo Basic, 2008)

Inspired by the functional programming paradigm (Hudak, 1989)

I functions as abstractions of inflectional and derivational wordformation rules

I basic rules are modelled as higher-order functions (functionsthat map from/to functions)

I function composition serves to combine rules

Main idea: morphological descriptions that closely resemble thosefound in traditional grammar books

Generative-reductive model

Language-independent model

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 15 / 65

Transformations and HOFs

Transformation function t : S → S defines the stem transformation

The number of different transformation types is limited in practice:suffixation, prefixation, alternations, or combinations thereof

More complex transformations can be defined indirectly using HOFs

sfx : S → (S → S) suffixationpfx : S → (S → S) prefixation

asfx : ℘(S × S)→ (S → S) suffix alternation

sfx (a)

sfx (a)(vojnik) = vojnika

plt = asfx({(k, c), (g, z), (h, s)}

)(sfx (e) ◦ plt)(vojnik) = vojnice

(sfx (emu) ◦ plt ◦ pfx (naj))(jak) = najjacemu

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 16 / 65

Inflectional paradigm

Defines the distinct word-forms (along with the morphological tags)as transformations of the stem, as well as paradigm applicabilityconditions

Inflectional paradigm

p =(c,{(t0, x0), . . . , (ti, xi), . . . , (tn, xn)

})c : S → {>,⊥} – condition on the stemti : S → S – transformation functionsxi – morphological tags

By convention, t0 transforms the stem s into lemma l, i.e., l = t0(s).

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 17 / 65

Inflectional paradigm – example 1

Case Singular Plural

N. vojnik-∅ vojnic-iG. vojnik-a vojnik-aD. vojnik-u vojnic-imaA. vojnik-a vojnik-eV. vojnic-e vojnic-iL. vojnik-u vojnic-imaI. vojnik-om vojnic-ima

pvojnik =(λs.(ends({k, g, h}) ∧ ¬cgr),{

nul , sfx (a), sfx (u), sfx (om), sfx (e) ◦ plt , sfx (i) ◦ sbl , sfx (ima), sfx (e)})

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 18 / 65

Inflectional pattern – example 2

pbrz =(c, Ppi ∪ Ppd ∪ {ti ◦ tc : ti ∈ Pc} ∪ {ti ◦ tc ◦ ts : ti ∈ Ps}

)Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 19 / 65

Derivational pattern(Snajder & Dalbelo Basic, 2010)

Defines the transformation of the basis word into the derived wordand the word classes of the basis and the derived word

Derivational pattern

d = (t,P1,P2)

t the basis word’s stem into the derived word’s lemmaP1 and P2 are the basis and the derived word’s class, respectively

E.g., pattern brz→ brzina:

d1 =(sfx (ina), {pA1, pA2}, {pN2, pN3, pN4}

)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 20 / 65

Derivational pattern

Remarks:

(1) different patterns may be used to derive a word of the samesemantic class

(2) an identical transformation may be used to derive words ofdifferent semantic classes (e.g., suffix ambiguity).If also the word classes of the basis and the derived words areidentical, then the pattern itself is ambiguous.

(1) is not a concern for morphological normalization, while (2) occursrarely and may be ignored

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 21 / 65

Derivation

Derives LPPs given an LPPs and a derivational pattern

lDerive(l1, p1, (t,P1,P2)

)={{

(l2,P∗2 ) : l2 ∈ (t ◦ t−10 )(l1)}

if p1 ∈ P1 ∧ P∗2 6= ∅

∅ otherwise

P∗2 = {p2 ∈ P2 : p2 � l2}

d =(sfx (ak) ◦ try(plt),mNouns,mNouns

)lDerive(smijeh, pN04, d) =

{(smijesak, {pN28, pN47})

}lDerive(struk, pN04, d) =

{(strucak, {pN28, pN47})

}lDerive(jabuka, pN28, d) = ∅

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 22 / 65

HOFM extensions

1 Ambiguous transformations

I transformations modelled as non-functional relationst : S → ℘(S)

I motivated by the fact that many transformations are not injectiveI E.g.,. palatalization: k/c, g/z, h/s, c/c, z/z

(sfx (a) ◦ plt)−1(vojnice) = {vojnik, vojnic}

2 Morphological optionality

I modelling doublets and conditioning of alternations(sfx (i) ◦ opt(sbl)

)(tvrtk) = {tvrtci, tvrtki}(

sfx (ovi) ◦ try(rifx (ije, je)

)(cvijet) = {cvjetovi}

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 23 / 65

Model of Croatian morphology

Transformations

I suffixation, prefixation, phonologically and morhologicallyconditioned alternations

Inflectional morphology

I 93 paradigms: 48 for nouns, 32 for verbs and 13 for adjectivesI morphosyntactic tags according to MULTEXT-East (Erjavec

et al., 2003)

Derivational morphology

I 244 patterns for suffixal derivation between and among nouns,verbs, and adjectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 24 / 65

Model of Croatian morphology – transformations

Function Description Example of application

rsfx Stem suffix replacement rsfx (ij, ik)(akademija) = {akademik}sfx Suffixation sfx (a) = (vojnik) = {vojnika}pfx Prefixation (pfx (naj) ◦ sfx (a) ◦ jot)(brz) = {najbrzi}

pca1 PC stem alternation 1 (sfx (a) ◦ pca1 )(vrabc) = {vrapca}pca2 PC stem alternation 2 (pca2 ◦ sfx (ba))(svat) = {svadba}

sbl Sibilarization (sfx (i) ◦ sbl)(vojnik) = {vojnici}plt Palatalization (sfx (e) ◦ plt)(vojnik) = {vojnice}jot Jotation (jot(u) ◦ jot)(krv) = {krvlju}acg Consonant group alternation (sfx (u) ◦ acg)(mast) = {mascu}exa Stem extension with a (sfx (a) ◦ exa)(vrabc) = {vrabaca}exe Stem extension with e exa(Cakovc) = {Cakovec}jat1 Yat reflex alternation ije/je (sfx (ovi) ◦ jat1 )(snijeg) = {snjegovi}jat2 Yat reflex alternation ije/e (sfx (ovi) ◦ jat2 )(brijeg) = {bregovi}jat3 Yat reflex alternation ije/i (sfx (vati) ◦ jat3

−1)(izli) = {izlijevati}

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 25 / 65

Model of Croatian morphology – derivational patternsGroup Semantic category of the derived word # Example

N-1 Masculine agent nouns 21 banka→ bankarN-2 Masculine noun expressing a characteristic 5 sretan→ sretnikN-3 Masculine nouns for a follower 3 Franjo→ franjevacN-4 Female person nouns 11 prijatelj → prijateljicaN-5 Nouns for male and female person 5 izdati→ izdajicaN-6 Demonyms and ethnonyms 11 Varazdin→VarazdinacN-7 Nouns for animals and plants 6 otrovan→ otrovnicaN-8 Nouns for inanimate objects 11 mijenjati→mjenjacN-9 Nouns for places 9 cigla→ ciglanaN-10 Abstract nouns 18 prijatelj → prijateljstvoN-11 Deverbal (action) nouns 24 cuvati→ cuvanjeN-12 Diminutives and augmentatives 19 orah→ orascicN-13 Collective nouns 7 radnik→ radnistvoN-14 Other types of nouns 6 brod→ brodarina

A-1 Qualifying adjectives 35 mrak→mracanA-2 Possessive adjectives 19 djed→ djedovA-3 Passive verb adjectives 9 spasiti→ spasen

V-1 Imperfective verbs 12 baciti→ bacatiV-2 Diminutive and pejorative verbs 6 govoriti→ govorkatiV-3 Verbs derived from nouns 5 vecera→ veceratiV-4 Verbs derived from adjectives 2 sitan→ sitniti

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 26 / 65

Derivational transformations

1 Phonologically conditioned alternations are implicit in the model(attempted before every suffixation)

I d = (sfx (ski),nouns, pAdjectives)klub→ klupski , but automat→ automatski

2 Morphologically conditioned alternations are modelled explicitly

I d = (sfx (avati) ◦ try(jot), tiVerbs, tiVerbs)onecistiti→ oneciscavati , but odobriti→ odobravati

3 Transformation ambiguity 1: multiple derivations are possible

I d = (sfx (ovit) ◦ try(jat1|jat2),mNouns, qAdjectives)brijeg→ bregovit/brjegovit

4 Transformation ambiguity 2: inconsistencies

I d = (sfx (ar) ◦ opt(jot), fNouns,mNouns)tvornica→ tvornicar , but biblioteka→ bibliotekar

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 27 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 28 / 65

Lexicon acquisition(Snajder et al., 2008)

Due to grammar ambiguity (overgeneration), a single word-formmay allow for the application of many inflectional patterns

⇒ Each word-form is lemmatized to 17 candidate LPPs on average(despite condition functions)

Morphological normalization that would make a direct use of thegrammar (reductive direction) would lack precision

The way to address this is to first use the grammar to acquire alexicon from corpus, and then to use the lexicon for normalization

An inflectional lexicon is a set of entries, each entry being a set ofLPPs

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 29 / 65

Acquisition algorithm in a nutshell

1 Apply the grammar in reductive direction on a chosen word-formfrom corpus in order to obtain a set of LPP candidates.One (or more, in case of homography) of the obtained LPPs will becorrect, while the others are spurious

2 Apply the grammar in generative direction on each LPP candidatein order to generate all its word-forms

3 Score the plausibilty of each LPP candidate based on the frequencyinformation from the corpus

I It is assumed that a correct LPP will have more of its word-formsattested in the corpus; spurious LPPs tend to generate invalid(non-existent) word-forms

I The best scored LPP consitutes a new lexicon entryI If there’s a tie, the best scored LPPs constitue an unresolved

lexicon entry

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 30 / 65

Acquisition algorithm – example

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 31 / 65

Acquisition algorithm

acquireLexicon(W0) = acquire(W0)

function acquire(W ) returns a set of lexicon entries {Ei}if W 6= ∅ thenw = (any ◦ wfpref )(W )

LPP ={(l, p) ∈ lmP(w) : accept(W, l, p)

}if LPP 6= ∅ then

bLPP = argmax(l,p)∈LPPscore(W0, l, p)

E = {lpp ∈ LPP : tie(W0, blf , lf )}WFS =

⋃(l,p)∈E lWfs(l, p)

return {E} ∪ acquire(W \WFS )else

return acquire(W \ {w})else

return ∅

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 32 / 65

LPP scoring measures

Basic (frequency-based) measures:

score0(W, l, p) = |lWfs(p, l) ∩W |score1(W, l, p) =

∑w∈lWfs(p,l)#(w,W0)

score2(W, l, p) = |lWfs(p, l) ∩W | / |lWfs(p, l)|

Heuristic measure:

scoreH(W, l, p) = 10 · score1(W, l, p) + β1 + β2

β1 =

4 if p ∈ PN0 if p ∈ PAqd

2 otherwiseβ2 =

{1 if l ∈W0

0 otherwise

Probabilistic measures:defined based on the expected distribution of word-forms P (t|p)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 33 / 65

LPP scoring measures – evaluation

Measure P R F1

All 4,33 100 8,29Arbitrary 0,40 0,40 0,40

score0 36,66 98,40 53,42score1 36,98 98,80 53,81score2 70,29 77,60 73,76

scoreH 70,78 94,00 80,76

scoreP1 74,28 82,00 77,95scoreP2 71,27 76,40 73,75scoreP3 65,25 67,60 66,40

score02 79,43 89,60 84,21score12 79.79 90.00 84.59score1P2 80,00 91,20 85,23score1P3 87,65 88,00 87,82

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 34 / 65

A machine learning approach: paradigm guessing(Snajder, 2012)

Guess the inflectional paradigm (and lemma) of a given word-form

1 Use the grammar (reductive+generative direction) to generatecandidate LPPs

2 Use supervised machine learning to train a model to decidewhich LPPs is correct based on a number of features

We focus on machine learning aspects: what are the relevant featuresand how well can we do?

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 35 / 65

Paradigm guessing – problem definition

Given word-form w, determine its correct stem s and its correctinflectional paradigm p

Given p, the lemma l can be derived from the stem s and vice versa,thus the problem can be re-casted as:

Problem definition

Given word-form w, determine its correct lemma-paradigm pair (LPP)(l, p). LPP is correct iff l is valid and p generates the valid word-forms ofthe stem obtained from l.

⇒ Binary classification problem (which candidate LPP is correct?)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 36 / 65

Features

String-based features – orthographic properties of lemma/stem

I incorrect LPPs tend to generate ill-formed stems/lemmas

Corpus-based features – frequencies or probability distributions ofword-forms/morphological tags in the corpus

I a correct LPP should have more of its word-forms attested in thecorpus

I every inflectional paradigm has its own distribution ofmorphological tags P (x|p). A correct LPP will generateword-forms that obey such a distribution

Other features – paradigmId and POS

22 features in total (146 binary-encoded features)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 37 / 65

String-based features

1 EndsIn

2 EndsInCgr

3 EndsInCons

4 EndsInNonPals

5 EndsInPals

6 EndsInVelars

7 LemmaSuffixProb – the probability P (sl|p)8 StemSuffixProb – the probability P (ss|p)9 StemLength

10 NumSyllables

11 OneSyllable

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 38 / 65

Corpus-based features

1 LemmaAttested

2 Score0 – number of attested word-form types

3 Score1 – sum of corpus frequencies of word-forms

4 Score2 – proportion of attested word-form types

5 Score3 – product of P (x|p) and P (x|l, p)6 Score4 – expected number of attested word-form types

7 Score5 – Kullback-Leibler divergence between p1 = P (x|p) andp2 = P (x|l, p)

8 Score6 – Jensen-Shannon divergence between p1 and p2

9 Score7 – cosine similarity between p1 and p2

Estimated on Vjesnik newspaper corpus (23 MW)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 39 / 65

Experimental setting

SVM with an RBF kernel (#features � #examples)

Training/testing data: hand-validated automatically acquiredinflectional lexicon

Positive examples: LPPs sampled from the lexicon – 5,000 fortraining and 5,000 for testing

Negative examples: generated using the grammar – 5,000 fortraining and 5,000 for testing

Total: 10,000 examples for training and 10,000 examples for testing

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 40 / 65

Evaluation – feature analysis

Some features are redundant while others may be irrelevant

Top-5 features with univariate filter selection:

I IG: StemSuffixProb, LemmaSuffixProb, Score6, Score5, Score7I GR: StemSuffixProb, LemmaSuffixProb, LemmaAttested, Score0,

Score5I RELIEF: ParadigmId, EndsIn, LemmaSuffixProb, Score5, Score2

Some features consistently low-ranked (e.g. POS, Score1)

Multivariate feature subset selection:

I CFS: StemSuffixProb, LemmaAttested, Score0I CSS: . . . (13 features)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 41 / 65

Evaluation – classification accuracy

Word-forms attested

Features (count) ≥ 1 ≤ 100 ≤ 10

All (22) 91.97 91.94 90.65String-based (13) 87.01 87.69 87.98Corpus-based (11) 87.78 86.59 82.04IG (5) 81.14 79.05 76.46GR (5) 59.76 80.90 77.29RELIEF (5) 90.62 90.60 89.27CFS (3) 81.69 79.51 78.67CSS (13) 27.41 91.56 90.37

Baseline 50.00 56.51 69.92

Subset of 5 features gives almost as good results as the full feature set

Decrease of accuracy on rare words is minimal

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 42 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 43 / 65

Derivational relation

A derivational pattern implicitly defines a derivational relationbetween LPPs as a binary relation on S × P:

Derivational relation →d

(l1, p1)→d (l2, p2) ⇐⇒ (l2,P∗2 ) ∈ lDerive(d, l1, p1) ∧ p2 ∈ P∗2

E.g. (smijeh, pN04)→d (smijesak, pN47) holds, but(smijeh, pN04)→d (strucak, pN47) doesn’t

→d does not imply actual derivational relatedness. It merely implies:

I surface-forms matchI lexical categories match

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 44 / 65

Building derivational families

Grouping of (potentially) derivationally related LPPs

L/=D is a set of derivational families of L=D is the reflexive and symmetric closure of transitive relation

∗−→D

Relation w1∗−→D w2

∃d ∈ D.((w1 →d w2) ∨ ∃w3 ∈ L.

((w1 →d w3) ∧ (w3

∗−→D w2)))

L/=D efficiently computable as WCCs of the induced digraph

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 45 / 65

Problems due to derivational semantics

(1) Spurious derivations: surface-form related words need not besemantically related

I E.g., sal→ *salica, nov→ *novacI Average ratio of spurious derivation accross patterns: 7.8%

Derivations

Group Suffix Example Match Mismatch (%)

A-1 -an beskraj → beskrajan 108 9 7.7A-3 -en dogovoriti→ dogovoren 79 0 0

N-11 -nje citati→ citanje 64 0 0A-2 -ski autor→ autorski 43 1 2.3

N-10 -ost aktivan→ aktivnost 40 2 4.8A-3 -an citati→ citan 37 3 7.5

N-10 -ost duhovit→ duhovitost 28 0 0N-11 -enje donositi→ donosenje 23 0 0N-11 -a nagraditi→nagrada 21 1 4.6V-1 -ati javiti→ javljati 18 2 10.0

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 46 / 65

Problems due to derivational semantics

(2) Vagueness and ambiguity of semantic relations

1 Semantic shift (strength of relation)baciti→ bacati vs. baciti→ bacac

2 Polysemyizdavac – izdati – izdajica

3 Non-denotational aspects of meaning (register, connotation)cinovnik – cinovnicic

4 “Distance” of derivational relationsluga – sluziti – sluzben

5 Opaqueness (idioms, methaphors)morski pas

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 47 / 65

Strength of derivational relations

To prevent loss of information, derivational normalization may berestricted to specific subsets of the derivational patterns.

Derivational relations may be grouped according to the perceivedstrength of semantic relation:

Level 1 – strong relatedness: A-2, V-1(possesive adjectives and aspectual verb pairs)

Level 2 – moderate relatedness: A-1, A-3, V-3, V-4, N-4, N-11(descriptive and deverbal adjectives, verbs from adjectives and nouns,male/female noun pairs, deverbal nouns)

Level 3 – weak relatedness: all other groups

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 48 / 65

Derivational lexica

Equivalence classes over the inflectional lexicon:

Class size

Lexicon Entries Reduction (%) Average Maximum

L5 47,415 – – –L5-D1 44,477 6.20 1.07 5L5-D2 38,158 19.52 1.24 19L5-D3 34,310 27.64 1.38 53

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 49 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 50 / 65

Evaluating morphological normalization

As usual, we can take two approaches:

1 Extrinsic evaluation – measures the effect on an IR/IE task2 Intrinsic evaluation – measures directly the quality of

morphological normalization

Extrinsic evaluation is important, but it does not give us an insightinto how the normalization procedure works

More importantly, extrinsic evaluation does not differentiate betweenthe cases in which normalization fails and the cases in which it isuseless

We focus on intrinsic evaluation

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 51 / 65

Intrinsic evaluation

Error counting method according to (Paice, 1996)

1 Understemming – not conflating morphologically relatedword-forms to the same stem (norm)fishery→ fisher, fisher→ fishslikom→ slik, slici→ slic

2 Overstemming – conflating morphologically unrelatedword-forms to the same stem (norm)divison→ div, divine→ divstanom→ stan, stanica→ stan

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 52 / 65

Intrinsic evaluation

Errors are counted on a manually compiled sample of word groups

Understemming index:

UI =#(pairs in each group normalized to different norms)

#(pairs in the group)

Overstemming index:

OI =#(pairs from different groups normalized to the same norm)

#(pairs in the group)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 53 / 65

Stemming indices – example

UI =1 · 3 + 1 · 1 + 1 · 3 + 0(

52

)+(22

) =7

10= 63.6%

OI =5 · 1 + 5 · 1(

72

) =11

21= 47.6%

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 54 / 65

Word sample

Determining group boundaries is notoriously difficult

Paice groups at two levels: “tight” and “loose” groups

The division is based on (an estimate of) the strength of semanticrelation

Problem: lexical semantics enters into the play(degree of similarity? polysemy? what is tight and what is loose?)

(1) {appriopriate, appropriately} {appropriations}

(2) {author, author’s, authors, autorship}

(3) {authoritarian} {authoritative} {authority, authorities}{authorized} {authorization}

(4) {cost, costing, costed, costs} {costly}

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 55 / 65

Modified Paice’s method(Snajder, 2010)

Grouping accoring to morphological, rather than lexico-semanticrelations:

1 Inflectional groups2 Derivational groups

Under- and overstemming errors are considered independently of thesemantic errors ⇒ morphology and semantics are kept apart

Simplifies grouping:

I The boundary between morphology and semantics is well defined(except perhaps for a few derivations)

I The boundary between inflection and derivation is well defined

Semantics is kept out of the picture and can be addressed separately(“semantically agnostic” evaluation)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 56 / 65

Word sample

10,000 word-forms from the Vjesnik corpus (Culture section)

5510 inflectional groups (on average 1.82 word-forms)

3773 derivational groups (on average 1.46 inflectional groups)

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 57 / 65

Word sample – example

(1) {arheolog} {arheologija, arheologiju}{arheoloska, arheoloske, arheoloski, arheoloskih}

(2) {arhitekt, arhitekta} {arhitekturi, arhitekturama}{arhitektonske, arhitektonskih, arhitektonskim}

(3) {arhiva, arhivima, arhivu} {arhivske, arhivskim, arhivskoj}{arhivar}

(4) {arija, arije, ariju}

(5) {izdajicom} {izdatke} {izdat, izdati} {izdaje, izdavati}{izdaje, izdajom} {izdanja, izdanje, izdanjem, izdanjima}{izdavaca, izdavace, izdavaci} {izdavacka}

(6) {grad, grada,grade, gradom, gradova} {gradska, gradske}

(7) {grade, gradimo, graditi} {gradnja, gradnje}

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 58 / 65

Measuring stemming quality

Trade-off between under- and overstemming

Inflectional/derivational stemming quality:

iSQ =2(1− iUI )(1− iOI )

2− iUI − iOIdSQ =

2(1− dUI )(1− dOI )

2− dUI − dOI

(iUI , iOI – on inflectional groups; dUI , dOI – on derivational groups)

Steming evaluated separatelly on inflectional/derivational level

In general, normalization may occur at both levels. Separate indicesdon’t tell us much about the stemming quality in such cases

But how much derivational normalization is “right”?

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 59 / 65

Measuring stemming quality

Inflectional word-forms must be conflated to a single norm, whereasderivationally unrelated word-forms must not

Combined inflectional-derivational stemming quality:

idSQ =2(1− iUI )(1− dOI )

2− iUI − dOI

Full inflectional conflation is the lower, while full derivationalconflation the upper bound for stemming; between these there is a“semantic gray area” (due to derivation)

Perfect stemmers will have idSQ = 1, but may still differ in howmuch derivation they address

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 60 / 65

Comparison of stemming quality

Method % iUI % iOI % iSQ % dOI % idSQ % dSS

Lexicon-based normalization:L5 5.79 5.56 94.32 3.15 95.51 1.47L5-D1 5.61 18.25 87.62 3.04 95.66 10.71L5-D2 5.02 47.07 67.98 9.52 92.68 41.09L5-D3 4.85 57.66 58.60 14.97 89.81 58.51

String prefix (baseline)P-6 44.98 50.82 51.94 6.22 69.35 30.43P-5 28.82 62.58 49.05 17.09 76.60 52.78P-4 16.40 77.86 35.01 46.01 65.61 73.37

Suffix stripping:S-1 47.76 15.39 64.60 5.76 67.22 3.63S-2 23.80 17.69 79.14 6.84 83.84 6.13

String-based distance clustering:D-1 34.25 19.30 72.46 3.45 78.23 7.88D-2 19.42 46.55 64.27 8.22 85.82 35.26D-3 13.91 61.79 52.93 18.34 83.82 59.70

Lemmatization:HML 5.62 1.96 96.17 0.36 96.94 0.94

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 61 / 65

Comparison of stemming quality

0

10

20

30

40

50

0 10 20 30 40 50

%dO

I (de

rivac

ijsko

pre

korje

nova

nje)

%iUI (flektivno potkorjenovanje)

P-4

P-5

P-6

L5 L5-D1

L5-D2

L5-D3

S-1

S-2

D-1

D-2

D-3

HML

P-4

P-5

P-6

DP-n

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 62 / 65

Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 63 / 65

Wrap-up

A generative-reductive model of inflection and derivation based onfunctional programming concepts

Inflectional model is used for lexicon acquisition from raw corpus

The correct inflectional paradigm can be guessed fairly accuratelyusing a supervised model with a couple of features

Derivational model may be used for derivational normalization, butthen the nature of derivational semantics must be taken into account,e.g., by restricting to specific derivational patterns

Morphological normalization can be evaluated intrinsically, taking intoaccount both inflection and derivation, but keeping semanics out ofthe picture

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 64 / 65

Future perspectives: DISMODS

DISMODS = Distributional Models of Derivational Semantics

Analyse the representation of derivationally related word pairs independency-based semantic space models

Construct representations of derivational processes in semanticspaces in order to support the automatic computation of semanticrepresentations of morphologically derived terms

Investigate how such models can be used to improve informationretrieval

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

Adamson, G., & Boreham, J. 1974.The Use of an Association Measure Based on Character Structure to Identify SemanticallyRelated Pairs of Words and Document Titles.Information Processing and Management, 10(7/8), 253–260.

Dzeroski, Saso, & Erjavec, Tomaz. 2000.Learning to Lemmatise Slovene Words.Pages 69–88 of: Learning language in logic, Lecture notes in computer science.

Erjavec, T., Krstev, C., Petkevic, V., Simov, K., Tadic, M., & Vitas, D. 2003.The MULTEXT-East Morphosyntactic Specifications for Slavic Languages.Pages 25–32 of: Proceedings of the EACL2003 Workshop on Morphological Processing ofSlavic Languages.

Gelbukh, Alexander, Alexandrov, Mikhail, & Han, Sang-Yong. 2004.Detecting Inflection Patterns in Natural Language by Minimization of MorphologicalModel.Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 3287, 432–438.

Goldsmith, John. 2000.Automatic Language-Specific Stemming in Information Retrieval.Revised Papers from the Workshop of Cross-Language Evaluation Forum onCross-Language Information Retrieval and Evaluation, LNCS, 2069, 273–284.

Hafer, M., & Weiss, S. 1974.Word Segmentation by Letter Successor Varieties.Information Processing and Management, 10(11/12), 371–386.

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

Hockett, C. F. 1954.Two models of Grammatical Description.Word, 10, 210–234.

Hudak, Paul. 1989.Conception, evolution, and application of functional programming languages.ACM Computing Surveys, 21(3), 359–411.

Koskenniemi, K. 1983.Two-Level Morphology: A General Computational Model for Word-Form Recognition andProduction.Helsinki: Publications of the Department of General Linguistics, University of Helsinki.

Krovetz, R. 1993.Viewing Morphology as an Inference Process.Pages 191–203 of: Proceedings of the Sixteenth Annual International ACM SIGIRConference on Research and Development in Information Retrieval.

Lovins, Julie Beth. 1968.Development of a Stemming Algorithm.Translation and Computational Linguistics, 11(1), 22–31.

Majumder, Prasenjit, Mitra, Mandar, Parui, Swapan K., Kole, Gobinda, Mitra, Pabitra, &Datta, Kalyankumar. 2007.YASS: Yet another suffix stripper.ACM Transactions on Information Systems, 25(4), 18:1–18:20.

Melucci, Massimo, & Orio, Nicola. 2003.Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

A Novel Method for Stemmer Generation Based on Hidden Markov Models.Pages 131–138 of: Proceedings of CIKM’2003.

Mladenic, Dunja. 2002.Learning Word Normalization Using Word Suffix and Context from Unlabeled Data.Pages 427–434 of: Proceedings of the Nineteenth International Conference on MachineLearning, ICML 2002.

Paice, C. D. 1990.Another Stemmer.ACM SIGIR Forum, 24, 56–61.

Paice, C. D. 1996.Method for Evaluation of Stemming Algorithms Based on Error Counting.Journal of the American Society for Information Science, 47(8), 632–649.

Pirkola, Ari. 2001.Morphological Typology of Languages for IR.Journal of Documentation, 57(3), 330–348.

Plisson, Joel, Lavrac, Nada, Mladenic, Dunja, & Erjavec, Tomaz. 2008.Ripple Down Rule learning for automated word lemmatisation.AI Communications, 21(1), 15–26.

Porter, M. F. 1980.An Algorithm for Suffix Stripping.Program, 14(3), 130–137.

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

Schone, P., & Jurafsky, D. 2001.Knowledge-free Induction of Inflectional Morphologies.Pages 1–9 of: Proceedings of the North American Chapter Of The Association ForComputational Linguistics, NAACL 2001.

Snajder, Jan. 2010.Morfoloska normalizacija tekstova na hrvatskome jeziku za dubinsku analizu ipretrazivanje informacija.Ph.D. thesis, University of Zagreb, Faculty of Electrical Engineering and Computing,Zagreb.

Snajder, Jan. 2012.Guessing the Correct Inflectional Paradigm of Uknown Croatian Words.Pages 185–190 of: Proceedings of the Eight Language Technologies Conference(IS-JT’2012).

Snajder, Jan, & Dalbelo Basic, Bojana. 2008.Higher-Order Functional Representation of Croatian Inflectional Morphology.Pages 121–130 of: Proceedings of the 6th International Conference on Formal Approachesto South Slavic and Balkan Languages, FASSBL6.Dubrovnik, Croatia: Croatian Language Technologies Society.

Snajder, Jan, & Dalbelo Basic, Bojana. 2010.A Computational Model of Croatian Derivational Morphology.Pages 109–117 of: Proceedings of the 6th International Conference on Formal Approachesto South Slavic and Balkan Languages, FASSBL7.Dubrovnik, Croatia: Croatian Language Technologies Society.

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

Snajder, Jan, Dalbelo Basic, Bojana, & Tadic, Marko. 2008.Automatic Acquisition of Inflectional Lexica for Morphological Normalisation.Information Processing and Management, 44(5), 1720–1731.

Xu, Jinxi, & Croft, W. Bruce. 1998.Corpus-Based Stemming using Cooccurrence of Word Variants.ACM Transactions on Information Systems, 16(1), 61–81.

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65