Modelling Inflection and Derivation of a Morphologically ... and Derivation of a Morphologically Rich Language Jan Snajder University of Zagreb, Faculty of Electrical Engineering and

Computational Linguistics Colloquium

Modelling Inflection and Derivation of aMorphologically Rich Language

Jan Snajder

University of Zagreb, Faculty of Electrical Engineering and Computing, TakeLab

Institute for Computational Linguistics, University of HeidelbergDecember 6, 2012

Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 1 / 65

Why morphology modelling?

Morphological processing is a prerequisite for many NLP tasks

Typical morphological processing tasks: (1) grouping, (2)segmentation, (3) full morphological analysis

This work focuses on morphological processing for IR/TM:due to morphological variation, the “meaning of a word getsdispersed” among several of its morphological variants

This problem can be addressed by morphological normalization:the grouping of morphologically related words and their replacementwith a single representative form


Why Croatian morphology?

Croatian is a heavily under-resourced language

Back in 2007: no morphological analyzer/lexicon

Initial goal: development of a procedure for inflectional andinflectional/derivational morphological normalization for Croatianlanguage suitable for IR/TM applications

Working hypotheses:

1 Positive effect of morphological normalization on IR/TMapplications is depends on the quality of the normalization

2 Due to morphological complexity of Croatian, satisfactory qualitycan only be reached using a linguistically-based approach

Focus gradually shifted towards building a full blown model ofCroatian morphology, suitable for morphological analysis in general


Outline

1 Background: morphology

2 Modelling inflection and derivation

3 Lexicon acquisition and paradigm guessing

4 Enter semantics: derivational relations

5 Evaluating morphological normalization

6 Wrap-up and future perspectives


By the end of this talk, you’ll. . .

1 have an impression of how difficult and tedious morphology modellingcan be, and why it’s a good thing that someone else has already doneit for your language

2 be familiar with one particular approach to modelling inflection andderivation

3 have an idea how to acquire a morphological lexicon using amorphology model

4 have a sense of why lexical semantics is important for derivation andvice versa

5 have an idea how one can evaluate morphological normalization

6 know a couple of Croatian words!


Outline








Morphology

1 Inflectional morphology deals with how word-forms are formed fromthe stem. Word-forms express grammatical features of the word.

I fish→ fishes, kuca→ kucom, bogat→ najbogatijih

2 Derivational morphology deals with derivation of new words fromthe existing words using derivational affixes.The derived words are possibly of a different POS, but (almostalways) of a different meaning than the basis word.

I fish→ fishery, kuca→ kucanski, bogat→ bogatstvo

Typically accomplished by adding affixes (functional morphemes) to theword’s stem (the part of the word common to all inflectional variants).

3 Compounding deals with word formation that combines two or moreexisting words


Morphological typology(Pirkola, 2001)

Morhological complexity is determined by the degree of affixation andthe difficulty of affix segmentation

Traditional morphological typology: isolating, agglutinative,fusional, and polysynthetic languages

Most languages are mixed types. It is possible to describe themorphological complexity of each language using two variables:

I index of synthesis (IS): the amount of affixation (the averagenumber of morphemes per word in a language)

I index of fusion (IF): the ease with which affixes can besegmented in words (the average number of fused words in alanguage)

I defined separately for inflection, derivation, and compounding


Croatian morphology

Croatian is a morfologically complex (rich) language:

1 Many grammatical categories

I nouns: case, number; adjectives: case, gender, number,definiteness, degree; verbs: time, person, gender, number. . .

2 Fusion of grammatical categories/suffix ambiguity

I vojnik-e – ruk-e

3 Morphological syncretisms/homography

I pile→ pile/piliti/piti,

4 Alternations:

I vojnik→ vojnice, podatak→ podatci/podaci

5 Many derivational patterns

I iskljuciti – iskljucivati – iskljuciv – iskljucen – iskljucivanje –iskljucivost – iskljucni – . . .


Croatian morphology – example

riba ribe ribi ribu ribo ribom ribama ribanje ribanja ribanju ribanjem ribanjima ribar ribara ribaru ribare ribarom ribari ribarima

ribarenje ribarenja ribarenju ribarenjem ribarenjima ribarica ribarice ribarici ribaricu ribarico ribaricom ribaricama ribariti ribarit

ribarim ribaris ribari ribarimo ribarite ribare ribareci ribarivsi ribario ribarila ribarilo ribarili ribarile ribarnica ribarnice ribarnici

ribarnicu ribarnico ribarnicom ribarnicama ribarski ribarskog ribarskoga ribarskom ribarskome ribarskomu ribarskim ribarskih

ribarskima ribarske ribarsko ribarska ribarskoj ribarsku ribarstven ribarstvena ribarstvenu ribarstvenim ribarstveni ribarstvenih

ribarstvenima ribarstvene ribarstveno ribarstvenoj ribarstvenom ribarstveni ribarstvenog ribarstvenoga ribarstvenom ribarstvenome

ribarstvenomu ribarstvenim ribarstvenih ribarstvenima ribarstvene ribarstveno ribarstvena ribarstvenoj ribarstvenu ribarstvo

ribarstva ribarstvu ribarstvom ribarstvima ribarstava ribic ribica ribicu ribicem ribici ribicima ribice ribici ribicija ribiciju ribicijem

ribiciji ribicijima ribicije ribicki ribickog ribickoga ribickom ribickome ribickomu ribickim ribickih ribickima ribicke ribicko ribicka

ribickoj ribicku riblji ribljeg ribljega ribljem ribljom ribljemu ribljim ribljih ribljima riblje riblja ribljoj riblju ribnjak ribnjaka

ribnjaku ribnjace ribnjakom ribnjaci ribnjacima ribnjake ribnjacarski ribnjacarskog ribnjacarskoga ribnjacarskom ribnjacarskome

ribnjacarskomu ribnjacarskim ribnjacarskih ribnjacarskima ribnjacarske ribnjacarsko ribnjacarska ribnjacarskoj ribnjacarsku

ribnjacarstvo ribnjacarstva ribnjacarstvu ribnjacarstvom ribnjacarstava ribnjacarstvima ribnjacki ribnjackog ribnjackoga

ribnjackom ribnjackome ribnjackomu ribnjackim ribnjackih ribnjackima ribnjacke ribnjacko ribnjacka ribnjackoj ribnjacku

ribnjicarstvo ribnjicarstva ribnjicarstvu ribnjicarstvom ribnjicarstvima ribnjicarstava


Word-and-paradigm approach

The three basic models of morphology description (Hockett, 1954):

1 Item-and-arrangement (IA) – assumes agglutinative structure

2 Item-and-process (IP), e.g., two-level morphology, (Koskenniemi, 1983)

– assumes agglutinative structure, at least on the lexical level

3 Word-and-paradigm (WP)

Croatian morphology is fusional, hence:

IA is inadequate because one morpheme often expresses manygrammatical features

IP is inadequate because of frequent morphological syncretisms andmorphologically conditioned alternations

In WP, word formation rules are defined w.r.t. the grammatical categories(which may be fused) and the paradigm (a class of rules with similarinflection rules)


Morphological normalization

Morphological normalization conflates morphological variants of asingle word or many words to a single and unique form, themorphological norm

I {walk,walking,walks,walked} → walkI {fish,fishes,fisher,fishers,fishery, . . . } → fish

Two main approaches:

1 Stemming – reducing of inflectional/derivational variants to a(pseudo)stem

F princesses → princ

2 Lemmatization – reducing of inflectional variants to alinguistically valid, canonical form

F princesses → princess


Morphological normalization – typical approaches

Lexicon-based lemmatization

Rule-based stemming

I (Lovins, 1968), (Paice, 1990), (Porter, 1980), Snowball project

Hybrid stemming (rules+corpus)

I (Krovetz, 1993), (Xu & Croft, 1998)

Machine learning-based lemmatization

I (Dzeroski & Erjavec, 2000), (Mladenic, 2002), (Plisson et al., 2008)

Unsupervised stemming (morphology induction)

I (Hafer & Weiss, 1974), (Adamson & Boreham, 1974; Majumder et al.,

2007), (Goldsmith, 2000), (Melucci & Orio, 2003), (Schone &

Jurafsky, 2001), (Goldsmith, 2000; Gelbukh et al., 2004; Majumder

et al., 2007)


Outline








Higher Order Functional Morphology (HOFM)(Snajder & Dalbelo Basic, 2008)

Inspired by the functional programming paradigm (Hudak, 1989)

I functions as abstractions of inflectional and derivational wordformation rules

I basic rules are modelled as higher-order functions (functionsthat map from/to functions)

I function composition serves to combine rules

Main idea: morphological descriptions that closely resemble thosefound in traditional grammar books

Generative-reductive model

Language-independent model


Transformations and HOFs

Transformation function t : S → S defines the stem transformation

The number of different transformation types is limited in practice:suffixation, prefixation, alternations, or combinations thereof

More complex transformations can be defined indirectly using HOFs

sfx : S → (S → S) suffixationpfx : S → (S → S) prefixation

asfx : ℘(S × S)→ (S → S) suffix alternation

sfx (a)

sfx (a)(vojnik) = vojnika

plt = asfx({(k, c), (g, z), (h, s)}

)(sfx (e) ◦ plt)(vojnik) = vojnice

(sfx (emu) ◦ plt ◦ pfx (naj))(jak) = najjacemu


Inflectional paradigm

Defines the distinct word-forms (along with the morphological tags)as transformations of the stem, as well as paradigm applicabilityconditions

Inflectional paradigm

p =(c,{(t0, x0), . . . , (ti, xi), . . . , (tn, xn)

})c : S → {>,⊥} – condition on the stemti : S → S – transformation functionsxi – morphological tags

By convention, t0 transforms the stem s into lemma l, i.e., l = t0(s).


Inflectional paradigm – example 1

Case Singular Plural

N. vojnik-∅ vojnic-iG. vojnik-a vojnik-aD. vojnik-u vojnic-imaA. vojnik-a vojnik-eV. vojnic-e vojnic-iL. vojnik-u vojnic-imaI. vojnik-om vojnic-ima

pvojnik =(λs.(ends({k, g, h}) ∧ ¬cgr),{

nul , sfx (a), sfx (u), sfx (om), sfx (e) ◦ plt , sfx (i) ◦ sbl , sfx (ima), sfx (e)})


Inflectional pattern – example 2

pbrz =(c, Ppi ∪ Ppd ∪ {ti ◦ tc : ti ∈ Pc} ∪ {ti ◦ tc ◦ ts : ti ∈ Ps}

)Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 19 / 65

Derivational pattern(Snajder & Dalbelo Basic, 2010)

Defines the transformation of the basis word into the derived wordand the word classes of the basis and the derived word

Derivational pattern

d = (t,P1,P2)

t the basis word’s stem into the derived word’s lemmaP1 and P2 are the basis and the derived word’s class, respectively

E.g., pattern brz→ brzina:

d1 =(sfx (ina), {pA1, pA2}, {pN2, pN3, pN4}

)


Derivational pattern

Remarks:

(1) different patterns may be used to derive a word of the samesemantic class

(2) an identical transformation may be used to derive words ofdifferent semantic classes (e.g., suffix ambiguity).If also the word classes of the basis and the derived words areidentical, then the pattern itself is ambiguous.

(1) is not a concern for morphological normalization, while (2) occursrarely and may be ignored


Derivation

Derives LPPs given an LPPs and a derivational pattern

lDerive(l1, p1, (t,P1,P2)

)={{

(l2,P∗2 ) : l2 ∈ (t ◦ t−10 )(l1)}

if p1 ∈ P1 ∧ P∗2 6= ∅

∅ otherwise

P∗2 = {p2 ∈ P2 : p2 � l2}

d =(sfx (ak) ◦ try(plt),mNouns,mNouns

)lDerive(smijeh, pN04, d) =

{(smijesak, {pN28, pN47})

}lDerive(struk, pN04, d) =

{(strucak, {pN28, pN47})

}lDerive(jabuka, pN28, d) = ∅


HOFM extensions

1 Ambiguous transformations

I transformations modelled as non-functional relationst : S → ℘(S)

I motivated by the fact that many transformations are not injectiveI E.g.,. palatalization: k/c, g/z, h/s, c/c, z/z

(sfx (a) ◦ plt)−1(vojnice) = {vojnik, vojnic}

2 Morphological optionality

I modelling doublets and conditioning of alternations(sfx (i) ◦ opt(sbl)

)(tvrtk) = {tvrtci, tvrtki}(

sfx (ovi) ◦ try(rifx (ije, je)

)(cvijet) = {cvjetovi}


Model of Croatian morphology

Transformations

I suffixation, prefixation, phonologically and morhologicallyconditioned alternations

Inflectional morphology

I 93 paradigms: 48 for nouns, 32 for verbs and 13 for adjectivesI morphosyntactic tags according to MULTEXT-East (Erjavec

et al., 2003)

Derivational morphology

I 244 patterns for suffixal derivation between and among nouns,verbs, and adjectives


Model of Croatian morphology – transformations

Function Description Example of application

rsfx Stem suffix replacement rsfx (ij, ik)(akademija) = {akademik}sfx Suffixation sfx (a) = (vojnik) = {vojnika}pfx Prefixation (pfx (naj) ◦ sfx (a) ◦ jot)(brz) = {najbrzi}

pca1 PC stem alternation 1 (sfx (a) ◦ pca1 )(vrabc) = {vrapca}pca2 PC stem alternation 2 (pca2 ◦ sfx (ba))(svat) = {svadba}

sbl Sibilarization (sfx (i) ◦ sbl)(vojnik) = {vojnici}plt Palatalization (sfx (e) ◦ plt)(vojnik) = {vojnice}jot Jotation (jot(u) ◦ jot)(krv) = {krvlju}acg Consonant group alternation (sfx (u) ◦ acg)(mast) = {mascu}exa Stem extension with a (sfx (a) ◦ exa)(vrabc) = {vrabaca}exe Stem extension with e exa(Cakovc) = {Cakovec}jat1 Yat reflex alternation ije/je (sfx (ovi) ◦ jat1 )(snijeg) = {snjegovi}jat2 Yat reflex alternation ije/e (sfx (ovi) ◦ jat2 )(brijeg) = {bregovi}jat3 Yat reflex alternation ije/i (sfx (vati) ◦ jat3

−1)(izli) = {izlijevati}


Model of Croatian morphology – derivational patternsGroup Semantic category of the derived word # Example

N-1 Masculine agent nouns 21 banka→ bankarN-2 Masculine noun expressing a characteristic 5 sretan→ sretnikN-3 Masculine nouns for a follower 3 Franjo→ franjevacN-4 Female person nouns 11 prijatelj → prijateljicaN-5 Nouns for male and female person 5 izdati→ izdajicaN-6 Demonyms and ethnonyms 11 Varazdin→VarazdinacN-7 Nouns for animals and plants 6 otrovan→ otrovnicaN-8 Nouns for inanimate objects 11 mijenjati→mjenjacN-9 Nouns for places 9 cigla→ ciglanaN-10 Abstract nouns 18 prijatelj → prijateljstvoN-11 Deverbal (action) nouns 24 cuvati→ cuvanjeN-12 Diminutives and augmentatives 19 orah→ orascicN-13 Collective nouns 7 radnik→ radnistvoN-14 Other types of nouns 6 brod→ brodarina

A-1 Qualifying adjectives 35 mrak→mracanA-2 Possessive adjectives 19 djed→ djedovA-3 Passive verb adjectives 9 spasiti→ spasen

V-1 Imperfective verbs 12 baciti→ bacatiV-2 Diminutive and pejorative verbs 6 govoriti→ govorkatiV-3 Verbs derived from nouns 5 vecera→ veceratiV-4 Verbs derived from adjectives 2 sitan→ sitniti


Derivational transformations

1 Phonologically conditioned alternations are implicit in the model(attempted before every suffixation)

I d = (sfx (ski),nouns, pAdjectives)klub→ klupski , but automat→ automatski

2 Morphologically conditioned alternations are modelled explicitly

I d = (sfx (avati) ◦ try(jot), tiVerbs, tiVerbs)onecistiti→ oneciscavati , but odobriti→ odobravati

3 Transformation ambiguity 1: multiple derivations are possible

I d = (sfx (ovit) ◦ try(jat1|jat2),mNouns, qAdjectives)brijeg→ bregovit/brjegovit

4 Transformation ambiguity 2: inconsistencies

I d = (sfx (ar) ◦ opt(jot), fNouns,mNouns)tvornica→ tvornicar , but biblioteka→ bibliotekar


Outline








Lexicon acquisition(Snajder et al., 2008)

Due to grammar ambiguity (overgeneration), a single word-formmay allow for the application of many inflectional patterns

⇒ Each word-form is lemmatized to 17 candidate LPPs on average(despite condition functions)

Morphological normalization that would make a direct use of thegrammar (reductive direction) would lack precision

The way to address this is to first use the grammar to acquire alexicon from corpus, and then to use the lexicon for normalization

An inflectional lexicon is a set of entries, each entry being a set ofLPPs


Acquisition algorithm in a nutshell

1 Apply the grammar in reductive direction on a chosen word-formfrom corpus in order to obtain a set of LPP candidates.One (or more, in case of homography) of the obtained LPPs will becorrect, while the others are spurious

2 Apply the grammar in generative direction on each LPP candidatein order to generate all its word-forms

3 Score the plausibilty of each LPP candidate based on the frequencyinformation from the corpus

I It is assumed that a correct LPP will have more of its word-formsattested in the corpus; spurious LPPs tend to generate invalid(non-existent) word-forms

I The best scored LPP consitutes a new lexicon entryI If there’s a tie, the best scored LPPs constitue an unresolved

lexicon entry


Acquisition algorithm – example


Acquisition algorithm

acquireLexicon(W0) = acquire(W0)

function acquire(W ) returns a set of lexicon entries {Ei}if W 6= ∅ thenw = (any ◦ wfpref )(W )

LPP ={(l, p) ∈ lmP(w) : accept(W, l, p)

}if LPP 6= ∅ then

bLPP = argmax(l,p)∈LPPscore(W0, l, p)

E = {lpp ∈ LPP : tie(W0, blf , lf )}WFS =

⋃(l,p)∈E lWfs(l, p)

return {E} ∪ acquire(W \WFS )else

return acquire(W \ {w})else

return ∅


LPP scoring measures

Basic (frequency-based) measures:

score0(W, l, p) = |lWfs(p, l) ∩W |score1(W, l, p) =

∑w∈lWfs(p,l)#(w,W0)

score2(W, l, p) = |lWfs(p, l) ∩W | / |lWfs(p, l)|

Heuristic measure:

scoreH(W, l, p) = 10 · score1(W, l, p) + β1 + β2

β1 =

4 if p ∈ PN0 if p ∈ PAqd

2 otherwiseβ2 =

{1 if l ∈W0

0 otherwise

Probabilistic measures:defined based on the expected distribution of word-forms P (t|p)


LPP scoring measures – evaluation

Measure P R F1

All 4,33 100 8,29Arbitrary 0,40 0,40 0,40

score0 36,66 98,40 53,42score1 36,98 98,80 53,81score2 70,29 77,60 73,76

scoreH 70,78 94,00 80,76

scoreP1 74,28 82,00 77,95scoreP2 71,27 76,40 73,75scoreP3 65,25 67,60 66,40

score02 79,43 89,60 84,21score12 79.79 90.00 84.59score1P2 80,00 91,20 85,23score1P3 87,65 88,00 87,82


A machine learning approach: paradigm guessing(Snajder, 2012)

Guess the inflectional paradigm (and lemma) of a given word-form

1 Use the grammar (reductive+generative direction) to generatecandidate LPPs

2 Use supervised machine learning to train a model to decidewhich LPPs is correct based on a number of features

We focus on machine learning aspects: what are the relevant featuresand how well can we do?


Paradigm guessing – problem definition

Given word-form w, determine its correct stem s and its correctinflectional paradigm p

Given p, the lemma l can be derived from the stem s and vice versa,thus the problem can be re-casted as:

Problem definition

Given word-form w, determine its correct lemma-paradigm pair (LPP)(l, p). LPP is correct iff l is valid and p generates the valid word-forms ofthe stem obtained from l.

⇒ Binary classification problem (which candidate LPP is correct?)


Features

String-based features – orthographic properties of lemma/stem

I incorrect LPPs tend to generate ill-formed stems/lemmas

Corpus-based features – frequencies or probability distributions ofword-forms/morphological tags in the corpus

I a correct LPP should have more of its word-forms attested in thecorpus

I every inflectional paradigm has its own distribution ofmorphological tags P (x|p). A correct LPP will generateword-forms that obey such a distribution

Other features – paradigmId and POS

22 features in total (146 binary-encoded features)


String-based features

1 EndsIn

2 EndsInCgr

3 EndsInCons

4 EndsInNonPals

5 EndsInPals

6 EndsInVelars

7 LemmaSuffixProb – the probability P (sl|p)8 StemSuffixProb – the probability P (ss|p)9 StemLength

10 NumSyllables

11 OneSyllable


Corpus-based features

1 LemmaAttested

2 Score0 – number of attested word-form types

3 Score1 – sum of corpus frequencies of word-forms

4 Score2 – proportion of attested word-form types

5 Score3 – product of P (x|p) and P (x|l, p)6 Score4 – expected number of attested word-form types

7 Score5 – Kullback-Leibler divergence between p1 = P (x|p) andp2 = P (x|l, p)

8 Score6 – Jensen-Shannon divergence between p1 and p2

9 Score7 – cosine similarity between p1 and p2

Estimated on Vjesnik newspaper corpus (23 MW)


Experimental setting

SVM with an RBF kernel (#features � #examples)

Training/testing data: hand-validated automatically acquiredinflectional lexicon

Positive examples: LPPs sampled from the lexicon – 5,000 fortraining and 5,000 for testing

Negative examples: generated using the grammar – 5,000 fortraining and 5,000 for testing

Total: 10,000 examples for training and 10,000 examples for testing


Evaluation – feature analysis

Some features are redundant while others may be irrelevant

Top-5 features with univariate filter selection:

I IG: StemSuffixProb, LemmaSuffixProb, Score6, Score5, Score7I GR: StemSuffixProb, LemmaSuffixProb, LemmaAttested, Score0,

Score5I RELIEF: ParadigmId, EndsIn, LemmaSuffixProb, Score5, Score2

Some features consistently low-ranked (e.g. POS, Score1)

Multivariate feature subset selection:

I CFS: StemSuffixProb, LemmaAttested, Score0I CSS: . . . (13 features)


Evaluation – classification accuracy

Word-forms attested

Features (count) ≥ 1 ≤ 100 ≤ 10

All (22) 91.97 91.94 90.65String-based (13) 87.01 87.69 87.98Corpus-based (11) 87.78 86.59 82.04IG (5) 81.14 79.05 76.46GR (5) 59.76 80.90 77.29RELIEF (5) 90.62 90.60 89.27CFS (3) 81.69 79.51 78.67CSS (13) 27.41 91.56 90.37

Baseline 50.00 56.51 69.92

Subset of 5 features gives almost as good results as the full feature set

Decrease of accuracy on rare words is minimal


Outline








Derivational relation

A derivational pattern implicitly defines a derivational relationbetween LPPs as a binary relation on S × P:

Derivational relation →d

(l1, p1)→d (l2, p2) ⇐⇒ (l2,P∗2 ) ∈ lDerive(d, l1, p1) ∧ p2 ∈ P∗2

E.g. (smijeh, pN04)→d (smijesak, pN47) holds, but(smijeh, pN04)→d (strucak, pN47) doesn’t

→d does not imply actual derivational relatedness. It merely implies:

I surface-forms matchI lexical categories match


Building derivational families

Grouping of (potentially) derivationally related LPPs

L/=D is a set of derivational families of L=D is the reflexive and symmetric closure of transitive relation

∗−→D

Relation w1∗−→D w2

∃d ∈ D.((w1 →d w2) ∨ ∃w3 ∈ L.

((w1 →d w3) ∧ (w3

∗−→D w2)))

L/=D efficiently computable as WCCs of the induced digraph


Problems due to derivational semantics

(1) Spurious derivations: surface-form related words need not besemantically related

I E.g., sal→ *salica, nov→ *novacI Average ratio of spurious derivation accross patterns: 7.8%

Derivations

Group Suffix Example Match Mismatch (%)

A-1 -an beskraj → beskrajan 108 9 7.7A-3 -en dogovoriti→ dogovoren 79 0 0

N-11 -nje citati→ citanje 64 0 0A-2 -ski autor→ autorski 43 1 2.3

N-10 -ost aktivan→ aktivnost 40 2 4.8A-3 -an citati→ citan 37 3 7.5

N-10 -ost duhovit→ duhovitost 28 0 0N-11 -enje donositi→ donosenje 23 0 0N-11 -a nagraditi→nagrada 21 1 4.6V-1 -ati javiti→ javljati 18 2 10.0


Problems due to derivational semantics

(2) Vagueness and ambiguity of semantic relations

1 Semantic shift (strength of relation)baciti→ bacati vs. baciti→ bacac

2 Polysemyizdavac – izdati – izdajica

3 Non-denotational aspects of meaning (register, connotation)cinovnik – cinovnicic

4 “Distance” of derivational relationsluga – sluziti – sluzben

5 Opaqueness (idioms, methaphors)morski pas


Strength of derivational relations

To prevent loss of information, derivational normalization may berestricted to specific subsets of the derivational patterns.

Derivational relations may be grouped according to the perceivedstrength of semantic relation:

Level 1 – strong relatedness: A-2, V-1(possesive adjectives and aspectual verb pairs)

Level 2 – moderate relatedness: A-1, A-3, V-3, V-4, N-4, N-11(descriptive and deverbal adjectives, verbs from adjectives and nouns,male/female noun pairs, deverbal nouns)

Level 3 – weak relatedness: all other groups


Derivational lexica

Equivalence classes over the inflectional lexicon:

Class size

Lexicon Entries Reduction (%) Average Maximum

L5 47,415 – – –L5-D1 44,477 6.20 1.07 5L5-D2 38,158 19.52 1.24 19L5-D3 34,310 27.64 1.38 53


Outline








Evaluating morphological normalization

As usual, we can take two approaches:

1 Extrinsic evaluation – measures the effect on an IR/IE task2 Intrinsic evaluation – measures directly the quality of

morphological normalization

Extrinsic evaluation is important, but it does not give us an insightinto how the normalization procedure works

More importantly, extrinsic evaluation does not differentiate betweenthe cases in which normalization fails and the cases in which it isuseless

We focus on intrinsic evaluation


Intrinsic evaluation

Error counting method according to (Paice, 1996)

1 Understemming – not conflating morphologically relatedword-forms to the same stem (norm)fishery→ fisher, fisher→ fishslikom→ slik, slici→ slic

2 Overstemming – conflating morphologically unrelatedword-forms to the same stem (norm)divison→ div, divine→ divstanom→ stan, stanica→ stan


Intrinsic evaluation

Errors are counted on a manually compiled sample of word groups

Understemming index:

UI =#(pairs in each group normalized to different norms)

#(pairs in the group)

Overstemming index:

OI =#(pairs from different groups normalized to the same norm)

#(pairs in the group)


Stemming indices – example

UI =1 · 3 + 1 · 1 + 1 · 3 + 0(

52

)+(22

) =7

10= 63.6%

OI =5 · 1 + 5 · 1(

72

) =11

21= 47.6%


Word sample

Determining group boundaries is notoriously difficult

Paice groups at two levels: “tight” and “loose” groups

The division is based on (an estimate of) the strength of semanticrelation

Problem: lexical semantics enters into the play(degree of similarity? polysemy? what is tight and what is loose?)

(1) {appriopriate, appropriately} {appropriations}

(2) {author, author’s, authors, autorship}

(3) {authoritarian} {authoritative} {authority, authorities}{authorized} {authorization}

(4) {cost, costing, costed, costs} {costly}


Modified Paice’s method(Snajder, 2010)

Grouping accoring to morphological, rather than lexico-semanticrelations:

1 Inflectional groups2 Derivational groups

Under- and overstemming errors are considered independently of thesemantic errors ⇒ morphology and semantics are kept apart

Simplifies grouping:

I The boundary between morphology and semantics is well defined(except perhaps for a few derivations)

I The boundary between inflection and derivation is well defined

Semantics is kept out of the picture and can be addressed separately(“semantically agnostic” evaluation)


Word sample

10,000 word-forms from the Vjesnik corpus (Culture section)

5510 inflectional groups (on average 1.82 word-forms)

3773 derivational groups (on average 1.46 inflectional groups)


Word sample – example

(1) {arheolog} {arheologija, arheologiju}{arheoloska, arheoloske, arheoloski, arheoloskih}

(2) {arhitekt, arhitekta} {arhitekturi, arhitekturama}{arhitektonske, arhitektonskih, arhitektonskim}

(3) {arhiva, arhivima, arhivu} {arhivske, arhivskim, arhivskoj}{arhivar}

(4) {arija, arije, ariju}

(5) {izdajicom} {izdatke} {izdat, izdati} {izdaje, izdavati}{izdaje, izdajom} {izdanja, izdanje, izdanjem, izdanjima}{izdavaca, izdavace, izdavaci} {izdavacka}

(6) {grad, grada,grade, gradom, gradova} {gradska, gradske}

(7) {grade, gradimo, graditi} {gradnja, gradnje}


Measuring stemming quality

Trade-off between under- and overstemming

Inflectional/derivational stemming quality:

iSQ =2(1− iUI )(1− iOI )

2− iUI − iOIdSQ =

2(1− dUI )(1− dOI )

2− dUI − dOI

(iUI , iOI – on inflectional groups; dUI , dOI – on derivational groups)

Steming evaluated separatelly on inflectional/derivational level

In general, normalization may occur at both levels. Separate indicesdon’t tell us much about the stemming quality in such cases

But how much derivational normalization is “right”?


Measuring stemming quality

Inflectional word-forms must be conflated to a single norm, whereasderivationally unrelated word-forms must not

Combined inflectional-derivational stemming quality:

idSQ =2(1− iUI )(1− dOI )

2− iUI − dOI

Full inflectional conflation is the lower, while full derivationalconflation the upper bound for stemming; between these there is a“semantic gray area” (due to derivation)

Perfect stemmers will have idSQ = 1, but may still differ in howmuch derivation they address


Comparison of stemming quality

Method % iUI % iOI % iSQ % dOI % idSQ % dSS

Lexicon-based normalization:L5 5.79 5.56 94.32 3.15 95.51 1.47L5-D1 5.61 18.25 87.62 3.04 95.66 10.71L5-D2 5.02 47.07 67.98 9.52 92.68 41.09L5-D3 4.85 57.66 58.60 14.97 89.81 58.51

String prefix (baseline)P-6 44.98 50.82 51.94 6.22 69.35 30.43P-5 28.82 62.58 49.05 17.09 76.60 52.78P-4 16.40 77.86 35.01 46.01 65.61 73.37

Suffix stripping:S-1 47.76 15.39 64.60 5.76 67.22 3.63S-2 23.80 17.69 79.14 6.84 83.84 6.13

String-based distance clustering:D-1 34.25 19.30 72.46 3.45 78.23 7.88D-2 19.42 46.55 64.27 8.22 85.82 35.26D-3 13.91 61.79 52.93 18.34 83.82 59.70

Lemmatization:HML 5.62 1.96 96.17 0.36 96.94 0.94


Comparison of stemming quality

0

10

20

30

40

50

0 10 20 30 40 50

%dO

I (de

rivac

ijsko

pre

korje

nova

nje)

%iUI (flektivno potkorjenovanje)

P-4

P-5

P-6

L5 L5-D1

L5-D2

L5-D3

S-1

S-2

D-1

D-2

D-3

HML

P-4

P-5

P-6

DP-n


Outline








Wrap-up

A generative-reductive model of inflection and derivation based onfunctional programming concepts

Inflectional model is used for lexicon acquisition from raw corpus

The correct inflectional paradigm can be guessed fairly accuratelyusing a supervised model with a couple of features

Derivational model may be used for derivational normalization, butthen the nature of derivational semantics must be taken into account,e.g., by restricting to specific derivational patterns

Morphological normalization can be evaluated intrinsically, taking intoaccount both inflection and derivation, but keeping semanics out ofthe picture


Future perspectives: DISMODS

DISMODS = Distributional Models of Derivational Semantics

Analyse the representation of derivationally related word pairs independency-based semantic space models

Construct representations of derivational processes in semanticspaces in order to support the automatic computation of semanticrepresentations of morphologically derived terms

Investigate how such models can be used to improve informationretrieval


Adamson, G., & Boreham, J. 1974.The Use of an Association Measure Based on Character Structure to Identify SemanticallyRelated Pairs of Words and Document Titles.Information Processing and Management, 10(7/8), 253–260.

Dzeroski, Saso, & Erjavec, Tomaz. 2000.Learning to Lemmatise Slovene Words.Pages 69–88 of: Learning language in logic, Lecture notes in computer science.

Erjavec, T., Krstev, C., Petkevic, V., Simov, K., Tadic, M., & Vitas, D. 2003.The MULTEXT-East Morphosyntactic Specifications for Slavic Languages.Pages 25–32 of: Proceedings of the EACL2003 Workshop on Morphological Processing ofSlavic Languages.

Gelbukh, Alexander, Alexandrov, Mikhail, & Han, Sang-Yong. 2004.Detecting Inflection Patterns in Natural Language by Minimization of MorphologicalModel.Progress in Pattern Recognition, Image Analysis and Applications, LNCS, 3287, 432–438.

Goldsmith, John. 2000.Automatic Language-Specific Stemming in Information Retrieval.Revised Papers from the Workshop of Cross-Language Evaluation Forum onCross-Language Information Retrieval and Evaluation, LNCS, 2069, 273–284.

Hafer, M., & Weiss, S. 1974.Word Segmentation by Letter Successor Varieties.Information Processing and Management, 10(11/12), 371–386.


Hockett, C. F. 1954.Two models of Grammatical Description.Word, 10, 210–234.

Hudak, Paul. 1989.Conception, evolution, and application of functional programming languages.ACM Computing Surveys, 21(3), 359–411.

Koskenniemi, K. 1983.Two-Level Morphology: A General Computational Model for Word-Form Recognition andProduction.Helsinki: Publications of the Department of General Linguistics, University of Helsinki.

Krovetz, R. 1993.Viewing Morphology as an Inference Process.Pages 191–203 of: Proceedings of the Sixteenth Annual International ACM SIGIRConference on Research and Development in Information Retrieval.

Lovins, Julie Beth. 1968.Development of a Stemming Algorithm.Translation and Computational Linguistics, 11(1), 22–31.

Majumder, Prasenjit, Mitra, Mandar, Parui, Swapan K., Kole, Gobinda, Mitra, Pabitra, &Datta, Kalyankumar. 2007.YASS: Yet another suffix stripper.ACM Transactions on Information Systems, 25(4), 18:1–18:20.

Melucci, Massimo, & Orio, Nicola. 2003.Jan Snajder (UNIZG FER) Modeling Morphologically Rich Language December 6, 2012 65 / 65

A Novel Method for Stemmer Generation Based on Hidden Markov Models.Pages 131–138 of: Proceedings of CIKM’2003.

Mladenic, Dunja. 2002.Learning Word Normalization Using Word Suffix and Context from Unlabeled Data.Pages 427–434 of: Proceedings of the Nineteenth International Conference on MachineLearning, ICML 2002.

Paice, C. D. 1990.Another Stemmer.ACM SIGIR Forum, 24, 56–61.

Paice, C. D. 1996.Method for Evaluation of Stemming Algorithms Based on Error Counting.Journal of the American Society for Information Science, 47(8), 632–649.

Pirkola, Ari. 2001.Morphological Typology of Languages for IR.Journal of Documentation, 57(3), 330–348.

Plisson, Joel, Lavrac, Nada, Mladenic, Dunja, & Erjavec, Tomaz. 2008.Ripple Down Rule learning for automated word lemmatisation.AI Communications, 21(1), 15–26.

Porter, M. F. 1980.An Algorithm for Suffix Stripping.Program, 14(3), 130–137.


Schone, P., & Jurafsky, D. 2001.Knowledge-free Induction of Inflectional Morphologies.Pages 1–9 of: Proceedings of the North American Chapter Of The Association ForComputational Linguistics, NAACL 2001.

Snajder, Jan. 2010.Morfoloska normalizacija tekstova na hrvatskome jeziku za dubinsku analizu ipretrazivanje informacija.Ph.D. thesis, University of Zagreb, Faculty of Electrical Engineering and Computing,Zagreb.

Snajder, Jan. 2012.Guessing the Correct Inflectional Paradigm of Uknown Croatian Words.Pages 185–190 of: Proceedings of the Eight Language Technologies Conference(IS-JT’2012).

Snajder, Jan, & Dalbelo Basic, Bojana. 2008.Higher-Order Functional Representation of Croatian Inflectional Morphology.Pages 121–130 of: Proceedings of the 6th International Conference on Formal Approachesto South Slavic and Balkan Languages, FASSBL6.Dubrovnik, Croatia: Croatian Language Technologies Society.

Snajder, Jan, & Dalbelo Basic, Bojana. 2010.A Computational Model of Croatian Derivational Morphology.Pages 109–117 of: Proceedings of the 6th International Conference on Formal Approachesto South Slavic and Balkan Languages, FASSBL7.Dubrovnik, Croatia: Croatian Language Technologies Society.


Snajder, Jan, Dalbelo Basic, Bojana, & Tadic, Marko. 2008.Automatic Acquisition of Inflectional Lexica for Morphological Normalisation.Information Processing and Management, 44(5), 1720–1731.

Xu, Jinxi, & Croft, W. Bruce. 1998.Corpus-Based Stemming using Cooccurrence of Word Variants.ACM Transactions on Information Systems, 16(1), 61–81.


Documents

Modelling Inflection and Derivation of a Morphologically ... and Derivation of a Morphologically Rich Language Jan Snajder University of Zagreb, Faculty of Electrical Engineering and