Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
<pisze> : ['piʃɛ] : /#p’yše#/ : /#p’ys/ + /R3e#/
Morphophonological Annotation of PolishAnalyzing and Tagging Polish Morphological suffixes
17.5.2006 Amir Zeldes
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Overview
1. Lemmatization2. Polish morphology3. Text-base approaches4. Morphophonemic analysis5. Benefits and applications6. References
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Lemmatization…
• …means: 1. Analyzing the grammatical categorization of a word
form (case, number, tense…)2. Finding the basic dictionary entry (or ‘lemma’) 3. Finding the grammatical categorization of the lemma
(part of speech, gender of a noun etc.)• …is important for:
1. Machine translation2. Information retrieval (search engines etc.)3. Building electronic corpora
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Lemmatization
• Suffixal morphology:– Each word has a stem (n) and a suffix (m):
n m
AAAAA…ABB…BHow do we know how to divide the word?
• Naïve definitions:– Stem: that part of the word which is common to all
word forms belonging to the same lemma– Suffix: the remaining characters
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Lemmatization
• The naïve algorithm:
1. Divide input string into all possible stem-suffix pairs2. Look up each possible suffix3. If a suffix is found, get instructions to find the lemma4. Look up the lemma5. If a lemma is found, create an analysis containing the
lemma, its categorization, and any grammatical information from the suffix
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Lemmatization
An English example:• Given that <s> is a suffix of plural nouns, analyze
the form <books>:1. b-ooks, bo-oks, boo-ks… one possible division is book-s2. Look for the suffix <s> 3. Its base suffix is Ø (just the stem), the category is plural noun4. Search the lexicon for a noun <book> + Ø = <book>5. Create an analysis: <books> is a plural noun with the lemma
<book>basenumPoSsuf
ØplNouns
<books> =? <book> + <s>
<books> = pl. noun < book
PoSlemma
Nounbook
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
English vs. Polish• But… Polish has many more morphological forms
than English*:
PolishEnglish
72Cases
5-Genders
4+2x5x7
3 (go, going, gone)
Non-finite verb forms
223 (walk, walks, walked)
Finite verb forms
…
*some forms can appear identical (cf. die Frau : die Frauen)
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Polish Morphology
Polish stems and suffixes have many forms• Orthographic suffix variation
lampie – loc. sg. fem. < lampa “lamp”szkole – loc. sg. fem. < szkoła “school”
• Stem variation– Consonant mutation: (cf. Molkerei : Milch)
ręce – loc. sg. fem. < nom. ręka “hand”– Vowel alternation: (cf. Hand : Hände)
rąk – gen. pl. fem. < nom. ręka “hand”
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Polish Morphology
Keeping the naïve definition means:• The stem of <ręka> “hand” is <r>:
ręce = r + ęceręka = r + ękarąk = r + ąk
• All of the following are suffixes for loc. sg. fem.:ręce = r + ęceszkole = szkol + elampie = lamp + ie
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Polish Morphology
• Sometimes the same suffix has two forms, orthe same form represents different suffixes:
piękny – beautiful (M sg) piękni – beautiful (M pl.)
ciężki – heavy (M sg.) ciężcy – heavy (M pl.)
• This means we need the entries:– -ny : nom sg M– -ki : nom sg M– -ni: nom pl M– -cy: nom pl M
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
The Tokarski Index…an a-tergo list of all possible Polish suffixes according to the naïve definition:• Contains ca. 18,000 entries• Each entry gives a form suffix, a lemma suffix, the
grammatical categorization, and some examples:• The index was created manually!
• It is currently used in major taggers (SAM, Morfeusz)
miećXII 1 (mam
imam, omam
mniemam, imam, dumam, trzymam (70)
omam (4)
ignam, Uznam
dynam
mIV N -mam
mamażIV lGmam
-maćI 1-mam
mamićVIa imam
myZa D (nam
mIV N -nam
-namonIII lG-nam
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
The Tokarski Index
ConsPros•Very difficult to produce, maintain and expand
•Some unforeseen forms may still not be recognized
•Distortion of intuitive “suffixes”
•Very fast (simple text based search)
•Words inflected similarly to listed examples are also recognized
•Simple dictionary - can contain only 1 form per lemma
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Dictionary Based Approaches•For each dictionary item, define what stem forms it has, and which suffixes they take:
ręka:–ręk + {a, i, ami}–ręc + {e}–rąk + {Ø}
ConsPros•Massive lexicographic work•Forms not in dictionary can’t be analyzed•Suffixes are still recognized at a textual level
•Can generate paradigms•Irregularities easy to handle
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Goals
1. Recognize and correctly analyze all forms2. Use a simple, mono-lemmatic dictionary3. Use a simple, extensible suffix table4. Distinguish between distinct homographic
suffixes5. Identify the same suffix no matter how it is
spelled
Text based approaches cannot fully realize these goals
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Polish Orthography
• In the best case: 1 letter = 1 phoneme
<tak> = /t;a;k;/
• In some cases: 2 letters = 1 phoneme(digraphs)
<czas> = /cz;a;s;/
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Polish Orthography
• <i> can mark:– A vowel, allophone of <y>
<i> = [i] = /y;/– Palatality of previous consonant
<dzia> = [ʥa] = /dź;a;/– Palatality and a vowel
<ci> = [ʨi] = /ć;y;/
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Using a phonemic analysis…
• A common stem /#ręk-/ still cannot be reached:
<ręka> : ['rɛ̃ka] : /#ręka#/<ręce> : ['rɛ ̃ʦɛ] : /#ręce#/
• Some previously possible analyses are impossible:
<pisać> : ['pisaʨ] : /#p’ysać#/ “to write”<pisze> : ['pišɛ] : /#p’ysze#/ “(he/she) writes”
pis-ać, pis-zenot a valid division
/sz/ is 1 phoneme
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemics
• Morphophonemic analysis:
/c/ is just an allophone of /k/ before /e/, in certainmorphological environments:
<ręce> = /#ręk/ + /e#/ (loc. sg. of <ręka> “hand”)
<rzece> = /#rzek/ + /e#/ (loc. sg. of <rzeka> “river”)
/sz/ is just an allophone of /s/ before /e/, in certainmorphological environments…
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemics
• But /k/ and /c/ are not allophones, /e/ after/k/ can produce different changes:
<krzyczeć> “to shout” = /#krzyk/ + /eć#/ (cf. perfective infinitive <krzyknąć>)
• We can define these as different morpho-phonemes - e1 and e2: (after Swan, 2002)
<ręce> = /#ręk/ + /e1#/<krzyczeć> = /#krzyk/ + /e2ć#/
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemics
• But the same changes can happen withdifferent vowels:
<ciężcy> “heavy” (M pl.) = /#ciężk/ + /y1#/(cf. singular <ciężki>)
• Operator morphophonemes are a moreeconomic description:
<ręce> = /#ręk/ + /R1e#/<ciężcy> = /#ciężk/ + /R1y#/<krzyczeć> = /#krzyk/ + /R2eć#/
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemics•Each operator affects different phonemes in differentways:
R4R3R2R1
/k’//cz//cz//c//k/
/g’//ż//ż//dz//g/
–/c//ć//ć//t/
…
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Phoneme Table
• These effects and other information can be stored in a phoneme table:
R4R3R2R1ArticulationSoftnessAirVoicedVowelPhon
+k’+cz+cz+c54111k;
+g’+ż+ż+dz54121g;
0+c+ć+ć21111t;
00-t-t33211ć;
…
/ć/ can be derived from /t/ + /R1/
consonant plosive dental
unvoiced hard
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Phonotactic Rules
• Strings can be converted to phoneme arrays
• Arrays can be matched against rules describing contact between phonemes
• This rule describes the mutation in <ręce>:
1 1 1[ ] [ ] [ ] [ ] # #R front R frontRC V C V− + + += +
1 1[ ] [ ] [ ] [ ]1#rę #re ęc k# e #R front R frontR− + + += +
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Phonotactic Rules
• Once a rule has been matched, it produces a stem-suffix pair for lookup:
• The suffix can be found in a table:
• The same suffix and base suffix apply to:
szkole = /#szkoł/ + /R1e#/ szkoła = /#szkoł/ + /a#/lampie = /#lamp/ + /R1e#/ lampa = /#lamp/ + /a#/
1#ręk#ręce# e#R= +
conditionsbaseasptensepersgendnumcasePoSsufa#FsglocNounR1e#
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Phonotactic Rules
• In many cases rules must be used to retrieve the lemma:<gryzł> : /#gryz-ł#/ “he bit”
/#gryz/ + /ć#/ = /#gryźć#/
conditionsbaseasptensepersgendnumcasePoSsufvowel=1ć#13M1VFinł#
1 1
# #
# #hard softdental dentalsibilant sibilantR R
C C
z ć ź
ć ć
ć
+ +⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥+ +⎢ ⎥ ⎢ ⎥+ +⎢ ⎥ ⎢ ⎥+ −⎣ ⎦ ⎣ ⎦
+ =
→ + >
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Phonotactic Rules• Sometimes elements appear only on one side
of the equation:<dworcom> dat. pl. of <dworzec> “station”
• Co-indexing keeps track of changes• /om#/ = dat. pl.
(regardless of ablaut and mutations)• nom. sg. = /#/
(reconstructs the lemma from the recovered stem)
2 21[ ] 2 3 1[ ] 2 3 R RC C V C eC V+ −= +
2 2[ ] [ ]#dwo m# #dr c rz e cwo +o m#oR R+ −=
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Applications
The morphophonemic approach allows:1. A comprehensive description of Polish morphology2. A simple dictionary: lemma + part of speech3. Smaller, expandable suffix table4. Distinguishing homographic, but
morphophonemically distinct suffixes:
<piękny> : /#piękn/ + /R4y#/ <piękni> : /#piękn/ + /R1y#/<ciężki> : /#ciężk/ + /R4y#/ <ciężcy> : /#ciężk/ + /R1y#/
5. Recognition of forms not in the dictionary:
<biolog> “biologist”, plural: <biologowie> : /#biolog/ + /owie#/
<biolodzy> : /#biolog/ + /R1y#/
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemic Annotation• Identified suffixes can be stored in corpora :
<t ID='w1c13v04s01t03' lemma='siać' pos='VFin' asp='impfv' pers='3' num='sg' gend='M' tense='past' suf='R2ał#' bsuf='R2ać#'>siał</t>
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemic Annotation• Suffix fields can be used for morphological queries
– Retrieve tokens having the same categorization but different suffixes:
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemic Annotation• Pluralia tantum:
– Retrieve nouns with a base suffix /R4y#/ and neuter nouns with a base suffix /a#/
• Imperfectives derived from perfectives– Retrieve verbs with a base suffix
/R3ać#/ (distinct from simple imperfectives in homographic /R2ać#/)
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
Morphophonemic Annotation
• Morphological queries can help study:1. The distribution of morphological suffixes2. Changes in morphology through historical corpora3. How speakers disambiguate morphology:
Morphological data is more than a bi-product of analysis
it’s another layer of information
Amir Zeldes 17.5.2006Morphophonological Annotation of Polish
References• O.E. Swan. 2002. A Grammar of Contemporary Polish. Bloomington,
Indiana: Slavica Publishers.
• K. Szafran. 1997. Automatic Lemmatisation of Texts in Polish – Is it Possibile? In: Formale Slavistik, eds. U. Junghanns and G. Zybatow. Frankfurt am Main: Vervuert Verlag, pp. 437-441.
• J. Bień and K. Szafran. 2001. Analiza morfologiczna języka polskiego w praktyce. Bulletin de la société polonaise de linguistique, fasc. LVII, pp. 171-184.
• J. Tokarski. 1993. Schematyczny indeks a tergo polskich form wyrazowych, opracowania i redakcja Zygmunt Saloni. Warszawa: Wydawnictwo Naukowe PWN.