TWO-LEVEL MORPHOLOGYand
FINITE STATE METHODS: A CONSUMER’S VIEW
Kemal OflazerSabancı Universityİstanbul, Turkey
20 Years of Finite State Systems 2
OVERVIEW
Engineering a Morphological Analyzer for Turkish: Experiences and Reflections
Lenient Morphology
Bootstrapping Morphological Lexicons
20 Years of Finite State Systems 3
TURKISH
Turkish is an agglutinative language (like Finnish, Hungarian, Basque, Korean ...)Very productive inflectional and derivational suffixation,Small root word lexicon (~60 K roots), but essentially an infinite number of word forms.
20 Years of Finite State Systems 4
TURKISH
Rich morphophonological processes (Vowel harmony, etc.)
evinizdekilerden (from the ones at your house)
ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAnA = {a,e}, H={ı, i, u, ü}, D= {d,t}
cf. odanızdakilerden (from the ones in your room)
oda+[ı]nız+da+ki+ler+denoda+HnHz+DA+ki+lAr+DAn
20 Years of Finite State Systems 5
TURKISH
evinizdekilerden (from the ones at your house)
ev+iniz+de+ki+ler+denev+HnHz+DA+ki+lAr+DAn
ev+Noun+A3sg+P2pl+Loc ^DB+Adj^DB+Noun+A3pl+Pnon+Abl
0 0 0
20 Years of Finite State Systems 6
ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO
About two months to get the representation and 30+ two-level rules right• kgen rule compiler + some hand compilation
Crude morphotactics• Manual replications of lexicons to deal with
exceptions (maintenance nightmare)• Manual partitioning of root lexicons to deal with
allomorph selections (more of the same)
20 Years of Finite State Systems 7
ENGINEERING A MORPHOLOGICAL ANALYZERFirst implementation in 1992 – 1993 using PC-KIMMO
No easy way to deal with numeric forms
Slow (~5 words / second on (old) workstations)
20 Years of Finite State Systems 8
ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.
Rule component was essentially a rewrite of the rules from the PC-KIMMO version taking advantage of some notational advantages offered.
Additional contexts were included to deal with vocalization of numeric constructions.
20 Years of Finite State Systems 9
ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.
The morphotactics (encoded in the ordering of root and suffix lexicons) was completely re-structured and streamlined.About 300 finite state constraints added to deal with• Long distance feature constraints,• Exceptions,• Allomorph selection (which was a MAJOR pain in
the PC-KIMMO version)
20 Years of Finite State Systems 10
ENGINEERING A MORPHOLOGICAL ANALYZERReimplementation using twolc and lexc in late 1994.
Availability of regular expressions in lexicon specifications enabled us to handle simple vocalizations to deal with forms like• 2/3’ü, 2/3’si, 1995’te vs 1996’da, 12.si vs 12’yi,
F16’ları, 100,000’i vs 1,000,000’u and with variable forms like• aaaaaah! (Interjection) as a+ h,• çoook, (emphatic form of çok) as ç o+ k
20 Years of Finite State Systems 11
Turkish Analyzer Architecture
Tes-is
Tis-lx
TR1 TR2 TR3 TR4 TRn...
= intersection of rule transducers
Tlx-if
TC
Tif-ef
Transducer to normalize case and map to platform independent char rep (xfst).
MorphographemicsTransducer (twolc)
Root and morphemelexicon transducer (lexc))
Transducers for morphotactic constraints (twolc/xfst)
Transducer to generate to clean-up symbolic output (xfst)
Transducers for individualtwo-level rules (twolc)
20 Years of Finite State Systems 12
Turkish Analyzer Architecture
Tes-is
Tis-lx
TR1 TR2 TR3 TR4 TRn...
= intersection of rule transducers
Tlx-if
TC
Tif-ef
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
kUtUGUnden
kUtUk+sH+ndAnkUtUk+yH+ndAn
kUtUk+Noun+A3sg+P3sg+nAbl
kütük+Noun+A3sg+P3sg+Abl
20 Years of Finite State Systems 13
Turkish Analyzer Architecture
kütüğünden, Kütüğünden, KÜTÜĞÜNDEN
kütük+Noun+A3sg+P3sg+Abl
Turkish Analyzer
(After all transducers
are intersected or composed)
(~300K States, 800K Transitions)
20 Years of Finite State Systems 14
REFLECTIONS
Getting the two-level rules right is reasonably simple:
Get ALL your data rightUse a consistent representations Test early and oftenHack idiosynractic cases with diacritics or other special markers. No real need to be very religious about “theory” here.
20 Years of Finite State Systems 15
REFLECTIONS
(For languages like Turkish) Getting the morphotactics “right” is REALLY hard:
Have a clean and manageable lexicon structureHandle • Overgeneration,• Exceptions, long distance dependencies,• Allomorph selection
using carefully crafted finite state filters.
20 Years of Finite State Systems 16
REFLECTIONS
Any serious analyzer will consist of tens of files:
Use scripts and makefilesDuring development save intermediate transducers during compositions, so that you can trace bugs by checking intermediate results.
Resulting system compiles in a few minutes on a high-end SparcStation and runs at about 5-6K forms / second.
20 Years of Finite State Systems 17
REFLECTIONS
This analyzer (along with an unknown word processor) will assign analyses to about 98-99% of the forms encountered in news text.
Basic analyzer covers about 97%.
Unknown word processor will attempt to analyze any word whose root is not in the lexicon (provided the orthography does not violate Turkish rules!)
20 Years of Finite State Systems 18
ARE WE THERE YET?
How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.
Galatasaray Bordeaux’yu 2-1 yendi.
20 Years of Finite State Systems 19
ARE WE THERE YET?
How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.
Galatasaray Bordeaux’yu 2-1 yendi.
Very common in technical text (IT papers journals, popular science magazines, etc.)
20 Years of Finite State Systems 20
ARE WE THERE YET?
How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.
Galatasaray Bordeaux’yu 2-1 yendi.
E
F T
T
20 Years of Finite State Systems 21
ARE WE THERE YET?
How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.
Galatasaray Bordeaux’yu 2-1 yendi.
The problem is that even though foreign orthography is used, suffixation proceeds based on foreign pronunciation! (sörvır, Bordo)Orthographically, such forms violate two-level rules (e.g., vowel harmony is violated in serverları)
E
F T
T
20 Years of Finite State Systems 22
ARE WE THERE YET?
How can we deal Eng-ish/Fren-ish?Dell serverları çok yetenekli.
Galatasaray Bordeaux’yu 2-1 yendi.
Use the CMU pronunciation dictionary build a “TTS” transducer to map forms to a different representation capturing pronuciation, do the morphology, and use a reverse “TTS” transducer to get back to orthography.
E
F T
T
20 Years of Finite State Systems 23
LENIENT MORPHOLOGY
Two-level morphology requires that all rules accept a given lexical-surface string pair: All rules have to put in a good word!
We want to analyze word forms even if they are mildly (and controllably) malformed.
Mismatches between orthography and pronunciationLinguistic variants
We do not want to do spelling correction!
20 Years of Finite State Systems 24
LENIENT MORPHOLOGY
Allow some two-level rules to (conceptually) fail (in the analysis direction), instead of requiring all to succeed.
Use a “optimality theory” style constraint cascade to (leniently) filter / accept forms (Karttunen 1998, Gerdemann & van Noord 2000)
20 Years of Finite State Systems 25
OT FILTERING
.O.
.O.
.O.
.O.
Ck
C2
C1
C0
• Each filter Ci passes� All input forms if NONE
satisfy the constraint, OR
� Only those input forms that satisfy the constraint
• C0 passes forms with 0 violations
• C1 passes forms with at most 1 violation (of possibly selected types).
• ...• Transducers are
composed with Karttunen’s lenient composition operator.
GEN
20 Years of Finite State Systems 26
LENIENT MORPHOLOGY
.O.
.O.
.O.
.O.
Ck
C2
C1
C0
• Failing rules mark failures with additional symbols.
• Filters select outputs with selected violations.
• Clean-up removes failure symbols.Two-level
Rules Transducer
.o.Clean-up
20 Years of Finite State Systems 27
LENIENT MORPHOLOGY
a:b => LC _ RC;X:b (new feasible pair)X:b /<= LC _ RC;Potentially overgenerating; filter with lexicon later
Clean-up handles a <- X replacement later
20 Years of Finite State Systems 28
LENIENT MORPHOLOGY
a:b /<= LC _ RC;Y:b (new feasible pair)Y:b <=> LC _ RC;
Clean-up handles a <- Y replacement later
20 Years of Finite State Systems 29
LENIENT MORPHOLOGY
a:b <= LC _ RC;Z:w new feasible pair for each w ≠ b such that a:w is a feasible pairZ:w => LC _ RC; for each such w.
Clean-up handles a <- Z replacement later.
20 Years of Finite State Systems 30
LENIENT MORPHOLOGY
Assume rulesA:a <= LC1 _ ;A:e <= LC2 _;
handle vowel harmonyX:a is a new FP generated from the second rule.X:a => LC2 _ ; is an additional rule.
serverlarda
Two-level transducer
server+lAr+DA
Allow 0 violations
Allow ≤ 1 violations
server+lXr+DA
Clean-up..., A <- X,...
...
server+lXr+DA
20 Years of Finite State Systems 31
LENIENT MORPHOLOGY
Before or after the lenient filtering cascade, one can employ finite state filters that limit violations
to just after or before root-suffix boundary, to specific morphemes,to specific roots, etc.
or just allow only selected rules to be violated.
20 Years of Finite State Systems 32
THANKS