32
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg

Morphology: Words and their Parts

  • Upload
    bian

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Morphology: Words and their Parts. CS 4705 Julia Hirschberg. Words. In formal languages, words are arbitrary strings In natural languages, words are made up of meaningful subunits called morphemes Morphemes are abstract concepts denoting entities or relationships Morphemes may be - PowerPoint PPT Presentation

Citation preview

Page 1: Morphology: Words and their Parts

CS 4705

Morphology: Wordsand their Parts

CS 4705

Julia Hirschberg

Page 2: Morphology: Words and their Parts

Words

• In formal languages, words are arbitrary strings• In natural languages, words are made up of

meaningful subunits called morphemes– Morphemes are abstract concepts denoting

entities or relationships – Morphemes may be

• Stems: the main morpheme of the word• Affixes: convey the word’s role, number, gender,

etc.• cats == cat [stem] + s [suffix]• undo == un [prefix] + do [stem]

Page 3: Morphology: Words and their Parts

Why do we need to do Morphological Analysis?

• The study of how words are composed from smaller, meaning-bearing units (morphemes)

• Applications:– Spelling correction: referece– Hyphenation algorithms: refer-ence– Part-of-speech analysis: googler [N], googling

[V]– Text-to-speech: grapheme-to-phoneme

conversion• hothouse (/T/ or /D/)

Page 4: Morphology: Words and their Parts

– Let’s us guess the meaning of unknown words• ‘Twas brillig and the slithy toves…

• Muggles moogled migwiches

Page 5: Morphology: Words and their Parts

Morphotactics

• What are the ‘rules’ for constructing a word in a given language?– Pseudo-intellectual vs. *intellectual-pseudo– Rational-ize vs *ize-rational– Cretin-ous vs. *cretin-ly vs. *cretin-acious

• Possible ‘rules’– Suffixes are suffixes and prefixes are prefixes– Certain affixes attach to certain types of stems

(nouns, verbs, etc.)– Certain stems can/cannot take certain affixes

Page 6: Morphology: Words and their Parts

• Semantics: In English, un- cannot attach to adjectives that already have a negative connotation:– Unhappy vs. *unsad– Unhealthy vs. *unsick– Unclean vs. *undirty

• Phonology: In English, -er cannot attach to words of more than two syllables– great, greater– Happy, happier– Competent, *competenter– Elegant, *eleganter– Unruly, ?unrulier

Page 7: Morphology: Words and their Parts

• Regular– Walk, walks, walking, walked, (had) walked– Table, tables

• Irregular – Eat, eats, eating, ate, (had) eaten– Catch, catches, catching, caught, (had) caught– Cut, cuts, cutting, cut, (had) cut– Goose, geese

Regular and Irregular Morphology

Page 8: Morphology: Words and their Parts

Morphological Parsing

• Algorithms developed to use regularities -- and known irregularities -- to parse words into their morphemes

• Cats cat +N +PL• Cat cat +N +SG• Cities city +N +PL• Merging merge +V +Present-participle

• Caught catch +V +past-participle

Page 9: Morphology: Words and their Parts

Morphology and Finite State Automata

• We can use the machinery provided by FSAs to capture facts about morphology• Accept strings that are in the language• Reject strings that are not• Do this in a way that does not require us to list

all the words in the language

Page 10: Morphology: Words and their Parts

How do we build a Morphological Analyzer?

• Lexicon: list of stems and affixes (w/ corresponding part of speech (p.o.s.))

• Morphotactics of the language: model of how and which morphemes can be affixed to a stem

• Orthographic rules: spelling modifications that may occur when affixation occurs– in il in context of l (in- + legal)

• Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

Page 11: Morphology: Words and their Parts

Some Simple Rules

• Regular singular nouns stay as is• Regular plural nouns have an -s on the end• Irregulars stay as is

Page 12: Morphology: Words and their Parts

Simple English NP FSA

Page 13: Morphology: Words and their Parts

Expand the Arcs with Stems and Affixes

catdog

child

geese

Page 14: Morphology: Words and their Parts

• We can now run strings through these machines to recognize strings in the language• Accept words that are ok• Reject words that are not

• But is this enough?• We often want to know the structure of a word

(understanding/parsing)• Or we may have a stem and want to produce a surface form

(production/generation)

• Example• From “cats” to “cat +N +PL”• From “cat + N + PL” to “cats”

Page 15: Morphology: Words and their Parts

Finite State Transducers (FSTs)

• Turning an FSA into an FST• Add another tape• Add extra symbols to the transitions• On one tape we read “cats” -- on the other we

write “cat +N +PL”• Or vice versa…

Page 16: Morphology: Words and their Parts

Kimmo Koskenniemi’s two-level morphologyIdea: a word is a relationship between lexical level (its morphemes) and surface level (its orthography)

Koskenniemi 2-level Morphology

Page 17: Morphology: Words and their Parts

• c:c means read a c on one tape and write a c on the other• +N:ε means read a +N symbol on one tape and write nothing on the other• +PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

Page 18: Morphology: Words and their Parts

Not So Simple

• Of course, its not all as easy as • “cat +N +PL” <-> “cats”

• What do we do about geese, mice, oxen?• Many spelling/pronunciation changes go along with

inflectional changes, e.g.• Fox and Foxes

Page 19: Morphology: Words and their Parts

Multi-Tape Machines

• Solution for complex changes:– Add more tapes – Use output of one tape machine as input to the

next• To handle irregular spelling changes, add

intermediate tapes with intermediate symbols

Page 20: Morphology: Words and their Parts

Example of a Multi-Tape Machine

• We use one machine to transduce between the lexical and the intermediate level, and another to transduce between the intermediate and the surface tapes

Page 21: Morphology: Words and their Parts

FST Fragment: Lexical to Intermediate

• ^ is morpheme boundary; # is word boundary

Page 22: Morphology: Words and their Parts

FST Fragment: Intermediate to Surface

• Rule: insert an e after a morpheme-final x, s or z and before morpheme s, eg. fox^s# foxes

Page 23: Morphology: Words and their Parts

Putting Them Together

Page 24: Morphology: Words and their Parts

Practical Uses

• This kind of parsing is normally called morphological analysis

• Can be • An important stand-alone component of an

application (spelling correction, information retrieval, part-of-speech tagging,…)

• Or simply a link in a chain of processing (machine translation, parsing,…)

Page 25: Morphology: Words and their Parts

Porter Stemmer (1980)

• Standard, very popular and usable stemmer (IR, IE) – identify a word’s stem

• Sequence of cascaded rewrite rules, e.g.– IZE ε (e.g. unionize union)– CY T (e.g. frequency frequent)– ING ε , if stem contains vowel (motoring

motor)• Can be implemented as a lexicon-free FST (many

implementations available on the web)

Page 26: Morphology: Words and their Parts

Important Note: Morphology Differs by Language

• Languages differ in how they encode morphological information– Isolating languages (e.g. Cantonese) have no

affixes: each word usually has 1 morpheme– Agglutinative languages (e.g. Finnish, Turkish)

are composed of prefixes and suffixes added to a stem (like beads on a string) – each feature realized by a single affix, e.g. Finnishepäjärjestelmällistyttämättömyydellänsäkäänköhän ‘Wonder if he can also ... with his capability of not

causing things to be unsystematic’

Page 27: Morphology: Words and their Parts

– Polysynthetic languages (e.g. Inuit languages) express much of their syntax in their morphology, incorporating a verb’s arguments into the verb, e.g. Western Greenlandic

Aliikusersuillammassuaanerartassagaluarpaalli.aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-lientertainment-provide-SEMITRANS-one.good.at-COP-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but'However, they will say that he is a great entertainer, but ...'

– So….different languages may require very different morphological analyzers

Page 28: Morphology: Words and their Parts

Concatenative vs. Non-concatenative Morphology

• Semitic root-and-pattern morphology– Root (2-4 consonants) conveys basic semantics

(e.g. Arabic /ktb/)– Vowel pattern conveys voice and aspect– Derivational template (binyan) identifies word

class

Page 29: Morphology: Words and their Parts

Template Vowel Pattern

active passive

CVCVC katab kutib write

CVCCVC kattab kuttib cause to write

CVVCVC ka:tab ku:tib correspond

tVCVVCVC taka:tab tuku:tib write each other

nCVVCVC nka:tab nku:tib subscribe

CtVCVC ktatab ktutib write

stVCCVC staktab stuktib dictate

Page 30: Morphology: Words and their Parts

Morphological Representations: Evidence from Human Performance

• Hypotheses:– Full listing hypothesis: words listed – Minimum redundancy hypothesis:

morphemes listed• Experimental evidence:

– Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest something in between

• Regularly inflected forms (e.g. cars) prime stem (car) but not derived forms (e.g. management, manage)

Page 31: Morphology: Words and their Parts

• But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart)

• Speech errors suggest affixes must be represented separately in the mental lexicon– ‘easy enoughly’ for ‘easily enough’

• Importance of morphological family size– Larger families faster recognition

Page 32: Morphology: Words and their Parts

Summing Up

• Regular expressions and FSAs can represent subsets of natural language as well as regular languages– Both representations may be difficult for humans to

understand for any real subset of a language

– Can be hard to scale up: e.g., when many choices at any point (e.g. surnames)

– But quick, powerful and easy to use for small problems

– AT&T Finite State Toolkit does scale

• Next class: – Read Ch 4 on Ngrams

– HW1 will be due at midnight on Oct 1