24
Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from Christel Kemke (U. Manitoba) slides on morphology

Morphology Reading: Chap 3, Jurafsky & Martin

  • Upload
    dalit

  • View
    99

  • Download
    1

Embed Size (px)

DESCRIPTION

Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from Christel Kemke (U. Manitoba) slides on morphology. Morphology. - PowerPoint PPT Presentation

Citation preview

Page 1: Morphology Reading: Chap 3,  Jurafsky  & Martin

MorphologyReading: Chap 3, Jurafsky & Martin

Instructor: Paul Tarau, based on Rada Mihalcea’s original slidesNote: Some of the material in this slide set was adapted from Christel Kemke (U. Manitoba) slides on morphology

Page 2: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 2

Morphology

Morpheme = "minimal meaning-bearing unit in a language"

Morphology handles the formation of words by using morphemes– base form (stem), e.g., believe– affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly

Morphological parsing = the task of recognizing the morphemes inside a word– e.g., hands, foxes, children

Important for many tasks– machine translation– information retrieval – lexicography– any further processing (e.g., part-of-speech tagging)

Page 3: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 3

Morphemes and Words

Combine morphemes to create wordsInflection

combination of a word stem with a grammatical morpheme

same word class, e.g. clean (verb), clean-ing (verb)Derivation

combination of a word stem with a grammatical morpheme

Yields different word class, e.g. clean (verb), clean-ing (noun)

Compoundingcombination of multiple word stems

Cliticizationcombination of a word stem with a cliticdifferent words from different syntactic categories, e.g.

I’ve = I + have

Page 4: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 4

Inflectional Morphology

Inflectional Morphologyword stem + grammatical morphemecat + sonly for nouns, verbs, and some adjectivesNouns

plural: regular: +s, +es irregular: mouse - mice; ox - oxenrules for exceptions: e.g. -y -> -ies like: butterfly - butterflies

possessive: +'s, +'Verbs

main verbs (sleep, eat, walk)modal verbs (can, will, should)primary verbs (be, have, do)

Page 5: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 5

Inflectional Morphology (verbs)

Verb Inflections for:main verbs (sleep, eat, walk); primary verbs (be, have, do)Morpholog. Form Regularly Inflected Formstem walk merge try map-s form walks merges tries maps-ing participle walking merging trying mappingpast; -ed participle walked merged tried mappedMorph. Form Irregularly Inflected Formstem eat catch cut -s form eats catches cuts -ing participle eating catching cutting -ed past ate caughtcut-ed participle eaten caughtcut

Page 6: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 6

Noun Inflections for:regular nouns (cat, hand); irregular nouns(child, ox)

Morpholog. Form Regularly Inflected Formstem cat handplural form cats hands

Morph. Form Irregularly Inflected Formstem child ox plural form children oxen

Inflectional Morphology (nouns)

Page 7: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 7

Inflectional and Derivational Morphology (adjectives)

Adjective Inflections and Derivations:prefix un- unhappy adjective, negationsuffix -ly happily adverb, mode

-er happier adjective, comparative 1-est happiest adjective, comparative 2

suffix -ness happiness nounplus combinations, like unhappiest, unhappiness.Distinguish different adjective classes, which can or

cannot take certain inflectional or derivational forms, e.g. no negation for big.

Page 8: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 8

Derivational Morphology (nouns)

Page 9: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 9

Derivational Morphology (adjectives)

Page 10: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 10

Verb Clitics

Page 11: Morphology Reading: Chap 3,  Jurafsky  & Martin

Methods, Algorithms

Page 12: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 12

Stemming

Stemming algorithms strip off word affixesyield stem only, no additional information (like plural, 3rd

person etc.)used, e.g. in web search enginesfamous stemming algorithm: the Porter stemmer

Page 13: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 13

Stemming

Reduce tokens to “root” form of words to recognize morphological variation.“computer”, “computational”, “computation” all reduced to

same token “compute”Correct morphological analysis is language specific and can

be complex.Stemming “blindly” strips off known affixes (prefixes and

suffixes) in an iterative fashion.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compres andcompres are both acceptas equival to compres.

Page 14: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 14

Porter Stemmer

Simple procedure for removing known affixes in English without using a dictionary.

Can produce unusual stems that are not English words:“computer”, “computational”, “computation” all reduced to

same token “comput”May conflate (reduce to the same token) words that

are actually distinct.Does not recognize all morphological derivationsTypical rules in Porter stemmer

sses ssies iational atetional tioning →

Page 15: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 15

Stemming Problems

Errors of Comission Errors of Omission

organization organ European Europe

doing doe analysis analyzes

Generalization Generic Matrices matrix

Numerical numerous Noise noisy

Policy police sparse sparsity

Page 16: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 16

Tokenization, Word Segmentation

Tokenization or word segmentationseparate out “words” (lexical entries) from running textexpand abbreviated terms

E.g. I’m into I am, it’s into it iscollect tokens forming single lexical entry

E.g. New York marked as one single entry

More of an issue in languages like Chinese

Page 17: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 17

Simple Tokenization

Analyze text into a sequence of discrete tokens (words).

Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.However, frequently they are not.

Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.

More careful approach:Separate ? ! ; : “ ‘ [ ] ( ) < > Care with . - why? when?Care with … ??

Page 18: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 18

Punctuation

Children’s: use language-specific mappings to normalize (e.g. Anglo-Saxon genitive of nouns, verb contractions: won’t -> wo ‘nt)

State-of-the-art: break up hyphenated sequence.U.S.A. vs. USA a.out

Page 19: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 19

Numbers3/12/91Mar. 12, 199155 B.C.B-52100.2.86.144

Generally, don’t index as textCreation dates for docs

Page 20: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 20

Lemmatization

Reduce inflectional/derivational forms to base formDirect impact on vocabulary sizeE.g.,

am, are, is be

car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

How to do this?Need a list of grammatical rules + a list of irregular words

Children child, spoken speak …

Practical implementation: use WordNet’s morphstr functionPerl: WordNet::QueryData (first returned value from validForms

function)

Page 21: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 21

Morphological Processing

Knowledgelexical entry: stem plus possible prefixes, suffixes plus

word classes, e.g. endings for verb forms (see tables above)

rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs

orthographic rules: spelling, e.g. double consonant as in mapping

Processing: Finite State Transducerstake information above and analyze word token / generate

word form

Page 22: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 22

Fig. 3.3 FSA for verb inflection.

Page 23: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 23

Fig. 3.5 More detailed FSA for adjective inflection.

Fig. 3.4 Simple FSA for adjective inflection.

Page 24: Morphology Reading: Chap 3,  Jurafsky  & Martin

Slide 24

Fig. 3.7 Compiled FSA for noun inflection.