54
1 Christel Kemke Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

Embed Size (px)

Citation preview

Page 1: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

1 Christel Kemke Morphology

COMP 4060 Natural Language Processing

Morphology, Word Classes,POS Tagging

Page 2: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

2 Christel Kemke Morphology

Overview

Morphology

Stemming Word Classes POS Tagging

(Jurafsky, 2nd edition, Ch. 2, 3, 5; Allen Ch. 2,3)

Page 3: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

3 Christel Kemke Morphology

Morphology

Page 4: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

4 Christel Kemke Morphology

Morphemes and Words

Morpheme = "minimal meaning-bearing unit in a language"

Combine morphemes to create words Inflection

combination of a word stem with a grammatical morpheme same word class, e.g. clean (verb), clean-ing (verb)

Derivation combination of a word stem with a grammatical morpheme Yields different word class, e.g. clean (verb), clean-ing (noun)

Compounding combination of multiple word stems

Cliticization combination of a word stem with a clitic different words from different syntactic categories, e.g. I’ve = I

+ have

Page 5: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

5 Christel Kemke Morphology

Inflectional Morphology

Inflectional Morphologyword stem + grammatical morpheme cat + sonly for nouns, verbs, and some adjectives Nouns

plural: regular: +s, +es irregular: mouse - mice; ox - oxenrules for exceptions: e.g. -y -> -ies like: butterfly - butterflies

possessive: +'s, +' Verbs

main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do)

Page 6: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

6 Christel Kemke Morphology

Inflectional Morphology (verbs)

Verb Inflections only for:main verbs (sleep, eat, walk); primary verbs (be, have, do)

Morpholog. Form Regularly Inflected Form stem walk merge try map -s form walks merges tries maps -ing participle walking merging trying mapping past; -ed participle walked merged tried mapped

Morph. Form Irregularly Inflected Form stem eat catch cut -s form eats catches cuts -ing participle eating catching cutting -ed past ate caught cut -ed participle eaten caught cut

Page 7: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

7 Christel Kemke Morphology

Inflectional and Derivational Morphology (adjectives)

Adjective Inflections and Derivations: prefix un- unhappy adjective, negation suffix -ly happily adverb, mode

-er happier adjective, comparative 1

-est happiest adjective, comparative 2

suffix -ness happiness noun

plus combinations, like unhappiest, unhappiness.

Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.

Page 8: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

8 Christel Kemke Morphology

Inflectional Morphology

Page 9: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

9 Christel Kemke Morphology

Noun Inflections

Page 10: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

10 Christel Kemke Morphology

Verb Inflections

Page 11: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

11 Christel Kemke Morphology

Derivational Morphology

Page 12: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

12 Christel Kemke Morphology

Noun Derivation

Page 13: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

13 Christel Kemke Morphology

Adjective Derivation

Page 14: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

14 Christel Kemke Morphology

Clitics

Page 15: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

15 Christel Kemke Morphology

Verb Clitics

Page 16: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

16 Christel Kemke Morphology

Methods, Algorithms

Page 17: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

17 Christel Kemke Morphology

Stemming

Stemming algorithms strip off word affixes

yield stem only, no additional information (like plural, 3rd person etc.)

used, e.g. in web search engines famous stemming algorithm: the Porter

stemmer

Page 18: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

18 Christel Kemke Morphology

Stemming Methods

Rule-based stemming Example rules:

ATIONAL→ ATE e.g., relational→ relate

ING→ if stem contains vowel, e.g., motoring→ motor

Page 19: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

19 Christel Kemke Morphology

Stemming Problems

Errors of Comission Errors of Omission

organization organ European Europe

doing doe analysis analyzes

Generalization Generic Matrices matrix

Numerical numerous Noise noisy

Policy police sparse sparsity

Page 20: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

20 Christel Kemke Morphology

Tokenization, Word Segmentation

Tokenization or word segmentation separate out “words” (lexical entries)

from running text expand abbreviated terms

E.g. I’m into I am, it’s into it is collect tokens forming single lexical

entry E.g. New York marked as one single entry

Page 21: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

21 Christel Kemke Morphology

Tokenization, Word Segmentation

Finite state transducer (FST) Modifies input string (rules) Recognizes (stored) abbreviations and

composite words See Fig.3.22 in Jurafsky, Ch.3 More of an issue in languages like

Chinese

Page 22: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

22 Christel Kemke Morphology

Lemmatization

Lemmatization maps words with same root but different surface appearances onto the same lexeme

e.g. buys, bought, buying -> buy

Page 23: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

23 Christel Kemke Morphology

Morphological Processing

Page 24: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

24 Christel Kemke Morphology

Word Reccognition

Spelling Errors Mark non-words based on

dictionary/lexicon Use “minimum editing distance”

Dynamic programming Table-based Transform operations

deletion, substitution, insertion Calculate minimum path

Morphological Parser = FST

Page 25: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

25 Christel Kemke Morphology

Morphological Processing Knowledge

lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above)

rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs

orthographic rules: spelling, e.g. double consonant as in mapping

Processing: Finite State Transducers take information above and analyze word token /

generate word form

Page 26: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

26 Christel Kemke Morphology

Fig. 3.3 FSA for verb inflection.

Page 27: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

27 Christel Kemke Morphology

Fig. 3.5 More detailed FSA for adjective inflection.

Fig. 3.4 Simple FSA for adjective inflection.

Page 28: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

28 Christel Kemke Morphology

Fig. 3.7 Compiled FSA for noun inflection.

Page 29: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

29 Christel Kemke Morphology

Fig. 3.12 Lexical and intermediate tape of a FS Transducer

Fig. 3.13 Lexical, intermediate, and surface tape after spelling transformation.

Page 30: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

30 Christel Kemke Morphology

Word Classes and POS Tagging

Page 31: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

31 Christel Kemke Morphology

Word Classes

Sort words into categories according to: morphological properties

Which types of morphological forms do they take?e.g. form plural: noun+s; 3rd person: verb+s

distributional propertiesWhat other words or phrases can occur nearby?e.g. possessive pronoun before noun

semantic coherenceClassify according to similar semantic type. e.g. nouns refer to object-like entities

Page 32: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

32 Christel Kemke Morphology

Open vs. Closed Word Classes

Open Class TypesThe set of words in these classes can

change over time, with the development of the language, e.g. spaghetti and download

Open Class Types: nouns, verbs, adjectives, adverbs

Page 33: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

33 Christel Kemke Morphology

Open vs. Closed Word Classes

Closed Class TypesThe set of words in these classes are very

much determined and hardly ever change for one language.

Closed Class Types: prepositions, determiners, pronouns, conjunctions, auxiliary verbs, particles, numerals

Page 34: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

34 Christel Kemke Morphology

Open Class Words: Nouns

Nounsdenote objects, concepts, entities, events

Proper NounsNames for specific individual objects, entitiese.g. the Eiffel Tower, Dr. Kemke

Common NounsNames for categories, classes, abstracts, eventse.g. fruit, banana, table, freedom, sleep, race, ...

Count Nounsenumerable entities, e.g. two bananas

Mass Nounsnot countable items, e.g. water, salt, freedom

Page 35: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

35 Christel Kemke Morphology

Open Class Words: Verbs

Verbs denote actions, processes, and states, e.g. smoke, dream,

rest, run

several morphological forms, e.g.

non-3rd person - eat, sleep

3rd person - eats, sleeps, progressive/ - eating, sleeping present participle/ gerundivepast participle - eaten, sleptsimple past - ate, slept

Page 36: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

36 Christel Kemke Morphology

Open Class Words: Verbs (2)

non-3rd person eat I eat. We eat. They eat.3rd person eats He eats. She eats. It eats.progressive eating He is eating.

He will be eating. He has been eating.

e.g. present participle He is eating. gerundive Eating scorpions [NP] is

common in China.use as adjective Eating children [NP] are

common at McDonalds.past participle eaten He has eaten the scorpion.

The scorpion was eaten.simple past ate He ate the scorpion.

Page 37: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

37 Christel Kemke Morphology

Verb Forms 1 - The five verb forms

Fig.2.6. The five verb forms. (Allen, 1995, p.28)

Page 38: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

38 Christel Kemke Morphology

Verb Forms 2 - The basic tenses

Fig.2.7. The basic tenses. (Allen, 1995, p.29)

Page 39: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

39 Christel Kemke Morphology

Verb Forms 3 - The progressive tenses

Fig.2.8. The progressive tenses. (Allen, 1995, p.29)

Page 40: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

40

  Past Present Future

Simple An action that ended at a point in the past.

An action that exists , is usual, or is repeated.

A plan for future action.

  cooked cook / cooks will cook

(time clue)* e.g. He cooked yesterday. e.g. He cooks dinner every Friday. e.g. He will cook tomorrow.

Progressivebe + main verb +ing

An action was happening (past progressive) when another action happened (simple past).

An action that is happening now. An action that will be happening over time, in the future, when something else happens.

  was / were cooking am / is / are cooking will be cooking

(time clue)* e.g. He was cooking when the phone rang.

e.g. He is cooking now. e.g. He will be cooking when you come.

Perfecthave + main verb

An action that ended before another action or time in the past.

An action that happened at an unspecified time in the past.

An action that will end before another action or time in the future.

  had cooked has / have cooked will have cooked

(time clue)* e.g. He had cooked the dinner when the phone rang.

e.g. He has cooked many meals. e.g. He will have cooked dinner by the time you come.

Perfect Progressivehave + be + main verb + ing

An action that happened over time, in the past, before another time or action in the past.

An action occurring over time that started in the past and continues into the present.

An action occurring over time, in the future, before another action or time in the future.

  had been cooking has / have been cooking will have been cooking

(time clue)* e.g. He had been cooking for a long time before he took lessons.

e.g. He has been cooking for over an hour.

e.g. He will have been cooking all day by the time she gets home.

Verb Tense Chart. From: http://www.athabascau.ca/courses/engl/155/support/verb_tenses.htm

Page 41: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

41 Christel Kemke Morphology

Open Class Words: Adjectives

Adjectives denote qualities or properties of objects e.g. heavy, blue, content

most languages have concepts for

colour - white, green, ...

age - young, old, ...value - good, bad, ...

not all languages have adjectives as separate class

Page 42: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

42 Christel Kemke Morphology

Open Class Words: Adverbs 1

Adverbsdenote modifications of actions (verbs) or qualities (adjectives) e.g. walk slowly or heavily drunk

Directional or Locational adverbsspecify direction or location e.g. go home, stay here

Page 43: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

43 Christel Kemke Morphology

Open Class Words: Adverbs 2

Degree Adverbsspecify extent of process, action, property

e.g. extremely slow, very modest

Manner Adverbsspecify manner of action or process e.g. walk slowly, run fast

Temporal Adverbsspecify time of event or action e.g. yesterday, Monday

Page 44: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

44 Christel Kemke Morphology

Closed Word Classes

Closed Class Types: Prepositions: on, under, over, at, from, to, with, ...

Determiners: a, an, the, ...

Pronouns: he, she, it, his, her, who, I, ...

Conjunctions: and, or, as, if, when, ...

Auxiliary verbs: can, may, should, are, …

Particles: up, down, on, off, in, out, …

Numerals: one, two, three, ..., first, second, ...

Page 45: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

45 Christel Kemke Morphology

Closed Word Class: Prepositions

Prepositions

occur before noun phrases; describe relations; often spatial or temporal relations

e.g. on the table spatial

in two hours temporal

Page 46: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

46 Christel Kemke Morphology

Closed Word Class: Pronouns

Pronouns reference to entities, events, relations etc.

Personal Pronounsrefer to persons or entities, e.g. you, he, it, ...

Possessive Pronounspossession or relation between person and object, e.g. his, her, my, its, ...

Wh-Pronounsreference in question or back reference, e.g. Who did this ..., Frieda, who is 80 years old ...

Page 47: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

47 Christel Kemke Morphology

Closed Word Class: Conjunctions

Conjunctions join phrases or sentences; semantics is varied and complex

Coordinating ConjunctionJoin two phrases or sentences on the same level through conjunctions like and, or, but, ...

e.g. He takes a cat and a dog.He takes a dog and she takes a cat.

Subordinating ConjunctionConnect embedded phrases through e.g. thate.g. He thinks that the cat is nicer than the dog.

Page 48: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

48 Christel Kemke Morphology

Closed Word Class: Auxiliary Verbs

Auxiliary Verbs Mark semantic features of main verb. Often describe tense and modality aspects. Semantics is difficult.

Tenseaddition expressing present, past or future, ...e.g. He will take the cat home.

Aspectaddition expressing completion of actione.g. He is taking the cat home. (incomplete)

Moodaddition expressing necessity of actione.g. He can take the cat home. (possible)

Page 49: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

49 Christel Kemke Morphology

Closed Word Class: Copula, Modal Verbs

Copula (be, do, have) and Modal Verbs (can, should, ...) are subclasses of Auxiliary Verbs.

Describe state, process, or tense / modality of action. Semantics: difficult (e.g. modal logic)

State / Process: be and do e.g. He is at home. He does nothing.

Tense: havee.g. He has taken the cat home.

Modality: can, ought to, should, muste.g. He can take the cat home. (possibility)

Page 50: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

50 Christel Kemke Morphology

Tagsets and POS Tagging

Page 51: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

51 Christel Kemke Morphology

POS Tagging - Tagsets

Tagsets for English Penn Treebank, 45 tags Brown corpus, 87 tags C5 tagset, 61 tags C7 tagset, 146 tags

For references see Jurafsky, p.296C5 and C7 tagsets are listed in Appendix C

Page 52: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

52Fig. 8.6 Penn Treebank, 45 tags

Page 53: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

53 Christel Kemke Morphology

Ambiguity in POS Tagging

Fig. 8.7 Ambiguity in tagging. The left column classifies words according to the number of tags, which can be used for them. The right column shows how many words fall into each class. E.g. there are 264 words which can be tagged with 3 different POS tags, and 1 word (“still”) which has 7 possible tags. (based on the Brown Corpus)

Page 54: Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging

54 Christel Kemke Morphology

POS Tagging - Taggers

Methods for POS Tagging:Rule-Based Tagging

use dictionary to assign POS; then use rules to disambiguate different POS/word classes (e.g. book as verb or noun)

Stochastic Taggingdetermines tags based on the probability of the occurrence of the tag, given the observed word, in the context of the preceding tags. Similar to Hidden Markov Models (probabilistic finite state machines).

Learn tagging rulesProblem in POS Tagging: AmbiguityProblem in POS Tagging: Which tag set to use?