64
03/14/22 1 Part of Speech Tagging September 9, 2005

Part of Speech Tagging

Embed Size (px)

DESCRIPTION

Part of Speech Tagging. September 9, 2005. Part of Speech. Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech: - PowerPoint PPT Presentation

Citation preview

04/19/23 1

Part of Speech Tagging

September 9, 2005

04/19/23 CAP6640 Natural Language Systems 2

Part of Speech

Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word.

Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech:

noun, verb, pronoun, preposition, adjective, adverb, conjunction, article

These categories are based on morphological and distributional similarities (not semantic similarities).

Part of speech is also known as:

word classes morphological classes lexical tags

04/19/23 3

Part of Speech (cont.)

A POS tag of a word describes the major and minor word

classes of that word. A POS tag of a word gives a significant amount of information about

that word and its neighbours. For example, a possessive pronoun (my, your, her, its) most likely will be followed by a noun, and a personal pronoun (I, you, he, she) most likely will

be followed by a verb. Most of words have a single POS tag, but some of them have more

than one (2,3,4,…) For example, book/noun or book/verb

I bought a book. Please book that flight.

04/19/23 4

Tag Sets

There are various tag sets to choose. The choice of the tag set depends on the nature of the application.

We may use small tag set (more general tags) or large tag set (finer tags).

Some of widely used part-of-speech tag sets:

Penn Treebank has 45 tags Brown Corpus has 87 tags C7 tag set has 146 tags

In a tagged corpus, each word is associated with a tag from the used tag set.

04/19/23 5

English Word Classes

Part-of-speech can be divided into two broad categories: closed class types -- such as prepositions open class types -- such as noun, verb

Closed class words are generally also function words. Function words play important role in grammar Some function words are: of, it, and, you Functions words are most of time very short and frequently

occur. There are four major open classes.

noun, verb, adjective, adverb a new word may easily enter into an open class.

Word classes may change depending on the natural language, but all natural languages have at least two word classes: noun and verb.

04/19/23 6

Nouns

Nouns can be divided as: proper nouns -- names for specific entities such as India, John,

Ali common nouns

Proper nouns do not take an article but common nouns may take. Common nouns can be divided as:

count nouns -- they can be singular or plural -- chair/chairs mass nouns -- they are used when something is

conceptualized as a homogenous group -- snow, salt Mass nouns cannot take articles a and an, and they can not be plural.

04/19/23 7

Verbs

Verb class includes the words referring actions and processes. Verbs can be divided as:

main verbs -- open class -- draw, bake auxiliary verbs -- closed class -- can, should

Auxiliary verbs can be divided as: copula -- be, have modal verbs -- may, can, must, should

Verbs have different morphological forms: non-3rd-person-sg eat 3rd-person-sg - eats progressive -- eating past -- ate past participle -- eaten

04/19/23 8

Adjectives

Adjectives describe properties or qualities for color -- black, white for age -- young, old

In Turkish, all adjectives can also be used as noun. kırmızı kitap red book kırmızıyı the red one (ACC)

04/19/23 9

Adverbs

Adverbs normally modify verbs. Adverb categories:

locative adverbs -- home, here, downhill degree adverbs -- very, extremely manner adverbs -- slowly, delicately temporal adverbs -- yesterday, Friday

Because of the heterogeneous nature of adverbs, some adverbs such as Friday may be tagged as nouns.

04/19/23 10

Major Closed Classes

Prepositions -- on, under, over, near, at, from, to, with Determiners -- a, an, the Pronouns -- I, you, he, she, who, others Conjunctions -- and, but, if, when Participles -- up, down, on, off, in, out Numerals -- one, two, first, second

04/19/23 11

Prepositions

Occur before noun phrases indicate spatial or temporal relations Example:

on the table under chair

Frequent. For example, some of the frequency counts in a 16 million word corpora (COBUILD). of 540,085 in 331,235 for 142,421 to 125,691 with 124,965 on 109,129 at 100,169

04/19/23 12

Particles

A particle combines with a verb to form a larger unit called phrasal verb.

go on turn on turn off shut down

04/19/23 13

Articles

A small closed class Only three words in the class: a an the Marks definite or indefinite They occur so often. For example, some of the

frequency counts in a 16 million word corpora (COBUILD). the 1,071,676 a 413,887 an 59,359

Almost 10% of words are articles in this corpus.

04/19/23 14

Conjunctions

Conjunctions are used to combine or join two phrases, clauses or sentences.

Coordinating conjunctions -- and or but join two elements of equal status Example: you and me

Subordinating conjunctions -- that who combines main clause with subordinate clause Example:

I thought that you might like milk

04/19/23 15

Pronouns

Shorthand for referring to some entity or event. Pronouns can be divided:

personal you she I possessive my your his wh-pronouns who what

04/19/23 16

TagSets for English

There are popular actual tagsets for part-of-speech PENN TREEBANK tagset has 45 tags

IN preposition/subordinating conj. DT determiner JJ adjective NN noun, singular or mass NNS noun, plural VB verb, base form VBD verb, past tense

A sentence from Brown corpus which is tagged using Penn Treebank tagset. The/DT grand/JJ jury/NN commented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

04/19/23 17

Part of Speech Tagging

Part of speech tagging is simply assigning the correct part of speech for each in an input sentence

We assume that we have the following: A set of tags (our tag set) A dictionary that tells us the possible tags for each word

(including all morphological variants). A text to be tagged.

There are different algorithms for tagging. Rule Based Tagging – uses hand-written rules Stochastic Tagging – uses probabilities computed from training

corpus Transformation Based Tagging – uses rules learned

automatically

04/19/23 18

How hard is tagging?

Most words in English are unambiguous. They have only a single tag. But many of most common words are ambiguous:

can/verb can/auxiliary can/noun The number of word types in Brown Corpus

unambiguous (one tag) 35,340 ambiguous (2-7 tags) 4,100

2 tags 3760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1

While only 11.5% of word types are ambiguous, over 40% of Brown corpus tokens are ambiguous.

04/19/23 19

Problem Setup

There are M types of POS tags Tag set: {t1,..,tM}.

The word vocabulary size is V Vocabulary set: {w1,..,wV}.

We have a word sequence of length n:

W = w1,w2…wn

Want to find the best sequence of POS tags:T = t1,t2…tn

)|Pr(maxarg WTTT

best

04/19/23 20

Information sources for tagging

All techniques are based on the same observations…

some tag sequences are more probable than others

ART+ADJ+N is more probable than ART+ADJ+VB

Lexical information: knowing the word to be tagged gives a lot of information about the correct tag

“table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...}

04/19/23 21

Rule-Based Part-of-Speech Tagging

First Stage: Uses a dictionary to assign each word a list of potential parts-of-speech.

Second Stage: Uses a large list of handcrafted rules to window down this list to a single part-of-speech for each word.

The ENGTWOL is a rule-based tagger In the first stage, uses a two-level lexicon transducer In the second stage, uses hand-crafted rules (about 1100

rules)

04/19/23 22

Sample rules

N-IP rule: A tag N (noun) cannot be followed by a tag IP

(interrogative pronoun)

... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun

ART-V rule:A tag ART (article) cannot be followed by a tag V (verb)...the book…

the: {ART} book: {N, V} --> {N}

04/19/23 23

After The First Stage

Example: He had a book. After the fırst stage:

he he/pronoun had have/verbpast have/auxliarypast a a/article book book/noun book/verb

04/19/23 24

Tagging Rule

Rule-1:

if (the previous tag is an article)

then eliminate all verb tags

Rule-2:

if (the next tag is verb)

then eliminate all verb tags

04/19/23 25

Transformation-based tagging

Due to Eric Brill (1995) basic idea:

take a non-optimal sequence of tags and improve it successively by applying a series of well-

ordered re-write rules

rule-based but, rules are learned automatically by training

on a pre-tagged corpus

04/19/23 26

1. Assign to words their most likely tag P(NN|race) = .98 P(VB|race) = .02

2. Change some tags by applying transformation rulesRule Context (trigger)

(apply the rule when…) Examples

NN VB (noun verb)

the previous tag is the preposition to

go to sleep(VB) ? go to school(VB)

VBR VB (past tense base f orm)

one of the previous 3 tags is a modal (MD)

you may cut (VB)

J J R RBR (comparative adj comparative adv)

next tag is an adjective (J J )

a more (RBR) valuable

VBP VB (past tense base f orm)

one of the previous 2 words is “n’t”

should (VB) n’t

An example

04/19/23 27

Types of context lots of latitude… can be:

tag-triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ...

word- triggered transformation The preceding/following word this word …

morphology- triggered transformation The preceding/following word finishes with an s …

a combination of the above The preceding word is tagged this ways AND the following word is this word

04/19/23 28

Learning the transformation rules

Input: A corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0)

Output: A bag of transformation rules Algorithm:

Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0Change tag a to tag b when…

The preceding/following word is tagged zThe word two before/after is tagged zOne of the 2 preceding/following words is tagged zOne of the 2 preceding words is z…

04/19/23 29

Learning the transformation rules (con't)

Run the initial tagger and compile types of errors <incorrect tag, desired tag, # of occurrences>

For each error type, instantiate all templates to generate candidate transformations

Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces

Save the transformation that yields the greatest improvement

Stop when no transformation can reduce the error rate by a predetermined threshold

04/19/23 30

Example

if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159>

Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of the two

preceding words is tagged as a determiner.

When this template is applied to the corpus: it corrects 98 of the 159 errors but it also creates 18 new errors

Error reduction is 98-18=80

04/19/23 31

Learning the best transformations

input: a corpus with each word:

correctly tagged (for reference) tagged with its most frequent tag (C0)

a bag of unordered transformation rules

output: an ordering of the best transformation rules

04/19/23 32

let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus Cε = minimum number of errors desired

for k:= 0 step 1 dobt := argmint (E(t(Ck)) // find the transformation t that minimizes // the error rate

if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly then goto finished

Ck+1 := bt(Ck) // apply rule bt to the current corpus

Tk+1 := bt // bt will be kept as the current transformation // rule

endfinished: the sequence T1 T2 … Tk is the ordered transformation rules

Learning the best transformations (con’t)

04/19/23 33

Strengths of transformation-based tagging

exploits a wider range of lexical and syntactic regularities

can look at a wider context condition the tags on preceding/next words not just

preceding tags. can use more context than bigram or trigram.

transformation rules are easier to understand than matrices of probabilities

04/19/23 34

How TBL Rules are Applied

Before the rules are applied the tagger labels every word with its most likely tag.

We get these most likely tags from a tagged corpus. Example:

He is expected to race tomorrow he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN

After selecting most-likely tags, we apply transformation rules. Change NN to VB when the previous tag is TO This rule converts race/NN into race/VB

This may not work for every case ….. According to race

04/19/23 35

How TBL Rules are Learned

We will assume that we have a tagged corpus. Brill’s TBL algorithm has three major steps.

Tag the corpus with the most likely tag for each (unigram model) Choose a transformation that deterministically replaces an

existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations.

Apply the transformation to the training corpus. These steps are repeated until a stopping criterion is reached. The result (which will be our tagger) will be:

First tags using most-likely tags Then apply the learned transformations

04/19/23 36

Transformations

A transformation is selected from a small set of templates.

Change tag a to tag b when

- The preceding (following) word is tagged z.

- The word two before (after) is tagged z.

- One of two preceding (following) words is tagged z.

- One of three preceding (following) words is tagged z.

- The preceding word is tagged z and the following word is tagged w.

- The preceding (following) word is tagged z and the word

two before (after) is tagged w.

04/19/23 37

Stochastic POS tagging

Assume that a word’s tag only depends on the previous tags (not following ones)

Use a training set (manually tagged corpus) to: learn the regularities of tag sequences learn the possible tags for a word model this info through a language model (n-

gram)

04/19/23 38

Goal: maximize P(word|tag) x P(tag|previous n tags)

P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)

P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)

Hidden Markov Model (HMM) Taggers

Lexical information Syntagmatic information

04/19/23 39

P(tag|previous n tags) if we look (n-1) tags before to find current tag --> n-gram

model

trigram model chooses the most probable tag ti for word wi given:

• the previous 2 tags ti-2 & ti-1 and • the current word wi

bigram model chooses the most probable tag ti for word wi given:

• the previous tag ti-1 and • the current word wi

unigram model (just most-likely tag) chooses the most probable tag ti for word wi given:

• the current word wi

Tag sequence probability

04/19/23 40

Example“race” can be VB or NN

“Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV”

“People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN”

let’s tag the word “race” in 1st sentence with a bigram model.

04/19/23 41

Example (con’t) assuming previous words have been tagged, we have:

“Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow”

P(race|VB) x P(VB|TO) ? given that we have a VB, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be

VB

P(race|NN) x P(NN|TO) ? given that we have a NN, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be

NN

04/19/23 42

From the training corpus, we found that:

P(NN|TO) = .021 // given that the previous tag is TO// 2.1% chances that the current tag is NN

P(VB|TO) = .34 // given that the previous tag is TO// 34% chances that the current tag is VB

P(race|NN) = .00041 // given that we have an NN // 0.041% chances that this word is "race"

P(race|VB) = .00003 // given that we have a VB // 003% chances that this word is "race"

so:

P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009

so:VB is more probable!

Example (con’t)

04/19/23 43

and by the way: race is 98% of the time a NN !!!

P(VB|race) = 0.02

P(NN|race) = 0.98 !!!

How are the probabilities found ? using a training corpus of hand-tagged text long & meticulous work done by linguists

Example (con’t)

04/19/23 44

HMM Tagging

But HMM tagging tries to find: the best sequence of tags for a sentence not just best tag for a single word

goal: maximize the probability of a tag sequence, given a word sequence

i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)

04/19/23 45

By Bayes law:

wordSeq is given… so P(wordSeq) will be the same for all tagSeq so we can drop it from the equation

P(wordSeq)

tagSeq) | P(wordSeq x P(tagSeq) wordSeq) | P(tagSeq

HMM Tagging (con’t) wordSeq) | P(tagSeq argmax bestTagSeq

tagSeq

)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t

tagSeq)|P(wordSeq x P(tagSeq) argmax bestTagSeq

n1n1n1t,...,t

n1

tagSeq

n1

04/19/23 46

n

1iii1-ii

t,...,t

n1n1n1t,...,t

n1

)t|P(w x )t|P(t argmax

)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t so

n1

n1

1. words are independent

2. Markov assumption (approximation to short history) ex. with bigram approximation:

3. probability of a word is only dependent on its tag

emission probability

state transition probability

Assumptions in HMM Tagging

)t,...,t|P(w x ... )xt,...,t|P(w x )t,...,t|P(w )t,...,t|w,...,P(w n1nn12n11n1n1

)t|P(t )t,...,t|P(t 1-ii1-i1i

)t|P(w t,...,tP(w iin1i )|

04/19/23 47

The derivationbestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq) (t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn )

Assumption 1: Independence assumption + Chain ruleP(t1, …, tn) x P(w1, …, wn | t1, …, tn) = P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)

Assumption 2: Markov assumption: only look at short history (ex. bigram)= P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)

Assumption 3: A word’s identity only depends on its tag = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn)

n

1 i

ii1-ii )t|P(w x )t|P(t

n

1 i

ii1-ii

t,...,tn1 )t|P(w x )t|P(t argmax*t,...,t bestTagSeq so

n1

04/19/23 48

Emissions & Transitions probabilities

let N: number of possible tags (size of tag set) V: number of word types (vocabulary)

from a tagged training corpus, we compute the frequency of: Emission probabilities P(wi| ti)

stored in an N x V matrix emission[i,j] = probability that tag i is the correct tag for word j

Transitions probabilities P(ti|ti-1) stored in an N x N matrix transmission[i,j] = probability that tag i follows tag j

In practice, these matrices are very sparse So these models are smoothed to avoid zero probabilities

04/19/23 49

Emission probabilities P(wi| ti)

stored in an N x V matrix emission[i,j] = probability/frequency that tag i

is the correct tag for word j

04/19/23 50

Transitions probabilities P(ti|ti-1)

stored in an N x N matrix transmission[i,j] = probability/frequency that

tag i follows tag j

04/19/23 51

Efficiency issues

to find the best probability of a sequence is exponential in time

for efficiency, we usually use the Viterbi algorithm For global maximisation i.e. best tag sequence

04/19/23 52

an Example Emission probabilities:

Transition probabilities:

Tag

PN VB TO IN AT NN

Vo

cabu

lary

John 0.9 0.1

likes 0.2 0.3

to 0.5

fish 0.1 0.8 0.3

in 0.5 1

the 1

sea 0.3

First T

ag

Second tag

PN VB TO IN AT NN None (last tag)

PN 0.2 0.7 0.1

VB 0.1 0.2 0.2 0.5

TO 1

IN 0.1 0.9

AT 0.05 0.95

NN 0.3 0.25 0.25 0.1 0.5 0.05

None (1st tag) 0.7 0.2 0.1

04/19/23 53

State Transition Diagram (VMM) Transition probabilities

0.7 0.1

0.7

0.1

PN

startAT

0.2

NN

0.5

0.95

IN

0.2

0.05

0.1

0.2

VB0.25

TO

0.25

0.3

0.1

0.2

end

0.05

0.5

1

0.1

0.9

04/19/23 54

State Transition Diagram (HMM) but the states are "invisible" (we only see the words)

John: 0.3

fish: 0.3

0.7 0.1

0.7

0.1

PN

startAT

0.2

NN

0.50.95

IN

0.2

0.05

0.1

0.2

VB0.25

TO

0.25

0.3

0.1

0.2

end

0.05

0.5

1

0.1

0.9

likes: 0.1

likes: 0.1

to: 0.1

fish: 0.1

the: 0.1 in: 0.2

sea: 0.2

in: 0.1…

04/19/23 55

The Viterbi Algorithm

best tag sequence for "John likes to fish in the sea"?

efficiently computes the most likely state sequence given a particular output sequence

based on dynamic programming

04/19/23 56

A smaller example

0.6

b

q rstart end

0.5

0.7

What is the best sequence of states for the input string “bbba”?

Computing all possible paths and finding the one with the max probability is exponential

a

0.4 0.80.2

b a

1 1

0.3 0.5

04/19/23 57

A smaller example (con’t)

For each state, store the most likely sequence that could lead to it (and its probability)

Path probability matrix: An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob.

for being in each state at the preceding time.Best sequence Input sequence / time

ε --> b b --> b bb --> b bbb --> a

leading

to q

coming

from qε --> q 0.6

(1.0x0.6)

q --> q 0.108

(0.6x0.3x0.6)

qq --> q 0.01944 (0.108x0.3x0.6)

qrq --> q 0.018144

(0.1008x0.3x0.4)

coming

from rr --> q 0

(0x0.5x0.6)

qr --> q 0.1008

(0.336x0.5x 0.6)

qrr --> q 0.02688 (0.1344x0.5x0.4)

leading

to r

coming

from qε --> r 0

(0x0.8)

q --> r 0.336

(0.6x0.7x0.8)

qq --> r 0.0648 (0.108x0.7x0.8)

qrq --> r 0.014112

(0.1008x0.7x0.2)

coming

from rr --> r 0 (0x0.5x0.8)

qr --> r 0.1344 (0.336x0.5x0.8)

qrr --> r 0.01344

(0.1344x0.5x0.2)

04/19/23 58

Viterbi for POS taggingLet:

n = nb of words in sentence to tag (nb of input tokens) T = nb of tags in the tag set (nb of states) vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag

sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1

// Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD)

vit[1,t]:=0.0 for t ≠ PERIOD

04/19/23 59

Viterbi for POS tagging (con’t)// Induction (build the path probability matrix)

for i:=1 to n step 1 do // for all words in the sentence

for all tags tj do // for all possible tags// store the max prob of the path

vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj| tk))

// store the actual state

path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) endend

//Termination and path-readout

bestStaten+1 := argmax1≤j≤T vit[n+1,j]for j:=n to 1 step -1 do // for all the words in the sentence

bestStatej := path[i+1, bestStatej+1]end

P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j]

emission probability

state transitionprobability

probability of best path

leading to state tk at word i

04/19/23 60

in bigram POS tagging, we condition a tag only on the preceding tag

why not... use more context (ex. use trigram model)

more precise:• “is clearly marked” --> verb, past participle• “he clearly marked” --> verb, past tense

combine trigram, bigram, unigram models condition on words too

but with an n-gram approach, this is too costly (too many parameters to model)

transformation-based tagging...

Possible improvements

04/19/23 64

Evaluation of POS taggers

compared with gold-standard of human performance metric:

accuracy = % of tags that are identical to gold standard

most taggers ~96-97% accuracy must compare accuracy to:

ceiling (best possible results) how do human annotators score compared to each other? (96-97%) so systems are not bad at all!

baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of

previous tags ? (90-91%) so anything less is really bad

04/19/23 65

More on tagger accuracy is 95% good?

that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence

when comparing tagger accuracy, beware of: size of training corpus

the bigger, the better the results difference between training & testing corpora (genre, domain…)

the closer, the better the results size of tag set

Prediction versus classification unknown words

the more unknown words (not in dictionary), the worst the results

04/19/23 66

Error analysis of POS taggers

correct tag

tags assigned by the tagger (Penn Treebank tags)

DT NNP J J NN VBD VBN … Total

DT 99.4 .3 0 0 .3 0 100

NNP 0 90.2 3.3 4.1 0 0 100

J J 0 .1 93.9 1.8 .1 1.9 100

NN 0 .5 2.2 95.5 .2 0 100

VBD 0 0 .3 1.4 96.0 2.5 100

VBN 0 0 1.9 0 3.4 93.3 100

Where did the tagger go wrong ? Use a confusion matrix / contingency table

Most confused: NN (noun) vs. NNP (proper noun) vs. JJ (adjective) VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective)

he chopped carrots, the carrots were chopped, the chopped carrots

04/19/23 67

Major difficulties in POS tagging Unknown words (proper names)

because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions:

assign all possible tags with probabilities distribution identical to lexicon as a whole

use morphological cues to infer possible tags • ex. word ending in -ed are likely to be past tense verbs or past participles

Frequently confused tag pairs preposition vs particle

<running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective