Upload
kelsey-sanford
View
60
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Part of Speech Tagging. September 9, 2005. Part of Speech. Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech: - PowerPoint PPT Presentation
Citation preview
04/19/23 CAP6640 Natural Language Systems 2
Part of Speech
Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word.
Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech:
noun, verb, pronoun, preposition, adjective, adverb, conjunction, article
These categories are based on morphological and distributional similarities (not semantic similarities).
Part of speech is also known as:
word classes morphological classes lexical tags
04/19/23 3
Part of Speech (cont.)
A POS tag of a word describes the major and minor word
classes of that word. A POS tag of a word gives a significant amount of information about
that word and its neighbours. For example, a possessive pronoun (my, your, her, its) most likely will be followed by a noun, and a personal pronoun (I, you, he, she) most likely will
be followed by a verb. Most of words have a single POS tag, but some of them have more
than one (2,3,4,…) For example, book/noun or book/verb
I bought a book. Please book that flight.
04/19/23 4
Tag Sets
There are various tag sets to choose. The choice of the tag set depends on the nature of the application.
We may use small tag set (more general tags) or large tag set (finer tags).
Some of widely used part-of-speech tag sets:
Penn Treebank has 45 tags Brown Corpus has 87 tags C7 tag set has 146 tags
In a tagged corpus, each word is associated with a tag from the used tag set.
04/19/23 5
English Word Classes
Part-of-speech can be divided into two broad categories: closed class types -- such as prepositions open class types -- such as noun, verb
Closed class words are generally also function words. Function words play important role in grammar Some function words are: of, it, and, you Functions words are most of time very short and frequently
occur. There are four major open classes.
noun, verb, adjective, adverb a new word may easily enter into an open class.
Word classes may change depending on the natural language, but all natural languages have at least two word classes: noun and verb.
04/19/23 6
Nouns
Nouns can be divided as: proper nouns -- names for specific entities such as India, John,
Ali common nouns
Proper nouns do not take an article but common nouns may take. Common nouns can be divided as:
count nouns -- they can be singular or plural -- chair/chairs mass nouns -- they are used when something is
conceptualized as a homogenous group -- snow, salt Mass nouns cannot take articles a and an, and they can not be plural.
04/19/23 7
Verbs
Verb class includes the words referring actions and processes. Verbs can be divided as:
main verbs -- open class -- draw, bake auxiliary verbs -- closed class -- can, should
Auxiliary verbs can be divided as: copula -- be, have modal verbs -- may, can, must, should
Verbs have different morphological forms: non-3rd-person-sg eat 3rd-person-sg - eats progressive -- eating past -- ate past participle -- eaten
04/19/23 8
Adjectives
Adjectives describe properties or qualities for color -- black, white for age -- young, old
In Turkish, all adjectives can also be used as noun. kırmızı kitap red book kırmızıyı the red one (ACC)
04/19/23 9
Adverbs
Adverbs normally modify verbs. Adverb categories:
locative adverbs -- home, here, downhill degree adverbs -- very, extremely manner adverbs -- slowly, delicately temporal adverbs -- yesterday, Friday
Because of the heterogeneous nature of adverbs, some adverbs such as Friday may be tagged as nouns.
04/19/23 10
Major Closed Classes
Prepositions -- on, under, over, near, at, from, to, with Determiners -- a, an, the Pronouns -- I, you, he, she, who, others Conjunctions -- and, but, if, when Participles -- up, down, on, off, in, out Numerals -- one, two, first, second
04/19/23 11
Prepositions
Occur before noun phrases indicate spatial or temporal relations Example:
on the table under chair
Frequent. For example, some of the frequency counts in a 16 million word corpora (COBUILD). of 540,085 in 331,235 for 142,421 to 125,691 with 124,965 on 109,129 at 100,169
04/19/23 12
Particles
A particle combines with a verb to form a larger unit called phrasal verb.
go on turn on turn off shut down
04/19/23 13
Articles
A small closed class Only three words in the class: a an the Marks definite or indefinite They occur so often. For example, some of the
frequency counts in a 16 million word corpora (COBUILD). the 1,071,676 a 413,887 an 59,359
Almost 10% of words are articles in this corpus.
04/19/23 14
Conjunctions
Conjunctions are used to combine or join two phrases, clauses or sentences.
Coordinating conjunctions -- and or but join two elements of equal status Example: you and me
Subordinating conjunctions -- that who combines main clause with subordinate clause Example:
I thought that you might like milk
04/19/23 15
Pronouns
Shorthand for referring to some entity or event. Pronouns can be divided:
personal you she I possessive my your his wh-pronouns who what
04/19/23 16
TagSets for English
There are popular actual tagsets for part-of-speech PENN TREEBANK tagset has 45 tags
IN preposition/subordinating conj. DT determiner JJ adjective NN noun, singular or mass NNS noun, plural VB verb, base form VBD verb, past tense
A sentence from Brown corpus which is tagged using Penn Treebank tagset. The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
04/19/23 17
Part of Speech Tagging
Part of speech tagging is simply assigning the correct part of speech for each in an input sentence
We assume that we have the following: A set of tags (our tag set) A dictionary that tells us the possible tags for each word
(including all morphological variants). A text to be tagged.
There are different algorithms for tagging. Rule Based Tagging – uses hand-written rules Stochastic Tagging – uses probabilities computed from training
corpus Transformation Based Tagging – uses rules learned
automatically
04/19/23 18
How hard is tagging?
Most words in English are unambiguous. They have only a single tag. But many of most common words are ambiguous:
can/verb can/auxiliary can/noun The number of word types in Brown Corpus
unambiguous (one tag) 35,340 ambiguous (2-7 tags) 4,100
2 tags 3760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1
While only 11.5% of word types are ambiguous, over 40% of Brown corpus tokens are ambiguous.
04/19/23 19
Problem Setup
There are M types of POS tags Tag set: {t1,..,tM}.
The word vocabulary size is V Vocabulary set: {w1,..,wV}.
We have a word sequence of length n:
W = w1,w2…wn
Want to find the best sequence of POS tags:T = t1,t2…tn
)|Pr(maxarg WTTT
best
04/19/23 20
Information sources for tagging
All techniques are based on the same observations…
some tag sequences are more probable than others
ART+ADJ+N is more probable than ART+ADJ+VB
Lexical information: knowing the word to be tagged gives a lot of information about the correct tag
“table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...}
04/19/23 21
Rule-Based Part-of-Speech Tagging
First Stage: Uses a dictionary to assign each word a list of potential parts-of-speech.
Second Stage: Uses a large list of handcrafted rules to window down this list to a single part-of-speech for each word.
The ENGTWOL is a rule-based tagger In the first stage, uses a two-level lexicon transducer In the second stage, uses hand-crafted rules (about 1100
rules)
04/19/23 22
Sample rules
N-IP rule: A tag N (noun) cannot be followed by a tag IP
(interrogative pronoun)
... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun
ART-V rule:A tag ART (article) cannot be followed by a tag V (verb)...the book…
the: {ART} book: {N, V} --> {N}
04/19/23 23
After The First Stage
Example: He had a book. After the fırst stage:
he he/pronoun had have/verbpast have/auxliarypast a a/article book book/noun book/verb
04/19/23 24
Tagging Rule
Rule-1:
if (the previous tag is an article)
then eliminate all verb tags
Rule-2:
if (the next tag is verb)
then eliminate all verb tags
04/19/23 25
Transformation-based tagging
Due to Eric Brill (1995) basic idea:
take a non-optimal sequence of tags and improve it successively by applying a series of well-
ordered re-write rules
rule-based but, rules are learned automatically by training
on a pre-tagged corpus
04/19/23 26
1. Assign to words their most likely tag P(NN|race) = .98 P(VB|race) = .02
2. Change some tags by applying transformation rulesRule Context (trigger)
(apply the rule when…) Examples
NN VB (noun verb)
the previous tag is the preposition to
go to sleep(VB) ? go to school(VB)
VBR VB (past tense base f orm)
one of the previous 3 tags is a modal (MD)
you may cut (VB)
J J R RBR (comparative adj comparative adv)
next tag is an adjective (J J )
a more (RBR) valuable
VBP VB (past tense base f orm)
one of the previous 2 words is “n’t”
should (VB) n’t
An example
04/19/23 27
Types of context lots of latitude… can be:
tag-triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ...
word- triggered transformation The preceding/following word this word …
morphology- triggered transformation The preceding/following word finishes with an s …
a combination of the above The preceding word is tagged this ways AND the following word is this word
04/19/23 28
Learning the transformation rules
Input: A corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0)
Output: A bag of transformation rules Algorithm:
Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0Change tag a to tag b when…
The preceding/following word is tagged zThe word two before/after is tagged zOne of the 2 preceding/following words is tagged zOne of the 2 preceding words is z…
04/19/23 29
Learning the transformation rules (con't)
Run the initial tagger and compile types of errors <incorrect tag, desired tag, # of occurrences>
For each error type, instantiate all templates to generate candidate transformations
Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces
Save the transformation that yields the greatest improvement
Stop when no transformation can reduce the error rate by a predetermined threshold
04/19/23 30
Example
if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159>
Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of the two
preceding words is tagged as a determiner.
When this template is applied to the corpus: it corrects 98 of the 159 errors but it also creates 18 new errors
Error reduction is 98-18=80
04/19/23 31
Learning the best transformations
input: a corpus with each word:
correctly tagged (for reference) tagged with its most frequent tag (C0)
a bag of unordered transformation rules
output: an ordering of the best transformation rules
04/19/23 32
let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus Cε = minimum number of errors desired
for k:= 0 step 1 dobt := argmint (E(t(Ck)) // find the transformation t that minimizes // the error rate
if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly then goto finished
Ck+1 := bt(Ck) // apply rule bt to the current corpus
Tk+1 := bt // bt will be kept as the current transformation // rule
endfinished: the sequence T1 T2 … Tk is the ordered transformation rules
Learning the best transformations (con’t)
04/19/23 33
Strengths of transformation-based tagging
exploits a wider range of lexical and syntactic regularities
can look at a wider context condition the tags on preceding/next words not just
preceding tags. can use more context than bigram or trigram.
transformation rules are easier to understand than matrices of probabilities
04/19/23 34
How TBL Rules are Applied
Before the rules are applied the tagger labels every word with its most likely tag.
We get these most likely tags from a tagged corpus. Example:
He is expected to race tomorrow he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN
After selecting most-likely tags, we apply transformation rules. Change NN to VB when the previous tag is TO This rule converts race/NN into race/VB
This may not work for every case ….. According to race
04/19/23 35
How TBL Rules are Learned
We will assume that we have a tagged corpus. Brill’s TBL algorithm has three major steps.
Tag the corpus with the most likely tag for each (unigram model) Choose a transformation that deterministically replaces an
existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations.
Apply the transformation to the training corpus. These steps are repeated until a stopping criterion is reached. The result (which will be our tagger) will be:
First tags using most-likely tags Then apply the learned transformations
04/19/23 36
Transformations
A transformation is selected from a small set of templates.
Change tag a to tag b when
- The preceding (following) word is tagged z.
- The word two before (after) is tagged z.
- One of two preceding (following) words is tagged z.
- One of three preceding (following) words is tagged z.
- The preceding word is tagged z and the following word is tagged w.
- The preceding (following) word is tagged z and the word
two before (after) is tagged w.
04/19/23 37
Stochastic POS tagging
Assume that a word’s tag only depends on the previous tags (not following ones)
Use a training set (manually tagged corpus) to: learn the regularities of tag sequences learn the possible tags for a word model this info through a language model (n-
gram)
04/19/23 38
Goal: maximize P(word|tag) x P(tag|previous n tags)
P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)
P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)
Hidden Markov Model (HMM) Taggers
Lexical information Syntagmatic information
04/19/23 39
P(tag|previous n tags) if we look (n-1) tags before to find current tag --> n-gram
model
trigram model chooses the most probable tag ti for word wi given:
• the previous 2 tags ti-2 & ti-1 and • the current word wi
bigram model chooses the most probable tag ti for word wi given:
• the previous tag ti-1 and • the current word wi
unigram model (just most-likely tag) chooses the most probable tag ti for word wi given:
• the current word wi
Tag sequence probability
04/19/23 40
Example“race” can be VB or NN
“Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV”
“People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN”
let’s tag the word “race” in 1st sentence with a bigram model.
04/19/23 41
Example (con’t) assuming previous words have been tagged, we have:
“Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow”
P(race|VB) x P(VB|TO) ? given that we have a VB, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be
VB
P(race|NN) x P(NN|TO) ? given that we have a NN, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be
NN
04/19/23 42
From the training corpus, we found that:
P(NN|TO) = .021 // given that the previous tag is TO// 2.1% chances that the current tag is NN
P(VB|TO) = .34 // given that the previous tag is TO// 34% chances that the current tag is VB
P(race|NN) = .00041 // given that we have an NN // 0.041% chances that this word is "race"
P(race|VB) = .00003 // given that we have a VB // 003% chances that this word is "race"
so:
P(VB|TO) x P(race|VB) = .34 x .00003 = .000 01 P(NN|TO) x P(race|NN) = .021 x .00041 = .000 009
so:VB is more probable!
Example (con’t)
04/19/23 43
and by the way: race is 98% of the time a NN !!!
P(VB|race) = 0.02
P(NN|race) = 0.98 !!!
How are the probabilities found ? using a training corpus of hand-tagged text long & meticulous work done by linguists
Example (con’t)
04/19/23 44
HMM Tagging
But HMM tagging tries to find: the best sequence of tags for a sentence not just best tag for a single word
goal: maximize the probability of a tag sequence, given a word sequence
i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence)
04/19/23 45
By Bayes law:
wordSeq is given… so P(wordSeq) will be the same for all tagSeq so we can drop it from the equation
P(wordSeq)
tagSeq) | P(wordSeq x P(tagSeq) wordSeq) | P(tagSeq
HMM Tagging (con’t) wordSeq) | P(tagSeq argmax bestTagSeq
tagSeq
)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t
tagSeq)|P(wordSeq x P(tagSeq) argmax bestTagSeq
n1n1n1t,...,t
n1
tagSeq
n1
04/19/23 46
n
1iii1-ii
t,...,t
n1n1n1t,...,t
n1
)t|P(w x )t|P(t argmax
)t,...,t|w,...,P(w x )t,...,P(t argmax )*t,...,(t so
n1
n1
1. words are independent
2. Markov assumption (approximation to short history) ex. with bigram approximation:
3. probability of a word is only dependent on its tag
emission probability
state transition probability
Assumptions in HMM Tagging
)t,...,t|P(w x ... )xt,...,t|P(w x )t,...,t|P(w )t,...,t|w,...,P(w n1nn12n11n1n1
)t|P(t )t,...,t|P(t 1-ii1-i1i
)t|P(w t,...,tP(w iin1i )|
04/19/23 47
The derivationbestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq) (t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn )
Assumption 1: Independence assumption + Chain ruleP(t1, …, tn) x P(w1, …, wn | t1, …, tn) = P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)
Assumption 2: Markov assumption: only look at short history (ex. bigram)= P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn)
Assumption 3: A word’s identity only depends on its tag = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn)
n
1 i
ii1-ii )t|P(w x )t|P(t
n
1 i
ii1-ii
t,...,tn1 )t|P(w x )t|P(t argmax*t,...,t bestTagSeq so
n1
04/19/23 48
Emissions & Transitions probabilities
let N: number of possible tags (size of tag set) V: number of word types (vocabulary)
from a tagged training corpus, we compute the frequency of: Emission probabilities P(wi| ti)
stored in an N x V matrix emission[i,j] = probability that tag i is the correct tag for word j
Transitions probabilities P(ti|ti-1) stored in an N x N matrix transmission[i,j] = probability that tag i follows tag j
In practice, these matrices are very sparse So these models are smoothed to avoid zero probabilities
04/19/23 49
Emission probabilities P(wi| ti)
stored in an N x V matrix emission[i,j] = probability/frequency that tag i
is the correct tag for word j
04/19/23 50
Transitions probabilities P(ti|ti-1)
stored in an N x N matrix transmission[i,j] = probability/frequency that
tag i follows tag j
04/19/23 51
Efficiency issues
to find the best probability of a sequence is exponential in time
for efficiency, we usually use the Viterbi algorithm For global maximisation i.e. best tag sequence
04/19/23 52
an Example Emission probabilities:
Transition probabilities:
Tag
PN VB TO IN AT NN
Vo
cabu
lary
John 0.9 0.1
likes 0.2 0.3
to 0.5
fish 0.1 0.8 0.3
in 0.5 1
the 1
sea 0.3
First T
ag
Second tag
PN VB TO IN AT NN None (last tag)
PN 0.2 0.7 0.1
VB 0.1 0.2 0.2 0.5
TO 1
IN 0.1 0.9
AT 0.05 0.95
NN 0.3 0.25 0.25 0.1 0.5 0.05
None (1st tag) 0.7 0.2 0.1
04/19/23 53
State Transition Diagram (VMM) Transition probabilities
0.7 0.1
0.7
0.1
PN
startAT
0.2
NN
0.5
0.95
IN
0.2
0.05
0.1
0.2
VB0.25
TO
0.25
0.3
0.1
0.2
end
0.05
0.5
1
0.1
0.9
04/19/23 54
State Transition Diagram (HMM) but the states are "invisible" (we only see the words)
John: 0.3
fish: 0.3
0.7 0.1
0.7
0.1
PN
startAT
0.2
NN
0.50.95
IN
0.2
0.05
0.1
0.2
VB0.25
TO
0.25
0.3
0.1
0.2
end
0.05
0.5
1
0.1
0.9
likes: 0.1
likes: 0.1
to: 0.1
fish: 0.1
the: 0.1 in: 0.2
sea: 0.2
in: 0.1…
…
…
…
…
…
04/19/23 55
The Viterbi Algorithm
best tag sequence for "John likes to fish in the sea"?
efficiently computes the most likely state sequence given a particular output sequence
based on dynamic programming
04/19/23 56
A smaller example
0.6
b
q rstart end
0.5
0.7
What is the best sequence of states for the input string “bbba”?
Computing all possible paths and finding the one with the max probability is exponential
a
0.4 0.80.2
b a
1 1
0.3 0.5
04/19/23 57
A smaller example (con’t)
For each state, store the most likely sequence that could lead to it (and its probability)
Path probability matrix: An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob.
for being in each state at the preceding time.Best sequence Input sequence / time
ε --> b b --> b bb --> b bbb --> a
leading
to q
coming
from qε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944 (0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
coming
from rr --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688 (0.1344x0.5x0.4)
leading
to r
coming
from qε --> r 0
(0x0.8)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648 (0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
coming
from rr --> r 0 (0x0.5x0.8)
qr --> r 0.1344 (0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
04/19/23 58
Viterbi for POS taggingLet:
n = nb of words in sentence to tag (nb of input tokens) T = nb of tags in the tag set (nb of states) vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag
sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1
// Initialization vit[1,PERIOD]:=1.0 // pretend that there is a period before // our sentence (start tag = PERIOD)
vit[1,t]:=0.0 for t ≠ PERIOD
04/19/23 59
Viterbi for POS tagging (con’t)// Induction (build the path probability matrix)
for i:=1 to n step 1 do // for all words in the sentence
for all tags tj do // for all possible tags// store the max prob of the path
vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj| tk))
// store the actual state
path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) endend
//Termination and path-readout
bestStaten+1 := argmax1≤j≤T vit[n+1,j]for j:=n to 1 step -1 do // for all the words in the sentence
bestStatej := path[i+1, bestStatej+1]end
P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j]
emission probability
state transitionprobability
probability of best path
leading to state tk at word i
04/19/23 60
in bigram POS tagging, we condition a tag only on the preceding tag
why not... use more context (ex. use trigram model)
more precise:• “is clearly marked” --> verb, past participle• “he clearly marked” --> verb, past tense
combine trigram, bigram, unigram models condition on words too
but with an n-gram approach, this is too costly (too many parameters to model)
transformation-based tagging...
Possible improvements
04/19/23 64
Evaluation of POS taggers
compared with gold-standard of human performance metric:
accuracy = % of tags that are identical to gold standard
most taggers ~96-97% accuracy must compare accuracy to:
ceiling (best possible results) how do human annotators score compared to each other? (96-97%) so systems are not bad at all!
baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of
previous tags ? (90-91%) so anything less is really bad
04/19/23 65
More on tagger accuracy is 95% good?
that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence
when comparing tagger accuracy, beware of: size of training corpus
the bigger, the better the results difference between training & testing corpora (genre, domain…)
the closer, the better the results size of tag set
Prediction versus classification unknown words
the more unknown words (not in dictionary), the worst the results
04/19/23 66
Error analysis of POS taggers
correct tag
tags assigned by the tagger (Penn Treebank tags)
DT NNP J J NN VBD VBN … Total
DT 99.4 .3 0 0 .3 0 100
NNP 0 90.2 3.3 4.1 0 0 100
J J 0 .1 93.9 1.8 .1 1.9 100
NN 0 .5 2.2 95.5 .2 0 100
VBD 0 0 .3 1.4 96.0 2.5 100
VBN 0 0 1.9 0 3.4 93.3 100
…
Where did the tagger go wrong ? Use a confusion matrix / contingency table
Most confused: NN (noun) vs. NNP (proper noun) vs. JJ (adjective) VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective)
he chopped carrots, the carrots were chopped, the chopped carrots
04/19/23 67
Major difficulties in POS tagging Unknown words (proper names)
because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions:
assign all possible tags with probabilities distribution identical to lexicon as a whole
use morphological cues to infer possible tags • ex. word ending in -ed are likely to be past tense verbs or past participles
Frequently confused tag pairs preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective