Upload
betrys
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 10 16 August 2007. Hidden Markov Model (HMM) Tagging. Using an HMM to do POS tagging HMM is a special case of Bayesian inference It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition). - PowerPoint PPT Presentation
Citation preview
Lecture 1, 7/21/2005 Natural Language Processing 1
CS60057Speech &Natural Language
Processing
Autumn 2007
Lecture 10
16 August 2007
Lecture 1, 7/21/2005 Natural Language Processing 2
Hidden Markov Model (HMM) Tagging
Using an HMM to do POS tagging
HMM is a special case of Bayesian inference
It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)
Lecture 1, 7/21/2005 Natural Language Processing 3
Goal: maximize P(word|tag) x P(tag|previous n tags)
P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)
P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)
Hidden Markov Model (HMM) Taggers
Lexical information Syntagmatic information
Lecture 1, 7/21/2005 Natural Language Processing 4
POS tagging as a sequence classification task
We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn.
What is the best sequence of tags which corresponds to this sequence of observations?
Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence
which is most probable given the observation sequence of n words w1…wn.
Lecture 1, 7/21/2005 Natural Language Processing 5
Getting to HMM
Let T = t1,t2,…,tn
Let W = w1,w2,…,wn
Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn
Hat ^ means “our estimate of the best = the most probable tag sequence” Argmaxx f(x) means “the x such that f(x) is maximized”
it maximazes our estimate of the best tag sequence
Lecture 1, 7/21/2005 Natural Language Processing 6
Getting to HMM
This equation is guaranteed to give us the best tag sequence
But how do we make it operational? How do we compute this value? Intuition of Bayesian classification:
Use Bayes rule to transform it into a set of other probabilities that are easier to compute
Thomas Bayes: British mathematician (1702-1761)
Lecture 1, 7/21/2005 Natural Language Processing 7
Bayes Rule
Breaks down any conditional probability P(x|y) into three other probabilities
P(x|y): The conditional probability of an event x assuming that y has occurred
Lecture 1, 7/21/2005 Natural Language Processing 8
Bayes Rule
We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words
Lecture 1, 7/21/2005 Natural Language Processing 9
Bayes Rule
Lecture 1, 7/21/2005 Natural Language Processing 10
Likelihood and prior
n
Lecture 1, 7/21/2005 Natural Language Processing 11
Likelihood and prior Further Simplifications
n
1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it
2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by the bigram tagger
Lecture 1, 7/21/2005 Natural Language Processing 12
Likelihood and prior Further Simplifications
n
1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it
thekoalaputthekeysonthetable
WORDSTAGS
NVPDET
Lecture 1, 7/21/2005 Natural Language Processing 13
Likelihood and prior Further Simplifications
2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.
Bigrams are used as the basis for simple statistical analysis of text
The bigram assumption is related to the first-order Markov assumption
Lecture 1, 7/21/2005 Natural Language Processing 14
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by the bigram tagger
n
biagram assumption
---------------------------------------------------------------------------------------------------------------
Lecture 1, 7/21/2005 Natural Language Processing 15
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns
That/DT flight/NN The/DT yellow/JJ hat/NN So we expect P(NN|DT) and P(JJ|DT) to be high But P(DT|JJ) to be:?
Lecture 1, 7/21/2005 Natural Language Processing 16
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1) Compute P(NN|DT) by counting in a labeled
corpus:
# of times DT is followed by NN
Lecture 1, 7/21/2005 Natural Language Processing 17
Two kinds of probabilities (2)
Word likelihood probabilities p(wi|ti) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”
Compute P(is|VBZ) by counting in a labeled corpus:
If we were expecting a third person singular verb, how likely is it that
this verb would be is?
Lecture 1, 7/21/2005 Natural Language Processing 18
An Example: the verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
How do we pick the right tag?
Lecture 1, 7/21/2005 Natural Language Processing 19
Disambiguating “race”
Lecture 1, 7/21/2005 Natural Language Processing 20
Disambiguating “race”
P(NN|TO) = .00047P(VB|TO) = .83The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’
P(race|NN) = .00057P(race|VB) = .00012Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.
P(NR|VB) = .0027P(NR|NN) = .0012tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun
P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(NN|TO)P(NR|NN)P(race|NN)=.00000000032Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins
Lecture 1, 7/21/2005 Natural Language Processing 21
Hidden Markov Models
What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)
Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov
Chain, or observable Markov Model.
Lecture 1, 7/21/2005 Natural Language Processing 22
Definitions
A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum
to one A Markov chain is a special case of a WFST in which the
input sequence uniquely determines which states the automaton will go through
Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous
sequences
Lecture 1, 7/21/2005 Natural Language Processing 23
Markov chain = “First-order observed Markov Model” a set of states
Q = q1, q2…qN; the state at time t is qt a set of transition probabilities:
a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A
Distinguished start and end states
Special initial probability vector
i the probability that the MM will start in state i, each i expresses the probability p(qi|START)
aij P(qt j | qt 1 i) 1i, j N
aij 1; 1i Nj1
N
Lecture 1, 7/21/2005 Natural Language Processing 24
Markov chain = “First-order observed Markov Model”
Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities:
P(qn|qn-1, qn-2, …, q1)
- I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days
- We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences
Problem: the larger n is, the more observations we must collect.
Suppose that n=6, then we have to collect statistics for 3(6-1) =
243 past histories
Lecture 1, 7/21/2005 Natural Language Processing 25
Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov
assumption
for a sequence of observations q1, … qn,
current state only depends on previous state
the joint probability of certain past and current observations
Lecture 1, 7/21/2005 Natural Language Processing 26
Markov chain = “First-order observable Markov Model”
Lecture 1, 7/21/2005 Natural Language Processing 27
Markov chain = “First-order observed Markov Model”
Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?
Using the Markov assumption and the probabilities in table 1, this translates into:
Lecture 1, 7/21/2005 Natural Language Processing 28
Markov chain for weather
What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) =
1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
Lecture 1, 7/21/2005 Natural Language Processing 29
Hidden Markov Model
For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny
But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags
So we need an extension! A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the same as the states.
This means we don’t know which state we are in.
Lecture 1, 7/21/2005 Natural Language Processing 30
Markov chain for words
Observed events: words
Hidden events: tags
Lecture 1, 7/21/2005 Natural Language Processing 31
States Q = q1, q2…qN; Observations O = o1, o2…oN;
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities (prior)
Transition probability matrix A = {aij}
Observation likelihoods (likelihood)
Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an
observation ot being generated from a state i, emission probabilities
Special initial probability vector
i the probability that the HMM will start in state i, each i expresses the probability
p(qi|START)
Hidden Markov Models
Lecture 1, 7/21/2005 Natural Language Processing 32
Assumptions
Markov assumption: the probability of a particular state depends only on the previous state
Output-independence assumption: the probability of an output observation depends only on the state that produced that observation
P(qi | q1...qi 1) P(qi | qi 1)
Lecture 1, 7/21/2005 Natural Language Processing 33
HMM for Ice Cream
You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA
for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date
that summer Our job: figure out how hot it was
Lecture 1, 7/21/2005 Natural Language Processing 34
Noam task
Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
(cp. with output symbols) Produce:
Weather Sequence: C,C,H,C,C,C,H …
(cp. with hidden states, causing states)
Lecture 1, 7/21/2005 Natural Language Processing 35
HMM for ice cream
Lecture 1, 7/21/2005 Natural Language Processing 36
Different types of HMM structure
Bakis = left-to-right Ergodic = fully-connected
Lecture 1, 7/21/2005 Natural Language Processing 37
HMM Taggers
Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD)
HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability
Lecture 1, 7/21/2005 Natural Language Processing 38
Weighted FSM corresponding to hidden states of HMM, showing A probs
Lecture 1, 7/21/2005 Natural Language Processing 39
B observation likelihoods for POS HMM
Lecture 1, 7/21/2005 Natural Language Processing 40
The A matrix for the POS HMM
Lecture 1, 7/21/2005 Natural Language Processing 41
The B matrix for the POS HMM
Lecture 1, 7/21/2005 Natural Language Processing 42
HMM Taggers
The probabilities are trained on hand-labeled training corpora (training set)
Combine different N-gram levels Evaluated by comparing their output from a test set to
human labels for that test set (Gold Standard)
Lecture 1, 7/21/2005 Natural Language Processing 43
The Viterbi Algorithm best tag sequence for "John likes to fish in the sea"? efficiently computes the most likely state sequence given a
particular output sequence based on dynamic programming
Lecture 1, 7/21/2005 Natural Language Processing 44
A smaller example0.6
b
q rstart end
0.5
0.7
What is the best sequence of states for the input string “bbba”?
Computing all possible paths and finding the one with the max probability is exponential
a
0.4 0.80.2
b a
1 1
0.3 0.5
Lecture 1, 7/21/2005 Natural Language Processing 45
A smaller example (con’t)
For each state, store the most likely sequence that could lead to it (and its probability) Path probability matrix:
An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being
in each state at the preceding time.Best sequence Input sequence / time
ε --> b b --> b bb --> b bbb --> a
leading
to q
coming
from qε --> q 0.6
(1.0x0.6)
q --> q 0.108
(0.6x0.3x0.6)
qq --> q 0.01944 (0.108x0.3x0.6)
qrq --> q 0.018144
(0.1008x0.3x0.4)
coming
from rr --> q 0
(0x0.5x0.6)
qr --> q 0.1008
(0.336x0.5x 0.6)
qrr --> q 0.02688 (0.1344x0.5x0.4)
leading
to r
coming
from qε --> r 0
(0x0.8)
q --> r 0.336
(0.6x0.7x0.8)
qq --> r 0.0648 (0.108x0.7x0.8)
qrq --> r 0.014112
(0.1008x0.7x0.2)
coming
from rr --> r 0 (0x0.5x0.8)
qr --> r 0.1344 (0.336x0.5x0.8)
qrr --> r 0.01344
(0.1344x0.5x0.2)
Lecture 1, 7/21/2005 Natural Language Processing 46
Viterbi intuition: we are looking for the best ‘path’
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
S1 S2 S4S3 S5
promised to back the bill
VBD
VBN
TO
VB
JJ
NN
RB
DT
NNP
VB
NN
Slide from Dekang Lin
Lecture 1, 7/21/2005 Natural Language Processing 47
The Viterbi Algorithm
Lecture 1, 7/21/2005 Natural Language Processing 48
Intuition
The value in each cell is computed by taking the MAX over all paths that lead to this cell.
An extension of a path from state i at time t-1 is computed by multiplying: Previous path probability from previous cell viterbi[t-
1,i] Transition probability aij from previous state I to
current state j Observation likelihood bj(ot) that current state j
matches observation symbol t
Lecture 1, 7/21/2005 Natural Language Processing 49
Viterbi example
Lecture 1, 7/21/2005 Natural Language Processing 50
Smoothing of probabilities
Data sparseness is a problem when estimating probabilities based on corpus data. The “add one” smoothing technique –
BN
wCwP n
n
1,1
,1
C- absolute frequencyN: no of training instancesB: no of different types
Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:
iii
iiiiiiii ttPttPtPttP
1,10
)|()|()(| 2,133122111,1
The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.
Lecture 1, 7/21/2005 Natural Language Processing 53
in bigram POS tagging, we condition a tag only on the preceding tag
why not... use more context (ex. use trigram model)
more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense
combine trigram, bigram, unigram models condition on words too
but with an n-gram approach, this is too costly (too many parameters to model)
Possible improvements
Lecture 1, 7/21/2005 Natural Language Processing 55
Further issues with Markov Model tagging
Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus-wide distribution
of POS Use morphological cues (capitalization, suffix) to assign a more
calculated guess. Using higher order Markov models:
Using a trigram model captures more context However, data sparseness is much more of a problem.
Lecture 1, 7/21/2005 Natural Language Processing 56
TnT
Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000 Underlying model:
Trigram modelling – The probability of a POS only depends on its two preceding POS The probability of a word appearing at a particular position given that its
POS occurs at that position is independent of everything else.
T
iTTiiiii
ttttPtwPtttP
T 1121 )|()|(),|(maxarg
1
Lecture 1, 7/21/2005 Natural Language Processing 57
Training
Maximum likelihood estimates:
)(
),()|(:
),(
),,(),|(:
)(
),()|( : Bigrams
: Unigrams
3
3333
32
321213
^
3
3223
33
tc
twctwPLexical
ttc
tttctttPTrigrams
tc
ttcttP
N
)c(t)(tP
^
^
Smoothing : context-independent variant of linear interpolation.
),|()|()(),|( 213
^
323
^
23
^
1213 tttPttPtPtttP
Lecture 1, 7/21/2005 Natural Language Processing 58
Smoothing algorithm
Set λi=0
For each trigram t1 t2 t3 with f(t1,t2,t3 )>0 Depending on the max of the following three values:
Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )
Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )
Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )
Normalize λi
Lecture 1, 7/21/2005 Natural Language Processing 59
Evaluation of POS taggers
compared with gold-standard of human performance metric:
accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to:
ceiling (best possible results) how do human annotators score compared to each other? (96-
97%) so systems are not bad at all!
baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of
previous tags ? (90-91%) so anything less is really bad
Lecture 1, 7/21/2005 Natural Language Processing 60
More on tagger accuracy is 95% good?
that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence
when comparing tagger accuracy, beware of: size of training corpus
the bigger, the better the results difference between training & testing corpora (genre, domain…)
the closer, the better the results size of tag set
Prediction versus classification unknown words
the more unknown words (not in dictionary), the worst the results
Lecture 1, 7/21/2005 Natural Language Processing 61
Error Analysis
Look at a confusion matrix (contingency table)
E.g. 4.4% of the total errors caused by mistagging VBD as VBN See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Adverb (RB) vs Particle (RP) vs Prep (IN) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)
ERROR ANALYSIS IS ESSENTIAL!!!
Lecture 1, 7/21/2005 Natural Language Processing 62
Tag indeterminacy
Lecture 1, 7/21/2005 Natural Language Processing 63
Major difficulties in POS tagging Unknown words (proper names)
because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions:
assign all possible tags with probabilities distribution identical to lexicon as a whole
use morphological cues to infer possible tags ex. word ending in -ed are likely to be past tense verbs or past participles
Frequently confused tag pairs preposition vs particle
<running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective
Lecture 1, 7/21/2005 Natural Language Processing 64
Unknown Words
Most-frequent-tag approach. What about words that don’t appear in the training set? Suffix analysis:
The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.
Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.
Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)
Use a morphological analyzer to get the restriction on the possible tags.
Lecture 1, 7/21/2005 Natural Language Processing 65
Unknown words
Lecture 1, 7/21/2005 Natural Language Processing 66
Alternative graphical models for part of speech tagging
Lecture 1, 7/21/2005 Natural Language Processing 67
Different Models for POS tagging
HMM Maximum Entropy Markov Models Conditional Random Fields
Lecture 1, 7/21/2005 Natural Language Processing 68
Hidden Markov Model (HMM) : Generative Modeling
Source Model PY
Noisy Channel PXY
y x
i
ii yyPP )|()( 1y i
ii yxPP )|()|( yx
Lecture 1, 7/21/2005 Natural Language Processing 69
Dependency (1st order)
kY1kY
kX
)|( kk YXP
)|( 1kk YYP
1kX
)|( 11 kk YXP
2kX
)|( 22 kk YXP
2kY)|( 21 kk YYP
1kY
1kX
)|( 1 kk YYP
)|( 11 kk YXP
Lecture 1, 7/21/2005 Natural Language Processing 70
Disadvantage of HMMs (1)
No Rich Feature Information Rich information are required
When xk is complex When data of xk is sparse
Example: POS Tagging How to evaluate Pwk|tk for unknown words wk ? Useful features
Suffix, e.g., -ed, -tion, -ing, etc. Capitalization
Generative Model Parameter estimation: maximize the joint likelihood of training examples
T
P),(
2 ),(logyx
yYxX
Lecture 1, 7/21/2005 Natural Language Processing 71
Generative Models
Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples
Lecture 1, 7/21/2005 Natural Language Processing 72
Generative Models (cont’d)
Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range
dependencies of the observations Very strict independence assumptions on the observations
Lecture 1, 7/21/2005 Natural Language Processing 73
Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples
T
P),(
2 )|(logyx
xXyY
Lecture 1, 7/21/2005 Natural Language Processing 74
Maximum Entropy modeling
N-gram model : probabilities depend on the previous few tokens. We may identify a more heterogeneous set of features which contribute in some way
to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)
Maxent combines these features in a probabilistic model. The given features provide a constraint on the model. We would like to have a probability distribution which, outside of these constraints, is
as uniform as possible – has the maximum entropy among all models that satisfy these constraints.
Lecture 1, 7/21/2005 Natural Language Processing 75
Maximum Entropy Markov Model Discriminative Sub Models
Unify two parameters in generative model into one conditional model
Two parameters in generative model,
parameter in source model and parameter in
noisy channel
Unified conditional model Employ maximum entropy principle
)|( 1kk yyP
)|( kk yxP
),|( 1kkk yxyP
i
iii xyyPP ),|()|( 1xy
Maximum Entropy Markov Model
Lecture 1, 7/21/2005 Natural Language Processing 76
General Maximum Entropy Principle
Model Model distribution PY|X with a set of features
fffl defined on X and Y
Idea Collect information of features from training data Principle
Model what is known Assume nothing else
Flattest distribution
Distribution with the maximum Entropy
Lecture 1, 7/21/2005 Natural Language Processing 77
Example
(Berger et al., 1996) example Model translation of word “in” from English to French
Need to model P(wordFrench) Constraints
1: Possible translations: dans, en, à, au course de, pendant 2: “dans” or “en” used in 30% of the time 3: “dans” or “à” in 50% of the time
Lecture 1, 7/21/2005 Natural Language Processing 78
Features
Features 0-1 indicator functions
1 if x y satisfies a predefined condition 0 if not
Example: POS Tagging
otherwise
NN is and tion- with ends if
,0
,1),(1
yxyxf
otherwise ,0
NNP is andtion Captializa with starts if ,1),(2
yxyxf
Lecture 1, 7/21/2005 Natural Language Processing 79
Constraints
Empirical Information Statistics from training data T
Tyx
ii yxfT
fP),(
),(||
1)(ˆ
Constraints)()(ˆ
ii fPfP
Tyx YDy
ii yxfxXyYPT
fP),( )(
),()|(||
1)(
Expected Value From the distribution PY|X we want to model
Lecture 1, 7/21/2005 Natural Language Processing 80
Maximum Entropy: Objective
Entropy
x y
Tyx
xXyYPxXyYPxP
xXyYPxXyYPT
I
)|(log)|()(ˆ
)|(log)|(||
1
2
),(2
)()(ˆ s.t.
max)|(
fPfP
IXYP
Maximization Problem
Lecture 1, 7/21/2005 Natural Language Processing 81
Dual Problem
Dual Problem Conditional model
Maximum likelihood of conditional data)),(exp()|(
1
l
iii yxfxXyYP
Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al.
2000)
Tyx
xXyYPl ),(
2,,
)|(logmax1
Lecture 1, 7/21/2005 Natural Language Processing 82
Maximum Entropy Markov Model Use Maximum Entropy Approach to Model
1st order
),|( 11 kkkkkk yYxXyYP
Features Basic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in source model
State-output pair feature Xkxk Yk yk Advantage: incorporate other advanced
features on xk yk
HMM vs MEMM (1st order)
kY1kY
kX
)|( 1kk YYP
)|( kk YXP
HMMMaximum Entropy
Markov Model (MEMM)
kY1kY
kX
),|( 1kkk YXYP
Lecture 1, 7/21/2005 Natural Language Processing 84
Performance in POS Tagging
POS Tagging Data set: WSJ Features:
HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)
Results (Lafferty et al. 2001) 1st order HMM
94.31% accuracy, 54.01% OOV accuracy 1st order MEMM
95.19% accuracy, 73.01% OOV accuracy
Lecture 1, 7/21/2005 Natural Language Processing 85
ME applications
Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources
Word window (4) Word features (prefix, suffix, capitalization) Previous POS tags
Lecture 1, 7/21/2005 Natural Language Processing 86
ME applications
Abbreviation expansion (Pakhomov, 2002) Information sources
Word window (4) Document title
Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources
Word window (4) Structurally related words (4)
Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources
Token features (prefix, suffix, capitalization, abbreviation) Word window (2)
Lecture 1, 7/21/2005 Natural Language Processing 87
Solution
Global Optimization Optimize parameters in a global model simultaneously,
not in sub models separately Alternatives
Conditional random fields Application of perceptron algorithm
Lecture 1, 7/21/2005 Natural Language Processing 88
Why ME?
Advantages Combine multiple knowledge sources
Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary -
(Reynar & Ratnaparkhi, 1997)) Global
N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002)
Combine dependent knowledge sources
Lecture 1, 7/21/2005 Natural Language Processing 89
Why ME?
Advantages Add additional knowledge sources Implicit smoothing
Disadvantages Computational
Expected value at each iteration Normalizing constant
Overfitting Feature selection
Cutoffs Basic Feature Selection (Berger et al., 1996)
Lecture 1, 7/21/2005 Natural Language Processing 90
Conditional Models
Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation
sequence
Allow arbitrary, non-independent features on the observation sequence X
The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models
Lecture 1, 7/21/2005 Natural Language Processing 91
Discriminative ModelsMaximum Entropy Markov Models (MEMMs)
Exponential model Given training set X with label sequence Y:
Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization
Lecture 1, 7/21/2005 Natural Language Processing 92
MEMMs (cont’d)
MEMMs have all the advantages of Conditional Models
Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)
Subject to Label Bias Problem
Bias toward states with fewer outgoing transitions
Lecture 1, 7/21/2005 Natural Language Processing 93
Label Bias Problem
• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)
• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x
• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).
• Per-state normalization does not allow the required expectation
• Consider this MEMM:
Lecture 1, 7/21/2005 Natural Language Processing 94
Solve the Label Bias Problem
Change the state-transition structure of the model
Not always practical to change the set of states
Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction)
Lecture 1, 7/21/2005 Natural Language Processing 95
Random Field
Lecture 1, 7/21/2005 Natural Language Processing 96
Conditional Random Fields (CRFs)
CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities
of next states given the current state CRF has a single exponential model for the joint probability of the entire
sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the
corresponding observations
Lecture 1, 7/21/2005 Natural Language Processing 97
Definition of CRFs
X is a random variable over data sequences to be labeled
Y is a random variable over corresponding label sequences
Lecture 1, 7/21/2005 Natural Language Processing 98
Example of CRFs
Lecture 1, 7/21/2005 Natural Language Processing 99
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
Lecture 1, 7/21/2005 Natural Language Processing 100
Conditional Distribution
1 2 1 2( , , , ; , , , ); andn n k k
x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a
Boolean edge featurek is the number of features
are parameters to be estimated
y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v
If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:
(y | x) exp ( , y | , x) ( , y | , x)
k k e k k v
e E,k v V ,k
p f e g v
Lecture 1, 7/21/2005 Natural Language Processing 101
Conditional Distribution (cont’d)
• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:
Z(x) is a normalization over the data sequence x
(y | x) exp ( , y | , x) ( , y |1
(x), x)
k k e k k v
e E,k v V ,k
p f e g vZ
Lecture 1, 7/21/2005 Natural Language Processing 102
Parameter Estimation for CRFs
The paper provided iterative scaling algorithms
It turns out to be very inefficient
Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient
Lecture 1, 7/21/2005 Natural Language Processing 103
Training of CRFs (From Prof. Dietterich)
log ( | )( , y | , x) ( , y | , x) log (x)
k k e k k ve E,k v V ,k
p y xf e g v Z
log ( | ) ( , y | , x) ( , y | , x) log (x)k k e k k ve E,k v V ,k
p y x f e g v Z
• First, we take the log of the equation
• Then, take the derivative of the above equation
• For training, the first 2 items are easy to get. • For example, for each k, fk is a sequence of Boolean numbers, such
as 00101110100111. is just the total number of 1’s in the sequence.( , y | , x)k k ef e
• The hardest thing is how to calculate Z(x)
Lecture 1, 7/21/2005 Natural Language Processing 104
Training of CRFs (From Prof. Dietterich) (cont’d)
• Maximal cliques
y1 y2 y3 y4c1 c2 c3
c1 c2 c3
1 2 3 4
1 2 3 4
1 1 2 2 2 3 3 3 4y ,y ,y ,y
1 1 2 2 2 3 3 3 4y y y y
(x) (y ,y ,x) (y ,y ,x) (y ,y ,x)
(y ,y ,x) (y ,y ,x) (y ,y ,x)
Z c c c
c c c
3 4 3 4 3 3 4: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c
1 1 2 1 2 1 1 2: exp( (y ,x) (y ,x) (y ,y ,x)) (y ,y ,x)c c
2 3 2 3 2 2 3: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c
Lecture 1, 7/21/2005 Natural Language Processing 105
POS tagging Experiments
Lecture 1, 7/21/2005 Natural Language Processing 106
POS tagging Experiments (cont’d)
• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number
or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
• oov = out-of-vocabulary (not observed in the training set)
Lecture 1, 7/21/2005 Natural Language Processing 107
Summary
Discriminative models are prone to the label bias problem
CRFs provide the benefits of discriminative models
CRFs solve the label bias problem well, and demonstrate good performance