Probabilistic Graphical Models
Case Studies: HMM and CRFEric XingLecture 6, February 3, 2020
Reading: see class homepage© Eric Xing @ CMU, 2005-2020 1
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
Hidden Markov Model: from static to dynamic mixture models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
N© Eric Xing @ CMU, 2005-2020 2
Example
q Speech recognition
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
© Eric Xing @ CMU, 2005-2020 3
Applications of HMMs
q Some early applications of HMMsq finance, but we never saw them q speech recognition q modelling ion channels
q In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
q Some current applications of HMMs to biologyq mapping chromosomesq aligning biological sequencesq predicting sequence structureq inferring evolutionary relationshipsq finding genes in DNA sequence
© Eric Xing @ CMU, 2005-2020 4
Definition (of HMM)
q Observation spaceAlphabetic set:Euclidean space:
q Index set of hidden states
q Transition probabilities between any two states
orq Start probabilities
q Emission probabilities associated with each state
or in general:
{ }Kccc ,,, !21=CdR
{ }M,,, !21=I
,)|( ,jiit
jt ayyp === - 11 1
( ) .,,,,lMultinomia~)|( ,,, IÎ"=- iaaayyp Miiiitt !111 1
( ).,,,lMultinomia~)( Myp ppp !211
( ) .,,,,lMultinomia~)|( ,,, IÎ"= ibbbyxp Kiiiitt !111
( ) .,|f~)|( IÎ"×= iyxp iitt q1
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
© Eric Xing @ CMU, 2005-2020 5
Probability of a parse
q Given a sequence x = x1……xTand a parse y = y1, ……, yT,
q To find how likely is the parse:(given our HMM and the sequence)
p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)= p(y1, ……, yT) p(x1……xT | y1, ……, yT)
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
© Eric Xing @ CMU, 2005-2020 6
Variable Elimination on Hidden Markov Model
p(x, y) = p(x1……xT, y1, ……, yT) = p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)
Conditional probability:
© Eric Xing @ CMU, 2005-2020 7
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
Variable Elimination on Hidden Markov Model
Conditional probability:
© Eric Xing @ CMU, 2005-2020 8
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
The Forward Algorithm
q We want to calculate P(x), the likelihood of x, given the HMMq Sum over all possible ways of generating x:
q To avoid summing over an exponential number of paths y, define
(the forward probability)
q The recursion:
),,...,()(def
11 1 ==== ktt
kt
kt yxxPy aa
å -==i
kiit
ktt
kt ayxp ,)|( 11 aa
å=k
kTP a)(x
å å å å Õ Õ= =
-==
yyxx
1 2 112 1
y y y
T
t
T
tttyyy
N ttyxpapp )|(),()( ,p!
© Eric Xing @ CMU, 2005-2020 9
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...
The Backward Algorithm
q We want to compute ,the posterior probability distribution on the t th position, given x
q We start by computing
q The recursion:
)|( x1=ktyP
Forward, atk Backward,
),...,,,,...,(),( Ttktt
kt xxyxxPyP 11 11 +=== x
)|...()...(
),,...,|,...,(),,...,(
, 1111
11
111
===
===
+
+
ktTt
ktt
kttTt
ktt
yxxPyxxP
yxxxxPyxxP
)|,...,( 11 == +ktTt
kt yxxPb
it
itt
iik
kt yxpa 111, )1|( +++ ==å bb
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...
© Eric Xing @ CMU, 2005-2020 10
The junction tree algorithm: message passing for HMM
q A junction tree for the HMM
q Rightward pass
q This is exactly the forward algorithm!q Leftward pass …
q This is exactly the backward algorithm!
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
),( 11 xyy ),( 21 yyy ),( 32 yyy ),( TT yy 1-y
),( 22 xyy ),( 33 xyy ),( TT xyy
)( 2yz )( 3yz )( Tyz)( 1yf )( 2yfÞ
),( 1+tt yyy)( ttt y®-1µ
),( 11 ++ tt xyy
)( 11 ++® ttt yµ
)( 1+ tt yµ
å+
+++¬+¬- =1
11111ty
tttttttttt yyyyy )()(),()( µµyµ
å
å
®-++
++®-+
+=
=
t
tt
t
ytttyytt
yttttttt
yayxp
yxpyyyp
)()|(
)|()()|(
, 111
1111
1µ
µ
å +®-+++® =ty
tttttttttt yyyyy )()(),()( 11111 µµyµ
å+
++++¬+=1
11111ty
ttttttt yxpyyyp )|()()|( µ
),( 1+tt yyy)( ttt y¬-1µ )( 11 ++¬ ttt yµ
),( 11 ++ tt xyy
)( 1+ tt yµ
© Eric Xing @ CMU, 2005-2020 11
Summary
)|,...,()(def
111 === +¬-ktTttt
kt yxxPkµb
•Forward algorithm
•Backward algorithm
),,,...,()(def
1111 === -®-kttttt
kt yxxxPkµa
)|()|()1()1(),1,1(
111111
:11
def,
ttttjttt
ittt
Tjt
it
jit
yypyxpyy
xyyp
+++++¨-
+
==µ
===
µµx
å -==i
kiit
ktt
kt ayxp ,)|( 11 aa
it
itt
iik
kt yxpa 111, )1|( +++ ==å bb
)|(,, 1111 == +++
ittji
jt
it
jit yxpabax
å=µ==j
jit
it
itT
it
it xyp ,
:1
def)|1( xbag
)|(),(
)|()(def
def
11
1
1 ===
==
+it
jt
ittt
yypji
yxpi
A
B
( )( )( )( )
ttt
tttt
ttt
ttt
bagbax
bbaa
*=**=
*=*=
++
++
-
. . .
. .
ABBABA
T
T
11
11
1
The matrix-vector form:
© Eric Xing @ CMU, 2005-2020 12
Posterior decoding
q We can now calculate
q Then, we can askq What is the most likely state at position t of sequence x:
q Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?
q Posterior Decoding:
q This is different from MPA of a whole sequence of hidden states
q This can be understood as bit error ratevs. word error rate
)()(),()|(
xxx
xPP
yPyPkt
kt
ktk
tba
==
==11
)|(maxarg* x1== ktkt yPk
{ } : * Tty tk
t !11 ==
Example:MPA of X ?MPA of (X, Y) ?
x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3
© Eric Xing @ CMU, 2005-2020 13
Viterbi decoding
q GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:
y* = argmaxy P(y|x) = argmaxp P(y,x)
q Let
= Probability of most likely sequence of states ending at state yt = k
q The recursion:
q Underflows are a significant problem
q These numbers become extremely small – underflowq Solution: Take the logs of all values:
),,...,,,...,(max ,--},...{ -1111111
== kttttyy
kt yxyyxxPV
t
itkii
ktt
kt VayxpV 11 -== ,max)|(
x1 x2 x3……………………...……..xN
State 12
K
x1 x2 x3……………………...……..xN
State 12
K
x1 x2 x3……………………...……..xN
State 12
K
x1 x2 x3……………………...……..xN
State 12
K
Vi(t)k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( !!""11121111 -
= p
( )( )itkii
ktt
kt VayxpV 11 -++== ,logmax)|(log
© Eric Xing @ CMU, 2005-2020 14
The Viterbi Algorithm – derivation
q Define the viterbi probability:),,...,,,...,(max ,},...{ 111111 1
== +++kttttyy
kt yxyyxxPV
t
),...,,,...,(),...,,,...,|(max ,},...{ ttttkttyy yyxxPyyxxyxP
t 111111 11
== ++
),,,...,,,...,()|(max ,},...{ tttttkttyy yxyyxxPyyxP
t 111111 11 --++ ==
itki
ktti VayxP ,, )|(max 111 == ++
),,,...,,,...,(max)|(max },...{, 111 111111 11==== --++ -
ittttyy
it
ktti yxyyxxPyyxP
t
itkii
ktt VayxP ,, max)|( 111 == ++
© Eric Xing @ CMU, 2005-2020 15
Computational Complexity and implementation details
q What is the running time, and space required, for Forward, and Backward?
Time: O(K2N); Space: O(KN).
q Useful implementation technique to avoid underflowsq Viterbi: sum of logsq Forward/Backward: rescaling at each position by multiplying by a constant
å -==i
kiit
ktt
kt ayxp ,)|( 11 aa
it
itt
iik
kt yxpa 111, )1|( +++ ==å bb
itkii
ktt
kt VayxpV 11 -== ,max)|(
© Eric Xing @ CMU, 2005-2020 16
Learning HMM: two scenarios
q Supervised learning: estimation when the “right answer” is knownq Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good(experimental) annotations of the CpG islands
GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
q Unsupervised learning: estimation when the “right answer” is unknownq Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
q QUESTION: Update the parameters q of the model to maximize P(x|q) ---Maximal likelihood (ML) estimation
© Eric Xing @ CMU, 2005-2020 17
Parameter sharing
q Consider a time-invariant (stationary) 1st-order Markov modelq Initial state probability vector: q State transition probability matrix:
q The joint:
q The log-likelihood:
q Again, we optimize each parameter separatelyq p is a multinomial frequency vector, and we've seen it beforeq What about A?
X1 X2 X3 XT…
Ap
)(def
11 == kk Xpp
)|(def
11 1 === -it
jtij XXpA
ÕÕ= =
-=T
t tttT XXpxpXp
2 2111 )|()|()|( : pq
ååå=
-+=n
T
ttntn
nn AxxpxpD
211 ),|(log)|(log);( ,,, pql
© Eric Xing @ CMU, 2005-2020 18
Learning a Markov chain transition matrix
q A is a stochastic matrix: q Each row of A is multinomial distribution.q So MLE of Aij is the fraction of transitions from i to j
q Application: q if the states Xt represent words, this is called a bigram language model
q Sparse data problem:q If i à j did not occur in data, we will have Aij =0, then any future sequence with word
pair i à j will have zero probability. q A standard hack: backoff smoothing or deleted interpolation
1=å j ijA
å åå å
= -
= -=•®
®=n
T
titn
jtnn
T
titnML
ijx
xxi
jiA2 1
2 1
,
,,
)(#)(#
MLiti AA •®•® -+= )(~ llh 1
© Eric Xing @ CMU, 2005-2020 19
Supervised ML estimation for “Hidden” MM
q Given x = x1…xN for which the true state path y = y1…yN is known,q Define:
Aij = # times state transition i®j occurs in yBik = # times state i in y emits k in x
q We can show that the maximum likelihood parameters q are:
q What if x is continuous? We can treat as N´T observations of, e.g., a Gaussian, and apply learning rules for Gaussian …
åå åå å ==
•®®=
= -
= -
' ',
,,
)(#)(#
j ij
ij
n
T
titn
jtnn
T
titnML
ij AA
y
yyi
jia2 1
2 1
åå åå å ==
•®®=
=
=
' ',
,,
)(#)(#
k ik
ik
n
T
titn
ktnn
T
titnML
ik BB
y
xyi
kib1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
© Eric Xing @ CMU, 2005-2020 20
Supervised ML estimation, ctd.
q Intuition:q When we know the underlying states, the best estimate of q is the average
frequency of transitions & emissions that occur in the training data
q Drawback:q Given little data, there may be overfitting:
q P(x|q) is maximized, but q is unreasonable:0 probabilities – VERY BAD
q Example:q Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F
q Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1
© Eric Xing @ CMU, 2005-2020 21
Pseudocounts
q Solution for small training sets:q Add pseudocounts
Aij = # times state transition i®j occurs in y + RijBik = # times state i in y emits k in x + Sik
q Rij, Sij are pseudocounts representing our prior beliefq Total pseudocounts: Ri = SjRij , Si = SkSik ,
q --- "strength" of prior belief, q --- total number of imaginary instances in the prior
q Larger total pseudocounts Þ strong prior belief
q Small total pseudocounts: just to avoid 0 probabilities --- smoothing
q This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts
© Eric Xing @ CMU, 2005-2020 22
Bayesian language model
q Global and local parameter independence
q The posterior of Aià· and Ai'à· is factorized despite v-structure on Xt, because Xt-1acts like a multiplexerq Assign a Dirichlet prior bi to each row of the transition matrix:
q We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)
X1 X2 X3 XT…
Ai à·p Ai'à·
ba
…
A
)(# where,)(
)(#)(#
),,|( ',
,def
•®+=-+=
+•®+®
==i
Ai
jiDijpA
i
ii
MLijikii
i
kii
Bayesij b
bllbl
bb
b 1
© Eric Xing @ CMU, 2005-2020 23
Example: HMM
q Supervised learning: estimation when the “right answer” is knownq Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls
q Unsupervised learning: estimation when the “right answer” is unknownq Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice
q QUESTION: Update the parameters q of the model to maximize P(x|q) ---Maximal likelihood (ML) estimation
© Eric Xing @ CMU, 2005-2020 24
Learning HMM: two scenarios
q Supervised learning: if only we knew the true state path then ML parameter estimation would be trivial
q E.g., recall that for complete observed tabular BN:
q What if y is continuous? We can treat as N´T observations of, e.g., a GLIM, and apply learning rules for GLIM …
q Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.
q The Baum Welch algorithm (i.e., EM)q Guaranteed to increase the log likelihood of the model after each iterationq Converges to local optimum, depending on initial conditions
å=
kjikij
ijkMLijk n
n
,','
qå åå å
= -
= -=•®
®=
nTt
itn
jtnn
Tt
itnML
ij yyy
ijia
2 1
2 1
,
,,
)(#)(#
å åå å
=
==•®
®=n
Tt
itn
ktnn
Tt
itnML
ik yxy
ikib
1
1
,
,,
)(#)(#
( ){ }NnTtyx tntn :,::, ,, 11 ==
© Eric Xing @ CMU, 2005-2020 25
The Baum Welch algorithm
q The complete log likelihood
q The expected complete log likelihood
q EMq The E step
q The M step ("symbolically" identical to MLE)
Õ ÕÕ ÷÷ø
öççè
æ====
-n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
ååååå==
- ÷øöç
èæ+÷
øöç
èæ+÷
øöç
èæ=
- n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ pl
)|( ,,, nitn
itn
itn ypy x1===g
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111 ==== --x
å åå å
-
=
==n
Tt
itn
nTt
jitnML
ija 11
2
,
,,
g
x
å åå å
-
=
==n
Tt
itn
ktnn
Tt
itnML
ikx
b 11
1
,
,,
g
g
Nn
inML
iå= 1,g
p© Eric Xing @ CMU, 2005-2020 26
å åå å
= -
= -=•®
®=
nTt
itn
jtnn
Tt
itnML
ij yyy
ijia
2 1
2 1
,
,,
)(#)(#
å åå å
=
==•®
®=n
Tt
itn
ktnn
Tt
itnML
ik yxy
ikib
1
1
,
,,
)(#)(#
Conditional Random Fields
© Eric Xing @ CMU, 2005-2020 27
Shortcomings of Hidden Markov Model (1): locality of features
q HMM models capture dependences between each state and onlyits corresponding observation
q NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.
q Mismatch between learning objective function and prediction objective function
q HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
Y1 Y2 … … … Yn
X1 X2 … … … Xn
© Eric Xing @ CMU, 2005-2020 28
Shortcomings of HMM (2): the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
What the local transition probabilities say:
• State 1 almost always prefers to go to state 2
• State 2 almost always prefer to stay in state 2 © Eric Xing @ CMU, 2005-2020 29
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1-> 1-> 1-> 1:
• 0.4 x 0.45 x 0.5 = 0.09
© Eric Xing @ CMU, 2005-2020 30
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 2->2->2->2 :
• 0.2 X 0.3 X 0.3 = 0.018 Other paths:1-> 1-> 1-> 1: 0.09
© Eric Xing @ CMU, 2005-2020 31
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1->2->1->2:
• 0.6 X 0.2 X 0.5 = 0.06Other paths:1->1->1->1: 0.09 2->2->2->2: 0.018
© Eric Xing @ CMU, 2005-2020 32
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1->1->2->2:
• 0.4 X 0.55 X 0.3 = 0.066Other paths:1->1->1->1: 0.09 2->2->2->2: 0.0181->2->1->2: 0.06 © Eric Xing @ CMU, 2005-2020 33
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Most Likely Path: 1-> 1-> 1-> 1• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.
• why?© Eric Xing @ CMU, 2005-2020 34
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Most Likely Path: 1-> 1-> 1-> 1• State 1 has only two transitions but state 2 has 5:
• Average transition probability from state 2 is lower © Eric Xing @ CMU, 2005-2020 35
HMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Label bias problem in HMM:• Preference of states with lower number of transitions over others
© Eric Xing @ CMU, 2005-2020 36
Solution: Do not normalize probabilities locally
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
From local probabilities ….© Eric Xing @ CMU, 2005-2020 37
Solution: Do not normalize probabilities locally
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 420
3010
20
10
20
20
30
2020
30
10
10
30
5
510
30
20
20
20
From local probabilities to local potentials
• States with lower transitions do not have an unfair advantage!© Eric Xing @ CMU, 2005-2020 38
From HMM to CRF
q CRF is a partially directed modelq Discriminative model, unlike HMMq Usage of global normalizer Z(x) overcomes the label bias
problem of HMMq Models the dependence between each state and the entire
observation sequence
Y1 Y2 … … … Yn
x1:n
© Eric Xing @ CMU, 2005-2020 39
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
P(X,Y) =
Conditional Random Fields
q General parametric form:Y1 Y2 … … … Yn
x1:n
© Eric Xing @ CMU, 2005-2020 40
CRFs: Inference
q Given CRF parameters l and µ, find the y* that maximizes P(y|x)
q Can ignore Z(x) because it is not a function of yq Run the max-product algorithm on the junction-tree of CRF:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Same as Viterbi decoding used in HMMs!
© Eric Xing @ CMU, 2005-2020 41
CRF learning
q Given {(xd, yd)}d=1N, find l*, µ* such that
q Computing the gradient w.r.t l: Gradient of the log-partition function in an exponential family is the expectation of the
sufficient statistics.
© Eric Xing @ CMU, 2005-2020 42
CRF learning
q Computing the model expectations:
q Requires exponentially large number of summations: Is it intractable?
q Tractable!q Can compute marginals using the sum-product algorithm on the chain
Expectation of f over the corresponding marginal probability of neighboring nodes!!
© Eric Xing @ CMU, 2005-2020 43
CRF learning
q Computing marginals using junction-tree calibration:
q Junction Tree Initialization:
q After calibration:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Also called forward-backward algorithm
© Eric Xing @ CMU, 2005-2020 44
CRF learning
q Computing feature expectations using calibrated potentials:
q Now we know how to compute rlL(l,µ):
q Learning can now be done using gradient ascent:
© Eric Xing @ CMU, 2005-2020 45
CRF learning
q In practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability
q In practice, gradient ascent has very slow convergenceq Alternatives:
q Conjugate Gradient methodq Limited Memory Quasi-Newton Methods
© Eric Xing @ CMU, 2005-2020 46
CRFs: some empirical results
q Comparison of error rates on synthetic data
HMM error
CR
F er
ror
Data is increasingly higher order in the direction of arrow
CRFs achieve the lowest error rate for higher order data
© Eric Xing @ CMU, 2005-2020 47
CRFs: some empirical results
q Parts of Speech tagging
q Using same set of features: HMM >=< CRF > MEMMq Using additional overlapping features: CRF+ > MEMM+ >>
HMM© Eric Xing @ CMU, 2005-2020 48
Supplementary
© Eric Xing @ CMU, 2005-2020 49
Other CRFsq So far we have discussed only 1-
dimensional chain CRFsq Inference and learning: exact
q We could also have CRFs for arbitrary graph structure
q E.g: Grid CRFsq Inference and learning no longer tractableq Approximate techniques used
q MCMC Samplingq Variational Inferenceq Loopy Belief Propagation
q We will discuss these techniques soon
© Eric Xing @ CMU, 2005-2020 50
Applications of CRF in Vision
Image Segmentation
Stereo Matching Image Restoration
© Eric Xing @ CMU, 2005-2020 51
Application: Image Segmentation
© Eric Xing @ CMU, 2005-2020 52
Application: Handwriting Recognition
© Eric Xing @ CMU, 2005-2020 53
Application: Pose Estimation
© Eric Xing @ CMU, 2005-2020 54
Feature Functions for CRF in Vision
© Eric Xing @ CMU, 2005-2020 55
Case Study: Image Segmentation
q Image segmentation (FG/BG) by modeling of interactions btw RVs q Images are noisy. q Objects occupy continuous regions in an image.
Input image Pixel-wise separateoptimal labeling
Locally-consistent joint optimal labeling
[Nowozin,Lampert 2012]
Y*= argmaxy∈{0,1}n
Vi (yi,X)+ Vi, j (yi, yj )j∈Ni
∑i∈S∑
i∈S∑#
$%%
&
'((.
Y: labelsX: data (features)S: pixelsNi: neighbors of pixel i
Unary Term Pairwise Term
© Eric Xing @ CMU, 2005-2020 56
Discriminative Random Fields
q A special type of CRFq The unary and pairwise potentials are designed using local
discriminative classifiers.q Posterior
q Association Potential q Local discriminative model for site i: using logistic link with
GLM.
q Interaction Potentialq Measure of how likely site i and j have the same label given
X
Ai (yi,X) = logP(yi | fi (X))
P(Y | X) = 1Zexp( Ai (yi,X)+ Iij (yi, yj,X)
j∈Ni
∑i∈S∑
i∈S∑ )
S. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006.
Association Interaction
P(yi =1| fi (X)) =1
1+ exp(−(wT fi (X)))=σ (wT fi (X))
Iij (yi, yj,X) = kyiyj + (1− k)(2σ (yiyjµij (X))−1))
(1) Data-independent smoothing term (2) Data-dependent pairwise logistic function © Eric Xing @ CMU, 2005-2020 57
DRF Results
q Task: Detecting man-made structure in natural scenes. q Each image is divided in non-overlapping 16x16 tile blocks.
q An example
q Logistic: No smoothness in the labelsq MRF: Smoothed False positive. Lack of neighborhood
interaction of the dataS. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006.
Input image Logistic MRF DRF
© Eric Xing @ CMU, 2005-2020 58
Multiscale Conditional Random Fields
q Considering features in different scales
q Local Features (site)q Regional Label Features (small
patch)q Global Label Features (big patch
or the whole image)q The conditional probability P(L|X) is
formulated by features in different scales
He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004
Õ=s
s XLPZ
XLP )|(1)|(
åÕ=L s
s XLPZ )|(
© Eric Xing @ CMU, 2005-2020 59
Multiscale Conditional Random Fields
He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004
Local Features
Regional Label Features
Global Label Features
© Eric Xing @ CMU, 2005-2020 60
mCRF Results
He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004 © Eric Xing @ CMU, 2005-2020 61
Topic Random Fields
q Spatial MRF over topic assignments
Zhao, B. et. al.: Topic random fields for image segmentation. ECCV 2010 © Eric Xing @ CMU, 2005-2020 62
TRF Results
Zhao, B. et. al.: Topic random fields for image segmentation. ECCV 2010
Spatial LDA vs. Topic Random Fields
© Eric Xing @ CMU, 2005-2020 63
Summary
q Conditional Random Fields are partially directed discriminative modelsq They overcome the label bias problem of HMM by using a global normalizerq Inference for 1-D chain CRFs is exact
q Same as Max-product or Viterbi decodingq Learning also is exact
q globally optimum parameters can be learnedq Requires using sum-product or forward-backward algorithm
q CRFs involving arbitrary graph structure are intractable in generalq E.g.: Grid CRFsq Inference and learning require approximation techniques
q MCMC samplingq Variational methodsq Loopy BP
© Eric Xing @ CMU, 2005-2020 64