Download pdf - Probabilistic Graphical Models - cs.cmu.eduepxing/Class/10708-20/lectures/lecture06... · Applications of HMMs q Some early applications of HMMs q finance, but we never saw them q

Probabilistic Graphical Models

Case Studies: HMM and CRFEric XingLecture 6, February 3, 2020

Reading: see class homepage© Eric Xing @ CMU, 2005-2020 1

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Hidden Markov Model: from static to dynamic mixture models

Dynamic mixture

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Static mixture

AX1

Y1

N© Eric Xing @ CMU, 2005-2020 2

Example

q Speech recognition

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

© Eric Xing @ CMU, 2005-2020 3

Applications of HMMs

q Some early applications of HMMsq finance, but we never saw them q speech recognition q modelling ion channels

q In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

q Some current applications of HMMs to biologyq mapping chromosomesq aligning biological sequencesq predicting sequence structureq inferring evolutionary relationshipsq finding genes in DNA sequence

© Eric Xing @ CMU, 2005-2020 4

Definition (of HMM)

q Observation spaceAlphabetic set:Euclidean space:

q Index set of hidden states

q Transition probabilities between any two states

orq Start probabilities

q Emission probabilities associated with each state

or in general:

{ }Kccc ,,, !21=CdR

{ }M,,, !21=I

,)|( ,jiit

jt ayyp === - 11 1

( ) .,,,,lMultinomia~)|( ,,, IÎ"=- iaaayyp Miiiitt !111 1

( ).,,,lMultinomia~)( Myp ppp !211

( ) .,,,,lMultinomia~)|( ,,, IÎ"= ibbbyxp Kiiiitt !111

( ) .,|f~)|( IÎ"×= iyxp iitt q1

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

© Eric Xing @ CMU, 2005-2020 5

Probability of a parse

q Given a sequence x = x1……xTand a parse y = y1, ……, yT,

q To find how likely is the parse:(given our HMM and the sequence)

p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)= p(y1, ……, yT) p(x1……xT | y1, ……, yT)

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

© Eric Xing @ CMU, 2005-2020 6

Variable Elimination on Hidden Markov Model

p(x, y) = p(x1……xT, y1, ……, yT) = p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)

Conditional probability:

© Eric Xing @ CMU, 2005-2020 7

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Variable Elimination on Hidden Markov Model

Conditional probability:

© Eric Xing @ CMU, 2005-2020 8

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

The Forward Algorithm

q We want to calculate P(x), the likelihood of x, given the HMMq Sum over all possible ways of generating x:

q To avoid summing over an exponential number of paths y, define

(the forward probability)

q The recursion:

),,...,()(def

11 1 ==== ktt

kt

kt yxxPy aa

å -==i

kiit

ktt

kt ayxp ,)|( 11 aa

å=k

kTP a)(x

å å å å Õ Õ= =

-==

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,p!

© Eric Xing @ CMU, 2005-2020 9

A Axt+1 xT

yt+1 yT...

Axt

yt

...

...

...

The Backward Algorithm

q We want to compute ,the posterior probability distribution on the t th position, given x

q We start by computing

q The recursion:

)|( x1=ktyP

Forward, atk Backward,

),...,,,,...,(),( Ttktt

kt xxyxxPyP 11 11 +=== x

)|...()...(

),,...,|,...,(),,...,(

, 1111

11

111

===

===

+

+

ktTt

ktt

kttTt

ktt

yxxPyxxP

yxxxxPyxxP

)|,...,( 11 == +ktTt

kt yxxPb

it

itt

iik

kt yxpa 111, )1|( +++ ==å bb

A Axt+1 xT

yt+1 yT...

Axt

yt

...

...

...

© Eric Xing @ CMU, 2005-2020 10

The junction tree algorithm: message passing for HMM

q A junction tree for the HMM

q Rightward pass

q This is exactly the forward algorithm!q Leftward pass …

q This is exactly the backward algorithm!

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

),( 11 xyy ),( 21 yyy ),( 32 yyy ),( TT yy 1-y

),( 22 xyy ),( 33 xyy ),( TT xyy

)( 2yz )( 3yz )( Tyz)( 1yf )( 2yfÞ

),( 1+tt yyy)( ttt y®-1µ

),( 11 ++ tt xyy

)( 11 ++® ttt yµ

)( 1+ tt yµ

å+

+++¬+¬- =1

11111ty

tttttttttt yyyyy )()(),()( µµyµ

å

å

®-++

++®-+

+=

=

t

tt

t

ytttyytt

yttttttt

yayxp

yxpyyyp

)()|(

)|()()|(

, 111

1111

1µ

µ

å +®-+++® =ty

tttttttttt yyyyy )()(),()( 11111 µµyµ

å+

++++¬+=1

11111ty

ttttttt yxpyyyp )|()()|( µ

),( 1+tt yyy)( ttt y¬-1µ )( 11 ++¬ ttt yµ

),( 11 ++ tt xyy

)( 1+ tt yµ

© Eric Xing @ CMU, 2005-2020 11

Summary

)|,...,()(def

111 === +¬-ktTttt

kt yxxPkµb

•Forward algorithm

•Backward algorithm

),,,...,()(def

1111 === -®-kttttt

kt yxxxPkµa

)|()|()1()1(),1,1(

111111

:11

def,

ttttjttt

ittt

Tjt

it

jit

yypyxpyy

xyyp

+++++¬®-

+

==µ

===

µµx

å -==i

kiit

ktt

kt ayxp ,)|( 11 aa

it

itt

iik

kt yxpa 111, )1|( +++ ==å bb

)|(,, 1111 == +++

ittji

jt

it

jit yxpabax

å=µ==j

jit

it

itT

it

it xyp ,

:1

def)|1( xbag

)|(),(

)|()(def

def

11

1

1 ===

==

+it

jt

ittt

yypji

yxpi

A

B

( )( )( )( )

ttt

tttt

ttt

ttt

bagbax

bbaa

*=**=

*=*=

++

++

-

. . .

. .

ABBABA

T

T

11

11

1

The matrix-vector form:

© Eric Xing @ CMU, 2005-2020 12

Posterior decoding

q We can now calculate

q Then, we can askq What is the most likely state at position t of sequence x:

q Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?

q Posterior Decoding:

q This is different from MPA of a whole sequence of hidden states

q This can be understood as bit error ratevs. word error rate

)()(),()|(

xxx

xPP

yPyPkt

kt

ktk

tba

==

==11

)|(maxarg* x1== ktkt yPk

{ } : * Tty tk

t !11 ==

Example:MPA of X ?MPA of (X, Y) ?

x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3

© Eric Xing @ CMU, 2005-2020 13

Viterbi decoding

q GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:

y* = argmaxy P(y|x) = argmaxp P(y,x)

q Let

= Probability of most likely sequence of states ending at state yt = k

q The recursion:

q Underflows are a significant problem

q These numbers become extremely small – underflowq Solution: Take the logs of all values:

),,...,,,...,(max ,--},...{ -1111111

== kttttyy

kt yxyyxxPV

t

itkii

ktt

kt VayxpV 11 -== ,max)|(

x1 x2 x3……………………...……..xN

State 12

K

x1 x2 x3……………………...……..xN

State 12

K

x1 x2 x3……………………...……..xN

State 12

K

x1 x2 x3……………………...……..xN

State 12

K

Vi(t)k

tV

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( !!""11121111 -

= p

( )( )itkii

ktt

kt VayxpV 11 -++== ,logmax)|(log

© Eric Xing @ CMU, 2005-2020 14

The Viterbi Algorithm – derivation

q Define the viterbi probability:),,...,,,...,(max ,},...{ 111111 1

== +++kttttyy

kt yxyyxxPV

t

),...,,,...,(),...,,,...,|(max ,},...{ ttttkttyy yyxxPyyxxyxP

t 111111 11

== ++

),,,...,,,...,()|(max ,},...{ tttttkttyy yxyyxxPyyxP

t 111111 11 --++ ==

itki

ktti VayxP ,, )|(max 111 == ++

),,,...,,,...,(max)|(max },...{, 111 111111 11==== --++ -

ittttyy

it

ktti yxyyxxPyyxP

t

itkii

ktt VayxP ,, max)|( 111 == ++

© Eric Xing @ CMU, 2005-2020 15

Computational Complexity and implementation details

q What is the running time, and space required, for Forward, and Backward?

Time: O(K2N); Space: O(KN).

q Useful implementation technique to avoid underflowsq Viterbi: sum of logsq Forward/Backward: rescaling at each position by multiplying by a constant

å -==i

kiit

ktt

kt ayxp ,)|( 11 aa

it

itt

iik

kt yxpa 111, )1|( +++ ==å bb

itkii

ktt

kt VayxpV 11 -== ,max)|(

© Eric Xing @ CMU, 2005-2020 16

Learning HMM: two scenarios

q Supervised learning: estimation when the “right answer” is knownq Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

q Unsupervised learning: estimation when the “right answer” is unknownq Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

q QUESTION: Update the parameters q of the model to maximize P(x|q) ---Maximal likelihood (ML) estimation

© Eric Xing @ CMU, 2005-2020 17

Parameter sharing

q Consider a time-invariant (stationary) 1st-order Markov modelq Initial state probability vector: q State transition probability matrix:

q The joint:

q The log-likelihood:

q Again, we optimize each parameter separatelyq p is a multinomial frequency vector, and we've seen it beforeq What about A?

X1 X2 X3 XT…

Ap

)(def

11 == kk Xpp

)|(def

11 1 === -it

jtij XXpA

ÕÕ= =

-=T

t tttT XXpxpXp

2 2111 )|()|()|( : pq

ååå=

-+=n

T

ttntn

nn AxxpxpD

211 ),|(log)|(log);( ,,, pql

© Eric Xing @ CMU, 2005-2020 18

Learning a Markov chain transition matrix

q A is a stochastic matrix: q Each row of A is multinomial distribution.q So MLE of Aij is the fraction of transitions from i to j

q Application: q if the states Xt represent words, this is called a bigram language model

q Sparse data problem:q If i à j did not occur in data, we will have Aij =0, then any future sequence with word

pair i à j will have zero probability. q A standard hack: backoff smoothing or deleted interpolation

1=å j ijA

å åå å

= -

= -=•®

®=n

T

titn

jtnn

T

titnML

ijx

xxi

jiA2 1

2 1

,

,,

)(#)(#

MLiti AA •®•® -+= )(~ llh 1

© Eric Xing @ CMU, 2005-2020 19

Supervised ML estimation for “Hidden” MM

q Given x = x1…xN for which the true state path y = y1…yN is known,q Define:

Aij = # times state transition i®j occurs in yBik = # times state i in y emits k in x

q We can show that the maximum likelihood parameters q are:

q What if x is continuous? We can treat as N´T observations of, e.g., a Gaussian, and apply learning rules for Gaussian …

åå åå å ==

•®®=

= -

= -

' ',

,,

)(#)(#

j ij

ij

n

T

titn

jtnn

T

titnML

ij AA

y

yyi

jia2 1

2 1

åå åå å ==

•®®=

=

=

' ',

,,

)(#)(#

k ik

ik

n

T

titn

ktnn

T

titnML

ik BB

y

xyi

kib1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

© Eric Xing @ CMU, 2005-2020 20

Supervised ML estimation, ctd.

q Intuition:q When we know the underlying states, the best estimate of q is the average

frequency of transitions & emissions that occur in the training data

q Drawback:q Given little data, there may be overfitting:

q P(x|q) is maximized, but q is unreasonable:0 probabilities – VERY BAD

q Example:q Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

q Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

© Eric Xing @ CMU, 2005-2020 21

Pseudocounts

q Solution for small training sets:q Add pseudocounts

Aij = # times state transition i®j occurs in y + RijBik = # times state i in y emits k in x + Sik

q Rij, Sij are pseudocounts representing our prior beliefq Total pseudocounts: Ri = SjRij , Si = SkSik ,

q --- "strength" of prior belief, q --- total number of imaginary instances in the prior

q Larger total pseudocounts Þ strong prior belief

q Small total pseudocounts: just to avoid 0 probabilities --- smoothing

q This is equivalent to Bayesian est. under a uniform prior with "parameter strength" equals to the pseudocounts

© Eric Xing @ CMU, 2005-2020 22

Bayesian language model

q Global and local parameter independence

q The posterior of Aià· and Ai'à· is factorized despite v-structure on Xt, because Xt-1acts like a multiplexerq Assign a Dirichlet prior bi to each row of the transition matrix:

q We could consider more realistic priors, e.g., mixtures of Dirichlets to account for types of words (adjectives, verbs, etc.)

X1 X2 X3 XT…

Ai à·p Ai'à·

ba

…

A

)(# where,)(

)(#)(#

),,|( ',

,def

•®+=-+=

+•®+®

==i

Ai

jiDijpA

i

ii

MLijikii

i

kii

Bayesij b

bllbl

bb

b 1

© Eric Xing @ CMU, 2005-2020 23

Example: HMM

q Supervised learning: estimation when the “right answer” is knownq Examples:

GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

q Unsupervised learning: estimation when the “right answer” is unknownq Examples:

GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice

q QUESTION: Update the parameters q of the model to maximize P(x|q) ---Maximal likelihood (ML) estimation

© Eric Xing @ CMU, 2005-2020 24

Learning HMM: two scenarios

q Supervised learning: if only we knew the true state path then ML parameter estimation would be trivial

q E.g., recall that for complete observed tabular BN:

q What if y is continuous? We can treat as N´T observations of, e.g., a GLIM, and apply learning rules for GLIM …

q Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.

q The Baum Welch algorithm (i.e., EM)q Guaranteed to increase the log likelihood of the model after each iterationq Converges to local optimum, depending on initial conditions

å=

kjikij

ijkMLijk n

n

,','

qå åå å

= -

= -=•®

®=

nTt

itn

jtnn

Tt

itnML

ij yyy

ijia

2 1

2 1

,

,,

)(#)(#

å åå å

=

==•®

®=n

Tt

itn

ktnn

Tt

itnML

ik yxy

ikib

1

1

,

,,

)(#)(#

( ){ }NnTtyx tntn :,::, ,, 11 ==

© Eric Xing @ CMU, 2005-2020 25

The Baum Welch algorithm

q The complete log likelihood

q The expected complete log likelihood

q EMq The E step

q The M step ("symbolically" identical to MLE)

Õ ÕÕ ÷÷ø

öççè

æ====

-n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

ååååå==

- ÷øöç

èæ+÷

øöç

èæ+÷

øöç

èæ=

- n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ pl

)|( ,,, nitn

itn

itn ypy x1===g

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111 ==== --x

å åå å

-

=

==n

Tt

itn

nTt

jitnML

ija 11

2

,

,,

g

x

å åå å

-

=

==n

Tt

itn

ktnn

Tt

itnML

ikx

b 11

1

,

,,

g

g

Nn

inML

iå= 1,g

p© Eric Xing @ CMU, 2005-2020 26

å åå å

= -

= -=•®

®=

nTt

itn

jtnn

Tt

itnML

ij yyy

ijia

2 1

2 1

,

,,

)(#)(#

å åå å

=

==•®

®=n

Tt

itn

ktnn

Tt

itnML

ik yxy

ikib

1

1

,

,,

)(#)(#

Conditional Random Fields

© Eric Xing @ CMU, 2005-2020 27

Shortcomings of Hidden Markov Model (1): locality of features

q HMM models capture dependences between each state and onlyits corresponding observation

q NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

q Mismatch between learning objective function and prediction objective function

q HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)

Y1 Y2 … … … Yn

X1 X2 … … … Xn

© Eric Xing @ CMU, 2005-2020 28

Shortcomings of HMM (2): the Label bias problem

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 40.4

0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

What the local transition probabilities say:

• State 1 almost always prefers to go to state 2

• State 2 almost always prefer to stay in state 2 © Eric Xing @ CMU, 2005-2020 29

HMM: the Label bias problem

State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1-> 1-> 1-> 1:

• 0.4 x 0.45 x 0.5 = 0.09

© Eric Xing @ CMU, 2005-2020 30


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 2->2->2->2 :

• 0.2 X 0.3 X 0.3 = 0.018 Other paths:1-> 1-> 1-> 1: 0.09

© Eric Xing @ CMU, 2005-2020 31


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1->2->1->2:

• 0.6 X 0.2 X 0.5 = 0.06Other paths:1->1->1->1: 0.09 2->2->2->2: 0.018

© Eric Xing @ CMU, 2005-2020 32


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Probability of path 1->1->2->2:

• 0.4 X 0.55 X 0.3 = 0.066Other paths:1->1->1->1: 0.09 2->2->2->2: 0.0181->2->1->2: 0.06 © Eric Xing @ CMU, 2005-2020 33


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

• why?© Eric Xing @ CMU, 2005-2020 34


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1• State 1 has only two transitions but state 2 has 5:

• Average transition probability from state 2 is lower © Eric Xing @ CMU, 2005-2020 35


State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

Label bias problem in HMM:• Preference of states with lower number of transitions over others

© Eric Xing @ CMU, 2005-2020 36

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5


0.60.2

0.2

0.2

0.2

0.2

0.45

0.550.2

0.3

0.1

0.1

0.3

0.5

0.50.1

0.3

0.2

0.2

0.2

From local probabilities ….© Eric Xing @ CMU, 2005-2020 37

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 420

3010

20

10

20

20

30

2020

30

10

10

30

5

510

30

20

20

20

From local probabilities to local potentials

• States with lower transitions do not have an unfair advantage!© Eric Xing @ CMU, 2005-2020 38

From HMM to CRF

q CRF is a partially directed modelq Discriminative model, unlike HMMq Usage of global normalizer Z(x) overcomes the label bias

problem of HMMq Models the dependence between each state and the entire

observation sequence

Y1 Y2 … … … Yn

x1:n

© Eric Xing @ CMU, 2005-2020 39

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

P(X,Y) =

Conditional Random Fields

q General parametric form:Y1 Y2 … … … Yn

x1:n

© Eric Xing @ CMU, 2005-2020 40

CRFs: Inference

q Given CRF parameters l and µ, find the y* that maximizes P(y|x)

q Can ignore Z(x) because it is not a function of yq Run the max-product algorithm on the junction-tree of CRF:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Same as Viterbi decoding used in HMMs!

© Eric Xing @ CMU, 2005-2020 41

CRF learning

q Given {(xd, yd)}d=1N, find l*, µ* such that

q Computing the gradient w.r.t l: Gradient of the log-partition function in an exponential family is the expectation of the

sufficient statistics.

© Eric Xing @ CMU, 2005-2020 42

CRF learning

q Computing the model expectations:

q Requires exponentially large number of summations: Is it intractable?

q Tractable!q Can compute marginals using the sum-product algorithm on the chain

Expectation of f over the corresponding marginal probability of neighboring nodes!!

© Eric Xing @ CMU, 2005-2020 43

CRF learning

q Computing marginals using junction-tree calibration:

q Junction Tree Initialization:

q After calibration:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn

Y2 Y3Yn-2 Yn-1

Also called forward-backward algorithm

© Eric Xing @ CMU, 2005-2020 44

CRF learning

q Computing feature expectations using calibrated potentials:

q Now we know how to compute rlL(l,µ):

q Learning can now be done using gradient ascent:

© Eric Xing @ CMU, 2005-2020 45

CRF learning

q In practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability

q In practice, gradient ascent has very slow convergenceq Alternatives:

q Conjugate Gradient methodq Limited Memory Quasi-Newton Methods

© Eric Xing @ CMU, 2005-2020 46

CRFs: some empirical results

q Comparison of error rates on synthetic data

HMM error

CR

F er

ror

Data is increasingly higher order in the direction of arrow

CRFs achieve the lowest error rate for higher order data

© Eric Xing @ CMU, 2005-2020 47

CRFs: some empirical results

q Parts of Speech tagging

q Using same set of features: HMM >=< CRF > MEMMq Using additional overlapping features: CRF+ > MEMM+ >>

HMM© Eric Xing @ CMU, 2005-2020 48

Supplementary

© Eric Xing @ CMU, 2005-2020 49

Other CRFsq So far we have discussed only 1-

dimensional chain CRFsq Inference and learning: exact

q We could also have CRFs for arbitrary graph structure

q E.g: Grid CRFsq Inference and learning no longer tractableq Approximate techniques used

q MCMC Samplingq Variational Inferenceq Loopy Belief Propagation

q We will discuss these techniques soon

© Eric Xing @ CMU, 2005-2020 50

Applications of CRF in Vision

Image Segmentation

Stereo Matching Image Restoration

© Eric Xing @ CMU, 2005-2020 51

Application: Image Segmentation

© Eric Xing @ CMU, 2005-2020 52

Application: Handwriting Recognition

© Eric Xing @ CMU, 2005-2020 53

Application: Pose Estimation

© Eric Xing @ CMU, 2005-2020 54

Feature Functions for CRF in Vision

© Eric Xing @ CMU, 2005-2020 55

Case Study: Image Segmentation

q Image segmentation (FG/BG) by modeling of interactions btw RVs q Images are noisy. q Objects occupy continuous regions in an image.

Input image Pixel-wise separateoptimal labeling

Locally-consistent joint optimal labeling

[Nowozin,Lampert 2012]

Y*= argmaxy∈{0,1}n

Vi (yi,X)+ Vi, j (yi, yj )j∈Ni

∑i∈S∑

i∈S∑#

$%%

&

'((.

Y: labelsX: data (features)S: pixelsNi: neighbors of pixel i

Unary Term Pairwise Term

© Eric Xing @ CMU, 2005-2020 56

Discriminative Random Fields

q A special type of CRFq The unary and pairwise potentials are designed using local

discriminative classifiers.q Posterior

q Association Potential q Local discriminative model for site i: using logistic link with

GLM.

q Interaction Potentialq Measure of how likely site i and j have the same label given

X

Ai (yi,X) = logP(yi | fi (X))

P(Y | X) = 1Zexp( Ai (yi,X)+ Iij (yi, yj,X)

j∈Ni

∑i∈S∑

i∈S∑ )

S. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006.

Association Interaction

P(yi =1| fi (X)) =1

1+ exp(−(wT fi (X)))=σ (wT fi (X))

Iij (yi, yj,X) = kyiyj + (1− k)(2σ (yiyjµij (X))−1))

(1) Data-independent smoothing term (2) Data-dependent pairwise logistic function © Eric Xing @ CMU, 2005-2020 57

DRF Results

q Task: Detecting man-made structure in natural scenes. q Each image is divided in non-overlapping 16x16 tile blocks.

q An example

q Logistic: No smoothness in the labelsq MRF: Smoothed False positive. Lack of neighborhood

interaction of the dataS. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006.

Input image Logistic MRF DRF

© Eric Xing @ CMU, 2005-2020 58

Multiscale Conditional Random Fields

q Considering features in different scales

q Local Features (site)q Regional Label Features (small

patch)q Global Label Features (big patch

or the whole image)q The conditional probability P(L|X) is

formulated by features in different scales

He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004

Õ=s

s XLPZ

XLP )|(1)|(

åÕ=L s

s XLPZ )|(

© Eric Xing @ CMU, 2005-2020 59

Multiscale Conditional Random Fields

He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004

Local Features

Regional Label Features

Global Label Features

© Eric Xing @ CMU, 2005-2020 60

mCRF Results

He, X. et. al.: Multiscale conditional random fields for image labeling. CVPR 2004 © Eric Xing @ CMU, 2005-2020 61

Topic Random Fields

q Spatial MRF over topic assignments

Zhao, B. et. al.: Topic random fields for image segmentation. ECCV 2010 © Eric Xing @ CMU, 2005-2020 62

TRF Results

Zhao, B. et. al.: Topic random fields for image segmentation. ECCV 2010

Spatial LDA vs. Topic Random Fields

© Eric Xing @ CMU, 2005-2020 63

Summary

q Conditional Random Fields are partially directed discriminative modelsq They overcome the label bias problem of HMM by using a global normalizerq Inference for 1-D chain CRFs is exact

q Same as Max-product or Viterbi decodingq Learning also is exact

q globally optimum parameters can be learnedq Requires using sum-product or forward-backward algorithm

q CRFs involving arbitrary graph structure are intractable in generalq E.g.: Grid CRFsq Inference and learning require approximation techniques

q MCMC samplingq Variational methodsq Loopy BP

© Eric Xing @ CMU, 2005-2020 64