CS 6784 Paper PresentationCS 6784 Paper Presentation Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John La↵erty, Andrew McCallum, Fernando

Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

CS 6784 Paper Presentation

Conditional Random Fields: Probabilistic Models for

Segmenting and Labeling Sequence Data

John La↵erty, Andrew McCallum, Fernando C. N. Pereira

Presenters: Brad Gulko and Stephanie Hyland

February 20, 2014

Main Contributions

Main Contribution Summary

This 2001 paper introduced the Conditional Random Field

(CRF).

Describes e�cient representation of field potentials in terms

of features.

Provides two algorithms for finding Maximum Likelihood

parameter values.

Provides some really unconvincing examples...

Main Contributions

Main Contribution Summary

... examples are NOT the strongest point of this paper

Main Contributions

Talk Structure

Brad

CRF in context

The Label Bias Problem

Stephanie

Parameter Estimation

Experiments

Conclusion

Main Contributions

Talk Structure

Brad

CRF in context

Stephanie

Experiments

Conclusion

Main Contributions

Talk Structure

Brad

CRF in context

Stephanie

Experiments

Conclusion

Main Contributions

Talk Structure

Brad

CRF in context

Stephanie

Experiments

Conclusion

Intro

CRF in Context

X = {X1,X2, · · · } be a set of observed RV

Y = {Y1,Y2, · · · } be a set of label RV

X ,Y be a set of joint observations of X,Y

Generative Discriminative

Directed ???

HMM

???

ME-MM

Undirected ???

MRF

???

???CRF

In 2001, HMM, ME-MM and MRF were

well known,the paper presents the CRF.

Intro

CRF in Context

Directed

???

HMM ???

ME-MM

Undirected

???

MRF ???

???CRF

well known,the paper presents the CRF.

Intro

CRF in Context

Directed

???

HMM

???

ME-MM

Undirected

???

MRF

???

CRF

well known,

the paper presents the CRF.

Intro

CRF in Context

Directed

???

HMM

???

ME-MM

Undirected

???

MRF

??????

CRF

well known,the paper presents the CRF.Presenters: Brad Gulko and Stephanie Hyland

Generative vs. Discriminative

Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )

Discriminative: maximise conditional P(Y |X )

When is Discriminative helpful?

Tractability requires independence

...but sometimes there are important correlations in X .

Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )

Discriminative: maximise conditional P(Y |X )

When is Discriminative helpful?

Tractability requires independence

...but sometimes there are important correlations in X .

Examples: important correlations

Long range interactions in

human genomics

Pronoun definition and

binding

Context in whole scene

image recognition

Recursive structure in

language

human genomics

binding

image recognition

language

human genomics

binding

image recognition

language

human genomics

binding

image recognition

language

Directed vs. Undirected

For a graphical model G(E,V) with joint

potential (V).

Let C be the set of cliques (fully

connected subgroups) in G, with c 2 C

having edges Ec

and vertices Vc

.

Finally, Dom(V) is the set of all values assumable by the

random variables, V(= X [ Y).

P(V) =1

Z (V), Z =

X

vinDom(V)

(v)

For a graphical model G(E,V) with joint

potential (V).

Let C be the set of cliques (fully

connected subgroups) in G, with c 2 C

having edges Ec

and vertices Vc

.

Finally, Dom(V) is the set of all values assumable by the

random variables, V(= X [ Y).

P(V) =1

Z (V), Z =

X

vinDom(V)

(v)

Directed vs. Undirected, continued

Compactness requires factorization (Hammersley-Cli↵ord,

1971):

(V) =Y

c2C

c

(Vc

)

Directed: local Normalization -

8c 2 C,X

v2Dom(Vc

)

c

(v) = 1

Undirected: Global Normalization - relaxes this constraint...

but what does it buy us?

1971):

(V) =Y

c2C

c

(Vc

)

Directed: local Normalization -

8c 2 C,X

v2Dom(Vc

)

c

(v) = 1

1971):

(V) =Y

c2C

c

(Vc

)

Directed: local Normalization - each c

is a probability.

8c 2 C,X

v2Dom(Vc

)

c

(v) = 1

1971):

(V) =Y

c2C

c

(Vc

)

Directed: local Normalization - each c

is a probability.

8c 2 C,X

v2Dom(Vc

)

c

(v) = 1

The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

Training Data:

3 4X2=I 8X2=OX2=IX2=O 2

Y 2

Y1 =1

Rel. Joint(Y2 ,X 2 ,Y1 )

Y1 =2

1 2X1=R 0.8 0.2

Y 1P(Y1 |X1 )

Training Data:

3 4X2=I 8X2=OX2=IX2=O 2

Y 2

Y1 =1

Y1 =2

3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

Y1 =1

Y1 =2

ConditionalP(Y2 |X2 ,Y1 )

Y 2

1 2X1=R 0.8 0.2

Y 1P(Y1 |X1 )

Training Data:

3 4X2=I 8X2=OX2=IX2=O 2

Y 2

Y1 =1

Y1 =2

3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

Y1 =1

Y1 =2

Y 2

Viterbi is

1 2X1=R 0.8 0.2

Y 1P(Y1 |X1 )

Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.51,4 0.8 0.52,3 0.22,4 0.2 1-Lets try it for

Whichlabelingwins?

Training Data:

3 4X2=I 8X2=OX2=IX2=O 2

Y 2

Y1 =1

Y1 =2

3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

Y1 =1

Y1 =2

Y 2

Viterbi is

1 2X1=R 0.8 0.2

Y 1P(Y1 |X1 )

Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2Lets try it for

But we want

What happened?

The Label Bias Problem: Conditional Markov Model (EM MM)

Toy Problem – fragment of a ME MM

Training Data:

3 4X2=I 8X2=OX2=IX2=O 2

Y 2

Y1 =1

Rel. Joint

(Y2 ,X 2 ,Y1 )

Y1 =2

3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

Y1 =1

Y1 =2

Conditional

P(Y2 |X2 ,Y1 )Y 2

Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2

Local Normalization

requires a

probability…. So..

The Label Bias Problem: PotentialsToy Problem – fragment of a CRF

Training Data:

1 2R 8 2-

X1

Y13 4

1 82 2

Y2

Y1

3 4I 8O 2

Y2

X2

Training Data:

1 2R 8 2-

X1

Y13 4

1 82 2

Y2

Y1

3 4I 8O 2

Y2

X2

Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 81,4 8 22,3 22,4 2 2 2

Whichlabeling

wins, now?

Training Data:

1 2R 8 2-

X1

Y13 4

1 82 2

Y2

Y1

3 4I 8O 2

Y2

X2

Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 8 64 small1,4 8 2 16 tiny2,3 2 2 2 infinitesimal2,4 2 2 2 8 ~100%

Because potentials do not have to normalize into probabilities until AFTERaggregation, they don’t suffer from inappropriate conditioning.

Fun fact

Fun fact: We have seen this in class before!

Graphical model G(E,V) with joint potential (V), C the set

of cliques in G with c 2 C having edges Ec

and vertices Vc

P(V) / (V) =Y

c2C

c

(Vc

)

M

3nets: cliques are pairs, and all conditioned on observed x:

P(y|x) /Y

(i ,j)2E

ij

(yi

, yj

, x)

AMN: cliques are pairs of nodes and singletons:

P�(y) =1

Z

NY

i

�i

(yi

)Y

i ,j2E�i ,j(yi , yj)

Fun fact

and vertices Vc

P(V) / (V) =Y

c2C

c

(Vc

)

M

P(y|x) /Y

(i ,j)2E

ij

(yi

, yj

, x)

P�(y) =1

Z

NY

i

�i

(yi

)Y

Fun fact

and vertices Vc

P(V) / (V) =Y

c2C

c

(Vc

)

M

P(y|x) /Y

(i ,j)2E

ij

(yi

, yj

, x)

P�(y) =1

Z

NY

i

�i

(yi

)Y

Where do Parameters Come From?

CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈

For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):

Ψ Ψ∈

Ψ∈

Ψ Ψ∈

Ψ∈

And because this is a conditional network with ∪

Ψ | Ψ ,∈

Ψ ,∈

Ψ Ψ∈

Ψ∈

Ψ | Ψ ,∈

Ψ ,∈

An exponential identity gives us

Ψ | exp log Ψ ,∈

log Ψ ,∈

Ψ Ψ∈

Ψ∈

Ψ | Ψ ,∈

Ψ ,∈

An exponential identity gives us

Ψ | exp log Ψ ,∈

log Ψ ,∈

Potentials can be ANY positive values… like linear combinations of arbitrary features

Ψ | exp , , , , ∈ , ∈∈ , ∈

Training algorithms

Improved iterative scaling

Want to maximize log-likelihodod with respect to parameters

✓ = (�1,�2, · · · ;µ1, µ2, · · · )

Method: Improved Iterative Scaling1: Extension of

Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

Improved because features need not sum to constant.

Idea: new set of parameters