Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
CS 6784 Paper Presentation
Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data
John La↵erty, Andrew McCallum, Fernando C. N. Pereira
Presenters: Brad Gulko and Stephanie Hyland
February 20, 2014
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Main Contribution Summary
This 2001 paper introduced the Conditional Random Field
(CRF).
Describes e�cient representation of field potentials in terms
of features.
Provides two algorithms for finding Maximum Likelihood
parameter values.
Provides some really unconvincing examples...
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Main Contribution Summary
... examples are NOT the strongest point of this paper
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Talk Structure
Brad
CRF in context
The Label Bias Problem
Stephanie
Parameter Estimation
Experiments
Conclusion
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Talk Structure
Brad
CRF in context
The Label Bias Problem
Stephanie
Parameter Estimation
Experiments
Conclusion
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Talk Structure
Brad
CRF in context
The Label Bias Problem
Stephanie
Parameter Estimation
Experiments
Conclusion
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Main Contributions
Talk Structure
Brad
CRF in context
The Label Bias Problem
Stephanie
Parameter Estimation
Experiments
Conclusion
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Intro
CRF in Context
X = {X1,X2, · · · } be a set of observed RV
Y = {Y1,Y2, · · · } be a set of label RV
X ,Y be a set of joint observations of X,Y
Generative Discriminative
Directed ???
HMM
???
ME-MM
Undirected ???
MRF
???
???CRF
In 2001, HMM, ME-MM and MRF were
well known,the paper presents the CRF.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Intro
CRF in Context
X = {X1,X2, · · · } be a set of observed RV
Y = {Y1,Y2, · · · } be a set of label RV
X ,Y be a set of joint observations of X,Y
Generative Discriminative
Directed
???
HMM ???
ME-MM
Undirected
???
MRF ???
???CRF
In 2001, HMM, ME-MM and MRF were
well known,the paper presents the CRF.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Intro
CRF in Context
X = {X1,X2, · · · } be a set of observed RV
Y = {Y1,Y2, · · · } be a set of label RV
X ,Y be a set of joint observations of X,Y
Generative Discriminative
Directed
???
HMM
???
ME-MM
Undirected
???
MRF
???
???
CRF
In 2001, HMM, ME-MM and MRF were
well known,
the paper presents the CRF.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Intro
CRF in Context
X = {X1,X2, · · · } be a set of observed RV
Y = {Y1,Y2, · · · } be a set of label RV
X ,Y be a set of joint observations of X,Y
Generative Discriminative
Directed
???
HMM
???
ME-MM
Undirected
???
MRF
??????
CRF
In 2001, HMM, ME-MM and MRF were
well known,the paper presents the CRF.Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Generative vs. Discriminative
Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )
Discriminative: maximise conditional P(Y |X )
When is Discriminative helpful?
Tractability requires independence
...but sometimes there are important correlations in X .
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Generative vs. Discriminative
Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )
Discriminative: maximise conditional P(Y |X )
When is Discriminative helpful?
Tractability requires independence
...but sometimes there are important correlations in X .
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Examples: important correlations
Long range interactions in
human genomics
Pronoun definition and
binding
Context in whole scene
image recognition
Recursive structure in
language
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Examples: important correlations
Long range interactions in
human genomics
Pronoun definition and
binding
Context in whole scene
image recognition
Recursive structure in
language
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Examples: important correlations
Long range interactions in
human genomics
Pronoun definition and
binding
Context in whole scene
image recognition
Recursive structure in
language
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Generative vs. Discriminative
Examples: important correlations
Long range interactions in
human genomics
Pronoun definition and
binding
Context in whole scene
image recognition
Recursive structure in
language
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected
For a graphical model G(E,V) with joint
potential (V).
Let C be the set of cliques (fully
connected subgroups) in G, with c 2 C
having edges Ec
and vertices Vc
.
Finally, Dom(V) is the set of all values assumable by the
random variables, V(= X [ Y).
P(V) =1
Z (V), Z =
X
vinDom(V)
(v)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected
For a graphical model G(E,V) with joint
potential (V).
Let C be the set of cliques (fully
connected subgroups) in G, with c 2 C
having edges Ec
and vertices Vc
.
Finally, Dom(V) is the set of all values assumable by the
random variables, V(= X [ Y).
P(V) =1
Z (V), Z =
X
vinDom(V)
(v)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected, continued
Compactness requires factorization (Hammersley-Cli↵ord,
1971):
(V) =Y
c2C
c
(Vc
)
Directed: local Normalization -
8c 2 C,X
v2Dom(Vc
)
c
(v) = 1
Undirected: Global Normalization - relaxes this constraint...
but what does it buy us?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected, continued
Compactness requires factorization (Hammersley-Cli↵ord,
1971):
(V) =Y
c2C
c
(Vc
)
Directed: local Normalization -
8c 2 C,X
v2Dom(Vc
)
c
(v) = 1
Undirected: Global Normalization - relaxes this constraint...
but what does it buy us?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected, continued
Compactness requires factorization (Hammersley-Cli↵ord,
1971):
(V) =Y
c2C
c
(Vc
)
Directed: local Normalization - each c
is a probability.
8c 2 C,X
v2Dom(Vc
)
c
(v) = 1
Undirected: Global Normalization - relaxes this constraint...
but what does it buy us?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Directed vs. Undirected
Directed vs. Undirected, continued
Compactness requires factorization (Hammersley-Cli↵ord,
1971):
(V) =Y
c2C
c
(Vc
)
Directed: local Normalization - each c
is a probability.
8c 2 C,X
v2Dom(Vc
)
c
(v) = 1
Undirected: Global Normalization - relaxes this constraint...
but what does it buy us?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM
Training Data:
The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM
Training Data:
3 4X2=I 8X2=OX2=IX2=O 2
Y 2
Y1 =1
Rel. Joint(Y2 ,X 2 ,Y1 )
Y1 =2
1 2X1=R 0.8 0.2
Y 1P(Y1 |X1 )
The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM
Training Data:
3 4X2=I 8X2=OX2=IX2=O 2
Y 2
Y1 =1
Rel. Joint(Y2 ,X 2 ,Y1 )
Y1 =2
3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-
Y1 =1
Y1 =2
ConditionalP(Y2 |X2 ,Y1 )
Y 2
1 2X1=R 0.8 0.2
Y 1P(Y1 |X1 )
The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM
Training Data:
3 4X2=I 8X2=OX2=IX2=O 2
Y 2
Y1 =1
Rel. Joint(Y2 ,X 2 ,Y1 )
Y1 =2
3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-
Y1 =1
Y1 =2
ConditionalP(Y2 |X2 ,Y1 )
Y 2
Viterbi is
1 2X1=R 0.8 0.2
Y 1P(Y1 |X1 )
Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.51,4 0.8 0.52,3 0.22,4 0.2 1-Lets try it for
Whichlabelingwins?
The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM
Training Data:
3 4X2=I 8X2=OX2=IX2=O 2
Y 2
Y1 =1
Rel. Joint(Y2 ,X 2 ,Y1 )
Y1 =2
3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-
Y1 =1
Y1 =2
ConditionalP(Y2 |X2 ,Y1 )
Y 2
Viterbi is
1 2X1=R 0.8 0.2
Y 1P(Y1 |X1 )
Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2Lets try it for
But we want
What happened?
The Label Bias Problem: Conditional Markov Model (EM MM)
Toy Problem – fragment of a ME MM
Training Data:
3 4X2=I 8X2=OX2=IX2=O 2
Y 2
Y1 =1
Rel. Joint
(Y2 ,X 2 ,Y1 )
Y1 =2
3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-
Y1 =1
Y1 =2
Conditional
P(Y2 |X2 ,Y1 )Y 2
Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2
Local Normalization
requires a
probability…. So..
The Label Bias Problem: PotentialsToy Problem – fragment of a CRF
Training Data:
The Label Bias Problem: PotentialsToy Problem – fragment of a CRF
Training Data:
1 2R 8 2-
X1
Y13 4
1 82 2
Y2
Y1
3 4I 8O 2
Y2
X2
The Label Bias Problem: PotentialsToy Problem – fragment of a CRF
Training Data:
1 2R 8 2-
X1
Y13 4
1 82 2
Y2
Y1
3 4I 8O 2
Y2
X2
Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 81,4 8 22,3 22,4 2 2 2
Whichlabeling
wins, now?
The Label Bias Problem: PotentialsToy Problem – fragment of a CRF
Training Data:
1 2R 8 2-
X1
Y13 4
1 82 2
Y2
Y1
3 4I 8O 2
Y2
X2
Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 8 64 small1,4 8 2 16 tiny2,3 2 2 2 infinitesimal2,4 2 2 2 8 ~100%
Because potentials do not have to normalize into probabilities until AFTERaggregation, they don’t suffer from inappropriate conditioning.
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Fun fact
Fun fact: We have seen this in class before!
Graphical model G(E,V) with joint potential (V), C the set
of cliques in G with c 2 C having edges Ec
and vertices Vc
P(V) / (V) =Y
c2C
c
(Vc
)
M
3nets: cliques are pairs, and all conditioned on observed x:
P(y|x) /Y
(i ,j)2E
ij
(yi
, yj
, x)
AMN: cliques are pairs of nodes and singletons:
P�(y) =1
Z
NY
i
�i
(yi
)Y
i ,j2E�i ,j(yi , yj)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Fun fact
Fun fact: We have seen this in class before!
Graphical model G(E,V) with joint potential (V), C the set
of cliques in G with c 2 C having edges Ec
and vertices Vc
P(V) / (V) =Y
c2C
c
(Vc
)
M
3nets: cliques are pairs, and all conditioned on observed x:
P(y|x) /Y
(i ,j)2E
ij
(yi
, yj
, x)
AMN: cliques are pairs of nodes and singletons:
P�(y) =1
Z
NY
i
�i
(yi
)Y
i ,j2E�i ,j(yi , yj)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Fun fact
Fun fact: We have seen this in class before!
Graphical model G(E,V) with joint potential (V), C the set
of cliques in G with c 2 C having edges Ec
and vertices Vc
P(V) / (V) =Y
c2C
c
(Vc
)
M
3nets: cliques are pairs, and all conditioned on observed x:
P(y|x) /Y
(i ,j)2E
ij
(yi
, yj
, x)
AMN: cliques are pairs of nodes and singletons:
P�(y) =1
Z
NY
i
�i
(yi
)Y
i ,j2E�i ,j(yi , yj)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Where do Parameters Come From?
CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈
Where do Parameters Come From?
CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈
For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):
Ψ Ψ∈
Ψ∈
Where do Parameters Come From?
CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈
For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):
Ψ Ψ∈
Ψ∈
And because this is a conditional network with ∪
Ψ | Ψ ,∈
Ψ ,∈
Where do Parameters Come From?
CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈
For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):
Ψ Ψ∈
Ψ∈
And because this is a conditional network with ∪
Ψ | Ψ ,∈
Ψ ,∈
An exponential identity gives us
Ψ | exp log Ψ ,∈
log Ψ ,∈
Where do Parameters Come From?
CRF’s are part of the same general class, ∝ Ψ ∏ Ψ∈
For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):
Ψ Ψ∈
Ψ∈
And because this is a conditional network with ∪
Ψ | Ψ ,∈
Ψ ,∈
An exponential identity gives us
Ψ | exp log Ψ ,∈
log Ψ ,∈
Potentials can be ANY positive values… like linear combinations of arbitrary features
Ψ | exp , , , , ∈ , ∈∈ , ∈
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Improved iterative scaling
Want to maximize log-likelihodod with respect to parameters
✓ = (�1,�2, · · · ;µ1, µ2, · · · )
Method: Improved Iterative Scaling1: Extension of
Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).
Improved because features need not sum to constant.
Idea: new set of parameters
✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not
decrease objective function. Iteratively apply!
Problem: slow, and nobody uses this any more.
1Della Pietra et al. (1997)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Improved iterative scaling
Want to maximize log-likelihodod with respect to parameters
✓ = (�1,�2, · · · ;µ1, µ2, · · · )
Method: Improved Iterative Scaling1: Extension of
Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).
Improved because features need not sum to constant.
Idea: new set of parameters
✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not
decrease objective function. Iteratively apply!
Problem: slow, and nobody uses this any more.
1Della Pietra et al. (1997)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Improved iterative scaling
Want to maximize log-likelihodod with respect to parameters
✓ = (�1,�2, · · · ;µ1, µ2, · · · )
Method: Improved Iterative Scaling1: Extension of
Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).
Improved because features need not sum to constant.
Idea: new set of parameters
✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not
decrease objective function. Iteratively apply!
Problem: slow, and nobody uses this any more.
1Della Pietra et al. (1997)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Improved iterative scaling
Want to maximize log-likelihodod with respect to parameters
✓ = (�1,�2, · · · ;µ1, µ2, · · · )
Method: Improved Iterative Scaling1: Extension of
Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).
Improved because features need not sum to constant.
Idea: new set of parameters
✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not
decrease objective function. Iteratively apply!
Problem: slow, and nobody uses this any more.
1Della Pietra et al. (1997)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Improved iterative scaling
Want to maximize log-likelihodod with respect to parameters
✓ = (�1,�2, · · · ;µ1, µ2, · · · )
Method: Improved Iterative Scaling1: Extension of
Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).
Improved because features need not sum to constant.
Idea: new set of parameters
✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not
decrease objective function. Iteratively apply!
Problem: slow, and nobody uses this any more.1Della Pietra et al. (1997)
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Modern CRF training - L-BFGS
Generally use L-BFGS2 algorithm.
Approximates Newton’s method. Optimise multivariate
function f (✓) through updates
✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)
Quasi-Newtonian: approximates Hessian Hf (✓).
Limited-memory: doesn’t store full (approximate) Hessian.
2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Modern CRF training - L-BFGS
Generally use L-BFGS2 algorithm.
Approximates Newton’s method. Optimise multivariate
function f (✓) through updates
✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)
Quasi-Newtonian: approximates Hessian Hf (✓).
Limited-memory: doesn’t store full (approximate) Hessian.
2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Modern CRF training - L-BFGS
Generally use L-BFGS2 algorithm.
Approximates Newton’s method. Optimise multivariate
function f (✓) through updates
✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)
Quasi-Newtonian: approximates Hessian Hf (✓).
Limited-memory: doesn’t store full (approximate) Hessian.
2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Training algorithms
Modern CRF training - L-BFGS
Generally use L-BFGS2 algorithm.
Approximates Newton’s method. Optimise multivariate
function f (✓) through updates
✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)
Quasi-Newtonian: approximates Hessian Hf (✓).
Limited-memory: doesn’t store full (approximate) Hessian.
2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Label bias
Generate data with noisy HMM.
4-state system (not counting ‘initial state’), transitions:
1 ) 2 ) 3
4 ) 5 ) 3
Emissions: highly biased!
P(X = Y ’s preferred value |Y ) = 29/32P(X = other |Y ) = 1/32
Preferred values: 1! ‘r’, 4! ‘r’, 2! ‘i’, 5! ‘o’, 3! ‘b’.
Result: CRF error 4.6%, MEMM error 42%.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Label bias
Generate data with noisy HMM.
4-state system (not counting ‘initial state’), transitions:
1 ) 2 ) 3
4 ) 5 ) 3
Emissions: highly biased!
P(X = Y ’s preferred value |Y ) = 29/32P(X = other |Y ) = 1/32
Preferred values: 1! ‘r’, 4! ‘r’, 2! ‘i’, 5! ‘o’, 3! ‘b’.
Result: CRF error 4.6%, MEMM error 42%.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Mixed-order sources
Generate data with mixed-order HMM:
Transitions: (1� ↵)p1(yi |yi�1) + ↵p2(yi |yi�1, yi�2)Emissions: (1� ↵)p1(xi |yi ) + ↵p2(xi |yi , xi�1)
Five labels, 26 observation values.
Training/testing: 1000 sequences of length 25.
CRF trained with Algorithm S (modified IIS). MEMM trained
with iterative scaling.
Viterbi to label test set.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Mixed-order sources: results
Squares : ↵ < 0.5.
CRF sort of wins?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Mixed-order sources: results
Squares : ↵ < 0.5.
CRF sort of wins?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Part of Speech Tagging
Penn Treebank: 45 syntactic tags, label each word in
sentence.
Train first-order HMM, MEMM, CRF.
Spelling features exploit
conditional framework.
Examples: starts with
number/upper case?,
contains hyphen, has su�x?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Part of Speech Tagging
Penn Treebank: 45 syntactic tags, label each word in
sentence.
Train first-order HMM, MEMM, CRF.
Spelling features exploit
conditional framework.
Examples: starts with
number/upper case?,
contains hyphen, has su�x?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Performance
Part of Speech Tagging
Penn Treebank: 45 syntactic tags, label each word in
sentence.
Train first-order HMM, MEMM, CRF.
Spelling features exploit
conditional framework.
Examples: starts with
number/upper case?,
contains hyphen, has su�x?
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain CRF
Example: skip-chain CRF3.
Has long-range features!
Basic idea: extend linear-chain CRF by joining some distant
observations with ‘skip edges’.
Connect multiple mentions of entity across whole document.
3From An Introduction to Conditional Random Fields for Relational
Learning, Charles Sutton and Andrew McCallum, 2006.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain CRF
Example: skip-chain CRF3.
Has long-range features!
Basic idea: extend linear-chain CRF by joining some distant
observations with ‘skip edges’.
Connect multiple mentions of entity across whole document.
3From An Introduction to Conditional Random Fields for Relational
Learning, Charles Sutton and Andrew McCallum, 2006.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain CRF
Example: skip-chain CRF3.
Has long-range features!
Basic idea: extend linear-chain CRF by joining some distant
observations with ‘skip edges’.
Connect multiple mentions of entity across whole document.
3From An Introduction to Conditional Random Fields for Relational
Learning, Charles Sutton and Andrew McCallum, 2006.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain CRF
Example: skip-chain CRF3.
Has long-range features!
Basic idea: extend linear-chain CRF by joining some distant
observations with ‘skip edges’.
Connect multiple mentions of entity across whole document.
3From An Introduction to Conditional Random Fields for Relational
Learning, Charles Sutton and Andrew McCallum, 2006.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Example: Note: Squares denote factors (e.g. potential functions).
Question: Ignoring the skip edges (in blue), what potentials does
Yi
appear in? Answer: (Yi
,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Example: Note: Squares denote factors (e.g. potential functions).
Question: Ignoring the skip edges (in blue), what potentials does
Yi
appear in?
Answer: (Yi
,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Example: Note: Squares denote factors (e.g. potential functions).
Question: Ignoring the skip edges (in blue), what potentials does
Yi
appear in? Answer: (Yi
,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain task
Data: 485 email announcements for seminars at CMU.
Task: identify start time, end time, location, speaker.
Linear chain CRF with skip edges between identical capitalised
words.
Other word-specific features e.g. ‘appears in list of first
names’, ‘upper case’, ‘appears to be part of time/date’ (by
regex), etc.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain task
Data: 485 email announcements for seminars at CMU.
Task: identify start time, end time, location, speaker.
Linear chain CRF with skip edges between identical capitalised
words.
Other word-specific features e.g. ‘appears in list of first
names’, ‘upper case’, ‘appears to be part of time/date’ (by
regex), etc.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain task
Data: 485 email announcements for seminars at CMU.
Task: identify start time, end time, location, speaker.
Linear chain CRF with skip edges between identical capitalised
words.
Other word-specific features e.g. ‘appears in list of first
names’, ‘upper case’, ‘appears to be part of time/date’ (by
regex), etc.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain task
Data: 485 email announcements for seminars at CMU.
Task: identify start time, end time, location, speaker.
Linear chain CRF with skip edges between identical capitalised
words.
Other word-specific features e.g. ‘appears in list of first
names’, ‘upper case’, ‘appears to be part of time/date’ (by
regex), etc.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain results
Values are F1 scores.
Repeated occurrences of speaker improve skip-chain
performance.
Tokens are consistently classified by skip-chain. Linear-chain is
inconsistent on 30.2 speakers, skip-chain: 4.8.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain results
Values are F1 scores.
Repeated occurrences of speaker improve skip-chain
performance.
Tokens are consistently classified by skip-chain. Linear-chain is
inconsistent on 30.2 speakers, skip-chain: 4.8.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Skip-chain CRF
Skip-chain results
Values are F1 scores.
Repeated occurrences of speaker improve skip-chain
performance.
Tokens are consistently classified by skip-chain. Linear-chain is
inconsistent on 30.2 speakers, skip-chain: 4.8.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Summary
CRFs combine discriminative (e.g. MEMM) and undirected
(e.g. MRF) properties to solve problems:
Global normalisation avoids label bias.
Conditioning on observations avoids modelling complex
dependencies.
Enables use of features using global structure.
Examples in paper strangely insubstantial, but CRFs are
widely and successfully used.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Summary
CRFs combine discriminative (e.g. MEMM) and undirected
(e.g. MRF) properties to solve problems:
Global normalisation avoids label bias.
Conditioning on observations avoids modelling complex
dependencies.
Enables use of features using global structure.
Examples in paper strangely insubstantial, but CRFs are
widely and successfully used.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Summary
CRFs combine discriminative (e.g. MEMM) and undirected
(e.g. MRF) properties to solve problems:
Global normalisation avoids label bias.
Conditioning on observations avoids modelling complex
dependencies.
Enables use of features using global structure.
Examples in paper strangely insubstantial, but CRFs are
widely and successfully used.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Summary
CRFs combine discriminative (e.g. MEMM) and undirected
(e.g. MRF) properties to solve problems:
Global normalisation avoids label bias.
Conditioning on observations avoids modelling complex
dependencies.
Enables use of features using global structure.
Examples in paper strangely insubstantial, but CRFs are
widely and successfully used.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion
Summary
CRFs combine discriminative (e.g. MEMM) and undirected
(e.g. MRF) properties to solve problems:
Global normalisation avoids label bias.
Conditioning on observations avoids modelling complex
dependencies.
Enables use of features using global structure.
Examples in paper strangely insubstantial, but CRFs are
widely and successfully used.
Presenters: Brad Gulko and Stephanie Hyland
CS 6784 Paper Presentation