Upload
joshua
View
68
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Unsupervised Models for Coreference Resolution. Vincent Ng Human Language Technology Research Institute University of Texas at Dallas. Plan for the Talk. Supervised learning for coreference resolution how and when supervised coreference research started - PowerPoint PPT Presentation
Citation preview
Unsupervised Models for Coreference Resolution
Vincent NgHuman Language Technology Research
InstituteUniversity of Texas at Dallas
2
Plan for the TalkSupervised learning for coreference resolution
how and when supervised coreference research startedstandard machine learning approach
3
Plan for the TalkSupervised learning for coreference resolution
how and when supervised coreference research startedstandard machine learning approach
Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications
4
Machine Learning for Coreference Resolutionstarted in mid-1990s
Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)
propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)
English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)
English, Chinese, Arabic
5
Machine Learning for Coreference Resolutionstarted in mid-1990s
Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)
propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)
English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)
English, Chinese, Arabic
identified as an important task for information extractionidentity coreference only
6
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
7
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
8
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
9
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
10
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
11
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
12
Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Lots of prior work on supervised coreference resolution
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
13
Standard Supervised Learning ApproachClassification
a classifier is trained to determine whether two mentions are coreferent or not coreferent
14
Standard Supervised Learning ApproachClassification
a classifier is trained to determine whether two mentions are coreferent or not coreferent
[Queen Elizabeth] set about transforming [her] [husband], ...
coref ?
not coref ?
coref ?
15
Standard Supervised Learning ApproachClustering
coordinates possibly contradictory pairwise coreference decisions
husband
King George VI
the King
his
Clustering Algorithm
Queen Elizabeth
her
Logue
a renowned speech therapist
Queen Elizabeth
Logue
[Queen Elizabeth],
set about transforming
[her]
[husband]
...
coref
not coref
not
coref
King George VI
16
Standard Supervised Learning ApproachClustering
coordinates possibly contradictory pairwise classification decisions
husband
King George VI
the King
his
Clustering Algorithm
Queen Elizabeth
her
Logue
a renowned speech therapist
Queen Elizabeth
Logue
[Queen Elizabeth],
set about transforming
[her]
[husband]
...
coref
not coref
not
coref
King George VI
17
Standard Supervised Learning ApproachClustering
coordinates possibly contradictory pairwise classification decisions
husband
King George VI
the King
his
Clustering Algorithm
Queen Elizabeth
her
Logue
a renowned speech therapist
Queen Elizabeth
Logue
[Queen Elizabeth],
set about transforming
[her]
[husband]
...
coref
not coref
not
coref
King George VI
18
Standard Supervised Learning ApproachTypically relies on a large amount of labeled data
What if we only have a small amount of annotated data?
19
First Attempt: Supervised Learningtrain on whatever annotated data we have
need to specify learning algorithm feature setclustering algorithm
20
First Attempt: Supervised Learningtrain on whatever annotated data we have
need to specify learning algorithm (Bayes)feature setclustering algorithm (Bell-tree)
21
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
22
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
Coref, Not Coref
23
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
Coref, Not Coref
)...,,,|(maxarg 21*
nYy xxxyPy
24
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
Coref, Not Coref
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy
25
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
What features to use in the feature representation?
Coref, Not Coref
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy
26
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }
27
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }
28
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }
29
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }
30
Use 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }
Linguistic Features
E.g., for the mention pair (Barack Obama, president-elect), the feature value is (Name, Nominal)
31
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy
COREF or NOT COREF
32
finds the class value y that is the most probable given the feature vector x1,..,xn
finds y* such that
But we may have a data sparseness problem
The Bayes Classifier
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy
COREF or NOT COREF
33
finds the class value y that is the most probable given the feature vector x1,..,xn
finds y* such that
But we may have a data sparseness problem
Let’s simplify this term!
The Bayes Classifier
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy
COREF or NOT COREF
34
finds the class value y that is the most probable given the feature vector x1,..,xn
finds y* such that
But we may have a data sparseness problem
Let’s simplify this term!assume that feature values from different groups are
independent of each other given the class
The Bayes Classifier
)...,,,|(maxarg 21*
nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy
COREF or NOT COREF
35
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
COREF or NOT COREF
36
)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy
These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)
COREF or NOT COREF
37
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
Generative model: specifies how an instance is generated
COREF or NOT COREF
38
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
Generative model: specifies how an instance is generated
Generate the class y with P(y)
COREF or NOT COREF
39
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
Generative model: specifies how an instance is generated
Generate the class y with P(y) Given y, generate
x1, x2, and x3 with P(x1, x2, x3 | y)
COREF or NOT COREF
40
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
Generative model: specifies how an instance is generated
Generate the class y with P(y) Given y, generate
x1, x2, and x3 with P(x1, x2, x3 | y)
Given y, generate x4, x5, and x6 with
P(x4, x5, x6 | y)
COREF or NOT COREF
41
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
Generative model: specifies how an instance is generated
Generate the class y with P(y) Given y, generate
x1, x2, and x3 with P(x1, x2, x3 | y)
Given y, generate x4, x5, and x6 with
P(x4, x5, x6 | y)
Given y, generate x7 with P(x7 | y)
COREF or NOT COREF
42
train on whatever annotated data we have
need to specify learning algorithm feature set clustering algorithm
First Attempt: Supervised Learning
43
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
44
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
45
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
46
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
47
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
48
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
49
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
Leaves contain all the possible partitions of all of the mentions
50
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
Leaves contain all the possible partitions of all of the mentions
Computationally infeasible to expand all nodes in the Bell tree
51
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
expands only the most promising nodes
52
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
expands only the most promising nodes
How to determine which nodes are promising?
53
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
54
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
1
55
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
56
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
1 * Pc(1,2) = 1 * 0.6 = 0.6
57
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4
58
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
59
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
60
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42
61
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.42
0.58
0.08
0.28
0.12
62
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.42
0.58
0.08
0.28
0.12
63
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.42
0.58
0.08
0.28
0.12
expands only the N most probable nodes at each level
64
Where are we?We have described
a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities
65
Where are we?We have described
a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities
Goal: evaluate this coreference system in the presence of a small amount of labeled data
66
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)
each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set
Mentions extracted automatically using an NP chunker
Experimental Setup
67
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)
each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set
Mentions extracted automatically using an NP chunker
Scoring programCEAF scoring program (Luo, 2005)
recall, precision, F-measure
Experimental Setup
68
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Evaluation Results
69
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Evaluation Results
70
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Evaluation Results
71
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Evaluation Results
72
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Evaluation Results
Can we improve performance by combining a small amount of labeled data and
a potentially large amount of unlabeled data?
73
Supervised learning for coreference resolutionbrief historystandard machine learning approach
Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications
Plan for the Talk
74
Self-Training
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
75
Self-Training
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Classifier h
76
Self-Training
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Classifier h
77
Self-Training
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Classifier h
N most confidently labeled instances
78
Results (F-measure for Self-Training)
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
w/o bagging
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
w/o bagging
Broadcast News Newswire
79
Why doesn’t Self-Training improve?only the most confidently labeled instances are added in
each iterationthe classifier already knows how to label these newly added
instancesnot much new knowledge is gained by re-training a classifier
from such newly added instances
80
Why does Self-Training hurt?also due to the bias towards confidently-labeled instances
many confidently labeled instances are pairs of identical proper names
(India, India) (IBM, IBM)
(prince, prince) (Clinton, Clinton)
…
Coref Coref Coref Coref
…
81
Why does Self-Training hurt?also due to the bias towards confidently-labeled instances
many confidently labeled instances are pairs of identical proper names
(India, India) (IBM, IBM)
(prince, prince) (Clinton, Clinton)
…
Coref Coref Coref Coref
…
(name, name) (name, name) (name, name) (name, name)
…
Mention Pair Type feature value
82
Why does Self-Training hurt?also due to the bias towards confidently-labeled instances
many confidently labeled instances are pairs of identical proper names
the classifier gradually learns that two proper names are likely to be coreferent, regardless of whether the names are identical
(India, India) (IBM, IBM)
(prince, prince) (Clinton, Clinton)
…
(name, name) (name, name) (name, name) (name, name)
…
Mention Pair Type feature value
Coref Coref Coref Coref
…
83
Why does Self-Training hurt?Since we hypothesize that the Mention Pair Type feature is
causing the problem …repeat the experiments without using this feature
84
Results (F-measure for Self-Training)
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
no MP Type feature with MP Type feature
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
no MP Type feature with MP Type feature
Broadcast News Newswire
85
Some Lessons Learnedwhen labeled data is scarce, feature design becomes an
important issue
when exploiting unlabeled data, it is crucial to learn from both confidently labeled and not-so-confidently labeled data
86
Supervised learning for coreference resolutionbrief historystandard machine learning approach
Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications
Plan for the Talk
87
Unsupervised Coreference as EM Clustering
Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs
88
Unsupervised Coreference as EM Clustering
Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs
the EM-based model is forced to learn from all of the mention pairs when the model is retrained
89
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
90
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5
1
2
3
4
5
91
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5
1
2
3
4
5
Coreferent
92
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Not Coreferent
1 2 3 4 5
1
2
3
4
5
93
A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent
Representing a Clustering
Don’t care about diagonal entries
1 2 3 4 5
1
2
3
4
5
94
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Don’t care about entries below the diagonal
1 2 3 4 5
1
2
3
4
5
95
A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent
1 2 3 4 5
1
2
3
4
5
Representing a Clustering
Transitive
96
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Valid
1 2 3 4 5
1
2
3
4
5
97
1 2 3 4 5
1
2
3
4
5
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Valid Invalid
1 2 3 4 5
1
2
3
4
5
98
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP
99
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
How to generate D given C?
)|()(),( CDPCPCDP
100
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
How to generate D given C? Assume that D is represented by its mention pairs
)|()(),( CDPCPCDP
101
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
How to generate D given C? Assume that D is represented by its mention pairs To generate D, generate all pairs of mentions in D
(Queen Elizabeth, her), (Queen Elizabeth, husband), (Queen Elizabeth, King George VI), …
)|()(),( CDPCPCDP
102
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
103
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
mpij is the pair formed from mention i and mention j
104
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
Let’s simplify this term
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
105
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
Let’s simplify this term assume that each mention pair mpij is generated
conditionally independently given C ij
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
106
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP
)|()()(
DPairs ijij CmpPCP
)|()( ...,14,13,12 CmpmpmpPCP
107
)|()()(
DPairs ijij CmpPCP
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
How to represent a mention pair mij?
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
108
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
109
Given a document D,generate a clustering C according to P(C)generate D given C
)|()()(
DPairs ijij CmpPCP
The Generative Model
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
110
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
7 feature values
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
111
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
Let’s simplify this term
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
112
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
Let’s simplify this term assume that feature values from different groups are
conditionally independent of each other
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
113
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
)|()|()( 6,
5,
43,
2,
1ijijijijijijijij CmpmpmpPCmpmpmpPCP
)|( 7ijij CmpP
114
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
)|()|()( 6,
5,
43,
2,
1ijijijijijijijij CmpmpmpPCmpmpmpPCP
)|( 7ijij CmpP
115
)|( 7 cmpP
Model Parameters
)|( 3,
2,
1 cmpmpmpP
)|( 6,
5,
4 cmpmpmpPimp are the feature values
{ Coref, Not Coref }c
116
)|( 7 cmpP
Model Parameters
)|( 3,
2,
1 cmpmpmpP
)|( 6,
5,
4 cmpmpmpPimp are the feature values
{ Coref, Not Coref }c
117
)|( 7 cmpP
Model Parameters
)|( 3,
2,
1 cmpmpmpP
)|( 6,
5,
4 cmpmpmpPimp are the feature values
{ Coref, Not Coref }c
118
Model Parameters
)|( 3,
2,
1 cmpmpmpP
)|( 6,
5,
4 cmpmpmpP
)|( 7 cmpP
imp are the feature values
{ Coref, Not Coref }c
If we had labeled data, we could estimate the parametersBut we don’t have labeled data. So …
119
Model ParametersUse EM to iteratively
estimate the model parametersprobabilistically induce a clustering for a document
120
The Induction Algorithm
Given a set of unlabeled documents
121
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
122
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
Initial labelings are presumably noisy
123
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
124
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
125
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
126
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
[123] [12][3][13][2] [1][23][1][2][3] + invalid clusterings
127
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
+ invalid clusterings
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.21 0.11 0.29 0.05 …
128
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
+ invalid clusterings
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.21 0.11 0.29 0.05 …
Iterate till convergence
129
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
+ invalid clusterings
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.21 0.11 0.29 0.05 …
Iterate till convergence
How to cope with the computational complexity
of the E-step?
130
Approximating the E-step
Search for the N most probable clusterings onlyusing the Bell Tree algorithm
131
Approximating the E-step
Search for the N most probable clusterings onlyusing the Bell Tree algorithm
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
132
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions of each document (E-step) use the normalized scores of the 50-best clusterings
The Induction Algorithm
Iterate till convergence
133
Supervised learning for coreference resolutionbrief historystandard machine learning approach
Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications
Plan for the Talk
134
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
135
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
136
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
1Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
137
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
1 1Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
138
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
1 1 2
139
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mention
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
1 1 2
2 3
4
2 2 5
4
140
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mentionensures transitivity automatically
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
1 1 2
2 3
4
2 2 5
4
141
Haghighi and Klein’s Generative Story
142
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
143
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
Inference: Gibbs sampling
144
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
Inference: Gibbs sampling
Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster
id
145
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
Inference: Gibbs sampling
Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster
id two occurrences of “she” will likely be posited as coreferent particularly inappropriate for generating pronouns
146
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
Inference: Gibbs sampling
Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster id
Extensions:use a separate “pronoun head model” to generate pronounsincorporate salience
147
Supervised learning for coreference resolutionbrief historystandard machine learning approach
Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications relaxed head generation agreement constraints pronoun-only salience
Plan for the Talk
148
Modification 1: Relaxed Head GenerationMotivation
H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …
149
Modification 1: Relaxed Head GenerationMotivation
H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …
Goalsimple method for incorporating such knowledge sources
150
Modification 1: Relaxed Head Generation
pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation
151
Modification 1: Relaxed Head Generation
pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation
International Business
Corporation
IBM
Barcelona
…
1
1
2
…
152
Modification 1: Relaxed Head Generation
pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation
instead of generating the head noun, generate the head id
International Business
Corporation
IBM
Barcelona
…
1
1
2
…
153
Modification 1: Relaxed Head Generation
pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation
instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”
as two mentions having the same head
International Business
Corporation
IBM
Barcelona
…
1
1
2
…
154
Modification 1: Relaxed Head Generation
pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation
instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”
as two mentions having the same headencourages the model to put the two into the same cluster
International Business
Corporation
IBM
Barcelona
…
1
1
2
…
155
Modification 2: Agreement ConstraintsMotivation
gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model
156
Modification 2: Agreement ConstraintsMotivation
gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model
while the model favours the assignment of a pronoun to a gender- and number-compatible cluster
it also favours the assignment of a pronoun to a large cluster
157
Modification 2: Agreement ConstraintsMotivation
gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model
while the model favours the assignment of a pronoun to a gender- and number-compatible cluster
it also favours the assignment of a pronoun to a large cluster
if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible
158
Modification 2: Agreement ConstraintsMotivation
gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model
while the model favours the assignment of a pronoun to a gender- and number-compatible cluster
it also favours the assignment of a pronoun to a large cluster
if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible
Goalimplement gender and number agreement as a constraint
159
disallow the generation of a mention by any cluster where the two are incompatible in number or gender
Modification 2: Agreement Constraints
160
Modification 3: Pronoun-Only Salience
In H&K’s model, salience is applied to all types of mentions (pronouns, names and nominals) during cluster assignment
Our hypothesissince names and nominals are less sensitive to salience, the
net benefit of applying salience to names and nominals could be negative as a result of inaccurate modeling of salience
We restrict the application of salience to pronouns only
161
Improving Haghighi and Klein’s Model3 modifications
relaxed head generationagreement constraintspronoun-only salience
162
EvaluationEM-based model
Haghighi and Klein’s modelwith and without the 3 modifications
163
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)
For each data set use one training text for initializing model parameters evaluate on the entire test set
Mentions extracted automatically using an NP chunker
Scoring programCEAF scoring program (Luo, 2005)
Experimental Setup
164
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Weakly Supervised Baseline)
Train the Bayes classifier on one (labeled) document
Use the Bell Tree clustering algorithm to impose a partition for each test document using the pairwise probabilities
165
Heuristic BaselineSimple rule-based system
Posits two mentions as coreferent if and only if they arethe same stringaliasesin an appositive relation
166
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Heuristic Baseline)
167
EM-Based ModelInitialize the parameters using one (labeled) document
rather than using randomly guessed clusterings
168
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (EM-Based Model)
169
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (EM-Based Model)
gains in both recall and precisionF-measure increases by 5-7%
170
Duplicated Haghighi and Klein’s Model
Use the same labeled document as in the EM-based model to learn the value of in the Dirichlet Process
171
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Duplicated H&K’s Model)
172
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Duplicated H&K’s Model)
In comparison to EM-based modelprecision drops substantiallyF-measure decreases by 10-11%
173
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Adding 3 Modifications)
174
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Adding 3 Modifications)
In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modification
175
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Adding 3 Modifications)
In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modificationmodest gain in recall and substantial gain in precision when
all modifications are applied (9-10% gain in F-measure)
176
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5
Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2
Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6
Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8
+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6
+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5
+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (Fully-Supervised Resolver)
Trained using C4.5, entire ACE training set, 34 featuresOutperforms the unsupervised models by 7%
177
Using a Knowledge-Based FeatureAdd a feature to the EM-based model that encodes the
output of a knowledge-based coreference systemimplements heuristics used by different MUC-7 resolvers
Resulting model not so “unsupervised”
178
Broadcast News Newswire Experiments on System Mentions
R P F R P F
EM-based Model (w/ KB feature) 65.4 53.3 58.8 68.1 58.2 62.8
EM-based Model (w/o KB feature) 57.0 54.6 55.7 62.9 56.5 59.6
Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5
Results (EM-Based Model w/ KB Feature)
179
SummaryExamined unsupervised models for coreference resolution
self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages
EM-based model and modified H&K’s model outperform self-training and H&K’s original model
Not as competitive as fully-supervised model, but …
180
Summary (Cont’)… they can potentially be improved by
incorporating additional linguistic features in
feature engineering remains a challenging issuecombining a large amount of labeled data with a large amount
of unlabeled data
generative modeling is interesting in itself
181
SummaryExamined unsupervised models for coreference resolution
self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages
Self-training with and without baggingDoesn’t improve (and sometimes even hurts) performanceAugment labeled data with only confidently-labeled instancesLittle knowledge is gained by the classifierCareful feature design is an especially important issueNeed to label both confident and not-so-confident instances
182
Summary (Cont’)EM-based generative model
induces a clustering on an unlabeled documentoutperforms Haghighi and Klein’s coreference model
Three extensions to Haghighi and Klein’s generative model each modification improves F-measure
Not as competitive as fully-supervised modelbut … generative modeling is interesting in itselffeature engineering remains a crucial yet challenging issue
183
Weakly Supervised BaselineTrain the Naïve Bayes classifier on one (labeled) document
Use the Bell Tree clustering algorithm to impose a partition on each test document using the pairwise probabilities
184
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)
each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set
Mentions extracted automatically using an NP chunker
Scoring program MUC scoring program (Vilain et al., 1995) ????
Experimental Setup
185
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)
each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set
Mentions extracted automatically using an NP chunker
Scoring program MUC scoring program (Vilain et al., 1995) ????
2 problems under-penalizes partitions where mentions are over-clustered does not reward successful identification of singleton clusters
Experimental Setup
186
)|...,,,()(maxarg 721 yxxxPyPYy
The Bayes Classifierfinds the class value y that is the most probable given the
feature vector x1,..,xn
finds y* such that
)...,,,|(maxarg 21*
nYy xxxyPy
)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy
These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)
Not as naïve as Naïve Bayes …
COREF or NOT COREF
187
Results (Self-Training w/ and w/o Bagging)
37
39
41
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
w/ bagging (5 bags) w/o bagging
37
39
41
43
45
47
49
51
53
55
0 1 2 3 4 5 6 7 8 9
Number of Iterations
w/ bagging (5 bags) w/o bagging
Broadcast News Newswire
188
Self-Training with Bagging
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
189
Self-Training with Bagging
Create k training sets, each of size |L|, by sampling from L with replacement
Train k classifiers
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
190
Self-Training with Bagging
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Bagged Classifier h1
Bagged Classifier h2
Bagged Classifier hk
191
Self-Training with Bagging
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Bagged Classifier h1
Bagged Classifier h2
Bagged Classifier hk
192
Self-Training with Bagging
Labeled data (L)
x x x x
Unlabeled data (U)
xxx x x
x x x xxx
x
xx
x
xxx
xx xx
xxx
xx
xx
x
x x
x xx
xx
xx
xx
xx
Bagged Classifier h1
Bagged Classifier h2
Bagged Classifier hk
N labeled instances with the highest average confidence
193
Why doesn’t Self-Training improve?only the most confidently labeled instances are added in
each iterationthe classifier already knows how to label these newly added
instancesnot much new knowledge is gained by re-training a classifier
from such newly added instances
Need to learn from both the confidently and no-so-confidently labeled instances
194
Haghighi and Klein’s ModelNonparametric Bayesian model
195
Haghighi and Klein’s ModelNonparametric Bayesian model
Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely
Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)
196
Haghighi and Klein’s ModelNonparametric Bayesian model
Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely
Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)
197
Haghighi and Klein’s ModelNonparametric Bayesian model
Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes
Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely
Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)
dXPXZPXZP )|(),|()|(
198
Haghighi and Klein’s ModelNonparametric Bayesian model
Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes
Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely
Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)
dXPXZPXZP )|(),|()|(
Integrate out the parameters
Encode prior knowledge on hypotheses
199
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
200
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
expands only the most promising paths
201
Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions
structures the search space as a Bell tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
expands only the most promising paths
How to determine which paths are promising?
202
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.6*(1- max (Pc(1,3), Pc(2,3))) = 0.6 * (1- max(0.2, 0.7)) = 0.58
0.42
203
Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise
probabilities returned by the coreference classifier
Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7
[1]
[12]
[1][2]
1
0.6
0.4
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
0.42
0.58
204
Plan for the TalkSupervised learning for coreference resolution
brief historystandard machine learning approach
Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)
three modifications
205
Standard Supervised Learning ApproachClassification
given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent
create one training instance for each pair of mentions from texts annotated with coreference information feature vector: describes the two mentions
train a classifier using a machine learning algorithm decision tree learner (C5), maximum entropy, SVMs
[Queen Elizabeth] set about transforming [her] [husband], ...
coref ?
not coref ?
coref ?
206
Related WorkApply a weakly supervised or unsupervised learning
algorithm to pronoun resolution
co-training (Müller et al., 2002)
self-training (Kehler et al., 2004)
207
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
Heuristics
208
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
How to compute the semantic class of a mention?
209
Linguistic FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
How to compute the semantic class of a mention? Proper names: use a named entity recognizer Nominals: induced from an unannotated corpus
210
Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on
PERSON, ORGANIZATION, LOCATION, and OTHERS
211
Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on
PERSON, ORGANIZATION, LOCATION, and OTHERS
Given a large, unannotated corpus
Use a parser to extract appositive relations <Eastern Airlines, carrier>, <George Bush, president>, …
Use a named entity recognizer to find the semantic classes of the proper names
Infer the semantic class of a nominal from the associated proper name
212
Potential Problems Named entity recognizer is not perfect
Mislabels proper names
Parser is not perfect Extracts mention pairs that are not in apposition
213
Potential Problems Named entity recognizer is not perfect
Mislabels proper names
Parser is not perfect Extracts mention pairs that are not in apposition
To improve robustness:1. Compute the probability that the nominal co-occurs with each
of the named entity types
2. If the most likely NE type has a probability above 0.7, label the nominal with the most likely NE type
214
Broadcast News Newswire Experiments on System Mentions
MUC CEAF MUC CEAF
Weakly Supervised Baseline 38.0 49.0 42.8 53.5
Heuristic Baseline 36.4 48.4 43.2 54.2
Our EM-based Model 51.6 55.7 57.8 59.6
Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8
+ Relaxed Head Generation 47.0 47.5 45.0 52.6
+ Agreement Constraints 48.9 51.4 46.0 54.5
+ Pronoun-only Salience 52.6 54.7 50.0 57.4
Fully Supervised Model 60.4 61.8 60.6 64.5
MUC and CEAF F-Scores
215
Broadcast News Newswire Experiments on System Mentions
MUC CEAF MUC CEAF
Weakly Supervised Baseline 38.0 49.0 42.8 53.5
Heuristic Baseline 36.4 48.4 43.2 54.2
Our EM-based Model 51.6 55.7 57.8 59.6
Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8
+ Relaxed Head Generation 47.0 47.5 45.0 52.6
+ Agreement Constraints 48.9 51.4 46.0 54.5
+ Pronoun-only Salience 52.6 54.7 50.0 57.4
Fully Supervised Model 60.4 61.8 60.6 64.5
MUC and CEAF F-Scores
216
Broadcast News Newswire Experiments on System Mentions
MUC CEAF MUC CEAF
Weakly Supervised Baseline 38.0 49.0 42.8 53.5
Heuristic Baseline 36.4 48.4 43.2 54.2
Our EM-based Model 51.6 55.7 57.8 59.6
Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8
+ Relaxed Head Generation 47.0 47.5 45.0 52.6
+ Agreement Constraints 48.9 51.4 46.0 54.5
+ Pronoun-only Salience 52.6 54.7 50.0 57.4
Fully Supervised Model 60.4 61.8 60.6 64.5
MUC and CEAF F-Scores
217
Broadcast News Newswire Experiments on System Mentions
MUC CEAF MUC CEAF
Weakly Supervised Baseline 38.0 49.0 42.8 53.5
Heuristic Baseline 36.4 48.4 43.2 54.2
Our EM-based Model 51.6 55.7 57.8 59.6
Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8
+ Relaxed Head Generation 47.0 47.5 45.0 52.6
+ Agreement Constraints 48.9 51.4 46.0 54.5
+ Pronoun-only Salience 52.6 54.7 50.0 57.4
Fully Supervised Model 60.4 61.8 60.6 64.5
MUC and CEAF F-Scores
Similar performance trends across the 2 scoring programs
218
Experiments using Perfect MentionsPerfect mentions are NPs marked up in the answer key
using them makes the coreference task somewhat easier
Similar performance trends observedexcept that the unsupervised models perform comparably to
the fully-supervised resolver
Conclusions drawn from system mentions are not always generalizable to perfect mentions and vice versa
219
SummaryPresented an EM-based model for unsupervised
coreference resolution that outperforms Haghighi and Klein’s coreference model
compares favourably to a modified version of their model
220
H&K’s Model: Salience ModelingEach entity/cluster is initially assigned a salience value of 0As we process the discourse, the salience value of each
entity will changeWhen we encounter a mention, we update the salience scores
(* 0.5 for each entity and add 1 to current entity)Then discretize the salience values
5 buckets: TOP, HIGH, MID, LOW, NONEUsing a separate corpus, estimate the probability of
P(mention type | Salience)where mention type can be pronoun, name, or nominal. E.g.,
P(pronoun | TOP) is a large value P(nominal | TOP) is a small value
model is sensitive to these estimated values
221
Why Salience Modeling?Important for pronouns
For H&K, since they don’t use features like apposition, modeling salience may allow mentions in an appositive to be assigned the same cluster id.
222
Parameter Initialization = 0.4 (true mention) and 0.7 (system mentions) concentration parameter: e-4
223
Parameter Initialization
Uses one (labeled) document taken from the training set toinitialize the parameters of our EM-based modeldetermine the concentration parameter, , in H&K’s model
224
Experiments with Perfect MentionsSimilar performance trends observed
except that the unsupervised models perform comparably to the fully-supervised resolver
Conclusions drawn from perfect mentions are not always generalizable to system mentions and vice versa
Results obtained using perfect mentions should not be compared against those obtained using system mentions
225
Degenerate EM BaselineModel obtained after one iteration of EM
No parameter re-estimation on the unlabeled data
226
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2
Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8
Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8
Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9
+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0
+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0
+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0
Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6
Degenerate EM Baseline: MUC Results
227
Degenerate EM Baseline: MUC ResultsBroadcast News Newswire
Experiments on System Mentions R P F R P F
Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2
Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8
Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8
Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9
+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0
+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0
+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0
Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6
large gain in recall and large drop in precision (over-clustering)
F-score increases for one data set and drops for the other
228
EM-Based Model: MUC Results
In comparison to Degenerate EMlarge drop in recall, but larger gain in precisionF-score increases by 4-21%gains attributed to exploitation of unlabeled data
Broadcast News Newswire Experiments on System Mentions
R P F R P F
Heuristic Baseline 30.9 44.3 36.4 36.3 53.4 43.2
Our EM-based Model 42.4 66.0 51.6 55.2 60.6 57.8
Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9
+ Relaxed Head Generation 48.3 45.7 47.0 40.9 50.0 45.0
+ Agreement Constraints 50.4 47.5 48.9 41.7 51.2 46.0
+ Pronoun-only Salience 52.2 53.0 52.6 44.3 57.3 50.0
Fully Supervised Model 53.0 70.3 60.4 53.1 70.5 60.6
229
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
230
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
Degenerate EM Baseline performs the worst
231
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
EM-based Model outperforms Heuristic Baseline
232
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
Addition of each extension yields improvements in F-score
233
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
Extended H&K system performs comparably with EM-based model
234
Broadcast News Newswire Experiments on System Mentions
MUC CEAF CEAFV MUC CEAF CEAFV
Heuristic Baseline 36.4 48.4 46.3 43.2 54.2 50.3
Degenerate EM Baseline 48.0 39.4 35.8 36.8 27.9 26.3
Our EM-based Model 51.6 55.7 52.9 57.8 59.6 52.8
Haghighi and Klein Baseline 45.2 45.2 39.0 41.9 48.8 41.7
+ Relaxed Head Generation 47.0 47.5 42.3 45.0 52.6 46.3
+ Agreement Constraints 48.9 51.4 47.0 46.0 54.5 48.4
+ Pronoun-only Salience 52.6 54.7 51.1 50.0 57.4 51.2
Fully Supervised Model 60.4 61.8 59.9 60.6 64.5 60.6
MUC, CEAF, CEAF-Variant F-Scores
Unsupervised models lag performance of the supervised model
235
Unsupervised Coreference as EM ClusteringDesign a generative model that can be used to induce a
clustering of the mentions in a given document
Exploit pairwise linguistic constraints gender and number agreement, semantic compatibility, …
236
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Facilitates the incorporate of pairwise linguistic constraints
1 2 3 4 5
1
2
3
4
5
1 2 3 4 5
1
2
3
4
5
Valid Invalid
237
Strong Coreference Indicators
String match Alias (one is an acronym or abbreviation of the other) Appositive
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Proper, Common }
Use 7 linguistic features
Features
238
Computing the E-stepGoal: assign a probability to each possible clustering of the
mentions in a document
239
Computing the E-stepGoal: assign a probability to each possible clustering of the
mentions in a document
Computationally intractable: number of clusterings is exponential in the number of mentions
240
Computing the E-stepGoal: assign a probability to each possible clustering of the
mentions in a document
Computationally intractable: number of clusterings is exponential in the number of mentions
Search for the N most probable clusterings only
241
Computing the E-stepGoal: assign a probability to each possible clustering of the
mentions in a document
Computationally intractable: number of clusterings is exponential in the number of mentions
Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm
structure the search space as a Bell tree
242
A Bell Tree
[1]
[12]
[1][2]
[123]
[12][3]
[13][2]
[1][23]
[1][2][3]
243
The Bell-Tree Search AlgorithmFinds the N most probable paths from the root to a leaf
using a beam search
The probability of a clustering (or partition) is the probability assigned to the corresponding path
244
Degenerate EM Baselinemodel that is obtained after one iteration of EM
initializes model parameters based on labeled documentapplies the model (and Bell tree search) to obtain the most
probable coreference partition
no parameter re-estimation on the unlabeled data
245
Noun Phrase CoreferenceIdentify the noun phrases (or mentions) that refer to the
same real-world entity
Partition the set of mentions into coreference equivalence classes
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. A renowned
speech therapist, was summoned to help the King
overcome his speech impediment...
246
Supervised Coreference Resolution
Lots of prior work on supervised coreference resolutionSoon et al. (2001), Strube et al. (2002), Yang et al. (2003),
Luo et al. (2004), Denis and Baldridge (2007), …
247
1 2 3 4 5
1
2
3
4
5
Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,
where Cij = 1 iff mentions i and j are coreferent
Reflexivity
248
Approximating the E-step
Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm
structures the search space as a Bell tree takes as input the pairwise coreference probabilities scores a clustering based on these probabilities
249
Haghighi and Klein’s ModelCluster-level model
assigns a cluster id to each mentionensures transitivity automatically
Nonparametric Bayesian modeldoes not commit to a particular set of parameters
250
Model Parameters
)|( 3,
2,
1 cmpmpmpP
)|( 6,
5,
4 cmpmpmpP
)|( 7 cmpP
imp are the feature values
{ Coref, Not Coref }c
251
The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only
Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)
Scoring programs: recall, precision, F-measureMUC scoring program (Vilain et al., 1995)CEAF scoring program (Luo, 2005)CEAF variant
same as CEAF, but ignores singleton clusters
Experimental Setup
252
Experimental SetupThe ACE 2003 coreference corpus
3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only
Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)
253
FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
254
FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
255
FeaturesUse 7 linguistic features divided into 3 groups
Strong Coreference Indicators
String match Appositive Alias (one is an acronym or abbreviation of the other)
Linguistic Constraints
Gender agreement Number agreement Semantic compatibility
Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }
256
The Generative ModelGiven a document D,
generate a clustering C according to P(C)generate D given C
)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP
)|()()(
DPairs ijij CmpPCP
)|()()(
7...,,
2,
1DPairs ijijijij CmpmpmpPCP
)|()|()( 6,
5,
43,
2,
1ijijijijijijijij CmpmpmpPCmpmpmpPCP
)|( 7ijij CmpP
257
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
[123]
258
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
[123] [1][2][3]
259
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
[123] [12][3][13][2] [1][23][1][2][3]
260
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.32 0.11 0.29 0.05
261
3 mentions: 1, 2, 3
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.32 0.11 0.29 0.05
Iterate till convergence
262
The Induction Algorithm
Given a set of unlabeled documentsguess a clustering for each document according to P(C)
estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation
assign a probability to each possible clustering of the mentions for each document (E-step)
3 mentions: 1, 2, 3
Iterate till convergence
How to cope with the computational complexity
of the E-step?
[123] [12][3][13][2] [1][23][1][2][3]
0.23 0.32 0.11 0.29 0.05
263
Goals
Design a new model for unsupervised coreference resolution
Improve Haghighi and Klein’s model with three modifications
264
Evaluation ResultsBroadcast News
Recall: 53.1, Precision: 45.5, F-measure: 49.0
NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5
265
Evaluation ResultsBroadcast News
Recall: 53.1, Precision: 45.5, F-measure: 49.0
NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5
Can we improve performance by combining labeled and unlabeled data?
266
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
EM-based Generative Model H&K’s Generative Model
For each mention, guess the cluster id according to P(cluster id)
Generate feature values
Create mention pairsFor each pair, guess whether it
is COREF or NOT COREF according to P(COREF)
Generate feature values
267
Dirichlet Process
Generate new cluster ids as needed
Probability of generating some existing cluster id i is:
for some constant
1 iiclusterinalreadymentionsofnumber
higher probability for larger clusters
number of mentions already in cluster i
268
Dirichlet Process
Generate new cluster ids as needed
Probability of generating some existing cluster id i is:
for some constant
Probability of generating some new cluster id is:
1 iiclusterinalreadymentionsofnumber
1 i
number of mentions already in cluster ihigher probability for larger clusters
269
Input: correct partition, system partition
3
4, 7
2, 5, 8
6
1, 9
System partitionCorrect partition
8, 11, 12
2, 6, 7
1, 4, 9
3, 5, 10
The CEAF Scoring Program
Recast the scoring problem as bipartite matching
270
Input: correct partition, system partition
3
4, 7
2, 5, 8
6
1, 9
System partitionCorrect partition
8, 11, 12
2, 6, 7
1, 4, 9
3, 5, 10
The CEAF Scoring Program
Recast the scoring problem as bipartite matching
Find the best matching using the Hungarian Algorithm
271
Input: correct partition, system partition
3
4, 7
2, 5, 8
6
1, 9
System partitionCorrect partition
6, 11, 12
2, 7, 8
1, 4, 9
3, 5, 10
The CEAF Scoring Program
2
2
1
1
Recast the scoring problem as bipartite matching
Find the best matching using the Hungarian Algorithm
272
Input: correct partition, system partition
3
4, 7
2, 5, 8
6
1, 9
System partitionCorrect partition
6, 11, 12
2, 7, 8
1, 4, 9
3, 5, 10
The CEAF Scoring Program
2
2
1
1
Recast the scoring problem as bipartite matching
Matching score = 6
Find the best matching using the Hungarian Algorithm
273
Input: correct partition, system partition
3
4, 7
2, 5, 8
6
1, 9
System partitionCorrect partition
6, 11, 12
2, 7, 8
1, 4, 9
3, 5, 10
The CEAF Scoring Program
2
2
1
1
Recast the scoring problem as bipartite matching
Matching score = 6
Recall = 6 / 9 = 0.66
Prec = 6 / 12 = 0.5
F-measure = 0.57
Find the best matching using the Hungarian Algorithm
274
Standard Supervised Learning ApproachClassification
given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent
create one training instance for each pair of mentions from a training text feature vector: describes the two mentions
275
Standard Supervised Learning ApproachClassification
given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent
create one training instance for each pair of mentions from a training text feature vector: describes the two mentions
[Queen Elizabeth] set about transforming [her] [husband], ...
coref ?
not coref ?
coref ?
276
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
277
Haghighi and Klein’s Generative StoryFor each mention encountered in a document,
generate a cluster id for the mention (according to some cluster id distribution)
generate the head noun of the mention (according to some cluster-specific head distribution)
The probability of generating a particular cluster id is based on some distribution that specifies P(id=1), P(id=2), P(id=3), … but we don’t know the number of clusters a priori don’t know how many probabilities to specify for distribution a distribution over an unknown number of clusters
278
Dirichlet Process
Generate new cluster ids as needed
279
Dirichlet Process
Generate new cluster ids as needed
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
1 1 2
2 ?
280
Dirichlet Process
Generate new cluster ids as needed
Queen Elizabeth set about transforming her husband,
King George VI, into a viable monarch. Logue, a
renowned speech therapist, was summoned to help the
King overcome his speech impediment...
1 1 2
2 ?
Should we generate id 1 or 2, or should we generate a new id 3?
281
Dirichlet Process
Generate new cluster ids as needed
Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i
282
Dirichlet Process
Generate new cluster ids as needed
Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i
higher probability for larger clusters
283
Dirichlet Process
Generate new cluster ids as needed
Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster I
Probability of generating some new cluster id is proportional to some constant α
higher probability for larger clusters