Unsupervised Models for Coreference Resolution

Unsupervised Models for Coreference Resolution

Vincent NgHuman Language Technology Research

InstituteUniversity of Texas at Dallas

2

Plan for the TalkSupervised learning for coreference resolution

how and when supervised coreference research startedstandard machine learning approach

3


how and when supervised coreference research startedstandard machine learning approach

Unsupervised learning for coreference resolutionself-training EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

4

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

5

Machine Learning for Coreference Resolutionstarted in mid-1990s

Connolly et al. (1994), Aone and Bennett (1995), McCarthy and Lehnert (1995)

propelled by availability of annotated corpora produced byMessage Understanding Conferences (MUC-6/7: 1995, 1998)

English onlyAutomatic Content Extraction (ACE 2003, 2004, 2005, 2008)

English, Chinese, Arabic

identified as an important task for information extractionidentity coreference only

6

Identity CoreferenceIdentify the noun phrases (or mentions) that refer to the

same real-world entity

Queen Elizabeth set about transforming her husband,

King George VI, into a viable monarch. Logue, a

renowned speech therapist, was summoned to help the

King overcome his speech impediment...

7







8







9







10







11







12



Lots of prior work on supervised coreference resolution





13

Standard Supervised Learning ApproachClassification

a classifier is trained to determine whether two mentions are coreferent or not coreferent

14


a classifier is trained to determine whether two mentions are coreferent or not coreferent

[Queen Elizabeth] set about transforming [her] [husband], ...

coref ?

not coref ?

coref ?

15

Standard Supervised Learning ApproachClustering

coordinates possibly contradictory pairwise coreference decisions

husband

King George VI

the King

his

Clustering Algorithm

Queen Elizabeth

her

Logue

a renowned speech therapist

Queen Elizabeth

Logue

[Queen Elizabeth],

set about transforming

[her]

[husband]

...

coref

not coref

not

coref

King George VI

16


coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his


Queen Elizabeth

her

Logue


Queen Elizabeth

Logue

[Queen Elizabeth],


[her]

[husband]

...

coref

not coref

not

coref

King George VI

17


coordinates possibly contradictory pairwise classification decisions

husband

King George VI

the King

his


Queen Elizabeth

her

Logue


Queen Elizabeth

Logue

[Queen Elizabeth],


[her]

[husband]

...

coref

not coref

not

coref

King George VI

18

Standard Supervised Learning ApproachTypically relies on a large amount of labeled data

What if we only have a small amount of annotated data?

19

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm feature setclustering algorithm

20

First Attempt: Supervised Learningtrain on whatever annotated data we have

need to specify learning algorithm (Bayes)feature setclustering algorithm (Bell-tree)

21

The Bayes Classifierfinds the class value y that is the most probable given the

feature vector x1,..,xn

22



Coref, Not Coref

23



finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy

24



finds y* such that

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

25



finds y* such that

What features to use in the feature representation?

Coref, Not Coref

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 21 yxxxPyP nYy

26

Linguistic FeaturesUse 7 linguistic features divided into 3 groups

Strong Coreference Indicators

String match Appositive Alias (one is an acronym or abbreviation of the other)

Linguistic Constraints

Gender agreement Number agreement Semantic compatibility

Mention Pair Type (ti, tj), where ti, tj { Pronoun, Name, Nominal }

27







28







29







30

Use 7 linguistic features divided into 3 groups






Linguistic Features

E.g., for the mention pair (Barack Obama, president-elect), the feature value is (Name, Nominal)

31



finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy )|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

32

finds the class value y that is the most probable given the feature vector x1,..,xn

finds y* such that

But we may have a data sparseness problem

The Bayes Classifier

)...,,,|(maxarg 21*


COREF or NOT COREF

33


finds y* such that


Let’s simplify this term!


)...,,,|(maxarg 21*


COREF or NOT COREF

34


finds y* such that


Let’s simplify this term!assume that feature values from different groups are

independent of each other given the class


)...,,,|(maxarg 21*


COREF or NOT COREF

35



finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy)|...,,,()(maxarg 721 yxxxPyPYy

COREF or NOT COREF

36

)|...,,,()(maxarg 721 yxxxPyPYy



finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

COREF or NOT COREF

37




finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

Generative model: specifies how an instance is generated

COREF or NOT COREF

38




finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy


Generate the class y with P(y)

COREF or NOT COREF

39




finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy


Generate the class y with P(y) Given y, generate

x1, x2, and x3 with P(x1, x2, x3 | y)

COREF or NOT COREF

40




finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy




Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

COREF or NOT COREF

41




finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy




Given y, generate x4, x5, and x6 with

P(x4, x5, x6 | y)

Given y, generate x7 with P(x7 | y)

COREF or NOT COREF

42

train on whatever annotated data we have

need to specify learning algorithm feature set clustering algorithm

First Attempt: Supervised Learning

43

Bell-Tree Clustering (Luo et al., 2004)searches for the most probable partition of a set of mentions

structures the search space as a Bell tree

44



[1]

45



[1]

[12]

[1][2]

46



[1]

[12]

[1][2]

[123]

[12][3]

47



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

48



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

49



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

50



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

Leaves contain all the possible partitions of all of the mentions

Computationally infeasible to expand all nodes in the Bell tree

51



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

52



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising nodes

How to determine which nodes are promising?

53

Determining the Most Promising PathsIdea: assign a score to each node, based on the pairwise

probabilities returned by the coreference classifier

Classifier gives us: Pc(1, 2) = 0.6, Pc(1, 3) = 0.2, Pc(2, 3) = 0.7

54




[1]

1

55




[1]

[12]

[1][2]

1

56




[1]

[12]

[1][2]

1

1 * Pc(1,2) = 1 * 0.6 = 0.6

57




[1]

[12]

[1][2]

1

0.6

1 * (1 - Pc(1,2)) = 1 * (1 - 0.6) = 0.4

58




[1]

[12]

[1][2]

1

0.6

0.4

59




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

60




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6 * max (Pc(1,3), Pc(2,3)) = 0.6 * max(0.2, 0.7) = 0.42

61




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

62




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

63




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

0.08

0.28

0.12

expands only the N most probable nodes at each level

64

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

65

Where are we?We have described

a learning algorithm for training a coreference classifier a clustering algorithm for combining coreference probabilities

Goal: evaluate this coreference system in the presence of a small amount of labeled data

66

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)

each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set

Mentions extracted automatically using an NP chunker

Experimental Setup

67


each has a training set and a test set use one training text for training the coreference classifier evaluate on the entire test set


Scoring programCEAF scoring program (Luo, 2005)

recall, precision, F-measure

Experimental Setup

68

Broadcast News Newswire Experiments on System Mentions

R P F R P F

Weakly Supervised Baseline 53.1 45.5 49.0 57.2 50.3 53.5

Heuristic Baseline 54.3 43.7 48.4 58.9 50.2 54.2

Our EM-based Model 57.0 54.6 55.7 62.9 56.5 59.6

Duplicated Haghighi and Klein 53.2 39.3 45.2 54.5 44.2 48.8

+ Relaxed Head Generation 53.4 42.8 47.5 55.9 49.8 52.6

+ Agreement Constraints 57.8 46.3 51.4 57.9 51.5 54.5

+ Pronoun-only Salience 59.2 50.8 54.7 59.4 55.6 57.4

Fully Supervised Model 63.4 60.3 61.8 65.8 63.2 64.5

Evaluation Results

69


R P F R P F









Evaluation Results

70


R P F R P F









Evaluation Results

71


R P F R P F









Evaluation Results

72


R P F R P F









Evaluation Results

Can we improve performance by combining a small amount of labeled data and

a potentially large amount of unlabeled data?

73

Supervised learning for coreference resolutionbrief historystandard machine learning approach


three modifications

Plan for the Talk

74

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

75

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

76

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

77

Self-Training

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Classifier h

N most confidently labeled instances

78

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9

Number of Iterations

w/o bagging

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9


w/o bagging

Broadcast News Newswire

79

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

80

Why does Self-Training hurt?also due to the bias towards confidently-labeled instances

many confidently labeled instances are pairs of identical proper names

(India, India) (IBM, IBM)

(prince, prince) (Clinton, Clinton)

…

Coref Coref Coref Coref

…

81





…


…

(name, name) (name, name) (name, name) (name, name)

…

Mention Pair Type feature value

82



the classifier gradually learns that two proper names are likely to be coreferent, regardless of whether the names are identical



…

(name, name) (name, name) (name, name) (name, name)

…

Mention Pair Type feature value


…

83

Why does Self-Training hurt?Since we hypothesize that the Mention Pair Type feature is

causing the problem …repeat the experiments without using this feature

84

Results (F-measure for Self-Training)

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9


no MP Type feature with MP Type feature

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9


no MP Type feature with MP Type feature


85

Some Lessons Learnedwhen labeled data is scarce, feature design becomes an

important issue

when exploiting unlabeled data, it is crucial to learn from both confidently labeled and not-so-confidently labeled data

86



three modifications

Plan for the Talk

87

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

88

Unsupervised Coreference as EM Clustering

Exploits unlabeled data by inducing a clustering for an unlabeled document, not by labeling mention pairs

the EM-based model is forced to learn from all of the mention pairs when the model is retrained

89

Representing a ClusteringA clustering C of n mentions is an n x n Boolean matrix,

where Cij = 1 iff mentions i and j are coreferent

90


where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

91


where Cij = 1 iff mentions i and j are coreferent1 2 3 4 5

1

2

3

4

5

Coreferent

92



Not Coreferent

1 2 3 4 5

1

2

3

4

5

93

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

Representing a Clustering

Don’t care about diagonal entries

1 2 3 4 5

1

2

3

4

5

94



Don’t care about entries below the diagonal

1 2 3 4 5

1

2

3

4

5

95

A clustering C of n mentions is an n x n Boolean matrix, where Cij = 1 iff mentions i and j are coreferent

1 2 3 4 5

1

2

3

4

5

Representing a Clustering

Transitive

96



Valid

1 2 3 4 5

1

2

3

4

5

97

1 2 3 4 5

1

2

3

4

5



Valid Invalid

1 2 3 4 5

1

2

3

4

5

98

The Generative ModelGiven a document D,

generate a clustering C according to P(C)generate D given C

)|()(),( CDPCPCDP

99



How to generate D given C?

)|()(),( CDPCPCDP

100



How to generate D given C? Assume that D is represented by its mention pairs

)|()(),( CDPCPCDP

101



How to generate D given C? Assume that D is represented by its mention pairs To generate D, generate all pairs of mentions in D

(Queen Elizabeth, her), (Queen Elizabeth, husband), (Queen Elizabeth, King George VI), …

)|()(),( CDPCPCDP

102



)|()(),( CDPCPCDP )|()( ...,14,13,12 CmpmpmpPCP

103




mpij is the pair formed from mention i and mention j

104



Let’s simplify this term


105



Let’s simplify this term assume that each mention pair mpij is generated

conditionally independently given C ij


106



)|()(),( CDPCPCDP

)|()()(

DPairs ijij CmpPCP

)|()( ...,14,13,12 CmpmpmpPCP

107

)|()()(

DPairs ijij CmpPCP



How to represent a mention pair mij?


108






Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Name, Nominal }

109

Given a document D,generate a clustering C according to P(C)generate D given C

)|()()(

DPairs ijij CmpPCP

The Generative Model


110



7 feature values


)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,

1DPairs ijijijij CmpmpmpPCP

111



Let’s simplify this term


)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,


112



Let’s simplify this term assume that feature values from different groups are

conditionally independent of each other


)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,


113




)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,


)|()|()( 6,

5,

43,

2,

1ijijijijijijijij CmpmpmpPCmpmpmpPCP

)|( 7ijij CmpP

114




)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,


)|()|()( 6,

5,

43,

2,


)|( 7ijij CmpP

115

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpPimp are the feature values

{ Coref, Not Coref }c

116

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,



117

)|( 7 cmpP

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,



118

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values


If we had labeled data, we could estimate the parametersBut we don’t have labeled data. So …

119

Model ParametersUse EM to iteratively

estimate the model parametersprobabilistically induce a clustering for a document

120

The Induction Algorithm

Given a set of unlabeled documents

121


Given a set of unlabeled documentsguess a clustering for each document according to P(C)

122



Initial labelings are presumably noisy

123



estimate the model parameters based on the automatically labeled documents (M-step) maximum likelihood estimation

124




assign a probability to each possible clustering of the mentions for each document (E-step)

125





3 mentions: 1, 2, 3

126





3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3] + invalid clusterings

127





3 mentions: 1, 2, 3

+ invalid clusterings

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

128





3 mentions: 1, 2, 3


[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …

Iterate till convergence

129





3 mentions: 1, 2, 3


[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.21 0.11 0.29 0.05 …


How to cope with the computational complexity

of the E-step?

130

Approximating the E-step

Search for the N most probable clusterings onlyusing the Bell Tree algorithm

131


Search for the N most probable clusterings onlyusing the Bell Tree algorithm

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

132



assign a probability to each possible clustering of the mentions of each document (E-step) use the normalized scores of the 50-best clusterings



133



three modifications

Plan for the Talk

134

Haghighi and Klein’s ModelCluster-level model

assigns a cluster id to each mention

135







136



1Queen Elizabeth set about transforming her husband,




137



1 1Queen Elizabeth set about transforming her husband,




138







1 1 2

139







1 1 2

2 3

4

2 2 5

4

140


assigns a cluster id to each mentionensures transitivity automatically





1 1 2

2 3

4

2 2 5

4

141

Haghighi and Klein’s Generative Story

142

Haghighi and Klein’s Generative StoryFor each mention encountered in a document,

generate a cluster id for the mention (according to some cluster id distribution)

generate the head noun of the mention (according to some cluster-specific head distribution)

143




Inference: Gibbs sampling

144





Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id

145





Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster

id two occurrences of “she” will likely be posited as coreferent particularly inappropriate for generating pronouns

146





Problem with the model: Too simplistic!mentions with the same head likely to get the same cluster id

Extensions:use a separate “pronoun head model” to generate pronounsincorporate salience

147


Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications relaxed head generation agreement constraints pronoun-only salience

Plan for the Talk

148

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

149

Modification 1: Relaxed Head GenerationMotivation

H&K’s model is linguistically impoverished does not exploit useful knowledge: alias, appositives, …

Goalsimple method for incorporating such knowledge sources

150

Modification 1: Relaxed Head Generation

pre-process a document by assigning a “head id” to each mention, such that two mentions have the same head id iffthey are the same stringor they are aliasesor they are in an appositive relation

151



International Business

Corporation

IBM

Barcelona

…

1

1

2

…

152



instead of generating the head noun, generate the head id


Corporation

IBM

Barcelona

…

1

1

2

…

153



instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same head


Corporation

IBM

Barcelona

…

1

1

2

…

154



instead of generating the head noun, generate the head idthe model views “International Business Corporation” and “IBM”

as two mentions having the same headencourages the model to put the two into the same cluster


Corporation

IBM

Barcelona

…

1

1

2

…

155

Modification 2: Agreement ConstraintsMotivation

gender and number agreement is implemented as a preference, not as a constraint, in H&K’s model

156



while the model favours the assignment of a pronoun to a gender- and number-compatible cluster

it also favours the assignment of a pronoun to a large cluster

157





if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

158





if a cluster is large enough, the model may assign the pronoun to the cluster even if the two are not compatible

Goalimplement gender and number agreement as a constraint

159

disallow the generation of a mention by any cluster where the two are incompatible in number or gender

Modification 2: Agreement Constraints

160

Modification 3: Pronoun-Only Salience

In H&K’s model, salience is applied to all types of mentions (pronouns, names and nominals) during cluster assignment

Our hypothesissince names and nominals are less sensitive to salience, the

net benefit of applying salience to names and nominals could be negative as a result of inaccurate modeling of salience

We restrict the application of salience to pronouns only

161

Improving Haghighi and Klein’s Model3 modifications

relaxed head generationagreement constraintspronoun-only salience

162

EvaluationEM-based model

Haghighi and Klein’s modelwith and without the 3 modifications

163


For each data set use one training text for initializing model parameters evaluate on the entire test set


Scoring programCEAF scoring program (Luo, 2005)

Experimental Setup

164


R P F R P F









Results (Weakly Supervised Baseline)

Train the Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition for each test document using the pairwise probabilities

165

Heuristic BaselineSimple rule-based system

Posits two mentions as coreferent if and only if they arethe same stringaliasesin an appositive relation

166


R P F R P F









Results (Heuristic Baseline)

167

EM-Based ModelInitialize the parameters using one (labeled) document

rather than using randomly guessed clusterings

168


R P F R P F









Results (EM-Based Model)

169


R P F R P F









Results (EM-Based Model)

gains in both recall and precisionF-measure increases by 5-7%

170

Duplicated Haghighi and Klein’s Model

Use the same labeled document as in the EM-based model to learn the value of in the Dirichlet Process

171


R P F R P F









Results (Duplicated H&K’s Model)

172


R P F R P F









Results (Duplicated H&K’s Model)

In comparison to EM-based modelprecision drops substantiallyF-measure decreases by 10-11%

173


R P F R P F









Results (Adding 3 Modifications)

174


R P F R P F










In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modification

175


R P F R P F










In comparison to Duplicated Haghighi and KleinF-measure improves after the addition of each modificationmodest gain in recall and substantial gain in precision when

all modifications are applied (9-10% gain in F-measure)

176


R P F R P F









Results (Fully-Supervised Resolver)

Trained using C4.5, entire ACE training set, 34 featuresOutperforms the unsupervised models by 7%

177

Using a Knowledge-Based FeatureAdd a feature to the EM-based model that encodes the

output of a knowledge-based coreference systemimplements heuristics used by different MUC-7 resolvers

Resulting model not so “unsupervised”

178


R P F R P F

EM-based Model (w/ KB feature) 65.4 53.3 58.8 68.1 58.2 62.8

EM-based Model (w/o KB feature) 57.0 54.6 55.7 62.9 56.5 59.6


Results (EM-Based Model w/ KB Feature)

179

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

EM-based model and modified H&K’s model outperform self-training and H&K’s original model

Not as competitive as fully-supervised model, but …

180

Summary (Cont’)… they can potentially be improved by

incorporating additional linguistic features in

feature engineering remains a challenging issuecombining a large amount of labeled data with a large amount

of unlabeled data

generative modeling is interesting in itself

181

SummaryExamined unsupervised models for coreference resolution

self-training, EM, Haghighi and Klein’s model require little labeled datafacilitates their application to resource-scarce languages

Self-training with and without baggingDoesn’t improve (and sometimes even hurts) performanceAugment labeled data with only confidently-labeled instancesLittle knowledge is gained by the classifierCareful feature design is an especially important issueNeed to label both confident and not-so-confident instances

182

Summary (Cont’)EM-based generative model

induces a clustering on an unlabeled documentoutperforms Haghighi and Klein’s coreference model

Three extensions to Haghighi and Klein’s generative model each modification improves F-measure

Not as competitive as fully-supervised modelbut … generative modeling is interesting in itselffeature engineering remains a crucial yet challenging issue

183

Weakly Supervised BaselineTrain the Naïve Bayes classifier on one (labeled) document

Use the Bell Tree clustering algorithm to impose a partition on each test document using the pairwise probabilities

184


each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set


Scoring program MUC scoring program (Vilain et al., 1995) ????

Experimental Setup

185


each has a training set and a test set use one training text for training the Bayes coreference classifier evaluate on the entire test set


Scoring program MUC scoring program (Vilain et al., 1995) ????

2 problems under-penalizes partitions where mentions are over-clustered does not reward successful identification of singleton clusters

Experimental Setup

186

)|...,,,()(maxarg 721 yxxxPyPYy



finds y* such that

)...,,,|(maxarg 21*

nYy xxxyPy

)|()|,,()|,,()(maxarg 7654321 yxPyxxxPyxxxPyPYy

These are the model parameters (to be estimated from annotated data using maximum likelihood estimation)

Not as naïve as Naïve Bayes …

COREF or NOT COREF

187

Results (Self-Training w/ and w/o Bagging)

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9


w/ bagging (5 bags) w/o bagging

37

39

41

43

45

47

49

51

53

55

0 1 2 3 4 5 6 7 8 9


w/ bagging (5 bags) w/o bagging


188

Self-Training with Bagging

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

189


Create k training sets, each of size |L|, by sampling from L with replacement

Train k classifiers

Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

190


Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx

Bagged Classifier h1


Bagged Classifier hk

191


Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx




192


Labeled data (L)

x x x x

Unlabeled data (U)

xxx x x

x x x xxx

x

xx

x

xxx

xx xx

xxx

xx

xx

x

x x

x xx

xx

xx

xx

xx




N labeled instances with the highest average confidence

193

Why doesn’t Self-Training improve?only the most confidently labeled instances are added in

each iterationthe classifier already knows how to label these newly added

instancesnot much new knowledge is gained by re-training a classifier

from such newly added instances

Need to learn from both the confidently and no-so-confidently labeled instances

194

Haghighi and Klein’s ModelNonparametric Bayesian model

195


Enables the use of prior knowledge to put a higher probability on hypotheses deemed more likely

Don’t commit to a particular set of parameters (don’t attempt to compute the most likely hypothesis)

196




197


Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes



dXPXZPXZP )|(),|()|(

198


Given a set of mentions X, find the most likely partition Z. Find the Z that maximizes



dXPXZPXZP )|(),|()|(

Integrate out the parameters

Encode prior knowledge on hypotheses

199



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

200



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

201



[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

expands only the most promising paths

How to determine which paths are promising?

202




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.6*(1- max (Pc(1,3), Pc(2,3))) = 0.6 * (1- max(0.2, 0.7)) = 0.58

0.42

203




[1]

[12]

[1][2]

1

0.6

0.4

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

0.42

0.58

204


brief historystandard machine learning approach

Unsupervised learning for coreference resolutionself-training and its variant (Ng and Cardie, 2003)EM clustering (Ng, 2008)nonparametric Bayesian modeling (Haghighi and Klein, 2007)

three modifications

205


given a description of two mentions, mi and mj, classify the pair as coreferent or not coreferent

create one training instance for each pair of mentions from texts annotated with coreference information feature vector: describes the two mentions

train a classifier using a machine learning algorithm decision tree learner (C5), maximum entropy, SVMs


coref ?

not coref ?

coref ?

206

Related WorkApply a weakly supervised or unsupervised learning

algorithm to pronoun resolution

co-training (Müller et al., 2002)

self-training (Kehler et al., 2004)

207







Heuristics

208







How to compute the semantic class of a mention?

209







How to compute the semantic class of a mention? Proper names: use a named entity recognizer Nominals: induced from an unannotated corpus

210

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

211

Inducing Semantic ClassesGoal: induce the semantic class of a nominal, focusing on

PERSON, ORGANIZATION, LOCATION, and OTHERS

Given a large, unannotated corpus

Use a parser to extract appositive relations <Eastern Airlines, carrier>, <George Bush, president>, …

Use a named entity recognizer to find the semantic classes of the proper names

Infer the semantic class of a nominal from the associated proper name

212

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

213

Potential Problems Named entity recognizer is not perfect

Mislabels proper names

Parser is not perfect Extracts mention pairs that are not in apposition

To improve robustness:1. Compute the probability that the nominal co-occurs with each

of the named entity types

2. If the most likely NE type has a probability above 0.7, label the nominal with the most likely NE type

214


MUC CEAF MUC CEAF

Weakly Supervised Baseline 38.0 49.0 42.8 53.5

Heuristic Baseline 36.4 48.4 43.2 54.2

Our EM-based Model 51.6 55.7 57.8 59.6

Duplicated Haghighi and Klein 45.2 45.2 41.9 48.8

+ Relaxed Head Generation 47.0 47.5 45.0 52.6

+ Agreement Constraints 48.9 51.4 46.0 54.5

+ Pronoun-only Salience 52.6 54.7 50.0 57.4

Fully Supervised Model 60.4 61.8 60.6 64.5

MUC and CEAF F-Scores

215


MUC CEAF MUC CEAF










216


MUC CEAF MUC CEAF










217


MUC CEAF MUC CEAF










Similar performance trends across the 2 scoring programs

218

Experiments using Perfect MentionsPerfect mentions are NPs marked up in the answer key

using them makes the coreference task somewhat easier

Similar performance trends observedexcept that the unsupervised models perform comparably to

the fully-supervised resolver

Conclusions drawn from system mentions are not always generalizable to perfect mentions and vice versa

219

SummaryPresented an EM-based model for unsupervised

coreference resolution that outperforms Haghighi and Klein’s coreference model

compares favourably to a modified version of their model

220

H&K’s Model: Salience ModelingEach entity/cluster is initially assigned a salience value of 0As we process the discourse, the salience value of each

entity will changeWhen we encounter a mention, we update the salience scores

(* 0.5 for each entity and add 1 to current entity)Then discretize the salience values

5 buckets: TOP, HIGH, MID, LOW, NONEUsing a separate corpus, estimate the probability of

P(mention type | Salience)where mention type can be pronoun, name, or nominal. E.g.,

P(pronoun | TOP) is a large value P(nominal | TOP) is a small value

model is sensitive to these estimated values

221

Why Salience Modeling?Important for pronouns

For H&K, since they don’t use features like apposition, modeling salience may allow mentions in an appositive to be assigned the same cluster id.

222

Parameter Initialization = 0.4 (true mention) and 0.7 (system mentions) concentration parameter: e-4

223

Parameter Initialization

Uses one (labeled) document taken from the training set toinitialize the parameters of our EM-based modeldetermine the concentration parameter, , in H&K’s model

224

Experiments with Perfect MentionsSimilar performance trends observed

except that the unsupervised models perform comparably to the fully-supervised resolver

Conclusions drawn from perfect mentions are not always generalizable to system mentions and vice versa

Results obtained using perfect mentions should not be compared against those obtained using system mentions

225

Degenerate EM BaselineModel obtained after one iteration of EM

No parameter re-estimation on the unlabeled data

226


R P F R P F


Degenerate EM Baseline 70.8 36.3 48.0 69.0 25.1 36.8


Haghighi and Klein Baseline 50.8 40.7 45.2 43.0 40.9 41.9





Degenerate EM Baseline: MUC Results

227

Degenerate EM Baseline: MUC ResultsBroadcast News Newswire

Experiments on System Mentions R P F R P F









large gain in recall and large drop in precision (over-clustering)

F-score increases for one data set and drops for the other

228

EM-Based Model: MUC Results

In comparison to Degenerate EMlarge drop in recall, but larger gain in precisionF-score increases by 4-21%gains attributed to exploitation of unlabeled data


R P F R P F








229


MUC CEAF CEAFV MUC CEAF CEAFV









MUC, CEAF, CEAF-Variant F-Scores

230












Degenerate EM Baseline performs the worst

231












EM-based Model outperforms Heuristic Baseline

232












Addition of each extension yields improvements in F-score

233












Extended H&K system performs comparably with EM-based model

234












Unsupervised models lag performance of the supervised model

235

Unsupervised Coreference as EM ClusteringDesign a generative model that can be used to induce a

clustering of the mentions in a given document

Exploit pairwise linguistic constraints gender and number agreement, semantic compatibility, …

236



Facilitates the incorporate of pairwise linguistic constraints

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

Valid Invalid

237


String match Alias (one is an acronym or abbreviation of the other) Appositive



Mention Type Pairs (ti, tj), where ti, tj { Pronoun, Proper, Common }

Use 7 linguistic features

Features

238

Computing the E-stepGoal: assign a probability to each possible clustering of the

mentions in a document

239



Computationally intractable: number of clusterings is exponential in the number of mentions

240




Search for the N most probable clusterings only

241




Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structure the search space as a Bell tree

242

A Bell Tree

[1]

[12]

[1][2]

[123]

[12][3]

[13][2]

[1][23]

[1][2][3]

243

The Bell-Tree Search AlgorithmFinds the N most probable paths from the root to a leaf

using a beam search

The probability of a clustering (or partition) is the probability assigned to the corresponding path

244

Degenerate EM Baselinemodel that is obtained after one iteration of EM

initializes model parameters based on labeled documentapplies the model (and Bell tree search) to obtain the most

probable coreference partition

no parameter re-estimation on the unlabeled data

245

Noun Phrase CoreferenceIdentify the noun phrases (or mentions) that refer to the


Partition the set of mentions into coreference equivalence classes


King George VI, into a viable monarch. A renowned

speech therapist, was summoned to help the King

overcome his speech impediment...

246

Supervised Coreference Resolution

Lots of prior work on supervised coreference resolutionSoon et al. (2001), Strube et al. (2002), Yang et al. (2003),

Luo et al. (2004), Denis and Baldridge (2007), …

247

1 2 3 4 5

1

2

3

4

5



Reflexivity

248


Search for the N most probable clusterings onlyusing Luo et al.’s (2004) search algorithm

structures the search space as a Bell tree takes as input the pairwise coreference probabilities scores a clustering based on these probabilities

249


assigns a cluster id to each mentionensures transitivity automatically

Nonparametric Bayesian modeldoes not commit to a particular set of parameters

250

Model Parameters

)|( 3,

2,

1 cmpmpmpP

)|( 6,

5,

4 cmpmpmpP

)|( 7 cmpP

imp are the feature values


251

The ACE 2003 coreference corpus3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

Scoring programs: recall, precision, F-measureMUC scoring program (Vilain et al., 1995)CEAF scoring program (Luo, 2005)CEAF variant

same as CEAF, but ignores singleton clusters

Experimental Setup

252

Experimental SetupThe ACE 2003 coreference corpus

3 data sets (Broadcast News, Newswire, Newspaper)each has a training set and a test set; evaluate on test set only

Mentionssystem mentions (mentions extracted by an NP chunker)perfect mentions (mentions extracted from answer key)

253

FeaturesUse 7 linguistic features divided into 3 groups






254







255







256




)|()()(

DPairs ijij CmpPCP

)|()()(

7...,,

2,


)|()|()( 6,

5,

43,

2,


)|( 7ijij CmpP

257





3 mentions: 1, 2, 3

[123]

258





3 mentions: 1, 2, 3

[123] [1][2][3]

259





3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

260





3 mentions: 1, 2, 3

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

261

3 mentions: 1, 2, 3





[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05


262





3 mentions: 1, 2, 3


How to cope with the computational complexity

of the E-step?

[123] [12][3][13][2] [1][23][1][2][3]

0.23 0.32 0.11 0.29 0.05

263

Goals

Design a new model for unsupervised coreference resolution

Improve Haghighi and Klein’s model with three modifications

264

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

265

Evaluation ResultsBroadcast News

Recall: 53.1, Precision: 45.5, F-measure: 49.0

NewswireRecall: 57.2, Precision: 50.3, F-measure: 53.5

Can we improve performance by combining labeled and unlabeled data?

266




EM-based Generative Model H&K’s Generative Model

For each mention, guess the cluster id according to P(cluster id)

Generate feature values

Create mention pairsFor each pair, guess whether it

is COREF or NOT COREF according to P(COREF)

Generate feature values

267

Dirichlet Process

Generate new cluster ids as needed

Probability of generating some existing cluster id i is:

for some constant

1 iiclusterinalreadymentionsofnumber

higher probability for larger clusters

number of mentions already in cluster i

268

Dirichlet Process


Probability of generating some existing cluster id i is:

for some constant

Probability of generating some new cluster id is:

1 iiclusterinalreadymentionsofnumber

1 i

number of mentions already in cluster ihigher probability for larger clusters

269

Input: correct partition, system partition

3

4, 7

2, 5, 8

6

1, 9

System partitionCorrect partition

8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10

The CEAF Scoring Program

Recast the scoring problem as bipartite matching

270


3

4, 7

2, 5, 8

6

1, 9


8, 11, 12

2, 6, 7

1, 4, 9

3, 5, 10



Find the best matching using the Hungarian Algorithm

271


3

4, 7

2, 5, 8

6

1, 9


6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10


2

2

1

1



272


3

4, 7

2, 5, 8

6

1, 9


6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10


2

2

1

1


Matching score = 6


273


3

4, 7

2, 5, 8

6

1, 9


6, 11, 12

2, 7, 8

1, 4, 9

3, 5, 10


2

2

1

1


Matching score = 6

Recall = 6 / 9 = 0.66

Prec = 6 / 12 = 0.5

F-measure = 0.57


274



create one training instance for each pair of mentions from a training text feature vector: describes the two mentions

275



create one training instance for each pair of mentions from a training text feature vector: describes the two mentions


coref ?

not coref ?

coref ?

276




277




The probability of generating a particular cluster id is based on some distribution that specifies P(id=1), P(id=2), P(id=3), … but we don’t know the number of clusters a priori don’t know how many probabilities to specify for distribution a distribution over an unknown number of clusters

278

Dirichlet Process


279

Dirichlet Process






1 1 2

2 ?

280

Dirichlet Process






1 1 2

2 ?

Should we generate id 1 or 2, or should we generate a new id 3?

281

Dirichlet Process


Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i

282

Dirichlet Process


Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster i


283

Dirichlet Process


Probability of generating some existing cluster id i is proportional to the number of mentions already in cluster I

Probability of generating some new cluster id is proportional to some constant α


Documents

Unsupervised Models for Coreference Resolution