Background

A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields

Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto

Nara Institute of Science and Technology

EMNLP-CoNLL 2007 29th June

Prague, Czech

2

Background

Named Entity

Proper nouns (e.g. Shinzo Abe (Person), Prague (Location)), time/date expressions (e.g. June 29 (Date)) and numerical expressions (e.g. 10%)

In many NLP applications (e.g. IE, QA), Named Entities play an important role

Named Entity Recognition task (NER)

Treated as sequential tagging problem

Machine learning methods have been proposed

Recall is usually low

Large scale NE dictionary is useful for NERSemi-automatic methods to compile NE dictionaries have

been demanded

3

Resource for NE dictionary construction

Wikipedia

Multi-lingual encyclopedia on the Web

382,613 gloss articles (as of June 20, 2007, Japanese)

Gloss indices are composed by nouns or proper nouns

HTML (Semi-structured text)

Lists(<LI>) and Tables(<TABLE>) can be used as clues for NE type categorization

Linked articles are glossed by anchor texts in articles

Each article has one or more categories

Wikipedia has useful information for NE categorizationCan be considered as a suitable resource

4

Objective

Extract Named Entities by assigning proper NE labels for gloss indices of Wikipedia

Person

Product

Person Location

OrganizationNatural Object

5

Use of Wikipedia features

Features of Wikipedia articles

Anchors of an article refer to the other related articles

Anchors in list elements have dependencies each other

=> Make 3 assumptions about dependencies between anchors

an example of a list structure

Burt Bacharach…composer

Dillard & Clark

Carpenters

Karen Carpenter

ORGANIZATION

PERSON

VOCATION

ORGANIZATION

PERSON Assumption 1 : The latter element in a list item tends to be in an attribute relation to the former element

Assumption 2 : The elements in the same itemization tends to be in the same NE category

Assumption 3 : The nested element tends to be in a part-of relation to the upper element

6

Overview of our approach

Focus on HTML list structure in Wikipedia

Make 3 assumptions about dependencies between anchors

Formalize NE categorization problem as labeling NE classes to anchors in lists

Define 3 kinds of cliques (edges: Sibling, Cousin and Relative ) between anchors based on 3 assumptions

Construct graphs based on 3 defined cliques

CRFs for NE categorization in Wikipedia

Define potential functions over 3 edges (and nodes) to provide conditional distribution over the graphs

Estimate MAP label assignment over the graphs using Conditional Random Fields

7

Conditional Random Fields (CRFs)

Conditional Random Fields [Lafferty 2001]

Discriminative, Undirected Models Define conditional distribution p(y|x) Features

Arbitrary features can be used Globally optimize on all possible label assignments Can deal with label dependencies by defining potential functions for

cliques (2 or more nodes)

Cc k

cckk fZp yx

xxy ,exp

)(

1)|(

x

y1 y3 yn･･･y2

parameter model:function feature:,cliques:,functionpartition :)( kkfCZ ，x

8

Use of dependencies for categorization

NE categorization problem as labeling classes to anchors

The edges of the constructed graphs corresponds to a particular dependency

Estimate MAP label assignment over the constructed graphs using Conditional Random Fields

Our formulation: Can extract anchors without gloss articles

Dillard & Clark..country rock

Carpenters

Karen Carpenter

: article exists

: article does not exist

9

Clique definition based on HTML tree structure

Sibling

Cousin

Relative

<UL>

Dillard & Clark

country rock

Carpenters

Karen Carpenter

<LI> <LI>

<A> <A> <A> <UL>

<LI>

<A>

Dillard & Clark…country rock

Carpenters

Karen Carpenter

The latter element tends to be in an attribute or a concept of the former element

Sibling

The elements tend to have a common NE category (e.g. ORGANIZATION)

Cousin

The latter element tends to be in a constituent part of the former element

Relative Use these 3 relations as cliques of CRFs

10

A graph constructed from 3 clique definitions

Burt Bacharach…”On my own”…1986

Dillard & Clark

Gene Clark

Carpenters …”As Time Goes By”…2000

Karen Carpenter

S : Sibling

C : Cousin

R : Relative

R

R

C

C

C C

S S

SSC

Estimate the MAP label assignment over the graph

The latter element tends to be an attribute or a concept of the former element

Sibling

The elements tend to have a common attribute (e.g. ORGANIZATION)

Cousin

The latter element tends to be a constituent part of the former element

Relative

11

Model

xx

xx

xy

,exp,

,exp),(

,,)(

1)|(

''

'

,,),(

ikk

kiV

jikk

kjiSCR

VviV

EEEvvjiSCR

yfy

yyfyy

yyyZ

piRCSji

SCR

V : Potential function for nodes

: Potential function for Sibling, Cousin and Relative cliques

R

R

C

C

C C

S S

SS

• Constructed graphs include cycles : exact inference is computationally expensive

->Introduce Tree-based Reparameterization (TRP) [Wainwright 2003] for approximate inference

12

Experiments

The aims of experiments are:

1. Compare graph-based approach (relational) to node-wise approach (independent) to investigate how the relational classification improves classification accuracy

2. Investigate the effect of defined cliques

3. Compare CRFs models to baseline models based on SVMs

4. Show the effectiveness of using marginal probability for filtering NE candidates.

13

Dataset

Dataset Randomly sampled 2300 articles (Japanese

version as of October 2005)

Anchors in list elements(<LI>) are hand-annotated with NE class label

We used Extended Named Entity Hierarchy (Sekine et al. 2002)

We reduced the number of classes to 13 from the original 200+ in order to avoid data sparseness

Classification target :16136 (14285 of those are NEs)

NE Class # of articles

EVENT 121

PERSON 3315

UNIT 15

LOCATION 1480

FACILITY 2449

TITLE 42

ORGANIZATION 991

VOCATION 303

NATURAL_OBJECT 1132

PRODUCT 1664

NAME_OTHER 24

TIMEX/NUMEX 2749

OTHER 1851

ALL 16136

14

Experiments (CRFs)

SC model

C

C

C C

S S

SSC

SCR model

C

C

C C

S S

SS

R

R

C

SR model

S S

SS

R

R

CR model

C

C

C C

R

R

C

S model

S S

SS

C model

C

C

C C

C

R modelR

R

I model

To investigate which clique type contributes classification accuracy:

We construct models that constitute of possible combinations of defined cliques

8 models (SCR, SC, SR, CR, S, C, R, I)

Classification is performed on each connected subgraph

15

Baseline ： Support Vector Machines (SVMs) ［ Vapnik 1998 ］

We perform two models:

I model: each anchor text is classified independently

P model: anchor texts are ordered by linear position in HTML, and performed history-based classification (j-1th classification result is used in j-th classification)

For multi-class classification : one-versus-rest

Evaluation

5-fold cross validation, by F1-value

Experimental settings (Baseline) , Evaluation

I model P model

16

Results (F1-value)

CRFs SVMs

# SCR SC SR CR S C R I P I

ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790

no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278

SC model

C

CC C

S S

SSC

SCR model

C

CC C

S S

SS

R

R

C

SR model

S S

SS

R

R

CR model

C

CC C

R

R

C

S model

S S

SS

C model

C

CC C

C

R modelR

R

I model

CRFs

P model

I model

SVMs

ALL : whole dataset , no article : anchors without articles

17

Results (F1-value)

CRFs SVMs


ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790

no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278

SC model

C

CC C

S S

SSC

SCR model

C

CC C

S S

SS

R

R

C

SR model

S S

SS

R

R

CR model

C

CC C

R

R

C

S model

S S

SS

C model

C

CC C

C

R modelR

R

I model

CRFs

P model

I model

SVMs

1. Graph-based vs. Node-wise

Performed McNemar paired test on labeling disagreements

=> difference was significant (p < 0.01)


18

Results (F1-value)

CRFs SVMs


ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790

no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278

SC model

C

CC C

S S

SSC

SCR model

C

CC C

S S

SS

R

R

C

SR model

S S

SS

R

R

CR model

C

CC C

R

R

C

S model

S S

SS

C model

C

CC C

C

R modelR

R

I model

CRFs

P model

I model

SVMs

2. Which clique is most contributed? => Cousin clique

Cousin cliques provided the highest accuracy improvements compare to

sibling and relative cliques


19

Results (F1-value)

CRFs SVMs


ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790

no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278

SC model

C

CC C

S S

SSC

SCR model

C

CC C

S S

SS

R

R

C

SR model

S S

SS

R

R

CR model

C

CC C

R

R

C

S model

S S

SS

C model

C

CC C

C

R modelR

R

I model

CRFs

P model

I model

SVMs

3. CRFs vs. SVMs Significance Test: McNemar paired test on labeling disagreements


20

Filtering NE candidates using marginal probability

Construct dictionaries from extracted NE candidates

Methods with lower cost are desirable

Extract only confident NE candidates

-> Use of marginal probability that provided by CRFs

Marginal probability

probability of a particular label assignment for a node

This can be regarded as

“confidence” of a classifier

iy

i pyp\

)|()|(y

xyx

Vvi

yi

21

Precision-Recall Curve

Precision-Recall curve obtained by thresholding the marginal probability of the MAP estimation in the CR model of CRFs

At this point, recall value is about 0.57 and precision value is about 0.97

Using the proper thresholding of marginal probability, NE dictionary can be constructed with lower cost

22

Summary and future work

Summary

Proposed a method for categorizing NEs in Wikipedia

Defined 3 kinds of cliques (Sibling, Cousin and Relative) over HTML tree

Graph-based model achieved significant improvements compare to Node-wise model, and baseline methods (SVMs)

NEs can be extracted with lower cost by exploiting marginal probability

23

Summary and Future work

Future work

Use fine-grained NE classes

For many NLP applications (e.g. QA, IE), NE dictionary with fine grained label sets will be a useful resource

Classification with statistical methods becomes difficult in case that the label set is large, because of the insufficient positive examples

Incorporate hierarchical structure of label sets into our models (Hierarchical Classification)

Previous work suggest that exploiting hierarchical structure of label sets improve classification accuracy

24

Thank you.

Documents

Background