Language and Domain Independent Entity Linking with Quantified Collective Validation

Language and Domain Independent Entity Linking with Quantified Collective

ValidationHan Wang, Jin Guang Zheng, Xiaogang Ma, Peter Fox, and Heng Ji

EMNLP2015

Presented by: Shuangshuang Zhou

Inui&Okazaki Lab. Tohoku University

# スライドの中の絵は著者の論⽂・ポストーから拝借

An example to explain the task

One day after released by the Patriots, Florida born Caldwellvisited the Jet. ...The New York Jets have six receivers on the roster: Cotchery, Coles, ...

New England Patriots

Reche, Caldwell Jerricho, Cotchery Laveranues, Coles

New York Jets9/12/16 2

Motivation and Contribution

u “Most of the previous research extensively exploited the linguistic features of the source documents in a supervised or semi-supervised way”.

u Quantified Collective Validation can be applied to a new language or domain:u It can worked with limited linguistic resources.

u It can conduct more deliberate study on the KB.

u A collective way of aligning co-occurred mentions to the KB with a further step to consider quantitatively differentiating entity relations in the KB.

9/12/16 3

Approaches - Overview

Candidate Ranking(Two ranking steps + Quantified Collective Validation)

Salience Ranking(SR) : measure candidates’ importance without the context using information entropy.

Context Similarity Ranking (CS) : measures the structural similarity between candidategraphs using Jaccard Similarity.

Candidate Graph Collective Validation (CV)

9/12/16 4

Approaches – Salience Ranking

9/12/16 5

where R(c)is the relation set for c in the KB; H(r) is given by the below equation; Et(r) is the tail entity set with c being the head entity and r being the connecting relation in the KB; L(et) denotes the cardinality of the tail entity set with et being the head entity in the KB. Sa(c) is recursively computed until convergence.

Measure a candidate’s (entity) importance without the context using information entropy,

(eh,r,et)

tuple format in KB

Approaches – Context Similarity Ranking

9/12/16 6

Measures the structural similarity between candidategraphs using Jaccard Similarity.

Whether two co-occurring mentions have their entity referents connected by some relation in the KB.

The more a Gic is structurally similar to its Gm, the

better the candidates in this Gic represent their

mentions in Gm.

Approaches – Mention Context Graph

9/12/16 7

Gm is a light-weight source contextrepresentation which simply involvesmention co-occurrence.• There will be an edge between twomention vertices if both of them fall intoa context window in the sourcedocument.• Two mention vertices will be connectedvia a dashed edge if they arecoreferential but are not located in thesame context window.One day after released by

the Patriots, Florida born Caldwell visited the Jet. ...

The New York Jets have six receivers on the roster: Cotchery, Coles, ...

Approaches – KB Graph

9/12/16 8

GK is a weighted graph that consists of a set of vertices representing the entities and a set of directed edges labeled with relations between entities.

A “wiki link” relation is added between two entities if one of them appears in theWikipedia article of the other.

Approaches – Candidate Graphs

9/12/16 9

Gc is a series of graphs each of whichrepresents a collective linking solution tothe given mentions.• Two vertices are connected if they arealso connected in GK by some relation rand their mentions are connected in Gm.The edge label r is transferred from GK.

Approaches – Context Similarity Ranking

9/12/16 10

Measures the structural similarity between candidategraphs using Jaccard Similarity.

Whether two co-occurring mentions have their entity referents connected by some relation in the KB.

The more a Gic is structurally similar to its Gm, the

better the candidates in this Gic represent their

mentions in Gm.

Approaches – Candidate Graph Collective Validation

9/12/16 11

• Assumption: a “tighter” relation between two candidates is more likely to be an appropriate representation of the relation between their co-occurring mentions in the source context.

• Quantitatively differentiates different types of relations using the calculated relation weights in GK.

adding effects of “tighter” relationssalience ranking

context similarity ranking

Experiments - Generic English Corpora

9/12/16 12

Experiments on TAC-KBP 2013 linkable mentions

Baseline:

Compared with top3 supervised and top3 unsupervised systems from TAC KBP 2013

Error Analysis:1) context capturing is deficient 2) simple coreference rules 3) certain relations are missing in the KB.

(Zheng et ,al 2014)

Experiments - Generic English Corpora

9/12/16 13

• SR outperforms the best KBP unsupervised system (0.632).

• Although CS did not produce a lot more correct linking results than SR did, but it promote a great number of good candidates to the top of the ranking list.

• CS is deficient in recognizing the subtle contextual difference among similar candidates (the same type).

Experiments - Generic Chinese Corpora

9/12/16 14

• Fahrnl et al.(2012) used over 20 fine-tuned features and many linguistic resource.

• Error Analysis:• A Low recall on

mapping candidates between English and Chinese

Experiments – Specific domain

9/12/16 15

• There is slight improvement in biomedical science because candidates of the related mentions mostly have similar relations in the KB.

• First study on earth science domain

• Errors Analysis:• There are biased

effects caused by salience ranking when using generic KB

• Some relations are not clearly defined in DBpedia.

Conclusion and Future work

u QCV has minimal reliance on linguistic analysis and the deep utilization of structured KBs.

u The demonstrated a high-performance EL approach that can be migrated to new languages and domains.

u They plan to better extract mention context and incorporate the impact of more distance KB entities other than just the neighbors.

9/12/16 16

感想 (I)

u For conversational collective ranking of unsupervised EL approaches, time complexity is a significant problem, the upper bound of the computing time to link all mentions in a documents is O(nm*nc*nnc*nnm).

u It is worth learning that they gave intensive analysis on each experiment result.

u Since their method is less reliable on source document, it could be also applied on short texts (Twitters, queries of search engines).

9/12/16 17

感想 (II)u Their approach may not effective when there is seldom co-

occurring mentions.

u We expect their system performance cross more generic English corpus.

u Their method worked on linkable mentions, but their method can not solve unlinkable mentions (NILs).

u For a new language and a new domain, they used the same KB (DBpedia), and their system performance was effected by the structured KB. So their performance need to be verified with new KBs.

9/12/16 18

Technology

Language and Domain Independent Entity Linking with Quantified Collective Validation